Improving the Performance of Batch Schedulers Using Online Job Runtime Classification - INRIA - Institut National de Recherche en Informatique et en Automatique Accéder directement au contenu
Pré-Publication, Document De Travail IEEE Transactions on Parallel and Distributed Systems Année : 2020

Improving the Performance of Batch Schedulers Using Online Job Runtime Classification

Résumé

Job scheduling in high-performance computing platforms is a hard problem that involves uncertainties on both the job arrival process and their execution times. Users typically provide only loose upper bounds for job execution times, which are not so useful for scheduling heuristics based on processing times. Previous studies focused on applying regression techniques to obtain better execution time estimates, which worked reasonably well and improved scheduling metrics. However, these approaches require a long period of training data. In this work, we propose a simpler approach by classifying jobs as small or large and prioritizing the execution of small jobs over large ones. Indeed, small jobs are the most impacted by queuing delays, but they typically represent a light load and incur a small burden on the other jobs. The classifier operates online and learns by using data collected over the previous weeks, facilitating its deployment and enabling a fast adaptation to changes in the workload characteristics. We evaluate our approach using four scheduling policies on six HPC platform workload traces. We show that: first, incorporating such classification reduces the average bounded slowdown of jobs in all scenarios, second, in most considered scenarios, the improvements are comparable to the ideal hypothetical situation where the scheduler would know in advance the exact running time of jobs.
Fichier principal
Vignette du fichier
preprint.pdf (4.28 Mo) Télécharger le fichier
Origine : Fichiers produits par l'(les) auteur(s)

Dates et versions

hal-03023222 , version 1 (25-11-2020)
hal-03023222 , version 2 (28-02-2022)

Identifiants

  • HAL Id : hal-03023222 , version 1

Citer

Salah Zrigui, Raphael y de Camargo, Arnaud Legrand, Denis Trystram. Improving the Performance of Batch Schedulers Using Online Job Runtime Classification. inPress. ⟨hal-03023222v1⟩
352 Consultations
227 Téléchargements

Partager

Gmail Facebook X LinkedIn More