A machine learning approach for an HPC use case: The jobs queuing time prediction

Vercellino, Chiara; Scionti, Alberto; Varavallo, Giuseppe; Viviani, Paolo; Vitali, Giacomo; Terzo, Olivier

doi:10.1016/j.future.2023.01.020

High-Performance Computing (HPC) domain provided the necessary tools to support the scientific and industrial advancements we all have seen during the last decades. HPC is a broad domain targeting to provide both software and hardware solutions as well as envisioning methodologies that allow achieving goals of interest, such as system performance and energy efficiency. In this context, supercomputers have been the vehicle for developing and testing the most advanced technologies since their first appearance. Unlike cloud computing resources that are provided to the end-users in an on-demand fashion in the form of virtualized resources (i.e., virtual machines and containers), supercomputers’ resources are generally served through State-of-the-Art batch schedulers (e.g., SLURM, PBS, LSF, HTCondor). As such, the users submit their computational jobs to the system, which manages their execution with the support of queues. In this regard, predicting the behaviour of the jobs in the batch scheduler queues becomes worth it. Indeed, there are many cases where a deeper knowledge of the time experienced by a job in a queue (e.g., the submission of check-pointed jobs or the submission of jobs with execution dependencies) allows exploring more effective workflow orchestration policies. In this work, we focused on applying machine learning (ML) techniques to learn from the historical data collected from the queuing system of real supercomputers, aiming at predicting the time spent on a queue by a given job. Specifically, we applied both unsupervised learning (UL) and supervised learning (SL) techniques to define the most effective features for the prediction task and the actual prediction of the queue waiting time. For this purpose, two approaches have been explored: on one side, the prediction of ranges on jobs’ queuing times (classification approach) and, on the other side, the prediction of the waiting time at the minutes level (regression approach). Experimental results highlight the strong relationship between the SL models’ performances and the way the dataset is split. At the end of the prediction step, we present the uncertainty quantification approach, i.e., a tool to associate the predictions with reliability metrics, based on variance estimation.

A machine learning approach for an HPC use case: The jobs queuing time prediction / Vercellino, Chiara; Scionti, Alberto; Varavallo, Giuseppe; Viviani, Paolo; Vitali, Giacomo; Terzo, Olivier. - In: FUTURE GENERATION COMPUTER SYSTEMS. - ISSN 0167-739X. - ELETTRONICO. - 143:(2023), pp. 215-230. [10.1016/j.future.2023.01.020]

A machine learning approach for an HPC use case: The jobs queuing time prediction

Vercellino, Chiara;Scionti, Alberto;Varavallo, Giuseppe;Viviani, Paolo;Vitali, Giacomo;Terzo, Olivier

2023

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2023
			
	Codice DOI
	
				https://dx.doi.org/10.1016/j.future.2023.01.020
			
	Titolo della Rivista
	
				FUTURE GENERATION COMPUTER SYSTEMS
			
	Appare nelle tipologie
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
1-s2.0-S0167739X23000274-main.pdf accesso riservato Descrizione: Pre-proof journal Tipologia: 1. Preprint / submitted version [pre- review] Licenza: Non Pubblico - Accesso privato/ristretto Dimensione 8.03 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	8.03 MB	Adobe PDF	Visualizza/Apri Richiedi una copia
1-s2.0-S0167739X23000274-main.pdf accesso aperto Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Creative commons Dimensione 2.79 MB Formato Adobe PDF Visualizza/Apri	2.79 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2975716

Nome	Dominio	Durata	Descrizione
s_.*	plu.mx	sessione	recupero grafico citazioni sociali da plumx
A_.*	core.ac.uk	7 giorni	recupero pubblicazioni consigliate per il pannello core-recommander
GS_.*	gstatic.com	richiesta http	visualizza grafico citazioni
CC_.*	creativecommons.org	richiesta http	visualizza licenza bitstream

PORTO @ Archivio Istituzionale della Ricerca

A machine learning approach for an HPC use case: The jobs queuing time prediction

Vercellino, Chiara;Scionti, Alberto;Varavallo, Giuseppe;Viviani, Paolo;Vitali, Giacomo;Terzo, Olivier

2023

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)