Datacenters play a vital role in today's society. At large, a datacenter room is a complex controlled environment composed of thousands of computing nodes, which consume kW of power. To dissipate the power, forced air/liquid flow is employed, with a cost of millions of euros per year. Reducing this cost involves using free-cooling and average case design, which can create a cooling shortage and thermal hazards. When a thermal hazard happens, the system administrators and the facility manager must stop the production to avoid IT equipment damage and wear-out. In this paper, we study the thermal hazards signatures on a Tier-0 datacenter room's monitored data during a full year of production. We define a set of rules for detecting the thermal hazards based on the inlet and outlet temperature of all nodes of a room. We then propose a custom Temporal Convolutional Network (TCN) to predict the hazards in advance. The results show that our TCN can predict the thermal hazards with an F1-score of 0.98 for a randomly sampled test set. When causality is enforced between the training and validation set the F1-score drops to 0.74, demanding for an in-place online re-training of the network, which motivates further research in this context.
Prediction of Thermal Hazards in a Real Datacenter Room Using Temporal Convolutional Networks / Ardebili, Ms; Zanghieri, M; Burrello, A; Beneventi, F; Acquaviva, A; Benini, L; Bartolini, A. - (2021), pp. 1256-1259. (Intervento presentato al convegno 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE)) [10.23919/DATE51398.2021.9474116].
Prediction of Thermal Hazards in a Real Datacenter Room Using Temporal Convolutional Networks
Burrello, A;
2021
Abstract
Datacenters play a vital role in today's society. At large, a datacenter room is a complex controlled environment composed of thousands of computing nodes, which consume kW of power. To dissipate the power, forced air/liquid flow is employed, with a cost of millions of euros per year. Reducing this cost involves using free-cooling and average case design, which can create a cooling shortage and thermal hazards. When a thermal hazard happens, the system administrators and the facility manager must stop the production to avoid IT equipment damage and wear-out. In this paper, we study the thermal hazards signatures on a Tier-0 datacenter room's monitored data during a full year of production. We define a set of rules for detecting the thermal hazards based on the inlet and outlet temperature of all nodes of a room. We then propose a custom Temporal Convolutional Network (TCN) to predict the hazards in advance. The results show that our TCN can predict the thermal hazards with an F1-score of 0.98 for a randomly sampled test set. When causality is enforced between the training and validation set the F1-score drops to 0.74, demanding for an in-place online re-training of the network, which motivates further research in this context.File | Dimensione | Formato | |
---|---|---|---|
DATE_D___IP.pdf
accesso aperto
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
PUBBLICO - Tutti i diritti riservati
Dimensione
567.17 kB
Formato
Adobe PDF
|
567.17 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2978567