Nowadays, Graphics Processing Units (GPUs) have gained importance in several domains where a high computational effort is required (i.e., Artificial Intelligence). At the same time, their adoption extended even to domains (e.g., automotive, robotics, aerospace) where the effects of possible hardware faults can be extremely serious. Hence, it is crucial to identify methods allowing to quickly detect the occurrence of faults in a GPU while it works in the operational phase. In this paper we propose a method based on the adoption of hardware performance counters. We show that the method is able to detect permanent faults occurring in some critical modules, thus allowing to increase the overall reliability of the system. The method does not involve significant costs, since performance counters already exist in all real GPUs, e.g., to support silicon debug and can be easily accessed in software.

Using Hardware Performance Counters to support in-field GPU Testing / Juan-David, Guerrero-Balaguera; Rodriguez Condia, Josie E.; SONZA REORDA, Matteo. - (2021), pp. 1-4. (Intervento presentato al convegno 28th IEEE International Conference on Electronics Circuits and Systems tenutosi a Dubai nel 28 november-1 december 2021) [10.1109/ICECS53924.2021.9665511].

Using Hardware Performance Counters to support in-field GPU Testing

Juan-David Guerrero-Balaguera;Josie E. Rodriguez Condia;Matteo Sonza Reorda
2021

Abstract

Nowadays, Graphics Processing Units (GPUs) have gained importance in several domains where a high computational effort is required (i.e., Artificial Intelligence). At the same time, their adoption extended even to domains (e.g., automotive, robotics, aerospace) where the effects of possible hardware faults can be extremely serious. Hence, it is crucial to identify methods allowing to quickly detect the occurrence of faults in a GPU while it works in the operational phase. In this paper we propose a method based on the adoption of hardware performance counters. We show that the method is able to detect permanent faults occurring in some critical modules, thus allowing to increase the overall reliability of the system. The method does not involve significant costs, since performance counters already exist in all real GPUs, e.g., to support silicon debug and can be easily accessed in software.
2021
978-1-7281-8281-0
File in questo prodotto:
File Dimensione Formato  
GPU_Hardware_Performance_Counters.pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 214.24 kB
Formato Adobe PDF
214.24 kB Adobe PDF Visualizza/Apri
Using_Hardware_Performance_Counters_to_support_infield_GPU_Testing.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 1.54 MB
Formato Adobe PDF
1.54 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2924512