Nowadays, Graphics Processing Units (GPUs) have gained importance in several domains where a high computational effort is required (i.e., Artificial Intelligence). At the same time, their adoption extended even to domains (e.g., automotive, robotics, aerospace) where the effects of possible hardware faults can be extremely serious. Hence, it is crucial to identify methods allowing to quickly detect the occurrence of faults in a GPU while it works in the operational phase. In this paper we propose a method based on the adoption of hardware performance counters. We show that the method is able to detect permanent faults occurring in some critical modules, thus allowing to increase the overall reliability of the system. The method does not involve significant costs, since performance counters already exist in all real GPUs, e.g., to support silicon debug and can be easily accessed in software.
Using Hardware Performance Counters to support in-field GPU Testing / Juan-David, Guerrero-Balaguera; Rodriguez Condia, Josie E.; SONZA REORDA, Matteo. - (2021), pp. 1-4. (Intervento presentato al convegno 28th IEEE International Conference on Electronics Circuits and Systems tenutosi a Dubai nel 28 november-1 december 2021) [10.1109/ICECS53924.2021.9665511].
Using Hardware Performance Counters to support in-field GPU Testing
Juan-David Guerrero-Balaguera;Josie E. Rodriguez Condia;Matteo Sonza Reorda
2021
Abstract
Nowadays, Graphics Processing Units (GPUs) have gained importance in several domains where a high computational effort is required (i.e., Artificial Intelligence). At the same time, their adoption extended even to domains (e.g., automotive, robotics, aerospace) where the effects of possible hardware faults can be extremely serious. Hence, it is crucial to identify methods allowing to quickly detect the occurrence of faults in a GPU while it works in the operational phase. In this paper we propose a method based on the adoption of hardware performance counters. We show that the method is able to detect permanent faults occurring in some critical modules, thus allowing to increase the overall reliability of the system. The method does not involve significant costs, since performance counters already exist in all real GPUs, e.g., to support silicon debug and can be easily accessed in software.File | Dimensione | Formato | |
---|---|---|---|
GPU_Hardware_Performance_Counters.pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
Pubblico - Tutti i diritti riservati
Dimensione
214.24 kB
Formato
Adobe PDF
|
214.24 kB | Adobe PDF | Visualizza/Apri |
Using_Hardware_Performance_Counters_to_support_infield_GPU_Testing.pdf
accesso riservato
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
1.54 MB
Formato
Adobe PDF
|
1.54 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2924512