Graphics processing units (GPUs) are widely used to accelerate Artificial Intelligence applications, such as those based on Convolutional Neural Networks (CNNs). Since in some domains in which CNNs are heavily employed (e.g., automotive and robotics) the expected lifetime of GPUs is over ten years, it is of paramount importance to study the impact of permanent faults (e.g. due to aging). Crucially, while the impact of transient faults on GPUs running CNNs has been widely studied, an accurate evaluation of the impact of permanent faults is still lacking. Performing this evaluation is challenging due to the complexity of GPU devices and the software implementing a CNN. In this work, we propose a methodology that combines the accuracy of gate-level fault simulation with the speed and flexibility of software fault injection to evaluate the effects of permanent hardware faults affecting a GPU. First, we profile the executed low-level GPU instructions during the CNN inference. Then, using extensive gate-level fault injection campaigns, we provide an accurate analysis of the effects of permanent faults on the internal modules executing the targeted instructions. Finally, we propagate these effects using fast software-based fault injection. The method allows, for the first time, to estimate the percentage of permanent faults leading the CNN to produce wrong results (i.e., changing the result of its work). The method's feasibility, which allows for flexibly trade-off accuracy with the required computational effort, is shown using LeNet running on an Ampere Nvidia GPU as a case study. The method reduces the computational effort for the evaluation by several orders of magnitude with respect to plain gate- and RTL-level faults simulation.
A Multi-level Approach to Evaluate the Impact of GPU Permanent Faults on CNN's Reliability / Rodriguez Condia, Josie E.; Guerrero-Balaguera, Juan-David; Dos Santos, Fernando F.; Reorda, Matteo Sonza; Rech, Paolo. - (2022), pp. 278-287. (Intervento presentato al convegno 2022 IEEE International Test Conference (ITC) tenutosi a Anaheim, CA (USA) nel 23-30 September 2022) [10.1109/ITC50671.2022.00036].
A Multi-level Approach to Evaluate the Impact of GPU Permanent Faults on CNN's Reliability
Rodriguez Condia, Josie E.;Guerrero-Balaguera, Juan-David;Reorda, Matteo Sonza;Rech, Paolo
2022
Abstract
Graphics processing units (GPUs) are widely used to accelerate Artificial Intelligence applications, such as those based on Convolutional Neural Networks (CNNs). Since in some domains in which CNNs are heavily employed (e.g., automotive and robotics) the expected lifetime of GPUs is over ten years, it is of paramount importance to study the impact of permanent faults (e.g. due to aging). Crucially, while the impact of transient faults on GPUs running CNNs has been widely studied, an accurate evaluation of the impact of permanent faults is still lacking. Performing this evaluation is challenging due to the complexity of GPU devices and the software implementing a CNN. In this work, we propose a methodology that combines the accuracy of gate-level fault simulation with the speed and flexibility of software fault injection to evaluate the effects of permanent hardware faults affecting a GPU. First, we profile the executed low-level GPU instructions during the CNN inference. Then, using extensive gate-level fault injection campaigns, we provide an accurate analysis of the effects of permanent faults on the internal modules executing the targeted instructions. Finally, we propagate these effects using fast software-based fault injection. The method allows, for the first time, to estimate the percentage of permanent faults leading the CNN to produce wrong results (i.e., changing the result of its work). The method's feasibility, which allows for flexibly trade-off accuracy with the required computational effort, is shown using LeNet running on an Ampere Nvidia GPU as a case study. The method reduces the computational effort for the evaluation by several orders of magnitude with respect to plain gate- and RTL-level faults simulation.File | Dimensione | Formato | |
---|---|---|---|
ITC_2022.pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
PUBBLICO - Tutti i diritti riservati
Dimensione
717.18 kB
Formato
Adobe PDF
|
717.18 kB | Adobe PDF | Visualizza/Apri |
A_Multi-level_Approach_to_Evaluate_the_Impact_of_GPU_Permanent_Faults_on_CNNs_Reliability.pdf
non disponibili
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
979.14 kB
Formato
Adobe PDF
|
979.14 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2974450