Graphics processing units (GPUs) are widely used to accelerate Artificial Intelligence applications, such as those based on Convolutional Neural Networks (CNNs). Since in some domains in which CNNs are heavily employed (e.g., automotive and robotics) the expected lifetime of GPUs is over ten years, it is of paramount importance to study the impact of permanent faults (e.g. due to aging). Crucially, while the impact of transient faults on GPUs running CNNs has been widely studied, an accurate evaluation of the impact of permanent faults is still lacking. Performing this evaluation is challenging due to the complexity of GPU devices and the software implementing a CNN. In this work, we propose a methodology that combines the accuracy of gate-level fault simulation with the speed and flexibility of software fault injection to evaluate the effects of permanent hardware faults affecting a GPU. First, we profile the executed low-level GPU instructions during the CNN inference. Then, using extensive gate-level fault injection campaigns, we provide an accurate analysis of the effects of permanent faults on the internal modules executing the targeted instructions. Finally, we propagate these effects using fast software-based fault injection. The method allows, for the first time, to estimate the percentage of permanent faults leading the CNN to produce wrong results (i.e., changing the result of its work). The method's feasibility, which allows for flexibly trade-off accuracy with the required computational effort, is shown using LeNet running on an Ampere Nvidia GPU as a case study. The method reduces the computational effort for the evaluation by several orders of magnitude with respect to plain gate- and RTL-level faults simulation.

A Multi-level Approach to Evaluate the Impact of GPU Permanent Faults on CNN's Reliability / Rodriguez Condia, Josie E.; Guerrero-Balaguera, Juan-David; Dos Santos, Fernando F.; Reorda, Matteo Sonza; Rech, Paolo. - (2022), pp. 278-287. (Intervento presentato al convegno 2022 IEEE International Test Conference (ITC) tenutosi a Anaheim, CA (USA) nel 23-30 September 2022) [10.1109/ITC50671.2022.00036].

A Multi-level Approach to Evaluate the Impact of GPU Permanent Faults on CNN's Reliability

Rodriguez Condia, Josie E.;Guerrero-Balaguera, Juan-David;Reorda, Matteo Sonza;Rech, Paolo
2022

Abstract

Graphics processing units (GPUs) are widely used to accelerate Artificial Intelligence applications, such as those based on Convolutional Neural Networks (CNNs). Since in some domains in which CNNs are heavily employed (e.g., automotive and robotics) the expected lifetime of GPUs is over ten years, it is of paramount importance to study the impact of permanent faults (e.g. due to aging). Crucially, while the impact of transient faults on GPUs running CNNs has been widely studied, an accurate evaluation of the impact of permanent faults is still lacking. Performing this evaluation is challenging due to the complexity of GPU devices and the software implementing a CNN. In this work, we propose a methodology that combines the accuracy of gate-level fault simulation with the speed and flexibility of software fault injection to evaluate the effects of permanent hardware faults affecting a GPU. First, we profile the executed low-level GPU instructions during the CNN inference. Then, using extensive gate-level fault injection campaigns, we provide an accurate analysis of the effects of permanent faults on the internal modules executing the targeted instructions. Finally, we propagate these effects using fast software-based fault injection. The method allows, for the first time, to estimate the percentage of permanent faults leading the CNN to produce wrong results (i.e., changing the result of its work). The method's feasibility, which allows for flexibly trade-off accuracy with the required computational effort, is shown using LeNet running on an Ampere Nvidia GPU as a case study. The method reduces the computational effort for the evaluation by several orders of magnitude with respect to plain gate- and RTL-level faults simulation.
2022
978-1-6654-6270-9
File in questo prodotto:
File Dimensione Formato  
ITC_2022.pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 717.18 kB
Formato Adobe PDF
717.18 kB Adobe PDF Visualizza/Apri
A_Multi-level_Approach_to_Evaluate_the_Impact_of_GPU_Permanent_Faults_on_CNNs_Reliability.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 979.14 kB
Formato Adobe PDF
979.14 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2974450