The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register-Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility.

Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection / Fernandes dos Santos, Fernando; Rodriguez Condia, Josie Esteban.; Carro, Luigi; Sonza Reorda, Matteo; Rech, Paolo. - ELETTRONICO. - (2021), pp. 292-304. (Intervento presentato al convegno 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2021 tenutosi a twn nel 2021) [10.1109/DSN48987.2021.00042].

Revealing GPUs Vulnerabilities by Combining Register-Transfer and Software-Level Fault Injection

Rodriguez Condia, Josie Esteban.;Carro, Luigi;Sonza Reorda, Matteo;Rech, Paolo
2021

Abstract

The complexity of both hardware and software makes GPUs reliability evaluation extremely challenging. A low level fault injection on a GPU model, despite being accurate, would take a prohibitively long time (months to years), while software fault injection, despite being quick, cannot access critical resources for GPUs and typically uses synthetic fault models (e.g., single bit-flips) that could result in unrealistic evaluations. This paper proposes to combine the accuracy of Register-Transfer Level (RTL) fault injection with the efficiency of software fault injection. First, on an RTL GPU model (FlexGripPlus), we inject over 1.5 million faults in low-level resources that are unprotected and hidden to the programmer, and characterize their effects on the output of common instructions. We create a pool of possible fault effects on the operation output based on the instruction opcode and input characteristics. We then inject these fault effects, at the application level, using an updated version of a software framework (NVBitFI). Our strategy reduces the fault injection time from the tens of years an RTL evaluation would need to tens of hours, thus allowing, for the first time on GPUs, to track the fault propagation from the hardware to the output of complex applications. Additionally, we provide a more realistic fault model and show that single bit-flip injection would underestimate the error rate of six HPC applications and two convolutional neural networks by up to 48parcent (18parcent on average). The RTL fault models and the injection framework we developed are made available in a public repository to enable third-party evaluations and ease results reproducibility.
2021
978-1-6654-3572-7
File in questo prodotto:
File Dimensione Formato  
dsn_2021_end.pdf

accesso aperto

Descrizione: post-print
Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 931.53 kB
Formato Adobe PDF
931.53 kB Adobe PDF Visualizza/Apri
Revealing_GPUs_Vulnerabilities_by_Combining_Register-Transfer_and_Software-Level_Fault_Injection.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 1.31 MB
Formato Adobe PDF
1.31 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2961673