Multiprocessor system-on-chip such as embedded GPUs are becoming very popular in safety-critical applications, such as autonomous and semi-autonomous vehicles. However, these devices can suffer from the effects of soft-errors, such as those produced by radiation effects. These effects are able to generate unpredictable misbehaviors. Fault tolerance oriented to multi-threaded software introduces severe performance degradations due to the redundancy, voting and correction threads operations. In this paper, we propose a new fault injection environment for NVIDIA GPGPU devices and a fault tolerance approach based on error detection and correction threads executed during data transfer operations on embedded GPUs. The fault injection environment is capable of automatically injecting faults into the instructions at SASS level by instrumenting the CUDA binary executable file. The mitigation approach is based on concurrent error detection threads running simultaneously with the memory stream device to host data transfer operations. With several benchmark applications, we evaluate the impact of softerrors classifying Silent Data Corruption, Detection, Unrecoverable Error and Hang. Finally, the proposed mitigation approach has been validated by soft-error fault injection campaigns on an NVIDIA Pascal Architecture GPU controlled by Quad-Core A57 ARM processor (JETSON TX2) demonstrating an advantage of more than 37% with respect to state of the art solution.

Analysis and Mitigation of Soft-Errors on High Performance Embedded GPUs / Sterpone, L.; Azimi, S.; De Sio, C.; Parisi, F.. - ELETTRONICO. - (2022), pp. 91-98. (Intervento presentato al convegno 21st IEEE International Symposium on Parallel and Distributed Computing tenutosi a Basel (Switzerland) nel 11-13 July 2022) [10.1109/ISPDC55340.2022.00022].

Analysis and Mitigation of Soft-Errors on High Performance Embedded GPUs

L. Sterpone;S. Azimi;C. De Sio;
2022

Abstract

Multiprocessor system-on-chip such as embedded GPUs are becoming very popular in safety-critical applications, such as autonomous and semi-autonomous vehicles. However, these devices can suffer from the effects of soft-errors, such as those produced by radiation effects. These effects are able to generate unpredictable misbehaviors. Fault tolerance oriented to multi-threaded software introduces severe performance degradations due to the redundancy, voting and correction threads operations. In this paper, we propose a new fault injection environment for NVIDIA GPGPU devices and a fault tolerance approach based on error detection and correction threads executed during data transfer operations on embedded GPUs. The fault injection environment is capable of automatically injecting faults into the instructions at SASS level by instrumenting the CUDA binary executable file. The mitigation approach is based on concurrent error detection threads running simultaneously with the memory stream device to host data transfer operations. With several benchmark applications, we evaluate the impact of softerrors classifying Silent Data Corruption, Detection, Unrecoverable Error and Hang. Finally, the proposed mitigation approach has been validated by soft-error fault injection campaigns on an NVIDIA Pascal Architecture GPU controlled by Quad-Core A57 ARM processor (JETSON TX2) demonstrating an advantage of more than 37% with respect to state of the art solution.
File in questo prodotto:
File Dimensione Formato  
ISPDC_2022.pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 416.1 kB
Formato Adobe PDF
416.1 kB Adobe PDF Visualizza/Apri
Analysis_and_Mitigation_of_Soft-Errors_on_High_Performance_Embedded_GPUs.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 1.11 MB
Formato Adobe PDF
1.11 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2971150