The high processing power of GPUs makes them attractive for safety-critical applications, where transient effects are a major concern, and resilience must be enforced without compromising performance. Configurable softcore GPUs are a recent technology that allows detailed reliability assessment capable of bringing directions to the design of reliable GPU applications. This work investigates the reliability of the register files and the pipeline of a softcore GPU under radiation-induced faults. It proposes software-based fault tolerance techniques to mitigate errors. Faults are simulated at the register transfer level in four case-study algorithms, and the Architectural Vulnerability Factor (AVF) and Mean Workload to Failure (MWTF) are checked over different GPU configurations. Results indicate that software-based techniques efficiently reduce AVF. In terms of MWTF, results show that the best cases depend on an optimized balance between GPU configuration, application runtime, and AVF.

Evaluating low-level software-based hardening techniques for configurable GPU architectures / Goncalves, Marcio M.; Rodriguez Condia, Josie Esteban; Sonza Reorda, Matteo; Sterpone, Luca; Azambuja, Jose Rodrigo. - In: THE JOURNAL OF SUPERCOMPUTING. - ISSN 0920-8542. - ELETTRONICO. - 78:6(2022), pp. 8081-8105. [10.1007/s11227-021-04154-z]

Evaluating low-level software-based hardening techniques for configurable GPU architectures

Rodriguez Condia, Josie Esteban;Sonza Reorda, Matteo;Sterpone, Luca;
2022

Abstract

The high processing power of GPUs makes them attractive for safety-critical applications, where transient effects are a major concern, and resilience must be enforced without compromising performance. Configurable softcore GPUs are a recent technology that allows detailed reliability assessment capable of bringing directions to the design of reliable GPU applications. This work investigates the reliability of the register files and the pipeline of a softcore GPU under radiation-induced faults. It proposes software-based fault tolerance techniques to mitigate errors. Faults are simulated at the register transfer level in four case-study algorithms, and the Architectural Vulnerability Factor (AVF) and Mean Workload to Failure (MWTF) are checked over different GPU configurations. Results indicate that software-based techniques efficiently reduce AVF. In terms of MWTF, results show that the best cases depend on an optimized balance between GPU configuration, application runtime, and AVF.
File in questo prodotto:
File Dimensione Formato  
Goncalves2022_Article_EvaluatingLow-levelSoftware-ba.pdf

non disponibili

Descrizione: post-print
Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 1.6 MB
Formato Adobe PDF
1.6 MB Adobe PDF   Visualizza/Apri   Richiedi una copia
JSupercomp___Software_based_for_FlexGrip.pdf

Open Access dal 08/01/2023

Descrizione: post-print accepted manuscript
Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 734.78 kB
Formato Adobe PDF
734.78 kB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2961684