The high processing power of GPUs makes them attractive for safety-critical applications, where transient effects are a major concern, and resilience must be enforced without compromising performance. Configurable softcore GPUs are a recent technology that allows detailed reliability assessment capable of bringing directions to the design of reliable GPU applications. This work investigates the reliability of the register files and the pipeline of a softcore GPU under radiation-induced faults. It proposes software-based fault tolerance techniques to mitigate errors. Faults are simulated at the register transfer level in four case-study algorithms, and the Architectural Vulnerability Factor (AVF) and Mean Workload to Failure (MWTF) are checked over different GPU configurations. Results indicate that software-based techniques efficiently reduce AVF. In terms of MWTF, results show that the best cases depend on an optimized balance between GPU configuration, application runtime, and AVF.
Evaluating low-level software-based hardening techniques for configurable GPU architectures / Goncalves, Marcio M.; Rodriguez Condia, Josie Esteban; Sonza Reorda, Matteo; Sterpone, Luca; Azambuja, Jose Rodrigo. - In: THE JOURNAL OF SUPERCOMPUTING. - ISSN 0920-8542. - ELETTRONICO. - 78:6(2022), pp. 8081-8105. [10.1007/s11227-021-04154-z]
Evaluating low-level software-based hardening techniques for configurable GPU architectures
Rodriguez Condia, Josie Esteban;Sonza Reorda, Matteo;Sterpone, Luca;
2022
Abstract
The high processing power of GPUs makes them attractive for safety-critical applications, where transient effects are a major concern, and resilience must be enforced without compromising performance. Configurable softcore GPUs are a recent technology that allows detailed reliability assessment capable of bringing directions to the design of reliable GPU applications. This work investigates the reliability of the register files and the pipeline of a softcore GPU under radiation-induced faults. It proposes software-based fault tolerance techniques to mitigate errors. Faults are simulated at the register transfer level in four case-study algorithms, and the Architectural Vulnerability Factor (AVF) and Mean Workload to Failure (MWTF) are checked over different GPU configurations. Results indicate that software-based techniques efficiently reduce AVF. In terms of MWTF, results show that the best cases depend on an optimized balance between GPU configuration, application runtime, and AVF.File | Dimensione | Formato | |
---|---|---|---|
Goncalves2022_Article_EvaluatingLow-levelSoftware-ba.pdf
non disponibili
Descrizione: post-print
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
1.6 MB
Formato
Adobe PDF
|
1.6 MB | Adobe PDF | Visualizza/Apri Richiedi una copia |
JSupercomp___Software_based_for_FlexGrip.pdf
Open Access dal 08/01/2023
Descrizione: post-print accepted manuscript
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
PUBBLICO - Tutti i diritti riservati
Dimensione
734.78 kB
Formato
Adobe PDF
|
734.78 kB | Adobe PDF | Visualizza/Apri |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2961684