The General-Purpose Graphics Processing Units (GPGPU) with energy efficient execution are increasingly used in wide range of applications due to high performance. These GPGPUs are fabricated with the cutting-edge technologies. Shrinking transistor feature size and aggressive voltage scaling has increased the susceptibility of devices to intrinsic and extrinsic noise leading to major reliability issues in the form of the transient faults. Therefore, it is essential to ensure the reliable operation of the GPGPUs in the presence of the transient faults. GPGPUs are designed for high throughput and execute the multiple threads in parallel, that brings a new challenge for the fault detection with minimum overheads across all threads. This paper proposes a new fault detection method called REFU, an architectural solution to detect the transient faults by temporal redundant re-execution of instructions using the idle functional execution units of the GPGPU. The performance of the REFU is evaluated with standard benchmarks, for fault free run across different workloads REFU shows mean performance overhead of 2%, average power overhead of 6%, and peak power overhead of 10%.

REFU: Redundant Execution with Idle Functional Units, Fault Tolerant GPGPU architecture / Raghunandana, K. K.; Varaprasad, B. K. S. V. L.; Sonza Reorda, M.; Singh, Virendra. - (2022), pp. 394-397. (Intervento presentato al convegno 2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) tenutosi a Nicosia, Cyprus nel 04-06 July 2022) [10.1109/ISVLSI54635.2022.00088].

REFU: Redundant Execution with Idle Functional Units, Fault Tolerant GPGPU architecture

Sonza Reorda, M.;
2022

Abstract

The General-Purpose Graphics Processing Units (GPGPU) with energy efficient execution are increasingly used in wide range of applications due to high performance. These GPGPUs are fabricated with the cutting-edge technologies. Shrinking transistor feature size and aggressive voltage scaling has increased the susceptibility of devices to intrinsic and extrinsic noise leading to major reliability issues in the form of the transient faults. Therefore, it is essential to ensure the reliable operation of the GPGPUs in the presence of the transient faults. GPGPUs are designed for high throughput and execute the multiple threads in parallel, that brings a new challenge for the fault detection with minimum overheads across all threads. This paper proposes a new fault detection method called REFU, an architectural solution to detect the transient faults by temporal redundant re-execution of instructions using the idle functional execution units of the GPGPU. The performance of the REFU is evaluated with standard benchmarks, for fault free run across different workloads REFU shows mean performance overhead of 2%, average power overhead of 6%, and peak power overhead of 10%.
2022
978-1-6654-6605-9
File in questo prodotto:
File Dimensione Formato  
REFU_Redundant_Execution_with_Idle_Functional_Units_Fault_Tolerant_GPGPU_architecture.pdf

accesso aperto

Tipologia: 2. Post-print / Author's Accepted Manuscript
Licenza: PUBBLICO - Tutti i diritti riservati
Dimensione 604.54 kB
Formato Adobe PDF
604.54 kB Adobe PDF Visualizza/Apri
REFU_Redundant_Execution_with_Idle_Functional_Units_Fault_Tolerant_GPGPU_architecture.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 609.83 kB
Formato Adobe PDF
609.83 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2981818