The General-Purpose Graphics Processing Units (GPGPU) with energy efficient execution are increasingly used in wide range of applications due to high performance. These GPGPUs are fabricated with the cutting-edge technologies. Shrinking transistor feature size and aggressive voltage scaling has increased the susceptibility of devices to intrinsic and extrinsic noise leading to major reliability issues in the form of the transient faults. Therefore, it is essential to ensure the reliable operation of the GPGPUs in the presence of the transient faults. GPGPUs are designed for high throughput and execute the multiple threads in parallel, that brings a new challenge for the fault detection with minimum overheads across all threads. This paper proposes a new fault detection method called REFU, an architectural solution to detect the transient faults by temporal redundant re-execution of instructions using the idle functional execution units of the GPGPU. The performance of the REFU is evaluated with standard benchmarks, for fault free run across different workloads REFU shows mean performance overhead of 2%, average power overhead of 6%, and peak power overhead of 10%.
REFU: Redundant Execution with Idle Functional Units, Fault Tolerant GPGPU architecture / Raghunandana, K. K.; Varaprasad, B. K. S. V. L.; Sonza Reorda, M.; Singh, Virendra. - (2022), pp. 394-397. (Intervento presentato al convegno 2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) tenutosi a Nicosia, Cyprus nel 04-06 July 2022) [10.1109/ISVLSI54635.2022.00088].
REFU: Redundant Execution with Idle Functional Units, Fault Tolerant GPGPU architecture
Sonza Reorda, M.;
2022
Abstract
The General-Purpose Graphics Processing Units (GPGPU) with energy efficient execution are increasingly used in wide range of applications due to high performance. These GPGPUs are fabricated with the cutting-edge technologies. Shrinking transistor feature size and aggressive voltage scaling has increased the susceptibility of devices to intrinsic and extrinsic noise leading to major reliability issues in the form of the transient faults. Therefore, it is essential to ensure the reliable operation of the GPGPUs in the presence of the transient faults. GPGPUs are designed for high throughput and execute the multiple threads in parallel, that brings a new challenge for the fault detection with minimum overheads across all threads. This paper proposes a new fault detection method called REFU, an architectural solution to detect the transient faults by temporal redundant re-execution of instructions using the idle functional execution units of the GPGPU. The performance of the REFU is evaluated with standard benchmarks, for fault free run across different workloads REFU shows mean performance overhead of 2%, average power overhead of 6%, and peak power overhead of 10%.File | Dimensione | Formato | |
---|---|---|---|
REFU_Redundant_Execution_with_Idle_Functional_Units_Fault_Tolerant_GPGPU_architecture.pdf
accesso aperto
Tipologia:
2. Post-print / Author's Accepted Manuscript
Licenza:
PUBBLICO - Tutti i diritti riservati
Dimensione
604.54 kB
Formato
Adobe PDF
|
604.54 kB | Adobe PDF | Visualizza/Apri |
REFU_Redundant_Execution_with_Idle_Functional_Units_Fault_Tolerant_GPGPU_architecture.pdf
non disponibili
Tipologia:
2a Post-print versione editoriale / Version of Record
Licenza:
Non Pubblico - Accesso privato/ristretto
Dimensione
609.83 kB
Formato
Adobe PDF
|
609.83 kB | Adobe PDF | Visualizza/Apri Richiedi una copia |
Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.
https://hdl.handle.net/11583/2981818