Soft errors are of great concern for the devices fabricated with the latest technology nodes. General Purpose Graphics Processing Units (GPGPUs) fabricated with cuttingedge technologies are susceptible to soft errors resulting in silent data corruption or application crashes. Various error detection and correction techniques are developed to mitigate soft errors in GPGPUs. Further, to improve performance and reduce the power overhead of existing error detection and correction techniques, we propose a novel microarchitecture called Repetition check to Reduce Redundant execution (RRR). RRR uses the observation that in the streaming multiprocessor of GPGPUs, the warps of the kernel have temporal execution repetition, and many dynamic instructions of the same kind across different warps have the same operand values and result. RRR stores the operand values, operation type, and result of dynamic instructions of a warp. Suppose the stored data matches the current primary execution values, then the second time redundant execution used for error detection is bypassed, resulting in improved performance and reduced power overhead. RRR stores the verified data in case of a mismatch using the original error detection and correction method. To demonstrate the effectiveness of the proposed method, RRR is augmented with the existing Fault tolerant GPGPU micro architecture Triple modular Redundant Execution with idle Functional Units – TREFU (TRRR). Performance and power overheads of TRRR are assessed using a set of ISPASS 2009, and RODINIA benchmarks at error-free conditions and different error rates. Our evaluation has shown that TRRR is 4% more Energy efficient than TREFU. The overhead for TRRR is 3% for performance, 2% for average power, and 1% for peak power.

TRRR: Accelerated Online Error Detecting and Correcting Fault Tolerant Architecture / K K, Raghunandana; K R, Yogesh Prasad; Sonza Reorda, M.; Singh, Virendra. - (2024), pp. 1-6. (Intervento presentato al convegno 2024 IEEE East-West Design & Test Symposium (EWDTS) tenutosi a Yerevan, Armenia nel 13-17 November 2024) [10.1109/ewdts63723.2024.10873630].

TRRR: Accelerated Online Error Detecting and Correcting Fault Tolerant Architecture

Sonza Reorda, M.;
2024

Abstract

Soft errors are of great concern for the devices fabricated with the latest technology nodes. General Purpose Graphics Processing Units (GPGPUs) fabricated with cuttingedge technologies are susceptible to soft errors resulting in silent data corruption or application crashes. Various error detection and correction techniques are developed to mitigate soft errors in GPGPUs. Further, to improve performance and reduce the power overhead of existing error detection and correction techniques, we propose a novel microarchitecture called Repetition check to Reduce Redundant execution (RRR). RRR uses the observation that in the streaming multiprocessor of GPGPUs, the warps of the kernel have temporal execution repetition, and many dynamic instructions of the same kind across different warps have the same operand values and result. RRR stores the operand values, operation type, and result of dynamic instructions of a warp. Suppose the stored data matches the current primary execution values, then the second time redundant execution used for error detection is bypassed, resulting in improved performance and reduced power overhead. RRR stores the verified data in case of a mismatch using the original error detection and correction method. To demonstrate the effectiveness of the proposed method, RRR is augmented with the existing Fault tolerant GPGPU micro architecture Triple modular Redundant Execution with idle Functional Units – TREFU (TRRR). Performance and power overheads of TRRR are assessed using a set of ISPASS 2009, and RODINIA benchmarks at error-free conditions and different error rates. Our evaluation has shown that TRRR is 4% more Energy efficient than TREFU. The overhead for TRRR is 3% for performance, 2% for average power, and 1% for peak power.
2024
979-8-3315-1576-8
File in questo prodotto:
File Dimensione Formato  
TRRR_Accelerated_Online_Error_Detecting_and_Correcting_Fault_Tolerant_Architecture.pdf

accesso riservato

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 732.44 kB
Formato Adobe PDF
732.44 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2997725