Modern Graphics Processing Units (GPUs) are increasingly employed to enhance the performance of algorithms across scientific and machine learning domains. Given the importance of General Matrix Multiplication (GEMM) operations, GPUs feature specialized in-chip accelerators, such as Tensor Cores (TCUs), to speed them up. High-Performance Computing (HPC) and safety-critical sectors (e.g., automotive, space, and autonomous robotics) impose severe constraints concerning not only energy consumption, performance, and area but also reliability. Faults arising from advanced semiconductor technologies or sustained HPC workloads can silently propagate, potentially leading to catastrophic failures.This work introduces a hardware-aware, software-based fault tolerance method to enhance the resilience of GEMM operations on TCUs. By leveraging TCU architecture and parallel operation distribution, the method enables efficient fault detection and mitigation. It utilizes redundant executions on TCU arithmetic cores (Dot-Product Units) to detect and, if required, correct fault effects. The method’s flexibility supports online detection and correction of transient and permanent hardware faults in TCU’s arithmetic units. Experimental results on real GPUs show the proposed mechanism introduces minimal and constant memory overhead and a negligible performance overhead (up to 1.13 times) across operand sizes. Thus, this solution offers an effective and complementary hardening strategy for TCU operations.

A Software-based Fault Tolerance Mechanism for Matrix Multiplication Operations in Tensor Cores / Limas Sierra, R., Guerrero-Balaguera, J., Rodriguez Condia, J.E., Reorda, M.S.. - In: IEEE TRANSACTIONS ON COMPUTERS. - ISSN 0018-9340. - ELETTRONICO. - (2026), pp. 1-14. [10.1109/tc.2026.3702572]

A Software-based Fault Tolerance Mechanism for Matrix Multiplication Operations in Tensor Cores

Limas Sierra, Robert;Guerrero-Balaguera, Juan-David;Rodriguez Condia, Josie Esteban;Reorda, Matteo Sonza
2026

Abstract

Modern Graphics Processing Units (GPUs) are increasingly employed to enhance the performance of algorithms across scientific and machine learning domains. Given the importance of General Matrix Multiplication (GEMM) operations, GPUs feature specialized in-chip accelerators, such as Tensor Cores (TCUs), to speed them up. High-Performance Computing (HPC) and safety-critical sectors (e.g., automotive, space, and autonomous robotics) impose severe constraints concerning not only energy consumption, performance, and area but also reliability. Faults arising from advanced semiconductor technologies or sustained HPC workloads can silently propagate, potentially leading to catastrophic failures.This work introduces a hardware-aware, software-based fault tolerance method to enhance the resilience of GEMM operations on TCUs. By leveraging TCU architecture and parallel operation distribution, the method enables efficient fault detection and mitigation. It utilizes redundant executions on TCU arithmetic cores (Dot-Product Units) to detect and, if required, correct fault effects. The method’s flexibility supports online detection and correction of transient and permanent hardware faults in TCU’s arithmetic units. Experimental results on real GPUs show the proposed mechanism introduces minimal and constant memory overhead and a negligible performance overhead (up to 1.13 times) across operand sizes. Thus, this solution offers an effective and complementary hardening strategy for TCU operations.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3012343
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo