A Software-based Fault Tolerance Mechanism for Matrix Multiplication Operations in Tensor Cores

Limas Sierra, Robert; Guerrero-Balaguera, Juan-David; Rodriguez Condia, Josie Esteban; Reorda, Matteo Sonza

doi:10.1109/tc.2026.3702572

Modern Graphics Processing Units (GPUs) are increasingly employed to enhance the performance of algorithms across scientific and machine learning domains. Given the importance of General Matrix Multiplication (GEMM) operations, GPUs feature specialized in-chip accelerators, such as Tensor Cores (TCUs), to speed them up. High-Performance Computing (HPC) and safety-critical sectors (e.g., automotive, space, and autonomous robotics) impose severe constraints concerning not only energy consumption, performance, and area but also reliability. Faults arising from advanced semiconductor technologies or sustained HPC workloads can silently propagate, potentially leading to catastrophic failures.This work introduces a hardware-aware, software-based fault tolerance method to enhance the resilience of GEMM operations on TCUs. By leveraging TCU architecture and parallel operation distribution, the method enables efficient fault detection and mitigation. It utilizes redundant executions on TCU arithmetic cores (Dot-Product Units) to detect and, if required, correct fault effects. The method’s flexibility supports online detection and correction of transient and permanent hardware faults in TCU’s arithmetic units. Experimental results on real GPUs show the proposed mechanism introduces minimal and constant memory overhead and a negligible performance overhead (up to 1.13 times) across operand sizes. Thus, this solution offers an effective and complementary hardening strategy for TCU operations.

A Software-based Fault Tolerance Mechanism for Matrix Multiplication Operations in Tensor Cores / Limas Sierra, R., Guerrero-Balaguera, J., Rodriguez Condia, J.E., Reorda, M.S.. - In: IEEE TRANSACTIONS ON COMPUTERS. - ISSN 0018-9340. - ELETTRONICO. - (2026), pp. 1-14. [10.1109/tc.2026.3702572]

A Software-based Fault Tolerance Mechanism for Matrix Multiplication Operations in Tensor Cores

Limas Sierra, Robert;Guerrero-Balaguera, Juan-David;Rodriguez Condia, Josie Esteban;Reorda, Matteo Sonza

2026

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2026
			
	Codice DOI
	
				https://dx.doi.org/10.1109/tc.2026.3702572
			
	Titolo della Rivista
	
				IEEE TRANSACTIONS ON COMPUTERS
			
	Appare nelle tipologie
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
_TComp2024__A_Software_Fault_Tolerant_Mechanism_for_MxM_in_Tensor_Cores.pdf accesso riservato Tipologia: 2. Post-print / Author's Accepted Manuscript Licenza: Non Pubblico - Accesso privato/ristretto Dimensione 1.42 MB Formato Adobe PDF Visualizza/Apri Richiedi una copia	1.42 MB	Adobe PDF	Visualizza/Apri Richiedi una copia

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3012343

PORTO @ Archivio Istituzionale della Ricerca

A Software-based Fault Tolerance Mechanism for Matrix Multiplication Operations in Tensor Cores

Limas Sierra, Robert;Guerrero-Balaguera, Juan-David;Rodriguez Condia, Josie Esteban;Reorda, Matteo Sonza

2026

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)