A Structured Method to Generate Self-Test Libraries for Tensor Cores

Limas Sierra, Robert; Guerrero Balaguera, Juan David; Rodriguez Condia, Josie E.; Sonza Reorda, Matteo

doi:10.3390/electronics14112148

Modern computing systems increasingly rely on specialized hardware accelerators, such as Graphics Processing Units (GPUs), to meet growing computational demands. GPUs are essential for accelerating a wide range of applications, from machine learning and scientific computing to safety-critical domains like autonomous systems and aerospace. To enhance performance, modern GPUs integrate dedicated in-chip units, such as Tensor Cores(TCs), which are designed for efficient mixed-precision matrix operations. However, as semiconductor technologies scale down, reliability challenges emerge. Permanent hardware faults caused by aging, process variations, or environmental stress can lead to Silent Data Corruptions, which silently compromise computation results. In order to detect such faults, self-test libraries (STLs) are widely used, corresponding to suitably crafted pieces of code, able to activate faults and propagate their effects to visible points (e.g., the memory) and possibly signal their occurrence. This work introduces a structured method for generating STLs to detect permanent hardware faults that may arise in TCs. By leveraging the parallelism and regular structure of TCs, the method facilitates the creation of effective STLs for in-field fault detection without hardware modifications and with minimal requirements in terms of test time and memory. The proposed approach was validated on an NVIDIA GeForce RTX 3060 Ti GPU, installed in a Hewlett-Packard Z2 G5 workstation with an Intel Core i9-10800 CPU and 32 GB RAM, available at the Department of Control and Computer Engineering (DAUIN), Politecnico di Torino, Turin, Italy.This setup was used to address stuck-at faults in the arithmetic units of TCs. The results demonstrate that the methodology offers a practical, scalable, and non-intrusive solution for enhancing GPU reliability, applicable in both high-performance and safety-critical environments.

A Structured Method to Generate Self-Test Libraries for Tensor Cores / Limas Sierra, Robert; Guerrero Balaguera, Juan David; Rodriguez Condia, Josie E.; Sonza Reorda, Matteo. - In: ELECTRONICS. - ISSN 2079-9292. - 14:11(2025). [10.3390/electronics14112148]

A Structured Method to Generate Self-Test Libraries for Tensor Cores

Limas Sierra, Robert;Guerrero Balaguera, Juan David;Rodriguez Condia, Josie E.;Sonza Reorda, Matteo

2025

Abstract

Scheda breve

Scheda completa

Scheda completa (DC)

	Anno del prodotto
	
				2025
			
	Codice DOI
	
				https://dx.doi.org/10.3390/electronics14112148
			
	Titolo della Rivista
	
				ELECTRONICS
			
	Appare nelle tipologie
	
				1.1 Articolo in rivista

File in questo prodotto:

File	Dimensione	Formato
electronics-14-02148.pdf accesso aperto Tipologia: 2a Post-print versione editoriale / Version of Record Licenza: Creative commons Dimensione 1.62 MB Formato Adobe PDF Visualizza/Apri	1.62 MB	Adobe PDF	Visualizza/Apri

Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3000405

PORTO @ Archivio Istituzionale della Ricerca

A Structured Method to Generate Self-Test Libraries for Tensor Cores

Limas Sierra, Robert;Guerrero Balaguera, Juan David;Rodriguez Condia, Josie E.;Sonza Reorda, Matteo

2025

Abstract

Scheda breve Scheda completa Scheda completa (DC)

Pubblicazioni consigliate

Informazioni

Conferma cancellazione

Scheda breve

Scheda completa

Scheda completa (DC)