Modern computing systems increasingly rely on specialized hardware accelerators, such as Graphics Processing Units (GPUs), to meet growing computational demands. GPUs are essential for accelerating a wide range of applications, from machine learning and scientific computing to safety-critical domains like autonomous systems and aerospace. To enhance performance, modern GPUs integrate dedicated in-chip units, such as Tensor Cores(TCs), which are designed for efficient mixed-precision matrix operations. However, as semiconductor technologies scale down, reliability challenges emerge. Permanent hardware faults caused by aging, process variations, or environmental stress can lead to Silent Data Corruptions, which silently compromise computation results. In order to detect such faults, self-test libraries (STLs) are widely used, corresponding to suitably crafted pieces of code, able to activate faults and propagate their effects to visible points (e.g., the memory) and possibly signal their occurrence. This work introduces a structured method for generating STLs to detect permanent hardware faults that may arise in TCs. By leveraging the parallelism and regular structure of TCs, the method facilitates the creation of effective STLs for in-field fault detection without hardware modifications and with minimal requirements in terms of test time and memory. The proposed approach was validated on an NVIDIA GeForce RTX 3060 Ti GPU, installed in a Hewlett-Packard Z2 G5 workstation with an Intel Core i9-10800 CPU and 32 GB RAM, available at the Department of Control and Computer Engineering (DAUIN), Politecnico di Torino, Turin, Italy.This setup was used to address stuck-at faults in the arithmetic units of TCs. The results demonstrate that the methodology offers a practical, scalable, and non-intrusive solution for enhancing GPU reliability, applicable in both high-performance and safety-critical environments.

A Structured Method to Generate Self-Test Libraries for Tensor Cores / Limas Sierra, Robert; Guerrero Balaguera, Juan David; Rodriguez Condia, Josie E.; Sonza Reorda, Matteo. - In: ELECTRONICS. - ISSN 2079-9292. - 14:11(2025). [10.3390/electronics14112148]

A Structured Method to Generate Self-Test Libraries for Tensor Cores

Limas Sierra, Robert;Guerrero Balaguera, Juan David;Rodriguez Condia, Josie E.;Sonza Reorda, Matteo
2025

Abstract

Modern computing systems increasingly rely on specialized hardware accelerators, such as Graphics Processing Units (GPUs), to meet growing computational demands. GPUs are essential for accelerating a wide range of applications, from machine learning and scientific computing to safety-critical domains like autonomous systems and aerospace. To enhance performance, modern GPUs integrate dedicated in-chip units, such as Tensor Cores(TCs), which are designed for efficient mixed-precision matrix operations. However, as semiconductor technologies scale down, reliability challenges emerge. Permanent hardware faults caused by aging, process variations, or environmental stress can lead to Silent Data Corruptions, which silently compromise computation results. In order to detect such faults, self-test libraries (STLs) are widely used, corresponding to suitably crafted pieces of code, able to activate faults and propagate their effects to visible points (e.g., the memory) and possibly signal their occurrence. This work introduces a structured method for generating STLs to detect permanent hardware faults that may arise in TCs. By leveraging the parallelism and regular structure of TCs, the method facilitates the creation of effective STLs for in-field fault detection without hardware modifications and with minimal requirements in terms of test time and memory. The proposed approach was validated on an NVIDIA GeForce RTX 3060 Ti GPU, installed in a Hewlett-Packard Z2 G5 workstation with an Intel Core i9-10800 CPU and 32 GB RAM, available at the Department of Control and Computer Engineering (DAUIN), Politecnico di Torino, Turin, Italy.This setup was used to address stuck-at faults in the arithmetic units of TCs. The results demonstrate that the methodology offers a practical, scalable, and non-intrusive solution for enhancing GPU reliability, applicable in both high-performance and safety-critical environments.
2025
File in questo prodotto:
File Dimensione Formato  
electronics-14-02148.pdf

accesso aperto

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Creative commons
Dimensione 1.62 MB
Formato Adobe PDF
1.62 MB Adobe PDF Visualizza/Apri
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3000405