The widespread adoption of Convolutional Neural Networks (CNNs) and their effective integration in several safety-critical applications have been driven by sophisticated optimizations, including advanced pruning techniques that preserve accuracy and performance, as well as specialized hardware accelerators, such as GPU Tensor Cores (TCs) with support for structural sparsity. Nonetheless, the analysis of the overall impact of soft errors on sparse CNNs due to corrupted sparsity mechanisms in GPUs, as well as the development of mitigation strategies, has not yet been sufficiently investigated. This work evaluates the impact of soft errors on the structured sparsity mechanism in GPU TCs, affecting the operation of CNN models. The effect of transient faults is evaluated at two levels: i) the micro-architecture of sparsity mechanisms in TCs, and ii) sparse CNN applications. In particular, several fault injection campaigns targeting the sparsity mechanism (errors in sparse indices and compressed weights) enable the characterization of errors in sparse workloads. Moreover, we introduce FT-Sparse, an algorithm-based fault tolerance (ABFT) solution to enhance the resilience of sparse CNNs against soft errors in the structural sparsity of GPUs. The results indicate that FT-Sparse can reduce the critical Silent Data Corruptions (SDCs) by up to 4.87×, with merely up to 7.59% average kernel execution overhead on GPUs.

FT-Sparse: Algorithm-Based Fault Tolerance for Sparse CNNs Using Structured Sparsity in GPUs / Rodriguez Condia, J.E., Ahmadilivani, M.H., Raik, J., Jenihhin, M., Reorda, M.S.. - ELETTRONICO. - (2026), pp. 1-7. (IEEE 44th VLSI Test Symposium (VTS) Napa, California, USA 27-29 April 2026) [10.1109/vts69484.2026.11563359].

FT-Sparse: Algorithm-Based Fault Tolerance for Sparse CNNs Using Structured Sparsity in GPUs

Rodriguez Condia, Josie Esteban;Reorda, Matteo Sonza
2026

Abstract

The widespread adoption of Convolutional Neural Networks (CNNs) and their effective integration in several safety-critical applications have been driven by sophisticated optimizations, including advanced pruning techniques that preserve accuracy and performance, as well as specialized hardware accelerators, such as GPU Tensor Cores (TCs) with support for structural sparsity. Nonetheless, the analysis of the overall impact of soft errors on sparse CNNs due to corrupted sparsity mechanisms in GPUs, as well as the development of mitigation strategies, has not yet been sufficiently investigated. This work evaluates the impact of soft errors on the structured sparsity mechanism in GPU TCs, affecting the operation of CNN models. The effect of transient faults is evaluated at two levels: i) the micro-architecture of sparsity mechanisms in TCs, and ii) sparse CNN applications. In particular, several fault injection campaigns targeting the sparsity mechanism (errors in sparse indices and compressed weights) enable the characterization of errors in sparse workloads. Moreover, we introduce FT-Sparse, an algorithm-based fault tolerance (ABFT) solution to enhance the resilience of sparse CNNs against soft errors in the structural sparsity of GPUs. The results indicate that FT-Sparse can reduce the critical Silent Data Corruptions (SDCs) by up to 4.87×, with merely up to 7.59% average kernel execution overhead on GPUs.
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/3012344
 Attenzione

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo