Graphic Processing Units (GPUs) are currently adopted in several domains with substantial reliability requirements, such as in automotive and robotics. Thus, evaluating the impact of possible faults affecting the internal components of a device is a crucial step towards developing certified products according to industrial standards (i.e., ISO26262). The block scheduling controllers play an important role in resource management and task operation in GPUs. However, understanding the sensitivity to faults of such modules is crucial in the development of mitigation mechanisms and effective countermeasures. This work evaluates the impact of transient faults on the block controller in a GPU. For this purpose, we extended a low-level micro-architecture GPU model (FlexGripPlus) to support the management of the different execution cores (i.e., the Streaming Multiprocessors or SIMD Engines) and allow the analysis of fault effects. A set of typical workloads were employed in the reliability evaluation. The experimental results show that the most critical stages for faults in the scheduler are those arising during the device's configuration and the exchange of tasks from an application. Moreover, when considering faults in the controller, multi-core GPUs appear to be less sensitive to faults than single-core GPUs. Finally, the parallel distribution of tasks (in blocks) also plays a significant role in the vulnerability to faults of the scheduler.

Microarchitectural Reliability Evaluation of a Block Scheduling Controller in GPUs / Rodriguez Condia, Josie E.; Faggiano, R; Reorda, Ms. - (2022), pp. 26-31. (Intervento presentato al convegno IEEE Computer Society Annual Symposium on VLSI (ISVLSI 2022) tenutosi a Nicosia, Cyprus nel 04-06 July 2022) [10.1109/ISVLSI54635.2022.00018].

Microarchitectural Reliability Evaluation of a Block Scheduling Controller in GPUs

Rodriguez Condia, Josie E.;Faggiano, R;Reorda, MS
2022

Abstract

Graphic Processing Units (GPUs) are currently adopted in several domains with substantial reliability requirements, such as in automotive and robotics. Thus, evaluating the impact of possible faults affecting the internal components of a device is a crucial step towards developing certified products according to industrial standards (i.e., ISO26262). The block scheduling controllers play an important role in resource management and task operation in GPUs. However, understanding the sensitivity to faults of such modules is crucial in the development of mitigation mechanisms and effective countermeasures. This work evaluates the impact of transient faults on the block controller in a GPU. For this purpose, we extended a low-level micro-architecture GPU model (FlexGripPlus) to support the management of the different execution cores (i.e., the Streaming Multiprocessors or SIMD Engines) and allow the analysis of fault effects. A set of typical workloads were employed in the reliability evaluation. The experimental results show that the most critical stages for faults in the scheduler are those arising during the device's configuration and the exchange of tasks from an application. Moreover, when considering faults in the controller, multi-core GPUs appear to be less sensitive to faults than single-core GPUs. Finally, the parallel distribution of tasks (in blocks) also plays a significant role in the vulnerability to faults of the scheduler.
File in questo prodotto:
File Dimensione Formato  
Microarchitectural_Reliability_Evaluation_of_a_Block_Scheduling_Controller_in_GPUs.pdf

non disponibili

Tipologia: 2a Post-print versione editoriale / Version of Record
Licenza: Non Pubblico - Accesso privato/ristretto
Dimensione 857.35 kB
Formato Adobe PDF
857.35 kB Adobe PDF   Visualizza/Apri   Richiedi una copia
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11583/2978951