# POLITECNICO DI TORINO Repository ISTITUZIONALE # Cross-layer soft-error resilience analysis of computing systems # Original Cross-layer soft-error resilience analysis of computing systems / Bosio, A.; Canal, R.; Di Carlo, S.; Gizopoulos, D.; Savino, A.. - STAMPA. - (2020), pp. 79-79. (Intervento presentato al convegno 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks: Supplemental Volume, DSN-S 2020 tenutosi a Valencia, Spain, Spain nel 29 June-2 July 2020) [10.1109/DSN-S50200.2020.00042]. Availability: This version is available at: 11583/2853435 since: 2020-11-20T14:33:25Z Publisher: Institute of Electrical and Electronics Engineers Inc. Published DOI:10.1109/DSN-S50200.2020.00042 Terms of use: This article is made available under terms and conditions as specified in the corresponding bibliographic description in the repository Publisher copyright IEEE postprint/Author's Accepted Manuscript ©2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collecting works, for resale or lists, or reuse of any copyrighted component of this work in other works. (Article begins on next page) # Cross-Layer Soft-Error Resilience Analysis of Computing Systems Alberto Bosio<sup>1</sup>, Ramon Canal<sup>2</sup>, Stefano Di Carlo<sup>3</sup>, Dimitris Gizopoulos<sup>4</sup>, Alessandro Savino<sup>3</sup> <sup>1</sup>École Centrale de Lyon – INL, France {alberto.bosio@ec-lyon} <sup>2</sup>Universitat Politècnica de Catalunya and Barcelona Supercomputing Center, Spain {rcanal@ac.upc.edu} <sup>3</sup>Politecnico di Torino, Italy {stefano.dicarlo@polito.it, alessandro.savino@polito.it} <sup>4</sup>University of Athens, Greece {dgizop@di.uoa.gr} Abstract— In a world with computation at the epicenter of every activity, computing systems must be highly resilient to errors even if miniaturization makes the underlying hardware unreliable. Techniques able to guarantee high reliability are associated to high costs. Early resilience analysis has the potential to support informed design decisions to maximize system-level reliability while minimizing the associated costs. This tutorial focuses on early cross-layer (hardware and software) resilience analysis considering the full computing continuum (from IoT/CPS to HPC applications) with emphasis on soft errors. #### I. INTRODUCTION The tutorial proposes a bottom-up approach to early resilience assessment of computing systems against hardware faults. Fig. 1 depicts how faults caused by different sources of failure at the silicon level (e.g., manufacturing defects, particle strikes) go through the different abstraction levels and design levels until they become a system failure, i.e., a visible error for the user. Fig. 1. Faults Propagation The tutorial will cover the methodologies used to compute the vulnerability (derating) factors at each layer of the system stack: technology (TVF), circuit (CVF), microarchitecture (uAVF), architecture (AVF), and software/program (SVF). These separate vulnerability factors, when combined together, can potentially deliver the evaluation of the overall system reliability. This is need not only to evaluate the effect of hardware fault on the final system. Being able to early analyze the system error rate -already at the design time- provides designer a tool to better estimate the system error rate; as well as, analyzing the effect of their specific countermeasures -at any level- for the whole system [1][2]. ### II. TUTORIAL ORGANIZATION # A. Introduction to Reliability Reliability is a very broad domain in which several communities have provided significant contributions. However, definitions and metrics have different meaning in different communities creating a serious obstacle in sharing of knowledge and in the efficient implementation of cross-layer reliability techniques that require synergy between all layers of the system stack. #### B. Cross-Layer Reliability Techniques Overview Cross-layer reliability (or cross-layer resilience) is gaining increasing relevance both in the academic and industrial sectors. In a cross-layer resilient system, physical and circuit level techniques can mitigate low-level faults. Hardware redundancy can be used to manage errors at the hardware architecture layer. Eventually, software implemented error detection and correction mechanisms can manage those errors that escaped the lower layers of the stack. In order to understand the potential but also the complexity of this design paradigm the tutorial provides a brief overview of the most used protection techniques available at the different layers including: Logic, Architectural, ISA/Software and System-Layer. The goal is not to provide an exhaustive review of the state-of-the-art but to give and idea of the building blocks that can be exploited in a cross-layer resilient design and most importantly to let the audience understand the size and complexity of the related design space that makes the reliability analysis a crucial task in the early phases of the #### C. Reliability analysis in a Cross-Layer Domain The decision of how to distribute the error management across the different layers has the goal to meet the system reliability requirements of a specific application, considering its sensitivity to hardware faults, while minimizing the related reliability tax. Overall, by considering multiple layers, one can exploit a wider range of information when handling errors. This leads to globally optimized error management strategies dedicated not only to reliability, but also to other design constraints. However, despite a cross-layer holistic design approach has several advantages compared to traditional single layer techniques, it increases the complexity of the design process since a larger design space must be explored. This translates into an increasing demand for system-level reliability analysis frameworks able to evaluate different combinations of cross-layer error protection techniques early in the design cycle. Unfortunately, such tools still lack maturity, especially compared to those available to optimize other design parameters such as power and performance. # ACKNOWLEDGMENT This tutorial presents methodologies and results obtained in the framework of several EC funded projects in which the presenters have been actively involved: CLERECO - FP7 UniServer - H2020 and RECIPE - FETHPC ### REFERENCES - A. Vallero, et al., "SyRA: Early System Reliability Analysis for Cross-Layer Soft Errors Resilience in Memory Arrays of Microprocessor Systems," in IEEE Transactions on Computers, vol. 68, no. 5, pp. 765-783, 1 May 2019 - [2] A. Vallero, at al., "Cross-layer system reliability assessment framework for hardware faults," 2016 IEEE International Test Conference (ITC), Fort Worth, TX, 2016, pp. 1-10., doi: 10.1109/TEST.2016.7805863