# POLITECNICO DI TORINO Repository ISTITUZIONALE

# Soft Error Reliability Prediction of SRAM-based FPGA Designs

# Original

Soft Error Reliability Prediction of SRAM-based FPGA Designs / Vacca, Eleonora; Azimi, Sarah; DE SIO, Corrado; Portaluri, Andrea; Rizzieri, Daniele; Sterpone, Luca; Merodio Codinachs, David; Poivey, Christian. - ELETTRONICO. - (2022), pp. 1-4. (Intervento presentato al convegno IEEE Radiation and its Effects on Components and Systems 2022 tenutosi a Venice (ITA) nel 03-07 October 2022) [10.1109/RADECS55911.2022.10412546].

Availability:

This version is available at: 11583/2971147 since: 2022-10-13T12:58:58Z

Publisher:

**IEEE** 

Published

DOI:10.1109/RADECS55911.2022.10412546

Terms of use:

This article is made available under terms and conditions as specified in the corresponding bibliographic description in the repository

Publisher copyright

IEEE postprint/Author's Accepted Manuscript

©2022 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collecting works, for resale or lists, or reuse of any copyrighted component of this work in other works.

(Article begins on next page)

# Soft Error Reliability Prediction of SRAM-based FPGA Designs

Eleonora Vacca<sup>1</sup>, Sarah Azimi<sup>1</sup>, Corrado De Sio<sup>1</sup>, Andrea Portaluri<sup>1</sup>, Daniele Rizzieri<sup>1</sup>, Luca Sterpone<sup>1</sup>, David Merodio Codinachs<sup>2</sup>, Christian Poivey<sup>2</sup>

<sup>1</sup>Politecnico di Torino, Department of Control and Computer Engineering (DAUIN), Turin, Italy <sup>2</sup>European Space Agency, Noordwijk, The Netherlands

Abstract— We developed a tool for the reliability analysis of SEU effects on the configuration memory of Xilinx Zynq SRAM-based FPGAs. A proton radiation test campaign on different TMR layouts demonstrated the effectiveness of our approach.

Keywords—Reliability, Single Event Upsets, SRAM-based FPGAs, Proton Test.

#### I. INTRODUCTION

In aerospace applications, Field Programmable Gate Arrays (FPGAs) are affected by radiation-induced permanent and transient faults [1] due to the harsh space environment. FPGAs are extremely susceptible to radiation effects since ionizing radiation can provoke two major effects: a cumulative degradation due to the Total Ionizing Dose (TID) effect, or a sudden phenomenon due to the radiation-induced Single Event Effects (SEEs) which may damage the device permanently with a Single Event Latch-Up (SEL) or transitorily with a transient fault such as Single Event Upsets (SEUs) and Single Event Transients (SETs) [2]. The adoption of Commercial-off-The-Shelf (COTS) devices or Radiation-Hardened devices would require mature design tools to implement a circuit which is resilient versus radiation-induced errors. When SRAM-based FPGAs are considered, SEEs affecting their configuration memory are the main problem [3], since SRAM-based configuration memory cells are particularly susceptible to SEUs causing structural changes in the circuit implemented on the FPGA, and thus, compromising its functionalities [4]. The usage of the hardware Triple Modular Redundancy (TMR) technique applied at the design stage is the classical mitigation approach used for SRAM-based FPGA circuits. However, it has been demonstrated that TMR may suffer from cross-domain failure induced by single and multiple bitflips within the configuration memory [5].

Several software tools have been proposed for analyzing SEU effects within the SRAM-based FPGA configuration memory. Generally, these tools focus on the identification of essential and critical bits that allows detecting a single point of failure of the TMR architecture bypassing the voter protection scheme [6][7]. The main idea behind such approaches is the relation of configuration memory bit coding with the FPGA routing and logic resources used by the mapped circuits. The tool proposed in [8] is analyzing the circuit architecture for identifying the bit that is controlling a cross-domain routing resource or a voter resource, classified as sensitive and critical

bits, and it estimates the contribution of that bit to the overall sensitivity of the mapped circuit.

While these tools can be effectively used to identify a single point of failure of TMR circuits mapped on SRAM-based FPGAs, they are not able to provide a correct estimation of the soft error reliability, mainly because they are not considering the cumulative effects of bitflips within the FPGA architecture when subjected to single particle-induced Multiple Bit Upsets (MBUs). Since it has been recently demonstrated that most of the ultra-nanometer FPGA technologies are suffering from MBU [7][9], a reliability analysis tool able to perform the soft-error prediction considering the cumulative effects of MBUs within SRAM-based FPGA's configuration memory is a major need

In this paper, we developed an FPGA architectural model driven by the configuration memory coding of the Xilinx Zynq-7020 SRAM-based FPGAs, which makes it possible to evaluate the architectural modifications induced by MBU effects. The model has been used as a back-end database for developing an analyzer tool able to calculate the whole reliability of the implemented designs according to the redundancy technique adopted at the design level. The approach has been validated by analyzing three designs implemented on a Xilinx Zynq XC7Z020 SRAM-based FPGA using protons beam provided in the Paul Scherrer Institute (PSI) Proton facility, with energies ranging from 16 MeV to 200 MeV.

This paper is organized as follows. Section II gives an overview of the configuration memory model. Section III describes the soft-error prediction tool and the identified radiation-induced effects. The radiation test experimental setup and results are presented in Section IV. Finally, Section V contains conclusions and future works.

#### II. BACKGROUND ON FPGA CONFIGURATION MEMORY

The mechanism coding the configuration memory bits and the physical resources programmed on the FPGA is fundamental for the evaluation of the circuit sensitivity to SEUs within the configuration memory. We developed two models, the first one is related to the coding of the configuration memory bit data, the second one is related to the routing switch architecture.

The configuration memory bit model is developed according to the configuration memory bit organization of the Xilinx Zynq XC7020 SRAM-based devices we decoded in [10]. We

identified the information related to logic resources such as Look-Up Tables (LUTs) and Block RAM and routing resources including Programmable Interconnection Points (PIPs), local short lines, diagonal lines, long lines, and hex lines.

The logic and routing resources configuration data are stored within a configuration map that reflects the effective FPGA configuration memory with the following considerations: the configuration memory bits are organized in clusters corresponding to the FPGA reconfigurable region and each type of resource has a defined memory map associated to the Configurable respective Logic Block (CLB). characteristics of the configuration memory model coding scheme is described in Figure 1, where it is possible to notice the different number of bits used to configure the routing segments and a cartoon representation of the configuration memory model reflecting the effective layout of the Xilinx Zynq-7020 device.



Fig. 1. The configuration memory model coding scheme for the routing segments and the respective localization on the FPGA architecture for the Xilinx Zynq-7020.

The routing switch architecture model is based on the routing organization within Xilinx Series-7 SRAM-based FPGAs. In details, the CLBs are organized in a two-dimensional array structure and between the CLB rows and columns, there are horizontal and vertical routing channels that allow exchanging data among, CLBs, Input blocks and Output blocks. Routing switchboxes enclose Programmable Interconnection Points (PIPs) that allow switching between vertical and horizontal wires, enabling the access to CLBs.



Fig. 2. A view of the Xilinx routing switchbox with highlights on reachable node, source node and PIP code.

We developed a model according to the scheme reported in Figure 2, where each node has associated some PIPs controlled by a group of SRAM configuration cells according to the configuration memory coding. The model is fundamental for the analysis of the cumulative effects of MBUs. Given a node, the PIPs correspond to a set of reachable nodes belonging either to the same matrix (local connection) and/or to another nearby switchbox having 2 (short lines), 4 (longitudinal and diagonal lines), 6 (long lines) or 12 (hex lines) CLBs distance from the original CLB. In addition, each node is characterized by a direction and a displacement. This information is essential to describe a propagation scenario since each configuration memory bit coding is associated the relative group of PIPs that may be in different CLB as it is represented in the example scenario of Figure 3, where it is possible to depict how the short line configuration memory bits are controlling routing resources located in different CLBs. Therefore, if an MBU effect hits one bit of the PIP coding and one bit of the short line code, the result is a cumulative effect on the PIP and on the short line buffer.



Fig. 3. A representation of a common configuration memory bit coding two PIPs in different CLBs.

# III. SOFT ERROR PREDICTION TOOL

The goal of the developed soft error prediction tool is to compute the reliability curve due to radiation effects affecting a SRAM-based FPGA configuration memory. The developed tool can perform the analysis of any circuit implemented on Xilinx Zynq-7020 SRAM-based FPGAs. The tool consists of an algorithm which load the circuit physical netlist exported from commercial tool and performs the configuration memory bit analysis considering all the possible multiple combination of SEUs within the FPGA configuration memory.

The flow of the tool is represented in Figure 4. The tool starts by loading the physical implementation of the circuit netlist and it starts by generating a virtual configuration memory database that contains all the information related to the programmed resources of the FPGA and generates all the configuration memory coding associated to logic and routing resources according to the rules described in the previous Section. The execution flow consists of the following steps:

1. Configuration memory bit coding: it generates the configuration memory bit database including the coordinates of LUTs, BRAMs, BRAMs interconnect, FFs and PIPs. Furthermore, it computes the decoded buffer coding for the long interconnection scenario. The coding tool is a fundamental part of the developed framework, since it contains the design implementation details. The framework uses a Physical Design Description (PDD) format that aggregate the information of the implemented design at the place and route level. The PDD is a graph where circuit nodes model the FPGA architectural resources

thus including LUTs, hardwired device units such as adders, multipliers or memory blocks, logic and routing configuration drivers also including clock wires and resets as well as Flip-Flop.

- Architectural map: it associates each configuration memory bit code to the correspondent circuit resource used on the FPGA. At this stage, every configuration memory bit will be associated to a circuit resource.
- 3. Bitflips insertion and propagation: it selects a group of configuration memory bits, mimic the upsets and compute the circuit topology according to the modified bits. The modification of the architecture may be cumulative with respect to the configuration memory bit; therefore, the final circuit topology may have multiple architectural modification that will impact on reliability classification. Any bitflip effect is propagated until the output of the circuit and classified.
- 4. Reliability calculation: the reliability curve is calculated performing a Monte Carlo analysis with a maximum of 100,000 iteration per MBU combination. The tool is considering combinations from 1 to 500 simultaneous upset. Finally, the tool generates a report including a list of parameters used during the Monte Carlo analysis such as:
- Accumulated bits: the number of bitflip accumulated in the virtual configuration memory
- Miss: the number of upset that did not hit any programmed resources
- Errors: the number of bitflip that can cause an error which
  is propagated until a netlist cell labeled as output is reached.
  In case of more than 1 bitflip is upset, it means that at least
  one of them could propagate an error until the output of the
  circuit.
- Filtered: the number of bitflip that, even if there are related to a used resource, they did not propagate the error until an output cell.



Fig. 4. The overall soft-error reliability analysis tool for Xilinx SRAM-based FPGAs.

#### IV. EXPERIMENTAL ANALYSIS

The developed tool has been applied to a TMR benchmark design implemented in three different alternative layouts on a Xilinx Zynq XC7Z020 28nm CMOS SRAM-based FPGAs adopting a Xilinx Artix-7 FPGA family architecture. The analysis have been compared with the results achieved from the radiation test experiments performed using proton beam at PSI. In this section, we described the developed experimental setup and the radiation test campaign results.

## A. Experimental Setup

The selected benchmark design consist on the RISC-V ALU scheme illustrated in Figure 5. The choice of RISC-V ALU as the main core relies on the peculiar characteristics of this module as it supports bit manipulation operations, adder partitioning, and the possibility of working in Single Instruction Multiple Data (SIMD) mode, processing more data in parallel, which make it an advanced computational unit but at the same time easy to customize, stimulate and control its behavior under harsh working conditions.



Fig. 5. The scheme of the implemented ALU and the three implemented layout solutions: Default implemented by the commercial tool (a), TMR-Domain Based Isolation (b) and Resource Sharing (c).

The design is proposed in three resource layout solutions referred to as Default, TMR-Domain Based Isolation and Resource Sharing. The Default layout solution, represented in Fig. 5.a, is the outcome of the commercial implementation tool, without any directive/constraints applied during the design flow. Both TMR-Domain Based Isolation and Resource Sharing are two layout solutions that have been realized through the integration of an in-house tool with the commercial CAD one. The TMR-Domain Based Isolation solution, shown in Fig. 5.b, aims at separating the physical resources used by the replica cores to keep them unrelated in the event of an error, i.e. physical displacement reduces the probability that an error affecting one replica will propagate to a nearby replica due to the resource sharing. Instead, the Resource Sharing solution reported in Fig. 5.c emphasizes the problem of resource sharing, taking it to the extreme as all replicas share all available resources. The characteristics of the implemented designs are reported in Table I where we depict the resource utilization

organized per TMR layout and reporting the number of LUTs, Routing resources and overall Resource utilization.

Table I. Design Layouts Comparison

| TMR Designs      | LUTs [#] | Routing PIPs [#] | Usage [%] |
|------------------|----------|------------------|-----------|
| Original         | 11,572   | 13,590           | 13.18     |
| Domain isolated  | 11,578   | 14,642           | 15.25     |
| Resource sharing | 11,572   | 24,948           | 23.10     |

# B. Experimental Results

All three solutions presented in section IV.B were tested with energies ranging from 50.80 MeV up to 150 MeV with an average flux of  $4.134 \cdot 10^7$  cm<sup>-2</sup>s<sup>-1</sup>.

The achieved results are reported in Figure 6 that shows the normalized TMR failure of the three TMR layout designs where the *Default* layout is taken as a reference versus the behavior of the other solutions.



Fig. 6. The cross-domain TMR failure rate computed from protonbeam experiments for the different layout solutions.

At energies lower than 100 MeV, it is possible to notice how the design with resource sharing and routing congestion and the reference design, have few differences. Since no layout policy is applied in the *Default* design, the placement algorithm adopted by the CAD tool pushes the sharing of resources among the TMR cores to achieve area and performance optimization. At high energies, however, the difference is significant and there is a strong sensitivity deterioration as the number of crossdomain failures is doubling w.r.t. the *Default*.



Fig. 7. The comparison of the predicted versus proton-test measured reliability.

We compared the reliability prediction of the developed tool with respect to the reliability measured considering the data recorded during the proton radiation test experiment. The results, illustrated in Figure 7, demonstrated an effective prediction for the three layout version of the TMR benchmark circuit. In details, we observed a marginal different at higher reliability value that are mainly related to the reduced statistic of such events collected during the radiation test experiment. However, the estimated reliability curve results accurate with a computed different of less than 0.08% on the average.

### V. CONCLUSIONS AND FUTURE WORKS

In this paper, a soft error reliability prediction tool has been presented and evaluated with respect radiation test experiments performed with proton beam. The experimental analysis have been conducted on a Xilinx Zynq XC7Z020 SRAM-based FPGAs. A TMR benchmark application has been implemented in three different layout solutions that have been evaluated to assess the impact that the layout has on the reliability of a hardened-by-design core by measuring the cross-section of the TMR cross-domain failure and comparing the results achieved by the radiation experiments with the reliability prediction performed by the proposed tool. The experimental results show an effective capability on predicting the reliability of TMR designs.

As future works, we will perform an accurate analysis of the contents of the configuration memory acquired during the readback procedure to verify the observation of SEMUs.

#### REFERENCES

- [1] V. Dumitriu, L. Kirischian and V. Kirischian, "Run-Time Recovery Mechanism for Transient and Permanent Hardware Faults Based on Distributed, Self-Organized Dynamic Partially Reconfigurable Systems," in *IEEE Transactions on Computers*, vol. 65, no. 9, pp. 2835-2847, 1 Sept. 2016
- [2] R. C. Baumann, et al., "Radiation-induced soft errors in advanced semiconductor technologies", 2005.
- [3] D. Lee, et al., "Single-Event Characterization of the 28 nm Xilinx Kintex-7 Field-Programmable Gate Array under Heavy Ion Irradiation", IEEE Radiation Effects Data Workshop, 2014.
- [4] M. D. Berg, K. A. LaBel, H. Kim, M. Friendlich, A. Phan and C. Perez, "A Comprehensive Methodology for Complex Field Programmable Gate Array Single Event Effects Test and Evaluation," in *IEEE Transactions* on *Nuclear Science*, vol. 56, no. 2, pp. 366-374, April 2009
- [5] H. Quinn, K. Morgan, P. Graham, J. Krone, M. Caffrey and K. Lundgreen, "Domain Crossing Errors: Limitations on Single Device Triple-Modular Redundancy Circuits in Xilinx FPGAs," in *IEEE Transactions on Nuclear Science*, vol. 54, no. 6, pp. 2037-2043, Dec. 2007
- [6] R. Zhang, L. Xiao, J. Li, X. Cao and L. Li, "An Adjustable and Fast Error Repair Scrubbing Method Based on Xilinx Essential Bits Technology for SRAM-Based FPGA," in *IEEE Transactions on Reliability*, vol. 69, no. 2, pp. 430-439, June 2020
- [7] B. Du et al., "Ultrahigh Energy Heavy Ion Test Beam on Xilinx Kintex-7 SRAM-Based FPGA," in *IEEE Transactions on Nuclear Science*, vol. 66, no. 7, pp. 1813-1819, July 2019
- [8] M. Desogus, L. Sterpone and D. M. Codinachs, "Validation of a tool for estimating the effects of soft-errors on modern SRAM-based FPGAs," 2014, IEEE 20th International On-Line Testing Symposium (IOLTS), 2014, pp. 111-115
- [9] A. Pérez-Celis and M. J. Wirthlin, "Statistical Method to Extract Radiation-Induced Multiple-Cell Upsets in SRAM-Based FPGAs," in IEEE Transactions on Nuclear Science, vol. 67, no. 1, pp. 50-56, 2020
- [10] L. Bozzoli, et al., "PyXEL: An Integrated Environment for the Analysis of Fault Effects in SRAM-Based FPGA Routing,", *International Symposium on Rapid System Prototyping (RSP)*, pp. 70-75, 2018.