# POLITECNICO DI TORINO Repository ISTITUZIONALE

FPGA Qualification and Failure Rate Estimation Methodology for LHC Environments Using Benchmarks Test Circuits

Original

FPGA Qualification and Failure Rate Estimation Methodology for LHC Environments Using Benchmarks Test Circuits / Scialdone, A.; Ferraro, R.; Alía, R. G.; Sterpone, L.; Masi, S. Danzeca and A.. - In: IEEE TRANSACTIONS ON NUCLEAR SCIENCE. - ISSN 0018-9499. - ELETTRONICO. - 69:7(2022), pp. 1633-1641. [10.1109/TNS.2022.3162037]

Availability: This version is available at: 11583/2961200 since: 2022-04-13T10:58:29Z

Publisher: IEEE

Published DOI:10.1109/TNS.2022.3162037

Terms of use:

This article is made available under terms and conditions as specified in the corresponding bibliographic description in the repository

Publisher copyright

(Article begins on next page)

# FPGA Qualification and Failure Rate Estimation Methodology for LHC Environments Using Benchmarks Test Circuits

Antonio Scialdone<sup>(b)</sup>, Rudy Ferraro<sup>(b)</sup>, Rubén García Alía<sup>(b)</sup>, *Member, IEEE*, Luca Sterpone<sup>(b)</sup>, *Member, IEEE*, Salvatore Danzeca<sup>(b)</sup>, and Alessandro Masi<sup>(b)</sup>

Abstract-When studying the behavior of a field programmable gate array (FPGA) under radiation, the most commonly used methodology consists in evaluating the single-event effect (SEE) cross section of its elements individually. However, this method does not allow the estimation of the device failure rate when using a custom design. An alternative approach based on benchmark circuits is presented in this article. It allows standardized application-level testing, which makes the comparison between different FPGAs easier. Moreover, it allows the evaluation of the FPGA failure rate independent of the application that will be implemented. The employed benchmark circuit belongs to the ITC'99 benchmark suite developed at Politecnico di Torino. Using the proposed methodology, the response of four FPGAs-the NG-Medium, the ProASIC3, the SmartFusion2, and the PolarFire-was evaluated under high-energy protons. Radiation tests with thermal neutrons were also conducted on the PolarFire to assess its potential sensitivity to them. Moreover, its performances in terms of total ionizing dose (TID) effects have been evaluated by measuring the degradation of the propagation delay during irradiation.

*Index Terms*—Benchmark tests, failure estimation, fieldprogrammable gate array (FPGA), protons, radiation tests, single-event effects (SEEs), thermal neutrons (ThNs), total ionizing dose (TID).

# I. INTRODUCTION

T CERN, many electronic systems are installed in the mixed-field radiation environment of the large hadron collider (LHC). Thus, the radiation hardness assurance (RHA) of electronic components is fundamental to ensure a reliable operation of accelerators and experiments. Because of their benefits in terms of costs, flexibility, and performances, field programmable gate arrays (FPGAs) are often at the core

Manuscript received 22 February 2022; accepted 21 March 2022. Date of publication 24 March 2022; date of current version 18 July 2022. This work was supported in part by the French National Program "Programme d'Investissements d'Avenir, IRT Nanoelec" under Grant ANR-10-AIRT-05.

Antonio Scialdone is with the European Organization for Nuclear Research (CERN), 1211 Geneva, Switzerland, and also with the Dipartimento di Automatica e Informatica, Politecnico di Torino, 10129 Turin, Italy (e-mail: antonio.scialdone@cern.ch).

Rudy Ferraro, Rubén García Alía, Salvatore Danzeca, and Alessandro Masi are with the European Organization for Nuclear Research (CERN), 1211 Geneva, Switzerland.

Luca Sterpone is with the Dipartimento di Automatica e Informatica, Politecnico di Torino, 10129 Turin, Italy.

Color versions of one or more figures in this article are available at https://doi.org/10.1109/TNS.2022.3162037.

Digital Object Identifier 10.1109/TNS.2022.3162037

of several electronic systems. However, their lifetime and performances are affected by radiation-induced effects, such as single-event effects (SEEs) and total ionizing dose (TID). This sensitivity creates the necessity to perform many qualification tests in order to find a suitable FPGA to use in a specific system or experiment. In the past, many FPGAs, such as the ProASIC3, the SmartFusion2 [1], the Artix7 [2], and the NG-Medium [3], were qualified under radiation for CERN purposes. The ProASIC3 has been embedded in most of the accelerator systems in the past, whereas the SmartFusion2 and Igloo2 are used nowadays in the new developments. However, with the imminent approach of the high-luminosity LHC (HL-LHC) and the consequent increase of the radiation levels, more robust FPGAs, able to withstand a higher fluence and ionizing dose, are necessary. Therefore, CERN is considering two new FPGAs as possible candidates for its applications: the NG-Medium and the PolarFire. The first is a radiationhardened-by-design (RHBD) FPGA, manufactured using the STM C65 space process. The latter is the fifth-generation nonvolatile FPGA device from Microsemi built on the stateof-the-art SONOS 28-nm nonvolatile process technology [4].

When performing a qualification test, the most commonly used procedure consists in evaluating the cross section of each functional element (FE), that is, DSPs, Flip-Flops, RAM, phase-locked loops (PLLs), separately [5]. Several radiation test datasets for this test topology are already available in the literature for many FPGAs, including those analyzed in this article, such as the PolarFire [4], [6]. Even though this approach yields a lot of information about the sensitivity of each FE, it does not give a realistic overview of how a custom application will work. Extrapolating the SEE susceptibility for a user design starting from the SEE response of its FEs is a difficult task. Other works, like [7] and [8], discussed the challenges and the consequences to face when performing such analyses on mission-specific designs. This limitation makes the estimation of the device failure rate during LHC operation quite complex. For this reason, as mentioned also in [5], application-style tests are recommended to derive a realistic behavior of the system. Nonetheless, performing a test for each application is an expensive and time-consuming task. In addition, interpreting the obtained results for other FPGAs belonging to different families and with different technology can be difficult. These limitations, together with the imminent

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/

approach of the HL-LHC, make the qualification process even more important and create the need to find a more reliable testing approach for the estimation of the device failure rate.

In this article, a testing methodology to overcome the aforementioned problems is presented. The approach uses a benchmark application, which better reflects the workload of a real application and facilitates standardized application-level testing, making the comparison of different families of FPGAs easier. The benchmark belongs to the IT99 suite [9] developed by Politecnico di Torino. Such circuits have already been used in other reliability experiments, like [10] and [11], for evaluating the performances of different mitigation techniques against SEEs. The advantages of this new methodology compared to the standard methodology will be proved by showing the results obtained for the SmartFusion2, the ProASIC3, the NG-Medium, and the PolarFire under high-energy protons. Using the benchmark, a device reliability analysis has been performed to understand the failure modes of the different FPGAs and compare the advantages and disadvantages of using one instead of another. For the PolarFire FPGA only, the results for the FEs under protons and thermal neutrons (ThNs) are presented, since a higher sensitivity is expected because of its newer technology node. Finally, its performances against TID effects are analyzed.

# II. CERN RADIATION ENVIRONMENT

The radiation environment inside the CERN accelerators is composed of different particles over a large spectrum of energies, from megaelectronvolt up to gigaelectronvolt, whose distribution may vary significantly depending on the location. When exposed to this sort of environment, FPGAs' performances can be impacted by both TID and SEEs. The two main contributions to SEEs inside the LHC are coming from high-energy hadrons (HEHs) and ThNs (ThN). HEHs are defined as all hadrons with kinetic energy above 20 MeV. The different areas inside the LHC and the corresponding radiation levels of the upcoming HL-LHC are analyzed in detail in [12]. The LHC areas where the COTS component can be installed can be divided into two main groups, according to their radiation levels: tunnel areas and shielded areas. The shielded areas, like UJs (junction chamber) and ULs (liaison gallery), located near the interaction points (IP), are heavily shielded, whereas RRs, located between the insertion regions (IRs), and dispersion suppressors (DSs), are lightly shielded. In the heavily shielded areas, the ThN contribution is much higher, because high-energy neutrons are thermalized by the shielding, whereas high-energy particles are attenuated. Different studies, like [2], [13], and [14], show that the ThN contribution can significantly affect the sensitivity of recent technologies. Therefore, a risk factor  $(R_{\rm th})$  [15] was introduced to identify locations with higher ThN contribution. The aforementioned risk factor is defined as

$$R_{\rm th} = \frac{\Phi_{\rm ThN}}{\Phi_{\rm HEH}}$$

where  $\Phi_{\text{ThN}}$  is the ThN fluence and  $\Phi_{\text{HEH}}$  is the high-energy hadron fluence. Table I summarizes the expected fluences with

TABLE I EXPECTED ANNUAL RADIATION LEVELS IN THE LHC AREAS FOR THE HL-LHC UPGRADE AND AVERAGE MEASURED RISK FACTOR

| Location | TID (Gy) | HEH Fluence (p/cm <sup>2</sup> ) | R <sub>th</sub> |
|----------|----------|----------------------------------|-----------------|
|          | Shi      | elded areas                      |                 |
| UJ       | 10       | $5\cdot 10^9$                    | 50              |
| UL       | 0.2      | $10^{8}$                         | 25              |
| RR       | 6        | $3\cdot 10^9$                    | 6               |
|          |          | Tunnel                           |                 |
| DS       | 100      | $5 \cdot 10^{10}$                | 4               |
| ARC      | 2        | $10^{9}$                         | 3               |

the HL-LHC upgrade for these areas obtained by FLUKA simulation [12], and the average  $R_{th}$  derived from measurements in [14]. In the tunnel, the risk factor is around 4, but it can go up to 25 in the ULs and up to 50 in the UJs. Therefore, ThNs can have a nonnegligible contribution to the total failure rate. This adds more complexity to the RHA procedure [16], making ThN tests mandatory for FPGAs with smaller node technologies.

To address the devices' sensitivity to HEH and ThN individually, the FPGAs are tested in two different facilities. Concerning the HEH, according to the RHA procedure, they all have the same probability of inducing SEEs due to their similar nuclear interaction cross section. Therefore, the devices are tested at the Paul Scherrer Institute (PSI) in Villigen, Switzerland, under a high-energy proton beam of 200 MeV. The ThN tests are carried out at the TENIS beam line, at the Institute Laue-Langevin (ILL) [17] in Grenoble, France.

#### III. TEST SETUP AND METHODOLOGY

Different radiation campaigns were performed in order to investigate the robustness of different FPGAs. The NG-Medium and the PolarFire have been proposed as candidates for future CERN applications, considering the TID levels expected for the HL-LHC. The former, despite its cost not comparable to that of a fully commercial solution, is an RHBD FPGA; therefore, it might be the only solution for systems in very harsh environments, especially in terms of TID. The PolarFire instead, belongs to the same family of the SmartFusion2 and ProASIC3 which had already been qualified for the LHC environments. Thus, similar/better performances are expected considering the addition of the SONOS technology.

To carry out the radiation campaigns, a setup comprising two different systems was adopted; these are the FPGA Under Test, acting as device under test (DUT), and a second FPGA, acting as a tester. By means of an additional FPGA, the circuitry necessary to transfer the data from the DUT to a host computer is removed, for example, logic necessary to implement a UART/Ethernet peripheral or a built-in self-test (BIST) architecture. Thus, the circuits under test are monitored directly. This allows performing a better analysis of the results because the testing logic on the DUT is minimized, reducing the potential sources of errors. Fig. 1 illustrates a top-level



Fig. 1. Overview of the test setup adopted for the characterization of the DUT.

view of the test setup. The Zynq-7000 SoC was adopted. It contains an ARM Cortex-A9 CPU connected to an Artix-7 FPGA. The FPGA implements the necessary test routine, sends the relevant data to the ARM CPU, and an external computer can be used to monitor the test. The tester and the DUT are connected through a low pin count (LPC) FMC cable. Therefore, the development board hosting the DUT must be equipped with an FMC connector. This is the case for the NG-Medium and the PolarFire. However, for the ProASIC3 and SmartFusion2, whose development boards lack an FMC connector, a UART interface was integrated on the DUT to communicate with the external computer.

Implementing the benchmark structure on all the FPGAs allowed a better comparison between multiple devices belonging to different families. Additionally, the SEE sensitivity of the PolarFire FEs was studied, to investigate the possibility of estimating the benchmark circuit sensitivity starting from these results. Propagation delay tests were also performed to quantify the design lifetime with the dose. These tests were conducted only on the PolarFire since they had already been performed on the other FPGAs in previous works.

#### A. Standard Methodology

Following the already established guidelines detailed in [5], the sensitivity of the FEs was retrieved through dedicated circuits.

Concerning the Flip-Flops, multiple chains of windowedshift-register (WSR) were implemented. They are shift registers with a serial-to-parallel output window that captures the output of the last N Flip-Flops. This structure was chosen because it allows performing tests at higher speed, reducing signal integrity issues at the FPGA output. Thanks to this structure, in normal conditions, the output window always contains the same values. Only in the case of an error, the content of the output window is different. The same structure was used to test the single-event transient (SET) capture sensitivity, by adding combinatorial gates between each Flip-Flop of the chain (WSR-SET). Both topologies of chain were implemented with and without triple-module redundancy (TMR) to test the efficiency of this mitigation technique. Moreover, as many chains as possible were placed on the DUT to retrieve a good amount of statistics. However, in order to keep the number of required I/Os low, a comparator mechanism for each chain, which checks the correctness of the data present



Fig. 2. Top-level architecture of the logic circuit used to retrieve the flip-flop sensitivity. The tester provides the inputs to the various structures on the DUT and monitors their outputs.



Fig. 3. Top-level architecture (left) of the circuit used to characterize the DSP's sensitivity. Internal structure (right) of a cluster of DSP.

in the window, was included. The output of this comparator was monitored by the tester architecture. Additionally, two test modes were included: static and dynamic. In the static mode, the bit shifted inside the chain is set to either 0 or 1. In the dynamic mode, the input alternates between the two. Fig. 2 shows the overview of the logic circuit used to retrieve the Flip-Flop sensitivity.

DSPs can be usually configured to implement various operations. In the case of the PolarFire, the possible configurations are: Multiplier only (MULT), Adder (ADD), and multiply and accumulate (MAC). To retrieve all of their sensitivities, a circuit containing all the configurations was implemented. In order to increase the amount of statistics measured, while keeping the number of I/Os within the limit of the FMC connector, the DSPs are organized in clusters. Each cluster contains DSPs performing the same operation. All the clusters are fed with the same input, stored on the DUT in triplicated registers. Thus, all the DSPs in the same cluster are expected to produce the same output. A comparator, for each DSP, checks that the output is correct. Finally, all the comparators of a cluster are connected through an AND gate to the DUT output that is monitored by the tester. This way, an error is detected as soon as one DSP in the cluster is affected by an SEE. The test can be performed either in the static or in the dynamic mode: in the static mode, the inputs are fixed and the DSPs perform the same operation starting from the same input. In the dynamic mode, the input alternates between the two values. The overview of this architecture is shown in Fig. 3.

As far as clocking circuitry is concerned, the PolarFire comes with eight PLLs. Their sensitivity was analyzed by monitoring their LOCK signal using the tester. For each PLL, the tester registers an event every time the PLL is not locked.

#### B. Benchmark Test Methodology

Evaluating the radiation response of an FPGA is a challenging task because there are many aspects to take into consideration. Starting from a given RTL description of a circuit, its final implementation will depend on the target FPGA, the tool used, and the applied constraints. Moreover, starting from the RTL model, there are many steps automatically performed by the tool that are configurable by the user. All these aspects can lead to a different implementation of the same RTL circuit, with different area, power consumption, performance characteristics, and eventually a different radiation response because the resulting number of configuration bits is different, or because an SET may propagate differently depending on the resource used. However, when choosing the FPGA candidate for a particular application at CERN, there is no defined implementation strategy, since it depends on each application. Most of the times, optimizations are carried out only at the RTL level and through primitives available from the tool, that is, to apply the TMR mitigation scheme. In addition to that, when using different tools, the available implementation strategies may not be the same, which would make the comparison harder. Moreover, if, on the one hand, applying a different strategy can improve the circuit response, on the other hand, it can mask some of the failure modes that could arise during operation when such techniques are not applied by the FPGA developers. Thus, since the goal is to perform an applicationlevel test, testing all the FPGAs using the same implementation strategies was important. Because of all these reasons, it was decided to study the behavior of each device running the same benchmark application constraining only the clock frequency, the IOs, and applying or not the TMR mitigation technique, leaving everything else to the vendor tools.

1) Benchmark Presentation: The adopted benchmark belongs to the ITC'99 suite, in particular to the IT99 portion developed by Politecnico di Torino. The goal of the suite is to provide a set of RTL circuits with different test cases, different complexity, but with uniform characteristics. They were built starting from public VHDL files, modified and combined to obtain larger circuits. Following this process, the circuits might have lost their original functionalities in favor of uniformity of description. The result of this process is a set of fully synthesizable circuits, without any hardware/compiler-specific directive that allows their implementation on different families of FPGAs. The VHDL descriptions range from tiny to larger circuits which implement a variety of functionalities, from finite-state machines (FSMs) to soft-core microprocessors. Therefore, they are a good representation of how FPGAs are used in a real environment. In this work, the B13 was chosen. Its original function was to act as an interface with a weather sensor. The circuit occupies relatively few resources (339 gates, 53 FFs, 20 I/O). It is mainly based on FSMs, which are quite common in real applications. Even though it is quite small, it was selected because it had already been used in other reliability experiments on other FPGAs. Hence, it was possible to compare our results with the ones obtained in the other measurements. Moreover, tests were planned for FPGAs of different sizes, with the aim of collecting many statistics



Fig. 4. Top-level architecture of the circuit used to study the response of the benchmark application.

from the irradiation tests. Thus, the B13 represented a good choice given its small size, allowing one to replicate it as much as possible on the DUT.

2) Benchmark Setup: When choosing the DUT implementation, a tradeoff between error observability and complexity of test structure is necessary, depending on the target FPGA. Each B13 has ten outputs, and therefore it is not possible to directly monitor each instance using the tester because of the limited amount of pins. Thus, a comparison on the DUT was adopted, and additional test logic was implemented on the tester in order to keep a good level of observability. The input generator and the golden reference were instead moved to the tester. During the test, the input generator is used to feed the golden circuit on the tester and all the B13s on the DUT. The outputs of the B13s on the DUT are compared against the output of the golden B13. In the case of an error, the test logic raises an error signal and generates the identifier of the B13 affected by the error and its output. All this information is sent to the tester, to help understanding in which structure the error has occurred. Fig. 4 shows such an architecture. A second version of this design with TMR was also tested. In both versions, TMR is implemented on all the comparators and the failure detection logic to make them more robust against SEEs. This architecture was used for both the NG-Medium and the PolarFire. For the SmartFusion2 and ProASIC3 without the FMC connector, another solution was necessary because of the limited number of pins available on the test board. The error observability was sacrificed in favor of a less complex DUT design. The B13s were compared two-by-two, and an UART interface was used to notify the errors to the external computer for data analysis. In this case, the tester was not employed. Even for this design, TMR was applied on all the logic surrounding the B13s.

#### C. Propagation Delay Monitoring

While exposed to radiations, the propagation delay of the FPGA elements can increase [18], [19]. Since every logic circuit works at a specific frequency, if the propagation delay changes while under radiation, the design could fail even before the device breakdown. As experimented in [20], the propagation delay degradation reached values of up to 1100% before the device failed. This kind of event is the main source of failure related to TID for a user design. Therefore, the propagation delay increase was measured by monitoring the frequency of many ring oscillators. A ring oscillator is simply



Fig. 5. Circuit for measuring the propagation delay change in the DUT. The tester selects the ring oscillators to observe, and a frequency counter on the tester FPGA measures its output frequency.

TABLE II Number of Circuits and Elements Used for the Characterization of the Polarfire MPF300

| Structure                | # Replica | # Element |  |
|--------------------------|-----------|-----------|--|
| WSR                      | 4         | 8000      |  |
| WSR-TMR                  | 2         | 8000      |  |
| WSR-SET                  | 4         | 8000      |  |
| WSR-SET-TMR              | 2         | 8000      |  |
| DSPs (All configuration) | -         | 80        |  |
| PLLs                     | -         | 8         |  |

a loop of inverters. For an odd number of inverters N, each implemented with an LUT whose propagation delay is  $t_{pd}$ , the output of the last inverter will oscillate with a frequency f expressed by

$$f = \frac{1}{t_{\rm pd} \times N \times 2}.$$

Fig. 5 shows the circuit used to monitor the propagation delay. The TID effects on the DUT were analyzed by measuring the change of frequency of each ring oscillator during irradiation using the tester.

## IV. RADIATION TESTS AND RESULTS

This section analyzes the results obtained from the different radiation campaigns. For all the FPGAs, the benchmark circuit was used to retrieve the devices' response at application level and to estimate their failure rate during operation. For the PolarFire only, the FEs and its performances against TID were evaluated too.

#### A. PolarFire FEs

The sensitivity of Flip-Flops, DSPs, and PLLs of the PolarFire was studied for protons and ThN irradiations. The Flip-Flops were tested using the WSR and WSR-SET structure implemented with and without TMR in the dynamic mode. Multiple tests at multiple frequencies were performed, with the aim of understanding the possible relationship between the frequencies and SEE cross sections. On the same DUT, also DSPs and PLLs were placed, together with the FFs. The DSPs were tested at different frequencies and in the dynamic mode. The PLLs were all fed by the same clock, and they were all using the same configuration. During the irradiation, in addition to SEUs, global failures were observed. Most of



Fig. 6. SEU and SEFI cross sections under protons and ThNs for the PolarFire MPF300 elements.

the circuits on the DUT (FFs, DSPs, and PLLs) were failing continuously at the same time. These events, considered to be single-event functional interrupts (SEFIs), were treated as a separate category. Table II describes the number of replicas used for each topology of element. Fig. 6 shows the cross section for SEUs and SEFIs for the different elements. It must be noted that no SEU was observed on the PLLs. For each cross section, the 95% confidence interval calculated using the methodology presented in [21] is reported.

Several considerations can be made about these results. First, despite the smaller technology, the PolarFire shows an average FF cross section  $2.5 \times$  lower than its predecessor, the SmartFusion2 [1]. The difference is even higher for the DSPs, with an average cross section of two orders of magnitude lower than its predecessor. Then, a lower cross section in the TMR version of the WSR chain can be observed, both for protons and ThNs, which proves the efficiency of the mitigation technique. However, global design failures (SEFIs) were observed and their cross section is relatively high; they represent the dominant failure type. These SEFIs can originate because of SETs on a global route, which can be either the reset or the clock. However, not all the structures were failing, probably because the SETs were attenuated before reaching the structures placed far from the affected location.

Furthermore, a very interesting outcome from these tests is the cross section measured for the ThNs. For every metric, except for FFs with TMR, the proton and ThN cross sections are relatively comparable. The cause for this high cross section could be related to the presence of boron-10 in the device, since this element has a very high ThN absorption. Considering the TMR version instead, the ThN cross section is lower. One possible reason could be the different SET duration. It is possible that the SETs induced by ThNs are shorter than those induced by protons, thus they are better mitigated by TMR. However, the SET duration should be measured with techniques such as those presented in [22] to verify such assumption. This confirms that ThN tests are necessary because they can have a huge impact on the failure rate of these devices in operation. Fig. 7 instead shows the FF cross section for SEUs as a function of the frequency. From the results, it is visible that the cross section is stable with the frequency, except for the WSR version in the TMR mode.



Fig. 7. PolarFire flip-flops SEE cross section expressed as a function of the frequency. The filled lines refer to the static test, whereas the dashed lines to the dynamic test.



Fig. 8. Propagation delay difference before and after irradiation of the PolarFire MPF300 for a total cumulative dose of 5.2 kGy.

## B. PolarFire MPF300 Propagation Delay Degradation

The degradation of the propagation delay was monitored using 1952 ring oscillators, each of them containing 47 inverters. With such a number, considering the LUT delay and the routing path, each of them generates a 100-MHz signal. The structures were spread through the entire FPGA to investigate if some areas could be affected more than others. Moreover, the ring oscillators were placed manually. For each of them, all the inverters were placed next to each other, avoiding the extra logic that could be added by the placement tool, resulting in an identical structure for all the rings. The rings were monitored during the whole irradiation run using the tester. Fig. 8 shows the difference, in percentage, between the frequency measured before and after the irradiation. As it can be seen, the FPGA exhibited a very good behavior. Most of the ring oscillators' frequency changed by only 0.3%, and the maximum observed variation is 0.45%, which is a very good result considering that the level of dose absorbed reached 5.2 kGy.

#### C. Benchmark Application

Before presenting the results for the benchmark application, it is necessary to analyze the different events and failure types observed.

1) NG-Medium: It is an SRAM-based FPGA whose cells are hardened by design, and therefore they are more resilient to single events compared to those of the other FPGAs. However,



Fig. 9. SEU and SEFI cross section of the benchmark application tested on the NG-medium, PolarFire, SmartFusion2, and ProASIC3, with version #2 implemented with TMR.

its configuration memory (CRAM) is based on an SRAM architecture which is more sensitive to SEUs compared to Flash-based FPGAs. However, the NG-Medium is equipped with a configuration memory integrity check (CMIC). It is an embedded engine performing automatic verification and repair of single-bit error inside the CRAM. When the bitstream is generated, a CMIC reference is automatically added by the tool. During the download process, the bitstream is loaded into the CRAM, while the CMIC reference is loaded into its own on-chip memory, protected by ECC. The engine periodically scans the CRAM, and in the case of a single-bit mismatch in a word, it corrects the error. According to the datasheet, this process takes around 4 ms. In the case of a double-bit error, however, the engine stops working and further single bit errors will not be corrected anymore. Without scrubbing, an error in the CRAM would change the interconnect configuration of the FPGA, leading to a permanent failure of the design. Thanks to the CMIC instead, an SEU in the CRAM leads to a temporary failure only, until the repair is finished. While the CRAM is corrupted, the entire design is in a faulty state, and this event will be referred to as a SEFI from now on.

Another type of failure observed is related to the FPGA itself. During the irradiation, the NG-Medium suffered from radiation-induced resets causing a reload of the CRAM. In our test setup, the configuration was stored inside an external memory. Hence, when a reset occurred, the FPGA could reload it and resume operation. However, this may not always be the case when in operation, and the CRAM could need to be manually reloaded again. For this reason, this kind of event is referred to as permanent failure.

2) PolarFire, SmartFusion2, ProASIC3: They are Flashbased FPGAs, and therefore their CRAM is more resilient to SEUs, but since they are not radiation-hardened, their cells are more sensitive. In this case, the source of failures for the application is due to SETs or SEUs inside the FPGA logic and the design functionalities are affected only temporarily, so these events are referred to as temporary failures.

*3) Results Analysis:* Fig. 9 shows the response of the four FPGAs under high-energy protons. Table III reports the number of B13 replicas considered for each FPGA, the fluences, and the errors detected. Results were gathered across multiple

TABLE III

CIRCUITS EMPLOYED, FLUENCES, AND ERRORS DETECTED FOR EACH DESIGN ON THE DIFFERENT FPGAs

| FPGA         | Design Name | TMR | B13s instance | Fluence (p/cm <sup>2</sup> ) | Errors detected |
|--------------|-------------|-----|---------------|------------------------------|-----------------|
| PolarFire    | MPF-1       | Ν   | 2048          | $3.00\cdot10^{10}$           | 42              |
|              | MPF-2       | Y   | 2048          | $2.3 \cdot 10^{11}$          | 64              |
| NG-Medium    | NG-1        | Ν   | 184           | $7.90 \cdot 10^{11}$         | 35              |
|              | NG-2        | Y   | 32            | $2.65 \cdot 10^{12}$         | 11              |
| SmartFusion2 | SM2-1       | Ν   | 160           | $1.10 \cdot 10^{11}$         | 12              |
|              | SM2-2       | Y   | 50            | $8.50 \cdot 10^{11}$         | 1               |
| ProASIC3     | PRO-1       | Ν   | 180           | $2.00 \cdot 10^{11}$         | 9               |
|              | PRO-2       | Y   | 60            | $8.50 \cdot 10^{11}$         | 2               |

campaigns, and therefore the table contains cumulative values. Starting with the NG-Medium, it was not planned to use any sort of mitigation technique in operation given the Rad-Hard nature of the FPGA. However, an unexpectedly high number of CMIC corrections was observed. For this reason, the TMR-ed version of the benchmark was tested as well. Since the placement tool ("NanoXmap 2.7") provided by the manufacturer was not offering any automatic triplication routine, the design was triplicated manually at block level. The results show that Block-TMR reduced the problem only slightly. Further analysis demonstrated that this is due to the placing/routing tool, which is creating many long common paths in the design, that is, the one between the output of the circuits and their corresponding voters, also referred in the literature as single point of failure (SPF) which reduce the effectiveness of the triplication technique. At the moment of the writing of this article, no possibility was given to manually place the different elements of the design to apply known placement techniques [23] to mitigate this problem. Therefore, a quite high margin of improvement can be expected with the planned update of the FPGA development tools by NanoXplore. Concerning the permanent failure, the cross section for this event has been measured as  $1.10 \cdot 10^{-12}$  cm<sup>2</sup>/device. Following a collaboration with the manufacturer, it emerged that one of the possible sources of this reset could be an error affecting the peripheral available in the FPGA as hardcoded block, that is, the SpaceWire interface. This peripheral, if affected by an error, triggers a reset and causes the reloading of the CRAM even though it is not used by the user logic. Another reason could be the failure of the CMIC engine. When a double error is detected, the FPGA is reset and the CRAM is reloaded, so that the CMIC operation can restart. Thus, tests were performed again after these reset conditions were masked. Nonetheless, the errors still occurred, indicating that there might be other sources, such as a microlatch-up, that power resets the FPGA. Further investigation on this problem is necessary.

Concerning the PolarFire, some temporary failures due to SEUs inside the B13 circuits were observed, but also some SEFIs where many B13s were failing, as in the FE tests. As visible in Fig. 9, even though the use of triplication reduced the SEU sensitivity by a factor of 5, the total failure rate is dominated in both cases (mitigated and nonmitigated), by the SEFIs. Further investigation is necessary to mitigate this effect for this FPGA.

TABLE IV B13 SEU Cross Section for the PolarFire MPF300 Under ThNs

| B13s instance   | Fluence (p/cm <sup>2</sup> )                                          | # Events | $\frac{\text{SEU}}{\sigma \text{ (cm}^2/\text{B13)}}$                     |  |  |  |
|-----------------|-----------------------------------------------------------------------|----------|---------------------------------------------------------------------------|--|--|--|
| Thermal Neutron |                                                                       |          |                                                                           |  |  |  |
| 2048<br>1024    | $\begin{array}{c} 5.49\cdot 10^{11} \\ 1.21\cdot 10^{12} \end{array}$ | 82<br>7  | $\begin{array}{c} 7.29 \cdot 10^{-14} \\ 5.65 \cdot 10^{-15} \end{array}$ |  |  |  |

On the other hand, no SEFI was observed on the Smart-Fusion2 and ProASIC3. The TMR version of the design also shows a lower sensitivity compared to the PolarFire, which is reasonable since they are based on a larger technology, and so they are less sensitive.

4) Comparison: Comparing the total cross sections for all the FPGAs, it is clear that the NG-Medium exhibits a better behavior among them all. However, mitigating the SEFIs on the PolarFire would lower down the cross section at the same level as the NG-Medium, which is remarkable when comparing a commercial FPGA with an RHBD FPGA. Moreover, the SEU cross section for the Flash-based FPGAs are comparable for the non-TMR version, but the SmartFusion2 and ProASIC3 show better performances with the TMR version compared to the PolarFire.

It is important to note that this conclusion would have been really hard to draw by using only the FEs test, for different reasons. The PolarFire failure rate is a combination of SEUs and SEFIs, and it is different from the SEFI cross section retrieved at the FE level. Moreover, for the NG-Medium, it is impossible to estimate the impact of the CMIC operations at the circuit level using the CRAM sensitivity, since the number of critical bits given by the tool is inaccurate as it was not possible to retrieve the same results based on their estimation. This demonstrates why the FE test is not sufficient for the FPGA qualification, and it also shows how benchmarks allow for comparison of different types of technologies subject to different kinds of effects, where the sources of failures are different.

For the PolarFire, SmartFusion2, and ProASIC3, the benchmark application was tested also under ThNs. As expected, no event was observed for the SmartFusion2 and ProASIC3, in contrast with what was observed on the PolarFire, which is based on a newer technology. Table IV reports its cross section, that is  $10 \times$  lower compared to the one observed under protons. Surprisingly, no SEFI was observed during the ThN tests, even though the design was the same for both tests. Moreover, two different PolarFire FPGAs were used for the FE tests under ThNs, and in both cases, SEFIs were observed.

## D. FPGA Lifetime

The robustness of the FPGAs against TID effects was also investigated during the tests, in terms of lifetime and programmability. For the lifetime, three events are considered as the end of life of the device: an exponential propagation delay increase, a sudden failure of the device, or the corruption of the flash memory. However, only the first event was observed during the tests.

No degradation or failures were observed with the NanoXplore up to 3 kGy [3], but it was more surprising that similar performances were reached with the PolarFire. The same evaluation board was used for three different campaigns separated by six and five months, respectively, where the doses reached 500 Gy, 2, and 3 kGy, respectively. Nevertheless, the FPGA did not show any significant sign of degradation, neither in terms of current consumption nor propagation delay. Also, no loss of reprogrammability was observed after the second campaign (2.5 kGy), before being unable to program the board at the end of the third campaign (5.5 kGy). These are remarkable great results for a commercial FPGA, also when compared to the low lifetime and programmability threshold of its predecessors, the SmartFusion2 and the ProASIC3. The SmartFusion2 lost programmability features after only 70 Gy and survived up to 650 Gy. The ProASIC3 instead lost programmability after only 20 Gy, whereas its lifetime was of 540 Gy. In these two cases, the lifetime corresponds to an exponential increase in the propagation delay.

The NG-Medium instead had a shorter effective lifetime because of two issues. The CMIC engine, as mentioned before, stops operating when it detects a double error. Thus, singlebit errors start accumulating until the design fails, without the possibility to recover without reloading the CRAM. Moreover, radiation-induced FPGA resets were observed, causing the loss of the CRAM content and thus, the permanent failure of the design. As described before, during the test, the FPGA could reload the design from an external memory and resume operation. However, the radiation levels of the LHC areas where this FPGA will be used are high enough to cause the failure of flash memories. Therefore, in this situation, this kind of event would cause the failure of the system requiring a manual reload of the user design. To mitigate these problems, a solution could be the remote programming of the FPGA through JTAG chains, just as it is already done for some FPGAs in the CERN detectors, for instance, ATLAS. However, deploying the same solution in the whole accelerator part of the complex will significantly increase the cost of this solution. More investigations are necessary to understand the origins of these resets and possible mitigation techniques.

#### V. FAILURE RATE ESTIMATION

By using the failure rate of the benchmark circuit, the failure rate of FPGAs belonging to different families can be compared. As previously done in [2], but considering two SEE contributions (ThNs and high-energy hadrons), the homogeneous Poisson process (HPP) is used to estimate the probability of having x failures over a period of time t. The process considers a constant failure rate and that the failures are independent of each other. It is expressed as

$$f(x; \lambda_T; t) = \frac{(\lambda_T t)^x e^{-\lambda_T t}}{x!}$$

where  $\lambda_T$  is the combination of the ThN and high-energy hadron contributions, and it is defined as

$$\lambda_T = \lambda_{\text{HEH}} + \lambda_{\text{ThN}} = \sigma_{\text{HEH}} \Phi_{\text{HEH}} + \sigma_{\text{ThN}} \Phi_{\text{ThN}}.$$



Fig. 10. System failures probability for the PolarFire FPGA considering 12 years of operation in the UJ and DS, calculated using the annual fluence level of the HL-LHC, for both HEH and ThN.



Fig. 11. Average failures for all the FPGAs in the four areas of the LHC, considering 12 years of operation.

 $\Phi_{\text{HEH}}$  and  $\Phi_{\text{ThN}}$  are the HEH and ThN annual fluences, respectively, whereas  $\sigma_{\text{HEH}}$  and  $\sigma_{\text{ThN}}$  are the HEH and ThN cross sections, respectively. For the hadron, the proton cross section was used as explained in Section II.

As an example, Fig. 10 shows the HPP curves calculated considering the PolarFire working for 12 years inside the DS and UJ, with two different risk factors. The ThN and proton contributions in this case are separated. As it can be seen, in the DS where the HEH fluence is higher, the PolarFire has 70% chance of failing three or more times. On the other hand, there are less chances of failing because of ThN. In UJ instead, where the risk factor is higher, there is a 20% probability of observing one or more failure induced by ThN. Fig. 11 shows the average number of failures, calculated from the HPP equation, for the non-TMR benchmark on all the FPGAs in four main LHC areas. The NG-Medium has very good performances compared to the others in terms of SEEs. However, the loss of configuration represents a problem, especially in the DS. Considering the PolarFire instead, the impact of ThN in the different areas can be appreciated. In the area with a low risk factor, such as the DS, 42% of the failures observed are induced by ThNs. With the increase of the risk factor, the ThN contribution increases, until it becomes the predominant failure source. In the UJ, 81% of the failures are induced by ThN. It must be noted that Figs. 10 and 11 show the failures for a single FPGA, and thus the numbers seem

low, but there will be hundreds of devices installed inside the LHC tunnel.

Thus, it is clear that the PolarFire represents a better solution in areas with a low  $R_{\rm th}$  factor, since its lifetime and programmability are much higher compared to those of the SmartFusion2 or ProASIC3. However, the situation is reversed when considering areas with a high  $R_{\rm th}$  factor. Here, the TID levels are very low, and therefore the higher sensitivity to ThNs make the PolarFire a bad candidate compared to the SmartFusion2 and ProASIC3, since they are less sensitive to ThN and can still survive for 12 years. Nonetheless, the racks hosting the FPGAs could be shielded to reduce ThN fluxes, mitigating this problem. Therefore, the correct FPGA must be chosen mainly depending on the LHC area, but also according to the failure modes tolerated for the application and to whether shielding is possible or not.

# VI. CONCLUSION

These experiments allowed to obtain a suitable characterization of the PolarFire and the NG-Medium. For the PolarFire, a complete characterization has been performed in terms of SEE and TID. The work presented an approach for FPGA testing, based on benchmark circuits. The results obtained proved the efficiency and the advantages of this technique compared to standard SEE testing. A benchmark failure rate, different from the one that can be obtained by analyzing the FEs by themselves, was derived. Moreover, by testing the same application on four different FPGAs, it was possible to demonstrate how the methodology allows for comparison of different FPGAs belonging to different families and affected by different errors. The PolarFire showed a relatively high sensitivity to ThNs, most probably due to the presence of boron-10 in the device, and therefore tests under these particles are necessary for devices based on recent technologies. Using the experimental data, in conjunction with the classical reliability analysis, the failure rates of the various FPGAs in the different areas of the LHC were estimated. The results gathered from such estimation proved that the NG-Medium and the PolarFire are two interesting candidates for CERN applications. Results also confirmed the importance of assessing ThNs sensitivity, since they represent the major cause of SEEs in the shielded areas. Future works will focus on a deeper study of the SEFIs occurred in the PolarFire and will also expand the benchmark tests using other benchmark applications.

#### ACKNOWLEDGMENT

The authors acknowledge the ILL for the beamtime allocated INDU-225.

#### References

- [1] G. Tsiligiannis, R. Ferraro, S. Danzeca, A. Masi, M. Brugger, and F. Saigne, "Investigation on the sensitivity of a 65nm flash-based FPGA for CERN applications," in *Proc. 16th Eur. Conf. Radiat. Effects Compon. Syst. (RADECS)*, Sep. 2016, pp. 510–513.
- [2] G. Tsiligiannis et al., "Radiation effects on deep submicrometer SRAMbased FPGAs under the CERN mixed-field radiation environment," *IEEE Trans. Nucl. Sci.*, vol. 65, no. 8, pp. 1511–1518, Aug. 2018.

- [3] G. Tsiligiannis, C. Debarge, J. Le Mauff, A. Masi, and S. Danzeca, "Reliability analysis of a 65 nm rad-hard SRAM-based FPGA for CERN applications," in *Proc. 19th Eur. Conf. Radiat. Effects Compon. Syst.* (*RADECS*), Sep. 2019, pp. 130–137.
- [4] N. Rezzak, J.-J. Wang, S. Varela, G. Bakker, and A. N. Gu, "Neutron and proton characterization of microsemi 28 nm PolarFire SONOS-based FPGA," in *Proc. IEEE Nucl. Space Radiat. Effects Conf. (NSREC)*, Jul. 2018, pp. 210–215.
- [5] M. Berg, "Field programmable gate array (FPGA) single event effect (SEE) radiation testing," NASA/Goddard Space Flight Center, Washington, DC, USA, Tech. Rep., 2012. [Online]. Available: https://nepp.nasa.gov/files/23779/FPGA\_Radiation\_Test\_Guidelines\_ 2012.pdf
- [6] J. J. Wang, N. Rezzak, F. Hawley, G. Bakker, J. McCollum, and E. Hamdy, "Radiation characteristics of field programmable gate array using complementary-sonos configuration cell," Microchip, San Jose, CA, USA, Tech. Rep., 2019. [Online]. Available: https://www.microsemi.com/document-portal/doc\_view/1244474rt-polarfire-radiation-test-report
- [7] M. Berg, "SEU system analysis: Not just the sum of all parts," presented at the Single Event Effects (SEE) Symp. Mil. Aerosp. Program. Logic Devices (MAPLD) Workshop, La Jolla, CA, USA, 2014. [Online]. Available: https://ntrs.nasa.gov/citations/20140008977
- [8] M. Berg and M. Campola, "SEE test and data analysis for complex FPGA systems," presented at the Microelectron. Rel. Qualification Work. Meeting (MRQW), El Segundo, CA, USA, 2020. [Online]. Available: https://ntrs.nasa.gov/citations/2020000820
- [9] F. Corno, M. S. Reorda, and G. Squillero, "RT-level ITC'99 benchmarks and first ATPG results," *IEEE Design Test Comput.*, vol. 17, no. 3, pp. 44–53, Jul./Sep. 2000.
- [10] H. Quinn *et al.*, "Using benchmarks for radiation testing of microprocessors and FPGAs," *IEEE Trans. Nucl. Sci.*, vol. 62, no. 6, pp. 2547–2554, Dec. 2015.
- [11] A. M. Keller, T. A. Whiting, K. B. Sawyer, and M. J. Wirthlin, "Dynamic SEU sensitivity of designs on two 28-nm SRAM-based FPGA architectures," *IEEE Trans. Nucl. Sci.*, vol. 65, no. 1, pp. 280–287, Jan. 2018.
- [12] R. G. Alía *et al.*, "LHC and HL-LHC: Present and future radiation environment in the high-luminosity collision points and RHA implications," *IEEE Trans. Nucl. Sci.*, vol. 65, no. 1, pp. 448–456, Jan. 2018.
- [13] S. Wen, R. Wong, M. Romain, and N. Tam, "Thermal neutron soft error rate for SRAMs in the 90 nm–45nm technology range," in *Proc. IEEE Int. Rel. Phys. Symp.*, May 2010, pp. 1036–1039.
- [14] M. Cecchetto *et al.*, "Thermal neutron-induced SEUs in the LHC accelerator environment," *IEEE Trans. Nucl. Sci.*, vol. 67, no. 7, pp. 1412–1420, Jul. 2020.
- [15] D. Kramer et al., "LHC RadMon SRAM detectors used at different voltages to determine the thermal neutron to high energy hadron Fluence ratio," *IEEE Trans. Nucl. Sci.*, vol. 58, no. 3, pp. 1117–1122, Jun. 2011.
- [16] K. Roeed, M. Brugger, and G. Spiezia, "An overview of the radiation environment at the LHC in light of R2E irradiation test activities," Dept. ATS, CERN, Geneva, Switzerland, Tech. Rep. CERN-ATS-Note-2011-077, 2011. [Online]. Available: https://cds.cern.ch/record/1382083
- [17] J. Beaucour *et al.*, "Grenoble large scale facilities for advanced characterisation of microelectronics devices," in *Proc. 15th Eur. Conf. Radiat. Effects Compon. Syst. (RADECS)*, Sep. 2015, pp. 312–315.
- [18] T. Oldham and F. McLean, "Total ionizing dose effects in MOS oxides and devices," *IEEE Trans. Nucl. Sci.*, vol. 50, no. 3, pp. 483–499, Jun. 2003.
- [19] H. Hatano and M. Shibuya, "Total dose radiation effects on CMOS ring oscillators operating during irradiation," *IEEE Electron Device Lett.*, vol. 4, no. 12, pp. 435–437, Dec. 1983.
- [20] F. L. Kastensmidt, E. C. P. Fonseca, R. G. Vaz, O. L. Goncalez, R. Chipana, and G. I. Wirth, "TID in flash-based FPGA: Power supplycurrent rise and logic function mapping effects in propagation-delay degradation," *IEEE Trans. Nucl. Sci.*, vol. 58, no. 4, pp. 1927–1934, Aug. 2011.
- [21] Single Event Effects test Method and Guidelines—Basic Specification, ESA, Paris, France, 1995. [Online]. Available: https://escies. org/download/webDocumentFile?id=62690
- [22] N. Battezzati *et al.*, "On the evaluation of radiation-induced transient faults in flash-based FPGAs," in *Proc. 14th IEEE Int. On-Line Test. Symp.*, Jul. 2008, pp. 135–140.
- [23] M. J. Cannon, A. M. Keller, C. A. Thurlow, A. Perez-Celis, and M. J. Wirthlin, "Improving the reliability of TMR with nontriplicated I/O on SRAM FPGAs," *IEEE Trans. Nucl. Sci.*, vol. 67, no. 1, pp. 312–320, Jan. 2020.