1. Introduction
With the advent of the artificial intelligence Internet of Things (AIoT) era, reconfigurable systems are becoming the mainstream architectures in related fields to meet the demands of fragmented new applications at the edge [
1,
2], mainly due to their combination of the flexibility of software computation and the efficiency of hardware computation. The application fields of reconfiguring the system are extensive and of significant strategic importance. In the aerospace field, reconfiguration systems are widely applied in satellite data processing [
3], spacecraft control systems [
4], and deep space exploration missions [
5]. Especially in the space environment with intense radiation, the FPGA-based reconfiguration system can achieve self-repair of faults through dynamic reconfiguration, ensuring the continuity of tasks. National defense and military applications are another important field, including radar signal processing [
6], encrypted communication systems [
7], and unmanned aerial vehicle control systems [
8]. These applications have extremely high requirements for the reliability and security of the system. Any failure may lead to serious consequences. In the field of industrial control, reconfiguration systems are widely applied in nuclear power plant safety systems [
9], high-speed rail control systems [
10], and intelligent manufacturing equipment [
11]. Harsh conditions such as electromagnetic interference and temperature changes in industrial environments pose severe challenges to the reliability of systems. Important applications in the field of medical electronics include medical image processing, implantable medical devices and telemedicine systems. These applications require the system to maintain extremely high reliability during long-term operation while meeting strict power consumption and size constraints. Although the specific requirements of various application fields are different, they all put forward strict requirements for the reliability of reconfiguring the system. Issues such as single-particle effects in space environments, electromagnetic interference in industrial settings, and long-term stability in medical applications all require specialized reliability guarantee technologies. This cross-domain reliability demand is driving the development of reconfiguring system reliability technologies and prompting researchers to explore solutions from multiple levels. A reconfigurable system is a computer system in which the circuit structure or composition can be reconfigured according to the application requirements [
12]. In recent years, with the rapid development of Field-Programmable Gate Arrays (FPGAs), which have gradually become the mainstream devices in the field of reconfigurable computing, the reconfigurable system implemented based on FPGAs has become the most dominant reconfigurable architecture at present. The novel FPGA-based reconfigurable system refers to the system that can use reusable FPGA hardware resources to flexibly change its own architecture according to different application requirements, so as to provide a suitable circuit/organization structure for each specific application requirement [
13].
In recent years, the aggressive scaling of CMOS technology and the increasing complexity of deployment environments have significantly increased the vulnerability of modern systems. In addition, to reduce development costs, many designs are fabricated in untrusted third-party foundries. As a result, these systems are exposed to various reliability threats, including environmental effects, aging-induced degradation, and malicious hardware attacks. Therefore, how to ensure the reliable operation of reconfigurable systems against all kinds of failure/attack threats is a hot research topic in the field. In this paper, we first focus on the “reliability problem” in reconfigurable systems and analyze its causes. Three key reliability issues facing reconfigurable systems are identified: (1) complex and harsh operating environments that make the system susceptible to various types of software/hardware failures, (2) aging and degradation of the system during long-term operation that can lead to permanent failures and increase the probability of transient failures, and (3) adversaries implanting Trojan horses and launching hardware attacks that can lead to device malfunctions/failures.
Based on the above analysis, we first outline the main threat models and protection methods facing reconfigurable systems. Next, this paper focuses on the three key reliability issues of reconfigurable systems: (1) it outlines the classification of fault-tolerant technologies for reconfigurable systems and the main research progress of fault-tolerant means, and compares and analyzes the fault-tolerant technologies and gives the corresponding evaluation indexes; (2) it classifies and summarizes the related technologies for aging protection of reconfigurable systems, describes the current status of research, analyzes and compares the characteristics of the technologies, and gives the corresponding evaluation indexes; (3) it introduces the basic concepts of hardware attack protection technologies for reconfigurable systems, and focuses on Hardware Trojans and hardware attacks. This paper first introduces the basic concepts of hardware attack protection technologies for reconfigurable systems. It then provides an overview of Hardware Trojans and evolutionary hardware-based protection techniques. Next, the evaluation metrics for hardware protection are summarized. Finally, the development trends of reliability-related technologies are discussed, their shortcomings and limitations are analyzed, and key challenges as well as future research hotspots in this field are identified.
The paper is structured as follows:
Section 2 and
Section 3 present the threat model and primary reliability technologies, respectively.
Section 4,
Section 5, and
Section 6 describe the current research status and technical characteristics of fault tolerance technologies, aging mitigation technologies, and hardware attack defense methods, respectively. The development trends and key technological perspectives of reliability in FPGA-based reconfigurable systems are discussed in
Section 7, and the paper is concluded in
Section 8.
2. Threat Model of Reliability
FPGA-based reconfigurable systems are increasingly used in military, aerospace, industrial control, and other complex critical areas. These highly complex intelligent applications are usually characterized by unattended operation, long operation cycles, complex deployment environments, etc., which impose very stringent requirements on system reliability. In this regard, this section will focus on analyzing the reliability threat model and explaining the importance of ensuring high system reliability.
2.1. Threat Model of Fault
The operating environment of complex applications has a direct impact on the reliability of FPGA-based reconfigurable systems. Configuration and logic memories in FPGAs are highly susceptible to long-term or transient failures due to radiation from energetic particles in the environment. Ionizing radiation generates electron–hole pairs in the oxide of Complementary Metal Oxide Semiconductor (CMOS) devices. These charge carriers may accumulate in the gate oxide, affecting transistor threshold voltages and timing behavior. Additionally, energetic particles can displace atoms in the silicon lattice, leading to permanent degradation of device electrical parameters—this is known as the total ionizing dose (TID) effect.
More critically, high-energy particles can cause immediate, unpredictable faults through Single-Event Effects (SEEs) [
5], which occur when a single particle strikes a sensitive node in the circuit. As illustrated in
Figure 1, different types of SEE manifest depending on the struck component:
(1) Single-Event Upset (SEU): A temporary change in the state of a storage element. In FPGAs, SEUs in DFFs can alter logic outputs, while SEUs in configuration memory may change routing or logic function, potentially leading to system malfunction. (2) Single-Event Transient (SET): A short-lived voltage pulse generated in combinational logic, caused by charge collection in a node. If the pulse propagates to a flip-flop during its sampling window, it can result in an SEU. (3) Single-Event Functional Interrupt (SEFI): A fault that causes a complete loss of function in a module, often due to a configuration bit flip that disables critical control signals. (4) Single-Event Latchup (SEL): A parasitic thyristor structure triggered by a large current surge, leading to a low-impedance path between power and ground. This can cause permanent damage or require power cycling. These effects are particularly problematic in harsh environments such as space, high-altitude aviation, nuclear facilities, and high-energy physics experiments, where cosmic rays and other ionizing radiations are prevalent.This figure schematically illustrates the vulnerability of FPGA logic blocks to these SEE mechanisms, highlighting key components and their potential failure modes under particle strikes.
Figure 1.
Schematic illustration of Single-Event Effects (SEEs) in FPGA logic blocks.
Figure 1.
Schematic illustration of Single-Event Effects (SEEs) in FPGA logic blocks.
2.2. Threat Model of Aging
FPGA-based reconfigurable systems also face the threat of permanent failures due to aging. In recent decades, advances in ultra-large-scale integrated circuit technology have brought CMOS transistors into the nanoscale era, where small-scale core regions provide billions of transistors to improve system performance. However, aggressive process scaling has increased the likelihood of reliability issues while increasing circuit density and reducing chip geometry [
14,
15,
16]. These reliability issues range from an increase in the number of design flaws to increased susceptibility to transient and interconnect failures [
17,
18], with the most significant threat being accelerated CMOS transistor aging. The main effects associated with transistor aging include Bias Temperature Instability (BTI), Time-Dependent Dielectric Breakdown (TDDB), Hot Carrier Injection (HCI), and electromigration (EM) [
19]. The above transistor aging effects lead to a cumulative degradation of the threshold voltage Vth of N-type Metal Oxide Semiconductor (nMOS) and P-type Metal Oxide Semiconductor (pMOS) transistors, resulting in the slowing down of circuits during their normal service life. If the delay degradation caused by this aging is not accounted for or compensated for, the circuit will eventually fail to operate correctly within a given timing constraint, or even cause the entire system to fail. The aging threat model is shown in
Figure 2. The specific content of
Figure 2 is as follows:
- (a)
Hot Carrier Injection (HCI): High-energy carriers injected into the gate oxide cause interface trap generation and threshold voltage shifts.
- (b)
Time-Dependent Dielectric Breakdown (TDBB): Gradual degradation of the gate oxide due to charge trapping leads to leakage or breakdown over time.
- (c)
Electromigration (EM): Metal atom migration under high current density causes voids and opens in interconnects, shown via SEM images and a schematic layout.
- (d)
Bias Temperature Instability (BTI): Threshold voltage shifts induced by negative bias stress on PMOS transistors result in performance degradation over time.
Figure 2.
Schematic diagram of aging threat model illustrating four primary mechanisms: (a) Hot Carrier Injection (HCI): high-energy carriers injected into gate oxide causing interface trap generation; (b) Time-Dependent Dielectric Breakdown (TDDB): gradual gate oxide degradation leading to leakage; (c) electromigration (EM): metal atom migration under high current density; (d) Bias Temperature Instability (BTI): threshold voltage shift induced by bias stress.
Figure 2.
Schematic diagram of aging threat model illustrating four primary mechanisms: (a) Hot Carrier Injection (HCI): high-energy carriers injected into gate oxide causing interface trap generation; (b) Time-Dependent Dielectric Breakdown (TDDB): gradual gate oxide degradation leading to leakage; (c) electromigration (EM): metal atom migration under high current density; (d) Bias Temperature Instability (BTI): threshold voltage shift induced by bias stress.
2.3. Threat Model of Hardware Attack
FPGA-based reconfigurable systems are increasingly used in critical scenarios where they are often compromised by adversaries, causing reliability problems [
20]. In recent years, with the intensification of hardware attacks, there has been a gradual increase in the number of systems compromised at the hardware level, the most common of which is the Hardware Trojan (HT) attack. Hardware Trojans are malicious circuits that can remain dormant and avoid detection during testing, but are triggered by certain conditions during operation to cause damage or even catastrophic consequences. Current chip designs follow a globalization strategy which, while reducing costs and ensuring on-time delivery [
19], creates vulnerabilities to hardware attacks.
Figure 3 illustrates the potential stages for Hardware Trojan insertion in the FPGA design and manufacturing lifecycle under a globalized supply chain model. In this paradigm, various stages of Very-Large-Scale Integration (VLSI) design—including specification, logic synthesis, physical layout, and fabrication—are outsourced to different vendors worldwide. Third-party intellectual property (IP) cores are commonly integrated into the final system design, further increasing the attack surface.
Reconfigurable devices such as FPGAs are also subject to this distributed development process. An untrusted IC foundry may introduce Hardware Trojans during the manufacturing phase by modifying the physical layout or inserting malicious circuitry at the wafer level—threats that are extremely difficult to detect after fabrication. Additionally, vendors providing reconfigurable IP could embed Trojans directly into the configuration code, enabling covert activation during runtime.
While the packaging stage is also part of the off-shore supply chain and could theoretically be exploited, our threat model focuses primarily on fabrication-level attacks—specifically those occurring during wafer production and IP integration—as these represent the most prevalent and well-documented vectors for persistent, stealthy Hardware Trojans in reconfigurable systems [
21,
22]. For clarity,
Figure 4 marks the entire IC Foundry block as “Untrusted” to emphasize the high risk associated with silicon-level manipulation, while acknowledging that packaging remains a secondary concern in this context.
3. Primary Reliability Technology
FPGA-based reconfigurable systems aim to adapt to the requirements of the development of smart IoT technology, which can rapidly implement prototype systems and evolve according to the iteration of applications and algorithms, as well as have high performance and low-power operation requirements, which greatly advances the development of intelligent edge devices [
23]. Therefore, in order to better serve all kinds of intelligent application scenarios, it becomes crucial to ensure the reliability of reconfigurable systems. In order to effectively guarantee the reliable operation of reconfigurable systems, extensive research has been carried out in both academia and industry to cope with the reliability problems that may exist in different application scenarios.
Figure 5 shows the main technical means to improve reliability for reconfigurable systems. In this paper, we believe that the techniques to guarantee the reliability of such systems can be divided into two main categories: proactive prevention and maintenance techniques, and reactive reflection and recovery mechanisms. The former focuses on preventive measures to shield or reduce the occurrence of reliability problems, while the latter is to restore the normal operation of the system by repairing mechanisms after failures or anomalies occur, both of which complement each other and have the common goal of ensuring the high reliability of the system. This section describes the detailed protection strategies from these two aspects.
The term “proactive” in the context of proactive prevention and maintenance techniques is defined as the ability of a system to be in a state of readiness or control in advance of possible disturbances (failures, faults, attacks) [
24], with the aim of keeping the system operating normally and achieving the desired mean time between failures. Typically, in a proactive approach, the state of the system is continuously monitored in conjunction with fault detection, localization, or prediction of possible failures or attacks using artificial intelligence-based methods in order to take preventive action in advance. For reconfigurable systems, proactive prevention and maintenance techniques can be divided into three main categories: autonomous cyclic repair, proactive migration/deployment, and prediction and early warning techniques that must be based on experience and expectations [
25]. Typically, autonomous cyclic repair requires periodic resetting or updating of the system to cope with possible system failures, and commonly used methods include periodic scrubbing, reconfiguration, etc. [
26]; proactive migration/deployment is mostly aimed at applications with long run-cycles, and usually there are multiple redundancies or backups in the system, and migration/deployment strategies can be planned in advance before tasks are mounted, or tasks can be migrated according to the actual operating status, with possible purposes including, but not limited to, achieving system load balancing, protecting parts of the system from attacks, and extending the average uptime of the system [
27]; most of the prediction and warning means are based on empirical or artificial intelligence prediction techniques, such as failure prediction, mean time to failure prediction, aging prediction, etc., to provide the warning function and guide the use of proactive means to avoid the possible failures or system risks [
28].
Responsive response and recovery mechanisms focus on taking measures to handle reliability issues after they occur and mitigating or compensating for abnormal system operations through maintenance processes within the system [
24]. Unlike active techniques, responsive methods do not incur additional overhead in the absence of faults, which makes them inherently conservative but effective under nominal conditions.
In reconfigurable systems, such mechanisms can be systematically classified according to the abstraction level at which recovery is enacted, yielding three orthogonal categories: configuration-level repair, logic-level redundancy, and runtime system-level migration.
(1) Configuration-level repair leverages the bitstream-based nature of FPGA configuration to restore correct functionality after transient faults. When a task fails due to radiation-induced bit flips or other soft errors, the affected configuration frames can be reloaded without disrupting the entire system. Common techniques include dynamic full reconfiguration, partial reconfiguration, and background scrubbing [
29]. This approach uniquely exploits the reprogrammability of FPGAs and operates directly at the configuration layer. (2) Logic-level redundancy and backup provide passive fault tolerance by replicating critical logic modules. Architectures such as dual-machine hot/cold standby, Triple Modular Redundancy (TMR), and quad-modular redundancy are widely deployed in high-availability and safety-critical applications [
30]. These methods operate at the hardware logic layer, offering fault shielding and uninterrupted task execution in the presence of hard or persistent faults. While effective, they typically incur area and power overheads proportional to the degree of replication. (3) Runtime system-level migration/deployment enables post-failure adaptation by relocating tasks to healthy regions of the FPGA fabric after permanent (hard) failures are detected. This technique operates at the system management layer and relies on Dynamic Partial Reconfiguration (DPR) capabilities. Methods such as task relocation, redirection deployment, and spare module activation allow continued execution despite localized hardware degradation or compromise [
31]. Unlike redundancy, migration conserves resources by reusing available healthy areas rather than pre-allocating backups.
Together, these three categories represent complementary strategies that act at distinct levels of the reconfigurable computing stack—configuration, logic, and runtime system—ensuring comprehensive post-failure recovery while maintaining orthogonality in both scope and implementation.
The following sections focus on the key reliability issues in reconfigurable systems, and the methods/strategies involved, such as fault-tolerant design, aging mitigation, and hardware attack protection, are described in detail, analyzed, and compared.
4. Fault Tolerance Technology
4.1. Concept and Classification
Faults during operation of FPGA-based reconfigurable systems will be an important factor limiting their reliability [
32], and fault-tolerant designs must be used to reduce the risks associated with failures. Fault tolerance, defined as the ability of a system to maintain its intended mission in the presence of faults [
33,
34], is one of the most important issues for reliable system operation. Even well-designed systems with optimal components and services cannot be considered reliable if they lack fault tolerance [
35].
From a functional point of view, fault tolerance techniques for reconfigurable systems include fault detection and localization techniques, transient (soft) fault mitigation techniques and permanent (hard) post-fault recovery techniques. The former focus on diagnosing the system state and discovering the fault location, while the latter two focus on taking appropriate actions to maintain the reliable operation of the system, and some of these fault-tolerant techniques also have cross-functional domains. In this paper, we focus on fault detection, shielding, and recovery using fault-tolerant design methods, and physical techniques such as hardware reinforcement (materials research) are not included in the discussion for the time being. As shown in
Figure 5, current fault tolerance techniques for FPGA-based reconfigurable systems fall into four main categories: fault detection and location techniques, redundancy-based fault tolerance techniques, dynamically reconfigurable fault tolerance techniques, and intelligent fault tolerance.
Fault detection and location techniques are an important means of diagnosing and determining the operational status of a system. In FPGA-based reconfigurable systems, fault detection techniques based on information redundancy, such as Parity Check Code (PCC), Cyclic Redundancy Check (CRC), and Error Correcting Code (ECC) [
36], mainly detect and correct faults and errors in the bitstream and configuration memory. Redundancy-based fault tolerance techniques are the most widely used, and the most common ones are Triple Modular Redundancy (TMR), Dual-Modular Cold/Hot Backup (DCB/DHB), Time Redundancy, etc. Such fault tolerance methods can effectively mitigate or even shield the impact of faults on the system. However, they usually incur high resource/time overheads [
37,
38,
39,
40]. In addition, FPGA-based Dynamic Partial Reconfiguration (DPR) techniques can be used to implement fault-tolerant system design by dynamically transforming some of the logic resources of the FPGA during system operation [
26,
29,
39,
41,
42]. Some research has proposed combining DPR techniques with redundancy-based fault tolerance to achieve fine-grained redundancy and dynamic partial scrubbing to reduce the overhead of fault tolerance under resource-constrained conditions [
43,
44]. In recent years, the development of artificial intelligence techniques and reconfigurable hardware has had a profound impact on the field of intelligent fault tolerance research. The field brings together aspects of reconfigurable hardware, artificial intelligence, fault tolerance, and autonomous systems. The most notable of these is the design of fault-tolerant systems based on the evolutionary hardware approach, which makes hardware “soft” through its ability to dynamically adapt to problems [
45], and has the advantage of synthesizing novel structures to replace failed modules or functions, thus allowing autonomous system repair. In addition, bio-inspired immune-based fault tolerance methods are also a typical class of intelligent fault tolerance methods, mostly applied to circuit fault diagnosis.
4.2. Research Status
Up to now, researchers from academic institutions and industry worldwide have extensively investigated fault tolerance in FPGA-based reconfigurable systems, leading to significant efforts and a wide range of research outcomes.
Figure 6 illustrates the PRISMA flowchart of the literature selection process for this review. In this section, we review the current state of the art and discuss four main aspects of fault tolerance, namely fault detection and localization techniques, redundancy-based fault tolerance techniques, dynamic reconfiguration-based fault tolerance techniques, and intelligent fault tolerance, in order of priority.
4.2.1. Fault Detection and Localization
Fault detection and localization techniques are designed to diagnose the operational status of a system and to troubleshoot and discover potential or known points of failure, and they are an important feature for measuring the reliable operation of a system. Overall, for FPGAs and their reconfigurable systems, there are three types of fault detection and localization techniques: (1) redundancy-based methods; (2) fault detection structure-based methods; and (3) configuration readback-based fault localization. Among them, redundancy-based methods can be subdivided into information redundancy-based and spatial redundancy-based methods. Typical information redundancy-based techniques include Parity Check Code, Cyclic Redundancy Check, Hemming code check, etc., in which the code that can detect errors is called Error Detection Code (EDC), and if the error can be further repaired, it is called Error Correction Code. The current mainstream XILINX series FPGAs have built-in information redundancy techniques in BRAM and Configuration Random Access Memory (CRAM), which can detect single-bit or double-bit flip-flop faults and repair single-bit flip-flop faults. Fault detection and localization techniques can also be based on spatial redundancy, the simplest form of which is N-modular redundancy, where redundant structures are connected in parallel (e.g., Triple Modular Redundancy). By comparing the outputs of the redundant modules and voting, the module where the fault is located can be identified. The redundancy of this redundancy fault detection system is always higher than per cent, and it has the advantage of being able to detect any fault without discrimination, and it only fails if all redundant modules produce the same fault.
Compared to information redundancy and spatial redundancy, which are specific to the detection of a particular device module/fault type, the fault-based detection architecture is applicable to more fault types and has a wider detection coverage. One of the most widely used techniques is offline fault detection, where the detection function must be performed after the application has been interrupted, which can be based on test devices external to the FPGA or configured into the FPGA. The latter, also known as the Built-In Self-Test (BIST), typically consists of a test pattern generator, a block under test, and an output response analyzer, and is periodically switched by reconfiguration. This built-in structure can fully test the entire FPGA but can only detect faults during the test pattern and therefore may not detect some time-dependent faults. In SRAM-based FPGAs, a notable research-oriented online fault detection method is the Self-checking Region (STAR) architecture. Initially proposed by researchers at Bell LABS in the late 1990s [
46], STAR divides FPGA logic arrays into uniformly structured blocks, each of which can be independently configured to perform built-in self-checks (BIST), while the rest of the system continues to operate normally. This block-level granularity can achieve local fault detection and isolation without interrupting the application, which is a key advantage compared with full-chip offline testing. It should be emphasized that STAR is a fault-tolerant architecture proposed in academic research and has not been implemented as a standard function for commercial FPGA devices. This method relies on dynamic partial refactoring to switch between functional mode and test mode, achieving high fault coverage at the cost of the area overhead of the test infrastructure.
Configuration readback is a proprietary debugging method for FPGAs. In the configuration readback method, the external controller reads the contents of the configuration memory in the FPGA and the contents of the flip-flops in the Configurable Logic Block (CLB) in the form of a configuration bitstream, and the readback process can be regarded as the reverse operation of configuring the FPGA. The bitstream readback has two modes: One is the readback verification mode, in which the controller reads the configuration of the memory unit and compares it with the original bitstream. This mode is mainly used to verify whether the previously completed configuration is successful or not. The second is the readback capture mode. This mode also additionally reads the configuration of the memory unit data to obtain the current state of all internal triggers within the CLB and the state of the IOB. Based on the knowledge of the data obtained from the FPGA readback and the data expected in the FPGA configuration memory and other resources, diagnostic algorithms can be used to detect and locate faults in the FPGA [
47,
48].
4.2.2. Redundancy-Based Fault Tolerance
The most widely used fault tolerance techniques are redundancy techniques, which usually include both temporal redundancy, which mitigates failures that may occur during a single computation by repeatedly running the computation or logic function, and spatial redundancy, which uses redundant hardware resources to replicate the structure of the selected circuits/modules in order to eliminate a single point of failure. In FPGA-based reconfigurable systems, temporal redundancy is usually combined with configuration refreshing to achieve fault tolerance, which requires repeating the configuration of the same module bitstream execution multiple times and comparing the outputs, aiming to reduce the impact of failures caused by single-particle effects that may occur during the configuration process, with the advantage of a low hardware resource overhead and the limitation of a high time overhead [
49].
Spatial redundancy is widely used in various types of electronic systems, where the most common form of construction is Triple Modular Redundancy, where the original circuit/module is expanded into three parallel executions and majority voters are added at appropriate circuit outputs to verify the outputs [
50,
51]. In principle, TMR allows circuits/modules to maintain normal operation in the event of any single failure, satisfying the fault-tolerant effect of one failure without interruption. However, for complex electronic systems, a significant amount of resources would be wasted if Triple Modular Redundancy techniques were used exclusively. In addition, when a common-mode fault occurs, multiple faults may occur in the TMR, which will lead to the unreliability of this fault-tolerant architecture [
52,
53]. Yang et al. [
54] proposed a three-mode redundancy architecture based on an evolutionary mechanism, which can efficiently solve the common-mode fault problem by evolving system circuits into redundant modules with different structures using an interactive two-phase evolutionary strategy, but it will require more hardware resources. To balance the resource overhead, Glein et al. [
30] proposed an adaptive redundancy mechanism that dynamically changes the redundant execution of functional modules using reconfigurable features by assessing the impact of the external environment on the reliability, but the approach does not take into account the differences in applications, which poses a reliability threat for some critical applications. In particular, there are redundant fault-tolerant structures such as quad-mode redundancy, triple-heat-and-cool, dual-heat-and-dual-cool, etc., in applications such as nuclear energy and aerospace, where reliability requirements are extremely high [
55,
56]. Comparatively, in resource-constrained application scenarios with slightly lower reliability requirements, other commonly used architectures are dual cold/hot backups, which, although not shielded from failures and requiring time overhead to restart or synchronize applications, can ensure task recovery and execution [
57,
58]. Similarly, there are rebuilding techniques which use pre-defined redundant system resources to repair failures; this approach also reduces the resource overhead to some extent, but the limitation is that the system design will be very complex [
59,
60].
For example, in FPGA-based satellite controllers used in aerospace applications, a Triple Modular Redundancy (TMR) architecture is employed. The core control logic runs simultaneously in three independent FPGA logic regions, and a majority voter compares their outputs—even if one region suffers a single-event upset due to space radiation, the consistent output from the other two normal regions ensures continued controller operation, preventing loss of satellite attitude control. Dual-machine hot backup is widely used in industrial PLCs (Programmable Logic Controllers), where the primary controller operates in real time and the standby controller synchronizes data. If the primary controller fails due to aging or electromagnetic interference, the standby takes over within milliseconds, maintaining uninterrupted production-line operation.
Overall, redundancy-based fault tolerance techniques are highly reliable and practical, and have been one of the most popular technical tools for decades; however, their limitations are also very obvious—extremely high time/hardware overhead. In this regard, for FPGA-based reconfigurable systems, many researchers have improved traditional redundancy architectures and proposed optimized design solutions, such as design diversity redundancy [
55], reduced precision redundancy [
61], comparative replication [
56], alternative logic systems, and dynamic redundancy. Most of these techniques take advantage of the fact that the implementation of application-specific functions corresponds to fixed resources within FPGAs in order to achieve more effective fault mitigation than tri-mode redundancy.
4.2.3. Dynamic Reconfiguration Fault Tolerance
Dynamic Reconfiguration Fault Tolerance is a fault tolerance technique specific to FPGAs and their reconfigurable systems that allows transient or permanent faults to be repaired by reconfiguring the affected configuration memory or relocating to a new configuration memory. For transient faults such as single-particle flip-flops, they can be repaired by refreshing the correct portion of the reconfiguration bitstream, and the rest of the functional modules on the chip will be unaffected during the reconfiguration of the current functional module. Currently, there are two common configuration refreshing strategies: one is the blind scrubbing strategy, which periodically reconfigures the functional modules through the golden “copies” of the specified partial configuration bitstreams, and this process does not require fault detection; the other is adaptive scrubbing strategy combined with fault localization, which needs to determine the location of the fault and then call the golden “copies” of the corresponding modules to fix the fault. The other is a combination of an adaptive scrubbing strategy for fault localization, where the fault location needs to be identified and the golden copy of the corresponding module is called for reconfiguration. Although the former does not require fault detection and localization, it has the obvious disadvantage of not being able to specify the scrubbing period and effect, which tends to waste resources and affect the efficiency of function execution. The latter relies on some detection and localization technique implemented in the design and triggers the recovery process in case of fault detection. Detection and localization of faulty modules are usually accomplished by techniques such as implementation in the design itself or unit replication with inspection units. When a fault signal is generated, the reconfigurable controller will trigger the reconfiguration process by fetching the appropriate configuration bitstream from the configuration bitstream memory. The reconfigurable controller is usually implemented by one of the soft core processors or IP cores in the FPGA [
62,
63].
For permanent failures caused by FPGA wear and tear or harsh environments, online reconfiguration and design-time pre-configuration methods can be used for fault tolerance. Inline reconfiguration is the process of mapping functional modules to execute in redundant resources through reconfiguration after a fault occurs to exclude the resources affected by the fault when it occurs. This method actually utilizes the spare resources remaining on the slice, which would not be able to tolerate such failures if they were not available. The disadvantage of this approach is that it requires non-negligible bitstream configuration generation time, increasing power consumption and area overhead in the event of a failure [
64,
65,
66]. In contrast to the online reconfigurable fault tolerance approach, some studies focus on preparing preplans at the design stage to mitigate possible permanent failures [
67,
68]. The reconfigurable area of the FPGA will be predivided into blocks, and the application design will be divided into modules, which are mapped and configured to execute in different blocks and reserve some of the fault-tolerant blocks. Also, the application modules are mapped to fault-tolerant blocks and precompiled into alternate configurations, and a portion of the created configuration bitstream is stored in some type of memory. When a permanent fault is detected and localized in one of the blocks, the precompiled bitstreams of the executing module of the faulty block in which it is located are configured to execute directly into the fault-tolerant block. With this approach, fault recovery time can be minimized, but the main drawback is that the precompiled partial configuration bitstream requires external storage.
4.2.4. Intelligent Fault Tolerance
With the continuous development of computer and electronic technologies, the idea of artificial intelligence has been integrated into fault tolerance research. As intelligent fault-tolerant techniques with value and promise in high-reliability and high-security applications, Artificial Immune Systems (AISs) and Evolutionary Hardware Fault Tolerance are becoming important and widely used approaches. The study and design of Artificial Immune Systems (AISs) is a relatively new area of research that attempts to build a computational system inspired by the natural immune system. Back in the 1990s, AISs emerged as a new branch of computational intelligence and became increasingly popular, with AIS-based work ranging from theoretical modeling and simulation to a wide range of applications. Among the various mechanisms of the biological immune system, negative selection, immune network models, and clonal selection remain the most discussed models [
69,
70,
71]. They are used in pattern recognition, fault detection, computer security, and various other applications that are being explored by researchers in the fields of science and engineering [
66,
72,
73,
74]. AISs are a new type of intelligent algorithm inspired by the natural immune mechanisms of the natural immune system, which is distributed, adaptive, fault-tolerant, and capable of learning, recognizing, and remembering without a teacher [
75,
76,
77,
78]. Currently, AISs have been proved to have great potential in the field of fault detection and diagnosis of complex systems, and have been applied to the fault and anomaly detection of FPGA electronic systems [
79,
80], power control systems [
81], and so on.
Evolutionary hardware is self-organizing, self-adaptive, and self-repairing, and hence, it is a good match for fault-tolerant systems [
82,
83]. The dynamic reconfiguration capability of FPGA-based systems can be used by an evolutionary hardware approach [
84], and when faults occur, they can be evolved to restore the correct operation of the system. Liu et al. [
85] specifically addressed electromagnetic pulse (EMP)-induced damage, which can cause permanent gate failures or interconnect degradation. Their approach employs an on-chip evolutionary algorithm that mutates configuration bits of affected CLBs while preserving the I/O interface. Fitness is evaluated via built-in test vectors. While this method demonstrated recovery from severe localized damage in simulation, its effectiveness drops sharply with distributed faults, and the lack of domain-specific heuristics leads to slow convergence—often requiring >
generations. Gavrie and Thompson [
86,
87] introduced an online evolvable architecture featuring a dedicated reconfigurable array alongside the main FPGA fabric. This auxiliary array runs the evolutionary search in parallel without interrupting system operation—a key advantage for mission-critical applications. However, this design incurs significant area overhead (typically 30–50% extra logic) and is tightly coupled to custom FPGA architectures, limiting its applicability to commercial off-the-shelf devices. Zhang et al. [
88,
89] proposed a hybrid self-healing strategy combining coarse-grained evolutionary reconfiguration with analog-compensated balancing networks. Instead of evolving digital logic alone, their method adjusts bias voltages or reference currents in mixed-signal peripherals to counteract performance drift caused by aging or radiation. This co-design reduces the digital search space and achieves subsecond recovery in prototype ASIC-FPGA hybrids. Yet, it requires specialized analog circuitry, making it incompatible with pure digital FPGA flows, and introduces new failure modes related to analog calibration stability. Wang et al. [
90,
91], Lanchares et al. [
92], and Mukherjee and Dhar [
93] all developed real-time EHW frameworks, but they differ markedly in implementation: Wang’s group used bitstream-level evolution on Xilinx FPGAs, directly manipulating configuration frames. While highly granular, this approach is constrained by undocumented routing rules, leading to frequent invalid bitstreams and low evolution efficiency. Lanchares adopted a functional-unit abstraction, evolving connections between pre-defined IP blocks (e.g., adders, multipliers). This improves evolvability and portability but sacrifices fine-grained optimization, resulting in 20% higher resource usage post-repair compared to manual redesign. Mukherjee and Dhar integrated fault injection feedback into the fitness function, enabling the system to adapt to emerging fault models during operation. Their method showed high resilience in dynamic radiation environments but demanded continuous monitoring resources, increasing static power consumption by up to 15%.
In this regard, improving evolutionary efficiency has gradually become one of the focuses of EHW fault tolerance research. In early research, simple programmable logic devices [
94,
95] and field-programmable gate arrays [
96,
97,
98,
99] are usually chosen as programmable architectures, and are proposed and accelerated for evolutionary efficiency at the bitstream level. However, these approaches are usually limited by the underlying layout and wiring rules that are not disclosed to the public by FPGA manufacturers and can only be studied at the theoretical and simulation level. In order to reduce the dependence on specific programmable devices and to increase the scale and evolutionary efficiency of circuit evolution, various methods and programmable architectures have been successively proposed, such as functional evolution [
100,
101], modular evolution [
102], Cartesian genetic programming-based evolution [
103], decomposition evolution [
104,
105], development methodology-based evolution [
106], evolution based on virtual reconfigurable hardware layers [
39,
41,
107], etc. Among them, virtual reconfigurable circuits and virtual coarse-grained reconfigurable architectures are typical representatives of virtual reconfigurable hardware layer-based evolution, which are constructed in the abstraction layer of FPGAs, and can realize coarse-grained functional module-level evolution, which improves the scale, efficiency, and practicability of circuit evolution, and are the mainstream architectures adopted in the research of evolutionary hardware fault tolerance techniques.
4.2.5. Application-Level Fault Tolerance and Soft Error Handling
In addition to hardware- and architecture-level fault-tolerant technologies, recent research has also emphasized application-level fault tolerance, which utilizes algorithmic attributes and application semantics to detect, tolerate, or mask soft errors. Unlike traditional redundantly based methods, application-level technology aims to ensure the correctness of program output rather than strict hardware state correctness, and thus, this usually significantly reduces hardware overhead. These technologies represent important supplementary perspectives on fault tolerance, especially for FPGA-based accelerators that perform error-resilient or approximate applications.
Application-level approaches have become an effective means to handle soft errors by leveraging algorithmic flexibility and correctness attributes, in addition to hardware-centric fault-tolerant mechanisms. These technologies can ensure that the application-level output is correct or acceptable, even if there is a brief failure of the underlying hardware. Cong and Gururaj proposed a typical class of application-level correctness techniques [
108], which utilize the invariants and error detection mechanisms of specific algorithms to identify and mitigate soft errors without relying on complete hardware redundancy. Their research indicates that many FPGA-accelerated applications, such as signal processing and machine learning workloads, can essentially tolerate a certain degree of computational imprecision, thereby enabling lightweight fault detection and recovery strategies at the application level.
However, application-level fault tolerance essentially depends on the application and may not provide complete fault coverage for safety-critical systems. Therefore, it is usually combined with lower-level fault-tolerant mechanisms to form a cross-layer reliability framework.
4.3. Comparative Analyses
Section 4.2 describes in detail four types of fault tolerance techniques for FPGA-based reconfigurable systems that are commonly used in the industry today. As can be seen from the previous section, they are not independent of each other, and there also exists a certain cross relationship between them. Fault detection and localization methods are often combined with fault recovery techniques in order to ensure the fault tolerance requirements of the system.
Table 1 compares and analyzes the typical technical means of various fault tolerance methods. From
Table 1, it can be seen that the mainstream fault location and detection methods contain four types, which have different advantages and disadvantages mainly in terms of resource overhead and detection delay. The redundancy-based fault tolerance method generally has the problem of large resource/time overhead, but it is the most widely used fault tolerance method and has been used by all walks of life for a long time; the dynamic reconfigurable fault tolerance method is a kind of fault tolerance method unique to FPGAs and their reconfigurable systems, and the advantage is that it can achieve fine-grained fault tolerance and improve the overall utilization of hardware resources. Intelligent fault tolerance is a kind of fault tolerance that has emerged in recent years with the evolution of hardware, which has the feature of intelligent adaptive fault tolerance without human intervention, and it is also the direction of the development of future fault tolerance technology, and its current limitation mainly lies in the long evolution time and the need for special hardware platforms.
Table 2 summarizes the key characteristics of three representative fault-tolerant techniques.
5. Aging Mitigation Technology
5.1. Concept and Classification
Successive reductions in transistor size exacerbate aging degradation of digital circuits by exacerbating aging effects such as Bias Temperature Instability, Time-Dependent Dielectric Breakdown, and Hot Carrier Injection [
109]. In addition, the amplified power density in nanoscale CMOS devices is another major cause of accelerated aging [
110]. Currently, age degradation of CMOS transistors has become one of the major reliability threats in digital integrated circuits, with age-related transient and permanent failures occurring more frequently in the long term, thus reducing the normal lifecycle of the device or system [
111]. Therefore, aging protection techniques must be designed and employed to counter or mitigate the reliability degradation of digital circuits.
Aging effects and degradation mechanisms are always present in practical applications, and the purpose of aging protection techniques is to slow down aging in order to extend the lifecycle of the device or to take measures to prevent reliability problems caused by aging degradation. In order to achieve the above aging protection goals, on the one hand, it is necessary to assess the aging of digital circuit chips/systems; on the other hand, it is necessary to take appropriate measures to slow down the aging on the basis of the aging assessment. Among them, the assessment of circuit aging includes two kinds. One is to monitor the aging condition of the current working circuit to assess its aging degree, and the other is to predict the aging trend or failure time for a certain circuit or device. In summary, as shown in
Figure 7, the main aging protection techniques for current FPGA-based reconfigurable systems include three main categories: aging monitoring techniques, aging adaptation and mitigation techniques, and aging modeling and prediction techniques.
Aging monitoring techniques are the most effective way to assess the degree of chip/system aging. The effect of aging on the chip/system is a combination of various environmental parameters, and at the same time, due to the existence of process variations and the randomness of aging effects, the aging degradation of each device is different, so it is not possible to fit all the aging degradation with a simple mathematical function or model. Sensor- and test circuit-based methods can visually monitor changes in circuit parameters such as delay, frequency, and voltage, and are the most commonly used means of assessing aging [
112,
113,
114,
115,
116,
117,
118,
119,
120,
121]. The main goal of aging modeling and prediction techniques is to determine the expected service life and reliability of circuits at design time to inform preventive and maintenance measures. Aging modeling has been studied at gate, circuit, and system levels, and these models focus on the physical aspects of microelectronics and have a high prediction accuracy. Their limitation is that the model parameters are quite complicated, and long-term aging experiments are required to determine the appropriate parameter values [
122,
123,
124,
125,
126]. With the expansion of AI techniques in CMOS device aging protection, machine learning-based aging prediction models have developed significantly. Such models can fit well and predict the aging trend of chips with complete data but require a large amount of sample data for model training [
127,
128,
129,
130]. Aging adaptation and mitigation techniques are the most important part of aging protection. Commonly used aging protection methods for CMOS devices are design protection and Dynamic Voltage/current regulation techniques; however, with the dramatic scaling of transistor sizes, the effectiveness of such conventional techniques is limited. In particular, for FPGAs and their reconfigurable systems, aging mitigation can be performed using resource management and aging-aware layout schemes based on reconfigurable techniques. Such approaches monitor or evaluate the aging degree of hardware resources at different granularities, and thus use reconfigurable techniques to dynamically adjust task mapping and resource usage to balance the overall chip stress for aging mitigation [
56,
125,
131,
132,
133,
134,
135,
136].
5.2. Research Status
Up to now, researchers from many domestic and foreign research institutes, universities, and industries have carried out research on aging protection for FPGA-based reconfigurable systems, and have made a lot of efforts and achieved a series of research results. In this section, we summarize the current status of the related research and describe it according to three aspects: aging monitoring technology, aging mitigation technology, and aging prediction technology.
5.2.1. Aging Monitoring
The effect of aging on FPGA chips is determined by a variety of environmental parameters and structural processes. Among them, the environmental parameters mainly include temperature, operating stress, duty cycle, and so on. However, for the same FPGA executing different applications, the degree of aging varies despite the same environmental changes. In particular, the above problem is exacerbated in the presence of process variations, which further adds uncertainty to circuit aging assessment. Moreover, the BTI effect is inherently stochastic in deep nanotechnology processes, which means that even similar devices operating in the same application and environment may have different aging levels (delayed degradation). Considering the above issues, it is not feasible to perform a simple aging assessment, and therefore, sophisticated techniques are required to monitor the aging level of FPGAs. Since FPGA aging causes propagation delays in the circuit, most of the current aging monitoring techniques measure the circuit delays by inserting aging monitors or various types of test structures to assess the degree of FPGA aging. Based on the design of aging monitoring sensors and sensing mechanisms, aging monitoring techniques can be classified into two categories: (1) simulation monitoring methods and (2) actual monitoring methods.
Usually, aging monitoring circuits for FPGAs are not generic and require the use of on-chip resources to construct a monitoring module according to the actual situation, so some scholars have used simulation-based monitoring methods to assess the aging of FPGAs. For example, Morales et al. [
112] developed a relatively generic simulation environment for FPGA circuits and were able to predict the propagation delay of the Look Up Table (LUT) under the NBTI and HCI aging regimes, while Mohammad et al. [
113] further designed different simulation experiments to differentiate between the main factors of HCI and BTI aging. Glocker et al. [
114] proposed a real-time power, temperature, and aging monitoring system, eTAPMon, for an FPGA prototype of MPSoC, modeled the temperature monitor based on a linear regression model obtained from offline thermal simulation, further modeled the behavior of the aging monitor based on the critical path model, and finally calculated the timing margin due to aging degradation. Krieg et al. [
115] introduced a multidisciplinary approach based on state-of-the-art power simulation and FPGA partial reconfiguration techniques and proposed a novel device aging detection mechanism using power simulation techniques. However, the above simulation-based monitoring methods rely too much on the prediction of empirical formulas, making it difficult to avoid certain estimation errors.
For example, in long-running industrial IoT gateway FPGAs, Ring Oscillator (RO) test structures are inserted in critical logic paths to monitor circuit delay variations in real time—as the FPGA ages, transistor aging gradually increases delays, and when thresholds are exceeded, the system automatically triggers Frequency Scaling or task migration to prevent timing violations that could cause data acquisition errors. In medical device FPGAs, sensors monitor chip temperature and voltage drift to indirectly assess aging levels, ensuring long-term stable operation.
In order to obtain more realistic aging data, some scholars have focused on researching FPGA on-chip aging monitoring methods, which construct latches or RO path monitors by using on-chip LUTs or interconnect resources to measure the actual aging of circuits. Stott et al. [
116] proposed an aging measurement and modeling methodology for FPGA chips, which describes how the aging of FPGAs is affected by temperature, voltage, frequency, duty cycle, and other factors, and compared the aging measured by the actual measurement structure with the aging calculated by the formula model to verify the correctness of the proposed measurement method. Wong and Miyake et al. [
117,
119,
121] proposed a method based on the measurement of circuit delays to indirectly assess the degradation of the chip. Mohammad and Xiang [
118,
120] both designed FPGA aging test platforms, which can be quickly and easily applied to different series of FPGA devices and at the same time can realize the aging monitoring of multiple groups of FPGA chips. The above research methods can monitor and collect FPGA aging data and calculate the aging degradation degree of FPGAs by using relevant models, featuring high accuracy and real-time capabilities, but their limitation mainly lies in the difficulty in determining the deployment location of the monitor or monitoring structure and the existence of a large resource overhead.
5.2.2. Aging Mitigation
In early applications, FPGA aging adaptation and mitigation mostly used aging mitigation techniques for CMOS-class devices, such as shield soldering, protective tape design, and dynamic adaptation techniques [
110]. Protective band design addresses degradation over the design lifetime by reducing the operating frequency or increasing the supply voltage to eliminate timing violations caused by aging effects such as NBTI. However, providing a voltage protection band will increase the energy consumption throughout the operating period [
137], so protection binding is also known as a worst-case design technique. Dynamic adaptation techniques include Dynamic Voltage (DV)/Frequency Scaling (FS). The fine-grained combination of these two techniques is a very effective tool for significant resistance to chip aging [
138], but it also requires the introduction of additional control modules, which increases the complexity of the circuit.
With the rapid development of dynamic reconfigurable technology for FPGA parts, the mapping relationship between tasks and on-chip resources can be dynamically changed to equalize the on-chip stress, thus achieving the purpose of aging mitigation. Based on this, the current mainstream FPGA aging mitigation techniques are classified into three categories, namely, the bit flip technique, the layout and remapping technique, and the route-aware technique [
109]; the first two techniques are mainly aimed at mitigating the aging of the configuration units and connected transistors in FPGAs, and the third is aimed at aging mitigation of the routing in FPGAs.
The bit flipping technique is a fine-grained aging mitigation technique that mainly achieves aging mitigation by balancing the stress on the configuration cells (Static Random Access Memory (SRAM)) or connected transistors. Among them, Stott et al. [
139] proposed periodically inverting the SRAM cell to balance its aging, which can only mitigate FPGA aging to a certain extent, since the probability of SRAM performance degradation due to aging is small. Mottaghi et al. [
112] further demonstrated that FPGA aging degradation can be mitigated by using a bit flip scheme and proposed a post-routing aging-aware timing analysis method to find the optimal flip-flop frequency, while Ghaderi et al. [
140] proposed applying the optimal arrangement of the LUT inputs to mitigate aging effects and achieved good results, but this method is difficult to apply to large-scale applications, and it will bring great cabling pressure as the number of LUTs increases. Layout and remapping, on the other hand, is an aging mitigation technique for high-level design and adaptive resource management [
131], where the on-chip stress is balanced for aging mitigation by making layout planning at the design stage or dynamically adjusting the task placement at runtime [
56,
132]. Depending on the layout strategy generation stage, the method is mainly divided into two categories: online layout strategy and offline layout design. The online layout strategy dynamically schedules tasks to perform different tasks based on the wear and tear of each reconfigurable block during task execution to balance the on-chip stress distribution. Zhang et al. proposed a solution to this problem, transitioning from fine-grained to on-chip coarse-grained strategies, firstly proposing a CLB-level task placement strategy that can achieve aging adaptation and fault tolerance within reconfigurable blocks [
133]; a two-phase (design-time and runtime) stress-balancing strategy is further proposed [
134] and refined on the basis of which aging mitigation is jointly achieved inside and outside reconfigurable partitions. The above approaches are very effective in mitigating FPGA on-chip aging, but the premise must assume that the dynamically reconfigurable partitions are equally sized, which to some extent results in wasted resources while limiting the placement options at runtime. In addition, online methods require real-time monitoring of aging information and calculation of stresses, which has significant resource and time overheads. In contrast, offline layout design generates aging-aware layout planning in the offline phase to balance the on-chip stresses with the mapping relationship of pre-scheduled tasks. Therefore, this approach does not require additional resources to monitor and compute aging information. Sahoo et al. proposed a heterogeneous task resource mapping placement strategy using Mixed-Integer Linear Programming (MILP) and genetic algorithms for spatial solutions to achieve resource utilization with guaranteed aging mitigation as much as possible, but the method cannot achieve flexible placement of all tasks, and no runtime task placement method is proposed [
135,
136]. HU et al. [
125] designed a reliability-aware FPGA resource layout planner, which aims at mitigating aging by balancing the frequency of task usage and, at the same time, considering the cabling lengths to save resources. However, none of the current offline aging-aware layouts consider hard fault tolerance, which would pose a fatal threat to pre-generated offline layout strategies. Route-aware aging mitigation techniques also utilize FPGA dynamic reconfigurability. This has been well researched by Khaleghi et al. They first proposed an aging-aware routing algorithm that uses a tree-based routing multiplexer structure and changes the priority of the multiplexer to use less aged transistors in an attempt to balance the stress on the multiplexer transistors [
141], but the approach oversimplifies the actual architecture of the FPGA too ideally. Further, they summarized and proposed comprehensive analysis and mitigation techniques for aging FPGA routing networks and subsequently proposed an aging-aware layout [
142], which balances the degradation of resources in a coarse-grained manner by imposing layout constraints or moving initially placed designs to less aged regions without affecting their routing, but the approach requires the consumption of increased storage.
In summary, the bit flip technique is mainly a fine-grained aging mitigation method designed to balance the pressure on a small portion of configuration units or connected transistors, and is suitable for locations with high on-chip partial reliability requirements. The layout and remapping technique is the current mainstream method for FPGA-based reconfigurable system aging protection and mitigation, relying on part of the dynamic re-mappable technology to achieve a flexible task resource mapping strategy. In both the design phase and operation phase, FPGA aging mitigation is implemented. In this high-level design of the scheme, there are still many research points, the most typical being the balance between resource and reliability requirements. In addition, there is not yet a better solution for the cases where the aging degree of each block in the FPGA is not equal. Route-aware aging mitigation-based research has confirmed that aging of SRAMs in routing has a negligible impact on routing latency, but aging of multiplexer transistors (directly connected to SRAMs) can lead to severe latency, so research should focus on mitigating transistor stress. The above techniques can also be combined with currently commonly used aging mitigation techniques for CMOS devices, and the combination of these techniques may lead to higher reliability and lower resource overhead, and is a worthy focus of current research.
5.2.3. Aging Prediction
Aging prediction techniques aim to assess the lifecycle and reliability trends of circuits during the design phase in order to take appropriate measures for preventive maintenance in advance. Research has been conducted to integrate aging prediction models into the Static Timing Analysis (STA) framework to optimize circuit timing using traditional EDA design tools [
143,
144]. In the current research field, there exist two main types of approaches for aging modeling and prediction of FPGAs, based on traditional physical models and machine learning-based approaches [
145,
146,
147]. The former uses physical experimental means, which usually require up to several years to observe and analyze the aging process of the device to obtain relevant formulas and parameters, so as to achieve the assessment and prediction of the aging trend of the device. Among them, Jang et al. [
123] designed an on-chip aging sensor to acquire aging data and predict circuit failures caused by BTI and HCI aging effects based on empirical formulas. YU et al. [
124] proposed an aging prediction and screening method based on a novel chip architecture called ZeroScreen. Xiang et al. [
125] proposed a generic aging modeling approach for predicting the remaining lifetime of a device. The main limitations of the above research methods for aging prediction based on transistor-level or LUT-level physical models are the long experimental period and the difficulty in determining appropriate formula parameters.
With the development of artificial intelligence technology, intelligent prediction methods for machine learning with data-driven modeling have gradually gained attention. At the system level, Gugulothu et al. [
127] acquired data based on inbuilt sensors and trained a recursive neural network-based sequence model to estimate the remaining usage time of the whole system. Li et al. [
128] proposed an integrated learning framework combining multiple intelligent algorithms for the prediction of the remaining usage time of the system. At the IC level, Karimi et al. [
129] proposed a generalized IC aging prediction method that takes into account a comprehensive set of IC operating conditions including workload, usage time, operating temperature, etc. Vijayan et al. [
130] proposed a methodology for monitoring the stress induced by performing low-cost and fine-grained workloads to accurately predict aging-induced delays. However, the above studies only focus on predicting BTI-induced circuit degradation and also have to rely on logic simulation to obtain eigenvalues and labels as inputs to the prediction model.
In summary, aging prediction research for FPGAs and their reconfigurable systems is still in the exploratory stage for two specific reasons: (1) traditional aging models are difficult to be directly applied to aging trend prediction and reliability assessment; and (2) there is a lack of a large amount of support data related to FPGA aging that can be tracked over a long period of time, which makes it difficult to train intelligent prediction models.
5.3. Comparative Analyses
Section 5.2 describes in detail three types of aging protection techniques for FPGA-based reconfigurable systems commonly used in the industry today. Among them, aging monitoring and aging prediction techniques focus on evaluating the aging state of the chip/system, while aging adaptation and mitigation techniques are based on the former and take corresponding measures to slow down aging or protect against aging-caused reliability problems.
Table 3 provides a comparative analysis of typical technical tools in various aging protection methods. As can be seen from
Table 1, the aging monitoring technology includes three categories of in situ sensor monitoring, external sensor monitoring, and testing with aging structure, which mainly have differences in resource overhead and testing location; the aging prediction technology is divided into two categories of traditional aging prediction model and intelligent aging prediction model, with the former focusing on the analysis of the physical degradation of aging on the microelectronic level, and the latter adopting intelligent algorithms, such as machine learning, to predict the aging trend; the aging adaptation and mitigation technology is divided into two categories. The aging adaptation and mitigation techniques are the focus of aging protection, and the design protection and dynamic adaptation techniques for CMOS devices have a wide range of applicability and simple operation, in which the setting of timing margins is crucial to the efficiency of the circuit. Bit flip-flop, layout and remapping, and route awareness are aging mitigation techniques specific to reconfigurable systems, with the advantage that reconfigurable technology can be used to adjust the use of hardware resources to balance the on-chip stress to achieve the effect of slowing down aging.
6. Hardware Attack Defense
6.1. Concept and Classification
In complex and critical applications, hardware is gradually becoming an anchor point for adversary attacks. Chip design follows a globalization strategy to reduce costs and ensure timely delivery [
19]. In this globalization strategy, IPs procured from different third-party IP suppliers are integrated into the final system design, and the various stages of the hyperscale integration design are outsourced to different parts of the world. At all times, third-party IP suppliers and foundries cannot be trusted. Third-party IP suppliers may plant malicious entities in the reconfigurable IP they provide [
148], and adversaries in foundries may plant malicious circuits during the fabrication of the IP [
149], opening up the possibility of hardware attacks. In recent years, Quo Vadis Labs has reported chip backdoors related to mission-critical functions used in attacks or sabotage, such as weapons control systems, nuclear power plants, and even public transport systems [
150].
Among them, Hardware Trojans have attracted widespread attention as a mainstream means of hardware attacks. Hardware Trojans are malicious circuits designed to alter the functionality of the original chip/circuitry that will be used to leak confidential information or even cause the chip they are integrated into to fail permanently at runtime. Hardware Trojans consist of two parts: a trigger and a payload. Triggers usually correspond to rare data inputs (sequences), while payloads are activities that cause data leakage or failure when a Hardware Trojan is triggered. Since FPGAs are also built through globalization procedures, it is also feasible to implant Hardware Trojans in FPGA structures [
151]. In addition, the bitstreams that make up the FPGA chips are procured from third-party intellectual property (3PIP) vendors, which also opens up the possibility of inserting Hardware Trojans during their generation [
148,
152].
Currently, Trojans are usually classified based on their payload, activation mechanism, and physical characteristics. This paper focuses on the impact of Hardware Trojans on the system, based on which they can be classified into three categories: (1) Trojans that directly lead to failures: such Trojans lead to failures of FPGAs, and the intensity of the failures can range from simple logic errors to complete device failures with electrical faults, and can be further classified into Trojans that prevent FPGA operation and Trojans that inject faults; (2) Trojans that produce side effects: such Trojans have payloads that do not interfere with normal design logic, aiming at leaking information from FPGAs or increasing the waste of hardware resources, which can be further classified into Trojans that leak information and Trojans that waste FPGA resources; (3) Trojans that introduce vulnerabilities: such Trojans do not have a direct effect on the system/hardware, but rather introduce vulnerabilities into the system to create conditions for other hardware attacks, and usually this type of Trojans are not effective in generating payloads immediately after implantation, but rather in creating conditions for other hardware attacks. Usually, this kind of Trojan is not able to generate payload immediately after implantation, which makes it difficult to detect. In this paper, we discuss the reliability problems caused by Hardware Trojans in FPGAs and their reconfigurable systems, so we focus on the first type of Hardware Trojans mentioned above.
6.2. Research Status
6.2.1. Prevention and Detection of Hardware Trojans
Currently, various studies are conducted in academia and industry in the areas of Hardware Trojan detection and protection [
153,
154,
155,
156]. Both prevention and detection methods are necessary to prevent adversaries from implanting Hardware Trojan systems. In the category of prevention techniques, a new TMR structure called Adapted Triple Modular Redundancy (ATMR) has been proposed [
153]. ATMR uses three different circuits to implement the same module, which depends on the fact that it is highly unlikely that a Hardware Trojan activation will trigger all the circuits at the same time. Due to the redundant structure, both the traditional TMR approach and the proposed ATMR have high area overhead and power consumption. Filling unused space on the FPGA is another way to protect circuits from Hardware Trojan insertion [
154]. This method has minimized performance and power loss. Karam et al. [
155] used physical and logical keys to improve the security of FPGA systems by obfuscating FPGA bitstreams; the technique is based on a dedicated and configurable architecture. Zhang et al. [
156] proposed a three-line-of-defense-based FPGA-Oriented Moving Target Defense (FOMT) approach for mobile FPGAs, which will create uncertainty for attackers and make it more difficult to insert Hardware Trojans.
Relative to prevention techniques, Hardware Trojan horse detection techniques have received wider attention and strong support from relevant governments, and a great deal of research has been conducted in academia and industry on the attack mechanism and detection methods of Hardware Trojans. Generally speaking, these methods can be broadly classified into two categories according to the type of features used: dynamic detection and static detection. Since the activity of Hardware Trojan circuits introduces additional effects on the target IC (e.g., circuit functions, bypass parameters, etc.), some researchers have attempted to exploit these additional effects to determine whether a given IC-to-be-tested is infected by a Hardware Trojan. Such approaches typically use a given test vector to activate or run the Trojan circuit to obtain several dynamic characteristics of the IC; hence, we call them dynamic detection. Dynamic detection methods aim to detect Hardware Trojan circuits inserted by untrusted foundries during the manufacturing process [
157,
158]. The selected IC features are susceptible to the effects of implanted high-temperature superconductors, and such effects are also easily detected. Therefore, dynamic methods can achieve high accuracy in Hardware Trojan detection [
159,
160]. Among them, logic testing and bypass analysis are two typical dynamic Hardware Trojan detection methods.
Although dynamic detection methods can better verify the presence of Hardware Trojan circuits in ICs, they still have some limitations in practical applications [
161,
162]. When an attacker maliciously tampers with the electrical design files of an IC during the design phase to circumvent Hardware Trojan detection, dynamic detection methods are no longer available at this point. However, static detection techniques aimed at uncovering Hardware Trojan circuits using testability-based structural features extracted from IC design files are not subject to these limitations [
163,
164]. For example, Zareen et al. [
165] explored a framework for detecting Hardware Trojan horses based on Artificial Immune Systems (AISs), a methodology that extracts from RTL code represented as a binary-encoded Control and Data Flow Graph (CDFG) certain behavioral features to reveal the presence of Hardware Trojan circuits. However, this method is unable to further locate the suspected Hardware Trojan circuits by abnormal behavioral features. Hasegawa et al. [
163] proposed extracting 51 features related to Hardware Trojans from the Gate-level Netlist (GLN), and applied the RF algorithm to efficiently select the most relevant 11 features from them. After that, based on these 11 features, various machine learning (ML) models are built to classify a given GLN as a normal circuit and an “infected” Hardware Trojan circuit. The authors of [
161,
162] proposed several new features related to Hardware Trojans based on the original 51 features. Based on these features, an RG-Secure framework and an XGBoost-based approach have been proposed to detect the implanted Hardware Trojan circuits. These solutions can effectively improve the detection accuracy of gate-level Hardware Trojans, but the detection results achieved for some specific target circuits, such as s35932-T200 [
166], are poor. In addition, CHOO et al. [
167] proposed a multi-ML-based Hardware Trojan detection framework, which utilizes the Support Vector Machine (SVM) and Decision Tree (DT) to classify branching operations and signals extracted from the RTL code and the corresponding GLNs, respectively, in order to identify the register transfer Level (RTL) code with Hardware Trojan instances. Compared to previous research, this approach can achieve higher performance without false alarm detection. However, this approach is somewhat complex to implement and requires a significant amount of time to train multiple ML models.
For example, in military communication equipment, FPGA chips employ “blank cell filling” technology during the design phase to fully occupy all unused logic resources, preventing attackers from implanting Hardware Trojans. Meanwhile, side-channel analysis (SCA) is used to detect malicious chips by identifying power consumption discrepancies, thereby safeguarding communication encryption keys. In civilian applications, FPGA accelerator cards in cloud servers utilize static detection of gate-level netlists to identify hidden Trojans in third-party IP cores, preventing data breaches.
In summary, most Hardware Trojan horse detection techniques (including dynamic and static detection) have been able to achieve better results. However, existing research still has certain limitations. Firstly, dynamic detection methods are mainly used to detect Hardware Trojans implanted by untrustworthy foundries during the manufacturing stage, which means they require generating enough test vectors to trigger them, which is time-consuming; secondly, SCA-based methods are unable to detect small Hardware Trojans because the bypass characteristics of ICs are susceptible to process variations and ambient noises, and LT-based methods have difficulty in detecting large Hardware Trojan circuits; thirdly, most of these methods assume a referenceable gold die. However, this is often difficult to obtain in practice. In fact, the enemy can maliciously modify electronic design files and insert Hardware Trojans during IC design. But for such Hardware Trojans, dynamic detection methods may not be able to detect them. However, static detection methods are not subject to the above limitations. However, such solutions can effectively verify whether the target IC is infected by the Trojan horse but cannot report the location where the Hardware Trojan horse is implanted. Moreover, the selected structural or functional characteristics of ICs can affect the detection accuracy of Hardware Trojans. In addition, studies have focused only on Hardware Trojan detection but have not carried out further work after Hardware Trojan detection (e.g., reporting the exact location of the implanted Hardware Trojan and the mitigation strategy).
6.2.2. Evolutionary-Based Hardware Trojan Prevention
Evolutionary hardware is becoming an important approach for fault-tolerant and reliable design, as the idea of artificial intelligence is incorporated into the field of security and high-reliability research [
168]. Evolutionary hardware is the design of hardware’s physical structure by simulating the natural evolution process using the idea of evolutionary algorithms, and it consists of two necessary elements: one is the programmable logic device (PLD) represented by FPGA, and the other is the evolutionary algorithm [
45]. The rapid development of programmable logic devices and evolutionary algorithms has greatly promoted and facilitated the development and implementation history of evolutionary hardware. The basic principle of evolutionary hardware is that the coded bit strings composed of the structure and parameters of the programmable logic device are used as the evolution object of the evolutionary algorithm, and the corresponding coded bit strings of the current required functions are generated through the evolution operation of the evolutionary algorithm, which are then downloaded to the programmable logic device, and the coded bit strings that tend to have the best adaptation to the requirements are continuously generated through the repeated demand adaptation comparisons and the evolution operation, and finally, the coded bit strings that best suit the current environment and actions are obtained. Finally, the hardware structure that best suits the current environment and action purpose is obtained, i.e., by directly adjusting the coded bit strings of the programmable logic device to obtain the optimal hardware structure required [
169]. Currently, evolutionary hardware has been successfully applied in many fields, such as image processing algorithms [
170], fault-tolerant system design [
44], face recognition [
171], power consumption optimization [
172], and arithmetic circuit design [
173].
Evolutionary hardware with its adaptive, self-organizing, and self-healing properties is now ripe for system fault tolerance applications. In early research, simple programmable logic devices [
94,
102] were usually chosen as programmable architectures, and evolutionary approaches were proposed and implemented at the bitstream level. However, these methods are not applicable in practical engineering applications because the technologies related to the underlying hardware layout are confidential. Therefore, the current research basically targets the virtual reconfigurable hardware layer (e.g., virtual reconfigurable circuits, virtual reconfigurable architectures, virtual coarse-grained reconfigurable arrays, etc.) to achieve evolutionary hardware fault tolerance. The virtual reconfigurable hardware layer is built on specific programmable devices, forming a programmable architecture for evolutionary hardware. At this abstract architecture level, coarse-grained circuit evolution, i.e., functional module-level evolution, can be achieved, thereby increasing the scale and utility of circuit evolution. Currently, evolvable circuits based on virtual reconfigurable circuits have even been integrated into partially reconfigurable hardware [
174]. Yang et al. [
5] proposed a functional module array consisting of virtual reconfigurable arrays, on which functional-level evolution to tolerate functional-level faults is implemented. Towards fault-tolerant and adaptive systems, Barker et al. [
175,
176,
177] proposed device-independent evolvable architectures: reconfigurable integrated system arrays and bionic hardware fault-tolerant architectures.
Hardware Trojans, as a kind of hardware structure maliciously implanted by an adversary that potentially disrupts the reliable operation of a system, belong to a special kind of failure threat, which can be achieved by adopting fault-tolerant methods analogous to those of evolved hardware for Hardware Trojan attack protection. Currently, intelligent mitigation mechanisms for Hardware Trojan attacks on programmable devices or reconfigurable systems implemented based on such devices are gradually gaining attention. This is because Hardware Trojan-infected circuits can be reconfigured to change the circuit structure or avoid the use of infected resources, which can be effective in protecting against the attack. However, research in this related area is still in the exploratory stage. Labafniya et al. [
169] proposed a mechanism to mitigate Hardware Trojan attacks in virtual reconfigurable circuits implemented on Field-Programmable Gate Arrays. They eliminated the effect of the Trojan by periodically reconfiguring the circuit, but it was difficult to determine the appropriate configuration period. Liu et al. [
178] proposed a security mapping methodology—a dynamic resource management strategy based on security values—to enhance attacks against Hardware Trojans by selectively protecting the processing units in coarse-grained virtual reconfigurable arrays. However, this method is based on a three-mode redundant architecture shielding against possible attacks, has a high resource overhead and does not fundamentally avoid Trojan horse attacks, and will fail completely if there are multiple identical modules implanted with Hardware Trojans.
In general, the technical research on Hardware Trojan protection based on evolutionary hardware is still in the stage of exploration and attempts, and the existing results are also in the stage of theoretical research and evolution of small-scale circuits. On the one hand, the reason is that the development of evolutionary hardware platforms started late; on the other hand, the reconfigurable technology that plays a key role in the research of evolutionary hardware is mastered by several giant companies, and the underlying hardware layout and wiring rules are kept secret from the outside world. As a result, many theoretical studies have not been really effective for the evolution of larger-scale physical circuits. In addition, the huge time overhead of the evolution process usually cannot meet the real-time requirements of high-reliability systems, which is also an issue that evolutionary hardware needs to focus on in hardware attack protection applications.
7. Reliability Evaluation Indicators
So far, there is a lack of effective performance metrics and evaluation criteria for assessing the reliability of FPGA-based reconfigurable systems. Currently, commonly used performance evaluation metrics include the following: mean time to failure, mean time to failure detection, mean time to repair, mean time to restore service, mean time between failures, accuracy, false negative rate, false positive rate, true negative rate, true positive rate, precision, recall, etc. The definition of each metric is described below.
Mean Time to Failure (MTTF): This is a key indicator of system reliability, representing the average time from operation to service termination due to failure in an irreparable system or product. It can be simply understood as the average service life.
Mean Time to Failure Detection (MTTD): This represents the average time between a system failure and the first detection of the problem, and is used as a measure of the average length of time a problem exists before it is detected, and can be calculated by dividing the total detection time of events in a given period by the total number of events. Mean Time To Repair (MTTR): This measures the efficiency of troubleshooting and repairing faults, referring to the average time from the beginning of repair to the system returning to normal operation. It can be obtained by dividing the total system repair time within a certain period by the total number of events. The smaller the MTTR, the stronger the maintainability and recoverability of the system.
Mean Time to Restore Service (MTRS): This represents the service recovery time, which measures the average time it takes for a system to recover from an unavailable state to a normal available state. It is numerically equal to the average unavailable time of the system.
Mean Time Between Failures (MTBFs): This is one of the key indicators for measuring system reliability and availability. It refers to the average time that a repairable system experiences from the previous failure (end) to the next failure (occurrence) during operation, representing the average available time of the system.
True Positive (TP): This is the total number of ICs without Trojans that were found to be true.
True Negative (TN): This is the total number of ICs with Trojans that were found to be false.
False Positive (FP): This is the total number of ICs with Trojans that were found to be true.
False Negative (FN): This is the total number of ICs without Trojan that were found to be false.
Genuine Positive (GP): This is the total number of ICs that are not actually trojanized.
Genuine Negative (GN): This is the total number of ICs that are actually trojanized.
True Positive Rate (TPR): This refers to the ratio of the number of ICs found to be TP to the total number of GPs.
True Negative Rate (TNR): This refers to the ratio of the number of ICs found to be TN to the total number of GNs.
False Positive Rate (FPR): This refers to the ratio of the number of ICs found to be FP to the total number of GNs.
False Negative Rate (FNR): This refers to the ratio of the number of ICs found to be FN to the total number of GPs.
Accuracy: This refers to the ratio of the number of ICs to be tested that are correctly identified to the total number of ICs. Specifically, it can be expressed by the mathematical formula described below:
Sensitivity: The proportion of the original circuit parameters that are affected by a Hardware Trojan horse. Sensitivity is often used in the evaluation of SCA methods and can be expressed in the mathematical formula described below:
where
denotes the bypass information characteristics of the circuit after implantation of the Hardware Trojan and
expresses the bypass information characteristics of the pure circuit.
Precision: This refers to the ratio of the number of ICs judged to be TPs to the total number of TPs and FPs. Specifically, it can be expressed by the mathematical formula described below:
Recall: This refers to the ratio of the number of ICs judged to be TPs to the total number of TPs and FNs. Specifically, it can be expressed by the mathematical formula described below:
8. Reliability Technology Development Trend
In this section, we mainly statistically analyze the development trend and main research contents of each high-reliability key technology in FPGA-based reconfigurable systems, and based on this, we elaborate the hotspots or urgent problems to be solved in future research.
8.1. Analysis of Development Trend on Reliability Technology
- A.
Fault tolerance technology for FPGA-based reconfigurable systems
Figure 8 shows the number of academic papers on fault tolerance of reconfigurable systems published nationally and internationally each year from 2012 to 2023, as well as the multiplying power trend line. As can be seen from
Figure 8, the number of relevant papers published globally in 2012 and 2013 is high; the number of relevant papers published globally each year from 2012 shows a decreasing trend; from 2018 to the end of 2022, the number of relevant research papers published gradually tends to be stable. The research on fault tolerance of reconfigurable systems focuses on the following aspects: from the viewpoint of the fault tolerance process, it includes fault injection, fault diagnosis, fault detection, fault tolerance control, etc.; from the viewpoint of fault tolerance technology, it includes modular design, hardware design, checkpoint design, and other fault tolerance designs; from the viewpoint of types of faults covered, it mainly focuses on transient faults, such as single-particle flipping faults, etc., and hard faults.
- B.
Aging mitigation technology for FPGA-based reconfigurable systems
Figure 9 shows the number of academic papers on the aging of reconfigurable systems published in China and abroad each year from 2012 to 2023, as well as the multiplicative trend line. As can be seen from
Figure 9, the number of relevant papers published globally in 2012 and 2014 is high, and from 2015 to the end of 2023, the number of relevant papers published globally each year is relatively stable. In the past decade, research on aging protection of reconfigurable systems has focused on the following aspects: From the perspective of the aging protection process, it includes aging prediction, aging monitoring, aging testing, aging mitigation, etc. Among them, aging monitoring is an important class of technical means, including monitoring systems, aging sensors, temperature sensors, delay monitoring, etc. From the perspective of the aging principle, it includes the study of the aging effects of EM, TDDB, HCI, BTI, etc. The latter two, the aging protection process and the aging mitigation process, are the aging mechanisms that researchers are most concerned about after entering the nanoscale CMOS era. At the same time, aging research can also be combined with hardware security research, such as the physical unclonable function (PUF).
- C.
Hardware defense techniques for FPGA-based reconfigurable systems
Figure 10 shows the number of academic papers related to the hardware protection of reconfigurable systems published globally each year from 2012 to 2023 and the multiplying power trend line. As can be seen from
Figure 8, the number of relevant papers published globally is higher before 2015, and then it gradually shows a decreasing trend; from 2018 to the end of 2022, the number of relevant research papers gradually becomes stable. In recent years, research on hardware protection for reconfigurable systems has focused on Hardware Trojans and physical unclonable functions. Hardware Trojan detection techniques have been an unexpected discovery, with the two research aspects of bypass analysis and logic testing accounting for the most research; in addition, reverse engineering, fuzzy design, blank cell filling, and other protection strategies have also been studied to some extent. With the development of programmable devices and evolutionary algorithms, Hardware Trojan mitigation strategies based on evolutionary hardware have gradually become a research hotspot. With the development of IoT and information technology in recent years, PUF is attracting attention as an effective device authentication method, of which SRAM PUF, Ring Oscillator PUF, and arbitration PUF are the most mainstream research objects at present.
8.2. Reliability Key Technology Perspective of FPGA-Based Reconfigurable Systems
FPGA-based reconfigurable systems have become one of today’s indispensable hardware platforms due to their flexibility in functionality and balanced computing performance, but with the increasing level of system synthesis, the reliability requirements of the overall architecture have become more stringent. In particular, when systems are deployed in complex and harsh scenarios, there are new challenges for reliable computing. At the same time, man-made attacks against reconfigurable systems can seriously threaten the reliable security of the system. It is noteworthy that in recent years, with the development of artificial intelligence and machine learning technologies, new ideas have been provided to ensure the high reliability of reconfigurable systems. Therefore, based on the existing research, the author believes that the following aspects will become the main research hotspots in the future.
- A.
Design of a multi-level fault tolerance mechanism for flipping faults
Specialized devices such as reconfigurable embedded CNC systems, reconfigurable micro-robots, etc., are currently implemented based on FPGAs and are typically deployed in space or in complex environments to perform their tasks. The main threats to such systems/devices are failures due to high-energy particle impacts in the environment, with single/multiple particle flip-flops being the most common. Current fault tolerance strategies for such failures are mainly based on design-based approaches and hardware-hardening approaches. The design-based approach is implemented in the application layer, which has the problem of high resource overhead; the hardware-hardening-based approach is implemented in the physical layer, which has the problem of high cost and a long development cycle. In this paper, we have mainly improved the high resource overhead caused by the tri-mode redundant architecture in the design-based approach, and we have not considered further improving the fault tolerance of the overall architecture by combining the hardware-hardening approach. Future work can consider combining design-based and hardware reinforcement-based approaches to achieve multi-level fault tolerance for reconfigurable systems. Physical hardware reinforcement can be used for the critical parts of the reconfigurable system or the partitions that perform critical tasks, and at the same time, the design-based approach can be combined to further ensure the reliable operation of the application layer, which can effectively reduce the development cost and cycle time while improving the reliability of the system.
- B.
Research on intelligent fault tolerance for reconfigurable systems
In current research on fault tolerance for reconfigurable systems, the focus is mainly on the design of fault tolerance methods, such as the use of tri-mode redundancy architecture to ensure uninterrupted operation after failure. With the development of artificial intelligence technology, the achievement of intelligent fault tolerance has also gradually become a trend. Intelligent fault tolerance methods based on evolutionary hardware have been implemented in reconfigurable systems, such as virtual reconfigurable circuits and coarse-grained reconfigurable arrays, but they are still at a preliminary stage and can only be applied to small-scale evolution. In addition, the time-consuming evolution process makes it unacceptable for real-time fault-tolerant systems. Future work will explore the implementation of intelligent evolutionary fault tolerance on larger-scale architectures and will focus on improving evolutionary efficiency.
- C.
Research on fine-grained fault localization in reconfigurable systems
Existing fault detection methods in reconfigurable-oriented systems mainly focus on fault detection at the functional module level. With the reconfigurability feature, when a fault is detected in a part of a functional module, the faulty part is replaced with a resource of the same granularity to achieve fault tolerance. However, coarse-grained fault detection and resource replacement lead to some degree of resource waste. In particular, for resource-constrained application scenarios, this will exacerbate the contradiction of high resource overhead for ensuring system reliability. Therefore, future work can further achieve precise fault localization and lower-granularity resource replacement to reduce resource overhead.
- D.
Research on aging assessment/prediction of reconfigurable systems
As CMOS processes shrink, the effects of aging on integrated circuits are exacerbated. Reconfigurable systems face the same aging threat to the normal lifecycle of devices. Particularly in unattended, long-lead-time applications such as space exploration, software/hardware failures due to aging have a serious impact on system reliability. Therefore, there is a strong need to assess/predict the aging of architectures in order to take countermeasures in advance. However, the underlying hardware of reconfigurable systems contains a variety of electronic components and interconnects that are subject to different major aging effects. For example, the aging effects of transistors are mainly hot carrier effects and Bias Temperature Instability effects, while the aging effects of metal interconnects are mainly electromigration effects. To achieve preventive maintenance, future work could focus on the assessment/prediction of aging in reconfigurable systems.
- E.
Research on Hardware Trojan detection methods for “non-golden design”
Existing Hardware Trojan detection methods mostly rely on the “golden design”, such as classification methods based on supervised learning and side-channel analysis methods. However, the “golden design” in the actual application process is usually difficult to obtain, and due to the manufacturing process and other factors, a “golden design” is not universal. In this case, the method based on the “golden reference” will be very limited. Based on this, future work can consider the research of Hardware Trojan detection methods without “golden reference”, such as using some unsupervised learning methods, through clustering to filter out the Trojan attack chips/circuits.
- F.
Hardware Trojan detection for reconfigurable systems
Existing Hardware Trojan detection methods are mainly aimed at detecting Hardware Trojans in a specific circuit/functional module. However, they do not consider the detection of Hardware Trojans at the system/architecture level. Some Hardware Trojans, such as those causing information leakage and increased power consumption, are often difficult to detect, especially when they are hidden in the whole system. In this paper, we focused on Hardware Trojans that disrupt functionality and cause malfunctions in reconfigurable systems, which are relatively easy to detect when they cause obvious malfunctions or disruptions, but we neglected to consider the detection of other Hardware Trojans. Future work will target the detection and location of non-function-destroying Hardware Trojans to improve hardware security and overall system reliability.
8.3. Open Challenges and Key Issues
Although significant progress has been made in high-reliability technologies for FPGA-based reconfigurable systems, there remain three core challenges in actual deployment and future development that require focused breakthroughs.
8.3.1. Integration with Edge Computing Scenarios
In the AIoT era, reconfigurable systems deployed at edge nodes must simultaneously meet requirements for low power consumption, real-time responsiveness, and dynamic multi-task adaptation. However, the edge environment is characterized by three key factors: hardware heterogeneity, dynamic variability, and harsh operating conditions. These pose new demands on reliability technologies: power constraints limit the use of conventional monitoring and scrubbing techniques, while real-time requirements demand millisecond-level fault detection and recovery. Existing reliability solutions are mostly designed for fixed scenarios and lack adaptive capabilities for heterogeneous edge environments. This challenge provides a clear direction for the research on “intelligent fault tolerance” and “aging assessment/prediction” discussed in
Section 8.2. Future development should focus on developing low-power adaptive fault-tolerant algorithms, fast fault detection mechanisms, and aging prediction models tailored to dynamic edge environments.
8.3.2. Deepening Cybersecurity Challenges in Reconfigurable Environments
As reconfigurable systems are increasingly deployed in critical domains, hardware attack methods are becoming more sophisticated, posing new challenges to traditional protection techniques. First, attackers can exploit dynamic reconfiguration to tamper with bitstreams and implant “dynamic Trojans,” whose behavior changes with configuration switching, making them difficult to detect using static detection methods. Second, the distributed deployment of edge nodes exacerbates supply chain attack risks, as the untrustworthiness of third-party IP cores and foundries becomes harder to trace. Third, there is insufficient coordination between reliability and security protections—for example, fault tolerance mechanisms may be exploited, while security hardening may introduce additional reliability risks such as increased latency and aging. Existing research often focuses in isolation on Trojan detection or fault tolerance design, lacking a unified “reliability–security” protection framework. This challenge is closely related to the topics in
Section 8.2, including “system-level Hardware Trojan detection,” “Hardware Trojan protection strategies,” and “non-golden design detection.” Future efforts should focus on system-level end-to-end protection, breakthroughs in reference-free detection technologies, and the development of a coordinated “fault tolerance-attack prevention-aging resistance” framework to enhance the security and reliability of reconfigurable systems.
9. Critical Analysis and Systematic Comparison of Core Technologies
To clarify the current status and practical bottlenecks of high-reliability technologies for FPGA-based reconfigurable systems, this section provides a systematic comparison and critical analysis of the mainstream core technologies previously introduced, focusing on functional features, performance metrics, and deployment constraints, so as to offer actionable guidance for technology selection and future optimization.
9.1. Classification and Comparative Framework of Core Technologies
Building on prior studies, high-reliability core technologies are grouped into four categories: fault-tolerant design, aging mitigation, hardware attack defense, and monitoring and prediction. Six key dimensions—core strengths, key limitations, resource overhead, response speed, applicable scenarios, and practical defects—are selected to form a comparative analysis framework.
9.2. Analysis of Defects in Mainstream Technologies
Based on the above comparison, this section conducts an in-depth analysis of the practical weaknesses and core deficiencies of key technologies.
Table 4 provides a comparative overview of various reliability technologies and their core advantages.
9.2.1. Fault-Tolerant Design: Imbalance Between Resources and Reliability
Triple Modular Redundancy (TMR) trades more than 200% hardware cost for reliability, which is generally unaffordable for edge nodes and leaves systems vulnerable to common-mode faults in harsh environments such as space. Error Correction Code (ECC) imposes low resource overhead but can only repair transient configuration memory faults; it is ineffective for logic faults that constitute more than 30% of observed failures, thus providing limited overall reliability. Dynamic Partial Reconfiguration (DPR) enables fine-grained repair, yet it requires precise fault localization and precompiled bitstreams. Its slow bitstream generation and susceptibility to localization errors hinder real-time deployment and may even induce secondary faults.
9.2.2. Aging Mitigation: Conflict Between Monitoring Accuracy and Practical Fit
Aging sensors can monitor delay variations in real time, but a single sensor covers only a localized region. Deploying sensors across the entire chip consumes more than 10% of on-chip resources, and the sensors themselves are subject to aging, leading to long-term drift in measurement accuracy. Dynamic Voltage and Frequency Scaling (DVFS) can alleviate certain aging effects by reducing frequency or increasing voltage; however, raising voltage exacerbates Hot Carrier Injection (HCI), creating a paradox where mitigating one aging mechanism accelerates another, potentially shortening device lifetime. Placement and remapping techniques rely on accurate aging assessment; yet, aging rates differ significantly across chip regions, making precise modeling difficult and sometimes intensifying local stress concentrations.
9.2.3. Hardware Attack Defense: Trade-Off Between Detection Capability and Deployment Cost
Static detection techniques, such as gate-level netlist analysis, depend on a “golden design” reference. Process variations and hidden modifications in third-party IP cores result in inherent discrepancies between the reference and manufactured chips, producing false positive rates greater than 25% and an inability to detect runtime “dynamic Trojans” implanted via bitstream tampering. Side-channel analysis (SCA) leverages power and timing signatures for Trojan detection, but it is highly susceptible to process noise and environmental interference, achieving detection rates below 30% for Trojans occupying less than 0.1% of chip area. Moreover, SCA requires specialized test equipment and complex data processing, making field deployment impractical. Adaptive Triple Modular Redundancy (ATMR) resists attacks through heterogeneous modules but introduces additional design and verification complexity, lacks unified standards, and exhibits poor compatibility across FPGA platforms, remaining largely confined to laboratory environments.
9.2.4. Monitoring and Prediction: Data Dependency and Real-Time Limits
Machine learning (ML)-based aging prediction models require extensive long-term labeled data. Since FPGA aging cycles span multiple years, obtaining full-lifecycle data is extremely challenging, resulting in poor model generalization and error rates exceeding 40% across different fabrication processes and application scenarios. Ring Oscillator (RO) delay monitoring employs a simple structure but is sensitive to temperature and voltage fluctuations, exhibiting measurement errors of 10–15% under identical aging conditions. Furthermore, RO monitoring reflects only overall aging trends and cannot identify aging damage in fine-grained resources such as CLBs and LUTs, limiting its effectiveness for precision maintenance.
9.3. Implications for Technology Optimization
The principal conflicts in current technologies center on the trade-offs among reliability, resource consumption, and detection accuracy versus deployment cost, as well as the gap between theoretical performance and practical feasibility. Future research should pursue reductions in legacy technology overhead (e.g., dynamic TMR scheduling, integration of ECC and DPR), overcome dependency bottlenecks (e.g., golden-free Trojan detection, few-shot aging prediction), and promote multitechnology synergy (e.g., integrated frameworks for fault tolerance, attack defense, and aging resistance) to better address the demands of complex application scenarios.
10. Conclusions
The reliability of reconfigurable systems is a rising research focus. This paper focuses on the threat models and core issues of FPGA-based reconfigurable systems, compares reliability research on faults, aging, and hardware attacks, summarizes achievements and shortcomings, and points out the trend toward development that is intelligent, fine-grained, and collaborative.
The core contributions of this work are threefold. First, we establish a comprehensive review framework encompassing “threat model–core technology–evaluation metric–challenge,” integrating classification and quantitative features of four fault tolerance, three aging mitigation, and two attack defense technologies. Second, through critical analysis, we reveal critical bottlenecks such as TMR’s resource overhead and Hardware Trojan detection reliant on “golden design.” Third, we identify three open challenges—reliability–resource–delay trade-off, edge integration, and security–reliability synergy—along with seven future research directions, including intelligent fault-tolerant optimization.
Gaps remain in 3D trade-off mechanisms under resource limits, low-power edge adaptability, and protection against dynamic Trojans and supply chain attacks. Future work should focus on multitechnology co-optimization to bridge theory and practice.
This study provides actionable guidance for topic selection and engineering decisions, advancing FPGA-based reconfigurable system reliability.