Facilitating Autonomous Systems with AI-Based Fault Tolerance and Computational Resource Economy

: Proposed is the facilitation of fault-tolerant capability in autonomous systems with particular consideration of low computational complexity and system interface devices (sensor/actuator) performance. Traditionally model-based fault-tolerant/detection units for multiple sensor faults in automation require a bank of estimators, normally Kalman-based ones. An AI-based control framework enabling low computational power fault tolerance is presented. Contrary to the bank-of-estimators approach, the proposed framework exhibits a single unit for multiple actuator/sensor fault detection. The efficacy of the proposed scheme is shown via rigorous analysis for several sensor fault scenarios for an electro-magnetic suspension testbed.


Introduction
Modern control systems require careful, reliable and economic design with maximum performance, normally imposing several design trade-offs (economic design, reliability, performance). In particular, reliability in control systems is vital especially in safety-critical systems (i.e., where faults must be accommodated before the impaired system becomes unstable). In non-safety-critical system cases, like production lines, reliability supports a normal operation regime avoiding production delays and/or unnecessary maintenance. In areas more aligned to autonomy, such as in Unmanned Area Vehicles (UAVs), the problem of considering control methods for reliability adds to the computational power of the already limited resources [1][2][3][4][5][6].
Autonomous systems must be trustworthy, and trustworthiness has been a popular topic of discussion in the current autonomous systems literature [7,8]. Reliability facilitates trustworthiness, and in this context fault accommodation can be achieved with a priori design of a controller that has the ability to take remedial actions so that the stability of the control system is maintained even with degraded performance. The stability and performance of a control system depends upon the healthy operation of its interfaces (actuators and sensors) and various approaches to design such capability appear in the literature (both model-based and model-free methods) [9][10][11][12][13][14][15][16][17]. Fault-tolerant control (FTC) supports reliability [18], the approach normally classified as either Passive (PFTC) or Active (AFTC) [19]. Passive FTC type requires a prior knowledge of the faults, while the Active type (used in this work) does not necessitate such knowledge of the fault rather a Fault Detection and Isolation (FDI) mechanism with reconfigurable control. Reconfigurable FTC control has gained significant attention over recent years given the demand on reliable system design [20][21][22] and in the area of cyber-physical systems [23].
Referring to sensor fault tolerance, especially after sensors failure, a few methods exist that use the information from the remaining healthy sensors, in order to reconstruct the lost signal of the faulty ones [24]. The latter methods include, use of a bank of Neural Networks (NNs) or use of Kalman Estimators (KE) [1,25]. Both approaches are worth considering when aiming in avoiding sensor redundancy. In contrast to the KEs approach, NNs have increased False Alarm Rates (FAR), mainly because they leave a very small residual after fault estimation [24]. Despite that, they are widely used since they can be designed without having precise knowledge of the model of the system under test [26][27][28][29]. In the above approaches, where many actuators/sensors exist, an (in-parallel) bank of estimators for multiple faults detection is employed [30]. For example, if there is one actuator with n y sensors, then the number of sensor fault combinations that could happen is 2 n y − 1 (where n y is the total number of sensors, assuming that not all sensors can fail). Hence, to be able to detect those faults it requires the same number of estimators. However, this increases the complexity of the control design and requires additional computational resources, since the estimators must work in parallel.
NNs have been used to a great extend in many engineering fields including control systems [31,32], as well as for Fault Detection (FD) methods in FTC systems [33,34] and more specifically in Sensor FDI [24,35,36]. We proposed an AI-based FD mechanism, referred to as iFD, based on the use of Neural Network approach that performs a similar task to the conventional bank-of-estimators FD albeit offers substantially reduced computational complexity. Visually this is depicted in Figure 1, the bank of the estimators running in parallel (dotted lines). The authors, in their brief paper [19], presented the original concept framework from an automatic fault-tolerant control viewpoint. This paper considerably extends the original results (with a particular emphasis on their interpretation) and present how reliable system autonomy can be facilitated. We discuss the framework solution with the help of a practical example, i.e., an Electro-Magnetic Suspension (EMS) testbed (typically forms the suspension platform of Maglev train) to support the vehicle compartment and maintain acceptable passenger ride quality. The rationale behind this choice being that EMS (Maglev) is an inherently unstable, non-linear, safety-critical system, subject to non-trivial control performance and reliability requirements (hence offering a challenging application for the validation of the work). Sensor faults are first modelled and then the simulation results using various fault scenarios showcase the proposed method.
The rest of the paper is organized as follows: Section 2 describes the proposed iFD approach including a short description of the NN training, and Section 3.1 shows the efficacy of the proposed method based on the analysis of the results (with the help of the practical example of the EMS-the test case details been discussed in Appendix A). Conclusions are discussed in Section 4.

Proposed Fault Detection Scheme
We employ the FD unit to detect actuator/sensor faults and activate controller reconfiguration. The proposed FD scheme is NN-based and the concept is depicted in Figure 2. Industrial systems typically exhibit Multiple-Input, Multiple Outputs. Control inputs relate to actuation (i.e., industrial drives, motors, pumps, electro-magnets etc.), in a control setting indicated by variable U . Outputs measure several useful parameters (both for control purposes and monitoring purposes), typically indicated by variable Y. Control design satisfies a desired performance relevant to the application. When one or more actuators and/or sensors are impaired, those signals are distorted leading to performance degradation or possibly instability of the closed loop. The sets of actuators and sensors are defined as U = [u 1 , u 2 , . . . u n u ] and Y = [y 1 , y 2 , . . . y n y ], where u j is the jth actuator, y j is the jth sensor, n u is the total number of actuators and n y is the total number of sensors.
The control loop features a bank of controllers [K 1 , K 2 , . . . K n yu ] and two isolation units for isolating faulty actuator and sensor signals when these happen. The iFD mechanism is employed to detect the faults and is comprised of a NN-based estimator, a Residual Generator (RG) and a Decision Mechanism (DM). The NN estimator is trained in such a way that the actuators and sensors signals are estimated and fed into the RG which in its turn compares the real signals with the estimated ones. Immediately after the RG has completed the comparison, it advances the residuals to the DM, leading to a decision whether a component is faulty or not.
The key point in the proposed iFD method is the estimator training approach to observe U and Y (the training process discussed in Section 2.1).
The inputs to the estimator are obtained from the Binary Switches (BS) depicted in Figure 2. The BS have three inputs; one represents the real measured values of the U and Y and the other comes from the functions C u j and C y j , defined as C u j = [c u 1 , c u 2 . . . c u nu ] and C y j = [c y 1 , c y 2 . . . c y ny ]. C u j and C y j represent two arrays that contains predefined functions, used during the training and operation of the iFD. Typically, calculating these values is benefited by designer experience as they tend to be application dependent. The third input (IS y j ) is a binary input which controls the switching operation between the inputs e.g., from y 1 to c y 1 . The output y BS j of the BS is given by (1) The residual generation is a vital task although is out of scope in this paper. The moving average filter defined in (2), manages to accommodate the noise coming from the sensors to reduce the FAR [37].
where r y j is the residual, y j andŷ j are the jth real and estimated signals (for the actuators the y is replaced by u), and N is the total number of past samples. The DM decides whether one or more actuating and/or sensing components are impaired or not by comparing the relative residual r j with a predefined threshold (engineer must define the threshold). Threshold selection is a non-trivial task to perform, since it impacts both fault detection sensitivity of the DM and the FAR. A very sensitive DM to faults (thresholds too low) causes increase of FAR, while a less sensitive DM (thresholds too high) may cause instability due to delayed reconfiguration. Threshold selection is normally done as trial-and-error process with designer experience in the application beneficial. Within the AI remit of fault estimation, threshold selection has received attention [38] but the details of this aspect are beyond the scope of this paper. In the proposed iFD, the Reconfiguration Signal (RS) at the output of DM enables controller reconfiguration. The proposed iFD setup works as follows: • Given normal operation, the actuator(s)/sensors signals are estimated and then fed into the residual generator. The generator calculates the residual (which would correspond to a very small value under normal operation) for each actuator/sensor and feeds it to the DM. The DM outputs the IS j and RS signals.

•
With one or more actuator(s)/sensors failing the corresponding residuals r j increase and the DM will detect the change by comparison with the relevant threshold levels. Next, the ISs will switch in order to activate the corresponding BS to modify its output to c j ; while the RS will feed the appropriate data value to enable controller reconfiguration. The BSs will also isolate the signals of the faulty devices from the estimator so that the latter 'sees' some 'known' data based on its training/knowledge (this is described in the next section). • Finally, the isolation units will remove faulty devices from the loop and the controller reconfigures to maintain performance and stability.

Offline Training of the iFD Unit: Obtaining the Learning Set
As in all AI-based solutions, training of the algorithm is the key point. The NN iFD unit is trained based on data accumulated from an extensive set of scenaria on subsets of the main sensor set, Y.
The collected data are then packed in a structure shown on Table 1. The first column presents the sensor set number from 1 to n yn , defined as, The second column shows the status of the sensor set, i.e., all possible sensor/actuator fault scenarios are covered, and the next two columns show the measured sensors and actuators signals. The last two columns reflect the estimated sensors and actuators signals. The data set D with dimensions d r × d c is given by, where nû and nŷ are the number of estimated signals for actuators and sensors, respectively. D is constructed with data from numerical simulations with each sensor/actuator fault scenario. Where the sensor(s) and/or actuator(s) is (are) assumed to be faulty, a known function c u j , c y j replaces the k data points. In an automatic setup the design engineer is required to select a set of functions C u j and C y j to replace the non-predicted outputs from the faulty actuators and sensors respectively. In an autonomous setting, experience from automation and use of reinforced learning design will enable the aforementioned choice (this part is studied as future research and is not the main aspect of the work presented here).
When an actuator and/or sensor fault occurs, the corresponding function c u j and/or c y j is/are connected to the iFD. This is a result of the iFD learning capability to respond to sensor/actuator faults in a way that the iFD unit itself continually checks for faults on the full sensor set and its subsets.
An electro-magnetic suspension (EMS) system testbed (details in Appendix A) is used as the practical platform to illustrate the proposed solution.

Visualizing the iFD Applied on the EMS Example
The EMS system is excited by a single control input (i.e., n u = 1), U = {u c }, and provides four output measurements (i.e., n y = 4), namely Y = {i, (z t − z),ż,z}. The outputs are employed for controller design. The H ∞ Loop-Shaping Design Procedure (LSDP) robust control approach was used to enable the necessary closed-loop performance. Please note that the airgap (z t − z) is required by default as a standard input [39]. With (z t − z) a default measurement, the total number of actuator/sensor sets are n yn = 8. Hence, eight LSDP controllers (i.e., K (z t −z) , K i,(z t −z) , K (z t −z),ż , K (z t −z),z , . . ., etc.) are used to cover the spectrum of (seven) possible sensor combinations that could occur as indicated in Figure 3. Details on the design of LSDP controllers is discussed in [40]. The iFD concept presented here can be considered for the detection and isolation of sensor faults in a typical FTC system arrangement for other engineering test platforms. In the case of one or more sensors failures, the relevant fault is detected and isolated (note that the isolation switches and the binary switches are merged together as shown in the figure) and the system switches to the alternative controller K • (for closed-loop purposes).         Sensor faults can be categorized into additive and multiplicative categories. As clearly shown in Figure 4, there exist three types of faults in each category, namely: (i) abrupt or step-type fault, (ii) incipient or soft fault and (iii) indeterminate fault. The faults accounted for in this work fall into the additive and multiplicative categories and are both abrupt and incipient types of faults. In the first case, the output of the sensor is added or multiplied with a function f a and f m respectively. Indeterminate faults could also belong to any combinations of the aforementioned faults, but they are assumed to occur in random width time windows and amplitudes. These faults are treated like permanent by the iFD, and they are 'captured' once they appear.

Neural Network Algorithm
A plethora of NN algorithm types exist in the research literature; however in this paper a dynamic non-linear input-output Neural Network model with tapped delay lines at the input is employed for time-series prediction. The NN's task is to perform similar to that of the-more conventional-bank of Kalman estimators (KE) in the feedback loop, as well as to predict future values based on past values of one or multiple time-series. In particular, to predict φ(t) series based on n c past values of θ(t) series so that φ(t) = λ(θ(t − 1), . . . , θ(t − n c )).
This type of NN algorithm is adapted to fit the inputs and targets of the suspension for the iFD realization. It has a total of five inputs (u c and i, (z t − z),ż,z) and three estimated outputs (î,ẑ,ẑ). Its internal architecture is shown in Figure 5, and is realized as a hidden layer (with one delay and 20 hidden neurons) and an output layer with sigmoid and linear functions, respectively. Each neuron's output is generally described by (6), where o n is the neuron's output, n n is the number of neuron's inputs, w are the weights (is a vector equal to the size of the neuron's inputs), ∆ are the delay lines and ψ is the bias point (considered to be one for all neurons). A fast convergence method for training moderate sized feed-forward neural networks is the Levenberg-Marquardt backpropagation algorithm (for details see [41,42] (Chapters 11-12)).
The training data that were (later used to train the NN), were successively collected in equal time windows T for all failing set combinations with a sampling time τ s , on an online working state model with the H ∞ LSDP controllers in the feedback loop. The stopping criteria was set to a Mean Square Error (MSE ≤ 10 −5 ) or a maximum of 1000 epochs.

Efficacy and Assessment of the Proposed iFD
A rigorous analysis follows next, to show the effectiveness of the proposed iFD unit with results from realistic simulations (MATLAB/Simulink) on the EMS system.
An offline training with a typical NN-based estimator, described in Section 2.3, was performed prior to the FTC design of the suspension. The total training process (1000 epochs) required c.7 min on a mobile workstation laptop (equipped with Intel R Core TM i7-9750H @ 2.6 GHz, 32 GB RAM), with the parameters described in the previous section. The total number of sensor and actuators, their estimated signals and the number of actuator/sensor sets are, n y = 4, n u = 1, nŷ = 3, nû = 1 and n yu = 8 respectively.
The training data from each sensor set comprising D, were collected using sample rate τ s = 1 kHz under total simulation time T = 6.6 s. The data set D consists of data subsets D d and D s , drawn from the deterministic and stochastic suspension responses (Subscripts d and s indicate the deterministic and stochastic cases respectively). The data set used for training is as follows, where d r = d r d + d r s and d c = d c d = d c s are both found from (4). The total number of columns is calculated as d c = 9, while the total samples per set is k = 6.6 × 1000, hence d r d = d r s = 52,800 and d r = 105,600. The functions used for the training of the NN are C y j = {c i = 0, cż = 0, cz = 0} with dimensions k × 3. These functions are both used in the stochastic and deterministic responses of the suspension where a fault is assumed, as explained in Section 2.1.
Overall, there are 4 sensors in the full sensor set (Y), while it is taken for granted that the actuator, u c , and the airgap sensor (z t − z) cannot fail (in case where fault tolerance is required for the airgap then redundant components can be used under a voting scheme). Three sensor fault possibilities are considered: the current i, the vertical velocityż and the accelerationz. Sensor faults are generally classified into abrupt and incipient fault types. We also use additive and multiplicative faults with bias for each one of the sensors i,ż andz. Figure 6 illustrates current sensor measurement, i, fault profile (similar pattern fault profiles are used for other sensors in this work). For all cases, faults start developing at time t f = 1 s (this is marked at point A in all relevant figures). Figure 6a for the impaired sensor, illustrates the normal current value superimposed with a low frequency band-limited random signal, ν(t), at frequencies of 10 rad/s and zero mean, white noise characteristics and power spectral density of S i = 5 limited to ω i = 1.6 Hz. Consequently, the additive/abrupt fault profile for the current sensor ( f aa i ) is given by, Sensor outputż andz follow a similar pattern (same bandwidth) and Sż = 0.03 and Sz = 2 respectively. Next, Figure 6b depicts the multiplicative/abrupt case ( f ma i ) where the current sensor is suddenly damaged at t = 1 s and as a result its output becomes five times larger than normal (9).
The same fault profile is used for the other two sensors,ż andz. For the current measurement, a bias (abrupt) type of failure is shown in Figure 6e. Clearly the sensor output abruptly increases (in a step manner) to its maximum value, i.e., in this case max y o i = 10 A.
The incipient types of faults are illustrated in Figure 6c and Figure 6d respectively. In the former figure the additive/incipient fault on the current measurement ( f ai i ) are described by (10). The latter is a ramp type signal with σ i slope superimposed with a low frequency random signal with band-limited white noise characteristics as previously explained. The aforementioned figure, depicts the multiplicative/incipient ( f mi i ) fault described by (11), where the fault starts developing at t f = 1 s and then falls to zero (due to multiplication by zero).
The fault profiles forż andz follow the same behavior as above with effective slopes, σ i = 20, σż = 6 and σz = 20 and for the PSD, S i = 0.5, Sż = 0.03 and Sz = 0.5 respectively. The following scenario supports explaining the iFD working principle: • Three sensors are subsequently impaired with a time difference as follows: accelerometer at 0.5 s, velocity at 1.5 s and current at 2 s), • the deterministic disturbance to the suspension is used and, • a multiplicative/abrupt fault profile is injected for each sensor at each time instant mentioned above (e.g., for the current sensor see Figure 6b).
The airgap sensor output with fault-free case and with the aforementioned fault scenario is depicted in Figure 7. The figure illustrates the airgap with a fault-free case (i.e., healthy sensor set, Y, with K i,(z t −z),ż,z ) and under the fault scenario mentioned. The acceleration sensor is impaired at 0.5 s (point A) and immediately after a controller reconfiguration follows (a new controller, K i,(z t −z),ż , is introduced in the loop) in order to maintain the stability and performance of the EMS. The EMS response with both fault-free and fault scenario comply with the control performance requirements described in Appendix A. Following the acceleration fault, the velocity one fails at t = 1.5 s (designated at point B) and the current sensor follows at t = 2 s (marked at point C). The subsequent faults are successfully detected and accommodated via appropriate switching on K i,(z t −z) and K (z t −z) respectively.     Table 2.
The sensor fault accommodation for each sensor failure is integrated in three steps. To assist in explaining the steps of the procedure, the current sensor fault will be interpreted: (i) Sensor FD: when the fault occurs at t = 1 s, the residual of the current measurement, r y i starts increasing and as soon as it passes the threshold (see Figure 8) the fault is detected. (ii) Fault Isolation: at this stage the faulty sensor is removed from the loop using a BS, while a 'known' function c y i = 0 is connected at the input of the iFD. Figure 9 clearly shows the signal at the input and output of the BS, as well as the signal at the output of the iFD. (iii) Controller reconfiguration: after the faulty sensor isolation, a reconfiguration signal is generated and the new controller, K (z t −z) , is introduced in the loop.
Careful investigation of Figure 9 (after the fault occurs at point C) shows that the unit detect the fault after a few time steps and that one time step is required for the BS y i portion to permanently change its output to c y i = 0. Hence, the residual remains large which justifies the reason the BS output will never return to its previous stage when/if the fault vanishes. The same figure also shows the input to BS y i with the two previous sensor faults, i.e., acceleration and velocity (at point A and B respectively).   Table 2 indicates the resulted performance of the suspension and the false alarm (FA) after 70 tests, analytically 35 for each deterministic and stochastic responses of the suspension. The first column of the table describes the sensor fault scenarios used to test the proposed iFD. Typically, rows 2-4 show the results for single sensor faults that occur at t = 1 s, while the rest rows show the results with subsequence faults starting from 0.5 s with a time difference of 1 s. The first six columns present the performance with abrupt faults while the rest four, show the performance with incipient faults. The track inputs exciting the EMS were discussed in Appendix A. Multiplicative (Mult.) and Additive (Add.) faults are used, as well as a bias (Bis.) fault that occurs abruptly. Per the scenario case, entry indicates that EMS performance is successfully maintained, while entry "x" indicates the opposite. In addition, if a FA arises is marked with a red color .
Close investigation of the aforementioned table of results indicates that the iFD successfully detects and reconfigures the controller under all scenaria (maintaining the appropriate performance levels). In some cases, i.e., in id: 7-8, an FA appears meaning that a sensor is 'shown' impaired although truly is healthy. This is an important finding towards facilitating reliable system autonomy. The threshold setting for the residual plays a substantial role and this needs to be addressed in a reliable autonomous system (as an FA could hinder perception and hence impact system stability).
Given that the residual threshold of such sensor cases is increased to avoid the FA, and the sensors are impaired themselves, then the FD could delay long enough to cause instability. Two particular issues that have been noted and looked further as future research are: (i) in the NN-trained FD unit, in general a small residual remains after a fault occurs in some scenaria, (ii) the coupled nature of the closed-loop (FD unit, reconfiguration mechanisms, decision making). A deep learning approach is currently investigated to address that uncertainty envelope. 1  Sensor FD time investigation is seen on Table 3. Column-wise: column 1 maps the identifier number for the chosen scenario, column 2 lists the sensors (remark: underlined channels indicate faulty ones with incident occurring at t f which is shown column 3). The rest of the columns refer to FD time, t d , in particular the ones marked using boldface font indicated fault occurrence is detected (while boldface entries with superscript "*" are the false alarms).
Careful investigation of the results shows that with abrupt single sensor failures (id: 1-3) the fault detection succeeds at the instance the sensor fails. In the incipient fault cases, a short delay within 0.030-0.399 s is noted. However, this delay does not hinder performance due to the robust controller. Although delays are observed in FD throughout these three and the next two scenaria, none of these cause FAs. In the last two scenarios (i.e., in scenario 6: two subsequent faults on velocity and acceleration outputs are considered and in scenario 7: acceleration, velocity and current sensors are impaired sequentially) FAs appear mainly in the current sensor case. Please note that other observed fault delays in iFD are successfully accommodated via the control reconfiguration.

Comparison of the Execution Time
Using the full sensor set, Y is possible to compare the time taken for a simulation to complete i.e., the execution time (t e ) using the iFD with that for a bank of eight in-parallel KEs as shown in Figure 10a. The execution time is measured at a high-level simulation in Simulink platform iteratively (×100) with each estimator. The execution time for each simulation is illustrated in Figure 10b. The mean simulation time for the iFD is 0.5 s, while that of the bank-of-estimators setup 7.9 s. Clearly the iFD is c. ×16 faster, in terms of mean exec. time, compared to the bank of KEs (in addition the proposed scheme offers c. ×13 lesser standard deviation i.e., 0.15 s). The comparison showcases a two-fold aspect: (i) the efficacy of the proposed fault detection approach in terms of fast actuators/sensors fault detection, (ii) a level of re-assurance as the proposed approach performs within the same envelope of performance of the conventional bank-of-estimators approach.  Figure 10c shows the normal probability plots for the conventional bank of estimators and proposed AI-based schemes. Clearly with very different execution time average, it is seen that the former setup is closer to a normal distribution, the latter NN-based solution favors faster execution time (while providing comparable performance). The importance of this result is two-fold, which we study further, (i) maintaining similar level of performance/reliability to an acceptable conventional solution enables certification [43] and (ii) correlating the process with descriptive statistics supports faster training for the autonomous system solution. To summarize, the following specific comments are highlighted for the proposed scheme: 1.
a single AI-based estimator unit can be used in the FDI instead of the conventional bank of estimators benefiting substantial computational resource reduction; 2.
the AI-based unit maintaining similar level of performance/reliability to an acceptable conventional solution enables certification [43];   Moreover, the control design requirements of an EMS system are dependent on the train type and its speed [48]. The EMS system should follow the gradient onto the rail (deterministic) and remain insensitive to track irregularities. Figure A1c summarizes the control performance requirements.