Incremental Modeling and Monitoring of Embedded CPU-GPU Chips

: This paper presents a monitoring framework to detect drifts and faults in the behavior of the central processing unit (CPU)-graphics processing unit (GPU) chips powering them. To construct the framework, an incremental model and a fault detection and isolation (FDI) algorithm are hereby proposed. The reference model is composed of a set of interconnected exchangeable subsystems that allows it to be adapted to changes in the structure of the system or operating modes, by replacing or extending its components. It estimates a set of variables characterizing the operating state of the chip from only two global inputs. Then, through analytical redundancy, the estimated variables are compared to the output of the system in the FDI module, which generates alarms in the presence of faults or drifts in the system. Furthermore, the interconnected nature of the model allows for the direct localization and isolation of any detected abnormalities. The implementation of the proposed framework requires no additional instrumentation as the used variables are measured by the system. Finally, we use multiple experimental setups for the validation of our approach and also proving that it can be applied to most of the existing embedded systems.


Introduction
Heterogeneous systems-on-chips (SoC) combine more than one type of processor, generally central processing units (CPU) and graphics processing units (GPU) [1]. They rose to popularity in the early 2000s, thanks to their suitability for most households and entertainment uses. Nowadays, heterogeneous SoCs are at the hearts of most modern computer systems and handheld devices, including those used in safety-critical systems (military, aerospace, automobile, etc.) [2]. In these increasingly embedded and mobile systems, manufacturers are evermore dealing with the challenge of designing chips offering high performance while maintaining minimal power consumption and manageable thermal output.
The goal of the project behind this study is to develop touchscreens to serve as both the primary displays and controls in the cockpits of the future. Reliability being a primordial aspect of these systems, in this particular study, we propose a monitoring framework for the surveillance of the embedded systems powering the said touchscreens.
To monitor such chips, we propose an incremental model with an incorporated fault detection and isolation (FDI) algorithm. Together, the pair allows for the early detection of faults and drifts in the system [3]. The idea is to construct an incremental and interconnected modeling structure that is composed of simpler subsystems, each written to estimate only one variable or characteristic [4]. This approach would result in separating the different-and mostly irreconcilable-dynamics, like discrete and continuous ones. This modeling methodology constitutes the first contribution of this work.
The FDI algorithm aims to detect errors or anomalies in the behavior of the system, mainly in power consumption, operating temperature, and possible software bugs affecting critical system programs or drivers such as the frequency scaling governor. The algorithm relies on analytical redundancy, and as will be described later, the interconnected structure of our model makes it easy to detect and isolate such anomalies, since every subsystem only estimates one variable.
The main advantage of our modeling framework and monitoring algorithm is that they only rely on data and sensors present on most modern chips and therefore can be deployed on most of the current and forthcoming SoCs with no additional development nor adaptation-apart from model adapting and training. Once the program is implemented, one can easily use its alarms and fault indicators to watch over the operating state of the device and isolate faulty subsystems. It can also be used to investigate the effect of these errors on the rest of the system, and even view wear-traits for diagnosis and prognosis purposes.
The remainder of this paper is organized into eight sections. We start by exploring and summarizing established works studying the subject of modeling and monitoring embedded chips, in Section 2. Then, the proposed modeling and monitoring approach is presented in Section 3, and the experimental setup in Section 4. In Section 5, we detail the modeling process of each subsystem and explain the modeling choices. Section 6 is used to present the FDI algorithm and describe the process of residual generation and evaluation (the decision-making process). In Sections 7 and 8, we proceed to present and discuss experimental results where The fault detection algorithm is validated by examining residuals in normal and faulty scenarios. Finally, we draw a conclusion from the collected data and obtained results.

Literature Overview
In the approach proposed hereafter, we select a set of variables that characterize the operating state of the SoC (functioning correctly, normal power consumption, etc.). In the first step, our monitoring framework estimates these variables. Then, it monitors them to detects faults and drifts.
To our knowledge, there is no work treating the problem of online monitoring the operating state of CPU-GPU chips as we do, which in itself shows the gap we are trying to address in the literature, and indicate novelty of this work. Nevertheless, some works study problems that are close to ours. In the remainder of this section, we explore the main research done in these areas, limiting our scope to works done on embedded or mobile SoCs.

Power Consumption Modeling in Embedded SoCs
Power consumption in embedded systems is a vivid and active field of research, particularly in the case of smartphones. These devices are abundantly available, and easy to program. More importantly, they come with a limited power supply, which incited developers to study the battery life [5], construct energy and power consumption models [6][7][8], and closely follow the influence of user actions and applications on power consumption [9]. Furthermore, studying the embedded systems' power profile has led to improving energy efficiency [10], reliability [11,12], and also monitoring and detecting anomalies [13][14][15] and energy hogs [16].
There exists three main literature reviews detailing power profiling and modeling in smartphones [17][18][19]. In the first review, Hoque et al. [17] gave an overview of the steps of building an energy profiler. Then, they went onto detailing the different energy measurement mechanisms (external instruments and self-metering), types of models (utilization-based, event-based, and code-based), modeling philosophies (white-box vs. black-box), and profiling schemes. Finally, the authors proposed a taxonomy of the studied profilers based on their deployment; either internally on the device, or externally, and for where the model is constructed, on or off the device.

Temperature Modeling in Embeded SoCs
The thermal performance of the SoC is an important variable for its proper functioning and determining its life cycle. Thus, temperature modeling is usually achieved for optimization purposes. During the design phase, it is established to define the temperature thresholds [39] or the size of the heat evacuation apparatus (radiators, heat pipes, etc.). During the operating phase, certain models are built to optimize thermal behavior [12,40]. While, other models are used to study the reliability of the system by investigating the effects of internal and external temperatures on the chip [41], or on the lifetime of the system [42,43] to improve it [44].
In addition to the models designed for optimization, the literature contains models focusing on energy and thermal management. For instance, Mercati et al. [45] proposed a method to adapt the operating conditions (processors' frequencies, screen brightness, etc.) to the needs of the user (usage, applications running, etc.). Other works aim at the same goal by proposing new scheduling policies [11,46] or new Dynamic Voltage and Frequency Scaling (DVFS) algorithms [12].
The closest work to our framework-models used online estimation and monitoring-is the Therminator [47] simulator and its second version ThermTap [48]. These programs are developed for the online estimation of temperature in mobile devices for debugging purposes.

SoC Monitoring and Diagnosis
In the monitoring field, Gao et al. [49] summarized most of the work on fault diagnosis and detection in a two parts survey. In this survey, they categorized and divided these works into four categories: model-based, signal-based, knowledge-based, and hybrid/active approaches. The survey also went to highlight fault diagnosis application, and fault tolerance methods [50].
This study falls into the second category (signal-based approaches [49]). However, most of the existing work does not specifically study the SoC independently but rather studies electronic boards as a whole. These boards are often viewed as discrete systems whose functioning depends on the proper operation of all the components, with the assumption that there is a strong dependence of operation between each of these components [51]. Accordingly, the developed diagnosis methods are based on causal models such as multi-signal flow graph [52], information flow model [53], directed graph [54], and the fault tree [55].
Cui et al. [56] introduced an improved dependency model and used it for fault detection and isolation on an electronic board harboring a CPU-GPU chip. The chip, like our case, was used in the field of avionics. The new model is capable of eliminating multiple faults by disconnecting them from the main model via switches in the Dependency Graphical Model (DGM). The model also is the first implementation of dynamic reconfiguration concepts, and takes into consideration the malfunctions of tests.
Additionally, Gizopoulos et al. [57] provided a classification and a detailed study of existing online error detection techniques applied to multicore processor architectures. These approaches are classified into four main categories: redundant execution [58,59], periodic Built-In Self-Test (BIST) approaches [60], dynamic verification approaches [61,62], and anomaly detection approaches [63,64]. The results of this comparative study highlight the effectiveness of the dynamic verification approaches in targeting transient faults, permanent faults, and design bugs. The latter was the focus of the recent work by Sinha et al. [65] who studied the reliability of the hardware-software combined system at the design stage and proposed a unified functional model that can be simulated to detect potential failures.
In a different approach, Mercati et al. [45] viewed reliability as a convex optimization problem and proposed the Workload-Aware Reliability Management (WARM); an optimal policy for multicore systems, while Zadegan et al. [66] used the networking capabilities of the Institute of Electrical and Electronics Engineers Standard (IEEE Std) 1687-2014 standard for in-field monitoring of embedded systems and fast localization of the faults.
Finally, Löfwenmark and Nadjm-Tehrani [67] published a survey that regrouped works focusing on multicore systems in avionics. They suggested that there is an increased sensitivity to faults due to shrinking transistor sizes, and highlighted areas where research is still needed.

Contributions
All of the works mentioned above describe novel methods with great results in their respective fields. Each of them focuses on a specific aspect or a component of the embedded system or the SoC (reliability or power consumption modeling, hardware components or software components, etc.). This work, on the other hand, presents a new modeling framework that is capable of estimating and monitoring all of the SoC's characteristics variables online at once, and link of the software and hardware parts. The obvious obstacle to overcome, however, was creating a model capable of estimating all of these variables.
The dynamics of CPU-GPU chips are quite diverse and can be considered from different angles. They can be described with discrete variables (like the CPU load) or seen as discrete event subsystems (such as the frequency), not to mention the nonlinear continuous dynamics like temperature and power consumption. To model all of these dynamics, the chip is viewed as a system with a variable structure, in which the speed, power consumption, and the generated heat depend on both the software load and the operation mode (power saving, performance, etc.). This causal link between the variables allows for the creation of an incremental modeling structure that simplifies the modeling process. Hence, the emphasis in this work is on streamlining and organizing both the modeling and monitoring processes into a manageable and adaptable framework, rather than presenting novel modeling for all the individual dynamics. Moreover, the incremental model naturally accounts for the variability in the structure of the system and estimates different dynamics at once, and more importantly, it is easily adaptable to changes in components or operating modes. Additionally, it gives the model builder the freedom of choice between writing new novel models-as was done hereafter-or selecting ones from the library of existing models-like PowerBooter [23]-allowing for better monitoring results and greater control over the whole model.

Modeling and Monitoring Approach
The monitoring approach proposed in this paper is complementary to the methods mentioned in the previous paragraph because it aims to provide early detection of drifts in the system's characteristics caused by wear, or harsh conditions, or over-solicitation.
Since it is going to be deployed in safety-critical systems and environments (airplanes), the monitoring algorithm's main job will be to ensure that the SoC is behaving correctly and working under optimal conditions. To do so, we first describe the operating state that the algorithm has to monitor.

The Selection of Variables
The operating state of an SoC needs to be described by both software and physical aspects of the said SoC since both are intricately linked. Furthermore, the selected variables have to be either readable directly from the system or measurable to ease model validation. Thus, we selected the following variables to describe the operating state of The SoC: • Per-core CPU load (Load 1 , ..., Load n ): Sometimes also called CPU utilization [68]. The load is the sum of times the processor spends either busy or waiting (e.g., for I/O) during a sampling period, relative to that sampling period in percent [69]. • GPU Load (Load GPU ): Same definition as for the CPU load, it is the sum of times the processor spends either busy or waiting during a sampling period, relative to that sampling period in percent [70]. • Memory Occupation Rate (MOR): Memory usage plays a huge role in power consumption [71] and thermal output of a SoC [72]. To characterize the influence of the Random Access Memory (RAM) on the temperature of the SoC and its power consumption, we needed to include its value as an input to those models. However, the value of the Random Access Memory (RAM) on its own is indicative of neither the base value used nor the maximum. Thus, we define MOR as the ratio of the occupied RAM (memory) relative to its full size. The power consumed by the board or the device (P)

Interconnected and Incremental Modeling
The set of aforementioned variables all follow different dynamics and are influenced by different factors, which makes establishing a global model for all of them arduous. However, investigating these variables, we find clear causal relationships between most of them. Firstly, the CPU load dictates the frequency on which the CPU should run, and it is the same for the GPU. Consequently, the voltage will change according to the chosen frequency through DVFS [40]. Secondly, power consumption in processors is a direct function of the voltage and the frequency [12]. Thirdly, higher frequencies always cause the temperature to rise indicating a clear link and correlation between these two variables [46]. Finally, modern CPU always include a thermal regulator that will either limit the highest frequency possible or shutdown the processor (or one of its core), whenever its temperature crosses a certain threshold it [40].
Therefore, instead of trying to build a global model for all these variables, we adopted a gradual approach and a modular structure for the modeling of the SoC in hand.
The modeling scheme is composed of a set of subsystems each written to estimate only one variable. In addition, by exploiting previously mentioned causal relationships and connecting the subsystems accordingly, the result is a model capable of estimating all the variables we defined and also accounting for the variable structure of the system. As shown in Figure 1, such a modular approach, indeed, allows to incrementally and progressively analyze each of the variables, and model it. Another clear advantage of this modeling scheme is that it allows for easy integration or replacement of subsystems in the case of changes or updates. Thus, giving the user the flexibility to use whichever subsystem he finds most suitable. For instance, during the course of this work, we developed three power models; a Nonlinear Autoregressive with eXogenous input (NARX) neural network-which will be presented later in the article, a regression-based model similar to the one developed by Kim et al. [26], and a regression-tree. All of these models were easily interchangeable and the library can be amended even by models from the literature. Figure 2 shows how easy it is to interchange subsystems and connect them into the general model. The interconnected model is required to have the least overhead. Thus, it only uses readings and sensors available on the system. Additionally, variables like the frequency-and consequently voltage and power consumption-are reevaluated every 20 ms [73]. The model needs to read these variables and generates estimations at this sampling time. During the building of the monitoring framework, we have built and experimented with several types of models-notably for the temperature and power consumption. However, in this work, we present the models that delivered the best accuracy while satisfying the speed of estimation requirements.
The variables estimated by the model, characterize the operating state of the SoC (i.e., both the CPU and GPU) and are used as inputs to the monitoring algorithm.

Monitoring Approach
Most of the fault diagnosis techniques, presented in Section 2, are based on the dependency theory, hardware and software redundancy, and online verification tests. The method proposed in this work, on the other hand, is based on analytical redundancy, created by the construction of a reference model of the system. This technique compares the real outputs of the system to reference ones generated by the developed estimation model. In the presence of faults, the system's outputs deviate from their reference values, thus, generating fault indicators. The latter, commonly called residuals, are evaluated by a probabilistic method to account for modeling uncertainties in the decision-making process.
The use of analytical redundancy is fairly common for the diagnosis of fault in the hardware section of systems. Yet, to our knowledge, this is the first use of this technique to monitor both hardware and software components. Figure 3 presents an overview diagram of our approach to monitoring an SoC with an embedded CPU and GPU. Finally, the modularity of the incremental estimation model allows for each of the generated residuals to be associated with an identified module. Hence, in the presence of faults, faulty subsystems can be easily isolated.

Experimental Setup
Two systems were used in this study for testing and experimental validation; an Android smartphone and an ARM-based development board.

The Android Smartphone
These devices are ubiquitous and developed application can be effortlessly transferred from one device to another. Moreover, the availability of the source code for their operating systems makes some of the needed parameters accessible for reading and even modification. Additionally, they are equipped with the all necessary sensors for the measurement of all the variables mentioned in Section 3.1.
The smartphone we used in this study is a Samsung Galaxy S5 [74]. It was equipped with an SoC harboring a quad-core processor with variable frequencies-through frequency and voltage scaling-ranging between 0.3-2.45 GHz. It also has a GPU with frequencies ranging between 200-578 MHz. The SoC is covered by the system's 2 GB low power RAM.
After much experimentation, this phone was no longer usable. For the last part of experimentation (see Section 8.3), we used a second device equipped with an octa-core processor arranged in a big.LITTLE configuration (Arm Holdings, Cambridge, United Kingdom, 2007) [75], running at frequencies up to 2.3 GHz, a state of the art GPU, and 4 GB of RAM.

The ARM-Based Development Board
For the actual prototype of the project, we use a safety-critical certified development board. The board runs on Linux and is Android capable. It has a one core ARM Cortex-A9 processor (ARM Holdings, Cambridge, United Kingdom, 2007) [76] and is equipped with 1 GB of RAM. Since it is only a prototype, the board is not equipped with sensors to measure consumed power (or drawn current for that matter), yet. We used an external ampere meter and an oscilloscope.

The Monitoring Equipment
There are two possible ways of implementing a monitoring algorithm; either implement it directly on the monitored board or use external equipment to do so. Even though we tested both ways, especially on the phone where it was feasible and had a utility that the phone monitors its own operating state. From a reliability standpoint, the operating state of an electronic board should be monitored externally, to reduce overhead on the board, and if the board freezes or fails, warnings and fail indications could still be generated. Hence, for this prototype, we used a PC and linked it to the monitored device through a TCP/IP connection. An application that we developed would send data from the device to the PC which would generate estimated reference variables and compares them to the readings. Figure 4 shows the actual development board alongside its touch screen and the multimeter used to measure the power it draws. The monitoring PC with the monitoring program is alongside as well.

System Modeling
In the following paragraphs, we detail the modeling process of each of the subsystems. Firstly, we start by reverse engineering algorithms of the frequency governor, voltage scaling, and the thermal regulator. Then we move onto the black-box modeling of power and temperature.

Reverse Engineering the Algorithms
These algorithms are already present on the system. To generate similar outputs as they do, we studied and repurposed their code for simulation and monitoring.

Frequency Governor
Frequency scaling is carried out by programs called Governors. In the studied systems, several frequency governors exist; they calculate the frequency according to usage needs as well as several other factors (responsiveness, power consumption, etc.). Smartphones usually tend to use the Interactive Governor (Google Inc, Mountain View, CA, United States, 2015) [77], which is also compatible with the ARM-based board. Thus, it has been used in our case study. Nevertheless, thanks to the modular structure of the model, any other governor can replace it, if needed. Figure 5 displays a simplified flowchart of the governor's algorithm. It is inspired by the source code available in the code deposits of the manufacturer of the studied smartphone [73]. In a summary, it increases and decreases the frequency of each core as a function of the load, specific constants (goHispeedLoad, targetLoad, hispeedFreq, etc.), and timers (timerRate, downTimer...).

The Thermal Regulator
The manufacturers of the SoCs used in this study programmed three of thermal regulators that can be selected to control the temperatures of the SoC. These regulators are Proportional Integral Derivative (PID), Single Step (SS), and the one used in our systems; Monitor. This algorithm simply samples the temperature at a predefined rate (samplingMs), and if a core's temperature is higher than the predefined threshold, the core will be shut down, until it drops below a second threshold (thresholdsClr), when the core can be turned on again.

Voltage Scaling
Recovered voltage readings of the CPU cores show that, like the frequency, voltage is discreet and varies in a set of well-defined values. To see the relationship between these two variables, voltage readings are plotted against the frequencies (Figure 6a). It shows an almost linear trend (in red) where the values are concentrated, indicating that each frequency is associated with a fixed voltage value.
Furthermore, by analyzing the measurements, we find 15 frequency values (plus a zero frequency for a turned-off core) against 14 voltage values, leading to the belief that some frequencies share the same voltage value. Thus, to better investigate the relationship between the frequencies and voltages, we use the histogram of voltage values for each frequency. As shown in Figure 6b for f = 1.49 GHz, the histogram clearly indicates that the core voltage for this frequency value is V 1.49 = 0.875 V. All of the other voltage values are obtained in the same manner.

Black-Box Modeling
White-box modeling of temperature and power in an SoC would require design level knowledge of all the inner components of the chip [23]. It would also require constructing finite state machines [17] or simulating differential equations [48]. Identification techniques, on the other hand, train the model to fit its outputs to observations using statistics [17]. Good expertise of the factors influencing the output is appreciated, but not required, nor is the formal theoretical proof of the relation between the inputs and outputs [23]. Still, these methods require large amounts of data, time, and computational resources for training and validation to account for all cases [17], unlike white-box models.

Power Model
Before starting the model construction and training processes, it is necessary to determine the inputs of the model. For microprocessors, the power consumption is often given by the [78]: where f is the frequency, V the voltage and C the capacitance of the microprocessor. C depends on the design, the internal wiring, and gate switching in the microprocessors [78]. While we cannot directly use Equation (1) to compute power consumption, we can use it as a guide to select the variables that correlate the most with power consumption in the microprocessor-in this case, the frequency and voltage. It was also shown in the previous paragraph that the voltage and the frequency data are tied one to the other. Therefore, the frequencies of the cores, the frequency of the GPU, and the MOR are used as the inputs to our power model. Some of the models in the literature also multiplied the frequencies by the load. Nonetheless, our experimental results showed that the load-which defines the frequency in the first place-had no further effect on the final accuracy of the model. Hence, for optimization purposes (limiting the input size of the model), we omitted its use.
Most power models in the literature are constructed using data fitting techniques like linear regression and tend to be of a polynomial form [23,24,26,27]. These models assume a linear relationship between the inputs and the outputs, which is not always the case [23], thus increasing the error when the change in power consumption is not linear [17]. Although we also use data-based techniques for identification, the model we construct is a nonlinear autoregressive model with exogenous input (NARX) [79], since they would overcome the shortcomings of regression-based ones [36,37], giving it an advantage over linear regression models.
The neural network is composed of two layers. The hidden layer contains neurons with sigmoids as activation functions, and the output layer contains a single linear neuron. Figure 7 shows the general structure of the NARX model with the output feedback and the time delay line (TDL), which in our model is equal to 1. Since this work is focused around the SoC, in order to minimize the influence of other system components, all communication and secondary peripherals Wi-Fi, modem, cameras, screen, etc.) were disabled and assumed to consume a static constant amount of power in that state [28,80]. The measured consumed power becomes: During our experimentation, we found the static power P Static to be marginal, thanks to the power management capabilities of modern devices, but we still included it in the model for better accuracy.

Thermal Model
Temperature dynamics are different from those of the other studied variables. For a starter, it is continuous, and going to be sampled (discretized). In addition, it does not depend only on the inputs, but also on its own previous values. Since our aim is to monitor the operating state of the SoC, we will not focus on the mechanics of heat generation and transfer. Thus, we focused our attention on finding variables affecting the temperature.
Recorded data show that temperature is directly correlated with the frequency of the CPU and the GPU. The data also confirm the correlation in the measurements between the recorded temperature and power consumption, as concluded in the literature [45,46]. Therefore, the considered inputs of the model of the temperatures are only the frequencies of the CPU and the GPU, along with the MOR and the power consumption.
To better represent these dynamics, we tried several model and settled on an Autoregressive-Moving-Average with Exogenous Inputs (ARMAX) model. ARMAX models use the regression of inputs and previous outputs, along with the moving average to simulate or predict the current output as follows [81]: y(k) is the output to estimate (or predict), u(k) is the exogenous (X) variable or the input, and e(t) is the moving average (MA) variable. q −1 is the delay operator. A(q −1 ) holds of the autoregressive parameters, B(q −1 ) holds the input parameters, and C(q −1 ) holds the MA parameters [81].
In the case of this work, the ARMAX model is ideal to model the temperature, as it takes into account previous inputs and values of the output. In the identification process, temperature T SoC is the output y(k) to be estimated, and the inputs are: .., f n , f GPU , MOR, P SoC ]

Monitoring: Residuals Generation and Evaluation
The design of a monitoring system goes through two major steps. The first is the residuals generation, whereas the second consists of their evaluation. In theory, generated residuals have a value of zero in normal operating conditions. However, in practice, due to uncertainties, it is uncommon that residuals are equal to zero. Thus, two types of approaches can be found in the specialized literature for residual evaluation: • Model-based approaches use the analytical expression of the residuals to generate thresholds that take into account parameter and model uncertainties along with measurement noises [82][83][84][85][86]. • Signal-based approaches consider residuals as a signal and do not take into account its physical origin. The idea consists of the extraction of statistical and probabilistic properties of residual signal in normal operation and using them as a reference to make a decision (generate an alarm, for instance) [87][88][89][90].

Residual Generation
In this work, raw residuals are generated by comparing the current outputs of the system with those estimated by the reference model (cf. Figure 3). They are expressed as follows: (4) where i = {1, ..., n, GPU} denotes either the GPU or the core number of the CPU, the subscript estm denotes model estimations and meas is for measured values (or readings). This approach for residuals generation is well suited for the detection of progressive deviations in the characteristics of the system and thus allows for early detection of degradation. Interpretations of these drifts in characteristics, on the other hand, are performed using relationships of causality arising from the expertise of manufacturers and system operators. Hence, this method employs a large part of the available information (physical knowledge, expertise, measures, etc.).

Residual Evaluation
This step allows us to determine the presence or absence of a fault. The raw residuals generated in the previous paragraph are evaluated each with methods suited for their type. The purpose of this evaluation is to generate a normalized residual R which can only take two values; 1 or 0. This residual serves as an indicator for users of the presence (or not) of faults in the system.
To generate the normalized residual, we need to define thresholds values. In this work, taking into account how residuals are generated, their evaluation is carried out using signal-based approaches. Additionally, since each subsystem is modeled by the most appropriate method for its dynamics, thresholds calculation is also adapted to the said dynamics.

Frequency and Voltage Residuals Evaluation
Analysis of the frequency raw residuals r f and those of the voltage r V (given in Section 7) show that these signals are equal to zero, but include momentary differences during the moments when the frequency or the voltage changes. These spikes are caused by a delay in the computation of estimated values relative to measured ones. To take this delay into account in the decision-making process, the maximum values of the delay, noted τ re f , is identified and then used to calculate normalized residuals R f and R V as follows: x = { f , v} denotes either the frequency or the voltage, and τ x d is the value of the computed delay for each of these two variables.

Power and Temperature Residuals Evaluation
Analysis of the raw residuals r P and r T show that these signals have a high-frequency noise with an average close to zero. They are also normally distributed around that average (Figures 9b and 10b), which indicates that 99.7% of the residuals will be present between values of the mean µ plus or minus three times the standard deviation σ (in our case, it is 99.2% for the temperature residuals and 99.4% for the power residuals). However, errors will still arise from estimation and scheduling errors [91]. These errors are especially prevalent in periods of high loads where readings and estimations might be out of synchronization (as seen with the frequency and voltage).
Knowing that drifts of characteristics are rather contained in the average value of the signal, and in order to minimize the effect of the noise and avoid false alarms coming from estimation and synchronization errors, we use the moving average method to compute averaged residuals for power and temperature, r m P and r m T , respectively, from raw residuals r P and r T , respectively (Moving average is the unweighted sum of data over a window of n samples). Then, we use µ and σ of the averaged signals as thresholds to form an envelope around each of the residuals (Equation (6)). Thus, normal operating thresholds are generated as follows for power (temperature averaged residuals and thresholds are obtained using the same formula): To further simplify the process of decision making, we generate normalized residuals R P and R T from the averaged residual r m P and r m T as follows: x = {P, T} denotes either the power or the temperature.

Fault Isolation
After fault detection comes isolation of the faulty subsystems, which remains an area of ongoing and developing research. In this paper, fault isolation is achieved through the incremental and interconnected structure of the model (Figure 1). Indeed, since in our model, each subsystem represents a functional component of the system and describes a relationship between its inputs and outputs during normal operation. In the case of presence of faults, these latter will first be detected in the faulty subsystems readings and alarms. Once a fault is detected, the next step is either to analyze its propagation and effect on the rest of the subsystem or just isolate it. In this work, we aim for isolation. Since the subsystems are interconnected, a fault would normally propagate. Thus, outputs from the model of the faulty component are replaced with readings which will allow for the rest of the subsystems in the model to continue generating the same outputs as measured.

Validation of the Model and Monitoring Algorithm
The results presented in this section were recorded in a default usage and benchmarking scenario. This scenario is composed of three main steps. The first is a series of standard benchmarks (for both CPU and GPU). These benchmarks are PCMark [92], 3D mark [93], AnTuTu Benchmark [94], and Geekbench 4 [95]. The second step is a less complex video playback task, and the third one is the device left idle for a while.
Since we used two experimental setups, in this section, we will alternate showing results from both setups to highlight the flexibility and portability of the models and algorithm we propose here.

Frequency and Voltage Models
Evolution of the measured frequencies from the CPU cores of the Android phone and the frequencies estimated by the model of the governor is given in Figure 8a. It shows near-identical results with a small delay at the instants of frequency change. This delay explained by the time needed by the model to assess the change. The maximum recorded delay is τ f re f = 0.1 s, during a period of high loads, due to scheduling delays [91]. This value will be used in the residual evaluation step. Analyzing Figure 8b, we arrive at the same conclusion for the voltage model with the difference of the maximum recorded delay being τ Vre f = 0.12 s.

NARX Power Model
To train and evaluate the NARX model, the data recovered during the default scenario, is split-as per standard training and validation procedure-into three sets; a training set (60%), a validation set (20%), and a test set (20%). The first two sets are used during the training process, while the test set is used to judge the performance of the model. Once the model was validated and tested offline, it was then further tested online to evaluate its estimation speed and its capacity to withstand scheduling delays during periods of high-loads and input lag. Table 1 shows the training, validation, and test results for the NARX model in addition to the online test results which are required to validate the model in our use case. In Figure 9a Figure 9b shows the estimation error distribution of the NARX power model for the online test. The errors have a mean value of 1.178 × 10 −4 W and a standard deviation of 0.0265, with 99.4% within µ ± 3 × σ. The Kolmogorov-Smirnov (KS) test confirms that this distribution fits the profile of a standard normal distribution. We also compared our results with the accuracy of established and recent power models; Trepn which measured at 94.5% [22], Snapdragon Profiler (measured at 95.2%), PowerBooter which reported 96% accuracy [17,23], PETrA (96% [29]), the models proposed by Walker [26]), and GreenOracle (∼10% [31]).
Finally, we compared the power draw overhead caused by our modeling and monitoring program to the profilers we could test; PowerBooter, Trepn, and Snapdragon Profiler. The first caused an increase of 9% in power consumption during a 15 min test. While the second caused an increase of 8%, in our standardized benchmarking test. The Snapdragon Profiler caused an increase of 5%. Compared to the three aforementioned profilers, our modeling and monitoring scheme caused an increase of only 3% in power consumption, during the same test.
While we are pleased that our NARX model delivered results that are on par or even better than established works, it is worth noting that some of these profilers have higher levels of granularity than our model. Furthermore, some works, like PowerBooter, have not been updated for several years, and manufacturer-specific profilers were trained using its brands' SoCs, and thus would probably perform with them.

ARMAX Temperature Model
In system identification, only two sets of data are needed. The first set is used to train the model, and the second is used to validate and test it. For this model, 70% of the data are used as a training set and the rest as a validation set. Finally, as in the case of the NARX power model, the model was tested online. Table 2 shows the MAPE and MAE (%) from both sets and the online test. The ARMAX model has an MAE of 0.4741 • C (1.48%). The estimation errors from the ARMAX model are normally distributed (KS test), as they were in the case of the power model, with 99.2% of the values within µ ± 3 × σ. Figure 10b displays the distribution of these errors.   Figure 10a displays the evolution of the estimated temperature on the development board, compared with the measured temperature of the CPU. The estimations delivered by our model are accurate in both low-frequency changes (heating and cooling), as well as high-frequency ones (weak temperature changes). The figure also shows that the model takes into account the initial conditions. Hence, temperature modeling is validated.

Validation of the Fault Detection and Isolation (FDI) Algorithm
Profiles of the raw residuals r f and normalized residual R f during our tests on the smartphone are given in Figure 11. This latter shows that raw residuals are sensitive to delays of estimation by the model at the moments of frequency changes, resulting in the appearance of a number of spikes. However, the normalized residual R f is equal to zero over the duration of the test. Hence, no false alarm is recorded. Figure 12, allows us to draw the same conclusion for the voltage raw residuals r v and the normalized residuals R v , which also present no false alarms. Raw residuals r P , averaged residuals r m P , and normalized residuals R P from the tests on the smartphone, are given in Figure 13a-c, along with the computed thresholds (in red). In the raw residuals r P , 99.4% of data is within the normal operating envelope, giving rise to some false alarms. The number of false alarm is reduced to naught by using the averaged residuals r m P . Thus, averaged residuals greatly improve the accuracy of the power normalized residuals. Figure 13 also shows all the temperature residuals r T , r m T , and R T along with the thresholds (red colored lines) from the tests on the development board, and allows as to draw the same conclusion for the temperature residuals.

Testing the FDI Algorithm
To test the residuals sensibility to faults and degradation, we propose three different faulty scenarios with faults originating either from the environment, or the software, or the hardware. During these experimentation process, the device will be put into the same benchmarking and usage as the previous section but in the middle of that usage, a fault will be presented as described below. A demonstration video of this process have been published online (Supplementary Material) [96].

Control Faults
In the first faulty scenario, we reproduce a possible governor bug or a system clock malfunction where frequency no longer corresponds to the load. In this scenario, a constant frequency is forced onto the CPU. Figure 14 shows the evolution of both measured and estimated frequencies on the development board. At t = 46 s, the measured frequency is locked to a constant frequency that does not match the input load. This fault is instantly detected by the raw residual r f as shown in Figure 15a. Consequently, the normalized residual R f rises from 0 to 1, generating an alarm (Figure 15b).

Hardware or Component Faults
This scenario is intended to simulate faulty component, the presence of foreign bodies on the chip (like the accumulation of dust), or the change in the chips characteristics due to wear either by time or by overuse. Such faults and drifts are generally noticed through a decrease in power consumption.
As already described in Section 5.2.1, the neural net model is trained to monitor the power consumed by the SoC (and the static power consumed by the rest of the peripherals by extension). Thus, to simulate the change in power profile caused by one of the aforementioned reasons, In this experiment, we plugged an external peripheral-a USB lamp.
In Figure 16, the values of the measured and estimated power consumption values from the development board are drawn against each other. It shows a noticeable difference between the two values compared to Figure 9a. When computing and evaluating the raw residuals (Figure 18a), we find that most values fall within the thresholds envelope. Nevertheless, further analysis using normalized residuals (Figure 18b) shows that the average of the residuals fall outside of the threshold envelope and so an alarm is generated (Figure 18c).

Environmental Faults
While the faults discussed in the previous paragraph come from either the software or the chip itself, the fault we are introducing in this faulty scenario is generally due to the environment around the chip. In this part of the test, the chip will be heated, allowing us to test the ability of the fault detection algorithm to detect abnormal heating of the SoC, which can be caused by a failure in the cooling system, or electrostatic charges, or even radiations.
In order to simulate such faults, we had to heat the devices up to a temperature higher than their running temperatures, without damaging their components or compromising their structural integrity. The phone was sealed in a waterproof bag and put into a hot water bath at a temperature of 80 • C, whereas the development board was exposed, at proximity to a 1000 W light projector. Figure 17 shows the evolution of measured and estimated SoC temperatures, on the smartphone. The difference between the two curves becomes clear at t = 180 s, about 20 s after the submersion of the phone into the water. Figure 18 shows the reaction of residuals r T (Figure 18d), r m T (Figure 18e), and R T (Figure 18f). The latter is detected around t = 190 s where the residual r m T goes beyond normal operating envelope. Then, an alarm is generated (R T rises from 0 to 1).

Conclusions
This work has tackled the problem of monitoring a CPU-GPU SoC in an embedded system intended for safety-critical use, hence making its monitoring an obligation. We propose a monitoring scheme that acquires data from the device. Then, using this data, we built and trained a novel incremental interconnected model for the estimation of characteristic variables of the system.
The unique structure of the model streamlines the otherwise difficult process of modeling an SoC, making it relatively easier and more focused on the actual use of the model rather than building it. The latter is composed of subsystems to each built to estimate one variable rendering the process of fault isolation easier. Moreover, this interconnected structure gives the model builder the freedom and flexibility to adapt the model to the use case by including new subsystems and hence enlarging the scope of the model, or omitting some of them to focus on certain variables. The process of building each of the subsystems was also detailed, from the reverse-engineering of the algorithms to the use of data-based techniques (neural net and regression).
Experimental results validated all of the subsystems and the incremental model delivers estimation on the fly. Furthermore, our NARX power model has an accuracy of 97.12%; one of the highest in the literature. The choice of a nonlinear neural net to model power consumption proved its merit since the NARX power model outperforms the regression-based models because it accounts for the nonlinearity in the changes of power consumption. Finally, the ARMAX model's temperature results are accurate and on par with established models.
In the monitoring part, we presented a fault diagnosis module, aimed at the early detection of drifts from estimated characteristics. The fault diagnosis module uses residuals to generate alarms in case of faults. These residuals are generated by comparing estimations made by the developed model with readings from the system. To test the fault diagnosis module, it was first trained and validated during standard use. Then, three failure scenarios were tested, each to simulate a different fault. The fault diagnosis module quickly detected, localized, and isolated these faults. Thus, it is experimentally validated. In addition to fault detection, the information collected on the trend of the characteristic drift can be used for analysis and modeling of degradation phenomena in CPU-GPU chips, which will be discussed in future works.