Prediction-Correction Techniques to Support Sensor Interoperability in Industry 4.0 Systems

Industry 4.0 is envisioned to transform the entire economical ecosystem by the inclusion of new paradigms, such as cyber-physical systems or artificial intelligence, into the production systems and solutions. One of the main benefits of this revolution is the increase in the production systems’ efficiency, thanks to real-time algorithms and automatic decision-making mechanisms. However, at the software level, these innovative algorithms are very sensitive to the quality of received data. Common malfunctions in sensor nodes, such as delays, numerical errors, corrupted data or inactivity periods, may cause a critical problem if an inadequate decision is made based on those data. Many systems remove this risk by seamlessly integrating the sensor nodes and the high-level components, but this situation substantially reduces the impact of the Industry 4.0 paradigm and increases its deployment cost. Therefore, new solutions that guarantee the interoperability of all sensors with the software elements in Industry 4.0 solutions are needed. In this paper, we propose a solution based on numerical algorithms following a predictor-corrector architecture. Using a combination of techniques, such as Lagrange polynomial and Hermite interpolation, data series may be adapted to the requirements of Industry 4.0 software algorithms. Series may be expanded, contracted or completed using predicted samples, which are later updated and corrected using the real information (if received). Results show the proposed solution works in real time, increases the quality of data series in a relevant way and reduces the error probability in Industry 4.0 systems.


Introduction
The strengthening of important global crises, such as the climatic crisis or the natural resource crisis, makes essential a change in the productive schemes of all countries, but especially in those with a relevant industrial sector [1]. The increase of efficiency in all industrial production processes is the only solution to optimize the use of resources, support the citizens' wellbeing and strengthen social development [2]. Industry 4.0 is an innovative paradigm referring to this new era [3].
In Industry 4.0, production systems and solutions implement mechanisms to make flexible, automatic and real-time decisions [4] that guarantee the adaptation of production processes to the variable behavior of economic, social and physical contexts [5]. With this approach, the global efficiency of industry has proved to increase significantly [6]. Paradigms, such as cyber-physical systems [7] or artificial intelligence [8], are basic to enable this new era, although all these monitoring mechanisms and decision-making algorithms are supported by a common technology: sensor nodes [9].
Using the sensor data, high-level software modules may create models to represent (and later predict) the production processes' and industry context's evolution, make realtime business decisions and (even) trigger alarms about workers' safety and wellbeing [10]. results have been reported, this approach needs the high-level, decision-making modules to be adapted as well, so the Industry 4.0 implantation costs and barriers tend to be higher. With the proposed solution in this paper, this challenge is addressed.
In semantic architectures, data adaptation is also critical. Different mechanisms to adapt and transform the different semantic standards into any other data format may be found [44]. Besides, ontologies to allow semantic data processing have been reported [45]. Contrary to the proposed solution, these schemes cannot be employed to protect the Industry 4.0 system against corrupted data, malfunctions, etc.
On the other hand, data-curation mechanisms are not explicitly addressed, as a seamless integration among hardware and software components [34,46] is the preferred approach in the literature. Hard and complex calibration processes are usually considered [6] to make the processing algorithms aware of the sensor nodes' biases. Besides, computationally heavy schemes to compensate different effects (such as redundant data) based on previous observations and offline processing may be found [47]. However, all these proposals do not enable sensor interoperability (they are totally application-specific); on the contrary, they make it difficult. Moreover, they are not flexible or dynamic solutions and, of course, they cannot be executed in real time (essential requirements in Industry 4.0).
Only a few proposals on actual data curation have been reported. In this area, most contributions are focused on outlier detection [11]. Using different techniques, datasets are transformed, and anomalous data are removed. Techniques based on digital encoders [48], machine learning [49], statistical indicators [50], performance indicators [16] or hybrid approaches [51] have been described. Although these schemes are useful, they cannot be employed in real time, and many other potential malfunctions, such as packet losses, cannot be addressed through these solutions. On the other hand, mechanisms based on signal-processing techniques may be found [15]. In these solutions, data are understood as communication signals, and they are curated based on instruments, such as the complex envelope. This approach may operate in real time and may correct and curate all kinds of malfunctions; however, it only considers one criterion to propose a curated data series. Thus, the error introduced by the curation algorithm is very variable, depending on how similar the sensor data under curation to a communication signal is. In some Industry 4.0 scenarios, this error may be too high to be acceptable.
Finally, some generic proposals on data analysis may be employed to support data curation in Industry 4.0. For example, algorithms to classify time series in an automatic and more flexible manner [52] have been reported. If only two labels (valid and invalid) are defined, this scheme could be employed for data curation. However, it cannot be employed in real time, and it does not enable the correction of errors in data series.
Contrary to all these previous proposals, the solution described in this paper may operate in real time, as it only operates with a limited amount of data. It is flexible and adaptable to all scenarios, as it does not depend on the sensor technology of software algorithms to be employed. Besides, all kinds of malfunctions can be curated, and up to four different potential curated data series are analyzed before selecting the most probable one.

Proposed Predictor-Corrector Solution
In this section, the proposed data duration mechanism, based on a predictor-corrector approach, is presented. Section 3.1 describes the general statistical framework to calculate and obtain the curated data series. Section 3.2 presents the different approaches to calculate the actual curated data flows, even in real time. Section 3.3 describes the models to analyze and estimate the different data malfunctions that may appear in Industry 4.0 solutions. Finally, Section 3.4 presents the mechanisms to update the curated data series if real data from Industry 4.0 is received in the future.

General Mathematical Framework and Curation Strategy
Given an Industry 4.0 platform T (1), a set of N different generic sensor nodes S i are generating N independent data series y i [n]. These data series suffer different malfunctions, Sensors 2021, 21, 7301 5 of 25 and they are received by high-level software modules as a different set of N data series x i [n]. These malfunctions are represented as a collection of L different functions λ l (2) transforming the original series generated by the sensor nodes y i [n] into de the received time series Although other sensing patterns could be applied in industrial scenarios, samples are periodically generated and sent to the high-level software modules for real-time monitoring and decision making. Samples in the S i sensor node are produced each T i seconds (3).
Statically, each data series x i [n] is a realization of a stochastic process φ i (4), where Ω is the universe of possible values ω k generated by the sensor node S i . This universe is a subset of the field of real numbers R (5). This universe is discrete and strictly depends on the hardware capabilities of the sensor node, and it is analyzed and reported by manufacturers.
For each different time instant n 0 , the stochastic process φ i transforms into a different random variable X i [ω] (6) with some specific statistical properties.
However, in Industry 4.0 scenarios, physical variables evolve much slower than the sampling period T i ; i.e., the superior frequency of physical signals f max is much lower than the sampling frequency f i s (7).
In this context, for any time instant n 0 , it is possible to define an open time interval B s around n 0 with radix ε (8), where the random variables show equivalent statistical properties for all time instants. Thus, we are assuming the stochastic process φ i is locally first-order stationary in B s (9).
Given a data series x i [n], if data in the time interval [n 1 , n 2 ] should be curated, an expanded time interval n e 1 , n e 2 (10) must be considered, so it contains the original interval [n 1 , n 2 ] where data must be curated, but it is included in the open time interval B s (the stochastic process must be stationary in the interval).
[n 1 , This time interval [n 1 , n 2 ] may refer to a past time period (11) (so we are performing an offline data curation), but we can also perform a real-time data curation if the current time instant n 0 belongs to the time interval [n 1 , n 2 ] under study (12). If operating in real time, the proposed curation solution is employed as a prediction mechanism for calculating future samples in advance. If offline data curation is performed, the algorithm may be run as fast as possible, while in real-time data curation, the process is synchronized with the sampling period T i , so one sample is curated at each time instant, although as many samples as desired may be predicted with each new sampling period T i . n 0 / ∈ [n 1 , n 2 ] : n 0 < n 1 (11) n 0 ∈ [n 1 , n 2 ] : n 1 ≤ n 0 < n 2 The problem of data curation is to find a new time series x * i [n] in the interval [n 1 , n 2 ], so it represents in a more precise way (compared to the original time series x i [n]) the real situation of the Industry 4.0 system, represented by time series y i [n]. As many random effects impact this study, a probabilistic approach is the most adequate, so this condition transforms in a comparison between two different probabilities, p * i and p i (13). Hereinafter, P(·) is the probability function, calculating the probability of a predicate to be true. Figure 1 shows the proposed algorithm to find that time series x * i [n], fulfilling the previous conditions (13), if it exists.  (12). If operating in real time, the proposed curation solution is employed as a prediction mechanism for calculating future samples in advance. If offline data curation is performed, the algorithm may be run as fast as possible, while in real-time data curation, the process is synchronized with the sampling period , so one sample is curated at each time instant, although as many samples as desired may be predicted with each new sampling period .
The problem of data curation is to find a new time series * [ ] in the interval [ 1 , 2 ], so it represents in a more precise way (compared to the original time series [ ]) the real situation of the Industry 4.0 system, represented by time series [ ]. As many random effects impact this study, a probabilistic approach is the most adequate, so this condition transforms in a comparison between two different probabilities, * and (13). Hereinafter, (•) is the probability function, calculating the probability of a predicate to be true.   As can be seen (in the initial prediction phase), first, a set of C suitable candidates X to be that curated time series x * i [n] are calculated (14). To calculate those candidates, different techniques are employed, based on interpolation mechanisms. In particular, five different techniques are considered: Newton's divided differences, Hermite interpolation, splines, Taylor interpolation and Lagrange polynomial. The purpose of this approach is to guarantee the curated time series x * i [n] is continuous and coherent with samples outside the curation interval [n 1 , n 2 ]. Using the previously curated data in the expanded interval n e 1 , n e 2 , a collection of possible curated time series x * i [n] are calculated, considering all samples define a continuous function. For each one of these candidates X c , then, it is applied the statistical theory (Bayes' theorem) to obtain probability p * i . If, for any candidate, the curation condition (13) is met, that candidate X c is selected as the curated time series x * i [n]. On the contrary, and depending on how different probabilities p * i and p i are, the time series may remain as is, or the curate data series may be obtained as a combination of the most probable candidates and the original data flow x i [n].
If a real-time data curation is performed, new information about the curation interval [n 1 , n 2 ] is received at each sampling period T i . In that case, a correction phase is carried out. In this phase, the new sample is compared to the predicted one, and (depending on how different they are) different actions are taken to correct the curated time series initially calculated.
To solve this problem, both probabilities p * i and p i must be obtained. Probability p i represents the fact that the received data x i [n] are exactly those data generated by the sensor nodes y i [n]. In other words, no malfunction (of any kind) has occurred in the interval [n 1 , n 2 ]. In our model, that means functions λ l are the identity function all of them (15). As all the malfunctions are physically independent, they are also statistically independent, and the joint probability may be rewritten as a product of unidimensional probabilities (16). Section 3.3 analyzes how to evaluate those probabilities for each one of the considered malfunctions.
On the other hand, probability p * i is more complicated to calculate, and the Bayes' theorem is employed (17).
To apply this theorem, three different probabilities, p cont , p mal and p rx , must be obtained. Probability p mal is the probability of functions λ l (representing the malfunctions) to transform data series x * i [n] into series x i [n] in the interval [n 1 , n 2 ]. As said before, this probability may be written as a product of L different unidimensional probabilities (18). Section 3.3 analyzes how to evaluate those probabilities for each one of the considered malfunctions.
Probability p rx is the probability of receiving the sequence x i [n] in the interval n e 1 , n e 2 . That variable may be easily calculated using the probability function of the random variable (and stochastic process) φ i (n, ω ) in the time interval B s (19). In this case, once again, we are considering samples are independent events, so the join probability may be rewritten as a product. Finally, probability p cont is the probability of the Industry 4.0 system's evolution to follow a continuous and coherent flow. In this case, we are evaluating how probable is the data series x * i [n] to show certain values in the interval [n 1 , n 2 ], considering the other data received in the expanded time interval n e 1 , n e 2 . As the stochastic process is stationary in the interval B s , the probability distribution g 0 i the interval [n 1 , n 2 ] and distribution g e i in the expanded time interval n e 1 , n e 2 must be identical. As both distributions become different, the probability of series x * i [n] to be the best candidate for the curated series is reduced. To calculate how different these two distributions are, we are employing the traditional function scalar product and the Lebesgue integral (20). However, in this case, as the universe under study is discrete, the Lebesgue integral may be approximated by a common sum. Thus, and considering the distance function induced by the function scalar product, we can calculate the distance d g between distributions g 0 i and g e i (21). Finally, to calculate the probability p cont , we must apply a function transforming values in the interval [0, ∞) in the interval [0, 1] (22).
Then, to enable the calculation of probabilities p cont and p cont , we have to model the probability distribution of the stochastic process φ i within the interval B s .
We are now defining the operator C(·, ·) within the universe Ω (23). Basically, this operator indicates the number of elements in the universe that are between two provided values; that is, card{·} the standard cardinality operator. This operator is coherent as Ω is a subset of the field of the real number where a strict order relation is defined. This operator is a positive operator as the target set is the set of natural numbers N ∪ {0}.
Now, we are assuming the stochastic process φ i is also locally ergodic in B s . Thus, and according to the Birkhoff ergodic theorem, the additions A m of the composite function C • φ i restricted to B s , φ i | B s converge "almost surely" to the statistical expected value of the composite function C • I Ω (24), where I Ω is the identity function in the universe Ω.
Now, we are considering a partition Π Ω of the universe Ω, composed of ∆ different subsets π i (25). All subsets π i have the same measure i , understood as the Lebesgue measure (26).
If we restrict the previous operator C(·, ·) to any subset π i , C | π i and evaluate this operator in the limits (a i , b i ) of this set π i , the statistical expected value of the composite Sensors 2021, 21, 7301 9 of 25 function C | π i • I π i is "almost surely" identical to the expression for the Laplace rule employed to calculate the probability of an event (27).
In this case, the event under study is the fact a sensor node generates a sample belonging to π i in the time interval B s . In conclusion, we are studying the probability distribution of the stochastic process φ i in the interval B s .
We are now defining a function f (·) associating the mean point σ i of every subset π i with the additions A m (28). In other words, through the additions A m , we are generating a discrete probability function f (·), which "almost surely" converges to the actual probability distribution of the stochastic process φ i in the interval B s and the points σ i (29).
To calculate the probability distributions g 0 i and g e i , the same process as described before may be employed, but considering the proper time interval and a new universe Σ composed by points σ i (30). Probabilty p rx can be directly obtained using function f (σ i ).

Candidates to Curated Time Series: Calculation
The first step to improve the quality of the time series x i [n] produced by sensor nodes S i in Industry 4.0 systems is to find the candidate series X to be the curated flow we are looking for. Initially, any series X c could be a candidate, and probability p * i will be the indicator to select the final curated series x * i [n]. However, this approach is almost impossible to implement in practice, as the universe of time series in the curation interval [n 1 , n 2 ] is infinite. Moreover, as this is not a free mathematical problem, some physical restrictions inherit from the Industry 4.0 system we are modeling must be considered.
First, sensor nodes have an operational range [x min , x max ], which introduces a hard restriction: no candidate X c with samples in the exterior of the interval [x min , x max ] is in the final curated series x * i [n] (31).
Second, Industry 4.0 systems monitor physical processes, which are continuous and smooth (as natural variables), so no gaps or abrupt changes may appear in the curated time series. In this context, curated series x * i [n] in the interval [n 1 , n 2 ] must be continuous and coherent with time series in the surrounds of this interval, i.e., in the expanded time interval n e 1 , n e 2 . In that way, analytic function describing the evolution of the Industry 4.0 system in the curation interval [n 1 , n 2 ] must also be able to describe the system evolution in the expanded interval n e 1 , n e 2 . To apply this restriction, the best way to calculate the candidates X c is using interpolation techniques. Different interpolation techniques may generate different candidates X c , so in this work, we are considering the most powerful, popular and well-behaved interpolation solutions: Newton's divided differences, Hermite interpolation, splines, Taylor interpolation and Lagrange polynomial.
These techniques are evaluated using the E points x i ext [n] which belong to the expanded interval n e 1 , n e 2 , but they are not included in the curation interval [n 1 , n 2 ] (as data in the curation interval may be wrong and introduce false information in our algorithm) (32).
Candidate X 1 [n] is obtained using the Newton's divided differences technique. In this case, as the independent variable is the discrete time n, traditional expressions for Newton's interpolation are slightly modified. Specifically, given the E points in x i ext [n], candidate X 1 [n] is a polynomial with order E − 1 (33). Coefficients (named as divided differences) may be easily calculated using simple mathematical operations, which reduces the computational time, enabling a real-time operation (34).
Candidate X 2 [n] is calculated through the Lagrange polynomial interpolation algorithm. In this case, the candidate is just a linear combination of data in sequence x i ext [n] (35). Besides, in this case, the order β of the interpolation polynomial may be selected (if it is lower than E, number of points in x i ext [n]). In general, polynomial with orders above six are not suitable (because they present unnatural fluctuations), but this parameter is free to be selected according to the Industry 4.0 system under study.
The third candidate X 3 [n] is obtained using the Hermite interpolation theory. In this case, besides the sequence x i ext [n], it is also necessary to know the value of the first order derivative .
x i ext [n] in the points n ext j . When managing discrete-time sequences, this may be easily calculated using first-order finite differences. In general, we are using a central difference (36), as it presents a much lower error. However, if either time point n ext j−1 or time point n ext j+1 do not exit, we can employ the forward difference (37) or the backward difference (38) respectively (and although a higher numerical error is introduced). .
x i ext n ext x i ext n ext If both time instants n ext j−1 and n ext j+1 do not exist, the first-order derivative cannot be calculated for instant n ext j . In that case, that point n ext j is not considered to calculate the candidate sequence X 3 [n].
Given the Lagrange polynomial L r [n] (39), and its first order derivative . L r [n], the Hermite interpolated sequence may be calculated through an osculating polynomial (40). This approach generates high-quality candidates, which may integrate large amount of points with a reduced computational cost.
Candidate X 4 [n] is based on Taylor's interpolation. Formally, this approach only requires one sample at the time instant n ext taylor , so it is a very good candidate for the initial moments of the Industry 4.0 system operation, when collected data are very limited. In practice, however, this method requires the use of different successive r-th derivatives . They can be easily obtained using the central, forward or backward differences we already described (36)- (38), but this needs some additional samples. In this approach, the order β of the interpolation polynomial can be also selected. Therefore, in general, for a given order β, this method needs between β + 1 and β + 2 samples. As said before, polynomial with orders above six show some unnatural variations. On the other hand, for very low values of β, the numerical error is also high. A balance between both factors must be reached.
In this context, candidate X 4 [n] may be easily obtained (41).
Finally, candidate X 5 [n] is obtained through splines. Using the splines technique, candidate is just a segmented polynomial (42).
This polynomial may have different orders (from one to three), but it is well-known that cubic splines is the solution generating the best candidates [53] (they are smooth, contrary to linear splines, and they adapt to a larger range of system behaviors). For each pair of time instants n ext j , n ext j+1 a new cubic polynomial X j 4 [n] is defined (43). In this polynomial X j smoothness conditions (44). In this proposal, we are using natural splines, so for any point n ext j where it is impossible to calculate the first-order derivative , these values are considered null (zero).
This final candidate is more computationally costly to calculate, as we are introducing a system of linear equations that must be solved to obtain the final expression for the candidate. However, this method creates high-quality, curated data series (with a reduced error), and (currently) linear equations are easily solved using numerical methods (especially in strong cloud infrastructures or Industry 4.0 software platforms).

Malfunction Modeling
The calculation of probabilities p i and p mal is directly associated to functions λ l , which model the data malfunctions in the Industry 4.0 platform. In this proposal, four different malfunctions are addressed: jitter (including inactivity periods in hardware nodes), communication delays, numerical errors in microprocessors and electromagnetic interferences (including data losses and transmissions errors). As said before, to calculate p i , we must consider the probability of all these functions λ l to be the identity function I Ω (null effect), while probability p mal is obtained considering (for each candidate) the situation when λ l (X c [n] ) = x i [n].
Jitter J (ξ) is probably the most harmful malfunction among all the ones considered in this proposal. Jitter is the maximum fluctuation in the communication delays, transmission periods or clock synchronization that causes samples x i [n] to be randomly ordered compared to the original ones in y i [n] (45). Jitter is represented by a function τ jitter [k] taking values in the interval [0, ξ k ], where ξ k is a realization (different for each value of k) of a random variable J (ξ) taking values in a continuous universe: the field of positive real numbers (46). Because of jitter, no samples may be received for long periods, while in other moments, large amounts of samples may be received and interfere.
Total jitter J (ξ) in Industry 4.0 systems is, actually, the composition of two different and independent sources: random jitter J random (ξ) and deterministic jitter J deter (ξ) (47). Operator * represents the convolution.
Deterministic jitter is caused by three different additive effects (well-known and modeled through deterministic expressions): fluctuations in the clock periods and edges, variations in the data packet length and sleep periods in the sensor nodes or the communi-cation channels [54]. Although these three effects are deterministic, they are also dependent on random variables, such as the clock-duty cycle and frequency, the packet length, and the duration of the blocking, respectively. In conclusion, deterministic jitter J deter (ξ) is a Gaussian distribution, according to the central limit theorem (48) with mean value m det and standard deviation s det .
On the other hand, many other random and uncontrolled effects, such as thermal oscillations, flicker or shot noise, may result in levels of jitter that cannot be predicted or calculated in any way. This is known as random jitter J random (ξ) and, according to the central limit theorem, is also a Gaussian distribution (49) with mean value m ran and standard deviation s ran .
The convolution of these two Gaussian distributions is a third Gaussian distribution (50). The mean value m and the standard deviation s depend on the scenario and Industry 4.0 system, but most modern 5G solutions establish the mean value m around 1 millisecond and the standard deviation s is one magnitude order lower. These values are aligned with the expected performance for ultra-reliable low latency communications (URLLC) in 5G networks and scenarios [55].
Communication delays D(ξ) are not as harmful as jitter as they are a linear transformation (51), but they may cause delayed decisions and other similar problems. As all malfunctions, delays are a random effect, and they are described by a random variable taking values in a continuous universe: the field of positive real numbers (51).
In this case, we are decomposing the total delay D(ξ) in three different contributions (52): delay in the output queue (sensor node) D out (ξ), transmission delay D tran (ξ), and delay in the input queue (software module) D in (ξ).
Both delays associated with queues are formally identical. The traffic theory allows obtaining the probability distribution for both components D out (ξ) and D in (ξ). In both cases, we are using a Poisson model M/M/1/P (in Kendall notation), where the sample generation rate ψ and the serving rate η follow a Poisson distribution, Ψ is the mean sample generation rate, Θ is the mean serving time and Γ is the time period taken as reference (typically one hour). P is the number of samples allowed in the system (53).
If we assume a FIFO (first in, first out) managing strategy for both queues in our model, we can define a Markov chain for describing the queues state. In that situation, the traffic theory and statistical laws define the queueing delay as an exponential distribution (54).
Regarding the transmission delay D tran (ξ), several different random and unknown variables affect the final value: data packet length, channel capacity, physical configuration of the scenario, etc. Therefore, and considering the central limit theorem, the probability distribution of the transmission delay is a Gaussian function with mean value m d and standard deviation s d (55).
Numerical errors in microprocessors are caused by the limited precision of hardware components. Basically, these errors are associated with two different modules: the analogto-digital converter (ADC) and the arithmetic combinational modules. In the ADC, because of the quantification process, samples are irreversibly modified. Because of the arithmetic combinational modules, operations with large numbers may be truncated to the maximum number admissible in the microprocessor, ρ max .
In any case, the numerical error N (ξ) is an additive effect (56), composed of two different sub-effects: the ADC error N ADC (ξ) and the arithmetic error N ari (ξ) (57). In this case, random variable N (ξ) takes values in the universe of real numbers (56).
Regarding the quantification error in the ADC, we are assuming the sensor node is configured according to the manufacturer's restrictions, and the analog signal y i (t) being digitalized is limited in amplitude to the operation rage −Y ADC max , Y ADC max of the ADC device. In this context, given an ADC device with u intervals with an amplitude of Σ units (58), the error will be restricted to the interval − Σ 2 , Σ 2 (the maximum error appears when the sample is exactly in the middle of an ADC interval) (59). All values within the proposed interval have the same probability, so the distribution is uniform (60).
On the other hand, errors caused by the limited precision of arithmetic devices, N ari (ξ), may take any value as there is no superior limit. However, all these values do not have the same probability, as lower errors are much more probable than higher errors (input parameters are also limited and system designers usually also consider this problem in their code). In particular, we are proposing the arithmetic error follows an exponential distribution (61). If we assume the maximum number the microprocessor can represent is ρ max , the worst situation to happen (where arithmetic error is highest) is the addition of two parameters with this maximum value ρ max . The result, then, presents a maximum error of ρ max units. Thus, the mean value for our exponential distribution is ρ max (and the distribution gets totally defined).
Finally, electromagnetic interferences are responsible for two main problems: sample losses and data corruption. At the data level, electromagnetic interferences show as a bit error rate (BER). If, because of BER, the number of corrupted bits in a sample x i [n] is above a limit γ lim , the sample is corrupted and is rejected and not received. If the number of corrupted bits is below that limit γ lim , the sample x i [n] is corrupted by an additive effect but is still received (62). Hereinafter, we are considering Λ as the length (in bits) of samples. This is a random variable with a uniform distribution in the interval [0, Λ max ].
The calculation of the additive distortion ξ is also based on the length of samples Λ (in bits). The error, then, must be included in the interval − 2 Λ − 1 , 2 Λ − 1 , the maximum value that can be represented using Λ bits. Errors have the same probability in all bits, so the probability distribution of distortion values ξ is uniform.
Finally, we must calculate the value of BER using the level of electromagnetic interferences. Given the signal power per bit E b and the power of interferences N 0 , the signal theory establishes the BER is obtained through the complementary error function er f c(·) (65).

Final Data Generation and Correction Step
After calculating probabilities p * i and p i and selecting the candidate X c [n] associated to the higher probability p * i among all five calculated candidates, different situations may occur. First, if offline data curation is being performed, no new information is expected after the performed calculations. Then, results are final. However, three possibilities are considered:

•
If probability p * i is clearly higher than p i , candidate X c [n] associated to this probability p * i is selected as the curated data series x * i [n] and initially received information x i [n] is deleted. In this proposal, we are considering a difference higher than 10% (66) as the limit to take this action.
Second, if real-time data curation is being performed, there will be a future time instant n f ut , belonging to the expanded interval n e 1 , n e 2 , whose associated sample x i n f ut is not available at the time instant n 0 when the curated series is obtained, but it is received in the future. This new information may affect the previous calculation x * old i [n], then a correction phase is considered to update the previous results.
First, the new sample x i n f ut is compared to the previously obtained curated sample x * old i n f ut . Then, if the difference is high enough (above 10%) (70), all the curation process is redone considering the new information. x * i n f ut = 1 2 x * old i n f ut + x i n f ut (71) Table 1 summarizes the most relevant symbols and variables considered in the proposed predictor-corrector scheme.

Experimental Validation and Results
To evaluate the performance and usability of the proposed technology, an experimental validation was designed and carried out. In this experimental phase, we analyzed the precision of the proposed prediction-correction method, but we also evaluated its behavior in terms of scalability and required computational time. This section describes the proposed experimental methods and the obtained results.

Experiments: Methods and Materials
Four different experiments were planned and developed. Three of them were based on simulation techniques, while the fourth one was supported by a real hardware infrastructure.
The first experiment evaluated the evolution of the computational cost of the proposed solution when deployed in an Industry 4.0 system. To do that, the number of sensor nodes N in the scenario was varied, and the total computational time required to curate all data series in real time was monitored. This experiment was focused on the solution's cost and its scalability. Time measurements and results were captured and displayed as relative values normalized by the sampling period T i . In that way, when processing time was above the sampling period T i , we determined the proposed framework was congested (buffers will grow until the entire software module becomes unavailable). In this experiment, MATLAB 2019B software was employed to build a simulation scenario where we could change the number of nodes N in an easy way. The experiment was repeated for different values of the sampling period T i .
The second experiment was also focused on analyzing the computational cost of the proposed solution and its scalability, but in this case when performing an offline data curation. In this case, the relative length of the expanded time interval n e 1 , n e 2 , with respect to the curation interval [n 1 , n 2 ], (72), was varied in different experiment's realizations, and the total computational time required to curate all the samples in the curation interval was monitored.
n e 2 − n e 1 n 2 − n 1 (72) As in the first experiment, to reduce the external validity threats to this experiment and make results more general, computational time was expressed as a relative number respect to the sampling period T i . This experiment was also based on simulation techniques using the MATLAB 2019B software, where an Industry 4.0 system was built and run. All sensors were generating data every 30 s. The experiment was repeated for different numbers of sensor nodes in the Industry 4.0 system, N.
The third experiment aimed to evaluate the precision and curation success of the proposed predictive-correction algorithm. In this case, the experiment was also based on simulation techniques using the MATLAB 2019B software. Measures about the proposal's precision were obtained as the mean square error, MSE, (73) for the entire curated data flow and all the sensor nodes in the scenario. This error basically evaluated how different the obtained curated data series x * i [n] was from the original information generated by sensor nodes y i [n]. To improve the clarity in the results, this MSE was expressed as a percentage (74). The same experiment was performed for the two proposed operation modes: offline data curation and real-time data curation. Furthermore, the proposed experiment was repeated for different relative lengths of the expanded time interval n e 1 , n e 2 , with respect to the curation interval [n 1 , n 2 ] (72). All sensors were generating data every 30 s.
Furthermore, to evaluate the improvement provided by the proposed scheme compared to existing mechanisms, results in the third experiment were compared to results reported by state-of-the-art solutions [56]. Specifically, to perform a coherent and valid comparison, we selected a low-cost, error-correction framework for wireless sensor networks. This scenario was similar to Industry 4.0 applications, and the selected solution also included a prediction and a correction phase, so the mechanisms were technically comparable, so results could be also compared. Results for this state-of-the-art solution [56] were obtained through a secondary simulation scenario with the same characteristics and setup employed for the proposed new approach.
For these three initial experiments, a simulation scenario was built and run using the MATLAB 2019B suite. The simulation scenario represented a chemical industry where environmental data are collected to make decisions about how safe it is to work on the pending activities. Four different kinds of sensor nodes were considered: temperature sensors, humidity sensors, CO 2 sensors (air quality) and volatile compounds sensors (detecting dangerous gas emissions). The scenario included a random composition, using all these four types of sensor nodes. All sensors generated data every T i seconds, although they were not synchronized. Each sensor started operating at a random instant within the first minute of the system running. Simulated sensor nodes were designed to represent an ESP-32 microprocessor (with a 12-bit ADC and a 16-bit architecture). Figure 2 shows the experimental simulation setup. pared to existing mechanisms, results in the third experiment were compared to results reported by state-of-the-art solutions [56]. Specifically, to perform a coherent and valid comparison, we selected a low-cost, error-correction framework for wireless sensor networks. This scenario was similar to Industry 4.0 applications, and the selected solution also included a prediction and a correction phase, so the mechanisms were technically comparable, so results could be also compared. Results for this state-of-the-art solution [56] were obtained through a secondary simulation scenario with the same characteristics and setup employed for the proposed new approach.
For these three initial experiments, a simulation scenario was built and run using the MATLAB 2019B suite. The simulation scenario represented a chemical industry where environmental data are collected to make decisions about how safe it is to work on the pending activities. Four different kinds of sensor nodes were considered: temperature sensors, humidity sensors, CO2 sensors (air quality) and volatile compounds sensors (detecting dangerous gas emissions). The scenario included a random composition, using all these four types of sensor nodes. All sensors generated data every seconds, although they were not synchronized. Each sensor started operating at a random instant within the first minute of the system running. Simulated sensor nodes were designed to represent an ESP-32 microprocessor (with a 12-bit ADC and a 16-bit architecture). Figure 2 shows the experimental simulation setup. The simulated scenario was based on a minimum-sized industry, where distances were never larger than 50 m. The decision-making software module collected information from sensor nodes using LoRaWAN wireless technology. The global environment was suburban, so the level of interferences was moderate-low.
Malfunctions were represented by models and functions included in the MATLAB libraries, so we guaranteed the independence of the system configuration phase and the The simulated scenario was based on a minimum-sized industry, where distances were never larger than 50 m. The decision-making software module collected information from sensor nodes using LoRaWAN wireless technology. The global environment was suburban, so the level of interferences was moderate-low.
Malfunctions were represented by models and functions included in the MATLAB libraries, so we guaranteed the independence of the system configuration phase and the system evaluation phase. In that way, results were more relevant. Models for jitter were introduced from Simulink and included duty-cycle-distortion deterministic jitter, random jitter, sinusoidal jitter and noise. In this work, only deterministic and random jitter were considered. The different types of jitter were injected into devices according to the IBIS-AMI specifications [57]. Delays were managed through the Control System Toolbox, which allowed integrating the input delay, the output delay and the independent transport delays in the transfer function and the frequency-response data. In this work, we are using first-order plus dead time models and state-space models with input and output delays. Regarding BER, MATLAB includes the Bit Error Rate Analysis App Environment, which can integrate different instances of the numerical models generated through a Monte Carlo Analysis Simulink block and simulation. Finally, numerical errors were introduced and modeled as numerical noise in devices, using the Simulink framework and the IBIS-AMI specifications (in a similar way as described and conducted for jitter models).
Simulations represented a total of 24 h of system operation. To remove the impact of exogenous variables in the results, each simulation was repeated 12 times. Results were calculated as the mean value of all these simulations for each system configuration. All simulations were performed using a Linux architecture (Ubuntu 20.04 LTS) with the following hardware characteristics: Dell R540 Rack 2U, 96 GB RAM, two processors (Intel Xeon Silver 4114 2.2 GB, HD 2 TB SATA 7.2K rpm). Simulations were performed under isolation conditions: Only the MATLAB 2019B suite was installed and running in the machine; all other services, software or internet connection were stopped or removed. The objective was to remove as much as possible all validity threats. The global simulation time was variable and automatically calculated by the system to get data representing 24 h of system operation.
The fourth and final experiment also aimed to evaluate the precision and curation success of the proposed predictive-correction algorithm. However, in the last experiment, we employed a real hardware platform. In this case, in an emulation environment representing the referred chemical industry, four nodes (one of each type) were deployed together with a LoRaWAN gateway. The sensor nodes configuration was identical to the proposed configuration for the simulation scenarios. In particular, all sensors were generating data every 30 s. In this case, measures about the proposal's precision were obtained as the mean square error (73)-(74) as well. The experiment was also performed for the two proposed operation modes: offline data curation and real-time data curation. The proposed experiment was repeated for different relative lengths of the expanded time interval n e 1 , n e 2 with respect to the curation interval [n 1 , n 2 ] as performed in the third experiment. In this case, malfunctions were introduced by actual environmental and technological factors. For each configuration, the system was operating 24 h (so results in the third and fourth experiments are comparable). For all these experiments, the configuration for the proposed prediction-correction algorithm is shown in Table 2.  Figure 3 shows the results of the first experiment. As can be seen, for all Industry 4.0 systems up to 10 sensor nodes, the proposed solution was able to curate all data series in real time, as the computational time was below the sampling period. However, systems with higher sample-generation rates (T i = 1 s and T i = 5 s) got congested when the number of nodes went up. Specifically, systems with 20 and 40 elements, respectively, caused the system to become unavailable. Any other higher sampling period did not cause congestion in the system, even with platforms including up to 100 sensor nodes.  Figure 3 shows the results of the first experiment. As can be seen, for all Industry 4.0 systems up to 10 sensor nodes, the proposed solution was able to curate all data series in real time, as the computational time was below the sampling period. However, systems with higher sample-generation rates ( = 1 s and = 5 s) got congested when the number of nodes went up. Specifically, systems with 20 and 40 elements, respectively, caused the system to become unavailable. Any other higher sampling period did not cause congestion in the system, even with platforms including up to 100 sensor nodes.

Results
Although for higher device densities and higher amounts of sensor nodes, the proposed system may get congested with higher sampling periods, 100 nodes is above the number of nodes most current Industry 4.0 platforms include.  Figure 4 shows the results of the second experiment. In this case, we evaluated the computational cost in the offline data curation. As can be seen, in only one case was the computational time above the sampling period: for networks with 100 sensor nodes. This situation was caused by a very populated network (100 sensor nodes), where data were generated at a high rate; greater than the processing rate offered by the prediction-correction mechanism (the data generation rate went up when the number of nodes in the net- Although for higher device densities and higher amounts of sensor nodes, the proposed system may get congested with higher sampling periods, 100 nodes is above the number of nodes most current Industry 4.0 platforms include. Figure 4 shows the results of the second experiment. In this case, we evaluated the computational cost in the offline data curation. As can be seen, in only one case was the computational time above the sampling period: for networks with 100 sensor nodes. This situation was caused by a very populated network (100 sensor nodes), where data were generated at a high rate; greater than the processing rate offered by the predictioncorrection mechanism (the data generation rate went up when the number of nodes in the network increased). Thus, after a few samples, the system did not process data "on the fly" and the new data were queued. Therefore, long term, the medium computational time was higher than the sampling period, because of data queues and the congestion caused by the high data-generation rate. In any case, in general and as said before, 101 is a number of devices above the current needs of common Industry 4.0 deployments (although future pervasive platforms may go beyond this number).  As a conclusion of these two initial experiments, the proposed solution successfully operated in real time and offline. Besides, it is scalable to large systems, above the current Industry 4.0 needs. However, for future pervasive sensing platforms, the proposed solution may require powerful cloud systems or the definition of a distributed or edge computing scheme. Figure 5 presents the results of the third experiment. As can be seen, in all situations the MSE was below 50%, even for weak configurations, such as expanded time intervals that were only less than two times higher than the curation interval. As the number of samples that were integrated into the candidate calculation process went up, the MSE clearly went down. For configurations where the length of the expanded interval was around 25 times the length of the curation interval, the MSE was almost null, and the remaining error may be considered residual and intrinsic to all processing schemes.  Only a second system configuration may cause the system to get congested: for an Industry 4.0 system with 50 devices and an extended time interval 25 times higher than the curation interval. Although this is a large number of samples to process, the reduced complexity of the proposed solution allowed (even in this case) computational times slightly below the sampling period.
As a conclusion of these two initial experiments, the proposed solution successfully operated in real time and offline. Besides, it is scalable to large systems, above the current Industry 4.0 needs. However, for future pervasive sensing platforms, the proposed solution may require powerful cloud systems or the definition of a distributed or edge computing scheme. Figure 5 presents the results of the third experiment. As can be seen, in all situations the MSE was below 50%, even for weak configurations, such as expanded time intervals that were only less than two times higher than the curation interval. As the number of samples that were integrated into the candidate calculation process went up, the MSE clearly went down. For configurations where the length of the expanded interval was around 25 times the length of the curation interval, the MSE was almost null, and the remaining error may be considered residual and intrinsic to all processing schemes. tion may require powerful cloud systems or the definition of a distributed or edge computing scheme. Figure 5 presents the results of the third experiment. As can be seen, in all situations the MSE was below 50%, even for weak configurations, such as expanded time intervals that were only less than two times higher than the curation interval. As the number of samples that were integrated into the candidate calculation process went up, the MSE clearly went down. For configurations where the length of the expanded interval was around 25 times the length of the curation interval, the MSE was almost null, and the remaining error may be considered residual and intrinsic to all processing schemes.  It was also interesting to analyze the impact of the proposed correction phase. As can be seen, real-time curation showed worse performance than offline data curation, as the available information was more reduced. The MSE may be between 10% and 20% higher in real-time data curation than in offline curation for the same system configuration. However, when considering the proposed correction phase, which was able to integrate future information in the previously obtained curated data series, real-time data curation may reach the same precision as offline mechanisms. Figure 6 shows a comparison between results obtained in the third experiment and results reported by state-of-the-art solutions [56]. As state-of-the-art mechanisms do not depend on the length of the expanded time interval (this notion was defined only in the new proposed scheme), the MSE for the previously reported solution was constant. Small variations displayed in Figure 6 were caused by numerical effects in the simulation. As can be seen, for expanded time intervals very similar in length to the curation interval, the MSE was lower in state-of-the-art solutions. Decision trees in state-of-the-art mechanisms, to be trained, analyze large amounts of data, so their models are much more precise than the models generated in the proposed framework. However, as larger expanded intervals were considered, the MSE in the proposed solution was greatly reduced, while state-of-the-art mechanisms remained constant. In fact, for relation of lengths above 20, the proposed solution showed an MSE 50% lower than the previously reported mechanism. In conclusion, for models obtained from the analysis of comparable amounts of samples, the proposed framework produced curated data series 50% more precise. It was also interesting to analyze the impact of the proposed correction phase. As can be seen, real-time curation showed worse performance than offline data curation, as the available information was more reduced. The MSE may be between 10% and 20% higher in real-time data curation than in offline curation for the same system configuration. However, when considering the proposed correction phase, which was able to integrate future information in the previously obtained curated data series, real-time data curation may reach the same precision as offline mechanisms. Figure 6 shows a comparison between results obtained in the third experiment and results reported by state-of-the-art solutions [56]. As state-of-the-art mechanisms do not depend on the length of the expanded time interval (this notion was defined only in the new proposed scheme), the MSE for the previously reported solution was constant. Small variations displayed in Figure 6 were caused by numerical effects in the simulation. As can be seen, for expanded time intervals very similar in length to the curation interval, the MSE was lower in state-of-the-art solutions. Decision trees in state-of-the-art mechanisms, to be trained, analyze large amounts of data, so their models are much more precise than the models generated in the proposed framework. However, as larger expanded intervals were considered, the MSE in the proposed solution was greatly reduced, while state-ofthe-art mechanisms remained constant. In fact, for relation of lengths above 20, the proposed solution showed an MSE 50% lower than the previously reported mechanism. In conclusion, for models obtained from the analysis of comparable amounts of samples, the proposed framework produced curated data series 50% more precise. If we evaluate together Figures 3-6, we can conclude the proposed solution successfully obtained a precise curated data series in Industry 4.0 platforms. Besides, if we assume an error around 5%, the proposed algorithm may be deployed in scenarios with a large number of sensor nodes, above the current state of the art in Industry 4.0 solutions.
Finally, Figure 7 shows the results of the fourth experiment. As can be seen, the evolution of the MSE with the relative interval length was similar to the one observed in Figure 4 (third experiment). As larger intervals were considered and more samples introduced in the candidate calculation process, the precision went up and the MSE went down. However, in this case, the precision was slightly lower than in the simulation study; in particular, we can see the MSE was around 5% higher. In any case, the qualitative evo- Figure 6. Results of the third experiment: comparison with state-of-the-art solutions [56].
If we evaluate together Figures 3-6, we can conclude the proposed solution successfully obtained a precise curated data series in Industry 4.0 platforms. Besides, if we assume an error around 5%, the proposed algorithm may be deployed in scenarios with a large number of sensor nodes, above the current state of the art in Industry 4.0 solutions.
Finally, Figure 7 shows the results of the fourth experiment. As can be seen, the evolution of the MSE with the relative interval length was similar to the one observed in Figure 4 (third experiment). As larger intervals were considered and more samples introduced in the candidate calculation process, the precision went up and the MSE went down. However, in this case, the precision was slightly lower than in the simulation study; in particular, we can see the MSE was around 5% higher. In any case, the qualitative evolution of the proposed mechanism was equivalent in simulation and real scenarios. Only in one situation was the performance different in a relevant manner: for curation and expanded time intervals with the same length (relation of lengths equal to the unit). Under those circumstances, in simulation scenarios, the correction phase had a smooth behavior and it still reduced the MSE. However, in real scenarios, when the relation of lengths was equal to the unit, not enough information was provided to the correction algorithm and thresholds did not converge properly. As a consequence, the correction phase introduced an additional error instead of improving the data quality. This situation, in any case, was very transitory and solved immediately when the expanded interval was even slightly higher than the curation interval (relation of lengths above the unit). Despite this fact, and once more, the impact of the correction phase was relevant, even more in this real hardware implementation (MSE may grow up to around 90% without the correction phase). in one situation was the performance different in a relevant manner: for curation and expanded time intervals with the same length (relation of lengths equal to the unit). Under those circumstances, in simulation scenarios, the correction phase had a smooth behavior and it still reduced the MSE. However, in real scenarios, when the relation of lengths was equal to the unit, not enough information was provided to the correction algorithm and thresholds did not converge properly. As a consequence, the correction phase introduced an additional error instead of improving the data quality. This situation, in any case, was very transitory and solved immediately when the expanded interval was even slightly higher than the curation interval (relation of lengths above the unit). Despite this fact, and once more, the impact of the correction phase was relevant, even more in this real hardware implementation (MSE may grow up to around 90% without the correction phase). Thus, results allowed us to conclude the proposed solution was successful and valid as a data curation technology for Industry 4.0, focused on improving sensor interoperability.

Conclusions
At the software level, real-time algorithms and automatic decision-making mechanisms are very sensitive to the quality of received data. Common malfunctions in sensor nodes, such as delays, numerical errors, corrupted data or inactivity periods, may cause a critical problem if an inadequate decision is made based on those data. The most common solution to this problem is the adaptation and transformation of high-level software components to tolerate these effects, but this calibration turns interoperability between physical sensors and software modules into a very problematic issue.
Therefore, new solutions that guarantee the interoperability of all sensors with the software elements in Industry 4.0 solutions are needed. In this paper, we proposed a solution based on numerical algorithms following a predictor-corrector architecture. Using a combination of techniques, such as Lagrange polynomial and Hermite interpolation, data series may be adapted to the requirements of Industry 4.0 software algorithms. Series may be expanded, contracted or completed using predicted samples, which are later updated and corrected using the real information (if received). Thus, results allowed us to conclude the proposed solution was successful and valid as a data curation technology for Industry 4.0, focused on improving sensor interoperability.

Conclusions
At the software level, real-time algorithms and automatic decision-making mechanisms are very sensitive to the quality of received data. Common malfunctions in sensor nodes, such as delays, numerical errors, corrupted data or inactivity periods, may cause a critical problem if an inadequate decision is made based on those data. The most common solution to this problem is the adaptation and transformation of high-level software components to tolerate these effects, but this calibration turns interoperability between physical sensors and software modules into a very problematic issue.
Therefore, new solutions that guarantee the interoperability of all sensors with the software elements in Industry 4.0 solutions are needed. In this paper, we proposed a solution based on numerical algorithms following a predictor-corrector architecture. Using a combination of techniques, such as Lagrange polynomial and Hermite interpolation, data series may be adapted to the requirements of Industry 4.0 software algorithms. Series may be expanded, contracted or completed using predicted samples, which are later updated and corrected using the real information (if received).
Through this process, the resulting curated time series has enough quality to be employed with any software module (artificial intelligence, decision making, etc.), guaranteeing the interoperability of all sensor nodes with the high-level applications (which now do not require any adaptation or calibration procedure).
Results show the proposed solution successfully operated in real time and offline and it was scalable to large systems, above the current Industry 4.0 needs. Besides, we can conclude the proposed solution successfully obtained a precise curated data series in Industry 4.0 platforms, even in scenarios with large number of sensor nodes.
Future works will consider the proposed solution in large Industry 4.0 deployments with intense industrial activity, where the environment is more hostile and the operation conditions are more critical.