Predicting machine failures from multivariate time series: an industrial case study

Non-neural Machine Learning (ML) and Deep Learning (DL) models are often used to predict system failures in the context of industrial maintenance. However, only a few researches jointly assess the effect of varying the amount of past data used to make a prediction and the extension in the future of the forecast. This study evaluates the impact of the size of the reading window and of the prediction window on the performances of models trained to forecast failures in three data sets concerning the operation of (1) an industrial wrapping machine working in discrete sessions, (2) an industrial blood refrigerator working continuously, and (3) a nitrogen generator working continuously. The problem is formulated as a binary classification task that assigns the positive label to the prediction window based on the probability of a failure to occur in such an interval. Six algorithms (logistic regression, random forest, support vector machine, LSTM, ConvLSTM, and Transformers) are compared using multivariate telemetry time series. The results indicate that, in the considered scenarios, the dimension of the prediction windows plays a crucial role and highlight the effectiveness of DL approaches at classifying data with diverse time-dependent patterns preceding a failure and the effectiveness of ML approaches at classifying similar and repetitive patterns preceding a failure.


Introduction
Predictive maintenance is a strategy that aims to reduce equipment degradation and prevent failures [1,2,3] by forecasting their occurrence.It differs from corrective maintenance (in which the components are used for their entire life span and replaced only when there is a fault) [4,5] and preventive maintenance (in which the replacement and reparation activities are scheduled periodically) [6,7].
Industrial data are often the result of the monitoring of several physical variables, which produces multivariate time series [8,9].Predicting failures via manual inspection is time-consuming and costly, especially because the correlation among multiple variables can be difficult to appraise with manual approaches.Computer-aided methods can improve performance and reliability and are less onerous and error-prone [10,11,12].
Predicting machine failures requires the cycle of data collection, model identification, parameter estimation, and validation [13].Alternative models are compared in terms of the effort needed to train them, the performances they deliver, and the relationship between such performances and the amount of input data used to make a prediction and the time horizon of the forecast.
In this paper, we compare alternative models trained for predicting malfunctions.The focus of the analysis is to study how performances, measured with the macro F 1 score metrics, are affected by the length of the input time series fragment used to make the prediction (reading window -RW ) and by the extension in the future of the forecast (prediction window -PW ).
As case studies, we consider the time series made of multiple telemetry variables collected with IoT sensors from (1) an industrial wrapper machine, (2) a blood refrigerator, and (3) a nitrogen generator.
In the three data sets, IoT sensors collect physical quantities to monitor the system status.The emission of alert codes signals the occurrence of anomalous conditions.The goal of the described data-driven predictive maintenance scenario is to anticipate the occurrence of a fault (i.e., a malfunctioning indicated by an alert code) within a future time interval (i.e., the prediction window) using historical data (i.e., the reading window).The presence of labeled data (i.e., the actual alert codes emitted by the machine in past work sessions) makes it possible to formulate the predictive maintenance problem as a binary classification task addressed with a fully supervised approach.
The contribution of the paper can be summarized as follows: • We compare three ML methods (Logistic Regression, Random Forest, and Support Vector Machine) and three DL methods (LSTM, ConvL-STM, and Transformer) on three novel industrial data sets with multiple telemetry data.
• We study the effect of varying the size of both the reading window and the prediction window in the context of failure prediction and discuss the consequences of these choices on the performances.
• We evaluate the diversity of patterns preceding faults using the Euclidean distance between time series windows and the spectral entropy measured on whole data sets as a global complexity metrics, showing that DL approaches outperform ML approaches significantly only for complex data sets with more diverse patterns.In contrast, for simpler datasets where patterns exhibit greater uniformity, both DL and ML approaches produce comparable results, with DL algorithms not introducing substantial improvements.
• All methods lose predictive power when the horizon enlarges because the temporal correlation between the input and the predicted event tends to vanish.
• When patterns are diverse, the amount of historical data becomes more influential.In general, augmenting the amount of input is not always beneficial.
• We publish the data sets and the compared algorithms to ensure reproducibility and allow the scientific community to extend the comparison to further ML and DL methods 1 .
The remainder of this paper is structured as follows: Section 2 surveys the related work, Section 3 presents the methodology adopted for pre-processing the data and making predictions, Section 4 presents and discusses the results, and Section 5 presents the conclusions and proposes future work.

Related work
Anomaly detection and failure prediction are two related fields applied to diverse data, including time series [27,28,29,30,31], and images [32,33,34].This research focuses on failure prediction applied to multivariate time series data acquired by multiple IoT sensors connected to an industrial machine.Several works have surveyed failure prediction on time series [35,30,36].
Failure prediction approaches belong to three categories, depending on the availability and use of data labels: supervised [18,12], semi-supervised [37] and unsupervised [38].Predicting failures in a future interval given labeled past data can be considered a supervised binary classification task [12].The data set used in this research is labeled, making supervised methods a viable option.
RF is one of the most used approaches [25,39,40,41,22,42,43,44,14,29,23,16,17,15,12].The work in [25] compares RF with ML and DL approaches, including Neural Networks and SVM, and shows the superiority of RF in terms of accuracy for a prediction window not longer than 300 seconds for a case study of wind turbines.However, accuracy does not consider class unbalance and is not appropriate for evaluating the performance of failure prediction algorithms [30].The work in [41] also considers the case of wind turbines and presents only the performances of RF using accuracy, sensitivity, and specificity for a prediction window of 1 hour.The of November 2023) work in [22] also uses RF on wind turbine data and measures performances using precision and recall for varying the reading windows up to 42 hours, showing rather poor performances.Other applications of RF in predictive maintenance range from refrigerator systems [44] to components of vending machines [42], vehicles [29], and chemical processes [16].
SVM is another approach commonly used in supervised failure prediction [25,26,45,42,14,46,47,15,12].The work in [26] presents an application of SVM to railway systems and evaluates the performances using False Positive Rate (FPR) and recall.It is one of the few approaches considering variation in both PW and RW.However, the assessment of the contribution of each window is not performed because they were changed together.The work in [15] compares ML algorithms (DT, RF, GB, and SVM) and shows that GB outperforms the other approaches, followed by RF.The work in [14] compares the accuracy of six ML algorithms on industrial data, including SVM, RF, and DT.It shows that DT not only surpasses RF but is also more interpretable.However, precision and recall would be necessary to evaluate the quality of predictions, as the data set is likely unbalanced.The work in [17] also measures performances using accuracy (for a balanced data set), compares RF, LR, and Ctree, and shows that (1) RF outperforms the other algorithms, and (2) the reading window size does not have a significant impact on the performances.
A few works contrast ML and DL predictors.The work in [20] compares the performances using AUC and shows that a small fully-connected neural network outperforms Naïve Bayes.However, other popular methods (e.g., SVM, LR, LSTM) are not evaluated.The work in [19] is one of the few approaches considering a comparison between two DL approaches (a CNN and LSTM) and shows (1) the superiority of the convolutional network in terms of accuracy and (2) the decrease in accuracy as the prediction window increases.However, additional metrics would be necessary to evaluate performances on a likely unbalanced data set.
Considering alternative DL approaches, aside from the commonly employed LSTM-based architectures [48,49,50], there is noteworthy research interest on ConvLSTM models.The work in [51] finds that ConvLSTM exhibits comparable performance to Random Forest (RF) when predicting faults in wind turbine gearbox systems.The work in [52] shows the effectiveness of a ConvLSTM-based architecture for failure detection.Both works do not analyze the impact of the size of RW and PW nor compare alternative DL architectures.ConvLSTM has also proved effective in forecasting the value of telemetry variables, a preliminary step for fault prediction in diverse systems [53,54,55].Transformer-based architectures have been applied to anomaly detection [56,57], and fault diagnosis [58,59].The work in [60] shows the effectiveness of a Transformer-based architecture for the fault prediction task in an Electric Power Communication Network.In contrast to LSTM-based architectures, Transformers can capture long-range dependencies and offer parallel processing capabilities, which makes them suitable also for time series tasks over long intervals.
Several works investigate the failure prediction task by forecasting the occurrence of a failure within a pre-determined time frame [26,16,20,17].Some studies evaluate the impact of either the reading window length [26,16,17] or the prediction window length [20,23,24].Only a few [26,12] evaluate the impact of varying the reading and the prediction window.Considering the surveyed approaches, the work in [12] is the only one that proposes a comparison between different ML algorithms varying both the RW and PW and assessing the contribution of each.However, it does not consider DL algorithms in the comparison.
In summary, a few works on supervised failure prediction for multivariate time series evaluate the variation of both RW and PW.However, none compares ML and DL algorithms in this respect.Moreover, none of the surveyed papers considers the presence of discrete sessions in the telemetry time series, which is instead the case of the wrapping machine data set.This research investigates the contribution of both the RW and PW variations and compares diverse ML and DL approaches on three novel industrial data sets featuring both discrete session-based and continuous time machines.

Method
This section outlines the experimental design employed to evaluate the impact of the reading/prediction windows and the hyperparameter selection on the performance of ML and DL approaches.It describes the data sets used and explains the data pre-processing procedures, the experimental methodology, and the performance metrics.

Data sets 3.1.1. Wrapping machine
Wrapping machines are systems used for packaging and comprise various components such as motors, sensors, and controllers.The monitored exem-plars are semi-automatic machines that wrap objects carried on pallets with stretch film.Each machine comprises: the turntable, a central circular platform for loading objects; the lifter, a moving part that contains the wrapping film reel; the tower, a column on which the lifter moves; the platform motor, a gear motor controlling the movements of the turntable with a chain drive, and the lifting motor, an electric motor for lifting the carriage.
The functioning of the machine is cyclic.A cycle consists of five steps: 1) Loading the products on a pallet at the center of the rotating platform.
2) Tying the film's trailing end to the pallet's base.3) Starting the wrapping cycle, executed according to previously set parameters. 4) Manually cutting the film (through the cutter or with an external tool) and making it adhere to the wrapped products.5) Removing the wrapped objects.
The data set contains records from sensors installed on a wrapping machine collected over one year (01-06-2021 to 31-05-2022).The machine is equipped with sensors attached to different components, such as the rotating platform and the electric motors.An acquisition system gathers data from sensors and sends them to a storage server.Sensors measure over 100 quantities (e.g., temperatures, motor frequencies, and platform speeds).Each data item records a timestamp, and the last value measured by a sensor.Data are transmitted only when at least one sensor registers a variation for saving energy, resulting in variable data acquisition frequency.Data are grouped into work sessions that represent periods in which the machine is operating.Each work session has a starting point, manually set by the operator, corresponding to the start of an interval during which the machine wraps at least one pallet.A work session ends when the machine finishes wrapping the pallets or when a fatal alert occurs.Session-level metrics are collected, too, such as the number of completed pallets and the quantity of consumed film.The data set contains information about alerts thrown by the machine during the power-on state (i.e., when the machine is powered on, which does not necessarily mean it is producing pallets).An alert is a numerical code that refers to a specific problem that impacts the functioning of the machine.
Depending on the changes detected by sensors, the irregular sampling time is addressed by resampling the time series with a frequency of 5 seconds, which gives a good trade-off between memory occupation and signal resolution.Missing values are filled in using the last available observation because the absence of new measurements indicates no change in the recorded value.

Blood refrigerator
Blood refrigerators are systems designed for the safe storage of blood and its derivatives at specific temperatures.Each unit includes different components.Compressor : pumps the refrigerant, increasing its pressure and temperature.Condenser : cools down and condenses the refrigerant.Evaporator : causes the refrigerant to evaporate for cooling the inside of the unit.Expansion Valve: regulates the flow of refrigerant into the evaporator.Interconnecting Tubing: facilitates the movement of the refrigerant between the compressor, condenser, expansion valve and evaporator.
A refrigeration cycle comprises the following phases.1) Compression: The cycle begins with the compressor drawing in low-pressure, low-temperature refrigerant gas.The compressor then compresses the gas, which raises its pressure and temperature.2) Condensation: The high-pressure, high-temperature gas exits the compressor and enters the condenser coils.As the gas flows through these coils, it releases heat, cools down, and condenses into a highpressure liquid.3) Expansion: This high-pressure liquid then flows through the capillary tube or expansion valve.As it does, its pressure drops, causing a significant drop in temperature.The refrigerant exits this stage as a low-pressure, cool liquid.4) Evaporation: The low-pressure, cool liquid refrigerant enters the evaporator coils inside the refrigerator.Here, it absorbs heat from the interior, causing it to evaporate and turn back into a lowpressure gas.This process cools the interior of the refrigerator.5) Return to Compressor: The low-pressure gas then returns to the compressor and the cycle starts over.
The data set includes information from IoT sensors that measure 58 variables (e.g., the status of the door of the refrigerator, its temperature, energy consumption, etc.).The series comprises data from 31-10-2022 to 30-12-2022 and records only when a variable changes value.This results in an irregular sampling period varying from a few seconds to one hour.For this reason, we resampled the data set using the median of the original sampling frequency (i.e., ≈ 34 seconds).

Nitrogen generator
Nitrogen generators separate the nitrogen from the other gasses, such as the oxygen, in compressed air.They include: an Air compressor, to supply the compressed air; Carbon Molecular Sieves (CMSs), to filter other gasses from the nitrogen; Absorption vessels, to take the nitrogen not filtered out by the CMS; Towers, to increase the nitrogen production.A tower comprises one CMS, two or more absorption vessels and the space where the division between nitrogen and oxygen takes place; Valves, to direct the flow of air and nitrogen and regulate the absorption of the nitrogen by the vessels; a Buffer Tank, to store the purified nitrogen.
The generator works as follows: 1) The air compressor compresses the air to a high pressure and supplies it to the machine.2) In the towers the CMS adsorbs smaller gas molecules such as oxygen while allowing larger nitrogen molecules to pass through the sieve and go into the vessels.3) The buffer tank receives the nitrogen gas from the vessels through the valves.
4) The valves reduce the pressure in the current working tower to release the residual gasses, while in the other towers the pressure is increased to restart the process.
The data set includes information from IoT sensors installed in the generator.A set of 64 variables measure the essential work parameters (e.g., the nitrogen pressure, CMS air pressure, etc.).The series comprises data from 01-08-2023 to 29-09-2023 and contains values recorded when a change in a variable occurs.This results in an irregular sampling period varying from a few seconds to four hours.For this reason, we resampled the data set using 1 minute as frequency.

Data Processing
Data processing is an essential step in the analysis of each dataset, involving the selection of pertinent variables associated with physical quantities and of the resampling period.To characterize the diversity of the data sets, the anomalous patterns preceding faults can be considered.To this end, the data set is first normalized using min-max normalization and then a set of fixed-size windows preceding each fault is defined.For every pair of such windows, denoted as w i and w j , the Euclidean distance, d ijkv , is calculated between pairs of data points at corresponding positions, indexed by k, for each variable, represented as v.The distance is computed between the points p ikv and p jkv .The diversity of patterns is quantified as the mean value scaled in the range [0, 1]: To characterize the complexity of a data set as a whole, we compute the mean of the spectral entropies of the considered time series [61], as imple-mented in the antropy library2 , normalized between 0 and 1 and using the fft method [62].Higher entropy values correspond to a higher time series complexity.

Wrapping machine
The first phase is feature selection.Since not all the quantities are equally relevant, only 13 are kept based on the data description provided by the manufacturer and are summarized in Table 1.Most removed features were redundant or had constant values.For example, such features as the employed recipe and the firmware version are excluded because they do not describe the system dynamics, whereas such features as the variation of a variable value during the session are eliminated because they can be derived from the initial and final values.The second phase is selecting relevant alert codes for the anticipation of machine malfunction.Figure 1 presents the distribution of the alert codes most relevant according to the manufacturer.Most of them are rare (i.e., are observed less than 10 times), and alerts 11 (platform motor inverter protection) and 34 (machine in emergency condition) are the most common ones.Alert 34 is thrown when the operator presses an emergency button.However, this occurrence heavily depends on human behavior because the operators often use the emergency button as a quick way to turn the machine off, as confirmed by the plant manager.For this reason, alert 34 is discarded, and the prediction task focuses only on alert 11.This alert is fundamental for the correct functioning of the machine, because the inverter controls the frequency or power supplied to the motor and thus controls its rotation speed.The improper control of the rotation speed hinders the wrapping operation.In the examined data set, alert 11 is preceded by diverse time-dependent patterns, and the mean Euclidean distance defined in Equation 1 is ≈ 0.26 for a window of 15 minutes.Figure 2 shows two examples of patterns preceding faults.The spectral entropy is ≈ 0.72.
The work session boundaries provided in the data set also depend on human intervention.Some inconsistent session-level metrics and a large number of sessions that do not produce any pallets are observed.A more reliable definition of a work session can be inferred from the telemetry variables.Intuitively, a session start time is introduced when at least one motion variable increases its value, and a session end time is created when all motion variables have decreased their value in the last ten minutes.Algorithm 1 specifies how sessions are defined.
Figure 3 shows the distribution of the work session duration computed from the telemetry data.The idle time intervals between sessions are neglected in further processing.
After pre-processing, it is possible to compute the distance of each data  The example on the left shows two faults observed on 15-06-2021.They are preceded by a slow decrease in the film tension and after the second fault the slave motor has a speed of zero.On the right, the fault observed on 30-06-2021 is preceded by a sudden acceleration of the platform rotation speed, while the slave motor does not move, but the film tension remains constant.

Algorithm 1 Computation of session boundaries
Require: ds, the data set containing, for each timestamp, an array with the values of the movement variables, structured as a hash map with the timestamp ts as the key and the movement variables array mv as the value Ensure: sessions, a map that associates a session number to each timestamp and -1 to out-of-session timestamps   sample within a work session to the successive alert, i.e., the Time To Failure (TTF).

Blood refrigerator
The data comprise variables directly measured by sensors (base variables) and variables derived from the base ones.For instance, the base variable "Door status" represents the status of the refrigerator door and has an associated derived variable "Door opening daily counter", which counts every day when the door is open.We keep the 12 most significant variables, shown in Table 2.The data set also contains alarms (e.g., High-Temperature Compartment, Blackout, Defrost Timeout Failure, etc.).Specifically, the data set originally contained 15 types of alarms, but those described as critical by the manufacturer are the ones presented in Table 3.After an analysis of such two alarms, we have selected "alarm 5" as the prediction target because "alarm 1" has too few occurrences, making the analysis of performances unfeasible.Storing blood at low temperature is crucial, as higher temperature can lead to bacterial growth [63], hemolysis [64] and increased risk of adverse reactions, including life-threatening conditions [65]. Figure 4 presents the typical anomalous pattern preceding a fault for three variables.In this case, the mean Euclidean distance defined in Equation 1 is ≈ 0.21 for a window of 15 minutes, ≈ 20% less than the one of the wrapping machine data set.The spectral entropy is ≈ 0.59, indicating a lower complexity with respect to the wrapper machines data set.

Nitrogen generator
The data are collected by IoT sensors over two months (01-08-2023 to 29-09-2023).We have considered only the variables that are directly measured from the sensors and discarded the derived ones.Table 4 shows the relevant variables, which record the changes of the physical status of the system.The nitrogen generator can produce different alarms, including "Air pressure too high", and "Oxygen failure -Second threshold reached".However, only one type of alarm is present in the data set: "CMS pressurization fail", whose frequency is shown in Table 5.A fail in CMS pressurization can hinder the separation of oxygen and nitrogen, as mentioned in [66].Figure 5 presents the typical anomalous pattern of three variables preceding a fault.The mean Euclidean distance defined in Equation 1 is ≈ 0.09 for a window of 15 minutes, ≈ 65% less than the one of the wrapping machine data set, and ≈ 57% less than the one of the blood refrigerator data set.The spectral entropy is ≈ 0.53, indicating a lower complexity with respect to both the wrapper machines and the blood refrigerator data sets.

Definition of the Reading and Prediction Windows
The task for comparing ML and DL methods is predicting the occurrence of alerts within a future time interval (i.e., a prediction window PW) given historical data (i.e., a reading window RW).This task is formulated as a binary classification, which assigns a label (failure/no failure) to each RW [30].Failures were provided by the manufacturer, which has measured them using sensors installed in the considered machines.A failure label associated with an RW is assigned when the corresponding PW contains at least an alert code.It is interpreted as a high probability of an alert occurring in the PW next to it.Given a session of length N and an RW size S RW , N − S RW + 1 reading windows are extracted.Each RW starts at timestamp t i , with i = 1 . . .N −S RW +1.For an RW (t j . . .t j +S RW ), the corresponding PW starts at time stamp t j +S RW +1.The proposed approach can be applied in absence of sessions without loss of generality, because the entire interval of the data set can be considered as one session.

Class Unbalance
Alerts are anomalies and thus, by definition, rarer than normal behaviors [67].For this reason, the number of RWs with label no failure is significantly greater than that of RWs with label failure, leading to a heavily imbalanced data set.For all the data sets, Random UnderSampling (RUS) [68,69] is applied to balance the class distribution by making the cardinality of the majority class comparable to that of the minority class.The algorithm randomly selects and removes observations from the majority class until it achieves the desired equilibrium between the two classes.In the case of the wrapping machine, RUS is applied separately on each train set (comprising 4 folds) and test set (1 fold) for each combination of RW and PW sizes.To prevent the presence of similar data in the train and test sets (i.e., partially overlapping data), after the definition of the test set, the partially overlapping windows in the train set are neglected by RUS.The RUS step is applied 10 times, as proposed in [12].In the blood refrigerator and nitrogen generator data sets, RUS is applied on the train and validation sets during the training phase, but not on the test set, where the test F 1 score is evaluated assigning a different weight to the failure and no failure classes in order to account for the unbalance.

Algorithms and hyperparameter tuning
The algorithms for which the influence of the RW and PW size is assessed include Logistic Regression (LR), as a representative of the Generalized Linear Models family [70], Random Forest (RF), as a representative of the Ensemble Methods family [71], Support Vector Machine (SVM), as a representative of the Kernel Methods family [72], Long Short-Term Memory (LSTM) [73], Transformers [74] and ConvLSTM [75], as representatives of DL methods.
For SVM, RF, and LR, the implementation used in this work is from the Python library scikit-learn3 , while DL models use TensorFlow4 .The tested network architectures of the DL models and the hyperparameters resulting from the hyperparameter search of all models can be found in the project repository.
For LR, C is the inverse of the regularization strength, and smaller values indicate a stronger regularization.
In the case of RF, the search varies the number of internal estimators N E (i.e., the number of decision trees) and the maximum number of features that can be selected at each split (in this context, a feature is defined as the value of a variable at a specific timestamp).
In the case of SVM, the C coefficient is the regularization hyperparameter multiplied by the squared L 2 penalty and is inversely proportional to the strength of the regularization.Small values of C lead to a low penalty for misclassified points, while if C is large, SVM tries to minimize the number of misclassified examples.
Considering LSTM, in the case of the wrapping machine the only hyperparameters that vary are the type of LSTM layer, Unidirectional or Bidirectional, and the loss function.For the latter, we compare the sigmoidF 1 loss with β = 1 and η = 0 (from now on referred to as F 1 loss) [76] and the Binary Cross-Entropy (BCE) loss.The use of BiLSTM has been shown to be effective in the works [77,73], which focus on forecasting problems.The two compared loss functions have been proven to deliver the top performances in [76].Each loss function serves a distinct purpose and offers advantages and disadvantages.The F 1 loss considers precision and recall, thus providing a balanced model performance evaluation.In this case, minimizing the F 1 loss means maximizing the F 1 score for the "Failure" class (i.e., reducing the missed "Failure" occurrences).The BCE loss is a widely used loss function for binary classification tasks and is symmetric with respect to the two classes.It measures the dissimilarity between predicted probabilities and target labels, encouraging the model to improve its probability estimations.BCE loss encourages the model to output probabilities for each class, providing richer information about the model's confidence in its prediction.However, it does not inherently consider the interplay between precision and recall and is not the most suitable choice for tasks where a balance between precision and recall is essential.
In the case of Transformers, the Binary Cross-Entropy (BCE) loss function is employed [78], and two hyperparameters are relevant: the number of attention heads and the number of transformer blocks.The former affects the capacity of the model to capture diverse patterns and relationships in the data, with higher values improving the ability to discern patterns.The latter increases the complexity of the model, thus allowing it to capture more complex patterns but also augmenting the risk of overfitting.
In the case of ConvLSTM [75], BCE is chosen as the loss function and the hyperparameter search (described in the project repository) has varied both the kernel size and the number of filters.Increasing the kernel size extends the receptive field and augmenting the number of filters allows capturing more diverse and complex spatiotemporal features.

Training and evaluation
Results are assessed using the macro F 1 score, which extends the F 1 score, defined in Equation 2. The F 1 score depends on precision and recall, defined respectively in Equation 3 and Equation 4, where TP stands for the number of true positives, FP for the number of false positives, TN for the number of true negatives, and FN for the number of false negatives.
P REC = T P T P + F P (3) The macro F 1 score is the average of the F 1 scores computed for each class independently.To compute the F 1 score for each class, the model treats that class as the positive class and the other class as the negative class.Let F 1,f be the F 1 score computed when class "Failure" is the positive one and F 1,n be the F 1 score computed when class "No failure" is the positive one.Then, the macro F 1 score is given by Equation 5.
The range of RW and PW sizes is adapted to the dynamics of the machines (a slow varying behavior requires testing less window sizes than a fast changing one).For the wrapping machine data set, we use 9 values for the PW (0.25, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, and 4 hours) and 6 values for the RW (10, 15, 20, 25, 30, and 35 minutes).For the blood refrigerator, we use 4 values for the PW (0.5, 1, 1.5, and 2 hours) and 5 values for the RW (10, 15, 20, 25, and 30 minutes).For the nitrogen generator data set, we use 6 values for the PW (0.5, 1, 1.5, 2, 3 and 5 hours) and 5 values for the RW (10, 15, 20, 25, and 30 minutes).For all the data sets, the minimum PW size is set to the smallest time span necessary for avoiding an unrecoverable machine failure, so that the operator can intervene.
The validation procedure is also adapted to the characteristics of the different use cases.In the wrapping machine data set there are only 13 alarms, which yield ≈ 500 failure RWs in the whole time series.Thus the number of failure RWs in the test set would be too small to test adequately the performances.Thus we adopt a training and evaluation procedure based on k-fold cross-validation (with k = 5).Each fold is identified by f k with k ∈ [1,5].The procedure also relies on 10 RUS instances r ik for the test set and 10 RUS instances p ik for the train set, 6 algorithms a j , with j ∈ {RF, LR, SVM, LSTM, ConvLSTM, Transformer}, and the hyperparameter settings for each algorithm (h jl ).Each combination of RW and PW sizes is used to compare the models with a procedure consisting of six steps: 1.For each k-fold split, the data are divided into a train set T rS k , comprising the folds f m with m ̸ = k, and a test set T eS k , corresponding to f k .
2. For each T rS k and T eS k , RUS is applied to balance the classes, obtaining 10 RUS instances r ik for the test set and 10 RUS instances p ik for the train set.
3. For each p ik , the four presented algorithms a j , with their hyperparameters h jl , are trained on the train folds and evaluated on the RUS instance r ik on the test fold, obtaining j × l macro F 1 scores on the test set, denoted as s ikjl .
4. For each algorithm and hyperparameter setting, the mean of the results is computed as m jl = mean i,k (s ikjl ) 5.Then, the maximum macro F 1 score for each algorithm is computed as b j = max l (m jl ) 6. Finally, the best macro F 1 score is computed as B s = max j (b j ), and the best algorithm as B a = argmax j (b j ) For the blood refrigerator and nitrogen generator data sets, a higher number of failure RWs is available and thus k-fold cross validation is unnecessary.Each data set is first divided into a train, validation and test set.The model is trained on the training set and the best combination of hyperparameters is determined as the one with the best macro F 1 score on the validation set using Bayesian Optimization.Finally, the best configuration of each algorithm is evaluated on the test set.

Results
This section reports the macro F 1 scores (from now on referred to as F 1 scores) and the macro average precision and recall of the compared algorithms computed with the abovementioned procedure.It discusses the effect of RW and PW selection on SVM, RF, LR, LSTM, ConvLSTM and Transformers prediction performances.

Wrapping machine
In the wrapping machine case, the cardinality of the minority class samples is significantly influenced by both the PW and the RW, as reading windows spanning across multiple sessions are neglected.Figure 6 shows the number of RWs with class "Failure" (i.e., the support) as RW and PW sizes vary.Note that a longer PW is more likely to contain a failure, resulting in a higher probability of the RW being classified as "Failure".Consequently, the number of RWs classified as "Failure" increases for larger PWs (i.e., the support increases).The support also increases with the decrease in RW size because more RWs are paired to a PW of class "Failure".
Table 6 compares the F 1 scores of each algorithm as PW and RW vary (i.e., the b j values), and Figure 7 shows the best algorithms for each reading and prediction window size (i.e., the B a values) and their F 1 scores (i.e., the B s values).In both cases, lighter backgrounds indicate better performances.Missing results correspond to cases in which at least one fold does not contain RWs of class "Failure".Figure 8: The F 1 score varies depending on the RW and, on average, reaches its peak for an RW of 20 minutes.This pattern is particularly noticeable for a PW of 0.25 hours.
Table 6, Figure 7 and Figure 8 show that the F 1 score, on average, increases as the RW size increases until a peak at 20 minutes and decreases afterward.Smaller train sets and the lesser relevance of past information can explain this behavior [79,80].For example, considering a PW of 15 minutes, the support increases of a factor ≈ 4 from an RW of 30 minutes to an RW of 10 minutes.The work in [80] reports that decreasing the number of train set samples (in this case, the number of RWs in the train set) can reduce performances and that the best results are achieved for the greatest number of training samples.The work in [79] finds that increasing the amount of historical data (in this case, the RW length) does not necessarily lead to better performance, as irrelevant information may be provided to the predictor.Similarly to [12], the performances tend to decrease as the PW size increases, with the best performances observed for the smallest PW size.This pattern depends on the fact that the predictions for smaller PWs exploit more recent data than for larger PWs.
LSTM performs better than the other algorithms (on average, 61.4% F 1 ).ConvLSTM is, on average, the second-best algorithm (on average, 54.5% F 1 ), followed by LR (on average, 54.0%F 1 ), Transformers (on average, 52.9% F 1 ), SVM (on average, 47.4% F 1 ), and RF (on average, 39.8% F 1 ).Related studies 0.25h 0.5h 1.0h 1.5h 2.0h 2.5h 3.0h 3. Figure 9: The average improvement of LSTM with respect to LR shows that LSTM benefits more from short PWs since it can capture relevant time-dependent patterns.Note that the values for the two smallest PWs (0.25 and 0.5 hours) are not entirely comparable with the others because they do not consider the longest RWs, for which the prediction could not be computed.
on time series forecasting also show the effectiveness of LSTM over simpler non-DL approaches [81,73,82].Figure 9 shows that the improvement in performance introduced by LSTM (the best DL algorithm) with respect to LR (the best ML algorithm) is, on average, higher for smaller PWs, and lower for larger PWs.The work in [83] also observes that LSTM forecasting performances decrease for a longer prediction horizon.This pattern can be explained by the decrease in the relevance of past information for predicting behaviors far in the future.In addition, [77] observes that the performance improvement between different models tends to decrease as the forecasting horizon enlarges, suggesting that more complex approaches are as effective as less complex ones when the historical data are far in the past.Overall, our findings show that in the case study, LSTM can capture temporal dependencies effectively and behave better for small PWs (i.e., the observed temporal patterns have a noticeable influence on the predicted class).As observed in [84], LR can exploit time-independent patterns since it does not consider any temporal dependency.This explains why the difference between LSTM and LR performances decreases as the PW increases.
Considering LSTM-based networks, the choice of the loss function also has an impact.As the RW and PW vary, the difficulty of predicting failures In particular, predicting errors becomes harder when the PW increases, especially when less historical information is available (i.e., for small RWs and large PWs).In such cases, the F 1 loss is ineffective, as it maximizes the F 1 score for the "Failure" class.However, a relatively high score (≈ 66.7%) corresponds to the trivial case in which only the "Failure" class is predicted, and it becomes more challenging to surpass this result in the most difficult instances of the prediction problem.Instead, the BCE loss does not privilege one of the two classes and is more suitable for the more challenging cases, leading to better macro F 1 scores.Focusing on the ML approaches, RF and SVM show poorer performances.SVM, similarly to LR, is not able to capture time dependencies.However, LR is expected to perform better than SVM, especially when the number of features exceeds the number of samples, as SVM with RBF is prone to overfitting [85], while LR is a simpler model.In this case, the number of features for both algorithms is given by the number of samples in the reading windows multiplied by the number of time series variables.On average, there are more features than training windows (≈ 1.13 features per training sample), which justifies the better performances of LR over SVM (≈ 1.14 times better in terms of F 1 score, on average).
RF is the worst approach because each decision tree considers a limited number of features selected randomly, neglecting temporal dependencies, and the number of trees is much smaller than the number of input features (≈ 2% to ≈ 13%).For these reasons, the random forest ensembles weak estimators that cannot capture complex patterns.Increasing the number of estimators or the maximum number of features does not necessarily yield better results.
For the combination of PW and RW leading to the best result for RF (i.e., 0.25 hours and 25 minutes, respectively), the best number of trees is 150, and the best number of maximum features is 33% of all the features.Increasing the number of trees to 200 yields a -0.01%variation in the macro F 1 score, and increasing the maximum number of features to 100% yields a -0.1% variation in the macro F 1 score.On average across all the RW-PW combinations, the best ML algorithm has an F 1 of ≈ 54.8%, while the best DL algorithm (LSTM) has an F 1 of ≈ 61.4%.In this data set, DL algorithms result the best choice for the specified fault prediction task.
In summary, the best results are obtained using a PW of 0.25 hours and an RW of 20 minutes, which is considered suitable in the industrial setting of the case study.A moderate amount of historical data is sufficient to predict an alert enough in advance to allow the operator to intervene.For these RW and PW values, SVM, LR, LSTM, and ConvLSTM reach their highest F 1 scores (respectively 0.783, 0.789, 0.861, and 0.807).However, LSTM obtains the best F 1 score, which is ≈ 7% better than LR and SVM, ≈ 5% better than ConvLSTM, and ≈ 32% better than RF.The difference between LSTM and the other approaches is justified by considering that it can consider temporal dependencies explicitly, and benefits are especially significant when recent historical data are used for predicting.Regarding the other DL algorithms, Transformers focus on non-sequential patterns and ConvLSTM on spatial patterns, which makes them less effective for the specified time series fault prediction task.

Blood refrigerator
Table 8 shows the F 1 scores of each algorithm as PW and RW vary, similarly to Table 6.DL algorithms outperform ML algorithms in less cases than in the wrapping machine time series.Such difference can be explained by considering the different nature of the data sets.While the wrapping machine faults are not associated with clearly identifiable short patterns and are caused by a longer wearing of the machine components, in the blood refrigerator, the high product temperature is often observed as a consequence of some short easily identifiable patterns, exemplified in Figure 4.Such patterns can be found by simpler algorithms (i.e., ML algorithms), and more complex architectures do not lead to significant benefits.
Figure 10 presents the algorithm with the best F 1 score for each RW and PW pair.The results highlight the small variation of performances across the algorithms.On average, the best ML algorithm for each RW-PW    combination has an F 1 of ≈ 67.3%, while the best DL algorithm for each RW-PW combination has an F 1 of ≈ 67.8%.
In the case of the blood refrigerator, the absence of sessions leads to negligible differences in the support, and does not affect results significantly.In this case, the average F 1 score for different RWs varies between ≈ 67.4% and ≈ 69.8%, a small variation (≈ 2.4%) compared to the one of the wrapping machine data set (≈ 9%).This result also suggests that the amount of historical data has, for this data set, a negligible effect, and that short RWs already lead to high F 1 scores.

Nitrogen generator
Table 9 compares the F 1 scores of each algorithm as PW and RW vary.Also in this case, the performances of ML and DL algorithms are similar, suggesting that as the data set becomes simpler, more complex algorithms become less effective and often worse than simple ones.In particular, the best DL algorithms for each RW-PW combination reach, on average, an F 1 score of 89.1%, while the best ML algorithms for each RW-PW combination reach an average of 89.0%.As in the blood refrigerators case, this result can be explained considering the easily identifiable and similar anomalous patterns preceding faults, as visible in Figure 5.
Figure 11 presents the algorithm with the best F 1 score for each RW and PW pair.The results highlight the similar performances of different algorithms on all the RWs and PWs.In particular, the average F 1 score of the best algorithm varies between 87.9% and 88.5% as the RW changes, a smaller difference (≈ 0.6%) compared to the other data sets.
In this case, RF is, on average, the best algorithm (with an F 1 score of ≈ 87.3%).It benefits from simple and repetitive patterns in the data, that it is able to capture effectively.Other algorithms have similar performances.Transformers have an average F 1 of ≈ 86.6%, LSTM has an average F 1 of ≈ 85.6%, SVM has an average F 1 of ≈ 85.6%, ConvLSTM has an average F 1 of ≈ 85.3%, and LR has an average F 1 of ≈ 78.4%.

Conclusions
Failure prediction on industrial multivariate data is crucial for implementing effective predictive maintenance strategies to reduce downtime and increase productivity and operational time.However, achieving this goal with non-neural machine learning and deep learning models is challenging and requires a deep understanding of the input data.
The illustrated case studies show the importance of setting meaningful prediction and reading windows consistent with domain-specific requirements.Experimental results demonstrate that basic, general purpose algorithms, such as Logistic Regression already achieve acceptable performances in complex cases.In the wrapping machine case study, Logistic Regression attains a macro F 1 score of 0.789 with a prediction window of 15 minutes and a reading window of 20 minutes.In the same case study, LR is outperformed by LSTM, a complex model that exploits temporal dependencies, with a margin of more than 7% on average on all RW-PW pairs.In the bestcase scenario of a prediction window of 15 minutes and a reading window of 20 minutes, the LSTM model achieves a notable macro F 1 score of 0.861.However, the performance gap declines as the prediction horizon enlarges because the relevance of the temporal dependencies between the RW and the PW fades out.Ultimately, an LSTM model is maximally effective for short PWs, when the influence of the reading window historical data is expected to be more critical and becomes less effective as the prediction window increases.
The better performances obtained with DL algorithms, however, are negligible in the case of simpler data sets, where easily identifiable repetitive patterns can be found.The blood refrigerator data set is a simpler data set, with a few relevant anomalous patterns leading to failures.In that case, ML algorithms perform similar to DL algorithms, often surpassing the latter.A similar result is obtained for the nitrogen generator data set.In this case, Random Forest, which has the worst performances on the wrapping machine data set, can effectively find simple rule-based repetitive patterns anticipating failures, achieving the best performances.
The results presented in this paper are valid for the industrial scenario and data sets employed in the experiments.Larger data sets would likely contain more occurrences of multiple alert codes enabling the study of different types of anomalies.
Our future work will focus on fine-tuning the LSTM model and exploring different architectures (e.g., CNN-based models [86] and Gaussian Processes [87]) and hybrid models that combine multiple machine learning and deep learning techniques [88].Such a comparison can help identify the most effective approach for different industrial scenarios and enable a better understanding of how different network structures and learning algorithms impact failure prediction accuracy.Another fundamental research direction for the practical application of LSTM as predictors is the interpretability of their output [89].As DL models are often used as black boxes, understanding the reasons behind their predictions is crucial in industrial settings where the model's output is used to take preventive maintenance action.

Figure 1 :
Figure1: Distribution of the alert codes on a logarithmic scale.Alert 34 is the most frequent but must be discarded because it is unreliable.Alert 11 is the second most frequent.The remaining alerts have less than ten occurrences in the entire data set.

Figure 2 :
Figure 2: Examples of diverse patterns preceding faults in the wrapping machine data set.The patterns involve three variables and the faults are displayed as dashed vertical lines.The example on the left shows two faults observed on 15-06-2021.They are preceded by a slow decrease in the film tension and after the second fault the slave motor has a speed of zero.On the right, the fault observed on 30-06-2021 is preceded by a sudden acceleration of the platform rotation speed, while the slave motor does not move, but the film tension remains constant.

Figure 3 :
Figure 3: Distribution of the duration of work sessions.

Figure 4 :
Figure 4: Examples of patterns preceding a fault in a blood refrigerator, observed on 22-11-2022.Before the fault (shown by a dashed line), the instant power consumption increases and then decreases after a short time, the evaporator temperature decreases of ≈ 40 • C after a sudden increase, and the product temperature has an increase.

Figure 5 :
Figure5: A typical pattern in a nitrogen generator, observed on 01-08-2023.In this case, before a fault (indicated with a gray dashed line), the oxygen concentration drops, the CMS air pressure decreases of of ≈ 8 bar, and the nitrogen pressure decreases of ≈ 4 bar, making the generation of nitrogen impossible.

Figure 6 :
Figure 6: The number of RWs with class "Failure" as RW and PW sizes vary.The highest number of "Failure" labels is observed for short RWs and long PWs.

Figure 7 :
Figure 7: Comparison of the best F 1 scores for each combination of RW and PW sizes for the wrapping machine case study.The heatmap shows each combination's best method and the corresponding F 1 score.Lighter background colors correspond to better results.

Figure 10 :
Figure 10: Comparison of the best F 1 scores for each combination of RW and PW sizes for the blood refrigerator case study.The heatmap shows each combination's best method and the corresponding F 1 score.Lighter background colors correspond to better results.

Figure 11 :
Figure 11: Comparison of the best F 1 scores for each combination of RW and PW sizes for the nitrogen generator case study.The heatmap shows each combination's best method and the corresponding F 1 score, where "Trans."indicates the Transformer.Lighter background colors correspond to better results.

Table 1 :
The 13 variables obtained after pre-processing.The minimum and maximum values are reported for each variable and the measurement unit.The "Movement" column indicates whether the variable is used to determine whether any motors are moving.

Table 2 :
The 12 variables of the blood refrigerator

Table 3 :
The alarms of the blood refrigerator

Table 4 :
The 4 variables of the nitrogen generator

Table 5 :
The alarm of the nitrogen generator

Table 6 :
Classification results on the F 1 metrics for the wrapping machine case study, varying both the reading and prediction window, for SVM, RF, LR, LSTM, ConvLSTM, and Transformers.Lighter background corresponds to better results.Empty cells indicate that the results cannot be computed due to the lack of "Failure" labels in some folds.

Table 7 :
Comparison of the F 1 and BCE loss functions for LSTM.F 1 loss becomes less effective as the classification problem becomes more difficult (i.e. when the PW increases and the RW decreases)

Table 8 :
Classification results on the F 1 metrics for the blood refrigerator case study, varying both the reading and prediction window, for SVM, RF, LR, LSTM, Transformer, and ConvLSTM.Lighter background corresponds to better results.

Table 9 :
Classification results on the F 1 metrics for the nitrogen generator case study, varying both the reading and prediction window, for SVM, RF, LR, LSTM, Transformer, and ConvLSTM.Lighter background corresponds to better results.