Comparing LSTM and GRU Models to Predict the Condition of a Pulp Paper Press

: The accuracy of a predictive system is critical for predictive maintenance and to support the right decisions at the right times. Statistical models, such as ARIMA and SARIMA, are unable to describe the stochastic nature of the data. Neural networks, such as long short-term memory (LSTM) and the gated recurrent unit (GRU), are good predictors for univariate and multivariate data. The present paper describes a case study where the performances of long short-term memory and gated recurrent units are compared, based on different hyperparameters. In general, gated recurrent units exhibit better performance, based on a case study on pulp paper presses. The ﬁnal result demonstrates that, to maximize the equipment availability, gated recurrent units, as demonstrated in the paper, are the best options.


Introduction
Modern algorithms, data storage, and computing power make it possible to not only analyze past behavior, but to anticipate future behavior of industrial equipment with reasonable confidence [1][2][3]. Anticipating future failures is, therefore, a topic that has been receiving increasingly more attention from researchers.
There are a few types of maintenance: curative, which solves problems after they occur; preventive, which can be done at regular intervals, aimed at preventing common problems; conditioning, namely in the predictive way, which attempts to predict problems that are going to happen and prevent them from happening at the optimal time [4].
Nowadays, predictive maintenance is the most common approach. It aims to optimize maintenance costs and increase equipment availability [5]. Maintenance procedures are performed when parts are supposed to be worn out, preventing failures and halting the production processes for more time than strictly necessary. Its main focus is to prevent future failures. However, in this case, some parts may be replaced before they are actually worn out, while others may wear out faster than expected and still fail [6]. Predictive maintenance aims to make the process more efficient, narrowing down the optimal time window for maintenance procedures. Using sensory data and adequate forecasting algorithms, the state of the equipment can be determined and the optimal time for maintenance interventions can be predicted some time in advance, avoiding unnecessary costs, as well as failures due to lack of maintenance.
More recently, however, artificial intelligence methods have become more popular. They impact societies, politics, economies, and industries [11], offering tools for data analysis, pattern recognition, and prediction, which could be beneficial in predictive maintenance and in production systems.
Modern machine learning methods offer superior performance and have become more popular [12]. They can work with high-dimensional data and multivariate data [13]. The most popular tools include artificial neural networks (ANNs), which have been proposed in many industrial applications, including soft sensing [14] and predictive control [15]. Random forest models are also good predictors, as shown in this study [16].
Traditional ANNs are simple and adequate for a wide range of problems. Bangalore et al. have studied the performance of neural networks for early detection of faults in gearbox bearings, to optimize the maintenance of wind turbines [17]. However, for prediction in sequential data, long shot-term memory (LSTM) and gated recurrent units (GRUs) have shown superior performance [18].
LSTM is very good at predicting in a time series [19,20]. It could extract patterns from sequential data and store these patterns in internal state variables. Each LSTM cell can retain important information for a longer period when it is used. This information property allows the LSTM to perform well in classifying, processing, or predicting complex dynamic sequences [21].
The present study aims to compare the performance of LSTM and GRU to solve the problem of predicting the future behavior of an industrial paper pulp press. Section 2 presents a survey of related work. Section 3 describes the theory of the LSTM and GRU networks, as well as the formulae used to calculate the different errors. Section 4 describes the methods used for cleaning the dataset and also the behavior of some samples. Section 6 describes the tests, results, and validation of the predictive models. Section 7 discusses the results and compares them to the state-of-the-art. Section 8 draws some conclusions and suggestions for future work.

Literature Review
Monitoring physical assets has becoming a priority for predictive maintenance. Recent studies prove the importance of the topic [22,23]. Many statistical and machine learning tools have been used for prediction purposes, in monitoring and preventing equipment failures [24,25], quality control [26], and in other areas [27].
Artificial neural networks have received special attention in the area of electrical energy. Studies, such as [27,28], show their capacity and performance as good predictors, as long as a dataset with sufficient quality and quantity of data is available and the right parameters are found.

Predictive Maintenance
The creation of a predictive maintenance program is a strategic decision that, until now, has lacked analysis of issues related to its installation, management, and control. Carnero [29] suggests that predictive maintenance can provide an increase in safety, quality, and availability in industrial plants.
Bansal et al. [30] present a new real-time predictive maintenance system for machine systems based on neural networks. Other studies, such as [31,32], indicate the feasibility of artificial neural networks for predictive maintenance.

MLP and Recurrent Networks
Multilayer Perceptron (MLP) neural networks have been used with success for predicting and diagnosing pump failures, showing promising results with different types of failures [33][34][35][36]. According to Ni and Wang [37] Partovi and Anandarajan [38], neural networks have high prediction accuracies and aid in decision-making [39].
In the context of recurrent neural networks, LSTM-based models presented good performance in time series classification tasks and prediction tasks [40]. The LSTM network is useful in solving non-linear problems due to its non-linear processing capacity [41].
Sakalle et al. [42] used an LSTM network to recognize a number of emotions in brain waves. The results obtained with the LSTM were superior when compared to the other models mentioned in the study. The same approach was used in predictive and proactive maintenance for high-speed rail power equipment [43]. Some architectures have good ability in predicting univariate or multivariate temporal series with LSTM and GRU networks [44][45][46].
Models that use RNN are usually suitable for time-series information. Hochreiter and Schmidhuber [47] proposed an LSTM, which showed an extraordinary execution power in several sequence-centric tasks, such as handwriting recognition [48,49], auditory speech demonstration [50,51], dialect modeling [52], and dialect translation. Besides these areas, networks have also been used in predicting heart failure [53].

Deep Learning
Recently, deep learning strategies have been used, with success, in a variety of areas [54]. Vincent et al. [55] show that deep neural networks can outperform other methods in voice recognition tasks. A similar approach was used in audio processing [56].
Yasaka et al. [35] used deep learning with a convolutional neural network (CNN), obtaining a high performance in image recognition. The images themselves can be used in a learning process with this technique, and feature extraction prior to the learning process is not necessary. Other studies in the field of computer vision include [57,58].
Krizhevsky et al. [36] showed good results in image processing, employing a layered pre-training technique. The analysis shows that a large deep convolutional neural network can achieve record-breaking results in a challenging data collection using supervised learning. This same study demonstrates how important the amount of convolutional layers is to achieve good results. In order to learn the types of difficult functions that can represent high-level abstractions, it is necessary to have deep architectures. There is a need for an exhaustive exploration of the types of layers, sizes, transfer functions, and other hyperparameters [59]. Figure 1 shows the inner design of an LSTM unit cell, according to Li and Lu [60]. Formally, the LSTM cell model is characterized as follows:

Long Short-Term Memory
Matrices W q and U q contain the weights of the input and recurrent connections, where the index can be the input gate i, output gate o, the forgetting gate f or the memory cell c, depending on the activation being calculated. c t ∈ R h is not just a cell of an LSTM unit, but contains h cells of the LSTM units, while i t , o t and f t represent the activations of, respectively, the input, output and forget gates, at time step t, where: • x t ∈ R d : input vector to the LSTM unit; • f t ∈ (0, 1) h forget gate's activation vector; • i t ∈ (0, 1) h input/update gate's activation vector; • o t ∈ (0, 1) h output gate's activation vector; • h t ∈ (−1, 1) h hidden state vector, also known as the output vector of the LSTM unit; •c t ∈ (−1, 1) h cell input activation vector; • c t ∈ R d : cell state vector.
W ∈ R h×d , U ∈ R h×h and b ∈ R h are weight matrices and bias vector parameters, which need to be learned during training. The indices d and h refer to the number of input features and number of hidden units.

Gated Recurrent Unit
The gated recurrent unit is a special type of optimized LSTM-based recurrent neural network [61]. The GRU internal unit is similar to the LSTM internal unit [62], except that the GRU combines the incoming port and the forgetting port in LSTM into a single update port. In [63], a new system called the multi-GRU prediction system was developed based on GRU models for the planning and operation of electricity generation.
The GRU was introduced by Cho et al. [64]. Although it was inspired by the LSTM unit, it is considered simpler to calculate and implement. It retains the LSTM immunity to the vanishing gradient problem. Its internal structure is simpler and, therefore, it is also easier to train, as less calculation is required to upgrade the internal states. The update port controls the extent to which the state information from the previous moment is retained in the current state, while the reset port determines whether the current state should be combined with the previous information [64]. Figure 2 shows the internal architecture of a GRU unit cell. These are the mathematical functions used to control the locking mechanism in the GRU cell: where W z , W r , W denote the weight matrices for the corresponding connected input vector. U z , U r , U represent the weight matrices of the previous time step, and b r , b z and b are bias.
The σ denotes the logistic sigmoid function, r t denotes the reset gate, z t denotes the update gate, andh t denotes the candidate hidden layer [65]. It shows that the GRU has an updated port and a reset port similar to forget and input ports on the LSTM unit. The refresh port defines how much old memory to keep, and the reset port defines how to combine the new entry with the old memory. The main difference is that the GRU fully exposes its memory content using just integration (but with an adaptive time constant controlled by the update port).
Deep learning networks are very sensitive to hyperparameters. When the hyperparameters are incorrectly set, the predicted output will produce high-frequency oscillation [66]. Important hyperparameters for GRU network models are the number of hidden units in the recurrent layers, the dropout value, and the learning rate value.
Individually, these hyperparameters can significantly influence the performance of the LSTM or GRU neural models. Studies, such as [67,68], demonstrate how important the adjustment of hyperparameters is, as it optimizes the learning process and can present good results against more complex neural network structures.

Model Evaluation
In the present experiments, LSTM and GRU neural network models are compared. To evaluate the model prediction performance, the models used were root mean square error (RMSE), mean absolute percentage error (MAPE), and mean absolute error (MAE). They are defined as follows: where Y t is the actual data value andŶ is the forecast obtained from the model value. The prediction error is calculated as the difference between Y andŶ, i.e., the difference between the output desired and the output obtained. n is the number of samples used in the test set.

Data Preparation
The present work is a continuation of previous work, where the data from the industrial press were already studied and analyzed using LSTM models [59]. The industrial presses are monitored by six sensors, with a sampling period of 1 min. The dataset contains data samples from 1 February, 2018 to December, 2020, for a total of 1004 days. The variables monitored are (1) electric current intensity (C. intensity); (2) hydraulic unit oil level (hydraulic unit level); (3) VAT pressure; (4) motor velocity (velocity); (5) temperature at the unit hydraulic (temperature at U.H.); and (6) torque. Figure 3 shows the plot of the raw data. As the graph shows, there are zones of typical operation and spikes of discrepant data. Figure 4 is a Q-Q plot, showing the normality of the data. As the figure shows, the data are not homogeneous. There are many discrepant samples in the extreme quantiles and the distribution of data is not linear.  Data quality is essential for developing effective modeling and planning. Data with discrepant values, as those shown in the charts, can pose difficulties to machine learning models. Therefore, data need to be processed and structured prior to analysis.
There are several treatment methods designed for this purpose, but a careful selection is needed so that information is not impaired. In the present work, the approach followed was the quantile method [59]. The quantile method removes extreme values, which are often due to sensor reading errors, stops, or other abnormal situations. After those samples are removed, it is possible to see more normal data distributions, such as those shown in   Since the present study relies on information that exists in the samples, this gives rise to the idea of presenting the correlation that exists between the variables. That information was condensed in the correlation matrix shown in Figure 7. As the figure shows, some of the correlations are interesting, such as those observed among the current, torque, and pressure. Other correlations are very low, such as those between oil level and temperature.

Methods
The present study aims to compare the performance of the LSTM model and the GRU model to predict future sensor values with 30 days advance, based on a window of past values. Experiments were performed using a computer with a third generation i5 processor, with 8 GB RAM. Previous work [59] shows that LSTM can make predictions with MAPE errors down to 2.17% for current intensity, 2.71% for hydraulic unit oil level, 2.50% for torque, 7.65% for VAT pressure, 16.88% for velocity, and 3.06% for temperature, using a window of 10 days and a sampling rate of two samples per day per sensor.
In the present work, different network architectures and hyperparameters were tested, for LSTM and GRU. In both cases, the networks rely on an encoding layer, a hidden layer of variable lengths, and an output layer. The internal architecture of the LSTM and GRU units are as shown in Figures 8 and 9.
The models were programmed in python, using the frameworks TensorFlow and Keras. For training, a batch size of 16 was used. Other hyperparameters, such as the activation function, when not indicated otherwise, are the TensorFlow and Keras default values.  For the experiments, the dataset was divided into train and test subsets. The test set was used for validation during the training process and for final evaluation. The samples were not included in the training set. The training set consisted of 70% of the samples and the test set contained the remaining 30% of the samples.
Experiments were performed with different resampling rates. Using aggressive resampling, the size of the dataset is greatly reduced, which increases speed and decreases the influence of outliers in the data. However, for more precision, lower resampling rates must be used.
To determine the best size for the sliding window, experiments were performed, resampling to just one sample per day, which gave a total of 1004 samples, 70% of which were used for train and 30% for test. Experiments were also performed to determine the best resample rate, showing that using one sample per hour was a good compromise between the computation required and the performance of the model, as explained in Section 6.
Different experiments were performed to compare the performance of the LSTM and the GRU, with different sets of hyperparameters. The parameters were varied and tested one-by-one. Dense search methods, such as grid-search, were not used because of the processing time required.

Experiments and Results
Experimental work was performed to confirm the ability of the models to learn, and then to determine the optimal hyperparameters of the LSTM and GRU. Figure 10 shows the learning curve of a GRU model, with 40 units in the hidden layer and window of 12 samples. The graph shows the loss measured in the train and in the test set. The learning process converges and takes less than 10 epochs to reach a small loss. This is similar to previous results obtained for the LSTM [59].

Testing the Convergence of the Learning Process
Although the learning curve shows that the model learns very quickly, in less than 10 epochs, in the following experiments, the number of epochs was limited to 15.

Experiments to Determine Model Performance with Different Window Sizes
The first experiment carried out, aimed to find the optimal window for the LSTM model and for the GRU model. The experiments were performed using one sample per day. Thus, the dataset had a total of 1004 samples. The models used had 40 units in the hidden layer. Figure 11 shows the results of the two models, using different window sizes and two different activation functions in the output layer. The RMSE is the average of all the variables. As the charts show, the GRU is always better than the LSTM, regardless of the window size or activation function used. The window size only has a small impact on the performance of the model, being the differences minimal from two to 12 days. On the other hand, the results are better when the ReLU is used at the output layer. When the sigmoid function is used, the difference in performance between the GRU and the LSTM is larger than when the ReLU function is used.
(a) Results with ReLU activation (b) Results with sigmoid activation Figure 11. RMSE values for LSTM and GRU models, with different window sizes and activation functions for the output layer. Figure 12 shows the MAPE and MAE associated with the 30 day forecast, for past windows of 2 to 12 days. The charts demonstrate that the LSTM architecture that uses a ReLU activation function in the output layer has lower errors. Using the sigmoid function, the LSTM errors are much larger. The GRU, however, in general performs better than the LSTM for all variables and activation functions. The prediction error results are much more stable for the GRU than they are for the LSTM. Table 1 shows exact error values for the best window sizes for the LSTM model. Table 2 shows the best window sizes for the GRU model.  Tables 1 and 2 for the best window sizes.

Experiments to Determine Model Performance with Different Resample Rates
In a second experiment, the models were tested with different resampling rates. Resampling is often used as a preprocessing method. Different techniques are used. Some of them are undersampling, in which the dataset size is reduced. This speeds up the data processing. In other cases, oversampling methods (such as data augmentation) are used in order to increase the number of samples.
In the present experiments, the dataset contains a large number of samples, so only undersampling techniques are necessary in order to reduce the number of data points. The method used was to average a number of samples, depending on the size of the dataset desired. Experiments were performed undersampling to obtain one sample per 12 h (two per day), one per six hours (four samples per day), one per each three hours, and finally one sample per hour. So the dataset size was greatly reduced.
The window sizes were the best of the previous experiments: a window size of 4 days for the LSTM and 12 days for the GRU, with the ReLU. A window size of 6 days for the LSTM and 10 days for the GRU, with the sigmoid. Figure 13 shows the average RMSE errors for both models. As the results show, sometimes the LSTM overperformed the GRU, namely when using the sigmoid function with periods of six and three hours. However, the difference was not statistically significant. On the other hand, the GRU was able to learn in all the situations and the RMSE error was always approximately 1. So, the GRU is robust and accepts larger periods with minimal impact on the performance, while the LSTM model is much more unstable.
(a) Using ReLU function at the output layer (b) Using sigmoid at the output layer  Figure 14 shows the MAE and MAPE errors calculated for each variable. It is possible to verify that, in general, the errors are much smaller with the sigmoid function. The LSTM model with the ReLU function is able to learn when a period of 12 h is used. When the sampling period is six hours, it seems the error gradient explodes for all variables and the errors become extremely large. For lower sampling periods, the LSTM does not learn. The GRU model continues to learn with acceptable errors. Table 3 shows the best results for the LSTM model with different resampling rates. Table 4 shows the best results for the GRU model with different resampling rates.

Experiments with Different Layer Sizes
An additional experiment was performed, to compare the performance of the models with different numbers of units in the hidden layer.
Using the GRU model, it is possible to learn with a larger number of samples, and with different variations of the model units, as shown in Figures 15 and 16. The LSTM was unable to learn with the resampling rate period of 1 h; therefore, results are missing. The window used in the experiments was 10 days for the sigmoid and 12 days for the ReLU, which were the optimal windows for the GRU using the ReLU and sigmoid functions, respectively.
(a) Using ReLU function at the output layer (b) Using Sigmoid at the output layer As the charts show, the GRU, using the sigmoid activation function, achieves the lowest RMSE error with 50 units in the hidden layer. Experiments described in Section 6.3 showed that the GRU with the same parameters, with 40 units in the hidden layer, had an RMSE error of 1.06. Table 5 shows the best results for the GRU model, after the tests with different numbers of cells in the hidden layer.

Comparing Many-to-Many and One-to-Many Architectures
An additional experiment was performed, in order to determine if the models are better trained to predict all the variables at the same time (one model, six outputs-manyto-many variables) or trained to predict just one variable (six models, one output eachmany-to-one variable).
This experiment was just performed for the GRU, which presented the best results in the previous experiments.
According to the graphs presented in Figure 17, it is clear that architecture 'manyto-many' presents slightly better results. Therefore, there is no advantage in training one model to predict each variable. Figure 17. Comparison of the performance of the GRU models, trained to predict many-to-many and many-to-one variables.

Tests with Different Activation Functions in the Hidden Layer
An additional step was to test combinations of different activation functions, for the hidden and output layers of the GRU. The activation functions tested were sigmoid, hyperbolic tangent (tanh), and ReLU. Figure 18 shows a chart with the average RMSE of the models. Globally, ReLU in the hidden layer and tanh for the output are the best models, even though ReLU-sigmoid and ReLU-ReLU are closely behind.  Table 6 shows the RMSE error for the different combinations of activation functions, for each variable. As the table shows, different variables may benefit from different functions, although, in general, a first layer of ReLU and a second layer of ReLU, sigmoid, or tanh are good choices.
The values shown in Table 6 are calculated for the raw output predicted. However, the raw output values have some sharp variations, which are undesirable for a predictive system. Therefore, the values were filtered and smoothed using a median filter. Figure 19 shows plots of selected results, where the signals and predictions were filtered with a rolling median filter, with a rolling window of 48 h. Table 7 shows the MSE errors calculated after smoothing. As the table shows, after smoothing, the prediction errors decrease.  Figure 19 shows examples of plots of different prediction lines in part of the test set. As the results show, in some cases the ReLU-tanh combination is the best, while in other cases, the ReLU-sigmoid offers better performance. The ReLU-tanh combination is better, in general, but in the case of temperature, the sigmoid output shows the best performance.

Discussion
Based on studies presented in the state of the art, it is possible to verify the usefulness of deep networks for prediction in time series variables. The area of prediction using deep neural networks has grown fast, due to the development of new models and the evolution of calculation power. LSTM and GRU models are two of the best forecast models. They have gained popularity recently, even though most of the state-of-the-art models are more traditional architectures.
The GRU network is simpler than the LSTM, supports higher resampling rates, and it can work on smaller and larger datasets. The experiments performed showed that the best results are based on the GRU neural network: it is easier and faster to train and achieve good results. A GRU network, with encoding and decoding layers, is able to forecast future behavior of an industrial paper press, 30 days in advance, with MAPE in general less than 10%. An optimized GRU model offers better results with a 12-day sampling sliding window, with a sampling period of 1 h, and 50 units in the hidden layer. The best activation functions depend on the model. However, the ReLU-tanh is perhaps one of the best models, on average.
The results also demonstrate that training the models using just one output variable, thus optimizing a model for each variable separately, is not advantageous when compared to training one model to predict all six variables at the same time.
The present work shows that a GRU network, with encoding and decoding layers, can be used to anticipate future behavior of an industrial paper press. It shows better overall performance, with less processing requirements, when compared to an equivalent LSTM model. To the best of the authors' knowledge, this is the first time such a study has been made. The prediction errors are smaller than those presented by the LSTM neural network and the GRU is more immune to exploding or vanishing gradient problems, so it learns in a wider range of configurations.
Compared to the literature, previous research has shown that the GRU is often the best predictor [69][70][71]. However, those studies were performed for univariate data only. The present work uses six variables in a time series and compares the multivariate and the univariate models. In [72], the model that presents the lowest RMSE is the ARIMA. However, that is just for a small dataset and forecast with 6 samples advance. In [44,73], forecasting models with LSTM, including encoding and decoding, are proposed, although not compared to GRU.

Conclusions
In the industrial world, it is important to minimize downtime. Equipment downtime, due to failure or curative maintenance, represents hours of production lost. To solve this problem, predictive maintenance is, nowadays, the best solution. Artificial intelligence models have been employed, aimed at anticipating the future behavior of machines and, therefore, avoiding potential failures.
The study presented in this paper compares the performance of LSTM and GRU models, predicting future values of six sensors, installed at an industrial paper press 30 days in advance.
The GRU models, in general, operate with less data and offer better results, with a wider range of parameters, as demonstrated in the case study based on pulp presses. Future work will include testing the performance of the GRU with different time gaps, in order to determine the best performance for different time gaps. Funding: Our research, leading to these results, has received funding from the European Union's Horizon 2020 research and innovation programme under the Marie Sklodowvska-Curie grant agreement 871284 project SSHARE and the European Regional Development Fund (ERDF) through the Operational Programme for Competitiveness and Internationalization (COMPETE 2020), under project POCI-01-0145-FEDER-029494, and by national funds through the FCT-Portuguese Foundation for Science and Technology, under projects PTDC/EEI-EEE/29494/2017, UIDB/04131/2020, and UIDP/04131/2020.

Institutional Review Board Statement: Not applicable.
Informed Consent Statement: Not applicable.

Data Availability Statement:
Restrictions apply to the availability of these data.

Conflicts of Interest:
The authors declare no conflict of interest.