A Data-Driven Multi-Regime Approach for Predicting Energy Consumption

: There has been increasing interest in reducing carbon footprints globally in recent years. Hence increasing share of green energy and energy efﬁciency are promoted by governments. There-fore, optimizing energy consumption is becoming more critical for people, companies, industries, and the environment. Predicting energy consumption more precisely means that future energy management planning can be more effective. To date, most research papers have focused on predicting residential building energy consumption; however, a large portion of the energy is consumed by industrial machines. Prediction of energy consumption of large industrial machines in real time is challenging due to concept drift, in which prediction performance deteriorates over time. In this research, a novel data-driven method multi-regime approach (MRA) was developed to better predict the energy consumption for industrial machines. Whereas most papers have focused on ﬁnding an excellent prediction model that contradicts the no-free-lunch theorem, this study concentrated on adding potential concept drift points into the prediction process. A real-world dataset was collected from a semi-autonomous grinding (SAG) mill used as a data source, and a deep neural network was utilized as a prediction model for the MRA method. The results proved that the MRA method enables the detection of multi-regimes over time and provides a highly accurate prediction performance, thanks to the dynamic model approach.


Introduction
Energy management is becoming vital for companies around the world, and energy prediction is necessary as an initial step to create an energy management system [1]. Predicting energy consumption is not only essential for energy management, but it is also crucial when considering climate change [2]. In the last decade, numerous researchers have applied several statistical analyses, such as data mining, machine learning and deep learning methods, on time series to predict energy consumption for buildings, cities and industrial machines [3][4][5].
The mining industry is extremely vulnerable from an economic position because of recession, economic uncertainty and the use of machines that are costly to maintain [6]. In mineral processing, comminution is one of the substantial operations especially for milling [7]. Moreover, milling machines have high-cost energy consumption and maintenance expenses [8,9]. Semi-autogenous grinding (SAG) mills have been more common for mining due to economic advantages, such as advanced processing capacity, low physical space necessity, relatively lower maintenance expenses, convenient configuration, and low investment [10][11][12]. Feed size, ore hardness and mill load are essential variables for productive SAG mill operation, but it is not always possible to have optimum values [13]. Predicting accurate energy consumption is a complex and critical task for researchers. Detecting change points and integrating these unpredictable change points into the prediction process is one of the most challenging tasks [16]. Time series forecasting has become more popular in recent decades due to the significant applications in numerous fields [17], such as energy consumption [18][19][20], predicting financial variables [21] and wind power generation [22,23]. A single method cannot achieve satisfactory prediction results for all types of time series [24]. Various intricate models have been studied to examine the time series' nonlinear behavior [25,26]. However, one of the biggest challenges is that there may be frequently repetitive data variations called concept drift over time on the data stream [27].
The term "concept drift" refers to unanticipated shifts in the underlying distribution of streaming data over time. Furthermore, many researchers have tried to solve the multiple regime problem on time series data [16,26,28,29]. Nevertheless, there are not sufficient papers that investigate real industrial machines' time series with possible multi-regime solutions. To the authors' best knowledge, most papers employ a traditional approach to create their prediction model, which disregards the data stream with concept drift issues and results in less accurate prediction performance over time.
In this research, first, the conventional method was applied to show the performance declines for a real time SAG Mill dataset. Then, a data-driven model was developed to avoid these degradations in the model prediction performance. Furthermore, if there are multiple repetitive regimes in industrial machinery datasets, the MRA method aids in detecting these regimes with a high level of accuracy. Finally, we compared the performance of the MRA method with the traditional approach, and the results proved that the proposed model reduces the overall error rate and is useful in finding repetitive regimes.

Motivation
A large number of research papers have investigated the buildings' or cities' energy consumption [30][31][32][33][34]. However, industrial machines use a vast amount of energy [35]; the industrial sector consumes over half of the world's total energy, and its energy consumption has nearly doubled in the last 60 years [36]. For example, the average annual electricity consumption for a U.S. residential utility customer in 2019 was 10,649 kilowatthours (kWh) [37], and a SAG mill consumes the same amount of energy in an hour. Furthermore, industrial machines tend to have more complicated distributed time series with multi-regime running cases, which may be for reasons such as operator behavior, load amount, material type, or size. In this research, a novel data-driven method was developed to enhance the prediction performance of industrial machine energy consumption based on the variables.

Contribution
Due to the concept drift, which is one of the biggest problems in the real-time data stream, the prediction accuracy of the traditional approach is decreasing over time. A novel method has been proposed to handle concept drift issues and precisely estimate the energy consumption of industrial machines. In addition, instead of dividing the real-time data into the fixed size of chunks, the data were split into the variant size of chunks based on the machine's operating conditions. Change points and recurrent regimes have thereby been successfully detected over time.
The rest of the paper is organized as follows: Section 2 presents the related work and differences with the MRA method. Section 3 explains the MRA method in detail. The results and comparisons are shared in Section 4. Finally, conclusions and future work are given in Section 5.

Related Work
Over the past few decades, energy consumption and efficiency have attracted an increasing number of researchers, not only for energy saving and supply purposes, but for CO 2 emissions, which have a significant impact on climate change [30]. An artificial neural network (ANN) was implemented based on the principal physical method to estimate the building energy consumption [31]. Hamzacebi demonstrated the power of the ANN for the prediction of the seasonal time series [32]. A novel method named pattern sequence-based forecasting (PSF) was developed by Alvarez et al. First, a clustering method was applied to cluster the time series data. Second, the sequence of labeled groups was calculated to predict the next day group, which increased the model performance for the specified group of the time series [33]. Hill et al. compared the traditional statistical method and the neural network method on the time series forecasting [14]. Similarly, Tso et al. showed that the decision tree and the neural network outperformed the regression method for the Hong Kong energy-consumption prediction [34]. Kankal et al. used four independent variables, gross domestic product, population, and the amount of import and export, and implemented an ANN to forecast energy demand [38]. A genetic algorithm and ANN were integrated by Azadeh et al. to forecast electricity demand for agricultural activities by using stochastic procedures [39]. Wang et al. developed a method to select secondary variables data from the cooling energy consumption dataset, and the model discovered periodicity over the time series. As a result, the model could predict energy consumption more precisely compared to the conventional methods [40].
He et al. developed a novel data-driven energy prediction approach to predict the energy consumption of grinding and milling machines. They implemented several feature extraction methods to eliminate unnecessary features, and deep learning was used as a prediction method. The results increased prediction accuracy compared to the traditional approach [41]. Another similar study was carried out for the prediction of energy consumption of electric arc furnaces, and the results proved that deep neural networks outperformed support vector machines, linear regression, and decision trees [42]. Kant and Sangwan implemented an ANN to predict the cutting energy of machining, and the results confirmed that higher feed rate and spindle speed use less energy [36]. Avalos et al. used a real-time operational variable feed tonnage, bearing pressure, and spindle speed from SAG mills. They implemented several deep learning and machine learning techniques to predict the energy consumption of the SAG mills. The results showed that neural networks achieved one of the best prediction performances for SAG mill energy consumption [7].
Several researchers have used the Markov regime-switching model to detect change points related to the multi-regime approach [25,43]. The disadvantage of the model is that the change points must be known before it is applied. However, each machine has specific properties and working conditions, requiring a unique approach for detecting change points more accurately [43]. Additionally, the model is less explainable and problematic to forecast, and it is broadly used in economics to define different structures [25].

Methodology
Industrial machines usually have complex designs and working conditions. Solutions must consider many aspects, such as feature selection, noisy data, trending data, stationarity, nonlinearity, seasonality, and multi-regimes. Moreover, accurately analyzing industrial time series requires an interdisciplinary approach to better understand the problem. In this paper, a novel data-driving model was developed, and working conditions and running cycles were considered based on a subject matter expert's (SME's) advice, which helped us to develop a better prediction model. Figure 2 summarizes the main steps for the proposed method to predict SAG mill energy consumption with a multi-regime approach. Step 1: Understanding the data and running conditions of the machine is a crucial element to accurately discover the machine's potential change points over time. Furthermore, an SME was consulted to decide threshold values for the output and potential change points to investigate possible regime regions, named chunks. There were five factors that directly or indirectly impact energy consumption, and all those features were used as input variables.
Step 2: Real-time industrial data usually have several issues, such as missing values, outliers, noisy data, changing feature tag names by time, upgrading sensor quality, and sensitivity. As a preprocessing step, the data were cleaned for further processing. Missing values are a widespread problem due to the reliability of sensor quality, and there are two common strategies in the literature [20]. If the majority of features were missing in a single record, the whole row was removed from the dataset. If the minority of the records were missing for a single line, they were replaced with their mean value. In this way, we attempted to use each record as much as possible. Each chunk has a different data size, and several chunks have a limited number of instances. Furthermore, when the data are split into several chunks, it is considered that any record would be valuable for their chunk.
Step 3: This step was mainly designed to discover potential regime areas. Several change points were selected when a daily cumulative energy consumption equaled zero for more than 24 h. Long-time inactivity is abnormal for a SAG mill as they work 24 h a day, seven days a week, except for regular maintenance or machine breakdowns. During inactive days, various operational changes on the machine that might significantly impact the machine running cycle are considered potential change points. Furthermore, the threshold timing should be updated according to machine type and working conditions.
Equation (1) is used for deciding the change points and the chunks. W t represents the timing window threshold and is a minimum 24 h time period for the cumulative energy consumption. S OV exemplifies the sum of the output variable, and O V (t) symbolizes hourly energy consumption for the selected duration. Ov(t).
(1) After separating the data into several chunks based on the threshold, a deep neural network (DNN) model was developed based on the first chunk. The chunk data were divided into several training and testing percentages, and the final split ratio was decided according to performance. The remaining chunks were used as unseen testing data.
A DNN was selected as a prediction method for the machine's energy consumption since it provides one of the best accuracies in the literature [20]. There are several alternatives to the DNN model, but a comparison for different models was not investigated in this paper. Overall, the main goal is to improve prediction performance by discovering potential repetitive multi-regimes over time.
A DNN model has an input layer, output layer, and multiple hidden layers. It uses a multi-layer feed-forward neural network structure. It also has more enhanced features, such as dropout, early stopping and penalties on the l1 and l2 norms of the weights against overfitting problems. Many hidden layers containing neurons with hyperbolic tangent function (tanh), rectifier, and sigmoid activation functions can be adjusted in the network. A sample of the DNN structure is illustrated in Figure 3.
Following the computation of the DNN, the value of the output is calculated using a feed-forward method. The mathematical description of the relationship between the output (y t ) and the inputs (y (t−n) ) is as follows [44]: W ij and W j are model parameters commonly referred to as connection weights. n is the number of input nodes, and h is the number of hidden nodes. W b and W bj are bias unit weights that are distinctive to each process unit, and f is the activation function, which is widely used as the Rectifier Linear Unit (ReLU) function. The network structure and connection weights determine the function f . The output error E t is calculated each time and used as negative feedback to adjust the incoming-weight connections and bias. This adjustment allows the DNN's computation accuracy to be improved by reducing output mistakes to a minimum. Step 4: Thresholds are subjective, and they may have a significant impact on the results. The results can be evaluated during this step to discover optimum threshold values for detecting multiple regimes more accurately and predicting energy consumption more precisely. If the results are not satisfactory, the thresholds for the machine should be updated accordingly.
Finally, the results and possible future work are discussed with an SME, as each industrial machine may have specific working conditions and require an interdisciplinary approach.
The MRA method is illustrated as a flow chart in Figure 4. The method divides the dataset into several chunks and optimizes necessary models based on chunks. NC, Th, Err, C, M, and NM represent a total number of chunks, threshold value, error rate, chunk no, model no, and the number of models, respectively. The DNN model is developed based on Chunk-1 data in the first step, and the following chunks are used as unseen testing data. The most recent model is used for the subsequent chunks until the error rate exceeds the threshold value. When the error rate is greater than the set threshold value (Err > Th), the MRA method, first, uses all available historical models to obtain an error rate lower than the threshold. All previous models are used in a loop represented by symbol i in Figure 4 to discover a suitable historical model for the current chunk. If a satisfying result (Err < Th) cannot be found, the current chunk is assumed to be a new regime chunk requiring a new model, and the MRA method builds a new model for the current chunk's (C) data.
Additionally, the traditional approach, also named the static model, was compared with the MRA method results to see improvement in prediction performance. The most common metrics to determine the model's accuracy for continuous variables are root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE) [45]. RMSE, MAE, and MAPE are given in Equations (3)-(5), respectively. m stands for the number of samples in the test set, Y i stands for the sample's actual value, andŶ i stands for the sample's predicted value. The lower values of these parameters mean the higher model's accuracy.
All three evaluation metrics were shared and used to avoid overfitting and underfitting. Furthermore, MAPE was used for determining the threshold values and general evaluation of the model performance.

Dataset
The dataset was collected from a SAG Mill over three consecutive years, and the summary statistics were illustrated in Table 1. There are several time intervals between sequential records for each variable, but all features have hourly average values in the dataset. Furthermore, each year has slightly different cumulative active hours; 8744, 8674, and 8539, respectively. There are six input variables (feed particular size, fresh feed amount, mill density, mill sound, mill speed, and mill pressure) and one output variable (mill energy consumption) in the dataset. The mill speed data were removed from the input variables as a preprocessing step since they had around 80% missing values. The remaining inputs had less than 20% missing values, which were preprocessed according to step two. The distribution of the inputs and the output are illustrated in Figure 5, where x-axes are the actual value and y-axes are the frequency. Time series problems can be distinguished from the more common classification and regression problems. If a time series has no pattern or seasonal impact, it is classified as stationary. It can be seen that all the three-year data have similar distributions, and the dataset appears to be stationary. Figure 6 shows sudden daily cumulative changes in the output value, and the following three graphics are separated according to the years. When the machine's energy consumption equals to zero and exceeds the threshold time duration (24 h), it is marked with red rectangular shapes.

Experimental Results
The data between each consecutive red rectangular shape is described as a chunk. There are 23 different chunks in total over the three years. The first chunk in June of the third year has a small number of data that are counted as one marked shape. The first eight chunks occurred in the first year, chunks nine to 16 were observed during the second year, and the remaining seven chunks were seen in the third year. After the dataset was divided into different chunks, the MRA method was implemented to detect possible multiple repetitive regime areas based on these chunks. Furthermore, each chunk has a different sample size as they were not divided into a fixed size. Therefore, the MRA method has a more flexible and dynamic approach compared to the static model.
For all DNN structures, we used the standardization function since our features have different range scales. Numerous combinations were tried to discover the optimum hyperparameter values for the DNN prediction performance. The best model was found by varying the number of hidden layers in the set of three, four, five, and the number of neurons in the set of 50, 100, and 150 selected for hyperparameter tuning. The numbers of epochs were selected to be 10, which is the number of passes over the training dataset. ReLU and tanh functions were used as the activation functions, but ReLU outperformed the tanh. We also used the early stopping criterion and dropout function for hidden layers to avoid overfitting when it is required. Different split ratios were also tried as a training and testing part, which is illustrated in Appendix A Table A1. According to prediction performance, the split ratio was decided for each model accordingly. In addition, epsilon, which provides forward progress, was selected as (1.0) × 10 (−8) . Rho was chosen as 0.99, the gradient moving average decay factor used for the learning rate decay over each update. Whereas l1, a regularization method that constrains the absolute value of the weights, was selected as (1.0) × 10 (−5) , l2, which constrains the sum of the squared weights, was chosen as 0.0. DNN model parameters details are shown in Appendix A Table A1. Building a new model for each chunk can provide us higher prediction performance. However, it is not efficient for the time complexity aspect since tuning and training of hyperparameters for each model separately requires extra time. In order to show the efficiency of the developed model, the conventional approach was also applied to the dataset, and the results were compared. As a traditional approach, a DNN model named Model-1 was developed based on Chunk-1 data, and the remaining 22 chunks were used as unseen testing data. Eighty percent of Chunk-1 data were used as the training set and the remaining 20% as a testing part. Table 2 illustrates Model-1's performance for each chunk. In addition, we calculated the general MAPE moving average to see overall model prediction performance. The last column, named Data Size, shows the sample size of each chunk.
According to Table 2, several chunks have a similar MAPE rate for static Model-1, indicating that several consecutive chunks have a similarity based on their error rate for the same model. However, discovering specific regimes will be changed according to the carefully chosen MAPE threshold value.
The MRA method uses the old models before building a new one when the MAPE exceeds the threshold. When the error rate is higher than the threshold value, a dynamic approach in which a new model is created immediately can be considered. However, testing the old models before creating a new one enables us to optimize the number of developed models and detect possible regime groups. Additionally, it may offer less complexity and save time in regard to the computing aspect. For this study, the MAPE threshold value was decided as 10%, which is accepted as high accuracy for similar research papers in the literature [45]. The MRA method applies the old models sequentially until it finds an error rate lower than the threshold value. If a satisfactory result is not found, a new model is created for the current chunk of data. The MRA method gives a regime number based on the used model. Table 3 shows the results of the MRA method. Compared to the static approach, the results have greater accuracy as the MRA method creates a new model according to chunks with a high error rate for the current model. Chunk-18, Chunk-20, and Chunk-23 exceed the MAPE threshold for the current model, and they are shown in bold in Table 3.  The MRA method enhances the prediction quality due to the dynamic model approach. The machine may have several running modes for different input combinations, and the MRA method assists in discovering those distinct potential regimes by predicting the energy consumption more precisely compared to the traditional method. For Chunk-23, it can be seen that the MRA method used Model-1 and achieved a MAPE lower than the threshold, which is an example of a repetitive regime. However, Chunk-18 and Chunk-20 required a new model based on the agreed threshold. Figure 7 illustrates the MRA method's prediction performance for each distinctive chunk from traditional approach. According to the results, most new regimes occurred in the last year of the dataset, which is from Chunk-16 to Chunk-23. A performance comparison of the traditional approach and the MRA method for the last eight chunks is shown in Figure 8. The results show that the MRA method outperformed the traditional approach.
Applying old models rather than building a new one has several advantages. First, it facilitates detection of possible repetitive regimes. Second, where the traditional approach's general prediction performance MAPE rate was around 8.35%, the MRA method's general prediction performance was around 5.53%. It also reduces the total number of models by applying old models before building a new one. As a result, the MRA method provides a better prediction performance for the energy consumption of the SAG Mill with the detection of potential repetitive regime areas.

Conclusions and Future Work
In this paper, a novel data-driven method named the MRA method was developed to predict the energy consumption of a SAG mill. The MRA method allows us to discover potential change points over time and enhance the prediction performance. In addition, the performance of the proposed method was compared with the traditional approach. The MRA method reduces the overall error rate and is useful in finding repetitive regimes. Furthermore, we also showed the importance of understanding the dataset rather than just focusing on the quality of the prediction models. More complex systems, such as industrial machines, require more interdisciplinary solutions to obtain better prediction results.
It is obvious that typical machine learning algorithms assuming data are stationary have difficulty with real-world variations in streaming data. Enormous numbers of data can be generated, necessitating dispersed processing over time. The results show that the proposed model effectively predicts the energy consumption of industrial machines with concept drift difficulties. Instead of using a traditional static model, which does not provide acceptable prediction performance after a concept drift occurs, the MRA method detects concept drift points over time to maintain highly accurate prediction performance thanks to the dynamic approach. In addition, the results proved that instead of using a fixed size of chunks, separating the data based on machine types and working conditions is more efficient to discover concept drift points.
In future work, the MRA method could be applied to different industrial machines' time series to see whether there is an improvement in energy consumption prediction. Additionally, the dataset has one-hour time interval records, but different time interval records may boost the accuracy of the results. Finally, in this research, a DNN was used as a prediction model. However, several prediction methods, such as SVR or RF, can be integrated into the MRA method and possibly increase the prediction performance.

Appendix A
In Table A1, a DNN model was developed for each chunk of data separately, which is referred to as fully dynamic modeling. The results were shared as RMSE, MAE, and MAPE. Additionally, distinctive parameters of each DNN model's details are given in the last column. Figure A1 illustrates each chunk's actual and predicted energy consumption according to the chunk number. All x-axes illustrate the time index, and y-axes show the amount of energy consumption within an hour.