RobustSTL and Machine-Learning Hybrid to Improve Time Series Prediction of Base Station Trafﬁc

: Green networking is currently becoming an urgent compulsion applied for cellular network architecture. One of the treatments that can be undertaken to fulﬁll such an objective is a trafﬁc-aware scheme of a base station. This scheme can control the power consumption of the cellular network based on the number of demands. Then, it requires an understanding of estimated trafﬁc in future demands. Various studies have undertaken experiments to obtain a network trafﬁc prediction with good accuracy. However, dynamic patterns, burstiness, and various noises hamper the prediction model from learning the data trafﬁc comprehensively. Furthermore, this paper proposes a prediction model using deep learning of one-dimensional deep convolutional neural network (1DCNN) and gated recurrent unit (GRU). Initially, this study decomposes the network trafﬁc data by RobustSTL, instead of standard STL, to obtain the trend, seasonal, and residual components. Then, these components are fed into the 1DCNN-GRU as input data. Through the decomposition method using RobustSTL, the hybrid model of 1DCNN-GRU can completely capture the pattern and relationship of the trafﬁc data. Based on the experimental results, the proposed model overall outperforms the counterpart models in MAPE, RMSE, and MAE metrics. The predicted data of the proposed model can follow the patterns of actual network trafﬁc data. The GRU architecture is a simpler model compared to the LSTM architecture because the GRU is without memory cells. However, it has a low computational cost but the same performance result as LSTM performance [17]. GRU architecture consists of two cells, i.e., update gate and reset gate. The update gate determines the past information of time series to be remembered or forgotten for the current prediction need. Then, the reset gate determines the amount of information to be remembered. Figure 3 depicts the architecture of GRU.


Introduction
The implementation of green networking architecture nowadays attracts attention. That is, the base station architecture must be power saving. About 70% of the power in the cellular network infrastructure is burned by base station units [1]. Hence, the first undertaking is to predict future network traffic of the base station. A network traffic prediction can bring significant information for understanding traffic patterns [2]. Through such prediction, the base station may actively control its power demands by reducing the bandwidth capacity during low traffic to lower the energy consumption of the base station. However, nonlinearity and intricacy in traffic data are the main issues in network prediction [3].
Generally, the effort of network traffic prediction is categorized into two schemes, i.e., model driven (parametric) and data driven [4]. The first scheme works based on the practicality of the theoretical assumptions, such as autoregressive integrated moving average (ARIMA) model, and the second scheme deals with machine learning by interpreting and learning the data, such as artificial neural network (ANN). Various models have been applied in network traffic prediction studies. Zhang et al. [5] presented an improved long short-term memory (LSTM) with wavelet transform to decompose the original internet network traffic. This model can successfully reduce the prediction error in the network traffic prediction problem. In Ref. [6], a method integrated with fuzzy clustering and the weight exponential to improve LSTM and adaptive neuro-fuzzy inference system (ANFIS) in series models was proposed. From the results, this proposed method can increase the prediction accuracy rate by enhancing the reliability of the preprocessing stages. Wavelet neural network (WNN) with seeker optimization algorithm (SOA) based on the dynamic adaptive search step was proposed to optimize the prediction accuracy by overcoming poor local search and adaptive adjustment ability of traditional SOA [2]. This model can catch the trend of the traffic data signal and has the validity of the prediction accuracy, but this model may not be robust for long-term prediction. Zheng et al. proposed a method for 4G network base station prediction by combining the v support vector regression (vSVR) algorithm with the optimization of a symbiotic organisms search (SOS) [7]. The obtained optimal prediction result takes a lot of experiments in the input optimization.
However, the existing research in network traffic data faces challenges affecting the prediction accuracy. The main challenges are complicated characteristics and dynamics patterns. Zhang et al. [5] designed a model of network traffic prediction by utilizing wavelet transform to decompose the original data into multiple components with different frequencies as the input of the model. These components can bring significant trends of time granularity to learn the changing rules of the traffic. In Ref. [8], a seasonal and trend decomposition using Loess (STL) was performed to address noises of the network traffic. STL decomposes the network traffic into seasonal, trend-cycle, and remainder components. The result of these components is then utilized as input for the GRU model. Another study combined LSTM and Gaussian process regression (GPR) to accurately predict single-cell cellular [9]. First, it extracts the periodic components of data traffic using Fourier transform fed into the LSTM cell, while GPR predicts the residual random components. This proposed method can successfully increase prediction accuracy compared to the traditional method.
Although some models have reached well the prediction of the network traffic with dynamic traffic patterns, they still need to improve in identifying abrupt changes, traffic burstiness, and outliers in future demands. Furthermore, this study proposes deep learning of base station traffic prediction with the reliability of traffic burstiness and outliers. To obtain the comprehensive pattern of traffic load from the base station, this study first decomposes the original traffic data using RobustSTL method, instead of the standard STL. RobustSTL not only decomposes seasonal, trend, and remainder or residual components but can also further decompose the remainder component into a spike and white noise [10]. The resulted components are synchronously fed into a hybrid model of onedimensional convolutional neural network (1DCNN) and gated recurrent units (GRU). Through this hybrid model, 1DCNN can catch and extract features of the components [11], and GRU can capture the rules and relationships among the components to improve model performance [12]. Although there is a study in the load forecasting field proposing 1DCNN-GRU [13], it does not utilize decomposition. Furthermore, we enhance their proposed model that not only proposes the hybrid model of 1DCNN and GRU but also utilizes the decomposition of RobustSTL in the beginning stage.
Through the above analysis, this paper focuses on designing a deep-learning method for base station traffic prediction by combining RobustSTL and 1DCNN-GRU. The primary contributions of this paper can be summarized as follows: We propose a single traffic load prediction model of the base station based on Robust-STL and 1DCNN-GRU. This method can extract dynamics patterns of traffic data and more accurately predict the base station traffic; 2.
The main contribution is the hybrid model of the decomposition of time series data using RobustSTL technique, instead of the standard STL, with 1DCNN-GRU; 3.
The proposed model can give the reference for sleeping control operation in the base station by estimating the future internet traffic.
The remaining sections of this paper are organized as follows. In Section 2, this paper gives a background on our materials and the methods used in this study. Section 3 presents the experimental results of base station traffic prediction and discusses these results. In Section 4, we put forward the conclusions of this study and highlight the directions of future research.

RobustSTL and 1DCNN-GRU
As we mentioned above, we use the combination model of RobustSTL and 1DCNN-GRU to predict base station traffic. Figure 1 shows that a raw input is decomposed by RobustSTL to reveal the underlying insights of base station traffic. After obtaining three decomposed components, i.e., the trend, seasonality, and residual components, these are simultaneously fed into 1DCNN. 1DCNN can help a better understanding of traffic patterns and spatial features. The 1DCNN architecture here consists of convolution, pooling, and fully connected layers. The number of inputs in the 1DCNN model from one raw input is 3 (three), obtained from the decomposition result. Then, these 1DCNN inputs are fed to the convolutional layer, which has 32 filters. The outputs of the convolutional layer are carried to the pooling layer to minimize the dimension. The last layer in the 1DCNN is the fully connected layer where the input is from the pooling layer.

RobustSTL and 1DCNN-GRU
As we mentioned above, we use the combination model of RobustSTL and 1DCNN-GRU to predict base station traffic. Figure 1 shows that a raw input is decomposed by RobustSTL to reveal the underlying insights of base station traffic. After obtaining three decomposed components, i.e., the trend, seasonality, and residual components, these are simultaneously fed into 1DCNN. 1DCNN can help a better understanding of traffic patterns and spatial features. The 1DCNN architecture here consists of convolution, pooling, and fully connected layers. The number of inputs in the 1DCNN model from one raw input is 3 (three), obtained from the decomposition result. Then, these 1DCNN inputs are fed to the convolutional layer, which has 32 filters. The outputs of the convolutional layer are carried to the pooling layer to minimize the dimension. The last layer in the 1DCNN is the fully connected layer where the input is from the pooling layer.
The outputs from the 1DCNN block are put into the GRU block. Through the GRU operation, the traffic characteristic and rules can be learned to improve the accuracy of the prediction result. The number of GRU cells is the same as the number of 1DCNN outputs. Then, the outputs of GRU are flattened with the fully connected layer to obtain the final output. The outputs from the 1DCNN block are put into the GRU block. Through the GRU operation, the traffic characteristic and rules can be learned to improve the accuracy of the prediction result. The number of GRU cells is the same as the number of 1DCNN outputs. Then, the outputs of GRU are flattened with the fully connected layer to obtain the final output.

Standard STL
STL is a statistical method for decomposing a time series data into three elements, i.e., seasonality, trend, and residual components. Trend is a systematic pattern that changes over time, does not repeatedly happen, and indicates the general tendency of the data to increase or decrease during a period. Seasonality is a component that repeatedly changes over time and represents data fluctuation. Meanwhile, remainder is a non-systematic component besides trend and seasonality components within the data. Suppose y t is time series data, STL decomposes the time series data into seasonality s t , trend τ t , and residual r t [14] as follows: The standard STL decomposition consists of two loops, i.e., inner loop and outer loop [15]. The inner loop upgrades seasonal and trend components in each of the passes. Then, the residual component is calculated. The outer loop will be adjusted by the Loess, when an anomaly is detected.

RobustSTL
In standard STL, we assume that all components contained in the time series data besides trend and seasonality are residual components. In contrast, RobustSTL proposed by Wen et al. [10] and adopted in this study claims that residual components can be further extracted into two terms, i.e., spike and noise. Moreover, RobustSTL can accurately and precisely decompose the time series data despite containing a long seasonality and high noise [10]. This method can handle fractional and shifted seasonality components over a period.
Data collected from the management unit frequently contain various noises. To obtain the exact trend and seasonality components, such noises initially need to be removed. Here, RobustSTL proposes a bilateral filtering to remove various noises. Then, RobustSTL utilizes the least absolute deviations (LAD) with L1 norm regularizations to extract the trend component and non-local seasonal filtering to obtain the seasonality component. Given a data of time series, Algorithm 1 shows the decomposition summary of the RobustSTL method.

Algorithm 1: RobustSTL decomposition summary.
Input: y t , parameter configuration Output: s t , τ t , r t 1: Denoise the network traffic data y t by bilateral filtering to obtain denoised data y t 2: Obtain the relative trend τ r t and apply this equation y t = y t − τ r t to denoised data 3: Perform the seasonality extraction to y t using non-local seasonal filtering to obtain s t value 4: Obtain trend, seasonality, and residual components Repeat steps 1-4 to obtain more accurate estimation

1DCNN
In the proposed model architecture, we introduce one-dimensional convolutional neural network (1DCNN) for the component of the base station traffic prediction model. 1DCNN can extract the morphological features of the traffic data to enhance the understanding of traffic patterns [11]. 1DCNN architecture generally consists of convolutional layer, pooling layer, and fully connected layer.

Convolutional Layer
This layer in 1DCNN overcomes the regular neural network by a faster convergence. This layer contains a set of time series maps, kernels, filters, strides, and neurons. This layer connects each neuron to the neighbor neurons. A convolutional operation calculates the dot product between corresponding convolution filters and the input time series maps. Through the kernels, this can learn a characteristic of the input maps. The time series input is first fed to the input layer. Then, the output of the convolution process is calculated as follows: where y l m and b l m are the output and the bias of the m th neuron at layer l. w l−1 im is kernel weight of the i th neuron at layer l − 1, andŷ l−1 i is the input to the m th neuron at layer l from the i th neuron at layer l − 1. L m denotes a selection of input maps. Then, conv1D refers to the convolution operator.

Max Pooling Layer
Pooling operation in the 1DCNN model refers to the resampling process, i.e., a transformation of multiple cells into one cell. This operation has advantages such as minimizing the computational cost while retaining the significant information. Besides, this operation can also avoid an overfitting model [16]. In this study, we select a max pooling model, which takes a maximum value of an array. Figure 2 shows an example max pooling operation of time series data. Initially, the total element of the time series map is (15), and the max pooling with a stride of three (3) divides them into five (5) groups denoted by various colors. Then, we can obtain the smaller size of the time series map maintaining the discriminant information through the following equation: where p is the output of max pooling operation, and y R is the elements of corresponding pool area R.
This layer in 1DCNN overcomes the regular neural network by a faster convergence. This layer contains a set of time series maps, kernels, filters, strides, and neurons. This layer connects each neuron to the neighbor neurons. A convolutional operation calculates the dot product between corresponding convolution filters and the input time series maps. Through the kernels, this can learn a characteristic of the input maps. The time series input is first fed to the input layer. Then, the output of the convolution process is calculated as follows: where and are the output and the bias of the m th neuron at layer l. is kernel weight of the i th neuron at layer l-1, and is the input to the m th neuron at layer l from the i th neuron at layer l-1. Lm denotes a selection of input maps. Then, conv1D refers to the convolution operator.

Max Pooling Layer
Pooling operation in the 1DCNN model refers to the resampling process, i.e., a transformation of multiple cells into one cell. This operation has advantages such as minimizing the computational cost while retaining the significant information. Besides, this operation can also avoid an overfitting model [16]. In this study, we select a max pooling model, which takes a maximum value of an array. Figure 2 shows an example max pooling operation of time series data. Initially, the total element of the time series map is (15), and the max pooling with a stride of three (3) divides them into five (5) groups denoted by various colors. Then, we can obtain the smaller size of the time series map maintaining the discriminant information through the following equation: where p is the output of max pooling operation, and R y is the elements of corresponding pool area R.

Fully Connected Layer
The fully connected layer receives the outputs of the pooling layer. First, they are typically flattened to be converted into a one-dimensional array, then connected to a layer where each input is linked to outputs with a weight. This layer has a similar structure to regular ANN, and a neuron in this layer can be calculated as follows: where u k is an output value at the k th neuron, and ω i,k is a weight value of i th input at the k th neuron. Meanwhile, p i is the i th input value, and m is the number of inputs. Then, b k is the bias value at the k th neuron. After we obtain the output of the fully connected layer, this layer is followed by an activation function, such as rectified linear unit (ReLU) or linear activation function.

GRU
The GRU architecture is a simpler model compared to the LSTM architecture because the GRU is without memory cells. However, it has a low computational cost but the same performance result as LSTM performance [17]. GRU architecture consists of two cells, i.e., update gate and reset gate. The update gate determines the past information of time series to be remembered or forgotten for the current prediction need. Then, the reset gate determines the amount of information to be remembered. Figure 3 depicts the architecture of GRU.
bk is the bias value at the th neuron. After we obtain the output of the fully connected layer, this layer is followed by an activation function, such as rectified linear unit (ReLU) or linear activation function.

GRU
The GRU architecture is a simpler model compared to the LSTM architecture because the GRU is without memory cells. However, it has a low computational cost but the same performance result as LSTM performance [17]. GRU architecture consists of two cells, i.e., update gate and reset gate. The update gate determines the past information of time series to be remembered or forgotten for the current prediction need. Then, the reset gate determines the amount of information to be remembered. Figure 3 depicts the architecture of GRU. Initially, the current input ̅ and the previous output ℎ are concatenated to be the input for the update gate and the reset gate. Furthermore, each value of all cells or components in GRU can be obtained as follows: Initially, the current input x t and the previous output h t−1 are concatenated to be the input for the update gate and the reset gate. Furthermore, each value of all cells or components in GRU can be obtained as follows: where z t , r t , and h t are the outputs of the update gate, the reset gate, and the candidate hidden layer. σ and tan h are the activation functions. Then, W z , W r , W h and b z , b r , b h are, respectively, weight matrix and bias vectors of the update gate, the reset gate, and the candidate hidden layer. Finally, we can obtain the output h t of the t th GRU as follows:

Preparation Stage
The dataset used in the experiment stage is received from Italy cellular network activity from Kaggle [19]. In this dataset, there is base station internet traffic from lots of locations. Each location is represented in the cell id in the dataset as instances. Here, we take internet traffic in three locations. They are cell 3350, cell 5060, and cell 5864. Figure 4 shows a closer view of the internet network traffic over an hourly period in the selected base station.

Preparation Stage
The dataset used in the experiment stage is received from Italy cellular network activity from Kaggle [19]. In this dataset, there is base station internet traffic from lots of locations. Each location is represented in the cell id in the dataset as instances. Here, we take internet traffic in three locations. They are cell 3350, cell 5060, and cell 5864. Figure 4 shows a closer view of the internet network traffic over an hourly period in the selected base station. Before we go into the training process, the training dataset is processed with a normalization to lead to stability, especially in the datasets containing various noises [20]. The normalization formula of Min-Max Scaler is presented in the equation below: where x is the original data, while xmin and xmax are the minimum value and the maximum value of the dataset. Then, x is the normalized data.

Training Process
The input data window used in the proposed model to predict the next hour traffic xt is 3 h data before, i.e., xt−1, xt−2, and xt−3. We set the historical data from 80 to 120 h to predict internet traffic activity from 10 to 25 h. In performing the training stage, we use the 90% of the dataset for the training step and the remaining data for the testing step Before we go into the training process, the training dataset is processed with a normalization to lead to stability, especially in the datasets containing various noises [20]. The normalization formula of Min-Max Scaler is presented in the equation below: where x is the original data, while x min and x max are the minimum value and the maximum value of the dataset. Then, x is the normalized data.

Training Process
The input data window used in the proposed model to predict the next hour traffic x t is 3 h data before, i.e., x t−1 , x t−2 , and x t−3 . We set the historical data from 80 to 120 h to predict internet traffic activity from 10 to 25 h. In performing the training stage, we use the 90% of the dataset for the training step and the remaining data for the testing step because the size of the dataset is not too big. For the validation set, we select 10% of the training dataset. We set the epochs by 1000 and use mean square error (MSE) as the loss function in training process. We also present the loss instance of training and validation of each epoch to reflect on the performance of our proposed model during the training process as shown in Figure 5. The learning curve of the training and validation loss is high at the beginning. Then, it gradually decreases upon adding training examples, and it flattens. This means the proposed model is not overfitting.
training dataset. We set the epochs by 1000 and use mean square error (MSE) as the loss function in training process. We also present the loss instance of training and validation of each epoch to reflect on the performance of our proposed model during the training process as shown in Figure 5. The learning curve of the training and validation loss is high at the beginning. Then, it gradually decreases upon adding training examples, and it flattens. This means the proposed model is not overfitting.

Evaluation Metrics
To evaluate and know the quality of the proposed model, we present mean absolute percentage error (MAPE), root mean square error (RMSE), and mean absolute error (MAE) as evaluation metrics in the predicted data and the original data during the testing process as follows [21]: where ad v and pd v denote the actual data and the predicted data, while T is the amount of data.

Results and Analysis
To verify the proposed model, we present other benchmark models as a comparison to our base station traffic prediction model. Here, the experimental stage is performed in

Evaluation Metrics
To evaluate and know the quality of the proposed model, we present mean absolute percentage error (MAPE), root mean square error (RMSE), and mean absolute error (MAE) as evaluation metrics in the predicted data and the original data during the testing process as follows [21]: where v ad and v pd denote the actual data and the predicted data, while T is the amount of data.

Results and Analysis
To verify the proposed model, we present other benchmark models as a comparison to our base station traffic prediction model. Here, the experimental stage is performed in the internet traffic of three base stations. The compared benchmark models are ARIMA, LSTM, LSTM-1DCNN, Wavelet-LSTM, and Standard STL-GRU. We present ARIMA as the traditional model and LSTM as the single model in time series analysis to be compared. LSTM-1DCNN as the hybrid model initially takes the input maps from the dataset to the LSTM cell, then the outputs are utilized as input to the 1DCNN model. Then, Wavelet-LSTM and standard STL-GRU models have a similar process to the model proposed. That is, these models initially decompose the time series signal into several components then are used as input data.
Furthermore, the prediction result of base station traffic is evaluated using MAPE, RMSE, and MAPE. We perform the testing stages five times for each cell and calculate the   The decomposition of time series data proves better achieving the accurate prediction, especially using RobustSTL. This decomposition method in our proposed model can extract dynamic patterns of traffic data despite containing some burstiness and outliers. Figure 6 shows that the predicted data of the proposed model in the base station of cell 5060, as an instance, matches the actual data with the low error. The actual data in the red line and the predicted data in green have similar patterns.
The decomposition of time series data proves better achieving the accurate predic tion, especially using RobustSTL. This decomposition method in our proposed model can extract dynamic patterns of traffic data despite containing some burstiness and outliers Figure 6 shows that the predicted data of the proposed model in the base station of cel 5060, as an instance, matches the actual data with the low error. The actual data in the red line and the predicted data in green have similar patterns.

Conclusions
Base station traffic prediction plays a vital role in green networking architecture in describing the future demands as references to apply sleeping control. The main chal lenges in base station traffic prediction are the dynamic and complicated patterns. Besides some burstiness and outliers appear in the traffic. The decomposition process can increas the understanding of traffic rules and relationships. The decomposition using RobustSTL indicates the optimal result compared to counterpart schemes in the time series extraction The combination of RobustSTL and 1DCNN-GRU generally obtains the optimal accuracy with the lowest values of MAPE, RMSE, MAE metrics in base station traffic prediction outperforming the other models. The proposed model can detect noises and outliers o traffic. For the next study, we hope the proposed model can be applied to a greater datase with more features and a long series to evaluate the proposed model more comprehen sively in the experimental stage.

Data Availability Statement:
The data presented in this study are openly available in Kaggl (https://www.kaggle.com/andrewfager/analysis-and-modeling-of-internet-usage/data, accessed o 15 November 2021).

Conflicts of Interest:
The authors declare no conflict of interest.

Conclusions
Base station traffic prediction plays a vital role in green networking architecture in describing the future demands as references to apply sleeping control. The main challenges in base station traffic prediction are the dynamic and complicated patterns. Besides, some burstiness and outliers appear in the traffic. The decomposition process can increase the understanding of traffic rules and relationships. The decomposition using RobustSTL indicates the optimal result compared to counterpart schemes in the time series extraction. The combination of RobustSTL and 1DCNN-GRU generally obtains the optimal accuracy with the lowest values of MAPE, RMSE, MAE metrics in base station traffic prediction outperforming the other models. The proposed model can detect noises and outliers of traffic. For the next study, we hope the proposed model can be applied to a greater dataset with more features and a long series to evaluate the proposed model more comprehensively in the experimental stage.