Electricity Load and Internet Traffic Forecasting Using Vector Autoregressive Models

This study was conducted to investigate the applicability of measuring internet traffic as an input of short-term electricity demand forecasts. We believe our study makes a significant contribution to the literature, especially in short-term load prediction techniques, as we found that Internet traffic can be a useful variable in certain models and can increase prediction accuracy when compared to models in which it is not a variable. In addition, we found that the prediction error could be further reduced by applying a new multivariate model called VARX, which added exogenous variables to the univariate model called VAR. The VAR model showed excellent forecasting performance in the univariate model, rather than using the artificial neural network model, which had high prediction accuracy in the previous study.


Introduction
As electricity demand grows globally, load demand forecasting has become an important factor in many aspects of energy production and delivery. The time horizons for forecasting are classified as short-, medium-, or long-term. Short-term forecasting (STLF) refers to hourly forecasts, medium-term forecasting (MTLF) for a week to a month, and long-term forecasting (LTLF) for over a year [1]. STLF is mainly used in the operational phase, while LTLF is used in the planning phase. Before information and communication technologies (ICT) and smart grids were developed, forecasting was based primarily on supply-side aggregated data, in top-down formats at large governmental levels. However, owing to recent the development of smart-grid technology, it has become possible to consider end-user demand through a bottom-up approach [2], which can now be applied to STLF. Thus, these technologies have expanded their roles by undertaking the responsibility of forecasting load demand from energy suppliers to consumers.
Summer and winter temperatures are becoming more extreme with rapid climate change, and demand is increasing because of the operation of energy-intensive devices such as air conditioners and heating appliances. In addition, load demand is increasing in buildings and parking lots, because of the surge in electric vehicle (EV) sales [3]. Furthermore, Internet traffic is continuously increasing because of the growing global popularity of smartphones and other Internet communication devices. The Internet makes it possible to find information, send emails, share photos and videos, manage bank accounts, as well as enable access to home network devices remotely. This high demand can also be attributed to the process of traffic delivery and data storage [4].
From a supplier's point of view, as renewable energy (RE) replaces energy produced from nuclear power, it has become more important to control supply and demand accurately [5]. However, the energy supply uncertainty has become an issue, because RE increases energy supply variability according to factors such as season, temperature, precipitation, cloud cover, and wind speed. The changing pattern of supply and demand has a direct impact on power production, as well as on relative energy prices, power rate Muzaffar and Afshari [27] studied long short-term memory (LSTM) networks, which are a special type of recurrent neural network, and applied them in learning the long-term dependencies in STLF. Global horizontal, direct normal, and diffused horizontal irradiance, as well as temperature, humidity, and wind speed variables, were considered as potential exogenous variables. Only temperature was applied as a dependent variable, in terms of reducing computational costs. It was shown that LSTM outperforms other methods, such as ARMA, SARIMA, and ARMA with exogenous variables.
Zhu et al. [28] proposed a new weather forecasting technique generated with the dry-bulb temperature profile, relative humidity, and global solar radiation. Then, some of the ranked influential factors were filtered. The final input variables were grouped and applied in an ANN model with back-propagation.
Reddy [29] proposed a Bat algorithm-based back-propagation approach for STLF, with weather factors such as temperature, humidity, and dew point; the best results were obtained in a case study considering temperature and humidity.
J. Morley et al. [30] suggested that understanding Internet traffic usage patterns may lead to simulating the electricity load demand area because Internet networks such as mobile, ICT-related devices, and PCs consume electricity. This phenomenon has become more important as network-based infrastructures grow.
Kim [31] proposed Internet traffic forecasting models using an AR-GARCH error model with seasonal ARIMA models. This motivated our study to build various forecasting models considering Internet traffic data.
As outlined above, some of the common external variables used in these studies include weather and socio-economic variables. As smart grid technology quickly advances, electronic device usage data, as well as non-electronic data, such as meteorological or economic variables, can be easily accessed by region. Many attempts have been made to keep up with the technologies; however, at the time of writing, no clear studies have considered Internet traffic data to forecast load demand. In this study, we have adopted Internet traffic data as an external variable in an ARIMA-based model, and as a dependent variable in a vector AR with exogenous variables (VARX) model. Although the AI-based models are widely used for producing accurate forecast results, it is difficult to discover inference about the variables. Therefore, we demonstrate several representative statistical forecasting methods, and adopt them in a smart grid environment.
The contributions of this paper are presented as follows.
• The existing STLF for load demand is limited to considering only predictor variables such as weather, holidays, and weekends. Thus, we present the effectiveness of considering Internet traffic data as a dependent variable in a multivariate time series forecasting method, and also as an external variable in univariate methods.

•
Moving-window prediction techniques were used in STLF to determine which models are superior in the interval k unit from the basic 15 min to 2 h forecasting, and whether the superior models exhibit robustness through these time horizons.
The remainder of this paper is organized as follows. Section 2 introduces the models used in this study. Section 3 describes the data and analysis. Section 4 presents the performance evaluations. Section 5 concludes the paper.

Taylor's Double Seasonal Exponential Smoothing Method
Taylor [32] introduced an extended version of the Holt-Winters double seasonal method, to address multiplicative seasonality. This model also assumes that the process of white noise is correlated.
where y t represents the actual value of demand, S t represents the seasonal component observed over time t (t = 1, 2, . . . , T), and s 1 and s 2 are double seasonal cycles. The components L t and T t are the level and trend components of the series at time t, respectively. The coefficients α, β, γ and δ are smoothing parameters. F t+h is the predicting value of h ahead from time t.
The initial values are calculated as follows: The F t+h formula of the Taylor's method is expressed as where φ represents the adjusted first-order coefficient, and the smoothing parameters are given by α, β, γ, δ, and φ.

Reg-ARIMA-GARCH Model
First, we introduce the basic ARIMA model. The ARIMA model has undergone various developments and was once a benchmark model for time series analysis and forecasting [33]. Once the stationary assumption of the data is confirmed, various time series data are explained with different non-seasonal (p, q) orders and seasonal (P, Q) orders of ARIMA. When series {y t |t = 1, 2, · · · , T} follows ARIMA(p, d, q)(P, D, Q) with a mean of µ, the time series takes the form where y t represents the actual value of demand (in kilowatts) observed at time t (t = 1, 2, . . . , T), and ε t represents the random errors assumed to be white noise during t, with a mean of zero and a constant variance of σ 2 ; p, d and q are integers and orders of the model; φ p (l) = 1 − φ 1 l − · · · − φ p l p , where p denotes the degree of the non-seasonal autoregressive polynomial; θ q (l) = 1 − θ 1 l − · · · − θ q l q , where q is the degree of the non-seasonal moving average polynomial; for the seasonal operators, Φ P (l s ) = 1 − Φ 1 l s − · · · − Φ P l Ps , where P denotes the degree of the seasonal autoregressive polynomial; and Θ Q (l s ) = 1 − Θ 1 l s − · · · − Θ Q l Qs , where Q denotes the degree of the seasonal moving average polynomial. The terms (1 − l) d and (1 − l s ) D are the non-seasonal and seasonal difference operators of order d and D, respectively; s is a seasonal cycle. Next, the external variables are considered to explain the many factors that affect electricity load demand, including holidays, temperature, and socio-economic variables. Typically, climate-related variables are regarded as important factors, imposing high demand on electrical appliances such as heating systems in winter and air conditioning in summer. In this study, temperature, and weekend and holiday indices were included as an explanatory variable in the model.
The Reg-ARIMA model is a regression ARIMA model with error terms [34]. When the series {y t |t = 1, 2, · · · , T} follows the Reg-ARIMA model with k number of predictors, the time series takes the form where β is the coefficient of predictors χ ti . The basic ARIMA models can be specifically used under the assumption of constant variance. To adjust the fluctuations of the time series, Engle [35] proposed the autoregressive conditional heteroscedasticity (ARCH) model. Bollerslev [36] extended it as the general ARCH (GARCH) model, whose main feature is that it can handle data with heavier-tailed error distributions. The error term of the ARIMA-GARCH model is defined as where r and s are the orders of the GARCH and ARCH processes, respectively; a 0 , a i and b j are constants; ε t is the error term; σ 2 t is the conditional variance of ε t ; and z t is a standardized error term.

VARX Model
Sims [37] introduced the VARX model, a method used to analyze the relationship between multivariate influencing variables. The model is a combination of several AR models, where these models form a vector between the variables affecting each other. The VAR model is a quantitative forecasting approach usually applied to multivariate time-series data.
The VARX(p) model is defined as where y t = (y 1t , y 2t , . . . , y kt ) is a vector of multivariate time-series variables, and x t = (x 1t , x 2t , . . . , x rt ) is a vector of exogenous variables; Φ i and Θ * i are matrix coefficients; y t and x t are (k × 1) and (r × 1) column vectors, and Φ i and Θ * i are (k × k) and (k × r) matrices, respectively; and ε t = (ε 1t , ε 2t , . . . , ε kt ) is a noise process vector that has a zero mean and is independent during t.

Electricity Load Data
The electricity load data were obtained from Chung-ang University, Seoul, Korea. They were collected at 15 min intervals during the period from 20 April to 21 June 2019. There are a total of 6048 data points. The total floor area of the buildings is approximately 182,730 m 2 . The campus has 25 buildings comprising research facilities, administrative offices, classrooms, cafeterias, and dormitories. Figure 1a shows a general time series profile of the load data. The electricity load demand shows daily and weekly patterns. It is clear that the Monday through Friday demand is higher than that of the weekend. There is also a decline pattern for the day during national holidays. Figure 1b shows a time-series plot of log-transformed data; it was used as a dependent variable instead of the original series to make an assumption of homoscedasticity in the ARIMA-GARCH models and the VAR model. offices, classrooms, cafeterias, and dormitories. Figure 1a shows a general time series profile of the load data. The electricity load demand shows daily and weekly patterns. It is clear that the Monday through Friday demand is higher than that of the weekend. There is also a decline pattern for the day during national holidays. Figure 1b shows a timeseries plot of log-transformed data; it was used as a dependent variable instead of the original series to make an assumption of homoscedasticity in the ARIMA-GARCH models and the VAR model.

Internet Traffic Data
The Internet traffic data were obtained from the same campus buildings, over the same period. However, they were collected at 5 min intervals. The data were aggregated into 15 min intervals to ensure comparability to those of the electricity load variable. Figure 2a shows the time series plots of the Internet traffic data. It shows cyclic patterns for the days and weeks, with clearer patterns revealed between weekdays and weekends, compared to Figure 2a. The series was also log-transformed, as shown in Figure 2b. The data were used as an exogenous variable in the Reg-ARIMA-GARCH models, and as a dependent variable in the VAR model.

Internet Traffic Data
The Internet traffic data were obtained from the same campus buildings, over the same period. However, they were collected at 5 min intervals. The data were aggregated into 15 min intervals to ensure comparability to those of the electricity load variable. Figure 2a shows the time series plots of the Internet traffic data. It shows cyclic patterns for the days and weeks, with clearer patterns revealed between weekdays and weekends, compared to Figure 2a. The series was also log-transformed, as shown in Figure 2b. The data were used as an exogenous variable in the Reg-ARIMA-GARCH models, and as a dependent variable in the VAR model. offices, classrooms, cafeterias, and dormitories. Figure 1a shows a general time series profile of the load data. The electricity load demand shows daily and weekly patterns. It is clear that the Monday through Friday demand is higher than that of the weekend. There is also a decline pattern for the day during national holidays. Figure 1b shows a timeseries plot of log-transformed data; it was used as a dependent variable instead of the original series to make an assumption of homoscedasticity in the ARIMA-GARCH models and the VAR model.

Internet Traffic Data
The Internet traffic data were obtained from the same campus buildings, over the same period. However, they were collected at 5 min intervals. The data were aggregated into 15 min intervals to ensure comparability to those of the electricity load variable. Figure 2a shows the time series plots of the Internet traffic data. It shows cyclic patterns for the days and weeks, with clearer patterns revealed between weekdays and weekends, compared to Figure 2a. The series was also log-transformed, as shown in Figure 2b. The data were used as an exogenous variable in the Reg-ARIMA-GARCH models, and as a dependent variable in the VAR model.

Temperature Data
Weather variables have been widely studied as important variables that may have a great impact on electricity load demand. A positive correlation relationship exists between the temperature and the demand during summer, because of the increased use of air conditioning. However, temperature is also correlated with high demand as temperatures fall during winter, because of the use of heating appliances.
Thus, the relationship between temperature and demand is usually negative in winter, compared to that in summer. Therefore, heating and cooling degree day indices are derived over a half year, to explain such opposite directions of correlation. However, the data used in this study cover April through June (spring in Korea). It was considered that the original temperature data were appropriate for use as an exogenous variable. The data were obtained from the Korea Meteorological Administration as predictor values in the Reg-ARIMA-GARCH models and the VARX model.

Special Days
To fit the different patterns in demand on weekends and holidays, dummy variables for these days were created. These were applied in the Reg-ARIMA-GARCH models and VARX model, as a predictor variable.

Data Analysis
The 6048 data observations (9 weeks) were divided into 7 weeks of training data, with the rest for validation. In this study, moving window forecasting methods were considered, and the optimal number of parameters, at each k step, was identified according to the Akaike information criterion for the ARIMA-based models, and to the Schwarz criterion (SC) for the VAR model. Thus, the models are recursively updated to forecast at each training set. Tables 1-4 represent the examples of estimated parameters and the results for assumptions in the training set.  Table 3. Parameter estimations of the ARIMA(3, 0, 1)(0, 1, 0) s=96 − GARCH(1, 1) model with temperature, weekend, and holiday variables.  Table 1 indicates the estimated coefficients for Taylor's double seasonal exponential smoothing method. We took the double seasonal cycles to describe a day (s 1 = 96) and a week (s 2 = 672).

Parameter Estimate
The residuals from ARIMA-fitted values were checked to see if there was a heteroscedasticity in the case of basic ARIMA models: (1) without any predictor variables; (2) with temperature and special-day variables and (3) with temperature, special-day, and Internet traffic variables. Although the Ljung-Box Q-statistics show that the standardized residuals were insignificant for the ARIMA models (1) p = 0.5735, (2) p = 0.5551, (3) p = 0.7010), it was shown that there is heteroscedasticity from the same results on the squared standardized residuals (1) p < 0.0001, (2) p < 0.0001, (3) p < 0.0001). To ensure that there are ARCH effects in the model, Engle's Lagrange multiplier tests were additionally conducted. The tests proved that the volatilities need to be fitted by the GARCH term in the ARIMA-based models. Tables 2-4 show the estimated coefficients from the ARIMA-GARCH model for the same cases as the ARIMA models. Here we can interpret how much each exogenous variable impacts the demand by coefficients. For example, Table 4 shows that more demand was observed as temperature or Internet traffic increased. On the other hand, less demand was observed on weekends and holidays.
Rather than considering the Internet traffic data as one of the input variables in the models, we tried to forecast the electricity load demand and the Internet traffic demand using the VAR model. Before fitting the model, the augmented Dickey-Fuller (ADF) test was conducted to determine if the main dependent variables had a unit root. The logtransformed series datasets were used for the test, setting trends and intercepts in both series. The optimal lag length was automatically selected based on the SC. Given that Table 5 indicates that those two series were stationary, there is no need to perform a further Johansen's cointegration test. To clarify the stationary assumption, the ADF test for the series, with the option having intercepts without trends, was conducted. In addition, the null hypothesis of non-stationarity was rejected in both series. Therefore, the VARX model was deemed an appropriate method. Table 6 shows the estimated coefficients matrix from the VAR model with temperature and special-day variables.  Table 6. Parameter estimations of VARX(4,0) model.

Performance Evaluations
This section discusses comparisons of the various models performed using meanabsolute-percentage-error (MAPE) and root-mean-square error (RMSE). These evaluation methods are widely used to evaluate model performance, especially for STLF.
MAPE is defined as where y t is the actual value andŷ t is the forecasted demand at time t. The equation of RMSE is given by Here we also obtained the accuracy results of the Internet traffic from the VAR model, but given that the main purpose of our study is to forecast electricity load demand, we only discuss the results of the power demand. Table 7 presents the MAPE results in the validation set at k steps ahead. It shows that the VARX model is superior to other models, through all steps. The second-best model was the ARIMA-GARCH model (3), with temperature, special-day, and Internet traffic variables; it showed higher accuracy than the other ARIMA-GARCH models that did not consider Internet traffic values as an input.  Table 8 shows the validation RMSE values; the performance of the VARX and GARCHbased models showed the same patterns as those for the MAPE. However, in the case of comparing the exponential smoothing method to ARIMA model (1), without any predictor variables, the ARIMA model showed better performance than that of the Taylor's model. That is, it is preferred to fit ARIMA models for univariate datasets.  (Figures 3 and  4) and quarter-hour ( Figures 5 and 6) for the MAPE and RMSE for 1 h and 8 h forecasts, respectively. Here, we only compare three representative models: Taylor's exponential smoothing method, ARIMA-GARCH 3, and VARX models; and we assume four variables were available: temperature, special day, Internet traffic, and Electricity load demand.            Figure 3 represents accuracy plots categorized by day type for 15 min forecasting. Special days were excluded in the day type stratification because there were no holiday seasons in the test set period. The VAR model shows the lowest error regardless of the day type, in terms of MAPE and RMSE. However, forecasts on weekdays were less accurate in ARIMA and VAR models, while the GARCH model shows the opposite.  Special days were excluded in the day type stratification because there were no holiday seasons in the test set period. The VAR model shows the lowest error regardless of the day type, in terms of MAPE and RMSE. However, forecasts on weekdays were less accurate in ARIMA and VAR models, while the GARCH model shows the opposite. Figure 4 shows the accuracy plots by day type for 2 h forecasting. It shows similar patterns to that of the 15 min forecasting, but the VAR model show less accuracy in weekday results. If the forecasting horizons are very short (k = 1), then the VAR model should be suggested. However, if the horizons are short (k = 8), the ARIMA-GARCH model is worth consideration. Figure 5 shows accuracy plots categorized by quarter-hour forecasting. Notably, the x-axis sequence of 1 to 96 corresponds to 00:15 a.m. to midnight. Forecasts prove less accurate between 9:00 a.m. to 10:45 a.m. (x = 32-39) when the morning classes begin. Although the GARCH and VAR models show better performances in the afternoon, the ARIMA model shows continuously poor results until night. The VAR model outperforms during general hours, but the accuracies of the GARCH model diminishes again between 07:00 p.m. to 10:30 p.m. (x = 72-86). Figure 6 shows the accuracy plots for 2 h forecasting. The ARIMA and GARCH models show similar patterns to the 15 min forecasting. However, the performance of the VAR model is poor as it is best suited to very short-term forecasting. As seen in Figure 4, the ARIMA-GARCH model provides higher accuracy than the VAR model. Figure 7 represents the actual values of the day after the national holiday from the validation set to compare the predicted values from each model. The 15 min (k = 1) forecasting does not show much difference in general, but Taylor's model showed underestimated in terms of level. The 2 h (k = 8) forecasting also shows that Taylor's model significantly underestimates predicted values above the others. We assume the main reason for this is the fact that Taylor's model cannot apply the exogenous variables such as a special day.

Concluding Remarks
Accurate STLF is a critical issue for decision makers and power generation companies in terms of policy making and development planning. Thus, many attempts have been made to improve the performance of electricity load prediction. This study examined the relevant time series methods for short-term forecasting of electricity load demand through 15 min to 2 h time horizons, in an institutional campus in Seoul. Taylor's double seasonal exponential smoothing methods, ARIMA-GARCH models, and the VARX model were used for optimization. In this study, these models provided the lowest MAPEs and RMSEs from 15 min (k = 1) to 2 h (k = 8) forecasting.
The results show that the VAR model is superior to the other univariate models through all steps. Taking the indirect variable as another dependent variable, rather than applying it as input values, provided high accuracy as well as the advantage of time efficiency, with a multivariate model. However, caution must be applied when using the

Concluding Remarks
Accurate STLF is a critical issue for decision makers and power generation companies in terms of policy making and development planning. Thus, many attempts have been made to improve the performance of electricity load prediction. This study examined the relevant time series methods for short-term forecasting of electricity load demand through 15 min to 2 h time horizons, in an institutional campus in Seoul. Taylor's double seasonal exponential smoothing methods, ARIMA-GARCH models, and the VARX model were used for optimization. In this study, these models provided the lowest MAPEs and RMSEs from 15 min (k = 1) to 2 h (k = 8) forecasting.
The results show that the VAR model is superior to the other univariate models through all steps. Taking the indirect variable as another dependent variable, rather than applying it as input values, provided high accuracy as well as the advantage of time efficiency, with a multivariate model. However, caution must be applied when using the VAR model, by checking the series are stationary and if not, a further cointegration test is required. Sometimes the cointegrated relationship shows up in the same variables with longer data sets, with lower frequency. If this is the case, the vector error correction model is considered the appropriate method. It is known that sometimes it shows strong evidence in the relationship between multivariate variables, depending on the length, or time, unit of the datasets.
The second-best model was the ARIMA-GARCH with Internet traffic, temperature and special-day predictors. It demonstrated that Internet traffic data are useful as input values, even in univariate models. The results were not always good when fitting volatilities, with the GARCH term in the ARIMA models through all steps, even though the ARCH effects tests indicated heteroscedasticity in the data. However, the data in this study were appropriate for STLF, by fitting GARCH models including the Internet traffic usage data.
In buildings that do not offer Internet traffic data, it is worth considering finding a potential dependent variable in a multivariate model such as VARX.
The results demonstrated that weather and holiday characteristics have an impact in demand forecasting. However, even if the external variables were appropriate, the accuracy varies, depending on whether the model fits the volatilities in the data. Although the bestfitted model was the VARX model using electricity load demand and Internet traffic data as multiple dependent variables, the other models still offer great insights for considering explanatory factors. In addition, using the VARX model is fast and time-effective.
Further, we discuss the model performances in depth by stratifying day types and quarter-hour of the days, to compare ARIMA, ARIMA-GARCH, and VAR models with exogenous variables. We show that the forecasts degrade over the time horizon, and the VARX model is not universally superior to other models.
In this study, we mainly aimed to compare the performance of the exponential smoothing methods, ARIMA-GARCH models, and VARX models. However, different adaptations of the models, such as SVM models, fuzzy models, and Kalman filters will be examined in future study.
Other future studies may set the goal of building an optimal and customized forecasting model for each single unit/building, according to building size, age, and type of external wall (for smaller units).  Data Availability Statement: The data have been collected from Office of Information and Communication Technology, Chung-Ang university.

Conflicts of Interest:
The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, or in the decision to publish the results.