## 1. Introduction

As non-conventional renewable power sources increases its contribution to the energy supply mix, transportation becomes the main responsible for CO

_{2} emissions. Road transport sector accounts for about 25% of greenhouse gases (GHG) emissions. Nowadays, several long and short term policies have been taken to mitigate GHG emissions from increasing fuel efficiency to the use of alternative fuels to power vehicles. For instance, the Paris Agreement, proposed by the United Nations (UN) and enforced in November 2016, has as main objective to limit the global average temperature below two-celsius degrees above pre-industrial levels. The transport sector is called to play an essential role in the process of decarbonizing the energy system. The most recent trends lean toward sustainable road transportation throughout the electrification of its propulsion systems [

1]. It is the most promising solution to fulfil the climate change goals traced by the UNs. Recent rigorous regulations and customers’ demand for high fuel economy pointed to accelerated developments of different alternative powertrain solutions, especially electric vehicles (EVs) [

2]. EVs are more efficient than traditional cars with internal combustion engines (ICEs) and even natural gas-powered vehicles because they have less moving parts and they can be charged from any energy source of electricity [

3]. Thus, massive use of them may help reduce GHG emission. Consequently, there have been several policies implemented worldwide to increase the adoption of EVs [

4]. As a result, in 2016, 750 thousand new vehicles were sold, and today there are more than 2 million vehicles on the roads [

5].

Although the numerous barriers EVs have to face, their massive adoption worldwide currently is possible thanks to some catalysts including sustained growth in renewable energies, battery price falls, and battery technology improvements [

6]. Electricity coming from renewable sources including solar-photovoltaic (PV), wind among others which are experiencing the lowest prices in the last decades [

7]. As well, battery technologies cost experienced its lowest cost in history, 85% cheaper than in 2010 and is forecasted that by 2023, their average prices will fall up to 90% compared with 2010 prices [

8]. These reductions are thanks to increasing order size and market uptake in battery electric vehicles sales. Therefore, EVs users should take advantage of this reduction to charge their vehicles from green sources and truly achieve an environmentally friendly solution to decarbonize the transportation sector. Nevertheless, there is a significant share still of EVs users that uses the conventional power grid to charge their vehicles. Thus their impact on distribution grids depict an essential challenge to face [

9,

10,

11,

12].

Over the last decades, several approaches have been developed until today to solve problems related to load balancing and capacity planning due to large-scale EVs penetration [

13]. For instance, demand-side response (DR) provides optimal charging schemes to arrange the demand of EVs strategically to the lowest electricity prices [

14]. However, even though the high performance and accuracy of these strategies, many factors still have not been taken into account when modelling users’ load charging habits. For instance, EV charging loads are sensitive to seasons, and seasonal factors have not been widely taken into account in recent studies. Therefore, it is necessary to analyze these factors that can affect owners’ charging habits. Zhao et al. proposed a strategy for domestic electric vehicle charging loads by explaining the key factors that can affect users charging behavior, taking into account the seasonal factors [

15]. Boston and Werthman analyzed charging and driving behavior of Ford plug-in electric vehicles with real-world data. Their findings shows that electricity prices, local charging infrastructure, and seasonal weather changes may affect driving behavior and then charging habits [

16]. Ul-Haq et al. performed a stochastic approach to model EVs charging patterns associated with residential load considering several cases among which, season factors where included [

17]. Their results state that charging habits changes across seasons.

With the rapid growth of EVs, several concerns arise regarding their power requirements. A potential solution to prevent unexpected load demand, is load forecasting [

18,

19,

20,

21,

22]. Early stages on EVs load forecasting are based on the charging behavior of EVs owners [

23]. Markov Chains and Monte Carlo (MC) simulations are methods that describe well the holiday and seasonal periods [

24]. For instance, Wang et al. used MC algorithms to forecast the electric vehicle charging load based on charging frequency [

25]. In Gerossier’s et al. study, they modelled charging habits, and thus they obtained probabilistic forecasts of the aggregated load. Besides, they evaluate the impact of EVs on the grid by 2030 by assuming that future charging behavior follows current habits [

26]. However, load prediction is somehow uncertainty due to the nature of overdispersed data. Some other studies focused on spatial and time distribution of EVs charging stations (CS) demand [

27,

28,

29]. Currently, advances in machine and deep learning techniques aid largely to model the fluctuation of CS’s demand.

Rapid uptake of EVs worldwide has led to a large amount of charging data, and machine learning methods represents a potential solution to deal with it. There are several methods used to forecast EVs charging load such as support vector machines (SVM) [

30], and recurrent methods including long-short-term memory (LSTM) neural networks [

31] and extreme gradient boost (XGBOOST) [

32]. However, data is scarce to handle the randomness of EVs load, and it depicts a significant obstacle for the majority of time series prediction algorithms. That randomness is explained from several stochastic factors, including weather conditions and occupant behaviors [

33,

34]. Since these factors affect a considerable proportion of electricity load forecasting, their contribution has been studied through correlation analysis based on non-parametric residual data from autoregressive (AR) models [

35] and behavior surveys [

36]. Recent studies have proved the last mentioned. For instance, Amara et al. developed an approach to model residual load for household short-term load forecasting (STFL) through an Adaptative Circular Conditional Expectation (ACCCE) combined with linear models (LM) [

35]. Wang et al. conducted a study to model customer behavior for effective load forecasting by using Sparse Continuous Conditional Random Fields (SCCRF) to apply hierarchical cluster analysis [

37] further.

Differently from the above-listed methods, random forest (RF) models have been largely applied to power load prediction on the user-side [

38]. RF can provide higher and more stable results than SVM in load prediction [

39]. RF is an additive method based on several decision trees which has been widely proved to be a powerful approach to forecast electrical load [

40]. Therefore, in this work, we used RF to demand forecasting on the user-side and to deal with possible overdispersion on data, we used a Quasi-Poisson (QPM) models.

This work considered seasonal factors to perform short-term load forecasting (STLF) and to discriminate the demand of EVs among seasons. The research question to be occupied by this work is: Do the seasons shape the way users charge their vehicles?. Our approach differs from the related ones published in the literature mainly on the use of both regression and classification approaches, since most of them rely on the use of one or the other, but not on using both. However, most studies in this regard only considers information on the vehicle or charging station side. Then we decided to join the average temperature to the variables to be analyzed. This study trained and tested not only multiple classifiers (12) capable of discriminating charging behaviors among seasons, but also different (3) regression models to forecast the energy consumption considering various scenarios. Also, an effect analysis of charging behaviors across seasons was developed using principal components, to provide a visual representation of the data in a two-dimensional space and also to analyze the relationship and contribution of variables. Afterwards, a feature importance technique was used to evaluate the importance of each variable. Finally, statistical analysis is provided by using analysis of variance (ANOVA) and Chi-square test to test with a 95% of certainly the hypothesis of the null-hypothesis: EVs load and the charging time varies depending on seasons factors.

The results of this work can aid in the planning of power distribution systems at large scale and also in providing valuable information to supply strategies such as demand response strategies in practical applications.

The structure of this paper is as follows:

Section 2 presents the methodology used for exploratory analysis, feature importance, classification and regression procedures as well as the metrics used for their evaluation. Then,

Section 5 presents the results. Finally,

Section 6 concludes the paper.

## 4. Case Analysis

This section presents information regarding aggregated data of the city-operated public charging stations (44) in Boulder, Colorado from 1 January 2018 to 28 February 2019 presented in [

41]. Also, results of an exploratory and feature importance analysis developed on the dataset as well as an analysis per season are presented in this section. Finally, the classification and forecasting models, as well as their performance metrics, are listed.

#### 4.1. EVs Charging Stations in Boulder

The Colorado market for EVs has rapidly grown from 20 vehicles in 2011 to more than 3100 in early 2014. According to the ZEV sales dashboard, at middles 2017 10.930 EVs were surpassing in more than three times the amount of them in 2014 [

42]. This state is well-prepared to fulfil the challenge of electrifying its road transport. For instance, the local government proposes several well-defined strategies, including charging infrastructure based on mature renewable, vehicle grant programs, among others. With these and several more strategies, this state has emerged as one of the top ten EV markets in the USA [

43,

44]. This trend is expected to continue growing to a projected amount of 940,000 EVs in Colorado by 2030. Particularly, Boulder precedes California in the number of registered EVs per 1000 residents (2.6 vs 1.4). This city has promoted EVs usage and thus charging stations since 2010 with the enforcement of local projects such as The Boulder SmartGrid Plug-In Electric/Hybrid Vehicles [

45].

Figure 1 illustrates the distribution of the charging stations existing in Boulder. The numbers inside the color markers indicate the number of charging stations in the zone. It must be noted that during the time horizon considered in this approach, the amount of stations were 44 in the city, however, to date, the number of stations increased by twice and a bit more.

From the above listed information, EVs load demand is rapidly growing and

Figure 2 illustrates its behavior on the last year. This tendency is expected to continue, and therefore it represents an important challenge as not the total CS of boulder are powered by renewable sources.

#### 4.2. Exploratory Analysis

We used principal component analysis (PCA) to provide a visual representation of the data. Before PCA, the data were classified by the amount of charging sessions. Only the two first principal components (PCs) were selected as they represent the 72.2% of the information.

Figure 3 shows a visual representation of diary observations in a new orthogonal space composed of the two PCs above mentioned. On the one hand, points on the plot are the observations, and their colour is related to the number of charging sessions. On the other hand, information related to the level of contribution of each variable is highlighted with colours being red the maximum. The larger the value of the contribution, the more the variable contributes to the selected principal components.

Thus, the variable which provides less information is CTM followed by ACGHC. As was expected, most of low-frequency charging sessions (Red points) are located in the majority on the third quadrant. This location minimizes the bulk of the variables since they are in the opposite direction of growth of them. This behavior keeps for the rest of the charging sessions groups, more charging sessions, more energy consumption, number of ports used. The opposite growing directions of CTM and TEMP may suggest that they have an inverse relationship.

Figure 4 illustrates a heatmap of daily KWH versus CTM and NOP. It allows observing the energy demand of CS depending on weekdays and weekends. Thus, critical days are visible in terms of energy demand. For instance, most of the ports are used on Monday, Wednesday, and Thursday. Additionally, most of the EVs users lasted for 320 min or more to charge their vehicles from Thursday to Friday.

#### 4.3. Feature Importance

The aim of this work is the classification and forecasting of energy consumption depending on the season. However, despite the multiple variables contained in the dataset, there are a certain number of them that contributes most comparing to the rest. Then, to get a better grasp the importance of each variable on the prediction of the energy consumption, we applied the Boruta feature importance algorithm. This method belongs to the wrapper family methods, which are characterized by the evaluation of each subset created using a determined resampling method combined to an ensemble model (in this case, Random Forest). The Boruta algorithm works as follows:

First, it duplicates the dataset, and rearrange the values in each column. These values are called shadow features. Then, it trains a classifier on the dataset. By this means, this model provides an idea of the importance through the accuracy for each of the features of the dataset. The higher the score, the better or more important.

Then, the algorithm checks for each of your real features if they are importance. Each feature is evaluated through the Z-score i.e., the number of standard deviations from the mean a data point is. Then, the importance of each feature is depends on whether the feature has a higher score than the maximum score of the shadow features. If they do, they are taken into account. These are called hits. Next, another iteration is performed. After a predefined set of iterations, the algorithm provides with a table of these hits.

At every iteration, the model compares the Z-scores of the shuffled copies (shadows features) and the original variables to see if the latter outperformed the former. If so, then the algorithm marks the feature as important. In summary, the algorithm validates the importance of the feature by comparing with random shuffled copies, fact that boost up the robustness.

Figure 5 illustrates a boxplot of the ranked predictors using the Boruta algorithm. Blue boxplots correspond to minimal, average and maximum Z score of shadow features. Red and green boxplots represent Z scores of respectively worst and top variance contributors. There, it can be seen that the predictors contributing the most to the variability of the dataset were GHGS, GSGS, NOS, CTM, UD. Most of the predictors have a significant important, as only the variable DAY contributed less than the shadow features. This is the reason why we did not consider this stage as a feature selection one but for enhancing the exploratory analysis above mentioned

#### 4.4. Analysis per Season

EVs charging behavior depends on several factors such as socio-demographics, time of the day, location of CS, distance travelled, tax incentives [

46]. However, along the year, the human being varies their typical activities depending on the season. For instance, in summer the activities for somebody could be different from the activities in winter. These changes affect the driving range of EVs directly due to the strong dependence on the ambient temperature. Most of EVs owners avoid to use their vehicles in winter as sub-zero temperatures may degrade the battery’s pack life expected [

47]. Several studies developed have demonstrated that reductions in driving range down to to 45% [

48], 34.3% [

49], and 31% [

50] on extreme cold (under zero degrees) scenarios.

We used the dataset mentioned above to analyze the charging behavior of EVs owners on each season.

Figure 6 illustrates a general picture of charging behaviors across seasons from their probability density functions. The season which accounts for significant EV load and charging sessions is autumn. This fact is strongly related to the fact as the ambient temperature begins to decrease, the available battery range gets reduced, and owners may be charging their vehicles more often to ensure they have sufficient range to reach their final destination [

16]. Winter represents the lowest energy consumption, and the frequency of small charging sessions are higher than the rest of the seasons. This behavior is explained from several domains, for example, cold temperatures cause drastic changes in human being’s habits, especially during sub-zero temperatures as the Canada, Sweden, Norway and Finland cases [

51] were people prefer to stay indoors. This behavior may aid to increase battery’s lifetime, as users do not use their vehicles in the same way as during other seasons, where the ambient temperature is the warmest. Charging habits during Spring and Summer are quite similar. This may be explained by during these seasons, people tend to stay outside the house to take advantage of agreeable weather and longer days (more sunlight hours).

#### 4.5. EVs Charging Load Classification per Season

As mentioned in

Section 4.4, the dataset was divided by seasons. We used several classification methods, including the classic ones such as support vector machines with linear (SVML) and radial (SVMR) kernels, linear discriminant analysis (LDA and SLDA), multinomial regression (MN), decision trees (DT), and naive Bayes (NB) and K-neighbors (KNN). In addition, we decided to evaluate models recurrent-based such as extreme gradient boost (XGBOOST) and boosted logistic regression (BLR). Finally, we used Lasso and elastic-net regularized generalized linear models (GLMNET) and bagging trees (TBAG) as ensemble models. These set of classifiers were trained and tested over the unseen (test) data by using a 10-folds cross-validation algorithm.

#### 4.6. Performance Metrics for Classification

In this work, we established accuracy and the receiver operating characteristic (ROC) curve as performance metrics for classification. First, the ROC curve evaluates the performance regarding the number of true positives (TP), true negatives (TN), false positives (FP), false negatives (FN), and area under the curve (AUC) to visualize the detection ability of a model. These metrics were computed during a cross-validation procedure for each model. Below there is a description of the metrics.

#### 4.6.1. Accuracy

It computes the number of TP and TN over the total of observations.

#### 4.6.2. Receiver Operating Characteristic

This metric describes the TP percentage versus the FP percentage. It helps to understand how sensitive (TP rate) and specific (TN rate) is a model. The ROC curve can be obtained by plotting the TP rate against FP. The best possible AUC is 1.0. The diagonal line in the ROC depicts randomness.

#### 4.7. EVs Charging Load Prediction

To assess the effectiveness of the prediction the demanda data of all the seasons of Boulder, Colorado were selected as a numerical example for simulation verification. The daily energy demand of EVs users was taken as the outcome, and the accuracy of both RF and QPM was observed. We designed three scenarios to predict the future season from the previous one/ones, as below explained:

Case I: We used the data from spring (Training) to forecast the data in summer (Test)

Case II: We joined the data from spring and summer (Training) to forecast the data in autumn (Test)

Case III: We joined data from spring, summer and autumn (Training), to forecast the data in winter (Test)

#### 4.8. Performance Metrics for Regression

The mean absolute percentage error (MAPE) and the root mean square error (RMSE) were selected as metrics to assess the performance of regression models. They can be computed from the below mentioned equations:

where

${p}_{n}\left(i\right)$ and

$\widehat{{p}_{n}\left(i\right)}$ are the real and predicted values of the ith data, respectively, also

n is the length of the data used for verification.