Predicting Electric Vehicle Charging Station Availability Using Ensemble Machine Learning

: Electric vehicles may reduce greenhouse gas emissions from individual mobility. Due to the long charging times, accurate planning is necessary, for which the availability of charging infrastructure must be known. In this paper, we show how the occupation status of charging infrastructure can be predicted for the next day using machine learning models— Gradient Boosting Classiﬁer and Random Forest Classiﬁer. Since both are ensemble models, binary training data (occupied vs. available) can be used to provide a certainty measure for predictions. The prediction may be used to adapt prices in a high-load scenario, predict grid stress, or forecast available power for smart or bidirectional charging. The models were chosen based on an evaluation of 13 different, typically used machine learning models. We show that it is necessary to know past charging station usage in order to predict future usage. Other features such as trafﬁc density or weather have a limited effect. We show that a Gradient Boosting Classiﬁer achieves 94.8% accuracy and a Matthews correlation coefﬁcient of 0.838, making ensemble models a suitable tool. We further demonstrate how a model trained on binary data can perform non-binary predictions to give predictions in the categories “low likelihood” to “high likelihood”.


Introduction
Electric vehicles (EVs) are generally seen as a viable path towards reducing CO2 emissions from mobility. In the context of this paper, we focus on battery electric or plug-in cars. These vehicles store (some of) the energy required for propulsion and auxiliary in an internal traction battery that is recharged through an electricity grid connection. While the share of EVs in car sales might have been considered negligible in the past several years, this cannot be considered true anymore-globally and in Germany. Globally, China had been market leader for several years in terms of absolute numbers, with annual sales exceeding one million vehicles for the first time in 2018 [1]. Likely due to a reduction in subsidy, the local market lost some of its dynamic growth recently. European governments such as Germany, France, Italy, and some others in turn have increased subsidies and other regulatory measures significantly [2]. In combination with the stricter EU fleet emission targets, Europe is now the globally leading market for EV sales, with 747,000 battery electric vehicles and 625,000 plug-in hybrid electric vehicles sold in 2021 alone [1]. Germany has been spearheading this development, with up to EUR 9000 in subsidies for new EVs, the worldwide fourth-highest share of battery and plug-in electric vehicles worldwide, and the highest absolute numbers in Europe [1]. The governmental goal of a cumulative sale of one million vehicles was reached, with a slight delay of 8 months, in August 2021 [3]. These trends and developments clearly show that the market is picking up pace. Using the typically used diffusion of innovation curve, one can state that the market is moving away from early innovators and is entering the early adopters or even early majority stage instead [4].
To achieve the widespread adoption of EVs, it is essential to have sufficient public charging infrastructure (CI) available such that users are able to comfortably and quickly recharge their EVs [5][6][7]. Currently, EV drivers in Germany are billed for CI usage at a fixed price per kWh and sometimes additionally for parking time. The situation may be different elsewhere, but generally does not vary significantly. Fixed price mechanisms lead to the highly inhomogeneous usage of CI, with occupation rates differing widely between day and night as well as between weekdays [8]. This is clearly not ideal as it leads to potentially avoidable charging hotspots where little to no CI is available. Some EVs may, however, already have a high state of charge and only recharge because the opportunity arises, while others urgently require recharging. During these shortages, drivers with a relatively high battery state of charge should be incentivised to give up their recharging opportunity for somebody with a more urgent recharging need. Implementing a system that globally optimises CI allocation is, however, not feasible for public CI as these devices should be easily accessible.
An alternative solution to the aforementioned problem is to vary prices. With current vehicle batteries covering 300 km and more of range [9], recharging has to happen only occasionally if persons drive 39 km/day on average in Germany [10,11]. If such persons were price-sensitive, it would be possible to incentivise them to charge when sufficient CI is available and, hence, indirectly support the grid load or capacity, respectively. This is the goal of the project BeNutz LaSA [12], in which the authors together with partner organisations implement such a price-based incentive mechanism in Germany.
Charging typically takes hours, especially at slower AC stations, and it is not feasible to force users to stop their charging spontaneously. If a shortage of CI occurs, we therefore cannot simply reduce the current occupation. For this reason, the authors attempt to predict CI usage with the goal of adapting prices already before a hotspot materialises. In this context, a good prediction is one that accurately predicts the occupation for as many stations as possible. The methodologies presented to achieve such a prediction can be used worldwide and are evaluated with real-world charging data in the use case of Germany.
The main aim of this paper is consequently to derive and evaluate methods to predict the status of many stations as accurately as possible and to give a measure of certainty on a station-by-station level. The paper is structured as follows: in the next Sections 1.1 and 1.2, the current literature on the issue is summarised. The following section, "Materials", contains all data sources used in this work. Section 3, "Methods", outlines all methods applied on the data. The results are presented in Section 4, "Results". The discussion and conclusions are provided in Sections 5 and 6, respectively. The back matter contains, in order of appearance, the acknowledgements, funding information, author contributions, data availability, software availability, the ethics statement, and lists of tables and figures.

Literature Review
Predicting the usage of CI is a much-discussed topic in the literature. One strand of literature focuses on which locations are sensible ones for the construction of CI [13][14][15][16], some with a focus on when future stations are likely to be used [17]. These models tend to have a longer-term focus and are not very relevant to the work done in this paper, where the goal is to predict short-term charging hotspots for the next day(s).
In the field of predicting the usage of individual stations for the short-term future, much less work has been published [18][19][20][21]. These authors attempt to generate a time series prediction for specific charging stations. The works, however, focus on very specific sites, such as a university campus [18], workplace charging [19], or EV taxis in the metropolitan area of Shenzhen [20], and therefore cannot serve as a model for a national scope, as done in this paper. Given the small geographic area, employing the used models may not be suitable for our purposes, as a risk of overfitting exists. A key reason for this is that models sometimes require specific conditions to be true to be applicable. Simple time series models focusing on a few points, such as done by Majidpur [18], fail if the patterns become highly irregular and cannot absorb the depth of information that our dataset contains. The models and methods, however, do provide a starting point for this work as they are a mix of classical statistical methods and machine-learning-based ideas.
Almaghrebi et al. [22] focus on a slightly different challenge. They attempt to predict energy requirements once a charging session has started given past CI usage data. They employ four models, namely a linear model, a gradient booster, a random forest, and a support vector machine. These concepts are also looked at in this paper. Frendo et al. [23] take this process one step further and examine the behaviour of an entire fleet and additionally provide an extensive literature review in their particular field. The complimentary challenge of trying to predict energy consumption while the vehicle is on the road is also being tackled using very similar methods [24,25].
An analogous problem that has received much more attention in the past is the prediction of the availability of bicycles at public bicycle sharing stations [26][27][28][29][30][31][32][33][34][35]. The problems are comparable since there are a number of spots that can either be occupied or not and it is relevant to determine when stations are out of bicycles to rent. The employed tools are quite similar to the ones used for CSs and include statistical models or machine learning models. Examples for the former are a generalised extreme value count model [28], dynamic linear models [29], generalised additive models [30], autoregressive, or autoregressive, integrating, moving average models (ARIMA) [26]. These models can typically be employed when usage patterns are fairly regular, particularly for the latter two. This requires the observed number of stations to be sufficiently aggregated such that random fluctuations do not have a major impact. In the context of this paper, the aggregation unit is the individual CS, with the corresponding high noise making such tools unsuitable. A selection of employed machine learning models includes random forest models [26], deep learning models [27], k-means clustering [31,32], or, less frequently, models such as the averaged one dependence estimators with subsumption resolution model [33]. Most of these models are in fact quite useful for our purpose as well and have been investigated. Deep learning, such as done by Yoshida et al. [27], however, requires that more input variables are available per decision than is possible in this paper. In other words, a CS history is not relevant for future events apart from the average weekly usage. We therefore do not consider deep learning models. Next to these time series models, it is also possible to use more holistic models, such as multi-agent systems [34] or flow-based models [35].
Looking at the three outlined categories, we can try to understand the applicability of the used methods for the purpose of CS availability prediction. In general, there are no models that are being used much more frequently than others, but, rather, models are selected to fulfil a specific purpose in each publication. The only important trend that we saw in researching the literature was that the easier-to-implement, standard models are more popular than their more complicated counterparts are. This can presumably be linked to the fact that most researchers come from the technical domain and not so much from the data science domain and therefore rely on existing libraries.
The statistical models can only be used on a large scale when trying to understand and predict trends involving hundreds of CSs. The reason for this is that the usage patterns at individual CSs are highly inhomogeneous and often do not follow a regular pattern consistently, as the standard deviations calculated in our previous work show [8]. An ARIMA model, for instance, would see far too irregular usage in the weeks before the predicted hour to provide any useful prediction.
The machine learning models used are very similar to the ones used in the prediction of CS availability. A more detailed analysis of which of these models is suitable follows later in the paper.
The last category of more holistic models is very challenging to create for cars. The reason for this is that cars move much further and for much more diverse reasons. Bicycle rides are typically limited to a few kilometres and there is no option to transport heavy goods on bikes. These two effects automatically limit the diversity of possible trip purposes and therefore make it much easier to develop models of the vehicles' movement. In contrast, the movement of EVs is far more diverse. An additional challenge is that public bicycle operators are able to track where the bikes are going, but no such data are available for (private) cars to protect the privacy of drivers.

Improvements of This Paper in the Context of the Literature
This paper uses similar machine learning and statistical tools as some previous work has already done [18][19][20][21] and thereby verifies the applicability of these tools. Our key contribution and thereby advancement compared to the current state of the art is that we are able to include a much broader geographic scope for our time series prediction as compared to previous publications. This Germany-wide approach allows us to ensure that overfitting is limited given the diverse geographies and characteristics, ranging from small villages all the way to larger metropolitan areas such as Munich, Berlin, or Hamburg. This wide geographical scope further ensures that the results are representative at least for European countries and possibly even for other geographies. We further advance the literature by clearly showing which models were used, why they were used, and how they were tuned. This will help fellow researchers in deciding which model to use in their own applications.

Materials
In this paper, several data sources are used. These are explained in the following subsections. All datasets are geographically referenced time series datasets that need to be merged into one single input table. Merging is done based on geographical reference and time.
While outlining the data used, it is also important to state which data were not used and why. This paper intentionally does not employ datasets that describe the static environment of the CS. Examples of such sets could be OpenStreetMaps points of interest, population density, number of registered cars, etc. While these factors are certainly important in understanding why and how CSs are used, this was not of relevance in this paper. The reason is that the past CS usage data provide a much better picture of usage patterns than environmental factors. Given that we assume there to be a usage history for each station, it is consequently superfluous to describe the environment and we focus solely on features changing quickly with time.

2019 vs. 2021 Dataset
In this work, we define two datasets, one starting in 2019 and one starting in 2021. The reason for this difference is that traffic data are available only from 2021 onwards, while CS occupation was available from 2019 onwards. A benefit of this split is that we can observe whether longer or shorter training data are preferable. The split between the two datasets was performed across all datasets listed in the following.

CS Usage Data
In the context of the project BeNutz LaSA [12], the ISEA institute has received the CS occupation from the industry partners SMART/LAB and Hubject. The dataset can be converted into CS hours, where each data point describes the status of a CS for a given hour. This hourly status was defined by dividing the occupied time by one hour and the number of EVSEs at a specific CS, as shown in the below equation.
where T occupied is the sum of all time that vehicles spent at the CS in the given hour. Multiple vehicles can result in T occupied becoming larger than 1. The variable # EVSEs is the number of EVSEs at the CS. A typical station with two EVSEs would consequently have an occupation of 50% if one of the two EVSEs were fully occupied during the hour while the other one remained available for customers, both were occupied for half an hour, or any possibility in between these two options. The total number of data points was around 90 million, out of which 12% had missing information. The time period analysed was approximately two years and nine months from 1 January 2019 to 12 September 2021. Reasons for missing information could be that the station was built during the time analysed and, therefore, historical data were unavailable earlier, the station was a test station, no matching weather station or area for traffic analysis could be found, or the station's given latitude and longitude was outside of the boundaries of a state. For this paper, the aspects of the data that were used were:

•
Status changes of the EVSEs as defined by the Open Charge Point Protocol [36]; • Location of the charging infrastructure; • Grouping of EVSEs into charging stations.

Public Holidays and School Holidays
Public holidays were retrieved using the Python library holidays [37]. The library allows us to check whether a date is a public holiday in countries and in the states of these countries. In this work, the public holidays on the level of German states are considered. Public holidays sometimes create long weekends, which are frequently used for short trips. This increased mobility was assumed to have an impact on charging station usage, which is why long weekends were also detected by checking whether public holidays were either directly adjacent to a weekend or with only a single day in between.
School holidays were retrieved from Ferienwiki.de [38]. The website provides iCal files containing school holidays. The school holidays are defined on a state level in Germany and can be quite different in terms of when they occur and how long they are. No correction is made in this paper for cases where stations are near state borders since it is challenging to define how the effect should be quantified.

Weather Data
Precipitation and temperature time series were downloaded from the open data portal of the German weather service, DWD [39,40]. The data comprise time series measured at 468 weather stations. Temperatures were recorded as the average temperature over one hour in • C and precipitation in mm h . Stations at an altitude above 1000 m were filtered out since the weather service operates stations on mountaintops, which would clearly not be representative for the weather situation where people live.
All weather datasets further contain a measure of accuracy and an indication of the equipment used to obtain said values. Given that the used data were quite recent, this information was of no relevance in this project since all stations have been operating with high-quality equipment over the last few years. Inaccuracies due to imprecise equipment occurred in the beginning of the time series, which, for many stations, was in the 20th century.

Traffic Data
Traffic data were purchased from ADAC Service GmbH [41] for this project. ADAC has trackers in over 300,000 vehicles in Germany and receives their position, velocity, and several other values. The data split Germany into rectangles of approximately 1.8 × 1.8 km and are available at an hourly resolution. For each such rectangle, the dataset contains the number of fleet vehicles that were in the rectangle during this hour. An example of the data can be found in Table 1. While the number of vehicles is not the traffic intensity itself, the monitored fleet is large enough that it can be understood as a proxy for traffic intensity. The recorded data have been provided from the start of 2021 onwards and predictions are delivered on a daily basis. In this work, only the recorded past has been used.

Methods
This chapter describes the methodology used in this paper. The different data sources listed in the "Materials" section were matched as outlined in "Data Merging". The resulting full datasets were processed as explained in "Data Preprocessing". The final dataset was then fed into several prediction models, which are given in "Chosen Prediction Models". The evaluation steps relevant for discussion are not outlined in the methodology section since common metrics were used.

Data Merging
Data matching was performed based on the location of the charging station. For weather data, each charging station was matched with the nearest available weather station using cKDTree from scipy.spatial [42]. Weather stations were marked as usable if data were recorded for the time period observed in this paper and if they were located at a height of less than 1 km. The latter criterion was introduced to filter weather stations located on mountaintops, such as the one on top of the mountain Zugspitze.
For spatial properties such as holidays and vacation periods that relate to entire states, charging stations were matched to the state that they were in using the Geospatial Data Abstraction Library in combination with its Python bindings.

Data Preprocessing
Models learning from large datasets can be subject to overfitting, where random noise in the training set is interpreted as an actual signal that the models interpret as significant. For the data discussed in this paper, the risk of overfitting is particularly prominent for weather data. The reason for this is that human behaviour does change with the weather but is generally agnostic to minor changes. A person would, for instance, drive to a nearby forest for a hike if it were warm and dry, but would not care about exact values for precipitation or temperature. To adjust for these somewhat coarser categories, precipitation was split into three levels of rain, which more or less corresponded to no rain, little rain, and heavy rain, as shown in Equation (2). These levels were chosen based on the analysis of the dataset used in our previous publication [8].
where prec f ull are the measured precipitation values obtained from the weather stations in mm h on the right-hand side. The variable prec in turn is the categorised precipitation.
To avoid overfitting for temperature, the recorded temperature was rounded to the nearest multiple of five since this is sufficiently granular to correspond with the human sensation of temperature but simultaneously coarse enough to avoid overfitting.
For the traffic data, we calculated how many vehicles were in a square relative to the square's average usage to provide a measure of how much the current traffic deviated from the traffic that the area would normally experience.
All datasets used in this work are time series. The next step of data preprocessing was consequently to align the various datasets along the time axis. This was done for each CS separately. An example of such a dataset can be seen in Table 2.

Feature Categorisation
While the calculated occupation, in theory, is a continuous variable and regression models could be employed to predict the occupation numerically, this is neither achievable nor desirable. The non-achievability stems from the fact that the provided features do not carry sufficient information to explain minor differences such as a vehicle leaving a few minutes earlier or later, which would in turn change the occupation. It is also not desirable since the goal of the project is to create a more human-readable output. The resulting occupation probabilities would likely be very confusing for users, particularly when compared to the chosen categories "very low" to "very high".
It was consequently necessary to create a categorisation of the occupation as a feature. This was done in two ways, which we will refer to as binary and categorised.
The binary classification, as the name already suggests, is a classification into binary values. We chose to set a boundary at 50%, as shown in Equation (3). The boundary was chosen since a 50% occupation means that only one EVSE was available on average at a typical CS with two EVSEs. A newly arriving vehicle would consequently cause the full usage of the station and no additional customers could be served. Additionally, virtually all CSs have an even number of EVSEs, leading to a good fit when taking half of 100%.
where occupation has been defined in Equation (1). The categorised classification is meant for users of CI. The goal is to provide a humanreadable classification in the categories "very low" (vl), "low" (l), "medium" (m), "high" (h), and "very high" (vh). These five categories are defined by occupation as well, as shown in Equation (4).
where occupation has been defined in Equation (1). This classification was chosen based on an analysis of which categories appear how frequently while simultaneously using easy-to-remember boundaries. Different categorisations are certainly possible and will be explored in our future works. The given one is arbitrary to an extent. The results of GB and RF cannot be used directly for the cat-egorisation since the output probability is defined as the probability that the station is occupied up to 50%. This was resolved by running many test predictions and creating a translation function. This function is simply a set of new boundaries that were chosen such that the size of the predicted groups is equal to the sizes when using AW on a set of test predictions. The boundaries are given in Equation (5) and the concept is visualised in Figure 1. As shown, the boundaries of the occupation are translated into group sizes by checking which x-value corresponds to the given y-values. These x-values were then kept and the corresponding y-values of the test predictions were chosen. The process was executed only for GB, but, as will be seen later, the prediction values were very close for both models and the same categorisation boundaries may be used.
where prediction is the output value generated by the machine learning models. size of the predicted groups is equal to the sizes when using AW on a set of test predictions. The boundaries are given in Equation (5) and the concept is visualised in Figure 1. As shown, the boundaries of the occupation are translated into group sizes by checking which x-value corresponds to the given y-values. These x-values were then kept and the corresponding y-values of the test predictions were chosen. The process was executed only for GB, but, as will be seen later, the prediction values were very close for both models and the same categorisation boundaries may be used.
is the output value generated by the machine learning models.  (4) and (5). The boundaries were chosen such that the group sizes are identical using GB and AW.

Train-Test Split
For the evaluation of machine learning models, one typically splits the data into a training and a test set. This was done in this work as well. Test feature-label combinations were randomly drawn from the dataset until 30% of the dataset was used. The remaining 70% were used for training. Note that this split consequently does not orient itself along a time axis and we therefore do not predict the last 30% in terms of time. This was done since models will be retrained regularly in real-world application and, given the highly dynamic market development, it is not sensible to use months-old models.

Model Selection
The proper analysis of the suitability of a model requires significant computational and time resources. To gain an overview of which model is suitable for the task and how the different models compare, a comparison was performed using the function classification.compare_models() of the library PyCaret [43]. The library wraps other machine learning libraries such as sklearn [44] and calls the most frequently used models using their standard settings. The models are consequently not optimised for their final task and overfitting might be an issue in certain cases. Results should therefore be understood as indicative   (4) and (5). The boundaries were chosen such that the group sizes are identical using GB and AW.

Train-Test Split
For the evaluation of machine learning models, one typically splits the data into a training and a test set. This was done in this work as well. Test feature-label combinations were randomly drawn from the dataset until 30% of the dataset was used. The remaining 70% were used for training. Note that this split consequently does not orient itself along a time axis and we therefore do not predict the last 30% in terms of time. This was done since models will be retrained regularly in real-world application and, given the highly dynamic market development, it is not sensible to use months-old models.

Model Selection
The proper analysis of the suitability of a model requires significant computational and time resources. To gain an overview of which model is suitable for the task and how the different models compare, a comparison was performed using the function classification.compare_models() of the library PyCaret [43]. The library wraps other machine learning libraries such as sklearn [44] and calls the most frequently used models using their standard settings. The models are consequently not optimised for their final task and overfitting might be an issue in certain cases. Results should therefore be understood as indicative and must be confirmed with properly tuned models, which was performed in the second part of this paper.
The goal of performing the model comparison was to evaluate whether the models outlined in the Section 3.4 "Chosen Prediction Models" were comparable in performance to other models. The choice for the actual implementation was, however, limited to models providing an uncertainty measure since the goal is not to perform binary prediction as accurately as possible but rather to later perform categorised prediction. Further, note that the average week model does not appear in the overview provided by PyCaret. It was added by us due to its low complexity and since it is easy to understand for end users and system administrators. Deterministic models were not evaluated since it could be shown in previous work that these are not sufficiently accurate.

Chosen Prediction Models
In this work, several prediction methods are tested and compared with each other. The Random Forest Classifier (RF) and Gradient Boosting Classifier (GB) were selected as typically used ensemble models and because of their high prediction performance, as shown in Table 3. Additionally, the average week model (AW) was added as a reference since, nowadays, many websites use this metric to inform customers about when to expect how many other customers. The chosen models are described in detail in the respective subsections. GB and RF were both trained and evaluated with multiple combinations of feature sets. In all combinations, data about weekday, hour, vacations, and long weekends were used given that their retrieval can be considered trivial. In prediction applications, the weather and traffic density input data require sophisticated prediction models themselves and are therefore not readily available in all applications.

Average Week Model (AW)
The AW is the simplest model analysed and serves as a baseline or reference for the more advanced models. It is created by averaging the weekly usage of a CS using an hourly resolution, as defined in Equations (6) and (7).
where P i (d, h) is the likelihood of CS i being occupied at a certain weekday and hour of the day and x d,h,i is the occupation status of a CS at a certain date and time range, found by summing all occupied EVSEs by the overall number of EVSEs at the station.

Gradient Boosting Classifier Model
A GB builds on similar concepts as used for RF. It combines many simple prediction models-in this case, decision trees-into a more complex model. The GB is built sequentially and each additional decision tree is fitted to minimise the losses of the previous ensemble. This can be understood as creating each consecutive tree to correct the errors made in the previous model. Since this method is a standard method and widely used, we omit an in depth description for the sake of conciseness. The implementation used in this paper is the GradientBoostingClassifier as defined in sklearn.ensemble [44]. The benefits of using a GB are similar to those of an RF. The model is able to handle non-linear data with interactions between features. A disadvantage is that the individual decision trees can only be generated sequentially, which makes the overall model creation quite slow.
The model was parametrised as follows: where max_depth describes the maximum number of nodes in a tree until a leaf has to be reached and min_samples_leaf describes the minimum share of the training data needed in each leaf. The maximum depth was chosen as 5 since the models typically use the first two to three levels for the average weekly patterns. The default and typically used setting of 3 would therefore have been insufficiently deep to make use of the full feature set. The minimum sample share per leaf was chosen as given to avoid overfitting. Otherwise, the model could be tempted to dedicate leaves for noisy patterns of individual charging stations. The remainder of the parameters were kept at their default values (e.g., learning_rate = 0.1, n_estimators = 100, . . . ). Listing all would be excessive at this point and the reader is referred to the documentation of the library for further details [44].

Random Forest Classifier Model
An RF consists of many individual decision trees that collectively estimate a value. The training data for the model are typically split into smaller sets and each decision tree is trained using one such set. Since RFs are a standard method in data science, an in-depth description is omitted. The implementation used in this work is the RandomForestClassifier as defined in sklearn.ensemble [44].
The model is able to handle data with strong interaction dependencies between the features, which is beneficial for this project. As an example, temperature or precipitation might affect CS usage much more during the weekend if people spontaneously make use of good weather compared to commuters on weekdays, since they have to go to work independent of the weather. An additional benefit is that the outputs of individual decision trees relative to the result of the final model can be interpreted as an uncertainty. For a real-world application, this is beneficial since the goal is to predict stations and areas with an overload and a high certainty that this overload will occur.
The model was parametrised as follows: where max_samples is the share of samples assigned to each tree and min_samples_leaf describes the minimum share of the training data needed in each leaf (same as for GB). This parametrisation was chosen based on visual inspection of sampled decision trees. The trees showed a reasonable balance between learned abstractions of the problem while still avoiding overfitting.

Results
This section contains the results obtained. We start by comparing the untuned models. Next, the prediction performance of the selected models is shown. As a last step, the feature importance in the selected prediction models is shown. Table 3 shows the results of the model comparison. For performance reasons, 10 million label-feature combinations were randomly sampled for the model evaluation. The label in this case was the binary version as outlined in "Feature Categorisation". Most of the evaluated models have performance in a similar range, with MCC varying between 74.3% and 82.6% for all models except the K Neighbours Classifier and the Quadratic Discriminant Analysis. The reason that these models perform poorly is that they were designed for different classification problems to the one analysed in this paper. The K Neighbours Classifier relies on the fact that neighbours share similar properties, which cannot be guaranteed if features influence each other. If the time were 9 a.m. in the morning, the day of the week matters a lot since this would mean commuters using CI during the week, but very little traffic on the weekend. The quadratic classifier also requires the features to be meaningful by themselves and is poor at handling interaction effects.

Model Comparison
The remaining models all perform reasonably well, with an MCC between 74.3% and 82.6%. Boosting algorithms particularly seem to outperform their competitors, but the differences are minor. The already introduced Gradient Boosting Classifier (GB) and Random Forest Classifier (RF) are among the well-performing models. A key advantage of these models is that they provide a probability because they are ensemble learners. In the application targeted in this work, having this uncertainty measure is valuable as it allows us to determine stations that are highly likely to be occupied. The Light Gradient Boosting Machine slightly outperforms GB, but the differences between the models are less than 0.1% on most metrics and therefore almost meaningless. Given that scikit-learn possesses an easy-to-use implementation of the Gradient Boosting Classifier, but not of the Light Gradient Boosting Machine, we decided to stick to scikit-learn for better comparability with the remainder of this work. The case is different for RF, which, although still mostly within 1%, does not quite match the performance of the boosting methods. The reason for still using RF can be seen in the fact that parallelised training is possible and that trees are generated without the implicit history that boosting algorithms possess. This history is created since new trees are generated to correct for the mistakes of the previous ensemble and could therefore involuntarily carry information. Given that the probability of achieving the right result has a central role to play in this work, RF is attractive as all trees are independent of each other and do not depend on each other in any way.

Prediction Performance
The key results of this paper are the prediction of the half-occupancy and the prediction of the categorised values. The results of this prediction are shown in Table 4. • "Model" (column 1) The model used as explained in the Section 3.4 "Chosen Prediction Models" • "Dataset"-"Include traffic" (columns 2-5) The columns contain the datasets used as outlined in the section "Data". TRUE indicates that the dataset was used in training and prediction and FALSE indicates that it was not used. • Share correctly categorised (columns 6-9) These columns contain the share of predictions that matched the real value. The numbers in the second title row of Table 4 indicate how far removed a category was allowed to be in order to be still considered correct. The column with the heading "0" consequently contains the share of correctly assigned categories. Column "1" allows for neighbouring categories (e.g., if "low" was predicted, "very low" and "medium" were still considered correct) to be correct. Columns 2 and 3 follow the same pattern, but allow further removed neighbours to be correct. Table 5 shows an example of how the confusion matrix translates into the values shown in Table 4. The binary metrics columns contain typically applied metrics in binary classification problems. These are, in order of appearance, accuracy, area under the receiver operating characteristic curve (AUC), recall, precision (Prec), F1 score (F1), Cohen's kappa score (Kappa), and Matthews correlation coefficient (MCC). As these are standard metrics in machine learning, a detailed explanation of each is omitted at this point and the reader is referred to the Appendix A.1 of this paper. Further information can be found in the documentation of the used methods in the model evaluation package of sklearn [44] or other literature.
The key results are further summarised in Figure 2. The models have been grouped into three key groups. The first group, "GB and RF incl. AW", shows the average result of all GB and RF that include the average weekly usage. "AW" is simply the result of the AW and "GB and RF excl. AW" is the result of the models unaware of average weekly usage.  Table 4 grouped by model type and whether the average weekly data (AW) were used or not. Part (a) shows the results for binary metrics and (b) for the categorised results. Category "0" corresponds to the share of correctly assigned categories; "1" counts direct neighbour categories (e.g., "very low"-"medium" are still correct if "low" was the true value); "2" includes neighbours once removed; "3" neighbours twice removed.
Another way to look at the results is presented in Figure 3, where the accuracy of GB with the 2021 dataset and all available data sources is compared to the accuracy for AW at the level of charging stations. It can be observed that all models are able to predict the usage of stations with either overall quite high or low usage. This is not surprising since a station that is nearly always/never used will likely be used/not be used in the future as well. The machine learning models, however, perform superiorly in the middle section, with stations being used between 30% and 50% of the time.

Feature Importance
When trying to use the prediction models in a real-world application, it is relevant to ensure that only those features are used that bring additional value to the prediction performance. While, theoretically, no damage should be done by including data with little additional value, the risk of overfitting and the effort required to warehouse and treat excess data sources both rise significantly with the amount of data used as features. Table A1 shows the feature relevance using the mean decrease in impurity as a metric for the prediction models created in the context of this work. The AW is not included as it does not rely on external features.  Table 4 grouped by model type and whether the average weekly data (AW) were used or not. Part (a) shows the results for binary metrics and (b) for the categorised results. Category "0" corresponds to the share of correctly assigned categories; "1" counts direct neighbour categories (e.g., "very low"-"medium" are still correct if "low" was the true value); "2" includes neighbours once removed; "3" neighbours twice removed.
Another way to look at the results is presented in Figure 3, where the accuracy of GB with the 2021 dataset and all available data sources is compared to the accuracy for AW at the level of charging stations. It can be observed that all models are able to predict the usage of stations with either overall quite high or low usage. This is not surprising since a station that is nearly always/never used will likely be used/not be used in the future as well. The machine learning models, however, perform superiorly in the middle section, with stations being used between 30% and 50% of the time.

Feature Importance
When trying to use the prediction models in a real-world application, it is relevant to ensure that only those features are used that bring additional value to the prediction performance. While, theoretically, no damage should be done by including data with little additional value, the risk of overfitting and the effort required to warehouse and treat excess data sources both rise significantly with the amount of data used as features. Table A1 shows the feature relevance using the mean decrease in impurity as a metric for the prediction models created in the context of this work. The AW is not included as it does not rely on external features. Figure 3. Accuracy of GB with all available inputs in 2021 (left) and of AW again for 2021 (right). The colour indicates which share of stations with the given average occupation was predicted with which accuracy. The heat map is constructed of rectangles with 2% edge length in both dimensions. Example of how to read: A dark green square corresponding to 80% with an average occupation of 50% and accuracy of 90% indicates that, for all stations with an average occupation of 50% to 52%, the prediction model achieved accuracy between 90% and 92% in 80% for all CS falling into the category. The blue line indicates what share of stations fell into each 2% block of average occupation. Figure 3. Accuracy of GB with all available inputs in 2021 (left) and of AW again for 2021 (right). The colour indicates which share of stations with the given average occupation was predicted with which accuracy. The heat map is constructed of rectangles with 2% edge length in both dimensions. Example of how to read: A dark green square corresponding to 80% with an average occupation of 50% and accuracy of 90% indicates that, for all stations with an average occupation of 50% to 52%, the prediction model achieved accuracy between 90% and 92% in 80% for all CS falling into the category. The blue line indicates what share of stations fell into each 2% block of average occupation.

Discussion
In this section, we discuss the results shown in the previous section. The structure is similar to the order in which the results were presented. We start with a short discussion of the model choice. The next three subsections discuss the usefulness of the different datasets and models given the prediction tasks. As a last content discussion, the feature importance is analysed. The section ends by discussing the limitations of the work done.

Model Choice
The comparison of a broad range of models was done to verify that the selected models in fact perform well in comparison to others. Since the project requires a measure of probability, the model choice was limited even before the comparison, but it is reassuring that GB and RF are among the better choices even in the more general task of binary classification. A Kappa and an MCC score of above 0.8 mean that the models have strong prediction quality in the binary challenge. Since the classification into the more diverse categories ranging from "very low" to "very high" is derived directly from the trained binary models' probability, the same conclusion can be drawn about the choice of models for this application.

2019 and 2021 Dataset
As can be seen in Table 4, the prediction performance in all major categories is significantly poorer for the longer dataset. The most likely explanation for this phenomenon is that Germany experienced dramatic changes in the time period that the models were unable to cover. One such change is the global COVID-19 pandemic, which resulted in three major lockdowns, resulting in significantly altered mobility patterns. With much more generous home office rules, the change in mobility patterns further had an effect outside the strict lockdowns. The second main change is the strong rise in EV sales. There were only 150,000 EVs on the road in the beginning of 2019 [45], but this number had risen to over 1 million by August 2021 [3]. Given that the mobility patterns and the number of users have dramatically changed over time, it is of little surprise that the models do not perform well when they have to predict the full time range.
The market is continuously changing, with increasing market shares of EVs [1], higher charging powers, and changing work modes, which may continue to change due to a broader acceptance of home-working [46]. This implies that there will be ongoing shifts in the use patterns of CSs. It therefore seems reasonable to assume that short training timeframes are more suitable for prediction than long ones.

Predicting Half-Occupancy
Looking at Table 4, three distinct groups can be observed. These are: • GB and RF unaware of average weekly data All GB and RF models that are unaware of the average weekly data perform very poorly. Besides a seemingly reasonable accuracy of around 0.78, the MCC is 0 and AUC 0.5. The latter two indicate that the model has no predictive value. The reason for this is that the provided input features are an insufficient basis to predict a CS occupation of above 50% since the average occupation is too low and features are not unique enough. • Average week model The average week model reaches a reasonable score in all major metrics. The relatively low recall, however, tells us that the model is unable to identify high CS occupation and earns most of its accuracy score by correctly predicting labels with low occupation rates. Given the goal of predicting usage spikes, this is clearly not ideal. This behaviour can be explained by the fact that only stations with an extremely high usage rate over the entire observed period during certain hours of the week would be correctly predicted as having a high occupation rate.
• GB and RF aware of average weekly data Once GB and RF are aware of the average weekly data, the prediction performance becomes much better. While there are small improvements in predictive quality if weather and traffic information is available, they do not make a strong contribution to station occupation. This occurs despite the fact that, individually, they show a strong correlation with station occupation. A reason for this could be that there are strong collinearities present. Traffic and station occupation, for instance, both show a similar pattern over the course of the day when using aggregated data. The results show that this effect holds true even when training detailed models.
The difference between the groups described in the second and third bullet points of the above list can also be observed in Figure 3. Unsurprisingly, both models achieve high accuracy for CSs with either a very high or a very low average occupation. Given that these CSs are either practically never used or always used, they are very likely to exhibit this behaviour in the test set as well. External features are irrelevant compared to the strong trend.
The difference is more pronounced in the middle segment, where CSs have more diverse usage. For these situations, GB (and RF, although not shown) have access to more information than AW and are thereby able to analyse the situation better. This allows these models to make much more accurate decisions in this segment, which is the reason that most binary metrics in Table 4 are much higher for GB and RF as compared to AW.

Predicting Categorised Values
Given that the ultimate goal of this work is to create a user-friendly way of displaying the likelihood that a CS is occupied, special attention is paid to the prediction of categorised values analysed through the confusion matrix. The 2021 data show a quite clear trend, where both RF and GB models are able to predict the exact category in over 80% of the predictions and predict the category itself or a nearby category at a rate of over 90%.
For users of the model, this high accuracy is good news. For most people, the difference between, for instance, "low" and "very low" is quite negligible, but the general tendency is of relevance. We can further see that GB and RF outperform AW significantly. This shows that simply taking the average weekly usage is insufficient when trying to make reasonably accurate predictions, but a correction using more models that are sophisticated is necessary.
As a third point, we can see, similar to the previous subsection, that the models unaware of the average weekly usage pattern are unable to make an accurate prediction. Virtually all predictions are in the categories "low" and "medium" since traffic, weather, and calendar data alone provide insufficient information for a prediction by itself. A consequence of this fact is that the suggested model can only be used in a real-world application if the stations have been operational for some time.
For the prediction of how a newly built station in a new location would likely be used, other tools are more adequate (see this paper's literature review, as well as the literature review in [8,47]).

Feature Importance
As can be seen in Table A1, the average weekly occupation is by far the dominating factor if it is available. As was shown already earlier, both GB and RF only provide useful outputs once they gain access to the average weekly usage. It is consequently of little surprise that this feature is heavily used and reduces around 99% of impurity. The models, however, score significantly better than the AW in close to all settings. This seeming contradiction can be overcome by understanding that the average weekly usage essentially forms a starting point for the remainder of the decision trees. Upon visual inspection, most trees use the average occupation in the first two to three decision nodes, and only after this do they switch to other decision variables. The other features are consequently used to correct the average occupation for the specific circumstances that occur during a specific hour. This fact explains why the GB and RF have much higher prediction performance than the AW.

Limitations
In this work, several limitations must be considered. These are listed in the following.

Market development
The EV market has undergone tremendous changes in the studied period. The number of vehicles increased from approximately 150,000 vehicles in early 2019 to one million in 2021. Similar trends could be observed with respect to the number of CSs. At the same, vehicle range has increased significantly over time [9] and travel within Germany is now comfortably possible [48]. It seems probable that all these changes had an effect on infrastructure usage. Consequently, training and test data might in part be outdated, which the results in Table 4, including the 2019 dataset, also show.
The COVID-19 pandemic has had an immense and lasting impact on mobility and travel patterns in Germany and worldwide. These developments are a second highly dynamic factor that the used models may not be able to reflect in detail.

•
Availability of datasets One can think of various other aspects that probably increase prediction accuracy but for which we could not obtain datasets in this study. Examples of such data are events taking place (where visitors require recharging opportunities), hotel bookings (assuming that people need more public infrastructure if they are not at home), etc. If a reliable and quantifiable dataset can be obtained for these tasks, the prediction accuracy would likely increase further.
If the EV market and the COVID-19 pandemic should stabilise, it would certainly be of interest to study the topics in this paper again. We believe that it would, for instance, be critical to include seasonal patterns in future work, but could not do so here due to the fact that too much changed over a period of a single year. If researchers have access to datasets described in the last point of the Limitations list, a new study would certainly also be highly interesting and may create new insights into which features add value.

Conclusions
In this paper, we showed how the occupation level of a CS can be predicted accurately using data-driven methods. The models were able to achieve high scores in all standard metrics typically used in machine learning. We further showed how a binary training set might be used to create models that are then able to predict the more fine-grained categories of "very low" to "very high" occupation levels of public CI.
We further conclude that predicting future CS requires knowledge of past station usage. Since the usage patterns are currently changing quickly due to the COVID-19 pandemic and the rapidly growing market, it is sufficient to have knowledge of the near past. Longer datasets are not required. If this information is unavailable, at least the features available in this work were able to provide a useful prediction in the time resolution required. The addition of other datasets, such as traffic and weather data, increases model performance somewhat, but the impact is comparably small. Using a machine learning algorithm is sensible since the tested algorithms, GB and RF, scored significantly better than the simple average usage on a weekly basis. Funding: As stated in the Acknowledgements, the research on which this paper is based was conducted in the context of the project BeNutz LaSA. The project was funded by the Federal Ministry for Economic Affairs and Energy of the Federal republic of Germany based on a decision of the German Parliament. The funding identifier is "01MV20001A".
Institutional Review Board Statement: Not applicable.

Informed Consent Statement: Not applicable.
Data Availability Statement: The CS information and traffic data were provided under a confidentiality agreement and cannot be shared externally. For a coarse overview of CS usage, we refer the reader to our previous publication alongside which data were published [8]. Other datasets are available open source and the respective sources are listed in the Data section. If any of the readers are facing difficulties in acquiring these data, please contact the authors.
Accuracy is defined as the share of predictions that were predicted correctly. This score does not treat an imbalance in the test set. In the context of this paper, high accuracy can be achieved by predicting that a CS will be used less than 50% for all prediction instances. Since this the case in the majority of cases, accuracy scores of above 70% can be achieved with a naïve model.
Area under the receiver operating characteristic curve (AUC) AUC= 1 0 Accuracy(y,ŷ)dθ y= positive i fy ≥ θ negative otherwise where Accuracy, y, andŷ are defined in the previous subsection andy is the predicted probability thatŷ is positive. A perfect prediction would achieve an AUC of 1, meaning that the true positive rate is always 1 and the false positive rate always 0. This is achieved ify is 1 for all true positives and 0 for all true negatives. A score of 0.5 is achieved if the predictor is random. A score below 0.5 would indicate that the model is guessing incorrectly more often than random choice would have been. Recall Recall = t p t p + f n where t p . is the sum of true positive predictions and f n is the number of false negative predictions.
Recall shows how many of the positive values have been labelled correctly as positive compared to the total number of positives in the test set. In the context of this paper, the recall consequently tells us how many high-usage situations the system correctly identified.
Precision (Prec) Precision = t p t p + f p where t p is the sum of true positive predictions and f p is the number of false positive predictions.
Precision is a measure for how many of the predicted positive values were actually positive. In the context of this paper, the score consequently tells us how many of the predicted high-usage situations were actually true. F1 score (F1) F1 = 2 · Precision · Recall Precision + Recall The F1 score is a weighted average of recall and precision. The metric consequently tries to balance the statements made. For this paper, this consequently strikes a balance between measuring how many situations were correctly predicted as positive compared to how many there were and how many were predicted.
Cohen's kappa score (Kappa) κ = p o −p e 1−p e = 2·(t p ·t n − f n · f p ) (tp+ f p )·( f p +t n )+(tp+ f n )·( f n +t n ) where p o is the relative observed agreement among raters and p e is the probability of chance agreement. In the second definition, t p is the sum of true positives, t n is the sum of true negatives, f p is the sum of false positives, and f n is the sum of false negatives.
Kappa measures how well matching was performed relative to a random assignment. Due to the relatively complex definition, no direct definition in the context of this work is possible.
Matthews correlation coefficient (MCC) MCC = t p · t n − f p · f n t p + f p · t p + f n · t n + f p · (t n + f n ) where t p is the sum of true positives, t n . is the sum of true negatives, f p is the sum of false positives, and f n is the sum of false negatives. MCC is a balanced measure and describes a model's prediction performance on a score from −1 to 1. A score of 1 corresponds to a perfect model, 0 to a model with no predictive value, and −1 to a model perfectly predicting the opposite of reality. MCC is the key measure in our paper for binary classification problems since it most accurately describes model performance in a highly imbalanced test set, as is the case in our work.
Appendix A.2. Feature Importance of the Selected Models Table A1 shows the feature importance for the models created in this paper. Table A1. Feature importance based on mean decrease in impurity of selected models, sorted by dataset first and then by MCC (i.e., same as in Table 4).