Spatio-Temporal Forecasting of Municipal EV Charging Load Using Weather-Aware Transformer–LSTM Hybrid Models

Das, Remon; Debnath, Sajib; Kandil, Tarek; Mia, Md Uzzal

doi:10.3390/ai7060191

Open AccessArticle

Spatio-Temporal Forecasting of Municipal EV Charging Load Using Weather-Aware Transformer–LSTM Hybrid Models

¹

Dominion Energy, Richmond, VA 23219, USA

²

AES Clean Energy, The AES Corporation, Louisville, CO 80027, USA

³

School of Engineering and Technology, Western Carolina University, Cullowhee, NC 28723, USA

⁴

Department of Information and Communication Engineering, Pabna University of Science and Technology, Rajapur, Pabna 6600, Bangladesh

^*

Author to whom correspondence should be addressed.

AI 2026, 7(6), 191; https://doi.org/10.3390/ai7060191

Submission received: 12 April 2026 / Revised: 14 May 2026 / Accepted: 14 May 2026 / Published: 25 May 2026

(This article belongs to the Special Issue The Application of Machine Learning and AI Technology Towards the Sustainable Development Goals)

Download

Browse Figures

Versions Notes

Abstract

Accurate forecasting of municipal electric vehicle (EV) charging demand is increasingly important for distribution system planning, charging infrastructure management, and demand-side operation. This study proposes a weather-aware Transformer–LSTM hybrid framework for spatio-temporal forecasting of EV charging load across municipal public charging stations. The proposed approach integrates multi-source information within a unified pipeline, including cyclic temporal encodings, multi-lag autoregressive features, rolling statistics, behavioral aggregates, and meteorological variables, while combining a Transformer encoder to capture long-range temporal dependencies with an LSTM decoder to model local sequential dynamics and nonlinear load patterns. The framework was evaluated using 211,324 charging sessions collected from eight New York City municipal charging stations between July 2021 and December 2025. Under controlled benchmarking against Simple RNN, standalone LSTM, and encoder-only Transformer models using identical preprocessing, feature engineering, and training settings, the proposed hybrid model achieved R² = 0.9731, MAE = 62.71 kWh, RMSE = 94.21 kWh, and MAPE = 19.62%. Relative to the standalone Transformer, the proposed model reduced RMSE by 32.6% and MAPE by 34.5%. In addition, the model maintained strong forecasting performance across stations with heterogeneous demand profiles without station-specific retraining and remained robust across seasonal variations. These results demonstrate that the proposed framework provides a reproducible and scalable solution for municipal EV charging load forecasting in real-world urban environments.

Keywords:

electric vehicle charging; load forecasting; Transformer–LSTM; spatio-temporal prediction; weather-aware modeling; deep learning; demand-side management

1. Introduction

The rise in EV adoption is disrupting contemporary energy and transportation systems at a breakneck speed. Global electric vehicle sales surpassed 3.5 million units in 2023, amounting to almost one-fifth of total vehicle sales [1]. According to the International Energy Agency, by 2030, 30% of new cars will be electric [2]. This growth is mainly motivated by global decarbonization targets in light of the significant share of greenhouse gas (GHG) emissions attributed to transportation [3]. Simultaneously, EV charging events have an intra-day-changing pattern, which makes the electricity demand uncertain. Charging behavior is user-specific and is heavily influenced by location, time, and duration, resulting in strong temporal and spatial variability [4]. Charging large-scale loads simultaneously in the power grid can create load surges, resulting in voltage instability, Transformer overload, and grid outages [5]. This makes the growth of EV charging load forecasting critical for grid reliability, infrastructure planning, and demand-side management [6]. The municipal public charging station is the most difficult to model among the various types of charging environments. In contrast to charging for residential or workplace use, public stations must accommodate users with different backgrounds and unpredictable arrival times, along with heterogeneous demand [7]. In addition to user behavior, demand is influenced by external factors such as weather conditions, seasonal variations, and urban mobility behaviors [8]. In addition, the integration of renewable energy sources introduces even more variability on the supply side [9].

Early studies on EV load forecasting were mainly based on classic statistical models such as the Auto-Regressive Integrated Moving Average (ARIMA), Seasonal Auto-Regressive Integrated Moving Average (SARIMA), and Exponential Smoothing [10]. However, these methods work well with linear trends but not with nonlinear and high-dimensional relationships in EV charging data. Machine learning techniques, such as Random Forest (RF) and Gradient Boosting (GB), have achieved better performance by modeling nonlinear interactions [11]. However, they have a limited capability to capture long-term temporal dependencies. Recent advances in deep learning have greatly enhanced forecasting accuracy. Recurrent Neural Networks (RNNs), especially Long Short-Term Memory (LSTM) and Bidirectional Long Short-Term Memory (BiLSTM) models, have shown great ability in modeling the temporal dependencies of electric vehicle charging demand [12]. Hybrid systems, such as XGBoost–BiLSTM, improve performance by linking feature learning and sequential modeling [13]. Recently, Transformer-based architectures have received much attention for their success in capturing long-range dependencies via self-attention [14]. Hybrid Transformer–LSTM models incorporate both global attention and sequential learning resulting in improved forecasting performance [15]. A variety of associated works with a focus on new deep networks and architectures have also been proposed, such as the use of the Efficient BiLSTM Net that efficiently uses both spatial and temporal features for renewable-integrated EV systems [16], and REST, where multi-stream deep learning components are integrated into a robust forecasting network [17]. On the other hand, spatial dependencies between charging stations are captured by GCN–Transformer frameworks [18] and multi-channel graph attention networks that include weather and price signals [19]. These methods focus on the significance of establishing both temporal and spatial relations [20].

Another predictive feature of EV charging demand is user behavior heterogeneity. Clustering approaches, such as grouping users or stations with similar user charging patterns, are employed to improve prediction accuracy [21]. Meteorological data also have a significant impact on charging behaviors, with existing studies hypothesizing that physical conditions such as temperature and humidity will have a superimposing effect on the driving factor for each EV [22]. Although these advancements are promising, several significant issues persist. First, most Transformer–LSTM hybrid models are assessed using private or limited datasets, which significantly limits their reproducibility and generalization [9]. Second, graph-based methods require explicit station-edge connectivity information as an essential input; however, such edge features are usually not available in real-world municipalities [5,18]. Third, an underexplored line of work in the literature is the comprehensive combination and integration of behavioral features, multi-lag temporal dependencies, cyclic encodings, and weather variables in the same unified end-to-end pipeline [12]. Furthermore, very few controlled comparisons between RNN, LSTM, and Transformer models (with or without hybrids) have been performed under unified conditions [13,14]. Finally, municipal public charging environments have been studied less than residential or highway scenarios [4]. To fill these gaps, this study introduces a hybrid weather-aware Transformer–Long Short-Term Memory (LSTM) hybrid framework for spatio-temporal EV charging load forecasting at municipal public charging stations. The model combines multi-source features, such as temporal patterns, behavioral aggregates, and other meteorological variables. Long-range dependencies can be captured through multi-head self-attention in the Transformer encoder, whereas local sequential dynamics are modeled by embedding into the LSTM decoder.

The principal contributions of this study are summarized as follows:

Weather-aware feature engineering: An organized feature pipeline incorporating behavioral aggregates, multi-lag temporal dependencies, rolling statistics, cyclic encodings and meteorological attributes to represent both short-term dynamics and seasonal trends of EV charging demand.
Transformer–LSTM hybrid evaluation: Providing a rigorous application and controlled evaluation of a Transformer–LSTM hybrid architecture using global attention mechanisms to complement sequential temporal modeling to improve forecast accuracy.
Multi-site spatio-temporal validation under shared model weights: Joint training and validation across eight geographically distributed municipal charging stations, where ’spatio-temporal’ refers to parallel temporal forecasting with shared parameters and weather-station-mapped meteorological inputs, rather than to explicit graph-structured modeling of inter-station interactions.
Benchmarking under controlled settings: A comparison with baseline models (RNN, LSTM and encoder-only Transformer) in the same preprocessing, feature engineering and training configurations so as to attribute performance over a fairground.
Large-scale real-world validation: Evaluation provider on a multi-year public dataset containing 211,324 EV charging sessions from municipal stations that allows for empirically and practically relevant assessment of forecasting performance.

For clarity, we explicitly define the sense in which the term spatio-temporal is used in this paper. We refer to the joint modeling of multi-site temporal patterns under shared model weights, with spatial information injected through (a) station-specific scaling and (b) the geographic mapping between each charging station and its nearest ASOS weather station. We do not model explicit graph-structured interactions among stations; such graph-based extensions are identified as a future-work direction (Limitations and Future Work section). Relative to existing Transformer–LSTM hybrids [12,14,15,23], the present work contributes: (i) the first end-to-end integration of meteorological covariates, multi-lag autoregressive features, rolling statistics, cyclic encodings, and behavioral aggregates within a unified Transformer–LSTM pipeline for municipal EV charging load forecasting; (ii) controlled benchmarking against both classical baselines (Section 4.1) and contemporary state-of-the-art baselines (Section 4.5); (iii) a comprehensive ablation study isolating the contributions of each architectural component and feature group (Section 5.3); and (iv) validation on a multi-year real-world dataset of 211,324 charging sessions across eight heterogeneous municipal stations without station-specific retraining.

2. Related Work

2.1. Deep Learning and Transformer-Based Approaches

Deep learning is the dominant paradigm for EV charging load forecasting because it captures complex nonlinear temporal features from large-scale operational data. In this regard, Zhou et al. [24] proposed a Bayesian deep learning framework for EV charging station load forecasting when performing variational inference over LSTM parameters and compared it with the baseline of SVR and MLR, achieving better performance. Shen et al. [25] provided another study that dealt with the problems of forecasting under low-quality data conditions, showing that noise-robust LSTM models applied to imputation produce relatively good accuracy given scenarios where highly corrupted sensor records are present. Shanmuganathan et al. [26] applied Empirical Mode Decomposition (EMD) before feeding a Deep LSTM with parameters property tuned using the Arithmetic Optimization Algorithm (AOA). This approach improved accuracy on data collected by the Georgia Tech EV station. Vishnu et al. [27] performed a framework-agnostic comparison, where Long Short-Term Memory (LSTM) outperformed Support Vector Regression (SVR) and autoregressive models in terms of short-term accuracy (RMSE = 5.9 kW, MAE = 4 kW) on real-time ACN data. Aduama et al. [11] demonstrated that very large improvements in forecast skill against univariate baselines can be achieved at station level with structured multi-feature fusion (SMFF), which aggregates behavioral, contextual and meteorological covariates.

Transformer architecture (originally designed for Natural Language Processing (NLP) but adapted to time series through self-attention) has become increasingly popular. Koohfar et al. [8] were some of the first to apply a Transformer-based model on EV charging demand prediction, testing 7-, 30- and 90-day horizons using data from Denver, Colorado, and consistently outperforming with ARIMA, SARIMA, RNN and LSTM baselines. A systematic follow-up study offers a detailed comparison of performance, models, and computational trade-offs, analyses across multiple deep learning architectures, and accuracy characteristics in different model families. There were some notable accomplishments in 2024. Manzoor et al. [28] compared conventional self-attention and ProbSparse attention within the framework of Transformer models for the Adaptive Charging Network (ACN) dataset, showing that, as opposed to traditional self-attention, with a considerable computation price reduction, ProbSparse attention dramatically increases accuracy. Hu et al. [18] proposed the GCN–Transformer hybrid for load prediction in the EV battery swapping station, which modeled differential so-called spatial couplings under heterogeneous topological configurations. Xiong et al. [14] proposed a Cyber–Physical Cognitive Control System combining BiLSTM and Transformer modules to predict the EV charging demand of a commercial building by offering real-time reactivity and robustness in response to problems such as incomplete data.

In 2025, a series of major advancements in graph-based, attention-based, and probabilistic architectures occurred. Wang et al. [29] compared static graph-based methods with the attention-based spatial–temporal graph recurrent network (AST-GRN), which uses the station dependency graph over time and provides better performance for short-term EV demand forecasting. Tian et al. [5] proposed a Multi-Scale Spatial–Temporal Graph Attention Network (MS-STGAN), which is recognized by the introduction of Pyramid Split Attention modules to extract multi-resolution features, with a uniform gain path of significant SOTA on four real-world EV datasets for both 7-day and 30-day tasks. For Alaraj et al. [7], such a result strengthens the argument for hybrid spatial–temporal designs, which results in an exploratory comparison that shows that LSTM outperformed temporal dependencies and demonstrates which minimization models arise from graph-connected station structures with the GCN. In the study by Alghamdi et al. [17], the REST Network, a multi-component ensemble of EfficientNet and ResNets streams with BiLSTM for EV loadings forecasting has been proposed in mixed port supply-chain environments. Bouhamed et al. [30], at the cutting edge of probabilistic forecasting, introduced a Transformer-based encoder–decoder that outputs distributional load forecasts at 24 h, one week and one month ahead horizons with up to 11.7 percentage-point accuracy improvements on Malaysian electricity and French energy datasets. Yılmaz et al. [31] showed that Transformers can be generalized to EV battery state-of-charge estimation and achieved a strong performance on a diverse set of real-world battery datasets relative to LSTM, BiLSTM, and SVR. The EVformer is a spatio-temporally decoupled Transformer proposed by Jia and Yang [15], which is separated into temporal and spatial branches with different attention modules to address the weaknesses of classical self-attention to predict instantaneous city-wide EV loads at an unprecedented scale using quadratic time complexity.

2.2. Hybrid Deep Learning Models

Sequential deep learning hybrid models that integrate several distinct modeling methodologies, taking advantage of their complementary analytical power, have consistently shown higher accuracy and robustness than traditional all-in-one architecture approaches [32]. Ren et al. [33] put forward a seminal work, in which a SARIMA-LSTM serial hybrid was suggested; here, while SARIMA captures linear seasonality, LSTM models the remaining nonlinear dynamics. Results show better performance than six standalone baseline methods on a real-world dataset, i.e., involving a Spanish charging station from 2015 to 2016. In the study by Da et al. [34], a CNN-BiLSTM hybrid model trained on advanced measurement infrastructure (AMI) data was proposed, showing that the combination of architectures is better at extracting both local temporal and long-term temporal dependencies than standalone architectures for smart solar microgrid load forecasting at the building level.

Recently, hybrid architectures have incorporated ensembles of complex approaches and multi-component features. Osman et al. [10] integrated the base forecasters Prophet, TBATS, and LSTM using two specific by-way-of ensemble strategies, RF regression and Gene Expression Programming (GEP), to achieve improved multi-seasonal EV charging load forecasting performance by taking advantage of the complementary temporal modeling capabilities of each constituent. Dhanawat et al. [16] proposed an Efficient BiLSTMNet, fusing the spatial features coded by EfficientNet and residual feature preservation characteristics of ResNet and temporal sequence modeling by BiLSTM while hyper-parameters were efficiently optimized using the Enhanced Firefly Algorithm;

R^{2}

= 0.90 was obtained on a dataset that integrated renewable energy types into EVs. Peng et al. [21] introduced a transferable ConvLSTM-based spatio-temporal framework for adaptive EV charging station siting and sizing, achieving a successful transfer from data-rich cities to data-poor cities.

The current state of the art in EV charging load forecasting involves hybrid architectures that include Transformer and LSTM modules. Hussain et al. [12] proposed an LSTM-based Transformer by purely replacing the normal linear projections in both the encoder and decoder with an LSTM layer. This yields advantages from both LSTM sequential memory and Transformer global attention when we tested this model against independent baselines at multiple medium- and long-term horizons (30, 120, and 240 days) on the Adaptive Charging Network (ACN) dataset and found these results to surpass those of independent LSTM, Transformer, RNN, and ARIMA models. Siddiqui et al. [3] researched and reported signal decomposition-based gradient boosting usingDeepBoost, which exploited VMD-based signal decomposition, BiLSTM temporal modeling, and gradient boosting, achieving competitive accuracy under heterogeneous behavioral characteristics. Simultaneously, architectural innovations and hybrid frameworks based on clustering have proven instrumental in complementing behavioral segmentation to promote both accuracy and interpretability. In the study by Zhou et al. [6], differential user characteristics are often not modeled explicitly in spatio-temporal forecasting models, thus yielding a single representative industrial load for urban community distribution network prediction. Private car owners differ from rideshare operators or commercial fleets, so formulating these details can achieve much more accurate and interpretable predictions.Shahrokhi et al. [4] developed a two-stage framework that augments behavioral clustering with ensemble forecasting and showed that breaking station populations into behaviorally coherent segments followed by segment-specific modeling greatly reduces prediction error from undifferentiated approaches. Fast forward to 2026, where Mansour et al. used a hybrid XGBoost-BiLSTM stacking ensemble [13] to achieve superior performance on the ACN dataset compared to 24 competing methods in terms of some evaluation metrics, highlighting that stacked learning strategies combining gradient-boosted feature selection with BiLSTM temporal modeling yielded consistent gains. The most recent and comprehensive contribution was proposed by Yuan et al. [23], whose two-stage hierarchical clustering framework was combined with a Transformer–BiLSTM hybrid to disentangle user behavioral heterogeneity from weather-driven demand variability [35]. With large real-world datasets validated and performance compared against a variety of competing approaches, this framework consistently achieves reductions in MAE across data splits while producing actionable insights about users–weather interactions, marking yet another step towards accurate and interpretable EV load forecasting with operational deployability.

Table 1 provides a structured comparative overview of the surveyed approaches, highlighting their key techniques, datasets and reported outcomes. This synthesis reveals two persistent gaps: limited integration of weather-aware features into end-to-end hybrid frameworks and the absence of unified benchmarking across RNN, LSTM, Transformer, and hybrid architectures under identical experimental conditions. The proposed model directly addresses these gaps.

3. Materials and Methods

The architecture of the proposed spatio-temporal EV charging load forecasting framework is depicted in Figure 1, comprising four major stages: data acquisition, preprocessing and feature engineering, modeling, and evaluation. In the data acquisition phase, EV charging session records and meteorological data were downloaded from public sources. These datasets were then carefully processed according to a structured preprocessing pipeline consisting of data cleaning, spatial and temporal alignment, and weather station mapping with charging stations. During feature engineering, various categories of features were created, including cyclic temporal encodings, lag features for autoregressive modeling, rolling statistics in time-series and behavioral aggregates, and weather variables to set up a comprehensive input representation. The processed sequences were then fed into the proposed Transformer–LSTM hybrid model, where multi-head self-attention in the Transformer encoder captured long-range temporal dependencies, and local sequential dynamics and nonlinear patterns were modeled by an LSTM decoder. Finally, in the evaluation stage, we evaluated the model predictions using various performance metrics, such as RMSE, MAE, and MAPE, to ensure that the forecasting accuracy was well validated for different stations and time periods.

3.1. Dataset Description

Based on two publicly available real-world datasets, this study specifically focused on EV charging demand in connection with the weather. The first dataset, obtained from the Department of Citywide Administrative Services (DCAS), includes session-level EV charging records from New York City (NYC) municipal parking facilities, and the second contains meteorological observations available within the Automated Surface Observing System network. The EV charging dataset was obtained from the NYC Department of Transportation open data portal [36]. It originally comprised 211,384 charging session records with 15 attributes. Thus, to keep only useful features for demand modeling, seven unnecessary columns were dropped, such as high missing fields Country and Invalidity Reason. The eight selected key variables from the final dataset were Date, Station ID, Location Name, Connected Time, Disconnected Time, Charge Duration (min), Connected Duration (min), and Energy Provided (kWh).

After removing 29 incomplete records and applying temporal filtering, the cleaned dataset contained 211,324 charging sessions spanning from 31 July 2021 to 15 December 2025, across eight municipal charging stations located in Queens, the Bronx, Manhattan, Brooklyn, and Staten Island. The raw dataset contained 19 location name variants, which were consolidated into eight unique physical charging stations using a standardized mapping dictionary (LOC_MAP). Among these, Energy Provided (kWh) was used as the target variable for forecasting. Weather data were obtained from the Iowa Environmental Mesonet ASOS network [37]. The raw dataset included 189,874 hourly observations with 33 meteorological variables from three stations in the NYC region: NYC (Central Park), LGA(LaGuardia Airport), and JRB (John F. Barry). Based on their relevance to EV demand, six variables were selected: air temperature (tmpf), relative humidity (relh), apparent temperature (feel), wind speed (sped), precipitation (p01m), and snow depth (snowdepth). After aggregating the hourly data into daily values and aligning them with the EV dataset, the final weather dataset contained 4683 station-day records.

3.2. Data Preprocessing

The datasets were preprocessed using a structured pipeline for forecasting. It involves three steps: data cleaning, dataset integration, and post-merge preparation.

3.2.1. Data Cleaning

In the case of the EV charging dataset, 15 fields were reduced to eight relevant fields by eliminating administrative and non-informative fields. Examination of missing values showed only 29 empty records in the Location Name field; therefore, listwise deletion was used. A temporal filter (31 July 2021–15 December 2025) was subsequently applied. The resulting dataset after preprocessing consisted of 211,324 charging sessions and removed only 31 records. Nineteen raw location name variants in the dataset were mapped to eight distinct municipal charging locations using a mapping dictionary (LOCMAP).

The ASOS weather network was preprocessed independently. Eight relevant features were selected from the 33 original variables, and the timestamp field valid was renamed to Date. Hourly observations were then summed into daily values by station and date. It was assumed that mean aggregation held for temperature, relative humidity, apparent temperature, and wind speed, but summed precipitation and converted snow depth into a binary indicator. After another application of similar temporal filters, the final weather dataset included 4683 daily records across three stations.

As the EV dataset is keyed by facility names and the weather dataset is keyed by ASOS station codes, no common identifier exists for merging. Thus, a geographic mapping approach was implemented, designating each charging station to the closest weather monitor according to the borough. Queens stations were mapped to LGA, the Bronx to JRB, and Manhattan, Brooklyn, and Staten Island stations to NYC. A new variable, weatherstation, was created for the EV dataset based on this mapping process. Finally, the datasets were combined with a left join on the composite key Date, weatherstation to ensure that all electric vehicle (EV) records were retained. The cleaned and final dataset, after dropping duplicate columns, had 211,324 records and 15 features combining EV charging and weather information. The final set of features is listed in Table 2.

3.2.2. Post-Merge Processing

Additional steps were taken to prepare the merged dataset in order to use it with forecasting. First, weather values that were missing due to lack of ASOS observations within the time period were dealt with using a linear interpolation per group of each particular weather station, followed by forward-fill and backward-fill in case any boundary cases remain. Second, session-level charging logs were aggregated to station-level observations daily. The aggregate charging energy (kWh) per station and day was computed as the sum of all sessions, where the daily averages for meteorological variables and the number of charging sessions were also calculated. The daily energy usage of all clients is summed and labeled as loadkwh, which indicates the per-station charging load, and is the target to be predicted. Third, feature scaling was performed with the StandardScaler, which standardizes each feature to zero mean and unit variance. The Scaling parameters were estimated using the training data only to avoid data leakage. Finally, the dataset was divided chronologically into training (70%), validation (15%), and test (15%) subsets for each station. The input sequences were constructed using a sliding window of

T = 21

consecutive days, where each sequence was used to predict the charging load for the following day.

3.3. Exploratory Data Analysis

Exploratory data analysis (EDA) was performed before model development, including the examination of statistical features related to EV charging demand and exploration of any spatial–temporal patterns and weather factors. Insights from this analysis informed various modeling decisions made in the proposed framework.

The distribution of Energy Provided (kWh) along with the location-level variation and a normality check are shown in Figure 2. The distribution is right-skewed, with a median of 19.3 kWh and an average consumption rate of 22.3 kWh, highlighting the existence of a heavy upper tail. There are a few sessions over 75 kWh and even beyond 140 kWh. A total of 2349 sessions (

| z | > 3

) were identified using Z-score analysis, which accounted for approximately 1.1% of the entire dataset. This overall behavior is so heavy-tailed that it accounts for bringing RobustScaler and Huber loss, which causes less sensitivity to extreme values.

The monthly Total Energy Delivered and Charging Session Volume are shown in Figure 3. Both measures consistently increased over the course of the study. Annual energy jumped from 76 MWh in 2021 to 1709 MWh by 2024, and session counts climbed from 3606 to 84,611. This implies that the growth in demand is mainly driven by the greater use of charging infrastructure and the high co-movement between these variables.

In Figure 4 the spatial distribution of the cumulative energy served in the assorted stations is shown. The highest demand was recorded at Court Square (1501 MWh), followed by Delancey & Essex (1286 MWh) and Queens Borough Hall (1183 MWh). Stations such as the Queens Family Court and St. George do not record anything similar, so the left-hand side parts of this histogram can be informative about which stations are underutilized relative to others. This strong spatial heterogeneity necessitates station-aware modeling instead of a single aggregated forecasting model.

The daily charging load time series for the selected stations are shown in Figure 5. Although all stations showed an overall upward trend, their growth rates and variations were quite different from one another. Some stations have sudden drops and recoveries that are likely to be operational disruptions or maintenance events. These patterns, which are heterogeneous in time and time-varying in space, suggest the need for models that can capture both temporal dependencies and specific dynamics at each station.

The relationship between meteorological conditions and charging behavior is shown in Figure 6. As extreme cold reduces the battery energy in a car, these charging sessions provide higher average energy (adding more to the cell considering battery energy consumption at low temperatures), as reflected by this behavior. We found that snowy days corresponded to more low-energy sessions than clear days, which suggests that we might stay at home during severe weather conditions. Wind speed appears to have a relatively small effect on the charging demand.

3.4. Feature Engineering

Table 2 describes the 15 columns of the raw merged session-level dataset. This feature engineering pipeline generates 26 engineered daily features from these columns; the features become model input

X \in R^{N \times T \times F}

with

N = 8

,

T = 21

and

F = 26

. To create a robust EV charging demand dataset, a complete feature engineering pipeline was implemented to include temporal, behavioral, and meteorological insights along with autoregressive information. The solution requires computing all features independently for each charging station before creating a sequence. Calendar variables (month, weekday, and day of the year) were one-hot encoded as sine–cosine transformations to ensure cyclic continuity.

x_{sin} = sin (\frac{2 π x}{P}), x_{cos} = cos (\frac{2 π x}{P})

(1)

where x is the calendar variable and P its natural period, yielding six cyclic features alongside a binary weekend flag and a quarterly index. Furthermore, the day-of-year is also cyclically encoded (sin/cos,

P = 365

) to capture annual seasonality at a finer temporal granularity than through monthly encoding, creating two additional features (dayofyear_sin, dayofyear_cos). Behavioral features: charge_dur_mean (mean daily charging duration) and session_count (number of sessions per station). Autoregressive features include past values of the target load_kwh at 1-, 2-, 3-, 7-, and 14-day intervals along with its 7- and 14-day rolling mean, standard deviation, minimum maximum, and exponential moving averages.

Once again, all autoregressive features are computed over past observations to the target date for prediction, ensuring complete temporal causality. That is, lag-k of load_kwh at time t uses the observed value at time

t - k

, and rolling statistics are calculated over the previous k days prior to the day in question, excluding said day. The inputs do not include any future information. In terms of meteorological conditions, six ASOS weather variables were utilized: air temperature, relative humidity, apparent temperature, wind speed, precipitation, and snow depth. The final input contained

F = 26

features per daily time step. With a lookback window of

T = 21

days and

N = 8

stations, the model input tensor is

X \in R^{N \times T \times F}

(2)

where

N = 8

,

T = 21

, and

F = 26

.

The engineered input features included in the forecasting model are listed in Table 3. The features include meteorological variables, behavior indicators, cyclic temporal encodings, and autoregressive statistics, allowing the capture of short-term fluctuations and long-term seasonal patterns of EV charging demand. Weather and behavioral features represent both external and usage-driven influences, whereas cyclic and calendar-based features encode periodic temporal structures. By adding lagged values, rolling statistics, and other exponential moving averages of those temporal dependencies, the model can learn demand trends. To capture the spatio-temporal charging behavior, 26 features were developed.

3.5. Proposed Model Architecture

This study proposes a hybrid Transformer–Long Short-Term Memory architecture for forecasting EV charging loads while considering weather conditions. The model combines the ability of the Transformer to learn global dependencies from the entire sequence with the sequential learning capabilities of LSTM networks. The Transformer captures long-range temporal relationships through a self-attention mechanism, whereas the LSTM component learns local sequential dynamics and nonlinear temporal patterns. The complete architecture contains four parts: an input embedding layer, a Transformer encoder, an LSTM decoder, and a multilayer perceptron (MLP) output head, as shown in Figure 7.

Input embedding. The input sequence for a single station is represented as

X \in R^{T \times F}

, where

T = 21

days and

F = 26

features. The sequence is first projected into a latent representation using a linear transformation as follows:

Z^{(0)} = X W_{p r o j} + b_{p r o j}

(3)

where

W_{p r o j} \in R^{F \times d_{m o d e l}}

and

d_{m o d e l} = 128

. Sinusoidal positional encoding [38] is then added to incorporate temporal-order information.

Transformer encoder. They found the Transformer to be a powerful choice for sequence-to-sequence systems and used four stacked layers with multi-head self-attention and feed-forward networks to process their embedded sequence. For each layer, the attention operation is calculated as

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(4)

where Q, K and V are the query, key, and value matrices, respectively. This encoder employs

h = 8

attention heads together with a model dimension

d_{m o d e l} = 128

and a feed-forward dimension

d_{f f} = 512

. Regularization was performed with dropout (

p = 0.15

). The encoder generates a contextualized sequence representation

Z^{(N_{e n c})}

that captures global temporal dependencies traversing the input window.

LSTM decoder. This contextualized sequence is subsequently fed into a two-layer LSTM decoder with a hidden dimension

h_{L S T M} = 128

. LSTM applies sequential dependencies to the encoded features using a standard gating mechanism:

\begin{matrix} i_{t} & = σ (W_{i i} z_{t} + W_{h i} h_{t - 1} + b_{i}) \end{matrix}

(5)

\begin{matrix} f_{t} & = σ (W_{i f} z_{t} + W_{h f} h_{t - 1} + b_{f}) \end{matrix}

(6)

\begin{matrix} c_{t} & = f_{t} ⊙ c_{t - 1} + i_{t} ⊙ tanh (W_{i, g} z_{t} + W_{h, g} h_{t - 1} + b_{g}) \end{matrix}

(7)

\begin{matrix} h_{t} & = σ (W_{i, o} z_{t} + W_{h, o} h_{t - 1} + b_{o}) ⊙ tanh (c_{t}) \end{matrix}

(8)

The final hidden state

h_{T}

represents an encoded temporal representation of the input sequence.

Output head: The LSTM hidden state is passed to a multilayer perceptron consisting of two hidden layers (64 and 32 neurons) with GELU activation and dropout. The final layer produces a single scalar prediction as follows:

\hat{y} = f (h_{T})

(9)

where

\hat{y}

denotes the predicted daily EV charging load (load_kwh).

Algorithm:

The overall procedure for a forward pass and training of the proposed model is summarized in Algorithm 1. The input sequence is first mapped to a latent embedding with sinusoidal positional encoding (Steps 1 and 2), which subsequently passes through four Transformer encoder layers and captures the long-range temporal dependencies (Steps 3–9). The contextualized representations are then passed through a two-layer LSTM to capture the local sequential dynamics (Steps 10–16), followed by an output through a two-layer GELU-activated MLP that generates the final non-negative load prediction (Steps 17–19). Training uses Huber loss, AdamW optimization, cosine annealing scheduling, and gradient clipping with early stopping (Steps 20–24).

Baseline models: To investigate the performance of the suggested architecture, we thus implement three baseline models based on the same preprocessing and yet using equal feature sets and training settings: (i) a Simple RNN with hidden dimension 64; (ii) a two-layer LSTM with hidden dimension 96; and (iii) an encoder-only Transformer model with two attention layers (

d_{m o d e l} = 96

,

h = 4

). All models used a common MLP output head and optimization setup for an apples-to-apples comparison.

Advanced baseline models: To further validate the effectiveness of the proposed framework, three contemporary state-of-the-art time-series forecasting architectures were additionally implemented under identical preprocessing, feature engineering, and training settings. Informer [39] was selected for its ProbSparse self-attention mechanism, which efficiently captures long-range temporal dependencies while reducing computational complexity. PatchTST [40] was included as a recent Transformer-based forecasting model that partitions time-series sequences into patch-level tokens to improve local temporal representation and long-horizon forecasting accuracy. The Temporal Fusion Transformer (TFT) [41] was adopted due to its interpretable attention mechanism and capability to integrate temporal dynamics with variable selection and gating networks for multivariate forecasting tasks. All advanced baselines used the same chronological data split, optimizer, and evaluation protocol to ensure a fair comparison with the proposed Transformer–LSTM hybrid model.

Algorithm 1 Weather–Aware Transformer–LSTM Hybrid Model

Require:

X \in R^{N \times T \times F}

: input sequences (

N = 8

,

T = 21

,

F = 26

)
Ensure:

\hat{y} \in R^{N}

: predicted daily EV charging load (kWh)
Stage 1: Input Embedding
1:

Z^{(0)} \leftarrow X W_{proj} + b_{proj}

▹ Linear projection:

R^{F} \to R^{128}

2:

Z^{(0)} \leftarrow Z^{(0)} + PE (T)

                                                        ▹ Sinusoidal positional encoding
      Stage 2: Transformer Encoder (4 layers)
  3: for

l = 1

to 4 do
4:

Q, K, V \leftarrow Z^{(l - 1)} W_{Q}, Z^{(l - 1)} W_{K}, Z^{(l - 1)} W_{V}

5:

A \leftarrow softmax (Q K^{⊤} / \sqrt{d_{k}}) V

▹ Multi-head attention (

h = 8

)
6:

\tilde{Z} \leftarrow LayerNorm (Z^{(l - 1)} + A)

7:

Z_{ff} \leftarrow FFN (\tilde{Z})

▹ Feed-forward,

d_{f f} = 512

8:

Z^{(l)} \leftarrow Dropout (LayerNorm (\tilde{Z} + Z_{ff}), p = 0.15)

9: end for
Stage 3: LSTM Decoder (2 layers)
10: Initialise

h_{0}, c_{0} \leftarrow 0

11: for

t = 1

to T do
12:

i_{t} \leftarrow σ (W_{i i} z_{t} + W_{h i} h_{t - 1} + b_{i})

13:

f_{t} \leftarrow σ (W_{i f} z_{t} + W_{h f} h_{t - 1} + b_{f})

14:

c_{t} \leftarrow f_{t} ⊙ c_{t - 1} + i_{t} ⊙ tanh (W_{i, g} z_{t} + W_{h, g} h_{t - 1} + b_{g})

15:

h_{t} \leftarrow σ (W_{i, o} z_{t} + W_{h, o} h_{t - 1} + b_{o}) ⊙ tanh (c_{t})

16: end for
Stage 4: MLP Output Head
17:

h \leftarrow Dropout (GELU (W_{1} h_{T} + b_{1}), p = 0.15)

▹

128 \to 64

18:

h \leftarrow Dropout (GELU (W_{2} h + b_{2}), p = 0.15)

▹

64 \to 32

19:

\hat{y} \leftarrow max (W_{3} h + b_{3}, 0)

▹ Scalar output, clipped

\geq 0

Training Configuration
20: Minimise

L_{δ} (y, \hat{y})

via AdamW ▹ Huber loss,

δ = 1.0

21: Apply CosineAnnealingLR ▹

η = 3 \times 10^{- 4}

,

η_{min} = 10^{- 6}

22: Apply gradient clipping ▹

ℓ_{2}

norm

\leq 1.0

23: Early stopping with patience

= 20

epochs
24: return

\hat{y}

3.6. Training Setup

All models were implemented using PyTorch 2.x and trained on a CUDA-enabled GPU. To ensure reproducibility, a fixed random seed of 42 was applied to Python (version 3.10), NumPy (version 1.24.0), and PyTorch.

The model parameters were optimized by minimizing the Huber loss (smooth

L_{1}

loss) with

δ = 1.0

:

L_{δ} (y, \hat{y}) = \{\begin{matrix} \frac{1}{2} {(y - \hat{y})}^{2} & | y - \hat{y} | \leq δ \\ δ | y - \hat{y} | - \frac{1}{2} δ^{2} & otherwise \end{matrix}

(10)

The Huber loss combines the sensitivity of the mean squared error for small deviations with the robustness of the mean absolute error for large deviations, making it suitable for EV charging datasets that may contain high-energy outlier sessions. The predicted values were clipped to non-negative values before the metric evaluation to ensure physically valid charging loads.

Parameter updates were performed using the AdamW optimizer [42] with a learning rate

η = 3 \times 10^{- 4}

and weight decay

10^{- 4}

. The learning rate followed a cosine annealing schedule as follows:

η_{t} = η_{min} + \frac{1}{2} (η - η_{min}) (1 + cos (\frac{π t}{T_{max}}))

(11)

where

T_{max} = 100

epochs and

η_{min} = 10^{- 6}

. This scheduling strategy enables a smooth learning rate decay and improves the optimization stability of deep sequence models [43,44,45].

Several regularization strategies were applied to prevent overfitting. A dropout with probability

p = 0.15

was used within the Transformer encoder, LSTM decoder, and output MLP layers. Gradient clipping was applied with a global

ℓ_{2}

norm threshold of 1.0 to stabilize backpropagation through the recurrent layers. Early stopping with a patience of 20 epochs was used, and the model checkpoint corresponding to the lowest validation loss was obtained.

Training batches were constructed with a batch size of 32 and shuffled at each epoch, whereas the validation and test batches were processed without shuffling. As described earlier, the dataset was split chronologically per station into 70% training, 15% validation, and 15% test partitions to prevent temporal leakage. The complete training configuration used for all the models is summarized in Table 4.

3.7. Evaluation Metrics

The forecasting performance was assessed using four complementary metrics calculated on the test set inverse-scaled to the original kWh units. The use of multiple metrics allows for a holistic picture of prediction accuracy, as each provides insight into a different part of the error distribution.

Root Mean Square Error (RMSE) is the square root of the mean squared error between predicted and actual load values, defined as:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(12)

The RMSE penalizes large prediction mistakes more severely and is therefore sensitive to peak-load deviations.

Mean Absolute Error (MAE), which measures the average magnitude of prediction errors:

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(13)

Unlike the RMSE, the MAE treats all errors equally and provides an interpretable estimate of the typical forecasting error in kWh.

Mean Absolute Percentage Error (MAPE) measures prediction error in relation to the actual load:

MAPE = \frac{100}{| S |} \sum_{i \in S} |\frac{y_{i} - {\hat{y}}_{i}}{y_{i}}|

(14)

where

S

includes only samples with

y_{i} > 1.0

kWh to avoid instability caused by near-zero values of the denominator.

sMAPE (Symmetric Mean Absolute Percentage Error) is a percentage-based error measure that gives less weight to errors for small values:

sMAPE = \frac{100}{n} \sum_{i = 1}^{n} \frac{| y_{i} - {\hat{y}}_{i} |}{\frac{1}{2} (| y_{i} | + | {\hat{y}}_{i} |)}

(15)

It provides a symmetric formulation of the percentage error, enabling more stability when the actual values are small.

Peak Signal-to-Noise Ratio (PSNR) quantifies the quality of reconstruction of lossy compression by comparing it with the maximum possible input signal:

PSNR = 10 {log}_{10} (\frac{{(max (y))}^{2}}{MSE})

(16)

where

MSE = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}

. Higher PSNR values indicate better reconstruction quality and lower distortion in the predicted load signals.

Finally, the coefficient of determination (

R^{2}

) assesses how well the model accounts for the variance of the observed charging demand:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(17)

Better predictive performance is indicated by higher

R^{2}

values.

These metrics were calculated after reversing the RobustScaler transformation back to the original kWh scale.

4. Results

In this section, we show the forecasting performance of the proposed hybrid Transformer–LSTM model and three baseline architectures on held-out test sets for all eight municipal charging stations. The results are presented on the original kWh scale after performing an inverse transformation with respect to the RobustScaler.

4.1. Overall Forecasting Performance

The aggregate test-set performance of all four models, measured using six evaluation metrics, is tabulated in Table 5. Among the models, Transformer–LSTM hybrid gives out the top performance in all metrics:

R^{2}

of 0.9731, MAE of 62.71 kWh, RMSE of 94.21 kWh, MAPE of 19.62%, sMAPE of 15.54%, and PSNR of 27.31 dB. The Simple RNN exhibited the weakest performance among the four baselines (

R^{2} = 0.9007

, RMSE = 181.01 kWh, MAPE = 52.40%, sMAPE = 37.22%, PSNR = 21.64 dB), suggesting that it cannot capture the complex and non-stationary characteristics of EV charging demand sufficiently well to mitigate under- or over-predictions. The LSTM outperformed the previous methods in terms of forecasting accuracy (

R^{2} = 0.9215

), followed by the Transformer model, which achieved even better performance (

R^{2} = 0.9408

), leveraging its global self-attention mechanism. The performance of the proposed hybrid model is significantly better than that of a standalone Transformer, reducing the RMSE by 32.6% (from 139.76 to 94.21 kWh), MAPE by 34.5% (from 29.94% to 19.62%) and sMAPE by 30.9% (from 22.48% to 15.54%). Moreover, it has the highest PSNR (27.31 dB), which reflects the overall quality of the reconstructed signal. The results show that by combining these two components, the model can simultaneously benefit from and capture both longer-term dependencies over time frames and localized temporal patterns more efficiently.

For a holistic comparative assessment of the model performance through correlation, the normalized standard deviation and RMSE are shown in Figure 8. Among all the tested models, the distance of the proposed Transformer–LSTM model was closest to the reference point, which indicates that it has the highest correlation and lowest error. In contrast, the baseline models were positioned farther from the reference, indicating relatively lower accuracy and higher divergence from the observed data. This validates the better prediction power of the proposed hybrid model.

Figure 9 shows the actual and predicted EV charging loads for the entire test set. The model matches well both the low demand regime exhibited by half of the test period and then a rapid demand growth in the second half, matching both with the timing for commissioning new additional high-capacity stations, such as Queens Borough Hall. The predicted values closely followed the actual load trajectory and had no systematic bias, confirming the model’s ability to generalize across heterogeneous demand regimes.

Figure 10 displays the predictions from all the models for a representative test window of 200 samples, including the actual load. The Simple RNN demonstrated the greatest discrepancies compared to the actual series, consistently overestimating or underestimating demand during inflection points. Visually, the LSTM and standalone Transformer appear closer to the actual values but still suffer from smoothing artifacts at sharp demand transitions. The Transformer–LSTM hybrid (red) follows the actual signal most closely, with a better match on both the peaks and troughs of the load profile.

4.2. Baseline Model Comparison

The MAE and RMSE values of each model are shown as grouped bar charts in Figure 11. Of all the models in the table, the Transformer–LSTM hybrid gives the lowest scores on each of the error metrics, reducing by 49.3% MAE with respect to the Simple RNN baseline and by 32.7% when compared with only a standalone Transformer. For the proposed model, the difference between the MAE and RMSE is at its lowest, which means that fewer outlier predictions are displayed (high deviations) than for the baselines.

The distribution of the RMSE values was normalized and plotted for all models, as shown in Figure 12. The baseline error with a low median and lower variance showed better performance in terms of prediction, where the proposed Transformer–LSTM model attained the minimum. This shows lower overall performance for the baseline models compared to themselves.

Figure 13 presents the normalized radar chart of MAE, RMSE, and MAPE across all models. The Transformer–LSTM occupied the smallest polygon, demonstrating its consistently superior performance in all three error dimensions. The radar chart also illustrates the cumulative gain from moving from the Simple RNN to LSTM to Transformer to our proposed hybrid model.

4.3. Scatter Plot Analysis

The actual versus predicted scatter plots for all four models are shown in Figure 14. Such points on the diagonal red identity line correspond to a perfect prediction. Both models exhibit large dispersion in Simple RNN scatter plots and sticky point patterns in the LSTM plot for high load values (i.e., above 1000 kWh), and both models show underestimation. The separate Transformer minimizes this dispersion, but still has a visible spread at the top loads. The scatter plot of the Transformer–LSTM yields the tightest distribution when compared to all other models, as points cluster closely around both axes along the diagonal in their full demand range (near zero-filtering up to 2200 kWh), which is coherent with its

R^{2}

(0.9731).

In Figure 15, we visualize the uncertainty of the predictions from the proposed model as a prediction interval (PI) representing the overlap between the actual and predicted values at 80% (lower bound) and 95% (upper bound). Most of the observed data were contained in the 95% interval; therefore, we can conclude that the uncertainty estimation is quite reliable. Longer intervals during high-variability periods indicate that prediction uncertainty is higher when sudden demand changes occur.

4.4. Spatial Generalization Across Stations

A comparison of the actual and predicted station-level average loads is presented for all eight station groups in Figure 16. This model is sufficiently valid to reproduce the differences in demand magnitude between high-capacity stations (Court Square ≈ 1450 kWh/day, Queens Borough Hall ≈ 11,220 kWh/day) and low-utilization stations (Queens Family Court < 40 kWh/day). The mean predicted load closely matched the actual values for all stations, demonstrating that the station-specific RobustScaler and shared model weights together captured the network’s full spatial heterogeneity without requiring tuning on a per-station basis.

4.5. Comparison with Advanced Time-Series Forecasting Baselines

To complement the controlled four-model benchmark reported in Section 4.1, three state-of-the-art time-series forecasting architectures were additionally implemented and evaluated on the same dataset, preprocessing pipeline, 26-feature input representation, 21-day lookback window, and chronological 70/15/15 split per station as the original baselines: Informer, which introduces ProbSparse self-attention and self-attention distilling for efficient long-sequence modeling; the Temporal Fusion Transformer (TFT), which combines variable-selection networks, gated residual networks, and interpretable multi-head attention with static covariate enrichment; and PatchTST, which partitions the input sequence into patches and processes them through a channel-independent Transformer encoder. All three models were trained with the AdamW optimizer, Huber loss, cosine-annealing schedule, and early stopping identical to those used for the proposed model and the original baselines. Hyper-parameters of each advanced baseline were tuned on the validation set using a grid search around the configurations reported in their respective original papers.

The aggregated test-set performance of all seven models is summarized in Table 6. Among the three contemporary baselines, TFT achieves the best forecasting accuracy (

R^{2} = 0.9612

, RMSE = 109.42 kWh, MAPE = 22.74%), followed by PatchTST (

R^{2} = 0.9583

, RMSE = 113.74 kWh, MAPE = 23.86%) and Informer (

R^{2} = 0.9489

, RMSE = 124.18 kWh, MAPE = 26.83%). All three substantially outperform the encoder-only Transformer (

R^{2} = 0.9408

), confirming that these architectures genuinely improve upon the vanilla Transformer for this task.

The proposed Transformer–LSTM hybrid nevertheless achieves the best performance across every metric. Informer, PatchTST, and TFT exhibit 31.8%, 20.7%, and 16.1% higher RMSE than the proposed model, respectively, and MAPE differences of 7.21, 4.24, and 3.12 percentage points in the same order. These margins indicate that, for short-horizon next-day municipal EV charging load forecasting, the explicit pairing of a Transformer encoder with an LSTM decoder captures complementary temporal information that standalone advanced architectures do not fully exploit. Although the proposed hybrid is the largest model in the comparison (≈1.07 M parameters), its absolute footprint remains modest by contemporary deep learning standards and is well within deployment budgets for offline day-ahead forecasting workflows; PatchTST emerges as the most attractive lightweight alternative (≈178 K parameters), offering close-to-best accuracy at approximately one-sixth of the parameter count.

4.6. Statistical Reliability of Performance Differences

To verify that the reported performance differences are not artifacts of a single training run, every model in the benchmark was retrained with five random seeds ({42, 123, 256, 512, 1024}) under otherwise identical conditions (preprocessing, feature pipeline, optimizer, learning-rate schedule, and early-stopping criterion). Table 7 reports the mean and standard deviation across the five runs, together with paired two-sided Student’s t-test p-values and Wilcoxon signed-rank p-values (

p_{W}

). Both tests compare each baseline against the proposed Transformer–LSTM hybrid on the per-test-sample absolute errors (errors averaged across seeds first, then paired over the test samples). The pairing is performed over

N_{test}

test station-day samples, corresponding to the 15% chronological test split aggregated across the eight stations, giving

N_{test} - 1

degrees of freedom for the paired t-test; the same

N_{test}

pairs are used for the Wilcoxon signed-rank test. The standard deviations are small (e.g.,

\pm 2.50

kWh on RMSE for the proposed model, corresponding to a coefficient of variation of approximately 2.6%), and all pairwise differences are statistically significant at the 0.1% level (

p < 0.001

and

p_{W} < 0.001

for all comparisons), confirming that the reported gains are not attributable to seed-level variance.

4.7. Error Decomposition and Mitigation Strategies

To assess the practical applicability of the proposed model for grid dispatch and to identify where the 19.62% overall MAPE concentrates, the test-set residuals were decomposed along two complementary axes: by station and by daily-demand range (<100, 100–250, 250–500, 500–1000, and >1000 kWh/day).

Per-station performance is reported in Table 8. The three highest-utilization stations (Court Square, Queens Borough Hall, Delancey Essex) achieve MAPE between 11.47% and 12.31%. Mid-utilization stations (Jerome 190th, Jerome Gun Hill, Bay Ridge) yield MAPE between 14.92% and 18.34%. The overall MAPE is dominated by two low-utilization stations, St. George (MAPE 27.43%) and Queens Family Court (MAPE 43.85%), whose daily loads frequently fall below 30 kWh; in this regime the percentage error is amplified by the small denominator even though the absolute error remains small (MAE

\leq 23

kWh). The aggregate

R^{2}

(0.9731) exceeds every per-station

R^{2}

because between-station mean differences (roughly 40-fold) dominate the total sum of squares.

This pattern is confirmed by the demand-range decomposition in Figure 17. MAPE drops monotonically from 38.21% in the <100 kWh/day bin to 10.34% in the >1000 kWh/day bin. From an operational perspective, the peak-load forecasting task most relevant for capacity dispatch benefits from a comparatively low relative error on the high-demand stations.

Targeted mitigation strategies: Three concrete strategies could reduce the residual MAPE on the low-demand sub-population: (1) training a separate demand-tier model for low-utilization stations; (2) replacing the Huber loss with a percentage-error-weighted loss; (3) producing quantile or conformal prediction intervals. These directions are noted in the Limitations and Future Work section.

5. Discussion

5.1. Model Performance and Architectural Insights

The extensive experimental results show that the proposed Transformer–LSTM hybrid achieves superior performance compared to all three baselines under all evaluation metrics, with an

R^{2}

of 0.9731, RMSE of 94.21 kWh, MAE of 62.71 kWh, and MAPE of 19.62%. The relative performance gains are substantial: compared with the next-best baseline (the encoder-only Transformer), the RMSE is reduced by 32.6% and the MAPE by 34.5%. These results validate the fundamental architectural hypothesis that global self-attention and local recurrent modeling are complementary capabilities in EV charging load forecasting, such that their combination can provide gains that are unobtainable from each modality independently.

The incremental performance improvement from the simple RNN over the LSTM to the Transformer and Transformer–LSTM hybrid is monotonic for all three metrics (Table 5). This orderly advancement confirms the controlled benchmark design, where every one of each model undergoes similar preprocessing, a set of features, preparation configuration, and assessment process. These gains are therefore attributable to architectural differences rather than to potential confounding factors, such as having different input representations or training budgets.

The multi-head self-attention mechanism of the Transformer encoder learns long-range temporal dependencies across a 21-day lookback window, which allows the model to pick up patterns such as weekly periodicity and multi-week demand trajectories that are generally difficult to capture more than a few iterations back for recurrent models. The LSTM decoder subsequently uses these contextualized sequence representations to model short-period sequential dynamics, such as momentum or day-to-day volatility, which are not well handled by the attention mechanism. These two inductive biases combined result in both the tightest alignment of points on the scatter plot in Figure 14 as well as the lowest gap between Figure 11, among all models tested.

5.2. Multi-Temporal Forecasting Consistency

The actual versus predicted loads for the four seasons are shown in Figure 18. Using a model tuned on structural supply demand differences, accuracy can be achieved in every season, despite large differences in both the magnitude of demand and its variability. As with the earlier examples, we note that summer and winter peak loads are highest (over 2000 kWh, at high-utilization stations), driven by cold-weather cell consumption (via heaters) and high-mobility summer demand. The model captures these high-demand processes but can also follow the lower and steadier demand dynamics observed in spring and fall. The stability of the performance across seasons breaks this pattern up into months that are representative (for all seasons), confirming that the weather features, specifically air temperature (tmpf), apparent temperature (feel), and binary snow indicator (snowdepth), encode a significant seasonal signal that appears to be well incorporated by the model alongside lagged load features.

The full prediction for the timeline from the start of the dataset until January 2026 using the Transformer–LSTM model is shown in Figure 19. The model fits on the end-to-end 4.5-year trend encompasses the major demand cycles and structural changes, including the late-2021/early-2022 explosive demand growth period (the quick buildout of Court Square and Delancey Essex stations), the temporary mid-2022 back down in demand, and the renewed growth for years 2024–2025. The best-prediction area (green shading, see approximately time steps 1150–1200) that has been highlighted corresponds to a time period of stable moderate demand in which the predicted values nearly overlap with the actual values and show that this model can be applied under steady-state operating conditions. Wider deviations appear at sudden demand transitions (e.g., the sharp drop at approximately time step 1150) and account for most of the residual error due to structural breaks that cannot be predicted solely from a 21-day lookback window.

5.3. Ablation Study

To quantify the individual contribution of each architectural component and feature group of the proposed Transformer–LSTM hybrid, three complementary ablation analyses were conducted under identical preprocessing, optimizer, learning-rate schedule, random seed, and early-stopping criterion as the main experiment, so that any observed performance change is attributable solely to the ablated component.

5.3.1. Architectural Ablation

Four architectural variants were compared: (i) the encoder-only Transformer (Transformer-only), which removes the LSTM decoder and maps the encoder output directly to the MLP head; (ii) the LSTM-only model (LSTM-only), which replaces the Transformer encoder with an additional LSTM block; (iii) a reverse-order hybrid (LSTM → Transformer), in which the LSTM encodes local dynamics first, followed by Transformer self-attention as decoder; and (iv) the proposed Transformer → LSTM hybrid.

The Transformer-only and LSTM-only variants correspond exactly to the encoder-only Transformer and standalone LSTM baselines reported in Section 4.1, and are reproduced here in the ablation context for completeness. They retain their original Section 4.1 capacities (encoder-only Transformer:

d_{model} = 96

,

h = 4

, two attention layers; standalone LSTM: two layers,

h_{LSTM} = 96

). The LSTM → Transformer reverse-order hybrid was trained at the proposed model’s full capacity (

N_{enc} = 4

,

d_{model} = 128

,

h = 8

;

h_{LSTM} = 128

). This ablation therefore addresses the question of what each component adds to its respective baseline, rather than an iso-capacity component swap, and the performance differences reported in Table 9 are interpreted in this light.

As shown in Table 9, removing the LSTM decoder degrades

R^{2}

from 0.9731 to 0.9408 and increases RMSE by 48.4%, whereas removing the Transformer encoder reduces

R^{2}

to 0.9215 with a 70.8% increase in RMSE. The reverse-order hybrid yields intermediate performance (

R^{2} = 0.9512

), confirming that the encoder–decoder ordering is not arbitrary: long-range dependencies captured by self-attention provide a contextualized representation from which the LSTM can subsequently extract local sequential dynamics, but this complementarity is not symmetric.

5.3.2. Feature-Group Ablation

The contribution of each engineered feature category was assessed by retraining the proposed hybrid model after removing one feature group at a time while keeping the rest of the 26-feature pipeline intact (Figure 20). Autoregressive lag features contribute the largest performance gain (RMSE increases by 51.3% when removed), reflecting the strong temporal self-similarity of EV charging demand. Rolling statistics (+33.5% RMSE) and weather variables (+21.9% RMSE) provide the next largest contributions, validating the weather-aware design hypothesis. Cyclic encodings and behavioral aggregates yield smaller but non-negligible gains, indicating that periodic temporal structure and usage-driven signals add complementary information beyond what is captured by lagged loads alone.

5.3.3. Hyper-Parameter Sensitivity

A focused sensitivity analysis was performed on three principal architectural hyper-parameters (Figure 21): the number of Transformer encoder layers

N_{enc} \in {2, 4, 6}

, the number of attention heads

h \in {4, 8, 16}

, and the LSTM hidden dimension

h_{LSTM} \in {64, 128, 256}

. The configuration (

N_{enc} = 4

,

h = 8

,

h_{LSTM} = 128

) adopted in the main experiment offered the best accuracy–capacity trade-off, with deeper or wider configurations yielding gains that fall within the seed-level noise range reported in Table 7 (RMSE

\pm 2.50

kWh, coefficient of variation

\approx 2.6 %

) at substantially higher computational cost. This confirms that the chosen architecture lies near the saturation point of the accuracy–capacity trade-off for the present forecasting task.

5.3.4. Discussion of Ablation Findings

Three principal conclusions emerge from the ablation results. First, the Transformer encoder and the LSTM decoder play distinct and complementary roles: the encoder establishes long-range temporal context across the 21-day lookback window, while the decoder refines this representation through gated recurrent processing of local dynamics; removing either component produces a substantial accuracy loss that cannot be recovered by enlarging the remaining one. Second, the order of composition matters: the Transformer → LSTM arrangement consistently outperforms the reverse, suggesting that contextualization should precede sequential refinement. Third, among engineered features, the autoregressive-lag and rolling-statistic groups dominate the contribution budget, while weather features provide a meaningful secondary improvement consistent with the temperature-dependent variation visible in Figure 6, particularly under cold conditions where battery preconditioning loads elevate charging demand beyond what autoregressive and rolling features can fully anticipate.

5.4. Computational Efficiency and Deployment Feasibility

To assess practical deployability, training and inference cost were measured for every benchmarked model on the same hardware (NVIDIA Tesla T4, 16 GB), with results summarized in Table 10. End-to-end training time covers the full training loop with cosine-annealed AdamW and early stopping; single-sample inference latency is the wall-clock time for one forward pass on a 21-day input sequence, averaged over 1000 test samples after a 100-pass warm-up (batch size 1, FP32); peak GPU memory is the maximum observed during training.

The proposed Transformer–LSTM hybrid is the largest model in the comparison (≈1.07 M parameters, 248 MB peak GPU memory), but its single-sample inference latency remains under 2 ms. For a municipality operating tens of charging stations and producing day-ahead forecasts at daily resolution, the required inference budget is several orders of magnitude below the available throughput, and the model can be deployed on commodity hardware without specialized infrastructure. Training cost (≈33 min on a T4) is fully amortized by infrequent retraining schedules (e.g., monthly). For applications subject to tighter compute or memory constraints, Table 10 indicates that PatchTST offers a strong lightweight alternative (≈178 K parameters, 0.89 ms inference latency, 96 MB peak memory) at the cost of approximately 20.7% higher RMSE relative to the proposed model.

6. Conclusions

The proposed study showcases a weather-sensitive Transformer–LSTM hybrid framework for the spatio-temporal forecasting of EV charging loads at eight municipal stations in New York City. The model consists of a four-layer Transformer encoder (

h = 8

,

d_{model} = 128

) and a two-layer LSTM decoder (

h_{LSTM} = 128

), which allows for the joint learning of long-range temporal dependencies (the temporal hierarchy learned by Transformers) alongside local sequential dynamics. Over 4.5 years, the model was trained on 211,324 charging sessions using a 26-feature input pipeline consisting of temporal encoding, lag features, rolling statistics, behavioral aggregates, and meteorological variables. Compared with simple RNN, standalone LSTM and Transformer baselines, the performance of the proposed model (

R^{2} = 0.9731

, MAE = 62.71 kWh, RMSE = 94.21 kWh, and MAPE = 19.62%) is unsurpassed. The hybrid model achieves significantly improved forecasting performance with a 47.9% reduction in RMSE when compared to the Simple RNN, as well as a 32.6% reduction compared to the Transformer. It has generalization efficacy across stations with differing demand patterns and can capture seasonal changes, including peak winter and summer loads. Monthly and yearly aggregated forecasts are only marginally biased and closely reflect actual demand over time. Performance evaluation results show that the model achieves best-in-class EV load forecasting with good accuracy, generality, and suitability for multi-year scaling; thus, paving the way towards practical application utility such as forensic simulation, infrastructure planning, and tomorrow’s demand management in a weather-aware fashion.

Limitations and Future Work

We organize the limitations of the present study into four categories and pair each with concrete, actionable future-work directions. Architectural limitations: The proposed framework treats each charging station as a parallel temporal-forecasting target with shared model weights and does not explicitly model inter-station spatial interactions. Future work will incorporate graph-structured spatial modules (e.g., Graph Attention Networks or graph-based informers) to capture demand spillover between geographically nearby stations. The encoder–decoder ordering and module sizes were determined by grid search (Section 5.3); a fully automated neural-architecture-search study over Transformer–LSTM hybrid configurations is a promising direction. Data limitations: Charging stations were mapped to weather stations at the borough level, which loses local micro-climatic variations. Higher-resolution gridded reanalysis data (e.g., Copernicus ERA5-Land at ∼9 km horizontal resolution at mid-latitudes) and radar-derived precipitation could sharpen the meteorological signal. Operational metadata such as maintenance schedules and station outages are not currently ingested; integrating these signals as structured covariates would directly mitigate the residual error spikes observed at demand transitions Figure 19. Evaluation limitations: The framework currently produces single-step (next-day) point forecasts. Two extensions are concretely planned: (i) multi-step forecasting via iterative roll-out or direct multi-output decoding heads, supporting horizons of 3–14 days; (ii) probabilistic forecasting via pinball-loss-based quantile regression or split-conformal prediction intervals, delivering calibrated uncertainty estimates for risk-aware grid management. Per-tier loss reweighting (Section 4.7) to reduce MAPE on low-demand stations is an additional near-term improvement. Deployment limitations: All measurements were conducted on a single GPU (NVIDIA Tesla T4). Although Section 5.4 establishes that inference latency is well within the budget for day-ahead operational use, embedded or edge deployment on CPU-only or constrained hardware has not been benchmarked here; lightweight student models distilled from the proposed hybrid, or PatchTST as a deployment-ready surrogate, are concrete pathways for that scenario. Emerging multimodal frameworks for ingesting unstructured operational signals (textual maintenance reports, policy notices, event-based grid signals) constitute a longer-term direction.

Author Contributions

Conceptualization, R.D., S.D. and T.K.; methodology, R.D., S.D. and M.U.M.; software, M.U.M.; validation, R.D., S.D. and T.K.; formal analysis, S.D. and M.U.M.; investigation, R.D. and S.D.; resources, T.K.; data curation, M.U.M.; writing—original draft preparation, R.D., S.D. and M.U.M.; writing—review and editing, R.D., S.D. and T.K.; visualization, M.U.M.; supervision, T.K.; project administration, T.K.; funding acquisition, T.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original EV charging data presented in the study are openly available in the website data.gov at https://catalog.data.gov/dataset/electric-vehicle-ev-charging-data-municipal-lots-and-garages (accessed on 10 February 2026). Code and Dataset: https://github.com/uzzal2200/Spatio-Temporal-Forecasting-of-Municipal-EV-Charging-Load-Using-Weather-Aware- Transformer-LSTM- (accessed on 10 February 2026).

Conflicts of Interest

Author Remon Das was employed by Dominion Energy, and author Sajib Debnath was employed by The AES Corporation. The authors confirm that the research was conducted independently and without any commercial or financial influence. So there is no conflict of interest.

References

Gül, T.; Pales, A.F.; Connelly, E. Global EV Outlook 2024 Moving towards increased affordability. In Electric Vehicles Initiative; International Energy Agency: Paris, France, 2024; Volume 79. [Google Scholar]
Jia, Z.; Li, J.; Zhang, X.-P.; Zhang, R. Review on Optimization of Forecasting and Coordination Strategies for Electric Vehicle Charging. J. Mod. Power Syst. Clean Energy 2023, 11, 389–400. [Google Scholar] [CrossRef]
Siddiqui, J.; Ahmed, U.; Amin, A.; Alharbi, T.; Alharbi, A.; Aziz, I.; Khan, A.R.; Mahmood, A. Electric vehicle charging station load forecasting with an integrated DeepBoost approach. Alex. Eng. J. 2025, 116, 331–341. [Google Scholar] [CrossRef]
Shahrokhi, S.; Wang, Z.; Paranjape, R.; Kozoriz, D.; Fick, J.; Pederson, S. A two-stage framework for spatiotemporal clustering and forecasting of electric vehicle charging. IEEE Access 2025, 13, 182345–182364. [Google Scholar] [CrossRef]
Tian, R.; Wang, J.; Sun, Z.; Wu, J.; Lu, X.; Chang, L. Multi-scale spatial-temporal graph attention network for charging station load prediction. IEEE Access 2025, 13, 29000–29017. [Google Scholar] [CrossRef]
Zhou, L.; Xu, Y.; Zhu, W.; Tang, B.; Li, Z. Spatio-temporal forecasting of EV charging demand in urban communities considering differentiated user characteristics. Int. J. Electr. Power Energy Syst. 2025, 172, 111249. [Google Scholar] [CrossRef]
Alaraj, M.; Martins, C.; Radi, M.; Darwish, M.; Majdalawieh, M. Spatial–temporal deep learning for electric-vehicle charging demand: An exploratory study of GCN and LSTM networks performance. IEEE Access 2025, 13, 202203–202213. [Google Scholar] [CrossRef]
Koohfar, S.; Woldemariam, W.; Kumar, A. Prediction of electric vehicles charging demand: A transformer-based deep learning approach. Sustainability 2023, 15, 2105. [Google Scholar] [CrossRef]
Koohfar, S.; Woldemariam, W.; Kumar, A. Performance comparison of deep learning approaches in predicting EV charging demand. Sustainability 2023, 15, 4258. [Google Scholar] [CrossRef]
Osman, H.; Azab, A.; AlGhazi, A.; Duffuaa, S.; Baki, F. Forecasting electric vehicle charging loads using random forest and gene expression programming ensemble models. Results Eng. 2025, 28, 108369. [Google Scholar] [CrossRef]
Aduama, P.; Zhang, Z.; Al-Sumaiti, A.S. Multi-feature data fusion-based load forecasting of electric vehicle charging stations using a deep learning model. Energies 2023, 16, 1309. [Google Scholar] [CrossRef]
Hussain, A.; Eswarakrishnan, V.; Aslam, A.; Tripura, S. Charging station demand forecasting using an LSTM-based hybrid transformer model. Sci. Rep. 2025, 15, 36639. [Google Scholar] [CrossRef]
Mansour, H.S.E.; Mohamed, A.S.; Abdel-Aziz, M. Electric vehicles charging stations load forecasting based on hybrid XGBoost-BiLSTM model. Sci. Rep. 2026, 16, 374. [Google Scholar] [CrossRef] [PubMed]
Xiong, X.; Huang, Z.; Chen, Y.; Sun, J. Load forecasting for commercial buildings using BiLSTM-Transformer network and cyber-physical cognitive control systems. Symmetry 2024, 16, 1601. [Google Scholar] [CrossRef]
Jia, M.; Yang, B. EVformer: A spatio-temporal decoupled transformer for citywide EV charging load forecasting. World Electr. Veh. J. 2026, 17, 71. [Google Scholar] [CrossRef]
Dhanawat, V.; Shinde, V.; Alami, R.; Akhunzada, A.; Faheem, Z.B.; Biswas, A. Electric vehicles charging station load forecasting integration with renewable energy using novel deep EfficientBiLSTMNet. IEEE Open J. Veh. Technol. 2025, 6, 2642–2661. [Google Scholar] [CrossRef]
Alghamdi, E.A.; Alkinani, M.H.; Almazroi, A.A.; Alqarni, M.; Aldhahri, E.A.; Ayub, N. REST network: An ensemble deep learning approach for EV charging load forecasting in artificial port supply chains. IEEE Access 2025, 13, 133128–133144. [Google Scholar] [CrossRef]
Hu, X.; Zhang, Z.; Fan, Z.; Yang, J.; Yang, J.; Li, S.; He, X. GCN-transformer-based spatio-temporal load forecasting for EV battery swapping stations under differential couplings. Electronics 2024, 13, 3401. [Google Scholar] [CrossRef]
Ding, H.; Guo, Y.; Wang, H. Spatiotemporal forecasting of regional EV charging load: A multi-channel attentional graph network integrating dynamic electricity prices and weather conditions. Electronics 2025, 14, 4010. [Google Scholar] [CrossRef]
Luo, R.; Song, Y.; Huang, L.; Zhang, Y.; Su, R. AST-GIN: Attribute-augmented spatiotemporal graph informer network for electric vehicle charging station availability forecasting. Sensors 2023, 23, 1975. [Google Scholar] [CrossRef]
Peng, Y.; Ye, L.; Ouyang, W.; Xi, Q.; Wang, J.; Wang, X. A comprehensive framework for adaptive electric vehicle charging station siting and sizing based on transferable spatio-temporal demand prediction. IEEE Access 2025, 13, 109752–109770. [Google Scholar] [CrossRef]
Nespoli, A.; Ogliari, E.; Leva, S. User behavior clustering based method for EV charging forecast. IEEE Access 2023, 11, 6273–6283. [Google Scholar] [CrossRef]
Yuan, L.; Zhong, J.; Liu, Y.; Liu, X.; Wang, Y.; Dong, Z.Y. Two-Stage Hierarchical Clustering and Transformer–BiLSTM Hybrid Framework for Electric Vehicle Charging Load Forecasting. Int. J. Electr. Power Energy Syst. 2026, 174, 111461. [Google Scholar] [CrossRef]
Zhou, D.; Guo, Z.; Xie, Y.; Hu, Y.; Jiang, D.; Feng, Y.; Liu, D. Using Bayesian Deep Learning for Electric Vehicle Charging Station Load Forecasting. Energies 2022, 15, 6195. [Google Scholar] [CrossRef]
Shen, X.; Zhao, H.; Xiang, Y.; Lan, P.; Liu, J. Short-Term Electric Vehicle Charging Load Forecasting Based on Deep Learning in Low-Quality Data Environments. Electr. Power Syst. Res. 2022, 212, 108247. [Google Scholar] [CrossRef]
Shanmuganathan, J.; Victoire, A.A.; Balraj, G.; Victoire, A. Deep Learning LSTM Recurrent Neural Network Model for Prediction of Electric Vehicle Charging Demand. Sustainability 2022, 14, 10207. [Google Scholar] [CrossRef]
Vishnu, G.; Kaliyaperumal, D.; Pati, P.B.; Karthick, A.; Subbanna, N.; Ghosh, A. Short-Term Forecasting of Electric Vehicle Load Using Time Series, Machine Learning, and Deep Learning Techniques. World Electr. Veh. J. 2023, 14, 266. [Google Scholar] [CrossRef]
Manzoor, T.; Lall, B.; Panigrahi, B.K. Transformer Models for EV Charging Demand Forecasting: Comparing Attention Mechanisms. In Proceedings of the 2024 Mediterranean Smart Cities Conference (MSCC), Tetouan, Morocco, 2–4 May 2024; pp. 1–5. [Google Scholar] [CrossRef]
Wang, S.; Li, Y.; Shao, C.; Wang, P.; Wang, A.; Zhuge, C. Adaptive Spatio-Temporal Graph Recurrent Network for Short-Term Electric Vehicle Charging Demand Prediction. Appl. Energy 2025, 383, 125320. [Google Scholar] [CrossRef]
Bouhamed, O.; Dissem, M.; Amayri, M.; Bouguila, N. Transformer-Based Deep Probabilistic Network for Load Forecasting. Eng. Appl. Artif. Intell. 2025, 152, 110781. [Google Scholar] [CrossRef]
Yılmaz, M.; Çinar, E.; Yazıcı, A. A Transformer-Based Model for State of Charge Estimation of Electric Vehicle Batteries. IEEE Access 2025, 13, 33035–33048. [Google Scholar] [CrossRef]
Debnath, S.; Mia, M.U.; Abubakkar, M.; Islam, M.R.; Mridul, M.S.I.; Biswas, A.K. Hybrid Multi-Scale Deep Learning Enhanced Electricity Load Forecasting Using Attention-Based Convolutional Neural Network and LSTM Model. IEEE Access 2026, 14, 13423–13444. [Google Scholar] [CrossRef]
Ren, F.; Tian, C.; Zhang, G.; Li, C.; Zhai, Y. A Hybrid Method for Power Demand Prediction of Electric Vehicles Based on SARIMA and Deep Learning with Integration of Periodic Features. Energy 2022, 250, 123738. [Google Scholar] [CrossRef]
Da, T.N.; Cho, M.-Y.; Thanh, P.N. Hourly Load Prediction-Based Feature Selection Scheme and Hybrid CNN–LSTM Method for Smart Solar Microgrid of Buildings. Expert Syst. 2024, 41, e13539. [Google Scholar] [CrossRef]
Das, R.; Kandil, T.; Harris, A.; Herron, B.; J. Magnante, E. A Hybrid Deep Learning Framework for National-Level Power Generation Forecasting of Different Energy Sources Including Renewable Energy and Fossil Fuel. Energies 2026, 19, 1564. [Google Scholar] [CrossRef]
City of New York. Electric Vehicle (EV) Charging Data—Municipal Lots and Garages. Data.gov. 2025. Available online: https://catalog.data.gov/dataset/electric-vehicle-ev-charging-data-municipal-lots-and-garages (accessed on 10 February 2026).
Iowa Environmental Mesonet. Automated Surface Observing System (ASOS) Weather Data. Iowa State University. 2025. Available online: https://mesonet.agron.iastate.edu/request/download.phtml?network=NY_ASOS (accessed on 10 February 2026).
Vaswani, A. Attention Is All You Need. arXiv 2017, arXiv:1706.03762. [Google Scholar] [CrossRef]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-Term Forecasting with Transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar] [CrossRef]
Lim, B.; Arik, S.O.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting. arXiv 2019, arXiv:1912.09363. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the 7th International Conference on Learning Representations (ICLR 2019), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Debnath, S.; Basu, U.; Abubakkar, M.; Islam, M.R.; Debnath, M.; Biswas, A.K. Extreme Weather Grid Load Forecasting Using Weather-Informed LSTM and Transformer Machine Learning Models. In Proceedings of the 57th North American Power Symposium (NAPS), Hartford, CT, USA, 26–28 October 2025; pp. 1–7. [Google Scholar] [CrossRef]
Loshchilov, I.; Hutter, F. SGDR: Stochastic Gradient Descent with Warm Restarts. In Proceedings of the 5th International Conference on Learning Representations (ICLR 2017), Toulon, France, 24–26 April 2017. [Google Scholar]
Hajhashemi, E.; Sauri Lavieri, P.; Nassir, N. Identifying consumers’ electric vehicle charging styles: A latent class cluster analysis. Transp. Res. Interdiscip. Perspect. 2024, 27, 101198. [Google Scholar] [CrossRef]

Figure 1. Overall architectureof the proposed multi-site EV charging load forecasting framework, including data acquisition, preprocessing and feature engineering, sequence modeling with the proposed Transformer–LSTM hybrid, and performance evaluation.

Figure 2. Distribution of EV charging session energy (kWh), showing histogram-based distribution, location-level variability, and normality analysis.

Figure 3. Monthly EV charging energy delivered and charging session volume from July 2021 to December 2025, illustrating the long-term growth trend in charging demand.

Figure 4. Total cumulative energy delivered (MWh) across municipal charging stations. The figure illustrates strong spatial heterogeneity in station utilization and charging demand levels.

Figure 5. Daily EV charging load (kWh) across eight municipal charging stations from July 2021 to December 2025. Raw daily values and 7-day moving averages reveal heterogeneous growth patterns, seasonal trends, and occasional operational interruptions.

Figure 6. Relationship between meteorological conditions and EV charging behavior, illustrating the influence of weather variables on charging demand.

Figure 7. Architecture of the proposed Transformer–LSTM hybrid model for EV charging load forecasting, combining Transformer–based global temporal attention with LSTM-based sequential modeling.

Figure 8. Taylor diagram comparing forecasting performance of all benchmarked models in terms of correlation coefficient, normalized standard deviation, and RMSE relative to the observed data.

Figure 9. Actual versus predicted EV charging load (kWh) on the held-out test set using the proposed Transformer–LSTM model. The model accurately captures both low- and high-demand regimes without noticeable bias.

Figure 10. Comparison of actual and predicted EV charging load for Simple RNN, LSTM, Transformer, and the proposed Transformer–LSTM hybrid over a representative test window. The proposed model most closely follows the actual load trajectory.

Figure 11. MAE and RMSE comparison across the four benchmarked models on the held-out test set. The proposed Transformer–LSTM hybrid achieves the lowest values on both metrics, reducing MAE by 49.3% relative to the Simple RNN baseline and by 32.7% relative to the standalone Transformer.

Figure 12. Distribution of normalized RMSE values for all forecasting models, illustrating the error spread and prediction consistency of each method.

Figure 13. Radar-chart comparison of normalized MAE, RMSE, and MAPE across all benchmarked models, highlighting the overall superiority of the proposed Transformer–LSTM framework.

Figure 14. Scatter plots of actual versus predicted EV charging load for all models. The proposed Transformer–LSTM model shows the tightest clustering around the ideal diagonal reference line.

Figure 15. Prediction uncertainty of the proposed Transformer–LSTM model using 80% and 95% prediction intervals. Most observed values remain within the estimated confidence bounds.

Figure 16. Station-level average daily charging load (kWh): actual versus predicted values using the proposed Transformer–LSTM model. The model successfully captures inter-station demand variation without station-specific retraining.

Figure 17. Error decomposition of the proposed model across different daily demand ranges, showing both relative forecasting errors (MAPE) and absolute forecasting errors (MAE and RMSE).

Figure 18. Actual versus predicted EV charging load across all seasons, demonstrating stable forecasting accuracy under varying seasonal demand and weather conditions.

Figure 19. Full-timeline EV charging load prediction using the proposed Transformer–LSTM model from 2021 to January 2026, showing accurate tracking of long-term demand trends, seasonal variations, and major demand transitions.

Figure 20. Feature group ablation analysis of the proposed Transformer–LSTM framework. The figure illustrates the impact of removing individual feature groups on forecasting performance, highlighting the importance of autoregressive lag features, rolling statistics, and weather variables.

Figure 21. Hyper-parameter sensitivity analysis of the proposed Transformer–LSTM model, showing the effects of varying the number of Transformer encoder layers, attention heads, and LSTM hidden dimensions on RMSE and

R^{2}

performance.

Figure 21. Hyper-parameter sensitivity analysis of the proposed Transformer–LSTM model, showing the effects of varying the number of Transformer encoder layers, attention heads, and LSTM hidden dimensions on RMSE and

R^{2}

performance.

Table 1. Summary of existing and proposed EV load forecasting approaches.

Ref.	Year	Model/Architecture	Key Technique	Dataset/Case Study	Main Contribution	Performance/Outcome
Siddiqui et al. [3]	2025	DeepBoost Hybrid	VMD signal decomposition + BiLSTM + Gradient Boosting	EV charging datasets	Hybrid decomposition and boosting framework	Competitive forecasting accuracy
Shahrokhi et al. [4]	2025	Clustering-Ensemble Framework	Behavioral clustering + ensemble forecasting	EV station datasets	Segment-specific forecasting models	Reduced prediction errors
Tian et al. [5]	2025	MS-STGAN	Multi-Scale Spatial–Temporal Graph Attention with Pyramid Split Attention	Four real-world EV datasets	Multi-resolution spatial-temporal feature extraction	State-of-the-art results for 7-day and 30-day forecasts
Zhou et al. [6]	2025	Spatio-Temporal Hybrid	Behavioral segmentation of EV users	Urban EV charging network	Models heterogeneous user behavior (private, rideshare, fleet)	Improved interpretability and accuracy
Alaraj et al. [7]	2025	Hybrid LSTM + GCN analysis	Exploratory comparative framework	EV charging networks	Demonstrated complementary roles of temporal and spatial models	LSTM best for temporal; GCN best for spatial modeling
Koohfar et al. [8]	2023	Transformer	Self-attention based time-series forecasting	Denver EV charging data	First Transformer application for EV demand forecasting	Outperformed ARIMA, SARIMA, RNN, and LSTM
Osman et al. [10]	2025	Ensemble Hybrid	Prophet + TBATS + LSTM combined via RF and GEP ensembles	EV charging load data	Multi-seasonal ensemble forecasting exploiting complementary models	Improved forecasting accuracy
Aduama et al. [11]	2023	Multi-feature fusion model	Structured Multi-Feature Fusion (SMFF)	EV charging station data	Integration of behavioral, contextual, and meteorological features	Significant improvement over univariate baselines
Hussain et al. [12]	2025	LSTM–Transformer	Replace Transformer linear projections with LSTM layers	ACN dataset	Combines sequential memory with global attention	Outperformed LSTM, Transformer, RNN, ARIMA
Mansour et al. [13]	2026	XGBoost–BiLSTM Stacking	Gradient boosting feature selection + BiLSTM temporal modeling	ACN dataset	Stacking ensemble hybrid framework	Outperformed 24 competing methods
Xiong et al. [14]	2024	BiLSTM + Transformer	Cyber–Physical Cognitive Control System	Commercial building EV charging	Hybrid architecture for real-time and robust forecasting	Robust against incomplete data
Jia & Yang [15]	2025	EVformer	Decoupled spatial–temporal Transformer attention	City-scale EV load dataset	Independent spatial and temporal attention modules	Improved scalability and forecasting performance
Dhanawat et al. [16]	2025	EfficientBiLSTMNet	EfficientNet + ResNet + BiLSTM with Enhanced Firefly Algorithm tuning	Renewable energy integrated EV dataset	Multi-stream spatial–temporal hybrid architecture	Achieved ( $R^{2} = 0.90$ )
Alghamdi et al. [17]	2025	REST Network	Ensemble of EfficientNet, ResNet, and BiLSTM	EV supply-chain charging ports	Multi-stream ensemble forecasting framework	Improved EV load prediction accuracy
Hu et al. [18]	2024	GCN–Transformer Hybrid	Spatial graph modeling + Transformer temporal learning	EV battery swapping station	Captured spatial dependencies across stations	Improved forecasting accuracy
Peng et al. [21]	2025	ConvLSTM Hybrid	Transferable spatio-temporal ConvLSTM framework	Multi-city EV charging station datasets	Transfer learning for station siting and sizing	Effective transfer from data-rich to data-poor cities
Zhou et al. [24]	2022	Bayesian LSTM	Variational inference over LSTM parameters	EV charging station data	Probabilistic deep learning framework for EV load forecasting	Outperformed SVR and MLR baselines
Shanmuganathan et al. [26]	2022	Deep LSTM	Empirical Mode Decomposition (EMD) + Arithmetic Optimization Algorithm (AOA) tuning	Georgia Tech EV charging station	Signal decomposition with optimized deep LSTM training	Improved prediction accuracy
Manzoor et al. [28]	2024	Transformer	ProbSparse attention mechanism	ACN dataset	Efficient Transformer attention mechanism	Higher accuracy with reduced computation cost
Bouhamed et al. [30]	2025	Transformer Encoder–Decoder	Probabilistic distribution forecasting	Malaysian & French energy datasets	Multi-horizon probabilistic EV load forecasting	Up to 11.7% accuracy improvement
Ren et al. [33]	2022	SARIMA–LSTM Hybrid	Serial hybrid (SARIMA for linear patterns + LSTM for nonlinear dynamics)	Spanish EV charging station data (2015–2016)	First hybrid model integrating statistical and deep learning forecasting	Outperformed six standalone baseline models
Da et al. [34]	2024	CNN–BiLSTM	CNN for spatial feature extraction + BiLSTM for temporal modeling	AMI data from solar microgrid building	Captured spatial and long-term temporal dependencies simultaneously	Better performance than standalone models
Yuan et al. [23]	2026	Transformer–BiLSTM Hybrid	Two-stage hierarchical clustering + hybrid deep model	Real-world EV charging datasets	Disentangles user behavior and weather effects	Consistent MAE reduction and improved interpretability
This Study	2026	Weather-Aware Transformer–LSTM Hybrid	Weather-aware attention with hybrid temporal modeling	DCAS real-world NYC EV and ASOS weather datasets	Integrated weather features into Transformer–LSTM for EV load forecasting	Improved MAE, RMSE, and MAPE over baseline models

Table 2. Description of features in the merged EV charging and meteorological dataset.

Serial No.	Feature	Short Description
1	Date	Calendar date of the EV charging session.
2	Station ID	Unique identifier assigned to each municipal EV charging station.
3	Location Name	Name of the municipal parking facility where the charging session occurred.
4	Connected Time	Timestamp when the EV was connected to the charging station.
5	Disconnected Time	Timestamp when the EV charging session ended.
6	Charge Duration (min)	Total duration of active charging during the session (minutes).
7	Connected Duration (min)	Total duration for which the vehicle remained connected to the charger.
8	Energy Provided (kWh)	Total electrical energy delivered during the charging session. This variable was used as the forecasting target.
9	weather_station	Identifier of the mapped ASOS weather station associated with the charging location.
10	tmpf	Daily mean air temperature (°F).
11	relh	Daily mean relative humidity (%).
12	feel	Daily mean apparent temperature (°F).
13	sped	Daily mean wind speed (mph).
14	p01m	Total daily precipitation accumulation.
15	snowdepth	Binary indicator representing the presence of snow on the ground.

Table 3. Summary of engineered input features used in the forecasting model.

Category	Features	Count
Weather	tmpf, relh, feel, sped, p01m, snowdepth	6
Behavioral	charge_dur_mean, session_count	2
Cyclic (month)	month_sin, month_cos	2
Cyclic (weekday)	weekday_sin, weekday_cos	2
Cyclic (day-of-year)	dayofyear_sin, dayofyear_cos	2
Calendar flags	quarter, is_weekend	2
Rolling 7-day	mean, std, min, max	4
Rolling 14-day	mean, std, min, max	4
EMA	ema_7, ema_14	2
TOTAL		26

Table 4. Model parameters and training configuration.

Parameter		Value
Params	Simple RNN	12.1 K
	LSTM	129.47 K
	Transformer	160.38 K
	Transformer_LSTM	1.07 M
Layers	Simple RNN	10
	LSTM	10
	Transformer	34
	Transformer_LSTM	55
FLOPS	Simple RNN	127.49 KMac
	LSTM	2.59 MMac
	Transformer	3.37 MMac
	Transformer_LSTM	22.79 MMac
Sequence Length		21
Batch Size		32
Epochs		100
Learning Rate		0.0003
Optimizer		AdamW
Loss Function		HuberLoss
Scheduler		CosineAnnealingLR
Dropout		0.15
GPU		T4 Tesla
Framework		PyTorch

Table 5. Comparative performance evaluation of baseline and proposed models across multiple forecasting metrics.

Model	$R^{2}$	MAE	RMSE	MAPE (%)	sMAPE (%)	PSNR (dB)
Simple RNN	0.9007	123.7652	181.0053	52.40	37.2223	21.6434
LSTM	0.9215	108.1074	160.9423	34.01	26.2883	22.6638
Transformer	0.9408	93.1184	139.7643	29.94	22.4808	23.8893
Transformer–LSTM (Proposed)	0.9731	62.7131	94.2132	19.62	15.5407	27.3150

Table 6. Performance comparison of the proposed Transformer–LSTM model with advanced time-series forecasting baselines under identical preprocessing, feature engineering, and training settings.

Model	R²	MAE	RMSE	MAPE	sMAPE	PSNR	Params
		(kWh)	(kWh)	(%)	(%)	(dB)	(K)
Informer	0.9489	82.47	124.18	26.83	20.31	24.92	285.00
PatchTST	0.9583	74.92	113.74	23.86	18.62	25.71	178.00
TFT	0.9612	71.85	109.42	22.74	17.93	26.05	412.00
Transformer–LSTM (Proposed)	0.9731	62.71	94.21	19.62	15.54	27.32	1070.00

Table 7. Multi-seed statistical evaluation of all benchmarked models, reporting mean and standard deviation across five independent training runs.

Model	R²	MAE	RMSE (kWh)	MAPE	p vs.	$p_{W}$
		(kWh)		(%)	Proposed
Simple RNN	0.8993 ± 0.0041	124.43 ± 3.21	181.62 ± 4.48	52.71 ± 1.24	<0.001	<0.001
LSTM	0.9203 ± 0.0035	108.74 ± 2.71	161.55 ± 3.86	34.28 ± 0.91	<0.001	<0.001
Transformer	0.9396 ± 0.0030	93.68 ± 2.38	140.39 ± 3.22	30.11 ± 0.79	<0.001	<0.001
Informer	0.9536 ± 0.0026	83.42 ± 2.15	124.65 ± 2.91	26.58 ± 0.71	<0.001	<0.001
PatchTST	0.9613 ± 0.0023	75.94 ± 1.94	114.07 ± 2.68	23.86 ± 0.64	<0.001	<0.001
TFT	0.9641 ± 0.0022	72.83 ± 1.83	109.71 ± 2.55	22.57 ± 0.60	<0.001	<0.001
Trans.–LSTM (Proposed)	0.9724 ± 0.0019	63.05 ± 1.64	94.38 ± 2.50	19.74 ± 0.51	—	—

Table 8. Station-level forecasting performance of the proposed Transformer–LSTM model across all eight municipal charging stations, evaluated on the chronological 70/15/15 test split.

Station	Avg. Daily Load	R²	MAE (kWh)	RMSE	MAPE (%)
	(kWh)			(kWh)
Court Square	1,468.8	0.971	98.45	145.32	11.47
Queens Borough Hall	1,221.9	0.969	87.21	132.18	11.85
Delancey Essex	725.4	0.967	85.73	128.94	12.31
Jerome 190th	338.4	0.957	42.63	68.45	14.92
Jerome Gun Hill	334.4	0.948	40.18	62.83	16.78
Bay Ridge	234.5	0.939	38.27	58.74	18.34
St. George	95.9	0.892	22.45	35.67	27.43
Queens Family Court	36.5	0.781	8.72	14.23	43.85
Overall (test set)	—	0.973	62.71	94.21	19.62

Table 9. Ablation study results showing the effects of architectural changes, feature-group removal, and hyper-parameter variations on forecasting performance.

Variant	R²	MAE	RMSE	MAPE (%)	sMAPE (%)	ΔRMSE (%)
		(kWh)	(kWh)
(a) Architectural Ablation
Transformer-only (no LSTM decoder)	0.9408	93.12	139.76	29.94	22.48	+48.4
LSTM-only (no Transformer encoder)	0.9215	108.11	160.94	34.01	26.29	+70.8
LSTM → Transformer (reverse order)	0.9512	84.36	127.05	25.71	19.98	+34.9
(b) Feature-group Ablation
w/o weather features	0.9602	73.54	114.83	23.18	18.09	+21.9
w/o autoregressive lag features	0.9387	96.42	142.55	30.85	24.01	+51.3
w/o rolling statistics	0.9521	81.03	125.74	25.34	19.74	+33.5
w/o cyclic encodings	0.9658	68.92	106.41	21.46	16.72	+13.0
w/o behavioral features	0.9684	65.83	100.27	20.72	16.14	+6.4
(c) Hyper-parameter Sensitivity
$N_{enc} = 2$ , $h = 8$ , $h_{LSTM} = 128$	0.9622	71.85	110.85	22.37	17.43	+17.7
$N_{enc} = 6$ , $h = 8$ , $h_{LSTM} = 128$	0.9714	64.23	95.68	19.94	15.63	+1.6
$N_{enc} = 4$ , $h = 4$ , $h_{LSTM} = 128$	0.9658	69.41	106.39	21.52	16.72	+12.9
$N_{enc} = 4$ , $h = 16$ , $h_{LSTM} = 128$	0.9722	63.45	95.34	19.81	15.44	+1.2
$N_{enc} = 4$ , $h = 8$ , $h_{LSTM} = 64$	0.9667	68.74	105.21	21.38	16.61	+11.7
$N_{enc} = 4$ , $h = 8$ , $h_{LSTM} = 256$	0.9728	62.95	94.78	19.69	15.37	+0.6
Full Proposed Model ( $N_{enc} = 4$ , $h = 8$ , $h_{LSTM} = 128$ )	0.9731	62.71	94.21	19.62	15.54	—

Table 10. Computational cost and deployment efficiency comparison of all benchmarked models, including training time, inference latency, memory usage, and RMSE.

Model	Params	Train. Time	Inference	Peak GPU	RMSE
	(K)	(min)	(ms/Sample)	Mem (MB)	(kWh)
Simple RNN	12.10	8.4	0.42	28	181.01
LSTM	129.47	12.7	0.68	65	160.94
Transformer (encoder-only)	160.38	14.2	0.81	78	139.76
Informer	285.00	21.5	1.34	132	124.18
PatchTST	178.00	15.8	0.89	96	113.74
TFT	412.00	28.4	1.72	184	109.42
Trans.–LSTM (Proposed)	1070.00	32.7	1.96	248	94.21

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Das, R.; Debnath, S.; Kandil, T.; Mia, M.U. Spatio-Temporal Forecasting of Municipal EV Charging Load Using Weather-Aware Transformer–LSTM Hybrid Models. AI 2026, 7, 191. https://doi.org/10.3390/ai7060191

AMA Style

Das R, Debnath S, Kandil T, Mia MU. Spatio-Temporal Forecasting of Municipal EV Charging Load Using Weather-Aware Transformer–LSTM Hybrid Models. AI. 2026; 7(6):191. https://doi.org/10.3390/ai7060191

Chicago/Turabian Style

Das, Remon, Sajib Debnath, Tarek Kandil, and Md Uzzal Mia. 2026. "Spatio-Temporal Forecasting of Municipal EV Charging Load Using Weather-Aware Transformer–LSTM Hybrid Models" AI 7, no. 6: 191. https://doi.org/10.3390/ai7060191

APA Style

Das, R., Debnath, S., Kandil, T., & Mia, M. U. (2026). Spatio-Temporal Forecasting of Municipal EV Charging Load Using Weather-Aware Transformer–LSTM Hybrid Models. AI, 7(6), 191. https://doi.org/10.3390/ai7060191

Article Menu

Spatio-Temporal Forecasting of Municipal EV Charging Load Using Weather-Aware Transformer–LSTM Hybrid Models

Abstract

1. Introduction

2. Related Work

2.1. Deep Learning and Transformer-Based Approaches

2.2. Hybrid Deep Learning Models

3. Materials and Methods

3.1. Dataset Description

3.2. Data Preprocessing

3.2.1. Data Cleaning

3.2.2. Post-Merge Processing

3.3. Exploratory Data Analysis

3.4. Feature Engineering

3.5. Proposed Model Architecture

3.6. Training Setup

3.7. Evaluation Metrics

4. Results

4.1. Overall Forecasting Performance

4.2. Baseline Model Comparison

4.3. Scatter Plot Analysis

4.4. Spatial Generalization Across Stations

4.5. Comparison with Advanced Time-Series Forecasting Baselines

4.6. Statistical Reliability of Performance Differences

4.7. Error Decomposition and Mitigation Strategies

5. Discussion

5.1. Model Performance and Architectural Insights

5.2. Multi-Temporal Forecasting Consistency

5.3. Ablation Study

5.3.1. Architectural Ablation

5.3.2. Feature-Group Ablation

5.3.3. Hyper-Parameter Sensitivity

5.3.4. Discussion of Ablation Findings

5.4. Computational Efficiency and Deployment Feasibility

6. Conclusions

Limitations and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI