4.1. Data Collection and Preprocessing
A multi-source, multivariate panel dataset was constructed for this analysis, comprising 504 monthly observations from January 2021 through December 2023 across 14 municipalities in Hunan Province. In this panel dataset, each record was uniquely identified by city and date identifiers. The dataset encompassed over 180,000 EV charging piles, with operating capacities ranging from 1 to 16,300 units and voltage levels including 10 kV, 220 V, and 380 V. Fourteen feature variables were incorporated based on their theoretical influence on EV development dynamics, including EV stock, sales volume, and penetration rate; GDP and per capita disposable income; transportation metrics (taxis, road mileage, and registered vehicles); demographic indicators; and meteorological parameters (temperature, precipitation, and sunshine duration). The target variable—EV charging pile operating capacity—was defined as the operational capability metric representing charging infrastructure utilization within each urban jurisdiction.
Following the confirmation of long-term equilibrium relationships between influencing factors and charging capacity, seven key determinants were identified: EV stock, sales volume, penetration rate, passenger car sales, taxi fleet size, sunshine duration, and precipitation. The ECTs for these features were extracted to quantify short-term deviations from equilibrium states. The adjusted results were subsequently integrated as inputs to the BiLSTM model.
4.2. Experimental Design and Evaluation Metrics
A series of comparative experiments were conducted to evaluate the proposed model’s performance against three benchmark approaches: (1) In support vector regression (SVR), a non-linear regression model was implemented utilizing a radial basis function (RBF) kernel, with identical input features employed across all comparative models. (2) A temporal convolutional network (TCN) architecture was constructed comprising three residual blocks, kernel size 3, and 64 hidden channels. Causal and dilated convolutions were implemented with ReLU activation functions, while final predictions were generated through a fully connected output layer. (3) A gated recurrent unit (GRU) model was constructed as a streamlined recurrent neural network variant, incorporating update and reset gates for information flow management. The architecture featured two stacked GRU layers with 128 units each, followed by a fully connected output layer.
To compare the impacts of different input features, we categorized our experiments into three groups: (1) The all-features group used all 14 original features as input for charging pile operating capacity prediction. (2) The seven-features group trained the model using only the seven variables that exhibited a medium- to long-term cointegration relationship. (3) The ECT-enhanced group incorporated all 14 original features along with the 7 error-corrected features from the cointegration analysis, totaling 21 features, as input for model training.
To ensure a comprehensive assessment, the predictive performance was evaluated from multiple perspectives using a suite of metrics: R-squared (R2), mean squared error (MSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). Simultaneously, we employed five-fold cross-validation, and the model’s performance from the fold yielding the best training results was selected.
4.3. Results and Discussions
All experiments were conducted within a consistent hardware environment, utilizing an NVIDIA RTX A6000 GPU with 48 GB of memory. The deep learning models were implemented using the PyTorch 2.1 framework, while traditional methods were implemented with the statsmodels and scikit-learn libraries. Each model was trained and evaluated independently for each city.
For this study, we augmented the previously selected feature set with seven ECT variables, utilizing this comprehensive input for model training.
As illustrated in
Figure 5, comparative analysis revealed the model prediction results versus actual values in Changsha over a nine-month period (January to September 2022). Visual inspection revealed that the proposed model exhibited superior performance, demonstrating the closest alignment with the true values.
To ensure the model’s generalization capability, a five-fold cross-validation strategy was employed. During this cross-validation process, our model achieved its optimal training performance in the second fold, with the specific results presented in
Table 5.
Figure 6, generated from the normalized data presented in
Table 5, illustrates the performance distribution and stability of our proposed model across five-fold cross-validation. The gray data points within the figure represent the specific normalized performance scores for each fold. Analysis of
Figure 6 reveals that the R
2 box plot is notably narrower and positioned at a higher value, indicating that this metric not only achieved high scores but also exhibited exceptional stability with minimal fluctuation across the five-fold cross-validation. Furthermore, the R
2 box plot’s proximity to the upper end of the
Y-axis signified the superior goodness of fit of our model. In contrast, MSE and MAPE demonstrated greater variability, showing more pronounced differences in their performance across different folds. Observing this in conjunction with
Table 5, it became evident that the second fold yielded the lowest MAE, MSE, and RMSE among all five folds, further confirming its superior performance. Consequently, for all subsequent comparative analysis experiments, we exclusively utilized the data from the second fold.
To validate the representativeness of our chosen input features, we conducted a comparative analysis using different input feature combinations. We categorized the input features into three distinct groups: the all-features group, comprising the 14 original features without cointegration analysis or error correction; the seven-features group, which included only the seven features cointegrated with charging pile operating capacity; and the ECT-enhanced group, consisting of all 14 original features augmented with seven error correction terms (ECTs) derived from cointegration analysis, totaling 21 features. These three feature sets were subsequently used for model training, and the results are presented in
Table 6.
Figure 7, generated from the normalized data presented in
Table 6, aims to visually compare the performance of the three distinct feature sets configured in our experiments across five key metrics: R
2, MAE, MSE, RMSE, and MAPE. A higher Z-score indicates superior model performance. The dashed lines in the figure facilitate the observation of overall performance trends. For the original data, a larger R
2 value signifies better model performance, while for MAE, MSE, RMSE, and MAPE, smaller values are indicative of better performance. During the normalization process, the MAE, MSE, RMSE, and MAPE metrics were inversely transformed, ensuring that the group with the lowest original error would receive the highest normalized score.
As depicted in the figure, the ECT-enhanced group consistently occupied the highest position, demonstrating the best normalized scores across all metrics. This observation was corroborated by the tabular data, where the ECT-enhanced group exhibited the highest R2 value and the lowest values for all other metrics. The all-features group performed subordinately, while the seven-features group showed the poorest performance, aligning with experimental expectations. Consequently, the ECT-enhanced group was identified as the optimal feature set as it consistently and significantly outperformed the other two feature sets across all evaluation dimensions.
To validate the superior performance of our proposed model over other commonly employed load forecasting models, we conducted comparative experiments against TCN, SVM, and GRU. For these three control groups, we similarly utilized the ECT-enhanced group as input features and applied five-fold cross-validation. Their respective second-fold results were then selected for comparison with our model. The outcomes are presented in
Table 7.
Similar to the feature group comparison, a higher Z-score in
Figure 8 for the normalized data indicated superior model performance. A comparative analysis revealed that our chosen BiLSTM model demonstrated exceptional performance, achieving a normalized score of 1.00 across all five metrics: R
2, MAE, MSE, RMSE, and MAPE. The GRU model exhibited the poorest performance, while SVM and TCN models fell in between. This outcome strongly validated the rationality of our proposed model.
The selection of the BiLSTM was justified because the evolution of operating capacity was influenced by both short-term fluctuations (e.g., seasonal and climatic) and long-term socioeconomic drivers. Traditional linear models often fail to adequately capture the complex, non-linear interactions between these multi-scale factors. BiLSTMs, however, excel at learning such temporal hierarchies and non-linear relationships, effectively balancing long-term memory with short-term dynamics.
Moreover, the inclusion of cointegrated variables provided an economically interpretable basis for the model, thereby improving its stability and explanatory power. Consequently, the BiLSTM in this framework functioned not only as a predictive tool but also as a crucial bridge between statistical analysis and deep learning. This hybrid approach offered significant methodological and practical advantages.
When extending the proposed model to regions characterized by diverse socioeconomic and climatic conditions, the inclusion of influencing factors such as GDP and taxi fleet size was anticipated to mitigate the impact of regional economic structural disparities and unique transportation characteristics on its generalization capability. However, the absence of explicitly modeled policy-driven effects within the current framework may present a limitation in generalizability for other regions. Addressing this aspect will necessitate systematic evaluation and further refinement in future research.