CNN–Patch–Transformer-Based Temperature Prediction Model for Battery Energy Storage Systems

Li, Yafei; Qian, Kejun; Shen, Qiuying; Ma, Qianli; Wang, Xiaoliang; Wang, Zelin

doi:10.3390/en18123095

Open AccessArticle

CNN–Patch–Transformer-Based Temperature Prediction Model for Battery Energy Storage Systems

by

Yafei Li

¹,

Kejun Qian

^1,*,

Qiuying Shen

¹,

Qianli Ma

¹,

Xiaoliang Wang

²

and

Zelin Wang

²

¹

State Grid Suzhou Power Supply Company, Suzhou 215004, China

²

School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou 215009, China

^*

Author to whom correspondence should be addressed.

Energies 2025, 18(12), 3095; https://doi.org/10.3390/en18123095

Submission received: 16 May 2025 / Revised: 2 June 2025 / Accepted: 9 June 2025 / Published: 12 June 2025

(This article belongs to the Section F5: Artificial Intelligence and Smart Energy)

Download

Browse Figures

Versions Notes

Abstract

Accurate predictions of the temperature of battery energy storage systems (BESSs) are crucial for ensuring their efficient and safe operation. Effectively addressing both the long-term historical periodic features embedded within long look-back windows and the nuanced short-term trends indicated by shorter windows are key factors in enhancing prediction accuracy. In this paper, we propose a BESS temperature prediction model based on a convolutional neural network (CNN), patch embedding, and the Kolmogorov–Arnold network (KAN). Firstly, a CNN block was established to extract multi-scale periodic temporal features from data embedded in long look-back windows and capture the multi-scale correlations among various monitored variables. Subsequently, a patch-embedding mechanism was introduced, endowing the model with the ability to extract local temporal features from segments within the long historical look-back windows. Next, a transformer encoder block was employed to encode the output from the patch-embedding stage. Finally, the KAN model was applied to extract key predictive information from the complex features generated by the aforementioned components, ultimately predicting BESS temperature. Experiments conducted on two real-world residential BESS datasets demonstrate that the proposed model achieved superior prediction accuracy compared to models such as Informer and iTransformer across temperature prediction tasks with various horizon lengths. When extending the prediction horizon from 24 h to 72 h, the root mean square error (RMSE) of the proposed model in relation to the two datasets degraded by only 11.93% and 19.71%, respectively, demonstrating high prediction stability. Furthermore, ablation studies validated the positive contribution of each component within the proposed architecture to performance enhancement.

Keywords:

battery energy storage systems; temperature prediction; Kolmogorov–Arnold network; convolutional neural network; patch embedding; attention mechanism

1. Introduction

With the ongoing transformation of energy structures, the importance of battery energy storage systems (BESSs) is growing significantly [1,2]. Owing to advantages such as their high energy density and rapid response speed, lithium-ion batteries have emerged as the dominant technology for electrochemical energy storage [3]. However, the performance and state of health (SOH) of these batteries are significantly influenced by thermal effects during operation [4]. Furthermore, sustained exposure to high temperatures can potentially trigger thermal runaway events, posing serious safety risks to both personnel and equipment [5].

Relying solely on passive temperature monitoring and reactive thermal management strategies is often insufficient to address unforeseen operational conditions or emergencies. Accurate prediction of BESS temperature plays a crucial role in enhancing thermal management and operational safety. Such predictive capacity provides critical inputs that help battery management systems (BMSs) implement proactive thermal control strategies, optimizing operational performance [6,7]. Additionally, it facilitates early detection of potential thermal anomalies, enabling timely interventions to prevent safety-critical events [8,9].

The current research on BESS temperature prediction primarily falls into three categories: physics-based modeling methods, data-driven approaches, and hybrid-driven methods. Physics-based modeling methods offer strong interpretability and contribute to revealing the intrinsic heat generation mechanisms within BESSs [10,11]. Hybrid-driven methods typically leverage physical models to provide prior knowledge or constraints, while data-driven models are employed for residual correction, parameter identification, or modeling aspects that are difficult to capture with physical models. This synergy can enhance the accuracy of physical models and improve the interpretability of data-driven models [4,11]. However, methods involving physical models often require complex modeling and are subject to significant challenges in accurately acquiring necessary parameters and numerous limitations in practical applications.

Although physics-based modeling methods and hybrid-driven methods demonstrate potential, purely data-driven approaches are often favored for their adaptability and ability to learn directly from operational data, especially when detailed physical parameters are difficult to obtain [9]. Wang et al. [12] applied backpropagation neural networks (BP-NNs), radial basis function neural networks (RBF-NNs), and elman neural networks (Elman-NNs) to establish temperature prediction models for lithium-ion batteries under conditions involving metal foam and forced-air cooling. Their study demonstrated the feasibility of applying neural network techniques to the domain of temperature prediction.

The core of BESS temperature prediction involves time series forecasting. Jiang et al. [13] proposed a joint prediction method for predicting the maximum and minimum temperatures of a BESS based on an elitist genetic algorithm (EGA) and a bidirectional long short-term memory (BiLSTM) network. By optimizing the time series segmentation strategy and introducing an improved bidirectional loss function, this approach achieved superior prediction accuracy compared to standard LSTM and light gradient boosting machine (LightGBM) models. Recognizing the spatio-temporal nature of temperature data, Zhao et al. [14] integrated the spatial feature extraction capabilities of a convolutional neural network (CNN) with the temporal modeling advantages of LSTM networks, proposing a temperature prediction method for lithium-ion batteries based on a hybrid CNN-LSTM neural network. Using this method, they successfully predicted temperatures for various battery types. The success of the aforementioned studies provides a strong foundation and valuable insights for utilizing data-driven methods to assist in the thermal management of BESSs.

By accounting for the critical role of accurately capturing long-range dependencies in time series analysis, models based on attention mechanisms (AMs) have demonstrated exceptional capabilities and achieved significant success [15]. Consequently, various attention-based models have also been widely applied in predicting battery remaining useful life (RUL), SOH, and temperature [16,17,18,19]. Li et al. [20] proposed a method for lithium-ion battery surface temperature prediction based on empirical mode decomposition (EMD) and the Informer model. In this approach, the authors used EMD to decompose temperature data into intrinsic mode functions (IMFs), reconstructed the IMF components based on the Pearson correlation coefficient, and finally combined these reconstructed components with voltage and current data, feeding them into multiple sub-models to realize high-precision temperature prediction. Qi et al. [21] utilized EMD to decompose original temperature data into different components, subsequently employing an LSTM network augmented by an AM to learn long-term dependencies within the features. Ultimately applied to the temperature prediction of electric vehicle (EV) power batteries, experiments conducted under various operating conditions validated the method’s ability to handle high-frequency non-stationary noise. Hong et al. [22] presented a novel multi-forward-step prediction technique based on a spiral self-attention neural network (SS-ANN) for predicting battery temperature, combining the self-attention mechanism (SAM) with gated recurrent units (GRUs), achieving substantial performance improvements compared to those yielded by conventional LSTM models. The success of these studies underscores the importance of introducing attention mechanisms into the domain of BESS temperature prediction.

However, the monitoring data of BESSs operating in non-laboratory environments (e.g., EV power batteries, residential BESSs, and battery energy storage power stations) are influenced by user habits and routine dispatch commands. The temperature profiles of these BESSs often exhibit pronounced local fluctuations superimposed on underlying periodic characteristics. While attention-based models excel at capturing dependencies, proactively and effectively addressing both the long-term historical periodic features embedded within long look-back windows and the nuanced short-term trends indicated by shorter windows remains a crucial factor in enhancing prediction accuracy. Furthermore, accurately extracting key predictive information from complex features is also pivotal for improving temperature prediction precision.

To predict BESS temperature and address the multifaceted challenges previously discussed, we propose a customized CNN–Patch–Transformer model that utilizes a Kolmogorov–Arnold Network (KAN) for feature mapping. A comparison of the main features of this model with the aforementioned existing approaches is presented in Table 1. The CNN–Patch–Transformer model was designed as follows: Firstly, a CNN block was established to extract multi-scale temporal periodic features and multi-scale correlations among data points embedded within long look-back windows. This allows for the capture of global patterns in time series data and the interdependencies among monitored variables. Subsequently, to capture fine-grained local dynamics, a patch-embedding mechanism and a transformer encoder were introduced. This endows the model with the ability to extract local temporal features from segments within historical long look-back windows, enabling it to perceive subtle local variations in the time series. Finally, the KAN model was applied, leveraging the strong fitting capabilities of B-spline functions, to enhance the model’s ability to extract key predictive information from the complex, feature-rich data representations generated by the preceding components.

The remainder of this paper is organized as follows: Section 2 introduces the mathematical formulation of the BESS temperature prediction task. Section 3 details the proposed CNN–Patch–Transformer model. Section 4 presents the experiments and analysis. Finally, Section 5 provides our conclusions.

2. Problem Definition

For the BESS composed of multiple battery modules, predicting the maximum and minimum temperatures of the entire BESS provides comprehensive reference information for decision-makers. Consequently, the temperature prediction task addressed in this study is fundamentally a multivariate time series forecasting problem with covariates, involving two target variables. When the length of the look-back window is L, the time series window for the maximum and minimum temperatures is defined as follows:

X = [x_{1}, x_{2}, \dots, x_{L}] \in ℝ^{2 \times L}

(1)

where

x_{t} \in ℝ^{2 \times 1}

represents the maximum and minimum temperature values at timestamp t.

Given N_z covariates, the covariate input window is defined as follows:

Z_{t} = [z_{1}, z_{2}, \dots, z_{L}] \in ℝ^{N_{z} \times L}

(2)

where

z_{t} \in ℝ^{N_{z} \times 1}

denotes the covariate data at timestamp t.

The BESS temperature prediction problem can be formulated as follows: given the historical data X and covariate data Z within the look-back window, predict the maximum and minimum temperature data

Y = [y_{L + 1}, y_{L + 2}, \dots, y_{L + T}] \in ℝ^{2 \times T}

for the next T future time steps. This task can be represented by the following equation:

\hat{Y} : = F_{θ} (X^{'}) = F_{θ} (Concat ([x_{1}, x_{2}, \dots, x_{L}], [z_{1}, z_{2}, \dots, z_{L}]))

(3)

where

\hat{Y}

represents the predicted output corresponding to the ground truth

Y

;

F_{θ} (•)

denotes the network model, with θ representing the model’s parameters;

Concat (•)

indicates the concatenation of X and Z along the first dimension;

X^{'} \in ℝ^{N \times L}

represents the concatenated look-back window; and N = N_z + 2.

The objective of this paper is to optimize the parameters θ of the neural network model

F_{θ} (•)

using the gradient descent method, such that the predicted output

\hat{Y}

closely approximates the ground truth

Y

. In this process, this paper employs the mean squared error (MSE) function as the loss function during the gradient descent procedure, which is calculated as follows:

MSE = \frac{1}{2 T} \sum_{i = 1}^{2} \sum_{t = 1}^{T} {({\hat{y}}_{t}^{(i)} - y_{t}^{(i)})}^{2}

(4)

where

{\hat{y}}_{t}^{(i)} \in ℝ^{1}

represents the predicted value of the i-th variable at time t, and

y_{t}^{(i)} \in ℝ^{1}

represents the label value of the i-th variable at time t.

3. CNN–Patch–Transformer Model

To address the unique challenges presented by BESS temperature data, we introduce a customized CNN–Patch–Transformer model. Figure 1 illustrates the overall architecture of the CNN–Patch–Transformer model, wherein the CNN block is employed to extract multi-scale periodic temporal features and inter-variable multi-scale correlations from time series data embedded in long look-back windows. This design enables the model to address the daily periodicity inherent in BESS temperature data and integrate correlations among monitored variables at different scales, thereby enhancing temperature prediction accuracy. The patch-embedding mechanism and its corresponding encoder are utilized to extract local temporal features from long historical look-back windows, allowing the model to process subtle local fluctuations in temperature data. The KAN model then functions as a powerful decoder, mapping these complex, multifaceted features to accurate predictions. The operational mechanism of each module depicted in the figure will be described in detail in this section.

3.1. CNN Block

To extract multi-scale periodic features within long look-back windows, we designed a CNN-based feature extraction module. Its specific operational mechanism is detailed in the pseudocode presented below (Algorithm 1).

Within the CNN block, we first downsampled the look-back window data N_c times to obtain N_c sub-windows with varying sampling frequencies. Subsequently, operations were performed independently on each sub-window. Specifically, we performed N convolutions on the data within each sub-window. The processing method for each convolution is illustrated in Figure 2.

Algorithm 1: CNN Block
	$Input : Time series data X^{'} \in ℝ^{N \times L}$
	$Output : CNN features H_{c} \in ℝ^{N \times D \times N_{c}}$
1	$Initialize empty list H_{c} \leftarrow []$
2	Downsample $X^{'}$ into N_c windows
3	for g = 1 to N_c do
4	$X_{g} \leftarrow Downsample (X^{'}, g) / / X_{g} \in ℝ^{N \times L_{g}}$
5	for n = 1 to N do
6	$exchange X_{g} [n, :] and X_{g} [int (N / 2), :]$
7	$h_{g}^{(n)} \leftarrow Conv 1 D (X_{g}, kernel = N) / / h_{g}^{(n)} \in ℝ^{1 \times L_{g} \times 1}$
8	end for
9	$h_{c} \leftarrow Concat (h_{g}^{(1)}, \dots, h_{g}^{(N)}) / / h_{c} \in ℝ^{N \times L_{g} \times 1}$
10	$h_{c} \leftarrow Linear (h_{c}) / / h_{c} \in ℝ^{N \times D \times 1}$
11	Append h_c to H_c
12	$H_{c} \leftarrow Stack (H_{c}, \dim = 3) / / H_{c} \in ℝ^{N \times D \times N_{c}}$
13	return H_c

According to Figure 2, during convolution, we first conceptually position the n-th time series at the center relative to all other series. Then, a convolutional kernel with dimensions of N × N is utilized for the convolution operation. The advantage of this approach is that it not only enables the convolutional kernel to fully extract concurrent correlated features between the variable being convolved and its surrounding variables but also the cross-temporal correlated features among the variables. It is noteworthy that, during convolution, both the beginning and end of the sequence are padded by replication. This ensures that the length of the sequence after convolution is equal to the length of the original sequence after the current downsampling operation.

After N convolutions have been performed on the data within a sub-window, the N sets of features obtained from these convolutions are concatenated and subsequently mapped to D dimensions using a linear projection. Once all sub-windows have undergone convolution, these N_c convolutional results are then concatenated to form the final extracted output

H_{c} \in ℝ^{N \times D \times N_{c}}

of the CNN block.

3.2. Patch Embedding

To extract cross-temporal correlations from time series data within long look-back windows and accurately analyze the local temporal characteristics of the time series data, we employ the patch-embedding method [23]. Patch embedding processes each time series individually. When processing each time series, the sequence is first subjected to a patching operation [1], as illustrated in Figure 3:

According to Figure 3, each sequence

X^{'} [n, :] \in ℝ^{1 \times L}

is divided into multiple patches with a length of P. During the division process, the degree of overlap between patches is controlled by the stride S. The patching process decomposes the sequence

X^{'} [n, :] \in ℝ^{1 \times L}

into

N_{p} = (⌊(L - P) / S⌋ + 2)

sub-sequences. These sub-sequences for the n-th variable are collectively represented as

x_{p}^{(n)} \in ℝ^{1 \times P \times N_{p}}

. To ensure that each patch has an equal length, the end of the last patch is padded using replication.

After decomposing the sequence into patches, patch embedding is used to map the patches

x_{p}^{(n)} \in ℝ^{1 \times P \times N_{p}}

to embedded features

h_{p}^{(n)} \in ℝ^{1 \times D \times N_{p}}

through position embedding and a linear projection. The final embedding result is then represented as

H_{p} = {h_{p}^{(1)}, \dots, h_{p}^{(N)}} \in ℝ^{N \times D \times N_{p}}

.

3.3. Transformer Encoder

As in most attention-based models, we also utilize a transformer encoder block to process the embedded features from patch embedding. The transformer encoder block employed in this study consists of a multi-head attention mechanism, normalization, and a feed-forward network (FFN) [24,25].

The multi-head attention mechanism first applies a linear mapping to the input

H_{p} \in ℝ^{N \times D \times N_{p}}

. This projects the original embedding dimension D to d_k to obtain learnable query, key, and value matrices. The query, key, and value matrices are then projected N_head times to obtain N_head individual attention heads. Finally, the outputs of these individual attention heads are concatenated and projected to produce the final computational result. In this process, the computation for a single attention head is performed as follows:

Attention (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(5)

where Q, K, and V correspond to the query, key, and value matrices, respectively; d_k = D/N_head; and the scaling factor

\sqrt{d_{k}}

is used to prevent the dot products from becoming excessively large, which could lead to vanishing gradients in the softmax function.

The primary purpose of normalization is to stabilize the distribution of activation values during training, reduce internal covariate shift, accelerate model convergence, and potentially improve model performance. In this study, batch normalization was employed for the output of patch embedding.

We implement the FFN using linear layers. Finally, the dimensionality of the input

H_{p} \in ℝ^{N \times D \times N_{p}}

remains unchanged after passing through one transformer encoder block.

3.4. KAN

After key predictive features have been extracted by the preceding modules, merely using a linear layer or a shallow multilayer perceptron (MLP) model may be insufficient for complex decoding tasks. Han et al. [26] experimentally demonstrated the effectiveness of KANs in multivariate time series forecasting tasks, also showing that replacing linear layers in other models with a KAN can yield superior benefits. In light of this, this paper employs a single-layer KAN to map the complex, multi-dimensional features extracted by the various preceding modules to temperature predictions.

Unlike traditional MLPs, which apply fixed activation functions neuron-wise, a KAN features learnable, univariate, spline-based activation functions on each network edge. This design is inspired by the Kolmogorov–Arnold representation theorem, which posits that any multivariate continuous function can be expressed as a finite composition of univariate functions and additions [27]. This composition can be formulated as

f (x) = \sum_{q = 1}^{2 n + 1} φ_{q} (\sum_{p = 1}^{n} φ_{q, p} (x_{p}))

(6)

where x_p is the p-th input feature, and

φ_{q}

and

φ_{q, p}

are both parameterized univariate spline functions, typically taken to be B-spline functions.

The capacity of a KAN to learn the optimal functional form for each connection is considered crucial for effectively transforming complex encoded information into precise temperature predictions. A KAN can be viewed as a special type of MLP, with its uniqueness stemming from the use of learnable B-spline functions as activation functions [28]. Leveraging the strong fitting capabilities of B-spline functions, KANs possess enhanced flexibility in modeling complex non-linear relationships, often leading to more accurate predictions than simpler decoders when mapping complex features to prediction results.

We did not employ a dedicated attention module in the decoding stage. Instead, we first concatenated the feature extraction results from the CNN block

H_{c} \in ℝ^{N \times D \times N_{c}}

and the output of the transformer encoder processing the patch embedding

{H^{'}}_{p} \in ℝ^{N \times D \times N_{p}}

into

H_{cat} \in ℝ^{N \times D \times N_{cat}}

, where N_cat = N_c + N_p. Subsequently, the concatenated result

H_{cat} \in ℝ^{N \times D \times N_{cat}}

was flattened into

{H^{'}}_{cat} \in ℝ^{N \times D N_{cat}}

. Then, the KAN model was used to map the flattened result

{H^{'}}_{cat} \in ℝ^{N \times D N_{cat}}

to the prediction output

{\hat{Y}}_{KAN} \in ℝ^{N \times T}

. Finally, this output was truncated to obtain the required final prediction result

\hat{Y} \in ℝ^{2 \times T}

.

4. Experiments and Analysis

4.1. Dataset Description

The two datasets employed in this study originate from residential BESSs located in Europe. Each BESS is composed of six battery sub-modules, and the chemical composition of the corresponding batteries is lithium iron phosphate (LFP). Henceforth, these datasets will be referred to as Dataset 1 and Dataset 2. Both datasets present an initial sampling frequency of 10 min and include monitored variables such as current, voltage, state of charge (SOC), maximum temperature, and minimum temperature. All these variables represent system-level data obtained directly from the BESSs’ monitoring systems. The prediction objectives of this study were the overall maximum and minimum temperatures observed for the entire BESS. These system-level aggregate temperatures are critical for overall thermal management and safety assessment.

To ensure the stability of model training and the reliability of the prediction results, we preprocessed the datasets. Firstly, we conducted a check for missing values within the datasets. For the few missing data points detected, linear interpolation was employed for imputation to maintain the continuity of the time series. Secondly, considering that sensors in practical deployments may produce anomalous readings, we utilized the local outlier factor (LOF) detection algorithm to identify potential outliers in the data [29]. The LOF algorithm assesses whether a data point is an outlier by calculating the density difference between the point in question and other data points within its neighborhood. Data points identified as outliers by the LOF algorithm were subsequently corrected using linear interpolation in order to mitigate the negative impact of noisy data on model performance.

The raw data spans the period from January 2020 to December 2020. Considering the substantial volume of the original 10 min sampling frequency data and to facilitate efficient model training and evaluation within available computational resources, we downsampled the original 10 min data to 1 h intervals by taking the average. While this approach may smooth out very rapid, sub-hourly thermal transients, the predictive objective of this research was to capture hour-level temperature changes over the upcoming day or multiple days, aiming to provide decision support for BESS thermal management. In this application context, 1 h temporal granularity is more suitable for capturing the dominant daily cycles and key heat accumulation or dissipation trends.

For the subsequent experiments, the data was partitioned such that the first 70% constituted the training set, the following 10% served as the validation set, and the remaining 20% formed the test set. The datasets were standardized; the mean and variance for this standardization were derived from the training set and subsequently applied to the training, validation, and test sets.

4.2. Setup of the Experiments

To evaluate the model’s performance, we adopted three evaluation metrics. In addition to the MSE metric used in the loss function, mean absolute error (MAE) and Root mean square error (RMSE) were also selected. The ways in which they are calculated are presented below:

MAE = \frac{1}{2 T} \sum_{i = 1}^{2} \sum_{t = 1}^{T} |{\hat{y}}_{t}^{(i)} - y_{t}^{(i)}|

(7)

RMSE = \sqrt{\frac{1}{2 T} \sum_{i = 1}^{2} \sum_{t = 1}^{T} {({\hat{y}}_{t}^{(i)} - y_{t}^{(i)})}^{2}}

(8)

To validate the effectiveness of the proposed model in regard to BESS temperature prediction, we compared it with the following models: Transformer [25], Informer [30], iTransformer [31], patch time series Transformer (PatchTST) [23], and time series Transformer with exogenous variables (TimeXer) [32]. Transformers are classic attention-based models widely applied in various time series forecasting tasks. The Informer model, building upon the Transformer model, introduces improvements to the self-attention mechanism and the encoder–decoder architecture. In contrast, iTransformer, PatchTST, and TimeXer focus their enhancements more on the processing and embedding of time series data.

The aforementioned algorithms were all implemented based on Python 3.9 and PyTorch 2.3.0. They were executed on a computer equipped with a Windows 10 operating system, an Intel^® Core™ i9-10980XE CPU @ 3.00 GHz, and an Nvidia GeForce RTX 3090 GPU. The source code for the comparative models was obtained from the time series library (TSLib) [24]. The hyperparameters for each model were tuned based on their performance with respect to a dedicated validation set, with the aim of identifying a reasonable and competitive configuration for every model. To ensure the stability and reproducibility of the experimental results and mitigate the impact of randomness, each model in the experimental section was independently run five times with different random seeds. The performance metrics ultimately reported are the means and standard deviations of the results from these five runs.

In the experimental setup, the target variables were designated as the maximum and minimum temperatures, while the covariates included current, voltage, and SOC. It should be noted that the method proposed in this paper also treats vectorized timestamps as covariates, whereas other comparative models employ their own respective methods for processing timestamp information. In the experiments, we utilized data from the preceding 7 days to predict data for the subsequent 1, 2, and 3 days; i.e., the look-back window length was 24 × 7 h, and the prediction horizon lengths were 24, 24 × 2, and 24 × 3 h, respectively. All the models adopted an input–output overlapping strategy similar to that pertaining to the Informer [30], and the normalization method proposed for Non-Stationary Transformers [33] was also applied to all the models to enhance performance.

To mitigate model overfitting, we adopted an early-stopping strategy during the training process. Specifically, the model’s performance regarding the validation set was continuously monitored. If no improvement in validation performance was observed over five consecutive training epochs, the training process was terminated, and the model parameters that yielded the best performance with respect to the validation set were preserved. This method facilitates halting the training when the model reaches its optimal generalization capability, thus preventing overfitting to the training data.

4.3. Comparative Experiments for Classic 24-Hour Prediction

We initially conducted comparative experiments focusing on the classic 24 h prediction horizon. The CNN block established herein is capable of extracting multi-scale periodic features from long historical look-back windows. However, for some models, such long look-back windows might introduce irrelevant reference information, increasing their processing burden and potentially impacting their accuracy. Considering these factors, for each comparative model, in addition to evaluating performance with the 24 × 7 sampling point look-back window, we also conducted experiments by truncating data from the end of this window to construct new, shorter look-back windows. This approach was employed to ensure that each comparative method could be evaluated with a more appropriate look-back length. The final test results for the two datasets are presented in Table 2 and Table 3, respectively, wherein all metrics are reported as the means ± standard deviations over five runs.

Based on the metrics presented in Table 2, the algorithm proposed in this paper achieved the best test results, and its stability was also extremely high. iTransformer and PatchTST, when using a look-back window truncated to 24 × 5, obtained close tests results, which were second-best overall; the average values of the evaluation metrics for iTransformer were slightly better than those for PatchTST, but the former’s stability was slightly lower. Furthermore, when iTransformer and PatchTST employed a 24 × 7 look-back window, their metrics did not significantly degrade from their respective best performances. This may be attributed to iTransformer’s ability to effectively integrate inter-variable correlations and PatchTST’s proficiency in handling local temporal features.

As for TimeXer, its prediction performance with look-back window lengths of 24 × 5 and 24 × 7 was slightly superior to that with windows of 24 × 1 and 24 × 3; however, overall, the average values of its prediction metrics were poorer. When the look-back window was truncated to 24 × 1, both the Transformer and Informer models achieved their respective best prediction results. When longer look-back windows were employed, the stability of the Transformer and Informer models significantly deteriorated. This suggests that when processing longer look-back windows, their performance might be adversely affected by the introduction of irrelevant reference information.

According to Table 3, the experiments conducted on Dataset 2 demonstrate that the proposed algorithm again achieved the best test performance. Unlike the experiments conducted on Dataset 1, both the Informer and TimeXer models obtained their respective best average test metrics when the look-back window was truncated to 24 × 1, while the remaining models achieved their respective best average test metrics with a look-back window length of 24 × 7. This result may be attributed to differences between the two datasets. As in the experiments conducted on Dataset 1, iTransformer and PatchTST once again achieved test performances ranking second only to the proposed model, whereas the Transformer and Informer models yielded poorer metrics and exhibited lower stability. This could be attributed to the different perspectives from which each model analyzes the characteristics of time series data.

To provide a more intuitive demonstration of the proposed method’s performance in BESS temperature prediction, we present, using Dataset 1 as an example, plots of the predicted time series curves for each model under the same random seed. For the comparative models, the look-back window lengths that achieved their best average prediction metrics, as reported in Table 2, were utilized. Due to the large quantity of data, Figure 4 illustrates only a segment of these data.

As can be seen from Figure 4, for 1-step-ahead prediction, the proposed method demonstrates excellent tracking between predicted and actual values. During minor data fluctuations, the predicted and actual values align closely, and the method responds quite sensitively to local data variations. However, at higher peak values, the proposed method tends to underestimate these peaks; nevertheless, these errors remain around approximately 2 °C, which is within an acceptable range.

The Transformer model performed poorly in predicting minor fluctuations and was sluggish in responding to minor local changes. It exhibited significant prediction bias around the 0th to 20th sampling points and near the 170th to 220th sampling points. The Informer model also showed notable prediction bias around the 5th to 20th sampling points and near the 170th to 210th sampling points, which may be indicative of the poorer results obtained by the Transformer and Informer models shown in Table 2. The iTransformer model also exhibited sensitivity to local changes, but it underestimated many minor peaks. PatchTST also demonstrated sensitivity to local data variations, a characteristic closely related to its excellent ability to extract local temporal features. However, its underestimation of some peaks is more pronounced than that of the proposed method. Although TimeXer can capture local data fluctuations reasonably well, its overall estimations of numerical values at each sampling point are generally poor.

To provide a more comprehensive evaluation of each model’s prediction performance for the upcoming 24 h of data, we present, using Dataset 1 as an example, box plots of the prediction errors for each model at every time step within the prediction window. Detailed information is shown in Figure 5. To ensure the clarity and conciseness of the subplots, outliers are not displayed in the figure.

According to Figure 5, the median prediction error of the proposed method is close to zero across all time steps, indicating that the model exhibits low systematic bias in temperature prediction. As the prediction horizon increases, the interquartile range (IQR) and the length of the whiskers gradually increase, stabilizing after the fifth time step. The error distribution remains relatively stable and compact. The whisker tips are situated around ±2 °C, which is within an acceptable range.

The median prediction error of the Transformer model is slightly above zero at almost all time steps, indicating a systematic overestimation in its predictions; this overestimation appears more pronounced in the prediction of the maximum temperature. The IQR and whisker length also gradually increase with the prediction step, and the error dispersion for the Transformer model is slightly greater than that for the proposed method. The Informer model’s errors are relatively compact before the sixth prediction step; however, its predictions exhibit systematic overestimation. The median prediction error of the iTransformer model consistently stays near zero, indicating it has minimal systematic prediction bias. However, the compactness of its errors at later time steps is slightly inferior to that of the proposed method. The median prediction error of the PatchTST model is also relatively close to zero, exhibiting low systematic prediction bias, while the compactness of its errors at later time steps is inferior to that of the proposed method. The median prediction error of the TimeXer model remains near zero; however, the dispersion of its errors at earlier time steps is noticeably higher than that of other models. This aligns with the prediction performance depicted in Figure 4. Overall, the magnitude of its errors is slightly higher than that of the other models.

Furthermore, the optimal results reported for the aforementioned comparative models in the 24-step prediction experiments were achieved after experimenting with various truncated look-back windows. In practical application scenarios where future label data is unavailable, the process of obtaining different results from multiple predictions could inherently introduce interference for decision-makers, making it difficult for them to determine which result is the most reliable. In contrast, the method proposed in this paper only requires accepting the long look-back window directly, and it can automatically extract its multi-scale periodic features and local temporal characteristics, thereby minimizing such interference for decision-makers.

In summary, it can be asserted that the method proposed in this paper not only demonstrates commendable accuracy and prediction stability in the comparative experiments but is also more suitable for providing a reliable basis for decision-making in practical scenarios.

4.4. Comparative Experiments for Other Prediction Horizons

To evaluate the performance of the proposed model under different prediction horizon scenarios, we conducted comparative experiments for 48 h and 72 h predictions. Similar to the classic 24 h prediction, for each comparative model, in addition to evaluating performance using a look-back window of 24 × 7 sampling points, we also conducted experiments by truncating data from the end of this window to construct new, shorter look-back windows. For the 48 h prediction experiments, the final test results for the two datasets are presented in Table 4 and Table 5, respectively. Similarly, for the 72 h prediction experiments, the results for the two datasets are shown in Table 6 and Table 7, respectively. All metrics in these tables are also reported as the means ± standard deviations over five runs.

According to Table 4 and Table 5, the proposed model continued to achieve the best test results, and its stability also remained high. As with the classic 24 h prediction, iTransformer and PatchTST still yielded good test results, and their overall performances were comparable. The stability of the Transformer and Informer models was also relatively poor. TimeXer, on the other hand, yielded moderate performance overall.

In Table 2, Table 3, Table 4 and Table 5, it can be observed that as the prediction horizon increased, the prediction performance of all the models declined, which aligns with the general pattern in time series forecasting. When the prediction time was extended from 24 h to 48 h, for Dataset 1, the average MSE, MAE, and RMSE values of the proposed model degraded by 14.56%, 8.12%, and 7.03%, respectively. The three average metrics for iTransformer at its best performance degraded by 22.63%, 12.04%, and 10.74%, respectively, and those for PatchTST at its best performance degraded by 20.87%, 12.21%, and 9.94%, respectively. For Dataset 2, the average MSE, MAE, and RMSE values of the proposed model degraded by 28.34%, 15.64%, and 13.29%, respectively. The three average metrics for iTransformer at its best performance degraded by 33.79%, 17.07%, and 15.67%, respectively, while those for PatchTST at its best performance degraded by 33.69%, 18.55%, and 15.62%, respectively.

According to Table 6 and Table 7, the proposed model continued to achieve the best average evaluation metrics, and its stability also remained at a high level. After the prediction horizon was extended from 24 h to 72 h, for Dataset 1, the average MSE, MAE, and RMSE values of the proposed model degraded by 25.29%, 14.19%, and 11.93%, respectively. The three average metrics for iTransformer at its best performance degraded by 39.83%, 22.59%, and 18.25%, respectively, while those for PatchTST at its best performance degraded by 33.38%, 18.35%, and 15.49%, respectively. For Dataset 2, the average MSE, MAE, and RMSE values of the proposed model degraded by 43.31%, 23.50%, and 19.71%, respectively. The three average metrics for iTransformer at its best performance degraded by 53.35%, 25.88%, and 23.83%, respectively, while those for PatchTST at its best performance degraded by 50.98%, 26.95%, and 22.88%, respectively.

Based on the metrics, iTransformer and PatchTST were consistently strong competitors with respect to the model proposed in this paper. However, iTransformer primarily focuses on the overall characteristics of the time series and the correlations between monitored variables, whereas PatchTST places greater emphasis on local variations within the time series. The model proposed in this paper can concurrently address inter-variable correlations, the multi-scale periodic characteristics of time series, and local variations within time series. Consequently, the model described herein consistently achieved the best results in terms of evaluation metrics. Furthermore, the aforementioned figures regarding the degree of model degradation also indicate that when the prediction horizon lengthens, the performance decay exhibited by the algorithm proposed in this paper is significantly less than that of iTransformer and PatchTST. This further demonstrates the superiority of the model described in this paper in handling prediction tasks of varying lengths.

4.5. Ablation Study

To further demonstrate the positive contribution of each module to the prediction task, ablation studies were conducted in this paper. It is noteworthy that, similar to the comparative experiments section, each model in the ablation study was also run 5 times with different random seeds. Furthermore, when experimenting without the CNN Block, new look-back windows were constructed by truncating data from the end of the original look-back window. It is also important to note that during the ablation study of the KAN module, an MLP was used to replace the original single-layer KAN model. Table 8 and Table 9 list the average statistical metrics and prediction deviations obtained from the ablation studies conducted on Dataset 1 and Dataset 2, respectively.

The results presented in Table 8 and Table 9 indicate that each module of the proposed method is crucial. When an MLP was used instead of the KAN model in the final layer, the average values of the MSE, MAE, and RMSE all showed degradation, and a noticeable degradation in stability was also observed in the classic 24 h prediction. This suggests that the KAN model successfully parsed future information embedded in the complex, multi-dimensional encoded representations from the preceding stages. Upon removing patch embedding, the average values of all the evaluation metrics exhibited greater degradation. Simultaneously, except for the 48 h prediction on Dataset 2, removing patch embedding led to a significant decrease in model stability in all cases. This indicates that local temporal variations embedded within different segments of the long look-back window constitute critical information for the BESS temperature prediction process. Finally, after removing the CNN block, regardless of how the look-back window was truncated, the average values of the evaluation metrics showed a decline. This demonstrates that the multi-scale periodic features embedded in the long look-back window, as well as the correlation information between various monitoring variables, are also critical for the BESS temperature prediction process.

5. Conclusions and Prospects

5.1. Conclusions

In this paper, we propose a BESS temperature prediction method based on the CNN–Patch–Transformer model. In this method, a CNN block is used to extract multi-scale periodic features from historical data and capture the multi-scale correlations among various monitored variables. Patch embedding, combined with an attention module, is employed to extract local variation features from the temporal data. The KAN model is then utilized to map the acquired multi-dimensional information, thereby achieving joint prediction of the maximum and minimum temperatures of a BESS.

In the experimental phase, we conducted a comparative analysis against Transformer, Informer, iTransformer, PatchTST, and TimeXer models. The experimental results demonstrate that the proposed method can automatically extract key information that is conducive to prediction from long historical look-back windows. In comparative experiments conducted across different prediction horizons, the values obtained for the statistical metrics MSE, MAE, and RMSE consistently exhibited the best evaluation performance. The results of ablation studies also validated the positive contribution of each component in the proposed method. The findings from various experiments also indicate that for temperature prediction in practical BESS scenarios, the multi-scale periodic features of time series data, local variation characteristics, and the correlations among monitored variables are all crucial factors for ensuring prediction accuracy.

5.2. Prospects

Due to the datasets employed in this study not directly providing detailed BESS charge/discharge state labels, an analysis based on operational modes was not undertaken at this stage. Furthermore, as the datasets span a period of one year, an error analysis based on seasonal patterns was also omitted to mitigate the risk of drawing potentially biased conclusions from imbalanced data across seasons. Consequently, an important direction for future research will be the collection or annotation of datasets that include explicit seasonal information and detailed BESS operating conditions. This will enable a meticulous quantitative evaluation of the proposed model’s performance under diverse conditions, and such fine-grained analysis will be invaluable for developing more robust models tailored to complex, real-world operating scenarios and further enhancing their practical utility.

Given the availability of relevant data, the findings of this research are based on residential BESS datasets. However, the constituent components of the proposed model are designed to capture fundamental characteristics within time series data, and these characteristics are also likely present in monitoring data from industrial and utility-scale BESSs, although their specific manifestations in terms of magnitude and parameters may vary. Consequently, the core architecture of the proposed model possesses inherent potential for generalizability and scalability. Future work will be dedicated to retraining and fine-tuning the model using data from systems of specific scales to extend its applicability to a wider range of BESS application scenarios.

Another significant avenue for future research will involve the practical deployment of this model. While the current work has established its predictive superiority, the computational burden and inference latency in real-time BESS applications remain to be thoroughly evaluated. The complexity introduced by the CNN block, transformer encoder, and KAN layer suggests that direct implementation in resource-constrained embedded BESS controllers could be challenging without optimization. Therefore, future investigations will explore techniques such as model pruning, quantization, and knowledge distillation to create more lightweight and efficient versions of the model and deploy and optimize these versions on representative embedded hardware platforms commonly used in BESSs. Furthermore, in practical scenarios, leveraging predicted temperature to guide the establishment of effective thermal management strategies, charge/discharge management policies, and early fault diagnosis will be crucial for enabling this model to meet real-world application demands. This will represent a key step in transforming this research into a practical tool for BESS thermal management.

Author Contributions

Conceptualization, Y.L. and K.Q.; methodology, Y.L. and K.Q.; software, Y.L.; validation, K.Q., Q.S., and Q.M.; formal analysis, Q.S. and Q.M.; investigation, X.W.; resources, Y.L. and K.Q.; data curation, X.W. and Z.W.; writing—original draft preparation, Y.L.; writing—review and editing, X.W. and Z.W.; visualization, Z.W.; supervision, K.Q.; project administration, K.Q.; funding acquisition, K.Q. All authors have read and agreed to the published version of the manuscript.

Funding

This work is supported by Science and Technology Project Funding of State Grid Jiangsu Electric Power Co., Ltd. (J2024015).

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy concerns.

Conflicts of Interest

Authors Yafei Li, Kejun Qian, Qiuying Shen and Qianli Ma were employed by the State Grid Suzhou Power Supply Company. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest. The authors declare that this study received funding from State Grid Jiangsu Electric Power Co., Ltd. The funder was not involved in the study design, collection, analysis, interpretation of data, the writing of this article or the decision to submit it for publication.

Abbreviations

The following abbreviations are used in this manuscript:

BESS	Battery Energy Storage System
SOH	State of Health
BMS	Battery Management System
BP-NN	Backpropagation Neural Network
RBF-NN	Radial Basis Function Neural Network
Elman-NN	Elman Neural Network
EGA	Elitist Genetic Algorithm
BiLSTM	Bidirectional Long Short-Term Memory
LightGBM	Light Gradient Boosting Machine
CNN	Convolutional Neural Network
AM	Attention Mechanism
RUL	Remaining Useful Life
EMD	Empirical Mode Decomposition
IMF	Intrinsic Mode Function
EV	Electric Vehicle
SS-ANN	Spiral Self-Attention Neural Network
SAM	Self-Attention Mechanism
GRU	Gated Recurrent Unit
KAN	Kolmogorov–Arnold Network
MSE	Mean Squared Error
FFN	Feed-Forward Network
MLP	Multilayer Perceptron
SOC	State of Charge
LFP	Lithium Iron Phosphate
LOF	Local Outlier Factor
MAE	Mean Absolute Error
RMSE	Root Mean Square Error
PatchTST	Patch Time Series Transformer
TimeXer	Time Series Transformer with Exogenous Variables
TSLib	Time Series Library
IQR	Interquartile Range

References

Gao, Y.; Xing, F.; Kang, L.; Zhang, M.; Qin, C. Ultra-Short-Term Wind Power Prediction Based on the ZS-DT-PatchTST Combined Model. Energies 2024, 17, 4332. [Google Scholar] [CrossRef]
Bracale, A.; De Falco, P.; Noia, L.P.D.; Rizzo, R. Probabilistic State of Health and Remaining Useful Life Prediction for Li-Ion Batteries. IEEE Trans. Ind. Appl. 2023, 59, 578–590. [Google Scholar] [CrossRef]
Wang, S.; Zhao, L.; Su, X.; Ma, P. Prognostics of Lithium-Ion Batteries Based on Battery Performance Analysis and Flexible Support Vector Regression. Energies 2014, 7, 6492–6508. [Google Scholar] [CrossRef]
Wang, Y.; Xiong, C.; Wang, Y.; Xu, P.; Ju, C.; Shi, J.; Yang, G.; Chu, J. Temperature State Prediction for Lithium-Ion Batteries Based on Improved Physics Informed Neural Networks. J. Energy Storage 2023, 73, 108863. [Google Scholar] [CrossRef]
Chu, F.; Shan, C.; Guo, L. Temporal Attention Mechanism Based Indirect Battery Capacity Prediction Combined with Health Feature Extraction. Electronics 2023, 12, 4951. [Google Scholar] [CrossRef]
Chen, J.; Zhao, Y.; Wang, M.; Yang, K.; Ge, Y.; Wang, K.; Lin, H.; Pan, P.; Hu, H.; He, Z.; et al. Multi-Timescale Reward-Based DRL Energy Management for Regenerative Braking Energy Storage System. IEEE Trans. Transp. Electrif. 2025, 11, 7488–7500. [Google Scholar] [CrossRef]
Ying, Y.; Tian, Z.; Wu, M.; Liu, Q.; Tricoli, P. A Real-Time Energy Management Strategy of Flexible Smart Traction Power Supply System Based on Deep Q-Learning. IEEE Trans. Intell. Transp. Syst. 2024, 25, 8938–8948. [Google Scholar] [CrossRef]
Zhang, H.; Fotouhi, A.; Auger, D.J.; Lowe, M. Battery Temperature Prediction Using an Adaptive Neuro-Fuzzy Inference System. Batteries 2024, 10, 85. [Google Scholar] [CrossRef]
Fan, X.; Zhang, W.; Qi, H.; Zhou, X. Accurate Battery Temperature Prediction Using Self-Training Neural Networks within Embedded System. Energy 2024, 313, 134031. [Google Scholar] [CrossRef]
Zhang, W.; Wan, W.; Wu, W.; Zhang, Z.; Qi, X. Internal Temperature Prediction Model of the Cylindrical Lithium-Ion Battery under Different Cooling Modes. Appl. Therm. Eng. 2022, 212, 118562. [Google Scholar] [CrossRef]
Krewer, U.; Röder, F.; Harinath, E.; Braatz, R.D.; Bedürftig, B.; Findeisen, R. Review—Dynamic Models of Li-Ion Batteries for Diagnosis and Operation: A Review and Perspective. J. Electrochem. Soc. 2018, 165, A3656. [Google Scholar] [CrossRef]
Wang, Y.; Chen, X.; Li, C.; Yu, Y.; Zhou, G.; Wang, C.; Zhao, W. Temperature Prediction of Lithium-Ion Battery Based on Artificial Neural Network Model. Appl. Therm. Eng. 2023, 228, 120482. [Google Scholar] [CrossRef]
Jiang, L.; Yan, C.; Zhang, X.; Zhou, B.; Cheng, T.; Zhao, J.; Gu, J. Temperature Prediction of Battery Energy Storage Plant Based on EGA-BiLSTM. Energy Rep. 2022, 8, 1009–1018. [Google Scholar] [CrossRef]
Zhao, H.; Chen, Z.; Shu, X.; Xiao, R.; Shen, J.; Liu, Y.; Liu, Y. Online Surface Temperature Prediction and Abnormal Diagnosis of Lithium-Ion Batteries Based on Hybrid Neural Network and Fault Threshold Optimization. Reliab. Eng. Syst. Saf. 2024, 243, 109798. [Google Scholar] [CrossRef]
Zhao, H.; Xu, P.; Gao, T.; Zhang, J.J.; Xu, J.; Gao, D.W. CPTCFS: CausalPatchTST Incorporated Causal Feature Selection Model for Short-Term Wind Power Forecasting of Newly Built Wind Farms. Int. J. Electr. Power Energy Syst. 2024, 160, 110059. [Google Scholar] [CrossRef]
Li, L.; Li, Y.; Mao, R.; Li, L.; Hua, W.; Zhang, J. Remaining Useful Life Prediction for Lithium-Ion Batteries With a Hybrid Model Based on TCN-GRU-DNN and Dual Attention Mechanism. IEEE Trans. Transp. Electrif. 2023, 9, 4726–4740. [Google Scholar] [CrossRef]
Wang, Y.; Jiang, B. Attention Mechanism-Based Neural Network for Prediction of Battery Cycle Life in the Presence of Missing Data. Batteries 2024, 10, 229. [Google Scholar] [CrossRef]
Sun, S.; Sun, J.; Wang, Z.; Zhou, Z.; Cai, W. Prediction of Battery SOH by CNN-BiLSTM Network Fused with Attention Mechanism. Energies 2022, 15, 4428. [Google Scholar] [CrossRef]
Lin, M.; Wu, J.; Meng, J.; Wang, W.; Wu, J. State of Health Estimation with Attentional Long Short-Term Memory Network for Lithium-Ion Batteries. Energy 2023, 268, 126706. [Google Scholar] [CrossRef]
Li, C.; Kong, Y.; Wang, C.; Wang, X.; Wang, M.; Wang, Y. Relevance-Based Reconstruction Using an Empirical Mode Decomposition Informer for Lithium-Ion Battery Surface-Temperature Prediction. Energies 2024, 17, 5001. [Google Scholar] [CrossRef]
Qi, X.; Hong, C.; Ye, T.; Gu, L.; Wu, W. Frequency Reconstruction Oriented EMD-LSTM-AM Based Surface Temperature Prediction for Lithium-Ion Battery. J. Energy Storage 2024, 84, 111001. [Google Scholar] [CrossRef]
Hong, J.; Zhang, H.; Xu, X. Thermal Fault Prognosis of Lithium-Ion Batteries in Real-World Electric Vehicles Using Self-Attention Mechanism Networks. Appl. Therm. Eng. 2023, 226, 120304. [Google Scholar] [CrossRef]
Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series Is Worth 64 Words: Long-Term Forecasting with Transformers. arXiv 2023, arXiv:2211.14730. [Google Scholar] [CrossRef]
Wang, Y.; Wu, H.; Dong, J.; Liu, Y.; Long, M.; Wang, J. Deep Time Series Models: A Comprehensive Survey and Benchmark. arXiv 2024, arXiv:2407.13278. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv 2023, arXiv:1706.03762. [Google Scholar] [CrossRef]
Han, X.; Zhang, X.; Wu, Y.; Zhang, Z.; Wu, Z. Are KANs Effective for Multivariate Time Series Forecasting? arXiv 2025, arXiv:2408.11306. [Google Scholar] [CrossRef]
Liu, Z.; Wang, Y.; Vaidya, S.; Ruehle, F.; Halverson, J.; Soljačić, M.; Hou, T.Y.; Tegmark, M. KAN: Kolmogorov-Arnold Networks. arXiv 2025, arXiv:2404.19756. [Google Scholar] [CrossRef]
Yu, R.; Yu, W.; Wang, X. KAN or MLP: A Fairer Comparison. arXiv 2024, arXiv:2407.16674. [Google Scholar] [CrossRef]
Breunig, M.M.; Kriegel, H.-P.; Ng, R.T.; Sander, J. LOF: Identifying Density-Based Local Outliers. In Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data; Association for Computing Machinery: New York, NY, USA, 2000; pp. 93–104. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. Proc. AAAI Conf. Artif. Intell. 2021, 35, 11106–11115. [Google Scholar] [CrossRef]
Liu, Y.; Hu, T.; Zhang, H.; Wu, H.; Wang, S.; Ma, L.; Long, M. iTransformer: Inverted Transformers Are Effective for Time Series Forecasting. arXiv 2024, arXiv:2310.06625. [Google Scholar] [CrossRef]
Wang, Y.; Wu, H.; Dong, J.; Qin, G.; Zhang, H.; Liu, Y.; Qiu, Y.; Wang, J.; Long, M. TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables. arXiv 2024, arXiv:2402.19072. [Google Scholar] [CrossRef]
Liu, Y.; Wu, H.; Wang, J.; Long, M. Non-Stationary Transformers: Exploring the Stationarity in Time Series Forecasting. arXiv 2023, arXiv:2205.14415. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the CNN–Patch–Transformer model.

Figure 2. Convolution method within the CNN block. In the figure, the identically colored blue squares represent the data of the same monitored variable at various time steps, and the red square represents the current convolution center.

Figure 3. Schematic diagram of the patching operation. In the figure, the identically colored blue squares represent the data of the same monitored variable at various time steps.

Figure 4. Illustrative comparison of prediction results for each model regarding Dataset 1. In the figure, the solid blue line represents the ground truth of minimum temperature, the dashed blue line represents the predicted minimum temperature, the solid yellow line represents the ground truth of maximum temperature, and the dashed yellow line represents the predicted maximum temperature. Each subplot corresponds to the prediction results of a different model: (a) the proposed method, (b) Transformer, (c) Informer, (d) iTransformer, (e) PatchTST, and (f) TimeXer.

Figure 5. Box plots of prediction errors for each model with respect to Dataset 1 at each prediction time step. In each row, the left plot shows the box plot of prediction errors for the minimum temperature, and the right plot shows the box plot of prediction errors for the maximum temperature. The short red line segments within the plots indicate the median prediction error. (a,b) the proposed method; (c,d) Transformer; (e,f) Informer; (g,h) iTransformer; (i,j) PatchTST; and (k,l) TimeXer.

Table 1. Comparison of main features of various methods.

Reference	Model	Main Features
[13]	EGA-BiLSTM	- EGA is used for optimal time-series data segmentation. - BiLSTM is used for predicting highest and lowest temperatures. - Model offers an improved loss function considering temperature range.
[14]	CNN-LSTM	- A hybrid network combining CNN is used for spatial features. - LSTM is used for temporal features. - Input parameters for CNN-LSTM are selected based on analysis of an equivalent circuit thermal model.
[20]	EMD-Informer	- EMD is used to decompose temperature data into IMFs. - Pearson correlation is used for IMF reconstruction. - Informer framework is used for capturing long-term dependencies.
[21]	EMD-LSTM-AM	- EMD is used to decompose raw temperature data. - IMFs are reconstructed into high-frequency, low-frequency, and trend components. - AM enhances feature learning and reduces parameters within LSTM.
[22]	SS-ANN	- SS-ANN improves long-range feature extraction ability. SAM processes the output of the GRU. - A clustering-based data partitioning method is used for all-season modeling. - The predicted horizon can be flexibly adjusted according to the prediction accuracy requirement.
This paper	CNN–Patch–Transformer	- CNN block is used for extracting multi-scale periodic features and inter-variable correlations. - Patch embedding is used for capturing local temporal features within long look-back windows. - Transformer encoder is used for encoding patched temporal features. - KAN is used for complex non-linear feature decoding and prediction.

Table 2. Comparative results for classic 24 h prediction regarding Dataset 1.

Model	Length of Look-Back Windows	MSE	MAE	RMSE
Proposed	24 × 7	0.31596 ± 0.00082	0.42097 ± 0.00108	0.56210 ± 0.00073
Transformer	24 × 1	0.39393 ± 0.01181	0.49330 ± 0.00970	0.62758 ± 0.00935
	24 × 3	0.48967 ± 0.03276	0.57200 ± 0.02341	0.69945 ± 0.02342
	24 × 5	0.62066 ± 0.03603	0.66222 ± 0.01974	0.78756 ± 0.02262
	24 × 7	0.77819 ± 0.19181	0.74023 ± 0.13227	0.87602 ± 0.11607
Informer	24 × 1	0.40567 ± 0.00930	0.49218 ± 0.00746	0.63689 ± 0.00728
	24 × 3	0.53808 ± 0.02225	0.59563 ± 0.03150	0.73341 ± 0.01528
	24 × 5	0.60225 ± 0.13956	0.62541 ± 0.10331	0.77215 ± 0.08690
	24 × 7	0.62971 ± 0.27852	0.63135 ± 0.16094	0.78100 ± 0.15712
iTransformer	24 × 1	0.40026 ± 0.00041	0.45886 ± 0.00109	0.63266 ± 0.00032
	24 × 3	0.38179 ± 0.00490	0.45401 ± 0.00313	0.61789 ± 0.00396
	24 × 5	0.34974 ± 0.00332	0.44001 ± 0.00251	0.59138 ± 0.00280
	24 × 7	0.36360 ± 0.00303	0.45318 ± 0.00248	0.60299 ± 0.00251
PatchTST	24 × 1	0.40588 ± 0.00033	0.45930 ± 0.00047	0.63709 ± 0.00026
	24 × 3	0.37006 ± 0.00203	0.45001 ± 0.00166	0.60832 ± 0.00167
	24 × 5	0.35017 ± 0.00274	0.44570 ± 0.00103	0.59175 ± 0.00232
	24 × 7	0.35649 ± 0.00234	0.44974 ± 0.00180	0.59707 ± 0.00196
TimeXer	24 × 1	0.42183 ± 0.00028	0.47352 ± 0.00054	0.64948 ± 0.00022
	24 × 3	0.42412 ± 0.00229	0.48663 ± 0.00179	0.65125 ± 0.00176
	24 × 5	0.39533 ± 0.00250	0.47566 ± 0.00165	0.62875 ± 0.00199
	24 × 7	0.40326 ± 0.00318	0.48136 ± 0.00122	0.63503 ± 0.00250

Table 3. Comparative results for classic 24 h prediction with respect to Dataset 2.

Model	Length of Look-Back Windows	MSE	MAE	RMSE
Proposed	24 × 7	0.26297 ± 0.00142	0.38298 ± 0.00092	0.51280 ± 0.00139
Transformer	24 × 1	0.41882 ± 0.01114	0.49494 ± 0.00788	0.64712 ± 0.00859
	24 × 3	0.49064 ± 0.03879	0.58017 ± 0.02733	0.70003 ± 0.02740
	24 × 5	0.40825 ± 0.01799	0.51674 ± 0.01408	0.63882 ± 0.01405
	24 × 7	0.40330 ± 0.04812	0.50532 ± 0.04252	0.63413 ± 0.03846
Informer	24 × 1	0.38223 ± 0.00796	0.46408 ± 0.00651	0.61822 ± 0.00643
	24 × 3	0.49359 ± 0.02661	0.57975 ± 0.01796	0.70236 ± 0.01897
	24 × 5	0.46098 ± 0.02451	0.54768 ± 0.01759	0.67876 ± 0.01795
	24 × 7	0.43663 ± 0.07469	0.51994 ± 0.04905	0.65894 ± 0.05503
iTransformer	24 × 1	0.30985 ± 0.00190	0.38998 ± 0.00118	0.55664 ± 0.00171
	24 × 3	0.30852 ± 0.00747	0.40765 ± 0.00631	0.55542 ± 0.00668
	24 × 5	0.30477 ± 0.00129	0.41418 ± 0.00141	0.55206 ± 0.00117
	24 × 7	0.28737 ± 0.00228	0.40408 ± 0.00196	0.53607 ± 0.00213
PatchTST	24 × 1	0.32115 ± 0.00236	0.39886 ± 0.00089	0.56670 ± 0.00208
	24 × 3	0.30281 ± 0.00071	0.39415 ± 0.00064	0.55028 ± 0.00064
	24 × 5	0.29324 ± 0.00160	0.39608 ± 0.00148	0.54151 ± 0.00148
	24 × 7	0.28960 ± 0.00478	0.39810 ± 0.00377	0.53813 ± 0.00442
TimeXer	24 × 1	0.33532 ± 0.00215	0.41400 ± 0.00197	0.57906 ± 0.00185
	24 × 3	0.33788 ± 0.00185	0.42836 ± 0.00103	0.58127 ± 0.00159
	24 × 5	0.33613 ± 0.00601	0.43781 ± 0.00343	0.57975 ± 0.00518
	24 × 7	0.34764 ± 0.00430	0.44822 ± 0.00299	0.58960 ± 0.00365

Table 4. Comparative results for 48 h prediction with respect to Dataset 1.

Model	Length of Look-Back Windows	MSE	MAE	RMSE
Proposed	24 × 7	0.36196 ± 0.00431	0.45515 ± 0.00231	0.60162 ± 0.00358
Transformer	24 × 3	0.61609 ± 0.02097	0.63906 ± 0.03754	0.78482 ± 0.01338
	24 × 5	0.63523 ± 0.11540	0.64887 ± 0.08931	0.79431 ± 0.07337
	24 × 7	0.50796 ± 0.04318	0.54740 ± 0.02050	0.71220 ± 0.03040
Informer	24 × 3	0.65121 ± 0.06126	0.63798 ± 0.06068	0.80626 ± 0.03788
	24 × 5	0.67840 ± 0.11123	0.65938 ± 0.07710	0.82158 ± 0.06520
	24 × 7	0.60905 ± 0.05021	0.61969 ± 0.03619	0.77987 ± 0.03242
iTransformer	24 × 3	0.47293 ± 0.00743	0.51519 ± 0.00426	0.68768 ± 0.00540
	24 × 5	0.42888 ± 0.00498	0.49298 ± 0.00257	0.65488 ± 0.00381
	24 × 7	0.43353 ± 0.00373	0.50202 ± 0.00236	0.65842 ± 0.00283
PatchTST	24 × 3	0.46537 ± 0.01146	0.51471 ± 0.00634	0.68214 ± 0.00835
	24 × 5	0.43769 ± 0.00324	0.50373 ± 0.00212	0.66157 ± 0.00245
	24 × 7	0.42326 ± 0.00253	0.50010 ± 0.00164	0.65058 ± 0.00194
TimeXer	24 × 3	0.51449 ± 0.00326	0.54175 ± 0.00246	0.71728 ± 0.00227
	24 × 5	0.45786 ± 0.00364	0.51372 ± 0.00268	0.67665 ± 0.00269
	24 × 7	0.46114 ± 0.00745	0.52148 ± 0.00422	0.67905 ± 0.00548

Table 5. Comparative results for 48 h prediction using Dataset 2.

Model	Length of Look-Back Windows	MSE	MAE	RMSE
Proposed	24 × 7	0.33750 ± 0.00334	0.44286 ± 0.00244	0.58094 ± 0.00288
Transformer	24 × 3	0.78580 ± 0.07468	0.74856 ± 0.03845	0.88568 ± 0.04142
	24 × 5	0.62546 ± 0.04625	0.65979 ± 0.03300	0.79043 ± 0.02931
	24 × 7	0.61492 ± 0.11647	0.64352 ± 0.07379	0.78123 ± 0.07580
Informer	24 × 3	0.82346 ± 0.05569	0.75371 ± 0.02624	0.90702 ± 0.03119
	24 × 5	0.77178 ± 0.04649	0.71668 ± 0.02360	0.87819 ± 0.02643
	24 × 7	0.79684 ± 0.08653	0.71753 ± 0.04680	0.89160 ± 0.04855
iTransformer	24 × 3	0.42655 ± 0.00162	0.48582 ± 0.00076	0.65311 ± 0.00124
	24 × 5	0.42740 ± 0.00441	0.49471 ± 0.00265	0.65375 ± 0.00337
	24 × 7	0.38448 ± 0.00347	0.47307 ± 0.00259	0.62006 ± 0.00280
PatchTST	24 × 3	0.42309 ± 0.00305	0.47749 ± 0.00177	0.65045 ± 0.00234
	24 × 5	0.40920 ± 0.00319	0.48055 ± 0.00181	0.63968 ± 0.00249
	24 × 7	0.38716 ± 0.00341	0.47194 ± 0.00247	0.62221 ± 0.00274
TimeXer	24 × 3	0.44779 ± 0.00183	0.49602 ± 0.00152	0.66917 ± 0.00137
	24 × 5	0.45014 ± 0.01194	0.50875 ± 0.00775	0.67088 ± 0.00883
	24 × 7	0.41595 ± 0.00366	0.49500 ± 0.00230	0.64494 ± 0.00284

Table 6. Comparative results for 72 h prediction with respect to Dataset 1.

Model	Length of Look-Back Windows	MSE	MAE	RMSE
Proposed	24 × 7	0.39586 ± 0.00735	0.48072 ± 0.00485	0.62915 ± 0.00584
Transformer	24 × 3	0.73450 ± 0.07279	0.69948 ± 0.05523	0.85619 ± 0.04246
	24 × 5	0.83060 ± 0.26465	0.75179 ± 0.14577	0.90186 ± 0.14683
	24 × 7	0.67284 ± 0.18321	0.67102 ± 0.11504	0.81489 ± 0.10483
Informer	24 × 3	0.75399 ± 0.03740	0.68267 ± 0.02510	0.86811 ± 0.02143
	24 × 5	0.78838 ± 0.27351	0.71021 ± 0.14474	0.87880 ± 0.14186
	24 × 7	0.65519 ± 0.06398	0.64878 ± 0.04027	0.80871 ± 0.03836
iTransformer	24 × 3	0.54212 ± 0.00847	0.55965 ± 0.00389	0.73627 ± 0.00576
	24 × 5	0.49079 ± 0.00986	0.53440 ± 0.00524	0.70053 ± 0.00704
	24 × 7	0.48906 ± 0.00516	0.53940 ± 0.00304	0.69932 ± 0.00368
PatchTST	24 × 3	0.54375 ± 0.00475	0.56549 ± 0.00164	0.73739 ± 0.00322
	24 × 5	0.51238 ± 0.00248	0.54904 ± 0.00078	0.71581 ± 0.00173
	24 × 7	0.46704 ± 0.00570	0.52748 ± 0.00391	0.68339 ± 0.00417
TimeXer	24 × 3	0.58977 ± 0.00507	0.58777 ± 0.00268	0.76796 ± 0.00330
	24 × 5	0.54065 ± 0.00919	0.56550 ± 0.00467	0.73527 ± 0.00628
	24 × 7	0.51925 ± 0.00580	0.55937 ± 0.00382	0.72058 ± 0.00403

Table 7. Comparative results for 72 h prediction with respect to Dataset 2.

Model	Length of Look-Back Windows	MSE	MAE	RMSE
Proposed	24 × 7	0.37686 ± 0.00448	0.47298 ± 0.00328	0.61388 ± 0.00364
Transformer	24 × 3	0.86198 ± 0.10027	0.77977 ± 0.04758	0.92715 ± 0.05456
	24 × 5	0.63156 ± 0.12632	0.65590 ± 0.08111	0.79154 ± 0.07927
	24 × 7	0.72211 ± 0.17236	0.70749 ± 0.10323	0.84467 ± 0.10394
Informer	24 × 3	0.77983 ± 0.13940	0.71944 ± 0.07447	0.88008 ± 0.08128
	24 × 5	0.72546 ± 0.06056	0.67551 ± 0.04785	0.85112 ± 0.03638
	24 × 7	0.71273 ± 0.06650	0.67112 ± 0.04293	0.84350 ± 0.03945
iTransformer	24 × 3	0.50689 ± 0.00463	0.53475 ± 0.00257	0.71196 ± 0.00325
	24 × 5	0.48562 ± 0.00518	0.53108 ± 0.00281	0.69685 ± 0.00372
	24 × 7	0.44067 ± 0.00453	0.50865 ± 0.00288	0.66382 ± 0.00342
PatchTST	24 × 3	0.50068 ± 0.00576	0.52862 ± 0.00264	0.70758 ± 0.00407
	24 × 5	0.46633 ± 0.00523	0.52172 ± 0.00294	0.68287 ± 0.00382
	24 × 7	0.43725 ± 0.00345	0.50537 ± 0.00204	0.66124 ± 0.00261
TimeXer	24 × 3	0.53989 ± 0.01232	0.54986 ± 0.00640	0.73473 ± 0.00837
	24 × 5	0.51149 ± 0.01984	0.54554 ± 0.00979	0.71508 ± 0.01381
	24 × 7	0.46889 ± 0.00532	0.52562 ± 0.00281	0.68475 ± 0.00388

Table 8. Ablation results regarding Dataset 1.

Model	Length of Look-Back Windows	MSE	MAE	RMSE
Proposed (24 h)	24 × 7	0.31596 ± 0.00082	0.42097 ± 0.00108	0.56210 ± 0.00073
w/o KAN (24 h)	24 × 7	0.33595 ± 0.00139	0.43171 ± 0.00034	0.57961 ± 0.00120
w/o patch embedding (24 h)	24 × 7	0.49385 ± 0.00946	0.50735 ± 0.00473	0.70272 ± 0.00676
w/o CNN block (24 h)	24 × 1	0.40664 ± 0.00100	0.45907 ± 0.00093	0.63768 ± 0.00078
	24 × 3	0.37086 ± 0.00958	0.45122 ± 0.00889	0.60894 ± 0.00780
	24 × 5	0.33628 ± 0.00244	0.43977 ± 0.00215	0.57989 ± 0.00210
	24 × 7	0.34070 ± 0.00105	0.44119 ± 0.00073	0.58369 ± 0.00090
Proposed (48 h)	24 × 7	0.36196 ± 0.00431	0.45515 ± 0.00231	0.60162 ± 0.00358
w/o KAN (48 h)	24 × 7	0.38726 ± 0.00283	0.47222 ± 0.00233	0.62230 ± 0.00227
w/o patch embedding (48 h)	24 × 7	0.53208 ± 0.00593	0.53464 ± 0.00294	0.72943 ± 0.00407
w/o CNN block (48 h)	24 × 3	0.45366 ± 0.00304	0.50787 ± 0.00101	0.67354 ± 0.00226
	24 × 5	0.41909 ± 0.00340	0.49503 ± 0.00253	0.64737 ± 0.00263
	24 × 7	0.39947 ± 0.00149	0.48443 ± 0.00141	0.63204 ± 0.00118
Proposed (72 h)	24 × 7	0.39586 ± 0.00735	0.48072 ± 0.00485	0.62915 ± 0.00584
w/o KAN (72 h)	24 × 7	0.41968 ± 0.00758	0.49622 ± 0.00456	0.64781 ± 0.00584
w/o patch embedding (72 h)	24 × 7	0.56768 ± 0.00885	0.56138 ± 0.00424	0.75342 ± 0.00589
w/o CNN block (72 h)	24 × 3	0.53821 ± 0.00226	0.56213 ± 0.00196	0.73363 ± 0.00154
	24 × 5	0.49273 ± 0.00398	0.54013 ± 0.00237	0.70195 ± 0.00284
	24 × 7	0.44139 ± 0.00351	0.51122 ± 0.00203	0.66436 ± 0.00264

Table 9. Ablation results for Dataset 2.

Model	Length of Look-Back Windows	MSE	MAE	RMSE
Proposed (24 h)	24 × 7	0.26297 ± 0.00142	0.38298 ± 0.00092	0.51280 ± 0.00139
w/o KAN (24 h)	24 × 7	0.27633 ± 0.00554	0.39287 ± 0.00339	0.52565 ± 0.00528
w/o patch embedding (24 h)	24 × 7	0.40093 ± 0.01650	0.45983 ± 0.00830	0.63308 ± 0.01305
w/o CNN block (24 h)	24 × 1	0.32143 ± 0.00101	0.39956 ± 0.00056	0.56695 ± 0.00089
	24 × 3	0.30353 ± 0.00054	0.39377 ± 0.00053	0.55093 ± 0.00049
	24 × 5	0.29015 ± 0.00208	0.39372 ± 0.00162	0.53865 ± 0.00193
	24 × 7	0.28628 ± 0.00084	0.39521 ± 0.00079	0.53505 ± 0.00079
Proposed (48 h)	24 × 7	0.33750 ± 0.00334	0.44286 ± 0.00244	0.58094 ± 0.00288
w/o KAN (48 h)	24 × 7	0.34683 ± 0.00286	0.44974 ± 0.00237	0.58892 ± 0.00243
w/o patch embedding (48 h)	24 × 7	0.42646 ± 0.00262	0.47725 ± 0.00107	0.65303 ± 0.00200
w/o CNN block (48 h)	24 × 3	0.42109 ± 0.00251	0.47672 ± 0.00145	0.64892 ± 0.00193
	24 × 5	0.40356 ± 0.00179	0.47775 ± 0.00070	0.63526 ± 0.00141
	24 × 7	0.38453 ± 0.00457	0.46986 ± 0.00244	0.62010 ± 0.00369
Proposed (72 h)	24 × 7	0.37686 ± 0.00448	0.47298 ± 0.00328	0.61388 ± 0.00364
w/o KAN (72 h)	24 × 7	0.38542 ± 0.00952	0.48194 ± 0.00637	0.62078 ± 0.00767
w/o patch embedding (72 h)	24 × 7	0.49577 ± 0.03944	0.52442 ± 0.01946	0.70366 ± 0.02811
w/o CNN block (72 h)	24 × 3	0.49693 ± 0.00309	0.52671 ± 0.00158	0.70493 ± 0.00219
	24 × 5	0.45897 ± 0.00605	0.51801 ± 0.00384	0.67746 ± 0.00446
	24 × 7	0.42693 ± 0.00400	0.49892 ± 0.00148	0.65339 ± 0.00306

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Qian, K.; Shen, Q.; Ma, Q.; Wang, X.; Wang, Z. CNN–Patch–Transformer-Based Temperature Prediction Model for Battery Energy Storage Systems. Energies 2025, 18, 3095. https://doi.org/10.3390/en18123095

AMA Style

Li Y, Qian K, Shen Q, Ma Q, Wang X, Wang Z. CNN–Patch–Transformer-Based Temperature Prediction Model for Battery Energy Storage Systems. Energies. 2025; 18(12):3095. https://doi.org/10.3390/en18123095

Chicago/Turabian Style

Li, Yafei, Kejun Qian, Qiuying Shen, Qianli Ma, Xiaoliang Wang, and Zelin Wang. 2025. "CNN–Patch–Transformer-Based Temperature Prediction Model for Battery Energy Storage Systems" Energies 18, no. 12: 3095. https://doi.org/10.3390/en18123095

APA Style

Li, Y., Qian, K., Shen, Q., Ma, Q., Wang, X., & Wang, Z. (2025). CNN–Patch–Transformer-Based Temperature Prediction Model for Battery Energy Storage Systems. Energies, 18(12), 3095. https://doi.org/10.3390/en18123095

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CNN–Patch–Transformer-Based Temperature Prediction Model for Battery Energy Storage Systems

Abstract

1. Introduction

2. Problem Definition

3. CNN–Patch–Transformer Model

3.1. CNN Block

3.2. Patch Embedding

3.3. Transformer Encoder

3.4. KAN

4. Experiments and Analysis

4.1. Dataset Description

4.2. Setup of the Experiments

4.3. Comparative Experiments for Classic 24-Hour Prediction

4.4. Comparative Experiments for Other Prediction Horizons

4.5. Ablation Study

5. Conclusions and Prospects

5.1. Conclusions

5.2. Prospects

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI