PatchConvFormer: A Patch-Based and Convolution-Augmented Transformer for Periodic Metro Energy Consumption Forecasting

Long, Liheng; Li, Linlin; Zhang, Lijie; Fu, Qing; Zou, Runzong; Feng, Fan; Zhang, Ronghui

doi:10.3390/electronics15010178

Open AccessArticle

PatchConvFormer: A Patch-Based and Convolution-Augmented Transformer for Periodic Metro Energy Consumption Forecasting

by

Liheng Long

¹

,

Linlin Li

¹,

Lijie Zhang

¹,

Qing Fu

²

,

Runzong Zou

^3,*

,

Fan Feng

⁴

and

Ronghui Zhang

³

¹

Guangzhou General Design & Contracting Department, Guangzhou Metro Design & Research Institute Co., Ltd., Guangzhou 510420, China

²

School of Physics, Sun Yat-sen University, Guangzhou 510275, China

³

Guangdong Key Laboratory of Intelligent Transportation System, School of Intelligent Systems Engineering, Sun Yat-sen University, Guangzhou 510275, China

⁴

School of Ocean Engineering and Technology, Sun Yat-sen University, Zhuhai 519082, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(1), 178; https://doi.org/10.3390/electronics15010178 (registering DOI)

Submission received: 5 November 2025 / Revised: 15 December 2025 / Accepted: 26 December 2025 / Published: 30 December 2025

(This article belongs to the Special Issue Smart Grid Technologies and Energy Conversion Systems)

Download

Browse Figures

Versions Notes

Abstract

Accurate forecasting of metro energy consumption is essential for intelligent power management and sustainable urban transportation systems. However, existing studies often overlook the intrinsic properties of metro energy time series, such as strong periodicity, inter-line heterogeneity, and pronounced non-stationarity. To address this gap, this paper proposes an enhanced Informer-based framework, PatchConvFormer (PCformer). The model integrates three key innovations: (1) a channel-independent modeling mechanism that reduces interference across metro lines; (2) a patch-based temporal segmentation strategy that captures fine-grained intra-cycle energy fluctuations; and (3) a multi-scale convolution-augmented attention module that jointly models short-term variations and long-term temporal dependencies. Using real operation data from 16 metro lines in a major city in China, PCformer achieves significant improvements in forecasting accuracy (MSE = 0.043, MAE = 0.145). Compared with the strongest baseline model in each experiment (i.e., the second-best model), the MSE and MAE are reduced by approximately 41.9% and 19.8%, respectively. In addition, the model maintains strong stability and generalization across different prediction horizons and cross-line transfer experiments. The results demonstrate that PCformer effectively enhances Informer’s capability in modeling complex temporal patterns and provides a reliable technical framework for metro energy forecasting and intelligent power scheduling.

Keywords:

metro energy consumption prediction; Informer; patch-based modeling; channel-independent modeling

1. Introduction

Amid rapid urban growth, metropolitan public transit networks worldwide are encountering increasing operational strain. Traditional road-based transport modes, characterized by low efficiency, high energy consumption, and heavy pollution under high passenger demand, can no longer meet the rapidly growing travel needs [1]. Urban rail transit, especially metro systems, serves as a backbone of contemporary public transport for its high throughput, efficiency, and environmental friendliness [2]. Reports from the IEA (International Energy Agency) indicate that rail consumes far less energy per passenger-kilometer than road or air travel, underscoring its vital role in achieving carbon neutrality and energy efficiency [3].

Metro energy use primarily arises from traction, lighting, ventilation, and air-conditioning, with traction alone contributing over half of the total consumption [4]. As networks expand and ridership grows, precise energy forecasting and intelligent scheduling become essential for enhancing efficiency and reducing emissions [5].

Metro energy forecasting aims to infer future consumption patterns from historical data, forming the basis for intelligent dispatching, with existing methods spanning statistical, machine learning, and deep learning paradigms. Traditional statistical approaches, such as ARIMA (Autoregressive Integrated Moving Average model) and SARIMA (Seasonal Autoregressive Integrated Moving Average model), perform well for stationary series with short-term dependencies but fail to capture nonlinear and complex temporal dynamics [6]. Algorithms like SVR (Support Vector Regression) and RF (Random Forests) enhance prediction accuracy when supported by well-crafted feature engineering. However, they are often sensitive to noise and exhibit limited generalization capability [7].

Benefiting from powerful computation and abundant data, deep learning approaches have achieved notable success in time-series forecasting by autonomously learning hierarchical features that capture intricate temporal dynamics [8]. Deep models not only deliver strong forecasting results [9] but also exhibit exceptional representational strength across vision tasks like image recognition and object detection, consistently reaching top-tier performance [10,11].

CNNs (Convolutional Neural Networks) [12] excel at extracting spatial features, while RNNs (Recurrent Neural Networks) [13] and their variants like LSTM (Long Short-Term Memory) [14] and GRU (Gated Recurrent Unit) [15] effectively model sequential data. However, RNN-based architectures suffer from sequential dependency propagation, making parallelization difficult and often leading to gradient vanishing problems in long-sequence tasks. To address these drawbacks, the Transformer architecture [16] leverages attention-based interactions to model long-range dependencies efficiently, leading to major advances in language [17], vision [18], and time-series [19] prediction. Subsequently, several Transformer-based variants—such as Informer [20], Autoformer [21], and FEDformer (Frequency Enhanced Decomposed Transformer) [22] were proposed to enhance the efficiency of long-sequence modeling through sparse attention and hierarchical encoding. Nevertheless, most existing models are tailored for generic time-series forecasting and show weak adaptability to the specific characteristics of metro energy prediction.

Unlike typical industrial or meteorological time-series data [23,24,25], metro energy consumption sequences exhibit stronger periodicity and structural heterogeneity. As illustrated in Figure 1, taking Line 1 from the metro energy consumption dataset used in this study as an example, the daily energy consumption demonstrates a clear weekly periodic pattern, forming a stable seven-day energy cycle. Metro energy usage fluctuates with passenger flow, scheduling, and external influences like weather or policy changes, exhibiting nonstationary dynamics that traditional models fail to capture. Moreover, although metro lines share global operating conditions and environmental patterns within the same city, their operational characteristics—such as section length, passenger density, and equipment configuration—differ significantly. As a result, there are no strict causal dependencies among different lines. Models that adopt a channel-mixing strategy to jointly model all lines tend to introduce spurious correlations, leading to overfitting and performance degradation.

This study introduces Patch Convolutional Transformer (PCformer), a specialized framework designed for metro energy consumption prediction. The proposed model is designed to exploit the periodic regularities and inter-line structures of metro systems, thereby achieving high-accuracy and interpretable multivariate time-series prediction. Built upon the Informer architecture, PCformer introduces three key innovations:

Channel-Independent Modeling: To handle weak correlations among different metro lines, PCformer adopts an independent modeling strategy with parameter sharing, ensuring inter-line consistency while preserving individual line dynamics. This design eliminates non-causal interference and enhances the model’s interpretability.
Patch-based Modeling Mechanism: A sliding-window convolutional patching operation is introduced at the input layer to capture localized fluctuations within weekly and holiday cycles, effectively balancing short-term variability and long-term trend modeling.
Multi-scale Convolution-Augmented Self-Attention: In the attention layer, multi-scale one-dimensional convolutions replace linear projections for key and value computations, strengthening local temporal dependency modeling and enabling joint feature extraction across multiple time scales.

Through these innovations, PCformer achieves efficient and robust energy sequence modeling within a unified framework. It can precisely capture short-term periodic fluctuations while maintaining stable representation of long-term energy trends, significantly improving forecasting accuracy and generalization under complex operational conditions.

The paper is structured as follows. Section 2 summarizes related studies on time-series forecasting; Section 3 details the design of PCformer; Section 4 describes experiments using real energy consumption data from 16 metro lines in a major Chinese city; Section 5 discusses and interprets the findings; and Section 6 concludes with prospects for future work.

2. Related Work

2.1. Patch in Transformer Architecture

Vaswani et al. [16] introduced the Transformer, a recurrence- and convolution-free architecture that leverages multi-head attention to model global dependencies, establishing a foundation for later variants across tasks. When local semantic information is essential, the patching mechanism becomes a crucial component of Transformer-based architectures. In natural language processing, BERT [26] introduced bidirectional encoding and masked pre-training, effectively learning contextual dependencies through patch-style token masking. In time-series forecasting, PatchTST [27] first applied patching to segment long sequences into subseries-level tokens, preserving local temporal semantics and establishing the foundation for patch-based Transformers in sequence modeling. PatchMLP [28] further revealed that the performance gain of Transformers in long-term forecasting stems primarily from patching rather than attention itself, proposing a simple Patch-based MLP that highlights the mechanism’s core role in temporal modeling. In vision tasks, the Vision Transformer (ViT) [29] introduced patch-based tokenization of images, effectively adapting the Transformer framework to handle high-dimensional visual data. Building on this, the Swin Transformer [30] introduced a shifted window mechanism that enables hierarchical, multi-scale feature representation and efficient local–global interaction, inspiring subsequent patch-based vision Transformers [31,32,33,34,35].

2.2. Convolution in Transformer Architecture

CvT [36] is one of the earliest attempts to incorporate convolutional structures into the Transformer architecture, motivated by the superior locality modeling ability of convolutions in vision tasks. It represents a key step toward the Convolution-in-Transformer paradigm. MetaFormer [37] further demonstrated that model performance relies more on the overall Transformer framework rather than specific attention or convolution modules, providing new insights for lightweight hybrid designs. Subsequent studies explored diverse fusion strategies, such as integrating depthwise separable convolutions into feed-forward networks [38,39], or combining MBConv [40] with Transformer layers [41]. UniFormer [42] integrates convolutional and attention mechanisms by combining local feature extraction in early layers with global dependency modeling in deeper ones, balancing efficiency and expressiveness.

Beyond computer vision, convolution–Transformer fusion has been increasingly applied in time-series forecasting. A lightweight hybrid CNN–Transformer model [43] was proposed for resource-constrained devices, where CNNs extract local patterns while Transformers model long-term dependencies. SCINet [44] introduced hierarchical convolutions to capture dynamic temporal correlations, and TimesNet [45] transformed 1D series into 2D tensors for multi-scale feature learning via the Inception module [46]. Later studies [47] proposed a CNN–Transformer probabilistic load forecasting framework combining local peak–trend extraction and global dependency modeling. ChannelMixer [48] further enhanced multivariate forecasting by integrating CNN-based short-term interaction modeling with Transformer-based long-term dependency learning.

Inspired by these advances, our proposed PatchConvFormer (PCformer) introduces a multi-scale convolution-augmented self-attention mechanism, which replaces the standard linear projection in the attention layer with multi-scale 1D convolutions to strengthen local temporal dependency modeling and achieve joint feature extraction across multiple time scales.

2.3. Deep Learning in Energy Consumption Forecasting

Deep learning has recently found extensive application in energy demand prediction. In 2023, the DECPR-TFT model [49] combined energy consumption pattern decomposition with a Temporal Fusion Transformer, achieving highly accurate and interpretable predictions for building energy consumption. ASSA-iTransformer [50] combined adaptive signal decomposition with the Transformer framework, yielding roughly 20% gains in MAE and RMSE, while SPAformer [51] employed spectral–temporal patch attention to further boost accuracy by 12%. The Transformer-based intelligent energy forecasting model [52] utilized self-attention to capture long-term dependencies, outperforming traditional ARIMA and LSTM models in long-horizon load forecasting. Moreover, the STRLM model [53], combining trend decomposition with graph neural networks, demonstrated strong generalization and few-shot forecasting capabilities.

In the context of metro energy forecasting, Rahn et al. [54] conducted one of the earliest multi-city experiments on urban rail energy consumption, verifying the potential of training-based metro systems for energy efficiency improvement. Sanchis et al. [55] established an energy consumption prediction model using train speed as the main variable, achieving accurate estimation of metro operation energy under different environmental conditions. A later study [56] developed a genetic algorithm-enhanced SVR approach that incorporated both internal and external variables, reaching 95% accuracy and surpassing BPNN and linear regression baselines.

However, most existing studies have focused primarily on feature extraction and architectural optimization [57,58], while neglecting the intrinsic temporal dependencies of energy consumption sequences. In metro systems, energy consumption is highly correlated with passenger flow, time periods, and scheduling strategies, exhibiting strong periodicity and non-stationarity.

In summary, existing research on time-series forecasting has made substantial progress in model architecture design. Some studies adopt channel-independent modeling or parameter-sharing strategies to mitigate channel interference in multivariate time series. Other works introduce patching or segmentation mechanisms to enhance local temporal semantic modeling. In addition, several methods employ convolutional or multi-scale structures to capture dynamics at different temporal scales. However, most of these approaches focus on only one or a subset of these mechanisms. They are mainly designed for generic time-series forecasting tasks. A summary of representative prior works is provided in Table 1.

In the field of energy consumption forecasting, especially for multi-line metro energy prediction, existing methods tend to emphasize feature engineering or isolated architectural improvements. They rarely consider the “weakly correlated” nature among different metro lines. The coexistence of strong periodicity and non-stationarity is also often overlooked. To address these limitations, the proposed PCformer integrates three mechanisms within the Informer framework: channel-independent modeling with parameter sharing, patch-based modeling, and multi-scale convolution-augmented self-attention. The model is specifically designed and validated for multi-line metro energy forecasting. As a result, PCformer achieves significant improvements in both prediction accuracy and generalization performance.

3. Methods

Metro energy forecasting involves predicting future consumption using past operational data while accounting for correlations among different lines. Given a set of metro network energy consumption data, we represent it as a multivariate time series sample. Let the look-back window length be L, with the sample denoted as

(x_{1}, \dots, x_{L})

, where each time step

x_{t}

is an M-dimensional vector, and M represents the number of metro lines in the network. The objective is to predict the energy consumption values for the next T time steps,

(x_{L + 1}, \dots, x_{L + T})

.

As shown in Figure 2, the proposed PCformer builds upon the Informer backbone and is tailored for metro energy data through three major upgrades: channel-disentangled modeling, patch-wise segmentation, and convolution-augmented attention, which together improve the extraction of periodic behaviors and inter-line dependencies.

3.1. Introduction to Informer

The Transformer excels in sequence modeling but faces scalability issues due to its heavy structure, costly attention, and slow decoding. Informer [59], as an efficient variant, mitigates these bottlenecks and delivers superior performance on long-sequence forecasting tasks.

In metro energy forecasting, long-horizon prediction supports proactive scheduling and cost optimization, motivating the adoption of Informer as the baseline framework.

As illustrated in Figure 3, Informer follows an encoder–decoder architecture, incorporating two major innovations:

ProbSparse Self-Attention Mechanism
The authors of Informer observe that the probability distribution of self-attention exhibits a long-tail property. Based on this observation, they aim to reduce attention complexity by retaining only the most important Query–Key interactions. Therefore, ProbSparse Self-Attention is proposed, and its core formulation is given as:

$A (Q, K, V) = S o f t m a x (\frac{\bar{Q} K^{T}}{\sqrt{d}}) V,$

(1)

where Q, K, and V denote the Query, Key, and Value matrices, respectively, and d is the feature dimension. $\bar{Q}$ represents a sparse query matrix formed by selecting the Top-u dominant queries according to Informer’s sparsity measurement. This design preserves only the most critical query–key interactions. The value of u is determined jointly by the sampling factor and the input sequence length. More details can be found in the original Informer paper.
Layer-wise Distilling Operation
To mitigate redundancy in the encoder, Informer introduces a layer-wise distilling mechanism composed of a one-dimensional convolution (Conv1D), an activation function (ELU, Exponential Linear Units), and a max-pooling layer (stride = 2). This process extracts the dominant temporal features while progressively reducing the sequence length. The transformation from the i-th to the $i + 1$ -th layer can be formulated as:

$X_{j + 1}^{t} = M a x P o o l (E L U (C o n v 1 d ({[X_{j}^{t}]}_{A B}))),$

(2)

which effectively halves the computational cost per layer while preserving the essential temporal information. Here, ${[X_{j}^{t}]}_{A B}$ denotes the output feature of the attention block at the j-th encoder layer. The operation compresses the sequence length step by step following the order Conv1D → ELU → MaxPool. Since the MaxPool stride is set to 2, the sequence length is reduced from L to approximately $\frac{L}{2}$ after each distilling step. Equation (2) explains the key mechanism that enables Informer to efficiently handle long sequences. More detailed implementations and derivations can be found in the original Informer paper.

In summary, by integrating probsparse self-attention and layer-wise feature distillation, Informer achieves lower computational complexity, higher modeling efficiency, and improved forecasting accuracy in long-sequence prediction tasks. These advantages establish a strong and efficient foundation for the proposed PCformer model.

3.2. Channel-Independent Modeling

In multivariate time series, each channel can be treated as an independent signal, where channel independence preserves single-channel features and channel mixing projects all channels into a shared space for information fusion. Traditional Transformer architectures (including Informer) typically adopt the channel-mixing paradigm, which is effective for capturing global dependencies but exhibits notable limitations when handling data with strong inter-channel heterogeneity.

In the metro network energy forecasting task, although the overall system energy data can be formalized as a multivariate time series, the meaning of each channel differs from that in conventional sensor data. Specifically, each column in the dataset represents the total energy consumption of one metro line over time. These channels (lines) do not exhibit direct causal dependencies, as each line has distinct operational environments, segment lengths, passenger flows, and equipment configurations. However, they share similar external periodic factors, such as city-wide operation schedules, weather conditions, and weekday–weekend patterns, leading to statistical correlations and global periodic constraints across lines.

Under this “weakly correlated but non-causal” multi-channel structure, conventional channel-mixing modeling introduces three major drawbacks:

Dilution of local features: The mixing process tends to obscure fine-grained temporal variations within individual lines, such as sharp energy peaks during rush hours.
Increased overfitting risk: The model may capture spurious correlations among lines, thereby reducing its generalization capability.
Reduced computational efficiency: As the number of lines increases, the input dimensionality grows linearly, while the attention computation complexity increases quadratically, resulting in higher GPU memory usage and longer training time.

These issues motivate the adoption of channel-independent modeling, which preserves intra-line dynamics, mitigates overfitting, and enhances computational efficiency for metro energy forecasting.

To address the aforementioned issues, this study proposes a channel-independent modeling mechanism specifically designed for the metro network energy forecasting task, aiming to enhance the model’s interpretability and prediction stability, as illustrated in Figure 4. Specifically, let the energy consumption sequence of the i-th metro line start at time index 1 with a length of L:

x_{1 : L}^{(i)} = (x_{1}^{(i)}, x_{2}^{(i)}, \dots, x_{L}^{(i)}), i = 1, 2, \dots, M .

(3)

Given the multivariate input sequence

(x_{1}, x_{2}, \dots, x_{L})

, we decompose it into M univariate sequences, each corresponding to one metro line:

x^{(i)} \in R^{1 \times L} .

(4)

Under the channel-independent setting, each sequence is individually fed into a shared-parameter PCformer backbone, which independently outputs the future T-step predictions for each channel:

{\hat{x}}^{(i)} = ({\hat{x}}_{L + 1}^{(i)}, \dots, {\hat{x}}_{L + T}^{(i)}) \in R^{1 \times T} .

(5)

In this way, the forward computation processes across channels are independent, achieving a “channel-independent yet parameter-shared” modeling mechanism.

It should be noted that the mapping from input length L to prediction length T shown in Figure 4 is not a special domain transformation, but rather a standard implementation in time-series forecasting. Moreover, under the channel-independent yet parameter-shared modeling strategy, the original multivariate energy consumption sequence

x \in R^{M \times L}

is decomposed into M univariate sequences, each processed independently by the same PCformer model with shared parameters. This design is based on the assumption that metro lines operate largely independently while exhibiting similar operational patterns. However, when strong couplings exist among lines or when different lines display highly distinct behaviors, a purely channel-independent model may be insufficient to capture cross-line dependencies. Therefore, this represents an interesting direction for future research.

3.3. Patch-Based Modeling

Metro energy consumption data typically exhibit strong periodic patterns, particularly a weekly cycle (7 days) characterized by distinct fluctuations in energy demand. This periodicity is largely influenced by factors such as passenger flow, operating schedules, and holiday adjustments, leading to a typical pattern of higher energy usage on weekdays and lower usage on weekends.

However, most existing studies primarily focus on global trends over long look-back windows (e.g., 30 days or more) and model the entire sequence under a unified temporal scale. Such approaches often fail to capture fine-grained variations within shorter local cycles, thereby overlooking dynamic changes in intra-week energy consumption. As a result, their predictive performance tends to deteriorate when faced with rapid periodic transitions—such as shifts between weekday and weekend modes—limiting their applicability in real-world metro energy management and scheduling.

To better capture local periodicity and dynamic variations, a patch-based convolutional module is introduced at the input stage for localized temporal modeling and channel-wise feature enhancement.

As shown in Figure 5, each input sequence

x^{(i)}

is first processed by a 1D convolution layer to extract initial features and broaden channel representation, improving sensitivity to energy fluctuations and holiday effects. Subsequently, a depthwise separable 1D convolution (DWConv1D) is introduced to perform the patching operation. The convolution kernel size is set to K and the stride to S, where S defines the non-overlapping interval between adjacent patches. After this operation, the sequence length is reduced from the original L to a shorter patched length

L_{p}

:

L_{p} = ⌊\frac{L - K}{S}⌋ + 1 .

(6)

Following the patching step, a linear layer is employed to achieve weighted fusion of features across different channels, allowing the model to capture interdependencies among local energy components—for instance, the energy flow patterns between main and branch lines or between peak and off-peak periods. Finally, another Conv1D layer is used to compress the multi-channel features back into a single-channel representation, yielding the final patched sequence embedding:

x_{p}^{(i)} = C o n v 1 d (L i n e a r (D W C o n v 1 d (C o n v 1 d (x^{(i)})))) .

(7)

This patch-based design compresses the input sequence from length L to

L_{p}

, yielding a quadratic decrease in memory usage and attention computation within later Transformer layers. This design not only provides clear advantages under limited GPU memory and training time constraints, but also enables the model to capture local periodic variations within longer historical windows, thereby balancing both long-term dependencies and short-term periodic dynamics in metro energy forecasting.

3.4. Convolution-Augmented Self-Attention

To capture both rapid variations and long-term dependencies in metro energy data, we enhance the Informer’s probsparse attention with a multi-scale convolutional module, as shown in Figure 6. In this structure, Key and Value features are derived from multi-scale 1D convolutions instead of linear projections, strengthening local temporal modeling.

For an input sequence

X \in R^{B \times L \times d_{m o d e l}}

, the original linear mapping:

K = X W_{K}, V = X W_{V},

(8)

is replaced by a multi-scale convolutional operation:

K_{s} = C o n v 1 d_{s} (X), V_{s} = C o n v 1 d_{s} (X), s \in (1, 3, 5, 7),

(9)

and the features extracted from different convolutional scales are fused through weighted summation:

K = \sum_{s} a_{s} K_{s}, V = \sum_{s} a_{s} V_{s}, \sum_{s} a_{s} = 1,

(10)

to obtain the final representation. The fusion weights

a_{s}

are learnable parameters and are updated jointly during training. To ensure non-negativity and enforce the constraint

\sum_{s} a_{s} = 1

, a softmax normalization is applied to the raw learnable weights

{\tilde{a}}_{s}

:

a_{s} = \frac{exp ({\tilde{a}}_{s})}{\sum_{s^{'}} exp ({\tilde{a}}_{s^{'}})} .

(11)

This allows the model to adaptively allocate the contributions of different convolutional scales based on the temporal characteristics of metro energy data. As a result, the model can effectively capture multi-scale energy consumption patterns while preserving the efficiency of ProbSparse Attention, achieving a unified representation of both short-term fluctuations and long-term trends.

3.5. Loss Function

In the metro energy consumption forecasting task, the energy consumption sequences exhibit strong non-stationary characteristics; during most periods, the variations are relatively smooth, but sharp spikes may occur during holidays, large-scale events, or unexpected system faults. When the model is trained using only the Mean Squared Error (MSE) loss, it becomes overly sensitive to extreme values; conversely, using only the Mean Absolute Error (MAE) reduces sensitivity to abrupt fluctuations, making it difficult to capture anomalous consumption patterns.

To address this issue, we design a hybrid loss function that combines MSE and MAE, balancing global trend fitting and local anomaly robustness.

In metro energy consumption forecasting, the consumption sequences exhibit pronounced non-stationary characteristics; while most periods show relatively smooth variations, sudden spike-like anomalies may occur during holidays, large-scale events, or unexpected system faults. The Mean Squared Error (MSE) amplifies large deviations through a squared penalty, which encourages the model to learn a smooth global trend but also makes it sensitive to outliers, as extreme values can disrupt overall trend fitting. In contrast, the Mean Absolute Error (MAE) provides constant gradients across all error magnitudes and is therefore less affected by abrupt anomalies, offering stronger robustness during training; however, relying solely on MAE may weaken the model’s ability to capture subtle variations and overall trend structures.

To address this issue, we design a hybrid loss function that integrates MSE and MAE, with the weighting coefficients

α

and

β

controlling their contributions. This formulation allows the model to achieve stable global trend learning while avoiding excessive interference from abnormal fluctuations.

Let the predicted value be

\hat{x}

and the ground truth be x; the joint loss function is defined as:

L_{t o t a l} = α \cdot M S E + β \cdot M A E,

(12)

where

M S E = \frac{1}{M} \sum_{i = 1}^{M} {∥ {\hat{x}}_{L + 1 : L + T}^{(i)} - x_{L + 1 : L + T}^{(i)} ∥}_{2}^{2},

(13)

M A E = \frac{1}{M} \sum_{i = 1}^{M} ∥ {\hat{x}}_{L + 1 : L + T}^{(i)} - x_{L + 1 : L + T}^{(i)} ∥,

(14)

and the coefficients

α

and

β

serve as balancing factors that regulate the trade-off between trend fitting and robustness to outliers.

In our experiments, the hybrid loss with

α = 0.5

and

β = 0.5

is adopted as the optimization objective during model training. It is worth noting that, consistent with common practice in time-series forecasting research, we evaluate model performance using standalone MSE and MAE metrics to ensure standardized and objective cross-model comparison. Therefore, although the hybrid loss enhances training stability and improves the model’s generalization capability, its numerical value is not reported as an evaluation metric.

This joint loss formulation enables the PCformer model to learn overall energy consumption trends accurately while mitigating the negative impacts of abrupt anomalies during training, thereby improving its stability and generalization performance under complex metro operation scenarios.

4. Experiment

The data were independently collected by the authors from an operational metro energy monitoring system. The dataset consist of daily energy usage from 16 metro lines in a major city in China, spanning July to December 2021 (182 days in total), with 80% of samples allocated for training and 20% for testing. Due to confidentiality requirements, the actual names and identifiers of the metro lines are not disclosed; instead, they are anonymized and denoted as Line 1 through Line 16 for analysis and presentation purposes.

4.1. Implementation Details

In PCformer, an overlapping patching strategy is applied to the input sequence to strengthen local temporal representations, with the sequence length set to 28 (four weeks). The patch size and stride were set to 7 and 1, producing a sequence length of 22. The model was trained for 100 epochs with a warm-up (from

5 \times 10^{- 7}

to

5 \times 10^{- 4}

) followed by cosine decay, using a batch size of 64 and weight decay of 0.05 for better generalization. All experiments were conducted in Python 3.8 with PyTorch 1.9.1 on Ubuntu 18.04, utilizing CUDA 11.8 acceleration on an RTX 2070 Super (8 GB) with an i7-10700K CPU and 32 GB RAM.

Model accuracy was assessed using Mean Squared Error (MSE) and Mean Absolute Error (MAE), where MSE highlights large deviations and MAE reflects the average absolute prediction gap. In general, smaller values of both metrics indicate better model fitting and higher forecasting accuracy. The mathematical formulations are as follows:

M S E = \frac{1}{M} \sum_{i = 1}^{M} {({\hat{y}}_{i} - y_{i})}^{2}, M A E = \frac{1}{M} \sum_{i = 1}^{M} | {\hat{y}}_{i} - y_{i} | .

(15)

Here,

y_{i}

and

{\hat{y}}_{i}

represent the actual and predicted metro energy values, with M denoting the number of samples. Using both MSE and MAE allows evaluation of the model’s capacity to capture overall trends and respond to sudden variations, offering a balanced measure of accuracy and robustness.

4.2. Comparison with the State of the Art

The superior performance of PCformer mainly arises from three structural innovations tailored to metro-energy forecasting. First, the channel-independent modeling mechanism effectively eliminates non-causal interference among different lines. It enables independent temporal modeling for each line while sharing parameters to improve computational efficiency. Second, the patch-based module segments sequences to capture fine-grained periodic patterns like weekday–weekend shifts while maintaining long-term trends. Third, the multi-scale convolution-augmented attention structure replaces linear projections with multi-scale one-dimensional convolutions, strengthening the joint modeling of short-term variations and long-term dependencies. The combination of these three designs allows PCformer to effectively characterize the complex periodicity and non-stationarity of metro energy consumption, resulting in superior accuracy and robustness compared with existing methods.

To validate PCformer’s effectiveness, we compared it with eight representative models: LSTM [14], BiLSTM [14], GRU [15], BiGRU [15], Transformer [16], Informer [20], ARMA-CNNLSTM [60], Mamba [61], and MTM-LSTM—trained [62] from scratch without pre-trained weights.

As shown in Table 2, under the setting where both the input and prediction windows are 28 steps, PCformer achieves the best performance with MSE = 0.043 and MAE = 0.145. The “Avg.” row represents the arithmetic mean of the results across the 16 metro lines. Compared with the strongest baseline models for each metric (i.e., the second-best results), PCformer reduces MSE and MAE by 41.9% and 19.8%, respectively. In addition, the standard deviation is reported to illustrate the variability of prediction errors across different metro lines. As shown in the table, PCformer achieves the lowest prediction error on almost all metro lines, with only a slight performance drop on Line 6. This indicates that PCformer maintains strong predictive accuracy and generalization capability across diverse operating conditions.

4.3. Ablation Studies

Table 3 summarizes the energy consumption forecasting results of different models under multiple prediction lengths (14, 28, 42, and 56). When the look-back window is fixed at 28, the overall prediction accuracy of all models gradually decreases as the prediction horizon increases. Although long-horizon forecasts often degrade due to error accumulation and information loss, PCformer maintains the lowest MSE and MAE across all settings, showing stable and generalizable performance.

Compared with baseline models, PCformer exhibits a slower degradation rate as the prediction horizon extends. This robustness is primarily attributed to the patch-based modeling mechanism and the multi-scale convolution-augmented attention structure, which jointly enable the model to maintain effective feature representation over longer time spans and accurately capture long-term dependencies without overfitting. These results confirm that PCformer maintains high prediction accuracy and stability under different forecasting intervals. It shows strong practical adaptability, making it well suited for dynamic prediction requirements in real-world metro energy forecasting applications.

To validate the rationality of the patch-based hyperparameter settings, we conduct ablation experiments under

L = 28

. Only the patch length P and stride S are varied. All other settings remain unchanged. MSE and MAE are used as evaluation metrics (see Table 4). When fixing

S = 1

, the errors consistently decrease as P increases from 3 to 5 and 7. Among these settings,

P = 7

achieves the best performance. This is because it matches the weekly 7-day periodicity of metro energy consumption. When fixing

P = 7

, the best performance is obtained at

S = 1

(MSE = 0.0430, MAE = 0.1452). As S increases to 3, 5, and 7, the prediction errors rise significantly. This indicates that larger strides, especially non-overlapping patches, reduce the model’s ability to preserve local information and thus degrade forecasting accuracy. Overall, the setting

P = 7

and

S = 1

yields the lowest MSE and MAE for PCformer.

4.4. Transfer Learning

To assess generalization, transfer experiments trained PCformer on twelve metro lines and tested it on four unseen ones (Lines 5, 9, 12, and 15). This setup evaluates the model’s adaptability to new lines without prior exposure. This experiment aims to assess whether the model can learn new energy consumption patterns and accurately capture temporal dependencies under conditions with no prior knowledge of the target lines.

Table 5 shows that PCformer achieves the best overall performance in the cross-line transfer experiments, with MSE and MAE of 0.1001 and 0.2034, respectively. The “Avg.” row represents the arithmetic mean of the results across different metro lines. Compared with the strongest baseline models for each metric (i.e., the second-best results), PCformer still demonstrates clear advantages in the cross-line forecasting task. In addition, the standard deviation is reported to reflect the variability of prediction errors across different transfer lines. The results show that even on completely unseen metro lines, PCformer maintains high prediction accuracy and stable performance, indicating strong cross-line generalization and transferability.

This performance advantage mainly arises from the channel-independent modeling mechanism, which enables each line to be learned as an independent temporal sequence while sharing parameters across channels. This design mitigates overfitting to particular lines and enhances adaptability to unseen routes. In addition, the patch-based modeling mechanism and the multi-scale convolution-augmented attention module work together to capture transferable temporal structures shared among different lines. Overall, the results indicate that PCformer not only performs strongly on known lines but also maintains stable performance when applied to new metro lines, demonstrating excellent scalability and practical applicability for large-scale metro energy forecasting.

5. Discussion

To provide an intuitive comparison of model performance in practical metro energy forecasting scenarios, Figure 7 illustrates the prediction results of different models on Line 1 of the metro energy consumption dataset used in this study. PCformer closely aligns with actual energy trends, showing minimal deviation and smoother residuals.

Across all experiments, PCformer surpasses leading forecasting baselines, showing faster convergence, stronger stability, and higher accuracy. Its channel-disentangled structure, patch-based representation, and multi-scale convolutional attention jointly capture both rapid variations and long-range dependencies in metro energy data. Unlike recurrent models such as LSTM and GRU, which have difficulty handling intricate temporal relationships, and standard Transformers that often overfit or incur high computational costs on multivariate data, PCformer adopts a lightweight multi-scale architecture that effectively balances modeling power and efficiency.

In summary, the proposed PCformer demonstrates excellent adaptability and scalability across different metro lines and time horizons. It provides a robust and efficient framework for large-scale metro energy forecasting and offers strong potential for real-world applications in urban rail energy management and intelligent power scheduling.

6. Conclusions

In this research, we tackled the issue of precisely estimating metro energy demand under complex operational dynamics marked by pronounced periodic trends and cross-line diversity. To overcome these challenges, we introduce PatchConvFormer (PCformer)—an Informer-inspired predictive framework that incorporates three principal innovations: (1) a channel-disentangled modeling scheme to suppress cross-line interference; (2) a patch-level temporal decomposition to capture fine-grained periodic signals; and (3) a multi-scale convolution-enhanced attention unit that jointly encodes both transient variations and extended temporal relations.

Extensive evaluations using real operational records from 16 metro routes in a major city in China verify that PCformer persistently surpasses leading time-series predictors in both accuracy and reliability. The framework converges more rapidly, maintains stronger training stability, and attains higher forecasting precision, demonstrating excellent adaptability to varying time granularities and unseen line scenarios.

By harmonizing model transparency with computational tractability, PCformer offers a deployable and efficient paradigm for metro energy forecasting and intelligent scheduling within urban transit systems. Future work will explore the incorporation of exogenous influences—including passenger traffic intensity, meteorological variables, and calendar anomalies—and the transferability of the framework to broader energy management domains.

Author Contributions

Conceptualization, L.L. (Liheng Long), L.L. (Linlin Li) and F.F.; methodology, L.Z., Q.F. and R.Z. (Runzong Zou); software, Q.F., R.Z. (Runzong Zou) and R.Z. (Ronghui Zhang); validation, L.Z., F.F. and R.Z. (Ronghui Zhang); formal analysis, L.L. (Linlin Li), R.Z. (Runzong Zou) and R.Z. (Ronghui Zhang); investigation, L.L. (Liheng Long), L.Z. and R.Z. (Runzong Zou); resources, L.L. (Linlin Li), F.F. and R.Z. (Ronghui Zhang); data curation, L.Z., F.F. and R.Z. (Ronghui Zhang); writing—original draft preparation, L.L. (Liheng Long), R.Z. (Runzong Zou) and R.Z. (Ronghui Zhang); writing—review and editing, L.L. (Linlin Li), R.Z. (Runzong Zou) and R.Z. (Ronghui Zhang); visualization, L.Z., F.F. and R.Z. (Ronghui Zhang); supervision, L.L. (Liheng Long), Q.F. and R.Z. (Ronghui Zhang); project administration, L.L. (Linlin Li), Q.F. and R.Z. (Ronghui Zhang). All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Grant No. 62203479), the Guangdong Basic and Applied Basic Research Foundation (Grant No. 2022A1515110891), and the Key Scientific Research Projects of China Association of Metros (No. CAMET-KY-202205).

Data Availability Statement

Data will be made available on request.

Conflicts of Interest

Authors Liheng Long, Linlin Li and Lijie Zhang were employed by the company Guangzhou General Design & Contracting Department, Guangzhou Metro Design & Research Institute Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:

PCformer	PatchCovFormer, Patch-based and Convolution-Agumented Transformer
MSE	Mean Squared Error
MAE	Mean Absolute Error
IEA	International Energy Agency
ARIMA	Autoregressive Integrated Moving Average model
SARIMA	Seasonal Autoregressive Integrated Moving Average model
SVR	Support Vector Regression
RF	Random Forests
CNN	Convolutional Neural Networks
RNN	Recurrent Neural Networks
LSTM	Long Short-Term Memory
GRU	Gated Recurrent Unit
FEDformer	Frequency Enhanced Decomposed Transformer
ELU	Exponential Linear Units
GPU	Graphics Processing Unit
DWConv	Depthwise Convolution
CUDA	Compute Unified Device Architecture
CPU	Central Processing Unit

References

Han, D.; Wu, S. The capitalization and urbanization effect of subway stations: A network centrality perspective. Transp. Res. Part A Policy Pract. 2023, 76, 103815. [Google Scholar] [CrossRef]
Su, W.; Li, X.; Zhang, Y.; Zhang, Q.; Wang, T.; Magdziarczyk, M.; Smolinski, A. High-speed rail, technological improvement, and carbon emission efficiency. Transp. Res. Part D Transp. Environ. 2025, 142, 104685. [Google Scholar] [CrossRef]
Zhang, Y.; Zhao, N.; Li, M.; Xu, Z.; Wu, D.; Hillmansen, S.; Tsolakis, A.; Blacktop, K.; Roberts, C. A techno-economic analysis of ammonia-fuelled powertrain systems for rail freight. Transp. Res. Part D Transp. Environ. 2023, 119, 103739. [Google Scholar] [CrossRef]
Feng, Z.; Chen, W.; Liu, Y.; Chen, H.; Skibniewski, M.J. Long-term equilibrium relationship analysis and energy-saving measures of metro energy consumption and its influencing factors based on cointegration theory and an ARDL model. Energy 2023, 263, 125965. [Google Scholar] [CrossRef]
Zheng, S.; Liu, Y.; Xia, W.; Cai, W.; Liu, H. Energy consumption optimization through prediction models in buildings using deep belief networks and a modified version of big bang-big crunch theory. Build. Environ. 2025, 279, 112973. [Google Scholar] [CrossRef]
Singh, S.K.; Das, A.K.; Singh, S.R.; Racherla, V. Prediction of rail-wheel contact parameters for a metro coach using machine learning. Expert Syst. Appl. 2023, 215, 119343. [Google Scholar] [CrossRef]
Domala, V.; Kim, T. Application of Empirical Mode Decomposition and Hodrick Prescot filter for the prediction single step and multistep significant wave height with LSTM. Ocean. Eng. 2023, 285, 115229. [Google Scholar] [CrossRef]
Cao, W.; Yu, J.; Chao, M.; Wang, J.; Yang, S.; Zhou, M.; Wang, M. Short-term energy consumption prediction method for educational buildings based on model integration. Energy 2023, 283, 128580. [Google Scholar] [CrossRef]
Li, M.; Zhang, P.; Xing, W.; Zheng, Y.; Zaporojets, K.; Chen, J.; Zhang, R.; Zhang, Y.; Gong, S.; Hu, J.; et al. A Survey of Large Language Models for Data Challenges in Graphs. Expert Syst. Appl. 2025, 225, 129643. [Google Scholar] [CrossRef]
Zhang, R.; Zou, R.; Zhao, Y.; Zhang, Z.; Chen, J.; Cao, Y.; Hu, C.; Song, H. BA-Net: Bridge Attention in Deep Neural Networks. Expert Syst. Appl. 2025, 292, 128525. [Google Scholar] [CrossRef]
Zhao, Y.; Chen, J.; Zhang, Z.; Zhang, R. BA-Net: Bridge attention for deep convolutional neural networks. In Proceedings of the European Conference on Computer Vision (ECCV), Tel Aviv, Israel, 23–27 October 2022; pp. 297–312. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. Imagenet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Lake Tahoe, NV, USA, 3–6 December 2012; Volume 25. [Google Scholar]
Jordan, M.I. Serial order: A parallel distributed processing approachn. Adv. Psychol. 1997, 121, 471–495. [Google Scholar] [CrossRef]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Cho, K.; Van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv 2014, arXiv:1406.1078. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Long Beach, CA, USA, 4–6 December 2017; Volume 30. [Google Scholar]
Kalyan, K.S.; Rajasekharan, A.; Sangeetha, S. Ammus: A survey of transformer-based pretrained models in natural language processing. arXiv 2021, arXiv:2108.05542. [Google Scholar]
Khan, S.; Naseer, M.; Hayat, M.; Zamir, S.W.; Khan, F.S.; Shah, M. Transformers in vision: A survey. ACM Comput. Surv. (CSUR) 2022, 54, 1–41. [Google Scholar] [CrossRef]
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in time series: A survey. arXiv 2022, arXiv:2202.07125. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Virtual, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Virtually, 6–14 December 2021; Volume 34, pp. 22419–22430. [Google Scholar]
Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency enhanced decomposed transformer for long-term series forecasting. In Proceedings of the International Conference on Machine Learning (ICML), Baltimore, MD, USA, 17–23 July 2022; pp. 27268–27286. [Google Scholar]
Xu, J.; Chen, T.; Yuan, J.; Fan, Y.; Li, L.; Gong, X. Ultra-Short-Term wind power prediction based on spatiotemporal contrastive learning. Electronics 2025, 14, 3373. [Google Scholar] [CrossRef]
Zi, X.; Liu, F.; Liu, M.; Wang, Y. Transformer with Adaptive Sparse Self-Attention for Short-Term Photovoltaic Power Generation Forecasting. Electronics 2025, 14, 3981. [Google Scholar] [CrossRef]
Pavlatos, C.; Makris, E.; Fotis, G.; Vita, V.; Mladenov, V. Enhancing electrical load prediction using a bidirectional LSTM neural network. Electronics 2023, 12, 4652. [Google Scholar] [CrossRef]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), Minneapolis, MN, USA, 2–7 June 2019; Volume 1 (long and short papers), pp. 4171–4186. [Google Scholar] [CrossRef]
Nie, Y. A Time Series is Worth 64Words: Long-term Forecasting with Transformers. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Tang, P.; Zhang, W. Unlocking the Power of Patch: Patch-Based MLP for Long-Term Time Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 12640–12648. [Google Scholar] [CrossRef]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representations (ICLR), Vienna, Austria, 3–7 May 2021. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual Event, 11–17 October 2021; pp. 10012–10022. [Google Scholar]
Beyer, L.; Izmailov, P.; Kolesnikov, A.; Caron, M.; Kornblith, S.; Zhai, X.; Minderer, M.; Tschannen, M.; Alabdulmohsin, I.; Pavetic, F. Flexivit: One model for all patch sizes. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 14496–14506. [Google Scholar]
Ronen, T.; Levy, O.; Golbert, A. Vision transformers with mixed-resolution tokenization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 4613–4622. [Google Scholar]
Chen, M.; Lin, M.; Li, K.; Shen, Y.; Wu, Y.; Chao, F.; Ji, R. Cf-vit: A general coarse-to-fine method for vision transformer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 7042–7052. [Google Scholar] [CrossRef]
Wang, Y.; Du, B.; Wang, W.; Xu, C. Multi-tailed vision transformer for efficient inference. Neural Netw. 2024, 174, 106235. [Google Scholar] [CrossRef]
Hu, Y.; Cheng, Y.; Lu, A.; Cao, Z.; Wei, D.; Liu, J.; Li, Z. LF-ViT: Reducing spatial redundancy in vision transformer for efficient image recognition. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), Vancouver, BC, Canada, 20–27 February 2024; Volume 38, pp. 2274–2284. [Google Scholar] [CrossRef]
Wu, H.; Xiao, B.; Codella, N.; Liu, M.; Dai, X.; Yuan, L.; Zhang, L. Cvt: Introducing convolutions to vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual Event, 11–17 October 2021; pp. 22–31. [Google Scholar]
Yu, W.; Luo, M.; Zhou, P.; Si, C.; Zhou, Y.; Wang, X.; Feng, J.; Yan, S. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 10819–10829. [Google Scholar]
Yuan, K.; Guo, S.; Liu, Z.; Zhou, A.; Yu, F.; Wu, W. Incorporating convolution designs into visual transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Virtual Event, 11–17 October 2021; pp. 579–588. [Google Scholar]
Yuan, Y.; Fu, R.; Huang, L.; Lin, W.; Zhang, C.; Chen, X.; Wang, J. Hrformer: High-resolution transformer for dense prediction. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021; Volume 34, pp. 7281–7293. [Google Scholar]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Dai, Z.; Liu, H.; Le, Q.V.; Tan, M. Coatnet: Marrying convolution and attention for all data sizes. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), Online, 6–14 December 2021; Volume 34, pp. 3965–3977. [Google Scholar]
Li, K.; Wang, Y.; Zhang, J.; Gao, P.; Song, G.; Liu, Y.; Li, H.; Qiao, Y. Uniformer: Unifying convolution and self-attention for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 12581–12600. [Google Scholar] [CrossRef] [PubMed]
Yao, Z.; Liu, X. A cnn-transformer deep learning model for real-time sleep stage classification in an energy-constrained wireless device. In Proceedings of the 2023 11th International IEEE/EMBS Conference on Neural Engineering (NER), Baltimore, MD, USA, 25–27 April 2023; pp. 1–4. [Google Scholar] [CrossRef]
Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; Xu, Q. Scinet: Time series modeling and forecasting with sample convolution and interaction. In Proceedings of the Advances in Neural Information Processing Systems (NeurIPS), New Orleans, LA, USA, 28 November–9 December 2022; Volume 35, pp. 5816–5828. [Google Scholar]
Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. Timesnet: Temporal 2d-variation modeling for general time series analysis. In Proceedings of the International Conference on Learning Representations (ICLR), Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar]
Tian, Z.; Liu, W.; Jiang, W.; Wu, C. Cnns-transformer based day-ahead probabilistic load forecasting for weekends with limited data availability. Energy 2024, 293, 130666. [Google Scholar] [CrossRef]
Zhang, E.; Yuan, W.; Liu, X. ChannelMixer: A Hybrid CNN-Transformer Framework for Enhanced Multivariate Long-Term Time Series Forecasting. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; pp. 1–5. [Google Scholar] [CrossRef]
Zheng, P.; Zhou, H.; Liu, J.; Nakanishi, Y. Interpretable building energy consumption forecasting using spectral clustering algorithm and temporal fusion transformers architecture. Appl. Energy 2023, 349, 121607. [Google Scholar] [CrossRef]
Liu, J.; Yang, F.; Yan, K.; Jiang, L. Household energy consumption forecasting based on adaptive signal decomposition enhanced iTransformer network. Energy Build. 2024, 324, 114894. [Google Scholar] [CrossRef]
Wang, C.F.; Liu, K.X.; Peng, J.; Li, X.; Liu, X.F.; Zhang, J.W.; Niu, Z.B. High-precision energy consumption forecasting for large office building using a signal decomposition-based deep learning approach. Energy 2025, 314, 133964. [Google Scholar] [CrossRef]
Sreekumar, G.; Martin, J.P.; Raghavan, S.; Joseph, C.T.; Raja, S.P. Transformer-based forecasting for sustainable energy consumption toward improving socioeconomic living: AI-enabled energy consumption forecasting. IEEE Syst. Man Cybern. Mag. 2024, 10, 52–60. [Google Scholar] [CrossRef]
Xi, Y.; Gan, X.; Zhan, Z.; Deng, K. Energy Data Forecasting Based on the STRLM Time Series Prediction Model. In Proceedings of the 2025 10th International Conference on Electronic Technology and Information Science (ICETIS), Hangzhou, China, 27–29 June 2025; pp. 489–494. [Google Scholar] [CrossRef]
Rahn, K.; Bode, C.; Albrecht, T. Energy-efficient driving in the context of a communications-based train control system (CBTC). In Proceedings of the 2013 IEEE International Conference on Intelligent Rail Transportation Proceedings, Beijing, China, 30 August–1 September 2013; pp. 19–24. [Google Scholar] [CrossRef]
Sanchis, I.V.; Zuriaga, P.S. An energy-efficient metro speed profiles for energy savings: Application to the Valencia metro. Transp. Res. Procedia 2016, 18, 226–233. [Google Scholar] [CrossRef]
Peng, J.; Kimmig, A.; Wang, D.; Niu, Z.; Liu, X.; Tao, X.; Ovtcharova, J. Energy consumption forecasting based on spatio-temporal behavioral analysis for demand-side management. Appl. Energy 2024, 374, 124027. [Google Scholar] [CrossRef]
Moon, J. A multi-step-ahead photovoltaic power forecasting approach using one-dimensional convolutional neural networks and transformer. Electronics 2024, 13, 2007. [Google Scholar] [CrossRef]
Fu, H.; Zhang, J.; Xie, S. A novel improved variational mode decomposition-temporal convolutional network-gated recurrent unit with multi-head attention mechanism for enhanced photovoltaic power forecasting. Electronics 2024, 13, 1837. [Google Scholar] [CrossRef]
Wang, H.; Guo, M.; Tian, L. A deep learning model with signal decomposition and informer network for equipment vibration trend prediction. Sensors 2023, 23, 5819. [Google Scholar] [CrossRef]
He, K.; Yang, Q.; Ji, L.; Pan, J.; Zou, Y. Financial time series forecasting with the deep learning ensemble model. Mathematics 2023, 11, 1054. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Mohammadi, M.; Jamshidi, S.; Rezvanian, A.; Gheisari, M.; Kumar, A. Advanced fusion of MTM-LSTM and MLP models for time series forecasting: An application for forecasting the solar radiation. Meas. Sens. 2024, 33, 101179. [Google Scholar] [CrossRef]

Figure 1. The total energy consumption of Metro Line 1 (excerpt). The data cover the period from 1 July to 30 September 2021, totaling 92 days. Each pair of red dashed lines indicates the interval of one week.

Figure 2. PCformer model overview.

Figure 3. Informer model architecture.

Figure 4. Channel independence modeling process.

Figure 5. Patch modeling process.

Figure 6. Multi-scale convolution-augmented module.

Figure 7. Visualization of prediction results of each model on Metro Line 1 (a) PCformer; (b) MTM-LSTM; (c) Mamba; (d) ARMA-CNNLSTM; (e) Informer; (f) Transformer; (g) BiGRU; (h) GRU; (i) BiLSTM; (j) LSTM.

Table 1. Comparison of modeling mechanisms in related works.

Method	Backbone	Channel Modeling	Patching Modeling	Convolution Multi-Scale
PatchTST [27]	Transformer	Independent	✔	×
PatchMLP [28]	Transformer	Mixed	✔	×
Informer [20]	Transformer	Mixed	×	×
TimesNet [45]	CNN-based	Mixed	×	✔
SCINet [44]	CNN	Mixed	×	✔
DECPE-TFT [49]	Transformer	Mixed	×	×
ASSA-iTransformer [50]	Transformer	Mixed	×	×
SPAformer [51]	Transformer	Mixed	✔	×
Rahn et al. [54]	Regression models	-	×	×
Sanchis et al. [55]	Regression models	-	×	×
PCformer (Ours)	Transformer	Independent (param. sharing)	✔	✔

Table 2. Multivariate long-term forecasting results obtained using PCformer. Prediction horizon is set to

T = 28

.

Table 2. Multivariate long-term forecasting results obtained using PCformer. Prediction horizon is set to

T = 28

.

Models	PCformer	MTM-LSTM	Mamba	ARMA-CNNLSTM	Informer	Transformer	BiGRU	GRU	BiLSTM	LSTM
Models	(Ours)	(2024)	(2024)	(2023)	(2021)	(2017)	(2014)	(2014)	(1997)	(1997)
Metric	MSE MAE	MSE MAE	MSE MAE	MSE MAE	MSE MAE	MSE MAE	MSE MAE	MSE MAE	MSE MAE	MSE MAE
Line 1	0.017 0.096	0.055 0.186	0.031 0.167	0.041 0.174	0.028 0.131	0.044 0.246	0.051 0.157	0.082 0.297	0.070 0.271	0.089 0.337
Line 2	0.015 0.112	0.058 0.195	0.032 0.174	0.043 0.182	0.037 0.142	0.044 0.235	0.054 0.164	0.086 0.310	0.073 0.283	0.093 0.352
Line 3	0.010 0.091	0.029 0.164	0.016 0.147	0.021 0.153	0.011 0.077	0.027 0.254	0.027 0.139	0.043 0.262	0.036 0.239	0.046 0.297
Line 4	0.058 0.195	0.190 0.257	0.107 0.230	0.141 0.240	0.120 0.262	0.124 0.152	0.177 0.217	0.284 0.410	0.242 0.374	0.308 0.465
Line 5	0.041 0.223	0.089 0.232	0.050 0.207	0.066 0.216	0.041 0.160	0.046 0.230	0.082 0.195	0.132 0.369	0.113 0.337	0.143 0.419
Line 6	0.060 0.223	0.146 0.241	0.082 0.216	0.108 0.225	0.053 0.202	0.115 0.134	0.136 0.203	0.219 0.384	0.186 0.350	0.237 0.436
Line 7	0.129 0.201	0.307 0.274	0.173 0.245	0.228 0.255	0.165 0.227	0.157 0.231	0.285 0.231	0.459 0.436	0.391 0.398	0.497 0.495
Line 8	0.113 0.272	0.525 0.516	0.296 0.462	0.390 0.481	0.313 0.437	0.485 0.584	0.488 0.436	0.786 0.822	0.668 0.751	0.850 0.934
Line 9	0.006 0.067	0.026 0.118	0.015 0.106	0.019 0.110	0.013 0.093	0.028 0.136	0.024 0.100	0.039 0.189	0.033 0.172	0.042 0.214
Line 10	0.013 0.077	0.035 0.133	0.020 0.119	0.026 0.124	0.019 0.123	0.020 0.128	0.032 0.112	0.052 0.212	0.044 0.193	0.056 0.241
Line 11	0.057 0.183	0.141 0.249	0.079 0.222	0.104 0.232	0.076 0.199	0.077 0.218	0.131 0.210	0.210 0.396	0.179 0.362	0.228 0.450
Line 12	0.045 0.140	0.118 0.197	0.067 0.176	0.088 0.183	0.061 0.162	0.076 0.174	0.110 0.166	0.177 0.313	0.151 0.286	0.192 0.356
Line 13	0.007 0.065	0.021 0.090	0.012 0.081	0.016 0.084	0.012 0.073	0.016 0.081	0.020 0.076	0.032 0.143	0.027 0.131	0.034 0.163
Line 14	0.009 0.062	0.028 0.098	0.016 0.088	0.021 0.091	0.012 0.062	0.026 0.120	0.026 0.083	0.042 0.156	0.036 0.142	0.045 0.177
Line 15	0.094 0.267	0.294 0.416	0.165 0.372	0.218 0.388	0.205 0.407	0.156 0.336	0.273 0.351	0.439 0.662	0.373 0.605	0.475 0.752
Line 16	0.016 0.101	0.039 0.144	0.022 0.129	0.029 0.134	0.024 0.133	0.016 0.113	0.036 0.121	0.058 0.229	0.049 0.209	0.063 0.260
Avg.	0.043 0.145	0.131 0.219	0.074 0.196	0.097 0.204	0.074 0.181	0.104 0.214	0.122 0.185	0.196 0.349	0.167 0.319	0.212 0.397
Std.	0.039 0.070	0.135 0.110	0.076 0.098	0.100 0.102	0.083 0.106	0.112 0.117	0.126 0.093	0.202 0.175	0.172 0.160	0.219 0.199

Note: The best results are highlighted in bold, while the second-best results are underlined.

Table 3. Performance comparison across different prediction lengths.

Prediction Length	14	28	42	56
Prediction Length	MSE MAE	MSE MAE	MSE MAE	MSE MAE
Transformer	0.0782 0.1703	0.1042 0.2142	0.1588 0.2517	0.1674 0.2991
Informer	0.0622 0.1592	0.0743 0.1806	0.1446 0.2425	0.1527 0.2517
Mamba	0.0409 0.1336	0.0738 0.1962	0.1227 0.2238	0.1741 0.3122
MTM-LSTM	0.0961 0.1912	0.1312 0.2193	0.1293 0.2342	0.1634 0.2548
PCformer (Ours)	0.0335 0.1168	0.0430 0.1452	0.0955 0.2021	0.1308 0.2223

Note: The best results are highlighted in bold, while the second-best results are underlined.

Table 4. Performance comparison across different patch length and stride (

L = 28

).

Table 4. Performance comparison across different patch length and stride (

L = 28

).

Patch Length (P)	Stride (S)	Patched Length ( $L_{p}$ )	MSE	MAE
3	1	26	0.1441	0.2374
5	1	24	0.0821	0.1951
7	1	22	0.0430	0.1452
7	3	8	0.1456	0.2410
7	5	5	0.2574	0.3286
7	7	4	0.2954	0.3459

Note: The best results are highlighted in bold.

Table 5. Evaluation of models’ transfer learning capability.

Method	PCformer (Ours)	Mamba	Informer	Transformer
Metric	MSE MAE	MSE MAE	MSE MAE	MSE MAE
Line 5	0.0192 0.1165	0.0459 0.1628	0.0777 0.2293	0.0871 0.2356
Line 9	0.0409 0.1884	0.0523 0.1685	0.0729 0.2160	0.0963 0.2383
Line 12	0.3094 0.3637	0.3761 0.4268	0.2563 0.3435	0.3677 0.4147
Line 15	0.0309 0.1451	0.0677 0.2185	0.0589 0.2029	0.0754 0.2252
Avg.	0.1001 0.2034	0.1355 0.2433	0.1165 0.2479	0.1566 0.2785
Std.	0.1211 0.0960	0.1391 0.1077	0.0810 0.0560	0.1221 0.0788

Note: The best results are highlighted in bold, while the second-best results are underlined.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Long, L.; Li, L.; Zhang, L.; Fu, Q.; Zou, R.; Feng, F.; Zhang, R. PatchConvFormer: A Patch-Based and Convolution-Augmented Transformer for Periodic Metro Energy Consumption Forecasting. Electronics 2026, 15, 178. https://doi.org/10.3390/electronics15010178

AMA Style

Long L, Li L, Zhang L, Fu Q, Zou R, Feng F, Zhang R. PatchConvFormer: A Patch-Based and Convolution-Augmented Transformer for Periodic Metro Energy Consumption Forecasting. Electronics. 2026; 15(1):178. https://doi.org/10.3390/electronics15010178

Chicago/Turabian Style

Long, Liheng, Linlin Li, Lijie Zhang, Qing Fu, Runzong Zou, Fan Feng, and Ronghui Zhang. 2026. "PatchConvFormer: A Patch-Based and Convolution-Augmented Transformer for Periodic Metro Energy Consumption Forecasting" Electronics 15, no. 1: 178. https://doi.org/10.3390/electronics15010178

APA Style

Long, L., Li, L., Zhang, L., Fu, Q., Zou, R., Feng, F., & Zhang, R. (2026). PatchConvFormer: A Patch-Based and Convolution-Augmented Transformer for Periodic Metro Energy Consumption Forecasting. Electronics, 15(1), 178. https://doi.org/10.3390/electronics15010178

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

PatchConvFormer: A Patch-Based and Convolution-Augmented Transformer for Periodic Metro Energy Consumption Forecasting

Abstract

1. Introduction

2. Related Work

2.1. Patch in Transformer Architecture

2.2. Convolution in Transformer Architecture

2.3. Deep Learning in Energy Consumption Forecasting

3. Methods

3.1. Introduction to Informer

3.2. Channel-Independent Modeling

3.3. Patch-Based Modeling

3.4. Convolution-Augmented Self-Attention

3.5. Loss Function

4. Experiment

4.1. Implementation Details

4.2. Comparison with the State of the Art

4.3. Ablation Studies

4.4. Transfer Learning

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI