Carbon Emission Forecasting Using Multi-Scale Temporal Patches

Xiong, Yuanhao; Wang, Meiling

doi:10.3390/app16042025

Open AccessArticle

Carbon Emission Forecasting Using Multi-Scale Temporal Patches

by

Yuanhao Xiong

and

Meiling Wang

^*

College of Engineering Science and Technology, Shanghai Ocean University, Shanghai 201306, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(4), 2025; https://doi.org/10.3390/app16042025

Submission received: 21 January 2026 / Revised: 12 February 2026 / Accepted: 16 February 2026 / Published: 18 February 2026

(This article belongs to the Topic Applications of Artificial Intelligence in Sustainable Energy and Environment)

Download

Browse Figures

Versions Notes

Abstract

Accurate carbon emission forecasting is essential for China’s dual-carbon targets and mitigation planning. However, many existing models struggle to capture long-range dependencies while remaining sensitive to short-term fluctuations. We evaluate State Space Transformer (SST) on a Rwanda dataset constructed from weekly Sentinel-5P observations. The resulting time series are noisy, weakly periodic, and heterogeneous across monitoring sites. SST forms interrelated temporal patches through Multi-Scale Temporal Patches (MSTP). It models low-frequency trends with a Mamba state space backbone and captures high-frequency disturbances using an enhanced Local Window Transformer (LWT). These design choices explicitly disentangle low-frequency trends from high-frequency perturbations in noisy observations, improving robustness to non-stationary remote-sensing sequences. Across forecasting horizons from 6 to 72 weeks, SST achieves an average MSE of 0.0331. It reduces MSE by approximately 3.5% compared with the strongest baseline, PatchTST, and consistently outperforms other baselines. With short input histories, SST remains stable for one-year-ahead forecasting (about 53 weeks), which is critical when historical records are limited in operational monitoring systems. Ablation studies further show that MSTP, Mamba, and LWT each contribute substantially to accuracy. Overall, SST-style multi-scale modeling is well suited to noisy monitoring data and supports sustainable planning and emission-trend analysis.

Keywords:

carbon emission modeling; time series decomposition; MSTP; Mamba; LWT

1. Introduction

Amid the escalating climate crisis, carbon neutrality has become a central policy objective worldwide [1,2]. Accurate carbon emission forecasting supports mitigation planning and emissions management. Industrial carbon emissions are driven by interacting economic and technological factors. These include industrial structure, energy efficiency, and technological upgrading. Uneven regional development also plays an important role [3,4]. Consequently, emission time series are often non-stationary. They also exhibit pronounced cross-sector and cross-region heterogeneity. In addition, observations may be contaminated by monitoring noise. Noise can arise from estimation and reporting processes. These characteristics complicate multi-horizon forecasting and uncertainty assessment. Reliable forecasts can help anticipate potential fluctuations. They can also facilitate differentiated policy design [5,6].

Forecasting methods have gradually shifted from classical statistical approaches. They are increasingly replaced by data-driven learning frameworks [7]. Early studies commonly adopted Linear Regression [8] and ARIMA [9]. They also used the grey forecasting model GM (1,1) [10]. These approaches assume linear relationships and fixed structural forms. They are interpretable and relatively easy to implement. However, they can be inadequate under nonlinear emission trajectories. They may also fail under policy-driven structural shifts. Emissions may change with adjustments in energy structure. They may also change with climate variability. To relax linear assumptions, researchers introduced SVR [11] and RF [12]. They also adopted ANN [13] for nonlinear mapping. Nevertheless, these methods typically lack explicit mechanisms. They cannot model long-term temporal dependencies well. They also provide limited structure for capturing multi-scale dynamics.

Deep learning further expanded modeling capacity for complex emission time series. LSTM [14] and GRU [15] employ gating mechanisms. They help capture temporal dependence in sequential data. CNN–LSTM hybrids combine local feature extraction with recurrent modeling [16]. These architectures can fit nonlinear patterns and short-term dynamics. Yet several challenges remain for practical industrial carbon forecasting. First, recurrent memory may degrade for very long sequences. This weakens trend learning across years and transition periods. Second, CNN–LSTM models emphasize local patterns. They can be sensitive to scale shifts. This may reduce robustness under non-stationarity and extreme events. Third, emission data exhibit multi-source heterogeneity across sectors and regions. Many RNN- or CNN-based approaches lack explicit cross-sectional structure. This can limit generalization across industries and locations.

Transformers were introduced to strengthen long-range dependency modeling for time series forecasting [17]. However, standard self-attention has quadratic complexity in sequence length. This hinders long-horizon forecasting and high-frequency settings. Informer reduces attention cost via the ProbSparse mechanism [18]. It improves efficiency on long sequences. However, sparse attention may under-represent local perturbations. It may also reduce sensitivity near emission turning points. Autoformer incorporates autocorrelation and decomposes series into trend and seasonal components [19]. It performs well when periodic structures are stable. However, industrial emission series often contain non-periodic disturbances. They also include structural changes. Decomposition assumptions may introduce bias under complex dynamics. PatchTST adopts patch-based tokenization to shorten sequences [20]. It can enhance local learning for long-horizon forecasting. DLinear provides a strong linear baseline through decomposition and lightweight projections [21]. These models highlight benefits of decomposition and patching.

Recent work has explored Mamba for sequence modeling. It uses selective state space representations [22]. Several studies investigate Mamba-based forecasting variants. They analyze when Mamba can be competitive [23]. Bidirectional designs incorporate context from both directions [24]. Multi-block compositions stack Mamba modules for long-horizon prediction [25]. Foundation-model efforts examine scaling Mamba-style architectures for time series [26]. Despite these advances, direct application to industrial emission forecasting remains challenging. Many pipelines treat multivariate dimensions as generic channels. This is insufficient when heterogeneity is structured by industry and region. Moreover, many multi-scale solutions rely on manually selected resolutions. They may not reconcile long-term trends with short-term dynamics.

Local shocks and high-frequency perturbations are central to industrial carbon emission dynamics. Yet many local-attention designs implicitly assume neighborhood stationarity. They are not tailored to abrupt events or rapidly changing regimes. Studies in energy forecasting suggest frequency aggregation or hierarchical designs [27,28]. These approaches can handle sudden changes and regime shifts. Mixture-of-experts frameworks can address heterogeneity via specialized subnetworks [29]. State space modeling has improved reliability in safety-critical grid forecasting [30]. Motivated by these observations, we adopt the SST (Section 2). SST offers configurable patching and explicit trend–residual separation. It represents long-term components and short-term fluctuations within a multi-scale structure. SST also employs locality-aware encoding to preserve short-lived signals. This aims to improve robustness for industrial carbon emission forecasting.

In this work, we investigate SST for industrial carbon emission time series forecasting. SST combines Multi-Scale Temporal Patches with state space sequence modeling. It uses a Mamba backbone to represent global, low-frequency evolution. It also uses a local-window Transformer to model short-term variations and local shocks. This design separates trend information from residual fluctuations within each sequence. We evaluate SST across multiple horizons. This assesses short- and long-term forecasting performance. We compare SST with statistical, recurrent, linear, and Transformer-based baselines. The baselines include PatchTST and DLinear. Finally, ablation studies isolate contributions of multi-scale structure. They also test the state space backbone and local attention.

2. Overall Architecture of SST

The State Space Transformer (SST) [31] integrates Mamba with a Transformer-style attention variant. In SST, the state space module is implemented with Mamba to enable efficient modeling of long sequences. Mamba is well suited for capturing gradual and step-wise shifts in emissions that may arise from policy interventions. To model local variations, SST employs windowed attention to emphasize nearby temporal context. This design strengthens the representation of industrial cycles, short-term peaks, and local fluctuations. The SST model is composed of five primary modules:

Multi-Scale Temporal Patches,
Global Patterns Expert,
Local Variations Expert,
Long-Short Router,
Forecasting Module.

As shown in Figure 1, the MSTP module converts the input series into multiple temporal resolutions based on data characteristics. The Global Patterns Expert extracts long-term patterns from lower-resolution sequences, while the Local Variations Expert captures short-term fluctuations from higher-resolution sequences. The Long-Short Router learns the contribution weights of the two experts and fuses global and local features accordingly. The fused representation is then fed into the Forecasting Module, which consists of linear layers and produces the final predictions.

2.1. Multi-Scale Temporal Patches—MSTP

The MSTP module segments the original temporal window into block-wise patches, aggregating the continuous time series into subsequence tokens for subsequent processing. It assigns different resolutions to long-range and short-range sequences. For each univariate series

X^{(i)} \in R^{L \times 1}

, patching yields

X_{p}^{(i)} \in R^{N \times P}

. Here, L is the original length, P is the patch length, and Str is the stride. The patch count N is computed using Equation (1):

N = ⌊\frac{L - P}{S t r}⌋ + 1

(1)

To quantify the effective temporal resolution after patching, we adapt the notion of “resolution” from image processing and define the Patched Time Series Resolution R_PTS as a token-level metric. Unlike raw sampling frequency, R_PTS depends on both token coverage and patch overlap. Token coverage reflects how densely patches span the input window, while overlap reflects redundancy introduced by overlapping strides. We thus define R_PTS as an overlap-weighted token density:

R_{P T S} = \frac{N}{L} {(\frac{P}{S t r})}^{2}

(2)

where L > P and L > Str. Since

N \approx \frac{L}{S t r}

, we obtain

R_{P T S} \approx \frac{1}{S t r} {(\frac{P}{S t r})}^{2} = \frac{P^{2}}{{S t r}^{3}}

(3)

This approximation removes the dependence on the fixed window length L and highlights how P and Str regulate the effective temporal resolution, which facilitates comparisons across patching strategies (P, Str).

In practice, the Mamba module focuses on long-term trend identification and processes distant segments using a lower R_PTS value to provide a coarse-grained view of long-range dependencies. In contrast, the LWT module focuses on local variation modeling and processes near-term segments using a higher R_PTS value to preserve sensitivity to short-term fluctuations. We further discuss and validate this design in Section 3.3.

For unpatched sequences, P = 1 and Str = 1, leading to R_PTS = 1 according to Equation (3). With multi-scale resolution adjustment, MSTP disentangles long-term and short-term information more effectively than raw sequences, and it also supports efficient feature extraction by the subsequent expert modules.

2.2. Global Patterns Expert

To capture long-term trends and cross-period dependencies in carbon emission sequences, SST adopts Mamba as the core component of the Global Patterns Expert. Mamba is a selective State Space Model (SSM) architecture built on the classical SSM framework [32]. Classical SSMs provide a solid theoretical foundation for sequence modeling, but their linear time-invariant assumptions and fixed parameterization can limit their ability to represent non-stationary dynamics; moreover, scaling classical SSMs to high-dimensional settings is often challenging. Mamba addresses these limitations by introducing input-dependent (selective) parameterization, enabling content-adaptive state transitions for complex sequences.

As illustrated in Figure 2, the embedded long-term sequence is first projected by Linear 1 (input projection) and then split into two parallel streams: a content stream and a gating stream. The content stream is processed by a lightweight local-mixing block (Conv → SILU → SSM) and then passed to a selective scan module for efficient long-range state space modeling. The “selective” mechanism arises because the SSM parameters are generated from content features. After the Conv–SILU–SSM block, a projection produces three groups of input-dependent parameters

(d t, B, C)

. Here, B controls how the current input is injected into the state, and C controls how the state is read out to form the output. The term dt is a low-rank step-size representation, which is further mapped to the discretization step Δ through a learnable projection and a positivity constraint:

Δ = S o f t p l u s (P r o j (d t)), Δ > 0,

In contrast, the state-dynamics base A and the skip/residual term D are learnable parameters shared across time. While Δ, B, and C vary across tokens, this design enables Mamba to adapt its state-transition behavior under non-stationary carbon emission signals.

The gating stream passes through SILU to produce a content-aware modulation signal. The selective-scan output is then fused with this gate via element-wise multiplication:

g a t e d_{o u t p u t} = (y_{scan}) \otimes S I L U (g a t e),

where y_scan denotes the output of the Conv–SILU–SSM branch. Linear 2 projects the gated representation back to the model dimension, yielding global-trend features for subsequent multi-scale fusion. With a hardware-aware selective scan, Mamba scales approximately linearly with sequence length in practice, making it suitable for long-horizon industrial carbon emission forecasting. Temporal order is preserved through recurrent state transitions within the scan; therefore, SST does not require additional positional encoding after the embedding stage.

To model cross-period dependence and suppress noise, SST builds a Mamba-centered global-pattern unit. Each univariate series is segmented into patches

x_{p L}^{(i)} \in R^{N_{L} \times P_{L}}

and then projected into a D-dimensional embedding

x_{L}^{(i)} \in R^{N_{L} \times D}

. These embeddings are fed into the Mamba-based expert to capture long-range patterns and attenuate low-amplitude noise in distant temporal contexts. We denote the resulting long-term representation as

Z_{L}^{(i)} \in R^{N_{L} \times D}

, which provides global temporal context for subsequent multi-scale feature fusion.

2.3. Local Variations Expert

To improve sensitivity to short-range fluctuations and local perturbations, SST introduces a Local Variations Expert built on an enhanced Local Window Transformer (LWT). Although vanilla Transformer attention is flexible, global attention lacks an explicit local inductive bias and incurs quadratic complexity,

O (L^{2})

, which reduces efficiency for forecasting. This can weaken responsiveness to localized variations in carbon emission sequences. To address this issue, LWT restricts attention to a fixed window of size w, so that each token attends only to its nearby context and local patterns are emphasized.

As illustrated in Figure 3, the short-term input is segmented into patches, yielding

x_{p S}^{(i)} \in R^{N_{S} \times P_{S}}

. The patches are then embedded as

x_{S}^{(i)} \in R^{N_{S} \times D}

. Unlike the Mamba-based global expert, LWT injects positional encoding into these tokens before attention (the ⊕ operation in Figure 3), which preserves fine-grained local order information within the short-term patch sequence. The resulting sequence is processed by stacked LWT blocks for local-pattern extraction. Each block contains a Local Window Attention (LWA) sub-layer and a feed-forward network, along with residual connections, normalization, and other standard Transformer components.

For a token at position t, LWA restricts attention to a neighborhood window

W (t)

:

W (t) = [t - ⌊\frac{w}{2}⌋, t + ⌊\frac{w}{2}⌋]

where indices are clipped to valid bounds and w is the window size. Given the embedded sequence

X = x_{S}^{(i)}

, we compute Q = XW_Q, K = XW_K, and V = XW_V. Within each local window, attention is computed as

A t t e n t i o n (Q_{W}, K_{W}, V_{W}) = s o f t m a x (\frac{Q_{W} K_{W}^{⊤}}{\sqrt{d_{k}}}) V_{W}

(4)

where

Q_{W}, K_{W}, V_{W} \in R^{w \times d_{k}}

denote the query, key, and value matrices within the local window, and d_k is the key dimension per head. Multi-head aggregation follows the standard Transformer formulation:

M H A (X) = C o n c a t ({h e a d}_{1}, \dots, {h e a d}_{h}) W_{O}

(5)

As shown in Figure 4, each LWA layer attends only within a window of size w. Stacking l LWT layers enables information to propagate across layers, so higher-layer tokens can indirectly access positions beyond the original window via intermediate tokens in lower layers. Consequently, the effective receptive field expands with depth and can be approximated as l ⋅ w − l + 1 positions. This mechanism allows broader context integration while retaining a strong local modeling bias.

By restricting attention to local windows, the per-layer complexity is reduced from

O (L^{2})

to

O (w L)

, where L denotes the sequence length. This substantially lowers computational cost and improves runtime efficiency, while preserving strong performance in modeling short-term dynamics and local fluctuations.

2.4. Long-Short Router

The Long-Short Router is a central component of SST’s fusion architecture. It adaptively balances the Global Patterns Expert and the Local Variations Expert. Given the input sequence x ∈ R^L×D, the router computes a compact sequence summary and applies a lightweight projection followed by a softmax to produce two normalized, sample-wise routing coefficients p_L, p_S ∈ (0,1), with p_L + p_S = 1. These coefficients quantify the relative emphasis on long-term trends versus short-term variations and are used to weight the outputs of the two experts.

The weighting is data-adaptive and varies across samples. For stable trends or strong periodicity, the router assigns a larger weight to the global expert, whereas for frequent abrupt changes, it increases the weight of the local expert. This provides an interpretable gating mechanism between global and local pathways and enables SST to reweight multi-scale information based on input characteristics, improving robustness across heterogeneous carbon emission series.

A concise expression of router-guided fusion is

y = p_{L} \cdot y_{L} + p_{S} \cdot y_{S}

(6)

where y_L and y_S denote the outputs from the global and local experts, respectively. In practice, the weights are applied during feature fusion to form a joint representation for the final prediction head.

2.5. Forecasting Module

The Global Patterns Expert (Mamba) outputs long-term features

Z_{L}^{(i)} \in R^{N_{L} \times D}

, and the LWT outputs short-term features

Z_{S}^{(i)} \in R^{N_{S} \times D}

. Before fusion, each feature map is flattened into a one-dimensional vector:

z_{L} = F l a t t e n (Z_{L}^{(i)}) \in R^{N_{L} D},

(7)

z_{S} = F l a t t e n (Z_{S}^{(i)}) \in R^{N_{S} D}

(8)

The router produces normalized coefficients p_L and p_S. These scalars are broadcast to match the flattened vectors and reweight the expert outputs before fusion. The weighted vectors are concatenated to form the fused representation:

z_{L S} = C o n c a t (p_{L} \cdot z_{L}, p_{S} \cdot z_{S}) \in R^{(N_{L} + N_{S}) D}

(9)

The fused vector z_LS jointly encodes global trends and local variations. It is then fed into a learned prediction head that maps z_LS to an O-step forecasting horizon:

\begin{matrix} {\hat{X}}^{(i)} = W_{head} z_{L S} + b_{head}, W_{head} \in R^{O \times (N_{L} + N_{S}) D}, b_{head} \in R^{O} \end{matrix}

(10)

3. Experiments

Accurate monitoring of carbon emissions is essential for climate-change mitigation. High-quality emissions data support scientific analysis and policy design. However, many regions still lack comprehensive monitoring infrastructure, leading to missing data and inconsistent assessments. In such data-sparse settings, forecasting can complement monitoring by supporting energy-transition planning and low-carbon operations. It can also provide early signals of potential emission anomalies.

3.1. Dataset

To evaluate SST for carbon emission forecasting, we use a Rwanda dataset derived from weekly Sentinel-5P observations. The dataset is publicly available through a Kaggle competition and is provided as train_datasets.csv [33]. It includes 497 monitoring locations across Rwanda (with latitude and longitude) and spans 1 January 2019 to 31 December 2021. Each location contains 159 weekly records, resulting in 79,023 training samples in total. Figure 5 illustrates the spatial coverage and the weekly sampling scheme.

All features are aggregated at weekly resolution and follow the column naming in the released CSV. The inputs include Sentinel-5P atmospheric product groups (e.g., SO₂, CO, NO₂, HCHO, and O₃), as well as aerosol- and cloud-related variables. Many variables are column or slant-column quantities with associated air-mass-factor (AMF) terms. In Sentinel-5P retrieval pipelines, AMF is used to convert slant columns to vertical columns in DOAS-style retrievals. The dataset also provides auxiliary attributes such as cloud fraction and viewing-geometry variables.

The prediction target is the field named “emission” in train_datasets.csv. The competition description refers to this target as CO₂ emissions, but no physical unit or flux definition is provided. We therefore interpret “emission” as a dataset-defined CO₂-related indicator rather than a physical emission flux. This clarification is important for scientific interpretation because Sentinel-5P inputs describe atmospheric columns and cloud states rather than direct surface fluxes. Such observations may be influenced by atmospheric transport and meteorological conditions; therefore, the target should be regarded as a proxy label within this benchmark.

The original dataset contains missing values that are unevenly distributed across feature groups. The “UvAerosolLayerHeight_” columns are missing in 99.44% of samples and are thus removed. After removal, the missing rate is 23.18% for the nitrogen dioxide features and 18.49% for the sulfur dioxide features. The missing rate is 9.21% for the formaldehyde features and 2.69% for the carbon monoxide features. Missing rates for ozone, UV aerosol index, and cloud features are all below 0.70%. We apply the same missing-value handling strategy to all compared models (see Section 3.2). The dataset is accessed via Kaggle and is used in accordance with Kaggle’s Terms of Use.

To compare the temporal characteristics of ETTh1 (Electricity Transformer Temperature) and the Rwanda dataset, we assess periodicity, noise intensity, and trend stationarity using three diagnostics. Periodicity is quantified by the Pearson lagged autocorrelation coefficient P_lag. We use lag = 24 for ETTh1 hourly data to represent a daily cycle, and lag = 53 for Rwanda weekly data to represent an annual-scale cycle. Noise intensity is measured by an STL-based signal-to-noise ratio (SNR). In STL (Seasonal and Trend decomposition using Loess), the structured signal is defined as the sum of trend T and seasonal S components, while the residual R is treated as noise. We define

S N R = 10 {l o g}_{10} (V a r (T + S) / V a r (R))

Trend stationarity is assessed using the Augmented Dickey–Fuller (ADF) test applied to the STL-extracted trend component. These diagnostics are used only to characterize dataset properties; they are not used to make statistical-significance claims in model comparisons. The results are summarized in Table 1.

The results indicate substantial differences between the two datasets. ETTh1 shows strong periodicity with P_lag = 0.9406 at lag = 24 and a low-noise profile (SNR = 15.10 dB). Its STL trend yields a small ADF p-value (ADF p = 0.0116), consistent with regular electricity and temperature dynamics. In contrast, the Rwanda dataset exhibits weak periodicity at lag = 53 (P_lag = 0.0129) and a much noisier pattern (SNR = − 5.33 dB), suggesting that residual fluctuations dominate the structured components. Its STL trend yields an extremely small ADF p-value (ADF p = 6.177 × 10⁻³⁰). Overall, the Rwanda dataset presents weak periodic structure and pronounced volatility, motivating robust representation learning and component disentanglement.

3.2. Data Preprocessing

We preprocess the emission dataset using a standardized pipeline. Given the large scale and high noise level, we apply the following feature-engineering and data-cleaning procedures:

We assign a stable geographic identifier (ID) to each of the 497 subregions by rounding latitude/longitude coordinates, forming a spatiotemporal index.
We extract seasonal attributes from weekly timestamps (Season ∈ {Spring, Summer, Autumn, Winter}) and apply one-hot encoding to obtain 4-dimensional orthogonal basis vectors.
We group the data by ID and sort records by time. For each ID, we compute a 7-step (i.e., 7-week) moving average (MA7) and standard deviation (SD7) of the target in a strictly causal manner (using only historical observations). For initialization, the first six MA7/SD7 values are set to the first available full-window statistic (the 7th value).
We construct a binary indicator for the COVID-19 period (is_covid) and an additional lockdown-status feature (is_lockdown).
Using the central coordinates of each region, we generate directional rotation features at multiple azimuth angles (e.g., rot_15_x, rot_15_y, rot_30_x, rot_30_y).
We compute spherical distances from each region to five key landmarks in Rwanda using the Haversine formula.
We perform K-means spatial clustering on the 497 regions (K = 12) using the training split only, producing a clustering feature (geo_cluster). We additionally compute the spherical distance from each point to each cluster centroid (cluster_i_dist, $i \in {1, \dots, 12}$ ).
We remove low-variance features (variance < 0.1) to filter redundant attributes.
To avoid data leakage, all preprocessing statistics are estimated using only the training split (a 6:2:2 time-based split within each ID). Features with more than 50% missing values in training are discarded. For the remaining features, missing values are filled by within-ID forward fill along the time axis, then imputed using training-only group means and the training global mean. Any residual missing entries are finally filled with zeros.

3.3. Experimental Setup

We evaluate SST on the Rwanda dataset and compare it with representative time series forecasting baselines. The baselines include Transformer-based architectures with self-attention mechanisms (Transformer, Informer, and Autoformer). We also include PatchTST and DLinear as strong competitive baselines.

We report three evaluation metrics—Mean Squared Error (MSE), Mean Absolute Error (MAE), and Relative Squared Error (RSE)—defined as

M S E = \frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2} M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}| R S E = \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(11)

where n is the number of samples, y_i is the ground truth,

{\hat{y}}_{i}

is the prediction, and

\overset{ˉ}{y}

is the mean of the ground-truth values. Lower values indicate better forecasting performance.

For SST, we adopt a dual-branch temporal processing design:

Long-term modeling (Mamba branch): a low-resolution configuration with R_PTS = 0.125 (i.e., P_L = 8, Str_L = 8) and an input length I_L = 72.
Short-term modeling (LWT branch): a high-resolution configuration with R_PTS = 2 (i.e., P_S = 4, Str_S = 2) and an input length I_S = 36.

The local window size is set to w = 7. We evaluate multi-horizon forecasting with output lengths

O \in {6, 12, 24, 48, 72}

.

We split the dataset into training/validation/test sets using a 6:2:2 ratio (time-based within each ID) and apply the same preprocessing pipeline to all models. All methods are implemented in PyTorch (2.7.0) and optimized using Adam. Early stopping is used with a patience of 6 epochs to prevent overfitting. Experiments are conducted on a workstation equipped with an NVIDIA GeForce RTX 4090D GPU (24 GB VRAM). Detailed parameter settings are provided in Appendix A.

3.4. Results and Discussion

Before the main experiments, we conduct a sensitivity analysis to verify that a lower R_PTS is more suitable for Mamba, whereas a higher R_PTS benefits LWT (with all other settings fixed; see Appendix A). To ensure feasibility, we set the local window size to w = 3 in this analysis. Results are reported in Table 2.

Table 2 shows that using a lower R_PTS for the Mamba branch helps preserve long-context structure and stabilizes global-trend modeling. It avoids over-fragmenting the sequence and supports robust long-range representation learning. In contrast, using a higher R_PTS for LWT produces finer-grained tokens and improves sensitivity to short-term fluctuations.

Under the experimental setup described in Section 3.3, we further compare SST with PatchTST, DLinear, Autoformer, Informer, and Transformer. The results are summarized in Table 3.

As shown in Table 3, for the multivariate-to-univariate forecasting task on the Rwanda dataset, SST achieves the best average performance across the five horizons (6–72 weeks). Averaged over all horizons, SST attains an MSE of 0.0331, an MAE of 0.0711, and an RSE of 0.4323. Compared with the strongest baseline, PatchTST, these results correspond to reductions of 3.4%, 3.7%, and 6.5%, respectively. SST ranks first on all three metrics at the 12-, 24-, and 48-week horizons, and it also achieves the lowest MSE and RSE at the 6-week horizon. At the 72-week horizon, PatchTST performs slightly better.

Figure 6 further highlights differences in error growth across horizons. Autoformer, Informer, and the standard Transformer exhibit larger increases in error as the horizon extends, and their curves fluctuate more across horizons, indicating less stable long-range forecasting. In contrast, SST and PatchTST show slower and more stable degradation in MSE, MAE, and RSE as the prediction length increases. DLinear is relatively stable but remains consistently less accurate than SST and PatchTST. For short and medium horizons (6–48 weeks), SST achieves the lowest errors across all three metrics. At 72 weeks, PatchTST becomes marginally better, while SST remains competitive. Overall, SST provides strong short-horizon accuracy and stable long-horizon behavior, suggesting improved robustness under heterogeneous and non-stationary carbon emission dynamics.

To examine the impact of input length, we evaluate SST under different configurations by varying the long-term input length I_L ∈ {24,48,72,96} and the short-term input length I_S ∈ {12,24,36,48}, while fixing the output length to O = 53. We compare SST with strong baselines, including PatchTST and DLinear. The results are summarized in Table 4.

As shown in Table 4, SST is not applicable under the setting I_L = 24 and I_S = 12. In this case, the LWT branch contains only 5 tokens. With the local-window constraint w = 7, valid attention windows cannot be formed. Therefore, we report results for the remaining configurations.

For 48-24-53, SST achieves MSE = 0.0544, MAE = 0.1082, and RSE = 0.5924. For 72-36-53, errors further decrease to 0.0478, 0.0953, and 0.5552. For 96-48-53, SST reaches the best performance, with MSE = 0.0466, MAE = 0.0911, and RSE = 0.5481. These results show a steady accuracy improvement as the input window grows. SST is also the top-performing model across these valid settings. It consistently outperforms strong baselines such as PatchTST and DLinear.

We also compare model size and computational cost, as reported in Table 5. SST contains 2.65 M parameters, comparable to Autoformer, Informer, and Transformer (2.73–2.94 M). PatchTST uses fewer parameters (1.61M), while DLinear is extremely lightweight (876 parameters). In terms of GPU memory consumption, PatchTST requires the most memory (3057.38 MB). SST also has a relatively high memory footprint (2685.46 MB), but it is 12.2% lower than PatchTST. However, SST consumes substantially more GPU memory than Autoformer (380.64 MB), Informer (182.02 MB), and Transformer (226.32 MB). Regarding training speed, SST is the slowest in our experiments, requiring 69.13 s/epoch, which is slower than PatchTST (49.56 s/epoch) and other Transformer-style baselines (23.84–43.55 s/epoch). Overall, SST achieves higher forecasting accuracy at the cost of increased training time and memory usage.

Based on the above results, SST delivers consistently strong forecasting accuracy on the Rwanda dataset, particularly at short and medium horizons, and its performance improves steadily as the input window increases. These gains are primarily attributed to its multi-scale design: the Mamba backbone supports long-range dependency modeling across patches, the local-window Transformer captures short-term fluctuations and local shocks, and the routing mechanism coordinates the two branches adaptively. Table 5 also reveals clear trade-offs. SST has a parameter count comparable to Transformer-style baselines but requires higher GPU memory and longer training time per epoch. PatchTST trains faster in our setting but consumes more GPU memory, while DLinear is highly efficient but consistently less accurate. Overall, SST prioritizes predictive performance over training efficiency and is suitable when accuracy and robustness are the primary objectives.

3.5. Ablation Studies

In carbon emission forecasting, SST is a hybrid architecture that combines a Mamba-based state space backbone with Transformer-style local attention. However, the effectiveness of this hybrid design in this domain requires empirical validation. To quantify the contribution of each core component, we conduct controlled ablation experiments with the following configurations: (1) No Mamba, which removes the Mamba-based global expert to assess the role of long-range dependency modeling; (2) No LWT, which removes the LWT-based local expert to evaluate the importance of short-term local-pattern modeling; (3) No Patch, which disables MSTP to test the necessity of multi-resolution feature extraction from raw inputs; (4) No Router, which removes the routing module and replaces it with equal-weight fusion of the two experts (i.e., p_L = p_S = 0.5), to examine whether adaptive long-short balancing is necessary; and (5) Full SST, the complete model, which serves as the reference configuration.

As shown in Figure 7, the full SST achieves the lowest average MSE across different prediction horizons and consistently outperforms all ablated variants. This demonstrates the benefit of jointly modeling global trends and local variations within a unified framework. Removing Mamba degrades performance, indicating that long-range global-trend modeling is important for accurate carbon emission forecasting. Removing LWT also increases error, suggesting that short-term local variations provide essential complementary information. Moreover, the No Patch variant performs worse than Full SST, confirming that MSTP improves performance by extracting informative multi-resolution representations from the raw time series. Overall, these results verify that each component contributes meaningfully, and that their combination yields stronger and more robust forecasting accuracy.

4. Conclusions

In this study, we applied SST to carbon emission forecasting and conducted extensive experiments to evaluate its effectiveness. By decomposing multi-resolution time series into global trends and local variations, SST achieves strong forecasting performance under complex emission dynamics. The results indicate that SST captures both long-term trends and short-term fluctuations, providing practical support for emission-trend analysis and policy formulation.

Nevertheless, our current evaluation is based on a dataset with weekly granularity, and SST is validated only for weekly-scale forecasting in this work. Its applicability to other temporal resolutions, such as daily or monthly forecasting, remains to be verified. Direct resampling may introduce additional noise and reduce comparability across settings. In future work, we will evaluate SST on datasets with different sampling frequencies and systematically examine how resolution changes affect forecasting reliability. In addition, our experiments focus on relatively recent observations and do not include multi-decade historical records. As a result, SST’s stability under long-term climate variability and potential regime shifts is still unclear. Future studies will test SST on longer-span datasets covering multiple decades and evaluate performance across different historical periods to assess robustness under regime transitions. Finally, we plan to incorporate key exogenous modalities—such as policy signals and regional economic indicators (e.g., GDP)—through a multimodal design to further improve robustness and applicability in real-world carbon-policy settings.

Author Contributions

Conceptualization, Y.X. and M.W.; methodology, Y.X.; software, Y.X.; validation, Y.X.; formal analysis, Y.X.; investigation, Y.X.; resources, M.W.; data curation, Y.X.; writing—original draft preparation, Y.X.; writing—review and editing, Y.X.; visualization, Y.X.; supervision, M.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding. The article processing charge (APC) was paid by Yuanhao Xiong.

Data Availability Statement

The dataset used in this study is publicly available on the Kaggle platform for the competition “Predict CO₂ Emissions in Rwanda”, accessible at https://www.kaggle.com/competitions/playground-series-s3e20/data (accessed on 17 May 2025). All raw data in this research were obtained from this open-access source, with no proprietary or restricted-access data involved. Accessed via Kaggle, this dataset is subject to Kaggle’s Terms of Use and the competition rules; we exclusively use the data for academic research purposes, and readers should obtain the raw data directly from Kaggle after accepting the relevant competition rules.

Acknowledgments

The author would like to acknowledge the assistance of ChatGPT (version 5.1 Thinking), a large language model developed by OpenAI (San Francisco, CA, USA), which was used solely for language refinement and academic polishing of the English manuscript. The model was accessed via the ChatGPT platform (https://chat.openai.com), and its role was limited to improving the clarity, grammar, and fluency of the paper’s English expressions. All technical content, experimental design, and scientific analysis were independently conceived and completed by the author.

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Appendix A.1

This appendix provides supplementary details that are essential for understanding and reproducing our experiments, but would otherwise interrupt the flow of the main text. Specifically, we summarize the complete experimental configuration used throughout the paper, including data split protocol, preprocessing and normalization rules, model architectures, and training hyperparameters. Unless otherwise stated, all reported results are obtained under the same training budget and early-stopping strategy, and are averaged over the same set of random seeds.

Table A1. Experimental configuration used in this paper.

Parameters	SST	PatchTST	DLinear	Autoformer	Informer	Transformer
Epoch	30	30	30	30	30	30
Learning rate (Lr)	0.0001	0.0001	0.0001	0.0001	0.0001	0.0001
d_model/d_ff	256/1024	256/1024	256/1024	256/1024	256/1024	256/1024
dropout	0.3	0.3	0.3	0.3	0.3	0.3
batch_size	64	64	64	64	64	64
random_seed	2021	2021	2021	2021	2021	2021
w	7	null	null	null	null	null
patience	6	6	6	6	6	6
optimizer	Adam	Adam	Adam	Adam	Adam	Adam
schedule	Lr × 0.9^{max(0,Epoch−3)}	Lr × 0.9^{max(0,Epoch−3)}	Lr × 0.9^{max(0,Epoch−3)}	Lr × 0.9^{max(0,Epoch−3)}	Lr × 0.9^{max(0,Epoch−3)}	Lr × 0.9^{max(0,Epoch−3)}
e_layers/d_layers	2/1	2/1	2/1	2/1	2/1	2/1
heads	8	8	8	8	8	8
m_layers	1	null	null	null	null	null

Appendix A.2

For clarity, we collect here the symbols and terms used throughout the experiments. Time units are expressed in time steps (weeks) unless otherwise stated.

I_S—short input length: number of historical time steps provided to the short branch (short-branch input length). The short branch models local/short-term dynamics.
I_L—long input length: number of historical time steps provided to the long branch (long-branch input length). The long branch models long-term trends and global state.
O—prediction horizon or output length: number of future time steps the model predicts.
p_m, s_m—Mamba patch length and stride: for the Mamba (long) branch, p_m is the number of original time steps grouped into one patch token; s_m is the step between adjacent Mamba patches.
p_l, s_l—LWT patch length and stride: for the Local Window Transformer (short branch), p_l and s_l are defined analogously to p_m, s_m.
w—LWT local window size: number of patch tokens included in each local self-attention window of the LWT. w is expressed in patch tokens and must be an odd integer (center token + symmetric left/right wings).
d_model—model embedding dimension: dimension of token embeddings/hidden representations.
d_ff—feed-forward width: inner dimension of the position-wise feed-forward network.
heads—number of attention heads: number of parallel heads in multi-head attention.
m_layers/e_layers/d_layers—layer counts for different modules: m_layers = number of Mamba layers; e_layers and d_layers = encoder/decoder layer counts for Transformer-style modules.

References

Wang, F.; Harindintwali, J.D.; Yuan, Z.; Wang, M.; Wang, F.; Li, S.; Yin, Z.; Huang, L.; Fu, Y.; Li, L.; et al. Technologies and perspectives for achieving carbon neutrality. Innovation 2021, 2, 100180. [Google Scholar] [CrossRef] [PubMed]
Liu, Z.; Deng, Z.; He, G.; Wang, H.; Zhang, X.; Lin, J.; Qi, Y.; Liang, X. Challenges and opportunities for carbon neutrality in China. Nat. Rev. Earth Environ. 2022, 3, 141–155. [Google Scholar] [CrossRef]
Chen, H.; Wang, R.; Liu, X.; Du, Y.; Yang, Y. Monitoring the enterprise carbon emissions using electricity big data: A case study of Beijing. J. Clean. Prod. 2023, 396, 136427. [Google Scholar] [CrossRef]
Liu, Y.; Xiao, H.; Zhang, N. Industrial carbon emissions of China’s regions: A spatial econometric analysis. Sustainability 2016, 8, 210. [Google Scholar] [CrossRef]
Hu, Y.; Man, Y. Energy consumption and carbon emissions forecasting for industrial processes: Status, challenges and perspectives. Renew. Sustain. Energy Rev. 2023, 182, 113405. [Google Scholar] [CrossRef]
Tollefson, J. China’s carbon emissions could peak sooner than forecast. Nature 2016, 531, 425–426. [Google Scholar] [CrossRef]
Gao, H.; Wang, X.; Wu, K.; Zheng, Y.; Wang, Q.; Shi, W.; He, M. A review of building carbon emission accounting and prediction models. Buildings 2023, 13, 1617. [Google Scholar] [CrossRef]
Libao, Y.; Tingting, Y.; Jielian, Z.; Guicai, L.; Yanfen, L.; Xiaoqian, M. Prediction of CO₂ emissions based on multiple linear regression analysis. Energy Procedia 2017, 105, 4222–4228. [Google Scholar] [CrossRef]
Sharma, S.; Mittal, A.; Bansal, M.; Joshi, B.P.; Rayal, A. Forecasting of carbon emissions in India using (ARIMA) time series predicting approach. In Proceedings of the International Conference on Renewable Power, Singapore, 28–29 November 2023; Springer Nature: Singapore, 2023; pp. 799–811. [Google Scholar]
Lin, C.S.; Liou, F.M.; Huang, C.P. Grey forecasting model for CO₂ emissions: A Taiwan study. Appl. Energy 2011, 88, 3816–3820. [Google Scholar] [CrossRef]
Chen, Y.; Xu, P.; Chu, Y.; Li, W.; Wu, Y.; Ni, L.; Bao, Y.; Wang, K. Short-term electrical load forecasting using the Support Vector Regression (SVR) model to calculate the demand response baseline for office buildings. Appl. Energy 2017, 195, 659–670. [Google Scholar] [CrossRef]
Rigatti, S.J. Random forest. J. Insur. Med. 2017, 47, 31–39. [Google Scholar] [CrossRef] [PubMed]
Ahmad, A.S.; Hassan, M.Y.; Abdullah, M.P.; Rahman, H.A.; Hussin, F.; Abdullah, H.; Saidur, R. A review on applications of ANN and SVM for building electrical energy consumption forecasting. Renew. Sustain. Energy Rev. 2014, 33, 102–109. [Google Scholar] [CrossRef]
Nie, W.; Huang, Z.; Mai, S.; Ha, W.; Chen, X.; Zhang, Q.; Feng, X.; Yuan, Z. Carbon emission prediction and analysis of influencing factors based on the LSTM model. In Proceedings of the International Conference on Computer Graphics, Artificial Intelligence, and Data Processing (ICCAID 2024), Guangzhou, China, 6–8 December 2024; SPIE: Bellingham, WA, USA, 2025; Volume 13560, pp. 631–636. [Google Scholar]
Yang, F.; Liu, D.; Zeng, Q.; Chen, Z.; Ye, Y.; Yang, T.; He, Y.; Zhou, S.; Zheng, L. Prediction of Mianyang carbon emission trend based on adaptive gru neural network. In Proceedings of the 2022 4th International Conference on Frontiers Technology of Information and Computer (ICFTIC), Qingdao, China, 9–11 December 2022; IEEE: Piscataway, NJ, USA, 2022; pp. 747–750. [Google Scholar]
Han, Z.; Cui, B.; Xu, L.; Wang, J.; Guo, Z. Coupling LSTM and CNN neural networks for accurate carbon emission prediction in 30 Chinese provinces. Sustainability 2023, 15, 13934. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Łukasz, K.; Illia, P. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–11. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Nie, Y. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11121–11128. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Wang, Z.; Kong, F.; Feng, S.; Wang, M.; Yang, X.; Zhao, H.; Wang, D.; Zhang, Y. Is mamba effective for time series forecasting? Neurocomputing 2025, 619, 129178. [Google Scholar] [CrossRef]
Liang, A.; Jiang, X.; Sun, Y.; Shi, X.; Li, K. Bi-mamba+: Bidirectional mamba for time series forecasting. arXiv 2024, arXiv:2404.15772. [Google Scholar]
Ahamed, M.A.; Cheng, Q. Timemachine: A time series is worth 4 mambas for long-term forecasting. In Proceedings of the ECAI 2024: 27th European Conference on Artificial Intelligence, Santiago de Compostela, Spain, 19–24 October 2024; Volume 392, p. 1688. [Google Scholar]
Ma, H.; Chen, Y.; Zhao, W.; Yang, J.; Ji, Y.; Xu, X.; Yang, G. A Mamba Foundation Model for Time Series Forecasting. arXiv 2024, arXiv:2411.02941. [Google Scholar] [CrossRef]
Hong, J.T.; Han, S.; Yan, J.; Liu, Y.-Q. Dual-path Frequency Mamba-Transformer Model for Wind Power Forecasting. Energy 2025, 332, 137225. [Google Scholar] [CrossRef]
Shen, T.; Shi, W.; Lei, J.; Li, Q. PAKMamba: Enhancing electricity load forecasting with periodic aggregation and Koopman analysis. Comput. Electr. Eng. 2025, 123, 110113. [Google Scholar] [CrossRef]
Hu, J.; Duan, P.; Cao, X.; Xue, Q.; Zhao, B.; Zhao, X.; Yuan, X.; Zhang, C. A multi-energy load forecasting method based on the Mixture-of-Experts model and dynamic multilevel attention mechanism. Energy 2025, 324, 135947. [Google Scholar] [CrossRef]
Lee, J.; Hong, S. Reliable Grid Forecasting: State Space Models for Safety-Critical Energy Systems. arXiv 2026, arXiv:2601.01410. [Google Scholar] [CrossRef]
Xu, X.; Chen, C.; Liang, Y.; Huang, B.; Bai, G.; Zhao, L.; Shu, K. SST: Multi-Scale Hybrid Mamba-Transformer Experts for Time Series Forecasting. In Proceedings of the 34th ACM International Conference on Information and Knowledge Management, Boise, ID, USA, 21–25 October 2025; pp. 3655–3665. [Google Scholar]
Faruque, M.O.; Rabby, M.A.J.; Hossain, M.A.; Hossain, M.A.; Islam, M.R.; Rashid, M.M.U.; Muyeen, S.M. A comparative analysis to forecast carbon dioxide emissions. Energy Rep. 2022, 8, 8046–8060. [Google Scholar] [CrossRef]
Moruri, D.; Bray, A.; Reade, W.; Chow, A. Predict CO₂ Emissions in Rwanda; Kaggle: San Francisco, CA, USA, 2023; Available online: https://kaggle.com/competitions/playground-series-s3e20 (accessed on 17 May 2025).

Figure 1. SST model architecture.

Figure 2. Main structure of the Global Patterns Expert.

Figure 3. Main structure of the Local Variations Expert.

Figure 4. Stacked LWT layers extend the receptive field.

Figure 5. Map of carbon emission observation points in the Rwanda dataset.

Figure 6. MSE, MAE, and RSE across forecasting horizons for different models.

Figure 7. Ablation results (average MSE) under I_L = 72, I_S = 36, and O ∈ {6,12,24,48}.

Table 1. Comparison of periodicity, noise level (SNR), and trend stationarity (ADF) between ETTh1 and Rwanda.

Dataset	Sampling Interval	Cycle	$P_{lag}$	SNR	ADF p
ETTh1	1 h	24 (Daily)	0.9406	15.10	0.0116
ETTh1	1 h	168 (Weekly)	0.8701	——	——
Rwanda	7 days	13 (Quarter)	0.0091	——	——
Rwanda	7 days	26 (Half a year)	0.0021	——	——
Rwanda	7 days	53 (Annual)	0.0129	−5.33	6.177 × 10⁻³⁰

Table 2. MSTP sensitivity analysis under the setting I_L-I_S-O = 72-36-6 (The underlined values indicate the best results).

Experimental Group	$w$	$P_{L}, S t r_{L}$	$P_{S}, S t r_{S}$	MSE
Low–High (baseline)	3	(8, 8)	(4,2)	0.0090
High–High	3	(4, 2)	(4,2)	0.0092
Low–Low	3	(8, 8)	(8,8)	0.0095
High–Low	3	(4, 2)	(8,8)	0.0105

Table 3. Multivariate to univariate carbon emission prediction results at different prediction horizons on the Rwanda dataset (the best results are in bold and underlined).

$I_{L}$ $- I_{S}$ $- O$	72-36-6			72-36-12			72-36-24			72-36-48			72-36-72
Model	MSE	MAE	RSE	MSE	MAE	RSE	MSE	MAE	RSE	MSE	MAE	RSE	MSE	MAE	RSE
SST	0.0089	0.0312	0.2392	0.0141	0.0446	0.3013	0.0249	0.0616	0.4003	0.0435	0.0880	0.5296	0.0743	0.1260	0.6909
PatchTST	0.0097	0.0338	0.2974	0.0194	0.0502	0.3469	0.0286	0.0712	0.4286	0.0461	0.0917	0.5637	0.0678	0.1221	0.6754
DLinear	0.0169	0.0679	0.3302	0.0279	0.0952	0.4237	0.0423	0.1229	0.5220	0.0657	0.1696	0.6509	0.0868	0.2080	0.7468
Autoformer	0.0590	0.1715	0.6170	0.0549	0.1638	0.5951	0.1400	0.2449	0.9502	0.1216	0.2291	0.8855	0.2140	0.3151	1.1743
Informer	0.0555	0.1496	0.5986	0.0838	0.1752	0.7353	0.3097	0.3329	1.4131	0.1745	0.2412	1.0606	0.6586	0.4203	2.0598
Transformer	0.0548	0.1212	0.5950	0.1674	0.1704	1.0391	0.1479	0.2091	0.9767	0.1068	0.2011	0.8297	0.2032	0.2803	1.1442

Table 4. Multivariate to univariate carbon emission prediction results for different input sequence lengths on the Rwanda dataset (the best results are in bold and underlined).

I_L-I_S-O	24-12-53			48-24-53			72-36-53			96-48-53
Model	MSE	MAE	RSE	MSE	MAE	RSE	MSE	MAE	RSE	MSE	MAE	RSE
SST				0.0544	0.1082	0.5924	0.0478	0.0953	0.5552	0.0466	0.0911	0.5481
PatchTST	0.0662	0.1191	0.6532	0.0584	0.1223	0.6327	0.0485	0.0956	0.5592	0.0491	0.1011	0.5627
DLinear	0.0696	0.1719	0.6697	0.0701	0.1751	0.6722	0.0700	0.1778	0.6715	0.0683	0.1742	0.6635
Autoformer	0.1192	0.2307	0.8766	0.1443	0.2399	0.9644	0.1543	0.2542	0.9973	0.1440	0.2564	0.9635
Informer	0.3544	0.2878	1.5114	0.1468	0.2370	0.9726	0.1569	0.2454	1.0057	0.3616	0.3320	1.5265
Transformer	0.2003	0.2167	1.1362	0.2066	0.2638	1.1539	0.1379	0.2251	0.9428	0.1097	0.1933	0.8408

Table 5. Comparison of model parameters, memory consumption and training time.

Model	Total Number of Parameters	Memory Cost (MB)	Average Time per Epoch (s/epoch)
SST	2,648,842	2685.46	69.13
PatchTST	1,614,088	3057.38	49.56
DLinear	876	21.08	4.45
Autoformer	2,732,033	380.64	43.55
Informer	2,936,065	182.02	29.15
Transformer	2,738,689	226.32	23.84

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Xiong, Y.; Wang, M. Carbon Emission Forecasting Using Multi-Scale Temporal Patches. Appl. Sci. 2026, 16, 2025. https://doi.org/10.3390/app16042025

AMA Style

Xiong Y, Wang M. Carbon Emission Forecasting Using Multi-Scale Temporal Patches. Applied Sciences. 2026; 16(4):2025. https://doi.org/10.3390/app16042025

Chicago/Turabian Style

Xiong, Yuanhao, and Meiling Wang. 2026. "Carbon Emission Forecasting Using Multi-Scale Temporal Patches" Applied Sciences 16, no. 4: 2025. https://doi.org/10.3390/app16042025

APA Style

Xiong, Y., & Wang, M. (2026). Carbon Emission Forecasting Using Multi-Scale Temporal Patches. Applied Sciences, 16(4), 2025. https://doi.org/10.3390/app16042025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Carbon Emission Forecasting Using Multi-Scale Temporal Patches

Abstract

1. Introduction

2. Overall Architecture of SST

2.1. Multi-Scale Temporal Patches—MSTP

2.2. Global Patterns Expert

2.3. Local Variations Expert

2.4. Long-Short Router

2.5. Forecasting Module

3. Experiments

3.1. Dataset

3.2. Data Preprocessing

3.3. Experimental Setup

3.4. Results and Discussion

3.5. Ablation Studies

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Appendix A

Appendix A.1

Appendix A.2

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI