1. Introduction
With the accelerating penetration of renewable energy and the growing importance of demand-side flexibility in modern power systems, the controllability and responsiveness of load-side resources have become critical to maintaining secure and efficient grid operation [
1]. As one of the most energy-intensive components of the cold-chain logistics sector, cold storage facilities exhibit considerable flexibility potential. By regulating refrigeration cycles and switching between freezing and insulation modes, cold storage can provide short-term peak shaving and load shifting [
2], enabling participation in demand response programs, load aggregation, and virtual power plant applications. The accuracy of cold storage load forecasting directly affects the feasibility of scheduling strategies, operational safety margins, and overall response performance, thereby determining whether demand-side resources can reliably contribute to system regulation [
3]. However, due to the diversity and uncertainty inherent in cold storage operations, load profiles often display strong intermittency, multi-timescale coupling, and substantial operational noise, making accurate forecasting a highly challenging task [
4].
The intermittency of cold storage loads originates from compressor on-off cycling [
4,
5], chamber-specific temperature regulation policies, and irregular logistics activities. Operational peaks such as inbound, outbound, and picking processes can induce sharp short-term fluctuations [
6,
7], whereas nighttime and insulation phases exhibit sustained low-load periods. Such instability prevents models relying on stationarity assumptions from capturing critical temporal patterns. Moreover, cold storage loads encompass multiple superimposed cycles—from minute-level compressor switching [
8], to hourly logistics rhythms, daily business routines, and long-term seasonal variations—making single-timescale models insufficient for learning complex periodic dependencies [
9,
10]. Compounding these challenges is the prevalence of high-frequency noise generated by temperature deadbands, equipment wear, ambient temperature variations, and frequent door openings. These factors cause effective temporal patterns to become sparse, while noisy and redundant components dominate the sequence, ultimately degrading model generalization and robustness. Therefore, developing forecasting models capable of capturing multi-period structures while effectively suppressing noise is essential for practical cold storage scheduling.
Time-series forecasting has long served as an important methodology in power systems, industrial control, finance, and meteorology, progressing through traditional statistical models, machine learning models, and deep learning approaches [
11]. Classical statistical models such as ARIMA and SARIMA have been widely employed in load forecasting [
12,
13,
14] due to their clear mathematical formulation of trends, cycles, and autocorrelations. Nonetheless, their reliance on linearity and stationarity assumptions limits their ability to handle non-stationary, nonlinear, and multi-scale industrial load patterns, while requiring heavy feature engineering. Machine learning models such as SVM [
15], random forests (RF) [
16], and gradient boosting decision trees (GBDT) [
17] enhance nonlinear modeling capacity through kernel functions and ensemble techniques, but still depend on manually constructed window features and struggle to capture long-range dependencies or multi-cycle interactions. Furthermore, in high-noise industrial scenarios, their lack of sequence context modeling reduces predictive robustness. Deep learning has advanced time-series forecasting by enabling automatic pattern extraction from raw sequences [
18,
19,
20,
21] Recurrent neural networks (RNNs) and their variants LSTM and GRU address gradient issues through gating mechanisms and perform recursive temporal modeling. Nevertheless, their sequential computation limits parallelization and hampers their ability to model long-range and multi-scale dependencies, especially under noisy and irregular load conditions. Temporal Convolutional Networks (TCN) [
22] introduce dilated convolutions to expand the receptive field efficiently, achieving strong stability and training efficiency in industrial applications. Yet, convolutional architectures lack explicit mechanisms for modeling cross-scale periodic interactions, limiting their applicability in scenarios with stacked multi-period patterns.
Transformer-based forecasters have recently advanced long-sequence prediction through innovations such as sparse attention (Informer [
23]), decomposition-based trend–season separation (Autoformer [
24]), enhanced temporal encoding (iTransformer [
25]), and patch/channel decoupling (PatchTST [
26]). Frequency-structured models like TimesNet [
27] and TimeMixer [
28], as well as lightweight linear models like DLinear [
29], further boost multi-scale pattern extraction. In contrast, the cold-storage load exhibits domain-specific complexities that fundamentally violate the assumptions underlying these models. Compressor on–off switching produces nonstationary micro-cycles that drift in both amplitude and duration; logistics-driven operations introduce abrupt, high-magnitude impulses that obscure stable periodic patterns; and chamber-level heterogeneity generates asynchronous cycles with inconsistent phase alignment. These factors result in aliasing and period fragmentation, causing existing multi-scale or multi-period modules to either detect incorrect periodicities, overfit to noise-induced pseudo-cycles, or misalign key temporal dependencies across different scales. Consequently, methods that rely on fixed decomposition schemes, static kernel periods, or rigid inter-scale interactions often fail to construct a coherent representation of the true load dynamics.
Motivated by these observations, this work proposes the multi-scale and Adaptive Multi-Period with Compression–Fusion Attention Network (MA-CFAN), which tackles the above challenges at their root. The MA-CFAN begins with a multi-resolution frequency-domain probing module that adaptively identifies latent and drifting periodicities rather than assuming fixed or global periods, enabling faithful extraction of compressor-level micro-cycles, operational mid-range rhythms, and long-term thermal inertia trends. To address noise-induced pattern dilution, a Compression–Fusion Attention (CFA) mechanism is introduced to compress redundant high-frequency activations within the attention space and selectively amplify time steps that truly govern future loads, substantially improving robustness in the presence of impulsive logistics events and switching noise. Moreover, MA-CFAN incorporates a period-weight aggregation module that dynamically evaluates the reliability of different periodic subspaces and performs cross-period fusion only when consistent dependencies exist, thereby mitigating phase misalignment and aliasing effects. Together, these components form a domain-tailored architecture capable of reconstructing stable, interpretable multi-scale structures from the intrinsically unstable and intermittently perturbed cold-storage load signals. The main contributions of this study are threefold.
We propose MA-CFAN, a forecasting model tailored to cold storage load characteristics, capable of jointly modeling multi-scale dependencies and suppressing operational noise through adaptive period mining and compression-fusion attention.
We establish a systematic benchmark of state-of-the-art deep time-series models under real cold storage conditions, providing methodological references for future research.
Using real operational data, we demonstrate the model’s superior forecasting performance and its potential applications in demand response and coordinated cold storage cluster scheduling. Overall, this work offers a new pathway for cold storage load forecasting and contributes to the broader development of intelligent control strategies for industrial flexible loads.
The remainder of this paper is organized as follows.
Section 2 describes the overall architecture of the proposed MA-CFAN model, including its multi-scale design, adaptive period extraction, and compression-fusion attention mechanism.
Section 3 introduces the dataset, baseline models, and experimental setup.
Section 4 presents the experimental results, comparative analysis, and ablation studies. Finally,
Section 5 concludes the paper and discusses limitations and directions for future research.
2. Materials and Methods
This section introduces the overall architecture and technical pathway of the proposed Multi-Scale and Adaptive Multi-Period Compression-Fusion Attention Network (MA-CFAN). Unlike existing forecasting models that struggle with drifting micro-cycles, noisy pseudo-periods, and misaligned multi-scale dependencies, MA-CFAN is explicitly designed to overcome these domain-specific challenges and reconstruct stable temporal structures from highly volatile cold-storage load sequences. The method targets several inherent difficulties in cold storage load forecasting, including multi-scale temporal patterns, multi-period structures, trend-seasonality entanglement, and substantial operational noise. As illustrated in
Figure 1, the framework integrates multi-resolution temporal processing, adaptive period extraction in the frequency domain, and unified trend-season modeling within a coherent architecture. The model comprises three major components: (1) a multi-scale input projection module composed of a multi-scale processing layer and an embedding layer; (2) the CFABlock layer, which performs parallel modeling of temporal dependencies across different temporal scales; and (3) an output prediction head that aggregates features from all scales to generate the final forecasting results.
2.1. Problem Definition
The task of cold storage load forecasting investigated in this study can be formalized as follows. Given a historical window of length
T, containing past cold storage power measurements, equipment operational states (e.g., compressor status, pump pressure), cold storage operation indicators (e.g., door-opening events), ambient temperature, internal sensor measurements, and temperature setpoint boundaries, the goal is to train a predictive function
that estimates the cold storage power over the next
time steps. This can be expressed as:
where
represents the multivariate input at time step
i, and
N denotes the number of input features. In our dataset,
(details are provided in the
Section 3.1). The output
denotes the predicted cold storage power at future time step
j.
2.2. Multi-Scale and Multi-Period Design
Cold storage load exhibits inherent multi-scale temporal characteristics, wherein different temporal resolutions emphasize distinct structural properties. Fine-grained sequences capture rapid fluctuations and localized patterns, whereas coarse-grained sequences reveal long-term trends and smooth variations [
30]. As illustrated in
Figure 2, hourly load data manifest strong high-frequency oscillations, daily sequences highlight clearer trend components, and weekly sequences show that most high-frequency variations have been smoothed out. These behaviors arise from the combined influence of multiple interacting factors—such as compressor on-off cycling, product loading and unloading operations, ambient temperature variation, and circadian rhythms—which collectively induce multi-scale structures and rich multi-periodicity in cold storage loads. Moreover, real-world data are further perturbed by stochastic disturbances and dynamic control strategies, making robust modeling even more challenging.
To address these characteristics, we design a multi-scale parallel adaptive period segmentation module. Specifically, the raw load sequence is first downsampled to obtain representations at different temporal resolutions, capturing both fine and coarse patterns. Although down sampling may smooth short-term fluctuations at coarser resolutions, this effect is mitigated by the parallel multi-scale architecture of MA-CFAN. Fine-grained sequences remain fully preserved and are processed independently, ensuring that short-term dynamics are not discarded. The final prediction head aggregates representations from all scales, allowing both high-frequency fluctuations and long-term trends to jointly inform the forecasting result. Then, instead of retaining all Fourier components, we apply FFT to each scale and extract only the top-k frequency components with the highest amplitudes. Alternative time-frequency analysis methods such as wavelet transforms or learnable spectral filters were considered. Despite this, FFT-based period identification was adopted due to its computational efficiency, robustness to noise, and clear physical interpretability, which are particularly advantageous in industrial cold storage scenarios. Exploring learnable spectral representations is identified as a promising direction for future work. In this study, the number of dominant frequencies k is treated as a small hyperparameter and is selected from the range 3–5 based on validation performance and physical interpretability of cold storage operations. Empirically, this range is sufficient to capture compressor micro-cycles, mid-term operational rhythms, and long-term thermal trends. Sensitivity analysis shows that MA-CFAN is robust to moderate variations in k, as the subsequent amplitude-normalized aggregation mechanism assigns lower weights to less informative frequency components, thereby reducing the risk of overfitting or information dilution. This strategy focuses the model on the most informative periodic signals while suppressing noise-dominated frequencies that may cause overfitting. At the same time, selecting multiple dominant frequencies avoids the information loss that would arise from oversimplified period assumptions, thereby enabling the model to effectively capture the complex multi-period temporal behaviors inherent in cold storage load data.
2.3. MA-CFAN
2.3.1. Multi-Scale Input Projection Layer
Prior studies such as PatchTST and DLinear adopt channel-independent processing to better model intra-variable temporal relations [
26,
29,
30]. However, cold storage load dynamics fundamentally arise from coupled interactions among multiple variables (e.g., compressor cycling, door operations, ambient temperature fluctuations). Therefore, we adopt a channel-mixing strategy to explicitly capture these cross-variable temporal dependencies, which is crucial for accurate cold-load modeling. The multi-scale input projection layer consists of two modules: (i) multi-scale processing, and (ii) embedding. Given the original input sequence
, we apply average pooling to construct
scales:
where
L is the sequence length and
N is the feature dimension.
Each scale then undergoes channel-mixed value embedding and temporal positional encoding:
producing the embedded sequence set
where
denotes the embedding dimension.
2.3.2. CFABlocks
After projection, the
multi-scale sequences are processed independently by
parallel CFABlocks. For the
m-th scale, the forward computation is:
Each CFABlock consists of three major components:
- (a)
Multi-Period Reshape
For a given scale, the input
is adaptively transformed into
K higher-dimensional 2D representations. FFT is applied independently to each sample on the channel-averaged embedded sequence. The amplitude spectrum is obtained by averaging magnitudes across embedding dimensions, and the top-k frequencies with the largest averaged amplitudes are selected.
where
denotes the unnormalized amplitude and
represents the estimated period length. Based on the detected frequencies, the 1D input is padded and reshaped into 2D tensors:
Each tensor captures intra-period variations (columns) and inter-period variations (rows).This dual locality allows attention to extract structural patterns along both directions.
- (b)
Compression-Fusion Attention
The CFABlock is designed to simultaneously capture seasonal repetition and trend evolution across multiple periods, while suppressing redundant noise through directional compression. The structure of compressing attention is shown in the
Figure 1, Once reshaped by period each 2D representation naturally separates into: a seasonal tensor sensitive to intra-period repetition, and a trend tensor sensitive to inter-period evolution. To focus on meaningful periodic structures, compression-fusion attention does not apply attention directly to the full 2D tensor. Instead, it performs directional adaptive compression: seasonal branch: compress along the trend dimension, and trend branch: compress along the seasonal dimension. This produces compact vectors that preserve representative periodic and trend-related patterns while discarding high-frequency noise caused by dead zones, device wear, or door openings.
Let the multi-period reshaped feature tensor be
, where
P denotes the number of extracted periods,
L is the intra-period length, and
d is the embedding dimension. The seasonal branch compresses information along the inter-period dimension using average pooling:
where
represents the compressed seasonal representation. Similarly, the trend branch compresses information along the intra-period dimension:
where
denotes the compressed trend representation.
Within the Compression-Fusion Attention (CFA) module, Full Attention (
Figure 3) serves as the fundamental operation applied to each compressed 2D tensor representation. After multi-period decomposition and structural reshaping, each transformed 2D tensor
is projected into query, key, and value spaces:
where
are learnable matrices. CFA applies the standard scaled dot-product attention on each decomposed period component:
This operation computes dense token-to-token interactions within each compressed temporal-frequency block, allowing the model to extract both intra-period structure and localized multi-frequency patterns. Full Attention measures the similarity between all token pairs in the compressed representation through the dot product . The Softmax function then assigns adaptive weights to each position, enabling the model to:
focus on salient temporal regions within a specific period,
capture high-resolution relationships preserved during compression,
maintain full expressiveness despite operating on shorter sequences.
Because the input to Full Attention has already been compressed through multi-period restructuring, CFA retains Full Attention’s modeling power while dramatically reducing computational cost. The attention outputs are later fused across periods using amplitude-normalized adaptive weights, completing the CFA pipeline.
The seasonal branch is processed using a Full Attention, enabling it to highlight repetitive structures and phase shifts. The trend branch preserves long-range evolution patterns without distortion. Both branches integrate time-frequency joint modeling, allowing attention to consider temporal locality and energy distribution across frequencies. After attention, the two branches are reshaped back to their 3D forms and fused multiplicatively: not by simple concatenation, but through element-wise interaction, allowing the model to dynamically emphasize trend-enhanced seasonality or seasonally modulated trend shifts. This asymmetric compression-fusion strategy enables CFABlock to explicitly model the interaction between long-term trend evolution and intra-period seasonal repetition while maintaining low computational complexity. This mechanism explicitly encodes the real-world phenomenon where repeated behaviors (e.g., compressor cycling) vary with long-term operating levels (e.g., daytime vs. nighttime load), making CFABlock well aligned with the coupled structures in cold storage operations.
- (c)
periodic weight aggregation
Finally, to generate the input for the next layer, the CFABlock fuses the
k extracted 1D representations
through an adaptive, amplitude-informed weighting mechanism. Inspired by the Auto-Correlation principle in [
24], the amplitude values
A associated with each selected frequency quantify the relative significance of the corresponding periodic component. These amplitudes naturally reflect the contribution of each transformed 2D tensor, enabling a principled fusion strategy. To this end, we first normalize the amplitudes using a Softmax function:
and then compute the aggregated representation as
Since both intra-period and inter-period variations have already been encoded within the set of structurally enriched 2D tensors, this amplitude-guided fusion enables the CFABlock to effectively capture diverse multi-scale temporal patterns. Consequently, the CFABlock provides a more expressive and robust representation than directly modeling the raw 1D input sequence, ensuring stronger temporal modeling capability across heterogeneous periodicities.
2.4. Output Prediction Head
After processing all scales, we obtain the multi-scale feature set. Since each scale captures distinct temporal patterns, we assign an independent prediction head to each scale:
where
is a linear layer used for the m-th scale.
3. Baseline Models and Experimental Setup
3.1. Dataset
This study utilizes the historical operational data from a cold storage facility located in Jinan, China. The warehouse occupies approximately 4658 m2 and consists of three freezer chambers and a loading bay. Its cooling capacity is jointly supplied by a compressor system with an input power of 265.2 kW and an air-cooling system rated at 149.7 kW.
The dataset spans the period from 26 June 2023 to 11 June 2024, covering nearly one full year of operation. The raw operational data are originally recorded at a 10-min resolution, yielding a total of 50,550 time steps, each containing 57 features (see
Table 1). For model training and evaluation, the data are aggregated to a 1-h resolution using mean aggregation to reduce high-frequency noise and align with practical scheduling requirements. The features are closely related to the cold-storage load, including:
Historical load power, which serves as the prediction target.
Equipment operational data, characterizing the states of compressors, evaporators, and fans, which directly influence load variations.
Operational status signals, such as door-opening records, reflecting cargo inflow/outflow that alters thermal disturbances.
Temperature setpoints (upper and lower bounds), which govern the regulation behavior of the cooling system. For example, after high-temperature goods enter the chamber, compressors and fans compensate by increasing power output.
Outdoor ambient temperature, which impacts overall energy consumption: higher loads in summer due to heat ingress and reduced loads in winter.
Chamber temperatures of the three freezers, included to enrich input dimensionality and improve forecasting precision.
All features are normalized using Z-score standardization. Occasional missing values are handled using forward filling, while abnormal sensor readings are implicitly mitigated through Z-score normalization and the model’s robustness to noise. Extreme operational events are retained in the dataset, as they reflect realistic cold storage operating conditions. The dataset is split chronologically into 70% training, 10% validation, and 20% testing subsets to prevent information leakage between past and future observations.
3.2. Baseline Models
To comprehensively evaluate the effectiveness of the proposed MA-CFAN model, we compared it against a diverse set of state-of-the-art baselines covering recurrent architectures, linear models, decomposition-based networks, and Transformer-family models. These baselines represent the mainstream paradigms in contemporary time-series forecasting. In addition, we include a simple persistent (seasonal naive) baseline as a reference. This baseline generates predictions by directly copying historical observations from previous temporal patterns (e.g., the most recent value or the value at the same hour of the previous day), without any parameter learning. Although conceptually simple, such a seasonal naive model serves as a meaningful lower-bound benchmark that anchors the intrinsic difficulty of the forecasting task and helps contextualize the performance gains achieved by more sophisticated models. Specifically, for a given forecasting horizon pred_len, the persistent (seasonal naive) baseline(SN) generates the prediction by directly copying the historical load values from pred_len time steps earlier, i.e., the forecasted sequence is identical to the corresponding historical segment immediately preceding the prediction window. This formulation provides a clear and reproducible reference that reflects purely seasonal repetition without any model training.
Long Short-Term Memory (LSTM) networks are classical recurrent architectures that capture temporal dependencies through gated memory units. LSTM serves as a strong traditional baseline for load forecasting due to its ability to model sequential patterns, though its limited capacity to extract long-range dependencies and multi-periodic structures constrains its performance on complex cold-storage signals.
DLinear [
29] is a simple yet competitive linear model that decomposes time series into trend and seasonal components via channel-wise linear projections. Its minimal architectural assumptions enable efficient training and strong performance on many benchmark datasets, making it a popular baseline for testing whether complex models truly outperform linear structures.
TimeMixer [
28] leverages temporal token mixing and feature mixing to capture cross-channel interactions and temporal patterns in a lightweight architecture. Its design enables efficient modeling of multi-frequency structures, offering a strong baseline for comparing multi-period extraction capabilities.
TimesNet [
27] utilizes multi-period based convolutional encoders to capture 2D temporal patterns by transforming 1D signals into structured tensors. Due to its use of period-aware feature extraction, TimesNet is particularly relevant for cold-storage load forecasting, which is dominated by strong periodic behaviors.
Informer [
23] employs ProbSparse self-attention and a generative decoder to handle long sequence forecasting efficiently. Its architecture is designed for scalability and large-range dependency modeling, providing an important reference for Transformer-based long-term prediction tasks.
Autoformer [
24] integrates an auto-correlation mechanism to explicitly model period-based dependencies and decomposes time series into trend and seasonal components. As one of the first decomposition-enhanced Transformers, Autoformer is a key baseline for evaluating the periodic extraction ability of our proposed CFABlock.
iTransformer [
25] introduces an inverted attention mechanism that swaps the roles of the feature and temporal dimensions, enabling cross-variable dependency modeling with improved computational efficiency. Its ability to learn channel-wise correlations makes it suitable for multivariate load forecasting tasks.
PatchTST [
26] is a patch-level Transformer model that divides the input sequence into non-overlapping temporal patches and applies self-attention on patch embeddings. This patch representation enhances local pattern extraction and improves generalization on long sequences, making it a strong benchmark for high-frequency operational data.
3.3. Experimental Setup
For fair comparison, all baseline models and MA-CFAN are trained using the same experimental configuration, including identical input length, prediction horizons, data splits, optimizer settings, early stopping criteria, and training epochs. the look-back window is fixed at 96 time steps, corresponding to four days. Forecasting horizons are set to 24, 48, 96, and 192 steps, representing 1-day, 2-day, 4-day, and 1-week forecasts, respectively, thus, covering short-, medium-, and long-term prediction ranges. Each model is trained for 50 epochs with early stopping (patience = 20). The Adam optimizer is used for stable and efficient training, with an initial learning rate of 0.001 that decays exponentially. Traditional TSF models commonly adopt the Mean Squared Error (MSE) loss:
However, MSE is insufficient for capturing structural patterns in cold-storage load sequences. To address this limitation, we introduce a hybrid objective combining MSE with a Patch-wise Structural (PS) loss (Equation (21)). this method first perform Fourier-based Adaptive Patching (FAP), where a dominant frequency
f yields an initial period
. The patch length is then:
On the patched sequences and , PS loss is defined by three components:
Correlation loss,
variance loss,
and mean alignment.
A Gradient-based Dynamic Weighting strategy assigns adaptive weights.
where
and
reflect covariance and variance consistency.
The PS loss is then.
and the final training objective becomes.
The evaluation metrics for model performance in this study include the Mean Squared Error (MSE). The formulation for MSE is presented in Equation (
15). Also included is the Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE). The formulation for MAE is presented in Equation (
23). Additionally, MAE is calculated in the normalized space, while MAPE is computed after inverse transformation back to the original kilowatt (kW) scale
The formulation for MAPE is presented in Equation (
24).
4. Results and Discussion
4.1. Main Results
Using the experimental configuration described above, we conducted extensive empirical evaluations of MA-CFAN against six state-of-the-art time series forecasting models on the cold-storage dataset. As shown in
Table 2 and
Table 3, MA-CFAN consistently achieved the best performance across all forecasting horizons, demonstrating superior robustness from short-term (24 h) to long-term (192 h) load prediction. Although evaluated on a single facility, MA-CFAN is designed to learn structural temporal dependencies driven by refrigeration mechanics and operational cycles rather than site-specific parameters. This design enables potential transferability to cold storage facilities with different sizes, equipment configurations, and climatic conditions.
From the perspective of forecasting difficulty, the persistent (seasonal naive) baseline provides a meaningful lower-bound reference. Across the majority of evaluation metrics and forecasting horizons, deep learning-based models consistently outperform this naive baseline, indicating that the prediction task cannot be solved by simple historical repetition alone. This performance gap becomes more pronounced as the forecasting horizon increases, reflecting the growing difficulty of capturing long-term temporal dependencies and non-stationary operational patterns in cold storage load data.
Compared with the traditional recurrent model LSTM, MA-CFAN reduces the average MSE by 19.3%, MAE by 14.8%, and MAPE by 23.8%. This improvement can be attributed to LSTM’s gated architecture struggles to capture long-range dependencies and cross-variable interactions inherent in multi-periodic cold-storage load patterns. Among Transformer-based baselines, both Informer and Autoformer perform relatively poorly. Autoformer relies heavily on periodic decomposition and, thus, becomes unstable under multi-period, multi-frequency data; Informer’s ProbSparse attention effectively handles long sequences but tends to lose critical temporal fluctuations due to sparsification.
DLinear, TimeMixer, and TimesNet outperform other baselines but still fall short of MA-CFAN. Compared with the strongest baseline DLinear, MA-CFAN improves average MSE by 4.25%, MAE by 2.8%, and MAPE by 3.6%. DLinear shows stable performance—especially in longer forecasting horizons (48, 96, 192 h), where it achieves 7 out of 9 s-best results—yet its purely linear mapping limits its ability to model the nonlinear multi-frequency dynamics of cold-storage loads. TimesNet exhibits strong short-term prediction capability due to its multi-scale convolutional architecture but deteriorates in long-term forecasting because convolution kernels have inherently restricted receptive fields. TimeMixer performs more consistently, benefiting from its coarse-to-fine multi-scale interaction design, yet still fails to fully uncover intricate temporal dependencies. In contrast, MA-CFAN’s multi-scale representation and multi-period CFA mechanism more effectively capture both global and local temporal structures.
iTransformer and PatchTST form the second-tier group. Both rely on patch-level modeling rather than point-wise attention: iTransformer treats each variable as a token and focuses on cross-variable relations while ignoring within-series temporal structure; PatchTST splits the series via fixed-length patches but applies channel-independent processing, limiting its ability to capture cross-channel interactions. Thanks to channel-mixing and adaptive multi-scale period decomposition, MA-CFAN surpasses both models by a considerable margin:
- -
Compared with iTransformer: MSE −9.9%, MAE −6.4%, MAPE −5.2%
- -
Compared with PatchTST: MSE −10.6%, MAE −6.4%, MAPE −8.2%
Figure 4 visualizes the prediction curves of MA-CFAN, DLinear, and TimesNet for 24, 48, 96, and 192-step forecasting. MA-CFAN maintains consistently superior alignment with the ground truth across all forecasting horizons. Notably, the real-world dataset includes abrupt logistics activities, compressor switching events, and dynamic control strategy changes. The proposed CFA module enhances robustness under such disturbances by compressing redundant high-frequency activations and selectively emphasizing structurally consistent temporal patterns. As evidenced by the stable performance gains across all horizons, MA-CFAN maintains strong predictive accuracy even under highly stochastic operational conditions. In the short 24-step prediction, Both DLinear and TimesNet show a large degree of offset at the starting posrition of the prediction and fail to capture the sudden change trend, while MA-CFAN we proposed can better grasp this change that is different from the past. In relatively long-term prediction steps, especially 192-step ahead forecast (8 days), TimesNet is able to capture general trends but misses high-frequency oscillations at longer horizons due to its limited receptive field, and DLinear exhibits significant deviations from the actual curve and tends to repeat similar temporal patterns for long horizons in 192-ahead forecast (8 days). MA-CFAN more accurately reproduces both the trend and fluctuation structures, which is crucial for cold-storage scheduling, although minor deviations from raw data remain.
Furthermore,
Figure 5 summarizes the average MSE across forecasting windows. As expected, prediction errors increase with horizon length for all models. Nevertheless, MA-CFAN achieves the lowest MSE at every horizon, demonstrating strong robustness and generalization. DLinear remains the closest competitor, yet its error remains 1.1–1.8% higher than MA-CFAN across windows.
The results reported in the main text are obtained using a fixed random seed of 2025. To further assess the robustness of the proposed method with respect to random initialization, MA-CFAN and several representative baseline models are additionally evaluated across multiple random seeds (2021, 2022, 2023, 2024, and 2025). For each selected seed, the MSE and MAE scores are computed, and the mean and standard deviation of the results are summarized in
Table 4. Among all compared methods, DLinear exhibits the most stable performance, followed closely by MA-CFAN, while the remaining models show relatively larger variations. Overall, the variances across different random seeds are consistently small, indicating that MA-CFAN demonstrates strong robustness against the choice of random seed.
4.2. Ablation Study
To evaluate the contribution of each architectural component, we conduct a systematic ablation study on MA-CFAN, with DLinear, PatchTST, TimeMixer and TimesNet included as competitive baselines for intuitive comparison. Three variants of MA-CFAN are designed:
MA-FAN—Replace the proposed Compress-Fusion Attention (CFA) with the full attention mechanism of the vanilla Transformer.
M-MLP—Replace the CFA module with a multilayer perceptron.
Patch-CFAN—Replace the multi-scale and multi-period processing with the fixed-length patching strategy of PatchTST.
As shown in
Figure 6, all three variants exhibit different degrees of performance degradation, demonstrating the indispensable role of each architectural element.
Among the three variants, MA-FAN replaces the proposed CFA with the full attention mechanism of the vanilla Transformer. The average MSE of MA-FAN increases by 4.8%, and the MAE increases by 3.0% compared with MA-CFAN. This performance drop provides strong evidence that full attention is less effective in modeling the structured seasonal-trend interactions that CFA explicitly compresses and fuses across multiple periods. Full attention processes all pairwise dependencies uniformly, causing it to dilute the periodic and trend-aligned patterns that are crucial for cold storage load forecasting. Even so, MA-FAN still shows performance comparable to DLinear and remains superior to several other baselines, suggesting that the multi-scale and multi-period representation design retains its predictive strength even when CFA is removed.
For the M-MLP variant, replacing CFA with a multilayer perceptron results in a further decline in accuracy. The average MSE of MA-FAN increases by 8.8%, and the MAE increases by 5.1% compared with MA-CFAN. This degradation suggests that simple nonlinear transformations cannot compensate for the loss of the structured feature extraction and cross-period fusion capabilities embedded in CFA. The MLP fails to capture temporal dependencies with explicit periodicity, leading to weaker representations and diminished forecasting accuracy.
Patch-CFAN has the lowest performance drop, which substitutes the adaptive multi-scale and multi-period module with the fixed-length patching strategy of PatchTST, also shows substantial performance degradation. The inferior results indicate that fixed patches cannot effectively align with variable-length seasonal patterns or capture cross-scale temporal dependencies. Cold storage load exhibits multi-periodicity influenced by operational cycles, refrigeration mechanics, and environmental temperature fluctuations, making the patch-based representation insufficient for learning such dynamic structures.
Overall, the consistent performance decreases across all variants provide clear evidence that both the Compress-Fusion Attention module and the multi-scale, multi-period representation design are essential for MA-CFAN’s superior forecasting ability. The ablation results validate that CFA plays a central role in capturing interpretable seasonal-trend interactions, while the multi-scale/multi-period processing framework ensures comprehensive temporal representation. These components jointly enable MA-CFAN to outperform existing baselines and maintain robust predictive performance under complex operating conditions.
To further justify the design choice of multiplicative fusion in the proposed compression-fusion attention module, we conducted a dedicated ablation study comparing it with several commonly used fusion strategies, including additive fusion, MLP-based fusion, and gated fusion. The goal of this experiment was to examine how different interaction mechanisms between trend and seasonal components affect the overall forecasting performance. Due to the FFT-based adaptive period decomposition, the number of extracted periods varied across samples and, thus, the tensor shapes after period segmentation are not fixed in advance.
After the compression-attention operation, the trend component was reduced to a compact 2D representation , while the seasonal component was reduced to , where L denotes the intra-period length, P denotes the number of detected periods, and d is the feature dimension. After the compression-attention operation, the trend component is reduced to a compact two-dimensional representation , while the seasonal component is reduced to , where L denotes the intra-period length, P denotes the number of detected periods, and d is the feature dimension. These intermediate representations serve as the inputs to different fusion strategies described below.
For additive fusion, the compressed trend and seasonal representations are first expanded to compatible 3D tensors and then summed element-wise. This strategy assumes equal and independent contributions from trend and seasonal components without explicitly modeling their interactions. Direct interaction via an MLP is not feasible due to the adaptive and sample-dependent tensor shapes produced by the FFT-based period decomposition. Therefore, both the trend and seasonal components are first expanded to 3D tensors, then reshaped into 2D representations consistent with the Multi-Period Reshape operation. The resulting vectors are concatenated and passed through a fully connected MLP layer to model cross-component interactions. The gated fusion strategy follows the same preprocessing steps as the MLP-based fusion. After concatenation, a gating unit is applied to adaptively control the information flow between the trend and seasonal components, allowing the model to selectively emphasize or suppress each component. In contrast, the proposed multiplicative fusion directly models the interaction between trend and seasonal components via element-wise multiplication after compression, without introducing additional parameters or requiring tensor reshaping. This design enables dynamic modulation between the two components while preserving structural simplicity.
The ablation results are summarized in
Figure 7. Among all compared fusion strategies, the proposed multiplicative fusion consistently achieves the best forecasting performance across all evaluation metrics. This indicates that explicitly modeling cross-component modulation through multiplicative interaction is more effective than additive or parameter-heavy fusion mechanisms in capturing the complex temporal dynamics of cold storage load data.
5. Conclusions and Outlooks
In this study, we proposed MA-CFAN, a novel neural forecasting framework specifically designed for cold-storage load prediction. To address the inherently complex characteristics of cold-storage load data—including multi-scale structures, multi-periodicity, and rich high-frequency variations—we first constructed multi-resolution representations through hierarchical downsampling. Based on these representations, we introduced an adaptive period extraction mechanism together with a Compression-Fusion Attention (CFA) module, enabling effective modeling of multi-period dependencies while suppressing noise and redundant temporal patterns. To comprehensively evaluate the performance of MA-CFAN, we benchmarked it against a diverse suite of state-of-the-art baselines spanning Transformer-based, CNN-based, MLP-based, and RNN-based forecasting paradigms. Extensive experiments on a real-world cold-storage dataset yield the following conclusions:
MA-CFAN demonstrates superior CPF performance across short-, medium-, and long-term forecasting horizons. In short-term forecasting (24 and 48 steps), MA-CFAN substantially reduces both MSE and MAE compared with traditional models such as LSTM. In medium- and long-term forecasting (96 and 192 steps), MA-CFAN further outperforms the second-best baseline with notably lower MSE and MAE, confirming its robustness and stability under longer prediction spans.
Ablation studies strongly validate the effectiveness of CFA and the multi-scale multi-period strategy. Removing either component leads to consistent performance drops, highlighting their crucial roles in capturing complex multi-period dependencies and extracting discriminative temporal structures from noisy cold-storage load sequences.
MA-CFAN provides a reliable and powerful forecasting framework for cold-storage clusters, offering improved accuracy, enhanced robustness, and stronger adaptability to multi-period, multi-frequency temporal dynamics. These advantages make MA-CFAN well-suited for practical deployment in cold-storage scheduling, energy management, and grid-interactive demand response applications.
Despite the strong performance of MA-CFAN, this study still has several limitations. First, the model is trained and evaluated on a specific cold storage dataset, and its generalizability to other industrial load scenarios remains to be validated. Second, although the proposed CFA effectively captures multi-scale and multi-period patterns, it introduces additional computational overhead compared with purely linear models. Future work will focus on validating MA-CFAN across multiple cold storage facilities and other industrial load types to further assess generalizability. To reduce computational complexity, strategies such as attention pruning, knowledge distillation, and lightweight approximation of CFA will be explored. In addition, interpreting cross-period interactions through attention visualization and frequency-weight analysis may provide actionable insights for operators and energy managers, supporting more transparent and informed decision-making.