A Patch-Based State-Space Hybrid Network for Container Resource Usage Forecasting

Song, Zhilong; Yin, Xiangguo; Li, Chencheng; Ba, He; Li, Lin

doi:10.3390/a19020148

Open AccessArticle

A Patch-Based State-Space Hybrid Network for Container Resource Usage Forecasting

by

Zhilong Song

¹,

Xiangguo Yin

¹,

Chencheng Li

¹,

He Ba

² and

Lin Li

^3,*

¹

Ultra-High Voltage Company, State Grid Ningxia Electric Power Company Ltd., Yinchuan 750001, China

²

School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an 710049, China

³

School of Electronic and Information Engineering, Xi’an Technological University, Xi’an 710021, China

^*

Author to whom correspondence should be addressed.

Algorithms 2026, 19(2), 148; https://doi.org/10.3390/a19020148

Submission received: 31 December 2025 / Revised: 6 February 2026 / Accepted: 10 February 2026 / Published: 11 February 2026

(This article belongs to the Special Issue AI and Computational Methods in Engineering and Science: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurate forecasting of container resource usage is crucial for efficient resource scheduling and ensuring Quality of Service (QoS) in cloud data centers. The inherent complexity of container workloads, characterized by strong temporal dependencies, multivariate correlations, and non-stationarity, challenges existing forecasting models, which often fail to efficiently capture both fine-grained local patterns and global trends. To address this gap, this paper proposes a novel Patch-based State-space Hybrid Network (PSH). PSH features a dual-branch architecture: a Local Transformer Path to model complex short-range dependencies and a Global Mamba Path, leveraging a State-Space Model (SSM) with linear-complexity, to efficiently capture long-range dependencies. This method uses an initial patching mechanism to reduce sequence length, which lowers computational overhead and supports efficient feature processing, and a cross-attention fusion module to integrate representations from its dual-branch architecture (Local Transformer Path for short-range dependencies, Global Mamba Path for long-range trends). The fusion module enables bidirectional interaction between the two paths: global context from the Global Mamba Path refines local features from the Local Transformer Path, balancing the model’s ability to capture both local patterns and global trends while maintaining high computational efficiency. Extensive experiments on the large-scale, real-world Alibaba Cluster Traces 2018 dataset demonstrate that PSH significantly outperforms existing state-of-the-art forecasting models in terms of accuracy and robustness.

Keywords:

time series forecasting; container resource management; cloud computing; deep learning

1. Introduction

Cloud-native systems have become the backbone of modern distributed architectures, with containerized applications dominating the deployment of services ranging from microservices to large-scale data processing [1]. In such systems, efficient resource scheduling, load balancing, and anomaly detection heavily rely on accurate container resource usage forecasting—a task that directly impacts resource utilization efficiency and Quality of Service (QoS) compliance [2]. For instance, cloud service providers need to avoid resource waste while meeting user demands, which requires precise prediction of future workloads to adjust resource allocation dynamically [3]. However, container resource time series exhibit inherent complexities that pose significant challenges to forecasting models: (1) temporal dependencies spanning short-term fluctuations and long-term trends; (2) multivariate correlations between interrelated metrics; and (3) non-stationarity caused by dynamic workload changes [4].

Traditional forecasting methods struggle to address these complexities. Statistical models like Autoregressive Integrated Moving Average (ARIMA) [5] excel at capturing linear trends but fail to model non-linear patterns in container workloads. Machine learning approaches such as Support Vector Regression (SVR) [6] and Back Propagation Neural Networks (BPNNs) [7] improve non-linear fitting but lack efficiency in handling long-sequence dependencies. Deep learning has emerged as a promising direction: Long Short-Term Memory (LSTM)/Gated Recurrent Unit (GRU)-based models [8] capture temporal dynamics but struggle with long-range trends; Transformer-derived models (e.g., Autoformer [9], Informer [10]) reduce computational complexity via sparse attention but prioritize global patterns at the cost of local feature granularity; hybrid models like PW-GAN-GP [4] and Transformation-Encoding-Attention (TEA) [11] enhance anomaly detection but are not optimized for multi-step resource forecasting, leading to suboptimal accuracy in real-world cloud scenarios.

To bridge these gaps, this paper proposes a Patch-based State-space Hybrid Network (PSH) for multivariate container resource usage forecasting. The core design philosophy of PSH is to synergistically capture local and global temporal dependencies through a dual-branch architecture: (1) A Local Transformer Path leverages multi-head self-attention to model short- to mid-range patterns. (2) A Global Mamba Path uses a State-Space Model (SSM), specifically the Mamba architecture [12], to efficiently learn long-term trends with linear scaling. A Cross-Branch Fusion module with cross-attention enables adaptive interaction between local and global representations, avoiding the one-sided bias of single-branch models. Additionally, we design a comprehensive data preprocessing pipeline, including adaptive resampling and stratified sampling to handle the heterogeneity and irregularity of the Alibaba Cluster Trace 2018 dataset [13], ensuring high-quality input for the model.

The main contributions of this work are summarized as follows:

This paper proposes a dual-branch hybrid architecture that integrates Transformer and Mamba, enabling simultaneous modeling of local fine-grained patterns and global long-range trends in container resource time series.
The design of a Patch-based Encoder for reducing the quadratic complexity of self-attention, along with a Cross-Branch Fusion module that realizes adaptive information interaction between dual branches to balance accuracy and efficiency.
For the Alibaba Cluster Trace 2018 dataset, this study develops a tailored preprocessing pipeline: it addresses key challenges like irregular sampling intervals and massive data volume, and generates a representative subset to facilitate efficient model prototyping.
We conduct extensive experiments on the large-scale, real-world Alibaba Cluster Traces 2018 dataset, demonstrating that PSH significantly outperforms existing state-of-the-art forecasting models, with it achieving the lowest CPU MAE of 0.0931 and a near-perfect Memory R² of 0.9957.

The remainder of this paper is organized as follows: Section 2 reviews related work on time series forecasting for cloud resources. Section 3 conducts problem analysis and formulation, while Section 4 elaborates on the data preprocessing pipeline, the detailed architecture of the PSH model, and the training and optimization strategies. Section 5 presents the experimental setup, comprehensive comparison results with baseline models, and exhaustive ablation studies. Specifically, Section 5 also includes a parameter sensitivity analysis and a rigorous computational efficiency analysis to evaluate the practical feasibility of our model. Section 6 concludes the paper and discusses future research directions.

2. Related Work

Container resource usage forecasting is a well-studied topic in cloud computing, with research evolving from traditional statistical methods to advanced deep learning architectures [14]. This section categorizes and reviews existing work, highlighting their strengths and limitations, and clarifies the positioning of our proposed PSH model.

2.1. Traditional Statistical and Machine Learning Methods

Early forecasting primarily relied on statistical models and classical machine learning. ARIMA [5] is a foundational statistical method effective for linear trends but struggles with the non-linear, non-stationary patterns common in container workloads. To address non-linearity, machine learning algorithms such as SVR were adopted, often enhanced with optimization techniques like Particle Swarm Optimization (PSO) to improve prediction accuracy for cloud loads [3]. Despite these improvements, such methods depend on manual feature engineering and are less effective at automatically capturing long-range temporal dependencies from raw time series data.

2.2. Single-Branch Deep Learning Methods

With the development of deep learning, single-branch architectures that can automatically learn hierarchical temporal features have become mainstream. These are primarily divided into Recurrent Neural Networks (RNN)-based and Transformer-based approaches.

2.2.1. RNN-Based Models

RNNs, particularly LSTM [8] and GRU, became widely used for modeling temporal sequences. Recent works continue to leverage these architectures for their effectiveness. For instance, Dogani et al. [1] combined Discrete Wavelet Transformation (DWT) with a Bidirectional GRU to better handle non-stationary data by decomposing the workload into sub-bands. Similarly, Bi et al. [2] designed an integrated deep learning method using bidirectional and grid LSTMs, achieving high-quality predictions on large-scale cloud cluster traces. However, a comprehensive evaluation by Lackinger et al. [15] highlights that while effective, the inherently sequential computation of RNNs creates bottlenecks for long sequences and poses challenges for capturing very long-range dependencies efficiently.

2.2.2. Transformer-Based Models

Transformer architectures, with their self-attention mechanism, overcome the sequential limitations of RNNs and enable parallel processing. To address the quadratic complexity of self-attention on long sequences, more efficient variants were developed. Autoformer [9] introduced an Auto-Correlation mechanism, while Informer [10] proposed a ProbSparse self-attention mechanism, both significantly improving efficiency for long-term forecasting. A comprehensive survey by Wen et al. [16] details the extensive impact and evolution of Transformers on time series analysis. Despite their success, these models prioritize global, long-range patterns, sometimes at the cost of capturing local, fine-grained details [4]. More recently, PatchTST [17] introduced a patching mechanism that treats segments of a time series as tokens, which not only reduces complexity but also improves the model’s ability to capture local semantics. However, similar to other single-branch Transformer architectures, it may still prioritize either local or global features based on patch granularity. Our proposed PSH addresses this by introducing a dual-branch Mamba–Transformer hybrid to explicitly capture multi-scale dependencies.

2.3. Hybrid Deep Learning Models

To leverage complementary strengths, hybrid architectures have become a prominent research direction. A common and effective approach is to combine Convolutional Neural Networks (CNNs) for local feature extraction with LSTMs for temporal modeling, a strategy that has demonstrated strong performance for virtual machine workload forecasting [18]. Other recent approaches have focused on creating robust frameworks for more specific tasks like anomaly detection. For example, Qi et al. [4] proposed PW-GAN-GP, a GAN-based predictive framework for anomaly detection in cloud data centers. In a similar vein, Zhang et al. [11] designed the TEA framework for anomaly detection in IoT environments. While powerful, these models are often highly specialized for a particular task (e.g., anomaly detection) or, like those developed for complex, geography-aware task scheduling [19], are not optimized for general-purpose resource forecasting. Many existing hybrids still lack a unified architecture that dynamically balances local and global feature extraction.

It is precisely this gap that our proposed PSH is designed to fill. Our model’s dual-branch architecture explicitly separates these concerns. The Local Transformer Path operates on patches to capture complex, short-range patterns with high fidelity. In parallel, the Global Mamba Path, leveraging a linear-complexity State-Space Model, efficiently captures long-range dependencies and overarching trends. By fusing the representations from these two specialized paths, PSH creates a comprehensive feature set that integrates both local and global context, leading to more accurate and robust forecasts.

Our PSH model addresses these limitations by (1) using a dual-branch architecture (Transformer + Mamba) to simultaneously capture local and global dependencies; (2) employing a Patch-based Encoder to reduce computational complexity; and (3) designing a Cross-Branch Fusion module to enable adaptive information interaction, ensuring balanced performance in both short-term and long-term forecasting.

3. Problem Analysis and Formulation

3.1. Problem Analysis

Container resource usage forecasting is critical for efficient resource scheduling, load balancing, and anomaly detection in cloud-native systems, where containerized applications dominate modern distributed architectures. However, container resource time series—including metrics like CPU utilization, memory utilization, and network throughput—exhibit complex characteristics: they involve both short-term fluctuations and long-term trends (temporal dependencies), interactions between multiple correlated metrics (multivariate correlations), and dynamic changes due to varying workloads (non-stationarity). These traits pose significant challenges for accurate forecasting.

These characteristics impose specific requirements on forecasting models: short-term fluctuations demand high-resolution local pattern capture, long-term trends require efficient modeling of global dependencies over extended horizons, multivariate correlations necessitate joint representation learning across metrics, and non-stationarity calls for robustness to distribution shifts. Existing single-branch architectures like pure Transformer or LSTM fail to satisfy all these requirements simultaneously, which motivates a hybrid design that decouples local and global modeling.

3.2. Problem Formulation

Container resource forecasting is formalized as a multivariate sequence-to-sequence prediction task with a fixed input–output window, and this formulation is chosen for several key reasons: it naturally accommodates multivariate inputs (such as CPU, memory, network, etc.) and multiple target outputs, the sliding window mechanism within it enables temporal context aggregation while supporting multi-step-ahead prediction a capability essential for proactive resource scheduling, and it also provides a standardized interface that is compatible with various deep learning backbones like Transformer and SSM.

This task is framed as a multivariate time series forecasting task with a sliding window mechanism. Let

T = {t_{1}, t_{2}, \dots, t_{N}}

denote a sequence of timestamps for a container, where N is the total number of time steps. At each timestamp

t_{i}

, the container’s state is represented by a feature vector

x_{t_{i}} \in R^{F}

, with F being the number of input features (e.g., resource utilization, resource requests, and limits). The target variables to forecast are M key resource metrics (a subset of input features), denoted as

Y = {y_{1}, y_{2}, \dots, y_{M}}

(e.g., CPU utilization and memory utilization).

To capture temporal dependencies, we use a sliding window of length

L_{i n}

(input sequence length) to extract historical information, aiming to predict the next

L_{o u t}

time steps (output sequence length) of the target variables. Specifically, given an input window

X = [x_{t - L_{in} + 1}, \dots, x_{t}] \in R^{L_{in} \times F}

(ending at time t), the goal is to learn a mapping function f that outputs the predicted target sequence

\hat{Y} = f (X) \in R^{L_{out} \times M}

, where

\hat{Y}

should be as close as possible to the true sequence

Y = [y_{t + 1}, \dots, y_{t + L_{out}}] \in R^{L_{out} \times M}

(with

y_{t + k}

representing the target vector at time

t + k

).

4. Methodology

4.1. Data Preprocessing

We preprocess the Alibaba Cluster Trace 2018 dataset [13] through a tailored pipeline to address data heterogeneity, irregular sampling, and scale, transforming it into a high-quality time series format suitable for forecasting.

The pipeline begins with cleaning two primary files: container_meta.csv, which contains 370,540 container records, and container_usage.csv—a much larger file with approximately 4 billion rows and a size of 164 GB. From the metadata, we retain only containers with a “started” status and a lifespan of at least one hour. In the usage data, we remove records missing any of the four key metrics, namely CPU utilization, memory utilization, network-in, and network-out, and discard entire containers that lack complete coverage of these metrics. Next, we merge the two datasets using a composite key consisting of container_id, machine_id, and timestamp. This yields a unified table with ten features: the four utilization metrics, the three identifiers, and three allocation-related features, specifically cpu_request, cpu_limit, and mem_size, which reflect user-defined resource configurations submitted at deployment time.

To ensure computational feasibility while dealing with the massive 270 GB industrial trace (comprising approximately 4 billion raw records), we construct a high-quality, representative subset. This ’representative subset selection’ strategy is designed to accelerate training and hyperparameter search without compromising the inherent heterogeneity of the real-world data. Specifically, we do not use synthetic data; instead, we categorized all 370,540 containers into distinct ’workload strata’ based on their average CPU and memory utilization patterns. Within each stratum, we identified ’high-quality’ containers—defined as those possessing at least 108 valid, contiguous time steps—to ensure the model can learn robust temporal dependencies. Finally, we performed stratified sampling to select a proportional number of containers from each stratum. By maintaining the same sampling ratio as the original filtered population, this subset strictly preserves the original distribution of diverse workload types (ranging from stable online services to highly volatile batch jobs), ensuring that our PSH model is evaluated on a true-to-life representation of the Alibaba production environment.

To support efficient model evaluation without sacrificing representativeness, we construct a high-quality subset that is roughly one-tenth the size of the full dataset. Containers are first grouped into workload strata based on their average CPU and memory utilization. We then filter out containers with fewer than 108 time steps and perform stratified sampling to select a proportional number of high-quality containers from each stratum, preserving the original distribution of workload types.

All numerical features are clipped to the 1st and 99th percentiles to mitigate the impact of outliers, downcast to minimal data types for memory efficiency, and saved in Apache Parquet format to enable fast I/O during training.

4.2. PSH Model

To address the challenges of forecasting multivariate container resource usage, we propose the PSH. The overall architecture of our model is illustrated in Figure 1. The design is predicated on a dual-branch philosophy to synergistically capture both local and global temporal dependencies from the input data.

The model begins with a Patching Encoder, which segments the input time series of length

L_{in}

into a sequence of non-overlapping patches. This crucial step reduces the temporal dimension to a much shorter length

L^{'}

, mitigating the quadratic complexity of subsequent attention mechanisms and creating a shared, down-sampled representation. This patched sequence is then processed in parallel by two specialized branches:

1.: The Local Transformer Path utilizes multi-head self-attention to model short- to mid-range dependencies and capture fine-grained, local patterns within the data.
2.: The Global Mamba Path employs a SSM to efficiently capture long-range dependencies and global trends across the entire sequence. While this specific branch scales linearly, the integration with the Transformer path means the overall network cost is dominated by the quadratic term of the attention mechanism.

The representations from both paths are then integrated through a Cross-Branch Fusion module. This module uses cross-attention to allow the global context from the Mamba path to inform and refine the local features extracted by the Transformer path. Finally, the fused feature sequence is passed to a Multi-Step Head, which consists of a pooling layer and a multi-layer perceptron (MLP), to generate the final multi-step forecast

\hat{Y} \in R^{L_{out} \times M}

. The following subsections provide a detailed description of each component.

4.2.1. Patching Encoder

Standard self-attention mechanisms exhibit a quadratic computational complexity of

O (L_{i n}^{2})

with respect to the input sequence length

L_{i n}

. To preserve salient temporal dynamics while mitigating this complexity, we employ a temporal patching mechanism via a 1D convolution. Specifically, by setting both the kernel size and stride to p, the operation aggregates p adjacent time steps into a single patch and projects the C input channels into a d-dimensional latent space.

Given the kernel weights

{W^{(i)}}_{i = 1}^{p} \in R^{d \times C}

, a bias term

b \in R^{d}

, and a non-linear activation

ϕ

, the downsampled sequence of patch vectors

{z_{τ}^{(0)}}_{τ = 1}^{L^{'}}

(where

L^{'} = ⌊ L_{i n} / p ⌋

) is defined as:

z_{τ}^{(0)} = ϕ (\sum_{i = 1}^{p} W^{(i)} x_{(τ - 1) p + i} + b), τ = 1, \dots, L^{'} .

(1)

This yields the unified patch sequence

Z^{(0)} = {[z_{1}^{(0)}, \dots, z_{L^{'}}^{(0)}]}^{⊤} \in R^{L^{'} \times d}

. For instance, with

L_{i n} = 96

and

p = 12

, the sequence length is reduced to

L^{'} = 8

, effectively shrinking the quadratic attention term from

96^{2}

to

8^{2}

. Beyond complexity reduction, patching encodes local window statistics into the hidden dimension d. This “temporal-to-channel rearrangement” allows subsequent layers to operate at a consistent temporal granularity, thereby mitigating potential mismatches during cross-path fusion.

4.2.2. Local Transformer Path

The local path targets short- to mid-range interactions at the patch scale. The patch sequence

Z^{(0)} \in R^{L^{'} \times d}

enters a Pre-LN Transformer with

n_{n f}

layers. Let the layer input be

Z \in R^{L^{'} \times d}

. LayerNorm

LN (\cdot)

standardizes each time step over channels; multi-head attention

MHA

uses h heads, each with head dimension

d_{h} = d / h

; the two-layer feed-forward network

FFN

has intermediate width

d_{f f}

with GELU and dropout. The update is

\begin{matrix} \tilde{Z} & = LN (Z), \\ H_{a} & = MHA (\tilde{Z}, \tilde{Z}, \tilde{Z}), \\ Z & \leftarrow Z + α_{a} H_{a}, \\ Z & \leftarrow Z + α_{f} FFN (LN (Z)) \end{matrix}

(2)

where

α_{a}, α_{f} \in R^{d}

are per-channel learnable residual scales that damp complex branches early and improve numerical stability. For each head

\begin{matrix} Attn (Q, K, V) & = softmax (\frac{Q K^{⊤}}{\sqrt{d_{h}}}) V, \\ Q & = Z W_{Q}, \\ K & = Z W_{K}, \\ V & = Z W_{V} \end{matrix}

(3)

with

W_{Q}, W_{K}, W_{V} \in R^{d \times d}

the projection matrices and the

\sqrt{d_{h}}

scale stabilizing softmax. Heads are concatenated and linearly projected back to d. Expressively,

MHA

learns data-adaptive receptive fields for velocity changes, fine peaks/valleys, and local saturation; optimization-wise, Pre-LN + residual scaling controls gradients in deep stacks.

4.2.3. Global Mamba Path

The global path models long-range dependencies and slow trends via the linear-complexity SSM family—Mamba [12]. Unlike traditional SSMs, Mamba introduces a selective mechanism that allows the model to adjust its parameters based on the input sequence, enabling it to focus on relevant information while discarding noise. This selection mechanism, implemented through time-varying state transitions, allows the model to capture the overarching trends in container workloads with much higher efficiency than standard attention-based models. With input

U \in R^{L^{'} \times d} (i . e ., Z^{(0)})

and state

s_{τ} \in R^{d}

, a selective SSM recursion is

s_{τ + 1} = A_{τ} s_{τ} + B_{τ} u_{τ}, g_{τ} = C_{τ} s_{τ}

(4)

where

A_{τ}, B_{τ}, C_{τ}

are input-modulated system and projection matrices, and

g_{τ}

is the filtered output per step. Equivalently, in the convolutional view,

g = k * u

with a learned kernel generator. Implementation uses Pre-LN + Dropout + residual scaling:

G = U + α_{m} Drop (Mamba (LN (U)))

(5)

where

α_{m} \in R^{d}

is a per-channel scale and

Drop (\cdot)

operates with a certain probability. Unlike attention’s

O (L^{' 2})

, (4) scales linearly with

L^{'}

; after patching, each

u_{τ}

summarizes p steps, enabling robust estimation of periodic/trend structure at low resolution, complementary to the local path.

4.2.4. Cross-Branch Fusion

Both paths operate at the same patch length

L^{'}

and same hidden width d, so fusion needs no resampling. We apply a single cross-attention to achieve “local dominance—global conditioning”: with

L \in R^{L^{'} \times d}

as queries and

G \in R^{L^{'} \times d}

as keys/values, we first LayerNorm each and linearly project to

R^{d}

:

\begin{matrix} Q & = LN (L) W_{Q}, \\ K & = LN (G) W_{K}, \\ V & = LN (G) W_{V} \end{matrix}

(6)

where

W_{Q}, W_{K}, W_{V} \in R^{d \times d}

mirror (3) but are distinct parameters. The cross-attention output is

F = L + α_{c} softmax (\frac{Q K^{⊤}}{\sqrt{d_{h}}}) V

(7)

where

α_{c} \in R^{d}

is a per-channel residual scale and softmax normalized along the key length

L^{'}

. Equation (7) lets the global path’s slow background modulate local judgments at each patch with varying strengths, preventing overshoot/undershoot during trend phases. A single fusion suffices for this task’s statistics and notably reduces peak memory and numerical uncertainty compared to multi-round alternations.

4.2.5. Multi-Step Head

The fused sequence

F \in R^{L^{'} \times d}

enters a lightweight joint regression head. We LayerNorm to unify channel scales, then apply temporal adaptive average pooling

Pool (\cdot)

to aggregate over length

L^{'}

into

v \in R^{d}

; finally a two-layer MLP maps to

L_{out} T

and reshapes to

\hat{Y} \in R^{L_{out} \times T}

:

\begin{matrix} v & = Pool (LN (F)), \\ vec (\hat{Y}) & = (GELU ({vW}_{1} + b_{1})) W_{2} + b_{2} \end{matrix}

(8)

where

W_{1} \in R^{d \times d_{f f}}, W_{2} \in R^{d_{f f} \times (L_{out} T)}

are the dense weights, and

b_{1} \in R^{d_{f f}}, b_{2} \in R^{L_{out} T}

are biases;

vec (\cdot)

flattens before reshaping to

L_{out} \times T

. The output uses no compressive activation, matching the implementation as real-valued regression. Pooling shifts “future-window evidence integration” upstream to the backbone, reducing head-level overfitting; columns of

W_{2}

provide distinct linear decoders for steps and targets.

4.3. Training and Optimization

The training procedure is based on the empirical risk minimization principle. Let

Θ

denote the set of all learnable parameters, which include convolutional/attention projections, feed-forward layers, Mamba SSM kernels, as well as head weights and biases. Let

D

be the training dataset, which is built through sliding-window operations. The per-time-step and per-target

ℓ_{1}

(Mean Absolute Error, MAE) loss function is given by:

L (Θ) = \frac{1}{| D |} \sum_{(X, Y) \in D} \frac{1}{L_{out} T} \sum_{t = 1}^{L_{out}} \sum_{j = 1}^{T} | {\hat{y}}_{t, j} (X; Θ) - y_{t, j} |

(9)

where

{\hat{y}}_{t, j} (X; Θ)

is the predicted value for the j-th target at the t-th time step with input

X

and parameters

Θ

,

y_{t, j}

is the corresponding ground-truth value,

L_{out}

is the length of the output sequence, and T is the number of target variables.

Due to spikes and mild label noise in real-world systems, the

ℓ_{1}

loss is more robust than the

ℓ_{2}

(Mean Squared Error, MSE) loss. To avoid statistical leakage, all features are standardized using the mean

μ_{c}

and standard deviation

σ_{c}

calculated only from the training data, together with a small numerical stabilizer

ε > 0

. The standardization is defined as:

x_{t, c}^{'} = \frac{x_{t, c} - μ_{c}}{σ_{c} + ε}, 1 \leq t \leq L_{in}, 1 \leq c \leq C

(10)

where

x_{t, c}

is the original value of the c-th feature at the t-th time step,

x_{t, c}^{'}

is the standardized value,

L_{i n}

is the length of the input sequence, and C is the number of feature variables. The term

ε

is used to prevent the numerical amplification of features with small variances. The statistics

(μ_{c}, θ_{c})

are computed on the training dataset and then frozen during validation and testing. After that, the standardized time series are divided into windowed pairs

(X, Y)

with fixed

L_{in}

(input window length) and

L_{out}

(output window length), ensuring that the data distributions in training and inference are aligned.

Training is carried out by minimizing the

ℓ_{1}

joint loss as defined previously. The AdamW optimizer is employed, with a learning rate schedule that combines linear warm-up and cosine decay. The learning rate

η (e)

is formulated as follows:

η (e) = \{\begin{matrix} η_{0} \cdot \frac{e}{E_{warm}}, & 1 \leq e \leq E_{warm}, \\ η_{0} \cdot \frac{1 + cos (π \cdot \frac{e - E_{warm}}{E - E_{warm}})}{2}, & E_{warm} < e \leq E . \end{matrix}

(11)

where

η_{0}

denotes the peak learning rate, E represents the total number of epochs, and

E_{w a r m}

is the number of warm-up epochs.

To mitigate overfitting, weight decay is utilized. Additionally, Automatic Mixed Precision (AMP) and TF32 are enabled. Global gradient-norm clipping is performed: if the

L_{2}

-norm of the gradient

\nabla_{Θ} L

exceeds a threshold

τ

, the gradients are scaled to keep them within bounds, preventing outlier batches from disrupting the optimization process.

Exponential Moving Average (EMA) is maintained in the parameter space to smooth noise and stabilize evaluation. Given the weights

Θ_{e}

after epoch e and an EMA momentum

β

, the EMA update rule is:

{\bar{Θ}}_{e} = β {\bar{Θ}}_{e - 1} + (1 - β) Θ_{e} .

(12)

For validation and test forward passes,

{\bar{Θ}}_{e}

is preferred to reduce evaluation variance. The overall procedure of PSH is shown in Algorithm 1.

Algorithm 1 Patch-based State-space Hybrid Network (PSH).

Require: Data

P

, features

F

, targets

T

; windows

(L_{in}, L_{out})

; patch p; epochs E; learning rate

η

; patience P

Ensure: Best checkpoint selected by validation avg MAE

1:: Load & Clean dataframe $D \leftarrow R E A D P A R Q U E T (P)$ ; drop NaN/Inf
2:: Group-wise Split by container id ⇒ Train/Val/Test (80/10/10)
3:: Standardize fit scaler on Train only; transform Train/Val/Test
4:: Windows build pairs $(x, y)$ with $x = X [i : i + L_{in}]$ , $y = X [i + L_{in} : i + L_{in} + L_{out}, T]$
5:: Model (PSH-Net) Patch(p)→Local-Transformer‖Global-Mamba→Cross-Attn Mixer→LN+Pool+MLP
6:: Optimize AdamW( $η$ ), warmup+cosine, AMP, grad-clip; optional EMA
7:: for $e = 1 . . E$ do
8:: Train: $\hat{Y} \leftarrow P S H - N E T (x)$ ; $L = ∥ \hat{Y} {- y ∥}_{1}$ ; update parameters
9:: Validate: compute per-target MAE/RMSE/ $R^{2}$ and averages; save best by avg MAE
10:: if no improvement for P epochs then break
11:: end if
12:: end for

5. Experiments

In this section, we first present the experimental setup, including descriptions of datasets, evaluation metrics, and model configuration. Next, the comparison methods selected as baselines for performance validation are introduced. Subsequently, the contrasting experimental results of our proposed model and the aforementioned comparison methods on the target datasets are discussed. Furthermore, we carry out ablation studies to analyze the impact of the model’s intrinsic structures and configuration parameters on its overall performance.

5.1. Experimental Setup

5.1.1. Datasets

In order to evaluate and analyze the performance of our proposed approach, we perform comprehensive experiments, and the real-world dataset used in these experiments is Alibaba Cluster Traces 2018.

Alibaba Cluster Traces 2018 [13] is a public cluster tracing dataset released by Alibaba, which assists researchers in understanding the characteristics and workloads of modern Internet data centers. It contains operational data of 4034 servers over an 8-day period (with a total volume of approximately 270 GB) and consists of six files, including machine_meta.csv and machine_usage.csv. These files respectively record the meta-information and resource usage of machines/containers, as well as the instance and task information of batch workloads (the task_name field in batch_task.csv includes the Directed Acyclic Graph (DAG) information of tasks). The data is collected every 60 s, and finally recorded as the average value of samples within 300 s; its core purpose is to support the solution to the mixed deployment problem of online and offline tasks in large-scale data centers and the optimization of coordinated scheduling schemes. It should be noted that the dataset used in the experiments of this paper is the preprocessed version of the aforementioned original dataset, and the specific data processing methods (including outlier removal, data standardization, and feature selection) are detailed in Section 4.1.

5.1.2. Evaluation Metrics

To comprehensively evaluate the model’s predictive performance, we employ three standard metrics: MAE, Root Mean Squared Error (RMSE), and the Coefficient of Determination (

R^{2}

). Let

y_{i}

denote the ground-truth value,

{\hat{y}}_{i}

be the predicted value,

\bar{y}

be the mean of the true values, and n be the total number of samples.

MAE provides a direct interpretation of the average error magnitude by calculating the average absolute difference between predicted and actual values, as defined by:

MAE = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(13)

RMSE is utilized to measure the square root of the average of squared errors, which penalizes large deviations more heavily and is thus particularly sensitive to significant errors. Its formula is:

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(14)

To assess how well the model explains the data’s variability, we use

R^{2}

, which measures the proportion of the variance in the target variable that is predictable from the model:

R^{2} = 1 - \frac{\sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}{\sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}

(15)

For MAE and RMSE, values closer to zero indicate higher accuracy, while for

R^{2}

, a value approaching 1 signifies a superior model fit that explains a large portion of the data’s variance.

5.1.3. Configuration

All experiments were conducted on a server equipped with an Intel(R) Xeon(R) Gold 6348 CPU and a single NVIDIA L40 GPU (48 GB VRAM),using the PyTorch framework (version 2.4.1) under a CUDA 12.8 environment. We employed techniques such as AMP, gradient clipping, and EMA to enhance training stability and efficiency. The dataset was split by container_id into training, validation, and test sets (80%/10%/10%) and standardized using only the training data. The model uses an input sequence of 96 steps to predict an output of 12 steps. We minimized the L1 loss using the AdamW optimizer, combined with a cosine annealing learning rate schedule with warm-up and an early stopping strategy. Key hyperparameters are detailed in Table 1.

5.2. Comparison Methods

To verify the effectiveness of our approach, several methods are selected for comparison.

5.2.1. ARIMA

ARIMA [20] is a classical statistical method that models time series data using autoregressive (AR), differencing (I), and moving average (MA) components. It is effective for capturing linear trends and seasonality but is limited in modeling complex non-linear patterns.

5.2.2. CNN-LSTM

CNN-LSTM is a hybrid deep learning architecture that combines a CNN with a LSTM network. The CNN layer acts as a feature extractor to identify local patterns within the time series, and the subsequent LSTM layer models the long-term temporal dependencies among these extracted features.

5.2.3. Autoformer

Autoformer [9] is a Transformer-based model featuring a novel decomposition architecture and an Auto-Correlation mechanism. It first decomposes the time series into seasonal and trend-cyclical components, then utilizes the Auto-Correlation mechanism in place of standard self-attention to discover period-based dependencies, improving both efficiency and performance.

5.2.4. Informer

Informer [10] is an efficient Transformer-based model designed for long sequence time series forecasting. It improves upon the standard Transformer by introducing a ProbSparse self-attention mechanism to reduce computational complexity, a self-attention distilling operation to shorten sequence length, and a generative decoder to produce long outputs in a single forward pass.

5.2.5. PatchTST

PatchTST [17] is a representative Transformer-based model that introduces a patching mechanism to segment time series data into tokens. By utilizing channel-independence and sub-series level processing, it significantly reduces the computational complexity of the self-attention mechanism while enhancing the model’s ability to capture local semantic dependencies. We include it to evaluate the performance gain of our dual-branch hybrid architecture over a single-branch patch-based Transformer.

5.2.6. DLinear

DLinear [21] is a simple yet effective linear forecasting model that decomposes time series into trend and remainder components. It applies a single linear layer to each component to handle the non-stationarity of workloads. Despite its lack of complex attention mechanisms, DLinear has demonstrated competitive performance against many deep learning models, serving as a baseline for structural simplicity and linear efficiency.

5.2.7. TimeMachine

TimeMachine [22] is a state-of-the-art forecasting model built entirely upon the Mamba (SSM) architecture. It leverages the linear scalability and small memory footprint of SSMs to capture long-term dependencies in multivariate sequences. TimeMachine employs an innovative integrated quadruple-Mamba architecture to produce multi-scale contextual cues and handle both channel-mixing and channel-independence scenarios. Including TimeMachine allows for a direct comparison between a pure-Mamba approach and our proposed hybrid Transformer–Mamba framework.

5.3. Discussion of Prediction Results

This section provides a detailed analysis of the experimental results. We first visualize the prediction performance of our proposed PSH model, then present a quantitative comparison against baseline methods, and finally, offer a distributional analysis of the prediction errors.

Figure 2 illustrates the prediction performance of our proposed PSH model on the Alibaba Cluster Trace 2018 dataset for CPU and memory usage. The line charts in the first column compare the ground-truth values (blue line) with the predicted values (red line), with a magnified view provided for detailed inspection. It is evident that the predicted values closely follow the trends and fluctuations of the ground-truth data, even for highly volatile patterns, showcasing the model’s capability to capture complex temporal dependencies. The second column presents the prediction error, calculated as the difference between the predicted and true values. The error plots for both CPU and memory usage show that the errors are centered around zero and remain within a relatively small range, indicating that the predictions are unbiased and accurate. The absence of significant error drift confirms the model’s stability over long prediction horizons.

Table 2 provides a comprehensive quantitative comparison of PSH against seven baseline models, including traditional statistical methods (ARIMA), standard deep learning architectures (CNN-LSTM), Transformer-based variants (Autoformer, Informer, and PatchTST), the linear-based DLinear, and the state-of-the-art Mamba-based TimeMachine. The results unequivocally demonstrate the superior performance of our PSH model across nearly all metrics and target variables, validating the effectiveness of the proposed hybrid dual-branch architecture. For CPU utilization, while the Mamba-based TimeMachine achieves a slightly lower MAE (0.0928) than PSH (0.0931), PSH significantly outperforms it in terms of RMSE (0.2123 vs. 0.2231) and

R^{2}

(0.9525 vs. 0.9513). This performance gap indicates that although pure Mamba models like TimeMachine are excellent at following average workload trends, they may struggle to capture extreme fluctuations and sudden spikes, which are heavily penalized by the RMSE metric. By integrating a Local Transformer Path, PSH effectively captures these fine-grained local dynamics, providing more robust predictions for peak loads. A similar superior trend is observed for memory utilization, where PSH achieves the highest

R^{2}

score (0.9957) and the lowest RMSE (0.0712) among all compared models. In the case of network I/O (Net_in and Net_out), PSH maintains its competitive edge by achieving the best scores across all error metrics. The observed similarity between Net_in and Net_out metrics is primarily attributed to the inherent characteristics of the Alibaba Cluster Trace 2018, where many containerized workloads exhibit either highly correlated or sparse network traffic patterns, leading to nearly identical forecasting errors. Overall, these quantitative findings highlight that the synergy between the Transformer and Mamba branches allows PSH to maintain high accuracy and stability in complex, non-stationary cloud environments.

Figure 3 presents the Cumulative Distribution Function (CDF) of the MSE for all eight compared methods: the seven baselines (ARIMA, CNN-LSTM, Autoformer, Informer, PatchTST, DLinear, and TimeMachine) and our proposed PSH. A curve that rises more steeply and is shifted further to the left indicates a model with lower prediction errors and better overall stability. As depicted in the figure, the CDF curve for PSH is consistently positioned to the left of all other baseline models, including the modern patch-based PatchTST and the Mamba-based TimeMachine. This signifies that for any given MSE value on the x-axis, a larger proportion of PSH’s predictions have an error less than or equal to that value. For instance, approximately 70% of the predictions from PSH have an MSE lower than 0.2, a mark that other strong competitors like DLinear and TimeMachine only reach at higher MSE levels. This demonstrates that PSH not only yields a lower average error (as detailed in Table 2) but also maintains a significantly higher concentration of low-error predictions. This distributional analysis further solidifies the conclusion that PSH provides more consistently accurate and robust forecasts across diverse container workload patterns than existing state-of-the-art methods.

While our experiments focus on the Alibaba Cluster Trace 2018, the heterogeneity of this large-scale dataset, which encompasses both stable online services and volatile batch jobs, provides strong evidence of the model’s robustness across different workload types. The superior performance of PSH implies strong generalization potential. Theoretically, the patching mechanism enhances generalization by capturing local semantic patterns (e.g., trend slopes and local shapes) rather than overfitting to point-wise values. This structural prior makes PSH likely to transfer well to other industrial datasets.

5.4. Ablation Studies

5.4.1. Effectiveness of Architectural Components

Table 3 summarizes the ablation studies conducted by excluding specific modules from the proposed method. To validate the effectiveness of each component within our proposed PSH model, we designed four variant models for comparison against the complete PSH architecture. Specifically, the methods for ablation studies are summarized as follows:

PSH: The complete prediction model.
PSH-T: Remove Mamba branch and fusion module.
PSH-M: Remove Transformer branch and fusion module.
PSH-A: Replace cross-attention fusion with addition.
PSH-G: Replace cross-attention fusion with gating.

As presented in Table 3, the superior performance of the complete PSH model over all its variants validates the significant contributions of our proposed dual-branch architecture and cross-attention fusion strategy. The importance of combining both local and global feature extraction is highlighted by comparing PSH with PSH-T and PSH-M. PSH-T, which relies solely on the Transformer branch, experiences a noticeable degradation in performance. More strikingly, PSH-M, which only uses the Mamba branch, shows a severe decline in accuracy. For instance, its MAE for CPU utilization (0.1408) is over 51% higher than that of the full PSH model (0.0931). This substantial performance drop underscores that the local feature extraction of the Transformer is fundamental, while the global modeling of Mamba serves as a crucial enhancement. The choice of fusion mechanism is also critical, as evidenced by the performance of PSH-A and PSH-G. Both variants, which replace cross-attention with simpler fusion methods like addition and gating, fail to match the performance of the complete PSH model. This indicates that a more sophisticated fusion mechanism like cross-attention, which enables adaptive interaction between the local and global representations, is superior to static methods.

In summary, the ablation studies confirm that each component of the PSH model plays an indispensable role. The synergy between the Transformer branch, the Mamba branch, and the cross-attention module is essential for achieving the highest prediction accuracy.

5.4.2. Parameter Sensitivity Analysis

In the implementation of the PSH model, the majority of the fundamental hyperparameters—including the hidden dimension (

d = 640

), the number of Transformer and Mamba layers (

n_{t f} = 5, n_{m b} = 3

), and the dropout rate (

0.12

)—were determined based on empirical experience and established configurations prevalent in the time series forecasting literature. The decision to utilize empirical defaults for these parameters is predicated on several critical considerations of scientific rigor and computational practicality. First, by adopting standard configurations that have been widely validated in state-of-the-art models like Informer and PatchTST, we ensure that the performance improvements of PSH are strictly attributable to our core architectural innovations, namely the dual-branch hybrid structure and cross-attention fusion rather, than the artifacts of exhaustive hyperparameter fine-tuning. Furthermore, given the immense scale of the Alibaba Cluster Trace 2018 dataset, an exhaustive grid search for every hyperparameter is computationally prohibitive and risks overfitting the model to specific noise patterns within the trace. This approach maintains a strict control-variable environment, allowing for a more objective evaluation of the proposed method’s inherent effectiveness.

Building upon this foundation, we specifically investigated the impact of the patch size (p) on forecasting accuracy, as it represents the most critical structural parameter unique to the Patching Encoder in the PSH architecture. We implemented variants with

p \in {8, 12, 16, 24}

and recorded the performance using three key metrics: MAE, RMSE, and

R^{2}

. As detailed in Table 4, the configuration with

p = 12

consistently yielded the best results across all three metrics, achieving the lowest MAE and RMSE and the highest

R^{2}

score for both CPU and memory utilization. Specifically, we observed that deviating from

p = 12

either by decreasing it to 8 or increasing it to 24 resulted in a noticeable degradation in predictive performance. This suggests that

p = 12

provides the most suitable temporal granularity for the Patching Encoder to capture the underlying dynamics of container resource usage in the Alibaba Cluster Trace 2018 dataset. While smaller patches might introduce excessive local noise, excessively large patches tend to obscure the fine-grained fluctuations essential for accurate forecasting. These findings justify our selection of

p = 12

as the default parameter for the PSH model, striking an optimal balance between local feature resolution and global trend modeling.

5.5. Computational Efficiency Analysis

To evaluate the feasibility of deploying PSH in real-world environments, a rigorous efficiency analysis was conducted. Table 5 compares the parameters, inference latency, and peak memory usage across all models. Experiments were performed on an NVIDIA L40 GPU with a batch size of 1 to simulate online inference. The results highlight a clear accuracy–efficiency trade-off. Lightweight models such as DLinear and the Mamba-based TimeMachine demonstrate exceptional speed, with inference latencies of 0.20 ms and 1.16 ms, respectively. In comparison, PSH exhibits a higher latency of 4.88 ms and a larger memory footprint (120.25 MB) due to its dual-branch design and cross-attention mechanism. However, in the context of cloud resource management, where scheduling intervals are typically in the order of minutes (e.g., 5 min), a sub-5ms latency is negligible. Crucially, this additional computational investment yields returns in performance: as shown in Table 2, PSH outperforms TimeMachine in terms of RMSE and

R^{2}

. This demonstrates that PSH successfully trades a marginal increase in latency for superior capability in capturing complex, high-frequency workload fluctuations, making it a more robust choice for QoS-critical applications.

6. Conclusions

In this paper, we addressed the critical challenge of multivariate time series forecasting for container resource usage in cloud environments. We introduced the PSH, a novel dual-branch architecture designed to concurrently and efficiently model both local and global temporal dynamics. The core innovation of PSH lies in its synergistic integration of a Transformer-based path, which excels at capturing intricate, short-range patterns, and a Mamba-based path, which efficiently models long-range dependencies using a State-Space Model. The model’s scalability is further enhanced by an initial Patching Encoder, and its predictive power is amplified by a cross-attention fusion module that allows global context to dynamically refine local feature representations.

Extensive experiments on the Alibaba Cluster Traces 2018 dataset show that PSH consistently outperforms state-of-the-art baselines: it achieves an MAE of 0.0931 for CPU utilization, outperforming the best baseline (CNN-LSTM) by 1.7% and significantly exceeding Autoformer and Informer, with similar performance gains observed in memory and network I/O metrics—findings that confirm its robustness and generalizability. Ablation studies further validate the indispensability of each component, especially the dual-branch design and cross-attention fusion. In summary, PSH provides a robust, accurate, and scalable solution for proactive resource management in modern cloud data centers. Regarding real-world applicability, PSH is designed to function as a predictive engine for cloud orchestration frameworks like the Kubernetes Horizontal Pod Autoscaler (HPA). By providing highly accurate multi-step-ahead predictions, it enables proactive resource provisioning, allowing system controllers to scale out instances before workload spikes occur, thereby minimizing QoS and Service-Level Agreement (SLA) violations in dynamic production environments. Future work may explore its applicability to other complex time series domains as well as the development of adaptive mechanisms for dynamic environments.

Author Contributions

Conceptualization, Z.S. and H.B.; methodology, H.B.; software, X.Y. and H.B.; validation, C.L. and H.B.; formal analysis, H.B.; investigation, Z.S. and H.B.; resources, Z.S., X.Y. and C.L.; data curation, H.B.; writing, H.B.; writing—review and editing, L.L.; visualization, H.B.; supervision, L.L.; project administration, Z.S., X.Y., C.L. and L.L.; funding acquisition, Z.S., X.Y., C.L. and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Science and Technology Project of State Grid Ningxia Electric Power Company Ltd. (No. 2024-1025).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Zhilong Song, Xiangguo Yin and Chencheng Li were employed by the company Ultra-High Voltage Company, State Grid Ningxia Electric Power Company Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Dogani, J.; Khunjush, F.; Seydali, M. Host load prediction in cloud computing with discrete wavelet transformation (dwt) and bidirectional gated recurrent unit (bigru) network. Comput. Commun. 2023, 198, 157–174. [Google Scholar] [CrossRef]
Bi, J.; Li, S.; Yuan, H.; Zhou, M. Integrated deep learning method for workload and resource prediction in cloud systems. Neurocomputing 2021, 424, 35–48. [Google Scholar] [CrossRef]
Zhong, W.; Zhuang, Y.; Sun, J.; Gu, J. A load prediction model for cloud computing using PSO-based weighted wavelet support vector machine. Appl. Intell. 2018, 48, 4072–4083. [Google Scholar] [CrossRef]
Qi, S.; Chen, J.; Chen, P.; Wen, P.; Niu, X.; Xu, L. An efficient GAN-based predictive framework for multivariate time series anomaly prediction in cloud data centers. J. Supercomput. 2024, 80, 1268–1293. [Google Scholar] [CrossRef]
Calheiros, R.N.; Masoumi, E.; Ranjan, R.; Buyya, R. Workload prediction using ARIMA model and its impact on cloud applications’ QoS. IEEE Trans. Cloud Comput. 2014, 3, 449–458. [Google Scholar] [CrossRef]
Barati, M.; Sharifian, S. A hybrid heuristic-based tuned support vector regression model for cloud load prediction. J. Supercomput. 2015, 71, 4235–4259. [Google Scholar] [CrossRef]
Wang, L.; Zeng, Y.; Chen, T. Back propagation neural network with adaptive differential evolution algorithm for time series forecasting. Expert Syst. Appl. 2015, 42, 855–863. [Google Scholar] [CrossRef]
Song, B.; Yu, Y.; Zhou, Y.; Wang, Z.; Du, S. Host load prediction with long short-term memory in cloud computing. J. Supercomput. 2018, 74, 6554–6568. [Google Scholar] [CrossRef]
Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv. Neural Inf. Process. Syst. 2021, 34, 22419–22430. [Google Scholar]
Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtual, 2–9 February 2021; Volume 35, pp. 11106–11115. [Google Scholar]
Zhang, R.; Chen, J.; Song, Y.; Shan, W.; Chen, P.; Xia, Y. An effective transformation-encoding-attention framework for multivariate time series anomaly detection in iot environment. Mob. Netw. Appl. 2024, 29, 1551–1563. [Google Scholar] [CrossRef]
Gu, A.; Dao, T. Mamba: Linear-time sequence modeling with selective state spaces. In Proceedings of the First Conference on Language Modeling, Philadelphia, PA, USA, 7–9 October 2024. [Google Scholar]
Alibaba Inc. Alibaba Production Cluster Data v2018. Available online: https://github.com/alibaba/clusterdata/tree/v2018 (accessed on 2 June 2025).
Amiri, M.; Mohammad-Khanli, L. Survey on prediction models of applications for resources provisioning in cloud. J. Netw. Comput. Appl. 2017, 82, 93–113. [Google Scholar] [CrossRef]
Lackinger, A.; Morichetta, A.; Dustdar, S. Time series predictions for cloud workloads: A comprehensive evaluation. In Proceedings of the 2024 IEEE International Conference on Service-Oriented System Engineering (SOSE), Shanghai, China, 15–18 July 2024; pp. 36–45. [Google Scholar]
Wen, Q.; Zhou, T.; Zhang, C.; Chen, W.; Ma, Z.; Yan, J.; Sun, L. Transformers in time series: A survey. arXiv 2022, arXiv:2202.07125. [Google Scholar]
Nie, Y. A Time Series is Worth 64Words: Long-term Forecasting with Transformers. arXiv 2022, arXiv:2211.14730. [Google Scholar]
Leka, H.L.; Fengli, Z.; Kenea, A.T.; Tegene, A.T.; Atandoh, P.; Hundera, N.W. A hybrid CNN-LSTM model for virtual machine workload forecasting in cloud data center. In Proceedings of the 2021 18th International Computer Conference on Wavelet Active Media Technology and Information Processing (ICCWAMTIP), Chengdu, China, 17–19 December 2021; pp. 474–478. [Google Scholar]
Yuan, H.; Bi, J.; Zhou, M. Geography-aware task scheduling for profit maximization in distributed green data centers. IEEE Trans. Cloud Comput. 2020, 10, 1864–1874. [Google Scholar] [CrossRef]
Box, G.E.; Jenkins, G.M.; Reinsel, G.C.; Ljung, G.M. Time Series Analysis: Forecasting and Control; John Wiley & Sons: Hoboken, NJ, USA, 2015. [Google Scholar]
Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are transformers effective for time series forecasting? In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 11121–11128. [Google Scholar]
Ahamed, M.A.; Cheng, Q. Timemachine: A time series is worth 4 mambas for long-term forecasting. In Proceedings of the 27th European Conference on Artificial Intelligence, Santiago de Compostela, Spain, 19–24 October 2024; Volume 392, pp. 1688–1695. [Google Scholar]

Figure 1. Workload prediction model in cloud environment.

Figure 2. Prediction results for the workload time series with our PSH in Alibaba Cluster Trace 2018. The line charts in the first column compare the ground-truth values and the predicted ones with PSH. The second column shows the error between the ground-truth values and the predicted ones. (a) Time series of CPU usage. (b) Time series of memory usage.

Figure 3. CDF for total workload time series in the Alibaba Cluster Trace 2018.

Table 1. Setting of PSH parameters for workload.

Parameters	Values	Description
Patch Size (p)	12	Input patch size
Hidden Dimension (d)	640	Model’s feature dimension
FFN Dimension ( $d_{f f}$ )	1536	FFN inner dimension
Attention Heads	10	Number of attention heads
Transformer Layers ( $n_{t f}$ )	5	Local path depth
Mamba Layers ( $n_{m b}$ )	3	Global path depth
Dropout Rate	0.12	Dropout probability
Optimizer	AdamW	Optimizer for training
Batch Size	2048	Batch size for training
Learning Rate	$1.5 \times 10^{- 4}$	Initial learning rate
Weight Decay	$10^{- 4}$	L2 regularization
Gradient Clip Norm	0.1	Max gradient norm
EMA Momentum ( $β$ )	0.999	EMA decay factor
Random Seed	42	Seed for experimental reproducibility

Table 2. Performance comparison on the Alibaba Cluster Trace 2018.

Methods	CPU			MEM			Net_in			Net_out
Methods	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$
ARIMA	0.1641	0.3595	0.8734	0.0154	0.0804	0.9938	0.0036	0.0674	0.9955	0.0035	0.0669	0.9956
CNN-LSTM	0.0947	0.2227	0.9514	0.0184	0.0760	0.9945	0.0062	0.0662	0.9957	0.0063	0.0661	0.9957
Autoformer	0.0981	0.2240	0.9509	0.0277	0.0801	0.9939	0.0109	0.0698	0.9952	0.0109	0.0694	0.9952
Informer	0.0971	0.2231	0.9512	0.0222	0.0776	0.9942	0.0076	0.0662	0.9957	0.0073	0.0659	0.9957
PatchTST	0.1310	0.2377	0.9447	0.0209	0.1226	0.9856	0.0107	0.1163	0.9866	0.0120	0.1161	0.9867
DLinear	0.1028	0.2218	0.9518	0.0197	0.0804	0.9938	0.0077	0.0705	0.9951	0.0079	0.0704	0.9951
TimeMachine	0.0928	0.2231	0.9513	0.0141	0.0748	0.9946	0.0050	0.0706	0.9956	0.0054	0.0705	0.9956
PSH	0.0931	0.2123	0.9525	0.0174	0.0712	0.9957	0.0034	0.0662	0.9957	0.0034	0.0659	0.9957

Table 3. Ablation Study of PSH’s different modules on the Alibaba Cluster Trace 2018.

Methods	CPU			MEM			Net_in			Net_out
Methods	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$
PSH	0.0931	0.2123	0.9525	0.0174	0.0712	0.9957	0.0034	0.0662	0.9957	0.0034	0.0659	0.9957
PSH-T	0.1032	0.2431	0.9421	0.0190	0.0836	0.9933	0.0037	0.0674	0.9955	0.0037	0.0672	0.9955
PSH-M	0.1408	0.2911	0.9170	0.0402	0.1564	0.9766	0.0155	0.1430	0.9798	0.0156	0.1414	0.9802
PSH-A	0.1025	0.2400	0.9435	0.0185	0.0820	0.9936	0.0036	0.0676	0.9955	0.0036	0.0674	0.9955
PSH-G	0.1033	0.2422	0.9425	0.0186	0.0824	0.9935	0.0033	0.0673	0.9955	0.0033	0.0673	0.9955

Table 4. Sensitivity analysis of patch size (p) on performance metrics.

Patch Size (p)	CPU			MEM			Net_in			Net_out
Patch Size (p)	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$	MAE	RMSE	$R^{2}$
8	0.1012	0.2315	0.9458	0.0192	0.0784	0.9942	0.0037	0.0674	0.9955	0.0037	0.0672	0.9955
12	0.0931	0.2123	0.9525	0.0174	0.0712	0.9957	0.0034	0.0662	0.9957	0.0034	0.0659	0.9957
16	0.1154	0.2568	0.9382	0.0225	0.0891	0.9921	0.0041	0.0692	0.9948	0.0040	0.0688	0.9949
24	0.1421	0.2984	0.9145	0.0312	0.1156	0.9874	0.0052	0.0754	0.9930	0.0051	0.0748	0.9931

Table 5. Comparison of model complexity and efficiency.

Model	Parameters (M)	Latency (ms/Sample)	Peak Memory (MB)
PSH (Ours)	28.69	4.88	120.25
TimeMachine	0.27	1.16	10.66
PatchTST	0.28	0.57	10.83
DLinear	<0.01	0.20	8.46
Autoformer	4.94	3.46	35.30
Informer	5.56	3.18	37.42
CNN-LSTM	1.16	1.96	17.07
ARIMA	<0.01	4.24	N/A (CPU)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Song, Z.; Yin, X.; Li, C.; Ba, H.; Li, L. A Patch-Based State-Space Hybrid Network for Container Resource Usage Forecasting. Algorithms 2026, 19, 148. https://doi.org/10.3390/a19020148

AMA Style

Song Z, Yin X, Li C, Ba H, Li L. A Patch-Based State-Space Hybrid Network for Container Resource Usage Forecasting. Algorithms. 2026; 19(2):148. https://doi.org/10.3390/a19020148

Chicago/Turabian Style

Song, Zhilong, Xiangguo Yin, Chencheng Li, He Ba, and Lin Li. 2026. "A Patch-Based State-Space Hybrid Network for Container Resource Usage Forecasting" Algorithms 19, no. 2: 148. https://doi.org/10.3390/a19020148

APA Style

Song, Z., Yin, X., Li, C., Ba, H., & Li, L. (2026). A Patch-Based State-Space Hybrid Network for Container Resource Usage Forecasting. Algorithms, 19(2), 148. https://doi.org/10.3390/a19020148

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Patch-Based State-Space Hybrid Network for Container Resource Usage Forecasting

Abstract

1. Introduction

2. Related Work

2.1. Traditional Statistical and Machine Learning Methods

2.2. Single-Branch Deep Learning Methods

2.2.1. RNN-Based Models

2.2.2. Transformer-Based Models

2.3. Hybrid Deep Learning Models

3. Problem Analysis and Formulation

3.1. Problem Analysis

3.2. Problem Formulation

4. Methodology

4.1. Data Preprocessing

4.2. PSH Model

4.2.1. Patching Encoder

4.2.2. Local Transformer Path

4.2.3. Global Mamba Path

4.2.4. Cross-Branch Fusion

4.2.5. Multi-Step Head

4.3. Training and Optimization

5. Experiments

5.1. Experimental Setup

5.1.1. Datasets

5.1.2. Evaluation Metrics

5.1.3. Configuration

5.2. Comparison Methods

5.2.1. ARIMA

5.2.2. CNN-LSTM

5.2.3. Autoformer

5.2.4. Informer

5.2.5. PatchTST

5.2.6. DLinear

5.2.7. TimeMachine

5.3. Discussion of Prediction Results

5.4. Ablation Studies

5.4.1. Effectiveness of Architectural Components

5.4.2. Parameter Sensitivity Analysis

5.5. Computational Efficiency Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI