Next Article in Journal
Calibration of Low-Cost Sensors for PM10 and PM2.5 Based on Artificial Intelligence for Smart Cities
Previous Article in Journal
Early Screening of Sleep-Disordered Breathing Using a Smartphone-Based Portable System in Stroke Patients and Its Relevance for Rehabilitation: A Prospective Observational Study
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Traffic Forecasting for Industrial Internet Gateway Based on Multi-Scale Dependency Integration

School of Physics, Liaoning University, Chongshan Campus, Shenyang 110031, China
*
Author to whom correspondence should be addressed.
Sensors 2026, 26(3), 795; https://doi.org/10.3390/s26030795
Submission received: 22 December 2025 / Revised: 21 January 2026 / Accepted: 23 January 2026 / Published: 25 January 2026
(This article belongs to the Section Industrial Sensors)

Abstract

Industrial gateways serve as critical data aggregation points within the Industrial Internet of Things (IIoT), enabling seamless data interoperability that empowers enterprises to extract value from equipment data more efficiently. However, their role exposes a fundamental trade-off between computational efficiency and prediction accuracy—a contradiction yet to be fully resolved by existing approaches. The rapid proliferation of IoT devices has led to a corresponding surge in network traffic, posing significant challenges for traffic forecasting methods, while deep learning models like Transformers and GNNs demonstrate high accuracy in traffic prediction, their substantial computational and memory demands hinder effective deployment on resource-constrained industrial gateways, while simple linear models offer relative simplicity, they struggle to effectively capture the complex characteristics of IIoT traffic—which often exhibits high nonlinearity, significant burstiness, and a wide distribution of time scales. The inherent time-varying nature of traffic data further complicates achieving high prediction accuracy. To address these interrelated challenges, we propose the lightweight and theoretically grounded DOA-MSDI-CrossLinear framework, redefining traffic forecasting as a hierarchical decomposition–interaction problem. Unlike existing approaches that simply combine components, we recognize that industrial traffic inherently exhibits scale-dependent temporal correlations requiring explicit decomposition prior to interaction modeling. The Multi-Scale Decomposable Mixing (MDM) module implements this concept through adaptive sequence decomposition, while the Dual Dependency Interaction (DDI) module simultaneously captures dependencies across time and channels. Ultimately, decomposed patterns are fed into an enhanced CrossLinear model to predict flow values for specific future time periods. The Dream Optimization Algorithm (DOA) provides bio-inspired hyperparameter tuning that balances exploration and exploitation—particularly suited for the non-convex optimization scenarios typical in industrial forecasting tasks. Extensive experiments on real industrial IoT datasets thoroughly validate the effectiveness of this approach.

1. Introduction

The Industrial Internet of Things (IIoT) is defined as the specific application of the Internet of Things (IoT) within industrial automation scenarios. This system employs industrial communication technologies and adheres to a service-oriented application paradigm [1]. The IIoT concept builds upon existing industrial infrastructure, leveraging intelligent technologies to enable flexible resource scheduling and data interconnection [2], thereby establishing a novel information communication paradigm. As network devices, industrial gateways facilitate communication and interconnection between heterogeneous networks in industrial environments by converting connections across different network types. Their core function lies in bridging networks, enabling interoperability among diverse industrial network devices while supporting comprehensive system monitoring and management. Industrial gateways comprise devices adapted for complex industrial environments, such as communication gateways, protocol converters, and remote terminal units. Within the IIoT domain, these devices serve as hubs connecting field devices to system platforms, achieving data interoperability and optimizing data value. The rapid expansion of data scale and connection volume has triggered a surge in network traffic, leading to challenges such as communication latency, bandwidth bottlenecks, and reliability issues at critical nodes. These challenges are often interrelated—for instance, bandwidth constraints exacerbate real-time and reliability problems. Addressing these challenges typically requires multi-layered, collaborative solutions. This encompasses network architecture optimization (edge computing, network slicing), intelligent scheduling and load forecasting, security design and supply chain governance, as well as establishing unified standards and operational frameworks [3]. Effectively managing diverse and complex network traffic is a prerequisite for advancing Industry 4.0. Industrial gateways play a pivotal role in IIoT but currently face a triple dilemma:
(1) Trade-off between model complexity and deployability: While Transformers and GNNs offer high accuracy, their O(n2) computational complexity and high memory requirements (typically >4 GB RAM) render them unsuitable for resource-constrained industrial gateways [4,5].
(2) Multi-scale modeling and interpretability gaps: Existing lightweight linear models (e.g., DLinear) fail to capture IIoT traffic’s multi-scale characteristics and nonlinear dependencies, leading to a 40–60% increase in error during burst traffic scenarios [6].
(3) Gaps exposed by emerging trends: Technologies like 5G slicing, edge computing, and digital twins complicate traffic propagation patterns (e.g., short-duration bursts between microservices, and cross-tier long-term dependencies); yet existing models lack targeted designs [7,8].
As shown in Table 1, existing research primarily focuses on accuracy optimization while neglecting computational constraints in edge deployment. For instance, although transformer-based architectures are powerful, their self-attention mechanisms introduce significant latency, which is unacceptable for real-time anomaly detection in industrial gateways. Similarly, graph neural networks (GNNs) cannot adapt to the dynamic joining or leaving of devices in industrial IoT environments under static topologies, while lightweight and efficient, linear MLP hybrid models often exhibit significantly degraded performance when handling multi-scale traffic patterns in industrial IoT scenarios. We contend that the critical challenge lies in developing a model that combines the interpretability and speed advantages of linear models with the feature extraction capabilities of deep neural networks to address multi-scale temporal dependencies.
Y. Li et al. [15] combined diffusion convolutions with sequence models (recurrent units) to explicitly model spatio-temporal dependencies using graph structures (propagation matrices). This approach resulted in a significant improvement in prediction accuracy for multi-node simultaneous time series (e.g., traffic flow). However, RNN-based architectures are limited in their capacity for long prediction horizons and computational parallelism when compared to convolutional or transformer approaches. This necessitates the pre-computation or online estimation of stable adjacency/diffusion matrices. In their seminal paper, H. Zhou et al. [5] proposed Informer, an efficient Transformer variant for the purpose of long-sequence time series forecasting. The sparse self-attention and probabilistic sampling mechanisms employed by the model substantially reduce computational complexity, while at the same time outperforming the native Transformer in terms of long-term predictions. In their seminal work, B. Lim et al. [9] proposed an explainable multi-domain multi-step forecasting Transformer that integrates static features, historical sequences, and known future inputs. The incorporation of modules for both variable selection and attention visualization serves to enhance interpretability and multivariate modeling performance. However, the model is large, with high training and inference costs, and it requires engineering optimizations for very long historical dependencies or ultra-long sequences.
From the perspective of practical value in industrial process prediction, classical statistical methods and interpretable models remain viable and indispensable foundational approaches. In contrast, deep learning (particularly Transformer variants and spatio-temporal graph models) demonstrates superior expressive power in handling complex nonlinear relationships and multi-node dependencies. However, a critical issue lies in existing research failing to address the core industrial deployment challenge: “How to simultaneously achieve multi-scale modeling and high-precision prediction under <100 millisecond inference latency and <512 MB memory constraints?” Based on the strengths and weaknesses of state-of-the-art models, this paper develops the DOA-MSDI-CrossLinear model. Its main contributions are as follows:
(1) To overcome the limitations of purely linear models, we enhanced the original CrossLinear architecture by replacing the single-linear layer with a lightweight shallow multi-layer perceptron (MLP) employing the GELU activation function. This design achieves a critical balance: significantly boosting the model’s ability to capture complex nonlinear traffic patterns while maintaining an ultra-compact parameter size (approximately 10k parameters), ensuring the structural simplicity required for high-speed inference.
(2) Instead of relying on generic feature extraction, we strategically integrate the state-of-the-art Multi-Scale Decomposable Mixing and Dual Dependency Interaction modules into the CrossLinear backbone. Our contribution lies in the novel orchestration of these components to solve the specific challenge of industrial traffic forecasting. By embedding these modules within our lightweight framework, we effectively leverage their capability for adaptive sequence decomposition and dependency modeling, while successfully constraining the overall model complexity to suit resource-limited edge scenarios.
(3) We employ the Dream Optimization Algorithm (DOA) for end-to-end hyperparameter tuning. This adaptive mechanism not only maximizes prediction accuracy while suppressing false alarms but also ensures an optimal balance between model performance and resource consumption. The final model demonstrates millisecond-level latency (2.91 ms) and negligible storage overhead (0.04 MB), proving its feasibility for real-time online deployment on resource-constrained industrial IoT edge devices.
The remainder of this paper is organized as follows: Section 2 of this paper introduces recent work on forecasting methods, which are primarily categorized into two groups: machine learning-based and deep learning-based approaches. Section 3 presents the model forecasting method developed in this paper, along with the specific implementation of each component within the model. Section 4 demonstrates the experimental validation, and finally, Section 5 summarizes the conclusions of this paper.

2. Related Works

This section will introduce past work related to network traffic prediction and time series forecasting in IIoT.

2.1. Machine Learning-Based Predictive Methods: Achievements and Limitations

Machine learning methods can process massive amounts of sensor data, maintenance logs, and operating parameters [16]. Their strength lies in learning how different variables interact and cause equipment performance degradation over time. However, traditional machine learning models cannot be applied directly; carefully designed features must be extracted from raw data to uncover patterns in both the frequency domain and time domain (sometimes requiring simultaneous extraction of both). In the field of time series forecasting, classical methods like ARIMA models [17] and exponential smoothing have been applied for decades. Machine learning alternatives such as linear regression and support vector machines [18] have also carved out a niche. However, the critical point, in our view, is that these methods perform best only when data behaves well—that is, when it is relatively stable and follows linear patterns. Once confronted with the complexity of IoT traffic, traditional methods begin to fall short. They were simply not designed to handle such complex states.
Osovsky et al. [16] ran a comprehensive comparison across different baseline models and found some interesting patterns. For UDP traffic, ARIMA came out on top as the most effective approach. When they looked at TCP traffic, though, linear regression and theta models performed better. And for HTTPS traffic, it turned out that linear regression, ARIMA, and N-BEATS all showed strong performance—no single winner there.
Another study [19] took real-world IIoT data traffic and benchmarked it against the 5G New Radio performance analysis model to see how well the characteristics matched up. Meanwhile, researchers have explored different machine learning approaches for cellular traffic forecasting: N. Sapankevych and R. Sankar [18] went with linear regression, while Deng et al. [13] opted for support vector machine regression.
In the context of eMBB traffic, one paper [20] suggested using ARIMA models to forecast demand ahead of time. The idea is pretty straightforward: if you can predict traffic spikes, you can reserve channels proactively to maximize throughput and maintain good data rates. In theory, this kind of predictive approach should improve overall QoS performance.
Despite these advances, machine learning-based traffic forecasting faces significant challenges. A key conflict lies between feature engineering and automated feature learning. Traditional approaches heavily rely on manually designed frequency-domain, time-domain, and spatio-temporal features, demanding deep domain expertise while limiting transferability across industrial scenarios. Second, complex models require massive training data to avoid overfitting, yet industrial process data are often scarce due to privacy constraints, sensor failures, and high costs of labeling anomalous data [21]. Existing research has not sufficiently explored how models can balance capacity and sample efficiency in data-scarce industrial environments.

2.2. Deep Learning-Based Predictive Methods

Deep learning, a pivotal branch of machine learning, emulates brain cognition through multi-layer artificial neural networks. The primary advantage of this approach lies in its capacity to automatically extract and learn complex features from high-dimensional and unstructured data, such as images and time series, without the necessity of manual feature engineering. This capability enables it to outperform other machine learning techniques in fields such as predictive maintenance, particularly when processing historical sensor data from equipment.

2.2.1. RNN/LSTM-Based Methods

Deep learning approaches for early traffic flow forecasting primarily rely on recurrent neural networks (RNNs) and their architectural variants, particularly long short-term memory (LSTM) networks and gated recurrent units (GRUs). These architectures are widely adopted due to their inherent capability to model long-term dependencies in sequential data, which is crucial for capturing the temporal evolution patterns characteristic of network traffic dynamics.
Wu et al. [22] extended the traditional LSTM framework by introducing a mechanism for dynamically modeling degradation factors and inferring latent variables, thereby significantly improving residual useful life (RUL) prediction accuracy. In parallel research, Tziolas et al. [23] conducted a systematic evaluation of three autoencoder architectures on industrial datasets. Empirical analysis revealed that autoencoders integrating LSTM layers with convolutional neural networks exhibited optimal performance characteristics.
In the realm of 5G network optimization, Alawe et al. [24] developed a prediction mechanism based on LSTM to anticipate traffic load fluctuations and enable dynamic resource allocation. This approach guides elastic network resource scaling through traffic forecasting, with a focus on the Access and Mobility Management (AMM) component. Simulation-based validation demonstrates that compared to traditional threshold-driven approaches, this prediction-driven scaling strategy significantly reduces latency in responding to traffic changes and effectively shortens the configuration delay of Virtual Network Function (VNF) instances during demand surges, exhibiting superior performance.
Although LSTM networks excel within the recurrent neural network family, their training process incurs substantial computational overhead that increases linearly with the number of parameters. To overcome this limitation, Hua et al. [25] proposed the sparse-connected LSTM variant RCLSTM. Its core innovation lies in deliberately designed sparse neural connection patterns. Experimental validation on two benchmark datasets demonstrated that this model reduces computational time by approximately 30% while maintaining or exceeding the prediction quality of standard LSTMs.
Despite the strong performance of LSTM and GRU architectures in modeling sequential dependencies, they face fundamental limitations when applied to industrial traffic forecasting. The vanishing gradient phenomenon becomes particularly pronounced when capturing long-term temporal patterns spanning weekly or monthly cycles. Furthermore, the inherent sequential computation structure of these models hinders effective parallelization, creating processing bottlenecks that conflict with the real-time latency constraints demanded by edge gateway deployments [26].

2.2.2. Transformer-Based Methods

The transformer architecture has fundamentally reshaped how we approach sequence modeling tasks since its introduction [27], while it started in natural language processing, researchers quickly realized its potential for time series forecasting and traffic prediction—anywhere you need to capture long-range dependencies in sequential data.
Several teams have adapted Transformers specifically for traffic forecasting. Chen et al. [28] combined multi-task learning with transformers to create MTL-Trans, which outperformed existing models on multidimensional time series tasks. Liu et al. [29] took a different approach with ST-Tran, their spatio-temporal model that uses separate spatial and temporal transformer blocks to extract features efficiently. When tested on real cellular network data, it proved both effective and practical for traffic prediction.
But here is the fundamental issue: even with these optimizations, the O(n2) complexity of self-attention conflicts with the linear scalability needed for edge gateway deployment. Recent efficient variants like Informer, Autoformer, and FEDformer reduce this complexity through approximations, but those approximations introduce errors that may be unacceptable in precision-critical industrial applications [5]. There is also a deeper question about interpretability—attention weights might look meaningful, but whether they actually correspond to operationally relevant temporal relationships remains debatable [30].

2.2.3. GNN-Based Methods

Graph neural networks have demonstrated remarkable effectiveness in traffic forecasting by integrating spatio-temporal modeling. Key architectures include Diffusion Convolutional Recurrent Neural Networks (DCRNNs) [15], Spatio-Temporal Graph Convolutional Networks (STGCNs) [11], and Graph WaveNets [31]. These models capture both spatial dependencies between network nodes and temporal evolution of traffic flows, enabling more precise multi-node predictions. For instance, multi-task learning approaches successfully integrated spatio-temporal features to enhance IoT traffic prediction accuracy in industrial settings [32].
Wavelet-based approaches offer an alternative perspective. Wang et al. [33] developed frequency-aware deep learning models using wavelet neural network architectures, while Yang et al. [34] combined high-order fuzzy cognitive graphs with redundant wavelet transforms to handle large-scale non-stationary time series. Given that wireless traffic is influenced not only by historical patterns but also by inter-base-station handover, Zhao et al. [35] proposed the STGCN-HO model, which enhances prediction accuracy by incorporating handover probabilities. Their results demonstrate that this model significantly outperforms baseline models at both cell and base-station levels.
However, these approaches still face limitations. Some approaches have been questioned due to computational inefficiency and suboptimal performance in specific scenarios [34]. Furthermore, graph neural network-based methods rely on known or derivable graph structures, which are often unavailable or unreliable in dynamic industrial networks (where device connections change with production configurations). The topology inference process not only increases computational burden but may also induce error propagation.

2.3. Linear and Hybrid Models

Linear mixing methods have been getting a lot of attention lately. Take DLinear [4]; for example, it handles time series forecasting by breaking down sequences and applying linear regression. The appeal? Low memory usage and fast inference. It stays pretty stable even when you extend the lookback window, but there is a catch: it struggles to pick up local features, which limits how accurate its predictions can be.
Then there is the MLP-Mixer approach. Tolstikhin et al. [36] built this architecture entirely on multilayer perceptrons—no convolutions, no self-attention—and it still performs competitively on image classification tasks. Li et al. [37] took a closer look at what attention mechanisms actually contribute to time series forecasting and came up with MTS-Mixers. Their model uses a dual-factor decomposition module to capture both temporal patterns and relationships between channels.
Multivariate time series prediction is tricky for two main reasons: patterns shift over time in unpredictable ways, and channels interact in complex ways that are not always obvious. Qiu et al. [38] tackled this with DUET, a framework that applies clustering in both the temporal and channel dimensions to improve prediction quality.
Now, here is something interesting: simple linear models like DLinear and NLinear perform surprisingly well on standard time series benchmarks [39], which suggests that many deep learning models might be overengineered for these tasks. But purely linear approaches have their own blind spots—they cannot handle the nonlinear dynamics you see in industrial systems, where equipment behavior often involves threshold effects, saturation points, and sudden mode switches. That is why we are proposing a hybrid approach that combines the efficiency of linear methods with the ability to capture directional nonlinear relationships.

2.4. Overview of Research Gaps

Reviewing existing research, three fundamental issues remain unresolved: transformers and graph neural networks, while accurate, incur prohibitively high computational costs, making deployment on edge devices impractical. Linear models, though efficient, are overly simplistic and struggle to capture complex patterns. Certain models can identify multi-timescale patterns (e.g., hourly peaks and weekly trends) yet operate like black boxes. It remains impossible to trace why the system predicts traffic surges during specific periods. General-purpose time series models fail to leverage the actual behavioral characteristics of industrial IoT systems. We reject the practice of stacking existing technologies and instead address the problem at its core. Industrial flow forecasting requires advancing two processes in tandem: decomposing signals into distinct timescales (daily, weekly, and seasonal) and modeling the interactions between these scales. This theoretical foundation—treating decomposition and interaction as independent yet coupled processes—distinguishes our approach from unsubstantiated modular patchwork solutions.

3. Dream-Based Optimization of MSDI-CrossLinear for Network Traffic Forecasting

This section details the proposed model and its theoretical foundations. Its overall architecture is illustrated in Figure 1.
Our model design is based on three empirically validated hypotheses regarding industrial gateway traffic, each supported by domain analysis as follows:
Hypothesis 1 (Multiscale Stationarity).
Industrial traffic exhibits non-stationarity globally but demonstrates local stationarity within production cycles. Rationale: Manufacturing systems follow fixed schedules (8 h shifts, weekly maintenance). Dataset analysis reveals autocorrelation peaks at 24 h ( ρ = 0.87 ) and 168 h ( ρ = 0.73 ), confirming periodicity. Implication: Supports scale-specific linear modeling within decomposed components.
Hypothesis 2 (Cross-Scale Directional Causality).
Fine-scale anomalies propagate to coarse scales, but not vice versa. Rationale: Equipment failures (hourly events) accumulate to reduce daily throughput, yet weekly trends cannot alter historical hourly readings. Granger causality tests confirm unidirectional effects (p < 0.01). Implication: Justifies modeling asymmetric interactions within the CrossLinear module.
Hypothesis 3 (Resource-Accuracy Tradeoff Threshold).
Industrial gateways tolerate <5% accuracy loss in exchange for 10× latency reduction. Rationale: Interviews with 12 industrial engineers revealed that response times below 100 milliseconds are critical for real-time control, while 5% prediction error remains within operational safety margins. Implication: Guides dream-optimized algorithms to prioritize latency reduction over marginal accuracy gains.
These assumptions differentiate our approach from generic time series models that ignore industrial system constraints.

3.1. Methodological Foundations

The fundamental epistemological framework underlying this paper posits that industrial IoT traffic is not a single sequence but rather the superposition of deterministic production cycles and random conceptual drift. We emphasize that model design should serve the practical needs of industrial deployment, rather than merely pursuing theoretical optimality. Our model is not built to stack algorithms but to mathematically simulate industrial production’s “shift schedules” (corresponding to the model’s daily scale) and “supply chain/order cycles” (corresponding to weekly/seasonal scales). Furthermore, model architecture should be reverse-engineered based on industrial gateway resource constraints (<512 MB memory, <100 ms latency), prioritizing interpretability, robustness, and deployment feasibility over performance gains on benchmark datasets alone.
This paper introduces an innovative theoretical paradigm, the Dual-Process Decomposition Interaction Theory (DPDIT), for industrial flow modeling. Unlike existing approaches treating temporal decomposition and feature interaction as sequential preprocessing steps, we establish them as coupled mathematical processes with independent optimization objectives as follows:
(1) Decomposition Process: Extract scale-specific components via adaptive moving average kernels, where kernel sizes are derived from industrial production cycle theory rather than empirical tuning.
(2) Interaction Process: Modeling cross-scale dependencies through directional nonlinear mappings while preserving inherent causal constraints of industrial systems.
Theoretical Basis: Industrial flow exhibits a hierarchical temporal structure governed by production planning (deterministic) and equipment failures (random). Existing methods suffer from the following shortcomings: Unified processing of all scales (Transformers model) leads to loss of scale-specific features; Independent modeling of each scale (DLinear model) ignores cross-scale causal relationships; Black-box interaction mechanisms (LSTM model) violate industrial interpretability requirements.
DPDIT resolves this tension by formalizing decomposition and interaction as a dual optimization problem. Unlike existing methods that optimize decomposition and interaction sequentially, DPDIT formalizes them as a coupled optimization problem as follows:
min Φ D , Φ I L = L decomp ( Φ D ) + λ L interact ( Φ I ) + γ L causality
where
  • L decomp ensures high-quality multi-scale decomposition:
    L decomp = x t s S x t ( s ) 2 2 + α s TV ( x t ( s ) )
  • L interact models cross-scale dependencies:
    L interact = y t f cross ( x t daily , x t weekly , x t seasonal ) 2 2
  • L causality enforces temporal ordering constraints absent in prior work:
    L causality = t y t x > t + s i < s j y ( s i ) x ( s j )
The range of λ is 0.1 , 1.0 , specifically adjusted using a grid search method, while the range of γ is 0.01 , 0.1 , adjusted by incrementally increasing its value. This theoretical framework incorporates a decomposition mechanism adaptable to prediction tasks, with each loss term possessing a clear physical interpretation. The specific explanations of the formula will be covered in the following subsections.

3.1.1. Theoretical Advantages of Multi-Scale Decomposition

Industrial IoT traffic is difficult to predict because it does not follow simple wave-like patterns but rather resembles multiple rhythms intertwining simultaneously at varying speeds. We can decompose it into a three-tier structure as follows: Fast rhythm (second level): real-time communication between devices—sensors collect data, controllers send commands, generating sudden traffic spikes. Medium rhythm (minute level): production cycles form repetitive patterns. For example, production lines complete cycles every 5 min; slow rhythm (hourly): macro trends span the workday, such as shift changes or production order modifications. Traditional forecasting methods face a dilemma: focusing on short-term bursts overlooks overall trends, while broadening the view to daily trends misses critical details like equipment failures [40].
Our solution employs a multi-scale decomposable mixing block [41]. Rather than choosing between granularity and the big picture, we first decompose signals into distinct temporal layers, then perform tailored analysis at each level. This resembles multiple cameras synchronously recording the same scene at different frame rates—one capturing rapid motion, another tracking gradual changes. Attention mechanisms require comparing all time points—an O(n2) operation. MDM employs intelligent averaging, scaling linearly with data length (O(n)). For typical industrial data streams (512 time points), processing speed increases by 6.8 times. When MDM decomposes flow into three components, it can explicitly state: “This is the hourly trend, this is the production cycle, this is equipment noise.” Yet attention weights remain a black box—delivering predictions without clarifying the temporal scales underlying decisions.
While other researchers have attempted to handle multiple timescales, each approach suffers from fatal flaws. Wavelet decomposition [33] applies a fixed mathematical model to decompose signals. The problem: equipment operates at varying speeds across factories. Fixed patterns cannot adapt to differences in motor cycles—whether every 30 s or every 3 min. Hierarchical attention mechanisms enable models to learn critical timescales. The issue: high computational cost (still O(n2) complexity) and difficulty explaining why the model focuses on specific scales. Multi-scale convolutional neural networks capture patterns using filters of varying sizes. The problem is: this method assumes data stationarity—that patterns remain constant over time. However, industrial systems continuously switch between different operating modes. Our approach employs adaptive decomposition technology [42], abandoning fixed patterns to learn a dedicated average window for each device. When Device A operates on a 2 min cycle and Device B on a 10 min cycle, the model automatically adjusts the convolution kernel size to achieve adaptation. Furthermore, it achieves multi-scale separation through convolutional operations that scale linearly with data length (O(n)), enabling the algorithm to run on edge gateways without hardware overload. Finally, unlike attention mechanisms that blend all signals, MDM provides three independent signals—corresponding to different time scales. Each component retains physical meaning: when plotting the “hourly trend” component, the curve authentically reflects shift change variations.

3.1.2. The Theoretical Necessity of Dual Dependency Interaction

When using standard self-attention mechanisms in time series forecasting, they attempt to perform two fundamentally distinct tasks simultaneously, resulting in suboptimal performance for both. In industrial IoT forecasting, we must consider: “Which past events are influencing the present?” and “Which other devices are affecting this device?” Standard transformer attention mechanisms conflate these two questions, processing them through a single set of weights. Traditional attention mechanisms must simultaneously learn correlations across both temporal and channel dimensions, leading to exponential parameter growth—requiring learning interactions across T × C × T × C dimensions (where T = time steps, C = number of channels). DDI’s [41] dedicated paths separately learn T × T temporal patterns and C × C channel patterns. Lower total parameters: Requires only T2 + C2 instead of (T × C)2. IIoT traffic simultaneously exhibits temporal dependencies—where historical traffic influences future patterns—and channel dependencies—such as coupling relationships between different devices/protocols (e.g., correlations between Modbus and OPC UA traffic). DDI addresses both dependencies through its Decoupled Interaction mechanism, avoiding the “dimension disaster” of traditional approaches (reducing parameter complexity from O(n2d2) to O(nd)) [43,44].

3.1.3. DOA Applicability Argument

Hyperparameter optimization in industrial settings faces three major challenges as follows: (1) High-dimensional non-convexity: Hyperparameter space dimensions >15, with multiple local optima. (2) High evaluation cost: Each evaluation requires retraining the model (time > 30 min). (3) Noise interference: Real-world data contains missing values and outliers. In this paper’s scenario, the search space includes continuous (learning rate, dropout), discrete (layer dimensions, kernel size), and categorical (activation function) hyperparameters. DOA naturally handles this heterogeneity through bio-inspired encoding. Secondly, molecular dynamics models decompose scale interactions with drug–drug interaction patterns, forming complex multimodal optimization surfaces where gradient-based algorithms often struggle to converge [45].
As shown in Table 2, DOA employs a two-stage search [citation] that maintains global search capability while reducing computational overhead [46]. Its convergence speed is slightly slower than Adam’s, but it exhibits stronger global search ability, superior noise resilience compared to other traditional methods, and lower computational costs than Bayesian optimization and grid search.

3.2. Problem Description

The task of Long-term Multivariate Time Series Forecasting (LMTSF) is delineated as such: Given a sequence of historical input observations X = x 1 , x 2 , , x L R L × M of length L, the objective is to generate a future forecast sequence X ^ = x L + 1 , x L + 2 , , x L + H R H × M of length H. In this definition, L and H denote the input and output time steps, respectively. X t R M represents the observation vector containing variables at time step t. It is crucial to emphasize that in multivariate forecasting scenarios, the feature dimensions (i.e., the number of variables M) of the input and output sequences must remain consistent.

3.3. Multi-Scale Decomposable Mixing Block

Real-world time series are frequently characterized by multi-scale heterogeneity, manifesting distinct behavioral patterns across various temporal granularities or cycles (e.g., daily trends, weekly seasonality, monthly cycles). The use of single-model or single-scale feature extraction methods has proven inadequate in comprehensively capturing the intricate interactions inherent in these systems. The MDM module [41] provides clearer, more focused inputs to downstream prediction models by adaptively decomposing the raw sequence into multiple sub-sequences, each representing a specific temporal scale. The raw input retains fine-grained details, while coarse-grained information is extracted through average pooling operations. The initial temporal pattern, designated as τ 1 , is introduced as channel X. Subsequently, an array of distinct coarse-grained temporal patterns, denoted as τ i , is incorporated into τ i R 1 × L d i 1 . The extraction of i 2 , , h is achieved through the application of average pooling to the temporal patterns derived from the preceding layer. Here, h signifies the number of downsampling operations, while d denotes the rate at which the data are being reduced. The decomposition of the temporal pattern at layer i can be expressed as in Equation (5) as follows:
τ i = A v g P o o l i n g τ i 1
Subsequently, the coarse-grained τ h is integrated into the fine-grained τ 1 through a feedforward residual network, where the blended data are represented by ξ i and the blended data and τ h are defined as follows: The integration of temporal patterns within layer i can be articulated through the following equation: After completing the temporal pattern fusion across multiple scales, the fused scale information, designated as ξ 1 , is obtained. The output for a specific channel is expressed as u = ξ i R 1 × L , where R denotes the matrix of the input data.
ξ i = τ i + M L P ξ i + 1
According to our theoretical framework, Equation (7) explains the previous decomposition loss component. The first term (reconstruction error): x t represents the original industrial flow sequence (dimensions: T × C , T = time step, C = number of channels), S = { daily , weekly , seasonal } denotes the set of time scales, x t ( s ) is the decomposed component at scale s, and s x t ( s ) x t : requires the sum of all components to reconstruct the original sequence. Second, Term (Smoothness Regularization): TV x t ( s ) = t = 1 T 1 x t + 1 ( s ) x t ( s ) Total Variation . Its function is to prevent the decomposition of noise components, forcing internal smoothing at every scale.
L decomp Φ D = i = 1 N x t s S x t ( s ) 2 2 + α s S TV x t ( s )
The MDM module’s key strength lies in its capacity to transform the decomposition process from an isolated preprocessing step that relies on human expertise and preset parameters into a component that is deeply coupled with the forecasting task. This component is capable of adaptive learning and end-to-end optimization. This capability is pivotal in demonstrating superior performance and enhanced robustness when handling complex, multiscale, multivariate, and heterogeneous real-world time series data. In industrial settings, the daily scale reflects production shifts, the weekly scale reflects maintenance cycles, and the seasonal scale reflects order fluctuations. Traditional EMD/STL fixed decomposition rules prevent learning, whereas DPDIT enables learning through the decomposition process and joint optimization with the forecasting task.

3.4. Dual Dependency Interaction Block

Time series data frequently manifest latent dynamic interactions and information propagation effects across multiple scales. However, extant time series models, such as ScaleFormer [47] and TimeMixer [48], are inadequate in adequately capturing these complex, high-order cross-scale dynamic interactions when addressing such tasks. This underscores the significance of comprehending and adeptly incorporating the interdependencies across disparate time scales in multiscale time series modeling.
The Dual Dependency Interaction Block [41] has been developed to identify latent dynamic interactions across disparate temporal scales within time series data while incorporating dependencies across both temporal and channel dimensions. The workflow encompasses the following steps:
Input Preparation: The DDI module initially acquires channel-aggregated information, denoted by u R C × L , from the MDM (Multi-scale Decomposition Module). This information is subsequently arranged into a matrix, U R C × L , where C denotes the number of channels and L denotes the sequence length. Subsequently, through a patching operation, U is transformed into U ^ R C × N × P , where N signifies the number of patches and P denotes the time step size per patch. V ^ t t + p is defined as the embedding output from the residual network, while U ^ t t + p is represented as the aggregated information patch from the MDM.
Temporal Mixing: DDI employs a multi-layer perceptron (MLP) that shares parameters across time steps. This MLP is designed to aggregate information from disparate channels along the temporal dimension, thereby capturing temporal correlations and facilitating temporal mixing. The result of this stage is Z t t + p .
Channel Mixing: Subsequently, through a transposition operation (e.g., swapping the channel dimension and the chunk dimension), the DDI module implements an additional MLP that is shared across channels. This MLP integrates inter-channel information across the time domain (for each time chunk), thereby capturing dependencies between channels.
Output and Residual Connection: In essence, the DDI executes a splitting operation, which involves the decomposition of the combined information into outputs for each channel. This process ultimately results in the generation of v R 1 × L . The residual connection (residual operation) embedded within the module serves as a critical mechanism. This approach ensures that DDI effectively leverages cross-channel dependencies while concurrently maintaining and enhancing its ability to capture temporal dependencies. The DDI module employs a dual mechanism of temporal mixing and channel mixing to process and integrate complex spatio-temporal interactions of multi-scale features within a unified framework.
The interaction of U ^ t t + p is expressed by Equations (8) and (9) as follows:
Z t t + p = U ^ t t + p + M L P ( V ^ t p t )
In this context, A T signifies the transpose of the matrix A. The DDI module introduces a learnable scaling factor, β , which dynamically adjusts and balances the emphasis on temporal and cross-channel dependencies within the model. This enables adaptive noise suppression and optimized integration of both dependencies, particularly in scenarios with low inter-variable correlations.
V ^ t t + p = Z t t + p + β · M L P Z t t + p T T
Formula (10) illustrates the interaction loss component within our framework. We utilize the temporal-spatial hybrid of the DDI module to accomplish one part of this, while the other part is described in the next section—the CrossLinear section.
L interact Φ I = t = 1 T y t f cross x t daily , x t weekly , x t seasonal 2 2
The primary benefit of the DDI module is its capacity to flexibly and adaptively capture and integrate complex multi-scale, temporal, and channel dependencies within time series data. The model employs intelligent noise suppression and residual learning mechanisms to deliver a robust and information-rich feature representation, thereby enhancing the performance of time series analysis and forecasting tasks.

3.5. CrossLinear

The CrossLinear [49] model is an innovative linear-based forecasting approach that addresses the inefficiency and overfitting issues of traditional models by integrating a plug-and-play cross-correlation embedding module. The module’s versatility has established it as a crucial plugin for forecasting tasks across diverse domains. Consequently, this study utilizes an advanced CrossLinear model to generate final predictions. The overarching framework is delineated in Figure 2.
In order to enhance training stability and reduce non-stationarity, two modules were incorporated as preprocessing and postprocessing steps: instance normalization and denormalization. Instance normalization involves the calculation of the mean and variance standardization of the current sample. Conversely, denormalization entails the restoration of the model output to the original mean and variance within the same group. This approach ensures the distribution’s stability while preserving the original scale.
Cross-Correlation Embedding: A plug-and-play module designed to capture dependencies between variables.
X 1 : T , 1 e m b = α · X 1 : T , 1 e n d o + ( 1 α ) · X 1 : T , 1 c r o s s
In the aforementioned formula, “endo” signifies endogenous variables, whereas “exo” denotes exogenous variables. The normalized exogenous and endogenous variables are arranged in a stacked configuration along the variable dimension. Convolution, in its one-dimensional form, employs the variable dimension as the designated “channel dimension,” with the time dimension functioning as the sliding dimension during the convolution process.
X 1 : T , 1 cross = Conv 1 D Stack X 1 : T , N 1 e x o , X 1 : T , 1 e n d o
The output X c r o s s denotes the weighted mixture of variables at each time step, where the weights are automatically learned by the convolution kernel. This can be regarded as a condensed representation of “cross-variable correlations.” The parameter α is capable of being learned and is responsible for balancing the contributions of endogenous and cross-correlated variables.
Secondly, patch embedding, originally derived from visual transformers [50], has been widely applied in transformer-based and linear models such as PatchTST [51] and PatchMLP [52] to capture short-term temporal dependencies, reduce parameter count, and mitigate overfitting [53]. The patching process is defined as follows:
P 1 endo , P 2 endo , , P k endo = Patchify X 1 : T , 1 e m b
Here, P denotes the patch length, and k = T / P represents the total number of patches. Each patch p i e n d o R 1 × p corresponds to a segment of the input sequence.
P endo = β · Projection 1 P 1 endo , , P k endo + ( 1 β ) · P E
The Positional Embedding and Optimized Forecasting Header is a component of the data that is used to create a model for predicting future outcomes. In order to incorporate positional information and enhance robustness, we employ positional embedding, a technique commonly used in Transformer architectures. Subsequently, an optimized forecasting header captures long-term temporal dependencies and generates the backbone model’s output X ^ T + 1 : T + S , 1 e n d o :
X ^ T + 1 : T + S , 1 e n d o = Projection 2 C o n c a t p e n d o
The initial mapping of the patches is conducted in a d-dimensional space (d is a hyperparameter) through the following P r o j e c t i o n 1 · : Subsequently, the embeddings are aggregated with the positional embedding P E R k × d , weighted by the learnable parameter β . Finally, the aforementioned embeddings are concatenated and processed by P r o j e c t i o n 2 · to generate the final output.
Purely linear prediction heads are incapable of effectively capturing the inherent nonlinear trends, periodicity, and nonlinear interactions between variables within time series data. Consequently, the single linear layer was substituted with a very shallow multi-layer perceptron (MLP) containing only one hidden layer and a nonlinear activation function (GELU). This enhancement significantly improves the model’s capacity to detect complex nonlinear patterns.
We employ CrossLinear’s cross-scale interactions to complete the remaining portion of the interaction loss in Equation (10) (as mentioned in Section 3.4).

3.6. Dream Optimization Algorithm

In order to enhance the model’s accuracy and efficiency, this paper employs the Dream-inspired Optimization Algorithm (DOA) to optimize model parameters. The DOA algorithm draws inspiration from the concept of human dreams in 2025, as outlined by Lang et al. [46]. Dreams manifest features of partial memory retention, forgetting, and logical self-organization, exhibiting a close resemblance to the optimization process of meta-heuristic algorithms. DOA integrates fundamental memory strategies and mechanisms of forgetting and replenishment to achieve a balance between exploration and exploitation. In addition, DOA incorporates a strategy of dream sharing to enhance escape from local optima. The optimization process is divided into two phases: exploration and exploitation. This division of the process has been shown to yield satisfactory optimization results.
During the algorithm’s initialization phase, a random population is first generated within the search space as the initial population, thereby commencing the optimization process. In the exploration phase (iteration count 0 to T d ), the population is divided into five groups based on differences in their “memory capabilities. Population grouping relies on varying “memory capacities.” Those with poor memory forget more but possess stronger exploratory abilities, while those with strong memory forget less but have weaker exploratory abilities. From Group 1 to Group 5, memory capacity progressively increases.” Each iteration is regarded as a “dreaming” process, with the objective of identifying the optimal solution through continuous iteration. During the development phase (iteration counts from T d to T m a x ), no grouping is performed. Prior to each phase of dreaming, the optimal dreamer (i.e., the best individual) from the previous iteration is presented to all individuals. Consequently, the position of each individual within the forgetting dimensions undergoes an update. It is posited that all individuals in the population share the same number of forgetting dimensions, denoted as K r . In the context of dimensionality reduction, a process referred to as “ K r random dimensions” involves the selection of dimensions from the original set of dimensions. These dimensions, denoted as K 1 , K 2 , , K K r ), are selected at random and undergo a process of update, which involves the refinement of their positions within the original dimensions.
Lang et al. [46] conducted extensive numerical experiments and, considering algorithm stability and applicability, set the DOA parameters according to the following formula:
T d = 9 10 × T m a x
T d denotes the maximum iteration count during the exploration phase, while T m a x represents the overall maximum iteration count.
K q = rand Dim 8 × q , max 2 , Dim 3 × q , q = 1 , 2 , 3 , 4 , 5
The notation rand(a, b) indicates a random integer selected within the range a to b. K q signifies the number of dimensions forgotten in the qth exploration group, and Dim denotes the problem dimension. In the aforementioned formula, rand(a, b) denotes a random integer selected from the range a to b.
K r = rand 2 , max 2 , Dim 3
K r represents the number of dimensions forgotten during the development phase, and Dim denotes the problem dimension. The parameter u modulates the ratio between the forget-and-refresh strategy and the dream-sharing strategy during the exploration phase. In the event that rand is less than or equal to u, the forget-and-refresh strategy is initiated. Conversely, if rand is greater than u, the dream-sharing strategy is executed. It is possible to set u to 0.9.

3.7. MSDI-CrossLinear Model Based on Dream Optimization

The overall framework of the dream-optimized MSDI-CrossLinear model is illustrated in Figure 1. The model’s components have been introduced in the preceding sections. MSDI-CrossLinear is a linear prediction model based on multiscale decomposition and dual dependence. It comprises the MDM module, DDI module, CrossLinear module, and a parameter optimization component based on DOA.
Initially, multidimensional raw sensor signals undergo a series of preprocessing steps. Subsequently, time-series samples composed of actual sensor signals and labels are input into the MDM module. Multi-scale information is extracted through average pooling and feedforward propagation. Information from each scale is fed into the DDI module for temporal and channel mixing, with outputs connected via residual connections. The culmination of the process entails the entry of the output data into the CrossLinear module, a computational framework designed for the purpose of result prediction. Subsequent to the conclusion of each epoch, the Adam optimizer recalibrates the model parameters for the ensuing training cycle. The termination of epoch training is contingent upon the fulfillment of specified termination criteria. Upon the occurrence of these criteria, the training of the model is terminated, and the model is saved. Subsequently, the MSE parameter is optimized using the DOA algorithm to refine key model dimensions, learning rate, and fully connected network dimensions. This training cycle is repeated until an optimal model is achieved and saved. Subsequently, the model incorporates the validated data to inform experimental design and facilitate predictions regarding practical applications.
L causality = s i , s j S 1 s i < s j · max 0 , y ^ t s j x t s i ϵ
Finally, Equation (19) presents the last and most crucial component of the framework—the causal constraint loss. The indicator function 1 s i < s j denotes that s i is a finer temporal scale than s j , e.g., daily < weekly < seasonal. y ^ t s j x t s i denotes sensitivity, measuring the dependency of coarse-scale predictions on fine-scale inputs. Industrial interpretation: “Impact of daily failures on weekly production capacity.”This paper employs the following causal constraint: L causality = t y t x t + k prohibits future use + s coarse y fine x coarse prohibited reverse causality . This prohibits coarse-scale influences on fine-scale while preventing future data from affecting past data.

4. Experiments

The effectiveness of the model is validated using the dataset from the FedCSIS 2020 Challenge [54], which contains workloads from monitored devices. In this study, we will be comparing our model with other state-of-the-art approaches. The dataset and the methods employed for data preprocessing are detailed in Section 4.1. As delineated in Section 4.2, the evaluation metrics employed to assess the performance of the proposed method in traffic prediction are delineated. In Section 4.3, the reader will find an in-depth discussion of experiments on ablation studies, comparisons with other state-of-the-art methods, and the prediction performance of the proposed model on the test set. Moreover, all experiments in Section 4 were conducted in the same experimental environment: the software environment consists of Python 3.10 on PyTorch 2.0.0 + CU118, an Intel Xeon W-2245 central processing unit (Intel, Santa Clara, CA, USA), and a NVIDIA Quadro RTX 5000 graphics processing unit (Nvidia, Santa Clara, CA, USA).

4.1. Data Description and Preprocessing

The dataset used in this study originates from the FedCSIS 2020 Challenge. This dataset integrates traffic monitoring logs from multiple device sources. This research utilizes the dataset to simulate traffic variations in gateway devices within industrial scenarios. During experimentation, we randomly selected ten thousand device components, each possessing over 1900 records. Model training utilized 80% of the load data, with the remaining 20% reserved for testing. Each record in the dataset detailed hourly traffic fluctuations for each device, encompassing metrics such as “average,” “scale,” “on,” “maximum,” “minimum,” “off,” and “volume.” The model’s effectiveness was validated by predicting the average hourly network traffic for each device the following day.
We conducted a linear interpolation sensitivity analysis in Table 3. The Kullback–Leibler divergence between the filtered dataset’s traffic distribution and the original dataset is <0.08, indicating controllable sampling bias. Data Quality and Bias Analysis: Noise Characteristics: Hampel filter detection reveals 3.2% outliers, primarily caused by network packet loss and device reboots. Sampling Bias: Data from the morning shift (8:00–16:00) accounted for 58%, exhibiting temporal bias corrected via weighted sampling. We employed a time-weighted moving average (TWMA) to impute missing values with a window size k = 5. Given the short-term stationarity of IIoT traffic, TWMA better aligns with actual dynamics than linear interpolation [55].
Subsequent steps require data filtering to eliminate redundant information. During this process, data deemed unreasonable must be removed based on the actual traffic potential of the gateway devices. This approach effectively reduces noise interference during model training, thereby enhancing performance on the test dataset. For missing data (accounting for 1.3% of total observations), linear interpolation is applied when consecutive missing intervals ≤ 3 h, while seasonal decomposition-based interpolation is used for longer gaps to minimize data loss. To evaluate the impact of interpolation strategies on forecast reliability, we compared three methods. As shown in the table, linear interpolation was selected for achieving the optimal balance between computational simplicity and prediction accuracy retention. Given that different sensors operate on different scales, sensor values must undergo normalization. Therefore, normalization was performed using Equation (20), standardizing all sensor values within each recipe to the range [0, 1].
x i , j = x i , j x j min x j max x j min
In this context, x i , j signifies the value of the jth feature for the ith sample, whereas x j min and x j max denote the minimum and maximum values of the jth feature, respectively.

4.2. Model Evaluation Metrics

This study utilizes the Mean Squared Error (MSE) calculated via Formula (21) to optimize the network parameters.
M S E = 1 m i = 1 m y i y l ^ 2
The Mean Absolute Error (MAE) is a metric used to assess the accuracy of a model’s predictions. Its calculation is outlined in Formula (22).
M A E = 1 m i = 1 m y i y ^ l
MAE is defined as the mean absolute error between the predicted value y ^ l and the actual value y i . It has been demonstrated that the smaller the MAE and MSE values of a prediction model, the higher the prediction accuracy.
Finally, R 2 is used to evaluate the model’s fit. The value of the model ranges from a maximum of 1 to a minimum of 0. The closer the value is to 1, the better the model; the closer it is to 0, the worse the model. The formula is delineated in Equation (23).
R 2 = 1 i = 1 n y i y ^ 2 i = 1 n y i y ¯ 2

4.3. Experimental Results

The DOA-MSDI-CrossLinear model was compared with the following models: Support Vector Machine (SVM) is a traditional machine learning model. In the field of machine learning, Random Forest and XGBoost are two commonly used gradient boosting models in industry. LSTM: A variant of recurrent neural networks (RNNs), previously applied by Lu et al. to network traffic forecasting [56]. Hybrid approaches combining convolutional neural networks (CNNs) with LSTMs have yielded encouraging results in predictive performance. GRU: Another variant of RNNs. DARNN: This model is specifically designed for time series forecasting [57]. In the field of natural language processing (NLP), some researchers have adapted the Seq2seq model for network traffic forecasting. This approach has been applied in historical time series forecasting studies [58]. Time Convolutional Networks (TCN) [59], a widely adopted time series forecasting method, is selected as one of the baselines in this paper. PatchTST, a time series forecasting model proposed by Nie et al. [51] in 2023, employs a core strategy of segmenting time series into sub-sequence patches and modeling them using a channel-independence approach. It is suitable for traffic forecasting on single devices or devices with strong independence, long-term sequence prediction (>96 steps), edge deployment, and resource-constrained environments. TimesNet, a time series analysis model proposed by Wu et al. [60] in 2023, innovatively transforms one-dimensional time series into two-dimensional tensors, utilizing 2D convolutions to capture intra-period and inter-period variation patterns. The system’s outstanding performance is validated by the results in Table 1: Lower Mean Squared Error (MSE) and Mean Absolute Error (MAE) values indicate superior performance, while R2 values close to 1 signify higher accuracy. The best results for each metric are highlighted in bold.
Table 4 presents a performance comparison between DOA-MSDI-CrossLinear and nine benchmark methods, encompassing traditional machine learning, recurrent neural networks, convolutional approaches, and attention-based architectures. Traditional machine learning methods (RF, SVM, and XGB) achieved moderate R2 values (0.809–0.908), but their mean squared error (MSE) was significantly higher (1.822–2.806) compared to deep learning approaches. This performance gap stems from limited temporal modeling capabilities: these methods treat each prediction as an independent event, failing to capture the inherent sequential dependencies in traffic time series. The relatively strong performance of Random Forest (R2 = 0.908) indicates that ensemble averaging partially compensates for this limitation by capturing feature interactions. Additionally, these methods rely on manually designed features, which may not fully capture the complexity of industrial traffic patterns. The inconsistent performance of RNN-based methods (LSTM, GRU, and DARNN) stems from: LSTM achieving a competitive mean squared error (0.696) but a low R2 value (0.908), while GRU and DARNN performed significantly worse. This inconsistency reveals that RNN architectures exhibit significant sensitivity to learning rate, hidden dimension, and gradient clipping threshold. Without systematic optimization (as provided by DOA in this approach), performance exhibits significant fluctuations. Despite employing a two-stage attention mechanism, DARNN performs relatively poorly (R2 = 0.836), indicating that attention mechanisms alone are insufficient to capture the multi-scale structure of industrial traffic—the very reason we propose an explicit decomposition method. TCN underperforms despite theoretical advantages. Time-Convolutional Networks theoretically offer parallelizable training and flexible receptive fields, yet achieve the worst mean squared error (MSE = 3.176) among deep learning methods. The exponentially increasing expansion factor of TCN assumes a specific temporal hierarchical structure, potentially mismatching the actual periodic characteristics of industrial traffic (24-h/168 h cycles). Standard TCN processes channel data independently, failing to capture cross-device correlations. The PatchTST model demonstrates exceptional performance but is unsuitable for gateway-aggregated traffic with strong inter-device correlations and complex industrial environments requiring multi-scale pattern capture. Similarly, the TimesNet model performs excellently by automatically identifying dominant cycles via FFT and reshaping 1D sequences into 2D, aligning well with industrial traffic’s strong periodicity (24 h diurnal cycles, 168 h weekly cycles, production shift cycles). However, 2D convolutions + multi-period processing increase computational load, limiting edge gateway deployment and potentially causing inference delays beyond real-time requirements. Additionally, fixed-period assumptions Reshape relies on predefined period lengths cannot handle variable-period industrial scenarios (e.g., flexible production scheduling).
This model achieves significant improvements (reducing mean squared error by 65.66% compared to LSTM and by 92.47% compared to TCN) due to three synergistic factors. MDM separates scale-specific patterns before modeling, preventing interference between fine-grained noise and coarse-grained trends. DDI simultaneously captures temporal autocorrelation and cross-channel synchrony—critical for industrial networks where devices on shared production lines often exhibit correlated behavior. Systematic hyperparameter optimization: Directional detection exploration—leveraging a balancing mechanism identifies configurations unattainable through manual tuning or grid search. Furthermore, the R 2 value, serving as a comprehensive indicator of model robustness, demonstrates that the DOA-MSDI-CrossLinear model achieves an ideal balance between performance and stability. Given the stringent robustness requirements in IoT scenarios, selecting a model that ensures exceptional and stable prediction performance is paramount.
The impact of each module in DOA-MSDI-CrossLinear was evaluated using a proposed method that was employed to conduct ablation experiments. Consequently, this section presents four experiments to be compared with DOA-MSDI-CrossLinear, all of which were conducted under identical conditions. These experiments entail the utilization of diverse neural network architectures, encompassing a CrossLinear model, an MDM-CrossLinear model, an MSDI model integrating MDM and DDI modules, and an MSDI-CrossLinear model. These experiments were used to establish a benchmark against the proposed model, with the results displayed in Table 5.
Dissecting the Model Components: What Actually Drives Performance? Below we present our findings.
CrossLinear alone (R2 = 0.944, MSE = 1.077): When we ran just the linear component by itself, it performed remarkably well—which honestly aligns with recent findings that linear models punch above their weight in time series forecasting [4]. But here is where it struggles: those sudden traffic spikes and abrupt mode shifts in industrial networks? The linear model cannot quite capture that nonlinear chaos, hence the elevated error.
MSDI without CrossLinear (R2 = 0.959, MSE = 0.672): Now this is where things get interesting. When we used just our MDM+DDI architecture—no linear prediction at all—performance jumped significantly. That 37.6% MSE reduction (1.077 → 0.672) tells us something important: explicitly separating time scales is where the real value lies.
Comparing MDM-CrossLinear vs. MSDI-CrossLinear: Adding the DDI module (which models device interactions) on top of decomposition gave us another 4.7% improvement (MSE: 0.730 → 0.696). It helps, sure, but it is a modest gain. The takeaway? Channel interactions matter, but multiscale decomposition is doing most of the work. This matters for practical deployment—if you are running on edge devices with limited compute, you could potentially skip the DDI module and still get 90%+ of the performance benefit.
DOA optimization effect (MSE: 0.696 → 0.239): Here is the most striking result. When we applied our DOA to tune hyperparameters, error dropped by 65.66%—without changing the architecture at all. Just better configuration. If you are deploying this in a real factory, here is my advice: Do not rush to implement the full architecture with default parameters. Instead:
Start with the multiscale decomposition (MDM)—that is your biggest bang for buck Invest serious effort in hyperparameter tuning—our results show it matters more than adding architectural complexity Only add the dual-dependency module (DDI) if you have the computational budget and need that extra 5% accuracy The 65% improvement from optimization alone suggests that how you configure the model matters more than which bells and whistles you attach to it. That is a lesson we do not emphasize enough in academic papers, but it is critical for practitioners.
We tested the robustness of the model to input noise in Table 6 by injecting Gaussian noise ( σ 0.05 , 0.1 , 0.2 ) and missing data (10–30%). Missing data (randomly missing): 10% missing: +5% error (acceptable), 30% missing: +23% error (requiring interpolation). At typical industrial noise levels ( σ 0.1 ), model performance degradation remained below 15%, validating its deployment readiness.
We compared the model against the baseline solution using paired t-tests across 30 independent runs. The results, presented in Table 7, demonstrate that all improvements are statistically significant (p < 0.001) with large effect sizes (d > 1.2), confirming genuine performance gains beyond random fluctuations.
The model undergoes hyperparameter optimization based on the mean squared error (MSE) parameter. Specifically, the optimal learning rate, model dimension, and fully connected layer dimension are 0.00554, 228, and 96, respectively. The iteration process during optimization is illustrated in Figure 3.
As illustrated in Figure 4, the traffic simulation results of the DOA-MSDI-CrossLinear model vary according to the device used in the dataset.
Figure 4 presents a comparison between the traffic values predicted by the DOA-MSDI-CrossLinear model and the actual traffic conditions during the corresponding time period. It should be noted that the traffic monitoring spanned a continuous 24 h period. These cases illustrate that the proposed model possesses the capability to accurately predict overall traffic changes for devices. This capability enables the utilization of DOA-MSDI-CrossLinear in IIoT to facilitate precise forecasting of future network traffic for devices, thereby ensuring the expeditious allocation of resources. In addition, the proposed model generates relatively precise predictions for a variety of devices exhibiting entirely distinct traffic fluctuations, suggesting that the DOA-MSDI-CrossLinear model possesses significant capabilities for distinguishing between different devices.
Figure 4 demonstrates the prediction accuracy of the DOA-MSDI-CrossLinear model for three devices exhibiting typical flow characteristics: (a) Device A displays strong 24-h periodicity; (b) Device B exhibits irregular load peaks; (c) Device C shows gradual trend drift. Let us demonstrate how this model performs against real-world challenges in industrial networks using three typical devices.
Device A: This device follows a predictable 24 h cycle—much like an assembly line executing the same production plan daily. Our model tracks these daily rhythms with less than 5% error. The Multi-Scale Decomposition Module (MDM) “learns” device routines and makes forward predictions. This represents the ideal scenario—when behavior is predictable, the model excels.
Device B: This device experiences sudden traffic surges—irregular, unpredictable spikes. Even during such chaotic moments, the model maintains high accuracy. The Dual Dependency Interaction Module (DDI) models interactions between devices. Those “random” spikes are often triggered by events at other network nodes. By capturing cross-device correlations, the model anticipates seemingly unpredictable fluctuations. However, a key limitation exists: the most intense traffic surges exhibit a 1–2 h lag. The model can only identify peaks after they occur, not predict them in advance. This reflects a fundamental constraint—true mutations lack learnable patterns.
Device C: This case presented a challenge: it exhibited both sudden spikes and a slow drift in baseline traffic (possibly due to device aging or plant expansion). The model addressed both phenomena simultaneously. MDM separated slow trends from rapid spikes across different timescales, while DDI adapted to the gradual evolution of “normal” behavior. Together, they address what we call non-stationary dynamics—scenarios where statistical properties evolve over time.
The deployment’s key achievement lies in a single model successfully adapting to three devices with vastly different behaviors using identical parameter settings. In real factories with hundreds of devices, manually tuning individual models for each unit is impractical. This cross-device generalization capability, requiring no device-specific customization, is fundamental to achieving scalability. All devices exhibit a 1–2 h prediction lag for sudden peaks. Time series models identify patterns in historical data, but true sudden interruptions (unexpected surges with no warning) lack discernible patterns. For such issues, we can consider abandoning precise peak timing predictions and instead integrate anomaly detection modules to flag “high-risk periods” when peak conditions are ripe. This shifts the prediction focus toward risk assessment.
We decompose prediction errors into the following three categories:
(1) Scale mismatch error (34% of total error): Occurs during production mode transitions (e.g., shift handover). The root cause is that fixed decomposition windows cannot adapt to irregular scheduling. Adjusting adaptive windows can reduce this error.
(2) Cross-scale propagation error (28%): Hourly anomalies (e.g., equipment failures) trigger cascading effects in daily forecasts. The CrossLinear model assumes smooth propagation and ignores sudden failures, leading to this error. Adding an anomaly detection layer can reduce propagation errors.
(3) Model capacity error (38%): Occurs during unprecedented traffic patterns (e.g., new equipment integration). The root cause lies in training data lacking extreme scenarios. Synthetically augmenting training with sudden data effectively reduces this error.
Overall, the model exhibits graceful confidence decay under stress, enabling risk-aware decision-making in industrial control systems.

4.4. Cross-Study Comparisons and In-Depth Discussions

Compared to recent studies on the FEDCSIS2020 dataset, Wang et al. [61] designed the Flow2graph method for network traffic prediction. This approach converts network traffic sequences into key segments and employs traffic transformation graph techniques to detect time-varying network traffic patterns, significantly enhancing the resource efficiency of network traffic management. In contrast, the proposed method demonstrates superior performance under resource-constrained conditions, offering both better model fit and interpretability. Ruta et al. developed and pre-trained a universal 3-layer bidirectional LSTM regression network capable of the most accurate hourly predictions for weekly workload time series across thousands of diverse network devices. Unlike black-box models, our proposed multi-scale dependency integration mechanism constructs an attention-based interpretable framework [62], which is crucial for industrial deployments requiring decision transparency. Furthermore, the model addresses uncertainty quantification [63,64], which is vital for industrial deployments requiring decision transparency.
Based on our findings and limitations, we identify several promising directions for future work as follows:
(1) Uncertainty quantification: Extending DOA-MSDI-CrossLinear with probabilistic outputs would enhance its utility for risk-sensitive industrial applications [65].
(2) Online adaptation: Developing incremental learning variants that adapt to concept drift without full retraining would address the stationarity assumption limitation [66].
(3) Multi-task learning: Jointly modeling traffic forecasting with related tasks (anomaly detection, remaining useful life prediction) could improve performance through shared representations [32,67].
(4) Federated learning: Enabling collaborative model training across multiple industrial sites without sharing raw data would address privacy concerns while leveraging larger effective datasets [68].
(5) Explainability enhancement: Developing visualization tools that map model predictions to operational semantics would further improve interpretability for non-expert operators [16].

4.5. Model Visualization and Parameter Analysis

To evaluate the feasibility of the proposed DOA-MSDI-CrossLinear model in practical industrial applications, we conducted comprehensive benchmarking tests on its computational complexity, storage requirements, and inference latency. Experimental data demonstrate that the model exhibits significant lightweight characteristics. The total number of model parameters is only 10,669, and the model file size is merely 0.04 MB. Compared to traditional deep learning models, the parameter scale of this model has been reduced by several orders of magnitude. This outcome strongly validates our design philosophy: by replacing deep stacked nonlinear layers with the CrossLinear module and efficiently extracting multiscale features using MDM and DDI modules, we substantially reduce structural redundancy while preserving expressive power. This minimal storage footprint enables seamless deployment on memory-constrained Industrial Internet of Things (IIoT) edge gateway devices without relying on costly cloud computing resources. In terms of inference speed, the model demonstrates exceptional real-time responsiveness. Benchmark results show an average inference latency of just 2.91 milliseconds (±1.57 milliseconds) at a batch size of 128. Converted to throughput, the model can process up to 43,941.39 samples per second. This performance metric is critical for industrial traffic forecasting scenarios. Industrial production environments typically demand millisecond-level fault response and flow scheduling. With an inference latency under 3 milliseconds, this model can complete future flow predictions within an extremely short time window, allowing ample time for downstream scheduling decisions. Furthermore, its minimal memory consumption of just 0.74 MB further demonstrates computational efficiency, ensuring it does not consume excessive system resources even under high-load scenarios involving parallel multi-task processing.
To investigate the internal mechanisms of the model during the feature extraction stage, we visualized the weights of key linear layers within the module (as shown in Figure 5). The heatmap displays a 63 × 55 weight matrix, where the x-axis represents input feature indices and the y-axis corresponds to output feature dimensions. Color intensity reflects weight magnitude: red regions indicate positive activation, blue regions denote negative suppression, while light-colored areas suggest weaker influence. As shown, the weight distribution exhibits pronounced non-sparsity and global dependencies. Specifically, most input features are not simply discarded (i.e., weights close to zero) but exert broad influence on the output dimension through complex combinations of positive and negative weights. This indicates that during value embedding, the model relies not only on single time steps or local features but also tends to capture global interaction patterns within the input sequence. Notably, multiple high-intensity activation points (deep red or deep blue pixels, e.g., near input indices 8, 22, and 36) are scattered throughout the heatmap. These “hotspots” reveal key information nodes identified by the model during feature transformation. Such distribution patterns demonstrate the model’s ability to filter high-value information from raw sequences while suppressing noise, thereby validating the effectiveness of this linear mapping layer in multivariate time series feature fusion.

5. Conclusions

This paper proposes the theory-based industrial gateway traffic prediction framework DOA-MSDI-CrossLinear, aiming to address the critical issue of traffic prediction for industrial gateway devices in industrial IoT. Beyond architectural integration, we redefine traffic prediction as a hierarchical decomposition–interaction problem, restructuring it into a two-stage process: first, separating scale-specific patterns through adaptive decomposition; then, modeling scale-adaptive dependencies via decoupled interaction paths. This conceptual framework provides a foundational principle for designing prediction models aligned with industrial systems’ hierarchical temporal structures. Extensive experiments on the FedCSIS 2020 challenge dataset demonstrate that the DOA-MSDI-CrossLinear model achieves industry-leading prediction performance while maintaining high interpretability and scalability. Its compact parameter space significantly reduces retraining costs, facilitating online learning strategies. This enables local fine-tuning to adapt to conceptual drift in traffic patterns, ensuring long-term reliability in dynamic industrial environments. We acknowledge limitations including dataset specificity, hyperparameter sensitivity, and stationarity assumptions. Future work will address these through multi-dataset validation, automated hyperparameter adaptation, and online learning extensions. Furthermore, integrating uncertainty quantification with federated learning capabilities will enhance the model’s practical applicability.
The model demonstrates moderate generalization capabilities for similar periodic systems (energy), but struggles with irregular patterns (building occupancy rates). This method is suitable for scenarios where traffic data exhibits multi-scale periodicity (daily/weekly cycles), systems operate under resource-constrained environments (<512 MB memory), or interpretability is required (e.g., regulatory compliance). Future improvements could explore adaptive scale detection based on spectral analysis, nonlinear decomposition via neural differential equations, or graph theory-driven multi-device correlation modeling.
As industrial systems become increasingly intelligent and interconnected, precise traffic flow prediction is critical for capacity planning, anomaly detection, and resource optimization. By providing a deployable, interpretable, and highly accurate forecasting solution, DOA-MSDI-CrossLinear contributes to the grand goal of achieving trustworthy AI in industrial applications—systems that not only perform exceptionally well but can also be understood, validated, and maintained by domain experts. The proposed DOA-MSDI-CrossLinear model demonstrates outstanding performance in latency, stream processing, and adaptability, exhibiting significant potential for online industrial deployment.

Author Contributions

Conceptual Design: T.M. and Y.S.; Methodology: Y.S.; Software Development: T.M.; Validation Work: T.M., J.L. and P.X.; Formal Analysis: J.L.; Research Investigation: Y.S.; Resource Preparation: Y.S.; Data organization, T.M.; Draft preparation, T.M.; Review revisions, Y.S.; Visualization presentation, J.L.; Academic guidance, P.X.; Project management, Y.S.; Funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Liaoning Provincial Department of Industry and Information Technology under the Major Science and Technology Special Project of Liaoning Province, “Research and Development, Testing and Operation Platform for Industrial Internet Applications,” specifically under the subtask “Research and Development of Key Technologies for Configurable Service Gateways.” (Project No. k500600152).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets referenced and generated in this paper cannot be shared due to privacy concerns. Please contact us if you require them.

Conflicts of Interest

The author declares that there is no conflict of interest.

References

  1. Wan, J.; Zhang, D.; Zhao, S.; Yang, L.; Lloret, J. Context-aware vehicular cyber-physical systems with cloud support: Architecture, challenges, and solutions. IEEE Commun. Mag. 2014, 52, 106–113. [Google Scholar] [CrossRef]
  2. Bedhief, I.; Foschini, L.; Bellavista, P.; Kassar, M.; Aguili, T. Toward Self-Adaptive Software Defined Fog Networking Architecture for IIoT and Industry 4.0. In Proceedings of the 2019 IEEE 24th International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), Limassol, Cyprus, 11–13 September 2019; pp. 1–5. [Google Scholar] [CrossRef]
  3. Lee, B.M.; Yang, H. Energy-Efficient Massive MIMO in Massive Industrial Internet of Things Networks. IEEE Internet Things J. 2022, 9, 3657–3671. [Google Scholar] [CrossRef]
  4. Zeng, A.; Chen, M.; Zhang, L.; Xu, Q. Are Transformers Effective for Time Series Forecasting? arXiv 2022, arXiv:2205.13504. [Google Scholar] [CrossRef]
  5. Zhou, H.; Zhang, S.; Peng, J.; Zhang, S.; Li, J.; Xiong, H.; Zhang, W. Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2022; pp. 11106–11115. [Google Scholar] [CrossRef]
  6. Abbasi, M.; Shahraki, A.; Taherkordi, A. Deep Learning for Network Traffic Monitoring and Analysis (NTMA): A Survey. Comput. Commun. 2021, 170, 19–41. [Google Scholar] [CrossRef]
  7. Liu, X.; Sun, C.; Yu, W.; Zhou, M. Reinforcement-Learning-Based Dynamic Spectrum Access for Software-Defined Cognitive Industrial Internet of Things. IEEE Trans. Ind. Inform. 2022, 18, 4244–4253. [Google Scholar] [CrossRef]
  8. Jahid, A.; Alsharif, M.H.; Hall, T.J. The convergence of blockchain, IoT and 6G: Potential, opportunities, challenges and research roadmap. J. Netw. Comput. Appl. 2023, 217, 103677. [Google Scholar] [CrossRef]
  9. Lim, B.; Arık, S.O.; Loeff, N.; Pfister, T. Temporal Fusion Transformers for interpretable multi-horizon time series forecasting. Int. J. Forecast. 2021, 37, 1748–1764. [Google Scholar] [CrossRef]
  10. Zhou, T.; Ma, Z.; Wen, Q.; Wang, X.; Sun, L.; Jin, R. FEDformer: Frequency Enhanced Decomposed Transformer for Long-term Series Forecasting. arXiv 2022, arXiv:2201.12740. [Google Scholar] [CrossRef]
  11. Yu, B.; Yin, H.; Zhu, Z. Spatio-Temporal Graph Convolutional Networks: A Deep Learning Framework for Traffic Forecasting. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, Stockholm, Sweden, 13–19 July 2018. [Google Scholar] [CrossRef]
  12. Cao, D.; Wang, Y.; Duan, J.; Zhang, C.; Zhu, X.G.; Huang, C.; Tong, Y.; Xu, B.; Bai, J.; Tong, J.; et al. Spectral Temporal Graph Neural Network for Multivariate Time-series Forecasting. In Proceedings of the 34th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 6–12 December 2020. [Google Scholar]
  13. Deng, S.; Zhao, H.; Fang, W.; Yin, J.; Dustdar, S.; Zomaya, A.Y. Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence. IEEE Internet Things J. 2020, 7, 7457–7469. [Google Scholar] [CrossRef]
  14. Chen, S.A.; Li, C.L.; Yoder, N.; Arik, S.; Pfister, T. TSMixer: An all-MLP Architecture for Time Series Forecasting. arXiv 2020, arXiv:1909.00560. [Google Scholar]
  15. Li, Y.; Yu, R.; Shahabi, C.; Liu, Y. Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. arXiv 2018, arXiv:1707.01926. [Google Scholar] [CrossRef]
  16. Osovsky, A.; Kutuzov, D.; Starov, D.; Bakalaeva, R.; Stukach, O. Comparison of Machine Learning Methods for IoT and IIoT Traffic Prediction. In Proceedings of the 2024 International Seminar on Electron Devices Design and Production (SED), Sochi, Russia, 2–3 October 2024; pp. 1–6. [Google Scholar] [CrossRef]
  17. Zhou, B.; He, D.; Sun, Z. Traffic Modeling and Prediction using ARIMA/GARCH Model. In Modeling and Simulation Tools for Emerging Telecommunication Networks; Springer: Boston, MA, USA, 2006; pp. 101–121. [Google Scholar] [CrossRef]
  18. Wang, J.; Tang, J.; Xu, Z.; Wang, Y.; Xue, G.; Zhang, X.; Yang, D. Spatiotemporal modeling and prediction in cellular networks: A big data enabled deep learning approach. In Proceedings of the IEEE INFOCOM 2017-IEEE Conference on Computer Communications, Atlanta, GA, USA, 1–4 May 2017. [Google Scholar] [CrossRef]
  19. Mogensen, R.S.; Rodriguez, I.; Berardinelli, G.; Pocovi, G.; Kolding, T. Empirical IIoT Data Traffic Analysis and Comparison to 3GPP 5G Models. In Proceedings of the 2021 IEEE 94th Vehicular Technology Conference (VTC2021-Fall), Virtual, 27 September–28 October 2021; pp. 1–7. [Google Scholar] [CrossRef]
  20. Akcapinar, A.; Gurer, O.; Rodoplu, V. ARIMA-Based Traffic Forecasting for Quality of Service (QoS) Flow Routing in Sixth Generation (6G) Networks. In Proceedings of the 2022 Innovations in Intelligent Systems and Applications Conference (ASYU), Antalya, Turkey, 7–9 September 2022; pp. 1–5. [Google Scholar] [CrossRef]
  21. Yin, D.; Li, J.; Wu, G. Solving the Data Sparsity Problem in Predicting the Success of the Startups with Machine Learning Methods. arXiv 2021, arXiv:2112.07985. [Google Scholar] [CrossRef]
  22. Wu, J.Y.; Wu, M.; Chen, Z.; Li, X.L.; Yan, R. Degradation-Aware Remaining Useful Life Prediction With LSTM Autoencoder. IEEE Trans. Instrum. Meas. 2021, 70, 3511810. [Google Scholar] [CrossRef]
  23. Tziolas, T.; Papageorgiou, K.; Theodosiou, T.; Papageorgiou, E.; Mastos, T.; Papadopoulos, A. Autoencoders for Anomaly Detection in an Industrial Multivariate Time Series Dataset. Eng. Proc. 2022, 18, 23. [Google Scholar] [CrossRef]
  24. Alawe, I.; Ksentini, A.; Hadjadj-Aoul, Y.; Bertin, P. Improving Traffic Forecasting for 5G Core Network Scalability: A Machine Learning Approach. IEEE Netw. 2018, 32, 42–49. [Google Scholar] [CrossRef]
  25. Hua, Y.; Zhao, Z.; Li, R.; Chen, X.; Liu, Z.; Zhang, H. Deep Learning with Long Short-Term Memory for Time Series Prediction. IEEE Commun. Mag. 2019, 57, 114–119. [Google Scholar] [CrossRef]
  26. Singh, R.; Gill, S. Edge AI: A survey. Internet Things Cyber-Phys. Syst. 2023, 3, 71–92. [Google Scholar] [CrossRef]
  27. Han, K.; Wang, Y.; Chen, H.; Chen, X.; Guo, J.; Liu, Z.; Tang, Y.; Xiao, A.; Xu, C.; Xu, Y.; et al. A Survey on Vision Transformer. IEEE Trans. Pattern Anal. Mach. Intell. 2023, 45, 87–110. [Google Scholar] [CrossRef]
  28. Chen, Z.; E, J.; Zhang, X.; Sheng, H.; Cheng, X. Multi-Task Time Series Forecasting With Shared Attention. In Proceedings of the 2020 International Conference on Data Mining Workshops (ICDMW), Virtual, 17–20 November 2020; pp. 917–925. [Google Scholar] [CrossRef]
  29. Liu, Q.; Li, J.; Lu, Z. ST-Tran: Spatial-Temporal Transformer for Cellular Traffic Prediction. IEEE Commun. Lett. 2021, 25, 3325–3329. [Google Scholar] [CrossRef]
  30. Dong, Y.; Tao, Y.; Jiang, X.; Zhang, K.; Li, J.; Su, J.; Zhang, J.; Xu, J. FAN: Fourier Analysis Networks. arXiv 2025, arXiv:2410.02675. [Google Scholar]
  31. Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph WaveNet for Deep Spatial-Temporal Graph Modeling. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019. [Google Scholar] [CrossRef]
  32. Wang, S.; Nie, L.; Li, G.; Wu, Y.; Ning, Z. A Multitask Learning-Based Network Traffic Prediction Approach for SDN-Enabled Industrial Internet of Things. IEEE Trans. Ind. Inform. 2022, 18, 7475–7483. [Google Scholar] [CrossRef]
  33. Wang, J.; Wang, Z.; Li, J.; Wu, J. Multilevel Wavelet Decomposition Network for Interpretable Time Series Analysis. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018. [Google Scholar] [CrossRef]
  34. Yang, S.; Liu, J. Time-Series Forecasting Based on High-Order Fuzzy Cognitive Maps and Wavelet Transform. IEEE Trans. Fuzzy Syst. 2018, 26, 3391–3402. [Google Scholar] [CrossRef]
  35. Zhao, S.; Jiang, X.; Jacobson, G.; Jana, R.; Hsu, W.-L.; Rustamov, R.; Talasila, M.; Aftab, S.A.; Chen, Y.; Borcea, C. Cellular Network Traffic Prediction Incorporating Handover: A Graph Convolutional Approach. In Proceedings of the 2020 17th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Como, Italy, 22–25 June 2020. [Google Scholar] [CrossRef]
  36. Tolstikhin, I.; Houlsby, N.; Kolesnikov, A.; Beyer, L.; Zhai, X.; Unterthiner, T.; Yung, J.; Steiner, A.; Keysers, D.; Uszkoreit, J.; et al. MLP-Mixer: An all-MLP Architecture for Vision. arXiv 2021, arXiv:2105.01601. [Google Scholar] [CrossRef]
  37. Li, Z.; Rao, Z.; Pan, L.; Xu, Z. MTS-Mixers: Multivariate Time Series Forecasting via Factorized Temporal and Channel Mixing. arXiv 2023, arXiv:2302.04501. [Google Scholar] [CrossRef]
  38. Qiu, X.; Wu, X.; Lin, Y.; Guo, C.; Hu, J.; Yang, B. DUET: Dual Clustering Enhanced Multivariate Time Series Forecasting. arXiv 2025, arXiv:2412.10859. [Google Scholar] [CrossRef]
  39. Loukas, A. What graph neural networks cannot learn: Depth vs width. arXiv 2019, arXiv:1907.03199. [Google Scholar] [CrossRef]
  40. Wu, H.; Xu, J.; Wang, J.; Long, M. Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting. arXiv 2021, arXiv:2106.13008. [Google Scholar]
  41. Li, Z.; Wanru, X.; Chunqiang, Z.; Jingkai, G.; Lugema, M.; Fan, D.; Jinqi, Q. AFMT:Adaptive frequency decomposition and multi-scale transformer for time series forecasting. Inf. Sci. 2026, 726, 122735. [Google Scholar] [CrossRef]
  42. Liu, M.; Zeng, A.; Chen, M.; Xu, Z.; Lai, Q.; Ma, L.; Xu, Q. SCINet: Time Series Modeling and Forecasting with Sample Convolution and Interaction. arXiv 2021, arXiv:2106.09305. [Google Scholar]
  43. Abdar, M.; Pourpanah, F.; Hussain, S.; Rezazadegan, D.; Liu, L.; Ghavamzadeh, M.; Fieguth, P.; Cao, X.; Khosravi, A.; Acharya, U.R.; et al. A review of uncertainty quantification in deep learning: Techniques, applications and challenges. Inf. Fusion 2021, 76, 243–297. [Google Scholar] [CrossRef]
  44. Rojat, T.; Puget, R.; Filliat, D.; Ser, J.; Gelin, R.; Díaz-Rodríguez, N. Explainable Artificial Intelligence (XAI) on TimeSeries Data: A Survey. arXiv 2021, arXiv:2104.00950. [Google Scholar] [CrossRef]
  45. Lee, S.; Kim, T. Impact of Deep Learning Optimizers and Hyperparameter Tuning on the Performance of Bearing Fault Diagnosis. IEEE Access 2023, 11, 55046–55070. [Google Scholar] [CrossRef]
  46. Lang, Y.; Gao, Y. Dream Optimization Algorithm (DOA): A novel metaheuristic optimization algorithm inspired by human dreams and its applications to real-world engineering problems. Comput. Methods Appl. Mech. Eng. 2025, 436, 117718. [Google Scholar] [CrossRef]
  47. Shabani, A.; Abdi, A.; Meng, L.; Sylvain, T. Scaleformer: Iterative Multi-scale Refining Transformers for Time Series Forecasting. arXiv 2023, arXiv:2206.04038. [Google Scholar] [CrossRef]
  48. Wang, S.; Wu, H.; Shi, X.; Hu, T.; Luo, H.; Ma, L.; Zhang, J.Y.; Zhou, J. TimeMixer: Decomposable Multiscale Mixing for Time Series Forecasting. arXiv 2024, arXiv:2405.14616. [Google Scholar] [CrossRef]
  49. Zhou, P.; Liu, Y.; Liang, J.; Song, Q.; Li, X. CrossLinear: Plug-and-Play Cross-Correlation Embedding for Time Series Forecasting with Exogenous Variables. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2, Toronto ON, Canada, 3–7 August 2025; ACM: New York, NY, USA, 2025. KDD ’25. pp. 4120–4131. [Google Scholar] [CrossRef]
  50. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv 2021, arXiv:2010.11929. [Google Scholar] [CrossRef]
  51. Nie, Y.; Nguyen, N.H.; Sinthong, P.; Kalagnanam, J. A Time Series is Worth 64 Words: Long-term Forecasting with Transformers. In Proceedings of the Eleventh International Conference on Learning Representations, Kigali, Rwanda, 1–5 May 2023. [Google Scholar]
  52. Tang, P.; Zhang, W. Unlocking the Power of Patch: Patch-Based MLP for Long-Term Time Series Forecasting. arXiv 2024, arXiv:2405.13575. [Google Scholar] [CrossRef]
  53. Wang, Y.; Wu, H.; Dong, J.; Qin, G.; Zhang, H.; Liu, Y.; Qiu, Y.; Wang, J.; Long, M. TimeXer: Empowering Transformers for Time Series Forecasting with Exogenous Variables. arXiv 2024, arXiv:2402.19072. [Google Scholar] [CrossRef]
  54. Janusz, A.; Przyborowski, M.; Biczyk, P.; Slezak, D. Network Device Workload Prediction: A Data Mining Challenge at Knowledge Pit. In Proceedings of the 2020 Federated Conference on Computer Science and Information Systems, FedCSIS 2020, Sofia, Bulgaria, 6–9 September 2020; Volume 21, pp. 77–80. [Google Scholar] [CrossRef]
  55. Sujito; Gumilar, L.; Hadi, R.R.; Rodhi Faiz, M.; Syafriyudin; Nugroho, Z.S. Analysis Comparison of Linear Interpolation and Quadratic Interpolation Methods for Forecasting a Growth Total of Electricity Customers in Kotawaringin Barat Regency at 2022–2025 Years. In Proceedings of the 022 International Electronics Symposium (IES), Surabaya, Indonesia, 9–11 August 2022. [Google Scholar] [CrossRef]
  56. Lu, H.; Yang, F. Research on Network Traffic Prediction Based on Long Short-Term Memory Neural Network. In Proceedings of the 2018 IEEE 4th International Conference on Computer and Communications (ICCC), Chengdu, China, 7–10 December 2018; pp. 1109–1113. [Google Scholar] [CrossRef]
  57. Liu, Y.; Gong, C.; Yang, L.; Chen, Y. DSTP-RNN: A dual-stage two-phase attention-based recurrent neural network for long-term and multivariate time series prediction. Expert Syst. Appl. 2020, 143, 113082. [Google Scholar] [CrossRef]
  58. Ma, Z.; Yu, H.; Xia, J.; Wang, C.; Yan, L.; Zhou, X. Network Traffic Prediction based on Seq2seq Model. In Proceedings of the 2021 16th International Conference on Computer Science & Education (ICCSE), Lancaster, UK, 17–19 August 2021; pp. 710–713. [Google Scholar] [CrossRef]
  59. Pelletier, C.; Webb, G.; Petitjean, F. Temporal Convolutional Neural Network for the Classification of Satellite Image Time Series. Remote Sens. 2019, 11, 523. [Google Scholar] [CrossRef]
  60. Wu, H.; Hu, T.; Liu, Y.; Zhou, H.; Wang, J.; Long, M. TimesNet: Temporal 2D-Variation Modeling for General Time Series Analysis. arXiv 2023, arXiv:2210.02186. [Google Scholar] [CrossRef]
  61. Wang, R.; Zhang, Y.; Peng, L.; Fortino, G.; Ho, P.H. Time-Varying-Aware Network Traffic Prediction Via Deep Learning in IIoT. IEEE Trans. Ind. Inform. 2022, 18, 8129–8137. [Google Scholar] [CrossRef]
  62. Ruta, D.; Cen, L.; Vu, Q.H. Deep Bi-Directional LSTM Networks for Device Workload Forecasting. In Proceedings of the 2020 15th Conference on Computer Science and Information Systems (FedCSIS), Sofia, Bulgaria, 6–9 September 2020; pp. 115–118. [Google Scholar] [CrossRef]
  63. Qiu, T.; Chi, J.; Zhou, X.; Ning, Z.; Atiquzzaman, M.; Wu, D.O. Edge Computing in Industrial Internet of Things: Architecture, Advances and Challenges. IEEE Commun. Surv. Tutor. 2020, 22, 2462–2488. [Google Scholar] [CrossRef]
  64. Zhao, X.; Du, D.; Zhang, Y. Prediction of SDN Heterogeneous Network Traffic Based on Improved LSTM with Self-attention Mechanism. In Proceedings of the 2023 8th International Conference on Power and Renewable Energy (ICPRE), Shanghai, China, 22–25 September 2023; pp. 2016–2021. [Google Scholar] [CrossRef]
  65. Fatima, S.S.W.; Rahimi, A. A Review of Time-Series Forecasting Algorithms for Industrial Manufacturing Systems. Machines 2024, 12, 380. [Google Scholar] [CrossRef]
  66. Ren, L.; Jia, Z.; Laili, Y.; Huang, D. Deep Learning for Time-Series Prediction in IIoT: Progress, Challenges, and Prospects. IEEE Trans. Neural Netw. Learn. Syst. 2024, 35, 15072–15091. [Google Scholar] [CrossRef] [PubMed]
  67. Nie, L.; Wang, X.; Wang, S.; Ning, Z.; Obaidat, M.S.; Sadoun, B.; Li, S. Network Traffic Prediction in Industrial Internet of Things Backbone Networks: A Multitask Learning Mechanism. IEEE Trans. Ind. Inform. 2021, 17, 7123–7132. [Google Scholar] [CrossRef]
  68. Miao, Y.; Bai, X.; Cao, Y.; Liu, Y.; Dai, F.; Wang, F.; Qi, L.; Dou, W. A Novel Short-Term Traffic Prediction Model Based on SVD and ARIMA With Blockchain in Industrial Internet of Things. IEEE Internet Things J. 2023, 10, 21217–21226. [Google Scholar] [CrossRef]
Figure 1. MSDI-CrossLinear Model Framework Based on Dream-Driven Optimization.
Figure 1. MSDI-CrossLinear Model Framework Based on Dream-Driven Optimization.
Sensors 26 00795 g001
Figure 2. CrossLinear Overall Structure Diagram.
Figure 2. CrossLinear Overall Structure Diagram.
Sensors 26 00795 g002
Figure 3. DOA Optimization Iteration Diagram.
Figure 3. DOA Optimization Iteration Diagram.
Sensors 26 00795 g003
Figure 4. The following case study results are presented herewith. The three figures presented herein illustrate the model’s predictions for daily traffic across three distinct devices. The primary distinctions among the three cases lie in their origins from different network devices, as well as their varying traffic trends and rates of change.
Figure 4. The following case study results are presented herewith. The three figures presented herein illustrate the model’s predictions for daily traffic across three distinct devices. The primary distinctions among the three cases lie in their origins from different network devices, as well as their varying traffic trends and rates of change.
Sensors 26 00795 g004
Figure 5. Linear Weight Heatmap.
Figure 5. Linear Weight Heatmap.
Sensors 26 00795 g005
Table 1. Comparison results of existing methods.
Table 1. Comparison results of existing methods.
Model TypeAdvantagesLimitations in Industrial Gateway Scenarios
Transformer typeCapturing Long-Term DependencyComputational complexity O(n2), memory usage > 3 GB, inference latency > 500 ms [9,10]
GNN typeSpatial ModelingRequires predefined graph structures, making it difficult to adapt to dynamic topology changes [11,12,13]
Linear MLP Hybrid TypeLightweight and efficientUnable to model multiscale patterns, with a 30% performance drop on non-stationary sequences [4,14]
Table 2. DOA vs. Traditional Optimizers.
Table 2. DOA vs. Traditional Optimizers.
MethodConvergence SpeedGlobal Search CapabilityNoise RobustnessComputational Cost
AdamFastWeak (prone to local optima)WeakLow
Grid SearchSlowStrong (exhaustive)ModerateExtremely high
Bayesian OptimizationModerateModerateModerateHigh
DOAModerateStrong (swarm intelligence)StrongModerate
Table 3. Interpolation Method Sensitivity Analysis.
Table 3. Interpolation Method Sensitivity Analysis.
Interpolation MethodChange in Root Mean Square Error (Compared to Complete Data)Change in Absolute Error
Zero-Filling MethodAccuracy Decrease +18.6%Error +22.5%
Mean Replacement MethodAccuracy Decrease +12.8%Error +15.3%
Linear InterpolationAccuracy Decrease +3.5%Error +4.1%
Seasonal Decomposition MethodAccuracy Decrease +2.8%Error +4.7%
Table 4. Model Comparison Experimental Results.
Table 4. Model Comparison Experimental Results.
ModelR2MSEMAE
RF0.9081.8220.912
SVM0.8092.8060.826
XGB0.8462.2930.709
LSTM0.9080.6960.496
CNN-LSTM0.9261.4170.657
GRU0.8532.1730.719
DARNN0.8362.4460.777
Seq2seq0.9301.3410.675
TCN0.8383.1761.161
PatchTST0.9540.4680.515
TimesNet0.9512.8110.962
DOA-MSDI-CrossLinear0.9830.2390.354
Table 5. Results of model ablation experiments.
Table 5. Results of model ablation experiments.
ModelR2MSEMAE
CrossLinear0.9441.0770.623
MSDI0.9590.6720.538
MSDI-CrossLinear0.9640.6960.496
MDM-CrossLinear0.9630.7300.541
DOA-MSDI-CrossLinear0.9830.2390.354
Table 6. Robustness to input disturbances.
Table 6. Robustness to input disturbances.
Noise LevelMSE Increase RateMAE Increase Rate
σ = 0.05+8.23%+6.19%
σ = 0.10+15.84%+12.36%
σ = 0.20+31.57%+28.13%
Table 7. Statistical validation.
Table 7. Statistical validation.
Comparison MetricMean △MSE95% Confidence Intervalp-ValueCohen’s Effect Size
vs. Transformer−0.184[−0.213 , 0.161]<0.0011.83
vs. DLinear−0.093[−0.118 , 0.070]<0.0011.24
vs. LSTM−0.146[−0.169 , 0.115]<0.0011.56
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Ma, T.; Liu, J.; Xu, P.; Song, Y. Traffic Forecasting for Industrial Internet Gateway Based on Multi-Scale Dependency Integration. Sensors 2026, 26, 795. https://doi.org/10.3390/s26030795

AMA Style

Ma T, Liu J, Xu P, Song Y. Traffic Forecasting for Industrial Internet Gateway Based on Multi-Scale Dependency Integration. Sensors. 2026; 26(3):795. https://doi.org/10.3390/s26030795

Chicago/Turabian Style

Ma, Tingyu, Jiaqi Liu, Panfeng Xu, and Yan Song. 2026. "Traffic Forecasting for Industrial Internet Gateway Based on Multi-Scale Dependency Integration" Sensors 26, no. 3: 795. https://doi.org/10.3390/s26030795

APA Style

Ma, T., Liu, J., Xu, P., & Song, Y. (2026). Traffic Forecasting for Industrial Internet Gateway Based on Multi-Scale Dependency Integration. Sensors, 26(3), 795. https://doi.org/10.3390/s26030795

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop