4.3. Experimental Results
The DOA-MSDI-CrossLinear model was compared with the following models: Support Vector Machine (SVM) is a traditional machine learning model. In the field of machine learning, Random Forest and XGBoost are two commonly used gradient boosting models in industry. LSTM: A variant of recurrent neural networks (RNNs), previously applied by Lu et al. to network traffic forecasting [
56]. Hybrid approaches combining convolutional neural networks (CNNs) with LSTMs have yielded encouraging results in predictive performance. GRU: Another variant of RNNs. DARNN: This model is specifically designed for time series forecasting [
57]. In the field of natural language processing (NLP), some researchers have adapted the Seq2seq model for network traffic forecasting. This approach has been applied in historical time series forecasting studies [
58]. Time Convolutional Networks (TCN) [
59], a widely adopted time series forecasting method, is selected as one of the baselines in this paper. PatchTST, a time series forecasting model proposed by Nie et al. [
51] in 2023, employs a core strategy of segmenting time series into sub-sequence patches and modeling them using a channel-independence approach. It is suitable for traffic forecasting on single devices or devices with strong independence, long-term sequence prediction (>96 steps), edge deployment, and resource-constrained environments. TimesNet, a time series analysis model proposed by Wu et al. [
60] in 2023, innovatively transforms one-dimensional time series into two-dimensional tensors, utilizing 2D convolutions to capture intra-period and inter-period variation patterns. The system’s outstanding performance is validated by the results in
Table 1: Lower Mean Squared Error (MSE) and Mean Absolute Error (MAE) values indicate superior performance, while R
2 values close to 1 signify higher accuracy. The best results for each metric are highlighted in bold.
Table 4 presents a performance comparison between DOA-MSDI-CrossLinear and nine benchmark methods, encompassing traditional machine learning, recurrent neural networks, convolutional approaches, and attention-based architectures. Traditional machine learning methods (RF, SVM, and XGB) achieved moderate R
2 values (0.809–0.908), but their mean squared error (MSE) was significantly higher (1.822–2.806) compared to deep learning approaches. This performance gap stems from limited temporal modeling capabilities: these methods treat each prediction as an independent event, failing to capture the inherent sequential dependencies in traffic time series. The relatively strong performance of Random Forest (R
2 = 0.908) indicates that ensemble averaging partially compensates for this limitation by capturing feature interactions. Additionally, these methods rely on manually designed features, which may not fully capture the complexity of industrial traffic patterns. The inconsistent performance of RNN-based methods (LSTM, GRU, and DARNN) stems from: LSTM achieving a competitive mean squared error (0.696) but a low R
2 value (0.908), while GRU and DARNN performed significantly worse. This inconsistency reveals that RNN architectures exhibit significant sensitivity to learning rate, hidden dimension, and gradient clipping threshold. Without systematic optimization (as provided by DOA in this approach), performance exhibits significant fluctuations. Despite employing a two-stage attention mechanism, DARNN performs relatively poorly (R
2 = 0.836), indicating that attention mechanisms alone are insufficient to capture the multi-scale structure of industrial traffic—the very reason we propose an explicit decomposition method. TCN underperforms despite theoretical advantages. Time-Convolutional Networks theoretically offer parallelizable training and flexible receptive fields, yet achieve the worst mean squared error (MSE = 3.176) among deep learning methods. The exponentially increasing expansion factor of TCN assumes a specific temporal hierarchical structure, potentially mismatching the actual periodic characteristics of industrial traffic (24-h/168 h cycles). Standard TCN processes channel data independently, failing to capture cross-device correlations. The PatchTST model demonstrates exceptional performance but is unsuitable for gateway-aggregated traffic with strong inter-device correlations and complex industrial environments requiring multi-scale pattern capture. Similarly, the TimesNet model performs excellently by automatically identifying dominant cycles via FFT and reshaping 1D sequences into 2D, aligning well with industrial traffic’s strong periodicity (24 h diurnal cycles, 168 h weekly cycles, production shift cycles). However, 2D convolutions + multi-period processing increase computational load, limiting edge gateway deployment and potentially causing inference delays beyond real-time requirements. Additionally, fixed-period assumptions Reshape relies on predefined period lengths cannot handle variable-period industrial scenarios (e.g., flexible production scheduling).
This model achieves significant improvements (reducing mean squared error by 65.66% compared to LSTM and by 92.47% compared to TCN) due to three synergistic factors. MDM separates scale-specific patterns before modeling, preventing interference between fine-grained noise and coarse-grained trends. DDI simultaneously captures temporal autocorrelation and cross-channel synchrony—critical for industrial networks where devices on shared production lines often exhibit correlated behavior. Systematic hyperparameter optimization: Directional detection exploration—leveraging a balancing mechanism identifies configurations unattainable through manual tuning or grid search. Furthermore, the value, serving as a comprehensive indicator of model robustness, demonstrates that the DOA-MSDI-CrossLinear model achieves an ideal balance between performance and stability. Given the stringent robustness requirements in IoT scenarios, selecting a model that ensures exceptional and stable prediction performance is paramount.
The impact of each module in DOA-MSDI-CrossLinear was evaluated using a proposed method that was employed to conduct ablation experiments. Consequently, this section presents four experiments to be compared with DOA-MSDI-CrossLinear, all of which were conducted under identical conditions. These experiments entail the utilization of diverse neural network architectures, encompassing a CrossLinear model, an MDM-CrossLinear model, an MSDI model integrating MDM and DDI modules, and an MSDI-CrossLinear model. These experiments were used to establish a benchmark against the proposed model, with the results displayed in
Table 5.
Dissecting the Model Components: What Actually Drives Performance? Below we present our findings.
CrossLinear alone (R
2 = 0.944, MSE = 1.077): When we ran just the linear component by itself, it performed remarkably well—which honestly aligns with recent findings that linear models punch above their weight in time series forecasting [
4]. But here is where it struggles: those sudden traffic spikes and abrupt mode shifts in industrial networks? The linear model cannot quite capture that nonlinear chaos, hence the elevated error.
MSDI without CrossLinear (R2 = 0.959, MSE = 0.672): Now this is where things get interesting. When we used just our MDM+DDI architecture—no linear prediction at all—performance jumped significantly. That 37.6% MSE reduction (1.077 → 0.672) tells us something important: explicitly separating time scales is where the real value lies.
Comparing MDM-CrossLinear vs. MSDI-CrossLinear: Adding the DDI module (which models device interactions) on top of decomposition gave us another 4.7% improvement (MSE: 0.730 → 0.696). It helps, sure, but it is a modest gain. The takeaway? Channel interactions matter, but multiscale decomposition is doing most of the work. This matters for practical deployment—if you are running on edge devices with limited compute, you could potentially skip the DDI module and still get 90%+ of the performance benefit.
DOA optimization effect (MSE: 0.696 → 0.239): Here is the most striking result. When we applied our DOA to tune hyperparameters, error dropped by 65.66%—without changing the architecture at all. Just better configuration. If you are deploying this in a real factory, here is my advice: Do not rush to implement the full architecture with default parameters. Instead:
Start with the multiscale decomposition (MDM)—that is your biggest bang for buck Invest serious effort in hyperparameter tuning—our results show it matters more than adding architectural complexity Only add the dual-dependency module (DDI) if you have the computational budget and need that extra 5% accuracy The 65% improvement from optimization alone suggests that how you configure the model matters more than which bells and whistles you attach to it. That is a lesson we do not emphasize enough in academic papers, but it is critical for practitioners.
We tested the robustness of the model to input noise in
Table 6 by injecting Gaussian noise (
) and missing data (10–30%). Missing data (randomly missing): 10% missing: +5% error (acceptable), 30% missing: +23% error (requiring interpolation). At typical industrial noise levels (
), model performance degradation remained below 15%, validating its deployment readiness.
We compared the model against the baseline solution using paired
t-tests across 30 independent runs. The results, presented in
Table 7, demonstrate that all improvements are statistically significant (
p < 0.001) with large effect sizes (d > 1.2), confirming genuine performance gains beyond random fluctuations.
The model undergoes hyperparameter optimization based on the mean squared error (MSE) parameter. Specifically, the optimal learning rate, model dimension, and fully connected layer dimension are 0.00554, 228, and 96, respectively. The iteration process during optimization is illustrated in
Figure 3.
As illustrated in
Figure 4, the traffic simulation results of the DOA-MSDI-CrossLinear model vary according to the device used in the dataset.
Figure 4 presents a comparison between the traffic values predicted by the DOA-MSDI-CrossLinear model and the actual traffic conditions during the corresponding time period. It should be noted that the traffic monitoring spanned a continuous 24 h period. These cases illustrate that the proposed model possesses the capability to accurately predict overall traffic changes for devices. This capability enables the utilization of DOA-MSDI-CrossLinear in IIoT to facilitate precise forecasting of future network traffic for devices, thereby ensuring the expeditious allocation of resources. In addition, the proposed model generates relatively precise predictions for a variety of devices exhibiting entirely distinct traffic fluctuations, suggesting that the DOA-MSDI-CrossLinear model possesses significant capabilities for distinguishing between different devices.
Figure 4 demonstrates the prediction accuracy of the DOA-MSDI-CrossLinear model for three devices exhibiting typical flow characteristics: (a) Device A displays strong 24-h periodicity; (b) Device B exhibits irregular load peaks; (c) Device C shows gradual trend drift. Let us demonstrate how this model performs against real-world challenges in industrial networks using three typical devices.
Device A: This device follows a predictable 24 h cycle—much like an assembly line executing the same production plan daily. Our model tracks these daily rhythms with less than 5% error. The Multi-Scale Decomposition Module (MDM) “learns” device routines and makes forward predictions. This represents the ideal scenario—when behavior is predictable, the model excels.
Device B: This device experiences sudden traffic surges—irregular, unpredictable spikes. Even during such chaotic moments, the model maintains high accuracy. The Dual Dependency Interaction Module (DDI) models interactions between devices. Those “random” spikes are often triggered by events at other network nodes. By capturing cross-device correlations, the model anticipates seemingly unpredictable fluctuations. However, a key limitation exists: the most intense traffic surges exhibit a 1–2 h lag. The model can only identify peaks after they occur, not predict them in advance. This reflects a fundamental constraint—true mutations lack learnable patterns.
Device C: This case presented a challenge: it exhibited both sudden spikes and a slow drift in baseline traffic (possibly due to device aging or plant expansion). The model addressed both phenomena simultaneously. MDM separated slow trends from rapid spikes across different timescales, while DDI adapted to the gradual evolution of “normal” behavior. Together, they address what we call non-stationary dynamics—scenarios where statistical properties evolve over time.
The deployment’s key achievement lies in a single model successfully adapting to three devices with vastly different behaviors using identical parameter settings. In real factories with hundreds of devices, manually tuning individual models for each unit is impractical. This cross-device generalization capability, requiring no device-specific customization, is fundamental to achieving scalability. All devices exhibit a 1–2 h prediction lag for sudden peaks. Time series models identify patterns in historical data, but true sudden interruptions (unexpected surges with no warning) lack discernible patterns. For such issues, we can consider abandoning precise peak timing predictions and instead integrate anomaly detection modules to flag “high-risk periods” when peak conditions are ripe. This shifts the prediction focus toward risk assessment.
We decompose prediction errors into the following three categories:
(1) Scale mismatch error (34% of total error): Occurs during production mode transitions (e.g., shift handover). The root cause is that fixed decomposition windows cannot adapt to irregular scheduling. Adjusting adaptive windows can reduce this error.
(2) Cross-scale propagation error (28%): Hourly anomalies (e.g., equipment failures) trigger cascading effects in daily forecasts. The CrossLinear model assumes smooth propagation and ignores sudden failures, leading to this error. Adding an anomaly detection layer can reduce propagation errors.
(3) Model capacity error (38%): Occurs during unprecedented traffic patterns (e.g., new equipment integration). The root cause lies in training data lacking extreme scenarios. Synthetically augmenting training with sudden data effectively reduces this error.
Overall, the model exhibits graceful confidence decay under stress, enabling risk-aware decision-making in industrial control systems.