A Bayesian-Optimized Mixture of Experts Framework for Short-Term Traffic Flow Prediction

Wu, Jianqing; Ren, Jiaao; Wang, Hui; Xie, Fei; Chen, Shaohan; Jiang, Mengjie

doi:10.3390/modelling7020055

Open AccessArticle

A Bayesian-Optimized Mixture of Experts Framework for Short-Term Traffic Flow Prediction

by

Jianqing Wu

¹

,

Jiaao Ren

¹,

Hui Wang

^1,*,

Fei Xie

²,

Shaohan Chen

² and

Mengjie Jiang

²

¹

School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou 341000, China

²

School of Information Engineering, Wenzhou Business College, Wenzhou 325000, China

^*

Author to whom correspondence should be addressed.

Modelling 2026, 7(2), 55; https://doi.org/10.3390/modelling7020055

Submission received: 2 February 2026 / Revised: 9 March 2026 / Accepted: 11 March 2026 / Published: 16 March 2026

(This article belongs to the Topic New Technological Solutions, Research Methods, Simulation and Analytical Models That Support the Development of Modern Transport Systems, 2nd Edition)

Download

Browse Figure

Versions Notes

Abstract

Accurate and reliable short-term traffic flow prediction is crucial for managing urban congestion but is challenged by the complex spatio-temporal dependencies inherent in traffic systems. Conventional single models, such as Long Short-Term Memory (LSTM) and Temporal Convolutional Network (TCN), often fail to capture these nonlinear dynamics. To address this, we propose a novel Bayesian-Optimized Mixture of Experts (BO-MoE) framework. This hybrid architecture utilizes a Mixture of Experts (MoE) to dynamically integrate multiple specialized deep learning models, allowing it to adapt to diverse and complex traffic patterns. Bayesian Optimization (BO) is further integrated to automate hyperparameter tuning, significantly enhancing predictive accuracy and model efficiency. We evaluated BO-MoE on three real-world traffic datasets. Empirical results demonstrate that our model consistently outperforms strong baselines, including TCN. Specifically, on PEMS04, it reduces MAE, RMSE, and MAPE by 1.97%, 1.19%, and 3.23%, respectively, while on PEMS08, the corresponding reductions reach 3.83%, 1.26%, and 5.49%. On the NZ dataset, BO-MoE also achieves superior performance, with improvements comparable to those on PEMS benchmarks.

Keywords:

traffic congestion; Bayesian Optimization; mixture of experts; short-term traffic prediction

1. Introduction

In recent years, the rapid growth in vehicle ownership has exacerbated traffic congestion in many regions. As a critical component of modern transportation systems, road networks play a vital role in supporting regional connectivity, logistics, and daily mobility. Traffic flow prediction, as one of the core technologies for analyzing traffic conditions, is essential for real-time traffic monitoring, congestion management, and informed traffic control.

Traffic flow characterizes the movement of vehicles across time and space within a transportation network and is typically described by key parameters such as speed, traffic volume, and vehicle density [1]. Different traffic environments, such as urban roads, arterial roads, and highways, exhibit distinct characteristics in terms of traffic volume, travel speed, intersection density, and trip distances. Additionally, traffic flow often demonstrates complex temporal patterns, posing significant challenges for accurate prediction and modeling. Short-term traffic flow prediction refers to the task of estimating vehicle flow within upcoming time intervals, typically ranging from 5 min to 3 h, based on recent dynamic traffic data.

In traffic flow prediction, spatial and temporal dependencies are critical factors that directly impact the predictive performance of models [2]. Spatial dependencies arise from the network topology formed by interconnected traffic monitoring nodes and the road segments linking them. Traffic conditions at one node are strongly correlated with those at neighboring nodes, and changes at upstream locations can propagate downstream through transmission effects. For example, a disruption in upstream traffic may induce congestion or changes in downstream flow patterns.

Temporal dependencies reflect the influence of multi-scale time patterns on traffic flow dynamics. These include intra-day variations, such as differences between workdays and holidays, as well as fluctuations during morning and evening peak hours. Additionally, there are sudden disruptions, including traffic accidents or abrupt weather events, that affect road capacity. Consequently, traffic flow exhibits notable regularity, periodicity, and volatility across different time frames. Furthermore, these temporal patterns exhibit distinct trends and cyclic behaviors at various time scales.

Various temporal scales, including hourly, daily, and weekly, reveal distinct patterns. The hourly scale captures short-term fluctuations, while the daily scale highlights the pronounced morning and evening peak periods. In contrast, the weekly scale showcases significant variations in traffic volumes between weekdays and weekends. These multi-scale characteristics underscore the need for predictive models that can simultaneously capture both short-term variations and long-term trends, an ability often lacking in traditional prediction methods.

Traditional models often encounter bottlenecks when handling nonlinearity and complex interactions, such as the vanishing gradient issue in Long Short-Term Memory (LSTM) networks and the locality constraints of Temporal Convolutional Networks (TCNs). To overcome these limitations, this study proposes a Bayesian-Optimized Mixture of Experts (BO-MoE) model. Specifically, Bayesian Optimization (BO) is utilized to identify optimal hyperparameters, thereby improving the adaptability and performance of the model. The Mixture of Experts (MoE) architecture dynamically fuses three representative time-series prediction models, Bidirectional Long Short-Term Memory (BiLSTM), TCN, and Transformer, through a learnable gating mechanism. Each expert contributes distinct advantages: BiLSTM captures long-range temporal dependencies, TCN effectively extracts local sequential features, and the Transformer models global sequence interactions via self-attention. Their complementary strengths provide a robust foundation for dynamic integration, enabling the proposed framework to model the complex and nonlinear characteristics of traffic flow in a comprehensive manner.

The structure of this paper is organized as follows. Section 2 provides a brief overview of the background and challenges associated with traffic flow prediction, along with a review of recent advancements in the field. Section 3 provides a detailed introduction to the proposed BO-MoE model. Section 4 presents a comprehensive evaluation of the model’s predictive performance using publicly available traffic datasets, with comparisons against several baseline approaches. This section also describes the experimental setup, including data sources, computational environment, and Hyperparameter Optimization (HPO) strategies. Finally, Section 5 presents an in-depth analysis of the experimental results, summarizes the main findings, and outlines potential directions for future research.

2. Literature Review

Traffic flow prediction methods are broadly categorized into model-driven and data-driven approaches. Traditional model-driven methods rooted in statistical principles use predefined mathematical structures to capture traffic dynamics. Linear models like the Autoregressive Integrated Moving Average (ARIMA) serve as established benchmarks, with seasonal variants (SARIMA) demonstrating strong performance even with limited data [3,4]. Hybrid approaches have sought to enhance these models by separating linear and nonlinear components, thereby improving accuracy in scenarios with missing data [5].

The limitations of model-driven approaches have catalyzed a shift towards data-driven methods, particularly deep learning, which has become the dominant paradigm in the field. These models excel by autonomously learning intricate patterns directly from data, bypassing the need for rigid assumptions. Early deep learning applications focused on temporal dependencies. For instance, Recurrent Neural Network (RNN) variants, such as LSTM networks [6], demonstrated effectiveness in capturing long-term temporal patterns from time-series traffic flow data. Similarly, Gated Recurrent Units (GRU) [7] and their bidirectional counterparts, BiLSTM [8] were successfully applied to traffic prediction tasks, proving their capacity for sequence modeling. However, these initial models largely neglected the critical influence of spatial dependencies inherent in traffic networks.

Subsequent research focused on integrating spatial features to create more holistic spatio-temporal representations. Initial efforts included using autoencoders with LSTM (AE-LSTM) to incorporate data from adjacent road segments or leveraging Convolutional Neural Networks (CNNs) to extract grid-like spatio-temporal features [9,10]. More advanced frameworks, such as the Sequential-Periodic Network (SPN), have employed attention mechanisms to dynamically weigh the importance of different spatial and temporal inputs or have utilized a TCN to efficiently capture temporal evolution [11,12].

The introduction of Graph Neural Networks (GNNs) marked a significant milestone, allowing models to explicitly represent the topological structure of road networks. Seminal works like the Attention-Based Spatio-Temporal Graph Convolutional Network (ASTGCN) and Spatial-Temporal Graph Attention Networks (ST-GAT) set a new standard [13,14]. However, these models often rely on a predefined, static graph structure, which limits their adaptability to dynamic traffic conditions and network changes. To address this, recent research has explored adaptive graph learning mechanisms. Architectures like Graph WaveNet, Dynamic Graph Convolutional Network (DGCN), and Principal Graph Embedding Convolutional Recurrent Network (PGECRN) learn the graph structure directly from the data, enabling them to capture hidden spatial dependencies and adapt to data distribution drift [15,16,17]. However, even with adaptive graphs, a fundamental challenge persists as conventional Graph Convolutional Network (GCN) modules mainly capture spatial dependencies among immediate neighbor. While stacking multiple layers can theoretically model multi-hop relationships, it often leads to well-documented issues such as over-smoothing and low-frequency bias. Addressing this, the Contrastive Learning-based Deeper Graph Convolutional Network (CL-DGCN) introduces a hyper-aggregation-based learnable message aggregation mechanism to mitigate over-smoothing and enhance the modeling of long-range spatial dependencies [18].

Beyond the complexities of spatial modeling, another critical challenge lies in handling noisy sensor observations and the epistemic uncertainty that permeates spatio-temporal scales. Kalman filtering (KF), a recursive state-space model, is widely employed for dynamic state estimation and is particularly effective at mitigating measurement noise in streaming traffic data [19]. Nevertheless, a standalone Kalman filter cannot fully capture the complex nonlinear and spatio-temporal dependencies of traffic systems. This has motivated hybrid frameworks such as LSTM–Kalman models, which combine LSTM’s nonlinear temporal modeling with KF for recursive state estimation and noise reduction. More recently, PI-GRNN has advanced this direction by embedding physical constraints into spatio-temporal modeling and leveraging Kalman filtering for enhanced uncertainty quantification [20,21].

While technically sophisticated, the models above face two persistent challenges. First, their monolithic architectures struggle to handle the profound spatio-temporal heterogeneity of traffic flow. The MoE framework offers a compelling solution by routing inputs to specialized sub-models [22]. Recent applications like Spatio-Temporal MoE (ST-MoE) and Congestion Prediction Mixture of Experts (CP-MOE), have demonstrated that MoE can reduce predictive bias and improve robustness by assigning different traffic patterns to the most suitable expert [23,24].

The second challenge is the daunting complexity of hyperparameter configuration. The performance of these intricate hybrid models is highly sensitive to their initial parameters, often demanding extensive and inefficient manual tuning. Automated HPO has emerged as a solution. While early methods used Genetic Algorithms (GA) or Particle Swarm Optimization (PSO), these often suffer from slow convergence. BO has proven to be a more sample-efficient and effective HPO technique, leveraging a surrogate model to intelligently explore the parameter space [25,26].

Despite the individual successes of MoE frameworks, and BO, the synergistic integration of these two paradigms remains largely unexplored. State-of-the-art models are powerful but difficult to tune, while MoE provides a flexible architecture to handle heterogeneity but adds another layer of complexity. To our knowledge, few studies have utilized the efficiency of BO to systematically optimize a sophisticated MoE framework for traffic flow prediction. This paper aims to bridge this gap by proposing a hybrid framework that leverages BO to configure a MoE architecture composed of diverse time-series experts. This approach is designed to enhance model adaptability and performance while efficiently navigating the high-dimensional complexity of spatio-temporal traffic data.

3. Methods

3.1. Problem Definition

The primary objective of this study aims to predict short-term traffic flow based on historical data. The traffic network is defined as a directed graph,

G = (V, E)

, where V

= \{V_{1}, V_{2}, \dots, V_{n}\}

is a set of N nodes representing sensor locations, and E is a set of edges representing the connectivity between these nodes.

The traffic state at any time step t is represented by a feature matrix

X_{t} \in R^{N \times C}

, where C is the number of features. Given historical traffic data over a past period of p time steps,

X = \{X_{t - p + 1}, {\dots, X}_{t}\}

, the task is to learn a mapping function f that predicts the traffic flow for the next s time steps, denoted as

Y = \{Y_{t + 1}, {\dots, Y}_{t + s}\}

. This can be formally expressed as:

Y = f (X; θ)

(1)

where

Y

denotes the predicted traffic values, and

θ

denotes the learnable parameters of the model.

3.2. Multi-Scale Traffic Flow Prediction Using BO-MoE

To effectively capture the complex spatio-temporal dependencies in traffic flow, we propose the BO-MoE framework. This framework is architected around two core components a MoE module for robust spatio-temporal feature modeling and a BO module for automatic hyperparameter tuning. The overall architecture is depicted illustrated in Figure 1.

3.2.1. Input Feature Layer

To capture traffic flow dynamics across multiple temporal scales, we partition the historical data into three distinct temporal segments, following the method in an adjacent segment (

X_{a}

), a daily segment (

X_{d}

), and a weekly segment (

X_{w}

). These segments are constructed by sampling historical data from the most recent period, the same time periods on previous days, and the same time periods on previous weeks, respectively.

The three resulting input tensors,

X_{a} \in R^{p_{a} \times N \times C}

,

X_{d} \in R^{p_{d} \times N \times C}

, and

X_{w} \in R^{p_{w} \times N \times C}

, are concatenated along the temporal dimension to form the final model input

X

, where

p_{a}

,

p_{d}

, and

p_{w}

are the number of time steps for each respective scale. This is expressed as:

X = C o n c a t e n a t e (X_{a}, X_{d}, X_{w})

(2)

3.2.2. MoE Module

The MoE module, illustrated in Figure 1a, forms the predictive core of our framework. The multi-scale input features X are fed into M expert models, and a gating network adaptively weights their respective outputs. Each expert is designed to capture distinct temporal dynamics: BiLSTM models long-range dependencies, TCN extracts local sequential features and short-term fluctuations, and Transformer captures sequence-wide interactions.

The gating network employs a two-layer feed-forward network to compute the expert weights

g (X)

:

g (X) = S o f t m a x (W_{2} \cdot D r o p o u t (R e L U (W_{1} X + b_{1}), r) + b_{2})

(3)

where

g (X)

is the vector of gating weights, representing the probability distribution over M experts,

W_{1}

and

W_{2}

are learnable weight matrices, b₁ and b₂ are learnable bias vectors, and r is the dropout rate.

Let

H_{i} (X)

be the output representation from the i-th expert model. This representation is passed through a Feed-Forward Network (FFN) to produce the final output of the expert

E_{i} (X)

. The FFN performs feature transformation and dimensionality reduction to obtain task-specific representations. Formally, the expert output is defined as:

E_{i} (X) = {F F N}_{i} (H_{i} (X)), i = 1, 2, 3

(4)

where

{F F N}_{i}

denotes the feed-forward network associated with the i-th expert.

The final output of the MoE module,

\hat{Y}

, is the weighted sum of all expert outputs:

\hat{Y} = \sum_{i = 1}^{M} g_{i} (X) \cdot E_{i} (X)

(5)

where

g_{i} (X)

is the i-th element of

g (X)

produced by a gating network that takes the input X, dynamically weighting each expert based on the input.

3.2.3. BO Algorithm

The proposed MoE architecture uses a large number of heterogeneous experts, presents a high-dimensional and intricately coupled hyperparameter space that makes manual tuning inefficient. To address this, a BO module is integrated into our framework, BO-MoE, as shown in Figure 1b, to automate the search for optimal hyperparameter configurations.

The BO algorithm aims to find the optimal hyperparameter vector x that minimizes the objective function

f (x)

. x includes model structural parameters and training parameters, such as the dropout rate r. The objective function typically defined as a validation loss metric, quantifies the overall performance of the MoE model.

BO consists of a probabilistic surrogate model and an acquisition function. A Gaussian Process (GP) is used as the surrogate model to approximate

f (x)

. The GP defines a prior over functions, assuming that for any finite set of points, the corresponding function values follow a joint Gaussian distribution:

f ~ G P (m (x), k (x, x^{'}))

(6)

where

m (x)

is the mean function, typically set to zero, and

k (x, x^{'})

is a scalar-valued kernel function that defines the covariance between two hyperparameter vectors x and

x^{'}

. In this study, the Radial Basis Function (RBF) kernel is adopted:

k (x, x^{'}) = σ_{f}^{2} \exp (- \frac{1}{2} \sum_{i = 1}^{n} \frac{{(x_{i} - x_{i}^{'})}^{2}}{l_{i}^{2}})

(7)

where

σ_{f}^{2}

is the global variance, and

l_{i}

is the length-scale parameter of the i-th dimension, controlling its sensitivity. The kernel hyperparameters are optimized by maximizing the marginal log-likelihood of the observed data.

At each iteration, an acquisition function is used to determine the next hyperparameter set to evaluate. Its core purpose is to balance exploration and exploitation by sampling in regions of high uncertainty and near known high-performing solutions, respectively. We employ the Expected Improvement (EI) function, defined as:

E I (x) = E [\max (f (x_{b e s t}) - f (x), 0)]

(8)

where

x_{b e s t}

denotes the best observed hyperparameter configuration, and

f (x_{b e s t})

is its corresponding objective function value. Given the predictive posterior distribution of the GP at point

x

with mean

μ (x)

and standard deviation

σ (x)

, the EI can be computed in closed form:

E I (x) = (f (x_{b e s t}) - μ (x) Φ (z) + σ (x) ϕ (z))

(9)

where

z = \frac{(f (x_{b e s t}) - μ (x))}{σ (x)}

, and

Φ (z)

and

ϕ (z)

are the cumulative distribution function and probability density function of the standard normal distribution, respectively. The overall BO-MoE procedure is summarized in Algorithm 1.

Algorithm 1. BO-MoE for Traffic Prediction

Input: Traffic flow data X, Hyperparameter space x, prediction horizon s, iterations T.
Output: Predicted traffic flow

\hat{Y} .

Procedure:
Initialize the GP surrogate model
for i = 1 to T do
//GP Posterior Prediction
Compute posterior mean

μ (x)

and variance

σ (x)

// Select hyperparameters using EI

x_{i}

= argmax Expected_Improvement(x|GP)
//Build and Train MoE model with selected hyperparameters
MoE_model = Build_MoE(

x_{i}

)
MoE_model = Train_MoE (Model = MoE_model, Data = X)
//Evaluate model performance

f (x)

= Evaluate_MAE (MoE_model, Validation_Data)
//Update surrogate model
GP = Update_GP (GP,

x_{i}

,

f (x)

)
end for
//Select optimal hyperparameters

x_{b e s t}

= argmin

f (x_{i})

//Build and Train final model
Final_MoE = Build_MoE(

x_{b e s t}

)
Final_MoE = Train_MoE(Model = Final_MoE, Data = X)
//Generate predictions

\hat{Y}

= Predict (Final_MoE, s)
return

\hat{Y}

Table 1 lists the hyperparameters optimized by BO, including learning rate, dropout rate, MoE hidden dimension, and expert-specific parameters, along with their optimal values.

The BO algorithm is executed for 30 trials to identify the optimal hyperparameter configuration. In each trial, the model is trained for 60 epochs on the training set without early stopping. After each epoch, performance is evaluated on the validation set using an objective function. The objective value from each trial guides the subsequent optimization process.

4. Results and Discussion

4.1. Evaluation Metrics

To evaluate the performance of the proposed BO-MoE framework, four widely used evaluation metrics are adopted: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Mean Absolute Percentage Error (MAPE), and the Coefficient of Determination (

R^{2}

). Specifically, MAE and RMSE measure the discrepancy between predicted and actual values, reflecting the overall prediction error of the model. MAPE expresses the relative error as a percentage, while

R^{2}

evaluates the goodness-of-fit. The metrics are defined as follows:

M A E = \frac{1}{n} \sum_{i = 1}^{n} | y_{i} - {\hat{y}}_{i} |

(10)

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} ({y_{i} - {\hat{y}}_{i})}^{2}}

(11)

M A P E = \frac{1}{n} \sum_{i = 1}^{n} | \frac{y_{i} - {\hat{y}}_{i}}{y_{i}} | \times 100 %

(12)

R^{2} = 1 - \frac{\sum_{i = 1}^{n} ({\hat{y}}_{i} - y_{i})^{2}}{\sum_{i = 1}^{n} (\bar{y} - y_{i})^{2}}

(13)

where n is the total number of samples,

y_{i}

is the observed value,

{\hat{y}}_{i}

is the corresponding predicted value, and

\bar{y}

is the mean of the observed values.

4.2. Dataset

We evaluate our method on three benchmark traffic datasets: two from the Caltrans Performance Measurement System (PeMS) in California, USA, namely PEMS04 and PEMS08, and one from the New Zealand Transport Agency (NZ) [27]. All datasets provide traffic flow measurements aggregated at regular intervals. In this study, we focus exclusively on traffic flow, which represents the total vehicle volume recorded during each interval.

The PEMS datasets include traffic flow, occupancy, and speed measurements collected from highway sensor stations across California. Data are officially aggregated at 5 min intervals, yielding 288 data points per sensor per day. PEMS04 comprises data from 307 sensors over 59 consecutive days (1 January to 28 February 2018). PEMS08 comprises data from 170 sensors over 62 consecutive days (1 July to 31 August 2016).

The NZ dataset contains traffic flow measurements from highway sensors across New Zealand, aggregated at 15 min intervals. We use data from 1 July to 31 December 2020 (184 consecutive days), originally collected from 2042 sensors. For data preprocessing, sensors with more than 1% missing values were excluded, and remaining missing entries were imputed using a Random Forest-based method that leverages spatial correlations among neighboring sensors. After preprocessing, 313 sensors from the NZ dataset were retained as graph nodes for constructing the spatial graph. The adjacency matrix is built based on the Euclidean distance between sensor locations, with edges retained only for sensor pairs within a distance threshold. All data were normalized to zero mean and unit variance prior to model input.

4.3. Implementation Details

All experiments were implemented using PyTorch with Python 3.10 and CUDA 12.1, and conducted on an NVIDIA RTX 4090 GPU (NVIDIA Corporation, Santa Clara, CA, USA). The datasets were split into 60% for training, 20% for validation, and 20% for testing. Feature normalization was applied using the mean and standard deviation derived from the training set. The model was optimized with the Adam optimizer, with MAE as the loss function. To capture multi-scale temporal dependencies, three historical input sequences of different lengths were used, with a prediction horizon of T = 12 steps.

4.4. Comparative Analysis

To evaluate the effectiveness of different expert combination strategies within the MoE framework, we conducted comparative experiments on the PEMS04, PEMS08, and NZ datasets. The details of the expert combinations are provided in Table 2.

Table 3 presents the evaluation results of different MoE combinations over a 12-step prediction horizon. On the PEMS04 dataset, the

{M o E}_{d}

and

{M o E}_{e}

achieved the lowest evaluation metrics, indicating superior performance among the initial, non-optimized configurations. Similarly, on PEMS08,

{M o E}_{d}

and

{M o E}_{h}

yielded the most favorable results. In contrast, on the NZ dataset, no single combination demonstrated consistent dominance. Based on their robust performance on PEMS04 and PEMS08,

{M o E}_{d}

,

{M o E}_{e}

, and

{M o E}_{h}

were selected for further hyperparameter tuning via BO.

Following the optimization, we conducted comprehensive comparative experiments to evaluate the performance of our proposed models. The experiments benchmarked our optimized variants against a wide array of baseline models, which include classical deep learning models such as LSTM, GRU, BiLSTM, TCN, Transformer, as well as several state-of-the-art spatio-temporal graph neural networks: ASTGCN, DGCN, and PGECRN. To ensure a fair comparison, all models were evaluated using a consistent multi-scale temporal division method [13]. This method is crucial for capturing the complex temporal dynamics of traffic flow. The detailed results of these comparisons are presented in Table 4.

Table 4 shows that the proposed BO-

{M o E}_{h}

achieves the best performance on most metrics across all three datasets. Compared to the strongest baseline, TCN, BO-

{M o E}_{h}

consistently obtains lower errors. On PEMS04, it reduces MAE, RMSE, and MAPE by 1.97%, 1.19%, and 3.23%, respectively; on PEMS08, the reductions are 3.83%, 1.26%, and 5.49%, respectively. A notable exception is observed on the NZ dataset, which has a longer 15 min sampling interval. On this dataset, the classic LSTM model records a higher R² value. This observation suggests that the efficacy of our model might be sensitive to the data sampling frequency. This is likely because data sparsity at this interval challenges our complex model with overfitting, while the simpler LSTM robustly captures the underlying trend. Despite this, BO-

{M o E}_{h}

maintains a clear advantage in terms of MAE, RMSE, and MAPE on the NZ dataset. Given its outstanding overall performance, BO-

{M o E}_{h}

is identified as our final proposed model.

While certain models like BiLSTM and TCN exhibit competitive results, BO-

{M o E}_{h}

consistently demonstrates superior accuracy and stability, confirming its effectiveness in capturing complex spatio-temporal dependencies. The performance of the model gain comes from its design, which effectively combines multi-scale time analysis with an adaptive gating mechanism. Unlike other models that use fixed combinations, our MoE framework offers a more adaptive fusion strategy. This design allows the model to explicitly learn from different temporal patterns, making it better at handling complex traffic changes and improving its overall accuracy.

5. Conclusions

In this study, we introduced a MoE architecture, optimized via BO, for multi-step traffic flow prediction. The strength of the proposed model lies in its ability to capture complex spatio-temporal dependencies by leveraging multi-scale temporal features and a gating network. Our empirical evaluations confirm that this approach yields superior accuracy, stability, and generalization compared to state-of-the-art baseline models.

Despite its strong performance, limitations include a sensitivity to coarser data sampling intervals and a lack of full interpretability in the expert gating mechanism. Future work will therefore focus on improving model robustness to varied data resolutions, enhancing interpretability, and exploring more advanced expert allocation strategies to better support intelligent traffic management systems.

Author Contributions

Conceptualization and methodology, J.W. and J.R.; investigation, data curation and writing—original draft preparation, J.W. and J.R.; writing—review and editing, J.W. and H.W.; visualization, S.C. and M.J.; supervision, J.W. and H.W.; funding acquisition, F.X. and H.W. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially funded by the National Natural Science Foundation of China (NSFC) under Grant No. 62366016. Additional support was provided by the Doctoral Scientific Research Foundation of Jiangxi University of Science and Technology (Grant No. 205200100591) and the Jiangxi Provincial Key Laboratory of Multidimensional Intelligent Perception and Control (Grant No. 2024SSY03161).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yin, X.; Wu, G.; Wei, J.; Shen, Y.; Qi, H.; Yin, B. Deep Learning on Traffic Prediction: Methods, Analysis and Future Directions. IEEE Trans. Intell. Transp. Syst. 2022, 23, 4927–4943. [Google Scholar] [CrossRef]
Song, C.; Lin, Y.; Guo, S.; Wan, H. Spatial-Temporal Synchronous Graph Convolutional Networks: A New Framework for Spatial-Temporal Network Data Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; PKP Publishing Services Network: Burnaby, BC, Canada, 2020; pp. 914–921. [Google Scholar]
Smith, B.L.; Williams, B.M.; Oswald, R.K. Comparison of Parametric and Nonparametric Models for Traffic Flow Forecasting. Transp. Res. Part C Emerg. Technol. 2002, 10, 303–321. [Google Scholar] [CrossRef]
Kumar, S.V.; Vanajakshi, L. Short-Term Traffic Flow Prediction Using Seasonal ARIMA Model with Limited Input Data. Eur. Transp. Res. Rev. 2015, 7, 21. [Google Scholar] [CrossRef]
Xu, C.; Li, Z.; Wang, W. Short-Term Traffic Flow Prediction Using a Methodology Based on ARIMA and Genetic Programming. Transport 2016, 31, 343–358. [Google Scholar] [CrossRef]
Zhao, Z.; Chen, W.; Wu, X.; Chen, P.C.Y.; Liu, J. LSTM network: A deep learning approach for short-term traffic forecast. IET Intell. Transp. Syst. 2017, 11, 68–75. [Google Scholar] [CrossRef]
Fu, R.; Zhang, Z.; Li, L. Using LSTM and GRU Neural Network Methods for Traffic Flow Prediction. In Proceedings of the 31st Youth Academic Annual Conference of the Chinese Association of Automation, Wuhan, China, 11–13 November 2016; IEEE: New York, NY, USA, 2016; pp. 324–328. [Google Scholar]
Abduljabbar, R.L.; Dia, H.; Tsai, P.-W. Unidirectional and Bidirectional LSTM Models for Short-Term Traffic Prediction. J. Adv. Transp. 2021, 2021, 5589075. [Google Scholar] [CrossRef]
Wei, W.; Wu, H.; Ma, H. An Autoencoder and LSTM-Based Traffic Flow Prediction Method. Sensors 2019, 19, 2946. [Google Scholar] [CrossRef] [PubMed]
Zhang, W.; Yu, Y.; Qi, Y.; Shu, F.; Wang, Y. Short-Term Traffic Flow Prediction Based on Spatio-Temporal Analysis and CNN. Transp. A Transp. Sci. 2019, 15, 80–91. [Google Scholar] [CrossRef]
Liu, L.; Zhen, J.; Li, G.; Zhan, G.; He, Z.; Du, B.; Lin, L. Dynamic Spatio-Temporal Representation Learning for Traffic Flow Prediction. IEEE Trans. Intell. Transp. Syst. 2021, 22, 7169–7183. [Google Scholar] [CrossRef]
Zhao, W.; Gao, Y.; Ji, T.; Wan, X.; Ye, F.; Bai, G. Deep Temporal Convolutional Networks for Short-Term Traffic Flow Forecasting. IEEE Access 2019, 7, 114496–114507. [Google Scholar] [CrossRef]
Guo, S.; Lin, Y.; Feng, N.; Song, C.; Wan, H. Attention-Based Spatio-Temporal Graph Convolutional Networks for Traffic Flow Forecasting. In Proceedings of the AAAI Conference on Artificial Intelligence, Honolulu, HI, USA, 27 January– 1 February 2019; PKP Publishing Services Network: Burnaby, BC, Canada, 2019; pp. 922–929. [Google Scholar]
Zhang, C.; Yu, J.J.Q.; Liu, Y. Spatial-Temporal Graph Attention Networks: A Deep Learning Approach for Traffic Forecasting. IEEE Access 2019, 7, 4–16. [Google Scholar] [CrossRef]
Wu, Z.; Pan, S.; Long, G.; Jiang, J.; Zhang, C. Graph WaveNet for Deep Spatio-Temporal Graph Modeling. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, Macao, China, 10–16 August 2019; AAAI Press: Washington, DC, USA, 2019; pp. 1907–1913. [Google Scholar]
Guo, K.; Hu, Y.; Qian, Z.; Sun, Y.; Gao, J.; Yin, B. Dynamic Graph Convolution Network for Traffic Forecasting Based on Latent Network Estimation. IEEE Trans. Intell. Transp. Syst. 2022, 23, 1009–1018. [Google Scholar] [CrossRef]
Han, Y.; Zhao, S.; Deng, H.; Jia, W. Principal Graph Embedding Convolutional Recurrent Network for Traffic Flow Prediction. Appl. Intell. 2023, 53, 17809–17823. [Google Scholar] [CrossRef]
Zhang, E.; Lv, Z.; Cheng, Z.; Ke, J. CL-DGCN: Contrastive Learning Based Deeper Graph Convolutional Network for Traffic Flow Data Prediction. Transp. Res. Part E Logist. Transp. Rev. 2025, 203, 104345. [Google Scholar] [CrossRef]
Emami, A.; Sarvi, M.; Asadi Bagloee, S. Using Kalman filter algorithm for short-term traffic flow prediction in a connected vehicle environment. J. Mod. Transp. 2019, 27, 222–232. [Google Scholar] [CrossRef]
Lin, T.; Lin, R. An Efficient Hybrid Model Combining LSTM and Kalman Filter for Real-Time Traffic Flow Prediction in Smart Transportation Systems. In Proceedings of the 2025 IEEE 15th International Conference on Signal Processing, Communications and Computing, Hong Kong, China, 18–21 July 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Deshpande, N.; Park, H. Physics-Informed Deep Learning with Kalman Filter Mixture for Traffic State Prediction. Int. J. Transp. Sci. Technol. 2025, 17, 161–174. [Google Scholar] [CrossRef]
Jacobs, R.A.; Jordan, M.I.; Nowlan, S.J.; Hinton, G.E. Adaptive Mixture of Local Experts. Neural Comput. 1991, 3, 79–87. [Google Scholar] [CrossRef] [PubMed]
Li, S.; Cui, Y.; Zhao, Y.; Yang, W.; Zhang, R.; Zhou, X. ST-MoE: Spatio-Temporal Mixture-of-Experts for Debiasing in Traffic Prediction. In Proceedings of the ACM International Conference on Information and Knowledge Management, Birmingham, UK, 21–25 October 2023; ACM Inc.: Nashville, TN, USA, 2023; pp. 1208–1217. [Google Scholar]
Jiang, W.; Han, J.; Liu, H.; Tao, T.; Tan, N.; Xiong, H. Interpretable Cascading Mixture-of-Experts for Urban Traffic Congestion Prediction. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Barcelona, Spain, 25–29 August 2024; ACM Inc.: Nashville, TN, USA, 2024; pp. 5206–5217. [Google Scholar]
Lu, X.; Chen, C.; Gao, R.; Xing, Z. Prediction of High-Speed Traffic Flow around Cities Based on BO-XGBoost Model. Symmetry 2023, 15, 1453. [Google Scholar] [CrossRef]
Wang, C.; Huang, S.; Zhang, C. Short-Term Traffic Flow Prediction Considering Weather Factors Based on Optimized Deep Learning Networks. Sustainability 2025, 17, 2576. [Google Scholar] [CrossRef]
Li, B.; Yu, R.; Chen, Z.; Ding, Y.; Yang, M.; Li, J.; Wang, J.; Zhong, H. High-Resolution Multi-Source Traffic Data in New Zealand. Sci. Data 2024, 11, 1216. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Overall Framework of the proposed BO-MoE, (a) architecture of the MoE module; (b) workflow of the BO algorithm.

Table 1. MoE Hyperparameter Settings and Optimization Results.

Hyperparameter	PEMS04	PEMS08	NZ
learning rate	0.0005	0.0001	0.0003
dropout	0.006	0.004	0.002
BiLSTM hidden	32	256	128
BiLSTM num layer	3	2	3
TCN channels	[64, 128, 256]	[96, 192, 384]	[96, 192, 384]
TCN kernel size	3	2	3
Transformer hidden	64	64	128
Transformer heads	8	8	4
Transformer layers	2	3	3
Expert gating hidden	32	128	32

Table 2. MoE Expert Combinations.

Abbreviations	Expert Combinations
${M o E}_{a}$	LSTM-TCN
${M o E}_{b}$	LSTM-TCN-Transformer
${M o E}_{c}$	LSTM-GRU-Transformer
${M o E}_{d}$	GRU-TCN-Transformer
${M o E}_{e}$	BiLSTM-GNN-Transformer
${M o E}_{f}$	BiLSTM-GRU-TCN
${M o E}_{g}$	BiLSTM-AGC-Transformer
${M o E}_{h}$	BiLSTM-TCN-Transformer

Table 3. 12-Step Performance of MoE Expert Combinations.

PEMS04				PEMS08				NZ
	MAE	RMSE	MAPE	$R^{2}$	MAE	RMSE	MAPE	$R^{2}$	MAE	RMSE	MAPE	$R^{2}$
${M o E}_{a}$	18.439	30.717	0.122	0.958	13.583	22.705	0.099	0.972	19.869	41.203	0.282	0.939
${M o E}_{b}$	18.466	30.794	0.122	0.957	13.489	22.694	0.088	0.972	19.637	41.075	0.275	0.936
${M o E}_{c}$	19.566	31.586	0.146	0.955	14.209	23.128	0.096	0.971	20.232	42.63	0.271	0.935
${M o E}_{d}$	18.394	30.688	0.122	0.958	13.273	22.611	0.088	0.972	20.222	43.094	0.275	0.932
${M o E}_{e}$	18.375	30.74	0.121	0.958	13.346	22.681	0.088	0.972	19.719	41.515	0.28	0.938
${M o E}_{f}$	18.522	30.769	0.123	0.957	14.012	23.032	0.094	0.971	19.685	41.258	0.27	0.939
${M o E}_{g}$	18.475	30.884	0.122	0.957	13.607	22.815	0.089	0.972	19.908	40.743	0.27	0.942
${M o E}_{h}$	18.403	30.792	0.122	0.957	13.086	22.631	0.087	0.972	19.938	42.525	0.269	0.932

Table 4. Comparative Results on PEMS04, PEMS08 and NZ.

PEMS04				PEMS08				NZ
	MAE	RMSE	MAPE	$R^{2}$	MAE	RMSE	MAPE	$R^{2}$	MAE	RMSE	MAPE	$R^{2}$
LSTM	19.740	32.271	0.132	0.952	15.108	24.494	0.101	0.965	21.051	42.448	0.289	0.940
GRU	19.665	32.220	0.133	0.952	14.929	24.361	0.100	0.966	21.834	46.000	0.282	0.931
BiLSTM	19.298	31.643	0.128	0.954	14.669	23.877	0.098	0.968	20.469	41.567	0.275	0.939
TCN	18.561	30.858	0.124	0.957	13.681	22.775	0.091	0.972	20.360	42.930	0.279	0.933
Transformer	19.833	31.643	0.162	0.955	14.429	23.136	0.128	0.971	21.164	42.746	0.345	0.935
ASTGCN	21.905	36.331	0.144	0.941	15.651	25.856	0.107	0.963	23.143	48.813	0.310	0.925
DGCN	20.598	33.288	0.146	0.920	17.425	26.809	0.120	0.919	25.158	51.518	0.396	0.824
PGECRN	19.682	32.220	0.134	0.953	14.571	24.515	0.102	0.967	34.791	76.372	0.404	0.880
BO- ${M o E}_{d}$	18.42	30.778	0.126	0.958	13.564	22.891	0.093	0.971	19.986	42.933	0.271	0.933
BO- ${M o E}_{e}$	18.338	30.687	0.122	0.958	13.495	22.751	0.091	0.972	19.953	43.071	0.275	0.935
BO- ${M o E}_{h}$	18.195	30.490	0.120	0.958	13.157	22.488	0.086	0.972	19.742	40.504	0.267	0.936

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wu, J.; Ren, J.; Wang, H.; Xie, F.; Chen, S.; Jiang, M. A Bayesian-Optimized Mixture of Experts Framework for Short-Term Traffic Flow Prediction. Modelling 2026, 7, 55. https://doi.org/10.3390/modelling7020055

AMA Style

Wu J, Ren J, Wang H, Xie F, Chen S, Jiang M. A Bayesian-Optimized Mixture of Experts Framework for Short-Term Traffic Flow Prediction. Modelling. 2026; 7(2):55. https://doi.org/10.3390/modelling7020055

Chicago/Turabian Style

Wu, Jianqing, Jiaao Ren, Hui Wang, Fei Xie, Shaohan Chen, and Mengjie Jiang. 2026. "A Bayesian-Optimized Mixture of Experts Framework for Short-Term Traffic Flow Prediction" Modelling 7, no. 2: 55. https://doi.org/10.3390/modelling7020055

APA Style

Wu, J., Ren, J., Wang, H., Xie, F., Chen, S., & Jiang, M. (2026). A Bayesian-Optimized Mixture of Experts Framework for Short-Term Traffic Flow Prediction. Modelling, 7(2), 55. https://doi.org/10.3390/modelling7020055

Article Menu

A Bayesian-Optimized Mixture of Experts Framework for Short-Term Traffic Flow Prediction

Abstract

1. Introduction

2. Literature Review

3. Methods

3.1. Problem Definition

3.2. Multi-Scale Traffic Flow Prediction Using BO-MoE

3.2.1. Input Feature Layer

3.2.2. MoE Module

3.2.3. BO Algorithm

4. Results and Discussion

4.1. Evaluation Metrics

4.2. Dataset

4.3. Implementation Details

4.4. Comparative Analysis

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI