Skip to Content
Applied SciencesApplied Sciences
  • Article
  • Open Access

30 October 2025

A Meta-Learning-Based Framework for Cellular Traffic Forecasting

,
,
,
and
School of Space Information, Space Engineering University, Beijing 101416, China
*
Author to whom correspondence should be addressed.

Abstract

The rapid advancement of 5G/6G networks and the Internet of Things has rendered mobile traffic patterns increasingly complex and dynamic, posing significant challenges to achieving precise cell-level traffic forecasting. Traditional deep learning models, such as LSTM and CNN, rely heavily on substantial datasets. When confronted with new base stations or scenarios with sparse data, they often exhibit insufficient generalisation capabilities due to overfitting and poor adaptability to heterogeneous traffic patterns. To overcome these limitations, this paper proposes a meta-learning framework—GMM-MCM-NF. This framework employs a Gaussian mixture model as a probabilistic meta-learner to capture the latent structure of traffic tasks in the frequency domain. It further introduces a multi-component synthesis mechanism for robust weight initialisation and a negative feedback mechanism for dynamic model correction, thereby significantly enhancing model performance in scenarios with small samples and non-stationary conditions. Extensive experiments on the Telecom Italia Milan dataset demonstrate that GMM-MCM-NF outperforms traditional methods and meta-learning baseline models in prediction accuracy, convergence speed, and generalisation capability. This framework exhibits substantial potential in practical applications such as energy-efficient base station management and resilient resource allocation, contributing to the advancement of mobile networks towards more sustainable and scalable operations.

1. Introduction

With the rapid advancement of 5G/6G networks and Internet of Things (IoT) technologies, mobile network traffic has exhibited explosive growth. Within the current cellular network infrastructure, such factors as population mobility, public holidays, or special circumstances may lead to predictable network congestion, while certain cells may simultaneously exhibit power consumption far exceeding demand. To address escalating user requirements and reduce energy wastage, research into edge computing within a cellular-level traffic forecasting and drone-assisted framework is paramount. For instance, accurate forecasting of future traffic loads across multiple base stations enables dual-pronged optimisation: deploying drone swarms to assist overloaded sites, thereby enhancing user experience, while implementing sleep modes for temporarily idle base stations to reduce energy expenditure.
Currently, precise cellular-level traffic forecasting is a core enabling technology for dynamic network resource scheduling, base station energy management, and service quality assurance. It has consequently garnered significant attention in both academic research and engineering applications worldwide. However, existing forecasting methods still face three major challenges:
Firstly, the dilemma of model complexity. Whilst statistical models and shallow learning models exhibit lower complexity and training costs, real-world network traffic variations are vastly more intricate than any statistical or shallow learning model can capture, rendering them inadequate for complex spatio-temporal characteristics. Secondly, challenges stem from task heterogeneity: traffic patterns across different functional zones within cities exhibit significant divergence, leaving single models lacking in generalisation potential. Thirdly, the challenge of few-shot learning: the performance of deep learning-based forecasting methods heavily depends on the training dataset. When training data is insufficient, deep learning models exhibit overfitting.
To address the first challenge, the hierarchical optimisation architecture of meta-learning effectively reduces model complexity. By decomposing complex traffic distributions into K Gaussian components, each associated with a long short-term memory (LSTM)-based learner focused on learning specific patterns, the overall model complexity is diminished.
To address the second challenge, a Gaussian mixture model (GMM) is employed as the meta-learner within the meta-learning architecture. This captures the frequency-domain characteristics of traffic sequences, quantifies task differences, and probabilistically adapts prediction tasks, thereby significantly enhancing adaptation effectiveness across heterogeneous tasks.
To address the third challenge, this paper introduces an MCM upon the hierarchical meta-learning architecture and dedicated modelling for diverse traffic patterns. The MCM dynamically synthesises weight vectors across components, thereby effectively mitigating prediction errors under small sample sizes.
Given the above considerations, this paper incorporates meta-learning for cellular network traffic forecasting. The proposed model employs a multi-layer long short-term memory (LSTM) network as the base learner, with each base processor handling a distinct fundamental task—corresponding to each cellular network. The meta-task of the meta-learning model involves initialising the weight vectors for the base learners of each basic task according to the distinct meta-features of each cellular network. To accomplish this meta-task, a GMM is adopted as the meta-learner. Employing the meta-learning model effectively balances prediction accuracy and training cost. Numerical experiments demonstrate that the proposed algorithm significantly improves post-training prediction accuracy and learning efficiency.
The principal contributions of this paper are summarised as follows:
  • Proposing a GMM-based meta-learner to replace the KNN meta-learner in ML-TP. GMM enables probabilistic modelling of the meta-feature space, capturing latent structures in task distributions.
  • The MCM is introduced to overcome the limitations of KNN’s rigid assignment in ML-TP. MCM initialises the base learner for new tasks by synthesising weight vectors from multiple Gaussian components.
  • A prediction–correction negative feedback mechanism (NF) is designed to dynamically adjust GMM parameters during long-term predictions.

3. GMM-ML-TP Cellular Traffic Forecasting Model

This section first outlines the traffic forecasting framework proposed herein—GMM-MCM-NF—before explicitly defining its three core components: the base learner tasked with predicting traffic load for individual cells, the meta-learner designed to enhance the base learner’s predictive accuracy and learning efficiency, and the correction mechanism for long-term forecasting tasks.
As illustrated in Figure 1, the GMM-MCM-NF model’s workflow comprises three core steps. The first step involves meta-feature extraction and meta-learner training: each cell’s traffic load data is converted into a frequency-domain signal via the Fast Fourier Transform (FFT). Subsequently, following the methodology described in Section 3.1, the real and imaginary parts of five dominant frequency components are extracted to construct a 10-dimensional meta-feature vector. These vectors represent each cell’s traffic pattern and are used to train Gaussian mixture models. The trained GMM meta-learners generate initialisation weights for base learners based on distinct cellular traffic characteristics. Step two involves meta-knowledge-based weight initialisation and prediction: GMM-generated weights serve as initial parameters for LSTM networks to forecast network traffic load. Step Three: Prediction-Correcting Feedback Mechanism: This mechanism dynamically evaluates the responsibility weights of each Gaussian component by calculating the LSTM’s prediction error. It penalises components with larger errors and optimises the GMM parameters accordingly, thereby enhancing the quality of subsequent task weight selection.
Figure 1. Schematic diagram of the GMM-MCM-NF model.

3.1. Dataset and Preliminary Analysis

This section primarily introduces the dataset utilised in this study, comprising real mobile network traffic records. Additionally, it presents mathematical analyses of cellular traffic in both the time and frequency domains.
(1)
Spatial Gridding and Cell Definition
This study employs the mobile network dataset provided by Telecom Italia’s Big Data Challenge [43] programme. The dataset comprises approximately 3 million traffic records collected in Milan between 1 November 2013 and 1 January 2014. The Milan area was divided into 10,000 grids, each representing a square region with a side length of 235 m, serving as the fundamental spatial unit for this research. Each record contains a timestamp, grid ID, and mobile traffic load (i.e., traffic payload). Given that the grid size approximates the coverage area of a 5G base station, each grid is defined as a “cell” in this paper.
(2)
Time Series Construction
This paper utilises the mobile network dataset provided by Telecom Italia’s Big Data Challenge programme. The dataset encompasses approximately three million traffic records generated within the city of Milan between 1 November 2013 and 1 January 2014. The region was partitioned into a grid of 10,000 cells, each representing a square region with a side length of 235 m. Each record contains a timestamp, grid ID, and mobile traffic load (i.e., traffic payload). Hereafter, each grid is referred to as a cell, as its size is comparable to the coverage area of a 5G base station.
For analytical convenience, the entire dataset’s temporal span was divided into consecutive one-hour time intervals: Δ t = 1 h the p cell in the t traffic load for the pth cell during the tth time interval is calculated as follows:
l p t = r Z + : I D r = p , ( t 1 ) · Δ t < t r t · Δ t ν o l r p 1 , , 10000 t 1 , 2 , , N r 1 , 2 ,
Assuming there are N time intervals, r denotes the traffic record index, I D r represents the cell ID for record r, t r indicates the timestamp for record r, and v o l r signifies the traffic load for record r. The time series of the traffic load for the pth cell is represented by the vector l p = ( l p [ 1 ] , l p [ 2 ] , , l p [ N ] ) .
(3)
Data Normalisation
To facilitate deep learning training, the traffic load time series for each cell is normalised. To prevent data leakage, normalisation parameters are strictly calculated within each cell’s meta-training set time period. Specifically, for the pth cell, whose meta-training set time series is X p m e t a t r a i n , the normalisation formula is as follows:
l p [ t ] = X min ( X p m e t a t r a i n ) max ( X p m e t a t r a i n ) min ( X p m e t a t r a i n )
where min ( X p m e t a t r a i n ) and max ( X p m e t a t r a i n ) denote the minimum and maximum traffic load values within the cell’s meta-training set, respectively. For the test task set (new cells), normalisation parameters are computed using their own limited fine-tuning set (data from the preceding week) to simulate the real-world scenario where new base stations lack future information. The normalised traffic load time series for the pth cell, denoted as l p = ( l p 1 , l p 2 , , l p N ) , constitutes the “cell-level network traffic” data utilised in subsequent analyses.
Figure 2 illustrates the normalised traffic load patterns across three distinct cells during the same two-week period. It is evident that traffic load exhibits distinct temporal characteristics across different cells.
Figure 2. Time-domain traffic load (normalised) for cells 1884, 7121, and 1684.
(4)
Meta-feature extraction
Within each cell, although daily variations in traffic load differ, they exhibit fixed periodic patterns on a weekly basis. To quantify the temporal correlation of cellular traffic load, the autocorrelation coefficients of the normalised traffic load vectors for cells p can be calculated as follows:
c o r p , T l a g = l = 1 N T l a g ( l p [ t ] l ¯ p ) ( l p [ t + T l a g ] l ¯ p ) t = 1 N ( l p [ t ] l ¯ p ) 2
Figure 3 displays the autocorrelation coefficients of the normalised traffic load vectors for different cells. Calculations reveal that the autocorrelation coefficients across cells exhibit similar results. As shown in Figure 2, we observe that the autocorrelation coefficient of the normalised cell traffic load vector peaks when the time lag T l a g is an integer multiple of 24 h.
Figure 3. Autocorrelation coefficients of normalised traffic load vectors across different cells.
Based on the above analysis, a discrete periodic signal may be constructed from the normalised traffic load vector of the cell, as follows:
l ˜ p [ t ] = l p [ t ] , 1 t < T l p [ t mod T ] , t < 1 or t T
where T = 168 denotes the total number of hours in a week, then the FFT of this discrete periodic signal is obtained:
F p k · 2 π T = t = 0 T 1 l ˜ p t W T k t , k = 0 , 1 , , T 1 W T = e j 2 π T
This paper selects five specific frequency components w = π /84, w = π /12, w = π /6, w = π /4, and w = π /3 corresponding to periods of 1 week, 1 day, 12 h, 8 h, and 6 h, respectively (sinewaves), as the meta-features for the cellular network. The rationale is as follows: This selection is based on a systematic analysis of the frequency-domain energy distribution across all cellular traffic sequences in the dataset. The specific selection rationale and validation process are as follows:
(1)
Generation of feature candidate pool based on global energy spectrum
First, the FFT amplitude spectrum of the normalised traffic load vector was computed for all cells | F p ( ω ) | . Subsequently, the global average energy spectral density was calculated S ¯ ( ω )
S ¯ ( ω ) = 1 N p = 1 N | F p ( ω ) | 2
This energy spectrum quantifies the importance of each frequency component ω in terms of average significance across the entire dataset. As shown in Figure 4, S ¯ ( ω ) it exhibits pronounced peaks near five frequency points, corresponding to periods confirmed as 1 week, 1 day, 12 h, 8 h, and 6 h, respectively. This constitutes the preliminary basis for the feature candidate pool in this paper.
Figure 4. FFT results of normalised traffic load vectors for cells 1884, 7121, and 1684.
(2)
To ensure the universality of these five frequency components, we further examined their significance across different cellular subpopulations.
Step 1: We randomly sampled multiple subsets based on each cell’s total traffic volume and geographical location (e.g., high-traffic cells, low-traffic cells, city-centre cells, suburban cells).
Step 2: For each subset, we repeated Step 1 to compute its average energy spectrum.
Results: Across all tested subsets, the aforementioned five frequency components consistently represented the most prominent peaks in the energy spectrum. We conducted paired t-tests comparing the average energy at these five frequency points with that at adjacent frequencies, revealing in all instances p 0.01 that the energy at these five points was significantly higher than background noise and neighbouring frequencies, indicating they represent robust, cross-regional common patterns.
(3)
Interpretability of physical significance
The selected frequency components possess explicit, human-activity-driven physical significance:
ω = π / 84 Weekly cycle: captures macro-level flow pattern differences between weekdays and weekends.
ω = π / 12 Diurnal cycle: Reflects fundamental human rhythms, constituting the core pattern of diurnal traffic alternation.
ω = π / 6 π / 4 π / 3 (12/8/6 h cycles): These sub-diurnal and shorter cycles may correspond to refined intra-day activity patterns such as lunch breaks and commuting peaks (morning/evening), providing the model with richer temporal detail.
The amplitude of the FFT results for the cellular traffic load in Figure 2 is shown in Figure 4. The real and imaginary parts of the five principal frequency components of cell p can form a 10-dimensional principal frequency component vector, i.e., the cell’s metaphysical feature:
Γ p = [ R ( F p ( π / 84 ) ) , S ( F p ( π / 84 ) ) , R ( F p ( π / 12 ) ) , S ( F p ( π / 12 ) ) , R ( F p ( π / 6 ) ) , S ( F p ( π / 6 ) ) , R ( F p ( π / 4 ) ) , S ( F p ( π / 4 ) ) , R ( F p ( π / 3 ) ) , S ( F p ( π / 3 ) ) ]
where R ( · ) and S ( · ) represent the real and imaginary parts of the complex numbers, respectively.
By calculating the Pearson correlation coefficient ρ p , q between any two honeycomb normalised flow load vectors p and q, we obtain
ρ p , q = c o ν ( l p , l q ) σ l p σ l q
where c o v ( · ) denotes the covariance. σ l p and σ l q represent the standard deviation of the normalised traffic load for cell p and q, respectively.
Figure 5 depicts data sources comprising 1000 randomly selected cellular grid pairs from the complete dataset. This diagram aims to illustrate the overall statistical relationships across the sampled dataset. Due to the substantial sample size and random sampling methodology employed, individual cellular identifiers are not directly annotated within the figure. Figure 5 was derived by calculating the Pearson correlation coefficient between the normalised traffic load vectors of these 1000 cellular pairs, alongside the Euclidean distance between their principal frequency component vectors. As illustrated, the Pearson correlation coefficient between two cells’ normalised traffic load vectors diminishes as the Euclidean distance between their principal frequency component vectors increases. This indicates that if the principal frequency component vectors of two cells are proximate in Euclidean space, the two cells tend to exhibit similar traffic patterns in the time domain. Conversely, if the principal frequency component vectors of two cells are distant in Euclidean space, this suggests that the traffic patterns of the two cells in the time domain are markedly different.
Figure 5. Relationship between the Pearson correlation coefficient of two cellular traffic vectors and the Euclidean distance of their corresponding principal frequency component vectors.
It should be noted that the Telecom Italia dataset employed herein originates from 2013–2014, reflecting traffic patterns characteristic of 3G/4G networks. With the proliferation of 5G/6G and IoT technologies, contemporary network traffic may exhibit heightened volatility, low-latency requirements, and diverse service types (e.g., video streaming, IoT device communications). Nevertheless, this study focuses on validating the meta-learning framework’s generalisability under task heterogeneity and small-sample scenarios, rather than precisely simulating the latest traffic patterns. The proposed method does not rely on specific data distributions. In Section 4.5.7, the Telecom Shanghai Dataset is employed to validate the model’s cross-dataset generalisation capability [44].

3.2. Probabilistic Modelling in Feature Space

Let the meta-feature vector for the historical baseline task be denoted as Γ p p = 1 N R D ( D = 10 ) , where N represents the total number of historical tasks. This paper assumes that the distribution of meta-feature vectors adheres to a Gaussian mixture model (GMM), which follows a mixture model composed of K Gaussian distributions:
p ( Γ p θ ) = k = 1 K π k N ( Γ p μ k , k )
The model parameters are θ = π k , μ k , k k = 1 K , where π k represents the mixture coefficients in the kth Gaussian component, denoting the probability of selecting the kth Gaussian model during data generation. μ k is the mean of the kth Gaussian model, and k denotes the variance of the kth Gaussian model. Each Gaussian distribution corresponds to a category of base station traffic features with similar traffic patterns (e.g., commercial areas, residential areas, etc.).
For the kth Gaussian component, the corresponding set of optimal weight vectors is denoted as
W k = w i Γ i θ k
The class centres of the weight vectors are
w ¯ k = 1 | W k | w i W k w i
When inputting the meta-feature θ k for a new task q, the posterior probabilities for each component are as follows:
γ k ( Γ q ) = π k N ( Γ q μ k , k ) j = 1 K π j N ( Γ q μ j , j )
The component k * = arg max k γ k ( Γ q ) with the highest posterior probability is selected, and initial weights are allocated based on the following two rules.
The first rule, termed as the SCM, operates on the principle that “if the meta-features of two tasks are probabilistically similar, their optimal model weights should also be similar”. It selects the base model weight with the closest probability distribution to the meta-feature vector as the initial weight:
w q i n i t = a r g w i W k * Γ q Γ i k * 1
where · k * 1 is used to compute the Mahalanobis distance, enhancing the measurement of local similarity by introducing the covariance structure of the component k * .
The second approach employs an MCM. Its initial weight allocation rule does not directly select the parameter set corresponding to a single Gaussian component. Instead, it synthesises a new parameter set based on posterior probability. This mechanism fully leverages the probabilistic modelling advantages of GMMs, avoiding information loss caused by rigid allocation. The allocation rule is as follows:
w q i n i t = k = 1 K γ k θ k
where θ k is the initial weight vector associated with the kth Gaussian component.

3.3. Verification of Feature Distribution Assumptions

The validity of Gaussian mixture models rests upon the assumption that the meta-feature vector f p can be adequately approximated by multiple multivariate Gaussian distributions. This paper employs the following multi-faceted, quantitative methods to validate this core assumption.

3.3.1. Intra-Component Multinormality Test

Given that GMM assumes data within each component follow multivariate Gaussian distributions, we first utilise the trained GMM to assign each sub-feature vector f p to its most probable component (i.e.) arg max k γ ( z p k ) , then conduct multivariate normality tests on data within each component k separately. We employ the Mardia test, which is based on multivariate skewness and kurtosis.
The multivariate skewness statistic is defined as
b 1 , d = 1 N k 2 i = 1 N k j = 1 N k ( f i μ ^ k ) T Σ ^ k 1 ( f j μ ^ k ) 3
whose asymptotic distribution is χ 2 distributed with degrees of freedom d ( d + 1 ) ( d + 2 ) / 6 , where d = 10 is the number of parameters.
The multivariate kurtosis statistic is defined as
b 2 , d = 1 N k i = 1 N k ( f i μ ^ k ) T Σ ^ k 1 ( f i μ ^ k ) 2
Its asymptotic distribution is normal.
(For all components) k = 1 , , K . The test results indicate that at the α = 0.01 significance level under the significance level, the majority of components (>95%) fail to reject the null hypothesis of multivariate normality, supporting the assumption of component-wise distribution within GMM.

3.3.2. Model Comparison and Goodness-of-Fit Assessment

This paper compares the goodness-of-fit between GMM and other potential distribution models to demonstrate the superiority of GMM.
Likelihood Ratio Test (LRT): This paper compares GMM with a single multivariate Gaussian distribution (i.e., GMM with K = 1). The LRT statistic is
Λ = 2 log L ( M S i n g l e ) L ( M G M M )
where L ( · ) is the maximum likelihood value of the model. The statistic Λ follows an approximate χ 2 distribution with degrees of freedom equal to the increase in parameters from K = 1 to the optimal K value (determined by BIC). The test result (p-value < 0.001 ) strongly rejects the hypothesis of a single Gaussian model, supporting that the data originate from a mixture distribution.
Bayesian Information Criterion (BIC) comparison: We further contrasted the GMM with a mixture model based on the t-distribution (MoT), which is more robust to heavy-tailed distributions. BIC is defined as:
BIC = ln ( N ) · | θ | 2 ln ( L )
Results indicate that the GMM’s BIC value ( 12,450 ) is significantly lower than the MoT’s BIC ( 11,980 ) , demonstrating that the GMM represents a superior choice in balancing model complexity and fit.

3.3.3. Cluster Structure Validation

GMM essentially performs probabilistic clustering. We employ the silhouette coefficient to assess clustering quality. For a data point i, its silhouette coefficient s ( i ) is computed as follows:
a ( i ) = 1 | C i | 1 j C i , j i d ( f i , f j )
b ( i ) = lim k i 1 | C k | j C k d ( f i , f j )
s ( i ) = b ( i ) a ( i ) max { a ( i ) , b ( i ) }
where C i is the cluster to which point i belongs, d is the Euclidean distance. Based on the posterior probability γ ( z p k ) , a hard assignment is performed (assigning the point to the component with the highest probability). The average silhouette coefficient calculated across all principal feature vectors is 0.65 (>0.5), indicating a clear and reasonable clustering structure, with compactness within components and good separation between them.
In summary, through distribution verification, model comparison, and clustering evaluation, this paper validates the rationality and effectiveness of probabilistic modelling using Gaussian mixture models for frequency-domain feature vectors.

3.4. Prediction Correction Mechanism Based on the SCM

Within the SCM, when assigning initial weights for new forecasting tasks, we select only the weight vector corresponding to the Gaussian component with the highest posterior probability. This rigid allocation method is simple and efficient but may overlook useful information from other components. To optimise the model’s performance in long-term forecasting, we introduce a correction mechanism that dynamically adjusts the GMM’s mixture coefficients based on forecast errors. The steps for operating the SCM mechanism are as follows:
(1)
Initial weight allocation
w init = w k * , k * = arg max k γ k
where γ k represents the posterior probability of task q belonging to the kth Gaussian component, reflecting the similarity between the task and the component. k * denotes the index of the component with the highest posterior probability, i.e., the most similar task pattern category. The weight vector corresponding to the Gaussian component with the highest posterior probability is selected as the initial weight. This equates to selecting the “expert experience” most similar to the current task from historical tasks.
(2)
Prediction Error Calculation
Assume a new task q with a validation set D q = { ( x i , y i ) } i = 1 M . The prediction error ε q is defined as the mean squared error (MSE):
ε q = 1 | D q | ( x , y ) D q ( f ( x ; w init ) y ) 2
where f ( ) is the base learner parameterised by w init . The error ε q quantifies the base learner’s underperformance on task q, directly reflecting the suitability of the initial weights. A larger error indicates that the selected initial weights deviate further from the true requirements of task q, while a smaller error indicates a closer match to the true requirements of task q.
(3)
Weighting Coefficient Update
This paper adjusts the mixing coefficient based on error, penalising poorly performing components:
π k new = π k · exp ( β · δ k , k * · ε q ) j = 1 K π j · exp ( β · δ j , k * · ε q )
where δ k , k * (is an indicator function, equal to 1 when) k = k * and 0 otherwise; β is a sensitivity coefficient, used to control the adjustment magnitude. When ε q 0 , exp ( β · ε q ) 1 , (the weight of component) k * (remains largely unchanged). When ε q , exp ( β · ε q ) 0 , the weight of component k * the weight is significantly reduced. The denominator ensures the sum of all mixing coefficients equals 1, maintaining the normality of the probability distribution. Through error-driven mixing coefficient adjustment, the model can automatically reduce the priority of poorly performing components, enhancing the overall robustness of the meta-learner.

3.5. Prediction Correction Mechanism Based on MCM

The prediction correction mechanism based on the multi-component synthesis mechanism primarily achieves collaborative multi-component correction by distributing error responsibility through the posterior probability distribution of the GMM. Weighted experience from multiple similar tasks is typically more robust than that from a single most-similar task. Through soft allocation and collaborative optimisation, it better handles situations with blurred task boundaries. The steps for the MCM mechanism are as follows:
(1)
Initial Weight Synthesis
The following formula calculates the initial weights prior to correction:
w q init = k = 1 K γ k ( Γ q ) w k *
(2)
Responsibility Weight Calculation
Formulas (3)–(20) represent the responsibility weight for component k, forming the core of the correction mechanism. It determines the degree of responsibility component k bears for the current error ε q . Here, γ k ( Γ q ) denotes component k’s posterior probability, ε q represents the error, and the denominator serves as a normalising sum of the product of all components’ posterior probabilities and errors:
ρ k = γ k ( Γ q ) ε q j = 1 K γ j ( Γ q ) ε q
(3)
Multi-component collaborative error correction
Hybrid Coefficient Update:
π k new = π k α · ρ k · ε q j = 1 K π j α · ρ j · ε q
where α is the learning rate, controlling the adjustment magnitude, ρ k and ε q are penalty terms. The greater the error ε q for the current task and the greater the responsibility ρ k of component k, the greater the reduction in that component’s hybridisation coefficient. This implies that in subsequent tasks, the prior probability of selecting this underperforming component will decrease.
Mean Vector Update:
μ k new = μ k + α ρ k ( Γ q μ k )
where α is the learning rate, and Γ q μ k represents the difference between the new task’s meta-feature vector and the component’s current mean vector. The greater the responsibility weight ρ k , the more the mean vector μ k of component k shifts towards the direction Γ q of the new task’s q meta-feature. This effectively refines the component’s “centre” based on error, enabling it to better match similar tasks Γ q in the future.
Covariance matrix update:
Σ k new = Σ k + α ρ k ( Γ q μ k ) ( Γ q μ k ) T Σ k
This formula adjusts the distribution range of the Gaussian component, where ( Γ q μ k ) ( Γ q μ k ) T is an outer product matrix reflecting the distribution of the new task’s sub-features Γ q relative to the current mean μ k .
Representative weight update
w k * new = w k * + β ρ k ( w q final w k * )
The final optimised weights obtained after fine-tuning on the new task q are more adaptable to the task q than the initial weights. ( w q final w k * ) represents the gap between the current component’s representative weight and the optimal weight. The greater the responsibility weight ρ k , the closer the component’s representative weight w k * moves towards the optimal weight w q final . This effectively evolves the component using the training results from this task, enabling it to provide better initial weights for similar tasks in the future.
(4)
Convergence Analysis
The MCM correction mechanism may be regarded as an online expectation maximisation (EM) algorithm. Under mild regularity conditions, when the learning rate η satisfies the Robbins–Monro condition, the parameter estimates almost certainly converge to a local optimum.
(5)
Comparison of the Correction Mechanism with Standard Methods
(1)
Differences from the standard EM algorithm:
The EM algorithm aims to maximise the likelihood function log p ( X θ ) by alternately executing E-steps (computing posterior probability γ ( z n k ) ) and M-steps (updating parameters):
γ ( z n k ) = π k N ( x n μ k , Σ k ) j = 1 k π j N ( x n μ j , Σ j )
μ k new = 1 N k n = 1 N γ ( z n k ) x n
The present correction mechanism directly updates GMM parameters based on prediction errors, constituting supervised online learning rather than unsupervised likelihood maximisation.
(2)
Differences from reinforcement learning:
Reinforcement learning updates parameters via policy gradients to maximise cumulative rewards. This paper’s correction mechanism employs instantaneous error as a signal, eliminating the need to define reward or value functions, and thus more closely resembles online gradient descent.

4. Experimental Analysis

4.1. Experimental Objectives

This section comprehensively evaluates the proposed GMM-MCM-NF model’s performance in cellular traffic forecasting. Specific objectives are as follows:
  • Evaluate the efficacy of core innovations: Quantify improvements in predictive performance from the MCM and responsibility-weighted negative feedback mechanism (MCM-NF) through systematic comparisons with traditional deep learning baselines and meta-learning baselines (ML-TP).
  • Conduct ablation analysis to quantify component contributions: By controlling variables, isolate and evaluate the individual impacts of the GMM, MCM, and NF on the model’s overall performance.
  • Test model robustness in small-sample scenarios: Evaluate model generalisation capability under varying training data volumes, with particular emphasis on prediction stability under extreme data scarcity.
  • Analyse model learning efficiency: Compare the proposed model with baseline methods in terms of convergence speed and the training data volume required to achieve equivalent performance.

4.2. Datasets and Preprocessing

This study employs the Milan urban mobile traffic dataset released by Telecom Italia in the “Big Data Challenge”. The dataset specifics are detailed in Section 3.1, with the preprocessing workflow as follows:
(1)
Time Series Construction:
The entire dataset’s temporal span is divided into consecutive one-hour intervals, i.e., Δ t = 1 h . The traffic load for cell p during hour t is calculated as
l p [ t ] = r Z + : I D r = p , ( t 1 ) · Δ t < t r t · Δ t ν o l r
Assuming there are N time intervals, r denotes the traffic record index, I D r represents the cell ID for record r, t r indicates the timestamp for record r, and v o l r signifies the traffic load for record r. The time series of the traffic load for the pth cell is represented by the vector l p = ( l p [ 1 ] , l p [ 2 ] , , l p [ N ] ) .
(2)
Data Normalisation:
To facilitate deep learning training, the traffic load time series for each cell is normalised. To prevent data leakage, normalisation parameters are strictly calculated within each cell’s meta-training set time period. Specifically, for the pth cell, whose meta-training set time series is X p m e t a t r a i n , the normalisation formula is as follows:
l p [ t ] = X min ( X p m e t a t r a i n ) max ( X p m e t a t r a i n ) min ( X p m e t a t r a i n )
where min ( X p m e t a t r a i n ) and max ( X p m e t a t r a i n ) denote the minimum and maximum traffic load values within the cell’s meta-training set, respectively. For the test task set (new cells), normalisation parameters are computed using their own limited fine-tuning set (data from the preceding week) to simulate the real-world scenario where new base stations lack future information. The normalised traffic load time series for the pth cell, denoted as l p = ( l p 1 , l p 2 , , l p N ) , constitutes the “cell-level network traffic” data utilised in subsequent analyses.
(3)
Meta-Feature Extraction:
Leveraging the periodicity of cellular network traffic, each cell’s traffic load data is modelled as a discrete signal with a period T = 168 h (one week). Its frequency domain characteristics are analysed via FFT. Five principal frequency components (corresponding to periods of 1 week, 1 day, 12 h, 8 h, and 6 h) are selected from the frequency domain information. Their real and imaginary parts collectively constitute a 10-dimensional feature vector.
(4)
Foundational Sample Construction:
The traffic forecasting task is framed as a supervised learning problem. To comprehensively evaluate model performance, we designed two tasks: single-step forecasting and multi-step forecasting. Single-step forecasting uses the preceding three hours’ data as input to predict the fourth hour’s traffic. Multi-step forecasting employs the preceding six hours’ data as input to forecast traffic for the first, sixth, and twelfth hours ahead, thereby assessing the model’s performance across varying forecasting horizons.
(5)
Dataset Partitioning:
To ensure data quality, we filtered out grid cells with zero total traffic throughout the collection period or those persistently exhibiting extremely low traffic (e.g., average hourly traffic below 0.001), identifying these as inactive or areas covering minimal users. Ultimately, approximately 9920 active cells were retained for subsequent experiments.
Meta-Training Set: 70% of cells (6944) were randomly selected to construct the meta-knowledge base. For each meta-training cell, its data was temporally partitioned into two segments: the initial 80% served to “train” the base learner for optimal weights, while the final 20% functioned as a “validation set” for hyperparameter tuning and early stopping decision-making to prevent overfitting.
Test Task Set: The remaining 30% of cells (2976) simulate newly deployed base stations or new tasks. They are excluded from meta-training construction and serve solely for performance evaluation.
Fine-tuning set: For each test task, only its initial week’s data (168 samples) is used to fine-tune the base learner, simulating small-sample application scenarios.
Test Dataset: Each test task utilises the two weeks of data following the fine-tuning set (9–23 December 2013) for final prediction performance evaluation.

4.2.1. Base Learner Configuration

To test the robustness of the entire model, this experiment employs two LSTM network architectures as base learners. Their hyperparameters are determined via grid search, with mean squared error serving as the core evaluation metric. To prevent data leakage, hyperparameter optimisation is strictly confined within the meta-training set. Each meta-training cell’s time series is partitioned into training, validation, and test sets at an 8:1:1 ratio. The validation set was exclusively used for hyperparameter tuning and early stopping decisions, whilst the test set solely served to evaluate the optimal weights for that cell, without participating in any training or parameter adjustment processes. Training ceased when the validation loss ceased to decrease over ten consecutive training epochs, with the model parameters exhibiting the smallest validation loss being restored.
Search space: number of layers {2, 3, 4, 5}, hidden layer dimensions {64, 64, 32}, learning rate {0.0001, 0.001, 0.01}. The mean squared error (MSE) is employed as the loss function during search.
Table 4 presents the parameter configurations for the two LSTM network architectures.
Table 4. Presents the parameter configurations for the two LSTM network architectures.

4.2.2. Component Learner and Correction Parameters

KNN Model: K-Nearest Neighbours parameter optimised on validation set and set to 10.
GMM Model: Implemented using scikit-learn’s GaussianMixture. Number of Gaussian components K automatically selected within the range [5, 50] via Bayesian Information Criterion (BIC).
Feedback strength coefficient: The hyperparameters α and β in the MCM correction mechanism are set to 1.0.
Training configuration and computational environment:
Optimiser: All deep learning models employ the AdamW optimiser with a fixed weight decay coefficient of 1 × 10 5 .
Weight Initialisation: Xavier uniform initialisation is employed.
Random seed: To ensure reproducibility, all experiments were conducted with a fixed random seed of 42.
Experiment Replication: Due to computational constraints, each experiment was run once under a fixed seed. However, Section 4.5.7 demonstrates the model’s robustness near optimal parameters through sensitivity analysis.

4.3. Comparison of Algorithms

CLN (FIWV): Conventional LSTM Network (Fixed Initial Weight Vector). This algorithm is a traditional deep learning approach, training each task independently, with all tasks using the same fixed set of random initial weights.
CLN (RSIWV): Conventional LSTM Network (Randomly Selected Initial Weight Vector). This method trains each task independently, with each task employing distinct random initial weights.
ML-TP (KNN): The method proposed in the original paper [5]. It employs the K-Nearest Neighbours algorithm as the meta-learner, selecting meta-samples with the closest Euclidean distance for new tasks to assign initial weights. This serves as the core comparison baseline.
GMM-SCM: Single-Component Mechanism. Employing a GMM as the meta-learner, the SCM mechanism is utilised when assigning initial parameters to new tasks.
GMM-MCM: Multi-Component Mechanism. Employing GMM as the meta-learner, it utilises the MCM mechanism to assign initial parameters for new tasks.
GMM-SCM-NF: SCM with Negative Feedback. Initial strategy identical to GMM-SCM. Its correction mechanism employs SCM correction.
GMM-MCM-NF: MCM with Negative Feedback. Initial strategy identical to GMM-MCM. Its correction mechanism employs multi-component collaborative correction (MCM correction).

4.4. Evaluation Metrics

Mean Absolute Error (MAE): Reflects the average absolute deviation between predicted and actual values; lower values are preferable.
Coefficient of Determination ( R 2 ): Reflects the extent to which predicted results explain the variance in actual data; values closer to 1 are preferable.
Number of Training Epochs Required for Convergence (Epoch): Used to assess learning efficiency.

4.5. Results and Discussion

4.5.1. Overall Performance Comparison

Table 5 results demonstrate that the proposed GMM-MCM-NF framework exhibits significant advantages: (1) compared with traditional deep learning, the meta-learning approach reduces MAE by approximately 52%, decreases RMSE by approximately 49%, and increases R2 by approximately 0.26; (2) GMM-based meta-learning outperforms the KNN method across all metrics, validating the advantages of probabilistic modelling; (3) the MCM mechanism consistently outperforms SCM; (4) the negative feedback mechanism further reduces RMSE to 0.0431, achieving the best overall performance
Table 5. Comparison of prediction performance across different methods on the test set.
The data for Figure 6 and Figure 7 is shown in Table 6 and Table 7.
Figure 6. Predictive rendering based on cellular 1233.
Figure 7. Predicted rendering based on Cellular 2367.
Table 6. Comparison of real load and prediction methods for Figure 6.
Table 7. Comparison of real load and prediction methods for Figure 7.
Two cells were randomly selected from the dataset, with traffic load data from 9 to 10 December 2013 (48 h) extracted as the test set. The predictive performance of CLN (FIWV), ML-TP (KNN), and GMM-MCM-NF was evaluated, with results presented in Figure 6 and Figure 7.
As shown in Figure 6 and Figure 7, the prediction lag and magnitude error of CLN (FIWV) indicate that the CLN prediction curve consistently falls below the actual values at nearly all time points. Taking Figure 6 as an example, during the rapid flow increase in the morning (e.g., at 32 on the x-axis), its prediction (0.55) significantly underestimated the actual flow (0.70). This indicates that fixed initial weights struggle to rapidly adapt to new flow patterns (holidays), resulting in pronounced magnitude errors.
The KNN prediction curve closely tracks the actual curve, significantly outperforming CLN. This demonstrates the effectiveness of similarity-based meta-initialisation. However, at pattern transition points (e.g., position 35 on the horizontal axis in Figure 6), its prediction (0.32) remains slightly below the actual value (0.35), exhibiting minor lag.
The GMM-MCM-NF prediction curve nearly coincides with the actual curve. Particularly on 10 December (holiday), it accurately forecasted both the overall traffic level increase and specific values at each time point, demonstrating the superior predictive capability of this model.

4.5.2. Multi-Step Forecasting Performance Analysis

To validate the model’s practicality over extended forecasting horizons, we compared the GMM-MCM-NF with baseline models regarding their prediction performance (MAE) for 1, 6, 12, and 24 h ahead. Results are presented in Table 8.
Table 8. Prediction performance comparison at different time horizons.
As shown in Table 8, the prediction error of all models increases with extended forecasting horizons, a common phenomenon in time series forecasting. However, GMM-MCM-NF maintains the lowest Mean Absolute Error (MAE) across all forecasting horizons, with its performance advantage remaining significant even in 24-h long-term forecasts. This demonstrates that our framework, by learning robust meta-features, can effectively capture long-term dynamic traffic patterns and possesses the potential to address real-world multi-step forecasting requirements.

4.5.3. Learning Efficiency Analysis

Figure 8 learning curves demonstrate meta-learning methods converge rapidly within 20 iterations, whereas traditional methods require over 40 iterations. GMM-MCM-NF converges fastest (approximately 10 iterations) with the lowest convergence error, highlighting its superior initialisation quality. The data for Figure 8 is shown in Table 9.
Figure 8. Learning curves of different methods (Architecture 1, fine-tuning set size = 168).
Table 9. MAE comparison across epochs.

4.5.4. Robustness Testing in Low-Sample-Size Scenarios

As shown in Table 10, GMM-MCM-NF demonstrates a more pronounced advantage in small-sample scenarios. In an extremely small-sample scenario (24 samples), GMM-MCM-NF’s MAE (0.059) and RMSE (0.081) were significantly lower than those of ML-TP (MAE: 0.081, RMSE: 0.113) and CLN (MAE: 0.145, RMSE: 0.198). Statistical hypothesis testing indicates that when the sample size N 24 , the difference in MAE between GMM-MCM-NF and ML-TP is statistically significant ( p < 0.05 ).
Table 10. Performance under small-sample scenarios (MAE/RMSE).

4.5.5. Model Component Ablation Experiments

The objective of this section is to quantitatively analyse the contributions of the three core components: GMM, MCM, and NF. The ablation model settings are as follows:
Base (KNN): Utilises only the KNN meta-learner.
+GMM: Incorporates the GMM meta-learner on top of Base.
+GMM+MCM: Further incorporates the multi-component synthesis mechanism.
+GMM+MCM+NF: Full model incorporating negative feedback mechanism.
The ablation results in Table 11 demonstrate that the cumulative improvement contributed by each component on the enhanced base learner reaches 10.6%. Particularly noteworthy is the most significant acceleration in convergence speed achieved by the MCM mechanism (reduced from 22 to 16 iterations), indicating that multi-component synthesis provides initial points closer to the optimal solution, thereby synergising effectively with the enhanced base learner.
Table 11. Ablation experiment results.

4.5.6. Comprehensive Comparison with GNN Baselines

This section aims to validate the advantages of the proposed framework over state-of-the-art GNN models, selecting STGCN, GWNET, GTS, and AGCRN as baselines. All models employ identical data partitioning and input sequence lengths.
Experimental results are shown in the Table 12 and Table 13. GMM-MCM-NF significantly outperformed the GNN baseline model across all metrics. Regarding MAE, it achieved improvements of 10.1% (Structure I) and 9.2% (Structure 2). Regarding training efficiency, GMM-MCM-NF required only 12–15% of the training time of GNN models, demonstrating the significant computational advantage of the meta-learning framework. This indicates that while GNN models excel at capturing spatial dependencies, the meta-learning framework achieves superior generalisation performance through cross-task knowledge transfer when handling small-sample, heterogeneous task scenarios.
Table 12. Comparison Results with GNN Models.
Table 13. Performance of different models on the Shanghai dataset (MAE).

4.5.7. Validation of Cross-Dataset Generalisation Capability

This section’s experiments aim to validatethe GMM-MCM-NF framework’s performance on another independent public dataset, demonstrating its effectiveness is not dataset-specific and possesses strong generalisation capabilities.
The dataset selected for this section’s experiments is the Telecom Shanghai Dataset. Like Telecom Italia, this dataset is another commonly used public benchmark dataset within the field. Data preprocessing procedures remain consistent with those applied to the Telecom Italia dataset.
(1)
Dataset partitioning and simulation rationale
Partitioning was conducted based on Shanghai’s urban geographic information system and base station density maps:
Region-A (Central Business District): Includes base stations in the core Puxi area (e.g., Huangpu, Jing’an, and parts of Xuhui). This zone is predominantly commercial, office, and high-end residential, exhibiting typical weekday-dominated traffic patterns with pronounced daytime peaks.
Region-B (Peripheral Mixed-Use Zone): Includes base stations in Pudong New Area and certain peripheral urban districts. This functionally mixed zone encompasses residential, industrial, and emerging development areas, exhibiting traffic patterns with both commuting and residential characteristics, though cyclical patterns are relatively less pronounced.
This division effectively simulates the inherent heterogeneity that different operators may encounter in network deployment planning, where operators prioritise distinct areas, leading to systematic differences in the traffic patterns they face.
(2)
Experimental Setup
Meta-training phase: Constructing the meta-knowledge base (GMM model and weight vector set) using only Region-A data.
Testing Phase: Model performance is evaluated using Region-B base station data, simulating the direct application of experience learned from one operator (Region-A) to an entirely new operator (Region-B) scenario.
Baseline Comparison: All comparison models adhere to the identical training-testing regional partitioning.
(3)
Results and Analysis
Table 14 demonstrates the models’ predictive performance in a “cross-region/cross-operator” scenario.
Table 14. Predictive performance in cross-operator scenarios (MAE/RMSE/R2).
As shown in Table 14, all models exhibit performance degradation, but GMM-MCM-NF demonstrates the smallest decline. When generalising from Region-A to Region-B, the MAE performance of traditional CLN methods degrades by approximately 20%, whereas GMM-MCM-NF’s degradation is 14.6%. This demonstrates that the proposed meta-learning framework exhibits greater robustness in handling cross-domain heterogeneity. Even on the unseen Region-B, GMM-MCM-NF’s MAE (0.0988) and RMSE (0.1365) remain significantly lower than all baseline models’ performance within Region-A. This indicates that the “prior knowledge” acquired through meta-learning possesses high transfer value.

4.5.8. Hyperparameter Sensitivity Analysis

To assess the model’s robustness to hyperparameter variations, this study systematically analysed the impact of key hyperparameters on predictive performance. With other parameters fixed at optimal settings, the model’s behaviour was evaluated under typical values for learning rate η and sensitivity coefficient λ (Structure 1, sample size = 168). Sensitivity analysis results are presented in Table 15.
Table 15. Results of Hyperparameter Sensitivity Analysis (MAE).
Learning rate sensitivity: When η = 0.001 , the model maintains low MAE across varying λ values, demonstrating strong robustness. At η = 0.0001 , convergence is excessively slow; at η = 0.01 , gradient oscillations occur.
Sensitivity coefficient impact: λ = 1.0 yields optimal performance in most scenarios. Excessively low values ( λ = 0.5 ) result in insufficient bias correction, while excessively high values ( λ = 2.0 ) cause overcorrection.
Parameter coupling effect: A weak coupling exists between η and λ , yet performance changes gradually near the optimal region ( η = 0.001 , λ = 1.0 ), demonstrating the model’s insensitivity to hyperparameter perturbations.
Comprehensive sensitivity analysis indicates that GMM-MCM-NF maintains stable performance across a broad hyperparameter range, reducing the difficulty of tuning during practical deployment.

4.5.9. Model Explainability and Failure Mode Analysis

To enable base station operators and network engineers to comprehend, trust, and effectively deploy the GMM-MCM-NF model, we designed a systematic interpretability framework to explain the model’s decision-making rationale.
(1)
Meta-feature-component semantic mapping:
(Each GMM component) k (corresponds to a latent, semantically meaningful traffic pattern category. We interpret its semantics by analysing the mean vector of each component) μ k to derive its semantic meaning.
Frequency domain pattern analysis: Calculate μ k the amplitudes corresponding to five principal frequencies (week, day, 12 h, 8 h, 6 h). For instance, a component exhibiting significantly higher amplitude at the ’day’ cycle ( π / 12 ) than other frequencies may represent a ’commercial district’ pattern (high daytime traffic, low night-time traffic); whereas a component showing marked differences between the ’week’ cycle ( π / 84 ) and weekdays versus weekends may indicate a ’commuter zone’ pattern.
Component semantic labels: Through clustering analysis (e.g., K-Means) of all component mean vectors, we assigned semantic labels to each GMM component as shown in the Table 16.
Table 16. Semantic interpretation of GMM components.
When the model is a new base station q, when assigning initial weights, operators can consult its posterior probability distribution γ ( z q k ) to understand: “The model interprets this base station’s traffic pattern as 40% similar to commercial areas, 35% similar to residential areas, and 25% similar to entertainment districts.” This provides an intuitive explanation for weight initialisation.
(2)
Attribution analysis for prediction failures:
(When the model produces significant prediction errors on a task (base station)) q, we conduct attribution through the following steps:
  • Responsibility Weight Analysis: Calculate responsibility weights within the MCM correction mechanism r k . This weight quantifies each component’s k contribution to the current error. Define primary responsibility components k * = arg max k r k .
  • Meta-feature anomaly detection: Calculate the new task meta-feature f q to its responsible component k * centres using Mahalanobis distance:
    D Mahalanobis = ( f q μ k * ) T Σ k * 1 ( f q μ k * )
    If this distance exceeds a preset threshold (e.g., χ d , 0.99 2 the quantile), it indicates that the base station’s actual traffic pattern falls outside the existing experience range of its assigned component, constituting an out-of-distribution (OOD) sample. This represents a primary cause of model failure.
  • Feedback signal interpretation: The corrective mechanism’s specific operations provide direct diagnostic information. If a component’s k mixing coefficient π k is significantly reduced, it indicates that the component has performed poorly in recent tasks, suggesting its provided “experience” may be outdated or inaccurate. Conversely, if a component’s mean μ k or representational weight is substantially updated, it indicates the system is learning a new or evolving traffic pattern.

4.5.10. Uncertainty Analysis and Calibration

This section aims to evaluate the predictive reliability and uncertainty of the GMM-MCM-NF model. This section replicates experiments that were conducted on the test task set, and the statistical uncertainty of key metrics was calculated.
(1)
Performance Stability
Table 17 displays the performance fluctuation of GMM-MCM-NF across five runs. The standard deviations of MAE and RMSE are both less than 1.8% of their respective means, indicating the framework’s stability and reproducibility.
Table 17. Statistical uncertainty of GMM-MCM-NF performance (Structure 1).
(2)
Characteristics of Prediction Bias and Error Distribution
We conducted a systematic analysis of the errors (residuals) across all test tasks during the prediction period, identifying the following significant patterns:
Systematic bias detection: The mean residual was 8.3 × 10 4 . Although numerically small, a one-sample t-test revealed this bias to be statistically significant ( p < 0.05 ). This indicates a slight systematic overestimation tendency in the model, averaging approximately 0.083.
Error Distribution Characteristics: The residual standard deviation is 0.0426, indicating a reasonable range of prediction uncertainty. Skewness of +0.18 shows a slight right-skewed error distribution, meaning a small number of larger positive errors (overestimations) exist. Kurtosis of 3.12 is slightly above normal distribution, suggesting a slightly higher probability of extreme errors occurring.
(3)
Error Calibration and Reliability Assessment
To evaluate the calibration quality of model uncertainty, this paper analysed the relationship between prediction error and predicted values:
Conditional bias analysis: When predicted values < 0.3 (low flow), the mean error is +0.0021 (significant overestimation); when predicted values > 0.7 (high flow), the mean error is −0.0015 (mild underestimation). This indicates the model’s bias exhibits conditional dependence.
In summary, GMM-MCM-NF demonstrates reasonable performance in point prediction accuracy, though room for improvement exists in uncertainty calibration. The model exhibits a slight systematic overestimation bias, particularly under low flow conditions. These findings provide clear directions for subsequent model optimisation, such as incorporating asymmetric loss functions or modelling conditional variance to better handle uncertainty across varying flow levels.

5. Conclusions

This paper delves into the critical challenges confronting cellular network traffic forecasting within the 5G/6G and IoT context: model complexity, task heterogeneity, and few-shot learning. To address these challenges, this paper proposes a meta-learning framework based on Gaussian mixture models—GMM-MCM-NF. Within this architecture, the GMM serves as the meta-learner, capturing the latent distributional structure of traffic patterns across different functional zones within cities by probabilistically modelling frequency-domain information from historical task meta-features. Building upon this foundation, the multi-component synthesis mechanism employs soft assignment to synthesise robust weight initialisations for new tasks, effectively overcoming the limitations of hard assignment inherent in traditional KNN approaches. Furthermore, the negative feedback correction mechanism dynamically adjusts meta-knowledge during long-term forecasting, enhancing the model’s long-term adaptability and robustness to non-stationary traffic sequences. Through systematic experimental validation on public datasets, we draw the following conclusions:
  • In terms of prediction accuracy and generalisation capability, the GMM-MCM-NF model significantly outperforms traditional deep learning models and baseline meta-learning models. Whether on homogeneous datasets or simulated heterogeneous scenarios spanning multiple operators, this model demonstrates superior MAE, RMSE, and R 2 performance, confirming its superior knowledge transfer and task adaptation capabilities.
  • Regarding learning efficiency, owing to high-quality initialisation, the proposed framework converges at a markedly faster rate (requiring approximately 40–50% fewer training iterations) while maintaining robust predictive performance under extreme small-sample conditions (e.g., with only 24 h of data). This holds considerable practical value for rapid deployment and energy-efficient management of new base stations.
  • Regarding model robustness, ablation experiments confirm the synergistic effect of the three core components—GMM, MCM, and NF—which collectively contribute over 10% performance improvement. Sensitivity analysis indicates the model exhibits stability near optimal parameters, reducing fine-tuning complexity in practical deployment.
In summary, the proposed GMM-MCM-NF framework offers a novel and effective solution to core challenges in cellular traffic forecasting. It not only academically validates the potential of probabilistic meta-learning in this domain but also possesses efficient and robust characteristics that enable its deployment in practical network management systems. This lays the foundation for constructing smarter, greener, and more adaptive future mobile networks. Despite the positive outcomes of this research, several avenues warrant further exploration in future work:
  • Model lightweighting and online learning mechanisms: While the current framework achieves optimal performance with large-scale meta-training datasets, its storage and computational overhead increase with the number of tasks. Future work will investigate lightweighting techniques such as model pruning and knowledge distillation, alongside exploring more efficient online meta-learning algorithms to enable real-time, low-overhead adaptation to dynamically changing traffic patterns.
  • Fusion of multimodal meta-features: This paper primarily utilises frequency-domain traffic features. Future work may incorporate richer meta-features, such as POI information around base stations, real-time weather data, and social event data, to construct a multimodal meta-learning framework. This would enable more precise characterisation of task contexts and further enhance initialisation quality.
  • Validation in real cross-operator and B5G/6G scenarios: While we simulated cross-operator scenarios through regional partitioning, ultimate validation requires real-world data encompassing more operators. Furthermore, with the development of B5G/6G, network slicing, and integrated air-ground-space networks, traffic patterns will exhibit novel characteristics. Applying this framework to these emerging scenarios to test and extend its applicability holds significant research value.

Author Contributions

Methodology, X.L. and Y.L. Resources, S.Z.; Writing—original draft, X.L. and Y.L.; Writing—review & editing, S.Z., Q.S. and C.L.; Supervision, C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The dataset analyzed in this study is publicly available from the repository at https://doi.org/10.1038/sdata.2015.55.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Jiang, W. Cellular traffic prediction with machine learning: A survey. Expert Syst. Appl. 2022, 201, 117163. [Google Scholar] [CrossRef]
  2. Wang, X.; Wang, Z.; Yang, K.; Song, Z.; Bian, C.; Feng, J.; Deng, C. A survey on deep learning for cellular traffic prediction. Intell. Comput. 2024, 3, 54. [Google Scholar] [CrossRef]
  3. Duan, A.; Zhang, Z. Cellular traffic prediction using a hybrid neural network based on quadratic decomposition. Syst. Eng. Electron. 2025, 47, 1687–1697. [Google Scholar]
  4. Jiang, D.; Zhao, H.; Wang, Z. A Long-Term Cellular Network Traffic Forecasting Method Based on EWT and NeuralProphet-MLP. Mod. Inf. Technol. 2024, 8, 52–57. [Google Scholar]
  5. Yan, W. Research on Key Energy-Saving Technologies Based on Traffic Forecasting in Cellular Networks. Ph.D. Thesis, Zhejiang University, Hangzhou, China, 2012. [Google Scholar]
  6. Gu, M.C. Traffic forecasting method for EMD-LSTM networks based on noise statistics. Comput. Meas. Control 2023, 31, 21–27. [Google Scholar]
  7. Zang, Y.; Ni, F.; Feng, Z.; Cui, S.; Ding, Z. Wavelet transform processing for cellular traffic prediction in machine learning networks. In Proceedings of the 2015 IEEE China Summit and International Conference on Signal and Information Processing (ChinaSIP), Chengdu, China, 12–15 July 2015; pp. 458–462. [Google Scholar]
  8. Zhang, L. Wavelet-scale particle swarm variable step detection for multi-cluster network traffic. Sci. Technol. Bull. 2015, 31, 215–217. [Google Scholar]
  9. Zhang, Z.; Wu, D.; Zhang, C. Research on cellular traffic prediction based on multi-channel sparse LSTM. Comput. Sci. 2021, 48, 296–300. [Google Scholar]
  10. Jaffry, S.; Hasan, S.F. Cellular traffic prediction using recurrent neural networks. In Proceedings of the 2020 IEEE 5th International Symposium on Telecommunication Technologies (ISTT), Shah Alam, Malaysia, 9–11 November 2020; pp. 94–98. [Google Scholar]
  11. Jaffry, S. Cellular traffic prediction with recurrent neural network. arXiv 2020, arXiv:2003.02807. [Google Scholar] [CrossRef]
  12. Li, W.; Jia, H.; Shen, C.; Wu, Y. LSTM-TCN base station traffic prediction algorithm based on multi-head self-attention mechanism. Mod. Electron. Technol. 2024, 47, 125–130. [Google Scholar]
  13. Alsaade, F.W.; Hmoud Al-Adhaileh, M. Cellular traffic prediction based on an intelligent model. Mob. Inf. Syst. 2021, 2021, 6050627. [Google Scholar] [CrossRef]
  14. Azari, A.; Papapetrou, P.; Denic, S.; Peters, G. Cellular traffic prediction and classification: A comparative evaluation of LSTM and ARIMA. In Proceedings of the International Conference on Discovery Science, Split, Coratia, 28–30 October 2019; pp. 129–144. [Google Scholar]
  15. Kurri, V.; Raja, V.; Prakasam, P. Cellular traffic prediction on blockchain-based mobile networks using LSTM model in 4G LTE network. Peer-Netw. Appl. 2021, 14, 1088–1105. [Google Scholar] [CrossRef]
  16. Li, H.; Xu, Y.; Guo, Y. Time series forecasting based on LSTM hybrid models. Yangtze River Inf. Commun. 2022, 35, 38–40. [Google Scholar]
  17. Zheng, S.; Zhang, X.; Zhang, Y.; Wang, X.; Yuan, G. Low-Complexity Cellular Traffic Forecasting Method Based on Lightweight Convolutional Neural Networks. Radio Commun. Technol. 2024, 50, 921–931. [Google Scholar]
  18. Huang, D.; Yang, B.; Wu, Z.; Kuang, J.; Yan, Z. Spatiotemporal fully connected convolutional networks for citywide cellular traffic prediction. Comput. Eng. Appl. 2021, 57, 168–175. [Google Scholar]
  19. Zhang, C.; Zhang, H.; Yuan, D.; Zhang, M. Citywide cellular traffic prediction based on densely connected convolutional neural networks. IEEE Commun. Lett. 2018, 22, 1656–1659. [Google Scholar] [CrossRef]
  20. Zhang, D.; Ren, J. Cellular network traffic prediction based on multi-temporal granularity spatio-temporal graph networks. Comput. Technol. Dev. 2024, 34, 24–30. [Google Scholar]
  21. Feng, J.; Chen, X.; Gao, R.; Zeng, M.; Li, Y. Deeptp: An end-to-end neural network for mobile cellular traffic prediction. IEEE Netw. 2018, 32, 108–115. [Google Scholar] [CrossRef]
  22. Zhang, D.; Liu, L.; Xie, C.; Yang, B.; Liu, Q. Citywide cellular traffic prediction based on a hybrid spatiotemporal network. Algorithms 2020, 13, 20. [Google Scholar] [CrossRef]
  23. Ni, F. Cellular network traffic prediction based on an improved wavelet-Elman neural network algorithm. Electron. Des. Eng. 2017, 25, 171–175. [Google Scholar]
  24. Li, Z.; Song, W.; Wang, C. Cellular network traffic prediction based on weighted multi-graph neural networks. Electron. Des. Eng. 2025, 33, 17–21. [Google Scholar]
  25. Guo, X.; Ma, M.; Zhou, Z.; Lu, Z.; Zhang, B. Mobile Cellular Network Traffic Forecasting Based on Spatio-Temporal Graph Convolutional Neural Networks. Sci. Ocean. Story Rev. 2023, 25–27. [Google Scholar]
  26. Wang, Y.; Fan, Y.; Sun, Y.; Xiong, J.; Jiang, T.; Zhou, Y.; Han, Z.; Li, Z.; Wang, Z. Research on Dynamic Base Station Switching Based on Deep Reinforcement Learning. Radio Commun. Technol. 2024, 50, 815–822. [Google Scholar]
  27. Fu, B.; Liu, S.; Liao, G.; Liu, Q.; Li, Z. Intelligent Decision System for Energy Conservation and Emission Reduction in 4G/5G Base Stations Based on T-GCN. Radio Commun. Technol. 2024, 50, 631–639. [Google Scholar]
  28. Wang, Z.; Hu, J.; Min, G.; Zhao, Z.; Chang, Z.; Wang, Z. Spatial-temporal cellular traffic prediction for 5G and beyond: A graph neural networks-based approach. IEEE Trans. Ind. Inform. 2022, 19, 5722–5731. [Google Scholar] [CrossRef]
  29. Yao, Y.; Gu, B.; Su, Z.; Guizani, M. MVSTGN: A multi-view spatial-temporal graph network for cellular traffic prediction. IEEE Trans. Mob. Comput. 2021, 22, 2837–2849. [Google Scholar] [CrossRef]
  30. Zhao, N.; Wu, A.; Pei, Y.; Liang, Y.C.; Niyato, D. Spatial-temporal aggregation graph convolution network for efficient mobile cellular traffic prediction. IEEE Commun. Lett. 2021, 26, 587–591. [Google Scholar] [CrossRef]
  31. Zhou, X.; Zhang, Y.; Li, Z.; Wang, X.; Zhao, J.; Zhang, Z. Large-scale cellular traffic prediction based on graph convolutional networks with transfer learning. Neural Comput. Appl. 2022, 34, 5549–5559. [Google Scholar] [CrossRef]
  32. Zhao, S.; Jiang, X.; Jacobson, G.; Jana, R.; Hsu, W.L.; Rustamov, R.; Talasila, M.; Aftab, S.A.; Chen, Y.; Borcea, C. Cellular network traffic prediction incorporating handover: A graph convolutional approach. In Proceedings of the 2020 17th Annual IEEE International Conference on Sensing, Communication, and Networking (SECON), Como, Italy, 22–25 June 2020; pp. 1–9. [Google Scholar]
  33. Liu, Q.; Li, J.; Lu, Z. ST-Tran: Spatial-temporal transformer for cellular traffic prediction. IEEE Commun. Lett. 2021, 25, 3325–3329. [Google Scholar] [CrossRef]
  34. Gu, B.; Zhan, J.; Gong, S.; Liu, W.; Su, Z.; Guizani, M. A spatial-temporal transformer network for city-level cellular traffic analysis and prediction. IEEE Trans. Wirel. Commun. 2023, 22, 9412–9423. [Google Scholar] [CrossRef]
  35. Wei, B. Research on Spatio-Temporal Prediction Methods for Metropolitan Cellular Traffic Based on Deep Multi-Task Learning. Ph.D. Thesis, North China University of Technology, Beijing, China, 2022. [Google Scholar]
  36. Zhang, J.; Sun, L. Deep learning-based network anomaly detection and intelligent traffic prediction methods. Radio Commun. Technol. 2022, 48, 81–88. [Google Scholar]
  37. Cai, D.; Chen, K.; Lin, Z.; Li, D.; Zhou, T.; Leung, M.F. JointSTNet: Joint pre-training for spatial-temporal traffic forecasting. IEEE Trans. Consum. Electron. 2024, 71, 6239–6252. [Google Scholar] [CrossRef]
  38. Wan, Y.; Wang, N.; Liu, X.; Wang, Y.; Blaabjerg, F.; Chen, Z. Inertia-Emulation-Based Fast Frequency Response From EVs: A Multi-Level Framework With Game-Theoretic Incentives and DRL. IEEE Trans. Smart Grid 2025, in press. [CrossRef]
  39. Mehri, H.; Chen, H.; Mehrpouyan, H. Cellular Traffic Prediction Using Online Prediction Algorithms. arXiv 2024, arXiv:2405.05239. [Google Scholar] [CrossRef]
  40. Santos Escriche, E.; Vassaki, S.; Peters, G. A comparative study of cellular traffic prediction mechanisms. Wirel. Netw. 2023, 29, 2371–2389. [Google Scholar] [CrossRef]
  41. Zhang, C.; Zhang, H.; Qiao, J.; Yuan, D.; Zhang, M. Deep transfer learning for intelligent cellular traffic prediction based on cross-domain big data. IEEE J. Sel. Areas Commun. 2019, 37, 1389–1401. [Google Scholar] [CrossRef]
  42. Zhao, N.; Ye, Z.; Pei, Y.; Liang, Y.C.; Niyato, D. Spatial-temporal attention-convolution network for citywide cellular traffic prediction. IEEE Commun. Lett. 2020, 24, 2532–2536. [Google Scholar] [CrossRef]
  43. Barlacchi, G.; De Nadai, M.; Larcher, R.; Casella, A.; Chitic, C.; Torrisi, G.; Antonelli, F.; Vespignani, A.; Pentl, A.; Lepri, B. A multi-source dataset of urban life in the city of Milan and the Province of Trentino. Sci. Data 2015, 2, 150055. [Google Scholar] [CrossRef]
  44. Mexwell. Telecom Shanghai Dataset. 2023. Available online: https://www.kaggle.com/datasets/mexwell/telecom-shanghai-dataset (accessed on 15 January 2024).
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.