1. Introduction
With the in-depth implementation of new-type urbanization and the strategy of building a strong transportation power, China’s urban spatial pattern is undergoing profound restructuring. Faced with scarce land resources and saturated existing transportation systems, developing a high-capacity, high-density underground urban rail transit network has become an inevitable choice to break through urban development bottlenecks and optimize the spatial structure. In this systemic undertaking, advances in tunnel construction technology directly determine construction efficiency, safety and environmental performance.
Among various methods such as cut-and-cover and shallow tunnelling, the shield tunnelling method, with its “minimally invasive” ability to pass beneath cities, offers high safety, low environmental impact and good economy. It is evolving from an advanced construction technique into the “backbone” and core driver of efficient underground space development in modern cities [
1,
2,
3].
The thrust of a shield machine is the resultant force applied by the machine to overcome the resistance of the soil at the tunnel face and the frictional resistance of the lining segments, thereby driving the cutterhead and segments forward during tunnelling. It is a key control parameter that reflects both the tunnelling capacity of the shield machine and the stability of the tunnel face. Insufficient thrust may lead to risks such as face instability, over-excavation and even collapse, whereas excessive thrust tends to increase ground disturbance and additional stress, potentially inducing surface settlement, deformation of adjacent buildings and damage to lining segments [
4]. Therefore, under complex geological conditions and diverse operating conditions, accurately predicting and optimally controlling shield thrust is of great significance for ensuring tunnelling safety and maintaining the stability of the surrounding environment. Hereafter, the total shield thrust is denoted by TH.
During shield tunnelling, sudden changes in ground conditions and complex interactions among construction parameters make it difficult to predict variations in shield thrust in a timely and reliable manner based solely on operator experience. Earlier studies mainly used empirical formulas or laboratory tests to investigate shield thrust. Krause et al. [
5] proposed an empirical formula for estimating shield thrust based on multiple engineering cases in Japan and Germany. Owing to its simple form, it enables site engineers to quickly approximate and compare thrust demands. However, shield thrust is influenced by numerous factors, and this empirical formula considers shield diameter as the main independent variable only. As a result, its ability to capture the coupled effects of complex ground conditions and varying operating states is limited, and it cannot reliably estimate thrust under such circumstances. Physical model tests are an important approach for revealing the coupling mechanisms between shield thrust and other tunnelling parameters. Nevertheless, when scaling prototype conditions down for laboratory testing, it is often difficult to simultaneously satisfy both geometric similarity and similarity in material mechanical properties. Zhu Hehua et al. [
6] analyzed the relationship between shield thrust and torque at different burial depths through laboratory model tests in sandy soils, and Su et al. [
7] conducted reduced-scale model tests to study the coupling behavior between shield thrust and soil chamber pressure. These studies are useful for revealing the mechanisms associated with one or a few influencing factors, but they find it difficult to account simultaneously for the combined effects of geological conditions, tunnel geometry and multiple tunnelling parameters and thus have limited capability for accurately predicting shield thrust under complex conditions.
With advances in shield construction data acquisition systems and digitalization, large volumes of measured data—including geological parameters, tunnel geometry and multiple excavation parameters—have been accumulated in engineering practice, providing a solid data basis for the data-driven prediction of shield thrust. Recent studies have used machine learning methods to develop predictive models for shield tunnelling parameters. Sun et al. [
8] proposed a dynamic prediction framework based on multi-source heterogeneous data, using a random forest model to predict shield loads, thereby providing a paradigm for data-driven research on shield tunnelling parameters. Kong et al. [
9] developed a random-forest-based prediction model for thrust and torque by fully integrating geological conditions and tunnelling parameters, and its prediction accuracy was significantly better than that of traditional empirical methods. Liu et al. [
10] constructed characteristic indicators such as tunnelling specific energy and a field cutting depth index and used artificial neural networks and support vector regression, respectively, to predict key tunnelling parameters, thereby providing references for evaluating shield energy consumption and optimizing tunnelling parameter configurations. Abbasi et al. [
11] investigated an EPB-TBM project on Mashhad Metro Line 3 in Iran and systematically compared the performance of multiple machine-learning models for predicting TBM performance/penetration rate, providing a useful reference for evaluating the engineering applicability of different algorithms in practice.
Most of the above machine-learning models treat the tunnelling data from each ring as an independent sample, implicitly assuming that there is no correlation between rings. However, shield tunnelling is inherently a dynamic process with strong temporal dependence, and the thrust state at the current ring is inevitably influenced by the “inertia” of parameters from previous rings. Ignoring the time-series characteristics of the tunnelling parameters may prevent the model from fully capturing the dynamic patterns governing thrust evolution, thereby degrading prediction accuracy and generalization performance.
To address this limitation, temporal deep learning models have gradually been introduced into the prediction of shield tunnelling parameters in recent years [
12,
13]. Long Short-Term Memory (LSTM) networks can selectively retain or discard historical information during training through gating mechanisms formed by the input, forget and output gates and are well suited to engineering time-series data with long-term dependencies. Existing studies have shown that time-series deep learning models represented by LSTM can effectively learn the nonlinear mapping from historical monitoring data to future parameter values, enabling the more accurate prediction of key quantities such as thrust. For example, Chen et al. [
14] developed an LSTM model using short-term sequence data from the “ascending phase” of the tunnelling cycle to predict tunnelling parameters in the subsequent steady phase, thereby demonstrating the feasibility of the dynamic prediction of tunnelling parameters based on LSTM. Qiu et al. [
15] constructed an LSTM model to accurately predict shield thrust and torque for deep-buried tunnel projects and achieved high prediction accuracy under complex surrounding rock conditions. Li et al. [
16] further proposed a hybrid deep learning framework that combines Kriging interpolation, a channel-attention convolutional neural network and LSTM, effectively integrating geological and tunnelling features and significantly improving the accuracy of shield attitude prediction in complex ground conditions.
In summary, temporal deep learning models provide a powerful tool for the dynamic, data-driven prediction of shield tunnelling parameters, particularly shield thrust.
When using deep learning methods for thrust prediction, the neural network architecture and hyperparameter settings have a significant impact on model performance. The bidirectional long short-term memory network (BiLSTM) extends the conventional LSTM by introducing two recurrent structures that process the sequence in the forward and backward directions. This enables the model to exploit dependencies both from past to future and from future to past within the input sequence for feature extraction, thereby capturing more comprehensively the temporal evolution patterns of thrust and other parameters [
17]. Hyperparameters of BiLSTM models—such as the number of neurons, learning rate and dropout rate—usually need to be specified before training, and their values directly affect the convergence speed and prediction accuracy of the network. However, it is difficult to determine an optimal hyperparameter combination purely based on experience. Manual trial-and-error or grid search is not only time-consuming but is also prone to becoming trapped in local optima, which limits model performance.
To overcome these problems, researchers have introduced intelligent optimization algorithms to optimize the hyperparameters of deep learning models [
18]. Mahmoodzadeh et al. [
19] combined an LSTM network with the grey wolf optimizer to predict TBM tunnelling performance, demonstrating that hybrid frameworks integrating temporal deep learning and intelligent optimization have strong potential for data-driven prediction in tunnel engineering. Genetic algorithms (GA), particle swarm optimization (PSO), the sparrow search algorithm (SSA) and the Hunger Games Search (HGS) algorithm all exhibit strong global search capabilities, do not rely on gradient information of the objective function and are well suited for hyperparameter search in complex, high-dimensional spaces [
20]. Ye et al. [
21] used PSO to optimize the number of neurons in LSTM and BPNN models, significantly improving the prediction accuracy of tunnelling speed. Gorgolis et al. [
22] showed that the prediction accuracy of an LSTM model optimized by GA was significantly and consistently higher than that of an unoptimized baseline model. Zhang et al. [
23] used an SSA-optimized LSTM model for real-time multi-region thrust prediction. Meanwhile, Wang et al. [
24] combined HGS with LSTM and demonstrated that this approach can effectively improve model accuracy in shield tunnelling parameter prediction tasks. Hyperparameters have a substantial impact on the performance of temporal neural networks; therefore, selecting appropriate hyperparameters is crucial for achieving high prediction accuracy. Because optimization algorithms differ markedly in their search strategies, identifying the most suitable method for a given dataset or engineering problem is particularly important. Accordingly, this study compares different optimizers in terms of prediction accuracy, computational efficiency and stability during the hyperparameter tuning process, with the aim of determining an algorithm that delivers high accuracy, fast convergence and robust performance.
In summary, intelligent optimization algorithms offer clear advantages for hyperparameter optimization in deep learning models. However, most existing studies have focused on issues such as the tunnelling rate and cutterhead torque, and research on shield thrust prediction remains relatively limited. At the same time, many studies employ only a single optimization algorithm, and systematic comparisons of different optimization strategies in terms of prediction accuracy, convergence efficiency and result stability are still lacking.
Based on the above analysis, this study develops a BiLSTM-based time-series prediction model for shield thrust that simultaneously accounts for the effects of geological conditions, tunnel geometry and multiple tunnelling control parameters. Four intelligent optimization algorithms—GA, HGS, PSO and SSA—are employed to automatically tune key hyperparameters of the BiLSTM model. By comparing BiLSTM models optimized by different algorithms in terms of prediction accuracy, convergence speed and sensitivity to population size, the objectives of this study are twofold: (1) to establish a dynamic prediction framework for shield thrust that is applicable to complex ground conditions; and (2) to evaluate the characteristics of different intelligent optimization algorithms in the hyperparameter optimization of shield thrust prediction models.
2. Methodological Framework
An intelligent hybrid modeling framework for shield thrust prediction is constructed by jointly optimizing the number of neurons, learning rate and dropout rate of a BiLSTM model using GA, HGS, PSO and SSA.
2.1. Computational Principles of the BiLSTM Neural Network
BiLSTM is an extension of the traditional LSTM. LSTM is a special type of recurrent neural network (RNN) designed to address the vanishing and exploding gradient problems that occur in traditional RNNs when training on long sequences. As shown in
Figure 1, LSTM effectively handles complex long-sequence dependencies by introducing gating mechanisms and a cell state to control the flow of information, thereby enabling selective forgetting, memorizing and outputting. The computation process is as follows:
(1) Forget gate: determines how much information from the previous cell state
should be retained:
(2) Input gate and candidate cell state: decides how much of the current input
is written into the cell state while generating the candidate cell state
:
(3) Cell state update: combines the effects of the forget and input gates to update the cell state:
(4) Output gate: based on the updated cell state, computes the final hidden state
:
In the above equations, denotes the input vector at time step t, and is the hidden state at the previous time step.
The notation represents the concatenation of the two vectors. The variables , and are the activations of the forget, input and output gates, respectively; is the cell state at time step t, and is the candidate cell state. , , and are the corresponding weight matrices, and , , and are the corresponding bias vectors. denotes the Sigmoid activation function, tanh denotes the hyperbolic tangent activation function and denotes element-wise multiplication.
By combining two LSTMs that process the sequence in opposite directions, a BiLSTM is formed, as shown in
Figure 2. In a BiLSTM, at each time step
t, the forward layer captures the dependency of historical information from the beginning of the sequence up to
t, while the backward layer captures the intra-sequence dependencies in the reverse direction (from later to earlier time steps) within the given input sequence. After concatenating the hidden states from both directions,
integrates the forward and backward contextual information within the same historical window, thereby providing a richer temporal feature representation.
2.2. Hyperparameters of BiLSTM
In BiLSTM networks, hyperparameter settings have a significant impact on the convergence speed, representational capacity and generalization performance of the model. Among them, the number of neurons, the learning rate and the dropout rate are three of the most sensitive hyperparameters affecting model performance. Proper hyperparameter configuration helps improve the accuracy of shield thrust prediction, whereas inappropriate values may lead to underfitting, overfitting or unstable training.
In a BiLSTM, the number of neurons usually refers to the dimensionality of the hidden state in each LSTM layer, i.e., the number of units in that layer. Each neuron can be regarded as a feature extractor, and the number of neurons determines the capacity of the model, namely, its ability to learn complex patterns and retain information. With more neurons, richer features can be extracted. However, if there are too few neurons, the model may fail to capture complex relationships in the data; if there are too many neurons, the model is prone to overfitting and unnecessary computational cost.
The learning rate controls the step size of each parameter update and thus governs both the speed and stability of convergence. A learning rate that is too large may cause oscillation or even divergence of the training process, whereas a learning rate that is too small may result in slow convergence and increase the risk of getting trapped in local optima. An appropriately chosen learning rate helps the model approach a good solution efficiently while maintaining good generalization performance.
The dropout rate specifies the proportion of neurons that are randomly “dropped” during training. Dropout is an effective regularization technique whose main purpose is to prevent overfitting, i.e., the situation where the model fits the training data too well but performs poorly on unseen test data. When the dropout rate is too low, the regularization effect is weak and the model may still overfit. When the dropout rate is too high, the regularization effect is overly strong, which can destroy too much information and cause underfitting, making it difficult for the model to learn effective representations from the data.
In this study, the number of neurons, the learning rate and the dropout rate are selected as the hyperparameters to be optimized for the BiLSTM model. An intelligent optimization algorithm is employed to jointly optimize these hyperparameters in order to further improve the accuracy and stability of shield thrust prediction.
2.3. Intelligent Optimization Algorithm
To improve the prediction accuracy and generalization performance of the BiLSTM model and to overcome the limitations of manual, experience-based hyperparameter selection, this study employs intelligent optimization algorithms, namely, GA, HGS, PSO and SSA, to automatically tune three key hyperparameters of the BiLSTM model.
The fitness function is the key indicator for evaluating the quality of hyperparameter combinations. In this study, the root-mean-square error (
RMSE) between the predicted and measured values is used as the fitness value of each individual [
19,
25,
26]. Each candidate hyperparameter configuration is trained for 100 epochs on the training set.
RMSE assigns a higher penalty to larger errors, which helps suppress peak deviations in safety-sensitive engineering parameter prediction tasks; moreover, it shares the same unit as thrust, facilitating engineering interpretation and comparisons across different models. A smaller
RMSE indicates better predictive performance of the model under the corresponding hyperparameter set. Let
denote the hyperparameter vector of the
-th individual in generation
; then the fitness function is defined as
is the number of training samples, is the measured thrust of sample and is the predicted thrust for sample under the hyperparameter vector .
2.3.1. Genetic Algorithm
GA is an intelligent optimization algorithm that mimics natural selection and genetic evolution. To be consistent with the other optimization algorithms used in this study, a real-coded GA is adopted, in which each candidate BiLSTM hyperparameter combination is directly encoded as a chromosome of real-valued genes. The evolution of GA is mainly driven by three basic operators: selection, crossover and mutation.
As shown in
Figure 3, the selection operator constructs a breeding pool according to individual fitness; individuals with higher fitness are more likely to be retained and selected for reproduction, thereby simulating the natural selection principle of “survival of the fittest”. The crossover operator then selects pairs of parent individuals from the breeding pool and, with crossover probability
, exchanges segments of their genes to generate offspring, achieving information recombination and expanding the search space. In this study,
is set to 0.95. Finally, the genes of the offspring individuals are randomly mutated with mutation probability
to generate new gene values, maintain population diversity and reduce the risk of premature convergence to a local optimum.
In the real-coded representation, the mutation of the
-th gene of the
-th individual can be expressed as
where
and
denote the gene values before and after mutation, respectively;
is the mutation probability, which is set to 0.025 in this study;
is a random perturbation generated according to the specified distribution; and rand is a random number in the interval [0, 1]. By repeatedly applying selection, crossover and mutation operations over multiple generations, the GA population gradually evolves towards regions with lower fitness values in the hyperparameter search space, ultimately yielding an optimal hyperparameter configuration for the GA–BiLSTM shield thrust prediction model.
2.3.2. Hunger Games Search Algorithm
HGS is an intelligent optimization algorithm inspired by animal foraging behavior. By modeling individuals’ hunger levels and their foraging strategies under different game modes, it seeks to balance global exploration and local exploitation in the search space. By introducing hunger-driven adaptive weights and stochastic perturbation terms, HGS can, to some extent, mitigate premature convergence and enhance the ability to escape local optima; however, its convergence performance is still influenced by factors such as population size, dataset characteristics and specific engineering scenarios.
As shown in
Figure 4, during each iteration, HGS first ranks the population according to individual fitness and then computes the hunger level and corresponding weight for each individual. Hunger is usually inversely proportional to fitness: individuals with lower fitness have higher hunger and therefore obtain stronger search momentum during position updates. Denoting the global best position in generation
by
and the random perturbation term for individual
by
, the position of individual
in generation
+ 1 can be written as
where
is a weight coefficient determined by the hunger level and control parameters, and
introduces randomness and diversity. By adopting different game modes, HGS enhances global exploration in the early stage of the search and gradually strengthens local exploitation in the later stage, thereby guiding the population to converge towards regions with higher fitness.
2.3.3. Particle Swarm Optimization Algorithm
PSO is an intelligent optimization algorithm that simulates the foraging behavior of bird flocks. In PSO, each candidate hyperparameter combination is represented as a particle position vector
in the search space, and the particles are iteratively updated according to their current velocity, their own best historical position and the global best position of the swarm so as to approach the optimal solution. As shown in
Figure 5, in generation
, the updates of the velocity
and position
of particle
can be expressed as
where
is the personal best position found so far by particle
;
is the global best position of the current swarm;
and
are the cognitive and social learning factors, respectively, and are both set to 2 in this study;
and
are random numbers uniformly distributed in [0, 1], used to increase the randomness and diversity of the search;
is the inertia weight, which trades off global exploration and local exploitation;
and
denote the maximum and minimum inertia weights, set to 0.9 and 0.4 in this study, respectively; and
is the maximum number of iterations. By iteratively applying the above update rules, the particle swarm gradually converges toward regions with higher fitness in the hyperparameter search space, yielding an improved hyperparameter configuration for the BiLSTM model.
2.3.4. Sparrow Search Algorithm
SSA is an intelligent optimization algorithm that mimics the foraging and anti-predation behaviors of sparrows. Individuals in the population are divided into producers and scroungers according to their functional roles, and a small proportion of them act as scouts to detect danger signals. Producers are mainly responsible for searching for food sources over a wide area; scroungers follow the producers to forage in the surrounding area; and scouts guide the population out of the current region when danger is detected, thereby reducing the risk of falling into local optima.
As shown in
Figure 6, in generation
, the position vector of sparrow
is denoted by
. Depending on the role of the individual and the current danger level, the update of
can be written as
where
is the global best position at generation
;
is the maximum number of iterations;
is a control parameter;
is a random number uniformly distributed on [0, 1] that characterizes the danger level;
is the safety threshold;
is a random matrix generated from a truncated normal distribution;
is a random vector uniformly distributed on [−1, 1]; and
is a constant that controls the step size. Through this mechanism, producers are responsible for guiding global exploration, scroungers perform local exploitation around high-quality individuals and scouts drive the population away from dangerous or stagnant regions, thus achieving a balance between exploration and exploitation.
2.4. Prediction Method Framework
Using measured tunnel geometry, geomechanical properties, and shield tunnelling parameters from the Guanzhuang–Yongshun section of Beijing Metro Line 22 as input variables, next-ring
TH was selected as the prediction target. A total of 1500 ring-wise samples were collected. To preserve the integrity of time-series prediction, the training/testing split strictly followed the chronological order of ring numbers: the first 1200 sequences were used for training and the remaining 300 sequences were used for testing. No random shuffling was performed to avoid disrupting temporal dependencies and artificially inflating performance. To mitigate the influence of differing units and magnitudes among physical variables on network training, feature normalization was carried out using the linear transformation in Equation (14), mapping the data to the interval [−1, 1]. The normalization parameters were determined consistently over the full dataset to ensure scale consistency when comparing different algorithms. After verification, the minimum and maximum values of each feature in the test set fall within the range of the training set; therefore, the min/max values computed from the full dataset are identical to those computed using only the training set in Equation (14). Hence, no information leakage is introduced by the normalization procedure.
In the formula, are the normalized values, and and are the maximum and minimum values of the measured data.
The time step was set to 20. That is, at any time t, the model uses only the parameters from the previous 20 excavated rings (t – 19, …, t) as inputs to predict the thrust of the next ring (t + 1). Therefore, during both training and testing, the model does not use any observations from future ring numbers. The backward branch of the BiLSTM performs feature extraction only within this historical window to enhance sequence representation.
All experiments were conducted on Ubuntu 22.04, using Python 3.12 and PyTorch 2.8.0. The computing resources included an NVIDIA RTX 5090 GPU and an Intel(R) Xeon(R) Gold 6459C CPU.
The number of neurons, learning rate and dropout rate were selected as the hyperparameters to be optimized, with search ranges set to the number of neurons [20, 100], learning rate [0.001, 0.05] and dropout rate [0, 0.7]. Four intelligent optimization algorithms—GA, PSO, HGS and SSA—were adopted, and three population sizes (10, 30 and 50) were considered for the search. The maximum number of iterations for GA, PSO, HGS and SSA was set to 100, and the stopping criterion was reaching the maximum iteration number. Since both BiLSTM training and intelligent optimization involve stochastic mechanisms such as random initialization, the fitness values from a single run may exhibit some variability. Therefore, each combination of algorithm and population size was executed independently 10 times. The root mean square error (
RMSE) between predicted and measured values was used as the fitness function, and the best hyperparameter set from each run was recorded. Based on these results, four hybrid prediction models—GA-BiLSTM, HGS-BiLSTM, PSO-BiLSTM and SSA-BiLSTM—were constructed, as shown in
Figure 7.
3. Database Construction
Using data from the shield tunnelling section of the Beijing Metro Line 22 project from Guanzhuang to Yongshun, this study selects 11 characteristic variables from three aspects—tunnel geometry, geomechanical conditions and shield tunnelling parameters—to construct a multi-source database for shield thrust prediction. Specifically, one tunnel geometry parameter is considered, namely, the tunnel burial depth (); five geomechanical parameters are considered, namely, cohesion (), equivalent internal friction angle (), compression modulus (), permeability coefficient () and standard penetration blow count (); and five shield tunnelling parameters are considered, namely, excavation speed (), cutterhead torque (), soil chamber pressure (), synchronous grouting pressure () and synchronous grouting volume ().
The burial depth of each ring of the shield tunnel is obtained from the geological investigation report, while the five shield tunnelling parameters are obtained from on-site construction records. To account for the influence of geomechanical conditions on model predictions, the geomechanical parameters of all strata from the ground surface to the tunnel crown within the excavation section of each ring are computed by weighted averaging, and the calculation formula is given as follows:
In the formula, is the geomechanical parameter; is the number of strata; is the thickness of the first layer of soil; is burial depth of the -th layer of soil; and is tunnel burial depth.
To ensure the reliability and temporal consistency of the modelling data, the raw ring-wise records were preprocessed before model development. First, the construction logs were aligned by ring number to ensure a one-to-one correspondence among geometric parameters, geological parameters and tunnelling parameters for the same ring. For occasional missing entries, adjacent-ring interpolation was used when the missing ratio was small, whereas records that could not be reliably recovered were removed. Second, considering potential abnormal records during shield tunnelling—such as abrupt changes caused by shutdowns, operating-mode switching or sensor fluctuations—statistical screening was performed using the 3σ rule and further checked against engineering-reasonable ranges. Isolated outliers were replaced with smoothed values to maintain sequence continuity, while consecutive abnormal segments were discarded to avoid biasing the temporal learning process. After these procedures, continuous and valid samples were obtained for model training and testing.
After the above processing, 1500 samples were obtained. Each sample contains 12 variables, including 11 input variables and 1 output variable. The maximum, minimum, mean and median values of each parameter are summarized in
Table 1. For the ground parameters, the burial depth
ranges from 17.7 to 33.0 m, with a mean of 25.3 m and a median of 24.7 m; cohesion
ranges from 4.8 to 21.5 kPa, with a mean of 12.7 kPa and a median of 12.0 kPa; and the internal friction angle
ranges from 14.6° to 24.9°, with a mean of 20.2° and a median of 20.7°.
For the shield tunnelling parameters, the excavation speed ranges from 40 to 97 mm/min, with a mean of 72.7 mm/min and a median of 73 mm/min. The synchronous grouting volume ranges from 10.0 × 103 to 16.5 × 103 L, with a mean of 12.1 × 103 L and a median of 12.1 × 103 L. The output variable, total shield thrust , ranges from 22.0 × 103 to 49.0 × 103 kN, with a mean of 32.9 × 103 kN and a median of 32.0 × 103 kN.
The distributions of the 11 input variables and the 1 output variable are shown in
Figure 8. It can be observed that each parameter generally exhibits a unimodal distribution: samples are mainly concentrated around their respective mean values, with a small number of samples located in the tails on both sides. This indicates that the data not only cover the typical operating conditions of the project site but also include a certain proportion of extreme conditions, providing a rich sample basis for subsequent training and generalization of the BiLSTM model.
To avoid qualitative judgments based solely on the distribution shape, the tail samples were further quantified using the quantiles of the ground-truth thrust in the test set. Specifically, 22 samples fall in the lower tail (TH ≤ P5) and 21 samples fall in the upper tail (TH ≥ P95), accounting for 7.3% and 7.0% of the test set, respectively. This indicates that the test set covers not only typical operating conditions but also a certain number of extreme high/low thrust cases.
4. Analysis of Shield Thrust Prediction Results
Based on the four hybrid models GA–BiLSTM, HGS–BiLSTM, PSO–BiLSTM and SSA–BiLSTM, the performance of shield thrust prediction is comprehensively analyzed in terms of the influence of population size on convergence characteristics and computational efficiency, as well as prediction accuracy under evaluation indices such as RMSE, MAE, MAPE and R2.
4.1. Sensitivity Analysis of Computational Efficiency with Respect to Population Size
To analyze the impact of population size on optimization performance and computational efficiency, three population sizes, Pop = 10, 30 and 50, are selected for comparative experiments using the four algorithms GA, HGS, PSO and SSA.
Figure 9 presents the average number of iterations to convergence and the computation time under different population sizes, and the corresponding numerical values are listed in
Table 2.
In terms of sensitivity to population size, the average number of iterations of GA and PSO varies relatively smoothly as Pop changes, whereas HGS and SSA are more sensitive to population size. When Pop = 10, the numbers of iterations for HGS and SSA are 44.6 and 42.6, respectively, which are significantly lower than those for Pop = 30 and 50. However, their average fitness values are 1961.73 kN and 1963.22 kN, both higher than 1944.10 kN and 1933.98 kN for Pop = 30 and 1941.07 kN and 1934.22 kN for Pop = 50. This indicates that, although convergence is faster with a small population, the quality of the obtained solutions is poorer and the algorithms tend to suffer from pronounced premature convergence. As the population size increases to 30, the average fitness of HGS and SSA decreases markedly and stabilizes around 1944.10 kN and 1933.98 kN, respectively, thereby avoiding premature convergence while ensuring a more thorough search.
In terms of computational time, the runtime of all four algorithms increases significantly as the population size grows. When Pop increases from 10 to 50, the computation times of GA, HGS, PSO and SSA rise from 9.02, 6.10, 9.22 and 11.72 min to 58.86, 38.00, 43.72 and 67.88 min, respectively, i.e., approximately five to six times the runtime at Pop = 10.
Considering both optimization performance and computational efficiency, the average fitness differences among the four algorithms are small for both Pop = 30 and Pop = 50, whereas Pop = 30 significantly reduces computation time compared with Pop = 50. Therefore, in the subsequent performance comparison experiments in this study, Pop = 30 is uniformly adopted as the benchmark population size, which ensures reliable convergence of the algorithms while also satisfying the time requirements of practical engineering applications.
Table 3 reports the test
RMSE and corresponding rankings of the optimal configurations obtained by each algorithm under Pop = 10/30/50. It can be observed that the relative ranking of the four hybrid models remains consistent across different population sizes, indicating that the conclusions regarding the comparative performance of the algorithms are robust with respect to Pop. In terms of accuracy, the overall difference between Pop = 10 and Pop = 30 is small, whereas Pop = 50 further reduces the error for some algorithms. However, when the computational time and convergence efficiency shown in
Figure 9 and
Table 2 are jointly considered, the time cost of Pop = 50 increases substantially. Meanwhile, although the small population size Pop = 10 can achieve comparable accuracy, it is more sensitive to random initialization and entails a higher risk of premature convergence, making it unsuitable as a unified benchmark for fair comparison. Therefore, Pop = 30 is selected as the uniform setting in subsequent analyses to strike a balance among accuracy, efficiency and stability.
4.2. Comparison of Thrust Prediction Performance
To systematically evaluate the predictive performance of the four hybrid models, four metrics are computed on the same test set: root-mean-square error (
RMSE), mean absolute error (
MAE), mean absolute percentage error (
MAPE) and the coefficient of determination (
R2). The results are summarized in
Table 4.
As shown in
Table 4, the
RMSE values of the four models are approximately 1330–1400 kN, the
MAE values lie between 1030 kN and 1110 kN, all
MAPE values are below 3.5% and all
R2 values are around 0.70. This indicates that, after the introduction of intelligent optimization algorithms, the BiLSTM-based models as a whole can adequately capture the nonlinear relationship between shield thrust and its influencing factors. Nevertheless, there are noticeable differences among the optimization algorithms. Among the four models, HGS–BiLSTM achieves the best performance on all four metrics, with
RMSE = 1332 kN,
MAE = 1032 kN,
MAPE = 3.17% and
R2 = 0.726, indicating that it outperforms the other three models in terms of both error level and goodness of fit. The errors of PSO–BiLSTM are slightly higher than those of HGS–BiLSTM but lower than those of GA–BiLSTM. SSA–BiLSTM performs the worst among the four models, with
RMSE = 1400 kN,
MAE = 1106 kN,
MAPE = 3.38% and
R2 = 0.697, indicating relatively weak overall predictive performance.
Meanwhile, to examine the reliability of the models under extreme thrust conditions, error metrics were computed separately on the tail subsamples
TH ≤ P5 and
TH ≥ P95, as reported in
Table 5. As shown in
Table 5, the errors on the extreme-thrust subsamples
TH ≤ P5 or
TH ≥ P95 are generally higher than the overall statistics in
Table 4, indicating that prediction becomes more challenging in tail conditions. For
TH ≥ P95, the model errors diverge markedly: PSO–BiLSTM achieves the lowest RMSE of 931.98 kN and
MAE of 639.84 kN in this range, followed by HGS–BiLSTM. In contrast, for
TH ≤ P5, all models exhibit relatively large errors, and PSO–BiLSTM shows the highest error, suggesting comparatively weaker robustness under low-thrust conditions. Overall, the tail-subsample analysis confirms performance differences among the models under extreme operating conditions and provides quantitative evidence to support model selection for risk-critical engineering scenarios.
To further elucidate the error distribution patterns across different thrust levels,
Figure 10 presents the variation of
MAE after binning the test set into quantile-based intervals. As shown, the error exhibits a “larger at both ends and smaller in the middle” trend:
MAE increases noticeably in the low- and high-thrust ranges while remaining relatively low across the mid-range (P25–P95). This indicates that the models provide more stable predictions under typical operating conditions but are more prone to larger deviations under extreme thrust conditions; therefore, engineering applicability should be assessed comprehensively in conjunction with tail-focused metrics.
For a more intuitive comparison of the predictive performance of each model,
Figure 11 presents the fitted curves and error distributions of the four hybrid models on the test set. It can be observed that the prediction curve of HGS–BiLSTM matches the measured values more closely and exhibits smaller residual fluctuations, whereas SSA–BiLSTM shows noticeable overestimation or underestimation in some peak segments. As indicated by
Table 4 and
Figure 11, HGS–BiLSTM achieves the best overall performance on the full-sample metrics. Meanwhile,
Table 5 and
Figure 10 show that the relative advantages of different models vary in the extreme tail regions. Therefore, HGS–BiLSTM is selected as the overall preferred model, and it is recommended that tail-focused error metrics be considered under extreme operating conditions to support decision-making.
5. Conclusions
Taking the shield tunneling section of the Beijing Metro Line 22 project from Guanzhuang to Yongshun as the engineering background, this study develops four hybrid models, namely, GA–BiLSTM, HGS–BiLSTM, PSO–BiLSTM and SSA–BiLSTM, for shield thrust prediction and systematically compares their performance. The main conclusions are as follows:
(1) The BiLSTM sequence model constructed using tunnel geometric parameters, geomechanical parameters and shield tunnelling parameters can effectively capture the nonlinearity and temporal dependence of shield thrust. After incorporating the intelligent optimization algorithms GA, HGS, PSO and SSA, the four hybrid models all achieved accurate one-step-ahead prediction on the Guanzhuang–Yongshun dataset of Beijing Metro Line 22, with test-set MAPE values below 3.5%. Although the errors increase under extreme tail operating conditions, they remain generally acceptable, indicating that the models exhibit a certain degree of reliability in extreme thrust ranges. Overall, the proposed framework demonstrates effectiveness and engineering applicability for the studied section under the adopted time-series split and modeling settings; however, its transferability still needs further validation on additional sections and projects.
(2) Population size has a significant impact on algorithm convergence and computational efficiency. For the dataset used in this study, HGS and SSA with a small population size Pop = 10 are prone to premature convergence, whereas an excessively large population Pop = 50 leads to a multi-fold increase in computational time. By jointly considering the average fitness, iteration number and computational time, Pop = 30 can effectively mitigate premature convergence while keeping the computational cost under control and thus can be regarded as a reasonable population-size setting for similar engineering applications.
(3) For the dataset used in this study, under the condition of uniformly adopting Pop = 30, among the four hybrid models, HGS–BiLSTM exhibits the best overall performance, with the smallest RMSE, MAE and MAPE and the largest R2. It can therefore be recommended as a preferred model for real-time prediction of shield thrust and construction decision support under complex geological conditions.
It should be noted that the conclusions of this study are based on a single project section and a fixed time-series split; further validation of the model’s generalizability will be conducted using data from multiple sections or multiple projects.