Next Article in Journal
Effects of Lee Silverman Voice Treatment® BIG on At-Home Physical Activity in Individuals with Parkinson’s Disease: A Preliminary Retrospective Observational Study
Next Article in Special Issue
Experimental Study on the Mechanical Properties of Weakly Cemented Soft Rock Under Different Moisture Contents and Stress Paths
Previous Article in Journal
Hydrogen Direct Injection and Intake Characteristics of an Internal Combustion Engine
Previous Article in Special Issue
Comparative Analysis of Rock Mass Characterization Techniques to Recommend Geomechanical Prevention Mechanisms Using UAV Photogrammetry
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Ground-Type Classification from Earth-Pressure-Balance Shield Operational Data with Uncertainty Quantification

1
School of Resources and Safety Engineering, Central South University, Changsha 410083, China
2
Institute of Innovation, Science and Sustainability, Federation University Australia, Ballarat, VIC 3350, Australia
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(24), 13234; https://doi.org/10.3390/app152413234
Submission received: 20 November 2025 / Revised: 12 December 2025 / Accepted: 14 December 2025 / Published: 17 December 2025
(This article belongs to the Special Issue Latest Advances in Rock Mechanics and Geotechnical Engineering)

Abstract

In urban underground space construction using shield tunnelling, the geological conditions ahead of the tunnel face are often uncertain. Without timely and accurate classification of the ground type, mismatches in operational parameters, uncontrolled costs, and schedule risks are likely to occur. Using observations from an earth pressure balance (EPB) project on an urban railway, a data-driven classification framework is developed that integrates shield tunnelling operating measurements with physically derived quantities to discriminate among soft soil, hard rock, and mixed strata. Principal component analysis (PCA) is performed on the training set, followed by a systematic comparison of tree-based classifiers and hyperparameter optimization strategies to explore the attainable performance. Under unified evaluation criteria, a categorical bosting (CatBoost) model optimized by a Nevergrad combination strategy (NGOpt) attains the highest test accuracy of 0.9625, with macro-averaged precision and macro-averaged recall of 0.9715 and 0.9716, respectively. To mitigate optimism from single-point estimates, stratified bootstrap intervals are reported for the test set. A Monte Carlo experiment applies independent perturbations to the PCA-transformed features, producing low label-flip rates across the three classes, with only minor changes in probability calibration metrics, which suggests consistent decisions under sensor noise and sampling bias. Overall, within the scope of the considered EPB project, the study delivers a compact workflow that demonstrates the feasibility of uncertainty-aware ground-type classification and provides a methodological reference for developing decision-support tools in underground tunnel construction.

1. Introduction

With the continuing expansion of urban underground space, shield tunnelling has become a routine choice for constructing metro lines, utility tunnels, and underground complexes in densely built cities [1,2]. Compared with cut and cover or the bench method, shield construction features a sealed face and synchronous support, i.e., excavation, soil and rock removal, and segmental lining installation proceed as a continuous process that effectively constrains construction induced disturbance [3,4]. Among shield types, earth pressure balance (EPB) machines maintain face stability through soil conditioning and active chamber pressure control and therefore adapt well to soft and mixed ground [5,6,7]. This widespread use in metro and municipal tunnels places higher demands on continuous and reliable geological Information. The ground ahead of the face is heterogeneous, and changes in the material in contact with the cutting tools alter penetration resistance and mucking behaviour. If such changes in ground type are not identified in a timely and accurate manner, the setpoints selected for mechanical control parameters may become mismatched with the actual ground conditions. In practice, this mismatch can trigger abnormal torque and thrust fluctuations, face instability and inefficient muck transport, which in turn accumulate into higher energy consumption and more frequent maintenance interventions.
In EPB drives for urban railways, such situations are by no means rare. For example, when transitioning from soft soil to a mixed or hard-rock face, operators typically need to increase cutterhead torque/thrust and adjust chamber pressure within a narrow temporal window. If these adjustments lag behind the actual change in ground conditions, the machine may experience unstable advance rates, which accumulate into cost overruns. Conversely, when a section that is classified as stiff soil locally contains weak pockets, overly aggressive parameters may trigger excessive face deformation, leading to remedial grouting. These examples illustrate that ground-type classification is not merely an academic exercise; it directly governs the matching between ground conditions and operational parameters, and thereby plays a central role in meeting construction schedules in complex urban tunnelling projects. In practice, the tunnel face often shifts among three operating regimes, i.e., pure soil, hard rock, and mixed soil and rock, and the cutterhead configuration and wear mechanisms change accordingly [8,9]. In predominantly soft soils, scrapers or ripper teeth are commonly used, and wear is governed mainly by particle size distribution and mineralogy; tests with quartz sand and soil abrasion devices show that increasing grain size and quartz content markedly accelerate scraper wear [10]. In particular, higher equivalent quartz content and unfavourable moisture conditions can amplify abrasive action by enhancing grain–metal interaction [11]. In rock faces, disc cutters are standard, and cutter ring life has been shown to correlate with rock strength and abrasivity indices, such as uniaxial compressive strength and Cerchar Abrasivity Index [12]. In mixed faces, intermittent contact and uneven support promote impact chipping and abnormal wear of disc cutters, so combined tooling and a narrower operating window are required to protect the cutters and control consumption [9].
Existing approaches often struggle to provide continuous geological information at the scale and frequency required by EPB operations [13,14,15,16]. Boreholes and in situ tests establish the overall geological framework, but sparse spatial coverage and key information inference latency limit their value for rapid mechanical operating parameter updating during excavation [13]. Geophysical surveys and probing methods can supplement the picture over a certain range, yet inversion non-uniqueness and site disturbance reduce their reliability as the sole basis for high-frequency decisions [15]. Construction monitoring, such as settlement or chamber pressure, is essential for safety management but mainly reflects external responses rather than changes at the tunnel face [17]. At the same time, the shield system exhibits strong coupling: thrust, torque, advance rate, and cutterhead rotation speed co-vary with control strategies and operating regimes, complicating inversion of ground properties [18,19,20]. Penetration per revolution is governed jointly by the normal load on the cutters and the ground strength, and the advance rate equals the product of penetration per revolution and the cutterhead rotation speed [18,21]. Axial work together with rotational work determines the specific energy per unit excavated volume, i.e., as ground strength or abrasivity increases, thrust and torque typically rise in concert, and the specific energy increases accordingly [18,19,20,21]. These interdependent mechanical operating parameters complicate the inversion of geological properties. The practical gap is clear: site teams require continuous judgement, whereas most available information is discrete and delayed. Achieving accurate and efficient assessment without a substantial increase in investigation workload has therefore become a central challenge.
Against this backdrop, data-driven methods offer a pragmatic complement to conventional investigation. The aim is not to replace investigation, but to make systematic use of high-frequency operational data already collected by the machine in order to learn the nonlinear relationships between measurable operating parameters and geological conditions [22,23,24]. In practice, feature sets built around physically meaningful quantities improve alignment with excavation mechanics: penetration rate (PR), field penetration index (FPI = thrust divided by PR), torque penetration index (TPI = torque divided by PR), and specific energy (SE) are widely used derived variables with clear physical interpretation [21,25,26,27]. This also provides an important theoretical background for data-driven research.
A growing body of studies has used tunnel-boring machine operating data to infer ground properties or to support look-ahead prediction. Building on large-scale tunnel boring machine (TBM) operation datasets, Zhang et al. [28] proposed a data-driven framework that engineers features from machine logs and compares supervised learners for predicting lithological categories ahead of the face, demonstrating the feasibility of continuous classification from operational signals. Zhao et al. [29] used torque, thrust, advance rate and fuel-tank temperature as inputs to an artificial neural network model for rapid analysis of TBM operations and geological-type prediction. Using multi-site TBM records, Jung et al. [30] trained an ANN to separate relatively hard, mixed and relatively soft strata, showing that simple feed-forward models already carry useful discriminative power for ground-type labelling. For a Chinese water-conveyance tunnel, Liu et al. [31] derived disc normal and rolling forces from cutterhead thrust and torque and used cutter-force features within an AdaBoost–CART classifier to improve ground-type recognition. Time-sequence modelling has also been explored: Liu et al. [32] employed long short-term memory networks to learn from monitoring sequences and directly predict rock-mass categories or indices, improving look-ahead stability when temporal dependence is strong. With partially labelled multi-project data, Yu et al. [33] developed a semi-supervised pipeline based on stacked sparse autoencoders and deep networks to map TBM parameters to binary geological attributes. At the ensemble level, Hou et al. [34] designed a stacked classifier for real-time rock-mass classification from machine data, addressing class imbalance in both training and evaluation. Yan et al. [35] advanced a prediction paradigm centred on the field penetration index (FPI) and torque penetration index (TPI), trained with stacking, grid search and k-fold cross-validation; they also released a companion dataset that has become a reproducible baseline for subsequent method development [35,36]. Under more volatile operating conditions, Fu et al. [37] proposed a hybrid learning framework for real-time look-ahead prediction that stresses robustness to complex ground and operational variability. Beyond pure classification, Pan et al. [38] used time-series clustering on ring-by-ring logs to relate TBM sequences to lithology with good robustness on a Singapore project. Extending from type to quality, Katuwal et al. [39] linked TBM cycle data to rock-mass rating and reported a practical workflow for rock-mass classification in challenging Himalayan geology.
A further practical complication, which has been emphasized in recent studies, is the pronounced spatial variability within what is nominally classified as the same soil or rock type [40]. Even over short distances, local weak/strong pockets can alter face stability and induce marked changes in EPB performance and operating parameters. Therefore, the ground encountered by the shield is often more heterogeneous than the discrete strata shown in design profiles. For example, random large-deformation analyses of face stability in spatially variable soils have shown that intra-layer strength variations can govern the onset and evolution of face deformation even when the average properties remain favourable [40]. Notably, this work does not attempt to resolve such fine-scale variability explicitly. The ground is instead represented at the ring scale through three discrete ground-type labels, and intra-class heterogeneity is treated implicitly via the dispersion of mechanically motivated features and the associated uncertainty indicators. This ring-wise abstraction is considered appropriate for the available monitoring resolution, while more explicit coupling with spatially variable strength fields and face-stability simulations is identified as an important direction for future extension of the framework.
Although the above studies have demonstrated that EPB operational data carry rich information on ground conditions, several gaps remain for practical, uncertainty-aware decision support. Many existing frameworks feed largely raw sensor readings into generic classifiers, which can obscure the underlying excavation mechanics and suffer from multicollinearity in high dimensions. Comparisons between algorithms are often limited to one or two model families without a systematic examination of how different machine learning prediction paradigms and optimizer choices interact under the same EPB context. In addition, robustness and uncertainty are commonly summarized by single-point metrics, with little insight into how stable the predictions remain under plausible noise, label revisions, or threshold changes.
This study develops a workflow that addresses these limitations and is explicitly tailored to EPB shield tunnelling. On the feature side, ring-wise monitoring parameters are combined with mechanically motivated derived indices, followed by standardization and principal component analysis (PCA), so that the final inputs form an orthogonal and physically interpretable space in which each component is linked to penetration resistance, chamber pressure control and energy consumption. To ensure modelling quality, several representative tree-based classifiers are evaluated for a specific EPB project, and CatBoost is further combined with three distinct hyperparameter optimization strategies (INFO, NGOpt and Optuna). This design makes it possible to characterize the attainable performance and robustness of different model-optimiser combinations, rather than reporting isolated accuracy gains for a single algorithm. Reliable model evaluation is taken as a prerequisite for credible predictions. To this end, stratified bootstrap intervals, Monte Carlo perturbation analysis and probability calibration metrics are integrated into a coherent assessment framework that quantifies discriminative power and the reliability of probabilistic outputs. Based on a ring-wise dataset from a representative EPB drive, the case study illustrates a workflow that can, in principle, be reused for cross-project benchmarking when comparable monitoring data are available. In practical terms, the framework is designed to provide high-quality ground-type classification together with transparent uncertainty information, thereby supporting risk-aware adjustment of EPB operating parameters, reducing trial-and-error in parameter tuning and helping to shorten the time required to reach stable and safe operating regimes in complex urban ground.

2. Modelling Background and Methodology

2.1. Categorical Boosting (CatBoost)

CatBoost is a gradient boosting model that uses oblivious decision trees as base learners, as shown in Figure 1 [41]. The task here is three-class classification with a training set { ( x i , y i ) } i = 1 n where yi ∈ {1,2,3}. The predictor is written as an additive model, and at iteration t a new oblivious tree gt(x) is added to the current ensemble to form:
F t ( x ) = F t 1 ( x ) + η g t ( x ) ,   η ( 0 , 1 ]
Class probabilities are obtained by the softmax mapping:
p c ( x ) = exp ( F c ( x ) ) k = 1 3 exp ( F k ( x ) ) ,       c = 1 , 2 , 3
and training minimizes the cross-entropy:
L = i = 1 n log p y i ( x i )
Samples are routed to leaf sets {L}. For each leaf l and class c, CatBoost applies a second-order (Newton) leaf value update, as presented in Equation (4):
v l c = i L g i c i L h i c + λ
where the first-order term is g i c = p c ( x i ) 1 ( y i = c ) , the second-order term is h i c = p c ( x i ) ( 1 p c ( x i ) ) , and λ > 0 is the L2 regularizer on leaf values. An oblivious tree uses the same feature–threshold split at each depth level for all samples, so the number of leaves is determined solely by depth, and the inference cost is stable. During training, CatBoost combines shrinkage, L2 regularization and subsampling to mitigate overfitting, and employs ordered statistics to reduce target leakage and ordering bias. In sum, the combination of softmax cross-entropy, second-order boosting and the oblivious tree structure yields a multiclass classifier that balances prediction accuracy and modelling stability.

2.2. Optuna

The core of Optuna is a Tree-structured Parzen Estimator (TPE) sampler [42] that performs sequential sampling [43]. Past trials are first partitioned by a quantile threshold of the objective into a “better set” and a “worse” set. Optuna then models the conditional distributions of the hyperparameter vector θ for these two sets in a nonparametric way, and samples future candidates from regions that are more likely to yield improvement. Let the observed objective be y = f(θ) and let y⋆ denote the threshold (typically an upper quantile of historical observations). TPE estimates the good solutions’ density ( G ( θ ) = p ( θ | y y ) ) and the non-good solutions’ density ( B ( θ ) = p ( θ | y < y ) ) and uses the density ratio as a sampling criterion to select the candidate that maximizes the density from the candidate pool.
θ * = arg max θ G ( θ ) B ( θ )
This separation of “good” and “non-good” regions supports broad exploration when evidence is scarce and progressively concentrates samples in promising neighbourhoods once such regions are identified. The densities are typically estimated by kernel methods or piecewise histograms. The approach natively handles mixed continuous and integer variables and can express hierarchical search spaces via conditional dependencies, so that a hyperparameter is sampled only when its prerequisite settings are active.
To reduce computational cost and dampen noise-driven fluctuations, Optuna also provides pruning based on intermediate evaluations. MedianPruner compares the interim score of the current trial at a given training step with the median of completed trials at the same step and stops the trial when it clearly underperforms. Successive Halving allocates resources in stages, first screening many candidates with a small budget and then progressively concentrating resources on the top performers. Hyperband distributes resources across multiple brackets with different initial budgets and promotion schedules to balance exploration breadth and per-trial depth.
In summary, Optuna combines density-ratio–guided adaptive sampling with evidence-based early stopping. It delivers favourable sample efficiency for nonconvex, noisy, and costly objective functions, remains friendly to mixed and conditional hyperparameter spaces, and maintains a practical balance between exploration and exploitation under limited budgets.

2.3. Information Driven Metaheuristic Optimization (OriginalINFO/INFO)

OriginalINFO treats hyperparameter search as a global optimization task in a mixed variable space [44]. It maintains a population of candidates { θ i ( t ) } i = 1 M and schedules exploration and exploitation by means of information measures. The central idea is to quantify both population diversity and convergence progress, and to adapt step sizes and perturbations accordingly so that the method explores broadly at early stages and refines solutions later. Let Σ(t) denote the sample covariance of the population at iteration t. A compact diversity proxy is the log volume:
D ( t ) = 1 d log det ( Σ ( t ) + ε I )
where d is the dimensionality and ε > 0 is a stabilizer. Progress is measured by the improvement of the current best score under a unified evaluation function J(⋅):
P ( t ) = J ( θ * ( t ) ) J ( θ * ( t 1 ) )
where θ * ( t ) refers to the optimal hyperparameter combination of generation t. Candidate updates combine attraction toward the current best with information-scaled stochastic perturbations and a projection back to the feasible set:
θ i ( t + 1 ) = Π S [ θ i ( t ) + α ( t ) ( θ * ( t ) θ i ( t ) ) + κ ( t ) ε i ( t ) ]
Here Π S enforces bounds and variable types (continuous variables are clipped; integer or categorical variables are discretized after the update). The exploitation gain α(t) is increased when progress is sustained, while the exploration gain κ(t) is enlarged when diversity collapses; the noise ε i ( t ) is Gaussian with zero mean and covariance aligned to Σ(t). Elitist retention preserves a few top candidates each generation, and random immigration replaces some weak candidates with fresh samples to restore coverage. The design naturally supports mixed and conditional search spaces through projection and masking, and provides a family of candidates that complements density-ratio sequential samplers under nonconvex and noisy objectives.

2.4. Nevergrad Optimizer (NGOpt)

NGOpt in Nevergrad builds an adaptive portfolio of derivative-free optimizers and allocates budget online by comparing the observed payoffs of the sub-strategies [45]. This avoids committing to a single algorithm before any evidence is available. The workflow is as follows. First, a parametrization is created according to variable types, dimensionality, and the overall budget, and a pool of candidate strategies is assembled, for example, evolutionary search, 1 + 1 random mutation, differential variation, and covariance-adaptation style methods. The optimization then proceeds with an ask–tell interface: each sub-strategy proposes a hyperparameter vector θ, which is scored by a unified evaluation function J(θ) such as the mean accuracy over five stratified folds.
To balance exploration and exploitation at the strategy level, NGOpt selects the next sub-optimizer using an upper confidence bound rule from multi-armed bandits:
j t = arg max j { μ ^ j ( t ) + c ln t n j ( t ) }
where μ ^ j ( t ) is the empirical payoff of strategy j at iteration t, n j ( t ) is its call count, and c > 0 controls exploration strength. The chosen strategy produces a candidate θ(t); after evaluation, μ ^ j ( t ) and n j ( t ) are updated. This shifts more budget toward strategies that perform well while still reserving some resources to test alternatives. For noisy objectives, NGOpt allows limited re-evaluation of top candidates or uses robust aggregation such as medians or batched means to reduce the influence of outliers. For mixed and hierarchical hyperparameter spaces, masking activates a parameter only when its prerequisites are satisfied; continuous variables are projected to bounds after updates, and integer or categorical variables are discretized.

2.5. Hybrid Modelling Procedure

  • Step 1: Data partitioning. The dataset is split 80:20 into a training set for learning the classification rule and a test set for evaluating generalization on unseen samples. Standardization and linear orthogonalization are fitted on the training set only and then applied to the test set using the same parameters to rigorously prevent information leakage (see Section 3 for preprocessing details). Stratified sampling is used during the split so that class proportions remain consistent across the two subsets, which mitigates bias in performance estimates for imbalanced data.
  • Step 2: Objective function design. To obtain a more reliable estimate of generalization under limited sample sizes, the optimization objective is defined by cross-validation, as presented in Figure 2. Specifically, the mean accuracy over fivefold cross-validation during training serves as the objective value (fitness value) at each trial:
    J ( θ ) = 1 K k = 1 K Accuracy k ( θ ) ,       K = 5
Compared with a single train/validation split, fivefold cross-validation reduces variance due to arbitrary partitions and lowers the risk of overfitting [46]. Using the fold-averaged accuracy steers the hyperparameter search toward configurations that perform consistently across different subsets rather than coincidentally excelling on one split, thereby improving the robustness of the selected model.
  • Step 3: Model initialization. To fully explore the performance envelope of CatBoost, three hyperparameters are tuned: the number of trees/boosting iterations (iterations), tree depth (max_depth), and learning rate (learning_rate). These jointly govern model capacity and learning dynamics. Boosting iterations controls the number of base learners and directly affects fit and training time; max_depth determines the complexity of each tree and is closely related to the ability to capture higher-order nonlinearities as well as the risk of overfitting; learning_rate sets the update step of each boosting round and balances convergence speed against generalization. The ranges and default settings are summarized in Table 1.
  • Step 4: Iterative optimization. After initialization, an iterative search is conducted to identify hyperparameter combinations that better serve the objective. The optimization loop proceeds over candidate configurations and terminates when the maximum number of trials reaches 200, yielding the best setting for subsequent final training and evaluation, as displayed in Figure 3.

3. Dataset Description

3.1. Data Source and Feature Analysis

The dataset employed in this study originates from the field records published by Tao Yan [36] for an earth pressure balance (EPB) shield tunnelling project on the Guangzhou–Foshan intercity railway, as illustrated in Figure 4. The construction site is located in the Long-Da section of the alignment in Guangzhou, China, where a single EPB shield with an outer diameter of 9.15 m, an overall length of 103 m and an installed power of 4.5 MW was used for excavation. During tunnelling, shield operational data were sampled on a per-ring basis. The resulting project database comprises 590 rings from the westbound tunnel and 589 rings from the eastbound tunnel, yielding a total of 1179 ring-wise samples with synchronized operational measurements and geological labels. Notably, the ground-type labels used in this work follow the engineering interpretation in [36,37]: (i) formations with soft soil, (ii) formations with uneven soft soil and hard rock (mixed ground), and (iii) formations with full-section hard rock are distinguished on the basis of the field penetration index and torque penetration index, combined with borehole information and construction logs, so that the mapping from field observations to class labels remains physically and geologically consistent. Notably, all observations were collected from a single EPB shield machine along this alignment, so the dataset is homogeneous in terms of machine type and project setting. Within this scope, this work should be viewed as an in-depth methodological case study on a well-documented project dataset, rather than as a comprehensive multi-project validation.
In terms of variable design, the dataset combines raw machine measurements with physically motivated derived quantities. The former includes cutterhead rotation speed (CRS), advance rate (AR), mean thrust (MF), mean torque (MT), upper chamber pressure (UEP), and lower chamber pressure (LEP), which capture the immediate control–feedback chain of advance, cutting and chamber pressure regulation. The latter makes the excavation mechanics explicit. Penetration per revolution (PR) serves as a proxy for advance per turn, integrating penetration resistance and cutting efficiency. The field penetration index (FPI) and the torque penetration index (TPI) are approximations to the axial force and torque required per unit penetration and are designed to reflect the combined influence of ground strength and heterogeneity. The specific energy (SE) can be interpreted as the energy density required to excavate a unit volume, integrating the contributions of axial and rotational work. Higher values of FPI and TPI are generally associated with stronger or more heterogeneous soil–rock assemblages, whereas elevated SE indicates less favourable cutting and mucking conditions; these relationships underpin the use of the derived variables as proxies for ground strength and deformation characteristics in the subsequent ground-type classification framework.
Table 2 summarizes the descriptive statistics of the ten input variables. CRS has a mean of about 1.43 rpm with a standard deviation of about 0.18 rpm and a narrow interquartile range, indicating that cutterhead speed is adjusted within a tight band. MF and MT have means of about 453.68 kN/m2 and 6.67 kN/m2, respectively, with comparatively large standard deviations, suggesting that the force–torque system is sensitive to nonstationarity in the ground. For the derived variables, PR has a mean of about 16.35 mm/r and a median of about 11.54 mm/r with substantial dispersion. FPI, TPI and SE show even larger mean–median gaps (for example, FPI mean ≈ 3033 vs. median ≈ 2550; TPI mean ≈ 542 vs. median ≈ 443; SE mean ≈ 14.50 vs. median ≈ 11.90), and their maxima are far above the upper quartiles, indicating heavy tails. This pattern accords with engineering intuition: when the machine encounters hard interbeds or accelerated tool wear, the required axial force and torque per unit penetration and the energy per unit volume tend to exhibit transient spikes.
To maintain consistency and traceability, the definitions, units and computational formulae of CRS, AR, MF, MT, UEP, LEP, PR, FPI, TPI and SE strictly follow those reported by Yan et al. [36,37]; the present study re-uses this public feature set and builds a new classification and uncertainty-quantification scheme on top of it, without altering the original measurements or label assignments.
To characterize linear and monotonic associations between raw measurements and derived variables and to provide evidence for subsequent feature treatment and model selection, this section reports both Pearson and Spearman coefficients [47,48]. These help distinguish approximately linear couplings from monotonic but nonlinear relations, and identify multicollinearity arising from shared construction terms (for example, FPI and TPI based on PR) or coordinated control variables (such as UEP and LEP). This, in turn, helps avoid double counting similar physical information during model training. In addition, efficiency measures (AR and PR) show systematic negative associations with resistance and energy measures (FPI, TPI and SE), whereas force–torque and upper–lower chamber pressures tend to be positively associated [25,26,27,49]. The signs and magnitudes of these coefficients serve as a quick consistency check against mechanical reasoning. The correlation results are shown in Figure 5, where the diagonal panels display univariate distributions, the lower triangle shows scatter plots with fitted trend lines, and the upper triangle lists Pearson (P.) and Spearman (S.) coefficients.
Several clusters of very strong positive correlation are apparent and largely reflect definitional or mechanistic coupling. AR and PR are almost perfectly linearly related (P ≈ 0.99, S ≈ 0.98), which is consistent with the kinematic relation that advance rate equals penetration per revolution multiplied by rotation speed. MF and MT are strongly positively correlated (P and S ≈ 0.78), indicating a coordinated increase in axial force and cutting torque as resistance rises. UEP and LEP are also highly correlated (P ≈ 0.84, S ≈ 0.87), reflecting coordinated chamber pressure control. FPI and TPI exhibit an extremely strong correlation (P ≥ 0.90, S ≥ 0.93) because both share PR in the denominator and are driven by MF and MT, respectively. TPI–SE and FPI–SE are nearly co-varying, showing that energy density and unit penetration force/torque respond in the same direction when resistance changes. While these highly correlated derived variables provide redundant evidence about resistance and energy, they also imply pronounced multicollinearity, which calls for standardization and moderate dimensionality reduction during modelling to stabilize estimation and limit variance inflation.
A set of stable negative patterns is also evident, mainly between efficiency and resistance or energy measures. PR and AR show moderate to strong negative correlations with FPI, TPI and SE (Pearson roughly −0.70 to −0.75, Spearman around −0.85), indicating that when the ground is harder or more heterogeneous, required unit force/torque and unit energy increase while penetration per revolution and advance rate decrease. This aligns with engineering intuition and site experience. In addition, CRS is moderately negatively correlated with AR and PR (P ≈ −0.43 to −0.57) but weakly to moderately positively correlated with FPI, TPI and SE (P ≈ 0.29–0.31). This suggests that increasing rotation speed does not linearly improve advance efficiency under certain regimes and tends instead to raise unit energy and unit force requirements, consistent with the idea that accelerating cutting without a concurrent reduction in resistance elevates energy input.

3.2. PCA

To mitigate instability arising from multicollinearity and scale differences among the input variables, this study applies principal component analysis (PCA) to the input features [50,51]. The procedure is as follows: a standardization transform to zero mean and unit variance is first fitted on the training set, after which the learned transformation is applied to both the training and test sets so that information leakage is avoided. The cumulative explained variance are shown in Figure 6. The first five principal components together account for 97.61% of the total variance, with individual contributions of approximately PC1: 57.92%, PC2: 23.93%, PC3: 7.48%, PC4: 5.43%, and PC5: 2.86%. In other words, more than 97% of the variability in the original ten-dimensional feature space is retained by these five components, while each of the remaining components explains less than 2.4% of the total variance individually. The curve exhibits a clear elbow after PC2–PC3, and retaining these five components therefore captures the dominant information while reducing high-dimensional redundancy that could hinder downstream modelling and hyperparameter tuning.
The loading structure provides a physically interpretable decomposition, as shown in Table 3. PC1 is dominated by FPI, TPI, and SE. When ground hardness or heterogeneity increases, the required axial force per unit penetration, the torque per unit penetration, and the unit excavation energy tend to rise together, and the PC1 score increases accordingly. This is consistent with the earlier correlation analysis, in which advanced-efficiency measures such as AR and PR showed strong negative associations with impedance and energy measures such as FPI, TPI, and SE. PC2 is mainly governed by UEP and LEP, with CRS also carrying notable weight, reflecting a process variable cluster of chamber-pressure control and rotation speed; the joint regulation of upper and lower chamber pressure together with cutterhead speed captures changes in ground conditions. PC3 again highlights CRS, with secondary contributions from UEP and LEP, and can be viewed as a complement that describes pressure responses under fine adjustments of rotation speed. PC4 is shaped by AR, CRS, and PR and emphasizes kinematic efficiency: advance rate equals penetration per revolution times rotation speed. PC5 is driven primarily by MT and FPI, which helps maintain classification sensitivity when energy consumption rises without a commensurate improvement in advance efficiency. Taken together, these patterns indicate that the retained principal components not only compress the correlated input variables into a lower-dimensional space, but also preserve clear mechanical interpretations, i.e., linking pressure control, penetration behaviour and energy input, which are directly relevant to discriminating between soft soil, mixed ground and hard rock strata.
Notably, although the specific loading values and explained variance ratios are tied to the Guangzhou–Foshan intercity railway dataset, the overall feature engineering and preprocessing strategy is kept project agnostic. All variables entering PCA are either directly measured EPB operating quantities (thrust, torque, advance rate, rotation speed, chamber pressures) or dimensionally consistent derived indicators such as penetration rate, field penetration index, torque penetration index and specific energy that are widely used in tunnelling mechanics. Standardization followed by PCA is applied to this generic feature set without relying on project-specific tuning, so that the same workflow can be recomputed on other EPB drives once ring-wise logs and basic geological labels are available. In that sense, the present PCA not only stabilizes model fitting on this project, but also defines a transferable low-dimensional representation that can be reestimated for new projects while retaining clear mechanical interpretation.

4. Results and Discussion

4.1. Selection of the Optimal Model

Conducting performance analysis of machine learning models for ground-type discrimination under finite samples and class imbalance requires multiple complementary indicators rather than a single score. In this study, three metrics were adopted, as shown in Figure 7 [52]. Overall accuracy (Acc) quantifies the proportion of correctly classified labels and reflects the global correctness of the model across all classes. Macro-recall (Rec), defined as the unweighted average of per-class recall, emphasizes the ability to correctly identify each ground type, including minority or more difficult classes such as mixed ground or hard-rock dominated strata. Macro-precision (Pre), defined as the unweighted average of per-class precision, highlights control of false alarms by penalizing situations where a class is predicted frequently but with many misclassifications. Let the confusion matrix be C = [Cij] with true class i and predicted class j. For class k, define TPk = Ckk, FPk = i≠kCik, and FNk = j≠kCkj, Overall accuracy is A c c = k C k k / i , j C i j , which is the proportion of correctly classified samples. Class-wise precision and recall are P r e k = ( T P k / ( T P k + F P k ) and R e c k = ( T P k / ( T P k + F N k ) . Macro-averaged indicators are simple arithmetic means across classes, i.e., P r e = 1 K k = 1 K P r e k and R e c = 1 K k = 1 K R e c k , where K is the number of classes. Compared with accuracy, which is dominated by the overall sample count, macro-averaged metrics weight each class equally and thus capture performance on minority classes more objectively. In particular, macro-recall emphasizes control of missed detections, whereas macro-precision emphasizes control of false alarms. Taken together, these indicators form an evaluation system that balances global performance, sensitivity to each geological category and the risk of overconfident but unreliable predictions, which is crucial when misclassifying hard-rock or mixed ground may lead to inappropriate EPB operating parameters and elevated construction risk.
Given that the present task involves structured data and a clear requirement for engineering interpretability and uncertainty analysis, this study focuses on tree-based classifiers rather than neural networks or kernel methods. Ensemble methods built from decision trees have repeatedly been shown to perform strongly on structured data, typically requiring less intensive hyperparameter tuning than deep neural networks, being less sensitive to feature scaling, and allowing more straightforward inspection of variable importance and decision rules. In the present setting, EPB operational variables and rock mass characteristics exhibit pronounced nonlinearities, interactions and threshold effects (for example, sharp changes in thrust and torque when transitioning from soft soil to mixed or hard-rock faces), and tree ensembles can naturally accommodate such patterns while remaining robust to monotonic transformations.
To build representative baselines that cover common tree-based paradigms, six comparison models were included: a single decision tree (DT), random forest (RF), extremely randomized trees (ET), gradient-boosted trees (XGBoost), categorical-friendly gradient boosting (CatBoost), and a rule-based linear model (RuleFit). The choice is motivated by the presence of threshold effects, strong nonlinearity, and feature interactions between machine operating parameters and geological feature categories. Tree models naturally express piecewise decision boundaries via feature–threshold splits, are less sensitive to feature scaling, and are relatively robust to outliers. DT provides the most basic and transparent tree baseline [53]. RF reduces variance and stabilizes generalization by bootstrap sampling and feature subsampling [54]. ET further randomizes split thresholds within the RF framework to enhance decorrelation and variance reduction [55]. XGBoost fits residuals additively with second-order approximations and explicit regularization, often balancing accuracy and robustness on structured data with moderate noise [56]. CatBoost uses oblivious (symmetric) trees and ordered statistics to mitigate target leakage and ordering bias, yielding stable representations of multivariate nonlinear relations [40]. RuleFit first extracts high-support logical rules from ensembles and then forms a sparse linear combination with L1 regularization, offering interpretability close to linear models while retaining expressive power [57]. Tree-based models are favoured in this setting because EPB operational variables and ground types exhibit pronounced nonlinearities, interactions and threshold effects (for example, sharp changes in thrust and torque when transitioning from soft soil to mixed or hard-rock faces), and tree ensembles can naturally accommodate such patterns while being robust to monotonic transformations and mixed physical units. Evaluating these six classifiers under the same data partitioning scheme and metric set allows a fair comparison of their strengths and weaknesses and makes it possible to identify not only the single best-performing model, but also the most suitable modelling paradigm for geological classification from shield operational data.
Among the six base classifiers, CatBoost achieves the highest test accuracy (Acc = 0.9292) while maintaining a favourable balance between precision (0.9373) and recall (0.9047); the train–test accuracy gap is about 0.056, indicating adequate capacity with controlled overfitting. Non-boosting base learners show weaker test performance overall. DT reaches test Acc = 0.8917 with precision 0.9095 and recall 0.8930; the train–test accuracy gap is about 0.092, suggesting a more pronounced overfitting tendency. ET yields test Acc = 0.8958 with precision 0.9133 and recall 0.8882; although lower in absolute value than CatBoost, its train–test accuracy gap (≈0.06) is comparable to CatBoost, indicating that strong randomization helps reduce variance. RuleFit provides good interpretability but attains the lowest test scores in this dataset (Acc = 0.8708, precision 0.8937, recall 0.8770), with a train–test accuracy gap of about 0.11, implying that a linear–rule hybrid may be capacity-limited in strongly nonlinear, highly coupled settings. Overall, CatBoost offers the best combination of top accuracy and a balanced precision–recall profile, as illustrated in Table 4 and Table 5.
Although CatBoost ranks among the leading base classifiers on the test set, the default hyperparameter configuration does not necessarily yield an ideal classification outcome. To further enhance generalization and stabilize hyperparameter selection, this study integrated three representative hyperparameter optimization strategies from Section 2.2, Section 2.3 and Section 2.4 with CatBoost, producing three hybrid models: INFO-CatBoost (information-driven metaheuristic), NGOpt-CatBoost (Nevergrad composite strategy), and Optuna-CatBoost (TPE-based sequential sampling with early-stopping pruning). These optimizers embody three distinct design philosophies: global population search, portfolio-style adaptive budget allocation across strategies, and density-ratio-guided sequential model optimization. The fitness function (objective) is the mean accuracy over five-fold cross-validation to ensure comparability. From an optimization perspective, INFO maintains a globally distributed population and explicitly encourages diversity in the candidate set, NGOpt adaptively reallocates the limited trial budget among multiple internal algorithms according to their observed performance, whereas Optuna progressively concentrates sampling in regions with favourable density ratios between high-performing and previously evaluated configurations. The convergence traces of the fitness values and the resulting hyperparameter configurations are presented in Figure 8: all three hybrid models exhibit an overall upward trend and eventual stabilization, but with characteristic convergence trajectories that illustrate how each strategy approaches its performance plateau under the same trial budget.
As shown in Table 4 and Table 5, all three hybrid models outperform the untuned CatBoost in both training and test phases and markedly narrow the train–test performance gap. NGOpt-CatBoost attains a test accuracy of 0.9625 with macro-precision 0.9715 and macro-recall 0.9716. Relative to the untuned CatBoost (accuracy 0.9292), this corresponds to an accuracy gain of about 0.0333, with macro-precision and macro-recall gains of about 0.0423 and 0.0424. Its train–test accuracy gap is approximately 0.0354, smaller than CatBoost’s roughly 0.056, indicating a higher performance ceiling with better overfitting control. INFO-CatBoost achieves a test accuracy of 0.9542 (gain ≈ 0.0250), with macro-precision 0.9654 and macro-recall 0.9570; the train–test gap is about 0.0437, which is also clearly better than the untuned model. Optuna-CatBoost reaches a test accuracy of 0.9500 (gain ≈ 0.0208), with macro-precision 0.9621 and macro-recall 0.9622. Overall, all three optimization schemes deliver consistent improvements in generalization under the present data regime, with NGOpt-CatBoost exhibiting the strongest aggregate performance. The confusion matrices of the candidate models introduced in this study are shown in Figure 9, enabling an intuitive assessment of the decision-making performance of different prediction techniques. In each panel, the numbers denote sample counts (not normalized proportions), and the color intensity encodes the magnitude of the counts (darker cells indicate higher frequencies). In each confusion matrix, the tick labels 1, 2, 3 correspond to the three geological conditions used in this study, i.e., 1 = Soft-soil dominated, 2 = Soft-soil–hard-rock mixed, and 3 = Hard-rock dominated. Diagonal cells, therefore, count correct recognitions for each class, while off-diagonal cells quantify specific confusions.
To assess the model’s discrimination across decision thresholds, this section further evaluated the NGOpt-CatBoost model using ROC (Receiver operating characteristic curve) analysis [58]. As shown in Figure 10, the key quantity is the area enclosed by each ROC curve and the axes of false positive rate (FPR) and true positive rate (TPR), namely the AUC. In the multiclass setting, a one-vs-rest scheme is adopted to compute ROC/AUC for each class separately: the macro-average AUC is the arithmetic mean of the per-class AUCs and gives equal weight to each class, whereas the micro-average AUC aggregates true and false decisions over all samples to form a single overall ROC that better reflects the model’s global ranking ability. AUC takes values from 0 to 1; values closer to 1 indicate better threshold-free ranking performance. Unlike accuracy or recall computed at a fixed threshold, AUC is insensitive to the particular threshold choice and is relatively robust under moderate shifts in class proportions.
From the figure, the three one-vs-rest curves (For classes 1, 2, and 3) lie close to the top and left borders, with AUCs of 0.99 for Class 1, 0.99 for Class 2, and 1.00 for Class 3, indicating strong separability for all three operating conditions. In particular, the near-perfect AUC for distinguishing hard-rock-dominated ground (Class 3) from the other two categories implies minimal overlap between positive and negative samples, reducing the risk of confusing hard-rock conditions with soft-soil or mixed strata—an advantage for proactive shield operation planning. Both macro- and micro-average AUCs are close to 1.00 and nearly coincide, showing that the model sustains very high overall discrimination while remaining balanced across classes, without trading minority-class performance for aggregate gains. This aligns with the previously reported high accuracy and well-balanced macro-precision and macro-recall at fixed thresholds, and further indicates low false-alarm rates together with high detection rates across a wide range of thresholds.

4.2. Model Uncertainty Analysis

Relying solely on single-point statistics such as one-off accuracy, precision, and recall tends to underestimate a model’s robustness and uncertainty. Point estimates neither capture sampling variance nor account for sensitivity to threshold choices and class proportions; with a modest test set or mild class imbalance, a single evaluation can yield spuriously high or low impressions. Against this backdrop, the conventional comparison is complemented with interval-based evaluation to quantify how much the reported metrics may vary under finite samples, i.e., their confidence, and to judge whether observed train–test differences fall within a statistically acceptable range.
A stratified bootstrap was employed to obtain nonparametric interval estimates for the overall indicators of the selected model (NGOpt-CatBoost) [59]. With the model and decision threshold held fixed, paired observations of true labels and model predictions are repeatedly resampled with replacement within each class so that class proportions are preserved, and empirical distributions for accuracy, macro-averaged precision, and macro-averaged recall are constructed. The analysis reports 95% confidence intervals together with 50% interquartile bands. These intervals are conditional on the fitted model and therefore reflect sampling uncertainty due to finite data rather than optimization variance arising from refitting; as such, they align with the engineering question of how stable the current model would be on similar samples. In the implementation, both the training and test sets were resampled 1000 times.
As shown in Figure 11, all three training-set metrics are near the upper bound with very tight intervals: Acc = 0.9979 with a 95% CI of [0.9948, 1.0000]; Pre = 0.9963 with [0.9898, 1.0000]; Rec = 0.9984 with [0.9960, 1.0000]. The narrow widths of these intervals indicate that sampling variance has largely converged under the present sample size and class composition. The test-set intervals are likewise compact and high: Acc = 0.9625 with [0.9417, 0.9833]; Pre = 0.9715 with [0.9559, 0.9876]; Rec = 0.9716 with [0.9557, 0.9874]. In all cases, the 50% interquartile bands remain close to the point estimates, which implies limited fluctuation of the metrics across resampled subsets. The lower bounds of the test-set intervals are well above the nominal chance level of 1/3 for a three-class problem and clearly exceed the performance of trivial baselines, indicating that the predictive ability of NGOpt-CatBoost is not only numerically high but also statistically distinct from random guessing.
In addition, the positions and overlaps of the training and test intervals provide further insight into generalization. As expected, the three metrics on the training set are slightly higher than those on the test set. The gap in accuracy between the training and test sets is about 0.0354 (from 0.9979 to 0.9625), which is modest and consistent with the positions and overlaps of the respective intervals rather than indicative of severe overfitting. This suggests that the observed train–test differences fall within a statistically acceptable range at the 5% level. Taken together, the relatively tight intervals, their separation from chance-level performance, and the strong overlap between training and test bands support the view that the NGOpt-CatBoost estimates are stable and statistically meaningful in the present EPB dataset, rather than over-specializing to the particular realization of the training set.
To further characterize the discriminative capability of NGOpt-CatBoost across different ground types, the overall macro-averaged metrics were complemented by class-wise estimates of accuracy, precision and recall with stratified bootstrap confidence intervals for soft soil, mixed strata and hard rock, as summarized in Figure 12. For the soft soil class, the test accuracy is approximately 0.9625 with a 95% confidence interval of [0.9374, 0.9834]. The corresponding class-wise precision reaches 0.9524 with an interval of [0.9181, 0.9899], while the recall is 0.9615 with an interval of [0.9135, 0.9904]. Performance for mixed strata is of a similar level. Under the same test split, the accuracy again attains 0.9625 with the same confidence interval [0.9374, 0.9834]; the precision and recall are 0.9623 and 0.9533, with 95% confidence intervals of [0.9174, 0.9905] and [0.9159, 0.9907], respectively. The hard rock class exhibits even stronger results in the present setting, with accuracy, precision and recall all equal to 1.0000 and the stratified bootstrap intervals collapsing to [1.0000, 1.0000]. This pattern indicates that, for the data condition and feature configuration considered, hard rock samples occupy a particularly well separated region in feature space and no misclassification or omission occurs on the test set.
Further, the results in Figure 12 suggest that NGOpt-CatBoost delivers not only high macro-level performance but also balanced class-wise behaviour. Across all three ground types, accuracy and the class-wise precision and recall are consistently close to or above 0.95, and no category shows a pronounced degradation that would create a limiting “weak link”. Misclassifications are largely confined to a small number of confusions between neighbouring categories, such as soft soil versus mixed strata or mixed strata versus hard rock, and even these errors remain relatively infrequent within the 95% confidence bounds. It should be noted that the apparent perfect classification for the hard rock class is still constrained by the sample size and operating conditions of the present EPB project, and thus mainly reflects the attainable performance ceiling under the specific dataset and feature design. Nevertheless, both the central estimates and the narrow confidence intervals indicate that NGOpt-CatBoost provides strong and stable predictive capacity for soft soil, mixed strata and hard rock, offering a quantitative basis for ground type informed adjustments of construction risk assessment.
Notably, EPB shield onboard sensors inevitably exhibit noise, drift, and calibration offsets, while excavation introduces transient deviations driven by cutter wear and operational latency [17,60]. Conventional classification metrics alone cannot reveal decision stability under such real-world perturbations. Accordingly, to characterize the robustness of the selected model (NGOpt-CatBoost) under input disturbances, a Monte Carlo perturbation analysis was conducted separately for each true class in the test set [61]. The procedure perturbed standardized and PCA-transformed test features by applying independent, unbiased random scaling of ±10%, repeated 1000 times, thereby generating for every sample an empirical distribution of the one-vs-rest probability for its true class. From these distributions, the per-sample mean, standard deviation, and the 95% prediction interval with its width are extracted. A decision-stability indicator is further defined as the expected flip rate, namely the frequency with which the perturbed prediction crosses the decision threshold relative to the baseline label. In addition, per-class Brier scores and calibration errors, including ECE, RMSCE, and MCE, are reported. The Brier score quantifies the mean squared gap between predicted probabilities and binary outcomes for a given one-vs-rest class; lower values indicate probabilities that are both well calibrated and suitably sharp rather than clustered around the threshold 0.5. Expected Calibration Error (ECE) summarizes the average absolute difference between predicted confidence and empirical accuracy across probability bins (equal-width bins over [0, 1] with empty bins ignored), providing a global view of calibration; smaller ECE reflects closer alignment between confidence and realized frequency [62,63]. Root Mean Squared Calibration Error (RMSCE) is the square-rooted, bin-weighted mean of squared differences and therefore places extra emphasis on larger local mismatches; it is useful for detecting pockets of pronounced miscalibration that may be muted in ECE [62]. Maximum Calibration Error (MCE) captures the worst per-bin deviation and highlights the most severe local misfit between confidence and accuracy [62,63]. Together, these metrics complement ROC/AUC by assessing not only ranking ability but also the reliability of probabilistic outputs, which is essential for threshold selection and risk-aware decision making. The resulting class-wise Monte Carlo summaries are visualized in Figure 13, Figure 14 and Figure 15, with quantitative statistics given below.
Class 1 (soft-soil dominated, Figure 13) exhibits a low-variance and high-consistency profile. The mean standard deviation is about 0.0110, and the median is about 0.0003, indicating that most samples are insensitive to perturbations, with a lightly right-skewed tail. The average width of the 95% prediction interval is about 0.0402 with a median of about 0.0010, implying very tight probability bands for the vast majority of cases. At the decision level, the expected flip rate is approximately 1.13%, meaning that only a very small subset of samples crosses the threshold under perturbations. In terms of probability quality and calibration, the Brier score decreases from 0.0313 at baseline (i.e., NGOpt-CatBoost’s prediction without input perturbation) to 0.0300 when using the Monte Carlo mean probability; RMSCE declines from 0.1761 to 0.1723 and MCE from 0.8928 to 0.8576, suggesting a modest denoising effect when aggregating across perturbations.
Class 2 (mixed soft-soil and hard-rock, Figure 14) is slightly more sensitive than Class 1 yet remains within a robust range. The mean standard deviation is about 0.0132, and the median is about 0.0029 after adding random perturbation; the average 95% interval width is about 0.0487 with a median of about 0.0103. Decision consistency remains high, with an expected flip rate of approximately 1.45%. Calibration degrades slightly under perturbation: the Brier score increases from 0.0425 to 0.0443, ECE rises from 0.0827 to 0.0848, and both RMSCE and MCE show small upward shifts, reflecting greater sensitivity of mixed-strata samples to modest input changes; however, the magnitude of these post-perturbation validation increases remains small.
Class 3 (hard-rock dominated, Figure 15) shows larger dispersion and wider intervals due to the combined effects of a smaller sample size and greater operating heterogeneity. The mean standard deviation is about 0.0263 with a median of about 0.0050 after a Monte Carlo simulation; the mean 95% interval width is about 0.0908 with a median of about 0.0203. Decision stability remains high but is weaker than in the other two classes, with an expected flip rate of approximately 3.3%. Probability and calibration metrics deteriorate under perturbation, with the Brier score increasing from 0.0296 to 0.0325, ECE from 0.0924 to 0.0994, and both RMSCE and MCE trending upward. This pattern is consistent with a limited sample size and a higher share of classification boundary-proximal cases in the hard-rock regime, and it indicates the value of incorporating geological priors or cost-sensitive thresholds to maintain conservative decisions in hard-rock-dominated segments.
Across the three geological features, a high level of consistency is maintained under plus or minus ten percent input perturbations, with Class 1 appearing most robust, Class 2 slightly less so, and Class 3 somewhat more sensitive, given sample size and diversity of mechanical operating status. The per-class boxplots show narrow probability boxes for most samples with NGOpt-CatBoost’s prediction probability closely aligned to the box medians, and only a small number of long-whisker or outlying cases correspond to potential boundary-adjacent instances.
Under the ±10% input perturbation regime used for Figure 13, Figure 14 and Figure 15, the patterns described above can be interpreted as scenarios in which each of the three ground classes experiences only a low rate of prediction change. For soft soil, mixed strata and hard rock, the expected flip rates are approximately 1.13%, 1.45% and 3.38%, respectively. Although all three classes are subjected to stochastic disturbances in their input features, only a small subset of samples in each class actually changes its predicted label, and these flips are largely confined to instances lying close to the decision boundaries. Consistent with this low rate of prediction change across the three classes, the probability calibration metrics show only minor deviations from the unperturbed baseline: the Brier scores and the ECE, RMSCE and MCE values remain very close to their original levels. In other words, under low-perturbation conditions where only a few samples per class experience label flips, the calibrated probabilistic outputs of NGOpt-CatBoost are essentially preserved.
From an engineering perspective, such low flip rate scenarios correspond to realistic situations in EPB tunnelling, for example, minor revisions to geological logs, small adjustments of decision thresholds, or the reclassification of a limited number of borderline rings near soft–mixed or mixed–hard interfaces. The observation that probability calibration metrics are only weakly affected in this regime indicates that the model’s predictive probabilities are robust to plausible levels of distributed uncertainty among the three ground types. This complements the point estimates and global performance metrics by showing that, even when some ambiguity in the local geological conditions is present, the probability calibration of the classifier remains stable and retains its interpretability for decision support.
To assess the robustness of the selected model under different levels of input perturbation, four amplitudes were considered: 5%, 10%, 15%, and 20%. Two indicators were tracked for each geological class: the mean expected flip rate (EFR) and the mean width of the 95% predictive interval (PIW95), as illustrated in Figure 16. Both indicators increase monotonically with perturbation amplitude, but the rates and shapes of increase differ by class. Class 1 (soft soil dominant) exhibits the strongest robustness: EFR rises modestly from 0.0061 to 0.0170, an absolute gain of 0.0109, and PIW95 increases from 0.0210 to 0.0702, an absolute gain of 0.0492. The growth is nearly linear and mild, indicating stable decisions across a fairly wide range of sensor or sampling disturbance. Class 2 (mixed strata) is slightly more sensitive than Class 1 and shows an early jump, then a plateau pattern: EFR moves from 0.0050 to 0.0146 between 5% and 10% and then grows only marginally to 0.0166 at 20%; PIW95 increases from 0.0277 to 0.0769. This suggests a small subset of samples near the decision boundary that react to light perturbations, while larger perturbations do not trigger proportional additional instability. Class 3 (hard rock dominant) is the most sensitive: EFR increases from 0.0199 to 0.0476 with an almost linear trend, and PIW95 expands from 0.0507 to 0.1678. Across all amplitudes, both EFR and PIW95 for Class 3 remain clearly higher than for the other classes, implying faster probability spread and a greater likelihood of predicted geological feature label flipping when noise is injected.
Viewed as a whole, Figure 16 characterizes how classification stability and predictive uncertainty evolve as sensor noise or input disturbances increase. Even under relatively aggressive perturbations of ±20%, the mean expected flip rate remains below about 2% for the soft-soil and mixed-strata classes and below about 5% for the hard-rock class, indicating that only a small fraction of samples in each class change their predicted ground-type labels. In parallel, the monotonic growth in PIW95 quantifies the gradual widening of the predictive probability envelopes with increasing noise amplitude, most prominently for hard-rock-dominant conditions. For the low-perturbation regimes (5–10%), where label-change rates across all three classes remain at the sub-percent to low-percent level, the associated probability calibration metrics (Brier score, ECE, RMSCE and MCE) deviate only marginally from their baseline values and the reliability curves stay close to the diagonal, showing that the probabilistic outputs remain well calibrated despite the injected noise. From an engineering standpoint, these low flip-rate scenarios correspond to practical situations such as minor revisions to geological logs or small recalibrations of thrust, torque and chamber-pressure sensors, in which only a few borderline rings near soft–mixed or mixed–hard interfaces are relabelled while the overall class balance is preserved. The combination of low flip rates, limited changes in calibration metrics and controlled interval expansion therefore suggests that NGOpt-CatBoost maintains statistically stable decisions over a realistic range of disturbance levels, while at the same time signalling increased uncertainty in a form that can be directly incorporated into risk aware threshold setting and on-site operational triggers.
The above uncertainty analyses also provide a basis for discussing within-class variability and local weak or strong pockets. In practice, even rings labelled as belonging to the same ground type may encompass heterogeneous materials and strength levels along the circumference or over short longitudinal distances. In the present framework, such intra-class heterogeneity is not resolved explicitly. Instead, it manifests as dispersion in the feature space and, consequently, in the predicted probabilities and uncertainty measures. For rings whose operating signals remain close to the typical patterns of their assigned class, the model tends to produce high predicted probabilities with narrow predictive intervals and very low flip rates, indicating stable decisions despite moderate sensor noise. By contrast, rings influenced by local weak or strong pockets that are underrepresented in the training distribution are more likely to appear as boundary-adjacent cases, with slightly reduced predicted probabilities for the nominal class, wider probability intervals and elevated flip rates, particularly under larger perturbation amplitudes. In such situations, the classifier still provides a useful warning signal in the form of increased uncertainty, rather than issuing overconfident but potentially misleading hard labels. At the same time, the ring-wise labelling scheme implies that abrupt sub-ring changes which do not sufficiently affect the aggregate machine response may escape detection; this limitation is inherent to the resolution of the available data and motivates future integration with higher-resolution monitoring and spatially variable numerical modelling.
Beyond the numerical summaries above, the uncertainty-aware outputs are intended to be used as practical decision aids rather than as purely statistical descriptors. For each ring, the joint information carried by the predicted class probabilities, their bootstrap confidence intervals and the perturbation-driven flip-rate diagnostics can be mapped to interpretable risk categories for soft-soil, mixed and hard-rock operating regimes. Rings with a clearly dominant class probability and narrow intervals, together with very low expected flip rates, correspond to a high confidence regime. In such cases, site engineers can adjust EPB thrust, torque, chamber pressure and conditioning parameters more decisively toward the setting recommended for the predicted ground type. For example, by increasing cutterhead torque and chamber pressure when a hard-rock dominated regime is identified with high confidence, or by relaxing these parameters and optimizing conditioning when the ground is confidently classified as soft soil. Rings characterized by intermediate probabilities, wide intervals or moderately elevated flip rates fall into an ambiguous regime. For these segments, conservative strategies are preferable, i.e., parameter changes should be implemented in smaller increments, and operators should explicitly watch for signs of face deformation, torque spikes behaviours.
In addition, the uncertainty indicators can be used to define quantitative operational triggers. For example, when hard-rock dominated rings exhibit a combination of rising flip rates and rapidly widening predictive intervals as perturbation amplitude increases, this pattern can be interpreted as an early warning that local conditions are approaching the limits of the training distribution. In such situations, it is advisable to adopt more conservative setpoints and to treat the segment as potentially more adverse than its nominal label suggests until further evidence is gathered. Conversely, when a stretch of rings maintains low flip rates and stable, narrow prediction intervals under ±10% perturbations, the associated ground-type classification can be regarded as robust to plausible levels of sensor noise and local variability, which justifies gradual optimization of operating parameters to improve efficiency. In practice, the classifier and its uncertainty indicators can be embedded as an auxiliary module within existing EPB data platforms, providing rapid recommendations for parameter ranges and flagging rings where elevated uncertainty should trigger conservative adjustments or additional investigation. In this way, the proposed uncertainty analysis is directly linked to risk-aware tunnelling decisions, rather than treating accuracy metrics as an end in themselves.

5. Limitations and Prospects

This study draws on publicly available data from EPB shield tunnelling on the Guangzhou to Foshan intercity railway and develops a geological-type classifier based on onboard measurements and physically derived variables. Evaluation is conducted within a comprehensive metric system and an explicit uncertainty framework. Although the optimal model exhibits strong performance in terms of accuracy, AUC, and robustness, several intrinsic limitations at the data and methodological levels should be acknowledged. The dataset reflects a high degree of homogeneity in project and equipment settings, as all samples come from the same line and the same machine. The primary objective is therefore to examine the feasibility of the proposed predictive techniques and to provide a methodological reference for subsequent work, rather than to claim immediate generalisability. The numerical results reported for the Guangzhou–Foshan drive should not be extrapolated to other projects without additional calibration and external validation. Accordingly, the broader applicability discussed in this paper should be understood in a methodological sense, i.e., as a reusable workflow for similar EPB datasets, rather than as evidence that the current model can be directly deployed on shield tunnels with different alignments, machine configurations or ground conditions. At the same time, the feature definitions, preprocessing steps and learning algorithms are formulated at a level that is compatible with ring-wise EPB logs in general, so that, once sufficient labelled data from new projects become available, the same workflow can be retrained and externally validated to assess cross-project transferability in a controlled manner. In addition, the ground-type labels are defined at the ring scale as three discrete categories, i.e., soft-soil dominated, mixed strata and hard-rock dominated. This labelling scheme inevitably averages over sub-ring spatial variability in fabric, and the present classifier therefore cannot explicitly resolve very local weak or strong pockets that do not sufficiently alter the aggregate operational response. Combining the proposed data-driven framework with higher-resolution monitoring or with random deformation analyses in spatially variable soils is an important avenue for future work.
On the modelling side, standardization and PCA are used to mitigate multicollinearity and scale differences, followed by tree-based learning and hyperparameter optimization. The linear orthogonalization induced by PCA helps stabilize estimation but sacrifices part of the interpretability and may fail to capture latent nonlinear structures. Although the hyperparameter search spans representative strategies, including information-driven population search, adaptive portfolio allocation, and sequential model-based optimization, both the compute budget and the search space are constrained, so selection bias arising from the accuracy–computation trade-off cannot be ruled out. Moreover, the relative advantages of different optimizers under diverse engineering scenarios remain to be established through systematic comparison.
In light of these limitations, future work can proceed along several directions. On the data side, build richer corpora that incorporate more granular stratigraphic descriptors and a broader range of operating conditions. On the methodological side, explore a wider spectrum of supervised learning techniques and conduct stricter external validation together with probabilistic calibration, so that the proposed approach remains stable and usable in more complex tunnelling environments.

6. Conclusions

In urban underground construction, the ground conditions ahead of the TBM face often exhibit rapid, nonstationary changes. Misclassification of strata can readily lead to parameter mismatches. When such mismatches persist, they typically manifest as increased energy consumption, accelerated cutter wear, more frequent unplanned stoppages, and remedial interventions, which cumulatively translate into higher construction costs, especially in constrained urban rail projects. Establishing a fast and effective method for predicting geological categories is therefore essential for adaptive process control and proactive risk management. Against this background, this study uses ring-by-ring EPB tunnelling data and targets reliable discrimination among three strata types with explicit confidence characterization. A classification framework is built by jointly leveraging onboard operating measurements and physically interpretable derived quantities. Principal component analysis is introduced to mitigate multicollinearity and scale disparities, followed by a comparative assessment of multiple tree-based learners and the application of robust hyperparameter optimization. To avoid overconfidence from single-point metrics, the evaluation is augmented on the test set with stratified-bootstrap confidence intervals and Monte Carlo–based probability propagation and calibration under input perturbations, enabling a precise depiction of robustness and credibility under finite samples and realistic noise.
The optimized NGOpt-CatBoost achieves the best test accuracy (Acc = 0.9625) with a balanced macro-precision and macro-recall of about 0.97. In the uncertainty analysis, stratified-bootstrap results indicate compact test-set intervals (Acc = 0.9625, 95% CI [0.9417, 0.9833]; macro-precision = 0.9715, 95% CI [0.9559, 0.9876]; macro-recall = 0.9716, 95% CI [0.9557, 0.9874]). The train–test differences are consistent with the respective intervals, suggesting that overfitting remains controlled. Monte Carlo input propagation at the class level shows low flip probabilities across the three target categories (1.13%, 1.45%, and 3.38%), with wider 95% prediction intervals for the hard-rock dominant class; calibration indicators (Brier score, ECE, RMSCE, MCE) change only modestly under perturbations, indicating high decision consistency and probabilistic stability in the presence of sensor noise and sampling deviations.
Overall, the study delivers a technical workflow that integrates feature engineering, model selection and uncertainty quantification on ring-wise data from a single EPB shield project in mixed urban ground. Within this case study setting, the framework yields uncertainty-aware ground-type predictions that could support parameter adjustment and risk assessment for specific (or similar) underground engineering projects. More broadly, the workflow is intended as a methodological reference for EPB drives with similar machine types and data acquisition schemes, and any direct deployment in other projects would require additional calibration and external validation.

Author Contributions

Conceptualization: S.H. and J.Z.; Methodology: S.H. and Y.C.; Investigation: S.H. and J.Z.; Resources: M.K. and J.Z.; Writing—original draft preparation: S.H. and J.Z.; Writing—review and editing: M.K. and J.Z.; Visualization: S.H. and Y.C.; Supervision: J.Z.; Funding acquisition: J.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially funded by the National Natural Science Foundation of China (52474121, 42177164).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data used in this study are from published research: Yan [36].

Acknowledgments

The authors want to thank all the individuals who provided help and cooperation.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Broere, W. Urban underground space: Solving the problems of today’s cities. Tunn. Undergr. Space Technol. 2016, 55, 245–248. [Google Scholar] [CrossRef]
  2. Yu, P.; Liu, H.; Wang, Z.; Fu, J.; Zhang, H.; Wang, J.; Yang, Q. Development of urban underground space in coastal cities in China: A review. Deep Undergr. Sci. Eng. 2023, 2, 148–172. [Google Scholar] [CrossRef]
  3. Guglielmetti, V.; Grasso, P.; Mahtab, A.; Xu, S. Mechanized Tunnelling in Urban Areas: Design Methodology and Construction Control; CRC Press: London, UK, 2008. [Google Scholar] [CrossRef]
  4. Maidl, B.; Herrenknecht, M.; Maidl, U.; Wehrmeyer, G. Mechanised Shield Tunnelling, 2nd ed.; Ernst & Sohn: Berlin, Germany, 2012. [Google Scholar] [CrossRef]
  5. Anagnostou, G.; Kovári, K. Face stability conditions with earth-pressure-balanced shields. Tunn. Undergr. Space Technol. 1996, 11, 165–173. [Google Scholar] [CrossRef]
  6. Peila, D. Soil conditioning for EPB shield tunnelling. KSCE J. Civ. Eng. 2014, 18, 831–836. [Google Scholar] [CrossRef]
  7. Peila, D.; Martinelli, D.; Todaro, C.; Luciani, A. Soil conditioning in EPB shield tunnelling—An overview of laboratory tests. Geomech. Tunn. 2019, 12, 491–498. [Google Scholar] [CrossRef]
  8. Tang, S.; Zhang, X.; Liu, Q.; Zhang, Q.; Li, X.; Wang, H. Experimental study on the influences of cutter geometry and material on scraper wear during shield TBM tunnelling in abrasive sandy ground. J. Rock Mech. Geotech. Eng. 2024, 16, 410–425. [Google Scholar] [CrossRef]
  9. Elbaz, K.; Shen, S.L.; Cheng, W.C.; Arulrajah, A. Cutter-disc consumption during earth-pressure-balance tunnelling in mixed strata. Geotech. Eng. 2018, 171, 363–376. [Google Scholar] [CrossRef]
  10. Ren, D.J.; Shen, S.L.; Zhou, A.; Chai, J.C. Prediction of lateral continuous wear of cutter ring in soft ground with quartz sand. Comput. Geotech. 2018, 103, 86–92. [Google Scholar] [CrossRef]
  11. Mucha, K. Application of rock abrasiveness and rock abrasivity test methods—A review. Sustainability 2023, 15, 11243. [Google Scholar] [CrossRef]
  12. Sun, Z.; Zhao, H.; Hong, K.; Chen, K.; Zhou, J.; Li, F.; Zhang, B.; Song, F.; Yang, Y.; He, R. A practical TBM cutter wear prediction model for disc cutter life and rock wear ability. Tunn. Undergr. Space Technol. 2019, 85, 92–99. [Google Scholar] [CrossRef]
  13. Li, S.; Liu, B.; Xu, X.; Nie, L.; Liu, Z.; Song, J.; Sun, H.; Chen, L.; Fan, K. An overview of ahead geological prospecting in tunneling. Tunn. Undergr. Space Technol. 2017, 63, 69–94. [Google Scholar] [CrossRef]
  14. Zaki, N.F.M.; Ismail, M.A.M.; Abidin, M.H.Z.; Madun, A. Geological prediction ahead of tunnel face in limestone formation tunnel using multi-modal geophysical surveys. J. Phys. Conf. Ser. 2018, 995, 012114. [Google Scholar] [CrossRef]
  15. Abate, G.; Catalano, E.; Ippolito, F.; Spagnoli, G. An early-warning system to validate the soil profile during TBM tunnelling by applying the HVSR method to TBM-induced microtremors. Geosciences 2022, 12, 113. [Google Scholar] [CrossRef]
  16. Yang, T.; Wen, T.; Huang, X.; Liu, B.; Shi, H.; Liu, S.; Peng, X.; Sheng, G. Predicting Model of Dual-Mode Shield Tunneling Parameters in Complex Ground Using Recurrent Neural Networks and Multiple Optimization Algorithms. Appl. Sci. 2024, 14, 581. [Google Scholar] [CrossRef]
  17. Huang, X.; Liu, Q.; Liu, H.; Zhang, P.; Pan, S.; Zhang, X.; Fang, J. Development and in-situ application of a real-time monitoring system for the interaction between TBM and surrounding rock. Tunn. Undergr. Space Technol. 2018, 81, 187–208. [Google Scholar] [CrossRef]
  18. Liu, J.; Li, S.; Wang, Y.; Wang, X.; Sun, Z. Application of specific energy in evaluation of geological conditions ahead of tunnel face. Energies 2020, 13, 909. [Google Scholar] [CrossRef]
  19. Cardu, M.; Coragliotto, M.; Oreste, P.; Papini, M. Performance analysis of tunnel boring machines for rock excavation. Appl. Sci. 2021, 11, 2794. [Google Scholar] [CrossRef]
  20. Zhou, X.; Zhang, Y.; Wang, W.; Li, X.; Lin, J. Performance evaluation of TBM using an improved load reversal method. Machines 2023, 11, 141. [Google Scholar] [CrossRef]
  21. Teale, R. The concept of specific energy in rock drilling. Int. J. Rock Mech. Min. Sci. Geomech. Abstr. 1965, 2, 57–73. [Google Scholar] [CrossRef]
  22. Yu, H.; Mooney, M.A. Characterizing the as-encountered ground condition with tunnel boring machine data using semi-supervised learning. Comput. Geotech. 2023, 154, 105159. [Google Scholar] [CrossRef]
  23. Zhao, D.; He, Y.; Chen, X.; Wang, J.; Liu, Y.; Zhang, Q.; Bai, J.; Liu, R. Data-driven intelligent prediction of TBM surrounding rock and personalized evaluation of disaster-inducing factors. Tunn. Undergr. Space Technol. 2024, 148, 105768. [Google Scholar] [CrossRef]
  24. Huang, Y.; Hu, X.; Pang, S.; Fu, W.; Chang, S.; Gao, B.; Hua, W. TBM enclosure rock grade prediction method based on multi-source feature fusion. Appl. Sci. 2025, 15, 6684. [Google Scholar] [CrossRef]
  25. Feng, S.; Wang, S. Theoretical considerations of field penetration index model and its application in TBM performance prediction. Geomech. Geophys. Geo Energ. Geo Resour. 2023, 9, 84. [Google Scholar] [CrossRef]
  26. Sun, M.; Chen, S.; He, H.; Wang, W.; Song, K.; Lin, X. Classification and prediction of rock mass drillability for a tunnel boring machine based on operational data mining. Front. Earth Sci. 2024, 12, 1518844. [Google Scholar] [CrossRef]
  27. She, L.; Hu, C.; Li, Y.; Hu, M.; Liu, Z.; Lei, F.; Wang, X.; Li, J. An empirical method for estimating TBM penetration rate using tunnelling specific energy. Tunn. Undergr. Space Technol. 2024, 144, 105525. [Google Scholar] [CrossRef]
  28. Zhang, Q.; Liu, Z.; Tan, J. Prediction of geological conditions for a tunnel boring machine using big operational data. Autom. Constr. 2019, 100, 73–83. [Google Scholar] [CrossRef]
  29. Zhao, J.; Shi, M.; Hu, G.; Song, X.; Zhang, C.; Tao, D.; Wu, W. A Data-Driven Framework for Tunnel Geological-Type Prediction Based on TBM Operating Data. IEEE Access 2019, 7, 66703–66713. [Google Scholar] [CrossRef]
  30. Jung, J.H.; Chung, H.; Kwon, Y.S.; Lee, I.M. An ANN to predict ground condition ahead of tunnel face using TBM operational data. KSCE J. Civ. Eng. 2019, 23, 3200–3206. [Google Scholar] [CrossRef]
  31. Liu, Q.; Wang, X.; Huang, X.; Yin, X. Prediction model of rock mass class using classification and regression tree integrated AdaBoost algorithm based on TBM driving data. Tunn. Undergr. Space Technol. 2020, 106, 103595. [Google Scholar] [CrossRef]
  32. Liu, S.; Yang, K.; Cai, J.; Zhou, S.; Zhang, Q. Prediction of Geological Parameters during Tunneling by Time Series Analysis on In Situ Data. Comput. Intell. Neurosci. 2021, 2021, 3904273. [Google Scholar] [CrossRef]
  33. Yu, H.; Tao, J.; Qin, C.; Xiao, D.; Sun, H.; Liu, C. Rock mass type prediction for tunnel boring machine using a novel semi-supervised method. Measurement 2021, 179, 109545. [Google Scholar] [CrossRef]
  34. Hou, S.; Liu, Y.; Yang, Q. Real-time prediction of rock mass classification based on TBM operation big data and stacking technique of ensemble learning. J. Rock Mech. Geotech. Eng. 2022, 14, 123–143. [Google Scholar] [CrossRef]
  35. Yan, T.; Shen, S.L.; Zhou, A.; Chen, X. Prediction of geological characteristics from shield operational parameters by integrating grid search and K-fold cross validation into stacking classification algorithm. J. Rock Mech. Geotech. Eng. 2022, 14, 1292–1303. [Google Scholar] [CrossRef]
  36. Yan, T. Data on prediction of geological characteristics during shield tunnelling in mixed soil and rock ground. Data Brief 2022, 45, 108726. [Google Scholar] [CrossRef]
  37. Fu, X.; Wu, M.; Tiong, R.L.K.; Zhang, L. Data-driven real-time advanced geological prediction in tunnel construction using a hybrid deep learning approach. Autom. Constr. 2023, 146, 104672. [Google Scholar] [CrossRef]
  38. Pan, Y.; Wu, M.; Zhang, L.; Chen, J. Time series clustering-enabled geological condition perception in tunnel boring machine excavation. Autom. Constr. 2023, 153, 104954. [Google Scholar] [CrossRef]
  39. Katuwal, T.B.; Panthi, K.K.; Basnet, C.B. Machine Learning Approach for Rock Mass Classification with Imbalanced Database of TBM Tunnelling in Himalayan Geology. Rock Mech. Rock Eng. 2024, 58, 11293–11318. [Google Scholar] [CrossRef]
  40. Chen, X.; Zhang, J.; Hu, Y.; Wang, W.; Liu, Y. Random large-deformation modelling on face stability considering dynamic excavation process during tunnelling through spatially variable soils. Can Geotech J. 2025, 62, 1–21. [Google Scholar] [CrossRef]
  41. Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. Adv. Neural Inf. Process. Syst. 2018, 31, 6639–6649. [Google Scholar]
  42. Bergstra, J.S.; Bardenet, R.; Bengio, Y.; Kégl, B. Algorithms for hyper-parameter optimization. Adv. Neural Inf. Process. Syst. 2011, 24, 2546–2554. [Google Scholar]
  43. Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19), Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar] [CrossRef]
  44. Ahmadianfar, I.; Heidari, A.A.; Noshadian, S.; Chen, H.; Gandomi, A.H. INFO: An efficient optimization algorithm based on weighted mean of vectors. Expert Syst. Appl. 2022, 195, 116516. [Google Scholar] [CrossRef]
  45. Trajanov, R.; Nikolikj, A.; Cenikj, G.; Teytaud, F.; Videau, M.; Teytaud, O.; Eftimov, T.; López-Ibáñez, M.; Doerr, C. Improving Nevergrad’s algorithm selection wizard NGOpt through automated algorithm configuration. In Parallel Problem Solving from Nature–PPSN XVII; Bäck, T., Preuss, M., Deutz, A., Eds.; Lecture Notes in Computer Science; Springer: Cham, Switzerland, 2022; Volume 13398, pp. 18–31. [Google Scholar] [CrossRef]
  46. Huang, S.; Zhou, J. Refined Approaches for Open Stope Stability Analysis in Mining Environments: Hybrid SVM Model with Multi-Optimization Strategies and GP Technique. Rock Mech. Rock Eng. 2024, 57, 9781–9804. [Google Scholar] [CrossRef]
  47. Schober, P.; Boer, C.; Schwarte, L.A. Correlation Coefficients: Appropriate Use and Interpretation. Anesth. Analg. 2018, 126, 1763–1768. [Google Scholar] [CrossRef] [PubMed]
  48. Hauke, J.; Kossowski, T. Comparison of Values of Pearson’s and Spearman’s Correlation Coefficients on the Same Sets of Data. Quaest. Geogr. 2011, 30, 87–93. [Google Scholar] [CrossRef]
  49. Ma, T.; Jin, Y.; Liu, Z.; Prasad, Y.K. Research on Prediction of TBM Performance of Deep-Buried Tunnel Based on Machine Learning. Appl. Sci. 2022, 12, 6599. [Google Scholar] [CrossRef]
  50. Hotelling, H. Analysis of a Complex of Statistical Variables into Principal Components. J. Educ. Psychol. 1933, 24, 417–441. [Google Scholar] [CrossRef]
  51. Jolliffe, I.T.; Cadima, J. Principal Component Analysis: A Review and Recent Developments. Philos. Trans. R. Soc. A Math. Phys. Eng. Sci. 2016, 374, 20150202. [Google Scholar] [CrossRef]
  52. Sokolova, M.; Lapalme, G. A Systematic Analysis of Performance Measures for Classification Tasks. Inf. Process. Manag. 2009, 45, 427–437. [Google Scholar] [CrossRef]
  53. Breiman, L.; Friedman, J.H.; Olshen, R.A.; Stone, C.J. Classification and Regression Trees; Wadsworth: Belmont, CA, USA, 1984. [Google Scholar]
  54. Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
  55. Geurts, P.; Ernst, D.; Wehenkel, L. Extremely Randomized Trees. Mach. Learn. 2006, 63, 3–42. [Google Scholar] [CrossRef]
  56. Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; ACM: New York, NY, USA, 2016; pp. 785–794. [Google Scholar] [CrossRef]
  57. Friedman, J.H.; Popescu, B.E. Predictive Learning via Rule Ensembles. Ann. Appl. Stat. 2008, 2, 916–954. [Google Scholar] [CrossRef]
  58. Fawcett, T. An Introduction to ROC Analysis. Pattern Recognit. Lett. 2006, 27, 861–874. [Google Scholar] [CrossRef]
  59. Efron, B. Bootstrap Methods: Another Look at the Jackknife. Ann. Stat. 1979, 7, 1–26. [Google Scholar] [CrossRef]
  60. Liu, M.-B.; Liao, S.-M.; Men, Y.-Q.; Xing, H.-T.; Liu, H.; Sun, L.-Y. Field Monitoring of TBM Vibration During Excavating Changing Stratum: Patterns and Ground Identification. Rock Mech. Rock Eng. 2022, 55, 1481–1498. [Google Scholar] [CrossRef]
  61. Metropolis, N.; Ulam, S. The Monte Carlo Method. J. Am. Stat. Assoc. 1949, 44, 335–341. [Google Scholar] [CrossRef] [PubMed]
  62. Niculescu-Mizil, A.; Caruana, R. Predicting Good Probabilities with Supervised Learning. In Proceedings of the 22nd International Conference on Machine Learning (ICML 2005), Bonn, Germany, 7–11 August 2005; pp. 625–632. [Google Scholar] [CrossRef]
  63. Guo, C.; Pleiss, G.; Sun, Y.; Weinberger, K.Q. On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning (ICML 2017), Sydney, Australia, 6–11 August 2017; Volume 70, pp. 1321–1330. [Google Scholar] [CrossRef]
Figure 1. CatBoost architecture diagram.
Figure 1. CatBoost architecture diagram.
Applsci 15 13234 g001
Figure 2. Illustration of objective function design.
Figure 2. Illustration of objective function design.
Applsci 15 13234 g002
Figure 3. Iterative optimization framework for three optimization strategies.
Figure 3. Iterative optimization framework for three optimization strategies.
Applsci 15 13234 g003
Figure 4. Basic geographic information of the sampling site.
Figure 4. Basic geographic information of the sampling site.
Applsci 15 13234 g004
Figure 5. Matrix plot of input variable pairs.
Figure 5. Matrix plot of input variable pairs.
Applsci 15 13234 g005
Figure 6. PCA scree plot with cumulative explained variance (training set).
Figure 6. PCA scree plot with cumulative explained variance (training set).
Applsci 15 13234 g006
Figure 7. Evaluation metric: (a) A detailed description of each indicator, and (b) Schematic diagram of a confusion matrix.
Figure 7. Evaluation metric: (a) A detailed description of each indicator, and (b) Schematic diagram of a confusion matrix.
Applsci 15 13234 g007
Figure 8. Iteration curves of the three CatBoost-based hybrid models.
Figure 8. Iteration curves of the three CatBoost-based hybrid models.
Applsci 15 13234 g008
Figure 9. Confusion matrices of the comparative models.
Figure 9. Confusion matrices of the comparative models.
Applsci 15 13234 g009
Figure 10. ROC curve of the hybrid model NGOpt-CatBoost.
Figure 10. ROC curve of the hybrid model NGOpt-CatBoost.
Applsci 15 13234 g010
Figure 11. Stratified bootstrap confidence intervals for evaluation metrics.
Figure 11. Stratified bootstrap confidence intervals for evaluation metrics.
Applsci 15 13234 g011
Figure 12. Class-wise indicator scores of NGOpt-CatBoost for three ground types, with 95% bootstrap confidence intervals on the test set.
Figure 12. Class-wise indicator scores of NGOpt-CatBoost for three ground types, with 95% bootstrap confidence intervals on the test set.
Applsci 15 13234 g012
Figure 13. Monte Carlo uncertainty analysis for Class 1 (soft-soil dominated).
Figure 13. Monte Carlo uncertainty analysis for Class 1 (soft-soil dominated).
Applsci 15 13234 g013
Figure 14. Monte Carlo uncertainty analysis for Class 2 (mixed soft-soil and hard-rock).
Figure 14. Monte Carlo uncertainty analysis for Class 2 (mixed soft-soil and hard-rock).
Applsci 15 13234 g014
Figure 15. Monte Carlo uncertainty analysis for Class 3 (hard-rock dominated).
Figure 15. Monte Carlo uncertainty analysis for Class 3 (hard-rock dominated).
Applsci 15 13234 g015
Figure 16. (a) EFR (expected flip rate) and (b) PIW95 (mean 95% predictive-interval width) vs. perturbation amplitude (per class).
Figure 16. (a) EFR (expected flip rate) and (b) PIW95 (mean 95% predictive-interval width) vs. perturbation amplitude (per class).
Applsci 15 13234 g016
Table 1. Hyperparameter search range of CatBoost.
Table 1. Hyperparameter search range of CatBoost.
HyperparameterBoosting IterationsMax_DepthLearning_Rate
Search range[50, 500][1, 10][0.001, 0.3]
Table 2. Descriptive statistics of the input variables.
Table 2. Descriptive statistics of the input variables.
VariableMeanStdMin25%Median75%Max
CRS (rpm)1.430.180.901.301.401.502.00
AR (mm/min)22.3614.223.0010.0015.0035.0065.00
MF (kN/m2)453.68170.9466.91323.91425.79577.86912.41
MT (kN/m2)6.673.401.573.925.879.1416.97
UEP (MPa)0.160.080.000.120.170.210.33
LEP (MPa)0.200.090.000.150.200.260.40
PR (mm/r)16.3511.272.317.1411.5425.0050.00
FPI (-)3032.612367.19327.28878.382550.004500.0016,000.00
TPI (-)542.01475.6148.75127.50443.48780.003380.00
SE (kW·h/m3)14.5012.641.373.4911.9020.8289.87
Table 3. Top contributing input variables for each principal component (PCA).
Table 3. Top contributing input variables for each principal component (PCA).
Principal Component (PC)Top-1 Feature
(|Loading|)
Top-2 Feature
(|Loading|)
Top-3 Feature
(|Loading|)
PC1FPI (0.3846)SE (0.3829)TPI (0.3825)
PC2UEP (0.5012)LEP (0.4471)CRS (0.4034)
PC3CRS (0.6843)LEP (0.3166)UEP (0.3160)
PC4CRS (0.5869)AR (0.5100)PR (0.4235)
PC5MT (0.6808)FPI (0.5996)PR (0.2289)
Table 4. The performance of each tree-based model in the training phase.
Table 4. The performance of each tree-based model in the training phase.
ClassifierAccPreRec
NGopt-CatBoost0.99790.99630.9984
INFO-CatBoost0.99790.99420.9984
Optuna-CatBoost0.99480.98980.9939
CatBoost0.98540.98540.9854
RF0.98120.98120.9812
XGBoost0.98220.98220.9822
DT0.98330.98330.9833
ET0.95400.95400.9540
RuleFit0.98330.98330.9833
Table 5. The performance of each tree-based model in the test phase.
Table 5. The performance of each tree-based model in the test phase.
ClassifierAccPreRec
NGopt-CatBoost0.96250.97150.9716
INFO-CatBoost0.95420.96540.9570
Optuna-CatBoost0.95000.96210.9622
CatBoost0.92920.92920.9292
RF0.91670.91670.9167
XGBoost0.91250.91250.9125
DT0.89170.89170.8917
ET0.89580.89580.8958
RuleFit0.87080.87080.8708
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Huang, S.; Chen, Y.; Khandelwal, M.; Zhou, J. Ground-Type Classification from Earth-Pressure-Balance Shield Operational Data with Uncertainty Quantification. Appl. Sci. 2025, 15, 13234. https://doi.org/10.3390/app152413234

AMA Style

Huang S, Chen Y, Khandelwal M, Zhou J. Ground-Type Classification from Earth-Pressure-Balance Shield Operational Data with Uncertainty Quantification. Applied Sciences. 2025; 15(24):13234. https://doi.org/10.3390/app152413234

Chicago/Turabian Style

Huang, Shuai, Yuxin Chen, Manoj Khandelwal, and Jian Zhou. 2025. "Ground-Type Classification from Earth-Pressure-Balance Shield Operational Data with Uncertainty Quantification" Applied Sciences 15, no. 24: 13234. https://doi.org/10.3390/app152413234

APA Style

Huang, S., Chen, Y., Khandelwal, M., & Zhou, J. (2025). Ground-Type Classification from Earth-Pressure-Balance Shield Operational Data with Uncertainty Quantification. Applied Sciences, 15(24), 13234. https://doi.org/10.3390/app152413234

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop