Data-Driven Estimation of Cerchar Abrasivity Index Using Rock Geomechanical and Mineralogical Characteristics

Choi, Soon-Wook; Ko, Tae Young

doi:10.3390/app16010552

Open AccessArticle

Data-Driven Estimation of Cerchar Abrasivity Index Using Rock Geomechanical and Mineralogical Characteristics

by

Soon-Wook Choi

¹

and

Tae Young Ko

^2,*

¹

Department of Geotechnical Engineering Research, Korea Institute of Civil Engineering and Building Technology, Goyang-si 10223, Republic of Korea

²

Department of Energy and Resources Engineering, Kangwon National University, Chuncheon-si 24341, Republic of Korea

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2026, 16(1), 552; https://doi.org/10.3390/app16010552

Submission received: 30 November 2025 / Revised: 21 December 2025 / Accepted: 29 December 2025 / Published: 5 January 2026

(This article belongs to the Section Civil Engineering)

Download

Browse Figures

Versions Notes

Abstract

The Cerchar Abrasivity Index (CAI) is essential for predicting tool wear in mechanized tunneling and mining, but direct measurement requires time-consuming laboratory procedures. We developed a data-driven framework to estimate CAI from standard geomechanical and mineralogical properties using 193 rock samples covering igneous, metamorphic, and sedimentary lithologies. After evaluating 278 feature combinations with multicollinearity constraints (VIF < 10.0), we identified an optimal four-variable subset: brittleness index B₁, density, Equivalent Quartz Content (EQC), and Uniaxial Compressive Strength (UCS), with rock type indicators. CatBoost achieved the best performance (Test R² = 0.907, RMSE = 0.420), and SHAP analysis confirmed that density and EQC are primary drivers of abrasivity. Additionally, symbolic regression derived an explicit formula using only three variables (density, EQC, B₁) without rock type classification (Test R² = 0.720). The proposed framework offers a practical approach for assessing rock abrasivity at early project stages.

Keywords:

Cerchar abrasivity index; rock abrasivity; machine learning; symbolic regression; SHAP analysis

1. Introduction

In mechanized tunneling, underground mining, and deep excavation projects, the interaction between cutting tools and rock mass fundamentally determines project economics, scheduling, and technical feasibility. Rock abrasivity (the capacity of rock to wear down cutting tools like TBM disc cutters, roadheader picks, and drill bits) drives cutter consumption rates, forces frequent machine downtime for tool replacements, and ultimately reduces advance rates. Accurate assessment of rock abrasivity during preliminary design and geotechnical investigation is therefore essential for selecting appropriate excavation methods, estimating costs, and managing risks in hard and abrasive ground conditions [1,2,3].

The Cerchar Abrasivity Index (CAI) test remains the most widely accepted standard for quantifying rock abrasivity at laboratory scale. The International Society for Rock Mechanics (ISRM) has formalized standardized procedures to ensure consistency across testing facilities [4]. The standard CAI test is a micro-destructive surface method that generally requires specially prepared specimens or intact core surfaces, which are often difficult to obtain during preliminary exploratory drilling in highly fractured or weathered rock masses. In addition, reliable characterization of heterogeneous geological formations demands a large number of tests, making the CAI-based approach both time-consuming and financially burdensome for large-scale infrastructure projects [5,6].

These constraints have driven efforts to develop reliable indirect methods for estimating CAI from geomechanical and mineralogical properties routinely measured during site investigations. Research has shown that mineralogical composition and petrographic features primarily control abrasivity. Hard mineral content, often quantified as Equivalent Quartz Content (EQC), correlates strongly with CAI values across sedimentary and granitic rocks [7,8,9,10]. Similarly, macro-mechanical properties like Uniaxial Compressive Strength (UCS), Brazilian Tensile Strength (BTS), and rock density reveal that stronger, denser matrices better resist stylus scratching [11,12,13,14]. CAI measurements also respond to environmental conditions such as confining stress [15] and water content [16], as well as testing parameters like particle size [17] and micro-texture [18]. This multifaceted behavior reflects the complex interaction among the various factors that control rock abrasivity.

To capture this complexity beyond individual descriptors, we incorporate physically motivated composite parameters increasingly recognized in rock cutting and TBM performance studies. First, we evaluate rock brittleness, which is critical for fragmentation and cutter wear, using four complementary indices derived from UCS (σ_c) and BTS (σ_t) [19]. These include ratio and difference forms (B₁ = σ_c/σ_t; B₂ = (σ_c − σ_t)/(σ_c + σ_t)) alongside product-based measures designed to capture the combined influence of compressive and tensile resistance (B₃ = (σ_c · σ_t)/2; B₄ = √B₃). Second, we introduce a Rock Abrasivity Index (RAI = (EQC × UCS)/100) reflecting the synergistic contribution of hard mineral abundance and matrix load-bearing capacity, following established correlations linking quartz content and strength to TBM wear [20].

Early studies used statistical approaches like simple or multiple linear regression to predict CAI from rock properties. While these models provided initial insights, they often failed to capture the complex, non-linear interactions among diverse properties, limiting accuracy beyond their training datasets. Recent computational geotechnics has shifted toward data-driven approaches. Machine Learning (ML) algorithms can model intricate non-linear relationships without predefined equations, showing improved performance in predicting rock properties [21,22,23]. Studies applying techniques from Artificial Neural Networks to advanced evolutionary algorithms have shown promising improvements in CAI estimation [21,22].

However, existing ML studies for CAI prediction have several limitations. First, many studies rely on relatively small datasets, typically ranging from 30 to 106 samples [24,25,26,27,28], which limits statistical robustness and model generalization. Second, datasets are often geologically homogeneous, restricted to specific rock types or regions (e.g., sedimentary rocks in Turkey [25,26], igneous rocks in India [24], or rocks from Pakistan [29]), reducing applicability to diverse geological environments. Third, some studies employ “black-box” algorithms such as ANN without providing physical interpretability of the predictions [24]. Fourth, computationally intensive evolutionary algorithms like GEP, while offering explicit equations, require substantial computational resources [27]. These limitations collectively hinder the practical application of existing models to new tunnel projects with heterogeneous geological conditions.

We address this gap using a dataset of 193 rock samples spanning igneous, sedimentary, and metamorphic origins, compiled from published studies [9,10,12,13,14,15,16,17,29,30,31,32,33] and geotechnical investigation reports from tunnel construction projects. To overcome the “black-box” nature of traditional ML and ensure engineering reliability, we propose a transparent framework for optimal feature selection. Rather than relying on a single technique, we evaluated all possible variable combinations satisfying strict multicollinearity constraints (VIF < 10.0). This exhaustive search identified the most efficient parameter subset maximizing predictive accuracy while minimizing complexity. We benchmarked ten ML algorithms (OLS, Ridge, Elastic Net, SVR, KNN, Random Forest, Gradient Boosting, XGBoost, LightGBM, and CatBoost) using a comprehensive weighted ranking system based on accuracy, generalization, and stability. Finally, we employed SHapley Additive exPlanations (SHAP) to quantify each parameter’s contribution to CAI prediction, verifying that model behavior aligns with rock mechanics principles. Beyond black-box machine learning models, symbolic regression offers an alternative approach that automatically discovers explicit mathematical formulas from data. Unlike traditional ML algorithms, symbolic regression generates interpretable equations that can be directly implemented in engineering practice without specialized software. By employing genetic programming techniques, symbolic regression searches through large mathematical expression spaces to identify optimal formulas balancing accuracy and parsimony. This approach combines predictive accuracy with practical applicability while reducing input requirements for geotechnical applications. This study presents a data-driven, physically interpretable tool for assessing rock abrasivity in early-stage engineering projects.

2. Materials and Methods

2.1. Dataset Preparation

We used 193 rock samples to predict the Cerchar Abrasivity Index (CAI), classified into three petrological groups consisting of Igneous (77), Metamorphic (40), and Sedimentary (76) rock samples. The dataset includes comprehensive geomechanical properties such as Equivalent Quartz Content (EQC), Uniaxial Compressive Strength (UCS), Brazilian Tensile Strength (BTS), density, and brittleness indices (B₁–B₄), with CAI as the target variable.

2.2. Exploratory Data Analysis and Preprocessing

We conducted comprehensive Exploratory Data Analysis (EDA) to ensure statistical validity and guide preprocessing strategies, examining data distributions, group-wise variability, and variable interrelationships. The dataset’s broad representation of rock abrasivity is evidenced by the distribution of CAI values (Figure 1a), which covers a broad range from low to extremely abrasive conditions, essential for training a generalizable model. Furthermore, the dataset’s geological diversity is confirmed by the distinct CAI distributions across igneous, metamorphic, and sedimentary rock groups (Figure 2). This wide petrological coverage justifies the mandatory inclusion of rock group indicators to capture group-specific abrasive behaviors inherent in different geological origins

We first examined the distributional characteristics of numerical input variables. Figure 1 shows histograms with best-fit probability curves, revealing that most parameters including UCS, BTS, CAI, EQC, and RAI follow Weibull distributions with varying degrees of right-skewness. Notably, density is the only variable that follows a normal distribution, while parameters such as B₃ exhibit highly right-skewed Weibull distributions with large standard deviations relative to their means.

We then examined petrological influences on abrasivity by comparing CAI values across rock groups (Figure 2). Igneous rocks generally show higher median CAI and wider value ranges than Sedimentary and Metamorphic rocks. This distinct variation necessitates incorporating rock type as a categorical predictor. We employed one-hot encoding for categorical variable representation. This choice was made because (1) CatBoost inherently handles categorical features through its built-in ordered target encoding mechanism [34], which automatically optimizes categorical variable processing; (2) with only three rock groups, one-hot encoding adds only two dummy variables, resulting in minimal dimensionality increase; and (3) one-hot encoding enables SHAP analysis to quantify each rock group’s individual contribution to CAI prediction.

Pearson correlation analysis examined linear dependencies between variables. The correlation heatmap (Figure 3) shows that density and EQC have weak to moderate positive correlations with CAI (r = 0.3 and 0.4, respectively). Although these linear correlation coefficients are not high, this suggests that the relationships between input variables and CAI are predominantly nonlinear, justifying our use of nonlinear machine learning models rather than simple linear regression. High multicollinearity among brittleness indices (B₁–B₄) suggests redundancy. Scatter plots (Figure 4) visually confirm these positive trends, particularly for density and EQC. These plots also reveal data dispersion and potential non-linear patterns that simple correlation coefficients might miss, justifying our use of non-linear machine learning models.

Based on these insights, we preprocessed the dataset by log-transforming skewed variables, one-hot encoding categorical rock types, and standardizing all numerical features using Z-score normalization.

2.3. Feature Selection Strategy

To account for the inherent influence of rock type, categorical variables representing rock groups (Igneous, Metamorphic, Sedimentary) were mandatorily included in all feature subsets throughout the modeling process.

For numerical predictors, we implemented a multi-stage selection process. First, we applied Variance Inflation Factor (VIF) [35] to screen for multicollinearity among numerical variables. Second, we used Recursive Feature Elimination (RFE) to rank numerical features by their contribution to model accuracy. Third, based on the RFE ranking, we selected a preliminary subset consisting of EQC, BTS, B₁, density, and RAI to establish baseline performance.

2.4. Symbolic Regression Analysis

To complement the black-box machine learning approach, we employed symbolic regression to derive explicit mathematical formulas for CAI prediction. Symbolic regression is an evolutionary algorithm-based technique that searches through the space of mathematical expressions to find equations that best fit the observed data while maintaining parsimony.

We utilized TuringBot (https://turingbotsoftware.com/) (accessed on 10 November 2025) [36], a commercial symbolic regression software that employs genetic programming to discover mathematical formulas automatically. TuringBot evolves a population of candidate equations through selection, crossover, and mutation operations, optimizing for both accuracy and simplicity.

Unlike the machine learning models that incorporated categorical rock type indicators, the symbolic regression analysis was conducted using only three numerical variables: density, EQC, and B₁. This deliberate simplification aimed to develop a universal equation applicable across different rock types without requiring petrological classification. The search space included basic arithmetic operations (+, −, ×, ÷), trigonometric functions (sin, cos, tan), logarithmic functions, and power functions. The optimization objective was to minimize prediction error while penalizing equation complexity to prevent overfitting.

TuringBot generates a Pareto front of solutions representing trade-offs between accuracy and complexity. From this Pareto front, we selected the optimal equation based on test set performance to ensure generalization capability.

2.5. Model Framework and Evaluation Strategy

We developed a prediction model using a two-phase framework with multi-criteria evaluation. The dataset was randomly split into training (80%) and test (20%) sets to enable unbiased model evaluation. The training set was used for model development and hyperparameter optimization, while the test set was reserved exclusively for final performance assessment.

First, Base Model Selection identified the most effective algorithms for this geological dataset. We benchmarked ten diverse machine learning algorithms [21,22,23], ranging from linear models (OLS, Ridge, ElasticNet) to distance-based methods (KNN, SVR) and ensemble methods (Random Forest, Gradient Boosting, XGBoost, LightGBM [37], CatBoost [34]). Based on preliminary performance, we selected CatBoost, Random Forest, and Gradient Boosting as top-tier models and rigorously optimized their hyperparameters using Optuna [38], a Bayesian optimization framework, over 50 trials.

Second, full scenario analysis determined the globally optimal feature subset. Using the three optimized base models, we evaluated 278 valid feature scenarios generated by combining numerical variables while satisfying strict multicollinearity constraints (VIF < 10.0). This threshold is a widely accepted standard in statistical literature [35], where VIF values between 5–10 indicate high but acceptable multicollinearity, while VIF > 10 suggests severe multicollinearity that compromises coefficient reliability. The VIF < 10 criterion ensures that variance inflation due to inter-variable correlation remains within acceptable limits while retaining physically meaningful predictor combinations. This exhaustive search identifies the most efficient feature set maximizing predictive accuracy without unnecessary complexity.

Finally, we employed a Comprehensive Weighted Ranking System to select the best model among diverse scenarios. Rather than relying on a single metric, this system evaluates models across three key dimensions. Generalization performance (40% weight) was assessed using Cross-Validation R² score to evaluate model robustness on training data. Prediction accuracy (30% weight) was measured by Root Mean Squared Error (RMSE) on unseen test sets to quantify precision. Model stability (30% weight) was evaluated through the gap between Training R² and Test R² as an indicator of overfitting risk.

3. Results

3.1. Establishment and Optimization of Base Models

All three models showed improved generalization capability after tuning, with CatBoost achieving the highest CV R² of 0.763 (representing a +0.024 increase) and Final Test R² of 0.875, establishing itself as the best-performing algorithm for this dataset (Table 1). The tuning process effectively reduced the gap between Train R² and Test R², thereby mitigating overfitting risks inherent in tree-based ensemble methods.

We subsequently used these optimized models as fixed algorithmic engines for full scenario analysis in Section 3.2.

3.2. Optimization of Feature Subsets via Full Scenario Analysis

To determine the globally optimal feature combination, we evaluated 278 valid feature scenarios generated by combining numerical variables while strictly filtering out combinations with Variance Inflation Factor (VIF) exceeding 10.0 to prevent multicollinearity. All models included categorical rock group variables to account for petrological differences.

Table 2 presents the top 10 feature combinations ranked by their overall performance. Results show that CatBoost consistently outperformed other algorithms, occupying top positions. The combination of B₁, density, EQC, and UCS achieved the best performance, with a Final Test R² of 0.907 and the lowest Test RMSE of 0.420. A closer examination of the top-ranked models reveals that density and EQC were present in every successful combination, confirming their critical role as primary predictors of rock abrasivity. Brittleness indices including B₁, B₂, and B₄ also frequently appeared, suggesting their importance as complementary features in capturing the rock’s fracture behavior.

We further analyzed the relationship between model performance and the number of numerical input variables to determine the optimal model complexity. Table 3 summarizes the performance of the best model for each feature count, and Figure 5 illustrates the corresponding performance trend. As shown in these results, the predictive accuracy improved significantly as the number of features increased from 1 to 3. The performance metrics peaked at 4 numerical variables, where the subset of B₁, density, EQC, and UCS provided the highest balance of accuracy and efficiency. Adding more features beyond 4 resulted in marginal performance gains or even slight degradation in stability, indicating potential redundancy introduced by additional variables.

Consequently, we selected the 4-variable CatBoost model as the final optimal model for CAI prediction.

3.3. Prediction Accuracy of the Optimal Model

To validate the predictive capability of our framework, we analyzed the prediction accuracy of the final optimal model (CatBoost using the 4-variable subset). CatBoost’s high performance can be attributed to its ordered boosting mechanism and efficient handling of categorical features [37]. Figure 6 presents the scatter plot comparing predicted versus actual CAI values. The model shows good agreement between predicted and observed values, with data points clustered along the 1:1 line. The model achieved a high Final Test R² of 0.907, explaining approximately 90.7% of variance in unseen test data. The low Test RMSE of 0.420 confirms the model’s precision in estimating CAI values across the sampled range. Notably, the Test R² (0.907) exceeds the Cross-Validation R² (0.777), which is opposite to the typical signature of overfitting where training performance significantly exceeds test performance. Furthermore, as shown in Table 1, hyperparameter tuning reduced the gap between CV R² and Test R², indicating that overfitting risk was effectively mitigated through the optimization process. The close alignment of test data (red diamonds) with the perfect fit line suggests the model successfully captured underlying geological relationships with strong generalization capability.

3.4. Symbolic Regression Model

TuringBot symbolic regression yielded an explicit mathematical formula for CAI prediction using only three input variables (density, EQC, and B₁), without requiring rock type classification. The optimal equation is expressed as:

CAI = α₀ + α₁⋅density − (density − α₂)cos(Φ)

(1)

where Φ is calculated through intermediate variables:

Φ = β₁Ψ + cos(β₂ − B₁)

(2)

Ψ = (density − γ₁)Ω + γ₅

(3)

Ω = EQC + γ₂ + tan[tan(γ₃ + γ₄ ⋅ EQC) − γ₆] − tan(γ₇ − B₁)

(4)

The model coefficients are summarized in Table 4. The physical interpretation of this equation can be understood as follows. The primary linear term (α₁·density) reflects that denser rocks exhibit lower porosity and stronger grain interlocking, providing greater resistance to stylus scratching and thus higher CAI values. The intermediate variable Ω incorporates EQC, confirming established findings that hard mineral content is a primary driver of abrasivity. The brittleness index B₁ (=σ_c/σ_t) captures rock fracture behavior, inherently incorporating strength characteristics without requiring UCS as a separate variable. The trigonometric functions (cos, tan) capture nonlinear interactions among these parameters, reflecting that rock abrasivity is governed by combined influences rather than simple linear superposition. While the formula appears complex for manual calculation, it can be readily implemented in spreadsheet software for practical engineering applications.

Figure 7 presents the scatter plot comparing predicted versus actual CAI values for the symbolic regression model. The model achieved a Training R² of 0.776 and Test R² of 0.720 with RMSE of 0.728. This performance was achieved using only three numerical variables without rock type classification, compared to the CatBoost model which required four numerical variables (B₁, density, EQC, UCS) plus categorical rock type indicators. This demonstrates that the symbolic regression approach can achieve comparable predictive capability with a more parsimonious model structure by utilizing B₁ as a composite parameter that inherently incorporates strength characteristics.

3.5. Physical Interpretation of the Optimal Model (SHAP Analysis)

To validate the model’s physical reliability, we performed SHAP (Shapley Additive exPlanations) analysis [39]. Figure 8 summarizes the interpretability results.

The bar chart (Figure 8a) ranks input variables by their mean absolute SHAP values. Among all predictors, the rock group indicator for Sedimentary rocks emerged as the most influential variable, highlighting the critical role of petrological classification in CAI prediction. Among the numerical predictors, density showed the highest importance, followed by EQC, B₁, and UCS. This hierarchy confirms that physical compactness (density) and mineral hardness (EQC) are primary drivers of rock abrasivity. The Beeswarm plot (Figure 8b) illustrates the direction and magnitude of each feature’s effect on CAI. High density values (red dots) consistently associate with positive SHAP values, indicating increased CAI. Similarly, EQC exhibits a strong positive relationship. The rock group variables show distinct patterns, with sedimentary classification having the widest impact range on model predictions.

The dependence plots (Figure 8c,d) illustrate the impact of key geological parameters on abrasivity [7,8,11,12]. For density (Figure 8c), the plot shows that SHAP values increase progressively as the standardized density increases. This impact is noticeably amplified in the range typically associated with hard rock transition, with SHAP values sharply rising from approximately 0.25 to 0.75 for standardized density values above 1.5. While the vertical dispersion is relatively narrow, the color gradient indicates a strong correlation with UCS; samples with high density (positive x-values) predominantly correspond to high UCS (red dots). This visual coupling provides initial evidence of a synergistic relationship where the CAI is jointly determined by compactness and strength. This confirms that physical compactness and mechanical strength are coupled in driving higher predicted abrasivity [2,12,13]. For EQC (Figure 8d), a positive linear trend with SHAP values is observed, though with greater variance compared to density. The color overlay reveals that for similar EQC levels, samples with higher density (red dots) generally exhibit higher SHAP values. This suggests a combined effect where mineral hardness (Quartz content) and rock compactness (density) collectively enhance the rock’s abrasive potential.

4. Discussion

4.1. Efficacy of Data-Driven Feature Selection

Our feature selection process showed that a reduced subset of variables could achieve better predictive performance compared to utilizing all available geological parameters. Full scenario analysis revealed that predictive accuracy (R²) reached an optimal plateau with just four numerical variables (B₁, density, EQC, UCS) combined with categorical rock type indicators. The rigorous enforcement of the strict multicollinearity constraint (VIF < 10) ensured that this four-variable subset represents the optimal balance between predictive accuracy and statistical independence. Adding more variables beyond this point yielded diminishing returns and increased overfitting risk, as evidenced by the widening gap between training and testing performance. This finding aligns with the parsimony principle by maximizing performance with a statistically justified, minimum-redundancy feature set [21,22], suggesting that CAI is primarily governed by a few dominant factors such as mineral hardness (EQC), physical compactness (density), and rock strength/brittleness (UCS, B₁), rather than complex interplay among numerous minor parameters.

4.2. Physical Interpretation of Key Predictors

SHAP analysis provided critical insights into the physical mechanisms underlying the model’s predictions, validating its alignment with rock mechanics principles.

Density and EQC emerged as the most influential numerical predictors [11,12]. The strong positive correlation between density and CAI reflects that denser rocks, often associated with lower porosity and higher grain interlocking, offer greater resistance to stylus scratching, resulting in higher wear. Similarly, the high importance of EQC confirms the well-established relationship that quartz content is the primary mineralogical driver of abrasivity [7,8,9,10]. Recent studies applying SHAP analysis in geotechnical engineering have similarly demonstrated the value of explainable AI in validating physical interpretations of ML model outputs [40].

The dependence plots also highlighted significant synergistic effects between predictors [2,3]. For instance, density’s impact on CAI was notably amplified in rocks with higher UCS. The SHAP dependence plot (Figure 8c) explicitly illustrates this synergistic mechanism by demonstrating that a dense rock matrix combined with high strength requires significantly more energy to fracture, thereby increasing normal force and friction on tool surfaces and leading to accelerated wear. This interaction effect, where the effect of density on CAI is amplified by UCS, is a key non-linear relationship successfully captured by the CatBoost ensemble model.

4.3. Performance of Ensemble Learning Models

Benchmarking results showed that tree-based ensemble methods (CatBoost, Random Forest, Gradient Boosting) outperformed other algorithms, including linear and distance-based methods [37]. Ensemble learning methods have shown better predictive performance compared to single models in geotechnical applications [41,42]. Our final CatBoost model achieved Test R² of 0.907, significantly outperforming linear models that struggled to capture non-linear dependencies between geological properties and abrasivity. CatBoost’s stability stems from its effective handling of categorical variables (Rock Groups) and its ability to model complex interactions without extensive feature engineering [21,22,23]. Rigorous hyperparameter tuning further enhanced generalization capability, reducing prediction error (RMSE) by approximately 14.1% compared to the baseline model.

4.4. Comparative Advantage over Prior ML Studies

To contextualize the performance of our approach, Table 5 provides a comparative analysis of CAI prediction models arranged chronologically. Early studies primarily relied on conventional regression analyses or early soft computing techniques (e.g., ANN) often limited by smaller, regionally specific datasets. While evolutionary algorithms like GEP introduced explicit equation generation, they often came with high computational costs. In contrast, our study utilizes a significantly larger and more diverse dataset (N = 193) and employs a rigorous VIF-constrained feature selection process with an advanced ensemble learning algorithm (CatBoost) to ensure model stability and high generalization.

As demonstrated in Table 5, earlier studies such as Ozdogan et al. [26] and Capik and Yilmaz [25] were constrained by small sample sizes (N < 50), limiting their applicability to broader geological conditions. While Tripathy et al. [24] and Kadkhodaei and Ghasemi [27] expanded the dataset size to around 100 samples, they either relied on “black-box” models or computationally intensive evolutionary algorithms. Our study advances this field by applying CatBoost on the largest diverse dataset (N = 193) among the compared works. Furthermore, our VIF-constrained scenario analysis ensures that the high predictive performance is achieved with a statistically independent and physically meaningful subset of features.

4.5. Comparison of Machine Learning and Symbolic Regression Approaches

Table 6 summarizes the performance comparison between the CatBoost machine learning model and the TuringBot symbolic regression model.

The CatBoost model demonstrates higher predictive accuracy with Test R² of 0.907. It should be noted that the two models were intentionally designed for different application scenarios rather than direct comparison under identical conditions. The symbolic regression model deliberately used only three numerical variables (density, EQC, B₁) without rock type classification to develop a universal equation applicable in heterogeneous geological formations where rock type boundaries are unclear. In contrast, the CatBoost model utilized four numerical variables plus categorical rock type indicators to maximize predictive accuracy when comprehensive geotechnical data are available. The symbolic regression model achieved a Test R² of 0.720, demonstrating that reasonable prediction is achievable with reduced input requirements. This suggests that the essential information for CAI prediction is largely captured by density, EQC, and B₁, with rock type classification and UCS providing incremental accuracy improvements.

The symbolic regression model offers several practical advantages. First, the model requires only three input variables (density, EQC, B₁) instead of four numerical variables plus rock type indicators, resulting in a more parsimonious model structure. Second, the equation applies universally across all rock types without requiring categorical classification, making it particularly suitable for heterogeneous geological formations where rock type boundaries are unclear. Third, the explicit mathematical formula provides interpretability, allowing engineers to understand and verify the relationships between input parameters and CAI predictions. Finally, the closed-form equation offers good portability, as it can be implemented in any computing environment including spreadsheets and simple calculators without requiring specialized machine learning software.

The performance difference between CatBoost (R² = 0.907) and symbolic regression (R² = 0.720) is significant, and predictive accuracy is indeed a critical parameter in engineering applications. However, the two models serve different purposes. For applications requiring maximum predictive accuracy with comprehensive geotechnical data, such as detailed design stages, the CatBoost model is strongly recommended. For preliminary assessments at early project stages where rock type classification is unavailable, rapid screening is needed, or when only limited testing data are available, the symbolic regression formula provides a practical alternative that balances reasonable accuracy with reduced data requirements and immediate implementation capability in spreadsheet software.

4.6. Limitations and Future Research

Despite promising results, this study has certain limitations. First, while our dataset (N = 193) is relatively large compared to many prior studies, it may still underrepresent specific rare rock types or extreme geological conditions [29]. Second, although a rigorous validation strategy was employed—splitting the dataset into 80% training and 20% test sets, with 5-fold cross-validation performed on the training set for model development and hyperparameter optimization, followed by final evaluation on the completely held-out test set—this approach provides internal validation only. Samples from the same geological formation or published study may appear in both training and test sets, which could lead to optimistic performance estimates. Leave-one-study-out cross-validation was not feasible due to the uneven distribution of samples across source studies. Furthermore, acquiring additional public datasets with complete input variables (UCS, BTS, density, EQC, and CAI) remains challenging due to the limited availability of comprehensive rock property databases. Therefore, the reported Test R² of 0.907 should be interpreted as an upper bound of expected performance, and field validation against independent project data is strongly recommended before engineering application. Expanding the dataset with samples from diverse geographical locations would further enhance the model’s global applicability. Third, the current model relies on macro-scale laboratory indices and does not incorporate micro-textural parameters (e.g., grain size distribution, cementation type) [17,18]. While such parameters could potentially refine predictions, especially for sedimentary rocks, systematic micro-textural data were not available in the compiled dataset.

It should also be clarified that the term “early project stages” in this study refers specifically to the bidding phase, where the number of boreholes and available core samples for testing is typically limited. At this stage, conducting extensive CAI testing is impractical due to time and budget constraints; however, basic geomechanical properties (UCS, BTS, density) and mineralogical composition can be obtained from the limited core samples available. The proposed framework enables CAI prediction from these routinely measured properties, providing valuable input for preliminary cost estimation and risk assessment during bid preparation.

For truly preliminary assessments where only limited data exist, Table 3 demonstrates that a single-variable model using only density achieves Test R² = 0.801. However, it should be noted that previous studies have consistently reported that single parameters alone are not suitable for predicting CAI. Ko et al. [2] concluded that a single parameter is not suitable to predict the value of CAI for igneous and metamorphic rocks, and Sun et al. [21] also reported that a single factor is not suitable for directly predicting CAI. Furthermore, the prediction accuracy of single-variable models varies significantly depending on rock type; for example, while sedimentary rocks show relatively acceptable correlations between UCS and CAI, metamorphic rocks cannot be classified into specific UCS and CAI ranges, making single-variable prediction unreliable [43]. Therefore, the multi-variable approach adopted in this study is justified for achieving reliable predictions across diverse rock types. Future research should focus on integrating these micro-scale features and validating the model’s performance against field TBM cutter wear data [2,3,20] to bridge the gap between laboratory indices and in situ excavation conditions. Recent studies have demonstrated the application of ML techniques for predicting TBM disc cutter wear [44,45], highlighting the practical value of data-driven approaches in mechanized tunneling. Integration with real-time TBM operational data represents a promising direction for practical implementation.

For practical project application, the required input parameters can be obtained through standard geotechnical testing procedures: density is measured following ISRM suggested methods; UCS is determined by uniaxial compression testing or estimated from Point Load Index (Is50); BTS is obtained from Brazilian tensile testing; B₁ is calculated as the UCS/BTS ratio; and EQC is determined through mineralogical analysis via thin-section petrography or XRD. Once CAI is predicted, it can be translated into engineering recommendations using established empirical relationships for TBM disc cutter life estimation [20] and excavation method selection based on CAI classification (Very Low to Extremely High abrasivity).

5. Conclusions

This study established a data-driven framework for predicting the Cerchar Abrasivity Index (CAI) using 193 rock samples. The key findings are as follows:

(1) An optimal four-variable subset (B₁, density, EQC, UCS) with rock type indicators achieved the highest predictive accuracy (Test R² = 0.907, RMSE = 0.420) using CatBoost, outperforming nine other machine learning algorithms.

(2) SHAP analysis confirmed that density and EQC are primary drivers of abrasivity, with significant synergistic effects between rock compactness and mechanical strength, validating the model’s alignment with rock mechanics principles.

(3) Symbolic regression derived an explicit formula using only three variables (density, EQC, B₁) without rock type classification (Test R² = 0.720), offering a practical alternative for preliminary assessments with reduced data requirements.

The proposed framework provides a cost-effective tool for early-stage rock abrasivity assessment, reducing the need for extensive laboratory testing in tunneling and mining projects.

Author Contributions

Conceptualization, S.-W.C. and T.Y.K.; methodology, S.-W.C. and T.Y.K.; software, S.-W.C.; validation, S.-W.C. and T.Y.K.; formal analysis, S.-W.C.; investigation, S.-W.C.; resources, T.Y.K.; data curation, S.-W.C.; writing—original draft preparation, S.-W.C.; writing—review and editing, T.Y.K.; visualization, S.-W.C.; supervision, T.Y.K.; project administration, T.Y.K.; funding acquisition, T.Y.K. All authors have read and agreed to the published version of the manuscript.

Funding

This research was conducted with the support of the National R&D Project for Smart Construction Technology (No. RS-2020-KA157074) funded by the Korea Agency for Infrastructure Technology Advancement under the Ministry of Land, Infrastructure, and Transport, and managed by the Korea Expressway Corporation.

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Acknowledgments

During the preparation of this manuscript/study, the author(s) used TuringBot (Version 3.1.4, https://turingbotsoftware.com/, accessed on 10 November 2025) for the purposes of symbolic regression analysis to derive explicit mathematical formulas for CAI prediction. The authors have reviewed and edited the output and take full responsibility for the content of this publication. The authors appreciate the support of the Korea Agency for Infrastructure Technology Advancement under the Ministry of Land, Infrastructure, and Transport, and managed by the Korea Expressway Corporation.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Zhang, G.; Thuro, K.; Song, Z.; Dang, W.; Bai, Q. Cerchar abrasivity test and its applications in rock engineering: A review. Int. J. Coal Sci. Technol. 2025, 12, 13. [Google Scholar] [CrossRef]
Ko, T.Y.; Kim, T.K.; Son, Y.; Jeon, S. Effect of geomechanical properties on Cerchar Abrasivity Index (CAI) and its application to TBM tunnelling. Tunn. Undergr. Space Technol. 2016, 57, 99–111. [Google Scholar] [CrossRef]
Rostami, J. Hard rock TBM cutterhead modeling for design and performance prediction. Geomech. Tunn. 2008, 1, 18–28. [Google Scholar] [CrossRef]
Alber, M.; Bruland, A.; Dahl, F.; Grima, M.A.; Käsling, H.; Michalakopoulos, T.N. ISRM Suggested Method for Determining the Abrasivity of Rock by the CERCHAR Abrasivity Test. Rock Mech. Rock Eng. 2014, 47, 261–266. [Google Scholar] [CrossRef]
Käsling, H.; Thuro, K. Determining rock abrasivity in the laboratory. In Proceedings of the ISRM EUROCK 2010, Lausanne, Switzerland, 15–18 June 2010. [Google Scholar]
Gao, K.; Wang, X.; Wei, H.; Zhu, T.; Zhang, Z. Abrasivity Database of Different Genetic Rocks Based on CERCHAR Abrasivity Test. Sci. Data 2024, 11, 630. [Google Scholar] [CrossRef]
Moradizadeh, M.; Cheshomi, A.; Ghafoori, M.; TrighAzali, S. Correlation of equivalent quartz content, Slake durability index and Is50 with Cerchar abrasiveness index for different types of rock. Int. J. Rock Mech. Min. Sci. 2016, 86, 42–47. [Google Scholar] [CrossRef]
Heydarian, P.; Asef, M.R.; Hamidi, J.K.; Talkhablo, M. The relationship between mechanical properties and mineralogical composition of some sedimentary rocks. Q. J. Eng. Geol. Hydrogeol. 2024, 57, qjegh2024-069. [Google Scholar] [CrossRef]
Majeed, Y.; Abu Bakar, M.Z. A study to correlate LCPC rock abrasivity test results with petrographic and geomechanical rock properties. Q. J. Eng. Geol. Hydrogeol. 2018, 51, 365–378. [Google Scholar] [CrossRef]
Er, S.; Tuğrul, A. Correlation of physico-mechanical properties of granitic rocks with Cerchar Abrasivity Index in Turkey. Measurement 2016, 91, 114–123. [Google Scholar] [CrossRef]
Wani, S.R.; Teshnizi, E.S.; Jalota, S. Correlation between Cerchar abrasivity index and geotechnical properties of igneous rocks: A comprehensive analysis using machine learning algorithms and interpretative analysis. Measurement 2026, 257, 118989. [Google Scholar] [CrossRef]
Zhang, S.-R.; She, L.; Wang, C.; Wang, Y.-J.; Cao, R.-L.; Li, Y.-L.; Cao, K.-L. Investigation on the relationship among the Cerchar abrasivity index, drilling parameters and physical and mechanical properties of the rock. Tunn. Undergr. Space Technol. 2021, 112, 103907. [Google Scholar] [CrossRef]
Majeed, Y.; Abu Bakar, M.Z.; Butt, I.A. Abrasivity evaluation for wear prediction of button drill bits using geotechnical rock properties. Bull. Eng. Geol. Environ. 2020, 79, 767–784. [Google Scholar] [CrossRef]
Aligholi, S.; Lashkaripour, G.R.; Ghafoori, M.; Azali, A. Evaluating the Relationships Between NTNU/SINTEF Drillability Indices with Index Properties and Petrographic Data of Hard Igneous Rocks. Rock Mech. Rock Eng. 2017, 50, 2929–2953. [Google Scholar] [CrossRef]
Alber, M. Stress dependency of the Cerchar abrasivity index (CAI) and its effects on wear of selected rock cutting tools. Tunn. Undergr. Space Technol. 2008, 23, 351–359. [Google Scholar] [CrossRef]
Abu Bakar, M.Z.; Majeed, Y.; Rostami, J. Effects of rock water content on CERCHAR Abrasivity Index. Rock Mech. Rock Eng. 2016, 49, 3745–3758. [Google Scholar] [CrossRef]
Majeed, Y.; Abu Bakar, M.Z. Effects of variation in the particle size of the rock abrasion powder and standard rotational speed on the NTNU/SINTEF abrasion value steel test. Bull. Eng. Geol. Environ. 2019, 78, 1537–1554. [Google Scholar] [CrossRef]
Ündül, Ö.; Er, S. Investigating the effects of micro-texture and geo-mechanical properties on the abrasiveness of volcanic rocks. Eng. Geol. 2017, 229, 85–94. [Google Scholar] [CrossRef]
Meng, F.; Wong, L.N.Y.; Zhou, H. Rock brittleness indices and their applications to different fields of rock engineering: A review. J. Rock Mech. Geotech. Eng. 2021, 13, 221–247. [Google Scholar] [CrossRef]
Plinninger, R.J.; Käsling, H.; Thuro, K. Wear prediction in hard rock excavation using the CERCHAR Abrasiveness Index (CAI). In Proceedings of the EUROCK 2004 and 53rd Geomechanics Colloquium, Salzburg, Austria, 7–9 October 2004; Schubert, W., Ed.; VGE Verlag GmbH: Essen, Germany, 2004; pp. 599–604. [Google Scholar]
Sun, J.; Fan, X.; Wang, H.; Shang, Y.; Sun, C. New Prediction Model of Rock Cerchar Abrasivity Index Based on Gene Expression Programming. Appl. Sci. 2025, 15, 10901. [Google Scholar] [CrossRef]
Kwak, N.-S.; Ko, T.Y. Machine learning-based regression analysis for estimating Cerchar abrasivity index. Geomech. Eng. 2022, 29, 219–228. [Google Scholar]
Hong, J.-P.; Kang, Y.S.; Ko, T.Y. Estimation of Cerchar abrasivity index based on rock strength and petrological characteristics using linear regression and machine learning. J. Korean Tunn. Undergr. Space Assoc. 2024, 26, 39–58. [Google Scholar]
Tripathy, A.; Singh, T.N.; Kundu, J. Prediction of abrasiveness index of some Indian rocks using soft computing methods. Measurement 2015, 68, 302–309. [Google Scholar] [CrossRef]
Capik, M.; Yilmaz, A.O. Modeling of Micro Deval abrasion loss based on some rock properties. J. Afr. Earth Sci. 2017, 134, 549–556. [Google Scholar] [CrossRef]
Ozdogan, M.V.; Deliormanli, A.H.; Yenice, H. The correlations between the Cerchar abrasivity index and the geomechanical properties of building stones. Arab. J. Geosci. 2018, 11, 604. [Google Scholar] [CrossRef]
Kadkhodaei, M.H.; Ghasemi, E. Development of a GEP model to assess CERCHAR abrasivity index of rocks based on geomechanical properties. J. Min. Environ. 2019, 10, 917–928. [Google Scholar]
Teymen, A. The usability of Cerchar abrasivity index for the estimation of mechanical rock properties. Int. J. Rock Mech. Min. Sci. 2020, 128, 104258. [Google Scholar] [CrossRef]
Majeed, Y.; Abu Bakar, M.Z. Statistical evaluation of CERCHAR Abrasivity Index (CAI) measurement methods and dependence on petrographic and mechanical properties of selected rocks of Pakistan. Bull. Eng. Geol. Environ. 2016, 75, 1341–1360. [Google Scholar] [CrossRef]
Lee, S.; Jung, H.-Y.; Jeon, S. Determination of Rock Abrasiveness using Cerchar Abrasiveness Test. Tunn. Undergr. Space 2012, 22, 284–295. [Google Scholar] [CrossRef]
Eide, L.N.R. TBM Tunnelling at the Stillwater Mine. Master’s Thesis, Norwegian University of Science and Technology (NTNU), Trondheim, Norway, 2014. [Google Scholar]
Macias, F.J. Hard Rock Tunnel Boring: Performance Predictions and Cutter Life Assessments. Ph.D. Thesis, Norwegian University of Science and Technology (NTNU), Trondheim, Norway, 2016. [Google Scholar]
Macias, F.J.; Dahl, F.; Bruland, A. New Rock Abrasivity Test Method for Tool Life Assessments on Hard Rock Tunnel Boring: The Rolling Indentation Abrasion Test (RIAT). Rock Mech. Rock Eng. 2016, 49, 1679–1693. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased Boosting with Categorical Features. In Proceedings of the 32nd International Conference on Neural Information Processing Systems (NeurIPS’18), Montréal, QC, Canada, 3–8 December 2018; pp. 6639–6649. [Google Scholar]
O’Brien, R.M. A Caution Regarding Rules of Thumb for Variance Inflation Factors. Qual. Quant. 2007, 41, 673–690. [Google Scholar] [CrossRef]
TuringBot. Ver. 3.1.4. Symbolic Regression Software, TuringBot Software: São Paulo, Brazil, 2020. Available online: https://turingbotsoftware.com/ (accessed on 10 November 2025).
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; pp. 3146–3154. [Google Scholar]
Akiba, T.; Sano, S.; Yanase, T.; Ohta, T.; Koyama, M. Optuna: A Next-Generation Hyperparameter Optimization Framework. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’19), Anchorage, AK, USA, 4–8 August 2019; pp. 2623–2631. [Google Scholar]
Lundberg, S.M.; Lee, S.-I. A Unified Approach to Interpreting Model Predictions. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17), Long Beach, CA, USA, 4–9 December 2017; pp. 4765–4774. [Google Scholar]
Lin, S.; Liang, Z.; Zhao, S.; Dong, M.; Guo, H.; Zheng, H. A Comprehensive Evaluation of Ensemble Machine Learning in Geotechnical Stability Analysis and Explainability. Int. J. Mech. Mater. Des. 2024, 20, 331–352. [Google Scholar] [CrossRef]
Baghbani, A.; Choudhury, T.; Costa, S.; Reiner, J. Application of Artificial Intelligence in Geotechnical Engineering: A State-of-the-Art Review. Earth-Sci. Rev. 2022, 228, 103991. [Google Scholar] [CrossRef]
Saadati, G.; Javankhoshdel, S.; Mohebbi Najm Abad, J.; Mett, M.; Kontrus, H.; Schneider-Muntau, B. AI-Powered Geotechnics: Enhancing Rock Mass Classification for Safer Engineering Practices. Rock Mech. Rock Eng. 2025, 58, 11319–11349. [Google Scholar] [CrossRef]
Di Giovanni, A.; Rispoli, A.; Ferrero, A.M.; Farinetti, A.; Cardu, M. A statistical approach for the correlation between Cerchar Abrasivity Index and Uniaxial Compressive Strength of rocks. Geomech. Tunn. 2023, 16, 378–386. [Google Scholar] [CrossRef]
Agrawal, A.K.; Murthy, V.M.S.R.; Chattopadhyaya, S.; Raina, A.K. Prediction of TBM Disc Cutter Wear and Penetration Rate in Tunneling Through Hard and Abrasive Rock Using Multi-Layer Shallow Neural Network and Response Surface Methods. Rock Mech. Rock Eng. 2022, 55, 3489–3506. [Google Scholar] [CrossRef]
Kwon, K.; Choi, H.; Jung, J.; Kim, D.; Shin, Y.J. Prediction of Abnormal TBM Disc Cutter Wear in Mixed Ground Condition Using Interpretable Machine Learning with Data Augmentation. J. Rock Mech. Geotech. Eng. 2025, 17, 2059–2071. [Google Scholar] [CrossRef]

Figure 1. Histograms with best-fit probability distributions for numerical input variables: (a) Cerchar Abrasivity Index (CAI); (b) Equivalent Quartz Content (EQC); (c) Uniaxial Compressive Strength (UCS); (d) Brazilian Tensile Strength (BTS); (e) Rock density; (f) Coefficient B₁; (g) Coefficient B₂; (h) Coefficient B₃; (i) Coefficient B₄; (j) Rock Abrasivity Index (RAI).

Figure 2. Box plots of Cerchar Abrasivity Index (CAI) by rock group. The circles (o) represent outlier values beyond the whiskers.

Figure 3. Correlation matrix heatmap of numerical input variables.

Figure 4. Scatter plots with linear regression lines showing relationships between Cerchar Abrasivity Index (CAI) and numerical input variables: (a) EQC (r = 0.34); (b) UCS (r = 0.41); (c) BTS (r = 0.42); (d) density (r = 0.39); (e) B₁ (r = 0.07); (f) B₂ (r = 0.08); (g) B₃ (r = 0.43); (h) B₄ (r = 0.46); (i) RAI (r = 0.56). The blue dots represent individual sample data points, the solid red line indicates the linear regression fit, and the red shaded area represents the 95% confidence interval.

Figure 5. Model performance trends with respect to the number of numerical input variables. The red arrow highlights the optimal subset at N = 4.

Figure 6. Comparison of predicted vs. actual CAI values for the final optimal CatBoost model.

Figure 7. Comparison of predicted vs. actual CAI values for the symbolic regression model derived using TuringBot (version 3.1.4).

Figure 8. SHAP analysis results for the optimal CatBoost model. (a) Global feature importance ranked by mean absolute SHAP value. (b) Beeswarm summary plot showing the distribution and direction of feature impact. (c) SHAP dependence plot for density. (d) SHAP dependence plot for EQC.

Table 1. Performance comparison before and after hyperparameter tuning for the top 3 base models.

Algorithm	Status	CV R²	Final Test R²	RMSE	Optimal Hyperparameters
CatBoost	Baseline	0.738	0.871	0.489	Default Settings
CatBoost	Tuned	0.763	0.875	0.477	iter: 800, depth: 8, lr: 0.08, l2_leaf_reg: 3.5
Random Forest	Baseline	0.741	0.882	0.465	Default Settings
Random Forest	Tuned	0.752	0.898	0.440	n_est: 500, max_depth: 15, min_samples_split: 5
Gradient Boosting	Baseline	0.725	0.865	0.495	Default Settings
Gradient Boosting	Tuned	0.734	0.874	0.488	n_est: 300, max_depth: 5, lr: 0.1, subsample: 0.8

Note: “Baseline” refers to models with default parameters using the RFE—selected feature subset. All models include categorical Rock Group variables as mandatory predictors.

Table 2. Top 10 Performing Feature Combinations.

Rank	Model	CV R²	Final Test R²	RMSE	N_feat	Feature List
1	CatBoost	0.777	0.907	0.420	4	B₁, density, EQC, UCS
2	CatBoost	0.777	0.905	0.425	4	B₂, B₄, density, EQC
3	CatBoost	0.778	0.898	0.440	4	B₁, B₄, density, EQC
4	CatBoost	0.775	0.902	0.431	4	B₁, BTS, density, EQC
5	CatBoost	0.770	0.912	0.407	6	B₁, B₂, BTS, density, EQC, UCS
6	CatBoost	0.770	0.905	0.424	5	B₁, B₃, B₄, density, EQC
7	RandomForest	0.760	0.890	0.455	5	B₁, B₄, density, EQC, RAI
8	RandomForest	0.753	0.896	0.444	5	B₁, BTS, density, EQC, RAI
9	CatBoost	0.771	0.907	0.420	5	B₁, B₃, density, EQC, UCS
10	CatBoost	0.777	0.895	0.445	6	B₁, B₂, B₃, density, EQC, UCS

Note: All models include categorical Rock Group variables as mandatory predictors.

Table 3. Best Performing Model per Feature Count (N).

N	Model	CV R²	Final Test R²	RMSE	Feature List
1	RandomForest	0.592	0.801	0.613	density
2	RandomForest	0.670	0.882	0.472	B₁, RAI
3	RandomForest	0.752	0.891	0.453	B₁, density, RAI
4	CatBoost	0.777	0.907	0.420	B₁, density, EQC, UCS
5	CatBoost	0.772	0.905	0.424	B₁, B₃, B₄, density, EQC
6	CatBoost	0.770	0.912	0.407	B₁, B₂, BTS, density, EQC, UCS
7	CatBoost	0.783	0.886	0.463	B₁, B₂, B₃, BTS, density, EQC, RAI

Note: All models include categorical Rock Group variables as mandatory predictors.

Table 4. Coefficients of the symbolic regression model.

Coefficient	Value	Coefficient	Value
α₀	−2.7428	γ₁	1.8681
α₁	1.9224	γ₂	−14.7198
α₂	0.9852	γ₃	−1.0535
β₁	−0.0707	γ₄	0.0752
β₂	−1.1617	γ₅	9.4508
		γ₆	0.00166
		γ₇	4.9859

Table 5. Comparative Analysis of CAI Prediction Models based on Literature Review.

Study (Reference)	Algorithm	Dataset Characteristics	Key Inputs/Approach	Limitations & Remarks
Tripathy et al. [24]	Soft Computing (ANN)	N = 105 (India)	UCS, PLI, E, V_p	Used “Black-box” ANN model; Good accuracy (R² = 0.97)
Capik and Yilmaz [25]	Simple & Multiple Regression	N = 41 (Turkey)	UCS, BTS, Is₅₀, Porosity, Schmidt Hardness	Focused on Micro Deval Abrasion Loss (MDAL)-CAI correlations; Limited sample size
Ozdogan et al. [26]	Multiple Regression	N = 30 (Building stones)	UCS, Porosity, Shore Hardness	Very small sample size; Restricted to specific stone type
Kadkhodaei and Ghasemi [27]	Gene Expression Programming (GEP)	N = 106 (Compiled)	RAI, BTS	Evolutionary algorithm offers explicit equations but is computationally expensive
Teymen [28]	Multiple Regression	N = 80 (Turkey)	CAI, UCS, E, BTS, Is₅₀, ROP, BPI	Focus on estimating properties from CAI
This Study	CatBoost (Ensemble)	N = 193 (Compiled)	B₁, Density, EQC, UCS	Rigorous VIF selection; High generalization due to diverse data.

Note: Point Load Index (Is₅₀ or PLI), P—wave velocity (V_p), Young’s Modulus (E), Rate of Penetration (ROP), and Block Punch Index (BPI).

Table 6. Performance comparison between CatBoost and Symbolic Regression models.

Model	Numerical Variables	Rock Type	Test R²	RMSE	Interpretability
CatBoost	4 (B₁, density, EQC, UCS)	Required	0.907	0.420	Black-box
Symbolic Regression	3 (density, EQC, B₁)	Not required	0.720	0.728	Explicit equation

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Choi, S.-W.; Ko, T.Y. Data-Driven Estimation of Cerchar Abrasivity Index Using Rock Geomechanical and Mineralogical Characteristics. Appl. Sci. 2026, 16, 552. https://doi.org/10.3390/app16010552

AMA Style

Choi S-W, Ko TY. Data-Driven Estimation of Cerchar Abrasivity Index Using Rock Geomechanical and Mineralogical Characteristics. Applied Sciences. 2026; 16(1):552. https://doi.org/10.3390/app16010552

Chicago/Turabian Style

Choi, Soon-Wook, and Tae Young Ko. 2026. "Data-Driven Estimation of Cerchar Abrasivity Index Using Rock Geomechanical and Mineralogical Characteristics" Applied Sciences 16, no. 1: 552. https://doi.org/10.3390/app16010552

APA Style

Choi, S.-W., & Ko, T. Y. (2026). Data-Driven Estimation of Cerchar Abrasivity Index Using Rock Geomechanical and Mineralogical Characteristics. Applied Sciences, 16(1), 552. https://doi.org/10.3390/app16010552

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Data-Driven Estimation of Cerchar Abrasivity Index Using Rock Geomechanical and Mineralogical Characteristics

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Preparation

2.2. Exploratory Data Analysis and Preprocessing

2.3. Feature Selection Strategy

2.4. Symbolic Regression Analysis

2.5. Model Framework and Evaluation Strategy

3. Results

3.1. Establishment and Optimization of Base Models

3.2. Optimization of Feature Subsets via Full Scenario Analysis

3.3. Prediction Accuracy of the Optimal Model

3.4. Symbolic Regression Model

3.5. Physical Interpretation of the Optimal Model (SHAP Analysis)

4. Discussion

4.1. Efficacy of Data-Driven Feature Selection

4.2. Physical Interpretation of Key Predictors

4.3. Performance of Ensemble Learning Models

4.4. Comparative Advantage over Prior ML Studies

4.5. Comparison of Machine Learning and Symbolic Regression Approaches

4.6. Limitations and Future Research

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI