Predictive Modeling of Aggregate Polished Stone Value from Mineralogical and Chemical Composition

Soudani, Khedoudja; Bounefla, Yazid; Cerezo, Veronique; Haddadi, Smail

doi:10.3390/eng7040149

Open AccessArticle

Predictive Modeling of Aggregate Polished Stone Value from Mineralogical and Chemical Composition

¹

LEEGO, FCE, USTHB, BP32 El-Alia, Bab-Ezzouar 16111, Algiers, Algeria

²

LBE FCE, USTHB, BP32 El-Alia, Bab-Ezzouar 16111, Algiers, Algeria

³

AME-EASE, Gustave Eiffel University, 69540 Bron, France

^*

Author to whom correspondence should be addressed.

Eng 2026, 7(4), 149; https://doi.org/10.3390/eng7040149

Submission received: 11 February 2026 / Revised: 22 March 2026 / Accepted: 24 March 2026 / Published: 26 March 2026

Download

Browse Figures

Versions Notes

Abstract

The polished stone value (PSV) is a key parameter for assessing the resistance of aggregates to polishing in the laboratory. It is included in technical specifications and serves as both a regulatory and contractual criterion for selecting aggregates for wearing courses. Its determination requires non-negligible amounts of material, long testing durations, and skilled operators. This study aims to develop a predictive modeling approach to estimate the polished stone value (PSV) from the mineralogical and chemical composition of aggregates. A curated database was compiled from the peer-reviewed literature, and compositional data were transformed using Isometric Log-Ratio (ILR) to generate physically interpretable balances and avoid constant-sum artifacts. Machine learning algorithms, including Gradient Boosting, CatBoost, and Multivariate Adaptive Regression Splines (MARS), were trained and evaluated using repeated 10 × 2 K-Fold cross-validation with preprocessing embedded within the loop. CatBoost achieved the highest accuracy, with 90.4% of predictions within ±20% of the measured PSV. Model interpretability using permutation feature importance and SHAP analysis identified meaningful drivers, highlighting the roles of CO₂/SO₃ versus the major-oxide framework, and silica-rich oxides versus CaO/MgO, consistent with petrographic expectations. The proposed workflow provides a practical and interpretable approach for predicting PSV from compositional data. It offers a time- and resource-efficient alternative to conventional laboratory tests, while also providing insight into the material factors that control aggregate polishing resistance. Limitations related to dataset size and inter-source variability are discussed.

Keywords:

polished stone value (PSV); aggregates; compositional data analysis; machine learning; predictive modeling; feature importance

1. Introduction

Aggregates are fundamental for the construction and maintenance of civil engineering works, including buildings, structural elements, and transport infrastructure such as roads, railways, and airport runways. The construction sector used approximately 25.9 to 29.6 billion tons of aggregates in 2012 [1]. Aggregates are the primary component of concrete, typically comprising 70–85% of the mixture by weight [2], and they account for approximately 90–95% of the weight in asphalt mixtures [3]. This highlights the important role of aggregate properties in controlling final material behavior and their direct impact on the performance and service life of structures.

Consequently, accurate characterization and assessment of the evolution of aggregate properties are essential for optimizing the selection of raw material to ensure long-term structural integrity. Aggregate testing not only verifies their appropriateness for various construction applications but also serves as the basis for material specification [4,5]

This characterization is based on laboratory tests covering a wide range of properties including geometric, physical, mechanical, chemical, and petrographic properties. It enables the evaluation of aggregate shape, bulk density, porosity, chemical and mineralogical composition, as well as resistance to degradation processes such as fragmentation, abrasion, impact, and polishing. Polishing resistance, in particular, is of critical importance for aggregates used in the wearing courses of road pavements. It is assessed using the polished stone value (PSV), defined by the British Standard [6] as “the measure of the resistance of roadstone to the polishing action of vehicle tyres under conditions similar to those occurring on the surface of a road”. The degree of polish achieved is measured using the British Portable Skid Resistance Tester and expressed as the PSV [7]. PSV is one of the most widely used laboratory tests for characterizing the polishing resistance of aggregates and is commonly specified as an indicator of the long-term skid resistance potential of road surface materials [8,9]. It is routinely used as a specification parameter to define minimum performance requirements for aggregates in surface courses, with threshold values adjusted according to traffic intensity and functional road classification [10,11,12]. The relevance of PSV is closely linked to road safety, as reductions in pavement skid resistance have been consistently associated with an increased risk in accidents, particularly under wet conditions when friction levels fall below critical thresholds [13,14,15,16,17,18]. In this context, several studies have investigated the relationship between pavement skid resistance and the polishing behavior of aggregates, highlighting the influence of aggregate characteristics on long-term friction performance [8,19,20] and establishing the PSV as a key predictor variable in skid resistance models [21,22,23].

Due to its importance, this property has attracted considerable attention from the scientific community. Indeed, numerous studies have focused on parameters governing this property, particularly petrographic characteristics such as Relative Hardness (RHD) of aggregates, Differential Hardness (DH) and others [24,25], as well as mineral [7] and chemical composition [20]. These studies clearly indicate that the mineral and chemical composition of aggregates is a key factor in explaining their polishing behavior.

Some studies have focused on developing predictive models to estimate PSV, as the PSV test exhibits several practical laboratory limitations, including high aggregate and energy consumption, a long testing duration (approximately 6 h), and the requirement for qualified operators to ensure accurate sample preparation and execution [25,26]. For instance, Shabani et al. [25] combined an experimental approach with advanced statistical modeling to predict the PSV of aggregates based on selected physical and petrographic properties. The developed models revealed a strong correlation between PSV and RHD (relative hardness). Furthermore, it was observed that for homogeneous rocks, texture and mineralogical composition (particularly the presence of hard minerals within a soft matrix) have a more decisive influence on PSV than relative hardness alone.

In recent years, there has been a growing demand for the prediction of material behavior and performance under varying conditions. The advent of Machine Learning (ML) has provided a powerful tool for advancing the understanding and prediction of material performance, enabling researchers and engineers to model nonlinear relationships, analyze large experimental datasets, and optimize multiple influencing parameters simultaneously.

El-Ashwah et al. [27] investigated this approach with the aim of reducing reliance on costly and time-consuming experimental testing, and they also demonstrated the predictive power of ML by highlighting the correlations between quantitative texture and morphology parameters measured on different types of aggregates, namely the AI (Angularity Index), FI (Form Index), and STI (Surface Texture Index). These parameters served as input variables for both statistical analyses and ML models aimed at predicting pavement friction loss. The statistical analysis identified the key global characteristics influencing friction loss, while ML, using Random Forest Analysis (RFA), was employed to develop predictive models based on aggregate features, achieving excellent performance (R² > 0.97).

Additionally, Hussain et al. [28] demonstrated that petrographic characteristics, including mineralogy, texture, and porosity, are crucial indicators governing and predicting the engineering performance of aggregates (carbonate aggregate), such as Los Angeles abrasion (LAA), aggregate crushing value (ACV), aggregate impact value (AIV), specific gravity (SG), water absorption (WA), and unconfined compressive strength (UCS). The use of ML models such as Random Forest, Gradient Boosting, CatBoost, and Multi-Layer Perceptron exhibits a clear advantage over traditional multiple regression methods. In this study, the Gradient Boosting model proved particularly effective, achieving excellent predictive performance (R² ≈ 0.997) for estimating engineering properties based on petrographic data.

According to the literature review, although aggregate polishing has been extensively investigated, chemical composition-based approaches for predicting accelerated polishing resistance remain largely underexplored. This study therefore proposes a data-driven framework to predict the PSV from aggregate chemical composition using ML techniques. Given the compositional nature of the data, the Isometric Log-Ratio Transformation (ILR) was applied to ensure a statistically coherent treatment of the variables. In addition, this study introduces an oxide-ratio-based approach to improve the interpretation of the relationships between chemical composition and aggregate polishing behavior. This integrated framework provides a novel perspective for linking aggregate chemistry to polishing resistance performance and contributes to advancing PSV prediction methodology.

2. Methodology

In this study, a systematic ML framework was developed to model the relationship between the chemical and mineralogical composition of aggregates and their polishing resistance, based on a dataset compiled from previously published experimental studies. Several predictive algorithms were implemented and compared, including tree-based ensemble models (Gradient Boosting, CatBoost version 1.2.10) and the non-parametric Multivariate Adaptive Regression Splines (MARS) approach.

Model calibration and validation were performed using repeated K-fold cross-validation to ensure statistical robustness, with model accuracy estimated through standard performance indicators such as the Mean Absolute Error (MAE), Root Mean Square Error (RMSE), and the coefficient of determination (R²). To enhance the interpretability and physical transparency of the results, feature relevance and interaction effects were analyzed through Permutation Feature Importance (PFI) and SHapley Additive exPlanations (SHAP). This combined methodological approach aims to balance predictive performance with interpretability, providing a reliable framework for assessing how the chemical and mineralogical characteristics of aggregates influence their polishing resistance. Figure 1 summarizes the main steps of the methodology.

2.1. Description of Database

To develop a predictive model and carry out statistical analysis, one can use either experimental laboratory test results or a dataset compiled from previously published studies. In the present study, a total of 87 data samples describing the PSV of various natural aggregates were collected from previously published journal articles [8,19,20,21,22,23,24,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44]. The selected studies report both chemical compositions (major oxides) and mineralogical proportions, together with corresponding PSV test results.

This literature-based approach enables the creation of a comprehensive and diverse dataset, covering a wide range of lithologies (limestone, dolomite, basalt, granite, etc.) and PSV values, thereby providing a robust foundation for subsequent statistical analysis and ML modeling.

2.2. Statistical Description of Database

Descriptive statistics of the main chemical and mineralogical variables are summarized in Table 1. The dataset exhibits heterogeneous distributions with CaO (24.8%) and SiO₂ (28.1%) dominating, while most accessory phases (e.g., Hematite, Magnetite, Montmorillonite) occur at less than 1%. The PSV (i.e., the response variable) averages around 58.5 (CoV = 0.20), consistent with medium-to-high polishing resistance.

Dispersion metrics (standard deviation, CoV) indicate that several variables exhibit high variability relative to their means. Biotite (CoV ≈ 4.36), Hornblende (CoV ≈ 5.46), and Montmorillonite (CoV ≈ 6.88) present very unstable distributions, suggesting sporadic occurrences and potential influence of local mineralogical heterogeneities. In contrast, major oxides such as CaO (CoV ≈ 0.96) and SiO₂ (CoV ≈ 0.90) exhibit more consistent distributions, reflecting their fundamental role in rock-forming minerals.

Analysis of distribution shapes shows strong deviations from normality for several variables. Most accessory phases present highly positive skewness (e.g., Hematite, Illite, Magnetite) and very high kurtosis (>30), reflecting near-zero concentrations with occasional extreme outliers. Conversely, the target variable PSV is nearly symmetric (skewness ≈ −0.03) with low kurtosis (≈0.21), indicating a relatively homogeneous distribution across the dataset.

The number of distinct values varies considerably, ranging from very low cardinality (e.g., Siderite, SrO = constant) to highly diverse distributions (e.g., CaO = 66 unique values, SiO₂ = 64). This heterogeneity has direct implications for model training: variables with low diversity are unlikely to enhance predictive performance, whereas broadly distributed variables (CaO, SiO₂, Quartz) are expected to carry greater discriminative power.

Overall, the statistical profiling confirms the relevance of PSV as the target variable and identifies a subset of major components (CaO, SiO₂, Quartz, Calcite, Dolomite) as the most informative predictors. The pronounced skewness and kurtosis of accessory minerals justify the use of advanced resampling or robust modeling approaches (e.g., SMOGN, tree-based learners) to mitigate distributional imbalances.

In addition to the statistical indicators, the distribution plots presented in Figure 2 provide a visual confirmation of the heterogeneous behavior of the dataset. The histograms highlight that most accessory minerals are concentrated near zero with occasional pronounced outliers, whereas major oxides and dominant phases (e.g., CaO, SiO₂, Quartz) exhibit broader and more continuous distributions. Overall, these plots complement the descriptive statistics by illustrating the contrast between sparse, zero-inflated variables and stable, continuous ones, thereby reinforcing the need for careful feature selection and robust learning strategies.

2.3. Feature Selection

The selection of predictors was guided by a combination of geological knowledge and statistical reasoning. Chemically, major oxides provide fundamental descriptors of aggregates, while mineralogical proportions can be partially estimated from these oxides in combination with petrographic analysis. For example, SiO₂ is associated with quartz, CaO with calcite/dolomite, and Al₂O₃ + K₂O/Na₂O with feldspars. Thus, including both chemical and mineralogical variables could introduce redundancy and increase collinearity without adding meaningful information. Statistical profiling supports this choice: major oxides exhibit broad ranges, moderate variability, and many distinct values, making them informative for predictive modeling, whereas mineralogical variables are often skewed, sparsely distributed, and contribute little to model performance. Consequently, the analysis was limited to chemical oxides only, ensuring a parsimonious set of predictors, better physical interpretability, and enhanced model generalization. This approach lays the foundation for the subsequent analysis, focusing on the predictive capacity of chemical oxides in estimating the PSV.

2.4. Data Transformation

2.4.1. Nature of Data

The input variables represent chemical oxide compositions that inherently sum to a constant (e.g., 100%). Such data are therefore compositional, meaning that each component carries only relative information rather than absolute values [45].

This property constrains the dataset to a simplex, a reduced subset of the n-dimensional Euclidean space Rⁿ. Standard statistical and ML algorithms, however, generally assume unconstrained variability in Euclidean space. Directly applying models in Rⁿ can lead to spurious correlations, distorted distances, and misleading interpretations. To overcome these issues, compositional data must be transformed into an unconstrained space using log-ratio transformations, which preserve the relative structure of the compositions while enabling the application of standard ML techniques.

2.4.2. Isometric Log-Ratio

The isometric log-ratio (ILR) transformation converts compositional vectors from the constrained simplex into an orthonormal Euclidean space, where each transformed coordinate (called an ILR balance) represents the logarithmic ratio between groups of components. Practically, each balance expresses how one subset of oxides dominates or balances another (e.g., siliceous vs. calcareous phases). This transformation removes the constant-sum constraint, eliminates spurious correlations, and allows standard statistical and ML algorithms to operate correctly. Moreover, ILR coordinates retain physical interpretability, as variations along an ILR axis reflect relative changes between chemically or functionally meaningful groups of constituents. The new ILR variables are derived from the following steps:

Let

x_{1}, x_{2}, \dots, x_{D}

be the raw compositional parts (e.g., oxide contents), where D denotes the total number of compositional components considered in the raw vector

x

.

x = (x_{1}, x_{2}, \dots, x_{D}), x > 0

(1)

After closure to ensure that

\sum_{i} C i = 1

C_{i} = \frac{x_{i}}{\sum_{j = 1}^{D} x_{j}}

(2)

For a given balance ILRn, defined by two disjoint groups of components An of size

r_{n}

and

B_{n}

of size

s_{n}

, while

i

and

j

index the components belonging to the groups

A_{n}

and

B_{n}

, respectively, the IRL combinate is given by:

{I L R}_{n} = \sqrt{\frac{r_{n} s_{n}}{r_{n} + s_{n}}} l n (\frac{{(\prod_{i ϵ A_{n}} x_{i})}^{\frac{1}{r_{n}}}}{{(\prod_{j ϵ B_{n}} x_{j})}^{\frac{1}{s_{n}}}})

(3)

2.4.3. Data Oversampling

Given the limited dataset size (87 instances) and the skewed distribution of the target variable, we adopted Regression SMOTE (reg-SMOTE) [46], an extension of SMOTE originally proposed for classification [47] to handle continuous targets. reg-SMOTE identifies rare regions of the response using a relevance function and generates synthetic samples by interpolation between each rare observation and its k nearest rare neighbors in the feature space, with the corresponding target values interpolated in the same proportion. To respect the geometry of compositional predictors, this process was carried out in the ILR space (after closure and isometric log-ratio transformation). Formally, for a rare instance i and a neighbor j in its k-NN set, with u ~ U(0,1):

x^{(s y n)} = x_{i} + u (x_{j} - x_{i}) and y^{(s y n)} = y_{i} + u (y_{j} - y_{i})

(4)

This mechanism preserves local structure while enriching under-represented regions of the target distribution. The application of reg-SMOTE increased the dataset size from 87 to 146 instances, enriching the tails of the distribution for both the target PSV and several ILR balances/chemical components (Table 2), while preserving the overall shape of the original distributions (Figure 3). Rare values became better represented, reducing imbalance without introducing unrealistic shifts. The hyperparameters adopted for the reg-SMOTE procedure are summarized in Table 3.

The comparison of the variance inflation factor (VIF) (Figure 4) indicates that collinearity levels remain moderate after augmentation: although slight increases are observed for correlated oxides (e.g., Fe₂O₃, SiO₂, TiO₂), no variable exceeds critical thresholds, and the relative ranking of VIF values remains stable, providing evidence that the dependency structure of the features is preserved.

To prevent data leakage, reg-SMOTE was applied strictly within the training folds during cross-validation, ensuring that synthetic instances did not influence the evaluation of held-out test sets. These results show that reg-SMOTE produced a richer and more balanced dataset while maintaining statistical coherence, providing a robust and unbiased basis for subsequent ML modeling.

3. Machine Learning Models Used

Once the data were processed, our focus shifted to developing the ML (ML) model for predicting the PSV. Based on the size and nature of the available dataset, we considered three complementary learners for tabular regression to evaluate their predictive capabilities.

3.1. Gradient Boosting Regression (GBR)

Gradient Boosting Regression aggregates shallow decision trees to capture non-linearities and interactions, with explicit control of bias variance via learning rate, depth, and ensemble size. It is a strong baseline but typically requires careful tuning and offers no native support for categorical variables [48].

3.2. CatBoost

CatBoost enhances Gradient Boosting with ordered boosting and native categorical encoding, which can reduce target leakage and improve generalization with limited preprocessing. Notably, its ordered scheme often makes it competitive on small-to-moderate datasets by curbing overfitting, albeit at increased training cost and reduced transparency relative to simpler baselines [49].

3.3. Multivariate Adaptive Regression Splines (MARS)

Multivariate Adaptive Regression Splines approximate responses with piecewise linear basis functions and data-driven knots, yielding interpretable terms and revealing thresholds and local effects while being more sensitive to collinearity and less reliable for extrapolation [50].

In practice, GBR is employed as a robust reference. CatBoost is applied when maximizing predictive accuracy, which is especially paramount for heterogeneous or smaller datasets, while MARS is prioritized when interpretability and mechanism-oriented insight are important, particularly with limited sample sizes.

3.4. Model Validation (Repeated K-Fold Cross-Validation)

Generalization is assessed using repeated K-fold cross-validation, in which a standard K-fold split is repeated R times with different random partitions, yielding R × K times R × K train/validation evaluations. Estimator variance is reduced compared to a single split, producing a more stable performance estimation on small datasets. In this study, K = 10 and R = 2 are used [51] (10 × 2 CV): in each iteration, data are partitioned into 10 folds. The models are trained on nine folds and evaluated on the held-out fold. The procedure is repeated with a new random shuffle. All preprocessing steps (closure, ILR transformation, scaling, and, when applicable, reg-SMOTE oversampling) are performed within the training folds to prevent information leakage from the validation set to the training set. The mean and standard deviation of MAE, RMSE, and R² across the 20 runs are reported, and hyperparameters are selected by minimizing the average validation error. A fixed random seed is used where feasible for reproducibility, and shuffling is applied at each repeat to decorrelate partitions.

The output of the model was investigated on several performance indicators like mean absolute error (MAE), mean square error (MSE), and coefficient of determination (R²) (Equations (5)–(7)).

In this section, we present the regression metrics used to evaluate model performance and interpret their meaning:

MAE indicates the average magnitude of prediction errors in the original units, treating all deviations linearly.

M A E = (\frac{1}{n}) \cdot Σ_{\{i = 1 \dots n\}} |y_{i} - ŷ_{i}|

(5)

RMSE indicates the typical size of errors while penalizing larger deviations more heavily (quadratically), and expresses the results in the original units.

R M S E = \sqrt{[(1 / n) \cdot Σ_{i = 1 \dots n} {(y_i - \hat{y}_i)}^{2}]}

(6)

R² indicates the proportion of variance in the observed data explained by the model relative to a mean-only baseline (closer to 1 is better).

R^{2} = 1 - \frac{(\sum_{\{i = 1 \dots n\}} {(y_{i} - {\hat{y}}_{i})}^{2})}{(\sum_{\{i = 1 \dots n\}} {(y_{i} - \bar{y})}^{2})}

(7)

where:

•: $y_{i}$ denotes the observed PSV value for observation $i$ ,
•: $ŷ_{i}$ denotes the corresponding predicted value,
•: $n$ is the total number of observations in the test set (or in a validation fold),
•: $\bar{y} = \frac{1}{n} \sum_{i = 1}^{n} y_{i}$ is the mean of the observed values.

3.5. Feature Importance Assessment and Interpretation

3.5.1. Permutation Features Importance

Permutation Feature Importance (PFI) evaluates a predictor’s reliance on each variable by measuring how the model’s performance deteriorates when the variable’s values are randomly permuted while the others remain unchanged. A larger post-permutation error indicates greater importance. Repeating the procedure and averaging across the repeated K-folds provides a stable estimate. The approach is easy to implement, model-agnostic, and reflects the trained model’s actual usage of features, though it can understate the role of correlated or interchangeable variables; using grouped or conditional permutations helps address this limitation, and the evaluation metric should be stated explicitly.

In this study, R² was selected as the evaluation metric, as it quantifies the proportion of variance in the response explained by the model relative to a mean-only baseline, enabling a straightforward comparison across algorithms and folds.

3.5.2. SHapley Additive exPlanations

SHAP (SHapley Additive exPlanations) quantifies each feature’s contribution to an individual prediction using cooperative game theory, producing additive, signed attributions that sum to the model output relative to a baseline and indicate whether a feature drives the prediction upward or downward. Local attributions can be aggregated into global summaries that reveal which features matter most and how their values influence the response. In this study, SHAP was applied only to the best performing model to provide instance level insights and a global importance profile that complements PFI.

4. Results and Discussion

4.1. Models’ Performances

The predicted-versus-observed (precision) plots (Figure 5) assess visual calibration by analyzing the proximity to the 45° identity line, trend slope/intercept, residual spread, and stability across the range.

The curve closely follows the identity line with a slope near one and negligible intercept bias. Residual bands are tight across low–high values, with no evident fan shape, indicating limited heteroscedasticity and minimal shrinkage toward the mean. Tail fidelity is comparatively strong.

The central region is well captured, even if mild mean reversion appears: lower truths are slightly overpredicted and higher truths are slightly underpredicted, resulting in a slope just below one. Dispersion increases toward the extremes, suggesting that deeper trees or adjusted regularization could improve tail fitting.

Alignment with the identity is reasonable for the MARS model in the midrange. However, deviations intensify near the boundaries, and residual bands widen and become asymmetric. This pattern is consistent with piecewise linear bases that capture local structure yet struggle with higher order interactions and exhibit heteroscedasticity in the tails.

CatBoost shows the most consistent calibration and precision, GBR is competitive with modest tail bias, and MARS offers interpretability at the cost of reduced tail accuracy patterns aligned with each method’s inductive bias and capacity.

Using out-of-fold (OOF) predictions from identical CV splits (n = 146), CatBoost achieved the best overall performance across all metrics (Table 4). It showed the highest agreement between observed and predicted values (r = 0.867; p < 0.001) and the largest explained variance (R² = 0.749; Adj-R² = 0.736). In absolute error terms, CatBoost reduced RMSE to 6.76 and MAE to 3.98, corresponding to improvements of 11.46% and 26.9% improvements over MARS and 2.30% and 3.17% over Gradient Boosting (GB), respectively. Relative error was also the lowest (MAPE = 7.07%), representing a 36.6% reduction versus MARS and 6.53% versus GB. The share of “practically accurate” predictions (within ±20% of the measured value) reached 90.41% with CatBoost + 8.22 percentage points over MARS and +0.69 over GB. Variance accounted for (VAF) followed the same pattern (75.14% for CatBoost vs. 73.94% for GB and 67.99% for MARS). The composite performance index (PI) ranked the models as CatBoost > GB > MARS.

GB ranked second overall (r = 0.860; R² = 0.737; RMSE = 6.92; MAPE = 7.57%; a20 = 89.73%). Relative to MARS, GB lowered the RMSE by 9.38%, MAPE by 32.2%, and increased a20 by 7.53 percentage points. MARS, while the most interpretable, showed the highest errors (RMSE = 7.64; MAPE = 11.16%) and the lowest agreement (r = 0.826; R² = 0.680). For all models the correlation tests were highly significant (p ≪ 0.001), indicating the observed relationships are very unlikely under the null of no association. Nevertheless, statistical significance does not replace the need to consider effect size or practical accuracy.

CatBoost offers the best trade-off between explanatory power and predictive accuracy on this dataset (≈75% variance explained, ≈7% mean relative error, and >90% predictions within ±20%), with GB close behind and MARS trailing primarily due to larger systematic errors.

The coefficient of determination (R²) indicates that the model effectively explains a substantial 74.9% of the relationships between the explanatory variables (chemical components) and the target PSV values. Nevertheless, approximately 25% of the variance remains unexplained, primarily due to the dataset’s size and quality, a limitation noted in several previous research [26,28,52]. Therefore, expanding the dataset’s size and representativeness could enhance the model’s predictive strength and generalizability. It is also crucial to acknowledge that the dataset for this study was compiled from various published studies conducted in different countries and laboratories, which may have influenced the results.

4.2. PFI—Permutation Feature Importance

The permutation importance profiles (Figure 6) show a strong level of consistency across the three models, indicating that only a small number of ILR balances carry most of the predictive information despite the limited dataset. Among these, the balance contrasting CO₂/SO₃ with the major oxides systematically emerges as the most influential (or close to it) in both CatBoost and GBR, and it remains highly ranked in MARS as well. This pattern suggests that relative increases in CO₂/SO₃ compared with the other oxides have a substantial and stable association with the target variable.

Balances comparing SiO₂ with (Al₂O₃, Fe₂O₃, TiO₂) and the broader silica-rich group (SiO₂, Al₂O₃, Fe₂O₃, TiO₂) against (CaO, MgO) also demonstrate consistently strong contributions. These findings highlight the importance of compositional contrasts between silica-bearing phases, calcium–magnesium oxides, and alumina–iron–titania oxides, which are chemically plausible and commonly reported in similar compositional analyses.

Intermediate effects are observed for the Na₂O/K₂O balances, whether evaluated jointly against (CaO, MgO) or separately as Na₂O versus K₂O. Their importance tends to be higher in CatBoost and GBR compared to MARS, which is consistent with the ability of tree-based models to detect interaction patterns and threshold behaviors that the piecewise-linear structure of MARS tends to smooth out. By contrast, the P₂O₅ balance appears systematically weaker particularly in MARS, suggesting either a genuinely limited role or heightened uncertainty due to the small sample size.

Overall, CatBoost and GBR exhibit very similar importance rankings, indicating stable feature ordering across boosted-tree implementations. MARS identifies the same dominant signals but compresses the differences between features, which aligns with its more constrained, additive formulation. Since importances are normalized using an R²-based scale, values close to 1 correspond to the greatest observed degradation in model performance after permutation. Minor numerical differences (on the order of a few hundredths) should thus be interpreted cautiously given the dataset size.

Despite these consistent patterns, the rankings must be interpreted with care:

•: Permutation of single balances can distribute importance across correlated ILR components which is an expected behavior with compositional data. Group-based permutations of chemically related balances can help assess robustness.
•: Fold-wise variability (e.g., confidence intervals from repeated k-fold cross-validation) should be examined to distinguish reliable signals from sampling noise.
•: SHAP analysis from the best performing model can serve as an additional check, ensuring that highly ranked balances also present coherent signed effects at the instance level.

Taken together, the results indicate that the strongest and most stable information lies in the balances contrasting CO₂/SO₃ with the major oxides and in the contrasts between silica-dominated oxides and both calcium/magnesium and alumina/iron/titania components. Sodium–potassium contrasts contribute moderately and appear more sensitive to interaction effects. These conclusions are consistent across the modeling approaches, but they should be interpreted with caution due to potential collinearity and the inherently limited sample size.

4.3. SHAP—SHapley Additive exPlanations for CatBoost Model

The SHAP summary of the CatBoost model, presented in Figure 7, indicates that a limited subset of ILR balances accounts for the majority of predictive power. The beeswarm plots also provide direct evidence of nonlinearity and feature interactions: non-monotonic color gradients and sign reversals along the SHAP axis reveal regime shifts, while substantial vertical dispersion for a fixed feature value, accompanied by mixed colors, suggests that effects depend on other balances.

The CO₂/SO₃ contrast relative to the major oxides emerges as the primary determinant, followed by the balance contrasting (SiO₂, Al₂O₃, Fe₂O₃, TiO₂) with (CaO, MgO) and then by SiO₂ versus (Al₂O₃, Fe₂O₃, TiO₂). These relative proportions appear as the main drivers of the response. Balances involving Na₂O/K₂O against CaO or MgO, as well as Na₂O versus K₂O, exert only moderate influence. Their beeswarm clouds, characterized by mixed colors and increasing vertical spread, suggest contributions governed by threshold effects and interactions captured by tree-based models.

The balance associated with P₂O₅ remains globally weak, consistent with PFI. Considering the sample size, this likely reflects a limited or uncertain contribution rather than a definitive absence of effect.

The overall ranking largely mirrors the one obtained via PFI, reinforcing the consistency of the observed signals. Overall, the relative increase in the CO₂/SO₃ ratio and the comparative levels of silica-based oxides versus Ca/Mg and versus Al/Fe/Ti represent the most influential predictive leverages, while alkali contrasts play a secondary, often interaction-dependent role.

Indeed, CO₂ (carbon dioxide) primarily originates from carbonates such as calcite (CaCO₃) and dolomite (CaMg(CO₃)₂). Its presence induces decarbonation, leading to the formation of less dense phases, micropores, and zones of structural weakness. These alterations promote the development of soft phases, including residual calcite, which compromise mineralogical cohesion. SO₃ (sulfur trioxide) derives from sulfates (e.g., gypsum, anhydrite) or oxidized sulfides. SO₃ contributes to the formation of chemically unstable phases, notably ettringite and secondary gypsum, generates crystalline discontinuities, and further diminishes mineralogical cohesion.

A high CO₂/SO₃ ratio reflects a predominance of soft phases, which are more sensitive to wear, explaining the accelerated loss of polishing (wear) resistance in materials exhibiting such chemical characteristics.

5. Conclusions

This research aims to develop predictive models for the PSV of aggregates based on their mineralogical and chemical compositions, using a machine-learning approach. A comprehensive database compiled from published studies was created to train and evaluate the models. Exploratory analysis of the dataset allowed for the identification of key oxide combinations, the detection of correlations and outliers, and the characterization of non-linear relationships, providing a solid foundation for predictive modeling.

Due to the compositional nature of the data, it was necessary to transform the chemical components into an unconstrained space using log-ratio transformations (ILR balances) to ensure the effective operation of standard statistical and machine-learning algorithms.

The modeling process adopted a comparative approach, utilizing tree-based algorithms (Gradient Boosting, CatBoost) and a non-parametric model (MARS). K-fold cross-validation was applied to ensure robustness and generalizability of the method. Model performance was evaluated using MAE, RMSE, and R² metrics. Among the models assessed, CatBoost demonstrated the most consistent calibration and accuracy, achieving the best performance across all metrics. It showed the highest agreement between observed and predicted values (r = 0.867; p < 0.001) and the greatest explained variance (R² = 0.749; Adj-R² = 0.736). SHAP and PFI interpretation methods were used to quantify feature contributions and the directional effects of chemical balances on PSV predictions. The results are promising. They highlight a clear relationship between aggregate polishing resistance and the chemical composition of the aggregates. Thus, the CO₂/SO₃ balance relative to the major oxides and the silica-rich oxides relative to CaO + MgO were identified as the key controlling factors. The analysis confirmed that increasing this ratio results in lower PSV values, aligning with the well documented polishing susceptibility of carbonate-rich limestones. The CO₂/SO₃ balance relative to the major oxides, along with the contrast between the silica-rich oxides and CaO + MgO, were identified as the primary controlling factors of aggregate polishing resistance. Higher carbonate content is associated with lower PSV values, consistent with the known polishing susceptibility of carbonate-rich limestones. Conversely, aggregates dominated by silica-rich phases, such as quartz and feldspars, exhibit higher PSV due to their greater hardness and resistance. Moreover, the findings highlight the predictive capability of machine-learning techniques and their relevance for predicting the PSV of aggregates. Even though the developed model demonstrates promising predictive performance, further improvement will require expanding the database in both size and diversity.

Author Contributions

Conceptualization: K.S.; Methodology: K.S., Y.B. Software: K.S., Y.B.; Validation: K.S., Y.B.; Formal analysis: K.S., Y.B.; Investigation: K.S.; Writing—original draft preparation: K.S., Y.B.; Writing—review and editing: K.S., Y.B., V.C.; Supervision: K.S., V.C., S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset used in this study was compiled from data extracted from previously published peer-reviewed journal articles. All data and related information used in this study are available from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

UNEP. Sand and Sustainability: Finding New Solutions for Environmental Governance of Global Sand Resources; United Nations Environment Programme: Geneva, Switzerland, 2019. [Google Scholar]
Kim, S.S.; Qudoos, A.; Jakhrani, S.H.; Lee, J.B.; Kim, H.G. Influence of Coarse Aggregates and Silica Fume on the Mechanical Properties, Durability, and Microstructure of Concrete. Materials 2019, 12, 3324. [Google Scholar] [CrossRef] [PubMed]
Sakthivel, S.N.; Kathuria, A.; Singh, B. Utilization of Inferior Quality Aggregates in Asphalt Mixes: A Systematic Review. J. Traffic Transp. Eng. 2022, 9, 864–879. [Google Scholar] [CrossRef]
Makul, N. Aggregates (Building Materials): Testing. In Dictionary of Concrete Technology; Springer Nature: Singapore, 2025; pp. 53–55. [Google Scholar] [CrossRef]
Mitchell, C. Construction aggregates: Evaluation and specification. In Proceedings of the Third International Forum for Industrial Rocks & Mining Conference & Exhibition, Fujairah, United Arab Emirates, 30 March–1 April 2015; Available online: https://nora.nerc.ac.uk/id/eprint/510909/ (accessed on 23 March 2026).
BS EN 1097-8; Tests for Mechanical and Physical Properties of Aggregates-Determination of the Polished Stone Value. BSI (British Standards Institution): London, UK, 2009.
Perry, M.J. Role of Aggregate Petrography in Micro-Texture Retention of Greywacke Surfacing Aggregate. Road Mater. Pavement Des. 2014, 15, 791–803. [Google Scholar] [CrossRef]
Crisman, B.; Ossich, G.; Bevilacqua, P.; Roberti, R. Degradation Prediction Model for Friction of Road Pavements with Natural Aggregates and Steel Slags. Appl. Sci. 2020, 10, 32. [Google Scholar] [CrossRef]
Arampamoorthy, H.; Patrick, J. Potential of the Wehner–Schulze Test to Predict the On-Road Friction Performance of Aggregate; NZ Transport Agency Research Report 443; NZ Transport Agency: Wellington, New Zealand, 2011; 34p. Available online: https://www.nzta.govt.nz/assets/resources/research/reports/443/docs/443.pdf (accessed on 23 March 2026).
BS EN 13043:2013; Aggregates for Bituminous Mixtures and Surface Treatments for Roads, Airfields and Other Trafficked Areas. BSI (British Standards Institution): London, UK, 2013.
Goehl, D.C.; Gurganus, C.; Park, E.S. Selection Criteria for Coarse Aggregate in Flexible Pavement Surfaces; Report No. FHWA/TX-21/0-7077-R1; Texas A&M Transportation Institute: College Station, TX, USA, 2021; Available online: https://static.tti.tamu.edu/tti.tamu.edu/documents/0-7077-R1.pdf (accessed on 23 March 2026).
National Highways. Design Manual for Roads and Bridges: Index (GG 000); National Highways: London, UK, 2025; Available online: https://www.standardsforhighways.co.uk/dmrb (accessed on 23 March 2026).
Najafi, S.; Flintsch, G.W.; Medina, A. Linking Roadway Crashes and Tire–Pavement Friction: A Case Study. Int. J. Pavement Eng. 2017, 18, 119–127. [Google Scholar] [CrossRef]
McCarthy, R.; Flintsch, G.; de León Izeppi, E. Impact of Skid Resistance on Dry and Wet Weather Crashes. J. Transp. Eng. Part B Pavements 2021, 147, 04021029. [Google Scholar] [CrossRef]
Wallman, C.-G.; Åström, H. Friction Measurement Methods and the Correlation Between Road Friction and Traffic Safety: A Literature Review; Swedish National Road and Transport Research Institute (VTI): Linkoping, Sweden, 2001; Available online: https://www.diva-portal.org/smash/get/diva2:673366/FULLTEXT01.pdf (accessed on 23 March 2026).
Cerezo, V.; Do, M.-T.; Violette, E. A Global Approach to Warn the Drivers before a Curve by Considering the Decrease of Skid Resistance Due to the Rain. In Proceedings of the 3rd International Conference on Road Safety and Simulation, Indianapolis, IN, USA, 14–16 September 2011; Available online: https://onlinepubs.trb.org/onlinepubs/conferences/2011/RSS/3/Cerezo,V.pdf (accessed on 23 March 2026).
Lebaku, P.K.R.; Gao, L.; Sun, J.; Wang, X.; Kang, X. Assessing the Influence of Pavement Performance on Road Safety Through Crash Frequency and Severity Analysis. Int. J. Pavement Res. Technol. 2025. [Google Scholar] [CrossRef]
Lindenmann, H.P. New Findings Regarding the Significance of Pavement Skid Resistance for Road Safety on Swiss Freeways. J. Safety Res. 2006, 37, 395–400. [Google Scholar] [CrossRef]
Li, P.; Yi, K.; Yu, H.; Xiong, J.; Xu, R. Effect of Aggregate Properties on Long-Term Skid Resistance of Asphalt Mixture. J. Mater. Civ. Eng. 2021, 33, 4020413. [Google Scholar] [CrossRef]
Zong, Y.; Li, S.; Zhang, J.; Zhai, J.; Li, C.; Ji, K.; Feng, B.; Zhao, H.; Guan, B.; Xiong, R. Effect of Aggregate Type and Polishing Level on the Long-Term Skid Resistance of Thin Friction Course. Constr. Build. Mater. 2021, 282, 122730. [Google Scholar] [CrossRef]
Szatkowski, W.S.; Hosking, J.R. The Effect of Traffic and Aggregate on the Skidding Resistance of Bituminous Surfacing; TRRL Report LR 504; Transport and Road Research Laboratory: Crowthorne, UK, 1972. [Google Scholar]
Pichayapan, P.; Chaleonpan, P.; Jitsangiam, P.; Wongchana, P. An evaluation of relationship with polished stone value and skid resistance value based on a laboratory investigation. Key Eng. Mater. 2019, 801, 410–415. [Google Scholar] [CrossRef]
Pérez-Acebo, H.; Gonzalo-Orden, H.; Findley, D.J.; Rojí, E. A Skid Resistance Prediction Model for an Entire Road Network. Constr. Build. Mater. 2020, 262, 120041. [Google Scholar] [CrossRef]
Roy, N.; Sarkar, S.; Kuna, K.K.; Ghosh, S.K. Effect of Coarse Aggregate Mineralogy on Micro-Texture Deterioration and Polished Stone Value. Constr. Build. Mater. 2021, 296, 123716. [Google Scholar] [CrossRef]
Shabani, S.; Ahmadinejad, M.; Ameri, M. Developing a Model for Estimation of Polished Stone Value (PSV) of Road Surface Aggregates Based on Petrographic Parameters. Int. J. Pavement Eng. 2013, 14, 242–255. [Google Scholar] [CrossRef]
Liu, J.; Guan, B.; Chen, H.; Liu, K.; Xiong, R.; Xie, C. Dynamic Model of Polished Stone Value Attenuation in Coarse Aggregate. Materials 2020, 13, 1875. [Google Scholar] [CrossRef]
El-Ashwah, A.S.; Abdelrahman, M. Relating Aggregate Friction Properties to Asphalt Pavement Friction Loss through Laboratory Testing, Statistical Analysis, and Machine Learning Insights. Int. J. Pavement Eng. 2025, 26, 2456739. [Google Scholar] [CrossRef]
Hussain, J.; Zafar, T.; Fu, X.; Ali, N.; Chen, J.; Frontalini, F.; Hussain, J.; Lina, X.; Kontakiotis, G.; Koumoutsakou, O. Petrological Controls on the Engineering Properties of Carbonate Aggregates through a Machine Learning Approach. Sci. Rep. 2024, 14, 31948. [Google Scholar] [CrossRef]
Lei, J.; Zheng, N.; Chen, X.; Bi, J.; Wu, X. Research on the Relationship between Anti-Skid Performance and Various Aggregate Micro Texture Based on Laser Scanning Confocal Microscope. Constr. Build. Mater. 2022, 316, 125984. [Google Scholar] [CrossRef]
Ergin, B.; Gökalp, İ.; Uz, V.E. Effect of Aggregate Microtexture Losses on Skid Resistance: Laboratory-Based Assessment on Chip Seals. J. Mater. Civ. Eng. 2020, 32, 4020040. [Google Scholar] [CrossRef]
Zong, Y.; Xiong, R.; Wang, Z.; Zhang, B.; Tian, Y.; Sheng, Y.; Xie, C.; Wang, H.; Yan, X. Effect of Morphology Characteristics on the Polishing Resistance of Coarse Aggregates on Asphalt Pavement. Constr. Build. Mater. 2022, 341, 127755. [Google Scholar] [CrossRef]
Qian, Z.; Wu, J.; Sun, F.; Wang, L. Effect of Aggregate Mineral Composition on Polish Resistance Performance. In Transportation Research Congress 2016; American Society of Civil Engineers (ASCE): Reston, VA, USA, 2016; pp. 263–271. [Google Scholar] [CrossRef]
Aboutalebi Esfahani, M.; Kalani, M. Petrographic Analysis Method for Evaluation and Achieving Durable Hot Mix Asphalt. Constr. Build. Mater. 2020, 234, 117408. [Google Scholar] [CrossRef]
Gökalp, İ.; Uz, V.E. The Effect of Aggregate Type and Gradation on Fragmentation Resistance Performance: Testing and Evaluation Based on Different Standard Test Methods. Transp. Geotech. 2020, 22, 100300. [Google Scholar] [CrossRef]
Sulandari, E.; Subagio, B.S.; Rahman, H.; Maha, I. Analysis of Aggregate Types with Micro-Texture and Macro-Texture Characteristics of Asphalt Mixture in Indonesia. Open Civ. Eng. J. 2023, 17, e187414952309010. [Google Scholar] [CrossRef]
Zhao, H.; Gao, H.; Tang, J.; Xue, X.; Guan, B. Investigation of Changes in Aggregates Morphological Characteristics and Abrasion Resistance before and after Abrasion. Int. J. Pavement Eng. 2025, 26, 2520024. [Google Scholar] [CrossRef]
He, Z.; Li, J.; Nian, J.; Guan, B. Experimental Analysis and Modeling of Micro-Texture and Vickers Hardness Impact on Polished Stone Value in High-Friction Aggregates. Front. Mater. 2024, 11, 1340828. [Google Scholar] [CrossRef]
Wu, Z.; Hungria, R.; Bharati, S. Assessment of Laboratory Friction Testing Equipment and Validation of Pavement Friction Characteristics with Field and Accelerated Friction Testing; FHWA/LA.25/707; Louisiana Transportation Research Center: Baton Rouge, LA, USA, 2025. Available online: https://rosap.ntl.bts.gov/view/dot/82247/dot_82247_DS1.pdf (accessed on 23 March 2026).
Li, S.; Xiong, R.; Zhai, J.; Zhang, K.; Jiang, W.; Yang, F.; Yang, X.; Zhao, H. Research Progress on Skid Resistance of Basic Oxygen Furnace (BOF) Slag Asphalt Mixtures. Materials 2020, 13, 2169. [Google Scholar] [CrossRef]
Qian, Z.; Hou, Y.; Dong, Y.; Cai, Y.; Meng, L.; Wang, L. An Evaluation Method for the Polishing and Abrasion Resistance of Aggregate. Road Mater. Pavement Des. 2020, 21, 1374–1385. [Google Scholar] [CrossRef]
Kane, M.; Artamendi, I.; Scarpas, T. Long-Term Skid Resistance of Asphalt Surfacings: Correlation between Wehner–Schulze Friction Values and the Mineralogical Composition of the Aggregates. Wear 2013, 303, 235–243. [Google Scholar] [CrossRef]
Wang, D.; Chen, X.; Xie, X.; Stanjek, H.; Oeser, M.; Steinauer, B. A Study of the Laboratory Polishing Behavior of Granite as Road Surfacing Aggregate. Constr. Build. Mater. 2015, 89, 25–35. [Google Scholar] [CrossRef]
Fernández, A.; Alonso, M.A.; López-Moro, F.J.; Moro, M.C. Polished Stone Value Test and Its Relationship with Petrographic Parameters (Hardness Contrast and Modal Composition) and Surface Micro-Roughness in Natural and Artificial Aggregates. Mater. Constr. 2013, 63, 377–391. [Google Scholar] [CrossRef]
Gökalp, İ.; Uz, V.E.; Saltan, M.; Tepe, M. Site Assessment of Surface Texture and Skid Resistance by Varying the Grit Parameters of an SMA. J. Transp. Eng. Part B Pavements 2022, 148, 04022033. [Google Scholar] [CrossRef]
Greenacre, M. Compositional Data Analysis. Annu. Rev. Stat. Appl. 2021, 8, 271–299. [Google Scholar] [CrossRef]
Torgo, L.; Ribeiro, R.P.; Pfahringer, B.; Branco, P. SMOTE for Regression. In Progress in Artificial Intelligence—EPIA 2013; Correia, L., Reis, L.P., Cascalho, J., Eds.; Lecture Notes in Artificial Intelligence (LNAI); Springer: Berlin/Heidelberg, Germany, 2013; Volume 8154, pp. 378–389. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority over-Sampling Technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Friedman, J.H. Greedy Function Approximation: A Gradient Boosting Machine. Ann. Stat. 2001, 29, 1189–1232. [Google Scholar] [CrossRef]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. CatBoost: Unbiased boosting with categorical features. Adv. Neural Inf. Process. Syst. 2018, 31, 6638–6648. [Google Scholar]
Friedman, J.H. Multivariate Adaptive Regression Splines. Ann. Stat. 1991, 19, 1–67. [Google Scholar] [CrossRef] [PubMed]
Molinaro, A.M.; Simon, R.; Pfeiffer, R.M. Prediction Error Estimation: A Comparison of Resampling Methods. Bioinformatics 2005, 21, 3301–3307. [Google Scholar] [CrossRef]
Koné, A.; Es-Sabar, A.; Do, M.-T. Application of Machine Learning Models to the Analysis of Skid Resistance Data. Lubricants 2023, 11, 328. [Google Scholar] [CrossRef]

Figure 1. Flowchart of the ML approach used in the study.

Figure 2. Data distribution plots.

Figure 3. IRL features and target (PSV)—densities before vs. after oversampling.

Figure 4. Variance inflation factor VIF—ILR features before and after oversampling.

Figure 5. Model performance for PSV prediction. (a) Gradient Boosting model; (b) CatBoost; (c) MARS.

Figure 6. Permutation feature importance analysis across three ML models.

Figure 7. SHAP summary plot for CatBoost model predicting PSV: feature impact on model output.

Table 1. Descriptive statistical analysis of database.

Column	Mean	Std	Min	Max	Skewness	Kurtosis	Cov	Distinct Values
Al₂O₃	7.9743	10.1374	0	71	3.1624	17.7358	1.2713	54
Amphibole	3.2455	12.9335	0	70	4.1726	16.6303	3.9851	8
Biotite	0.6771	2.9492	0	25	7.3772	60.2028	4.3555	9
CO₂	7.9338	16.4284	0	47.2	1.6831	0.9562	2.0707	17
CaO	24.7985	23.8570	0	96.53	0.8293	−0.1145	0.9620	66
Calcite	12.3477	27.8417	0	96.19	2.0763	2.7612	2.2548	20
Chlorite	1.2009	3.8228	0	24	4.0153	18.1552	3.1833	12
Dolomite	2.8174	10.4240	0	60	4.3456	18.9962	3.6999	12
Fe₂O₃	5.2214	6.6423	0	30.1	1.3906	1.6501	1.2721	48
Hematite	0.0185	0.1658	0	1.48	8.9443	80.0000	8.9443	2
Hornblende	0.6647	3.6295	0	24.58	5.7833	33.7437	5.4602	4
Illite	0.3125	1.7965	0	14.83	7.1959	56.0160	5.7487	5
K₂O	0.8894	1.6217	0	6.1	1.9328	2.6433	1.8234	34
Magnetite	0.0583	0.3799	0	2.96	6.8588	48.4896	6.5197	3
MgO	3.1478	3.6926	0	15.17	1.4897	1.6736	1.1731	64
MnO	0.2541	0.9968	0	7.2	5.6000	33.8654	3.9234	25
Montmorillonite	0.3751	2.5822	0	22.59	8.3211	71.7781	6.8846	4
Na₂O	1.152	1.8609	0	8.67	2.1552	5.3364	1.6154	36
PSV	58.5450	11.8005	33.4	89.6	−0.0285	0.2112	0.2016	70
Plagioclase	3.1780	11.3261	0	58.05	3.7030	12.9603	3.5640	8
Potash feldspar	9.3079	18.0426	0	60	1.6772	1.2264	1.9384	18
Pyroxene	2.4177	8.9707	0	55	4.4305	20.3485	3.7103	10
P₂O₅	0.1604	0.4039	0	2.25	3.0355	10.1578	2.5189	20
Quartz	9.4275	18.2083	0	77	2.0209	2.9464	1.9314	27
SO₃	0.0223	0.0867	0	0.5	4.4368	19.5944	3.8891	8
SiO₂	28.138	25.3575	0	86.8	0.21697	−1.5204	0.9012	64
Siderite	0	0	0	0	0	/	/	1
SrO	0	0	0	0	0	/	/	1
TiO₂	0.4300	0.9266	0	4.8	2.8907	8.7998	2.155	26

Table 2. Chemical oxide ratios defining ILR balances for compositional data modeling.

ILR (Oxide Ratio)	Description
$\frac{(C O_{2}, S O_{3})}{(S i O_{2}, A l_{2} O_{3}, F e_{2} O_{3}, T i O_{2}, C a O, M g O, N a_{2} O, K_{2} O)}$	This ratio contrasts volatile-bearing carbonate and sulfate phases with the bulk major-oxide framework of the aggregate.
$\frac{(S i O_{2}, A l_{2} O_{3}, F e_{2} O_{3}, T i O_{2})}{(C a O, M g O)}$	This ratio measures the relative abundance of silico-aluminous and Fe-Ti oxides versus calco-magnesian components.
$\frac{S i O_{2}}{(A l_{2} O_{3}, F e_{2} O_{3}, T i O_{2})}$	This ratio expresses the proportion of free silica (quartz) relative to Al-Fe--Ti bearing phases.
$\frac{(N a_{2} O, K_{2} O)}{(C a O, M g O)}$	This ratio opposes alkali oxides associated with feldspathic minerals to Ca-Mg oxides typical of carbonates and mafic silicates.
$\frac{N a_{2} O}{K_{2} O}$	This ratio differentiates Na-rich plagioclase-dominated compositions from K-feldspar dominated ones.
$\frac{P_{2} O_{5}}{(S i O_{2}, A l_{2} O_{3}, F e_{2} O_{3}, T i O_{2})}$	This ratio represents the proportion of P-bearing accessory phases (mainly apatite) relative to the silico-aluminous framework.
$\frac{(F e_{2} O_{3}, T i O_{2}, M g O, M n O)}{(S i O_{2}, A l_{2} O_{3})}$	This ratio contrasts Fe-Ti-Mg-Mn oxides and mafic silicates with the Si-Al framework.

Table 3. reg-SMOTE hyperparameters used.

Parameter	Value
k (neighbors)	4
n_syn_per_rare	3
relevance threshold	0.75 (lower-tail focus)
relevance function	quantile-based (q0.15, q0.50)

Table 4. Evaluation metrics for predictive models.

Model	CatBoost	Gradient Boosting	MARS
n	146	146	146
R	0.867	0.860	0.826
R²	0.749	0.737	0.680
Adj R²	0.736	0.723	0.663
p_value	≤0.001	≤0.001	≤0.001
RMSE	6.762	6.921	7.637
MAE	3.976	4.106	5.440
MAPE (%)	7.073	7.567	11.163
VAF (%)	75.138	73.938	67.989
a20 (%)	90.411	89.726	82.192
PI	−5.274	−5.459	−6.294
Rank	1	2	3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Soudani, K.; Bounefla, Y.; Cerezo, V.; Haddadi, S. Predictive Modeling of Aggregate Polished Stone Value from Mineralogical and Chemical Composition. Eng 2026, 7, 149. https://doi.org/10.3390/eng7040149

AMA Style

Soudani K, Bounefla Y, Cerezo V, Haddadi S. Predictive Modeling of Aggregate Polished Stone Value from Mineralogical and Chemical Composition. Eng. 2026; 7(4):149. https://doi.org/10.3390/eng7040149

Chicago/Turabian Style

Soudani, Khedoudja, Yazid Bounefla, Veronique Cerezo, and Smail Haddadi. 2026. "Predictive Modeling of Aggregate Polished Stone Value from Mineralogical and Chemical Composition" Eng 7, no. 4: 149. https://doi.org/10.3390/eng7040149

APA Style

Soudani, K., Bounefla, Y., Cerezo, V., & Haddadi, S. (2026). Predictive Modeling of Aggregate Polished Stone Value from Mineralogical and Chemical Composition. Eng, 7(4), 149. https://doi.org/10.3390/eng7040149

Article Menu

Predictive Modeling of Aggregate Polished Stone Value from Mineralogical and Chemical Composition

Abstract

1. Introduction

2. Methodology

2.1. Description of Database

2.2. Statistical Description of Database

2.3. Feature Selection

2.4. Data Transformation

2.4.1. Nature of Data

2.4.2. Isometric Log-Ratio

2.4.3. Data Oversampling

3. Machine Learning Models Used

3.1. Gradient Boosting Regression (GBR)

3.2. CatBoost

3.3. Multivariate Adaptive Regression Splines (MARS)

3.4. Model Validation (Repeated K-Fold Cross-Validation)

3.5. Feature Importance Assessment and Interpretation

3.5.1. Permutation Features Importance

3.5.2. SHapley Additive exPlanations

4. Results and Discussion

4.1. Models’ Performances

4.2. PFI—Permutation Feature Importance

4.3. SHAP—SHapley Additive exPlanations for CatBoost Model

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI