Optimizing Bare Soil Mosaics for Clay Prediction via Environmental Covariates and Variable Selection

Suleymanov, Azamat; Kriuchkov, Nikita; Asylbaev, Ilgiz; Suleymanov, Ruslan

doi:10.3390/rs18101503

Open AccessArticle

Optimizing Bare Soil Mosaics for Clay Prediction via Environmental Covariates and Variable Selection

¹

Laboratory of Soil Science, Ufa Institute of Biology, Ufa Federal Research Centre, Russian Academy of Sciences, 450054 Ufa, Russia

²

Department of Geodesy, Cartography and Geographic Information Systems, Ufa University of Science and Technology, 450076 Ufa, Russia

³

Faculty of Biology, Shenzhen MSU-BIT University, International University Park Road 1, Dayun New Town, Longgang District, Shenzhen 517182, China

⁴

Department of General Ecology and Hydrobiology, Faculty of Biology, Lomonosov Moscow State University, 119991 Moscow, Russia

⁵

Department of Soil Science, Agrochemistry and Precision Agriculture, Bashkir State Agrarian University, 450001 Ufa, Russia

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1503; https://doi.org/10.3390/rs18101503

Submission received: 6 March 2026 / Revised: 7 May 2026 / Accepted: 8 May 2026 / Published: 11 May 2026

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Multi-temporal bare soil mosaics and regional environmental covariates in combination with variable selection methods were tested for clay mapping across croplands.
Combining mosaic with regional proxies increased model performance.
The modified greedy feature selection outperformed a combination of variance inflation factor and recursive feature elimination techniques and resulted in the best model performance (RMSE = 8.73%, R² = 0.42, RPD = 1.34).

What are the implications of the main findings?

Integrating other environmental variables besides mosaics is an important step toward achieving maximum prediction accuracy, highlighting pedogenic controls beyond reflectance.
In addition to including other environmental variables besides mosaics, it is important to consider the method of variable selection for soil mapping tasks.

Abstract

Spatial information on soil clay content addresses critical needs in precision agriculture and soil health assessment worldwide. This study utilizes a mapping workflow for topsoil clay (0–20 cm) across croplands in southern Russia using Sentinel-2 bare soil mosaics and regional covariates representing soil-forming factors. We generated 26 multi-temporal mosaics via per-pixel and per-date approaches and then tested five scenarios with regional covariates in combination with three variable selection techniques: variance inflation factor (VIF), recursive feature elimination (RFE), and modified greedy feature selection (MGFS). We found that among 26 temporal mosaics, the best single mosaic (scenario 1) explained 25% of the clay variation (RMSE = 10.03%, R² = 0.25, RPD = 1.16). Using only regional covariates selected after VIF and RFE approaches (scenario 2) yielded comparable results (RMSE = 9.77%, R² = 0.27, RPD = 1.18). Combination of the best bare soil mosaic and all regional covariates without the variable selection method (scenario 3) improved the predictions (RMSE = 9.45%, R² = 0.33, RPD = 1.24), and with the VIF/RFE application (scenario 4), the model showed slightly worse accuracy (RMSE = 9.65%, R² = 0.31, RPD = 1.21). MGFS implementation (scenario № 5) boosted the model performance and resulted in the best predictions (RMSE = 8.73%, R² = 0.42, RPD = 1.34). NIR bands from the bare soil mosaic, and terrain attributes with Landsat and MODIS variables from the regional covariate set were the key variables. We demonstrate that combining multi-temporal Sentinel-2 mosaics with regional covariates, when paired with an appropriate selection strategy, yields superior clay predictions.

Keywords:

soil clay; Sentinel; digital soil mapping; variable selection; machine learning; remote sensing

1. Introduction

Soil clay content is a key determinant of water retention, infiltration, aggregation, nutrient availability, and the mechanical behavior of soils [1]. Spatial information on clay content is commonly required at field-to-landscape scales for precision agriculture [2], assessment of carbon storage potential [3] or erosion modeling [4,5]. Moreover, clay content is considered the key driver of soil organic carbon (SOC) accumulation via mineral-surface adsorption and aggregate occlusion [6]. This soil parameter also attracts attention due to the SOC:clay ratio, a metric used for soil health assessment and widely discussed in recent studies [7,8,9]. However, traditional soil surveys and laboratory analyses are time-consuming, costly, and often too sparse to capture fine-scale variability.

A widely used approach for spatial assessment of soil properties at various scales is digital soil mapping (DSM) [10]. DSM is a framework that combines field observations with spatial environmental covariates and utilizes statistical or machine-learning techniques for mapping soil properties and classes. Within this framework, remote sensing (RS) is a popular source of explanatory variables [11]. It provides temporally frequent and spatially continuous observations related to soil-forming factors and surface conditions, including surface reflectance, vegetation status, and moisture dynamics. In agricultural landscapes, RS data is an effective tool because of its ability to capture bare-soil reflectance during appropriate acquisition windows [12]. Consequently, numerous methodologies are adopted for producing RS images and mosaics that reveal bare soil conditions [13,14,15]. Such covariates are especially valuable for predicting soil properties that are closely correlated with soil color, such as soil organic carbon or texture [11].

Despite the strong spectral correlations enabling RS bare soil mosaics to capture numerous soil properties, relying solely on such data often limits mapping accuracy [16,17]. Other spatial covariates representing key soil-forming factors are important to fully explain spatial variation in soil properties driven by pedogenic processes beyond surface reflectance. For example, in geomorphologically complex landscapes, elevation or other terrain attributes are often major variables that describe processes such as weathering, transport, and deposition. Notably, Azizi et al. [16] demonstrated that RS and relief attributes were the most important covariates for silt predictions across agricultural soils in Iran. Climate variables are also useful for broad-scale DSM along with bare soil mosaics [17].

An important step in the DSM framework is variable selection. It is usually applied to optimize model performance, reduce overfitting, and enhance interpretability by retaining only the most informative covariates from large environmental datasets [18]. Common approaches include eliminating highly correlated variables to mitigate multicollinearity [19]. Recursive feature elimination (RFE) and Boruta are other popular approaches, which iteratively remove the least important predictors. Several studies have shown that a specific feature selection method can affect soil predictions [20,21]. Therefore, choosing an appropriate variable selection strategy is critical for DSM workflows.

In summary, approaches for DSM within croplands largely involve the use of multi-temporal RS data and their derived indices [22,23]. Several studies, in addition to RS data, include other explanatory variables such as soil maps [24] or crop type maps [25,26], whereas variable selection is often limited to removing highly correlated variables or using only a single method (e.g., Boruta or RFE) [19]. Therefore, in this study, we aimed to develop and evaluate a DSM workflow for topsoil clay content mapping in southern Russia. Specifically, we tested several mapping scenarios that used different RS-derived bare soil mosaics, ancillary environmental covariates, and feature selection algorithms. Hence, the objectives of this study are: (i) to test different RS-derived bare soil mosaics from Sentinel-2A data for clay modeling; (ii) to evaluate the integration of environmental variables; and (iii) to assess the effect of variable selection techniques on model performance.

2. Materials and Methods

2.1. Study Area

The study area (~4000 km²) is located within the Republic of Bashkortostan, Russia (Figure 1). It lies in the southeastern part of the republic, east of the Southern Ural Mountains, and spans a transition from forested highlands in the west to steppe lowlands in the east. Elevation ranges from 250 to 550 m a.s.l. The climate is continental, with cold winters (mean January temperature −15 to −17 °C), warm summers (mean July temperature 18–20 °C), and an annual precipitation of 300–400 mm that increases with elevation. The soils are diverse and are dominated by gray forest soils (Phaeozems) in the mountains and foothills, and by Chernozems and saline soils in the steppe areas. Croplands are mainly concentrated in the eastern part of the study area, on level steppe terrain.

2.2. Soil Data

A total of 182 topsoil samples (0–20 cm) were collected from agricultural lands across the study area as part of a regional soil inventory initiative by the Ministry of Land and Property Relations of the Republic of Bashkortostan in the late 2010s. These soil sampling locations were analogous to those surveyed during the 1970–1980 campaigns, with a focus on croplands. The content of physical clay (the sum of fractions less than 0.01 mm) was determined according to GOST 12536 [27], which is based on the Kachinsky soil particle size classification [28].

2.3. Spatial Modeling Framework

For digital mapping of clay content across croplands, we tested several scenarios that used different types of environmental variables. These covariate sources can be divided into two main groups: multi-temporal RS data that we generated for this study and the existing regional set of covariates consisting of the key soil-forming variables.

2.3.1. Multi-Temporal Remote Sensing Data

Firstly, we generated bare soil mosaics from Sentinel-2 imagery. Because bare-soil retrieval requires minimizing the influence of vegetation, crop residues, and surface moisture, we used spectral-index masking. Specifically, we employed four indices commonly used for these tasks: the Normalized Difference Vegetation Index (NDVI) [29], the Normalized Burn Ratio 2 (NBR2) [30], the Bare Soil Index (BSI) [11,31], and the Soil Surface Moisture Index (S2WI) [14].

All processing was implemented in Google Earth Engine [32]. Sentinel-2 scenes were collected for the study area for the main agricultural season (April–October) over 2016–2024, using an initial metadata filter (scene cloudiness < 70%) to preserve temporal density. Cloud and shadow pixels were then removed using QA/SCL information (when available) together with Sentinel-2 cloud-probability screening, and scenes with insufficient clear-sky fraction over the area of interest (AOI) were discarded. For each retained scene, NDVI, NBR2, BSI, and S2WI were computed, and all outputs were harmonized to 20 m spatial resolution (Sentinel-2 bands 1, 9, and 10 were excluded).

To retain only bare/dry soil pixels across the time series, we applied multiple masking scenarios. In total, 24 filter combinations were evaluated using thresholding of individual indices and/or their combinations, partly following the logic proposed by Vaudour et al. [12]. Importantly, thresholds were not chosen arbitrarily. They were derived from AOI-specific percentile statistics calculated after cloud and shadow masking. The strict NDVI threshold of 0.23 corresponded to the 10th percentile of NDVI values, whereas the less restrictive NDVI threshold of 0.29 corresponded approximately to the 25th percentile. The NBR2 threshold of 0.16 corresponded closely to the 90th percentile, while the BSI threshold of 0.20 was close to the 75th percentiles of the AOI-specific BSI distribution.

Final mosaics were produced using “per-pixel” and “per-date” strategies. In the per-pixel strategy, after applying strict masks (bare/dry eligibility), mosaics were generated either by selecting the single best observation per pixel using qualityMosaic (e.g., minimum S2WI or minimum NBR2, or maximum BSI), or by temporal aggregation (mean/median) of reflectance bands and indices across all eligible observations per pixel [13]. In contrast, the per-date strategy ranks acquisition dates using AOI-level index summaries computed on masked pixels and then builds mosaics by sequential gap-filling in the priority order of the selected dates (coverage-aware selection), rather than by per-pixel min/mean/median over time [12]. Overall, 26 multi-temporal mosaics at 20 m resolution were generated (Table 1). After excluding non-arable land, the number of usable field observations decreased because some sampling locations did not coincide with croplands and/or valid bare-soil pixels in the final mosaics.

2.3.2. Regional Covariate Stack

Previous studies compiled a harmonized set of spatial covariates covering the entire Republic of Bashkortostan [33]. This dataset is largely consistent with the one used for SoilGrids 2.0 [34] and comprises a wide array of soil-forming variables, including RS data (Landsat and MODIS spectral bands and indices, land cover types), climate variables, hydrology, terrain attributes and derivatives, SOC, pH, and geological and soil type maps. For a detailed review, we refer to [33]. A total of 90 regional covariates with a spatial resolution of 250 m/pixel were used for the workflow.

2.3.3. Mapping Scenarios

Our workflow for mapping clay content included five scenarios that utilized different covariate sets, feature-selection methods, and predictive algorithms (Table 2). Under the first baseline scenario (№ 1), we tested spectral bands within single produced multi-temporal mosaics step by step using Partial Least Squares (PLS), Random Forest (RF), Cubist, and Support Vector Machine (SVM). Thus, in each iteration, Sentinel-2 bands from a single multi-temporal mosaic (n = 26) were used as explanatory variables using the above predictive algorithms. For each mosaic, the points that intersected it (i.e., corresponding to the bare soil) were used, resulting in 172–182 points, depending on the mosaic. The best-performing mosaic and predictive technique was chosen after 10-fold cross-validation with 10 repetitions, based on error metrics.

Within the second scenario (№ 2), we used only regional covariates with the same predictive algorithms. Before that, we applied two approaches to reduce the number of regional variables to avoid multicollinearity and redundancy. Firstly, variance inflation factor (VIF) analysis was implemented, which reduced the number of variables to 53. Then, for the remaining variables, we used the recursive feature elimination (RFE) method, which resulted in a final set of 32 regional covariates.

Scenario № 3 used the best bare soil mosaic that was defined within scenario № 1 and all regional covariates as explanatory variables using the RF approach. Therefore, this predictive model used 10 spectral bands from the best mosaic and 90 regional covariates for clay mapping. Here, we did not use any variable selection methods. All regional covariates were resampled to match the 20 m resolution of the Sentinel-2 mosaic using the bilinear method for continuous covariates and nearest neighbor for categorical covariates. Bilinear interpolation preserves the gradual variation in continuous environmental variables (e.g., temperature, elevation) by calculating weighted averages of neighboring pixels, whereas nearest neighbor resampling is used for categorical covariates (e.g., geological classes) to preserve original class integrity without introducing artificial intermediate values. As the RF approach was the best for both previous scenarios, we continued the workflow with this algorithm.

Scenario № 4 included the best bare soil mosaic (scenario № 1) and regional covariates (n = 90). Then the VIF and RFE techniques were applied to all variables. Scenario № 5 also included the same best bare soil mosaic (scenario № 1) and regional covariates (n = 90) with another variable selection technique. Specifically, we tested a modified greedy feature selection (MGFS) algorithm that demonstrated marked effectiveness in soil modeling tasks [18]. This method was applied to a full stack covariate (e.g., for spectral bands and all regional variables) and the remaining 29 variables were used under RF modeling.

2.4. Variable Selection Techniques

Recursive feature elimination (RFE) is a wrapper method for feature selection that operates through a backward elimination process [35]. It begins by training a predictive model using the entire set of available features and ranks their importance. The least important feature(s) are then pruned, and the model is retrained on the reduced subset. This recursive procedure of ranking and eliminating features continues until a pre-specified number of features remains. RFE was performed using RF importance (permutation-based) with stepwise removal of five variables per iteration. Subset selection was guided by 10-fold cross-validation within the RFE loop, and the subset yielding the lowest root mean square error (RMSE) was retained.

Variance inflation factor (VIF) analysis is a filter method specifically designed to detect multicollinearity—a situation where features are highly correlated with one another—within a set of predictors [36]. It quantifies how much the variance of a regression coefficient is inflated due to linear dependencies with other features. For each predictor, a VIF score is calculated by regressing it against all other predictors; a high VIF (above a threshold of 5 that was used in this study) indicates severe multicollinearity. Features with excessively high VIF scores are typically removed iteratively to enhance model stability and interpretability.

Modified greedy feature selection (MGFS) is an advanced heuristic approach that builds upon standard forward or backward selection to improve computational efficiency [37]. The goal is to more closely approximate an optimal feature subset by mitigating the myopic pitfalls of a simple greedy strategy. Briefly, the approach begins by identifying the single most important predictor from a model trained on all available variables. It then uses a forward selection process, iteratively testing and adding the next best predictor that most improves cross-validated model performance using the RMSE metric. This cycle continues, building progressively larger predictor sets until performance stops improving or a maximum number of features is reached. The method is a modification of standard greedy selection, made more efficient by starting with the top-ranked predictor rather than testing each one individually at the first step. For details, we refer the reader to the literature cited above [37]. This procedure was performed using RF models and 10-fold cross-validation. At each iteration, all remaining candidate variables were tested by adding them individually to the current subset, and the candidate subset with the highest mean cross-validated coefficient of determination (R²) was retained for the next iteration. RMSE was also recorded at each step as a complementary error metric. The final MGFS subset was defined as the subset with the highest mean cross-validated R² and lowest RMSE along the selection path.

2.5. Evaluation of Model Performance

The prediction performance of each model within all scenarios was evaluated using 10-fold cross-validation with 10 repetitions because the field sample size was limited. Under small-sample conditions, a single round of cross-validation can yield variance in evaluation metrics due to random data splits. Repeating the entire procedure multiple times and averaging the results reduces this variance, which allows to mitigate the risk of overfitting and provides a more reliable assessment of model generalizability. Hence, this approach minimizes performance estimate variability while making full use of the available data. Three statistical metrics were used to evaluate prediction performance: RMSE, R², and ratio of performance to deviation (RPD).

To evaluate the uncertainty associated with our predictions, we used quantile regression forest to calculate the 95th and 5th percentiles for each pixel. The uncertainty was then defined as the range between these two percentiles, which corresponds to the width of the 90% prediction interval.

All steps to implement the modeling workflow, including data preparation, machine learning, digital mapping, and cross-validation steps, were performed in the R programming environment (version 4.5.2) [38] under the caret package (version 7.0-1) [39].

3. Results

3.1. Summary Statistics of Clay Content Across Croplands

The clay content values ranged from 23.9 to 78.1% (Table 3). The mean clay content was 56.5%, and the median was 59.6%. The distribution exhibited moderate variability, as indicated by a standard deviation of 11.2%.

3.2. Final Temporal Mosaic

The effective coverage of the final mosaics ranged from 80% to 100% of the AOI. Per-date mosaics covered 82–85%, whereas per-pixel mosaics covered 80–100%. The lowest coverage was observed for the strict combined-rule mosaics, where NDVI, NBR2, and BSI thresholds were applied simultaneously.

Among the tested predictive models using bare soil mosaics (n = 26) and four predictive algorithms, the best performance demonstrated PerPix_Bare_NDVI023_FULLSTACK_median in combination with the RF approach (Table 4). This model resulted in RMSE = 10.03%, R² = 0.25, RPD = 1.16 after the cross-validation procedure. The next most powerful models were also based on this mosaic but using SVM (RMSE = 10.11%, R² = 0.24, RPD = 1.15) and Cubist algorithms (RMSE = 10.16%, R² = 0.23, RPD = 1.14). The third and fourth places were occupied by mosaics PerDate_NDVI023 and PerPix_Bare_NBR2 with error metrics RMSE = 10.21%, R² = 0.20, and RPD = 1.11 and RMSE = 10.28%, R² = 0.19, and RPD = 1.09, respectively, using the SVM approach.

Spearman rank correlation analysis revealed significant negative relationships between soil clay content and most spectral bands derived from the best Sentinel-2A mosaic (PerPix_Bare_NDVI023_FULLSTACK_median) (Table 5). The strongest negative correlations were observed in the near-infrared (NIR) and red-edge (RE) regions, with clay content showing a maximum correlation with B8 (r = −0.36, p < 0.001), followed closely by B8A (r = −0.35, p < 0.001) and B7 (r = −0.35, p < 0.001). The B6 (r = −0.34, p < 0.001) and B5 (r = −0.32, p < 0.001) bands also showed strong negative correlations, while the visible spectrum bands showed moderate negative correlations: B2 (r = −0.31, p < 0.001), B3 (r = −0.29, p < 0.001), and B4 (r = −0.29, p < 0.001). The short-wave infrared (SWIR) band B11 demonstrated a weaker correlation (r = −0.24, p < 0.01), whereas B12 showed a non-significant positive correlation (r = 0.06, p > 0.05).

3.3. Performance of Tested Scenarios

Across the five tested scenarios, predictive performance generally improved from Scenario 1 to Scenario 5 (Figure 2 and Table 6). As demonstrated before, the RF model using only bare soil mosaic as covariates explained only 25% of the clay variation. The second model (№ 2) that incorporated only regional covariates with a spatial resolution of 250 m/pixel resulted in RMSE = 9.77%, R² = 0.27, and RPD = 1.18. That is, scenario № 2 showed a slight increase in accuracy when, among the tested algorithms, the RF was the most reliable. The RF model under scenario № 3, which used the best bare soil mosaic, and all regional variables demonstrated improved performance (RMSE = 9.45%, R² = 0.33, and RPD = 1.24) compared to scenarios 1 and 2. After the VIF and RFE procedures (scenario № 4), the model demonstrated slightly worse performance (RMSE = 9.65%, R² = 0.31, and RPD = 1.21). The final model (Scenario № 5), which used mosaic and regional environmental variables selected after the MGFS approach, was the most accurate, explaining 42% of clay variation with an RMSE of 8.73% and RPD of 1.34. Thus, when the inclusion of regional variables led to an improvement in performance, the method of their selection significantly determined the accuracy of predictions.

3.4. Key Environmental Covariates

Figure 3 shows the top ten explanatory environmental covariates in models 3–5. Using all variables (scenario № 3), the top ten most important covariates were from the bare soil mosaic, including NIR bands (8 and 8A) and RE bands (5 and 7). In addition, the regional RS data, namely the NIR band from Landsat and Enhanced Vegetation Index (EVI), as well as land surface temperature, were included in the top 10.

Under scenario № 4, VIF and RFE procedures selected 52 variables that were used to build the model. Here, MODIS EVI, Landsat NIR, and band 2 (blue) from the mosaic were the most important variables. The key variables in model № 5 were bands 8A and 7 from the mosaic, as well as Landsat NIR, but further terrain and geology emerged as important predictors. Specifically, terrain indices MRVBF and downslope curvature, as well as mixed sedimentary rocks, were included in the top 10. Hence, while bands from the bare soil mosaic were key in all models, regional covariates made a specific contribution to all models.

3.5. Digital Maps of Clay Content

We present the generated maps of clay content from models № 1 and 5 to highlight the difference between them (Figure 4). Both digital maps demonstrated the same gradients in their distribution, where the highest values were observed in the west of the study region (e.g., in the foothills). In contrast, steppe landscapes in the east were mainly characterized by the lowest clay levels. However, there were also small areas with minimum clay content. To quantify the disparity between these two maps (e.g., from models № 1 and 5), we calculated their difference. Notably, model № 1, which used only mosaic, slightly underestimated the clay content in the foothills in the west. This model also overestimated some areas in the western plains. The magnitude of prediction uncertainties, calculated as the 90% prediction interval width, showed an increase from the eastern plains to the western foothills. Notably, the uncertainty map of model № 1 showed a pronounced “salt-and-pepper” effect, whereas the uncertainty map of model № 5 was smoother due to the influence of regional variables.

4. Discussion

In the first step of spatial mapping of soil clay content across bare croplands, we generated bare soil mosaics from Sentinel-2 satellite data, focusing on minimizing interference from vegetation, plant residues, and soil moisture. Testing these mosaics (n = 26) revealed that the per-pixel approach excelled, particularly with an NDVI threshold below 0.23. Although simpler than the other compositing tested in this study (e.g., using only NDVI), this configuration outperformed others by best capturing bare soil signatures while excluding non-soil influences, leading to superior clay content predictions. The low NDVI cutoff likely succeeded because it filtered out even sparse vegetation in croplands, where residual greenness can distort soil reflectance. It should be noted that previous studies demonstrated different results in terms of the best temporal mosaics for soil property mapping [12]. Nevertheless, we found that spectral bands from the bare soil mosaic did not lead to impressive prediction accuracy. This highlights the importance of testing different methods for this task. Compared to other studies, RPD values at four cropland sites in the Czech Republic ranged from 1.05 to 1.89 using Sentinel-2 spectral bands and indices as predictors [40]. Similarly, Gaab et al. [41] reported RPD 1.79–1.99 for clay predictive models on a cultivated watershed in South India, depending on a Sentinel-2 acquisition date.

The regional covariates that capture key soil-forming factors yielded comparable performance, which can be explained by a relatively low initial spatial resolution. As a result, these covariates did not capture the short-scale variability of clay content, and despite adding numerous valuable explanatory variables, the improvements were negligible. Conversely, the smoother predictions of Model № 5 (compared to Model № 1) partially reflect the influence of these regional covariates. Coarser-resolution covariates miss spatial details and may introduce a smoothing effect, whereas the finer-resolution RS data retain more spatial heterogeneity, which may dominate model predictions. As a result, this disparity in spatial resolution between Sentinel and regional covariates may affect model performance and uncertainty. Nevertheless, the regional covariates in combination with the bare soil mosaic showed more accurate predictions than these sets of variables alone. It was expected that the regional covariates represent the broad-scale factors influencing soil formation, even with a coarse spatial resolution. Hence, this synergy integrates pedogenic context from the regional variables, while the bare soil mosaic inserts site-specific spectral data that reflects the actual short-range variation the coarser data misses. Earlier, the effectiveness of incorporating climate and terrain covariates into RS data was demonstrated by Zhou et al. [42] and Suleymanov [17] for SOC mapping.

We found that clay content was correlated with the visible and NIR bands (B2–B8A) of Sentinel-2A (Table 5). As demonstrated by numerous studies, these spectral ranges (i.e., Vis-NIR) are associated with iron, soil moisture, and carbon, which are characterized by direct spectral responses [43,44]. This spectral range is broad and overlapping, where especially the presence of the above properties can mask the absorption features of other soil components. Therefore, the success of the clay prediction in our study was mostly attributed to covariations with other soil properties, like SOC [45]. This is also supported by the importance of regional variables, which highlighted differences in soil-forming factors across the study region: the transition from forested foothill mountains with higher SOC content in the west to dry steppe soils with lower SOC content in the east [17]. As a result, soil types, SOC, and clay content vary systematically across the region.

The most interesting finding was the notable performance difference among models (№ 3–5) that differed in their feature selection approach. VIF primarily targets multicollinearity by removing redundant variables that are highly correlated with others, and it does not consider the modeling variable. Within this procedure, there is a risk of eliminating an informative predictor from a highly informative pair [46]. For instance, it might remove a regional climate variable that is correlated with a bare soil spectral band, even though together they explain clay content better than either alone. Subsequently, RFE ranks the remaining covariates based on their individual predictive importance. This process can be biased towards variables with strong individual effects and may overlook predictors that are weak in isolation but powerful in specific combinations. Consequently, model № 4 likely arrived at a robust but conservative set of features, which potentially resulted in it failing to retain the specific valuable information among the broad-scale regional factors and spectral bands from a multi-temporal mosaic.

In contrast, the MGFS method is inherently more adaptive and combinatorial. A “greedy” algorithm evaluates features not just in isolation, but by iteratively testing which next feature adds the most value to the current subset. This allowed MGFS to find complementary variables actively. It could, for example, first select a key regional covariate that sets a climatic context, then immediately seek out the bare soil spectral bands that best refine predictions within that context, explicitly building a set of interacting predictors. Therefore, we conclude that such improvement under scenario № 4 highlights complex geospatial contexts, and the interaction between datasets is as important as the datasets themselves. Similar results were reported by Zhang et al. [18], where the MGFS-derived model demonstrated the best model performance for SOC density mapping. For comparison, the authors also used Boruta, RFE, and VIF approaches, and their R² ranged from 0.48 to 0.57, while the MGFS-based model resulted in an R² of 0.60 using only nine explanatory variables. Hence, while VIF is a filter method that ignores predictive power and RFE uses greedy backward elimination that can remove useful variables prematurely, MGFS, by contrast, employs a forward search that evaluates all candidate additions at each step.

The prediction performance of spatial soil models depends on training data (quantity and quality), explanatory variables, and the predictive technique itself. With this in mind, the main limitation of our study is the relatively small sample size. For this reason, we used the RF approach suitable for small datasets [47], and adopted a repeated 10-fold cross-validation strategy with 10 repetitions to obtain more robust performance estimates. Nevertheless, we recognize that sample size remains a primary limitation of this study, and future work should validate the model on larger, independent datasets. Also, future studies should incorporate additional covariates and test model transferability across diverse regions.

5. Conclusions

This study applied the DSM framework for clay content by synergizing Sentinel-2 bare soil mosaics with diverse covariates and advanced selection techniques. The following conclusion summarizes the key findings:

The per-pixel-based bare soil mosaic using a single threshold (NDVI < 0.23) outperformed other temporal alternatives and resulted in RMSE = 10.03%, R² = 0.25, and RPD = 1.16. Hence, single RS data explained only 25% of the clay variation across croplands.
Using regional covariates representing soil-forming factors, the models demonstrated comparable performance (RMSE = 9.77%, R² = 0.27, and RPD = 1.18). Among them, MODIS and Landsat variables were the most important; i.e., RS data representing vegetation conditions remained key predictors among other environmental factors.
Combining a multi-temporal mosaic and a regional set of covariates showed different accuracy depending on the variable selection method. Using all variables without filtering led to a model accuracy of RMSE = 9.45%, R² = 0.33, and RPD = 1.24, whereas excluding variables after the VIF and RFE procedures slightly reduced the accuracy (RMSE = 9.65%, R² = 0.31, RPD = 1.21). This is explained by the fact that the excluded variables (mainly highly correlated) also contained useful information.
The model that used bare soil mosaic and regional variables with a combination of the MGFS technique increased R² to 0.42 (RMSE = 8.73%, RPD = 1.34), highlighting the importance of the variable selection method. Under this predictive model, NIR bands and relief were the most important. This confirms that other pedogenic factors should be used beyond reflectance.
We recommend that future studies incorporate multi-temporal remote sensing data alongside complementary covariates that represent other soil-forming factors, while also adopting a more robust variable selection strategy. This workflow is applicable to other cropland areas with analogous climate and terrain conditions, where bare soil mosaics can be effectively combined with regional covariates and variable selection techniques for DSM tasks.

Author Contributions

Conceptualization: A.S. and N.K.; Data curation: A.S. and I.A.; Formal analysis: A.S., N.K., R.S. and I.A.; Funding acquisition: A.S.; Investigation: A.S. and I.A.; Methodology: A.S. and N.K.; Project administration: A.S.; Resources: I.A.; Software: A.S., N.K. and R.S.; Supervision: A.S.; Validation: N.K. and R.S.; Visualization: A.S.; Writing—original draft: A.S. and N.K.; Writing—review and editing: I.A. and R.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Russian Science Foundation, project No. 25-74-00002.

Data Availability Statement

The data presented in this study are available on reasonable request from the authors.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

RS	Remote sensing
VIF	Variance inflation factor
RFE	Recursive feature elimination
NIR	Near-infrared
RMSE	Root mean square error
RPD	Ratio of performance to deviation
R²	Coefficient of determination
MGFS	Modified greedy feature selection
SOC	Soil organic carbon
NDVI	Normalized Difference Vegetation Index
NBR2	Normalized Burn Ratio 2
S2WI	Soil Surface Moisture Index
AOI	Area of interest
BSI	Bare Soil Index

References

Dixon, J.B. Roles of Clays in Soils. Appl. Clay Sci. 1991, 5, 489–503. [Google Scholar] [CrossRef]
Zanini, M.; Priori, S.; Petito, M.; Cantalamessa, S. Digital Soil Mapping for Precision Agriculture Using Multitemporal Sentinel-2 Images of Bare Ground. In Proceedings of the 2023 IEEE International Workshop on Metrology for Agriculture and Forestry (MetroAgriFor), Pisa, Italy, 6–8 November 2023; pp. 232–236. [Google Scholar]
Schweizer, S.A.; Mueller, C.W.; Höschen, C.; Ivanov, P.; Kögel-Knabner, I. The Role of Clay Content and Mineral Surface Area for Soil Organic Carbon Storage in an Arable Toposequence. Biogeochemistry 2021, 156, 401–420. [Google Scholar] [CrossRef]
Ayoubi, S.; Milikian, A.; Mosaddeghi, M.R.; Zeraatpisheh, M.; Zhao, S. Impacts of Clay Content and Type on Shear Strength and Splash Erosion of Clay–Sand Mixtures. Minerals 2022, 12, 1339. [Google Scholar] [CrossRef]
Kriuchkov, N.R.; Makarov, O.A. Modeling Dynamics of Soil Erosion by Water Due to Soil Organic Matter Change (1980–2020) in the Steppe Zone of Russia. Agronomy 2023, 13, 2527. [Google Scholar] [CrossRef]
Wiesmeier, M.; Urbanski, L.; Hobley, E.; Lang, B.; von Lützow, M.; Marin-Spiotta, E.; van Wesemael, B.; Rabot, E.; Ließ, M.; Garcia-Franco, N.; et al. Soil Organic Carbon Storage as a Key Function of Soils—A Review of Drivers and Indicators at Various Scales. Geoderma 2019, 333, 149–162. [Google Scholar] [CrossRef]
Johannes, A.; Matter, A.; Schulin, R.; Weisskopf, P.; Baveye, P.C.; Boivin, P. Optimal Organic Carbon Values for Soil Structure Quality of Arable Soils. Does Clay Content Matter? Geoderma 2017, 302, 14–21. [Google Scholar] [CrossRef]
Mäkipää, R.; Menichetti, L.; Martínez-García, E.; Törmänen, T.; Lehtonen, A. Is the Organic Carbon-to-Clay Ratio a Reliable Indicator of Soil Health? Geoderma 2024, 444, 116862. [Google Scholar] [CrossRef]
Wenzel, W.W.; Golestanifard, A.; Duboc, O. SOC: Clay Ratio: A Mechanistically-Sound, Universal Soil Health Indicator across Ecological Zones and Land Use Categories? Geoderma 2024, 452, 117080. [Google Scholar] [CrossRef]
McBratney, A.B.; Mendonça Santos, M.L.; Minasny, B. On Digital Soil Mapping. Geoderma 2003, 117, 3–52. [Google Scholar] [CrossRef]
Chen, Q.; Vaudour, E.; Richer-de-Forges, A.C.; Arrouays, D. Spectral Indices in Remote Sensing of Soil: Definition, Popularity, and Issues. A Critical Overview. Remote Sens. Environ. 2025, 329, 114918. [Google Scholar] [CrossRef]
Vaudour, E.; Gomez, C.; Lagacherie, P.; Loiseau, T.; Baghdadi, N.; Urbina-Salazar, D.; Loubet, B.; Arrouays, D. Temporal Mosaicking Approaches of Sentinel-2 Images for Extending Topsoil Organic Carbon Content Mapping in Croplands. Int. J. Appl. Earth Obs. Geoinf. 2021, 96, 102277. [Google Scholar] [CrossRef]
Gasmi, A.; Gomez, C.; Lagacherie, P.; Zouari, H.; Laamrani, A.; Chehbouni, A. Mean Spectral Reflectance from Bare Soil Pixels along a Landsat-TM Time Series to Increase Both the Prediction Accuracy of Soil Clay Content and Mapping Coverage. Geoderma 2021, 388, 114864. [Google Scholar] [CrossRef]
Vaudour, E.; Gomez, C.; Loiseau, T.; Baghdadi, N.; Loubet, B.; Arrouays, D.; Ali, L.; Lagacherie, P. The Impact of Acquisition Date on the Prediction Performance of Topsoil Organic Carbon from Sentinel-2 for Croplands. Remote Sens. 2019, 11, 2143. [Google Scholar] [CrossRef]
Chinilin, A.; Lozbenev, N.; Shilov, P.; Fil, P.; Levchenko, E.; Kozlov, D. Synergetic Use of Bare Soil Composite Imagery and Multitemporal Vegetation Remote Sensing for Soil Mapping (A Case Study from Samara Region’s Upland). Land 2024, 13, 2229. [Google Scholar] [CrossRef]
Azizi, K.; Garosi, Y.; Ayoubi, S.; Tajik, S. Integration of Sentinel-1/2 and Topographic Attributes to Predict the Spatial Distribution of Soil Texture Fractions in Some Agricultural Soils of Western Iran. Soil Tillage Res. 2023, 229, 105681. [Google Scholar] [CrossRef]
Suleymanov, A.; Telyagissov, S.; Asylbaev, I.; Mirsayapov, R.; Suleymanov, R.; Keshavarzi, A.; Tuktarova, I.; Belan, L. Multi-Scenario Modeling of Soil Organic Carbon in Semi-Arid Croplands with Uncertainty Quantification and Model Interpretation. Remote Sens. Appl. Soc. Environ. 2026, 41, 101903. [Google Scholar] [CrossRef]
Zhang, X.; Chen, S.; Xue, J.; Wang, N.; Xiao, Y.; Chen, Q.; Hong, Y.; Zhou, Y.; Teng, H.; Hu, B.; et al. Improving Model Parsimony and Accuracy by Modified Greedy Feature Selection in Digital Soil Mapping. Geoderma 2023, 432, 116383. [Google Scholar] [CrossRef]
Kasraei, B.; Schmidt, M.G.; Zhang, J.; Bulmer, C.E.; Filatow, D.S.; Arbor, A.; Pennell, T.; Heung, B. A Framework for Optimizing Environmental Covariates to Support Model Interpretability in Digital Soil Mapping. Geoderma 2024, 445, 116873. [Google Scholar] [CrossRef]
Demir, S.; Sahin, E.K. An Investigation of Feature Selection Methods for Soil Liquefaction Prediction Based on Tree-Based Ensemble Algorithms Using AdaBoost, Gradient Boosting, and XGBoost. Neural Comput. Appl. 2023, 35, 3173–3190. [Google Scholar] [CrossRef]
Ferhatoglu, C.; Miller, B.A. Choosing Feature Selection Methods for Spatial Modeling of Soil Fertility Properties at the Field Scale. Agronomy 2022, 12, 1786. [Google Scholar] [CrossRef]
Broeg, T.; Don, A.; Scholten, T.; Erasmi, S. Reducing Bias in Cropland Soil Organic Carbon and Clay Predictions Using Sentinel-2 Composites and Data Balancing. Remote Sens. Environ. 2026, 333, 115109. [Google Scholar] [CrossRef]
Gomez, C.; Vaudour, E.; Féret, J.-B.; de Boissieu, F.; Dharumarajan, S. Topsoil Clay Content Mapping in Croplands from Sentinel-2 Data: Influence of Atmospheric Correction Methods across a Season Time Series. Geoderma 2022, 423, 115959. [Google Scholar] [CrossRef]
Suleymanov, A.; Chen, Q.; Richer-de-Forges, A.; Arrouays, D.; Vaudour, E.; Asylbaev, I.; Mirsayapov, R.; Telyagissov, S.; Valiev, G.; Shagaliev, R.; et al. Synergetic Integration of Multi-Temporal Remote Sensing Mosaic and Conventional Soil Map for Mapping Organic Carbon Content in Chernozems. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 28358–28374. [Google Scholar] [CrossRef]
Kasraei, B.; Schmidt, M.G.; Saurette, D.D.; Bulmer, C.E.; Zhang, J.; Pennell, T.; John, K.; Heung, B. Advancing Digital Soil Mapping with Multi-Year Crop Cover Data: Impacts on Model Accuracy and Soil Interpretation. Geoderma 2025, 461, 117481. [Google Scholar] [CrossRef]
Wang, X.; Zhou, F.; Chen, S.; Deng, X.; Ren, Z.; Duan, S.-B.; Wang, M.; Shi, Z. Advancing Provincial Cropland Soil Mapping with Temporal Satellite Data Integration. Soil Use Manag. 2025, 41, e70108. [Google Scholar] [CrossRef]
GOST 12536-2014; Grounds Methods for Laboratory Determination of Granulometric (Grain Size) and Microaggregate Composition. The Interstate Council for Standardization, Metrology and Certification: Minsk, Belarus, 2014.
Kachinsky, N.A. Mechanical and Microagregate Composition of Soils Methods for Study; USSR Academy of Science Publishers: Moscow, Russia, 1958; p. 193. (In Russian) [Google Scholar]
Tucker, C. Red and Photographic Infrared Linear Combinations for Monitoring Vegetation. Remote Sens. Environ. 1979, 8, 127–150. [Google Scholar] [CrossRef]
Key, C.; Benson, N. Landscape Assessment: Ground Measure of Severity, the Composite Burn Index; and Remote Sensing of Severity, the Normalized Burn Ratio. In FIREMON: Fire Effects Monitoring and Inventory System; General Technical Reports; USDA Forest Service, Rocky Mountain Research Station: Ogden, UT, USA, 2006; p. LA 1-51. [Google Scholar]
Roy, P.; Rikimaru, A.; Miyatake, S. Tropical Forest Cover Density Mapping. Trop. Ecol. 2002, 43, 39–47. [Google Scholar]
Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-Scale Geospatial Analysis for Everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
Suleymanov, A.; Valiev, G.; Shagaliev, R.; Suleymanov, R.; Belan, L. Three-Dimensional Mapping of Key Soil Properties with Multi-Stage Validation and Big Data. CATENA 2026, 266, 109949. [Google Scholar] [CrossRef]
Poggio, L.; de Sousa, L.M.; Batjes, N.H.; Heuvelink, G.B.M.; Kempen, B.; Ribeiro, E.; Rossiter, D. SoilGrids 2.0: Producing Soil Information for the Globe with Quantified Spatial Uncertainty. Soil 2021, 7, 217–240. [Google Scholar] [CrossRef]
Guyon, I.; Weston, J.; Barnhill, S.; Vapnik, V. Gene Selection for Cancer Classification Using Support Vector Machines. Mach. Learn. 2002, 46, 389–422. [Google Scholar] [CrossRef]
Curto, J.; Pinto, J. The Corrected VIF. J. Appl. Stat. 2011, 38, 1499–1507. [Google Scholar] [CrossRef]
Drobnič, F.; Kos, A.; Pustišek, M. On the Interpretability of Machine Learning Models and Experimental Feature Selection in Case of Multicollinear Data. Electronics 2020, 9, 761. [Google Scholar] [CrossRef]
R Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. 2025. Available online: https://www.r-project.org/ (accessed on 15 January 2026).
Kuhn, M. Building Predictive Models in R Using the Caret Package. J. Stat. Softw. 2008, 28, 1–26. [Google Scholar] [CrossRef]
Gholizadeh, A.; Žižala, D.; Saberioon, M.; Borůvka, L. Soil Organic Carbon and Texture Retrieving and Mapping Using Proximal, Airborne and Sentinel-2 Spectral Imaging. Remote Sens. Environ. 2018, 218, 89–103. [Google Scholar] [CrossRef]
Gaab, J.; Ruiz, L.; Sekhar, M.; Dharumarajan, S.; Masika Tutondele, J.; Ebengo Okoto, M.; Gomez, C. Detection of Short-Term Changes in Soil Clay Content at Field Scale with Sentinel-2 Enables Mapping of the Reservoir Sediment Reuse in Cropland. Geoderma 2025, 464, 117630. [Google Scholar] [CrossRef]
Zhou, T.; Geng, Y.; Chen, J.; Pan, J.; Haase, D.; Lausch, A. High-Resolution Digital Mapping of Soil Organic Carbon and Soil Total Nitrogen Using DEM Derivatives, Sentinel-1 and Sentinel-2 Data Based on Machine Learning Algorithms. Sci. Total Environ. 2020, 729, 138244. [Google Scholar] [CrossRef] [PubMed]
Adeline, K.R.M.; Gomez, C.; Gorretta, N.; Roger, J.-M. Predictive Ability of Soil Properties to Spectral Degradation from Laboratory Vis-NIR Spectroscopy Data. Geoderma 2017, 288, 143–153. [Google Scholar] [CrossRef]
Tümsavaş, Z.; Tekin, Y.; Ulusoy, Y.; Mouazen, A.M. Prediction and Mapping of Soil Clay and Sand Contents Using Visible and Near-Infrared Spectroscopy. Biosyst. Eng. 2019, 177, 90–100. [Google Scholar] [CrossRef]
Hobley, E.; Prater, I. Estimating Soil Texture from Vis–NIR Spectra. Eur. J. Soil Sci. 2018, 70, 83–95. [Google Scholar] [CrossRef]
Hanberry, B.B. Practical Guide for Retaining Correlated Climate Variables and Unthinned Samples in Species Distribution Modeling, Using Random Forests. Ecol. Inform. 2024, 79, 102406. [Google Scholar] [CrossRef]
Khaledian, Y.; Miller, B.A. Selecting Appropriate Machine Learning Methods for Digital Soil Mapping. Appl. Math. Model. 2020, 81, 401–418. [Google Scholar] [CrossRef]

Figure 1. Locations: (a) the Republic of Bashkortostan on the European continent; (b) study site within the southern part of the republic; and the location of the sampling points (c) (red dots) within the study area. Source: Google Maps.

Figure 2. Scatterplots of the observed and predicted clay contents using different scenarios.

Figure 3. Relative importance of the top 10 environmental covariates for clay prediction under scenarios № 3, 4, and 5. The source of the covariate is shown in parentheses.

Figure 4. Digital maps of clay content predicted under scenarios № 1 and 5 (top left), differences between them, calculated as model 1 minus model 5 (top right), and their uncertainty maps showing the 90% prediction interval width using quantile regression forest technique (bottom row).

Table 1. Generated bare soil mosaics using Sentinel-2 data for clay mapping.

Approach	Driver	Mosaic Name	Condition for Selection
Per-pixel	Bare soil	PerPix_Bare_NDVI023_best_FULLSTACK	NDVI ≤ T1 * (NDVI p10 ≈ 0.23). Best: min NDVI
Per-pixel	Bare soil	PerPix_Bare_NDVI023_FULLSTACK_mean	NDVI ≤ T1 (NDVI p10 ≈ 0.23). Aggregation: mean.
Per-pixel	Bare soil	PerPix_Bare_NDVI023_FULLSTACK_median	NDVI ≤ T1 (NDVI p10 ≈ 0.23). Aggregation: median.
Per-pixel	Bare soil	PerPix_Bare_NDVI029_best_FULLSTACK	NDVI ≤ T2 ** (NDVI p25 ≈ 0.29). Best: min NDVI
Per-pixel	Bare soil	PerPix_Bare_NDVI029_FULLSTACK_mean	NDVI ≤ T2 (NDVI p25 ≈ 0.29). Aggregation: mean.
Per-pixel	Bare soil	PerPix_Bare_NDVI029_FULLSTACK_median	NDVI ≤ T2 (NDVI p25 ≈ 0.29). Aggregation: median.
Per-pixel	Bare soil	PerPix_Bare_BSI_best_FULLSTACK	Mask: cloud/shadow-screened scenes. Best: max BSI.
Per-pixel	Bare soil	PerPix_Bare_BSI_FULLSTACK_mean	Mask: cloud/shadow-screened scenes. Aggregation: mean.
Per-pixel	Bare soil	PerPix_Bare_BSI_FULLSTACK_median	Mask: cloud/shadow-screened scenes. Aggregation: median.
Per-pixel	Bare soil	PerPix_Bare_NBR2_best_FULLSTACK	NDVI ≤ T2 (NDVI p25 ≈ 0.29) + GVI1 * > 0 & GVI2 ** > 0 + NBR2 < 0.16. Best: min NBR2
Per-pixel	Bare soil	PerPix_Bare_NBR2_FULLSTACK_mean	NDVI ≤ T2 (NDVI p25 ≈ 0.29) + GVI1 > 0 & GVI2 > 0 + NBR2 < 0.16. Aggregation: mean.
Per-pixel	Bare soil	PerPix_Bare_NBR2_FULLSTACK_median	NDVI ≤ T2 (NDVI p25 ≈ 0.29) + GVI1 > 0 & GVI2 > 0 + NBR2 < 0.16. Aggregation: median NBR2.
Per-pixel	Driest soil	PerPix_Driest_S2WI_best_FULLSTACK	NDVI ≤ T2 (NDVI p25 ≈ 0.29). Best: min S2WI.
Per-pixel	Driest soil	PerPix_Driest_S2WI_FULLSTACK_mean	NDVI ≤ T2 (NDVI p25 ≈ 0.29). Aggregation: mean S2WI.
Per-pixel	Driest soil	PerPix_Driest_S2WI_FULLSTACK_median	NDVI ≤ T2 (NDVI p25 ≈ 0.29). Aggregation: median S2WI.
Per-pixel	Bare soil	PerPix_RuleStrict_FULLSTACK_median	NDVI < 0.23 + NBR2 < 0.16 + BSI > 0.20. Aggregation: median.
Per-pixel	Bare soil	PerPix_RuleStrict_bestS2WI_FULLSTACK	NDVI < 0.23 + NBR2 < 0.16 + BSI > 0.20. Best: min S2WI.
Per-pixel	Driest soil	PerPix_Driest_S2WI_NBR2minLexi_best_FULLSTACK	NDVI ≤ T2 (NDVI p25 ≈ 0.29) + (GVI1 > 0 & GVI2 > 0). Keep per-pixel NBR2 minima, then Best: min S2WI.
Per-pixel	Driest soil	PerPix_Driest_S2WI_NBR2minLexi_FULLSTACK_mean	NDVI ≤ T2 (NDVI p25 ≈ 0.29) + (GVI1 > 0 & GVI2 > 0). Candidates: per-pixel NBR2 minima. Aggregation: mean.
Per-pixel	Driest soil	PerPix_Driest_S2WI_NBR2minLexi_FULLSTACK_median	NDVI ≤ T2 (NDVI p25 ≈ 0.29) + (GVI1 > 0 & GVI2 > 0). Candidates: per-pixel NBR2 minima. Aggregation: median.
Per-date	Bare soil	PerDate_NDVI023_FULLSTACK	NDVI ≤ T1 (NDVI p10 ≈ 0.23). Date ranking: AOI mean NDVI (min).
Per-date	Bare soil	PerDate_NDVI029_FULLSTACK	Mask: NDVI ≤ T2 (NDVI p25 ≈ 0.29). Date ranking: AOI mean NDVI (min).
Per-date	Bare soil	PerDate_Bare_NBR2_FULLSTACK	Mask: NDVI ≤ T2 (NDVI p25 ≈ 0.29) + (GVI1 > 0 & GVI2 > 0). Date ranking: AOI mean NBR2 (min).
Per-date	Bare soil	PerDate_BSI_FULLSTACK	Mask: NDVI ≤ T2 (NDVI p25 ≈ 0.29). Date ranking: AOI mean BSI (max).
Per-date	Driest soil	PerDate_Driest_S2WI_FULLSTACK	Mask: NDVI ≤ T2 (NDVI p25 ≈ 0.29). Date ranking: AOI mean S2WI (min).
Per-date	Driest soil	PerDate_Driest_S2WI_NBR2thr_FULLSTACK	Scene filter: AOI mean NBR2 < 0.16; Mask: NDVI ≤ T2 (NDVI p25 ≈ 0.29). Ranking: AOI mean S2WI (min).

* T1—NDVI 10th percentile (p10); ** T2—NDVI 25th percentile (p25); *** GVI1—difference between B3 and B2 (green vegetation index 1); **** GVI2—difference between B4 and B3 (green vegetation index 2).

Table 2. Tested scenarios for clay mapping.

Scenario/Model, №	Covariates	Algorithm	Variable Selection Approach
1	Spectral bands from bare soil mosaic (Sentinel-2)	PLS, RF, KNN, Cubist, SVM	-
2	Regional covariates	PLS, RF, KNN, Cubist, SVM	RFE, VIF
3	The best mosaic from scenario/model № 1 + all regional covariates	RF	-
4	The best mosaic from scenario/model № 1 + regional covariates	RF	RFE, VIF
5	The best mosaic from scenario/model № 1 + regional covariates	RF	MGFS

Table 3. Summary statistics of clay content (n = 181).

Parameter	Min, %	Max, %	Mean, %	Median, %	SD, %
Clay	23.9	78.1	56.5	59.6	11.2

Table 4. Top 5 most accurate predictive models using different bare soil mosaics.

Bare Soil Mosaic	Description	Number of Training Samples	Algorithm	RMSE, %	R²	RPD
1	PerPix_Bare_NDVI023_FULLSTACK_median	181	RF	10.03	0.25	1.16
2	PerPix_Bare_NDVI023_FULLSTACK_median	181	SVM	10.11	0.24	1.15
3	PerPix_Bare_NDVI023_FULLSTACK_median	181	Cubist	10.16	0.23	1.14
4	PerDate_NDVI023_FULLSTACK	172	SVM	10.21	0.20	1.11
5	PerPix_Bare_NBR2_FULLSTACK_median	182	SVM	10.28	0.19	1.09

Table 5. Spearman correlation coefficients between clay content and spectral bands derived from the best Sentinel-2A mosaic.

Blue (B2)	Green (B3)	Red (B4)	RE1 (B5)	RE2 (B6)	RE3 (B7)	NIR (B8)	RE4 (B8A)	SWIR 1 (B11)	SWIR 2 (B12)
−0.31 ***	−0.29 ***	−0.29 ***	−0.32 ***	−0.34 ***	−0.35 ***	−0.36 ***	−0.35 ***	−0.24 **	0.06

*** and ** are statistically significant at p < 0.001 and p < 0.01, respectively.

Table 6. The performance of the clay content predictions according to the tested scenarios.

Scenario/Model, №	Covariates	Variable Selection Approach	Number of Training Samples	Algorithm	RMSE, %	R²	RPD
1	Spectral bands from bare soil mosaic (Sentinel-2)	-	181	RF	10.03	0.25	1.16
2	Regional covariates	RFE, VIF	181	RF	9.77	0.27	1.18
3	The best mosaic from scenario/model № 1 + all regional covariates	-	181	RF	9.45	0.33	1.24
4	The best mosaic from scenario/model № 1 + regional covariates	RFE, VIF	181	RF	9.65	0.31	1.21
5	The best mosaic from scenario/model № 1 + regional covariates	MGFS	181	RF	8.73	0.42	1.34

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Suleymanov, A.; Kriuchkov, N.; Asylbaev, I.; Suleymanov, R. Optimizing Bare Soil Mosaics for Clay Prediction via Environmental Covariates and Variable Selection. Remote Sens. 2026, 18, 1503. https://doi.org/10.3390/rs18101503

AMA Style

Suleymanov A, Kriuchkov N, Asylbaev I, Suleymanov R. Optimizing Bare Soil Mosaics for Clay Prediction via Environmental Covariates and Variable Selection. Remote Sensing. 2026; 18(10):1503. https://doi.org/10.3390/rs18101503

Chicago/Turabian Style

Suleymanov, Azamat, Nikita Kriuchkov, Ilgiz Asylbaev, and Ruslan Suleymanov. 2026. "Optimizing Bare Soil Mosaics for Clay Prediction via Environmental Covariates and Variable Selection" Remote Sensing 18, no. 10: 1503. https://doi.org/10.3390/rs18101503

APA Style

Suleymanov, A., Kriuchkov, N., Asylbaev, I., & Suleymanov, R. (2026). Optimizing Bare Soil Mosaics for Clay Prediction via Environmental Covariates and Variable Selection. Remote Sensing, 18(10), 1503. https://doi.org/10.3390/rs18101503

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Optimizing Bare Soil Mosaics for Clay Prediction via Environmental Covariates and Variable Selection

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Soil Data

2.3. Spatial Modeling Framework

2.3.1. Multi-Temporal Remote Sensing Data

2.3.2. Regional Covariate Stack

2.3.3. Mapping Scenarios

2.4. Variable Selection Techniques

2.5. Evaluation of Model Performance

3. Results

3.1. Summary Statistics of Clay Content Across Croplands

3.2. Final Temporal Mosaic

3.3. Performance of Tested Scenarios

3.4. Key Environmental Covariates

3.5. Digital Maps of Clay Content

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI