1. Introduction
Soil organic matter (SOM) is one of the most critical components of cropland soils and plays a fundamental role in maintaining soil health and ensuring sustainable agricultural development [
1]. In the context of the global carbon cycle, SOM constitutes an essential part of the terrestrial carbon pool [
2] and plays a key role in carbon sequestration and greenhouse gas balance [
3]. Changes in SOM not only affect carbon cycling dynamics within agricultural systems but also directly influence global climate regulation mechanisms [
4]. Regarding soil fertility, the functional mechanisms of SOM are complex and multifaceted. On one hand, SOM can enhance the retention of nutrients such as nitrogen, phosphorus, and potassium through adsorption and complexation processes [
5]. On the other hand, it improves soil aggregate structure, thereby enhancing soil aeration and water-holding capacity [
6]. However, in recent years, long-term cultivation has led to severe SOM degradation [
7], and the consequent reduction in SOM content may accelerate soil degradation and pose a threat to food security [
8]. Therefore, predicting the spatial distribution of SOM is crucial for land conservation and sustainable utilization.
Over the past decades, with the continuous development of remote sensing technology and machine learning algorithms, digital soil mapping (DSM) has been widely applied in the prediction of soil properties [
9]. Remote sensing offers advantages such as wide spatial coverage, efficient data acquisition, and high temporal resolution [
10], effectively overcoming the high costs and limited representativeness associated with traditional soil sampling. Consequently, it has gradually become a core technical approach for monitoring and assessing cropland SOM, providing abundant data sources for SOM mapping [
11]. Meanwhile, machine learning algorithms can capture the complex nonlinear relationships between remote sensing data and soil organic matter, thereby improving mapping accuracy [
12]. With the development of remote sensing technology and the introduction of machine learning methods, the accuracy and efficiency of spatial prediction of SOM have been significantly improved [
13]. Traditional statistical methods often rely on linear assumptions during modeling, making it difficult to fully capture the complex nonlinear relationships among soil, climate, topography, and vegetation [
13]. In contrast, machine learning approaches, such as Random Forest (RF), Gradient Boosting Trees (XGBoost), and Cubist, can handle high-dimensional, multi-source heterogeneous data and automatically identify nonlinear and interactive effects among variables, thus being widely applied in DSM [
14]. Especially when estimating SOM using remote sensing data, machine learning models can fully exploit multi-temporal and multi-spectral information by integrating bare-soil and crop-season imagery with meteorological and topographic factors, thereby improving prediction accuracy and spatial generalization [
15]. In addition, machine learning methods exhibit strong robustness, accommodating variations in data quality and sample distribution under different climatic conditions, enabling stable performance across drought, flood, and normal years. Furthermore, with the increasing availability of high-resolution remote sensing imagery and multi-source data, combining machine learning models with feature selection and data fusion strategies can effectively reduce redundant variables and enhance model interpretability. Collectively, integrating remote sensing imagery with machine learning methods allows for accurate estimation and spatial mapping of SOM in black soil regions under varying climatic conditions, providing a scientific basis for black soil conservation and sustainable agricultural development.
In this application process, the performance of remote sensing imagery in soil monitoring is constrained by surface cover conditions [
16]. Images acquired during the bare-soil period can often provide more direct spectral information of the soil, thereby offering distinct advantages for SOM mapping [
17]. However, imagery from the crop-growing season should not be overlooked. Vegetation growth is closely related to soil nutrient content and can indirectly reflect SOM levels [
18]. Therefore, bare-soil and crop-season images are complementary. In recent years, an increasing number of studies have shown that the judicious combination of these two temporal datasets can significantly improve the accuracy of soil property predictions [
19]. Nevertheless, the applicability and effectiveness of such combinations may vary across regions and climatic conditions, and scientifically selecting the appropriate temporal combination remains a key challenge in remote sensing-based soil mapping [
20].
Climatic conditions are a key external factor influencing the spectral signals of soil and vegetation in remote sensing imagery [
21]. Variations in precipitation and temperature can modify soil moisture, vegetation cover, and crop growth dynamics, thereby altering the spectral responses of both soil and vegetation [
22]. In flood years, excessive precipitation increases soil saturation and promotes vigorous crop growth, resulting in a marked reduction in bare-soil areas. Under these conditions, soil reflectance is often masked by high moisture levels and dense vegetation cover, limiting the usability of bare-soil period imagery [
23]. In such cases, crop-season imagery may play a more important role, as the growth state of crops directly reflects soil nutrient and moisture conditions, thereby indirectly providing information for SOM estimation [
24].Conversely, in drought years, insufficient precipitation and dry soils suppress crop growth, resulting in relatively larger bare-soil areas. Under these conditions, bare-soil period imagery can more clearly capture soil spectral characteristics [
25]. However, arid conditions may alter surface soil structure and reflectance, and without correction for moisture conditions, this can introduce biases in SOM estimation [
26]. In normal years, moderate soil moisture and temperature conditions maintain a balance between crop cover and bare-soil exposure, allowing soil information to be obtained while vegetation signals indirectly reflect soil properties, thereby stabilizing the accuracy of image combinations. Overall, interannual climatic variability significantly modulates the spectral responses of soil and vegetation, potentially leading to differences in the optimal temporal combination strategies under different climatic conditions. Therefore, investigating the variation patterns of image features under various climatic scenarios and exploring optimal image combinations for flood, drought, and normal years are crucial for improving SOM mapping accuracy and enhancing the robustness of predictive models.
In summary, although DSM has made considerable progress across various regions and scales, existing studies have rarely systematically considered the effects of interannual climatic variability on the accuracy of remote sensing-based SOM predictions. In particular, the adaptability and functional mechanisms of multi-temporal image combination strategies, environmental covariates, and feature selection under different climatic conditions remain unclear in typical farmland ecosystems. This research gap can be attributed to three key constraints: First, historical DSM studies have often prioritized spatial variability of SOM while treating temporal dynamics as a secondary factor—many studies focus on single-year or short-term datasets, assuming relatively stable climatic conditions over the study period, which overlooks the modulatory role of interannual climate fluctuations on soil–vegetation spectral responses. Second, data availability and consistency pose practical challenges: Long-term, high-quality remote sensing datasets are often scarce, especially for regions with frequent extreme weather; meanwhile, matching multi-year remote sensing data with synchronous in situ SOM samples requires substantial labor and resource investment, which limits large-scale systematic investigations. Third, methodological limitations in integrating climate variability: Traditional machine learning models for DSM are often optimized for specific climatic conditions and lack adaptive frameworks to quantify how interannual changes in temperature/precipitation alter the predictive power of remote sensing features. To address this gap, this study focused on Youyi Farm in the Sanjiang Plain and selected three representative climatic years: 2019 (flood year), 2020 (normal year), and 2021 (drought year). By integrating multi-temporal Sentinel-2 imagery, environmental factors, and a RF model, we systematically evaluated the mapping performance of single-period and dual-period image combinations, examined the contributions of environmental covariates and feature selection to model accuracy and stability, and revealed the spatial distribution patterns of cropland SOM under different climatic conditions. Notably, this study innovatively links interannual climatic variability to the optimization of remote sensing temporal combinations for SOM mapping, avoiding the “one-size-fits-all” limitation of traditional fixed-period strategies. It also proposes an RF-RFE feature selection framework tailored to complex climatic scenarios, which enhances the robustness of SOM prediction by adaptively screening key features across flood, normal, and drought years. This study aims to refine the methodological framework for SOM mapping under complex climatic scenarios, providing both theoretical guidance and practical references for multi-year cropland quality monitoring, agricultural management, and carbon sink assessment.
3. Results
Before presenting the modeling results, the configuration of predictor variables for each modeling scenario is summarized in
Table 4. This table lists the categories and number of input features used in the single-period, dual-period, and integrated models.
3.1. Differences in Optimal Temporal Windows Under Different Climatic Years
Table 5 presents the single-period predictions of SOM based on Sentinel-2 imagery for 2019 (flood year), 2020 (normal year), and 2021 (drought year). Overall, climatic conditions had a significant impact on modeling accuracy.
In 2019, the model performed the poorest, with R2 values ranging from 0.157 to 0.426 and RMSE from 1.064 to 1.290 g·kg−1. The lowest accuracy was observed in October ( = 0.157, RMSE = 1.290 g·kg−1), likely due to excessive precipitation resulting in high soil moisture and enhanced vegetation masking effects. In 2021, although model accuracy was higher than in 2019, it remained lower than in 2020. May and April exhibited relatively better performance ( = 0.429 and 0.426, respectively), while October showed a marked decline in accuracy ( = 0.098, RMSE = 1.333 g·kg−1). Under drought conditions, increased soil exposure facilitated the extraction of soil information, but the larger fluctuations in vegetation indices also introduced greater uncertainty. In contrast, 2020 exhibited the highest overall accuracy, with June and May achieving the best performance ( = 0.510 and 0.507, respectively), benefiting from a relatively stable climate that strengthened the correlation between vegetation and SOM.
In summary, remote sensing-based modeling achieved the highest accuracy under normal climatic conditions. In contrast, during extreme years, interference from vegetation growth and soil moisture conditions significantly affected the remote sensing signals, leading to a decline in model performance.
3.2. Effect of Combining Bare Soil and Crop Growth Period Imagery
Table 6 presents the modeling results based on dual-period imagery for different climatic years. Overall, the combination of bare-soil and crop growth period images outperformed single-period imagery.
In 2019, the optimal combination was May and September ( = 0.471, RMSE = 1.021 g·kg−1); in 2020, June and July ( = 0.563, RMSE = 0.928 g·kg−1); and in 2021, May and July ( = 0.520, RMSE = 0.973 g·kg−1). All these optimal combinations included one bare-soil period and one crop growth period, indicating their complementary roles in SOM prediction: the bare-soil period captures soil background information, while the crop growth period provides information on vegetation growth.
The differences in optimal combinations across years also highlight the significant modulating effect of climate. In 2020, with vigorous crop growth, the crop growth period contributed more prominently, whereas in 2019 and 2021, the models relied more heavily on bare-soil imagery, likely due to unstable vegetation signals under flood and drought conditions. Notably, combinations involving October imagery consistently performed poorly, suggesting that post-harvest soil disturbance and crop residues negatively affected the remote sensing signals.
3.3. Improvement of Mapping Accuracy by Environmental Variables
Table 7 compares the dual-period modeling results before and after the inclusion of environmental variables. The results indicate that environmental factors improved model accuracy across all three climatic years, with R
2 increasing by 0.05–0.10 and RMSE generally decreasing.
In 2019, the optimal combination (May + September) saw R2 increase from 0.471 to 0.537, and RMSE decrease from 1.021 to 0.956 g·kg−1. In 2020, the optimal combination (May + July) achieved an R2 increase from 0.547 to 0.588 and RMSE decrease from 0.946 to 0.901 g·kg−1. In 2021, the optimal combination (May + July) showed R2 improvement from 0.520 to 0.566 and RMSE reduction from 0.973 to 0.926 g·kg−1.
3.4. Differences in the Improvement of Mapping Accuracy Across Climatic Years by Feature Selection
Table 8 presents the performance of the optimal image combinations after feature selection, based on models that already incorporated environmental variables, across different climatic years. The results indicate that the feature selection strategy led to varying degrees of accuracy improvement in all three years.
In 2019, the optimal combination was May + September, with R2 increasing from 0.537 to 0.544 and RMSE decreasing from 0.956 to 0.949 g·kg−1, showing a limited improvement. The most pronounced effect was observed in 2020, where the optimal combination (May + July) achieved R2 of 0.609 and RMSE of 0.879 g·kg−1 after feature selection, representing the largest improvement among the three years. This suggests that the original feature set in 2020 contained more redundant or noisy variables, and feature selection effectively removed interfering information, thereby enhancing model generalization and stability. The notable improvement may also be associated with 2020 being a normal climatic year, where the remote sensing and environmental variables were well-balanced, making it easier for the selection strategy to identify key driving variables. In 2021, the May + July combination saw R2 increase from 0.566 to 0.578 and RMSE decrease from 0.926 to 0.913 g·kg−1 after feature selection. Although the improvement was smaller than in 2020, it still outperformed 2019, demonstrating a certain degree of robustness.
Overall, feature selection improved SOM prediction accuracy across different climatic years, although its effectiveness was significantly influenced by the climatic context and the quality of the original feature set. In years with stable climate conditions and complete remote sensing information, feature selection could fully exploit the structural relationships among variables, leading to substantial improvements in model accuracy. In contrast, during flood or drought years, although remote sensing signals were more disturbed and the improvement was relatively limited, the selection strategy still enhanced model robustness and resistance to noise. Therefore, in practical applications, feature selection strategies should be flexibly adjusted according to the characteristics of the climatic year: in normal years, the elimination threshold can be set based on the minimum decrease in the model’s R2, ensuring the retention of features with weak information but no noise; in extreme climatic years, the threshold needs to be tightened to remove features that have even minor negative impacts on model stability, thereby avoiding interference from climate-induced abnormal spectral signals and ultimately retaining key information while improving the adaptability and reliability of SOM mapping.
3.5. Spatial Distribution Mapping of SOM in Farmland
Based on the optimal image combinations for each year (May + September for 2019, and May + July for 2020 and 2021) (
Table 8), the spatial distribution of soil organic matter (SOM) in the study area from 2019 to 2021 was predicted (
Figure 4). The results indicate that the overall spatial pattern of SOM remained consistent across the three years, with high-value areas in the northeast and low-value areas in the central region, exhibiting a clear spatial gradient.
Specifically, the SOM content ranged from 0.91 to 7.96 g·kg−1 in 2019, 0.93 to 7.98 g·kg−1 in 2020, and 0.97 to 8.30 g·kg−1 in 2021. High-value areas were mainly concentrated in the northeastern part, forming large, continuous distributions, with small patches occasionally appearing in the southwest. The high-value regions remained stable over the three years, showing no significant expansion or contraction. Low-value areas were primarily located in the central region, often forming linear or patchy distributions, with persistently low SOM content, resulting in a pronounced “central–peripheral” contrast.