A Prior Knowledge-Guided Remote Sensing Framework for Maize Yield Estimation and Spatiotemporal Interpretability Analysis

Qi, Beisong; Zhang, Xinle; Chen, Lu; Liu, Huanjun; Meng, Linghua; Han, Xinyi; An, Zeyu; Liu, Jiming

doi:10.3390/rs18101455

Open AccessArticle

A Prior Knowledge-Guided Remote Sensing Framework for Maize Yield Estimation and Spatiotemporal Interpretability Analysis

by

Beisong Qi

^1,2

,

Xinle Zhang

^3,*,

Lu Chen

²,

Huanjun Liu

²,

Linghua Meng

²,

Xinyi Han

³,

Zeyu An

³ and

Jiming Liu

³

¹

College of Agriculture, Jilin Agricultural University, Changchun 130118, China

²

State Key Laboratory of Black Soils Conservation and Utilization, Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun 130102, China

³

College of Information Technology, Jilin Agricultural University, Changchun 130118, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(10), 1455; https://doi.org/10.3390/rs18101455

Submission received: 26 February 2026 / Revised: 25 April 2026 / Accepted: 3 May 2026 / Published: 7 May 2026

(This article belongs to the Special Issue Advances in Remote Sensing for Crop Monitoring and Food Security (Second Edition))

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed prior knowledge-guided YFKD-XGBoost model achieved high-precision regional maize yield estimation by fusing multi-dimensional remote sensing features.
SHAP analysis comprehensively revealed the driving factors of maize yield formation from global, temporal, and spatial perspectives.

What are the implications of the main findings?

The framework successfully translates complex crop mechanistic parameters into observable remote sensing features, significantly enhancing the interpretability of machine learning models.
The spatial diagnosis of specific yield-limiting factors provides a scalable, actionable tool to support site-specific, differentiated precision agriculture management and decision-making.

Abstract

Accurately predicting crop yield and its spatiotemporal variability is crucial for precision agriculture. This study developed a prior knowledge-guided remote sensing yield estimation framework at Youyi Farm in China. Based on multi-source data from 2016 to 2025, a Yield-Formation Key Dataset (YFKD) was constructed by integrating Meteorological, Eco-physiological, Phenological, and Soil features. Combined with Boruta feature selection, MLR (Multiple Linear Regression), RF (Random Forest), and XGBoost (Extreme Gradient Boosting) models were compared, and SHAP (Shapley Additive Explanations) was utilized for spatiotemporal driving force analysis. The results showed that the YFKD-XGBoost model achieved the optimal performance (

R^{2} = 0.865

, RMSE = 1491 kg/ha), improving accuracy by up to 17.7% compared to the baseline model. Global SHAP analysis revealed that Soil Spectral Reflectance provided the highest contribution. Temporally, the period from late July to mid-September (especially mid-August) served as the critical monitoring window. Spatially, based on the area share of the dominant negative SHAP contributor, Meteorological Background was the most widespread limiting factor (34.8% of the constrained area), Soil Conditions constraints showed localized clustering (16.4%), while Phenological and Eco-physiological constraints dominated intra-field spatial differentiation. This study validated the feasibility of this framework for high-precision yield estimation and the analysis of yield formation driving factors under the constraints of a limited regional dataset (n = 233), providing reliable support for regional differentiated agricultural management.

Keywords:

yield estimation; prior knowledge-guided; remote sensing; machine learning; Shapley additive explanations

1. Introduction

The implementation of precision agricultural management necessitates accurate crop yield data and a profound understanding of the driving factors behind yield variability [1]. As a major global staple crop, maize yield formation is a complex biophysical process co-regulated by long-term environmental background conditions and short-term vegetation dynamics [2]. Even within a single field, maize yield exhibits significant spatial heterogeneity, which is primarily driven by complex interactions among Soil Conditions, topography, Meteorological Background, Phenological Information, and management practices [3]. Accurately characterizing the spatiotemporal features of yield variability and identifying the factors underlying yield formation can provide a scientific basis for site-specific field management. This, in turn, effectively enhances resource use efficiency and yield levels, thereby addressing food security challenges [4]. Therefore, developing a research framework that not only achieves high-precision yield estimation but also analyzes the driving factors of yield formation differences is of paramount significance for advancing smart agriculture and precision management.

Crop yield estimation methods are primarily categorized into process-based mechanistic models and data-driven statistical models. Mechanistic models, such as DSSAT and APSIM, are grounded in solid theory; they simulate the Eco-physiological Process to elucidate yield formation prior knowledge and accurately predict production [5]. However, these models demand complex inputs—including specific Soil Conditions (e.g., texture, organic matter, and moisture), daily Meteorological Background data (e.g., temperature, precipitation, and radiation), and management practices [6]. Such data are often difficult to obtain at regional scales, and their low spatiotemporal resolution fails to capture local microclimate and soil heterogeneity. In contrast, data-driven empirical models directly leverage massive remote sensing observations to establish yield prediction relationships. The rapid advancement of multi-source satellite data has enriched this approach. For instance, Kogan [7] utilized NOAA/AVHRR to develop the Vegetation Condition Index (VCI) and Temperature Condition Index (TCI), demonstrating high correlations with maize yield [8]. Labus [9] identified a strong relationship between yield and integrated growing-season NDVI in Montana, USA. Moriondo [10] achieved synergistic estimation accuracy by combining RS-derived vegetation indices with meteorological data and crop models, while Sakamoto [11] improved prediction applications by deriving dynamic time-series Phenological Information. These technologies enable cost-effective, continuous field-scale monitoring over large areas, providing robust support for data-driven yield estimation.

Research on crop yield estimation models utilizing remote sensing data has gradually evolved from simple vegetation index regression to complex machine learning (ML) methods. In recent years, deep learning (DL) has also achieved significant progress in agricultural remote sensing, demonstrating substantial advantages in automatic feature extraction and the processing of high-dimensional spatiotemporal data [12]. For instance, convolutional neural networks (CNNs) have been widely applied to extract spatial features from remote sensing imagery [13]; particularly in yield prediction tasks, integrating attention mechanisms into CNNs has proven highly effective in capturing subtle canopy spectral features for accurate regression [14]. Meanwhile, recurrent neural networks (RNNs) or long short-term memory (LSTM) networks are frequently employed to capture the temporal dependencies of crop growth [15,16]. To further enhance the processing of complex spatiotemporal sequences, recent advances have introduced lightweight tensor attention into ConvLSTM architectures, significantly reducing model parameters while preserving structural features [17]. Yang J. et al. proposed a hierarchical deep learning approach combining LSTM and Transformer modules to extract high-frequency temporal features from daily Meteorological and vegetation indices [18]. Moreover, for highly complex multi-source data, advanced architectures like global clue-guided cross-memory Transformer networks have been proposed to fully exploit multi-modal heterogeneity and extract highly discriminative fusion features [19]. Although studies indicate that deep learning exhibits exceptional performance in data-rich, large-scale scenarios, these models typically require massive training datasets to fully leverage their representation learning capabilities [20]. To address the challenge of limited sample sizes in high-dimensional remote sensing tasks, researchers have actively explored various strategies, such as employing tensor decomposition-based relaxed linear regression to enhance model robustness and generalization under small training sets [21]. However, given that crop yield data collection is constrained by harvest cycles and incurs high costs, acquiring sufficient labeled samples is frequently difficult. Consequently, in real-world scenarios characterized by limited sample sizes—such as under extreme climate stress—the over-reliance of deep learning models on large-scale datasets poses a significant challenge.

Traditional machine learning methods, such as Random Forest (RF), Support Vector Machines (SVM), and Extreme Gradient Boosting (XGBoost), can effectively capture the nonlinear and hierarchical relationships between spectral features and response variables such as crop yield. These methods have consistently demonstrated significant advantages across diverse crops and geographical regions. For instance, Khosravani [22] evaluated machine learning models, including XGBoost and RF, using multi-source remote sensing data, and found that XGBoost achieved the highest prediction accuracy for winter wheat yield. Bouras et al. compared SVM, RF, and XGBoost for early cereal yield prediction in Morocco [23], concluding that the XGBoost algorithm outperformed the alternative models. Dhaliwal et al. utilized long-term climate and soil data to predict cotton yield and discovered that RF outperformed both Levenberg–Marquardt neural network (LM-NN) and multilayer perceptron neural network (MLP-NN) models [24]. In scenarios with limited sample sizes, XGBoost can achieve yield prediction performance comparable to or even better than deep learning models by utilizing a feature-based representation of remote sensing data, while offering higher efficiency and explainability [25]. Furthermore, its high computational efficiency and ease of hyperparameter tuning make it highly suitable for regional-scale operational applications [26].

Despite the advancements in predictive accuracy achieved by traditional machine learning and deep learning models, their inherent “black-box” nature severely restricts the understanding of yield formation prior knowledge and the formulation of agricultural management decisions. Existing studies have incorporated interpretable methods, such as SHAP (SHapley Additive exPlanations), to intuitively quantify the specific marginal contributions of individual input features (e.g., accumulated growing degree days, precipitation, and vegetation anomaly indices) to the prediction outcomes [27]. This integration endows machine learning models with high agronomic interpretability, effectively mitigating the “black-box” dilemma commonly associated with pure deep learning models. For instance, Li et al. utilized SHAP analysis to demonstrate that the Enhanced Vegetation Index (EVI) during the soybean pod-setting and corn milk stages plays a crucial role in explaining interannual yield variability in the Midwestern United States [28]. Similarly, Zhou et al. employed SHAP to analyze the environmental factors driving spatial discrepancies in wheat yield across the European Union [29]. Xia et al. leveraged SHAP to identify the most influential indicators and key Phenological windows affecting yield formation under extreme weather conditions [30]. However, current research primarily focuses on enhancing model predictive performance, often lacking a systematic analysis of how input features influence yield, the relative contributions of various driving factors, and the underlying agronomic mechanisms [31]. Moreover, the majority of studies remain confined to global feature ranking, failing to systematically dissect the yield-driving factors across different temporal, spatial, and process dimensions. This lack of a comprehensive analytical framework makes it difficult to translate yield estimation results into actionable agronomic insights and management recommendations, thereby limiting the practical utility of these models in precision agriculture decision-making.

To address these challenges, there is an urgent need to construct a prior knowledge-guided, data-driven yield estimation and interpretation framework capable of handling small-to-medium sample scenarios at the regional scale (Figure 1). Focusing on Youyi Farm in the Black Soil Region of Northeast China, this study utilizes Sentinel-2 time-series data from 2016 to 2025 to develop such a framework. The specific objectives are:

To systematically characterize yield driving factors across four dimensions—Meteorological Background, Eco-physiological Process, Phenological Information, and Soil Conditions—using remote sensing data to construct the Yield-Formation Key Dataset (YFKD);
To select optimal features for constructing feature sets, systematically compare the performance of Multiple Linear Regression (MLR), Random Forest (RF), and XGBoost models, and progressively evaluate the accuracy improvements driven by different feature groups within the YFKD;
To employ SHAP analysis to comprehensively interpret the yield estimation results of the optimal model, and quantify the global relative contribution of the four dimensions: Meteorological Background, Eco-physiological Process, Phenological Information, and Soil Conditions;
To reveal critical periods affecting yield, identify spatial yield-limiting factors, and systematically elucidate the factors underlying regional spatiotemporal yield variability.

This study introduces three core innovations: (1) a prior knowledge-driven feature construction logic for YFKD, aligning remote sensing features with input parameters of mechanistic models; (2) implementation of SHAP analysis with a decadal (every ten days) time scale to identify critical Phenological windows for yield formation; and (3) a pixel-level, multi-scale SHAP spatial partitioning method to visualize major yield-limiting factors in the study area. Consequently, this research not only provides a robust method for high-precision regional yield estimation but also transforms black-box prediction results into agricultural guidance, offering solid scientific support for precision agriculture management at specific locations.

2. Materials and Methods

2.1. Study Area

Youyi Farm is located in the Sanjiang Plain of the Black Soil Region in Northeast China (45°–50°N, 132°–137°E) (Figure 2a). Covering approximately 133,000 hectares, it operates as a large-scale, modern, mechanized agricultural enterprise. The terrain exhibits a transitional trend from mountains to plains, featuring diverse geomorphology (Figure 2b). The western region consists of undulating low hills dominated by Planosols (Albic soils); the central rolling plains are characterized by fertile Phaeozems (Black soils) and Meadow soils; and the eastern region is flat and low-lying, with an extensive distribution of Bog soils [32]. This variation results in significant spatial heterogeneity in Soil Conditions (Figure 2e).

The local Meteorological Background is defined by a temperate continental monsoon climate, featuring cold winters and warm, humid summers, with precipitation concentrated between June and August. Regarding management, the farm practices intensive large-scale operations with strict control over the entire production process. Notably, the long-standing practice of autumn straw return combined with deep plowing creates an ideal temporal window for remote sensing observations of bare soil properties (Figure 2f). Consequently, the significant spatial heterogeneity in topography, Soil Conditions, and microclimates renders Youyi Farm an ideal experimental site for analyzing spatiotemporal yield characteristics and conducting regional-scale remote sensing yield estimation [33].

2.2. Data Sources and Preprocessing

2.2.1. Optical Remote Sensing Data

Remote sensing data acquisition and processing were conducted via the Google Earth Engine (GEE) platform [34] (https://earthengine.google.com/). This study utilized Sentinel-2 Level-2A Surface Reflectance products covering the period from 2016 to 2025, which had undergone official atmospheric and geometric corrections by the European Space Agency (ESA). Clouds and cloud shadows were masked using the QA60 band. Between 2016 and 2025, a total of 4259 Sentinel-2 scenes covered the study area, with the alternating Sentinel-2A and 2B satellites providing complete coverage every 2 to 5 days. Specifically, from 2019 to 2025, the annual scene count ranged consistently between 578 and 666. This high observation frequency guarantees an abundance of valid pixels after cloud masking, thereby preventing temporal data gaps from biasing the subsequent dekadal synthesis. To construct a unified spatiotemporal dataset, all bands were resampled to 10 m resolution and reprojected to a consistent coordinate system alongside other multi-source spatial data, ensuring precise spatial alignment.

2.2.2. Vegetation Index Calculation

The crop growing season was defined as 1 June to 31 October annually, comprising 15 dekads (10-day periods, Table 1). Sentinel-2 multispectral images acquired during the 2016–2025 growing seasons were first processed by masking clouds and cloud shadows, followed by the calculation of the Normalized Difference Vegetation Index (NDVI). To construct a high-quality, continuous vegetation index time series, the Maximum Value Composite (MVC) method was applied within each dekad to generate a 15-point time series per season [35]. This effectively filters out contaminated NDVI values caused by residual clouds and aerosols, thereby selecting the accurate vegetation signal for each 10-day window. Accordingly, the sequential dekads across the growing season are denoted by the index

k

(k = 1,2, \dots, 15)

.

To address data gaps at specific time nodes caused by consecutive cloudy days and to reconstruct a continuous crop growth curve, a Savitzky–Golay (S-G) filter was subsequently applied to these 15 dekadal points [36]. Based on experimental testing, the S-G parameters were fixed across all years with a window size of 5 and a polynomial order of 2. In the context of a 15-dekad growing season, a 5-dekad (50-day) window ensures adequate coverage over data gaps caused by continuous rainy weather, while the second-order polynomial fitting effectively preserves interannual Phenological features without flattening the overall trajectory. This combined MVC and S-G approach ensures both signal purity and Phenological fidelity, providing a robust foundation for subsequent time-series analyses.

2.2.3. Yield Data

Yield data were collected via field sampling at Youyi Farm in 2025, comprising 233 maize sample points. For each point, GPS coordinates and yield (converted to standard moisture dry weight) were recorded (Figure 2c). Descriptive statistics indicate an approximately normal distribution with slight right skewness (0.45) and a kurtosis of 3.93. The mean yield was 9900.45 kg/ha (range: 1321.16–25,966.5 kg/ha), with a standard deviation of 4399.05 kg/ha and a coefficient of variation (CV) of 0.44, reflecting substantial data dispersion.

Yields were stratified into five classes: Low Yield (<6000 kg/ha), Below Average Yield (6000–9000 kg/ha), Average Yield (9000–12,000 kg/ha), Above Average Yield (12,000–15,000 kg/ha), and High Yield (>15,000 kg/ha). Sample points were spatially co-registered with remote sensing and topographic rasters to facilitate model training and independent validation.

2.2.4. Crop Classification Data and Field Boundaries

Vector data regarding maize planting distribution (2016–2025) were acquired from the Youyi Farm management department. A pixel-scale spatiotemporal overlay analysis was performed on these annual maps to delineate the statistically significant maize cultivation range. By accumulating planting occurrences within each raster cell over the decade, a maize planting frequency distribution map was generated (Figure 2d).

Frequencies ranged from 1 to 8. High-frequency zones (deep blue), primarily concentrated in the central and southwestern regions, indicate stable, long-term, continuous mono-cropping. Conversely, lighter-colored regions in the north and northeast reflect lower frequencies, attributed to crop rotation systems (e.g., maize–soybean) or rice cultivation. This frequency map served as a critical spatial mask for the subsequent extraction of Interannual NDVI Statistics and Intra-annual NDVI. Only the NDVI of the maize planting year was kept using the data, and the NDVI value of the non-maize planting year was excluded to prevent data contamination.

2.3. Methodology

2.3.1. Construction of the Yield-Formation Key Dataset (YFKD)

Drawing upon the input requirements of process-based crop models (e.g., DSSAT and APSIM), we constructed the Yield-Formation Key Dataset (YFKD)—a prior knowledge-guided spatiotemporal feature set (Table 2). This framework integrates yield formation theory to characterize four critical dimensions via remote sensing: Meteorological Background, Eco-physiological Process, Phenological Information, and Soil Conditions. This feature construction logic based on prior knowledge not only ensures interpretability in model-building processes but also demonstrates superior performance compared to feature aggregation methods [37]. (Note: while the farm implements unified macroscopic agricultural schedules, micro-level management variations inevitably exist; however, due to the unavailability of fine-grained spatial records, management factors were not explicitly included as input variables).

Y F K D = {M, E, P, S}

(1)

where

M

,

E

,

P

, and

S

represent the Meteorological Background, Eco-physiological Processes, Phenological Information, and Soil Conditions, respectively.

Meteorological Background ( $M$ )

Interannual NDVI Statistics were employed to characterize fine-scale Meteorological variations at the regional scale. Meteorological Background conditions (precipitation, temperature, and radiation) macro-regulate canopy growth and photosynthetic productivity; their comprehensive effects are effectively captured by interannual vegetation index variations [38]. Compared to interpolated data from sparse Meteorological stations, high-resolution vegetation indices offer superior granularity in depicting crop growth heterogeneity driven by farm-scale microclimates. Using the Maximum Value Composite (MVC) and filtered dataset from the 2016–2024 growing seasons (1 June–31 October), 135 NDVI scenes were acquired. (Note: The target year 2025 was excluded to prevent collinearity in subsequent modeling). These scenes were aggregated by dekad to calculate multi-year statistics—mean, standard deviation, maximum, minimum, and median—yielding 75 Interannual NDVI Statistics. These indices serve as spatiotemporal features representing crop growth responses to historical Meteorological Background conditions.

Mean: represents the long-term average, serving as the crop growth baseline under normal Meteorological Background conditions.

N D V I H_{M E A N} (k) = \frac{1}{N} \sum_{y = 1}^{N} N D V I H (y, k)

(2)

Median: represents the long-term typical level; unlike the mean, it effectively excludes the influence of years with potential anomalies.

N D V I H_{M E D} (k) = m e d i a n (N D V I H (y, k))

(3)

Maximum: represents the upper limit of vegetation growth potential, indicating the optimal state achievable under ideal Meteorological Background conditions.

N D V I H_{M A X} (k) = \max (N D V I H (y, k))

(4)

Minimum: represents the lower limit of vegetation growth, reflecting performance following historical extreme stress events.

N D V I H_{M I N} (k) = \min (N D V I H (y, k))

(5)

Standard Deviation: represents the stability of interannual growth status, serving as an indicator of sensitivity to climate variations [39].

N D V I H_{S D} (k) = \sqrt{\frac{1}{N} \sum_{y = 1}^{N} {(N D V I H (y, k) - N D V I H_{M E A N} (k))}^{2}}

(6)

2.: Eco-physiological Process ( $E$ )

The Eco-physiological Process (target year 2025) was characterized using Intra-annual NDVI and Intra-annual NDVI Anomaly. Remote sensing indices enable continuous monitoring of vegetation greenness, canopy structure, and phenology, effectively characterizing key processes such as biomass accumulation and photosynthetic productivity [40].

The 2025 growing season MVC NDVI dataset served as the Intra-annual NDVI. To quantify crop growth stress within the Eco-physiological Process, the Intra-annual NDVI Anomaly was calculated by dividing the current dekadal NDVI by the corresponding historical multi-year mean [41]. This metric reflects stress induced by extreme Meteorological Background events, physiological damage, or management issues [42,43]. Consequently, values between 0 and 1 indicate that the pixel’s NDVI is below the historical baseline, while values exceeding 1 indicate above-average performance.

N D V I C = {N D V I C 1, N D V I C 2, \dots, N D V I C 15}

(7)

N D V I A (k) = \frac{N D V I C (k)}{N D V I H M E A N (k)}

(8)

3.: Phenological Information ( $P$ )

Phenological Information was characterized using Intra-annual NDVI Change Rates derived from the time series. The ascent rate indicates vegetation vigor (greening speed), whereas the descent rate reflects the pace of crop maturation and senescence [44].

Ascending Rate (

{N D V I C R}_{u p}

)

This metric calculates the average slope from the start of the growing season (

k = 1

) to the peak growth stage (

k = k_{m a x}

).

N D V I C R_{u p} = \frac{N D V I C (k_{m a x}) - N D V I C (1)}{k_{m a x} - 1}

(9)

Descending Rate (

{N D V I C R}_{d o w n}

)

This metric calculates the average slope from the peak growth stage (

k = k_{m a x}

) to the end of the growing season (

k = 15

).

N D V I C R_{d o w n} = \frac{N D V I C (k_{m a x}) - N D V I C (15)}{15 - k_{m a x}}

(10)

where

k_{m a x}

denotes the specific dekad index at which the Intra-annual NDVI reaches its maximum value (

N D V I C (k_{m a x})

) for a given pixel during the growing season.

N D V I C (1)

and

N D V I C (15)

represent the NDVI observations at the initial baseline (early June) and the terminal stage (late October) of the growing season, respectively.

4.: Soil Conditions ( $S$ )

Soil Conditions were characterized using Soil Spectral Reflectance. Soil properties, such as moisture, texture, and organic matter, are critical inputs in mechanistic models, influencing water–nutrient supply capacity and yield potential. Given the significant correlation between spectral reflectance during the bare soil period and soil properties (e.g., organic matter), as validated in numerous studies, these data were used to construct soil environmental indicators [45].

To extract these features, the ‘bare soil period’ was explicitly defined as the temporal window from 15 April to 20 May, occurring strictly after spring snowmelt and prior to crop emergence. Based on the annual maize planting range, a pixel-level median composite was performed on multi-year bare soil imagery to generate a single multispectral scene representing the spatiotemporal characteristics of Soil Conditions. This multi-year median approach acts as a robust statistical method against transient factors—such as temporary low reflectance from recent spring precipitation, or anomalous brightening from late ephemeral snow and residual stubble. This scene includes 13 bands (visible, near-infrared, red-edge, and short-wave infrared). This median-based approach effectively eliminates outliers, thereby more accurately reflecting the spatial patterns of long-term Soil Conditions [46].

To further mitigate the effects of atmospheric conditions, moisture differences, and topographic relief while accentuating soil spectral morphological features, pixel-level vector normalization was applied to the median composite imagery [47].

S O I L R_{N O R M} (b) = \frac{S O I L R_{M E D} (b)}{\sqrt{\sum_{b = 1}^{13} {(S O I L R_{M E D} (b))}^{2}}}

(11)

where b (

b = 1, 2, \dots, 13

) represents the spectral bands of Sentinel-2. The resulting normalized bare soil multispectral image provides stable, high-dimensional soil environmental features, supporting subsequent crop growth monitoring and yield anomaly analysis.

2.3.2. Feature Selection and Predictor Set Construction

Using crop yield as the dependent variable, we categorized 120 features into groups: Interannual NDVI Statistics, Intra-annual NDVI, Intra-annual NDVI Anomaly, Intra-annual NDVI Change Rates, and Soil Spectral Reflectance. The Boruta algorithm was employed for optimal feature screening to construct a multi-level predictor set (Table 2). Boruta is a feature selection technique built upon the Random Forest framework, designed to identify all features with statistically significant importance relative to the dependent variable. Its core mechanism involves creating “shadow features” by randomly shuffling the values of original features to establish a contrastive baseline [48].

The algorithm trains a Random Forest model and calculates the importance score for each feature. A feature is deemed “important” and retained only if its importance score is significantly higher than that of the best-performing shadow feature. By comparing against a random noise baseline, Boruta demonstrates exceptional stability and robustness when handling high-dimensional data, complex non-linear relationships, and multi-collinearity [49].

To ensure an accurate assessment of model robustness, the Boruta screening process (executed for 150 iterations) was conducted exclusively on the training subset, with the test subset (split at a 7:3 ratio; test ratio: 0.3) remaining entirely independent and strictly unseen during any feature selection or hyperparameter tuning phases to completely prevent data leakage. After sensitivity testing and iterative tuning, key parameters were set as follows: the importance threshold (percent) was set to 97 to enforce strict statistical standards, and the maximum depth (max_depth) of the underlying Random Forest was limited to 4. Depths greater than 4 induced severe overfitting by capturing sample-specific noise, whereas depths below 4 resulted in underfitting and insufficient predictive accuracy. These parameter settings remained consistent across all feature screening and subsequent model training phases.

2.3.3. Experimental Design and Model Construction

A stratified random sampling strategy was employed to partition the dataset, with 70% of the samples allocated for training and the remaining 30% for testing [50]. This approach ensures a consistent distribution of yield classes across both subsets, effectively preventing sample bias and enhancing model stability and robustness across different yield levels. As shown in Figure 3a, the exact distribution of sample points across different yield levels (low, moderate-low, moderate, moderate-high, and high yields) is strictly maintained between the training and test datasets. Furthermore, the continuous yield distribution profiles of the two subsets align perfectly (Figure 3b). This rigorous stratification ensures accurate model evaluation while capturing the full spectrum of spatial yield variability.

To evaluate the predictive value of the variables, six distinct feature combination schemes were developed (Table 3). Intra-annual NDVI was treated as a fundamental input, as it directly characterizes current-season crop growth variations—a critical factor for explaining yield fluctuations. Building upon this, Interannual NDVI Statistics, Meteorological Background, and Soil Conditions were individually integrated to reflect the long-term environmental constraints (e.g., light, temperature, water, and heat) on yield formation. Finally, all five feature groups were fused to form an all-factor input set.

While maintaining consistent training–test splits, the different predictor sets were input into traditional regression and machine learning (ML) models for comparative analysis. This allows for a systematic evaluation of how model architectures and variable sets influence prediction accuracy. The models utilized include: Multiple Linear Regression (MLR), Random Forest (RF), and Extreme Gradient Boosting (XGBoost).

Multiple Linear Regression (MLR): As a classical statistical method, MLR identifies and constructs linear dependencies between predictors and yield. To ensure robustness and eliminate multi-collinearity, predictors were introduced stepwise into the regression equation based on their contribution to the residual sum of squares. Variables were retained only if they passed the Variance Inflation Factor threshold and the F-test for statistical significance.

Random Forest Regression (RF) is a non-parametric ensemble model that aggregates multiple decision trees to capture non-linear relationships while maintaining robustness against outliers and collinearity [51]. In this study, we replaced traditional grid search with the Optuna framework—a Bayesian optimization approach—to efficiently identify global optima. To prevent overfitting and ensure robustness, we applied a strict regularization strategy: limiting the maximum tree depth to 4 and requiring 5–15 samples per leaf node. The final model was constructed based on the optimal parameters identified after 300 iterations, searching across a forest size of 100 to 500 trees.

Extreme Gradient Boosting (XGBoost) is a sophisticated gradient-boosted decision tree framework that optimizes a second-order Taylor expansion of the loss function, offering significant advantages in capturing complex non-linearities [52]. Unlike traditional ensemble methods, XGBoost incorporates a built-in regularization term (including L1 and L2 penalties) directly into the objective function to penalize model complexity and mitigate overfitting.

To fine-tune its high-dimensional parameter space, we integrated the Optuna framework to optimize critical hyperparameters, including the learning rate, subsampling ratio, and column sampling per tree. We maintained a maximum tree depth of 4 to align with the RF model for comparative consistency. A dynamic iteration strategy was implemented with an upper limit of 3000 trees, coupled with an Early Stopping mechanism: training was automatically terminated if the validation error stagnated for 50 consecutive rounds. This approach ensures the model converges at the optimal generalization point, effectively balancing computational efficiency with high-precision predictive performance in regional-scale applications.

2.3.4. Model Evaluation Metrics

A stratified random sampling strategy was employed to partition the dataset into a training set (70%) and an independent test set (30%). To evaluate the robustness of the models, a 10-fold cross-validation strategy was implemented within the training set [53]. In each iteration, 90% of the training data was used for model development while the remaining 10% served as the internal validation set. The independent test set remained entirely uninvolved throughout this process to ensure an unbiased evaluation of robustness performance. Furthermore, to quantify the uncertainty of the model performance and ensure that the results were not overstated by a single random split, a non-parametric bootstrapping evaluation was applied to the independent test set. We performed 1000 resampling iterations with replacement and calculated the 95% Confidence Intervals (95% CI) for the evaluation metrics.

The model performance was quantitatively assessed using the coefficient of determination (

R^{2}

), root mean square error (RMSE), mean absolute error (MAE), and normalized root mean square error (nRMSE) [54]. Specifically,

R^{2}

measures the degree of consistency between predicted and observed values, RMSE reflects the overall level of prediction error, MAE characterizes the average absolute magnitude of errors, and nRMSE represents the error magnitude relative to the mean observed value.

2.3.5. Interpretation of Yield Drivers Using SHAP

After identifying the optimal prediction model, the SHAP (SHapley Additive exPlanations) method was introduced to provide an interpretable analysis of the model results. This approach quantitatively reveals the contribution factors of different features across various yield levels [55]. The interpretability analysis focuses on three dimensions: global contribution, temporal features, and spatial limiting factors [56].

Global Contribution Analysis

SHAP values were calculated for all samples. The mean absolute SHAP value of each feature was used as the global contribution indicator to rank features at the model level and identify key factors dominating yield prediction.

Samples were stratified into five levels: High Yield, Above Average Yield, Average Yield, Below Average Yield, and Low Yield. We statistically compared the distribution of SHAP values for the same feature across these classes. By analyzing the sign and magnitude of SHAP values, we explored the direction (positive or negative) and intensity of each feature’s contribution at different yield levels, thereby revealing the non-linear and stratified effects of the feature–yield relationship.

2.: Temporal Feature Importance Analysis

SHAP analysis was performed on time-series variables, specifically focusing on Intra-annual NDVI and Intra-annual NDVI Change Rates. This analysis identifies the critical time windows that provide the maximum contribution to yield formation and reveals the dynamic influence of temporal indicators on final production.

3.: Spatial Pattern Analysis of Limiting Factors

SHAP values for each variable were mapped at the pixel level to perform spatial attribution. This analysis specifically focuses on regions where features exert a negative constraint on yield. By visualizing the dominant negative contribution variables for each pixel, we identified the key limiting factors restricting regional yield potential and elucidated their spatial heterogeneity.

To quantify the spatial dominance of yield-limiting factors, we aggregated the pixel-level SHAP values into the four predefined agronomic dimensions of the YFKD (

Y F K D = {M, E, P, S}

). For a given pixel, the net SHAP contribution of each dimension was calculated, and the dimension with the minimum negative value was identified as the dominant limiting factor. The percentage of the constrained area for each factor was computed as follows:

S_{i, g} = \sum_{f \in g} ϕ_{i, f}, where g \in Y F K D

(12)

L_{i} = \arg \min_{g \in YFKD} (S_{i, g}), subject to \min_{g \in YFKD} (S_{i, g}) < 0

(13)

P e r c e n t a g e (g) = \frac{|{i \in P_{c o n s t r a i n e d}∣ L_{i} = g}|}{|P_{c o n s t r a i n e d}|} \times 100 %

(14)

where

S_{i, g}

is the net SHAP contribution of dimension

g

(e.g., Soil Conditions) for pixel

i

;

ϕ_{i, f}

represents the SHAP value of feature

f

within dimension

g

; and

L_{i}

denotes the dominant limiting factor for pixel

i

. A pixel is considered constrained (

P_{c o n s t r a i n e d}

) only if its minimum net SHAP contribution across all four YFKD dimensions is negative. The final percentage for dimension

g

is the ratio of the number of pixels dominated by

g

to the total number of constrained pixels.

3. Results

3.1. Optimal Feature Selection Results

Through Boruta feature selection, 20 key variables were identified, including

S O I L R_{N O R M} (2,3, 4,8, 12)

,

N D V I C (2,6, 7,8, 9)

,

N D V I H_{M E D} (2,9)

,

N D V I A (6,7, 8,9, 10,11)

, and

N D V I C R_{u p, d o w n}

. Figure 4 illustrates the importance scores of individual variables and the cumulative importance share for each feature group. The importance scores of variables ranged from 97 (

N D V I C (9)

) to 198 (

N D V I C R_{d o w n}

), highlighting significant differences in feature weights for yield prediction. At the feature group level, the Intra-annual NDVI Anomaly (

N D V I A

) group accounted for the largest share of the cumulative importance score at 29.6%. This was followed by the Soil Spectral Reflectance (

S O I L R

) group at 24.0% and the Intra-annual NDVI (

N D V I C

) group at 22.4%. At the individual level, although the cumulative score of the Intra-annual NDVI Change Rates group was relatively small, the Intra-annual NDVI descent rate (

N D V I C R_{d o w n}

) provided the highest importance score (198), significantly exceeding all other features. Other top-performing variables included the historical median during the seedling stage (

N D V I H_{m e d 2}

, 173) and the anomaly value during the maturity stage (

N D V I A_{11}

, 171). Additionally, soil features such as

S O I L R_{4}

(165) and

S O I L R_{12}

(152) also demonstrated exceptionally high importance, ranking within the top five.

3.2. Model Performance Comparison

Based on the six predictor input combinations, Multiple Linear Regression (MLR), Random Forest (RF), and Extreme Gradient Boosting (XGBoost) models were constructed for maize yield prediction. The comparative prediction accuracy of these models on the independent test set is summarized in Table 3.

Overall, machine learning models significantly outperformed the Multiple Linear Regression (MLR) model across all variable combinations, exhibiting higher coefficients of determination (

R^{2}

) and lower RMSE and MAE values. Specifically, the XGBoost model achieved the highest prediction accuracy in most scenarios, demonstrating superior non-linear modeling capabilities and robustness performance—particularly when utilizing the all-factor input set (Figure 5). Quantitative evaluation results on the test set indicate that using Intra-annual NDVI alone yielded relatively high estimation accuracy (

R^{2}

= 0.735, RMSE = 2096.08 kg/ha for XGBoost), underscoring its direct and significant control over yield.

Using the Intra-annual NDVI model as a baseline, the inclusion of Interannual NDVI Statistics, Intra-annual NDVI Anomaly, and Soil Conditions individually led to significant accuracy improvements of 12.37–14.5%, accompanied by a marked decrease in RMSE. Among these, the integration of Soil Conditions provided the most prominent enhancement, increasing

R^{2}

to 0.842 and reducing RMSE by 22.9% relative to the baseline. The improvements from Intra-annual NDVI Anomaly and Interannual NDVI Statistics followed, with

R^{2}

reaching 0.831 and 0.826, respectively. In contrast, incorporating Intra-annual NDVI Change Rates did not yield performance gains and instead resulted in a marginal decline in accuracy (because NDVI Change Rates receive noise amplification and smoothing artifacts from some early harvest plots in 2025).

When all variables were integrated to construct the all-factor model, both prediction accuracy and stability reached their optima. The XGBoost model achieved an

R^{2}

of 0.865 (a 17.6% improvement over the Intra-annual NDVI baseline). The RMSE and MAE decreased to their lowest levels—1492.12 kg/ha and 999.81 kg/ha, representing significant reductions of 28.8% (an absolute drop of 603.96 kg/ha) and 29.5%, respectively, compared to the baseline. In the model comparison, the XGBoost model outperformed others under the optimal input combination; its

R^{2}

(0.865) was higher than that of the Random Forest model (0.764) and far exceeded the MLR model (0.691). To rigorously quantify model stability and rule out the randomness of a single data split, a non-parametric bootstrapped evaluation (n = 1000) was performed on the test set. The results yielded a narrow and robust 95% Confidence Interval (CI) for the XGBoost

R^{2}

[0.791, 0.935], whose lower bound explicitly exceeded the upper bound of the MLR model’s 95% CI [0.496, 0.788]. This demonstrates optimal comprehensive predictive performance and statistically validates the necessity of constructing a multi-dimensional Yield-Formation Key Dataset (YFKD) for high-precision yield estimation.

3.3. Spatial Patterns of Estimated Yield

Figure 6 illustrates the spatial mapping results of maize yield produced by MLR, RF, and XGBoost across different predictor combinations. Overall, all models exhibited a consistent macro-scale spatial distribution regardless of the variable combination: the northern and northeastern regions were identified as high-yield zones, with most plots showing robust production; the central region was a distinct low-yield zone with significant spatial clustering; and the southwestern region primarily consisted of low-to-medium yield zones with only sporadic high-yield plots. To quantitatively evaluate the spatial reliability of the generated yield maps, a zonal mean validation was performed across five yield levels using the 233 field samples. The results showed in Table 4 that the XGBoost model accurately captured the mean productivity of each zone (e.g., Observed Low: 4357 kg/ha vs. Predicted: 4682 kg/ha), confirming the model’s robust spatial representation capability.

When utilizing only Intra-annual NDVI as a predictor, the spatial characterization varied significantly among the three models, all of which showed high levels of error. The inclusion of

S O I L R

,

N D V I C

,

N D V I H

,

N D V I A

, and

N D V I C R

markedly improved the spatial structure of the results, with the boundaries between high- and low-yield areas becoming clearer. Specifically, the integration of soil conditions enhances the identification of central low-yielding areas by effectively delineating areas with low bare soil reflectivity. Similarly, the inclusion of Interannual NDVI Statistics led to more precise identification of high-yield zones in the northeast.

Under the all-factor fusion condition (

S O I L R_{N O R M} (2,3, 4,8, 12)

,

N D V I C (2,6, 7,8, 9)

,

N D V I H_{M E D} (2,9)

,

N D V I A (6,7, 8,9, 10,11)

, and

N D V I C R_{u p, d o w n}

), all three models achieved optimal spatial estimation results. The XGBoost model outperformed RF and MLR in mitigating “salt-and-pepper” noise and low-value anomalies, providing a superior characterization of spatial gradients and field-scale mapping. This performance highlights the model’s ability to capture complex non-linear interactions among high-dimensional features during the yield formation process. Quantitative evaluations (Table 3) confirmed that this configuration achieved the highest

R^{2}

and the lowest error metrics, further validating the efficacy of multi-source spatiotemporal feature fusion and non-linear modeling for regional maize yield estimation.

3.4. SHAP-Based Analysis of Yield Drivers

The distribution of SHAP values and the global importance ranking derived from the XGBoost model (Figure 7) reveal the specific contribution of different feature categories to maize yield prediction. The results indicate that the Soil Spectral Reflectance (

S O I L R

) group provided the highest contribution among all predictors (Figure 7). Specifically,

S O I L R_{N O R M} (4)

exhibited the highest mean absolute SHAP value (70.0), significantly exceeding other variables, followed by

S O I L R_{N O R M} (12)

with a mean contribution of 57.9.

N D V I C (8)

(late August Intra-annual NDVI) ranked third, though its mean SHAP value dropped to 25.2. Other features, such as

S O I L R_{N O R M} (3)

,

S O I L R_{N O R M} (8)

, and various

N D V I A

indicators, showed mean absolute SHAP values ranging from 4.6 to 13.9, representing a relatively smaller contribution to the model.

The SHAP summary plots further demonstrate that the primary SOILR features possess a wide span of SHAP values across their range, with distinct positive and negative contribution intervals, indicating strong non-linear influence. In contrast, the SHAP values for Interannual NDVI Statistics are more concentrated and characterized by primarily positive contributions. Meanwhile, the SHAP values for Intra-annual NDVI and Intra-annual NDVI Anomaly are mostly clustered near zero, suggesting a lower magnitude of impact on the final prediction results.

Regarding the cumulative contribution by feature group (Figure 8b), Soil Spectral Reflectance dominated the model with a contribution share of 50.2%. This was followed by Intra-annual NDVI (20.5%) and Intra-annual NDVI Anomaly (15.5%), while the overall contributions of Intra-annual NDVI Change Rates (9.1%) and Interannual NDVI Statistics (4.7%) were relatively minor.

Different feature groups exhibited distinct response patterns across yield levels (Figure 8a). Soil Spectral Reflectance displayed a unique bi-directional regulatory effect: in the “Low Yield” category, it presented a significant negative SHAP value (approximately −33), whereas, in the “High Yield” category, its positive contribution surged to over 60. This indicates that Soil Conditions act as a predictive constraint in certain areas while serving as a productivity booster in others. The dual nature of Soil Conditions—both promoting and limiting yield—primarily depends on geographical variations in soil types across different spatial regions. Areas with higher soil fertility exhibit a promoting effect on yield, whereas regions with lower fertility demonstrate a limiting effect on yield. This phenomenon coexists at the regional scale. Furthermore, although the global share of Intra-annual NDVI Change Rates was low, it exerted the largest negative SHAP value (approximately −35) among all features in low-yield samples, even surpassing the negative contribution of Soil Spectral Reflectance.

Importantly, the aggregation of SHAP values into percentages serves strictly to quantify the relative differences in predictive contributions among feature groups. While the qualitative ranking of these contributions provides meaningful insights into the dominant drivers of the model, the exact quantitative ratios between these percentages lack direct physical meaning. Therefore, they should not be interpreted as strict proportional measures of causal impact on crop yield.

4. Discussion

4.1. Spatiotemporal Characterization Capability of the YFKD

Although direct Meteorological observations, soil physicochemical parameters, and management data were not explicitly introduced, the remote-sensing-based Yield-Formation Key Dataset (YFKD) clearly delineates differences among yield classes across various growth stages and spectral features (Figure 9). This demonstrates that the dataset effectively characterizes the comprehensive effects of Soil Conditions, Meteorological Background, Phenological Information, and Eco-physiological Process on yield formation at the farm scale. Furthermore, Boruta screening results indicate that high-importance variables exist within every feature group, confirming the interpretability of

N D V I C

,

N D V I A

,

N D V I H

,

N D V I C R

, and

S O I L R

in representing these critical yield elements.

Significant variability was observed in Interannual NDVI Statistics across yield classes. High-yield samples consistently exhibited multi-year means and dekadal maxima superior to those of low-yield samples (Figure 9a), particularly during the vigorous growth period in August. This suggests that the Meteorological Background exerts a long-term constraining effect on yield formation, which is effectively captured by the interannual time series. Rather than relying on raw Meteorological data, utilizing these interannual statistics provides a deliberate advantage by directly capturing the crop’s holistic biological response to long-term climate variations [57].

On the intra-annual scale, the Intra-annual NDVI time series showed systematic differences throughout the growing season (Figure 9c). High-yield samples demonstrated a faster ascent in the early stage, sustained higher NDVI values in the mid-stage, and reached higher peaks. Conversely, low-yield samples generally exhibited lower levels throughout the season, with suppressed peaks and earlier senescence. These results indicate that intra-annual variations reflect extrinsic manifestations of changes in physiological indicators (changes in biomass and nitrogen, etc.) and developmental progress within the Eco-physiological Process [58].

Intra-annual NDVI Anomaly further revealed the impact of growth stress on yield (Figure 9d). Except for the early and late stages, low-yield samples consistently showed negative anomalies (below the multi-year average) during key growth phases, whereas high-yield samples remained near or above the average. This demonstrates that

N D V I A

indicators effectively capture deviations in the Eco-physiological Process caused by improper management, waterlogging, drought, or other extreme Meteorological Background conditions. This approach effectively quantifies the negative outcome of interacting environmental constraints, thereby explaining yield disparities.

Multi-year composite Soil Spectral Reflectance characteristics differed markedly between yield classes (Figure 9e). Areas corresponding to high-yield samples exhibited lower overall bare soil reflectance, while low-yield samples showed the opposite pattern. This confirms that soil reflectance serves as a proxy for the positive correlation between soil fertility and yield, indicating that features extracted from multi-year bare soil imagery effectively characterize the constraining role of Soil Conditions on yield potential [59]. Because low soil reflectivity often means higher organic matter content or higher clay content, it indicates higher fertility and water retention capacity to a certain extent [60].

In summary, by extracting and integrating features from massive long-term remote sensing time series, the YFKD achieved an effective characterization of key yield formation processes without directly introducing Meteorological, soil, or physiological data. This provides a scalable, remote-sensing-driven data framework for regional crop yield monitoring and estimation.

4.2. Analysis of Key Monitoring Time Windows for Yield Formation

Given the time-series nature of NDVI indices, their contribution and importance to yield prediction were analyzed systematically to identify critical time windows for yield formation (Figure 10). Features with high contribution were primarily concentrated between late July and mid-September, indicating that variations in Interannual NDVI Statistics and Intra-annual NDVI during this phase possess strong interpretability regarding yield differences.

The SHAP contribution of NDVI indices gradually increased, peaking in mid-August, before declining significantly in mid-to-late September. This trajectory corresponds to the Eco-physiological Process as the crop transitions from vegetative to reproductive growth. The trend suggests that canopy status and biomass accumulation characterized by NDVI in mid-August (mid-growing season) offer the strongest explanatory power for final yield, whereas the contribution diminishes as the crop enters maturity.

Notably, during the early growing season, Interannual NDVI Statistics and Intra-annual NDVI in mid-June exhibited relatively high SHAP contribution values. Although the total number of indices in this period was small, their contribution level ranked fourth among the 15 dekads. This result indicates that early emergence vigor and the Meteorological Background during the seedling stage serve as critical drivers for subsequent yield formation.

4.3. Spatial Distribution Patterns of Regional Yield-Limiting Factors

Based on the SHAP analysis of the optimal XGBoost model (all-factor input), areas exhibiting negative effects (yield constraints) accounted for 80.8% of the total study area in 2025. This indicates that most plots in Friendship Farm are subject to varying types and degrees of yield-limiting conditions. Significant spatial differentiation was observed among limiting factors, with distinct differences in their coverage areas: Meteorological Background was the primary limiting factor (34.80%), followed by Phenological Information (27.50%), Eco-physiological Process (21.36%), and Soil Conditions (16.35%). These results suggest that while Soil Conditions may possess stronger explanatory power regarding yield magnitude (intensity) at the global importance level and specific plot scales, Meteorological Background and Phenological Information constraints are more spatially widespread across the farm (Figure 11).

In terms of spatial patterns, Meteorological Background constraints were predominantly distributed in the northern and northeastern regions, characterized by broad, extensive, and relatively dispersed coverage. Soil Conditions constraints were concentrated in the southwestern mountainous area (Albic soil zone) and the central region (Meadow soil zone), showing spatial clustering consistent with soil types. Constraints related to Phenological Information and Eco-physiological Process were dispersed across other plots, exhibiting high intra-field variability, which reflects a more refined response to growth status differences within individual fields.

These spatial patterns illustrate the scale differences in how limiting factors affect yield formation. Meteorological Background constraints demonstrate strong spatial continuity and broad coverage, as climatic factors like rainfall and temperature operate on a larger scale—even farm-scale microclimates cover substantial areas. In contrast, Soil Conditions constraints reflect relatively stable environmental heterogeneity, such as texture, organic matter, and drainage [61]. For instance, Albic soils formed by leaching contain an impermeable, clay-heavy Albic horizon in the plow layer, while certain Meadow soils formed by river channel shifts have high sand content; these inherent soil barriers significantly restrict yield formation (Figure 12).

Phenological Information constraints reflect intra-field heterogeneity in light, temperature, water, and heat, leading to Phenological discrepancies such as late emergence or delayed maturation (“stay-green”), which directly impact yield. Eco-physiological Process constraints are closely linked to current-season canopy status and stress responses, effectively capturing uneven growth caused by variations in soil moisture, fertility, management practices, or micro-topography within the same environmental background.

The zoning results of spatial limiting factors indicate that yield constraints in the study area are not driven by a single factor but by the superposition of Meteorological Background, Soil Conditions, Phenological Information, and Eco-physiological Process differences. This finding provides critical data support for implementing zonal management and differentiated regulation strategies in future agricultural production decisions.

4.4. Mechanism of Accuracy Enhancement via Multi-Dimensional Spatiotemporal Feature Fusion

The model comparison results indicate that while using Intra-annual NDVI alone achieved relatively high estimation accuracy, the inclusion of Interannual NDVI Statistics and Soil Spectral Reflectance significantly improved performance (Table 3). This validates the effectiveness of the prior knowledge-guided feature fusion strategy and the advantage of the XGBoost model in handling high-dimensional features.

As shown in Figure 12, zoomed-in views of plots in the western, central, and northern regions reveal significant spatial heterogeneity in bare soil imagery. The spatial characteristics of yield magnitude show high consistency not only with current-season vegetation features but also with Soil Spectral Reflectance during the bare soil period (especially in central plots). This visually demonstrates that crop yield is not solely determined by intra-annual growth status but is co-modulated by long-term Soil Conditions constraints and the Interannual NDVI Statistics (representing the Meteorological Background). Therefore, relying on single-year, single-phase, or single-process information is insufficient to comprehensively characterize the complex yield formation mechanism.

Under multi-type feature fusion conditions, the XGBoost model exhibited optimal predictive performance, highlighting the advantages of the Gradient Boosting Decision Tree model in handling complex non-linear relationships and multi-source feature extraction. Strong spatial heterogeneity exists across Meteorological Background, Soil Conditions, Phenological Information, and Eco-physiological Process elements at both regional and field scales. This suggests that features are not merely linearly superimposed; rather, the tree structure achieves an adaptive weighted combination. The model automatically identifies and enhances the weights of features that restrict or promote yield within specific plots, significantly improving the precision of local spatial detail characterization while ensuring global robustness.

From a methodological perspective, YFKD provides a mechanism to systematically translate traditional crop model input strategies (parameters like meteorology, soil, phenology, and eco-physiology) into remote sensing data features, where each category corresponds to a critical perspective of yield formation (Figure 13). This framework enables data-driven models to maintain high-precision prediction while possessing clear prior knowledge-guided directionality consistent with ecological laws, thereby significantly enhancing the interpretability and generalizability of the results.

5. Conclusions

Guided by prior knowledge, this study constructed the Yield-Formation Key Dataset (YFKD) using remote sensing data. By integrating the XGBoost algorithm with the SHAP interpretability framework, we achieved high-precision estimation and a spatiotemporal driving factor analysis of maize yield at Friendship Farm. The YFKD framework successfully transforms complex parameters from traditional mechanistic models—specifically Meteorological Background, Soil Conditions, Phenological Information, and Eco-physiological Process—into observable remote sensing features. This approach significantly enhances model interpretability and, as evidenced by our zonal spatial validation (Table 4), demonstrates robust predictive reliability for regional applications. Compared to the baseline model using only Intra-annual NDVI, the YFKD-XGBoost model, which fuses Interannual NDVI Statistics, Intra-annual NDVI, Intra-annual NDVI Anomaly, Intra-annual NDVI Change Rates, and Soil Spectral Reflectance, demonstrated optimal performance (

R^{2} = 0.865

, RMSE = 1492.12 kg/ha).

SHAP analysis revealed the spatiotemporal driving factors of yield formation. The contribution of vegetation indices to yield prediction was concentrated between late July and mid-September, following an increasing-then-decreasing trend. Notably, observations of Interannual and Intra-annual NDVI features in the early season (mid-June) also held significant indicative value for subsequent predictions. Spatially, Meteorological Background constraints exhibited a widespread distribution (predominantly in the Northeast and North), whereas Soil Conditions constraints showed significant localized clustering (mainly in the Southwest Albic soil zone and the Central sandy Meadow soil zone). These spatialized diagnostic results intuitively reveal the dominant limiting factors across different regions, providing a direct basis for implementing differentiated precision agriculture management measures.

Author Contributions

Conceptualization, B.Q., X.Z. and H.L.; methodology, B.Q., X.Z. and L.M.; validation, L.C., Z.A. and J.L.; investigation, B.Q., X.H. and Z.A.; data curation, B.Q., X.H. and J.L.; writing—original draft preparation, B.Q.; writing—review and editing, X.Z., L.C. and L.M.; supervision, X.Z. and H.L.; funding acquisition, X.Z. and H.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the National Key R&D Program of China (2021YFD1500100), the National Key R&D Program of China (2024YFD1501100), the Science and Technology Development Plan Project of Jilin Province, China (20240101043JC) and the Jilin Agricultural University Introduction of Talents Project (No.202020010).

Data Availability Statement

Data are subject to privacy restrictions. Please contact the corresponding author.

Acknowledgments

We thank the National Earth System Science Data Center for providing geographic information data.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Kayad, A.; Sozzi, M.; Gatto, S.; Marinello, F.; Pirotti, F. Monitoring Within-Field Variability of Corn Yield using Sentinel-2 and Machine Learning Techniques. Remote Sens. 2019, 11, 2873. [Google Scholar] [CrossRef]
Lu, C.; Leng, G.; Liao, X.; Tu, H.; Qiu, J.; Li, J.; Huang, S.; Peng, J. In-season maize yield prediction in Northeast China: The phase-dependent benefits of assimilating climate forecast and satellite observations. Agric. For. Meteorol. 2024, 358, 110242. [Google Scholar] [CrossRef]
Maestrini, B.; Basso, B. Drivers of within-field spatial and temporal variability of crop yield across the US Midwest. Sci. Rep. 2018, 8, 14833. [Google Scholar] [CrossRef]
Colaço, A.F.; Bramley, R.G.V. Do crop sensors promote improved nitrogen management in grain crops? Field Crops Res. 2018, 218, 126–140. [Google Scholar] [CrossRef]
Jin, Z.; Azzari, G.; You, C.; Di Tommaso, S.; Aston, S.; Burke, M.; Lobell, D.B. Smallholder maize area and yield mapping at national scales with Google Earth Engine. Remote Sens. Environ. 2019, 228, 115–128. [Google Scholar] [CrossRef]
Paudel, D.; Boogaard, H.; Wit, A.D.; Janssen, S.; Osinga, S.; Pylianidis, C.; Athanasiadis, I.N. Machine learning for large-scale crop yield forecasting. Agric. Syst. 2021, 187, 103016. [Google Scholar] [CrossRef]
Kogan, F.N. Global Drought Watch from Space. Bull. Am. Meteorol. Soc. 1997, 78, 621–636. [Google Scholar] [CrossRef]
Unganai, L.S.; Kogan, F.N. Drought Monitoring and Corn Yield Estimation in Southern Africa from AVHRR Data. Remote Sens. Environ. 1998, 63, 219–232. [Google Scholar] [CrossRef]
Labus, M.P.; Nielsen, G.A.; Lawrence, R.L.; Engel, R.; Long, D.S. Wheat yield estimates using multi-temporal NDVI satellite imagery. Int. J. Remote Sens. 2002, 23, 4169–4180. [Google Scholar] [CrossRef]
Moriondo, M.; Maselli, F.; Bindi, M. A simple model of regional wheat yield based on NDVI data. Eur. J. Agron. 2007, 26, 266–274. [Google Scholar] [CrossRef]
Sakamoto, T.; Wardlow, B.D.; Gitelson, A.A.; Verma, S.B.; Suyker, A.E.; Arkebauer, T.J. A Two-Step Filtering approach for detecting maize and soybean phenology with time-series MODIS data. Remote Sens. Environ. 2010, 114, 2146–2159. [Google Scholar] [CrossRef]
Chlingaryan, A.; Sukkarieh, S.; Whelan, B. Machine learning approaches for crop yield prediction and nitrogen status estimation in precision agriculture: A review. Comput. Electron. Agric. 2018, 151, 61–69. [Google Scholar] [CrossRef]
Zhou, X.; Song, J.; Dang, Y.; Xiao, Z.; Yang, H. Real-time prediction of corn yield from single-phase SAR and optical remote sensing data using deep learning. Eur. J. Agron. 2025, 171, 127819. [Google Scholar] [CrossRef]
Hu, T.; Liu, Z.; Hu, R.; Zeng, L.; Deng, K.; Dong, H.; Li, M.; Deng, Y.J. Yield prediction method for regenerated rice based on hyperspectral image and attention mechanisms. Smart Agric. Technol. 2025, 10, 100804. [Google Scholar] [CrossRef]
Ji, Z.; Pan, Y.; Zhu, X.; Adem, E.S. Combining multi-source data, phenology information, and machine learning approaches to estimate crop yield in the United States. Field Crops Res. 2026, 336, 110219. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, B.; Zhao, C.; Chen, L.; Kuai, Y.; Wang, C.; Jiang, S.; Chen, D.; Zhu, Q.; Wang, Z.; et al. Tobacco yield estimation via multi-source data fusion and recurrent neural networks. Int. J. Appl. Earth Obs. Geoinf. 2025, 144, 104925. [Google Scholar] [CrossRef]
Hu, W.S.; Li, H.C.; Deng, Y.J.; Sun, X.; Du, Q.; Plaza, A. Lightweight Tensor Attention-Driven ConvLSTM Neural Network for Hyperspectral Image Classification. IEEE J. Sel. Top. Signal Process. 2021, 15, 734–745. [Google Scholar] [CrossRef]
Yang, J.; Liu, L.; Yang, Q.; Jia, X.; Peng, B.; Guan, K.; Jin, Z. Knowledge-guided graph machine learning improves corn yield mapping in the U.S. Midwest. Remote Sens. Environ. 2026, 335, 115287. [Google Scholar] [CrossRef]
Hu, W.S.; Li, W.; Li, H.C.; Huang, F.H.; Tao, R. Global Clue-Guided Cross-Memory Quaternion Transformer Network for Multisource Remote Sensing Data Classification. IEEE Trans. Neural Netw. Learn. Syst. 2025, 36, 7357–7371. [Google Scholar] [CrossRef]
Zhang, J.; Guan, K.; Chen, Z.; Hipple, J.; Huang, Y.; Peng, B.; Wang, S.; Xu, X.; Jin, Z.; Zhao, K.; et al. Aligning satellite-based phenology in a deep learning model for improved crop yield estimates over large regions. Agric. For. Meteorol. 2025, 372, 110675. [Google Scholar] [CrossRef]
Deng, Y.J.; Zhang, L.W.; Ren, L.; Zhu, X.; Li, H.C.; Du, Q. Tensor Decomposition-Based Relaxed Linear Regression for Hyperspectral Image Classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5514516. [Google Scholar] [CrossRef]
Shariati, S.A.K.; Abbasi, A. Machine learning-based winter wheat yield prediction using multisource data. Agric. Water Manag. 2025, 322, 109951. [Google Scholar] [CrossRef]
Bouras, E.H.; Jarlan, L.; Er-Raki, S.; Balaghi, R.; Amazirh, A.; Richard, B.; Khabba, S. Cereal Yield Forecasting with Satellite Drought-Based Indices, Weather Data and Regional Climate Indices Using Machine Learning in Morocco. Remote Sens. 2021, 13, 3101. [Google Scholar] [CrossRef]
Dhaliwal, J.K.; Panday, D.; Saha, D.; Lee, J.; Jagadamma, S.; Schaeffer, S.; Mengistu, A. Predicting and interpreting cotton yield and its determinants under long-term conservation management practices using machine learning. Comput. Electron. Agric. 2022, 199, 107107. [Google Scholar] [CrossRef]
Han, Y.; Wang, K.; Yang, F.; Pan, S.; Liu, Z.; Zhang, Q.; Zhang, Q. Prediction of maize cultivar yield based on machine learning algorithms for precise promotion and planting. Agric. For. Meteorol. 2024, 355, 110123. [Google Scholar] [CrossRef]
Huber, F.; Yushchenko, A.; Stratmann, B.; Steinhage, V. Extreme Gradient Boosting for yield estimation compared with Deep Learning approaches. Comput. Electron. Agric. 2022, 202, 107346. [Google Scholar] [CrossRef]
Zhu, B.; Wu, H.; Li, S.; Chen, L.; Song, K. A concise real-time identification method of maize phenological period based on remote sensing time information and segmented machine learning algorithm. Remote Sens. Environ. 2026, 338, 115349. [Google Scholar] [CrossRef]
Li, Y.; Zeng, H.; Zhang, M.; Wu, B.; Qin, X. Global de-trending significantly improves the accuracy of XGBoost-based county-level maize and soybean yield prediction in the Midwestern United States. GIScience Remote Sens. 2024, 61, 2307481. [Google Scholar] [CrossRef]
Zhou, W.; Zhou, W.; Cammarano, D.; Butterbach-Bahl, K.; Olesen, J.E.; Lin, Z.; Huang, T.; Cai, G.; Zhang, J.; Qiu, J.; et al. Unraveling the impact of environmental factors on wheat yield across the European Union via explainable machine learning. Comput. Electron. Agric. 2026, 241, 111268. [Google Scholar] [CrossRef]
Xia, C.; Ren, C.; Wang, Y.; Wang, Z.; Jia, M.; Xi, Y.; Liu, P.; Ren, H.; Hou, Q.; Ruan, X. Decoding soil-topography buffering of maize yield spatial heterogeneity in extreme precipitation year using Sentinel-2 data and SHAP interpretability. Field Crops Res. 2026, 337, 110263. [Google Scholar] [CrossRef]
Oikonomidis, A.; Catal, C.; Kassahun, A. Deep learning for crop yield prediction: A systematic literature review. N. Z. J. Crop Hortic. Sci. 2023, 51, 1–26. [Google Scholar] [CrossRef]
Zhang, Y.; Luo, C.; Ma, Y.; Kong, D.; Wang, Y.; Zhang, W.; Liu, H. Effects of Farmland Scale on Soil Organic Matter Change in Black Soil Areas of China in the Past 40 Years. Land Degrad. Dev. 2026, 1–18. [Google Scholar] [CrossRef]
Kong, D.; Luo, C.; Liu, H. Integrative remote sensing and machine learning approaches for SOC and TN spatial distribution: Unveiling C:N ratio in Black Soil region. Soil Tillage Res. 2026, 255, 106809. [Google Scholar] [CrossRef]
Gorelick, N.; Hancher, M.; Dixon, M.; Ilyushchenko, S.; Thau, D.; Moore, R. Google Earth Engine: Planetary-scale geospatial analysis for everyone. Remote Sens. Environ. 2017, 202, 18–27. [Google Scholar] [CrossRef]
Holben, B.N. Characteristics of maximum-value composite images from temporal AVHRR data. Int. J. Remote Sens. 1986, 7, 1417–1434. [Google Scholar] [CrossRef]
Chen, J.; Jönsson, P.; Tamura, M.; Gu, Z.; Matsushita, B.; Eklundh, L. A simple method for reconstructing a high-quality NDVI time-series data set based on the Savitzky–Golay filter. Remote Sens. Environ. 2004, 91, 332–344. [Google Scholar] [CrossRef]
Qiao, M.; He, X.; Cheng, X.; Li, P.; Zhao, Q.; Zhao, C.; Tian, Z. KSTAGE: A knowledge-guided spatial-temporal attention graph learning network for crop yield prediction. Inf. Sci. 2023, 619, 19–37. [Google Scholar] [CrossRef]
Richardson, A.D.; Keenan, T.F.; Migliavacca, M.; Ryu, Y.; Sonnentag, O.; Toomey, M. Climate change, phenology, and phenological control of vegetation feedbacks to the climate system. Agric. For. Meteorol. 2013, 169, 156–173. [Google Scholar] [CrossRef]
Rembold, F.; Meroni, M.; Otieno, V.; Kipkogei, O.; Mwangi, K.; de Sousa Afonso, J.M.; Ihadua, I.M.T.J.; José, A.E.A.; Zoungrana, L.E.; Taieb, A.H.; et al. New Functionalities and Regional/National Use Cases of the Anomaly Hotspots of Agricultural Production (ASAP) Platform. Remote Sens. 2023, 15, 4284. [Google Scholar] [CrossRef]
Tucker, C.J. Red and photographic infrared linear combinations for monitoring vegetation. Remote Sens. Environ. 1979, 8, 127–150. [Google Scholar] [CrossRef]
Vos, K.D.; Gebruers, S.; Degerickx, J.; Iordache, M.D.; Keune, J.; Di Giuseppe, F.; Pereira, F.V.; Wouters, H.; Swinnen, E.; Van Rossum, K.; et al. Predicting below-average NDVI anomalies for agricultural drought impact forecasting. Remote Sens. Environ. 2025, 330, 114980. [Google Scholar] [CrossRef]
Meroni, M.; Schucknecht, A.; Fasbender, D.; Rembold, F.; Fava, F.; Mauclaire, M.; Goffner, D.; Di Lucchio, L.M.; Leonardi, U. Remote sensing monitoring of land restoration interventions in semi-arid environments with a before–after control-impact statistical design. Int. J. Appl. Earth Obs. Geoinf. 2017, 59, 42–52. [Google Scholar]
Bolton, D.K.; Friedl, M.A. Forecasting crop yield using remotely sensed vegetation indices and crop phenology metrics. Agric. For. Meteorol. 2013, 173, 74–84. [Google Scholar] [CrossRef]
Pei, J.; Tan, S.; Zou, Y.; Liao, C.; He, Y.; Wang, J.; Huang, H.; Wang, T.; Tian, H.; Fang, H.; et al. The role of phenology in crop yield prediction: Comparison of ground-based phenology and remotely sensed phenology. Agric. For. Meteorol. 2025, 361, 110340. [Google Scholar] [CrossRef]
Wang, C.; Luo, C.; Meng, X.; Wang, C.; Liu, H. Intelligent mapping paradigm to overcome systematic bias in remote sensing SOC estimation: A case study of the black soil region in China and the United States. ISPRS J. Photogramm. Remote Sens. 2025, 230, 644–660. [Google Scholar] [CrossRef]
Meng, X.; Bao, Y.; Zhang, X.; Luo, C.; Liu, H. A long-term global Mollisols SOC content prediction framework: Integrating prior knowledge, geographical partitioning, and deep learning models with spatio-temporal validation. Remote Sens. Environ. 2025, 318, 114592. [Google Scholar] [CrossRef]
Dvorakova, K.; Heiden, U.; Pepers, K.; Staats, G.; van Os, G.; van Wesemael, B. Improving soil organic carbon predictions from a Sentinel–2 soil composite by assessing surface conditions and uncertainties. Geoderma 2023, 429, 116128. [Google Scholar] [CrossRef]
Zhang, X.; Chen, S.; Xue, J.; Wang, N.; Xiao, Y.; Chen, Q.; Hong, Y.; Zhou, Y.; Teng, H.; Hu, B.; et al. Improving model parsimony and accuracy by modified greedy feature selection in digital soil mapping. Geoderma 2023, 432, 116383. [Google Scholar] [CrossRef]
Luo, C.; Zhang, X.; Wang, Y.; Men, Z.; Liu, H. Regional soil organic matter mapping models based on the optimal time window, feature selection algorithm and Google Earth Engine. Soil Tillage Res. 2022, 219, 105325. [Google Scholar] [CrossRef]
Iniyan, S.; Varma, V.A.; Naidu, C.T. Crop yield prediction using machine learning techniques. Adv. Eng. Softw. 2023, 175, 103326. [Google Scholar] [CrossRef]
Belgiu, M.; Drăguţ, L. Random forest in remote sensing: A review of applications and future directions. ISPRS J. Photogramm. Remote Sens. 2016, 114, 24–31. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ‘16), New York, NY, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Sejuti, Z.A.; Islam, M.S. A hybrid CNN–KNN approach for identification of COVID-19 with 5-fold cross validation. Sens. Int. 2023, 4, 100229. [Google Scholar] [CrossRef]
Wei, M.C.F.; Molin, J.P.; Longchamps, L. Predictive power vs interpretability: Machine learning approaches to unravel sugarcane yield drivers. Comput. Electron. Agric. 2026, 243, 111353. [Google Scholar] [CrossRef]
Lundberg, S.M.; Lee, S.I. A unified approach to interpreting model predictions. In Advances in Neural Information Processing Systems 30; Curran Associates, Inc.: Red Hook, NY, New York, 2017. [Google Scholar]
Wolanin, A.; Mateo-García, G.; Camps-Valls, G.; Gómez-Chova, L.; Meroni, M.; Duveiller, G.; Guanter, L. Estimating and understanding crop yields with explainable deep learning in the Indian Wheat Belt. Environ. Res. Lett. 2020, 15, 024019. [Google Scholar] [CrossRef]
Han, Z.; Song, W. Interannual trends of vegetation and responses to climate change and human activities in the Great Mekong Subregion. Glob. Ecol. Conserv. 2022, 38, e02215. [Google Scholar] [CrossRef]
Hansen, P.M.; Schjoerring, J.K. Reflectance measurement of canopy biomass and nitrogen status in wheat crops using normalized difference vegetation indices and partial least squares regression. Remote Sens. Environ. 2003, 86, 542–553. [Google Scholar] [CrossRef]
Baumgardner, M.F.; Silva, L.R.F.; Biehl, L.L.; Stoner, E.R. Reflectance properties of soils. Adv. Agron. 1986, 38, 1–44. [Google Scholar]
Zhang, X.; Zhang, Q. Monitoring interannual variation in global crop yield using long-term AVHRR and MODIS observations. ISPRS J. Photogramm. Remote Sens. 2016, 114, 191–205. [Google Scholar] [CrossRef]
Basso, B.; Shuai, G.; Zhang, J.; Robertson, G.P. Yield stability analysis reveals sources of large-scale nitrogen loss from the US Midwest. Sci. Rep. 2019, 9, 5774. [Google Scholar] [CrossRef]

Figure 1. The overall research framework of the study (In this figure, dark blue indicates the Soil Condition features derived from bare soil imagery; light blue represents the Eco-physiological Process and Phenological Information jointly constructed by intra-annual and inter-annual vegetation indices; and pink denotes the Meteorological Background features constructed using inter-annual vegetation indices).

Figure 2. (a) Location of the study area; (b) Digital Elevation Model (DEM); (c) distribution of measured yield sampling points; (d) maize planting frequency distribution from 2016 to 2025; (e) 1:1,000,000 soil type distribution map; and (f) surface conditions in the bare soil period.

Figure 3. Statistical distribution of the maize yield dataset after stratified random sampling (70% training, 30% testing). (a) Stacked bar chart detailing the exact sample counts across the five predefined yield classes, demonstrating consistent proportional allocation. (b) Kernel Density Estimation (KDE) curves illustrating the continuous yield distributions.

Figure 4. Feature importance scores based on the Boruta algorithm.

Figure 5. Scatter plots comparing observed versus estimated maize yields for different models.

Figure 6. Spatial mapping of maize yield under various feature input combinations and prediction models.

Figure 7. Global feature contribution ranking and SHAP value distribution characteristics.

Figure 8. (a) Contribution factors of different feature groups across varying yield levels (from low yield to high yield); (b) total contribution of different feature groups to yield prediction.

Figure 9. Relationships between yield and: (a) Interannual NDVI Statistics (Mean) time series; (b) Interannual NDVI Statistics (Maximum) time series; (c) Intra-annual NDVI time series; (d) Intra-annual NDVI Anomaly time series; and (e) Soil Spectral Reflectance during the bare soil period. This Fig group demonstrates the capability of the YFKD to characterize the formation of differences among various yield classes.

Figure 10. Analysis of key time windows for maize yield formation based on SHAP. The bar chart illustrates the contribution of different dekads to yield from emergence (June) to maturity (Sep.).

Figure 11. Spatial distribution patterns of regional yield-limiting factors based on SHAP.

Figure 12. Field-verified evidence and remote sensing observations of typical yield-limiting soil types: (a) soil profile photograph of an Albic soil, illustrating a prominent Albic horizon formed by leaching processes; (b) remote sensing imagery of the Albic soil during the bare soil period; (c) maize yield estimation results within the Albic soil zone; (d) surface soil photograph of a sandy Meadow soil with high sand content; (e) remote sensing imagery of the sandy Meadow soil during the bare soil period; and (f) maize yield estimation results within the sandy Meadow soil zone.

Figure 13. Comparative zoomed-in views of multi-dimensional features and yield estimation results at the field scale.

Table 1. Division of the maize growing season at the dekadal scale.

$k$	Date	Dekads
1	0601–0610	Early June
2	0611–0620	Middle June
3	0621–0630	Late June
4	0701–0710	Early July
5	0711–0720	Middle July
6	0721–0731	Late July
7	0801–0810	Early August
8	0811–0820	Middle August
9	0821–0831	Late August
10	0901–0910	Early September
11	0911–0920	Middle September
12	0921–0930	Late September
13	1001–1010	Early October
14	1011–1020	Middle October
15	1021–1031	Late October

Table 2. The Yield-Formation Key Dataset (YFKD) and its feature groupings.

YFKD	Group	Vairable	Number
Meteorological Background	Interannual NDVI Statistics	${NDVIH}_{MEAN} (k)$	15
		${NDVIH}_{MED} (k)$	15
		${NDVIH}_{MAX} (k)$	15
		${NDVIH}_{MIN} (k)$	15
		${NDVIH}_{STD} (k)$	15
Eco-physiological Process	Intra-annual NDVI	$NDVIC (k)$	15
Eco-physiological Process	Intra-annual NDVI Anomaly	$NDVIA (k)$	15
Phenological Information	Intra-annual NDVI Change Rates	$NDVICR$	2
Soil Conditions	Soil Spectral Reflectance	${S O I L R}_{N O R M} (b)$	13

Table 3. Comparison of model performance under different feature combination schemes.

	MLR				Random Forest				XGboost
	$R^{2}$	RMSE	MAE	NRMSE	$R^{2}$	RMSE	MAE	NRMSE	$R^{2}$	RMSE	MAE	NRMSE
$N D V I C$	0.526	2806.55	2087.92	26.9%	0.623	2503.14	1805.61	24.0%	0.735	2096.08	1418.30	20.09%
$N D V I C + N D V I C R$	0.453	3013.07	2346.77	28.8%	0.589	2611.32	1843.83	25.3%	0.644	2430.87	1768.52	23.30%
$N D V I C + N D V I H$	0.663	2366.55	1843.13	22.6%	0.668	2349.07	1683.00	22.5%	0.826	1698.19	1043.68	16.28%
$N D V I C + N D V I A$	0.617	2521.11	1938.27	24.1%	0.669	2342.90	1706.81	22.4%	0.831	1675.22	1050.96	16.06%
$N D V I C + S O I L R$	0.666	2355.75	1836.68	22.5%	0.788	1877.88	1392.82	18.0%	0.842	1615.40	1043.84	15.56%
$N D V I C + N D V I C R + N D V I H + N D V I A + S O I L R$	0.691	2246.35	1749.55	21.4%	0.764	1979.57	1425.23	18.8%	0.865	1492.12	999.81	14.37%

Table 4. Spatial validation of estimated maize yield via zonal mean comparison.

Yield Class	Observed Mean (kg/ha)	XGBoost Predicted Mean (kg/ha)	Bias
Low	4357	4682	325
Below Average	7589	7421	−168
Average	10,412	10,298	−114
Above Average	13,387	13,512	125
High	17,894	17,456	−438

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Qi, B.; Zhang, X.; Chen, L.; Liu, H.; Meng, L.; Han, X.; An, Z.; Liu, J. A Prior Knowledge-Guided Remote Sensing Framework for Maize Yield Estimation and Spatiotemporal Interpretability Analysis. Remote Sens. 2026, 18, 1455. https://doi.org/10.3390/rs18101455

AMA Style

Qi B, Zhang X, Chen L, Liu H, Meng L, Han X, An Z, Liu J. A Prior Knowledge-Guided Remote Sensing Framework for Maize Yield Estimation and Spatiotemporal Interpretability Analysis. Remote Sensing. 2026; 18(10):1455. https://doi.org/10.3390/rs18101455

Chicago/Turabian Style

Qi, Beisong, Xinle Zhang, Lu Chen, Huanjun Liu, Linghua Meng, Xinyi Han, Zeyu An, and Jiming Liu. 2026. "A Prior Knowledge-Guided Remote Sensing Framework for Maize Yield Estimation and Spatiotemporal Interpretability Analysis" Remote Sensing 18, no. 10: 1455. https://doi.org/10.3390/rs18101455

APA Style

Qi, B., Zhang, X., Chen, L., Liu, H., Meng, L., Han, X., An, Z., & Liu, J. (2026). A Prior Knowledge-Guided Remote Sensing Framework for Maize Yield Estimation and Spatiotemporal Interpretability Analysis. Remote Sensing, 18(10), 1455. https://doi.org/10.3390/rs18101455

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Prior Knowledge-Guided Remote Sensing Framework for Maize Yield Estimation and Spatiotemporal Interpretability Analysis

Highlights

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Data Sources and Preprocessing

2.2.1. Optical Remote Sensing Data

2.2.2. Vegetation Index Calculation

2.2.3. Yield Data

2.2.4. Crop Classification Data and Field Boundaries

2.3. Methodology

2.3.1. Construction of the Yield-Formation Key Dataset (YFKD)

2.3.2. Feature Selection and Predictor Set Construction

2.3.3. Experimental Design and Model Construction

2.3.4. Model Evaluation Metrics

2.3.5. Interpretation of Yield Drivers Using SHAP

3. Results

3.1. Optimal Feature Selection Results

3.2. Model Performance Comparison

3.3. Spatial Patterns of Estimated Yield

3.4. SHAP-Based Analysis of Yield Drivers

4. Discussion

4.1. Spatiotemporal Characterization Capability of the YFKD

4.2. Analysis of Key Monitoring Time Windows for Yield Formation

4.3. Spatial Distribution Patterns of Regional Yield-Limiting Factors

4.4. Mechanism of Accuracy Enhancement via Multi-Dimensional Spatiotemporal Feature Fusion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI