1. Introduction
Wheat is one of the world’s major food crops, and timely and accurate acquisition of wheat yield information is crucial for the rational planning of cultivation practices and the early warning of yield fluctuations. Time series crop growth parameters derived from remote sensing data reflect the growth status of wheat across different phenological stages and has been increasingly utilised for regional crop yield estimation [
1,
2].
Previous research has utilised remote sensing data from satellite sensors to quantify regional crop yields. In this regard, a range of vegetation indices, such as the Leaf Area Index (LAI), Enhanced Vegetation Index (EVI) and so on, have been incorporated into yield estimation [
3,
4]. On this basis, meteorological data and remote sensing data are being combined for crop yield estimation to take into account the effects of external environmental factors with the aim of improving the accuracy of crop yield estimation. For instance, Dujakovic et al. [
5] adopted a collaborative method combining remote sensing (Moderate-resolution Imaging Spectroradiometer (MODIS) time series Normalized Difference Vegetation Index (NDVI)) and meteorological data analysis to predict the start of growing season in grasslands (SOS), the results showed that the integrated method is superior to other methods and has more robust and accurate SOS estimation results, which is of great help for the estimation of grassland production. In addition, soil, as another key factor influencing crop growth, is receiving more and more attention to be included in crop growth studies for accurate yield estimation based on multimodal data (Contains multiple types of data or signal sources) [
6,
7]. However, when utilising multimodal data for modelling, the spatio-temporal consistency of data from various sources presents a primary challenge.
In estimating crop yields using multimodal data, research on estimating crop yields using linear statistical models has gained momentum due to their simplicity of modelling [
8]. For example, Liu et al. [
9] used 40 years of data from experimental stations in China’s main winter wheat producing areas to explore the spatiotemporal changes in growing degree days (GDD) under the background of climate change, and used a linear mixed effects regression model to illustrate the relationship between cumulative GDD at different growth stages and wheat yield. These linear statistical prediction models based on statistical relationships had obvious performance limitations. In this situation, crop model-based yield estimation methods gradually gained attention. Crop growth models are dynamic simulation models based on crop growth mechanisms, which take into account the complex interactions of environmental factors such as soil, climate, human activities and biological processes during crop growth [
10,
11]. Although having a reported higher accuracy compared to linear statistical models, crop growth models also face challenges in their implementation due to excessive parameter constraints, inadequate model optimisation and inputs characterised by uncertainty. In addition, these models are sensitive to climate change and usually require extensive and diverse validation datasets, leading to inaccurate simulations and biased estimates [
12]. Therefore, ways to combine observation data and crop growth models to dynamically adjust model parameters through data assimilation techniques to improve the estimation accuracy has become a research hotspot [
13].
Traditional yield estimation methods, such as those based on remote sensing indices, statistical models or crop growth models, rely, in most cases, on high-quality (precise and accurate) data. Further they generally fail to capture complex relationships among different crop varieties, growing environments, and management practices. In recent years, deep learning, as a cutting-edge technology of artificial intelligence, has gradually emerged in the field of crop yield estimation due to its superior performance in complex nonlinear problems and large-scale data processing [
14,
15]. Deep learning, through the feature learning capability of multi-layer neural networks, is not only able to automatically extract key features from spectral images, but also able to capture the spatiotemporal dynamics of time series data, which effectively solves the limitations of traditional methods in data processing and model adaptation. For example, Zhang et al. [
15] proposed a Long Short-Term Memory (LSTM) based maize yield prediction method by combining vegetation index and meteorological data. The results showed that the model was able to characterise the cumulative effect of environmental factors on yield, and obtained accurate and stable yield estimation results. Wang et al. [
16] utilised the advantage of time series memory of Gated Recurrent Unit (GRU) and combined it with a Convolutional Neural Network (CNN) to estimate the yield of winter wheat at the decadal time scale, and the results showed that the yield of winter wheat at the county scale could be effectively predicted with the combination of vegetation index and meteorological data. Khan et al. [
17] used deep neural networks and multistream deep neural networks combined with vegetation indices to predict county scale maize yields in the U.S., which succeeded in making accurate estimates before harvesting. Aravind et al. [
18] developed a multi-stage crop yield estimation model based on deep learning methods using time series multivariate meteorological data for wheat harvest estimating. Further, coupling deep learning with a crop growth model can effectively combine the advantages of both to further improve the accuracy of crop yield estimation. Jeong et al. [
19] combined deep learning and remote sensing with process-based crop models to enhance rice yield prediction, developed and evaluated four models based on different deep neural network architectures: feedforward neural network, LSTM, GRU, and bidirectional LSTM. All models showed high prediction accuracy. Introducing an attention mechanism in a deep learning model can direct the model to apply extra attention to specific variables, which in turn improves the feature extraction capability and efficiency of the deep learning model. Tian et al. [
20] proposed an attention mechanism based multilevel crop network (AMCN) to estimate district level wheat yields based on remotely sensed and meteorological data, which provided more accurate yield estimation by utilising a multilevel crop network while taking into account the uncertainties involved in the model estimation. In summary, numerous yield estimation methods combining deep learning models with crop growth models or attention mechanisms have been validated as effective. However, few studies have explored how to construct applicable deep learning model architectures from a crop physiology perspective to further enhance the models’ yield estimation accuracy.
The improved accuracy evidenced by deep learning in agricultural contexts is increasingly irrefutable and has led to its widespread use for crop yield estimation. However, since deep learning models are often regarded as ‘black-box models’, the problem of their interpretability has become a major bottleneck for further improvement in the field of agriculture [
21,
22]. Interpretability is not only related to the reliability and transparency of the model, but also important for understanding the key factors in the crop growth process and optimising agricultural management strategies. In crop yield estimation, interpretability refers to a model’s ability to explicitly describe the relationship between input features and output results. For example, how to describe the causal relationship between meteorological data input to the model and crop growth. Although deep relational learning models are capable of handling complex nonlinearities, their high-dimensional feature space and hierarchical structure enable humans to have difficulty in directly understanding their decision-making process. The lack of interpretability not only reduces the credibility of the model, but also limits its potential for application in agricultural decision-making. Therefore, the improvement of the interpretability of deep learning models is the key to promote their application in the field of crop yield estimation.
In recent years, for the interpretability problem of deep learning in the field of yield estimation, academics have carried out various studies and made preliminary progress [
23]. Paudel et al. [
21] used data from the European Commission Joint Research Centre’s MARS crop yield prediction system to assess the performance and interpretability of neural network models for crop yield prediction, where the selected neural networks could process sequential or time series data, demonstrating the potential of deep learning to automatically learn features and generate reliable crop yield predictions, and highlighting the importance of involving human stakeholders to participate in assessing the interpretability of the models. Bi et al. [
24] interpreted the results of a model for three-phase products of nitrogen-rich biomass pyrolysis prediction through SHapley Additive exPlanations (SHAP) value analysis, and conducted sensitivity analysis through Monte Carlo simulation to identify the key features that have a greater impact on the target value. The partial correlation analysis of the two features reveals the importance and dependence of the model features. Wang et al. [
25] further revealed the cumulative effect between remotely sensed LAI and Vegetation Temperature Condition Index (VTCI) with yield by employing Light Gradient Boosting Machines (LightGBM) and SHAP, and the results showed that the cumulative effect between remote sensing parameters and yield improved the interpretability of yield estimation models based on deep learning methods. In summary, most existing research on interpretability within crop yield estimation focuses solely on the input variables of models, with little in-depth explanation of how these variables influence estimated yields from the perspective of crop growth mechanisms. Notably, studies exploring the dynamic changes in contributions from multimodal data across different crop growth stages remain absent.
This study accounts for spatiotemporal scale disparities (multi-scale) among multimodal data (remote sensing, meteorology and soil) and, from a crop-physiological perspective, develops an innovative deep-learning architecture for wheat yield estimation to enhance both predictive accuracy and model interpretability. The main research questions of this study are to: (1) propose a multi-scale network framework for winter wheat yield estimation based on multimodal data, (2) validate the model’s yield estimation accuracy at rain-fed and irrigated farmlands; assess the model’s yield estimation accuracy at the county scale based on the inter-annual yield estimation results, and (3) carry out model interpretability analysis.
2. Materials and Methods
2.1. Study Area
The study area is located in the Guanzhong Plain in northwestern China, covering a total area of approximately 36,000 km
2. It is one of China’s major grain-producing regions (
Figure 1), primarily cultivating winter wheat and summer maize. Winter wheat is sown in early October each year and harvested in early June of the following year. During its growth cycle, winter wheat undergoes several key stages, including tillering, overwintering, green-up (early to mid March), jointing (late March to mid April), heading-filling (late April to early May), and milk maturity (mid to late May). This study focuses solely on winter wheat within the study area. The specific distribution and delineation criteria for winter wheat growing regions are detailed in
Section 2.2.
2.2. Remote Sensing Data
This study utilises remote sensing data from the Guanzhong Plain’s wheat-growing regions during the main growth stages of winter wheat in 2024. The data includes the MODIS Nadir BRDF-Adjusted Reflectance Daily 500 m data set (MCD43A4), MODIS Vegetation Indices 16-Day Global 250 m data set (MOD13Q1), MODIS Leaf Area Index/FPAR 4-Day Global 500 m (MCD15A3H) and the land cover data set (MCD12Q1), all of which are directly accessible via the Google Earth Engine (GEE) platform. Time series biophysical parameters-LAI, vegetation indices-NDVI, water stress indices-Shortwave Infrared Water Stress Index (SIWSI) for the four growth stages of winter wheat were generated using the above products as input data to the yield estimation model. LAI, NDVI and SIWSI were selected for winter wheat yield estimation because they comprehensively reflect the physiological characteristics of the crop and the main factors affecting yield in terms of key aspects such as canopy structure, growth status and water stress conditions, etc. LAI quantifies the canopy structure of the vegetation and photosynthesis capacity, and is closely related to biomass accumulation; NDVI provides comprehensive information on the health of the vegetation and the growth status, and is significantly related to crop yield; while SIWSI complements NDVI through the sensitivity to water stress, directly reflecting the crop water status and its effect on yield. The combination of the three indices provides dynamic and comprehensive key inputs to the model, thus improving the accuracy and reliability of yield estimation. The MCD12Q1 data set was used to extract the land cover type layer for the study area. By applying a cropland mask from the relevant band, remote sensing, meteorological and soil data were extracted. The data set contains 13 bands, and the LC_Type1 band, which corresponds to the International Geosphere-Biosphere Programme (IGBP) classification, was used. Specifically, land cover type 12 (Croplands) was selected to mask agricultural areas, enabling the acquisition of remote sensing and meteorological data specific to farmland in the Guanzhong Plain.
2.3. Meteorological Data
In this study, the European Centre for Medium-Range Weather Forecasts (ECMWF) ERA5-Land reanalysis dataset was used as a source of meteorological information for crop yield estimation. The ERA5-Land Daily Aggregated–ECMWF meteorological Reanalysis dataset containing 14 meteorological variables was downloaded from the GEE platform for this study (
https://developers.google.com/earth-engine/datasets/catalog/ECMWF_ERA5_LAND_DAILY_AGGR, accessed on 2 January 2025) (
Table 1). Of these meteorological variables, temperature is an important meteorological factor affecting crop growth. The temperature-related variables are temperature at 2 m, and dewpoint_temperature at 2 m, as well as soil temperature at level 1, and soil temperature at level 2, which reflect the heat supply and transpiration during the crop growing season. Water supply is a key limiting factor for crop productivity. The precipitation and water variables are the sum of total precipitation, and the sum of runoff, volumetric soil water of layer 1, and volumetric soil water of layer 2, are used to describe soil moisture conditions. Photosynthesis drives the growth process of crops and is closely related to solar radiation. Radiation-related variables include the sum of surface net solar radiation and the sum of surface net thermal radiation, which are used to measure energy balance and crop photosynthesis potential. Surface pressure and wind field variables include surface air pressure, horizontal component of wind speed at 10 m, and vertical component of wind speed at 10 m, which can reflect the dynamics of weather systems and crop transpiration water consumption. The sum of potential evaporation is an important indicator to characterise crop water requirement and can reflect the regional water balance and the intensity of evapotranspiration process. The time scales of all meteorological variables obtained are on a daily scale.
Since meteorological data have a lower spatial resolution compared to remotely sensed data, a geostatistics-based spatial interpolation method, the Kriging Interpolation (KI) method was used in this study to interpolate the meteorological data to a spatial resolution consistent with that of remotely sensed data. HAN’s study showed that it has convincing results in interpolating meteorological data in the Guanzhong Plain [
26].
2.4. Soil Data
The Harmonized World Soil Database (HWSD) is a 30 arc-second resolution raster database containing more than 15,000 different soil mapping units. The database integrates global soil information from multiple sources, including regional and national updates (such as the Soil and Terrain (SOTER), European Soil Database (ESD), China Soil Map, World Inventory of Soil Emission (WISE)), and data from the FAO-UNESCO World Soil Atlas [
27,
28]. In this study, the HWSD dataset was used to extract 10 key soil variables that affect crop yields (
Table 2). These variables can be summarised into the following categories: Cation Exchange Capacity (CEC), CEC reflects the soil’s ability to adsorb nutrient ions, it is an important indicator for measuring soil fertility; Soil acidity (pH value), soil pH affects the effectiveness of nutrients in the soil and the growth of crop roots; Organic carbon content, organic carbon is an important component of soil fertility and directly affects the physical and chemical properties of the soil and crop productivity; Soil texture affects the ability to retain water and nutrients, in which the clay content of the top and bottom soils (T_CLAY and S_CLAY) is an important variable for measuring soil texture, which provides a basis for the study of water and crop root distribution. All soil data obtained are static data. Like meteorological data, soil data are also interpolated to a spatial resolution of 500 m.
2.5. Yield Data
In this study, a field sampling experiment for winter wheat yield was conducted in early June 2024 in the Guanzhong Plain (
Figure 1). The specific field sampling rules were referred to Han’s study [
29]. Given the spatial resolution limitations of MODIS data, the selection criteria for field sampling sites are as follows: within a 500 m × 500 m area, there should be no significant interference from villages, wide roads, buildings, trees, or similar features, and the surface cover should predominantly consist of wheat. The spacing between field sampling points must exceed 500 m. A total of 222 field samples of winter wheat yield were collected in the experiment, including 149 samples of winter wheat from irrigated farmland and 73 samples of winter wheat from rain-fed farmland.
2.6. MultiScaleWheatNet Model
In this study, a multi-scale network framework for winter wheat yield estimation, named MultiScaleWheatNet is proposed. Multi-scale refers to the model’s input comprising multiple data sources (remote sensing, meteorology and soil) with differing spatio-temporal scales. The network structure mainly consists of a sub-network for feature extraction among crop growth stages, a sub-network for time series feature extraction, a sub-network for feature fusion, and a sub-network for yield estimation. The sub-network for feature extraction among crop growth stages is composed of the time series feature extraction sub-network, while the time series feature extraction sub-network is composed of the following branches: Meteorological transformer branch, NDVI transformer branch, LAI transformer branch, SIWSI transformer branch and soil transformer branch, which have the same network structure (
Figure 2). It is worth noting that each of the three remote sensing indices is treated as a separate branch due to the inconsistent temporal resolution of the remote sensing indices.
The model was developed using a range of data sources and variables. These included time series characteristic variables corresponding to the four growth stages of winter wheat in 2024 (Green-up, Jointing, Heading-filling, and Milk maturity). This data was sourced from multiple sources and included: 14 daily meteorological variables from the ERA5-Land dataset, the NDVI composited on 16 days by MODIS, the LAI composited on 4 days by MODIS, the SIWSI composited on 8 days by MODIS, and 10 soil property variables from the HWSD dataset. The labels of the model were the yield values of 222 winter wheat samples measured in the field. However, due to the inconsistent time step between different feature variables, traditional small batch training and feature extraction methods cannot be directly applied. Consequently, a transformer-based padding mask technique was employed to align the time steps, which is a matrix whose value domain consists of true and false values, where the ‘true value’ indicates that the corresponding position is a mask value, and the ‘false value’ indicates the valid data. In this study, 0 is used as the mask value. In the self-attention mechanism, the mask sets the attention weight of the filled position to a very small value (usually negative infinity) to ensure that the features in the filled position do not affect the features in other positions. This enables the model to effectively differentiate between real data and filler values, thus enabling it to focus on processing valid data.
In order to efficiently extract the temporal features of each feature variable, it is necessary to transform all the data with different time scales to a unified time scale. The data first enters the embedding layer for temporal feature embedding, and the temporal information is incorporated into the feature vectors through position encoding. This enables the model to perceive the temporal order of the data and enhances feature representation. The position encoding is generated using sine and cosine functions to ensure that the model can capture position information in long time series. The feature variables are then processed by the transformer encoder to generate a feature representation on a uniform temporal scale. The feature vector of the last time step is transformed by a nonlinear activation transform as a composite feature representation of that feature variable’s time series. Subsequently, the composite features of the different feature variables are fused to describe the growth status of winter wheat during a given growth stage. It is noteworthy that the model exhibited consistency in the extraction of time series features across all four growth stages. Subsequently, the time series features of the four stages were spliced in chronological order to form a complete time series feature vector, which was then input into the feature fusion sub-network. This sub-network employs a multi-layer attention mechanism and a nonlinear activation layer to integrate the key information of the full growth cycle of winter wheat from the green-up stage to the milk maturity stage, thereby generating a comprehensive feature vector. Finally, the integrated feature vector is regression predicted through the fully connected layer to output the estimated yield of winter wheat.
The dataset is stratified by irrigation type and randomly partitioned into training and test sets in an 8:2 ratio. The input data are organised in the form of 3-dimensional matrices (number of samples, time step, feature variables) (Batch × Time × Feature). To accelerate the convergence speed of the model training, the feature variables were dimensionless scaled to the [0, 1] interval.
The optimal hyperparameters of the MultiScaleWheatNet model determined after 3000 epochs are shown in
Table 3. The number of hidden layers used for feature extraction (hidden_dim) for each data input stream (Meteorological variables, NDVI, LAI, SIWSI, soil properties) is 32, and to ensure that the model learns a high quality representation of the features. Use of a two-layer Transformer encoder (num_layers = 2) for temporal modelling. For the output features of the modules corresponding to the four growth stages, the model achieves global feature fusion by feature splicing and mapping to the high dimensional space of final_hidden_dim (set as 128), which is further captures high-level feature interactions across stages. Finally, the features are mapped to the target dimension through the fully connected layer to output a single stages value (output_dim = 1). The above hyperparameters effectively capture the complex dependencies between time steps while controlling the computational complexity to avoid overfitting.
The MultiScaleWheatNet model is developed based on the PyTorch deep learning framework build version 1.11.0, with Python version 3.8 serving as the programming language. The operating system is Ubuntu 20.04, and the CUDA version is 11.3. The experimental platform is configured with an NVIDIA GeForce RTX 3090 GPU, equipped with 24 GB of RAM, and an AMD EPYC 7642 48-Core Processor, operating on 80 GB of RAM.
2.7. Benchmark Model for Comparison
Two mainstream deep learning architectures, LSTM based and CNN based models, and a classic machine learning method, Random Forest Regression (RFP) model were designed as benchmark models to compare with the proposed MultiScaleWheatNet model in terms of yield estimation accuracy. The LSTM based model was established as a baseline due to its proven capability in modelling temporal sequences, especially in crop growth monitoring tasks. The input data structure of the LSTM based model aligns with the MultiScaleWheatNet, using three-dimensional matrices (Batch × Time × Feature). Each input variable (meteorological, NDVI, LAI, SIWSI, and soil properties) was processed through separate LSTM branches due to varying temporal resolutions. Each branch employs two stacked LSTM layers with 32 hidden units to capture temporal dependencies. The outputs from these branches were concatenated across the four growth stages in chronological order, resulting in a unified feature vector. Subsequently, the unified feature vector was processed through two fully connected layers with dimensions of 128 and 64, respectively, followed by a regression layer to output the yield estimation. Additionally, A CNN based model was selected due to its ability to effectively capture spatial-temporal patterns within multi-dimensional datasets. In the CNN based model, the input data is reshaped into a four-dimensional tensor with the shape (Batch × Channel × Time × Feature). Each feature variable is processed through an independent convolutional branch. Each branch consists of two 2D convolutional layers, followed by max pooling and batch normalisation. Specifically, the first convolutional layer employs 32 filters with a kernel size of 3 × 1, while the second layer uses 64 filters with the same kernel size. After the convolutional operations, the feature maps from different branches are flattened and concatenated in sequence according to the corresponding growth stages. The integrated features are then passed through two fully connected layers with 128 and 64 neurons, respectively, both activated using the ReLU function. Finally, a regression output layer is used to estimate the crop yield. Both comparative models used the same training/test split ratio (8:2), normalisation method ([0, 1]), and optimisation parameters (Adam optimizer with learning rate 0.001) as the MultiScaleWheatNet. Training was executed for 3000 epochs to ensure convergence.
2.8. Model Interpretability Analysis
One fundamental aspect of interpretability is feature attribution, which aims to elucidate the contribution of individual input features to a model’s decision-making process. A key technique in feature attribution is gradient-based analysis, which leverages the gradients of the model’s output with respect to its input to assess the importance of different features. However, local gradients can be unstable and may not provide a reliable measure of feature contributions, especially in highly nonlinear models. To address this limitation, Sundararajan et al. [
30] introduced the Integrated Gradients (IG) method, a principled approach that ensures robustness and theoretical soundness in feature attribution. The core idea is to compute the integral of the gradient over a path from a reference input (baseline)
x′ to a target input
x, thus measuring the overall contribution of the features to the prediction result. The integral gradient is defined as follows:
where
x′ is a reference input, such as an all-zero vector or a mean input;
x is the target input; the path parameter α is a path parameter from 0 to 1 used for interpolating between
x′ and x;
is the gradient of the function F with respect to the input features
xi.
In practice, since integrals cannot be solved directly, they are usually approximated using Numerical Integration (NI) methods. The most common numerical integration method is the Trapezoidal Rule:
where
m is the number of integration steps (usually 50 to 100);
is calculated as the gradient values at different points. In deep learning, the gradient can be calculated by automatic differentiation and the integral gradient can be solved using numerical integration methods.
IG assigns an importance value to each variable by quantifying the marginal contribution of each variable to the model output when combined with other variables. Each of the input variables to the model (meteorological variables, remote sensing indices, soil properties) was analysed to determine their relative importance in estimating wheat yield. Specifically, the IG framework was applied to the MultiScaleWheatNet model to examine the impact of time series input variables on yield estimation.
3. Results
3.1. Results of Winter Wheat Yield Estimation
The accuracy results of the models on the validation set revealed that MultiScaleWheatNet achieved superior performance compared to the benchmark models (
Table 4). The results clearly demonstrate the effectiveness of the proposed MultiScaleWheatNet, highlighting its ability to better capture complex temporal patterns and feature interactions across multiple scales and sources of data. The multi-scale transformer-based framework significantly enhanced the mode’s accuracy by effectively addressing the challenges posed by the inconsistent temporal resolution of the input variables.
The constructed MultiScaleWheatNet model is used to train and test the sample dataset, and the changes in the loss curves of the training and testing sets are visualised for the model during training and testing (
Figure 3). As the number of iterations increases, the model loss value decreases, and for the mixed samples (The samples of rain-fed farmland and irrigated farmland were mixed together and the training and test sets were randomly selected), the model reaches convergence and the training and testing loss curves largely coincide when the number of trainings reaches 2000. For the non-mixed samples (Differentiating between rain-fed and irrigated samples, and select training and test sets proportionally, respectively), the model converges and the test loss curve flattens out after 1500 training sessions. It can be seen that the loss value of the model for the mixed samples converged at 0.008, and the loss value of the model for the non-mixed samples converged at 0.014, indicating that the constructed models for estimating the yields of winter wheat all have high performance.
The trained MultiScaleWheatNet model were applied to the test set, which in turn estimated the winter wheat yields. A comparison between the estimated and field measured yield values is shown in
Figure 4. To evaluate the model’s fit, the slope and intercept of the fitted curve are examined. The fitted line visually illustrates the relationship between the actual and predicted yields. The range of winter wheat yields measured in the field was centred between 7.3 t·ha
−1 and 8.0 t·ha
−1. Based on the model trained with mixed samples, the fitted line has a slope of 0.73, and a better fit between the estimates and actual values (R
2 = 0.79, RMSE = 0.17 t·ha
−1, nRMSE = 9.59%). There were no significant under- or overestimates of yield for high and low yielding winter wheat. In most of the sample plots, the model-estimated yields were relatively lower than the measured yields. The accuracy of the model trained on unmixed samples was lower for estimating winter wheat yield on irrigated farmlands (R
2 = 0.73, RMSE = 0.17 t·ha
−1, nRMSE = 10.21%) than on rain-fed farmlands (R
2 = 0.86, RMSE = 0.15 t·ha
−1, nRMSE = 9.19%). The main reason for this is that after the field survey, it was found that local farmers usually carry out one or two irrigation events in irrigated farmlands and to spray pesticides to control pests and diseases, which resulted in the winter wheat growth in irrigated fields being more affected by field management practices, and thus the yield estimation accuracy of the MultiScaleWheatNet model in irrigated farmlands was lower.
3.2. Results of Winter Wheat Yield Spatial Distribution
The yield estimation results of winter wheat in the Guanzhong Plain exhibited clear spatial distribution characteristics and interannual variation trends (
Figure 5). Overall, the yield demonstrated a spatial pattern of high in the west and low in the east, with high-yielding areas (8–10 t·ha
−1) predominantly concentrated in the western part of the plain, which possesses high soil stages and adequate irrigation conditions, while medium-yielding areas (5–7 t·ha
−1) were primarily distributed in the eastern region, constrained by factors such as rainfall and soil quality, exhibits yields that are marginally lower but still maintain a relatively high level. The northern zone and the Loess Plateau in the northeastern region are characterised by low-yielding conditions, attributed to the effects of undulating topography, poor soil quality, and limited water resources. With regard to inter-annual variability, lower yields in 2021, particularly in eastern regions, may be attributable to climatic fluctuations or changes in water availability, leading to pests, diseases and lodging, which in turn could lead to a significant drop in yields. In 2022, the overall yield was higher, with a wide distribution of high production areas in the west, recovery of yields in some medium production areas, and a smaller area of low production areas. This indicates that the climatic conditions were more suitable in that year. In 2023, the yield underwent a further enhancement, marked by an expansion of the area with high production in the west and a notable improvement in the central region’s yield. This may be attributed to warmer climatic conditions and optimised farm management practices, such as improved irrigation regulation and fertilisation strategies, which have enhanced the growing environment for the crop. Conversely, production declined in 2024, attributable to the expansion of the western and eastern low-production areas, which may have been affected by extreme weather conditions (e.g., drought or excessive precipitation) that hindered the growth of winter wheat.
The yield estimation results of counties in 2021, 2022 and 2023 were statistically calculated from the administrative map of the study area based on the regional scale yield estimation results of the MultiScaleWheatNet model. The county scale yield data from official statistics (Shaanxi Provincial Bureau of Statistics) were collected for validating the accuracy of the model’s county scale yield estimation. The scatter distribution of county scale winter wheat yields estimated by the MultiScaleWheatNet model and county scale winter wheat yield statistics collected from official agencies is shown in
Figure 6. It can be seen that the distribution of the winter wheat yield data estimated by the MultiScaleWheatNet model and the county scale winter wheat yield data provided by official agencies is roughly close to the 1:1 line. The model’s tendency to overestimate yields in low-yielding counties is more pronounced. In all three years, R
2 was greater than 0.35, RMSE was less than 0.73 t·ha
−1, and nRMSE was less than 20.4%. Overall, the MultiScaleWheatNet model performed better in estimating county scale winter wheat yields.
3.3. Results of Winter Wheat Yield Estimation Feature Importance
Based on the estimation of wheat yield using the MultiScaleWheatNet model, the interpretability of the model was analysed. The feature contribution was determined by calculating the IG values for the different modal input data. The five meteorological and soil variables and the three remote sensing indices that contributed the most at each growth stage were selected separately, and the average results of the contributions of each category of input data at each growth stage were calculated (
Figure 7). A summary of the results revealed that for all growth stages combined, it was found that remote sensing indices had relatively high contributions in all growth stages, with roughly equal contributions from meteorological and soil variables. In terms of growth stages, remote sensing indices had relatively high contributions throughout the all four growth stages, especially at heading-filling stage. Meteorological variables had relatively high contributions in the first two growth stages and close to zero in the last two. Soil attributes had relatively low contributions in the first two growth stages, but have a high contribution in the last growth stage.
As demonstrated by the results of the character attribution analysis presented in
Figure 8, the contribution of each variable at different growth stages is found to be significantly different. Such differences reflect the changes in the demand for environmental conditions at different stages.
The utilisation of remote sensing indices demonstrated a less pronounced differential variation among the various growth stages. The contribution of the NDVI was found to be more significant during the green-up stage. In contrast, the contribution of the LAI was found to be more pronounced in the remaining three growth stages. During the early growth phase, NDVI increases with vegetation coverage, reaching its peak during the growth stages such as jointing to heading stage, when crop biomass and chlorophyll content are at their maximum [
31]. LAI reflects the degree of vegetation density. A higher LAI typically indicates greater photosynthetic capacity, directly influencing yield. During the mid-to-late stages of the growing season, changes in LAI impact crop growth and yield formation, making it a key determinant of yield [
32]. SIWSI did not perform particularly well across all growth stages, possibly due to the absence of a more severe agricultural drought in the Guanzhong Plain in 2024.
The contribution of meteorological variables varied according to the growth stage of winter wheat. The contributions of surface air pressure (S_P), the sum of potential evaporation (P_E_S) and volumetric soil water of layer 1 (V_S_W_L1) were higher during the green-up stage of winter wheat. This is due to the fact that during this period, wheat is at a critical stage of vegetative stage and the plant begins to resume growth, which requires sufficient water to support root water absorption and nutrient uptake. The high contribution of V_S_W_L1 indicates that winter wheat requires sufficient soil moisture at this time to promote water uptake by the root system and nutrient transformation in the plant. P_E_S reflects the rate of water loss through transpiration and evaporation and S_P exerts an indirect influence on water transpiration and gas exchange. As the process of stages advanced, the prevailing meteorological factors underwent a gradual transition to V_S_W_L1, soil temperature at level 1 (S_T_L1), and dewpoint temperature at 2 m (DT_2) during the jointing stage. The high contributions of V_S_W_L1 and S_T_L1 reflected the dual moisture and temperature dependence. During this period, winter wheat began to accelerate growth and differentiate internodes, and the root system’s demand for water in the soil reached a new height. A suitable soil temperature can promote root uptake and growth. This observation is corroborated by the pronounced contribution of DT_2. During the heading-filling stage, S_P, S_N_S_R_S and V_S_W_L1 had greater effects on yield. The heading-filling stage is the critical period for wheat yield development. S_N_S_R_S is a direct driver of photosynthetic efficiency. Under adequate sunlight conditions, photosynthetic efficiency increases and more photosynthetic products are transferred to the kernel, improving kernel weight and quality. V_S_W_L1 is more important during this period, as sufficient soil moisture not only ensures nutrient uptake by wheat, but also avoids ‘bruised grains’ due to water stress during the filling period. During the stage of milk maturity, the contributions of soil temperature (S_T_L1 and S_T_L2), wind speed (U_C_W_10 and V_C_W_10), and P_E_S were found to be elevated. Milk maturity is defined as the late stage of grain filling, when starch and protein accumulation in the grain is nearing its end, mainly in the form of dry matter accumulation and water loss. The high contributions of S_T_L1 and S_T_L2 indicate that suitable soil temperature helps to maintain root activity and provide the final support for nutrient accumulation in the grain. The effects of U_C_W_10 and V_C_W_10 were further enhanced during the milky maturity stage, which may be closely related to transpiration and seed dehydration processes. The importance of P_E_S suggests that the evapotranspiration process plays an important role in the accumulation of dry matter and water loss from the grain.
Compared to meteorological factors, soil variables have a more stable effect on yield throughout the growing season. The green-up stage is an important transition phase from dormancy to nutrient growth of winter wheat, during which soil pH (S_PH_H2O), surface pH (T_PH_H2O), soil cation exchange capacity (S_CEC_SOIL) and topsoil cation exchange capacity (T_CEC_SOIL) show higher contributions. pH determines the effectiveness of nutrients in the soil, and an appropriate soil pH environment can ensure rapid recovery of plant growth. Cation exchange capacity reflects the ability of the soil to hold and release nutrients. A soil with high cation exchange capacity can prevent nutrient loss and ensuring nutrient uptake by the root system and plant growth during the green-up stage. The jointing stage is the transition from nutrient to reproductive growth in winter wheat, and the high contribution of aboveground organic carbon (T_OC), belowground organic carbon (S_OC) and topsoil clay (T_CLAY) during this stage underscoring the significance of organic carbon in the accumulation of grain material. The promotion of root uptake of elements such as nitrogen, phosphorus and potassium by soil organic carbon is achieved by increasing soil microbial activity, thus improving photosynthetic efficiency and enhancing the synthesis and transport of photosynthetic products. Soils with high clay content retain soil moisture and provide a stable supply of water and nutrients to the root system, helping the plant to accumulate dry matter during the jointing stage. The heading-filling stage is a critical stage in the formation of wheat grain yield, with the substantial contribution of S_OC, S_CEC_SOIL and T_CEC_SOIL, during this stage underscoring the significance of organic carbon and cation exchange capacity in the process of dry matter accumulation. The milky stage is the final stage of winter wheat grain maturity, and the contribution of cation exchange capacity of the subsoil bonding layer soil (S_CEC_CLAY) and S_PH_H2O increased during this period. The contributions increased, reflecting the close relationship between root vigour and nutrient translocation during this period. Soils with high cation exchange capacity, such as clay soils, in particular, have been identified as being particularly important for their capacity to retain water during this phase, thereby averting premature root decline. The significance of soil pH during the milky stage underscores the importance of a suitable pH for enhancing nutrient uptake by the root system and ensuring the accumulation of dry matter in the final stage.
In summary, the contribution of LAI demonstrated greater stability throughout the growth stages, particularly during the jointing, heading-filling and milky maturity stage of winter wheat. LAI has been shown to reflect crop growth with a high degree of accuracy, thus serving as a key variable for crop health monitoring. The combined impact of meteorological factors exhibited a discernible temporal sequence, initially dominated by water availability and subsequently transitioning to temperature and sunlight in the middle and late stages. Soil factors demonstrated a close correlation with soil pH and cation exchange capacity in the early and late stages, and with organic carbon content in the middle stage. The dynamic contribution pattern of multimodal data indicates that the input features of the deep learning model should be optimised by combining the key variables in different growth stages to enhance prediction accuracy and interpretability in winter wheat yield estimation.
4. Discussion
In Guo’s [
33] and Li’s [
34] previous studies on winter wheat yield estimation in the Guanzhong Plain, which both relied on remote sensing only inputs (LAI, fraction of photosynthetically active radiation (FPAR), VTCI) and explained yield estimates based on calendar time (Guo used a transformer-based neural network, and Li used a CNN-LSTM joint model incorporating attention mechanisms), the yield estimation accuracy at the sample level showed that R
2 was less than 0.65 and RMSE was greater than 0.49 t·ha
−1. In this study, the MultiScaleWheatNet model used wheat phenology as the temporal coordinate and fused remote sensing, meteorological, and soil data across heterogeneous spatial and temporal scales. In the field experiments, the MultiScaleWheatNet model achieved R
2 = 0.79 and RMSE = 0.17 t·ha
−1. This demonstrates that by aligning feature learning with phenological stages and integrating cross-modal information, the model can enhance yield estimation accuracy. Moreover, compared with the aforementioned two studies, the interpretability advantage of this study arises from an innovative model architectural design that is explicitly phenology aligned. Time series features are learned within four main growth stages, each stage produces a consistent representation, and the stage embeddings are concatenated in chronological order. A multilayer attention fusion subnetwork then captures cross stage interactions and yields attributions resolved jointly by stage and by variable. These attributions map directly to crop physiology and distinguish early season water dominance from mid to late season effects of temperature and radiation, as well as stage dependent soil influences. Because interpretability is embedded in the architecture rather than added after the fact, the framework provides richer evidence, more precise and mechanistically coherent explanations, and better handling of temporal heterogeneity and missing data, achieving a more balanced trade off among accuracy, interpretability, and operational deployment than the RS only, calendar time frameworks in the previous studies.
Although this study has accurately estimated the winter wheat yield in the Guanzhong Plain, there are still some issues that need to be addressed. On the one hand, there is potential inconsistency in input data quality. The resolution of multimodal data used to estimate yield is not uniform, and even after applying the KI method to harmonise spatial resolution, it does not guarantee that all features exhibit consistent prominence across all agricultural areas. This inconsistency somewhat reduces the estimation accuracy of the proposed model, introducing potential errors into the forecast results. On the other, multimodal data still exhibits some limitations in its comprehensiveness. Crop management practices, such as irrigation, spraying of pesticides and fertilisation, play a significant role in the growth of wheat crops, but due to limitations in data availability, the precise prediction of all influencing factors is not feasible [
35]. In future studies, a more thorough analysis and validation process could identify substitute data or data relationships that represent similar factors.
In this study, the results of the interpretability analyses of the yield estimation model were explored in depth. The analyses show that there is a significant difference between the positive and negative effects of different input variables on yield, i.e., some variables are positively correlated with yield, while others show negative correlations. For a given variable, its contribution may also differ (positively or negatively) across crop growth stages (
Figure 8). However, the current study generally failed to systematically analyse this positive and negative effect, instead focusing only on the significance of the variables on the predicted outcomes and ignoring the specific interpretation of the direction of the variables’ influence. This limitation may lead to a lack of scientific rationality in the interpretation of the model, which in turn affects its practical application value in agricultural production management. In order to further improve the interpretability of deep learning models in crop yield estimation, we believe that future research should incorporate the crop growth mechanism and structurally modify the models. Specifically, the key physiological and ecological mechanisms during crop growth can be embedded into the deep learning model, so that the model may better reflect the causal relationship between variables when performing feature attribution analysis. For example, a process-based modelling approach can be introduced to combine environmental factors such as temperature, precipitation, and photosynthetic radiation with the growth and development process of winter wheat, so as to optimise the importance weight allocation of the features and ensure that the direction of the variables’ influence on yield is in line with the crop growth mechanism. In addition, the combination of physics-informed deep learning can further enhance the interpretability of the model, making it more potential for application in agricultural production management and climate change assessment.