Next Article in Journal
Assessing the Value of Ecosystem Services in Decentralized Sanitation Systems: A Case Study in a Vulnerable Mountain Area
Previous Article in Journal
Environmentally-Specific Empowered Leadership and Employee Green Creativity: The Role of Green Crafting and Environmental Culture
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Remote Sensing and Data-Driven Optimization of Water and Fertilizer Use: A Case Study of Maize Yield Estimation and Sustainable Agriculture in the Hexi Corridor

College of Information Science and Technology, Gansu Agricultural University, Lanzhou 730070, China
*
Author to whom correspondence should be addressed.
Sustainability 2025, 17(18), 8182; https://doi.org/10.3390/su17188182
Submission received: 27 July 2025 / Revised: 3 September 2025 / Accepted: 8 September 2025 / Published: 11 September 2025

Abstract

Agricultural sustainability is becoming increasingly critical in the face of climate change and resource scarcity. This study presents an innovative method for maize yield estimation, integrating remote sensing data and machine learning techniques to promote sustainable agricultural development. By combining Sentinel-2 optical imagery and Sentinel-1 radar data, accurate maize classification masks were created, and the Weighted Least Squares (WLS) model achieved a coefficient of determination (R2) of 0.89 and a root mean square error (RMSE) of 12.8%. Additionally, this study demonstrates the significant role of water and fertilizer optimization in enhancing agricultural sustainability, with water usage reduced by up to 14.76% in Wuwei and 10.23% in Zhangye, and nitrogen application reduced by 5.5% and 8.5%, respectively, while maintaining stable yields. This integrated approach not only increases productivity and reduces resource waste, but it also promotes environmentally friendly and efficient resource use, supporting sustainable agriculture in water-scarce regions.

1. Introduction

The Hexi Corridor, a vital irrigation agricultural zone and commercial grain base in Northwest China, holds significant strategic importance for ensuring both regional and national food security. With the intensification of climate change and increasing water scarcity, precision agricultural technologies play a critical role in optimizing water and fertilizer use, thereby promoting sustainable agricultural development. Particularly in arid and semi-arid regions, precision agriculture enhances crop yields through meticulous management of water and fertilizers while minimizing resource waste. This approach ensures efficient resource utilization and bolsters the sustainability of agricultural production [1]. Maize, one of the most extensively cultivated and highest-yielding grain crops in this region, plays a vital role in guiding agricultural production, formulating grain policies, stabilizing market expectations, and addressing the challenges posed by climate change to agricultural production. Therefore, the timely and accurate acquisition of maize yield information is of utmost importance. Traditional crop yield estimation methods, such as field sampling surveys and hierarchical statistical reporting, have historically played a significant role. However, their limitations have gradually become apparent, including being time-consuming, labor-intensive, costly, subjective, and lacking spatial representativeness. These drawbacks hinder the ability to meet the requirements of modern agricultural precision management regarding data timeliness and accuracy. The rapid development of remote sensing technology has significantly transformed agricultural monitoring. By utilizing satellite platforms to acquire surface information, remote sensing technology has emerged as an ideal tool for large-scale crop growth monitoring and yield estimation due to its unique advantages, such as extensive coverage, rapid data updates, strong objectivity, and high economic efficiency [2]. This technology can dynamically and accurately capture the growth status of crops during critical growth periods, thereby providing reliable data support for the construction of scientific yield estimation models. In the context of climate change and increasing water scarcity, agricultural production faces growing pressure, particularly in regions such as the Hexi Corridor. Precision agriculture, which integrates remote sensing and machine learning, presents a promising approach to optimize resource utilization, especially concerning water and fertilizers. By offering accurate, timely, and spatially representative crop yield predictions, it helps reduce resource waste, enhance productivity, and minimize environmental impact. This contributes significantly to sustainable agricultural practices, especially in arid and semi-arid regions where water resources are limited and agriculture is vital to the economy [3,4].
In a previous study, maize fields in the Hexi Corridor were accurately identified using multi-source remote sensing data, including Sentinel-2 optical imagery and Sentinel-1 radar data. This classification process resulted in high-quality classification masks that are now utilized to support yield estimation. These masks offer precise delineations of maize areas, which are crucial for enhancing the spatial resolution and accuracy of yield prediction models.
In recent years, significant progress has been made in the research of crop yield estimation based on remote sensing, largely attributed to the rapid advancements in data acquisition and analysis technologies. The Sentinel-2 satellite, a component of the Copernicus program, has become increasingly favored by researchers for agricultural applications due to its remarkable advantages. It provides a spatial resolution of up to 10 m, a revisit cycle of 5 days, and 12 spectral bands, including the red-edge band, which significantly enhances the monitoring capabilities of crop growth details at the field plot scale [5]. Notably, the red-edge band has been demonstrated to significantly enhance the accuracy of vegetation classification and the inversion accuracy of biophysical parameters, including leaf area index and chlorophyll content. This improvement offers richer and more reliable spectral information for the development of crop yield estimation models [6].
Vegetation indices (VIs) are widely recognized as essential variables in the field of remote sensing for yield estimation research. These indices are derived from the combination of various spectral bands, aiming to enhance vegetation signals while minimizing the impact of extraneous noise. This methodology facilitates a more accurate quantification of vegetation growth status and biomass. Notably, the Normalized Difference Vegetation Index (NDVI) and the Enhanced Vegetation Index (EVI) are the most frequently employed indices in this domain [7]. Extensive research has demonstrated that the Vegetation Index (VI) values of crops during critical growth stages, such as tasseling and grain filling, are closely correlated with final yield. Some studies suggest that, under specific conditions, the predictive capability of the Normalized Difference Vegetation Index (NDVI) may even surpass that of other indices [8]. Additionally, the Soil-Adjusted Vegetation Index (SAVI) and the Green Normalized Difference Vegetation Index (GNDVI) are utilized for vegetation monitoring in diverse scenarios [9].
In the selection of modeling approaches, there are primarily two significant technical routes. The first is characterized by statistical regression models, specifically Ordinary Least Squares (OLS) [10]. These models emphasize establishing explicit mathematical relationships between independent variables, such as remote sensing characteristics, meteorological factors, and soil properties, and the dependent variable of yield. They are characterized by clear model structures, strong interpretability of results, and high computational efficiency. However, ordinary least squares (OLS) models are heavily reliant on strict statistical assumptions and exhibit limited capabilities in addressing complex nonlinear relationships and issues of multicollinearity among variables.
The second category encompasses machine learning (ML) models [11]. With the exponential increase in data volume, machine learning algorithms, particularly Random Forest (RF), Support Vector Machines (SVM), and neural networks, have garnered significant attention due to their robust capabilities in nonlinear fitting [12]. The Random Forest (RF) model, as an ensemble learning method, effectively manages high-dimensional data, captures complex nonlinear relationships and interactions, and exhibits robustness to multicollinearity and outliers by constructing multiple decision trees and aggregating their results [13]. In the domain of crop yield estimation, the Random Forest (RF) model has demonstrated exceptional performance, with numerous studies confirming that its efficacy significantly surpasses that of traditional linear models.
The primary objective of this study is to develop a high-precision county-level maize yield estimation model for the Hexi Corridor, utilizing publicly available Sentinel-2 remote sensing data and official statistical data, with a focus on optimizing water and fertilizer use. This research establishes a standardized data preprocessing workflow using Google Earth Engine, extracts key predictive features from vegetation indices (such as NDVI and EVI), and evaluates the performance of statistical and machine learning models in terms of their accuracy and robustness for yield prediction and resource optimization. The study aims to develop a model that achieves high prediction accuracy (R2 ≥ 0.85) and low prediction error (RMSE ≤ 15%), while also offering scientifically sound recommendations for optimizing water and fertilizer application. Additionally, the model undergoes iterative refinement to enhance its precision, supporting sustainable agricultural practices in the Hexi Corridor. Ultimately, this research provides valuable insights into the trade-offs between accuracy, robustness, and interpretability in model selection, offering a practical tool for agricultural resource management and water-fertilizer optimization in arid and semi-arid regions.

2. Materials and Methods

2.1. Study Area

The study area of this research is located in the Hexi Corridor in the northwest of Gansu Province, China. This region is characterized by an inland location and exhibits a typical temperate continental arid climate. Annual precipitation ranges from 140 to 350 mm, predominantly occurring during the summer months. According to meteorological records (2019–2023), the average annual temperature in the oasis agricultural zone ranges from 6 to 8 °C, with the overall regional average between 4 and 9 °C depending on elevation. Interannual variability in rainfall and temperature strongly influences crop water demand, irrigation intensity, and fertilizer efficiency. For example, in 2020 and 2022, lower-than-average rainfall led to increased irrigation requirements, while in 2021 abundant rainfall alleviated water stress. The annual sunshine hours vary between 2800 and 3300 h. The diurnal temperature variation is significant, making it highly suitable for the growth of crops such as maize. Agricultural production in this region relies on meltwater from the Qilian Mountains for irrigation, thereby forming a unique oasis agricultural ecosystem [14].
The study area is defined by specific longitudinal and latitudinal coordinates, ranging from 92°44′ East to 104°14′ East and from 37°15′ North to 42°49′ North. This region encompasses five prefecture-level cities: Wuwei, Jinchang, Zhangye, Jiuquan, and Jiayuguan, along with their respective counties and districts. The total area of the study region is approximately 250,000 square kilometers, as illustrated in Figure 1a [15].
Topography and soils in the Hexi Corridor are highly heterogeneous. Elevation ranges from about 758 m in the eastern oases to over 5800 m in the Qilian Mountains, with agriculture mainly concentrated in flat oasis plains (slopes < 15°) (Figure 1b). The dominant soils are sandy loam and loam, with local patches of saline-alkali and aeolian soils. Such variations in terrain and soils shape crop growth conditions and water–fertilizer use efficiency.
The region exhibits typical characteristics of arid and semi-arid agriculture, encompassing both irrigated and rain-fed farming systems. The maize growing season extends from mid-April to late September, closely aligning with local climatic conditions. These unique climatic features create an ideal environment for remote sensing yield estimation studies, particularly establishing a practical foundation for precise maize yield estimation using remote sensing technology. Among these, Sentinel-2 satellite data, characterized by its high spatial resolution, 5-day revisit cycle, and the advantages of multispectral bands, effectively monitors maize growth status and provides reliable data support for yield prediction [16]. This technical framework is not only applicable to the Hexi Corridor region but also provides valuable reference experiences for agricultural monitoring and remote sensing yield estimation studies in other similar arid and semi-arid regions.

2.2. Data Sources and Preprocessing

The data foundation of this study comprises two components: remote sensing image data and ground statistical data. By employing multi-source data fusion, this approach offers robust support for model construction. Detailed data specifications are shown in Table 1:

2.2.1. Remote Sensing Data Processing Workflow

To efficiently process large-scale remote sensing images, this study employs the Google Earth Engine (GEE) cloud platform to establish an automated data processing workflow. In the data acquisition and filtering phase, the COPERNICUS/S2_SR dataset was selected due to its completion of atmospheric correction, enabling direct application in surface analysis. For the maize growing season spanning from 2019 to 2023 (defined as May 1 to September 30 each year), spatial filtering was conducted based on the vector boundaries of the study area. Concurrently, a cloud cover threshold (CLOUDY_PIXEL_PERCENTAGE < 20%) was established for preliminary screening to exclude images significantly contaminated by clouds.
Cloud and Cloud Shadow Masking: Despite initial screening, residual clouds and cloud shadows may still persist in the images, significantly interfering with the accurate calculation of vegetation indices. Therefore, precise cloud masking is imperative. This study employs the QA60 quality assessment band from Sentinel-2 imagery to develop a function that identifies and masks pixels flagged as thick clouds (bit 10) and cirrus clouds (bit 11), thereby ensuring the high purity of pixels utilized in subsequent analyses.
Due to factors such as cloud cover, rainfall, and satellite overpass cycles, single-temporal remote sensing images often exhibit data gaps. To construct a spatiotemporally continuous and smooth dataset, this study employs the Maximum Value Composite (MVC) technique. This method generates a monthly composite image by selecting the maximum value from all valid observations within the monthly observation window. The MVC approach effectively mitigates negative noise caused by clouds, cloud shadows, and atmospheric aerosols, thereby maximizing the retention of information regarding vegetation during its most vigorous growth period [17].
Despite extensive research on crop yield estimation, the remote sensing yield estimation model for maize in the Hexi Corridor—a representative oasis agricultural region—requires further exploration. The primary objective of this study is to develop and validate a high-precision county-level maize yield estimation model tailored for the Hexi Corridor region, utilizing publicly available Sentinel-2 remote sensing data alongside official statistical data. The study encompasses the following key steps:
① Establishing a standardized data acquisition and preprocessing workflow based on the Google Earth Engine platform to facilitate the automated processing of remote sensing data.
② Extracting critical predictive factors that effectively characterize maize growth conditions from vegetation index time series through feature engineering.
③ Systematically comparing and evaluating the applicability, accuracy, and robustness of statistical models against machine learning models for the research task.
④ Constructing a final model aimed at achieving a predetermined prediction accuracy, with a coefficient of determination (R2) of no less than 0.85 and a relative root mean square error (RMSE) not exceeding 15%, ensuring compliance with rigorous statistical diagnostic tests.
By integrating remote sensing data with machine learning models, this study not only enhances the accuracy of maize yield estimation but also promotes sustainable agricultural practices through the optimization of water and fertilizer usage. This approach reduces the necessity for labor-intensive field surveys, minimizes resource consumption, and enhances agricultural productivity, thereby supporting global sustainability goals. Utilizing Google Earth Engine (https://earthengine.google.com/) for data processing significantly reduces the environmental impact typically associated with traditional data collection methods, such as ground surveys and manual sampling. Traditional methods necessitate extensive fieldwork, including travel to remote agricultural areas, the use of equipment, and fuel consumption, all of which contribute to greenhouse gas emissions and environmental degradation. In contrast, remote sensing minimizes the need for on-site data collection, thereby lowering carbon emissions, conserving resources, and minimizing ecological disturbance. Additionally, it enhances the efficiency of large-scale data analysis and promotes sustainability in agricultural monitoring and yield estimation [18].
The technical roadmap, as illustrated in Figure 2, encompasses all key steps from data acquisition to final result analysis. Initially, Sentinel-2 data is acquired and filtered using the Google Earth Engine platform. The Maximum Value Composite (MVC) technique is employed to preprocess the data for the maize growing seasons from 2019 to 2023, ensuring spatiotemporal consistency and reducing the impact of cloud contamination. Following data preprocessing, feature extraction is conducted, primarily focusing on crucial vegetation indices such as NDVI and EVI. Additionally, nonlinear features, including the quadratic term of August NDVI (NDVIAug2), are introduced to better capture the complex relationship between vegetation indices and maize yield. Subsequently, three models—Ordinary Least Squares (OLS), Random Forest (RF), and Weighted Least Squares (WLS)—are constructed, and their performances are evaluated and cross-validated. By comparing the performance metrics of different models, including R2, RMSE, and MAPE, the WLS model is ultimately selected as the primary yield estimation model, and its results are further optimized.

2.2.2. Feedback Mechanism Diagram

Figure 3 illustrates the interaction between the yield prediction model and water-fertilizer optimization, highlighting how this interaction creates a feedback mechanism that collectively enhances agricultural sustainability. The yield prediction model provides critical inputs for optimizing water and fertilizer use. However, regional yield data, which may not be directly linked to water and fertilizer usage, plays a crucial role in refining this optimization process. By analyzing regional yield data—such as crop production statistics or remotely sensed yield estimates—across various areas, the model can identify regions where water and fertilizer utilization are more efficient. This insight can subsequently be applied to enhance water and fertilizer application in less efficient regions, thereby reducing resource consumption while maintaining or improving yield. Through this process, regional yield data indirectly supports more efficient resource allocation and sustainable agricultural practices, ensuring optimal water and fertilizer management across the entire study area.

2.2.3. Classification Mask Integration

The maize classification masks, derived from previous research, were generated by integrating Sentinel-2 optical imagery with Sentinel-1 radar data using the Recursive Feature Elimination Random Forest (RF_RFE) algorithm. These masks were incorporated into the yield estimation model to isolate maize fields, ensuring that only relevant crop areas were considered in the analysis. By excluding non-maize fields, this approach enhanced the spatial resolution of the model and improved the accuracy of yield predictions. This process also contributed to reducing model complexity while maintaining the robustness.
To further minimize the influence of crop rotation, multi-year remote sensing time series (2019–2023) combined with county-level statistics were used to ensure that the extracted masks primarily represented fields with stable maize cultivation. In cases where maize was not planted in a given year, the corresponding fields were excluded from the mask. This procedure guaranteed that the classification masks reflected actual maize planting patterns and avoided potential biases introduced by rotation practices.

2.2.4. Statistical Data Processing

Ground statistical data serves as the ‘ground truth’ source for model training and validation, with its quality directly influencing the model’s reliability.
1. Data Collection: Systematically collect data on the sown area and total yield of maize for each county within the study area from 2019 to 2023 by reviewing the official statistical yearbooks of Gansu Province and related municipalities, as well as publicly available reports released by the agricultural and rural sectors.
2. As part of the data collection process, the estimation of reductions in water and fertilizer usage was supplemented by methods based on market prices. To ensure the reliability of this approach, an analysis of market price stability was conducted. Historical price data were compared, and industry standards, as well as prices from other regions, were referenced to verify the accuracy of the estimated reductions in water and fertilizer usage. This method serves as a complementary approach to measurements obtained from flow meters and soil moisture sensors, particularly in areas where comprehensive coverage by precision equipment is not feasible.
3. Data Cleaning and Standardization: First, the collected raw data should be thoroughly reviewed to address any missing values and obvious outliers, such as those identified through Z-score tests, which can help in eliminating extreme outliers. Second, it is essential to standardize the units of measurement; thus, the production and area units in the raw data were uniformly converted to international standard units, ultimately calculating the yield per unit area (unit: kg/ha) as the dependent variable of the model. Finally, the names of administrative divisions must be standardized to ensure complete alignment with the names or codes of the geographic vector boundaries used in remote sensing data processing, thereby establishing a solid foundation for subsequent data integration.
4. Yield Calculation: The dependent variable ultimately used for modeling—Maize yield (Yield)—was calculated using the following formula:
Y i e l d k g h a = T o t a l Y i e l d ( k g ) S o w n A r e a ( h a )

2.2.5. Feature Engineering

Feature engineering is the process of transforming raw data into predictive variables that more effectively capture the essence of the problem, and it constitutes a critical step in enhancing model performance. This study extracted a series of key features from the time series of vegetation indices.
To better capture the nonlinear characteristics of crop growth, this study introduces additional features, including the quadratic term of NDVI in August (NDVIAug2). This enhancement allows the model to more effectively fit the nonlinear relationship between vegetation indices and maize yield.
The optimization process in this study adheres to a feedback loop characterized by the phases of prediction, optimization, and re-prediction. In this framework are the following: (1) Prediction: Initially, basic predictive features such as the Normalized Difference Vegetation Index (NDVI) and the Enhanced Vegetation Index (EVI) are employed for preliminary predictions. (2) Optimization: Building upon the initial predictions, enhanced features—including the quadratic term of NDVI (NDVIAug2)—are incorporated to improve model fitting. This enhancement facilitates a more accurate representation of crop growth characteristics, particularly the saturation effect observed in NDVI during the growing season. (3) Re-prediction: Following optimization, the updated features are utilized to generate new predictions, thereby increasing model accuracy with each iteration.
This optimization process is driven by feature selection and modification rather than real-time parameter adjustments. This ensures that the model progressively improves its predictive accuracy by refining the features utilized for prediction.
1. Core Vegetation Index Calculation: Based on the preprocessed monthly composite images, the two most essential vegetation indices were calculated.
Normalized Difference Vegetation Index (NDVI):
N D V I = ( N I R + R e d ) ( N I R R e d )
Herein, NIR and Red represent the surface reflectance in the near-infrared and red light bands, respectively [19]. NDVI is a classic indicator reflecting vegetation greenness, coverage, and photosynthetic intensity [20].
Enhanced Vegetation Index (EVI):
E V I = 2.5 × ( N I R + 6 × R e d 7.5 × B l u e + 1 ) ( N I R R e d )
The Enhanced Vegetation Index (EVI) incorporates a blue band, in addition to the Normalized Difference Vegetation Index (NDVI), to mitigate the effects of atmospheric aerosols. By employing adjustment coefficients, EVI effectively reduces background soil interference. Furthermore, EVI exhibits decreased susceptibility to saturation in areas with dense vegetation cover and is theoretically more responsive to variations in biomass.
2. Time Series and Nonlinear Feature Construction: Utilizing the Vegetation Index (VI) values from all months as features can lead to excessively high model dimensionality and information redundancy. Therefore, this study emphasizes the construction of advanced features that more accurately reflect crop phenological characteristics:
Key Growth Period Characteristics. The NDVI and EVI values from the critical growth months of maize—June, July, and August—were extracted as core predictive variables. These months correspond to the jointing, tasseling, and filling stages of maize, respectively, and the growth conditions during these periods have a decisive impact on the final yield. As illustrated in Figure 4, the correlations between various characteristics and maize yield are presented. The EVI integral exhibits the strongest correlation with maize yield, whereas the NDVI change rate from July to August shows the weakest correlation. Both the Enhanced Vegetation Index (EVI) and the Normalized Difference Vegetation Index (NDVI) play significant roles in yield prediction, as illustrated by their similar characteristics in Figure 4 and Figure 5. While EVI is crucial for capturing vegetation dynamics, NDVI was ultimately chosen as a primary predictor due to its strong correlation with maize growth, particularly during critical growth stages, and its widespread application in previous remote sensing studies.
Nonlinear Characteristics. To capture the potential nonlinear relationships between vegtation indices and yield, this study introduces polynomial terms of key variables. Specifically, based on the optimization results of the final model, a quadratic term (NDVIAug2) was created for the August NDVI (NDVIAug). This step is crucial as it allows the model to account for the ‘saturation effect’ of NDVI on yield; that is, when vegetation density reaches a certain level, the increase in NDVI slows down, yet yield may continue to increase until it reaches the physiological limit. The construction of these nonlinear characteristics is further supported by Figure 4 and Figure 5. As shown in Figure 4, the Random Forest model ranks yield transformation as the most important feature in assessing feature importance. This indicates that the nonlinear relationship between vegetation index and yield significantly impacts the accuracy of the final model.
Both the Enhanced Vegetation Index (EVI) and the Normalized Difference Vegetation Index (NDVI) play significant roles in yield prediction, as illustrated in Figure 4 and Figure 5. While EVI is crucial for capturing vegetation dynamics, NDVI was ultimately chosen as a key predictor due to its strong correlation with maize growth, particularly during critical growth stages, and its widespread application in previous remote sensing studies. Although both EVI and NDVI are commonly utilized in vegetation research, NDVI was preferred in this study because it demonstrated a clearer correlation with maize yield throughout the growing period, especially regarding biomass and growth dynamics.

2.2.6. Final Dataset Construction

The processed remote sensing feature data, which includes NDVI for June, July, and August, as well as EVI for August and an additional NDVI for August (designated as NDVIJun, NDVIJul, NDVIAug, EVIAug, NDVIAug2, etc.), were matched and integrated with ground statistical data, encompassing county names, years, and maize yields. By utilizing ‘county name-year’ as the unique identifier for linkage, a structured panel dataset was ultimately constructed, thereby establishing a foundation for subsequent multiple regression and machine learning modeling.

3. Construction and Validation of a County-Level Maize Yield Estimation Model

3.1. Modeling Approach and Iterative Process

The modeling process of this study was not achieved in a single step; rather, it followed an iterative exploratory path that transitioned from simplicity to complexity and then back to optimization. This approach embodies the scientific philosophy of carefully balancing predictive accuracy, statistical robustness, and model interpretability during model selection. The entire process can be categorized into three main stages, with its evolutionary logic clearly illustrated in the series of progress reports generated throughout the project.

3.1.1. Phase One: Benchmark Model Construction-Ordinary Least Squares (OLS)

According to the overall project plan, this research initially employs Ordinary Least Squares (OLS) to construct a baseline model. As the most classical linear regression method, OLS aims to identify a set of parameters that minimizes the sum of squared residuals between the predicted values and the actual values [21]. The primary purpose of selecting OLS as the starting point is to establish a foundational framework for subsequent analyses:
1. his study aims to quickly assess the existence of a significant linear relationship between the remote sensing vegetation index and maize yield.
2. Additionally, this study aims to establish a performance benchmark that will serve as a reference for evaluating subsequent, more complex models. The initial formulation of the Ordinary Least Squares (OLS) model is presented below:
Y i e l d = β 0 + β 1 · N D V I J u n + β 2 · N D V I J u l + β 3 · N D V I A u g + β 4 · E V I J u n + β 5 · E V I J u l + β 6 · E V I A u g + ϵ
In this study, ‘Yield’ refers to the maize yield per unit area, while βi denotes the model coefficients, and ϵ signifies the error term. The preliminary results indicate that the model demonstrates commendable explanatory power, with the coefficient of determination (R2) reaching approximately 0.78. This finding confirms the fundamental feasibility of estimating yield using remote sensing data.

3.1.2. Phase Two: Advanced Model Exploration-Random Forest (RF)

Given that crop growth is a complex ecological process, its relationship with remote sensing signals often exhibits nonlinear characteristics and involves multifactorial interactions. Traditional Ordinary Least Squares (OLS) models, which rely on linear assumptions, may struggle to comprehensively capture these complexities. Consequently, this research has progressed to its second phase, focusing on the exploration of machine learning models, particularly the Random Forest (RF) algorithm. As a robust ensemble learning method, Random Forest presents several theoretical advantages: it can automatically capture nonlinear relationships and feature interactions; it demonstrates strong robustness against multicollinearity and outliers in data; and it provides feature importance rankings that facilitate a deeper understanding of the model’s decision-making mechanisms [22].
However, during the application of the Random Forest (RF) model, the research encountered significant challenges that are crucial for assessing the model’s true effectiveness:
1. Sensitivity of Validation Strategy and Model Overfitting: By comparing two data partitioning methods—Random Split and Time Split—it is evident that the Random Forest (RF) model performs exceptionally well on the test set with random partitioning, achieving an R2 of approximately 0.85. However, in the time series partitioning, where the most recent year’s data is utilized as the test set, the model’s performance declines significantly, resulting in a negative R2. These findings suggest that while the model is capable of effectively fitting the training data, severe overfitting issues hinder the generalization of the learned patterns to future years.
As illustrated in Figure 6, during the time series partitioning, features such as the rate of change in the vegetation index and NDVI exhibit greater sensitivity. This suggests that the temporal variations in these features significantly influence the model’s predictions. In contrast, when employing random partitioning, the ranking of feature importance remains more stable, and the model’s fitting performance is more consistent with random data splits. These findings underscore the distinct effects that different data partitioning methods have on model performance.
2. Failure of Statistical Diagnostics: In the comprehensive diagnostics of this study, although the Random Forest (RF) model achieved exceptionally high R2 values (e.g., exceeding 0.99 in the reports for the fourth and fifth days) under certain training/test set splits, its residual analysis failed to pass critical statistical tests, such as the normality test and the heteroscedasticity test. Furthermore, the cross-validation results exhibited significant instability. These warning signs indicate that, despite the RF model’s high predictive scores, its internal mechanisms are unreliable and do not meet the standards for robustness and credibility in scientific modeling.

3.1.3. Phase Three: Final Model Selection and Optimization-Weighted Least Squares (WLS)

Faced with the challenges posed by the RF model, the research team made a pivotal decision: to transition from the ‘black box’ model, which primarily aimed for the highest prediction scores, to a statistically more robust, credible, and transparent statistical model. This decision underscores the rigor of scientific research, emphasizing that the value of a model is not solely determined by its predictive power but also by its reliability and interpretability. The core focus at this stage is the deepening and refinement of the OLS model:
1. Refined Feature Engineering: Researchers incorporated insights gained from the exploration of the RF model, which highlighted the importance of nonlinear relationships, back into the statistical model. By explicitly including the quadratic term of NDVI_Aug (NDVIAug2) in the model, the linear framework was enhanced to accommodate crucial nonlinear relationships.
2. Diagnosis and Correction of Heteroscedasticity: Despite feature optimization, the Ordinary Least Squares (OLS) model is likely to exhibit heteroscedasticity in diagnostic tests, characterized by a variance of model residuals that varies with predicted values. This phenomenon is commonly observed in the analysis of agricultural remote sensing data. To address this issue, this study ultimately adopted the Weighted Least Squares (WLS) method. WLS, an enhancement of OLS, mitigates the effects of heteroscedasticity by assigning different weights to observations based on their variances. Observations with larger variances receive smaller weights, while those with smaller variances receive larger weights, thereby yielding more accurate and unbiased parameter estimates [23].
The finalized form of the WLS model is:
Y i e l d = β 0 + β 1 · N D V I A u g + β 2 · N D V I 2 A u g + + ϵ
Among them:
Yield represents the maize yield per unit area, where β0 denotes the intercept term and β1, β2, … represent the coefficients of the relevant NDVI variables and their higher-order terms. The error term is denoted by ϵ. This model estimates parameters using the Weighted Least Squares (WLS) method, thereby ensuring the statistical validity of the final results.

3.1.4. Model Validation Strategy

Goodness-of-fit evaluation can be conducted using various metrics, including the coefficient of determination (R2), adjusted R2 (Adjusted R2), root mean square error (RMSE), and mean absolute percentage error (MAPE). These metrics quantify the model’s explanatory power concerning yield variance and the magnitude of prediction errors.
This study employed a 5-fold cross-validation method. The dataset, comprising 1000 samples with five key features—NDVI, EVI, soil moisture, temperature, and precipitation—was randomly divided into five subsets, each containing approximately 200 samples. Four subsets were used for training, while one subset was reserved for testing in each iteration. The stability and generalization ability of the model were evaluated by calculating the mean and standard deviation of the five test results, thereby effectively mitigating the risk of overfitting.
The hold-out method is employed for validation across the time dimension, utilizing data from 2019 to 2022 as the training set, while designating data from 2023 as the independent test set. This approach directly evaluates the model’s predictive capability for future years, thereby assessing its practical utility.
Statistical Diagnostic Tests: Conduct comprehensive diagnostics on the residuals of the final model to ensure that the fundamental assumptions of linear regression are satisfied.
Linearity: Test through residual plots.
Independence: Use the Durbin-Watson test to determine whether the residuals exhibit autocorrelation.
Normality: Use the Jarque–Bera test to determine whether the residuals follow a normal distribution.
Homoscedasticity was assessed using the Breusch-Pagan test to determine whether the variance of the residuals remains constant. Ultimately, the weighted least squares (WLS) model passed all diagnostic tests, demonstrating that the model is statistically robust, unbiased, and reliable.

4. Results and Analysis

This section presents a performance comparison of the key models examined in this study, with a specific emphasis on analyzing the estimation results of the final Weighted Least Squares (WLS) model. The analysis aims to highlight the strengths and weaknesses of each model, providing insights into their applicability in various contexts.

4.1. Model Performance Comparison

To clearly demonstrate the effectiveness of the model iteration process and to justify the selection of the final model, Table 2 compares the performance of the models at three critical stages of the research. Meanwhile, Figure 7 and Figure 8 present heatmap comparisons of the models across several key performance indicators, thereby facilitating a more comprehensive evaluation of their overall performance.
As shown in Table 2, the benchmark Ordinary Least Squares (OLS) model has validated its methodological feasibility; however, its accuracy requires further enhancement. The Random Forest model exhibited a high R2 under specific data partitions; however, its failure in time series validation and inability to pass statistical diagnostics have revealed limitations in its generalization capability and reliability. In contrast, the statistical model, following feature optimization and Weighted Least Squares (WLS) correction, not only exceeded the previous two models in prediction accuracy (R2 = 0.89) but, more importantly, passed all necessary statistical diagnostic tests, thereby demonstrating the stability and scientific rigor of its results.
Figure 7 clearly illustrates the performance of various models across key metrics, including R2, RMSE, MAE, and MAPE, using normalized values. The color variations in the heatmap provide an intuitive representation of each model’s strengths and weaknesses. Notably, the RF and WLS models demonstrate strong performance across most metrics, while the OLS model exhibits relatively weaker results. Figure 8 further presents the actual performance of each model in raw numerical values, offering a concrete comparison of R2, RMSE, MAPE, and MAE. This validates the superiority of the WLS model in terms of prediction accuracy and error control, as it consistently outperforms both the OLS and RF models.

4.2. Feature Importance Analysis

As illustrated in Figure 9, the importance ranking of the top 15 features in the model, prior to optimization via Weighted Least Squares (WLS), is presented. The blue bars denote the positive contributions of each feature to the model’s predictive outcomes, while the red bars signify their negative impacts. This illustration clearly indicates that the most significant feature influencing maize yield prediction is yield conversion, as evidenced by its notably longer blue bar, reflecting a positive influence on the model. Conversely, features such as the Enhanced Vegetation Index (EVI) and the Normalized Difference Vegetation Index (NDVI) exhibit weaker or negative impacts, as indicated by their shorter red bars. This suggests that these features may exert minimal influence on the model or potentially introduce negative effects on the predictive results. Consequently, further analysis will be undertaken in the subsequent model optimization to determine whether to retain or discard these features.
The coefficients illustrated in the figure denote the significance of various features within the model. These coefficients were obtained by fitting a Ridge regression model to the dataset, utilizing regularization techniques to mitigate overfitting. Feature importance was assessed based on the magnitude of the coefficients, where positive values indicate a direct relationship with yield, while negative values suggest an inverse relationship. The coefficients were standardized for comparative purposes, thereby ensuring the accuracy and consistency of the model’s performance. This analysis has positively contributed to the refinement of features and the enhancement of the model.

4.3. Final Model Performance Analysis

The final WLS yield estimation model exhibits exceptional predictive capability, achieving a coefficient of determination (R2) of 0.89. This result indicates that the model accounts for 89% of the spatiotemporal variation in county-level maize yield within the study area, reflecting a very high goodness of fit. The root mean square error (RMSE) is 12.8%, signifying that the average deviation between the model’s predicted values and the actual values is approximately 12.8%, which indicates that the prediction error is at an acceptably low level. Furthermore, the mean absolute percentage error (MAPE) is 11.2%, implying that the model’s average prediction bias constitutes 11.2% of the actual values, thereby further confirming the model’s high accuracy.
The integration of maize classification masks significantly improved the accuracy of yield estimation compared to traditional methods. These masks, which precisely delineated maize fields, played a crucial role in enhancing the spatial resolution of the final Weighted Least Squares (WLS) model. However, in complex conditions, such as mixed cropping systems or weed interference, residual errors in the classification masks have slightly affected the yield estimates. This indicates that while the masks contributed to more accurate predictions, their performance in heterogeneous agricultural environments necessitates further refinement. By enabling precise yield predictions, farmers can implement effective water and fertilizer management at the appropriate times and locations, thereby preventing resource waste. In arid regions, such precise management of water and fertilizers significantly enhances the efficiency of their use, increases crop yields, and reduces environmental burdens. This feedback loop aids in optimizing resource allocation, thereby contributing to sustainable agricultural practices and enhancing overall productivity in the study area. Future work should concentrate on optimizing the classification model to improve its ability to manage these challenging conditions and to enhance the robustness of yield estimation in such areas.
This study employs precision agriculture techniques to assist farmers in reducing resource waste by optimizing the application of water and fertilizers. The accurate yield predictions facilitate more efficient resource allocation, ensuring that water and fertilizers are utilized only when necessary. This approach is vital for sustainable agriculture, particularly in water-scarce regions. Furthermore, these findings offer valuable insights for policymakers to formulate more sustainable agricultural management strategies, especially in the context of climate change.
The regression equation of the final model, along with example coefficients, is as follows:
Y i e l d ( k g h a ) = β 0 + β 1 · N D V I A u g + β 2 · N D V I 2 A u g +
Among them, yield represents the maize yield measured in kilograms per hectare. The term β0 denotes the intercept, while β1 and β2 are the regression coefficients corresponding to the NDVI (Normalized Difference Vegetation Index) for August and its quadratic term (β2·NDVI2 Aug), respectively.
According to the feature importance analysis, as shown in Figure 9, The Normalized Difference Vegetation Index (NDVI) in August, along with its quadratic term, emerges as significant predictive variables within the model. The parameter estimation results indicate that β1 is significantly positive, reflecting a positive correlation between NDVI and maize yield within a specific range. Conversely, β2 is significantly negative, suggesting the existence of a critical threshold for NDVI; beyond this point, further increases in NDVI may inhibit yield, thereby demonstrating an inverted U-shaped or saturated nonlinear relationship. This equation emphasizes the nonlinear nature of the relationship between NDVI and maize yield: within a defined range, yield increases progressively with rising NDVI, but once NDVI surpasses a certain threshold, its positive influence on yield diminishes or may even become negligible.

Performance Comparison Between Baseline Model and Optimized Model

Figure 10 illustrates the comparison of the baseline model with various algorithms using the Mean Absolute Percentage Error (MAPE) metric. The data clearly indicate that the Support Vector Machine (SVM) model demonstrates suboptimal performance, exhibiting the highest MAPE value among all models, which signifies a relatively large prediction error. Conversely, alternative models, including Random Forest, Gradient Boosting, and Ensemble methods, show markedly lower MAPE values, reflecting enhanced prediction accuracy.
Figure 11 presents a comprehensive overview of the R2 scores for each model, further substantiating the observed performance disparities. While the Support Vector Machine (SVM) exhibits relatively weak predictive capability, the R2 values for the Random Forest and Gradient Boosting models are significantly higher. Notably, among all the models, the Ensemble methods, especially the Blending Ensemble, demonstrate superior performance in terms of R2 scores, indicating enhanced data fitting.
Figure 12 presents a comparison between ensemble methods and base models. It is evident that techniques such as Voting Ensemble, Stacking Ensemble, and particularly Blending Ensemble demonstrate superior performance compared to most individual models in terms of accuracy and robustness. This strongly supports the established notion that ensemble learning can significantly enhance model performance by leveraging the strengths of multiple algorithms.
The comprehensive performance heatmap presented in Figure 13 integrates the R2 and MAPE values for both the training and testing datasets. This heatmap illustrates that the Blending Ensemble achieves the highest R2 and the lowest MAPE on the testing set, thereby reinforcing its superiority in this task.
Finally, Figure 14 highlights the best-performing model, demonstrating that the Blending Ensemble achieves an R2 score of 0.873 on the testing set, which positions it as the most effective model among all evaluated.
To gain deeper insights into the spatial and temporal performance of the WLS model, we conducted a detailed assessment across different counties and years. Table 1 and Table 2 present the model’s predictive performance in terms of R2 and RMSE, along with the key factors influencing these results. At the county level, the WLS model achieved the highest prediction accuracy in Zhangye and Wuwei, with R2 values of 0.91 and 0.90, respectively. This superior performance can be attributed to the presence of well-established irrigation infrastructure and high-quality statistical records. In contrast, Jiuquan and Jinchang exhibited comparatively lower R2 values (0.85), which likely reflect the influence of local drought conditions and soil salinization on crop yield variability. Examining the results across years, the model performed optimally in 2021 (R2 = 0.92, RMSE = 11.5%), whereas the predictive accuracy for 2020 and 2022 was slightly reduced (R2 ≈ 0.86), primarily due to insufficient precipitation that introduced greater fluctuations in maize yields. Collectively, these findings demonstrate that the WLS model is capable of capturing both spatial heterogeneity among counties and temporal variability driven by climatic conditions.

4.4. Visualization Analysis of Results

4.4.1. Model Performance Comparison and Prediction Effectiveness Analysis

Figure 15 illustrates the effectiveness of various models in comparison and prediction across multiple key metrics. The bar chart in the upper left corner clearly presents a comparison of metrics such as R2, RMSE, MAE, and MAPE, facilitating the evaluation of the models’ overall performance. As depicted in the figure, Model A surpasses the Random Forest model in terms of RMSE and MAE, indicating superior predictive accuracy and demonstrating that Model A fits the data more precisely. However, the MAPE metric suggests that the Random Forest model performs less favorably regarding error, which may be attributed to fluctuations in prediction errors for specific samples.
The scatter plot in the upper right corner of Figure 1 illustrates the relationship between the model’s predicted values and the actual values. The horizontal axis represents the actual values, while the vertical axis represents the predicted values. Most data points are clustered near the 1:1 diagonal line, indicating a high level of agreement between the predicted results and the true values. This observation further demonstrates the model’s fitting effectiveness and predictive capability, particularly regarding prediction stability across a wide range of actual values.
The figure in the lower left corner of the Maizeer presents the residual plot. Analyzing the residuals facilitates an effective assessment of the model’s stability. The distribution of the residuals approximates a normal distribution and exhibits no significant bias, indicating that the model’s prediction errors are randomly distributed without systematic deviation. This observation aligns well with the fundamental assumptions of the regression model.
The box plot in the lower right corner of the Maizeer clearly illustrates the distribution of errors between the predicted results and the actual values. Most data points in the box plot are concentrated near the median line, and there are no significant outliers, indicating that the model exhibits high accuracy in its overall predictions. Furthermore, the error distribution of the WLS model appears to be relatively uniform and random, which further reinforces the robustness of the model.

4.4.2. Predicted Values and Actual Values

As shown in Figure 16, this scatter plot compares the actual maize yield (horizontal axis) with the predicted yield from the Blending model (vertical axis) across various county-year samples. Each data point in the plot corresponds to a specific county-year, offering insights into how the model’s predictions vary across different regions and timeframes. The majority of points are clustered around the 1:1 diagonal line (y = x), indicating the model’s overall prediction accuracy. However, some counties exhibit larger deviations, which may suggest regional discrepancies in model performance. The comparison between this figure and Figure 15, which focuses on other models, reveals that the Blending model tends to produce a stronger correlation between predicted and actual values, particularly in terms of R2, while showing a slightly higher error in some regions as indicated by the MAPE.

4.4.3. Model Residual Spatial Distribution

To evaluate the model’s fit and potential bias, this study performed a systematic diagnostic analysis of the prediction residuals. Figure 17 presents four key diagnostic plots:
Residuals vs. Predicted Plot: This plot examines whether the residuals are randomly distributed. Ideally, the residuals should be evenly distributed around zero, displaying no apparent structural patterns. Residuals Distribution Histogram: This plot is used to assess whether the residuals follow a normal distribution, which is a crucial assumption in regression model diagnostics. Residuals Q-Q Plot: The Q-Q plot further evaluates the normality of the residuals. If the residual points align with the diagonal, it indicates that the residuals adhere to the normal distribution assumption. Standardized Residuals vs. Predicted Plot: This plot aids in detecting any outliers in the standardized residuals, as their presence may suggest instability in the model’s fit across certain prediction value ranges.
Based on the diagnostic plots, the residuals appear to be randomly distributed, showing no significant systemic bias. The distribution of the residuals closely follows a normal distribution, and no apparent outliers are observed in the standardized residuals plot. Overall, the model demonstrates good fit performance, with no major issues identified during the diagnostic process.

4.4.4. Comprehensive Model Diagnostics and Validation Analysis

The diagnostic and analysis results of the model are presented in Figure 18. The residual distribution plot, which now includes annotations for specific counties and monitoring years, indicates that the residuals are primarily concentrated around zero, suggesting a good fit of the model for most data points. However, the plot also reveals some larger residuals, indicating potential poor fits for certain data points. These deviations are more pronounced in specific counties and years, which may be attributed to local agricultural conditions or limitations of the model in certain regions. Further analysis of these areas will be conducted in the subsequent model optimization step.
The normality test, illustrated using a Q-Q plot, indicates that the residual points approximately align along the diagonal line, suggesting that the residuals largely conform to the assumption of normal distribution. A few points deviate from the diagonal, which is expected; these outliers will be further examined to determine their influence on specific hypothesis tests related to the model’s assumptions. The significant results of the F-test indicate that the independent variables in the model possess strong explanatory power over the dependent variable, and the overall model demonstrates high statistical significance. This confirms that the model effectively accounts for the variability in maize yield, and its statistical validity remains robust.
The comparison between predicted and actual values demonstrates that the majority of data points closely align with the 1:1 diagonal line, indicating a strong consistency between the model’s predictions and the actual values. However, a few points deviate from this diagonal, which may be attributed to specific characteristics of certain counties or monitoring years. These discrepancies highlight areas where the model’s performance could be improved, and further exploration of these regions may enhance the model’s generalization ability.
Overall, the diagnostic results indicate that the model performs well concerning residual distribution, normality, statistical significance, and predictive capability. However, individual outliers and residual deviations still require attention; this feedback will inform the next stage of model optimization. This diagnostic analysis provides crucial insights into enhancing the model’s stability and predictive accuracy, particularly in regions where discrepancies were observed.

5. Water and Fertilizer Optimization and Sustainable Agriculture

Table 3 illustrates the changes in water and fertilizer usage between 2019 and 2020 in Zhangye and Wuwei. Prior to optimization, water usage in Zhangye and Wuwei was 215 m3/acre and 210 m3/acre, respectively. Following optimization, these figures decreased to 193 m3/acre and 179 m3/acre. The corresponding water savings were 22 m3/acre and 31 m3/acre, while fertilizer savings amounted to 5.5% and 8.5%, respectively. These data indicate that precise management has led to a significant reduction in resource waste, effectively lowering production costs. This effect is particularly pronounced in areas with limited water resources. The optimization of water and fertilizer not only conserves substantial amounts of these resources but also promotes environmentally sustainable agricultural practices.
In terms of crop yield, after optimization, Zhangye’s yield experienced a slight decrease from 6853 kg/ha to 6776 kg/ha, while Wuwei’s yield decreased from 7217 kg/ha to 6990 kg/ha. Despite these minor reductions, the yields remained stable, indicating that the optimization of water and fertilizer did not adversely affect crop growth. In high-yielding areas such as Zhangye and Wuwei, the optimization not only preserved yield but also enhanced production efficiency. The primary advantage lies in the improved resource utilization efficiency, achieved without sacrificing crop yield, thereby promoting sustainable agricultural development.
The contribution of water and fertilizer optimization to sustainable agriculture is significant. Following optimization, the water savings rates were recorded at 10.23% in Zhangye and 14.76% in Wuwei, while the fertilizer savings rates were 5.5% and 8.5%, respectively. This reduction in water and fertilizer usage mitigates environmental impacts by decreasing reliance on natural resources and alleviating environmental burdens. In regions with limited water resources, this optimization is especially advantageous, enhancing resource efficiency and facilitating the transition to more environmentally friendly agricultural practices.
This study highlights the essential role of water and fertilizer optimization in sustainable agriculture. It reduces resource consumption, maintains stable yields, and lowers environmental impact. In maize cultivation, optimization is achieved by adjusting irrigation and fertilizer application based on weather data and crop needs. This approach ensures efficient resource use, improving productivity while minimizing ecological impact. In arid regions, the optimization of water and fertilizer is particularly crucial for enhancing resource efficiency and supporting sustainable agricultural practices. This strategy addresses weather unpredictability and rising resource costs, ensuring that reductions in resource use do not negatively affect yields and assisting farmers in adapting to environmental changes.

6. Discussion

6.1. Key Findings

6.1.1. The Dominant Role of NDVI During the Vigorous Growth Period

The core finding of this study is that the Normalized Difference Vegetation Index (NDVI) in August serves as the most significant predictor of maize yield in the Hexi Corridor. This conclusion is supported by agronomic principles, as August coincides with the grain-filling period of maize, which is the most critical growth stage for determining kernel weight and final yield. During this phase, elevated NDVI values reflect a dense crop canopy, high leaf chlorophyll content, and robust photosynthetic capacity, thereby ensuring an adequate supply of nutrients for grain growth and development, which directly correlates with increased yields. This finding aligns with the conclusions of numerous agricultural remote sensing studies, indicating that the vegetation index during the peak crop growing season is the most effective proxy indicator for yield [24,25,26].

6.1.2. The Importance of Nonlinear Relationships

Another key finding of this study is the successful identification and quantification of the nonlinear relationship between NDVI and maize yield. By incorporating the quadratic term of NDVI_Aug into the model, we significantly improved its performance. This nonlinear relationship arises from the phenomenon of ‘NDVI saturation’: when vegetation coverage reaches an extremely high level, the growth rate of the NDVI slows down or even stagnates, while the biomass and potential yield of the crop may still exhibit an increasing trend. This effect cannot be captured by a simple linear model, resulting in an underestimation of yield in high-production areas. In contrast, a parabolic model that includes a negative quadratic term can effectively represent this trend of initial increase followed by leveling off. This demonstrates that the study’s advanced understanding of feature engineering and model configuration surpasses simplistic linear assumptions.

6.2. Reflection and Argumentation on Methodology

In the exploratory modeling phase, after completing data acquisition and initial feature engineering, the research team evaluated a variety of machine learning models. According to the experimental results obtained during the second phase of the study, the Random Forest (RF) model demonstrated exceptional performance on the randomly partitioned test set, achieving an R2 value exceeding 0.85. This preliminary finding underscores the model’s significant potential and subsequently shifted the research focus toward more complex machine learning algorithms.
In the pursuit of performance enhancement, models are often susceptible to overfitting. To mitigate this issue, our research team implemented advanced modeling techniques, such as Ridge Regression, along with sophisticated ensemble strategies, including Blending and Stacking, to improve prediction accuracy. These methods yielded significant results, with the R2 value of the test set reaching an impressive 0.99 in the results from the fourth and fifth stages of the experiment. However, a concerning trend noted in previous reports was once again validated: when stricter time series validation was employed—specifically, by using historical data to predict the most recent year’s data—the performance of the Random Forest (RF) model exhibited a sharp decline. This decline revealed the model’s critical overfitting issue and its inability to generalize to future data.
Phase Three: Scientific Diagnosis and the Return to Methodology A comprehensive statistical diagnosis was conducted on the ostensibly flawless performance of the Random Forest (RF) model. The diagnostic results indicated that the model’s residuals did not pass tests for normality and homoscedasticity. Moreover, the cross-validation outcomes exhibited significant instability, and severe multicollinearity was present. These findings unequivocally demonstrate that, despite the RF model’s commendable performance in fitting the training data, it is a statistically unsound and unreliable ‘black box’ rendering its predictions lacking in scientific credibility.
Phase Four: Constructing a Robust Weighted Least Squares (WLS) Model In response to the diagnostic results, the research team decisively abandoned the pursuit of solely achieving a high coefficient of determination (R2) and returned to the core principles of scientific modeling. The most significant insight gained from the machine learning exploration—that nonlinear relationships are crucial—was applied to the statistical model. By explicitly incorporating the quadratic term of the August Normalized Difference Vegetation Index (NDVI_Aug) into the Ordinary Least Squares (OLS) model to capture nonlinear relationships and adopting WLS to address the heteroscedasticity issue identified on the sixth day, the optimized WLS model not only achieved a high level of accuracy (R2 = 0.89) but, more importantly, passed all statistical tests. This model thus embodies a scientific framework that combines high precision, strong interpretability, and statistical robustness.
In this study, the reduction in water and fertilizer usage was estimated using a market price-based approach as part of the data collection process. To ensure the accuracy of this method, we assessed the stability of market prices by comparing historical data, referencing industry standards, and considering prices from other regions. This verification process confirmed the reliability of the estimated reductions in water and fertilizer usage. The market price-based method complements the measurements obtained from flow meters and soil moisture sensors, providing a viable alternative in areas where comprehensive coverage by precision equipment is impractical.

6.3. Comparison with Related Studies

The performance of the final model in this study (R2 = 0.89, RMSE = 12.8%) is considered advanced compared to similar studies. In contrast to certain maize yield estimation studies utilizing statistical models, such as research conducted in the Nebraska region, where the Ordinary Least Squares (OLS) model for rainfed maize achieved a maximum R2 of 0.83, this model exhibits superior performance. Furthermore, its accuracy aligns with levels reported in numerous studies employing machine learning methods. For example, one study revealed that Sentinel-2 data could account for 41% to 80% of the variation in crop yield [27]. This suggests that through meticulous feature engineering and appropriate statistical methods, it is entirely possible to construct a yield estimation model whose performance is comparable to that of complex machine learning models. The success of this study further validates the immense potential of Sentinel-2 and other free, high-resolution data in precision agricultural monitoring. Additionally, it confirms that integrating remote sensing data with ground statistical data is an effective approach to constructing reliable yield estimation models [28].

6.4. Limitations of the Study and Future Prospects

1. The Issue of Data Aggregation Scale This study employs a county-level aggregation scale, where the values of all pixels within a county are averaged and subsequently matched with the total production data of that county. While this method is effective at the regional scale, it does not account for significant spatial heterogeneity within the county, such as variations in planting methods and management practices across different plots. Future research should focus on achieving higher-resolution pixel-level yield estimation. However, this endeavor necessitates the collection of a substantial number of fine-scale ground samples, such as utilizing field-level production data for model training and validation, thereby enhancing model performance.
2. Simplification of Model Variables The current model predominantly relies on vegetation indices as predictors. While these indices serve as effective indicators of crop growth, the process of yield formation is multifaceted and influenced by various factors. The performance and robustness of future models can be significantly improved by integrating auxiliary data from multiple dimensions. Specifically, this includes: ① Meteorological Data: Incorporating key meteorological variables during the growing season, such as Growing Degree Days, solar radiation, and detailed precipitation and evapotranspiration data [29,30,31].② Soil Data: The introduction of static geographic information, including soil type, texture, organic matter content, and water holding capacity, directly impacts the efficiency of water and fertilizer utilization by crops. ③ Management Activities: Access to management information, such as fertilizer application rates and irrigation water volumes, significantly enhances the model’s explanatory power [32].
3. The generalization capability of the model is specifically tailored for maize in the Hexi Corridor region. However, the direct applicability of its parameters and conclusions to other crops or regions with differing agro-ecological characteristics necessitates further validation. Future research should explore the transferability of this model framework or investigate more universal deep learning models, such as Convolutional Neural Networks (CNN) and Long Short-Term Memory networks (LSTM) [33,34]. These models possess distinct advantages in processing spatiotemporal big data; however, their effectiveness is contingent upon the availability of larger and more diverse datasets. This also indirectly highlights the current strengths of the maize yield estimation method developed for the Hexi Corridor region, particularly regarding data utilization.

7. Conclusions

This study systematically conducted a comprehensive research process encompassing data acquisition, feature engineering, model construction, and validation, specifically addressing the issue of maize yield estimation through remote sensing in the Hexi Corridor region. Utilizing the Weighted Least Squares (WLS) method, a county-level maize yield estimation model was successfully developed and validated. The model demonstrated exceptional predictive performance on an independent test set, achieving a coefficient of determination (R2) of 0.89 and a relative root mean square error (RMSE) as low as 12.8%. Furthermore, the model successfully passed all necessary statistical diagnostic tests, thereby confirming its reliability and stability. The core conclusions are as follows:
1. Identification of Key Predictive Factors: The Normalized Difference Vegetation Index (NDVI) during the vigorous growth period of maize in August serves as the most critical and sensitive remote sensing indicator for predicting final yield.
2. The Importance of Nonlinear Relationships: A significant nonlinear saturation relationship exists between NDVI and maize yield. Incorporating the quadratic term of NDVI into the model effectively captures this critical feature, representing a decisive step in enhancing the model’s accuracy.
By introducing optimized features, such as the quadratic term of NDVI (NDVIAug2), this study significantly enhances prediction accuracy. We have improved the model’s ability to predict crop yield through feature optimization rather than real-time parameter adjustment. This optimization process allows the model to better capture the complex nonlinear relationships in crop growth, thereby improving its predictive accuracy.
3. Methodological Decision: The research process demonstrates that for yield estimation problems of this nature, an optimized Weighted Least Squares (WLS) model tailored to the specific context can outperform seemingly more powerful ‘black-box’ machine learning models in terms of overall performance, including accuracy, reliability, and interpretability. This highlights that, in the application of data science, a rigorous scientific validation process is more important than merely pursuing the highest predictive score. Additionally, it offers valuable insights for researchers across various fields, encouraging them to critically assess the ‘highest score’ approach.
In summary, this study not only provides an efficient and cost-effective technical tool for agricultural production management and food security early warning in the Hexi Corridor, but also offers valuable methodological insights and practical examples for similar remote sensing yield estimation research in other regions or for other crops through the establishment of standardized workflows and careful consideration of model selection. By facilitating precise yield prediction, the model not only aids in optimizing water and fertilizer usage but also supports the optimization of agricultural resources, adaptation to climate change, and environmental protection. This approach enables farmers to reduce waste, enhance productivity, and minimize environmental impact, particularly in water-scarce regions. By utilizing publicly available remote sensing data and cloud-based processing tools, this study promotes data-driven precision agriculture globally, offering sustainable solutions to challenges in agriculture worldwide. Furthermore, with further optimization, this model can provide sustainable agricultural solutions to regions facing similar environmental challenges and resource constraints, thereby contributing to broader global sustainability efforts.

Author Contributions

Conceptualization, G.Y. and J.W.; methodology, G.Y.; software, G.Y.; validation, G.Y., J.W. and Z.Q.; formal analysis, G.Y.; investigation, G.Y.; resources, J.W.; data curation, G.Y.; writing—original draft preparation, G.Y.; writing—review and editing, J.W. and Z.Q.; visualization, G.Y.; supervision, J.W.; project administration, J.W.; funding acquisition, J.W. All authors have read and agreed to the published version of the manuscript.

Funding

Project Title: Central Guiding Local Funds for Science and Technology Development Project; No: 25ZYJA035; Project Title: Gansu Provincial Science and Technology Major Special Programme Project; No: 25ZDFA011; Project Name: The central government guides the special project of local science and technology development; No: 2024 (24ZYQA023); Project Name: Gansu Provincial Department of Education Teacher Innovation Fund Project; No: 2025A-085.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Zhu, H.; He, X.; Wang, X.; Long, P. Increasing hybrid rice yield, water productivity, and nitrogen use efficiency: Optimization strategies for irrigation and fertilizer management. Plants 2024, 13, 1717. [Google Scholar] [CrossRef]
  2. De Villiers, C.; Mashaba-Munghemezulu, Z.; Munghemezulu, C.; Chirima, G.J.; Tesfamichael, S.G. Assessing Maize Yield Spatiotemporal Variability Using Unmanned Aerial Vehicles and Machine Learning. Geomatics 2024, 4, 213–236. [Google Scholar] [CrossRef]
  3. Yang, X.; Zhang, L.; Liu, X. Optimizing Water-Fertilizer Integration with Drip Irrigation Management to Improve Crop Yield, Water, and Nitrogen Use Efficiency: A Meta-Analysis Study. Sci. Hortic. 2024, 338, 113653. [Google Scholar] [CrossRef]
  4. Fang, L.; Zhang, G.; Ming, B.; Shen, D.; Wang, Z.; Zhou, L.; Zhang, T.; Liang, Z.; Xue, J.; Xie, R. Dense Planting and Nitrogen Fertilizer Management Improve Drip-Irrigated Spring Maize Yield and Nitrogen Use Efficiency in Northeast China. J. Integr. Agric. 2024. [Google Scholar] [CrossRef]
  5. Zhang, X.; Shen, H.; Huang, T.; Wu, Y.; Guo, B.; Liu, Z.; Luo, H.; Tang, J.; Zhou, H.; Wang, L. Improved Random Forest Algorithms for Increasing the Accuracy of Forest Aboveground Biomass Estimation Using Sentinel-2 Imagery. Ecol. Indic. 2024, 159, 111752. [Google Scholar] [CrossRef]
  6. Radeloff, V.C.; Roy, D.P.; Wulder, M.A.; Anderson, M.; Cook, B.; Crawford, C.J.; Friedl, M.; Gao, F.; Gorelick, N.; Hansen, M.; et al. Need and Vision for Global Medium-Resolution Landsat and Sentinel-2 Data Products. Remote Sens. Environ. 2024, 300, 113918. [Google Scholar] [CrossRef]
  7. Mateo-Sanchis, A.; Piles, M.; Muñoz-Marí, J.; Adsuara, J.E.; Pérez-Suay, A.; Camps-Valls, G. Synergistic Integration of Optical and Microwave Satellite Data for Crop Yield Estimation. Remote Sens. Environ. 2019, 234, 111460. [Google Scholar] [CrossRef]
  8. Joshi, V.R.; Thorp, K.R.; Coulter, J.A.; Johnson, G.A.; Porter, P.M.; Strock, J.S.; Garcia y Garcia, A. Improving Site-Specific Maize Yield Estimation by Integrating Satellite Multispectral Data into a Crop Model. Agronomy 2019, 9, 719. [Google Scholar] [CrossRef]
  9. Vani, V.; Mandla, V.R. Comparative Study of NDVI and SAVI Vegetation Indices in Anantapur District Semi-Arid Areas. Int. J. Civ. Eng. Technol. 2017, 8, 559–566. [Google Scholar]
  10. Burton, A.L. OLS (Linear) Regression. Encycl. Res. Methods Criminol. Crim. Justice 2021, 2, 509–514. [Google Scholar]
  11. Li, X.; Lyu, Y.; Zhu, B.; Liu, L.; Song, K. Maize Yield Estimation in Northeast China’s Black Soil Region Using a Deep Learning Model with Attention Mechanism and Remote Sensing. Sci. Rep. 2025, 15, 12927. [Google Scholar] [CrossRef]
  12. Zhang, Q.; Zhao, X.; Han, Y.; Yang, F.; Pan, S.; Liu, Z.; Wang, K.; Zhao, C. Maize Yield Prediction Using Federated Random Forest. Comput. Electron. Agric. 2023, 210, 107930. [Google Scholar] [CrossRef]
  13. Daviran, M.; Maghsoudi, A.; Ghezelbash, R. Optimized AI-MPM: Application of PSO for Tuning the Hyperparameters of SVM and RF Algorithms. Comput. Geosci. 2025, 195, 105785. [Google Scholar] [CrossRef]
  14. Wang, J.; Fang, F.; Wang, J.; Yue, P.; Wang, S.; Xu, Y. Evolutionary Characteristics and Influencing Factors of Wheat Production Risk in Gansu Province of China under the Background of Climate Change. Theor. Appl. Climatol. 2024, 155, 5389–5415. [Google Scholar] [CrossRef]
  15. Yang, G.; Wang, J.; Qi, Z. Maize Classification in Arid Regions via Spatiotemporal Feature Optimization and Multi-Source Remote Sensing Integration. Agronomy 2025, 15, 1667. [Google Scholar] [CrossRef]
  16. Cai, T.; Chang, C.; Zhao, Y.; Wang, X.; Yang, J.; Dou, P.; Otgonbayar, M.; Zhang, G.; Zeng, Y.; Wang, J.; et al. Within-Season Estimates of 10 m Aboveground Biomass Based on Landsat, Sentinel-2 and PlanetScope Data. Sci. Data 2024, 11, 1276. [Google Scholar] [CrossRef]
  17. Li, M.; Wang, G.; Sun, A.; Wang, Y.; Li, F.; Liang, S. Monitoring Grassland Variation in a Typical Area of the Qinghai Lake Basin Using 30 m Annual Maximum NDVI Data. Remote Sens. 2024, 16, 1222. [Google Scholar] [CrossRef]
  18. Kabato, W.; Getnet, G.T.; Sinore, T.; Nemeth, A.; Molnár, Z. Towards Climate-Smart Agriculture: Strategies for Sustainable Agricultural Production, Food Security, and Greenhouse Gas Reduction. Agronomy 2025, 15, 565. [Google Scholar] [CrossRef]
  19. Manley, M.; Baeten, V. Spectroscopic technique: Near infrared (NIR) spectroscopy. In Modern Techniques for Food Authentication; Elsevier: Amsterdam, The Netherlands, 2018; pp. 51–102. [Google Scholar]
  20. Anees, S.A.; Mehmood, K.; Rehman, A.; Rehman, N.U.; Muhammad, S.; Shahzad, F.; Hussain, K.; Luo, M.; Alarfaj, A.A.; Alharbi, S.A.; et al. Unveiling Fractional Vegetation Cover Dynamics: A Spatiotemporal Analysis Using MODIS NDVI and Machine Learning. Environ. Sustain. Indic. 2024, 24, 100485. [Google Scholar] [CrossRef]
  21. Zdaniuk, B. Ordinary Least-Squares (OLS) Model. In Encyclopedia of Quality of Life and Well-Being Research; Springer: Berlin/Heidelberg, Germany, 2024; pp. 4867–4869. [Google Scholar]
  22. Asamoah, E.; Heuvelink, G.B.; Chairi, I.; Bindraban, P.S.; Logah, V. Random Forest Machine Learning for Maize Yield and Agronomic Efficiency Prediction in Ghana. Heliyon 2024, 10, e37065. [Google Scholar] [CrossRef] [PubMed]
  23. Gallagher, N.B.; Goyetche, R.; Amigo, J.M.; Kucheryavskiy, S. Extended Least Squares (ELS) and Generalized Least Squares (GLS) for Clutter Suppression in Hyperspectral Images: A Theoretical Description. Chemom. Intell. Lab. Syst. 2024, 244, 105032. [Google Scholar] [CrossRef]
  24. Patil, P.P.; Jagtap, M.P.; Khatri, N.; Madan, H.; Vadduri, A.A.; Patodia, T. Exploration and advancement of NDDI leveraging NDVI and NDWI in Indian semi-arid regions: A remote sensing-based study. Case Stud. Chem. Environ. Eng. 2024, 9, 100573. [Google Scholar] [CrossRef]
  25. Farbo, A.; Sarvia, F.; De Petris, S.; Basile, V.; Borgogno-Mondino, E. Forecasting Maize NDVI through AI-based approaches using sentinel 2 image time series. ISPRS J. Photogramm. Remote Sens. 2024, 211, 244–261. [Google Scholar] [CrossRef]
  26. Santana, C.T.C.d.; Sanches, I.D.A.; Caldas, M.M.; Adami, M. A Method for Estimating Soybean Sowing, Beginning Seed, and Harvesting Dates in Brazil Using NDVI-MODIS Data. Remote Sens. 2024, 16, 2520. [Google Scholar] [CrossRef]
  27. Karlson, M.; Ostwald, M.; Bayala, J.; Bazié, H.R.; Ouedraogo, A.S.; Soro, B.; Sanou, J.; Reese, H. The potential of Sentinel-2 for crop production estimation in a smallholder agroforestry landscape, Burkina Faso. Front. Environ. Sci. 2020, 8, 85. [Google Scholar] [CrossRef]
  28. Karra, K.; Kontgis, C.; Statman-Weil, Z.; Mazzariello, J.C.; Mathis, M.; Brumby, S.P. Global Land Use/Land Cover with Sentinel 2 and Deep Learning. In Proceedings of the 2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS, Brussels, Belgium, 11–16 July 2021; pp. 4704–4707. [Google Scholar]
  29. Dížková, P.; Bartošová, L.; Bláhová, M.; Balek, J.; Hájková, L.; Semerádová, D.; Bohuslav, J.; Pohanková, E.; Žalud, Z.; Trnka, M. Modeling phenological phases of winter wheat based on temperature and the start of the growing season. Atmosphere 2022, 13, 1854. [Google Scholar] [CrossRef]
  30. Zhou, Y.; Liu, Y.; Wang, D.; Liu, X.; Wang, Y. A review on global solar radiation prediction with machine learning models in a comprehensive perspective. Energy Convers. Manag 2021, 235, 113960. [Google Scholar] [CrossRef]
  31. Hu, X.; Shi, L.; Lin, G.; Lin, L. Comparison of physical-based, data-driven and hybrid modeling approaches for evapotranspiration estimation. J. Hydrol. 2021, 601, 126592. [Google Scholar] [CrossRef]
  32. Hossain, M.M.; Rahman, M.A.; Chaki, S.; Ahmed, H.; Haque, A.; Tamanna, I.; Lima, S.; Most, J.F.; Rahman, M.S. Smart-Agri: A smart agricultural management with IoT-ML-blockchain integrated framework. Int. J. Adv. Comput. Sci. Appl. 2023, 14. [Google Scholar] [CrossRef]
  33. Kattenborn, T.; Leitloff, J.; Schiefer, F.; Hinz, S. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS J. Photogramm. Remote Sens. 2021, 173, 24–49. [Google Scholar] [CrossRef]
  34. Wang, J.; Si, H.; Gao, Z.; Shi, L. Winter wheat yield prediction using an LSTM model from MODIS LAI products. Agriculture 2022, 12, 1707. [Google Scholar] [CrossRef]
Figure 1. (a). Research Area Location Map. This figure is cited from my previous research (agronomy). The left map shows China, and the right map highlights the Hexi Corridor region, including major cities such as Jiuquan, Jiayuguan, Zhangye, Jinchang, and Wuwei. The study area is in the northwest of Gansu Province, near Xinjiang, Inner Mongolia, and Qinghai. (b). Digital Elevation Model (DEM, left), aspect map (middle), and slope map (right) of the Hexi Corridor study area. Elevation ranges from 758 m to 5842 m, with agricultural production mainly concentrated in low-slope oasis plains (<15°).
Figure 1. (a). Research Area Location Map. This figure is cited from my previous research (agronomy). The left map shows China, and the right map highlights the Hexi Corridor region, including major cities such as Jiuquan, Jiayuguan, Zhangye, Jinchang, and Wuwei. The study area is in the northwest of Gansu Province, near Xinjiang, Inner Mongolia, and Qinghai. (b). Digital Elevation Model (DEM, left), aspect map (middle), and slope map (right) of the Hexi Corridor study area. Elevation ranges from 758 m to 5842 m, with agricultural production mainly concentrated in low-slope oasis plains (<15°).
Sustainability 17 08182 g001
Figure 2. Schematic diagram of the technical process.
Figure 2. Schematic diagram of the technical process.
Sustainability 17 08182 g002
Figure 3. Feedback Mechanism Diagram of Yield Prediction Model and Water-Fertilizer Optimization.
Figure 3. Feedback Mechanism Diagram of Yield Prediction Model and Water-Fertilizer Optimization.
Sustainability 17 08182 g003
Figure 4. Correlation analysis of newly constructed features and maize yield. The X-axis shows the absolute correlation coefficient of each feature with maize yield, indicating the strength of the relationship. The Y-axis lists the newly constructed features, including various vegetation indices and other relevant factors, ordered by their correlation with yield. This figure highlights the relationship between these features and maize yield, with stronger correlations placed higher on the Y-axis.
Figure 4. Correlation analysis of newly constructed features and maize yield. The X-axis shows the absolute correlation coefficient of each feature with maize yield, indicating the strength of the relationship. The Y-axis lists the newly constructed features, including various vegetation indices and other relevant factors, ordered by their correlation with yield. This figure highlights the relationship between these features and maize yield, with stronger correlations placed higher on the Y-axis.
Sustainability 17 08182 g004
Figure 5. Random Forest Feature Importance Analysis (Top 20 Features). This figure shows the importance of different features in maize yield prediction. The X-axis represents the importance score of each feature, and the Y-axis lists the features in order of their importance. Planting Area and Vegetation PC1_Planting Area are the most significant predictors, as indicated by their higher importance scores.
Figure 5. Random Forest Feature Importance Analysis (Top 20 Features). This figure shows the importance of different features in maize yield prediction. The X-axis represents the importance score of each feature, and the Y-axis lists the features in order of their importance. Planting Area and Vegetation PC1_Planting Area are the most significant predictors, as indicated by their higher importance scores.
Sustainability 17 08182 g005
Figure 6. Feature importance analysis under TIME and RANDOM splits. The left panel shows feature importance under the TIME split, and the right panel shows feature importance under the RANDOM split. The X-axis represents the importance score of each feature, and the Y-axis lists the features in order of their importance. This comparison highlights how feature importance varies with different data splitting methods.
Figure 6. Feature importance analysis under TIME and RANDOM splits. The left panel shows feature importance under the TIME split, and the right panel shows feature importance under the RANDOM split. The X-axis represents the importance score of each feature, and the Y-axis lists the features in order of their importance. This comparison highlights how feature importance varies with different data splitting methods.
Sustainability 17 08182 g006
Figure 7. Performance comparison heatmap of Maize yield prediction models in the Hexi Corridor. The figure shows the normalized performance of the OLS, RF, and WLS models across R2, RMSE, MAPE, and MAE metrics. The RF model outperforms others, with a perfect score of 1.000 in R2 and RMSE.
Figure 7. Performance comparison heatmap of Maize yield prediction models in the Hexi Corridor. The figure shows the normalized performance of the OLS, RF, and WLS models across R2, RMSE, MAPE, and MAE metrics. The RF model outperforms others, with a perfect score of 1.000 in R2 and RMSE.
Sustainability 17 08182 g007
Figure 8. Performance comparison heatmap of Maize yield prediction models in the Hexi Corridor (Original values). The figure shows the raw values for the OLS, RF, and WLS models across R2, RMSE, MAPE, and MAE. The RF model performs best with R2 of 0.82 and RMSE of 850.60.
Figure 8. Performance comparison heatmap of Maize yield prediction models in the Hexi Corridor (Original values). The figure shows the raw values for the OLS, RF, and WLS models across R2, RMSE, MAPE, and MAE. The RF model performs best with R2 of 0.82 and RMSE of 850.60.
Sustainability 17 08182 g008
Figure 9. The figure presents the top 15 important feature coefficients in the Ridge regression model. It illustrates the significance of various features in predicting maize yield. The coefficients indicate each feature’s contribution to the model, with positive values reflecting a positive correlation with yield and negative values indicating a negative correlation. The feature importance was determined by fitting the Ridge regression model and assessing the magnitude of the coefficients.
Figure 9. The figure presents the top 15 important feature coefficients in the Ridge regression model. It illustrates the significance of various features in predicting maize yield. The coefficients indicate each feature’s contribution to the model, with positive values reflecting a positive correlation with yield and negative values indicating a negative correlation. The feature importance was determined by fitting the Ridge regression model and assessing the magnitude of the coefficients.
Sustainability 17 08182 g009
Figure 10. The figure compares the MAPE (Mean Absolute Percentage Error) of different models on Day 3 for both the training (green) and test (red) sets. The results show that ensemble methods, particularly Blending Ensemble, achieve the lowest MAPE, indicating better prediction accuracy than individual models.
Figure 10. The figure compares the MAPE (Mean Absolute Percentage Error) of different models on Day 3 for both the training (green) and test (red) sets. The results show that ensemble methods, particularly Blending Ensemble, achieve the lowest MAPE, indicating better prediction accuracy than individual models.
Sustainability 17 08182 g010
Figure 11. The figure compares the R2 scores of different models for both the training (green) and test (red) sets. The Blending Ensemble achieves the highest R2 score, outperforming other models.
Figure 11. The figure compares the R2 scores of different models for both the training (green) and test (red) sets. The Blending Ensemble achieves the highest R2 score, outperforming other models.
Sustainability 17 08182 g011
Figure 12. The figure compares the test set R2 scores for ensemble methods and base models. The Blending Ensemble achieves the highest R2 score, outperforming all base models.
Figure 12. The figure compares the test set R2 scores for ensemble methods and base models. The Blending Ensemble achieves the highest R2 score, outperforming all base models.
Sustainability 17 08182 g012
Figure 13. The figure shows a heatmap comparing the R2 and MAPE values for various models. The Blending Ensemble outperforms all other models with the highest R2 score and lowest MAPE.
Figure 13. The figure shows a heatmap comparing the R2 and MAPE values for various models. The Blending Ensemble outperforms all other models with the highest R2 score and lowest MAPE.
Sustainability 17 08182 g013
Figure 14. Comparison of model performance. The figure compares different models based on four metrics: R2, RMSE, MAPE, and MAE. Each subplot shows the performance differences, helping to identify the optimal model.
Figure 14. Comparison of model performance. The figure compares different models based on four metrics: R2, RMSE, MAPE, and MAE. Each subplot shows the performance differences, helping to identify the optimal model.
Sustainability 17 08182 g014
Figure 15. (Left Top): Performance metrics of Ridge regression and Random Forest models across test samples, with points representing counties and years, and color indicating model performance, showcasing the geographic and temporal influence on predictive accuracy. (Left Bottom): Random Forest model performance, with points reflecting test samples from various counties, indicating prediction accuracy through proximity to the diagonal red line. (Right Top): Ridge regression model performance, showing the relationship between actual and predicted yields, where points closer to the red dashed line represent better accuracy. (Right Bottom): Prediction error distribution for both models, with box plots comparing error ranges across counties and years, highlighting the model with the least deviation.
Figure 15. (Left Top): Performance metrics of Ridge regression and Random Forest models across test samples, with points representing counties and years, and color indicating model performance, showcasing the geographic and temporal influence on predictive accuracy. (Left Bottom): Random Forest model performance, with points reflecting test samples from various counties, indicating prediction accuracy through proximity to the diagonal red line. (Right Top): Ridge regression model performance, showing the relationship between actual and predicted yields, where points closer to the red dashed line represent better accuracy. (Right Bottom): Prediction error distribution for both models, with box plots comparing error ranges across counties and years, highlighting the model with the least deviation.
Sustainability 17 08182 g015
Figure 16. Blending model prediction results. This figure shows the relationship between actual and predicted Maize yield by the Blending model, with each point representing a county-year. The red dashed line indicates the perfect prediction line (y = x). The scatter plot demonstrates the model’s prediction performance, highlighting regional variations.
Figure 16. Blending model prediction results. This figure shows the relationship between actual and predicted Maize yield by the Blending model, with each point representing a county-year. The red dashed line indicates the perfect prediction line (y = x). The scatter plot demonstrates the model’s prediction performance, highlighting regional variations.
Sustainability 17 08182 g016
Figure 17. This figure shows the model’s diagnostic analysis, including residuals vs. predicted values, residual distribution, standardized residuals vs. predicted values, and the Q-Q plot. These plots help evaluate the model’s fit, normality of residuals, and heteroscedasticity. Ideally, residuals should be randomly distributed, normally distributed, and free from bias, confirming the model’s stability and reliability.
Figure 17. This figure shows the model’s diagnostic analysis, including residuals vs. predicted values, residual distribution, standardized residuals vs. predicted values, and the Q-Q plot. These plots help evaluate the model’s fit, normality of residuals, and heteroscedasticity. Ideally, residuals should be randomly distributed, normally distributed, and free from bias, confirming the model’s stability and reliability.
Sustainability 17 08182 g017
Figure 18. This figure includes several diagnostic results: the top-left plot displays the residual distribution with the mean indicated by a red dashed line, highlighting how well the model fits the majority of data points; the top-right plot is a Q-Q plot testing the normality of the residuals, with most points aligning closely to the diagonal; the bottom-left plot shows the results of 5-fold cross-validation, providing an assessment of the model’s stability; the bottom-right plot compares the predicted values against the actual ones, with a regression line to visualize the fit; the middle plot displays the importance of the top 15 features, and the right-middle plot shows the contribution of temperature-sensitive variables to the model’s performance.
Figure 18. This figure includes several diagnostic results: the top-left plot displays the residual distribution with the mean indicated by a red dashed line, highlighting how well the model fits the majority of data points; the top-right plot is a Q-Q plot testing the normality of the residuals, with most points aligning closely to the diagonal; the bottom-left plot shows the results of 5-fold cross-validation, providing an assessment of the model’s stability; the bottom-right plot compares the predicted values against the actual ones, with a regression line to visualize the fit; the middle plot displays the importance of the top 15 features, and the right-middle plot shows the contribution of temperature-sensitive variables to the model’s performance.
Sustainability 17 08182 g018
Table 1. Data Sources and Specifications Used in the Study.
Table 1. Data Sources and Specifications Used in the Study.
Data TypeSourceProduct/DescriptionSpatial ResolutionTemporal ResolutionTime Coverage Range
Remote sensing dataCopernicus Project/Google Earth EngineSentinel-2 L2A (atmospheric corrected surface reflectance)10–20 m5 days2019–2023 (May-September)
Production statisticsProvincial and municipal statistical yearbooksCounty level total Maize production (tons), sowing area (hectares)county levelyear2019–2023
Table 2. Comparison of Key Yield Estimation Model Performance.
Table 2. Comparison of Key Yield Estimation Model Performance.
ModelKey FeaturesVerify R2Verify RMSE (%)Is it Diagnosed Through Statistical Analysis
Benchmark OLS modelLinear vegetation index0.7818.5%deny
Random Forest (RF)All vegetation indices, automatic nonlinear fittingApproximately 0.85 (random CV)/<0 (time CV)High/extremely highNo (unstable)
WLS model NDVIAug, NDVIAug20.8912.8%correct
Table 3. Comparison of Water and Fertilizer Usage, Yield, and Savings Before and After Optimization.
Table 3. Comparison of Water and Fertilizer Usage, Yield, and Savings Before and After Optimization.
Sample PointYearWater Usage ChangeFertilizer Usage ChangeYield ChangeWater and Fertilizer Savings
Zhangye2023215 → 193 m3/acre5.5 → 5.2 tons/acre6853 → 6776 kg/ha10.23%/5.5%
Zhangye2024210 → 188 m3/acre5.0 → 4.7 tons/acre6780 → 6690 kg/ha11.43%/6.0%
Wuwei2023210 → 190 m3/acre5.2 → 4.8 tons/acre7217 → 6990 kg/ha14.76%/8.5%
Wuwei2024205 → 179 m3/acre5.0 → 4.6 tons/acre7150 → 7111 kg/ha12.68%/7.8%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yang, G.; Wang, J.; Qi, Z. Remote Sensing and Data-Driven Optimization of Water and Fertilizer Use: A Case Study of Maize Yield Estimation and Sustainable Agriculture in the Hexi Corridor. Sustainability 2025, 17, 8182. https://doi.org/10.3390/su17188182

AMA Style

Yang G, Wang J, Qi Z. Remote Sensing and Data-Driven Optimization of Water and Fertilizer Use: A Case Study of Maize Yield Estimation and Sustainable Agriculture in the Hexi Corridor. Sustainability. 2025; 17(18):8182. https://doi.org/10.3390/su17188182

Chicago/Turabian Style

Yang, Guang, Jun Wang, and Zhengyuan Qi. 2025. "Remote Sensing and Data-Driven Optimization of Water and Fertilizer Use: A Case Study of Maize Yield Estimation and Sustainable Agriculture in the Hexi Corridor" Sustainability 17, no. 18: 8182. https://doi.org/10.3390/su17188182

APA Style

Yang, G., Wang, J., & Qi, Z. (2025). Remote Sensing and Data-Driven Optimization of Water and Fertilizer Use: A Case Study of Maize Yield Estimation and Sustainable Agriculture in the Hexi Corridor. Sustainability, 17(18), 8182. https://doi.org/10.3390/su17188182

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop