Prediction of Soybean Yield at the County Scale Based on Multi-Source Remote-Sensing Data and Deep Learning Models

Fu, Hongkun; Li, Jian; Lu, Jian; Lin, Xinglei; Kang, Junrui; Zou, Wenlong; Ning, Xiangyu; Sun, Yue

doi:10.3390/agriculture15131337

Open AccessArticle

Prediction of Soybean Yield at the County Scale Based on Multi-Source Remote-Sensing Data and Deep Learning Models

by

Hongkun Fu

^1,2,

Jian Li

^2,3,*,

Jian Lu

^1,2,

Xinglei Lin

^2,3,

Junrui Kang

^2,3,

Wenlong Zou

³,

Xiangyu Ning

⁴

and

Yue Sun

⁴

¹

College of Agriculture, Jilin Agricultural University, Changchun 130118, China

²

Jilin Provincial Cross-Regional Collaborative Innovation Center for Agricultural Intelligent Equipment, Changchun 130118, China

³

College of Information Technology, Jilin Agricultural University, Changchun 130118, China

⁴

Northeast Institute of Geography and Agroecology, Chinese Academy of Sciences, Changchun 130102, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(13), 1337; https://doi.org/10.3390/agriculture15131337

Submission received: 20 May 2025 / Revised: 12 June 2025 / Accepted: 20 June 2025 / Published: 21 June 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Against the backdrop of global food security challenges, precise pre-harvest yield estimation of large-scale soybean crops is crucial for optimizing agricultural resource allocation and ensuring stable food supplies. This study developed an integrated prediction model for county-level soybean yield forecasting, which combines multi-source remote-sensing data with advanced deep learning techniques. The ant colony optimization-convolutional neural network with gated recurrent units and multi-head attention (ACGM) model showcases remarkable predictive prowess, as evidenced by a coefficient of determination (R²) of 0.74, a root mean square error (RMSE) of 123.94 kg/ha, and a mean absolute error (MAE) of 105.39 kg/ha. When pitted against other models, including the random forest regression (RFR), support vector regression (SVR), extreme gradient boosting (XGBoost), and convolutional neural network (CNN) models, the ACGM model clearly emerges as the superior performer. This study identifies August as the optimal period for early soybean yield prediction, with the model performing best when combining environmental and photosynthetic parameters (ED + PP). The ACGM model demonstrates a good accuracy and generalization ability, providing a practical approach for refined agricultural management. By integrating deep learning with open-source remote-sensing data, this research opens up new avenues for enhancing agricultural decision-making and safeguarding food security.

Keywords:

yield prediction; soybean; multi-source remote-sensing data; ACGM model; county scale; deep learning

1. Introduction

Soybean is one of the most important crops in the global agricultural system [1]. As one of the primary sources of plant protein for both humans and animals, and also an essential oilseed crop, soybean plays an indispensable role in daily life [2]. However, extreme weather events and environmental changes are occurring with an increasing frequency, posing significant challenges to food security [3]. Precise prediction of soybean yields before harvest is not only of vital importance for optimizing agricultural production but also directly linked to a country’s food security strategy [4]. Despite the great significance of this task, accurately predicting soybean yields on a large scale remains a formidable challenge [5].

Traditional yield prediction methods can be mainly classified into three types: empirical statistical models, crop growth models, and expert systems [6]. Empirical statistical models rely on historical yield data, meteorological records, soil properties, agricultural inputs, and other information, and they establish yield prediction models through statistical analysis [7]. However, these models often face difficulties in data acquisition and updating, especially when it comes to large-scale yield prediction [8]. Crop growth models are based on the principles of crop physiology and ecology, taking into account various physiological processes and environmental factors affecting crop growth and development, and then establishing models to describe the dynamic growth of crops. But this kind of model requires a large number of parameter settings and data support, making the model construction and operation rather complicated [9]. Expert systems mainly rely on the subjective judgments of experts in the agricultural field, and the predictions of experts may be significantly affected by personal experience and bias [10]. Therefore, in the context of difficulties in data acquisition and updating, complex construction and operation of crop growth models, and expert systems being vulnerable to subjective factors, remote-sensing technology with its unique advantages provides new ideas and directions for yield prediction.

With the rapid development of remote-sensing technology, agricultural data acquisition modes have been greatly innovated, making it possible to obtain high-precision and large-scale spatial data [11]. Remote-sensing technologies represented by satellite platforms such as the Moderate Resolution Imaging Spectroradiometer (MODIS) and TerraClimate, with their powerful macro-monitoring capabilities, cannot only conduct periodic observations of large areas but also acquire multi-source heterogeneous data covering vegetation indices, environmental data, photosynthetic characteristics, etc. [12]. These data provide rich information support for crop yield prediction research and play an increasingly important role in the field of agricultural production management [13]. Many studies have fully exploited the application potential of remote-sensing data to improve yield prediction accuracy. For example, the Normalized Difference Vegetation Index (NDVI) has been used to predict winter wheat yields with relatively ideal results [14]. Combining climate data such as temperature and rainfall with the Enhanced Vegetation Index (EVI) and Green Normalized Vegetation Index (GNDVI) has allowed for constructing a more comprehensive crop growth evaluation system, effectively improving the accuracy of yield prediction [15]. Further introduction of photosynthetic parameters has significantly optimized the performance of corn yield prediction models [16]. These studies show that the fusion of multi-source data can significantly improve yield prediction accuracy, but at the same time, the increase in data volume has also led to a sharp rise in computational complexity. Traditional prediction models face efficiency bottlenecks when processing massive data. Therefore, developing efficient and accurate prediction models to meet the urgent needs of agricultural production management in the big data era has become an important direction of current research.

In recent years, the widespread adoption of machine learning and deep learning algorithms within the realm of massive data processing has paved the way for innovative research in crop yield prediction [17]. Harnessing their exceptional prowess in feature extraction and pattern recognition, these state-of-the-art technologies have been effectively deployed in various predictive tasks [18]. However, in the realm of crop yield prediction, several existing models encounter substantial challenges. Despite the extensive utilization of potent models such as random forest regression, support vector machine (SVM), and extreme gradient boosting (XGBoost), these models either depend on linear assumptions or adopt relatively simple decision-making processes. As a result, they are disadvantaged when handling multi-source nonlinear relationships [19]. In sharp contrast, convolutional neural networks (CNNs), which are grounded in deep learning, exhibit an exceptional performance in this area. For example, CNN models have been effectively employed to predict the yields of corn and soybeans. They not only provide accurate predictions but also display remarkable cross-regional transferability. This underscores the tremendous potential of this technology within the agricultural sector [20]. Nevertheless, while CNNs surpass other models in feature extraction, they still encounter limitations when it comes to efficiently integrating multi-source time-series data [21]. Moreover, Heilongjiang Province features a highly diverse geographical and environmental setting. There are notable differences in soil types, landforms, and climates across its various regions. Both traditional machine learning and deep learning models find it a daunting task to process such complex time-series data [22]. To circumvent this limitation, researchers have directed their attention towards advanced technologies, including long short-term memory (LSTM), gated recurrent units (GRUs), attention mechanisms, and ant colony optimization algorithms. In the domain of time-series modeling, although LSTM is adept at capturing long-term dependencies, its intricate gating mechanism often results in a low training efficiency and overfitting [23]. In contrast, a GRU features a more streamlined dual-gate architecture that preserves sequence information-processing capabilities while offering pronounced advantages in terms of a rapid training speed, minimal memory usage, and superior generalization performance [24]. Particularly in agricultural time-series data prediction scenarios, a GRU’s adaptive information update mechanism enables it to efficiently manage dynamic changes in crop growth processes [25]. Meanwhile, the multi-head attention mechanism dynamically focuses on pivotal information within input sequences through parallel computation across multiple attention subspaces, significantly enhancing the flexibility and precision of model feature extraction [26]. This mechanism not only augments the model’s capacity to express complex patterns but also effectively alleviates overfitting through an adaptive weight allocation strategy, thereby improving the model’s generalization performance across diverse data distributions. However, as model architectures become increasingly convoluted, hyperparameter optimization has emerged as a critical factor constraining model performance. The integration of optimization algorithms has provided an effective remedy to this issue. Among these, the ant colony optimization (ACO) algorithm, as a quintessential example of bio-inspired intelligent optimization methods, has been demonstrated to substantially enhance the prediction accuracy and generalization capability of regression models by simulating ant foraging behavior to dynamically explore optimal parameter combinations, thus laying a solid foundation for constructing high-performance predictive models.

To address the above issues, this study proposes a county-level soybean yield prediction method integrating vegetation indices, environmental data, and photosynthetic indices, aiming to break through the limitations of existing technologies. The research conducts in-depth exploration around three objectives: First, an innovative ant colony optimization-based convolutional neural network with gated recurrent units and multi-head attention (ACGM) model is constructed and applied to soybean yield prediction in Heilongjiang Province. Through multi-dimensional data fusion and intelligent algorithm collaboration, the prediction accuracy and reliability of the model in complex agricultural environments are systematically evaluated, providing a new technical paradigm for regional-scale yield prediction. Second, by comparing the prediction performance of data from different months, this study deeply analyzes the impact of data timeliness on prediction results, accurately locates the optimal prediction time window, optimizes data collection and analysis strategies while improving prediction timeliness, and enhances the timeliness of agricultural production decisions and resource utilization efficiency. Third, this study comprehensively analyzes the influence mechanisms of variables such as vegetation indices, environmental data, and photosynthetic characteristics and their combinations on soybean yields, identifies key influencing factors, provides a scientific decision-making basis for agricultural production management, and promotes the deep transformation of precision agriculture from theoretical research to practical application.

2. Materials and Methods

2.1. Study Area

Heilongjiang Province is located in the core of Northeast China, between 43 and 53° North latitude and between 121 and 135° East longitude, belonging to a typical temperate humid or semi-humid continental monsoon climate zone [27]. The region exhibits distinct seasonal climatic characteristics: long and severe winters, short and warm summers, and changeable weather with significant temperature fluctuations in spring and autumn. The annual average temperature remains stable between −5 °C and 5 °C, annual precipitation ranges from 400 to 650 mm (concentrated primarily in summer), and the annual sunshine duration reaches 2300–2800 h [28]. Such unique climatic conditions provide an ideal growth environment for soybeans. The synergistic effects of suitable temperatures, abundant precipitation, and sufficient sunlight coupled with Heilongjiang’s possession of fertile land within one of the world’s four major black soil regions (featuring an extremely high organic matter and nitrogen content in the soil), as well as flat and open terrain, are highly conducive to soybean cultivation [29]. In Heilongjiang Province, the soybean-growing season typically begins in late April and ends with the harvest in early October. As a major soybean production area [30], statistical data in 2022 showed that the province’s soybean output accounted for 47.0% of the national total [31]. The specific planting distribution is visually presented in Figure 1, which shows the 2022 soybean-planting distribution map of Heilongjiang Province.

2.2. Dataset and Preprocessing

This study focuses on soybean yield prediction in Heilongjiang Province, selecting county-level data from April to September over six consecutive years (2017–2022), fully covering the entire soybean growth period. The dataset includes multi-source information such as vegetation indices, environmental data, and photosynthetic parameters. During preprocessing, all data were systematically integrated to a unified spatial resolution of 500 m and a monthly temporal scale. The data processing deeply integrates the advantages of two platforms, Google Earth Engine (GEE) and ArcGIS (10.8). Leveraging GEE’s powerful cloud computing capabilities, multi-source satellite and sensor data (e.g., MOD13A1, TerraClimate) were efficiently analyzed. Meanwhile, administrative boundary data and soybean planting distribution data of Heilongjiang Province were imported into the GEE platform, and county-scale mask extraction technology was used to accurately obtain data for the target area. ArcGIS(10.8) was employed to leverage its professional geographic information-processing capabilities for drawing study area maps and visualizing analysis results. Soybean yield data required for this study were all sourced from the official website of the Heilongjiang Provincial Bureau of Statistics (https://tjj.hlj.gov.cn/ (accessed on 4 February 2025)). After completing data cleaning, excluding county-level data with missing values, and discarding variables unrelated to the yield through correlation analysis, a total of 67 valid county-level datasets were ultimately selected from the period spanning 2017 to 2022. These datasets cover the months from April to September and include 15 feature indices, which will be employed for model construction and yield prediction. The data types, variable information, and sources involved in this study are detailed in Table 1.

2.2.1. Vegetation Indices

In the remote-sensing data system, vegetation indices serve as key observation indicators for expressing the physiological status and growth characteristics of vegetation. In this study, based on the Google Earth Engine (GEE) platform, seven reflectance bands (Sur_Refl_b01 to Sur_Refl_b07) of MOD13A1 data were used to calculate vegetation indices including NDVI, EVI, NDWI, RVI, GNDVI, GVCI, SAVI, WDRVI, GLI, and CVI. The specific calculation formulas for each index are detailed in Table 2. These vegetation indices reflect the vegetation growth status from different dimensions. For example, the NDVI [32], as the most widely used vegetation index, can effectively characterize the vegetation chlorophyll content and photosynthesis intensity by detecting the chlorophyll absorption and near-infrared reflection characteristics of vegetation, and its value is directly related to vegetation growth vitality and coverage; the EVI [33] introduces atmospheric correction parameters and soil adjustment factors, making it more sensitive to monitoring vegetation dynamic changes in high-biomass areas and complex environments; and the NDWI [34] focuses on extracting water information in vegetation canopies and their surroundings, providing a basis for evaluating the vegetation water stress status. The collaborative analysis of different vegetation indices provides multi-perspective data support for an in-depth understanding of phenotypic changes during soybean growth.

2.2.2. Environmental Data

Climatic conditions reflect the environmental factors affecting soybean growth and development. During key phenological stages of soybean growth, including germination, flowering, and grain-filling stages, extreme temperatures can cause significant damage to plants [35]. Low-temperature environments inhibit cellular activity, interfere with physiological and metabolic processes such as photosynthesis and respiration, and delay plant growth and development [36]. High temperatures may accelerate plant transpiration, trigger water imbalance, and affect pollen viability and pollination–fertilization processes, ultimately leading to reduced seed setting rates and adverse impacts on soybean yield formation [37]. To deeply explore the internal relationship between meteorological factors and soybean yields, this study selected the TerraClimate dataset as the source of meteorological data, covering key climatic variables such as precipitation (PR), actual evapotranspiration (AET), the Palmer Drought Severity Index (PDSI), solar radiation (SRAD), the monthly minimum temperature (TMMN), monthly maximum temperature (TMMX), vapor pressure (VAP), and vapor pressure deficit (VPD). The TerraClimate dataset is renowned for its wide spatial coverage and high spatiotemporal resolution, enabling the accurate capture of regional climatic characteristics and their dynamic changes.

2.2.3. Photosynthetic Parameters

Soybean physiological changes are also key factors influencing yield formation. During soybean growth and development, the carbon fixation capacity and photosynthetic efficiency directly determine the rate of dry matter accumulation, serving as core physiological indicators for measuring crop productivity [38]. In that regard, this study selected four key photosynthetic physiological variables: gross primary productivity (GPP) [39] to quantify ecosystem carbon assimilation capacity, net photosynthesis (PsnNet) [40] to directly reflect plant photosynthate accumulation efficiency, fraction of absorbed photosynthetically active radiation (Fpar) [41] to characterize vegetation light capture efficiency, and Leaf Area Index (LAI) [42] to indicate the photosynthetic area of the vegetation canopy [43]. These variables systematically characterize the soybean physiological status from aspects such as carbon metabolism, light energy utilization, and canopy structure, providing critical physical and chemical parameter support for an in-depth analysis of crop yield formation mechanisms.

2.3. Yield Prediction Model

2.3.1. RFR

In this study, the random forest regression (RFR) model was employed for soybean yield prediction. As a classic ensemble learning algorithm, this model demonstrates unique advantages in complex system modeling by constructing numerous decision trees and aggregating their prediction results [44]. Specifically, random forest captures nonlinear relationships and interactions between variables through parallel training of multiple decision trees, making it particularly suitable for modeling complex systems with multi-factor coupling in agricultural production. In this study, the number of estimators was set to 200 (n_estimators = 200), a parameter configuration that not only ensures computational efficiency but also significantly improves model prediction accuracy. To ensure model robustness and generalization ability, the model was evaluated with a random shuffling mechanism (shuffle = True) enabled during data partitioning, while the random seed was fixed at 100 (random_state = 100). This series of parameter configurations not only effectively reduces the variance of model evaluation but also fully adapts to the characteristics of the dataset and the requirements of soybean yield prediction tasks through systematic tuning, providing a solid guarantee for reliable evaluation and optimization of model performance.

2.3.2. SVR

In this study, the support vector regression (SVR) was adopted for soybean yield prediction. As a classic supervised learning algorithm, SVR achieves data fitting by constructing an optimal hyperplane to minimize the error between predicted and true values, demonstrating unique advantages in handling high-dimensional data and complex nonlinear relationships [45]. During model construction, key hyperparameters of SVR were finely tuned: the penalty parameter C was set to 1000, the kernel coefficient (gamma) to 0.0001, and the tolerance error (epsilon) to 0.001. Additionally, input features were standardized to improve model training efficiency and prediction performance. This series of parameter configurations and data preprocessing strategies effectively accelerated model convergence and enhanced the adaptability of the model to the dataset in this study.

2.3.3. XGBoost

In our study, we harnessed the XGBoost algorithm to iteratively train multiple weak learners (typically decision trees), progressively reducing prediction errors. These weak learners were then combined into a robust predictive model [46]. The parameters were configured as follows: the number of estimators (n_estimators) was set to 180, and the learning rate was fixed at 0.01 to regulate the boosting process. The maximum tree depth (max_depth) was set to 6, and the minimum child weight (min_child_weight) was set to 1 to fine-tune the complexity of tree development. Additionally, 80% of the samples were randomly selected when constructing each tree. A consistent random seed (random_state = 2) was used to ensure the reproducibility of our results.

2.3.4. CNN

In the design of the convolutional layers of the CNN model, the first layer is equipped with 64 convolutional kernels of size 3. These kernels are primarily used to capture local features in the data. Through convolution operations, they extract basic patterns and structural information from the data as the first step in feature extraction [47]. The second layer employs 128 convolutional kernels of the same size 3. Building on the features extracted by the first layer, this layer delves deeper to uncover more complex and abstract features. By doing so, it significantly enhances the model’s ability to represent data features, enabling a more comprehensive understanding of the input data [48]. The third layer is furnished with 512 convolutional kernels of size 3, further refining and extracting features at a more granular level. This layer focuses on acquiring high-level feature representations, which are crucial for advanced data analysis and classification tasks. In addition to the convolutional layers, a max-pooling layer is incorporated into the model. The max-pooling layer plays a pivotal role in the overall architecture. It not only reduces the dimensionality of the data, thereby alleviating the computational burden and memory requirements of subsequent processing steps, but also retains the most critical and representative features by selecting the maximum value within local regions [49]. This operation enhances the model’s robustness and generalization ability, enabling it to perform well on unseen data and maintain a stable performance across different datasets.

2.3.5. ACGM

The ACGM model proposed in this study is shown in Figure 2. This model integrates the ant colony optimization (ACO) algorithm to optimize its hyperparameters [50]. The gated recurrent unit (GRU) module, which excels at processing sequential data, effectively captures long-term dependencies in sequences by updating hidden states at each time step based on the current input and previous hidden states [51]. The multi-head attention (MHA) mechanism enables the model to simultaneously focus on different aspects of the input data and extract features from multiple dimensions [52]. Through the parallel computation of multiple heads, each performing attention calculations in different representation subspaces, the MHA mechanism can capture rich feature information [53]. It also dynamically assigns weights according to the importance of the input data, learning to prioritize information relevant to the current task while downplaying irrelevant information. This mechanism enhances the model’s complexity and flexibility, enabling it to learn more complex functional mapping relationships.

The aforementioned code was executed within a Python(3.11) environment on a computer equipped with an Intel Core i5-12600KF processor (Intel Corporation, Santa Clara, CA, USA), 32 GB of RAM (KingBank Technology Co., Ltd., Shenzhen, China), 3 TB of storage capacity (FanXiang, Shenzhen, China), and running the Windows 11 operating system.

2.4. Model Evaluation Indicators

To evaluate the predictive ability of the model, we employed several metrics including the coefficient of determination (R²), root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). During the prediction process, R² serves as a key indicator for assessing the goodness of fit between predicted and actual values [33]. R² ranges from 0 to 1, with values closer to 1 indicating a better fit, meaning minimal discrepancy between predictions and observations. However, relying solely on a single metric is insufficient to comprehensively measure prediction accuracy. To address this limitation, we incorporated additional commonly used metrics: RMSE, MAE, and MAPE.

R^{2} = 1 - \frac{{\sum_{i = 1}^{n} (y_{i} - {\hat{y}}_{i})}^{2}}{{\sum_{i = 1}^{n} (y_{i} - {\bar{y}}_{i})}^{2}}

(1)

RMSE is a valuable indicator for measuring model accuracy, reflecting the model’s ability to generate predictions close to actual values [54]. It is worth noting that RMSE assigns greater weights to larger errors in the prediction results. Conceptually, it represents the standard deviation of the residuals, quantifying the difference between predicted and actual values. A lower RMSE value indicates a higher prediction accuracy of the model and a better fit between the predicted results and the true values.

R M S E = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(y_{i} - {\hat{y}}_{i})}^{2}}

(2)

Compared with the RMSE, the MAE is less sensitive to outliers. In the calculation of the RMSE, outliers are amplified by the squaring operation, thus having a significant impact on the RMSE, but their impact on the MAE is relatively small. Therefore, the MAE can more accurately reflect the average prediction error of the model.

M A E = \frac{1}{n} \sum_{i = 1}^{n} |y_{i} - {\hat{y}}_{i}|

(3)

On the other hand, the MAPE can effectively capture the relative difference between predicted and true values and present it in the form of a percentage. This feature gives the MAPE significant advantages in comparing datasets of different scales and units. The lower the MAPE value, the higher the prediction accuracy of the model.

M A P E = \frac{100 %}{n} \sum_{i = 1}^{n} |\frac{(y_{i} - {\hat{y}}_{i})}{y_{i}}|

(4)

As shown in Figure 3, we present the framework of satellite remote-sensing data and machine learning and deep learning models used for soybean yield prediction.

3. Results

3.1. Correlation Analysis Between Variables and Yield

The Pearson correlation analysis results between vegetation indices, meteorological indices, photosynthetic indices and soybean yield are shown in Figure 4. This analysis aims to comprehensively explore the degree of association between each index and soybean yield and screen optimal input parameter combinations for machine learning and deep learning models through significance testing (p < 0.01), so as to lay a foundation for constructing high-precision soybean yield prediction models. The analysis results show that the soybean yield exhibits significant correlations with multiple types of indices. For vegetation indices, the EVI (0.36) can effectively reduce interference from atmospheric and soil background noise and sensitively reflect the vegetation growth status; the NDWI (0.27) is closely linked to the soybean yield by capturing water information in the vegetation canopy; the NDVI (0.31), as a widely used vegetation indicator, is commonly used to assess the vegetation coverage and growth status; and the GVCI (0.28) and WDRVI (0.28) reflect the vegetation health and photosynthetic capacity from different perspectives, both showing significant correlations with the soybean yield. In terms of meteorological indices, PR (0.27) and AET (0.30) directly influence the water balance during soybean growth; the PDSI (0.27) comprehensively evaluates regional drought conditions; SRAD (0.21) provides an energy foundation for photosynthesis; TMMN (0.32) and TMMX (0.25) significantly impact soybean growth and development processes; and VAP (0.51) and VPD (0.20) reflect atmospheric moisture conditions and are closely related to soybean physiological activities. Regarding photosynthetic indices, GPP (0.2) directly characterizes the amount of carbon fixed by vegetation through photosynthesis; and FPAR (0.28) and the LAI (0.28) reveal soybean’s photosynthetic potential and growth status from the perspectives of light energy utilization efficiency and leaf structural characteristics, respectively, with both indices showing significant associations with soybean yield. In constructing the soybean yield prediction model, this study comprehensively considered the influence of various indices on the yield and selected a series of indicators effectively reflecting the vegetation growth status, coverage, health, meteorological conditions, moisture status, photosynthetic efficiency, physiological status, and structural characteristics as model inputs. Based on the correlation analysis results, the GNDVI, GLI, and PsnNet were excluded from the model input parameters due to their non-significant correlations with the soybean yield (p > 0.01), which failed to provide effective information gain. To avoid the impact of redundant parameters on model performance, reduce computational complexity, and minimize overfitting risks, these indices were removed to ensure the model focuses on key influencing factors, thereby improving the prediction accuracy and efficiency.

3.2. Comparison of Soybean Yield Prediction Models

To achieve high-precision soybean yield prediction, this study compared three machine learning models (RFR, SVR, XGBoost) and two deep learning models (CNN and ACGM). All models were trained and tested on a comprehensive dataset containing vegetation indices, meteorological elements, and photosynthetic parameters throughout the soybean growth period. Their performance metrics are summarized in Table 3. As an ensemble learning model, RFR achieved an R² value of 0.41, but its RMSE was 142.66 kg/ha and MAPE was 6.35%. The SVR model, using a support vector regression strategy, had an R² of 0.43, an RMSE of 207.68 kg/ha, and a MAPE of 11.07%. The XGBoost model under the gradient boosting framework showed a weaker yield prediction capability with an R² of 0.39, an RMSE of 202.57 kg/ha, and a MAPE of 10.13%. In contrast, deep learning models demonstrated significant advantages. The CNN model, leveraging its convolutional structure for efficient spatial feature extraction, achieved an R² of 0.64 and an RMSE of 198.13 kg/ha, representing a breakthrough compared to traditional machine learning methods. The ACGM model proposed in this study performed exceptionally well in both years of prediction: in 2021, it achieved an R² of 0.75, an RMSE of 163.18 kg/ha, and a MAPE of 9.19%; and in 2022, its R² was 0.74, RMSE was 123.94 kg/ha, MAE was 105.3 kg/ha, and MAPE was as low as 6.21%.

The correlation between the actual yield data in 2021 and 2022 and the yield predictions of the five models during these two years is shown in Figure 5. To more intuitively display the prediction performance, the y = x diagonal is included in the figure as a reference for evaluating prediction results. The closer the predicted values are to the measured values, the more closely the data points cluster around this line, indicating a higher accuracy in yield prediction. From the prediction results of the five models, it can be seen that for machine learning models, the scatter points in the high-value region are mainly distributed along the diagonal; however, in the low-value region, the scatter points are more dispersed, showing an overestimation of low values. In contrast, the data points of deep learning models are concentrated on both sides of the diagonal, which is consistent with the prediction results presented in Table 3. Among the two deep learning models, the data points in the prediction results of the ACGM model are closer to the diagonal, with obvious clustering around the diagonal. Additionally, the model exhibits the best performance in the four evaluation indicators selected in Table 3. These findings collectively validate the superior prediction performance of the constructed ACGM model.

3.3. Optimal Month for Soybean Yield Prediction

In the experiment, we selected data from the seedling stage to the maturity stage of soybeans for yield prediction. By comparing different months throughout this period, we determined the optimal prediction time before harvest. Additionally, by limiting the data collection time period, we aimed to reduce the data volume and thereby improve the computational efficiency. Analysis of the yield prediction results for the six months shown in Figure 6 clearly indicates that July and August exhibit the best prediction performance. In July, the prediction results showed an R² of 0.69, RMSE of 180.58 kg/ha, MAE of 136.20 kg/ha, and MAPE of 10.09%. In August, the corresponding values were an R² of 0.70, an RMSE of 179.50 kg/ha, an MAE of 132.70 kg/ha, and a MAPE of 8.34%. During these two months, the three evaluation indicators, R², RMSE, and MAE, all showed a good performance. It is worth noting that the MAPE value in August was significantly better than that in July. A lower MAPE value indicates a higher prediction accuracy of the model. Furthermore, through statistical testing with a t-value of 4.9148 and a p-value of 0.0012, we have proven there is a significant difference between the prediction performances in July and August. This further confirms that soybean yield prediction performs best in August.

3.4. Analysis of Different Variables in Soybean Yield Prediction

Using the best-performing prediction model ACGM, we conducted an in-depth exploration of the roles played by different indicators in soybean yield prediction. Specifically, we analyzed vegetation indices (VIs), environmental data (EDs), photosynthesis-related parameters (PPs), and different combinations of these variables from the optimal prediction month (August). The research results are shown in Figure 7. Our findings indicate that among individual variables, environmental drivers (EDs) achieved the most favorable prediction results, with an R² value of 0.6356. In terms of variable combinations, both the vegetation index–photosynthesis parameter combination (VIs + PP) and the environmental driver–photosynthesis parameter combination (ED + PP) demonstrated strong predictive capabilities. Notably, the ED + PP combination performed exceptionally well, with an R² value as high as 0.6554. Contrary to the trend of R² values, the three indicators, RMSE, MAE, and MAPE, showed a consistent pattern of change. In this context, the ED + PP combination again stood out, achieving the lowest error values: an RMSE of 193.5, MAE of 133.92, and MAPE of 9.54%. These results indicate that the combination of environmental variables and photosynthesis-related indices plays a crucial role in soybean yield prediction.

3.5. Spatial Distribution Map of Predicted Soybean Yield

Through the above experiments, we conducted an in-depth exploration and comparative analysis of the performance of various models in soybean yield prediction. Finally, we determined that the ACGM model exhibited a superior predictive performance. By virtue of its unique architecture and algorithmic advantages, this model can accurately capture complex features and internal patterns related to the soybean yield. To more reliably verify the effectiveness and generalization ability of the ACGM model, we employed a leave-one-year-out cross-validation method for soybean yield prediction. This method involves sequentially reserving data from each year as a validation set while using data from the remaining years for model training. In this way, we comprehensively and objectively evaluated the model’s predictive performance on datasets from different years. Based on this validation method, we generated distribution maps of soybean yield predictions covering 2017 to 2022. These maps intuitively show the predicted soybean yields for each county in each year. As shown in Figure 8, from a spatial distribution perspective, the soybean yields in the northwestern region were generally low over this six-year period, while those in the central and eastern regions were relatively high. Additionally, as shown in Figure 9, the model’s prediction errors remained within the range of −10% to 10% during the six years from 2017 to 2022. This result fully demonstrates that the overall performance of the ACGM model is satisfactory, with a high prediction accuracy and stability. The errors falling within this reasonable range further highlight the model’s excellent predictive ability, enabling it to estimate soybean yields relatively accurately. Furthermore, this confirms that the model has a strong generalization ability and can adapt to complex data characteristics across different years and regions.

4. Discussion

This study achieves a precise prediction of soybean yield at the county scale in Heilongjiang Province using deep learning models with vegetation indices, environmental data, and photosynthetic indices. To comprehensively and deeply evaluate the prediction performance of the ACGM model, comparative experiments were conducted with four models: RFR, SVR, XGBoost, and CNN. Although the RFR model has the ability to process complex data, it has obvious limitations in extracting spatial and temporal dynamic features from multi-source data. However, the CNN component in the ACGM model, by virtue of its local connection and weight-sharing characteristics, can accurately extract spatial features, overcoming the challenges faced by traditional statistical methods in handling nonlinear relationships and demonstrating strong advantages [55]. The SVR model shows an insufficient capability in processing high-dimensional data and is easily disturbed by data noise. In contrast, the GRUs in the ACGM model, based on a gating mechanism, can accurately capture the temporal sequence dynamics during crop growth [56]. Compared with the SVR model, the ACGM model has significant advantages in processing temporal sequence information, providing a solid and reliable basis for soybean yield prediction from the temporal dimension [57]. Although the XGBoost model exhibits certain advantages in ensemble learning, it lacks depth in mining nonlinear relationships among data when handling complex relationships in multi-source heterogeneous data. The multi-head attention mechanism of the ACGM model, however, can deeply excavate complex relationships among data and dynamically focus on key information closely related to the soybean yield. This not only significantly improves the prediction accuracy of the model but also enhances its adaptability to complex agricultural scenarios, which are precisely the shortcomings of the XGBoost model. The ACGM model successfully overcomes the challenges of traditional statistical methods in handling nonlinear relationships [58], fully exploiting multi-source information to facilitate soybean yield prediction; effectively integrating time-step information to provide a reliable basis for yield prediction from the temporal dimension [59]; deeply mining complex relationships among multi-source heterogeneous data to accurately focus on key information, significantly improving the prediction accuracy and enhancing the adaptability to complex agricultural scenarios; and finely tuning the hyperparameters of each module to improve the model’s generalization ability in complex agricultural environments [60]. The modules collaborate with each other and complement each other’s advantages to form an organic whole, comprehensively and deeply excavating information related to the soybean yield and providing a powerful support for precise soybean yield prediction. Through its application in soybean yield prediction tasks across different years and regions, the ACGM model further verifies its strong generalization ability.

To determine the optimal time window for soybean yield prediction, this study systematically collected multi-source data throughout the soybean growth period (April to September) and performed dynamic prediction analysis using the ACGM model. The results show that when using August’s data as the prediction input, the model achieved the optimal performance in key evaluation indicators such as R², RMSE, MAE, and MAPE. This finding is highly consistent with the laws of soybean growth and development: August coincides with the pod-filling stage, during which seed development accelerates and the rate of dry matter accumulation significantly increases, making it a critical period for yield formation [61]. At this stage, both vegetative and reproductive growth of soybean plants are vigorous, with physiological indicators such as leaf photosynthetic efficiency and root absorption capacity at their most active levels [62]. This enhances the detectability of data and provides the model with rich and accurate feature inputs, effectively improving the model’s ability to capture yield formation mechanisms [63]. Therefore, the key crop growth information contained in August’s data makes it an ideal time node for constructing high-precision soybean yield prediction models, providing an important basis for optimizing data collection strategies and improving prediction timeliness.

To explore the optimal parameter combination for soybean yield prediction and clarify the specific contributions of vegetation indices (VIs), environmental data (ED), and photosynthetic parameters (PPs) in yield prediction, this study systematically analyzed and compared the prediction effects of different parameter combinations [64]. As shown in Figure 6, the prediction results of ED alone also showed a favorable performance, indicating its non-negligible application potential [65]. Notably, the “ED + PP” parameter combination performed most excellently among all experimental groups, with an R² of 0.6554, an RMSE of 193.5 kg/ha, an MAE of 133.92 kg/ha, and a MAPE of 9.54%. As a critical foundation for yield prediction, environmental data (ED) can accurately characterize the external environmental features of soybean growth. Taking precipitation as an example, this key environmental factor directly participates in the water metabolism and material transportation processes of crops. The variation in its abundance or deficiency not only affects the growth and development process of plants but also plays a decisive role in the final yield formation [66]. This study incorporates multi-dimensional environmental data such as temperature, solar radiation, and vapor pressure to comprehensively construct the external environmental profile of soybean growth [67]. The introduction of photosynthetic parameters (PPs), however, provides a new perspective for yield prediction from the level of crop internal physiological mechanisms [68]. Photosynthetic indices such as gross primary productivity (GPP) reflect the dynamics of photosynthesis intensity and dry matter accumulation in real time by quantifying the carbon assimilation efficiency of plants. When environmental data and photosynthetic parameters are organically combined, they form a complementary advantage: ED explains the external conditions for yield formation from the perspective of environmental stress and resource supply, while PP reveals the internal mechanisms of crop responses to the environment at the physiological and metabolic levels. This synergistic effect of internal and external factors significantly enhances the explanatory power and prediction accuracy of soybean yield prediction models for complex agricultural systems. The above findings further confirm the important position of photosynthetic parameters in soybean yield prediction [69]. Monitoring the dynamic changes in photosynthetic parameters and constructing multi-source information fusion models in combination with environmental data not only opens up new technical paths for soybean yield prediction but also provides theoretical support and practical guidance for dynamic monitoring of crop growth and yield estimation in the context of precision agriculture.

From the soybean yield prediction distribution map shown in Figure 7, it can be seen that the yield in the northwestern region is significantly lower than that in the southeastern region. This spatial distribution pattern is closely related to geographical environmental factors [70]. In the northwestern region, complex terrain, poor soil conditions, and water resource shortages directly restrict the growth potential of soybeans. Under its diverse geographical conditions, different environmental conditions are formed, and differences in environmental data play a key role in yield distribution [71]. Due to latitudinal and altitudinal differences, the northwestern region has a low annual accumulated temperature, with low-temperature environments throughout the soybean growth cycle. In the early growth stage, low temperatures inhibit the photosynthetic efficiency and root development, delaying plant growth; in the late growth stage, low temperatures seriously affect flower bud differentiation and pollination, leading to significantly reduced final yields [72]. In sharp contrast, the central and eastern regions have suitable temperatures, abundant sunlight, and ample precipitation. This ideal combination of water and heat creates excellent environmental conditions for soybean growth, making it easier to achieve high yields in these areas [73]. When analyzing the error map in Figure 8, the spatial distribution of soybean yield prediction errors shows a high degree of similarity to the yield distribution. The northwestern region not only has generally low yield predictions but also significantly larger prediction errors. This may be because the complex terrain in the northwestern region poses many challenges for remote-sensing data collection, increasing the difficulty of accurately extracting spatial features by the model. When processing such data, the model struggles to precisely characterize local features, thereby amplifying prediction errors. Additionally, the complex physiological processes of soybeans under low-temperature stress enhance the nonlinear relationship between photosynthetic parameters and yield, and the existing model’s insufficient ability to characterize the yield formation mechanism in this region further exacerbates prediction errors. In contrast, the stable geographical environment and standardized farmland management in the central and southern regions provide favorable conditions for data collection and model fitting. The flat terrain ensures the consistency of remote-sensing data collection, and stable climatic conditions make crop growth patterns more predictable. Therefore, the model can more accurately fit the relationship between yield and environmental/physiological parameters, effectively reducing prediction errors. This is consistent with our finding that the addition of environmental data significantly improved the yield prediction accuracy.

In the subsequent research, we will delve deeper into the following aspects. In terms of model optimization, we will continue to explore more efficient algorithm fusion strategies to address complex agricultural environments. In data collection, we will acquire multi-source data with a high spatial resolution. Additionally, we will integrate multidisciplinary knowledge from agronomy, meteorology, soil science, etc., to excavate the interaction mechanisms between crop physiology and environmental data.

5. Conclusions

This study proposes an ACGM model for predicting the soybean yield at the county level in Heilongjiang Province. The model integrates the ant colony optimization algorithm, convolutional neural networks, gated recurrent units, and multi-head attention mechanisms. It not only achieves high-accuracy predictions but also demonstrates strong generalization capabilities. Through systematic comparisons with random forest regression (RFR), support vector regression (SVR), XGBoost, and convolutional neural networks (CNNs), the experimental results reveal significant differences among these models in terms of metrics such as the coefficient of determination (R²), root mean square error (RMSE), mean absolute error (MAE), and mean absolute percentage error (MAPE). Notably, the ACGM model excels in all evaluations. When selecting the optimal prediction time, we found that the ACGM model performs best when using August’s data for soybean yield prediction. In comparing the predictive performances of different parameter combinations, the combination of environmental data and photosynthetic parameters (ED + PP) exhibits superior predictive effects. The model demonstrates good predictive results across different county scales and in different years, indicating its strong transferability. This study not only explores the optimal prediction time and parameter combinations but also further confirms that the ACGM model has outstanding generalization capabilities and can effectively adapt to soybean yield prediction tasks in complex agricultural scenarios.

Author Contributions

Conceptualization, H.F. and J.L. (Jian Li); methodology, H.F. and J.L. (Jian Lu); software, H.F. and J.L. (Jian Lu); validation, H.F., J.L. (Jian Lu), and X.N.; formal analysis, H.F. and X.L.; investigation, H.F. and Y.S.; resources, H.F. and J.K.; data curation, H.F. and J.K.; writing—original draft preparation, H.F. and J.L. (Jian Li); writing—review and editing, H.F., W.Z. and J.L. (Jian Li); visualization, H.F. and X.N.; supervision, X.L. and X.N.; project administration, W.Z. and X.N.; funding acquisition, J.L. (Jian Li). All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Scientific Research Project of the Jilin Provincial Department of Education (Grant No. JJKH20250574BS), the Demonstration Project for the Construction of a Provincial-Level Modern Agricultural Industry Technology System (Grant No. JLARS-2025-010216), the Ginseng Soil Improvement Technology Development Project in Jingyu County, Baishan City, Jilin Province (Grant No. 20250017), the Science and Technology Project of the Jilin Provincial Department of Agriculture and Rural Affairs (Grant No. 2024PG1204), and the Technical Service (Natural Sciences)—Commissioned Project by Enterprises and Institutions (Grant No. 20240179).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The datasets generated and analyzed during the current study are available from the corresponding author upon reasonable request. These datasets are not publicly accessible as they are subject to ongoing research and contain information that has not yet been fully disseminated.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yu, H.; Li, J. Short-and long-term challenges in crop breeding. Natl. Sci. Rev. 2021, 8, nwab002. [Google Scholar] [CrossRef] [PubMed]
Qin, P.; Wang, T.; Luo, Y. A review on plant-based proteins from soybean: Health benefits and soy product development. J. Agric. Food Res. 2022, 7, 100265. [Google Scholar] [CrossRef]
Bazzana, D.; Foltz, J.; Zhang, Y. Impact of climate smart agriculture on food security: An agent-based analysis. Food Policy 2022, 111, 102304. [Google Scholar] [CrossRef]
Wadas, W.; Kondraciuk, T. The Role of Foliar-Applied Silicon in Improving the Growth and Productivity of Early Potatoes. Agriculture 2025, 15, 556. [Google Scholar] [CrossRef]
Yu, N.; Li, L.; Schmitz, N.; Tian, L.F.; Greenberg, J.A.; Diers, B.W. Development of methods to improve soybean yield estimation and predict plant maturity with an unmanned aerial vehicle based platform. Remote Sens. Environ. 2016, 187, 91–101. [Google Scholar] [CrossRef]
Zhou, H.; Huang, F.; Lou, W.; Gu, Q.; Ye, Z.; Hu, H.; Zhang, X. Yield prediction through UAV-based multispectral imaging and deep learning in rice breeding trials. Agric. Syst. 2025, 223, 104214. [Google Scholar] [CrossRef]
Okupska, E.; Gozdowski, D.; Pudełko, R.; Wójcik-Gront, E. Cereal and Rapeseed Yield Forecast in Poland at Regional Level Using Machine Learning and Classical Statistical Models. Agriculture 2025, 15, 984. [Google Scholar] [CrossRef]
Li, Y.; Guan, K.; Yu, A.; Peng, B.; Zhao, L.; Li, B.; Peng, J. Toward building a transparent statistical model for improving crop yield prediction: Modeling rainfed corn in the U.S. Field Crops Res. 2019, 234, 55–65. [Google Scholar] [CrossRef]
Keating, B.A.; Thorburn, P.J. Modelling crops and cropping systems—Evolving purpose, practice and prospects. Eur. J. Agron. 2018, 100, 163–176. [Google Scholar] [CrossRef]
Osinga, S.A.; Paudel, D.; Mouzakitis, S.A.; Athanasiadis, I.N. Big data in agriculture: Between opportunity and solution. Agric. Syst. 2022, 195, 103298. [Google Scholar] [CrossRef]
Feng, H.; Fan, Y.; Yue, J.; Bian, M.; Liu, Y.; Chen, R.; Ma, Y.; Fan, J.; Yang, G.; Zhao, C. Estimation of potato above-ground biomass based on the VGC-AGB model and deep learning. Comput. Electron. Agric. 2025, 232, 110122. [Google Scholar] [CrossRef]
Wei, S.; Zhang, H.; Ling, J. A review of mangrove degradation assessment using remote sensing: Advances, challenges, and opportunities. GISci. Remote Sens. 2025, 62, 2491920. [Google Scholar] [CrossRef]
Triantakonstantis, D.; Karakostas, A. Soil Organic Carbon Monitoring and Modelling via Machine Learning Methods Using Soil and Remote Sensing Data. Agriculture 2025, 15, 910. [Google Scholar] [CrossRef]
Bregaglio, S.; Ginaldi, F.; Raparelli, E.; Fila, G.; Bajocco, S. Improving crop yield prediction accuracy by embedding phenological heterogeneity into model parameter sets. Agric. Syst. 2023, 209, 103666. [Google Scholar] [CrossRef]
Arshad, S.; Kazmi, J.H.; Javed, M.G.; Mohammed, S. Applicability of machine learning techniques in predicting wheat yield based on remote sensing and climate data in Pakistan, South Asia. Eur. J. Agron. 2023, 147, 126837. [Google Scholar] [CrossRef]
Lu, C.; Leng, G.; Liao, X.; Tu, H.; Qiu, J.; Li, J.; Huang, S.; Peng, J. In-season maize yield prediction in Northeast China: The phase-dependent benefits of assimilating climate forecast and satellite observations. Agric. For. Meteorol. 2024, 358, 110242. [Google Scholar] [CrossRef]
Li, Y.; Liu, X.; Zhang, X.; Gu, X.; Yu, L.; Cai, H.; Peng, X. Using solar-induced chlorophyll fluorescence to predict winter wheat actual evapotranspiration through machine learning and deep learning methods. Agric. Water Manag. 2025, 309, 109322. [Google Scholar] [CrossRef]
Zhu, H.; Lin, C.; Dong, Z.; Xu, J.-L.; He, Y. Early Yield Prediction of Oilseed Rape Using UAV-Based Hyperspectral Imaging Combined with Machine Learning Algorithms. Agriculture 2025, 15, 1100. [Google Scholar] [CrossRef]
Lu, J.; Li, J.; Fu, H.; Tang, X.; Liu, Z.; Chen, H.; Sun, Y.; Ning, X. Deep Learning for Multi-Source Data-Driven Crop Yield Prediction in Northeast China. Agriculture 2024, 14, 794. [Google Scholar] [CrossRef]
Khan, S.N.; Li, D.; Maimaitijiang, M. Using gross primary production data and deep transfer learning for crop yield prediction in the US Corn Belt. Int. J. Appl. Earth Obs. Geoinf. 2024, 131, 103965. [Google Scholar] [CrossRef]
Du, J.; Zhang, Y.; Wang, P.; Tansey, K.; Liu, J.; Zhang, S. Enhancing Winter Wheat Yield Estimation With a CNN-Transformer Hybrid Framework Utilizing Multiple Remotely Sensed Parameters. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4405213. [Google Scholar] [CrossRef]
Lu, J.; Li, J.; Fu, H.; Zou, W.; Kang, J.; Yu, H.; Lin, X. Estimation of rice yield using multi-source remote sensing data combined with crop growth model and deep learning algorithm. Agric. For. Meteorol. 2025, 370, 110600. [Google Scholar] [CrossRef]
Lu, J.; Fu, H.; Tang, X.; Liu, Z.; Huang, J.; Zou, W.; Chen, H.; Sun, Y.; Ning, X.; Li, J. GOA-optimized deep learning for soybean yield estimation using multi-source remote sensing data. Sci. Rep. 2024, 14, 7097. [Google Scholar] [CrossRef]
Michael, N.E.; Bansal, R.C.; Ismail, A.A.A.; Elnady, A.; Hasan, S. A cohesive structure of Bi-directional long-short-term memory (BiLSTM) -GRU for predicting hourly solar radiation. Renew. Energy 2024, 222, 119943. [Google Scholar] [CrossRef]
Ahmed, S.F.; Alam, M.S.B.; Hassan, M.; Rozbu, M.R.; Ishtiak, T.; Rafa, N.; Mofijur, M.; Shawkat Ali, A.B.M.; Gandomi, A.H. Deep learning modelling techniques: Current progress, applications, advantages, and challenges. Artif. Intell. Rev. 2023, 56, 13521–13617. [Google Scholar] [CrossRef]
Hu, Z.; Chen, L.; Luo, Y.; Zhou, J. EEG-Based Emotion Recognition Using Convolutional Recurrent Neural Network with Multi-Head Self-Attention. Appl. Sci. 2022, 12, 11255. [Google Scholar] [CrossRef]
Liu, T.; Yu, L.; Bu, K.; Yan, F.; Zhang, S. Seasonal local temperature responses to paddy field expansion from rain-fed farmland in the cold and humid Sanjiang Plain of China. Remote Sens. 2018, 10, 2009. [Google Scholar] [CrossRef]
Ma, H.; Wang, C.; Liu, J.; Yuan, Z.; Yao, C.; Wang, X.; Pan, X. Separate prediction of soil organic matter in drylands and paddy fields based on optimal image synthesis method in the Sanjiang Plain, Northeast China. Geoderma 2024, 447, 116929. [Google Scholar] [CrossRef]
Wang, W.; Deng, X.; Yue, H. Black soil conservation will boost China’s grain supply and reduce agricultural greenhouse gas emissions in the future. Environ. Impact Assess. Rev. 2024, 106, 107482. [Google Scholar] [CrossRef]
Xin, M.; Zhang, Z.; Han, Y.; Feng, L.; Lei, Y.; Li, X.; Wu, F.; Wang, J.; Wang, Z.; Li, Y. Soybean phenological changes in response to climate warming in three northeastern provinces of China. Field Crops Res. 2023, 302, 109082. [Google Scholar] [CrossRef]
Wang, T.; Ma, Y.; Luo, S. Spatiotemporal Evolution and Influencing Factors of Soybean Production in Heilongjiang Province, China. Land 2023, 12, 2090. [Google Scholar] [CrossRef]
Jasinski, M.F. Sensitivity of the normalized difference vegetation index to subpixel canopy cover, soil albedo, and pixel scale. Remote Sens. Environ. 1990, 32, 169–187. [Google Scholar] [CrossRef]
Huete, A.R.; Liu, H.Q.; Batchily, K.V.; Van Leeuwen, W. A comparison of vegetation indices over a global set of TM images for EOS-MODIS. Remote Sens. Environ. 1997, 59, 440–451. [Google Scholar] [CrossRef]
McFeeters, S.K. The use of the Normalized Difference Water Index (NDWI) in the delineation of open water features. Int. J. Remote Sens. 1996, 17, 1425–1432. [Google Scholar] [CrossRef]
Nendel, C.; Reckling, M.; Debaeke, P.; Schulz, S.; Berg-Mohnicke, M.; Constantin, J.; Fronzek, S.; Hoffmann, M.; Jakšić, S.; Kersebaum, K.-C.; et al. Future area expansion outweighs increasing drought risk for soybean in Europe. Glob. Change Biol. 2023, 29, 1340–1358. [Google Scholar] [CrossRef]
Zhang, H.; Zhao, Y.; Zhu, J.-K. Thriving under Stress: How Plants Balance Growth and the Stress Response. Dev. Cell 2020, 55, 529–543. [Google Scholar] [CrossRef] [PubMed]
Sato, H.; Mizoi, J.; Shinozaki, K.; Yamaguchi-Shinozaki, K. Complex plant responses to drought and heat stress under climate change. Plant J. 2024, 117, 1873–1892. [Google Scholar] [CrossRef]
Yang, A.; Luo, S.; Xu, Y.; Zhang, P.; Sun, Z.; Hu, K.; Li, M. Optimization of Irrigation and Fertilization in Maize-Soybean System Based on Coupled Water-Carbon-Nitrogen Interactions. Agronomy 2025, 15, 41. [Google Scholar] [CrossRef]
Williams, M.; Rastetter, E.B.; Fernandes, D.N.; Goulden, M.L.; Shaver, G.R.; Johnson, L.C. Predicting gross primary productivity in terrestrial ecosystems. Ecol. Appl. 1997, 7, 882–894. [Google Scholar] [CrossRef]
Peltier, G.; Stoffel, C.; Findinier, J.; Madireddi, S.K.; Dao, O.; Epting, V.; Morin, A.; Grossman, A.; Li-Beisson, Y.; Burlacot, A. Alternative electron pathways of photosynthesis power green algal CO₂ capture. Plant Cell 2024, 36, 4132–4142. [Google Scholar] [CrossRef]
Pinker, R.T.; Laszlo, I. Global distribution of photosynthetically active radiation as observed from satellites. J. Clim. 1992, 5, 56–65. [Google Scholar] [CrossRef]
Fang, H.; Baret, F.; Plummer, S.; Schaepman-Strub, G. An overview of global leaf area index (LAI): Methods, products, validation, and applications. Rev. Geophys. 2019, 57, 739–799. [Google Scholar] [CrossRef]
Duan, J.; Wang, H.; Yang, Y.; Cheng, M.; Li, D. Rice Growth Parameter Estimation Based on Remote Satellite and Unmanned Aerial Vehicle Image Fusion. Agriculture 2025, 15, 26. [Google Scholar] [CrossRef]
Dou, J.; Yunus, A.P.; Tien Bui, D.; Merghadi, A.; Sahana, M.; Zhu, Z.; Chen, C.-W.; Khosravi, K.; Yang, Y.; Pham, B.T. Assessment of advanced random forest and decision tree algorithms for modeling rainfall-induced landslide susceptibility in the Izu-Oshima Volcanic Island, Japan. Sci. Total Environ. 2019, 662, 332–346. [Google Scholar] [CrossRef]
Zhang, F.; Deb, C.; Lee, S.E.; Yang, J.; Shah, K.W. Time series forecasting for building energy consumption using weighted Support Vector Regression with differential evolution optimization technique. Energy Build. 2016, 126, 94–103. [Google Scholar] [CrossRef]
Li, Q.-F.; Song, Z.-M. High-performance concrete strength prediction based on ensemble learning. Constr. Build. Mater. 2022, 324, 126694. [Google Scholar] [CrossRef]
Dai, G.; Tian, Z.; Fan, J.; Sunil, C.K.; Dewi, C. DFN-PSAN: Multi-level deep information feature fusion extraction network for interpretable plant disease classification. Comput. Electron. Agric. 2024, 216, 108481. [Google Scholar] [CrossRef]
Shin, H.-C.; Roth, H.R.; Gao, M.; Lu, L.; Xu, Z.; Nogues, I.; Yao, J.; Mollura, D.; Summers, R.M. Deep convolutional neural networks for computer-aided detection: CNN architectures, dataset characteristics and transfer learning. IEEE Trans. Med. Imaging 2016, 35, 1285–1298. [Google Scholar] [CrossRef] [PubMed]
Shi, Z.; Ye, Y.; Wu, Y. Rank-based pooling for deep convolutional neural networks. Neural Netw. 2016, 83, 21–31. [Google Scholar] [CrossRef]
Awadallah, M.A.; Makhadmeh, S.N.; Al-Betar, M.A.; Dalbah, L.M.; Al-Redhaei, A.; Kouka, S.; Enshassi, O.S. Multi-objective ant colony optimization: Review. Arch. Comput. Methods Eng. 2025, 32, 995–1037. [Google Scholar] [CrossRef]
Chen, J.; Jing, H.; Chang, Y.; Liu, Q. Gated recurrent unit based recurrent neural network for remaining useful life prediction of nonlinear deterioration process. Reliab. Eng. Syst. Saf. 2019, 185, 372–382. [Google Scholar] [CrossRef]
Tan, T.H.; Chang, Y.L.; Wu, J.R.; Chen, Y.F.; Alkhaleefah, M. Convolutional Neural Network with Multihead Attention for Human Activity Recognition. IEEE Internet Things J. 2024, 11, 3032–3043. [Google Scholar] [CrossRef]
Liu, D.; Dong, X.; Bian, D.; Zhou, W. Epileptic Seizure Prediction Using Attention Augmented Convolutional Network. Int. J. Neural Syst. 2023, 33, 2350054. [Google Scholar] [CrossRef] [PubMed]
Gueymard, C.A. A review of validation methodologies and statistical performance indicators for modeled solar radiation data: Towards a better bankability of solar projects. Renew. Sustain. Energy Rev. 2014, 39, 1024–1034. [Google Scholar] [CrossRef]
Syed, T.N.; Zhou, J.; Lakhiar, I.A.; Marinello, F.; Gemechu, T.T.; Rottok, L.T.; Jiang, Z. Enhancing Autonomous Orchard Navigation: A Real-Time Convolutional Neural Network-Based Obstacle Classification System for Distinguishing ‘Real’ and ‘Fake’ Obstacles in Agricultural Robotics. Agriculture 2025, 15, 827. [Google Scholar] [CrossRef]
Wang, J.; Wang, P.; Tian, H.; Tansey, K.; Liu, J.; Quan, W. A deep learning framework combining CNN and GRU for improving wheat yield estimates using time series remotely sensed multi-variables. Comput. Electron. Agric. 2023, 206, 107705. [Google Scholar] [CrossRef]
Shi, S.; Xu, L.; Gong, W.; Chen, B.; Chen, B.; Qu, F.; Tang, X.; Sun, J.; Yang, J. A convolution neural network for forest leaf chlorophyll and carotenoid estimation using hyperspectral reflectance. Int. J. Appl. Earth Obs. Geoinf. 2022, 108, 102719. [Google Scholar] [CrossRef]
Xia, Y.; Ding, D.; Chang, Z.; Li, F. Joint Deep Networks Based Multi-Source Feature Learning for QoS Prediction. IEEE Trans. Serv. Comput. 2022, 15, 2314–2327. [Google Scholar] [CrossRef]
Wang, M.; Li, T. Correction: Wang, M.; Li, T. Pest and Disease Prediction and Management for Sugarcane Using a Hybrid Autoregressive Integrated Moving Average—A Long Short-Term Memory Model. Agriculture 2025, 15, 500. Agriculture 2025, 15, 774. [Google Scholar] [CrossRef]
Radočaj, D.; Jurišić, M. A Phenology-Based Evaluation of the Optimal Proxy for Cropland Suitability Based on Crop Yield Correlations from Sentinel-2 Image Time-Series. Agriculture 2025, 15, 859. [Google Scholar] [CrossRef]
Castro, J.C.; Dohleman, F.G.; Bernacchi, C.J.; Long, S.P. Elevated CO₂ significantly delays reproductive development of soybean under Free-Air Concentration Enrichment (FACE). J. Exp. Bot. 2009, 60, 2945–2951. [Google Scholar] [CrossRef] [PubMed]
Gitelson, A.; Viña, A.; Solovchenko, A.; Arkebauer, T.; Inoue, Y. Derivation of canopy light absorption coefficient from reflectance spectra. Remote Sens. Environ. 2019, 231, 111276. [Google Scholar] [CrossRef]
von Bloh, M.; Nóia Júnior, R.d.S.; Wangerpohl, X.; Saltık, A.O.; Haller, V.; Kaiser, L.; Asseng, S. Machine learning for soybean yield forecasting in Brazil. Agric. For. Meteorol. 2023, 341, 109670. [Google Scholar] [CrossRef]
Shen, L.; Li, Z.; Hao, J.; Wang, L.; Chen, H.; Wang, Y.; Xia, B. Evaluating the Dynamic Response of Cultivated Land Expansion and Fallow Urgency in Arid Regions Using Remote Sensing and Multi-Source Data Fusion Methods. Agriculture 2025, 15, 839. [Google Scholar] [CrossRef]
Clarke, B.; Otto, F.; Stuart-Smith, R.; Harrington, L. Extreme weather impacts of climate change: An attribution perspective. Environ. Res. Clim. 2022, 1, 012001. [Google Scholar] [CrossRef]
Song, X.-P.; Li, H.; Potapov, P.; Hansen, M.C. Annual 30 m soybean yield mapping in Brazil using long-term satellite observations, climate data and machine learning. Agric. For. Meteorol. 2022, 326, 109186. [Google Scholar] [CrossRef]
He, M.; Li, H.; Sun, Z.; Li, X.; Li, Q.; Cai, J.; Zhou, Q.; Zhong, Y.; Wang, X.; Jiang, D. Drought priming enhances young spike development in wheat under drought stress during stem elongation. J. Integr. Agric. 2025; in press. [Google Scholar] [CrossRef]
Huang, J.; Tian, L.; Liang, S.; Ma, H.; Becker-Reshef, I.; Huang, Y.; Su, W.; Zhang, X.; Zhu, D.; Wu, W. Improving winter wheat yield estimation by assimilation of the leaf area index from Landsat TM and MODIS data into the WOFOST model. Agric. For. Meteorol. 2015, 204, 106–121. [Google Scholar] [CrossRef]
Croce, R.; Carmo-Silva, E.; Cho, Y.B.; Ermakova, M.; Harbinson, J.; Lawson, T.; McCormick, A.J.; Niyogi, K.K.; Ort, D.R.; Patel-Tupper, D.; et al. Perspectives on improving photosynthesis to increase crop yield. Plant Cell 2024, 36, 3944–3973. [Google Scholar] [CrossRef]
Mohammadi, S.; Rydgren, K.; Bakkestuen, V.; Gillespie, M.A.K. Impacts of recent climate change on crop yield can depend on local conditions in climatically diverse regions of Norway. Sci. Rep. 2023, 13, 3633. [Google Scholar] [CrossRef]
Devkota, K.P.; Bouasria, A.; Devkota, M.; Nangia, V. Predicting wheat yield gap and its determinants combining remote sensing, machine learning, and survey approaches in rainfed Mediterranean regions of Morocco. Eur. J. Agron. 2024, 158, 127195. [Google Scholar] [CrossRef]
Burroughs, C.H.; Montes, C.M.; Moller, C.A.; Mitchell, N.G.; Michael, A.M.; Peng, B.; Kimm, H.; Pederson, T.L.; Lipka, A.E.; Bernacchi, C.J.; et al. Reductions in leaf area index, pod production, seed size, and harvest index drive yield loss to high temperatures in soybean. J. Exp. Bot. 2023, 74, 1629–1641. [Google Scholar] [CrossRef] [PubMed]
Wu, P.; Wang, Y.; Shao, J.; Yu, H.; Zhao, Z.; Li, L.; Gao, P.; Li, Y.; Liu, S.; Gao, C.; et al. Enhancing productivity while reducing water footprint and groundwater depletion: Optimizing irrigation strategies in a wheat-soybean planting system. Field Crops Res. 2024, 309, 109331. [Google Scholar] [CrossRef]

Figure 1. Map of 2022 research and cultivation areas: Green indicates soybean mask.

Figure 2. Structural diagram of the model based on the ant colony optimization algorithm, convolutional neural network, gated recurrent unit, and multi-head attention mechanism.

Figure 3. Comprehensive framework diagram for soybean crop yield prediction based on satellite remote-sensing data and machine and deep learning models.

Figure 4. Correlation heatmap between variables and soybean yield. * p-value of the correlation coefficient (r) is less than 0.01, indicating that the specific correlation has a very high level of statistical significance.

Figure 5. Scatter plots of yield predictions based on RFR, SVR, XGBoost, CNN, and ACGM models in 2021 and 2022.

Figure 6. Predictions of yield variations for Heilongjiang soybeans evaluated by R², RMSE, MAE, and MAPE from April to September.

Figure 7. Performance of various data combinations in predicting county-level soybean yield using the ACGM model.

Figure 8. Spatial distribution maps of county-level soybean yield predictions using the ACGM model for 2017–2022.

Figure 9. Spatial distribution map of the errors between the predicted and actual soybean yields at the county level from 2017 to 2022.

Table 1. Sources of data.

Data	Variable	Temporal Resolution	Spatial Resolution	Source
Vegetation indices	NDVI, EVI, NDWI, RVI, GNDVI	16 days	500 m	MOD13A1
Environmental data	PR, AET, PDSI, DEF, SRAD, TMMN, TMMX, VPD, VAP	Monthly	1 km	TerraClimate datasets
Photosynthetically active indices	Gpp, PsnNet, Fpar, Lai	Monthly	500 m	MODIS
Soybean yield and planting area	Planting area	Year	30 m	https://doi.org/10.5194/essd-12-3081-2020 (accessed on 23 June 2024)
Soybean yield and planting area	Yield data for soybean	Year	City	https://tjj.hlj.gov.cn/ (accessed on 23 June 2024)

Table 2. Spectral vegetation indices.

Vegetation Index	Description	Formula
NDVI	Normalized difference vegetation index	$NDVI = \frac{(N I R - R)}{(N I R + R)}$
EVI	Enhanced vegetation index	$E V I = 2.5 \times \frac{(N I R - R)}{(N I R + 6 \times R - 7.5 \times B + 1)}$
NDWI	Normalized difference water index	$NDWI = \frac{(G R E E N - N I R)}{(G R E E N + N I R)}$
RVI	Ratio vegetation index	$RVI = \frac{N I R}{R}$
GNDVI	Green normalized difference vegetation index	$GNDVI = \frac{(N I R - G R E E N)}{(N I R + G R E E N)}$
GVCI	Green vegetation canopy index	$GVCI = \frac{(G R E E N - R)}{(G R E E N + R)}$
SAVI	Soil adjusted vegetation index	$S A V I = \frac{(N I R - R) (1 + 0.5)}{(N I R + R + 0.5)}$
WDRVI	Wide-dynamic-range vegetation index	$WDRVI = \frac{0.15 \times N I R - R}{0.15 \times N I R + R}$
GLI	Green leaf index	$GLI = 2 \times G R E E N - R - B L U E$
CVI	Chlorophyll vegetation index	$C V I = \frac{N I R}{R} \times \frac{G R E E N}{B L U E}$

Table 3. Soybean yield prediction performance of the ACGM and models compared in 2021 and 2022.

	Model	R²	RMSE (kg/ha)	MAE (kg/ha)	MAPE (%)
2021	RFR	0.41 ± 0.10	142.66 ± 25.58	113.75 ± 16.98	6.35 ± 1.27
	SVR	0.43 ± 0.05	207.68 ± 33.57	171.92 ± 38.74	11.07 ± 2.70
	XGBoost	0.39 ± 0.04	202.57 ± 34.37	161.95 ± 22.15	10.13 ± 2.47
	CNN	0.64 ± 0.02	198.13 ± 4.94	150.17 ± 4.19	12.81 ± 0.36
	ACGM	0.75 ± 0.02	163.18 ± 5.02	127.83 ± 8.98	9.19 ± 0.79
2022	RFR	0.42 ± 0.08	198.34 ± 32.04	160.54 ± 21.81	10.43 ± 2.21
	SVR	0.49 ± 0.05	133.09 ± 30.95	107.26 ± 25.04	5.96 ± 1.78
	XGBoost	0.36 ± 0.03	153.30 ± 19.33	119.50 ± 13.28	6.51 ± 0.82
	CNN	0.66 ± 0.02	143.27 ± 4.79	115.69 ± 5.07	6.57 ± 0.39
	ACGM	0.74 ± 0.02	123.94 ± 4.78	105.39 ± 3.95	6.21 ± 0.30

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fu, H.; Li, J.; Lu, J.; Lin, X.; Kang, J.; Zou, W.; Ning, X.; Sun, Y. Prediction of Soybean Yield at the County Scale Based on Multi-Source Remote-Sensing Data and Deep Learning Models. Agriculture 2025, 15, 1337. https://doi.org/10.3390/agriculture15131337

AMA Style

Fu H, Li J, Lu J, Lin X, Kang J, Zou W, Ning X, Sun Y. Prediction of Soybean Yield at the County Scale Based on Multi-Source Remote-Sensing Data and Deep Learning Models. Agriculture. 2025; 15(13):1337. https://doi.org/10.3390/agriculture15131337

Chicago/Turabian Style

Fu, Hongkun, Jian Li, Jian Lu, Xinglei Lin, Junrui Kang, Wenlong Zou, Xiangyu Ning, and Yue Sun. 2025. "Prediction of Soybean Yield at the County Scale Based on Multi-Source Remote-Sensing Data and Deep Learning Models" Agriculture 15, no. 13: 1337. https://doi.org/10.3390/agriculture15131337

APA Style

Fu, H., Li, J., Lu, J., Lin, X., Kang, J., Zou, W., Ning, X., & Sun, Y. (2025). Prediction of Soybean Yield at the County Scale Based on Multi-Source Remote-Sensing Data and Deep Learning Models. Agriculture, 15(13), 1337. https://doi.org/10.3390/agriculture15131337

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Soybean Yield at the County Scale Based on Multi-Source Remote-Sensing Data and Deep Learning Models

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Area

2.2. Dataset and Preprocessing

2.2.1. Vegetation Indices

2.2.2. Environmental Data

2.2.3. Photosynthetic Parameters

2.3. Yield Prediction Model

2.3.1. RFR

2.3.2. SVR

2.3.3. XGBoost

2.3.4. CNN

2.3.5. ACGM

2.4. Model Evaluation Indicators

3. Results

3.1. Correlation Analysis Between Variables and Yield

3.2. Comparison of Soybean Yield Prediction Models

3.3. Optimal Month for Soybean Yield Prediction

3.4. Analysis of Different Variables in Soybean Yield Prediction

3.5. Spatial Distribution Map of Predicted Soybean Yield

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI