UAV-Based Hyperspectral and Ensemble Machine Learning for Predicting Yield in Winter Wheat

Winter wheat is a widely-grown cereal crop worldwide. Using growth-stage information to estimate winter wheat yields in a timely manner is essential for accurate crop management and rapid decision-making in sustainable agriculture, and to increase productivity while reducing environmental impact. UAV remote sensing is widely used in precision agriculture due to its flexibility and increased spatial and spectral resolution. Hyperspectral data are used to model crop traits because of their ability to provide continuous rich spectral information and higher spectral fidelity. In this study, hyperspectral image data of the winter wheat crop canopy at the flowering and grain-filling stages was acquired by a low-altitude unmanned aerial vehicle (UAV), and machine learning was used to predict winter wheat yields. Specifically, a large number of spectral indices were extracted from the spectral data, and three feature selection methods, recursive feature elimination (RFE), Boruta feature selection, and the Pearson correlation coefficient (PCC), were used to filter high spectral indices in order to reduce the dimensionality of the data. Four major basic learner models, (1) support vector machine (SVM), (2) Gaussian process (GP), (3) linear ridge regression (LRR), and (4) random forest (RF), were also constructed, and an ensemble machine learning model was developed by combining the four base learner models. The results showed that the SVM yield prediction model, constructed on the basis of the preferred features, performed the best among the base learner models, with an R2 between 0.62 and 0.73. The accuracy of the proposed ensemble learner model was higher than that of each base learner model; moreover, the R2 (0.78) for the yield prediction model based on Boruta’s preferred characteristics was the highest at the grain-filling stage.


Introduction
Winter wheat is one of the three major cultivated cereals and is the most widely-grown cereal crop in the world [1]. Wheat plays a crucial role in global food production, trade, and food security [2]. Estimating wheat yield prior to harvest on a large scale not only offers a scientific foundation for local governments to establish production goals, but also ensures food security [3]. Therefore, the timely and accurate estimation of winter wheat yield is crucial for intelligent agricultural management and people's livelihoods.
The traditional yield assessment of winter wheat involves destructive sampling in the field in order to determine yield, which is not only time consuming, less objective, and lacking in robustness and sustainability, but also fails to monitor the crop growth throughout its reproductive life [4]. The development of remote sensing technology in recent years has provided a non-destructive, rapid, and efficient way to monitor crop growth [5]. Remote sensing techniques include ground-based platforms, satellite-based platforms, and UAV-based platforms [6]. The data collection from the ground-based poor repeatability [26,28,29]. Most of the previous studies have focused on the mining of spectral information and the exploration of regression techniques based on machine learning algorithms [30,31], and there has been little discussion and research on model fusion. Therefore, the potential problems of small sample size and single machine learning algorithms can limit the application to winter wheat yield estimation in practical production. In order to address this issue, we introduced decision-level fusion (DLF) models in ensemble machine learning. The DLF models fuse multichannel/multiscale information and typically produce more consistent and better prediction performance than individual models, have good noise immunity, can handle high-dimensional data, provide complete and detailed object information, and are simple to implement and fast to train [32,33]. These models are extensively used in the fields of injury detection, artificial intelligence, and image processing [34][35][36]. Based on previous studies, machine learning and hyperspectral imagery have been used successfully in many applications, but the strategy based on DLF model fusion has not yet been applied to crop yield prediction [37,38].
The aim of this study was to estimate winter wheat yield using hyperspectral imagery from a UAV. The specific objectives included the following: (1) investigating the potential of hyperspectral imagery for winter wheat yield prediction, (2) evaluating the performance of winter wheat yield prediction models under different feature selection methods, and (3) building a DLF model based on individual machine learning algorithms in order to improve prediction performance.

Experimental Design and Data Collection
This research trial was conducted in the 2019-2020 growing season at the experimental base of the Chinese Academy of Agricultural Sciences in Xinxiang, Henan Province (113 • 45 38 N, 35 • 8 10 E). During the winter wheat reproductive period (November 2019-June 2020), the total monthly rainfall, average monthly temperature, and average monthly sunshine hours all reached their maximums in May, while the monthly relative humidity reached its maximum in January ( Figure 1). Rainfall was mainly concentrated in January, February, April, and May; temperature and sunshine hours both increased gradually from January onwards as the crop developed; and relative humidity was fairly constant throughout the season.
The trial area shown in Figure 2 consisted of 180 plots with three irrigation treatments set at high irrigation (irrigation treatment 1, IT1), moderate irrigation (irrigation treatment 2, IT2), and low irrigation (irrigation treatment 3, IT3) during the full growth period, using large sprinklers corresponding to a total irrigation water depth of 240 mm, 190 mm, and 145 mm, respectively. The irrigation schedule for each stage is shown in Table 1. Each irrigation treatment had 60 plots, 8 m long and 1.4 m wide, with an area of 11.2 m 2 . Thirty varieties of winter wheat were selected for this experiment, and each irrigation treatment was replicated twice in a group of 30 wheat varieties to ensure the objectivity of the experiment. For production fields, pesticide and fertilizer management was performed according to local management practices. At maturity (3 June 2020), winter wheat yields were collected using a plot combine.  Tillering  35  35  35  Overwintering  35  35  35  Greening  35  25  20  Jointing  50  35  20  Heading  50  35  20  Grain filling  35  25  15  Total  240 190 145 Figure 1. Meteorological conditions during the wheat growth period from November to May: (a) total monthly rainfall, (b) average monthly temperature, (c) average monthly humidity, and (d) average monthly sunshine hours.

Acquisition and Processing of Hyperspectral Data
The M600 Pro (SZ DJI Technology Co, Shenzhen, China) was used as the flight platform with an onboard Resonon Pika L nano-hyperspectral propulsion scanner to acquire hyperspectral data. The Resonon Pika L Nano-Hyperspec with meter-level accuracy is a lightweight (0.6 kg) hyperspectral sensor specifically designed for use on UAV platforms. This sensor has 300 spectral bands in the 400-1000 nm wavelength range with a band width of 2.1 nm, including the visible and near infrared regions. It is externally pushed and scanned with a choice of scanning angles (vertically downwards, horizontally, or at any angle). The Resonon Pika L Nano-Hyperspec features a focal length of 12 mm and offers a 22 • field of view. Each scan line contains 640 pixels with a pixel pitch of 6 µm. The spectral resolution and resampling intervals are 6 nm and 2 nm, respectively. The sensor also includes a GPS/inertial measurement unit (GPS/IMU) navigation system, which enables the gathering of real-time altitude data from the UAV platform, allowing for better reflection calibration and geographic alignment. Depending on the environmental circumstances, certain criteria were established to fit the site size survey in this study. To ensure the quality of the data, hyperspectral data corresponding to the flowering (Zadok 65) and grain-filling (Zadok 85) stages of the wheat were acquired on 30 April 2020 and 13 May 2020, respectively. Both UAV flights were carried out between 10 a.m. and 2 p.m. in clear and cloudless weather conditions to minimize the effect of shadows. The UAV flew at a speed of 5 m/s at a height of 40 m, with a ground sampling distance of 2.5 cm. Three 0.25 m 2 reference panels that differed in brightness (95% white, 40% grey, and 5% black) were placed within the study area for postprocessing and measured with the spectrometer. In this study, 12 ground control points (GCPs) were evenly distributed across the field as precise georeferenced positions, and their centimeter-level positioning accuracy was obtained through the differential global positioning systems.
The acquired hyperspectral data was subjected to radiometric correction, atmospheric correction, and geometric correction. Hyperspectral images were acquired at an altitude of 50 m and under stable light conditions, so atmospheric correction was not required. Spec-trononPro software (version 3.4.0, Resonon) was used for hyperspectral image correction. For hyperspectral radiometric corrections, empirical linear corrections were made using the measured images and field spectra of the wheat and reference panels. The hyperspectral image radiance data was converted to reflectance by the known reflectance of the white reference panel. Three standard panels with different reflectance properties were placed in the flight area to derive the three parameters for atmospheric correction. Geographical correction used position and attitude parameters from the GPS/IMU and the relationship between the GPS/IMU and the imager. The parameters were converted between their respective coordinate systems. Noise in the image data can cause large differences at the beginning and end of the spectral range shown by the image and field spectra, so it is necessary to eliminate certain bands from the image data. The background (shadows and dirt) was eliminated from each plot by thresholding the NIR band at a wavelength of 800 nm. According to previous studies, vegetation usually has higher reflection values than the background in the NIR region, which is the reason behind our filtering method, setting the threshold at 30%, and removing noise bands below 440 nm and above 960 nm.

Acquisition of Spectral Indices
Hyperspectral data acquired using UAVs consists of hundreds of bands that contain a wealth of spectral information, and many of the adjacent bands are highly correlated with each other [39]. Sixty published spectral indices calculated using spectral reflectance were selected for predicting yield (Table 2), with each spectral index derived from two or more spectral bands. These spectral indices included the curvative index (CI), chlorophyll absorption index (CAI), normalized difference vegetation index (NDVI), simple ratio index (SR), pigment-specific normalized difference (Psnd), renormalized difference vegetation index (RDVI), triangular vegetation index (TVI), modified versions of these indices, such as the modified normalized difference (MND), modified simple ratio (MSR), normalized difference (ND), and their combinations MCARI/ MTVI2, among others. The majority of the bands utilized are in the red, NIR, and red-edge spectral regions.

Feature Selection Methods
The choice of input features is as important as the choice of the algorithm to be used when building the model. In supervised learning, feature selection is often used prior to model development to minimize the feature set dimensionality and thus gain performance improvements in the learning algorithm. In this study, 60 spectral indices were chosen, so it was ideal to select the most sensitive spectral indices to reduce the number of features. The following three common feature selection methods were used in this study to rank the importance of features: recursive feature elimination (RFE), Boruta, and the Pearson correlation coefficient (PCC).
Recursive feature elimination (RFE) [71] is a wrapper-based feature selection method that selects features with the help of a classification method. RFE requires training multiple classifiers to reduce the feature dimension, training time increases with the number of classifiers trained, and each part of the analysis can continue to be iterated, saving computational time. Low-weighted features are eliminated in each iteration, while equal weights are assigned to relevant attributes. RCE is performed in three steps, as follows: (1) an estimator is used to estimate the initial features' importance scores, (2) the feature with the lowest significance score is eliminated, and (3) a rank in given to each deleted variable in the order in which it was removed.
The Boruta [72] algorithm is a wrapper method built around the random forest algorithm. It provides criteria for a number of important factors and captures the outcome variables for all relevant features in the dataset by scoring all candidate features as well as shaded features. The importance value of the shaded features depends on whether the candidate features are significantly correlated or not. The Boruta algorithm steps are as follows: (1) randomly disrupt the feature order to obtain the shadow feature matrix, (2) train the model with the shadow feature matrix as input, (3) take the maximum value in the shadow features and record the cumulative hits of the real features to mark the features as important or unimportant, and (4) remove the unimportant features and repeat the first three steps until all the features are marked.
The Pearson correlation coefficient (PCC) [73] is a measure of the correlation between two variables, and it varies between −1 and 1. The PCC(r xy ) may be calculated using the following equation: x i and y = 1 y n ∑ i=1 y i denote the means of x and y, respectively, with n representing the sample size. The PCC is invariant to both linear and non-linear changes of the variables. The absolute value of PCC was used to compute the feature significance scores.

Decision-Level Fusion Model for Ensemble Learning
In this study, support vector machines (SVM), Gaussian process (GP), linear ridge regression (LRR), and random forest (RF) were the four regression models based on DLF for ensemble learning. The 'caret' R package in R4.0.2 was used to build the individual learner and DLF framework. The basic principle of DLF is shown in Figure 3. The hyperspectral index and winter wheat yield data pairs from 180 plots were randomly and uniformly divided into five groups, one group of which (n = 36) was randomly taken as the validation set and the remaining four groups (n = 144) as the training set. Predictions were made for each fold by training the model and five-fold cross-validation. In the five-fold crossvalidation process, the winter wheat yield predictions were generated separately for each regression model, and the model effects could be observed by examining the results of the individual learners on the validation set. The m individual learners would get a prediction matrix of m × n dimensions after completing the above process (n was the number of samples in the training set and m was the number of individual learners), and the results of the prediction matrix were then used to train the DLF model to make the final prediction.
Importantly, a five-fold cross-validation method was used in all of the models to ensure the reasonableness of the comparison between methods. To avoid uncertainty in the results, the process of dividing the data into training and validation sets using the five-fold crossvalidation method was repeated 40 times to generate 200 models, and the mean prediction accuracy of the validation set of these 200 models was used as the final evaluation metric.

Regression Methods
Based on a survey of previous research, and in order to assess the effectiveness of different machine learning algorithms and to better comprehend the non-linear connection between the dependent and independent variables, the following four widely utilized machine learning models were selected and used for comparison: SVM, GP, LRR, and RF. The four machine learning algorithms are described below, as follows: SVMs (support vector machines) [74], which benefit from statistical learning theory and the principle of minimal structural risk, are sparse and robust classifiers, mainly used for the classification and regression of high-dimensional samples. SVMs are increasingly popular in existing research areas because of their characteristics, such as good generalization ability and robustness to noise. SVMs are trained on sufficient samples in order to obtain a set of samples that approximate the hyperplane by fitting estimates of successive optimal output variables. The hyperplane is approximated by two important parameters, the kernel function and the loss function. The radial basis function was utilized as the kernel function in this research to change the regularization parameters using a cross-validation method.
GP (Gaussian process) [75] is a supervised learning process for estimating regression model parameters through sample learning. GP belongs to the stochastic process in probability theory and mathematical statistics, where any linear combination of random variables conforms to a normal distribution. GP is now widely used in modelling in the field of remote sensing, and therefore this algorithm was used in this study. LRR (linear ridge regression) [76] is a biased estimation regression method dedicated to covariance data analysis. LRR obtains more objective regression coefficients by losing some information and reducing precision. Typically, LRR has a low R 2 and high regression coefficients, and is widely used in co-linear problems and research with a large amount of data. The LRR algorithm was used in this study to construct yield estimation models.
RF (random forest) [76] is an ensemble learning method that constructs multiple decision trees and can perform decision making and regression. RF is able to model the relationship between dependent and independent variables based on decision rules. It can handle a large number of input variables, assess the importance of variables while deciding on categories, produce higher accuracy, balance errors, and quickly mine the data. Therefore, the RF algorithm was used for modelling in this study.
The machine learning algorithms used in this study were all implemented independently. To improve the prediction accuracy of the models, we further processed these results to construct a DLF model [77], which is a model that fuses the results of different machine learning models by the training weights obtained. Based on previous research, a weighted prior (WP) approach was introduced to construct the DLF model, taking into account the estimated variance of each model. The DLF and WP can further improve the model accuracy and generalization ability and minimize the result bias. The procedure for this method is as follows [78]: where ε (i) is the estimation variance, y (i) is the predicted value from the i th model, and y is the observed value.
where N is the total number of samples in the training set.
where l denotes the total number of models and w i denotes the weight of the i th model.
where w * i is the final DLF weighting.
where y WP is the final result based on the WP method. The individual machine learning models were used as input to build the DLF model.

Cross-Validation and Parameter Optimization
A five-fold cross-validation was used to form the prediction matrix in the personal machine learning process of the DLF, which can be used as external cross-validation. In addition, internal random grid search cross-validation allows the fine-tuning of the hyperparameters of the individual learner shown in Figure 4. In external cross-validation, the original dataset is randomly divided into five equal parts (Figure 4), one of which is then used as the validation set and the remaining four as the training set each time. Each training set used for external cross-validation was also randomly divided into five equal parts, of which 1/5 was used as the validation set for internal cross-validation and the remaining 4/5 was used as the training set for internal cross-validation. The model was trained by setting different combinations of candidate hyperparameters for internal crossvalidation, and the model was then validated on the internal cross-validation set. Each hyperparameter combination was validated five times, and after training model evaluation, the hyperparameter combination with the highest average validation accuracy was applied to the outer cross-validation to construct the ideal model.

Statistical Analysis
In this study, the regression model was evaluated in the four following ways: coefficient of determination (R 2 ), root mean square error (RMSE), ratio of performance to interquartile distance (RPIQ), and ratio of performance to deviation (RPD). The criteria for evaluating models are yield estimation models with higher accuracy, and an RPD of >1.5 is usually considered to indicate a reliable prediction. The formulae for the four evaluation methods are as follows: where y i is the measured value,ŷ i is the predicted value, y is the mean of the measured values, N is the sample size, SD is the standard deviation of the measured value of the prediction set, Q 3 is the lower boundary of the third quartile, and Q 1 is the upper boundary of the first quartile.

Descriptive Statistics
The yield of winter wheat in all of the test plots in this study was 6.55 t·ha −1 , and the mean yields differed for the three irrigation treatments. The yield statistics for the test plots under each irrigation treatment and all of the plots are shown in Table 3. In general, the treatments with higher irrigation levels were associated with higher yields. IT1 had the highest average yield of 7.97 t·ha −1 , followed by IT2 at 6.73 t·ha −1 , and IT3 at 4.94 t·ha −1 . The data ranges, quantile statistics, standard deviations (SD), and coefficients of variation (CV) for the yield datasets for all of the plots and the three experimental treatments showed significant yield differences between the treatments and well separated datasets. The simple linear regression coefficients of determination for each vegetation index at the flowering and the grain-filling stages are shown in Table A1. The results show that the R 2 of each spectral index in the grain-filling stage was mostly greater than that in the flowering stage. The RVSI index performed best at both stages, with R 2 values of 0.48 at the flowering stage, and 0.49 at the grain-filling stage. The poorest performing index was CI in the flowering stage, with an R 2 of 0.08, and the index with the poorest performance in the grain-filling stage was TCARI/OSAVI, with an R 2 of 0.1.

Feature Importance Ranking
In this study, RFE, Boruta, and PCC methods were used to rank the importance of 60 vegetation indices at the flowering and grain-filling stages. The results of the ranking of the importance of each vegetation index are shown in Table A2 of Appendix A. Comparing the ranking of feature importance at the flowering and grain-filling stages for the three feature selection methods revealed that RVSI ranked highly and performed consistently well overall. The ranking of each of the other vegetation indices for the different stages varied with the different feature selection methods. Of the 60 vegetation indices selected, 23 were composed of three or four bands, and about 15 of them were in the top 40 in order of importance. We also noted that two integer indices, MCARI/MTVI2 and TCARI/OSAVI, were ranked in the top 40 by both the RFE and Boruta trait-screening methods in both of the wheat growth stages. Both of the indices were ranked in the top 25 after RFE screening at the grain-filling stage. After the PPC trait-screening method, both of the indices were ranked outside of the top 40 at the flowering stage, and only MCARI/MTVI2 remained in the top 40 at the grain-filling stage.

Comparison and Performance of Feature Selection Methods and Model Accuracy
In order to further explore the high-performance features, a total of 60 features were iteratively added to the machine learning model, starting with the first feature in each order, and updating the model training performance until all of the 60 features were included. The training accuracy was calculated for four base models (SVM, GP, LRR, and RF), obtained under three feature selection methods, for the two wheat growth stages ( Figure 5). For the SVM model, the Boruta method performed best in both the flowering and grain-filling stages, followed by PCC and RFE, and the accuracy of the model improved as the number of features increased ( Figure 5(a1,2)). For the GP model, the flowering stage was more accurate when using the Boruta method, followed by the PCC and RFE methods, and the grain-filling stage was better with the Boruta method and PCC compared to RFE ( Figure 5(b1,2)). For the LRR model, the RFE method performed best at the flowering stage. The Boruta method performed the best at the grain-filling stage and PCC performed the worst ( Figure 5(c1,2)). In the RF model, the best accuracy was achieved at the flowering and grain-filling stages when the Boruta method was used to rank the models, with the PCC and RFE methods performing in general agreement at the flowering stage and the results of the RFE method were the worst at the grain-filling stage ( Figure 5(d1,2)). The combined results showed that the accuracy of all of the four models (SVM, GP, LRR, and RF) remained stable as the number of features increased after about 25 features. Therefore, this study used the top 25 features for the ensemble model development.
Comparing the R 2 of the four models constructed for the two growth stages showed that the LRR model had the lowest accuracy, with R 2 ranging from 0.48 to 0.54 at the flowering stage and 0.48 to 0.63 at the grain-filling stage, and after the input features were stable, the R 2 values were 0.54 and 0.59, respectively. The R 2 of the GP model ranged from 0.12 to 0.72 for the flowering stage, and 0.55-0.81 for the grain-filling stage. The RF model had the highest accuracy, with R 2 ranging from 0.76 to 0.94 at the flowering stage and 0.86-0.95 at the grain-filling stage, and when the input features were stabilized, the R 2 values were 0.93 and 0.95, respectively.
The five models (four base models and the DLF model) were trained using the full features of the training samples and selected features, and model performance was evaluated on the validation samples. The mean values of the validation accuracy obtained from 200 trials are shown in Table 4. Among the base models constructed in this study, the validation accuracy of the SVM model constructed using the RFE method with the preferred spectral indices at the flowering stage was the highest (R 2 = 0.63, RMSE = 1.03 t·ha −1 , RPIQ = 2.40, RPD = 1.60), and the validation set accuracy of the SVM model constructed using the Boruta method with the preferred features at the grain-filling stage was the highest (R 2 = 0.73, RMSE = 0.87 t·ha −1 , RPIQ = 2.74, RPD = 1.90). Among the constructed DLF models, the best accuracy of the models constructed using the Boruta and PCC methods with the preferred features at the flowering stage achieved an R 2 of 0.66, and the highest accuracy of the models constructed using the Boruta method with the preferred features was at the grain-filling stage (R 2 = 0.78, RMSE = 0.79 t·ha −1 , RPIQ = 2.99, RPD = 2.08). Overall, all of the methods gave an R 2 of 0.56 or higher, indicating the effectiveness of these models in estimating the winter wheat yield. The DLF models outperformed all of the individual models. The R 2 values for the DLF models constructed using the preferred features were ≥0.65 for the flowering stage, and 0.63 for the DLF models constructed using all features. At the grain-filling stage, the R 2 values were ≥0.77 for the DLF models constructed using the preferred features, and 0.75 for the DLF models constructed using all of the features at the grain-filling stage. The accuracy of all of the feature selection methods was improved in this study relative to the full feature model, and the RFE method improved the most at the flowering stage. The R 2 values of the SVM, GP, LRR, RF, and DLF models improved by 0.04, 0.03, 0.04, 0.03, and 0.02, reaching 0.63, 0.59, 0.62, 0.60, and 0.65, respectively, at the flowering stage. The Boruta method improved the most at the grain-filling stage. The R 2 values for the five models increased by 0.05, 0.05, 0.06, 0.03, and 0.03, reaching 0.73, 0.72, 0.66, 0.68, 0.78, respectively. In addition, the accuracy of the models was higher at the grain-filling stage compared to the flowering stage.
Scatter plots ( Figure A1) were used to better show the yield prediction performance of the models and the feature selection methods. In general, all of the models gave good results and performed well with all three of the feature selection methods. In addition, the accuracy of the DLF model varied by growth stage and feature selection method. The performance was stable across all of the feature selection methods at the different growth stages, indicating that it was more adaptable to different feature selection methods. Most of the observed and predicted yields obtained from the DLF model showed good agreement with each other, and it was good at simulating the high and low yields obtained at harvest for the different irrigation treatments.

Yield Distribution
A comparison of all of the models used in this study revealed that the DLF model, constructed using the Boruta method for preferential feature selection at the grain-filling stage, achieved the best accuracy, and it was therefore used to generate a distribution of predicted yields ( Figure 6). The results of the t-test analysis between the different irrigation treatments are shown in Table 5 and indicate that the yield distribution differed significantly between the three treatments, in the order IT1 > IT2 > IT3. Overall, the predicted yield distribution in the IT1 treatment was in the range of 5 to 10 t·ha −1 . Based on the observed results, the IT1 treatment had the highest yield of 5 to 9 t·ha −1 , followed by the IT2 and IT3 treatments; this is consistent with the yield distribution predicted by the DLF model and demonstrates the feasibility of using a model to estimate yield.

Discussion
We selected 60 hyperspectral narrow-band indices for this study, of which approximately 74% were associated with red-rimmed bands. At the flowering stage, more than six indices associated with the red-edge band were in the top 10 after sorting by the RFE, Boruta, and PCC methods. These red-edge spectral indices all provided better prediction performance than other band indices, which agrees with the findings of previous studies [79][80][81]. For example, Xie et al. [82] analyzed the relationship between yield and canopy spectral reflectance of winter wheat at maturity under low-temperature stress and found that the red-edge region was associated with grain yield. However, there was a large variation in the ranking of the different indices, which may be due to the use of different feature selection methods or the different environments to which the vegetation indices apply. Some of the spectral indices performed consistently well among the different feature selection methods at the two wheat growth stages, such as the three spectral indices RVSI, DSWI-4, and ND [553,682] . The RVSI index, which consists of three bands including the red-edge band, performed well in assessing wheat rust symptoms and constructing rice physiological trait models [83], and was in the top five in the different methodological feature rankings in this study. This could be because it provided more spectral information and was more sensitive to the yield of the different feature selection methods at the different growth stages. The DWSI-4 index, originally a variant of the plant disease-water stress index constructed using simple and normalized ratios, also had good stability and performance in crop disease prediction [84]. The ND [553,682] index can be used to estimate the chlorophyll content and can minimize the effect of shading and leaf area index size [85,86]. Our study showed that these three spectral indices can be used for yield estimation. MCARI/MTVI2 and TCARI/OSAVI are integrated indices. In previous studies, their performance was better than the individual MCARI, MTVI2, and OSAVI indices, because the integrated indices had richer band information and effectively eliminate the background effects. [87,88]. The Boruta method was second to the RFE method for winter wheat at the flowering stage and performed best at the grain-filling stage, probably due to the difference in the performance between the two methods in the different environments. The Boruta method is a fully correlated feature selection method that aims to select features that are truly correlated with the dependent variable and can be used for prediction, rather than model-specific selection, and can help us to understand the characteristics of the dependent variable more comprehensively and make better and more effective feature selections [24,89]. The RFE method takes into account the correlation between the features, continuously builds models to find the best features, has good generalization ability, and is a suitable method for small sample data sets [90]. The PCC method, which performed the worst in this study, is very commonly used in sensitivity feature selection in the crop science community. It does not require any model training, but does not objectively represent correlations when the correlations between the variables are complex. There is also a risk of multicollinearity between features [91,92]. In this study, the accuracy of the model construction, based on the preferred features under feature selection, was better than that of the model under the full feature condition, which was consistent with the findings of Hsu et al. 2011 [93] and validated the effectiveness and generalizability of the feature selection method.
In this study, four individual machine learning algorithms were used to construct winter wheat yield estimation models based on a subset of spectral indices obtained after feature selection. The RF model had the highest accuracy and performed best when trained using the training set data, but the RF model was not the best performer in the validation set of the model, probably due to the overfitting phenomenon of the RF model in the training set [94]. In the model training set, the LRR models all performed the worst, but in the model validation set the GP models performed the worst at the flowering stage and the LRR models performed the worst at the grain-filling stage. LRR models tend to have a lower R 2 than the ordinary regression models but can generate a value on covariance problems [95]. The GP models use the full sample for prediction, and as the dimensionality of the data rises, the effectiveness decreases [96]. The SVM models did not perform well in the training set but had the highest accuracy in the validation. SVM is a machine learning method based on the inner product kernel function. The wrong choice of kernel hyperparameters may cause a decrease in the accuracy of the model training set estimation. However, the high accuracy of the SVM model validation set was due to its better robustness, suitability for small sample data regression, and the lack of sensitivity to kernel functions with the ability to avoid dimensional catastrophe problems. [97,98]. We also found that the accuracy of yield estimation models constructed using the four independent machine learning algorithms, SVM, GP, LRR, and RF, at the two developmental stages of winter wheat also differed greatly. Based on the model validation set, the accuracy of each model at the grain-filling stage was higher than that at the flowering stage under the different feature selections. This was due to the dry matter stored in the wheat seeds through carbon assimilation in the winter wheat during grain filling, indicating that this stage contains more spectral information that can be used to predict yield. In addition, the spectral information collected from the winter wheat was increased in order to provide a more comprehensive and accurate reflection of the yield of the winter wheat [2,99]. A DLF (decision-level fusion) model was developed based on the individual machine learning models used in this study. The results showed that the DLF model performed significantly better than each of the other models when all of the features or the selected features were used. When using the selected features, the DLF model performed best at the flowering and grain-filling stages, and the model accuracy was better than that of the individual models. In addition, using selected features obtained under the different feature selection methods, the DLF model produced R 2 values of >0.65 at the flowering stage and >0.77 at the grain-filling stage. Overall, the DLF model gave more satisfactory and better results than the individual models. This was the same conclusion reached in a previous study [33] where the DLF model was able to minimize the individual model bias and improve the accuracy of the inverse model. Taken together, the above description suggests that adequacy and diversity are two important principles in the selection of base models in the decision-level fusion process [100]. This requires that the different base learners should all have a good predictive performance and be able to minimize inter-model dependencies and act as complementary information [101,102]. This prerequisite requirement is justified by the fact that the DLF methods fuse the prediction results of different independent machine learners so that the final fusion results are all influenced by each base model [103]. Furthermore, fusion of models with similar high performance will yield limited prediction results [104]. Based on the requirements of DLF and the limitation problem, this study used the SVM, GP, LRR, and RF machine learning algorithms with completely different training mechanisms to construct the yield estimation models and improved the model performance through parameter optimization, and the experimental results provided further evidence of the effectiveness of the underlying models.
This study used the acquired hyperspectral image time series to predict the yield of winter wheat, and the yield prediction model constructed for the grain-filling period had a high accuracy rate. The use of hyperspectral data to construct yield estimation models has been widely used in previous yield estimation studies, and all have achieved high model accuracy, consistent with the findings of this paper [105,106]. For example, Chandel et al. (2019) [107] used hyperspectral indices to construct a yield prediction regression model and found that the yield of irrigated wheat was estimated with an accuracy of 96%. However, relying on hyperspectral data alone for yield estimation still has some limitations. In future research, we intend to integrate UAV RGB and multispectral image data into yield estimation models as well, in order to broaden the application area of yield estimation. In addition, we will also consider examining the effects of biotic (weeds, pests, and diseases) and abiotic (nutrients, temperature, and salinity) stresses based on UAV imagery and ground data. Finally, additional feature selection methods and integrated learning methods will be considered for yield estimation in order to further improve the prediction accuracy. Therefore, in the future, we will also analyze the impact of diseases, insects, and fertility on wheat yield.

Conclusions
In winter wheat production, real-time insight into yield conditions prior to harvesting can help to optimize crop management and guide the field practices. In this study, we developed a DLF-based machine learning model for winter wheat yield prediction using UAV-based hyperspectral imagery. The narrow-band hyperspectral indices were extracted, and the most important indices were selected for model development using each of the three feature selection methods. The results showed that the RFE-based method for feature selection at the flowering stage had a higher accuracy, the Boruta-based method for feature selection at the grain-filling stage had a higher accuracy, and the DLF model outperformed the base models and achieved the highest accuracy when using the preferred features. This study demonstrates the effectiveness of using hyperspectral images to build a model for yield estimation in winter wheat.

Data Availability Statement:
The data presented in this study are available within the article.

Conflicts of Interest:
The authors declare no conflict of interest.

Disclaimer:
The findings and conclusions in this article are those of the authors and should not be construed to represent any official USDA or U.S. Government determination or policy. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture and the Chinese Academy of Agricultural Sciences. USDA is an equal opportunity provider and employer.