Alfalfa Yield Prediction Using UAV-Based Hyperspectral Imagery and Ensemble Learning

: Alfalfa is a valuable and intensively produced forage crop in the United States, and the timely estimation of its yield can inform precision management decisions. However, traditional yield assessment approaches are laborious and time-consuming, and thus hinder the acquisition of timely information at the ﬁeld scale. Recently, unmanned aerial vehicles (UAVs) have gained signiﬁcant attention in precision agriculture due to their e ﬃ ciency in data acquisition. In addition, compared with other imaging modalities, hyperspectral data can o ﬀ er higher spectral ﬁdelity for constructing narrow-band vegetation indices which are of great importance in yield modeling. In this study, we performed an in-season alfalfa yield prediction using UAV-based hyperspectral images. Speciﬁcally, we ﬁrstly extracted a large number of hyperspectral indices from the original data and performed a feature selection to reduce the data dimensionality. Then, an ensemble machine learning model was developed by combining three widely used base learners including random forest (RF), support vector regression (SVR) and K-nearest neighbors (KNN). The model performance was evaluated on experimental ﬁelds in Wisconsin. Our results showed that the ensemble model outperformed all the base learners and a coe ﬃ cient of determination (R 2 ) of 0.874 was achieved when using the selected features. In addition, we also evaluated the model adaptability on di ﬀ erent machinery compaction treatments, and the results further demonstrate the e ﬃ cacy of the proposed ensemble model.


Introduction
Alfalfa is one of the most important and widespread perennial legumes, and it is considered as a valuable forage crop with relatively high yield and nutritional value [1]. In 2018, nearly 53 million tons of alfalfa and alfalfa mixtures were harvested from about 17 million acres in the United States [2]. Due to its large scale, precise management to achieve the forage yield goal is critical to optimize profitability [3]. Timely estimation of alfalfa production within the growing season can inform precision management decisions to reduce the potential production loss. Additionally, rapidly and accurately estimating the yield within the growing season also has the potential to improve the timing in harvesting alfalfa to optimize the forage quality and production [4].
The early development of alfalfa is characterized by the formulation and growth of leaves and stems. Then the appearance of flower buds indicates the transformation from vegetative stage to reproductive stage, after which alfalfa progressively passes through flowering and seed pod development [5]. At different stages of an alfalfa life cycle, adequate fertilization, irrigation and tillage practices are required to ensure long-lasting productive stands [1]. In addition, the cutting frequency also has a strong impact on the alfalfa yield, and three or four cuttings are usually adopted in the northern United States. Typically, decreased yield can be observed in the successive cuttings through a year [6]. Generally, alfalfa is harvested as hay or silage. Alfalfa intended for hay is sun-dried in the field after cutting to a moisture level of less than 12% [7], while shorter drying time is needed for silage which has more than 50% moisture content [8]. Machinery is typically used for alfalfa harvest from mowing, raking, merging, to baling or chopping [9]. Wheel traffic from these machines has a significant impact on soil health and crop production potential. According to a multi-state study, the yield reduction in next cutting caused by wheel traffic ranged from 5% to 26% depending on the traffic timing [10]. Traffic events which occurred three to five days after mowing caused significant losses in alfalfa yield [11]. Though yield loss from wheel traffic varies from field to field, it can be reduced by minimizing the number and locations of machinery operations [12][13][14][15].
Traditional yield assessment is based on destructive sampling which is to manually collect data samples in the field and weigh the samples to determine the yield [16][17][18][19]. However, destructive sampling is not only laborious and time-consuming, but also are unable to monitor the crop status over a growing season [4]. The advances of remote sensing techniques have provided a non-destructive and efficient way to monitor the crop growth, and thus has great potential for crop yield analysis. In the last few decades, satellite remote sensing has been widely used in agriculture [20][21][22]. Among these studies, several researchers explored the utility of satellite data to estimate alfalfa yield. For example, multiple vegetation indices (VIs) extracted from time-series Landsat images were used to predict alfalfa yield in Saudi Arabia, and near-infrared (NIR) reflectance, soil adjusted vegetation index (SAVI) and normalized difference vegetation index (NDVI) were found to be strongly correlated with the yield [23]. High-resolution commercial QuickBird satellite data were applied to estimate alfalfa yield over hilly areas on Loess Plateau of China, and better performance was achieved than using the Landsat data [20]. Though successful, the adoption of satellite remote sensing in precision farming has been restricted due to limitations, such as cloud contamination and relatively coarse spatial and temporal resolutions [21,22].
Recently, unmanned aerial vehicles (UAVs) have gained significant attention due to their greater flexibility in mission scheduling, and image data with finer spatial resolution acquired from different sensors mounted on UAV platforms have been widely used in precision agriculture. Based on low-cost conventional digital RGB images, various studies have been carried out to assess the growth status of crops and predict yields using either the original three color bands or derived color indices, such as green-red vegetation index (GRVI) [24] and excess green vegetation index [25]. Since plants typically have strong reflective properties in the NIR wavelengths, multispectral sensors have become more favored by incorporating the NIR channels. For example, several VIs which were developed by including the NIR band, such as NDVI and red edge position index (REPI), have been successfully applied in grain yield prediction [26], crop senescence rate detection [27], plant water status assessment [28], and other applications in precision agriculture [29][30][31].
Hyperspectral cameras are more expensive than RGB and multispectral cameras, and large data storage capacity is required to store hyperspectral data cubes and perform data processing. Despite the challenges, hyperspectral imagery is capable of providing more detailed spectral information and offering better opportunities for applications in precision agriculture since hundreds of spectral bands, arranged in narrower bandwidth, are consisted in the images. Based on the advantage of hyperspectral imagery, various agricultural applications have been studied using hyperspectral imagery, such as Remote Sens. 2020, 12, 2028 3 of 24 crop mapping [32,33], disease detection [34,35], and stress assessment [36,37]. Among the wide range of applications, crop yield prediction by taking advantage of the continuous narrow spectral bands, is one of the most popular [38][39][40][41][42][43]. For example, Leaf area index (LAI) and chlorophyll (CHL) were estimated using noisy-reduced UAV-based hyperspectral data, and these traits were then used to predict wheat yield in Belm, northwestern Germany, achieving a coefficient of determination (R 2 ) of 0.88 [44]. Another example showed that wheat yield could also be effectively predicted using 190 narrow bands from UAV-based hyperspectral imagery before harvest in Minnesota [45]. Compared to RGB data, hyperspectral datasets were demonstrated to be more effective in modeling the yield of winter barley [43]. Besides, they were also compared to multispectral images in grain sorghum yield prediction, and the narrow-band NDVIs extracted from the hyperspectral data were able to explain more yield variability [42]. Recently, to facilitate the feature selection, a feature explorer interactive system was developed to accelerate the exploration, ranking and selection of the derived hyperspectral indices [46]. Though successful, these studies were all focused on grain crops. It is necessary to investigate the potential of applying hyperspectral imaging for modeling the forage yield.
In general, two types of models have been investigated for modeling the yield: process-based crop simulation models and machine learning models. The crop models forecast yield by simulating the crop growth, based on the known physiological characteristics of plants and a number of environmental factors, and representative alfalfa simulation models include SIMED [47], ALSIM [48], ALFALFA 1.4 [49], ALF2LP [50], and the DSSAT-CROPGRO-Perennial Forage model, which can be adapted to alfalfa [51]. Although successful, these models typically require a large number of input data related to crop cultivar, management practices and soil conditions which are often difficult to obtain [52]. Moreover, the calibration of these mechanistic models can be challenging due to the complexity of physiological processes.
In contrast, machine learning approaches aim to develop an empirical relationship between the independent variables with the yield, and thus have the advantage of forecasting the yield without relying on the specific parameters for individual crops [53]. In this context, various machine learning models have been developed for crop yield prediction, such as linear regression [54], support vector regression (SVR) [55] and artificial neural networks [56]. However, these approaches all depend on single predictive models, and are subject to overfitting with limited training data [57]. In the machine learning community, there is an increasing interest in ensemble techniques, which use a group of base learners for training and combine the predictions from all of them for final predictions. Ensemble models usually result in better predictive performance compared to single models [58,59]. Bagging, boosting and stacking are three commonly used ensemble learning strategies. Bagging generates base learners in parallel using training subsets obtained from bootstrap sampling, and boosting trains a sequence of base models by exploiting their dependences [60]. Random forest (RF) and stochastic gradient boosting are representative bagging and boosting approaches respectively, and their superior performance in yield prediction has been demonstrated in several studies [53,61]. Unlike bagging and boosting which typically combine homogeneous learners, stacking tends to employ heterogeneous learners and leverage differences between them to improve the final accuracy. The diversity condition guarantees the complementary information provided by difference models, which is also the key for ensuring that the incorrect results are unlikely to achieve by all the base learners. To the best of our knowledge, the stacking-based ensemble strategy has not been applied in yield prediction, though several successful applications have been observed in machine learning and computer vision [62,63].
This study was designed to conduct in-season alfalfa yield prediction using UAV-based hyperspectral images. The specific objectives included (1) investigate the potential of using hyperspectral images for alfalfa yield prediction, (2) establish an ensemble learning model to improve prediction performance, and (3) evaluate the model adaptability under different machinery compaction treatments.

Experimental Design and Field Data Collection
The research trials were conducted at the Arlington Agricultural Research Station in Wisconsin in 2019 ( Figure 1). Wisconsin is located in the Great Lakes region which is in the north-central part of the United States, and Arlington is located in south-central Wisconsin. Arlington has a continental climate characterized by warm and humid summers and cold and dry winters. In 2019, the annual total precipitation reached 115 cm, and the monthly average temperature was highest at 23 • C in July and lowest at −9 • C in April [64].

Experimental Design and Field Data Collection
The research trials were conducted at the Arlington Agricultural Research Station in Wisconsin in 2019 ( Figure 1). Wisconsin is located in the Great Lakes region which is in the north-central part of the United States, and Arlington is located in south-central Wisconsin. Arlington has a continental climate characterized by warm and humid summers and cold and dry winters. In 2019, the annual total precipitation reached 115 cm, and the monthly average temperature was highest at 23 °C in July and lowest at −9 °C in April [64]. The study area shown in Figure 1 consists of two adjacent alfalfa fields, namely F635 and F650. There was one block in F635 where alfalfa was seeded in May, 2018, and three blocks in F650 with alfalfa newly planted in May, 2019. Each block contained 21 plots with a size of 24 m 2 (W-E 4 m × N-S 6 m), and it was randomly designed with seven levels of machinery traffic treatments and three replications for each. In each treatment, a certain number of traffic passes were used, with each pass simulating a particular machinery operation, such as mowing, raking, merging, baling, chopping and transporting. For treatments T1-T6, a swather was used to simulate both silage and hay harvest by applying weight and pressure to the plant and soil. For example, three passes applied in T2 and T5 were used to simulate mowing, merging (or raking) and harvest operation, and the delayed second and third passes in T5 were because more time was required for alfalfa to dry in hay harvest. The T7 was a control group without any traffic applied. The detailed descriptions for the seven treatments are presented in Table 1. The two alfalfa fields were harvested four times from June to September, and the dry matter yield data acquired and processed from the last two harvests which occurred in August and September were used in this study. The study area shown in Figure 1 consists of two adjacent alfalfa fields, namely F635 and F650. There was one block in F635 where alfalfa was seeded in May, 2018, and three blocks in F650 with alfalfa newly planted in May, 2019. Each block contained 21 plots with a size of 24 m 2 (W-E 4 m × N-S 6 m), and it was randomly designed with seven levels of machinery traffic treatments and three replications for each. In each treatment, a certain number of traffic passes were used, with each pass simulating a particular machinery operation, such as mowing, raking, merging, baling, chopping and transporting. For treatments T1-T6, a swather was used to simulate both silage and hay harvest by applying weight and pressure to the plant and soil. For example, three passes applied in T2 and T5 were used to simulate mowing, merging (or raking) and harvest operation, and the delayed second and third passes in T5 were because more time was required for alfalfa to dry in hay harvest. The T7 was a control group without any traffic applied. The detailed descriptions for the seven treatments are presented in Table 1. The two alfalfa fields were harvested four times from June to September, and the dry matter yield data acquired and processed from the last two harvests which occurred in August and September were used in this study.

Hyperspectral Image Acquisition and Pre-Processing
The hyperspectral data were acquired by a Headwall nano-hyperspec push-broom scanner. This sensor covers 273 spectral bands ranging between 400-1000 nm with a bandwidth of 2.2 nm [65]. Each scan line contains 640 pixels with a pixel pitch of 7.4 µm. A VectorNav (VN)−300 GNSS/INS navigation system was integrated with the hyperspectral camera to directly provide the position and orientation of the camera for data geo-referencing. The DJI Matrice 600 Pro (M600) was used as the UAV platform. Furthermore, a DJI Ronin MX three-axis gimbal stabilizer was used on the M600 airframe. By using the gimbal, the sensor can maintain a nadir view regardless of airframe orientation. This capability can help stabilize the hyperspectral camera during the flight, leading to improved data geometry quality. Two UAV flights were conducted on July 25 and August 19 in 2019 under cloudless weather conditions. The UAV was flying at a speed of 5 m/s from an altitude of 40 m, and the corresponding ground sampling distance (GSD) was 2.5 cm.
The acquired hyperspectral data were geometrically and radiometrically corrected. For the geometric correction, the data were orthorectified using the Headwall SpectralView software based on the GNSS/INS data from the VN-300. The raw hyperspectral data were first converted to radiance using SpectralView then calibrated to reflectance using the calibration panels with 56%, 32% and 11% reflectivity. Within each plot, the background (e.g., shadow and soil) were removed by thresholding the NIR band at 800 nm wavelength (the pixels with reflectance values below 30% were removed in this study). The filtering strategy was adopted because vegetation typically has much higher reflectance values in the NIR region than the background [66,67]. Similar to previous studies [45,68], the 30% threshold was empirically determined. The alfalfa reflectance data were extracted for each plot and noisy bands below 442 nm and beyond 957 nm were removed.

Spectral Feature Extraction and Reduction
The acquired hyperspectral data contain hundreds of spectral bands, and many adjacent bands are highly correlated. To reduce the data dependency, instead of using all the original bands, we extracted the narrow-band indices as spectral features and used them for modeling the alfalfa yield in this study. Specifically, we explored 80 published indices (Table 2), with each derived from two or more spectral bands. The indices included simple ratio index (SRI), NDVI, REPI, chlorophyll absorption ratio index, modified versions of these indices such as mND 705 and combination of them such as TCARI/OSAVI 1 . Although the calculations are varied, most wavebands used were in the red and NIR ranges.  [720,820] (R820 − R720)/(R820 + R720) [77] NDVI [734,750] (R750 − R735)/(R750 + R734) [76] Physiological reflectance index PRI [528,567] Green normalized difference vegetation index GNDVI (R750 − R550)/(R750 + R550) [82] Renormalized difference vegetation index RDVI Double peak canopy nitrogen index DCNI   [720,750] R750/R720-1 Transformed Chlorophyll absorption in reflectance index Optimized soil-adjusted vegetation index Modified triangular vegetation index MERIS terrestrial chlorophyll index Red edge position index REP 1 700 Optimal vegetation index Vi opt (1 + 0.45)(R8002 + 1)/(R670 + 0.45) [114] In supervised learning, feature selection is typically applied prior to the model development to reduce the data dimensionality, especially when the training set is small. The recursive feature Remote Sens. 2020, 12, 2028 8 of 24 elimination (RFE) approach was a widely applied wrapper feature selection model and performed well in previous studies [46,115,116]. Therefore, it was adopted for feature selection in this study. It was performed iteratively by: (1) running an estimator to determine the initial feature importance scores; (2) removing the feature with the lowest importance score; and (3) assigning the ranking to the removed variable according to its removing order. This procedure was recursively repeated until the rankings were determined for all the input features. We implemented the entire process twice independently using two distinct estimators including SVR [117] and RF [118] to calculate the initial feature importance in step (1), and the final rankings were calculated based on the results from SVR-RFE and RF-RFE.

Ensemble Model Development
To enhance the prediction performance, an ensemble model was proposed based on a stacking strategy including the following two steps: (1) train and apply multiple machine learning models independently; and (2) combine multiple prediction results via a linear regressor [119]. It is important to select appropriate base learners to develop a successful ensemble model, and diversity is a critical condition, as the similarity between different models must be minimized for providing complementary information [60]. SVR, K-nearest neighbors (KNN) and RF are commonly used machine learning approaches with distinct principles, and their predictive ability for crop yield prediction have been assessed by many studies [120][121][122]. Therefore, we used these three models as the base learners and a brief description for each algorithm is presented below.
SVR is a supervised learning model characterized by the usage of kernels [123]. In SVR, the variables are firstly transformed from the original space to another space through a kernel function. Then, a linear function is determined to minimize errors between training data and the insensitive loss function. KNN regression assumes that similar samples exist in close proximity in the feature space. As an instance-based learning method, it estimates the response of an unknown sample by averaging the responses from K nearest neighbors in the training set. RF regression is a combination of multiple regression trees following the classical top-down procedure [124]. Each tree is generated using a bootstrap sample and learned in parallel and independent of each other [125,126]. The final estimation of RF is determined by averaging the predictions of all the independent trees.
In this study, we adopted a five-fold cross validation strategy [4,127,128] to create out-of-sample predictions. To test the robustness of all models, 50 repetitions of five-fold cross validation were performed, resulting in a total of 250 experiments (Figure 2). The R 2 , root mean square error (RMSE) and mean absolute deviation (MAE) were used to evaluate the model performance, and the equations of the indices are shown in Equations (1)-(3).
where n is the number of samples, y i andŷ i represent the observed and the predicted yields of sample i, y and denotes the mean of observed yield. Models with higher R 2 and lower values of RMSE and MAE indicating better performance in prediction. In addition, a paired sample t-test, was adopted to evaluate the significance of the accuracy improvement of the ensemble model [129].
where n is the number of samples, y i and y i represent the observed and the predicted yields of sample i, and y denotes the mean of observed yield. Models with higher R 2 and lower values of RMSE and MAE indicating better performance in prediction. In addition, a paired sample t-test, was adopted to evaluate the significance of the accuracy improvement of the ensemble model [129].

Yield Statistics and Spectral Profiles
Moisture contents were measured for each plot with an average of around 80% and a standard deviation (STD) of 2%, and the dry matter yield was used in this study. The yields varied at the two harvests, with an average of 2102.267 kg/ha in the August harvest, and a reduced average of 1042.149 kg/ha in September. The detailed yield statistics for each machinery treatment are shown in Table 3. Among different treatments, similar patterns were observed in both harvests. In general, compaction treatments with more traffic passes were associated with lower yields. For example, T1 and T7 had higher yields than others, and this is because T1 was simulating the pure mower operation using a single traffic pass, and T7 is the control group without any traffic compaction. In contrast, T6 had the lowest yield due to the most severe compaction. Moreover, earlier machinery treatments were found to have a less negative impact on the yield, by comparing the simulated silage (T2 and T3) and hay (T5 and T6) harvests.

Yield Statistics and Spectral Profiles
Moisture contents were measured for each plot with an average of around 80% and a standard deviation (STD) of 2%, and the dry matter yield was used in this study. The yields varied at the two harvests, with an average of 2102.267 kg/ha in the August harvest, and a reduced average of 1042.149 kg/ha in September. The detailed yield statistics for each machinery treatment are shown in Table 3. Among different treatments, similar patterns were observed in both harvests. In general, compaction treatments with more traffic passes were associated with lower yields. For example, T1 and T7 had higher yields than others, and this is because T1 was simulating the pure mower operation using a single traffic pass, and T7 is the control group without any traffic compaction. In contrast, T6 had the lowest yield due to the most severe compaction. Moreover, earlier machinery treatments were found to have a less negative impact on the yield, by comparing the simulated silage (T2 and T3) and hay (T5 and T6) harvests. To investigate the spectrum changes under compaction, the average alfalfa spectrum under each treatment (T1-T6) was compared with the control group (T7) which had no compaction applied, and the comparison results including the mean and standard deviation are shown in Figure 3. Similar to other green vegetation, alfalfa plants generally have low reflectance in blue (450-500 nm) and red regions (600-700 nm) due to the chlorophyll absorption and high reflectance in NIR (800-900 nm) due to the scattering in spongy mesophyll [130]. In addition, we found that more compaction resulted in lower reflectance values in NIR due to the plant stress. For example, under the most severe wheal compaction, T6 showed the lowest average reflectance values within the NIR region (750-1000 nm), and the least spectra overlap with the control group T7. To investigate the spectrum changes under compaction, the average alfalfa spectrum under each treatment (T1-T6) was compared with the control group (T7) which had no compaction applied, and the comparison results including the mean and standard deviation are shown in Figure 3. Similar to other green vegetation, alfalfa plants generally have low reflectance in blue (450-500 nm) and red regions (600-700 nm) due to the chlorophyll absorption and high reflectance in NIR (800-900 nm) due to the scattering in spongy mesophyll [130]. In addition, we found that more compaction resulted in lower reflectance values in NIR due to the plant stress. For example, under the most severe wheal compaction, T6 showed the lowest average reflectance values within the NIR region (750-1000 nm), and the least spectra overlap with the control group T7.

Feature Importance
The 80 VIs were ranked using the RFE strategy described in Section 2.3. The ranking results from one experiment are shown in Table 4, and the ranking statistics over the 250 experiments are shown in Figure A1. Across the 250 experiments, the VIs ranked in the top and bottom were relatively stable. For example, Datt1 and MCARI1 mostly ranked highly, and they were originally developed for estimating the chlorophyll content of eucalyptus and corn leaves [15,86]. Two indices, MTCI1 and MTCI2, designed for a medium resolution imaging spectrometer [110] also had a stable and satisfactory performance. Out of all the 80 indices, there were 27 three-band or four-band VIs, and

Feature Importance
The 80 VIs were ranked using the RFE strategy described in Section 2.3. The ranking results from one experiment are shown in Table 4, and the ranking statistics over the 250 experiments are shown in Figure A1. Across the 250 experiments, the VIs ranked in the top and bottom were relatively stable. For example, Datt 1 and MCARI 1 mostly ranked highly, and they were originally developed for estimating the chlorophyll content of eucalyptus and corn leaves [15,86]. Two indices, MTCI 1 and MTCI 2 , designed for a medium resolution imaging spectrometer [110] also had a stable and satisfactory performance. Out of all the 80 indices, there were 27 three-band or four-band VIs, and about 20 of them ranked in the top 40. This is likely because the appropriate combination of multiple bands could provide more spectral information. We also noticed that four integrated indices, including TCARI/OSAVI 1 , TCARI/OSAVI 2 , MCARI/OSAVI 1 and MCARI/OSAVI 2 , ranked in the top 40, with TCARI/OSAVI 1 and MCARI/OSAVI 1 appearing in the top 20. This indicates that different types of VIs can be combined to complement each other for achieving enhanced performance. In addition, several narrow-band NDVIs, which have been widely adopted for estimating vegetation properties [68], were also effective in modeling the alfalfa yield.

Model Comparison and Performance
To further explore the high-performing features, we iteratively added the top one index into the machine learning model and updated the model training performance until all the 80 indices were included. The training accuracies obtained by the three base models (SVR, KNN and RF) were calculated, and the results of one experiment are shown in Figure 4. Among all the 250 experiments, RF performed the best, followed by KNN. As more features were included, the accuracies of all models improved at first and then remained stable after around 25 features were included. Therefore, we finally used the top 25 features for the ensemble model development. We trained all the four models (three base models, and one ensemble model) introduced in Section 2.4 using both the full and selected features on training samples, and evaluated the model performance on test samples. The test accuracies obtained from the 250 experiments are shown in Table 5. Satisfactory accuracies with an R 2 of more than 0.822 were achieved by all the approaches, demonstrating the effectiveness of these models in modeling alfalfa yield. The ensemble model outperformed all the base approaches, achieving an R 2 of 0.874 using the reduced features and an R 2 of 0.854 using the full features. Regarding the feature selection, the accuracies were improved for all the approaches. We also compared the results between using the selected VIs and the derivatives of full bands (Table A1). Again, the ensemble model outperformed other base models regardless of the features adopted, which further proved the effectiveness of the ensemble model proposed in this study. Besides, with the ensemble model, higher accuracies were achieved by using the selected VIs than using first or second derivatives of full bands. Moreover, we used a paired sample t-test to evaluate whether the methods are statistically different on the reported R 2 . Specifically, six statistical tests were performed, and the results are shown in Table 6. The t-test was conducted between the ensemble model with each of the three base models (RF, SVR and KNN). The results showed that the accuracy improvement obtained by the ensemble model was statistically significant under both selected and full feature conditions. Remote Sens. 2020, 12, x FOR PEER REVIEW 11 of 23 RF performed the best, followed by KNN. As more features were included, the accuracies of all models improved at first and then remained stable after around 25 features were included. Therefore, we finally used the top 25 features for the ensemble model development. We trained all the four models (three base models, and one ensemble model) introduced in Section 2.4 using both the full and selected features on training samples, and evaluated the model performance on test samples. The test accuracies obtained from the 250 experiments are shown in Table 5. Satisfactory accuracies with an R 2 of more than 0.822 were achieved by all the approaches, demonstrating the effectiveness of these models in modeling alfalfa yield. The ensemble model outperformed all the base approaches, achieving an R 2 of 0.874 using the reduced features and an R 2 of 0.854 using the full features. Regarding the feature selection, the accuracies were improved for all the approaches. We also compared the results between using the selected VIs and the derivatives of full bands (Table A1). Again, the ensemble model outperformed other base models regardless of the features adopted, which further proved the effectiveness of the ensemble model proposed in this study. Besides, with the ensemble model, higher accuracies were achieved by using the selected VIs than using first or second derivatives of full bands. Moreover, we used a paired sample t-test to evaluate whether the methods are statistically different on the reported R 2 . Specifically, six statistical tests were performed, and the results are shown in Table 6. The t-test was conducted between the ensemble model with each of the three base models (RF, SVR and KNN). The results showed that the accuracy improvement obtained by the ensemble model was statistically significant under both selected and full feature conditions.
The agreement between the observed and predicted yield for each model using the selected features is shown in Figure 5. Among the four models, the best agreement was found in the ensemble model (Figure 5d), while the worst agreement was observed in RF (Figure 5a). Moreover, in Figure  5, two clusters could be identified and they respectively corresponded to the yields obtained from the two harvests. Again, comparing to the other approaches, the ensemble model demonstrated better agreement in both clusters.  The agreement between the observed and predicted yield for each model using the selected features is shown in Figure 5. Among the four models, the best agreement was found in the ensemble model (Figure 5d), while the worst agreement was observed in RF ( Figure 5a). Moreover, in Figure 5, two clusters could be identified and they respectively corresponded to the yields obtained from the two harvests. Again, comparing to the other approaches, the ensemble model demonstrated better agreement in both clusters.

Model Adaptability for Different Compaction Treatments
We then further evaluated the model adaptability under different compaction treatments. To this end, we calculated the prediction accuracies for all seven treatments, and the results are shown in Table 7. In general, all models achieved good results and they performed particularly well under T1, T2, T3 and T5 treatments, with all the R 2 above 0.831. Additionally, we noticed that the ensemble model provided decent estimation accuracy under all the treatments, and it was superior to other base models in most cases. Although the accuracy varied across treatments, the ensemble model performed relatively stable, with an R 2 of more than 0.778 under each treatment, demonstrating its stronger adaptability across treatments. Moreover, it is worth mentioning that the ensemble performance can be directly affected by the base learners. For example, it achieved better performance in T3 and T5 when the base models showing higher accuracies.
Finally, the detailed yield modeling performance for each treatment was shown in the scatter plots in Figure 6. It is clear to see that T1 and T7 had higher yields than others, while T6 had the lowest yield due to the most severe machinery compaction. Among the four models, the ensemble model again showed the best agreement between the observed and the predicted yield under most treatments, and strong performance was exhibited under T3 and T5 treatments. Additionally, the figure also showed that both the high and low yields obtained from the two harvests could be well modeled using the ensemble approach.

Selection of the Vegetation Indices
In this study, we explored 80 hyperspectral narrow-band indices, and approximately 95% of these indices used the red-edge wavelengths in their calculations. To reduce the data dimensionality, all the 80 VIs were ranked based on the REF approach, and several representative indices were discussed below.
First, we noticed that three-band or four-band indices ranked importantly and about six of them were among the top ten indices. This is likely because the multi-band indices tend to provide more spectral information and improve the robustness and precision in assessing plant traits [92]. For example, Datt 1 , MCARI 1 , MTCI 1 and MTCI 2 are four three-band indices, and they were relatively stable in the top ten among the 250 experiments. Their strong performance is mainly due to their sensitivity to the chlorophyll content which is strongly correlated with the photosynthetic capacity of the plants [131]. In previous studies, integrated indices such as MCARI/OSAVI and TCARI/OSAVI were found to outperform TCARI and MCARI, and this is mainly because combining the MCARI or TCARI with the optimized SAVI (OSAVI) can help reduce the background effects [15]. However, this pattern was not observed in this study as the background pixels (e.g., soil) were already excluded in the data pre-processing step. Another multi-band index REP is calculated using four narrow bands to ascertain the position of maximum slope in the red-NIR region, and its potential of estimating biomass has been identified in previous studies [76,132,133]. Two REP indices (REP 1 and REP 2 ) were considered in this study, and both were selected for the model development.
In addition, various narrow-band NDVIs have been evaluated by researchers for estimating different plant traits [68,88,111,134] and several of them were also demonstrated to be effective in modeling alfalfa yield in this study. Another common index is the physiological reflectance index (PRI), which is typically used for indicating the photosynthetic efficiency [78]. In this study, PRI [531,570] also ranked importantly and was selected for the model development. Besides, we also studied 19 SRIs with multiple band combinations, and only about three of them were selected as predictors, and none of them ranked in the top ten. This is likely because the information provided by SRIs are already included in other VIs with more spectral bands or complex mathematical structures.

Advantages of the Ensemble Model
Instead of using a single machine learning model, we developed an ensemble model by combining three base learners in this study. The results showed that the ensemble model outperformed the other approaches significantly when using both all the VIs and the selected ones. With the selected features, the ensemble method obtained 0.874 in R 2 , 220.799 kg/ha in RMSE and 164.787 kg/ha in MSE, achieving an increase of 2.8%, 8.5%, and 10.2%, respectively, in comparison with the KNN which was the best performing single model. In addition, we further evaluated the model performance on seven compaction treatments. Again, the ensemble model yielded more adaptive and satisfactory results than base models with R 2 exceeding 0.778 on all the subsets.
As demonstrated previously, in ensemble learning, sufficiency and diversity are the two main principles in selecting the base models [135]. It means that each base learner should possess good predicting capabilities, but at the same time, the dependence between the models must be minimized for providing complementary information [60]. This is reasonable because the ensemble method combines the individual predictions, and therefore the performance of each base model can affect the final fusion results. On the other hand, limited additional information would be gained by combining the high-performance models which are similar. Based on the two conditions, we employed SVR, KNN and RF which have completely different training mechanisms, as the base learners in this study, and their parameters were optimized strategically for achieving the best performance. The experimental results further demonstrate the effectiveness of the base model selection.

Effects of Machinery Compaction
Previous studies showed that alfalfa is one of the most susceptible forages to machinery traffic which may lead to a reduction in the next cutting yield. In this study, seven levels of compaction were explored, and two main insights were found from the experiments. First, we noticed that the yield loss is positively correlated with the severity of the compaction. The more traffic passes were applied, the greater the observed yield loss. This is mainly because the soil compaction caused by the wheel traffic can lead to reduced macropore air permeability and water infiltration, and thus negatively affect the plant root development and lower the yield [12]. Second, the later application of machinery traffic practices tends to increase the yield loss, which could be noticed by comparing the yields from hay and silage harvests. This is because delayed wheel traffic is more likely to cause physical damage to regrowth shoot and therefore decrease the yield [9]. Similar findings were also noticed in other studies [9][10][11]135,136]. For example, wheel traffic was found to be a contributing factor to alfalfa yield loss in a field study conducted in Nevada, and fewer and earlier machinery treatments were recommended to reduce the damage [10]. A field experiment carried out in Auchincruive, Scotland, also demonstrated that both frequent and delayed wheel passes caused reductions in herbage yield [135].

Conclusions
Alfalfa is an important forage crop in the U.S., and it plays an important role in the food supply chain as feedstock for animals. Pre-harvest insight to yield can help optimize management practices. In this study, we developed an ensemble-based machine learning model for alfalfa yield prediction using UAV-based hyperspectral imagery. The narrow-band hyperspectral indices were extracted and the most important ones were selected for the model development. The results showed that the ensemble model outperformed other base models, and the highest accuracy was achieved when using the reduced features. In addition, we examined the model performance under different compaction treatments, and again, the best performance was achieved by the ensemble model, indicating its stronger adaptability in comparison with other approaches. Moreover, we also found that the yield loss is positively correlated with the compaction severity, while earlier compaction after harvest can help reduce the impact. Our study demonstrated the efficacy of using hyperspectral images for modeling alfalfa yield, and for future work, we will incorporate environmental factors, such as climate variables and soil properties, into the modeling process and to further enhance the prediction performance.