Above-Ground Biomass Estimation in Oats Using UAV Remote Sensing and Machine Learning

Current strategies for phenotyping above-ground biomass in field breeding nurseries demand significant investment in both time and labor. Unmanned aerial vehicles (UAV) can be used to derive vegetation indices (VIs) with high throughput and could provide an efficient way to predict forage yield with high accuracy. The main objective of the study is to investigate the potential of UAV-based multispectral data and machine learning approaches in the estimation of oat biomass. UAV equipped with a multispectral sensor was flown over three experimental oat fields in Volga, South Shore, and Beresford, South Dakota, USA, throughout the pre- and post-heading growth phases of oats in 2019. A variety of vegetation indices (VIs) derived from UAV-based multispectral imagery were employed to build oat biomass estimation models using four machine-learning algorithms: partial least squares (PLS), support vector machine (SVM), Artificial neural network (ANN), and random forest (RF). The results showed that several VIs derived from the UAV collected images were significantly positively correlated with dry biomass for Volga and Beresford (r = 0.2–0.65), however, in South Shore, VIs were either not significantly or weakly correlated with biomass. For Beresford, approximately 70% of the variance was explained by PLS, RF, and SVM validation models using data collected during the post-heading phase. Likewise for Volga, validation models had lower coefficient of determination (R2 = 0.20–0.25) and higher error (RMSE = 700–800 kg/ha) than training models (R2 = 0.50–0.60; RMSE = 500–690 kg/ha). In South Shore, validation models were only able to explain approx. 15–20% of the variation in biomass, which is possibly due to the insignificant correlation values between VIs and biomass. Overall, this study indicates that airborne remote sensing with machine learning has potential for above-ground biomass estimation in oat breeding nurseries. The main limitation was inconsistent accuracy in model prediction across locations. Multiple-year spectral data, along with the inclusion of textural features like crop surface model (CSM) derived height and volumetric indicators, should be considered in future studies while estimating biophysical parameters like biomass.


Introduction
Oat (Avena sativa L.) is a cool-season, multipurpose grain crop which ranks sixth among the most produced cereal in the world [1]. According to USDA-National Agricultural Statistics Service small grains 2020 summary statistics, out of 1.2 million hectares of oats farmed in the United States, approximately 406,000 hectares were harvested for grain, accounting for less than half of the entire planted area [2]. The crop has traditionally been collected for fodder, forage, straw, hay, silage, and chaff production in addition to grain production [1]. Oat forage is preferred over other annual forage crops because of its high Statistical models have been implemented to relate spectral information with biophysical attributes of crops [43,44]. Traditional modeling approaches are limited by statistical assumptions failing to address outlier data, nonlinearity, heteroscedasticity, and multicollinearity issues [45]. Recently, machine learning algorithms have been widely employed for the exploration and analysis of big data sets to identify meaningful correlations, patterns, and rules among data, which are frequently found to outperform traditional regression analysis [46]. The relationship between spatial and temporal changes of various predictor factors determines biomass estimation. Machine learning techniques could be highly relevant for biomass estimation as it has excellent capacity to treat multidimensional data via incorporating several predictor features [47]. Expected biomass being a continuous variable, machine learning methods such as support vector machine (SVM) [24], partial least square (PLS) [48,49], random forest (RF) [50], and artificial neural network (ANN) [51] have been used for biomass estimation. Training data is often required for supervised machine learning algorithms, however, obtaining a large dataset is often challenging because of the difficulty in manually harvesting large numbers of plots and the limited crop growing season [28]. In order to get reliable and unbiased estimates of model performance in these cases, validation techniques such as leave one out for cross-validation and k-fold cross-validation have been used in previous studies [22,52].
There are a limited number of studies that have used UAV-based canopy spectral information and machine learning to predict the biomass in oats. Various studies related to above-ground biomass estimation in cereal crops have seen lower estimation accuracy after the heading stage, which could be due to higher biomass amount or other inflorescence/ stem interference overleaf canopy after heading [25,53]. Few studies have explored the impacts of canopy spectral information from different growth phases on biomass estimation for oats. Thus, the objectives of this study are; to (i) evaluate the potential of UAV multispectral imagery-derived VIs in estimation of above ground biomass in oats, (ii) evaluate the performance of UAV imagery collected at pre-and post-heading phases for oat biomass estimation, and (iii) compare the performance of different machine-learning algorithms for estimating above ground biomass of oats.

Field Experiments
Thirty-five oat genotypes adapted to the Northern Great Plains were cultivated in 2019 at three locations in South Dakota ( Figure 1): Volga (44.321994, −96.924565), South Shore (45.105087, −96.927985), and Beresford (43.080859, −96.776148). The experimental design followed a randomized complete block design (RCBD) with three replications. Each plot (experimental unit) was approximately 2.78 m 2 . Oats were planted at a density of approximately 300 seeds per square meter and at a depth of approximately 0.038 m. Beresford, Volga, and South Shore were planted on 26 April, 14 May, and 7 May, respectively, and were harvested on 11 July, 18 July, and 19 July, respectively. Agronomic practices such as fertilization and weed management were carried out in accordance with regional practices. Based on the information extracted from the Agacis website (https://agacis.rcc-acis.org, accessed on 1 July 2021), the average temperature during the growing season (May to July) was 16.4 • C in South Shore, 18.8 • C in Beresford and 17.2 • C in Volga. In 2019, precipitations during the growing season (May to July) totaled 11.93 cm in South Shore, 9.90 cm in Volga, and 11.93 cm in Beresford.

Ground Data Collection
Several phenotypic traits, such as heading time and crown rust severity, which can directly or indirectly affect forage yield, were collected for this study. Crown rust severity was scored as the percentage of leaf area covered by pustules over the entire plot. When plants were between late milk and early dough, oats were harvested for forage. The plants were cut close to the soil surface (approximately 7.6 cm) with a Jari mower or a forage harvester (Figure 2a), depending on the location. The above-ground biomass of each plot (fresh weight) was recorded immediately after harvest. For each plot, a sub-sample was collected and subjected to air-dried oven set at 70 degrees Celsius until the weight was constant (approximately a week). Dry matter content was calculated and used to measure dry matter yield for each plot; the details of dry biomass calculation are as follows: Dry mater content (%) = Subsample dry weight Subsample fresh weight * 100% (1) Dry biomass = Fresh biomass * dry matter content 100 (2) (a)

Ground Data Collection
Several phenotypic traits, such as heading time and crown rust severity, which can directly or indirectly affect forage yield, were collected for this study. Crown rust severity was scored as the percentage of leaf area covered by pustules over the entire plot. When plants were between late milk and early dough, oats were harvested for forage. The plants were cut close to the soil surface (approximately 7.6 cm) with a Jari mower or a forage harvester (Figure 2a), depending on the location. The above-ground biomass of each plot (fresh weight) was recorded immediately after harvest. For each plot, a sub-sample was collected and subjected to air-dried oven set at 70 degrees Celsius until the weight was constant (approximately a week). Dry matter content was calculated and used to measure dry matter yield for each plot; the details of dry biomass calculation are as follows: Dry mater content (%) = Subsample dry weight Subsample fresh weight * 100% Dry biomass = Fresh biomass * dry matter content 100 (2) Figure 1. Three different experimental locations (South Shore, Beresford, and Volga) in South Dakota.

Ground Data Collection
Several phenotypic traits, such as heading time and crown rust severity, which can directly or indirectly affect forage yield, were collected for this study. Crown rust severity was scored as the percentage of leaf area covered by pustules over the entire plot. When plants were between late milk and early dough, oats were harvested for forage. The plants were cut close to the soil surface (approximately 7.6 cm) with a Jari mower or a forage harvester (Figure 2a), depending on the location. The above-ground biomass of each plot (fresh weight) was recorded immediately after harvest. For each plot, a sub-sample was collected and subjected to air-dried oven set at 70 degrees Celsius until the weight was constant (approximately a week). Dry matter content was calculated and used to measure dry matter yield for each plot; the details of dry biomass calculation are as follows: Dry mater content (%) = Subsample dry weight Subsample fresh weight * 100% (1) Dry biomass = Fresh biomass * dry matter content 100 (2) (a)

Sensor and Aerial Platform
The UAV deployed is a DJI (Dà-Jiāng Innovations) Matrice 600 hexcopter (SZ DJI Technology Co., Ltd., Shenzhen 518057, China) ( Figure 2b). Multispectral images were collected with a MicaSense RedEdge-MX camera (MicaSense, Inc., Seattle, WA, USA). Micasense RedEdge-MX has a 3.2-megapixel resolution, and five bands with central wavelengths of 457 nm (blue), 560 nm (green), 668 nm (red), 717 nm (red-edge), and 840 nm (near-NIR). The spectral range covered by the green, red, red-edge, and NIR bands were 545-555 nm, 640-660 nm, 710-720 nm, and 840-860 nm, respectively. For UAV waypoint navigation and flights, an autopilot system was applied using Drone Deploy (Drone Deploy, San Francisco, CA, USA) software over the fields. Drone Deploy software was used for autonomous takeoff, flight, and landing purposes, and for capturing consistent data over time. Each of the flights was performed at an altitude of 25 m and with a front and side overlap of 80%. The flights were performed in either sunny or overcast conditions with wind gusts less than 12 miles per hour. Aerial images were collected on multiple days: Beresford (14 June, 1 July, 8 July, and 12 July), Volga (13 June, 25 June, 4 July, and 11 July), and South Shore (16 June, 25 June, 6 July, 11 July, and 18 July). The UAV flights were conducted between 10 a.m. to 12 p.m. to ensure constant daylight operation.

Image Preprocessing
The processing of raw images captured by UAV was conducted by using Pix4DMappersoftware (Pix4D Inc., San Francisco, CA, USA) to generate orthomosaic images in tiff format ( Figure 3). The orthomosaic images were generated with a spatial resolution of 0.7 cm. Following the orthomosaic, 10 ground control points (GCPs) were employed across the field area to geo-reference the imageries from various flights. The GCP coordinates were measured with a Magellan GPS device (Magellan Navigation Inc., San Dimas, CA, USA). Four white tarps were evenly spaced around each corner of each field for radiometric correction. The reflectance value of the tarps was determined using a CROPSCAN MSR16R (CROPSCAN Inc., 1932 Viola Heights Lane NE Rochester, MN 55906, USA). Four white tarps were used in the development of the linear relationship between DN (digital number) and surface reflectance. The average DN of white tarps from drone imageries from all the flights was used to develop an equation for each band. A linear regressionbased calibration [54] was used where slope and intercept from the equation was later used to convert DN values from each band to reflectance as described. The DN values were converted to reflectance using the following equation:

Sensor and Aerial Platform
The UAV deployed is a DJI (Dà-Jiāng Innovations) Matrice 600 hexcopter (SZ DJI Technology Co., Ltd., Shenzhen 518057, China) ( Figure 2b). Multispectral images were collected with a MicaSense RedEdge-MX camera (MicaSense, Inc., Seattle, WA, USA). Micasense RedEdge-MX has a 3.2-megapixel resolution, and five bands with central wavelengths of 457 nm (blue), 560 nm (green), 668 nm (red), 717 nm (red-edge), and 840 nm (near-NIR). The spectral range covered by the green, red, red-edge, and NIR bands were 545-555 nm, 640-660 nm, 710-720 nm, and 840-860 nm, respectively. For UAV waypoint navigation and flights, an autopilot system was applied using Drone Deploy (Drone Deploy, San Francisco, CA, USA) software over the fields. Drone Deploy software was used for autonomous takeoff, flight, and landing purposes, and for capturing consistent data over time. Each of the flights was performed at an altitude of 25 m and with a front and side overlap of 80%. The flights were performed in either sunny or overcast conditions with wind gusts less than 12 miles per hour. Aerial images were collected on multiple days: Beresford (14 June, 1 July, 8 July, and 12 July), Volga (13 June, 25 June, 4 July, and 11 July), and South Shore (16 June, 25 June, 6 July, 11 July, and 18 July). The UAV flights were conducted between 10 a.m. to 12 p.m. to ensure constant daylight operation.

Image Preprocessing
The processing of raw images captured by UAV was conducted by using Pix4DMappersoftware (Pix4D Inc., San Francisco, CA, USA) to generate orthomosaic images in tiff format ( Figure 3). The orthomosaic images were generated with a spatial resolution of 0.7 cm. Following the orthomosaic, 10 ground control points (GCPs) were employed across the field area to geo-reference the imageries from various flights. The GCP coordinates were measured with a Magellan GPS device (Magellan Navigation Inc., San Dimas, CA, USA). Four white tarps were evenly spaced around each corner of each field for radiometric correction. The reflectance value of the tarps was determined using a CROPSCAN MSR16R (CROPSCAN Inc., 1932 Viola Heights Lane NE Rochester, MN 55906, USA). Four white tarps were used in the development of the linear relationship between DN (digital number) and surface reflectance. The average DN of white tarps from drone imageries from all the flights was used to develop an equation for each band. A linear regression-based calibration [54] was used where slope and intercept from the equation was later used to convert DN values from each band to reflectance as described. The DN values were converted to reflectance using the following equation: where DN ij is the digital number for ith band at jth flight period, and SR ij is the surface reflectance for ith band at jth flight period.

Pixel Classification Using K-Mean Clustering Algorithm
Pixel classification was used based on the K-mean algorithm using MATLAB. The processing software imported stacked mosaic images to create 6 cluster classes. This differentiation of clusters was based on the color feature of the image. Based on higher NIR reflectance, cluster types with green pixels were identified. A binary vegetation image was created after masking non-canopy type cluster classes. Then DN values for that cluster were extracted for all bands (NIR, red edge, red, green, and blue) and converted to surface reflectance using a calibration method. The same VIs was computed as previously described (Table 1).

Spectral Vegetation Indices Extraction
Two methodologies were used to derive vegetation indices. The first one (hereafter referred to as "average reflectance over ROI") was based on averaging the spectral reflectance for all pixels within the region of interest (ROI). However, the spectral information derived from average reflectance over ROI included shadows, background soil, and panicles (for imagery collected after heading), which could affect the overall VIs values. Spectral indices are sensitive to green living vegetation, therefore, only pixels with high NIR re- flectance values within ROI were selected in the second methodology (hereafter referred to as "pixel classification").

Average Reflectance over Region of Interest
The orthomosaic images were processed using ArcGIS software (Version 10.7. Redlands, CA, USA) to extract the spectral indices. They were first converted to float from raster format. Then, using the raster calculator tool in the software, a variety of VIs were generated ( Table 1). The shape file polygons were created using the same software and used for the identification of each sampling plot as an experimental unit. Finally, the zonal statistics tool was used to derive plot-level mean VIs from each experimental unit.

Pixel Classification Using K-Mean Clustering Algorithm
Pixel classification was used based on the K-mean algorithm using MATLAB. The processing software imported stacked mosaic images to create 6 cluster classes. This differentiation of clusters was based on the color feature of the image. Based on higher NIR reflectance, cluster types with green pixels were identified. A binary vegetation image was created after masking non-canopy type cluster classes. Then DN values for that cluster were extracted for all bands (NIR, red edge, red, green, and blue) and converted to surface reflectance using a calibration method. The same VIs was computed as previously described (Table 1).

Data Pre-Processing
Multispectral imagery from each flight was aggregated, resulting in a comprehensive dataset for all three locations. For accessing spectral properties in accordance with the specific growth phase of oats and its relationship with biomass yield, the dataset was divided into two subsets, i.e., pre-heading and post-heading stages. This division was based on the heading date noted for each genotype in different field conditions. The spectral information collected prior to panicle emergence was separated as the pre-heading dataset, and the spectral information collected after panicle emergence in most genotypes was separated as the post-heading dataset. More explanation could be obtained from histograms plotted for each location (Figure 4) representing the distribution of heading occurrence in different genotypes measured after days of planting. The vertical dotted line represents spectral data collection through UAV. For Beresford and South Shore, spectral data from the first two flights were averaged and considered as pre-heading sample data. Likewise, remaining later flights were averaged and considered as post-heading data. While in Volga, the first three flights were averaged for the pre-heading data frame and the last single flight was considered as the post-heading data frame. tograms plotted for each location (Figure 4) representing the distribution of heading occurrence in different genotypes measured after days of planting. The vertical dotted line represents spectral data collection through UAV. For Beresford and South Shore, spectral data from the first two flights were averaged and considered as pre-heading sample data. Likewise, remaining later flights were averaged and considered as post-heading data. While in Volga, the first three flights were averaged for the pre-heading data frame and the last single flight was considered as the post-heading data frame.

Correlation Analysis between VIs and Biomass
The package "hmisc" in R (version 3.5.1, R Development Core Team, 2018) [62] was used to calculate the correlation matrix, including VIs and biomass. The function "rcorr" was used to generate a matrix of Pearson's rank correlation coefficients for all possible pairs of columns of the matrix.

Broad Sense Heritability Estimate
Broad sense heritability estimate refers to the proportion of phenotypic variance in a trait that is attributed to the genetic variance in a population. Based on the linear mixed

Correlation Analysis between VIs and Biomass
The package "hmisc" in R (version 3.5.1, R Development Core Team, 2018) [62] was used to calculate the correlation matrix, including VIs and biomass. The function "rcorr" was used to generate a matrix of Pearson's rank correlation coefficients for all possible pairs of columns of the matrix. Broad sense heritability estimate refers to the proportion of phenotypic variance in a trait that is attributed to the genetic variance in a population. Based on the linear mixed model approach, "Minimum norm quadratic unbiased estimation (MINQUE)" was used for estimating variance components and random effects. The jackknife as resampling technique was implemented to generalize statistical test using R package "minque" [63].

Modeling
The spectral data retrieved from image processing were combined with ground truth dry biomass to create the final dataset for modeling. The dataset included many variables as each VI was considered over different time frames. Hence, various linear and non-linear regression-based machine learning techniques were evaluated, and their performance was compared. The "caret" package (Version 6.0-88) in R (version 3.5.1, R Development Core Team, 2018) was used for implementing all different model algorithms [64]. In this study, four machine learning algorithms, i.e., PLS (partial least square regression), SVM (support vector machine), RF (random forest), and ANN (artificial neural network), were used to predict biomass.
The PLS approach is known for its convenience in highly correlated predictors by dimension reduction techniques as in principle component analysis [65]. The SVM algorithm aims to find a hyperplane in an n-dimensional space that distinctly classifies the data points. These hyperplanes are known as the decision boundary and are used to predict continuous output [66]. In our study, SVM was implemented using a linear variant, "svmLinear" method that was chosen from the caret package in R for this purpose. The RF algorithm principle works on a combination of tree predictors, such that each tree is dependent on the values of a random vector that is sampled independently having similar distribution for rest of trees in forest [67]. The ANN adopts the computing environment by repeated adjustment using neuron weights and thresholds. The network training completes its task once the output error of the network reaches its expected value [68].
For all four modeling approaches, tuning parameters were set ( Table 2). For example, in the PLS method, the model was subjected to tuning for finding the optimal number of principal components ("ncomp") to be incorporated. While in the case of SVM, parameter C, known as "Cost", was used as a tuning parameter, allowing different iterations of C to maximize model accuracy. The cost-penalty parameter relates tolerance to error, which means that when C gets large, the model gets flexible, and it leads to overfitting. In other cases, with a small value of C, the model is rigid and subjected to underfitting. For the RF analysis, the number of trees defaulted to 500, while to obtain the best predictor combination for split candidate, the "mtry" parameter was tuned with its corresponding cross-validation error. For the ANN analysis, size and decay were hyper-parameters used to tune, where size is the number of units in the hidden layer and decay acts as a regularization parameter to avoid over-fitting. To change the candidate values of the tuning parameters, the "tuneLength" or "tuneGrid" arguments were used in the train function. For Beresford and South Shore, seventy percent of the data for each location was used for training the model and the rest was used as a validation set for evaluating the model performance. In Volga, only the first two replications of the field trial were used in our data analysis because heavy precipitation after planting caused delayed emergence in the third replication. Because of the smaller number of datapoints, the set was split 50:50 for training and validation. Random-number seeds were applied before training each model such that every model had the same data partition and had stable result output. For PLS, SVM, and ANN models, data were transformed using the "preProcess" function, which forced all predictors to be centered and scaled. In addition, "trainControl" was used to specify the type of resampling methods to estimate performance of model.
For resampling methods, k-fold cross-validation (CV) was performed on the training data set. The CV approach divides data into folds, estimating the error rate of machine learning-based classifications on iteration and outputs the final model with the least error rate [71]. In this study, repeated k-fold CV was implemented using 10 folds with three replications. The default metric used for accuracy assessment in each model was the root mean square error (RMSE). The comparison analysis was performed for both the training set (cross-validation) and the test set data using RMSE and coefficient of determination (R 2 ). Those parameters were calculated as where X i and Y i were estimated biomass and measured biomass, respectively, and X, Y were the average estimated biomass and measured biomass, respectively, and n was the number of samples. The predictor or variable importance for each model was derived using the generic function "varImp" using the caret package. For the PLS model, the variable importance was calculated based on weighted sums of the absolute regression coefficients. While in RF model, variable importance was derived from mean square error, computed out-of-bag data for each tree, then recomputed again after permuting each predictor variable. For ANN and SVM, there was no model-specific way for calculating variable importance; hence, the importance of each predictor was evaluated individually by using the "filter" approach [64]. The overall workflow for machine learning modeling using UAV remote-sensing data for above-ground biomass estimation is explained in Figure 5. bag data for each tree, then recomputed again after permuting each predictor variable. For ANN and SVM, there was no model-specific way for calculating variable importance; hence, the importance of each predictor was evaluated individually by using the "filter" approach [64]. The overall workflow for machine learning modeling using UAV remotesensing data for above-ground biomass estimation is explained in Figure 5.

Ground-Based Dry Biomass Measurements
The highest dry biomass was produced at South Shore, with an average of 13,674.4 kg/ha. The lowest dry biomass was produced in Volga, with an average of 9191.0 kg/ha (Figure 6i). Wet conditions favored the development of crown rust in all three locations. Crown rust severity was least severe in South Shore, where it averaged 25%, but 50% at

Ground-Based Dry Biomass Measurements
The highest dry biomass was produced at South Shore, with an average of 13,674.4 kg/ha. The lowest dry biomass was produced in Volga, with an average of 9191.0 kg/ha (Figure 6i). Wet conditions favored the development of crown rust in all three locations. Crown rust severity was least severe in South Shore, where it averaged 25%, but 50% at the other two locations (Figure 6ii). There was a negative correlation between fresh biomass and crown rust severity at Beresford (r = −0.59 **) and Volga (r = −0.4 **), and this shows that biomass was negatively affected by the presence of crown rust infection on leaves at those two locations ( Table 3). The correlation between biomass and crown rust severity was, however, not significant in South Shore. The average height for each plot was also correlated to dry biomass yield. Plant height had a significant positive correlation with dry biomass in Beresford (r = 0.38) and South Shore (r = 0.24).

Broad Sense Heritability Estimates for Vegetative Indices
Broad-sense heritability (H 2 ) estimates were calculated for dry biomass yield and VIs. The broad-sense heritability for dry biomass yield was 0.55 for Beresford and 0.24 for Volga. In South Shore, however, the heritability was 0.01, which shows that variation in dry biomass yield was primarily due to other factors than the genotype. Among the VIs considered, VARI and NDVI were found to consistently have higher heritability across growth phases and locations. The broad-sense heritability estimates were lower for VIs derived from pre-heading flights (NDVI: H 2 = 0. 46

Relationship between Dry Biomass Yield and Vegetation Indexes Derived through Average Reflectance over ROI Method
Pearson correlation coefficients (r) were calculated between dry biomass and VIs obtained through average reflectance over the ROI method (Table 4). In Beresford, the highest correlations between VIs and dry biomass yield (0.45 to 0.6) were obtained for later flights (post-heading). For Volga, the strength of correlations between VIs and dry biomass yield was similar for both post-and pre-heading flights. Among the VIs, NDVI and RTVI were most highly correlated with dry biomass yield for both pre-heading (r = 0.43 and 0.57, respectively) and post-heading flights (r = 0.42 and 0.41, respectively). In South Shore, few VIs (TVI, ExGR, VARI) had significant correlations with dry biomass yield for flights before heading. For post-heading flights, only GNDVI was significantly positively correlated with biomass (r = 0.23).

Relationship between Dry Biomass Yield and Vegetation Indexes Derived through Average Reflectance over ROI Method
Pearson correlation coefficients (r) were calculated between dry biomass and VIs obtained through average reflectance over the ROI method (Table 4). In Beresford, the highest correlations between VIs and dry biomass yield (0.45 to 0.6) were obtained for later flights (post-heading). For Volga, the strength of correlations between VIs and dry biomass yield was similar for both post-and pre-heading flights. Among the VIs, NDVI and RTVI were most highly correlated with dry biomass yield for both pre-heading (r = 0.43 and 0.57, respectively) and post-heading flights (r = 0.42 and 0.41, respectively). In South Shore, few VIs (TVI, ExGR, VARI) had significant correlations with dry biomass yield for flights before heading. For post-heading flights, only GNDVI was significantly positively correlated with biomass (r = 0.23).

Relationships between Dry Biomass Yield and Vegetation Indexes Derived through Pixel Classification
For VIs derived from the pixel classification method, post-heading flights were more strongly correlated (r = 0.4-0.7) with dry biomass yield than those derived from pre-heading flights (r = 0.3-0.5) in Beresford (Table 5). Similar results were obtained for Volga. For South Shore, however, dry biomass was not significantly correlated with any of the VIs except TVI (r = 0.23) for pre-heading flights ( Table 5). The use of pixel classification resulted in higher correlations between VIs and dry biomass for both pre-heading and post-heading flights in Beresford. For Beresford, the correlation between dry biomass and NDVI was r = 0.57 for the average reflectance over the ROI method and r = 0.72 after pixel classification. For Volga, correlation coefficients between dry matter yield and VIs derived from pre-heading flights were quite similar for both methods (average reflectance over ROI and pixel classification) irrespective of VIs. For the post-heading phase, VIs derived from the pixel classification method had significantly greater correlation values (r = 0.38-0.54) with dry matter yield as compared to average reflectance over the ROI method. For the post-heading stage in Volga, the correlation between dry biomass and NDVI was r = 0.42 for the average reflectance over the ROI method and r = 0.54 in the pixel classification method. No substantial differences were observed between the two methods for South Shore. In both cases only some VIs was significantly correlated to biomass during pre-heading, i.e., ExGR (r = 0.3), VARI (r = 0.28) and TVI (r = 0.24) in average reflectance over ROI method and TVI (r = 0.23) in the pixel classification method.

Biomass Prediction from Spectral Information Collected Pre-and Post-Heading
Biomass estimation models were built with 7 VIs derived from flights during preheading and post-heading phases using machine learning regression methods. To assess each model's performance, the RMSE and R 2 for the testing data set were compared for each model (Table 6). For UAV data collected prior to heading, the RF model was the best model for Beresford (RMSE = 1726.3 and R 2 = 0.3) and South Shore (RMSE = 1659.1 and R 2 = 0.2), but the SVM model was best for Volga (RMSE = 695 and R 2 = 0.4). For UAV data collected post-heading, the PLS model performed best for Beresford (RMSE = 1098.6 and R 2 = 0.7) and Volga (RMSE = 717.4 and R 2 = 0.3), and the SVM model worked best for South Shore (RMSE = 1681.5 and R 2 = 0.1). For Beresford, most models had a good fit; data points were distributed close to the fitted line as compared to the other two locations (Figure 8). We found no single model that performed best in all three sites, no matter if it was based on pre-heading or post-heading flights. The interval in the dot plot (Figure 9) shows the difference in performance, with wider intervals indicative of greater variation in performance. The overlapping confidence interval for RMSE values for the different models ( Figure 9) represents the performance gap which could be due to the small sample size used for modeling.
For Beresford, models' validation using testing dataset indicates higher R 2 for models developed based on data from post-heading flights as compared to models based on data from pre-heading flights. For Volga and South Shore, however, the model's performance was very similar whether pre-or post-heading data was used for model development.

Assessing Variable Importance in Various Models
All four regression methods considered for model development were implemented with seven predictor variables (VIs), but the relative importance of each predictor varied depending on the algorithm, location, and time of spectral information collection (i.e., preheading or post-heading). For Beresford, GNDVI and ExGR had high importance for both i ii iii Figure 9. Dot plots from "caret" package show model comparisons using the resampling technique for Beresford (i), Volga (ii), and South Shore (iii). Each plot shows the mean estimated RMSE value for all four algorithms. Error bars are 95% confidence intervals on the metrics for each algorithm.

Assessing Variable Importance in Various Models
All four regression methods considered for model development were implemented with seven predictor variables (VIs), but the relative importance of each predictor varied depending on the algorithm, location, and time of spectral information collection (i.e., pre-heading or post-heading). For Beresford, GNDVI and ExGR had high importance for both pre-and post-heading across the models (Figure 10a). For Volga, RTVI had the greatest importance among the VIs (Figure 10b). For South Shore, results were variable across models (i.e., GNDVI in SVM and PLS, ExGR in ANN and RTVI in RF) (Figure 10c).   Variable importance was also accessed for pre-and post-heading by aggregating information for all locations and models. For models based on pre-heading data, ExGR, GNDVI, and RTVI had a greater value of importance in comparison to another VIs ( Figure 11). The same three predictor variables also had higher importance in models developed using data from post-heading flights ( Figure 11). This suggests that both RGB based (ExGR) and NIR based (GNDVI and RTVI) indices were influential for biomass prediction.

Vegetative Indices on Predicting Biomass
Significant correlations between VIs and dry biomass yield were observed in Beresford and Volga. In South Shore, however, very few VIs were significantly correlated to dry biomass. This means that spectral information from aerial multispectral sensors may not be fully efficient for biomass monitoring in certain cases. The principle of VIs is based on photosynthetically active material, which could lead to error for the prediction of total biomass [72,73]. The indicators of plant performance in remote sensing are color, structure, and shapes of leaves. This is determined by properties like chlorophyll content and leaf morphological and surface structures, which are dependent on the genotypes and on environmental stresses and plant nutrition status. In our case, the higher moisture and lower temperature in South Shore likely resulted in the higher biomass production along with a low correlation of biomass yield with a disease like crown rust which led to minimal spectral differences amongst genotypic plots. Another possible reason for the indices not being able to predict biomass could be optical saturation. VIs saturation has been reported previously in different studies. Prabhakara et al. [37] reported that VIs was not able to detect the amount of biomass when there was high vegetation for barley and rye. In their study, NDVI, GNDVI, and G-R saturated after reaching a value of approximately 0.8 and were only related to biomass under~1500 kg/ha, beyond which an increase in biomass did not increase vegetative index value. In our study, although every location had average biomass measured above 1500 kg/ha, in South Shore, VIs reached the highest value (average NDVI value of 0.63) during the second flight (before heading) and gradually declined in later flights. Whereas, for Beresford and Volga, the average value of VIs consistently increased over time and reached to peak for the last flight before forage harvest.
In addition, during the 2019 growing season, precipitations were frequent at South Shore, where the soil was saturated with water, and dew was frequent. The average soil moisture over the growing season in South Shore was relatively 37.5% higher than in Beresford (29.2%) [74]. The presence of dew on the canopies at the time of flight could have affected the spectral reflectance quality and resulted in inaccurate vegetation indices. Pinter et al. [75], in their study on the effect of dew on canopy reflectance, found that moderate to high dew levels enhanced reflectance in visible wavelengths by 40-60% in wheat cultivars. The wetness on leaves has been observed to affect the canopy reflectance in a variety of plants, particularly in visible wavelengths [76,77].
The thirty-five oat genotypes used in this study had different maturity. The interval for heading occurrence varied depending on the location. The heading stage for all 105 plots occurred within nine days in Beresford, within six days in Volga, and within nine days in South Shore. Plots also had different maturity stages on the day of forage harvest. There is evidence that the vegetation indices are affected not only by environmental conditions but also by the growth stage of the crop [78]. Future studies should include soil moisture status, weather information, crop stage of each genotype, and other environmental factors to investigate the possible cause for failure of VIs to predict biomass.
Several studies reported using plant height derived from the crop surface model (CSM) in combination with VIs for the accurate prediction of biomass for crops like barley [79] and winter wheat [80]. Using a volume metric to estimate crop biomass within a plot (combination of spectral and structural information) has significantly improved aboveground biomass in corn [22]. Overall, these studies, along with our findings, suggest that a combination of spectral and structural information from an aerial sensor may be necessary to predict biophysical parameters like biomass more precisely.

Broad-Sense Heritability Estimates for VIs
For all three locations, NDVI and VARI had higher broad-sense heritability than dry biomass yield. Another study reported a strong genetic correlation between winter wheat grain yield and spectral reflectance and found Multispectral/RGB-based VIs with heritability (H 2 = 0.6-0.8), greater than for yield (H 2 = 0.4-0.7) [81]. With these criteria, spectral data can be used for indirect selection in plant breeding operations to increase genetic gains [18]. However, in this study, biomass and VIs were not significantly correlated in all locations. Evaluating the performance of UAV as a breeding tool for phenotyping should be evaluated over multiple locations and years before determining if VIs can be used as an indirect selection tool for oat biomass.

Comparison of Methodologies for VIs Computation
Several studies [82,83] have used pixel classification to enhance the accuracy of UAVbased data to differentiate canopy and non-canopy areas. Booth et al. [82] used the single pixel sample point method to differentiate shrub and grass species from other background pixels. Patrignani et al. [83] used Canopeo (automatic color threshold classification in MATLAB, which classified pixels to the canopy and non-canopy categories in various crops (turf, corn, sorghum, etc.). In our study, NDVI correlation to biomass improved with the pixel classification method in almost all cases (except for Volga for pre-heading flights). Nevertheless, it is essential to note that improvement seen with average reflectance over the ROI method was not consistent for every VIs. The lack of significant correlations between VIs and biomass remained unchanged for most cases in South Shore even when the pixel classification method was applied.
When considering different planophile and erectophile species, Myneni and Williams [84] reported that NDVI was unaffected by pixel heterogeneity for estimating canopy vigor based on biomass and color. Pixel heterogeneity, in our case, was comprised of panicle structure and other background effects (shadow). But resolving problems through the selection of pure canopy pixels was successful for one location (Beresford), but it did not quite improve the relationship with ground truth biomass in all cases.

Evaluation of Prediction Models for Biomass
The proportion of variance in dry biomass yield explained by the models developed in this study ranged from 70% in Beresford to 0.1% in South Shore. Similar to our results, Wengert et al. [23] used VIs (RGB and multispectral) along with texture and plant height as the predictor variable with the RF algorithm to predict above-ground biomass in barley with a R 2 of 0.62. Lu et al. [19], using VIs only as predictor variables, found that RF had a higher R 2 (0.69) than SVM and other linear-based models for predicting biomass in wheat. For Beresford, model performance marginally fluctuated between model development and model validation. Validation R 2 for Volga, however, drastically decreased for all types of models. One of the possible reason could be the lower range of dry biomass yield among plots at that location. The low performance metrics for the models developed for South Shore are expected considering the insignificant correlations between VIs and dry biomass yield.
Comparatively, all machine learning approaches yielded similar performances, except ANN. The sample size in this study was very small, while a high number of training data points is required to build optimal neural network models. Small datasets are subject to overfitting [80,[85][86][87].
RTVI, GNDVI, and ExGR consistently ranked as highly important variables. GNDVI was also reported to be a highly ranked variable for above-ground biomass prediction of a legume-grass mixture using UAV-borne spectral information [21]. Several studies [88,89] have reported that red-edge VIs were not as important as NIR-based VIs for model prediction. In our study, a red-edge-based VI (RTVI) was ranked as an important predictor.

Conclusions
The purpose of the study was to estimate oat biomass using VIs derived from high resolution UAV imagery. Differences in growing conditions between the three locations resulted in significant variations in oat biomass production. The VIs derived from multispectral imagery was found to be positively correlated to above-ground biomass for two of the locations. In the third location, however, very few UAV-derived VIs were significantly correlated with biomass yield. Two different methodologies for VI extraction were compared, i.e., the pixel classification method and average reflectance over ROI method. While the use of pixel classification appears useful to increase the strength of the correlation between VIs and biomass as observed in Beresford, this was not consistent across locations.
Four machine learning algorithms for estimating dry biomass yield were developed using VIs from UAV imagery. Approximately 70% of the variance was explained by RF, SVM, and PLS models for biomass prediction at one location. Additional sampling points with multi-year trials should be considered to improve prediction models because advanced machine learning algorithms, such as deep learning, often requires larger number of data points and long training periods to improve model accuracy.
The same crop in different environments exhibited distinct physical properties, hence, a single algorithm may not suffice the need for precise biomass monitoring. Multi-sensor data fusion, multi-index combination, the inclusion of a range of characteristics not directly linked to crop biomass monitoring, and the use of sophisticated algorithms are all viable options for enhancing the accuracy of oat biomass predictions [90].