Evaluation of RGB and Multispectral Unmanned Aerial Vehicle (UAV) Imagery for High-Throughput Phenotyping and Yield Prediction in Barley Breeding

: With advances in plant genomics, plant phenotyping has become a new bottleneck in plant breeding and the need for reliable high-throughput plant phenotyping techniques has emerged. In the face of future climatic challenges, it does not seem appropriate to continue to solely select for grain yield and a few agronomically important traits. Therefore, new sensor-based high-throughput phenotyping has been increasingly used in plant breeding research, with the potential to provide non-destructive, objective and continuous plant characterization that reveals the formation of the ﬁnal grain yield and provides insights into the physiology of the plant during the growth phase. In this context, we present the comparison of two sensor systems, Red-Green-Blue (RGB) and multispectral cameras, attached to unmanned aerial vehicles (UAV), and investigate their suitability for yield prediction using different modelling approaches in a segregating barley introgression population at three environments with weekly data collection during the entire vegetation period. In addition to vegetation indices, morphological traits such as canopy height, vegetation cover and growth dynamics traits were used for yield prediction. Repeatability analyses and genotype association studies of sensor-based traits were compared with reference values from ground-based phenotyping to test the use of conventional and new traits for barley breeding. The relative height estimation of the canopy by UAV achieved high precision (up to r = 0.93) and repeatability (up to R 2 = 0.98). In addition, we found a great overlap of detected signiﬁcant genotypes between the reference heights and sensor-based heights. The yield prediction accuracy of both sensor systems was at the same level and reached a maximum prediction accuracy of r 2 = 0.82 with a continuous increase in precision throughout the entire vegetation period. Due to the lower costs and the consumer-friendly handling of image acquisition and processing, the RGB imagery seems to be more suitable for yield prediction in this study. seed density was 300 germinating grains per m 2 . Sowing dates differed across the three environments. The ﬁeld trial in Halle ( HAL ) was sown on 28 February, in Gatersleben ( IPK ) on 14 March, and in Merbitz ( MER ) on 21 March, respectively. Field and disease management was carried out in accordance with local practice. The plant-available N content in the soil (N min ), which was measured at all environments on 20 February, resulted in more than 100 kg/ha N min at all sites. As a conclusion, no additional N fertilizer was applied. More-over, in MER and IPK plant growth regulators were applied to prevent lodging, in HAL no application was conducted. The site-speciﬁc weather conditions during the year 2019 are shown in Table S1. All three sites had a high degree of overlap, characterized mainly by below-average rainfall during the growing season and above-average temperatures during the summer months.


Introduction
Despite numerous advances in the field of genetics and the application of new molecular technologies in plant research [1], the genetic gain of major crops has stabilized or even stagnated in many regions of the world [2,3]. Considering that crop demand is estimated to double in the first half of this century [4], accelerating yield increases is crucial to overcome this challenge. Identifying bottlenecks of current approaches in plant breeding and plant production offers the greatest potential for achieving this goal. In the field of plant breeding, field phenotyping in terms of high throughput and quality is one of the biggest bottleneck in breeding programs [5][6][7].
Saiz-Rubio et al. [8] give an in-depth review on the direction modern farming and breeding is heading. While there are a variety of technical options to collect data in the field, satellites [9][10][11][12], stationary in-field sensors [13] or ground-based vehicles and farm robots [14][15][16] always have a catch and enforce decision making about flexibility, scalability and the desired data precision. Satellites can cover a large area but usually only within a rigid time schedule and with a limited spatial resolution. Rover measurements on the ground are, of course, only economical on a smaller scale but provide a high flexibility in terms of payload and usable sensor equipment. Although the regulatory take-off weight and the technically possible payload [17] pose a limit to what flying platforms can carry, Unmanned Aerial Vehicles (UAVs) provide the possibility for high-resolution multispectral and even hyperspectral imaging that can still cover a mid-size area with an acceptable spectral, spatial and temporal resolution [18]. Even larger areas are accessible with fixed wing drones [19]. UAVs can also be relocated to spatially distributed plots with less effort and almost no time-loss. They are the closest solution to a semi-continuous field survey.
On the one hand, visual scoring of traits during the growing season such as plant height or ears per square meter is quite simple but time-consuming and, therefore, limited. On the other hand, scoring of vegetation cover, disease susceptibility or life history traits such as time of shooting or maturity is more difficult, depending on human experience, and, therefore, subjective. Sensor-based phenotyping, in contrast to conventional scoring of traits, offers a high degree of objectivity, which is a prerequisite for the comparability of data from different environments, years or populations and would thus be an advantage for multinational breeders' working methods. In addition, high-throughput crop phenotyping approaches using UAV offer attractive solutions for evaluating plots with high spatial and temporal resolution, leading to an increase in the number of genotypes in mapping populations and thus more robust statistical evaluation capabilities. In addition, UAVbased phenotyping, with its high area performance, allows for screening of the breeding material at very early stages of its development. Breeding selection at this early stage of the breeding cycle is based on a few rapidly collected traits and thus offers the most potential for improvement through a better phenotype database provided by UAV. The high temporal resolution enables the observation of dynamic growth processes and the detection of small phenotypic differences and thus the extraction of new traits regarding plant architecture and physiology [20,21]. In order to make optimal use of time series of measurements and to characterize dynamic growth, it is essential to carry out functional analyses in which mathematical functions are able to simulate the plant growth pattern [22,23]. Based on these smoothed growth functions, biologically relevant traits can be extracted and could become part of a new and more adequate breeding strategy.
The previous focus on specific and final growth phases is mainly due to methodological limitations such as destructive trait surveys (e.g., determination of biomass) or simply lack of experienced staff to phenotype large segregating populations in a dense temporal interval. Phenotypic dissection by repeated UAV phenotyping could provide information on how final yield is formed during the growth phase and how direct and indirect morphological, physiological, and environmental elements influence yield in different genotypes [24]. This becomes especially important since breeding for yield improvement around the globe is based on the empirical selection criterion of yield itself, even though yield is known to be subject to low heritabilities and high genotype × environment inter-Remote Sens. 2021, 13,2670 3 of 30 action [25][26][27]. Furthermore, the collected spectral and derived spatial data can be used for yield predictions already during the growth phase [28,29], which may assist decision making in agribusiness [30]. Traditional methods of measuring yield are destructive, timeand energy-intensive and cannot be applied to large areas [31]. Pre-harvest yield estimates could be used to determine input factors such as nutrients, pesticides, and water, and to forecast upcoming labor-and cost-intensive operations such as harvesting, drying and storage [32]. In the field of plant breeding, yield prediction at the plot level opens up a possible scenario of a selection tool that rapidly and efficiently identifies promising high-yielding genotypes from a large number of early-generation lines even before harvest. This diagnostic tool would require high heritability and good correlation to grain yield.
Remote sensing techniques are able to use canopy light reflectance to assess yield at the genotype level without destructive sampling [33]. For this purpose, information on the reflection of electromagnetic waves from plant canopies is obtained with passive sensors. The measured reflectance depends on chemical and morphological properties of surfaces and changes with plant type, biomass, developmental stage, vigor and physiological properties such as water content and plant pigments [34,35]. Certain plant characteristics are associated with the absorption of very specific wavelengths of electromagnetic radiation. Furthermore, the reflection of light at several specific wavelengths can be mapped to different coefficients, called vegetation indices (VIs), which are more environmentally stable and less susceptible to interference than single wavelength information. For example, Xue et al. [36] discussed more than 100 VIs for their specific applicability and representativeness as a function of vegetation of interest, environment and implementation accuracy.
For many UAV remote sensing applications, the visible (400-700 nm), near-infrared (700-1200 nm) and short-wave infrared (>1200 nm) light spectra are measured by multispectral, hyperspectral or conventional Red-Green-Blue (RGB) cameras. Multi-and especially hyperspectral sensors have a high spectral resolution, but they are usually more expensive and have a higher weight than commercially available RGB cameras. Moreover, spectral sensing is more sensitive to ambient lighting conditions than color imaging. Multispectral sensors are the condensed form of a hyperspectral sensor by (typically) application-specific band selection, since several wavelength bands in the visible and infrared spectrum, which can be freely selected via filters, can be recorded and still enable the use of a broad range of spectral indices [37]. Furthermore, the spectral information obtained with multispectral cameras is more reliable and repeatable through the implementation of radiometric calibration [38]. Besides VI calculation, multispectral imagers are widely used for crop monitoring to assess, for example, yield predictions, nutrient status, pigment degradation, disease severity, and water content [39][40][41][42][43].
However, ultra-high-resolution images from low-cost RGB cameras with a general high quality of factory color calibration mounted on UAVs offer a wide range of phenotyping applications too. Despite the fact that RGB imagery has a very limited spectral range and resolution and is less molecule-specific due to wider spectral bands, several spectral indices related to the visible range have been proposed for use with UAV imagery, such as green chromatic coordinate (GCC) [44], excess green (EG) [45] or Normalized Green-Red Different Index (NGRDI) [46] which are widely used. The high spatial resolution is the main advantage of RGB sensors, which can be used to reconstruct the 3D structure of the leaf canopy based on the structure from motion algorithm (SfM) [47] and thus can reproduce the morphology of the plot very well. The additional information about the canopy architecture can then be integrated into the modelling of yield formation and biomass and has the potential to improve the models [28,48]. On the one hand, many authors reported that RGB indices outperform spectroradiometric VIs in terms of characterizing plant growth [49][50][51], separating vegetation from bare soil [52], assessing nitrogen [27,53] and detecting foliar diseases [54,55]. This could be in part due to the fact that spectroradiometric indices exhibit longer wavelengths in the NIR range than RGB indices and are thus more susceptible to canopy architecture, which affects the reflective properties of plants, and soil mixing pixels, which occurs especially at low spatial resolution [56]. On the other hand, several studies indicate that NIR-based VIs improve soil-plant discrimination [57,58] and provide more information about biophysical parameters than RGB-based VIs [59,60].
Agricultural models based on remotely sensed data, such as those for yield estimation, are increasingly based on machine learning methods. These methods often tend to be more robust and accurate than conventional correlative methods due to their flexibility to adapt to the complexity of data through training [61]. Furthermore, the variance of a variable to be predicted can be explained by either parametric or non-parametric approaches. The former are easier to interpret due to their predefined structure; the latter usually require more training but are also more adaptable. Commonly used methods are multivariate regression [62,63], decision trees [62,64,65], support vector machines [62,65] or artificial neural networks [28,62,65]. Deep learning methods based on artificial neural networks represent a particularly complex approach with promising results, but were not considered in this case due to the complexity of defining their architecture. In this study, a parametric generalized linear model [66] and a non-parametric random forest model [67] were attempted.
Despite many promising advances in the field of UAV-assisted high-throughput phenotyping, breeders have been reluctant to integrate these technologies into their already successful breeding pipelines in the past, due to its complexity and costs. However, as knowledge grows at the translational research level between technological and biological domains, both the complexity of data collection and processing as well as the costs of technology can be reduced, thus increasing acceptance. In addition, acceptance is raised by demonstrating the efficiency gains in achieving breeding goals through new innovative methods [68].
In this context, this study compares UAV-based RGB and multispectral image analyses in terms of traits suitable for breeding, such as plant height (HEI), vegetation cover (VCOV) and yield predictions using different modelling approaches during the entire vegetation period. The temporal grid (weekly) of the trait collection at three test sites was used for the extraction of growth dynamics traits. Using ground-based phenotyping, it was possible to quantify the accuracy of UAV-based phenotyping. Repeatability and genotype association studies were conducted to test the suitability of using conventional and new traits for barley breeding. Barley was selected as the experimental organism in this study as it serves as a genetic and phenotypic model plant for temperate cereals such as wheat, spelt, rye and triticale, while having a high economic relevance. In addition, the results found in this study using the diverse barley mapping population S42IL [69,70] could be compared with numerous previously published studies of this population as a kind of validation.

Plant Material, Environments, and Growing Conditions
The plant material included 49 wild barley introgression lines (ILs) of the S42IL library and 11 spring barley cultivars as controls. The S42IL library was derived by an initial cross between the spring barley cultivar "Scarlett" and the Israeli wild barley accession ISR42-8 [71], followed by three rounds of backcrossing with "Scarlett" and marker assisted selection which is described by Schmalenbach et al. [72]. The S42IL population was conventionally phenotyped in multiple genotype association studies to identify significant introgression lines for plant height [73][74][75], early vigor [76], drought stress [70,[77][78][79], and yield [80], which can be used comparatively for sensor-assisted phenotyping in this study.
In 2019, field experiments were conducted in three environments in Germany located at MLU's experimental field sites in Halle (51 • (Figure 1). The field trials were sown in a randomized split-plot design where each genotype was repeated six times, resulting in 360 plots per environment and 1080 plots in total. The yield plots had a dimension of 7.5 m 2 (1.5 × 5.0 m) with 8 to 10 rows. The number of sown grains per plot was varied depending on the known germination capacity of the individual line. The target Remote Sens. 2021, 13, 2670 5 of 30 seed density was 300 germinating grains per m 2 . Sowing dates differed across the three environments. The field trial in Halle (HAL) was sown on 28 February, in Gatersleben (IPK) on 14 March, and in Merbitz (MER) on 21 March, respectively. Field and disease management was carried out in accordance with local practice. The plant-available N content in the soil (N min ), which was measured at all environments on 20 February, resulted in more than 100 kg/ha N min at all sites. As a conclusion, no additional N fertilizer was applied. Moreover, in MER and IPK plant growth regulators were applied to prevent lodging, in HAL no application was conducted. The site-specific weather conditions during the year 2019 are shown in Table S1. All three sites had a high degree of overlap, characterized mainly by below-average rainfall during the growing season and above-average temperatures during the summer months.
Research (IPK) in Gatersleben (51°48′15′′N 11°15′02′′E) ( Figure 1). The field trials were sown in a randomized split-plot design where each genotype was repeated six times, re sulting in 360 plots per environment and 1080 plots in total. The yield plots had a dimen sion of 7.5 m 2 (1.5 × 5.0 m) with 8 to 10 rows. The number of sown grains per plot was varied depending on the known germination capacity of the individual line. The targe seed density was 300 germinating grains per m 2 . Sowing dates differed across the three environments. The field trial in Halle (HAL) was sown on 28 February, in Gatersleben (IPK) on 14 March, and in Merbitz (MER) on 21 March, respectively. Field and disease management was carried out in accordance with local practice. The plant-available N con tent in the soil (Nmin), which was measured at all environments on 20 February, resulted in more than 100 kg/ha Nmin at all sites. As a conclusion, no additional N fertilizer was applied. Moreover, in MER and IPK plant growth regulators were applied to preven lodging, in HAL no application was conducted. The site-specific weather conditions dur ing the year 2019 are shown in Table S1. All three sites had a high degree of overlap, char acterized mainly by below-average rainfall during the growing season and above-average temperatures during the summer months.

Ground Phenotyping Data
The ground-based conventional phenotyping included agronomic and developmen tal traits described in Table 1, which were used as ground truth (GT) in this study. Pheno typing was carried out weekly at all environments parallel to UAV flights. This continu ous data collection enabled the generation of growth curves for HEI and VCOV and the extraction of new traits based on growth dynamics (2.4). For the reference value HEIGT mean plot height was measured for 107 (two replications of the S42IL population and al Scarlett plots) out of 360 plots for each environment and time point in the field. In addi tion, the life history traits time to shooting (SHO), time to heading (HEA) and time to maturity (MAT) and lodging were phenotyped. All phenotypic data are included in Table  S2.

Ground Phenotyping Data
The ground-based conventional phenotyping included agronomic and developmental traits described in Table 1, which were used as ground truth (GT) in this study. Phenotyping was carried out weekly at all environments parallel to UAV flights. This continuous data collection enabled the generation of growth curves for HEI and VCOV and the extraction of new traits based on growth dynamics (2.4). For the reference value HEI GT , mean plot height was measured for 107 (two replications of the S42IL population and all Scarlett plots) out of 360 plots for each environment and time point in the field. In addition, the life history traits time to shooting (SHO), time to heading (HEA) and time to maturity (MAT) and lodging were phenotyped. All phenotypic data are included in Table S2.

UAV Data Platforms, Camera Systems and UAV Campaigns
The weekly UAV flights started at the beginning of April and ended after 11 weeks in HAL and after 12 weeks in IPK and MER with a flight at the developmental stage of full maturity. This period includes the grain filling as well as the ripening phase of the crops (Table S3). Two camera systems were used, which differ in terms of costs, weight, modes of operation and the type of acquired image data. High-resolution RGB images were captured by a quadrocopter system DJI Phantom 4 Professional (SZ DJI Technology Co. Ltd., Shenzhen, China), which was equipped with a Zenmuse X4S camera by default (see Table S4 Multispectral images were captured by the MACAW multispectral camera, (Tetracam Inc., Chatsworth, CA, USA), which was mounted on the UAV platform DJI Matrice 600 Pro (SZ DJI Technology Co. Ltd., Shenzhen, China). The camera consists of six independent CMOS image sensors and optics with an image size of 1280 × 1024 pixels (1.3 MP) each with 16-bit radiometric resolution. The optics focal length of 9.6 mm resulted in a ground resolution of 1.5 cm/pixel at a flight altitude of 30 m. The 25 mm diameter bandpass filters were user-configurable between~380 and~1000 nm. For the remote sensing study five filters of 10 nm full-width at half maximum (FWHM) were selected with center wavelengths at 670, 700, 740, 780, 900 and one filter of 40 nm FWHM with center wavelengths at 970 nm.
Additional data regarding the specific filter transmission, sensor specifications and spectral sensitivity of the CMOS are listed in Table S4, according to the manufacturer's data (Andover Corporation; Salem, NH, USA). To correct the measured reflectance values for surface radiance and luminance, an incident light sensor (ILS) with the same custom filter combination was mounted on the top of the Matrice 600 Pro (M600) to record the relative strength of down-welling radiation to be integrated into the TIFF image headers. The M600 was equipped with a Gremsy T3 Gimbal, and an A3 flight controller. With a total weight of 11 kg and a total price of EUR 35,000 it represents a cost-intensive measuring platform in this study. Both UAVs were controlled by a remote controller and an iPad (Apple Inc., Cupertino, CA, USA) with the DJI Ground Station Pro app (SZ DJI Technology Co. Ltd., Shenzhen, China). Details of the individual flight campaigns and the current weather conditions at the time of the UAV flights are listed in Tables S1 and S4. Geo-referencing of the UAV images was done by six ground control points (GCPs), equally distributed across the field trials. After the position of the GCPs were determined using a Trimble R9s GNSS (Global Navigation Satellite System) receiver (Trimble Ltd., Sunnyvale, CA, USA) with a precision of 0.02 m, GCPs were not moved during the entire vegetation period.

UAV Data Processing
The overall workflow of the developed pipeline is presented in Figure 2. Except for initial pre-processing and photogrammetric processing all operations on the UAV data were performed in the language and environment for statistical computing R [82]. Special packages used are mentioned in the corresponding sections.
for surface radiance and luminance, an incident light sensor (ILS) with the same custom filter combination was mounted on the top of the Matrice 600 Pro (M600) to record the relative strength of down-welling radiation to be integrated into the TIFF image headers. The M600 was equipped with a Gremsy T3 Gimbal, and an A3 flight controller. With a total weight of 11 kg and a total price of EUR 35,000 it represents a cost-intensive measuring platform in this study.
Both UAVs were controlled by a remote controller and an iPad (Apple Inc., Cupertino, CA, USA) with the DJI Ground Station Pro app (SZ DJI Technology Co. Ltd., Shenzhen, China). Details of the individual flight campaigns and the current weather conditions at the time of the UAV flights are listed in Tables S1 and S4. Geo-referencing of the UAV images was done by six ground control points (GCPs), equally distributed across the field trials. After the position of the GCPs were determined using a Trimble R9s GNSS (Global Navigation Satellite System) receiver (Trimble Ltd., Sunnyvale, CA, USA) with a precision of 0.02 m, GCPs were not moved during the entire vegetation period.

UAV Data Processing
The overall workflow of the developed pipeline is presented in Figure 2. Except for initial pre-processing and photogrammetric processing all operations on the UAV data were performed in the language and environment for statistical computing R [82]. Special packages used are mentioned in the corresponding sections.

Initial Pre-Processing of Multispectral Imagery
The pre-processing of the MACAW imagery was carried out with the image editing program Pixel Wrench 2 (Tetracam Inc.; Chatsworth, CA, USA) and basically comprised two steps. The use of six independent sensors, each representing a user-selected wavelength, causes a spatial offset within the 6-layer TIF file. To correct for this offset, an alignment was required, which was carried out using an alignment file with fixed individual values for rotation and scaling of the images for each sensor. Moreover, reflectance values were obtained as a fraction of the detected incident radiation by using the ILS metadata recorded at the time of image acquisition. Regarding the low flight altitude of the UAVs and the thus assumed limited atmospheric influence, no additional atmospheric correction was conducted.

Initial Pre-Processing of Multispectral Imagery
The pre-processing of the MACAW imagery was carried out with the image editing program Pixel Wrench 2 (Tetracam Inc.; Chatsworth, CA, USA) and basically comprised two steps. The use of six independent sensors, each representing a user-selected wavelength, causes a spatial offset within the 6-layer TIF file. To correct for this offset, an alignment was required, which was carried out using an alignment file with fixed individual values for rotation and scaling of the images for each sensor. Moreover, reflectance values were obtained as a fraction of the detected incident radiation by using the ILS metadata recorded at the time of image acquisition. Regarding the low flight altitude of the UAVs and the thus assumed limited atmospheric influence, no additional atmospheric correction was conducted.

Photogrammetric Processing
Photogrammetric data processing of the gained UAV imagery was performed in Agisoft Metashape Professional (Version 1.5.2.7838, Agisoft LLC, St. Petersburg, Russia) using its built-in SfM algorithm, which enables the three-dimensional reconstruction of the scene based on common feature points of overlapping images [47]. As two different camera systems with distinct spatial-spectral resolution were used, the Agisoft workflow was run using different settings to achieve best results, respectively. The workflow of orthomosaic generation consisted of four main processing steps: image alignment, generation of dense point cloud, digital elevation model (DEM) generation, and final orthomosaic generation. The imagery was georeferenced during the alignment process by manually identifying six GCPs within the raw geometry UAV images and applying Agisoft's optimization function. The individual settings for processing the datasets are summarized in Table S5.

Crop Height Model (CHM) and Vegetation Index (VI) Calculation
Crop height models (CHMs) were derived from high-resolution RGB DEMs generated in Agisoft Metashape Professional by subtracting the pixel values of the first bare soil DEM at each test site from subsequent DEMs' pixel values to gain absolute HEI. HEI CHMred represents a subset of CHM reduced to the 107 plots where HEI CHM was captured in the field. VIs were calculated according to Table 2 and further subdivided into the following groups with regard to their sensitivity: single band, pigment, water content and physiology. All raster calculations were implemented using the raster package [83]. Table 2. Spectral vegetation indices (VIs) tested in this study. "R" denotes the reflection in indicated wavebands; RGB channels are represented in terms of corresponding spectral regions.

Index
Index Full Name Platform Group/Sensitivity Formula Reference

B1-NIR1
Near infrared band 1 Multi Single band R780more information see Table S4 B2-RED Red band Multi Single band R670-

Soil Masking
To reduce bare soil interference for spectral feature extraction during the first developmental stages of the plants, Otsu's method was applied to high-contrast NDVI and EG images ( Table 2). Otsu's method is an adaptive, nonparametric and therefore unsupervised thresholding algorithm for image segmentation based on grayscale histograms [97] that is proven to be suitable for separating plant and bare soil pixels [52]. Assuming a bimodal histogram, the algorithm determines a threshold at which the within-class variance of the resulting two classes becomes minimal. Pixels assigned to the lower index intensity class in this way were treated as soil pixels, masked and excluded from further processing. The method was implemented using the auto_thresh function of the autothresholdr package [98]. The resulting masks were then applied to spectral bands and VIs of the same platform, date and place.

Vegetation Cover (VCOV) Derivation
Soil masking for VCOV derivation was varied as Otsu's method would still result in the presence of blend pixels, which were desired for the acquisition of the spectral features, but not for relative vegetation fraction estimation. Therefore, a stricter discrimination was required. Only Pixels of the RGB imagery above a fixed GCC VI threshold of 0.38 were treated as plant pixels. In contrast, the multispectral imagery was masked by applying the base R k-means algorithm with four cluster centers to NDVI images, resulting in a more dynamic threshold. Only pixels belonging to the highest NDVI cluster were treated as plant pixels. Based on these threshold, binary images are calculated and used as vegetation masks in the further processing of the image data.

Plotwise Feature Extraction
Plotwise orthomosaic features were extracted using the RSAGA package, which provides access to geocomputing and terrain analysis functions of the geographical information system SAGA within R [99]. SAGA version 6.2.0 was used. Plot polygon shapefiles of the field trials were created with the MiniGIS software by Geo-Konzept (Adelschlag, Germany). After applying a −0.2 m buffer to each plot polygon shapefile to account for boundary effects, summary statistics such as arithmetic mean and the 95% quantile of the plots were extracted. The VCOV feature was estimated by extracting the mean value of the masked GCC and NDVI orthomosaics described above.

Statistical Analysis
Further data analysis was performed using the plot polygon's 95% quantile of CHM orthomosaics and the mean value of VI orthomosaics and VCOV orthomosaics.

Ground Truth Validation
To assess the accuracy of canopy height determined by UAV imagery (HEI CHM ) Pearson's coefficient of correlation (r) and the root mean square error (RMSE) of HEI CHM and HEI GT was calculated (formula see 2.5.3) for each point in time and across the entire vegetation period.

Growth Rate Modelling
Multi-temporal measurements enable the calculation of growth rates of HEI and VCOV data by applying the smoothing and extraction of trait (SET) method implemented in the growthPheno R package [22]. According to Brien et al. [22] the best smoothing method, which results in fitting the observed data to natural cubic smoothing splines by direct smoothing was selected.
In addition to the smoothing method, the degrees of freedom (DF) for the fitting procedure must be specified. For HEI, DF = 5 provided the best results, with 10 up to 11 observations made across the year ( Figure S1). Based on the smoothed HEI values, the absolute growth rates (AGR) according to Brien et al. [22] were calculated. The smoothed HEI and AGR values, plotted to identify growth dynamics, appeared to be homogenous. Two different growth phases could be identified. The first growth phase is characterized by a steady increase of the growth rate up to the maximum and the second by a steady decrease of the growth rate. Based on this definition, the slope between the endpoints of the increasing growth phase (HEI GRi ) and decreasing growth phase (HEI GRd ) were used as a trait for plant growth proxies. In addition, the maximum canopy height (HEI MAX ) across all UAV flights was recorded and is referred to as HEI in this study.
For the VCOV estimation, we used two sets of data: one based on the RGB images and the other based on the NDVI values of the multispectral images ( Figure S2). The discrimination between soil and plant pixels is described in detail in soil masking in Section 2.4.4. The VCOV estimation resulted in smoothed vegetation cover data (VCOV smoothed ), which were used in yield prediction modelling. We obtain the best results with DF = 3, based on 4-6 observations. We extracted two traits based on these time series data. For the first trait, the increasing homogeneous growth phase was selected and the slope (VCOV GRi ) was calculated, as was done for HEI. The second trait is represented by the number of days during which the VCOV increased from 10% to 90% (VCOV 90 ). For this purpose, the days of year on which the smoothed VCOV values of 10% and 90% were reached were interpolated and subtracted from each other.

Yield Prediction
Spectral traits (Table 2) as well as the spatial traits HEI and VCOV of the two presented imaging systems were used to assess their ability to predict plot yield. In doing so, both systems were considered independently in order to compare their suitability for prediction. Since CHM could only be derived from RGB imagery, it was not used in multispectral yield prediction. The modelling of plot yield included three approaches: For multiple regression in approaches 2 and 3, an elastic-net regularized generalized linear model (GLM) and a random forest model were fitted. Both were implemented using the tidymodels framework [100]. The elastic-net regularized GLM algorithm linearly combines L1 (lasso) and L2 (ridge) regularization penalties in a multiple linear model. It was applied using the glmnet package [66] as engine with penalty held at 0.001 and a mixture of L1 and L2 held at 0.5. Prior to model building, predictors were centered and scaled, a prerequisite for successful regularization.
The random forest algorithm is a robust ensemble learning method for classification as well as regression that constructs a variety of decision trees or regression trees based on a randomly chosen subset of the data and a randomly chosen subset of features for each derived tree. In regression, the prediction corresponds to the average prediction of all regression trees [67]. The two hyperparameters ntree (the number of trees to grow) and mtry (the number of variables randomly sampled as candidates at each tree split) were held constant at ntree = 1000 and mtry = n predictors /3.
Single trait-yield correlations were evaluated by calculating the coefficient of determination (r 2 ). For multiple linear regression, the RMSE was captured as well. The r 2 and RMSE were calculated as: where x i is the actual plot yield, x is the average actual plot yield, y i is the predicted plot yield and n is the number of observations. Multiple regression metrics were extracted and averaged over a tenfold cross-validated model fit. GLM variable estimates were extracted from the best cross-validated fit and their corresponding percentage values were calculated as fraction of the overall absolute estimate sum.
Yield predictions for the genotype association study were collected over a ten times repeated tenfold elastic-net GLM cross-validation where predictions of the independent validation subset were kept and averaged for each plot.

Genotype Association Study
Prior to the genotype association study the Repeatability (R 2 ) as a measure of the proportion of genotypic variation in the total phenotypic variance was calculated using the lme4 package [101] for each trait at each point in time as: where V G and V P are the genotypic and the total phenotypic variance and r is the number of repetitions (r = 6 in this study) per genotype. The genotype association study was then conducted as Dunnett's test [102], comparing the UAV trait of each genotype to the control cultivar Scarlett using the glht package for general linear hypotheses [103]. To reduce type I error probability due to multiple comparisons [104], the Bonferroni adjustment was applied to the model summary and the significant threshold was set at p < 0.01.

Canopy Height Determination
The correlations of HEI CHM and HEI GT across all dates are shown in Figure 3. In the three environments HAL, IPK and MER correlations of 0.82, 0.89, and 0.91 were attained, resulting in an overall correlation of r = 0.87 using the 95% CHM plot quantile. Moreover, Figure 3 and Figure S3 demonstrate that the maximum height was reached before MAT. As it is known for barley, the ears start to bend over during maturation and, thus, the height decreases. Despite primarily strong correlations, HEI GT was generally underestimated by the UAV, as it is shown in Figure S3.   Figure S4 shows the repeatability of three traits that represent the results of the growth curve analysis made possible by UAV phenotyping. Therefore, raw HEICHM were smoothed ( Figure S1) to identify homogeneous growth phases and to determine the growth rates within the homogeneous growth. The increase in growth rate (HEIGRi) oc- Moderate to high repeatabilities of these parameters ranging from R 2 = 0.59 for HEIGRd in IPK to R 2 = 0.98 for HEIGRi in HAL show that phenotypic variation can be explained by genotype and thus be used in breeding. HEI modelling over time based on UAV phenotyping revealed reliable results and the GR parameters were therefore integrated into the yield estimation model.
The subsequent Dunnett test to identify statistically significant genotype associations for all HEI traits revealed a large number of S42ILs that differ significantly from Scarlett The use of the 95% CHM plot quantile shows the smallest absolute deviations from HEI GT . A drawback of using the 95% CHM plot quantile for height estimation was that isolated and vertically growing plants in plots of genotypes where lodging occurred, resulted in an overestimation of the overall plot height. These caused lower correlations to the HEI GT data, as can be seen in the late acquisition period in HAL and IPK ( Figure S3). That complicates the simple automated detection of lodged plots and in some cases a manual check, by looking at the variation of CHM information in the entire plot, would be necessary. Furthermore, individual time points showed a large variation in correlation coefficients between HEI CHM and HEI GT ranging from r = 0.09 in HAL on 104 DAS to r = 0.93 in MER on 68 DAS and in HAL on 89 DAS, which is shown in Figure S3.
For a direct statistic comparison in the sense of breeding (repeatability and concordance of genotype associations) between the conventional and new phenotyping method, a subset of the HEI CHM dataset was formed (HEI CHMred ), which was reduced to the 107 identical plots of the HEI GT dataset. The repeatabilities of HEI CHM, HEI CHMred and HEI GT in Figure S4 showed the same trends between individual time points with high values of up to R 2 = 0.98 almost across the whole growing period. Before SHO, repeatability was low and began to increase with the height growth around SHO to find the maximum and reach a plateau at HEA until ripening began and the repeatability decreased again. It should be emphasized that HEI CHM generally showed better repeatability than HEI GT and HEI CHMred , except at late plant developmental stages in HAL, where HEI GT showed slightly better repeatability ( Figure S4). The repeatability of HEI CHMred fluctuated slightly and was on the same scale as HEI GT even at those times points where the canopy height measured in the field was clearly underestimated by UAV (e.g., HAL on 89, 95 DAS, IPK on 104, 111 DAS, MER on 84, 90 DAS). Figure S4 shows the repeatability of three traits that represent the results of the growth curve analysis made possible by UAV phenotyping. Therefore, raw HEI CHM were smoothed ( Figure S1) to identify homogeneous growth phases and to determine the growth rates within the homogeneous growth. The increase in growth rate (HEI GRi ) occurred between 59-85 DAS in HAL, 34-61 in MER and 62-83 DAS in IPK. The decrease in the growth rate (HEI GRd ) was between 85-120 DAS in Halle, 61-103 DAS in MER and 83-111 DAS in IPK.
Moderate to high repeatabilities of these parameters ranging from R 2 = 0.59 for HEI GRd in IPK to R 2 = 0.98 for HEI GRi in HAL show that phenotypic variation can be explained by genotype and thus be used in breeding. HEI modelling over time based on UAV phenotyping revealed reliable results and the GR parameters were therefore integrated into the yield estimation model.
The subsequent Dunnett test to identify statistically significant genotype associations for all HEI traits revealed a large number of S42ILs that differ significantly from Scarlett ( Figure S5). The highest significant differences were found between SHO and HEA at all environments, where the difference in growth was the greatest as no plant growth regulator was applied and no lodging occurred during this period. In general, the HEI CHM dataset reached the highest significance and several lines were constantly detected over the measurement period. In comparison, the reduced HEI GT and HEI CHMred datasets showed less significant results, which shows the great advantage of the UAV phenotyping, which can cover a larger sample size. Furthermore, the same lines with a significant HEI could be detected for the calculated growth parameters and for HEI MAX . Again, HEI GRi , which represents the period of increasing HEI and therefore the time between SHO and HEA, showed the highest significance.

Vegetation Cover
The vegetation cover (VCOV) was determined using both RGB and multispectral imagery to compare the two methods. To identify early vigor genotypes single time points of VCOV were used for smoothing ( Figure S2) and subsequently extracting growth parameters.
The estimation of VCOV by the RGB imagery resulted in higher consistency across sequential flights with a later saturation and a lower rate of outliers ( Figure S2), which provided a more reliable dataset compared to the VCOV values determined by multispectral imagery. The VCOV values of the multispectral dataset appeared to have a poorer quality due to the greater scatter and outlier rate. In MER VCOV showed the greatest variation across both determination methods, but nevertheless it can be seen across all datasets that the population average of the S42ILs is slightly above the performance of Scarlett.
Inaccuracies can be mitigated by smoothing the growth curve and still allow growth trends to be identified. In order to identify fast growing genotypes, the smoothed vegetation cover data (VCOV smoothed ) were used to estimate the growth parameters VCOV GRi and VCOV 90 . The calculated repeatabilities show differences between environments ( Figure S6). Across all VCOV traits, the highest repeatability of R 2 = 0.94 was estimated in HAL at 81 DAS by the RGB imagery. Furthermore, in HAL and especially in IPK, high and consistent repeatabilities (R 2 = 0.47-0.94) over time could be detected. In some cases, higher repeatability was detected with the raw VCOV data and in some cases with the smoothed VCOV data, which does not reveal a consistent trend. Nevertheless, MER showed the lowest repeatabilities (R 2 = 0-0.36).
The repeatabilities of the growth parameters VCOV GRi and VCOV 90 were in the same order of magnitude as the repeatabilities on individual measurement days, so that it made sense, with the exception of MER, to carry out a Dunnett test to identify significant genotype associations on the basis of the VCOV smoothed data. The Dunnett test revealed one significant line in HAL and in IPK based on multispectral determination of VCOV smoothed and three significant lines based on RGB determination of VCOV smoothed only in IPK ( Figure S6). Unfortunately, there is no overlap of significant lines across sites or datasets, although decreased p-values of some lines indicate the same tendencies.

Yield Prediction
The measured plot yield was the target value of the yield modelling and is shown in  In order to predict plot yield as accurately as possible, three different approaches were tested. In the following, only the results of the GLM models are described and the results of the corresponding RF models, which generally reflected the same trends, are included in Figure S7.
In approach 1, the linear regression of plot yield based on a single trait and a single time point revealed strong differences in predictive performance between the individual traits and time points (Figure 5a). For Figure 5a, the two VIs with the highest r 2 values per environment and imagery dataset were selected and plotted across all environments, resulting in four and five VIs for the multispectral and RGB imagery datasets, respectively. The highest r 2 values of the RGB data set were detected at all three environments shortly before MAT with r 2 = 0.55, r 2 = 0.37 and r 2 = 0.21 in HAL, MER and IPK, respectively ( Figure  5a,d). In the multispectral dataset, high r 2 values were detected both around HEA and shortly before MAT, which were of the same order of magnitude as in the RGB dataset, except the high predictive power of the multispectral data at 47 DAS in MER. HAL generally had the highest and IPK the lowest r² values. In some cases, the r 2 increased after precipitation, especially in the multispectral dataset, for example at HAL 61 DAS, IPK 70 DAS and 92 DAS and in MER 84 DAS. In order to predict plot yield as accurately as possible, three different approaches were tested. In the following, only the results of the GLM models are described and the results of the corresponding RF models, which generally reflected the same trends, are included in Figure S7.
In approach 1, the linear regression of plot yield based on a single trait and a single time point revealed strong differences in predictive performance between the individual traits and time points (Figure 5a). For Figure 5a, the two VIs with the highest r 2 values per environment and imagery dataset were selected and plotted across all environments, resulting in four and five VIs for the multispectral and RGB imagery datasets, respectively. The highest r 2 values of the RGB data set were detected at all three environments shortly before MAT with r 2 = 0.55, r 2 = 0.37 and r 2 = 0.21 in HAL, MER and IPK, respectively (Figure 5a,d). In the multispectral dataset, high r 2 values were detected both around HEA and shortly before MAT, which were of the same order of magnitude as in the RGB dataset, except the high predictive power of the multispectral data at 47 DAS in MER. HAL generally had the highest and IPK the lowest r 2 values. In some cases, the r 2 increased after precipitation, especially in the multispectral dataset, for example at HAL 61 DAS, IPK 70 DAS and 92 DAS and in MER 84 DAS. In approach 2, yield predictions simultaneously included a number of RGB, multispectral and morphology traits (HEI and VCOV) at a single time point. The predictive power increased, but was still subject to large fluctuations between the time points (Figure In approach 2, yield predictions simultaneously included a number of RGB, multispectral and morphology traits (HEI and VCOV) at a single time point. The predictive power increased, but was still subject to large fluctuations between the time points (Figure 5b). The curves of the prediction accuracy between the RGB and the multispectral data set showed similar fluctuations within a range between r 2 = 0.03 and r 2 = 0.6. The highest prediction accuracy was detected at all 3 sites in the last third of the vegetation period and reached values of r 2 = 0.6, r 2 = 0.5 and r 2 = 0.29 in the RGB imagery in HAL, MER and IPK, respectively (Figure 5b,d). The RMSE is contrary to the r 2 and ranges from 0.51 kg/plot in HAL (89 DAS, RGB) to 0.28 in IPK (92 DAS, Multi), which corresponds to an error of 13.7% and 6.3%, respectively.
Approach 3 included all traits and time points simultaneously to model plot yield. With this approach, the prediction accuracy increased with each additional day of measurement and clearly exceeded the prediction accuracy at the end of the vegetation period compared to the second approach ( Figure 6). Prediction accuracy of all yield estimation models generally showed an asymptotic trend. However, the timing and the level of prediction accuracy differed between environments and imagery datasets. The trends in prediction accuracy between the environments and the RGB and multispectral datasets were retained from approach 2, but at a higher level. High r 2 values were also detected shortly before MAT and reached the highest value in HAL with r 2 = 0.82, followed by r 2 = 0.65 in MER and r 2 = 0.55 in IPK. Nevertheless, the high r 2 values in HAL had a larger RMSE than MER or IPK. The error term of the regression models of the multispectral dataset was between 0.24 and 0.39 kg/plot (=6.4-10.4%) in HAL, between 0.27 and 0.35 kg/plot (=6.7-8.7%) in MER and between 0.26 and 0.33 kg/plot (=5.9-7.5%) in IPK. The RMSE of the RGB based regression models ranged in HAL between 0.23 and 0.45 kg/plot (=6.1-12.1%), in MER between 0.24 and 0.32 kg/plot (=6.1-8.0%) and in IPK between 0.23 and 0.33 kg/plot (=5.2-7.6%). The absolute estimation error was smallest in IPK, despite the lower r 2 , because the CV of the measured yield is lowest (Figure 4). Although IPK had the lowest r 2 value, the absolute estimation error is the smallest because the CV of the measured yield is the smallest (Figure 4). showed similar fluctuations within a range between r 2 =0.03 and r 2 =0.6. The highest prediction accuracy was detected at all 3 sites in the last third of the vegetation period and reached values of r 2 = 0.6, r 2 =0.5 and r 2 = 0.29 in the RGB imagery in HAL, MER and IPK, respectively (Figure 5b,d). The RMSE is contrary to the r 2 and ranges from 0.51 kg/plot in HAL (89 DAS, RGB) to 0.28 in IPK (92 DAS, Multi), which corresponds to an error of 13.7% and 6.3%, respectively. Approach 3 included all traits and time points simultaneously to model plot yield. With this approach, the prediction accuracy increased with each additional day of measurement and clearly exceeded the prediction accuracy at the end of the vegetation period compared to the second approach ( Figure 6). Prediction accuracy of all yield estimation models generally showed an asymptotic trend. However, the timing and the level of prediction accuracy differed between environments and imagery datasets. The trends in prediction accuracy between the environments and the RGB and multispectral datasets were retained from approach 2, but at a higher level. High r 2 values were also detected shortly before MAT and reached the highest value in HAL with r 2 = 0.82, followed by r 2 = 0.65 in MER and r 2 = 0.55 in IPK. Nevertheless, the high r 2 values in HAL had a larger RMSE than MER or IPK. The error term of the regression models of the multispectral dataset was between 0.24 and 0.39 kg/plot (≙6.4-10.4%) in HAL, between 0.27 and 0.35 kg/plot (≙6.7-8.7%) in MER and between 0.26 and 0.33 kg/plot (≙5.9-7.5%) in IPK. The RMSE of the RGB based regression models ranged in HAL between 0.23 and 0.45 kg/plot (≙6.1-12.1%), in MER between 0.24 and 0.32 kg/plot (≙6.1-8.0%) and in IPK between 0.23 and 0.33 kg/plot (≙5.2-7.6%). The absolute estimation error was smallest in IPK, despite the lower r 2 , because the CV of the measured yield is lowest (Figure 4). Although IPK had the lowest r 2 value, the absolute estimation error is the smallest because the CV of the measured yield is the smallest (Figure 4). To get a deeper understanding, which traits were selected for yield prediction (approach 3), the effect sizes of the traits were extracted and visualized in Figure 7 and listed in Table S6. Therefore, VIs were grouped according to Table 2. This reveals the importance of trait groups, for yield estimation over time and general trends can be identified. For example, in the RGB dataset, the importance of HEI and VCOV initially increases and later decreases over the vegetation period. The pigment-classified VIs seem to play a greater role in yield estimation across environments and datasets at the beginning of the vegetation period than at the end. To get a deeper understanding, which traits were selected for yield prediction (approach 3), the effect sizes of the traits were extracted and visualized in Figure 7 and listed in Table S6. Therefore, VIs were grouped according to Table 2. This reveals the importance of trait groups, for yield estimation over time and general trends can be identified. For example, in the RGB dataset, the importance of HEI and VCOV initially increases and later decreases over the vegetation period. The pigment-classified VIs seem to play a greater role in yield estimation across environments and datasets at the beginning of the vegetation period than at the end.
Remote Sens. 2021, 13, x FOR PEER REVIEW 18 of 31 Figure 7. Date-wise estimate sums of the GLM yield prediction approach 3. Spectral traits are grouped according to their sensitivity as described in Table 2. Table S6 shows that with increasing dataset size over the vegetation period, the number of traits selected by the predictive model also increased while the impact of each single estimator per trait decreased. As the prediction performance across all environments increased over time, apparently diversification of the selected traits improved prediction performance. However, this relationship seems to be limited, as at the end of the growing season the number of selected traits for the model continued to increase, while the r 2 values stagnated.
Interestingly, canopy height (HEI) information is important for yield prediction within the RGB dataset starting from the beginning (Figure 7) and moreover, canopy heights from all stages of development were selected for modelling, not just the most recent canopy height, especially in HAL (Table S6). On the contrary, at MER and IPK, where canopy height at the end of the vegetation period showed minimal variation due to the application of growth regulators, early height information (with more variability) was selected for yield prediction until the end.
VCOV was the second morphological trait extracted from the UAV imagery and was included in both RGB and multispectral datasets as determined by different methods. The trait was phenotyped very early in the vegetation period and was still present in almost all yield models across environments and datasets (Figure 7). In HAL, the VCOV at 55 DAS appeared frequently in yield models, in MER at 34 and 40 DAS and in IPK at 42 and 49 DAS. The VCOV estimates were especially high in the RGB dataset, exceeding 30% in the 47 DAS yield model in MER. In the RGB dataset, the sum of the VCOV estimates reached its maximum in the early yield models around SHO. However, it should be taken into account that at this time the prediction accuracy had not yet reached its maximum. In the multispectral dataset, VCOV played a smaller role, but was nevertheless detected in yield models over the entire vegetation period with constant VCOV estimates between 2% and 14% without a discernible maximum across the vegetation period.
Nevertheless, the spectral information represents the largest percentage of the estimates. In yield modelling based on the multispectral dataset, this proportion represents  Table 2. Table S6 shows that with increasing dataset size over the vegetation period, the number of traits selected by the predictive model also increased while the impact of each single estimator per trait decreased. As the prediction performance across all environments increased over time, apparently diversification of the selected traits improved prediction performance. However, this relationship seems to be limited, as at the end of the growing season the number of selected traits for the model continued to increase, while the r 2 values stagnated.
Interestingly, canopy height (HEI) information is important for yield prediction within the RGB dataset starting from the beginning (Figure 7) and moreover, canopy heights from all stages of development were selected for modelling, not just the most recent canopy height, especially in HAL (Table S6). On the contrary, at MER and IPK, where canopy height at the end of the vegetation period showed minimal variation due to the application of growth regulators, early height information (with more variability) was selected for yield prediction until the end.
VCOV was the second morphological trait extracted from the UAV imagery and was included in both RGB and multispectral datasets as determined by different methods. The trait was phenotyped very early in the vegetation period and was still present in almost all yield models across environments and datasets (Figure 7). In HAL, the VCOV at 55 DAS appeared frequently in yield models, in MER at 34 and 40 DAS and in IPK at 42 and 49 DAS. The VCOV estimates were especially high in the RGB dataset, exceeding 30% in the 47 DAS yield model in MER. In the RGB dataset, the sum of the VCOV estimates reached its maximum in the early yield models around SHO. However, it should be taken into account that at this time the prediction accuracy had not yet reached its maximum. In the multispectral dataset, VCOV played a smaller role, but was nevertheless detected in yield models over the entire vegetation period with constant VCOV estimates between 2% and 14% without a discernible maximum across the vegetation period.
Nevertheless, the spectral information represents the largest percentage of the estimates. In yield modelling based on the multispectral dataset, this proportion represents 91.2% on average across environments and time points. On the one hand individual bands and VIs (simple ratios and normalized VIs) showed high effects that were environmentspecific. In MER, for example, the NDWI (water content VI) showed large effects between 7.4−20.2% across the entire vegetation period and was thus crucial for yield prediction (Table S6). At the other environments, however, this VI hardly seemed to play a role.
The VOG (Pigment VI) showed larger effects in HAL and IPK up to the last yield models and seemed to be less weighted in MER and did not appear in the last three yield models. On the other hand, the PSSRa (Pigment VI) and REP index (Physiology VI) seemed to be less environments-specific, as performance was similar across sites, with PSSRa showing larger estimates than REP. In the yield models based on the RGB dataset, the estimates of the spectral information had a lower relevance with 75.5% on average across environments and time points compared to the multispectral dataset. The three individual bands as well as the six calculated VIs were equally important for the yield models.
With the aim of optimizing yield predictions, further model approaches were tested by merging the collected data in different ways. One approach was to merge the RGB and multispectral datasets to combine the good features of the two camera systems. However, r 2 and RMSE of the GLM or RF models showed no significant improvement across all three environments ( Figure S7c). Another approach was to merge the datasets across all three environments and to carry out the modelling to end up with holistic models for RGB and multispectral imagery. In this approach, the RF model seemed to perform much better than the GLM model, with no difference between the two camera systems ( Figure S7d). The final prediction performance of the GLM model (r 2 = 0.68) represents an average of the environment-specific model performances. The final approach was the total combination of the all available datasets that resulted in one model. The prediction performance of the GLM model increased up to r 2 = 0.73, approaching the RF model (r 2 = 0.75), but again no strong increase in predictive performance was evident from the data fusion ( Figure S7d).
Another criterion for the prediction accuracy was the repeatability calculation, which was carried out with the predicted values and the ground true yield (YLD GT , measured in the field). Figure S8 reveals that the highest repeatability by far could be calculated at a consistently high level in IPK. The repeatability of the predicted yield showed a small variation across the vegetation period from R 2 = 0.81 (42 DAS) to R 2 = 0.91 (70 DAS) of the multispectral and R 2 = 0.85 (49 DAS) to R 2 = 0.95 (62 DAS) of the RGB dataset, but did not reach the high repeatability of R 2 = 0.96 of the YLD GT data. In HAL and MER significantly lower repeatabilities were calculated, ranging from R 2 = 0.07 to R 2 = 0.56 with an increasing trend during the vegetation period. The repeatability of the YLD GT data were also at a low level of R 2 = 0.66 in HAL and R 2 = 0.57 in MER.
The subsequent genotype association study conducted with the predicted yield from the stacked dataset approach 3 confirmed the results of the repeatability analysis. In HAL and MER, based on YLD GT data, the Dunnett-test revealed only line S42IL-143 with a significant difference in yield to Scarlett, while no significant lines were detected with the modelled yield data (Table S7). Therefore, only the Dunnett-test results of the field trial in IPK are shown in Figure S9. In IPK a total of nine significant lines were detected in the YLD GT data. In the multispectral dataset, 9 significant lines were detected across the 11 models generated over the vegetation period, but only 6 lines matched those found in the YLD GT dataset. This implies that the three remaining S42 lines were false positives. The false positives S24IL_105, S24IL_121 and S24IL_138 were only detected between the beginning and the middle phase of the vegetation period .
In addition, the significant lines S42IL-111, S42IL-126 and S42IL-176 from the YLD GT dataset were false negatives that could not be confirmed by the modelled yield data at any time point. The results of the Dunnett test based on the yield prediction of the RGB dataset revealed 12 significant lines across the 12 models, with four false positive lines (S24IL_103, S24IL_105, S24IL_121, S24IL_138) and 1 false negative line (S42IL_111), resulting in 7 overlapping lines with YLD GT . The comparison of the predicted yield with YLD GT showed that the lack of difference in estimated yield between the false negatives and the cultivar Scarlett was due to an overestimation of the predicted yield of the false negatives (Table S8). For instance, the yield of line S24IL_111 was overestimated even across the two datasets, although the line showed a very strong negative effect in the YLD GT dataset.

Canopy Height Determination
The correlations of HEI CHM and HEI GT were slightly below the correlations observed in similar studies [48,105,106], which could be due to the fact that the plant height showed a relatively small variation when using plant growth regulators. Without the use of plant growth regulators, as was the case in HAL, plots suffered lodging and the determination of the height was difficult with both the UAV and the ruler, which led to an inaccurate measurement. Moreover, the HEI GT measurements in the field cannot be seen as a general standard for canopy height determination due to the subjectivity of the measurement [107,108].
The general underestimation of plant height by the UAV is an already known issue in the determination of crop height by the SfM algorithm [106,109]. An exact determination is assumed to be mainly impeded by plant movement as it prevents the SfM algorithm from finding common pixels on the top plant level of different images, the so-called tiepoints. This effect is, therefore, more likely to occur with increasing canopy height and under windy conditions. In addition, during the shoot elongation phase between SHO and HEA, when the main tillers elongate, the canopy is very thin, resulting in the greatest underestimation of HEI CHM . That indicates that thinner plant stock reduces the probability, to detect tie-points at the top of the canopy. It is more likely that tie-points occur at a lower level, where leaves form a denser canopy. Unfortunately, no simple correlation between the underestimation and some of these sources of error could be detected as they all overlap to varying degrees at different times.
The low correlation coefficient of individual time points at the late acquisition period in HAL could be explained by the missing possibility to reliably detect lodged plots, as it is mentioned in 3.1. This was mainly observed in HAL, as only there no plant growth regulators were applied to prevent lodging. To some extend this is also the case in IPK. In addition, the DEM generation was partly erroneous in the last three datasets for IPK and it was impossible to create high-quality DEMs. The troubleshooting was very laborious and, retrospectively, hardly possible. As the flight parameters and settings have not been changed and the environmental conditions were similar to the previous flights, these factors can be excluded. A possible cause could be the unconscious movement of GCP between 92 DAS and 98 DAS by farming machines, which could have caused the high inaccuracy of GCP in the final DEM model, highlighting the importance of GCP consistency.
The repeatability analysis revealed lower repeatabilities in the early plant development, which could be explained by the reduced variation in the phenotype. However, in late plant development, lodging, which does not affect all plots equally due to environmental factors, could be a further error term. In a comparison of repeatabilities between HEI CHM , HEI GT and HEI CHMred , HEI CHM generally shows the highest values, which on the one hand indicates that it is possible to determine the relative HEI at individual time points with a high degree of accuracy using UAV. Absolute HEI determination, on the other hand, is subject to various sources of error and therefore only possible with a lower degree of accuracy. Moreover, the higher repeatability of HEI CHM compared to HEI CHMred indicates that an increased sample size increased the statistical power, which can be achieved by using UAVs to determine canopy height in practice. The results of the Dunnett test show a strong overlap of detected significant lines between HEI CHM , HEI GT and HEI CHMred and most of these lines were already known from previous publications [74][75][76][77][78]110]. In addition, HEI MAX and the calculated growth parameters HEI GRi also show a high overlap of detected significant lines. This overlap demonstrates that the accuracy of plant height determination using UAV and SfM algorithms is very useful for plant breeding under certain conditions.
In order to obtain high correlations to the true crop height, the correct flight conditions (no wind, correct flight altitude and image overlap) and a variation in canopy height are required, as the correlation coefficient is highly dependent on the trait variation and is therefore species-and population-specific. In addition, the determination of absolute height has a inaccuracy due to limitation of imagery data using of SfM techniques [111], which made the growth modelling of sequential flights more difficult. Nevertheless, it could be shown that the traits derived from the growth modelling (HEI GRi , HEI GRd) ) have a high repeatability and lines with significantly different growth developmental characteristics could be detected, which is sufficient for plant breeding purposes. If only the canopy height and no spectral information are of interest, light detection and ranging (LiDAR) systems are an interesting substitution technology [112]. UAVs have only recently become capable of carrying light LiDAR systems [113], and for this reason little is known about their use. However, the first studies show the potential of this system for the generation of highly accurate CHMs for predicting biomass [114,115] or detecting lodging [116].

Vegetation Cover
A fast leaf area growth during vegetative development has a great influence on the yield potential of a plant [117] and on soil water evaporation rate [118] and is, thus, of special breeding interest under water-limited growth conditions. In addition, initial rapid growth is very competitive and displaces weeds, which is of particular interest for organic farming. In this study the vegetation cover (VCOV) is used as a proxy of the early vegetative development. This trait is highly correlated with the physiological status of the plant, such as leaf area index, plant vigor, biomass, yield and responses to temperatures [28,[119][120][121][122][123]. In order to detect small variations in growth patterns between genotypes it is necessary to determine this parameter objectively and reliably, which can be ensured by the use of sensor technology.
In this study, the results of the RGB determined VCOV seem to be more reliable, which can be partly explained by a higher image resolution higher, which is related to the lower flight altitude but also to a better spatial resolution of the RGB camera sensor. Moreover, the poorer quality of the multispectral determined VCOV is possibly due to the fact that single time points had to be removed because the multispectral camera failed at the beginning of the season in MER and IPK. The poor results in MER are due to difficulties in creating the orthomosaic, especially in the multispectral dataset. Nevertheless, a comparison of the mean VCOV values of the two methods indicates fast-growing genotypes within the S42ILs and thus shows the genetic potential of the population.
The calculated repeatability of the VCOV, VCOV smoothed and the growth parameters VCOV GRi and VCOV 90 are slightly below the repeatability of a comparable trial in maize [123]. The low repeatability in MER can again be explained by difficulties in the preparation of the orthomosaics and by heterogeneous plant growth within the experiment, which can be attributed to greater soil differences. The proportion of genetic variation in the total phenotypic variation in MER is very small or even negligible. The comparison of methods indicates that the RGB dataset enables a more reliable determination of the VCOV traits than the multispectral dataset. Due to the faster saturation and the deletion of single time points of the VCOV traits in the multispectral dataset, the repeatability in HAL and IPK at single time points is zero, while repeatability on the same day can be observed in the RGB dataset ( Figure S6).
The results of the subsequently conducted Dunnett test confirmed that the determination of VCOV traits was possible through the use of drone-mounted RGB cameras and that the accuracy of the results was sufficient for further quantitative genetic analysis. However, the results presented here were severely limited by the number of observations at the time of the trait increase, because to extract growth dynamics from imagery data, temporally high-resolution data collection is critical [20]. The weekly trait monitoring of the environments resulted in only 2-3 relevant observations that could be used for growth modelling until saturation was reached, which was not sufficient to detect the small trait variations within the population. Nevertheless, the results show the potential of this method to detect the VCOV traits, especially if the monitoring intervals were increased to every 2-3 days. A commercial RGB camera with its simpler data handling and more consistent results may be a more suitable choice if only VCOV traits are of interest.
Monitoring the temporal and spatial variations of the VCOV has many potential breeding applications [124]. The quantification of vegetation cover fraction of crops is a major issue in remote sensing with many established methods [52,[125][126][127]. However, despite the great potential for identifying early vigour genotypes and the fact that bridging phenomics and genetics is a step towards a more powerful genetic analyses, the trait is hardly used by the breeding community. In this study, the potential of the trait was shown by the outperformance of the average VCOV values of the S42IL compared to the Scarlett variety. In addition, it is also an informative trait, and was selected in our yield prediction models.

Yield Prediction
In the following, only the results of the GLM models are discussed. The coefficient of variation of the yield measured in the field was highest in HAL and lowest in IPK and also the yield prediction of approach 1 revealed the highest predictive power in HAL and lowest in IPK (Figure 5a), indicating that a larger variation of the predicted trait may enable higher r 2 values. Similar correlations were also found in other yield modelling studies [28,29,128]. The indication that in some cases a more accurate yield prediction was achieved after precipitation shows a possible positive correlation between the water status of the plant represented by the VIs and the final plot yield. If this is the case, measurements after recent rainfall, subsequent to a period of drought, would be advantageous compared to measurements during continuous dry periods. From a biological point of view, more drought stress tolerant genotypes might be able to re-hydrate faster, which is reflected in large VI differences that might also have an impact on the final yield and its prediction.
Approach 2, which consisted of a data fusion of several VIs, HEI and VCOV information, resulted in better yield prediction models for both multispectral and RGB datasets and is in accordance with the findings of multiple additional studies [94,[129][130][131]. Furthermore, the correlation between precipitation and prediction accuracy was less pronounced than in approach 1, which may indicate that multiple traits were able to break this correlation, which seems to be the case at least for spectral indices. Yue et al. [132] were able to show that using multiple spectral indices to estimate crop parameters is superior to using a single index. However, the high variability of prediction accuracy does not allow a recommendation for favorable measurement dates over the vegetative period.
The stacked data of approach 3 showed the best prediction power with the highest accuracy in HAL, which might be related to a higher variation of the studied traits. For example, HAL was the only environment where no growth regulators were applied, resulting in greater trait variation for canopy height and the target trait plot yield. In Herzig et al. [133] it was also found that the variation of the target trait influences the predictive accuracy of regression models. The gradually increasing predictive accuracy of yield over successive growth stages of plants is a scientific consensus and was observed in a number of studies [29,132,134]. Figure 7 and Table S6 show the traits selected for yield prediction and their effect size, where HEI CHM is an important information for yield prediction. Several studies demonstrated that canopy height information can significantly improve yield mod-elling [29,48,62,94,129,131,135]. In this context, the height information of all development stages are important for the yield modelling, which raises the question of the optimal phenotyping time of each trait for the best yield prediction model. Thus, not only the final height, but also the timing of canopy height information with the largest possible variation and the temporal pattern of HEI accumulation are crucial for the model performance. Since the height information could only be extracted from the RGB imagery, it was only included in the RGB dataset and not in the multispectral dataset. Thus, the additional spatial traits in the RGB dataset contributed decisively to the fact that the yield prediction accuracy between the RGB and the multispectral dataset is at a similar level.
VCOV was included in both RGB and multispectral datasets and revealed continuity of being included in the yield models indicates the information gain that VCOV traits can provide to improve yield prediction. For example, high VCOV values at an early growth stage might represent a good germination rate and a high leaf area index, which in turn affect final grain yield already at an early stage. Considering the low variance of VCOV and the low repeatability, especially in MER (see Section 3.2), these results were surprising. However, a correlation between VCOV and yield was confirmed in maize by García-Martínez et al. [28] and Geipel et al. [129] achieved the best yield prediction performance by implementing crop coverage as the second predictor variable in a study with maize.
The spectral information nevertheless represents a large part of the estimation for the yield modelling with environment-specific variations for individual Vis. For instance, the water index NDWI, first described by Penuelas et al. [86] is an important estimator in MER. The NDWI can be used to estimate the water concentration in plants and is thus strongly associated with grain yield [39,136]. This is consistent with the observation that MER showed the highest drought stress and that growth was limited by lack of water, as indicated by the results of a thermal camera (data not shown). In addition, the red edge reflectance index VOG [92], which was described to retrieve biochemical pigments (especially chlorophyll) [137], showed environment-specific variations by playing an important role predominantly in HAL and IPK.
The PSSRa and the REP index appeared to be less environment-specific as they were equally important at all sites. The PSSRa was initially described as an estimator for photosynthetic pigment concentrations [87] but was also associated with vegetative biomass [35,138]. The REP index is calculated from the region of the red-near infrared transition and has been shown to have high information content for vegetation spectra for example chlorophyll content and as an indication of vegetation stress [139][140][141].
Attempting to merge the RGB and multispectral datasets to optimize yield prediction, no significant improvement was observed, indicating that the advantages of the camera systems do not add up and that there may be redundancy in the spectral information of both camera systems. Another approach was to merge the data sets of the three sites for each camera system separately, resulting in a moderate prediction power. The loss of accuracy is associated with the robustness of the model, which is due to the merging of the environments, which could also be shown in Wiegmann et al. [142]. A robust model becomes more important when forecasting yield across environments and years, which is the medium-term goal of remote sensing in agriculture.
The repeatability calculation of the predicted yield revealed high repeatability at early stages of development, especially in IPK, which underpins the predictive accuracy of these early prediction models. The significantly lower repeatabilities in HAL and MER in comparison with IPK are indicating other sources of error in the field at these environments that may have negatively affected repeatability, such as soil in homogeneities. Based on this, the repeatability of the predicted yield in HAL and MER was expected to be at a lower level than in IPK and not to exceed the YLD GT repeatabilities from HAL and MER.
The subsequent genotype association study with the predicted yield data could only detect significant lines in IPK. Six of the nine significant lines detected in the YLD GT data correspond to Zahn et al. [80]. The overlap of detected significant lines between YLD GT and the multispectral dataset increases with higher r 2 and false positives were only detected between the beginning and the middle phase of the vegetation period, indicating that increasing prediction accuracy reduces false positive detections. In addition, false negative lines (lines detected within the YLD GT but not with the predicted yield) of the predicted yield of the multispectral and RGB datasets were a result of overestimation of the measured yield. This may be caused by a different physiological composition of S24IL_111 compared to the rest of the training population, leading to a kind of mimicry effect. For example, a high chlorophyll content in the leaf could be associated with a high yield in the training population. Under the exemplary assumption that the S24IL_111 line could have an increased chlorophyll content but cannot convert this into a higher yield due to a lack of remobilization efficiency, this would be contrary to the modelled association between yield and chlorophyll content. This type of correlation breaker would result in an overestimation of the yield of the line and could have caused the false negative detection. The results of the association study, especially those of the models at the end of the vegetation period with reduced false positive lines, can be used for plant breeding. A robust and increased predictive performance through several trial years and environments could make a more accurate association study with fewer false negatives possible.

Conclusions
Remote sensing from UAV is expected to be an important new tool to assist breeders and farmers in precision agriculture, but there has only been a slow adoption of promising UAV applications [143]. Therefore, this study used a simple practical approach to objectively and precisely phenotype traditional and new traits in a high throughput manner based on the methodological comparison of two UAVs equipped with different camera systems. The high-resolution RGB imagery could be used with the help of the SfM algorithm to extract the relative canopy height with a high precision across the entire vegetation period. The results of the genotype association study with the high overlap between significant lines of UAV estimated height and field measured height demonstrate the practical use for plant breeding. The determination of vegetation cover at the beginning of the vegetation period is only feasible by means of objective sensor technology. Therefore, RGB and multispectral imagery were used to extract the percentage of plant cover within a plot. The results of the RGB dataset showed a more precise and consistent extraction of VCOV across sequential flights, making it more favorable for this trait than the multispectral data set.
Moreover, non-invasive UAV phenotyping of many plants at many different time points improves accurate growth analysis and enables the identification of genotype association dynamics, which has been successfully executed for the traits canopy height and vegetation cover. For this kind of application, it is crucial to determine the correct time interval of phenotyping depending on the trait. The phenotyping interval of 7 days for the trait vegetation cover appeared to be too large, giving potential for improvement. Nevertheless, vegetation cover and the modelled canopy height were important for yield prediction, which could be quantified by their relative contribution to the predicted yield estimates. The fusion of spectral vegetation indices and spatial traits like canopy height and vegetation cover provides complementary information for the estimation of yield, which is confirmed in numerous studies, at least for canopy height [129,131].
In particular, the yield models based on the RGB imagery could be improved by the spatial traits, resulting in the prediction performance of both camera systems being on the same level. If the main focus is on yield prediction, the RGB system used in this study would be preferable to the multispectral system due to the lower costs and consumer-friendly handling in image acquisition and image processing. Multispectral monitoring seems to be more important for answering research questions such as the interaction between genotype, environment and management (G × E × M), as it enables better resolution of the relationship between plant physiological status and sensor data, which can be used to improve understanding of yield gaps. However, the results of the multispectral camera refer only to the sensors, filters and simplified methods selected here. For example, an additional filter in the green spectral range, a pixel-based and not plot-based evaluation or an additional atmospheric correction by spectral references in the field could greatly improve the results. In addition, it is also feasible to extract spatial traits such as canopy height and vegetation cover from multispectral data, which would also improve yield prediction. Unfortunately, the results of this approach in this study were too imprecise to differentiate the small differences in genotypes.
Modelling based on single time points resulted in large differences in predictive performance across the vegetation period, whereas the accumulation of time points showed a steady improvement of the predictive models. In contrast, the fusion of RGB and multispectral datasets did not improve yield prediction, indicating redundancy in the datasets. Merging the datasets across environments resulted in a general predictive model with average precision for the three environments. In order to exploit the full potential of UAV applications and to strengthen acceptance in the user community, transferability of the models to different years, populations, and environments is required. To achieve this maturity of UAV remote sensing, a broad and standardized data collection and modelling procedure is necessary to establish general and accurate yield prediction models.
Supplementary Materials: The following are available online at https://www.mdpi.com/article/ 10.3390/rs13142670/s1. Figure S1: HEI CHM determined by RGB imagery and HEI smoothed. Figure S2. VCOV determined by RGB and multispectral imagery and VCOV smoothed . Figure S3: Boxplots of canopy height measured in the field (HEI GT ) and determined by UAV RGB imagery by the 95% quantile of a plot (HEI CHMred ). Figure S4: Repeatabilities of canopy height traits determined in the field and by UAV RGB imagery. Figure S5: All significant S42ILs detected for all traits in regard to canopy height. Figure S6: Repeatabilities and all significant S42ILs detected in regard to VCOV traits. Figure S7: Different modelling approaches of GLM and RF. Figure S8: Repeatability of the predicted yield. Figure S9: Results of the post hoc Dunnett test of the predicted yield at the IPK environment. Table S1. Weather conditions of all environments. Table S2. Phenotypic data of all trait for each S42 line of all environments. Table S3. Data acquisition of all trait for each S42 line of all environments. Table S4. Camera-, sensor, and filter specification. Table S5. Agisoft Metashape settings. Table S6. Estimates of the GLM yield prediction model based on the stacked datasets. Table S7. Results of the genotype association study based on YLD GT and predicted YLD of the stacked dataset approach with the GLM model. Table S8. Predicted yield of the stacked dataset approach with the GLM model.
Author Contributions: P.H. conducted the field trials in 2019 in HAL, MER, and IPK, gathered and analyzed the phenotypic data in the field, conducted the UAV flights and drafted the manuscript. P.B. gathered the phenotypic data in the field, processed UAV data, created prediction models and figures and drafted the manuscript. H.-C.K., D.K. and U.K. provided UAV technical support and support in automated Agisoft processing and UAV data processing and drafted the manuscript. K.P. and U.S. planned the project, acquired public funding and institutional co-funding, coordinated the collaboration between the project partners and edited the manuscript. A.M. supported the analysis of the phenotypic and genotypic data and drafted the manuscript. All authors have read and agreed to the published version of the manuscript.  Table S2. UAV imaging data can be made available on request from the corresponding author.