Deep Learning Applied to Phenotyping of Biomass in Forages with UAV-Based RGB Imagery

Monitoring biomass of forages in experimental plots and livestock farms is a time-consuming, expensive, and biased task. Thus, non-destructive, accurate, precise, and quick phenotyping strategies for biomass yield are needed. To promote high-throughput phenotyping in forages, we propose and evaluate the use of deep learning-based methods and UAV (Unmanned Aerial Vehicle)-based RGB images to estimate the value of biomass yield by different genotypes of the forage grass species Panicum maximum Jacq. Experiments were conducted in the Brazilian Cerrado with 110 genotypes with three replications, totaling 330 plots. Two regression models based on Convolutional Neural Networks (CNNs) named AlexNet and ResNet18 were evaluated, and compared to VGGNet—adopted in previous work in the same thematic for other grass species. The predictions returned by the models reached a correlation of 0.88 and a mean absolute error of 12.98% using AlexNet considering pre-training and data augmentation. This proposal may contribute to forage biomass estimation in breeding populations and livestock areas, as well as to reduce the labor in the field.


Introduction
Monitoring crop parameters like nutrient content, biomass, and plant height is essential for yield prediction and management optimization [1]. In situ measurements of these parameters can be a time-consuming, expensive, and biased task. To assist in plant breeding programs [2] as well as in precision agriculture practices [3], remote sensing technologies have been used in multiple approaches [4][5][6], and, lately, this has expanded with the implementation of UAV (Unmanned Aerial Vehicle) based-data. In recent years, UAV-based images, in conjunction with robust and intelligent data Previous researches related to our proposal used conventional machine learning methods, which require handcraft features [29]. Although some of them returned good accuracies, there is still a need for more robust models to be proposed in this task. Based on a review analysis, Kamilaris and Prenafeta-Boldú [29] verified that deep learning methods outperformed traditional machine learning in several agriculture applications. In this sense, the estimation of biomass using deep learning is still scarce in the literature. In previous work, Ma et al. [30] assessed a deep learning architecture based on VGGNet to predict the above-ground biomass of winter wheat. The authors used an RGB camera on a terrestrial tripod, which somewhat limits the application in larger areas. Nevertheless, this experiment demonstrated the potential of CNN to ascertain this task and stated that the proposal of novel methods on this theme is still necessary.
To the best of our knowledge, no literature focused on investigating deep learning-based biomass estimation methods using UAV-RGB orthoimages in tropical forages. Approaches using RGB orthoimages seem to be an interesting practice since they have higher spatial resolution compared to images from other types of sensors, and it does not rely on the 3D information of the canopy, reducing the amount of data necessary to perform the said task. The contribution of this study is to propose a deep learning approach to estimate biomass in forage breeding programs and pasture fields using only UAV-RGB imagery and AlexNet and ResNet deep learning architectures. We also compared the results with VGGNet, used in previous work [30] on biomass estimation. The rest of the paper is organized as follows. Section 2 presents both the materials and methods implemented in this study. Section 3 presents and discusses the results obtained in the experimental analysis. Finally, Section 4 summarizes the main conclusions of our approach.

Study Area and Dataset
The dataset was formed by images obtained by the UAV DJI Phantom 4 PRO embedded with an RGB digital camera with an image resolution of 5472 × 3648. The area of the experiment is located in Brazilian Cerrado at the Experimental Station of Embrapa Beef Cattle, Campo Grande, Mato Grosso do Sul, Brazil (Figure 1-latitude 20 • 26 46 S, longitude 54 • 43 16 W) and altitude 535 m. The flight was carried out on 23 January 2019, at around 9 a.m. with a relative height of 18 m, resolving 0.5 cm/pixel. The photos were taken with a frontal overlap of 81% and a lateral overlap of 61%. The orthoimage (Figure 2b) was generated using the Pix4D software based on the SfM (structure-from-motion) and MVS (multi-view stereo) techniques.
The experiment was composed of 110 genotypes representing a high genetic diversity of the species Panicum maximum (syn. Megathyrsus maximus), an important tropical forage grass for livestock production [31]. These genotypes are grouped by 86 full-sib progenies, ten sexual and ten apomictic progenitors along with four commercial cultivars (Mombaça, MG12 Paredão, BRS Quenia, and BRS Tamani). All the genetic material, with exception of MG12 Paredão, was developed by the P. maximum Breeding Program of Embrapa. A randomized complete block design was used with three replications, totaling 330 plots. Each plot consisted of two rows of 2.0 m with 0.5 m apart. Each row contained five plants spaced by 0.5 m between plants, totaling ten plants per plot. Plots were 1.0 m apart, representing an area of 4.5 m 2 . We evaluated total green matter yield trait as ground truth data. Each plot was harvested 0.2 m from the soil, and the green material was weighted in kg·plot −1 using a field dynamometer and converted in kg·ha −1 on 25 January 2019. For term simplicity, this trait will be named later only as biomass yield. Figure 2 shows the plot's definition procedure. We developed a python script tool (https://github. com/wvmcastro/tiffviewer) named field plot cropper (FPLOTCROPPER). The inputs of the tool are: the orthomosaic, the number of blocks in the image, the number of lines, and columns within each block, and the user-defined rectangular polygons (Figure 2b). The result can be seen in Figure 2c. Dealing with orthoimages can prove to be a challenging task from the computational point of view, as images of this type can go beyond gigabytes in size. In this regard, the use of specific software to reduce these images is necessary. For this, FPLOTCROPPER uses Matplotlib [32] packages for viewing the orthoimages and capturing mouse events, which the user produces when defining the four corners of each block. FPLOTCROPPER also uses Rasterio [33] for reading and writing of the images, along with the Python 3 programming language [34]. The presented proposal uses biomass yield as a class attribute y. Figure 3 plots the histograms of the y data distribution of the 330 plots of the experimental station. After this pre-processing step, where plots were correctly cropped and identified, we then proceeded with the experimental evaluation described in the next sections.

Deep Learning Approach and Experimental Setup
Conventional machine learning techniques require considerable domain expertise and careful engineering to extract meaningful features to train the models. According to LeCun [35], deep learning methods can learn convolutional filters that can obtain those meaningful features from the images. The advantage is that the learning process incorporates the learning of convolutional filters that replaces the feature extractors designed by human engineers. This approach requires minimal engineering by hand and has achieved a state of the art results in many areas of machine intelligence [35]. Therefore the filters to emphasize and extract essential features are now inside the model, allowing the model to process images in its raw format.
However, the training of CNNs requires large datasets, such as ImageNet, with 14 million images. In the forage literature, to the best of our knowledge, there is no such dataset. In green biomass estimation, the sample size is relatively small for deep learning standards. This limitation relates to obtaining the ground truth of the field biomass, which requires expensive laboring work to harvest and weigh each plot's biomass.
In our study, we have only 330 plots. Training CNNs with a small sample size is a challenging task. The literature points to fine-tuning approaches using a pre-trained model [36], learn smaller models [37], and perform data augmentation [38]. We address the learning problem by trying these three approaches. We selected AlexNet [39] (8 layers), a popular and relatively small convolutional neural network, and ResNet [40] (18 layers), both with and without a pre-trained model, and with and without data augmentation. We designed the experimental evaluation, not only to show the individual accuracy rate of each model but also to measure how effective is each of these three approaches. As a baseline we use VGGNet [41] (11 layers), because it was used for the same purpose in other species [30].
We used AlexNet, ResNet18 and VGGNet from PyTorch [42]. The last fully connected layer architectures was changed so that the models were adapted to a regression problem. The settings for each experiment can be seen in Table 1. In all experiments, the Adam [43] optimization method was used, with the descending gradient algorithm, in a fixed learning rate of 0.001, constant β 1 = 0.9, β 2 = 0.999 and = 10 −8 . The number of epochs was defined empirically using early stopping evaluated every 100 epochs. The pre-trained models are PyTorch pre-trained model on ImageNet (see https://pytorch.org/docs/stable/torchvision/models.html), where we loaded these pre-trained weights and fine-tuned these models using the training set. Models without pre-training were fully trained using the training set.
Considering the relatively restricted number of 330 plots (examples), we performed all experiments using ten-fold cross-validation, since cross-validation produces better generalization error than hold-out [44,45]. To pursue more robust models, we trained them using the data augmentation technique. We named augmented horizontally (augmented h) for regular data augmentation, where the images were flipped from left to right. We named augmented horizontally and vertically (augmented hv) for data augmentation, where the images were also flipped from top to bottom. The models without data augmentation were named original in Table 1. All models were trained to adopt the MSE (mean square error) as the loss function. The mean square error evaluated values between the actual value of biomass, in kg·ha −1 (y i ), and the value predicted by the model was used (ŷ i ). The formula is in the Equation (1), where n is the number of examples, y i is the true score,ŷ i is the predicted score, and y i ∈ [1556, 15,333], which means that the y i varied from 1556 kg·ha −1 to 15,333 kg·ha −1 . We used a desktop with a NVIDIA Titan X GPU (12 GB), Intel i7-6800K 3.4 GHz CPU, and 64 GB of RAM. We assess the regression problem using MAE (mean absolute error), MAPE (mean absolute percentage error), R (Pearson correlation). However, these metrics can hide predictions biased towards higher or lower values than the true prediction. The same can occur with the graphs in Section 3, where high-density regions can overlap points. This limitation motivates the evaluation of the results under RROC (Receiver Operating Characteristic Curves from Regression) [46]. We also evaluate the results using histograms (Section 3.3). The number of bins was determined using the elbow rule [47] from the partitioning of the samples' real values.
For visual inspection of the model activations, the last convolutional layers of CNNs retain spatial information and high-level semantics of the CNN [48]. Looking at these layers, it is possible to highlight the class-discriminative regions of the images. One of the first methods to emphasize the discriminative areas of the images was the CAM (class activation mapping) [48]. However, CAM works only on CNNs without fully connected layers architectures. One year later, Grad-CAM [49] was proposed, enabling the use of fully connected layers without architectural changes or re-training. We applied Grad-CAM and displayed some results in Section 3.4.

Experimental Results Evaluation
We divided our evaluation into four parts: (1) evaluation of the results using standard metrics MAE, MAPE, and R, and the graphs of predicted versus real values; (2) a visual representation of the error by ROC Regression curves; (3) the histograms of the predictions; and (4) the heat map of the feature map activation for visual inspection. Table 2 shows the mean and standard deviation of mean absolute error, mean absolute percentage error, and Pearson Correlation of ten-fold cross-validation. The results indicate four groups of outcomes. The best first group represents AlexNet pre-training, Experiment #7, #8, and #9. These results are among the top 3 results showing an average MAE lower than 768.75 kg·ha −1 . The second best group is also from AlexNet, but without pre-training, Experiment #1, #2, and #3 with average MAE lower than 924.48. The third and fourth groups are the ResNet18 results, where the MAE is higher than 1000. Overall and surprisingly, the results of AlexNet are better than ResNet18. The VGGNet (baseline) was considered only with pre-training and data augmentation because we verified that this contributed to the improvement of the other models. The AlexNet also outperformed VGGNet11, which presented an average MAE of 825.94 kg·ha −1 . Experiment #9 presented the best absolute result. We then proceed with the experimental evaluation performing a one-way ANOVA test. We obtained an F-statistic of 9.81, and with a p-value of 5.16 × 10 −13 < 0.05, we can reject the null hypothesis that the models have are equal performance using MAE. We continue the evaluation with Tukey's pos-hoc test to find the differences among the experiments. Figure 4 shows the Tukey's HSD.  The results that differ significantly with AlexNet_ptrain_hv are red, the insignificant difference are gray. Therefore the results indicate that AlexNet_ptrain_hv has a significant difference with all ResNet18 results. The pre-trained models showed a slight improvement in the AlexNet result but not in ResNet. The ResNet results, except for ResNet18_ptrain_hv, the pre-training deteriorated the results. For data augmentation hv, we can see a small improvement in pre-trained models of AlexNet and ResNet. For the AlexNet without pre-training, the use of data augmentation worsens the results.

Standard Evaluation: MAE, MAPE, R, and Graph of Predicted Versus Real
The mean and standard deviations of MAE, MAPE, and R assume Gaussian distributions. Plotting the real values and the prediction in the same graph can be a more precise visualization of the results, such as point cloud density and outliers.The graphs presented in Figure 5 show exactly these visualizations, where points are pair of (y,ŷ). In a situation of perfect predictions, these graphs would be perfect lines (1:1). In all graphs, the models show more spread results on the top right and a high concentration of points in the center of the diagram, which corroborates with the real class y (Figure 3) where the numbers of biomass higher than 11,000 kg·ha −1 are scarce, and the average value of y is close to 6000 kg·ha −1 .
When comparing the graphs of AlexNet and ResNets, we can see that all ResNet results (Experiments #4, #5, #6, #10, #11, and #12,) show a point cloud lying more to the bottom right, indicating that ResNet has difficulties predicting values higher than 10,000 kg·ha −1 than AlexNet. The pre-trained AlexNet (Experiments #7, #8, and #9 ) shows a narrow corridor of points close to the ascending diagonal (dotted line). The narrowest point cloud and closer to the dotted line seems to be the Experiment #9, which confirms the results of the absolute number of MAE, MAPE, and R.  Among the results with pre-training ( Figure 6), AlexNet original, AlexNet augmented h, and AlexNet augmented hv (#7, #8, and #9) are closer to (0,0). ResNet18 pre-trained augmented h (Experiment #11) is above the dotted line, indicating better results for UNDER the prediction than OVER the prediction. In practice, this is equivalent to an average prediction ofŷ slightly higher than the true prediction y for Experiment #11. This interpretation may be counter-intuitive when we look at Figure 5k due to the points on the middle right of the graph, however looking closely, we can see high-density points in the middle left, that corroborates with the RROC result. The other way around occurs with ResNet18 original (Experiment #4) and VGGNet11(Experiment #13), where the results are below the dotted line.

ROC Regression
When comparing AlexNet (light blue) and ResNet (red), we can see that AlexNet points are closer to the (0,0). All ResNet points are further away from the AlexNet points and spread over the graph, showing a barrier of points close to the ascendant diagonal.

Histograms
The histogram graphs of Figure 7 show the intersection between the distribution of the real data and the distribution of the predictions of each experiment. The addition of new groups did not significantly increase the representativeness of the data with more than 20 bins, so the number of bins was set to 20. The intersection areas between the distributions were calculated for each experiment and are presented in Table 3.  Table 3. Intersection areas of the histograms shown in Figure 7. Through the Table 3 and Figure 7, it is possible to conclude the superiority of the results of Experiment #9, where the training set used was at least twice as large concerning any other experiment due to the data augmentation.
From the plot of the validation loss given the number of epochs during training presented in Figure 8, it can be observed that Experiments #1, #2, #3, #7, #8, #9, and #13 were the ones to converge the fastest getting to a plateau at around 100 epochs. Experiments #5 and #11 converged at approximately 230 epochs. Experiments #6 and #12 converged after about 300 epochs. Finally, Experiments #4 and #10 were the ones that took the longest to reach the plateau, taking approximately 500 epochs. From this analysis, it is possible to conclude that the AlexNet and VGGNet11 models were able to consistently present better results earlier when compared to experiments using the ResNet18 model.  Figures 9 and 10 show the heatmaps from Grad-CAM on Experiment #9. Warm colors (red) indicate a more class-discriminative region to the prediction, while cold colors (blue) represent the lower class-discriminative region.  Figure 9 shows the three best predictions, where no strong red or blue regions are highlighted. Figure 10 shows the three worst predictions, where extreme values in the neurons, strong red and blues, occur. We believe that these extreme values create a broader range of values that make the regression problem more difficult to solve, worsening the results' MAE.

Training and Test Time
Finally, we also evaluated the training and test time of one fold of the ten-fold cross-validation procedure. The test set has fix size of 33 examples, and the training set 300, 600, and 900 examples for original, augmented h and augmented hv, respectively. Table 4 shows the results. The training time needs to be analyzed with the respective number of training steps. It is interesting to see that although the h and hv have two and three times more examples, their training time was not multiplied by a factor of two and three on ResNet. While for the AlexNet, these factors are consistent. The test time stays between 0.39 and 0.67 for AlexNet and ResNet. The testing time of VGG was the highest among the tested results.

Discussion
Our study focused on a deep learning-based approach to estimate biomass yield in forage fields. Furthermore, we investigated the impact of the data augmentation and pre-training steps on the estimation results. CNN is a state-of-the-art method to evaluate imagery data. We also considered different genotypes of a forage species, which contribute to the heterogeneity of our dataset. For assessing the experiments, we presented regression analysis results with a high correlation between predicted and measured yields ( Table 2 and Figure 4), an RROC space comparing the deep networks implemented ( Figure 5), an intersection analysis between each experiment (Table 3 and Figure 6) and qualitative information such as heat maps illustrating the best and worst predictions of our data (Figures 8 and 9).
When comparing against other methods to estimate biomass, our approach differentiates from a methodological point-of-view. Up to the point, most approaches considered shallow learners (i.e., conventional machine learning methods) and RGB imagery combined with DSM and DTM models, LiDAR or even vegetation spectral indices from multi and hyperspectral data [1,[24][25][26][27][28]50]. As an advantage of adopting deep neural networks is the oversimplification of remote sensing data variety necessary to conduct this estimation. Although the production of input data such as DTM, DSM, spectral indices, and multiple bands is a relatively easy task in remote sensing, the costs to obtain such products, specifically LiDAR and hyperspectral data, are highly-priced when comparing against RGB only imagery. Our deep networks, trained with RGB inputs, can perform similarly or even better than most traditional methods and data previously described. This is an important advance for agricultural remote sensing approaches. The caveat of the proposal and other CNN-based approaches are the requirements of computational intensive procedures, often requiring the use of GPUs.
Compared to the previous study that used 3D information to estimate the biomass for the same species [26], we achieved more accurate results. Batistoti et al. [26] achieved an R2 of 0.74 considering a limited number of plots (four in total). Another study [51] estimated yield with proximal sensing equipment in a heterogeneous sward structure of grasslands and applied a MLPSR (Multiple Partial Least Square Regression) approach, which returned an R2 of 0.69. In estimating biomass from legume-grass, Wachendorf et al. [52] produced high accuracies with a similar approach as Moeckel et al. [51], returning accuracies up to 0.95 for specific models. Another study [53] considered machine learning techniques to estimate grassland biomass in spectral data, with an RMSE equal to 71.2 t·ha −1 . This indicates that our UAV-based RGB dataset, in conjunction with more robust methods (i.e., deep neural networks), can be compared against expensive measurement remote systems, considering even terrestrial measurements.
In this study, we implemented two convolutional networks (AlexNet and ResNet18) and tested as a baseline another network (VGGNet11), as stated in the previous sections. To evaluate the impact of samples in our approach we tested whether pre-trained models and data augmentation produced significant improvements in its accuracy. Our results indicated that the AlexNet method performed better. A possible explanation for this is that the ResNet18 method, although being a deeper network than the implemented AlexNet, was unable to represent the pre-trained problem with its convolutional filters properly. In other words, it was not able to modify its layers with enough precision. Results without pre-training steps were also not sufficient. This demonstrates how the lack of data for training impacted its performance. Nevertheless, the evaluation of different pre-processing steps (with and without data augmentation and pre-training) resulted in essential implications for integrating agronomic measurements collected in the field with these robust methods in remote sensing RGB imagery.
The VGGNet11 performance was calculated to be compared against the accuracy obtained in a previous paper [30]. In this paper [30], more traditional approaches were also compared against this deep learning method. These approaches were related to spectral vegetation indices and 3D models as standalone data to estimate biomass yield. Even so, it was demonstrated that the VGGNet11 outperformed these traditional approaches. Here, our proposal focused mostly on evaluations with the AlexNet and ResNet18 networks throughout the experiment. We noticed that in both methods, data augmentation improved the overall performance in estimating biomass yield. As a result of this outcome, we also implemented data augmentation in the VGGNet11 network. Nonetheless, our analysis ( Figure 5) demonstrated that the AlexNet method was superior to the deep learning method implemented in [30], even with data augmentation. This may be an indicator that this type of approach with RGB imagery performs better with shallow architectures.
It is possible to observe that the models used were able to return a high correlation between the RGB images of the plots and the real yield value (Table 2 and Figure 4). This study is the first approximation of its kind. We believe that many aspects to be evaluated in future research. Although we stated the importance of RGB data in an economic point-of-view, we do not disregard the impact of other types of remote sensing methods to increase the accuracy of deep learning-based neural networks. The use of 3D reconstruction data from point-clouds can also be explored in the future since biomass has a strong relationship with the volume of the plant. This data insertion could assess whether there are performance gains in the prediction and infer the density of the plant, a significant characteristic for researchers in the area. Regardless, the results utilizing only RGB and the cross-validation method indicate the capability of the proposed method in this approach.

Conclusions
Until this moment, this paper's proposed approach is the first research that implemented and evaluated a CNN-based architecture, combined with high-resolution UAV RGB images, for the prediction of biomass yield considering different forage genotypes.
Two regression models based on CNNs (Convolutional Neural Networks) named AlexNet and ResNet18 were evaluated, and compared to VGGNet-adopted in previous work in the same thematic for other grass species. The predictions returned by the models reached a correlation of 0.88 and a mean absolute error of 12.98% using AlexNet considering pre-training and data augmentation. Comparing the achieved results to a previous study that was based on 3D information to estimate the biomass for the same species [26], we achieved more accurate results.
In conclusion, the models used were able to establish a high correlation between the images and the biomass value measured in the field. This demonstrates how feasible the proposed approach is to predict forage yield at highly detailed RGB imagery, producing accuracy comparable to more expensive approaches with both aerial and proximal remote sensing.
Since this is the first study of its kind, there are many aspects to be evaluated in future research. It is worth noting that the models developed here are not yet ready to be deployed in commercial production. Although cross-validation has been used in all experiments, the dataset is still considered small. However, the results obtained are strong indications of the method's future success in more significant variations of forage crop datasets. Experiments using datasets from different locations and weather conditions are essential to provide more generalized models, and we intend to conduct this in future work.