Deep Learning Regression Approaches Applied to Estimate Tillering in Tropical Forages Using Mobile Phone Images

We assessed the performance of Convolutional Neural Network (CNN)-based approaches using mobile phone images to estimate regrowth density in tropical forages. We generated a dataset composed of 1124 labeled images with 2 mobile phones 7 days after the harvest of the forage plants. Six architectures were evaluated, including AlexNet, ResNet (18, 34, and 50 layers), ResNeXt101, and DarkNet. The best regression model showed a mean absolute error of 7.70 and a correlation of 0.89. Our findings suggest that our proposal using deep learning on mobile phone images can successfully be used to estimate regrowth density in forages.


Introduction
Pasture areas cover 21% of the territory (170 million hectares) in Brazil; however, a large part of these pastures are degraded [1], leading to lower livestock productivity. The current average Brazilian productivity (73.5 kg of CWE. ha −1 ·yr −1 ) is lower than the potential productivity of 294 kg CWE.ha −1 ·yr −1 [2]. This production gap represents a great challenge to be surpassed by the livestock producing countries. On one hand, the increase in the world population leads to increased demand for protein. On the other hand, policies to combat climate change require more natural environment conservation, thus demanding less area for animal protein production. In this scenario, increasing the productivity of areas already used for animal protein production is essential to meet the growing demand and to attend to the policies for reducing greenhouse gas emissions, without increasing pasture area. To achieve this goal, the development of more productive cultivars by efficient forage breeding methodologies can help reduce the productivity gap [3].
Tillers are small units of forage grass plants responsible for pasture production. After defoliation of the pasture (e.g., grazing by animals) the regrowth of tillers is crucial to maintain pasture stability and productivity [4,5]. The tillers that effectively contribute to productivity are those that regrow up to eight days after mechanical defoliation or grazing by animals [6]. Thus, one way to measure productivity is to estimate regrowth seven days after defoliation [7]. However, in situ measurements of this trait can be time-consuming, labor-intensive, and is a subjective task. Thus, the development of low-cost technologies for automated plant phenotyping could help scientists and professionals in forage breeding programs. Machine and deep learning combined with mobile devices, such as smartphones, are powerful and low-cost tools for this purpose. The development of such tools could induce less labor and time and more accuracy in the phenotyping process in forage breeding programs, leveraging the efficiency of these programs and contributing to the release of improved cultivars used to reduce the productivity gap.
Many machine learning methods, such as Support Vector Machine (SVM) and Knearest neighbors (KNN), have been employed and show outstanding results, indicating their potential role in the future of High-Throughput Phenotyping (HTP) [8,9]. Deep Learning is a subset of machine learning techniques known as a versatile tool capable of automatically extracting features and assimilating complex data using a deep neural network. Convolutional Neural Networks (CNNs) have made remarkable achievements in Computer-vision-related tasks [10]. CNN-based approaches have been widely applied to plant phenotyping because of their ability to create robust models that can be embedded in remote sensors [11,12]. The literature often neglects the use of simpler and faster digital image processing approaches. However, in the problem tackled in this study, several research papers have already compared digital image processing and deep learning to grass-like plants, especially between 2018 and 2019, where, in most cases, deep learning showed better performance [13][14][15][16].
Regarding tiller estimation, Zhifeng et al. [17] showed that Magnetic Resonance Imaging (MRI) could be used to measure rice tillers, as well as the conventional X-ray computed tomography system. Yet, an image processing procedure is still necessary. Fang et al. [18] proposed an automatic wheat tiller counting method under field conditions with terrestrial Light Detection and Ranging (LiDAR) using an adaptive layering and hierarchical clustering. Boyle et al. [19] conducted experiments using RGB images of wheat on different days and at three different angles and used a computer vision algorithm based on the Frangi filter. Deng et al. [20] trained a Faster R-CNN on three different backbones (ZFNet, VGGNet16, and VGG-CNN-M-1024) and evaluated productive rice tillers detection using mobile images. They achieved good accuracy compared to manual counting. Kristsis et al. [21] present a plant identification dataset with 125 classes of vascular plants in Greece, which include leaf, flower, fruit, stem in a tree, herb, and fern-like form. They focused the proposal on finding deep learning architectures to deploy on mobile devices. This problem has a different goal from our study. We are not concerned with finding a lightweight architecture. Our proposal aims to help HTP find the best genetic material using mobile images, where computational cost is significant but not a critical factor in our application purposes. In addition, they report their results using validation sets and not as test set [22]. Another interesting result from a grass-like image input can be found in Fujiwara et al. [23]. The authors use a CNN to estimate legume coverage with Unmanned Aerial Vehicle (UAV) imageries. This study samples image patches and estimates the coverage of timothy, white clover, and background using a fine-tuned model for each patch. They evaluate only on GoogLeNet [24].
Although we can find a rich literature in grass-like deep learning literature, to the best of our knowledge, no studies were found that investigate deep-learning-based methods to estimate the regrowth density of tillers in tropical forages using mobile phone images. Mobile phones are more accessible to most researchers than sources used in previous works (e.g., MRI and LiDAR). Furthermore, while other studies count the number of tillers [17][18][19][20], we use a score between 10 and 100 to represent a percentage of regrown tillers to select the top-k best genetic material.
The selection of top-k genotypes requires a scoring function to define the total order. Therefore the natural choice to perform this task is to treat it as a regression problem. If we train the model as a classification problem as classes of 10, 20, 30, all the way to 100, we tie the scores between these ranges, and therefore we lose the fine grain that is very important to select the top-k plants. Treating this problem as a classification problem instead of a regression problem would throw away all the potential of the total ordering possible using scores as the main output of deep learning models. Furthermore, evaluating the use of mobile phones involves two problems: (1) mobile images and (2) small models. The first problem can greatly vary when considering image quality, light, and resolution. The latter considers small models that often compromise accuracy to obtain a lighter model.
We compared small models with bigger models to verify the loss acceptable in these applications.
Our objective is to explore deep learning regression-based methods on mobile phone images to assess the regrowth of tillers. Furthermore, different from other studies that directly count the number of tillers, we propose a methodology to assess the percentage of regrown tillers using scores from 10 to 100. We collected 1124 images with two distinct mobile phones and labeled them manually. Six different architectures were evaluated using 10-fold cross-validation with and without transfer learning. We presented a quantitative and qualitative analysis for regression. Thus, our work indicates the potential of the proposed methodology for the tiller regrowth estimation, which will be useful in increasing the efficiency of the breeding program. Our work can be used to build powerful tools for scientists and researchers to evaluate and select the best cultivar candidates in forage breeding programs and contribute to increasing animal protein productivity.
The rest of this paper is organized as follows. Section 2 presents the materials and methods adopted in this study. Section 3 presents the results obtained in the experimental analysis. Section 4 discusses our achievements. Finally, Section 5 summarizes the main conclusions and points to future works.

Materials and Methods
We adopt a standard workflow (see Figure 1) of data collection, preprocessing, and training procedures.

Study Area and Dataset
The study was developed in the field at Embrapa Beef Cattle, Campo Grande, Mato Grosso do Sul, Brazil, in the Cerrado Biome ( Figure 2). Embrapa Beef Cattle holds the main Panicum maximum germplasm bank in the country and is responsible for its breeding program [3]. Panicum maximum (Guinea grass) is one of the most important tropical forage grasses because of its high production potential, nutritive value, adaptation ability to different soils and climates, and potential as an alternative source of energy [25][26][27]. Our experiments were conducted in two trials (P7 and P8) of a biparental population of Guinea grass with 210 genotypes showing a high genetic diversity.
The dataset was generated with images obtained with two mobile phones-a Redmi Note 8 Pro and a Moto G4 Play-using the Field Book app [28] that organizes the images and their traits in a CSV file. Our dataset is composed of 1124 labeled images. Tables 1 and 2 show the number of images collected by date, mobile phone, and experimental area. Each acquisition was close to 1 hour, with a variation of 10 min. The P8 trial was imaged in just one day with a single cell phone, while the P7 trial was imaged on three different days, one day with two cell phones. Considering the different dates (four different days from two seasons-spring and summer) and times (10 a.m. to 11 a.m. and 1 p.m. to 2 p.m.) the images were taken, an attempt was made to generate a dataset with high luminosity variability, making the model more generic and robust. All assessments were made seven days after harvest.  Images taken with Redmi Note 8 Pro are 3264 × 1504 pixels in dimension, and Moto G4 Play's images are 3264 × 2448 in dimension. They were taken at 1.05m approximately. Figure 3 shows the in situ data collection, while Figure 4 shows samples of different regrowth density from our dataset. The regrowth density was evaluated in each plot seven days after the mechanical harvest when the regrowth density shows a higher correlation with the next harvest production. To achieve high reliability, the regrowth density measurements must be performed by the same expert (researcher or technical staff) repeatedly after a series of harvests in a year and in different years. For this study, the ground truth data were collected in the field by an Embrapa Beef Cattle researcher ( Figure 3). The regrowth was annotated as an integer score dividable by 10, varying from 10 to 100 (included). A score of 10 corresponds to a tiller regrowth of 0% to 10%, and 100 corresponds to a tiller regrowth of 90% to 100%. The literature usually uses a coarser scoring range from 1 to 5, where 1 represents a regrowth of 0% to 20% of tillers, 2 a regrowth from 20% to 40%, 3 a regrowth from 40% to 60%, 4 a regrowth from 60% to 80%, and 5 the regrowth from 80% to 100% [26]. However, we used a more refined scale to have more robustness in our work .

Deep Learning Approach
After the labeled data were organized, we approached the problem using regression with the FastAi library [29]. The Experiment was evaluated with 6 architectures: AlexNet [30], ResNet [31] (18, 34 and 50 layers), ResNeXt101 [32] and DarkNet [33]. We used the AlexNet, ResNet, and ResNeXt implementations from PyTorch [34]. For DarkNet, a repository implementation was used [35]. While using ResNeXt on FastAi, a pre-trained model library was used [36]. In addition, all the architectures were evaluated with a pre-trained model on ImageNet [30] in order to assess the influence of fine tuning.
We performed all experiments using 10-fold cross-validation with an internal hold-out procedure to create training, validation, and test sets. Each fold was divided considering 81% for training, 9% for validation, and 10% for testing. All results presented in this paper were evaluated on the test set. We trained our models on Tesla K80 GPU.

Experimental Setup
We resized the images to 224 × 224 pixels, applying random horizontal flip and max rotation of 20 degrees; both with a probability of 0.75. We trained for 65 epochs (see Figure 5). The training epochs were split into four stages of 10, 10, 5, and 40 epochs. For the first three stages, we used One Cycle Policy [37], and for the last stage, we used the standard training policy. The learning rate was chosen empirically, using the learning rate finder implemented on the FastAi library.
In pre-trained models, after the first stage, we unfroze the third-to-last layer, and after the second stage, we unfroze the whole model. The loss function used was mean square error flat.
At the inference, the predictions were rounded to the closest multiple of 10 between 10 and 100 (included). Table 3 shows how we divided our experiments regarding its architecture, pre-training status, and batch size. The hashtag (#) indicates the experiment number.

Approach Evaluation and Statistical Analysis
We evaluated all of our experiments on Mean Absolute Error (MAE) , Root Mean Square Error (RMSE), Mean Absolute Error (MAPE), and Pearson Correlation (R) and plotted its confusion matrix. Each metric was calculated with the following equations: The y represents the true value while theŷ represents the predicted value. m y and mŷ are the average of the true values and the average of the predicted values, respectively. Nonetheless, these metrics do not give better notions of lower and higher values dominance than the ground truth data. So, we were motivated to use Regression Receiver Operating Characteristic (RROC) [38]. RROC space is a plot that depicts the total underestimation (always negative) against the total over-estimation (always positive). Thus, the closer the point is from (0, 0), called RROC heaven, the better the model is. There is a diagonal dashed line UNDER + OVER = 0 that represents the points where the underestimation matches the over-estimation, making the model unbiased. We also used a histogram to evaluate how well the model distribution was learned by comparing it with the true distribution. Finally, we applied the Grad-CAM [39] visual approach.

Results
Initially, we plotted the loss curve of the models in the validation set in Figure 5. The plots show the validation loss versus the number of epochs. A point is calculated as the average loss over the folds in each epoch. The loss curve gives an overview of the training behavior of the models, and it is possible to check whether an incorrect setting of epochs affected a model result. We can see that all models converged and reached a stable line in the validation set after iteration 30. Another important observation is that no model had potholes in its loss curve, suggesting that early termination might affect the results. Table 4 shows the mean and standard deviation of the mean absolute error, root mean square error, mean absolute percentage error, and Pearson correlation over the 10-fold cross-validation of each attempt. The experiment number refers to Table 3. Regarding the standard evaluation, the top result, seen in experiment resnet50-pret, has an average MAE of 7.70 and an average RMSE of 10.97; however, the non-pre-trained counterpart did not have such good results. The best couple was ResNeXt101, which achieved an average MAE of 7.72 and 7.81 and an average RMSE of 11.02 and 11.04 with and without fine-tuning, respectively. All experiments showed a correlation higher than 0.81.

Standard Metrics: MAE, RMSE, MAPE, Pearson Correlation, and Confusion Matrix
The predictions used to plot Figures 5-8 are computed by concatenating all 10 test set results from the cross-validation procedure. In this way, all predictions have no overlapping results, representing the entire dataset as a test set without leaking data from the training to the test set.  Table 3.   (7). Therefore, among 209 examples, 86 were predicted correctly as 70, and the remaining examples were around the correct prediction. The area in the matrices lower than 60 represents forages with low regrowth. The goal of the breeding program is to select plants with the best regrowth, i.e., the ones with higher scores for the trait. Therefore, due to the selection applied in past generations, we expect fewer samples with scores less than 60. When we look at the prediction quality in this region of the top two best performing models, resnet50-pret and resnext101-pret (Figure 6g,i, respectively), we can observe that resnext101-pret shows a slightly blueish color pattern closer to the main descending diagonal than resnet50-pret. This pattern indicates that resnext101-pret performs better for lower scores than resnet50-pret. When we look at lower-performing models, such as alexnet-nopret Figure 6b, the results are spread all over scores lower than 60, and the model starts to hit the main diagonal after 60.
The confusion matrix plot shows some values below and above the descending diagonal. However, it is hard to evaluate whether the algorithms had any tendency to predict higher or lower values than the ground truth. One way to assess the tendency to higher or lower values is using RROC [38].

RROC Space
RROC space is a plot that depicts the total under-estimation (always negative) against the total over-estimation (always positive). Thus, the closer the point is from (0, 0), called RROC heaven, the better the model is. There is a diagonal dashed line UNDER + OVER = 0 that represents the points where the under-estimation matches the over-estimation, making the model unbiased. Figure 7 shows the RROC plot of the trained models. We can observe that all of them are under the dashed line, which indicates that the models tend to predict lower values than the ground truth values. This result corroborates with the confusion matrix where the values below the descending diagonal, especially 80, 90, and 100, are usually higher than the values above the descending diagonal.
The experiments resnet50-pret and resnext101-pret are closer to the RROC heaven. The least biased model is the one of experiment darknet-nopret. Comparing with Table 4, we observe that experiments resnet50-pret and resnext101-pret show good results; however, the RROC space analysis shows that they are biased. This shows the importance of this analysis, as the standard metrics do not show. Figure 8 shows the intersection (greenish color) of the Probability Density Function (PDF) of the ground truth data distribution and the predictions distribution of each experiment. The number of bins is fixed to 10 and represents multiples of 10 between 10 and 100 (included). The y distribution is shown in Figure 9.

Histogram Analysis
The intersection area between the distributions in each experiment shown in Table 5 is a numerical representation of the graphs. It allows us to compare the experiments using a numerical score. All the histograms learned to predict the correct distribution well; however, they showed difficulty in predicting the classes well in the end. The best histograms are from the experiment with alexnet-pret, resnext101-pret, and darknet-pret which achieved 0.93 of intersection area between both distributions. We used the Kullback-Leibler divergence (KL divergence) to measure the distance between both probability distributions. The distributions most similar to ground truth data are from experiments with alexnet-pret, resnext50-pret, and darknet-pret.  Table 5. Intersection areas of the histogram shown in Figure 8. Best results presented in bold.

Visual Inspection
Experiment resnet50-pret shows the top MAE, RMSE, and correlation among the algorithms tested. We analyzed the image regions that this model considers more discriminating to define the regrowth areas, i.e., where the model looks at the image to predict the regrowing areas. For this, we look at the last activation map in the model using Grad-CAM. Figures 10 and 11 show the heatmap of Grad-CAM on Experiment resnet50-pret for the best and worst prediction for 10, 50, and 100 ground truth values, respectively. Warmer colors indicate areas that played the most important role in the model's decision, while colder colors mean the opposite.
The heatmap in Figure 10 shows a pattern where, in the lower density (regrowth score 10), the model avoids looking at the center of the plot and focuses on the border of the plot, leaving a circle in the middle where the model does not analyze. With the high density (regrowth score 100), the heatmap is stronger in an opposite way, focusing more on the center of the plot. This result corroborates with common sense that the high-density plot has more leaves in the center where the model looks.  When looking at Figure 11a, the image seems to be mislabeled to 10. We can see from the image that the plot presented a relatively acceptable regrowth and much better than Figure 10a, and we believe that the model predicted a better score than the ground truth. The same occurs in the other images where the prediction seems better than the ground truth. The pattern of higher regrowth is similar to Figure 10, where the higher the regrowth, the more critical the center of the plot is.

Efficiency Analysis
We analyzed the efficiency of the experiments by comparing the average time a model takes to compute a single example. Table 6 shows the number of parameters for each experiment and the average inference time on GPU (tested on Tesla M4) and CPU. We picked 112 examples of our dataset for this analysis. As expected, the models are much faster on GPU than on CPU. Therefore, GPU is preferable to CPU. However, in our case, the time spent by inference is not an issue because we do not need the prediction in real-time, and even the slowest model (resnext101-pret) is already quite fast.

Discussion
This study estimates the regrowth density of tropical forages using mobile phone images. To achieve such a goal, we evaluated a series of standard and state-of-the-art deep learning methods from a simpler model such as AlexNet with only five layers to a more complex model such as ResNext101 with 101 layers. These models were adapted to tackle the problem as a regression problem.
For the first time, we report that deep learning methods can deliver correlations from 0.81 to 0.89 in estimating the regrowth density using mobile phone images. We believe that this result is very acceptable and has the potential to speed up data collection of regrowth density and consequently increase the efficiency of forage breeding programs. The closest approach found in the literature was the study conducted by Deng et al. [20] for rice tillers. The authors used a completely different approach. Their approach required harvesting the rice and evaluating the cross-sections of rice tillers. Using object detection, they estimated the number of productive tillers. Our approach requires just a plot image obtained from a mobile phone without harvesting or other labor-intensive intervention.
Deeper neural nets perform better than the shallower version of the same architecture in most problems [31]. In HTP, we found some controversy where the deeper model di not always produce the best result. The study conducted by Oliveira et al. [40] using aerial images taken by an Unmanned Aerial Vehicle (UAV) showed some results where the best performing model among AlexNet, ResNeXt50, MaCNN, LF-CNN, and DarkNet53 was a simple AlexNet. Intrigued by these results, we evaluated a broader range of deep learning architectures with a more diverse number of layers. Interestingly, a 50-layer (Resnet50) network achieved our best-performing result. Again, in a traditional computer vision task, we expected the 101 layer network to give the best result, which did not occur.
The analysis using RROC indicated that all models were below the descending diagonal, suggesting that deep learning models tend to undervalue the prediction of the results in the problem setting of this paper. Castro et al. [41] also plotted RROC in a biomass prediction problem using deep learning and aerial images, and in their results, this tendency did not exist. We believe that this tendency happens due to the skewed data distribution (Figure 9) toward higher values.
The heatmap results shed light on where the network is "looking" to predict the regrowth density. To the best of our knowledge, this result is the first study to address the interpretability of deep learning models on regrowth. The results indicate that the circular region is the main area to reveal the lower regrowth area. The center of the plot is the most characteristic area for higher regrowth images in deep learning.
Compared to similar works, ours differs for not using any complex sensor technology, such as MRI and LiDAR, which are highly priced and excessive compared to a mobile phone. In addition, there is no need for a scheme to take pictures on different days and rotations and handcraft features. Furthermore, the main distinction from other works is the estimated trait. We calculated a score representing the regrowth percentage of the tillers instead of counting the number of tillers.
The use of machine learning must be used with care. Although the proposed approach can give valuable estimates of tiller regrowing, it is not advisable to completely substitute the manual labeling field regrow density. It is always good to collect smaller validation sets to evaluate if the learned models still give good estimates. Therefore, the proposed approach never intended to completely replace the manual labeling of fields but rather to allow the HTP research to multiply the number of plots while reducing the need for manual labeling collection.

Conclusions
To the best of our knowledge, this is the first research that evaluated CNN-based architectures to estimate regrowth density using RGB images collected by mobile phones. From our perspective, this study also presents the following contributions according to our results: (1) deep learning can deliver correlations from 0.81 to 0.89 in estimating the regrowth density using mobile phone images; (2) the best-performing architecture is not always the deeper model for this problem; (3) the deep learning models tend to undervalue the predictions in our problem setting and; (4) the heatmap indicates the patterns that deep learning models use to predict regrowth density.
Previous works focus on estimating the tiller number. We used a score that represents the percentage of regrown tillers, and we collected a dataset with images of forages taken on different days, locations, phones, and genotypes, promoting more generalized models.
Our results indicate that we might succeed in using our methods for new data prediction. To develop new cultivars, the researchers need to evaluate and select for multiple traits in the breeding program. Thus, there is a huge consumption in time and cost, sometimes with low accuracy, for performing the phenotyping step. Thus, training new algorithms to estimate traits such as disease and insect damages, mineral deficiencies, seed number, and other traits is the next step of this work for using deep learning associated with low-cost mobile devices.
In future work, we will evaluate the problem by employing lightweight deep learning architectures to deploy the model inside the mobile phone. In this way, the annotators can speed up their labeling process, and their task is more related to validating the predictions and collecting images than labeling the plot. We also plan to evaluate the problem using the Learning-To-Rank algorithm and evaluate the use of UAV-based images.

Data Availability Statement:
The data that support the findings of this study are available from the corresponding author with the permission of Embrapa.