A Deep Learning Semantic Segmentation-Based Approach for Field-Level Sorghum Panicle Counting

: Small unmanned aerial systems (UAS) have emerged as high-throughput platforms for the collection of high-resolution image data over large crop fields to support precision agriculture and plant breeding research. At the same time, the improved efficiency in image capture is leading to massive datasets, which pose analysis challenges in providing needed phenotypic data. To complement these high-throughput platforms, there is an increasing need in crop improvement to develop robust image analysis methods to analyze large amount of image data. Analysis approaches based on deep learning models are currently the most promising and show unparalleled performance in analyzing large image datasets. This study developed and applied an image analysis approach based on a SegNet deep learning semantic segmentation model to estimate sorghum panicles counts, which are critical phenotypic data in sorghum crop improvement, from UAS images over selected sorghum experimental plots. The SegNet model was trained to semantically segment UAS images into sorghum panicles, foliage and the exposed ground using 462, 250 × 250 labeled images, which was then applied to field orthomosaic to generate a field-level semantic segmentation. Individual panicle locations were obtained after post-processing the segmentation output to remove small objects and split merged panicles. A comparison between model panicle count estimates and manually digitized panicle locations in 60 randomly selected plots showed an overall detection accuracy of 94%. A per-plot panicle count comparison also showed high agreement between estimated and reference panicle counts (Spearman correlation ρ = 0.88, mean bias = 0.65). Misclassifications of panicles during the semantic segmentation step and mosaicking errors in the field orthomosaic contributed mainly to panicle detection errors. Overall, the approach based on deep learning semantic segmentation showed good promise and with a larger labeled dataset and extensive hyper-parameter tuning, should provide even more robust and effective characterization of sorghum panicle counts.


Introduction
Recent years have seen unmanned aerial systems (UAS) emerge as effective means for fieldrelevant phenotyping activities by enabling efficient and more affordable collection of aerial crop images over entire crop growth cycles. Traditional image analysis and machine learning have played key roles in transforming these image data into targeted phenotypic information such as plant height [1,2] and plant population counts [3][4][5]. However, the massive image data being collected by UAS and other high throughput platforms pose challenges to traditional methods, whose performance tends to level off and fall short of the high accuracy required for fully automated systems [6]. Much of the loss in performance in traditional methods is attributable to the limitations of feature engineering, which is a critical step in many of them. Feature engineering, the selection of suitable features based on domain knowledge to aid analysis, is almost always data or situation specific which presents huge challenges when massive and diverse data are involved [7]. It is not feasible to come up with a comprehensive feature set if an expert must account for variability in thousands of images. This is even more complicated in agricultural environments where image quality is influenced by changes in illumination, crop growth and senescence [4,8].
The most promising image analysis methods in that respect are those based on deep learning. Deep learning is a subset of machine learning where multi-layer artificial neural networks learn from large amounts of data to accomplish a targeted application. Deep learning provides three major advantages over traditional approaches. First, it achieves consistent and high accuracies in classifying, identifying and segmenting objects of different categories from image data. Second, its performance scales with the amount of data, thus could handle massive datasets without significant drops in accuracy. Third, the ability to adapt a pre-trained model to other applications through transfer learning, significantly reduces the burden of training [6,9]. Deep learning removes the burden and limitation of feature engineering by learning features automatically, providing an opportunity for improved automation while allowing for increases in accuracy with large data [10]. Such robust performance is likely to be useful for applied breeding by reducing the down time between the time massive images are collected and the availability of needed phenotypic data.
Sorghum (Sorghum bicolor L. Moensh) is an important food and economic crop in many parts of the world [11]. In the face of global climate change and population growth, sorghum breeders continue to strive for improved crop yields to boost food security. Plant phenotyping, the characterization of a plant's physical attributes, is of great significance to achieving this breeding goal. In sorghum breeding, sorghum panicles, the reproductive (flowering) part of a sorghum plant and main yield component, are among the critical components for which phenotypic data are sought [12]. Phenotypic data such as counts, number of seeds per panicle, sizes (panicle length and width) play key roles in various assessments including assessing genetic diversity, selection of new cultivars and estimation of potential yields [13][14][15][16][17][18][19][20]. However, the manual characterization of panicles has proved to a bottleneck to sorghum crop improvement. Thus, approaches such as deep learning that can provide an estimate of phenotypic attributes in an effective and efficient manner are needed to expedite sorghum improvement [21,22].
Previous studies have proposed various deep learning architectures for a variety of tasks including image classification, regression, object detection and semantic segmentation [6,[23][24][25]. Popular among these are convolutional neural networks (CNN), which apply sequences of convolutional layers to facilitate learning of image features [26]. Like ordinary neural networks, CNNs apply learnable weights and biases to input image values to transform them into final predictions. However, CNNs have a deep architecture consisting of multiple hidden layers (more than 3) making them more robust than ordinary neural networks. CNNs also account for local connectivity by enforcing consecutive-only neuron connections and by applying shared weights and layer pooling, reduce the complexity of training process [26]. The annual ImageNet Competitions (http://www.image-net.org/), in which participants compete to classify millions of sample images from the ImageNet database, has inspired a number of deep learning architectures. Notable examples include the VGG-16 model [23], GoogleNet by Google [27] and ResNet by Microsoft Research [24] with 16,22, and 152 layers respectively. While all these models were originally designed for image classification, they are adaptable to other applications-the SegNet model [23] adapts VGG-16 model for semantic segmentation, RCNN [28] adapt common CNN architecture for object detection. Thus, the stage is set for the application of these models to a variety of phenotyping tasks.
The utility of deep learning has already gained ground in various plant phenotyping tasks including plant disease detection and diagnosis [29], fruits and flowers classification [30], leaf counting in rosette plants [10] and maize tassel counting [31] with relatively higher accuracies than previous methods. In a number of studies, pretrained deep learning models have been adapted to applications of interest to boost performance, given many of the pretrained models have already been trained on thousands or millions of images [29,30]. Yet, other researchers designed custom deep learning architectures to extract targeted information. In Lu et al [31], a custom deep convolutional neural network model called TasselNet was developed to count maize tassels through density-based estimation under field environments. Xiong et al [32] segmented rice panicle objects from field images based on a custom model, Panicle-SEG model, which combined super-pixel clustering and a CNN to provide robust performance. A positive outcome of previous studies is growing availability of labeled datasets that other studies can leverage. However, given the still lower state of application of deep learning analyses in plant phenotyping, there still need for more specialized datasets and evaluation of deep learning approaches to other aspects of plant phenotyping.
For sorghum panicle counting, other traditional machine learning and deep learning approaches have also previously been used [33,34]. In [33] color information and three-dimensional structure information were leveraged by using a color ratio and circle fitting approach to detect sorghum panicles. Given the reliance on color ratios and the circularity of panicles in nadir images, this approach could be limited under wide spectral variability or under shape distortion due to image capture geometry. Another study [34] applied a combination of clustering using the Simple Linear Iterative Clustering (SLIC) [35] super-pixel algorithm and bag-of-words models [36] to learn relevant features for panicle classification. Panicles were then detected based on learned features by applying a logistic regression model. While high accuracy was achievable, it came at the expense of significant feature engineering. A similar approach based on learning multiple features for panicle detection is presented in [37]. To reduce the effort spent on manual labeling, Ghosal et al. [38] applied a weakly supervised deep learning framework whereby a model was trained on a small training sample but gradually updated through human feedback on object detection results. They demonstrated a significant reduction in human labeling effort without compromising final model performance.
In this study, our main goal was to develop a field-based approach for counting sorghum panicles from UAS images using deep learning semantic segmentation. Sematic segmentation is a supervised learning problem aimed at assigning each pixel of an image to one of several semantic classes [39,40]. Our general approach relied on a SegNet model to segment UAS images into sorghum panicles and other features such as foliage and exposed ground. While we could have applied an object detection approach, we choose the semantic segmentation option because of better definition of object boundaries, which provided opportunities for splitting any merged panicle instances. Object detection only provides bounding boxes of detected objects, which might involve much more post processing [31]. Our specific objectives were to: (1) develop labeled datasets to support the semantic segmentation of panicles from existing UAS image data from sorghum trials; (2) Train and evaluate the SegNet model for semantic segmentation of UAS data into the three semantic lasses; (3) develop a method for detection and localization of individual panicles within the experimental field based on the semantic segmentation results and 4) compare derived panicle count estimates with manually collected counts.

Study Site
Our study site was a sorghum experimental field on the Texas A&M AgriLife Research Farm (Latitude 30°55'00'' N, Longitude 96°25'48''W) in Burleson County, Texas (USA). The climate in this region is humid subtropical with mild winters and hot summers [41]. The soil at the AgriLife Research Farm is a silt clay loam with 0%-1% slopes [42]. The field, planted on 26 April 2017, comprised historical grain sorghum material covering over 700 single-row (0.75 m by 6.0 m) plots. For our study only 360 plots were used as shown in Figure 1. Standard agronomic practices for grain sorghum production, including supplemental irrigation and fertilization as needed, were applied.

UAS Image Data and Plot Boundaries
We collected aerial images over the field using a DJI Phantom 3 Professional UAS (Shenzhen, Guangdong, China) with a 12-megapixel DJI FC300X camera at the following settings: a 35 mm equivalent focal length of 20, a F-stop of f/2.8 and an automatic ISO option. Images were captured on a sunny and virtually still day on 7 July 2017, when the crop was in its reproductive growth stage, from a flight altitude of 10m with 90% forward and side image overlap in a standard parallel flight pattern.
The collected images were processed using the Pix4Dmapper software to generate orthomosaics. Standard photogrammetric processing was followed including extraction and matching of common points across images followed by triangulation and bundle adjustment to generate densified 3D point clouds and orthomosaic images [1]. The effective ground sample distance of the generated orthomosaics was 0.4 cm/pixel. Ground control points were not used and image registration relied on geotagged GPS image positions only. For our purposes, only relative positions were needed, thus relying on geotagged positions was adequate. Based on the generated orthomosaics, plot boundaries were generated in shapefile format. Plot boundaries were buffered inward by a small distance (10 cm) to minimize edge effects, where panicles from one plot encroached on a neighboring plot.

Labeled Panicle Data
Labeled panicle data for deep learning based semantic segmentation were developed from 250 × 250 pixels sample images cut out from full UAS scenes. Three semantic classes were defined as follows: Panicle, for all panicle instances in an image; Ground, for exposed ground surfaces in the image and; Background, for green foliage and any shadowed regions in the image as illustrated in Figure 2. The manner in which the semantic classes are modeled depends on the goal of the application. While two semantic classes (panicle and background) would have sufficed, we modeled ground as a separate class to provide a better scene characterization. In addition, ground and panicles may look similar, so modeling them separately enable a trained model to be more robust to these similarities. In selecting sample images, we tried to make the sample set as representative as possible by including various scenarios. We selected samples with panicles of different colors, samples without panicles, samples with pure classes (e.g., ground only) and samples from images captured at different angles. Labeling for semantic segmentation requires that all pixels in each sample images are labeled. Thus, we digitized all pixels in each sample into predefined semantic classes using the Image Labeler tool in the MATLAB ® software (www.mathworks.com). The Image Labeler provides an easy interface to mark a variety of regions of interest (rectangular, pixel, polygon) in sample images to define targeted ground truth for semantic segmentation and object detection. We labeled 462 samples.

Sorghum Panicle Counting Approach
Our panicle counting approach involved two main steps: (1) deep learning semantic segmentation of input image data and (2) Post-processing to facilitate individual panicle counting. The semantic segmentation served to classify and accurately locate panicles with an image. The second step post-processed the semantic segmentation output to get rid of small object sand split any merged panicles. Figure 3 summarizes our general workflow.

Deep Learning Semantic Segmentation and Model Fitting
Deep learning semantic segmentation overview: At a basic level, a deep learning architecture for semantic segmentation comprises an encoder and decoder network. An encoder network is normally a CNN, which learns relevant but low-resolution features about the target classes. On the other hand, a decoder network uses the learned features to generate a pixel level prediction. Previous studies in the computer sciences have developed various approaches for semantic segmentation [23,25,28,29,43,44]. Region-based and fully connected convolutional network-based (FCN) approaches are among the predominant ones. Region-based approaches such as R-CNN [28] rely on object detection-candidate regions are selectively extracted and their corresponding features computed. Then, it applies a classifier such as support vector machine for pixel level prediction to generate a semantic segmentation. While high performances are achievable using such approaches, the need to extract many candidate regions can be time-consuming. In addition, selected candidate regions may not be representative, which may lead to poor boundary definition. Like region-based methods, FCN-based methods [25] use a CNN but are able to learn and optimize pixel to pixel mappings without extracting candidate regions. However, such networks suffer from loss of resolution due to the consecutive convolutional and pooling operations. Approaches that generate high-resolution segmentations such as SegNet [44] and U-Net [43] have developed and have demonstrated improved performance.
Fitting SegNet semantic segmentation model: In this study, we adapted a SegNet with weights initialized from the VGG-16 network for semantic segmentation of sorghum panicles using MATLAB® Computer Vision and Deep Learning toolboxes. Building a SegNet type model in MATLAB only requires one to specify the size of the input sample images (250 × 250 × 3), the number of semantic classes (3) and a pretrained model, VGG-16 in our case. The rest of the network setup such as assigning weights from VGG-16 network and adding required layers is then done automatically. Given the unbalanced sampling among the three semantic classes, we incorporated sample weighting, with weights calculated as the inverse frequency of each class, to enhance the robustness of our model. To improve the accuracy of the network, we also augmented the training data by randomly shifting, rotating and reflecting it to create multiple versions of the data [45,46]. With a fully specified model, we trained it using mini-batch stochastic gradient descent with momentum (SGDM) as the optimizer with 75% of the labeled data and 15% of the data for validation. The learning rate followed a piecewise schedule that reduced the learning rate from an initial value of 0.03 by a factor of 0.3 every 10 epochs. A mini-batch size of 4 training samples was used to reduce memory usage while training. Training was accomplished over 50 epochs on a 64-bit Dell Workstation (Intel® Xeon® Processor with 256 Gb RAM, NVIDIA™ Quadro K5200 GPU with 8 Gb RAM) and took about an hour and half to complete. Having trained the model, we applied it to the field orthomosaic to generate a field-level semantic segmentation.
Accuracy assessment of trained SegNet model: We evaluated the accuracy of the trained deep learning model against remaining 10% test labeled data using two metrics: the overall accuracy (OA) and the intersection over union (IoU). The overall accuracy expresses the rate of correctly classified pixels regardless of class. The IoU segmentation measure also known as Jaccard coefficient, is the ratio of correctly classified pixels to the total number of ground truth and predicted pixels in that class -penalizes false positives. We applied these metrics to assess both global and per-class performance of the sematic segmentation. Given the number of true positives, TP, the number of true negative, TN, the number of false positives, FP and the number of false negatives, FN, we defined the two metrics as follows:

Post-Processing and Panicle Counting
The semantic segmentation result required post-processing before panicles we could detect individual panicles and estimate plot-level panicle counts due to: (1) existence of small usually wrong classified objects, (2) existence of merged panicle objects that affected the plot-level panicle count estimates. We applied a sequence of morphological operations and the watershed transform to alleviate these two issues.
To remove small objects, we first binarized the semantic segmentation output so 1s represented panicle regions and 0s represented everything else. We then applied an erosion operation, using a 3pixel disk shaped structuring element, to the binarized output to remove some of the boundary pixels from all panicle objects and also get rid of some small objects. Given that we expected panicle objects to be of certain minimum size, we imposed a size (30-pixel) restriction of the eroded output by applying an area opening morphological operation. The 30-pixel threshold was determined empirically by examining sizes of panicle objects in the segmented image.
With the clean panicle mask from the preceding steps, we applied a watershed transform to split merged panicles. Watershed transform treats an image as a topographic surface with ridges and valleys and achieves a segmentation by decomposing the image into catchment basins [47,48]. To model a topographic surface from a binary image, an approach often used is to calculate a negative distance transform to the complement of a binary image (panicle mask in our case) [49]. The distance transform generates an image showing distances for each foreground pixel to the closest boundary and inverting it ensures target object center pixels are regional minima, satisfying assumptions of the watershed transform. With the watershed transform, any local minima can be catchment, which often leads to over-segmentation. To prevent over-segmentation, we suppressed some local minima in the distance transform before applying the watershed transform. We achieved this by suppressing all minima in the negative distance transform image whose depth is less than d, 0.7 in our case. We determined the threshold d through prior empirical test.
Having split the merged panicles, we applied connected components analysis to watershed transform output to generate individual panicle regions. Based on area attributes of the connected components, we eliminated any regions with area less than 30 pixels. Finally, we determined panicle locations using centroid coordinates of each connected component as illustrated in Figure 3. Having detected the panicle instances in the field, we determined plot-level panicle counts by aggregating all detected panicle instances within respective plot polygons.

Validating Detected Panicle Counts
We evaluated the performance of our panicle counting method both in terms of individual panicle detection and in terms of overall plot-level correlation by comparing estimated panicle counts with manually digitized panicle data. We randomly selected 60 plots and manually digitized panicle locations from the orthomosaic (Figure 4). In cases where multiple panicles touched, we digitized them as separate entities to test how effective the developed procedure would perform at separating them. To evaluate panicle detection accuracy, we used several accuracy measures to capture overall under or over-detection of panicles. The three accuracy measures included the omission error (OE), which captured the rate at which our method missed sorghum panicles; the commission (misclassification) error (CE), which showed the rate at which our method wrongly classified other objects as panicles and; the overall accuracy (OA), which showed the overall detection rate of our method with respect to a known number of panicles. We calculated the three accuracy measures as shown in Equations (3): where No is the number panicle omitted, Nm is the number of wrongly classified objects, Nt is the total number of correct panicle detections, NR is known total number of panicles and ND is the total number of locations detected as panicle detections.
To get the total number of correctly classified panicles, Nt, we matched detected panicle locations to measured x-y panicle locations. We considered a detected panicle correctly matched if it fell in a 5 cm buffer zone around a reference panicle, selecting the 5 cm radius based on average panicle width data [5]. Where multiple panicles fell in the buffer zone, we considered only the nearest panicle to the buffer center. The number of missed panicles, No, was defined by the number of panicles without any detected panicle in their buffer zones while, Nm, was determined as the difference between Nd and Nt.
To capture the under and over-estimation and correlation of estimate panicle counts with respect to manually digitized panicle counts, we also calculated a mean bias, biasavg, and Spearman's correlation, ρ, metrics. The biasavg for the selected number of plots was calculated as in Equation (4) where Ri is the i th reference attribute measurement, Ei is the i th computed attribute measurement and N is the total number of measurements being compared. The Spearman's correlation was calculated using Equation (5), where and represented mean panicle counts for the computed and reference measurements. Table 1 summarizes overall and class-specific accuracies for the semantic segmentation of UAS image data achieved by the trained deep learning model. The model achieved an overall accuracy of 95% with class-specific accuracies ranging from 94% to 98%. The segmentation performance in terms of intersection over union (IoU) ranged from 80%-93% showing a high agreement between the deep learning segmentation and the reference labeled data. Figure 5 shows a montage of a sample input image and its corresponding segmentation output, demonstrating a generally effective segmentation of the image into the three semantic classes. The model effectively segmented panicles of different colors and generated a generally clean output with low presence of mixed pixels. One key advantage of the deep learning semantic segmentation approach to traditional object-based methods is being able to generate semantic information in a single processing step. With traditional object-based approaches, separate segmentation, classification steps and post-classification steps are usually involved [50,51].  A few aspects of our model training set up are worth emphasizing. Sample weight balancing is an important consideration for applications where semantic classes have different prevalence. Without it, less prevalent classes (panicle in our case), would not be adequately segmented. We used the inverse frequency weighting scheme, however, other weighting schemes such as over-weighting a class of interest are possible and might be of interest for further assessment [52]. While our training set was small, leveraging weights from pretrained VGG-16 model, improved the model fitting and reduced the training time. Model fitting also benefitted from data augmentation which artificially increasing the amount of data. Despite the improvements due to data augmentation, there is still a great need to develop extensive labeled datasets to enhance the generalization of fitted deep learning models.

Observed Sources of Error
Errors of commission among the three classes contributed to loss of performance of the semantic segmentation model. The model misclassified some dry foliage and textured ground surfaces as panicles. The segmentation was also poor around object edges as illustrated in Figure 6b. The lack of crisp edges in semantic segmentation is a well-known problem in SegNet architectures due to the down-sampling involved in the encoder part of the network [23]. Errors in segmenting panicle objects could also be attributable to the dominance of other features in scenes [53], which made it difficult for the model to segment relatively smaller panicle objects. This, together with edge errors, contributed to lowering the IoU metrics especially for panicle class (80%). Nevertheless, an IoU score of 80% was high enough to overlap all panicle instances which contributed to lower omissions of panicle objects. While we strove to digitize each semantic class as closely as possible, we sometimes missed or sub-optimally digitized subtle objects. Small objects presented the most challenge in the labeling processes due to the difficulty in digitizing them and the image resolution limitation. However, the model was generally robust to such omissions and generated the expected output as illustrated in Figure 6c. However, such missed objects, while correctly segmented, still contributed to lowering the IoU scores because the ground truth was considered perfect. Lastly, artifacts in the orthomosaic also contributed to wrong segmentations at the field level. Errors in 3D reconstructions and orthomosaic generation are well known in UAS image processing [54,55]. Estimation based on individual images could alleviate this limitation, as was demonstrated in [56] for estimating crop cover percent.

Panicle Counting Performance and Field-Level Panicle Mapping
Of the total of 3458 reference panicle locations digitized from the 60 plots, our method detected 3250, representing an overall accuracy of 94%. Our method missed 231 reference panicle locations, translating into an overall omission rate of 6.7%. We observed an overall misclassification error of 14.5% based on 3803 detections. Figure 7 shows panicles detected over the whole field with selected zoomed views to highlight a few aspects of the detection. Differences between manual and estimated panicle counts are attributable to errors due to the deep learning semantic segmentation, as we have already outlined. However, some errors were due to the watershed segmentation. The watershed transform step erroneously split some panicle instances leading to over-counting while some merged objects were not effectively separated leading to under-counting. We observed a positive mean detection bias of 0.65 panicles, indicating a general over-estimation of panicle counts. Based on correlation, the plot-level panicle counts agreed estimates from our model with a Spearman correlation of 0.88. The over-estimation resulted mainly for the erroneous split of panicle objects. Based on the two performance metrics, the estimated panicle count compared very well with the measured panicle count in each plot (Figure 8), which shows promise for the application of deep learning approaches for such a plant phenotyping task. The level of counting accuracy (94%) achieved in this study is an improvement from our previous study [5] (89.3%) that applied terrestrial lidar data and a density-based clustering approach. While different input data were used, we attribute the increased performance in this study to the deep learning model applied. With the development of deep learning models for point clouds [57,58], compared performance should be achieved from point clouds. Thought not directly comparable, due to the use of different accuracy metrics, the performance achieved here is generally in line to other previous studies that applied image data [34,37,38]. In Olsen et al [34] a 98% counting accuracy and median absolute error (MAE) between 1.88 and 2.66 were achieved for 18 sorghum varieties in the Midwestern United States. The accuracy was based on a receiver operating characteristic (ROC) curve's area under the curve (AUC) metric-which is usually correlated with overall accuracy. Similarly performance was also reported in [37]. As highlighted before, this high accuracy came at the expense of significant feature engineering, which we circumvented in this study by applying a deep learning approach. Ghosal et al. [38] used 1269 hand-labeled images and achieved an AUC accuracy of 94% by applying an active learning inspired weakly supervised deep learning framework. Similar counting studies on other crops have also reported good performance. In Xiong et al [32] where rice panicles were segmented using a similar segmentation approach as ours, achieving an overall accuracy of about 83%. In counting maize tassels, [31] achieved an accuracy over 93%. The high accuracies achieved from the diverse studies is evidence of the robust performance of deep learning approaches.

Discussion
Plant breeding and agronomy both benefit from availability of site-specific information to manage a variety of problems [59,60]. Site-specific information within fields has traditionally been gathered by using sensors such as yield monitors or by manual sampling [21,22]. Both approaches can demand a significant amount of revenue and human resources. For example, while information on sorghum panicle number would be very useful in a breeding program, they are rarely counted because of time and labor associated with their collection. Consequently, methods developed herein offer significant improvement in site-specific panicle information for sorghum crop improvement and agronomy. These data can be leveraged by plant breeders and agronomists for site-specific data on panicle distribution, sorghum yield estimation in lieu of harvesting to obtain direct grain yield estimates. This indirect yield estimate has significant application to sorghum breeding in early phases of hybrid evaluation where relative grain yields are most important and might be estimated with combine harvest [18,61]. For agronomy, the value of indirect yield can be useful to assess yield potential prior to harvest which has numerous applications to production agriculture. For full utilization, the capacity of deep learning approach to segment panicles of different colors without additional processing is also advantageous in cases where multiple multi-color varieties are under study. While resources for developing labeled dataset may be high initially, the pay-off would be  117  126  131  149  160  170  177  196  200  217  223  237  254  264  276  289  300  325  331  340  349  364  368  381  391  396  406  421  436  447 Panicle count

Plot No
Reference Estimated greater in ensuring years and would contribute to reducing the down time due to manual phenotyping.
With UAS image data beyond the visible spectrum now routinely being collected in many breeding programs, there is potential to enhance counting sorghum panicles with data from other non-visible wavelengths. Multispectral imagery, particularly data from the red-edge and nearinfrared bands, has been applied for different aspects of plant phenotyping including in characterizing senescence patterns sorghum breading lines [62], in assessing nitrogen and chlorophyll content [63]and in assessing plant stress and disease [9,64,65]. The ability of these nonvisible bands to model such subtle changes in plant condition should contribute to better discrimination of panicles from objects that are spectrally similar in the visible spectrum such as ground and dried foliage. Thermal imagery also presents another interest option, especially for characterizing canopy temperature [66][67][68]. Given the high saliency of sorghum panicles compared to other features in a field, significant temperature gradients may be expected which should be amenable to object discrimination image analysis approaches as presented in this study. Threedimensional datasets such as digital surface models and point clouds are also promising for object detection and may enable derivation of other panicle attributes other than counts e.g., panicle lengths and widths. In our previous study [5], we demonstrated the utility of terrestrial lidar data for sorghum panicle detection and characterization of individual panicle lengths and widths. A similar study by [33] combined spectral and DSM data to estimate panicle counts and yield. With UAS-based lidar becoming more mainstream, there is a chance that such 3D data and methods will be increasingly used for panicle detection and similar applications.
While we have demonstrated the effectiveness of our developed approach, further assessments are needed to elucidate the impact of field conditions on panicle counting accuracy. Field conditions such as changes in illumination, crop growth and senescence are known to influence image quality and associated algorithm performance [4,8]. Thus, testing the effectiveness of deep learning approaches on image data collected at different times of the day or at different growth stages of a crop under study will be key in shading light on best practices and highlight challenges and limitations in the application of deep learning methods. Also, this study did not address the impact of changes in the image spatial resolution on the panicle detection. However, previous studies on UAS imaging performance have shown that lower resolution data tend to reduce the accuracy of derived metrics e.g., plant height and biomass [69][70][71]. Other studies have also shown that lower resolution images tend to lower the performance of deep learning models [72,73]. In Koziarski and Cyganek [72], the robustness of deep learning models was evaluated against distortions such as blurring and concluded that the models were susceptible to such distortions. A similar study [73] evaluated the impact of low resolution on classification accuracy of several state of art models including VGGNet [23], ResNet [24] and AlexNet [74], and also confirmed the negative impact of lower resolution on deep learning models. From the foregoing, we would recommend high resolution images, which should not only contribute to better accuracies but also allow adequate visualization of target features during ground truth labeling. Aerial imaging, as applied in this study, may also be limited in capturing significantly occluded panicles [75]. The value of applying both nadir and oblique has been demonstrated in other UAS-based studies [76][77][78]. In our case multi-angle images should provide better view of panicles and likely lead to better detection.
Research presented here could benefit from newer or alternative deep learning approaches. Notable example are instance segmentation methods, which are capable of identifying individual object instances within an image [79]. Instance segmentation usually combines an object detection step which locates the individual object instances and a semantic segmentation step that refines the object boundaries. With such capability, we could potentially eliminate the post-processing steps applied to split merged panicles. However, instance segmentation is a more challenging problem than the semantic segmentation applied here [79]. Its performance and benefits to sorghum panicle phenotyping need to be assessed in future studies. Another potential deep learning approach is density counting. Density counting enables direct counting by estimating the density of objects in the image without performing segmentation or object detection [80][81][82]. Such an approach would be useful in providing plot-level panicle counts in cases where specific location data are not needed. Previous phenotyping studies have applied this approach to count maize tassels [31] and to count plant leaves [83], which we expect can be applied for sorghum panicles too. Lastly, deep learning has, in part, been driven by the availability of large amounts of labeled datasets [6]. Continuous and extensive development of labeled datasets will be key if deep learning is to serve its intended goals in plant phenotyping. Publishing experimental data has major advantages for the research community and encourages inter-disciplinary collaborations among remote sensing scientists, plant breeders and computer scientists [84]. A few public phenotyping datasets are now available through platforms such as Cyverse, where we have published other datasets [85], but more are needed to account for the various areas of research. It is our goal to publish the labeled datasets generated for this study once all the necessary metadata has been documented.

Conclusion
Deep learning has revolutionized image analyses in diverse fields and the high accuracies achieved in this study show great promise for plant phenotyping tasks. By applying sample weight balancing, data augmentation and some aspects of transfer learning, we effectively trained a semantic segmentation that facilitated the estimation of sorghum panicle counts. Going by the high panicle detection accuracy, the impact of merged panicles was largely overcome by our post-processing analyses. Better accuracies can be expected with larger labeled dataset and extensive hyperparameter tuning, for enhanced and robust field-based characterization of sorghum panicle counts. By applying such robust modeling, together with high-throughput platforms such as UAS and robots, significant gains can be expected in sorghum crop improvement and agronomy.