Advancing Tassel Detection and Counting: Annotation and Algorithms

: Tassel counts provide valuable information related to ﬂowering and yield prediction in maize, but are expensive and time-consuming to acquire via traditional manual approaches. High-resolution RGB imagery acquired by unmanned aerial vehicles (UAVs), coupled with advanced machine learning approaches, including deep learning (DL), provides a new capability for monitoring ﬂowering. In this article, three state-of-the-art DL techniques, CenterNet based on point annotation, task-aware spatial disentanglement (TSD), and detecting objects with recursive feature pyramids and switchable atrous convolution (DetectoRS) based on bounding box annotation, are modiﬁed to improve their performance for this application and evaluated for tassel detection relative to Tasselnetv2+. The dataset for the experiments is comprised of RGB images of maize tassels from plant breeding experiments, which vary in size, complexity, and overlap. Results show that the point annotations are more accurate and simpler to acquire than the bounding boxes, and bounding box-based approaches are more sensitive to the size of the bounding boxes and background than point-based approaches. Overall, CenterNet has high accuracy in comparison to the other techniques, but DetectoRS can better detect early-stage tassels. The results for these experiments were more robust than Tasselnetv2+, which is sensitive to the number of tassels in the image. Contributions: Data collection post-processing, Data Experimental Writing—original Writing—review


Introduction
Maize is a major crop for food consumption and a source of material for a wide range of products. Increasing maize yield is important, especially under the pressure of global climate change, which is often associated with increased temperatures and extreme droughts [1,2]. Plant breeders focused on developing improved varieties of crops seek to understand the joint impact of genetics, environment, and management practices on yield.
A key component of their programs involves measuring various physical, chemical, and biological attributes of the plants, referred to as phenotypes, throughout the growing season. For maize and many other grains crops, flowering is one of the most important stages as it initiates the stage of reproduction. Any external stress, physical or biological, can cause plant damage and result in production losses. Traditional approaches for field-based monitoring of tasseling are manual and thus time-consuming, labor-intensive, expensive, and potentially error-prone, especially in large fields. Alternatively, image-based techniques that automatically detect and count the tassels to predict the flowering date can mitigate these problems. However, varying illumination and shape, shadows, occlusions, and complex backgrounds impact the accuracy of these approaches [3,4].
Recently, remote sensing (RS) imagery acquired by UAVs has been investigated for counting objects such as plants, because high temporal and spatial resolution data can be acquired over large fields [5]. In this study, data acquired by an RGB camera mounted on a UAV are investigated for tassel detection and counting. Manual tassel counting in the large center point of each tassel for detection, which is simpler to implement than bounding box annotation. The size and dimension of the objects are directly calculated without any prior anchor. The CenterNet Hourglass network was modified and implemented with few-shot learning for this study. Two anchor-based, multi-stage detectors based on the bounding box annotations [25,27], which obtained high accuracy on the COCO dataset, were also modified and implemented for tassel detection. Specifically, the loss function for TSD was modified during the training process, and the classification and regression problems were considered separately, which increased the tassel detection accuracy [25]. In DetectoRS, the existing feature pyramid networks (FPN) were modified, and extra feedback connections were added to the backbone. The convolution layers of ResNext were also replaced with the atrus convolutions and deformable convolutional networks (DCNs) [27].
The remainder of this paper is organized as follows. In Section 2, the study area and details of point and box annotations are described. The three state-of-the-art DL-based algorithms are also introduced, and the specific modifications are described for improved tassel detection using multiple evaluation metrics. Experimental results for the multiple aspects of the study are presented and discussed in Section 3. Section 4 provides a critical evaluation of the approaches and discusses directions for future research.

Field Experiment and Image Acquisition
The experiment was carried out at the Agronomy Center for Research and Education (ACRE) of Purdue University (40 • 28 43 N, 86 • 59 23 W, 4540 US-52, West Lafayette, IN, USA) during the 2020 growing season (see Figure 1). The field experiment was planted in a modified randomized complete block design (RCBD), using varieties of maize from the Genomes to Fields (G2F) initiative for High-Intensity Phenotype Sites (G2F-HIPS). Two replications of 22 entries for hybrids (G2F-HIPS) and 22 entries for inbreds (G2F-HIPS) were planted on 12 May in a two-row segment plot layout with a plant population of 30,000 plants per acre.
A total of 88 plots were imaged on 20 July using a Sony Alpha 7R-III RGB camera mounted on a UAV DJI Matrice M600 Pro, with a Trimble APX-15v3 GNSS/INS unit for direct georeferencing. The UAV was flown at 20 m altitude, and the RGB imagery was processed to a 0.25 cm pixel resolution orthophoto using the method of [37] to eliminate inaccuracies due to lens distortion and double mapping associated with significant height changes over short distances. In-field manual phenotyping data were recorded throughout the season as stipulated by the G2F-HIPS standard operating procedures [38], including stand count and anthesis date. During the flowering period, visual inspections were performed to determine the anthesis date of each plot. Hybrids and inbreds in this experiment had different anthesis dates, with a range of 20 days from the first variety to flower to the last. This provided an opportunity to evaluate the counting algorithms over a range of flowering times in the same field.

Data Annotation
A rigorous tassel annotation was performed for this study using the open source annotation tool LabelMe [39]. First, the tassels are annotated in the orthophoto with points at the center of the tassel (see Figure 2a), designating the position of the tassel and the number of tassels per row within the plot. Multiple reviews are needed, as mistakes are common, even for experienced labelers. As different algorithms require specific forms of annotation as inputs, a second annotation dataset was developed from the point annotations, where bounding boxes of 20 × 20 pixels were generated from the previously annotated points (see Figure 2b). The third annotation dataset (see Figure 2c) was developed using bounding boxes where the size of the boxes was manually adjusted to the size of the tassel, avoiding excessive overlap with the neighboring boxes. The impact of the annotation approaches is evaluated in the Results section.

CenterNet
CenterNet, a state-of-the-art anchor-free detector [36], which was recently demonstrated to be effective for plant counting, was investigated for the tassel counting problem because of the simplicity of the annotation and limited computational requirements [5]. For tassel detection and counting, the point annotation dataset described in the previous section was employed for the training process. For localization, CenterNet uses a Gaussian kernel and a fully connected network (FCN) to create the heatmap, which is used to estimate the tassel centers. CenterNet does not require any post-processing, which reduces computational complexity. For this study, the CenterNet-based approach with an Hourglass-104 architecture was implemented to determine the locations of tassels' centers (see Figure 3 for more details). The hyperparameters such as learning rate, number of epochs, and batch size were optimized by grid search for tassel counting.

TSD
In TSD [25], the spatial misalignment between classification and regression functions in the sibling's head can decrease the object detection accuracy. These functions are decoupled by creating two spatially disentangled proposals. As shown in Figure 4, the original images are input to the backbone, and a regional proposal P is then generated by the region proposal network (RPN). In the next step, two separate proposals,P c and P r , are estimated for classification and regression. Finally, the object is detected, and the coordinated box is regressed.
Three modifications of TSD were implemented for tassel detection, as the shape and size of the tassels are highly variable from one variety to another: (1) Cascade R-CNN, a multi-stage extension of two-stage R-CNN, was used instead of Faster R-CNN. The architecture includes a sequence of detectors, where the output of one stage is used to train the next stage. This improves the threshold of intersection over union (IoU) metric compared to sequential detectors. The value is gradually increased without overfitting, and the number of false positives is reduced. For inference, consecutive detectors can also significantly increase the detection accuracy [40]. (2) DCNs were added to the backbone, replacing CNNs which have fixed geometric structures and are not appropriate for tassel detection. This is also due to the limited amount of training data. For detecting rotated or scaled objects, the training images are augmented, and different augmentation transforms are considered to achieve reasonable detection accuracy [41]. (3) To address the variation in the size of the tassels, a multiscale test was also implemented.

DetectoRS
DetectoRS obtained the highest accuracy on the COCO dataset in 2020 [27]. It uses twice looking and thinking ideas, following the architecture of Cascade-RCNN, and implements subsequent detectors. DetectoRS includes two main steps: recursive feature pyramids (RFP) and switchable atrous convolution (SAC). The existing feature pyramid networks (FPN) were modified in DetectoRS by adding feedback connections from the FPN layers into the bottom-up backbone layers (see Figure 5), which help extract stronger features. In this study, the RFP and SAC were modified for tassel detection because of their variation in size and geometric complexity; details are explained in the following. In the original version of DetecoRS, two sequential RFPs are used. For the tassel detection, because of the variety in shape and size of tassels, three RFPs are considered (see Figure 6). Features at each stage are recursively extracted as follows: where t: iteration number of RFP, i: number of decomposition level at each RFP, In the original DetectoRS, the ResNet architecture is modified. ResNet has four similar stages; DetectoRS only changes the first layer by replacing it with a convolution layer (kernel size one). This is called atrous spatial pyramid pooling (ASPP). Atrous convolutions add zeros to the original kernel and increase the size of the kernel, but the computational complexity is not increased [42,43]. The initial weight of this layer is set to zero, so the pre-trained weights from the ImageNet or COCO datasets can be used. For tassel detection, the ResNext architecture is used and modified. In the ResNext architecture, unlike ResNet, the neurons of one path are not connected to the neurons of other paths. It also uses the bottleneck design at each convolution path and reduces the computational complexity. ResNext has higher object detection accuracy than the ResNet backbone on the ImageNet data [44]. Additionally, DCNs have been used instead of CNN in the tassel ASPP (see Figure 7) because of variation in tassels' shapes and geometry. In Figure 5, SAC with the atrous rates 1 (red) and 2 (green) is shown; the same object at different scales can be easily detected using different atrous rates. Therefore, considering different atrous rates can increase the accuracy of tassel detection which has different sizes and shapes. For tassel detection, DCN with kernel size (1 × 1) is used (See Figure 8). The tassel SAC is calculated as follows: where r is a hyperparameter, and for tassel detection, the optimal value is 5. ∆w is a trainable parameter. The switchable function S(.) includes an average pooling with kernel size (5 × 5) followed by a convolution layer with size (1 × 1). Based on the ideas in SENet [44], before and after SAC, DCN and global average pooling were added to increase the tassel detection accuracy. The multiscale test was also considered because of the variation in the size and shape of the tassels.

TasselNetv2+
For maize tassel counting, the L 1 loss function which is used in the regression problem is calculated as follows: where m is the number of training images, and a i is the residual that measures the difference between the regressed count and the ground truth count for the i − th image [19]. The optimization problem aims to minimize L 1 using the Adam technique. The optimal regressed count is the estimated number of tassels in the image, which is not usually an integer number.

Parameter Settings
The optimal parameter settings for the four algorithms are determined experimentally and are shown in Table 1.

Model Evaluation
Two types of metrics for detection and regression-based approaches were implemented and evaluated.

Detection Metrics
The performance of detected tassels using bounding boxes-based approaches (Detec-toRS and TSD) is evaluated by the following metrics: Intersection over union IoU, which measures the overlap between a predicted bounding box B and B gt , the ground truth bounding box (see Equation (4)), is a widely used metric. If the values of B and the B gt match exactly, the value of IoU is one. For the tassel detection, the acceptable value is set to 0.5.
If the value of IoU is higher than 0.5, they are selected as detected objects and compared to the ground reference. The correct and incorrect detected tassels are then represented as TP and FP respectively. The missing tassels are also considered as FN. The precision (Pr), recall (Re), and score (Sc) values are calculated as: The accuracy of the number of the detected tassels N t and ground truth N g is defined as follows: For the CenterNet based point annotation, the value of r is considered as the maximum distance between the ground truth and the predicted tassel's center location for consideration as a correct or missing tassel detection (assumed to be 10 cm). The TP, FP, and FN values are calculated based on the criteria introduced in [5] for plant counting.

Counting Metrics
The mean absolute error (MAE) and root mean square error (RMSE) indicate the difference between N t and N g and are obtained as:

Results
The tassel detection algorithms were implemented on a machine with seven cores, one GPU (GTX 1080ti, 11 GB RAM), and 128 GB external RAM.

Comparison of Original and Developed Anchor and Anchor-Free Based Approaches for Tassel Detection
As mentioned previously, the original version of anchor-free CenterNet and anchorbased approaches (TSD and DetectoRS) were introduced for object detection of the COCO dataset. Because of the variation in size, shape, similarity to the leaves, and overlap, the modifications described in Section 2.3 were implemented for the tassel dataset. In [40], it was reported that the Cascade R-CNN had a higher object detection accuracy than Faster R-CNN and Mask R-CNN on the COCO dataset (8 & 6%), respectively. In tassel detection, the Cascade R-CNN improved the average detection accuracy by ∼(2.3 & 2.7%) for TSD and DetectoRS, respectively, in comparison to Faster R-CNN. Mask R-CNN requires the mask annotation around the tassels and was not implemented because of the complexity in the tassel shape and their high overlap. The results in Figure 9 show that applying the modifications increased the mean tassel detection accuracy and reduced the corresponding standard deviation.

Sensitivity Analysis to Bounding Box Sizes
The bounding box-based approaches are sensitive to the size of the bounding boxes and their overlap for mature tassels. Figure 10 shows three different sizes of bounding boxes used in the TSD algorithm. In Figure 10a, the small bounding boxes were drawn around the tassels' centers originally obtained from the point annotation. These bounding boxes did not fully include the tassels and frequently missed the shape. Therefore, the detection accuracy was reduced. Figure 10b depicts the bounding boxes which encompass the tassel. In this image, and for most of the row segments where the tassels have fully flowered, the bounding boxes have highly overlapping areas. In Figure 10c, the bounding boxes were selected as large as possible to include the tassel area, while attempting to reduce the overlapping areas. This sensitivity to the size of the bounding box during annotation reduces the practicality of using it. The number of tassels in the ground truth is 61, and TSD detected 44 (TP = 43, FP = 1, FN = 18), 46 (TP = 45, FP = 1, FN = 16), and 55 (TP = 55, FP = 0, FN = 6) tassels in Figure 10a-c, respectively. This illustrates how the detection accuracy of highly overlapped objects is dependent on the size, and therefore the degree of overlap of the bounding boxes.

Sensitivity to Tassel Density and Heterogeneity
As noted in Section 2, the dataset for tassel detection and counting was collected during the 2020 growing season, consisting of 88 panels with two replicas of inbred and hybrid varieties. In Figure 11, the image of the field on the west shows clear differences between the inbred panel planted in the north and the hybrids planted in the south side of the field. The canopy of the inbreds was less dense, and the tassels' shapes, colors, structures, and stages of maturity differed from the hybrids as seen in Figure 12. The dataset provided an opportunity to evaluate the performance of the algorithms over a field with diverse tassel characteristics.

Training and Testing Information
As described in Section 2, the original orthophoto was divided into 15 subsets (S1-S15) with a size of ∼(3600 × 2100) pixels. Two-row segments of plots on the west side of the field were used for testing, while the east side of the image was used for training. Plots S1-S7 correspond to hybrid entries and S8-S15 to inbreds. Therefore, data from both the hybrid and inbred varieties were included in both training and testing.
The number of training and testing images is shown in Table 2. The sizes of the training and test datasets differ for the three algorithms. This is primarily due to the requirements of the algorithms. If the size of the training images is close to testing images, the accuracy of tassel detection is artificially increased. TSD and DetectoRS can use different size datasets for training, unlike CenterNet architecture that requires (512 × 512) inputs. The image size ∼(600 × 2100) was used for training. The combined size of the training region is ∼(3000 × 2100). If this region is divided into (600 × 2100 ), only five training images could be extracted from each subset (total 75), which is not adequate for training. A random crop with overlap was applied to the training region to extract 105 images. After that, the training images were randomly divided 90% and 10% into training and validation. As mentioned, the CenterNet architecture requires (512 × 512) inputs. Thus, if the image coverage was selected similar to TSD and DetectoRS (2000 × 600), it would need to be resized to (512 × 512). Because of the complexity in the tassel shapes, CenterNet could not train well. To mitigate the impact, a few-shot learning strategy similar to the idea which is used for plant counting in [5] was used to reduce the number of required training images. Finally, for CenterNet, the total number of training and validation images considered was 350 and 30, respectively. For a fair comparison, the size (512 × 512) was also considered for the training images of TSD and DetectoRS. However, this was so small that some of the annotated bounding boxes were not completely located in the training images not included during training (especially the tassels which are close to the image boundary). Because CenterNet uses point annotation, the number of missing tassels that were close to the boundary during the training was much lower. In the end, ∼91 million pixels were selected for CenterNet, and ∼122 million pixels were chosen for training TSD and DetectoRS, the training time being approximately equivalent for the three methods (see Table 2).

Comparison for Different Annotation Techniques
The detected tassels and ground test reference of five subsets are depicted in Figures 13-17 as examples. Figures 13 and 14 are examples of individual results for hybrid entries with fully emerged tassels (after the flowering date). Figures 15 and 16 depict inbred varieties with fully developed tassels (before the flowering date). For these subsets, CenterNet had the overall best performance based on the score of 99.15%, 94.73%, 95.15%, and 93.97%, respectively.     Figure 17 is included to demonstrate the detection in plots where tassels had not yet emerged (before the flowering date). The ground truth indicated two tassels.

Subset 11 in
DetectoRS and TSD could only find one tassel, and CenterNet and TSD had false positives. The worst results were from TasselNetv2+, which incorrectly counted 29 tassels. The performances of TasselNetv2+, CenterNet, DetectoRS, and TSD are shown in Tables 3-5. The results of these tables indicate that the tassels are detected more accurately by the detection approaches (CenterNet, TSD, and DetectoRS) than the regression approach (TasselNetv2+). As previously mentioned, inbred varieties had fewer tassels compared to the hybrids at the time of the data acquisition, and the performance of each detector is affected by the density of the tassels. Table 3 indicates that among the three methods, TSD had the lowest performance with a score of 89.90 and a standard deviation of 14.59. The counting results in Table 4 show that TassleNetv2+ had the highest MAE and MSE values. Its performance is also significantly worse when the number of tassels in the image is low.  Table 5 shows the scores for each testing plot using CenterNet, DetectoRS, and TSD. In most subsets, CenterNet obtained the best score value. The details of tassel counting (TP, FP, and FN) for each of the fifteen subsets are shown in Figure 18. The number of manually counted tassels (Ng) was always somewhat larger than the number detected by the algorithms, although the difference between TP and NG was not statistically different for any algorithm. CenterNet had the maximum average value of TP. TSD had the largest average value of FP, and DetectoRS had the largest average FN value.
As shown in Figure 19, the mean precision value of DetectoRS is higher than CenterNet and TSD. Investigating further, Table 5 shows that for S13, (Ng = 0). Therefore, Figure 19 does not have any information for this subset. The standard deviation of the precision of DetectoRS is also small. We can infer the DetectoRS could detect the actual tassels well. However, the recall value is not as high as CenterNet. The accuracy of TSD is lower than CenterNet, and it detects more false positives (FP) as incorrect tassels. Furthermore, creating the bounding box around the tassels for training DetectoRS and TSD is timeconsuming, and these algorithms are sensitive to the size of bounding boxes.  The linear regressions between the manual counting and the predicted number of tassels using the three detection-based methods (CenterNet, DetectoRS, and TSD) and one regression-based method (Tasselnetv2+) are shown in Figure 20. The TSD and DetectoRS techniques provide slightly higher fidelity counts than Cen-terNet and Tasselnetv2+ (see Figure 20). However, CenterNet has the highest mean value for TP and the lowest FP and FN values. Because of the date of the imagery, most plots were in the mid-to-late stages of flowering and had a large number of tassels. TassleNetv2+ could only provide good results when the number of tassels in the subset was very high. The impact of overcounting by TasselNetv2+ compared to ground reference data in plots with a small number of tassels is clearly visible.

Conclusions
A key goal of this study was to investigate the value of detection-based approaches compared to regression counting methods in the complex scenario of in-field tassel counting. In this article, three state-of-the-art object detection algorithms, CenterNet, DetectoRS, and TSD, were modified for tassel detection and compared to counts obtained by Tassel-Netv2+, as well as image-based ground reference counts. All three algorithms had good overall performance in terms of the number of true positives compared to the ground reference. CenterNet achieved the highest recall value and score, and DetectoRS obtained the highest value of precision. TSD has the lowest score value. The performance of Tas-selNetv2+, which only provides information about counting, is highly dependent on the number of tassels.
Specific annotation is used for each of the detection-based algorithms. Two types of label annotations, "point" and "bounding boxes", were investigated. DetectoRS and TSD require bounding box annotations, and results are sensitive to the size of the bounding boxes. CenterNet works based on point annotation, which is simpler, faster to collect, and more accurate than bounding boxes As future work, the next steps will explicitly consider multiple dates during the flowering process, where there is greater diversity in the number and characteristics of the developing tassels. The strategy will investigate multitemporal analysis and attentionbased networks to predict flowering (anthesis) dates. The impact of temperature will also be investigated via the incorporation of growing degree days (GDD).