A New Dataset and Comparative Study for Aphid Cluster Detection and Segmentation in Sorghum Fields

Aphid infestations are one of the primary causes of extensive damage to wheat and sorghum fields and are one of the most common vectors for plant viruses, resulting in significant agricultural yield losses. To address this problem, farmers often employ the inefficient use of harmful chemical pesticides that have negative health and environmental impacts. As a result, a large amount of pesticide is wasted on areas without significant pest infestation. This brings to attention the urgent need for an intelligent autonomous system that can locate and spray sufficiently large infestations selectively within the complex crop canopies. We have developed a large multi-scale dataset for aphid cluster detection and segmentation, collected from actual sorghum fields and meticulously annotated to include clusters of aphids. Our dataset comprises a total of 54,742 image patches, showcasing a variety of viewpoints, diverse lighting conditions, and multiple scales, highlighting its effectiveness for real-world applications. In this study, we trained and evaluated four real-time semantic segmentation models and three object detection models specifically for aphid cluster segmentation and detection. Considering the balance between accuracy and efficiency, Fast-SCNN delivered the most effective segmentation results, achieving 80.46% mean precision, 81.21% mean recall, and 91.66 frames per second (FPS). For object detection, RT-DETR exhibited the best overall performance with a 61.63% mean average precision (mAP), 92.6% mean recall, and 72.55 on an NVIDIA V100 GPU. Our experiments further indicate that aphid cluster segmentation is more suitable for assessing aphid infestations than using detection models.


Introduction
With a growing global population, the increasing demand for food has led to widespread innovations in the agricultural industry.Approximately 37% of the crops grown worldwide are lost to pest damage, and 13% of this loss is directly attributed to insects [1].The impact is particularly pronounced in staple crops such as rice, wheat, sorghum, and maize, not only posing significant threats to global food security but also impacting various national economies [2], Consequently, there has been a consistent rise in the use of chemical pesticides to combat these pests and maximize yields.The projected expenditure on pesticides reached around USD 107 billion in 2023 and this figure is expected to further increase.The heightened usage not only translates to increased costs for farmers but also underscores the numerous health risks associated with pesticide use.The conventional pest treatments involve applying chemical pesticides uniformly across the entire field using large sprayers in a continuous spray once a specific infestation threshold is reached.However, this treatment approach becomes excessive when pests are only fractionally present.Therefore, a more efficient method is sought after.
In order to provide timely aphid control and reduce the negative impacts of excessive pesticide use through precision spraying, it is desirable to develop an intelligent robot that can scout the field periodically, detect and localize any aphid infections, and apply precise spray once aphids are detected using the on-board spray system, as shown in Figure 1.This system will provide a more efficient, effective, and economical solution for pest management.Through data collection and scouting techniques, farmers are able to better plan their crop planting and treatment application to achieve higher yields and save chemical costs by spraying more efficiently.Automated robotic technologies will significantly reduce the labour costs and increase the yield by noticeable margins of up to 5% [3].An intelligent scouting robot with onboard vision and spray system.The vision system is used to detect and localize aphid affections, and the spray system can apply precise pesticide application to the infected areas.
The implementation of an automated pest control system holds the potential to significantly enhance farmers' ability to pinpoint areas in need of treatment with precision, thus reducing the overall pesticide usage and fostering sustainable agricultural methods.Despite the extensive research dedicated to automated pest detection, it remains a formidable challenge, primarily due to the small size of individual pests and their adeptness at blending into their natural environment.Previous work has been conducted on the detection and segmentation of aphids [4][5][6][7]; however, these works were explored in the context of individual pest detection rather than assessing the overall infestation levels.The datasets are also usually catered to the specific pests and the crops they affect.We explore the feasibility of accurately detecting and segmenting these pests using lightweight models capable of real-time inference speeds.
The primary focus of this paper is on the topic of automated aphid cluster localization through object detection and semantic segmentation tasks.Our multi-scale dataset created using high-resolution images from a sorghum field enables learning to detect aphid clusters at different scales and in robust conditions.We trained popular real-time object detection and semantic segmentation models to evaluate their performance against each other and for the overall application of aphid infestation control.Through a comprehensive analysis, considering the prediction precision, inference speed, and information parameters for automated pest localization, we identify RT-DETR and Fast-SCNN as the top performers for aphid detection and semantic segmentation tasks, respectively.Furthermore, our findings suggest that semantic segmentation proves more advantageous for this application due to the additional spatial information it offers.
The main contributions of this study are summarized below.

•
We created an extensive multi-scale dataset for aphid cluster detection and semantic segmentation, utilizing real-world images captured in sorghum fields.This dataset encompasses various viewpoints, diverse lighting conditions, and multiple scales, thereby providing enhanced representations of aphid features.By leveraging this dataset, we significantly improve the performance of the learning models in detecting and segmenting aphid clusters.

•
We conducted a benchmark evaluation to assess the performance of four state-of-theart real-time semantic segmentation models and three detection models using our generated dataset.The outcomes of this evaluation offer a practical reference point for aphid detection and segmentation in real-world scenarios.
Partial results of this paper were previously presented at the CVPR 2023 Workshops [8].This paper is a significant extension of the workshop paper.In addition to segmentation, the paper incorporates the task of aphid cluster detection with three additional real-time detection models, a comparison with the previous fixed-scale dataset, as well as expanded details regarding the segmentation models.The generated dataset can be downloaded at https://doi.org/10.7910/DVN/N3YJXG.

Related Works
Convolutional Neural Networks (CNNs) have demonstrated exceptional effectiveness in feature extraction during the past decade.Utilizing CNNs has led to state-of-the-art achievements in various image analysis tasks, including image classification [9,10], object detection [11,12], and semantic segmentation [13,14].However, the task of pest localization remains challenging for these models, primarily due to the diminutive sizes of insects.Barbedo [15] spearheaded the use of image processing techniques and handcrafted features to automatically detect and count pests on leaves with simple black backgrounds, but more effective features were needed to improve the performance.Using the histogram-oriented gradient algorithm and a support vector machine, Ref. [4] devised a way to identify and count aphids with greater accuracy.However, with the advent of CNNs, there was a push to move towards learning features directly from the data.
A modification of the widely used U-Net model was introduced in [5] to segment semantically and count the individual aphid nymphs on leaves.While this method demonstrated high precision and recall, its applicability was limited by the dataset consisting solely of ideal laboratory images featuring simple black backgrounds.Consequently, the model lacked generalizability to the natural settings where these pests are typically found.Hence, it is imperative for models to exhibit robustness to diverse viewpoints and intricate crop canopies in the real-world environments where these systems are deployed.A tiny-sized dense distribution network, TD-Det, was proposed in [6] and evaluated using the APHID-4K dataset from [16].The model consisted of a transformer feature pyramid network and used a multi-resolution training method, highlighting the effectiveness of using multiple resolutions of the images for training.The need for more datasets of this variety was further addressed by the small dataset, LeLePhiD [17], where they annotated aphid clusters for infestation detection.
In our preliminary investigation [18], we assembled an extensive dataset of aphids in their natural habitat and undertook a comparative analysis of various object detectors using this dataset.However, the dataset was constructed at a fixed scale, and the models implemented were not tailored for real-time applications.Moreover, the bounding boxes provided by the detection models lacked the requisite precision for accurately estimating the infection levels.Given the recent developments regarding real-time models, this paper shifts its focus towards real-time aphid localization within both detection and segmentation contexts.

Dataset
Aphids are recognized for consistently causing significant damage to both grain and sweet sorghums [19].To ensure the accuracy and representativeness of the data collected for our dataset in capturing the natural habitat of aphids, we took high-resolution images from sorghum fields in both the northern and southern regions of the State of Kansas during the aphid growing season.This was achieved using an imaging rig equipped with three GoPro cameras positioned at different heights, as shown in Figure 2.This setup facilitates capturing diverse images from various viewpoints, heights, and lighting conditions.This approach enables the models to better comprehend the spatial characteristics of sorghum aphids across different environments, ultimately yielding more robust and generalizable results.The initial images exhibited sparse aphid concentration, with those devoid of any aphids being filtered out via manual examination by domain-trained research assistants, yielding a dataset comprising 5447 images, each sized at 3647 × 2736 pixels.Given that aphids typically congregate in groups, our approach for infestation assessment focused on identifying clusters rather than individual insects.To this end, during the annotation process, we set a threshold of six or more closely located aphids to define a cluster [18].This criterion ensured that only clusters of a significant size, indicative of an economic threat, were detected.
To facilitate a statistically meaningful analysis, we implemented 10-fold cross-validation to mitigate potential biases resulting from random data splitting and the substantial class imbalance against background pixels.The selected images were randomly divided into 10 equal sets.Given the sparsity and small sizes of the aphid clusters in the original highresolution images, we employed patch generation to enhance cluster presence and ensure robust localization across different scales.This patch generation process, as illustrated in Figure 3, involved subdividing the images into patches at three distinct scales: Scale 1 (0.132W × 0.132H), Scale 2 (0.263W × 0.263H), and Scale 3 (0.525W × 0.525H), where W and H represent the width and height of the original image, respectively.Some sample images at different scales are shown in Figure 4. To maintain annotation completeness, a 10% overlap was incorporated between adjacent patches.To prevent any potential data leakage across the sets, the patch generation was conducted after the fold separation of the original images.The generated dataset comprised a total of 54,742 multi-scale image patches, with 35,140, 13,311, and 6291 patches captured from the top, middle, and bottom camera heights, respectively.Additionally, there were 36,478, 14,628, and 3636 images obtained from Scale 1, Scale 2, and Scale 3, respectively.Despite the extensive patch generation process, annotations of aphid clusters remained sparse, accounting for only 2.45% of each image.
Initially, annotations were provided in the form of semantic segmentation masks, from which bounding boxes were derived.The statistical distribution of the aphid clusters is shown in Figure 5.To create the object detection dataset, overlapping boxes were merged.In contrast, our previous study [18] exclusively focused on developing the dataset at a single scale.However, as evidenced by our experiments, the incorporation of multi-scale data enables models to learn features at varying resolutions, thereby contributing to improved performance.In the literature, there are two other aphid image datasets with automated aphids for detection.APHID-4K is a high-resolution dataset containing 4294 images of wheat aphids [16].However, as this dataset is annotated for individual aphid detection and counting, it is less suitable for infestation-level assessment.LeLePhiD is another dataset introduced in [17] for semantically segmenting aphid clusters on lemon leaves.While this dataset is suitable for assessing aphid concentration and finding infestation severity, it is a very small dataset with only 665 images and each image consists of a single lemon leaf taken from 30 to 50 cm away, making it hard for practical applications.Compared to these two datasets, our dataset is significantly larger, with 54,742 images captured at multiple scales and from various viewpoints; it offers a comprehensive and diverse dataset for large-scale pest detection research.

Models
With the increased demands of practical applications like autonomous vehicles and intelligent systems, efficient models have been developed to quickly process visual information and take immediate action.This highlights the need for models that can work with high-resolution image data at high processing speeds.In this study, we implemented some of the most popular real-time semantic segmentation networks, including Fast-SCNN, BiSeNetV1, BiSeNetV2, and Small HRNet, as well as real-time object detection networks such as Faster R-CNN, RT-DETR, and YOLOv7.

Segmentation Models
Fast-SCNN [20] achieves state-of-the-art segmentation speeds coupled with a lightweight design, making it ideal for embedded systems with limited computational memory.Its binary path architecture effectively distributes the computation load and integrates various region-based context information to enhance segmentation accuracy.Bilateral Segmentation Network (BiSeNet) [21,22] is another noteworthy family of realtime segmentation networks employing a similar two-pathway architecture.By fusing high-level semantic information with low-level features, BiSeNet strikes a balance between segmentation speed and accuracy, making it highly suitable for real-time applications.High-Resolution Network (HRNet) [23] serves as a robust backbone for various computer vision tasks.Its simplified version, Small HRNet, achieves real-time speeds while preserv-ing high-resolution representations.This is achieved through a high-resolution stem in the initial stage, followed by the gradual addition of high-to-low-resolution streams in the main body.All these models demonstrate remarkable capability in achieving real-time speeds across different benchmark datasets while maintaining high levels of accuracy.

Detection Models
Faster R-CNN is a prominent two-stage real-time object detector building upon the Fast R-CNN model by incorporating a Region Proposal Network (RPN).This RPN shares full-image convolution features with the detection network to enable cost-free region proposals [24].The YOLO (You Only Look Once) family of models represents state-ofthe-art single-stage object detection networks, renowned for their real-time processing speeds and competitive accuracy [25].YOLOv7, a recent addition to this family, is an anchor-based model capable of achieving higher speed and accuracy compared to other known object detectors.Wang et al. [26] demonstrated through ablation studies that scaling the network depth and width sequentially, while concatenating the layers, optimizes the YOLOv7 architecture for improved performance.The Real-Time Detection Transformer (RT-DETR), introduced by Lv et al. [27], stands out as the first real-time end-to-end object detector to achieve state-of-the-art speed and accuracy.It surpasses all YOLO detectors at the same scale by employing a hybrid encoder capable of efficiently processing multi-scale features.RT-DETR also enhances the initialization of object queries through IoU-aware query selection, thereby avoiding delays caused by Non-Maximal Suppression (NMS).Leveraging these three object detectors, we conducted experiments for real-time aphid cluster detection, capitalizing on their respective strengths in speed and accuracy to address the requirements of our study effectively.

Experimental Setup and Evaluation Metrics
Pre-Processing: The original 3647 × 2736 images were utilized to generate patches at three different scales.Patches containing less than 1% aphid cluster coverage were filtered out as they typically represented remnants of clusters from the patch generation process or were deemed insignificant in terms of economic threat.This curation process yielded a final dataset comprising 54,742 multi-scale images.Prior to training, all images were resized to 1024 × 1024 pixels and normalized based on the mean and standard deviation of the dataset.For the semantic segmentation task, considering the substantial class imbalance between aphid clusters and background classes, class weights were computed using the pixels in the image masks and were applied during training.These weights were calculated as follows: W Aphid Cluster = Total pixels Aphid cluster pixels (1) Training Setup: The models were implemented in Python using PyTorch.For the segmentation models, we utilized the MMSegmentation library [28].Additionally, we employed the MMDetection library for the Faster R-CNN model, while the official implementations were utilized for the other object detection models.The training was conducted using 4 NVIDIA v100 GPUs with a total memory capacity of 64 GB.
Training Pipeline: During training, all segmentation models underwent 160,000 iterations utilizing Stochastic Gradient Descent (SGD) with a learning rate set to 0.001, a momentum of 0.9, a weight decay of 0.005, and a batch size of 2. Conversely, a batch size of 8 images was employed for the object detection models.For training YOLOv7 and Faster R-CNN, SGD was utilized as the optimizer.YOLOv7 was trained with a learning rate of 0.01, a momentum of 0.937, and a weight decay of 0.0005, while Faster R-CNN used a learning rate of 0.02, a momentum of 0.9, and a weight decay of 0.0001.RT-DETR, on the other hand, employed AdamW as the optimizer, trained with a learning rate of 0.0001, a momentum of 0.9, and a weight decay of 0.0001.
Model Evaluation: For the object detection task, we utilized the popular mean Average Precision (mAP) metric across various Intersection over Union (IoU) thresholds, including 0.25, 0.5, and 0.75.In the segmentation task, model evaluation relied on two key metrics: Intersection over Union (IoU) and Dice coefficient.These metrics are defined as below.

IoU =
Area of overlap Area of union (3) The speed of all models is evaluated based on how fast they are able to detect aphid clusters by their frames per second (FPS).

Experimental Results
Utilizing the generated multi-scale patches, we have achieved a comprehensive representation of aphid clusters from diverse viewpoints, thereby enhancing the robustness of our models across different regions within a sorghum field.By leveraging this dataset, we trained the identified object detection and semantic segmentation networks, thereby establishing a benchmark for aphid cluster localization.This effort contributes significantly to the broader objective of infestation management.

Segmentation
The four real-time semantic segmentation models we used in our experiments included Fast-SCNN, BiSeNetV1, BiSeNetV2, and Small HRNet.The performance of these models is shown in Table 1.The results show the mean Intersection over Union (mIoU), mean Dice score (mDice), mean precision (mPrecision), mean recall (mRecall), and the speed in frames per second (FPS).From the results, we can see the tradeoff between speed and accuracy, with Small HRNet achieving the highest accuracy at 71.62 mIoU but having the slowest relative speed at 31.57FPS compared to the other networks.The BiSeNetV1 and BiSeNetV2 models yield lower performance while having quite consistent speeds of 53.70 FPS and 56.19 FPS, respectively.In terms of inference speed, Fast-SCNN stands out as the top performer among all the models, achieving above real-time speeds at 91.66 FPS.Remarkably, it maintains a high accuracy level of 71.25 mIoU.Given its superior performance compared to Small HRNet and its exceptional speed, Fast-SCNN emerges as the overall best-performing model for real-time aphid cluster segmentation.Table 1 also shows the overall recommendation in the column "Rank".Figure 6 illustrates the segmentation results across the four models and compares them to the corresponding ground truth annotations.Small HRNet and Fast-SCNN demonstrate the highest accuracy in producing segmentation masks closely aligned with the ground truths.They effectively detect aphid clusters under various visual conditions, including sparsely located clusters, partially occluded clusters, and those in low-light environments.Additionally, Small HRNet exhibits proficiency in handling clusters with more intricate shapes, while Fast-SCNN excels in delineating tighter boundaries, effectively avoiding areas of leaves between clusters devoid of a significant number of aphids.In contrast, the BiSeNet models tend to overlook aphid clusters that blend into the background crop canopy and often misclassify similarly textured parts of the plant as clusters.Notably, BiSeNetV1 may also erroneously detect smaller individual clusters within larger ones rather than recognizing them all as a single cluster.

Detection
The experiments incorporated three real-time object detection models: Faster R-CNN, YOLOv7, and RT-DETR.Table 2 presents the performance metrics of these models corresponding to various IoU thresholds of 0.25, 0.5, and 0.75.It is evident that the IoU threshold plays a significant role in determining the difficulty for models to generate True Positive predictions.Higher IoU thresholds impose stricter localization requirements, leading to lower average precision and recall.This underscores the tradeoff between the stringency of the localization criteria and the overall performance of the model.Table 3 shows the mean average precision and recall averaged across the three IoU thresholds, as well as the corresponding frames per second (FPS) rates, where the column "Rank" indicates the overall recommendation for practical deployment.
Among these detection models, RT-DETR achieved the highest accuracy, attaining a 61.63 mAP, while also managing the second fastest inference speed with an above real-time speed of 72.55 FPS.Faster R-CNN emerges as the second-best performer with an accuracy of 57.83 mAP.Even though this model has the slowest relative speed, it is still able to maintain above real-time speeds of 48.03 FPS.YOLOv7 achieved the lowest relative accuracy of 57.33 mAP, which is quite similar to Faster R-CNN, but its lightweight architecture enables it to be more efficient, with the highest frame rate at 113.64 FPS.Although YOLOv7 demonstrates the fastest frame rate, its lower accuracy highlights the tradeoff between speed and accuracy.Consequently, RT-DETR emerges as the optimal choice overall, offering the best balance between accuracy and above real-time inference speeds.
The visual detection results from each of the three models, alongside their corresponding ground truth bounding boxes, are depicted in Figure 7. From these results, we can see that all the models generated bounding boxes that closely aligned with their ground truths.Faster R-CNN, however, tended to produce additional bounding boxes, often detecting smaller clusters in the background that pose a negligible economic threat, and occasionally misidentifying the background components of the crop as clusters.Moreover, due to the absence of Non-Maximal Suppression, Faster R-CNN resulted in overlapping bounding boxes.Despite encountering challenges in detecting the sparsely populated clusters in the dark and shadow-covered bases of the crops, RT-DETR demonstrated robustness across various conditions, including different viewpoints, lighting conditions, leaf sizes, as well as images with blur and shadow.
Table 2. Object detection results detailed at different IoU thresholds of 0.25, 0.50, and 0.75.For YOLOv7, 0.25, 0.5, and 0.75 were also set as the Non-Maximal Suppression thresholds for the corresponding IoU thresholds.In Table 2 and Table 3, we also compared the detection performance with the following four detection models reported in [18]: ATSS (Adaptive Training Sample Selection) [29], GFLV2 (Generalized Focal Loss V2) [30], PAA (Probabilistic Anchor Assignment) [31], and VFNet (VarifocalNet) [32].Although these models were trained and evaluated from the same original image dataset, their performance does not align with the results obtained in this study.The primary reason for this variance is that these models were trained using the fixed-scale dataset created in [18].This comparison further underscores the significance of creating a multi-scale dataset, emphasizing its crucial role in achieving superior performance in aphid cluster detection and segmentation tasks.

Results Analysis
Despite the advancements in modern computer vision techniques, localizing small insects remains a very challenging problem in automated detection and segmentation within the agricultural field.This difficulty primarily arises from the minuscule sizes of these pests and their adeptness at blending into complex natural crop environments.To tackle these obstacles, we implemented a three-camera setup with varying heights and employed multi-scale patch generation to create object detection and semantic segmentation datasets.As a result, the resulting images encompassed a wide range of scales, viewpoints, and heights, surpassing the diversity of the existing pest image datasets.Moreover, recognizing that aphid infestations are typically clustered, with the size of the cluster indicating the level of economic threat, our annotations occurred at the cluster level rather than focusing on individual insects.Additionally, the utilization of class weights helped to address the significant class imbalance, ensuring that the segmentation models could effectively learn from the data.
Among the models evaluated, RT-DETR emerged as the top performer for object detection, while Fast-SCNN excelled in semantic segmentation.This success can be attributed to their combination of high speed and accuracy, making them well-suited for practical agricultural applications.With its newly designed hybrid encoder able to efficiently process multi-scale features, RT-DETR is able to avoid the delays caused by NMS and reduce the computation time of the network.Using IoU-aware query selection also allows the model to focus on the parts of the image that contain the most relevant objects, enhancing the accuracy.From the results in Figure 7, we can see that RT-DETR performs quite well relative to the other models and is robust to changes in lighting conditions and viewpoints.
Although YOLOv7 manages to achieve a higher frame rate, the combination of the highest accuracy at a speed above real time ensures that RT-DETR is sufficient for quality real-time aphid cluster detection with higher recall.From Table 2, we also notice the tradeoff between the localization strictness and the model's performance.As the agricultural needs are vast and diverse, this enables the threshold to be customized to different requirements.
The highly recommended model, Fast-SCNN, has a binary path architecture, which allows it to efficiently downsample the images and concatenate the shared low-level computed features to achieve its excellent speeds.However, it is this simplicity that keeps it from outperforming the most accurate model, Small HRNet, which sacrifices speed for its better accuracy.We see this further demonstrated in Figure 6 where Small HRNet is able to segment more complicated boundaries to better fit the aphid clusters.Fast-SCNN is still, however, able to quite accurately segment the clusters and account for different lighting conditions and complicated crop backgrounds.

Recommendation for Aphid Infestation Control
Given that aphids typically appear sparsely but form dense clusters, the economic impact of these pests correlates directly with the size of these clusters.For an autonomous pest control system to effectively manage infestations in real time, it must quickly localize these clusters and assess the severity of the infestation based on their sizes.Our comparative results of object detection and semantic segmentation approaches demonstrate that the additional spatial information provided by semantic segmentation, through the sizes of the predicted masks, allows for an accurate determination of whether an infestation has reached a critical threshold requiring intervention.
From Figure 8, we observe the predictions achieved by RT-DETR and Fast-SCNN for object detection and semantic segmentation, respectively, on identical test images.While both methodologies are adept at identifying small, isolated, and circular clusters of aphids, challenges arise with atypical cluster formations.For example, in cases where clusters form elongated lines along leaves or assume elongated shapes, as illustrated in the bottom row of Figure 8, bounding boxes may cover extensive areas of the image, yet the actual clusters occupy only sparse sections within these boxes.Conversely, the predicted masks from semantic segmentation provide a more accurate representation of the area directly occupied by the aphid clusters.
Therefore, we recommend semantic segmentation for this task as its pixel-level precision in identifying aphid clusters proves to be more suitable.While RT-DETR provides high accuracy at speeds above real time, Fast-SCNN's capability to accurately segment clusters with precise boundaries offers a more accurate assessment of the infestation level present in the image.Nonetheless, we recognize that agricultural settings vary widely, and both methodologies have demonstrated effectiveness in localizing aphid clusters in their natural habitats.In scenarios where speed is a priority, or the sheer number of detected clusters is sufficient for deciding on treatment actions, the detection approach may be preferred, still yielding competitive results.In practical applications, we might establish a threshold such that, if the total area covered by segmentation masks or detection bounding boxes in an image exceeds a predetermined threshold, the onboard spray system would be activated to apply pesticides to the identified areas.

Conclusion
Aphids cause some of the most consistent damage to wheat, rice, and sorghum crops worldwide.As a primary cause of insect pest damage, recent advances in automated drone technology have allowed farmers to target them more effectively using selective treatment.However, for computer vision models to be able to detect and segment aphid clusters with a sufficient level of performance, they require a large amount of diverse data.Using manually collected and labelled high-resolution images, our large multi-scale dataset of aphid clusters addresses the need to adequately assess the infestation levels of aphids in sorghum crops.The different scales allow for better generalization capabilities and robustness to different ranges of zoom and viewpoints.In our experiments, we trained popular real-time object detection and semantic segmentation models to localize aphid clusters regarding real-time speeds or better.Our results show that both RT-DETR and Fast-SCNN are ideal choices for object detection and semantic segmentation, respectively, for providing high-quality predictions at above real-time speeds.In terms of the overall better-suited task, semantic segmentation is recommended due to the extra spatial information provided through the masks.These findings highlight the promise of this problem domain, and further studies with deployed models in the field may enable real-world data on the effects of these models on the yield to be studied.Thus, this study will hopefully motivate further research that drives us towards a more sustainable and efficient agricultural system.

Figure 1 .
Figure1.An intelligent scouting robot with onboard vision and spray system.The vision system is used to detect and localize aphid affections, and the spray system can apply precise pesticide application to the infected areas.

Figure 2 .
Figure 2. The imaging rig used to capture images.It is equipped with three adjustable GoPro cameras to take images from different heights and viewpoints.

Figure 3 .
Figure 3. Two examples of the different scales that were used in training.Image (left) shows how the original high-resolution 3647 × 2736 image was subdivided to create patches at the 0.525W × 0.525H scale (Scale 3), where W and H refer to the width and height of the original image.At this scale, the original image will yield 4 patches.Image (right) shows how the image was further subdivided to create patches at the 0.263W × 0.263H scale (Scale 2).This scale will yield 16 patches from the original image.In each case, adjacent patches were taken with an overlap of 10%.

Figure 4 .
Figure 4. Example images from the dataset alongside their corresponding ground truth labels.The first row shows the appearance of the aphid clusters and the second row has the corresponding ground truth masks overlaid on them.The first, second, and third columns show image patches at Scale 1, Scale 2, and Scale 3, respectively.

Figure 5 .
Figure 5. Histograms showing the mask area percentage across the images and the number of images per scale.The chart on the left shows the percentage of aphid cluster masks using intervals of 10% from 0% to 100%.As most images lie in the interval between 0% and 10%, the chart in the center further breaks that interval down for closer analysis.The chart on the right provides the number of images at each scale.

Figure 6 .
Figure 6.The predicted masks from the four real-time semantic segmentation models alongside the ground truth mask annotations.Each row shows us the performance of the models at the three different scales.

Figure 7 .
Figure 7.The predicted bounding boxes from the three object detection models as well as the ground truth bounding boxes.Each row highlights the performance of the models at different lighting conditions and viewpoints.

Figure 8 .
Figure 8. Prediction results for the same test images in object detection and semantic segmentation contexts from RT-DETR and Fast-SCNN, respectively.These examples show how semantic segmentation is more beneficial due to the provided spatial information of the aphid clusters.

Table 3 .
Real-time object detection results (sorted via mAP).The mAP and mRecall are averaged over IoU thresholds at 0.25, 0.50, and 0.75.Bold indicates top result."Rank" shows the overall recommendation."×" means not recommended.