AG-YOLO: A Rapid Citrus Fruit Detection Algorithm with Global Context Fusion

Lin, Yishen; Huang, Zifan; Liang, Yun; Liu, Yunfan; Jiang, Weipeng

doi:10.3390/agriculture14010114

Open AccessEditor’s ChoiceArticle

AG-YOLO: A Rapid Citrus Fruit Detection Algorithm with Global Context Fusion

by

Yishen Lin

,

Zifan Huang

,

Yun Liang

^*,

Yunfan Liu

and

Weipeng Jiang

College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China

^*

Author to whom correspondence should be addressed.

Agriculture 2024, 14(1), 114; https://doi.org/10.3390/agriculture14010114

Submission received: 6 December 2023 / Revised: 5 January 2024 / Accepted: 9 January 2024 / Published: 10 January 2024

(This article belongs to the Special Issue Application of Vision Technology and Artificial Intelligence in Smart Farming—2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Citrus fruits hold pivotal positions within the agricultural sector. Accurate yield estimation for citrus fruits is crucial in orchard management, especially when facing challenges of fruit occlusion due to dense foliage or overlapping fruits. This study addresses the issues of low detection accuracy and the significant instances of missed detections in citrus fruit detection algorithms, particularly in scenarios of occlusion. It introduces AG-YOLO, an attention-based network designed to fuse contextual information. Leveraging NextViT as its primary architecture, AG-YOLO harnesses its ability to capture holistic contextual information within nearby scenes. Additionally, it introduces a Global Context Fusion Module (GCFM), facilitating the interaction and fusion of local and global features through self-attention mechanisms, significantly improving the model’s occluded target detection capabilities. An independent dataset comprising over 8000 outdoor images was collected for the purpose of evaluating AG-YOLO’s performance. After a meticulous selection process, a subset of 957 images meeting the criteria for occlusion scenarios of citrus fruits was obtained. This dataset includes instances of occlusion, severe occlusion, overlap, and severe overlap, covering a range of complex scenarios. AG-YOLO demonstrated exceptional performance on this dataset, achieving a precision (P) of 90.6%, a mean average precision (mAP)@50 of 83.2%, and an mAP@50:95 of 60.3%. These metrics surpass existing mainstream object detection methods, confirming AG-YOLO’s efficacy. AG-YOLO effectively addresses the challenge of occlusion detection, achieving a speed of 34.22 frames per second (FPS) while maintaining a high level of detection accuracy. This speed of 34.22 FPS showcases a relatively faster performance, particularly evident in handling the complexities posed by occlusion challenges, while maintaining a commendable balance between speed and accuracy. AG-YOLO, compared to existing models, demonstrates advantages in high localization accuracy, minimal missed detection rates, and swift detection speed, particularly evident in effectively addressing the challenges posed by severe occlusions in object detection. This highlights its role as an efficient and reliable solution for handling severe occlusions in the field of object detection.

Keywords:

occlusion detection; NextViT; attention mechanism

1. Introduction

In China, citrus, being a prevalent type of citrus fruit, holds significant importance within the agricultural sector. Accurate yield estimation during citrus cultivation is paramount, as precise estimations aid fruit growers in making informed decisions for orchard management, including fertilization, irrigation, and harvesting plans. Furthermore, it contributes to production planning and supply chain management for fruit growers. However, traditional yield estimation methods may lead to significant deviations, especially when fruits are obstructed by dense foliage or mutual fruit occlusion, posing substantial challenges in accurately determining fruit yield. The issue of severe fruit occlusion poses a significant challenge in accurately estimating the yield at present.

The current methods for object detection in the agricultural domain can be broadly categorized into two types: non-deep learning methods and deep learning methods.

Non-deep learning methods, exemplified by Hamuda et al. (2018) [1], employ Kalman filtering and the Hungarian algorithm for crop detection in agricultural fields. However, due to the simplicity of the experimental background that lacked overlapping crops, these methods are unsuitable for detecting densely distributed fruits with occlusions. Lu et al. (2015) [2] introduced a citrus fruit recognition approach based on color and contour information under varying canopy lighting conditions. Although this method can adapt to complex natural environments with diverse lighting and backgrounds, its detection performance diminishes when dealing with smaller citrus fruits. Lin et al. (2013) [3] proposed an improved method based on the circular random Hough transform (CRHT) to detect occluded camellia fruit. The algorithm incorporates modules such as edge pre-detection and rapid center-point localization to enhance the recognition rate. Song et al. (2012) [4] proposed a method for recognizing occluded apple targets based on the convex hull theory. This method converts images from RGB to Lab color space and employs K-means clustering to categorize images into three classes: leaves, branches, and fruits. Subsequently, through morphology and convex hull techniques, the method extracts contour curves of fruit edges and estimates parameters of smooth curves, achieving occluded fruit localization. Feng et al. (2015) [5] used the 2R-G-B color difference model to extract color features of ripe red tomatoes and employed a dynamic threshold segmentation method for fruit identification. However, this approach suffers from prolonged identification times and does not adequately address accurate identification in complex environments with leaf occlusions. Sun et al. (2017) [6] utilized the K-means clustering algorithm based on the Lab color space to segment apple targets occluded by branches. The mathematical morphology method was employed to extract the contour of the apples. Subsequently, employing the minimum bounding rectangle method removed false contours, and the curvature features of the contour facilitated the reconstruction of the apple targets. This study’s approach accurately identifies, locates, and reconstructs occluded apple targets by branches, effectively reducing reconstruction time. Sun et al. (2019) [7] proposed an object extraction algorithm based on geometric morphology and iterative random circles. This algorithm effectively segments and identifies adhered fruits in images. It addresses the issue of segmenting fruits in complex environments where multiple fruits are adhered or slightly occluded to some extent. The aforementioned machine vision-based fruit recognition algorithms are constrained by strong environmental dependencies and poor robustness in complex occlusion scenarios, often resulting in slow recognition speeds and low identification accuracy (Kuznetsova et al. 2020 [8], Sozzi et al. 2022 [9], Zhang et al. 2021 [10]).

The powerful processing capabilities of deep learning and its adaptability to complex scenarios make it an ideal choice for addressing occlusion issues in fruit detection. In the realm of deep learning for fruit detection, the prevailing frameworks are categorized into single-stage detectors and two-stage detectors. YOLO (Redmon et al. 2016 [11], Redmon and Farhadi 2017 [12], Redmon and Farhadi 2018 [13], Bochkovskiy et al. 2020 [14]) is a one-stage object detector known for its outstanding performance in both detection accuracy and speed. Koirala et al. (2019) [15] conducted a comparative analysis of six different deep learning architectures. They combined the distinctive features of YOLOv2 (Tiny) to devise a novel architecture termed ’MangoYOLO’. This new model was specifically designed for mango detection in canopy images. MangoYOLO integrates the strengths of both YOLOv2 and YOLOv3, offering both high detection speed and high accuracy in fruit detection. Tian et al. (2019) [16] proposed an improved YOLOv3 model designed for fruit orchard scenarios characterized by fluctuations in lighting, complex backgrounds, fruit overlap, and foliage obstruction. They leveraged DenseNet (Huang et al. 2017) [17] to process the low-resolution feature maps of the YOLOv3 model, effectively enhancing feature propagation and reuse. Yang et al. (2022) [18] utilized an enhanced YOLOv4-tiny model for tomato fruit recognition. They improved the identification accuracy of small tomatoes by incorporating a detection layer with dimensions of 76 pixels by 76 pixels. Additionally, they introduced a convolutional attention module to enhance the recognition accuracy of occluded tomatoes and employed a densely connected convolutional network to improve global feature integration. Huang et al. (2022) [19] proposed an optimized YOLOv5 model for citrus fruit recognition. By integrating the CBAM (convolutional block attention module) attention mechanism and the

α

-IoU (alpha-intersection over union) loss function, they significantly enhanced the detection accuracy of occluded targets and improved the precision of bounding box localization. The two-stage detectors are primarily based on the R-CNN algorithm (Girshick et al. 2014 [20]). Mu et al. (2020) [21] utilized deep learning methods, integrating regional convolutional networks (R-CNN) and Resnet-101 (Kaiming et al. 2016 [22]), to detect unripe tomatoes in heavily shaded conditions. This was applied for both ripeness detection and predicting tomato yield. Afonso et al. (2020) [23] applied Mask R-CNN (Kaiming et al. 2017) [24] to tomato datasets for detection, utilizing various neural networks as backbones to extract feature maps. Bi et al. (2019) [25] enhanced the multi-scale image detection capability and real-time performance of citrus fruit recognition models using a multiple segmentation approach. Testing the citrus fruit target dataset in natural environments containing various interference factors, the results indicated the robustness and real-time performance of the citrus recognition model against common interference factors and occlusions in natural picking environments. The average recognition accuracy was 86.6%, with an average single-frame image detection time of 80 ms.

Although the aforementioned methods have shown improvements in both the accuracy and speed of detection models and have proposed feasible solutions, they struggle to effectively address severe occlusion issues in close-range object detection. This paper adopts the NextViT (Li et al. 2022) [26] structure, a new generation Vision Transformer (Dosovitskiy et al. 2020) [27], as the backbone network. Additionally, it integrates the Global Context Fusion Module (GCFM) at the end of the neck network, presenting a model with enhanced contextual awareness and feature representation capabilities, named AG-YOLO (Attention-Guided YOLO).

The main contributions of this paper are as follows:

(1): The study introduces the Global Context Fusion Module (GCFM) to effectively merge local and global contexts. In contrast to conventional methodologies, the GCFM employs self-attention mechanisms to selectively emphasize essential features of occluded targets. Moreover, it enables interaction and fusion between occluded regions and other image areas, facilitating a more comprehensive understanding of the image’s semantic content and structural composition. This utilization of global context information notably enhances the model’s capacity to discern the relationship between occluded fruit targets and their background, thereby improving the detection capability for such occluded targets.
(2): The utilization of NextViT as the backbone network highlights its superior global perception and representation capabilities compared to traditional convolutional neural networks, attributed to its Vision Transformer-based architecture. The Transformer (Vaswani et al. 2017) [28] architecture of NextViT provides the network with extensive global awareness, allowing it to process information from all areas across the input image simultaneously, rather than focusing solely on localized regions.This capability proves particularly advantageous for addressing challenges in object detection tasks involving issues like occlusion and scale variations.
(3): To meet the training requirements of the proposed model in this paper, a dataset specifically for occluded citrus fruits was collected, generated, and selected. The dataset contains ripe yellow fruit data and unripe green fruit data, segmented into subsets based on occlusion types, including branch-occluded, densely occluded, severely branch-occluded, and severely densely occluded within each color category. This rigorous classification method ensures the diversity and comprehensiveness of the dataset.

2. Datasets

2.1. Image Data Collection

Comprehensive image data collection was carried out to encompass a diverse range of natural lighting conditions and detection environments within the dataset. The image data collection work was carried out in the citrus orchard located in Wuming District, Nanning City, Guangxi Zhuang Autonomous Region, during the summer (to collect unripe fruits) and autumn (to collect ripe fruits) seasons. To adapt to various detection environments under natural light, the citrus trees were shot at distances of 0.3 m to 1 m from the citrus target, fully capturing images under diverse scenes such as sunny days, cloudy days, front lighting, and back lighting. Over 8000 images were collected in total, with the resolution preprocessed to 1280 × 720. After manual screening, 957 images meeting the task requirements were obtained to compose the ’Citrus Occlusion Dataset’. Sample images of the dataset are shown in Figure 1.

2.2. Dataset Creation

The processed dataset was manually annotated using the LabelImg (LabelImg is an open-source annotation tool developed by Tzutalin and contributors released on GitHub) data annotation tool. Subsequently, the dataset was divided into training, validation, and test sets in a ratio of 7:2:1. The training set comprises 670 images, the validation set includes 190 images, and the test set contains 97 images. Further subdivision of the dataset was conducted to delve deeper into the occlusion challenges in citrus fruit detection. The dataset was categorized into two distinct subsets: 666 images for branch and leaf occlusion and 291 images for fruit overlap occlusion. Among the branch and leaf occlusion dataset, images with fruit occlusion exceeding 60% and constituting more than 30% of occluded samples were categorized as the ’severe occlusion’ dataset, totaling 200 images. Within the fruit overlap occlusion dataset, images with overlaps exceeding 60% were categorized as the ’severe overlap’ dataset, comprising 89 images. The delineation of the citrus fruit occlusion dataset is presented in Table 1.

3. Methods

3.1. The Architecture of AG-YOLO

To mitigate the effects of occlusion and overlap in citrus fruit detection, a detection network model with enhanced global perception and stronger feature extraction capabilities is proposed in this study, as depicted in Figure 2. Initially, the NextViT backbone network extracts global features from citrus fruit images, capturing long-range dependencies and semantic information within the images. This empowers the model to exhibit robust feature representation when dealing with large-scale images and complex scenes. Subsequently, the Feature Pyramid Network (FPN) and the Path Aggregation Network (PANet) extract multi-scale features to strengthen feature fusion. Next, the Global Context Fusion Module (GCFM) effectively integrates local features with global context information, significantly enhancing the model’s capability to understand and analyze occlusions, scale variations, and object relationships. This is particularly important for citrus fruit detection tasks, as citrus fruits might be obscured by dense foliage or other citrus fruits, while leveraging global context information helps the model better understand the image. Finally, AG-YOLO completes the detection task through prediction layers of different scales. These prediction layers handle target category predictions, confidence estimation in target presence, and regression predictions for object positions. Through this hierarchical processing, the model accurately generates detection boxes for citrus fruits.

3.2. Backbone Network NextViT

In the past, most object detection models aimed at addressing severe occlusion issues have relied on convolutional neural networks for feature extraction, such as VGG (Simonyan et al. 2014) [29], ResNet, MobileNet (Howard et al. 2017) [30], and EfficientNet (Tan et al. 2019) [31], among others. These conventional CNN architectures primarily focus on capturing local information within images, leading to certain limitations when handling occlusion challenges. On the contrary, NextViT, based on the Vision Transformer architecture, leverages self-attention mechanisms to capture the global relationships within input images. This approach effectively captures long-range dependencies and semantic information within the images. NextViT demonstrates stronger representational capabilities when dealing with large-scale images and complex scenes, enabling the extraction of richer and more accurate features. NextViT follows a hierarchical pyramid structure, where each stage comprises a patch embedding layer along with a series of next-generation convolutional modules (Next Convolution Block) and next-generation Transformer modules (Next Transformer Block). As depicted in Figure 3, the next-generation convolutional modules play a pivotal role within the model. They leverage convolutional operations and multi-head attention mechanisms for feature extraction. By incorporating a path dropout mechanism, these next-generation convolutional modules effectively reduce redundant information within feature maps, thereby enhancing the model’s robustness and generalization capability. Its design enables the model to better adapt to occluded object detection and localization tasks, resulting in a significant performance improvement, particularly in scenarios involving severe occlusion.

As shown in Figure 4, the next-generation Transformer module combines the self-attention mechanism of Transformers with convolutional operations, specifically designed to handle severe occlusion in object instances. It achieves global information interaction and feature fusion on feature maps through the self-attention mechanism, thereby better capturing the details and context of occluded targets. Simultaneously, it performs local feature extraction through convolutional operations, enhancing the recognition capability of occluded targets. Leveraging self-attention and convolutional operations at different scales effectively integrates multi-scale features. This strategy allows the model to comprehensively perceive occluded targets, thereby improving fruit detection and localization capabilities in scenarios involving severe occlusion.

3.3. Global Context Fusion Module GCFM

The Global Context Fusion Module (GCFM) takes into full consideration the specificity of occlusion scenarios, enhancing the recognition capability of occluded targets through a strategy that fuses global contextual information. As shown in Figure 5, the Global Context Fusion Module primarily comprises two components: the Local Attention Module and the Global Attention Module. Within the Local Attention Module, the module conducts convolution operations and a Softmax activation function on the feature map, resulting in a positional weight map. Subsequently, this positional weight map undergoes similarity computations with the feature map. The results of these similarity computations then undergo a convolutional operation and are weighted fused with a local weighting map obtained from convolutional layers and a sigmoid activation function. The self-attention mechanism facilitates the capture of interdependencies and feature expressions within local regions, thereby reinforcing the feature representation of the target area.

In the Global Attention Module, the input feature map undergoes a series of operations, including adaptive average pooling, convolutional layers, and sigmoid activation functions, resulting in a global weighting vector. Subsequently, the weighted adjusted global features are combined with the self-attention strengthened local features through addition fusion. Even when the target is partially occluded, the module is capable of extracting effective features through global correlations, thereby enhancing recognition accuracy. Whereas the local features encompass detailed contextual information, the global features encapsulate holistic semantic information. Their combination enables the model to gain a more comprehensive understanding of the target, thereby enhancing the detection performance of occluded targets.

The Global Context Fusion Module, combined with the NextViT backbone network, maximizes the advantage of effectively capturing contextual global information in complex scenes. This integration contributes to enhancing the detection capability of occluded targets. Furthermore, the Global Context Fusion Module serves as an independent module that can be easily embedded into various object detection architectures without altering the overall network structure. Such modular design allows for more flexibility and convenience in its usage.

4. Experiments

4.1. Experimental Environment and Parameter Settings

The experiment is based on the Ubuntu 22.04.1 operating system, utilizing an Intel Xeon Silver 4214R CPU, Nvidia RTX 3090 GPU, PyTorch version 1.13 for deep learning framework, and CUDA version 11.7. The model undergoes 600 iterations with an initial learning rate set to 0.01. The bounding box loss function selected is the CIoU loss. The batch size is set to 16, utilizing the SGD optimizer. During training, input images are uniformly resized to 640 × 640 pixels.

4.2. Model Evaluation Metrics

To validate the model’s effectiveness, evaluations were conducted qualitatively and quantitatively. For qualitative assessment, the model’s performance was evaluated by comparing detection images between the AG-YOLO model and other models, specifically focusing on object localization precision, detection count, and the presence of false negatives or false positives. For quantitative evaluation, the selected metrics include: precision (P), recall (R), and mean average precision (mAP) for ripe and unripe fruits. Within mAP, mAP50 and mAP50:95 are chosen: mAP50 represents the average precision at an IoU threshold of 0.5, and mAP50:95 signifies the mean detection accuracy across 10 IoU thresholds from 0.5 to 0.95, with a step size of 0.05. The specific mathematical formulas are shown as Equations (1)–(4).

P = \frac{T_{P}}{T_{P} + F_{P}}

(1)

R = \frac{T_{P}}{T_{P} + F_{N}}

(2)

In Equation (1),

T_{P}

represents the number of true positive samples detected, and

F_{P}

represents the number of false positive samples predicted; in Equation (2),

F_{N}

represents the number of positive samples not detected.

A P = \int_{0}^{1} P \cdot R d R

(3)

m A P = \frac{\sum_{i = 0}^{n} A P_{i}}{n}

(4)

In Equation (4), n represents the number of detected target categories.

4.3. Quantitative Analysis Evaluation

To validate the superiority of the proposed algorithm, comparisons were conducted with various state-of-the-art algorithms from recent years. The algorithms compared in this study include CNN-based object detectors such as YOLOv8 and Faster R-CNN, and Transformer-based object detectors like DAB-DETR (Liu et al. 2022) [32], DN-DETR (Li et al. 2022) [33], SAP-DETR (Liu et al. 2022) [34], Anchor-DETR (Wang et al. 2021) [35], and Conditional-DETR (Meng et al. 2021) [36]. Through these comparisons, the aim is to demonstrate the algorithm’s advantages in object detection over both traditional CNN-based detectors and more recent Transformer-based detectors, particularly in addressing occlusion challenges, as illustrated in Table 2 and Table 3.

Based on the data from Table 2 and Table 3, in the comparative experiments between CNN-based detectors and Transformer-based detectors, AG-YOLO achieved the best performance across metrics, including precision, recall, mAP50, mAP50:95,

A P_{s}

(AP for small objects),

A P_{m}

(AP for medium objects), and

A P_{l}

(AP for large objects). This achievement can be attributed to the comprehensive utilization of Transformer model’s contextual modeling ability and the global context fusion mechanism. This enables the model to better understand the contextual information within the images, effectively addressing severe occlusion issues and improving the accuracy of object detection.

4.4. Ablation Study

To validate the effectiveness of individual modules and analyze their impact on the precision of the AG-YOLO algorithm, this study conducted ablation experiments. YOLOv5s with the CSPDarknet53 backbone network was employed as the baseline for comparison. The results of the ablation experiments conducted in this research are presented in Table 4. According to the data in Table 4, the ablation experiments illustrate that using NextViT as the backbone network and introducing the Global Context Fusion Module enhances both the precision and recall of the model, consequently reducing misidentification and omission rates. Compared to the baseline model, employing the NextViT backbone network results in an improvement of 4.0 and 2.7 points in mAP50 and mAP50:95, respectively. This signifies that utilizing NextViT as the backbone network helps the network focus on obscured target features, thus enhancing downstream task accuracy. Notably, integrating the Global Context Fusion Module with the NextViT backbone network optimizes model performance. In comparison to the baseline model, metrics such as P, R, mAP50, and mAP50:95, respectively, improve by 3.9, 5.7, 4.2, and 3.2 points, highlighting the effectiveness of NextViT combined with the Global Context Fusion Module.

4.5. Qualitative Analysis Evaluation

To more intuitively demonstrate the superior detection performance of AG-YOLO proposed in this study, comparisons were conducted on the test set of the citrus dataset. The evaluation specifically focused on both one-stage and two-stage detection models: YOLOv5s, YOLOv7, YOLOv8s, Faster R-CNN, and SSD. These models were employed to evaluate AG-YOLO’s effectiveness in detecting severely occluded fruit targets. Based on Figure 6 and Figure 7, in scenarios where immature and mature fruits are heavily obscured by leaves and branches, YOLOv5s, YOLOv7, YOLOv8s, Faster R-CNN, and SSD exhibit instances of missed detections. However, AG-YOLO still detects obscured fruit targets. In the GradCAM heatmaps depicted in Figure 6 and Figure 7, color intensity represents the importance of each pixel or region concerning the predicted target class. Darker colors (such as red or blue) indicate more significant influence on the prediction outcome. The YOLOv5s, YOLOv7, YOLOv8s, Faster R-CNN, and SSD models demonstrate insufficient attention to obscured regions, failing to learn the feature connections between obscured and non-obscured areas, thereby failing to detect obscured target fruits. AG-YOLO exhibits higher responsiveness to obscured regions, indicating that the AG-YOLO model can effectively focus on and localize targets even in severely occluded situations. It displays robustness, extracting crucial information from complex scenes to identify and locate obscured targets.

Based on the confusion matrix results in Figure 8, it suggests that AG-YOLO can more accurately identify samples within the citrus category compared to other models. Moreover, it tends to misclassify fewer samples of other categories as citrus compared to these other models. This indicates the superiority of AG-YOLO in recognizing and reducing misclassifications in the citrus category.

4.6. The Detection Performance of AG-YOLO in Different Weather Conditions and PR Curves

The detection performance of AG-YOLO in cloudy weather is illustrated in Figure 9, and its performance in sunny weather is depicted in Figure 10. Experimental results indicate that AG-YOLO demonstrates stable performance in occluded scenes under various lighting conditions, exhibiting strong robustness. Figure 11 illustrates the PR curves of AG-YOLO.

5. Conclusions

This study aims to address the issue of missed detections and low detection rates in citrus target detection due to severe occlusions. To address this challenge, a citrus occlusion dataset was constructed. An innovative network model named AG-YOLO was proposed, incorporating the NextViT backbone along with the Global Context Fusion Module (GCFM). This design led to a significant enhancement in the model’s performance and robustness.

NextViT, serving as the backbone of AG-YOLO, surpasses the limitations of traditional convolutional neural networks by introducing a Transformer-based feature extraction method. This allows the model to better capture the global context information within images. NextViT accurately captures the correlation between target regions and their surrounding environment, enhancing target localization accuracy. This robustly addresses the issue of missed detections caused by severe occlusions in citrus detection.

Simultaneously, the Global Context Fusion Module plays a critical role in AG-YOLO. It utilizes self-attention mechanisms to facilitate the interaction and fusion of local and global features. This is crucial in addressing the low detection rates of occluded targets in complex scenes, as occluded targets are often influenced by factors such as complex environments and indistinct image details. However, the Global Context Fusion Module can capture global associative information through contextual information fusion mechanisms, improving the detection effectiveness of occluded targets.

In the experiment, a comprehensive evaluation of the AG-YOLO model was conducted on the citrus occlusion dataset, comparing it against classic CNN-based detection methods as well as the latest Transformer-based detection methods.The results indicate a significant improvement in AG-YOLO, with its precision (P), recall (R), mAP at IoU 50% (mAP50), mAP at IoU 50-95% (mAP50:95), AP for small objects (

A P_{s}

), AP for medium objects (

A P_{m}

), and AP for large objects (

A P_{l}

) reaching 90.6%, 73.4%, 83.2%, 60.3%, 18.8%, 47.2%, and 69.2%, respectively. Such results confirm the effectiveness and practicality of the approach, illustrating AG-YOLO’s superiority in addressing severe occlusion issues in fruit detection.

In summary, the proposed AG-YOLO model in this study successfully addressed the challenge of severe occlusion in citrus detection by effectively leveraging the global perceptual capabilities of the NextViT backbone network and the feature interaction and fusion abilities of the Global Context Fusion Module.This research contributes to the advancement of object detection techniques and offers valuable insights and explorations for research and practical applications in agricultural intelligence.

Author Contributions

Conceptualization, Y.L. (Yishen Lin) and Y.L. (Yun Liang); methodology, Y.L. (Yishen Lin) and Z.H.; software, Y.L. (Yishen Lin) and Z.H.; validation, Z.H.; investigation, Z.H.; resources, Y.L. (Yun Liang); data curation, Z.H., Y.L. (Yunfan Liu),and W.J.; writing—original draft preparation, Y.L. (Yishen Lin); writing—review and editing, Y.L. (Yishen Lin); supervision, Y.L. (Yun Liang); project administration, Y.L. (Yun Liang). All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Hamuda, E.; Mc Ginley, B.; Glavin, M.; Jones, E. Improved image processing-based crop detection using Kalman filtering and the Hungarian algorithm. Comput. Electron. Agric. 2018, 148, 37–44. [Google Scholar] [CrossRef]
Lu, J.; Sang, N. Detecting citrus fruits and occlusion recovery under natural illumination conditions. Comput. Electron. Agric. 2015, 110, 121–130. [Google Scholar] [CrossRef]
Lin, X.; Li, L.; Gao, Z.; Yi, C.; Li, Q. Revised quasi-circular randomized Hough transform and its application in camellia-fruit recognition. Trans. Chin. Soc. Agric. Eng. 2013, 29, 164–170. [Google Scholar]
Song, H.; He, D.; Pan, J. Recognition and localization methods of occluded apples based on convex hull theory. Trans. Chin. Soc. Agric. Eng. 2012, 28, 174–180. [Google Scholar]
Feng, Q.; Chen, W.; Yang, Q. Identification and localization of overlapping tomatoes based on linear structured vision system. J. China Agric. Univ. 2015, 20, 100–106. [Google Scholar]
Sun, S.; Wu, Q.; Tan, J.; Long, Y.; Song, H. Recognition and reconstruction of single apple occluded by branches. J. Northwest A&F Univ. (Nat. Sci. Ed.) 2017, 45, 138–146. [Google Scholar]
Sun, J.; Sun, Y.; Zhao, R.; Ji, Y.; Zhang, M.; Li, H. Tomato Recognition Method Based on Iterative Random Circle and Geometric Morphology. Trans. Chin. Soc. Agric. Mach. 2019, 50, 22–26. [Google Scholar]
Kuznetsova, A.; Maleva, T.; Soloviev, V. Using YOLOv3 algorithm with pre-and post-processing for apple detection in fruit-harvesting robot. Agronomy 2020, 10, 1016. [Google Scholar] [CrossRef]
Sozzi, M.; Cantalamessa, S.; Cogato, A.; Kayad, A.; Marinello, F. Automatic bunch detection in white grape varieties using YOLOv3, YOLOv4, and YOLOv5 deep learning algorithms. Agronomy 2022, 12, 319. [Google Scholar] [CrossRef]
Zhang, C.; Li, T.; Zhang, W. The detection of impurity content in machine-picked seed cotton based on image processing and improved YOLOV4. Agronomy 2021, 12, 66. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Koirala, A.; Walsh, K.; Wang, Z.; McCarthy, C. Deep learning for real-time fruit detection and orchard fruit load estimation: Benchmarking of ‘MangoYOLO’. Precis. Agric. 2019, 20, 1107–1135. [Google Scholar] [CrossRef]
Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
Yang, J.; Qian, Z.; Zhang, Y.; Qin, Y.; Miao, H. Real-time recognition of tomatoes in complex environments based on improved YOLOv4-tiny. Trans. Chin. Soc. Agric. Eng 2022, 9, 215–221. [Google Scholar]
Tongbin, H. Citrus fruit recognition method based on the improved model of YOLOv5. J. Huazhong Agric. Univ. 2022, 41, 170–177. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Mu, Y.; Chen, T.S.; Ninomiya, S.; Guo, W. Intact detection of highly occluded immature tomatoes on plants using deep learning techniques. Sensors 2020, 20, 2984. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
Afonso, M.; Fonteijn, H.; Fiorentin, F.S.; Lensink, D.; Mooij, M.; Faber, N.; Polder, G.; Wehrens, R. Tomato fruit detection and counting in greenhouses using deep learning. Front. Plant Sci. 2020, 11, 571299. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Bi, S.; Gao, F.; Chen, J.; Zhang, L. Detection Method of Citrus Based on Deep Convolution Neural Network. Trans. Chin. Soc. Agric. Mach. 2019, 50, 181–186. [Google Scholar]
Li, J.; Xia, X.; Li, W.; Li, H.; Wang, X.; Xiao, X.; Wang, R.; Zheng, M.; Pan, X. Next-vit: Next generation vision transformer for efficient deployment in realistic industrial scenarios. arXiv 2022, arXiv:2207.05501. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
Liu, S.; Li, F.; Zhang, H.; Yang, X.B.; Qi, X.; Su, H.; Zhu, J.; Zhang, L. DAB-DETR: Dynamic Anchor Boxes are Better Queries for DETR. arXiv 2022, arXiv:2201.12329v4. [Google Scholar]
Li, F.; Zhang, H.; Liu, S.; Guo, J.; Ni, L.M.; Zhang, L. DN-DETR: Accelerate DETR Training by Introducing Query DeNoising. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 13609–13617. [Google Scholar]
Liu, Y.; Zhang, Y.; Wang, Y.; Zhang, Y.; Tian, J.; Shi, Z.; Fan, J.; He, Z. SAP-DETR: Bridging the Gap Between Salient Points and Queries-Based Transformer Detector for Fast Model Convergency. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 18–22 June 2023; pp. 15539–15547. [Google Scholar]
Wang, Y.; Zhang, X.; Yang, T.; Sun, J. Anchor DETR: Query Design for Transformer-Based Object Detection. arXiv 2021, arXiv:2109.07107. [Google Scholar]
Meng, D.; Chen, X.; Fan, Z.; Zeng, G.; Li, H.; Yuan, Y.; Sun, L.; Wang, J. Conditional DETR for Fast Training Convergence. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3631–3640. [Google Scholar]

Figure 1. Images of citrus fruit under various conditions of oranges.

Figure 2. Network architecture of AG-YOLO.

Figure 3. Next-generation convolutional module diagram.

Figure 4. Next-generation Transformer module diagram.

Figure 5. Global Context Fusion Module.

Figure 6. Comparisons of image detection results and GradCAM thermograms for localized shading conditions in unripe fruits.

Figure 7. Comparisons of image detection results and GradCAM thermograms for localized shading conditions in ripe fruits.

Figure 8. Confusion matrix diagrams for the models.

Figure 9. AG-YOLO’s performance in cloudy weather.

Figure 10. AG-YOLO’s performance in sunny weather.

Figure 11. PR curves of AG-YOLO.

Table 1. Citrus fruit occlusion dataset.

Dataset	Occlusion	Severe Occlusion	Overlap	Severe Overlap
Unripe fruit	129	58	100	20
Ripe fruit	337	142	102	69

Table 2. Comparative experiments with CNN-based detectors.

Network	P/%	R/%	mAP50/%	mAP50:95/%
AG-YOLO (ours)	90.6	73.4	83.2	60.3
YOLOv5s	86.7	67.7	79.0	57.1
YOLOv7	85.4	69.6	79.9	53.2
YOLOv8s	85.1	67.5	77.6	57.0
Faster RCNN	50.6	81.1	77.3	47.2
SSD	87.8	61.5	70.9	42.8

Table 3. Comparative experiments with Transformer-based detectors.

Network	mAP50/%	mAP50:95/%	$A P_{s}$ /%	$A P_{m}$ /%	$A P_{l}$ /%
AG-YOLO (ours)	83.2	60.3	18.8	47.2	69.2
DAB-DETR	77.5	48.5	12.0	39.0	65.0
DN-DETR	74.0	49.1	15.8	38.6	65.5
SAP-DETR	78.4	50.9	15.5	41.4	67.0
Anchor-DETR	76.3	47.7	12.9	38.2	63.9
Conditional-DETR	78.0	49.6	13.8	40.0	66.1

Table 4. Ablation experiments of NextViT backbone and Global Context Fusion Module.

Network	Backbone	P/%	R/%	mAP50/%	mAP50:95/%
AG-YOLO w/o GCFM	CSPDarknet53	86.7	67.7	79.0	57.1
AG-YOLO w/o GCFM	NextViT	92.3	72.1	83.0	59.8
AG-YOLO w/ GCFM	CSPDarknet53	86.7	67.0	78.4	57.3
AG-YOLO w/ GCFM	NextViT	90.6	73.4	83.2	60.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, Y.; Huang, Z.; Liang, Y.; Liu, Y.; Jiang, W. AG-YOLO: A Rapid Citrus Fruit Detection Algorithm with Global Context Fusion. Agriculture 2024, 14, 114. https://doi.org/10.3390/agriculture14010114

AMA Style

Lin Y, Huang Z, Liang Y, Liu Y, Jiang W. AG-YOLO: A Rapid Citrus Fruit Detection Algorithm with Global Context Fusion. Agriculture. 2024; 14(1):114. https://doi.org/10.3390/agriculture14010114

Chicago/Turabian Style

Lin, Yishen, Zifan Huang, Yun Liang, Yunfan Liu, and Weipeng Jiang. 2024. "AG-YOLO: A Rapid Citrus Fruit Detection Algorithm with Global Context Fusion" Agriculture 14, no. 1: 114. https://doi.org/10.3390/agriculture14010114

APA Style

Lin, Y., Huang, Z., Liang, Y., Liu, Y., & Jiang, W. (2024). AG-YOLO: A Rapid Citrus Fruit Detection Algorithm with Global Context Fusion. Agriculture, 14(1), 114. https://doi.org/10.3390/agriculture14010114

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

AG-YOLO: A Rapid Citrus Fruit Detection Algorithm with Global Context Fusion

Abstract

1. Introduction

2. Datasets

2.1. Image Data Collection

2.2. Dataset Creation

3. Methods

3.1. The Architecture of AG-YOLO

3.2. Backbone Network NextViT

3.3. Global Context Fusion Module GCFM

4. Experiments

4.1. Experimental Environment and Parameter Settings

4.2. Model Evaluation Metrics

4.3. Quantitative Analysis Evaluation

4.4. Ablation Study

4.5. Qualitative Analysis Evaluation

4.6. The Detection Performance of AG-YOLO in Different Weather Conditions and PR Curves

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI