Wampee-YOLO: A High-Precision Detection Model for Dense Clustered Wampee in Natural Orchard Scenario

Li, Zhiwei; Xie, Yusha; Wang, Jingjie; Huang, Guogang; Yu, Longzhen; Zhang, Kai; Li, Junlong; Liu, Changyu

doi:10.3390/horticulturae12020232

Open AccessArticle

Wampee-YOLO: A High-Precision Detection Model for Dense Clustered Wampee in Natural Orchard Scenario

by

Zhiwei Li

^1,†

,

Yusha Xie

^1,†,

Jingjie Wang

²

,

Guogang Huang

^3,*

,

Longzhen Yu

¹,

Kai Zhang

¹

,

Junlong Li

³

and

Changyu Liu

^1,*

¹

College of Mathematics and Informatics, South China Agricultural University, Guangzhou 510642, China

²

Edwardson School of Industrial Engineering, Purdue University, West Lafayette, IN 47907, USA

³

School of Marine Science and Technology, Shanwei Institute of Technology, Shanwei 516600, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Horticulturae 2026, 12(2), 232; https://doi.org/10.3390/horticulturae12020232

Submission received: 16 January 2026 / Revised: 11 February 2026 / Accepted: 12 February 2026 / Published: 14 February 2026

(This article belongs to the Section Fruit Production Systems)

Download

Browse Figures

Versions Notes

Abstract

Wampee (Clausena lansium) harvesting currently relies heavily on manual labor, but automation is significantly hindered by clustered fruit growth patterns, small fruit sizes, and complex orchard backgrounds, which make accurate detection highly challenging. This study proposes Wampee-YOLO, a lightweight and high-precision model based on the YOLO11n architecture, specifically designed for real-time wampee detection in natural orchard environments. The proposed model integrates several architectural enhancements: the RFEMAConv module for expanded receptive fields, an AIFI module for improved small target interaction, and a C2PSA-MSCADYT structure to boost multi-scale adaptability. Additionally, a Triplet Attention mechanism strengthens multi-dimensional feature representation, while an AFPN-Pro2345 neck structure optimizes cross-scale feature fusion. Experimental results demonstrate that Wampee-YOLO achieves an mAP₅₀ of 90.3%, a precision of 92.1%, and F1 score of 87%. This represents a significant 3.4% mAP₅₀ improvement over the YOLO11n baseline, with a slight increase to 3.28 M parameters. Ablation studies further confirm that the AFPN-Pro2345 module provides the most substantial performance gain, increasing mAP₅₀ by 2.4%. The model effectively balances computational efficiency with detection accuracy. These findings indicate that Wampee-YOLO offers a robust and efficient visual detection solution suitable for deployment on resource-constrained edge devices in smart orchard applications.

Keywords:

wampee detection; YOLO11n; multi-scale feature fusion; small target

1. Introduction

Wampee, an evergreen fruit tree of the genus Clausena in the family Rutaceae, originated in southern China and boasts a cultivation history of at least 1500 years. It is widely cultivated across various regions in China, including Guangdong, Guangxi, Fujian, Yunnan, Sichuan, Hainan, and Taiwan [1]. Notably, Yunan County in Guangdong Province stands as the world’s largest wampee cultivation region. As of 2023, the planting area in Yunan County reached 12,800 hectares (1.28 × 10⁴ ha), accounting for nearly 75% of the global cultivation area, with an annual yield of 108,000 tons and an output value exceeding 688 million USD [2]. Possessing a unique flavor and abundant bioactive substances, wampee fruits are highly favored by consumers [3]. According to the 2024–2030 China Wampee Industry Market Research and Investment Potential Forecast Report, the annual production of wampee in China surged from approximately 1 million tons in 2010 to over 2 million tons in 2020, with market supply increasing year by year. On the demand side, domestic consumption is projected to reach 3 million tons by 2025. In the face of this rapidly growing demand, intelligent harvesting has become a critical milestone for elevating the wampee industry to a higher level of development. Currently, the harvesting process still relies predominantly on manual labor. This traditional approach not only necessitates significant manpower input and involves high labor intensity but is also time-consuming and inefficient. The core of realizing intelligent harvesting lies in the object detection capabilities within visual control systems. Therefore, research into efficient wampee detection technologies tailored for natural orchard scenario is of great significance for improving production and harvesting efficiency, reducing operational costs, and ensuring the sustainable and healthy development of the wampee industry.

Fruit detection technology is a key technology in smart agriculture. Wampee fruit detection in natural scenario presents significant challenges due to characteristics such as dense occlusion, varying light intensity, and small target size. Currently, object detection methods, both domestically and internationally, are broadly categorized into two main approaches: one utilizing traditional image processing methods, and the other utilizing deep learning network models.

Object detection methods based on traditional image processing require the manual extraction of shallow features such as the target’s color, shape, and texture, which are then detected using machine learning techniques such as Support Vector Machine (SVM) and Artificial Neural Network (ANN). Liu et al. [4] proposed a grape detection method based on color and texture features and employed a technique to separate useful features from the image to accelerate fruit detection. Experiments showed that the method partially eliminated background interference, achieving an average precision of 87%. Fu et al. [5] proposed a banana detection method that utilized HSV color features, HOG, and LBP texture features, using an SVM classifier to achieve banana detection in plantations. The results demonstrated that the proposed method was applicable for banana detection in plantations under varying lighting and occlusion conditions, achieving an average detection rate of 89.63%. Guo et al. [6] proposed a monocular machine vision-based method to detect clustered lychee fruits under overlapping conditions, achieving an accuracy exceeding 87%. However, this method was highly susceptible to variations in lychee illumination and texture information, often resulting in false detections when dealing with severely occluded lychee fruits. Yu et al. [7] proposed a mature lychee detection method based on depth information, involving the collection of lychee images with depth data using a depth camera, image segmentation using the depth information, and finally training a Random Forest binary classification model using color and texture features, thereby achieving accurate detection of lychee fruits in natural scenario, with recognition accuracies of 89.92% for green lychees and 94.50% for mature lychees, respectively. However, applying traditional image processing methods for wampee detection in natural orchard scenario is heavily influenced by factors such as illumination changes, occlusion by stems and leaves, similarity between target and background colors, and inter-fruit occlusion, leading to unsatisfactory detection accuracy. Furthermore, traditional methods require manually extracting texture and color features from fruit images; when environmental changes alter these manually designed features, object detection is affected by subjectivity and environmental complexity, which prevents the algorithm from guaranteeing high-efficiency detection and results in poor generalization and low robustness, making it difficult to apply effectively in the natural orchard scenarios of wampee [8].

With the rapid development of deep learning, object detection based on Convolutional Neural Networks (CNNs) has demonstrated immense advantages. These models not only autonomously learn representative features but also perform direct detection on input images, with detection accuracy far exceeding traditional methods that rely on hand-crafted features [9]. Deep learning-based object detection algorithms are primarily categorized into two-stage and one-stage algorithms. Girshick et al. [10] first proposed the R-CNN two-stage detection model, pioneering the introduction of CNNs to enhance feature learning capabilities for object detection. Subsequent optimizations and improvements led to the creation of Mask R-CNN by He et al. [11], who replaced RoIPooling with RoIAlign to resolve the pixel misalignment issue and added target mask prediction. However, the serial two-step process of two-stage detection requires individual processing for a large number of candidate regions, which greatly reduces the detection speed. Furthermore, performance often deteriorates sharply when dealing with dense and small targets. Consequently, one-stage detection algorithms have gradually become mainstream. YOLO [12,13,14] has garnered widespread attention due to its excellent balance of speed and accuracy and has been utilized in many fruit detection studies based on deep learning. YOLO11, as a new model in the YOLO family, possesses the advantage of effectively handling objects in complex environments. Zhang et al. [15] proposed a YOLO11-Pear-based study for pears in complex orchard environments, resolving issues such as low detection accuracy and poor adaptability. Liao et al. [16] proposed the YOLO-MECD model based on the YOLO11s architecture, significantly improving the detection performance and accuracy for citrus detection. Wang et al. [17] used YOLO11n for tomato detection and semantic segmentation, proposing an optimized region tracking and counting method that effectively promoted the intelligent estimation of tomato yield. Li et al. [18] proposed the improved YOLO11-GS model, achieving efficient detection and localization of both grape fruits and stems. Du et al. [19] proposed the YOLO11-WAS model for apples, which achieves lightweight and high-precision detection of multi-species apples in complex orchard environments, significantly boosting detection accuracy while simultaneously reducing the number of parameters and floating-point operations.

A comprehensive analysis reveals that YOLO11 possesses strong detection advantages across various fruits, including pears, citrus, and tomatoes, and demonstrates accurate performance on cluster fruits like grapes and lychees at multiple scales. However, a review of existing literature indicates that object detection research specifically for wampee has not yet been reported. Given that wampee, like grapes and lychees, is a cluster fruit, applying the YOLO11 object detection algorithm to wampee is feasible. Nevertheless, the current absence of a dedicated public dataset for wampee is a critical bottleneck hindering the in-depth development of intelligent wampee detection and harvesting research. A high-quality, diverse wampee image dataset is fundamental for training and evaluating deep learning models, essential for capturing multi-scale features, occlusion variations, and lighting differences in wampee in natural orchard scenario. Therefore, constructing a dedicated wampee dataset that covers different growth stages, environmental conditions, and complex backgrounds will not only provide necessary training and validation resources for algorithmic research but also lay the data foundation for the practical application and performance optimization of intelligent wampee harvesting technology. Simultaneously, as wampee fruits are small targets, many research studies focusing on small-target improvements have already been achieved on public datasets of other fruits. For instance, Nan et al. [20] proposed an intelligent detection method for multiple classes of dragon fruits in harvesting rows based on the WGB-YOLO network. Bai et al. [21] proposed an improved YOLOv7 algorithm incorporating a Swin Transformer prediction head for accurate recognition of strawberry seedlings, flowers, and fruits. Zhao et al. [22] combined YOLOv8 with C2f-RepGhost to propose Ta-YOLO to overcome the challenge of target occlusion in greenhouse tomato detection and counting.

To achieve effective detection of naturally growing wampee fruits, particular attention must be paid to balancing the detection model’s accuracy and parameter quantity. To harmonize these critical aspects, this study introduces a high-precision detection model while controlling the model’s parameter volume to a certain extent, specifically designed for wampee detection in natural scenario.

The main contributions of this study are summarized as follows:

Construction of a dedicated wampee detection dataset for natural Scenario: We constructed a specialized wampee detection dataset tailored for natural orchard scenario. This dataset comprises wampee images covering different maturity levels (colors), shooting perspectives, lighting conditions, and occlusion scenarios. High-quality manual annotations were performed, providing a crucial benchmark data resource for intelligent wampee detection research.
Enhanced multi-scale feature extraction: To address the scale variability and complex background interference of wampee fruits in natural scenes, we integrated the re-parameterized high-efficiency multi-scale attention convolution (RFEMAConv) based on the C3k2 module, forming the C3k2-RFEMAConv fundamental unit. This module enhances the robustness of the backbone network in extracting discriminative multi-scale fruit features from complex environments by expanding the effective receptive field and augmenting contextual information compensation.
Optimized small target feature localization: We introduced two improvements targeting the precision of small target feature representation. The AIFI module was used to enhance the original SPPF structure for better small target localization and recognition capabilities. Furthermore, a Triplet Attention mechanism was integrated at the end of the backbone to facilitate multi-dimensional feature interaction, thus strengthening the network’s ability to express and distinguish small target features.
Occlusion interference suppression: Targeting the feature blurring and localization difficulties caused by dense inter-fruit occlusion and leaf occlusion, we innovatively designed the C2PSA-MSCADYT module featuring dual parallel paths. The Multi-Scale Coordinate Attention (MSCA) branch of this module captures and fuses position-aware features across multiple spatial granularities, significantly enhancing the capability to distinguish and localize occluded fruit contours. Simultaneously, its Dynamic Tanh (DYT) activation branch optimizes the gradient flow and feature representation by adaptively adjusting the non-linear response, thereby enhancing the network’s learning stability and feature discrimination under complex occlusion.
Progressive adaptive feature pyramid (AFPN-Pro2345): The AFPN-Pro2345 feature pyramid improves the efficiency and effectiveness of multi-scale feature fusion by directly merging non-adjacent hierarchical layers. This mechanism protects the integrity of deep semantic features and shallow detail features during propagation, avoiding information loss caused by multiple sampling steps. Furthermore, it utilizes four feature layers of different scales from the backbone network to construct the feature pyramid, thereby helping the detection head accurately locate and identify objects of varying sizes.

2. Materials and Methods

The overall process flow of the research materials and methods is shown in Figure 1.

2.1. Dataset

2.1.1. Data Acquiring

Wampee images were collected from Longgang Village, Zhongluotan Town, Baiyun District, Guangzhou City, Guangdong Province (coordinates: 113°23′56.5″ E, 23°23′15.9″ N). A total of 877 images were captured in the natural orchard scenarios using the iPhone 13 Pro Max camera (Apple Inc., Cupertino, CA, USA), with the image format being JPG. As visualized in Figure 2, the dataset was constructed to maximize robustness by incorporating samples from diverse weather conditions, lighting angles, and occlusion levels. This holistic data diversity ensures that the model learns to generalize across different real-world occasions.

To ensure dataset robustness across different operational scenarios, image acquisition covered diverse weather conditions (sunny, cloudy, overcast), variable illumination intensities (strong direct light, natural light, weak light), and multi-dimensional shooting angles (top, eye-level, and bottom views). During the data acquisition process, the shooting distance was dynamically adjusted by physically moving the camera closer to or further from the target (varying from 30 cm to 100 cm). This procedure was designed to simulate the continuous approach of an ‘eye-in-hand’ robotic system during the actual harvesting operation, capturing wampee fruits at various scales and resolutions. The dataset constructed in this study comprehensively covers the various typical conditions encountered by wampee fruit in complex orchard environments. Based on phenotypic characteristics, the fruits were classified into two categories: ripe and unripe. Furthermore, extensive sampling was specifically conducted for challenging scenarios, including complex occlusions caused by high-density clustering and interference from cluttered foliage backgrounds. Consequently, the algorithm design and experimental evaluation in this study are centered around evaluating the model’s robustness in four typical challenging scenarios: ripe fruit, unripe fruit, overlapping fruit, and background interference. It is important to note that these are testing subsets based on image attributes, not distinct classification categories for the model.

2.1.2. Data Labeling and Augmentation

This study utilized LabelMe to uniformly annotate all wampee fruits as a single class (‘Wampee’), regardless of their maturity stage or size. This approach ensures the model learns general features applicable to complex orchard environments. We drew rectangular bounding boxes around wampee fruits by dragging the mouse. This process generated a label file in TXT format, which contains center coordinates, width, and height of wampee targets. To bolster the model’s generalization capabilities and fortify network robustness, we employed data augmentation techniques to diversify the training dataset. Image samples of different data augmentation techniques applied are shown in Figure 3.

Through augmentation techniques such as affine transformation, vertical flipping, angle transformation, and random brightness adjustments, an additional 255 new data images were obtained and randomly assigned to the splits. The augmented images were integrated exclusively into the training set. Consequently, the entire labeled dataset was partitioned into training, validation, and test sets with a proportional split of 7:1.5:1.5. The entire dataset contains 40,736 annotated targets. The final training set contains 792 images, the validation set contains 172 images, and the test set contains 168 images.

According to the definition standard proposed by Aldubaikhi et al. [23], a target is classified as a small object when its normalized dimensions (relative to the height and width of the entire image) is less than 0.1. Figure 4 displays the size distribution of targets within the dataset, where most wampee fruits have an aspect ratio relative to the entire image that is less than 0.05 (highlighted by the red box in Figure 4I). This suggests that small-scale objects represent a substantial proportion of the dataset.

2.2. Wampee-YOLO

2.2.1. Architecture

Due to the need for real-time detection of wampee fruits in natural orchard scenarios, YOLO series models are long known for their superior balance between detection speed and end-to-end training compared to other detection architectures. Furthermore, YOLO11 has demonstrated strong capabilities in object localization and recognition within natural orchard scenarios, along with computational and deployment advantages. Based on these considerations, the improved network model proposed in this study is an enhancement based on YOLO11n. For wampee detection in natural settings, YOLO11 is better equipped for feature learning specific to wampee and can efficiently handle interference issues such as varying illumination and occlusion encountered in natural scenarios. Despite its advantages, YOLO11 exhibits certain limitations. For instance, as wampee fruits are small targets that grow in clusters, inter-fruit occlusion frequently occurs and leaves often obstruct the fruits. Simultaneously, due to the small size of the fruit and the diverse characteristic classes present in the natural scenarios, YOLO11 struggles to maintain the same efficiency as it does when detecting large targets, often resulting in missed detections (leakage) and false detections (misclassification). Enhancing YOLO11’s ability to handle interference issues inherent to the natural scenarios, such as small targets and occlusion, will necessarily involve sacrificing some of its lightweight performance in order to significantly improve its accuracy and efficiency.

To address the aforementioned challenges, specifically ripe fruits, unripe fruits, overlapping fruits, and background interference in natural orchard scenarios, this paper introduces an enhanced model, Wampee-YOLO, the architecture of which is depicted in Figure 5. First, the C3k2 module in the original backbone is combined with the convolution module RFEMAConv (fusing receptive field enhancement and an efficient multi-scale attention mechanism) to form the C3k2-RFEMAConv module. By expanding the receptive field and optimizing multi-scale attention, the feature extraction capability for multi-scale wampee fruits is significantly improved. This effectively addresses interference from occlusion, illumination changes, and fruit scale variations in the natural scenarios, thereby boosting precision and detection accuracy. Next, we propose the C2PSA-MSCADYT module to replace the original C2PSA module. The construction of this module includes two core designs: first, the integration of the Multi-Scale Coordinate Attention (MSCA) mechanism to replace the original attention component; and second, the addition of a dynamic activation branch (PSABlock-DYT), which incorporates a learnable Dynamic Tanh (DYT) component. The advantages of this design are twofold: the MSCA mechanism enhances the model’s spatial localization capability under complex occlusion, while the dynamic activation mechanism improves the network’s adaptability to changes in input features, thereby comprehensively strengthening the robust detection of wampee fruits in natural scenes.

In deep learning computer vision tasks, the attention mechanism has become a key technology for enhancing model performance. Traditional attention mechanisms, such as channel attention (like the SE block) or spatial attention, often process different dimensions in isolation, ignoring the complex interdependencies among the three dimensions: channel, height, and width. This single-dimensional attention calculation struggles to fully capture the complex dependencies within the feature map, thereby limiting the model’s feature expression capability. In the context of wampee fruit detection in natural scenarios, which involves small targets and occlusion issues, the model’s feature representation capacity is critically important. To address this limitation, we introduce the Triplet Attention mechanism. This is a lightweight and highly efficient attention module capable of simultaneously capturing cross-dimensional dependencies among the three dimensions—Channel, Height, and Width—of the input feature map through three parallel branches. This module avoids any dimensionality reduction operations, significantly boosting the model’s representation capability while incurring minimal computational overhead, thereby enhancing the model’s robustness and detection accuracy for small defects [24]. Finally, we propose AFPN-Pro2345 to improve the neck structure of YOLO11, which achieves cross-scale feature fusion across four stages through multiple up-sampling and down-sampling operations. This further enhances the detection accuracy for complex multi-scale target scenarios, such as wampee detection in natural scenarios. Figure 5A illustrates the basic architecture of the Wampee-YOLO model.

2.2.2. C3k2-RFEMAConv

To address the challenges of scale variability, complex occlusion, and illumination changes encountered during wampee fruit detection in natural scenarios, an innovative C3k2-RFEMAConv module was designed. This module replaces the basic convolutional components within the original C3k2 structure with an advanced architecture that fuses Receptive Field Enhancement and Efficient Multi-scale Attention. While maintaining the multi-scale processing advantage of the Feature Pyramid Network (FPN), this module achieves intelligent optimization of multi-scale features through dynamic receptive field adjustment and an adaptive feature selection mechanism. This significantly enhances the model’s capacity to capture critical wampee fruit features, allowing the network to adaptively focus on the key discriminative features of fruits at varying scales.

C3k2-RFEMAConv replaces standard conv with RFConv for receptive field expansion and EMA for multi-scale attention. Taking the multi-scale feature maps from the backbone network as input, the RF Module first significantly expands the receptive field range through spatial reorganization of the feature map without increasing computational complexity, thereby enhancing the utilization of contextual information surrounding the wampee fruits. This module effectively expands the features of the input feature map and then generates enhanced feature maps via dimension rearrangement. This process successfully preserves the detailed features of wampee in natural scenarios while expanding the context awareness range. Subsequently, the Efficient Multi-scale Attention (EMA) mechanism is adopted to replace traditional single-scale attention. EMA captures the spatial and texture features of wampee fruits at multiple granular levels by processing three different scale feature branches in parallel: 1 × 1, 3 × 3, and adaptive large-kernel convolutions. The primary innovation of this mechanism resides in its ability to facilitate efficient retention and enhancement of channel information through cross-scale feature interactions, thereby mitigating the information loss typically associated with dimensionality reduction in conventional attention modules. By strengthening feature representation through a channel information prioritization mechanism, the module is granted the capability to adaptively select granularity specific to wampee features. Finally, the module retains the dual-branch processing advantage of the C3k2 architecture. Figure 5B shows the structural diagram of C3k2-RFEMAConv, where one branch performs deep feature extraction through multiple Bottleneck-RFEMAConv units, and the other branch maintains an identity mapping to ensure the complete preservation of the original feature information. This design effectively avoids the vanishing gradient problem and improves training stability while strengthening feature expression capability. Bottleneck-RFEMAConv is the core construction unit for deep feature extraction in this module, and its structure is shown in Figure 5D. It consists of two convolutional layers: The first convolutional layer uses standard 3 × 3 convolution for channel adjustment and initial feature extraction; The second convolutional layer innovatively integrates RFEMAConv. This layer generates multi-scale features through group convolution and achieves explicit expansion of the receptive field and dynamic enhancement of key features through feature space rearrangement and EMA mechanism. This bottleneck structure retains optional residual connections. When the number of input and output channels is the same, the original features and deep features are added through a shortcut path, effectively promoting gradient flow and feature reuse. In complex scenarios, it effectively alleviates the problem of detail loss caused by the increase in network depth, which is particularly beneficial for the stable detection of occluded and small-scale wampee fruits.

2.2.3. C2PSA-MSCADYT

In orchard or agricultural detection, occlusion is a significant factor affecting model accuracy [25]. In scenarios where wampee fruits are partially occluded by foliage, branches, or adjacent clusters, conventional object detection frameworks frequently exhibit diminished performance in pinpointing precise coordinates and delineating contours, ultimately resulting in a significant degradation of detection precision. As shown in Figure 5C, this paper proposes an improved C2PSA module, named C2PSA-MSCADYT, which integrates the Multi-Scale Coordinate Attention (MSCA) mechanism and the Dynamic Tanh (DYT) activation function into the PSABlock module. Experimental results demonstrate that the improved module effectively enhances the detection performance of wampee fruits under occluded conditions in complex orchard environments.

The architecture of the C2PSA-MSCADYT module employs a dual-path parallel processing design. The input features are processed through two independent branches before subsequent fusion: the Multi-Semantic Collaborative Attention (MSCA) branch is responsible for capturing multi-level spatial features, while the Dynamic Activation (DYT) branch focuses on optimizing the non-linear transformation of features. This dual-path topological design expands the dimensionality of the feature representation space, enhancing the network’s ability to distinguish fruits from interference in complex backgrounds. Within the Multi-Semantic Collaborative Attention branch, the designed MSCA component adopts a combined structure of multi-scale depthwise convolution and coordinate attention, as shown in Figure 5E. By processing 1 × 1, 3 × 3, and adaptive large-kernel convolutions in parallel, this component is able to capture the spatial features of wampee fruits at multiple granularity levels, allowing the network to better leverage the correlation between position and channel information, thereby optimizing feature representation under intra-class occlusion. The newly added coordinate attention mechanism significantly enhances the model’s localization capability for partially occluded fruits by embedding precise spatial position information into channel attention via bi-directional pooling.

In the Dynamic Activation Branch, the integrated Dynamic Tanh component adaptively adjusts the activation function’s response curve through learnable parameters. This design augments the flexibility of feature transformation and optimizes the trajectory of gradient propagation. By effectively alleviating the vanishing gradient phenomenon, it ensures robust training stability for complex architectural configurations. By adjusting the shape of activation function dynamically, this component significantly enhances the model’s ability to learn unique color features and texture patterns specific to wampee fruits.

2.2.4. AFPN-Pro2345

To address the challenges of multi-scale object detection in complex orchard environments and optimize the utilization of computational resources, this study proposes the Progressive Asymptotic Feature Pyramid Network (AFPN-Pro2345) module to replace the traditional Feature Pyramid Network (FPN) structure. Building upon the multi-scale fusion foundation of the conventional AFPN, this module innovatively introduces a progressive computational resource allocation strategy and a progressive scale fusion mechanism. This significantly enhances the detection capability for small and occluded targets. The core structure of this module is illustrated in Figure 5A.

The design of the AFPN-Pro2345 module incorporates two core “progressive” ideas. First, in terms of feature extraction depth, the module adopts a progressively decreasing BasicBlock configuration strategy. Based on the recognition that low-level features contain more detailed information and are particularly crucial for object detection, this strategy allocates more computational resources (four BasicBlocks each) to high-resolution feature maps such as P2 and P3. In contrast, the number of BasicBlocks is progressively reduced (three for P4 and two for P5) for feature maps with stronger semantic information but lower resolution. This design optimizes the distribution of computational resources, concentrating more computing power on detail preservation and the reinforcement of small target features. Second, concerning the multi-scale feature fusion path, the module employs a progressively expanded fusion strategy, the workflow of which is clearly reflected in Figure 5A. Specifically, the fusion process proceeds sequentially through three stages: the first stage is Dual-Scale Fusion, involving the bidirectional fusion of P2 and P3 features via the ASFF2 module; the second stage is Tri-Scale Fusion, where the preliminarily fused P2 and P3 features, along with the original P4 feature, are collectively input into the ASFF3 module to further interact with contextual information derived from P4; the third stage is Quad-Scale Fusion, where the processed P2, P3, and P4 features from the previous two stages, along with the original P5 feature, are simultaneously fed into the ASFF4 module. ASFF4 serves as the final aggregator, integrating all information ranging from high-resolution details (P2) to high-level semantics (P5), constructing a multi-scale feature pyramid with comprehensive representation capabilities.

Through a progressive fusion process that moves from simplicity to complexity, the AFPN-Pro2345 module ensures that low-level detail features are fully preserved and enhanced during the early fusion stages. As the fusion stages advance, these features are progressively and effectively integrated with higher-level, more abstract semantic information. This design enables the model to achieve comprehensive feature coverage of wampee fruits—ranging from local details to global context—thereby allowing for the precise identification of wampee fruits with varying degrees of occlusion and scale in complex natural scenarios.

2.3. Experimental Settings

2.3.1. Experimental Environment and Training Settings

The model training and experiments for the proposed Wampee-YOLO were conducted on a server equipped with an Intel Core i5-10400F processor (base frequency 2.90 GHz) and 40 GB of RAM. The core computational unit utilized an NVIDIA GeForce RTX 3090 GPU. The system features a 4 TB storage space. All software environments were built upon PyTorch 1.13.1, Cuda 11.7, and Python 3.8.19 to fully leverage the parallel computing capabilities of the GPU. All input images were uniformly resized to a resolution of 640 × 640 pixels to ensure consistency across the dataset. During the model training process, the initial learning rate was set to 0.0001, momentum of 0.937, weight decay of 0.0005, and the number of epochs was set to 200. The IoU loss function was adopted to guide the optimization process. The AdamW optimizer with a momentum of 0.9 was employed, and a weight decay coefficient of 0.0005 was applied to enhance the model’s generalization ability and mitigate overfitting. The main parameter configurations used for training are detailed in Table 1.

2.3.2. Evaluation Metrics

To quantitatively assess model performance, we employed a suite of evaluation metrics, including precision (P), recall (R), F1 score, and mean average precision (mAP) for predictive efficacy, alongside parameter count to characterize architectural complexity. Specifically, precision quantifies the proportion of true positive detections relative to the total number of positive predictions. The mathematical formulation for precision is as follows:

P = TP / (TP + FP)

(1)

where TP denotes the number of true positives (correctly identified positive instances) and FP represents false positives (incorrectly identified positive instances). Similarly, recall (R) quantifies the model’s ability to capture all relevant instances, defined as the proportion of correctly predicted positive samples relative to the total number of actual positive samples. The mathematical expression for recall is as follows:

R = TP / (TP + FN)

(2)

where FN is incorrectly predicted negative values.

The F1 score serves as a comprehensive metric that balances precision and recall by calculating their harmonic mean. It provides a single measure of a model’s predictive efficacy, particularly when a trade-off between precision and recall is required. A higher F1 score indicates a more robust model that effectively minimizes both misidentifications and missed detections. The formula is expressed as follows:

F 1 = 2 \times (P \times R) / (P + R)

(3)

mAP

is a standard performance benchmark in object detection. This metric is derived by first computing the average precision (AP) for each individual class representing the area under the Precision-Recall curve and subsequently averaging these AP values across all categories to provide a holistic assessment of detection accuracy. The mathematical formulation is as follows:

mAP = (\sum_{i = 1}^{N} {AP}_{i}) / N

(4)

where N is the total number of categories of wampee, and AP_i is the AP value of the i-th category. In this study, mAP₅₀ was adopted as the primary evaluation criterion. Under this metric, a detection is validated as a true positive only if the Intersection over Union (IoU) between the predicted bounding box and the ground truth exceeds a threshold of 0.5. This predefined threshold constitutes the basis for determining the confidence scores and subsequent precision-recall calculations.

Parameters denotes the total count of trainable variables within the neural network, encompassing all weights and biases. This metric serves as a key indicator of the model’s structural complexity and its inherent learning capacity. Computational complexity is quantified by Giga Floating-point Operations (GFLOPs), which describes the demand for computational resources during inference. These model complexity metrics, including both Parameters and GFLOPs, serve to clarify the trade-off between detection performance and computational cost under practical deployment constraints.

3. Experimental Results and Analysis

3.1. Ablation Study

The Wampee-YOLO model proposed in this study is developed upon the YOLO11n baseline, integrating five distinct architectural enhancements. To systematically evaluate the individual and collective contributions of these modules, an exhaustive series of ablation experiments was conducted, the results of which are detailed in Table 2. As shown in Table 2, the effectiveness of each proposed module was validated. When the modules were added individually, the AFPN-Pro2345 module yielded the best performance, raising mAP₅₀ and F1 score by 2.4% and 1.9%, respectively; other single additions also contributed improvements: AIFI module increased mAP₅₀ by 0.4% and F1 by 0.5%; C3k2-RFEMAConv module provided a mAP₅₀ increase of 0.2% and F1 increase of 0.3%; Triplet Attention module resulted in a 0.3% increase in mAP₅₀ and 0.4% in F1 score; and the C2PSA-MSCADYT module led to increases of 0.1% in mAP₅₀ and 0.3% in F1 score. In the progressive combination experiments, building upon the AFPN-Pro2345 module, the addition of the AIFI module further increased mAP₅₀ and F1 score by 0.1% each. Subsequently, the inclusion of the C3k2-RFEMAConv module added 0.5% to mAP₅₀ and 0.3% to F1 score. Following that, the inclusion of the C2PSA-MSCADYT module increased mAP₅₀ by 0.3% and F1 score by 0.4%. Finally, the addition of the Triplet Attention module resulted in marginal increases of 0.1% in mAP₅₀ and 0.2% in F1 score. Notably, the final Wampee-YOLO model achieved a significant improvement over the initial YOLO11n model, with mAP₅₀ increasing by 3.4%, and precision, recall, F1 score, and mAP_50–95 enhanced by 2%, 3.8%, 2.9%, and 3.6%, respectively. While these enhancements involved a moderate increase in model complexity—with Parameters rising by 0.7M and GFLOPs shifting from 6.3 to 21.9—the substantial gains in detecting small-target wampee fruits in complex orchard environments justify this trade-off between computational cost and detection accuracy.

3.2. Comparison Experiments

3.2.1. Comparison of Different Attention Modules

To further validate the effectiveness of the global, inter-group, and intra-group Triple Attention module in the backbone network, the Triplet Attention in the Wampee-YOLO model was replaced with different attention modules, including SEAM, CBAM, AFGCAttention, BAMblock, and LSKBlock. These highly efficient attention modules have been widely adopted by researchers in recent years. Comparing them with Triplet Attention allows for a more comprehensive evaluation of the backbone network’s strengths and weaknesses. The comparative results of Wampee-YOLO using these different attention modules are presented in Table 3.

The precision, recall, F1 score, mAP₅₀, and mAP_50–95 of Triplet Attention were 92.1%, 82.7%, 87%, 90.3%, and 58.3%, respectively, ranking first among the six attention modules. Notably, while maintaining the lowest parameter count (3.28 M), Triplet Attention also demonstrated highly competitive computational efficiency at 21.9 GFLOPs, which is nearly on par with the minimum value in the group (21.8 GFLOPs). This combination of the best detection accuracy and minimal model size identifies Triplet Attention as the most efficient solution for wampee detection.

3.2.2. Comparison of Different Detection Models

To further validate the effectiveness of the improved model, this study selected several state-of-the-art object detection models for comparative experiments, all trained on the same dataset. This comprehensive comparison aimed to evaluate the Wampee-YOLO model’s capability for wampee fruit detection in natural scenarios. The analysis seeks to elucidate the performance differences among the models and highlight the superior accuracy of Wampee-YOLO. In this comparative study, the proposed Wampee-YOLO model was evaluated against five renowned object detection algorithms: YOLOv5n [26], YOLOv8n [14], YOLOv10n [27], YOLO11n [28], and RT-DETR [29]. The YOLO series constitutes the most popular single-stage object detection models in recent years. By comparing the advantages and disadvantages of the Wampee-YOLO model’s performance against these alternatives, a more comprehensive assessment of its capability was achieved. All experimental comparison results are presented in Table 4.

The model validation results are shown in Figure 6. As illustrated by the precision-epoch trajectories in Figure 6A, it is evident that the majority of the evaluated models exhibit robust convergence, with precision scores surpassing the 90%. However, the curves for RT-DETR and YOLOv10n exhibit noticeable fluctuations in certain sections, indicating that these two models encountered some propagation errors and information loss during training, which resulted in unstable detection capability for multi-scale targets. Wampee-YOLO demonstrates superiority among the models, with its detection precision consistently maintaining the highest point during the later stages of model training.

Furthermore, the Precision-Recall (PR) curve represents the trade-off between precision and recall. Generally, recall is plotted on the X-axis and precision on the Y-axis. The closer the PR curve is to the upper-right corner, the better the model’s performance [30]. As visible from the Precision-Recall curve (Figure 6B), the curve for Wampee-YOLO is clearly closer to the upper-right corner compared to the other models, demonstrating that the performance of Wampee-YOLO is superior to that of the alternatives.

The results presented in Table 4 show that the Wampee-YOLO model outperforms the other five comparison models. Compared to the remaining models in the table, from top to bottom, Wampee-YOLO achieved improvements in precision of 1.1%, 1.4%, 2.4%, 2%, and 0.7%, respectively. In terms of recall, while it was 0.2% slightly lower than that of RT-DETR, it showed improvements of 5.7%, 3.2%, 3.2%, and 3.8% compared to the other models. The F1 score increased by 3.6%, 2.3%, 2.7%, 2.9%, and 0.1%, respectively. The mAP₅₀ increased by 3.8%, 3.5%, 2.4%, 3.4%, and 0.8%, respectively. The mAP_50–95 index increased by 3.9%, 3.3%, 2.8%, 3.6% and 1%, respectively. Regarding model parameters, it is evident that while Wampee-YOLO achieved improvements across all metrics, its parameter count is 6.1 times smaller than that of the RT-DETR model, yet the overall performance indicators show a significant improvement compared to the better-performing models. Therefore, compared to the YOLOv5n, YOLOv8n, YOLOv10n, and YOLO11n models, the Wampee-YOLO model achieves the optimal results in terms of precision, recall, F1 score, mAP₅₀, and mAP_50–95 with only a marginal increase in parameter count. To further evaluate the model complexity, we analyzed the computational cost (FLOPs) of each model. As shown in Table 4, existing lightweight models such as YOLOv5n, YOLOv8n, and the YOLO11n baseline maintain low computational costs, ranging from 5.8 to 6.8 GFLOPs. However, their detection capabilities are limited in complex scenarios, with mAP₅₀ scores plateauing below 88%. In contrast, Wampee-YOLO requires 21.9 G FLOPs. The increase in computational complexity is a deliberate design choice, stemming from the introduction of the AFPN-Pro2345 and RFEMAConv modules to resolve feature ambiguity in clustered fruits. This demonstrates that Wampee-YOLO effectively utilizes a higher computational budget to achieve a breakthrough in accuracy (+3.4% mAP₅₀ over baseline), offering a robust solution where precision is the priority.

Figure 7 displays the comparison of confusion matrices for YOLOv5n, YOLOv8n, YOLOv10n, YOLO11n, RT-DETR, and Wampee-YOLO. The results show that the Wampee-YOLO model is superior to RT-DETR and other models in the YOLO series in terms of distinguishing wampee fruits from the background. The high precision and low misclassification rate achieved in wampee fruit detection demonstrate the reliability of the Wampee-YOLO model. These advancements highlight the application potential of Wampee-YOLO for multi-scale wampee detection in complex agricultural scenarios.

3.3. Visualization

To validate the capability of Wampee-YOLO in detecting multi-scale targets within natural orchard scenarios, we conducted multiple rounds of comparative visual detection experiments. We systematically evaluated the model across four typical scenarios: multi-scale distribution (including different maturity stages), overlapping occlusion distribution, and background interference. The multi-scale distribution scenario is characterized by color variations in the wampee fruit images due to differences in their growth stages, which significantly increases the detection difficulty. In the overlapping occlusion distribution scenario, a large number of wampee fruits exist within the same imaging field, where frequent inter-fruit occlusion poses a significant challenge to detection. Lastly, in the background interference scenario, the detrimental effects of foliage and natural environmental conditions are exacerbated. This study employed the XGradCAM [31], a feature map based visualization technique, which generates a heatmap by globally averaging the gradients to weight the feature maps, demonstrating outstanding object localization capabilities. Figure 8 compares the backbone network of Wampee-YOLO against four well-known backbone networks: EfficientViT [32], FasterNet [33], MobileNetV4 [34], and ConvNeXt V2 [35]. Although these networks perform exceptionally well on general datasets, they exhibit limitations in the unstructured orchard environment. Specifically, they tend to be distracted by complex background textures, with high activation weights often drifting towards leaves and branches due to the color similarity between unripe fruits and foliage. This highlights the challenge of ‘feature ambiguity’ in natural agricultural scenes. In contrast, thanks to the specifically designed RFEMAConv module and multi-dimensional attention mechanisms, Wampee-YOLO successfully suppresses these background interferences. It demonstrates a highly concentrated attention mechanism that precisely localizes the fruit boundaries, validating that our customized backbone possesses superior feature extraction robustness compared to general-purpose SOTA lightweight networks.

Figure 9 displays the comparative detection results of YOLOv5n, YOLOv8n, YOLOv10n, YOLO11n, RT-DETR, and Wampee-YOLO models for wampee fruits in typical natural orchard scenarios, including ripe fruits scenario, unripe fruits scenario, overlapping fruits scenario, and background interference scenario. The visualization clearly demonstrates the accuracy of Wampee-YOLO in recognizing multi-scale wampee fruits under different sizes and natural orchard scenarios. In the ripe wampee fruit images, the baseline YOLO11n model exhibits false positives, misidentifying the connecting points of wampee clusters as individual fruits, whereas Wampee-YOLO performs more stably. When processing unripe wampee fruits, YOLOv5n and YOLOv10n models suffer from missed detections, failing to detect unripe fruits located near the foliage. In the overlapping occlusion distribution scenario, several models other than Wampee-YOLO experience a considerable number of missed detections, unable to detect fruits under severe overlap and occlusion. Furthermore, under background interference conditions, we observed that all models encountered some misdetections and missed detections of wampee fruits; the RT-DETR model performed especially poorly in misclassification, producing the falsest detections, while YOLOv5n and YOLOv10n showed more missed detections. Notably, Wampee-YOLO also exhibits some limitations. Similarly to the baseline model, Wampee-YOLO may still produce missed detections when wampee fruits are excessively overlapped and occluded. This is likely because the wampee targets are small, and the combined presence of fruit overlap and leaf occlusion leads to very few visible features, resulting in insufficient feature extraction by the model for extremely scarce-feature targets. Concurrently, very bright leaves are also prone to be misdetected, indicating that the model still faces difficulties in handling interference caused by extreme brightness.

Based on the preceding analysis, Wampee-YOLO still surpasses other models in terms of detection accuracy, exhibiting a comparatively lower number of false and missed detections per image.

4. Discussion

4.1. Advantages

The detection accuracy obtained in this study is of critical importance to the field of intelligent agriculture, particularly for the selective harvesting of wampee. In unstructured orchard environments, the detection challenge is twofold. First, for unripe fruits, the primary bottleneck is the ‘green-on-green’ visual ambiguity, where targets share high similarity with the foliage. Second, and perhaps more critical for immediate economic return, is the detection of ripe (yellow-brown) fruits. Although distinct in color, ripe wampee grow in dense, grape-like clusters, causing severe mutual occlusion that often confuses standard detectors into recognizing a cluster as a single object. Previous studies have struggled to simultaneously address these conflicting challenges. For instance, Yu et al. [7] achieved a recognition accuracy of 89.9% for green lychees but noted limitations under varying lighting. Similarly, Liao et al. [16] proposed the YOLO-MECD model for citrus, achieving an mAP₅₀ of 81.6%, yet performance often degraded in highly cluttered scenes. In stark contrast, Wampee-YOLO significantly outperforms these benchmarks with an mAP₅₀ of 90.3%. This performance margin validates that our architecture excels in both scenarios: the AFPN-Pro module effectively enhances the feature extraction of camouflaged green fruits, while the RFEMAConv module sharpens the boundary awareness for occluded ripe yellow fruits. By accurately distinguishing individual ripe targets within dense clusters, our model provides the precise localization required for robotic manipulators to pick fruit without causing damage, thereby establishing a new benchmark for automated harvesting. Beyond technical metrics, the high precision of Wampee-YOLO holds transformative potential for the agricultural industry. Currently, wampee harvesting relies predominantly on manual labor, which is increasingly constrained by rising workforce costs, labor shortages, and safety risks associated with high-branch operations. In comparison to this traditional approach, Wampee-YOLO serves as a robust “visual brain” for autonomous harvesting robots, enabling 24/7 continuous operation without fatigue. The economic benefits of this transition are substantial. The achieved 90.3% mAP₅₀ directly minimizes the rate of missed detections—a key factor linked to yield recovery and revenue. While manual harvesting is subjective and prone to error under fatigue, our model ensures consistent localization. This capability not only reduces direct labor costs but also facilitates intelligent orchard management, such as automated yield estimation, thereby translating technical accuracy into tangible economic efficiency.

4.2. Limitations and Future Work

Notwithstanding the superior efficacy of the Wampee-YOLO model across primary benchmarks including precision, recall, F1 score, mAP₅₀, and mAP_50–95, its integration into practical agricultural environments is still governed by an inherent trade-off between inference latency and the stringent resource constraints of edge computing devices. At present, we have not yet conducted experiments on the designed algorithm for agricultural robots, which is a limitation of this research. If the model is directly deployed onto resource constrained agricultural robot platforms, maintaining high detection accuracy may lead to a decrease in inference speed, making it difficult to meet the real-time harvesting and monitoring requirements of the orchard environment. Conversely, simplifying the model to pursue real-time capability would compromise its detection performance in natural orchard scenarios involving small targets and occlusion, thereby negating the practical significance of the model’s improvements. Furthermore, as observed in the visualization comparison, similar to baseline models, Wampee-YOLO still exhibits certain limitations. For example, model performance may degrade when wampee fruits suffer from excessive overlap and occlusion, or when interference is caused by high reflectivity leaves. To provide a deeper assessment of the model’s robustness, we suggest that future research should include a statistical analysis of failure cases, quantifying the number of misdetections and missed detections under various extreme scenarios (severe occlusion, strong backlighting, high reflection) to guide more targeted model optimization.

In addition, constrained by collection conditions and annotation costs during the dataset construction process, the current wampee fruit image data still has room for improvement regarding varietal diversity and maturity range. Future work will focus on expanding the dataset’s coverage to include image samples of more wampee varieties at different growth stages and under varying environmental conditions. More importantly, to enhance the model’s general applicability, subsequent work should incorporate a cross-seasonal testing plan, validating the model across different harvesting seasons and various weather conditions. We will also explore the adoption of transfer learning strategies, utilizing existing models as pre-trained weights and fine-tuning them with a small amount of new data, thereby further enhancing the model’s recognition capability and adaptability in diverse real-world scenarios.

Subsequent research will continue to explore ways to further optimize the model structure and reduce computational complexity while maintaining high accuracy. The goal is to seek a better balance between real-time performance and accuracy, thereby promoting the practical application and deployment of Wampee-YOLO in smart agriculture.

5. Conclusions

This study addresses the persistent challenge of accurate wampee detection in unstructured orchard environments, where dense occlusion and small target sizes severely limit the performance of conventional lightweight detectors. By integrating progressive cross-scale feature fusion and multi-dimensional attention mechanisms into the YOLO11n architecture, the proposed Wampee-YOLO effectively bridges the gap between feature representation capability and model compactness.

The experimental results demonstrate that Wampee-YOLO achieves an mAP₅₀ of 90.3%, significantly outperforming state-of-the-art models such as YOLOv8n and RT-DETR in complex scenarios. Although the parameter count increased slightly to 3.28 M compared to the baseline (2.58 M), the model maintains a highly compact architecture suitable for embedded applications. Crucially, our analysis of computational complexity reveals a vital trade-off: while Wampee-YOLO introduces higher computational costs (21.9 GFLOPs) compared to the baseline, it effectively converts this additional computational budget into a 3.4% improvement in detection accuracy. This finding suggests that for robotic harvesting tasks, where the cost of a missed pick or damaged fruit outweighs the benefit of excessive inference speed, prioritization of precision over raw throughput is the optimal strategy.

Ultimately, Wampee-YOLO provides a robust visual perception baseline for the next generation of intelligent agricultural robots, demonstrating that targeted architectural improvements can significantly enhance detection reliability in complex agricultural environments without sacrificing deployment feasibility.

Author Contributions

Conceptualization, Z.L. and C.L.; methodology, Z.L.; software, Z.L. and L.Y.; validation, Z.L. and L.Y.; formal analysis, Z.L.; investigation, Z.L.; resources, Z.L. and C.L.; data curation, Z.L.; writing—original draft preparation, Z.L., Y.X. and J.W.; writing—review and editing, Z.L., G.H., K.Z., J.L. and C.L.; visualization, Z.L., G.H. and C.L.; supervision, G.H. and C.L.; project administration, Z.L., Y.X. and C.L.; funding acquisition, Y.X., G.H. and C.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Philosophy and Social Sciences Planning Project of Guangdong Province of China (Grant No. GD23XGL099), the Philosophy and Social Sciences Planning Project of Guangzhou City of China (Grant No. 2025GZGJ21), and the Guangdong General Universities Young Innovative Talents Project (2023KQNCX247).

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, S.; Liu, Z.; Zhao, M.; Gao, C.; Wang, J.; Li, C.; Dong, X.; Liu, Z.; Zhou, D. Chitosan-wampee seed essential oil composite film combined with cold plasma for refrigerated storage with modified atmosphere packaging: A promising technology for quality preservation of golden pompano fillets. Int. J. Biol. Macromol. 2023, 224, 1266–1275. [Google Scholar] [CrossRef] [PubMed]
Mo, X.; Cai, D.; Yang, H.; Chen, Q.; Xu, C.; Wang, J.; Tong, Z.; Xu, B. Changes in fruit quality parameters and volatile compounds in four wampee varieties at different ripening stages. Food Chem. X 2025, 27, 102377. [Google Scholar] [CrossRef]
Chang, X.; Ye, Y.; Pan, J.; Lin, Z.; Qiu, J.; Guo, X.; Lu, Y. Comparative assessment of phytochemical profiles and antioxidant activities in selected five varieties of wampee (Clausena lansium) fruits. Int. J. Food Sci. Technol. 2018, 53, 2680–2686. [Google Scholar] [CrossRef]
Liu, S.; Whitty, M.; Cossell, S. Automatic grape bunch detection in vineyards for precise yield estimation. In Proceedings of the 14th IAPR International Conference on Machine Vision Applications (MVA), Tokyo, Japan, 18–22 May 2015. [Google Scholar]
Fu, L.; Duan, J.; Zou, X.; Lin, G.; Song, S.; Ji, B.; Yang, Z. Banana detection based on color and texture features in the natural environment. Comput. Electron. Agric. 2019, 167, 105057. [Google Scholar] [CrossRef]
Guo, Q.; Chen, Y.; Tang, Y.; Zhuang, J.; He, Y.; Hou, C.; Chu, X.; Zhong, Z.; Luo, S. Lychee fruit detection based on monocular machine vision in orchard environment. Sensors 2019, 19, 4091. [Google Scholar] [CrossRef]
Yu, L.; Xiong, J.; Fang, X.; Yang, Z.; Chen, Y.; Lin, X.; Chen, S. A litchi fruit recognition method in a natural environment using RGB-D images. Biosyst. Eng. 2021, 204, 50–63. [Google Scholar] [CrossRef]
Xiao, F.; Wang, H.; Li, Y.; Cao, Y.; Lv, X.; Xu, G. Object detection and recognition techniques based on digital image processing and traditional machine learning for fruit and vegetable harvesting robots: An overview and review. Agronomy 2023, 13, 639. [Google Scholar] [CrossRef]
Shi, Y.; Lian, S.; Siyao, Z. Recognition method of pheasant using enhanced Tiny-YOLOV3 model. Trans. Chin. Soc. Agric. Eng. 2020, 13, 141–147. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 16th IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Sapkota, R.; Karkee, M. Ultralytics YOLO evolution: An overview of YOLO26, YOLO11, YOLOv8 and YOLOv5 object detectors for computer vision and pattern recognition. arXiv 2025, arXiv:2510.09653. [Google Scholar]
Zhang, M.; Ye, S.; Zhao, S.; Wang, W.; Xie, C. Pear object detection in complex orchard environment based on improved YOLO11. Symmetry 2025, 17, 255. [Google Scholar] [CrossRef]
Liao, Y.; Li, L.; Xiao, H.; Xu, F.; Shan, B.; Yin, H. YOLO-MECD: Citrus detection algorithm based on YOLOv11. Agronomy 2025, 15, 687. [Google Scholar] [CrossRef]
Wang, A.; Xu, Y.; Hu, D.; Zhang, L.; Li, A.; Zhu, Q.; Liu, J. Tomato yield estimation using an improved lightweight YOLO11n network and an optimized region tracking-counting method. Agriculture 2025, 15, 1353. [Google Scholar] [CrossRef]
Li, P.; Chen, J.; Chen, Q.; Huang, L.; Jiang, Z.; Hua, W.; Li, Y. Detection and picking point localization of grape bunches and stems based on oriented bounding box. Comput. Electron. Agric. 2025, 233, 110168. [Google Scholar] [CrossRef]
Du, X.; Zhang, X.; Li, T.; Chen, X.; Yu, X.; Wang, H. YOLO-WAS: A lightweight apple target detection method based on improved YOLO11. Agriculture 2025, 15, 1521. [Google Scholar] [CrossRef]
Nan, Y.; Zhang, H.; Zeng, Y.; Zheng, J.; Ge, Y. Intelligent detection of multi-class pitaya fruits in target picking row based on WGB-YOLO network. Comput. Electron. Agric. 2023, 208, 107780. [Google Scholar] [CrossRef]
Bai, Y.; Yu, J.; Yang, S.; Ning, J. An improved YOLO algorithm for detecting flowers and fruits on strawberry seedlings. Biosyst. Eng. 2024, 237, 1–12. [Google Scholar] [CrossRef]
Zhao, Y.; Chen, Y.; Xu, X.; He, Y.; Gan, H.; Wu, N.; Wang, Z.; Sun, X.; Wang, Y.; Skobelev, P.; et al. Ta-YOLO: Overcoming target blocked challenges in greenhouse tomato detection and counting. Front. Plant Sci. 2025, 16, 1618214. [Google Scholar] [CrossRef]
Aldubaikhi, A.; Patel, S. Advancements in small-object detection (2023–2025): Approaches, datasets, benchmarks, applications, and practical guidance. Appl. Sci. 2025, 15, 11882. [Google Scholar] [CrossRef]
Wang, H.; Wang, H.; Liu, Q. TW-YOLO: High-precision steel wire rope detection algorithm based on triplet attention. Adv. Comput. Commun. 2025, 6, 48–54. [Google Scholar] [CrossRef]
Zhang, Q.; Liu, Y.; Gong, C.; Chen, Y.; Yu, H. Applications of deep learning for dense scenes analysis in agriculture: A review. Sensors 2020, 20, 1520. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. What is YOLOv5: A deep look into the internal features of the popular object detector. arXiv 2024, arXiv:2407.20892. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J. YOLOv10: Real-time end-to-end object detection. In Advances in Neural Information, Processing Systems, Proceedings of the Neural In-formation Processing Systems Conference, Vancouver, BC, Canada, 16 December 2024; Neural Information Processing Systems Foundation: San Diego, CA, USA, 2024. [Google Scholar]
Khanam, R.; Hussain, M. YOLOv11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs beat YOLOs on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024. [Google Scholar]
Yang, K.; Song, Z. Deep learning-based object detection improvement for fine-grained birds. IEEE Access 2021, 9, 67901–67915. [Google Scholar] [CrossRef]
Fu, R.; Hu, Q.; Dong, X.; Guo, Y.; Gao, Y.; Li, B. Axiom-based Grad-CAM: Towards accurate visualization and explanation of CNNs. arXiv 2020, arXiv:2008.02312. [Google Scholar]
Liu, X.; Peng, H.; Zheng, N.; Yang, Y.; Hu, H.; Yuan, Y. EfficientViT: Memory efficient vision transformer with cascaded group attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Chen, J.; Kao, S.-h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.-H.; Chan, S.-H.G. Run, don’t walk: Chasing higher FLOPS for faster neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]
Qin, D.; Leichner, C.; Delakis, M.; Fornoni, M.; Luo, S.; Yang, F.; Wang, W.; Banbury, C.; Ye, C.; Akin, B. MobileNetV4: Universal models for the mobile ecosystem. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024. [Google Scholar]
Woo, S.; Debnath, S.; Hu, R.; Chen, X.; Liu, Z.; Kweon, I.S.; Xie, S. ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023. [Google Scholar]

Figure 1. The overall process framework of the proposed Wampee-YOLO. Where (a1,a2) represent ripe and unripe fruits, (a3) denotes overlapping fruits, (a4) indicates background interference; (b1) represents rain interference, (b2) represents shading interference, (b3) represents smoke interference, and (b4) represents strong light interference; (c1–c4) represent images containing small targets; (d1) represents ripe fruit-on-fruit overlap, (d2) denotes ripe fruits occluded by leaves, (d3) indicates unripe fruit-on-fruit overlap, and (d4) shows unripe fruits occluded by branches; (e1) captures differences in physical fruit size, (e2) depicts large fruits captured at close range, (e3) shows small fruits captured at a long distance, and (e4) represents occluded fruit images within the visible scale.

Figure 2. Representative samples from the collected wampee dataset demonstrating environmental diversity and robustness.

Figure 3. Image samples of different data augmentation techniques applied for wampee fruits.

Figure 4. Statistical analysis of wampee fruit distribution and dimensions within the dataset. (A) Frequency distribution of target bounding box centers along the horizontal (X-axis). (B) Target distribution within the image, where x and y represent the target’s horizontal and vertical coordinates (center position), respectively. (C) Target distribution frequency along the vertical axis (Y-axis). (D) Relationship between target width and its horizontal position. (E) Relationship between target width and its vertical position. (F) Histogram of target width, showing the distribution of target widths in the dataset. (G) Correlation between target height and horizontal position (X-coordinate), reflecting the spatial distribution of object scales. (H) Correlation between target height and vertical position (Y-coordinate), showing the relationship between object scale and image depth/placement. (I) Joint distribution of target height and width, where the red bounding box encapsulates the dimensional range of the majority of instances. (J) Frequency distribution of target heights, depicting the vertical scale profile of the objects in the dataset.

Figure 5. Architecture of the proposed Wampee-YOLO. (A) Overall architecture. (B) Architecture of C3k2-RFEMAConv. (C) Architecture of C2PSA-MSCADYT. (D) Architecture of Bottleneck-RFEMAConv. (E) Architecture of MultiScale Coordinate Attention.

Figure 6. Comparison of the validation results of each model. (A) Precision-epoch curves. (B) Precision-Recall curves.

Figure 7. Comparison of confusion matrices of different models evaluated on the dataset. (A) YOLOv5n. (B) YOLOv8n. (C) YOLOv10n. (D) YOLO11n. (E) RT-DETR. (F) Wampee-YOLO.

Figure 8. Comparison of heatmaps from different backbone networks.

Figure 9. Comparison of detection results from different detection models, where green boxes represent correct detections, blue boxes represent false or duplicate detections, and red boxes represent missed detections.

Table 1. Key training parameters settings.

Parameters	Setup
Epochs	200
Batch size	8
Learning rate	1 × 10⁻⁴
Image size	640 × 640
Weight Decay	0.0005
Momentum	0.937
Optimizer	AdamW

Table 2. Results of ablation study experiments.

AFPN-Pro2345	AIFI	C3k2-RFEMAConv	C2PSA-MSCADYT	Triplet Attention	P/%	R/%	F1/%	mAP₅₀/%	mAP_50–95/%	Parameters/M	GFLOPs/G
					90.1	78.9	84.1	86.9	54.7	2.58	6.3
√					91.9	80.8	86.0	89.3	57.5	2.56	17.8
	√				90.7	79.2	84.6	87.3	54.8	3.21	6.6
		√			91.7	78.2	84.4	87.1	54.8	2.59	10.4
			√		90.7	79.0	84.4	87.0	54.9	2.61	6.3
				√	90.9	78.9	84.5	87.2	54.9	2.58	6.3
√	√				91.4	81.3	86.1	89.4	57.6	3.18	18.1
√	√	√			91.4	81.9	86.4	89.9	58.2	3.25	21.8
√	√	√	√		91.9	82.2	86.8	90.2	58.3	3.28	21.8
√	√	√	√	√	92.1	82.7	87.0	90.3	58.3	3.28	21.9

Table 3. Results for comparison of different attention modules.

Models	P/%	R/%	F1/%	mAP₅₀/%	mAP_50–95/%	Parameters/M	GFLOPs/G
Triplet Attention	92.1	82.7	87.0	90.3	58.3	3.28	21.9
SEAM	91.0	82.0	86.3	89.9	58.1	3.43	22
CBAM	91.4	81.6	86.2	89.8	58.1	3.35	21.8
AFGCAttention	91.5	81.9	86.4	90.0	58.1	3.35	21.8
BAMblock	91.7	81.8	86.5	89.9	58.3	3.30	21.9
LSKBlock	91.6	82.1	86.6	90.2	58.3	3.53	22

Table 4. Comparison between Wampee-YOLO and mainstream models.

Models	P/%	R/%	F1/%	mAP₅₀/%	mAP_50–95/%	Parameters/M	GFLOPs/G
YOLOv5n	91.0	77.0	83.4	86.5	54.4	2.18	5.8
YOLOv8n	90.7	79.5	84.7	86.8	55	2.58	6.8
YOLOv10n	89.7	79.5	84.3	87.9	55.5	2.27	6.5
YOLO11n	90.1	78.9	84.1	86.9	54.7	2.58	6.3
RT-DETR	91.4	82.9	86.9	89.5	57.3	19.87	56.9
Wampee-YOLO	92.1	82.7	87.0	90.3	58.3	3.28	21.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Z.; Xie, Y.; Wang, J.; Huang, G.; Yu, L.; Zhang, K.; Li, J.; Liu, C. Wampee-YOLO: A High-Precision Detection Model for Dense Clustered Wampee in Natural Orchard Scenario. Horticulturae 2026, 12, 232. https://doi.org/10.3390/horticulturae12020232

AMA Style

Li Z, Xie Y, Wang J, Huang G, Yu L, Zhang K, Li J, Liu C. Wampee-YOLO: A High-Precision Detection Model for Dense Clustered Wampee in Natural Orchard Scenario. Horticulturae. 2026; 12(2):232. https://doi.org/10.3390/horticulturae12020232

Chicago/Turabian Style

Li, Zhiwei, Yusha Xie, Jingjie Wang, Guogang Huang, Longzhen Yu, Kai Zhang, Junlong Li, and Changyu Liu. 2026. "Wampee-YOLO: A High-Precision Detection Model for Dense Clustered Wampee in Natural Orchard Scenario" Horticulturae 12, no. 2: 232. https://doi.org/10.3390/horticulturae12020232

APA Style

Li, Z., Xie, Y., Wang, J., Huang, G., Yu, L., Zhang, K., Li, J., & Liu, C. (2026). Wampee-YOLO: A High-Precision Detection Model for Dense Clustered Wampee in Natural Orchard Scenario. Horticulturae, 12(2), 232. https://doi.org/10.3390/horticulturae12020232

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Wampee-YOLO: A High-Precision Detection Model for Dense Clustered Wampee in Natural Orchard Scenario

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Data Acquiring

2.1.2. Data Labeling and Augmentation

2.2. Wampee-YOLO

2.2.1. Architecture

2.2.2. C3k2-RFEMAConv

2.2.3. C2PSA-MSCADYT

2.2.4. AFPN-Pro2345

2.3. Experimental Settings

2.3.1. Experimental Environment and Training Settings

2.3.2. Evaluation Metrics

3. Experimental Results and Analysis

3.1. Ablation Study

3.2. Comparison Experiments

3.2.1. Comparison of Different Attention Modules

3.2.2. Comparison of Different Detection Models

3.3. Visualization

4. Discussion

4.1. Advantages

4.2. Limitations and Future Work

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI