YOLO-CSB: A Model for Real-Time and Accurate Detection and Localization of Occluded Apples in Complex Orchard Environments

Pan, Yunxiao; Chen, Yiwen; Tong, Xing; Liu, Mengfei; Huang, Anxiang; Zhou, Meng; Hu, Yaohua

doi:10.3390/agronomy16030390

Open AccessArticle

YOLO-CSB: A Model for Real-Time and Accurate Detection and Localization of Occluded Apples in Complex Orchard Environments

by

Yunxiao Pan

,

Yiwen Chen

,

Xing Tong

,

Mengfei Liu

,

Anxiang Huang

,

Meng Zhou

and

Yaohua Hu

^*

College of Optical, Mechanical and Electrical Engineering, Zhejiang A&F University, Hangzhou 311300, China

^*

Author to whom correspondence should be addressed.

Agronomy 2026, 16(3), 390; https://doi.org/10.3390/agronomy16030390

Submission received: 12 January 2026 / Revised: 27 January 2026 / Accepted: 3 February 2026 / Published: 5 February 2026

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Apples are cultivated over a large global area with high yields, and efficient robotic harvesting requires accurate detection and localization, particularly in complex orchard environments where occlusion by leaves and fruits poses substantial challenges. To address this, we proposed a YOLO-CSB model-based method for apple detection and localization, designed to overcome occlusion and enhance the efficiency and accuracy of mechanized harvesting. Firstly, a comprehensive apple dataset was constructed, encompassing various lighting conditions and leaf obstructions, to train the model. Subsequently, the YOLO-CSB model, built upon YOLO11s, was developed with improvements including the integration of a lightweight CSFC Block to reconstruct the backbone, making the model more lightweight; the SEAM component is introduced to improve feature restoration in areas with occlusions, complemented by the efficient BiFPN approach to boost detection precision. Additionally, a 3D positioning technique integrating YOLO-CSB with an RGB-D camera is presented. Validation was conducted via ablation analyses, comparative tests, and 3D localization accuracy assessments in controlled laboratory and structured orchard settings, The YOLO-CSB model demonstrated effectiveness in apple target recognition and localization, with notable advantages under leaf and fruit occlusion conditions. Compared to the baseline YOLO11s model, YOLO-CSB improved mAP by 3.02% and reduced the parameter count by 3.19%. Against mainstream object detection models, YOLO-CSB exhibited significant advantages in detection accuracy and model size, achieving a mAP of 93.69%, precision of 88.82%, recall of 87.58%, and a parameter count of only 9.11 M. The detection accuracy in laboratory settings reached 100%, with average localization errors of 4.15 mm, 3.96 mm, and 4.02 mm in the X, Y, and Z directions, respectively. This method effectively addresses complex occlusion environments, enabling efficient detection and precise localization of apples, providing reliable technical support for mechanized harvesting.

Keywords:

YOLO-CSB; RGB-D camera; robotic harvesting; occlusion handling

1. Introduction

Statistics show that global apple production is over 8.43 million metric tons each year and keeps going up. China, the United States, Poland, and Turkey produce most of the world’s apples [1]. Right now, people still pick apples mostly by hand. This method takes a lot of time, makes workers very tired, and is not efficient [2]. The number of farm workers is dropping and many are getting older, so there are more and more labor shortages for apple picking [3]. Improving picking efficiency and cutting labor costs are very important in today’s agriculture. Fruit-picking robots use cameras to see and pick apples automatically. They can save money and are becoming an important part of smart farming [4]. How fast and how accurate the detection is determines whether the robot can work well and be trusted. These two things are key to making farming automatic [5].

In recent times, advancements in deep learning have introduced innovative approaches for fruit detection and positioning [6]. Deep learning-based object detection algorithms mainly include Faster R-CNN [7], Mask R-CNN [8], R-FCN [9], SSD [10], and YOLO [11,12]. Two-stage detection methods like Faster R-CNN, Mask R-CNN, and R-FCN achieve high accuracy through region proposals and convolutional neural network feature extraction but have high computational complexity and slower inference speeds. In contrast, single-stage algorithms like SSD and the YOLO series directly output target locations and categories through end-to-end neural networks, eliminating the need for region proposals, thus offering faster detection speeds and lower computational demands. These algorithms have been widely applied to various fruit detection tasks, such as kiwifruit [13], oranges [14], mangoes [15], grapes [16], and winter jujubes [17]. In the field of apple detection, YOLO series models are particularly favored for their efficiency. For example, Yang et al. proposed the AD-YOLO model combined with MR-SORT for automatic apple detection and counting, improving tracking accuracy in video processing [18]. Wang et al. achieved enhanced apple detection by improving YOLOv5s, reaching a mAP of 89.7% [19,20]. Chen et al. proposed an apple detection method based on the Des-YOLO v4 algorithm, optimized for complex environments [21]. Yue et al. conducted apple detection in complex environments using an improved YOLOv8n [22,23]. With the evolution of the YOLO series, YOLO11, as the latest version, further enhances model accuracy and efficiency. For example, Yang et al. developed the AAB-YOLO framework, an enhanced version of the YOLOv11 network, for detecting apples in natural settings [24]. Yan et al. conducted apple recognition and yield estimation studies using a fused model of improved YOLOv11 and DeepSORT, achieving an F1 score of 91.7% [25]. However, existing methods still face challenges in complex orchard environments: occlusions by branches, leaves, and fruits lead to blurred target contours, reducing detection accuracy; high computational complexity and parameter counts limit model deployment on edge devices.

Existing fruit localization techniques primarily encompass laser scanning, stereo vision systems, and methods utilizing RGB-D cameras. Laser scanning employs high-accuracy point cloud data to enable three-dimensional positioning. For instance, Tsoulias et al. applied LiDAR-based scanners to identify and locate apples in orchard settings, attaining a mean detection success rate of 92.5% for trees without foliage, highlighting the capability for remote detection and 3D positioning [26]. However, laser scanning equipment is costly, and its performance is limited in detecting small or partially occluded fruits, particularly in complex orchard environments where lighting and foliage interference reduce accuracy [27]. Stereo vision systems utilize binocular cameras to acquire depth information. For instance, Jianjun et al. introduced a binocular vision measurement framework utilizing neural networks for 3D tomato localization, attaining 88.6% reliability with X, Y, and Z axis errors maintained within 5 mm [28]. While stereo vision offers spatial depth data, it is computationally complex, less real-time, and sensitive to lighting changes, easily leading to matching errors or depth estimation biases [29].

In contrast, RGB-D cameras, integrating color imagery with depth data, exhibit superior accuracy in detection and positioning under varied lighting and intricate backgrounds, offering notable benefits in cost-efficiency, versatility, and adaptability. When paired with YOLO series algorithms, RGB-D cameras can achieve efficient and accurate fruit localization. For example, Wang et al. used an RGB-D camera combined with the CA-YOLOv5 model to achieve apple detection in natural environments, with a mean Average Precision (mAP) of 91.2% [20]. RGB-D methods show great potential in fruit localization but require further optimization to address occlusion limitations in complex environments.

This study aims to address the challenge of accurate identification and localization of apples in complex orchard environments for robotic harvesting. By integrating the proposed YOLO-CSB model with RGB-D camera technology, precise apple detection and localization were successfully achieved. The specific research conclusions are as follows:

1.: A novel apple detection model, YOLO-CSB, was proposed, incorporating Partial Convolution to enhance the C3k2 module, resulting in a new CSFC Block architecture that reduces computational complexity and enhances model lightweightness.
2.: The SEAM and BiFPN modules were integrated into the YOLO11s framework. The SEAM module improves feature extraction for occluded regions, while the BiFPN module optimizes multi-scale detection performance through weighted bidirectional feature fusion.
3.: A 3D localization method combining YOLO-CSB with RGB-D cameras was developed, with experiments designed to evaluate localization errors. This method enhances the accuracy of apple detection and localization in complex orchard environments, meeting the requirements of robotic harvesting systems.

2. Materials and Methods

2.1. Dataset Construction

The images in this study were sourced from a dataset collected at the Baishui Apple Demonstration Station of Northwest A&F University. These apple images were captured using a Redmi K70 (Xiaomi Corporation, Beijing, China) at a resolution of 3072 × 3072. The distance between the camera and the tree canopy should be kept between 0.8 m and 1.5 m. This distance range allows for clear close-up shots of apples as well as shots of the entire tree canopy at a medium distance. The dataset was collected from September to October 2023, with image acquisition performed during morning sessions (9:00–11:00, 45%) and afternoon sessions (15:00–17:00, 55%). A total of 1210 apple images were obtained, containing 23,370 annotated instances with an average of 19.3 apple instances per image, as depicted in Figure 1.

For annotation, the LabelImg tool (Python 1.8.6) was utilized to label each apple in the images, with a uniform category of “apple,” and the annotations were saved as .txt files in YOLO format. During the annotation process, distant or blurry fruits that were impractical for robotic harvesting were excluded to align with the operational logic of robots in real-world environments. In real-world scenarios, the robot only processes the filtered bounding boxes: after detection, we add depth thresholds (within 1.5 m) and area thresholds (pixel area > 1500), plus confidence filtering (>0.4), directly discarding far-away or blurry fruits. This way, the robot will not receive instructions from these targets, and collisions will not occur. The constructed dataset was divided into training, validation, and test sets in an 8:1:1 ratio. To enhance the diversity and robustness of the dataset, various image augmentation techniques were applied, including brightness adjustment, translation, mirroring, and rotation, resulting in a threefold increase in dataset size. Ultimately, the dataset comprised 2613 training images, 327 validation images, and 327 test images. To preliminarily evaluate the model’s cross-dataset generalization capability, 100 apple images were selected from the public MinneApple dataset and added to the validation set, forming an extended validation set of 427 images The test set was analyzed to identify challenging scenarios including occlusion, low light, and small targets, and classified accordingly to evaluate the model’s performance under complex conditions. The detailed distribution is shown in Table 1.

2.2. Apple Object Detection Method

2.2.1. YOLO11s Network Architecture

The YOLO11 series encompasses five variants—YOLO11n, YOLO11s, YOLO11m, YOLO11L, and YOLO11x—with progressively increasing model depth and width. As a leading object detection framework, YOLO11 demonstrates superior performance in apple detection tasks. Specific performance metrics for each variant are presented in Table 2.

To optimize the trade-off between computational complexity, detection speed, and positioning accuracy, this research adopted YOLO11s as the foundation for developing an apple detection network. The YOLO11s framework consists of three primary components: backbone, neck, and head. During the backbone utilizes two 3 × 3 convolutions with a stride of 2 for initial downsampling, followed by alternating Conv and C3k2 modules to extract multi-scale features. At the conclusion of the backbone, SPPF component executes multi-scale pooling operations., and a C2PSA (Position-Sensitive Attention) module adaptively enhances key features. The neck adopts a Bidirectional Feature Pyramid Network-PAN (FPN-PAN) structure, integrating deep and shallow features via top-down and bottom-up pathways using C3k2 and lightweight convolutions for feature recombination while eliminating redundant connections. The detection head employs an Anchor-Free decoupled architecture, separating localization and classification tasks, with outputs optimized using Distribution Focal Loss and CIoU Loss. A Task-Aligned strategy is applied for precise positive and negative sample assignment. The overall network, built on Conv, C3k2, SPPF, and C2PSA modules, achieves a balance of convergence speed, detection accuracy, and model lightweightness.

2.2.2. YOLO11s Model Improvements

Despite the robust performance of YOLO11 on general datasets, its direct application to complex orchard environments faces challenges. Apples exhibit highly random spatial distributions within tree canopies, often accompanied by occlusions from branches, leaves, or overlapping fruits, resulting in blurred target contours and significant deformation. As a high-performance detection framework, YOLO11 still relies on considerable computational resources, and achieving lightweight deployment while maintaining accuracy remains a critical issue.

To address these challenges, this paper proposes a lightweight apple detection network, YOLO-CSB, with the following key contributions: (1) Integration of PConv into the backbone network to enhance the C3k2 module, forming a novel CSFC Block that reduces computational complexity and improves detection speed. (2) Replacing the original C2PSA module in the backbone with the SEAM module, SEAM replaces the original C2PSA module and can better handle occlusion. SEAM is followed by CSMM (Channel and Spatial Mixture Module), which uses patches of different sizes to capture multi-scale features simultaneously. Compared to the original attention unit, the advantage of CSMM is that it first uses patches of different sizes to view local and large-range information, then mixes channel and spatial dimensions through depthwise separable convolution. This allows more effective retention and recovery of apple features when occluded by leaves or branches, while increasing computational cost very little. (3) Incorporation of BiFPN in the neck, enabling effective retention of small target and occlusion details through bidirectional cross-layer feature fusion, thereby enhancing overall detection performance. The overall architecture of YOLO-CSB is illustrated in Figure 2.

CSFC Block

In practical deployment scenarios, object detection models often operate on resource-constrained embedded devices, such as orchard inspection robots or edge cameras, while requiring real-time performance under complex occlusion conditions. The C3k2 module in YOLO11 enriches feature representation through a multi-branch bottleneck structure but introduces channel redundancy, increasing computational overhead. Lightweight models, such as MobileNet [30], SENet [31], and ShuffleNet [32], utilize depth-separable convolutions to extract spatial features, significantly reducing floating-point operations (FLOPs). However, these structures often incur higher memory access costs, impacting inference speed. The CSFC Block improves upon C3k2 by integrating PConv from FasterNet, which performs convolution on a select subset of input feature channels, bypassing redundant sections to minimize computational and memory access overhead. The architecture of the CSFC Block is illustrated in Figure 3.

PConv (Partial Convolution) applies standard convolution on a selected subset of input channels to capture spatial features, while keeping the remaining channels unchanged. To maintain continuity in memory access, the initial consecutive channels are generally chosen to represent the entire feature map for processing.

When the input and output feature maps have an equal number of channels, the computational complexity (FLOPs) and memory access cost of PConv are given by Equation (1) and Equation (2), respectively:

\begin{matrix} FLOPs = h \times w \times k^{2} \times c_{p}^{2} \end{matrix}

(1)

where h and w are the height and width of the feature map, k is the kernel size, and c_p = c/r is the number of channels participating in computation, with r being the partial ratio. Compared to standard convolution, PConv reduces computational cost to approximately 1/r.

P C o n v = h \times w \times (2 c_{p} \times c + k^{2} \times {c_{p}}^{2})

(2)

Compared to standard convolution, PConv also reduces memory access cost to approximately 1/r.

Since only cp channels are used for spatial feature extraction, one might consider discarding the remaining channels. However, doing so would degrade PConv to a standard convolution with reduced channels, deviating from the goal of reducing redundancy. Thus, these uncomputed channels are retained for subsequent Conv layers, allowing feature information to propagate across all channels.

2.: SEAM

To accurately detect apples under complex conditions like leaf occlusion while maintaining real-time performance, the backbone network must be lightweight and capable of compensating for occluded region features. In the original YOLO11s network, multi-scale features are output directly after the neck, but when fruits are occluded by leaves or adjacent fruits, local features disappear, and responses weaken, leading to missed detections. To address this, the SEAM (Separated and Enhancement Attention Module) is introduced to replace the C2SPA module entirely, as shown in Figure 4.

SEAM first uses depthwise separable convolutions to extract spatial features by channel, significantly reducing parameters while preserving fine-grained fruit contour information. Then, pointwise 1 × 1 convolutions compress channel features, followed by two fully connected layers to recalibrate cross-channel weights, enhancing responses in non-occluded regions and suppressing background noise. Finally, an exponential mapping extends attention weights from [0,1] to [1, e], which are element-wise multiplied with the original features to perform feature recalibration. Positioned next to SEAM is the CSMM (Channel and Spatial Mixture Module), which captures multi-scale features using patches of varying sizes and employs depthwise separable convolutions to investigate relationships between spatial and channel dimensions, enabling the attention unit to maintain lightweightness while achieving multi-scale perception. The computation process is expressed as Equations (3) and (4):

A = \exp (F C (C o n v_{1 \times 1} (D W C o n v (X))))

(3)

Y = X ⊙ A

(4)

where X is the input feature map (H × W × C), A is the attention weight map (H × W × C) generated by the attention mechanism, Yis the SEAM output feature map (H × W × C), DWConv denotes depthwise convolution, FC represents two fully connected layers, and ⊙ indicates element-wise multiplication. Thanks to the separated channel-spatial modeling mechanism, SEAM maintains high activation in occluded edge regions with almost no increase in inference latency, significantly enhancing detection robustness in complex orchard environments.

3.: Feature Fusion Layer Improvement

Multi-scale feature fusion is critical for detection accuracy. The original YOLO11s structure uses the PANet structure, as shown in Figure 5a, as the feature fusion layer, which adds bottom-up pathways to the top-down path to effectively transfer semantic information. However, the contributions of features at different resolutions are always treated equally, unable to adequately capture the significance of small target features in detection tasks. To tackle this issue, this research implements a weighted Bidirectional Feature Pyramid Network (BiFPN) in YOLO11s, substituting the original PANet architecture. BiFPN builds on PANet by implementing bidirectional cross-layer connections for iterative top-down and bottom-up fusion, while introducing learnable normalized weights to adaptively allocate importance to different scale features. Compared to PANet’s equal-weight summation, BiFPN dynamically adjusts feature contributions, maintaining better detection capabilities for both small-scale apple targets and large-scale fruit backgrounds. The original BiFPN structure is shown in Figure 5b.

During feature fusion, BiFPN adopts a fast normalized weighting strategy, with the output expressed as Equation (5):

O = \frac{\sum_{i} w_{i} \cdot I_{i}}{ε + \sum_{j} w_{j}} \begin{matrix} , & w_{i} \geq 0 \end{matrix}

(5)

where Ii denotes input features of different scales, wi are learned non-negative weights, and

ε

is a small constant to avoid division by zero. This mechanism ensures adaptability in feature fusion across different resolutions, enabling the detection network to automatically favor the scale information most beneficial for apple targets in complex environments.

2.3. Detection and Localization System

This study proposes a robotic detection and localization system based on the YOLO-CSB model. The system components are shown in Table 3, and the overall architecture is illustrated in Figure 6.

The system adopts an end-to-end neural network structure, utilizing the YOLO-CSB model to achieve real-time detection of apples in complex orchard environments and combining depth information from an RGB-D camera for three-dimensional localization. The detailed workflow is as follows:

(1): Image Capture and Preprocessing: An RGB-D camera (Intel RealSense D435i) acquires RGB images and synchronized depth maps of the orchard scene in real time. RGB images serve for object detection, whereas depth maps deliver spatial depth data. The input image resolution is resized to 640 × 640 to satisfy the YOLO-CSB model’s input specifications.
(2): Target Detection: The YOLO-CSB model processes the preprocessed RGB images and outputs the two-dimensional bounding box coordinates of apples, denoted as $(x_{b}, y_{b}, w_{b}, h_{b})$ , where $x_{b}, y_{b}$ represents the center of the bounding box, $w_{b}, h_{b}$ represent the width and height, respectively.
(3): Three-Dimensional Localization: Based on the 2D bounding box detected by YOLO-CSB, we extract the apple’s depth value from the depth map of the RGB-D camera. Then, combining the 2D coordinates and depth value, we calculate the 3D coordinates in the camera coordinate system using camera intrinsics and a calibration matrix. In cases of heavy leaf occlusion, directly taking the center depth if the bounding box center falls on a leaf can easily lead to errors. We instead use the median of all valid depth points within the bounding box as the apple’s center depth. This avoids interference from a few leaf points, making the depth closer to the fruit itself. Furthermore, if less than 30% of the points within the bounding box are valid depth points, we consider the occlusion too severe and discard the box, not transmitting it to the robotic arm. This prevents 3D coordinate calculation errors or the robot from grasping the wrong position. The coordinate transformation formula is expressed in Equation (6):

$X_{c} = \frac{(x_{b} - c_{x}) \cdot d_{z}}{f_{x}}, Y_{c} = \frac{(y_{b} - c_{y}) \cdot d_{z}}{f_{y}}, Z_{c} = d_{z}$

(6)

where $c_{x}, c_{y}$ are the camera’s optical center coordinates (in pixels), $f_{x}, f_{y}$ are the camera’s focal lengths, and Z represents the depth value of the corresponding pixel in the depth map. These intrinsic parameters are pre-established through camera calibration.
(4): Coordinate Transformation: The coordinates derived from the camera are based on its coordinate system, whereas the robotic arm’s operations require alignment with its base coordinate system. The Tsai–Lenz method is employed for hand–eye calibration. The procedure is as follows: a checkerboard calibration target is fixed to the robot end-effector, and the RGB-D camera captures the checkerboard from at least 15 different viewpoints. For each capture, the robot end-effector pose (A matrix computed from joint angles) and the checkerboard pose observed by the camera (B matrix) are recorded. The hand–eye transformation matrix X (fixed relationship from camera to end-effector) is then solved using least-squares optimization to minimize the error of AX = XB. After calibration, accuracy verification is performed using another checkerboard: at 20 random poses, the robot is commanded to move to the checkerboard center observed by the camera, achieving an average positional error of 2.3 mm and rotational error of 0.8°. This precision is sufficient for apple harvesting, as apple diameters typically range from 60–80 mm. Hand–eye calibration relies on the equation: $A \cdot X = X \cdot B$ , where A represents the 4 × 4 homogeneous transformation matrix from the robot’s base coordinate system to the end-effector coordinate system, computed from the robot’s joint angles; B denotes the 4 × 4 homogeneous transformation matrix from the camera coordinate system to the calibration target (e.g., a checkerboard), determined through the calibration procedure; and X is the hand–eye transformation matrix to be calculated, encapsulating the fixed relationship between the camera coordinate system and the robot’s base coordinate system, as defined in Equation (7):

$X = [\begin{matrix} R & T \\ 0 & 1 \end{matrix}]$

(7)

where R represents a 3 × 3 rotation matrix, and T denotes a 3 × 1 translation vector.
For n different robot poses, the corresponding transformation matrices $A_{i}$ and $B_{i}$ (i = 1, 2, …, n) are collected, and X is solved through an optimization method with the objective function given by Equation (8):

$\min_{X} \sum_{i = 1}^{n} {‖A_{i} \cdot X - X \cdot B_{i}‖}_{F}$

(8)

where ${∥ \cdot ∥}_{F}$ denotes the Frobenius norm, and an iterative algorithm is used to compute the optimal X.
Once X is obtained, the camera coordinates are transformed to the robotic coordinates through a homogeneous transformation, as shown in Equation (9):

$[\begin{matrix} X_{r} \\ Y_{r} \\ Z_{r} \\ 1 \end{matrix}] = X \cdot [\begin{matrix} X_{c} \\ Y_{c} \\ Z_{c} \\ 1 \end{matrix}]$

(9)
(5): Robotic Arm Control and Execution: The transformed three-dimensional coordinates are transmitted to the AUBO C5 robotic arm via ROS, controlling its six degrees of freedom to move to the target apple position.

Leveraging the outlined steps, this research effectively accomplishes efficient and accurate apple positioning by utilizing an enhanced YOLO object detection framework combined model and a hand–eye calibration-based coordinate mapping technique. Figure 7 illustrates the apple detection and positioning process within the robotic harvesting system.

2.4. Model Training and Evaluation

2.4.1. Experimental Environment

For the training and deployment of YOLO11, the specific configuration used in this experiment is shown in Table 4.

For the robotic harvesting system experiment, Ubuntu 20.04 was selected as the underlying operating system, with PyCharm 2024.2.4 as the IDE and Python 3.12 as the interpreter version.

2.4.2. Evaluation Metrics

The effectiveness of the object detection algorithm is assessed through metrics including mean Average Precision (mAP), precision, recall, floating-point operations (FLOPs), and model parameters. All mAP values mentioned in this paper refer to mAP@0.5, which is the mean Average Precision at an IoU threshold of 0.5. For evaluating accuracy, the mAP metric is employed, calculated as the mean of Average Precision (AP) values, derived by integrating the precision–recall (P-R) curve, as presented in Equations (10) and (11):

m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i} \times 100 %

(10)

A P = \int_{0}^{1} P (R) d R

(11)

In this research, as only the apple category is evaluated, n equals 1. Precision (P) represents the proportion of accurately predicted apple instances relative to the total predicted apple instances, while recall (R) denotes the proportion of accurately predicted apple instances relative to the total actual apple instances. Their computation formulas are expressed in Equations (12) and (13):

P = \frac{T P}{T P + F P} \times 100 %

(12)

R = \frac{T P}{T P + F N} \times 100 %

(13)

where TP (True Positive) indicates instances that are actually apples and correctly predicted as apples, FP (False Positive) indicates instances that are not apples but predicted as apples, and FN (False Negative) indicates instances that are apples but not predicted as apples.

Additionally, in apple three-dimensional localization, the average errors in the X, Y, and Z directions are used to evaluate localization accuracy, as shown in Equations (14)–(16):

E_{X} = \frac{1}{N} \sum_{i = 1}^{N} |X_{t u r e, i} - X_{p r e d, i}|

(14)

E_{Y} = \frac{1}{N} \sum_{i = 1}^{N} |Y_{t u r e, i} - Y_{p r e d, i}|

(15)

E_{Z} = \frac{1}{N} \sum_{i = 1}^{N} |Z_{t u r e, i} - Z_{p r e d, i}|

(16)

where

X_{ture, i}, Y_{ture, i}, Z_{ture, i},

are the true coordinates of the i-th apple,

X_{pred, i}, Y_{pred, i}, Z_{pred, i},

are the predicted coordinates, and N is the total number of apples.

3. Results

3.1. Ablation Experiments

We used YOLO11s as the baseline model and then gradually added CSFC, SEAM, and BiFPN modules to test their improvement effects on apple detection. The results are shown in Table 5. “√” indicates that this module has been added, “×” indicates that it has not been added. We trained using the same apple dataset, which included different lighting and occlusion conditions. To ensure the reliability of the results, we trained once with five different random seeds and then reported the mean ± standard deviation. Adding the CSFC module alone reduced mAP@0.5 to 89.41 ± 0.36%, 1.11 percentage points lower than the baseline, but the number of parameters decreased to 9.23 M (a decrease of 1.91%), FLOPs decreased to 20.3 G, and latency decreased to 4.92 ± 0.07 ms. Adding the SEAM module alone increased mAP@0.5 to 92.82 ± 0.31%, precision to 86.71 ± 0.39%, and recall to 86.42 ± 0.38%, indicating that it performs well in handling occlusion with minimal increase in computation. Adding the BiFPN module alone increased mAP@0.5 to 91.98 ± 0.29%, but the number of parameters increased to 9.50 M (an increase of 0.96%), and FLOPs increased to 23.7 G. Adding CSFC and SEAM together increased mAP@0.5 to 93.18 ± 0.25%, and the number of parameters decreased to 9.11 M. Adding CSFC and BiFPN together increased mAP@0.5 to 92.97 ± 0.27%, and the number of parameters to 9.24 M. Adding SEAM and BiFPN together increased mAP@0.5 to 93.44 ± 0.32%, and the number of parameters decreased to 8.82 M. The YOLO-CSB model with all three modules included achieved a mAP@0.5 of 93.69 ± 0.22%, precision of 88.61 ± 0.45%, recall of 87.29 ± 0.51%, 9.11 M parameters, 20.6 G FLOPs, and latency of 4.81 ± 0.09 ms. Compared to the baseline model, mAP@0.5 improved by approximately 3 percentage points, precision and recall improved by 3.17 and 1.98 percentage points respectively, parameter count decreased by 3.19%, FLOPs decreased by approximately 4%, and latency decreased by approximately 7.7%. These results demonstrate that YOLO-CSB achieves a good balance between accuracy and efficiency.

3.2. Comparison Before and After YOLO11 Model Improvement

To evaluate the detection performance of the improved YOLO-CSB network in complex orchard scenarios, comparative experiments were conducted under various lighting and occlusion conditions, as shown in Figure 8. The experimental results are shown in Table 6. In well-lit environments, both YOLO11 and the improved YOLO11 maintain high detection accuracy. However, in low-light conditions, the original YOLO11’s apple detection rate significantly decreases, while the improved YOLO11 maintains stable recognition performance. In occlusion tests, both models perform similarly under no-occlusion conditions, but when facing leaf occlusion or fruit overlap, the original YOLO11’s accuracy declines, whereas the improved YOLO11 demonstrates stronger adaptability to various occlusion scenarios.

Although the improved YOLO-CSB achieved performance gains across various scenarios, certain types of failures still exist. In occluded scenarios, when more than 70% of the fruit is occluded, the model misses detections, and the EMA attention mechanism struggles to capture effective features. When multiple fruits overlap closely, the model may merge them into a single bounding box. The detection rate drops sharply in low light because the EMA attention weight allocation fails, the signal-to-noise ratio of deep features is too low, and reflective points on the fruit surface may be misidentified as independent targets. These failures provide clear directions for future improvements.

To further quantify the performance differences between the two models under different conditions, this study aggregates and compares metrics such as recall, precision, and mAP, as shown in Figure 9. The results indicate that the improved YOLO11 outperforms the original model across all metrics.

3.3. Comparison with Other Lightweight Networks

To comprehensively validate the performance of the improved model (Ours) in edge-based apple detection tasks, seven representative target detection models were selected for comparison: Faster R-CNN, SSD, YOLOv5s, RT-DETR-L, ShuffleNetv2, YOLOv7-tiny, YOLOv8s, and YOLOv11s. To ensure a fair comparison, all models were trained with unified settings: input resolution of 640 × 640, batch size of 8, workers of 2, SGD optimizer (momentum 0.937, weight decay 0.0005) and 300 training epochs. All models were trained on the same apple dataset with identical data augmentation strategies. The evaluation metrics include mAP@0.5, precision, recall, parameters (Params), computational complexity (FLOPs), System Time and actual inference latency on the NVIDIA Jetson Orin Nano (Latency). Specifically, latency is defined as the model inference time, which is the pure network forward propagation time from input image to output bounding boxes, excluding image preprocessing, NMS post-processing, and 3D localization computation. The testing conditions are unified as follows: input size 640 × 640, batch size 1, FP16 half-precision mode, with 50 warmup runs followed by 100 official runs for averaging. System Time refers to the complete pipeline duration from reading RGB images to outputting 3D coordinates, including preprocessing, inference, NMS post-processing, and 3D localization calculation. The results are shown in Table 7.

As shown in Table 4, YOLO-CSB outperforms most mainstream lightweight models in terms of accuracy, parameter count, computational complexity, and edge latency. Particularly in orchard scenarios with dense occlusion and significant fruit scale variations, YOLO-CSB, leveraging the SEAM occlusion attention and BiFPN multi-scale fusion mechanisms, significantly improves recall for small targets and occluded fruits without introducing additional parameter overhead. Overall, YOLO-CSB achieves a better balance between accuracy and efficiency.

To evaluate performance in greater detail, typical apple images from natural environments were selected to test the models under different conditions. As shown in Figure 10, all models exhibit varying degrees of missed or incorrect detections for apples occluded by leaves or those with less distinct features due to small size. YOLO-CSB demonstrates the fewest missed and incorrect detections. In cases where apples are covered by leaves, the detection bounding box may not fully enclose the complete apple contour, affecting localization accuracy. Overall, while all models can detect apples with clear features, YOLO-CSB outperforms others in recognizing less distinct apples. It is concluded that the strategies proposed in this study effectively improve detection accuracy while achieving lightweightness.

3.4. Model Feature Visualization

To more intuitively observe the improvement in recognition capability of the BiFPN feature fusion network, this study uses activation heatmaps to compare the visual effects of YOLO11s and the model with integrated improved attention mechanisms. The brightness intensity in a heatmap region indicates the parts of the image that significantly influence the model’s output. The heatmap of the improved YOLO is shown in Figure 11c. Compared to Figure 11b, the apple targets in Figure 11c exhibit brighter colors and higher responses, while incorrectly extracted leaf features show lower brightness. Overall, the improved YOLO model demonstrates better attention effects in the heatmap.

3.5. Apple Three-Dimensional Localization Error Assessment

In a controlled laboratory setting, a simulated apple tree was chosen as the test subject to assess the feasibility of the apple detection and positioning technology outlined in the “Robotic Detection and Positioning System” section. This approach combines the YOLO-CSB framework with an RGB-D camera to accomplish effective 3D positioning. The experiment employed the robotic harvesting system depicted in Figure 6 to evaluate the positioning performance of the robotic arm during real-world operations. The detection and localization results are shown in Figure 12. In laboratory conditions, the detection success rate for 10, 15, and 20 apples reached 100%, with the model performing stably, consistent with detection results in structured orchard environments.

Initially, spatial coordinates of the apple center points were acquired using the RGB-D camera as reference points, following the approach outlined in Section 2.3. Subsequently, the robotic arm’s end-effector was directed to predefined measurement points, and the actual distance deviation between the end-effector and the target point was measured using a high-precision rangefinder. Each measurement point was evaluated independently five times, with the mean error computed to ensure data precision and reliability. Five sets of apple positions were randomly chosen for testing. The results are shown in Table 8.

The experimental results reveal average positioning errors of 4.15 mm, 3.96 mm, and 4.02 mm in the X, Y, and Z directions, respectively. These findings confirm that the approach satisfies the demands for accurate apple positioning, rendering it appropriate for real-world harvesting applications.

To further verify the robustness of this technology for apple detection and positioning, harvesting experiments were performed in the laboratory using the robotic harvesting system. The harvesting procedure is depicted in the figure, demonstrating that the approach enables real-time apple detection and positioning, delivering precise harvesting coordinates in most instances.

4. Discussion

We assessed the performance of the YOLO-CSB framework against recent studies on apple detection to determine its efficacy in intricate orchard settings. In recent times, research on apple detection has predominantly concentrated on single-stage object detection algorithms, seeking to improve detection precision while minimizing model complexity and size. For instance, Liu et al. [1] proposed a YOLOv5s-BC-based apple detection method, achieving a mAP of 92.01%, precision of 88.71%, and recall of 83.80%. Wang et al. [19] developed an improved YOLOv5s model with a mAP of 89.70%, precision of 87.50%, and recall of 86.20%. In contrast, our proposed YOLO-CSB model achieved a mAP of 93.69%, precision of 88.82%, and recall of 87.58%, demonstrating superior detection performance. Additionally, Yang et al. [24] introduced an AAB-YOLO model based on an improved YOLOv11, with a mAP of 91.50% and a parameter count of 10.2 M, whereas YOLO-CSB, with only 9.11 M parameters, reduced the parameter count by 10.69% while maintaining high accuracy. Chen et al. [19] proposed a Des-YOLO v4-based apple detection model with a parameter count of 12.5 M, while YOLO-CSB’s parameter count is only 73% of that, further validating its lightweight design advantages. Overall, YOLO-CSB outperforms other similar apple detection methods in both detection accuracy and model lightweightness.

Robotic positioning serves as a key step for accurately finding and locating small-sized fruits. In this work, an RGB-D camera is combined with the YOLO-CSB network to obtain the 3D coordinates of apple centers, which are then treated as test points for the positioning trials. The network detects fruits in the RGB view, giving 2D box coordinates, and depth images are used to compute the third dimension. The manipulator is driven so its end-effector reaches each selected point. A precise distance sensor records the real offset between the end-effector and the target, repeating every measurement five times to cut down noise. The mean positioning errors for apples are 4.15 mm, 3.96 mm, and 4.02 mm along X, Y, and Z, respectively, meeting the accuracy demand for automated picking.

Previous work on fruit positioning reported a LiDAR-plus-YOLOv5 system that yielded a mean apple error of 21.1 mm, replacing LiDAR with an RGB-D sensor cuts cost and pushes error below 5 mm [33]. In bell pepper studies, Guo et al. coupled an RGB-D device with a refined YOLOv4 and reached 89.55% accuracy [34]. Zhou et al. fused an upgraded YOLOX with depth data to keep apple offsets under 7 mm in every axis [35]. Apples, being smaller and more occluded, are harder than peppers, yet the YOLO-CSB-plus-RGB-D combination still drives the average error under 5 mm, outperforming earlier RGB-D attempts on minor fruit.

Despite these significant achievements, YOLO-CSB has certain limitations. As shown in Figure 8, under extreme conditions such as low-light environments, the model may experience missed detections. These issues are primarily attributed to reduced image quality due to insufficient lighting, which diminishes the prominence of apple texture and color features. To address this, future work will focus on incorporating more apple images captured under low-light conditions to further enhance the training dataset.

Our dataset comes from only one orchard and one variety, and the collection time is also concentrated. This may cause the model to perform worse in other orchards, varieties, or seasons. We only added a small number of images from public datasets to the validation set for preliminary cross-validation. Future work will collect multi-source data and conduct comprehensive testing on more public fruit datasets.

All experiments in this study used a fixed input resolution of 640 × 640. This was because the dataset was uniformly set to this size during preprocessing and export; changing the resolution would involve re-labeling and retraining, and due to limited resources, no comparisons were made. Future work will compare different resolutions (e.g., 512, 640, 800) and analyze their impact on mAP, latency, and computational cost.

Our YOLO-CSB model is currently only used for apple detection, but it can be easily extended to other fruits, such as oranges, bananas, or pears, which also have occlusion and lighting issues. We simply need to change the dataset, label those fruits, and then retrain the model. The parameters and modules remain the same, and it can be used in these scenarios. Furthermore, for multi-class scenarios, such as simultaneously detecting apples at different ripeness levels (green and red apples), or multiple fruit types, we can modify the detection head to add more class labels. Experiments show that the model has good support for multi-class targets because BiFPN handles multi-scale features. We will try these extensions in the future to see how they perform.

Although this study validated the effectiveness of the hand–eye calibration system under laboratory conditions, the localization performance in actual orchard environments remains to be further verified. Limited by current experimental constraints and resources, large-scale field testing will be an important direction for our future work. In the future, we plan to conduct long-term localization accuracy assessments in real orchard scenarios, comprehensively considering the effects of illumination variations, tree occlusions, and terrain undulations on system performance, thereby further enhancing the practicality and robustness of the system.

In conclusion, the YOLO-CSB model and its three-dimensional localization system provide an efficient and precise solution for apple mechanized harvesting in complex orchard environments. Its outstanding performance in detection accuracy, model lightweightness, and localization error reduction offers a scalable reference for the detection and localization of other target fruits. This study evaluated the performance of YOLO-CSB in complex orchard environments and compared it with recent apple detection research, demonstrating that our model achieves advantages in both detection accuracy and lightweight design. However, this study has certain limitations: the dataset was collected from a single orchard with a single apple variety, concentrated in autumn with relatively uniform lighting conditions, which may limit the model’s generalization capability to other orchards, different varieties, other seasons, or extreme lighting conditions. Furthermore, this study only added a small number of public dataset images to the validation set for preliminary cross-dataset validation, without conducting systematic cross-dataset testing, which is also a shortcoming of this study. In addition, all experiments in this study used a fixed input resolution of 640 × 640 due to preprocessing constraints and limited resources, without exploring the impact of different resolutions on model performance. Moreover, the YOLO-CSB model is currently only validated for apple detection, though its architecture is readily extensible to other fruits or multi-class scenarios. Most importantly, although the hand–eye calibration system was validated under laboratory conditions, its localization performance in actual orchard environments remains to be further verified. Future research can further validate its performance in diverse orchard settings, collect multi-source data, conduct systematic cross-dataset validation on public datasets, compare different input resolutions to optimize the balance between accuracy and efficiency, extend the model to other fruit types or multi-class detection scenarios, and perform large-scale field testing to comprehensively assess localization accuracy under varying illumination, occlusions, and terrain conditions. These efforts will promote the widespread adoption of intelligent agricultural technologies.

5. Conclusions

This study presents a method for accurate detection and localization of apples in complex orchard environments. By integrating the proposed YOLO-CSB model with RGB-D camera technology, precise apple detection and localization were successfully achieved. The key conclusions are as follows:

Innovative Network Architecture Design: The YOLO11s backbone was optimized to develop a more efficient network structure. First, Partial Convolution (PConv) was introduced to enhance the C3k2 module, forming a novel CSFC Block that significantly reduces computational complexity. Second, the SEAM module was incorporated into the backbone, enhancing feature extraction for occluded fruits through a channel-spatial parallel attention mechanism. Finally, the BiFPN module was embedded in the neck, optimizing multi-scale detection performance through weighted bidirectional feature fusion.
Systematic Experimental Validation: Ablation studies were conducted under consistent training conditions to evaluate the individual contributions of the proposed components, confirming the effectiveness of the CSFC Block, SEAM, and BiFPN modules. The ablation experiments demonstrated that, compared to the baseline YOLO11s model, YOLO-CSB achieved a 3.02% increase in mAP, reaching 93.69%, with parameters reduced to 9.11 M and FLOPs lowered to 20.6 G. Furthermore, comparisons with mainstream lightweight detection models showed that YOLO-CSB significantly reduced missed detections and false positives in complex occlusion scenarios, exhibiting superior detection performance.
High-Precision 3D Localization System Design: A three-dimensional localization method combining YOLO-CSB with RGB-D cameras was proposed. Laboratory tests revealed average localization errors of 4.15 mm, 3.96 mm, and 4.02 mm in the X, Y, and Z directions, respectively, meeting the stringent precision requirements for robotic harvesting. Robotic harvesting experiments further validated the method’s real-time performance and reliability, providing valuable insights for object detection and localization tasks in precision agriculture.

Author Contributions

Conceptualization, A.H.; Data curation, M.L.; Formal analysis, Y.C.; Funding acquisition, Y.H.; Investigation, Y.C.; Methodology, Y.P.; Project administration, X.T.; Resources, Y.H.; Software, Y.P.; Supervision, A.H.; Validation, M.Z.; Visualization, M.Z.; Writing—original draft, Y.P.; Writing—review and editing, Y.H., X.T. and M.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Talent Start-up Project of Zhejiang A&F University Scientific Research Development Foundation (no. 2021LFR066),the National Natural Science Foundation of China 32171894.

Data Availability Statement

All data generated in this study have been uploaded to and can be freely accessed at the following link: https://github.com/panyunxiao1/Apple- (accessed on 6 January 2026).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Liu, J.; Liu, Z. YOLOv5s-BC: An improved YOLOv5s-based method for real-time apple detection. J. Real-Time Image Process. 2024, 21, 88. [Google Scholar] [CrossRef]
Nesterov, D.; Shurygin, B.; Solovchenko, A.; Krylov, A.; Sorokin, D. A CNN-based method for fruit detection in apple tree images. Comput. Math. Model. 2022, 33, 354–364. [Google Scholar] [CrossRef]
Musacchi, S.; Serra, S. Apple fruit quality: Overview on pre-harvest factors. Sci. Hortic. 2018, 234, 409–430. [Google Scholar] [CrossRef]
Lv, J.; Xu, H.; Han, Y.; Lu, W.; Xu, L.; Rong, H.; Yang, B.; Zou, L.; Ma, Z. A visual identification method for the apple growth forms in the orchard. Comput. Electron. Agric. 2022, 197, 106954. [Google Scholar] [CrossRef]
He, B.; Qian, S.; Niu, Y. Visual recognition and location algorithm based on optimized YOLOv3 detector and RGB depth camera. Vis. Comput. 2024, 40, 1965–1981. [Google Scholar] [CrossRef]
Li, Y.; Feng, Q.; Li, T.; Xie, F.; Liu, C.; Xiong, Z. Advance of target visual information acquisition technology for fresh fruit robotic harvesting: A review. Agronomy 2022, 12, 1336. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 2961–2969. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Inf. Process. Syst. 2016, 29, 1–9. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Hidayatullah, P.; Syakrani, N.; Sholahuddin, M.R.; Gelar, T.; Tubagus, R. YOLOv8 to YOLO11: A comprehensive architecture in-depth comparative review. arXiv 2025, arXiv:2501.13400. [Google Scholar] [CrossRef]
Hussain, M. Yolov1 to v8: Unveiling each variant–a comprehensive review of yolo. IEEE Access 2024, 12, 42816–42833. [Google Scholar] [CrossRef]
Yang, Y.; Su, L.; Zong, A.; Tao, W.; Xu, X.; Chai, Y.; Mu, W. A New Kiwi Fruit Detection Algorithm Based on an Improved Lightweight Network. Agriculture 2024, 14, 1823. [Google Scholar] [CrossRef]
Huang, Z.; Li, X.; Fan, S.; Liu, Y.; Zou, H.; He, X.; Xu, S.; Zhao, J.; Li, W. ORD-YOLO: A Ripeness Recognition Method for Citrus Fruits in Complex Environments. Agriculture 2025, 15, 1711. [Google Scholar] [CrossRef]
Zhong, Z.; Yun, L.; Cheng, F.; Chen, Z.; Zhang, C. Light-YOLO: A lightweight and efficient YOLO-based deep learning model for mango detection. Agriculture 2024, 14, 140. [Google Scholar] [CrossRef]
Shen, Q.; Zhang, X.; Shen, M.; Xu, D. Multi-scale adaptive YOLO for instance segmentation of grape pedicels. Comput. Electron. Agric. 2025, 229, 109712. [Google Scholar] [CrossRef]
Yu, C.; Shi, X.; Luo, W.; Feng, J.; Zheng, Z.; Yorozu, A.; Hu, Y.; Guo, J. MLG-YOLO: A model for real-time accurate detection and localization of winter jujube in complex structured orchard environments. Plant Phenomics 2024, 6, 0258. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Gao, Y.; Yin, M.; Li, H. Automatic apple detection and counting with AD-YOLO and MR-SORT. Sensors 2024, 24, 7012. [Google Scholar] [CrossRef] [PubMed]
Wang, H.; Feng, J.; Yin, H. Improved method for apple fruit target detection based on YOLOv5s. Agriculture 2023, 13, 2167. [Google Scholar] [CrossRef]
Yang, R.; He, Y.; Hu, Z.; Gao, R.; Yang, H. CA-YOLOv5: A YOLO model for apple detection in the natural environment. Syst. Sci. Control Eng. 2024, 12, 2278905. [Google Scholar] [CrossRef]
Chen, W.; Zhang, J.; Guo, B.; Wei, Q.; Zhu, Z. An Apple Detection Method Based on Des-YOLO v4 Algorithm for Harvesting Robots in Complex Environment. Math. Probl. Eng. 2021, 2021, 7351470. [Google Scholar] [CrossRef]
Yue, Y.; Pan, L.; Zhao, D.; Li, J.; Dong, X.; Shan, W. Apple recognition based on improved YOLOv8. In Proceedings of the International Conference on Optics and Machine Vision (ICOMV 2024), Nanchang, China, 19–21 January 2024; SPIE: Bellingham, WA, USA, 2024; pp. 242–247. [Google Scholar]
Yue, Y.; Cui, S.; Shan, W. Apple detection in complex environment based on improved YOLOv8n. Eng. Res. Express 2024, 6, 045259. [Google Scholar] [CrossRef]
Yang, L.; Zhang, T.; Zhou, S.; Guo, J. AAB-YOLO: An Improved YOLOv11 Network for Apple Detection in Natural Environments. Agriculture 2025, 15, 836. [Google Scholar] [CrossRef]
Yan, Z.; Wu, Y.; Zhao, W.; Zhang, S.; Li, X. Research on an apple recognition and yield estimation model based on the fusion of improved YOLOv11 and DeepSORT. Agriculture 2025, 15, 765. [Google Scholar] [CrossRef]
García-Manso, A.; Gallardo-Caballero, R.; García-Orellana, C.J.; González-Velasco, H.M.; Macías-Macías, M. Towards selective and automatic harvesting of broccoli for agri-food industry. Comput. Electron. Agric. 2021, 188, 106263. [Google Scholar] [CrossRef]
Su, T.; Zhang, S.; Liu, T. Multi-spectral image classification based on an object-based active learning approach. Remote Sens. 2020, 12, 504. [Google Scholar] [CrossRef]
Yin, J.; Wang, Y.; Zhong, S. Binocular measurement model of locating fruit based on neural network. In Proceedings of the 2nd International Conference on Information Science and Engineering, Hangzhou, China, 4–6 December 2010; IEEE: New York, NY, USA, 2010; pp. 1069–1072. [Google Scholar]
Au, C.K.; Lim, S.H.; Duke, M.; Kuang, Y.C.; Redstall, M.; Ting, C. Integration of stereo vision system calibration and kinematic calibration for an autonomous kiwifruit harvesting system. Int. J. Intell. Robot. Appl. 2023, 7, 350–369. [Google Scholar] [CrossRef]
Fu, L.; Majeed, Y.; Zhang, X.; Karkee, M.; Zhang, Q. Faster R–CNN–based apple detection in dense-foliage fruiting-wall trees using RGB and depth features for robotic harvesting. Biosyst. Eng. 2020, 197, 245–256. [Google Scholar] [CrossRef]
Jin, X.; Xie, Y.; Wei, X.-S.; Zhao, B.-R.; Chen, Z.-M.; Tan, X. Delving deep into spatial pooling for squeeze-and-excitation networks. Pattern Recognit. 2022, 121, 108159. [Google Scholar] [CrossRef]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; pp. 6848–6856. [Google Scholar]
Karim, M.R.; Reza, M.N.; Ahmed, S.; Lee, K.-H.; Sung, J.; Chung, S.-O. Detection of Trees and Objects in Apple Orchard from LiDAR Point Cloud Data Using a YOLOv5 Framework. Electronics 2025, 14, 2545. [Google Scholar] [CrossRef]
Guo, S.; Yoon, S.-C.; Li, L.; Wang, W.; Zhuang, H.; Wei, C.; Liu, Y.; Li, Y. Recognition and positioning of fresh tea buds using YOLOv4-lighted+ ICBAM model and RGB-D sensing. Agriculture 2023, 13, 518. [Google Scholar] [CrossRef]
Zhou, J.; Chen, M.; Zhang, M.; Zhang, Z.; Zhang, Y.; Wang, M. Improved Yolov8 for Multi-Colored Apple Fruit Instance Segmentation and 3d Localization. Artif. Intell. Agric. 2026, 16, 381–396. [Google Scholar] [CrossRef]

Figure 1. Schematic diagram of apple data collection process.

Figure 2. YOLO-CSB network structure diagram.

Figure 3. Diagram of the working principle of PConv.

Figure 4. Schematic diagram of SEAM.

Figure 5. Structural diagrams of PANet and BiFPN. (a) PANet structure. (b) BiFPN structure.

Figure 6. Overall architecture diagram of the robot detection and positioning system.

Figure 7. The process diagram of the system’s detection and location of apples.

Figure 8. Comparison of model effects before and after improvement. (a) Before improvement. (b) After improvement.

Figure 9. Performance comparison of YOLO11 and improved YOLO11. (a) Precision. (b) Recall. (c) mAP@50. (d) mAP@50-95.

Figure 10. Effect of mainstream object detection models on detecting apples. (a) Test images. (b) Faster R-CNN. (c) SSD. (d) YOLO11n. (e) ShuffleNetv2. (f) YOLOv7-tiny. (g) YOLOv8s. (h) YOLO11s. (i) YOLO-CSB.

Figure 11. Original image and model effect comparison before and after optimization. (a) Original image. (b) Before improvement. (c) After improvement.

Figure 12. Detection and positioning result graph. (a) 10 apples. (b) 15 apples. (c) 20 apples.

Table 1. Scenario classification statistics of test set.

Scenario Type	Criteria	Image Count	Percentage (%)
Occlusion	Target occluded more than 30%	92	21.5
Low Light	Illuminance below 300 lux	85	19.9
Small Target	Target area less than 5% of image	86	20.1
Normal	Does not meet challenging criteria	164	38.4
Total	-	427	100

Table 2. The specific performance tables of each version of YOLO.

Model	Inference Time (ms)	Parameters (M)	FLOPs (B)
YOLO11n	1.5	4.6	6.5
YOLO11s	2.5	10.4	21.5
YOLO11m	4.7	20.1	68.0
YOLO11L	6.2	25.3	86.9
YOLO11x	11.3	56.9	194.9

Table 3. System component table.

Component	Description
RGB-D Camera	Intel RealSense D435i (Manufacturer: RealSense, Inc., Santa Clara, CA, USA), captures RGB images and depth information, resolution 640 × 480
Robotic Arm	AUBO C5 (6 degrees of freedom, payload 5 kg, repeat positioning accuracy ± 0.1 mm)
End-Effector	Flexible gripper, designed to prevent damage to apples
Computer	Runs ROS and the detection model

Table 4. The specific configuration table of the experiment.

Configure	Parameters
Operating System	Windows 10
GPU	NVIDIA GeForce RTX 4060 Ti (8 GB)
CPU	12th Gen Intel(R) Core(TM) i5-12400F (16 GB)
Deep Learning Framework	PyTorch 2.5.1
Parallel Computing Platform	CUDA 12.1
image size	640 × 640
batch	8
optimizer	‘SGD’ (momentum 0.937, weight decay 0.0005)
workers	2
epochs	300

Table 5. Ablation experiments with different YOLO-CSB modules.

	Improvement Methods
YOLO11s	CSFC	SEAM	BiFPN	mAP (%)	Precision (%)	Recall (%)	FLOPs (G)	Parameters (M)	Latency (ms)
√	×	×	×	90.52 ± 0.28	85.44 ± 0.35	85.31 ± 0.41	21.3	9.41	5.2
√	√	×	×	89.41 ± 0.36	84.12 ± 0.42	84.98 ± 0.47	20.3	9.23	4.9
√	×	√	×	92.82 ± 0.31	86.71 ± 0.39	86.42 ± 0.38	21.2	9.29	5.2
√	×	×	√	91.98 ± 0.29	83.28 ± 0.51	81.15 ± 0.62	23.7	9.50	5.9
√	√	√	×	93.18 ± 0.25	87.31 ± 0.33	86.07 ± 0.40	20.7	9.11	5.4
√	√	×	√	92.97 ± 0.27	87.02 ± 0.36	87.02 ± 0.39	21.9	9.24	5.3
√	×	√	√	93.44 ± 0.32	88.53 ± 0.31	85.89 ± 0.39	21.9	8.82	5.5
√	√	√	√	93.69 ± 0.22	88.61 ± 0.45	87.29 ± 0.51	20.6	9.11	4.8

Table 6. Performance comparison between original and improved models under different scenarios.

Scenario Type	Original Model mAP (%)	Improved Model mAP (%)
Occlusion	80.30 ± 0.37	89.80 ± 0.32
Low Light	78.50 ± 0.39	84.70 ± 0.34
Small Target	82.40 ± 0.25	86.40 ± 0.23
Normal	93.20 ± 0.22	95.10 ± 0.22

Table 7. Performance comparison between mainstream object detection models.

Method	mAP (%)	Precision (%)	Recall (%)	FLOPs (G)	Parameters (M)	Latency (ms)	System Time (ms)
Faster R-CNN	69.45	72.56	60.34	948.10	137.10	20.3	31.4
SSD	75.76	62.50	72.53	86.40	25.13	18.3	19.2
YOLO11n	87.50	82.61	79.23	4.30	4.30	10.3	18.5
ShuffleNetv2	87.58	84.33	83.30	4.30	3.38	4.6	11.5
YOLOv7-tiny	84.42	87.49	82.29	13.20	6.00	6.1	12.8
YOLOv8s	90.21	85.61	84.33	22.20	11.53	5.3	15.4
YOLO11s	90.67	85.61	85.55	21.30	9.41	5.2	14.1
Ours	93.69	88.82	87.58	20.60	9.11	4.8	13.2

Table 8. 3D localization error statistics.

Direction	Mean Error (mm)	Std Dev (mm)	Max Error (mm)
X-axis	4.15	0.72	4.81
Y-axis	3.96	0.68	4.47
Z-axis	4.02	0.81	4.72

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pan, Y.; Chen, Y.; Tong, X.; Liu, M.; Huang, A.; Zhou, M.; Hu, Y. YOLO-CSB: A Model for Real-Time and Accurate Detection and Localization of Occluded Apples in Complex Orchard Environments. Agronomy 2026, 16, 390. https://doi.org/10.3390/agronomy16030390

AMA Style

Pan Y, Chen Y, Tong X, Liu M, Huang A, Zhou M, Hu Y. YOLO-CSB: A Model for Real-Time and Accurate Detection and Localization of Occluded Apples in Complex Orchard Environments. Agronomy. 2026; 16(3):390. https://doi.org/10.3390/agronomy16030390

Chicago/Turabian Style

Pan, Yunxiao, Yiwen Chen, Xing Tong, Mengfei Liu, Anxiang Huang, Meng Zhou, and Yaohua Hu. 2026. "YOLO-CSB: A Model for Real-Time and Accurate Detection and Localization of Occluded Apples in Complex Orchard Environments" Agronomy 16, no. 3: 390. https://doi.org/10.3390/agronomy16030390

APA Style

Pan, Y., Chen, Y., Tong, X., Liu, M., Huang, A., Zhou, M., & Hu, Y. (2026). YOLO-CSB: A Model for Real-Time and Accurate Detection and Localization of Occluded Apples in Complex Orchard Environments. Agronomy, 16(3), 390. https://doi.org/10.3390/agronomy16030390

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-CSB: A Model for Real-Time and Accurate Detection and Localization of Occluded Apples in Complex Orchard Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Construction

2.2. Apple Object Detection Method

2.2.1. YOLO11s Network Architecture

2.2.2. YOLO11s Model Improvements

2.3. Detection and Localization System

2.4. Model Training and Evaluation

2.4.1. Experimental Environment

2.4.2. Evaluation Metrics

3. Results

3.1. Ablation Experiments

3.2. Comparison Before and After YOLO11 Model Improvement

3.3. Comparison with Other Lightweight Networks

3.4. Model Feature Visualization

3.5. Apple Three-Dimensional Localization Error Assessment

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI