1. Introduction
Statistics show that global apple production is over 8.43 million metric tons each year and keeps going up. China, the United States, Poland, and Turkey produce most of the world’s apples [
1]. Right now, people still pick apples mostly by hand. This method takes a lot of time, makes workers very tired, and is not efficient [
2]. The number of farm workers is dropping and many are getting older, so there are more and more labor shortages for apple picking [
3]. Improving picking efficiency and cutting labor costs are very important in today’s agriculture. Fruit-picking robots use cameras to see and pick apples automatically. They can save money and are becoming an important part of smart farming [
4]. How fast and how accurate the detection is determines whether the robot can work well and be trusted. These two things are key to making farming automatic [
5].
In recent times, advancements in deep learning have introduced innovative approaches for fruit detection and positioning [
6]. Deep learning-based object detection algorithms mainly include Faster R-CNN [
7], Mask R-CNN [
8], R-FCN [
9], SSD [
10], and YOLO [
11,
12]. Two-stage detection methods like Faster R-CNN, Mask R-CNN, and R-FCN achieve high accuracy through region proposals and convolutional neural network feature extraction but have high computational complexity and slower inference speeds. In contrast, single-stage algorithms like SSD and the YOLO series directly output target locations and categories through end-to-end neural networks, eliminating the need for region proposals, thus offering faster detection speeds and lower computational demands. These algorithms have been widely applied to various fruit detection tasks, such as kiwifruit [
13], oranges [
14], mangoes [
15], grapes [
16], and winter jujubes [
17]. In the field of apple detection, YOLO series models are particularly favored for their efficiency. For example, Yang et al. proposed the AD-YOLO model combined with MR-SORT for automatic apple detection and counting, improving tracking accuracy in video processing [
18]. Wang et al. achieved enhanced apple detection by improving YOLOv5s, reaching a mAP of 89.7% [
19,
20]. Chen et al. proposed an apple detection method based on the Des-YOLO v4 algorithm, optimized for complex environments [
21]. Yue et al. conducted apple detection in complex environments using an improved YOLOv8n [
22,
23]. With the evolution of the YOLO series, YOLO11, as the latest version, further enhances model accuracy and efficiency. For example, Yang et al. developed the AAB-YOLO framework, an enhanced version of the YOLOv11 network, for detecting apples in natural settings [
24]. Yan et al. conducted apple recognition and yield estimation studies using a fused model of improved YOLOv11 and DeepSORT, achieving an F1 score of 91.7% [
25]. However, existing methods still face challenges in complex orchard environments: occlusions by branches, leaves, and fruits lead to blurred target contours, reducing detection accuracy; high computational complexity and parameter counts limit model deployment on edge devices.
Existing fruit localization techniques primarily encompass laser scanning, stereo vision systems, and methods utilizing RGB-D cameras. Laser scanning employs high-accuracy point cloud data to enable three-dimensional positioning. For instance, Tsoulias et al. applied LiDAR-based scanners to identify and locate apples in orchard settings, attaining a mean detection success rate of 92.5% for trees without foliage, highlighting the capability for remote detection and 3D positioning [
26]. However, laser scanning equipment is costly, and its performance is limited in detecting small or partially occluded fruits, particularly in complex orchard environments where lighting and foliage interference reduce accuracy [
27]. Stereo vision systems utilize binocular cameras to acquire depth information. For instance, Jianjun et al. introduced a binocular vision measurement framework utilizing neural networks for 3D tomato localization, attaining 88.6% reliability with X, Y, and Z axis errors maintained within 5 mm [
28]. While stereo vision offers spatial depth data, it is computationally complex, less real-time, and sensitive to lighting changes, easily leading to matching errors or depth estimation biases [
29].
In contrast, RGB-D cameras, integrating color imagery with depth data, exhibit superior accuracy in detection and positioning under varied lighting and intricate backgrounds, offering notable benefits in cost-efficiency, versatility, and adaptability. When paired with YOLO series algorithms, RGB-D cameras can achieve efficient and accurate fruit localization. For example, Wang et al. used an RGB-D camera combined with the CA-YOLOv5 model to achieve apple detection in natural environments, with a mean Average Precision (mAP) of 91.2% [
20]. RGB-D methods show great potential in fruit localization but require further optimization to address occlusion limitations in complex environments.
This study aims to address the challenge of accurate identification and localization of apples in complex orchard environments for robotic harvesting. By integrating the proposed YOLO-CSB model with RGB-D camera technology, precise apple detection and localization were successfully achieved. The specific research conclusions are as follows:
- 1.
A novel apple detection model, YOLO-CSB, was proposed, incorporating Partial Convolution to enhance the C3k2 module, resulting in a new CSFC Block architecture that reduces computational complexity and enhances model lightweightness.
- 2.
The SEAM and BiFPN modules were integrated into the YOLO11s framework. The SEAM module improves feature extraction for occluded regions, while the BiFPN module optimizes multi-scale detection performance through weighted bidirectional feature fusion.
- 3.
A 3D localization method combining YOLO-CSB with RGB-D cameras was developed, with experiments designed to evaluate localization errors. This method enhances the accuracy of apple detection and localization in complex orchard environments, meeting the requirements of robotic harvesting systems.
4. Discussion
We assessed the performance of the YOLO-CSB framework against recent studies on apple detection to determine its efficacy in intricate orchard settings. In recent times, research on apple detection has predominantly concentrated on single-stage object detection algorithms, seeking to improve detection precision while minimizing model complexity and size. For instance, Liu et al. [
1] proposed a YOLOv5s-BC-based apple detection method, achieving a mAP of 92.01%, precision of 88.71%, and recall of 83.80%. Wang et al. [
19] developed an improved YOLOv5s model with a mAP of 89.70%, precision of 87.50%, and recall of 86.20%. In contrast, our proposed YOLO-CSB model achieved a mAP of 93.69%, precision of 88.82%, and recall of 87.58%, demonstrating superior detection performance. Additionally, Yang et al. [
24] introduced an AAB-YOLO model based on an improved YOLOv11, with a mAP of 91.50% and a parameter count of 10.2 M, whereas YOLO-CSB, with only 9.11 M parameters, reduced the parameter count by 10.69% while maintaining high accuracy. Chen et al. [
19] proposed a Des-YOLO v4-based apple detection model with a parameter count of 12.5 M, while YOLO-CSB’s parameter count is only 73% of that, further validating its lightweight design advantages. Overall, YOLO-CSB outperforms other similar apple detection methods in both detection accuracy and model lightweightness.
Robotic positioning serves as a key step for accurately finding and locating small-sized fruits. In this work, an RGB-D camera is combined with the YOLO-CSB network to obtain the 3D coordinates of apple centers, which are then treated as test points for the positioning trials. The network detects fruits in the RGB view, giving 2D box coordinates, and depth images are used to compute the third dimension. The manipulator is driven so its end-effector reaches each selected point. A precise distance sensor records the real offset between the end-effector and the target, repeating every measurement five times to cut down noise. The mean positioning errors for apples are 4.15 mm, 3.96 mm, and 4.02 mm along X, Y, and Z, respectively, meeting the accuracy demand for automated picking.
Previous work on fruit positioning reported a LiDAR-plus-YOLOv5 system that yielded a mean apple error of 21.1 mm, replacing LiDAR with an RGB-D sensor cuts cost and pushes error below 5 mm [
33]. In bell pepper studies, Guo et al. coupled an RGB-D device with a refined YOLOv4 and reached 89.55% accuracy [
34]. Zhou et al. fused an upgraded YOLOX with depth data to keep apple offsets under 7 mm in every axis [
35]. Apples, being smaller and more occluded, are harder than peppers, yet the YOLO-CSB-plus-RGB-D combination still drives the average error under 5 mm, outperforming earlier RGB-D attempts on minor fruit.
Despite these significant achievements, YOLO-CSB has certain limitations. As shown in
Figure 8, under extreme conditions such as low-light environments, the model may experience missed detections. These issues are primarily attributed to reduced image quality due to insufficient lighting, which diminishes the prominence of apple texture and color features. To address this, future work will focus on incorporating more apple images captured under low-light conditions to further enhance the training dataset.
Our dataset comes from only one orchard and one variety, and the collection time is also concentrated. This may cause the model to perform worse in other orchards, varieties, or seasons. We only added a small number of images from public datasets to the validation set for preliminary cross-validation. Future work will collect multi-source data and conduct comprehensive testing on more public fruit datasets.
All experiments in this study used a fixed input resolution of 640 × 640. This was because the dataset was uniformly set to this size during preprocessing and export; changing the resolution would involve re-labeling and retraining, and due to limited resources, no comparisons were made. Future work will compare different resolutions (e.g., 512, 640, 800) and analyze their impact on mAP, latency, and computational cost.
Our YOLO-CSB model is currently only used for apple detection, but it can be easily extended to other fruits, such as oranges, bananas, or pears, which also have occlusion and lighting issues. We simply need to change the dataset, label those fruits, and then retrain the model. The parameters and modules remain the same, and it can be used in these scenarios. Furthermore, for multi-class scenarios, such as simultaneously detecting apples at different ripeness levels (green and red apples), or multiple fruit types, we can modify the detection head to add more class labels. Experiments show that the model has good support for multi-class targets because BiFPN handles multi-scale features. We will try these extensions in the future to see how they perform.
Although this study validated the effectiveness of the hand–eye calibration system under laboratory conditions, the localization performance in actual orchard environments remains to be further verified. Limited by current experimental constraints and resources, large-scale field testing will be an important direction for our future work. In the future, we plan to conduct long-term localization accuracy assessments in real orchard scenarios, comprehensively considering the effects of illumination variations, tree occlusions, and terrain undulations on system performance, thereby further enhancing the practicality and robustness of the system.
In conclusion, the YOLO-CSB model and its three-dimensional localization system provide an efficient and precise solution for apple mechanized harvesting in complex orchard environments. Its outstanding performance in detection accuracy, model lightweightness, and localization error reduction offers a scalable reference for the detection and localization of other target fruits. This study evaluated the performance of YOLO-CSB in complex orchard environments and compared it with recent apple detection research, demonstrating that our model achieves advantages in both detection accuracy and lightweight design. However, this study has certain limitations: the dataset was collected from a single orchard with a single apple variety, concentrated in autumn with relatively uniform lighting conditions, which may limit the model’s generalization capability to other orchards, different varieties, other seasons, or extreme lighting conditions. Furthermore, this study only added a small number of public dataset images to the validation set for preliminary cross-dataset validation, without conducting systematic cross-dataset testing, which is also a shortcoming of this study. In addition, all experiments in this study used a fixed input resolution of 640 × 640 due to preprocessing constraints and limited resources, without exploring the impact of different resolutions on model performance. Moreover, the YOLO-CSB model is currently only validated for apple detection, though its architecture is readily extensible to other fruits or multi-class scenarios. Most importantly, although the hand–eye calibration system was validated under laboratory conditions, its localization performance in actual orchard environments remains to be further verified. Future research can further validate its performance in diverse orchard settings, collect multi-source data, conduct systematic cross-dataset validation on public datasets, compare different input resolutions to optimize the balance between accuracy and efficiency, extend the model to other fruit types or multi-class detection scenarios, and perform large-scale field testing to comprehensively assess localization accuracy under varying illumination, occlusions, and terrain conditions. These efforts will promote the widespread adoption of intelligent agricultural technologies.