1. Introduction
Apples are one of the most important temperate fruits globally. As the largest producer and consumer, China’s industry scale and production technology level exert a decisive influence on the global market [
1]. Statistics indicate that China’s apple cultivation area and yield consistently rank first worldwide. In 2022, the cultivation area reached approximately 30.2 thousand hectares with a production of 41.38 million tons, accounting for about 50% of the global total [
2,
3]. Furthermore, statistics show that from 1998 to 2017, the proportion of hired labor in apple production rose sharply from 2.87% to 28.24% [
4]. The sustainable development of this massive industry faces significant challenges from rising labor costs and increasing demand for precise harvesting. Consequently, promoting orchard production automation centered on intelligent harvesting robots has become an inevitable trend for industrial upgrading [
5,
6,
7,
8].
The core of an intelligent harvesting system lies in its “visual perception” capability, i.e., accurate, robust, and real-time fruit detection. However, the complex orchard environment, characterized by dense occlusion between branches/leaves and fruits, variable lighting conditions, and diversity in target scales, severely constrains the performance of existing visual systems. Many detection models that excel in controlled laboratory settings often suffer from sharp accuracy drops and poor stability in real orchards, failing to meet the high-reliability requirements of automated harvesting [
9,
10,
11,
12]. Therefore, developing a lightweight detection model that maintains high accuracy, speed, and strong robustness in complex environments is a critical prerequisite for achieving automated harvesting.
Early fruit recognition primarily relied on traditional digital image processing techniques, such as color space conversion (e.g., RGB to HSV), threshold segmentation, and handcrafted feature extraction (color, texture, shape) [
13]. These methods showed some effectiveness in scenes with simple backgrounds but were limited in feature representation capability. They were heavily dependent on specific lighting and background conditions, exhibiting insufficient generalization to occlusion, overlap, and complex backgrounds, thus failing to meet practical application needs.
The breakthrough in deep learning technology has revolutionized the field of fruit recognition through Convolutional Neural Network (CNN)-based object detection methods, such as the R-CNN series [
14,
15], SSD [
16], and the YOLO series [
17,
18,
19]. These methods can automatically learn deep semantic features from images, significantly improving adaptability to complex environments and detection accuracy. In recent years, researchers have made numerous improvements tailored to agricultural scenarios. For instance, Wu et al. [
20] enhanced robustness to variable lighting by improving YOLOv8; Liu Shuai et al. [
21] improved the robustness of the detection model in occluded orchard environments by employing an adversarial network to repair occluded images based on the Fast R-CNN detection model; Zhou Hui et al. [
22] achieved lightweight design for an orchard detection model by improving YOLOv8. Liu et al. [
23] focused on designing lightweight architectures to facilitate model deployment on edge devices; Shi et al. [
24] improved detection accuracy for apple fruits by integrating various attention mechanisms. However, existing studies still require enhancement in accuracy and robustness when dealing with extremely dense occlusion and highly overlapping targets in complex orchard scenes. More importantly, most research stops at the recognition stage, failing to form an effective closed loop with subsequent yield estimation tasks, thereby limiting their overall application value.
Regarding yield estimation, traditional methods mainly rely on statistical sampling [
25] or vegetation index models based on remote sensing data [
26]. The former suffers from sample representativeness issues, leading to large estimation errors; the latter, while suitable for large-scale monitoring, involves high equipment and technology costs and is ineffective against target information loss caused by occlusion within orchards. Methods combining deep learning-based object counting with average weight estimation [
27] offer a new approach. However, relying solely on 2D image counting ignores variations in fruit volume and is susceptible to perspective and occlusion effects, resulting in significant estimation errors. Recently, methods incorporating 3D visual information for volume estimation have emerged. Yet, they often depend on high-precision depth sensors and complex reconstruction algorithms, or face issues like large manual measurement errors and computational complexity, making it difficult to achieve accurate, practical yield estimation in lightweight, low-cost systems.
In summary, current research faces challenges in two key areas for complex orchard environments: fruit detection and precise yield estimation. On one hand, there is a need for a lightweight detection model that maintains high accuracy and strong robustness under occlusion and variable lighting while meeting real-time requirements on edge devices. On the other hand, there is an urgent need to develop a low-cost, high-precision yield estimation algorithm based on visual perception that can overcome occlusion effects.
To systematically address these issues, this paper proposes an integrated solution, with its core contribution being the simultaneous proposal of a lightweight high-accuracy detection model and a 3D precise yield estimation algorithm. (1) Proposal of an improved lightweight detection model named YOLO-WBL. Based on YOLOv11n [
28], this model innovatively introduces Wavelet Transform Convolution (WTConv) [
29] and the C3K2 module into the backbone network to efficiently extract multi-scale frequency-domain features at low computational cost. It employs a Weighted Bidirectional Feature Pyramid Network (BiFPN) [
30] in the neck network to enhance multi-scale feature fusion capability. Furthermore, a lightweight detection head based on Shared Convolution and separated Group Normalization (Detect-SCGN) [
31] is designed, significantly reducing the number of parameters while ensuring and even improving detection accuracy. (2) Proposal of a precise yield estimation algorithm based on Collaborative Local Volume (CLV). This algorithm deeply integrates the aforementioned detection results with point cloud data acquired by an RGB-D camera. Through camera parameter calibration, pixel-to-3D coordinate transformation, point cloud reconstruction, and convex hull volume calculation, it achieves precise estimation of the volume of individual occluded fruits, subsequently aggregating these for total yield estimation. This effectively addresses the information loss problem under occlusion in traditional methods. (3) Integration of the YOLO-WBL model and CLV algorithm to construct and deploy a complete edge computing system. A visual interactive software based on PySide6 was developed on the NVIDIA Jetson Xavier NX platform, realizing full-process functionality from real-time detection to yield estimation. Extensive field experiments validate the system’s high accuracy, strong robustness, and real-time performance in complex orchard environments.
2. Materials and Methods
2.1. Construction of a Complex Orchard Dataset
To accurately simulate the multifaceted challenges encountered in actual orchard operations and effectively evaluate the robustness of the detection and yield estimation models, systematic data collection was conducted in a typical dwarf, densely planted apple orchard in the Aksu region of Xinjiang. The primary cultivar is Fuji apple, including varieties such as Changfu 2, Qiufu 1, and Yanfu 2. To facilitate data acquisition, the tree height was maintained below 3.5 m. This planting pattern features low and structurally uniform canopies, which not only facilitates robotic arm picking operations but also provides favorable conditions for depth sensors to acquire comprehensive fruit information from multiple angles.
2.1.1. Image Data Collection
The data collection comprehensively considered key variables in the actual orchard environment that affect visual system performance, mainly including: (1) Lighting conditions: Covering different natural lighting scenarios such as front lighting, backlighting, side lighting, and canopy shadows. (2) Shooting perspectives: Including multiple angles such as top-down, horizontal, and bottom-up to simulate the observation perspectives of robots under different working postures. (3) Shooting distance: Varied within the range of 0.1 m to 1.0 m to reflect the typical working distance between the camera and the fruit during actual harvesting. The collection equipment utilized a Huawei Nova7 smartphone (Huawei Technologies Co., Ltd., Shenzhen, China) and an Intel RealSense D455 depth camera (Intel Corporation, Santa Clara, CA, USA) working in coordination to synchronously acquire high-resolution RGB images and corresponding depth information.
For the same batch of apple samples, RGB-D images were captured under two modes to obtain different levels of occlusion and viewing angles: a close-range vertical viewpoint and an inclined multi-viewpoint simulating robotic operation. Immediately after image acquisition, all photographed apples were manually harvested. Their true mass was then measured one by one using a high-precision electronic balance (LiChen FA1204, accuracy 0.01 g) and recorded as the ground-truth benchmark for algorithm evaluation. Consequently, a dataset containing multi-view images and corresponding precise mass data was constructed for subsequent performance comparison and analysis of different yield estimation algorithms. All data were collected intensively during the fruit ripening period on September 20, 2025. The final constructed dataset contains 1634 apple image samples (JPEG format), with partial examples shown in
Figure 1. This dataset fully reflects the diversity of the complex orchard environment, providing a realistic data foundation for subsequent model training and validation.
2.1.2. Data Augmentation and Preprocessing
To enhance the robustness and generalization capability of the model in the variable environment of real orchards, systematic data augmentation was applied to the original collected dataset. Using a Python script, images were sequentially subjected to random rotation (±30°), brightness adjustment (scale factor 0.8–1.2), saturation variation (scale factor 0.7–1.3), and the addition of Gaussian noise (σ = 0.01). This process simulated complex factors commonly encountered during the actual operation of harvesting robots, such as lighting fluctuations, viewpoint changes, dust interference, and sensor noise. Through this augmentation pipeline, a final dataset containing 2014 images was constructed, comprising 400 images with varying lighting conditions, 680 images from different viewpoints, 532 images at varying distances, and 402 images featuring occlusion. Examples of the augmented images are shown in
Figure 2.
To ensure the reliability of model training and evaluation, the dataset was divided into a training set (1410 images), a validation set (403 images), and a test set (201 images) in a ratio of 7:2:1. This division ensured complete independence and no overlap of images among the subsets. Data annotation followed the PASCAL VOC format. The LabelImg tool was used for manual annotation, with the label category uniformly defined as “apple,” and corresponding XML files were generated to provide high-quality ground-truth labels for subsequent supervised learning.
2.2. Lightweight Apple Detection Model Based on YOLO-WBL
To achieve high-accuracy real-time apple detection in complex orchard environments and provide reliable target location information for subsequent yield estimation, this study aims to balance lightweight design with high precision. While NanoDet prioritizes extreme lightweighting through the use of Ghost modules and depthwise separable convolutions, it suffers from poor robustness when confronted with complex environments. EfficientDet, on the other hand, relies on compound scaling and an EfficientNet backbone, offering high computational efficiency but at the cost of structural complexity. Therefore, to achieve high-accuracy real-time apple detection in complex orchard environments, this study selects YOLOv11n as the baseline model. This model consists of a backbone network, a neck network, and a detection head, offering significant advantages in computational efficiency and parameter count. However, it still suffers from issues such as a high rate of missed detections and insufficient robustness in complex scenarios involving branch and leaf occlusion and uneven lighting.
To address these limitations, this paper proposes a lightweight object detection model named YOLO-WBL. Its overall architecture is shown in
Figure 3, with optimizations primarily focused on the following three aspects: (1) To enhance feature representation capability while maintaining a lightweight profile, the original C3K2 module in the backbone network is replaced with a hybrid module combining Wavelet Transform Convolution (WTConv) and C3K2. WTConv separates the low-frequency and high-frequency components of an image through frequency-domain transformation, enabling the network to more effectively capture texture and contour features at different scales. This enhances the model’s robustness to scale variations and local details without significantly increasing the computational burden. (2) To further strengthen multi-scale feature fusion capability, a Weighted Bidirectional Feature Pyramid Network (BiFPN) is adopted to replace the C3K2 module in the original neck structure. BiFPN performs adaptive fusion of features from different scales using learnable weights and facilitates thorough interaction between high-level semantic information and low-level detailed information through its bidirectional pathways, thereby enhancing the model’s representational capacity for apples at varying distances and sizes. (3) To improve detection accuracy and further reduce the parameter count, a lightweight detection head based on Shared Convolution and separated Group Normalization (Detect-SCGN) is designed. This structure reduces redundant computation by sharing shallow convolutional layer weights and independently optimizes the group normalization layers for different output branches. Consequently, it significantly decreases memory usage and inference latency while improving classification and localization accuracy.
2.2.1. C3K2_WT Module Incorporating Wavelet Transform Convolution
The C3K2 module in YOLOv11 is an efficient feature extraction component. Its design core lies in splitting the input feature map into two parts: one part is directly passed through a shortcut connection to preserve the original information, while the other part undergoes deep feature extraction via a C3K bottleneck structure with configurable convolutional kernels (e.g., 3 × 3 or 5 × 5). The features from both paths are finally concatenated. This structure broadens the model’s receptive field by employing variable-sized convolutional kernels, enabling it to capture broader contextual information and thereby enhancing feature representation capability in complex scenes.
However, the introduction of large, variable-sized kernels, while enhancing the receptive field, also significantly increases the model’s parameter count and computational complexity, posing a challenge for deployment on resource-constrained edge devices. To address this issue, this study proposes integrating Wavelet Transform Convolution (WTConv) with the C3K2 module to construct the C3K2_WT module. Although deep CNNs can progressively expand the receptive field through stacked convolutional layers to model global context, relying solely on the layer-by-layer propagation of local convolutional features to capture global information is often inefficient. This is particularly true for orchard scenes that require simultaneously processing multi-scale targets (e.g., apples at varying distances) and complex backgrounds, potentially leading to insufficient understanding of the global scene and consequently affecting detection accuracy for occluded targets [
32]. In contrast, WTConv utilizes the wavelet transform to decompose the feature map into different frequency sub-bands, achieving multi-scale frequency-domain feature extraction. This method can effectively enlarge the equivalent receptive field and enhance the model’s representation capability for texture and shape structures without significantly increasing the parameter count.
Specifically, the wavelet convolution layer (WTConv module) employs the Wavelet Transform (WT) to downsample the input tensor, yielding the low-frequency component
and the high-frequency components
,
, and
. Subsequently, depthwise convolution is performed on these different frequency maps using small convolutional kernels (e.g., 3 × 3, 5 × 5). Finally, the processed results from the different frequency maps are recombined using the Inverse Wavelet Transform (IWT) to construct the output
Y. This process can be expressed as:
where
X is the input tensor, and
w is the weight tensor of the
k ×
k depthwise kernels, which has four times the number of input channels as
X. This method achieves frequency-domain feature separation and parallel extraction, allowing small-sized convolutional kernels to operate within a larger effective receptive field while maintaining the module’s lightweight characteristics. The structure of the C3K2_WT module and the WTConv processing flow are illustrated in
Figure 4.
By employing this one-level composition operation and further extending it using the same cascading principle in the formula domain, the calculation formulas are given in (2)–(3):
where
represents the input to this layer, and
denotes the
-th level high-frequency maps, namely
,
and
.
To combine the outputs from different frequencies, we leverage the linear property of WT and its inverse operation, i.e.,
. Therefore, the final executed result can be expressed as:
Addressing the contradiction between receptive field expansion and the surge in parameter count within the C3K2 feature extraction module, this study proposes a lightweight improvement scheme. By introducing wavelet transform, this scheme performs multi-scale decomposition and extraction of features in the frequency domain, effectively mitigating the over-parameterization issue caused by traditional convolutions attempting to enlarge the receptive field. Specifically, by replacing the original convolutional layers with WTConv2d layers, the improvement seamlessly integrates the advantages of wavelet transform in image frequency separation and global information capture while preserving the fundamental architecture and functionality of the bottleneck structure. This enhancement significantly boosts the model’s capability to analyze and fuse multi-frequency components in images (such as fruit contours, textural details, and environmental backgrounds), thereby enriching feature representation and robustness while controlling parameter volume.
2.2.2. Weighted Bidirectional Feature Pyramid Network
To fully leverage multi-scale features and enhance the model’s detection capability for apples of varying sizes, YOLOv11n employs a bidirectional feature pyramid structure based on PANet [
33] for feature fusion. This architecture aggregates deep semantic information with shallow detail features through bottom-up and top-down pathways, thereby strengthening multi-scale representation. However, its fixed bidirectional fusion mechanism introduces additional computational overhead and may also generate redundancy or loss during information propagation, which to some extent affects the model’s inference efficiency and accuracy stability in resource-constrained environments.
To address this, this study optimizes the original neck module by adopting a Weighted Bidirectional Feature Pyramid Network (BiFPN). Building upon PANet, BiFPN incorporates two key improvements: First, it simplifies the network topology by removing redundant nodes that have only a single input and do not participate in feature fusion. Second, it adds skip connections between features at the same level and introduces learnable adaptive weights for cross-scale fusion, thereby achieving more efficient multi-scale feature interaction while reducing information loss. A structural comparison between PANet and BiFPN is shown in
Figure 5.
During the feature fusion process, BiFPN assigns learnable weights to input features of different resolutions, enabling the network to dynamically adjust the contribution of each scale’s features to the fused result, thereby achieving more refined feature selection and integration. This mechanism helps enhance the model’s ability to distinguish between targets with scale variations and similar appearances. Taking the 6th-level feature fusion illustrated in
Figure 5b as an example, its computational process can be expressed by the following formula:
Here, represents the intermediate feature at level 6 from the top-down path, while denotes the output feature at level 6 from the bottom-up path; and are the learnable weights for the level 6 feature fusion, indicating the importance of the level 6 input feature and the adjusted level 7 input feature, respectively; are the learnable weights for the level 6 output feature fusion, representing the importance of the level 6 input feature, the level 6 intermediate feature, and the adjusted level 5 output feature, respectively; and refer to the level 6 and level 7 input features, respectively; is the level 5 input feature; Resize denotes upsampling or downsampling operations, is the minimum weight adjustment coefficient, set to 0.0001. All features are constructed in a similar manner, and BiFPN also employs depthwise separable convolution for feature fusion, followed by batch normalization and activation after each convolution.
Compared to the PANet architecture, BiFPN streamlines the network topology by removing single-contribution nodes and incorporates skip connections between input and output nodes at the same level. This design enables richer cross-level feature reuse and fusion without significantly increasing computational complexity. By stacking this bidirectional path module multiple times, the network facilitates deeper multi-scale feature interactions, thereby learning more discriminative feature representations. The introduction of the BiFPN structure effectively enhances the feature fusion efficiency of the model under limited computational resources. Its core advantages lie in: dynamically adjusting the contribution of feature maps at different resolutions through learnable weights, and preserving more original feature information via skip connections. This allows the network to more comprehensively capture the multi-scale contextual information of apple targets, particularly in handling complex scenarios such as branch/leaf occlusion, fruit overlap, and scale variations due to distance. This not only strengthens the semantic representation of apples by deep features but also significantly improves the model’s detection robustness and accuracy across varying environments, thereby optimizing overall performance while ensuring model efficiency.
2.2.3. Detect-SCGN Detection Head
To enhance multi-scale detection accuracy while strictly controlling model complexity, this study refactors the original decoupled detection head of YOLOv11 into a lightweight design. Although the original decoupled head improves detection performance by separating classification and regression tasks, its independent branch structure significantly increases parameter count and computational cost. Furthermore, its single-scale prediction mode fails to fully utilize the rich multi-scale feature information aggregated by the neck network, limiting the model’s adaptability to apples of different sizes.
Addressing these issues, this paper proposes a lightweight detection head based on Shared Convolution and separated Group Normalization (Detect-SCGN), whose structure is shown in
Figure 6. The core of this design lies in sharing shallow convolutional weights across multiple prediction branches to drastically reduce parameters, while independently processing the Batch Normalization (BN) layer for each branch to accommodate distribution differences among features at different levels. Specifically, the P3, P4, and P5 feature layers output from BiFPN (responsible for small, medium, and large targets, respectively) first undergo independent 1 × 1 convolutions for channel adjustment and preliminary feature extraction. Considering the inherent differences in the numerical distributions of feature maps at different scales, a Layer Normalization operation is introduced in each branch to stabilize the training process and accelerate convergence.
Subsequently, the parameters from the 1 × 1 convolutions are shared with the following 3 × 3 convolutional layers to form a shared parameter module. This module consists of 3 × 3 convolutional layers and Batch Normalization (BN) layers, which are used to effectively extract discriminative features. The core feature extraction parameters are shared with the Convolutional Regularization layer (Conv-Reg) and the loss function module, as shown in Equation (7). To address scale variations of detection targets, a scaling layer is utilized to resize the feature maps transformed by Conv-Reg, further enhancing the adaptability and prediction accuracy for multi-scale object detection. The scaling layer dynamically adjusts the feature sizes, as shown in Equations (8) and (9), while reducing the model’s parameter count and computational overhead, effectively mitigating detection accuracy loss. This feature transformation process can be formally represented as:
where
denotes the output feature,
represents the input feature,
signifies the extracted feature,
is the scaling tensor,
s is the scaling factor,
is the function for adjusting the tensor size,
indicates the initial input tensor, and
denotes the result of scaling and concatenating multiple tensors.
2.3. Accurate Yield Estimation Method Based on the Collaborative Local Volumetric (CLV) Approach
Accurate yield estimation is crucial for enabling refined orchard management and optimizing harvest planning. To overcome the significant errors associated with traditional 2D image-based statistical methods when dealing with occlusion and overlap, this study proposes a Collaborative Local Volumetric (CLV) yield estimation algorithm based on 3D point cloud reconstruction. This method deeply integrates the detection results from the aforementioned YOLO-WBL model with the real 3D information captured by a depth camera. It achieves precise yield estimation in occluded scenes by inferring the complete fruit volume from partially visible surfaces. The overall workflow is illustrated in
Figure 7.
The method first employs the YOLO-WBL model to detect apple targets in the RGB image, obtaining their 2D bounding boxes and pixel-level center coordinates. Subsequently, the core algorithm proceeds in two stages: (1) Depth Optimization and Coordinate Mapping: The 2D detection results are aligned with the corresponding depth map using the depth camera’s intrinsic parameters, precisely mapping pixel coordinates to 3D space to acquire the real 3D point cloud of the visible portion of each apple. (2) Local Volume Reconstruction and Aggregation: Based on the irregular real-surface point cloud, the volume of individual fruits is estimated using convex hull reconstruction and sphericity correction algorithms, which are then aggregated to obtain the total yield for the area. This approach discards the assumptions of simplifying apples into regular shapes or relying on average weight, directly performing calculations based on 3D geometric information, thereby significantly enhancing estimation robustness and accuracy in complex scenes.
2.3.1. Depth Optimization and 3D Coordinate Mapping
Existing yield estimation methods based on 2D detection often assume apples are regular spheres lying on the same plane and convert pixel sizes to physical dimensions using empirical scaling factors. The core flaw of such methods lies in the lack of support from real depth information, leading to distorted 3D spatial distribution and overly idealized volume estimation models. This results in significant errors in real orchards due to occlusion, perspective distortion, and the morphological diversity of fruits.
To address these issues, the conversion factors and position coordinates are corrected by invoking the depth camera’s parameters to ensure the accuracy of model reconstruction. Coordinate optimization is divided into three steps: depth value unit conversion, image coordinate normalization, and coordinate system calculation. In the depth value conversion stage, the sensor’s electrical signals are converted into real physical distances through unit system transformation (e.g., millimeter-to-meter, meter-to-centimeter), establishing the absolute distance relationship from each pixel in the image to the camera. The specific calculation formulas are shown in Equations (10) and (11).
where
depth represents the depth map condensation factor,
Z_cm denotes the depth distance of the target in the camera coordinate system (cm), and
Z_m signifies the depth distance of the target in the camera coordinate system (m).
During the image coordinate normalization process, pixel positions are converted into spatial direction vectors originating from the camera’s optical center and pointing towards object points in 3D space. This is based on the principle of back-projection using the pinhole camera model. To address the misalignment between the pixel coordinate origin and the optical center origin, the coordinate origin is first shifted from the image corner to the optical center. The specific calculation formula is as follows:
where
cx,
cy represent the principal point coordinates of the image,
u0 denotes the pixel horizontal coordinate,
v0 denotes the pixel vertical coordinate,
dx represents the offset in the horizontal direction, and
dy represents the offset in the vertical direction.
After shifting the center point to the optical center, focal length normalization is performed to remove the influence of the camera’s internal geometric parameters and convert pixel coordinates into dimensionless normalized coordinates.
fx represents the focal length in the horizontal direction, fy represents the focal length in the vertical direction, x_nor denotes the horizontal normalized coordinate, and y_nor denotes the vertical normalized coordinate.
Finally, the coordinate system is calculated by integrating depth information, scaling the actual physical coordinates according to the true depth to enhance realism.
where
X_cm represents the horizontal position in the camera coordinate system, and
Y_cm represents the vertical position in the camera coordinate system.
2.3.2. Point Cloud Model Reconstruction Based on Deformed Spheres
Traditional methods for 3D modeling of apples often simplify them into regular spheres or ellipsoids. While this greatly simplifies calculation, it fails to represent the diverse morphological characteristics of real fruits (such as local concavities, asymmetrical growth, etc.), leading to inherent biases in volume estimation. To overcome this limitation, this paper proposes an irregular point cloud generation method based on deformed spheres. This method, grounded in spherical coordinates, finely simulates the irregular geometric features of real apple surfaces by introducing multi-frequency deformation functions, random radius perturbations, and apple-specific morphological enhancements.
To prevent excessive clustering of point clouds at the poles when sampling uniformly on a sphere, this study employs an area-element-based uniform sampling strategy. This ensures that the distribution density of sampling points on the spherical surface is proportional to the area, thereby obtaining a more uniform initial distribution of the 3D point cloud. On this basis, a multi-frequency composite deformation function is applied to modulate the radial distance of the sampling points, simulating natural undulations and irregularities on the apple surface. The generation of spherical coordinates for sampling points, the probability density function, and the multi-frequency deformation process can be formally expressed as follows:
where
represents a random number uniformly distributed in the interval [0, 1],
denotes the azimuth angle,
denotes the zenith angle,
u is an intermediate variable used to generate a uniform distribution for
,
di represents the
i-th deformation direction vector,
denotes the zenith angle for the
i-th deformation direction, and
denotes the azimuth angle for the
i-th deformation direction.
The final radius of the apple model is determined based on the calculation of the deformation factors. Furthermore, to account for the morphological characteristics of apples, an apple-specific deformation function is introduced to simulate their typical morphological features. The specific implementation method is shown in Equations (23)–(28).
where
v represents the position vector,
denotes the angle between
v and
di,
represents the deformation factor,
denotes the strength of the
i-th deformation function,
denotes the frequency of the
i-th deformation function,
denotes the final radius,
R represents the apple’s base radius,
represents the deformation factor for the depression at the apple’s top,
represents the deformation factor for the depression at the apple’s bottom, and
denotes the apple’s final radius.
2.3.3. Volume Estimation Based on Convex Hull Algorithm and PCA Sphericity Correction
Traditional apple volume estimation methods (e.g., those based on minimum bounding boxes or standard ellipsoid models) often introduce significant errors due to excessive simplification of the fruit’s irregular geometry. To enhance the accuracy of volume calculation, this study proposes an optimized method combining the convex hull algorithm with Principal Component Analysis (PCA)-based sphericity correction.
(1) Initial Volume Estimation Based on the Convex Hull Algorithm
The core of this method is to approximate the fruit volume by calculating the smallest convex polyhedron (convex hull) enclosing the target point cloud. The algorithm first identifies the peripheral vertices of the point cloud, performs triangulation to form a closed convex polyhedron, and obtains the total volume by summing the volumes of all tetrahedra. This computational process can be expressed as:
where
represents the number of triangular facets,
,
,
denote the coordinates of the three vertices of the
i-th triangle,
represents the vector from the origin to any vertex of the triangle, and
denotes the convex hull volume.
This method offers three advantages: (1) High Fitting Accuracy: The convex hull closely conforms to the irregular shape, reducing the average volume estimation error by over 60% compared to the bounding box method. (2) Shape Adaptability: It requires no pre-defined geometric model and automatically adapts to different shapes. (3) Strong Robustness: It is insensitive to variations in point cloud density and noise, exhibits rotational invariance, and provides a stable geometric foundation for subsequent analysis.
(2) Sphericity Analysis and Volume Correction Based on PCA
Since convex hull volume systematically overestimates the true volume of non-convex objects, the error is particularly significant when fruit surfaces have concavities. To further improve accuracy, this paper introduces PCA to quantitatively analyze the shape characteristics of the point cloud. PCA projects the point cloud onto three principal component axes through orthogonal transformation. The corresponding eigenvalues reflect the dispersion of the point cloud along each principal axis.
In the above equations, C represents the eigenvalues of the covariance matrix, denotes the coordinates of the i-th point, is the centroid of the point cloud, n indicates the number of points, is the sphericity, and are the maximum and minimum eigenvalues, respectively, is the shape correction factor, and is the corrected volume.
2.4. Model Evaluation Metrics
To comprehensively and quantitatively evaluate the overall performance of the proposed YOLO-WBL detection model in complex orchard environments, the following evaluation metrics were selected from three dimensions: detection accuracy, model efficiency, and deployment lightweightness. These include Precision (P), Recall (R), mean Average Precision (mAP@0.5), Floating Point Operations (FLOPs), average inference speed (Frames Per Second, FPS), the number of model parameters (Parameters), and the model’s disk storage size (MB). Together, these metrics constitute a multi-dimensional evaluation framework, ensuring that high detection accuracy is pursued while strictly constraining model complexity to meet the deployment requirements on embedded devices in practical orchard environments.
2.5. Experimental Environment and Training Strategy
2.5.1. Experimental Environment
All experiments were conducted on a workstation equipped with an AMD Ryzen Threadripper PRO 3975WX 32-core processor, 384 GB of RAM, and an NVIDIA RTX A5000 GPU (24 GB VRAM). The operating system was Windows 11. The deep learning framework used was PyTorch 1.13, the programming language was Python 3.8, and parallel computing relied on CUDA 11.7 and the corresponding version of the cuDNN acceleration library.
2.5.2. Training Strategy
Model training employed the Stochastic Gradient Descent (SGD) optimizer. Key hyperparameters were set as follows: initial learning rate of 0.01, momentum of 0.937, and a weight decay coefficient of 0.0005. The training batch size was set to 16, and input images were uniformly resized to 640 × 640 pixels. The entire training process spanned 300 epochs. To enhance the model’s generalization capability to scale variations, occlusion, and background changes, the Mosaic data augmentation technique was introduced during training. This technique, by randomly stitching four training images, effectively simulates target aggregation and occlusion in complex scenes, strengthening the model’s ability to learn spatial context.
2.6. Performance Comparison of Different Feature Fusion Networks
To validate the effectiveness of the adopted Weighted Bidirectional Feature Pyramid Network (BiFPN) for the apple detection task in complex orchard environments, this study conducted comparative experiments with several current mainstream feature fusion architectures, including models such as Hyper-MFM (Hypergraph-based Multi-scale Feature Fusion Module), PST (Pyramid Sparse Transformer), and PSConv (Pinwheel-shaped Convolutional Module). By comparing different feature fusion structures, the effectiveness of BiFPN in complex orchard environments was verified. All comparative experiments were conducted based on the same dataset and training strategy to ensure fairness. The performance metrics of each model are detailed in
Table 1.
The experimental results indicate that BiFPN significantly enhances the model’s lightweight characteristics while maintaining high detection accuracy. Specifically, on the apple test set, BiFPN achieved an mAP@0.5 of 86.3%, while the model size was reduced by approximately 24% compared to the baseline YOLOv11n. Compared to the RepHMS structure, although BiFPN’s mAP@0.5 was slightly lower by 11.3%, its parameter count and computational load were substantially reduced, making it more suitable for edge devices with limited computing power. Compared to methods like PST and PSConv, BiFPN significantly reduced model parameters (by 35% and 21.4%, respectively) and storage footprint (by 23% and 21%, respectively) with a slight increase in computational load. Compared to structures like HAFB and Hyper, BiFPN exhibited comparable detection accuracy but held a clear advantage in lightweight metrics.
In summary, BiFPN achieves a favorable balance between detection accuracy and model efficiency, making it particularly suitable for deployment on resource-constrained edge computing platforms in orchards. It provides a viable feature fusion solution for subsequent real-time detection and yield estimation systems.
2.7. Ablation Study
To systematically validate the effectiveness and contribution of the proposed improved modules (C3K2_WT, BiFPN, and Detect-SCGN), we conducted a progressive ablation study using YOLOv11n as the baseline model. In each experiment, only one improved module was introduced sequentially at the corresponding network location. Specifically, these included: introducing the C3K2_WT module into the backbone network, integrating the BiFPN structure into the neck network, and employing the Detect-SCGN module in the detection head. The detailed performance comparison is shown in
Table 2.
Based on the experimental results presented in
Table 2, the following analysis is provided: First, the integration of the C3K2_WT module into the backbone network yielded a model that maintained a mean Average Precision (mAP@0.5) of 86.9% and a parameter count of 2.5 M, comparable to the baseline. Notably, this modification reduced computational load and model size by approximately 1.6% and 1.7%, respectively, while simultaneously improving the recall rate by about 1.1%. These findings indicate that the incorporation of wavelet transform within the module enables a slight reduction in model complexity and resource consumption without compromising core detection accuracy.
Second, building upon the improved backbone, the introduction of the BiFPN structure in the neck of the network yielded substantial lightweighting benefits. Compared to the original baseline, this enhancement reduced the model’s parameter count and memory footprint by approximately 23% and 24%, respectively, underscoring the efficiency of BiFPN in feature fusion and structural simplification. Although this stage of the model exhibited a slight decline in mAP@0.5 and recall (to 84.5% and 75.4%, respectively), the proposed model maintained a high level of performance when compared with existing improved models. For instance, the YOLO-AP optimization model proposed by Huang Zhihao et al. [
34] achieved increases in mAP@0.5 and recall to 92.3% and 84.6%, representing improvements of 4% and 3.6% over its baseline, respectively. The WBL model retained 94.5% and 93.7% of its improved model’s mAP@0.5 and recall values. In this context, the 17.4% increase in computational load can be considered a reasonable trade-off for achieving more comprehensive multi-scale feature interaction. Due to resource constraints, replication of the model by Huang et al. was not feasible. Finally, the introduction of the proposed Detect-SCGN detection head resulted in comprehensive performance optimization. This module effectively recovered the performance degradation caused by the BiFPN integration, restoring precision and recall (increases of 1.7 and 1.2 percentage points, respectively). Furthermore, it reduced the parameter count, computational load, and model size by 10.5%, 2.7%, and 5.8%, respectively, while restoring mAP@0.5 to baseline levels prior to BiFPN integration and achieving an additional improvement of 0.3%. This strongly verifies the effectiveness of the Detect-SCGN design-through shared convolution and separated normalization-in enhancing discriminative power while reducing redundancy.
Compared to the original YOLOv11n baseline, the final YOLO-WBL model, which integrates all improvements, maintained excellent detection accuracy (achieving 87.2% mAP@0.5) with a marginal 0.1-percentage-point increase in recall rate. Meanwhile, the parameter count, computational load, and model size were significantly reduced by 32%, 3.15%, and 28.87%, respectively. The ablation study fully demonstrates the synergistic effects of the individual modules, collectively contributing to a more lightweight and higher-accuracy solution for apple detection.
2.8. Performance Comparison with Mainstream Models
To comprehensively evaluate the performance of the proposed YOLO-WBL model in the complex task of apple detection in orchard environments, comparative experiments were conducted against several mainstream object detection models. Although non-YOLO models such as EfficientDet, NanoDet, and RT-DETR were considered, existing research [
35] indicates that these models exhibit comparatively lower recall and average detection rates in complex agricultural settings, alongside higher memory consumption, when contrasted with YOLO-based architectures. Consequently, the comparative experiments in this study focused primarily on YOLO-series detection models, including YOLOv5, YOLOv8, YOLOv10, YOLOv11, YOLOv12, and YOLOv13. All experiments were conducted under identical dataset conditions and training strategies to ensure fairness and consistency in the evaluation. The detailed performance metrics are presented in
Table 3.
Based on the experimental results in
Table 3, the YOLO-WBL model demonstrates outstanding performance in terms of model lightweighting. Its parameter count is only 1.7 M, and its model size is merely 3.72 MB, representing significant reductions of 34.61% and 28.87%, respectively, compared to its baseline model, YOLOv11n. Even when compared to the latest YOLOv12 model, YOLO-WBL maintains advantages of approximately 29.17% and 28.18% in parameter count and model size, respectively, fully reflecting the effectiveness of its lightweight design oriented towards edge deployment. Regarding detection accuracy, YOLO-WBL is also highly competitive. Its mean Average Precision (mAP@0.5) reaches 87.2%, surpassing contemporary advanced models such as YOLOv10 (86.8%) and YOLOv12 (86.6%). Furthermore, its Precision (P) and Recall (R) are 93.8% and 79.3%, respectively, ranking among the highest levels within the YOLO series models compared. This verifies that YOLO-WBL’s lightweight design does not come at the cost of detection performance, enabling it to effectively handle challenges posed by occlusion, lighting variations, and multi-scale targets in orchard environments.
In summary, the YOLO-WBL model achieves an exceptional balance between detection accuracy and model efficiency. Its characteristics of “high accuracy and small footprint” make it particularly suitable for deployment on edge computing devices with limited computational resources (such as the Jetson series) in orchards, providing a reliable and efficient visual solution for real-time apple detection and subsequent yield estimation in complex environments.
4. Conclusions
This study aims to address two core challenges in automated apple harvesting within complex orchard environments: high-precision real-time detection and accurate yield estimation. To this end, we propose an integrated vision solution, whose core components include a lightweight, high-precision detection model named YOLO-WBL and a 3D point cloud-based yield estimation algorithm named CLV. Through systematic theoretical analysis, comprehensive comparative experiments, and deployment validation on actual edge devices, the effectiveness and advancement of this solution have been fully demonstrated. The main conclusions are as follows:
(1) The YOLO-WBL detection model achieves an exceptional balance between accuracy and efficiency. The model realizes lightweight design and performance enhancement through three key innovations: introducing the C3K2_WT module incorporating wavelet transform into the backbone network to enhance multi-scale frequency-domain feature extraction capability; adopting a Weighted Bidirectional Feature Pyramid Network (BiFPN) to optimize neck feature fusion and improve multi-scale representation efficiency; and designing a lightweight detection head based on Shared Convolution and separated Group Normalization (Detect-SCGN) to improve detection accuracy while reducing the parameter count. Experimental results show that the final model achieved a precision of 93.8%, a recall of 79.3%, and an mAP@0.5 of 87.2% on the test set, while its model size was compressed to only 3.72 MB-a reduction of 28.87% compared to the baseline model. Ablation studies confirmed the effectiveness and synergy of the individual improvement modules.
(2) The proposed model demonstrates outstanding performance in lightweight characteristics, possessing excellent edge deployment properties. Compared to mainstream models from YOLOv5 to YOLOv12, YOLO-WBL maintains leading detection accuracy while having the smallest parameter count and model size, showcasing significant lightweight advantages. When deployed on the NVIDIA Jetson Xavier NX edge device and accelerated by TensorRT, its inference speed reached 8.7 FPS-a 52.6% improvement over the original YOLOv11n-fully meeting the stringent real-time requirements of complex orchard environments.
(3) The CLV yield estimation algorithm represents a leap from 2D statistical counting to 3D geometric estimation, achieving significantly improved accuracy. The algorithm deeply integrates the detection results from YOLO-WBL with point cloud data captured by a depth camera. Through precise coordinate mapping, point cloud reconstruction based on deformed spheres, and a calculation method combining convex hull volume with PCA-based sphericity correction, it overcomes the failure issues of traditional methods under occlusion. Experiments indicate that in scenarios where the occlusion rate is below 40%, the Mean Absolute Percentage Error (MAPE) for single-fruit weight estimation can be controlled within 8%. Under ideal, unoccluded conditions, the error rate is as low as 3.2%, far surpassing traditional methods based on 2D image counting or regular shape assumptions.
(4) The integrated system verifies the practicality and robustness of the proposed solution. A visualization system developed based on the PySide6 framework successfully integrated detection, yield estimation, and path planning functionalities, operating stably on the edge computing platform. Field tests confirmed the high accuracy, strong robustness, and real-time processing capability of the entire system in real, complex orchard environments.
In summary, the YOLO-WBL model and CLV algorithm proposed in this study provide an efficient, accurate, and deployable solution for apple detection and yield estimation in complex orchard environments. This work holds positive reference value for promoting the development of smart orchard management and automated harvesting technologies.