Research on Lightweight Apple Detection and 3D Accurate Yield Estimation for Complex Orchard Environments

Chen, Bangbang; Sun, Xuzhe; Liu, Xiangdong; Ma, Baojian; Ding, Feng

doi:10.3390/horticulturae12030393

Open AccessArticle

Research on Lightweight Apple Detection and 3D Accurate Yield Estimation for Complex Orchard Environments

by

Bangbang Chen

^1,2

,

Xuzhe Sun

¹,

Xiangdong Liu

^1,*,

Baojian Ma

¹

and

Feng Ding

²

¹

School of Mechatronic Engineering, Xinjiang Institute of Technology, Aksu 843100, China

²

School of Mechatronic Engineering, Xi’an Technological University, Xi’an 710021, China

^*

Author to whom correspondence should be addressed.

Horticulturae 2026, 12(3), 393; https://doi.org/10.3390/horticulturae12030393

Submission received: 25 January 2026 / Revised: 9 March 2026 / Accepted: 19 March 2026 / Published: 22 March 2026

(This article belongs to the Special Issue AI for a Precision and Resilient Horticulture)

Download

Browse Figures

Versions Notes

Abstract

Severe foliage occlusion and dynamically changing lighting conditions in complex orchard environments pose significant challenges for visual perception systems in automated apple harvesting, including low detection accuracy, poor robustness, and insufficient real-time performance. To address these issues, this study proposes an improved lightweight detection network based on YOLOv11, named YOLO-WBL, along with a precise yield estimation algorithm based on 3D point clouds, termed CLV. The YOLO-WBL network is optimized in three aspects: (1) A C3K2_WT module integrating wavelet transform is introduced into the backbone network to enhance multi-scale feature extraction capability; (2) A weighted bidirectional feature pyramid network (BiFPN) is adopted in the neck network to improve the efficiency of multi-scale feature fusion; (3) A lightweight shared convolution separated batch normalization detection head (Detect-SCGN) is designed to significantly reduce the parameter count while maintaining accuracy. Based on this detection model, the CLV algorithm deeply integrates depth camera point cloud information through 3D coordinate mapping, irregular point cloud reconstruction, and convex hull volume calculation to achieve accurate estimation of individual fruit volume and total yield. Experimental results demonstrate that: (1) The YOLO-WBL model achieves a precision of 93.8%, recall of 79.3%, and mean average precision (mAP@0.5) of 87.2% on the apple test set; (2) The model size is only 3.72 MB, a reduction of 28.87% compared to the baseline model; (3) When deployed on an NVIDIA Jetson Xavier NX edge device, its inference speed reaches 8.7 FPS, meeting real-time requirements; (4) In scenarios with an occlusion rate below 40%, the mean absolute percentage error (MAPE) of yield estimation can be controlled within 8%. Experimental validation was conducted using apple images selected from the dataset under varying lighting intensities and fruit occlusion conditions. The results demonstrate that the CLV algorithm significantly outperforms traditional average-weight-based estimation methods. This study provides an efficient, accurate, and deployable visual solution for intelligent apple harvesting and yield estimation in complex orchard environments, offering practical reference value for advancing smart orchard production.

Keywords:

occlusion; point cloud; detection; depth information; lightweight

1. Introduction

Apples are one of the most important temperate fruits globally. As the largest producer and consumer, China’s industry scale and production technology level exert a decisive influence on the global market [1]. Statistics indicate that China’s apple cultivation area and yield consistently rank first worldwide. In 2022, the cultivation area reached approximately 30.2 thousand hectares with a production of 41.38 million tons, accounting for about 50% of the global total [2,3]. Furthermore, statistics show that from 1998 to 2017, the proportion of hired labor in apple production rose sharply from 2.87% to 28.24% [4]. The sustainable development of this massive industry faces significant challenges from rising labor costs and increasing demand for precise harvesting. Consequently, promoting orchard production automation centered on intelligent harvesting robots has become an inevitable trend for industrial upgrading [5,6,7,8].

The core of an intelligent harvesting system lies in its “visual perception” capability, i.e., accurate, robust, and real-time fruit detection. However, the complex orchard environment, characterized by dense occlusion between branches/leaves and fruits, variable lighting conditions, and diversity in target scales, severely constrains the performance of existing visual systems. Many detection models that excel in controlled laboratory settings often suffer from sharp accuracy drops and poor stability in real orchards, failing to meet the high-reliability requirements of automated harvesting [9,10,11,12]. Therefore, developing a lightweight detection model that maintains high accuracy, speed, and strong robustness in complex environments is a critical prerequisite for achieving automated harvesting.

Early fruit recognition primarily relied on traditional digital image processing techniques, such as color space conversion (e.g., RGB to HSV), threshold segmentation, and handcrafted feature extraction (color, texture, shape) [13]. These methods showed some effectiveness in scenes with simple backgrounds but were limited in feature representation capability. They were heavily dependent on specific lighting and background conditions, exhibiting insufficient generalization to occlusion, overlap, and complex backgrounds, thus failing to meet practical application needs.

The breakthrough in deep learning technology has revolutionized the field of fruit recognition through Convolutional Neural Network (CNN)-based object detection methods, such as the R-CNN series [14,15], SSD [16], and the YOLO series [17,18,19]. These methods can automatically learn deep semantic features from images, significantly improving adaptability to complex environments and detection accuracy. In recent years, researchers have made numerous improvements tailored to agricultural scenarios. For instance, Wu et al. [20] enhanced robustness to variable lighting by improving YOLOv8; Liu Shuai et al. [21] improved the robustness of the detection model in occluded orchard environments by employing an adversarial network to repair occluded images based on the Fast R-CNN detection model; Zhou Hui et al. [22] achieved lightweight design for an orchard detection model by improving YOLOv8. Liu et al. [23] focused on designing lightweight architectures to facilitate model deployment on edge devices; Shi et al. [24] improved detection accuracy for apple fruits by integrating various attention mechanisms. However, existing studies still require enhancement in accuracy and robustness when dealing with extremely dense occlusion and highly overlapping targets in complex orchard scenes. More importantly, most research stops at the recognition stage, failing to form an effective closed loop with subsequent yield estimation tasks, thereby limiting their overall application value.

Regarding yield estimation, traditional methods mainly rely on statistical sampling [25] or vegetation index models based on remote sensing data [26]. The former suffers from sample representativeness issues, leading to large estimation errors; the latter, while suitable for large-scale monitoring, involves high equipment and technology costs and is ineffective against target information loss caused by occlusion within orchards. Methods combining deep learning-based object counting with average weight estimation [27] offer a new approach. However, relying solely on 2D image counting ignores variations in fruit volume and is susceptible to perspective and occlusion effects, resulting in significant estimation errors. Recently, methods incorporating 3D visual information for volume estimation have emerged. Yet, they often depend on high-precision depth sensors and complex reconstruction algorithms, or face issues like large manual measurement errors and computational complexity, making it difficult to achieve accurate, practical yield estimation in lightweight, low-cost systems.

In summary, current research faces challenges in two key areas for complex orchard environments: fruit detection and precise yield estimation. On one hand, there is a need for a lightweight detection model that maintains high accuracy and strong robustness under occlusion and variable lighting while meeting real-time requirements on edge devices. On the other hand, there is an urgent need to develop a low-cost, high-precision yield estimation algorithm based on visual perception that can overcome occlusion effects.

To systematically address these issues, this paper proposes an integrated solution, with its core contribution being the simultaneous proposal of a lightweight high-accuracy detection model and a 3D precise yield estimation algorithm. (1) Proposal of an improved lightweight detection model named YOLO-WBL. Based on YOLOv11n [28], this model innovatively introduces Wavelet Transform Convolution (WTConv) [29] and the C3K2 module into the backbone network to efficiently extract multi-scale frequency-domain features at low computational cost. It employs a Weighted Bidirectional Feature Pyramid Network (BiFPN) [30] in the neck network to enhance multi-scale feature fusion capability. Furthermore, a lightweight detection head based on Shared Convolution and separated Group Normalization (Detect-SCGN) [31] is designed, significantly reducing the number of parameters while ensuring and even improving detection accuracy. (2) Proposal of a precise yield estimation algorithm based on Collaborative Local Volume (CLV). This algorithm deeply integrates the aforementioned detection results with point cloud data acquired by an RGB-D camera. Through camera parameter calibration, pixel-to-3D coordinate transformation, point cloud reconstruction, and convex hull volume calculation, it achieves precise estimation of the volume of individual occluded fruits, subsequently aggregating these for total yield estimation. This effectively addresses the information loss problem under occlusion in traditional methods. (3) Integration of the YOLO-WBL model and CLV algorithm to construct and deploy a complete edge computing system. A visual interactive software based on PySide6 was developed on the NVIDIA Jetson Xavier NX platform, realizing full-process functionality from real-time detection to yield estimation. Extensive field experiments validate the system’s high accuracy, strong robustness, and real-time performance in complex orchard environments.

2. Materials and Methods

2.1. Construction of a Complex Orchard Dataset

To accurately simulate the multifaceted challenges encountered in actual orchard operations and effectively evaluate the robustness of the detection and yield estimation models, systematic data collection was conducted in a typical dwarf, densely planted apple orchard in the Aksu region of Xinjiang. The primary cultivar is Fuji apple, including varieties such as Changfu 2, Qiufu 1, and Yanfu 2. To facilitate data acquisition, the tree height was maintained below 3.5 m. This planting pattern features low and structurally uniform canopies, which not only facilitates robotic arm picking operations but also provides favorable conditions for depth sensors to acquire comprehensive fruit information from multiple angles.

2.1.1. Image Data Collection

The data collection comprehensively considered key variables in the actual orchard environment that affect visual system performance, mainly including: (1) Lighting conditions: Covering different natural lighting scenarios such as front lighting, backlighting, side lighting, and canopy shadows. (2) Shooting perspectives: Including multiple angles such as top-down, horizontal, and bottom-up to simulate the observation perspectives of robots under different working postures. (3) Shooting distance: Varied within the range of 0.1 m to 1.0 m to reflect the typical working distance between the camera and the fruit during actual harvesting. The collection equipment utilized a Huawei Nova7 smartphone (Huawei Technologies Co., Ltd., Shenzhen, China) and an Intel RealSense D455 depth camera (Intel Corporation, Santa Clara, CA, USA) working in coordination to synchronously acquire high-resolution RGB images and corresponding depth information.

For the same batch of apple samples, RGB-D images were captured under two modes to obtain different levels of occlusion and viewing angles: a close-range vertical viewpoint and an inclined multi-viewpoint simulating robotic operation. Immediately after image acquisition, all photographed apples were manually harvested. Their true mass was then measured one by one using a high-precision electronic balance (LiChen FA1204, accuracy 0.01 g) and recorded as the ground-truth benchmark for algorithm evaluation. Consequently, a dataset containing multi-view images and corresponding precise mass data was constructed for subsequent performance comparison and analysis of different yield estimation algorithms. All data were collected intensively during the fruit ripening period on September 20, 2025. The final constructed dataset contains 1634 apple image samples (JPEG format), with partial examples shown in Figure 1. This dataset fully reflects the diversity of the complex orchard environment, providing a realistic data foundation for subsequent model training and validation.

2.1.2. Data Augmentation and Preprocessing

To enhance the robustness and generalization capability of the model in the variable environment of real orchards, systematic data augmentation was applied to the original collected dataset. Using a Python script, images were sequentially subjected to random rotation (±30°), brightness adjustment (scale factor 0.8–1.2), saturation variation (scale factor 0.7–1.3), and the addition of Gaussian noise (σ = 0.01). This process simulated complex factors commonly encountered during the actual operation of harvesting robots, such as lighting fluctuations, viewpoint changes, dust interference, and sensor noise. Through this augmentation pipeline, a final dataset containing 2014 images was constructed, comprising 400 images with varying lighting conditions, 680 images from different viewpoints, 532 images at varying distances, and 402 images featuring occlusion. Examples of the augmented images are shown in Figure 2.

To ensure the reliability of model training and evaluation, the dataset was divided into a training set (1410 images), a validation set (403 images), and a test set (201 images) in a ratio of 7:2:1. This division ensured complete independence and no overlap of images among the subsets. Data annotation followed the PASCAL VOC format. The LabelImg tool was used for manual annotation, with the label category uniformly defined as “apple,” and corresponding XML files were generated to provide high-quality ground-truth labels for subsequent supervised learning.

2.2. Lightweight Apple Detection Model Based on YOLO-WBL

To achieve high-accuracy real-time apple detection in complex orchard environments and provide reliable target location information for subsequent yield estimation, this study aims to balance lightweight design with high precision. While NanoDet prioritizes extreme lightweighting through the use of Ghost modules and depthwise separable convolutions, it suffers from poor robustness when confronted with complex environments. EfficientDet, on the other hand, relies on compound scaling and an EfficientNet backbone, offering high computational efficiency but at the cost of structural complexity. Therefore, to achieve high-accuracy real-time apple detection in complex orchard environments, this study selects YOLOv11n as the baseline model. This model consists of a backbone network, a neck network, and a detection head, offering significant advantages in computational efficiency and parameter count. However, it still suffers from issues such as a high rate of missed detections and insufficient robustness in complex scenarios involving branch and leaf occlusion and uneven lighting.

To address these limitations, this paper proposes a lightweight object detection model named YOLO-WBL. Its overall architecture is shown in Figure 3, with optimizations primarily focused on the following three aspects: (1) To enhance feature representation capability while maintaining a lightweight profile, the original C3K2 module in the backbone network is replaced with a hybrid module combining Wavelet Transform Convolution (WTConv) and C3K2. WTConv separates the low-frequency and high-frequency components of an image through frequency-domain transformation, enabling the network to more effectively capture texture and contour features at different scales. This enhances the model’s robustness to scale variations and local details without significantly increasing the computational burden. (2) To further strengthen multi-scale feature fusion capability, a Weighted Bidirectional Feature Pyramid Network (BiFPN) is adopted to replace the C3K2 module in the original neck structure. BiFPN performs adaptive fusion of features from different scales using learnable weights and facilitates thorough interaction between high-level semantic information and low-level detailed information through its bidirectional pathways, thereby enhancing the model’s representational capacity for apples at varying distances and sizes. (3) To improve detection accuracy and further reduce the parameter count, a lightweight detection head based on Shared Convolution and separated Group Normalization (Detect-SCGN) is designed. This structure reduces redundant computation by sharing shallow convolutional layer weights and independently optimizes the group normalization layers for different output branches. Consequently, it significantly decreases memory usage and inference latency while improving classification and localization accuracy.

2.2.1. C3K2_WT Module Incorporating Wavelet Transform Convolution

The C3K2 module in YOLOv11 is an efficient feature extraction component. Its design core lies in splitting the input feature map into two parts: one part is directly passed through a shortcut connection to preserve the original information, while the other part undergoes deep feature extraction via a C3K bottleneck structure with configurable convolutional kernels (e.g., 3 × 3 or 5 × 5). The features from both paths are finally concatenated. This structure broadens the model’s receptive field by employing variable-sized convolutional kernels, enabling it to capture broader contextual information and thereby enhancing feature representation capability in complex scenes.

However, the introduction of large, variable-sized kernels, while enhancing the receptive field, also significantly increases the model’s parameter count and computational complexity, posing a challenge for deployment on resource-constrained edge devices. To address this issue, this study proposes integrating Wavelet Transform Convolution (WTConv) with the C3K2 module to construct the C3K2_WT module. Although deep CNNs can progressively expand the receptive field through stacked convolutional layers to model global context, relying solely on the layer-by-layer propagation of local convolutional features to capture global information is often inefficient. This is particularly true for orchard scenes that require simultaneously processing multi-scale targets (e.g., apples at varying distances) and complex backgrounds, potentially leading to insufficient understanding of the global scene and consequently affecting detection accuracy for occluded targets [32]. In contrast, WTConv utilizes the wavelet transform to decompose the feature map into different frequency sub-bands, achieving multi-scale frequency-domain feature extraction. This method can effectively enlarge the equivalent receptive field and enhance the model’s representation capability for texture and shape structures without significantly increasing the parameter count.

Specifically, the wavelet convolution layer (WTConv module) employs the Wavelet Transform (WT) to downsample the input tensor, yielding the low-frequency component

X_{L L}^{(i)}

and the high-frequency components

X_{H L}^{(i)}

,

X_{L H}^{(i)}

, and

X_{H H}^{(i)}

. Subsequently, depthwise convolution is performed on these different frequency maps using small convolutional kernels (e.g., 3 × 3, 5 × 5). Finally, the processed results from the different frequency maps are recombined using the Inverse Wavelet Transform (IWT) to construct the output Y. This process can be expressed as:

Y = I W T (C o n v (w, W T (X)))

(1)

where X is the input tensor, and w is the weight tensor of the k × k depthwise kernels, which has four times the number of input channels as X. This method achieves frequency-domain feature separation and parallel extraction, allowing small-sized convolutional kernels to operate within a larger effective receptive field while maintaining the module’s lightweight characteristics. The structure of the C3K2_WT module and the WTConv processing flow are illustrated in Figure 4.

By employing this one-level composition operation and further extending it using the same cascading principle in the formula domain, the calculation formulas are given in (2)–(3):

X_{L L}^{(i)}, X_{H}^{(i)} = W T (X_{L L}^{(i - 1)})

(2)

X_{L L}^{(i)}, X_{H}^{(i)} = C o n v (W^{(i)}, (X_{L L}^{(i)}, X_{H}^{(i)}))

(3)

where

X_{L L}^{(0)}

represents the input to this layer, and

X_{H}^{(i)}

denotes the

i

-th level high-frequency maps, namely

X_{H L}^{(i)}

,

X_{L H}^{(i)}

and

X_{H H}^{(i)}

.

To combine the outputs from different frequencies, we leverage the linear property of WT and its inverse operation, i.e.,

I W T (X + Y) = I W T (X) + I W T (Y)

. Therefore, the final executed result can be expressed as:

Z^{(i)} = I W T (Y_{L L}^{i} + Z^{(i + 1)}, Y_{H}^{(i)})

(4)

Addressing the contradiction between receptive field expansion and the surge in parameter count within the C3K2 feature extraction module, this study proposes a lightweight improvement scheme. By introducing wavelet transform, this scheme performs multi-scale decomposition and extraction of features in the frequency domain, effectively mitigating the over-parameterization issue caused by traditional convolutions attempting to enlarge the receptive field. Specifically, by replacing the original convolutional layers with WTConv2d layers, the improvement seamlessly integrates the advantages of wavelet transform in image frequency separation and global information capture while preserving the fundamental architecture and functionality of the bottleneck structure. This enhancement significantly boosts the model’s capability to analyze and fuse multi-frequency components in images (such as fruit contours, textural details, and environmental backgrounds), thereby enriching feature representation and robustness while controlling parameter volume.

2.2.2. Weighted Bidirectional Feature Pyramid Network

To fully leverage multi-scale features and enhance the model’s detection capability for apples of varying sizes, YOLOv11n employs a bidirectional feature pyramid structure based on PANet [33] for feature fusion. This architecture aggregates deep semantic information with shallow detail features through bottom-up and top-down pathways, thereby strengthening multi-scale representation. However, its fixed bidirectional fusion mechanism introduces additional computational overhead and may also generate redundancy or loss during information propagation, which to some extent affects the model’s inference efficiency and accuracy stability in resource-constrained environments.

To address this, this study optimizes the original neck module by adopting a Weighted Bidirectional Feature Pyramid Network (BiFPN). Building upon PANet, BiFPN incorporates two key improvements: First, it simplifies the network topology by removing redundant nodes that have only a single input and do not participate in feature fusion. Second, it adds skip connections between features at the same level and introduces learnable adaptive weights for cross-scale fusion, thereby achieving more efficient multi-scale feature interaction while reducing information loss. A structural comparison between PANet and BiFPN is shown in Figure 5.

During the feature fusion process, BiFPN assigns learnable weights to input features of different resolutions, enabling the network to dynamically adjust the contribution of each scale’s features to the fused result, thereby achieving more refined feature selection and integration. This mechanism helps enhance the model’s ability to distinguish between targets with scale variations and similar appearances. Taking the 6th-level feature fusion illustrated in Figure 5b as an example, its computational process can be expressed by the following formula:

P_{6}^{t d} = C o n v (\frac{ω_{1} P_{6}^{i n} + ω_{2} \cdot Resize (P_{7}^{i n})}{ω_{1} + ω_{2} + \in})

(5)

P_{6}^{o u t} = C o n v (\frac{ω_{1}^{'} P_{6}^{i n} + ω_{2}^{'} P_{6}^{t d} + ω_{3}^{'} Resize (P_{5}^{o u t})}{ω_{1}^{'} + ω_{2}^{'} + ω_{3}^{'} + ε})

(6)

Here,

P_{6}^{t d}

represents the intermediate feature at level 6 from the top-down path, while

P_{6}^{o u t}

denotes the output feature at level 6 from the bottom-up path;

ω_{1}

and

ω_{2}

are the learnable weights for the level 6 feature fusion, indicating the importance of the level 6 input feature and the adjusted level 7 input feature, respectively;

ω_{1}^{'}, ω_{2}^{'}, ω_{3}^{'}

are the learnable weights for the level 6 output feature fusion, representing the importance of the level 6 input feature, the level 6 intermediate feature, and the adjusted level 5 output feature, respectively;

P_{6}^{i n}

and

P_{7}^{i n}

refer to the level 6 and level 7 input features, respectively;

P_{5}^{o u t}

is the level 5 input feature; Resize denotes upsampling or downsampling operations,

\in

is the minimum weight adjustment coefficient, set to 0.0001. All features are constructed in a similar manner, and BiFPN also employs depthwise separable convolution for feature fusion, followed by batch normalization and activation after each convolution.

Compared to the PANet architecture, BiFPN streamlines the network topology by removing single-contribution nodes and incorporates skip connections between input and output nodes at the same level. This design enables richer cross-level feature reuse and fusion without significantly increasing computational complexity. By stacking this bidirectional path module multiple times, the network facilitates deeper multi-scale feature interactions, thereby learning more discriminative feature representations. The introduction of the BiFPN structure effectively enhances the feature fusion efficiency of the model under limited computational resources. Its core advantages lie in: dynamically adjusting the contribution of feature maps at different resolutions through learnable weights, and preserving more original feature information via skip connections. This allows the network to more comprehensively capture the multi-scale contextual information of apple targets, particularly in handling complex scenarios such as branch/leaf occlusion, fruit overlap, and scale variations due to distance. This not only strengthens the semantic representation of apples by deep features but also significantly improves the model’s detection robustness and accuracy across varying environments, thereby optimizing overall performance while ensuring model efficiency.

2.2.3. Detect-SCGN Detection Head

To enhance multi-scale detection accuracy while strictly controlling model complexity, this study refactors the original decoupled detection head of YOLOv11 into a lightweight design. Although the original decoupled head improves detection performance by separating classification and regression tasks, its independent branch structure significantly increases parameter count and computational cost. Furthermore, its single-scale prediction mode fails to fully utilize the rich multi-scale feature information aggregated by the neck network, limiting the model’s adaptability to apples of different sizes.

Addressing these issues, this paper proposes a lightweight detection head based on Shared Convolution and separated Group Normalization (Detect-SCGN), whose structure is shown in Figure 6. The core of this design lies in sharing shallow convolutional weights across multiple prediction branches to drastically reduce parameters, while independently processing the Batch Normalization (BN) layer for each branch to accommodate distribution differences among features at different levels. Specifically, the P3, P4, and P5 feature layers output from BiFPN (responsible for small, medium, and large targets, respectively) first undergo independent 1 × 1 convolutions for channel adjustment and preliminary feature extraction. Considering the inherent differences in the numerical distributions of feature maps at different scales, a Layer Normalization operation is introduced in each branch to stabilize the training process and accelerate convergence.

Subsequently, the parameters from the 1 × 1 convolutions are shared with the following 3 × 3 convolutional layers to form a shared parameter module. This module consists of 3 × 3 convolutional layers and Batch Normalization (BN) layers, which are used to effectively extract discriminative features. The core feature extraction parameters are shared with the Convolutional Regularization layer (Conv-Reg) and the loss function module, as shown in Equation (7). To address scale variations of detection targets, a scaling layer is utilized to resize the feature maps transformed by Conv-Reg, further enhancing the adaptability and prediction accuracy for multi-scale object detection. The scaling layer dynamically adjusts the feature sizes, as shown in Equations (8) and (9), while reducing the model’s parameter count and computational overhead, effectively mitigating detection accuracy loss. This feature transformation process can be formally represented as:

g_{o u t} = f_{w} (g_{i n})

(7)

X_{s c a l e, i} = F . v i w e (X_{i n i t a i a l}, s c a l e_f a c t o r s = s)

(8)

X_{c o m b i n e d} = c o n c a t (X_{s c a l e 1}, X_{s c a l e 2}, X_{s c a l e 3} \dots)

(9)

where

g_{o u t}

denotes the output feature,

g_{i n}

represents the input feature,

f_{w} (\cdot)

signifies the extracted feature,

X_{s c a l e, i}

is the scaling tensor, s is the scaling factor,

F . v i w e

is the function for adjusting the tensor size,

X_{i n i t a i a l}

indicates the initial input tensor, and

X_{c o m b i n e d}

denotes the result of scaling and concatenating multiple tensors.

2.3. Accurate Yield Estimation Method Based on the Collaborative Local Volumetric (CLV) Approach

Accurate yield estimation is crucial for enabling refined orchard management and optimizing harvest planning. To overcome the significant errors associated with traditional 2D image-based statistical methods when dealing with occlusion and overlap, this study proposes a Collaborative Local Volumetric (CLV) yield estimation algorithm based on 3D point cloud reconstruction. This method deeply integrates the detection results from the aforementioned YOLO-WBL model with the real 3D information captured by a depth camera. It achieves precise yield estimation in occluded scenes by inferring the complete fruit volume from partially visible surfaces. The overall workflow is illustrated in Figure 7.

The method first employs the YOLO-WBL model to detect apple targets in the RGB image, obtaining their 2D bounding boxes and pixel-level center coordinates. Subsequently, the core algorithm proceeds in two stages: (1) Depth Optimization and Coordinate Mapping: The 2D detection results are aligned with the corresponding depth map using the depth camera’s intrinsic parameters, precisely mapping pixel coordinates to 3D space to acquire the real 3D point cloud of the visible portion of each apple. (2) Local Volume Reconstruction and Aggregation: Based on the irregular real-surface point cloud, the volume of individual fruits is estimated using convex hull reconstruction and sphericity correction algorithms, which are then aggregated to obtain the total yield for the area. This approach discards the assumptions of simplifying apples into regular shapes or relying on average weight, directly performing calculations based on 3D geometric information, thereby significantly enhancing estimation robustness and accuracy in complex scenes.

2.3.1. Depth Optimization and 3D Coordinate Mapping

Existing yield estimation methods based on 2D detection often assume apples are regular spheres lying on the same plane and convert pixel sizes to physical dimensions using empirical scaling factors. The core flaw of such methods lies in the lack of support from real depth information, leading to distorted 3D spatial distribution and overly idealized volume estimation models. This results in significant errors in real orchards due to occlusion, perspective distortion, and the morphological diversity of fruits.

To address these issues, the conversion factors and position coordinates are corrected by invoking the depth camera’s parameters to ensure the accuracy of model reconstruction. Coordinate optimization is divided into three steps: depth value unit conversion, image coordinate normalization, and coordinate system calculation. In the depth value conversion stage, the sensor’s electrical signals are converted into real physical distances through unit system transformation (e.g., millimeter-to-meter, meter-to-centimeter), establishing the absolute distance relationship from each pixel in the image to the camera. The specific calculation formulas are shown in Equations (10) and (11).

Z_m = d e p t h_m m \times d e p t h_s c a l e

(10)

Z_c m = Z_m \times 100

(11)

where depth represents the depth map condensation factor, Z_cm denotes the depth distance of the target in the camera coordinate system (cm), and Z_m signifies the depth distance of the target in the camera coordinate system (m).

During the image coordinate normalization process, pixel positions are converted into spatial direction vectors originating from the camera’s optical center and pointing towards object points in 3D space. This is based on the principle of back-projection using the pinhole camera model. To address the misalignment between the pixel coordinate origin and the optical center origin, the coordinate origin is first shifted from the image corner to the optical center. The specific calculation formula is as follows:

d_{x} = u_{0} - c_{x}

(12)

d_{y} = v_{0} - c_{y}

(13)

where c_x, c_y represent the principal point coordinates of the image, u₀ denotes the pixel horizontal coordinate, v₀ denotes the pixel vertical coordinate, d_x represents the offset in the horizontal direction, and d_y represents the offset in the vertical direction.

After shifting the center point to the optical center, focal length normalization is performed to remove the influence of the camera’s internal geometric parameters and convert pixel coordinates into dimensionless normalized coordinates.

x_n o r = d_{x} / f_{x}

(14)

y_n o r = d_{y} / f_{y}

(15)

f_x represents the focal length in the horizontal direction, f_y represents the focal length in the vertical direction, x_nor denotes the horizontal normalized coordinate, and y_nor denotes the vertical normalized coordinate.

Finally, the coordinate system is calculated by integrating depth information, scaling the actual physical coordinates according to the true depth to enhance realism.

X_c m = x_n o r \times Z_c m

(16)

Y_c m = y_n o r \times Z_c m

(17)

where X_cm represents the horizontal position in the camera coordinate system, and Y_cm represents the vertical position in the camera coordinate system.

2.3.2. Point Cloud Model Reconstruction Based on Deformed Spheres

Traditional methods for 3D modeling of apples often simplify them into regular spheres or ellipsoids. While this greatly simplifies calculation, it fails to represent the diverse morphological characteristics of real fruits (such as local concavities, asymmetrical growth, etc.), leading to inherent biases in volume estimation. To overcome this limitation, this paper proposes an irregular point cloud generation method based on deformed spheres. This method, grounded in spherical coordinates, finely simulates the irregular geometric features of real apple surfaces by introducing multi-frequency deformation functions, random radius perturbations, and apple-specific morphological enhancements.

To prevent excessive clustering of point clouds at the poles when sampling uniformly on a sphere, this study employs an area-element-based uniform sampling strategy. This ensures that the distribution density of sampling points on the spherical surface is proportional to the area, thereby obtaining a more uniform initial distribution of the 3D point cloud. On this basis, a multi-frequency composite deformation function is applied to modulate the radial distance of the sampling points, simulating natural undulations and irregularities on the apple surface. The generation of spherical coordinates for sampling points, the probability density function, and the multi-frequency deformation process can be formally expressed as follows:

f (θ, ϕ) = \frac{1}{4 π} \sin θ

(18)

ϕ = 2 π ξ_{1}

(19)

u = 2 ξ_{2} - 1

(20)

θ = \arccos (u)

(21)

d_{i} = [\begin{array}{l} \sin θ_{i} \cos ϕ_{i} \\ \sin θ_{i} \sin ϕ_{i} \\ \cos θ_{i} \end{array}]

(22)

where

ξ

represents a random number uniformly distributed in the interval [0, 1],

ϕ

denotes the azimuth angle,

θ

denotes the zenith angle, u is an intermediate variable used to generate a uniform distribution for

\cos θ

, d_i represents the i-th deformation direction vector,

θ_{i}

denotes the zenith angle for the i-th deformation direction, and

ϕ_{i}

denotes the azimuth angle for the i-th deformation direction.

The final radius of the apple model is determined based on the calculation of the deformation factors. Furthermore, to account for the morphological characteristics of apples, an apple-specific deformation function is introduced to simulate their typical morphological features. The specific implementation method is shown in Equations (23)–(28).

a_{i} = \arccos (\min (1, \max (- 1, v \cdot d_{i})))

(23)

f_{d e f o r m} (v) = 1 + \sum_{i = 1}^{N_{d}} s_{i} \cos (k_{i} a_{i})

(24)

r_{f i n a l} = R \cdot (0.95 + 0.1 ξ_{5}) \cdot f_{d e f o r m} (v)

(25)

f_{t o p} (θ) = \{\begin{matrix} 0.85 \cdot \exp (- 5 θ), θ < π / 6 \\ 1, other \end{matrix}

(26)

f_{b o t t o m} (θ) = \{\begin{matrix} 0.85 \cdot \exp (- 5 (π - θ)), θ > 5 π / 6 \\ 1, other \end{matrix}

(27)

r_{a p p l e} = r_{f i n a l} \cdot f_{t o p} (θ) \cdot f_{b o t t o m} (θ) \cdot (1 + 0.05 \sin (2 ϕ))

(28)

where v represents the position vector,

a_{i}

denotes the angle between v and d_i,

f_{d e f o r m}

represents the deformation factor,

s_{i}

denotes the strength of the i-th deformation function,

k_{i}

denotes the frequency of the i-th deformation function,

r_f i n a l

denotes the final radius, R represents the apple’s base radius,

f_{t o p}

represents the deformation factor for the depression at the apple’s top,

f_{b o t t o m}

represents the deformation factor for the depression at the apple’s bottom, and

r_{a p p l e}

denotes the apple’s final radius.

2.3.3. Volume Estimation Based on Convex Hull Algorithm and PCA Sphericity Correction

Traditional apple volume estimation methods (e.g., those based on minimum bounding boxes or standard ellipsoid models) often introduce significant errors due to excessive simplification of the fruit’s irregular geometry. To enhance the accuracy of volume calculation, this study proposes an optimized method combining the convex hull algorithm with Principal Component Analysis (PCA)-based sphericity correction.

(1) Initial Volume Estimation Based on the Convex Hull Algorithm

The core of this method is to approximate the fruit volume by calculating the smallest convex polyhedron (convex hull) enclosing the target point cloud. The algorithm first identifies the peripheral vertices of the point cloud, performs triangulation to form a closed convex polyhedron, and obtains the total volume by summing the volumes of all tetrahedra. This computational process can be expressed as:

V_{c o n v e x} = \frac{1}{6} \sum_{i = 1}^{N_{f}} |(v_{i 1} - v_{i 0}) \times (v_{i 2} - v_{i 0}) \cdot c_{i}|

(29)

where

N_{f}

represents the number of triangular facets,

v_{i 0}

,

v_{i 1}

,

v_{i 2}

denote the coordinates of the three vertices of the i-th triangle,

c_{i}

represents the vector from the origin to any vertex of the triangle, and

V_{c o n v e x}

denotes the convex hull volume.

This method offers three advantages: (1) High Fitting Accuracy: The convex hull closely conforms to the irregular shape, reducing the average volume estimation error by over 60% compared to the bounding box method. (2) Shape Adaptability: It requires no pre-defined geometric model and automatically adapts to different shapes. (3) Strong Robustness: It is insensitive to variations in point cloud density and noise, exhibits rotational invariance, and provides a stable geometric foundation for subsequent analysis.

(2) Sphericity Analysis and Volume Correction Based on PCA

Since convex hull volume systematically overestimates the true volume of non-convex objects, the error is particularly significant when fruit surfaces have concavities. To further improve accuracy, this paper introduces PCA to quantitatively analyze the shape characteristics of the point cloud. PCA projects the point cloud onto three principal component axes through orthogonal transformation. The corresponding eigenvalues reflect the dispersion of the point cloud along each principal axis.

C = \frac{1}{n - 1} \sum_{i = 1}^{n} (P_{i} - \bar{P}) {(P_{i} - \bar{P})}^{T}

(30)

s p h e r i c i t y = \frac{λ_{\min}}{λ_{\max}}

(31)

s h a p e_c o r r e c t i o n = 0.7 + 0.3 \times s p h e r i c i t y

(32)

V_{c o r r e c t e d} = V_{c o n v e x} \times s h a p e_c o r r e c t i o n

(33)

In the above equations, C represents the eigenvalues of the covariance matrix,

P_{i}

denotes the coordinates of the i-th point,

\bar{P}

is the centroid of the point cloud, n indicates the number of points,

s p h e r i c i t y

is the sphericity,

λ_{\max}

and

λ_{\min}

are the maximum and minimum eigenvalues, respectively,

s h a p e_c o r r e c t i o n

is the shape correction factor, and

V_{c o r r e c t e d}

is the corrected volume.

2.4. Model Evaluation Metrics

To comprehensively and quantitatively evaluate the overall performance of the proposed YOLO-WBL detection model in complex orchard environments, the following evaluation metrics were selected from three dimensions: detection accuracy, model efficiency, and deployment lightweightness. These include Precision (P), Recall (R), mean Average Precision (mAP@0.5), Floating Point Operations (FLOPs), average inference speed (Frames Per Second, FPS), the number of model parameters (Parameters), and the model’s disk storage size (MB). Together, these metrics constitute a multi-dimensional evaluation framework, ensuring that high detection accuracy is pursued while strictly constraining model complexity to meet the deployment requirements on embedded devices in practical orchard environments.

2.5. Experimental Environment and Training Strategy

2.5.1. Experimental Environment

All experiments were conducted on a workstation equipped with an AMD Ryzen Threadripper PRO 3975WX 32-core processor, 384 GB of RAM, and an NVIDIA RTX A5000 GPU (24 GB VRAM). The operating system was Windows 11. The deep learning framework used was PyTorch 1.13, the programming language was Python 3.8, and parallel computing relied on CUDA 11.7 and the corresponding version of the cuDNN acceleration library.

2.5.2. Training Strategy

Model training employed the Stochastic Gradient Descent (SGD) optimizer. Key hyperparameters were set as follows: initial learning rate of 0.01, momentum of 0.937, and a weight decay coefficient of 0.0005. The training batch size was set to 16, and input images were uniformly resized to 640 × 640 pixels. The entire training process spanned 300 epochs. To enhance the model’s generalization capability to scale variations, occlusion, and background changes, the Mosaic data augmentation technique was introduced during training. This technique, by randomly stitching four training images, effectively simulates target aggregation and occlusion in complex scenes, strengthening the model’s ability to learn spatial context.

2.6. Performance Comparison of Different Feature Fusion Networks

To validate the effectiveness of the adopted Weighted Bidirectional Feature Pyramid Network (BiFPN) for the apple detection task in complex orchard environments, this study conducted comparative experiments with several current mainstream feature fusion architectures, including models such as Hyper-MFM (Hypergraph-based Multi-scale Feature Fusion Module), PST (Pyramid Sparse Transformer), and PSConv (Pinwheel-shaped Convolutional Module). By comparing different feature fusion structures, the effectiveness of BiFPN in complex orchard environments was verified. All comparative experiments were conducted based on the same dataset and training strategy to ensure fairness. The performance metrics of each model are detailed in Table 1.

The experimental results indicate that BiFPN significantly enhances the model’s lightweight characteristics while maintaining high detection accuracy. Specifically, on the apple test set, BiFPN achieved an mAP@0.5 of 86.3%, while the model size was reduced by approximately 24% compared to the baseline YOLOv11n. Compared to the RepHMS structure, although BiFPN’s mAP@0.5 was slightly lower by 11.3%, its parameter count and computational load were substantially reduced, making it more suitable for edge devices with limited computing power. Compared to methods like PST and PSConv, BiFPN significantly reduced model parameters (by 35% and 21.4%, respectively) and storage footprint (by 23% and 21%, respectively) with a slight increase in computational load. Compared to structures like HAFB and Hyper, BiFPN exhibited comparable detection accuracy but held a clear advantage in lightweight metrics.

In summary, BiFPN achieves a favorable balance between detection accuracy and model efficiency, making it particularly suitable for deployment on resource-constrained edge computing platforms in orchards. It provides a viable feature fusion solution for subsequent real-time detection and yield estimation systems.

2.7. Ablation Study

To systematically validate the effectiveness and contribution of the proposed improved modules (C3K2_WT, BiFPN, and Detect-SCGN), we conducted a progressive ablation study using YOLOv11n as the baseline model. In each experiment, only one improved module was introduced sequentially at the corresponding network location. Specifically, these included: introducing the C3K2_WT module into the backbone network, integrating the BiFPN structure into the neck network, and employing the Detect-SCGN module in the detection head. The detailed performance comparison is shown in Table 2.

Based on the experimental results presented in Table 2, the following analysis is provided: First, the integration of the C3K2_WT module into the backbone network yielded a model that maintained a mean Average Precision (mAP@0.5) of 86.9% and a parameter count of 2.5 M, comparable to the baseline. Notably, this modification reduced computational load and model size by approximately 1.6% and 1.7%, respectively, while simultaneously improving the recall rate by about 1.1%. These findings indicate that the incorporation of wavelet transform within the module enables a slight reduction in model complexity and resource consumption without compromising core detection accuracy.

Second, building upon the improved backbone, the introduction of the BiFPN structure in the neck of the network yielded substantial lightweighting benefits. Compared to the original baseline, this enhancement reduced the model’s parameter count and memory footprint by approximately 23% and 24%, respectively, underscoring the efficiency of BiFPN in feature fusion and structural simplification. Although this stage of the model exhibited a slight decline in mAP@0.5 and recall (to 84.5% and 75.4%, respectively), the proposed model maintained a high level of performance when compared with existing improved models. For instance, the YOLO-AP optimization model proposed by Huang Zhihao et al. [34] achieved increases in mAP@0.5 and recall to 92.3% and 84.6%, representing improvements of 4% and 3.6% over its baseline, respectively. The WBL model retained 94.5% and 93.7% of its improved model’s mAP@0.5 and recall values. In this context, the 17.4% increase in computational load can be considered a reasonable trade-off for achieving more comprehensive multi-scale feature interaction. Due to resource constraints, replication of the model by Huang et al. was not feasible. Finally, the introduction of the proposed Detect-SCGN detection head resulted in comprehensive performance optimization. This module effectively recovered the performance degradation caused by the BiFPN integration, restoring precision and recall (increases of 1.7 and 1.2 percentage points, respectively). Furthermore, it reduced the parameter count, computational load, and model size by 10.5%, 2.7%, and 5.8%, respectively, while restoring mAP@0.5 to baseline levels prior to BiFPN integration and achieving an additional improvement of 0.3%. This strongly verifies the effectiveness of the Detect-SCGN design-through shared convolution and separated normalization-in enhancing discriminative power while reducing redundancy.

Compared to the original YOLOv11n baseline, the final YOLO-WBL model, which integrates all improvements, maintained excellent detection accuracy (achieving 87.2% mAP@0.5) with a marginal 0.1-percentage-point increase in recall rate. Meanwhile, the parameter count, computational load, and model size were significantly reduced by 32%, 3.15%, and 28.87%, respectively. The ablation study fully demonstrates the synergistic effects of the individual modules, collectively contributing to a more lightweight and higher-accuracy solution for apple detection.

2.8. Performance Comparison with Mainstream Models

To comprehensively evaluate the performance of the proposed YOLO-WBL model in the complex task of apple detection in orchard environments, comparative experiments were conducted against several mainstream object detection models. Although non-YOLO models such as EfficientDet, NanoDet, and RT-DETR were considered, existing research [35] indicates that these models exhibit comparatively lower recall and average detection rates in complex agricultural settings, alongside higher memory consumption, when contrasted with YOLO-based architectures. Consequently, the comparative experiments in this study focused primarily on YOLO-series detection models, including YOLOv5, YOLOv8, YOLOv10, YOLOv11, YOLOv12, and YOLOv13. All experiments were conducted under identical dataset conditions and training strategies to ensure fairness and consistency in the evaluation. The detailed performance metrics are presented in Table 3.

Based on the experimental results in Table 3, the YOLO-WBL model demonstrates outstanding performance in terms of model lightweighting. Its parameter count is only 1.7 M, and its model size is merely 3.72 MB, representing significant reductions of 34.61% and 28.87%, respectively, compared to its baseline model, YOLOv11n. Even when compared to the latest YOLOv12 model, YOLO-WBL maintains advantages of approximately 29.17% and 28.18% in parameter count and model size, respectively, fully reflecting the effectiveness of its lightweight design oriented towards edge deployment. Regarding detection accuracy, YOLO-WBL is also highly competitive. Its mean Average Precision (mAP@0.5) reaches 87.2%, surpassing contemporary advanced models such as YOLOv10 (86.8%) and YOLOv12 (86.6%). Furthermore, its Precision (P) and Recall (R) are 93.8% and 79.3%, respectively, ranking among the highest levels within the YOLO series models compared. This verifies that YOLO-WBL’s lightweight design does not come at the cost of detection performance, enabling it to effectively handle challenges posed by occlusion, lighting variations, and multi-scale targets in orchard environments.

In summary, the YOLO-WBL model achieves an exceptional balance between detection accuracy and model efficiency. Its characteristics of “high accuracy and small footprint” make it particularly suitable for deployment on edge computing devices with limited computational resources (such as the Jetson series) in orchards, providing a reliable and efficient visual solution for real-time apple detection and subsequent yield estimation in complex environments.

3. Results and Discussion

3.1. Detection Performance Analysis Under Different Lighting Intensities and Fruit Occlusion Conditions

To evaluate the practical detection performance and robustness of the YOLO-WBL model in real-world, complex orchard environments, this study selected four typical challenging scenarios for comparative testing. These scenarios included: strong sunlight exposure, tree shadow coverage, severe occlusion by branches and leaves, and a wide-angle view with dense fruit clusters. The detection results of YOLO-WBL are compared with those of the baseline model YOLOv11n in Figure 8 (where yellow arrow indicate missed detections) and Figure 9.

From the qualitative results in Figure 8, it can be observed that under complex conditions involving uneven lighting and occlusion, the YOLOv11n model exhibited noticeable missed detections and false positives. Particularly in shadowed scenes, its ability to perceive low-contrast targets was insufficient, leading to missed detections. In areas with dense fruit clusters, limited feature discriminability caused misidentifications.

In contrast, the YOLO-WBL model demonstrated superior and more stable detection performance across all scenarios. In both the strong sunlight and severe occlusion scenes, YOLO-WBL not only successfully detected all targets but also achieved higher localization accuracy. In the most challenging wide-angle dense scenario, YOLO-WBL missed only two small targets at the edges, significantly outperforming the baseline model. This indicates that by introducing the C3K2_WT, BiFPN, and Detect-SCGN modules, the model has enhanced its capability to extract and fuse multi-scale features, contextual information, and intrinsic target features, thereby improving its generalizability and robustness in complex environments. Compared with existing research results [36] the improved model demonstrates enhanced detection accuracy for fruits in complex environments, with notable reductions in missed detections, duplicate detections, and false detections. These findings substantiate the precision and robustness of the WBL model under challenging orchard conditions.

To further investigate the underlying reasons for the performance improvement, Figure 9 presents a comparison of the feature activation heatmaps (generated using Grad-CAM) for both models on the same sample. It can be observed that the feature responses of the YOLOv11n model are relatively scattered, with some activated regions deviating from the main body of the fruit, making it susceptible to interference from the background or adjacent branches/leaves. In contrast, the feature heatmaps of the YOLO-WBL model show a more focused and accurate response distribution, with high-activation regions closely aligned with the core semantic parts of the fruit contours. This visual evidence confirms that the proposed improvements effectively guide the model to focus on more discriminative features, thereby reducing false positives and missed detections caused by environmental interference.

In summary, although a small number of missed detections still occur under conditions of extreme occlusion (indicating that there remains room for continuous model optimization), YOLO-WBL has significantly improved detection accuracy and stability in scenarios with variable lighting, occlusion, and high target density, despite a substantial reduction in model parameter count and size. These results validate the effectiveness of the proposed improvements and demonstrate that YOLO-WBL possesses strong potential for practical application, offering more reliable technical support for automated operations in complex orchard environments. It should be noted that the discussion on model robustness in this section is primarily based on visual demonstrations a under specific scenarios, which falls within the realm of qualitative analysis. A more rigorous validation approach would entail systematic, condition-based evaluation using quantitative detection-level metrics, such as the rate of mAP degradation under occlusion. We acknowledge that this constitutes an important direction for future work, which will be pursued with more finely annotated data. Such efforts are intended to provide more comprehensive and reliable technical support for automated operations in complex orchard environments.

3.2. Yield Estimation Experiment

Comparison of Yield Estimation Using Different Detection Methods

(1) Evaluation Metric

The recognition accuracy of the weight estimation system directly impacts the precision of apple weight prediction. The experiment was conducted based on different shooting setups, with the weight estimation error value serving as the evaluation metric. Its calculation formula for accuracy is shown in (34).

Y = |\frac{m - m_{0}}{m_{0}}| \times 100 %

(34)

where Y represents the error value (%), m₀ denotes the total mass obtained by manual weighing (g), and m refers to the mass estimated or detected by the algorithm (g).

(2) Yield Estimation Results

The visual comparison of the 3D point cloud reconstruction results (as shown in Table 4) intuitively demonstrates that the point cloud model generated by the proposed CLV algorithm more closely aligns with the physical contour of the real apple in geometric morphology, exhibiting higher fidelity in shape restoration and detail preservation. In contrast, the point cloud generated by the baseline method, which performs 3D reconstruction from traditional 2D photographs, exhibits significant morphological deviations and surface distortions. Such geometric distortions are directly propagated to the subsequent volume and weight estimation steps, leading to reduced prediction accuracy. Furthermore, because the calculation did not utilize the depth camera’s intrinsic parameters, the positional relationships in the traditional 2D photographs deviate substantially from reality, leading to a substantial discrepancy.

Quantitative results calculated using formula (34) show that the mean absolute error for single-fruit weight estimation using the proposed CLV algorithm is approximately 3.2%, significantly lower than the estimation error of about 17.5% for the traditional point-cloud-based reconstruction method. This result confirms that the proposed CLV algorithm, through the deep integration of depth information and local surface geometric reasoning, effectively overcomes occlusion and viewpoint limitations, achieving high weight estimation accuracy. In practical orchard applications, this level of accuracy is sufficient to support yield estimation needs at the industry level, providing a reliable technical basis for harvest planning and smart orchard management.

3.3. Performance Analysis Under Different Occlusion Scenarios

To systematically evaluate the robustness and estimation accuracy of the CLV algorithm under various real-world occlusion conditions, we conducted experiments in an orchard using four typical occlusion scenarios: mutual occlusion between apples, partial occlusion by branches, light occlusion by leaves, and a control scenario with no occlusion. The apple reconstruction and yield estimation results generated by the algorithm under each scenario were compared against the ground truth weights obtained after manual harvesting using a high-precision scale. The 3D coordinates for image capture were set at (−0.2 m, 0 m, 0 m). Details are shown in Table 5 and Figure 10.

Quantitative analysis indicates that the estimation accuracy of the CLV algorithm is closely related to the degree of scene occlusion. In branch occlusion scenarios, although noise from branches interferes with local point clouds, the occluded area is typically small, allowing the CLV algorithm to maintain relatively high reliability, with a mean absolute percentage error (MAPE) of approximately 4.57%. In leaf occlusion scenarios, the high similarity in color and texture between leaves and fruits introduces more significant interference in feature extraction and segmentation, leading to reduced accuracy in deriving key geometric parameters such as radius and width. As a result, the MAPE for apples under leaf occlusion increases to 7.07%, reflecting a rise in estimation error. The most challenging condition involves mutual occlusion between apples. In such cases, the target fruit is heavily obscured, and its visible surface area is substantially reduced, which dramatically increases the uncertainty in reconstructing the complete volume from partial point cloud data. Under this occlusion type, the maximum absolute relative error of apple estimation reaches 13.66%, indicating performance limitations of the algorithm under severe occlusion. Additionally, deviations in reconstructed positional information were observed; for example, the reconstruction of Apple 2 under branch occlusion exhibited a misalignment along the Y-axis.

A comprehensive analysis shows that the CLV algorithm, when applied under conditions where the occlusion rate (defined as the proportion of the target area affected) does not exceed 40%, can effectively compensate for partial information loss through its built-in volume correction mechanism, achieving a reliable yield estimation with an error rate of less than 8%. Compared to existing yield estimation methods [37] that rely on canopy area and apple count, this approach ensures high accuracy while reducing operational complexity. However, when the occlusion rate exceeds this threshold, the three-dimensional information available for reconstruction becomes severely insufficient-specifically, accurate acquisition of the apple’s x, y, w, h, and R data is no longer possible, preventing precise construction of the apple’s point cloud model. Consequently, the initial volume estimation derived from the convex hull algorithm deviates significantly from the actual result, and the discrepancy between the corrected volume result and the actual value increases, ultimately leading to a marked decline in estimation accuracy. This finding delineates the effective application scope of the current algorithm; operating beyond this effective range (under conditions of high occlusion) leads to increased error rates and inaccurate yield estimates. However, it also points toward future research directions, indicating the need for further optimization of volume inference and completion strategies under high-occlusion conditions to enhance the algorithm’s generalizability in extremely complex environments.

3.4. System Deployment and Performance Verification

3.4.1. Visual Detection and Yield Estimation System

To intuitively verify the practical performance of the detection and yield estimation algorithms, this study developed an integrated PC-based visual system using the PySide6 framework. Its architecture is shown in Figure 11a. The core functionalities of the system include:

Multi-model Real-time Detection: Supports loading and real-time switching between different trained models (e.g., YOLOv11n, YOLO-WBL), allowing users to dynamically adjust key parameters such as the Intersection over Union (IoU) threshold.

Interactive Results Display: Detection results are displayed in real-time on the main interface, accompanied by detailed statistics including detection categories, target count, average confidence, and the estimated individual fruit weight and cumulative yield calculated based on the CLV algorithm.

Operation Log and Path Planning Integration: The right panel of the system records user operation logs. Furthermore, integrated with the path planning algorithm previously proposed by the research group (e.g., NDT-RRT), the system can automatically generate and visualize the optimal harvesting path upon completion of detection, while also displaying the planning time, providing decision support for automated operations.

3.4.2. Edge Device Deployment and Real-Time Performance Verification

To validate the feasibility of the algorithms in practical orchard edge computing scenarios, the optimized YOLO-WBL model, CLV yield estimation algorithm, and the visual system were deployed on an NVIDIA Jetson Xavier NX (16 GB version) embedded platform. This device features 384 CUDA cores and 48 Tensor cores, and utilizes the TensorRT framework for model inference acceleration. The overall software and hardware deployment architecture is shown in Figure 11b.

Performance testing on the Jetson device, conducted using 20 images each under varying lighting conditions, different viewpoints, varying distances, and occlusion scenarios, demonstrated that the YOLO-WBL model achieves an average inference speed of 8.7 FPS, a 52.6% improvement over the original YOLOv11n model’s average of 5.7 FPS, while maintaining high detection accuracy and an extremely compact model size. The CLV algorithm’s average processing time for single-frame point cloud analysis and volume calculation on the embedded platform was only 125 ms. These results conclusively demonstrate that the proposed YOLO-WBL model and CLV algorithm can achieve high-precision, real-time apple detection and yield estimation in complex orchard environments on resource-constrained edge devices, demonstrating significant value for practical engineering applications.

Despite the progress achieved in this study, there remains room for further exploration. The YOLO-WBL model still exhibits a small number of missed detections under conditions of extreme dense occlusion. Future work could enhance the model’s ability to judge occluded targets by introducing more powerful context reasoning mechanisms or leveraging temporal information. The accuracy of the CLV algorithm decreases under severe mutual occlusion between fruits (visible surface area less than 40%). Subsequent research could explore prior models for complete fruit shape based on partial observations or utilize multi-view information fusion to improve reconstruction completeness. Furthermore, deeply integrating the visual perception system from this study with robotic precision picking actuators to build a fully closed-loop “perception-decision-execution” intelligent harvesting system is the crucial next step toward advancing orchard automation into practical application.

4. Conclusions

This study aims to address two core challenges in automated apple harvesting within complex orchard environments: high-precision real-time detection and accurate yield estimation. To this end, we propose an integrated vision solution, whose core components include a lightweight, high-precision detection model named YOLO-WBL and a 3D point cloud-based yield estimation algorithm named CLV. Through systematic theoretical analysis, comprehensive comparative experiments, and deployment validation on actual edge devices, the effectiveness and advancement of this solution have been fully demonstrated. The main conclusions are as follows:

(1) The YOLO-WBL detection model achieves an exceptional balance between accuracy and efficiency. The model realizes lightweight design and performance enhancement through three key innovations: introducing the C3K2_WT module incorporating wavelet transform into the backbone network to enhance multi-scale frequency-domain feature extraction capability; adopting a Weighted Bidirectional Feature Pyramid Network (BiFPN) to optimize neck feature fusion and improve multi-scale representation efficiency; and designing a lightweight detection head based on Shared Convolution and separated Group Normalization (Detect-SCGN) to improve detection accuracy while reducing the parameter count. Experimental results show that the final model achieved a precision of 93.8%, a recall of 79.3%, and an mAP@0.5 of 87.2% on the test set, while its model size was compressed to only 3.72 MB-a reduction of 28.87% compared to the baseline model. Ablation studies confirmed the effectiveness and synergy of the individual improvement modules.

(2) The proposed model demonstrates outstanding performance in lightweight characteristics, possessing excellent edge deployment properties. Compared to mainstream models from YOLOv5 to YOLOv12, YOLO-WBL maintains leading detection accuracy while having the smallest parameter count and model size, showcasing significant lightweight advantages. When deployed on the NVIDIA Jetson Xavier NX edge device and accelerated by TensorRT, its inference speed reached 8.7 FPS-a 52.6% improvement over the original YOLOv11n-fully meeting the stringent real-time requirements of complex orchard environments.

(3) The CLV yield estimation algorithm represents a leap from 2D statistical counting to 3D geometric estimation, achieving significantly improved accuracy. The algorithm deeply integrates the detection results from YOLO-WBL with point cloud data captured by a depth camera. Through precise coordinate mapping, point cloud reconstruction based on deformed spheres, and a calculation method combining convex hull volume with PCA-based sphericity correction, it overcomes the failure issues of traditional methods under occlusion. Experiments indicate that in scenarios where the occlusion rate is below 40%, the Mean Absolute Percentage Error (MAPE) for single-fruit weight estimation can be controlled within 8%. Under ideal, unoccluded conditions, the error rate is as low as 3.2%, far surpassing traditional methods based on 2D image counting or regular shape assumptions.

(4) The integrated system verifies the practicality and robustness of the proposed solution. A visualization system developed based on the PySide6 framework successfully integrated detection, yield estimation, and path planning functionalities, operating stably on the edge computing platform. Field tests confirmed the high accuracy, strong robustness, and real-time processing capability of the entire system in real, complex orchard environments.

In summary, the YOLO-WBL model and CLV algorithm proposed in this study provide an efficient, accurate, and deployable solution for apple detection and yield estimation in complex orchard environments. This work holds positive reference value for promoting the development of smart orchard management and automated harvesting technologies.

Author Contributions

Conceptualization, B.C. and X.S.; methodology, X.L.; validation, B.C. and X.L.; formal analysis, B.M. and F.D.; investigation, B.C., X.S. and B.M.; resources, B.C. and X.L.; data curation, X.S.; writing—original draft preparation, X.S., and B.C.; writing—review and editing, B.M., X.L. and F.D.; project administration, X.L.; funding acquisition, B.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of Xinjiang (Grant Number: 2025D01C89), the second batch of Tianshan Talent Cultivation Plan for Young Talent Support Project (Grant Number: 2023TSYCQNTJ0040), and the National College Student Innovation Training Program Project (Grant Number: 202513558003).

Data Availability Statement

The data that support the findings of this study are available within the manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Wang, C.L.; Pan, W.Y.; Zou, T.L.; Li, C.J.; Han, Q.Y.; Wang, H.M.; Yang, J.; Zou, X.J. A Review of Perception Technologies for Berry Fruit-Picking Robots: Advantages, Disadvantages, Challenges, and Prospects. Agriculture 2024, 14, 1346. [Google Scholar] [CrossRef]
Hua, W.J.; Zhang, Z.; Zhang, W.Q.; Liu, X.H.; Hu, C.; He, Y.C.; Mhamed, M.; Li, X.L.; Dong, H.X.; Saha, C.K.; et al. Key technologies in apple harvesting robot for standardized orchards: A comprehensive review of innovations, challenges, and future directions. Comput. Electron. Agric. 2025, 235, 110343. [Google Scholar] [CrossRef]
Ji, W.; Ding, Y.; Xu, B.; Chen, G.Y.; Zhao, D.A. Adaptive Variable Parameter Impedance Control for Apple Harvesting Robot Compliant Picking. Complexity 2020, 2020, 4812657. [Google Scholar] [CrossRef]
Zhang, X.Q.; Du, L.Y.; Zhu, Q.R. The potential of substituting labors with capitals in apple production under the constraint of increasing labor cost. Res. Agric. Mod. 2020, 41, 484–492. [Google Scholar]
Li, J.; Karkee, M.; Zhang, Q.; Xiao, K.H.; Feng, T. Characterizing apple picking patterns for robotic harvesting. Comput. Electron. Agric. 2016, 127, 633–640. [Google Scholar] [CrossRef]
Hu, G.R.; Zhou, J.G.; Chen, Q.Y.; Luo, T.Y.; Li, P.H.; Chen, Y.; Zhang, S.; Chen, J. Effects of different picking patterns and sequences on the vibration of apples on the same branch. Biosyst. Eng. 2024, 237, 26–37. [Google Scholar] [CrossRef]
Xin, Q.; Luo, Q.; Zhu, H. Key Issues and Countermeasures of Machine Vision for Fruit and Vegetable Picking Robot. Adv. Transdiscipl. Eng. 2024, 46, 69–78. [Google Scholar]
Li, X.; Wang, W.H.; Wu, L.J.; Chen, S.; Hu, X.L.; Li, J.; Tang, J.H.; Yang, J. Generalized Focal Loss: Learning Qualified and Distributed Bounding Boxes for Dense Object Detection. In Proceedings of the 34th Conference on Neural Information Processing Systems (NeurIPS), Electr. Network, Virtual, 6–12 December 2020. [Google Scholar]
Chen, Y.; Chen, B.B.; Li, H.T. Object Identification and Location Used by the Fruit and Vegetable Picking Robot Based on Human-decision Making. In Proceedings of the 10th International Congress on Image and Signal Processing, BioMedical Engineering and Informatics (CISP-BMEI), Shanghai, China, 14–16 October 2017. [Google Scholar]
Wu, Y.; Wan, X.; Zhang, J.; Yang, Y. Research on fruit picking recognition based on deep learning. In Proceedings of the Optoelectronic Imaging and Multimedia Technology X, Beijing, China, 15–16 October 2023; SPIE: Bellingham, WA, USA, 2023. [Google Scholar]
Chu, P.Y.; Li, Z.J.; Lammers, K.; Lu, R.F.; Liu, X.M. Deep learning-based apple detection using a suppression mask R-CNN. Pattern Recognit. Lett. 2021, 147, 206–211. [Google Scholar] [CrossRef]
Nan, Y.L.; Zhang, H.C.; Zeng, Y.; Zheng, J.Q.; Ge, Y.F. Intelligent detection of Multi-Class pitaya fruits in target picking row based on WGB-YOLO network. Comput. Electron. Agric. 2023, 208, 107780. [Google Scholar] [CrossRef]
Mo, S.T.; Dong, T.; Zhao, X.X.; Kan, J.M. Discriminant model of banana fruit maturity based on genetic algorithm and SVM. J. Fruit Sci. 2022, 39, 2418–2427. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Turan, M.; Almalioglu, Y.; Araujo, H.; Konukoglu, E.; Sitti, M. Deep EndoVO: A recurrent convolutional neural network (RCNN) based visual odometry approach for endoscopic capsule robots. Neurocomputing 2018, 275, 1861–1870. [Google Scholar] [CrossRef]
Wang, Y.T.; Xue, J.R. Lightweight object detection method for Lingwu long juiube images based on improved SSD. Trans. Chin. Soc. Agric. Eng. 2021, 37, 173–182. [Google Scholar]
Li, H.P.; Li, C.Y.; Li, G.B.; Chen, L.X. A real-time table grape detection method based on improved YOLOv4-tiny network in complex background. Biosyst. Eng. 2021, 212, 347–359. [Google Scholar] [CrossRef]
Wang, A.C.; Qian, W.H.; Li, A.; Xu, Y.W.; Xie, Y.W.; Zhang, L.Y. NVW-YOLOv8s: An improved YOLOv8s network for real-time detection and segmentation of tomato fruits at different ripeness stages. Comput. Electron. Agric. 2024, 219, 108833. [Google Scholar] [CrossRef]
Tan, H.S.; Ma, W.H.; Tian, Y.; Zhang, Q.; Li, M.Y. Improved YOLOv8n object detection of fragrant pears. Trans. Chin. Soc. Agric. Eng. 2024, 40, 178–185. [Google Scholar]
Wu, H.T.; Mo, X.T.; Wen, S.J.; Wu, K.L.; Ye, Y.; Wang, Y.M.; Zhang, Y.H. DNE-YOLO: A method for apple fruit detection in Diverse Natural Environments. J. King Saud Univ. Comput. Inf. Sci. 2024, 36, 102220. [Google Scholar] [CrossRef]
Liu, S.; Xiao, Y.T.; Zhang, W.P.; Li, F.Z.; Wang, H.C. Apple fruit detection method based on generative adversarial networks under occlusion conditions. Hubei Agric. Sci. 2024, 63, 47–53. [Google Scholar]
Zhou, H.; Yang, J.; Zhao, X.F. Lightweight Improvement of YOLOv8 for Apple Detection in Complex Orchard Environments. Sci. Technol. Eng. 2025, 25, 2274–2283. [Google Scholar]
Liu, Z.F.; Abeyrathna, R.; Sampurno, R.M.; Nakaguchi, V.M.; Ahamed, T. Faster-YOLO-AP: A lightweight apple detection algorithm based on improved YOLOv8 with a new efficient PDWConv in orchard. Comput. Electron. Agric. 2024, 223, 109118. [Google Scholar] [CrossRef]
Shi, B.X.; Hou, C.K.; Xia, X.L.; Hu, Y.H.; Yang, H. Improved young fruiting apples target recognition method based on YOLOv7 model. Neurocomputing 2025, 623, 129186. [Google Scholar] [CrossRef]
Wang, H.L.; Liu, D.C.; Ye, Q.Z.; Liang, Z.W.; Zhang, Y.D. Yield Estimation of Orah Mandarin Based on UAV Remote Sensing Images. South China Fruits 2025. Epub ahead of printing. [Google Scholar] [CrossRef]
Xu, J.Y.; Du, X.; Li, Q.Z.; Dong, T.F.; Zhang, Y.; Wang, H.Y.; Xiao, J.; Zhang, J.S. Advances in Remote Sensing Estimation of Crop Yield Based on Hybrid Modeling. Chin. J. Agrometeorol. 2025, 46, 1472–1486. [Google Scholar]
Bao, L.; Wang, M.T.; Liu, J.C.; Wen, B.; Ming, Y. Estimation method of wheat yield based on convolution neural network. Acta Agric. Zhejiangensis 2020, 32, 2244–2252. [Google Scholar]
Huang, Z.J.; Lee, W.S.; Yang, P.; Ampatzidis, Y.; Shinsuke, A.; Peres, N.A. Advanced canopy size estimation in strawberry production: A machine learning approach using YOLOv11 and SAM. Comput. Electron. Agric. 2025, 236, 110501. [Google Scholar] [CrossRef]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet Convolutions for Large Receptive Fields. In Proceedings of the 18th European Conference on Computer Vision (ECCV), Milan, Italy, 29 September–4 October 2024; pp. 363–380. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully convolutional one-stage object detection. arXiv 2019, arXiv:1904.01355. [Google Scholar] [CrossRef]
Yu, T.W.; Zheng, E.R.; Shen, J.G. Optical remote sensing image scene classification based on multi-level cross-layer bilinear fusion. Acta Photonica Sin. 2022, 51, 0210007. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
Huang, Z.H.; Lu, C.F.; Cui, Y.R.; Hu, R.H. YOLO-AP: A Lightweight Apple Fruit Detection Algorithm Based on Improved YOLO11n. J. Agric. Sci. Technol. 2025, 27, 118–133. [Google Scholar]
Zhang, J.W.; Wang, J.Z.; Yang, S.L.; Huang, T. Research on Infrared Image Recognition of Hard-to-Identify Coal Gangue with Liquid Intervention Based on HDB-YOLO Algorithm. J. Min. Saf. Eng. 2026; in press. Available online: https://link.cnki.net/urlid/32.1760.td.20260204.0909.002 (accessed on 10 February 2026).
Li, J.A.; Zheng, J.Y.; Han, Y.; Wang, X.Y.; Xue, X.M.; Zhang, X. Research on apple target detection method in complex environments based on Improved YOLOv8n. J. Chin. Agric. Mech. 2026; in press. Available online: https://link.cnki.net/urlid/32.1837.S.20260123.1850.004 (accessed on 5 February 2026).
Li, H.B.; Shi, Y.; Zhang, B.H.; Li, D.D. ACNN-LSTM prediction method for apple tree yield based on canopy area and number of apple. China Agric. Inform. 2022, 34, 11–18. [Google Scholar]

Figure 1. Sample images from the dataset.

Figure 2. Examples of Image Data Augmentation. (a–e) are the augmented images of the collected mature red apple images after applying random rotation (±30°), brightness adjustment (scale factor 0.8–1.2), saturation variation (scale factor 0.7–1.3), and Gaussian noise addition (σ = 0.01).

Figure 3. Architecture of the proposed YOLO-WBL model.

Figure 4. Convolutional processing procedure of WTConv.

Figure 5. Network architectures of PANet and BiFPN. (a) is the PANet structure in YOLOv11n, (b) is the BiFPN structure improved from PANet.

Figure 6. The Detect-SCGN Detection Head.

Figure 7. Apple yield estimation process.

Figure 8. Visual comparison of detection performance under different scenarios.

Figure 9. Visual comparison of heatmaps before and after model improvements.

Figure 10. Visual results of yield estimation under different occlusion scenarios.

Figure 11. Edge Device Deployment.

Table 1. Performance comparison results of different feature fusion networks.

Model	mAP@0.5/%	Params/M	FLOPs/G	Weights/MB
YOLOv11n	0.869	2.5	6.4	5.23
YOLOv11-HAFB	0.871	4.1	8.6	8.24
YOLOv11-PST	0.859	2.8	6.1	5.13
YOLOv11-Hyper	0.872	2.7	6.9	5.38
YOLOv11-PSConv	0.869	2.4	6.3	5.02
YOLOv11-Rephms	0.873	2.8	7.0	5.78
YOLOv11-BiFPN	0.863	1.8	7.4	3.95

Table 2. Ablation study results.

Model	C3K2_WT	BiFPN	Detect-SCGN	Params/M	FLOPs/G	P/%	R/%	mAP@0.5/%	Weights/MB
YOLOv11n	×	×	×	2.5	6.4	0.938	0.785	0.869	5.23
	√	×	×	2.5	6.3	0.938	0.796	0.869	5.14
	√	√	×	1.9	7.4	0.922	0.792	0.863	3.95
	√	√	√	1.7	7.2	0.938	0.793	0.872	3.72

Table 3. Performance comparison results with mainstream models.

Model	Params/M	FLOPs/G	P/%	R/%	mAP@0.5/%	Weights/MB
YOLOv5	2.2	5.9	0.922	0.787	0.863	4.44
YOLOv6	4.2	11.0	0.93	0.796	0.861	8.17
YOLOv8	2.7	2.7	0.936	0.783	0.868	5.42
YOLOv10	2.7	8.4	0.907	0.786	0.86	5.49
YOLOv11	2.6	6.4	0.938	0.785	0.869	5.23
YOLOv12	2.5	6.0	0.922	0.796	0.868	5.19
YOLOv13	2.4	6.3	0.887	0.766	0.836	5.18
YOLO-WBL	1.7	7.2	0.938	0.793	0.872	3.72

Table 4. Comparison of yield estimation results using different detection methods. From top to bottom are comparison tests of ripe apples under direct light conditions using different methods (CLV algorithm, baseline method), with weight values corresponding to the numerical order in the images.

Detection Method	Detection Result Image	3D Point Cloud	Apple Weight Calculation (Single Fruit)	Error Value
Proposed Method			236.7 g 274.2 g 254.3 g 243.5 g 239.3 g 264.5 g 274.1 g 256.4 g 294.3 g	4.2%
Traditional Method			224.5 g 263.7 g 272.2 g 257.1 g 230.4 g 260.1 g 280.4 g 263.9 g 279.2 g	16.2%

Table 5. Yield estimation results under different occlusion scenarios.

Scenario	Estimated Weight (g)	Measured Weight (g)	Absolute Relative Error (%)
Fruit Occlusion	180.7	188.3	4.04
	221.2	212.7	4.00
	143.1	125.9	13.66
	122.7	127.0	3.39
	155.4	148.3	4.79
	188.3	182.7	3.06
	176.8	183.7	3.76
Branch Occlusion	191.5	197.6	3.09
	303.7	291.2	4.29
	171.6	179.8	4.56
	152.4	140.2	8.70
	144.6	147.9	2.23
Leaf Occlusion	160.4	152.4	5.25
	255.2	274.6	7.07
	329.0	318.1	3.43
	152.6	160.2	4.74

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, B.; Sun, X.; Liu, X.; Ma, B.; Ding, F. Research on Lightweight Apple Detection and 3D Accurate Yield Estimation for Complex Orchard Environments. Horticulturae 2026, 12, 393. https://doi.org/10.3390/horticulturae12030393

AMA Style

Chen B, Sun X, Liu X, Ma B, Ding F. Research on Lightweight Apple Detection and 3D Accurate Yield Estimation for Complex Orchard Environments. Horticulturae. 2026; 12(3):393. https://doi.org/10.3390/horticulturae12030393

Chicago/Turabian Style

Chen, Bangbang, Xuzhe Sun, Xiangdong Liu, Baojian Ma, and Feng Ding. 2026. "Research on Lightweight Apple Detection and 3D Accurate Yield Estimation for Complex Orchard Environments" Horticulturae 12, no. 3: 393. https://doi.org/10.3390/horticulturae12030393

APA Style

Chen, B., Sun, X., Liu, X., Ma, B., & Ding, F. (2026). Research on Lightweight Apple Detection and 3D Accurate Yield Estimation for Complex Orchard Environments. Horticulturae, 12(3), 393. https://doi.org/10.3390/horticulturae12030393

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Lightweight Apple Detection and 3D Accurate Yield Estimation for Complex Orchard Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Construction of a Complex Orchard Dataset

2.1.1. Image Data Collection

2.1.2. Data Augmentation and Preprocessing

2.2. Lightweight Apple Detection Model Based on YOLO-WBL

2.2.1. C3K2_WT Module Incorporating Wavelet Transform Convolution

2.2.2. Weighted Bidirectional Feature Pyramid Network

2.2.3. Detect-SCGN Detection Head

2.3. Accurate Yield Estimation Method Based on the Collaborative Local Volumetric (CLV) Approach

2.3.1. Depth Optimization and 3D Coordinate Mapping

2.3.2. Point Cloud Model Reconstruction Based on Deformed Spheres

2.3.3. Volume Estimation Based on Convex Hull Algorithm and PCA Sphericity Correction

2.4. Model Evaluation Metrics

2.5. Experimental Environment and Training Strategy

2.5.1. Experimental Environment

2.5.2. Training Strategy

2.6. Performance Comparison of Different Feature Fusion Networks

2.7. Ablation Study

2.8. Performance Comparison with Mainstream Models

3. Results and Discussion

3.1. Detection Performance Analysis Under Different Lighting Intensities and Fruit Occlusion Conditions

3.2. Yield Estimation Experiment

Comparison of Yield Estimation Using Different Detection Methods

3.3. Performance Analysis Under Different Occlusion Scenarios

3.4. System Deployment and Performance Verification

3.4.1. Visual Detection and Yield Estimation System

3.4.2. Edge Device Deployment and Real-Time Performance Verification

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI