2.1. Dataset, Acquisition Protocol, and Preprocessing
2.1.1. Experimental Site and Camera Configuration
The data acquisition campaign was conducted in a dwarf-rootstock, high-density orchard at the First Station of the Taihang Mountain Road, Hebei Agricultural University, Baoding, China. At the macro scale, orchard geometry is characterised by a row spacing of approximately 4 m, a plant spacing of 1.5 m, and a mean tree height of 3.5 m. At the canopy scale, the scene is semi-structured: lateral branches extend into the inter-row volume, embedding fruit targets within a multi-layer branch–foliage volume rather than an idealised planar fruiting wall. Illumination variability is introduced through three acquisition regimes—front lighting, lateral lighting, and strong backlighting—which produce fragmented shadows and specular highlights and therefore stress RGB-based feature learning under field conditions.
Depth-aware acquisition uses an Intel RealSense D455 camera (Intel Corporation, Santa Clara, CA, USA) calibrated over 500–1500 mm.
Figure 2 summarises the calibration outcome as box plots of measurement error (mm) versus range, with the full span partitioned into 100 mm distance bins. The profile shows that errors are tightly scattered around the zero-error baseline with comparatively low dispersion within the 700–1100 mm band, whereas the outer segments (approximately 500–700 mm and 1100–1500 mm) exhibit larger spread and a systematic shift toward positive error, especially at the longer end of the calibrated interval. Accordingly, image capture was constrained to a camera–canopy distance of 0.7–1.1 m so that routine depth-guided processing stays within the sensor’s most reliable operating interval implied by this calibration.
2.1.2. Occlusion Taxonomy and Background Purification
The detection task is formulated beyond pure 2D localisation: each target is assigned to one of three occlusion-centric classes—No Occlusion (NO), Soft Occlusion (SO), or Hard Occlusion (HO)—according to the physical rigidity of the occluding object as the primary criterion for robot intervention planning. NO indicates a fully visible target. SO indicates occlusion mainly by deformable foliage, where a push-and-grasp strategy may be feasible. HO indicates occlusion by rigid branches, support wires, or densely clustered fruit, where forced intervention is unsafe, and avoidance planning is preferred.
During data collection, the camera field of view may include distant fruit from adjacent rows when foreground canopy gaps appear. To suppress harvest-irrelevant distant interference at the data source, a depth-guided purification strategy is applied under the calibrated depth reliability regime described in
Section 2.1.1. Specifically, regions with depth Z > 2.0 m are projected onto the RGB plane and replaced via neighbourhood green-foliage texture synthesis masking,
Figure 3 compares representative raw regions with the masked counterparts under this procedure**, restricting learning to the manipulator-relevant canopy layer. restricting learning to the manipulator-relevant canopy layer.
2.1.3. Dataset Summary and Size Distribution
The full image corpus comprises 1635 RGB images and 12,705 annotated instances. The dataset is split in a fixed 7:1:2 ratio for training, validation, and testing, yielding 1144 training images, 164 validation images, and 327 test images. Instance-level occlusion proportions (label-derived) and image counts under the three illumination regimes are summarised in
Table 1, together with the sensing operating range and the depth threshold used for background purification (Z > 2.0 m).
To characterise object-scale variation in the image plane, we summarise the distribution of instance-level bounding-box widths computed from YOLO-normalised widths mapped to pixel units. We use 60 equal-width bins over 2–150 px, including all instances across all splits.
All compared models are trained and evaluated under an identical fixed partition (7:1:2) with mutually exclusive image sets; hyperparameters are tuned using the validation set, and the test set is used only for final reporting. Although k-fold cross-validation can estimate variability across partitions, it was not adopted for the main experiments due to the high computational cost of repeated full detector training; we instead prioritise a transparent benchmark split for direct reproducibility and fair comparison.
Figure 4 shows that widths are strongly concentrated at small pixel extents, while a minority of larger boxes reflects closer viewpoints or larger apparent object scale; this motivates multi-scale feature representation in the detection backbone. Code and annotations will be released after publication to support reproduction.
2.2. Detection Model Architecture
In semi-structured orchards, occlusion and lighting variation degrade RGB features. The detector must run on mobile-class hardware while discriminating No Occlusion (NO), Soft Occlusion (SO), and Hard Occlusion (HO). Relative to standard lightweight single-stage designs, the main difficulty is aligning fusion operators with cues relevant to rigidity-aware class decisions.
We take YOLOv11 as the baseline. For three-class occlusion reasoning, its default neck and units show several limitations. PANet-style fusion concatenates multi-scale features with limited control over cross-scale contribution; during upsampling, deep background responses can dominate shallow detail. The native C3k2 stack relies heavily on fixed 3 × 3 convolutions with a relatively small effective receptive field for elongated branches. Feature extraction and upsampling are also carried out without an explicit separation between illumination variation and edge structure.
YOLOv11-CBMES adds a frequency-domain branch and multi-scale reconstruction around the baseline pipeline, as shown in
Figure 5.
A weighted Bidirectional Feature Pyramid Network (BiFPN) replaces the original neck to learn cross-scale fusion weights.
Contrast-Driven Feature Aggregation (CDFA) is placed at P5 to separate low-frequency illumination-dominated energy from high-frequency structure using a Haar wavelet split.
CSP-Encapsulated Multi-Scale Convolutional Blocks (CM and CSP-MSCB) replace C3k2 at selected BiFPN fusion nodes, and Efficient Upsampling Convolutional Blocks (EUCB) replace conventional upsampling along the top-down path.
A zero-parameter Shift-Context (SC) module is inserted at the channel-shuffle stages of CM and EUCB, yielding CMS and EU_SC. SC mixes local neighbourhoods by cyclic shifting without extra parameters or FLOPs.
In
Figure 5, CMS and EU_SC denote the configurations after SC is integrated.
While
Figure 5 shows the static wiring,
Figure 6 summarises the end-to-end data flow. Depth-guided purification suppresses distant background interference. The backbone builds spatial features; CDFA decomposes P5 features to emphasise edge-related responses relative to low-frequency components. Weighted BiFPN and the spatial reconstruction blocks then fuse and refine these representations. The heads output NO, SO, and HO predictions intended as inputs to downstream grasp-or-avoid policies; closed-loop safety is not evaluated here.
2.2.1. Bidirectional Weighted Feature Pyramid Network (BiFPN)
As previously described, the original PANet employs an undifferentiated concatenation mechanism for multi-scale feature fusion, resulting in equal-weight superimposition of features across different hierarchical levels without selective filtering, thereby inducing cross-scale semantic alignment bias. To overcome this limitation, this study introduces the Weighted Bidirectional Feature Pyramid Network (BiFPN) to replace PANet, dynamically adjusting the fusion ratio of features across different scales through learnable weight vectors [
25]. For a given multi-scale input feature set
, the fused output feature
is computed according to Equation (1), where
denotes the
-th input feature map participating in the current fusion node, and
represents the number of input branches at that node (in the proposed architecture,
for top-down nodes and
for bottom-up nodes):
where
denotes the learnable scalar weight, constrained to non-negative values via ReLU activation to ensure physical interpretability of feature contributions, and
is a numerical stability term. This fast normalisation strategy avoids the exponential computational overhead associated with Softmax, reducing the latency of a single fusion operation from 2.3 ms (PANet) to 1.7 ms, measured on an NVIDIA GeForce RTX 4070 (NVIDIA Corporation, Santa Clara, CA, USA). Through end-to-end training, the weight distribution self-adaptively adjusts during backpropagation: when processing shallow-layer small-target features, the weights of deep-layer background semantics are automatically attenuated, thereby mitigating the semantic dilution problem during cross-scale fusion [
25].
At the network topology level, computationally redundant single-input nodes are removed, and a bidirectional feature pyramid structure comprising both top-down and bottom-up pathways is constructed. The top-down pathway is primarily responsible for injecting the high-level semantic abstractions enhanced by the CDFA module into shallow network layers, leveraging the category-discriminative information at the P5 layer to suppress the responses of texture-similar non-target regions in shallow feature maps. Taking the P4 layer as an example, the intermediate feature
is generated as expressed in Equation (2). This process propagates the high-level semantic judgement of “fruit versus foliage” from P5 downward, providing semantic priors to shallow layers to suppress non-target regions with similar textures. The bottom-up pathway subsequently propagates high-frequency geometric edge information from shallow layers back to deep layers to correct localisation bias. The final output
aggregates three signal streams—the original input, the intermediate-state feature, and the feature from the lower layer—as expressed in Equation (3):
In Equation (2), is the backbone feature at level 4 and is the backbone feature at level 5. Because is half the spatial resolution of , Resize(·) applies 2× nearest-neighbour upsampling to align to . Conv(·) denotes a 3 × 3 depthwise separable convolution for post-fusion refinement. and are learnable scalars for this node only. is the top-down intermediate feature at P4. In Equation (3), is the bottom-up fused output at P3. Because has twice the resolution of , Resize(·) uses strided convolution (stride 2) to align to . , , and are learnable scalars for this bottom-up node and are not shared with and in Equation (2). combines deep semantics (), the original backbone feature (), and shallower geometric detail ().
Learnable fusion reweights existing scales but cannot create edge–illumination separation when backbone features are already strongly aliased, which motivates CDFA.
2.2.2. Contrast-Driven Feature Aggregation (CDFA) Module
Backbone features can mix illumination drift with boundary cues when processing is confined to local spatial convolution. Standard kernels fit local appearance but do not explicitly separate low-frequency background variation from high-frequency discontinuities along object contours [
26]. We therefore embed a Contrast-Driven Feature Aggregation (CDFA) module at the P5 terminus of the backbone, extending processing from the spatial domain to a wavelet-split frequency path. CDFA uses a Haar wavelet decomposition together with a contrast-driven enhancement pathway inspired by WFEN [
27] and ConDSeg [
28]. Convolution without such structure tends to blur fine edges. As illustrated in
Figure 7, CDFA proceeds in three stages: wavelet frequency-domain decoupling, dual-branch cascaded attention enhancement, and spatial reconstruction.
Haar wavelet-based explicit decomposition. In contrast to implicit feature learning, the input feature
is projected onto the frequency domain. The Haar wavelet transform losslessly decomposes the feature map into four sub-bands: a low-frequency approximation component
and three high-frequency detail components in the horizontal, vertical, and diagonal directions, denoted
and
, respectively. The three high-frequency components are concatenated along the channel dimension to construct the high-frequency edge feature
, which is highly sensitive to the edge gradients of fruit regions. The low-frequency component serves as the low-frequency illumination feature
, preserving smooth background and illumination distribution information. The mathematical formulation is given in Equations (4) and (5), where Cat(·) denotes concatenation along the channel dimension. Simultaneously, the original input
is mapped to an initial value vector
via a linear projection layer
followed by an Unfold operation, serving as the feature substrate for subsequent cascaded aggregation:
The Haar wavelet transform convolves the input feature map along both row and column directions using stride-2 low-pass (averaging) and high-pass (differencing) filter kernels, producing four sub-bands at half the spatial resolution. Specifically, is the dual low-frequency component encoding smooth illumination and chrominance distributions; (row low-pass, column high-pass) responds to vertical edge gradients; (row high-pass, column low-pass) responds to horizontal edge gradients; and (dual high-frequency) responds to diagonal texture variations.
To improve fruit–background separability under challenging illumination, we use a cascaded attention design. In the first stage, the low-frequency branch generates an attention map that suppresses broad illumination and smooth-background responses in the feature substrate. In the second stage, the high-frequency branch generates an attention map that strengthens contour-related responses at fruit and branch boundaries. This two-stage modulation follows a “background suppression, then edge enhancement” sequence and increases contrast between target regions and clutter in feature space. Although CDFA improves deep-feature purity at the backbone terminus, it increases computational pressure; therefore, subsequent neck operations require reconstruction blocks to better preserve and propagate the high-frequency detail produced by CDFA.
Although CDFA successfully purifies deep semantic features at the backbone terminus via frequency-domain filtering, this computationally intensive operation inevitably tightens the overall network computational budget. Furthermore, the fixed 3 × 3 convolutional kernels and nearest-neighbour upsampling operators within BiFPN nodes are insufficient in terms of receptive field coverage and frequency-domain fidelity to fully accommodate the high-frequency detail carried in the CDFA output, necessitating additional feature reconstruction modules for compensation.
2.2.3. Multi-Scale Spatial Feature Reconstruction Module
To resolve the tension between CDFA feature capacity and the limited receptive field at BiFPN nodes within a constrained computational budget, this study reconstructs the fundamental convolutional units at the critical feature fusion pathways, ensuring fidelity of deep semantic propagation during downward transmission while simultaneously compensating for the inherent local perception deficiency of lightweight architectures. This section first introduces the independent designs of the Multi-Scale Convolutional Block (MSCB) and the Efficient Upsampling Convolutional Block (EUCB), followed by an exposition of the embedding mechanism of the zero-parameter Shift-Context (SC) strategy. In the ablation experiments (
Section 3.2), intermediate architectures without the embedded SC are denoted as CM (CSP-MSCB) and EUCB, respectively; the final architectures with SC embedded are denoted as CMS (CSP-MSCB-SC) and EU_SC (EUCB-SC).
To address the inadequacy of single convolutional kernels in adapting to the dramatic scale variation of fruit targets, this study proposes a Multi-Scale Convolutional Block based on the CSP architecture. The MSCB is embedded within the residual branch of the CSP framework; without the SC strategy, this configuration is denoted as CM (CSP-MSCB), as illustrated in
Figure 8. The core of this module resides in a globally heterogeneous kernel selection mechanism [
29].
In contrast to conventional approaches that deepen network architectures to enlarge the receptive field, MSCB adopts a “width-for-depth” strategy. A set of parallel convolutional groups is constructed within the CSP residual branch, with the convolutional kernel combination
dynamically allocated according to the depth
of the feature pyramid: Shallow layer P3 (high resolution):
, where small kernels are employed to preserve high-frequency spatial details, adapted to small-target detection. Intermediate layer P4 (medium resolution):
, where a balance between fine-grained detail capture and contextual awareness is maintained. Deep layer P5 (Low Resolution):
, where large kernels are utilised to expand the effective receptive field (ERF), covering large targets while suppressing background noise. To reconcile expressive capacity during training with ultra-fast inference response, and given the additional parameter overhead introduced by large-kernel convolutions, reparameterisation technology is adopted. During the training phase, multiple parallel branches are employed to capture rich feature representations, as expressed in Equation (6):
where
denotes the intermediate feature map after channel expansion within the CSP residual branch;
denotes a depthwise separable convolution with kernel size
, in which each channel independently performs spatial convolution to reduce the parameter count;
denotes batch normalisation. The outputs of each branch are aggregated via element-wise addition rather than channel concatenation.
During the inference phase, the linear additivity of convolution is exploited to collapse the multi-branch kernels
and biases
into a single-branch operator, achieving “zero-cost” performance gain, as expressed in Equation (7). Here,
and
denote the equivalent convolutional weights and biases of the
-th branch after absorbing the batch normalisation parameters, respectively. Specifically, the BNlayer parameters of each training-phase branch: scaling factor
, shift
, running mean
, and running variance
are first fused into the convolutional weights following the standard reparameterisation procedure:
,
. The
operation extends smaller kernels to the same spatial dimensions as the largest kernel
via zero-padding, enabling element-wise summation across kernels of different sizes within a consistent tensor dimensionality. Consequently, during inference, a single
depthwise separable convolution suffices to equivalently reproduce the multi-branch expressive capacity of the training phase:
Along the top-down pathway of BiFPN, to robustly mitigate the edge blurring and grid artefacts introduced by conventional interpolation algorithms, EUCB is adopted to replace the standard upsampling operator. As illustrated in
Figure 9, its design follows an integrated “upsample–depth enhancement–dimensionality compression” logic, functionally equivalent to a learnable inverse filtering operator. Specifically, bilinear interpolation first magnifies the feature map spatially; a 3 × 3 depthwise convolution then performs semantic smoothing on the enlarged features to repair pixel discontinuities caused by resolution stretching; finally, a 1 × 1 convolution compresses the channel dimensionality. This design not only restores the sharpness of fruit boundary edges, but also ensures fidelity of deep semantic information during downward propagation [
30].
Although the introduction of MSCB and EUCB enhances perceptual capability, it also increases memory access costs. To achieve a Pareto improvement between accuracy and inference speed, the Shift-Context (SC) strategy is embedded into the channel-shuffle stages of both CM and EUCB, respectively. Following SC integration, CM is upgraded to CMS (CSP-MSCB-SC) and EUCB is upgraded to EU_SC (EUCB-SC).
The SC strategy draws inspiration from temporal shift operations in video understanding and is adapted to the spatial domain. As illustrated in
Figure 10, for the intermediate feature
produced by MSCB or EUCB, the channels are evenly divided into four equal partitions, which are subjected to cyclic shifts in four spatial directions {Up
Down
Left
Right} =
as expressed in Equations (8) and (9):
This operation enables the feature vector at each spatial position to assimilate neighbourhood information at zero-parameter and zero-FLOP cost. Although cyclic spatial shifting introduces minor stitching artefacts at the absolute physical boundaries of the feature map (e.g., the leftmost pixel wrapping to the rightmost), such boundary transition noise is rapidly attenuated through the receptive field coverage and pooling operations of subsequent network layers. Moreover, given the strong prior that target fruits are virtually never located within the outermost single-pixel boundary of an image, this boundary noise is effectively diluted during deep feature propagation, thereby achieving global neighbourhood interaction while circumventing the computational overhead of complex boundary mask operations.
In summary, YOLOv11-CBMES combines four coordinated changes for semi-structured orchard images. BiFPN learns cross-scale fusion weights instead of fixed concatenation. CDFA applies Haar-based splitting at P5 to reduce illumination–edge entanglement before neck processing. CM and EUCB enlarge receptive fields and sharpen upsampled features within a fixed budget. SC adds zero-cost neighbourhood mixing inside CM and EUCB (CMS and EU_SC). Together, these blocks target cross-scale misalignment, aliasing, limited context, and efficiency constraints highlighted in
Section 2.2.
2.3. Model Training
All models were trained and evaluated on a single platform for reproducibility and fair comparison: an Intel® CoreTM i9-14900HX CPU (Intel Corporation, Santa Clara, CA, USA), an NVIDIA GeForce RTX 4070 GPU (8 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA), CUDA 12.6, cuDNN 9.5.1, and PyTorch 2.6.0. Every architecture compared in this study used the same training recipe (optimizer, schedule, input resolution, augmentation, and stopping rule); only the network structure differed.
Input resolution. Images were resized to 640 × 640 pixels. A trial at 1280 × 1280 yielded only a marginal gain on a narrow subset of cases while reducing throughput to about 45 FPS, which is ill-suited to closed-loop control on mobile orchard platforms; 640 × 640 was therefore adopted as the operational trade-off between accuracy and latency.
Optimization. Training used stochastic gradient descent with momentum 0.937 and weight decay 5 × 10−4. The initial learning rate was 0.01, adjusted with a cosine annealing schedule over training. The batch size was 8.
Data augmentation: Mosaic augmentation (four-image Mosaic) was enabled for all models in the main experiments so that architectural comparisons are not confounded by different augmentation policies; its effect is reported separately in
Section 3.2.
Classification loss, foreground–background imbalance, and three-class rebalancing. Instance-level labels follow No Occlusion (NO), Soft Occlusion (SO), and Hard Occlusion (HO). Let
denote the empirical class proportion of category c among the annotated instances; thus
,
, and
(
Table 1). The task has C = 3 mutually exclusive occlusion categories. Two imbalances matter: (1) inter-class spread among NO, SO, and HO, and (2) intra-image imbalance between assigned foreground locations and the large number of background locations on the dense prediction map in single-stage training.
For (1), we apply class-frequency reweighting to the supervised classification terms tied to assigned positive samples. Let
be the ground-truth occlusion class of a positive assignment. Inverse-frequency factors, normalised so that their average over the C classes equals unity, are
In our setting (C = 3), with the above proportions, this yields . In the implementation narrative below, always denotes these symbolic weights; substitution uses the above. For each assigned positive, the per-element classification loss on the class dimensions involved in that assignment is multiplied by (); background locations use a unit factor (no extra class weight) so that class balance is adjusted without rescaling the entire background set. No hand-tuning of beyond these was used.
For (2), the classification branch applies focal modulation on binary cross-entropy with logits between predicted logits and task-aligned soft targets from label assignment. Let p denote logits,
the sigmoid, and
the soft target (element-wise, aligned with logits p). We fix the focusing exponent
= 1.5 and the positive-versus-background balancing constant
= 0.25 in the focal construction. We define the following:
The factor
down-weighs easy locations (high
) and emphasises hard misclassified locations;
applies the usual positive-versus-background scaling in focal formulations for dense detectors. Class-specific factors
,
, and
are fixed from the empirical
in
Table 1 as in Equation (10). After applying focal modulation, the weighted per-element classification loss for assigned positives is multiplied by
on the supervised class channels associated with ground-truth class c; background locations use a unit multiplier on the class-balancing factors so that the dense background field is not globally rescaled. The classification term is summed over locations and normalised in the same manner as in the reference YOLO implementation. The resulting scalar classification loss is denoted
and enters the total training objective together with the bounding-box regression terms in the same weighted sum as in the reference YOLOv11 detector, so that end-to-end optimisation minimises localisation and classification jointly.
Robustness analyses (multi-seed training and five-fold cross-validation with a fixed test set). Unless otherwise stated, all results that populate the main benchmark tables in
Section 3—including the progressive ablation in
Section 3.2 and the cross-architecture comparison in
Section 3.4—were obtained from training runs that use the global random seed 0, consistent with the default configuration in our implementation.
Supplementary analyses quantify two additional sources of variability while keeping the test subset fixed and preserving the 7:1:2 split definition in
Table 1. First, we repeat training for YOLOv11 and YOLOv11-CBMES with three seeds (0, 42, and 2026), changing only the seed that controls optimisation stochasticity (for example, initialisation, shuffling, and stochastic augmentation); the training, validation, and test subsets are unchanged across these repeats. Second, we pool the training and validation subsets, repartition them into five folds, and in each round train on four folds and use the remaining fold for validation (monitoring and selection); after each training run, class-wise mAP is evaluated on the same held-out test subset using the procedure in
Section 2.4. The first analysis characterises variability from training randomness under a fixed partition; the second characterises sensitivity to reassigning images between training and validation without using the test subset for training or selection.
Section 3.1 summarise these two analyses and are reported in
Section 3.1.