Frequency-Domain Decoupling and Multi-Dimensional Spatial Feature Reconstruction for Occlusion-Aware Apple Detection in Complex Semi-Structured Orchard Environments

Gao, Long; Wang, Pengfei; Liu, Lixing; Liu, Hongjie; Li, Jianping; Yang, Xin

doi:10.3390/agronomy16080790

Open AccessArticle

Frequency-Domain Decoupling and Multi-Dimensional Spatial Feature Reconstruction for Occlusion-Aware Apple Detection in Complex Semi-Structured Orchard Environments

by

Long Gao

,

Pengfei Wang

,

Lixing Liu

,

Hongjie Liu

,

Jianping Li

and

Xin Yang

^*

College of Mechanical and Electrical Engineering, Hebei Agricultural University, Baoding 071000, China

^*

Author to whom correspondence should be addressed.

Agronomy 2026, 16(8), 790; https://doi.org/10.3390/agronomy16080790

Submission received: 3 March 2026 / Revised: 9 April 2026 / Accepted: 10 April 2026 / Published: 12 April 2026

(This article belongs to the Special Issue Intelligent Detection and Classification of External Traits in Crop Plants, Fruits, and Vegetables)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Apple detection is a core perception task for harvesting robots operating in complex orchard environments. Targets are frequently affected by branch–foliage occlusion, alternating front/side/back lighting, and strong local illumination fluctuations, which blur object boundaries against background textures and substantially increase detection difficulty. To improve target perception under these conditions, we propose an improved detector, YOLOv11-CBMES. First, based on YOLOv11, we replace the original neck with a weighted BiFPN to enhance cross-scale feature fusion under occlusion. Second, we introduce a Contrast-Driven Feature Aggregation (CDFA) module at the P5 stage, using Haar wavelet decomposition to decouple low-frequency illumination components from high-frequency structural components. Third, we reconstruct spatial feature learning and the upsampling pathway using CSP-based multi-scale blocks and efficient upsampling blocks, and embed a zero-parameter Shift-Context strategy to strengthen local neighbourhood interaction. Finally, we formulate apple detection as a three-class occlusion classification task (No Occlusion, Soft Occlusion, and Hard Occlusion) to support occlusion-aware target recognition. On the apple occlusion dataset, YOLOv11-CBMES achieves

{m A P}_{N O}

= 83.50%,

{m A P}_{S O}

= 67.36%, and

{m A P}_{H O}

= 51.90% at IoU = 0.5. Compared with YOLOv11n under the same training protocol, the gains are +2.16 pp (NO), +3.68 pp (SO), and +5.31 pp (HO), with the largest improvement observed in Hard Occlusion (HO). The results indicate that introducing frequency-domain structural processing into the detection framework improves apple occlusion classification and object detection performance, and provides a theoretical basis for designing perception modules for end-effector operations in apple harvesting robots.

Keywords:

apple detection; occlusion-aware detection; frequency-domain feature decoupling; cross-scale feature fusion; illumination-robust perception; apple harvesting robot

1. Introduction

The persistent global shortage of agricultural labour, coupled with rising labour costs, has positioned intelligent harvesting robots as a key enabler of orchard automation [1]. Vision-servo-based hand-eye coordination has advanced rapidly, with motion planning and end-effectors offering high dexterity and flexibility [2,3]. Deploying these systems in real orchards nevertheless involves a clear safety–efficiency trade-off. The main limitation is usually not actuation, but perception under dense occlusion: current visual pipelines often provide too weak a basis for the decisions required for safe operation [4,5].

Modern intensive orchards widely adopt dwarf-rootstock high-density spindle-form systems, with row spacing of about 4 m and plant spacing of about 1.5 m, supporting visual navigation along rows [6]. Precision pruning is not always applied, so lateral branches often depart from an ideal planar “fruit wall” and protrude into 3D inter-row space [7]. This semi-structured layout embeds fruit in a dense branch–foliage volume. The result is strong depth confusion between foreground wood and background fruit and fragmented shadows under natural light. Occlusion is widely viewed as a primary limit on visual control for harvesting robots [8]. It slows recognition and lowers accuracy. When the rigidity of the occluder is unknown, the arm may collide with rigid wood or skip reachable fruit due to conservative stopping, hurting both efficiency and safety [9].

Current agricultural visual detection research is predominantly driven by deep convolutional neural networks. Early studies based on two-stage R-CNN-series algorithms [10,11,12,13] demonstrated acceptable detection accuracy but incurred computational costs incompatible with the resource-constrained agricultural mobile platforms. Jia et al. [11] achieved detection precision and recall of 97.31% and 95.70%, respectively, by integrating ResNet and DenseNet into a lightweight Mask R-CNN variant, yet inference speed remained insufficient for real-time control. In recent years, single-stage detectors represented by the YOLO family have been widely deployed at the edge due to their inference speed advantages [14,15]. Han et al. [16] proposed the Rep-ViG-apple model, fusing CNN and graph convolutional networks to handle complex orchard environments; Qiu et al. [17] developed BGWL-YOLO based on YOLOv11 for apple maturity grading.

Despite recent engineering advances, current agricultural vision methods—including widely deployed YOLO architectures—remain vulnerable in complex occlusion. They primarily fit local appearance statistics in the RGB spatial domain and are therefore prone to severe semantic aliasing under strong backlighting or extreme textural similarity between targets and background [18]. Cross-disciplinary work in remote sensing and infrared tracking shows that frequency-domain representations can help separate targets from strong background fluctuations [19,20]. Agricultural vision has begun to adopt similar ideas, but existing frequency-aware models often use wavelets either as heavy components inside selective attention networks [21] or mainly as a convenient operator for spatial downsampling and channel compression [22], rather than explicitly decoupling high-frequency edge cues from low-frequency illumination-driven background variation at the feature level.

We address this gap by using explicit frequency-domain structure to reduce illumination–texture aliasing at the source of deep features. Rather than relying solely on deeper spatial stacking, the proposed perspective targets clearer separation of contour-related responses from low-frequency environmental variation, supporting semantics that better distinguish flexible foliage from rigid branches and may inform safer motion decisions under mixed rigid–flexible occlusion.

This spatial-domain bias also widens the gap between perception and physical execution. Robotic vision increasingly emphasises affordance learning and manipulation-aware perception to connect visual understanding with interaction constraints [23]. Many agricultural detectors, however, still reduce recognition to bounding boxes. Although Li et al. [24] refined occlusion labelling for apples, most pipelines do not expose a compact link from visual occlusion cues to actionable rigidity priors (e.g., displaceable foliage versus impassable branches). Without such priors, downstream planners lack a reliable physical boundary and tend toward overly conservative avoidance.

To overcome these perceptual and decision-making bottlenecks, our objective is an occlusion-aware detection framework that supports traversability-style judgements, not coordinates alone. To that end, this study proposes YOLOv11-CBMES, a lightweight network integrating explicit frequency-domain decoupling and multi-dimensional spatial feature reconstruction. The core rationale is that decoupling high-frequency edge information from low-frequency illumination variation mitigates limitations of RGB spatial processing and improves discrimination between rigid branches and flexible foliage, providing an alternative feature-representation strategy that combines physical occlusion cues with frequency-domain signal decomposition. The principal contributions are as follows:

(1): To mitigate cross-scale misalignment under occlusion, we replace PANet with a weighted Bidirectional Feature Pyramid Network (BiFPN) and introduce a CSP-encapsulated Multi-Scale Convolutional Block (CM and CSP-MSCB) and an Efficient Upsampling Convolutional Block (EUCB) at key fusion and upsampling stages. CM uses depth-adaptive heterogeneous kernels to widen the effective receptive field; EUCB recovers frequency-domain fidelity along the upsampling path via learnable inverse filtering. A zero-parameter Shift-Context (SC) strategy is embedded into the channel-shuffle stages of CM and EUCB, yielding CMS (CSP-MSCB-SC) and EU_SC (EUCB-SC) and reinforcing local neighbourhood mixing of edge-related features. Together, these components provide spatial expansion, frequency-domain compensation, and neighbourhood interaction within a constrained computational budget.
(2): To reduce illumination–texture aliasing in RGB features, we design a Contrast-Driven Feature Aggregation (CDFA) module based on Haar wavelet decomposition at the P5 stage, separating low-frequency illumination-dominated components from high-frequency edge-dominated components before reconstruction. This targeted purification suppresses aliasing near the feature source and supports rigidity-relevant boundary cues under complex lighting. Embedding CDFA only at P5 balances frequency-domain enhancement with real-time efficiency, and ablations indicate the largest single-module gain for Hard Occlusion recognition in the proposed chain.
(3): The task extends from fruit localisation to traversability-oriented assessment. Figure 1 summarises the three-tier taxonomy—No Occlusion (NO), Soft Occlusion (SO), and Hard Occlusion (HO)—with representative annotations under variable illumination and geometry. Outputs specify each target’s location and occlusion–rigidity class (NO/SO/HO), intended as inputs to downstream motion or grasp planning.

2. Materials and Methods

2.1. Dataset, Acquisition Protocol, and Preprocessing

2.1.1. Experimental Site and Camera Configuration

The data acquisition campaign was conducted in a dwarf-rootstock, high-density orchard at the First Station of the Taihang Mountain Road, Hebei Agricultural University, Baoding, China. At the macro scale, orchard geometry is characterised by a row spacing of approximately 4 m, a plant spacing of 1.5 m, and a mean tree height of 3.5 m. At the canopy scale, the scene is semi-structured: lateral branches extend into the inter-row volume, embedding fruit targets within a multi-layer branch–foliage volume rather than an idealised planar fruiting wall. Illumination variability is introduced through three acquisition regimes—front lighting, lateral lighting, and strong backlighting—which produce fragmented shadows and specular highlights and therefore stress RGB-based feature learning under field conditions.

Depth-aware acquisition uses an Intel RealSense D455 camera (Intel Corporation, Santa Clara, CA, USA) calibrated over 500–1500 mm. Figure 2 summarises the calibration outcome as box plots of measurement error (mm) versus range, with the full span partitioned into 100 mm distance bins. The profile shows that errors are tightly scattered around the zero-error baseline with comparatively low dispersion within the 700–1100 mm band, whereas the outer segments (approximately 500–700 mm and 1100–1500 mm) exhibit larger spread and a systematic shift toward positive error, especially at the longer end of the calibrated interval. Accordingly, image capture was constrained to a camera–canopy distance of 0.7–1.1 m so that routine depth-guided processing stays within the sensor’s most reliable operating interval implied by this calibration.

2.1.2. Occlusion Taxonomy and Background Purification

The detection task is formulated beyond pure 2D localisation: each target is assigned to one of three occlusion-centric classes—No Occlusion (NO), Soft Occlusion (SO), or Hard Occlusion (HO)—according to the physical rigidity of the occluding object as the primary criterion for robot intervention planning. NO indicates a fully visible target. SO indicates occlusion mainly by deformable foliage, where a push-and-grasp strategy may be feasible. HO indicates occlusion by rigid branches, support wires, or densely clustered fruit, where forced intervention is unsafe, and avoidance planning is preferred.

During data collection, the camera field of view may include distant fruit from adjacent rows when foreground canopy gaps appear. To suppress harvest-irrelevant distant interference at the data source, a depth-guided purification strategy is applied under the calibrated depth reliability regime described in Section 2.1.1. Specifically, regions with depth Z > 2.0 m are projected onto the RGB plane and replaced via neighbourhood green-foliage texture synthesis masking, Figure 3 compares representative raw regions with the masked counterparts under this procedure**, restricting learning to the manipulator-relevant canopy layer. restricting learning to the manipulator-relevant canopy layer.

2.1.3. Dataset Summary and Size Distribution

The full image corpus comprises 1635 RGB images and 12,705 annotated instances. The dataset is split in a fixed 7:1:2 ratio for training, validation, and testing, yielding 1144 training images, 164 validation images, and 327 test images. Instance-level occlusion proportions (label-derived) and image counts under the three illumination regimes are summarised in Table 1, together with the sensing operating range and the depth threshold used for background purification (Z > 2.0 m).

To characterise object-scale variation in the image plane, we summarise the distribution of instance-level bounding-box widths computed from YOLO-normalised widths mapped to pixel units. We use 60 equal-width bins over 2–150 px, including all instances across all splits.

All compared models are trained and evaluated under an identical fixed partition (7:1:2) with mutually exclusive image sets; hyperparameters are tuned using the validation set, and the test set is used only for final reporting. Although k-fold cross-validation can estimate variability across partitions, it was not adopted for the main experiments due to the high computational cost of repeated full detector training; we instead prioritise a transparent benchmark split for direct reproducibility and fair comparison.

Figure 4 shows that widths are strongly concentrated at small pixel extents, while a minority of larger boxes reflects closer viewpoints or larger apparent object scale; this motivates multi-scale feature representation in the detection backbone. Code and annotations will be released after publication to support reproduction.

2.2. Detection Model Architecture

In semi-structured orchards, occlusion and lighting variation degrade RGB features. The detector must run on mobile-class hardware while discriminating No Occlusion (NO), Soft Occlusion (SO), and Hard Occlusion (HO). Relative to standard lightweight single-stage designs, the main difficulty is aligning fusion operators with cues relevant to rigidity-aware class decisions.

We take YOLOv11 as the baseline. For three-class occlusion reasoning, its default neck and units show several limitations. PANet-style fusion concatenates multi-scale features with limited control over cross-scale contribution; during upsampling, deep background responses can dominate shallow detail. The native C3k2 stack relies heavily on fixed 3 × 3 convolutions with a relatively small effective receptive field for elongated branches. Feature extraction and upsampling are also carried out without an explicit separation between illumination variation and edge structure.

YOLOv11-CBMES adds a frequency-domain branch and multi-scale reconstruction around the baseline pipeline, as shown in Figure 5.

A weighted Bidirectional Feature Pyramid Network (BiFPN) replaces the original neck to learn cross-scale fusion weights.
Contrast-Driven Feature Aggregation (CDFA) is placed at P5 to separate low-frequency illumination-dominated energy from high-frequency structure using a Haar wavelet split.
CSP-Encapsulated Multi-Scale Convolutional Blocks (CM and CSP-MSCB) replace C3k2 at selected BiFPN fusion nodes, and Efficient Upsampling Convolutional Blocks (EUCB) replace conventional upsampling along the top-down path.
A zero-parameter Shift-Context (SC) module is inserted at the channel-shuffle stages of CM and EUCB, yielding CMS and EU_SC. SC mixes local neighbourhoods by cyclic shifting without extra parameters or FLOPs.

In Figure 5, CMS and EU_SC denote the configurations after SC is integrated.

While Figure 5 shows the static wiring, Figure 6 summarises the end-to-end data flow. Depth-guided purification suppresses distant background interference. The backbone builds spatial features; CDFA decomposes P5 features to emphasise edge-related responses relative to low-frequency components. Weighted BiFPN and the spatial reconstruction blocks then fuse and refine these representations. The heads output NO, SO, and HO predictions intended as inputs to downstream grasp-or-avoid policies; closed-loop safety is not evaluated here.

2.2.1. Bidirectional Weighted Feature Pyramid Network (BiFPN)

As previously described, the original PANet employs an undifferentiated concatenation mechanism for multi-scale feature fusion, resulting in equal-weight superimposition of features across different hierarchical levels without selective filtering, thereby inducing cross-scale semantic alignment bias. To overcome this limitation, this study introduces the Weighted Bidirectional Feature Pyramid Network (BiFPN) to replace PANet, dynamically adjusting the fusion ratio of features across different scales through learnable weight vectors [25]. For a given multi-scale input feature set

Ι = {I_{1}, I_{2}, \dots, I_{n}}

, the fused output feature

O

is computed according to Equation (1), where

I_{i}

denotes the

i

-th input feature map participating in the current fusion node, and

n

represents the number of input branches at that node (in the proposed architecture,

n = 2

for top-down nodes and

n = 3

for bottom-up nodes):

O = \sum_{i} \frac{w_{i}}{ϵ + \sum_{j} w_{j}} \cdot I_{i}

(1)

where

w_{i}

denotes the learnable scalar weight, constrained to non-negative values via ReLU activation to ensure physical interpretability of feature contributions, and

ϵ = 10^{- 4}

is a numerical stability term. This fast normalisation strategy avoids the exponential computational overhead associated with Softmax, reducing the latency of a single fusion operation from 2.3 ms (PANet) to 1.7 ms, measured on an NVIDIA GeForce RTX 4070 (NVIDIA Corporation, Santa Clara, CA, USA). Through end-to-end training, the weight distribution self-adaptively adjusts during backpropagation: when processing shallow-layer small-target features, the weights of deep-layer background semantics are automatically attenuated, thereby mitigating the semantic dilution problem during cross-scale fusion [25].

At the network topology level, computationally redundant single-input nodes are removed, and a bidirectional feature pyramid structure comprising both top-down and bottom-up pathways is constructed. The top-down pathway is primarily responsible for injecting the high-level semantic abstractions enhanced by the CDFA module into shallow network layers, leveraging the category-discriminative information at the P5 layer to suppress the responses of texture-similar non-target regions in shallow feature maps. Taking the P4 layer as an example, the intermediate feature

P_{4}^{t d}

is generated as expressed in Equation (2). This process propagates the high-level semantic judgement of “fruit versus foliage” from P5 downward, providing semantic priors to shallow layers to suppress non-target regions with similar textures. The bottom-up pathway subsequently propagates high-frequency geometric edge information from shallow layers back to deep layers to correct localisation bias. The final output

P_{4}^{o u t}

aggregates three signal streams—the original input, the intermediate-state feature, and the feature from the lower layer—as expressed in Equation (3):

P_{4}^{t d} = Conv (\frac{w_{1} \cdot P_{4}^{i n} + w_{2} \cdot {R e s i z e (P}_{5}^{i n})}{w_{1} + w_{2} + ϵ})

(2)

P_{4}^{o u t} = Conv (\frac{w_{1}^{'} \cdot P_{4}^{i n} + w_{2}^{'} \cdot P_{4}^{t d} + w_{3}^{'} \cdot {R e s i z e (P}_{3}^{o u t})}{w_{1}^{'} + w_{2}^{'} + w_{3}^{'} + ϵ})

(3)

In Equation (2),

P_{4}^{i n}

is the backbone feature at level 4 and

P_{5}^{i n}

is the backbone feature at level 5. Because

P_{5}^{i n}

is half the spatial resolution of

P_{4}^{i n}

, Resize(·) applies 2× nearest-neighbour upsampling to align

P_{5}^{i n}

to

P_{4}^{i n}

. Conv(·) denotes a 3 × 3 depthwise separable convolution for post-fusion refinement.

w_{1}

and

w_{2}

are learnable scalars for this node only.

P_{4}^{t d}

is the top-down intermediate feature at P4. In Equation (3),

P_{3}^{o u t}

is the bottom-up fused output at P3. Because

P_{3}^{o u t}

has twice the resolution of

P_{4}^{o u t}

, Resize(·) uses strided convolution (stride 2) to align to

P_{4}^{o u t}

.

w_{1}^{'}

,

w_{2}^{'}

, and

w_{3}^{'}

are learnable scalars for this bottom-up node and are not shared with

w_{1}

and

w_{2}

in Equation (2).

P_{4}^{o u t}

combines deep semantics (

P_{4}^{t d}

), the original backbone feature (

P_{4}^{i n}

), and shallower geometric detail (

P_{3}^{o u t}

).

Learnable fusion reweights existing scales but cannot create edge–illumination separation when backbone features are already strongly aliased, which motivates CDFA.

2.2.2. Contrast-Driven Feature Aggregation (CDFA) Module

Backbone features can mix illumination drift with boundary cues when processing is confined to local spatial convolution. Standard kernels fit local appearance but do not explicitly separate low-frequency background variation from high-frequency discontinuities along object contours [26]. We therefore embed a Contrast-Driven Feature Aggregation (CDFA) module at the P5 terminus of the backbone, extending processing from the spatial domain to a wavelet-split frequency path. CDFA uses a Haar wavelet decomposition together with a contrast-driven enhancement pathway inspired by WFEN [27] and ConDSeg [28]. Convolution without such structure tends to blur fine edges. As illustrated in Figure 7, CDFA proceeds in three stages: wavelet frequency-domain decoupling, dual-branch cascaded attention enhancement, and spatial reconstruction.

Haar wavelet-based explicit decomposition. In contrast to implicit feature learning, the input feature

F_{i n}^{c d f a} \in R^{B \times H \times W \times C}

is projected onto the frequency domain. The Haar wavelet transform losslessly decomposes the feature map into four sub-bands: a low-frequency approximation component

F_{L L}

and three high-frequency detail components in the horizontal, vertical, and diagonal directions, denoted

F_{L H}, F_{H L}

and

F_{H H}

, respectively. The three high-frequency components are concatenated along the channel dimension to construct the high-frequency edge feature

F_{h i g h}

, which is highly sensitive to the edge gradients of fruit regions. The low-frequency component serves as the low-frequency illumination feature

F_{l o w}

, preserving smooth background and illumination distribution information. The mathematical formulation is given in Equations (4) and (5), where Cat(·) denotes concatenation along the channel dimension. Simultaneously, the original input

F_{i n}^{c d f a}

is mapped to an initial value vector

V_{i n i t} \in R^{B \times H \times W \times C}

via a linear projection layer

ϕ (\cdot)

followed by an Unfold operation, serving as the feature substrate for subsequent cascaded aggregation:

F_{h i g h} = C a t (F_{L H}, F_{H L}, F_{H H}) \in R^{B \times \frac{H}{2} \times \frac{W}{2} \times 3 C}

(4)

F_{l o w} = F_{L L} \in R^{B \times \frac{H}{2} \times \frac{W}{2} \times C}

(5)

The Haar wavelet transform convolves the input feature map along both row and column directions using stride-2 low-pass (averaging) and high-pass (differencing) filter kernels, producing four sub-bands at half the spatial resolution. Specifically,

F_{L L}

is the dual low-frequency component encoding smooth illumination and chrominance distributions;

F_{L H}

(row low-pass, column high-pass) responds to vertical edge gradients;

F_{H L}

(row high-pass, column low-pass) responds to horizontal edge gradients; and

F_{H H}

(dual high-frequency) responds to diagonal texture variations.

To improve fruit–background separability under challenging illumination, we use a cascaded attention design. In the first stage, the low-frequency branch generates an attention map that suppresses broad illumination and smooth-background responses in the feature substrate. In the second stage, the high-frequency branch generates an attention map that strengthens contour-related responses at fruit and branch boundaries. This two-stage modulation follows a “background suppression, then edge enhancement” sequence and increases contrast between target regions and clutter in feature space. Although CDFA improves deep-feature purity at the backbone terminus, it increases computational pressure; therefore, subsequent neck operations require reconstruction blocks to better preserve and propagate the high-frequency detail produced by CDFA.

Although CDFA successfully purifies deep semantic features at the backbone terminus via frequency-domain filtering, this computationally intensive operation inevitably tightens the overall network computational budget. Furthermore, the fixed 3 × 3 convolutional kernels and nearest-neighbour upsampling operators within BiFPN nodes are insufficient in terms of receptive field coverage and frequency-domain fidelity to fully accommodate the high-frequency detail carried in the CDFA output, necessitating additional feature reconstruction modules for compensation.

2.2.3. Multi-Scale Spatial Feature Reconstruction Module

To resolve the tension between CDFA feature capacity and the limited receptive field at BiFPN nodes within a constrained computational budget, this study reconstructs the fundamental convolutional units at the critical feature fusion pathways, ensuring fidelity of deep semantic propagation during downward transmission while simultaneously compensating for the inherent local perception deficiency of lightweight architectures. This section first introduces the independent designs of the Multi-Scale Convolutional Block (MSCB) and the Efficient Upsampling Convolutional Block (EUCB), followed by an exposition of the embedding mechanism of the zero-parameter Shift-Context (SC) strategy. In the ablation experiments (Section 3.2), intermediate architectures without the embedded SC are denoted as CM (CSP-MSCB) and EUCB, respectively; the final architectures with SC embedded are denoted as CMS (CSP-MSCB-SC) and EU_SC (EUCB-SC).

To address the inadequacy of single convolutional kernels in adapting to the dramatic scale variation of fruit targets, this study proposes a Multi-Scale Convolutional Block based on the CSP architecture. The MSCB is embedded within the residual branch of the CSP framework; without the SC strategy, this configuration is denoted as CM (CSP-MSCB), as illustrated in Figure 8. The core of this module resides in a globally heterogeneous kernel selection mechanism [29].

In contrast to conventional approaches that deepen network architectures to enlarge the receptive field, MSCB adopts a “width-for-depth” strategy. A set of parallel convolutional groups is constructed within the CSP residual branch, with the convolutional kernel combination

K^{l} = {k_{1}, k_{2}, \dots, k_{m}}

dynamically allocated according to the depth

l \in {P 3, P 4, P 5}

of the feature pyramid: Shallow layer P3 (high resolution):

K^{P 3} = {1, 3, 5}

, where small kernels are employed to preserve high-frequency spatial details, adapted to small-target detection. Intermediate layer P4 (medium resolution):

K^{P 4} = {3, 5, 7}

, where a balance between fine-grained detail capture and contextual awareness is maintained. Deep layer P5 (Low Resolution):

K^{P 5} = {5, 7, 9}

, where large kernels are utilised to expand the effective receptive field (ERF), covering large targets while suppressing background noise. To reconcile expressive capacity during training with ultra-fast inference response, and given the additional parameter overhead introduced by large-kernel convolutions, reparameterisation technology is adopted. During the training phase, multiple parallel branches are employed to capture rich feature representations, as expressed in Equation (6):

F_{a g g}^{t r a i n} = \sum_{k_{i} \in K^{l}} B N ({D W C o n v}_{k_{i} \times k_{i}} (F_{e x p}))

(6)

where

F_{e x p}

denotes the intermediate feature map after channel expansion within the CSP residual branch;

{D W C o n v}_{k_{i} \times k_{i}}

denotes a depthwise separable convolution with kernel size

k_{i} \times k_{i}

, in which each channel independently performs spatial convolution to reduce the parameter count;

B N

denotes batch normalisation. The outputs of each branch are aggregated via element-wise addition rather than channel concatenation.

During the inference phase, the linear additivity of convolution is exploited to collapse the multi-branch kernels

W_{k}

and biases

B_{k}

into a single-branch operator, achieving “zero-cost” performance gain, as expressed in Equation (7). Here,

W_{k}

and

B_{k}

denote the equivalent convolutional weights and biases of the

k_{i}

-th branch after absorbing the batch normalisation parameters, respectively. Specifically, the BNlayer parameters of each training-phase branch: scaling factor

γ

, shift

β

, running mean

μ

, and running variance

σ^{2}

are first fused into the convolutional weights following the standard reparameterisation procedure:

W_{k}^{'} = \frac{γ}{\sqrt{σ^{2} + ϵ}} W_{k}

,

B_{k} = γ \frac{- μ}{\sqrt{σ^{2} + ϵ}} B_{k}

. The

P a d ()

operation extends smaller kernels to the same spatial dimensions as the largest kernel

{m a x (K}^{l})

via zero-padding, enabling element-wise summation across kernels of different sizes within a consistent tensor dimensionality. Consequently, during inference, a single

{m a x (K}^{l}) \times {m a x (K}^{l})

depthwise separable convolution suffices to equivalently reproduce the multi-branch expressive capacity of the training phase:

\{\begin{matrix} W_{f u s e d} = \sum_{k_{i} \in K^{l}} P a d (W_{k}) \\ B_{f u s e d} = \sum_{k_{i} \in K^{l}} B_{k} \end{matrix}

(7)

Along the top-down pathway of BiFPN, to robustly mitigate the edge blurring and grid artefacts introduced by conventional interpolation algorithms, EUCB is adopted to replace the standard upsampling operator. As illustrated in Figure 9, its design follows an integrated “upsample–depth enhancement–dimensionality compression” logic, functionally equivalent to a learnable inverse filtering operator. Specifically, bilinear interpolation first magnifies the feature map spatially; a 3 × 3 depthwise convolution then performs semantic smoothing on the enlarged features to repair pixel discontinuities caused by resolution stretching; finally, a 1 × 1 convolution compresses the channel dimensionality. This design not only restores the sharpness of fruit boundary edges, but also ensures fidelity of deep semantic information during downward propagation [30].

Although the introduction of MSCB and EUCB enhances perceptual capability, it also increases memory access costs. To achieve a Pareto improvement between accuracy and inference speed, the Shift-Context (SC) strategy is embedded into the channel-shuffle stages of both CM and EUCB, respectively. Following SC integration, CM is upgraded to CMS (CSP-MSCB-SC) and EUCB is upgraded to EU_SC (EUCB-SC).

The SC strategy draws inspiration from temporal shift operations in video understanding and is adapted to the spatial domain. As illustrated in Figure 10, for the intermediate feature

F

produced by MSCB or EUCB, the channels are evenly divided into four equal partitions, which are subjected to cyclic shifts in four spatial directions {Up

,

Down

,

Left

,

Right} =

{F_{1}, F_{2}, F_{3}, F_{4}}

as expressed in Equations (8) and (9):

\{F_{1}, F_{2}, F_{3}, F_{4}\} = s p l i t (F)

(8)

\{\begin{matrix} F_{1}^{'} = Roll (F_{1}, s h i f t = + 1, a x i s = H) \\ F_{2}^{'} = Roll (F_{2}, s h i f t = - 1, a x i s = H) \\ F_{3}^{'} = Roll (F_{3}, s h i f t = + 1, a x i s = W) \\ F_{4}^{'} = Roll (F_{4}, s h i f t = - 1, a x i s = W) \end{matrix}

(9)

This operation enables the feature vector at each spatial position to assimilate neighbourhood information at zero-parameter and zero-FLOP cost. Although cyclic spatial shifting introduces minor stitching artefacts at the absolute physical boundaries of the feature map (e.g., the leftmost pixel wrapping to the rightmost), such boundary transition noise is rapidly attenuated through the receptive field coverage and pooling operations of subsequent network layers. Moreover, given the strong prior that target fruits are virtually never located within the outermost single-pixel boundary of an image, this boundary noise is effectively diluted during deep feature propagation, thereby achieving global neighbourhood interaction while circumventing the computational overhead of complex boundary mask operations.

In summary, YOLOv11-CBMES combines four coordinated changes for semi-structured orchard images. BiFPN learns cross-scale fusion weights instead of fixed concatenation. CDFA applies Haar-based splitting at P5 to reduce illumination–edge entanglement before neck processing. CM and EUCB enlarge receptive fields and sharpen upsampled features within a fixed budget. SC adds zero-cost neighbourhood mixing inside CM and EUCB (CMS and EU_SC). Together, these blocks target cross-scale misalignment, aliasing, limited context, and efficiency constraints highlighted in Section 2.2.

2.3. Model Training

All models were trained and evaluated on a single platform for reproducibility and fair comparison: an Intel^® Core^TM i9-14900HX CPU (Intel Corporation, Santa Clara, CA, USA), an NVIDIA GeForce RTX 4070 GPU (8 GB VRAM; NVIDIA Corporation, Santa Clara, CA, USA), CUDA 12.6, cuDNN 9.5.1, and PyTorch 2.6.0. Every architecture compared in this study used the same training recipe (optimizer, schedule, input resolution, augmentation, and stopping rule); only the network structure differed.

Input resolution. Images were resized to 640 × 640 pixels. A trial at 1280 × 1280 yielded only a marginal gain on a narrow subset of cases while reducing throughput to about 45 FPS, which is ill-suited to closed-loop control on mobile orchard platforms; 640 × 640 was therefore adopted as the operational trade-off between accuracy and latency.

Optimization. Training used stochastic gradient descent with momentum 0.937 and weight decay 5 × 10⁻⁴. The initial learning rate was 0.01, adjusted with a cosine annealing schedule over training. The batch size was 8.

Data augmentation: Mosaic augmentation (four-image Mosaic) was enabled for all models in the main experiments so that architectural comparisons are not confounded by different augmentation policies; its effect is reported separately in Section 3.2.

Classification loss, foreground–background imbalance, and three-class rebalancing. Instance-level labels follow No Occlusion (NO), Soft Occlusion (SO), and Hard Occlusion (HO). Let

π_{c}

denote the empirical class proportion of category c among the annotated instances; thus

π_{N O} = 0.26

,

π_{S O} = 0.42

, and

π_{H O} = 0.32

(Table 1). The task has C = 3 mutually exclusive occlusion categories. Two imbalances matter: (1) inter-class spread among NO, SO, and HO, and (2) intra-image imbalance between assigned foreground locations and the large number of background locations on the dense prediction map in single-stage training.

For (1), we apply class-frequency reweighting to the supervised classification terms tied to assigned positive samples. Let

c \in \{N O, S O, H O\}

be the ground-truth occlusion class of a positive assignment. Inverse-frequency factors, normalised so that their average over the C classes equals unity, are

ω_{c} = \frac{C}{π_{c} \cdot \sum_{k = 1}^{C} π_{k}^{- 1}}

(10)

In our setting (C = 3), with the above proportions, this yields

(ω_{N O}, ω_{S O}, ω_{H O}) = (1.23, 0.76, 1.00)

. In the implementation narrative below,

ω_{c}

always denotes these symbolic weights; substitution uses the

π_{c}

above. For each assigned positive, the per-element classification loss on the class dimensions involved in that assignment is multiplied by (

ω_{c}

); background locations use a unit factor (no extra class weight) so that class balance is adjusted without rescaling the entire background set. No hand-tuning of

ω_{c}

beyond these

π_{c}

was used.

For (2), the classification branch applies focal modulation on binary cross-entropy with logits between predicted logits and task-aligned soft targets from label assignment. Let p denote logits,

σ (\cdot)

the sigmoid, and

y \in [0, 1]

the soft target (element-wise, aligned with logits p). We fix the focusing exponent

γ

= 1.5 and the positive-versus-background balancing constant

α

= 0.25 in the focal construction. We define the following:

\{\begin{matrix} p_{t} = y σ (p) + (1 - y) (1 - σ (p)) \\ α_{t} = α y + (1 - α) (1 - y) \\ BCEWithLogits (p, y) = - y \log σ (p) - (1 - y) \log (1 - σ (p)) \\ L_{c l s} = α_{t} {(1 - p_{t})}^{γ} BCEWithLogits (p, y) \end{matrix}

(11)

The factor

{(1 - p_{t})}^{γ}

down-weighs easy locations (high

p_{t}

) and emphasises hard misclassified locations;

α_{t}

applies the usual positive-versus-background scaling in focal formulations for dense detectors. Class-specific factors

ω_{N O}

,

ω_{S O}

, and

ω_{H O}

are fixed from the empirical

π_{c}

in Table 1 as in Equation (10). After applying focal modulation, the weighted per-element classification loss for assigned positives is multiplied by

ω_{c}

on the supervised class channels associated with ground-truth class c; background locations use a unit multiplier on the class-balancing factors so that the dense background field is not globally rescaled. The classification term is summed over locations and normalised in the same manner as in the reference YOLO implementation. The resulting scalar classification loss is denoted

L_{c l s}

and enters the total training objective together with the bounding-box regression terms in the same weighted sum as in the reference YOLOv11 detector, so that end-to-end optimisation minimises localisation and classification jointly.

Robustness analyses (multi-seed training and five-fold cross-validation with a fixed test set). Unless otherwise stated, all results that populate the main benchmark tables in Section 3—including the progressive ablation in Section 3.2 and the cross-architecture comparison in Section 3.4—were obtained from training runs that use the global random seed 0, consistent with the default configuration in our implementation.

Supplementary analyses quantify two additional sources of variability while keeping the test subset fixed and preserving the 7:1:2 split definition in Table 1. First, we repeat training for YOLOv11 and YOLOv11-CBMES with three seeds (0, 42, and 2026), changing only the seed that controls optimisation stochasticity (for example, initialisation, shuffling, and stochastic augmentation); the training, validation, and test subsets are unchanged across these repeats. Second, we pool the training and validation subsets, repartition them into five folds, and in each round train on four folds and use the remaining fold for validation (monitoring and selection); after each training run, class-wise mAP is evaluated on the same held-out test subset using the procedure in Section 2.4. The first analysis characterises variability from training randomness under a fixed partition; the second characterises sensitivity to reassigning images between training and validation without using the test subset for training or selection. Section 3.1 summarise these two analyses and are reported in Section 3.1.

2.4. Model Evaluation Metrics

To evaluate perceptual accuracy and deployment efficiency, we report detection accuracy, localisation quality, and computational cost. Detection accuracy is specified in Section 2.4.1; Computational cost and inference speed are detailed in Section 2.4.2. Localisation is assessed through the same IoU-based matching underlying mAP and P/R.

2.4.1. Detection Accuracy Metrics

Beyond precision (P) and recall (R), the primary detection metrics in this study are

{m A P}_{N O}

,

{m A P}_{S O}

, and

{m A P}_{H O}

at IoU = 0.50, corresponding to No Occlusion, Soft Occlusion, and Hard Occlusion, respectively. Each

{m A P}_{c}

is the average precision for category c at this IoU threshold—i.e., the area under the precision–recall curve obtained by varying the confidence threshold, following the standard single-class detection protocol. Ablation studies in Section 3.2 emphasise these three class-specific mAP values; Section 3.4 additionally reports overall P and R together with the same three mAP columns for cross-architecture benchmarking. We do not report a macro-averaged scalar over the three classes, so that NO, SO, and HO remain separately visible under the empirical class prevalence in Table 1. Unless noted otherwise, differences between percentage-scale metrics (e.g., mAP or accuracy reported) are reported in percentage points, abbreviated pp.

At the chosen IoU threshold,

\{\begin{matrix} P = \frac{T P}{T P + F P} \times 100 % \\ R = \frac{T P}{T P + F N} \times 100 % \end{matrix}

(12)

The quantities

{m A P}_{N O}

,

{m A P}_{S O}

, and

{m A P}_{H O}

in Section 3 are computed accordingly. At IoU = 0.50, predictions are matched to the same-class ground-truth boxes by a greedy one-to-one assignment; a match counts as TP if IoU ≥ 0.50. Unmatched predictions are FP; unmatched ground truths are FN.

2.4.2. Lightweight and Efficiency Metrics

Mobile orchard platforms favour compact models and predictable runtime. We report parameters (M) as the total number of learnable weights and FLOPs (G) as the floating-point operations for one forward pass at the reported input resolution; both are obtained from the same evaluation procedure used for tables in this study.

End-to-end speed is measured on the same platform used for training. Each model is evaluated on the held-out test split with input size 640 × 640, batch size 8. Before the timed pass over the test set, a short warm-up forward is executed on dummy tensors shaped to the validation batch and input resolution; this warm-up is excluded from the timing statistics. During evaluation, the framework reports average per-image times in milliseconds for preprocessing, forward inference, and post-processing (including NMS). End-to-end latency per frame is the sum of these three components, expressed in seconds as

T_{i n f e r e n c e}

, and FPS is computed as in Equation (13). Timed averages are accumulated over one full pass of the test set. All compared models use the same evaluation settings so that parameters, FLOPs, FPS, and accuracy metrics are directly comparable.

F P S = \frac{1}{T_{i n f e r e n c e}}

(13)

3. Results

3.1. Stability Under Random Seed and Training–Validation Reassignment

Before interpreting component-wise changes, we assess whether the relative advantage of YOLOv11-CBMES over YOLOv11 depends on a single random seed or a single assignment of images to the training and validation subsets. The test subset is never used for training or model selection in any of the following analyses. Our focus is whether the gain of YOLOv11-CBMES over YOLOv11 on the held-out test split is directionally consistent across these perturbations.

Multi-seed repeats. We trained YOLOv11 and YOLOv11-CBMES with three global seeds (0, 42, and 2026), changing only the stochasticity of optimisation (initialisation, shuffling, and stochastic augmentation). The training, validation, and test image sets are unchanged across seeds. Table 2 reports class-wise mAP on the held-out test split. Across all three seeds, YOLOv11-CBMES achieves higher

{m A P}_{S O}

,

{m A P}_{N O}

, and

{m A P}_{H O}

than YOLOv11, and the Gain rows are positive for every class under every seed. Mean gains are +3.78 ± 0.09 pp (SO), +1.95 ± 0.40 pp (NO), and +5.27 ± 0.61 pp (HO), with seed-to-seed variability summarised by the reported standard deviations.

Five-fold cross-validation with a fixed test set. To assess sensitivity to how images are allocated between training and validation, we pool the training and validation subsets, repartition the pooled images into five folds, and in each round, train on four folds and validate on the remaining fold. After each run, class-wise mAP is evaluated on the same held-out test subset using Section 2.4. Table 3 reports test-split mAP for YOLOv11 and YOLOv11-CBMES under each fold. Across all five folds, YOLOv11-CBMES achieves higher

{m A P}_{S O}

,

{m A P}_{N O}

, and

{m A P}_{H O}

than YOLOv11 for every fold and every class. Mean improvements are +3.87 ± 0.19 pp (SO), +6.12 ± 0.56 pp (NO), and +3.80 ± 0.24 pp (HO), with fold-to-fold variability reflected in the reported standard deviations.

Together, Table 2 and Table 3 support the same qualitative conclusion: the relative improvement of YOLOv11-CBMES over YOLOv11 remains consistent across seed perturbations and repeated training–validation reassignments under a fixed test evaluation.

Section 3.2 then reports progressive ablations under the fixed partition and seed 0 to attribute that gap to specific architectural choices.

3.2. Ablation Study

We report progressive ablations under one training protocol and hardware setup (Table 4). Each row adds or removes a single major component relative to defined baselines, so that changes in

{m A P}_{S O}

,

{m A P}_{N O}

, and

{m A P}_{H O}

can be read against parameters and FLOPs.

Table 4 links each module to measurable changes in class-wise mAP, parameters, and FLOPs. (1) BiFPN: Parameters fall from 2.58 M to 1.92 M,

{m A P}_{H O}

rises from 46.59% to 48.72%, and single-node fusion latency falls from 2.3 ms to 1.7 ms on an RTX 4070. (2) CDFA: Added to the native neck (YOLOv11-CDFA), it increases

{m A P}_{S O}

by 2.03 pp and

{m A P}_{N O}

by 1.66 pp relative to YOLOv11, with only a small change in

{m A P}_{H O}

at matched neck complexity; paired with BiFPN (YOLOv11-CB),

{m A P}_{H O}

increases by 2.54 pp to 51.26% relative to YOLOv11-BiFPN. (3) CM (MSCB): In the progressive stack, it adds 0.50 pp to

{m A P}_{H O}

(from 51.26% to 51.76%). The single-module control YOLOv11-MSCB reaches

{m A P}_{H O}

= 47.42% at 2.86 M parameters and 6.7 G FLOPs. (4) EUCB: In the stack, it raises

{m A P}_{N O}

by 0.93 pp while

{m A P}_{H O}

temporarily drops 0.47 pp before SC is applied; YOLOv11-EUCB shows the same qualitative trade-off (

{m A P}_{N O}

up,

{m A P}_{H O}

slightly down). (5) SC: YOLOv11-SC raises

{m A P}_{H O}

to 46.77% at unchanged size/FLOPs; in CMS/EU_SC it restores

{m A P}_{H O}

by 0.61 pp to 51.90% and reaches the best three-class values in this chain. Section 3.3 reports CDFA transform variants at matched model scale.

In short, HO accuracy is the hardest class for YOLOv11; BiFPN and CDFA together account for most of the HO gain, while CMS/EU_SC and SC mainly refine the trade-offs among classes at similar FLOPs.

Figure 11 compares training curves and precision–recall behaviour for YOLOv11 and YOLOv11-CBMES. YOLOv11-CBMES tends to stabilise earlier (near epoch 125 in our runs) with smaller oscillations than the baseline. The PR curves show higher precision in the mid-recall band (approximately 0.6–0.8), consistent with fewer high-confidence false positives on the held-out split. These plots are qualitative companions to Table 4 and are not used as standalone significance tests.

The ablations in Table 4 are consistent with the intended roles of the blocks: BiFPN reweights multi-scale features; CDFA changes P5 feature statistics before neck fusion; CM/EUCB adjusts receptive field and upsampling; and SC adds local mixing without extra parameters. Section 4 discusses limitations and failure cases.

Table 5 compares training with Mosaic enabled versus disabled. All other settings—including the dataset split, optimisation schedule, and remaining augmentations—match the main experiments. Mosaic increases every class-wise metric for both models. For YOLOv11, the largest change occurs for

{m A P}_{H O}

, which increases by 4.07 pp from 42.52% to 46.59%;

{m A P}_{S O}

and

{m A P}_{N O}

rise by 2.94 and 3.94 pp, respectively. For YOLOv11-CBMES,

{m A P}_{H O}

again shows the largest gain, increasing by 4.79 pp from 47.11% to 51.90%, with smaller increases for

{m A P}_{S O}

and

{m A P}_{N O}

of 1.72 and 1.53 pp, respectively. Full values are listed in Table 5.

Under Mosaic-enabled training—the configuration used in the main ablation and comparative experiments—YOLOv11-CBMES exceeds YOLOv11 by 3.68 pp in

{m A P}_{S O}

, 2.16 pp in

{m A P}_{N O}

, and 5.31 pp in

{m A P}_{H O}

, as shown in Table 5 for the Mosaic On rows. With Mosaic disabled, YOLOv11-CBMES remains higher than YOLOv11 on all three metrics in the Mosaic Off rows of Table 5. Parameters and FLOPs are unaffected by the Mosaic setting.

We therefore retain Mosaic as a shared training configuration: it improves both the baseline and the proposed model on SO, NO, and HO under the same protocol, while architectural comparisons in the main tables are reported under identical augmentation to keep the benchmark controlled.

3.3. Validation of Frequency-Domain Decoupling Necessity

To compare inductive biases at similar parameter and FLOP budgets, we replaced the CDFA wavelet core inside the YOLOv11-CB stack with three alternatives: random orthogonal mixing, standard convolution, and a DFT path (Table 6). The goal is to separate the effect of the transform from a generic increase in width or depth.

Each transform’s inductive bias is reflected in the metrics. Random orthogonal projections carry no strong spatial–frequency prior and yield

{m A P}_{H O}

= 48.95%, suggesting that gains are not explained by parameter count alone. Standard convolution reaches

{m A P}_{N O}

= 82.55% but leaves illumination and edges entangled, with

{m A P}_{H O}

capped at 49.30%. The DFT path emphasises global frequency content; its lower

{m A P}_{H O}

(48.05%) and higher FLOPs (9.2 G) are consistent with weaker spatial localisation of fine boundaries on this task. Haar wavelets retain localised spatial–frequency separation and reach

{m A P}_{H O}

= 51.26%, matching the best HO score in this comparison.

3.4. Comparative Evaluation of Object Detection Models

To comprehensively evaluate the performance positioning of YOLOv11-CBMES within the broader landscape of object detection architectures, a two-dimensional comparative benchmark was established, encompassing both classical detection frameworks and mainstream YOLO-series variants. The classical architecture baseline comprises Faster R-CNN, SSD, and RetinaNet, representing two-stage refinement, single-stage speed-prioritised, and focal loss-based sample rebalancing strategies, respectively. The YOLO-series baseline spans YOLOv5, YOLOv8, YOLOv10, and YOLOv11, enabling quantification of the performance gains achieved by the proposed modifications relative to the iterative evolution of general-purpose lightweight architectures. All models were trained and evaluated on identical hardware platforms, training hyperparameters, and datasets to ensure benchmark fairness. Unless otherwise specified, all YOLO-series models adopt the nano (n) scale variant; Faster R-CNN employs a ResNet-50 backbone; SSD employs a VGG-16 backbone; and RetinaNet employs a ResNet-50 and FPN backbone.

The quantitative results presented in Table 7 demonstrate that YOLOv11-CBMES achieves

{m A P}_{H O}

of 51.90% on the Hard Occlusion category, representing an improvement of 5.31 pp (+11.4%) over the YOLOv11 baseline (46.59%), surpassing YOLOv8 (44.86%) by 7.04 pp and YOLOv10 (45.67%) by 6.23 pp. For the No Occlusion category,

{m A P}_{N O}

reaches 83.50%, exceeding all YOLO-series benchmark models by 1.16–2.70 pp. The Soft Occlusion category achieves

{m A P}_{S O}

of 67.36%, an improvement of 3.68 pp over the YOLOv11 baseline.

Among classical architectures, Faster R-CNN attains

{m A P}_{N O}

of 82.90%, close to the 83.50% of YOLOv11-CBMES; however, its throughput is 4.7 FPS versus 149.5 FPS for YOLOv11-CBMES on our platform (approximately 32× fewer frames per second), which is poorly matched to low-latency control on typical mobile orchard hardware. SSD achieves 76.1 FPS, yet its

{m A P}_{H O}

of 43.86% is 8.04 pp below that of the proposed model. RetinaNet records the lowest

{m A P}_{H O}

of 41.03% among all evaluated models.

Relative to Faster R-CNN and SSD, YOLOv11-CBMES uses far fewer parameters and FLOPs (Table 7) while achieving a higher

{m A P}_{H O}

and competitive

{m A P}_{N O}

on this benchmark. Relative to YOLO-series nano variants, it increases parameters and FLOPs modestly (Table 7) in exchange for higher class-wise mAP on NO, SO, and HO under the same training protocol.

Figure 12 presents a qualitative comparison of detection results across models under varying illumination conditions. Under front-lighting conditions (row 1), all models successfully detect the primary fruit targets; however, SSD and YOLOv5 exhibit prominent missed detections in regions of dense branch-leaf occlusion. Under lateral-lighting conditions (row 2), RetinaNet generates spurious bounding boxes at shadow boundaries, misclassifying leaf gaps in the canopy as fruit. Under strong backlighting conditions (row 3), both Faster R-CNN and RetinaNet produce multiple false positive detections in overexposed highlight regions, whereas YOLOv11-CBMES demonstrates the highest spatial correspondence between predicted bounding boxes and ground-truth targets. Notably, in densely clustered fruit scenarios (column 3), YOLOv11-CBMES generates a certain number of redundant detection boxes, indicating that the current Axis-Aligned Bounding Box (AABB) representation paradigm exhibits inherent limitations when processing spatially overlapping targets—an issue to be further discussed in Section 4.5.

Figure 13 summarises normalised confusion matrices on the held-out test set. Recall for SO is similar across architectures (approximately 0.73–0.76), whereas HO remains the most error-prone class.

For HO recall, YOLOv11-CBMES reaches 0.688, compared with 0.594 for YOLOv8 and YOLOv10 and 0.574 for Faster R-CNN under the same evaluation protocol.

Misclassification of true Hard Occlusion (HO) as predicted No Occlusion (NO) corresponds to a high-severity error in rigidity-aware harvesting perception: a rigid occluder is treated as absent. Following common practice in occlusion-centric reporting, we use the normalised confusion-matrix entry for this (true HO, predicted NO) as a conservative proxy for high-severity confusion. Under our evaluation protocol, YOLOv11-CBMES yields a rate of 0.064 for this entry, compared with 0.115 for Faster R-CNN and 0.100 for SSD. These values are dataset-level confusion proportions on the held-out test split; they should not be interpreted as empirically measured collision frequencies on a physical platform.

Misclassification of true HO as predicted Soft Occlusion (SO) captures confusion between rigid and compliant occluders. YOLOv11-CBMES attains 0.248 on this entry, versus 0.339 for YOLOv8 and 0.324 for YOLOv10. Faster R-CNN reports a lower value (0.212) on the same entry but simultaneously exhibits lower HO recall, indicating a different precision–recall operating point rather than unambiguously superior safety performance for field deployment.

Misclassification of true Soft Occlusion (SO) as predicted HO corresponds to labelling compliant foliage as rigid occlusion. For this entry, YOLOv11-CBMES reports 0.178, marginally above several YOLO baselines (approximately 0.15–0.16), which implies a modestly more conservative bias toward rigid occlusion under ambiguity. Whether such bias is advantageous depends on the downstream motion policy and cost model; no closed-loop validation is conducted in the present study.

3.5. Feature Space Visualisation

This subsection uses visual diagnostics—class activation maps (CAM), channel maps, effective receptive field (ERF) heatmaps, and a radial power-spectrum summary—to illustrate qualitative differences between baseline and proposed fusion or upsampling blocks. These figures are interpretive aids alongside Table 4 and Table 6, not independent statistical tests.

3.5.1. Multi-Scale Semantic Flow Alignment and Feature Purification

Figure 14 compares CAMs for BaseFPN and BiFPN at P3–P5. At P3, BiFPN activations appear more concentrated on fruit regions than BaseFPN, with visibly weaker background speckle. At P4, BiFPN responses track object boundaries more tightly in the examples shown. P5 shows a similar qualitative trend. Differences are scene-dependent; we do not claim uniform improvement on every image.

Figure 15 presents the inter-architecture differences in feature distribution from the channel dimension perspective. The channel-level feature maps of the BaseFPN P3 and P4 layers (Figure 15B) display large-area grey-value distributions, indicating that multiple channels produce moderate-intensity responses across background regions. The corresponding BiFPN levels (Figure 15C) show background pixel values converging towards zero, with high-amplitude responses retained exclusively in target regions, manifesting a pronounced sparse activation pattern. Furthermore, the BiFPN P5 output (Figure 15C, P5) preserves texture granularity reminiscent of shallow-layer features, whereas the BaseFPN output at the same level (Figure 15B, P5) tends towards smooth homogeneity.

The formation mechanisms underlying the spatial focusing and channel sparsification phenomena observed above will be analysed in Section 4.1 from the perspective of the BiFPN weighted fusion strategy.

3.5.2. Empirical Analysis of Frequency-Domain Decoupling

Figure 16 illustrates the frequency-domain processing pipeline of the CDFA module applied to the input feature map and the corresponding output results. Following Haar wavelet decomposition, the input feature is separated into a high-frequency feature map (FG image) and a low-frequency feature map (BG image). The FG image primarily retains edge gradient information, with fruit contours and branch structures clearly discernible and smooth-region responses suppressed to near-zero values. The BG image encodes large-scale illumination distributions, exhibiting smooth thermal gradient patterns with high-frequency textural detail absent. The two feature maps exhibit a spatially complementary relationship—regions of concentrated response in the FG image correspond closely to regions of suppressed response in the BG image.

3.5.3. Spatial Receptive Field Expansion and Frequency-Domain Fidelity Analysis

Figure 17 visualises the effective receptive field (ERF) gradient backpropagation heatmaps at the P5 layer under three module configurations [31], with the corresponding quantitative metrics summarised in Table 8. Specifically, Figure 17A presents the ERF visualisation result for the YOLOv11-BiFPN model; Figure 17B presents the ERF visualisation result following the introduction of the CM and EUCB modules on top of the YOLOv11-BiFPN baseline; and Figure 17C presents the ERF visualisation result after applying the SC strategy to both CM and EUCB simultaneously, forming the CMS and EU_SC modules.

Figure 17 shows how ERF heatmaps change when CM/EUCB and then SC are added. After CM and EUCB, the highlighted region expands, but peak intensity drops in these visualisations; after SC, broad coverage remains while strong local peaks reappear near fruit centres. This pattern is consistent with the non-monotonic mAP steps in Table 4 and is discussed further in Section 4.4.

The Radially Averaged Power Spectrum (RAPS) analysis computed according to Equation (14) (Figure 18G) reveals that both curves conform to the

1 / f

decay law characteristic of natural images. In the low-frequency band (normalised frequency < 0.4), the two curves exhibit near-identical trajectories; in the mid-to-high frequency band (normalised frequency > 0.4), a systematic bifurcation emerges, with the EUCB curve consistently exceeding the baseline, and the degree of bifurcation increasing monotonically with frequency:

L [r] = \ln (\frac{1}{2 π r} \int_{0}^{2 π} |F (r c o s θ, r s i n θ)| d θ + ϵ)

(14)

where

r = \sqrt{u^{2} + v^{2}}

denotes the radial distance in frequency space;

(u, v)

are the two-dimensional frequency coordinates;

F (u, v)

is the frequency-domain representation of the feature map obtained via two-dimensional Fast Fourier Transform (FFT);

θ = a r c t a n (\frac{v}{u})

is the polar angle in frequency space; and

ϵ = 10^{- 8}

is a numerical stability term introduced to prevent logarithmic singularities. This formulation performs angular averaging of the amplitude spectrum along a circle of radius

r

in frequency space, eliminating directional dependence and revealing the power decay profile as a function of spatial frequency. A higher

L [r]

value indicates stronger energy at the corresponding frequency.

Figure 18 presents the reconstruction effect of the EUCB module on feature maps and the corresponding frequency-domain energy distribution. Spatial-domain comparison reveals that the baseline upsampling feature map (Figure 18B) exhibits smooth edge transitions, whereas the EUCB-reconstructed feature map (Figure 18C) manifests sharper gradient discontinuities at fruit boundaries. The difference map (Figure 18F) demonstrates that the activation gains of EUCB relative to the baseline are concentrated at object contour regions, with differences in smooth regions approaching zero.

The asymmetric ERF pattern, the recovery of peak response when CM is upgraded to CMS, and the mid-to-high-frequency energy trend associated with EUCB are interpreted together in Section 4.4.

4. Discussion

4.1. Physical Interpretation of the BiFPN “Parameter Reduction with Accuracy Gain” Phenomenon

Table 4 shows that replacing PANet with BiFPN cuts parameters by 25.6% while raising Hard Occlusion accuracy by 2.13 pp. This parameter–accuracy trade-up is interpreted as removal of redundant fusion pathways rather than loss of capacity.

PANet concatenates multi-scale features as if each level contributed equally to the decision. In semi-structured canopies, P5 semantics and P3 geometry are not interchangeable for rigidity cues. Blind concatenation spends parameters and lets the deep-layer background dilute shallow high-frequency edges during fusion.

BiFPN’s learnable fusion weights act as gradient-driven soft gating on each node. Figure 14 and Figure 15 qualitatively support this reading: at P3, non-target activations drop and channel responses on background approach zero, yielding sparse maps. Sparsity saves compute and raises the signal-to-noise ratio at the detection head.

Relative to BiFPN as used for generic detection [25], here, BiFPN and CDFA are complementary: BiFPN reweights cross-scale topology, CDFA cleans spectra before fusion. Ablations match this narrative:

{m A P}_{H O}

is 48.72% with BiFPN alone and 51.26% after adding CDFA. Topology-level pruning frees budget for the heavier wavelet block at P5. The pair realizes a practical Pareto move—fewer redundant fusion parameters, extra capacity at the boundary-rich stage—within roughly 5 M parameters.

4.2. Performance Boundary and Architectural Trade-Offs of the Frequency-Domain Decoupling Mechanism

Table 4 shows a nonlinear cost–benefit pattern: CDFA delivers the largest single-step gain on Hard Occlusion but also adds the most parameters. The following discussion separates the wavelet mechanism from edge-deployment constraints.

CDFA uses Haar wavelets to split low-frequency illumination from high-frequency contour energy; the bases are orthogonal, so the two physical causes are less entangled than under a single sliding convolution, which mixes illumination drift with edge transients. Figure 15 agrees: responses concentrate on fruit and background clutter drops, consistent with a learned, physically motivated high-pass emphasis on structure.

Prior agricultural and multimodal detection work shows that frequency-domain and wavelet-style operators are implemented in many different ways (architecture placement, parameterisation, and training protocol), so direct one-to-one comparisons at the level of a single AP/parameter/FPS triple are often misleading without matching budgets and tasks; see [32] for a cross-modal survey of detector behaviour. Du et al. [22] spread wavelet convolutions through YOLOv11 and report parameter savings when such operators replace selected spatial stacks, supporting the claim that frequency-domain operators need not always inflate model size. Here, CDFA sits only at P5.

CDFA gains are scale-sensitive. In SCORE-DETR [21], small-target AP is only 14.3%; when the footprint falls below about 1/64 of the map cell, high-frequency energy is drowned by quantisation noise. Our depth-guided acquisition keeps the camera in 0.7–1.1 m, so rigid branches in HO samples average at least ~12 px wide, which avoids the worst small-target regime by design. That range also bounds the manipulator’s useful workspace; in deployment, the depth gate should be retuned to each arm’s kinematic envelope.

4.3. Decision-Making Value and Misclassification Cost Analysis of the N-S-H Occlusion Classification Taxonomy

For the remainder of this subsection, we use compact labels for misclassification events under the three-class taxonomy. The expression HO → NO denotes a Hard Occlusion (HO) instance that is predicted as No Occlusion (NO); HO → SO denotes an HO instance predicted as Soft Occlusion (SO). The symbols P(HO → NO) and P(HO → SO) denote the corresponding probabilities taken from the normalised confusion matrix (row: true HO; column: predicted NO or SO), as reported in Figure 13 and Table 9.

Rather than treating occlusion solely as a nuisance for localisation, we embed a three-tier rigidity-oriented taxonomy—No Occlusion (NO), Soft Occlusion (SO), and Hard Occlusion (HO)—in labelling and in the detection head, so that each detection carries a discrete affordance-related class alongside its box.

Li et al. [24] define eight occlusion tags on YOLOv5s and map them to Graspable, Tentatively Ungraspable, and Ungraspable. That scheme is informative but can misalign with control when fine tags do not map cleanly to compliant versus rigid contact. The NO/SO/HO grouping used here is coarser but is intended to align with push/grasp versus avoidance semantics under our annotation rules.

Figure 12 highlights two failure modes. Even for YOLOv11-CBMES, HO predicted as NO occurs at a non-negligible rate (6.4% in our normalised confusion matrix), so perception-only operation would still require conservative safeguards in practice; we therefore discuss vision as a pre-filter rather than a stand-alone safety guarantee. HO misread as SO is 24.8% here versus 34% and 32% on YOLOv8 and YOLOv10. Residual HO → SO clusters at branch–petiole transitions: in 480–580 nm reflectance, the materials overlap even when high-frequency cues are strengthened, which is a fundamental limit of monocular RGB.

To aggregate the two failure modes, we use an asymmetric cost model as a transparent sensitivity analysis rather than a calibrated field-loss model. Let

C_{collision}

denote the cost assigned to a single HO → NO error and

C_{repair}

denote the cost assigned to a single HO → SO error. Following a conservative working assumption that collision events are substantially more costly than repair-equivalent maintenance actions, we set

C_{collision}

= 5

C_{repair}

for the scalar reported in Table 9. Expected risk per inference cycle is given by Equation (15):

E [R i s k] = P (H O \to N O) \times C_{c o l l i s i o n} + P (H O \to S O) \times C_{r e p a i r}

(15)

where

P (H O \to N O)

and

P (H O \to S O)

denote the misclassification probabilities of Hard Occlusion predicted as No Occlusion and Soft Occlusion, respectively, both extracted from the off-diagonal elements of the normalised confusion matrices. The resulting

E [Risk]

is expressed in units of

C_{repair}

, representing the expected maintenance-equivalent cost per inference cycle under the given misclassification distribution. The composite risk profile across all evaluated models is summarised in Table 9.

Table 9 ranks YOLOv11-CBMES lowest at 0.568

C_{repair}

. YOLO baselines sit in a 0.674–0.734

C_{repair}

band, so the proposed model lowers composite risk by about 15.7–22.6% within that family. Faster R-CNN peaks at 0.787

C_{repair}

because P(HO → NO) = 0.115 dominates the collision term even though P(HO → SO) = 0.212 is low; the apparent gain on the secondary error comes with the highest exposure to the higher-cost error term under the defined cost model. SSD and YOLOv11 both read 0.711

C_{repair}

but differ in decomposition: SSD’s larger P(HO → NO) = 0.103 implies higher tail risk from collisions despite the same aggregate cost.

The classifier can therefore be viewed as a probabilistic pre-filter; realised system risk would additionally depend on controller gains, contact sensing, and workspace safeguards, which are outside the scope of the present benchmark.

4.4. Synergistic Effects of Receptive Field Expansion and Zero-Parameter Neighbourhood Interaction

Ablating CM, EUCB, and SC in order yields a non-monotonic curve. CM widens the equivalent receptive field from 153 px to 257 px in the ERF table and adds 0.50 pp to

{m A P}_{H O}

. Adding EUCB pulls

{m A P}_{H O}

back to 51.29% while

{m A P}_{N O}

rises to 83.23% (+0.93 pp). Embedding SC in CM and EUCB to form CMS and EU_SC restores

{m A P}_{H O}

to 51.90% and lifts

{m A P}_{S O}

to 67.36% at zero added parameters and FLOPs. Near the Pareto front, improving one objective can hurt another: EUCB’s learnable upsampling restores mid-to-high-frequency energy (Figure 18) but smooths locally, partially undoing MSCB’s HO-focused high-frequency emphasis. SC restores neighbourhood mixing without extra parameters; the ERF table’s peak response recovers from 0.765 to 0.951, consistent with regaining wide context together with strong local peaks.

Like Yang et al. [29], MSCB relies on reparameterised heterogeneous convolutions, but kernels are assigned per pyramid level using physical priors: small kernels in shallow stages preserve fine structure; large kernels in deep stages span occluders. That level-specific schedule avoids one global kernel policy and steers receptive-field growth toward the cues each scale needs. Relative to common agricultural-detector upgrades, the CMS–EU_SC path prioritises context expansion, frequency restoration on upsampling, and neighbourhood re-mixing without attention overhead.

Two frequent responses to occlusion are local attention stacking—e.g., repeated CBAM or CA at fusions [33,34], often +0.5–4% AP with little extra size—which still redistributes weights inside a local window and may not bridge long branches, and global self-attention—Swin-style or deformable transformers [35,36]—which improves long-range recall but reduces speed (Liu et al. report ~40% slower than YOLOv7) and can soften high-frequency edges in orchard texture. MSCB avoids stacking opaque attention operators: it keeps Yang et al. [29]-style reparameterised multi-kernel training but ties it to the pyramid-aware schedule above, mitigating blanket large kernels that erode edges while avoiding transformer-level global attention cost.

Much prior work has focused on spatial interaction with attention. Shift-Context mixes neighbours via cyclic shifts at zero added parameters and zero added FLOPs in our implementation. Together with MSCB for scale-aware context and EUCB for upsampling-path filtering, the full model reaches mAP_HO = 51.90% at 149.5 FPS on our benchmark (Table 7), i.e., a favourable accuracy–efficiency point relative to the baselines trained under the same protocol; extension to other orchards and hardware remains future work.

4.5. Limitations and Future Research Directions

YOLOv11-CBMES improves occlusion-aware detection under edge constraints, yet several limitations remain in natural orchards. Axis-aligned boxes with NMS at IoU 0.5 produce redundant and missed detections in dense clusters (Figure 12): rectangles poorly match non-convex layouts and branch geometry, and OBB does not remove background-inclusive boxes for apples. Instance-level or amodal segmentation with geometric priors is a natural next step; low-cost amodal training [37,38], occlusion-order modelling [37], and dual segmentation–completion designs for overlapping fruit [39] indicate feasible directions. A dedicated amodal branch could couple visible masks with boundary or layer-order cues to improve grasp pose under occlusion.

The residual rate of misclassifying HO as SO (near 24.8%) reflects a ceiling for rigidity from monocular RGB: branches, petioles, and overlapping leaves share colour and texture. Depth or local point clouds better separate cylindrical branches from planar foliage. The detector should serve as a confidence-stratified pre-filter, with active viewing or force–position probing when HO–SO scores are ambiguous [40], yielding a two-stage loop of visual screening and local confirmation.

Training used depth-guided purification, so one may ask whether monocular inference confuses heavily occluded fruit at 0.7–1.1 m with distant clutter. Distant apples appear as low-frequency blur, whereas nearby occluded fruit still carries sharp edge energy; CDFA’s frequency split discourages a naive small-area-equals-background mapping and supports rejecting optically degraded far targets while keeping near-field occluded instances. This does not remove the need to validate behaviour when depth is absent at run time.

Ablations show that removing frequency-domain decoupling degrades rigidity-related discrimination and worsens proxy risk metrics under the cost model in Section 4.3; the additional compute is therefore a deliberate accuracy-focused design choice for this dataset, rather than a claim of certified safety compliance.

5. Conclusions

This study proposes YOLOv11-CBMES, a lightweight detector for intensive orchards that couples frequency-domain decoupling with multi-scale spatial reconstruction under mobile compute limits. The model outputs bounding boxes with a three-class occlusion–rigidity label (NO/SO/HO) to support downstream planning assumptions. Principal conclusions are as follows:

(1): Frequency-domain decoupling and multi-scale reconstruction. Combining CDFA, BiFPN, CMS, and EU_SC improves class-wise AP on the held-out split (Table 4 and Table 7), with the largest relative gains on HO under our protocol. At 4.95 M parameters and 149.5 FPS (Table 7), mAP_HO increases from 46.59% to 51.90% versus YOLOv11n (+5.31 pp).
(2): Proxy risk metrics under an explicit cost model. Using normalised confusion matrices (Figure 13), the rate of misclassifying HO as NO decreases from 0.115 (Faster R-CNN) to 0.064 (YOLOv11-CBMES) on this benchmark. Under the asymmetric cost model in Equation (15) and Table 9, the composite scalar E[Risk] is lowest for YOLOv11-CBMES among the models compared here; this is a dataset-level diagnostic and not a field-validated collision rate.
(3): Qualitative behaviour under illumination stress. Qualitative comparisons (Figure 12) suggest fewer gross failures in some strong-backlight examples than several baselines, consistent with the intended role of CDFA; limitations remain for dense clusters under an axis-aligned box representation (Section 4.5).
(4): Limitations and future work. Axis-aligned boxes with NMS remain weak on densely overlapping instances; the residual confusion between HO and SO indicates a ceiling for rigidity from monocular RGB when branches and petioles share appearance. Promising extensions include instance or amodal segmentation, geometric completion priors, tighter fusion of RGB with depth or active viewing, and contact-aware probing when class probabilities are ambiguous, evaluated beyond the present split and site.

Author Contributions

Conceptualization, X.Y. and L.G.; methodology, X.Y. and L.G.; software, L.G.; formal analysis, L.G.; investigation, L.G. and P.W.; resources, L.G. and H.L.; writing—original draft preparation, L.G.; writing—review and editing, X.Y.; visualization, L.G.; supervision, L.L.; project administration, J.L.; funding acquisition, X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the earmarked fund for CARS (CARS-27)and supported by the Earmarked Fund for Hebei Apple Innovation Team of Modern Agro-industry Technology Research System (HBCT2024150202).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author(s).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Silwal, A.; Davidson, J.R.; Karkee, M.; Mo, C.; Zhang, Q.; Lewis, K. Design, Integration, and Field Evaluation of a Robotic Apple Harvester. J. Field Robot. 2017, 34, 1140–1159. [Google Scholar] [CrossRef]
Chen, C.; Song, Z.; Li, X.; Chen, C.; Yang, F.; Wang, Z. Research Status of Apple Picking Robotic Arm Picking Strategy and End-Effector. Comput. Electron. Agric. 2025, 235, 110349. [Google Scholar] [CrossRef]
Hua, W.; Zhang, Z.; Zhang, W.; Liu, X.; Hu, C.; He, Y.; Mhamed, M.; Li, X.; Dong, H.; Saha, C.K.; et al. Key Technologies in Apple Harvesting Robot for Standardized Orchards: A Comprehensive Review of Innovations, Challenges, and Future Directions☆. Comput. Electron. Agric. 2025, 235, 110343. [Google Scholar] [CrossRef]
Wang, C.; Pan, W.; Zou, T.; Li, C.; Han, Q.; Wang, H.; Yang, J.; Zou, X. A Review of Perception Technologies for Berry Fruit-Picking Robots: Advantages, Disadvantages, Challenges, and Prospects. Agriculture 2024, 14, 1346. [Google Scholar] [CrossRef]
Zhang, Y.; Li, N.; Zhang, L.; Lin, J.; Gao, X.; Chen, G. A Review on the Recent Developments in Vision-Based Apple-Harvesting Robots for Recognizing Fruit and Picking Pose. Comput. Electron. Agric. 2025, 231, 109968. [Google Scholar] [CrossRef]
Yan, B.; Quan, J.; Yan, W. Three-Dimensional Obstacle Avoidance Harvesting Path Planning Method for Apple-Harvesting Robot Based on Improved Ant Colony Algorithm. Agriculture 2024, 14, 1336. [Google Scholar] [CrossRef]
Chaploutskyi, A.; Polunina, O.; Kucher, I.; Chetskyi, B.; Borysenko, V.; Zabolotnyi, O. The Growth Activity and Productivity of Apple Trees Depending on the Form of the Crown and the Time of Pruning. J. Hortic. Res. 2023, 31, 55–63. [Google Scholar] [CrossRef]
Yan, B.; Fan, P.; Lei, X.; Liu, Z.; Yang, F. A Real-Time Apple Targets Detection Method for Picking Robot Based on Improved YOLOv5. Remote Sens. 2021, 13, 1619. [Google Scholar] [CrossRef]
Yan, B.; Liu, Y.; Yan, W. A Novel Fusion Perception Algorithm of Tree Branch/Trunk and Apple for Harvesting Robot Based on Improved YOLOv8s. Agronomy 2024, 14, 1895. [Google Scholar] [CrossRef]
Chu, P.; Li, Z.; Lammers, K.; Lu, R.; Liu, X. Deep Learning-Based Apple Detection Using a Suppression Mask R-CNN. Pattern Recognit. Lett. 2021, 147, 206–211. [Google Scholar] [CrossRef]
Jia, W.; Tian, Y.; Luo, R.; Zhang, Z.; Lian, J.; Zheng, Y. Detection and Segmentation of Overlapped Fruits Based on Optimized Mask R-CNN Application in Apple Harvesting Robot. Comput. Electron. Agric. 2020, 172, 105380. [Google Scholar] [CrossRef]
Wang, D.; He, D. Fusion of Mask RCNN and Attention Mechanism for Instance Segmentation of Apples under Complex Background. Comput. Electron. Agric. 2022, 196, 106864. [Google Scholar] [CrossRef]
Zhang, J.; He, L.; Karkee, M.; Zhang, Q.; Zhang, X.; Gao, Z. Branch Detection for Apple Trees Trained in Fruiting Wall Architecture Using Depth Features and Regions-Convolutional Neural Network (R-CNN). Comput. Electron. Agric. 2018, 155, 386–393. [Google Scholar] [CrossRef]
Wang, D.; He, D. Channel Pruned YOLO V5s-Based Deep Learning Approach for Rapid and Accurate Apple Fruitlet Detection before Fruit Thinning. Biosyst. Eng. 2021, 210, 271–281. [Google Scholar] [CrossRef]
Yang, H.; Liu, Y.; Wang, S.; Qu, H.; Li, N.; Wu, J.; Yan, Y.; Zhang, H.; Wang, J.; Qiu, J. Improved Apple Fruit Target Recognition Method Based on YOLOv7 Model. Agriculture 2023, 13, 1278. [Google Scholar] [CrossRef]
Han, B.; Lu, Z.; Zhang, J.; Almodfer, R.; Wang, Z.; Sun, W.; Dong, L. Rep-ViG-Apple: A CNN-GCN Hybrid Model for Apple Detection in Complex Orchard Environments. Agronomy 2024, 14, 1733. [Google Scholar] [CrossRef]
Qiu, Z.; Ou, W.; Mo, D.; Sun, Y.; Ma, X.; Chen, X.; Tian, X. BGWL-YOLO: A Lightweight and Efficient Object Detection Model for Apple Maturity Classification Based on the YOLOv11n Improvement. Horticulturae 2025, 11, 1068. [Google Scholar] [CrossRef]
Humadi, A.M.; Sadeghzadeh, M.; Younis, H.A.; Mosleh, M. Adaptive Two-Tier Deep Learning for Content-Based Image Retrieval and Classification with Dynamic Similarity Fusion. IET Image Process. 2025, 19, e70192. [Google Scholar] [CrossRef]
Wen, Z.; Li, P.; Liu, Y.; Chen, J.; Xiang, X.; Li, Y.; Wang, H.; Zhao, Y.; Zhou, G. FANet: Frequency-Aware Attention-Based Tiny-Object Detection in Remote Sensing Images. Remote Sens. 2025, 17, 4066. [Google Scholar] [CrossRef]
Duan, W.; Ji, L.; Chen, S.; Zhu, S.; Ye, M. Triple-Domain Feature Learning with Frequency-Aware Memory Enhancement for Moving Infrared Small Target Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5006014. [Google Scholar] [CrossRef]
Zhang, B.; Zhang, F.; An, S.; Ding, F.; Liang, Y.; Lyu, Q. SCORE-DETR: An Efficient Transformer-Based Network for Small and Occluded Citrus Detection. Comput. Electron. Agric. 2025, 238, 110843. [Google Scholar] [CrossRef]
Du, X.; Zhang, X.; Li, T.; Chen, X.; Yu, X.; Wang, H. YOLO-WAS: A Lightweight Apple Target Detection Method Based on Improved YOLO11. Agriculture 2025, 15, 1521. [Google Scholar] [CrossRef]
Yang, S.; Zhang, W.; Song, R.; Cheng, J.; Wang, H.; Li, Y. Watch and Act: Learning Robotic Manipulation from Visual Demonstration. IEEE Trans. Syst. Man Cybern. Syst. 2023, 53, 4404–4416. [Google Scholar] [CrossRef]
Li, H.; Yang, P.; Liu, H.; Liu, X.; Qian, J.; Yu, Q.; Geng, C.; Shi, Y. An Improved YOLOv5s Model for Assessing Apple Graspability in Automated Harvesting Scene. Front. Plant Sci. 2023, 14, 1323453. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection 2020. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Fujieda, S.; Takayama, K.; Hachisuka, T. Wavelet Convolutional Neural Networks. arXiv 2018, arXiv:1805.08620. [Google Scholar] [CrossRef]
Li, W.; Guo, H.; Liu, X.; Liang, K.; Hu, J.; Ma, Z.; Guo, J. Efficient Face Super-Resolution via Wavelet-Based Feature Enhancement Network. arXiv 2024, arXiv:2407.19768. [Google Scholar]
Lei, M.; Wu, H.; Lv, X.; Wang, X. ConDSeg: A General Medical Image Segmentation Framework via Contrast-Driven Feature Enhancement. Proc. AAAI Conf. Artif. Intell. 2025, 39, 4571–4579. [Google Scholar] [CrossRef]
Yang, Z.; Guan, Q.; Zhao, K.; Yang, J.; Xu, X.; Long, H.; Tang, Y. Multi-Branch Auxiliary Fusion YOLO with Re-Parameterization Heterogeneous Convolutional for Accurate Object Detection. arXiv 2024, arXiv:2407.04381. [Google Scholar]
Rahman, M.M.; Munir, M.; Marculescu, R. EMCAD: Efficient Multi-Scale Convolutional Attention Decoding for Medical Image Segmentation. arXiv 2024, arXiv:2405.06880. [Google Scholar] [CrossRef]
Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef]
Acharjee, S.; Ganguly, S.; Ali, A.; Frnda, J. Comprehensive Performance Analysis of Object Detection Algorithms across Diverse Sensor Modalities. IEEE Sens. Rev. 2025, 2, 304–322. [Google Scholar] [CrossRef]
Hu, J.; Fan, C.; Wang, Z.; Ruan, J.; Wu, S. Fruit Detection and Counting in Apple Orchards Based on Improved Yolov7 and Multi-Object Tracking Methods. Sensors 2023, 23, 5903. [Google Scholar] [CrossRef]
Sun, L.; Hu, G.; Chen, C.; Cai, H.; Li, C.; Zhang, S.; Chen, J. Lightweight Apple Detection in Complex Orchards Using YOLOV5-PRE. Horticulturae 2022, 8, 1169. [Google Scholar] [CrossRef]
Ji, W.; Zhai, K.; Xu, B.; Wu, J. Green Apple Detection Method Based on Multidimensional Feature Extraction Network Model and Transformer Module. J. Food Prot. 2025, 88, 100397. [Google Scholar] [CrossRef]
Liu, G.; Zhang, Y.; Liu, J.; Liu, D.; Chen, C.; Li, Y.; Zhang, X.; Touko Mbouembe, P.L. An Improved YOLOv7 Model Based on Swin Transformer and Trident Pyramid Networks for Accurate Tomato Detection. Front. Plant Sci. 2024, 15, 1452821. [Google Scholar] [CrossRef]
Yang, J.; Deng, H.; Zhang, Y.; Zhou, Y.; Miao, T. Application of Amodal Segmentation for Shape Reconstruction and Occlusion Recovery in Occluded Tomatoes. Front. Plant Sci. 2024, 15, 1376138. [Google Scholar] [CrossRef]
Kim, S.; Hong, S.-J.; Ryu, J.; Kim, E.; Lee, C.-H.; Kim, G. Application of Amodal Segmentation on Cucumber Segmentation and Occlusion Recovery. Comput. Electron. Agric. 2023, 210, 107847. [Google Scholar] [CrossRef]
Wang, Y.; Xiao, S.; Meng, X. Incoherent Region-Aware Occlusion Instance Synthesis for Grape Amodal Detection. Sensors 2025, 25, 1546. [Google Scholar] [CrossRef] [PubMed]
Beyhan, A.; Adar, N.G. Modeling and Real-Time Cartesian Impedance Control of 3-DOF Robotic Arm in Contact with the Surface. Sci. Iran. 2024, 31, 1421–1430. [Google Scholar] [CrossRef]

Figure 1. The three-tier apple occlusion classification taxonomy adopted in this study: No Occlusion (NO), Soft Occlusion (SO), and Hard Occlusion (HO), defined according to the physical rigidity of the occluding object as the primary criterion for robot intervention decision-making.

Figure 2. Box plots of Intel RealSense D455 depth measurement error versus calibration distance. Light green boxes denote the distance bins corresponding to the preferred operating range; light orange boxes denote bins outside that range. The red horizontal dashed line indicates zero measurement error. Within each bin, the box spans the interquartile range (IQR), the horizontal line inside the box is the median error, and black diamonds are individual calibration measurements overlaid on the corresponding bin.

Figure 3. Comparison of raw image regions before and after depth-guided texture synthesis masking.

Figure 4. Distribution of instance-level bounding-box widths in the dataset.

Figure 5. Schematic of the YOLOv11-CBMES architecture, comprising Backbone, Neck, and Head. P1–P5 label the multi-scale feature stages from finest (P1) to coarsest (P5) resolution; P3–P5 feed the Detect heads.

Figure 6. Comprehensive data transformation and module interaction workflow delineating the translation from multidimensional physical inputs to downstream grasp-or-avoid policies.

Figure 7. Schematic diagram of the Contrast-Driven Feature Aggregation (CDFA) module.

Figure 8. Structural comparison between the proposed CSP-MSCB and the original C3k2 module. Arrows: solid arrows denote the forward data path; dashed arrows denote skip (residual) connections; red arrows mark the inference-time data path inside the MSCB detail (structural re-parameterisation from multi-branch training to a single RepHDWConv branch). Dashed-line boxes: the outer dashed frames group the full C3k2 and CSP-MSCB diagrams, respectively.

Figure 9. Structural diagram of the Efficient Upsampling Convolutional Block (EUCB).

Figure 10. Schematic diagram of the Shift-Context (SC) operation. Black arrows indicate the forward data path. Red straight arrows indicate the shift direction; red curved arrows indicate wrap-around of the shifted row or column. Dashed outlines mark the row or column that is rolled across the boundary.

Figure 11. Training-time metric curves for the ablation models during optimisation. (A) Training convergence curves; (B) Precision–Recall (P–R) curves.

Figure 12. Qualitative comparison of detection results across all evaluated models under three representative illumination conditions: front lighting (row 1), lateral lighting (row 2), and strong backlighting (row 3), with columns differentiating scenes of varying spatial complexity, including densely clustered fruit.

Figure 13. Normalised confusion matrices for all evaluated models on the N-S-H three-category classification task.

Figure 14. Class activation map (CAM) comparison between BaseFPN and BiFPN at three feature pyramid scales (P3, P4, and P5). (A) Input image. (B) BaseFPN CAMs for P3, P4, and P5 (top to bottom). (C) BiFPN CAMs for P3, P4, and P5 under the same layout.

Figure 15. Channel-level feature map comparison between the BaseFPN baseline and BiFPN at P3, P4, and P5 levels. (A) Input image. (B) BaseFPN channel maps. (C) BiFPN channel maps for the same input and layout as (B). Colour indicates activation magnitude (cool = low, warm = high).

Figure 16. Comparison of feature maps before and after processing by the CDFA module. Pseudo-colour indicates activation magnitude on overlaid heatmaps: warm colours (red–orange) denote stronger responses and cool colours (blue–green) denote weaker responses. (A) Input image. (B) Input feature map before CDFA. (C) BG (low-frequency/background-dominated) branch map. (D) FG (high-frequency/foreground-dominated) branch map. (E) Intermediate output after CDFA processing but before the final zoomed view. (F) Final CDFA output (zoomed): responses shrink to compact peaks centred on fruit cores. Following cascaded attention modulation by CDFA, the final output feature map demonstrates activations converging towards fruit core positions, with the number of spurious response points at background foliage gaps substantially reduced relative to the input features. The physical significance of this frequency-domain decomposition and feature convergence phenomenon, along with its contribution to feature purity enhancement, will be elaborated in Section 4.2.

Figure 17. Effective receptive field (ERF) gradient backpropagation heatmap comparison at the P5 layer across three module configurations: (A) YOLOv11-BiFPN baseline; (B) YOLOv11-BiFPN with CM and EUCB; (C) YOLOv11-BiFPN with CMS and EU_SC (SC strategy embedded).

Figure 18. Functional analysis of the EUCB module versus baseline upsampling. Dark purple denotes lower activation and yellow denotes higher activation. (A) Input RGB image. (B) Baseline upsampled feature map. (C) EUCB-reconstructed feature map at the same stage. (D) Baseline horizontal feature strips (E) EUCB horizontal feature strips. (F) Differential activation map. (G) Radially averaged power spectrum curves: log energy versus spatial frequency (low to high); blue = baseline, red = EUCB.

Table 1. Quantitative characterisation of orchard scene complexity and acquisition settings.

Metric	Value
Total images (N)	1635
Total annotated instances	12,705
Split (7:1:2)	Train: 1144; Validation: 164; Test: 327
Orchard geometry (macro-scale)	Row spacing: 4 m; plant spacing: 1.5 m; mean tree height: 3.5 m
Occlusion statistics (label-derived)	NO: 26%; SO: 42%; HO: 32%
Illumination variability (image counts)	Front lighting: 634; lateral lighting: 511; strong backlighting: 490
Sensing operating range	Intel RealSense D455; camera–canopy distance 0.7–1.1 m
Depth-guided background purification	Mask depth threshold Z > 2.0 m

Table 2. Multi-seed robustness on the fixed test split.

Model	Metric	Seed 0	Seed 42	Seed 2026	Mean	SD
YOLOv11	${m A P}_{S O}$	63.68	62.58	61.28	62.51	1.20
	${m A P}_{N O}$	81.34	79.85	78.74	79.98	1.30
	${m A P}_{H O}$	46.59	42.72	43.49	44.27	2.05
YOLOv11-CBMES	${m A P}_{S O}$	67.36	65.68	65.14	66.06	1.16
	${m A P}_{N O}$	83.50	82.05	80.22	81.92	1.64
	${m A P}_{H O}$	51.90	48.57	48.13	49.53	2.06
Gain	${m A P}_{S O}$	3.68	3.81	3.86	3.78	0.09
	${m A P}_{N O}$	2.16	2.20	1.48	1.95	0.40
	${m A P}_{H O}$	5.31	5.85	4.64	5.27	0.61

SD is the sample standard deviation across three seeds. Gains are reported in percentage points (pp).

Table 3. Five-fold cross-validation with a fixed test set.

Model	Metric	Fold 1	Fold 2	Fold 3	Fold 4	Fold 5	Mean	SD
YOLOv11	${m A P}_{S O}$	61.64	61.59	61.15	60.68	60.47	61.11	0.47
	${m A P}_{N O}$	75.47	74.95	76.20	77.00	73.53	75.43	1.27
	${m A P}_{H O}$	40.67	39.83	41.85	41.15	40.23	40.75	0.73
YOLOv11-CBMES	${m A P}_{S O}$	65.48	65.20	64.93	64.66	64.60	64.97	0.36
	${m A P}_{N O}$	81.75	81.66	82.17	82.24	79.93	81.55	0.92
	${m A P}_{H O}$	44.61	43.97	45.59	44.66	43.90	44.55	0.65
Gain	${m A P}_{S O}$	3.84	3.61	3.78	3.98	4.13	3.87	0.19
	${m A P}_{N O}$	6.28	6.71	5.97	5.24	6.40	6.12	0.56
	${m A P}_{H O}$	3.94	4.14	3.74	3.51	3.67	3.80	0.24

The test split (20% of images) is identical for all folds.

Table 4. Ablation study results.

Model	BiFPN	CDFA	MSCB	EUCB	SC	${m A P}_{S O}$ /%	${m A P}_{N O}$ /%	${m A P}_{H O}$ /%	Parameters/M	FLOPs/G
YOLOv11						63.68	81.34	46.59	2.58	6.3
YOLOv11-BiFPN	✔					66.76	82.8	48.72	1.92	6.3
YOLOv11-CDFA		✔				65.71	83	46.69	5.41	8.6
YOLOv11-CSP-MSCB			✔			64.35	81.92	47.42	2.86	6.7
YOLOv11-EUCB				✔		63.95	82.05	46.38	2.67	6.7
YOLOv11-SC					✔	63.88	81.52	46.77	2.58	6.3
YOLOv11-CB	✔	✔				66.26	82.39	51.26	4.75	8.6
YOLOv11-CB-CM	✔	✔	✔			66.46	82.3	51.76	4.86	8.5
YOLOv11-CB-CM-EU	✔	✔	✔	✔		66.86	83.23	51.29	4.95	8.9
YOLOv11-CBMES	✔	✔	✔	✔	✔	67.36	83.5	51.9	4.95	8.9

Table 5. Effect of Mosaic augmentation under an identical dataset split and training protocol.

Model	Mosaic	${m A P}_{S O}$ /%	${m A P}_{N O}$ /%	${m A P}_{H O}$ /%	Parameters/M	FLOPs/G
YOLOv11	On	63.68	81.34	46.59	2.58	6.3
YOLOv11	Off	60.74	77.40	42.52	2.58	6.3
YOLOv11-CBMES	On	67.36	83.50	51.90	4.95	8.9
YOLOv11-CBMES	Off	65.64	81.97	47.11	4.95	8.9

Table 6. Comparison of CDFA module variants with alternative transformations of equivalent mathematical complexity evaluated on YOLOv11 CB architecture.

Transformation Type	Spatial and Frequency Property	${m A P}_{S O}$ /%	${m A P}_{N O}$ /%	${m A P}_{H O}$ /%	Parameters/M	FLOPs/G
Random Matrix	No physical bias	65.5	82.1	48.95	4.75	8.6
Standard Conv	Spatial localization only	65.8	82.55	49.3	4.76	8.8
Fourier DFT	Global frequency only	64.95	81.3	48.05	4.75	9.2
Haar Wavelet	Joint spatial and frequency	66.26	83.39	51.26	4.75	8.6

Table 7. Comparative evaluation of state-of-the-art object detection models.

Model	P/%	R/%	${m A P}_{S O}$ /%	${m A P}_{N O}$ /%	${m A P}_{H O}$ /%	FLOPs/G	Params/M	Model Size/MB	FPS
YOLOv5	56.4	66.8	63.39	81.09	44.25	5.8	2.18	4.4	244.7
YOLOv8	57.4	67.2	63.37	80.80	44.86	6.8	2.68	5.4	237.7
YOLOv10	57.9	66.5	63.62	81.09	45.67	6.5	2.26	5.5	343.1
YOLOv11	58.1	67.4	63.68	81.34	46.59	6.3	2.58	5.2	236.6
SSD	56.4	65	59.57	77.02	43.86	30.5	24.01	185	76.1
Faster R-CNN	60.2	68.2	67.32	82.90	47.77	208	41.36	251	4.7
RetinaNet	57.6	65.2	57.81	80.75	41.03	156	19.8	151	31.1
YOLOv11-CBMES	62.1	68.1	67.36	83.50	51.90	8.9	4.95	9.9	149.5

Table 8. Quantitative comparison of receptive field metrics across three module configurations.

Model Configuration	Equivalent Rectangular Side Length/px	High-Contribution Region Proportion/%	Maximum Response Value
BiFPN	153	5.72%	2.048
BiFPN+CM+EUCB	257	16.12%	0.765
BiFPN+CMS+EU_SC	263	16.88%	0.951

Table 9. Composite operational risk cost comparison across all evaluated models.

Model	$P$ (HO → NO)	$P$ (HO → SO)	E[Risk]
YOLOv5	0.071	0.338	0.693
YOLOv8	0.067	0.339	0.674
YOLOv10	0.082	0.324	0.734
YOLOv11	0.08	0.311	0.711
SSD	0.103	0.196	0.711
RetinaNet	0.09	0.266	0.716
Faster R-CNN	0.115	0.212	0.787
YOLOv11-CBMES	0.064	0.248	0.568

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Gao, L.; Wang, P.; Liu, L.; Liu, H.; Li, J.; Yang, X. Frequency-Domain Decoupling and Multi-Dimensional Spatial Feature Reconstruction for Occlusion-Aware Apple Detection in Complex Semi-Structured Orchard Environments. Agronomy 2026, 16, 790. https://doi.org/10.3390/agronomy16080790

AMA Style

Gao L, Wang P, Liu L, Liu H, Li J, Yang X. Frequency-Domain Decoupling and Multi-Dimensional Spatial Feature Reconstruction for Occlusion-Aware Apple Detection in Complex Semi-Structured Orchard Environments. Agronomy. 2026; 16(8):790. https://doi.org/10.3390/agronomy16080790

Chicago/Turabian Style

Gao, Long, Pengfei Wang, Lixing Liu, Hongjie Liu, Jianping Li, and Xin Yang. 2026. "Frequency-Domain Decoupling and Multi-Dimensional Spatial Feature Reconstruction for Occlusion-Aware Apple Detection in Complex Semi-Structured Orchard Environments" Agronomy 16, no. 8: 790. https://doi.org/10.3390/agronomy16080790

APA Style

Gao, L., Wang, P., Liu, L., Liu, H., Li, J., & Yang, X. (2026). Frequency-Domain Decoupling and Multi-Dimensional Spatial Feature Reconstruction for Occlusion-Aware Apple Detection in Complex Semi-Structured Orchard Environments. Agronomy, 16(8), 790. https://doi.org/10.3390/agronomy16080790

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Frequency-Domain Decoupling and Multi-Dimensional Spatial Feature Reconstruction for Occlusion-Aware Apple Detection in Complex Semi-Structured Orchard Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset, Acquisition Protocol, and Preprocessing

2.1.1. Experimental Site and Camera Configuration

2.1.2. Occlusion Taxonomy and Background Purification

2.1.3. Dataset Summary and Size Distribution

2.2. Detection Model Architecture

2.2.1. Bidirectional Weighted Feature Pyramid Network (BiFPN)

2.2.2. Contrast-Driven Feature Aggregation (CDFA) Module

2.2.3. Multi-Scale Spatial Feature Reconstruction Module

2.3. Model Training

2.4. Model Evaluation Metrics

2.4.1. Detection Accuracy Metrics

2.4.2. Lightweight and Efficiency Metrics

3. Results

3.1. Stability Under Random Seed and Training–Validation Reassignment

3.2. Ablation Study

3.3. Validation of Frequency-Domain Decoupling Necessity

3.4. Comparative Evaluation of Object Detection Models

3.5. Feature Space Visualisation

3.5.1. Multi-Scale Semantic Flow Alignment and Feature Purification

3.5.2. Empirical Analysis of Frequency-Domain Decoupling

3.5.3. Spatial Receptive Field Expansion and Frequency-Domain Fidelity Analysis

4. Discussion

4.1. Physical Interpretation of the BiFPN “Parameter Reduction with Accuracy Gain” Phenomenon

4.2. Performance Boundary and Architectural Trade-Offs of the Frequency-Domain Decoupling Mechanism

4.3. Decision-Making Value and Misclassification Cost Analysis of the N-S-H Occlusion Classification Taxonomy

4.4. Synergistic Effects of Receptive Field Expansion and Zero-Parameter Neighbourhood Interaction

4.5. Limitations and Future Research Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI