Lightweight and High-Precision Visual Detection of Cherry Cracking Defects Based on Improved YOLO11 with Enhanced Feature Fusion

Sun, Yifei; Miao, Xinying; Zhang, Yi; He, Zhipeng; Tao, Xinyue; Wang, Zhenghan; Hou, Tianwen; Ren, Ping; Wang, Wei

doi:10.3390/agriculture16101110

Open AccessArticle

Lightweight and High-Precision Visual Detection of Cherry Cracking Defects Based on Improved YOLO11 with Enhanced Feature Fusion

by

Yifei Sun

^1,2,

Xinying Miao

^1,2,*,

Yi Zhang

^1,2

,

Zhipeng He

^1,2,

Xinyue Tao

^1,2,

Zhenghan Wang

^1,2,

Tianwen Hou

^1,2,

Ping Ren

^1,2 and

Wei Wang

^1,2

¹

School of Information Engineering, Dalian Ocean University, Dalian 116023, China

²

Liaoning Provincial Key Laboratory of Marine Information Technology, Dalian 116023, China

^*

Author to whom correspondence should be addressed.

Agriculture 2026, 16(10), 1110; https://doi.org/10.3390/agriculture16101110 (registering DOI)

Submission received: 8 April 2026 / Revised: 12 May 2026 / Accepted: 15 May 2026 / Published: 19 May 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

Sweet cherry cracking severely impairs its commercial value and causes huge economic losses, and the accurate real-time detection of fine cracking defects remains a challenging small-target detection task. Traditional manual sorting and conventional machine vision methods suffer from low efficiency and poor robustness, while existing YOLO-based models have limitations in multi-scale feature fusion, local feature discrimination and spatial information retention for cherry cracking detection, and their effectiveness in natural production environments has not been statistically validated. To address these issues, this study proposes YOLO-CY for cherry cracking defect detection. Three key modules were optimized: the C3k2_AdditiveBlock was designed to enhance multi-scale feature extraction, the C2PSA_CGLU module improved the discriminability of local crack features via refined channel attention, and the Efficient Up-Convolution Block replaced traditional upsampling to reduce spatial information loss. Experiments were conducted on a self-constructed dataset of 3662 cherry images acquired on a real sorting line under natural ambient light. The results showed that YOLO-CY achieved an mAP50 of 94.88% and an mAP50-95 of 64.92%, with precision and recall reaching 93.90% and 90.81%, respectively, significantly outperforming mainstream lightweight YOLO models and two-stage detectors. Ablation experiments verified the synergistic effect of the three improved modules, and the model only had a marginal increase in parameters (2.62 M) and GFLOPs (6.60), maintaining lightweight characteristics. YOLO-CY can accurately detect fine, low-contrast and pedicel-overlapping cracks and is suitable for real-time detection on automated cherry-sorting lines, providing a technical solution for intelligent cherry quality inspection.

Keywords:

cherry cracking detection; YOLO11; small-object detection; deep learning

1. Introduction

As one of the earliest ripening deciduous fruit species in the Northern Hemisphere, sweet cherries are highly preferred by consumers owing to their appealing appearance, sweet flavor and rich nutritional profile [1]. The global cherry market has experienced sustained expansion, attaining a valuation of 62.5 billion US dollars in 2023 and projected to grow at a compound annual rate of 7.3% from 2024 to 2030 [2]. China and Turkey rank among the largest cherry consumption markets worldwide, with China alone importing 377,000 metric tons of cherries in the 2023–2024 production season [3]. However, the thin, delicate peel and high water content of cherries render them highly susceptible to mechanical injury, fruit cracking and microbial spoilage during natural growth, harvest, transportation and postharvest handling. Among these defects, fruit cracking is one of the most dominant factors that compromise commercial value and shelf life [4]. Cracked fruit not only deteriorates in quality but also accelerates the decay of neighboring healthy cherries during storage and transport, leading to considerable economic losses. Importantly, the accurate detection of incipient cracks is inherently challenging because these defects often manifest as fine, low-contrast linear fissures that can be easily missed by human sorters or conventional imaging setups. Therefore, the rapid and reliable sorting of intact and cracked cherries is of vital practical importance for raising the automation level and economic returns of the cherry industry.

From the perspective of agricultural production, the precise detection of cherry cracking defects carries multi-dimensional practical value. During postharvest sorting, timely identification and removal of cracked fruit can effectively prevent pathogenic microorganisms from invading the pulp through fissures and causing decay, thus markedly reducing the loss rate of the entire fruit batch throughout cold-chain logistics and retail [5]. Statistics indicate that postharvest losses of cherries can reach 30–40% in developing countries, with cracking-induced quality deterioration being a leading contributor. In quality grading, the extent of cracking serves as a key indicator for fruit classification in the international cherry trade, so accurate cracking detection enables differential pricing and strengthens the market premium potential of high-quality fruit [6]. Moreover, the systematic collection and analysis of cracking data provide essential feedback for upstream cultivation, helping agronomists assess the cracking susceptibility of different cultivars, optimize irrigation and fertilization regimes, and adjust harvest timing, thereby reducing the incidence of cracking at its source [7]. In the context of the rapid advancement of smart agriculture and agricultural robotics, integrating high-precision cracking detection algorithms into automated sorting lines and field picking robots can substantially boost operational efficiency and consistency while mitigating the global challenge of agricultural labor shortages [8]. Nevertheless, fully realizing these multi-dimensional benefits critically hinges on the availability of detection methods that can operate accurately and robustly under the natural, unconstrained conditions of commercial packing lines.

Currently, cherry-sorting operations remain heavily reliant on manual labor, which suffers from significant drawbacks, including low throughput, high labor costs and inconsistent grading standards [1]. While conventional machine vision approaches have achieved automated detection of surface defects to some degree, their reliance on handcrafted feature extractors leads to limited accuracy and robustness when confronted with the diverse morphologies, minute sizes and textural similarities to intact skin that characterize cherry cracks, rendering them inadequate for production-line requirements [5]. With the rapid evolution of deep learning, Convolutional Neural Network (CNN)-based object detectors have demonstrated distinct advantages in fruit defect detection [9]. Two-stage detectors, typified by Faster R-CNN, deliver relatively high accuracy, yet their inference speed often falls short of the real-time demands of industrial sorting [10]. By contrast, single-stage detectors of the YOLO family formulate object localization and classification as an end-to-end regression task, achieving real-time inference while maintaining competitive accuracy, and have consequently emerged as the mainstream technical paradigm for fruit defect detection [11,12]. Despite these advances, generic YOLO-based models are not specifically tailored to the unique visual characteristics of cherry cracking, such as irregular linear patterns, extremely low contrast against the fruit skin, and frequent overlap with pedicels. Their detection sensitivity and robustness for such fine-grained defects under natural online imaging conditions remain insufficiently validated and present a clear performance gap. Motivated by this gap, the present study aims to develop a dedicated, lightweight detection framework capable of effectively capturing subtle crack features while satisfying the real-time constraints of automated cherry-sorting lines.

Despite the remarkable progress achieved by existing YOLO-based approaches for fruit defect detection, several critical challenges still remain in the dedicated task of cherry cracking defect identification. Cherry cracking defects typically present as fine linear or irregular fractures on the fruit surface, and their small physical size categorizes this task as a canonical small-object detection problem. The C3k2 module integrated in the standard YOLO11 framework demonstrates insufficient capacity for multi-scale feature fusion, which renders it unable to capture such fine-grained defect characteristics [13]. Although the native C2PSA attention module in YOLO11 introduces a spatial attention mechanism, it suffers from limited channel-wise information interaction and lacks the capability for refined modeling of local detailed features, which frequently gives rise to misclassification between cracking defects and normal fruit surface textures [14]. Furthermore, the conventional upsampling strategy employed in the neck network of YOLO11 tends to incur spatial information loss during feature map reconstruction, which impairs the localization precision of small-sized cracking defects [15].

The main contributions of this work are summarized as follows:

We propose the C3k2_AdditiveBlock module to replace the original C3k2 module. This module adopts a dual-branch parallel structure with an additive fusion strategy for spatial feature extraction, which improves the detection sensitivity to tiny cracks on the cherry surface.
We reconstruct the C2PSA_CGLU module, which achieves more refined feature selection and information filtering along the channel dimension, enhancing the model’s discriminative capability for local crack defect features.
We replace the traditional upsampling operation in the neck network of YOLO11 with the efficient up-convolution block (EUCB). While maintaining computational efficiency, this replacement effectively reduces spatial information loss and improves the quality of multi-scale feature fusion.

2. Related Works

Surface defect detection in fruits constitutes one of the most important computer vision applications in agricultural intelligence, whose core objective is to realize automated evaluation of fruit appearance quality through image processing and pattern recognition. Early approaches predominantly relied on traditional machine vision techniques, which extracted low-level visual features of defective regions using color space transformations, threshold segmentation and morphological operations, followed by classification with support vector machines or artificial neural networks [16,17]. Although these methods perform satisfactorily under laboratory conditions with controlled illumination and simple defect types, their handcrafted feature representations lack the capacity to capture the intricate and irregular patterns of cherry cracking defects. Consequently, their detection performance deteriorates markedly in real production-line environments, revealing an inherent inability to generalize across diverse crack morphologies and natural imaging variations.

With the development and popularization of deep learning, object detectors based on convolutional neural networks (CNNs) have gradually become the mainstream approach for fruit defect detection. As a typical two-stage detector, Faster R-CNN first generates candidate regions via a region proposal network (RPN), followed by refined classification and regression operations, which ensures high localization precision [10]. Wei et al. [18] adopted an improved Faster R-CNN model for the detection of cherry surface defects, and their results showed enhanced sorting efficiency and detection accuracy. Zhang et al. [19] proposed a Faster R-CNN-based algorithm to identify bruise regions on apple surfaces, realizing the rapid detection of small defective areas under complex background conditions. However, the two-stage architecture incurs significant computational overhead, causing inference speeds that fall short of the real-time demands of high-throughput sorting lines. Moreover, for fine crack defects that are often only a few pixels wide, the region proposal stage may generate numerous redundant anchors, further stressing the computational budget without commensurate gains in detection precision.

Single-stage detectors have been increasingly applied in fruit defect detection on account of their end-to-end framework and faster inference efficiency. Liu et al. [20] incorporated focal loss and the CBAM attention mechanism into YOLOX, markedly improving the mean accuracy of cherry defect detection and grading to 97.59%. Feng et al. [21] developed MSDD-YOLOX, which integrates residual connections and attention modules within the neck structure, realizing real-time multi-class detection of citrus surface defects. Wu et al. [22] presented flaw-YOLOv5s for the detection of potato surface defects, addressing the inefficiency of manual inspection and the heavy computational burden of earlier deep learning models. Yao et al. [23] embedded a small-object detection layer and SE attention into YOLOv5, along with CIoU loss, to enhance the performance of kiwifruit defect detection. For cherry-specific tasks, Han et al. [24] combined YOLOv5 with a flood-filling algorithm for quality evaluation, while Song et al. [25] employed Swin Transformer and MLP to achieve high classification accuracy for cherry appearance. More recently, Li et al. [26] designed the lightweight CMD-YOLO for cherry maturity detection, and Cherry-YOLO [3] optimized YOLOv8 for simultaneous ripeness and defect detection. Despite these notable advances, most existing studies address broad fruit defect categories or maturity estimation, and relatively few have specifically targeted the unique challenges of cherry cracking. The extremely fine, low-contrast and pedicel-overlapping nature of cracks requires dedicated multi-scale feature learning and fine-grained discrimination that generic models often lack, leaving a clear performance gap for high-precision cracking detection under natural online conditions.

Attention mechanisms serve as a powerful feature enhancement strategy in deep learning [15,27], adaptively learning importance weights across spatial positions or channel dimensions [28]. By steering the network toward the most informative regions, they significantly improve feature representation and discriminability [29]. In object detection, attention mechanisms have been extensively integrated into backbones, feature pyramids and detection heads. Channel attention models the interdependencies among feature channels. The Squeeze-and-Excitation Network (SENet) [30] compresses spatial information via global average pooling and learns nonlinear channel dependencies through connected layers, enabling adaptive channel-wise recalibration. However, the global pooling operation discards local spatial details that are crucial for distinguishing fine cracks from intact peel textures. The Convolutional Block Attention Module (CBAM) [31] augments SENet with a spatial attention branch, yet still relies on global channel compression, which may inadequately model the subtle local variations of crack edges. The Efficient Channel Attention Network (ECA-Net) [32] replaces connected layers with one-dimensional convolution, reducing complexity while preserving channel interaction, but it similarly lacks mechanisms to capture local neighborhood context explicitly.

Spatial attention mechanisms, on the other hand, emphasize the varying importance of different spatial locations. Non-local networks [33] capture long-range dependencies by computing pairwise relationships among all spatial positions, yet their quadratic complexity restricts deployment on high-resolution feature maps. The Transformer-based Vision Transformer (ViT) [34] and Swin Transformer [35] achieve powerful spatial modeling through global and window-based self-attention, respectively, but their computational demands pose challenges for real-time applications. Against this backdrop, TransNeXt [36] introduces Aggregated Attention as the token mixer and innovatively employs Convolutional GLU as the channel mixer. By integrating convolution with gated linear units, Convolutional GLU achieves efficient feature selection and channel-wise information filtering, bridging the gap between GLU and SE mechanisms and endowing each token with channel-specific attention weights. Nevertheless, directly applying these advanced attention designs to cherry cracking detection without tailoring them to the anisotropic, fine linear structures of cracks may still fall short in delivering the necessary local feature discrimination and spatial detail preservation, particularly under the real-time and lightweight constraints imposed by automated sorting systems.

In summary, existing studies have made fruitful progress in fruit defect detection, YOLO evolution, attention mechanisms and feature fusion, establishing a solid foundation for our research. However, for the specific task of cherry cracking defect detection, there still exists a distinct gap in achieving high-precision and real-time identification of small, low-contrast cracks using the YOLO11 framework. This gap can be attributed to three interconnected limitations: insufficient multi-scale feature aggregation in the backbone, inadequate local detail discrimination in the channel attention, and spatial information loss during upsampling in the neck. Addressing these limitations through a coordinated optimization of multi-scale feature extraction, channel-wise information interaction and upsampling quality constitutes the core motivation of the present work.

3. Proposed Methods

As a state-of-the-art object detection framework with superior overall performance, YOLO11 inherits the consistent advantages of the YOLO series and achieves a more favorable trade-off between inference speed and detection accuracy. We specifically adopt its lightweight nano variant, YOLO11n, which further reduces computational demands while benefiting from pre-trained weights on large-scale image datasets for efficient fine-tuning. Although YOLO11n integrates channel and spatial attention through modules such as C2PSA, these built-in attention mechanisms are primarily designed for general-purpose detection. Compared with classic channel attention schemes like SE and CBAM that rely on global pooling and fully connected layers or the gated linear unit (GLU)-based feature selection that achieves token-specific channel filtering, the native attention in YOLO11n exhibits limited capacity for dynamically emphasizing fine, locally irregular patterns such as cherry cracks while suppressing irrelevant textures. Motivated by this gap, we select YOLO11n as the baseline model and implement three groups of systematic improvements to enhance multi-scale crack feature learning, channel-wise information discrimination and spatial detail preservation. As illustrated in Figure 1, the revised network (YOLO-CY) can more effectively exploit visual information from input images to meet the rigorous requirements of cherry cracking detection.

3.1. C3k2_AdditiveBlock

As an essential basic component of the YOLO11 network structure, the C3k2 module is constructed following the Cross Stage Partial (CSP) design paradigm and adopts a distinctive dual-branch structure configuration. The main branch directly transmits raw feature information without redundant transformation, while the processing branch executes feature refinement via convolution operations with configurable kernel sizes; feature fusion is ultimately accomplished through concatenation followed by a 1 × 1 convolutional layer. This unique structural layout endows the network with powerful feature expression performance while maintaining excellent computational efficiency. It transmits input features via two parallel branches before feature aggregation, which allows the model to extract local detail information and global contextual clues synchronously. Despite these merits, the vanilla C3k2 module still presents distinct limitations when deployed for complex visual detection scenarios. Primarily, the original module lacks an efficient attention mechanism to strengthen feature representation, resulting in suboptimal detection performance in challenging scenarios involving target occlusion, small objects, and dense targets. Secondly, the feature interaction paradigm inside the module is relatively monotonous, as it relies chiefly on standard convolution operations for local feature extraction and lacks the capability for long-range global feature modeling. Furthermore, the vanilla C3k2 module is susceptible to the vanishing gradient problem during model training, which hinders stable model convergence and compromises the final detection accuracy.

To overcome the aforementioned drawbacks, we propose an optimized modification to the C3k2 module by introducing the innovative AdditiveBlock, which enhances feature representation via additive attention mechanisms and a reinforced local perceptual field. As shown in Figure 2, the optimized C3k2_AdditiveBlock not only retains the original computational efficiency but also significantly enhances the model’s detection capability in complex application scenarios. Specifically, the integration of the additive attention mechanism strengthens high-quality feature extraction and representation. Distinct from conventional attention mechanisms that compute attention weights via multiplicative operations, the AdditiveBlock adopts additive operations for weight calculation, a tailored design that facilitates more efficient gradient backpropagation and effectively alleviates the vanishing gradient issue. The mathematical formulation of the AdditiveBlock is given as follows:

Attention (Q, K, V) = σ (W_{f} ([Q, K])) ⊙ V

(1)

where Q, K, and V stand for the query, key, and value matrices, respectively;

σ

signifies the activation function;

W_{f}

refers to a trainable weight matrix;

[\cdot, \cdot]

denotes the concatenation operation; and ⊙ indicates element-wise multiplication.

3.2. EUCB

Cherry cracking is typically characterized by subtle linear fissures or geometric depressions on the pericarp surface. The upsampling layers integrated into the Feature Pyramid Network and Path Aggregation Network of the standard YOLO11 architecture are designed to restore spatial resolution degraded by successive downsampling operations during feature extraction. Conventional interpolation-based upsampling functions as a fixed geometric prior-driven mapping mechanism, which is devoid of learnable parameters required for the dynamic reconstruction of high-frequency spatial details. Such a limitation inevitably leads to the blurring or over-smoothing of textural details at crack edges during the top-down multi-scale feature fusion process, thereby impeding the model from accurately distinguishing genuine crack defects from natural surface textures and pedicel shadows on cherry fruits. To mitigate this technical bottleneck, this study introduces the Efficient Up-Convolution Block (EUCB) as a refined alternative to conventional upsampling mechanisms, as illustrated in Figure 3. The EUCB enables efficient reconstruction of low-resolution feature maps into high-resolution counterparts while synchronously enhancing feature representation, and it precisely aligns the spatial dimensions and channel counts of the upsampled features with those of the skip connection feature maps derived from the encoder branch, laying a solid foundation for subsequent high-quality multi-scale feature fusion.

Assume that the input feature map of the module is

X \in R^{H \times W \times C_{i n}}

, where H and W denote the height and width of the input feature map, respectively, and

C_{i n}

denotes the number of input channels. First, to upsample the low-resolution feature map to a size compatible with the skip connection in the next layer, the module performs an upsampling operation. We adopt bilinear interpolation with a scaling factor of 2, whose mathematical expression is:

X_{u p} = Up (X, scale = 2)

(2)

Following this process, the spatial size of the feature map is increased by a factor of two, whereas the channel count stays unaltered:

X_{u p} \in R^{2 H \times 2 W \times C_{i n}}

(3)

After upsampling, to enhance and smooth the local details of the upsampled features, the module introduces a 3 × 3 depth-wise convolution (Depth-wise Convolution). As an integral component of depthwise separable convolution, depth-wise convolution is distinct from standard convolution in that it conducts convolution computations independently on each input channel without implementing inter-channel fusion. For the output

X_{u p}

from the previous step, the depth-wise convolution operation can be expressed as:

X_{d w} = D W_{3 \times 3} (X_{u p})

(4)

where

D W_{3 \times 3}

stands for the depth-wise convolution process having a kernel size of 3 × 3. Afterwards, batch normalization (BN) and a ReLU activation function are imposed on the output generated by the depth-wise convolution so as to acquire the normalized feature map

X_{n o r m}

:

X_{n o r m} = ReLU (BN (X_{d w}))

(5)

Eventually, to regulate the channel number to align with the input channel number of the subsequent layer in the decoder, the module employs a 1 × 1 point-wise convolution (Point-wise Convolution), and its mathematical formulation is:

Y = C_{1 \times 1} (X_{n o r m})

(6)

where

C_{1 \times 1}

stands for the conventional convolution process having a kernel size of 1 × 1. Following this procedure, we acquire the ultimate output of the EUCB module:

Y \in R^{2 H \times 2 W \times C_{o u t}}

(7)

Integrating the above steps, the overall forward propagation formula of the EUCB module can be expressed as:

E U C B (X) = C_{1 \times 1} (ReLU (BN (D W_{3 \times 3} (Up (X)))))

(8)

where ReLU stands for the activation function, and BN denotes batch normalization.

3.3. C2PSA_CGLU

The C2PSA module in the vanilla YOLO11 architecture employs a dual-branch design that combines channel attention with spatial self-attention to strengthen feature representation through parallel attention-driven recalibration. Its core logic is to capture high-level semantics via channel-wise self-attention while locating target-related spatial cues through spatial self-attention. However, cherry cracking defects manifest as subtle textural fractures whose visual saliency depends not merely on spatial localization but, more critically, on the continuity of geometric morphology. The native spatial attention in C2PSA performs feature reweighting based solely on the response intensity within global or local receptive fields, lacking explicit modeling of directional geometric features. As an anisotropic linear structural defect, fine cherry cracks can easily have their spatial saliency diluted by the isotropic responses of surrounding background regions, thereby impairing precise defect identification. To overcome this limitation, we propose the optimized C2PSA_CGLU module, as illustrated in Figure 4.

More specifically, we define the input feature map of the module as

X \in R^{H \times W \times C}

, where H and W represent the spatial height and spatial width of the feature map, and C stands for the channel count of the input feature. To align the parameter scale with the traditional ConvFFN, we define the expansion ratio as R, and the hidden layer dimension of ConvGLU is

d = \frac{2}{3} R C

(9)

First, a linear projection is applied to the input feature to expand the channel dimension from C to

2 d

, preparing for the subsequent two-branch split:

X_{p r o j} = X W_{1} + b_{1}

(10)

where

W_{1} \in R^{C \times 2 d}

serves as the projection weight matrix,

b_{1} \in R^{2 d}

acts as the projection bias, and the projected feature map

X_{p r o j} \in R^{H \times W \times 2 d}

. The expanded feature is divided into two separate branches: the value branch

X_{v}

and the gating branch

X_{g}

, each having d channels. The corresponding calculation expression is:

X_{v}, X_{g} = Split (X_{p r o j})

(11)

After splitting, the dimensions of both branches are

R^{H \times W \times d}

. The value branch

X_{v}

carries the original feature information without additional spatial transformation, while the gating branch

X_{g}

is used to generate adaptive gating weights by introducing local convolutions to capture neighborhood information. A 3 × 3 depthwise convolution is applied to the gating branch, with the calculation formula:

X_{g_{c o n v}} = {DepthwiseConv}_{3 \times 3} (X_{g})

(12)

Depthwise convolution performs spatial convolution independently for each channel, introducing local information of the 3 × 3 neighborhood to the gating branch with minimal computational overhead. This step makes the gating signal no longer rely solely on the current token but fuses features from adjacent positions, and also supplements implicit positional information for the ViT model. The GELU activation function is applied to the convolved gating branch to map it to gating weights around 0 1:

G = GELU (X_{g_{c o n v}})

(13)

GELU is adopted as the gating function because, compared with the traditional sigmoid, it better accommodates the feature distributions typical of visual tasks while naturally ensuring non-negative gating weights for soft value-branch weighting. Unlike the standard ConvGLU that applies GELU in an isolated channel mixing block, our C2PSA_CGLU integrates it within a dual-branch attention architecture, allowing the gating mechanism to leverage both channel-wise and spatial context for more discriminative crack feature selection. The generated gating weights are multiplied element-wise with the value branch to achieve fine-grained channel attention:

Y = X_{v} ⊙ G

(14)

where ⊙ denotes element-wise multiplication. In this step, each spatial token has an independent gating weight generated from the 3 × 3 neighborhood features of that position, which not only preserves local fine-grained differences but also fuses neighborhood information, perfectly solving the coarse-grained problem of global gating in the SE mechanism. Eventually, the weighted feature map is projected back to the original channel dimension C so as to acquire the ultimate output of the module:

Y_{o u t} = Y W_{2} + b_{2}

(15)

where

W_{2} \in R^{d \times C}

is the output projection weight,

b_{2} \in R^{C}

is the output projection bias, and the final output

Y_{o u t} \in R^{H \times W \times C}

is completely consistent with the input dimension. To intuitively illustrate the overall pipeline of the proposed algorithm, the complete workflow is visualized in Algorithm 1.

Algorithm 1 YOLO-CY Core Modules (C3k2_AdditiveBlock + EUCB + C2PSA_CGLU)

Require: Input feature map X

Ensure: Output feature map Y

(1) C3k2_AdditiveBlock
$X_{s h o r t} \leftarrow X$
$X_{c o n v} \leftarrow Conv (X)$
$Q, K, V \leftarrow Linear (X_{c o n v})$
$A \leftarrow σ (W_{f} ([Q, K]))$
$X_{a t t} \leftarrow A ⊙ V$
$X_{c 3} \leftarrow Conv ([X_{s h o r t}, X_{a t t}])$
(2) EUCB (Efficient Up-Convolution Block)
$X_{u p} \leftarrow Upsample (X_{c 3}, s c a l e = 2)$
$X_{d w} \leftarrow {DWConv}_{3 \times 3} (X_{u p})$
$X_{n o r m} \leftarrow ReLU (BN (X_{d w}))$
$X_{e u c b} \leftarrow {Conv}_{1 \times 1} (X_{n o r m})$
(3) C2PSA_CGLU
$X_{p r o j} \leftarrow X_{e u c b} W_{1} + b_{1}$
$X_{v}, X_{g} \leftarrow Split (X_{p r o j})$
$X_{g}^{'} \leftarrow {DWConv}_{3 \times 3} (X_{g})$
$G \leftarrow GELU (X_{g}^{'})$
$X_{m u l} \leftarrow X_{v} ⊙ G$
$Y \leftarrow X_{m u l} W_{2} + b_{2}$
return $Y$

4. Experiment and Result Analysis

4.1. Dataset and Environment

This study adopts a self-collected cherry image dataset captured in a real sorting production line. All images are acquired using an RGB camera with a shooting distance of 20–25 cm from the fruit surface. Natural ambient light serves as the illumination source, and the illuminance is controlled within 1000–1200 lux to ensure consistent imaging quality. The original image resolution is

1920 \times 1080

pixels, and all images are uniformly resized to

640 \times 640

pixels via preprocessing for subsequent model training and inference. The dataset consists of 3662 cherry images, including 1878 intact fruit samples and 1784 cracked fruit samples. The quantity ratio of the two categories is approximately 1:1 with a balanced class distribution, thereby eliminating the need for additional balancing strategies such as weighted loss or oversampling. Each image contains one or multiple cherry fruits, and the crack defects cover diverse typical patterns, including fine linear cracks, low-contrast cracks, stem-overlapped cracks and densely distributed multiple cracks. It is also noteworthy that images of the same fruit and highly similar consecutive frames are never simultaneously allocated to the training and validation sets. The dataset possesses high morphological diversity and strong representativeness for practical industrial scenarios. In this work, data partitioning is implemented at the fruit level.

All model training procedures are conducted on a high-performance computing platform equipped with eight NVIDIA A800 graphics cards, each with 80 GB of video memory, paired with a total of 512 GB of system memory, and the detailed experimental environment and partial hyperparameter configurations are documented in Table 1. To guarantee the fairness and comparability of all experimental results, every comparative experiment is implemented exclusively on the aforementioned custom cherry dataset, and model performance is evaluated using a set of universally adopted metrics in the field of target detection, including mAP@50, mAP@50-95, precision (P), recall (R), parameters (Params), and GFLOPs.

4.2. Comparative and Ablation Experiment

As shown in Table 2, the proposed YOLO-CY model achieves state-of-the-art results across all evaluation metrics, with mAP50 reaching 94.88% and mAP50-95 reaching 64.92%. The precision and recall are 93.90% and 90.81%, respectively. Compared with the baseline YOLO11n, YOLO-CY increases mAP50 and mAP50-95 by 1.63 and 1.83 percentage points, while raising recall by 2.00 percentage points and precision by 0.40 percentage points. Such prominent performance improvements are not obtained through simple model expansion. YOLO-CY maintains nearly consistent parameter scale and computational complexity with YOLO11n, with only a slight increase of 0.03 M parameters and 0.16 GFLOPs, which validates the high efficiency of the presented improvement strategies. The performance gains of YOLO-CY over YOLO11n stem from targeted remedies for three critical structural defects of the original network. First, the vanilla C3k2 module in YOLO11n exhibits insufficient multi-scale feature fusion capability for fine-grained targets such as cherry cracking and fails to adequately capture feature representations of tiny cracks under diverse receptive fields. The designed C3k2_AdditiveBlock integrates an additive attention mechanism and a dual-branch parallel structure, effectively enhancing the model’s sensitivity to subtle crack features. Second, the original C2PSA module focuses primarily on spatial self-attention modeling yet lacks sufficient channel-wise information interaction and fine feature screening, making it difficult to distinguish local crack textures from normal fruit surface patterns. The proposed C2PSA_CGLU employs a convolutional gated linear unit to realize refined channel attention filtering and strengthen the discrimination of local crack morphological features. Third, the conventional upsampling adopted in the neck network of YOLO11n suffers from non-learnable reconstruction of high-frequency details, resulting in the loss of spatial localization information for small cracks during multi-scale feature fusion. In contrast, the EUCB module replaces fixed interpolation upsampling with a learnable combination of depthwise and pointwise convolutions, thereby preserving fine-edge spatial details of cracks and boosting localization accuracy. The collaborative effect of the above designs enables YOLO-CY to comprehensively outperform YOLO11n while retaining its lightweight advantage.

As listed in Table 3, the real-time performance of all models is systematically evaluated in terms of inference latency and throughput, which is essential for the practical deployment of automated cherry-sorting pipelines. YOLO-CY achieves the optimal single-image inference latency of merely 0.0062 s, representing a 24.4% reduction compared with the baseline YOLO11n. In terms of full-pipeline frame rate, YOLO-CY reaches 125 FPS, substantially outperforming YOLO11n and other mainstream lightweight models. Notably, its pure inference frame rate peaks at 160 FPS, which greatly surpasses YOLOv8n, YOLOv10n, YOLOv12n and APNet, demonstrating outstanding computational efficiency. Such prominent inference speed advantages benefit from the computation-oriented structural design of the three modified modules. The additive attention mechanism embedded in C3k2_AdditiveBlock replaces the multiplication operation in conventional attention with additive calculation, enhancing feature representation while eliminating the high cost of matrix multiplication. The EUCB module constructs the upsampling path based on depthwise separable convolutions, drastically reducing computational overhead compared with standard convolution-based upsampling. Meanwhile, C2PSA_CGLU achieves efficient channel-wise feature screening via convolutional gated linear units and avoids the quadratic complexity inherent to self-attention mechanisms. These collaborative designs enable the model to boost detection accuracy without sacrificing inference speed while further improving it. From an industrial deployment perspective, the superior real-time performance of YOLO-CY delivers significant engineering practicality. Current high-speed cherry-sorting lines generally operate at a processing rate of 10–20 fruits per second. With a full-pipeline processing speed of 125 FPS, YOLO-CY completes image inference within 8 ms per frame. Even on production lines equipped with multi-view imaging systems, it can adapt well to a mechanical sorting rhythm of 15 fruits per second and provide a sufficient time margin for subsequent executing operations, such as pneumatic rejection and robotic grasping. Furthermore, the ultra-low inference latency of merely 6.2 ms enables smooth operation of the model on embedded edge devices. It reduces reliance on high-performance servers and cloud computing resources and holds great potential to lower hardware costs and system complexity for on-site industrial deployment.

To further evaluate whether the performance improvement of the proposed YOLO-CY over the baseline YOLO11n is statistically significant, a paired t-test was conducted based on the results of five independent runs. Across the five runs, the performance differences of YOLO-CY relative to YOLO11n were 1.6, 1.6, 1.8, 1.4, and 1.6 percentage points, respectively, all of which are positive values, indicating stable improvements. The mean performance gain is 1.6 percentage points, with a standard deviation of 0.13, suggesting low variability across repeated experiments. The paired t-test yields a p-value less than 0.0001, demonstrating that the observed improvement is highly statistically significant. Therefore, it can be concluded that the performance gain of the proposed method is not caused by random fluctuations but reflects a consistent and reliable enhancement over the baseline model. This result sufficiently demonstrates that the customized improved modules introduced in this work deliver significant performance enhancements without imposing excessive additional computational overhead, while preserving the lightweight characteristics of the baseline model, making the proposed YOLO-CY architecture highly adaptable for embedded deployment and real-time defect detection tasks on automated cherry-sorting production lines. The comprehensive experimental results validate the effectiveness and rationality of the proposed modified modules and confirm that the YOLO-CY model achieves dual optimization of detection performance and real-time inference efficiency for cherry cracking defect detection, outperforming all selected state-of-the-art mainstream detection models.

Table 4 presents the ablation results based on the YOLO11n baseline with three proposed modules incorporated incrementally. The original YOLO11n achieves a precision of 93.50%, a recall of 88.81%, and mAP50 and mAP50-95 values of 93.25% and 63.09%, respectively. Although such performance is competitive, the limited recall reveals the inherent missing detection risk for tiny or low-contrast cracks. This limitation stems from insufficient multi-scale feature fusion of the vanilla C3k2 module, weak local discrimination of the original C2PSA, and severe spatial information loss induced by conventional upsampling operations. With the sole integration of C3k2_AdditiveBlock, the precision and recall increase to 93.55% and 88.85%, accompanied by a slight mAP50 improvement to 93.30%, while only 0.01 M parameters and 0.05 GFLOPs are added. Such performance gains benefit from the additive attention mechanism, which efficiently enhances the backbone’s capability to capture multi-scale crack features and compensates for the inadequate receptive field coverage of small targets in the original C3k2. When only the EUCB module is adopted, recall obtains the most prominent improvement from 88.81% to 89.02%, with mAP50 rising to 93.45% and mAP50-95 reaching 63.30%. This outcome demonstrates that the learnable upsampling design effectively remedies the high-frequency detail loss caused by interpolated upsampling in YOLO11n, preserves spatial localization cues of crack boundaries during feature fusion, and alleviates the missing detection of small objects. The individual replacement with C2PSA_CGLU increases recall to 88.91% and mAP50 to 93.34%. This module realizes fine-grained channel-wise feature filtering and reduces feature confusion among crack textures, normal fruit surfaces and stem shadows. Dual-module combination experiments further verify the complementarity among the three designs. Specifically, C3k2_AdditiveBlock enriches feature representations, EUCB retains spatial details, and C2PSA_CGLU refines feature discrimination. Targeted optimizations are implemented in feature extraction, feature fusion and feature screening to address the core defects of YOLO11n. The full integration of all three modules yields the optimal performance of YOLO-CY, with precision of 93.90%, recall of 90.81%, mAP50 of 94.88% and mAP50-95 of 64.92%. Compared with the YOLO11n baseline, recall is improved by 2.00 percentage points, while mAP50 and mAP50-95 are boosted by 1.63 and 1.83 percentage points, respectively. Meanwhile, the model parameters merely increase from 2.59 M to 2.62 M, and computational complexity rises slightly from 6.44 GFLOPs to 6.60 GFLOPs.

Table 5 compares the detection performance of different models on hard samples. The constructed hard-sample dataset covers challenging scenarios, including slender linear cracks, stem-overlapped cracks, low-contrast defects, densely distributed multiple cracks, and small objects, to comprehensively evaluate the model’s robustness and generalization in complex practical environments. Experimental results indicate that RT-DETR achieves an mAP50 of only 86.4%, with a precision of 86.8% and a recall of 82.7%. This demonstrates its limited perception capability for tiny cracks and low-contrast defects. Dispersed attention distribution hinders the accurate localization of critical defective regions. By comparison, YOLOv8n and YOLOv12n yield mAP50 values of 88.1% and 87.4%, respectively. Although superior to RT-DETR, they still suffer from obvious false and missing detections under stem-overlapped cracks and dense multi-crack conditions, revealing deficiencies in multi-scale feature fusion and local detail discrimination. APNet and the baseline YOLO11n attain higher mAP50 values of 90.3% and 89.6%, achieving substantial performance improvements. Nevertheless, weakened attention and insufficient feature focusing still restrict their recognition of low-contrast and slender linear cracks, thereby limiting further recall gains. In contrast, the proposed YOLO-CY achieves the best overall performance on hard samples, with an mAP50 of 92.6%, a precision of 93.2%, and a recall of 90.0%. Such remarkable results validate the synergistic effects of the three modified modules. The optimized multi-scale feature extraction strengthens the sensitivity to microcracks. Refined channel attention enhances the differentiation between cracks, fruit stems and surface textures. Meanwhile, the learnable upsampling strategy mitigates spatial information loss and guarantees precise boundary localization of crack defects. The high recall rate verifies that YOLO-CY effectively alleviates missing detection and localization deviations under intricate hard-sample conditions, exhibiting excellent robustness and practical deployment potential.

Table 6 presents the performance of YOLO-CY under six-fold cross-validation to evaluate its stability and generalization capability. Experimental results show that the mAP50 of the model remains steady between 94.75% and 94.95% across all folds, with precision ranging from 93.75% to 93.98% and recall fluctuating within a narrow interval of 90.65% to 90.89%. All metrics exhibit extremely marginal variations. The six-fold average mAP50 reaches 94.86%, while the average precision and recall are 93.88% and 90.79%, respectively. These results are highly consistent with those from previous comparative experiments, which verify the reliability and reproducibility of the model’s performance. The low-performance variance across folds further demonstrates that YOLO-CY is insensitive to different data partitioning strategies and maintains stable detection performance facing discrepancies in cherry cultivars, crack morphologies and imaging conditions. Such superior stability stems from the synergistic design of C3k2_AdditiveBlock, C2PSA_CGLU and EUCB. These three modules enhance the model’s robust feature learning from the dimensions of multi-scale feature extraction, channel attention regulation and spatial information retention. Accordingly, the model can extract highly generalized discriminative crack features even with changes in training data partitioning, without introducing additional overfitting risks.

To further investigate the regions of interest focused on by each network, as illustrated in Figure 5, we present the visualization results of attention heatmaps for cherry cracking defect detection based on GRAD-CAM [45]. In these heatmaps, red regions denote the key areas emphasized by the network during detection, while blue regions represent low-attention areas. This experiment intuitively reflects the attention-focusing capability and feature-recognition preferences of different models with respect to cherry-crack characteristics. The experimental results demonstrate that the YOLOv8 model exhibits poor attention focusing on cherry cracks and tends to mistakenly identify non-defective regions, such as natural peel textures and pedicel shadows, as core regions of interest, which introduces significant interference into the feature extraction of genuine cracks. Although the RT-DETR model can capture features of partial crack regions, its attention distribution is relatively scattered, with insufficient focusing ability on tiny linear cracks on the cherry surface, making it difficult to accurately locate the core defective regions. As the baseline model in this study, YOLO11n achieves superior attention-focusing performance compared with the former two models and can effectively identify medium and large crack regions. Nevertheless, it still suffers from weakened attention and inadequate feature focusing when dealing with fine micro-cracks and irregular shallow cracks. In contrast, the proposed YOLO-CY model in this paper can precisely concentrate its attention on core defective regions, including linear fissures and irregular damages of cherry cracks, and effectively eliminates the interference caused by natural peel textures, pedicel shadows, and surface spots. The attention distribution is highly consistent with the spatial locations and morphological characteristics of actual cracks. This improvement is attributed to the C3k2_AdditiveBlock module, which enhances the multi-scale feature fusion capability of the model, and the C2PSA_CGLU module, which strengthens the discrimination and screening of local crack defect features. These modules enable the model to accurately recognize the visual characteristics of cracks and allocate core attention to them, verifying that the improved modules make the feature extraction logic and attention mechanism of the model more suitable for the detection of tiny cherry cracks.

To better evaluate the detection performance on diverse targets and challenging samples, as illustrated in Figure 6, we present the experimental results of cherry crack detection using multi-sample and hard-sample test sets. The experiments were carried out on cherry samples with different cultivars, varying crack severity levels, and diverse imaging conditions. The hard samples mainly consist of tiny linear cracks, cracks overlapping with pedicels, low-contrast cracks, and densely distributed multiple cracks, which are representative of high-difficulty cases in actual production. These experiments are formulated to validate the robustness of detection and generalization performance of the model in complex, realistic scenarios. According to the experimental results, the YOLOv8 model suffers from severe missed detection when dealing with micro-cracks and low-contrast cracks and exhibits obvious false detection on samples where cracks overlap with pedicels since it cannot effectively distinguish the visual features between crack defects and pedicels. Although the RT-DETR model can detect partial crack defects in challenging samples, it suffers from low localization accuracy and severe bounding box drift. Furthermore, it tends to merge detection boxes and miss some individual cracks in densely distributed multi-crack regions. The YOLO11n baseline model achieves significantly better detection performance than the above two models, with a remarkable reduction in both missed and false detection rates. However, it still shows limited effectiveness in detecting ultra-fine linear cracks and low-contrast cracks under strong illumination and reflection. By contrast, the YOLO-CY model proposed in this study exhibits better detection capability on both normal and difficult samples. It realizes precise recognition of cracks at various grades, including micro-cracks, shallow cracks, and deep cracks. For difficult samples such as cracks overlapping with pedicels, low-contrast cracks, and densely distributed multiple cracks, the proposed model resolves the issues of missed detection, false detection, and localization deviation. The detection boxes can accurately conform to the actual contours and spatial locations of the cracks.

Figure 7 demonstrates the dynamic change tendencies of various loss functions and detection performance indicators during the training process of the YOLO-CY model, thereby directly reflecting the convergence properties and performance development regularity of the constructed network. With the increase in training epochs, the bounding box regression loss, classification loss, and distribution focal loss of the training set show a continuous and steady downward tendency without significant oscillations or fluctuations. Meanwhile, the corresponding losses of the validation set decline in a synchronous manner and finally reach a stable state. This indicates that the network parameters of the model have been effectively adjusted and optimized; the feature extraction and learning process for cherry cracking defects is stable; and there are no problems such as overfitting, gradient disappearance, or training instability. Thus, it verifies the good compatibility between the designed improved modules and the adopted training strategies. In terms of detection performance metrics, the precision and recall of the model gradually increase and tend to saturate as the training proceeds, and both mAP50 and mAP50-95 also show a steady upward trend, finally reaching a relatively high level without significant decline, which demonstrates that the model’s detection capability for cherry cracking defects is continuously enhanced during training; it can not only achieve high-confidence defect identification and localization but also possess superior multi-threshold detection performance and favorable feature representation capability for small-target cherry cracking defects with diverse morphologies and scales.

5. Discussion and Limitations

The proposed YOLO-CY model outperforms the YOLO11n baseline and other prevailing detection methods across all evaluation metrics, benefiting substantially from the synergistic integration of three elaborately designed modules. The C3k2_AdditiveBlock enhances multi-scale feature fusion via a dual-branch parallel architecture and additive attention mechanism, which strengthens the model’s sensitivity to fine cherry cracks as challenging small targets. By replacing the original spatial attention with convolutional gated linear units, the C2PSA_CGLU module achieves fine-grained channel-wise feature screening, improves the discrimination of local crack textures, and reduces misclassification caused by interference from fruit stems and normal pericarp textures. The EUCB module substitutes conventional interpolated upsampling with a learnable combination of depthwise and pointwise convolutions, which preserves high-frequency spatial details, optimizes multi-scale feature fusion, and enables precise localization of crack boundaries. These targeted improvements address the core deficiencies of the baseline model in feature aggregation, local feature discrimination and spatial information retention, enabling superior robustness under complex industrial scenarios, such as low-contrast cracks, stem-occluded cracks and densely distributed multiple defects.

Nevertheless, certain limitations remain in this study. First, systematic validation across multiple cherry cultivars has not yet been conducted. Although the constructed dataset contains diverse crack patterns, it primarily focuses on mainstream commercial varieties. Nevertheless, the fundamental visual characteristics of crack defects are highly consistent across cultivars, and the proposed module designs are not tailored to specific varieties. Hence, promising cross-cultivar generalization can be expected, and additional follow-up validation will not compromise the validity of the current conclusions. Second, the inference speed may degrade on low-performance hardware, resulting in deployment constraints on devices with extremely limited computing resources. Even so, YOLO-CY maintains an ultra-low parameter size and computational complexity, which fully meet real-time detection requirements on general edge devices. Considering that automated sorting pipelines are commonly equipped with edge computing hardware with sufficient computing capacity, such limitations exert negligible impacts on the practical engineering applicability of the proposed model.

6. Conclusions

Aiming at the challenging small-object detection task of cherry cracking detection, which is characterized by tiny target size, low contrast and diverse morphologies, this study proposes the YOLO-CY model to remedy the structural deficiencies of YOLO11 in multi-scale feature fusion, channel discrimination capability and spatial information retention. Systematic optimizations are conducted on the backbone network, attention module and neck structure. Specifically, the C3k2_AdditiveBlock integrates an additive attention mechanism to enhance the model’s sensitivity to subtle cracks. The C2PSA_CGLU module adopts convolutional gated linear units to achieve refined channel-wise feature screening. The EUCB module replaces conventional interpolated upsampling with learnable upsampling operations, effectively preserving the spatial details of crack edges. On the self-collected cherry image dataset acquired from actual sorting lines, YOLO-CY achieves an mAP50 of 94.88% and an mAP50-95 of 64.92%, with the recall rate reaching 90.81%. Compared with the YOLO11n baseline, the three metrics are improved by 1.63, 1.83 and 2.00 percentage points, respectively, while the model parameter scale and computational complexity only slightly rise to 2.62 M and 6.60 GFLOPs. The proposed model also exhibits strong robustness on hard samples such as fine cracks, low-contrast defects and stem-occluded targets, effectively mitigating missing detection and localization deviation. The experimental results demonstrate that YOLO-CY can well satisfy the dual requirements of detection accuracy and real-time performance in industrial sorting lines. Nevertheless, this work has not undergone systematic validation across multiple cherry cultivars and complex field environments, and the inference efficiency of the model on edge devices with extremely limited computational resources remains to be further explored. Overall, this work provides a lightweight technical solution for cherry cracking defect detection. The presented modular optimization strategies also offer a valuable reference for small-object defect detection tasks of other fruit varieties.

Author Contributions

Conceptualization, Y.S. and X.M.; methodology, Y.S. and X.M.; software, Y.S. and Y.S.; validation, Y.S., Z.H., X.T. and T.H.; formal analysis, X.T. and Z.W.; investigation, Y.S. and W.W.; resources, X.T. and W.W.; data curation, Y.S. and Z.H.; writing—original draft, Y.S., X.M. and Y.Z.; writing—review and editing Y.S. and Z.H.; visualization, T.H. and P.R.; supervision, X.M. and X.T.; project administration, X.M. and Z.W.; funding acquisition, X.M. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Key Research and Development Project (2024YFD2400103); the Key Project of Liaoning Province “Publishing List and Appointing Commander” (2022081); the Joint Program of Liaoning Province Science and Technology Plan (2024JH2/102600083); the General Project of Liaoning Provincial Department of Education (JYTMS20230489); the Open Project of Sichuan Engineering Research Center for Key Technologies of All-Electric General Aviation Aircraft (2025008); and the Horizontal Project (GH202403).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interest or personal relationships that could have appeared to influence the work reported in this paper.

References

Liu, Y.; Han, X.; Ren, L.; Ma, W.; Liu, B.; Sheng, C.; Song, Y.; Li, Q. Surface defect and malformation characteristics detection for fresh sweet cherries based on YOLOv8-DCPF method. Agronomy 2025, 15, 1234. [Google Scholar] [CrossRef]
Research, G.V. Cherry Market Size, Share & Trends Analysis Report, 2024. Available online: https://www.grandviewresearch.com/industry-analysis/cherry-market-report (accessed on 1 April 2026).
Luan, F.; Fan, K.; Xu, X.; Yang, X.; Chen, J. Cherry-YOLO: Enhanced real-time detection of Cherry ripeness and defects with optimised YOLOv8. Computing 2025, 107, 198. [Google Scholar] [CrossRef]
Momeny, M.; Jahanbakhshi, A.; Jafarnezhad, K.; Zhang, Y.D. Accurate classification of cherry fruit using deep CNN based on hybrid pooling approach. Postharvest Biol. Technol. 2020, 166, 111204. [Google Scholar] [CrossRef]
Lufu, R.; Ambaw, A.; Opara, U.L. The contribution of transpiration and respiration processes in the mass loss of pomegranate fruit (cv. Wonderful). Postharvest Biol. Technol. 2019, 157, 110982. [Google Scholar] [CrossRef]
UNECE. UNECE Standard FFV-13 Concerning the Marketing and Commercial Quality Control of Sweet Cherries; Technical report; United Nations Economic Commission for Europe: Geneva, Switzerland, 2017. [Google Scholar]
Measham, P.F. Rain-Induced Fruit Cracking in Sweet Cherry (Prunus avium L.). Ph.D. Thesis, University of Tasmania, Sandy Bay, TAS, Australia, 2011. [Google Scholar]
Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. A review of key techniques of vision-based control for harvesting robot. Comput. Electron. Agric. 2016, 127, 311–323. [Google Scholar] [CrossRef]
Chiou, K.D.; Chen, Y.X.; Chen, P.S.; Jou, Y.T.; Tsai, S.H.; Chang, C.Y. Application of deep learning for fruit defect recognition in Psidium guajava L. Sci. Rep. 2025, 15, 6145. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 1137–1149. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2016; pp. 779–788. [Google Scholar]
Li, M.; Tao, Z.; Yan, W.; Lin, S.; Feng, K.; Zhang, Z.; Jing, Y. Apnet: Lightweight network for apricot tree disease and pest detection in real-world complex backgrounds. Plant Methods 2025, 21, 4. [Google Scholar] [CrossRef]
Ding, C.; Zhang, R.; Qi, J.; Xie, Y.; Yi, T.; Li, L.; Wu, M.; Zhang, W.; Bao, Z. Precision detection and geolocation of missed pre-tassels in hybrid maize seed production using UAV-based deep learning. Eur. J. Agron. 2026, 173, 127923. [Google Scholar] [CrossRef]
Ren, K.; Chen, R.; Zhang, H.; Cui, K.; Wu, Q.; Dong, J.; Liang, L.; Liu, L. An efficient and lightweight LS-YOLOv11 algorithm for non-invasive pavement distress detection in UAV images. Meas. Sci. Technol. 2026, 37, 015004. [Google Scholar] [CrossRef]
Rahman, M.M.; Munir, M.; Marculescu, R. Emcad: Efficient multi-scale convolutional attention decoding for medical image segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 11769–11779. [Google Scholar]
Soltani Firouz, M.; Sardari, H. Defect detection in fruit and vegetables by using machine vision systems and image processing. Food Eng. Rev. 2022, 14, 353–379. [Google Scholar] [CrossRef]
Li, M.; Tao, Z.; Lin, S.; Feng, K. LAF: Enhancing person re-identification via Latent-Assisted Feature Fusion. Alex. Eng. J. 2025, 127, 116–128. [Google Scholar]
Wei, R.; Pei, Y.k.; Jiang, Y.c.; Zhou, P.; Zhang, Y. Detection of cherry defects based on improved Faster R-CNN model. Food Mach. 2021, 37, 98–105. [Google Scholar]
Zhang, Q.; Cao, H. Improved faster R-CNN based on apple defective region target detection. J. Anhui Sci. Technol. Univ. 2023, 37, 96–101. [Google Scholar]
Liu, J.y.; Pei, Y.k.; Chang, Z.y.; Chang, Z.; Chai, Z.; Cao, P. Cherry defect and classification detection based on improved YOLOX model. Food Mach. 2023, 39, 139–145. [Google Scholar]
Feng, J.; Wang, Z.; Wang, S.; Tian, S.; Xu, H. MSDD-YOLOX: An enhanced YOLOX for real-time surface defect detection of oranges by type. Eur. J. Agron. 2023, 149, 126918. [Google Scholar] [CrossRef]
Wu, H.; Zhu, R.; Wang, H.; Wang, X.; Huang, J.; Liu, S. Flaw-YOLOv5s: A lightweight potato surface defect detection algorithm based on multi-scale feature fusion. Agronomy 2025, 15, 875. [Google Scholar] [CrossRef]
Yao, J.; Qi, J.; Zhang, J.; Shao, H.; Yang, J.; Li, X. A real-time detection algorithm for Kiwifruit defects based on YOLOv5. Electronics 2021, 10, 1711. [Google Scholar] [CrossRef]
Han, W.; Jiang, F.; Zhu, Z. Detection of cherry quality using YOLOV5 model based on flood filling algorithm. Foods 2022, 11, 1127. [Google Scholar] [CrossRef]
Song, K.; Yang, J.; Wang, G. A Swin transformer and MLP based method for identifying cherry ripeness and decay. Front. Phys. 2023, 11, 1278898. [Google Scholar] [CrossRef]
Li, M.; Ding, X.; Wang, J. CMD-YOLO: A lightweight model for cherry maturity detection targeting small object. Smart Agric. Technol. 2025, 12, 101513. [Google Scholar] [CrossRef]
Zhang, T.; Li, L.; Zhou, Y.; Liu, W.; Qian, C.; Hwang, J.N.; Ji, X. Cas-vit: Convolutional additive self-attention vision transformers for efficient mobile applications. IEEE Trans. Image Process. 2026, 35, 1899–1909. [Google Scholar] [CrossRef]
Li, M.; Tao, Z.; Lin, S.; Feng, K. Mix-net: Hybrid attention/diversity network for person re-identification. Electronics 2024, 13, 1001. [Google Scholar] [CrossRef]
Li, W.; Liu, K.; Zhang, L.; Cheng, F. Object detection based on an adaptive attention mechanism. Sci. Rep. 2020, 10, 11307. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; pp. 3–19. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2020; pp. 11534–11542. [Google Scholar]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2018; pp. 7794–7803. [Google Scholar]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision; IEEE: Piscataway, NJ, USA, 2021; pp. 10012–10022. [Google Scholar]
Shi, D. Transnext: Robust foveal visual perception for vision transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Piscataway, NJ, USA, 2024; pp. 17773–17783. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In European Conference on Computer Vision; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Piscataway, NJ, USA, 2024; pp. 16965–16974. [Google Scholar]
Jocher, G.; Chaurasia, A.; Stoken, A.; Borovec, J.; Kwon, Y.; Michael, K.; Fang, J.; Yifu, Z.; Wong, C.; Montes, D.; et al. YOLOv5: A State-of-the-Art Real-Time Object Detection System; Zenodo: Geneva, Switzerland, 2021. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8, 2023. Available online: https://www.scirp.org/reference/referencespapers?referenceid=3532980 (accessed on 1 April 2026).
Wang, C.Y.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar]
Tian, Y.; Ye, Q.; Doermann, D. YOLO12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar]
Jocher, G.; Qiu, J. Ultralytics YOLO11, 2024. Available online: https://docs.ultralytics.com/models/yolo11 (accessed on 1 April 2026).
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. arXiv 2017, arXiv:1610.02391. [Google Scholar]

Figure 1. Overall Structure Diagram of YOLO-CY.

Figure 2. Overall Structure Diagram of C3k2_AdditiveBlock.

Figure 3. Overall Structure Diagram of EUCB.

Figure 4. Overall Structure Diagram of C2PSA_CGLU.

Figure 5. Heatmap visualization experiment: red indicates regions where the network focuses more attention, while blue represents regions with less attention.

Figure 6. Multi-sample and hard-sample experiments.

Figure 7. Curves of training parameters and metrics.

Table 1. Hyperparameters for Experiments.

Hyperparameter	Configurations	Hyperparameter	Configurations
Epochs	100	Batch_size	512
Optimizer	SGD	Lr0	0.01
Weight_decay	0.0005	Box	7.5
Cls	0.5	Dfl	1.5

Table 2. Results of Comparative Experiments, Where Bold Values Represent the Optimal.

Model	mAP50	mAP50-95	P	R	F1	Params (M)	GFLOPs
SSD [37]	46.87	40.51	66.24	60.21	63.08	6.13	3.04
Faster R-CNN [10]	77.64	52.37	81.65	79.25	80.43	18.93	41.80
RT-DETR [38]	89.10	60.14	90.23	86.62	88.40	19.00	54.09
YOLOv5n [39]	88.65	59.26	89.67	84.23	86.87	2.19	5.92
YOLOv8n [40]	92.55	62.95	92.98	88.23	90.54	2.69	6.94
YOLOv9t [41]	89.67	60.17	91.04	86.42	88.67	1.77	6.7
YOLOv10n [42]	91.77	62.95	93.05	87.91	90.41	2.70	8.39
YOLOv12n [43]	92.91	63.01	93.12	88.12	90.55	2.60	6.50
YOLO11n [44]	93.25	63.09	93.50	88.81	91.10	2.59	6.44
APNet [12]	93.97	63.64	93.55	88.91	91.17	2.79	8.00
Ours	94.88	64.92	93.90	90.81	92.33	2.62	6.60

Table 3. Computational resource comparison. FPS* denotes the total frame rate of the full pipeline, including pre-processing, inference and post-processing. Inference represents single-image inference latency. Bold indicates the best performance.

Methods	FPS (Inference) ↑	FPS* ↑	Inference (s) ↓
YOLOv8n [40]	114	99	0.0088
YOLOv10n [42]	121	102	0.0083
YOLO11n [44]	122	103	0.0082
YOLOv12n [43]	88	69	0.0113
RT-DETR [38]	19	14	0.0536
APNet [12]	116	100	0.0086
Ours	160	125	0.0062

Table 4. Results of Ablation Experiments, C3k2 Represent the C3k2_AdditiveBlock, and C2PSA Represent the C2PSA_CGLU.

C3k2	EUCB	C2PSA	P	R	mAP50	mAP50-95	Params	GFLOPs
			93.50	88.81	93.25	63.09	2.59	6.44
✓			93.55	88.85	93.30	63.14	2.60	6.49
	✓		93.61	89.02	93.45	63.30	2.60	6.46
		✓	93.54	88.91	93.34	63.19	2.60	6.48
✓	✓		93.61	89.14	93.54	63.39	2.61	6.50
	✓	✓	93.69	89.90	94.13	64.07	2.61	6.52
✓		✓	93.74	89.23	93.65	63.51	2.61	6.51
✓	✓	✓	93.90	90.81	94.88	64.92	2.62	6.60

Table 5. Hard-Sample Experiment Including Small Objects and Complex Lighting Samples.

Model	mAP50	P	R
RT-DETR [38]	86.4	86.8	82.7
YOLOv8n [40]	88.1	91.1	86.4
YOLOv12n [43]	87.4	88.3	85.6
APNet [12]	90.3	89.8	88.5
YOLO11n [44]	89.6	90.1	89.6
Ours	92.6	93.2	90.0

Table 6. K-Fold Cross-Validation Results of the Proposed Model.

Fold	P	R	mAP50
K-1	93.75	90.65	94.75
K-2	93.82	90.72	94.82
K-3	93.95	90.85	94.90
K-4	93.88	90.78	94.85
K-5	93.92	90.83	94.91
K-6	93.98	90.89	94.95
Mean	93.88	90.79	94.86

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Sun, Y.; Miao, X.; Zhang, Y.; He, Z.; Tao, X.; Wang, Z.; Hou, T.; Ren, P.; Wang, W. Lightweight and High-Precision Visual Detection of Cherry Cracking Defects Based on Improved YOLO11 with Enhanced Feature Fusion. Agriculture 2026, 16, 1110. https://doi.org/10.3390/agriculture16101110

AMA Style

Sun Y, Miao X, Zhang Y, He Z, Tao X, Wang Z, Hou T, Ren P, Wang W. Lightweight and High-Precision Visual Detection of Cherry Cracking Defects Based on Improved YOLO11 with Enhanced Feature Fusion. Agriculture. 2026; 16(10):1110. https://doi.org/10.3390/agriculture16101110

Chicago/Turabian Style

Sun, Yifei, Xinying Miao, Yi Zhang, Zhipeng He, Xinyue Tao, Zhenghan Wang, Tianwen Hou, Ping Ren, and Wei Wang. 2026. "Lightweight and High-Precision Visual Detection of Cherry Cracking Defects Based on Improved YOLO11 with Enhanced Feature Fusion" Agriculture 16, no. 10: 1110. https://doi.org/10.3390/agriculture16101110

APA Style

Sun, Y., Miao, X., Zhang, Y., He, Z., Tao, X., Wang, Z., Hou, T., Ren, P., & Wang, W. (2026). Lightweight and High-Precision Visual Detection of Cherry Cracking Defects Based on Improved YOLO11 with Enhanced Feature Fusion. Agriculture, 16(10), 1110. https://doi.org/10.3390/agriculture16101110

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Article metric data becomes available approximately 24 hours after publication online.

Article Menu

Lightweight and High-Precision Visual Detection of Cherry Cracking Defects Based on Improved YOLO11 with Enhanced Feature Fusion

Abstract

1. Introduction

2. Related Works

3. Proposed Methods

3.1. C3k2_AdditiveBlock

3.2. EUCB

3.3. C2PSA_CGLU

4. Experiment and Result Analysis

4.1. Dataset and Environment

4.2. Comparative and Ablation Experiment

5. Discussion and Limitations

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI