4.1. Dataset and Environment
This study adopts a self-collected cherry image dataset captured in a real sorting production line. All images are acquired using an RGB camera with a shooting distance of 20–25 cm from the fruit surface. Natural ambient light serves as the illumination source, and the illuminance is controlled within 1000–1200 lux to ensure consistent imaging quality. The original image resolution is pixels, and all images are uniformly resized to pixels via preprocessing for subsequent model training and inference. The dataset consists of 3662 cherry images, including 1878 intact fruit samples and 1784 cracked fruit samples. The quantity ratio of the two categories is approximately 1:1 with a balanced class distribution, thereby eliminating the need for additional balancing strategies such as weighted loss or oversampling. Each image contains one or multiple cherry fruits, and the crack defects cover diverse typical patterns, including fine linear cracks, low-contrast cracks, stem-overlapped cracks and densely distributed multiple cracks. It is also noteworthy that images of the same fruit and highly similar consecutive frames are never simultaneously allocated to the training and validation sets. The dataset possesses high morphological diversity and strong representativeness for practical industrial scenarios. In this work, data partitioning is implemented at the fruit level.
All model training procedures are conducted on a high-performance computing platform equipped with eight NVIDIA A800 graphics cards, each with 80 GB of video memory, paired with a total of 512 GB of system memory, and the detailed experimental environment and partial hyperparameter configurations are documented in
Table 1. To guarantee the fairness and comparability of all experimental results, every comparative experiment is implemented exclusively on the aforementioned custom cherry dataset, and model performance is evaluated using a set of universally adopted metrics in the field of target detection, including mAP@50, mAP@50-95, precision (P), recall (R), parameters (Params), and GFLOPs.
4.2. Comparative and Ablation Experiment
As shown in
Table 2, the proposed YOLO-CY model achieves state-of-the-art results across all evaluation metrics, with mAP50 reaching 94.88% and mAP50-95 reaching 64.92%. The precision and recall are 93.90% and 90.81%, respectively. Compared with the baseline YOLO11n, YOLO-CY increases mAP50 and mAP50-95 by 1.63 and 1.83 percentage points, while raising recall by 2.00 percentage points and precision by 0.40 percentage points. Such prominent performance improvements are not obtained through simple model expansion. YOLO-CY maintains nearly consistent parameter scale and computational complexity with YOLO11n, with only a slight increase of 0.03 M parameters and 0.16 GFLOPs, which validates the high efficiency of the presented improvement strategies. The performance gains of YOLO-CY over YOLO11n stem from targeted remedies for three critical structural defects of the original network. First, the vanilla C3k2 module in YOLO11n exhibits insufficient multi-scale feature fusion capability for fine-grained targets such as cherry cracking and fails to adequately capture feature representations of tiny cracks under diverse receptive fields. The designed C3k2_AdditiveBlock integrates an additive attention mechanism and a dual-branch parallel structure, effectively enhancing the model’s sensitivity to subtle crack features. Second, the original C2PSA module focuses primarily on spatial self-attention modeling yet lacks sufficient channel-wise information interaction and fine feature screening, making it difficult to distinguish local crack textures from normal fruit surface patterns. The proposed C2PSA_CGLU employs a convolutional gated linear unit to realize refined channel attention filtering and strengthen the discrimination of local crack morphological features. Third, the conventional upsampling adopted in the neck network of YOLO11n suffers from non-learnable reconstruction of high-frequency details, resulting in the loss of spatial localization information for small cracks during multi-scale feature fusion. In contrast, the EUCB module replaces fixed interpolation upsampling with a learnable combination of depthwise and pointwise convolutions, thereby preserving fine-edge spatial details of cracks and boosting localization accuracy. The collaborative effect of the above designs enables YOLO-CY to comprehensively outperform YOLO11n while retaining its lightweight advantage.
As listed in
Table 3, the real-time performance of all models is systematically evaluated in terms of inference latency and throughput, which is essential for the practical deployment of automated cherry-sorting pipelines. YOLO-CY achieves the optimal single-image inference latency of merely 0.0062 s, representing a 24.4% reduction compared with the baseline YOLO11n. In terms of full-pipeline frame rate, YOLO-CY reaches 125 FPS, substantially outperforming YOLO11n and other mainstream lightweight models. Notably, its pure inference frame rate peaks at 160 FPS, which greatly surpasses YOLOv8n, YOLOv10n, YOLOv12n and APNet, demonstrating outstanding computational efficiency. Such prominent inference speed advantages benefit from the computation-oriented structural design of the three modified modules. The additive attention mechanism embedded in C3k2_AdditiveBlock replaces the multiplication operation in conventional attention with additive calculation, enhancing feature representation while eliminating the high cost of matrix multiplication. The EUCB module constructs the upsampling path based on depthwise separable convolutions, drastically reducing computational overhead compared with standard convolution-based upsampling. Meanwhile, C2PSA_CGLU achieves efficient channel-wise feature screening via convolutional gated linear units and avoids the quadratic complexity inherent to self-attention mechanisms. These collaborative designs enable the model to boost detection accuracy without sacrificing inference speed while further improving it. From an industrial deployment perspective, the superior real-time performance of YOLO-CY delivers significant engineering practicality. Current high-speed cherry-sorting lines generally operate at a processing rate of 10–20 fruits per second. With a full-pipeline processing speed of 125 FPS, YOLO-CY completes image inference within 8 ms per frame. Even on production lines equipped with multi-view imaging systems, it can adapt well to a mechanical sorting rhythm of 15 fruits per second and provide a sufficient time margin for subsequent executing operations, such as pneumatic rejection and robotic grasping. Furthermore, the ultra-low inference latency of merely 6.2 ms enables smooth operation of the model on embedded edge devices. It reduces reliance on high-performance servers and cloud computing resources and holds great potential to lower hardware costs and system complexity for on-site industrial deployment.
To further evaluate whether the performance improvement of the proposed YOLO-CY over the baseline YOLO11n is statistically significant, a paired t-test was conducted based on the results of five independent runs. Across the five runs, the performance differences of YOLO-CY relative to YOLO11n were 1.6, 1.6, 1.8, 1.4, and 1.6 percentage points, respectively, all of which are positive values, indicating stable improvements. The mean performance gain is 1.6 percentage points, with a standard deviation of 0.13, suggesting low variability across repeated experiments. The paired t-test yields a p-value less than 0.0001, demonstrating that the observed improvement is highly statistically significant. Therefore, it can be concluded that the performance gain of the proposed method is not caused by random fluctuations but reflects a consistent and reliable enhancement over the baseline model. This result sufficiently demonstrates that the customized improved modules introduced in this work deliver significant performance enhancements without imposing excessive additional computational overhead, while preserving the lightweight characteristics of the baseline model, making the proposed YOLO-CY architecture highly adaptable for embedded deployment and real-time defect detection tasks on automated cherry-sorting production lines. The comprehensive experimental results validate the effectiveness and rationality of the proposed modified modules and confirm that the YOLO-CY model achieves dual optimization of detection performance and real-time inference efficiency for cherry cracking defect detection, outperforming all selected state-of-the-art mainstream detection models.
Table 4 presents the ablation results based on the YOLO11n baseline with three proposed modules incorporated incrementally. The original YOLO11n achieves a precision of 93.50%, a recall of 88.81%, and mAP50 and mAP50-95 values of 93.25% and 63.09%, respectively. Although such performance is competitive, the limited recall reveals the inherent missing detection risk for tiny or low-contrast cracks. This limitation stems from insufficient multi-scale feature fusion of the vanilla C3k2 module, weak local discrimination of the original C2PSA, and severe spatial information loss induced by conventional upsampling operations. With the sole integration of C3k2_AdditiveBlock, the precision and recall increase to 93.55% and 88.85%, accompanied by a slight mAP50 improvement to 93.30%, while only 0.01 M parameters and 0.05 GFLOPs are added. Such performance gains benefit from the additive attention mechanism, which efficiently enhances the backbone’s capability to capture multi-scale crack features and compensates for the inadequate receptive field coverage of small targets in the original C3k2. When only the EUCB module is adopted, recall obtains the most prominent improvement from 88.81% to 89.02%, with mAP50 rising to 93.45% and mAP50-95 reaching 63.30%. This outcome demonstrates that the learnable upsampling design effectively remedies the high-frequency detail loss caused by interpolated upsampling in YOLO11n, preserves spatial localization cues of crack boundaries during feature fusion, and alleviates the missing detection of small objects. The individual replacement with C2PSA_CGLU increases recall to 88.91% and mAP50 to 93.34%. This module realizes fine-grained channel-wise feature filtering and reduces feature confusion among crack textures, normal fruit surfaces and stem shadows. Dual-module combination experiments further verify the complementarity among the three designs. Specifically, C3k2_AdditiveBlock enriches feature representations, EUCB retains spatial details, and C2PSA_CGLU refines feature discrimination. Targeted optimizations are implemented in feature extraction, feature fusion and feature screening to address the core defects of YOLO11n. The full integration of all three modules yields the optimal performance of YOLO-CY, with precision of 93.90%, recall of 90.81%, mAP50 of 94.88% and mAP50-95 of 64.92%. Compared with the YOLO11n baseline, recall is improved by 2.00 percentage points, while mAP50 and mAP50-95 are boosted by 1.63 and 1.83 percentage points, respectively. Meanwhile, the model parameters merely increase from 2.59 M to 2.62 M, and computational complexity rises slightly from 6.44 GFLOPs to 6.60 GFLOPs.
Table 5 compares the detection performance of different models on hard samples. The constructed hard-sample dataset covers challenging scenarios, including slender linear cracks, stem-overlapped cracks, low-contrast defects, densely distributed multiple cracks, and small objects, to comprehensively evaluate the model’s robustness and generalization in complex practical environments. Experimental results indicate that RT-DETR achieves an mAP50 of only 86.4%, with a precision of 86.8% and a recall of 82.7%. This demonstrates its limited perception capability for tiny cracks and low-contrast defects. Dispersed attention distribution hinders the accurate localization of critical defective regions. By comparison, YOLOv8n and YOLOv12n yield mAP50 values of 88.1% and 87.4%, respectively. Although superior to RT-DETR, they still suffer from obvious false and missing detections under stem-overlapped cracks and dense multi-crack conditions, revealing deficiencies in multi-scale feature fusion and local detail discrimination. APNet and the baseline YOLO11n attain higher mAP50 values of 90.3% and 89.6%, achieving substantial performance improvements. Nevertheless, weakened attention and insufficient feature focusing still restrict their recognition of low-contrast and slender linear cracks, thereby limiting further recall gains. In contrast, the proposed YOLO-CY achieves the best overall performance on hard samples, with an mAP50 of 92.6%, a precision of 93.2%, and a recall of 90.0%. Such remarkable results validate the synergistic effects of the three modified modules. The optimized multi-scale feature extraction strengthens the sensitivity to microcracks. Refined channel attention enhances the differentiation between cracks, fruit stems and surface textures. Meanwhile, the learnable upsampling strategy mitigates spatial information loss and guarantees precise boundary localization of crack defects. The high recall rate verifies that YOLO-CY effectively alleviates missing detection and localization deviations under intricate hard-sample conditions, exhibiting excellent robustness and practical deployment potential.
Table 6 presents the performance of YOLO-CY under six-fold cross-validation to evaluate its stability and generalization capability. Experimental results show that the mAP50 of the model remains steady between 94.75% and 94.95% across all folds, with precision ranging from 93.75% to 93.98% and recall fluctuating within a narrow interval of 90.65% to 90.89%. All metrics exhibit extremely marginal variations. The six-fold average mAP50 reaches 94.86%, while the average precision and recall are 93.88% and 90.79%, respectively. These results are highly consistent with those from previous comparative experiments, which verify the reliability and reproducibility of the model’s performance. The low-performance variance across folds further demonstrates that YOLO-CY is insensitive to different data partitioning strategies and maintains stable detection performance facing discrepancies in cherry cultivars, crack morphologies and imaging conditions. Such superior stability stems from the synergistic design of C3k2_AdditiveBlock, C2PSA_CGLU and EUCB. These three modules enhance the model’s robust feature learning from the dimensions of multi-scale feature extraction, channel attention regulation and spatial information retention. Accordingly, the model can extract highly generalized discriminative crack features even with changes in training data partitioning, without introducing additional overfitting risks.
To further investigate the regions of interest focused on by each network, as illustrated in
Figure 5, we present the visualization results of attention heatmaps for cherry cracking defect detection based on GRAD-CAM [
45]. In these heatmaps, red regions denote the key areas emphasized by the network during detection, while blue regions represent low-attention areas. This experiment intuitively reflects the attention-focusing capability and feature-recognition preferences of different models with respect to cherry-crack characteristics. The experimental results demonstrate that the YOLOv8 model exhibits poor attention focusing on cherry cracks and tends to mistakenly identify non-defective regions, such as natural peel textures and pedicel shadows, as core regions of interest, which introduces significant interference into the feature extraction of genuine cracks. Although the RT-DETR model can capture features of partial crack regions, its attention distribution is relatively scattered, with insufficient focusing ability on tiny linear cracks on the cherry surface, making it difficult to accurately locate the core defective regions. As the baseline model in this study, YOLO11n achieves superior attention-focusing performance compared with the former two models and can effectively identify medium and large crack regions. Nevertheless, it still suffers from weakened attention and inadequate feature focusing when dealing with fine micro-cracks and irregular shallow cracks. In contrast, the proposed YOLO-CY model in this paper can precisely concentrate its attention on core defective regions, including linear fissures and irregular damages of cherry cracks, and effectively eliminates the interference caused by natural peel textures, pedicel shadows, and surface spots. The attention distribution is highly consistent with the spatial locations and morphological characteristics of actual cracks. This improvement is attributed to the C3k2_AdditiveBlock module, which enhances the multi-scale feature fusion capability of the model, and the C2PSA_CGLU module, which strengthens the discrimination and screening of local crack defect features. These modules enable the model to accurately recognize the visual characteristics of cracks and allocate core attention to them, verifying that the improved modules make the feature extraction logic and attention mechanism of the model more suitable for the detection of tiny cherry cracks.
To better evaluate the detection performance on diverse targets and challenging samples, as illustrated in
Figure 6, we present the experimental results of cherry crack detection using multi-sample and hard-sample test sets. The experiments were carried out on cherry samples with different cultivars, varying crack severity levels, and diverse imaging conditions. The hard samples mainly consist of tiny linear cracks, cracks overlapping with pedicels, low-contrast cracks, and densely distributed multiple cracks, which are representative of high-difficulty cases in actual production. These experiments are formulated to validate the robustness of detection and generalization performance of the model in complex, realistic scenarios. According to the experimental results, the YOLOv8 model suffers from severe missed detection when dealing with micro-cracks and low-contrast cracks and exhibits obvious false detection on samples where cracks overlap with pedicels since it cannot effectively distinguish the visual features between crack defects and pedicels. Although the RT-DETR model can detect partial crack defects in challenging samples, it suffers from low localization accuracy and severe bounding box drift. Furthermore, it tends to merge detection boxes and miss some individual cracks in densely distributed multi-crack regions. The YOLO11n baseline model achieves significantly better detection performance than the above two models, with a remarkable reduction in both missed and false detection rates. However, it still shows limited effectiveness in detecting ultra-fine linear cracks and low-contrast cracks under strong illumination and reflection. By contrast, the YOLO-CY model proposed in this study exhibits better detection capability on both normal and difficult samples. It realizes precise recognition of cracks at various grades, including micro-cracks, shallow cracks, and deep cracks. For difficult samples such as cracks overlapping with pedicels, low-contrast cracks, and densely distributed multiple cracks, the proposed model resolves the issues of missed detection, false detection, and localization deviation. The detection boxes can accurately conform to the actual contours and spatial locations of the cracks.
Figure 7 demonstrates the dynamic change tendencies of various loss functions and detection performance indicators during the training process of the YOLO-CY model, thereby directly reflecting the convergence properties and performance development regularity of the constructed network. With the increase in training epochs, the bounding box regression loss, classification loss, and distribution focal loss of the training set show a continuous and steady downward tendency without significant oscillations or fluctuations. Meanwhile, the corresponding losses of the validation set decline in a synchronous manner and finally reach a stable state. This indicates that the network parameters of the model have been effectively adjusted and optimized; the feature extraction and learning process for cherry cracking defects is stable; and there are no problems such as overfitting, gradient disappearance, or training instability. Thus, it verifies the good compatibility between the designed improved modules and the adopted training strategies. In terms of detection performance metrics, the precision and recall of the model gradually increase and tend to saturate as the training proceeds, and both mAP50 and mAP50-95 also show a steady upward trend, finally reaching a relatively high level without significant decline, which demonstrates that the model’s detection capability for cherry cracking defects is continuously enhanced during training; it can not only achieve high-confidence defect identification and localization but also possess superior multi-threshold detection performance and favorable feature representation capability for small-target cherry cracking defects with diverse morphologies and scales.