1. Introduction
Japanese quince is a temperate fruit species characterized by aromatic, acid-rich fruits that undergo pronounced physicochemical changes during ripening, making it suitable for non-destructive maturity detection in postharvest and precision agriculture studies. With the continuous growth of the global population and the increasing demand for high-quality agricultural products, agriculture is rapidly transitioning toward digitalization and intelligent management. In orchard production systems, accurate monitoring of fruit maturity is essential for determining optimal harvest timing, improving supply chain efficiency, and ensuring fruit quality. However, traditional maturity assessment and harvesting practices largely depend on manual experience, which is not only labor-intensive and inefficient but also prone to subjective bias, often resulting in premature or delayed harvesting in large-scale orchards and ultimately reducing economic returns [
1].
In recent years, computer vision technology, particularly deep learning-based object detection algorithms, has demonstrated tremendous application potential in the agricultural sector. Kamilaris and Prenafeta-Boldú systematically reviewed the research progress and application status of deep learning methods in agricultural information, indicating that deep learning technology provides essential technical support for agricultural intelligence [
2]. While early methodologies relied on two-stage mechanisms (e.g., Faster R-CNN) to ensure accuracy, the advent of single-stage paradigms—most notably the YOLO lineage—has marked a critical milestone. These modern approaches have successfully reconciled the trade-off between computational speed and detection performance [
3]. Terven et al. conducted an exhaustive survey of the YOLO lineage, spanning from the inaugural v1 to the recent YOLOv8 and YOLO-NAS. Their study systematically dissects the architectural innovations and performance trajectories characterizing each iteration of the series [
4]. In the field of fruit detection, Lawal proposed a modified YOLOv3 framework for tomato detection, which significantly improved detection performance in natural orchard environments by optimizing network structure [
5,
6]. The latest YOLOv11 model further enhances feature extraction efficiency through an enhanced backbone and neck architecture, establishing a new benchmark in real-time object detection [
7].
However, directly transferring these models that perform excellently on general datasets to unstructured natural orchard environments still faces dual challenges of environmental adaptability and hardware resource limitations. On one hand, orchard environments exhibit high complexity and uncontrollability. Unlike controlled laboratory settings, dramatic fluctuations in natural lighting conditions (such as strong backlighting, dappled shadows) and random occlusion of fruits by branches and leaves can easily cause loss of visual information. Furthermore, fruits at different maturity stages often display subtle phenotypic differences, especially when unripe fruits show high color similarity with background foliage, making it difficult for general models to capture fine-grained features that distinguish maturity levels, resulting in higher miss detection and false detection rates in complex scenarios [
8,
9]. On the other hand, agricultural automation applications typically require deploying detection models on edge devices such as mobile robots, drones, or handheld terminals. These devices have very limited computational power (FLOPs), storage space, and power budgets. Existing high-performance detection models often come with massive parameter counts and complex network structures, leading to excessive inference latency that fails to meet the stringent “real-time” requirements of agricultural operations.
Lightweight object detection network design has gradually become a research hotspot. Addressing the challenges of detection in naturalistic agricultural backdrops, Chen et al. introduced GSBF-YOLO, a model that leverages the GSim strategy to minimize parameter overhead. Empirical results indicate its superior performance in accurately characterizing tomato ripeness despite environmental complexities [
10]. Building upon the YOLO11n baseline, Li et al. engineered the YOLO11-LES framework for assessing strawberry maturity. By synergizing a lightweight adaptive weighting downsampling scheme with a spatial-enhanced attention mechanism, the model constrains its storage footprint to a mere 4.6 MB whilst securing a 2.9% gain in precision [
11]. In the domain of pomegranate recognition, Chen et al. devised the PL-YOLO framework, which incorporates an edge-feature extraction unit and a context-guided attention FPN to navigate environmental complexities [
12]. Parallelly, Lou et al. engineered YOLO-TLA, a streamlined architecture that employs C3CrossCovn blocks and a specialized detection head for small targets [
13]. While these innovations have reduced computational burdens, establishing an ideal equilibrium between granular feature preservation and inference efficiency within unstructured orchard settings remains an unresolved hurdle.
Traditional pooling often incurs a loss of high-frequency information (e.g., texture). As a remedy, Williams and Li proposed wavelet pooling, a method that compresses features by discarding specific sub-bands during decomposition, thereby solving the overfitting issues inherent in max pooling more effectively than neighborhood approaches [
14]. Subsequent studies, such as the MWCNN model by Liu et al., successfully integrated these transforms to balance resolution and receptive fields for tasks like super-resolution [
15]. Additionally, Brito et al. proposed a multi-pooling network combining the advantages of max pooling and wavelet pooling, effectively changing output signal dimensions through 1 × 1 convolution, achieving better trade-offs in semantic segmentation tasks [
16]. However, existing wavelet pooling methods are primarily applied to image restoration and classification tasks, with limited applications in agricultural object detection.
Attention mechanisms, as important techniques for enhancing network performance, have been widely applied in agricultural vision. Yang et al. formulated SimAM, a parameter-free operator that deduces 3D attention weights by minimizing an energy function. Its elegance lies in its implementation simplicity, requiring negligible code while identifying neuronal importance without adding model weight [
17]. Distinctively, the CBAM module (Woo et al.) adopts a sequential inference strategy, refining features first along the channel axis and then the spatial axis to achieve adaptive calibration [
18]. Furthermore, Hu et al. pioneered the Squeeze-and-Excitation (SE) paradigm, which explicitly models channel interdependencies to dynamically recalibrate feature responses, proving effective across diverse CNN backbones [
19].
In network architecture optimization, Li et al. proposed GSConv as a convolution strategy in depthwise separable convolution that concatenates standard convolution results with depthwise convolution results followed by channel shuffling, significantly reducing parameters and computational costs while maintaining detection performance, with FLOPs approaching half that of standard convolution when channel numbers are large. The VoVGSCSP module, derived from the GSConv operation, represents a strategic fusion of CSP and VoVNet topologies. This hybrid architecture optimizes the speed-accuracy trade-off by mitigating computational complexity through efficient feature reuse and parallel processing mechanisms [
20]. Zhou et al. proposed a lightweight real-time object detection method based on YOLOv4 for complex scenes, replacing CSPDarknet53 backbone network with MobileNetV3 and using depthwise over-parameterized convolutional layer to promote feature extraction effectiveness, achieving 41.82 fps on Titan X while maintaining competitive accuracy [
21]. Targeting resource-limited computing environments, Chen et al. engineered shuffle-octave-yolo. When deployed on the NVIDIA Jetson TX2, this architecture attained a mean Average Precision (mAP) of 65.97% at a frame rate of 30.9 fps, thereby exemplifying a superior compromise between computational swiftness and detection fidelity [
22].
Comprehensive analysis of existing research reveals that current fruit maturity detection methods face key challenges, including the contradiction between feature preservation and lightweighting, insufficient multi-scale information fusion, and inadequate environmental adaptability. Existing lightweight methods often sacrifice fine-grained features, and traditional feature fusion structures have incomplete information transmission between different levels and lack sufficient robustness when facing dense targets, occlusion, and natural environmental variations. Therefore, this study proposes an improved lightweight fruit maturity detection network, WSS-YOLO, based on YOLOv11n. This model is specifically designed for edge deployment in complex orchard environments, utilizing WaveletPool technology to losslessly preserve texture detail features through the multi-resolution characteristics of wavelet transforms, designing a Slim-neck lightweight architecture based on GSConv to significantly reduce parameter count and computational cost, and integrating the parameter-free attention mechanism SimAM to achieve an adaptive focus on key fruit regions and noise suppression, thereby effectively balancing computational efficiency while ensuring detection accuracy.
3. Results
3.1. Experimental Environment
All empirical evaluations were executed on a workstation running the Ubuntu 18.04 operating system. The deep learning environment was configured using the PyTorch framework (version 1.8.0 + cuda11.1) compatible with Python 3.9.13. To ensure efficient computational processing, hardware acceleration was provided by an NVIDIA GeForce RTX 3090 GPU equipped with 24 GB of video memory.
Table 1 outlines the specific hyperparameter settings adopted for model training.
3.2. Evaluation Criteria
To systematically evaluate the performance of YOLOv11n and its enhanced variants in fruit maturity analysis, this study employs established benchmarks: Precision, Recall, and mean Average Precision (mAP) [
27]. As formalized in Equations (3)–(6), Precision gauges the reliability of positive predictions by determining the fraction of true positives (TPs) relative to the total positive predictions (the sum of TPs and false positives, FPs). In contrast, Recall assesses the model’s sensitivity, indicating the percentage of actual ripe fruit successfully detected against the ground truth.
Regarding aggregate performance, mAP functions as the principal indicator. Specifically, mAP@0.5 signifies the mean precision averaged across all categories at a single Intersection over Union (IoU) threshold of 0.5. This metric underscores the model’s capacity to balance accuracy and recall, where elevated scores denote stronger robustness. To offer a more rigorous analysis of localization quality, mAP@0.5:0.95 is also computed. This comprehensive metric averages mAP scores across a continuum of IoU thresholds from 0.5 to 0.95 (in 0.05 increments), yielding a granular perspective on detection fidelity under stricter overlap requirements.
3.3. Model Training Process
Figure 7 presents the training and validation loss curves together with the confusion matrix of the proposed model. The loss curves include box loss, classification loss (cls loss), and distribution focal loss (dfl loss) for both the training and validation sets. During training, all loss values decreased rapidly in the initial epoch and gradually converged to stable levels as training progressed. The trends observed in the validation set were consistent with those of the training set, indicating stable optimization and good generalization performance. To improve the visualization of the convergence process, smoothed curves were additionally provided to reduce the influence of fluctuations and highlight the overall training trends.
The confusion matrix further illustrates the classification performance of the proposed model. Most samples were correctly classified, as evidenced by the dominant diagonal entries. Specifically, 1448 ripe quinces and 1085 unripe quinces were correctly identified. Confusion between ripe and unripe quinces was limited, with only three unripe samples misclassified as ripe and two ripe samples misclassified as unripe. Most classification errors were associated with the background category. Some background regions were incorrectly recognized as ripe or unripe quinces, while a portion of fruit samples were classified as background. Nevertheless, the confusion matrix indicates that the proposed model can effectively distinguish ripe and unripe quinces while maintaining reliable performance in the presence of complex background conditions.
Figure 8 presents the comparison of precision–recall curves between the proposed WSS-YOLO model and the baseline YOLOv11n model. The PR curve of the proposed model encloses a larger area and is positioned closer to the upper-right corner compared with the baseline, indicating stronger overall detection performance. This demonstrates that the proposed model maintains higher precision across a wider range of recall values for both ripe and unripe fruits, reflecting its improved ability to detect target instances while reducing false positives.
3.4. Ablation Experiments
Table 2 presents the ablation study conducted on the YOLOv11n baseline to evaluate the individual and combined effects of WaveletPool, Slim-neck, and SimAM. Compared with the baseline model, which achieved 84.1% precision, 85.8% recall, 90.9% mAP50, 74.2% mAP50-95, 2.64 M parameters, and 6.5 G FLOPs, the introduction of different modules resulted in distinct performance changes in terms of detection accuracy and computational efficiency.
When WaveletPool was introduced alone, the model achieved 85.3% precision, 87.1% recall, 91.8% mAP50, and 74.9% mAP50-95, while the number of parameters decreased from 2.64 M to 2.23 M and FLOPs were reduced from 6.5 G to 4.5 G. This result indicates that WaveletPool contributes to improving detection accuracy while reducing model complexity. The reduction in parameters and FLOPs suggests that the module can replace part of the conventional feature-processing operation with a more compact representation, thereby improving computational efficiency.
After introducing Slim-neck alone, FLOPs decreased from 6.5 G to 4.9 G, confirming its effectiveness in reducing computational cost. However, the mAP50 decreased from 90.9% to 89.7%, indicating that excessive feature compression in the neck structure may weaken feature representation to some extent. In contrast, introducing SimAM alone improved recall from 85.8% to 87.4% and increased mAP50 from 90.9% to 91.1% without increasing the number of parameters, suggesting that SimAM enhances feature discrimination in a parameter-free manner.
The intermediate combinations further reveal the interaction among different modules. The combination of WaveletPool and Slim-neck achieved the lowest FLOPs of 4.1 G but showed a slight decrease in mAP50 and mAP50-95 compared with using WaveletPool alone, which may be attributed to the cumulative effect of feature compression. When SimAM was combined with WaveletPool or Slim-neck, the detection performance improved noticeably, with mAP50 reaching 92.1% and 92.3%, respectively. These results indicate that SimAM can effectively compensate for the potential loss of discriminative information caused by lightweight feature processing.
Finally, the full WSS-YOLO model integrating WaveletPool, Slim-neck, and SimAM achieved the best overall performance, with precision, recall, mAP50, and mAP50-95 reaching 86.4%, 87.5%, 93.4%, and 76.0%, respectively. Meanwhile, the model maintained only 2.23 M parameters and reduced FLOPs to 4.1 G. Compared with the baseline, WSS-YOLO improved precision by 2.3 percentage points, recall by 1.7 percentage points, mAP50 by 2.5 percentage points, and mAP50-95 by 1.8 percentage points, while reducing parameters by 15.5% and FLOPs by 36.9%. These results demonstrate that the three modules are complementary rather than simply additive, enabling WSS-YOLO to achieve a better balance between detection accuracy and computational efficiency.
3.5. Comparison of Different Attention Mechanisms
Table 3 presents the final quantitative results of different attention mechanisms in terms of precision, recall, mAP50, and mAP50-95. SimAM achieved the highest performance among all methods, with precision, recall, mAP50, and mAP50-95 of 86.4%, 87.5%, 93.4%, and 76.0%, respectively. CA and CBAM also performed competitively, while SE, EMA, and GAM showed lower results, particularly in mAP50 and mAP50-95. These results highlight the superior effectiveness of SimAM in enhancing feature representation for accurate detection.
Figure 9 provides a comprehensive comparison of mAP50 training curves, where the SimAM mechanism (blue curve) demonstrates better performance than other state-of-the-art methods. From the beginning of training, SimAM exhibited faster convergence and rapidly surpassed 0.8 mAP50 within fewer epochs compared to other methods. Throughout the stabilization phase (Epoch 50–200), SimAM consistently maintained a stable advantage over SE, EMA, and GAM. Even when compared with competitive models such as CA and CBAM, SimAM achieved higher peak accuracy at the end of training. This trajectory indicates that SimAM provides strong feature extraction capability, leading to improved overall detection performance.
3.6. Comparative Experiments
As shown in
Table 4, WSS-YOLO demonstrates improved performance across all evaluation metrics. Specifically, it achieves a precision of 86.4%, recall of 87.5%, mAP50 of 93.4%, and mAP50-95 of 76.0%. Compared with YOLOv11n (precision 84.1%, recall 85.8%, mAP50 90.9%), WSS-YOLO shows improvements of 2.3%, 1.7%, and 2.5%, respectively, reflecting better detection accuracy and recall performance.
In addition, when compared to YOLOv11n equipped with lightweight backbones, WSS-YOLO achieves better performance in terms of both accuracy and overall detection capability. For example, YOLOv11n + ShuffleNetV2 achieves a precision of 75.2%, recall of 77.5%, and mAP50 of 84.5%, while YOLOv11n + MobileNetV3 and YOLOv11n + MobileNetV4 achieve precision of 76.3% and 79.4%, recall of 76.3% and 78.0%, and mAP50 of 84.7% and 85.2%, respectively. This comparison indicates that although lightweight backbones reduce model parameters and computational cost, they may sacrifice detection accuracy, whereas WSS-YOLO achieves a favorable trade-off between performance and efficiency.
From the perspective of model complexity, WSS-YOLO exhibits both lightweight design and practical deployability. Its parameter count is only 2.23 M, with 4.1 G FLOPs and a model weight of 4.7 MB, which is significantly lower than YOLOv3 (61.5 M parameters, 154.6 G FLOPs) and even more efficient than other lightweight models such as YOLOv5s (7.03 M parameters, 15.8 G FLOPs). Therefore, WSS-YOLO not only achieves strong detection performance but also maintains low computational cost and a compact model size, suggesting its potential for real-time applications on resource-constrained devices.
3.7. Model Detection and Heatmap Visualization Results
In this study, the WSS-YOLO model demonstrates significant advantages over YOLOv11n in fruit detection tasks as shown in
Figure 10. First, in terms of detection accuracy, WSS-YOLO performs more consistently, particularly in recognizing fruits at different maturity stages, with generally higher confidence scores. Specifically, WSS-YOLO not only achieves high confidence for ripe fruits (up to 0.97) but also performs excellently in detecting unripe fruits (up to 0.96). Compared to YOLOv11n’s confidence scores (as low as 0.59 for unripe fruits), WSS-YOLO exhibits more balanced performance when handling fruits at different maturity stages, indicating more stable behavior in the presented examples.
In terms of accuracy of localization, WSS-YOLO shows higher precision in fruit bounding. Even in scenarios with densely packed fruits, complex backgrounds, or overlapping fruits, WSS-YOLO can more accurately bound each fruit, reducing the misidentification and bounding box deviations that may occur with the YOLOv11n model in these scenarios. Particularly when the spacing between fruits is small or the background is cluttered, WSS-YOLO shows distinct advantages, ensuring more precise fruit localization.
Through heatmap visualization results as shown in
Figure 10, it can be further observed that WSS-YOLO performs exceptionally well when handling backgrounds with densely packed fruits. The heatmaps show that WSS-YOLO can generate more prominent focus regions in fruit areas, indicating that the model can more effectively focus on fruits and reduce background interference. In contrast, YOLOv11n displays more dispersed attention regions, especially in cases of unripe fruits and small inter-fruit spacing, where its focusing capability is weaker, resulting in some fruits not being accurately identified.
In terms of performance under varying illumination, fruit overlap, and complex backgrounds, WSS-YOLO performs better than YOLOv11n. Even under conditions of varying illumination, fruit overlap, or complex backgrounds, WSS-YOLO maintains high detection accuracy and localization precision, with significantly better detection accuracy for unripe fruits than YOLOv11n. YOLOv11n, on the other hand, exhibits relatively unstable performance when handling these complex environments, especially in cases of overlapping fruits or complex backgrounds.
4. Discussion
The proposed design provides a useful reference for lightweight fruit detection. By combining WaveletPool, Slim-neck, and SimAM, the model improves feature preservation and background suppression while keeping computational cost low. WaveletPool helps retain important structural information during downsampling, Slim-neck reduces redundant computation in feature fusion, and SimAM strengthens discriminative fruit-region responses without adding trainable parameters. These characteristics make WSS-YOLO suitable for real-time agricultural applications on resource-constrained devices.
Second, targeting different task characteristics, the multi-module fusion design proposed in this study can effectively avoid the introduction of redundant mechanisms, thereby reducing the waste of computational resources. By optimizing the network architecture to achieve multi-level information fusion, information bottleneck issues are avoided, which greatly improves the model’s performance in complex orchard environments. Especially when facing challenges such as illumination variations, occlusion, and fine-grained fruit features, the model can still maintain high-precision detection, showing that the model can maintain accurate detection under illumination variation, occlusion, and fine-grained fruit appearance differences.
Beyond algorithmic optimization, this study further evaluates the edge deployment capability of the proposed WSS-YOLO model. To assess its practical applicability in in situ agricultural scenarios, the model was deployed on an NVIDIA Jetson Orin Nano, as shown in
Figure 11 and
Table 5. During deployment, the input image resolution was set to 640 × 640, which was consistent with the validation setting used in the accuracy evaluation. The reported speed of 23.0 FPS refers to model inference performance on the edge device and includes network forward propagation and post-processing, including non-maximum suppression. Image acquisition, data loading, visualization, and result saving were not included in the FPS calculation. No additional TensorRT acceleration, FP16 inference, or INT8 quantization was applied in this experiment. Under these deployment settings, WSS-YOLO maintained real-time inference capability on resource-constrained hardware, suggesting its feasibility for continuous visual monitoring and intelligent agricultural detection tasks.
Furthermore, although the proposed WSS-YOLO model achieved promising detection performance, several limitations should be acknowledged. First, the experiments in this study were conducted on a single publicly available dataset collected using a smartphone camera. Although the dataset contains variations in shooting distance, illumination, occlusion, fruit overlap, and background complexity, the images are still derived from a limited orchard scenario. Therefore, the generalizability discussed in this study mainly refers to the environmental variations covered by the current dataset, rather than full transferability across different orchards, acquisition dates, seasons, or weather conditions. In addition, the training, validation, and test sets were constructed using a holdout split from the same dataset. Since all subsets originate from the same data source, similar background patterns or scene characteristics may exist across different subsets. In the current study, independent cross-orchard, cross-day, or cross-weather validation was not conducted. This limitation may lead to a slightly optimistic estimation of the model’s performance in unseen orchard environments.
The model may still face challenges under severe occlusion, dense fruit overlap, extreme illumination, or when fruit color is highly similar to the background. These factors may cause missed detections or inaccurate localization. Therefore, future work will focus on introducing multi-source datasets collected from different orchards, dates, weather conditions, and imaging devices. More rigorous cross-scene and cross-day validation will also be conducted to further evaluate the generalization ability of the proposed model in truly complex agricultural scenarios. In addition, more advanced data augmentation strategies and refined feature extraction methods will be explored to improve detection stability under extreme field conditions.
5. Conclusions
This study proposed WSS-YOLO, a lightweight fruit maturity detection model based on YOLOv11n for quince detection in complex orchard environments. The model integrates WaveletPool, a GSConv-based Slim-neck, and SimAM to improve feature preservation, reduce computational cost, and enhance discriminative responses to fruit regions. These components allow the model to better handle texture loss, background interference, and partial occlusion while maintaining a compact network structure.
Systematic experiments based on a multi-scenario quince maturity dataset showed that WSS-YOLO achieved 86.4% precision, 87.5% recall, and 93.4% mAP@0.5, exceeding the baseline YOLOv11n by 2.3%, 1.7%, and 2.5%, respectively. Heatmap visualization analysis further indicates that the proposed model improves localization accuracy of fruit targets and reduces background interference in natural environments.
Moreover, while achieving performance improvements, the model reduces computational costs. WSS-YOLO has only 2.23 M parameters, with floating-point operations (FLOPs) reduced to 4.1 G and a weight file size of only 4.7 MB, showing better overall performance compared with mainstream lightweight networks such as YOLOv8 and YOLOv5. This lightweight design enables the model to meet the requirements of real-time performance and high precision for agricultural harvesting robots and handheld mobile terminals, providing a feasible approach for non-destructive fruit detection in smart agriculture scenarios. Future work will focus on deploying this algorithm on embedded hardware platforms and further evaluating its long-term stability and generalization performance in actual harvesting operations.
Overall, the proposed WSS-YOLO demonstrates promising performance for lightweight fruit maturity detection in orchard environments. Future work will explore its application in more diverse orchard scenarios and broader agricultural environments to further evaluate its generalizability and practical potential.