1. Introduction
Tomato is one of the major vegetable crops worldwide, ranking among the leading vegetable crops in both production and consumption. Available statistics show that global tomato production output was approximately 192 million tons in 2023 [
1]. Within the tomato industry chain, maturity is a key factor determining fruit quality, nutritional composition, and market value. Particularly in large-scale agricultural production, accurate assessment of tomato maturity is of great significance for optimizing harvesting time, increasing yield, and reducing postharvest losses. However, traditional maturity detection methods rely mainly on manual observation or sensory evaluation, which suffer from strong subjectivity and poor consistency, making them difficult to apply in large-scale automated production [
2,
3]. As a result, accurate and automated maturity detection has become an important topic in this field.
Considerable progress has been achieved in the maturity detection of tomatoes and other agricultural products. Early research mainly used traditional image processing methods, in which maturity was determined by extracting features such as color, shape, and texture. Traditional machine vision algorithms, such as denoising, color-space transformation, and morphological filtering, are used to preprocess the images, and manually designed visual features are then combined to distinguish different maturity stages. For instance, Arefi et al. [
4] demonstrated that an electronic nose could effectively characterize changes in volatile compounds during tomato ripening, thereby providing a new sensing approach for maturity assessment. Mendoza et al. [
5] developed a computer vision-based method for tomato maturity detection, in which color features were extracted and combined with a classification model to identify different ripening stages. Building on traditional image processing, subsequent studies further evolved toward approaches integrating classical machine learning methods, which showed stronger discriminative capability than simple threshold-based methods. Unlike simple threshold-based methods, which mainly classify maturity according to fixed feature boundaries, classical machine learning methods usually construct feature vectors from manually extracted image descriptors and then use classifiers to learn the mapping relationship between image features and maturity categories. For instance, Liu [
6] proposed a mature tomato detection method based on Histogram of Oriented Gradients (HOG) and Support Vector Machine (SVM), which achieved favorable detection performance. Nevertheless, these methods still rely strongly on manually designed features and remain vulnerable to missed detections and false alarms in complex scenes with cluttered backgrounds, illumination changes, and ambiguous class boundaries, resulting in limited robustness and poor cross-scene generalization.
With the rapid development of deep learning, agricultural vision tasks have increasingly adopted neural-network-based approaches [
7,
8,
9,
10,
11]. In particular, maturity detection methods based on convolutional neural networks have become a dominant approach in this field. Compared with traditional methods, deep learning models can automatically learn multi-scale representations and are more tolerant of occlusion, object-scale variation, and background complexity. The research focus has also moved gradually from two-stage detectors to one-stage detectors. Two-stage detectors refer to detection frameworks that first generate candidate object regions and then classify and refine these regions in a second stage. Representative methods include Faster Region-based Convolutional Neural Network (Faster R-CNN) and Mask Region-based Convolutional Neural Network (Mask R-CNN). This type of detector usually has strong localization and recognition capability because the region proposal and object classification processes are performed separately. Although two-stage methods usually provide high accuracy, their heavy computation and longer inference time restrict their use in real-time industrial scenarios. For example, the improved Mask R-CNN proposed by Zu et al. [
12] achieved certain results in tomato maturity detection under greenhouse or occluded conditions, but its generalization ability in dense fruit clusters and complex natural environments remained insufficient. In contrast, one-stage detectors directly predict object categories and bounding-box locations from feature maps without a separate region proposal stage. Typical examples include the YOLO series and SSD. Because detection is completed in a unified forward propagation process, one-stage detectors generally have faster inference speed and are therefore more suitable for real-time maturity detection and automated agricultural production scenarios.
In recent years, one-stage detectors, especially the YOLO (You Only Look Once) family, have gradually become a research focus in tomato maturity detection because they offer a practical compromise between accuracy and real-time speed. Existing studies have improved accuracy and efficiency of tomato maturity detection by incorporating attention mechanisms [
13], enhanced convolution modules [
14,
15], redesigned network structures [
16,
17], and optimized loss functions [
13]. However, challenges such as confusion between adjacent maturity categories, interference from complex backgrounds, and insufficient cross-scene generalization have not yet been fundamentally resolved. Existing studies mainly focus on attention enhancement, multi-task learning, lightweight architecture design, small-object detection, and multi-scale feature fusion. Li et al. [
18] proposed MHSA-YOLOv8, which enhances global feature modeling through a multi-head self-attention mechanism, thereby improving tomato maturity detection and counting performance; however, detection results still fluctuate in scenarios with severe occlusion, complex backgrounds, and significant illumination interference. An improved method based on YOLOv9 [
19] further enhanced detection accuracy and inference speed, although its parameter scale and computational burden remained relatively high, and its deployment adaptability and robustness in complex environments still require improvement. AITP-YOLO [
20] improved the detection of small, blurred, and occluded targets through multiple strategies and multi-scale feature fusion, showing that feature enhancement is useful for maturity recognition under complex scenes; however, its modeling of fine-grained differences between adjacent maturity stages remained insufficient. YOLO-PGC [
21] improved detection accuracy by enhancing YOLO11 and demonstrated good robustness under different maturity stages, illumination conditions, and occlusion scenarios, but its optimization still focused mainly on overall performance improvement and lacked dedicated design for ambiguous class boundaries and cross-scene generalization. Chen et al. [
22] developed MTD-YOLOv7 for joint maturity detection and fruit-cluster recognition. Although this multi-task framework achieved good accuracy and real-time performance, it was built on a limited dataset of 390 images collected from a single greenhouse scenario, and the multi-task design significantly increased model complexity, which to some extent restricted its generalization ability under complex natural environments, cultivar differences, and dense occlusion conditions.
In summary, existing approaches have improved detection accuracy, speed, and adaptability to complex scenes, they still face the following challenges in complex agricultural scenarios: confusion among adjacent maturity categories; interference from complex scenes involving leaf occlusion, fruit overlap, scale variation, and natural illumination fluctuations; and the need to effectively balance detection accuracy, model complexity, and cross-scene robustness. Similar issues have also been reported in maturity detection studies of other fruits such as strawberry [
14], grape [
23], and citrus [
24]. To overcome these limitations, this study takes YOLOv7 as the baseline model, introduces a DCNConv module with adaptive magnitude constraints to enhance local geometric perception, adopts a stability-enhanced ECANet attention mechanism to strengthen channel discriminability, and applies WIoU v3 to stabilize bounding-box regression. Overall, this study improves tomato maturity detection from three complementary aspects: channel feature discrimination, local geometric perception, and bounding-box regression stability, thereby enhancing the recognition of adjacent maturity stages in complex greenhouse scenarios. These modifications form a tomato maturity detection model for complex agricultural scenes, improving adaptability and detection accuracy while providing technical support for automated maturity detection in large-scale agricultural production.
The main contributions of this study are summarized as follows:
A stability-enhanced ECANet module is introduced into the feature fusion path to strengthen channel-wise discriminative responses and suppress background interference, thereby improving the recognition of adjacent maturity categories.
A DCNv2-based DCNConv module with adaptive offset magnitude constraints is designed to enhance local geometric modeling while mitigating training instability caused by unconstrained deformable sampling.
WIoU v3 is introduced to stabilize bounding-box regression, and the proposed design is evaluated through ablation studies, model comparisons, visualization analysis, and cross-dataset experiments to verify its effectiveness and robustness.
3. Experimental Results and Analysis
3.1. Experimental Environment
All experiments in this study were conducted on a 64-bit Windows 11 operating system. The hardware configuration included an Intel Core i5-14400F @ 2.50 GHz CPU and an NVIDIA GeForce RTX 5060 Ti 16GB GPU. The software environment consisted of Python 3.9, CUDA 12.8, and PyTorch 2.8. The training settings were as follows: the input image resolution was set to 640 × 640, the batch size was 12, the number of workers was 6, the number of training epochs was 300, and the learning rate was 0.01.
All experiments were independently trained under the same experimental settings. The reported results were obtained by performing inference on the test set using the best weights selected according to validation-set performance during training. The training, validation, and test sets were strictly separated, ensuring that different augmented versions of the same image do not appear in multiple subsets. The best model weights were automatically selected by the YOLO framework based on built-in evaluation metrics, which mainly include Precision, Recall, and mAP on the validation set, while the test set was used only for final performance evaluation.
3.2. Evaluation Metrics
To provide a comprehensive assessment of model performance in tomato maturity detection, this study evaluated the model from two aspects: detection performance and model complexity. For detection performance, Precision(%), Recall(%), and mAP@0.5(%) were adopted for evaluation. The definitions of Precision and Recall are given in Equation (8), where TP denotes true positives, FP denotes false positives, and FN denotes false negatives. Precision refers to the proportion of true positive samples among all samples predicted as positive, and is used to measure the reliability of positive predictions. Recall measures the model’s capability in detecting real targets. The mean Average Precision at IoU = 0.5 (mAP@0.5) is calculated as the arithmetic mean of the average precision (AP) values over all categories when the IoU threshold is set to 0.5.
For model complexity, Parameters (M) and GFLOPs (Giga Floating-point Operations) were adopted for evaluation. Parameters (M) denote the number of learnable model parameters in millions, reflecting the storage cost and capacity scale of the model. GFLOPs refer to the number of billion floating-point operations required for one forward pass, which is used to assess computational complexity.
3.3. Comparison of Attention Mechanisms
To examine the influence of the ECANet attention mechanism on detection performance, comparative experiments were conducted on the basis of the YOLOv7 baseline by incorporating five attention modules, namely CBAM [
35], FcaNet [
36], SimAM [
37], CA [
38], and ECANet. The results are presented in
Table 1.
Table 1 shows that different attention mechanisms have noticeably different effects on the detection performance of YOLOv7. Compared with the baseline model, introducing ECANet increases Precision and mAP@0.5 to 81.5% and 89.3%, corresponding to improvements of 1.1 and 0.9 percentage points, respectively. Among all compared methods, ECANet achieves the best mAP@0.5, while maintaining a Recall of 81.3%, which is comparable to that of the baseline. This indicates that ECANet improves detection accuracy while preserving a stable recall level.
Other attention mechanisms show different performance tendencies. CBAM achieves the best Precision of 82.2% (+1.8 percentage points), indicating its effectiveness in feature enhancement. FcaNet steadily improves all evaluation metrics, although its overall gain remains slightly lower than that of ECANet. SimAM increases Recall to 82.7% (+1.4 percentage points), suggesting an advantage in reducing missed detections. In contrast, CA improves Precision but decreases Recall to 80.6% (−0.7 percentage points), indicating a less balanced performance. In addition, after introducing these modules, the Parameters and GFLOPs of the model remain almost unchanged, with only CA showing a very slight increase in GFLOPs. This suggests that the performance improvements mainly come from optimized feature representation rather than a substantial increase in computational cost.
Overall, ECANet achieves the best mAP@0.5 among all compared attention mechanisms. It improves detection accuracy while maintaining stable recall, and delivers superior overall detection performance without introducing additional model complexity.
To further interpret the above quantitative differences from the perspective of feature representation, Grad-CAM [
39] was used to visualize YOLOv7 and its variants with different attention mechanisms, as shown in
Figure 11. All heatmaps were generated based on the same input image, the same predicted target, and the same feature layer (the output of the small-scale detection branch).
As shown in
Figure 11a, the high-response regions of YOLOv7 are mainly concentrated along the tomato edges, while the activation inside the tomato regions is relatively weak. This suggests that the baseline model relies more on boundary features and provides insufficient representation of the target body. After introducing CBAM (
Figure 11b), the activated regions become larger and cover the two tomato bodies more completely. However, part of the background is also activated, indicating limited background suppression. In CA (
Figure 11e), the high-response regions cover both tomato bodies, and a low-response interval appears between the two targets, suggesting relatively clear instance separation in multi-object scenes.
In contrast, FcaNet (
Figure 11c) and SimAM (
Figure 11d) show more scattered response distributions. The activated regions of FcaNet appear fragmented and lack a continuous dominant hotspot, indicating limited focus on key discriminative regions. SimAM provides wider target-region coverage than the baseline and FcaNet, but its high-response regions remain relatively dispersed, and the inter-target distinction is not sufficiently clear. In ECANet (
Figure 11f), the activated regions are mainly concentrated on the tomato bodies, while background activation is relatively weak. This indicates that ECANet provides more concentrated target responses and stronger background suppression, although its target coverage is less complete than that of CBAM and CA.
In summary, the Grad-CAM visualizations show that different attention mechanisms differ in target coverage, background suppression, and multi-object discrimination. CBAM and CA provide more complete target coverage, whereas ECANet produces more concentrated responses in discriminative target regions with relatively weak background activation. By contrast, FcaNet and SimAM show more dispersed activation patterns. Combined with the quantitative results above, the higher mAP@0.5 obtained by ECANet is consistent with its stronger target-focusing characteristic. However, there remains room for further improvement in response completeness under densely packed multi-object scenes.
3.4. Comparison of Loss Functions
To further improve the stability of bounding box regression in tomato maturity detection, this study conducted comparative experiments on the basis of the YOLOv7 baseline by introducing four loss functions, namely GIoU [
40], Focal-EIoU [
41], SIoU [
42], and WIoU v3. The results are presented in
Table 2.
As shown in
Table 2, different loss functions have varying effects on YOLOv7. WIoU v3 achieves the highest Recall of 81.9% (+0.6 percentage points), while Precision and mAP@0.5 reach 81.1% and 88.5% (+0.7 and +0.1 points). Its improved Recall is due to dynamic gradient weighting, emphasizing difficult samples such as occluded or low-light tomatoes. The limited Precision gain suggests that false positives from color similarity across adjacent maturity categories rely more on attention mechanisms.
GIoU achieves the highest Precision (82.9%, +2.5 points) but with lower Recall (80.7%, −0.6 points). Focal-EIoU provides the highest mAP@0.5 (88.9%) with balanced performance, while SIoU increases Precision to 81.9% but reduces Recall to 79.4% (−1.9 points). From the perspective of model complexity, the number of parameters and GFLOPs remain unchanged across all loss functions, suggesting that the performance differences mainly arise from the different optimization strategies for bounding box regression rather than changes in model complexity.
Overall, WIoU v3 offers the best Recall while maintaining unchanged model complexity, making it the preferred loss for bounding-box regression in the improved YOLOv7 model.
To further verify that WIoU v3 can improve the stability of bounding box regression and enhance detection robustness in complex scenes, this study selected representative samples from greenhouse environments, including single fruit, dense multi-fruit clusters, low-light conditions, complex backgrounds, similar maturity levels, and occlusion, and visually compared the detection results of CIoU and WIoU v3. The comparison results are shown in
Table 3.
As shown in
Table 3, WIoU v3 demonstrates better detection performance in most complex scenarios. In the single-object scene, the detection confidence increases from 0.90 to 0.94, indicating that WIoU v3 provides better regression optimization and helps improve prediction stability. However, in the dense multi-fruit scene, the improvement brought by WIoU v3 is relatively limited, and the number of detected bounding boxes is reduced by three compared with the result of CIoU, suggesting that densely occluded scenes remain challenging for the current model. Under low-light conditions, the detection confidences of three tomato targets increase from 0.82, 0.42, and 0.91 to 0.89, 0.66, and 0.95, respectively, indicating that WIoU v3 provides better robustness for bounding box regression under poor illumination. In the complex-background scene, the confidence of the false detection on the red metal frame at the lower part of the image decreases from 0.75 to 0.37, showing that WIoU v3 can suppress high-confidence false detections caused by background interference to some extent. In the similar-maturity scene, the detection confidence of each target increases slightly, for example from 0.91 to 0.93, suggesting that more accurate bounding box regression helps improve discriminative stability between adjacent maturity categories. In the occlusion scene, CIoU fails to detect two occluded fruits, whereas WIoU v3 successfully detects one of them, further demonstrating its better adaptability to difficult samples.
In summary, the advantages of WIoU v3 are mainly reflected in improving bounding box regression stability, enhancing robustness in complex scenes, and suppressing some background-induced false detections, rather than significantly increasing the number of detections in every scenario.
3.5. Ablation Experiment
This study conducted ablation experiments to quantify the contribution of each proposed module to the performance improvement of YOLOv7. The results are presented in
Table 4.
As shown in
Table 4, the effects of different improvement modules on YOLOv7 are clearly distinct. When ECANet is introduced alone, Precision and mAP@0.5 increase to 81.5% and 89.3% (+1.1 and +0.9 points), while Recall, Parameters, and GFLOPs remain unchanged relative to the baseline. This indicates that ECANet enhances channel-wise discriminative feature representation and improves detection accuracy without increasing model complexity. When DCNConv is introduced alone, Precision, Recall, and mAP@0.5 reach 79.3%, 81.4%, and 87.2% (−1.1, +0.1, and −1.2 points). Meanwhile, the number of parameters increases slightly from 36.5 M to 36.7 M, and GFLOPs decrease from 103.2 to 96.9. This suggests that although DCNConv reduces computational complexity and improves local geometric modeling, its standalone use may introduce feature misalignment or optimization instability, so its geometric modeling advantage does not directly translate into overall performance gains. WIoU v3 alone provides a modest improvement, increasing Recall to 81.9% without changing Parameters or GFLOPs, reflecting its ability to reduce missed detections.
For module combinations, ECANet + DCNConv improves Recall and mAP@0.5 to 82.9% and 89.2% (+1.6 and +0.8 points), whereas Precision drops to 79.7% (−0.7 points), indicating a trade-off between target recall and false detections due to the interaction of channel attention and deformable convolution. ECANet + WIoU v3 also improves Recall to 82.9% (+1.6 points), while Precision and mAP@0.5 remain close to baseline, showing that channel feature enhancement and regression optimization can complement each other. The combination of DCNConv and WIoU v3 achieves 82.7% Precision and 89.0% mAP@0.5, with GFLOPs reduced to 96.9, demonstrating that regression optimization can partly compensate for the instability of DCNConv when used alone.
When all three modules are combined, the model achieves the highest Precision and mAP@0.5, reaching 83.7% and 89.6% (+3.3 and +1.2 points), indicating that channel feature enhancement, spatial geometric modeling, and regression optimization exhibit stronger complementarity when used together. Meanwhile, GFLOPs decrease by 6.3, and the number of parameters increases only slightly by 0.2 M, showing that the final model improves accuracy without substantially increasing model size. However, the Recall of the three-module combination decreases to 80.5% (−0.8 points), the lowest among all ablation settings. Combined with
Figure 12, this decrease is mainly associated with an increase in background false negatives for the fully_ripened category, suggesting that stricter prediction boundaries for some fully ripened tomatoes in complex backgrounds may lead to slightly more missed detections. Overall, the final model achieves a better balance between detection accuracy and computational complexity, though the small reduction in Recall remains a limitation.
To further analyze the category discrimination capability and the precision-recall trade-off of the improved model, this study quantitatively compares YOLOv7 and YOLO-RCM using confusion matrices and PR curves, as shown in
Figure 12.
As shown by the confusion matrices in
Figure 12, YOLO-RCM alleviates the confusion between adjacent maturity categories to some extent. Compared with the baseline YOLOv7, the proportion of fully_ripened samples misclassified as half_ripened decreases from 0.11 to 0.08, and the proportion of half_ripened samples misclassified as green decreases from 0.04 to 0.03. Meanwhile, the true positive rate of the half_ripened category increases from 0.73 to 0.75, whereas that of the fully_ripened category slightly decreases from 0.81 to 0.80. These results suggest that YOLO-RCM improves fine-grained discrimination among similar maturity categories, although the improvement is not uniform across all categories.
For background false detections and misses, YOLO-RCM shows category-specific changes. The proportions of background samples misclassified as fully_ripened and half_ripened decrease from 0.20 and 0.28 to 0.19 and 0.27, respectively, indicating a certain reduction in background false detections for these two categories. However, the proportion of background samples misclassified as green increases from 0.52 to 0.54, suggesting that background suppression for the green category is still limited. In addition, the background FN of the fully_ripened category increases from 0.08 to 0.13, while the corresponding values for half_ripened and green remain unchanged at 0.07. This indicates that the decrease in Recall is mainly associated with increased missed detections in the fully_ripened category.
The PR curves further show the improvement of YOLO-RCM in overall detection accuracy. Compared with YOLOv7, the AP values of YOLO-RCM for fully_ripened, half_ripened, and green increase from 0.869, 0.840, and 0.942 to 0.895, 0.849, and 0.945, corresponding to gains of 2.6, 0.9, and 0.3 percentage points, respectively. The overall mAP@0.5 increases from 0.884 to 0.896 (+1.2 percentage points). In the medium-to-high recall range, the PR curves of YOLO-RCM are generally above those of YOLOv7, especially for the fully_ripened and half_ripened categories, indicating improved precision under comparable recall levels.
Combined with the ablation results in
Table 4, the above improvements may mainly come from the complementary effects of different modules. ECANet contributes clear gains in Precision and mAP@0.5, suggesting that the reduced confusion between adjacent maturity categories is related to enhanced channel-wise discriminative features. DCNConv mainly shows its effect when combined with WIoU v3, where local geometric modeling and regression optimization jointly improve detection performance. However, the increase in misses for the fully_ripened category (from 0.08 to 0.13) remains a limitation of the improved model and partly explains the decrease in Recall.
3.6. Model Comparison Experiment
To verify the effectiveness of the improved model, this study selected representative object detection models with parameter sizes and computational complexity in a comparable range for comparison. Considering that both model parameter scale and computational complexity affect detection performance [
43,
44], this study comprehensively considered Parameters and GFLOPs, and selected representative versions from mainstream object detection models that are broadly comparable to YOLOv7 in terms of parameter count and computational cost. The final comparison included RT-DETR L [
45], YOLOv5 L [
46], YOLOv8 M [
47], YOLO11 L [
48], and YOLO26 L [
49]. All models were evaluated under the same experimental settings in terms of detection accuracy and efficiency. The comparison results are shown in
Table 5 and
Figure 13.
As shown in
Table 5, YOLO-RCM achieves the best Precision (83.7%) and mAP@0.5 (89.6%) among all compared models on the current dataset. Compared with the baseline YOLOv7, these two metrics increase by 3.3 and 1.2 percentage points, respectively, indicating that the proposed improvement strategy effectively enhances target discriminability and overall detection accuracy. Although the Recall of YOLO-RCM decreases slightly from 81.3% to 80.5%, it remains higher than that of all other compared models except the baseline, suggesting that the model largely preserves its target detection capability while improving accuracy.
In terms of mAP@0.5, YOLO-RCM outperforms all compared models. Specifically, its mAP@0.5 is higher than that of YOLOv7, YOLOv5 L, and YOLO26 L by 1.2, 4.6, and 4.7 percentage points, respectively, and exceeds YOLO11 L and YOLOv8 M by 4.9 and 12.4 percentage points. Regarding Precision, YOLO-RCM also ranks first, surpassing YOLO26 L by 2.1 percentage points. This indicates that the proposed method has an advantage in reducing false detections and improving the reliability of detection results.
From the perspective of model complexity, YOLO-RCM has 36.7 M parameters, which is only 0.2 M higher than the baseline YOLOv7, while its GFLOPs are reduced from 103.2 to 96.9. This indicates that the proposed method improves accuracy without substantially increasing model size; instead, it reduces computational complexity while keeping the parameter scale nearly unchanged. Compared with YOLOv5 L, which has a higher parameter count and GFLOPs but lower Precision and mAP@0.5, YOLO-RCM shows better efficiency. Furthermore, although YOLOv8 M, YOLO11 L, and YOLO26 L are lighter in terms of Parameters and GFLOPs, their detection accuracy remains lower than that of YOLO-RCM. Overall, YOLO-RCM achieves a favorable balance between detection accuracy and computational cost.
In summary, YOLO-RCM demonstrates the best overall performance on the current tomato dataset. The results confirm that the joint improvement strategy effectively enhances Precision and mAP@0.5 while reducing computational complexity and maintaining a nearly unchanged parameter scale. However, the slight decrease in Recall remains a limitation and should be further investigated in future work.
3.7. Cross-Dataset Robustness Experiment
To further evaluate the generalization ability and robustness of YOLO-RCM, this study selected the publicly available tomato maturity detection dataset TomatOD [
50] as an independent external test set (denoted as Dataset B) and conducted cross-dataset robustness experiments on YOLOv7 and YOLO-RCM.
The TomatOD dataset was collected in a greenhouse environment and includes challenging conditions such as overexposure, low illumination, uneven lighting, and complex backgrounds, as shown in
Figure 14. The dataset contains 277 tomato images acquired in greenhouse conditions and 2418 tomato annotations, including 431 fully_ripened, 395 half_ripened, and 1892 green instances. In terms of annotation distribution, the TomatOD dataset is class-imbalanced; however, the relative proportions of different categories are consistent with their actual occurrence frequencies in real-world scenes [
50].
As shown in
Table 6, when evaluated on Dataset B as an external test set, YOLO-RCM exhibits performance gains that are consistent in direction with those observed on Dataset A. With a 1.2 percentage point decrease in Recall and only a 0.2 M increase in parameter count, Precision and mAP@0.5 improve by 5.8 and 4.0 percentage points, respectively, compared with the baseline model, while GFLOPs are reduced by 6.3. These results indicate that the proposed method does not merely fit the characteristics of a specific dataset. Instead, the model is able to capture generalizable features of tomato fruits, thereby demonstrating the robustness and generalization capability of YOLO-RCM across different datasets.
4. Discussion
The experimental results demonstrate that YOLO-RCM can improve the accuracy of tomato maturity detection in complex agricultural scenes. Compared with the baseline YOLOv7, the proposed method shows better overall performance in terms of detection accuracy and robustness. This suggests that the combination of local geometric modeling, channel-wise feature enhancement, and bounding box regression optimization contributes to improved maturity detection under cluttered conditions.
The performance improvement is consistent with the combined effects of the three modules, including local geometric modeling, channel-wise feature enhancement, and bounding box regression optimization. The DCNConv module enhances the model’s ability to perceive local geometric variations, making it better suited for detecting tomato targets under challenging conditions such as partial occlusion and overlap. By imposing magnitude constraints on the offsets, the model preserves its deformation modeling capability while preventing sampling points from drifting excessively, thereby ensuring stable representation of local contours. The stability-enhanced ECANet further strengthens discriminative channel responses, which is beneficial for distinguishing adjacent maturity categories with subtle differences. Finally, WIoU v3 improves the stability of bounding box regression, enabling more reliable localization in densely distributed multi-object regions and under complex illumination conditions.
Compared with existing YOLO-based maturity detection methods, the proposed method focuses more on complex backgrounds and confusion between adjacent maturity categories. Previous studies have improved performance through attention mechanisms, multi-scale fusion, or multi-task learning, but their effectiveness is still constrained by scene complexity, data diversity, or model complexity. The current results indicate that enhancing local feature adaptability and channel discriminability can be beneficial for fine-grained tomato maturity detection under the tested greenhouse conditions.
Despite these improvements, YOLO-RCM still has certain limitations. The model may still misclassify adjacent maturity categories, especially when the color transition is gradual or when fruits are heavily occluded by stems and leaves. In addition, the Recall of YOLO-RCM is lower than that of YOLOv7, indicating that some targets are still missed while accuracy is improved. According to the experimental analysis, the decline in Recall is mainly associated with an increase in missed detections in the fully_ripened category. Moreover, although the proposed method improves detection precision and reduces computational cost, the number of parameters still increases slightly. Therefore, further optimization is still needed for lightweight deployment in real-time agricultural systems.
Additionally, this study still has several shortcomings. First, the maturity stages of different tomato varieties are not always consistent. For example, some green-ripe tomato varieties remain green even when fully mature, making them highly similar in color to the immature stage of ordinary red-ripe tomatoes. This may lead to cross-variety misclassification, which is a primary bottleneck of current maturity detection methods based primarily on color features. Furthermore, the validation of the dynamic weighting mechanism in WIoU v3 remains mainly qualitative, and lacks quantitative evidence regarding its regulation of dynamic weights. Future work should therefore quantitatively verify this mechanism through gradient distribution analysis or weight statistics. In addition, future research will focus on the color confusion problem across multiple tomato varieties and will explore multimodal fusion schemes incorporating spectral or depth information, so as to improve the model’s recognition ability, generalization, and robustness across different tomato varieties. Moreover, explainable AI methods will be incorporated to provide interpretable evidence for model predictions, thereby helping growers and decision makers better understand the basis of maturity detection results.
5. Conclusions
To address the problems of false detections and insufficient adaptability to complex scenes in tomato maturity detection with YOLOv7, this study proposes an improved model, YOLO-RCM. The model enhances key channel feature representation by introducing ECANet into the FPN, improves spatial modeling of complex targets by replacing standard convolutions with DCNConv in the Backbone, and adopts WIoU v3 to optimize the bounding box regression process, providing complementary improvements in channel feature discrimination, local geometric perception, and bounding-box regression stability.
The experimental results indicate that ECANet achieves the best mAP@0.5 among attention mechanism while maintaining Recall comparable to the baseline; DCNConv reduces computational complexity and shows a clear synergistic effect when combined with WIoU v3; and WIoU v3 achieves the highest Recall in the loss function comparison. In the ablation study, introducing all three modules together results in Precision of 83.7% and mAP@0.5 of 89.6%, with GFLOPs reduced to 96.9 and Parameters increasing slightly to 36.7 M (+0.2 M). These results indicate that YOLO-RCM achieves better overall detection performance by making a reasonable trade-off between precision and recall.
The confusion matrices and PR curves further show that YOLO-RCM reduces misclassification between adjacent maturity categories and lowers background false detections for the fully_ripened and half_ripened categories, although background false detections for the green category increase slightly. As a result, overall detection accuracy is improved. However, missed detections in the fully_ripened category also increase, indicating that while the model enhances discriminative capability, its ability to detect certain targets is reduced to some extent.
Compared with other mainstream object detection models, YOLO-RCM ranks highest in both Precision (83.7%) and mAP@0.5 (89.6%). In cross-dataset robustness experiment, YOLO-RCM shows improvement trends consistent with those observed in the main experiments: Precision and mAP@0.5 increase by 5.8 and 4.0 percentage points, respectively, while Recall decreases by 1.2 percentage points and the Parameters increase by 0.2 M, and GFLOPs are reduced to 96.9 (−6.3). These results demonstrate that YOLO-RCM has good robustness and generalization capability across different datasets.