YOLO-RCM: An Improved Tomato Maturity Detection Model for Complex Greenhouse Environments

Chen, Dehua; Teng, Hao; Lu, Yuchen; Zhang, Yuxuan; Wu, Haorong

doi:10.3390/agronomy16121146

Open AccessArticle

YOLO-RCM: An Improved Tomato Maturity Detection Model for Complex Greenhouse Environments

by

Dehua Chen

¹,

Hao Teng

¹,

Yuchen Lu

²

,

Yuxuan Zhang

^3,4,*

and

Haorong Wu

¹

School of Electronic Information and Electrical Engineering, Chengdu University, Chengdu 610106, China

²

Yantai Research Institute, Harbing Engineering University, Yantai 264006, China

³

College of Intelligent Science and Engineering, Beijing University of Agriculture, Beijing 102206, China

⁴

Department of Computer and Electrical Engineering, Mid Sweden University, 85170 Sundsvall, Sweden

^*

Author to whom correspondence should be addressed.

Agronomy 2026, 16(12), 1146; https://doi.org/10.3390/agronomy16121146

Submission received: 29 April 2026 / Revised: 3 June 2026 / Accepted: 5 June 2026 / Published: 11 June 2026

(This article belongs to the Special Issue Digital Twins in Precision Agriculture)

Download

Browse Figures

Versions Notes

Abstract

To reduce confusion between adjacent maturity categories, as well as false detections and low detection accuracy caused by complex backgrounds in tomato object detection, this study develops an improved YOLOv7-based model, named YOLO-RCM (Reduce classes misjudgment). First, a stability-enhanced ECANet channel attention module is embedded into the feature pyramid network (FPN) to strengthen discriminative channel responses. Second, a DCNv2-based deformable convolution enhancement module, namely DCNConv with adaptive magnitude constraints, is incorporated into the backbone network to alleviate feature misalignment caused by shape variation, partial occlusion, and fine-grained appearance differences in tomato maturity detection. Third, the WIoU v3 loss function is adopted to refine bounding box regression stability. The model was evaluated on the public Laboro Tomato dataset and TomatOD dataset. Experimental results indicate that YOLO-RCM obtains 83.7% Precision and 89.6% mAP@0.5, exceeding the baseline by 3.3 and 1.2 percentage points, respectively. Its Recall is 80.5%, with a decrease of 0.8 percentage points, whereas GFLOPs are reduced to 96.9, 6.3 lower than the baseline. These results indicate that the proposed method improves detection accuracy and computational efficiency while maintaining an almost unchanged model scale. The confusion matrix and PR curves further show that YOLO-RCM can effectively mitigate misdetections associated with adjacent maturity stages and complex scenes. In the external-dataset robustness test, Precision and mAP@0.5 are improved by 5.8 and 4.0 percentage points over the baseline, respectively, confirming the generalization ability of the proposed model. The main contribution of this study lies in improving tomato maturity detection from three complementary aspects: channel feature discrimination, local geometric perception, and bounding box regression stability. The study offers a practical technical reference for intelligent tomato harvesting systems in complex agricultural environments.

Keywords:

deep learning; tomato; maturity detection; YOLOv7; modulated deformable convolution

1. Introduction

Tomato is one of the major vegetable crops worldwide, ranking among the leading vegetable crops in both production and consumption. Available statistics show that global tomato production output was approximately 192 million tons in 2023 [1]. Within the tomato industry chain, maturity is a key factor determining fruit quality, nutritional composition, and market value. Particularly in large-scale agricultural production, accurate assessment of tomato maturity is of great significance for optimizing harvesting time, increasing yield, and reducing postharvest losses. However, traditional maturity detection methods rely mainly on manual observation or sensory evaluation, which suffer from strong subjectivity and poor consistency, making them difficult to apply in large-scale automated production [2,3]. As a result, accurate and automated maturity detection has become an important topic in this field.

Considerable progress has been achieved in the maturity detection of tomatoes and other agricultural products. Early research mainly used traditional image processing methods, in which maturity was determined by extracting features such as color, shape, and texture. Traditional machine vision algorithms, such as denoising, color-space transformation, and morphological filtering, are used to preprocess the images, and manually designed visual features are then combined to distinguish different maturity stages. For instance, Arefi et al. [4] demonstrated that an electronic nose could effectively characterize changes in volatile compounds during tomato ripening, thereby providing a new sensing approach for maturity assessment. Mendoza et al. [5] developed a computer vision-based method for tomato maturity detection, in which color features were extracted and combined with a classification model to identify different ripening stages. Building on traditional image processing, subsequent studies further evolved toward approaches integrating classical machine learning methods, which showed stronger discriminative capability than simple threshold-based methods. Unlike simple threshold-based methods, which mainly classify maturity according to fixed feature boundaries, classical machine learning methods usually construct feature vectors from manually extracted image descriptors and then use classifiers to learn the mapping relationship between image features and maturity categories. For instance, Liu [6] proposed a mature tomato detection method based on Histogram of Oriented Gradients (HOG) and Support Vector Machine (SVM), which achieved favorable detection performance. Nevertheless, these methods still rely strongly on manually designed features and remain vulnerable to missed detections and false alarms in complex scenes with cluttered backgrounds, illumination changes, and ambiguous class boundaries, resulting in limited robustness and poor cross-scene generalization.

With the rapid development of deep learning, agricultural vision tasks have increasingly adopted neural-network-based approaches [7,8,9,10,11]. In particular, maturity detection methods based on convolutional neural networks have become a dominant approach in this field. Compared with traditional methods, deep learning models can automatically learn multi-scale representations and are more tolerant of occlusion, object-scale variation, and background complexity. The research focus has also moved gradually from two-stage detectors to one-stage detectors. Two-stage detectors refer to detection frameworks that first generate candidate object regions and then classify and refine these regions in a second stage. Representative methods include Faster Region-based Convolutional Neural Network (Faster R-CNN) and Mask Region-based Convolutional Neural Network (Mask R-CNN). This type of detector usually has strong localization and recognition capability because the region proposal and object classification processes are performed separately. Although two-stage methods usually provide high accuracy, their heavy computation and longer inference time restrict their use in real-time industrial scenarios. For example, the improved Mask R-CNN proposed by Zu et al. [12] achieved certain results in tomato maturity detection under greenhouse or occluded conditions, but its generalization ability in dense fruit clusters and complex natural environments remained insufficient. In contrast, one-stage detectors directly predict object categories and bounding-box locations from feature maps without a separate region proposal stage. Typical examples include the YOLO series and SSD. Because detection is completed in a unified forward propagation process, one-stage detectors generally have faster inference speed and are therefore more suitable for real-time maturity detection and automated agricultural production scenarios.

In recent years, one-stage detectors, especially the YOLO (You Only Look Once) family, have gradually become a research focus in tomato maturity detection because they offer a practical compromise between accuracy and real-time speed. Existing studies have improved accuracy and efficiency of tomato maturity detection by incorporating attention mechanisms [13], enhanced convolution modules [14,15], redesigned network structures [16,17], and optimized loss functions [13]. However, challenges such as confusion between adjacent maturity categories, interference from complex backgrounds, and insufficient cross-scene generalization have not yet been fundamentally resolved. Existing studies mainly focus on attention enhancement, multi-task learning, lightweight architecture design, small-object detection, and multi-scale feature fusion. Li et al. [18] proposed MHSA-YOLOv8, which enhances global feature modeling through a multi-head self-attention mechanism, thereby improving tomato maturity detection and counting performance; however, detection results still fluctuate in scenarios with severe occlusion, complex backgrounds, and significant illumination interference. An improved method based on YOLOv9 [19] further enhanced detection accuracy and inference speed, although its parameter scale and computational burden remained relatively high, and its deployment adaptability and robustness in complex environments still require improvement. AITP-YOLO [20] improved the detection of small, blurred, and occluded targets through multiple strategies and multi-scale feature fusion, showing that feature enhancement is useful for maturity recognition under complex scenes; however, its modeling of fine-grained differences between adjacent maturity stages remained insufficient. YOLO-PGC [21] improved detection accuracy by enhancing YOLO11 and demonstrated good robustness under different maturity stages, illumination conditions, and occlusion scenarios, but its optimization still focused mainly on overall performance improvement and lacked dedicated design for ambiguous class boundaries and cross-scene generalization. Chen et al. [22] developed MTD-YOLOv7 for joint maturity detection and fruit-cluster recognition. Although this multi-task framework achieved good accuracy and real-time performance, it was built on a limited dataset of 390 images collected from a single greenhouse scenario, and the multi-task design significantly increased model complexity, which to some extent restricted its generalization ability under complex natural environments, cultivar differences, and dense occlusion conditions.

In summary, existing approaches have improved detection accuracy, speed, and adaptability to complex scenes, they still face the following challenges in complex agricultural scenarios: confusion among adjacent maturity categories; interference from complex scenes involving leaf occlusion, fruit overlap, scale variation, and natural illumination fluctuations; and the need to effectively balance detection accuracy, model complexity, and cross-scene robustness. Similar issues have also been reported in maturity detection studies of other fruits such as strawberry [14], grape [23], and citrus [24]. To overcome these limitations, this study takes YOLOv7 as the baseline model, introduces a DCNConv module with adaptive magnitude constraints to enhance local geometric perception, adopts a stability-enhanced ECANet attention mechanism to strengthen channel discriminability, and applies WIoU v3 to stabilize bounding-box regression. Overall, this study improves tomato maturity detection from three complementary aspects: channel feature discrimination, local geometric perception, and bounding-box regression stability, thereby enhancing the recognition of adjacent maturity stages in complex greenhouse scenarios. These modifications form a tomato maturity detection model for complex agricultural scenes, improving adaptability and detection accuracy while providing technical support for automated maturity detection in large-scale agricultural production.

The main contributions of this study are summarized as follows:

A stability-enhanced ECANet module is introduced into the feature fusion path to strengthen channel-wise discriminative responses and suppress background interference, thereby improving the recognition of adjacent maturity categories.
A DCNv2-based DCNConv module with adaptive offset magnitude constraints is designed to enhance local geometric modeling while mitigating training instability caused by unconstrained deformable sampling.
WIoU v3 is introduced to stabilize bounding-box regression, and the proposed design is evaluated through ablation studies, model comparisons, visualization analysis, and cross-dataset experiments to verify its effectiveness and robustness.

2. Materials and Methods

2.1. Dataset and Maturity Criteria

The experiments are based on the public Laboro Tomato dataset [25]. The dataset contains 804 images captured using two different camera devices, with resolutions of 3024 × 4032 pixels or 3120 × 4160 pixels. It includes tomatoes of two size types, namely regular-sized tomatoes and cherry tomatoes, at different maturity stages. The images cover greenhouse scenes under multiple viewing angles and lighting conditions, providing substantial scene diversity and making the dataset suitable for evaluating model generalization in complex greenhouse environments. Since this study focuses on tomato maturity detection, only maturity categories are considered in the Laboro Tomato dataset, while fruit size is not distinguished. Specifically, the annotations for regular-sized tomatoes and cherry tomatoes are merged into the corresponding maturity categories to enhance generalization ability across fruits of different sizes.

With reference to the original annotation scheme of the Laboro Tomato dataset and the relevant industry standard GH/T 1193-2021 [26], the tomatoes in the dataset are divided into three maturity categories in this study: ripe, semi-ripe, and unripe. A tomato is defined as ripe when it appears overall red, with red color visible on 90% or more of its surface. A tomato is defined as semi-ripe when it has a greenish tone, with red color visible on 30% to 90% of its surface. A tomato is defined as unripe when it appears overall green or whitish, with red color visible on 30% or less of its surface. The dataset images are categorized according to these maturity criteria, as shown in Figure 1. The dataset labels fully_ripened, half_ripened, and green correspond to ripe, semi-ripe, and unripe, respectively.

2.2. Data Augmentation

The original dataset exhibited a pronounced imbalance among maturity classes. To improve the model’s ability to detect minority classes, several augmentation strategies were applied. First, the Laboro Tomato dataset was randomly divided into training, validation, and test sets at a 7:2:1 ratio, while ensuring strict separation among the three subsets to avoid data leakage. After the split, the images in each subset were augmented using ±40% brightness adjustment, horizontal flipping, vertical flipping, Gaussian filtering with a 7 × 7 kernel and Sigma = 5, and random translation with a maximum shift of 10%. An augmented dataset, denoted as Dataset A, was then constructed. These augmentation operations were designed to simulate common sources of interference in greenhouse scenes, including illumination variation (brightness adjustment), image blur (Gaussian filtering), and viewpoint differences (flipping and translation).

The numbers of images in the three subsets of the augmented dataset were 3934 training images, 1127 validation images, and 567 test images, still maintaining the 7:2:1 ratio. The numbers of annotated instances for the three target categories, namely fully_ripened, half_ripened, and green, were 2240, 2261, and 8617, approximately in a 1:1:4 ratio. Examples of the data augmentation results are shown in Figure 2.

Figure 2. Examples of data augmentation: (a) Original image; (b) Brightness adjustment; (c) Gaussian filtering; (d) Random translation; (e) Horizontal flipping; (f) Vertical flipping.

2.3. YOLOv7 Object Detection Model

YOLOv7 is an efficient object detection model proposed by Wang et al. [27] in 2022. As illustrated in Figure 3, its architecture consists of four main parts: Input, Backbone, Neck, and Head. The Input stage performs image preprocessing and uses training-stage augmentation such as Mosaic to improve robustness to small objects. The Backbone constructs the feature extraction path by alternately stacking CBS, ELAN, and MP modules to extract features for objects at different scales. The Neck adopts the SPPCSPC module expands the receptive field and combines the PANet structure with the ELAN-W module to fuse deep semantic information with shallow positional information. The Head introduces the RepConv structure, which improves feature representation through multi-branch representations during training and structural re-parameterization during inference, without increasing inference-time computational cost. Prediction feature maps are finally generated at three scales to support the detection of targets with different sizes.

On the MS COCO, YOLOv7 attains 51.4% mAP@0.5:0.95 with an inference speed of 161 FPS, indicating a strong balance between detection accuracy and inference speed. Because of its real-time inference ability and effective multi-scale detection performance, YOLOv7 serves as a suitable baseline for the subsequent improvement of tomato maturity detection.

2.4. YOLO-RCM

Although YOLOv7 performs effectively in general object detection, it still faces the following challenges in tomato maturity detection, which involves fine-grained recognition: (1) neighboring maturity stages, such as fully_ripened and half_ripened, exhibit only subtle differences in color features, and insufficient channel-level discriminative capability can easily lead to misclassification; (2) under dense occlusion, the approximately spherical contours of tomato fruits often undergo irregular deformation, making them difficult to accurately capture with the fixed receptive fields of standard convolutions. To address these issues, this study systematically improves YOLOv7 from three aspects, namely geometric deformation perception, feature representation enhancement, and regression optimization, and proposes an improved model termed YOLO-RCM (Reduce classes misjudgment). Its architecture is shown in Figure 4, and the main modifications are as follows:

Stability-enhanced ECANet modules are inserted into the FPN feature fusion path in the Head, specifically before the upsampling backbone branch, the lateral branch outputs from the Backbone, and the output of the small-scale detection branch. This design suppresses background noise and increases the saliency of target features from the channel perspective.
In the ELAN modules that output the P3 (stride = 8) and P4 (stride = 16) feature maps in the Backbone, the last two standard convolutions are replaced with DCNConv, a deformable convolution module with adaptive magnitude constraints, improving the ability to capture shape variation, local occlusion, and fine-grained features.
WIoU v3 is used to refine bounding-box regression, which improves localization stability and model generalization.

2.4.1. DCNConv

In natural scenes, tomatoes are approximately spherical in shape. Due to variations in growth posture, viewing angle, and occlusion by leaves and fruits, they often exhibit significant geometric deformation in images. In addition, the fine-grained differences between adjacent maturity categories require the model to have a more refined local feature extraction capability. To address these issues, DCNv2 (Deformable ConvNets v2) dynamically learns sampling offsets and modulation coefficients, enabling the receptive field to adapt to the local contours of the target [28]. In this way, it extracts locally stable features that are more robust to geometric deformation and reduces the limitations imposed by the fixed rectangular receptive field of standard convolution, thereby better accommodating target deformation while suppressing interference from irrelevant background regions. Compared with DCNv1, DCNv2 further introduces a modulation coefficient, denoted as Δm, to weight the contribution of each sampling point, thereby improving the flexibility of feature selection. Therefore, in this study, the core operator of modulated deformable convolution in DCNv2 is adopted as the basis of the deformable convolution enhancement module. By appending a BN layer and a SiLU activation function after this operator, a deformable convolution enhancement module, termed DCNConv, is constructed with the same input–output interface as standard convolution. This module can be integrated into existing network architectures as a plug-and-play component. The DCNConv structure is shown in Figure 5.

The input feature X is first fed into the offset-mask prediction branch to generate a prediction tensor arranged along the channel dimension. This tensor is then split into the offsets of the sampling points along the x and y directions, namely

o_{x}

and

o_{y}

, as well as the modulation branch output

m_{l o g i t}

. Specifically,

o_{x}

and

o_{y}

are concatenated to form the overall offset

Δ p

, while

m_{l o g i t}

is normalized by the Sigmoid function to generate the modulation coefficient

Δ m

. The input feature X is then sampled at the offset locations by bilinear interpolation and weighted by the modulation coefficient. Finally, the resulting features are processed by a BN layer and the SiLU activation function to produce the output feature Y. The mathematical formulation is given in Equation (1).

y (p_{0}) = \sum_{k = 1}^{K} Δ m_{k} \cdot w_{k} \cdot x (p_{0} + p_{k}^{g r i d} + Δ p_{k})

(1)

Here,

p_{0}

represents the current location on the output feature map, and K is the number of sampling points in the convolution kernel.

p_{k}^{g r i d}

represents the k-th regular sampling position of the standard convolution kernel,

Δ p_{k}

is the total offset of the k-th sampling position,

Δ m_{k}

denotes the corresponding modulation coefficient, and

w_{k}

is the convolution weight at the k-th sampling position. The term

x (p_{0} + p_{k}^{g r i d} + Δ p_{k})

represents the feature value obtained by bilinear interpolation after offset sampling.

2.4.2. ECANet Attention Module

Tomato cultivation environments are often complex. Leaf and fruits occlusion, background interference, and color similarity between adjacent maturity categories make it difficult for the network to accurately extract discriminative features of tomato fruits in complex scenes. To alleviate this problem, an attention mechanism is introduced to adaptively adjust channel-wise feature weights and enhance the network’s perception of discriminative target features. The ECANet module (Efficient Channel Attention Module) [29] is a lightweight channel attention method. Unlike the SE module, which relies on fully connected layers, ECANet adopts an adaptive one-dimensional convolution to realize local cross-channel interaction while avoiding the information loss caused by dimensionality reduction and preserving dependencies among adjacent channels. At the channel level, it can adaptively enhance channel responses related to tomato maturity color features while suppressing interference channels such as background leaves, thereby alleviating both confusion among adjacent maturity categories and background interference. The structure of the ECANet module is shown in Figure 6.

The ECANet module first performs global average pooling (GAP) on the input feature map, compressing each channel into a descriptor that represents the global representation of that channel. It then performs local cross-channel interaction through a one-dimensional convolution, where an adaptively sized kernel slides across the sequence of channel descriptors to model local channel relationships while sharing parameters. Finally, the convolution output is mapped to the interval (0, 1) by the Sigmoid function to obtain the channel attention weight

ω_{i}

, which is multiplied with the corresponding channel feature map to produce the attention-weighted output feature. The mathematical formulation is given in Equation (2).

ω_{i} = σ (\sum_{j = 1}^{k} w^{j} \cdot y_{i}^{j}), y_{i}^{j} \in Ω_{i}^{k}

(2)

Here,

ω_{i}

denotes the attention weight of the i-th channel;

Ω_{i}^{k}

denotes the set of k neighboring channels centered on the i-th channel;

y_{i}^{j}

denotes the channel descriptor of the i-th channel after global average pooling;

w^{j}

denotes the one-dimensional convolution weights shared across all channels; and

σ (\cdot)

denotes the Sigmoid activation function.

2.4.3. WIoUv3 Loss Function

The original YOLOv7 architecture adopts CIoU Loss (Complete IoU) as the bounding box regression loss. By considering center-point distance and aspect-ratio constraints, CIoU improves regression accuracy to a certain extent. However, CIoU lacks a dynamic mechanism for adjusting the gradient contributions of samples with different quality levels [30]. In addition, the aspect-ratio penalty term computed via the arctangent function may, in some cases, be inconsistent with the IoU optimization direction, leading to unstable regression. To address these limitations of CIoU and to dynamically model the effect of sample quality differences on bounding box regression, this study introduces WIoU v3 (Wise-IoU Loss) as the bounding box regression loss for the improved model [30,31]. Its geometric illustration is shown in Figure 7.

The WIoU v3 loss contains two components: a basic IoU weighting term

L_{W I o U}

and a dynamic focusing weight

r

. Built upon the standard IoU loss, it introduces a dynamic weighting term based on the relative center-point distance and combines it with a non-monotonic focusing mechanism, allowing the gradient weights of samples with different quality levels according to their degree of outlierness. Compared with IoU-based loss functions that rely solely on fixed geometric constraints, WIoU v3 can simultaneously suppress the weights of low-quality samples and excessively high-quality samples, thereby encouraging the model to focus on the fine-grained optimization of medium-quality samples. This helps prevent overfitting and improves the localization accuracy of difficult samples. The WIoU v3 loss is defined in Equation (3).

L_{W I o U v 3} = r \cdot L_{W I o U}

(3)

Here,

L_{W I o U}

denotes the basic IoU weighting term constructed from the relative center-point offset, as defined in Equation (4), and

r

denotes the non-monotonic focusing weight, as defined in Equation (5).

L_{W I o U} = \exp (\frac{ρ^{2} (b, b^{g t})}{c^{2}}) \cdot L_{I o U}

(4)

In Equation (4),

ρ^{2} (b, b^{g t})

denotes the squared Euclidean distance between the center point of the predicted box and that of the ground-truth box,

c^{2}

denotes the squared diagonal length of the smallest enclosing box covering both the predicted box and the ground-truth box, and

L_{I o U}

denotes the standard IoU loss. This term normalizes the squared center-point distance

β^{2}

by

c

, the diagonal length of the smallest enclosing box and applies, in exponential form, a geometric penalty weight related to the degree of center-point offset to the standard IoU loss.

r = \frac{β}{δ \cdot α^{β - δ}}

(5)

In Equation (5),

α

and

δ

are hyperparameters controlling the shape of the focusing curve, set to 1.7 and 2.7, respectively. These values are taken from the empirical tuning results reported in the original paper and are used to construct a reasonable non-monotonic focusing curve, so as to prioritize the optimization of medium-quality samples while maintaining a good balance between training stability and performance. Under this mechanism, the model can dynamically adjust gradient allocation according to sample quality: for low-quality samples and excessively high-quality samples, their gradient weights are appropriately reduced; for medium-quality samples, greater optimization emphasis is assigned, thereby improving overall localization performance. The outlierness of the current sample, denoted by

β

, is defined in Equation (6).

β = \frac{L_{I o U}^{*}}{{\bar{L}}_{I o U}}

(6)

In Equation (6),

L_{I o U}^{*}

denotes the IoU loss of the current sample after gradient truncation, and

{\bar{L}}_{I o U}

denotes the moving average of the IoU loss during training. When

β > 1

, the current sample loss is higher than the training average, indicating that the sample is relatively low-quality. When

β < 1

, the sample is regarded as a high-quality sample close to convergence. For both extreme cases,

r

assigns relatively low weights, so that the model focuses on medium-quality samples with

β \approx 1

.

2.4.4. Training Instability in the Joint Use of ECANet and DCNConv and Its Mitigation

Under the same experimental conditions as described in Section 3.1, when the ECANet and DCNConv modules are jointly introduced into the model, the training metrics exhibit persistent oscillation, with evident performance collapse occurring at the 44th and 81st epochs. From the training curves as shown in Figure 8, it can be observed that Precision, Recall, and mAP@0.5 all show abrupt rises followed by rapid drops near their peak values, indicating that the issue cannot be attributed merely to ordinary random noise. Instead, it is more likely associated with abrupt changes in feature distribution or instability in gradient propagation during training.

From the perspective of module design, the sustained oscillation of the training metrics can be attributed to two main factors. First, the adaptive convolution kernel size in ECANet is not constrained by a lower bound. When the number of input channels is small, the convolutional receptive field may degenerate into a single-channel operation, causing the attention mechanism to degrade into channel-wise scaling. Second, the offset prediction in DCNConv lacks magnitude constraints. The mismatch between batch normalization (BN) and nonlinear activation in the DCNConv structure may trigger gradient explosion and instability in the early training stage [32,33], while deformable convolution itself introduces additional optimization instability through dynamically learned offsets [34]. During training, this may cause sampling points to move beyond valid feature regions, thereby introducing temporary misalignment in feature representation. Therefore, explicit constraints on the offsets are necessary to improve training stability.

To address the stability issues caused by the above module design, an adaptive magnitude constraint related to the convolution kernel size

k

is imposed on the offsets

\tilde{ο}

in DCNConv, ensuring that dynamic sampling points remain within the valid feature range. The improved structure is shown in Figure 9, and the mathematical form of the constraint is given in Equation (7), where

ο

denotes the original offset.

\tilde{ο} = M \cdot \tanh (\frac{ο}{M}), M = \frac{k}{2}

(7)

In addition, following the recommendation in the original ECANet paper for low-channel scenarios (C ≤ 64), the lower bound of the one-dimensional convolution kernel size in ECANet is set to 3, ensuring that the kernel covers at least three adjacent channels and thereby preserving the cross-channel modeling capability of the attention module.

From the perspective of network architecture, ECANet and DCNConv are deployed in series along the FPN feature fusion path. This coupling effect may amplify fluctuations during training. Moreover, DCNConv is simultaneously deployed in the ELAN modules of the Backbone and the fusion path of the Head. During backpropagation, gradient disturbances may be amplified by the two-stage structure and gradually accumulate during training, eventually leading to performance collapse at the 44th and 81st epochs.

To mitigate the coupling problem in the network architecture, the overall model structure is further adjusted, as shown in Figure 10. Specifically, the DCNConv module along the FPN feature fusion path is removed to avoid distributional perturbations caused by module coupling and the amplification of gradient disturbances introduced by the two-stage structure. Therefore, the final model used in the experiments includes the improved ECANet attention mechanism, the improved DCNConv module, and the adjusted overall architecture.

3. Experimental Results and Analysis

3.1. Experimental Environment

All experiments in this study were conducted on a 64-bit Windows 11 operating system. The hardware configuration included an Intel Core i5-14400F @ 2.50 GHz CPU and an NVIDIA GeForce RTX 5060 Ti 16GB GPU. The software environment consisted of Python 3.9, CUDA 12.8, and PyTorch 2.8. The training settings were as follows: the input image resolution was set to 640 × 640, the batch size was 12, the number of workers was 6, the number of training epochs was 300, and the learning rate was 0.01.

All experiments were independently trained under the same experimental settings. The reported results were obtained by performing inference on the test set using the best weights selected according to validation-set performance during training. The training, validation, and test sets were strictly separated, ensuring that different augmented versions of the same image do not appear in multiple subsets. The best model weights were automatically selected by the YOLO framework based on built-in evaluation metrics, which mainly include Precision, Recall, and mAP on the validation set, while the test set was used only for final performance evaluation.

3.2. Evaluation Metrics

To provide a comprehensive assessment of model performance in tomato maturity detection, this study evaluated the model from two aspects: detection performance and model complexity. For detection performance, Precision(%), Recall(%), and mAP@0.5(%) were adopted for evaluation. The definitions of Precision and Recall are given in Equation (8), where TP denotes true positives, FP denotes false positives, and FN denotes false negatives. Precision refers to the proportion of true positive samples among all samples predicted as positive, and is used to measure the reliability of positive predictions. Recall measures the model’s capability in detecting real targets. The mean Average Precision at IoU = 0.5 (mAP@0.5) is calculated as the arithmetic mean of the average precision (AP) values over all categories when the IoU threshold is set to 0.5.

P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N}

(8)

For model complexity, Parameters (M) and GFLOPs (Giga Floating-point Operations) were adopted for evaluation. Parameters (M) denote the number of learnable model parameters in millions, reflecting the storage cost and capacity scale of the model. GFLOPs refer to the number of billion floating-point operations required for one forward pass, which is used to assess computational complexity.

3.3. Comparison of Attention Mechanisms

To examine the influence of the ECANet attention mechanism on detection performance, comparative experiments were conducted on the basis of the YOLOv7 baseline by incorporating five attention modules, namely CBAM [35], FcaNet [36], SimAM [37], CA [38], and ECANet. The results are presented in Table 1.

Table 1 shows that different attention mechanisms have noticeably different effects on the detection performance of YOLOv7. Compared with the baseline model, introducing ECANet increases Precision and mAP@0.5 to 81.5% and 89.3%, corresponding to improvements of 1.1 and 0.9 percentage points, respectively. Among all compared methods, ECANet achieves the best mAP@0.5, while maintaining a Recall of 81.3%, which is comparable to that of the baseline. This indicates that ECANet improves detection accuracy while preserving a stable recall level.

Other attention mechanisms show different performance tendencies. CBAM achieves the best Precision of 82.2% (+1.8 percentage points), indicating its effectiveness in feature enhancement. FcaNet steadily improves all evaluation metrics, although its overall gain remains slightly lower than that of ECANet. SimAM increases Recall to 82.7% (+1.4 percentage points), suggesting an advantage in reducing missed detections. In contrast, CA improves Precision but decreases Recall to 80.6% (−0.7 percentage points), indicating a less balanced performance. In addition, after introducing these modules, the Parameters and GFLOPs of the model remain almost unchanged, with only CA showing a very slight increase in GFLOPs. This suggests that the performance improvements mainly come from optimized feature representation rather than a substantial increase in computational cost.

Overall, ECANet achieves the best mAP@0.5 among all compared attention mechanisms. It improves detection accuracy while maintaining stable recall, and delivers superior overall detection performance without introducing additional model complexity.

To further interpret the above quantitative differences from the perspective of feature representation, Grad-CAM [39] was used to visualize YOLOv7 and its variants with different attention mechanisms, as shown in Figure 11. All heatmaps were generated based on the same input image, the same predicted target, and the same feature layer (the output of the small-scale detection branch).

As shown in Figure 11a, the high-response regions of YOLOv7 are mainly concentrated along the tomato edges, while the activation inside the tomato regions is relatively weak. This suggests that the baseline model relies more on boundary features and provides insufficient representation of the target body. After introducing CBAM (Figure 11b), the activated regions become larger and cover the two tomato bodies more completely. However, part of the background is also activated, indicating limited background suppression. In CA (Figure 11e), the high-response regions cover both tomato bodies, and a low-response interval appears between the two targets, suggesting relatively clear instance separation in multi-object scenes.

In contrast, FcaNet (Figure 11c) and SimAM (Figure 11d) show more scattered response distributions. The activated regions of FcaNet appear fragmented and lack a continuous dominant hotspot, indicating limited focus on key discriminative regions. SimAM provides wider target-region coverage than the baseline and FcaNet, but its high-response regions remain relatively dispersed, and the inter-target distinction is not sufficiently clear. In ECANet (Figure 11f), the activated regions are mainly concentrated on the tomato bodies, while background activation is relatively weak. This indicates that ECANet provides more concentrated target responses and stronger background suppression, although its target coverage is less complete than that of CBAM and CA.

In summary, the Grad-CAM visualizations show that different attention mechanisms differ in target coverage, background suppression, and multi-object discrimination. CBAM and CA provide more complete target coverage, whereas ECANet produces more concentrated responses in discriminative target regions with relatively weak background activation. By contrast, FcaNet and SimAM show more dispersed activation patterns. Combined with the quantitative results above, the higher mAP@0.5 obtained by ECANet is consistent with its stronger target-focusing characteristic. However, there remains room for further improvement in response completeness under densely packed multi-object scenes.

3.4. Comparison of Loss Functions

To further improve the stability of bounding box regression in tomato maturity detection, this study conducted comparative experiments on the basis of the YOLOv7 baseline by introducing four loss functions, namely GIoU [40], Focal-EIoU [41], SIoU [42], and WIoU v3. The results are presented in Table 2.

As shown in Table 2, different loss functions have varying effects on YOLOv7. WIoU v3 achieves the highest Recall of 81.9% (+0.6 percentage points), while Precision and mAP@0.5 reach 81.1% and 88.5% (+0.7 and +0.1 points). Its improved Recall is due to dynamic gradient weighting, emphasizing difficult samples such as occluded or low-light tomatoes. The limited Precision gain suggests that false positives from color similarity across adjacent maturity categories rely more on attention mechanisms.

GIoU achieves the highest Precision (82.9%, +2.5 points) but with lower Recall (80.7%, −0.6 points). Focal-EIoU provides the highest mAP@0.5 (88.9%) with balanced performance, while SIoU increases Precision to 81.9% but reduces Recall to 79.4% (−1.9 points). From the perspective of model complexity, the number of parameters and GFLOPs remain unchanged across all loss functions, suggesting that the performance differences mainly arise from the different optimization strategies for bounding box regression rather than changes in model complexity.

Overall, WIoU v3 offers the best Recall while maintaining unchanged model complexity, making it the preferred loss for bounding-box regression in the improved YOLOv7 model.

To further verify that WIoU v3 can improve the stability of bounding box regression and enhance detection robustness in complex scenes, this study selected representative samples from greenhouse environments, including single fruit, dense multi-fruit clusters, low-light conditions, complex backgrounds, similar maturity levels, and occlusion, and visually compared the detection results of CIoU and WIoU v3. The comparison results are shown in Table 3.

As shown in Table 3, WIoU v3 demonstrates better detection performance in most complex scenarios. In the single-object scene, the detection confidence increases from 0.90 to 0.94, indicating that WIoU v3 provides better regression optimization and helps improve prediction stability. However, in the dense multi-fruit scene, the improvement brought by WIoU v3 is relatively limited, and the number of detected bounding boxes is reduced by three compared with the result of CIoU, suggesting that densely occluded scenes remain challenging for the current model. Under low-light conditions, the detection confidences of three tomato targets increase from 0.82, 0.42, and 0.91 to 0.89, 0.66, and 0.95, respectively, indicating that WIoU v3 provides better robustness for bounding box regression under poor illumination. In the complex-background scene, the confidence of the false detection on the red metal frame at the lower part of the image decreases from 0.75 to 0.37, showing that WIoU v3 can suppress high-confidence false detections caused by background interference to some extent. In the similar-maturity scene, the detection confidence of each target increases slightly, for example from 0.91 to 0.93, suggesting that more accurate bounding box regression helps improve discriminative stability between adjacent maturity categories. In the occlusion scene, CIoU fails to detect two occluded fruits, whereas WIoU v3 successfully detects one of them, further demonstrating its better adaptability to difficult samples.

In summary, the advantages of WIoU v3 are mainly reflected in improving bounding box regression stability, enhancing robustness in complex scenes, and suppressing some background-induced false detections, rather than significantly increasing the number of detections in every scenario.

3.5. Ablation Experiment

This study conducted ablation experiments to quantify the contribution of each proposed module to the performance improvement of YOLOv7. The results are presented in Table 4.

As shown in Table 4, the effects of different improvement modules on YOLOv7 are clearly distinct. When ECANet is introduced alone, Precision and mAP@0.5 increase to 81.5% and 89.3% (+1.1 and +0.9 points), while Recall, Parameters, and GFLOPs remain unchanged relative to the baseline. This indicates that ECANet enhances channel-wise discriminative feature representation and improves detection accuracy without increasing model complexity. When DCNConv is introduced alone, Precision, Recall, and mAP@0.5 reach 79.3%, 81.4%, and 87.2% (−1.1, +0.1, and −1.2 points). Meanwhile, the number of parameters increases slightly from 36.5 M to 36.7 M, and GFLOPs decrease from 103.2 to 96.9. This suggests that although DCNConv reduces computational complexity and improves local geometric modeling, its standalone use may introduce feature misalignment or optimization instability, so its geometric modeling advantage does not directly translate into overall performance gains. WIoU v3 alone provides a modest improvement, increasing Recall to 81.9% without changing Parameters or GFLOPs, reflecting its ability to reduce missed detections.

For module combinations, ECANet + DCNConv improves Recall and mAP@0.5 to 82.9% and 89.2% (+1.6 and +0.8 points), whereas Precision drops to 79.7% (−0.7 points), indicating a trade-off between target recall and false detections due to the interaction of channel attention and deformable convolution. ECANet + WIoU v3 also improves Recall to 82.9% (+1.6 points), while Precision and mAP@0.5 remain close to baseline, showing that channel feature enhancement and regression optimization can complement each other. The combination of DCNConv and WIoU v3 achieves 82.7% Precision and 89.0% mAP@0.5, with GFLOPs reduced to 96.9, demonstrating that regression optimization can partly compensate for the instability of DCNConv when used alone.

When all three modules are combined, the model achieves the highest Precision and mAP@0.5, reaching 83.7% and 89.6% (+3.3 and +1.2 points), indicating that channel feature enhancement, spatial geometric modeling, and regression optimization exhibit stronger complementarity when used together. Meanwhile, GFLOPs decrease by 6.3, and the number of parameters increases only slightly by 0.2 M, showing that the final model improves accuracy without substantially increasing model size. However, the Recall of the three-module combination decreases to 80.5% (−0.8 points), the lowest among all ablation settings. Combined with Figure 12, this decrease is mainly associated with an increase in background false negatives for the fully_ripened category, suggesting that stricter prediction boundaries for some fully ripened tomatoes in complex backgrounds may lead to slightly more missed detections. Overall, the final model achieves a better balance between detection accuracy and computational complexity, though the small reduction in Recall remains a limitation.

To further analyze the category discrimination capability and the precision-recall trade-off of the improved model, this study quantitatively compares YOLOv7 and YOLO-RCM using confusion matrices and PR curves, as shown in Figure 12.

As shown by the confusion matrices in Figure 12, YOLO-RCM alleviates the confusion between adjacent maturity categories to some extent. Compared with the baseline YOLOv7, the proportion of fully_ripened samples misclassified as half_ripened decreases from 0.11 to 0.08, and the proportion of half_ripened samples misclassified as green decreases from 0.04 to 0.03. Meanwhile, the true positive rate of the half_ripened category increases from 0.73 to 0.75, whereas that of the fully_ripened category slightly decreases from 0.81 to 0.80. These results suggest that YOLO-RCM improves fine-grained discrimination among similar maturity categories, although the improvement is not uniform across all categories.

For background false detections and misses, YOLO-RCM shows category-specific changes. The proportions of background samples misclassified as fully_ripened and half_ripened decrease from 0.20 and 0.28 to 0.19 and 0.27, respectively, indicating a certain reduction in background false detections for these two categories. However, the proportion of background samples misclassified as green increases from 0.52 to 0.54, suggesting that background suppression for the green category is still limited. In addition, the background FN of the fully_ripened category increases from 0.08 to 0.13, while the corresponding values for half_ripened and green remain unchanged at 0.07. This indicates that the decrease in Recall is mainly associated with increased missed detections in the fully_ripened category.

The PR curves further show the improvement of YOLO-RCM in overall detection accuracy. Compared with YOLOv7, the AP values of YOLO-RCM for fully_ripened, half_ripened, and green increase from 0.869, 0.840, and 0.942 to 0.895, 0.849, and 0.945, corresponding to gains of 2.6, 0.9, and 0.3 percentage points, respectively. The overall mAP@0.5 increases from 0.884 to 0.896 (+1.2 percentage points). In the medium-to-high recall range, the PR curves of YOLO-RCM are generally above those of YOLOv7, especially for the fully_ripened and half_ripened categories, indicating improved precision under comparable recall levels.

Combined with the ablation results in Table 4, the above improvements may mainly come from the complementary effects of different modules. ECANet contributes clear gains in Precision and mAP@0.5, suggesting that the reduced confusion between adjacent maturity categories is related to enhanced channel-wise discriminative features. DCNConv mainly shows its effect when combined with WIoU v3, where local geometric modeling and regression optimization jointly improve detection performance. However, the increase in misses for the fully_ripened category (from 0.08 to 0.13) remains a limitation of the improved model and partly explains the decrease in Recall.

3.6. Model Comparison Experiment

To verify the effectiveness of the improved model, this study selected representative object detection models with parameter sizes and computational complexity in a comparable range for comparison. Considering that both model parameter scale and computational complexity affect detection performance [43,44], this study comprehensively considered Parameters and GFLOPs, and selected representative versions from mainstream object detection models that are broadly comparable to YOLOv7 in terms of parameter count and computational cost. The final comparison included RT-DETR L [45], YOLOv5 L [46], YOLOv8 M [47], YOLO11 L [48], and YOLO26 L [49]. All models were evaluated under the same experimental settings in terms of detection accuracy and efficiency. The comparison results are shown in Table 5 and Figure 13.

As shown in Table 5, YOLO-RCM achieves the best Precision (83.7%) and mAP@0.5 (89.6%) among all compared models on the current dataset. Compared with the baseline YOLOv7, these two metrics increase by 3.3 and 1.2 percentage points, respectively, indicating that the proposed improvement strategy effectively enhances target discriminability and overall detection accuracy. Although the Recall of YOLO-RCM decreases slightly from 81.3% to 80.5%, it remains higher than that of all other compared models except the baseline, suggesting that the model largely preserves its target detection capability while improving accuracy.

In terms of mAP@0.5, YOLO-RCM outperforms all compared models. Specifically, its mAP@0.5 is higher than that of YOLOv7, YOLOv5 L, and YOLO26 L by 1.2, 4.6, and 4.7 percentage points, respectively, and exceeds YOLO11 L and YOLOv8 M by 4.9 and 12.4 percentage points. Regarding Precision, YOLO-RCM also ranks first, surpassing YOLO26 L by 2.1 percentage points. This indicates that the proposed method has an advantage in reducing false detections and improving the reliability of detection results.

From the perspective of model complexity, YOLO-RCM has 36.7 M parameters, which is only 0.2 M higher than the baseline YOLOv7, while its GFLOPs are reduced from 103.2 to 96.9. This indicates that the proposed method improves accuracy without substantially increasing model size; instead, it reduces computational complexity while keeping the parameter scale nearly unchanged. Compared with YOLOv5 L, which has a higher parameter count and GFLOPs but lower Precision and mAP@0.5, YOLO-RCM shows better efficiency. Furthermore, although YOLOv8 M, YOLO11 L, and YOLO26 L are lighter in terms of Parameters and GFLOPs, their detection accuracy remains lower than that of YOLO-RCM. Overall, YOLO-RCM achieves a favorable balance between detection accuracy and computational cost.

In summary, YOLO-RCM demonstrates the best overall performance on the current tomato dataset. The results confirm that the joint improvement strategy effectively enhances Precision and mAP@0.5 while reducing computational complexity and maintaining a nearly unchanged parameter scale. However, the slight decrease in Recall remains a limitation and should be further investigated in future work.

3.7. Cross-Dataset Robustness Experiment

To further evaluate the generalization ability and robustness of YOLO-RCM, this study selected the publicly available tomato maturity detection dataset TomatOD [50] as an independent external test set (denoted as Dataset B) and conducted cross-dataset robustness experiments on YOLOv7 and YOLO-RCM.

The TomatOD dataset was collected in a greenhouse environment and includes challenging conditions such as overexposure, low illumination, uneven lighting, and complex backgrounds, as shown in Figure 14. The dataset contains 277 tomato images acquired in greenhouse conditions and 2418 tomato annotations, including 431 fully_ripened, 395 half_ripened, and 1892 green instances. In terms of annotation distribution, the TomatOD dataset is class-imbalanced; however, the relative proportions of different categories are consistent with their actual occurrence frequencies in real-world scenes [50].

As shown in Table 6, when evaluated on Dataset B as an external test set, YOLO-RCM exhibits performance gains that are consistent in direction with those observed on Dataset A. With a 1.2 percentage point decrease in Recall and only a 0.2 M increase in parameter count, Precision and mAP@0.5 improve by 5.8 and 4.0 percentage points, respectively, compared with the baseline model, while GFLOPs are reduced by 6.3. These results indicate that the proposed method does not merely fit the characteristics of a specific dataset. Instead, the model is able to capture generalizable features of tomato fruits, thereby demonstrating the robustness and generalization capability of YOLO-RCM across different datasets.

4. Discussion

The experimental results demonstrate that YOLO-RCM can improve the accuracy of tomato maturity detection in complex agricultural scenes. Compared with the baseline YOLOv7, the proposed method shows better overall performance in terms of detection accuracy and robustness. This suggests that the combination of local geometric modeling, channel-wise feature enhancement, and bounding box regression optimization contributes to improved maturity detection under cluttered conditions.

The performance improvement is consistent with the combined effects of the three modules, including local geometric modeling, channel-wise feature enhancement, and bounding box regression optimization. The DCNConv module enhances the model’s ability to perceive local geometric variations, making it better suited for detecting tomato targets under challenging conditions such as partial occlusion and overlap. By imposing magnitude constraints on the offsets, the model preserves its deformation modeling capability while preventing sampling points from drifting excessively, thereby ensuring stable representation of local contours. The stability-enhanced ECANet further strengthens discriminative channel responses, which is beneficial for distinguishing adjacent maturity categories with subtle differences. Finally, WIoU v3 improves the stability of bounding box regression, enabling more reliable localization in densely distributed multi-object regions and under complex illumination conditions.

Compared with existing YOLO-based maturity detection methods, the proposed method focuses more on complex backgrounds and confusion between adjacent maturity categories. Previous studies have improved performance through attention mechanisms, multi-scale fusion, or multi-task learning, but their effectiveness is still constrained by scene complexity, data diversity, or model complexity. The current results indicate that enhancing local feature adaptability and channel discriminability can be beneficial for fine-grained tomato maturity detection under the tested greenhouse conditions.

Despite these improvements, YOLO-RCM still has certain limitations. The model may still misclassify adjacent maturity categories, especially when the color transition is gradual or when fruits are heavily occluded by stems and leaves. In addition, the Recall of YOLO-RCM is lower than that of YOLOv7, indicating that some targets are still missed while accuracy is improved. According to the experimental analysis, the decline in Recall is mainly associated with an increase in missed detections in the fully_ripened category. Moreover, although the proposed method improves detection precision and reduces computational cost, the number of parameters still increases slightly. Therefore, further optimization is still needed for lightweight deployment in real-time agricultural systems.

Additionally, this study still has several shortcomings. First, the maturity stages of different tomato varieties are not always consistent. For example, some green-ripe tomato varieties remain green even when fully mature, making them highly similar in color to the immature stage of ordinary red-ripe tomatoes. This may lead to cross-variety misclassification, which is a primary bottleneck of current maturity detection methods based primarily on color features. Furthermore, the validation of the dynamic weighting mechanism in WIoU v3 remains mainly qualitative, and lacks quantitative evidence regarding its regulation of dynamic weights. Future work should therefore quantitatively verify this mechanism through gradient distribution analysis or weight statistics. In addition, future research will focus on the color confusion problem across multiple tomato varieties and will explore multimodal fusion schemes incorporating spectral or depth information, so as to improve the model’s recognition ability, generalization, and robustness across different tomato varieties. Moreover, explainable AI methods will be incorporated to provide interpretable evidence for model predictions, thereby helping growers and decision makers better understand the basis of maturity detection results.

5. Conclusions

To address the problems of false detections and insufficient adaptability to complex scenes in tomato maturity detection with YOLOv7, this study proposes an improved model, YOLO-RCM. The model enhances key channel feature representation by introducing ECANet into the FPN, improves spatial modeling of complex targets by replacing standard convolutions with DCNConv in the Backbone, and adopts WIoU v3 to optimize the bounding box regression process, providing complementary improvements in channel feature discrimination, local geometric perception, and bounding-box regression stability.

The experimental results indicate that ECANet achieves the best mAP@0.5 among attention mechanism while maintaining Recall comparable to the baseline; DCNConv reduces computational complexity and shows a clear synergistic effect when combined with WIoU v3; and WIoU v3 achieves the highest Recall in the loss function comparison. In the ablation study, introducing all three modules together results in Precision of 83.7% and mAP@0.5 of 89.6%, with GFLOPs reduced to 96.9 and Parameters increasing slightly to 36.7 M (+0.2 M). These results indicate that YOLO-RCM achieves better overall detection performance by making a reasonable trade-off between precision and recall.

The confusion matrices and PR curves further show that YOLO-RCM reduces misclassification between adjacent maturity categories and lowers background false detections for the fully_ripened and half_ripened categories, although background false detections for the green category increase slightly. As a result, overall detection accuracy is improved. However, missed detections in the fully_ripened category also increase, indicating that while the model enhances discriminative capability, its ability to detect certain targets is reduced to some extent.

Compared with other mainstream object detection models, YOLO-RCM ranks highest in both Precision (83.7%) and mAP@0.5 (89.6%). In cross-dataset robustness experiment, YOLO-RCM shows improvement trends consistent with those observed in the main experiments: Precision and mAP@0.5 increase by 5.8 and 4.0 percentage points, respectively, while Recall decreases by 1.2 percentage points and the Parameters increase by 0.2 M, and GFLOPs are reduced to 96.9 (−6.3). These results demonstrate that YOLO-RCM has good robustness and generalization capability across different datasets.

Author Contributions

Conceptualization, D.C. and H.T.; methodology, D.C. and H.T.; software, D.C.; validation, Y.L. and H.W.; formal analysis, Y.Z.; investigation, Y.L. and Y.Z.; resources, Y.Z.; data curation, D.C.; writing—original draft preparation, D.C.; writing—review and editing, D.C., H.T. and H.W.; funding acquisition, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not appliable.

Informed Consent Statement

Not appliable.

Data Availability Statement

The original dataset used in this study is publicly available from the Laboro dataset. The augmented dataset and code generated in this study are not publicly available but can be obtained from the authors upon reasonable request.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Food and Agriculture Organization of the United Nations. FAOSTAT Statistical Database: Crops and Livestock Products—Tomatoes Production. Available online: https://www.fao.org/faostat/ (accessed on 15 April 2026).
Rizzo, M.; Marcuzzo, M.; Zangari, A.; Gasparetto, A.; Albarelli, A. Fruit ripeness classification: A survey. Artif. Intell. Agric. 2023, 7, 44–57. [Google Scholar] [CrossRef]
Nahak, P.; Pratihar, D.K.; Deb, A.K. Tomato maturity stage prediction based on vision transformer and deep convolution neural networks. Int. J. Hybrid. Intell. Syst. 2025, 21, 61–78. [Google Scholar] [CrossRef]
Gómez, A.H.; Hu, G.; Wang, J.; Pereira, A.G. Evaluation of tomato maturity by electronic nose. Comput. Electron. Agric. 2006, 54, 44–52. [Google Scholar] [CrossRef]
Wan, P.; Toudeshki, A.; Tan, H.; Ehsani, R. A methodology for fresh tomato maturity detection using computer vision. Comput. Electron. Agric. 2018, 146, 43–50. [Google Scholar] [CrossRef]
Liu, G.; Mao, S.; Kim, J.H. A mature-tomato detection algorithm using machine learning and color analysis. Sensors 2019, 19, 2023. [Google Scholar] [CrossRef]
Kuang, M.; Li, X.; Wu, B.; Liu, D.; Xiang, Y.; Liu, F.; Zou, X.; Xie, F.; Zhang, Y.; Li, X. CF-DETR: A robust transformer-based framework for small-scale chili flower detection in industrial chili production systems. Front. Plant Sci. 2026, 17, 1824412. [Google Scholar] [CrossRef]
Zhang, Y.; Lu, Y.; Martinez-Rau, L.S.; Fan, Z.; Qiu, Q.; O’Flynn, B.; Bader, S. TinyML-enabled IoT edge framework with knowledge distillation for weed classification. IEEE Internet Things J. 2026, early access. [Google Scholar] [CrossRef]
Kuang, M.; Zou, X.; Xie, F.; Li, X.; Chen, S.; Liu, D.; Zhang, Y.; Bader, S.; Zou, X.; Li, X. DDM-YOLO: A lightweight oriented detection model for mature daylily fruits in complex environments. J. King Saud. Univ. Comput. Inf. Sci. 2026, 38, 91. [Google Scholar] [CrossRef]
Kuang, M.; Li, X.; Xie, F.; Zou, X.; Xiang, Y.; Zhang, Y.; Liu, D.; Zou, X.; Li, X. Advancements and prospects in key technologies for robotic pollination in greenhouse pepper breeding: A review. Front. Plant Sci. 2026, 17, 1778541. [Google Scholar] [CrossRef] [PubMed]
Zhang, Y.; Lu, Y.; Martinez-Rau, L.S.; Qiu, Q.; Bader, S. Real-time on-device weed identification using a hardware-efficient lightweight CNN. Front. Plant Sci. 2026, 17, 1747863. [Google Scholar] [CrossRef]
Zu, L.; Zhao, Y.; Liu, J.; Su, F.; Zhang, Y.; Liu, P. Detection and segmentation of mature green tomatoes based on Mask R-CNN with automatic image acquisition approach. Sensors 2021, 21, 7842. [Google Scholar] [CrossRef]
Sun, X. Enhanced tomato detection in greenhouse environments: A lightweight model based on S-YOLO with high accuracy. Front. Plant Sci. 2024, 15, 1451018. [Google Scholar] [CrossRef]
Teng, H.; Sun, F.; Wu, H.; Lv, D.; Lv, Q.; Feng, F.; Yang, S.; Li, X. DS-YOLO: A lightweight strawberry fruit detection algorithm. Agronomy 2025, 15, 2226. [Google Scholar] [CrossRef]
Zhao, M.; Cui, B.; Yu, Y.; Zhang, X.; Xu, J.; Shi, F.; Zhao, L. Intelligent detection of tomato ripening in natural environments using YOLO-DGS. Sensors 2025, 25, 2664. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Li, Y.; Han, Q.; Wang, H.; Li, C.; Wu, Z. A method for tomato ripeness recognition and detection based on an improved YOLOv8 model. Horticulturae 2025, 11, 15. [Google Scholar] [CrossRef]
Sun, H.; Zheng, Q.; Yao, W.; Wang, J.; Liu, C.; Yu, H.; Chen, C. An improved YOLOv8 model for detecting four stages of tomato ripeness in greenhouses. Agriculture 2025, 15, 936. [Google Scholar] [CrossRef]
Li, P.; Zheng, J.; Li, P.; Long, H.; Li, M.; Gao, L. Tomato maturity detection and counting model based on MHSA-YOLOv8. Sensors 2023, 23, 6701. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Rong, Q.; Hu, C. Ripe tomato detection algorithm based on improved YOLOv9. Plants 2024, 13, 3253. [Google Scholar] [CrossRef]
Huang, W.; Liao, Y.; Wang, P.; Chen, Z.; Yang, Z.; Xu, L.; Mu, J. AITP-YOLO: Improved tomato ripeness detection model based on multiple strategies. Front. Plant Sci. 2025, 16, 1596739. [Google Scholar] [CrossRef]
Wu, Q.; Huang, H.; Song, D.; Zhou, J. YOLO-PGC: A tomato maturity detection algorithm based on improved YOLOv11. Appl. Sci. 2025, 15, 5000. [Google Scholar] [CrossRef]
Chen, W.; Liu, M.; Zhao, C.; Li, X.; Wang, Y. MTD-YOLO: Multi-task deep convolutional neural network for cherry tomato fruit bunch maturity detection. Comput. Electron. Agric. 2024, 216, 108533. [Google Scholar] [CrossRef]
Sun, F.; Lv, Q.; Bian, Y.; He, R.; Lv, D.; Gao, L.; Wu, H.; Li, X. Grape target detection method in orchard environment based on improved YOLOv7. Agronomy 2024, 15, 42. [Google Scholar] [CrossRef]
Lv, Q.; Sun, F.; Bian, Y.; Wu, H.; Li, X.; Zhou, J. A lightweight citrus object detection method in complex environments. Agriculture 2025, 15, 1046. [Google Scholar] [CrossRef]
Trigubenko, R.; Fujihara, H. LaboroTomato: Instance Segmentation Dataset. Available online: https://github.com/laboroai/LaboroTomato#readme (accessed on 20 April 2026).
GH/T 1193-2021 Tomato; All China Federation of Supply and Marketing Cooperatives. All China Federation of Supply and Marketing Cooperatives: Beijing, China, 2021. Available online: https://std.samr.gov.cn/hb/search/stdHBDetailed?id=BF8670ADC4159E08E05397BE0A0A3649 (accessed on 4 June 2026).
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. Available online: https://arxiv.org/abs/2207.02696 (accessed on 20 April 2026).
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets V2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. ECA-Net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 11531–11539. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar]
Paik, I.; Choi, J. The disharmony between BN and ReLU causes gradient explosion, but is offset by the correlation between activations. arXiv 2023, arXiv:2304.11692. [Google Scholar] [CrossRef]
Panigrahi, A.; Chen, Y.; Kuo, C.C.J. Analysis on gradient propagation in batch normalized residual networks. arXiv 2018, arXiv:1812.00342. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. CBAM: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 783–792. [Google Scholar]
Qin, X.; Li, N.; Weng, C.; Su, D.; Li, M. Simple attention module based speaker verification with iterative noisy label detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 22–27 May 2022; pp. 6722–6726. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual explanations from deep networks via gradient-based localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
Zhang, Y.F.; Ren, W.; Zhang, Z.; Jia, Z.; Wang, L.; Tan, T. Focal and efficient IoU loss for accurate bounding box regression. Neurocomputing 2022, 506, 146–157. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
Chen, W.; Luo, J.; Zhang, F.; Tian, Z. A review of object detection: Datasets, performance evaluation, architecture, applications and current trends. Multimed. Tools Appl. 2024, 83, 65603–65661. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar] [CrossRef]
Lv, W.; Xu, S.; Zhao, Y.; Wang, G.; Wei, J.; Cui, C.; Du, Y.; Dang, Q.; Liu, Y. DETRs beat YOLOs on real-time object detection. arXiv 2023, arXiv:2304.08069. [Google Scholar]
Jocher, G. Ultralytics YOLOv5, Version 7.0. Available online: https://github.com/ultralytics/yolov5 (accessed on 15 April 2026).
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics YOLOv8, Version 8.0.0. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 April 2026).
Jocher, G.; Qiu, J. Ultralytics YOLO11, Version 11.0.0. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 April 2026).
Jocher, G.; Qiu, J. Ultralytics YOLO26, Version 26.0.0. Available online: https://github.com/ultralytics/ultralytics (accessed on 15 April 2026).
Tsironis, V.; Bourou, S.; Stentoumis, C. TOMATOD: Evaluation of object detection algorithms on a new real-world tomato dataset. Int. Arch. Photogramm. Remote Sens. Spatial Inf. Sci. 2020, XLIII-B3, 1077–1084. [Google Scholar] [CrossRef]

Figure 1. Examples of tomato fruits at different maturity stages: (a) fully_ripened; (b) half_ripened; (c) green.

Figure 3. Architecture of the YOLOv7 model.

Figure 4. Architecture of the YOLO-RCM model.

Figure 5. Structure of the modulated deformable convolution-based DCNConv module.

Figure 6. Structure of the ECANet module.

Figure 7. Geometric illustration of WIoU v3.

Figure 8. Training curves of the joint ECANet and DCNConv improvement.

Figure 9. Structure of the improved DCNConv module.

Figure 10. Adjusted model architecture.

Figure 11. Comparison of heatmaps for different attention mechanisms: (a) YOLOv7; (b) YOLOv7 + CBAM; (c) YOLOv7 + FcaNet; (d) YOLOv7 + SimAM; (e) YOLOv7 + CA; (f) YOLOv7 + ECANet; The heatmap uses the standard Jet colormap in computer vision, where color intensity corresponds to the attention weight. Dark blue indicates the lowest attention weight, followed by cyan, green and yellow in ascending order, and bright red indicates the highest attention weight.

Figure 12. Comparison of confusion matrices and PR curves: (a) YOLOv7 confusion matrix; (b) YOLOv7 PR curve; (c) YOLO-RCM confusion matrix; (d) YOLO-RCM PR curve.

Figure 13. Comparative Experiment Between Different Models.

Figure 14. Sample images from the TomatOD dataset: (a) Overexposure; (b) Low illumination; (c) Uneven lighting; (d) Complex background.

Table 1. Comparison results of different attention mechanisms.

Models	P/%	R/%	mAP@0.5/%	Parameters/M	GFLOPs
YOLOv7	80.4	81.3	88.4	36.5	103.2
YOLOv7 + CBAM	82.2	81.9	88.9	36.5	103.2
YOLOv7 + FcaNet	82	81.8	88.8	36.5	103.2
YOLOv7 + SimAM	80.4	82.7	88.9	36.5	103.2
YOLOv7 + CA	82.1	80.6	88.7	36.5	103.3
YOLOv7 + ECANet	81.5	81.3	89.3	36.5	103.2

Table 2. Comparison results of loss functions.

Model	P/%	R/%	mAP@0.5/%	Parameters/M	GFLOPs
YOLOv7	80.4	81.3	88.4	36.5	103.2
YOLOv7 + GIoU	82.9	80.7	88.8	36.5	103.2
YOLOv7 + Focal-EIoU	81.3	81.6	88.9	36.5	103.2
YOLOv7 + SIoU	81.9	79.4	88.5	36.5	103.2
YOLOv7 + WIoU v3	81.1	81.9	88.5	36.5	103.2

Table 3. Visual comparison of detection results using different loss functions.

	Original Images	CIoU	WIoU v3
single object
multiple objectives
dark
complex background
similar maturity
occlusion

Table 4. Ablation experiment results.

Baseline	ECANet	DCNConv	WIoU v3	P/%	R/%	mAP@0.5/%	Parameters/M	GFLOPs
YOLOv7				80.4	81.3	88.4	36.5	103.2
	√			81.5	81.3	89.3	36.5	103.2
		√		79.3	81.4	87.2	36.7	96.9
			√	81.1	81.9	88.5	36.5	103.2
	√	√		79.7	82.9	89.2	36.7	96.9
	√		√	80.7	82.9	88.7	36.5	103.2
		√	√	82.7	80.6	89.0	36.7	96.9
	√	√	√	83.7	80.5	89.6	36.7	96.9

√ indicates that this method uses the corresponding module.

Table 5. Comparison results of different models.

Model	P/%	R/%	mAP@0.5/%	Parameters/M	GFLOPs
RT-DETR L	79.9	77.2	79.1	32.0	103.4
YOLOv5 L	76.6	80.3	85	53.1	134.7
YOLOv7	80.4	81.3	88.4	36.5	103.2
YOLOv8 M	78.4	77.3	77.2	25.8	78.7
YOLO11 L	79.6	78.5	84.7	25.3	86.6
YOLO26 L	81.6	76	84.9	24.7	86.1
YOLO-RCM	83.7	80.5	89.6	36.7	96.9

Table 6. Cross-dataset robustness experiment results.

Model	P/%	R/%	mAP@0.5/%	Parameters/M	GFLOPs
YOLOv7	55.2	63.2	55.2	36.5	103.2
YOLO-RCM	61	62	59.2	36.7	96.9

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, D.; Teng, H.; Lu, Y.; Zhang, Y.; Wu, H. YOLO-RCM: An Improved Tomato Maturity Detection Model for Complex Greenhouse Environments. Agronomy 2026, 16, 1146. https://doi.org/10.3390/agronomy16121146

AMA Style

Chen D, Teng H, Lu Y, Zhang Y, Wu H. YOLO-RCM: An Improved Tomato Maturity Detection Model for Complex Greenhouse Environments. Agronomy. 2026; 16(12):1146. https://doi.org/10.3390/agronomy16121146

Chicago/Turabian Style

Chen, Dehua, Hao Teng, Yuchen Lu, Yuxuan Zhang, and Haorong Wu. 2026. "YOLO-RCM: An Improved Tomato Maturity Detection Model for Complex Greenhouse Environments" Agronomy 16, no. 12: 1146. https://doi.org/10.3390/agronomy16121146

APA Style

Chen, D., Teng, H., Lu, Y., Zhang, Y., & Wu, H. (2026). YOLO-RCM: An Improved Tomato Maturity Detection Model for Complex Greenhouse Environments. Agronomy, 16(12), 1146. https://doi.org/10.3390/agronomy16121146

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-RCM: An Improved Tomato Maturity Detection Model for Complex Greenhouse Environments

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset and Maturity Criteria

2.2. Data Augmentation

2.3. YOLOv7 Object Detection Model

2.4. YOLO-RCM

2.4.1. DCNConv

2.4.2. ECANet Attention Module

2.4.3. WIoUv3 Loss Function

2.4.4. Training Instability in the Joint Use of ECANet and DCNConv and Its Mitigation

3. Experimental Results and Analysis

3.1. Experimental Environment

3.2. Evaluation Metrics

3.3. Comparison of Attention Mechanisms

3.4. Comparison of Loss Functions

3.5. Ablation Experiment

3.6. Model Comparison Experiment

3.7. Cross-Dataset Robustness Experiment

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI