1. Introduction
Cherry tomato is widely cultivated on a global scale. It is highly valued for its distinct flavor and superior nutritional content, boasting high levels of vitamin C, potassium, and lycopene [
1]. Maturity stage identification is a decisive factor in autonomous harvesting, as the timing of intervention fundamentally dictates postharvest physiological quality and subsequent commercial value. Cherry tomatoes harvested too early often exhibits stunted development, hard skin, and high tomatine content, while late harvesting can lead to overripe fruit and a shortened shelf life, both of which result in reduced quality and economic losses [
2,
3]. The current harvesting of greenhouse cherry tomatoes remains predominantly manual. This practice, however, is plagued by the inherent challenges of the complex greenhouse environment, which leads to low efficiency, high labor intensity, substantial costs, and adverse working conditions [
4]. Furthermore, identification of tomato maturity is significantly influenced by the subjective judgment of harvesters, making it difficult to establish a unified classification standard. This limitation hinders the standardization and efficiency improvement of the current tomato cultivation industry [
5,
6]. Although picking robots show great potential for application in agriculture, existing machines suffer from low recognition accuracy, failing to meet practical picking needs [
7]. Thus, developing an accurate and efficient tomato maturity detection algorithm is essential.
A computer vision system is the core perception framework for robotic harvesting, enabling precise fruit identification, localization, and maturity assessment. Its performance directly governs the success rate of the picking operation [
8]. Conventional approaches to fruit maturity classification typically begin with the image acquisition of tomatoes during cultivation. Based on the external characteristics of the fruit, the digital image processing technology is used to process the single feature information of the image to realize the recognition of the fruit target [
9,
10]. Laykin et al. [
11] employed HSI conversion and threshold segmentation to analyze chromatic and morphological parameters for fruit quality assessment. Khoshroo et al. [
12] implemented a region-growth segmentation algorithm to differentiate maturity levels, achieving an accuracy of 82.38% through the integration of a watershed transformation. Si et al. [
13] proposed a color difference ratio algorithm for apple recognition based on red-green differences, with an accuracy of 89.5%. Chen and Ding [
14] distinguished ripe from semi-ripe tomatoes utilizing infrared spectroscopy and color analysis, achieving a classification accuracy exceeding 94.8%. Liu et al. [
15] achieved a 94.41% detection accuracy by training a Support Vector Machine (SVM) classifier with Histograms of Oriented Gradients (HOG) features, utilizing Non-Maximum Suppression (NMS) to refine detection results.
Although digital image processing methods can achieve fruit ripeness detection to a certain extent, they exhibit poor robustness against environmental interference during target recognition. Moreover, this approach struggles to handle complex detection environments. While the SVM algorithm demonstrates high detection accuracy, its inherent computational complexity and low inference efficiency hinder its actual deployment on resource-constrained edge devices [
16]. In practical greenhouse settings, dense planting and fluctuating illumination pose significant challenges, often leading to fruit occlusion and reduced recognition precision. Fueled by innovations in computer technology, several distinct areas have made extensive use of deep learning techniques, such as perch individual motion feature extraction [
17], fishing boat sailing centerline detection [
18], abnormal pine tree detection [
19], which shows great potential and make fruit ripeness recognition possible in complex environments [
20].
Yan et al. [
21] suggested a picking point location identification technique that combines deep threshold segmentation and MASK R-CNN, achieving 87.3% success rate of fruit stem localization. Quach et al. [
22] developed a tomato identification model based on MobileNet. By combining this model with the YOLOv8 detection algorithm, they achieved a 96.69% detection accuracy in complex environments. Leveraging the Inception V2 network with the Single Shot MultiBox Detector (SSD), Yuan et al. [
23] reached a detection accuracy of 98.85% for cherry tomatoes within greenhouse settings. Guan et al. [
24] employed YOLOv5 to identify the positional relationship between the tomato pedicel and fruit, achieving a processing speed of 104 ms per frame, which satisfies the operational criteria for real-time robotic picking. Solimani et al. [
25] presented the YOLOv8 model-based SE module, which enhanced the model’s capacity to identify targets of various sizes in intricate settings. Gao et al. [
26] added coordinate attention (CA) to the model’s backbone network, enhanced the algorithm’s capacity to identify maturity features, as well as raised the average detection accuracy by 1.3% over the initial model.
Despite the high detection speed and accuracy of neural network models in maturity recognition, their robustness remains limited under environmental stressors such as fluctuating illumination and physical occlusion. Furthermore, automated harvesting requires strict real-time synchronization between the vision system and the robotic arm. Relying on server-side deployment often introduces unpredictable latencies caused by network instability and signal attenuation in complex greenhouse settings, thereby undermining harvesting success rates. Consequently, edge deployment is essential to ensure deterministic, which necessitates a lightweight model architecture.
For those reasons, this paper presents YOLO-ELS, a lightweight algorithm for detecting cherry tomato maturity in greenhouses. It is designed to enhance recognition accuracy while maintaining a low computational footprint. The main contributions of this paper are as follows:
A dataset of cherry tomatoes in a greenhouse environment is collected, labeled and classified by maturity to meet multi-maturity classification tasks;
A lightweight cherry tomato maturity recognition algorithm YOLO-ELS is proposed for complex environments. The proposed model demonstrates significantly enhanced capability in identifying and classifying fruits of different maturity levels under challenging conditions including branch occlusion and varying illumination;
Ablation and comparative experiments were conducted to evaluate the contribution of each improved module, establish a theoretical foundation for the reliability of the enhancement strategies, and empirically demonstrate the efficacy of the proposed algorithm.
3. Detection Algorithm of Tomato Maturity
In greenhouse cultivation of cherry tomatoes, plants can reach heights of 1.6–1.8 m. To optimize land utilization, high-density planting configurations are typically employed, which inevitably leads to severe fruit occlusion. Such environments result in heterogeneous light distribution across the canopy, further exacerbated by leaf occlusion. Even within a single plant, fruit maturity varies considerably due to mutual shading among fruits and between fruits and leaves. Therefore, to address the issues with the current algorithms and realize the requirements of deploying to edge devices, this paper improves the YOLOv8n algorithm.
3.1. Baseline Model Selection
As a single-stage detection algorithm, the YOLO series streamlines object detection by treating it as an end-to-end regression problem. By processing images through a single convolutional neural network in one forward pass, the algorithm directly generates bounding box coordinates and class probabilities. This architectural efficiency ensures high detection speeds. Based on its predecessor’s architecture, YOLOv8 introduces systematic optimizations to the backbone, neck, and head modules. These optimizations significantly enhance the model’s feature extraction and object detection performance, demonstrating strong potential for various recognition tasks. Also, as a representative model, YOLOv8 has proven its reliability through extensive practical use, making it a suitable baseline for interpretable module replacement and ablation experiments. Therefore, YOLOv8 is selected as the baseline model in this study. Its architectural diagram is presented in
Figure 4.
In terms of backbone network design, the YOLOv8 backbone evolves the CSPDarknet architecture by substituting the original C3 module with the C2f module. This modification introduces richer branched connections to enhance gradient flow, thereby significantly improving both feature extraction capability and contextual information fusion efficiency, without substantially increasing the model’s parameter count [
27]. As for the neck network, YOLOv8 continues to employ the PANet structure, which facilitates the aggregation of features through both top-down and bottom-up pathways. This design enhances the flow of information between the three different-scale feature maps output by the backbone. Furthermore, YOLOv8 simplifies the network architecture by removing two convolutional layers from its upsampling component. This modification reduces the computational burden of the model. In the detection head, YOLOv8 innovatively adopts a decoupled head design. This design completely separates the classification and regression tasks into two independent network branches. Each branch can thus focus on its specific task, leading to higher detection accuracy and faster convergence.
These architectural improvements significantly enhance YOLOv8’s detection performance in handling scale variations and complex backgrounds. As a result, the model maintains high detection accuracy while retaining real-time inference capability. This characteristic makes it well-suited for the task of greenhouse cherry tomato ripeness detection, demonstrating promising application potential [
28]. Among the different versions in the YOLOv8 series, the YOLOv8n model has the smallest computational footprint, making it more suitable for deployment on resource-constrained edge detection devices [
29]. Therefore, this study selects YOLOv8n as the baseline algorithm.
3.2. Improved YOLO-ELS Network Model Design
Despite its excellence as a lightweight baseline model for object detection, YOLOv8n encounters notable limitations when directly used for cherry tomato maturity detection on greenhouse edge devices.
As a core component of the backbone, the C2f module enhances feature reuse capability through extensive cross-layer connections. However, its structure involves a large number of convolutional and bottleneck layers, which adversely affects the model’s real-time detection performance. Additionally, due to the unique nature of greenhouse cultivation environments, cherry tomato fruits exhibit significant scale variation, severe occlusion, and target overlap. The original pooling layer module cannot adequately adapt to objects of different sizes. Simultaneously, the detection head demonstrates insufficient feature extraction capability for partially visible fruits. This deficiency can easily lead to missed detection of small target fruits and misclassification of fruit maturity, severely compromising the model’s detection performance. To address these issues, we propose YOLO-ELS, an improved model based on the YOLOv8n architecture. The structural design of the improved YOLO-ELS model is illustrated in the
Figure 5.
The specific modifications of the YOLO-ELS model are as follows:
The bottleneck modules within the original C2f structure are replaced by Edge Information Enhanced Modules (EIEM). By optimizing the feature representation within the C2f blocks, this substitution prioritizes critical morphological cues and filters out redundant background information, significantly increasing the model’s sensitivity to fruit shape characteristics;
In the Spatial Pyramid Pooling Fast module, the Large Separable Kernel Attention (LSKA) is integrated immediately after the concat module. By applying attention to the fused multi-scale features, LSKA expands the effective receptive field, enhancing the model’s ability to recognize cherry tomatoes across varying dimensions and sizes;
The Spatially Enhanced Attention Module (SEAM) is incorporated into the decoupled detection head, positioned directly after the first convolutional layer. This specific placement allows SEAM to strengthen the feature response of visible fruit areas to compensate for response losses in obscured regions, thereby improving recognition performance under severe greenhouse occlusion;
The original CIoU loss function is replaced by Inner-GIoU to optimize the bounding box regression process. By utilizing auxiliary bounding boxes for loss calculation, this substitution accelerates training convergence and enhances the localization accuracy for fruit samples across different scales.
These improvement measures substantially elevate the YOLO-ELS model’s recognition capability for cherry tomato maturity in greenhouse environments.
3.3. C2f-EIEM Edge Feature Extraction Module
In the YOLOv8 architecture, the C2f module processes intermediate feature maps through a split-and-merge strategy. One branch maintains a direct path for feature fusion, while the other passes through a Bottleneck sequence involving convolution, normalization, and activation. This dual-path design facilitates the extraction of more comprehensive feature representations. Ultimately, the feature representation of the algorithm is improved by feature fusion with a multi-branch structure, the processing flow of which is shown in
Figure 6a. However, in greenhouse environments, the chromatic similarity between unripe tomatoes and the background foliage often leads to misidentifications. When utilizing the original model, background branches and leaves are frequently misclassified as unripe fruits, resulting in false positives that compromise overall detection accuracy.
To solve the issue mentioned previously, this paper introduces the Edge Information Enhanced Modules (EIEM) in the C2f layer. The EIEM model is designed to strengthen the extraction of discriminative edge features, thereby improving the model’s ability to distinguish cherry tomatoes from complex backgrounds. The improved module is shown in
Figure 6b.
To systematically strengthen edge detection, the C2f-EIEM module undergoes a complete structural substitution, where the original bottleneck are replaced with EIEM to facilitate more robust gradient information extraction. By using double convolutional branching to learn the image data, more comprehensive feature information is obtained. On the one hand, the module performs feature extraction on the input to the original image through a convolutional branch to retain the image’s spatial information; simultaneously, another parallel branch incorporates the Sobel operator for edge feature extraction, enhancing the model’s shape awareness of target objects. The structure of the Sobel operator is shown in
Figure 7.
The Sobel operator employs two mutually perpendicular
convolution kernels to compute directional derivatives. By convolving these kernels with the image, approximations of the horizontal and vertical luminance gradients are obtained. These directional gradients are then combined to determine the gradient magnitude for each pixel. Finally, by applying a predefined grayscale threshold, the edge segmentation of the target is achieved [
30]. The formula for this edge operator is shown below:
where
A represents the original unprocessed image,
Gx and
Gy are the grey values of the image for horizontal and vertical edge detection, respectively, and
G is the grey value of the point that is finally calculated. The processed image is shown in
Figure 8.
Compared to the original C2f module, the enhanced C2f-EIEM module effectively filters out substantial irrelevant background information in images and directly extracts more accurate edge orientation information. This process significantly reduces data volume while preserving the model’s ability to capture edge features. By minimizing interference from unrelated background conditions, the module helps lower the false detection rate and reduces the computational demand of the detection algorithm.
3.4. SPPF-LSKA Large Separable Kernel Attention Module
In facility-based cherry tomato cultivation, heterogeneous light distribution within the plant canopy leads to varying maturity levels across different vertical layers. Consequently, fruits of different sizes and maturity stages often coexist within the same field of view. Experimental results demonstrate that the original model, when confronted with occluded small targets, tends to over-focus on local features. This results in the fragmentation of large targets into multiple independent instances during recognition, thereby generating duplicate detections and compromising detection accuracy.
In the YOLOv8 algorithm, the Spatial Pyramid Pooling Fast (SPPF) module performs pooling operations on convolutional feature maps through grids of varying granularities. This design integrates feature information under different receptive fields, thereby enabling efficient processing of multi-scale targets. However, in the original YOLOv8 algorithm, the static pooling layers cannot adequately adapt to tomato fruit targets with varying scales. Therefore, to enhance the model’s capability in learning cherry tomato targets of varying sizes, ensure detection accuracy for multi-scale objects, the Large Separable Kernel Attention (LSKA) mechanism is incorporated into the original SPPF module. Specifically, it is cascaded between the concat layer and the Conv layer. The the modified module architecture is shown in
Figure 9.
To expand the receptive field without prohibitive computational costs, the LSKA module [
31] employs a kernel decomposition strategy. Within this module, two-dimensional convolutional kernels are decomposed into separate horizontal and vertical one-dimensional kernels. These directional kernels are then sequentially applied to the input features, allowing the attention module to efficiently implement large-kernel depthwise convolutions. This architectural refinement allows the model to capture extensive contextual information, thereby facilitating superior multi-scale feature representation in complex scenes.
As shown in
Figure 10, compared to Large Kernel Attention (LKA), the improved LSKA module decomposes the original (
) × (
) two-dimensional convolutional kernel into two one-dimensional deep convolutional layers of 1 × (
) and (
) × 1, which extract information in the horizontal and vertical directions, capture the local information of the context of the feature image, and extract the cascade to generate the preliminary attention map. The output after LSKA processing is as follow:
where ⊗ is the Hadamard product, ∗ is convolution,
d is the dilation rate,
W is the convolution kernel,
k is the maximal receptive field of the kernel
W,
is the feature map of the inputs,
is the attention map,
is the output of deep expansion convolution with kernel sizes of
and
,
is the output of the deep convolution with kernel sizes
and
, and
is the LSKA output.
By integrating the LSKA module, the model inevitably incurs an increase in parameters. However, this integration significantly expands the model’s receptive field and enhances its spatial contextual perception, enabling more effective recognition of cherry tomato targets at varying scales. This improvement mitigates the original model’s excessive reliance on local features and promotes robust multi-scale feature aggregation. Furthermore, the convolutional kernel decomposition and depthwise convolution design adopted by LSKA substantially reduce the total parameter count compared to the standard LKA model. Therefore, while introducing attention mechanisms to strengthen feature representation, this design maintains the lightweight nature of the model, making it more suitable for deployment on edge devices.
3.5. SEAM Head Module
The detection of cherry tomatoes in greenhouse environments is frequently compromised by severe occlusion among fruits, branches, and leaves. Such occlusion results in feature overlap and the loss of discriminative characteristics. To enhance detection performance under these conditions, the Spatially Enhanced Attention Module (SEAM) is integrated after the first convolutional layer within the detection head. This module strengthens the feature response by enhancing discriminative cues from visible regions, thereby compensating for information loss in occluded areas and improving the model’s capacity to identify partially obscured targets. The structure of SEAM module is shown in
Figure 11.
Within the SEAM module [
32], input images undergo processing through a residual-enhanced CSMM channel, where depthwise separable convolution establishes cross-dimensional correlations between spatial and channel features. Subsequent channel convolution integrates inter-channel information to strengthen feature connectivity, while the synergistic combination of GELU activation and feature map normalization jointly stabilizes the training process.
Subsequent to the CSMM channel, the module utilizes a two-layer fully connected architecture to aggregate global channel information. This approach enhances the interaction between feature channels, allowing the algorithm to capture and represent heterogeneous image characteristics more robustly. When the occluded target is detected, the lost features can be compensated according to the channel information when it is not occluded.
Finally, an exponential function is applied to the output logits of the fully connected layer, rescaling the activation values from to . These values serve as attention weights and are integrated with the original features through element-wise multiplication to produce the final output. By integrating SEAM into the head module, the framework effectively mitigates informative feature loss induced by fruit-plant occlusion, thereby enhancing overall detection performance while specifically improving recognition accuracy for occluded targets.
3.6. Loss Function Improvement
The loss function serves not only as a critical metric for evaluating model predictions but also as the fundamental mechanism for guiding gradient optimization and training trajectories. By modifying the loss function, the model can be guided in the desired direction according to the specific requirements of the particular dataset. In order to improve the accuracy of the algorithmic model detection and speed up the model detection, the original CIoU loss function is replaced with an enhanced Inner-GIoU function, which obtains faster regression convergence results compared to the original one. The relevant formule are shown below:
where
B and
Bgt represent the predicted anchor frame and the real frame, and
C is the smallest rectangular frame that covers
B and
Bgt. In contrast to the CIoU loss, the GIoU loss function incorporates the minimum bounding rectangle that encloses both the predicted and ground-truth boxes to quantify the distance between them. When the predicted frame overlaps with the real frame, the overlap degree of the two frames can be reflected by the area of the minimum circumscribed rectangle. When the predicted bounding box and ground-truth box exhibit no overlap, it can also reflect the distance between the two detected frames well, effectively solves the problem of the gradient being zero when the two frames are not overlapped. However, such loss functions lack the adaptability to different detectors and detection tasks in practice, resulting in poor generalization and slow convergence, which ultimately affects the accuracy and speed of the final detection.
To compensate for the poor generalisation and slow convergence limitations in existing IoU functions, this research introduces Inner-IoU [
33] to improve the loss function, as illustrated in
Figure 12.
Inner-IoU incorporates a scale factor to modulate auxiliary bounding box dimensions, optimizing regression constraints. The scale factor is the ratio of the size of the auxiliary bounding box to the ground truth bounding box. When the scale factor exceeds 1, the auxiliary bounding box expands beyond the actual bounding box, capturing more extensive contextual information. This augmented feature representation is conducive to enhancing the localization precision of small targets. Conversely, when the scale factor is less than 1, the auxiliary bounding box is constrained to the core feature regions of the object. This refinement facilitates more precise localization for large-scale targets by prioritizing high-confidence interior pixels. Therefore, the size of the scale factor can be adjusted according to the size of the IoU value in the actual situation, which can achieve the effect of accelerating the convergence speed or expanding the regression effect.
In the process of tomato target detection, the larger immature fruit is easy to be confused with the green branches and leaves in the background, which leads to the low recall rate of the model. Therefore, in this experiment, the size of the detection box is reduced by setting the ratio value less than 1 to improve the regression effect of the model.
By adaptively scaling auxiliary bounding boxes via scale factor, this approach overcomes generalization limitations in existing methods, bolsters model robustness against multi-scale variations.
4. Experimental Result and Analysis
4.1. Experimental Setups
The models are trained with an Intel(R) Xeon(R) Gold 6430 processor (Intel, Santa Clara, CA, USA) and NVIDIA GeForce RTX 4090 graphics (NVIDIA, Santa Clara, CA, USA). Running on 24 GB of RAM, the software environment was Ubuntu 20.04, and the virtual environment was configured with PyTorch 1.11.0, CUDA 11.3, and Python 3.8, which was optimized using the SGD optimizer in model training. The training parameters are summarized in
Table 2.
To optimize detection performance, mosaic data augmentation was applied during training, with deactivation in the final 10 epochs to fine-tune model parameters. The graph of the YOLO-ELS model training process is shown in
Figure 13.
The experimental results reveal that the model enters the convergence phase at approximately 200 epochs, whereby the loss function stabilizes within a narrow margin. Upon reaching 250 epochs, both the training and validation curves exhibit asymptotic behavior, converging into near-linear trajectories. This steady state indicates that the model has undergone effective learning and the weights have successfully equilibrated. Furthermore, these trends validate that the configured training hyperparameters are well-suited to the proposed architectural requirements.
4.2. Evaluation Indicators
The evaluation in this study employs five key metrics: Precision, Recall, mAP@50%, F1-score, and GFLOPs. Precision measures the proportion of correctly predicted positive instances among all instances predicted as positive. Recall evaluates the model’s ability to identify all relevant target instances. The mean Average Precision (mAP) at an IoU threshold of 0.5 reflects the overall detection accuracy across all categories under varying confidence thresholds. The F1-score provides a balanced measure by combining both Precision and Recall into a single metric using their harmonic mean. The formula for each metric is as follows:
where
TP (True Positive) value is the number of positive samples identified correctly,
FP (False Positive) is the number of negative samples identified as positive samples,
FN (False Negative) indicates positive samples misclassified as negative, and n is the number of identified samples.
GFLOPs measures the computational cost per inference and is directly linked to inference latency. A lower GFLOPs value enables low-power NPUs to achieve higher frame rates. Parameters refers to the total number of trainable elements, while Model Size indicates the disk space occupancy. These two metrics dictate peak memory usage and storage requirements, respectively. For edge devices with constrained resources, these metrics are essential indicators for evaluating model lightweighting.
4.3. Experimental Results and Analysis
4.3.1. Comparison of Different Loss Function
The loss function quantifies the discrepancy between predictions and ground truth, playing a critical role in guiding the optimization. To validate the effectiveness of the proposed Inner-GIoU loss function, this study compares it with the original CIoU loss function and several other loss functions of different types. The experimental results are summarized in
Table 3.
As shown in the table, among all loss function variants, the baseline CIoU achieves the highest precision. However, its recall rate is only 80.6%, suggesting a tendency toward conservative predictions that lead to missed detections. When the loss function is replaced with SlideLoss, the model’s recall increases to 82.7%, suggesting that the introduction of dynamic thresholds alleviates the class imbalance issue in the original dataset. Nevertheless, its precision decreases significantly, failing to meet detection requirements. ShapeIoU effectively captures geometric features, reaching the highest mAP@50% and recall, yet its sharp decline in precision undermines its practical reliability. Compared to the original loss, both Inner-CIoU and Inner-PIoU show some improvement in recall, but at the cost of reduced precision. In contrast, Inner-GIoU demonstrates the most balanced performance. By leveraging the auxiliary bounding box mechanism, it achieves a substantial 3.3% gain in recall with a marginal 0.3% decrease in precision. This synergistic improvement indicates that Inner-GIoU better regularizes the regression process, providing the most robust detection capability for complex greenhouse scenarios.
4.3.2. Ablation Experimental Results and Analysis
To systematically evaluate the contribution of each proposed module within the integrated framework, nine ablation configurations were designed and executed. These tests were conducted under identical experimental conditions and utilized the same dataset to ensure consistency and comparability across all evaluations. The experimental findings are displayed in the
Table 4.
The experimental results indicate that the baseline YOLOv8n model achieves a Precision of 86.5%, a mean Average Precision of 88.5%, and an F1-score of 83.7% on the experimental dataset. These initial metrics serve as the reference benchmark for evaluating the performance gains introduced by the proposed modular enhancements.
Figure 14a displays the detection results. As the graphic illustrates, the baseline YOLOv8n exhibits limited feature representation capabilities for small-scale targets and struggles with spatial occlusion caused by branches throughout the detection phase. This leads to frequent omissions and false detections, which consequently undermines the overall recognition accuracy and robustness of the model in complex greenhouse environments.
By integrating the SPPF-LSKA module, the model’s sensitivity to small-scale targets was significantly enhanced, with P, mAP@50%, and F1 score increasing by 1.7%, 1.1%, and 2.0%. The incorporation of the LSKA mechanism into the spatial pooling layer has significantly improved the model’s detection capability for multi-scale targets. Although this enhancement introduces additional computational, it effectively expands the effective receptive field. This expansion ensures scale-invariant detection and remarkably reduces the missed detection rate in complex environments.
After integrating the SEAM module into the detection head, the model exhibits a marginal improvement of 0.2% in precision, while achieving a substantial gain in recall. This outcome demonstrates that the SEAM module effectively compensates for feature response degradation in occluded regions by enhancing discriminative cues in visible areas. Consequently, it mitigates occlusion challenges caused by overlapping fruits and foliage, thereby reducing missed detections in cherry tomato identification.
After the addition of the EIEM module, the precision and the mAP@50% are improved by 2.6% and 1.2%. Remarkably, this enhancement was accompanied by a drastic reduction in both parameter count and GFLOPs. These results indicate that the EIEM module effectively captures contour-specific features, thereby enhancing the model’s discriminative power to distinguish immature tomato targets from complex green leaf backgrounds. By successfully decoupling the target from environmental noise, the module significantly mitigates false positives while simultaneously advancing the lightweight architectural objectives of the network.
Following the sequential integration of the three enhancement modules, the model exhibits improvements in precision, mAP@50%, and F1-score compared to the baseline. However, its recall decreases from 0.81 to 0.806, indicating a decline in the model’s ability to comprehensively identify relevant targets and an increased tendency to miss detections. To solve this problem, this paper introduces the Inner-GIoU loss function to modify the model. Through multiple sets of comparative experiments, the ratio factor value of Inner-GIoU is set to 0.9, significantly enhancing the localization accuracy for large targets. This configuration achieves substantially improved recall and overall detection performance while maintaining comparable precision levels. The final detection results are shown in
Figure 14e.
The proposed improvements yield significant performance gains, with the enhanced model achieving increases of 6.2% in precision, 2.9% in recall, and 3.5% in mAP@50%. Concurrently, the computational complexity is substantially reduced, meeting the requirements for deployment on edge devices. In summary, YOLO-ELS demonstrates marked improvement in detecting cherry tomatoes of varying sizes and maturity levels under complex greenhouse conditions.
4.4. Comparative Analysis of Test Results of Different Network Models
To systematically evaluate the superiority of the proposed model in tomato ripeness identification, seven mainstream architectures were selected for comparative analysis. These include single-stage detectors DETR [
34], SSD [
35], and TOOD [
36], the two-stage model Faster R-CNN [
37], and other versions of YOLO such as YOLOv5s, YOLOv8n, and YOLOv11. The experimental results are shown in the
Table 5.
The information in the table shows that the two-stage model Faster-RCNN has higher average detection accuracy and recall rate than other one-stage models except the YOLO series. However, due to the inherent architectural complexity of two-stage models, the algorithm imposes substantial computational overhead, with floating-point operations reaching 138 G. Such resource-intensive requirements fail to satisfy the real-time deployment constraints and hardware limitations of edge terminals. In several one-stage algorithms, the YOLO series algorithm has obvious advantages in terms of parameter quantity, model size, and computing power requirements. Furthermore, successive versions of the YOLO algorithm consistently demonstrate superior performance over other single-stage detectors in terms of both mean Average Precision and recall.
Among various YOLO iterations, YOLOv8n offers a superior balance between performance and efficiency for edge deployment. Although it exhibits a slight accuracy trade-off compared to YOLOv5s, its model size and computational demands are reduced by approximately 50%. Notably, although the latest lightweight model, YOLO11n, demonstrates high efficiency, its recall rate remains inadequate at only 0.748. Such a high miss rate poses a significant bottleneck for automated fruit harvesting, where missing targets directly reduces operational efficiency. In contrast, YOLOv8n serves as a more robust and adaptable baseline for optimization in these practical scenarios. However, as the final refined model, YOLO-ELS achieves a significant boost in overall performance. Specifically, compared to the reference YOLO11n, its precision, recall, and mAP@50% are increased by 1.7%, 9.1%, and 3.2%, respectively. Although YOLO-ELS incurs a marginal increase in model size and computational complexity relative to YOLO11n, this increment is negligible and does not compromise its suitability for edge deployment.
Therefore, based on the experimental results of each algorithm model, the improved YOLO-ELS model has the best detection effect, and the index parameters such as precision, recall and mAP@50% reach 0.927, 0.839 and 0.920. Moreover, the number of model parameters is small, and the computational cost is lower, which meets the deployment requirements on terminal equipment. It demonstrates that the model satisfies the real requirements of this investigation and performs well in tomato maturity detection tasks in a greenhouse setting.
4.5. Model Visualisation
Grad-CAM++ represents an enhanced visualization technique built upon the Grad-CAM framework. It produces class activation maps by emphasizing positive gradient contributions from the final convolutional layer toward specific class scores, thereby more accurately highlighting regions critical for target class identification [
38]. Therefore, in order to intuitively represent the attention of each part of the image before and after the improvement, this research introduces the heat map to visualize the detection model, so as to analyze the attention of different models to the recognition target. The visualized picture is shown in
Figure 15.
The heatmap illustrates that for the YOLOv8n model, the attention hotspot map is more scattered and does not concentrate on the tomato’s primary characteristics. Instead, it primarily highlights the exterior contour area features. This limitation is more pronounced in immature tomatoes, suggesting that the baseline struggles to extract critical internal information. In contrast, the improved YOLO-ELS exhibits a more concentrated focus, with activation regions densely covering the core representative areas of the target. This localized concentration indicates that the model can more effectively integrate diverse feature information. In summary, the enhanced architecture demonstrates superior discriminative power for cherry tomatoes, capturing richer feature representations and confirming the efficacy of the proposed modifications.
4.6. Performance Benchmarking Results and Analysis
To further validate the generalization and robustness of the proposed YOLO-ELS, we conducted additional benchmarking experiments on the publicly available “2022 Dataset of String Tomato in Shanxi Nonggu Tomato Town” [
39] dataset to conduct supplementary benchmark evaluations. The dataset contains 3665 images of cluster tomatoes at varying maturity stages. It was randomly split into training, validation, and test sets in an 8:1:1 ratio for model training. To evaluate the effectiveness of YOLO-ELS on the public benchmark, we compared its performance against the MTS-YOLO [
40] and several representative YOLO-series models using the same dataset.
As summarized in
Table 6, YOLO-ELS maintains superior detection stability despite changes in geographical variety and growth conditions. Specifically, YOLO-ELS achieved a precision of 92.4%, outperforming YOLOv8n and YOLOv10n by 4.8% and 8.8%. Structurally, the 6.9 GFLOPs and 2.93 M parameters of YOLO-ELS demonstrate that the proposed optimizations effectively condense the model without sacrificing spatial robustness. While the MTS-YOLO model exhibited a slightly higher recall, the marginal F1-score difference indicates that YOLO-ELS provides comparable overall performance with enhanced reliability.
In conclusion, YOLO-ELS maintains consistent detection efficacy across datasets with diverse growth conditions and varieties. The successful adaptation from single-fruit to string-fruit tasks validates that the architectural optimizations enhance feature representation and spatial robustness. These results confirm the model’s viability as a generalized solution for precision agriculture in complex greenhouse environments.
4.7. Application and Edge Deployment Performance Testing of the YOLO-ELS
To validate the deployment suitability of the improved YOLO-ELS algorithm model in real-world scenarios, this study selected the NVIDIA Jetson Orin Nano SUPER, which is a commonly used embedded hardware platform in this field, as the testbed for algorithm deployment. The software environment was set up with Ubuntu 22.04 LTS and JetPack 6.0, with the deployment status on the Jetson Nano platform shown in
Figure 16.
During the test, the input resolution was strictly maintained at 640 × 640 pixels, consistent with the model training configuration. To simulate continuous inference loads in real-world scenarios, a consecutive image stream constructed from the test set was processed as the dataset, while hardware performance parameters were monitored in real time using the tegrastats command. The detection execution process is illustrated in
Figure 17.
Experimental results demonstrate that the YOLO-ELS model achieves excellent real-time processing capabilities and energy efficiency on edge devices. In terms of inference speed, the model completes an average inference time of only 25.2 ms per image, achieving an overall detection frame rate of 28.2 FPS, meeting the real-time requirements of agricultural harvesting robots under typical conditions. Meanwhile, during sustained operation, the overall power consumption of the platform remains stable at 6.5 W, demonstrating excellent energy efficiency.
To validate the algorithm’s recognition capability in real-world scenarios, this study utilized an external Orbbec Gemini Pro depth camera (Orbbec, Shenzhen, China). The camera captured a video stream at a resolution of 640 × 480 and a frame rate of 30 fps for real-time detection of cherry tomato plants. The detection results are shown in
Figure 18.
Based on the combined experimental results, the YOLO-ELS cherry tomato ripeness recognition algorithm proposed in this study can meet deployment requirements on computationally constrained embedded devices. This confirms the practical deployability of the model and provides an effective solution for real-time detection of cherry tomato ripeness in greenhouse environments.
5. Conclusions and Future Work
To address the problem of missed and false detections in greenhouse environments due to light changes and overlapping fruit shading, an LSKA layer was inserted into the pooling module of the baseline model. This layer processes the output from the original concatenation layer, thereby expanding the model’s receptive field and enhancing its ability to detect small targets. Secondly, the EIEM module was introduced to replace the bottleneck block in the C2F structure of the backbone network. This replacement enhances the model’s ability to learn shape features of the targets while reducing interference from redundant background information, thereby lowering the model’s computational complexity. Furthermore, a SEAM detection layer was incorporated into the detection head to strengthen feature recognition for occluded fruits, while the INNER-GIoU loss function was employed to optimize bounding box regression, enhancing the model’s convergence capability across multiple scales.
The enhanced YOLO-ELS model demonstrates significant performance such that the precision, recall and mAP@50% reach 92.7%, 83.9% and 92.0%, which are 6.2%, 2.9% and 3.5% higher than those of the original model. The storage space occupied by the model is only 5.91 MB, and the required computing power is 6.9 Gflops. In practical deployment, the model achieved a detection speed of 28.2 FPS on the Jetson Orin Nano Super, with average power consumption maintained at 6.5 W. These results indicate that the improved model balances accuracy, speed, and energy efficiency well, making it suitable for real-time maturity detection of cherry tomatoes in complex greenhouse environments.
To evaluate its performance, YOLO-ELS was tested under consistent conditions with other common one-stage and two-stage algorithms. On the established cherry tomato dataset, the improved model achieved the highest detection precision and recall, demonstrating superior overall detection performance. The results of the model heat map also show that the improved model can notice more comprehensive feature information, cover more detection target area, and have stronger target recognition ability in the task of target detection. To further demonstrate the superiority of the proposed architectural improvements, experiments were also conducted on an additional public dataset. The results indicate that the enhanced YOLO-ELS algorithm maintains stable detection performance across different data distributions, significantly outperforming the baseline algorithms. This demonstrates its adaptability for fruit maturity recognition tasks in diverse production scenarios and across various fruit types.
In the future, we will expand the dataset to include a greater variety of cultivars and a wider range of growing conditions. This expansion aims to enhance the model’s generalization capability and robustness across diverse cultivation environments. By increasing the raw data diversity, we expect to further stabilize performance and mitigate potential minor precision drops that may arise from sample size limitations. In addition, future work will also explore the practical applicability of the algorithm in real picking scenarios. The model’s detection performance will be evaluated and refined based on experimental outcomes to meet actual operational needs.