YOLO-ELS: A Lightweight Cherry Tomato Maturity Detection Algorithm

Tong, Zhimin; Zhou, Yu; Li, Changhao; Cai, Changqing; Rong, Lihong

doi:10.3390/app16021043

Open AccessArticle

YOLO-ELS: A Lightweight Cherry Tomato Maturity Detection Algorithm

by

Zhimin Tong

¹

,

Yu Zhou

¹,

Changhao Li

¹,

Changqing Cai

^2,*

and

Lihong Rong

^1,*

¹

College of Mechanical and Electrical Engineering, Qingdao Agricultural University, Qingdao 266109, China

²

College of Electrical and Information Engineering, Changchun Institute of Technology, Changchun 130012, China

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2026, 16(2), 1043; https://doi.org/10.3390/app16021043

Submission received: 15 December 2025 / Revised: 13 January 2026 / Accepted: 17 January 2026 / Published: 20 January 2026

(This article belongs to the Section Agricultural Science and Technology)

Download

Browse Figures

Versions Notes

Abstract

Within the domain of intelligent picking robotics, fruit recognition and positioning are essential. Challenging conditions such as varying light, occlusion, and limited edge-computing power compromise fruit maturity detection. To tackle these issues, this paper proposes a lightweight algorithm YOLO-ELS based on YOLOv8n. Specifically, we reconstruct the backbone by replacing the bottlenecks in the C2f structure with Edge-Information-Enhanced Modules (EIEM) to prioritize morphological cues and filter background redundancy. Furthermore, a Large Separable Kernel Attention (LSKA) mechanism is integrated into the SPPF layer to expand the effective receptive field for multi-scale targets. To mitigate occlusion-induced errors, a Spatially Enhanced Attention Module (SEAM) is incorporated into the decoupled detection head to enhance feature responses in obscured regions. Finally, the Inner-GIoU loss is adopted to refine bounding box regression and accelerate convergence. Experimental results demonstrate that compared to the YOLOv8n baseline, the proposed YOLO-ELS achieves a 14.8% reduction in GFLOPs and a 2.3% decrease in parameters, while attaining a precision, recall, and mAP@50% of 92.7%, 83.9%, and 92.0%, respectively. When compared with mainstream models such as DETR, Faster-RCNN, SSD, TOOD, YOLOv5s, and YOLO11n, the mAP@50% is improved by 7.0%, 4.7%, 11.4%, 8.6%, 3.1%, and 3.2%. Deployment tests on the NVIDIA Jetson Orin Nano Super edge platform yield an inference latency of 25.2 ms and a detection speed of 28.2 FPS, successfully meeting the real-time operational requirements of automated harvesting systems. These findings confirm that YOLO-ELS effectively balances high detection accuracy with lightweight architecture, providing a robust technical foundation for intelligent fruit picking in resource-constrained greenhouse environments.

Keywords:

lightweight; cherry tomatoes; YOLOv8; maturity classification; target detection

1. Introduction

Cherry tomato is widely cultivated on a global scale. It is highly valued for its distinct flavor and superior nutritional content, boasting high levels of vitamin C, potassium, and lycopene [1]. Maturity stage identification is a decisive factor in autonomous harvesting, as the timing of intervention fundamentally dictates postharvest physiological quality and subsequent commercial value. Cherry tomatoes harvested too early often exhibits stunted development, hard skin, and high tomatine content, while late harvesting can lead to overripe fruit and a shortened shelf life, both of which result in reduced quality and economic losses [2，3]. The current harvesting of greenhouse cherry tomatoes remains predominantly manual. This practice, however, is plagued by the inherent challenges of the complex greenhouse environment, which leads to low efficiency, high labor intensity, substantial costs, and adverse working conditions [4]. Furthermore, identification of tomato maturity is significantly influenced by the subjective judgment of harvesters, making it difficult to establish a unified classification standard. This limitation hinders the standardization and efficiency improvement of the current tomato cultivation industry [5,6]. Although picking robots show great potential for application in agriculture, existing machines suffer from low recognition accuracy, failing to meet practical picking needs [7]. Thus, developing an accurate and efficient tomato maturity detection algorithm is essential.

A computer vision system is the core perception framework for robotic harvesting, enabling precise fruit identification, localization, and maturity assessment. Its performance directly governs the success rate of the picking operation [8]. Conventional approaches to fruit maturity classification typically begin with the image acquisition of tomatoes during cultivation. Based on the external characteristics of the fruit, the digital image processing technology is used to process the single feature information of the image to realize the recognition of the fruit target [9,10]. Laykin et al. [11] employed HSI conversion and threshold segmentation to analyze chromatic and morphological parameters for fruit quality assessment. Khoshroo et al. [12] implemented a region-growth segmentation algorithm to differentiate maturity levels, achieving an accuracy of 82.38% through the integration of a watershed transformation. Si et al. [13] proposed a color difference ratio algorithm for apple recognition based on red-green differences, with an accuracy of 89.5%. Chen and Ding [14] distinguished ripe from semi-ripe tomatoes utilizing infrared spectroscopy and color analysis, achieving a classification accuracy exceeding 94.8%. Liu et al. [15] achieved a 94.41% detection accuracy by training a Support Vector Machine (SVM) classifier with Histograms of Oriented Gradients (HOG) features, utilizing Non-Maximum Suppression (NMS) to refine detection results.

Although digital image processing methods can achieve fruit ripeness detection to a certain extent, they exhibit poor robustness against environmental interference during target recognition. Moreover, this approach struggles to handle complex detection environments. While the SVM algorithm demonstrates high detection accuracy, its inherent computational complexity and low inference efficiency hinder its actual deployment on resource-constrained edge devices [16]. In practical greenhouse settings, dense planting and fluctuating illumination pose significant challenges, often leading to fruit occlusion and reduced recognition precision. Fueled by innovations in computer technology, several distinct areas have made extensive use of deep learning techniques, such as perch individual motion feature extraction [17], fishing boat sailing centerline detection [18], abnormal pine tree detection [19], which shows great potential and make fruit ripeness recognition possible in complex environments [20].

Yan et al. [21] suggested a picking point location identification technique that combines deep threshold segmentation and MASK R-CNN, achieving 87.3% success rate of fruit stem localization. Quach et al. [22] developed a tomato identification model based on MobileNet. By combining this model with the YOLOv8 detection algorithm, they achieved a 96.69% detection accuracy in complex environments. Leveraging the Inception V2 network with the Single Shot MultiBox Detector (SSD), Yuan et al. [23] reached a detection accuracy of 98.85% for cherry tomatoes within greenhouse settings. Guan et al. [24] employed YOLOv5 to identify the positional relationship between the tomato pedicel and fruit, achieving a processing speed of 104 ms per frame, which satisfies the operational criteria for real-time robotic picking. Solimani et al. [25] presented the YOLOv8 model-based SE module, which enhanced the model’s capacity to identify targets of various sizes in intricate settings. Gao et al. [26] added coordinate attention (CA) to the model’s backbone network, enhanced the algorithm’s capacity to identify maturity features, as well as raised the average detection accuracy by 1.3% over the initial model.

Despite the high detection speed and accuracy of neural network models in maturity recognition, their robustness remains limited under environmental stressors such as fluctuating illumination and physical occlusion. Furthermore, automated harvesting requires strict real-time synchronization between the vision system and the robotic arm. Relying on server-side deployment often introduces unpredictable latencies caused by network instability and signal attenuation in complex greenhouse settings, thereby undermining harvesting success rates. Consequently, edge deployment is essential to ensure deterministic, which necessitates a lightweight model architecture.

For those reasons, this paper presents YOLO-ELS, a lightweight algorithm for detecting cherry tomato maturity in greenhouses. It is designed to enhance recognition accuracy while maintaining a low computational footprint. The main contributions of this paper are as follows:

A dataset of cherry tomatoes in a greenhouse environment is collected, labeled and classified by maturity to meet multi-maturity classification tasks;
A lightweight cherry tomato maturity recognition algorithm YOLO-ELS is proposed for complex environments. The proposed model demonstrates significantly enhanced capability in identifying and classifying fruits of different maturity levels under challenging conditions including branch occlusion and varying illumination;
Ablation and comparative experiments were conducted to evaluate the contribution of each improved module, establish a theoretical foundation for the reliability of the enhancement strategies, and empirically demonstrate the efficacy of the proposed algorithm.

2. Materials and Methods

2.1. Image Acquisition

All the cherry tomato images in this paper were collected from a cherry tomato greenhouse plantation in Luozhuang Village, Gucheng Street, Shouguang City, Shandong Province, China (118.78° E, 36.91° N). These images were captured with a mobile phone in October 2024 under the natural daylight conditions. The shooting distance was set to 10–50 cm, and the image format acquired was JPG with a resolution of 2592 × 2592. A variety of lighting situations, including sunny, cloudy, positive, and negative, were used for image collection. Meanwhile, to guarantee the variety of picture samples, images were captured from left, right, and front angles of tomato plants, covering single-target, multi-target, frontlighting, backlighting, occlusion, shading, and overlapping scenarios, and some samples of this dataset collected are shown in Figure 1.

2.2. Dataset Partitioning

In order to establish a uniform standard for maturity identification, a classification of tomato maturity needs to be made. According to the Chinese national standard GH/T 1193–2021, tomato maturity was divided into six different maturity grades. However, the algorithm developed in this study is primarily applied during the fruit harvesting stage, where grading is required based on the fruit’s appearance and subsequent storage period. To align with practical requirements and industry standards, under the consultation of industry professionals, the maturity levels were consolidated into three categories based on transportable duration after picking: immature, color-turning, and mature. These categories were assigned the labels “unripe”, “half”, and “ripe”. Meanwhile, to ensure the objectivity of our dataset and address potential visual ambiguity, we employed a three-person team to handle the annotation process. Any ambiguous samples were carefully discussed and reconciled to reach a consensus, thereby reducing individual subjectivity. The classification criteria are as detailed in Table 1.

To ensure dataset quality, images with excessive blur, overexposure, or dominant background interference were excluded through rigorous data cleaning. This process yielded a curated dataset of 724 images. The images were annotated using LabelImg 1.8.6 according to the criteria in Table 1, with bounding boxes tightly enclosing each fruit. To ensure quality, overly small or blurred instances were excluded. The resulting dataset was randomly partitioned into training and validation sets at an 8:2 ratio. All annotations were saved in YOLO format, as illustrated in Figure 2.

After partitioning the dataset, data augmentation was applied exclusively to the training set images to expand the sample size and enhance model convergence. With the help of the image data enhancement library Albumentations 1.4.11, offline data improvement techniques are used to improve the image, including translation, rotation, cropping, mirroring, and noise addition. The enhanced image is shown in Figure 3. Through the use of data augmentation, 1890 training set image samples were obtained, covering a total of 11,691 instances of various maturity tomatoes.

3. Detection Algorithm of Tomato Maturity

In greenhouse cultivation of cherry tomatoes, plants can reach heights of 1.6–1.8 m. To optimize land utilization, high-density planting configurations are typically employed, which inevitably leads to severe fruit occlusion. Such environments result in heterogeneous light distribution across the canopy, further exacerbated by leaf occlusion. Even within a single plant, fruit maturity varies considerably due to mutual shading among fruits and between fruits and leaves. Therefore, to address the issues with the current algorithms and realize the requirements of deploying to edge devices, this paper improves the YOLOv8n algorithm.

3.1. Baseline Model Selection

As a single-stage detection algorithm, the YOLO series streamlines object detection by treating it as an end-to-end regression problem. By processing images through a single convolutional neural network in one forward pass, the algorithm directly generates bounding box coordinates and class probabilities. This architectural efficiency ensures high detection speeds. Based on its predecessor’s architecture, YOLOv8 introduces systematic optimizations to the backbone, neck, and head modules. These optimizations significantly enhance the model’s feature extraction and object detection performance, demonstrating strong potential for various recognition tasks. Also, as a representative model, YOLOv8 has proven its reliability through extensive practical use, making it a suitable baseline for interpretable module replacement and ablation experiments. Therefore, YOLOv8 is selected as the baseline model in this study. Its architectural diagram is presented in Figure 4.

In terms of backbone network design, the YOLOv8 backbone evolves the CSPDarknet architecture by substituting the original C3 module with the C2f module. This modification introduces richer branched connections to enhance gradient flow, thereby significantly improving both feature extraction capability and contextual information fusion efficiency, without substantially increasing the model’s parameter count [27]. As for the neck network, YOLOv8 continues to employ the PANet structure, which facilitates the aggregation of features through both top-down and bottom-up pathways. This design enhances the flow of information between the three different-scale feature maps output by the backbone. Furthermore, YOLOv8 simplifies the network architecture by removing two convolutional layers from its upsampling component. This modification reduces the computational burden of the model. In the detection head, YOLOv8 innovatively adopts a decoupled head design. This design completely separates the classification and regression tasks into two independent network branches. Each branch can thus focus on its specific task, leading to higher detection accuracy and faster convergence.

These architectural improvements significantly enhance YOLOv8’s detection performance in handling scale variations and complex backgrounds. As a result, the model maintains high detection accuracy while retaining real-time inference capability. This characteristic makes it well-suited for the task of greenhouse cherry tomato ripeness detection, demonstrating promising application potential [28]. Among the different versions in the YOLOv8 series, the YOLOv8n model has the smallest computational footprint, making it more suitable for deployment on resource-constrained edge detection devices [29]. Therefore, this study selects YOLOv8n as the baseline algorithm.

3.2. Improved YOLO-ELS Network Model Design

Despite its excellence as a lightweight baseline model for object detection, YOLOv8n encounters notable limitations when directly used for cherry tomato maturity detection on greenhouse edge devices.

As a core component of the backbone, the C2f module enhances feature reuse capability through extensive cross-layer connections. However, its structure involves a large number of convolutional and bottleneck layers, which adversely affects the model’s real-time detection performance. Additionally, due to the unique nature of greenhouse cultivation environments, cherry tomato fruits exhibit significant scale variation, severe occlusion, and target overlap. The original pooling layer module cannot adequately adapt to objects of different sizes. Simultaneously, the detection head demonstrates insufficient feature extraction capability for partially visible fruits. This deficiency can easily lead to missed detection of small target fruits and misclassification of fruit maturity, severely compromising the model’s detection performance. To address these issues, we propose YOLO-ELS, an improved model based on the YOLOv8n architecture. The structural design of the improved YOLO-ELS model is illustrated in the Figure 5.

The specific modifications of the YOLO-ELS model are as follows:

The bottleneck modules within the original C2f structure are replaced by Edge Information Enhanced Modules (EIEM). By optimizing the feature representation within the C2f blocks, this substitution prioritizes critical morphological cues and filters out redundant background information, significantly increasing the model’s sensitivity to fruit shape characteristics;
In the Spatial Pyramid Pooling Fast module, the Large Separable Kernel Attention (LSKA) is integrated immediately after the concat module. By applying attention to the fused multi-scale features, LSKA expands the effective receptive field, enhancing the model’s ability to recognize cherry tomatoes across varying dimensions and sizes;
The Spatially Enhanced Attention Module (SEAM) is incorporated into the decoupled detection head, positioned directly after the first convolutional layer. This specific placement allows SEAM to strengthen the feature response of visible fruit areas to compensate for response losses in obscured regions, thereby improving recognition performance under severe greenhouse occlusion;
The original CIoU loss function is replaced by Inner-GIoU to optimize the bounding box regression process. By utilizing auxiliary bounding boxes for loss calculation, this substitution accelerates training convergence and enhances the localization accuracy for fruit samples across different scales.

These improvement measures substantially elevate the YOLO-ELS model’s recognition capability for cherry tomato maturity in greenhouse environments.

3.3. C2f-EIEM Edge Feature Extraction Module

In the YOLOv8 architecture, the C2f module processes intermediate feature maps through a split-and-merge strategy. One branch maintains a direct path for feature fusion, while the other passes through a Bottleneck sequence involving convolution, normalization, and activation. This dual-path design facilitates the extraction of more comprehensive feature representations. Ultimately, the feature representation of the algorithm is improved by feature fusion with a multi-branch structure, the processing flow of which is shown in Figure 6a. However, in greenhouse environments, the chromatic similarity between unripe tomatoes and the background foliage often leads to misidentifications. When utilizing the original model, background branches and leaves are frequently misclassified as unripe fruits, resulting in false positives that compromise overall detection accuracy.

To solve the issue mentioned previously, this paper introduces the Edge Information Enhanced Modules (EIEM) in the C2f layer. The EIEM model is designed to strengthen the extraction of discriminative edge features, thereby improving the model’s ability to distinguish cherry tomatoes from complex backgrounds. The improved module is shown in Figure 6b.

To systematically strengthen edge detection, the C2f-EIEM module undergoes a complete structural substitution, where the original bottleneck are replaced with EIEM to facilitate more robust gradient information extraction. By using double convolutional branching to learn the image data, more comprehensive feature information is obtained. On the one hand, the module performs feature extraction on the input to the original image through a convolutional branch to retain the image’s spatial information; simultaneously, another parallel branch incorporates the Sobel operator for edge feature extraction, enhancing the model’s shape awareness of target objects. The structure of the Sobel operator is shown in Figure 7.

The Sobel operator employs two mutually perpendicular

3 \times 3

convolution kernels to compute directional derivatives. By convolving these kernels with the image, approximations of the horizontal and vertical luminance gradients are obtained. These directional gradients are then combined to determine the gradient magnitude for each pixel. Finally, by applying a predefined grayscale threshold, the edge segmentation of the target is achieved [30]. The formula for this edge operator is shown below:

\begin{matrix} G x = [\begin{matrix} - 1 & 0 & 1 \\ - 2 & 0 & 2 \\ - 1 & 0 & 1 \end{matrix}] * A \end{matrix}

(1)

\begin{matrix} G y = [\begin{matrix} 1 & 2 & 1 \\ 0 & 0 & 0 \\ - 1 & - 2 & - 1 \end{matrix}] * A \end{matrix}

(2)

\begin{matrix} G = \sqrt{G x^{2} + G y^{2}} \end{matrix}

(3)

where A represents the original unprocessed image, Gx and Gy are the grey values of the image for horizontal and vertical edge detection, respectively, and G is the grey value of the point that is finally calculated. The processed image is shown in Figure 8.

Compared to the original C2f module, the enhanced C2f-EIEM module effectively filters out substantial irrelevant background information in images and directly extracts more accurate edge orientation information. This process significantly reduces data volume while preserving the model’s ability to capture edge features. By minimizing interference from unrelated background conditions, the module helps lower the false detection rate and reduces the computational demand of the detection algorithm.

3.4. SPPF-LSKA Large Separable Kernel Attention Module

In facility-based cherry tomato cultivation, heterogeneous light distribution within the plant canopy leads to varying maturity levels across different vertical layers. Consequently, fruits of different sizes and maturity stages often coexist within the same field of view. Experimental results demonstrate that the original model, when confronted with occluded small targets, tends to over-focus on local features. This results in the fragmentation of large targets into multiple independent instances during recognition, thereby generating duplicate detections and compromising detection accuracy.

In the YOLOv8 algorithm, the Spatial Pyramid Pooling Fast (SPPF) module performs pooling operations on convolutional feature maps through grids of varying granularities. This design integrates feature information under different receptive fields, thereby enabling efficient processing of multi-scale targets. However, in the original YOLOv8 algorithm, the static pooling layers cannot adequately adapt to tomato fruit targets with varying scales. Therefore, to enhance the model’s capability in learning cherry tomato targets of varying sizes, ensure detection accuracy for multi-scale objects, the Large Separable Kernel Attention (LSKA) mechanism is incorporated into the original SPPF module. Specifically, it is cascaded between the concat layer and the Conv layer. The the modified module architecture is shown in Figure 9.

To expand the receptive field without prohibitive computational costs, the LSKA module [31] employs a kernel decomposition strategy. Within this module, two-dimensional convolutional kernels are decomposed into separate horizontal and vertical one-dimensional kernels. These directional kernels are then sequentially applied to the input features, allowing the attention module to efficiently implement large-kernel depthwise convolutions. This architectural refinement allows the model to capture extensive contextual information, thereby facilitating superior multi-scale feature representation in complex scenes.

As shown in Figure 10, compared to Large Kernel Attention (LKA), the improved LSKA module decomposes the original (

2 d - 1

) × (

2 d - 1

) two-dimensional convolutional kernel into two one-dimensional deep convolutional layers of 1 × (

2 d - 1

) and (

2 d - 1

) × 1, which extract information in the horizontal and vertical directions, capture the local information of the context of the feature image, and extract the cascade to generate the preliminary attention map. The output after LSKA processing is as follow:

\begin{matrix} {\bar{Z}}^{C} = \sum_{H, W} W_{(2 d - 1) \times 1}^{C} * (\sum_{H, W} W_{1 \times (2 d - 1)}^{C} * F^{C}) \end{matrix}

(4)

\begin{matrix} Z^{C} = \sum_{H, W} W_{(\frac{k}{d}) \times 1}^{C} * (\sum_{H, W} W_{1 \times (\frac{k}{d})}^{C} * {\bar{Z}}^{C}) \end{matrix}

(5)

\begin{matrix} A^{c} = W_{1 \times 1} * Z^{C} \end{matrix}

(6)

\begin{matrix} {\bar{F}}^{C} = A^{c} \otimes F^{C} \end{matrix}

(7)

where ⊗ is the Hadamard product, ∗ is convolution, d is the dilation rate, W is the convolution kernel, k is the maximal receptive field of the kernel W,

F^{C}

is the feature map of the inputs,

A^{C}

is the attention map,

Z^{C}

is the output of deep expansion convolution with kernel sizes of

(k / d) \times 1

and

1 \times (k / d)

,

{\bar{Z}}^{C}

is the output of the deep convolution with kernel sizes

(2 d - 1) \times 1

and

1 \times (2 d - 1)

, and

{\bar{F}}^{C}

is the LSKA output.

By integrating the LSKA module, the model inevitably incurs an increase in parameters. However, this integration significantly expands the model’s receptive field and enhances its spatial contextual perception, enabling more effective recognition of cherry tomato targets at varying scales. This improvement mitigates the original model’s excessive reliance on local features and promotes robust multi-scale feature aggregation. Furthermore, the convolutional kernel decomposition and depthwise convolution design adopted by LSKA substantially reduce the total parameter count compared to the standard LKA model. Therefore, while introducing attention mechanisms to strengthen feature representation, this design maintains the lightweight nature of the model, making it more suitable for deployment on edge devices.

3.5. SEAM Head Module

The detection of cherry tomatoes in greenhouse environments is frequently compromised by severe occlusion among fruits, branches, and leaves. Such occlusion results in feature overlap and the loss of discriminative characteristics. To enhance detection performance under these conditions, the Spatially Enhanced Attention Module (SEAM) is integrated after the first convolutional layer within the detection head. This module strengthens the feature response by enhancing discriminative cues from visible regions, thereby compensating for information loss in occluded areas and improving the model’s capacity to identify partially obscured targets. The structure of SEAM module is shown in Figure 11.

Within the SEAM module [32], input images undergo processing through a residual-enhanced CSMM channel, where depthwise separable convolution establishes cross-dimensional correlations between spatial and channel features. Subsequent channel convolution integrates inter-channel information to strengthen feature connectivity, while the synergistic combination of GELU activation and feature map normalization jointly stabilizes the training process.

Subsequent to the CSMM channel, the module utilizes a two-layer fully connected architecture to aggregate global channel information. This approach enhances the interaction between feature channels, allowing the algorithm to capture and represent heterogeneous image characteristics more robustly. When the occluded target is detected, the lost features can be compensated according to the channel information when it is not occluded.

Finally, an exponential function is applied to the output logits of the fully connected layer, rescaling the activation values from

[0, 1]

to

[1, e]

. These values serve as attention weights and are integrated with the original features through element-wise multiplication to produce the final output. By integrating SEAM into the head module, the framework effectively mitigates informative feature loss induced by fruit-plant occlusion, thereby enhancing overall detection performance while specifically improving recognition accuracy for occluded targets.

3.6. Loss Function Improvement

The loss function serves not only as a critical metric for evaluating model predictions but also as the fundamental mechanism for guiding gradient optimization and training trajectories. By modifying the loss function, the model can be guided in the desired direction according to the specific requirements of the particular dataset. In order to improve the accuracy of the algorithmic model detection and speed up the model detection, the original CIoU loss function is replaced with an enhanced Inner-GIoU function, which obtains faster regression convergence results compared to the original one. The relevant formule are shown below:

\begin{matrix} IoU = \frac{| B \cap B^{g t} |}{| B \cup B^{g t} |} \end{matrix}

(8)

\begin{matrix} GIoU = IoU - \frac{| C - B \cap B^{g t} |}{| C |} \end{matrix}

(9)

\begin{matrix} L_{GIoU} = 1 - IoU + \frac{| C - B \cap B^{g t} |}{| C |} \end{matrix}

(10)

\begin{matrix} L_{Inner-GIoU} = L_{GIoU} + IoU - {IoU}^{inner} \end{matrix}

(11)

where B and Bgt represent the predicted anchor frame and the real frame, and C is the smallest rectangular frame that covers B and Bgt. In contrast to the CIoU loss, the GIoU loss function incorporates the minimum bounding rectangle that encloses both the predicted and ground-truth boxes to quantify the distance between them. When the predicted frame overlaps with the real frame, the overlap degree of the two frames can be reflected by the area of the minimum circumscribed rectangle. When the predicted bounding box and ground-truth box exhibit no overlap, it can also reflect the distance between the two detected frames well, effectively solves the problem of the gradient being zero when the two frames are not overlapped. However, such loss functions lack the adaptability to different detectors and detection tasks in practice, resulting in poor generalization and slow convergence, which ultimately affects the accuracy and speed of the final detection.

To compensate for the poor generalisation and slow convergence limitations in existing IoU functions, this research introduces Inner-IoU [33] to improve the loss function, as illustrated in Figure 12.

Inner-IoU incorporates a scale factor to modulate auxiliary bounding box dimensions, optimizing regression constraints. The scale factor is the ratio of the size of the auxiliary bounding box to the ground truth bounding box. When the scale factor exceeds 1, the auxiliary bounding box expands beyond the actual bounding box, capturing more extensive contextual information. This augmented feature representation is conducive to enhancing the localization precision of small targets. Conversely, when the scale factor is less than 1, the auxiliary bounding box is constrained to the core feature regions of the object. This refinement facilitates more precise localization for large-scale targets by prioritizing high-confidence interior pixels. Therefore, the size of the scale factor can be adjusted according to the size of the IoU value in the actual situation, which can achieve the effect of accelerating the convergence speed or expanding the regression effect.

In the process of tomato target detection, the larger immature fruit is easy to be confused with the green branches and leaves in the background, which leads to the low recall rate of the model. Therefore, in this experiment, the size of the detection box is reduced by setting the ratio value less than 1 to improve the regression effect of the model.

By adaptively scaling auxiliary bounding boxes via scale factor, this approach overcomes generalization limitations in existing methods, bolsters model robustness against multi-scale variations.

4. Experimental Result and Analysis

4.1. Experimental Setups

The models are trained with an Intel(R) Xeon(R) Gold 6430 processor (Intel, Santa Clara, CA, USA) and NVIDIA GeForce RTX 4090 graphics (NVIDIA, Santa Clara, CA, USA). Running on 24 GB of RAM, the software environment was Ubuntu 20.04, and the virtual environment was configured with PyTorch 1.11.0, CUDA 11.3, and Python 3.8, which was optimized using the SGD optimizer in model training. The training parameters are summarized in Table 2.

To optimize detection performance, mosaic data augmentation was applied during training, with deactivation in the final 10 epochs to fine-tune model parameters. The graph of the YOLO-ELS model training process is shown in Figure 13.

The experimental results reveal that the model enters the convergence phase at approximately 200 epochs, whereby the loss function stabilizes within a narrow margin. Upon reaching 250 epochs, both the training and validation curves exhibit asymptotic behavior, converging into near-linear trajectories. This steady state indicates that the model has undergone effective learning and the weights have successfully equilibrated. Furthermore, these trends validate that the configured training hyperparameters are well-suited to the proposed architectural requirements.

4.2. Evaluation Indicators

The evaluation in this study employs five key metrics: Precision, Recall, mAP@50%, F1-score, and GFLOPs. Precision measures the proportion of correctly predicted positive instances among all instances predicted as positive. Recall evaluates the model’s ability to identify all relevant target instances. The mean Average Precision (mAP) at an IoU threshold of 0.5 reflects the overall detection accuracy across all categories under varying confidence thresholds. The F1-score provides a balanced measure by combining both Precision and Recall into a single metric using their harmonic mean. The formula for each metric is as follows:

\begin{matrix} P = \frac{T P}{T P + F P} \end{matrix}

(12)

\begin{matrix} R = \frac{T P}{T P + F N} \end{matrix}

(13)

\begin{matrix} A P = \int_{0}^{1} P (R) d r \end{matrix}

(14)

\begin{matrix} m A P = \frac{1}{n} \sum_{i = 1}^{n} A P \end{matrix}

(15)

\begin{matrix} F 1 = \frac{2 \times P \times R}{P + R} \end{matrix}

(16)

where TP (True Positive) value is the number of positive samples identified correctly, FP (False Positive) is the number of negative samples identified as positive samples, FN (False Negative) indicates positive samples misclassified as negative, and n is the number of identified samples.

GFLOPs measures the computational cost per inference and is directly linked to inference latency. A lower GFLOPs value enables low-power NPUs to achieve higher frame rates. Parameters refers to the total number of trainable elements, while Model Size indicates the disk space occupancy. These two metrics dictate peak memory usage and storage requirements, respectively. For edge devices with constrained resources, these metrics are essential indicators for evaluating model lightweighting.

4.3. Experimental Results and Analysis

4.3.1. Comparison of Different Loss Function

The loss function quantifies the discrepancy between predictions and ground truth, playing a critical role in guiding the optimization. To validate the effectiveness of the proposed Inner-GIoU loss function, this study compares it with the original CIoU loss function and several other loss functions of different types. The experimental results are summarized in Table 3.

As shown in the table, among all loss function variants, the baseline CIoU achieves the highest precision. However, its recall rate is only 80.6%, suggesting a tendency toward conservative predictions that lead to missed detections. When the loss function is replaced with SlideLoss, the model’s recall increases to 82.7%, suggesting that the introduction of dynamic thresholds alleviates the class imbalance issue in the original dataset. Nevertheless, its precision decreases significantly, failing to meet detection requirements. ShapeIoU effectively captures geometric features, reaching the highest mAP@50% and recall, yet its sharp decline in precision undermines its practical reliability. Compared to the original loss, both Inner-CIoU and Inner-PIoU show some improvement in recall, but at the cost of reduced precision. In contrast, Inner-GIoU demonstrates the most balanced performance. By leveraging the auxiliary bounding box mechanism, it achieves a substantial 3.3% gain in recall with a marginal 0.3% decrease in precision. This synergistic improvement indicates that Inner-GIoU better regularizes the regression process, providing the most robust detection capability for complex greenhouse scenarios.

4.3.2. Ablation Experimental Results and Analysis

To systematically evaluate the contribution of each proposed module within the integrated framework, nine ablation configurations were designed and executed. These tests were conducted under identical experimental conditions and utilized the same dataset to ensure consistency and comparability across all evaluations. The experimental findings are displayed in the Table 4.

The experimental results indicate that the baseline YOLOv8n model achieves a Precision of 86.5%, a mean Average Precision of 88.5%, and an F1-score of 83.7% on the experimental dataset. These initial metrics serve as the reference benchmark for evaluating the performance gains introduced by the proposed modular enhancements. Figure 14a displays the detection results. As the graphic illustrates, the baseline YOLOv8n exhibits limited feature representation capabilities for small-scale targets and struggles with spatial occlusion caused by branches throughout the detection phase. This leads to frequent omissions and false detections, which consequently undermines the overall recognition accuracy and robustness of the model in complex greenhouse environments.

By integrating the SPPF-LSKA module, the model’s sensitivity to small-scale targets was significantly enhanced, with P, mAP@50%, and F1 score increasing by 1.7%, 1.1%, and 2.0%. The incorporation of the LSKA mechanism into the spatial pooling layer has significantly improved the model’s detection capability for multi-scale targets. Although this enhancement introduces additional computational, it effectively expands the effective receptive field. This expansion ensures scale-invariant detection and remarkably reduces the missed detection rate in complex environments.

After integrating the SEAM module into the detection head, the model exhibits a marginal improvement of 0.2% in precision, while achieving a substantial gain in recall. This outcome demonstrates that the SEAM module effectively compensates for feature response degradation in occluded regions by enhancing discriminative cues in visible areas. Consequently, it mitigates occlusion challenges caused by overlapping fruits and foliage, thereby reducing missed detections in cherry tomato identification.

After the addition of the EIEM module, the precision and the mAP@50% are improved by 2.6% and 1.2%. Remarkably, this enhancement was accompanied by a drastic reduction in both parameter count and GFLOPs. These results indicate that the EIEM module effectively captures contour-specific features, thereby enhancing the model’s discriminative power to distinguish immature tomato targets from complex green leaf backgrounds. By successfully decoupling the target from environmental noise, the module significantly mitigates false positives while simultaneously advancing the lightweight architectural objectives of the network.

Following the sequential integration of the three enhancement modules, the model exhibits improvements in precision, mAP@50%, and F1-score compared to the baseline. However, its recall decreases from 0.81 to 0.806, indicating a decline in the model’s ability to comprehensively identify relevant targets and an increased tendency to miss detections. To solve this problem, this paper introduces the Inner-GIoU loss function to modify the model. Through multiple sets of comparative experiments, the ratio factor value of Inner-GIoU is set to 0.9, significantly enhancing the localization accuracy for large targets. This configuration achieves substantially improved recall and overall detection performance while maintaining comparable precision levels. The final detection results are shown in Figure 14e.

The proposed improvements yield significant performance gains, with the enhanced model achieving increases of 6.2% in precision, 2.9% in recall, and 3.5% in mAP@50%. Concurrently, the computational complexity is substantially reduced, meeting the requirements for deployment on edge devices. In summary, YOLO-ELS demonstrates marked improvement in detecting cherry tomatoes of varying sizes and maturity levels under complex greenhouse conditions.

4.4. Comparative Analysis of Test Results of Different Network Models

To systematically evaluate the superiority of the proposed model in tomato ripeness identification, seven mainstream architectures were selected for comparative analysis. These include single-stage detectors DETR [34], SSD [35], and TOOD [36], the two-stage model Faster R-CNN [37], and other versions of YOLO such as YOLOv5s, YOLOv8n, and YOLOv11. The experimental results are shown in the Table 5.

The information in the table shows that the two-stage model Faster-RCNN has higher average detection accuracy and recall rate than other one-stage models except the YOLO series. However, due to the inherent architectural complexity of two-stage models, the algorithm imposes substantial computational overhead, with floating-point operations reaching 138 G. Such resource-intensive requirements fail to satisfy the real-time deployment constraints and hardware limitations of edge terminals. In several one-stage algorithms, the YOLO series algorithm has obvious advantages in terms of parameter quantity, model size, and computing power requirements. Furthermore, successive versions of the YOLO algorithm consistently demonstrate superior performance over other single-stage detectors in terms of both mean Average Precision and recall.

Among various YOLO iterations, YOLOv8n offers a superior balance between performance and efficiency for edge deployment. Although it exhibits a slight accuracy trade-off compared to YOLOv5s, its model size and computational demands are reduced by approximately 50%. Notably, although the latest lightweight model, YOLO11n, demonstrates high efficiency, its recall rate remains inadequate at only 0.748. Such a high miss rate poses a significant bottleneck for automated fruit harvesting, where missing targets directly reduces operational efficiency. In contrast, YOLOv8n serves as a more robust and adaptable baseline for optimization in these practical scenarios. However, as the final refined model, YOLO-ELS achieves a significant boost in overall performance. Specifically, compared to the reference YOLO11n, its precision, recall, and mAP@50% are increased by 1.7%, 9.1%, and 3.2%, respectively. Although YOLO-ELS incurs a marginal increase in model size and computational complexity relative to YOLO11n, this increment is negligible and does not compromise its suitability for edge deployment.

Therefore, based on the experimental results of each algorithm model, the improved YOLO-ELS model has the best detection effect, and the index parameters such as precision, recall and mAP@50% reach 0.927, 0.839 and 0.920. Moreover, the number of model parameters is small, and the computational cost is lower, which meets the deployment requirements on terminal equipment. It demonstrates that the model satisfies the real requirements of this investigation and performs well in tomato maturity detection tasks in a greenhouse setting.

4.5. Model Visualisation

Grad-CAM++ represents an enhanced visualization technique built upon the Grad-CAM framework. It produces class activation maps by emphasizing positive gradient contributions from the final convolutional layer toward specific class scores, thereby more accurately highlighting regions critical for target class identification [38]. Therefore, in order to intuitively represent the attention of each part of the image before and after the improvement, this research introduces the heat map to visualize the detection model, so as to analyze the attention of different models to the recognition target. The visualized picture is shown in Figure 15.

The heatmap illustrates that for the YOLOv8n model, the attention hotspot map is more scattered and does not concentrate on the tomato’s primary characteristics. Instead, it primarily highlights the exterior contour area features. This limitation is more pronounced in immature tomatoes, suggesting that the baseline struggles to extract critical internal information. In contrast, the improved YOLO-ELS exhibits a more concentrated focus, with activation regions densely covering the core representative areas of the target. This localized concentration indicates that the model can more effectively integrate diverse feature information. In summary, the enhanced architecture demonstrates superior discriminative power for cherry tomatoes, capturing richer feature representations and confirming the efficacy of the proposed modifications.

4.6. Performance Benchmarking Results and Analysis

To further validate the generalization and robustness of the proposed YOLO-ELS, we conducted additional benchmarking experiments on the publicly available “2022 Dataset of String Tomato in Shanxi Nonggu Tomato Town” [39] dataset to conduct supplementary benchmark evaluations. The dataset contains 3665 images of cluster tomatoes at varying maturity stages. It was randomly split into training, validation, and test sets in an 8:1:1 ratio for model training. To evaluate the effectiveness of YOLO-ELS on the public benchmark, we compared its performance against the MTS-YOLO [40] and several representative YOLO-series models using the same dataset.

As summarized in Table 6, YOLO-ELS maintains superior detection stability despite changes in geographical variety and growth conditions. Specifically, YOLO-ELS achieved a precision of 92.4%, outperforming YOLOv8n and YOLOv10n by 4.8% and 8.8%. Structurally, the 6.9 GFLOPs and 2.93 M parameters of YOLO-ELS demonstrate that the proposed optimizations effectively condense the model without sacrificing spatial robustness. While the MTS-YOLO model exhibited a slightly higher recall, the marginal F1-score difference indicates that YOLO-ELS provides comparable overall performance with enhanced reliability.

In conclusion, YOLO-ELS maintains consistent detection efficacy across datasets with diverse growth conditions and varieties. The successful adaptation from single-fruit to string-fruit tasks validates that the architectural optimizations enhance feature representation and spatial robustness. These results confirm the model’s viability as a generalized solution for precision agriculture in complex greenhouse environments.

4.7. Application and Edge Deployment Performance Testing of the YOLO-ELS

To validate the deployment suitability of the improved YOLO-ELS algorithm model in real-world scenarios, this study selected the NVIDIA Jetson Orin Nano SUPER, which is a commonly used embedded hardware platform in this field, as the testbed for algorithm deployment. The software environment was set up with Ubuntu 22.04 LTS and JetPack 6.0, with the deployment status on the Jetson Nano platform shown in Figure 16.

During the test, the input resolution was strictly maintained at 640 × 640 pixels, consistent with the model training configuration. To simulate continuous inference loads in real-world scenarios, a consecutive image stream constructed from the test set was processed as the dataset, while hardware performance parameters were monitored in real time using the tegrastats command. The detection execution process is illustrated in Figure 17.

Experimental results demonstrate that the YOLO-ELS model achieves excellent real-time processing capabilities and energy efficiency on edge devices. In terms of inference speed, the model completes an average inference time of only 25.2 ms per image, achieving an overall detection frame rate of 28.2 FPS, meeting the real-time requirements of agricultural harvesting robots under typical conditions. Meanwhile, during sustained operation, the overall power consumption of the platform remains stable at 6.5 W, demonstrating excellent energy efficiency.

To validate the algorithm’s recognition capability in real-world scenarios, this study utilized an external Orbbec Gemini Pro depth camera (Orbbec, Shenzhen, China). The camera captured a video stream at a resolution of 640 × 480 and a frame rate of 30 fps for real-time detection of cherry tomato plants. The detection results are shown in Figure 18.

Based on the combined experimental results, the YOLO-ELS cherry tomato ripeness recognition algorithm proposed in this study can meet deployment requirements on computationally constrained embedded devices. This confirms the practical deployability of the model and provides an effective solution for real-time detection of cherry tomato ripeness in greenhouse environments.

5. Conclusions and Future Work

To address the problem of missed and false detections in greenhouse environments due to light changes and overlapping fruit shading, an LSKA layer was inserted into the pooling module of the baseline model. This layer processes the output from the original concatenation layer, thereby expanding the model’s receptive field and enhancing its ability to detect small targets. Secondly, the EIEM module was introduced to replace the bottleneck block in the C2F structure of the backbone network. This replacement enhances the model’s ability to learn shape features of the targets while reducing interference from redundant background information, thereby lowering the model’s computational complexity. Furthermore, a SEAM detection layer was incorporated into the detection head to strengthen feature recognition for occluded fruits, while the INNER-GIoU loss function was employed to optimize bounding box regression, enhancing the model’s convergence capability across multiple scales.

The enhanced YOLO-ELS model demonstrates significant performance such that the precision, recall and mAP@50% reach 92.7%, 83.9% and 92.0%, which are 6.2%, 2.9% and 3.5% higher than those of the original model. The storage space occupied by the model is only 5.91 MB, and the required computing power is 6.9 Gflops. In practical deployment, the model achieved a detection speed of 28.2 FPS on the Jetson Orin Nano Super, with average power consumption maintained at 6.5 W. These results indicate that the improved model balances accuracy, speed, and energy efficiency well, making it suitable for real-time maturity detection of cherry tomatoes in complex greenhouse environments.

To evaluate its performance, YOLO-ELS was tested under consistent conditions with other common one-stage and two-stage algorithms. On the established cherry tomato dataset, the improved model achieved the highest detection precision and recall, demonstrating superior overall detection performance. The results of the model heat map also show that the improved model can notice more comprehensive feature information, cover more detection target area, and have stronger target recognition ability in the task of target detection. To further demonstrate the superiority of the proposed architectural improvements, experiments were also conducted on an additional public dataset. The results indicate that the enhanced YOLO-ELS algorithm maintains stable detection performance across different data distributions, significantly outperforming the baseline algorithms. This demonstrates its adaptability for fruit maturity recognition tasks in diverse production scenarios and across various fruit types.

In the future, we will expand the dataset to include a greater variety of cultivars and a wider range of growing conditions. This expansion aims to enhance the model’s generalization capability and robustness across diverse cultivation environments. By increasing the raw data diversity, we expect to further stabilize performance and mitigate potential minor precision drops that may arise from sample size limitations. In addition, future work will also explore the practical applicability of the algorithm in real picking scenarios. The model’s detection performance will be evaluated and refined based on experimental outcomes to meet actual operational needs.

Author Contributions

Conceptualization, Z.T. and C.C.; methodology, Y.Z. and Z.T.; software, Y.Z.; validation, Z.T., Y.Z. and C.L.; formal analysis, Y.Z. and Z.T.; investigation, Y.Z. and C.L.; resources, C.C.; data curation, Z.T.; writing—original draft preparation, Y.Z.; writing—review and editing, Z.T., C.L. and L.R.; visualization, Y.Z. and Z.T.; supervision, C.C. and L.R.; project administration, C.C. and L.R.; funding acquisition, C.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Science and Technology Development Program of Jilin Province (Grant No. 20230202077NC). The APC was funded by Changqing Cai and Lihong Rong.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are reported within the article. Further inquiries can be directed to the corresponding authors.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Li, Z.G.; Thomas, C. Quantitative evaluation of mechanical damage to fresh fruits. Trends Food Sci. Technol. 2014, 35, 138–150. [Google Scholar] [CrossRef]
Shao, Y.Y.; Wang, Y.X.; Guan, X.T.; Gao, C.; Wang, K.L.; Gao, Z.M. Visual detection of SSC and firmness and maturity prediction for feicheng peach by using hyperspectral imaging. Trans. Chin. Soc. Agric. Mach. 2020, 51, 344–350. [Google Scholar]
Su, F.; Zhao, Y.P.; Wang, G.H.; Liu, P.Z.; Yan, Y.F.; Zu, L.L. Tomato maturity classification based on SE-YOLOv3-MobileNetV1 network under nature greenhouse environment. Agronomy 2022, 12, 1638. [Google Scholar] [CrossRef]
Miao, R.; Li, Z.; Wu, J. A lightweight cherry tomato ripening detection method based on improved YOLO v7. J. Agric. Mach. 2023, 10, 225–233. [Google Scholar]
Zheng, S.H.; Jia, X.X.; He, M.L.; Zheng, Z.B.; Lin, T.L.; Weng, W.X. Tomato Recognition Method Based on the YOLOv8-Tomato Model in Complex Greenhouse Environments. Agronomy 2024, 14, 1764. [Google Scholar] [CrossRef]
Chen, W.B.; Liu, M.C.; Zhao, C.J.; Li, X.X.; Wang, Y.Q. MTD-YOLO: Multi-task deep convolutional neural network for cherry tomato fruit bunch maturity detection. Comput. Electron. Agric. 2024, 216, 108533. [Google Scholar] [CrossRef]
Zhang, F.; Gao, J.; Zhou, H.; Zhang, J.X.; Zou, K.L.; Yuan, T. Three-dimensional pose detection method based on keypoints detection network for tomato bunch. Comput. Electron. Agric. 2022, 195, 106824. [Google Scholar] [CrossRef]
Bulanon, D.; Burks, T.; Alchanatis, V. Fruit visibility analysis for robotic citrus harvesting. Trans. ASABE 2009, 52, 277–283. [Google Scholar] [CrossRef]
Lv, J.D.; Zhao, D.A.; Ji, W.; Ding, S. Recognition of apple fruit in natural environment. Optik 2016, 127, 1354–1362. [Google Scholar] [CrossRef]
Gao, G.C.; Zhao, S.Y.; Zhang, C.; Yu, X.B.; Li, Z.Q. Study on fruit recognition methods based on compressed sensing. J. Comput. Theor. Nanosci. 2015, 12, 2937–2942. [Google Scholar] [CrossRef]
Laykin, S.; Alchanatis, V.; Fallik, E.; Edan, Y. Image–processing algorithms for tomato classification. Trans. ASAE 2002, 45, 851. [Google Scholar] [CrossRef]
Khoshroo, A.; Arefi, A.; Khodaei, J. Detection of red tomato on plants using image processing techniques. Agric. Commun. 2014, 2, 9–15. [Google Scholar]
Si, Y.S.; Liu, G.; Feng, J. Location of apples in trees using stereoscopic vision. Comput. Electron. Agric. 2015, 112, 68–74. [Google Scholar] [CrossRef]
Chen, J.X.; Ding, J.J. The Study on Infrared Image Recognition for Tomato Picker Based on Machine Vision. J. Agric. Mech. Res. 2022, 44, 44–48+53. [Google Scholar] [CrossRef]
Liu, G.X.; Mao, S.Y.; Kim, J.H. A mature-tomato detection algorithm using machine learning and color analysis. Sensors 2019, 19, 2023. [Google Scholar] [CrossRef]
Wang, J.H.; Lin, X.M.; Luo, L.F.; Chen, M.Y.; Wei, H.L.; Xu, L.J.; Luo, S.M. Cognition of grape cluster picking point based on visual knowledge distillation in complex vineyard environment. Comput. Electron. Agric. 2024, 225, 109216. [Google Scholar] [CrossRef]
Yu, J.H.; Liu, L.W.; Xu, L.; Yu, H.H.; Chen, Y.Y. Individual motion feature extraction method for sea bass based on improved YOLOv8 and ByteTrack. Trans. Chin. Soc. Agric. Eng. 2025, 41, 182–190. [Google Scholar]
Sun, Y.P.; Meng, X.W.; Gu, P.X.; Li, Z.Q.; Liu, Y.; Zhao, D. Extracting the navigation center line for fishery complementary photovoltaic boat using improved YOLOv8n. Trans. Chin. Soc. Agric. Eng. 2024, 40, 173–182. [Google Scholar] [CrossRef]
Liang, Q.F.; Liang, C.Q.; Guo, H.; Xie, S.F.; Huang, Y.; Long, S.B.; Chen, P. Detecting discolored pine trees under natural scenes using improved YOLOv5. Trans. Chin. Soc. Agric. Eng. 2025, 41, 165–174. [Google Scholar] [CrossRef]
Zheng, Y.P.; Li, G.Y.; Li, Y. Survey of Application of Deep Learning in Image Recognition. Comput. Eng. Appl. 2019, 55, 20–36. [Google Scholar]
Yan, J.W.; Wang, P.B.; Wang, T.J.; Zhu, G.F.; Zhou, X.L.; Yang, Z. Identification and localization of optimal picking point for truss tomato based on mask r-cnn and depth threshold segmentation. In Proceedings of the 2021 IEEE 11th Annual International Conference on CYBER Technology in Automation, Control, and Intelligent Systems (CYBER), Jiaxian, China, 27–31 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 899–903. [Google Scholar]
Quach, L.D.; Quoc, K.N.; Quynh, A.N.; Ngoc, H.T.; Thai-Nghe, N. Tomato health monitoring system: Tomato classification, detection, and counting system based on YOLOv8 model with explainable MobileNet models using Grad-CAM++. IEEE Access 2024, 12, 9719–9737. [Google Scholar] [CrossRef]
Yuan, T.; Lv, L.; Zhang, F.; Fu, J.; Gao, J.; Zhang, J.X.; Li, W.; Zhang, C.L.; Zhang, W.Q. Robust cherry tomatoes detection algorithm in greenhouse scene based on SSD. Agriculture 2020, 10, 160. [Google Scholar] [CrossRef]
Guan, Z.X.; Han, L.; Zuo, Z.J.; Pan, L.B. Design a robot system for tomato picking based on YOLO v5. IFAC-PapersOnLine 2022, 55, 166–171. [Google Scholar] [CrossRef]
Solimani, F.; Cardellicchio, A.; Dimauro, G.; Petrozza, A.; Summerer, S.; Cellini, F.; Renò, V. Optimizing tomato plant phenotyping detection: Boosting YOLOv8 architecture to tackle data complexity. Comput. Electron. Agric. 2024, 218, 108728. [Google Scholar] [CrossRef]
Gao, X.; Ding, J.Y.; Zhang, R.H.; Xi, X.B. YOLOv8n-CA: Improved YOLOv8n Model for Tomato Fruit Recognition at Different Stages of Ripeness. Agronomy 2025, 15, 188. [Google Scholar] [CrossRef]
Zhao, N.; Wen, Y. OGS-YOLOv8: Coffee Bean Maturity Detection Algorithm Based on Improved YOLOv8. Appl. Sci. 2025, 15, 11632. [Google Scholar] [CrossRef]
Zhang, H.; Liu, L.; Bi, J.; Liu, H.; Wen, Z.; Bi, L.; Yao, G. Modified Lightweight YOLO v8 Model for Fast and Precise Indoor Occupancy Detection. Appl. Sci. 2026, 16, 335. [Google Scholar] [CrossRef]
Yaseen, M. What is YOLOv8: An in-depth exploration of the internal features of the next-generation object detector. arXiv 2024, arXiv:2408.15857. [Google Scholar]
Deng, X.; Huang, T.; Wang, W.; Feng, W. SE-YOLO: A sobel-enhanced framework for high-accuracy, lightweight real-time tomato detection with edge deployment capability. Comput. Electron. Agric. 2025, 239, 110973. [Google Scholar] [CrossRef]
Lau, K.W.; Po, L.M.; Rehman, Y.A.U. Large separable kernel attention: Rethinking the large kernel attention design in cnn. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
Yu, Z.P.; Huang, H.B.; Chen, W.J.; Su, Y.X.; Liu, Y.H.; Wang, X.Y. Yolo-facev2: A scale and occlusion aware face detector. Pattern Recognit. 2024, 155, 110714. [Google Scholar] [CrossRef]
Zhang, H.; Xu, C.; Zhang, S. Inner-IoU: More effective intersection over union loss with auxiliary bounding box. arXiv 2023, arXiv:2311.02877. [Google Scholar]
Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 213–229. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Feng, C.J.; Zhong, Y.J.; Gao, Y.; Scott, M.R.; Huang, W.L. Tood: Task-aligned one-stage object detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; IEEE Computer Society: Washington, DC, USA, 2021; pp. 3490–3499. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 839–847. [Google Scholar]
Song, G.Z.; Yan, S.; Jian, W.; Jing, C.; Luo, G.; Sheng, S.; Wang, X.; Li, Y. 2022 Dataset of String Tomato in Shanxi Nonggu Tomato Town. 2023. Available online: https://www.scidb.cn/en/detail?dataSetId=42525d00518b423baae7a55c4e001a68 (accessed on 10 January 2026).
Wu, M.; Lin, H.; Shi, X.; Zhu, S.; Zheng, B. MTS-YOLO: A multi-task lightweight and efficient model for tomato fruit bunch maturity and stem detection. Horticulturae 2024, 10, 1006. [Google Scholar] [CrossRef]

Figure 1. Images of cherry tomatoes in various circumstances: (a) cloudy, (b) frontlight, (c) backlight, (d) single target fruit, (e) overlapping fruit shade, (f) branch and leaf shade.

Figure 2. Classification of tomatoes with different ripening levels and their labeling results map: (a) unripe fruit. (b) ripe fruit. (c) color-turning fruit. (d) unripe fruit annotation box. (e) ripe fruit annotation box. (f) color-turning fruit annotation box. Here, the green box is the minimum bounding box that encloses the fruit.

Figure 3. Sample image after data enhancement: (a) original image, (b) image noise and flipping, (c) image cropping, (d) horizontal image flip.

Figure 4. Architecture of YOLOv8 network.

Figure 5. Architecture of YOLO-ELS network.

Figure 6. Structure of C2f module and C2f-EIEM module. (a) C2f module. After a split separation operation, the image information is passed through the bottleneck module for feature extraction. (b) C2f-EIEM module. This module replaces the Bottleneck layer with the EIEM module.

Figure 7. Sobel conv structure. Gx is the result of processing the image horizontally and Gy is the result of processing vertically. Here, C, H, and W denote the number of channels, height, and width of the feature maps, respectively, while D represents the depth dimension.

Figure 8. (a) Original image data, (b) Image data processed by sobel operator.

Figure 9. Structure of SPPF-LSKA.

Figure 10. Comparison between LKA and LSKA models. (a) Original LKA model, (b) LSKA model.

Figure 11. Structure of SEAM.

Figure 12. Diagrams of IoU and Inner-IoU Structures. Where the dashed box is the minimum outer rectangular box, green is the true box, and red is the predicted anchor box.

Figure 13. YOLO-ELS Training Result Chart. Where box loss measures the precision of bounding box localization, cls loss evaluates the accuracy of category predictions, dfl loss characterizes the uncertainty of box boundaries and refines the localization precision, respectively.

Figure 14. Detection results of different models. Targets within red circles represent instances of omission or misdetection.

Figure 15. Visual analysis of Grad-CAM++.

Figure 16. The Jetson Orin Nano Super edge computing platform used for model deployment.

Figure 17. Operational interface on the Jetson Orin Nano Super platform.

Figure 18. YOLO-ELS real-time detection result. (a) Original RGB image, (b) Detection result.

Table 1. The classification criteria of cherry tomatoes.

Maturity Stage	Description	Label
Unripe	The fruit is predominantly whitish-green in color with a glossy surface	Unripe
Color-turning	A yellowish halo begins to appear around the blossom end, and the red surface coverage reaches 10% to 60%	Half
Ripe	The fruit is plump, with red surface coverage exceeding 60%	Ripe

Table 2. The experimental setup.

Parameter Name	Parameter Value
epochs	300
batch size	64
workers	32
image size	640 × 640
learning rate	0.01
momentum	0.937
weight decay	0.0005

Table 3. Comparison results of detection models under different loss functions.

Module Name	Precision	Recall	mAP@50%
ELS + CIoU	0.930	0.806	0.922
ELS + SlideLoss	0.905	0.827	0.905
ELS + ShapeIoU	0.913	0.839	0.922
ELS + Inner CIoU	0.902	0.831	0.897
ELS + Inner PIoU	0.902	0.823	0.910
ELS + Inner GIoU	0.927	0.839	0.920

Table 4. Ablation experimental results.

	LSKA	EIEM	SEAM	Inner-GIoU	Precision	Recall	mAP@50%	F1	GFLOPs	Parameters
1	×	×	×	×	0.865	0.810	0.885	0.837	8.1	3,006,233
2	✓	×	×	×	0.882	0.833	0.896	0.857	8.3	3,279,129
3	×	✓	×	×	0.891	0.820	0.897	0.854	7.7	2,851,337
4	×	×	✓	×	0.867	0.845	0.895	0.856	7.0	2,818,073
5	✓	✓	×	×	0.899	0.836	0.900	0.866	7.9	3,124,233
6	✓	×	✓	×	0.877	0.818	0.903	0.846	7.2	3,090,969
7	×	✓	✓	×	0.910	0.861	0.913	0.885	6.7	2,663,177
8	✓	✓	✓	×	0.930	0.806	0.922	0.864	6.9	2,936,073
9	✓	✓	✓	✓	0.927	0.839	0.920	0.881	6.9	2,936,073

Note: ✓ indicates that the algorithm is used; × indicates that the algorithm is not used.

Table 5. Comparison results of different target detection models.

Module Name	Precision	Recall	mAP@50%	Model Size/MB	GFlops/G	Parameters/MB
DETR	∖	0.707	0.850	478	60.7	41.555
Tood	∖	0.774	0.834	244	127.0	32.023
SSD	∖	0.685	0.806	185	30.5	24.013
Faster Rcnn	∖	0.762	0.873	315	138.0	41.358
YOLOv5s	0.925	0.818	0.889	13.7	15.8	7.028
YOLO11n	0.914	0.748	0.888	5.25	6.3	2.582
YOLOv8n	0.865	0.810	0.885	5.98	8.1	3.006
YOLO-ELS	0.927	0.839	0.920	5.91	6.9	2.936

Table 6. Comparison of model experimental results on public dataset.

Module Name	Precision	Recall	F1	mAP@50%	GFlops/G	Parameters/MB
YOLOv8n	0.876	0.851	0.863	0.906	8.1	3.01
YOLOv10n	0.836	0.851	0.843	0.874	8.2	2.57
MTS-YOLO	0.903	0.871	0.887	0.920	6.8	2.05
YOLO-ELS	0.924	0.848	0.884	0.913	6.9	2.93

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tong, Z.; Zhou, Y.; Li, C.; Cai, C.; Rong, L. YOLO-ELS: A Lightweight Cherry Tomato Maturity Detection Algorithm. Appl. Sci. 2026, 16, 1043. https://doi.org/10.3390/app16021043

AMA Style

Tong Z, Zhou Y, Li C, Cai C, Rong L. YOLO-ELS: A Lightweight Cherry Tomato Maturity Detection Algorithm. Applied Sciences. 2026; 16(2):1043. https://doi.org/10.3390/app16021043

Chicago/Turabian Style

Tong, Zhimin, Yu Zhou, Changhao Li, Changqing Cai, and Lihong Rong. 2026. "YOLO-ELS: A Lightweight Cherry Tomato Maturity Detection Algorithm" Applied Sciences 16, no. 2: 1043. https://doi.org/10.3390/app16021043

APA Style

Tong, Z., Zhou, Y., Li, C., Cai, C., & Rong, L. (2026). YOLO-ELS: A Lightweight Cherry Tomato Maturity Detection Algorithm. Applied Sciences, 16(2), 1043. https://doi.org/10.3390/app16021043

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

YOLO-ELS: A Lightweight Cherry Tomato Maturity Detection Algorithm

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Acquisition

2.2. Dataset Partitioning

3. Detection Algorithm of Tomato Maturity

3.1. Baseline Model Selection

3.2. Improved YOLO-ELS Network Model Design

3.3. C2f-EIEM Edge Feature Extraction Module

3.4. SPPF-LSKA Large Separable Kernel Attention Module

3.5. SEAM Head Module

3.6. Loss Function Improvement

4. Experimental Result and Analysis

4.1. Experimental Setups

4.2. Evaluation Indicators

4.3. Experimental Results and Analysis

4.3.1. Comparison of Different Loss Function

4.3.2. Ablation Experimental Results and Analysis

4.4. Comparative Analysis of Test Results of Different Network Models

4.5. Model Visualisation

4.6. Performance Benchmarking Results and Analysis

4.7. Application and Edge Deployment Performance Testing of the YOLO-ELS

5. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI