A Lightweight Greenhouse Tomato Fruit Identification Method Based on Improved YOLOv11n

Gao, Xingyu; Li, Fengyu; Yan, Jun; Sun, Qinyou; Meng, Xianyong; Liu, Pingzeng

doi:10.3390/agriculture15141497

Open AccessArticle

A Lightweight Greenhouse Tomato Fruit Identification Method Based on Improved YOLOv11n

by

Xingyu Gao

^1,2,3,

Fengyu Li

^1,2,3,

Jun Yan

^1,2,3,

Qinyou Sun

^1,2,3,

Xianyong Meng

^1,2,3,*

and

Pingzeng Liu

^1,2,3,*

¹

School of Information Science and Engineering, Shandong Agricultural University, Taian 271018, China

²

Key Laboratory of Huanghuaihai Smart Agricultural Technology, Ministry of Agriculture and Rural Affairs, Taian 271018, China

³

Agricultural Big Data Research Center, Shandong Agricultural University, Taian 271018, China

^*

Authors to whom correspondence should be addressed.

Agriculture 2025, 15(14), 1497; https://doi.org/10.3390/agriculture15141497

Submission received: 29 May 2025 / Revised: 6 July 2025 / Accepted: 8 July 2025 / Published: 11 July 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

The aim of this paper is to propose an improved lightweight YOLOv11 detection method in response to the difficulty of extracting tomato fruit features in greenhouse environments and the need for lightweight picking equipment. Firstly, the conventional step convolution is substituted by the Average pooling Downsampling (ADown) module with multi-path fusion; Gated Convolution (gConv) is incorporated in the C3K2 module, which considerably reduces the number of parameters and computation of the model. Concurrently, the Lightweight Shared Convolutional Detection (LSCD) is incorporated into the detection head component with to the aim of further reducing the computational complexity. Finally, the Wise–Powerful intersection over Union (Wise-PIoU) loss function is employed to optimise the model accuracy, and the effectiveness of each improvement module is verified by means of ablation experiments. The experimental results demonstrate that the precision of ACLW-YOLO (A stands for ADown, C stands for C3K2_gConv, L stands for LSCD, and W stands for Wise-PIoU) reaches 94.2%, the recall (R) is 92.0%, and the mean average precision (mAP) is 95.2%. Meanwhile, the model size is only 3.3 MB, the number of parameters is 1.6 M, and the floating-point computation is 3.9 GFLOPs. The ACLW-YOLO model enhances the precision of detection through its lightweight design, while concurrently achieving a substantial reduction in computational complexity and memory utilisation. The study demonstrates that the enhanced model exhibits superior recognition performance for various tomato fruits, thereby providing a robust theoretical and technical foundation for the automation of greenhouse tomato picking processes.

Keywords:

greenhouse tomato; fruit recognition; YOLOv11; lightweighting; loss function

1. Introduction

Tomatoes are one of the most widely cultivated solanaceous vegetable crops in the world [1], is not only rich in nutritional value, but also has unique edible and health care functions; thus, it is widely planted in facility agriculture. With the increasing trend of greenhouse scale cultivation, the traditional manual picking method has been unable to meet the demand for high efficiency and low loss in modern agriculture, which prompts the intelligent upgrade of picking robots to become an inevitable trend [2]. With the continuous expansion of greenhouse planting scale, automated picking technology and accurate detection of fruits have become the key factors in improving production efficiency and reducing labour costs [3].

Fruit recognition technology is the key to the realisation of automatic robot picking function. Improving the accuracy of target fruit recognition and reducing the leakage rate of target fruits are crucial for enhancing the stability of the system and improving the overall picking efficiency [4]. In the initial development stage of fruit recognition technology, traditional image processing methods based on visual features such as colour, texture, and shape are mainly applied, such as the sliding window technique, Histogram of Oriented Gradients (HOG) [5], Support Vector Machine (SVM) [6], and Non-Maximum Value Suppression (NMVS) [7]. These methods can achieve some success under the conditions of uniform illumination and a simple background, but in the complex environment of natural greenhouses, when facing the challenges of a complex background, drastic changes in illumination, and shading and overlapping among fruits, they show obvious defects such as insufficient robustness, weak generalisation ability, and reliance on manual parameter configuration, which make it difficult to satisfy the dual demands of high real-time performance and high accuracy in practical agricultural production [8].

With the advancement of deep learning, object detection approaches based on Convolutional Neural Networks (CNN) [9] have become a mainstream direction in fruit recognition. These methods can generally be divided into two categories [10]: two-stage detectors, such as the R-CNN series [11], Fast R-CNN [12], and Faster R-CNN [13]; and one-stage detectors, including SSD [14] and the YOLO family. The YOLO algorithm, introduced by Redmon et al., pioneered the single-stage detection paradigm by formulating object detection as a regression problem, significantly improving detection speed [15]. In the agricultural field, Sa et al. developed a sweet pepper detection system based on deep learning [16], and Bargoti et al. proposed CNN-based detection methods for apples and mangoes [17]. Long et al. combined CSPNet with the ResNet Backbone of Mask R-CNN and proposed an improved tomato segmentation approach, which enhanced both the accuracy and speed of cherry tomato recognition [18].

The proliferation of UAVs and proximal sensing technologies has further accelerated the adoption of deep learning in precision agriculture. Kamilaris et al. conducted a comprehensive review of the application of deep learning in agriculture and systematically analysed the technical paths of this technology in crop monitoring and pest and disease detection, achieving a deep sorting out of the development context of agricultural AI technology [19]. To address the challenge of limited annotated data, Goodfellow et al. introduced Generative Adversarial Networks (GANs), which generate high-quality synthetic samples through adversarial training, offering an innovative solution for agricultural dataset augmentation [20]. Based on this, Fawakherji et al. proposed an improved WGAN-GP model that produces spectrally consistent weed images in both RGB and IR domains using a stable gradient penalty, enabling automated generation of high-quality agricultural imagery [21]. Rana et al. developed an efficient annotation pipeline by integrating YOLOv8 with the Segment Anything Model (SAM), achieving seamless automation of segmentation and detection in cauliflower monitoring tasks [22]. The SAM, introduced by Kirillov et al., provides powerful zero-shot segmentation capabilities, offering strong support for automatic annotation in complex agricultural environments [23]. In spectral image analysis, fuzzy classification algorithms have demonstrated superior robustness in environments with overlapping spectral features. Mehrotra et al. applied a Modified Possibilistic C-Means (MPCM) soft classification strategy combined with vegetation indices to effectively address spectral overlap and heterogeneity in agricultural images [24]. Furthermore, Rana et al. highlighted the challenges posed by spatial–spectral misalignment in multispectral imaging. Their registration error analysis showed that even slight misalignments could significantly degrade CNN detector performance, underscoring the importance of precise spectral registration as a critical preprocessing step [25].

Although the YOLO series performs exceptionally well in terms of speed, standard models still suffer from parameter redundancy and large model sizes, making deployment on resource-constrained embedded devices challenging. To address this, researchers have introduced various lightweight architectures. Howard et al. developed the MobileNets family, which dramatically reduces model complexity by separating standard convolutions into depthwise and pointwise operations [26]. Likewise, Zhang et al. proposed the ShuffleNet architecture, which employs channel shuffling and group convolution to reduce computation without sacrificing accuracy [27]. Sun et al. proposed YOLOv5-PRE, a lightweight apple detection framework that integrates ShuffleNet, GhostNet, and attention mechanisms (CA, CBAM) to achieve rapid yield estimation in complex orchards [28]. Li et al. designed a strawberry maturity detection algorithm using lightweight convolutional modules and attention mechanisms, enabling accurate 3D localization of tea shoots with RGB-D sensors [29]. Zhang et al. developed the EPSA-YOLO-V5s model to address the challenges of outdoor field environments, achieving accurate estimation of rapeseed survival rates [30]. Zhao et al. introduced a lightweight version of YOLOv5s optimised for greenhouse conditions to enable efficient tomato detection [31]. Qiu et al. proposed an enhanced YOLOv8n architecture that combines partial convolution and knowledge distillation to strike a balance between accuracy and efficiency for automated mulberry harvesting [32]. However, lightweight model designs often lead to compromised detection performance, especially in greenhouse environments with dense fruit occlusion and drastic lighting variations, where false positives and missed detections are common [33].

In summary, although substantial progress has been made in fruit recognition, several challenges remain. Occlusion and clustering of fruits in complex environments still limit detection accuracy. Technical issues such as spectral feature overlap and misalignment in multispectral imagery further exacerbate detection difficulty. While existing models have undergone lightweight optimisation, their computational demands and storage footprints remain too high for real-time deployment on low-power embedded platforms. Therefore, achieving high accuracy while further reducing model size and improving deployment efficiency remains a central challenge in agricultural visual recognition.

To address these issues, this study proposes a lightweight tomato fruit detection model named ACLW-YOLO, based on an improved YOLOv11n framework. The model retains high detection accuracy while significantly reducing parameters and accelerating inference through architectural optimization. Specifically, traditional downsampling layers are replaced with ADown modules; the C3K2_gConv structure is employed to reduce computational load; LSCD modules are used for refined feature extraction; and the Wise-PIoU loss function is adopted to enhance bounding box regression. This solution enables efficient and accurate tomato fruit detection, demonstrating strong potential for embedded deployment and real-world agricultural applications.

2. Materials and Methods

2.1. Dataset Generation

2.1.1. Data Acquisition

The tomato fruit images used in this study were collected from the tomato greenhouse of VOLPUSI Tomato Culture Industrial Park (Jinan, Shandong, China), where tomatoes are grown in hanging vine cultivation. To enrich the diversity of samples, we followed the advice of agronomic experts to collect data on specific dates and conditions, which included the afternoon of 2 March 2025 (cloudy), the morning of 3 March 2025 (sunny), the noon of 3 March 2025, and the afternoon of 3 March 2025. We captured images of tomato fruits at different ripening stages (green ripening stage, colour-change stage, and ripening stage), different lighting modes (downlight, backlight, and sidelight), different shooting distances (0.2 to 2 m), different shooting directions (overhead, flat, and upward), different shading modes (fruit shading, branch shading, and foliage shading), different numbers of fruits (single fruits and multiple fruits), and different weather conditions (cloudy day and sunny day). In this study, an iPhone 14 Plus equipped with a dual-camera system was used as the image acquisition device. The primary camera features a 1/1.65-inch backside-illuminated CMOS sensor (Bayer array, 12 MP) with a 26 mm equivalent focal length and an f/1.5 aperture. During data collection, the standard photo mode was enabled, with both exposure and white balance set to automatic, allowing the system to adapt in real time to dynamic lighting variations commonly found in greenhouse environments. A total of 957 original images were captured, and after screening to remove blurred and duplicate samples, 900 images were finally selected for the experiment. Some sample images are shown in Figure 1.

2.1.2. Data Enhancement

To mitigate the risks of sample imbalance and overfitting while enhancing the robustness and generalization capability of the model, this study employs data augmentation techniques to expand the original dataset. In response to the challenges associated with greenhouse tomato fruit detection—such as varying illumination conditions, differences in shooting angles, and interference from sensor noise—a comprehensive augmentation strategy was developed, incorporating both spatial transformations and illumination-noise enhancements. The applied methods include horizontal flipping, vertical flipping, random translation, random rotation (within an angle range of ±15°), random brightness adjustment (within a variation range of ±20%), as well as the addition of Gaussian noise and salt-and-pepper noise.

The augmentation parameters were determined based on the actual characteristics of the greenhouse environment. Specifically, the rotation range was constrained to reflect the natural swinging of hanging tomato fruits, while the brightness variation corresponds to typical fluctuations in indoor lighting conditions. Through this augmentation process, a total of 8100 tomato images were generated. After a secondary screening to ensure clarity and completeness, 7835 high-quality samples were retained for model training. The effects of the data augmentation strategy are illustrated in Figure 2.

2.1.3. Dataset Production

The production of the dataset consists of three main steps: image labelling, label file format conversion and dataset division. In this study, all tomato images were manually labelled using LabelMe tool. The green tomatoes were labelled as green-ripening-stage tomatoes, and the red tomatoes were labelled as ripening-stage tomatoes. Considering that tomatoes will produce post-ripening effects during transportation and storage after picking, tomatoes that are changing colour are labelled as ripening tomatoes even though the fruits have not reached full ripeness yet, but they have already met the actual picking needs. At the same time, to avoid the interference of the background pixels on the tomato fruit, and to maximise the retention of fruit colour and characteristics, the minimum outer rectangle is drawn for different-maturity fruits, to ensure the accuracy of the data parameters in the labelling box. After the labelling is completed, the corresponding JSON format file is generated, which records the information of the labelled image data. Since the read format of the YOLOv11 model chosen in this paper is txt file format, a program is designed to convert the JSON format file to txt format file. Finally, the labelled dataset was divided according to the ratio of 7:2:1, and the training set of 5481, the validation set of 1560, and the test set of 794 were obtained. The distribution of the number of tomato fruits with different maturity levels is shown in Table 1.

2.2. Network Structure and Lightweight Improvement Methods

2.2.1. YOLO v11 Network Modelling

YOLOv11 is a target detection algorithm introduced by the Ultralytics team in September 2024. The algorithm further optimises the network structure and module design to improve the accuracy and computational efficiency of the model while maintaining the efficient real-time target detection capability. The overall architecture of YOLOv11 consists of an input, a Backbone network (Backbone), a Neck network (Neck), and a Detection Head (Head).

The input end is responsible for receiving the image and preprocessing, including adjusting the input to a standard size, normalization and data enhancement to improve the robustness of the model to different lighting and shooting angles. The Backbone network is responsible for extracting the base features from the input images, which significantly enhances the multi-scale feature representation capability under the lightweight design by introducing the C3k2 module and the C2PSA module. The role of the Neck part is to perform the feature fusion and enhancement, which employs the PAN-FPN architecture, a structural design that helps to aggregate features from different scales and to optimise the feature delivery process. The accuracy of target localisation is effectively improved. The Detection Head is responsible for performing target classification and localization prediction. It adopts a multi-task decoupling optimization strategy and innovatively applies depth-separable convolution technology on the classification branch, which enables the model to maintain high accuracy while maintaining high efficiency in real-time detection when dealing with complex scenes.

2.2.2. ACLW-YOLO Based Lightweight Greenhouse Tomato Detection Method

To meet the dual demands for model lightweight and high accuracy for tomato fruit detection in greenhouse environment, this paper proposes the ACLW-YOLO network structure based on the YOLO v11 framework. The core optimisation idea of the model is to comprehensively introduce Adown and C3k2_gConv lightweight structural modules in the Backbone and Neck network and then optimise the weight of the Detection Head through the LSCD structure before finally using the Wise-PIoU loss function. The number of parameters and computational costs are significantly reduced while maintaining accuracy, which makes it suitable for resource-constrained agricultural deployment scenarios. The improved network structure is shown in Figure 3.

2.2.3. ADown

In the traditional YOLO family and most CNN architectures, downsampling often relies on convolutional operations with a step size of 2. While these methods can effectively reduce the feature map size, reduce computation effort, and expand the sensory field, they are prone to losing fine-grained information and are costly to deploy in resource-constrained environments.

For this reason, ACLW-YOLO fully introduces the ADown module from YOLOv9 in Backbone and Neck, replacing the conventional downsampling with a multipath fusion strategy. ADown performs average pooling of the input feature maps, which smoothes out the noise and completes the initial downsampling. The pooling result is divided into two paths along the channels. On the one hand, the main branch extracts spatial features by 3 × 3 convolution and compresses the number of channels. On the other hand, the auxiliary branch performs maximum pooling to highlight the key information and then restores the channel dimension and increases the nonlinear expression by 1 × 1 convolution. Finally, the two outputs are spliced in channel dimensions and fused convolution to generate the final downsampled feature map. The logical structure of the ADown module is shown in Figure 4.

Compared with single-path stepwise convolution, ADown combines the advantages of both average pooling and maximum pooling, which can extract richer features at different scales and compress the size of the feature maps involved in the high-cost convolution operation by nearly half, thus significantly reducing the GFLOPs and compressing the model size. In the fruit detection task, Adown’s multi-scale information retention mechanism not only improves the sensitivity to small targets but also enables the model to maintain higher detection accuracy and robustness under complex illumination and diverse backgrounds.

2.2.4. C3K2_gConv

The C3k2 module, a lightweight Bottleneck variant in YOLOv11, is designed based on the Cross Stage Partial (CSP) structure. However, C3k2 still applies dense convolution operations to the full-channel feature maps, resulting in relatively high computational cost and parameter count, which limits its deployment efficiency on embedded devices.

To address this issue, this paper introduces a Gated Convolution mechanism into the C3k2 module, proposing the C3k2_gConv module. This module adopts dual-branch architecture: the Value branch employs depthwise separable convolution, while the Gate branch utilises a 1 × 1 convolution. Given an input feature map

X \in R^{\land} (B \times C \times H \times W)

where

B

,

C

,

H

,

W

denote the batch size, number of channels, height, and width, respectively, the computation in the gConvBlock proceeds as follows.

Batch Normalization is first applied to the input:

\hat{X} = B N (X) = γ \frac{X - μ}{\sqrt{σ^{2} + ϵ}} + β .

(1)

where

\hat{X}

is the normalised feature map,

μ

and

σ^{2}

are the batch mean and variance, γ and

β

are learnable parameters, and

ϵ

is a small constant for numerical stability.

Dual-branch outputs are then computed as follows:

V = P W C o n v (D W C o n v (\hat{X}))

(2)

G = σ (C o n v (\hat{X}))

(3)

where

V

is the output of the Value branch,

G

is the gate signal from the Gate branch, and

σ

denotes the Sigmoid activation function.

The final output is obtained by element-wise gating:

Y = V ⊙ G

(4)

where

Y

is the fused output and

⊙

denotes element-wise multiplication.

By leveraging the adaptive feature selection capability of the gating mechanism, the C3k2_gConv module achieves significant reductions in both parameter count and computational complexity. This effectively improves inference speed and enhances recognition accuracy in complex scenarios. The logical structure of the C3k2_gConv module is illustrated in Figure 5.

2.2.5. Detection Head Module

Greenhouse tomatoes exhibit a wide variety of shapes and sizes. Traditional detection heads process multi-scale feature maps (P3, P4, P5) independently, without cross-scale interaction. This often results in missed fine-grained details and makes it difficult to simultaneously capture key features of both small and large objects. Consequently, detection performance degrades, especially for small or overlapping tomatoes.

To address this issue, we propose an LSCD in ACLW-YOLO. The LSCD head introduces Group Normalization (GN) to improve the stability and convergence of both classification and regression branches. Multi-scale feature maps (P3, P4, and P5) output from the Neck are first passed through separate 1 × 1

C o n v_{G N}

modules to unify the number of channels while maintaining their spatial resolution. The resulting features are then concatenated along the channel dimension to form a fused representation:

F_{c o n c a t} = C o n c a t (C o n v_{G N} (P_{3}), C o n v_{G N} (P_{4}), C o n v_{G N} (P_{5}))

(5)

This fused feature map is then processed by two consecutive 3×3

C o n v_{G N}

layers for deep feature refinement:

F_{s h a r e d} = C o n v_{G N}^{3 \times 3} (C o n v_{G N}^{3 \times 3} (F_{c o n c a t}))

(6)

The refined features are subsequently fed into separate classification and regression branches to generate the final detection outputs. By leveraging shared convolution and GN, the LSCD head significantly reduces the number of parameters while maintaining high detection accuracy. In the context of greenhouse tomato detection, this module effectively handles challenges such as colour variation across maturity stages, partial occlusion by leaves, and overlapping fruits due to dense distribution. It provides a robust and efficient technical solution to support automated tomato harvesting in greenhouse environments. The overall architecture of the LSCD module is illustrated in Figure 6.

2.2.6. Loss Function

YOLOv11 adopts Complete Intersection over Union (CIoU) as the loss function for bounding box regression. CIoU optimises the regression process primarily by considering the centre distance and aspect ratio difference between predicted and ground truth boxes. However, it does not directly account for the position relationships between the four edges of the predicted and target boxes, which may result in slower convergence and limited localization precision.

To improve the model’s fitting capability for complex objects, ACLW-YOLO adopts the Wise-PIoU loss function, which integrates the fast convergence property of PIoU (Pixels Intersection over Union) with the dynamic non-monotonic focusing mechanism of Wise-IoU. This approach enhances detection accuracy through edge-aware vertex penalty and dynamic gradient reweighting.

At the core of PIoU is a normalised position penalty factor P, which quantifies the geometric misalignment by directly measuring the edgewise absolute differences between predicted and ground truth boxes:

P = \frac{1}{4} (\frac{d w_{1} + d w_{2}}{w_{t}} + \frac{d h_{1} + d h_{2}}{h_{t}})

(7)

where

d w_{1}

,

d w_{2}

,

d h_{1}

,

d h_{2}

represent the absolute edgewise offsets in the x and y directions, and

w_{t}

,

h_{t}

denote the width and height of the ground truth box, respectively.

Based on this, the PIoU loss function is defined as follows:

L_{I o U} = 1 - \frac{I}{U}

(8)

L_{P I o U} = L_{I o U} + 1 - e^{- P^{2}}

(9)

where

I

and

U

denote the intersection and union areas of the predicted and ground truth boxes, respectively, and

I o U

represents their ratio.

The Wise-IoU component introduces a dynamic non-monotonic focusing mechanism. It employs adaptive weights to provide a more balanced gradient distribution—reducing the dominance of high-quality anchor boxes while mitigating the negative influence of low-quality samples on model training. By combining the fast convergence of PIoU with the adaptive focusing of Wise-IoU, we define the Wise-PIoU loss as follows:

L_{W i s e - P I o U} = {(\frac{L_{I o U}^{*}}{{\bar{L}}_{I o U}})}^{γ} \cdot L_{P I o U}

(10)

where

L_{I O U}^{*}

denotes the IoU loss of the current sample,

{\bar{L}}_{I o U}

is the exponentially weighted moving average of the IoU loss, and γ is a dynamic adjustment factor.

Through explicit edge-based geometric quantification and multi-dimensional spatial matching, Wise-PIoU significantly accelerates convergence and improves discriminative precision in densely packed small-object scenarios. In the context of greenhouse tomato detection, this loss function effectively handles the colour variations across ripeness stages, occlusions caused by foliage, irregular object shapes, and overlapping fruits in dense clusters. It reduces both missed detections and false positives, enhancing the model’s robustness and stability in complex agricultural environments.

2.3. Training Environment and Parameter Settings

In this study, experiments are carried out on the ACLW-YOLO model based on the improved YOLOv11n structure, and model training and evaluation are accomplished in the following hardware and software environments. The hardware platform is the Intel (R) Core (TM) i7-14650HX processor with 32GB DDR5 5600MHz high-speed memory and NVIDIA GeForce RTX 4060 Laptop (8GB graphics memory) as GPU. The software environment is Windows 11, the programming language is Python 3.10.14, the deep learning framework is PyTorch 2.2.0, and the CUDA version is 11.8. The hyper-parameters related to the model experiment are shown in Table 2.

2.4. Indicators for Model Performance Evaluation

To comprehensively evaluate the performance of the ACLW-YOLO model in the tomato fruit detection task, this paper constructs a complete performance evaluation system from the two dimensions of detection precision and model complexity. In terms of detection accuracy, Precision, Recall and mAP are selected as the core indicators. In terms of model complexity, the number of parameters, computation amount and model size are selected as the evaluation indexes. Additionally, to assess the model’s practical performance in real-world deployment scenarios, we introduce deployment-oriented metrics including Frames Per Second (FPS), processing latency, and power consumption.

Precision measures the accuracy of positive predictions and reflects the model’s ability to control false detections. It is calculated as follows:

P r e c i s i o n (%) = \frac{T P}{T P + F P} \times 100 %

(11)

where

T P

(True Positive) denotes the number of correctly detected objects, and

F P

(False Positive) represents the number of incorrectly detected objects.

Recall evaluates the completeness of object detection and reflects the model’s capability to avoid missed detections. It is defined as follows:

R e c a l l (%) = \frac{T P}{T P + F N} \times 100 %

(12)

where

F N

(False Negative) refers to the number of objects that were not detected.

The mAP is calculated by the average precision (AP) of different categories of targets at multiple IoU thresholds, which comprehensively reflects the detection ability of the model on various categories of targets. The calculation method is as follows:

m A P (%) = \frac{1}{N} \sum_{i = 1}^{N} A P_{i} \times 100 %

(13)

where

N

denotes the number of target classes and

A P_{i}

is the average precision of the target in class i.

The number of parameters reflects the total count of trainable and stored weights within the model. This value increases with the depth and width of the network, directly determining the model file size on disk and affecting update efficiency during iterative training or deployment. Computational cost is measured in Floating Point Operations Per Second (FLOPs). A lower FLOPs value indicates reduced computational overhead per forward pass, thereby enhancing real-time performance. Model size, typically measured in megabytes (MB), refers to the disk space occupied by the model’s weight file. It impacts both model loading time and memory usage during runtime. Inference latency (ms) refers to the time required for the model to process a single image and complete a forward pass. Its reciprocal, frames per second (FPS), indicates the processing speed for video streams—higher FPS implies smoother performance. Power consumption (W) quantifies the average energy usage of the model on a specific hardware platform. Lower power consumption not only extends the battery life of embedded devices but also reduces heat dissipation requirements. By comprehensively evaluating these indicators, ACLW-YOLO demonstrates a well-balanced performance across detection accuracy, computational efficiency, storage overhead, and real-world deployability.

3. Results and Discussion

3.1. Effectiveness of Loss Function Optimization

To gain deeper insights into the impact of loss functions on network performance, this study conducted a series of systematic comparative experiments. Several mainstream bounding box regression loss functions were selected for analysis, including CIoU, PIoU, SIoU, WIoU, Shape-IoU, and Wise-PIoU. The performance comparison of different loss functions is presented in Table 3.

The experimental results indicate that Wise-PIoU outperforms all other loss functions in terms of both precision and recall, demonstrating a clear advantage. Although its mAP50 is 0.1 percentage points lower than that of CIoU, this minor difference is negligible in practical applications. For greenhouse tomato detection, high precision helps reduce false positives, while high recall mitigates missed detections—characteristics that are more practically valuable than marginal gains in the mAP metric. Wise-PIoU features fast convergence and a dynamic focusing mechanism. Through an adaptive weight allocation strategy, it effectively suppresses harmful gradients from low-quality anchors while enhancing the learning capacity of medium-quality anchors, thereby improving the model’s detection accuracy. Considering overall performance, Wise-PIoU is ultimately adopted as the loss function in this study.

3.2. Ablation Experiments

To systematically validate the effectiveness of each proposed optimization strategy, a comprehensive ablation study was conducted on the YOLOv11n model using the self-constructed tomato dataset under a unified training environment. Following the logical order of network structure optimization, four key improvements were introduced sequentially: the ADown downsampling module, the C3k2-gConv feature extraction structure, the LSCD lightweight detection head, and the Wise-PIoU loss function. A hybrid experimental design was adopted, combining isolated single-module validation with cumulative integration. First, the performance contribution of each module was quantified independently; then, modules were progressively combined to assess their synergistic effects. The impact of each component on detection accuracy and inference efficiency was comprehensively evaluated. The ablation results are summarised in Table 4.

Experiment 1 corresponds to the baseline YOLOv11n model without any modifications. Although it achieves relatively high accuracy in tomato detection, its large model size and computational overhead render it unsuitable for embedded deployment. From the single-module analysis, it can be observed that the ADown module (Experiment 2) significantly reduces model complexity while maintaining detection accuracy. Compared with the baseline, the precision increased by 1.1% to 93.2%, while recall and mAP50 experienced slight decreases. However, the model size decreased by 17.3%, the parameter count dropped by 19.2%, and FLOPs were reduced by 15.9%. These results indicate that ADown effectively enhances model compactness by optimising the downsampling strategy through a multi-path fusion mechanism. The C3k2-gConv module (Experiment 3) primarily excels in improving detection precision. Compared with the baseline, precision increased significantly by 1.8% to 93.9%, while model size and parameter count were reduced by 11.5% and 15.4%, respectively. This highlights the advantage of the gated convolution mechanism in balancing accuracy and lightweight design. By adopting a dual-branch structure and adaptive feature selection, this module enhances feature representation capacity. The LSCD head (Experiment 4) demonstrates promising lightweighting capabilities, reducing model size by 5.8% and FLOPs by 11.1%, while maintaining stable and slightly improved detection performance. This suggests that the lightweight shared convolutional detection head effectively balances model efficiency and detection accuracy through cross-scale feature fusion. The Wise-PIoU loss function (Experiment 5) mainly contributes to improving regression quality, boosting precision by 1.0% and recall by 0.5%, confirming the effectiveness of dynamic weight adjustment in optimising bounding box regression. In Experiment 6, the integration of ADown and C3k2-gConv modules reduces model size by 28.8%, parameter count by 30.8%, and FLOPs by 27.0% compared to the baseline, demonstrating a positive synergy between the two modules in lightweight optimisation. Experiment 7 adds the LSCD head to the previous configuration, leading to notable improvements in detection performance: precision increases to 93.6%, recall to 91.8%, and mAP50 reaches the highest value of 95.3%, while the model size is further reduced to 3.3 MB and FLOPs to 3.9 G. This indicates that LSCD not only preserves the lightweight advantage but also effectively addresses the recall drop observed in Experiment 6, highlighting the superiority of the decoupled detection head in multi-scale feature integration. Experiment 8 constitutes the final ACLW-YOLO model, in which the Wise-PIoU loss function is applied on top of the configuration from Experiment 7 to further optimise bounding box regression. Compared to Experiment 7, precision increases to 94.2%, recall improves to 92.0%, and mAP50 remains at a high level of 95.2%. Taking all indicators into account, this configuration achieves the optimal balance between accuracy and efficiency. These results comprehensively validate both the individual effectiveness and the synergistic benefits of the proposed improvement modules.

3.3. Performance Analysis of Tomato Maturity Detection Based on ACLW-YOLO

To comprehensively evaluate the improvements of the ACLW-YOLO model over the YOLOv11n baseline and to analyse its detection performance across different tomato maturity stages, a comparative analysis was conducted for each stage using precision, recall, and mAP50 as evaluation metrics. Table 5 presents the detection performance of the models before and after optimisation for tomatoes at different ripeness levels.

As shown in Table 5, ACLW-YOLO achieves consistent improvements in overall detection performance. The overall precision increases from 92.1% to 94.2%, representing a 2.1 percentage point gain. Recall improves from 91.7% to 92.0%, and mAP50 rises from 94.9% to 95.2%, each with a 0.3 percentage point improvement. In terms of class-specific performance, the precision for unripe tomatoes increases significantly by 2.4 percentage points to 94.2%, indicating stronger discriminative capability for green tomatoes and a notable reduction in false positives. The improvement for ripe tomatoes is even more substantial, with precision rising by 1.8 percentage points to 94.2%, recall increasing by 0.7 percentage points to 95.6%, and mAP50 climbing by 1.1 percentage points to 97.9%. These results highlight the superior detection performance of the ACLW-YOLO model in the context of tomato maturity classification.

To further investigate the actual detection performance of the ACLW-YOLO model across different tomato categories, an error analysis was conducted based on the confusion matrix. The confusion matrix of ACLW-YOLO is presented in Figure 7.

As shown in Figure 7, the ACLW-YOLO model achieves the best recognition performance on the mature tomato category, followed by the immature tomato category. A detailed analysis of misclassified samples reveals that the primary source of errors stems from interference caused by similar background colours, particularly the high colour similarity between green tomatoes and surrounding leaves, which leads to frequent missed detections of immature tomatoes. In the complex greenhouse environment, varying lighting conditions and occlusions further exacerbate the difficulty of distinguishing features.

3.4. Comparison Experiments of Different Algorithms

To verify the validity and superiority of the ACLW-YOLO model, we conducted comparative experiments with various classical models on the same greenhouse tomato dataset. The experimental results are shown in Table 6.

As can be seen from the table, YOLOv11n maintains a high recall rate of 91.7%, but the model memory usage and parameter count are large. YOLOv3-tiny has the least satisfactory results, with a model size of 23.3 MB, the highest among all algorithms. YOLOv5 performs mediocrely, with a recall rate of only 90.2%. The results of YOLOv8n were not ideal, While the accuracy is low, the amount of computation reaches 8.1 G. YOLOv9t has a better performance, with mAP50 on par with the improved model, but the computation is 7.6 G, which is not conducive to lightweight deployments. YOLOv10n and YOLOv12 have excellent performance in terms of precision, recall, and average precision mean, yet they still find it hard to reduce the higher model complexity. In contrast, ACLW-YOLO achieves the best detection performance of 94.2% precision, 92.0% recall, and 95.2% mAP50 with the lowest resource consumption of 3.3 MB, 1.6 M parameters, and 3.9 GFLOPs, which provides the optimal detection performance and deployment efficiency, and is more suitable for greenhouse tomato detection.

3.5. Generalisation Experiments on Different Datasets

To verify the generalisation ability and cross-domain adaptability of the proposed algorithm, both the baseline YOLOv11n and the improved ACLW-YOLO were evaluated through comparative experiments on three datasets: a private tomato dataset, the public Laboro Tomato dataset, and the tomatOD dataset. These datasets collectively cover complex greenhouse environments, large-scale standard samples, and multi-variety expert-labelled samples, enabling a comprehensive assessment of the model’s cross-domain performance.

As shown in Table 7, ACLW-YOLO demonstrates outstanding detection performance across different data distributions. On the private dataset, the model achieves a precision of 94.20%, representing a 2.09 percentage point improvement over YOLOv11n, with an mAP50 of 95.24%, up by 0.37 percentage points. On the Laboro Tomato dataset, ACLW-YOLO reaches a precision of 90.82%, improving by 2.14 percentage points, and an mAP50 of 90.54%, which is comparable to the baseline’s 90.62%. On the tomatOD dataset, the model attains a precision of 91.96% and an mAP50 of 96.23%, up 0.34 percentage points. In terms of model lightweighting, ACLW-YOLO reduces the number of parameters to 1.61 M, representing a 37.6% decrease compared to YOLOv11n. The model size is also compressed to 3.3 and 3.4 MB, further enhancing its deployment feasibility.

These experimental results demonstrate that ACLW-YOLO maintains high detection accuracy while significantly reducing computational complexity and storage requirements, making it well-suited for deployment on resource-constrained edge devices. More importantly, the model maintains stable performance across three datasets with distinct data distributions and scene complexities, confirming its strong cross-domain generalisation capability. This robustness is particularly valuable for adapting to the diversity and complexity of real-world greenhouse environments.

3.6. Comparison of Detection Effects

To more intuitively evaluate the adaptability and robustness of the ACLW-YOLO model in real greenhouse environments, this study selects several typical greenhouse tomato growth scenarios (multiple fruits, occlusion, dim environment, and similar backgrounds) to conduct visualisation and comparison experiments, and compares the algorithms of the ACLW-YOLO model with the YOLOv11n model. The red rectangular box with the “ripe” label indicates the recognised ripe tomato results, and the blue rectangular box with the “unripe” label indicates the recognised unripe tomato results. The visualisation results are shown in Figure 8.

In the multi-fruit scenario, both models can accurately detect the target, but the improved model has higher confidence, better recognition ability for multi-fruit targets, can better handle the mutual occlusion relationship between fruits, and accurately detects the boundary of each tomato fruit, avoiding the adhesion misdetection between fruits. In the scene where occlusion occurs, the two models did not show any leakage detection, but YOLOv11n’s recognition of green tomato fruits is poor, which is because the fact that the green fruits are similar in colour to the occluded leaves, which has an impact on the model’s recognition, whereas the improved model can use the features of its unobscured part for effective detection and the model can identify the occluded part of the fruits based on the fruit’s local features and its spatial relationship with other fruits, the occluded fruits can be recognised. In the scene with a dim background, the YOLOv11n model has a leakage detection phenomenon, and fails to recognise the green fruits under the condition of insufficient light, while the improved model can still capture the key features of the fruits and has a strong feature extraction capability. In the scene with similar backgrounds, YOLOv11n shows a serious leakage phenomenon, and the model misses all the targets, while the improved model can accurately recognise the three main targets in the near distance, in addition to the small target green tomato in the far distance. Despite the low contrast between the fruit and the background, the model can effectively recognise the location of the fruit through the rich feature information learned. Therefore, to avoid the situation of omission and misdetection of the picking robot in actual production leading to operational errors, the ACLW-YOLO model with higher overall performance is selected for fruit recognition and detection to improve the picking efficiency to adapt to the task requirements.

3.7. Jetson Platform Model Test

To verify the deployment feasibility of the proposed ACLW-YOLO model in a resource-constrained environment, we conducted performance evaluation on the NVIDIA Jetson Orin Nano (4GB) edge computing platform. As shown in Table 8, the model showed good practicality on this edge device.

Specifically, the model’s single-frame processing latency on the Jetson Orin Nano is 84.4 milliseconds. Although this is an increase from the 10.8 milliseconds of high-performance computing platforms, it still maintains a frame rate of 11.85 FPS, meeting the basic requirements for real-time detection. Notably, this performance is achieved with only 10W of power consumption, a significant improvement over the 140W power consumption of computer platforms. It is worth noting that the current test results are based on the model’s baseline performance and have not yet utilised advanced optimisation techniques such as TensorRT. After integrating these inference optimisation technologies in actual deployment, it is expected to further enhance performance and improve the model’s practicality. In summary, this result fully demonstrates the technical advantages and deployment potential of the ACLW-YOLO model in edge AI applications.

4. Conclusions

Aiming at the problem of a large volume of model parameters and high complexity of network structure in tomato fruit detection, which makes it difficult to deploy the application on inspection equipment, this paper proposes a lightweight greenhouse tomato fruit target detection algorithm (ACLW-YOLO) based on improved YOLOv11n. The main contributions are as follows, the process of downsampling is improved by the ADown module, and gConv is introduced based on the original c3k2 structure, which reduces the number of parameters and computation of the model. Secondly, the LSCD mechanism is used in the detection head, which improves the detection accuracy while reducing the model complexity, and finally, the model is improved by the Wise-PIoU loss function, which effectively improves the performance of the model in the greenhouse tomato detection task.

Experimental results show that, compared with the original YOLOv11n, ACLW-YOLO reduces the number of parameters by 38.7%, decreases computational cost by 39.1%, and compresses the model size by 37.5%. Meanwhile, it achieves a precision of 94.2%, recall of 92.0%, and mean average precision of 95.2%. Cross-dataset testing further verifies its strong generalisation capability. Deployed on the NVIDIA Jetson Orin Nano edge device, the model reaches a real-time inference speed of 11.85 FPS with only 10W of power consumption, demonstrating excellent performance for resource-constrained platforms.

This study effectively addresses the long-standing trade-off between model lightweighting and high detection accuracy in the context of greenhouse tomato detection. The proposed model achieves high-precision detection, low resource consumption, compact size, and real-time edge inference capability, fulfilling the core requirements of tomato fruit detection in automated greenhouse harvesting systems and providing a solid technical foundation for practical deployment.

Despite the notable improvements, several challenges remain. Under extreme conditions—such as strong glare, severe fruit occlusion, or backgrounds with high visual similarity—the detection accuracy may decline. Moreover, current deployment verification is limited to a single hardware platform, lacking systematic evaluation across heterogeneous edge devices. To address these limitations, future research will focus on two main directions: Designing more efficient loss functions to enhance model robustness under challenging scenarios. Expanding compatibility testing across diverse hardware platforms to ensure broad applicability and stable performance in real-world deployment environments. Additionally, future work will include comparative studies with more state-of-the-art models to further validate the feasibility of ACLW-YOLO. Efforts will also be made to establish a standardised evaluation benchmark to promote more objective and comprehensive methodological comparisons in this domain.

Author Contributions

The authors confirm contribution to the paper as follows: Conceptualisation, X.G. and X.M.; methodology, X.G. and X.M.; validation, X.G. and Q.S.; formal analysis, X.G. and F.L.; investigation, X.G. and Q.S.; resources, X.G. and F.L.; data curation, X.G. and F.L.; writing—original draft preparation, X.G.; writing—review and editing, X.G.; supervision, X.M. and P.L.; project administration, X.M. and J.Y.; funding acquisition, J.Y. and P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by A Project of Shandong Province Higher Educational Program for Introduction and Cultivation of Young Innovative Talents in 2021.

Informed Consent Statement

Not applicable.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Conflicts of Interest

The authors declare no conflicts of interest to report regarding the present study.

Abbreviations

The following abbreviations are used in this manuscript:

Adown	Average pooling Downsampling
AP	Average precision
Backbone	Backbone network
CIoU	Complete intersection over union
CNN	Convolutional Neural Networks
CSP	Cross-stage partial connections
FLOPs	Floating Point Operations Per Second
FN	False Negative
FP	False Positive
FPS	Frames Per Second
GANs	Generative Adversarial Networks
gConv	Gated Convolution
GN	Group Normalisation
Head	Detection Head
HOG	Histogram of Oriented Gradients
LSCD	Lightweight Shared Convolutional Detection
mAP	Mean average precision
MPCM	Modified Possibilistic C-Means
Neck	Neck network
NMS	Non-Maximum Suppression
NMVS	Non-Maximum Value Suppression
P	Precision
PIoU	Pixels intersection over union
R	Recall
SAM	Segment Anything Model
SVM	Support Vector Machine
TP	True Positive
Wise-PIoU	Wise–Powerful intersection over Union

References

Sun, Y.Z.; He, J.; Wei, F.; Yang, W.J. Development of China’s tomato industry and evaluation of its international competitiveness in the 13th Five-Year Plan. China Melon Veg. 2023, 36, 112–116. [Google Scholar]
Song, Z.M. Research on Target Detection and Localization Technology of Tomato Picking Robot Based on Deep Learning. Master’s Thesis, Henan Agricultural University, Zhengzhou, China, 2024. [Google Scholar]
Zhao, Y.; Gong, L.; Huang, Y.; Liu, C. A review of key techniques of vision-based control for harvesting robot. Comput. Electron. Agric. 2016, 127, 311–323. [Google Scholar] [CrossRef]
Song, H.B.; Shang, Y.Y.; He, D.J. Research progress on deep learning recognition technology for fruit targets. J. Agric. Mach. 2023, 54, 1–19. [Google Scholar]
Dalal, N.; Triggs, B. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Gholami, A.; Kim, S.; Zhen, D.; Yao, Z.; Mahoney, M.; Keutzer, K. A survey of quantization methods for efficient neural network inference. arXiv 2021, arXiv:2103.13630. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS—Improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Gongal, A.; Amatya, S.; Karkee, M.; Zhang, Q.; Lewis, K. Sensors and systems for fruit detection and localization: A review. Comput. Electron. Agric. 2015, 116, 8–19. [Google Scholar] [CrossRef]
LeCun, Y.; Bengio, Y.; Hinton, G. Deep learning. Nature 2015, 521, 436–444. [Google Scholar] [CrossRef]
Liu, L.; Ouyang, W.; Wang, X.; Fieguth, P.; Chen, J.; Liu, X.; Pietikäinen, M. Deep learning for generic object detection: A survey. Int. J. Comput. Vis. 2020, 128, 261–318. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single shot multibox detector. In Proceedings of the 14th European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Sa, I.; Ge, Z.; Dayoub, F.; Upcroft, B.; Perez, T.; McCool, C. DeepFruits: A fruit detection system using deep neural networks. Sensors 2016, 16, 1222. [Google Scholar] [CrossRef]
Bargoti, S.; Underwood, J. Deep fruit detection in orchards. In Proceedings of the IEEE International Conference on Robotics and Automation, Singapore, 29 May–3 June 2017; pp. 3626–3633. [Google Scholar]
Long, J.H.; Zhao, C.J.; Lin, S.; Guo, W.Z.; Wen, C.W.; Zhang, Y. Improved Mask R-CNN for tomato fruit segmentation with different ripeness in greenhouse environment. Trans. Chin. Soc. Agric. Eng. 2021, 37, 100–108. [Google Scholar]
Kamilaris, A.; Prenafeta-Boldú, F.X. Deep learning in agriculture: A survey. Comput. Electron. Agric. 2018, 147, 70–90. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Adv. Neural Inf. Process. Syst. 2014, 27, 2672–2680. [Google Scholar]
Fawakherji, M.; Youssef, A.; Bloisi, D.; Pretto, A.; Nardi, D. Crop and weeds classification for precision agriculture using context-independent pixel-wise segmentation. In Proceedings of the Third IEEE International Conference on Robotic Computing, Naples, Italy, 25–27 February 2019; pp. 146–152. [Google Scholar]
Crimaldi, M.; Carillo, P.; Rana, S.; Gerbino, S.; Cirillo, V.; Guarino, F.; Maggio, A. Crop Growth Analysis Using Automatic Annotations and Transfer Learning in Multi-Date Aerial Images and Ortho-Mosaics. Agronomy 2024, 14, 2052. [Google Scholar] [CrossRef]
Kirillov, A.; Mintun, E.; Ravi, N.; Mao, H.; Rolland, C.; Gustafson, L.; Xiao, T.; Girshick, R. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–6 October 2023; pp. 4015–4026. [Google Scholar]
Mehrotra, A.; Kumar, A.; Singh, K.K.; Nigam, R. Study of spectral overlap and heterogeneity in agriculture based on soft classification techniques. Remote Sens. Appl. 2024, 36, 101286. [Google Scholar]
Rana, S.; Gerbino, S.; Crimaldi, M.; Cirillo, V.; Carillo, P.; Sarghini, F.; Maggio, A. Comprehensive Evaluation of Multispectral Image Registration Strategies in Heterogenous Agriculture Environment. J. Imaging 2024, 10, 61. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient convolutional neural networks for mobile vision applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
Zhang, X.; Zhou, X.; Lin, M.; Sun, J. ShuffleNet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
Sun, L.; Hu, G.; Chen, C.; Cai, H.; Li, C.; Zhang, S.; Chen, J. Lightweight apple detection in complex orchards using YOLOV5-PRE. Horticulturae 2022, 8, 1169. [Google Scholar] [CrossRef]
Li, Y.; He, L.; Jia, J.; Lv, J.; Chen, J.; Qiao, X.; Wu, C. In-field tea shoot detection and 3D localization using an RGB-D camera. Sensors 2021, 21, 2987. [Google Scholar] [CrossRef]
Zhang, P.; Li, D. EPSA-YOLO-V5s: A novel method for detecting the survival rate of rapeseed in a field environment. Appl. Sci. 2022, 12, 4934. [Google Scholar]
Zhao, F.; Zuo, G.F.; Gu, S.R.; Ren, X.T.; Tao, X. A lightweight study of greenhouse tomato detection model based on improved YOLO v5s. Jiangsu Agric. Sci. 2024, 52, 200–209. [Google Scholar]
Qiu, H.; Zhang, Q.; Li, J.; Rong, J.; Yang, Z. Lightweight mulberry fruit detection method based on improved YOLOv8n for automated harvesting. Agronomy 2024, 14, 2861. [Google Scholar] [CrossRef]
Tan, M.; Le, Q. EfficientNet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]

Figure 1. Examples of tomato acquisition images: (a) green ripening tomato; (b) colour-changing tomato; (c) ripening tomato; (d) multiple ripening tomato; (e) with light; (f) backlight; (g) side light; (h) cloudy day; (i) sunny day; (j) top shot; (k) flat shot; and (l) elevation shot.

Figure 2. Data enhancement effects: (a) original image; (b) horizontal flip; (c) vertical flip; (d) random panning; (e) random rotation; (f) Gaussian noise; (g) pepper noise; (h) salt noise; (i) random brightness.

Figure 3. Structure of ACLW-YOLO network.

Figure 4. ADown module architecture.

Figure 5. Structure of the C3K2_gConv module.

Figure 6. LSCD module structure.

Figure 7. Confusion matrix of ACLW-YOLO.

Figure 8. (a) Detection results of YOLOv11n; (b) detection results of ACLW-YOLO.

Table 1. Number of tomato fruits labelled at different maturity levels in the dataset.

Maturity Level	Training Set	Validation Set	Test Set	(Grand) Total
green stage of ripening	14,434	3980	2445	20,859
maturity	13,434	3510	2046	18,990
(grand) total	27,868	7490	4491	39,849

Table 2. Hyper-parameter settings related to model experiments.

Parameters	Value
Optimiser	SGD
Weight decay	0.0005
Momentum	0.937
Initial learning rate	0.01
Batch size	16
Epoch	200
Image size	640 × 640

Table 3. Performance comparison of different bounding box loss functions.

Loss Function	P (%)	R (%)	mAP50 (%)
CIoU	93.6	91.8	95.3
PIoU	93.5	91.2	94.8
SIoU	93.0	91.7	94.6
WIoU	93.0	91.2	94.7
Shape-IoU	94.0	91.2	95.2
Wise-PIoU	94.2	92.0	95.2

Table 4. Ablation experiment of the ACLW-YOLO model.

Experiment	ADown	C3k2-gConv	LSCD	Wise-Piou	P (%)	R (%)	mAP50 (%)	Model Size (MB)	Param (M)	GFLOPs (G)
1					92.1	91.7	94.9	5.2	2.6	6.3
2	√				93.2	91.6	94.5	4.3	2.1	5.3
3		√			93.9	91.0	94.7	4.6	2.2	5.7
4			√		92.9	91.7	94.9	4.9	2.4	5.6
5				√	93.1	92.2	95.0	5.2	2.6	6.3
6	√	√			93.1	90.8	94.8	3.7	1.8	4.6
7	√	√	√		93.6	91.8	95.3	3.3	1.6	3.9
8	√	√	√	√	94.2	92.0	95.2	3.3	1.6	3.9

Table 5. Detection performance comparison across maturity stages.

Models	Class	P (%)	R (%)	mAP50 (%)
YOLOv11n	all	92.1	91.7	94.9
	unripe	91.8	88.4	92.9
	ripe	92.4	94.9	96.8
ACLW-YOLO	all	94.2	92.0	95.2
	unripe	94.2	88.4	92.6
	ripe	94.2	95.6	97.9

Table 6. Comparative experiments of different models.

Models	P (%)	R (%)	mAP50 (%)	Model Size (MB)	Param (M)	GFLOPs (G)
YOLOv11n	92.1	91.7	94.9	5.2	2.6	6.3
YOLOv3-tiny	93.3	90.0	93.5	23.2	12.1	18.9
YOLOv5	93.9	90.2	94.8	4.4	2.2	5.8
YOLOv8n	93.5	90.2	94.8	6.0	3.0	8.1
YOLOv9t	94.0	91.4	95.2	4.4	2.0	7.6
YOLOv10n	92.4	91.0	95.0	5.5	2.3	6.5
YOLOv12	93.0	91.0	95.0	5.3	2.6	6.3
ACLW-YOLO	94.2	92.0	95.2	3.3	1.6	3.9

Table 7. Performance comparison of different models on various datasets.

Dataset	Model	P (%)	mAP50 (%)	Param (M)	Model Size (MB)
Private Dataset	YOLOv11n	92.11	94.87	2.58	5.2
Private Dataset	ACLW-YOLO	94.20	95.24	1.61	3.3
Laboro Tomato	YOLOv11n	88.68	90.62	2.58	5.2
Laboro Tomato	ACLW-YOLO	90.82	90.54	1.61	3.3
TomatOD	YOLOv11n	91.90	95.89	2.58	5.2
TomatOD	ACLW-YOLO	91.96	96.23	1.61	3.4

Table 8. Comparison of different equipment performance.

Device Name	Power (W)	FPS	Latency (Ms)
Computer Platform	140	92.48	10.8
Jetson Orin Nano	10	11.85	84.4

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Gao, X.; Li, F.; Yan, J.; Sun, Q.; Meng, X.; Liu, P. A Lightweight Greenhouse Tomato Fruit Identification Method Based on Improved YOLOv11n. Agriculture 2025, 15, 1497. https://doi.org/10.3390/agriculture15141497

AMA Style

Gao X, Li F, Yan J, Sun Q, Meng X, Liu P. A Lightweight Greenhouse Tomato Fruit Identification Method Based on Improved YOLOv11n. Agriculture. 2025; 15(14):1497. https://doi.org/10.3390/agriculture15141497

Chicago/Turabian Style

Gao, Xingyu, Fengyu Li, Jun Yan, Qinyou Sun, Xianyong Meng, and Pingzeng Liu. 2025. "A Lightweight Greenhouse Tomato Fruit Identification Method Based on Improved YOLOv11n" Agriculture 15, no. 14: 1497. https://doi.org/10.3390/agriculture15141497

APA Style

Gao, X., Li, F., Yan, J., Sun, Q., Meng, X., & Liu, P. (2025). A Lightweight Greenhouse Tomato Fruit Identification Method Based on Improved YOLOv11n. Agriculture, 15(14), 1497. https://doi.org/10.3390/agriculture15141497

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Greenhouse Tomato Fruit Identification Method Based on Improved YOLOv11n

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Generation

2.1.1. Data Acquisition

2.1.2. Data Enhancement

2.1.3. Dataset Production

2.2. Network Structure and Lightweight Improvement Methods

2.2.1. YOLO v11 Network Modelling

2.2.2. ACLW-YOLO Based Lightweight Greenhouse Tomato Detection Method

2.2.3. ADown

2.2.4. C3K2_gConv

2.2.5. Detection Head Module

2.2.6. Loss Function

2.3. Training Environment and Parameter Settings

2.4. Indicators for Model Performance Evaluation

3. Results and Discussion

3.1. Effectiveness of Loss Function Optimization

3.2. Ablation Experiments

3.3. Performance Analysis of Tomato Maturity Detection Based on ACLW-YOLO

3.4. Comparison Experiments of Different Algorithms

3.5. Generalisation Experiments on Different Datasets

3.6. Comparison of Detection Effects

3.7. Jetson Platform Model Test

4. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI