DSS-YOLO: A Lightweight Flower and Stamen Detection Model for Greenhouse Tomato Pollination Assistance

Zhang, Shan; Zhang, Dongfang; Zhang, Jun; Wang, Jiaqi; Zhang, Yibing; Fan, Xiaofei; Zhou, Yuhong

doi:10.3390/agronomy16010067

Open AccessArticle

DSS-YOLO: A Lightweight Flower and Stamen Detection Model for Greenhouse Tomato Pollination Assistance

by

Shan Zhang

^1,2,

Dongfang Zhang

^2,3,

Jun Zhang

^1,2

,

Jiaqi Wang

^1,2,

Yibing Zhang

^4,*,

Xiaofei Fan

^1,2,* and

Yuhong Zhou

^1,2,*

¹

School of Mechanical and Electrical Engineering, Hebei Agricultural University, Baoding 071000, China

²

Beijing-Tianjin-Hebei Sub-Center of National Digital Agriculture Innovation, Baoding 071000, China

³

College of Horticulture, Hebei Agricultural University, Baoding 071000, China

⁴

Hengyifeng (Beijing) Technology Co., Ltd., Beijing 100000, China

^*

Authors to whom correspondence should be addressed.

Agronomy 2026, 16(1), 67; https://doi.org/10.3390/agronomy16010067

Submission received: 2 December 2025 / Revised: 21 December 2025 / Accepted: 23 December 2025 / Published: 25 December 2025

(This article belongs to the Section Precision and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

The accurate detection of tomato flowers and stamens in greenhouse environments is a key technology for achieving automated pollination and intelligent crop management. However, during the detection process, tomato flowers and stamens often face challenges such as overlapping, leaf occlusion, complex backgrounds, and difficulties in small-scale feature extraction, which severely impact detection accuracy. Therefore, this paper proposes a lightweight detection model, termed DSS-YOLO. First, the backbone network of YOLOv11n was replaced with HGNetv2, and depthwise separable convolution (DWConv) was incorporated to construct a multiscale lightweight feature extraction network, named DWHGNetv2. This design enhances the feature extraction capability for tomato flowers while reducing computational cost and overall model complexity. Second, traditional convolution-based downsampling layers were replaced with the SCDown module to improve computational efficiency and feature representation. Furthermore, the SIoU loss function was introduced to optimize the localization accuracy of angle-sensitive targets, such as stamens. Experimental results demonstrate that DSS-YOLO consistently outperforms the baseline YOLOv11n. Compared with YOLOv11n, the model size, parameter count, and computational cost are reduced by 34%, 36%, and 35%, respectively, while precision, recall, and mean Average Precision at an IoU threshold of 0.5 (mAP@0.5) are improved by 1.1%, 1.0%, and 0.7%, respectively. These results indicate that DSS-YOLO provides a robust and efficient solution for high-precision tomato flower and stamen detection, meeting the practical requirements of pollination robots in greenhouse environments.

Keywords:

lightweight feature fusion; tomato flower and stamen detection; improved YOLOv11n; HGNetv2

1. Introduction

As one of the three major traded vegetables worldwide, tomatoes are widely cultivated owing to their rich nutritional value [1]. China is the largest producer of this crop, with the highest output and planting areas worldwide. The tomato industry is of great significance in ensuring the livelihoods of people, increasing farmers’ income and promoting agricultural exports [2]. Tomato cultivation methods can be classified into open fields and greenhouses. Greenhouse cultivation can effectively reduce the impact of disease and adverse climates and has certain advantages in improving the yield, quality, and economic benefits of tomatoes [3,4].

However, in the closed environment of greenhouses, tomatoes, as hermaphroditic crops, rely on effective pollination processes for normal fruit setting. In contrast to open environments, closed greenhouses lack sufficient wind and natural pollination media, such as insects, which often causes flowers to miss the best pollination time, thus affecting the fruit-setting rate and final yield [5,6]. Although artificial pollination can be used as a remedial measure, it has inherent defects, such as high labor intensity, low efficiency, and high uncertainty of human operation, which can easily cause uneven pollination, leading to problems such as flower drop, fruit drop, and deformed fruit, reducing the commercial value and economic returns of the fruit [7]. Therefore, developing a robotic system that can autonomously and accurately identify flowers and perform pollination has become a key technical direction for improving the production efficiency of greenhouse tomatoes and promoting intelligent transformation of agricultural facilities [8,9,10].

In recent years, computer vision technologies, particularly deep learning–based approaches, have achieved significant progress in agricultural object detection tasks and have been increasingly applied to fruit flower recognition and counting. Early studies mainly relied on traditional image processing techniques, such as flower detection based on color space analysis and threshold segmentation. Relevant researchers extracted flower regions using Gaussian filtering combined with color feature analysis, the HSL color space, and RGB channel ratios, respectively [11,12,13]. However, these methods are highly sensitive to illumination variations, background complexity, and flower shape diversity, resulting in limited robustness in complex field environments.

With the rapid development of deep learning, convolutional neural network (CNN)–based detection and segmentation methods have gradually become the mainstream for agricultural flower detection [14,15,16,17]. For example, Jaju and Chandak [18] employed ResNet-50 combined with transfer learning to achieve multi-class flower recognition. Lin et al. [19] integrated an improved VGG19 backbone with Faster R-CNN for strawberry flower detection. Farjon et al. [20] proposed a Faster R-CNN–based apple flower detection approach to support precision thinning decisions. In addition, Dias et al. [21] and Sun et al. [22] utilized DeepLab-based models to achieve fine-grained segmentation of apple, peach, and pear flowers, while Tian et al. [23] and Mu et al. [24] further enhanced detection and localization performance in complex inflorescence scenarios by incorporating architectures such as U-Net and Mask R-CNN.

Although these methods have demonstrated promising performance under specific conditions, they generally suffer from large model sizes and slow inference speed, making them unsuitable for real-time, high-precision flower and stamen detection in greenhouse environments. In intelligent agricultural systems—particularly automated pollination applications—flower detection serves as a critical prerequisite for accurate localization and precise operations, where both detection accuracy and processing speed directly affect system performance. Owing to their end-to-end unified architecture, high inference efficiency, and favorable balance between speed and accuracy, the YOLO series has been widely adopted for flower recognition tasks. This has significantly promoted the development of real-time detection solutions in agricultural scenarios. For example, Lyu et al. [25] proposed a YOLO-HPFD model using a multi-teacher knowledge distillation strategy, achieving an mAP of 94.21% for lychee flower detection in complex backgrounds. Ren et al. [26] developed the FPG-YOLO model, which improved the detection accuracy of pear flowers by optimizing the network structure and loss function. Bai et al. [27] introduced an improved YOLOv7-based model for strawberry flower detection, enhancing multiscale feature fusion and global feature perception. Wang et al. [28] applied YOLOv8 to chili pepper flower detection, achieving high accuracy and recall while maintaining real-time performance. Li et al. [29] proposed a lightweight YOLOv4-based model for kiwi flower bud detection, obtaining an mAP of 97.61% with high inference efficiency. Xu et al. [30] improved tomato flower detection by integrating attention mechanisms and multi-angle feature fusion, while Yuying et al. [31] enhanced YOLOv5s to address illumination and occlusion challenges in apple flower detection, achieving a detection accuracy of 97.2%.

Although YOLO-based target detection methods have made progress in tomato flower recognition, they still face significant challenges in the complex environments of greenhouses. Factors such as foliage shading, overlapping flowers and fruits, variable lighting, and similar morphology among flowers seriously interfere with detection accuracy. Furthermore, existing models generally suffer from complex structures, redundant parameters, and high computational costs, rendering it difficult to meet the application requirements of high accuracy, speed, and real-time performance.

To address the aforementioned challenges, this study proposes a lightweight detection model for greenhouse tomato pollination, termed DSS-YOLO. Taking YOLOv11n as the baseline, the proposed model is developed with the core objective of achieving a favorable balance between lightweight design and accuracy preservation. By collaboratively restructuring backbone feature modeling, downsampling information flow, and bounding box regression optimization, DSS-YOLO realizes a structure-level lightweight detection framework tailored for small-target scenarios. The proposed improvements are summarized as follows:

(1): Lightweight backbone network design:

A novel backbone network, DWHGNetv2, is constructed by deeply integrating the efficient architecture of HGNetv2 with depthwise separable convolutions. This design enhances the model’s capability to extract multi-scale and fine-grained features of tomato flowers while significantly reducing the number of parameters and computational complexity, thereby enabling deployment on mobile or embedded devices.

(2): Efficient downsampling mechanism optimization:

The SCDown module is introduced to replace conventional standard convolution-based downsampling layers. By decoupling channel transformation and spatial compression, SCDown effectively performs downsampling while preserving critical fine-grained information essential for small-target detection, thereby improving feature fidelity and computational efficiency.

(3): Small-target-oriented loss function improvement:

The SIoU loss function is adopted to optimize bounding box regression. By explicitly incorporating the vector angle relationship between the predicted and ground-truth bounding boxes, SIoU redefines the regression penalty, enabling more accurate localization of direction-sensitive and small-scale targets, such as tomato stamens.

2. Materials and Methods

2.1. Image Acquisition

All tomato flower images used in this study were collected from an intelligent greenhouse at Shengtong Agricultural Co., Ltd. (Baoding, Hebei, China). To construct a robust detection model that is adaptable to a complex and variable greenhouse environment, a detailed image acquisition scheme was developed to ensure sufficient diversity in light intensity, weather, and shooting angle, which covered different weather conditions (sunny and cloudy) and three key time periods of the day (9:00 a.m., 2:00 p.m., and 6:00 p.m.) to cover typical light intensity and color temperature variations. Furthermore, the acquisition work comprehensively considered various shooting distances (near and far), lighting directions (front lighting and backlighting), and camera angles (overhead and low-angle) to simulate various observation conditions that pollination robots might encounter during actual operation. To balance convenience and cost control, all images were acquired using a OnePlus 8 smartphone (OnePlus Communication Technology Co., Ltd., Dongguan, China). The smartphone was equipped with a Sony IMX586 main camera sensor, featuring an equivalent focal length of approximately 26 mm and a maximum aperture of f/1.8, with an original image resolution of 3000 × 4000 pixels. Finally, 2300 images of tomato flowers were obtained, providing ample data for subsequent model training and evaluation.

2.2. Data Preprocessing

A dataset was constructed using 2300 original greenhouse tomato images. The dataset was first randomly divided into a training set (1840 images), validation set (230 images), and test set (230 images) in an 8:1:1 ratio. The LabelImg tool was used to annotate bounding boxes on all images, defining two independent categories: “flower” and “stamen.” The annotation targets were flowers with fully opened petals and clearly visible stamens, suitable for pollination. The final dataset comprised a total of 6750 “flower” bounding boxes and 5125 “stamen” bounding boxes. To improve the generalization ability of the model and prevent overfitting, the Albumentations [32] library was used to systematically enhance the training set images. A moderate augmentation probability was adopted to avoid introducing excessive geometric or illumination distortions. The following specific enhancement strategies were used: color perturbation (brightness ±10%, contrast and saturation ±30%, hue ±0.05, probability 0.5) to simulate lighting changes; horizontal flipping (probability 0.5) and random cropping/rotation (scale 0.6–1.0, angle ±30°, probability 0.5) to enhance viewpoint invariance; and Gaussian noise injection (variance {10, 50}, probability 0.3) and Gaussian blur (convolution kernel {3, 5}, probability 0.2) to improve adaptability to noise and depth-of-field variations. After enhancement, the total number of training set images was increased to 5520, and the data enhancement results are shown in Figure 1.

2.3. YOLOv11n Module

To achieve the accurate and efficient detection of tomato flowers in a greenhouse environment, lightweight YOLOv11n, a compact variant of the YOLOv11 series released in 2024, was selected as the baseline detection framework in this study. It maintains an end-to-end efficient architecture and achieves a favorable tradeoff between parameter size and inference speed through an optimized backbone network and feature-fusion strategy. As a recent advancement in the YOLO family, YOLOv11 integrates multiple improvements from previous versions, providing a robust and modern foundation for the proposed model enhancements. The model structure continues with the classic design of the backbone network, neck network, and detection head. Its backbone network achieves robust multiscale feature extraction through stacked, efficient convolutional modules and utilizes an optimized path aggregation structure to enhance the ability of the model to represent multimorphological flowers in complex backgrounds. Finally, the detection head outputs the class probability of the target and a precise bounding box through parallel classification and regression branches. The YOLOv11n model structure is shown in Figure 2.

2.4. DSS-YOLO Module

To achieve a balance between detection accuracy and model lightweighting, and to adapt to resource-constrained embedded deployment environments, this study proposes an improved YOLOv11n-based model termed DSS-YOLO. The proposed model is jointly optimized at three levels: backbone feature extraction, downsampling strategy, and bounding box regression loss. The overall architecture is illustrated in Figure 3. Specifically, a lightweight backbone network (DWHGNetv2), an efficient downsampling module (SCDown), and an improved loss function (SIoU) are introduced to collaboratively enhance small-target detection performance while reducing computational complexity.

2.4.1. DWHGNetV2 Lightweight Backbone Network

To overcome the computational bottleneck encountered by the baseline model when deployed on mobile devices, we constructed a novel lightweight backbone network, DWHGNetv2, to replace the original backbone structure of YOLOv11n. The core innovation of DWHGNetv2 lies in combining the efficient structure of the high-performance network, HGNetv2 [33], with the low-power advantage of a depthwise separable convolution (DWConv) [34]. In this study, HGNetv2 was selected as the backbone for the lightweight architecture, primarily due to its high compatibility with greenhouse flower detection tasks. Unlike classical lightweight networks such as MobileNetv4 and ShuffleNetv2, which primarily reduce computational complexity through channel optimization, HGNetv2 maintains high-resolution feature representations via a parallel multi-branch architecture [35]. This design enables effective multi-scale feature fusion while preserving rich spatial and texture details, which is particularly beneficial for detecting occluded, overlapping, and morphologically diverse tomato flowers in greenhouse environments. Furthermore, by integrating DWConv, the computational cost is further reduced, achieving a unique balance between feature richness preservation and model lightweighting. Based on this, a DWHGBlock module was designed, and its key improvement was to replace some of the original standard convolutions in HGNetv2 with DWConv. DWConv can reduce computational complexity and parameter count by splitting a single standard convolution into two steps: channel-wise and pointwise convolutions. Its parameter count was calculated as k × k × c + c × n (where k is the kernel size, c is the number of input channels, and n is the number of output channels), which is significantly lower than the k × k × c × n of the traditional standard convolution. This decomposition effectively avoids redundant calculations across channels, considerably improving the computational efficiency while maintaining feature extraction capabilities. A schematic of the DWConv structure is presented in Figure 4.

A complete DWHGNetv2 backbone network was constructed using multiple DWHGBlocks. This design enables the model to maintain strong discriminative feature extraction capabilities while possessing low overall complexity and parameter count when handling common greenhouse conditions, such as leaf occlusion and complex backgrounds. The structure of the DWHGBlock is shown in Figure 5. Compared with the original HGNetv2, DWHGNetv2 achieves lightweighting and performance optimization through progressive modifications of its basic modules. Specifically, in the bottleneck modules of each HGNetv2 stage, the standard convolutions used for feature transformation are replaced with DWConv, which consist of a 3 × 3 depthwise convolution followed by a 1 × 1 pointwise convolution. This replacement significantly reduces the number of parameters and computational complexity while introducing structural constraints that help improve generalization in complex greenhouse environments. Meanwhile, the original multi-branch architecture, high-resolution feature preservation mechanism, and cross-stage feature fusion strategy of HGNetv2 are fully retained, ensuring effective multi-scale feature representation. To accommodate the potential changes in intermediate feature dimensions introduced by DWConv, the channel numbers in each stage are adaptively adjusted, thereby maintaining dimensional consistency with the original HGNetv2 and ensuring network compatibility and training stability.

2.4.2. SCDown Efficient Downsampling Module

To address the issues of information loss and high computational cost caused by traditional downsampling operations in tomato stamen detection, the SCDown module [36] was adopted to replace the standard convolutional downsampling layer in YOLOv11n. This module employs a “channel-first, spatial-later” decoupled design [37]: the feature channels are first adjusted via pointwise convolution [38], followed by depthwise convolution for spatial downsampling. Since the depthwise convolution does not share parameters across channels, this structure can effectively reduce the number of parameters and computational complexity while preserving more fine-grained feature information [39,40]. Consequently, during the compression of feature map spatial dimensions, the SCDown module helps better retain the texture and positional details of small-scale stamens, thereby enhancing the model’s ability to extract fine-grained features and improving overall detection robustness. The structure of the SCDown module is illustrated in Figure 6, and the detailed implementation parameters are listed in Table 1.

2.4.3. SIoU Loss Function

YOLOv11 uses the CIoU loss function [41] by default for bounding box regression. CIoU builds upon the Intersection over Union (IoU) metric by incorporating penalties for the center point distance and aspect ratio, aiming to optimize the localization of predicted boxes. However, when detecting small-scale targets such as tomato stamens, CIoU exhibits notable limitations. First, because these targets have a narrow range of aspect ratios, the aspect ratio penalty in CIoU provides limited gradient guidance. Second, even minor deviations in the center point of small targets can lead to relatively large localization errors, while CIoU is insufficiently sensitive to such deviations. Crucially, tomato stamens are typically elongated and direction-sensitive, but CIoU only considers center distance and aspect ratio without explicitly modeling the alignment of the bounding box orientation. Consequently, when the predicted box is rotated relative to the ground truth, CIoU provides inadequate regression penalty, which may reduce localization accuracy for direction-sensitive small targets.

To address this issue, the SIoU loss function [42] is introduced. Its core improvement lies in incorporating an angle penalty term Λ, which constrains the directional deviation of the line connecting the predicted and ground-truth box centers (as illustrated in Figure 7). This mechanism guides the model to prioritize correcting the orientation of the predicted box during the early stages of training, followed by gradual optimization of position and scale. As a result, SIoU provides more stable and effective regression gradients for elongated and direction-sensitive small targets such as tomato stamens, significantly improving bounding box localization accuracy.

The calculation formula is as follows:

Λ = 1 - 2 {s i n}^{2} (a r c s i n (\frac{c_{h}^{'}}{σ}) - \frac{π}{4}) = c o s (2 a r c s i n (\frac{c_{h}^{'}}{σ}) - \frac{π}{4})

(1)

Δ = \sum_{t = x, y} (1 - e^{- γ ρ_{t}}) = 2 - e^{- γ ρ_{x}} - e^{- γ ρ_{y}}

(2)

ρ_{x} = {(\frac{b_{c x}^{g t} - b_{c x}}{c_{w}})}^{2}, ρ_{y} = {(\frac{b_{c y}^{g t} - b_{c y}}{c_{h}})}^{2}, γ = 2 - Λ

(3)

Ω = {(1 - e^{- w_{w}})}^{θ} + {(1 - e^{- W_{h}})}^{θ}

(4)

w_{w} = \frac{|w - w^{g t}|}{m a x (w, w^{g t})}, w_{h} = \frac{|h - h^{g t}|}{m a x (h, h^{g t})}

(5)

{Loss}_{SIOU} = 1 - I o U + \frac{Ω + Δ}{2}

(6)

where σ is the distance between the center points of the ground-truth and predicted boxes,

c_{h}^{'}

is the height difference between the center points of the ground-truth and predicted boxes,

b_{c x}^{g t}, b_{c y}^{g t}

are the horizontal and vertical coordinates of the center point of the ground-truth box, and

b_{c x}

and

b_{c y}

are the horizontal and vertical coordinates of the center point of the predicted box.

c_{w}

and

c_{h}

are the width and height of the minimum bounding rectangle of the ground-truth and predicted boxes, respectively. w, h,

w_{g t}

, and

h_{g t}

are the width and height of the predicted and ground-truth boxes, respectively.

θ

is a parameter that adjusts the emphasis on shape loss. Λ represents the angle cost, Δ represents the distance cost, and Ω represents the shape cost.

ρ_{x}, ρ_{y}

denote the normalized center point distances, while

γ

is a parameter that varies with the angle cost Λ to adjust the weight of the distance cost.

w_{w}, w_{h}

correspond to the normalized relative differences in width and height between the predicted and ground-truth bounding boxes.

2.5. Test Setup and Evaluation Criteria

2.5.1. Test Environment and Parameter Settings

The model training in this study was conducted under the following hardware and software configurations. An Intel Core i7-10700K CPU and an NVIDIA GeForce RTX 3070 GPU with 8 GB of memory were used. The training platform was built based on PyTorch 1.13.1 with CUDA 11.7. All reported experimental results were obtained under a unified and consistent experimental configuration. The detailed hyperparameter settings used during training are summarized in Table 2.

2.5.2. Evaluation Indicators

This study comprehensively evaluates the proposed model from two aspects: detection performance and computational efficiency. Detection performance is assessed using Precision (P), Recall (R), and mean Average Precision at IoU 0.5 (mAP@0.5), which collectively reflect the model’s capability in target recognition and localization. Computational efficiency is evaluated in terms of the number of parameters (Parameters), computational complexity (FLOPs), model size (Model Size), and inference speed measured by frames per second (FPS), providing a comprehensive analysis of model complexity and real-time inference capability. The FPS is measured on the same NVIDIA GeForce RTX 3070 GPU used during training and is evaluated based on batch inference of static images (batch size = 1) rather than real-time video streams. Considering the small-scale and densely distributed characteristics of tomato stamens, mAP@0.5 is adopted as the primary evaluation metric. This metric is computed under a fixed IoU threshold of 0.5 and averaged over two target categories (flowers and stamens), following the standard PASCAL VOC evaluation protocol. The formula is as follows:

P = \frac{T P}{T P + F P} \times 100 %

(7)

R = \frac{T P}{T P + F N} \times 100 %

(8)

m A P @ 0.5 = \frac{1}{N} \sum_{i = 1}^{N} A P i \times 100 %

(9)

P S = \frac{N_{f r a m e s}}{T_{i n f e r e n c e}}

(10)

where R represents the recall rate, P represents the precision rate, TP represents the number of positive samples correctly identified as positive, FP represents the number of negative samples incorrectly identified as positive, FN represents the number of positive samples incorrectly identified as negative, N represents the total number of categories in the dataset, APi denotes the average precision of the i-th category,

N_{f r a m e s}

denotes the total number of processed image frames; and

T_{i n f e r e n c e}

represents the total inference time in seconds required to process these frames.

3. Results

3.1. Ablation Experiment

To verify the effectiveness of the proposed improvements for tomato flower pollination detection, YOLOv11n was selected as the baseline model, and a series of ablation experiments were conducted by incrementally incorporating different improvement components into the network. The comparative results of these ablation experiments are summarized in Table 3. In addition, Figure 8 illustrates the corresponding parameter counts and GFLOPs for each model configuration.

When only DWHGNetv2 (A) is introduced, the number of model parameters is reduced from 2.5 M to 1.8 M, and the FLOPs decrease from 6.3 G to 4.2 G (as illustrated in Figure 8), indicating a significant reduction in model complexity. However, both Recall and mAP@0.5 show a certain decline, suggesting that replacing the backbone alone is limited in improving detection performance while compressing the model scale. Specifically, DWConv decomposes standard convolution into depthwise and pointwise convolutions, which substantially reduces the number of parameters and computational cost. At the same time, it constrains model capacity at the structural level and reduces redundant inter-channel coupling, thereby alleviating excessive inter-channel redundancy and helping to mitigate overfitting tendencies in scenarios with limited agricultural data, which contributes to improved generalization on the test set. When only SCDown (B) is introduced, the model achieves an improvement in Precision while maintaining a relatively low parameter count, indicating that SCDown can effectively preserve key information during the downsampling process and has a positive impact on detection performance.

When only the SIoU loss function (C) is adopted, both Precision and mAP@0.5 are improved compared with the baseline, with mAP@0.5 reaching 95.3%. This result verifies the advantage of SIoU in terms of bounding box regression accuracy and training stability.

In the combined experiments, A + B significantly reduces the model size while increasing mAP@0.5 to 95.4%, with model weights of only 3.4 MB, demonstrating the strong complementarity between DWHGNetv2 and SCDown. The A + C and B + C combinations further improve detection accuracy, among which B + C shows particularly notable gains in Precision and mAP@0.5.

When all three improvements are simultaneously introduced (A + B + C), the model achieves the best overall performance, with Precision, Recall, and mAP@0.5 reaching 94.1%, 94.3%, and 95.9%, respectively. Meanwhile, the model weight and parameter count are reduced to 3.4 MB and 1.6 M, and the FLOPs are only 4.1 G. These results demonstrate that the proposed modules exhibit strong synergistic effects in enhancing detection accuracy while reducing computational complexity. Consequently, the final DSS-YOLO model achieves an optimal balance between performance and lightweight design.

3.2. Comparative Experiment of Mainstream Target Detection Algorithms

To comprehensively evaluate the performance advantages of the proposed DSS-YOLO model, comparative experiments were conducted under a unified hardware platform and software environment. Several mainstream lightweight object detection algorithms were evaluated, while larger variants were included for reference only. The compared methods include Faster R-CNN [19,20], ShuffleNetV2 [43], MobileNetV4 [36], YOLOv3-tiny, YOLOv5n, YOLOv6n, YOLOv8n, YOLOv9-tiny, YOLOv10n, YOLOv11n as well as the proposed DSS-YOLO model. Inference speed (FPS) was measured on the GPU to ensure fairness and consistency in speed evaluation across different models. The experimental results are presented in Table 4.

As shown in Table 4, the traditional two-stage detector Faster R-CNN achieves relatively high detection accuracy; however, it suffers from an excessively large model size (521.5 MB), as well as extremely high parameter count and computational complexity. This large model size is mainly attributed to its high number of parameters stored in full-precision format, which leads to substantially increased memory consumption. Consequently, its inference speed is limited to only 10 FPS, making it unsuitable for real-time tomato flower detection in greenhouse environments.

Compared with Faster R-CNN, lightweight backbone-based models such as ShuffleNetV2 and MobileNetV4 significantly reduce model size, parameters, and FLOPs. However, this reduction in complexity comes at the cost of limited detection performance, particularly in terms of precision and recall, which restricts their applicability in complex greenhouse scenes with occlusion and small targets. YOLOv3-tiny offers moderate inference speed, but its mAP@0.5 is noticeably lower than that of more recent YOLO-based detectors, resulting in inferior overall performance.

Among the one-stage YOLO-series models, YOLOv5n, YOLOv6n, YOLOv8n, YOLOv9-tiny, and YOLOv10n demonstrate a more favorable balance between detection accuracy and computational efficiency. These models achieve mAP@0.5 values above 94.6% while maintaining real-time inference speeds ranging from 31 to 53 FPS. Nevertheless, their parameter counts and computational costs remain higher than those of ultra-lightweight architectures, and further improvements in accuracy are often accompanied by increased complexity.

In contrast, the proposed DSS-YOLO achieves the best overall performance among all compared models. It attains the highest mAP@0.5 of 95.9%, along with superior precision (94.1%) and recall (94.3%). At the same time, DSS-YOLO maintains an extremely compact model size of only 3.4 MB, with 1.6 M parameters and 4.1 GFLOPs, which is comparable to or lower than most lightweight networks. Moreover, DSS-YOLO achieves the fastest inference speed of 65 FPS, outperforming all other mainstream algorithms evaluated in this study.

Overall, these results demonstrate that DSS-YOLO provides a more favorable trade-off among detection accuracy, model lightweightness, and inference speed. This balanced performance makes DSS-YOLO particularly suitable for real-time tomato flower detection and intelligent pollination applications in resource-constrained greenhouse environments, highlighting its strong potential for practical deployment.

3.3. Comparison Experiments of Different Downsampling Layers

To validate the effectiveness of the proposed SCDown module in multi-scale feature extraction and information preservation, a series of systematic comparative experiments were conducted in this section. During the downsampling stage, three representative methods were introduced for comparison: ADown (adaptive downsampling), SAConv (switchable atrous convolution–based downsampling), and SPDConv (sparse pyramid dynamic convolution–based downsampling). These approaches aim to reduce computational complexity while preserving critical semantic information through mechanisms such as dynamic receptive field adjustment, feature redundancy suppression, and sparse computation. Through comparative experiments, the comprehensive impact of different downsampling layers on model accuracy, parameter count, GFLOPs, and model size was analyzed. The experimental comparison results are presented in Table 5.

Table 5 compares the effects of different downsampling modules on detection performance and computational complexity under the same network architecture and training strategy. In terms of detection accuracy, SCDown achieves the best overall performance across all three metrics—Precision, Recall, and mAP@0.5—with the mAP@0.5 reaching 95.1%, representing improvements of 0.1%, 0.2%, and 0.6% over ADown, SAConv, and SPDConv, respectively. These results indicate that SCDown can more effectively preserve critical feature information during the downsampling process, thereby enhancing the detection accuracy of tomato flower targets.

Regarding model complexity, SCDown maintains high detection performance while incurring relatively low computational overhead. The model equipped with SCDown has a weight size of 4.1 MB, 2.0 M parameters, and 5.5 G, which are comparable to those of ADown and significantly lower than those of SAConv and SPDConv. Although SAConv and SPDConv enhance feature representation to some extent, they introduce additional parameters and computational cost. In particular, SPDConv exhibits a substantially higher computational burden, with FLOPs reaching 11.4 G, which limits its suitability for real-time detection applications.

Overall, SCDown achieves a more favorable trade-off between detection accuracy and computational efficiency, improving the model’s ability to represent key tomato flower features while avoiding excessive computational overhead. Therefore, SCDown is selected as the downsampling module in DSS-YOLO, providing effective support for the overall performance enhancement of the proposed model.

3.4. Comparison Experiments of Different Loss Functions

Within the YOLO framework, the design of the loss function has a direct impact on model optimization efficiency and detection performance. To address the challenges encountered in greenhouse tomato flower and stamen detection—such as large variations in target scale, strong background interference, and difficulty in recognizing small objects—this study introduces the SIoU loss function and conducts a systematic comparison with several mainstream bounding box regression losses, including GIoU, EIoU, and WIoU. The objective is to investigate the influence of different loss functions on the model convergence behavior and final detection accuracy. The comparative experimental results are presented in Table 6.

Table 6 presents the impact of different bounding box regression loss functions on the detection performance of the model under identical experimental settings. Clear differences can be observed among the loss functions in terms of Precision, Recall, and mAP@0.5, indicating that the choice of bounding box regression strategy plays a crucial role in tomato flower detection.

Overall, SIoU achieves the best performance across all three evaluation metrics, with Precision and Recall reaching 93.5% and 93.3%, respectively, and mAP@0.5 attaining 95.3%. Compared with GIoU, EIoU, and WIoU, SIoU demonstrates a clear advantage, suggesting that jointly modeling angular alignment, center distance, and overlap during regression effectively improves the matching accuracy between predicted boxes and ground-truth targets.

In contrast, GIoU exhibits relatively stable Precision and Recall but yields a lower mAP@0.5, implying limited localization accuracy in complex scenarios. Although EIoU achieves a certain improvement in mAP@0.5, its Precision and Recall decrease slightly, which may lead to increased false positives or missed detections. WIoU shows the weakest overall performance, with all three metrics falling below those of the other loss functions, indicating limited adaptability to the present dataset.

Based on the above analysis, SIoU provides the best balance between detection accuracy and stability and is therefore more suitable for tomato flower detection in complex greenhouse environments. Accordingly, SIoU is selected as the bounding box regression loss function for DSS-YOLO to further enhance detection performance.

3.5. Visual Analysis of Results

To provide a more intuitive evaluation of the detection performance of the improved DSS-YOLO model, a comparative analysis between YOLOv11n and DSS-YOLO was conducted on the tomato flower test set under complex greenhouse conditions. The comparison considered two factors: different shooting distances and varying illumination conditions. In addition, the Grad-CAM method was employed to visualize the feature response regions of the models. Grad-CAM generates heatmaps to highlight the image regions that receive the highest attention during inference, thereby offering further insight into the capability of DSS-YOLO to detect tomato flower targets under diverse environmental conditions. The corresponding detection and visualization results are presented in Figure 9.

In close-range scenarios, both models are able to accurately detect tomato flowers. However, DSS-YOLO demonstrates superior localization stability and confidence for both the flower body and the floral center, with feature responses more consistently concentrated on key flower structures and less influenced by background interference. In contrast, the feature activation of YOLOv11n appears more dispersed. In long-distance scenarios, YOLOv11n is prone to missed detections and localization deviations, whereas DSS-YOLO maintains effective attention to small-scale flower targets. The Grad-CAM heatmaps further indicate that DSS-YOLO exhibits more prominent feature activation for distant small targets, highlighting its advantage in small-object detection under complex conditions.

Under varying illumination conditions, both models retain a certain level of detection capability; however, DSS-YOLO shows greater stability and robustness. Under front-lighting conditions, DSS-YOLO focuses more effectively on critical regions such as the floral center in scenes involving flower overlap and leaf occlusion, while YOLOv11n still exhibits relatively strong responses to non-target regions. Under backlighting and low-light conditions, YOLOv11n suffers from noticeably reduced detection stability due to blurred contours, decreased contrast, and noise interference, often resulting in reduced confidence scores or missed detections. By comparison, DSS-YOLO continues to concentrate on flower structural regions with more compact feature responses, demonstrating stronger robustness and reliability in challenging illumination environments. Although DSS-YOLO maintains relatively stable feature activation on flower structures, several typical failure cases can still be observed in Figure 9, including false detections under low-light conditions and increased attention to non-target objects in the corresponding heatmaps. These failures mainly occur when object boundaries are severely degraded or when the foreground–background contrast is extremely low.

The comparative analysis of feature heatmaps shows that DSS-YOLO consistently produces more concentrated activation regions across different scenarios, primarily focusing on tomato flowers and their floral centers, whereas YOLOv11n exhibits more scattered responses and is more susceptible to background interference. This observation indicates that DSS-YOLO possesses stronger discriminative capability in feature extraction and target representation, which contributes to reducing both false positives and missed detections.

Overall, the visualization results demonstrate that DSS-YOLO consistently outperforms the baseline YOLOv11n model under different shooting distances and complex illumination conditions. Its improved performance in small-target detection, background suppression, and attention to key structural regions provides a more reliable visual perception foundation for automated tomato flower recognition and intelligent pollination in greenhouse environments.

4. Discussion

The DSS-YOLO model proposed in this study demonstrates strong overall performance in detecting greenhouse tomato flowers and stamens. The key methodological contribution lies in a systematic strategy that jointly optimizes model lightweightness and detection accuracy. Compared with traditional image processing approaches based on handcrafted features, the end-to-end DSS-YOLO framework can automatically learn more discriminative and robust high-level features directly from data. Conventional methods typically rely on serial pipelines composed of preprocessing, segmentation, feature extraction, and classification modules, which require extensive parameter tuning and often suffer from limited generalization under complex greenhouse conditions, such as variable illumination, foliage occlusion, and dense flower clustering. In contrast, DSS-YOLO effectively overcomes these limitations, exhibiting superior adaptability and robustness in complex scenarios.

The results of this study are consistent with recent trends in lightweight agricultural object detection while achieving targeted improvements for greenhouse pollination applications. For example, Lyu et al. [25] demonstrated that lightweight network designs help maintain sensitivity to densely occluded litchi flowers under limited computational resources, while Bai et al. [27] showed that efficient feature-fusion mechanisms improve strawberry flower detection in complex backgrounds. Building upon these findings, this study specifically targets the agricultural task of automated pollination. By accurately defining open flowers and clearly visible stamens suitable for pollination as detection targets, and by systematically integrating a lightweight backbone, efficient downsampling, and a direction-aware loss function, DSS-YOLO achieves higher detection accuracy and robustness under varying illumination and viewing angles. These results provide a more reliable visual perception solution for the practical deployment of automated pollination systems in greenhouse environments.

Despite the excellent performance of DSS-YOLO, a scope for improvement exists in its generalizability and performance. The test set in this study strictly followed the same acquisition criteria as the training set and included images captured at different times of the day (09:00, 14:00, and 18:00), thereby providing a preliminary validation of the model’s robustness to daily illumination variations. However, several limitations of the dataset remain. First, all data were collected from a single greenhouse and a single tomato cultivar (Provence tomato) within one geographic region. As a result, the generalization capability of the proposed model to different regions, tomato varieties (e.g., cherry tomatoes and beefsteak tomatoes), and greenhouse structures, as well as its robustness to extreme environmental disturbances (such as lens fogging caused by high humidity), were not investigated in this study. These aspects will be systematically evaluated in future work using more diverse and extensive datasets. Second, although the SIoU loss function improves overall localization accuracy, its regression capability for extremely elongated or heavily occluded stamens may be approaching a performance bottleneck. Future studies may therefore explore more specialized shape modeling and constraint strategies to further enhance robustness under such challenging conditions. Finally, although the proposed model has been lightweight, deployment on ultra–low-power embedded devices will likely require the integration of advanced model compression techniques, such as network pruning, quantization, or knowledge distillation, to further exploit its efficiency potential [44,45,46].

In summary, DSS-YOLO provides an efficient and practical solution for identifying the tomato pollination status in greenhouse environments. Future work will focus on building a cross-regional multivariate dataset, exploring specialized regression loss functions for slender objects, and promoting the deployment of the model on edge computing platforms, achieving full implementation of smart agriculture technologies.

5. Conclusions

This study addresses the demand for high-precision and lightweight visual models in greenhouse tomato pollination by proposing a novel detection framework, termed DSS-YOLO. Built upon YOLOv11n, the proposed model systematically enhances feature extraction, information preservation, and bounding box regression through three key improvements: the construction of a lightweight backbone network, DWHGNetv2, based on depthwise separable convolutions; the introduction of an SCDown downsampling module to decouple channel transformation and spatial compression for better preservation of small-target information; and the adoption of an SIoU loss function with an angle-aware mechanism to improve localization accuracy.

Experimental results demonstrate that DSS-YOLO achieves an excellent balance between detection accuracy and computational efficiency. Compared with the baseline YOLOv11n, the model size, parameter count, and computational cost are reduced by 34%, 36%, and 35%, respectively, while precision, recall, and mAP@0.5 are improved by 1.1%, 1.0%, and 0.7%, respectively. Meanwhile, DSS-YOLO maintains a real-time inference speed of 65 FPS, outperforming mainstream lightweight detection models.

Overall, this research provides a reliable visual perception solution for automated greenhouse pollination and offers a valuable technical reference for other resource-constrained agricultural vision applications.

Author Contributions

Conceptualization; methodology; validation; investigation; writing—original draft preparation, S.Z.; Methodology; investigation, D.Z.; Software; data curation, J.Z.; Software; formal analysis; visualization, J.W.; Conceptualization; Validation, Y.Z. (Yibing Zhang); Conceptualization; writing—review and editing; supervision, X.F.; Conceptualization; resources; writing—review and editing; supervision; project administration; funding acquisition, Y.Z. (Yuhong Zhou). All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the earmarked fund for CARS (CARS-23), the Research and Integrated Demonstration of Key Technologies for AI-Enabled High-Quality and High-Efficiency Vegetable Production (252N0301D) and the Application of High-Quality Water in Crop Seedling Cultivation (2211N011).

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

Author Yibing Zhang was employed by the company Hengyifeng (Beijing) Technology Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Ullah, F.; Ullah, H.; Ishfaq, M.; Gul, S.L.; Kumar, T.; Li, Z. Improvement of Nutritional Quality of Tomato Fruit with Funneliformis mosseae Inoculation Under Greenhouse Conditions. Horticulturae 2023, 9, 448. [Google Scholar] [CrossRef]
Safeer, S.; Pulvento, C. Blockchain-Backed Sustainable Management of Italian Tomato Processing Industry. Agriculture 2024, 14, 1120. [Google Scholar] [CrossRef]
Magalhães, S.A.; Castro, L.; Moreira, G.; Santos, F.N.; Cunha, M.; Dias, J.; Moreira, A.P. Evaluating the Single-Shot MultiBox Detector and YOLO Deep Learning Models for the Detection of Tomatoes in a Greenhouse. Sensors 2021, 21, 3569. [Google Scholar] [CrossRef] [PubMed]
Dingley, A.; Anwar, S.; Kristiansen, P.; Warwick, N.W.M.; Wang, C.-H.; Sindel, B.M.; Cazzonelli, C.I. Precision Pollination Strategies for Advancing Horticultural Tomato Crop Production. Agronomy 2022, 12, 518. [Google Scholar] [CrossRef]
Sun, X.; Zhang, Q.; Zhang, H.; Niu, L.; Zhang, M.; Zhang, Y. A Set of Artificial Pollination Technical Measures: Improved Seed Yields and Active Ingredients of Seeds in Oil Tree Peonies. Plants 2024, 13, 1194. [Google Scholar] [CrossRef]
Zhang, H.; Han, C.; Breeze, T.D.; Li, M.; Mashilingi, S.K.; Hua, J.; Zhang, W.; Zhang, X.; Zhang, S.; An, J. Bumblebee Pollination Enhances Yield and Flavor of Tomato in Gobi Desert Greenhouses. Agriculture 2022, 12, 795. [Google Scholar] [CrossRef]
Chang, X.; Yan, X.; Lv, F.; Zhang, Y.; Breeze, T.D.; Li, X. The Pollinating Network of Pollinators and the Service Value of Pollination in Hanzhong City, China. Insects 2025, 16, 1223. [Google Scholar] [CrossRef]
Chen, Z.; Lei, X.; Yuan, Q.; Qi, Y.; Ma, Z.; Qian, S.; Lyu, X. Key Technologies for Autonomous Fruit- and Vegetable-Picking Robots: A Review. Agronomy 2024, 14, 2233. [Google Scholar] [CrossRef]
Rong, J.; Wang, P.; Wang, T.; Hu, L.; Yuan, T. Fruit pose recognition and directional orderly grasping strategies for tomato harvesting robots. Comput. Electron. Agric. 2022, 202, 107430. [Google Scholar] [CrossRef]
Seo, D.; Cho, B.-H.; Kim, K.-C. Development of Monitoring Robot System for Tomato Fruits in Hydroponic Greenhouses. Agronomy 2021, 11, 2211. [Google Scholar] [CrossRef]
Dorj, U.-O.; Lee, M.; Diyan-Ul-Imaan, N. A New Method for Tangerine Tree Flower Recognition. In Communications in Computer and Information Science; Springer: Berlin/Heidelberg, Germany, 2012; Volume 353, pp. 49–56. [Google Scholar] [CrossRef]
Hočevar, M.; Širok, B.; Godeš, T.; Stopar, M. Flowering Estimation in Apple Orchards by Image Analysis. Precis. Agric. 2013, 15, 466–478. [Google Scholar] [CrossRef]
McCarthy, A.; Raine, S. Automated variety trial plot growth and flowering detection for maize and soybean using machine vision. Comput. Electron. Agric. 2022, 194, 106727. [Google Scholar] [CrossRef]
Das Choudhury, S.; Guha, S.; Das, A.; Das, A.K.; Samal, A.; Awada, T. FlowerPhenoNet: Automated Flower Detection from Multi-View Image Sequences Using Deep Neural Networks for Temporal Plant Phenotyping Analysis. Remote Sens. 2022, 14, 6252. [Google Scholar] [CrossRef]
Zhang, C.; Craine, W.A.; McGee, R.J.; Vandemark, G.J.; Davis, J.B.; Brown, J.; Hulbert, S.H.; Sankaran, S. Image-Based Phenotyping of Flowering Intensity in Cool-Season Crops. Sensors 2020, 20, 1450. [Google Scholar] [CrossRef]
Li, C.; Song, Z.; Wang, Y.; Zhang, Y. Research on Bud Counting of Cut Lily Flowers Based on Machine Vision. Multimed. Tools Appl. 2023, 82, 2709–2730. [Google Scholar] [CrossRef]
Huang, Y.; Qian, Y.; Wei, H.; Lu, Y.; Ling, B.; Qin, Y. A Survey of Deep Learning-Based Object Detection Methods in Crop Counting. Comput. Electron. Agric. 2023, 215, 108425. [Google Scholar] [CrossRef]
Jaju, S.; Chandak, M. A Transfer Learning Model Based on RESNET-50 for Flower Detection. In Proceedings of the 2022 International Conference on Applied Artificial Intelligence and Computing (ICAAIC), Hyderabad, India, 23–24 September 2022; pp. 307–311. [Google Scholar]
Lin, P.; Lee, W.S.; Chen, Y.M.; Peres, N.; Fraisse, C. A Deep-Level Region-Based Visual Representation Architecture for Detecting Strawberry Flowers in an Outdoor Field. Precis. Agric. 2019, 21, 387–402. [Google Scholar] [CrossRef]
Farjon, G.; Krikeb, O.; Hillel, A.B.; Alchanatis, V. Detection and Counting of Flowers on Apple Trees for Better Chemical Thinning Decisions. Precis. Agric. 2019, 21, 503–521. [Google Scholar] [CrossRef]
Dias, P.A.; Tabb, A.; Medeiros, H. Multispecies Fruit Flower Detection Using a Refined Semantic Segmentation Network. IEEE Robot. Autom. Lett. 2018, 3, 3003–3010. [Google Scholar] [CrossRef]
Sun, K.; Wang, X.; Liu, S.; Liu, C. Apple, peach, and pear flower detection using semantic segmentation network and shape constraint level set. Comput. Electron. Agric. 2021, 185, 106150. [Google Scholar] [CrossRef]
Tian, Y.; Yang, G.; Wang, Z.; Li, E.; Liang, Z. Instance segmentation of apple flowers using the improved mask R–CNN model. Biosyst. Eng. 2020, 193, 264–278. [Google Scholar] [CrossRef]
Mu, X.; He, L.; Heinemann, P.; Schupp, J.; Karkee, M. Mask R-CNN based apple flower detection and king flower identification for precision pollination. Smart Agric. Technol. 2023, 4, 100151. [Google Scholar] [CrossRef]
Lyu, S.; Zhao, Y.; Liu, X.; Li, Z.; Wang, C.; Shen, J. Detection of male and female litchi flowers using YOLO-HPFD multi-teacher feature distillation and FPGA-embedded platform. Agronomy 2023, 13, 987. [Google Scholar] [CrossRef]
Ren, R.; Sun, H.; Zhang, S.; Zhao, H.; Wang, L.; Su, M.; Sun, T. FPG-YOLO: A detection method for pollenable stamen in ‘Yuluxiang’ pear under non-structural environments. Sci. Hortic. 2024, 328, 112941. [Google Scholar] [CrossRef]
Bai, Y.; Yu, J.; Yang, S.; Ning, J. An improved YOLO algorithm for detecting flowers and fruits on strawberry seedlings. Biosyst. Eng. 2024, 237, 1–12. [Google Scholar] [CrossRef]
Wang, Z.Y.; Zhang, C.P. An improved chilli pepper flower detection approach based on YOLOv8. Plant Methods 2025, 21, 71. [Google Scholar] [CrossRef]
Li, G.; Suo, R.; Zhao, G.; Gao, C.; Fu, L.; Shi, F.; Dhupia, J.; Li, R.; Cui, Y. Real-time detection of kiwifruit flower and bud simultaneously in orchard using YOLOv4 for robotic pollination. Comput. Electron. Agric. 2022, 193, 106641. [Google Scholar] [CrossRef]
Xu, T.; Qi, X.; Lin, S.; Zhang, Y.; Ge, Y.; Li, Z.; Dong, J.; Yang, X. A neural network structure with attention mechanism and additional feature fusion layer for tomato flowering phase detection in pollination robots. Machines 2022, 10, 1076. [Google Scholar] [CrossRef]
Shang, Y.; Zhang, Q.; Song, H. Application of deep learning using YOLOv5s to apple flower detection in natural scenes. Trans. Chin. Soc. Agric. Eng. 2022, 38, 222–229. [Google Scholar] [CrossRef]
Zhong, M.; Li, Y.; Gao, Y. Research on Small-Target Detection of Flax Pests and Diseases in Natural Environment by Integrating Similarity-Aware Activation Module and Bidirectional Feature Pyramid Network Module Features. Agronomy 2025, 15, 187. [Google Scholar] [CrossRef]
Zhao, A.; Lv, Y.; Xu, L.; Wei, M.; Wang, Z.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-Time Object Detection. arXiv 2023, arXiv:2304.08069. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, Y.; Xu, H. Lightweight Domestic Pig Behavior Detection Based on YOLOv8. Appl. Sci. 2025, 15, 6340. [Google Scholar] [CrossRef]
Wang, J.; Li, H.; Li, Y.; Qin, Z. A Lightweight CNN-Transformer Implemented via Structural Re-Parameterization and Hybrid Attention for Remote Sensing Image Super-Resolution. ISPRS Int. J. Geo-Inf. 2025, 14, 8. [Google Scholar] [CrossRef]
Shi, Y.; Duan, Z.; Qing, S.; Zhao, L.; Wang, F.; Yuwen, X. YOLOV9S-Pear: A lightweight YOLOV9S-Based improved model for young Red Pear Small-Target recognition. Agronomy 2024, 14, 2086. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Schwarz Schuler, J.P.; Also, S.R.; Puig, D.; Rashwan, H.; Abdel-Nasser, M. An Enhanced Scheme for Reducing the Complexity of Pointwise Convolutions in CNNs for Image Classification Based on Interleaved Grouped Filters without Divisibility Constraints. Entropy 2022, 24, 1264. [Google Scholar] [CrossRef]
He, L.; Wang, M. SliceSamp: A Promising Downsampling Alternative for Retaining Information in a Neural Network. Appl. Sci. 2023, 13, 11657. [Google Scholar] [CrossRef]
Peng, G.; Wang, K.; Ma, J.; Cui, B.; Wang, D. AGRI-YOLO: A Lightweight Model for Corn Weed Detection with Enhanced YOLO v11n. Agriculture 2025, 15, 1971. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing Geometric Factors in Model Learning and Inference for Object Detection and Instance Segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef]
Gevorgyan, Z. SIoU loss: More powerful learning for bounding box regression. arXiv 2022, arXiv:2205.12740. [Google Scholar] [CrossRef]
Su, C.; Zhu, L.; Dai, W.; Zhou, J.; Wang, J.; Mao, Y.; Sun, J. Nav-YOLO: A Lightweight and Efficient Object Detection Model for Real-Time Indoor Navigation on Mobile Platforms. ISPRS Int. J. Geo-Inf. 2025, 14, 364. [Google Scholar] [CrossRef]
Zhang, R.; Lu, Y.; Song, Z. YOLO sparse training and model pruning for street view house numbers recognition. J. Phys. Conf. Ser. 2023, 2646, 012025. [Google Scholar] [CrossRef]
Dai, D.; Wu, H.; Wang, Y.; Ji, P. LHSDNet: A Lightweight and High-Accuracy SAR Ship Object Detection Algorithm. Remote Sens. 2024, 16, 4527. [Google Scholar] [CrossRef]
Zhu, Y.X.; Hao, S.S.; Zheng, W.J.; Jin, C.Q.; Yin, X.; Zhou, P. Multi-teacher cotton field weed detection model based on knowledge distillation. Trans. Chin. Soc. Agric. Eng. 2025, 41, 200–210. [Google Scholar] [CrossRef]

Figure 1. Data augmentation.

Figure 2. YOLOv11n network structure.

Figure 3. DSS-YOLO network structure.

Figure 4. DWConv network structure.

Figure 5. DWHGBlock network structure.

Figure 6. SCDown network structure.

Figure 7. Block Diagram of SIOU Loss Function Anchor.

Figure 8. Parameter counts and GFLOPs of the ablation experiment results.

Figure 9. Visualization results of YOLOv11n and DSS-YOLO under different conditions: (a) close-range scene with sufficient illumination; (b) long-distance small-target scenario; (c) greenhouse detection scene under front-lighting conditions; (d) flower detection scene under backlighting conditions; and (e) flower detection scene under low-light conditions. The third and fourth columns present the Grad-CAM heatmaps, highlighting the regions of feature activation.

Table 1. SCDown module parameters.

Parameter	Parameter Value
Pointwise convolution	Kernel size: 1 × 1
Deep Convolution	Kernel size: 3 × 3
Downsampling Stride	2
Padding	1
Activation function	SiLU
Normalization layer	BatchNorm

Table 2. Model hyperparameter settings.

Hyper-Parameters	Set Value
Epochs	150
Batch	8
Imgsz	640
Optimizer	SGD
Lr0, lrf	0.01
Weight_decay	0.0005
Label_smoothing	0.01
Mosaic	1.0
close_mosaic	50
MixUp	0.2
workers	8

Table 3. The results of ablation experiment.

Test	DWHGNetv2	SCDown	SIoU	Precision/%	Recall/%	mAP50/%
YOLOv11n	-	-	-	93.0	93.3	95.2
A	√	-	-	93.5	91.7	94.6
B	-	√	-	93.1	93.5	95.1
C	-	-	√	93.5	93.3	95.3
A + B	√	√	-	93.3	93.6	95.4
A + C	√	-	√	93.8	92.8	95.6
B + C	-	√	√	93.9	93.5	95.7
A + B + C	√	√	√	94.1	94.3	95.9

Note: √ indicates that the corresponding module is enabled, while - indicates that it is not used.

Table 4. Comparison results of mainstream object detection algorithms.

Mainstream Algorithms	P /%	R /%	mAP50 /%	Weights /MB	Params /M	FLOPs /G	FPS (Frames·s⁻¹)
Faster R-CNN	92.9	90.2	94.3	521.5	137	370.2	10
ShuffleNetV2	93.5	92.4	94.8	3.5	1.7	4.1	63
MobileNetV4	92.9	91.2	94.3	3.7	1.7	4.2	60
YOLOv3-tiny	93.4	90.5	93.7	18.2	9.5	14.3	43
YOLOv5n	93.2	94.6	95.5	4.4	2.1	5.8	53
YOLOv6n	93.8	95.1	94.6	8.1	4.1	11.5	31
YOLOv8n	93.2	94.1	95.4	5.3	2.6	6.8	50
YOLOv9-tiny	92.7	95.3	95.2	3.9	1.7	6.4	53
YOLOv10n	93.1	93.0	95.5	5.4	2.6	8.2	48
DSS-YOLO	94.1	94.3	95.9	3.4	1.6	4.1	65

Table 5. Comparison results of different downsampling modules.

Model	Precision/%	Recall/%	mAP50/%	Weights /MB	Parameters /M	FLOPs /G
ADown	92.9	93.2	95.0	4.3	2.1	5.1
SAConv	92.4	92.8	94.9	7.2	3.4	4.8
SPDConv	92.0	92.5	94.5	8.8	4.5	11.4
SCDown	93.1	93.5	95.1	4.1	2.0	5.5

Table 6. Comparison results of different loss functions.

Loss Functions	Precision/%	Recall/%	mAP50/%
GIoU	93.1	93.1	94.5
EIoU	92.5	92.8	95.3
WIoU	91.4	90.2	94.1
SIoU	93.5	93.3	95.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhang, S.; Zhang, D.; Zhang, J.; Wang, J.; Zhang, Y.; Fan, X.; Zhou, Y. DSS-YOLO: A Lightweight Flower and Stamen Detection Model for Greenhouse Tomato Pollination Assistance. Agronomy 2026, 16, 67. https://doi.org/10.3390/agronomy16010067

AMA Style

Zhang S, Zhang D, Zhang J, Wang J, Zhang Y, Fan X, Zhou Y. DSS-YOLO: A Lightweight Flower and Stamen Detection Model for Greenhouse Tomato Pollination Assistance. Agronomy. 2026; 16(1):67. https://doi.org/10.3390/agronomy16010067

Chicago/Turabian Style

Zhang, Shan, Dongfang Zhang, Jun Zhang, Jiaqi Wang, Yibing Zhang, Xiaofei Fan, and Yuhong Zhou. 2026. "DSS-YOLO: A Lightweight Flower and Stamen Detection Model for Greenhouse Tomato Pollination Assistance" Agronomy 16, no. 1: 67. https://doi.org/10.3390/agronomy16010067

APA Style

Zhang, S., Zhang, D., Zhang, J., Wang, J., Zhang, Y., Fan, X., & Zhou, Y. (2026). DSS-YOLO: A Lightweight Flower and Stamen Detection Model for Greenhouse Tomato Pollination Assistance. Agronomy, 16(1), 67. https://doi.org/10.3390/agronomy16010067

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DSS-YOLO: A Lightweight Flower and Stamen Detection Model for Greenhouse Tomato Pollination Assistance

Abstract

1. Introduction

2. Materials and Methods

2.1. Image Acquisition

2.2. Data Preprocessing

2.3. YOLOv11n Module

2.4. DSS-YOLO Module

2.4.1. DWHGNetV2 Lightweight Backbone Network

2.4.2. SCDown Efficient Downsampling Module

2.4.3. SIoU Loss Function

2.5. Test Setup and Evaluation Criteria

2.5.1. Test Environment and Parameter Settings

2.5.2. Evaluation Indicators

3. Results

3.1. Ablation Experiment

3.2. Comparative Experiment of Mainstream Target Detection Algorithms

3.3. Comparison Experiments of Different Downsampling Layers

3.4. Comparison Experiments of Different Loss Functions

3.5. Visual Analysis of Results

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI