Lightweight YOLOv11n-Based Detection and Counting of Early-Stage Cabbage Seedlings from UAV RGB Imagery

Zhao, Rongrui; Luo, Rongxiang; Ding, Xue; Cui, Jiao; Yi, Bangjin

doi:10.3390/horticulturae11080993

Open AccessArticle

Lightweight YOLOv11n-Based Detection and Counting of Early-Stage Cabbage Seedlings from UAV RGB Imagery

by

Rongrui Zhao

^1,2,

Rongxiang Luo

^2,3,*

,

Xue Ding

^1,3

,

Jiao Cui

⁴ and

Bangjin Yi

^4,*

¹

Faculty of Geography, Yunnan Normal University, Kunming 650500, China

²

Southwest United Graduate School, Kunming 650092, China

³

School of Information Science and Technology, Yunnan Normal University, Kunming 650500, China

⁴

Yunnan Institute of Geological Sciences, Kunming 650051, China

^*

Authors to whom correspondence should be addressed.

Horticulturae 2025, 11(8), 993; https://doi.org/10.3390/horticulturae11080993 (registering DOI)

Submission received: 5 July 2025 / Revised: 3 August 2025 / Accepted: 19 August 2025 / Published: 21 August 2025

(This article belongs to the Section Vegetable Production Systems)

Download

Browse Figures

Versions Notes

Abstract

This study proposes a lightweight adaptive neural network framework based on an improved YOLOv11n model to address the core challenges in identifying cabbage seedlings in visible light images captured by UAVs. These challenges include the loss of small-target features, poor adaptability to complex lighting conditions, and the low deployment efficiency of edge devices. First, the adaptive dual-path downsampling module (ADown) integrates average pooling and maximum pooling into a dual-branch structure to enhance background texture and crop edge features in a synergistic manner. Secondly, the Illumination Robust Contrast Learning Head (IRCLHead) utilizes a temperature-adaptive network to adjust the contrast loss function parameters dynamically. Combined with a dual-output supervision mechanism that integrates growth stage prediction and interference-resistant feature embedding, this module enhances the model’s robustness in complex lighting scenarios. Finally, a lightweight spatial-channel attention convolution module (LAConv) has been developed to optimize the model’s computational load by using multi-scale feature extraction paths and depth decomposition structures. Experiments demonstrate that the proposed architecture achieves an mAP@0.5 of 99.0% in detecting cabbage seedling growth cycles, improving upon the baseline model by 0.71 percentage points. Furthermore, it achieves an mAP@0.5:0.95 of 2.4 percentage points, reduces computational complexity (GFLOPs) by 12.7%, and drastically reduces inference time from 3.7 ms to 1.0 ms. Additionally, the model parameters are simplified by 3%. This model provides an efficient solution for the real-time counting of cabbage seedlings and lightweight operations in drone-based precision agriculture.

Keywords:

identification of cabbage plants; drone visible light imagery; lightweight neural network; precision agriculture

1. Introduction

Thanks to the rapid development of drone technology and deep learning algorithms, smart agriculture has great potential to improve the efficiency of crop management and the accuracy of crop monitoring. By mounting visible light imaging devices, drones can obtain high-resolution images of farmland, providing valuable data for monitoring and analyzing crop growth [1]. At the same time, deep learning methods, particularly convolutional neural networks (CNNs), have significantly enhanced the accuracy and automation level of crop recognition systems in target detection and image recognition [2]. As one of the most widely cultivated leafy vegetables worldwide, cabbage is prized for its nutritional value [3]. Although traditional agricultural methods have made significant progress in planting management, challenges such as small-target size, severe occlusion, and frequent changes in lighting conditions still exist during the seedling stage, affecting the accuracy of real-time monitoring [4,5]. Additionally, the limited computational resources of edge devices on drones highlight the need for lightweight and efficient models.

In recent years, YOLO object detection algorithms have gained widespread use in drone image analysis due to their fast detection speed and high accuracy [6]. From the multi-scale detection mechanisms of versions such as YOLOv3 [7] and YOLOv4 [8] to the emergence of lightweight branches like YOLOv5 [9] and YOLOv8 [10], object detection models have demonstrated good adaptability in agricultural edge computing scenarios [11]. YOLOv11, a real-time object detection model, was released in September 2024. The ‘n’ suffix denotes a lightweight nano-scale design, continuing the YOLO series tradition (e.g., v5n/v8n). YOLOv11n achieves significant speed improvements while maintaining accuracy through a lightweight architecture, GPU optimization, and multi-task adaptation [12]. However, drone images are often affected by background interference, lighting changes, and plant overlap in complex field environments, which can lead to the loss of small object features, poor adaptation to complex lighting conditions, and reduced detection accuracy and stability [13]. Several improvement methods have been proposed to address these issues [14]. To address the issue of slight target feature loss, Tian et al. [15] introduced dynamic sampling points and multi-scale feature fusion into the model, combined with Deformable Attention and ADown modules to extract cabbage features and improve instance segmentation accuracy. Zhu et al. [16], based on the YOLOv7 YOLO-SDLUWD network, improved the accuracy and speed of infrared small-target detection in complex backgrounds through SD-Conv, LU-PAFPN, and WD-loss improvements. Similarly, Zhao et al. [17], based on the YOLOv8s network, improved the detection accuracy of small targets in drone aerial images by introducing SR-Conv, BiFPN, a small-target detection layer, and a fusion loss function. To address the issue of poor adaptability to lighting changes, Shen et al. [18] introduced multi-task learning and optimization algorithms into the YOLOv5-POS model. They proposed a temperature-adaptive network to regulate robust lighting contrast learning, thereby improving detection accuracy under different lighting conditions. Fu et al. [19] combined lighting normalization pre-processing with a dynamic occlusion compensation network (DOCN) to mitigate feature loss in occluded regions and validated the improvement in accuracy. Zheng et al. [20] enhanced YOLOv8n by incorporating a CBAM attention module, a Ghost module, and a histogram filtering algorithm to estimate cabbage diameter. They also obtained depth data via uncrewed aerial vehicles (UAVs) to address the issue of estimating diameters in occluded environments. Addressing the limited computational resources of edge devices mounted on UAVs, Chen et al. [21] improved YOLOv8n by introducing Swin-conv blocks and ParNet attention modules to achieve automated monitoring of cabbage seedlings. Kong et al. [22] utilized MobileNetV3 and cross-scale feature fusion (CSFF) modules to create a lightweight segmentation network that enhances real-time performance. Meanwhile, Wu et al. [23] developed a lightweight model based on GhostNet, which combines attention mechanisms to optimize the extraction of maturity features. This achieves fast detection and counting in natural environments. Compared with traditional large-scale models, lightweight designs reduce computational resource consumption and improve real-time performance, adapting to the edge computing requirements of devices such as UAVs [24]. In addition to the YOLO series, progress has also been made in agricultural target detection using other deep learning algorithms. For instance, Faster R-CNN has performed well in detecting soybean plants [25]. At the same time, Mask R-CNN achieved instance segmentation of cherry tomatoes [26]. However, existing studies have primarily focused on field crops, such as corn [27], wheat [28], and rice [29], with limited research on leafy vegetables, including cabbage.

In summary, although computer vision has matured in precision agriculture, cabbage still presents unique challenges in terms of plant identification during the seedling stage. This is because cabbages account for an extremely low proportion of drone imagery due to the difficulty of detecting small objects [30]. Leaf shapes are highly variable and prone to severe occlusion and overlap [31]. The field environment is complex and susceptible to interference from weeds, soil background, and variable lighting conditions [32]. These factors significantly reduce the accuracy and robustness of existing general-purpose object detection models (such as the YOLO series) in recognizing cabbage seedlings, particularly in densely planted scenarios where false negatives and misclassifications are commonplace [33]. Research on lightweight, high-accuracy, and real-time recognition and counting models for the seedling stage remains relatively scarce. Therefore, based on the YOLOv11n algorithm, this study has designed and constructed a lightweight, adaptive target detection architecture. To address issues such as recognizing small targets, adapting to complex lighting conditions, and ensuring efficient edge deployment in the cabbage seedling stage, three innovations are proposed:

(1): Designing an adaptive dual-path downsampling module (ADown), which effectively suppresses background noise and enhances target edge features through parallel average and maximum pooling operations. The merged feature representation improves the ability to distinguish targets in complex field environments.
(2): The illumination-robust contrastive learning head (IRCLHead) dynamically adjusts contrast loss parameters through a temperature-adaptive network. It also combines a dual-output supervision mechanism to extract features with strong discriminative ability and illumination invariance. This effectively enhances the model’s adaptability under complex illumination conditions.
(3): A lightweight spatial-channel attention convolution module (LAConv) has been developed, combining multi-scale feature extraction and a deep decomposition structure to integrate spatial pyramid pooling with channel attention mechanisms effectively. This reduces computational complexity while achieving adaptive capture of seedling morphological features and suppression of background interference. This meets the computational resource requirements of drone edge computing platforms.

2. Materials and Methods

2.1. Datasets

The data for this study were collected in Tonghai County, Yunnan Province (102°30′ E–102°52′ E, 23°55′ N–24°14′ N), as illustrated in Figure 1. The region boasts a subtropical, semi-humid, plateau monsoon climate, offering an average annual sunshine duration of 2250 h and an average annual temperature of 15.6 °C—creating an optimal environment for the large-scale cultivation of cabbage. The study area exhibits typical characteristics of highland agriculture, with contiguous fields and a distinct gradient in planting density. This provides strong support for constructing a dataset encompassing different growth stages and environmental conditions. The UAV platform used in this study was a DJI Phantom 4 RTK (SZ DJI Technology Co., Ltd., Shenzhen, China). It was equipped with a visible light camera, multi-frequency multi-system high-precision RTK GNSS, and a three-axis gimbal stabilizer. This ensured that the image data was highly accurate and stable. The flight altitude was set to approximately 5 m using manual control. Shooting vertically downward (0°) comprehensively captured the top morphological distribution characteristics of the cabbage, resulting in 820 raw images (resolution 5472 × 3648 pixels).

To enhance the model’s adaptability to complex scenes, the original images were manually annotated by multiple professionals with experience in crop recognition and converted to TXT format. Enhancement strategies, such as rotation, cropping, grayscale conversion, brightness adjustment, noise addition, and image stitching, were then introduced, resulting in a total of 4669 enhanced images (see Figure 2). To avoid overfitting or wasting resources during training due to highly similar or duplicate samples, this study used the structural similarity index (SSIM) and image histogram analysis to screen the samples before and after data augmentation. This ensured that the training samples were representative and diverse, and that redundant images were removed. Additionally, a random stratified sampling method was used to divide the data into training (3268 images), validation (934 images), and test (467 images) sets in a 7:2:1 ratio. This ensured a balanced distribution of growth stages, lighting conditions, and shooting angles, while maintaining the comprehensiveness and non-redundancy of the samples in each subset. The specific statistics are shown in Table 1.

2.2. Improved Model Architecture

This study employs the YOLOv11n lightweight single-stage detection framework as its core architecture, with its overall structure illustrated in Figure 3. Consisting of a backbone network (Backbone), a neck network (Neck), and a detection head (Head), it aims to achieve high-precision and high-efficiency recognition and counting of young cabbage seedlings in the field [34].

In the backbone network, the input image undergoes multi-level convolution operations for preliminary feature extraction. This is followed by downsampling and semantic enhancement through alternately stacked ADown and multi-branch residual modules (C3k2). ADown is an adaptive downsampling structure that utilizes dual-branch paths to extract trunk and edge features, thereby improving detail retention through multi-scale information fusion. C3k2 introduces a bottleneck structure and a multi-branch strategy to dynamically adjust the receptive field at different depths, jointly modeling shallow texture information and deep semantic features. A fast spatial pyramid pooling module (SPPF) is introduced at the bottom of the backbone network to enhance the model’s global context awareness through multi-scale maximum pooling, followed by a cross-stage partial self-attention module (C2PSA). C2PSA stacks multiple partial self-attention blocks (PSABlock) to significantly improve the expressiveness of foreground targets through a joint spatial and channel attention mechanism while suppressing background interference. This provides a high-quality semantic feature foundation for subsequent multi-scale fusion.

The neck network part adopts a multi-path, cross-scale feature fusion structure. The model employs a top-down and bottom-up feature pyramid fusion strategy, involving multi-level upsampling (upsample) and concatenation (concat) operations, to facilitate cross-layer information interaction. Each fusion node contains the C3k2 for feature adaptation and enhancement. In some paths, LAConv is introduced to simulate the dynamic modeling capabilities of deformable convolutions using a lightweight convolution structure. This improves the model’s adaptability to scale changes and target deformations.

The detection head utilizes a lightweight decoupled detection head (detect), and the model design features a three-level output structure that corresponds to detection tasks at different feature layers, thereby adapting to the performance differences in seedlings at various scales. Each branch contains an IRCLHead, which utilizes a decoupled structure to optimize bounding box (BBox) loss and class (Cls) loss separately. In the IRCLHead, the BBox loss component uses a series of IoU-enhanced losses to improve regression accuracy. Meanwhile, the Cls loss component introduces a dynamic weighting mechanism for positive and negative samples based on focal loss, which effectively mitigates class imbalance and confusion between similar categories. Multi-layer detection outputs are aggregated through detection heads 1, 2, and 3 (Detect1, Detect2, and Detect3), and the predicted boxes and their categories are finally output, completing the localization and identification of cabbage seedlings.

In addition, the ADown and LAConv modules introduce standard 3 × 3 convolution operations as an alternative to larger 5 × 5 or 7 × 7 convolution kernels. Compared to larger kernels, 3 × 3 convolutions significantly reduce computational complexity and parameter counts while maintaining good local feature extraction capabilities. This makes them more suitable for deployment at the edge on drone platforms with high resource and inference speed requirements. By stacking multiple 3 × 3 convolutions, structures equivalent to larger receptive fields can be constructed while maintaining nonlinear modeling capabilities. This enhances the model’s ability to perceive cabbage seedlings at different scales and in various forms. At the same time, 3 × 3 convolutions are better at preserving target edges and texture details. This avoids the feature smoothing and information loss that larger convolution kernels can cause. Thus, the recognition accuracy of small targets in complex backgrounds is improved.

2.2.1. Adaptive Dual-Path Downsampling

The adaptive dual-path downsampling (Adown) module is designed to address two key challenges in detecting cabbage plants in visible light images captured by drones: the loss of small-target features and background interference [35]. This module combines an innovative dual-path architecture, featuring average and maximum pooling, to enhance background texture and target edge features, respectively. The average pooling path suppresses background noise (e.g., soil and weeds), while the max pooling path enhances detail features (e.g., leaf edges and contours) under varying lighting conditions. This effectively reduces the negative impact of uneven lighting on detection accuracy.

As shown in Figure 4, the ADown module employs two parallel feature processing paths. One path uses average pooling to reduce the dimensionality of the input features and extract global consistency information from the background. The other uses max pooling to focus on plant edge features, avoiding the loss of small-target features that occurs in traditional sampling. After convolution, the processed features from the two paths are concatenated to form a composite feature integrating background and target information. To improve computational efficiency and adapt to edge devices, the ADown module adopts a lightweight design. The input features are preliminarily reduced in dimensions through 2 × 2 average pooling, after which they are split into two parts along the channel dimension. One feature is enhanced through a 3 × 3 convolution, while the other is extracted through a 3 × 3 maximum pooling operation followed by a 1 × 1 convolution. The two processing results are then concatenated to maintain feature integrity while reducing the computational burden and meeting real-time requirements.

This module has a dual-branch structure designed to focus on compressing channel information and extracting spatial features. Let the input features be

X \in R^{(B \times C_{1} \times H \times W)}

where B, C₁, H, and W represent the batch size, number of input channels, and height and width, respectively. The dual-path pooling output is defined as follows:

X_{avg} = \frac{1}{2 \times 2} \sum_{i = 0}^{1} \sum_{j = 0}^{1} X (:, :, 2 i : 2 i + 2, 2 j : 2 j + 2)

(1)

X_{\max} = \max_{i, j \in {0, 1}} X (:, :, 2 i : 2 i + 2, 2 j : 2 j + 2)

(2)

Following feature segmentation, the convolution processing of each subpath is represented as follows:

Y_{1} = {Conv}_{3 \times 3} (X_{avg, 1})

(3)

X_{maxdown} = {MaxPool}_{3 \times 3} (X_{avg, 2})

(4)

Y_{2} = {Conv}_{1 \times 1} (X_{maxdown})

(5)

The final, merged output is as follows:

Y = Concat (Y_{1}, Y_{2}) \in ℝ^{B \times C_{2} \times H / 2 \times W / 2}

(6)

Among them,

X_{avg}

denotes the output feature map after average pooling in the spatial region, while i and j are the index variables of spatial pooling.

X_{\max}

denotes the output feature map after maximum pooling in the spatial region, while Y₁ denotes the output feature map after applying a 3 × 3 convolution to the average pooling features.

X_{maxdown}

denotes the result of maximum pooling of the input feature map. Y₂ denotes the output feature map after applying a 1 × 1 convolution to the maximum pooling features after downsampling. Y denotes the final fusion output feature map. C₂ denotes the number of output channels. H/2 and W/2 indicate that the output feature map has been halved in size in the spatial dimension.

In the cabbage plant detection task, ADown demonstrated three core advantages. Firstly, the maximum pooling path effectively focuses on and enhances key discriminative features, such as the edges and contours of small targets, like cabbage seedlings in the seedling stage. This significantly avoids the loss of small-target information caused by traditional downsampling methods. At the same time, the average pooling path integrates consistent background information, reducing the response to background interference, such as soil and weeds. This enhances the visual distinction between crop targets and complex backgrounds, making them more distinguishable. Secondly, ADown addresses the common phenomenon of dense leaf overlap during cabbage growth by fusing dual-path features (background texture and target edges). This significantly enhances the model’s ability to identify boundaries in overlapping areas, thereby effectively reducing false positives and false negatives caused by plant adhesion. Thirdly, the module achieves linear growth in computational load related to the number of input channels through feature splitting, channel reallocation, and a carefully designed combination of 3 × 3 and 1 × 1 convolutions. This significantly optimizes computational efficiency while maintaining excellent performance and meeting the strict real-time processing requirements of drone-based edge computing platforms. From a code implementation perspective, ADown precisely implements the core functionality of dual-path pooling by calling the ‘avg_pool2d’ and ‘max_pool2d’ functions. The convolution layer, combined with the SiLU activation function and batch normalization layer, embodies the design philosophy of being lightweight yet highly efficient. The feature splitting, parallel convolution operations, and final channel merging processes precisely implement the technical approaches of adaptive channel allocation, multi-scale feature extraction, and deep decomposition structure optimization. The entire module efficiently achieves collaborative downsampling of dual-path features.

2.2.2. Illumination-Robust Contrastive Learning Head

This paper presents an illumination-robust contrastive learning head (IRCLHead) designed to address the impact of complex lighting conditions (such as intense direct light, shadows, and dynamic light and shadow) on the extraction of crop phenotypic features and target detection accuracy in visible light images captured by agricultural drones. This consists of a classification branch (Cls Head), a comparison projection branch (Proj Head), and a temperature regulation network (Temp Net). The classification branch utilizes a 1 × 1 convolution to extract the classification features necessary for target detection. The contrast projection branch utilizes a 3 × 3 convolution to extract local features, which are then compressed into a low-dimensional space using a 1 × 1 convolution to enhance the model’s ability to learn contrasts. The temperature regulation network generates temperature coefficients through global pooling and a two-layer fully connected structure, enabling the adaptive adjustment of temperature parameters in the contrast loss. These three components form a dual-branch output supervision mechanism that synergistically optimizes classification and contrastive learning, effectively enhancing the model’s ability to recognize phenotypic information and improve detection robustness under varying lighting conditions. The network structure is shown in Figure 5.

The IRCLHead utilizes normalized temperature scaling cross-entropy loss (NT-Xent) to enhance the robustness of the model’s features in complex lighting conditions. This loss function instructs the model to distinguish between light interference and actual phenotypic differences by comparing the similarity between positive and negative samples. The core computational logic is as follows: positive samples maximize feature similarity (for different light-enhanced views of the same seedling), while negative samples minimize similarity (for different seedlings or background interference features). The temperature parameter, τ, is dynamically adjusted by light conditions to control similarity judgments. The final loss function can be expressed as follows: IRCLHead’s contrastive learning loss function adopts normalized temperature-scaled cross-entropy loss (NT-Xent). This optimizes the model’s robustness to features under different lighting conditions by calculating the similarity between positive and negative samples. The loss function is calculated as follows:

L_{c o n t r a s t} = - \frac{1}{N} \sum_{i = 1}^{N} l o g \frac{e^{s i m (z_{i}, z_{i}^{+}) / τ}}{\sum_{j = 1, j \neq i}^{2 N} e^{s i m (z_{i}, z_{j}) / τ}}

(7)

where N denotes the number of training samples,

z_{i}

denotes the projected features of the samples,

z_{i}^{+}

denotes the features of positive samples, and

z_{j}

and the latter represent the features of negative samples. sim (·,·) is the cosine similarity, and τ is calibrated in real time by the adaptive network of the IRCLHead.

Additionally, the IRCLHead mechanism improves the model’s ability to adapt to complex lighting scenes, such as those with intense light or shadow, by introducing a temperature-adaptive adjustment. The core function of this mechanism is to dynamically calibrate the temperature parameter, τ, in the contrast loss. This enables the model to adjust the strength of the contrast loss according to the lighting conditions of the input image. Thus, the model can automatically adjust the strictness of its judgment of feature similarity based on lighting intensity. The specific implementation process is as follows: First, the input feature x undergoes global average pooling to extract the light-related global feature g. Then, the normalization factor α is generated through the linear mapping layer and the nonlinear activation function. Finally, the temperature parameter, τ, is adjusted as shown in the following formulas:

α = σ (W_{τ} \cdot R e L U (W_{r} \cdot g))

(8)

τ = τ_{m i n} + α \cdot (τ_{m a x} - τ_{m i n})

(9)

Among them, g denotes image-level global feature vectors, W_τ and W_r denote learnable linear transformation matrices,

R e L U (\cdot)

denotes a linear rectified activation function,

σ (\cdot)

denotes the sigmoid function which maps the output to the interval [0, 1],

α

denotes the dynamically adjusted temperature parameter,

τ_{\min}

denotes the initial temperature parameter,

τ_{\max}

denotes the temperature upper limit, and

τ

denotes the temperature parameter ultimately used for comparison loss.

The total loss function of IRCLHead is

L = λ_{1} \cdot L_{c l s} + 2 λ_{2} L_{c o n t r a s t}

(10)

In this equation, L denotes the overall loss function,

L_{c l s}

denotes the classification loss,

L_{c o n t r a s t}

denotes the contrast loss, and λ₁ and λ₂ denote the weight hyperparameters.

The advantage of IRCLHead excels in its ability to enhance model robustness under various lighting conditions, particularly in scenes with significant lighting changes. It reduces feature shifts caused by these changes, thereby improving the ability to recognize and separate plant targets. In the context of cabbage seedling detection, IRCLHead reinforces the learning of plant contours and morphological features through contrast loss. This addresses issues such as missed detections caused by leaf occlusion and small-target characteristics, as well as the challenge of shadow occlusion. Additionally, IRCLHead is designed to be lightweight. Its temperature-adaptive network consists solely of global pooling, linear mapping, and activation functions, avoiding the introduction of complex computational processes. This ensures improved robustness while avoiding excessive computational overhead, making it suitable for real-time processing on drone edge devices.

2.2.3. Light Attention Conv

This paper presents a lightweight spatial-channel attention convolutional module (LAConv) to address challenges such as leaf overlap, morphological variability, and complex background interference in the detection of cabbage seedlings. By combining spatial pyramid pooling with channel attention mechanisms, the LAConv module strikes a balance between low computational cost and enhanced feature expression capabilities, making it well-suited for uncrewed aerial vehicle (UAV) edge computing platforms. Experimental results demonstrate that the LAConv module effectively enhances the recognition accuracy of small objects in seedling detection tasks, particularly in low-light or occlusion scenarios. Figure 6 illustrates the overall structure of the lightweight attention convolution module (LAConv), comprising four parts: spatial pyramid, channel attention, weighted features, and feature fusion. The residual path enhances information transmission by achieving a residual connection. The input feature X first enters the spatial pyramid, where 3 × 3 convolution, BatchNorm, SiLU activation, depth-separable convolution (DWConv), and pointwise convolution are used to extract multi-scale spatial features. Next, the channel attention stage generates channel weights using global average pooling, a 1 × 1 convolution, compression expansion, SiLU activation, and Sigmoid activation. The weighted features stage then obtains weighted features by applying weights to the multi-scale features. In the final stage, feature fusion, the weighted features are concatenated with the original pyramid features and fused using 1 × 1 convolution, BatchNorm, and SiLU (or identity mapping) to produce the final enhanced features. Finally, the main path output is combined with the conditionally adjusted features from the residual path to create the final result. The core architecture comprises three components: spatial pyramid multi-scale feature extraction, feature fusion, and residual connection.

During the spatial pyramid multi-scale feature extraction stage, the module uses the spatial pyramid pooling (SPP) method to adapt to morphological changes in cabbage seedlings at various growth stages. By constructing multi-scale receptive fields through hierarchical convolutions, it significantly enhances the capture of features in small targets and complex backgrounds. First, the module performs channel expansion on the input features through a 3 × 3 convolution, expanding the number of channels to

C_{expansion} = C_{in} \times expansion

, and then introduces BatchNorm normalization and SiLU activation functions to improve nonlinear modeling capabilities and training stability. Group convolution (with groups set to

C_{expansion}

) and 1 × 1 convolution are then used in a cascaded operation to achieve the parallel extraction and fusion of multi-scale features. Let the input features be

X \in R_{X}^{H \times W \times C_{in}}

, and the spatial pyramid pooling process can be expressed as

P y r a m i d (X) = [C o n v_{3 \times 3} (X), G C o n v_{3 \times 3} (X), C o n v_{1 \times 1} (X)]

(11)

where

P y r a m i d (\cdot)

indicates the feature set after pyramid fusion.

G C {onv}_{3 \times 3}

represents group convolution, and

C {onv}_{3 \times 3}

and

G C {onv}_{1 \times 1}

represent standard 3 × 3 and 1 × 1 convolutions, respectively.

To strengthen chlorophyll-related features and suppress background noise, LAConv introduces a channel attention mechanism. Context information from the channel dimension is extracted using global average pooling, and channel attention weights are generated with two-level 1 × 1 convolutions (with a 4:1 dimensionality reduction ratio) and Sigmoid activation. Given a multi-scale feature

X_{pyramid}

, the channel attention process can be expressed as follows:

X_{a v g} = G l o b a l A v g P o o l (X_{p y r a m i d})

(12)

X_{c a l} = R e L U (C o n v_{1 \times 1} (X_{a v g}))

(13)

α = S i g m o i d (C o n v_{1 \times 1} (X_{c a 1}))

(14)

X_{w e i g h t e d} = X_{p y r a m i d} \otimes α

(15)

where

X_{cal}

denotes the feature map of the input;

X_{avg}

denotes the channel obtained through global average pooling;

α

denotes the attention weight tensor;

\otimes

denotes element-wise multiplication.

During the feature fusion process, LAConv improves feature complementarity by combining weighted features with original multi-scale features and using a 1 × 1 convolution to reduce the number of channels. To prevent the gradient from disappearing in the deep network, LAConv uses residual concatenation.

X_{c o n c a t} = C o n c a t (X_{w e i g h t e d}, X_{p y r a m i d})

(16)

X_{f u s i o n} = S i L U (B N (C o n v_{1 \times 1} (X_{c o n c a t})))

(17)

X_{o u t} = X_{f u s i o n} + S h o r t c u t (X)

(18)

Among them,

X_{w e i g h t e d}

denotes a weighted feature map,

X_{f u s i o n}

denotes a merged feature map,

X_{o u t}

denotes the final output feature map of the module, and Shortcut (X) is an identity mapping when the input and output dimensions are consistent and the stride is 1. Otherwise, dimension matching is achieved via 1 × 1 convolution and batch normalization.

The LAConv module enhances the ability to capture detailed features, such as serrated leaf edges and leaf vein texture, through multi-scale feature extraction (spatial pyramid pooling). This enhances cabbage seedling detection and effectively mitigates the issue of small-target leakage resulting from fixed receptive fields in traditional convolution. Meanwhile, the channel attention mechanism in LAConv dynamically suppresses interference noise from the soil and weed background, enhancing the response of the chlorophyll feature channel. This improves target differentiation in complex field environments. To adapt to the computational power of edge devices, LAConv uses a deep decomposition structure, a combination of grouped and pointwise convolutions, which reduces computational complexity while maintaining feature extraction effectiveness. These designs enable the model to balance accuracy and efficiency despite challenges such as overlapping leaves, changing morphology, and background interference. This provides a lightweight solution for real-time plant identification in the field.

2.2.4. Real-Time Cabbage Seedling Counting Model

The design of the real-time cabbage seedling counting model incorporates multiple key modules to ensure robustness and efficiency in complex lighting conditions [36]. The counting output layer processes overlapping targets using dynamic non-maximum suppression (NMS) technology and geographic coordinate mapping to generate spatial location data for plants with confidence levels and distribution heat maps. Throughout the processing workflow, the input image is first divided into sub-images of 1824 × 1824 pixels. These sub-images are then processed by the IRCLHead module for light-adaptive enhancement, ensuring model stability under varying lighting conditions. Meanwhile, the LAConv and ADown modules perform feature extraction in parallel to meet real-time processing requirements. Additionally, the intelligent counting logic utilizes a density-adaptive threshold mechanism that dynamically adjusts detection sensitivity based on plant density. This addresses the challenge of identifying overlapping targets. The model demonstrates high robustness in complex lighting conditions, achieving a balance between accuracy and efficiency by utilizing LAConv to enhance leaf edge features, ADown to optimize background suppression, and IRCLHead to ensure all-weather detection capabilities.

3. Results

3.1. Experimental Environment

To ensure fairness in model comparison and comparability of results, all network models in this paper were trained and tested on the same hardware configuration and dataset. The hardware environment for this experiment consisted of an NVIDIA GeForce RTX 3050 GPU with 6 GB of VRAM, an Intel processor, and 16 GB of RAM, which met the training and inference requirements of deep learning models. The software environment was configured with Python 3.11.4 and the Ultralytics 8.3.2 framework, utilizing CUDA-accelerated computing based on PyTorch 2.0.0 and cu118. During training, an initial learning rate of 0.001 was used with the AdamW optimizer, a weight decay of 0.0005, and a momentum parameter of 0.937. The batch size was set to 16 due to GPU memory constraints, and the training consisted of 150 epochs in total. Data augmentation strategies included random brightness adjustment, the addition of Gaussian noise, and random flipping to enhance the model’s ability to generalize in complex scenarios. It is worth noting that, since this study was primarily conducted on a desktop GPU (RTX 3050) for training and inference testing, the reported inference time does not fully represent the runtime efficiency of the model on edge devices. As the proposed model is intended for deployment on edge devices, future research will evaluate deployment and inference times on typical edge hardware, such as Jetson Nano and Jetson Xavier NX, to validate its real-time performance and adaptability.

3.2. Evaluation Metrics

This paper selects the following evaluation metrics: precision (P), recall (R), mean average precision (mAP) at IoU thresholds of 0.5 and 0.5 to 0.95 (mAP@0.5 and mAP@0.5:0.95), model size, floating-point operations per second (GFLOPs), number of parameters, and inference time. Among these metrics, precision measures the proportion of correctly detected targets out of all detection results; recall reflects the proportion of detected targets out of the actual number of targets. mAP@0.5 and mAP@0.5:0.95 evaluate target localization accuracy using IoU thresholds of 0.5 and 0.5 to 0.95, respectively. Model size, GFLOPs, and the number of parameters quantify the model’s computational efficiency and storage cost, while inference time reflects its real-time performance.

3.3. Comparison Experiments

3.3.1. Comparison of the Improved Model with the Baseline Model

As shown in Table 2 and Figure 7, this section compares the performance of the improved model with that of the YOLOv11n baseline model in the task of detecting cabbage seedlings. The experiments evaluate performance in two areas: detection accuracy and runtime efficiency. They analyze precision, recall, mAP@0.5, and mAP@0.5:0.95, among other detection metrics. Efficiency metrics, including model size, computational complexity, number of parameters, and inference time, were also taken into account.

The results demonstrate that the improved model outperforms the baseline model in all key metrics. In terms of detection accuracy, the enhanced model achieves a precision of 97.7%, a 0.6% increase over the baseline. The recall rate is 95.6%, a 0.7% improvement. mAP@0.5 is 99.0%, a 0.7% increase. mAP@0.5:0.95 is 89.9%, a 2.4% increase. In terms of runtime efficiency, the improved model is 5.4 MB in size, has a computational complexity of 5.5 GFLOPs, and has an inference time of 1.0 milliseconds—a 72.9% reduction compared to the baseline model’s 3.7 milliseconds. These improvements maintain the model’s lightweight characteristics while improving detection accuracy and inference speed, meeting the real-time detection requirements of agricultural scenarios.

Figure 7 illustrates the advantages of the improved model in various detection scenarios. In areas with dense seedlings, for example, the enhanced model can more accurately identify and locate adjacent plants, reducing false negatives and false positives. In complex backgrounds, the improved model demonstrates stronger interference resistance and can effectively distinguish seedlings from weeds. Additionally, the enhanced model performed stably in test samples under different lighting conditions, demonstrating good environmental adaptability. Comprehensive quantitative indicators and visualization results show that the improved model strikes a good balance between detection accuracy and operational efficiency. Its performance in detecting cabbage seedlings provides strong technical support for real-time plant monitoring in precision agriculture, with broad application prospects.

3.3.2. Comparison of the Improved Model with Other Network Models

To validate the generalization capability and practical application value of the improved model proposed for the cabbage seedling recognition task, this study compared two representative models at different levels: mainstream models from the YOLO series (including YOLOv3-Tiny, YOLOv8N, YOLOv9C, and YOLOv11N), as detailed in Table 3. This series is widely used for lightweight object detection and is particularly suitable for edge deployment scenarios. It serves as a standard benchmark for evaluating the performance of agricultural object detection tasks. Additionally, to demonstrate the advanced nature of the proposed model in similar applications, we conducted a comparative analysis using several state-of-the-art models that have performed exceptionally well in cabbage detection tasks in recent years, such as TSP-YOLO, YOLO11-CGB, and YOLOv7, as detailed in Table 4. Through a multi-dimensional metric comparison, we validated not only the optimization effectiveness of the proposed model within the YOLO framework but also its comprehensive advantages in real-world agricultural detection scenarios.

In terms of detection accuracy, the data in Table 3 and Figure 8 show that the improved model outperforms the baseline models in the YOLO series in both accuracy and recall rate. The accuracy rate is 97.7%, representing an improvement of 0.6 percentage points over YOLOv11n, and the recall rate is 95.6%, representing an improvement of 0.7 percentage points over YOLOv11n. The mAP@0.5 is 99.0%, representing an improvement of 0.71 percentage points over YOLOv11n, and the mAP@0.5:0.95 is 89.9%, representing an improvement of 2.4 percentage points over YOLOv11n. These results indicate that the improved model has been optimized in terms of localization accuracy and bounding box regression performance in dense scenes. Compared to YOLOv9c, the enhanced model achieves an mAP@0.5:0.95 of 90.1%, while reducing computational complexity to 5.5 GFLOPs. Compared to YOLOv3-Tiny and YOLOv8N, the improved model’s mAP@0.5:0.95 increased by 3.8 and 2.4 percentage points, respectively.

In terms of computational efficiency, the improved model has a model size of 5.4 MB, 2.5 million parameters, and an inference time of 1.0 ms, demonstrating good edge computing performance. Compared to YOLOv11n, the inference time is reduced by 73%, the number of parameters is reduced by 3%, and the GFLOPs is reduced by 12%. Through the LAConv depth decomposition structure and ADown feature splitting strategy, the computational load is significantly optimized compared to YOLOv3-tiny (6.2 ms) and YOLOv8n (3.3 ms).

As shown in Table 4, the improved model’s mAP@0.5:0.95 is 89.9%, surpassing YOLOv7′s 83.4%. In terms of inference time, the enhanced model takes 1.0 ms, the same as TSP-YOLO, and outperforms YOLOv8n’s 26.3 ms. Compared to YOLO11CGB’s 4.1 GFLOPs, the improved model dynamically adjusts parameter τ using the IRCLHead temperature adaptive mechanism, combining this with the LAConv channel attention mechanism to enhance chlorophyll features. This maintains an accuracy rate of 97.7% and a recall rate of 95.6%, even in complex lighting conditions.

3.3.3. Radar Chart Comparing Different Convolutions

This study compares the performance of the lightweight spatial-channel attention convolution (LAConv) module with that of traditional convolution modules (e.g., DWConv, GhostConv, and DualConv) in cabbage seedling detection tasks. The analysis focuses on detection accuracy, computational efficiency, and inference time, paying particular attention to how each module performs in complex agricultural scenarios.

The experimental results demonstrate that LAConv outperforms the other modules overall. As shown in Table 5, LAConv achieves higher scores than the different modules on key metrics, including precision (97.7%), recall (95.4%), mAP@0.5 (98.9%), and mAP@0.5:0.95 (90.3%). Compared to DWConv, LAConv improves mAP@0.5:0.95 by 2.0 percentage points, reduces inference time from 3.6 ms to 1.3 ms, and decreases the number of model parameters. These improvements are attributed to three key features of LAConv. First, spatial pyramid pooling and multi-scale convolutions enhance the extraction of detailed features. Second, the channel attention mechanism effectively reduces background interference and improves the extraction of chlorophyll features. Third, the deep decomposition structure combined with residual connections optimizes computation and reduces computational complexity.

As shown in the Figure 9 radar chart, LAConv demonstrates excellent performance in both detection accuracy and inference efficiency. Its multi-scale feature fusion and background suppression mechanisms effectively address issues such as the misclassification of small targets and misclassification in complex backgrounds. This makes it suitable for use on drone edge computing platforms. Through the channel-separable convolution strategy, the deep convolution reduces inference time by approximately 30%, with an accuracy difference of less than 1.5%. This makes it suitable for resource-limited drone systems. Separable convolution successfully reduces the number of parameters by 40% by decoupling the spatial and channel dimensions of computation. Although its detection accuracy is slightly lower than that of LAConv (with a 2.1% decrease in mAP@0.5), it remains valuable in scenarios that require high real-time performance.

Overall, different convolution modules have their respective advantages and disadvantages in agricultural object detection tasks. LAConv excels in accuracy, particularly in complex backgrounds, while deep convolution and separable convolution are more computationally efficient. Therefore, selecting the appropriate convolution module based on the needs of different application scenarios can strike a balance between accuracy and efficiency. LAConv is recommended for scenarios with higher accuracy requirements, such as seedling monitoring. In contrast, deep convolution or separable convolution may be considered for real-time performance-focused scenarios, like drone inspections. This study provides a reference for selecting convolution modules in agricultural intelligent applications.

3.4. Ablation Experiment

This study systematically evaluated the performance of each core module, both in isolation and in combination with the others, through ablation experiments. As shown in Table 6 and Figure 10, the experiment analyzed each module and its combinations quantitatively from three dimensions: detection accuracy, computational efficiency, and inference time. Radar charts intuitively demonstrate performance differences under different configurations.

The results of the experiments show that LAConv performs excellently when used alone, with an mAP@0.5 of 98.9% and an mAP@0.5:0.95 of 90.3%, respectively, improving the baseline model by 0.6 and 2.8 percentage points. This module enhances the extraction of detailed features, such as leaf edges, through spatial pyramid pooling, while the channel attention mechanism effectively suppresses background interference and highlights chlorophyll-related features. When channel attention is removed, the model’s performance decreases significantly, with the accuracy and recall rates falling by 1.5% and 1.2%, respectively, and the mAP@0.5 dropping to 97.5%. This verifies the importance of channel attention in complex backgrounds. When used alone, ADown achieved an mAP@0.5 of 98.4% and a recall rate of 95.0%, but the mAP@0.5:0.95 fell to 87.4%. This module combines average and maximum pooling to suppress background noise and effectively strengthen target edges, thereby improving the separation of overlapping targets. However, the lack of multi-scale information fusion limits the detection accuracy of small targets. After introducing IRCLHead, the model maintains an mAP@0.5:0.95 of 87.4% under various lighting conditions, with the inference time increasing to 3.3 ms. This module enhances light-invariance feature extraction through a temperature-adaptive mechanism that adjusts contrast loss, but it also introduces additional computational overhead.

Module combination experiments reveal significant synergistic advantages. When LAConv is used with ADown, the mAP@0.5:0.95 increases to 88.3%, representing a 0.9 percentage point improvement over using ADown alone. This indicates that multi-scale fusion and background suppression can achieve complementary optimization. When combined with IRCLHead, the mAP@0.5:0.95 remains at 87.8%, but the inference time increases significantly to 9.3 ms. This indicates that channel enhancement and global contrast mechanisms introduce computational redundancy in the inference pipeline, resulting in latency amplification. Operations such as feature queue maintenance, brightness perturbation simulation, and high-resolution access pose particular challenges to inference performance in IRCLHead. Future work could explore local channel-guided contrast or structural reparameterization strategies to reduce computational complexity. When combining ADown with IRCLHead, mAP@0.5 remained at 98.3% and mAP@0.5:0.95 was still at 87.4%, indicating that the advantages of the dual-path and lighting robustness mechanisms were not fully exploited without LAConv support. Ultimately, the model achieved its best performance with full module integration, with mAP@0.5 reaching 99.0% and mAP@0.5:0.95 reaching 89.9%, while the inference time was compressed to 1.0 ms. LAConv provides multi-scale structural information, ADown enhances texture and edge representation, and IRCLHead improves light robustness. These three components work together to optimize the model’s overall balance of accuracy, efficiency, and adaptability. The radar chart in Figure 10 further validates the balanced performance of this configuration across various dimensions.

3.5. Counting Performance

This study proposes a real-time counting model based on YOLOv11n. Table 7 shows the experimental results, and Figure 11 shows counting examples. The improved model achieves a detection accuracy of 99.6% and completes the task in 27.3 min, which is a significant improvement in efficiency. The number of detection boxes is optimized, and computational redundancy is reduced through the use of the dynamic non-maximum suppression (NMS) algorithm. Error analysis indicates that the remaining 0.4% error stems primarily from extremely dense planting areas and substantial light–shadow interference. Although the ADown module effectively separates partially overlapping targets, false negatives still occur. The temperature-adaptive mechanism of IRCLHead effectively reduces false positives. Figure 11 shows that the model’s counting results align well with the manually annotated positions, particularly in scenarios with weeds. The channel attention mechanism of the LAConv module significantly suppresses background interference in these scenarios. The model maintains high accuracy in environments with significant lighting changes and severe interference from weeds or soil. Combining the LAConv and ADown modules enhances target feature extraction and optimizes background suppression, improving detection stability and accuracy. In conclusion, the improved YOLOv11n model demonstrates excellent performance in real-time Chinese cabbage seedling counting, meeting agricultural application requirements for real-time, high-accuracy solutions.

4. Conclusions

This study is based on the YOLOv11n framework and involves the development of a lightweight, high-precision model for recognizing and counting cabbage seedlings. It integrates an adaptive dual-path downsampling module (ADown), a light-robust contrast learning head (IRCLHead), and a space-channel fusion attention convolution module (LAConv). These components effectively enhance the model’s detection performance and robustness in complex agricultural environments. Experimental results show that the model achieved a 99.0% mAP@0.5 on the test set, improving the baseline model’s mAP@0.5:0.95 by 2.4 percentage points, thereby outperforming it. In the counting task, detection accuracy reaches 99.6%, and average counting time is reduced to 27.3 min, validating the model’s potential for use in agricultural production scenarios. Ablation experiments show that LAConv significantly improves the recognition of small targets (e.g., early-stage seedlings), ADown distinguishes densely overlapping plants more effectively, and IRCLHead improves stability under complex lighting conditions. However, despite its good performance across multiple metrics, the model still has the following limitations: (1) False negatives still occur in field scenarios with high seedling density and blurred boundaries, indicating room for improvement in modeling fine-grained boundary features. (2) False positives may still occur in strong backlight or environments with severe shadow changes, suggesting that the model’s robustness to lighting conditions needs further optimization. (3) The model demonstrates high inference efficiency in small-scale plots, but its real-time processing capability in large-scale aerial imagery needs enhancement. (4) This study focuses on a single crop (cabbage) and has not validated its ability to generalize to other crop types. In summary, while this model has made positive progress in terms of detection accuracy and deployment efficiency, it requires further optimization in terms of environmental adaptability, inference speed, and cross-crop generalizability. To enhance global modeling capabilities, consider introducing more expressive visual encoders and Transformer-like structures and combining multi-source data (e.g., hyperspectral or multi-view images) to improve the model’s understanding of complex scenes. Additionally, to expand the model’s adaptability, future research could explore its transferability to other cruciferous crops (e.g., Chinese cabbage and radish) and non-cruciferous crops (e.g., lettuce and spinach). Strategies such as few-shot learning, multi-task training, or domain adaptation could enhance its cross-crop performance and increase the model’s potential applications in various agricultural environments.

Author Contributions

Conceptualization, R.Z. and R.L.; methodology, R.Z.; software, X.D.; validation, R.Z., R.L. and B.Y.; formal analysis, R.Z.; investigation, R.Z.; resources, R.L.; data curation, X.D.; writing—original draft preparation, R.Z.; writing—review and editing, R.L.; visualization, X.D.; supervision, B.Y.; project administration, R.L.; funding acquisition, B.Y. and J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study is grateful for the support of the following projects: Major Science and Technology Special Project of Yunnan Province, Number: 202302AO370003, Yunnan Province Basic Research Special Program, Number: 202301AT070173, Yunnan Province Basic Research Special Program, Number: 202401AT070103.

Data Availability Statement

All data are contained within the article. To request the data and code, please send an email to the first or corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Mostafa, S.; Mondal, D.; Panjvani, K.; Kochian, L.; Stavness, I. Explainable deep learning in plant phenotyping. Front. Artif. Intell. 2023, 6, 1203546. [Google Scholar] [CrossRef]
Wang, J.; Zhang, Z.; Luo, L.; Wei, H.; Wang, W.; Chen, M.; Luo, S. DualSeg: Fusing transformer and CNN structure for image segmentation in complex vineyard environment. Comput. Electron. Agric. 2023, 206, 107682. [Google Scholar] [CrossRef]
Moreb, N.; Murphy, A.; Jaiswal, S.; Jaiswal, A.K. Cabbage. In Nutritional Composition and Antioxidant Properties of Fruits and Vegetables; Elsevier: Amsterdam, The Netherlands, 2020; pp. 33–54. [Google Scholar]
Sun, X.; Miao, Y.; Wu, X.; Wang, Y.; Li, Q.; Zhu, H.; Wu, H. Cabbage Transplantation State Recognition Model Based on Modified YOLOv5-GFD. Agronomy 2024, 14, 760. [Google Scholar] [CrossRef]
Tian, Y.; Zhao, C.; Zhang, T.; Wu, H.; Zhao, Y. Recognition Method of Cabbage Heads at Harvest Stage under Complex Background Based on Improved YOLOv8n. Agriculture 2024, 14, 1125. [Google Scholar] [CrossRef]
Zhao, F.; He, Y.; Song, J.; Wang, J.; Xi, D.; Shao, X.; Wu, Q.; Liu, Y.; Chen, Y.; Zhang, G. Smart UAV-assisted blueberry maturity monitoring with Mamba-based computer vision. Precis. Agric. 2025, 26, 56. [Google Scholar] [CrossRef]
Tian, Y.; Yang, G.; Wang, Z.; Wang, H.; Li, E.; Liang, Z. Apple detection during different growth stages in orchards using the improved YOLO-V3 model. Comput. Electron. Agric. 2019, 157, 417–426. [Google Scholar] [CrossRef]
Wu, D.; Lv, S.; Jiang, M.; Song, H. Using channel pruning-based YOLO v4 deep learning algorithm for the real-time and accurate detection of apple flowers in natural environments. Comput. Electron. Agric. 2020, 178, 105742. [Google Scholar] [CrossRef]
Qiu, F.; Shao, C.; Zhou, C.; Yao, L. A method for cabbage root posture recognition based on YOLOv5s. Heliyon 2024, 10, e31868. [Google Scholar] [CrossRef]
Sapkota, R.; Ahmed, D.; Karkee, M. Comparing YOLOv8 and Mask R-CNN for instance segmentation in complex orchard environments. Artif. Intell. Agric. 2024, 13, 84–99. [Google Scholar] [CrossRef]
Luling, N.; Reiser, D.; Straub, J.; Stana, A.; Griepentrog, H.W. Fruit Volume and Leaf-Area Determination of Cabbage by a Neural-Network-Based Instance Segmentation for Different Growth Stages. Sensors 2022, 23, 129. [Google Scholar] [CrossRef]
Sapkota, R.; Flores-Calero, M.; Qureshi, R.; Badgujar, C.; Nepal, U.; Poulose, A.; Zeno, P.; Vaddevolu, U.B.P.; Khan, S.; Shoman, M.; et al. YOLO advances to its genesis: A decadal and comprehensive review of the You Only Look Once (YOLO) series. Artif. Intell. Rev. 2025, 58, 274. [Google Scholar] [CrossRef]
Xia, Y.; Wang, Z.; Cao, Z.; Chen, Y.; Li, L.; Chen, L.; Zhang, S.; Wang, C.; Li, H.; Wang, B. Recognition Model for Tea Grading and Counting Based on the Improved YOLOv8n. Agronomy 2024, 14, 1251. [Google Scholar] [CrossRef]
Wongdee, P.; Teeyapan, K. A Comparative Study of Deep Learning Models for Cabbage Detection and Counting in Drone Imagery. In Proceedings of the 2025 17th International Conference on Knowledge and Smart Technology (KST), Bangkok, Thailand, 26 February–1 March 2025; pp. 260–265. [Google Scholar]
Tian, Y.; Cao, X.; Zhang, T.; Wu, H.; Zhao, C.; Zhao, Y. CabbageNet: Deep Learning for High-Precision Cabbage Segmentation in Complex Settings for Autonomous Harvesting Robotics. Sensors 2024, 24, 8115. [Google Scholar] [CrossRef] [PubMed]
Zhu, J.; Qin, C.; Choi, D. YOLO-SDLUWD: YOLOv7-based small target detection network for infrared images in complex backgrounds. Digit. Commun. Netw. 2025, 11, 269–279. [Google Scholar] [CrossRef]
Zhao, S.; Chen, H.; Zhang, D.; Tao, Y.; Feng, X.; Zhang, D. SR-YOLO: Spatial-to-Depth Enhanced Multi-Scale Attention Network for Small Target Detection in UAV Aerial Imagery. Remote Sens. 2025, 17, 2441. [Google Scholar] [CrossRef]
Shen, X.; Shao, C.; Cheng, D.; Yao, L.; Zhou, C. YOLOv5-POS: Research on cabbage pose prediction method based on multi-task perception technology. Front. Plant Sci. 2024, 15, 1455687. [Google Scholar] [CrossRef]
Fu, H.; Zhao, X.; Tan, H.; Zheng, S.; Zhai, C.; Chen, L. Effective methods for mitigate the impact of light occlusion on the accuracy of online cabbage recognition in open fields. Artif. Intell. Agric. 2025, 15, 449–458. [Google Scholar] [CrossRef]
Zheng, J.; Wang, X.; Shi, Y.; Zhang, X.; Wu, Y.; Wang, D.; Huang, X.; Wang, Y.; Wang, J.; Zhang, J. Keypoint detection and diameter estimation of cabbage (Brassica oleracea L.) heads under varying occlusion degrees via YOLOv8n-CK network. Comput. Electron. Agric. 2024, 226, 109428. [Google Scholar] [CrossRef]
Chen, X.; Liu, T.; Han, K.; Jin, X.; Wang, J.; Kong, X.; Yu, J. TSP-yolo-based deep learning method for monitoring cabbage seedling emergence. Eur. J. Agron. 2024, 157, 127191. [Google Scholar] [CrossRef]
Kong, X.; Li, A.; Liu, T.; Han, K.; Jin, X.; Chen, X.; Yu, J. Lightweight cabbage segmentation network and improved weed detection method. Comput. Electron. Agric. 2024, 226, 109403. [Google Scholar] [CrossRef]
Wu, M.; Yuan, K.; Shui, Y.; Wang, Q.; Zhao, Z. A Lightweight Method for Ripeness Detection and Counting of Chinese Flowering Cabbage in the Natural Environment. Agronomy 2024, 14, 1835. [Google Scholar] [CrossRef]
Guo, Z.; Cai, D.; Jin, Z.; Xu, T.; Yu, F. Research on unmanned aerial vehicle (UAV) rice field weed sensing image segmentation method based on CNN-transformer. Comput. Electron. Agric. 2025, 229, 109719. [Google Scholar] [CrossRef]
Zhang, K.; Wu, Q.; Chen, Y. Detecting soybean leaf disease from synthetic image using multi-feature fusion faster R-CNN. Comput. Electron. Agric. 2021, 183, 106064. [Google Scholar] [CrossRef]
Crespo, A.; Moncada, C.; Crespo, F.; Morocho-Cayamcela, M.E. An efficient strawberry segmentation model based on Mask R-CNN and TensorRT. Artif. Intell. Agric. 2025, 15, 327–337. [Google Scholar] [CrossRef]
Yu, H.; Zhao, J.; Bi, C.G.; Shi, L.; Chen, H. A Lightweight YOLOv5 Target Detection Model and Its Application to the Measurement of 100-Kernel Weight of Corn Seeds. CAAI Trans. Intell. Technol. 2025, 1–14. [Google Scholar] [CrossRef]
Yousafzai, S.N.; Nasir, I.M.; Tehsin, S.; Fitriyani, N.L.; Syafrudin, M. FLTrans-Net: Transformer-based feature learning network for wheat head detection. Comput. Electron. Agric. 2025, 229, 109706. [Google Scholar] [CrossRef]
Jin, S.; Cao, Q.; Li, J.; Wang, X.; Li, J.; Feng, S.; Xu, T. Study on lightweight rice blast detection method based on improved YOLOv8. Pest Manag. Sci. 2025, 81, 4300–4313. [Google Scholar] [CrossRef]
Zhang, T.; Zhou, J.; Liu, W.; Yue, R.; Yao, M.; Shi, J.; Hu, J. Seedling-YOLO: High-Efficiency Target Detection Algorithm for Field Broccoli Seedling Transplanting Quality Based on YOLOv7-Tiny. Agronomy 2024, 14, 931. [Google Scholar] [CrossRef]
Zhu, X.; Chen, F.; Zheng, Y.; Chen, C.; Peng, X. Detection of Camellia oleifera fruit maturity in orchards based on modified lightweight YOLO. Comput. Electron. Agric. 2024, 226, 109471. [Google Scholar] [CrossRef]
Gao, X.; Wang, G.; Zhou, Z.; Li, J.; Song, K.; Qi, J. Performance and speed optimization of DLV3-CRSNet for semantic segmentation of Chinese cabbage (Brassica pekinensis Rupr.) and weeds. Crop Prot. 2025, 195, 107236. [Google Scholar] [CrossRef]
Xu, P.; Fang, N.; Liu, N.; Lin, F.; Yang, S.; Ning, J. Visual recognition of cherry tomatoes in plant factory based on improved deep instance segmentation. Comput. Electron. Agric. 2022, 197, 106991. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. Yolov11: An overview of the key architectural enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Feng, H.; Chen, X.; Duan, Z. LCDDN-YOLO: Lightweight Cotton Disease Detection in Natural Environment, Based on Improved YOLOv8. Agriculture 2025, 15, 421. [Google Scholar] [CrossRef]
Ye, Z.; Yang, K.; Lin, Y.; Guo, S.; Sun, Y.; Chen, X.; Lai, R.; Zhang, H. A comparison between Pixel-based deep learning and Object-based image analysis (OBIA) for individual detection of cabbage plants based on UAV Visible-light images. Comput. Electron. Agric. 2023, 209, 107822. [Google Scholar] [CrossRef]
Shi, H.; Liu, C.; Wu, M.; Zhang, H.; Song, H.; Sun, H.; Li, Y.; Hu, J. Real-time detection of Chinese cabbage seedlings in the field based on YOLO11-CGB. Front. Plant Sci. 2025, 16, 1558378. [Google Scholar] [CrossRef] [PubMed]
Gao, X.; Wang, G.; Qi, J.; Wang, Q.; Xiang, M.; Song, K.; Zhou, Z. Improved YOLO v7 for Sustainable Agriculture Significantly Improves Precision Rate for Chinese Cabbage (Brassica pekinensis Rupr.) Seedling Belt (CCSB) Detection. Sustainability 2024, 16, 4759. [Google Scholar] [CrossRef]
Jiang, P.; Qi, A.; Zhong, J.; Luo, Y.; Hu, W.; Shi, Y.; Liu, T. Field cabbage detection and positioning system based on improved YOLOv8n. Plant Methods 2024, 20, 96. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Research area data collection example. (A, B, and C represent the seedling, vegetative, and mature stages of cabbage, respectively).

Figure 2. Data enhancement.

Figure 3. Improved YOLOv11n network architecture diagram.

Figure 4. Adaptive dual-path downsampling structure diagram.

Figure 5. Illumination-robust contrastive learning head structural diagram.

Figure 6. Light attention conv structural diagram.

Figure 7. A comparison of the detection performance of the improved YOLOv11n and the baseline models is shown below.

Figure 8. Bar chart showing the performance comparison of improved YOLOv11n vs. the YOLO series model. (Note: Colors are randomly assigned for visualization and have no specific meaning).

Figure 9. Radar chart comparing different convolutions.

Figure 10. Melting experiment radar chart.

Figure 11. Improved YOLOv11n model counting example diagram.

Table 1. Division of the cabbage dataset.

Dataset	Total Number of Pictures	Training Set	Validation Set	Test Set
Original image	820	574	164	82
Data enhancement	4669	3268	934	467

Table 2. Comparison between the improved YOLOv11n model and the baseline model.

Models	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Model Size (MB)	GFLOPs (G)	Parameters	Inference (ms)
YOLOv11n	97.1	94.9	98.3	87.5	5.5	6.3	2,582,347	3.7
Our	97.7	95.6	99.0	89.9	5.4	5.5	2,504,996	1.0

Table 3. Comparison experiments.

Models	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Model Size (MB)	GFLOPs (G)	Parameters	Inference (ms)
YOLOv3-tiny	97.2	94.7	98.2	86.1	9.6	14.3	9,519,538	6.2
YOLOv6n	97.1	95.1	98.3	87.5	8.6	11.5	4,155,123	3.4
YOLOv8n	97.2	94.8	98.3	87.5	5.6	6.8	2,684,563	3.3
YOLOv9c	97.4	95.6	98.7	90.1	43.3	82.7	21,146,195	21.1
YOLOv10n	96.0	95.0	98.2	88.4	5.8	8.2	2,694,886	4.9
YOLOv11n	97.1	94.9	98.3	87.5	5.5	6.3	2,582,347	3.7
Our	97.7	95.6	99.0	89.9	5.4	5.5	2,504,996	1.0

Table 4. Comparison with different models in recent studies.

Models	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Model Size (MB)	GFLOPs (G)	Parameters	Inference (ms)
TSP-yolo Chen et al. [21]	98.5	97.8	99.4	90.3	\	\	\	1.0
YOLO11CGB Shi et al. [37]	\	\	97.0	\	\	4.1	3.2	\
YOLOv7 Gao et al. [38]	94.3	\	83.4	\	\	\	\	\
YOLOv8n Jiang et al. [39]	95.5	85.1	93.9	\	\	\	\	26.3

Table 5. Radar chart comparing different convolutions.

Models	P (%)	R (%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Model Size (MB)	GFLOPs (G)	Parameters	Inference (ms)
DWConv	97.0	95.0	98.3	88.3	4.9	6.1	2,289,739	3.6
GhostConv	97.1	94.8	98.3	87.6	5.3	6.1	2,510,219	3.4
DualConv	96.8	95.1	98.3	87.8	4.9	5.7	2,321,838	3.4
LAConv (Our)	97.7	95.4	98.9	90.3	5.1	5.8	2,349,443	1.3

Table 6. Ablation experiment.

ADown	IRCLHead	LAConv	P(%)	R(%)	mAP@0.5 (%)	mAP@0.5:0.95 (%)	Model Size (MB)	GFLOPs (G)	Parameters	Inference (ms)
✕	✕	✓	97.7	95.4	98.9	90.3	5.1	5.8	2,349,443	1.3
✓	✕	✕	97.1	95.0	98.4	87.4	4.6	5.1	2,099,787	3.0
✕	✓	✕	97.3	94.8	98.3	87.4	6.1	6.6	2,895,940	3.3
✓	✕	✓	97.2	94.7	98.4	88.3	4.8	5.2	2,191,403	3.2
✕	✓	✓	96.8	95.0	98.3	87.8	5.7	6.1	2,663,036	9.3
✓	✓	✕	96.8	95.1	98.3	87.4	5.2	5.4	2,413,380	3.0
✓	✓	✓	97.7	95.6	99.0	89.9	5.4	5.5	2,504,996	1.0

✓ and ✕ indicate the use of this module and the absence of this module, respectively.

Table 7. Counting results.

Method	Number of Instances	Number of Detection Boxes	Accuracy (%)	Time (Min)
Real-time counting model based on YOLOv11n	820	27,650	99.6	27.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, R.; Luo, R.; Ding, X.; Cui, J.; Yi, B. Lightweight YOLOv11n-Based Detection and Counting of Early-Stage Cabbage Seedlings from UAV RGB Imagery. Horticulturae 2025, 11, 993. https://doi.org/10.3390/horticulturae11080993

AMA Style

Zhao R, Luo R, Ding X, Cui J, Yi B. Lightweight YOLOv11n-Based Detection and Counting of Early-Stage Cabbage Seedlings from UAV RGB Imagery. Horticulturae. 2025; 11(8):993. https://doi.org/10.3390/horticulturae11080993

Chicago/Turabian Style

Zhao, Rongrui, Rongxiang Luo, Xue Ding, Jiao Cui, and Bangjin Yi. 2025. "Lightweight YOLOv11n-Based Detection and Counting of Early-Stage Cabbage Seedlings from UAV RGB Imagery" Horticulturae 11, no. 8: 993. https://doi.org/10.3390/horticulturae11080993

APA Style

Zhao, R., Luo, R., Ding, X., Cui, J., & Yi, B. (2025). Lightweight YOLOv11n-Based Detection and Counting of Early-Stage Cabbage Seedlings from UAV RGB Imagery. Horticulturae, 11(8), 993. https://doi.org/10.3390/horticulturae11080993

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Lightweight YOLOv11n-Based Detection and Counting of Early-Stage Cabbage Seedlings from UAV RGB Imagery

Abstract

1. Introduction

2. Materials and Methods

2.1. Datasets

2.2. Improved Model Architecture

2.2.1. Adaptive Dual-Path Downsampling

2.2.2. Illumination-Robust Contrastive Learning Head

2.2.3. Light Attention Conv

2.2.4. Real-Time Cabbage Seedling Counting Model

3. Results

3.1. Experimental Environment

3.2. Evaluation Metrics

3.3. Comparison Experiments

3.3.1. Comparison of the Improved Model with the Baseline Model

3.3.2. Comparison of the Improved Model with Other Network Models

3.3.3. Radar Chart Comparing Different Convolutions

3.4. Ablation Experiment

3.5. Counting Performance

4. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI