High-Altitude UAV-Based Detection of Rice Seedlings in Large-Area Paddy Fields

Li, Zhenhua; Yao, Xinfeng; Ban, Songtao; Hu, Dong; Tian, Minglu; Yuan, Tao; Li, Linyi

doi:10.3390/agriculture16030307

Open AccessArticle

High-Altitude UAV-Based Detection of Rice Seedlings in Large-Area Paddy Fields

by

Zhenhua Li

^1,†,

Xinfeng Yao

^2,†,

Songtao Ban

^2,3,

Dong Hu

²,

Minglu Tian

²

,

Tao Yuan

² and

Linyi Li

^2,*

¹

College of Information Technology, Shanghai Ocean University, Shanghai 201306, China

²

Institute of Agricultural Science and Technology Information, Shanghai Academy of Agricultural Sciences, Shanghai 201403, China

³

CAS Center for Excellence in Molecular Plant Sciences, Institute of Plant Physiology and Ecology, Chinese Academy of Sciences, Shanghai 200233, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Agriculture 2026, 16(3), 307; https://doi.org/10.3390/agriculture16030307

Submission received: 4 December 2025 / Revised: 7 January 2026 / Accepted: 19 January 2026 / Published: 26 January 2026

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

Accurate quantification of field-grown rice seedlings is essential for evaluating yield potential and guiding precision field management. Unmanned aerial vehicle (UAV)-based remote sensing, with its high spatial resolution and broad coverage, provides a robust basis for accurate seedling detection and population density estimation. However, in previous studies, UAVs were typically employed at relatively low altitudes, which provided high-resolution imagery and facilitated seedling recognition but limited efficiency. To enable large-area monitoring, higher flight altitudes are required, which reduces image resolution and adversely affects rice seedling recognition accuracy. In this study, UAVs were flown at a height of 30 m, and the resulting lower-resolution imagery, combined with the small size of seedlings, their dense spatial distribution, and the complex field background, necessitated algorithmic improvements for accurate detection. To address these challenges, we propose an enhanced You Only Look Once version 8 nano (YOLOv8n)-based detection model specifically designed to improve seedling recognition under high-altitude UAV imagery. The model incorporates an improved Bidirectional Feature Pyramid Network (BiFPN) for multi-scale feature fusion and small-object detection, a Global-to-Local Spatial Aggregation (GLSA) module for enriched spatial context modeling, and a Content-Guided Attention Fusion (CGAFusion) module to enhance discriminative feature learning. Experiments on high-altitude UAV imagery demonstrate that the proposed model achieves an mAP@0.5 of 94.7%, a precision of 91.0%, and a recall of 91.2%, representing a 2.3% improvement over the original YOLOv8n. These results highlight the model’s innovation in handling high-altitude UAV imagery for large-area rice seedling detection, demonstrating its effectiveness and practical potential under complex field conditions.

Keywords:

rice seedlings; small object detection; YOLOv8n; high-altitude UAV imagery; large-area paddy fields

1. Introduction

1.1. Importance of Rice Seedling Detection

Rice is one of the three major staple crops worldwide, and maintaining high yield and stable quality is vital for ensuring global food security [1]. In rice cultivation management, the number of basic seedlings (or planting density) plays a decisive role in determining canopy structure, effective panicle number, and final yield [2]. The efficiency of rice planting has substantially improved with the transition from manual to mechanical transplanting. However, this transition has introduced new challenges, such as missing seedlings and discontinuous planting rows. Consequently, rapid and accurate assessment of the basic seedling number after mechanical transplanting is essential not only for evaluating transplanting quality, but also for guiding seedling replanting strategies [3,4].

1.2. Related Works

Traditional field surveys for basic seedling estimation primarily rely on manual visual inspection within a limited number of representative sampling plots. Such methods are labor-intensive, subjective, and prone to sampling errors [5]. In recent years, rapid advancements in UAVs and deep learning technologies have provided new opportunities for developing automated and objective approaches to rice seedling monitoring.

Early studies that segmented and identified rice seedlings predominantly relied on handcrafted feature extraction [6], and their performance was significantly affected by image quality and illumination variability, limiting their applicability in complex field conditions [7]. As deep learning-based methods have become the dominant paradigm in rice seedling detection, researchers have increasingly adopted advanced detectors, such as two-stage [8], one-stage [9], and Transformer-based architectures [10].

Among various detection frameworks, the one-stage YOLO family has gained widespread popularity owing to its excellent balance between precision and real-time performance. Yeh et al. [11] proposed an enhanced YOLOv4 model that integrates data augmentation and the Mish activation function to improve tiny rice seedling detection and counting. Li et al. [12] developed a lightweight YOLOv4 model to further enhance inference speed and computational efficiency. Zhao et al. [13] modified the YOLOv8n model by incorporating additional feature extraction modules and an optimized backbone, enabling accurate distinction between qualified and floating seedlings. Chen et al. [14] compared prediction accuracy for rice seedlings with the YOLOv8n, YOLOv9t, and YOLOv10n, where YOLOv8n demonstrated the highest, with an R² value of 0.889, RMSE of 3.225, and rRMSE of 0.032. YOLOv8n employs the C2f module, which merges high-level semantic features with low-level spatial information. Additionally, its Anchor-Free Detection Head predicts the center of an object directly, rather than relying on predefined anchor boxes. This enables YOLOv8n to handle rice seedlings with far greater flexibility. A summary of related work on UAV-based rice seedling detection is presented in Table 1.

During the regreening and early tillering phase (7–15 days post-transplant), rice seedlings are characterized by small morphological scales, with typical plant heights of 15–25 cm and leaf widths of 0.4–0.8 cm. Numerous studies have focused on building and training models using UAV imagery acquired at low to medium altitudes (typically between 1.5 and 15 m) to ensure high spatial resolution (ground sampling distance (GSD) of 0.41 to 3.5 mm/pixel), which facilitates the accurate identification of small rice seedlings [15,16,17,18].

Under high-resolution imagery, fine-grained structural and texture details of seedlings are sufficiently preserved, allowing detection models that rely primarily on single attention mechanisms, local texture cues, or mid-scale semantic information to achieve satisfactory performance [19]. However, in large-area paddy field rice seedling monitoring, time constraints and aircraft battery limitations require higher flight altitudes to improve data acquisition efficiency, which results in a decrease in GSD. Models built and trained for high-spatial-resolution imagery struggle to adapt to low-resolution imagery. As the overall morphological features of rice seedlings are weakened, leaf edges become blurred, and color features are lost. Additionally, background interference such as water surface reflections, bare soil patches, and straw residues becomes more pronounced. Taking the standard YOLOv8n as an example, its deep down-sampling architecture causes the sparse geometric features of tiny rice seedlings to vanish, while its general-purpose feature extraction lacks the discriminative attention required to separate blurred targets from morphologically similar background interference [20].

Table 1. Summary of related work on UAV-based rice seedling detection.

Reference	Method	Flight Altitude/GSD	Key Results	Advantages (Pros)	Limitations (Cons)
Cui et al. [15]	improved YOLOv5s	1.5–2 m, GSD = 0.55 mm/pixel	mAP@0.5:0.95 = 72.3%	High precision for individual seedling morphology.	Low efficiency for large-scale fields due to very low altitude; limited coverage.
Yang et al. [17]	RSHRNet (Segmentation)	3 m, GSD = 0.82 mm/pixel	MIoU = 62.68%	Preserves fine-grained morphological details via HRNet; high segmentation quality.	Inefficient data acquisition (3 m); Segmentation is computationally expensive compared to detection.
Gao et al. [21]	improved RT-DETR-r18	10 m, GSD = 0.77 mm/pixel	Accuracy = 82.8%	Real-time capabilities with transformer-based architecture.	Relatively low accuracy; struggles with complex background interference at this resolution.
Chen et al. [14]	YOLOv8n/v9t/v10n	12 m, GSD = 3.23 mm/pixel 15 m, GSD = 4.03 mm/pixel	12 m, mAP@0.5 = 96.4%	Demonstrates feasibility of lightweight models at medium altitudes.	Performance degrades as altitude increases (15 m); Standard models lack specific modules for blurred, tiny feature enhancement.
Xia et al. [16]	improved YOLOv8	12.45 m, GSD = 3.5 mm/pixel	Accuracy = 70.3%	Integrated attention mechanism for feature enhancement.	Focused primarily on detecting missing seedlings rather than characterizing existing ones.
Wu et al. [18]	YOLOv5s	15 m, GSD = 4.11 mm/pixel	N/A	Established baseline for seedling positioning.	Focuses on row extraction rather than precise individual detection; the standard model lacks specific optimizations for preserving tiny, blurred seedling features.
Li et al. [22]	RS-P2PNet	15 m, GSD = 1.9 mm/pixel 25 m, GSD = 3.1 mm/pix	MAE = 1.6	Handles very high altitudes (25 m) using multi-scale fusion; label-efficient (points).	Point supervision lacks bounding box information (size/shape); ResNet backbone is computationally heavier than lightweight detectors.
Yang et al. [23]	Improved VGG-16 (Sliding Window)	20 m, GSD = 5.5 mm/pixel	Accuracy = 99%	High classification accuracy for individual patches.	Computationally intensive due to the heavy VGG backbone and sliding window approach; low inference speed makes it unsuitable for real-time edge deployment.
Tseng et al. [10]	EfficientDet and Faster R-CNN	20 m, GSD = 5.5 mm/pixel	mAP = 88.8%	Proven accuracy with two-stage or heavy detectors.	Older architectures; high computational cost and parameter count limit edge deployment on UAVs.

Note: N/A indicates that the data were not reported in the cited study.

To address the accuracy degradation caused by high-altitude UAV imagery, in this study we propose an enhanced YOLOv8n-based framework that integrates an improved Bidirectional Feature Pyramid Network (BiFPN), a Global-to-Local Spatial Aggregation (GLSA) module, and a Content-Guided Attention Fusion (CGAFusion) module.

(1) An improved BiFPN is introduced to counteract the loss of tiny seedling signals caused by deep down-sampling. By reinforcing the contribution of shallow, high-resolution features during multi-scale fusion, the improved BiFPN preserves critical geometric and structural information of small rice seedlings that would otherwise vanish in lightweight backbones.

(2) A GLSA module is embedded to address contrast degradation and feature blur in high-altitude imagery. Through joint modeling of long-range contextual dependencies and local spatial responses, GLSA selectively enhances weak seedling features while enhancing color characteristics and texture details.

(3) A CGAFusion module is employed to resolve semantic ambiguity between rice seedlings and visually similar background objects. By adaptively fusing semantic context with structural cues based on content relevance, CGAFusion improves feature discriminability and reduces false detections caused by floating debris and straw residues.

These components are jointly designed to strengthen multi-scale feature representation, spatial context modeling, and discriminative feature refinement for high-altitude UAV imagery. As a result, the proposed framework enables accurate localization and counting of rice seedlings under low-resolution conditions, providing a robust technical foundation for scalable and automated monitoring across large-area paddy fields.

2. Materials and Methods

This section describes the complete workflow used to develop and evaluate the proposed rice seedling detection approach. It includes the acquisition and construction of the dataset, the design of the improved detection model, the procedure for missing seedling identification, the experimental environment, and the evaluation metrics. Together, these components provide the methodological foundation for the experiments presented in Section 3.

2.1. Data Acquisition

The study was conducted at Changjiang Farm, located in Chongming District, Shanghai, China (31.662° N, 121.556° E), as shown in Figure 1. The rice cultivar used was Huruanyu, which was machine-transplanted on 25 May 2024. High-resolution remote sensing images were acquired using a DJI M300 RTK UAV (SZ DJI Technology Co., Ltd., Shenzhen, China) equipped with a Zenmuse P1 camera (resolution of 8192 × 5460 pixels). The UAV flight mission was executed on 12 June 2024, between 11:00 and 13:00, under clear weather conditions with minimal wind. The flight was carried out at an altitude of 30 m and a speed of 15 m/s, with a front overlap of 75% and a side overlap of 55%. Following the pre-planned flight route, a total of 999 original images with a GSD of 3.75 mm/pixel were captured and saved in JPEG format.

Representative UAV images of the rice field are shown in Figure 2, highlighting the complexity of the field background. The study area contains diverse backgrounds, including bare soil, water surfaces, and duckweed, as well as heterogeneous field conditions such as floating straw and scum. Additionally, the rice seedlings exhibit variable shapes and sizes, with some regions showing densely clustered seedlings.

2.2. Dataset Construction

Considering the GSD of 3.75 mm/pixel, each original image was cropped into multiple 800 × 800-pixel sub-images. This resolution was specifically chosen to correspond to a physical area of approximately 3 m × 3 m, which provides sufficient spatial context for row structure identification while maintaining the visibility of tiny seedlings. Sub-images covering various backgrounds and field conditions were selected to manual count seedling samples. As illustrated in Figure 3, the LabelImg software (Version 1.8.6) was used for manual annotation of rice seedlings. Each seedling was labeled by drawing a minimum bounding rectangle, and the annotations were saved in the YOLO object detection format in corresponding TXT files. In total, 100 sub-images containing rice seedlings were curated, comprising 32,782 annotated seedlings. Preliminary experiments conducted on subsets of 60, 80, and 100 images showed that the detection accuracy stabilized at this scale, confirming that the dataset size is sufficient for model convergence.

The dataset is densely distributed and contains small objects. The majority of seedling bounding boxes occupy less than 1% of the image area, and the number of instances per image ranges from 200 to 400.

The dataset was randomly split into training, validation, and testing sets with a ratio of 7:1:2. To improve the model’s adaptability to varying field conditions, various image augmentation techniques were applied during training, including Hue Saturation Value (HSV) color perturbation, geometric transformations (scaling and translation), horizontal flipping, and mosaic augmentation. The specific hyperparameters are detailed in Table 2.

Visual examples of these effects are illustrated in Figure 4. These augmentation strategies were randomly applied in each training epoch to simulate different data acquisition conditions, such as changes in illumination and shooting angles. This approach effectively expands the sample space, mitigates overfitting, and enhances the model’s detection performance under complex field conditions.

2.3. Rice Seedling Detection Model and Improvements

Among single-stage object detection models, YOLOv8n is well-known for its efficiency and flexibility, providing an excellent balance between detection speed and accuracy. The YOLOv8n variant, in particular, features a compact network architecture with fast inference speed, making it an ideal candidate for lightweight detection tasks within the YOLO series. Consequently, we adopted the YOLOv8n model as the baseline for rice seedling detection in this study.

The improved YOLOv8n architecture, shown in Figure 5, consists of four primary components: input, backbone, neck, and head modules.

In the neck module, an improved BiFPN is integrated to enhance multi-scale feature fusion by incorporating shallow feature maps from the backbone. This strategy enables effective fusion of features across varying seedling colors, sizes, and complex background conditions. Additionally, the conventional convolution operations used for channel alignment and feature enhancement are replaced by the GLSA module, which improves the model’s focus on critical features while adjusting channel dimensions. Lastly, before feeding the feature maps into the head, a CGAFusion module is introduced to automatically balance and integrate both the original and enhanced features. This module emphasizes the most informative representations while suppressing irrelevant background noise.

2.3.1. Improved BiFPN Feature Fusion Network

BiFPN is a highly efficient and scalable bidirectional multi-scale feature fusion framework. Compared to traditional Feature Pyramid Networks (FPNs) [24] or the Path Aggregation Network (PANet) [25] structure used in the original YOLOv8n, BiFPN reduces redundant connections and introduces learnable weights, achieving effective feature fusion while minimizing computational overhead, as illustrated in Figure 6. Here, P₂, P₃, P₄, and P₅ denote feature maps at four different scales, which are extracted by the backbone network from the input image through down-sampling by factors of 2, 3, 4, and 5, respectively.

To further enhance detection performance for small-scale rice seedlings, the BiFPN used in this study was modified to incorporate the shallow P₂ feature map into the fusion process. The P₂ layer retains rich low-level spatial details, significantly improving the model’s ability to detect small objects without a significant increase in computational cost.

BiFPN constructs both top-down and bottom-up pathways, facilitating bidirectional information exchange and reinforcement across different feature levels. During the fusion process, a learnable weighted feature fusion strategy is employed to prevent redundancy and information conflicts that commonly arise from simple feature stacking. Specifically, the feature maps

F_{i}

from different levels are first spatially aligned via interpolation and then fused using learnable weights:

F_{i}^{'} = Interpolate (F_{i}, s i z e = H \times W)

(1)

F_{f u s e d} = {C o n v}_{3 \times 3} (\sum_{i = 1}^{n} w_{i} \cdot F_{i}^{'}), where \sum w_{i} = 1, w_{i} \geq 0

(2)

where

F_{i}^{'}

denotes the spatially aligned feature map, and

w_{i}

represents the learnable non-negative fusion weights, which are activated by ReLU and normalized to satisfy the sum-to-one constraint. H and W denote the target height and width of the aligned feature maps, respectively, ensuring that all input features share a consistent spatial resolution before fusion. The final fused output is then refined through a 3 × 3 convolution to capture contextual semantics and fine-grained details.

2.3.2. Integration of the GLSA Module

In the BiFPN feature fusion structure, convolution operations are typically employed to align the channel dimensions of multi-scale features extracted by the backbone network, ensuring compatibility for multi-scale fusion. However, traditional convolution operations tend to reduce channel dimensions while neglecting spatial context and local detail information, limiting the effective fusion of high-level semantic and low-level detailed features.

To enhance the representation power of features before fusion, in this study we introduce the GLSA module, proposed by Tang et al. [26]. This module performs channel compression and semantic enhancement on the feature maps output by the backbone network. As shown in Figure 7, the GLSA module consists of two branches: the Global Spatial Attention (GSA) branch and the Local Spatial Attention (LSA) branch. The GSA branch focuses on capturing global contextual information to improve the recognition of large objects, while the LSA branch emphasizes local detail extraction to facilitate the precise localization of small objects.

Specifically, for an input feature map

F_{i n} \in R^{C \times H \times W}

, it is first equally split along the channel dimension into a local and a global sub-branch:

X_{L S A}, X_{G S A} = S p l i t (F_{i n})

(3)

The LSA branch employs multiple layers of depthwise separable convolutions and residual connections

φ (\cdot)

to extract local features and generate an attention mask for pixel-wise fusion:

F_{L S A} = X_{L S A} + X_{L S A} ⊙ σ (φ (X_{L S A}))

(4)

where

σ (\cdot)

denotes the Sigmoid activation function and

⊙

represents element-wise multiplication.

The GSA branch utilizes attention weighting

ψ (\cdot)

combined with a multi-layer perceptron (MLP) to capture global context, which is then added to the original feature:

F_{G S A} = X_{G S A} + M L P (ψ (X_{G S A}))

(5)

Finally, the outputs of both branches are concatenated along the channel dimension and fused through a

1 \times 1

convolution to generate the enhanced output:

F_{o u t} = {C o n v}_{1 \times 1} ([F_{L S A}, F_{G S A}])

(6)

This design effectively integrates global context and local details, thereby strengthening the model’s focus on critical features while suppressing irrelevant background information.

2.3.3. Incorporation of the CGAFusion Module in the Detection Head

To further enhance the model’s discriminative capability in complex paddy field environments, in this study we incorporated the CGAFusion module, proposed by Chen et al. [27], into the detection head stage, as shown in Figure 8. The CGAFusion module employs a Content-Guided Attention (CGA) mechanism that integrates three sub-attention modules: Channel Attention (CA), Spatial Attention (SA), and Pixel Attention (PA). These submodules collectively refine the input features from three complementary perspectives: global semantics, regional saliency, and fine-grained local distributions.

Specifically, the Spatial Attention module first generates a spatial importance map (SIM) for each channel to highlight prominent regions within the image. The Channel Attention module then assigns importance weights to each feature channel, thereby enhancing the semantic representation of salient features. Finally, the Pixel Attention module performs fine-grained modeling of the fused features, improving feature expressiveness, particularly near object edges or in occluded regions.

The combined attention output W is then used to guide the fusion of low-level and high-level features, followed by a

1 \times 1

convolution for final mapping. The overall process can be formulated as:

W = σ (P A ([F_{i n i t}, S A (F_{i n i t}) + C A (F_{i n i t})]))

(7)

F_{f u s e d} = {C o n v}_{1 \times 1} (F_{i n i t} + W \cdot F_{x} + (1 - W) \cdot F_{y})

(8)

where

F_{x}

and

F_{y}

denote the low-level and high-level features, respectively, and

F_{i n i t} = F_{x} + F_{y}

represents the initial fused feature. The operators

S A (\cdot), C A (\cdot)

, and

P A (\cdot)

denote the Spatial, Channel, and Pixel Attention modules, respectively. The symbol

σ (\cdot)

denotes the Sigmoid activation function.

By adaptively balancing and integrating the original and enhanced features, the CGAFusion module strengthens the representation of informative regions while suppressing redundant background noise, thus improving the model’s detection performance under complex and variable field conditions.

2.4. Missing Seedling Detection

Based on the detection of individual rice seedlings, the spacing between seedlings can be calculated and used as a criterion for identifying missing seedlings. As shown in Figure 9, the regions corresponding to detected rice seedlings are labeled in white according to the annotation files generated by the detection model. When the spacing between two seedlings is sufficiently small, their white-labeled regions may overlap and form a single connected area, indicating that no missing seedlings exist between them. Conversely, when the spacing exceeds a certain threshold, the number of missing seedlings can be estimated, and the corresponding regions are marked in red.

To quantify the number of missing seedlings, we adopted a method based on the ratio of the inter-seedling distance to the threshold for missing seedling spacing, followed by a floor operation. The calculation is expressed as:

n = ⌊\frac{D}{d}⌋

(9)

where

⌊\cdot⌋

denotes the floor function, D is the distance between two adjacent seedlings, d is the threshold for missing seedling spacing, and n is the estimated number of missing seedlings. The threshold for identifying missing seedling spacing is determined based on the rice transplanter’s planting distance of 10 cm, with the corresponding spacing set to 25 pixels according to the image’s ground sampling distance.

2.5. Experimental Environment

The training and testing of the proposed model were conducted on a high-performance computer (HPC) running the Windows 11 operating system (Microsoft Corporation, Redmond, WA, USA). The hardware configuration includes an Intel i9-13900KF processor (Intel Corporation, Santa Clara, CA, USA), 32 GB of RAM, and a GeForce RTX 4080 GPU (NVIDIA Corporation, Santa Clara, CA, USA).

The software environment consists of Python 3.11 (Python Software Foundation, Wilmington, DE, USA), CUDA 12.3 (NVIDIA Corporation, Santa Clara, CA, USA), and cuDNN 8.9.6.50 (NVIDIA Corporation, Santa Clara, CA, USA), with PyTorch 2.2.1 (Meta Platforms, Inc., Menlo Park, CA, USA) serving as the deep learning framework. These versions were selected because PyTorch 2.2.1 provides optimized support for NVIDIA GPUs under CUDA 12.3 and cuDNN 8.9, ensuring stable training and efficient parallel computation. Python 3.11 offers improved execution efficiency and compatibility with the selected deep learning libraries. This configuration collectively provides a reliable and high-performance environment for training and evaluating the proposed model.

2.6. Model Evaluation Metrics

To comprehensively evaluate the performance of the improved YOLOv8n model, the following metrics were used: mean average precision at Intersection over Union (IoU) 0.5 (mAP@0.5), precision (P), and recall (R). The specific formulas are defined as follows:

R (R e c a l l) = \frac{T P}{T P + F N}

(10)

P (P r e c i s i o n) = \frac{T P}{T P + F P}

(11)

A P = \int_{0}^{1} P (R) d R

(12)

m A P @ 0.5 = \frac{1}{Q} \sum_{i = 1}^{Q} A P_{i} (I o U \geq 0.5)

(13)

where TP is the number of correctly detected rice seedlings, FP is the number of falsely detected rice seedlings, and FN is the number of missed detections. The AP represents the area under the precision–recall curve, and mAP@0.5 denotes the mean AP across all test samples at an IoU threshold of 0.5. These metrics quantitatively reflect the model’s key performance in practical scenarios.

In addition to accuracy, computational efficiency was evaluated using frames per second (FPS) and model size. FPS measures the inference speed, calculated as the average number of images processed per second on the test device with a batch size of 1. Model size refers to the storage space occupied by the model weights, indicating the deployment feasibility on resource-constrained devices.

Furthermore, to quantitatively validate the effectiveness of the missing seedling detection algorithm (described in Section 2.4), we employed four statistical metrics: Mean Absolute Error (MAE), Root Mean Square Error (RMSE), Counting Accuracy, and the Coefficient of Determination (R²). These metrics are calculated as follows:

M A E = \frac{1}{N} \sum_{i = 1}^{N} |y_{i} - \hat{y_{i}}|

(14)

R M S E = \sqrt{\frac{1}{N} \sum_{i = 1}^{N} {(y_{i} - \hat{y_{i}})}^{2}}

(15)

C o u n t i n g A c c u r a c y = (1 - \frac{\sum_{i = 1}^{N} |y_{i} - \hat{y_{i}}|}{\sum_{i = 1}^{N} y_{i}}) \times 100 %

(16)

R^{2} = 1 - \frac{\sum_{i = 1}^{N} {(y_{i} - \hat{y_{i}})}^{2}}{\sum_{i = 1}^{N} {(y_{i} - \bar{y})}^{2}}

(17)

where N represents the number of test images (randomly selected for statistical analysis),

y_{i}

denotes the manually counted ground truth of missing seedlings for the i image,

\hat{y_{i}}

is the number of missing seedlings predicted by the algorithm, and

\bar{y}

is the mean of the ground truth values. MAE and RMSE measure the magnitude of the counting error, while Counting Accuracy and

R^{2}

reflect the overall reliability and correlation between the automated detection and manual counting.

3. Results

This section presents the experimental results used to evaluate the effectiveness of the proposed high-altitude rice seedling detection model. We report both quantitative and qualitative findings, including ablation studies, comparative experiments with baseline methods, and application-oriented analyses. These results collectively demonstrate the model’s performance, robustness, and practical utility under real-world field conditions.

3.1. Ablation Study

This subsection investigates the contribution of each proposed module to the overall detection performance. By incrementally adding or removing components from the network architecture, we analyze their individual effects through quantitative metrics and visual results. The ablation study helps clarify the necessity and effectiveness of the BiFPN enhancement, GLSA module, and CGAFusion module.

3.1.1. Ablation Results and Metric Visualization

To evaluate the contribution of each proposed module, six groups of ablation experiments were conducted based on the YOLOv8n baseline model. The results are presented in Table 3, where A, B, and C represent the improved BiFPN, GLSA module, and CGAFusion module, respectively.

The baseline YOLOv8n achieves an mAP@0.5 of 92.4% with a model size of 6.0 MB. After replacing the original PANet with the BiFPN structure, the mAP@0.5 increases by 1.4% while the model size decreases significantly to 4.1 MB. This reduction occurs because BiFPN optimizes the feature fusion path, removing redundant connections while utilizing weighted fusion to retain critical shallow features.

The individual introduction of the GLSA and CGAFusion modules also brings performance gains. Model B improves precision (+0.8%) by enhancing spatial focus, though at the cost of increased parameter size (8.0 MB). Model C improves mAP to 93.1% with negligible size increase, validating the efficiency of the attention-guided fusion.

Further, combining the GLSA and BiFPN modules improves mAP@0.5 by 2.2%, highlighting their complementarity in spatial attention and multi-scale modeling. Finally, adding the CGAFusion module results in the highest mAP@0.5 (94.7%), P (91%), and R (91.2%). Although the CGAFusion module adds a moderate computational load, decreasing FPS from 137 to 117, it is crucial for generating spatial importance maps that guide attention effectively.

In summary, the proposed model achieves a superior trade-off. Compared to the baseline, it boosts accuracy significantly (+2.3% mAP) while maintaining a compact size (6.3 MB, only +0.3 MB increase). Although the inference speed drops to 117 FPS due to complex attention calculations, it remains well above the real-time threshold (30 FPS) [28], ensuring practical deployability on UAV platforms.

All optimized configurations outperform the baseline in key metrics. The mAP@0.5 curves for the six models are shown in Figure 10, confirming that the proposed enhancements improve both detection accuracy and stability.

3.1.2. Ablation Detection Results

In Figure 11, the detection results and feature activation maps (Grad-CAM++ [29]) of the baseline and improved models are compared under different field conditions. The heatmaps visualize the aggregated attention of the P₃, P₄, and P₅ feature layers from the output layers of the neck network, where warmer colors (red/orange) indicate regions of higher model focus.

The visualization results in Figure 11 expose distinct limitations of the baseline model across different scenarios. First, in the case of tiny seedlings (Figure 11a), the baseline model suffers from insufficient feature representation. The corresponding heatmaps show extremely weak and faint activations on the targets, indicating that the model fails to extract salience from such small objects, directly leading to missed detections. Second, for densely distributed seedlings (Figure 11b), the model struggles with instance discrimination. The heatmaps exhibit a strong adhesive effect, where activations from adjacent seedlings blur and merge into indistinguishable clusters. This lack of clear boundary separation prevents the model from effectively resolving individual plants. Third, under background interference (Figure 11c), the baseline lacks semantic discrimination capabilities. It generates erroneous high-response hotspots on floating straw and other background textures, causing these non-target elements to be misclassified as seedlings.

In contrast, the improved model demonstrates superior performance across these challenging field conditions. The Grad-CAM++ heatmaps show that the improved model generates sharper and more concentrated activation regions that precisely align with the rice seedlings, effectively suppressing background noise. This indicates that BiFPN captures shallow and multi-scale features, the GLSA module enhances focus on critical regions, and the CGAFusion module suppresses background noise while highlighting relevant features. The results illustrate the accurate detection of small targets, a significant reduction in straw misclassification, and improved completeness in dense areas. These results confirm the practical effectiveness and robustness of the enhanced model for rice seedling detection in complex field environments.

3.2. Comparison Experiments

In this subsection, we compare the proposed model with mainstream object detection algorithms to assess its relative performance. Both metric-based evaluations and qualitative visualizations are included to demonstrate the advantages of the proposed approach in handling high-altitude UAV imagery and small-object detection scenarios.

3.2.1. Comparison Results and Metric Visualization

To further verify the effectiveness of the proposed model for rice seedling detection, comparative experiments were conducted using mainstream object detection models representing different architectures: Faster R-CNN (two-stage), RT-DETR-r18 (Transformer-based), and state-of-the-art YOLO series (one-stage). Considering the strict constraints on storage and inference speed for UAV edge deployment, we specifically selected the lightweight “nano” versions (YOLOv8n, YOLOv10n, and YOLOv12n) and the smallest RT-DETR variant to ensure a fair comparison within the same magnitude of computational complexity. Consequently, the improved model (YOLOv8n + A + B + C) was evaluated against these lightweight counterparts. All models were trained and evaluated under identical experimental settings, including the same dataset split, input resolution, and hardware environment. The experimental results are presented in Table 4.

To facilitate intuitive comparison of model performance across multiple dimensions, Figure 12 presents a radar chart visualization of the results in Table 4. The chart employs piecewise normalization for accuracy metrics (mAP@0.5, P, and R) with segmented mapping (50–80→0–0.3, 80–95→0.3–1.0) to emphasize differences among high-performance models, while FPS is normalized to its maximum value and model size is inversely normalized (smaller is better).

As shown in Table 4 and visualized in Figure 12, Faster R-CNN achieves the lowest average precision and has the largest model size, making it unsuitable for rice seedling detection tasks. YOLOv10n, YOLOv12n and RT-DETR-r18, despite their advanced architectures, perform worse than the baseline YOLOv8n (83.1%, 91.2% and 83.0% mAP, respectively). This performance drop is likely due to their architectural bias towards larger objects commonly found in datasets like COCO; they struggle to converge effectively on the tiny, dense targets characteristic of rice seedling datasets without extensive architectural adaptation. The baseline YOLOv8n outperforms YOLOv10, YOLOv12, and RT-DETR in overall performance, demonstrating its feasibility for this application. This suggests that without task-specific optimizations for small objects and complex backgrounds, newer models do not necessarily guarantee better results.

The improved YOLOv8n model delivers the best performance across all evaluated metrics. To evaluate the statistical significance and reproducibility of these results, we conducted five independent experiments with different random seeds for both the baseline and the proposed model (Table S1). As shown in Table 4, the proposed model achieved an average mAP@0.5 of 94.7% (±0.3%), consistently outperforming the baseline YOLOv8n, which achieved 92.4% (±0.3%). The non-overlapping performance ranges (baseline: 92.0–92.7%; ours: 94.4–95.1%) and a T-test analysis (p < 0.05) confirm that the improvement is statistically significant and robust. This demonstrates the critical role of the proposed GLSA, BiFPN, and CGAFusion modules in accurately identifying rice seedlings. Regarding efficiency, although our model’s FPS value (117) is lower than that of the baseline (237) due to the enhanced computation, it is comparable to that of YOLOv12n (123 FPS) and remains sufficient for operational requirements. Coupled with a minimal model size of 6.3 MB, our method offers the best trade-off, prioritizing the high precision essential for yield estimation while maintaining practical run-time performance.

3.2.2. Comparison Detection Results

Figure 13 illustrates the detection results of the six models under various complex background conditions.

Faster R-CNN, YOLOv10n and RT-DETR-r18 exhibit severe missed detections in dense clusters (Figure 13b), likely due to poor small-object feature resolution.

YOLOv8n and YOLOv12n reduce missed detections but suffer from higher false positive rates in areas with floating straw (Figure 13c), indicating insufficient background suppression capabilities.

In contrast, the improved YOLOv8n model overcomes these trade-offs. By leveraging BiFPN for multi-scale feature retention, GLAS for feature enhancement and CGAFusion for noise filtering, it achieves superior detection performance under challenging conditions involving environmental noise, tiny seedlings, and dense planting areas, outperforming all other tested models.

3.3. Application Demonstration: Quantitative Evaluation of Missing Seedling Detection

Building on the missing seedling detection algorithm described in Section 2.4, we further integrated it with the improved YOLOv8n model to evaluate the planting status of rice seedlings. As shown in Figure 14, by calculating the spacing between seedlings and applying a missing-seedling threshold, missing regions can be effectively identified.

To rigorously validate the effectiveness of the proposed algorithm, we conducted a statistical evaluation on 20 representative test images (Figure S1). These images were carefully selected to cover variations in spatial location, background complexity, and seedling density. The algorithm’s performance was assessed by comparing the automatically detected numbers of missing seedlings with manually counted ground-truth values. Ground truth was established through expert visual inspection of the original high-resolution UAV images, in which the actual number of missing seedlings was determined according to the standard planting interval. Table 5 summarizes the performance metrics comparing these manual counts against the automated results. As shown in Figure 15, the linear regression analysis yields an R² of 0.90, indicating a robust correlation. The system achieved a Counting Accuracy of 88.1% with an MAE of 3.75 and an RMSE of 4.57.

Notably, the scatter plot reveals a slight trend of underestimation (data points below the y = x line). This is attributed to the conservative design of the distance thresholding algorithm (Equation (9)) and the irregular spacing of field planting. However, this result also indirectly confirms the high recall rate of our improved YOLOv8n model; if the detection model frequently missed seedlings, the algorithm would overestimate missing counts (points above the y = x line), which is rarely observed here.

4. Discussion

4.1. Trade-Off Between Efficiency and Accuracy

UAVs are an effective tool for large-area agricultural monitoring. However, in practical applications, very low-altitude flights (<10 m) provide limited image coverage. In low-texture paddy field environments, the small absolute overlap area between adjacent images often leads to unstable feature matching, making large-scale image mosaicking difficult and restricting such flight strategies to local, fine-scale observations. In contrast, increasing flight altitude substantially improves acquisition efficiency but inevitably reduces image resolution, which increases the difficulty of small-object detection. Therefore, selecting an appropriate flight altitude is critical to balancing acquisition efficiency and detection accuracy.

Under fixed imaging sensor parameters and flight overlap rates, the UAV flight altitude directly affects both the GSD and the image coverage area per flight. The GSD increases approximately proportionally with altitude, while the coverage area increases quadratically [30]. When the altitude increases from 10 m to 30 m, the coverage area of a single flight increases by nine times. This results in a significant reduction in the number of flight lines, leading to an estimated 90% reduction in time cost compared to 10 m. Consequently, as altitude rises, the GSD increases, and the seedling information is more likely to be lost during model down-sampling. This challenge becomes especially pronounced during convolutional feature extraction and attention-based fusion, where conventional approaches struggle to capture the small-size rice seedlings in low-spatial-resolution, high-altitude imagery.

UAVs equipped with high-resolution RGB sensors can efficiently acquire large-area paddy field imagery while ensuring sufficient GSD providing a robust data foundation for automated seedling detection [31]. Chen et al. [14] employed DJI MAVIC 3M (with a resolution of 5280 × 3956 pixels) to capture images of rice seedlings taken at heights of 12 m (GSD of 3.23 mm/pixel) and 15 m (GSD of 4.03 mm/pixel). The models achieved satisfactory accuracy at both altitudes, and a 15 m flight altitude is recommended to balance identification performance with flight efficiency. In this study, the Zenmuse P1 camera (with a resolution of 8192 × 5460 pixels) was used. According to Chen’s research on GSD, a flight altitude of 32.24 m would achieve a GSD of 4.03 mm/pixel. However, considering the actual flight conditions, a flight altitude of 30 m was selected to capture high-resolution imagery with a GSD of 3.75 mm/pixel.

While a 30 m flight altitude ensures an appropriate GSD and substantially improves acquisition efficiency, it also results in weaker target features and stronger background interference, as rice seedlings occupy fewer pixels with blurred textures and edges. This shift places higher demands on the detection model’s ability to preserve fine spatial details and suppress background noise. In this context, YOLOv8n was selected for its favorable balance between efficiency and accuracy. Its C2f module integrates high-level semantic and low-level spatial features, and the anchor-free detection head enables flexible localization of small and irregular targets such as rice seedlings. However, under the low-resolution conditions of high-altitude UAV imagery, the original YOLOv8n exhibits limitations: shallow features are insufficiently emphasized during multi-scale fusion, leading to missed detections, while PANet-based fusion struggles to distinguish true seedlings from visually similar background, resulting in false positives. Therefore, targeted architectural improvements were introduced to enhance feature preservation and discrimination in low-resolution, high-interference scenarios.

4.2. Advantages of the Improved Model

The proposed detection framework modifies the YOLOv8n architecture to address the challenges of high-altitude (30 m) seedling detection: feature vanishing, texture blurring, and background interference. Unlike previous studies that relied on high-resolution imagery (2–15 m altitude) to capture rich plant details, our method enhances weak feature representation under low-resolution, high-altitude conditions through three mechanism-level improvements.

First, the improved BiFPN significantly enhances small-object detection by strengthening shallow, high-resolution features during multi-scale fusion. At a 30 m UAV altitude, rice seedlings occupy very few pixels, and their discriminative cues mainly come from edge contours and local textures. In standard YOLOv8 architectures, repeated down-sampling and unidirectional aggregation suppress these fine details, leading to weak responses for tiny targets. The improved BiFPN addresses this by bidirectional multi-scale fusion with learnable weights, adaptively preserving and propagating shallow feature information throughout the network [32]. This structurally mitigates small-object information loss under high-altitude conditions, yielding a 2.3% improvement in mAP@0.5. However, the added fusion paths and nodes increase computational cost, indicating that further optimization is needed for deployment in resource-constrained scenarios.

Secondly, the GLSA module addresses the limitations of local convolutions under low-contrast and blurred high-altitude imagery. At 30 m, rice seedlings exhibit weak texture and contrast relative to backgrounds such as water and bare soil, making it difficult for standard CNNs or single-attention mechanisms to capture both global structure and fine-grained details. GLSA integrates global attention to model long-range dependencies across planting rows and local attention to enhance edges and neighborhood variations [33]. This dual mechanism enables the model to infer seedling presence even when individual textures are indistinct. Experimental results show that GLSA improves P from 89.5% to 91.0%, particularly stabilizing detection in dense seedling regions, highlighting its advantage in complex spatial structure modeling. However, GLSA is not optimized for extreme lightweight deployment; its use in resource-constrained or real-time scenarios requires balancing accuracy gains with computational cost.

Finally, the CGAFusion module primarily addresses false positives arising from complex field backgrounds. In rice seedling stage imagery, interference factors such as floating debris, straw, and water reflections can resemble seedlings in shape or color, and may be amplified during multi-scale feature enhancement, leading to misdetections. CGAFusion introduces a content-guided mechanism that adaptively fuses high-level semantic features from the neck with shallow structural features extracted by the backbone [34]. This fusion constrains the enhanced features with low-level structural information, preserving discriminative seedling cues while effectively suppressing background noise. Experimental results show that CGAFusion significantly reduces false positives under complex backgrounds, yielding a 2.6% increase in recall, which demonstrates its superior discriminative capability compared to the standard concatenation in the original YOLOv8n.

4.3. Future Research Directions

Despite achieving promising detection performance, the proposed method still faces several limitations.

First, the generalization ability of the model warrants further validation due to specific dataset constraints. The current dataset was collected from a single location, involving a single rice variety and acquisition date at a fixed UAV altitude. Although the dataset consists of only 100 cropped sub-images, it contains a total of 32,782 annotated rice seedlings, with an average of over 300 instances per image. This high density of small objects provides abundant supervision signals, which is consistent with observations in the UAV-based object detection literature, where dense distributions of small targets require substantial annotated instances to achieve reliable deep learning performance [35].

Furthermore, the dataset lacks coverage of complex field scenarios such as extensive bare soil exposure, strong water-surface reflections, or visually similar backgrounds, which frequently occur in practical paddy fields and can cause ambiguous classification or false detections. To validate the sufficiency of the data, we conducted a preliminary saturation experiment, training the model on subsets of 60, 80, and 100 images. The results indicated that performance improvements plateaued after 80 images, demonstrating that the current dataset (approximately 32 k instances) is adequate to train a robust model for this specific domain without overfitting (Table S2). Future work should expand the dataset across different temporal phases, geographical regions, and rice varieties, while integrating synthetic image generation and domain adaptation techniques to improve model robustness and generalization in diverse micro-environments. In addition, extending the framework to leverage multi-temporal UAV imagery, combined with temporal modeling approaches such as spatiotemporal Transformers [36] or temporal convolutional networks [37], could enable the capture of seedling growth dynamics and quality changes, further improving the applicability of the model for large-scale, time-sensitive monitoring.

Last but not least, the transition from high-performance workstations to edge-intelligent deployment is a critical step for practical agriculture. In our experiments conducted on a workstation equipped with an NVIDIA RTX 4080 GPU, the proposed model achieved an inference speed of 117 FPS. Although this is lower than the baseline model (237 FPS) due to the increased computational complexity of the added modules, it still offers a significant performance margin for real-time processing. Coupled with a compact model size of 6.3 MB, these metrics suggest that the model is structurally efficient. However, deployment on resource-constrained UAV onboard hardware poses greater challenges. Future work will focus on bridging this hardware gap by employing edge-specific optimization techniques, such as TensorRT acceleration, FP16/INT8 quantization, and channel pruning. These measures aim at ensuring that the model maintains valid real-time performance (e.g., >30 FPS) even on embedded platforms like the NVIDIA Jetson series, enabling on-the-fly seedling diagnosis and variable-rate prescription.

5. Conclusions

This study focuses on a practical yet underexplored problem in rice seedling monitoring: when UAV flight altitude increases for large-area coverage, the resulting decrease in image resolution significantly weakens the detectability of small seedlings. Unlike most existing studies that rely on low-altitude and idealized imagery, we employed 30-m-altitude UAV images—characterized by reduced ground resolution, dense spatial seedling patterns, and complex background interference—to construct a realistic detection scenario.

To address the challenges of detecting small, densely distributed rice seedlings under complex background in large-area farmland, a multi-module enhanced YOLOv8n framework was developed. This framework systematically integrates a modified BiFPN for improved multi-scale feature fusion, a GLSA module for enriched spatial context modeling, and a CGAFusion module for robust feature refinement. These modules collaboratively enhance detection performance from three key aspects: feature fusion depth, spatial attention, and target saliency.

In terms of performance, the improved YOLOv8n model maintains a lightweight footprint (only 6.3 MB) while achieving notable improvements in key metrics: mAP@0.5, P, and R reached 94.7%, 91.0%, and 91.2%, respectively. Compared with the original YOLOv8n (92.4%, 89.5%, and 88.6%), the proposed model demonstrates substantial gains, particularly in detecting rice seedlings within complex backgrounds. Ablation studies further confirmed that each module contributes positively to detection performance and that their integration provides complementary and synergistic benefits. Extensive experiments on UAV remote sensing images acquired at a high flight altitude (30 m) validated the effectiveness and robustness of the proposed method.

Despite these promising results, several limitations remain. The dataset is specific to a single location, rice variety, acquisition date, and UAV altitude. Although dense annotation (over 32,000 seedlings across 100 images) provides sufficient supervision for training, the generalization of the model to other regions, cultivars, and varying field conditions requires further evaluation. Extending the framework to multi-temporal imagery and integrating temporal modeling approaches could enhance the monitoring of seedling growth dynamics and quality. Additionally, while the current model achieves 117 FPS on high-performance workstations, deployment on resource-constrained UAV platforms will require edge-specific optimizations, including quantization, pruning, and inference acceleration, to ensure real-time performance.

Overall, the results validate the practical applicability of the improved YOLOv8n framework for robust and accurate rice seedling detection in real-world, large-area paddy field, offering a reliable foundation for subsequent precision agriculture monitoring and decision-making.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/agriculture16030307/s1, Figure S1: Twenty charts displaying the statistics of missing seedlings; Table S1: Performance stability of YOLOv8n and the improved YOLOv8n under different random seeds; Table S2: Performance comparison on different training dataset sizes.

Author Contributions

Conceptualization, Z.L. and X.Y.; Data curation, Z.L., S.B., D.H., T.Y. and M.T.; Formal analysis, Z.L. and X.Y.; Funding acquisition, L.L.; Investigation, L.L.; Methodology, Z.L. and L.L.; Project administration, L.L.; Resources, L.L.; Software, Z.L.; Supervision, L.L.; Validation, S.B., D.H., M.T. and T.Y.; Visualization, X.Y.; Writing—original draft, Z.L. and X.Y.; Writing—review & editing, X.Y., S.B., D.H., T.Y., M.T. and L.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by Shanghai Agricultural Science and Technology Innovation Program (Grant No. I2023005).

Data Availability Statement

The data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

References

He, L.; Du, B.; Xiong, L.; Wang, P.; Zhang, W.; Imran, M.; Sun, G.; Yuan, F.; Liu, Z.; Yao, X. Enhancing food security and farmers’ profit through ratoon rice-potato rotation in central China. Eur. J. Agron. 2025, 168, 127636. [Google Scholar] [CrossRef]
Zhu, H.; Lu, X.; Zhang, K.; Xing, Z.; Wei, H.; Hu, Q.; Zhang, H. Optimum basic seedling density and yield and quality characteristics of unmanned aerial seeding rice. Agronomy 2023, 13, 1980. [Google Scholar] [CrossRef]
Qin, J.; Hu, T.; Yuan, J.; Liu, Q.; Wang, W.; Liu, J.; Guo, L.; Song, G. Deep-learning-based rice phenological stage recognition. Remote Sens. 2023, 15, 2891. [Google Scholar] [CrossRef]
Tan, S.; Liu, J.; Lu, H.; Lan, M.; Yu, J.; Liao, G.; Wang, Y.; Li, Z.; Qi, L.; Ma, X. Machine learning approaches for rice seedling growth stages detection. Front. Plant Sci. 2022, 13, 914771. [Google Scholar] [CrossRef]
Gao, M.; Yang, F.; Wei, H.; Liu, X. Automatic monitoring of maize seedling growth using unmanned aerial vehicle-based RGB imagery. Remote Sens. 2023, 15, 3671. [Google Scholar] [CrossRef]
Liao, J.; Wang, Y.; Yin, J.; Liu, L.; Zhang, S.; Zhu, D. Segmentation of rice seedlings using the YCrCb color space and an improved Otsu method. Agronomy 2018, 8, 269. [Google Scholar] [CrossRef]
Ma, J.; Jiang, X.; Fan, A.; Jiang, J.; Yan, J. Image matching from handcrafted to deep features: A survey. Int. J. Comput. Vis. 2021, 129, 23–79. [Google Scholar] [CrossRef]
Huang, S.; Wu, S.; Sun, C.; Ma, X.; Jiang, Y.; Qi, L. Deep localization model for intra-row crop detection in paddy field. Comput. Electron. Agric. 2020, 169, 105203. [Google Scholar] [CrossRef]
Deng, X.; Qi, L.; Liu, Z.; Liang, S.; Gong, K.; Qiu, G. Weed target detection at seedling stage in paddy fields based on YOLOX. PLoS ONE 2023, 18, e0294709. [Google Scholar] [CrossRef]
Tseng, H.-H.; Yang, M.-D.; Saminathan, R.; Hsu, Y.-C.; Yang, C.-Y.; Wu, D.-H. Rice seedling detection in UAV images using transfer learning and machine learning. Remote Sens. 2022, 14, 2837. [Google Scholar]
Yeh, J.-F.; Lin, K.-M.; Yuan, L.-C.; Hsu, J.-M. Automatic counting and location labeling of rice seedlings from unmanned aerial vehicle images. Electronics 2024, 13, 273. [Google Scholar] [CrossRef]
Li, L.-H.; Chung, K.-L.; Jiang, L.-Q.; Sharma, A.K.; Liu, Y.-S. The study of light-weight YOLOv4 model for rice seedling and counting. In Proceedings of the 2022 International Conference on Computer Applications Technology (CCAT), Guangzhou, China, 14–16 July 2022; pp. 1–6. [Google Scholar]
Zhao, B.; Zhang, Q.; Liu, Y.; Cui, Y.; Zhou, B. Detection method for rice seedling planting conditions based on image processing and an improved YOLOv8n model. Appl. Sci. 2024, 14, 2575. [Google Scholar] [CrossRef]
Chen, S.; Li, W.; Chen, D.; Xie, Z.; Zhang, S.; Cen, F.; Huang, X.; Tu, L.; Gao, Z. Recognition of rice seedling counts in UAV remote sensing images via the YOLO algorithm. Smart Agric. Technol. 2025, 12, 101107. [Google Scholar] [CrossRef]
Cui, J.; Zheng, H.; Zeng, Z.; Yang, Y.; Ma, R.; Tian, Y.; Tan, J.; Feng, X.; Qi, L. Real-time missing seedling counting in paddy fields based on lightweight network and tracking-by-detection algorithm. Comput. Electron. Agric. 2023, 212, 108045. [Google Scholar] [CrossRef]
Xia, Y.; Zhu, Z.; Liu, X. SSM-based detection of rice seedling deficiency. Sci. Rep. 2025, 15, 22605. [Google Scholar] [CrossRef]
Yang, X.; Li, H.; Zhu, W.; Zuo, Y. RSHRNet: Improved HRNet-based semantic segmentation for UAV rice seedling images in mechanical transplanting quality assessment. Comput. Electron. Agric. 2025, 234, 110273. [Google Scholar] [CrossRef]
Wu, S.; Chen, Z.; Bangura, K.; Jiang, J.; Ma, X.; Li, J.; Peng, B.; Meng, X.; Qi, L. A navigation method for paddy field management based on seedlings coordinate information. Comput. Electron. Agric. 2023, 215, 108436. [Google Scholar] [CrossRef]
Wu, Y.; Yuan, S.; Tang, L. Plant recognition of maize seedling stage in UAV remote sensing images based on H-RT-DETR. Plant Methods 2025, 21, 60. [Google Scholar] [CrossRef]
Giri, K.J. SO-YOLOv8: A novel deep learning-based approach for small object detection with YOLO beyond COCO. Expert Syst. Appl. 2025, 280, 127447. [Google Scholar]
Gao, J.; Tan, F.; Hou, Z.; Li, X.; Feng, A.; Li, J.; Bi, F. UAV-Based automatic detection of missing rice seedlings using the PCERT-DETR model. Plants 2025, 14, 2156. [Google Scholar] [CrossRef]
Li, C.; Deng, N.; Mi, S.; Zhou, R.; Chen, Y.; Deng, Y.; Fang, K. Automatic counting and location of rice seedlings in low altitude UAV images based on point supervision. Agriculture 2024, 14, 2169. [Google Scholar] [CrossRef]
Yang, M.-D.; Tseng, H.-H.; Hsu, Y.-C.; Yang, C.-Y.; Lai, M.-H.; Wu, D.-H. A UAV open dataset of rice paddies for deep learning practice. Remote Sens. 2021, 13, 1358. [Google Scholar] [CrossRef]
Lin, T.-Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Tang, F.; Xu, Z.; Huang, Q.; Wang, J.; Hou, X.; Su, J.; Liu, J. DuAT: Dual-aggregation transformer network for medical image segmentation. In Proceedings of the Chinese Conference on Pattern Recognition and Computer Vision (PRCV), Xiamen, China, 13–15 October 2023; pp. 343–356. [Google Scholar]
Chen, Z.; He, Z.; Lu, Z.-M. DEA-Net: Single image dehazing based on detail-enhanced convolution and content-guided attention. IEEE Trans. Image Process. 2024, 33, 1002–1015. [Google Scholar] [CrossRef] [PubMed]
Wei, X.; Yin, L.; Zhang, L.; Wu, F. DV-DETR: Improved UAV aerial small target detection algorithm based on RT-DETR. Sensors 2024, 24, 7376. [Google Scholar] [CrossRef]
Chattopadhay, A.; Sarkar, A.; Howlader, P.; Balasubramanian, V.N. Grad-cam++: Generalized gradient-based visual explanations for deep convolutional networks. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; pp. 839–847. [Google Scholar]
Maes, W.H. Practical guidelines for performing UAV mapping flights with snapshot sensors. Remote Sens. 2025, 17, 606. [Google Scholar] [CrossRef]
Rejeb, A.; Abdollahi, A.; Rejeb, K.; Treiblmaier, H. Drones in agriculture: A review and bibliometric analysis. Comput. Electron. Agric. 2022, 198, 107017. [Google Scholar] [CrossRef]
Shen, L.; Lang, B.; Song, Z. Object detection for remote sensing based on the enhanced YOLOv8 with WBIFPN. IEEE Access 2024, 12, 158239–158257. [Google Scholar] [CrossRef]
Lu, J.; Cao, Z.; Wang, J.; Wang, Z.; Zhao, J.; Zhang, M. A picking point localization method for table grapes based on PGSS-YOLOv11s and morphological strategies. Agriculture 2025, 15, 1622. [Google Scholar] [CrossRef]
Chen, C.; Yu, C.; Cai, S. Advancing landslide recognition through multi-dimensional feature fusion and transformer architectures. Vis. Comput. 2025, 41, 11311–11325. [Google Scholar] [CrossRef]
Wang, X.; Wang, A.; Yi, J.; Song, Y.; Chehri, A. Small object detection based on deep learning for remote sensing: A comprehensive review. Remote Sens. 2023, 15, 3265. [Google Scholar] [CrossRef]
Lin, F.; Crawford, S.; Guillot, K.; Zhang, Y.; Chen, Y.; Yuan, X.; Chen, L.; Williams, S.; Minvielle, R.; Xiao, X. Mmst-vit: Climate change-aware crop yield prediction via multi-modal spatial-temporal vision transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 5774–5784. [Google Scholar]
Mobarakeh, Z.M.; Pourmanafi, S.; Ahmadi, M. Employing sentinel-2 time-series and noisy data quality control enhance crop classification in arid environments: A comparison of machine learning and deep learning methods. Int. J. Appl. Earth Obs. Geoinf. 2025, 142, 104678. [Google Scholar] [CrossRef]

Figure 1. Location of the study area and sample UAV-acquired RGB imagery used for data collection.

Figure 2. Rice seedling images under different background conditions: (a) seedlings on a soil background; (b) seedlings on a water surface; (c) seedlings on a duckweed background.

Figure 3. Example of rice seedlings labeled with green bounding boxes on cropped images.

Figure 4. Illustration of image augmentation effects applied during the training process.

Figure 5. The architecture of the improved YOLOv8n network. The top section illustrates the detailed internal structure of the fundamental modules. (1) Baseline Components: Light-colored blocks represent the original YOLOv8n modules. CBS is the basic convolutional unit composed of a Convolution layer, Batch Normalization (BN), and SiLU activation. C2f denotes the feature extraction module incorporating Split operations and Bottleneck blocks. SPPF stands for Spatial Pyramid Pooling-Fast, and Detect refers to the decoupled head for classification (Cls) and bounding box regression (Bbox). (2) Proposed Improvements: The modules highlighted in colors represent the newly introduced methods. GLSA (red) represents the Global-to-Local Spatial Aggregation module; BiFPN (orange) represents the weighted feature fusion module adopted from the Bidirectional Feature Pyramid Network; and CGAFusion (yellow) stands for the Content-Guided Attention Fusion module.

Figure 6. Comparison of three neck network structures: (a) PANet; (b) BiFPN; and (c) the improved BiFPN.

Figure 7. Structure of the GLSA module. The structure of this diagram is inspired by the design in [26], with modifications to better suit the specific application of the GLSA module.

Figure 8. Architecture of the CGAFusion module.

Figure 9. Effect diagram of missing seedlings detection.

Figure 10. Visualization of mAP@0.5 performance for the baseline YOLOv8n model and the five improved configurations in the ablation study.

Figure 11. Detection results of the baseline and improved models. Red circles indicate missed or false detections, while green circles highlight the corresponding improvements. (a) Detection results for small-object targets; (b) detection results for densely distributed objects; (c) detection results under background interference.

Figure 12. Performance comparison of object detection models using radar chart with piecewise normalized metrics.

Figure 13. Detection results of different object detection models. Red circles indicate missed and false detections, while green circles highlight the corresponding improvements. (a) Detection results for small-object targets; (b) detection results for densely distributed objects; (c) detection results under background interference.

Figure 14. Detection results of the improved YOLOv8n integrated with the missing seedling detection algorithm. Green bounding boxes indicate detected seedlings, while red bounding boxes represent identified missing seedling regions.

Figure 15. Linear regression analysis of missing seedling detection. The blue dots represent the experimental data points, and the red shaded area indicates the 95% confidence interval of the linear fit.

Table 2. Configuration of data augmentation parameters.

Category	Augmentation Techniques and Parameters
Color Space	HSV Perturbation: Hue = ±1.5%; Saturation = ±70%; Value = ±40%
Geometric	Mosaic: Prob = 1.0; Flip-LR: Prob = 0.5; Scale: Gain = ±10%; Translate: Range = ±10%

Note: Prob represents probability, Gain/Range represents the range of change.

Table 3. Performance metrics of the ablation experiments.

Model Name	mAP@0.5/%	P/%	R/%	FPS	Model Size/MB
YOLOv8n	92.4	89.5	88.6	237	6.0
YOLOv8n + A	93.8	90.8	89.5	190	4.1
YOLOv8n + B	92.9	90.3	89.3	154	8.0
YOLOv8n + C	93.1	90.6	87.9	182	6.3
YOLOv8n + A + B	94.6	90.8	90.7	137	4.4
YOLOv8n + A + B + C	94.7	91.0	91.2	117	6.3

Table 4. Comparison of experimental results for different object detection models.

Model Name	mAP@0.5/%	P/%	R/%	FPS	Model Size/MB
Faster-RCNN	50.5	50.5	51.1	50	108
YOLOv8n	92.4 ± 0.3	89.5	88.6	237	6.0
YOLOv10n	83.1	87.0	76.3	190	5.8
YOLOv12n	92.1	88.5	88.0	123	5.3
RT-DETR-r18	83.0	88.0	77.3	92	40.5
Improved YOLOv8n	94.7 ± 0.3	91.0	91.2	117	6.3

Note: The mAP@0.5 values for YOLOv8n and improved YOLOv8n are reported as Mean ± Standard Deviation over five independent runs with different random seeds.

Table 5. Statistical performance of the missing seedling detection algorithm on 20 test images.

Total Images	Total Ground Truth	Total Predicted	MAE	RMSE	Counting Accuracy	R²
20	632	591	3.75	4.57	88.1%	0.90

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Z.; Yao, X.; Ban, S.; Hu, D.; Tian, M.; Yuan, T.; Li, L. High-Altitude UAV-Based Detection of Rice Seedlings in Large-Area Paddy Fields. Agriculture 2026, 16, 307. https://doi.org/10.3390/agriculture16030307

AMA Style

Li Z, Yao X, Ban S, Hu D, Tian M, Yuan T, Li L. High-Altitude UAV-Based Detection of Rice Seedlings in Large-Area Paddy Fields. Agriculture. 2026; 16(3):307. https://doi.org/10.3390/agriculture16030307

Chicago/Turabian Style

Li, Zhenhua, Xinfeng Yao, Songtao Ban, Dong Hu, Minglu Tian, Tao Yuan, and Linyi Li. 2026. "High-Altitude UAV-Based Detection of Rice Seedlings in Large-Area Paddy Fields" Agriculture 16, no. 3: 307. https://doi.org/10.3390/agriculture16030307

APA Style

Li, Z., Yao, X., Ban, S., Hu, D., Tian, M., Yuan, T., & Li, L. (2026). High-Altitude UAV-Based Detection of Rice Seedlings in Large-Area Paddy Fields. Agriculture, 16(3), 307. https://doi.org/10.3390/agriculture16030307

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

High-Altitude UAV-Based Detection of Rice Seedlings in Large-Area Paddy Fields

Abstract

1. Introduction

1.1. Importance of Rice Seedling Detection

1.2. Related Works

2. Materials and Methods

2.1. Data Acquisition

2.2. Dataset Construction

2.3. Rice Seedling Detection Model and Improvements

2.3.1. Improved BiFPN Feature Fusion Network

2.3.2. Integration of the GLSA Module

2.3.3. Incorporation of the CGAFusion Module in the Detection Head

2.4. Missing Seedling Detection

2.5. Experimental Environment

2.6. Model Evaluation Metrics

3. Results

3.1. Ablation Study

3.1.1. Ablation Results and Metric Visualization

3.1.2. Ablation Detection Results

3.2. Comparison Experiments

3.2.1. Comparison Results and Metric Visualization

3.2.2. Comparison Detection Results

3.3. Application Demonstration: Quantitative Evaluation of Missing Seedling Detection

4. Discussion

4.1. Trade-Off Between Efficiency and Accuracy

4.2. Advantages of the Improved Model

4.3. Future Research Directions

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI