1. Introduction
Rice is one of the most vital staple crops globally, supplying approximately 25% of the caloric intake for over 3 billion people worldwide [
1]. Rice lodging, which negatively impacts yield and harvest efficiency, is caused by a combination of natural factors, varietal resistance, and cultivation practices. According to the Food and Agriculture Organization (FAO), natural disasters are among the primary contributors to severe food insecurity in 12 countries, highlighting the importance of mitigating lodging to ensure food security [
2].
In recent years, with the advancement of artificial intelligence, deep learning has been increasingly applied in agriculture due to its high precision and automation capabilities [
3]. Its application spans a wide range of tasks, including crop disease diagnosis, lodging detection, variety classification, weed identification, and yield estimation. Among these, rice lodging detection has drawn particular attention. Studies have confirmed the effectiveness of deep convolutional neural networks combined with segmentation models applied to UAV-acquired remote sensing imagery for identifying lodging areas [
4]. As the technology has evolved, rice lodging detection has progressed from manual inspection to more sophisticated methods utilizing satellite spectral data, radar imagery, and UAV-based remote sensing platforms. Among these, UAV-based remote sensing has emerged as a particularly promising approach due to its unique advantages. While satellites and radar systems have been widely used for crop lodging monitoring, UAVs have distinct advantages that make them particularly suitable for this task. Compared with satellite and radar technologies, UAVs provide higher spatial resolution, often at the centimeter level, whereas satellite images typically have meter-level resolution and radar images generally have lower spatial resolution due to signal dispersion. This allows UAVs to capture fine-grained lodging features that are essential for precise assessment. Furthermore, UAVs enable flexible and on-demand data acquisition, overcoming the relatively long revisit cycles of satellites (ranging from days to weeks) and the dependence of radar systems on specific weather or orbital conditions. Lastly, UAV data are generally easier to process, as they are collected at low altitudes with minimal atmospheric interference, whereas satellite and radar data often require complex preprocessing steps such as atmospheric correction and speckle noise reduction. These limitations in satellite and radar methods reduce the effectiveness and timeliness of lodging detection, thereby motivating the exploration of UAV-based remote sensing as a more practical and efficient alternative.
In crop lodging detection, UAV-based remote sensing technology offers significant advantages over traditional manual surveys. It enables rapid and extensive coverage of large areas, reduces both labor and time costs, enhances work efficiency, and minimizes human error. As agricultural insurance continues to grow, the need for accurate and efficient crop damage assessment has become more pressing. UAV-based remote sensing plays a pivotal role in this process, not only improving the efficiency of claim settlements but also facilitating better crop management, increasing yields, and enhancing post-disaster recovery efforts [
5]. Furthermore, UAVs offer an affordable solution that meets the need for timely data collection, providing high-resolution imagery at minimal economic cost. Their ease of use allows for precise and convenient monitoring of farmland, making them an invaluable tool in modern agricultural practices.
Existing research on rice lodging detection methods can be broadly categorized into three main themes: traditional vegetation index-based approaches, multi-source remote sensing data fusion methods, and deep learning-based semantic segmentation techniques.
First, vegetation index-based methods utilize spectral characteristics extracted from UAV or satellite imagery to highlight lodging areas. Chauhan et al. [
6] utilized multispectral UAV data to assess wheat lodging by extracting spectral features, demonstrating the effectiveness of vegetation index-based approaches for lodging detection. Wu et al. [
7] extracted visible-light spectral features and constructed Excess Green (EXG) vegetation index images. By fusing Digital Surface Models (DSMs) with RGB and EXG data, they achieved a soybean lodging extraction precision of 82.84%. Similarly, Yang et al. [
8] employed semantic segmentation models enhanced by vegetation indices across multi-date UAV visible images, effectively identifying rice lodging patterns in different growth stages. These methods are straightforward and computationally light but often sensitive to environmental factors such as illumination changes and background complexity, limiting robustness.
Second, multi-source remote sensing data fusion approaches integrate different data modalities to enhance lodging detection accuracy. For example, Yongkang et al. [
9] employed UAV multispectral image feature fusion to monitor wheat lodging, demonstrating the improved detection accuracy achievable by integrating spectral and spatial features. Jing et al. [
10] utilized UAV visible-light remote sensing combined with feature fusion techniques to extract wheat lodging areas, demonstrating the effectiveness of integrating multiple feature types from UAV imagery. Chauhan et al. [
11] utilized RADARSAT-2 and Sentinel-1 satellite radar data combined with discriminant analysis to classify wheat lodging severity, demonstrating the potential of radar–satellite data fusion for lodging assessment. Sarkar et al. [
12] proposed a Mobile U-Net architecture that fuses RGB and DSM images for soybean lodging detection, balancing accuracy and computational efficiency. Zhao et al. [
13] compared multispectral and RGB UAV images for rice lodging assessment, concluding that RGB imagery can sometimes outperform multispectral images without extra feature engineering. Additionally, Dai et al. [
14] proposed a rice lodging disaster monitoring framework based on the integration of multi-source remote sensing data, including satellite optical imagery and radar datasets, offering higher detection reliability across diverse environmental scenarios.
Third, deep learning-based semantic segmentation techniques have become predominant due to their strong feature extraction capabilities and adaptability. Zang et al. [
15] applied segmentation networks, including U-Net, PSPNet, DeepLabV3+, and ACSNet, to wheat lodging in UAV images, with ACSNet achieving the best results and a relative error of 4.5%. Zhao et al. [
16] similarly confirmed ACSNet’s superior performance in segmenting irregular lodging shapes. Yao et al. [
17] evaluated DeepLabV3+, U-Net, and BiseNetV2 on multispectral data, improving the detection of small lodging areas. Zhang et al. [
18] further optimized the segmentation architecture for UAV-based rice lodging detection, enhancing performance with architectural refinements. Kumar et al. [
19] employed machine learning techniques to assess rice lodging at the plot level using multispectral UAV data, demonstrating the feasibility of fine-grained, high-throughput analysis. Tian et al. [
20] combined visible and multispectral UAV imagery to capture a more comprehensive set of lodging features. Guan et al. [
21] proposed a quantitative monitoring method for maize lodging across different growth stages using UAV remote sensing data, demonstrating the feasibility of precise lodging severity assessment through advanced feature extraction and analysis. Moreover, Zhang et al. [
22] developed a multi-branch classification framework that efficiently detects wheat lodging from UAV imagery, underscoring the potential of lightweight modular designs for large-scale monitoring. Ulku [
23] introduced ResLMFFNet, a real-time semantic segmentation network designed for precision agriculture scenarios, which balances lightweight design with competitive segmentation accuracy and is suitable for deployment on embedded UAV platforms.
Although UAV-based rice lodging detection methods have made significant progress, they still face several challenges in practical applications, including complex lighting and environmental conditions, diverse and irregular lodging patterns and scales, severe class imbalance due to limited labeled data and small target proportions, as well as computational constraints for real-time inference on UAV edge devices.
To address these practical challenges, this study focuses on the following core problems, aiming to achieve more accurate, efficient, and practical rice lodging detection:
How to enhance the model’s adaptability to complex lighting conditions and dynamic field environments to ensure accurate identification of lodging areas;
How to design a lightweight and efficient model architecture that satisfies the real-time inference requirements of UAV-mounted edge devices;
How to mitigate the recognition difficulties caused by sample imbalance and scale variation, thereby improving segmentation robustness and generalization;
How to achieve quantitative estimation of lodging areas to support grid-based spatial localization and decision-making in agricultural applications.
To tackle the above challenges, this study proposes SWRD–YOLO, a lightweight instance segmentation model based on an improved YOLO framework. The main contributions of this work are as follows:
A Residual Convolutional Block Attention Module (ResCBAM) is introduced to address the challenge of weakened feature discriminability under complex and variable lighting conditions. By integrating spatial and channel attention mechanisms, the module enhances the model’s ability to focus on critical lodging features and suppress irrelevant background noise, thereby improving detection robustness in diverse field environments.
To better handle the irregular and multi-directional nature of lodging patterns, a Dynamic Oriented Depthwise Convolution (DO-DConv) module is adopted. This module dynamically adjusts convolutional sampling positions according to lodging morphology, enabling the network to capture structural deformations that standard fixed-grid convolutions may miss, thereby enhancing adaptability to varied lodging orientations.
A dynamic sampling strategy (DySample) is employed to mitigate the adverse effects of severe class imbalance and significant scale variation during training. By adaptively emphasizing underrepresented or small-scale lodging features, the strategy guides the model to learn more balanced and discriminative representations, thus improving segmentation robustness and overall generalization.
A grid-based estimation method is proposed to calculate the lodged area ratio, i.e., the proportion of lodged area within each grid cell, enabling fine-grained spatial localization and quantitative assessment of lodging severity, thus supporting precision agricultural management.
Furthermore, the proposed grid-based lodging estimation method provides spatially localized and quantitative assessments of lodging severity within UAV images. Compared with traditional pixel-wise or region-based segmentation, this structured approach facilitates a finer-grained understanding of lodging distribution. Although the current study focuses on image-level grid analysis, this method lays the foundation for future integration with field parcel boundaries, enabling field-scale decision-making in practical applications such as precision field management and agricultural insurance assessment.
3. Experiments and Analysis
3.1. Evaluation Metrics
To evaluate the effectiveness of the proposed SWRD–YOLO instance segmentation model, this study considers the following metrics: precision (P), recall (R), mean average precision (mAP@0.5), F1 score, and giga floating-point operations per second (GFLOPs).
Precision refers to the ratio of correctly identified positive instances to all instances that the classifier labeled as positive. The higher the precision, the lower the false detection rate of the model [
32]. The formula is as follows:
where
represents the number of targets correctly predicted as rice lodging areas and
represents the number of targets incorrectly predicted as rice lodging areas.
Recall indicates the ability of the model to retrieve all relevant positive instances from the dataset. The higher the recall, the lower the missed detection rate of the model [
32]. The formula is as follows:
where
represents the number of actual rice lodging areas that were not predicted as rice lodging areas.
mAP@0.5 represents the average precision value of the model at different recall levels under the condition that the IoU threshold is 0.5, reflecting the average performance of the model across multiple categories. The formula is as follows:
where
represents the number of rice areas that are not lodged and were correctly predicted as non-lodged rice areas.
The F1 score is an indicator used to measure the precision of a binary classification model in statistics. It takes into account both precision and recall and is the harmonic mean of the two [
37]. The formula is as follows:
GFLOPs indicates the number of floating-point operations (in billions) that a system, such as a GPU or an algorithm, can complete every second, reflecting its processing capability. In the process of deep learning model training and inference, it represents the amount of model computation. The formula is as follows:
where
and
are the height and width of the output,
and
are the number of channels of the input and output, and
and
are the height and width of the convolution kernel [
38].
Among these metrics, precision, recall, and mAP@0.5 are obtained from the result.csv file generated after model training; the F1 score is calculated based on its formula; and GFLOPs are automatically reported by the code during the training and inference processes.
3.2. Attention Mechanism Comparison Experiments
To evaluate the performance improvement introduced by the ResCBAM, the Convolutional Block Attention Module (CBAM), Shuffle Attention Mechanism (ShuffleAttention), and Multi-Head Self-Attention (MHSA) were individually integrated into the YOLOv8n-seg network for comparison, with all other components of the network kept unchanged. The comparison results are presented in
Table 2.
Compared with the +CBAM, +ShuffleAttention, and +MHSA models, the +ResCBAM model improved recall by 16.4%, 5.9%, and 10.6%, respectively, and increased mAP@0.5 by 11.9%, 4.6%, and 8.3%, respectively. The F1 score also improved by 10.8%, 2.8%, and 7.3%, respectively. GFLOPs increased by 2.4, 2.4, and 2.2, respectively. The precision of the +ResCBAM model was 2.9% and 3.7% higher than that of the +CBAM and +MHSA models, respectively, and 0.2% lower than that of the +ShuffleAttention model. The ResCBAM demonstrated superiority in highlighting key feature information and outperformed the CBAM and MHSA attention mechanisms in segmenting lodging regions. Compared with the ShuffleAttention mechanism, ResCBAM effectively enhanced model performance by reducing missed detections and false positives.
The superior performance of the ResCBAM can be attributed to its dual attention mechanism, which effectively integrates channel and spatial attention to simultaneously capture important feature maps and their spatial locations. This complementary design enables better feature representation and localization, resulting in improved detection accuracy and reduced errors compared to single-attention or other attention mechanisms.
3.3. Loss Function Comparison Experiments
To evaluate the impact of different loss functions on model performance, the CIoU, WIoU, and maximum potential distance intersection over union (MPDIoU) were individually integrated into the model for comparison while keeping the rest of the YOLOv8n-seg network structure unchanged. The training performance of each loss function is presented in
Table 3.
Compared with the YOLOv8n-seg and +MPDIoU models, the WIoU improved precision by 6.9% and 2.4%, respectively, increased recall by 11.6% and 0.3%, increased mAP@0.5 by 11.8% and 2.3%, and improved the F1 score by 9.5% and 1.3%. The GFLOPs of the YOLOv8n-seg, +WIoU, and +MPDIoU models were all 12.1. The WIoU loss function demonstrated superior optimization capabilities, significantly enhancing model localization precision across objects of various scales.
3.4. Optimizer Comparison Experiments
To verify the improved performance of the SGD optimizer, comparative experiments were conducted using seven optimizers—Adaptive Moment Estimation (Adam), AdamW, Lion, Nesterov-Accelerated Adaptive Moment Estimation (NAdam), Rectified Adam (RAdam), Sophia Gradient (SophiaG)-based, and SGD—while keeping the rest of the YOLOv8n-seg network structure unchanged. The results are shown in
Table 4.
Compared with the +Adam, YOLOv8n-seg, +NAdam, and +RAdam models, SGD improved precision by 7.3%, 5.2%, 5%, and 2.1%, respectively, increased recall by 17%, 14%, 13.6%, and 6.8%, respectively, and improved mAP@0.5 by 12.8%, 9.6%, 9.9%, and 4.4%, respectively. The F1 score increased by 13.1%, 10.1%, 10.3%, and 4.5%. The GFLOPs of the +Adam, YOLOv8n-seg, +NAdam, +RAdam, and +SGD models were all 12.1. The SGD optimizer offered significant advantages in accurately identifying rice lodging regions, thereby improving model robustness and generalization ability in complex field conditions. Its stable training dynamics make it particularly suitable for rice lodging detection tasks, contributing to enhanced detection precision and reliability.
3.5. Comparison Test of Upsampling Module
To verify the influence of the DySample upsampling module on instance segmentation performance, the original nn.Upsample module was replaced with DySample while keeping the rest of the YOLOv8n-seg network structure unchanged, and a comparative experiment was conducted. The test results are presented in
Table 5.
The results show that in the model using DySample, precision increased from 86.6% to 92.4%, recall increased from 71.7% to 85.3%, mAP@0.5 increased from 80.5% to 91.0%, and the F1 score increased from 78.6% to 88.7%. GFLOPs remained unchanged at 12.1, indicating that DySample effectively enhances the model’s ability to reconstruct high-resolution features without increasing additional computational overhead, significantly improving the segmentation precision and verifying its advantages in complex boundary recovery and fine-grained target recognition.
3.6. Convolution Structure Comparison Test
To assess the impact of different convolutional structures on model performance, three modules—Conv, Pinwheel Convolution (PinwheelConv), and DO-DConv—were integrated into the YOLOv8n-seg model for comparison, with all other network components held constant. The experimental findings are presented in
Table 6.
Compared with the +PinwheelConv and YOLOv8n-seg models, DO-DConv improved precision by 1.1% and 4.8%, respectively, increased recall by 5.1% and 11.2%, increased mAP@0.5 by 4.2% and 9.5%, increased the F1 score by 3.3% and 8.3%, and reduced GFLOPs to 11.8, showing an excellent balance between detection performance and inference efficiency. The DO-DConv structure, with its enhanced detection performance and reduced computational complexity, is highly valuable for edge devices and real-world deployment.
3.7. Ablation Experiments
To evaluate the optimization effects of the SGD optimizer, WIoU loss function, ResCBAM, DySample upsampling module, and DO-DConv convolution module, ablation experiments were conducted on the test set. Starting with the baseline YOLOv8n-seg model, modules were incrementally added to form five distinct experimental configurations. An overview of the ablation experiment results is provided in
Table 7.
In
Table 7, a check mark (✓) indicates the usage of the corresponding module.
When the basic model did not introduce any improved modules, precision was 86.6%, recall was 71.7%, mAP@0.5 was 80.5%, the F1 score was 78.6%, and GFLOPs were 12.1. After introducing the SGD optimizer, the mAP@0.5 of the model increased to 90.1%, and the F1 score increased to 88.7%, verifying the positive role of SGD in accelerating convergence speed and improving training stability. After adding the WIoU loss function, mAP@0.5 further improved to 91.2%, indicating that the loss function has a significant effect on improving target positioning precision and can alleviate the precision loss caused by the target box regression error. After introducing ResCBAM, the precision of the model increased to 94.7%, and the F1 score reached 91.3%. The feature expression ability of the model in the target area was significantly enhanced, effectively improving the overall performance of detection and segmentation. After further adding the DySample module, recall increased to 90.5%, and mAP@0.5 reached 93.7%, indicating that the module has advantages in high-resolution feature recovery, which can more accurately restore the target boundary and effectively improve segmentation precision. Finally, the DO-DConv direction-aware dynamic convolution module reduced the GFLOPs from 14.5 to 14.2 while maintaining high detection precision, significantly optimizing the inference efficiency of the model and improving its deployment ability on edge devices. Overall, while ensuring high-precision instance segmentation performance, the improved scheme takes into account the lightweight design and deployment efficiency of the model and provides efficient and reliable technical support for practical agricultural applications such as rice lodging detection.
In summary, the proposed modules enhance model performance in several key areas. Among them, ResCBAM and DySample play key roles in target area recognition and boundary information recovery, respectively, while DO-DConv significantly reduces computational overhead while ensuring the detection precision of the model, achieving a good trade-off between precision and efficiency, thereby showing strong practical deployment potential. The trend analysis reveals that ResCBAM primarily improves precision, DySample has a greater impact on recall, and DO-DConv reduces GFLOPs while optimizing inference efficiency, all while maintaining high precision. The experimental results demonstrate that the proposed scheme ensures high-precision instance segmentation while maintaining low computational resource consumption, providing efficient and reliable support for practical agricultural applications such as rice lodging monitoring.
3.8. Comparison Experiment of Different Segmentation Models
Since this study focuses on real-time, lightweight instance segmentation in complex agricultural environments, the YOLO family of models was selected for comparison due to its unified architecture and superior inference speed. Traditional segmentation models such as Mask R-CNN or DeepLabV3+ typically offer higher precision but require significantly more computation, making them less practical for deployment in resource-constrained field scenarios. Therefore, they were not included in the comparison. To evaluate the effectiveness of the proposed model, a comparison was performed on the test set against the original YOLO models.
Table 8 presents the performance comparison results among YOLOv5n-seg, YOLOv8l-seg, YOLO11n-seg, and SWRD–YOLO (the proposed model).
For each YOLO-series model, five versions are typically available: n, s, m, l, and x, with detection precision increasing from n to x. Lower-ranked models generally offer faster inference speeds and lower resource consumption, while higher-ranked models prioritize accuracy at the cost of computational efficiency. YOLOv8n, although lower in detection precision, is the most lightweight and resource-efficient variant, making it suitable for rapid detection tasks.
Compared with YOLOv5n-seg, YOLOv8l-seg, and YOLO11vn-seg, the proposed SWRD–YOLO model improved precision by 5.0%, 6.2%, and 3.4%, respectively. mAP@0.5 increased by 2.4%, 13.9%, and 6.0%, while the F1 score improved by 2.0%, 16.2%, and 6.7%. In terms of GFLOPs, SWRD–YOLO showed increases of 7.3, 2.1, and 3.8, respectively. Regarding recall, SWRD–YOLO achieved gains of 22.9% and 9.3% over YOLOv8l-seg and YOLOv11n-seg, respectively, while showing a slight decrease of 0.9% compared to YOLOv5n-seg. Overall, SWRD–YOLO demonstrated superior and more stable performance in both recognition precision and inference speed, especially in scenarios involving dense and small targets, highlighting its effectiveness for real-time agricultural applications.
3.9. Edge Deployment and Lodging Quantification
3.9.1. Visualization of Segmentation Prediction Results
To enable accurate and efficient rice lodging detection, the SWRD–YOLO model was trained under the PyTorch framework and exported as a best.pt file, containing both the network structure and learned weights. For deployment on resource-constrained edge devices, the model was first converted to the ONNX format, which facilitates interoperability across different inference engines. Subsequently, the ONNX model was optimized into a TensorRT engine (best.engine) using NVIDIA’s TensorRT toolkit. This optimization leverages techniques such as graph fusion, precision calibration, and layer fusion to significantly enhance inference speed and efficiency.
The deployment was conducted on both PC and edge platforms to evaluate robustness. The intelligent terminal device selected was the Allspark2, powered by the NVIDIA Jetson Orin NX module.
Figure 10,
Figure 11 and
Figure 12 illustrate the original input images, the segmentation prediction results on the PC platform, and the results on the edge platform, respectively. In these figures, the label “beating down” is used to denote the lodging areas identified by SWRD–YOLO based on the instance segmentation results. The converted engine model achieved consistent segmentation performance across platforms. Specifically, the representative test images showed segmentation confidences of 90%, 92%, and 92% on the PC platform and 91%, 92%, and 92% on the edge device. This minimal variation demonstrates the model’s strong generalization capability and cross-platform stability.
3.9.2. Inference Performance Evaluation on Edge Device
The inference speed of both the baseline and SWRD–YOLO models was evaluated on the Allspark2 intelligent terminal equipped with the NVIDIA Jetson Orin NX module.
Inference was conducted on a dataset of 100 images, each processed individually. The average frames per second (FPS) was computed by averaging the results of 20 independent runs to minimize variability. No pre-warming procedures were employed to simulate realistic deployment conditions. The baseline model achieved an average inference speed of 17.59 FPS, whereas the SWRD–YOLO model achieved 16.15 FPS. Although this represents a slight reduction of approximately 8%, SWRD–YOLO exhibited significant improvements in detection precision and recall. The reduction in FPS is primarily attributed to the increased computational complexity introduced by advanced modules such as DO-DConv, ResCBAM, and DySample. Nevertheless, the achieved FPS is sufficient to support near-real-time processing in agricultural monitoring tasks, effectively balancing accuracy and efficiency.
These results confirm that, despite its increased architectural complexity, the proposed SWRD–YOLO model maintains an acceptable inference speed on edge devices, enabling practical deployment for UAV-based rice lodging detection.
3.9.3. Grid-Based Lodging Ratio Estimation
To further enhance the robustness and interpretability of lodging detection, this study employed a grid-based analysis approach to estimate both the local and global lodging ratios. Each image was partitioned into fixed-size grids, enabling fine-grained spatial statistics and intuitive visualization of lodging severity distribution.
Each image is evenly divided into N grids of equal dimensions. Within each grid, the number of lodged pixels (segmentation mask) is denoted as
, and the total number of valid pixels in the grid is denoted as
. The local lodging ratio of the
i-th grid is calculated as
This formulation offers a localized assessment of lodging severity, which is essential for identifying spatial heterogeneity and micro-regional stress zones within the field.
For accurate quantification, two complementary methods are used. At the local level, the lodging ratio of each grid is calculated based on the instance segmentation mask, enabling targeted field management. At the global level, the overall lodging ratio can be computed in two ways: (1) by directly dividing the total number of lodged pixels by the total number of valid pixels in the image, and (2) by aggregating the grid-level lodging ratios weighted by their respective pixel counts. Experimental results show that both methods produce consistent results under normal conditions, while the weighted grid-based method offers better robustness in handling partial occlusions or uneven lighting, thereby enhancing the reliability of lodging severity estimation.
Specifically, the second method—weighted aggregation of grid-level lodging ratios—is defined as follows:
This represents the weighted average of all local lodging ratios, where the weight of each grid is proportional to its valid pixel count. This approach yields a more objective and reliable estimate of the overall degree of lodging.
Figure 13 illustrates the grid-based lodging detection results: red grid lines overlay the original image, blue semi-transparent lodging masks highlight lodging areas, and the overall lodging percentage is displayed in the upper-left corner for clear visual feedback. This method enables spatial monitoring of rice lodging by dividing the image into grids, thereby identifying localized areas with severe lodging. Such granularity supports precision field management decisions. Meanwhile, grid-based statistical analysis of lodging percentages provides a standardized metric for quantitative comparisons across different time points or field plots. By integrating instance segmentation with grid-level analysis, the system not only supports localized assessment but also contributes to holistic decision-making, significantly enhancing its value in precision agriculture and UAV-based crop evaluation.
4. Discussion
Lodging detection has become increasingly important in precision agriculture due to its significant impact on crop yield and mechanized harvesting. This study proposes SWRD–YOLO, a lightweight instance segmentation model designed to address challenges such as complex lighting conditions, irregular lodging shapes, and class imbalance inherent in UAV-based lodging imagery. By incorporating advanced modules, including ResCBAM, DySample, and DO-DConv, alongside optimized training strategies like the SGD optimizer and the WIoU loss function, the model achieves substantial improvements in precision, recall, and mAP metrics.
Compared to existing methods, SWRD–YOLO demonstrates a favorable balance between detection accuracy and computational efficiency, making it suitable for real-time deployment on resource-constrained embedded devices. For instance, Zhao et al. [
16] used an improved PSPNet for wheat lodging segmentation, achieving high accuracy at the cost of increased model complexity and slower inference speed. Yao et al. [
17] developed a modified DeepLabV3+ for real-time lodging recognition on harvesters, focusing more on onboard harvester hardware rather than UAV platforms. In contrast, SWRD–YOLO maintains competitive accuracy with a significantly reduced model size and faster inference (16.15 FPS on a Jetson Orin NX).
Moreover, Zhang et al. [
18] optimized segmentation architectures for rice lodging detection using UAV imagery; however, the increased complexity limits their practicality in real-time applications. Kumar et al. [
19] utilized multispectral UAV data and machine learning for plot-level lodging assessment, emphasizing spectral features; however, they faced challenges in small target detection. Guan et al. [
21] proposed a multi-stage approach for maize lodging quantification, which requires extensive multi-date data and thereby limits rapid assessment capabilities. In comparison, SWRD–YOLO achieves precise lodging segmentation from single-date RGB UAV images by enhancing feature fusion and attention mechanisms.
Despite these advances, limitations persist. The model’s robustness under extreme environmental conditions, such as heavy wind or rain, and image quality degradation due to UAV motion blur, requires further study. Additionally, while this study focuses on rice lodging, generalizing the approach to other crops requires dataset expansion and model retraining. Future work could explore multi-temporal data integration and domain adaptation techniques to improve adaptability and robustness.
To further improve the model’s accuracy and robustness, it is important to consider potential sources of error that may affect segmentation performance. Potential errors arise from variable lighting, occlusions, overlapping plants, and the class imbalance between lodging and non-lodging areas. These factors can cause misclassifications and boundary ambiguity. Careful data collection, including flight altitude standardization and imaging angle control, alongside advanced data augmentation strategies, will be essential to mitigate these challenges and improve segmentation accuracy.
In summary, SWRD–YOLO effectively balances accuracy, speed, and model complexity, demonstrating strong potential for practical deployment in intelligent crop lodging monitoring systems. This contributes to timely field management and sustainable agricultural practices.
5. Conclusions
This study addresses several critical challenges in UAV-based rice lodging detection, including complex lighting conditions, diverse lodging morphologies, varying shooting angles, and class imbalance. To enhance the model’s robustness under these factors, we applied data augmentation techniques such as color enhancement, rotation, and noise injection. A lightweight SWRD–YOLO model was proposed to balance detection precision with computational efficiency, enabling deployment on resource-constrained edge devices. These integrated strategies significantly improved detection precision and adaptability in complex field environments.
Based on the YOLOv8n-seg framework, the SWRD–YOLO model incorporates several key improvements, including replacing AdamW with the SGD optimizer for enhanced generalization on dense, small targets; integrating the WIoU loss function; adopting the ResCBAM; employing the DySample upsampling module; and utilizing the DO-DConv structure. Collectively, these enhancements yielded significant gains in precision, recall, mAP@0.5, and F1 score of 8.2%, 16.5%, 12.8%, and 12.8%, respectively, over the baseline. While computational complexity slightly increased from 12.1 to 14.2 GFLOPs, operational efficiency remained high. Notably, the DO-DConv module contributed to reducing GFLOPs from 14.5 to 14.2 without compromising precision, optimizing inference efficiency. This balance between computational cost and performance confirms that the proposed SWRD–YOLO model achieved a lightweight design without sacrificing accuracy, making it well-suited for real-time deployment on resource-constrained edge devices in agricultural applications.
Moreover, the lightweight design and optimizations of the SWRD–YOLO model offer distinct benefits in inference speed and deployment feasibility. The model can be effectively deployed on edge devices such as the NVIDIA Jetson Orin NX, supporting real-time processing. This deployment reduces dependence on costly computing infrastructure, accelerates detection workflows, and delivers timely, actionable insights for field management, thereby lowering operational costs and enhancing the practicality of high-precision lodging detection in agricultural settings.
In addition, a grid-based quantitative method was implemented to estimate lodging severity from segmentation masks. By dividing each image into uniform grids and calculating lodging ratios per grid, the overall lodging degree was determined through weighted aggregation. This fine-grained estimation not only enhances interpretability but also facilitates localized field assessment, making it particularly applicable to precision agriculture and disaster evaluation.
Guided by insights from this study, future research will focus on further reducing model complexity to enable more efficient deployment on resource-limited devices. Enhancing the model’s robustness in detecting blurred or motion-distorted targets, frequently encountered in high-speed UAV operations, will be a key objective. Additionally, improving adaptability to diverse and complex farmland environments—including various crop species and geographic regions—remains critical. These advancements will drive the development of intelligent, lightweight, and practical solutions for real-time rice lodging detection, ultimately advancing precision agriculture and sustainable crop management.