SWRD–YOLO: A Lightweight Instance Segmentation Model for Estimating Rice Lodging Degree in UAV Remote Sensing Images with Real-Time Edge Deployment

Guo, Chunyou; Tan, Feng

doi:10.3390/agriculture15151570

Open AccessArticle

SWRD–YOLO: A Lightweight Instance Segmentation Model for Estimating Rice Lodging Degree in UAV Remote Sensing Images with Real-Time Edge Deployment

by

Chunyou Guo

and

Feng Tan

^*

College of Information and Electrical Engineering, Heilongjiang Bayi Agricultural University, Daqing 163319, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(15), 1570; https://doi.org/10.3390/agriculture15151570

Submission received: 25 June 2025 / Revised: 17 July 2025 / Accepted: 21 July 2025 / Published: 22 July 2025

(This article belongs to the Topic Intelligent Agriculture: Perception Technologies and Agricultural Equipment for Crop Production Processes)

Download

Browse Figures

Versions Notes

Abstract

Rice lodging severely affects crop growth, yield, and mechanized harvesting efficiency. The accurate detection and quantification of lodging areas are crucial for precision agriculture and timely field management. However, Unmanned Aerial Vehicle (UAV)-based lodging detection faces challenges such as complex backgrounds, variable lighting, and irregular lodging patterns. To address these issues, this study proposes SWRD–YOLO, a lightweight instance segmentation model that enhances feature extraction and fusion using advanced convolution and attention mechanisms. The model employs an optimized loss function to improve localization accuracy, achieving precise lodging area segmentation. Additionally, a grid-based lodging ratio estimation method is introduced, dividing images into fixed-size grids to calculate local lodging proportions and aggregate them for robust overall severity assessment. Evaluated on a self-built rice lodging dataset, the model achieves 94.8% precision, 88.2% recall, 93.3% mAP@0.5, and 91.4% F1 score, with real-time inference at 16.15 FPS on an embedded NVIDIA Jetson Orin NX device. Compared to the baseline YOLOv8n-seg, precision, recall, mAP@0.5, and F1 score improved by 8.2%, 16.5%, 12.8%, and 12.8%, respectively. These results confirm the model’s effectiveness and potential for deployment in intelligent crop monitoring and sustainable agriculture.

Keywords:

rice lodging severity estimation; UAV-based precision agriculture; agricultural image segmentation; lightweight deep learning; edge AI deployment; efficient inference

1. Introduction

Rice is one of the most vital staple crops globally, supplying approximately 25% of the caloric intake for over 3 billion people worldwide [1]. Rice lodging, which negatively impacts yield and harvest efficiency, is caused by a combination of natural factors, varietal resistance, and cultivation practices. According to the Food and Agriculture Organization (FAO), natural disasters are among the primary contributors to severe food insecurity in 12 countries, highlighting the importance of mitigating lodging to ensure food security [2].

In recent years, with the advancement of artificial intelligence, deep learning has been increasingly applied in agriculture due to its high precision and automation capabilities [3]. Its application spans a wide range of tasks, including crop disease diagnosis, lodging detection, variety classification, weed identification, and yield estimation. Among these, rice lodging detection has drawn particular attention. Studies have confirmed the effectiveness of deep convolutional neural networks combined with segmentation models applied to UAV-acquired remote sensing imagery for identifying lodging areas [4]. As the technology has evolved, rice lodging detection has progressed from manual inspection to more sophisticated methods utilizing satellite spectral data, radar imagery, and UAV-based remote sensing platforms. Among these, UAV-based remote sensing has emerged as a particularly promising approach due to its unique advantages. While satellites and radar systems have been widely used for crop lodging monitoring, UAVs have distinct advantages that make them particularly suitable for this task. Compared with satellite and radar technologies, UAVs provide higher spatial resolution, often at the centimeter level, whereas satellite images typically have meter-level resolution and radar images generally have lower spatial resolution due to signal dispersion. This allows UAVs to capture fine-grained lodging features that are essential for precise assessment. Furthermore, UAVs enable flexible and on-demand data acquisition, overcoming the relatively long revisit cycles of satellites (ranging from days to weeks) and the dependence of radar systems on specific weather or orbital conditions. Lastly, UAV data are generally easier to process, as they are collected at low altitudes with minimal atmospheric interference, whereas satellite and radar data often require complex preprocessing steps such as atmospheric correction and speckle noise reduction. These limitations in satellite and radar methods reduce the effectiveness and timeliness of lodging detection, thereby motivating the exploration of UAV-based remote sensing as a more practical and efficient alternative.

In crop lodging detection, UAV-based remote sensing technology offers significant advantages over traditional manual surveys. It enables rapid and extensive coverage of large areas, reduces both labor and time costs, enhances work efficiency, and minimizes human error. As agricultural insurance continues to grow, the need for accurate and efficient crop damage assessment has become more pressing. UAV-based remote sensing plays a pivotal role in this process, not only improving the efficiency of claim settlements but also facilitating better crop management, increasing yields, and enhancing post-disaster recovery efforts [5]. Furthermore, UAVs offer an affordable solution that meets the need for timely data collection, providing high-resolution imagery at minimal economic cost. Their ease of use allows for precise and convenient monitoring of farmland, making them an invaluable tool in modern agricultural practices.

Existing research on rice lodging detection methods can be broadly categorized into three main themes: traditional vegetation index-based approaches, multi-source remote sensing data fusion methods, and deep learning-based semantic segmentation techniques.

First, vegetation index-based methods utilize spectral characteristics extracted from UAV or satellite imagery to highlight lodging areas. Chauhan et al. [6] utilized multispectral UAV data to assess wheat lodging by extracting spectral features, demonstrating the effectiveness of vegetation index-based approaches for lodging detection. Wu et al. [7] extracted visible-light spectral features and constructed Excess Green (EXG) vegetation index images. By fusing Digital Surface Models (DSMs) with RGB and EXG data, they achieved a soybean lodging extraction precision of 82.84%. Similarly, Yang et al. [8] employed semantic segmentation models enhanced by vegetation indices across multi-date UAV visible images, effectively identifying rice lodging patterns in different growth stages. These methods are straightforward and computationally light but often sensitive to environmental factors such as illumination changes and background complexity, limiting robustness.

Second, multi-source remote sensing data fusion approaches integrate different data modalities to enhance lodging detection accuracy. For example, Yongkang et al. [9] employed UAV multispectral image feature fusion to monitor wheat lodging, demonstrating the improved detection accuracy achievable by integrating spectral and spatial features. Jing et al. [10] utilized UAV visible-light remote sensing combined with feature fusion techniques to extract wheat lodging areas, demonstrating the effectiveness of integrating multiple feature types from UAV imagery. Chauhan et al. [11] utilized RADARSAT-2 and Sentinel-1 satellite radar data combined with discriminant analysis to classify wheat lodging severity, demonstrating the potential of radar–satellite data fusion for lodging assessment. Sarkar et al. [12] proposed a Mobile U-Net architecture that fuses RGB and DSM images for soybean lodging detection, balancing accuracy and computational efficiency. Zhao et al. [13] compared multispectral and RGB UAV images for rice lodging assessment, concluding that RGB imagery can sometimes outperform multispectral images without extra feature engineering. Additionally, Dai et al. [14] proposed a rice lodging disaster monitoring framework based on the integration of multi-source remote sensing data, including satellite optical imagery and radar datasets, offering higher detection reliability across diverse environmental scenarios.

Third, deep learning-based semantic segmentation techniques have become predominant due to their strong feature extraction capabilities and adaptability. Zang et al. [15] applied segmentation networks, including U-Net, PSPNet, DeepLabV3+, and ACSNet, to wheat lodging in UAV images, with ACSNet achieving the best results and a relative error of 4.5%. Zhao et al. [16] similarly confirmed ACSNet’s superior performance in segmenting irregular lodging shapes. Yao et al. [17] evaluated DeepLabV3+, U-Net, and BiseNetV2 on multispectral data, improving the detection of small lodging areas. Zhang et al. [18] further optimized the segmentation architecture for UAV-based rice lodging detection, enhancing performance with architectural refinements. Kumar et al. [19] employed machine learning techniques to assess rice lodging at the plot level using multispectral UAV data, demonstrating the feasibility of fine-grained, high-throughput analysis. Tian et al. [20] combined visible and multispectral UAV imagery to capture a more comprehensive set of lodging features. Guan et al. [21] proposed a quantitative monitoring method for maize lodging across different growth stages using UAV remote sensing data, demonstrating the feasibility of precise lodging severity assessment through advanced feature extraction and analysis. Moreover, Zhang et al. [22] developed a multi-branch classification framework that efficiently detects wheat lodging from UAV imagery, underscoring the potential of lightweight modular designs for large-scale monitoring. Ulku [23] introduced ResLMFFNet, a real-time semantic segmentation network designed for precision agriculture scenarios, which balances lightweight design with competitive segmentation accuracy and is suitable for deployment on embedded UAV platforms.

Although UAV-based rice lodging detection methods have made significant progress, they still face several challenges in practical applications, including complex lighting and environmental conditions, diverse and irregular lodging patterns and scales, severe class imbalance due to limited labeled data and small target proportions, as well as computational constraints for real-time inference on UAV edge devices.

To address these practical challenges, this study focuses on the following core problems, aiming to achieve more accurate, efficient, and practical rice lodging detection:

How to enhance the model’s adaptability to complex lighting conditions and dynamic field environments to ensure accurate identification of lodging areas;
How to design a lightweight and efficient model architecture that satisfies the real-time inference requirements of UAV-mounted edge devices;
How to mitigate the recognition difficulties caused by sample imbalance and scale variation, thereby improving segmentation robustness and generalization;
How to achieve quantitative estimation of lodging areas to support grid-based spatial localization and decision-making in agricultural applications.

To tackle the above challenges, this study proposes SWRD–YOLO, a lightweight instance segmentation model based on an improved YOLO framework. The main contributions of this work are as follows:

A Residual Convolutional Block Attention Module (ResCBAM) is introduced to address the challenge of weakened feature discriminability under complex and variable lighting conditions. By integrating spatial and channel attention mechanisms, the module enhances the model’s ability to focus on critical lodging features and suppress irrelevant background noise, thereby improving detection robustness in diverse field environments.
To better handle the irregular and multi-directional nature of lodging patterns, a Dynamic Oriented Depthwise Convolution (DO-DConv) module is adopted. This module dynamically adjusts convolutional sampling positions according to lodging morphology, enabling the network to capture structural deformations that standard fixed-grid convolutions may miss, thereby enhancing adaptability to varied lodging orientations.
A dynamic sampling strategy (DySample) is employed to mitigate the adverse effects of severe class imbalance and significant scale variation during training. By adaptively emphasizing underrepresented or small-scale lodging features, the strategy guides the model to learn more balanced and discriminative representations, thus improving segmentation robustness and overall generalization.
A grid-based estimation method is proposed to calculate the lodged area ratio, i.e., the proportion of lodged area within each grid cell, enabling fine-grained spatial localization and quantitative assessment of lodging severity, thus supporting precision agricultural management.

Furthermore, the proposed grid-based lodging estimation method provides spatially localized and quantitative assessments of lodging severity within UAV images. Compared with traditional pixel-wise or region-based segmentation, this structured approach facilitates a finer-grained understanding of lodging distribution. Although the current study focuses on image-level grid analysis, this method lays the foundation for future integration with field parcel boundaries, enabling field-scale decision-making in practical applications such as precision field management and agricultural insurance assessment.

2. Materials and Methods

2.1. Dataset

2.1.1. Data Source

In this study, a DJI Phantom 3 Professional UAV was employed for data acquisition. The UAV is equipped with a Complementary Metal-Oxide Semiconductor (CMOS) image sensor featuring an effective resolution of 12.4 megapixels, an image resolution of 4000 × 3000 pixels, and a lens aperture of f/2.8. The data collection site was located at the Agricultural Internet of Things Service Center in Wuchang City, Heilongjiang Province (45.0241°N, 127.3121°E), where rice is the primary crop grown, as shown in Figure 1. Under adequate lighting conditions, imagery covering the complete rice growth cycle—from June to October 2024—was captured, focusing on rice lodging phenomena. The acquired images include lodging areas captured under varying illumination angles and flight altitudes, as shown in Figure 2.

2.1.2. Amplification and Labeling of Dataset

After screening, rice images with insufficient target information were eliminated, and 1022 valid images were retained to construct the dataset. Using the Labelme annotation tool, images were annotated, creating JavaScript Object Notation (JSON) files containing category and pixel coordinate data, which were then converted into TXT files for training the SWRD–YOLO model. The dataset was divided randomly into training, validation, and test sets. Although the images in the dataset were independent samples, the environment of the actual rice field was extremely complex, such as the varying lighting conditions and the diversity of rice leaf growth angles and lodging angles, which made it difficult to capture the diverse characteristics of lodged rice through field collection. To improve the model’s robustness and detection ability for rice lodging, the training and validation sets were augmented to prevent overfitting due to insufficient data. To improve the model’s adaptability and learning ability for rice lodging characteristics, the methods included color enhancement, rotation, and the introduction of Gaussian and salt-and-pepper noise. Examples of the images after data enhancement, including salt-and-pepper noise, Gaussian noise, color transformation, and rotation, are shown in Figure 3.

The segmentation parameters, such as the confidence threshold, non-maximum suppression (NMS) threshold, and mask threshold, were optimized based on the performance on the validation set to achieve an appropriate balance between precision and recall. Empirical evaluation indicated that minor variations in these parameters had a limited impact on the segmentation accuracy and recall. Therefore, the reported results correspond to the parameter settings validated on the validation dataset.

After data enhancement, the dataset was expanded to 2000 images. Using the data augmentation program, corresponding annotation files were automatically generated alongside the new images. The training set and validation set comprised 97% of the dataset, divided in an 8:2 ratio. The test set consisted of two parts: the remaining 3% of the dataset images and an additional 40 images selected from the original captured images that were not included in the dataset, in order to enrich the diversity of the test set. The distribution of the number of images in each part of the dataset is shown in Table 1.

2.2. Experimental Setup and Hyperparameter Configuration

The training platform of this test model was an online server provided by SeetaTech. The hardware environment included an Intel(R) Xeon(R) Gold 6430 processor and an Nvidia GeForce RTX 4090 graphics card with 24 GB of memory. The operating system was Ubuntu 20.04, and the programming language used was Python 3.8.10. PyTorch 2.0.0 served as the deep learning framework, with parallel computing performed using CUDA 11.8. To avoid overfitting and reduce training time, the batch size of each iteration was set to 16, and the number of epochs was set to 200 [24]. The subsequent comparative experiments used the same parameter settings.

2.3. Algorithm Improvement

2.3.1. YOLOv8

You Only Look Once (YOLO) is an end-to-end object detection method that can directly extract detection results from images through single-stage processing [25]. As one of the models in this series, YOLOv8 has the advantages of a small model size, fast deployment speed, low resource occupation, and suitability for edge devices, making it particularly effective in lightweight application scenarios. In addition, YOLOv8 introduces some improvements while maintaining the strengths of the previous generation, including the use of the C2f (Cross-Stage Partial with two convolutions and one fusion) module, the removal of upsampling convolutions, the introduction of a coupled detection head structure, and the implementation of an anchor-free detection mechanism. These enhancements further improve detection precision and speed [26]. In this study, the YOLOv8n-seg model was adopted as the baseline for rice lodging area detection in UAV remote sensing images and was further customized and optimized to better meet the specific requirements of the task. The detailed architecture of YOLOv8n-seg is illustrated in Figure 4.

However, the original model exhibited issues such as insufficient segmentation precision, false positives, and missed detections in the paddy field environment. To address these problems, the YOLOv8n-seg instance segmentation model was improved to enhance detection precision, reduce errors, and maintain model efficiency, thereby increasing robustness. The specific improvements made in this study are as follows:

The model was optimized using Stochastic Gradient Descent (SGD), which introduces beneficial random noise that implicitly regularizes the training process, enhances generalization and test performance, reduces memory usage, and accelerates convergence on large-scale datasets.
To improve localization accuracy, the wise intersection over union (WIoU) loss was adopted. This loss reduces the influence of low-quality samples and eases competition among high-quality anchors, enabling the model to focus on challenging examples. For rice lodging detection, where UAV images have ambiguous boundaries due to overlapping leaves and shadows, the WIoU enables better boundary refinement and more accurate lodging area detection, thereby improving segmentation performance under complex field conditions.
The ResCBAM was employed to reduce information loss and enhance feature extraction through residual learning and channel–spatial attention. Specifically for rice lodging detection, ResCBAM effectively emphasizes critical features such as bent stems, overlapping leaves, and texture variations caused by lodging.
The upsampling module was replaced with DySample, which enables finer pixel-level reconstruction by dynamically distributing weights. This improves the spatial perception of feature maps, allowing better recovery of detailed edge information. In the context of rice lodging detection, this enhancement helps accurately restore the contours of lodged areas, resulting in improved recall and more precise segmentation.
The DO-DConv module was introduced to dynamically adjust feature extraction based on texture direction and edge distribution, improving the model’s responsiveness to rice lodging characteristics. This improves segmentation precision while optimizing inference efficiency for edge device deployment.

Figure 5 illustrates the network architecture of the improved SWRD–YOLO model.

In the backbone, the 7th-layer convolution (Conv) module was replaced with DO-DConv. In the neck, DySample was applied at the 10th and 14th layers for upsampling, and the ResCBAM was introduced at the 13th, 17th, 21st, and 25th layers. Additionally, the 22nd-layer Conv was replaced with DO-DConv to enhance directional texture perception.

Shallow convolutional layers mainly extract low-level texture and edge features. Introducing DO-DConv at this stage disrupts feature diversity and increases computational overhead due to the high resolution of early feature maps. In contrast, deeper layers process more abstract semantic information, where DO-DConv can enhance directional awareness with relatively low computational cost. Therefore, standard Conv modules were retained in shallow layers to maintain feature stability, while DO-DConv was selectively applied in the deeper layers to improve semantic representation and overall inference efficiency without compromising precision.

2.3.2. ResCBAM

An attention mechanism acts as a resource allocation system that mimics human cognitive focus, dynamically emphasizing relevant information while disregarding irrelevant details to enhance critical data [27]. It is typically divided into two types: channel attention, which enhances the expressiveness of individual feature channels, and spatial attention, which emphasizes the significance of each spatial location in the image. This study combines both the spatial attention module (SAM) and the channel attention module (CAM) to construct the ResCBAM, incorporating a residual connection (RC) approach [28]. The residual connection introduces direct links between network layers, enabling the network to learn the residuals between inputs and outputs. This reduces gradient vanishing, facilitates the training of deeper models, and preserves information flow.

Specifically, the CAM adaptively recalibrates the importance of each feature channel, enabling the model to focus on discriminative features. Meanwhile, the SAM emphasizes significant spatial regions within the feature maps, allowing the model to better localize important areas. By combining the CAM and the SAM, the ResCBAM leverages their complementary strengths to capture both “what” features are important and “where” they are located within the image. This dual attention mechanism is particularly effective in the context of rice lodging detection, where lodging areas exhibit distinct texture and shape patterns (channel information) as well as distinct spatial distributions.

By generating attention maps for both channels and spatial locations, the ResCBAM enhances important features while suppressing irrelevant ones, thereby improving feature representation [29]. Consequently, the ResCBAM not only addresses the limitations of traditional convolutional neural networks in handling features of varying shapes, scales, and directions but also mitigates the gradient vanishing problem. Its structure is shown in Figure 6.

The ResCBAM receives the input feature, which is fed simultaneously into the CAM, the SAM, and the process that computes the refined feature. Specifically, the input feature map first passes through the CAM to calculate channel attention weights and output the channel-weighted features. The CAM output is then fed into the SAM for spatial attention weighting, producing spatially weighted features. The refined feature is formed by combining the input feature map, the CAM output, and the SAM output, thereby integrating multi-stage attention information. Finally, the module fuses the input feature map and the refined feature through a residual connection to produce the final output features. This design effectively leverages both channel and spatial attention mechanisms while maintaining the stability and richness of feature representation through residual connections.

Residual connections provide a direct path for gradient backpropagation, effectively alleviating the problem of vanishing gradients that commonly hinder the training of deep neural networks. By enabling the network to learn residual mappings instead of directly fitting the desired underlying function, residual connections facilitate smoother and more stable optimization. This not only accelerates convergence during training but also helps maintain feature integrity across layers, preventing degradation of learned representations. Consequently, the incorporation of residual connections in the ResCBAM contributes significantly to the overall training stability and robustness of the model, especially when dealing with complex and deep architectures.

2.3.3. WIoU Loss Function

The loss function is a crucial component in target detection algorithms [3]. In the YOLOv8n-seg model, the default loss function for bounding-box regression is the complete intersection over union (CIoU) [30]. Most high-performance target detection algorithms rely on the bounding-box regression (BBR) loss function to accurately determine the location of the detected target [31]. Traditionally, the intersection over union (IoU) loss function is used for bounding-box regression, which quantifies the overlap between the predicted and the ground-truth bounding boxes [32]. However, ignoring the shape and size characteristics of bounding boxes can negatively affect regression precision. The CIoU improves upon this by accounting for not only the intersection area but also the center distance and aspect ratio of the predicted and ground-truth bounding boxes. The formula for the CIoU is as follows:

C I o U = I o U - [ρ^{2} \cdot \frac{(B_{p r e d}, B_{g t})}{c^{2}} + α \cdot γ]

(1)

where IoU is the intersection over union between the predicted bounding box

B_{p r e d}

and the ground-truth bounding box

B_{g t}

;

ρ^{2} \cdot (B_{p r e d}, B_{g t})

represents the squared Euclidean distance between the predicted bounding box’s center point and the ground-truth box’s center; c is the Euclidean distance between opposite corners of the minimal rectangular region encompassing both the predicted and the ground-truth bounding boxes [33]; and

α

is the balance parameter, which is used to balance the weight of the center-point distance and the aspect ratio consistency term. The formula is

α = \frac{γ}{1 - I o U + γ}

(2)

where

γ

is the parameter for measuring the consistency of the aspect ratio. Its specific formula is

γ = \frac{4}{π} \times (\arctan \frac{w_{g t}}{h_{g t}} - \arctan \frac{w_{p r e d}}{h_{p r e d}})

(3)

where

w_{g t}

and

h_{g t}

represent the width and height of the ground-truth bounding box, and

w_{p r e d}

and

h_{p r e d}

represent the width and height of the predicted bounding box.

The CIoU is greatly affected by the samples in the actual detection scene, and cannot be adaptively adjusted for the diversity of detection targets, which affects the convergence speed of the algorithm. Therefore, this study introduces another loss function, the WIoU.

The calculation for the WIoU is

L_{W I o U} = R_{W I o U} \cdot L_{I o U} \cdot \frac{β}{δ \cdot α_{1}^{β - δ}}

(4)

where

α_{1}

and

β

are learning parameters.

L_{I o U} = 1 - I o U

(5)

R_{W I o U} = e x p [\frac{{(x_{p r e d} - x_{g t})}^{2} + {(y_{p r e d} - y_{g t})}^{2}}{(W_{g}^{2} + H_{g}^{2})}]

(6)

where

(x_{p r e d}, y_{p r e d})

and

(x_{g t}, y_{g t})

represent the coordinates of the center points of the predicted bounding box and the ground-truth bounding box, respectively;

W_{g}

and

H_{g}

are the dimensions of the minimum enclosing box; and

I o U

is used to assess the overlap between the predicted and ground-truth bounding boxes.

The variable

β

denotes the irregularity level of the predicted bounding box and is defined by

β = \frac{\dot{L_{I o U}}}{L_{I o U}}

(7)

where

\dot{L_{I o U}}

is the constant obtained by transforming the variable

\bar{L_{I o U}}

, and

\bar{L_{I o U}}

is the running average of the momentum m, which is calculated as follows:

m = \sqrt{t \cdot n \cdot 0.05}

(8)

where t represents the value of the epoch and n represents the batch size.

Traditional bounding-box regression loss functions often lead to reduced localization accuracy when processing low-quality samples in target detection datasets. However, the WIoU addresses this issue by introducing a dynamic, non-monotonic focusing mechanism and outlier evaluation [31]. Unlike the traditional IoU loss, which assigns a loss of 1 when the bounding boxes do not overlap (thus losing the distance information), the WIoU provides a more effective strategy for gradient assignment, helping to mitigate the problem of gradient vanishing.

2.3.4. SGD Optimizer

The original YOLOv8n-seg model uses the Adam with decoupled weight decay (AdamW) optimizer, a variant of Adam, which improves weight decay by directly applying it to the weights instead of the gradients. Due to its sensitivity to batch size, AdamW frequently introduces high gradient variance in small-batch training, which destabilizes the optimization process. Additionally, AdamW’s need to compute both first-order and second-order momentum results in higher computational complexity, which increases training time, especially for large and complex models. To overcome these issues, this study introduces SGD as an alternative optimizer. SGD is widely used in deep learning for its ability to efficiently handle large-scale data and update weights based on gradient information in real time. The calculation formula for SGD is as follows:

θ = θ - η [\nabla_{θ} J (θ, x_{i}, y_{i})]

(9)

where

θ

represents the model parameters,

η

represents the learning rate,

\nabla_{θ} J (θ)

is the derivative of the loss function for the parameters

θ

,

x_{i}

is the input feature of a sample randomly selected from the dataset, and

y_{i}

is the corresponding target value or true label for

x_{i}

.

In the SGD algorithm, model weights are updated frequently by computing the gradients of the loss function using randomly selected individual samples or small batches. This frequent updating introduces randomness, which helps the model avoid local minima and enhances both its robustness and training efficiency. The core principle of SGD is to iteratively optimize model parameters, gradually approaching the global or local minima of the objective function and ultimately minimizing the model’s loss on the training data.

In paddy fields, lodging rice is usually closely adjacent to normally growing rice, forming dense, small targets. The SGD optimizer can help the model to better distinguish these dense, small targets and reduce missed detections caused by targets that are too small and too densely packed. The leaves, stems, and other parts of rice may overlap due to lodging, forming multiple targets on a single organ. The SGD optimizer enhances the model’s understanding of complex structures, enabling it to more accurately identify lodging sites and reduce interference between multiple targets on a single organ. In addition, different organs of rice (such as leaves, stems, and panicles) may show different characteristics under lodging conditions. The SGD optimizer improves the model’s comprehensive extraction ability of multi-organ and multi-objective features, enabling it to more comprehensively identify lodging rice. In extensive paddy fields, the large-scale data to be detected often contains abundant and complex information. The SGD optimizer enhances the model’s ability to process such data so that it can maintain a high recall in complex backgrounds.

2.3.5. DySample Upsampling Module

In the feature pyramid network structure, the upsampling operator is very important [34]. However, due to the local and subtle morphological changes in rice lodging in farmland, and because the image resolution may be reduced by the impression of environmental factors, the traditional nearest-neighbor interpolation and bilinear interpolation upsampling operators are limited by their receptive fields. In the process of multi-scale feature fusion, it is difficult to fully capture these key features, resulting in the reduced detection effect of rice lodging [35]. To solve the above problems, this study introduces the DySample lightweight dynamic upsampling module. Common upsampling methods, such as Content-Aware ReAssembly of FEatures (CARAFE), Feature Adaptive Deformable Convolutional Upsampling (FADE), and Spatially Adaptive Pyramid Attention (SAPA), use dynamic convolution and additional sub-networks to improve the feature extraction ability, which results in a large amount of computation. DySample does not use dynamic convolution but redefines upsampling from the perspective of point sampling, achieving clearer edge detection by dividing a point into multiple points, thereby improving detection precision and reducing the amount of computation. The implementation process of DySample is shown in Figure 7, where

X_{1}

is the feature map input and

X_{2}

is the output feature map.

The working process of the DySample upsampling module is as follows: The input feature map is split into two paths. One path is directly fed into the network sampling module as the original feature input. The other path goes through the sampling point generator, which dynamically generates sampling points. These sampling points and their neighborhoods form the sampling set. Both the input feature map and the sampling set serve as two inputs to the network sampling module. The network sampling module integrates feature information from the original input and the sampling set, performing weighted interpolation to produce a high-quality upsampled feature map. This output feature map is then used for subsequent network processing.

The structure of the sampling set is shown in Figure 8. To increase the flexibility of the offset, static and dynamic factors are introduced. Figure 8a shows the sampling set obtained by carrying the offset of the static factor. Figure 8b shows the sampling set obtained by carrying the offset of the dynamic factor. G represents the original sampling feature, and O represents the generated offset.

In the static sampling process of the DySample upsampling module, the input feature map first passes through a linear layer that produces a tensor sized

2 g \cdot s^{2}

, representing fixed sampling offsets. Before applying these offsets, a scaling factor of 0.25 is introduced to constrain their magnitude. The scaled offsets are then spatially rearranged using a pixel shuffle operation to form a structured sampling set. Finally, the original input features are fused with the features sampled at these fixed offsets to generate the output feature map. This method ensures stable and regular upsampling.

In the dynamic sampling process of the DySample upsampling module, the input feature map first passes through a linear layer and is divided into two branches. One branch generates a tensor of size

2 g \cdot s^{2}

, and a fixed coefficient of 0.5 is applied to adjust its magnitude before fusion. The other branch also produces a tensor of the same size. These two branches are then fused and processed through a pixel shuffle operation to generate the dynamic sampling offsets. Finally, these offsets are combined with the original sampling features to produce the upsampled feature map.

2.3.6. DO-DConv Convolution Module

The conventional stride convolution used in YOLO-series algorithms tends to cause significant detail loss during continuous downsampling, particularly when extracting small object features, which negatively impacts model performance [36]. DO-DConv augments standard convolution with additional depthwise convolution layers, enabling dynamic adaptation of kernels based on the texture or edge direction of input images, which enhances the network’s ability to capture directional features and improves its overall feature representation. The convolution structure of DO-DConv is shown in Figure 9, which is computed through kernel composition. The calculation process is expressed as

O = (D \cdot W) \cdot F

(10)

First, multiply the two weights, that is,

W^{'} = D \cdot W

(11)

Thus, a new weight

W^{'}

is generated, and the weight

W^{'}

is used to perform deep convolution on the input feature P, that is,

O = W^{'} \cdot F

(12)

Figure 9. Convolution structure of DO-DConv.

3. Experiments and Analysis

3.1. Evaluation Metrics

To evaluate the effectiveness of the proposed SWRD–YOLO instance segmentation model, this study considers the following metrics: precision (P), recall (R), mean average precision (mAP@0.5), F1 score, and giga floating-point operations per second (GFLOPs).

Precision refers to the ratio of correctly identified positive instances to all instances that the classifier labeled as positive. The higher the precision, the lower the false detection rate of the model [32]. The formula is as follows:

P = \frac{T_{P}}{T_{P} + F_{P}} \times 100 %

(13)

where

T_{P}

represents the number of targets correctly predicted as rice lodging areas and

F_{P}

represents the number of targets incorrectly predicted as rice lodging areas.

Recall indicates the ability of the model to retrieve all relevant positive instances from the dataset. The higher the recall, the lower the missed detection rate of the model [32]. The formula is as follows:

R = \frac{T_{P}}{T_{P} + F_{N}} \times 100 %

(14)

where

F_{N}

represents the number of actual rice lodging areas that were not predicted as rice lodging areas.

mAP@0.5 represents the average precision value of the model at different recall levels under the condition that the IoU threshold is 0.5, reflecting the average performance of the model across multiple categories. The formula is as follows:

A P = \frac{T_{P} + T_{N}}{T_{P} + F_{P} + T_{N} + F_{N}}

(15)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(16)

where

T_{N}

represents the number of rice areas that are not lodged and were correctly predicted as non-lodged rice areas.

The F1 score is an indicator used to measure the precision of a binary classification model in statistics. It takes into account both precision and recall and is the harmonic mean of the two [37]. The formula is as follows:

F 1 = \frac{2 \times P \cdot R}{P + R}

(17)

GFLOPs indicates the number of floating-point operations (in billions) that a system, such as a GPU or an algorithm, can complete every second, reflecting its processing capability. In the process of deep learning model training and inference, it represents the amount of model computation. The formula is as follows:

G F L O P s = \frac{2 \times H_{o u t} \cdot W_{o u t} \cdot C_{i n} \cdot K_{h} \cdot K_{w} \cdot C_{o u t}}{10^{9}}

(18)

where

H_{o u t}

and

W_{o u t}

are the height and width of the output,

C_{i n}

and

C_{o u t}

are the number of channels of the input and output, and

K_{h}

and

K_{w}

are the height and width of the convolution kernel [38].

Among these metrics, precision, recall, and mAP@0.5 are obtained from the result.csv file generated after model training; the F1 score is calculated based on its formula; and GFLOPs are automatically reported by the code during the training and inference processes.

3.2. Attention Mechanism Comparison Experiments

To evaluate the performance improvement introduced by the ResCBAM, the Convolutional Block Attention Module (CBAM), Shuffle Attention Mechanism (ShuffleAttention), and Multi-Head Self-Attention (MHSA) were individually integrated into the YOLOv8n-seg network for comparison, with all other components of the network kept unchanged. The comparison results are presented in Table 2.

Compared with the +CBAM, +ShuffleAttention, and +MHSA models, the +ResCBAM model improved recall by 16.4%, 5.9%, and 10.6%, respectively, and increased mAP@0.5 by 11.9%, 4.6%, and 8.3%, respectively. The F1 score also improved by 10.8%, 2.8%, and 7.3%, respectively. GFLOPs increased by 2.4, 2.4, and 2.2, respectively. The precision of the +ResCBAM model was 2.9% and 3.7% higher than that of the +CBAM and +MHSA models, respectively, and 0.2% lower than that of the +ShuffleAttention model. The ResCBAM demonstrated superiority in highlighting key feature information and outperformed the CBAM and MHSA attention mechanisms in segmenting lodging regions. Compared with the ShuffleAttention mechanism, ResCBAM effectively enhanced model performance by reducing missed detections and false positives.

The superior performance of the ResCBAM can be attributed to its dual attention mechanism, which effectively integrates channel and spatial attention to simultaneously capture important feature maps and their spatial locations. This complementary design enables better feature representation and localization, resulting in improved detection accuracy and reduced errors compared to single-attention or other attention mechanisms.

3.3. Loss Function Comparison Experiments

To evaluate the impact of different loss functions on model performance, the CIoU, WIoU, and maximum potential distance intersection over union (MPDIoU) were individually integrated into the model for comparison while keeping the rest of the YOLOv8n-seg network structure unchanged. The training performance of each loss function is presented in Table 3.

Compared with the YOLOv8n-seg and +MPDIoU models, the WIoU improved precision by 6.9% and 2.4%, respectively, increased recall by 11.6% and 0.3%, increased mAP@0.5 by 11.8% and 2.3%, and improved the F1 score by 9.5% and 1.3%. The GFLOPs of the YOLOv8n-seg, +WIoU, and +MPDIoU models were all 12.1. The WIoU loss function demonstrated superior optimization capabilities, significantly enhancing model localization precision across objects of various scales.

3.4. Optimizer Comparison Experiments

To verify the improved performance of the SGD optimizer, comparative experiments were conducted using seven optimizers—Adaptive Moment Estimation (Adam), AdamW, Lion, Nesterov-Accelerated Adaptive Moment Estimation (NAdam), Rectified Adam (RAdam), Sophia Gradient (SophiaG)-based, and SGD—while keeping the rest of the YOLOv8n-seg network structure unchanged. The results are shown in Table 4.

Compared with the +Adam, YOLOv8n-seg, +NAdam, and +RAdam models, SGD improved precision by 7.3%, 5.2%, 5%, and 2.1%, respectively, increased recall by 17%, 14%, 13.6%, and 6.8%, respectively, and improved mAP@0.5 by 12.8%, 9.6%, 9.9%, and 4.4%, respectively. The F1 score increased by 13.1%, 10.1%, 10.3%, and 4.5%. The GFLOPs of the +Adam, YOLOv8n-seg, +NAdam, +RAdam, and +SGD models were all 12.1. The SGD optimizer offered significant advantages in accurately identifying rice lodging regions, thereby improving model robustness and generalization ability in complex field conditions. Its stable training dynamics make it particularly suitable for rice lodging detection tasks, contributing to enhanced detection precision and reliability.

3.5. Comparison Test of Upsampling Module

To verify the influence of the DySample upsampling module on instance segmentation performance, the original nn.Upsample module was replaced with DySample while keeping the rest of the YOLOv8n-seg network structure unchanged, and a comparative experiment was conducted. The test results are presented in Table 5.

The results show that in the model using DySample, precision increased from 86.6% to 92.4%, recall increased from 71.7% to 85.3%, mAP@0.5 increased from 80.5% to 91.0%, and the F1 score increased from 78.6% to 88.7%. GFLOPs remained unchanged at 12.1, indicating that DySample effectively enhances the model’s ability to reconstruct high-resolution features without increasing additional computational overhead, significantly improving the segmentation precision and verifying its advantages in complex boundary recovery and fine-grained target recognition.

3.6. Convolution Structure Comparison Test

To assess the impact of different convolutional structures on model performance, three modules—Conv, Pinwheel Convolution (PinwheelConv), and DO-DConv—were integrated into the YOLOv8n-seg model for comparison, with all other network components held constant. The experimental findings are presented in Table 6.

Compared with the +PinwheelConv and YOLOv8n-seg models, DO-DConv improved precision by 1.1% and 4.8%, respectively, increased recall by 5.1% and 11.2%, increased mAP@0.5 by 4.2% and 9.5%, increased the F1 score by 3.3% and 8.3%, and reduced GFLOPs to 11.8, showing an excellent balance between detection performance and inference efficiency. The DO-DConv structure, with its enhanced detection performance and reduced computational complexity, is highly valuable for edge devices and real-world deployment.

3.7. Ablation Experiments

To evaluate the optimization effects of the SGD optimizer, WIoU loss function, ResCBAM, DySample upsampling module, and DO-DConv convolution module, ablation experiments were conducted on the test set. Starting with the baseline YOLOv8n-seg model, modules were incrementally added to form five distinct experimental configurations. An overview of the ablation experiment results is provided in Table 7.

In Table 7, a check mark (✓) indicates the usage of the corresponding module.

When the basic model did not introduce any improved modules, precision was 86.6%, recall was 71.7%, mAP@0.5 was 80.5%, the F1 score was 78.6%, and GFLOPs were 12.1. After introducing the SGD optimizer, the mAP@0.5 of the model increased to 90.1%, and the F1 score increased to 88.7%, verifying the positive role of SGD in accelerating convergence speed and improving training stability. After adding the WIoU loss function, mAP@0.5 further improved to 91.2%, indicating that the loss function has a significant effect on improving target positioning precision and can alleviate the precision loss caused by the target box regression error. After introducing ResCBAM, the precision of the model increased to 94.7%, and the F1 score reached 91.3%. The feature expression ability of the model in the target area was significantly enhanced, effectively improving the overall performance of detection and segmentation. After further adding the DySample module, recall increased to 90.5%, and mAP@0.5 reached 93.7%, indicating that the module has advantages in high-resolution feature recovery, which can more accurately restore the target boundary and effectively improve segmentation precision. Finally, the DO-DConv direction-aware dynamic convolution module reduced the GFLOPs from 14.5 to 14.2 while maintaining high detection precision, significantly optimizing the inference efficiency of the model and improving its deployment ability on edge devices. Overall, while ensuring high-precision instance segmentation performance, the improved scheme takes into account the lightweight design and deployment efficiency of the model and provides efficient and reliable technical support for practical agricultural applications such as rice lodging detection.

In summary, the proposed modules enhance model performance in several key areas. Among them, ResCBAM and DySample play key roles in target area recognition and boundary information recovery, respectively, while DO-DConv significantly reduces computational overhead while ensuring the detection precision of the model, achieving a good trade-off between precision and efficiency, thereby showing strong practical deployment potential. The trend analysis reveals that ResCBAM primarily improves precision, DySample has a greater impact on recall, and DO-DConv reduces GFLOPs while optimizing inference efficiency, all while maintaining high precision. The experimental results demonstrate that the proposed scheme ensures high-precision instance segmentation while maintaining low computational resource consumption, providing efficient and reliable support for practical agricultural applications such as rice lodging monitoring.

3.8. Comparison Experiment of Different Segmentation Models

Since this study focuses on real-time, lightweight instance segmentation in complex agricultural environments, the YOLO family of models was selected for comparison due to its unified architecture and superior inference speed. Traditional segmentation models such as Mask R-CNN or DeepLabV3+ typically offer higher precision but require significantly more computation, making them less practical for deployment in resource-constrained field scenarios. Therefore, they were not included in the comparison. To evaluate the effectiveness of the proposed model, a comparison was performed on the test set against the original YOLO models. Table 8 presents the performance comparison results among YOLOv5n-seg, YOLOv8l-seg, YOLO11n-seg, and SWRD–YOLO (the proposed model).

For each YOLO-series model, five versions are typically available: n, s, m, l, and x, with detection precision increasing from n to x. Lower-ranked models generally offer faster inference speeds and lower resource consumption, while higher-ranked models prioritize accuracy at the cost of computational efficiency. YOLOv8n, although lower in detection precision, is the most lightweight and resource-efficient variant, making it suitable for rapid detection tasks.

Compared with YOLOv5n-seg, YOLOv8l-seg, and YOLO11vn-seg, the proposed SWRD–YOLO model improved precision by 5.0%, 6.2%, and 3.4%, respectively. mAP@0.5 increased by 2.4%, 13.9%, and 6.0%, while the F1 score improved by 2.0%, 16.2%, and 6.7%. In terms of GFLOPs, SWRD–YOLO showed increases of 7.3, 2.1, and 3.8, respectively. Regarding recall, SWRD–YOLO achieved gains of 22.9% and 9.3% over YOLOv8l-seg and YOLOv11n-seg, respectively, while showing a slight decrease of 0.9% compared to YOLOv5n-seg. Overall, SWRD–YOLO demonstrated superior and more stable performance in both recognition precision and inference speed, especially in scenarios involving dense and small targets, highlighting its effectiveness for real-time agricultural applications.

3.9. Edge Deployment and Lodging Quantification

3.9.1. Visualization of Segmentation Prediction Results

To enable accurate and efficient rice lodging detection, the SWRD–YOLO model was trained under the PyTorch framework and exported as a best.pt file, containing both the network structure and learned weights. For deployment on resource-constrained edge devices, the model was first converted to the ONNX format, which facilitates interoperability across different inference engines. Subsequently, the ONNX model was optimized into a TensorRT engine (best.engine) using NVIDIA’s TensorRT toolkit. This optimization leverages techniques such as graph fusion, precision calibration, and layer fusion to significantly enhance inference speed and efficiency.

The deployment was conducted on both PC and edge platforms to evaluate robustness. The intelligent terminal device selected was the Allspark2, powered by the NVIDIA Jetson Orin NX module. Figure 10, Figure 11 and Figure 12 illustrate the original input images, the segmentation prediction results on the PC platform, and the results on the edge platform, respectively. In these figures, the label “beating down” is used to denote the lodging areas identified by SWRD–YOLO based on the instance segmentation results. The converted engine model achieved consistent segmentation performance across platforms. Specifically, the representative test images showed segmentation confidences of 90%, 92%, and 92% on the PC platform and 91%, 92%, and 92% on the edge device. This minimal variation demonstrates the model’s strong generalization capability and cross-platform stability.

3.9.2. Inference Performance Evaluation on Edge Device

The inference speed of both the baseline and SWRD–YOLO models was evaluated on the Allspark2 intelligent terminal equipped with the NVIDIA Jetson Orin NX module.

Inference was conducted on a dataset of 100 images, each processed individually. The average frames per second (FPS) was computed by averaging the results of 20 independent runs to minimize variability. No pre-warming procedures were employed to simulate realistic deployment conditions. The baseline model achieved an average inference speed of 17.59 FPS, whereas the SWRD–YOLO model achieved 16.15 FPS. Although this represents a slight reduction of approximately 8%, SWRD–YOLO exhibited significant improvements in detection precision and recall. The reduction in FPS is primarily attributed to the increased computational complexity introduced by advanced modules such as DO-DConv, ResCBAM, and DySample. Nevertheless, the achieved FPS is sufficient to support near-real-time processing in agricultural monitoring tasks, effectively balancing accuracy and efficiency.

These results confirm that, despite its increased architectural complexity, the proposed SWRD–YOLO model maintains an acceptable inference speed on edge devices, enabling practical deployment for UAV-based rice lodging detection.

3.9.3. Grid-Based Lodging Ratio Estimation

To further enhance the robustness and interpretability of lodging detection, this study employed a grid-based analysis approach to estimate both the local and global lodging ratios. Each image was partitioned into fixed-size grids, enabling fine-grained spatial statistics and intuitive visualization of lodging severity distribution.

Each image is evenly divided into N grids of equal dimensions. Within each grid, the number of lodged pixels (segmentation mask) is denoted as

F_{i}

, and the total number of valid pixels in the grid is denoted as

T_{i}

. The local lodging ratio of the i-th grid is calculated as

r_{i} = \frac{F_{i}}{T_{i}}

(19)

This formulation offers a localized assessment of lodging severity, which is essential for identifying spatial heterogeneity and micro-regional stress zones within the field.

For accurate quantification, two complementary methods are used. At the local level, the lodging ratio of each grid is calculated based on the instance segmentation mask, enabling targeted field management. At the global level, the overall lodging ratio can be computed in two ways: (1) by directly dividing the total number of lodged pixels by the total number of valid pixels in the image, and (2) by aggregating the grid-level lodging ratios weighted by their respective pixel counts. Experimental results show that both methods produce consistent results under normal conditions, while the weighted grid-based method offers better robustness in handling partial occlusions or uneven lighting, thereby enhancing the reliability of lodging severity estimation.

Specifically, the second method—weighted aggregation of grid-level lodging ratios—is defined as follows:

R = \frac{\sum_{i = 1}^{N} F_{i}}{\sum_{i = 1}^{N} T_{i}}

(20)

This represents the weighted average of all local lodging ratios, where the weight of each grid is proportional to its valid pixel count. This approach yields a more objective and reliable estimate of the overall degree of lodging.

Figure 13 illustrates the grid-based lodging detection results: red grid lines overlay the original image, blue semi-transparent lodging masks highlight lodging areas, and the overall lodging percentage is displayed in the upper-left corner for clear visual feedback. This method enables spatial monitoring of rice lodging by dividing the image into grids, thereby identifying localized areas with severe lodging. Such granularity supports precision field management decisions. Meanwhile, grid-based statistical analysis of lodging percentages provides a standardized metric for quantitative comparisons across different time points or field plots. By integrating instance segmentation with grid-level analysis, the system not only supports localized assessment but also contributes to holistic decision-making, significantly enhancing its value in precision agriculture and UAV-based crop evaluation.

4. Discussion

Lodging detection has become increasingly important in precision agriculture due to its significant impact on crop yield and mechanized harvesting. This study proposes SWRD–YOLO, a lightweight instance segmentation model designed to address challenges such as complex lighting conditions, irregular lodging shapes, and class imbalance inherent in UAV-based lodging imagery. By incorporating advanced modules, including ResCBAM, DySample, and DO-DConv, alongside optimized training strategies like the SGD optimizer and the WIoU loss function, the model achieves substantial improvements in precision, recall, and mAP metrics.

Compared to existing methods, SWRD–YOLO demonstrates a favorable balance between detection accuracy and computational efficiency, making it suitable for real-time deployment on resource-constrained embedded devices. For instance, Zhao et al. [16] used an improved PSPNet for wheat lodging segmentation, achieving high accuracy at the cost of increased model complexity and slower inference speed. Yao et al. [17] developed a modified DeepLabV3+ for real-time lodging recognition on harvesters, focusing more on onboard harvester hardware rather than UAV platforms. In contrast, SWRD–YOLO maintains competitive accuracy with a significantly reduced model size and faster inference (16.15 FPS on a Jetson Orin NX).

Moreover, Zhang et al. [18] optimized segmentation architectures for rice lodging detection using UAV imagery; however, the increased complexity limits their practicality in real-time applications. Kumar et al. [19] utilized multispectral UAV data and machine learning for plot-level lodging assessment, emphasizing spectral features; however, they faced challenges in small target detection. Guan et al. [21] proposed a multi-stage approach for maize lodging quantification, which requires extensive multi-date data and thereby limits rapid assessment capabilities. In comparison, SWRD–YOLO achieves precise lodging segmentation from single-date RGB UAV images by enhancing feature fusion and attention mechanisms.

Despite these advances, limitations persist. The model’s robustness under extreme environmental conditions, such as heavy wind or rain, and image quality degradation due to UAV motion blur, requires further study. Additionally, while this study focuses on rice lodging, generalizing the approach to other crops requires dataset expansion and model retraining. Future work could explore multi-temporal data integration and domain adaptation techniques to improve adaptability and robustness.

To further improve the model’s accuracy and robustness, it is important to consider potential sources of error that may affect segmentation performance. Potential errors arise from variable lighting, occlusions, overlapping plants, and the class imbalance between lodging and non-lodging areas. These factors can cause misclassifications and boundary ambiguity. Careful data collection, including flight altitude standardization and imaging angle control, alongside advanced data augmentation strategies, will be essential to mitigate these challenges and improve segmentation accuracy.

In summary, SWRD–YOLO effectively balances accuracy, speed, and model complexity, demonstrating strong potential for practical deployment in intelligent crop lodging monitoring systems. This contributes to timely field management and sustainable agricultural practices.

5. Conclusions

This study addresses several critical challenges in UAV-based rice lodging detection, including complex lighting conditions, diverse lodging morphologies, varying shooting angles, and class imbalance. To enhance the model’s robustness under these factors, we applied data augmentation techniques such as color enhancement, rotation, and noise injection. A lightweight SWRD–YOLO model was proposed to balance detection precision with computational efficiency, enabling deployment on resource-constrained edge devices. These integrated strategies significantly improved detection precision and adaptability in complex field environments.

Based on the YOLOv8n-seg framework, the SWRD–YOLO model incorporates several key improvements, including replacing AdamW with the SGD optimizer for enhanced generalization on dense, small targets; integrating the WIoU loss function; adopting the ResCBAM; employing the DySample upsampling module; and utilizing the DO-DConv structure. Collectively, these enhancements yielded significant gains in precision, recall, mAP@0.5, and F1 score of 8.2%, 16.5%, 12.8%, and 12.8%, respectively, over the baseline. While computational complexity slightly increased from 12.1 to 14.2 GFLOPs, operational efficiency remained high. Notably, the DO-DConv module contributed to reducing GFLOPs from 14.5 to 14.2 without compromising precision, optimizing inference efficiency. This balance between computational cost and performance confirms that the proposed SWRD–YOLO model achieved a lightweight design without sacrificing accuracy, making it well-suited for real-time deployment on resource-constrained edge devices in agricultural applications.

Moreover, the lightweight design and optimizations of the SWRD–YOLO model offer distinct benefits in inference speed and deployment feasibility. The model can be effectively deployed on edge devices such as the NVIDIA Jetson Orin NX, supporting real-time processing. This deployment reduces dependence on costly computing infrastructure, accelerates detection workflows, and delivers timely, actionable insights for field management, thereby lowering operational costs and enhancing the practicality of high-precision lodging detection in agricultural settings.

In addition, a grid-based quantitative method was implemented to estimate lodging severity from segmentation masks. By dividing each image into uniform grids and calculating lodging ratios per grid, the overall lodging degree was determined through weighted aggregation. This fine-grained estimation not only enhances interpretability but also facilitates localized field assessment, making it particularly applicable to precision agriculture and disaster evaluation.

Guided by insights from this study, future research will focus on further reducing model complexity to enable more efficient deployment on resource-limited devices. Enhancing the model’s robustness in detecting blurred or motion-distorted targets, frequently encountered in high-speed UAV operations, will be a key objective. Additionally, improving adaptability to diverse and complex farmland environments—including various crop species and geographic regions—remains critical. These advancements will drive the development of intelligent, lightweight, and practical solutions for real-time rice lodging detection, ultimately advancing precision agriculture and sustainable crop management.

Author Contributions

Conceptualization, C.G. and F.T.; methodology, C.G.; software, C.G.; validation, C.G. and F.T.; formal analysis, C.G. and F.T.; investigation, C.G. and F.T.; resources, F.T.; data curation, C.G. and F.T.; writing—original draft preparation, C.G.; writing—review and editing, C.G. and F.T.; visualization, C.G.; supervision, F.T.; project administration, F.T.; funding acquisition, F.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key Research and Development Program of China (2023YFD2301605), Heilongjiang Natural Science Foundation Project (LH2023F043).

Institutional Review Board Statement

This study did not involve human participants or animals, and therefore ethical review and approval were not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to an ongoing study.

Acknowledgments

The authors would like to acknowledge the anonymous reviewers for their valuable comments and the members of the editorial team for their careful proofreading of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Peng, B.; Jin, K.-X.; Luo, D.-Y.; Tian, X.-Y.; Sun, Y.-F.; Huang, X.-H.; Pang, R.-H.; Wang, Q.-X.; Zhou, W.; Yuan, H.-Y.; et al. The Nutritional Components of Rice Are Closely Related to Grain Quality Traits in Rice. J. Biol. Life Sci. 2020, 11, 239. [Google Scholar] [CrossRef]
FAO. Executive Summary. In Tracking Progress on Food and Agriculture-Related SDG Indicators 2023; FAO: Rome, Italy, 2023. [Google Scholar]
Chen, Y.; Chen, S.; Li, K.-C.; Liang, W.; Li, Z. DRJOA: Intelligent Resource Management Optimization through Deep Reinforcement Learning Approach in Edge Computing. Cluster Comput. 2022, 26, 2897–2911. [Google Scholar] [CrossRef]
Zheng, Z.; Zhang, S.; Shen, J.; Shao, Y.; Zhang, Y. A Two-Stage CNN for Automated Tire Defect Inspection in Radiographic Image. Meas. Sci. Technol. 2021, 32, 115403. [Google Scholar] [CrossRef]
Chauhan, S.; Darvishzadeh, R.; Boschetti, M.; Pepe, M.; Nelson, A. Remote Sensing-Based Crop Lodging Assessment: Current Status and Perspectives. ISPRS J. Photogramm. Remote Sens. 2019, 151, 124–140. [Google Scholar] [CrossRef]
Chauhan, S.; Darvishzadeh, R.; Lu, Y.; Stroppiana, D.; Boschetti, M.; Pepe, M.; Nelson, A. Wheat Lodging Assessment Using Multispectral Uav Data. Int. Arch. Photogramm. Remote Sens. Spat. Inf. Sci. 2019, XLII-2/W13, 235–240. [Google Scholar] [CrossRef]
Yutong, W.; Wei, Z.; Wenqiang, S.; Jinyang, L.; Liqiang, Q. Study on soybean lodging identification based on UAV remote sensing. J. Chin. Agric. Mech. 2024, 45, 209–214+257. [Google Scholar]
Yang, M.-D.; Tseng, H.-H.; Hsu, Y.-C.; Tsai, H.P. Semantic Segmentation Using Deep Learning with Vegetation Indices for Rice Lodging Identification in Multi-Date UAV Visible Images. Remote Sens. 2020, 12, 633. [Google Scholar] [CrossRef]
Yongkang, W.; Tiancong, Y.; Shaolong, Z.; Li, H.; Jianzhao, D.; Yingxin, X.; Chenyang, W.; Wei, F. Monitoring Wheat Lodging Based on UAV Multi-Spectral Image Feature Fusion. Sci. Agric. Sin. 2023, 56, 1670–1685. [Google Scholar]
Jing, Z.; Fangjiang, P.; Yubin, L.; Liqun, L.; Dianlong, C.; Dongjian, Y.; Yuting, W. Wheat lodging area extraction using UAV visible light remote sensing and feature fusion. Trans. Chin. Soc. Agric. Eng. 2021, 37, 73–80. [Google Scholar]
Chauhan, S.; Darvishzadeh, R.; Boschetti, M.; Nelson, A. Discriminant Analysis for Lodging Severity Classification in Wheat Using RADARSAT-2 and Sentinel-1 Data. ISPRS J. Photogramm. Remote Sens. 2020, 164, 138–151. [Google Scholar] [CrossRef]
Sarkar, S.; Zhou, J.; Scaboo, A.; Zhou, J.; Aloysius, N.; Lim, T.T. Assessment of Soybean Lodging Using UAV Imagery and Machine Learning. Plants 2023, 12, 2893. [Google Scholar] [CrossRef] [PubMed]
Zhao, X.; Yuan, Y.; Song, M.; Ding, Y.; Lin, F.; Liang, D.; Zhang, D. Use of Unmanned Aerial Vehicle Imagery and Deep Learning UNet to Extract Rice Lodging. Sensors 2019, 19, 3859. [Google Scholar] [CrossRef] [PubMed]
Dai, X.; Chen, S.; Jiang, H.; Nainarpandian, C. Rice Lodging Disaster Monitoring Method Based on Multi-Source Remote Sensing Data. In Proceedings of the International Conference on Geographic Information and Remote Sensing Technology (GIRST 2022), Kunming, China, 16–18 September 2023. [Google Scholar] [CrossRef]
Hecang, Z.; Congsheng, W.; Qiaoli, Z.; Qing, Z.; Jie, Z.; Guoqiang, L.; Guoqing, Z. Study on Automatic Classification Method of Wheat Lodging Based on Deep Learning. J. Henan Agric. Sci. 2023, 52, 167–173. [Google Scholar] [CrossRef]
Zhao, J.; Li, Z.; Lei, Y.; Huang, L. Application of UAV RGB Images and Improved PSPNet Network to the Identification of Wheat Lodging Areas. Agronomy 2023, 13, 1309. [Google Scholar] [CrossRef]
Yao, C.; Lv, D.; Li, H.; Fu, J.; Li, C.; Gao, X.; Hong, D. A Real-Time Crop Lodging Recognition Method for Combine Harvesters Based on Machine Vision and Modified DeepLab V3+. Smart Agric. Technol. 2025, 11, 100926. [Google Scholar] [CrossRef]
Zhang, P.; Zhang, S.; Wang, J.; Sun, X. Identifying Rice Lodging Based on Semantic Segmentation Architecture Optimization with UAV Remote Sensing Imaging. Comput. Electron. Agric. 2024, 227, 109570. [Google Scholar] [CrossRef]
Kumar, M.; Bhattacharya, B.K.; Pandya, M.R.; Handique, B.K. Machine Learning Based Plot Level Rice Lodging Assessment Using Multi-Spectral UAV Remote Sensing. Comput. Electron. Agric. 2024, 219, 108754. [Google Scholar] [CrossRef]
Tian, M.; Ban, S.; Yuan, T.; Ji, Y.; Ma, C.; Li, L. Assessing Rice Lodging Using UAV Visible and Multispectral Image. Int. J. Remote Sens. 2021, 42, 8840–8857. [Google Scholar] [CrossRef]
Guan, H.; Liu, H.; Meng, X.; Luo, C.; Bao, Y.; Ma, Y.; Yu, Z.; Zhang, X. A Quantitative Monitoring Method for Determining Maize Lodging in Different Growth Stages. Remote Sens. 2020, 12, 3149. [Google Scholar] [CrossRef]
Zhang, K.; Zhang, R.; Yang, Z.; Deng, J.; Abdullah, A.; Zhou, C.; Lv, X.; Wang, R.; Ma, Z. Efficient Wheat Lodging Detection Using UAV Remote Sensing Images and an Innovative Multi-Branch Classification Framework. Remote Sens. 2023, 15, 4572. [Google Scholar] [CrossRef]
Ulku, I. ResLMFFNet: A Real-Time Semantic Segmentation Network for Precision Agriculture. J. Real-Time Image Process. 2024, 21, 101. [Google Scholar] [CrossRef]
Zhang, S.; Wang, J.; Yang, K.; Guan, M. YOLO-ACT: An Adaptive Cross-Layer Integration Method for Apple Leaf Disease Detection. Front. Plant Sci. 2024, 15, 1451078. [Google Scholar] [CrossRef]
Wang, L.; Zheng, S.; Zhang, H.; Qiu, Z.; Zhong, X.; Liuliu, H.; Liu, Y. ncRFP: A Novel End-to-End Method for Non-Coding RNAs Family Prediction Based on Deep Learning. IEEE/ACM Trans. Comput. Biol. Bioinform. 2021, 18, 784–789. [Google Scholar] [CrossRef]
Zhiqiang, G.; Wu, Z.; Guangdong, W. Design and application of a remote greenhouse monitoring system based on LoRa and NB-IoT. Ind. Instrum. Autom. 2025, 48–53+75. [Google Scholar]
Zhang, X.; Cong, Y.; Yuan, Z.; Zhang, T.; Bai, X. Early Fault Detection Method of Rolling Bearing Based on MCNN and GRU Network with an Attention Mechanism. Shock Vib. 2021, 2021, 6660243. [Google Scholar] [CrossRef]
Yang, R.; Huang, C. A Method for Human Parsing Based on Deep Learning and Attention Mechanism. In Proceedings of the 2019 6th International Conference on Systems and Informatics (ICSAI), Shanghai, China, 2–4 November 2019; pp. 1163–1167. [Google Scholar] [CrossRef]
Liu, X.; Hou, S.; Liu, S.; Ding, W.; Zhang, Y. Attention-Based Multimodal Glioma Segmentation with Multi-Attention Layers for Small-Intensity Dissimilarity. J. King Saud Univ. Comput. Inf. Sci. 2023, 35, 183–195. [Google Scholar] [CrossRef]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar] [CrossRef]
Yu, Z.; Yu, C.; Zhou, X.; Lu, Z.; Song, X. (Eds.) Green, Pervasive, and Cloud Computing—18th International Conference, GPC 2023, Harbin, China, September 22–24, 2023, Proceedings, Part I; Lecture Notes in Computer Science; Springer Nature: Singapore, 2024; Volume 14504. [Google Scholar] [CrossRef]
Jia, Q.; Yang, J.; Han, S.; Du, Z.; Liu, J. CAMLLA-YOLOv8n: Cow Behavior Recognition Based on Improved YOLOv8n. Animals 2024, 14, 3033. [Google Scholar] [CrossRef]
Zheng, S.; Zhou, X.; Lv, Q.; Zhang, Z. Wire Arc Additive Manufacturing Microstructure Pore Defect Detection Algorithm Based on Improved YOLOv5. In Proceedings of the 2023 4th International Conference on Big Data, Artificial Intelligence and Internet of Things Engineering (ICBAIE), Xiamen, China, 21–23 April 2023; IEEE: New York, NY, USA, 2023; pp. 102–106. [Google Scholar] [CrossRef]
Xiang-quan, G.; Qing-song, Q.; Ling-wang, K. Small object detection algorithm based on improved YOLOv5s. Comput. Eng. Des. 2024, 45, 1134–1140. [Google Scholar] [CrossRef]
Zhikang, L.; Lirong, X.; Yifan, B.; Fuzhi, J.; Long, Z.; Minglei, S. YOLO-SDI-based detection of residual film in agricultural fields. Comput. Eng. 2024, 1–12. [Google Scholar] [CrossRef]
Bian, E.; Yin, M.; Fu, S.; Gao, Q.; Li, Y. Part Defect Detection Method Based on Channel-Aware Aggregation and Re-Parameterization Asymptotic Module. Electronics 2024, 13, 473. [Google Scholar] [CrossRef]
Mao, Y.; Wang, J.; Luo, Y.; Lin, W.; Yao, J.; Wen, J.; Chen, G. Socioeconomic Disparities and Regional Environment Are Associated with Cervical Lymph Node Metastases in Children and Adolescents with Differentiated Thyroid Cancer: Developing a Web-Based Predictive Model. Front. Endocrinol. 2024, 15, 1128711. [Google Scholar] [CrossRef]
Yu, J.; Cheng, T.; Cai, N.; Zhou, X.-G.; Diao, Z.; Wang, T.; Du, S.; Liang, D.; Zhang, D. Wheat Lodging Segmentation Based on Lstm_PSPNet Deep Learning Network. Drones 2023, 7, 143. [Google Scholar] [CrossRef]

Figure 1. Map of the study area (Wuchang City, Heilongjiang Province).

Figure 2. Images of rice lodging areas.

Figure 3. Images amplified by four amplification methods.

Figure 4. Network architecture of the YOLOv8n-seg model.

Figure 5. Network architecture of the improved YOLOv8n-seg model.

Figure 6. Structure of the ResCBAM.

Figure 7. DySample upsampling operator structure diagram.

Figure 8. Structure of the sampling set.

Figure 10. Original images.

Figure 11. Detection results on the PC platform.

Figure 12. Detection results on the intelligent terminal.

Figure 13. Grid-based lodging detection results.

Table 1. Dataset sample information.

Category	Training Set	Validation Set	Test Set	Total
Lodging rice	1522	418	100	2040

Table 2. Comparison results of different attention mechanisms.

Model	Precision/%	Recall/%	mAP@0.5/%	F1 Score/%	GFLOPs
+CBAM	88.3	67.4	78.6	76.5	12.1
+ShuffleAttention	91.4	77.9	89.5	84.5	12.1
+MHSA	87.5	73.2	82.2	80.0	12.3
+ResCBAM	91.2	83.8	90.5	87.3	14.5

Table 3. Comparison results of different loss functions.

Model	Precision/%	Recall/%	mAP@0.5/%	F1 Score/%	GFLOPs
YOLOv8n-seg	86.6	71.7	80.5	78.6	12.1
+WIoU	93.5	83.3	92.3	88.1	12.1
+MPDIoU	91.1	83.0	90.0	86.8	12.1

Table 4. Comparison results of different optimizers.

Model	Precision/%	Recall/%	mAP@0.5/%	F1 Score/%	GFLOPs
+Adam	84.5	68.7	77.3	75.6	12.1
YOLOv8n-seg	86.6	71.7	80.5	78.6	12.1
+NAdam	86.8	72.1	80.2	78.4	12.1
+RAdam	89.7	78.9	85.7	84.2	12.1
+SGD	91.8	85.7	90.1	88.7	12.1

Table 5. Comparison results of different sampling modules.

Model	Precision/%	Recall/%	mAP@0.5/%	F1 Score/%	GFLOPs
YOLOv8n-seg	86.6	71.7	80.5	78.6	12.1
+DySample	92.4	85.3	91.0	88.7	12.1

Table 6. Comparison results of different convolution structures.

Model	Precision/%	Recall/%	mAP@0.5/%	F1 Score/%	GFLOPs
YOLOv8n-seg	86.6	71.7	80.5	78.6	12.1
+DO-DConv	91.4	82.9	90.0	86.9	11.8
+PinwheelConv	90.3	77.8	85.8	83.6	12.1

Table 7. Comparison results of the ablation test.

SGD	WIoU	ResCBAM	DySample	DO-DConv	Precision/%	Recall/%	mAP@0.5/%	F1 Score/%	GFLOPs
					86.6	71.7	80.5	78.6	12.1
✓					91.8	85.7	90.1	88.7	12.1
✓	✓				92.6	84.4	91.2	88.4	12.1
✓	✓	✓			94.7	88.2	92.4	91.3	14.5
✓	✓	✓	✓		92.6	90.5	93.7	91.9	14.5
✓	✓	✓	✓	✓	94.8	88.2	93.3	91.4	14.2

Table 8. Comparison results of different segmentation models.

Model	Precision/%	Recall/%	mAP@0.5/%	F1 Score/%	GFLOPs
YOLOv5n-seg	89.8	89.1	90.9	89.4	6.9
YOLOv8l-seg	88.6	65.3	79.4	75.2	12.1
YOLO11n-seg	91.4	78.9	87.3	84.7	10.4
SWRD–YOLO	94.8	88.2	93.3	91.4	14.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, C.; Tan, F. SWRD–YOLO: A Lightweight Instance Segmentation Model for Estimating Rice Lodging Degree in UAV Remote Sensing Images with Real-Time Edge Deployment. Agriculture 2025, 15, 1570. https://doi.org/10.3390/agriculture15151570

AMA Style

Guo C, Tan F. SWRD–YOLO: A Lightweight Instance Segmentation Model for Estimating Rice Lodging Degree in UAV Remote Sensing Images with Real-Time Edge Deployment. Agriculture. 2025; 15(15):1570. https://doi.org/10.3390/agriculture15151570

Chicago/Turabian Style

Guo, Chunyou, and Feng Tan. 2025. "SWRD–YOLO: A Lightweight Instance Segmentation Model for Estimating Rice Lodging Degree in UAV Remote Sensing Images with Real-Time Edge Deployment" Agriculture 15, no. 15: 1570. https://doi.org/10.3390/agriculture15151570

APA Style

Guo, C., & Tan, F. (2025). SWRD–YOLO: A Lightweight Instance Segmentation Model for Estimating Rice Lodging Degree in UAV Remote Sensing Images with Real-Time Edge Deployment. Agriculture, 15(15), 1570. https://doi.org/10.3390/agriculture15151570

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SWRD–YOLO: A Lightweight Instance Segmentation Model for Estimating Rice Lodging Degree in UAV Remote Sensing Images with Real-Time Edge Deployment

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset

2.1.1. Data Source

2.1.2. Amplification and Labeling of Dataset

2.2. Experimental Setup and Hyperparameter Configuration

2.3. Algorithm Improvement

2.3.1. YOLOv8

2.3.2. ResCBAM

2.3.3. WIoU Loss Function

2.3.4. SGD Optimizer

2.3.5. DySample Upsampling Module

2.3.6. DO-DConv Convolution Module

3. Experiments and Analysis

3.1. Evaluation Metrics

3.2. Attention Mechanism Comparison Experiments

3.3. Loss Function Comparison Experiments

3.4. Optimizer Comparison Experiments

3.5. Comparison Test of Upsampling Module

3.6. Convolution Structure Comparison Test

3.7. Ablation Experiments

3.8. Comparison Experiment of Different Segmentation Models

3.9. Edge Deployment and Lodging Quantification

3.9.1. Visualization of Segmentation Prediction Results

3.9.2. Inference Performance Evaluation on Edge Device

3.9.3. Grid-Based Lodging Ratio Estimation

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI