An Instance Segmentation Method for Agricultural Plastic Residual Film on Cotton Fields Based on RSE-YOLO-Seg

Fang, Huimin; Xu, Quanwang; Chen, Xuegeng; Wang, Xinzhong; Yan, Limin; Zhang, Qingyi

doi:10.3390/agriculture15192025

Open AccessArticle

An Instance Segmentation Method for Agricultural Plastic Residual Film on Cotton Fields Based on RSE-YOLO-Seg

by

Huimin Fang

^1,2,3

,

Quanwang Xu

¹,

Xuegeng Chen

^1,4,

Xinzhong Wang

^1,2,4

,

Limin Yan

⁴ and

Qingyi Zhang

^1,2,4,*

¹

School of Agricultural Engineering, Jiangsu University, Zhenjiang 212013, China

²

Key Laboratory for Theory and Technology of Intelligent Agricultural Machinery and Equipment, Jiangsu University, Zhenjiang 212013, China

³

Jiangsu Province and Education Ministry Co-Sponsored Synergistic Innovation Center of Modern Agricultural Equipment, Zhenjiang 212013, China

⁴

College of Mechanical and Electrical Engineering, Shihezi University, Shihezi 832003, China

^*

Author to whom correspondence should be addressed.

Agriculture 2025, 15(19), 2025; https://doi.org/10.3390/agriculture15192025

Submission received: 1 August 2025 / Revised: 3 September 2025 / Accepted: 24 September 2025 / Published: 26 September 2025

(This article belongs to the Section Artificial Intelligence and Digital Agriculture)

Download

Browse Figures

Versions Notes

Abstract

To address the challenges of multi-scale missed detections, false positives, and incomplete boundary segmentation in cotton field residual plastic film detection, this study proposes the RSE-YOLO-Seg model. First, a PKI module (adaptive receptive field) is integrated into the C3K2 block and combined with the SegNext attention mechanism (multi-scale convolutional kernels) to capture multi-scale residual film features. Second, RFCAConv replaces standard convolutional layers to differentially process regions and receptive fields of different sizes, and an Efficient-Head is designed to reduce parameters. Finally, an NM-IoU loss function is proposed to enhance small residual film detection and boundary segmentation. Experiments on a self-constructed dataset show that RSE-YOLO-Seg improves the object detection average precision (mAP50(B)) by 3% and mask segmentation average precision (mAP50(M)) by 2.7% compared with the baseline, with all module improvements being statistically significant (p < 0.05). Across four complex scenarios, it exhibits stronger robustness than mainstream models (YOLOv5n-seg, YOLOv8n-seg, YOLOv10n-seg, YOLO11n-seg), and achieves 17/38 FPS on Jetson Nano B01/Orin. Additionally, when combined with DeepSORT, compared with random image sampling, the mean error between predicted and actual residual film area decreases from 232.30 cm² to 142.00 cm², and the root mean square error (RMSE) drops from 251.53 cm² to 130.25 cm². This effectively mitigates pose-induced random errors in static images and significantly improves area estimation accuracy.

Keywords:

agricultural plastic film; residual membrane recognition; instance segmentation; YOLO11-seg; SegNext attention; NM-IoU; multi-scale attention

1. Introduction

As a crucial strategic commodity in China, cotton occupies a unique economic position [1]. Cotton mulching technology, which boasts advantages of water retention, soil moisture conservation, temperature regulation, and weed suppression, notably improves cotton growth conditions and substantially boosts cotton yields. However, agricultural mulch film is mainly composed of polyethylene, a material that requires over 200 years for natural degradation [2]. Its accumulation in soil causes severe problems such as soil structure deterioration and environmental pollution [3]. Consequently, the government has attached increasing importance to the recovery and management of agricultural plastic residual film. Residual film pollution control is a comprehensive effort. Beyond developing residual film recovery machinery for mechanical collection and implementing government subsidies to encourage farmers’ recovery efforts, seasonal monitoring of residual film recovery efficiency is also critical for effective pollution management [4].

The primary traditional method for early detection of residual plastic mulch is manual sampling. Per sampling standards, plots of specific sizes are selected, residual mulch is collected from different soil layers, then washed, dried, and quantified—rendering the process time-consuming and labor-intensive [5]. The widespread application of artificial intelligence (AI) in agriculture has driven agricultural production toward higher efficiency and quality. In the field of plastic film detection, scholars have begun exploring machine learning for plastic film identification. Zheng et al. [6] utilized the Google Earth Engine (GEE) cloud platform and Landsat-8 satellite surface reflectance dataset, selected spectral, index, and texture features through feature selection methods, and applied the random forest algorithm to extract plastic film in Changwu County, achieving an accuracy of up to 95%. Wu et al. [7] investigated post-harvest tobacco fields via UAV remote sensing imagery, employing RGB and HSV color space models to segment plastic mulch targets, and achieved an average recognition rate of 87.49%. Zhai et al. [8] proposed a method integrating pixel blocks and machine learning models to segment ground residual plastic mulch, aiming to assess such mulch in cotton fields before planting. They found the artificial neural network (ANN) model combined with 20 × 20 pixel blocks yielded the optimal average intersection-over-union (IoU), with an identification rate of 71.25%. However, after the operation of residual film recovery machines, the residual film on the ground becomes fragmented—rendering satellite or UAV remote sensing-based methods (originally suitable for intact film detection) difficult to apply. Additionally, in this scenario, the ground background and residual film exhibit similar color characteristics, which makes traditional machine learning methods (e.g., color threshold segmentation, binarization) ineffective.

In recent years, scholars have begun exploring deep learning for residual film detection. Zhang et al. [9] employed the Faster R-CNN convolutional neural network for residual film identification, adopting a dual-threshold algorithm with ResNet50 as the main feature extraction network, achieving an accuracy rate of 89.24%. However, the model is relatively complex and computationally expensive. Huang et al. [10] combined YOLOv7-SPD object detection with DeepLabV3+ image segmentation: first identifying residual films, then calculating their areas using image segmentation algorithms. However, YOLOv7 has many computational parameters and cannot achieve real-time detection. Niu [11] improved the SegFormer model by adding a 1/64-level feature map extraction layer to enhance small object detection. The improved model achieved an average IoU of 83% with a single detection time of 251 ms, but it cannot meet real-time requirements and only performs segmentation on single static images, failing to consider the impact of different residual film poses on segmentation accuracy [12]. Lin et al. [13] used YOLOv5s as the base model, combined it with ELAN attention, and replaced downsampling to propose a YOLO-SDI residual film recognition model. This improved the recognition of small residual film objects and optimized detection speed, but the model could only count the number of residual films.

The aforementioned scholars have made significant contributions to the identification of agricultural residual films in farmland. However, residual films remaining after the operation of residual film recovery machines exhibit distinct characteristics: variable sizes, complex backgrounds (e.g., similar colors between the ground and residual films, coupled with interference from residual cotton, straw, and other debris), and diverse orientations. Existing methods still suffer from limitations in accurately identifying multi-scale residual films and conducting real-time assessments of residual film quantities in complex farmland environments.

Based on this, this paper takes the residual film left on the ground after the operation of a self-propelled residual film recovery machine as the detection object. To address issues such as misclassification and missed detection of multi-scale residual films, unclear edges during segmentation, and diverse orientations, we propose a residual film instance segmentation model RSE-YOLO-Seg. Based on YOLO11-Seg, this model incorporates a variable receptive field PKI module into the C3K2 module of the backbone network, combined with SegNext attention featuring multi-scale convolutional kernels, to capture residual film features with significant scale variations. Next, standard convolutions in the network are replaced with receptive field convolutions (RFCAConv) to emphasize the spatial characteristics of receptive fields in different regions and sizes. An efficient detection head is also adopted to enhance the extraction of residual film feature details. Finally, to address issues such as small targets and incomplete residual film edge segmentation, a novel loss function (NM-IoU) is proposed to effectively resolve problems like missed detection of fragmented residual films and inaccurate segmentation. To address the issue of residual films in different poses, we employ the instance segmentation model RSE-YOLO-Seg combined with DeepSort tracking to track and segment residual films. At their optimal pose, we capture their bounding box images and calculate their projected area (the angle closest to orthographic projection), thereby estimating the residual film area. This reduces the error caused by static image segmentation under different residual film poses.

2. Materials and Methods

2.1. Establishment of Agricultural Plastic Residual Film Dataset

To construct a residual film dataset under natural environments, as illustrated in Figure 1, residual plastic film images were collected from cotton fields in Shihezi City (44.3° N, 86.0° E) and Shawan City (44.2° N, 85.6° E), Xinjiang, during the period from October to mid-November 2024. This sampling site was chosen because it covers the sandy loam and clay loam soils typical of cotton-growing regions in Xinjiang.

Image acquisition was primarily conducted during two time windows: 9:00–11:30 a.m. and 4:00–6:30 p.m. This scheduling was intended to minimize shadow interference induced by intense sunlight and improve image quality. For cotton fields with varying post-recovery machine operation conditions, residual film images were captured using two devices: an iPhone 12 (Apple Inc., Zhengzhou, China; resolution: 3024 × 4032) and a DJI MINI 4K drone (DJI, Zhengzhou, China; resolution: 3840 × 2160). During image acquisition, the cameras were kept parallel to the ground. Furthermore, based on the residual film size range (5–50 cm) and references from the existing literature on UAV-based residual film data collection [11], the shooting parameters were set as follows: the smartphone was used for close-range imaging at 0.5–2 m from the target, while the drone was operated for long-range imaging at a flying altitude of 3–5 m. Images of surface residual film in natural cotton fields were collected under conditions of varying shooting angles, light intensities, and soil types.

Among these, the residual film surface mainly exhibits three distribution states: bare exposed surface, suspended on cotton stalks, and distributed in complex inter-row environments. Additionally, residual films under different soil background conditions such as clean soil, soil with a high content of cotton debris, wet soil, and dry soil. Furthermore, there are also scenes of residual films mixed across multiple rows captured by drones. Specific images of the sampling scheme are shown in Figure 2.

After screening, 2535 high-quality images were obtained, and the segmentation targets were annotated using software LabelMe 5.5.0, as shown in Figure 3. A total of 10,604 residual film instances were manually labeled by the same professional researcher under the supervision of agricultural image analysis experts and computer vision researchers, ensuring labeling consistency and high quality. The labeling process adhered to detailed guidelines, which addressed scenarios such as residual film overlaps, occlusion, and accurate delineation of residual film boundaries. To maintain labeling consistency, regular proofreading and expert reviews were conducted; label quality was evaluated by comparing the Intersection over Union (IoU) with model predictions. The annotated labels were subsequently converted to txt format for model training.

To prevent model overfitting during training, data augmentation techniques—including random brightness variation (to simulate lighting changes), angle variation (to simulate movement), and noise addition (to simulate image interference)—were applied to augment and expand the dataset, as illustrated in Figure 4. Specific expansion details are provided in Table 1. A total of 5559 images were obtained and randomly split into training, validation, and test sets at a 7:2:1 ratio, forming the Cotton Field Residual Film Dataset.

2.2. Test Environment Configuration and Network Parameter Settings

The experimental environment of this study was configured on a Windows 11 operating system. The hardware included an NVIDIA RTX 4070 GPU (24 GB VRAM) and an Intel^® Xeon^® Platinum 8352V CPU (2.10 GHz). The GPU provided efficient parallel computing capabilities for intensive object segmentation tasks, and its 24 GB VRAM supported the processing of large-scale image data. The CPU’s multi-threading capability effectively facilitated complex computational tasks in deep learning. The framework used was CUDA 12.1, with Python 3.10 (programming language) and PyTorch 2.5.1 (deep learning library). Input images were uniformly resized to 640 × 640 pixels—a resolution selected to balance graphics memory usage and computational efficiency. Hyperparameter settings are summarized in Table 2. During model training, CUDA acceleration was enabled, and an SGD optimizer [14] (momentum = 0.937; weight decay = 0.0005) was used to stabilize gradient updates. The initial learning rate was set to 0.01, and the model fully converged after 250 training epochs. A batch size of 32 was chosen based on GPU memory capacity and training stability.

Meanwhile, to validate the RSE-YOLO-Seg model’s detection performance on edge devices, the model was deployed on two edge computing devices: Jetson Nano Orin (NVIDIA, Hunan, China; 8 GB) and Jetson Nano B01 (NVIDIA, Taiwan, China; 4 GB). These devices ran Ubuntu 22.04, with their acceleration environment configured using CUDA 12.6 and cuDNN 9.3 to ensure stable model acceleration. The TensorRT inference library was further employed to maximize inference throughput and achieve low latency [15]. Detailed configuration of the deployment-side environment is provided in Table 3.

To ensure comparability results, all models were trained and tested under the same dataset and experimental conditions.

2.3. Model Evaluation Metrics

In this study, two types of outputs were evaluated: bounding box and mask predictions. The following metrics were selected: Bounding Box Precision (P(b)), Recall (R(b)), and mean Average Precision (mAP50(b)) to assess bounding box detection accuracy; Mask Precision (P(m)), Recall (R(m)), and mean Average Precision (mAP50(m)) to measure pixel-level segmentation accuracy and completeness; and the number of parameters and Floating Point Operations Per Second (FLOPs) to evaluate model complexity. Additionally, inference time (Latency) was used to gauge the model’s real-time performance.

The results of residual film detection are primarily categorized into four types: True Positive (TP), False Positive (FP), True Negative (TN), and False Negative (FN). Precision (P) is defined as the proportion of true residual film instances among all samples predicted as positive by the model, which reflects the accuracy of the model’s positive predictions. Recall (R) is the proportion of correctly predicted positive samples among all actual positive samples, representing the model’s ability to capture positive instances.

P = \frac{T P}{T P + F P}

(1)

R = \frac{T P}{T P + F N}

(2)

Mean Average Precision (mAP) evaluates a model’s detection performance across all categories. Average Precision (AP) refers to the area under the precision-recall curve, serving as a comprehensive metric for assessing the detection performance of a specific category. mAP50 denotes the AP value when the Intersection over Union (IoU) threshold is set to 0.5. Given that this study involves the detection of only one category (total number of categories, k = 1), mAP is numerically equivalent to AP.

A P = \int_{0}^{1} P (R) d R

(3)

m A P = \sum_{i = 1}^{k} A P_{i} / k

(4)

Furthermore, parameter denotes the number of parameters in the model, reflecting its size; FLOPs indicate the number of floating-point operations, representing the computational load of the model; and Inference refers to the time required to detect an image, reflecting the detection speed of the model.

F L O P s = O (\sum_{i = 1}^{n} K_{i}^{2} \times C_{i - 1} \times C_{i} + \sum_{i = 1}^{n} M_{i} \times C_{i})

(5)

F P S = 1000 / p r e p r o c e s s + i n f e r e n c e + N M S

(6)

Latency is primarily influenced by environmental configuration and hardware performance, and thus cannot be calculated via an explicit formula [16]. Instead, it requires actual deployment and inference operations, which can be conducted using either a GPU or a CPU. Generally, GPU-based inference is significantly faster than CPU-based inference, owing to the parallel processing capability of GPUs and their extensive optimization for accelerating deep learning computations. Therefore, CPU-based inference can reflect a model’s inference speed under hardware-constrained conditions. In this study, Latency data were derived from CPU inference on the training side and actual inference time of the Jetson’s GPU.

2.4. Residual Film Segmentation Method

Traditional segmentation models (e.g., Mask R-CNN, DeepLabV3+) often demand extensive computational resources, limiting their practicality for deployment in resource-constrained scenarios. Given this study focuses on real-time segmentation in cotton fields, YOLO-series models were selected for their unified architecture and superior inference speed. Specifically, YOLOv11-Seg is an efficient instance segmentation model improved upon the YOLO series; it achieves pixel-level target segmentation while preserving real-time detection capabilities, rendering it suitable for diverse real-time computer vision tasks [17]. Its network structure is primarily divided into four components: input, backbone network, neck network, and segmentation head. The input refers to the residual film images to be detected, which undergo size normalization before being fed into the detection network. The backbone network serves as the model’s primary feature extractor, convolving the input images into features of different sizes at the feature level. The neck network’s main function is to integrate the features extracted by the backbone network and pass them to the segmentation head, enabling the identification of small, medium, and large-sized targets, and displaying the contours of detected residual films through generated masks.

To address the issues of multi-scale missed detection, false detection, and incomplete boundary segmentation of residual film, this paper proposes an RSE-YOLO-Seg segmentation model. Firstly, considering the multi-scale characteristics of residual film, a plug-and-play PKI module with adaptable receptive fields is integrated into the C3K2 module of the backbone network; this enhancement improves the C3K2 module’s capability to capture features across different scales and hierarchies [18]. Combined with the SegNext attention mechanism (which employs multi-scale convolutional kernels), the model effectively captures features of residual film with significant scale variations. Subsequently, the standard convolutional layers in the network are replaced with RFCAConv (which incorporates receptive field attention). This modification emphasizes the spatial characteristics of the receptive field, enabling differentiated processing of regions and sizes of varying types [19], thereby significantly enhancing the model’s ability to capture and utilize residual film information. Furthermore, an efficient detection head is proposed to reduce model parameters while improving the extraction of detailed residual film features. Finally, a novel loss function named NM-IoU is introduced: it combines NWD loss (which focuses on small object detection) and MPD loss (which addresses occlusion and overlap issues) with an optimal fusion weight ratio of 0.5 (determined experimentally). By integrating the advantages of both loss functions, the NM-IoU effectively mitigates missed detection of fragmented residual film and improves segmentation accuracy. Experimental results demonstrate that these enhancements significantly improve the recognition accuracy and recall rate of residual film at various scales under complex field conditions. The architecture of the proposed RSE-YOLO-Seg model is illustrated in Figure 5.

2.4.1. RFCAConv

Convolution operations, as the core component of convolutional neural networks, effectively extract feature information from images through sliding windows and shared parameters. However, they also suffer from issues such as large model parameters and high computational costs. Spatial attention mechanisms, as an important attention technique, focus on the spatial dimension of images—i.e., the correlations between pixels—to assign different attention weights to different regions during model training and learning. This enables the model to establish the priority of effective information more efficiently and focus on key information. The combination of the two enables the model to efficiently extract and process image information from convolutional networks.

Existing spatial attention mechanisms determine weights solely based on spatial feature recognition of input feature maps, failing to account for the spatial characteristics of receptive fields. They apply the same processing to receptive fields of different regions and sizes, limiting the model’s ability to capture and utilize information in images. Additionally, they face challenges in effectively sharing parameters of large-scale convolutional kernels. The Receptive Field Attention (RFA) mechanism addresses the limitations of existing spatial attention [20] by shifting the focus to the spatial features of the receptive field, emphasizing the importance of various features within the receptive field, and enabling more flexible adjustment of convolution kernel parameters based on the spatial features of the receptive field, thereby resolving the issue of parameter sharing for convolution kernels. Coordinated Attention (CA) [21] is a mechanism that integrates spatial position information into channel attention. By incorporating CA’s spatial attention mechanism into the spatial features of the receptive field, we obtain Receptive Field Coordinate Attention Convolution (RFCAConv). The network framework is shown in Figure 6, providing a clear visual representation. By matching the spatial attention of the receptive field’s spatial features with convolution, we generate RFCAConv to replace standard convolution, can effectively mitigate the issue of convolution parameter sharing [22]. Additionally, it partially considers long-range information, enhancing convolution performance.

2.4.2. C3K2_PKI Module

The C3k2 module is an optimized version of the traditional CSP bottleneck structure in YOLO11. Compared with the C2f module, it introduces parallel convolutional layers to replace a single convolutional layer, thereby reducing redundant computations and improving inference speed. However, it is still constrained by the receptive field size of convolutional kernels, leading to insufficient coverage when extracting features of objects with significant scale differences and resulting in the loss of effective object information.

Given the significant size variations of residual films on the ground in cotton fields, introducing a variable receptive field enables the network to capture features at different scales and levels, thus achieving a more comprehensive feature representation. Smaller receptive fields can capture details of fragmented residual films and surrounding debris, while larger receptive fields can cover large-area sheet-like residual film regions and their morphological structures. The multi-scale convolutional neural network (PKINet) is a hole-free multi-scale convolutional network specifically designed for remote sensing target detection [23], aiming to address the limitations of small convolutional kernel networks in capturing sufficient target context information and the issue of background noise introduced by large convolutional kernel networks when detecting small targets. The PKI Module includes a 3 × 3 depth-wise convolution (DWConv) module to capture input image information, followed by parallel multi-scale DWConv modules (5 × 5, 7 × 7, 9 × 9, 11 × 11) to capture contextual information of feature maps [24]. Finally, a 1 × 1 convolution module fuses feature with different receptive field sizes across channels to obtain multi-scale contextual information and unifies the feature mapping channels into a multi-scale lightweight bottleneck structure, i.e., the PKI bottleneck, as shown in Figure 7. Replacing all bottleneck modules in the C2f module with PKI bottleneck modules forms the C3k2_PKI module. Although the C3K2_PKI module incorporates multi-scale large convolutional kernels, it controls computational complexity within an acceptable range via depthwise separable convolution (DWConv) and 1 × 1 convolution for feature fusion. When combined with SegNext, this module introduces only a minimal increase in model size and complexity (~0.1 M parameters and ~0.1 GFLOPs), as shown in Table 4 (Section 3.1). This ensures the model still meets real-time requirements for practical deployment.

2.4.3. Segnext Attention

SegNext Attention is an efficient spatial-channel attention mechanism [25] that captures contextual information through multi-scale separable convolutions and is specifically designed for semantic segmentation tasks. Its core idea is to extract multi-scale features in parallel using convolutional kernels of different sizes, and finally fuse them into a dynamic attention weight map to enhance important feature regions. Its structure is shown in Figure 8. After a feature map is input, it first undergoes 5 × 5 depth-wise convolution, followed by processing through multi-scale branches. Each branch first uses a 1 × K convolution to capture horizontal patterns, then a K × 1 convolution to capture vertical patterns, where K is set to 7, 11, and 21. These three kernel sizes focus on local, medium-range, and long-range contexts, respectively. The outputs of the multi-scale branches are then fused through addition, undergo spatial attention calculation, and achieve cross-channel interaction through 1 × 1 convolution.

This multi-scale convolution kernel structure allows small kernels to capture residual film details and large kernels to capture overall shapes, automatically focusing on residual film feature regions. Additionally, 1 × K and K × 1 convolutions are equivalent to K × K convolutions but reduce the number of parameters to 2/K of the original, significantly decreasing the model’s computational load [26]. Furthermore, since plastic film is typically applied in continuous strips, and half of the residual film remaining on the ground after removal is also strip-shaped, the strip convolution combination used in SegNext_Attention is more sensitive to long, strip-shaped targets, which significantly improves the extraction of residual film features.

2.4.4. Segment_Efficient Head and Nwd-Mpd Loss Function

In object detection, head structures are typically categorized into two types. One is the fully connected head (FC-head), where features extracted from each node in the fully connected layer network are interconnected. This makes the fully connected head more spatially sensitive but results in a larger parameter volume. The other is the convolutional head (Conv-head), which has a simpler network structure and lower computational complexity compared to the fully connected head. The original model YOLO11-seg in this study adopts the convolutional head approach. As shown in Figure 9, the original detection head employs a decoupled head structure [27] with a parallel branch design. When calculating the boundary loss, features first pass through two 3 × 3 convolutional layers; for classification loss, depth-separable convolutions combined with 1 × 1 pointwise convolutions are used. Standard convolutions then separately compute the bounding box loss values and classification loss values. Although this saves significant computational costs and resources, a reduction in detection accuracy is inevitable. Therefore, it is necessary to redesign the detection head to achieve better feature detail extraction capabilities. This paper thus proposes an efficient module to replace the 3 × 3 convolutional layers in the original detection head.

The improved efficient detection head retains the decoupled head structure and parallel branch feature processing method. However, in each branch, two efficient modules are stacked to replace the two 3 × 3 convolutional layers in the original detection head, with the final output computed via a standard convolutional layer. The efficient module consists of two 3 × 3 convolutional layers. Given the low computational burden of 3 × 3 convolutions, they can also enhance nonlinearity, improving the model’s ability to represent complex functions. Consequently, while reducing network computation through efficient convolutions, the model maintains the extraction of deeper and richer image features, preserving more spatial information. This significantly enhances the model’s detection performance.

In addition, in the task of detecting residual film on the ground surface, challenges such as fragmented residual film and complex occlusions also exist. The original model uses CIoU for loss calculation, which only considers the overlap between the predicted bounding box and the actual bounding box. When detecting small residual film targets, it fails to effectively capture their overlap, leading to frequent missed detections of fragmented residual film. To improve the model’s detection rate for fragmented residual films, a novel loss function calculation method combining NWD-IoU and MPD-IoU is proposed. The specific calculation methods for these two loss functions are as follows:

(1): NWD-IoU

As shown in Figure 10a, when the rectangular box B is translated by 3 pixels to obtain C, the IoU value between the predicted box and the ground truth box decreases from 0.53 to 0.06. This indicates that traditional loss functions are not suitable for small object detection, as IoU is more sensitive to small scales. NWD is a loss function designed to enhance small object detection [28], which calculates the similarity between bounding boxes using a metric-based approach. Specifically, bounding boxes are modeled as Gaussian distributions, and the Wasserstein distance is used to measure the similarity between the two distributions, replacing IoU. This method offers the advantage of being able to measure similarity even when two boxes have minimal overlap. Moreover, it is insensitive to target scale, making it more stable for small object detection.

For two 2D Gaussian distributions

μ_{1} = N (m_{1}, Σ_{1})

and

μ_{2} = N (m_{2}, Σ_{2})

, the second-order Wasserstein distance between them can be defined as:

{W_{2}}^{2} (μ_{1} - μ_{2}) = ‖m_{1} - m_{2}‖ + T r (Σ_{1} + Σ_{2} - 2 {(Σ_{2}^{1 / 2} Σ_{1} Σ_{2}^{1 / 2})}^{1 / 2})

(7)

In addition, for Gaussian distributions

N_{a}

and

N_{b}

modeled from bounding boxes

A = (c x_{a}, c y_{b}, w_{b}, h_{b})

and

B = (c x_{b}, c y_{b}, w_{b}, h_{b})

, substitution yields:

W_{2}^{2} (N_{a}, N_{b}) = {‖({[c x_{a}, c y_{a}, \frac{w_{a}}{2}, \frac{h_{a}}{2}]}^{T}, {[c x_{b}, c y_{b}, \frac{w_{b}}{2}, \frac{h_{b}}{2}]}^{T})‖}_{2}^{2}

(8)

Finally, through index normalization, a new metric known as the normalized Wasserstein distance (NWD) is obtained.

N W D (N_{a}, N_{b}) = e x p (- \frac{\sqrt{W_{2}^{2} (N_{a}, N_{b})}}{C})

(9)

The bounding box loss function based on NWD is as follows:

L_{N W D} = 1 - N W D (N_{a}, N_{b})

(10)

(2): MPD-IoU

MPDIoU is a new bounding box metric standard based on the minimum point distance. As shown in Figure 10b, it effectively addresses the issue where CIoU becomes ineffective when the aspect ratios of the two boxes are consistent [29]. By directly minimizing the distance between the upper-left corner and lower-right corner of the predicted box and the actual box, it more intuitively reflects the positional relationship between bounding boxes. The specific calculation formula is as follows:

d_{1}^{2} = {(x_{1}^{p r d} - x_{1}^{g t})}^{2} + {(y_{1}^{p r d} - y_{1}^{g t})}^{2}

(11)

d_{2}^{2} = {(x_{2}^{p r d} - x_{2}^{g t})}^{2} + {(y_{2}^{p r d} - y_{2}^{g t})}^{2}

(12)

M P D I o U = I O U - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(13)

L_{M P D I o U} = 1 - M P D I o U

(14)

MPDIoU comprehensively considers the positional and size offsets between bounding boxes. It simplifies the similarity comparison between predicted boxes and ground-truth boxes during bounding box regression and converges effectively regardless of whether the predicted and ground-truth boxes overlap. Compared with CIoU, MPDIoU not only measures the similarity between predicted and ground-truth bounding boxes more accurately but also effectively prevents detection box distortion—especially when the target is partially or fully occluded [30]. A more refined box-matching mechanism improves accuracy in occluded scenarios, reduces detection box distortion caused by residual overlap or occlusion, and lowers the risk of missed detections.

By integrating the advantages of both methods, we propose an NM-IoU metric that incorporates the concept of NWD-IoU into MPD. This metric focuses on small residual film targets while considering accurate boundary segmentation for loss calculation. NWD and MPDIoU are combined at a specific weight ratio to derive NM-IoU. The final loss calculation formula of the model is shown in Equation (9), which effectively addresses issues such as detection box distortion and missed target detection caused by fragmented and overlapping residual films.

L o s s = (1 - I o U_r a t i o) \times {(1.0 - L_{N W D})}_{m e a n} + i o u_r a t i o \times {(1.0 - L_{M P D I o U})}_{m e a n}

(15)

2.5. Theoretical Area of Residual Film Obtained Based on the Tracking Network

Since the orientation, occlusion status, and posture of residual films can be affected by environmental factors such as wind or dust, the residual film detected in the first frame of the monitoring video may not be in a frontal orientation. A tracking network is used to track the movement of these residual films, and images within their bounding boxes are captured when their posture is optimal (the orthographic projection angle) to estimate the projected area of the residual film.

The DeepSORT algorithm, an improved version of the SORT algorithm, effectively addresses the issue of frequent identity (ID) switching by introducing more stable CNN-based metrics and training on large datasets [31]. The algorithm’s flowchart is shown in Figure 11a:

(1): Residual film detection: The improved YOLO11-seg network is used to detect residual films in the video, obtaining detection boxes and mask information of residual films.
(2): Target prediction: The Kalman filter is used to predict the position and state of the residual film in the next frame of the video.
(3): Target matching: An improved Hungarian algorithm is used to perform optimal matching between residual films in consecutive video frames, obtaining the trajectory of residual films in the video. Trajectories of unmatched residual films are temporarily stored and continue to participate in subsequent frame prediction and matching. If a target remains unmatched for 30 consecutive frames, it is deemed a disappeared residual film and its trajectory is deleted. If the matching is successful, the results are output, parameters are updated, and residual film detection is restarted. In dynamic residual film detection, the complex environment—such as diverse residual films, occlusions, and multiple moving targets—can cause unpredictable jumps in the ID of the detected object, affecting detection results. To address this, a historical frame data recording module is introduced into the original tracker to store the historical trajectory information of residual film targets. This module indexes targets by their ID, with stored values including historical position information and frame indices. Through global variables, historical residual film information can be shared and updated across different frames. As described in the cascade matching and IoU matching steps in Figure 10, during target detection and tracking in each frame, the system checks whether the target’s ID exists in the historical frame module. If it exists, the system verifies the validity of the target ID and updates its historical information; otherwise, a new ID is reassigned [32].

Next, the area of residual film was estimated from the tracking results using a pixel-counting method. As illustrated in Figure 11b, the pixel count of each residual film was recorded during tracking, with the maximum pixel count for each film designated as the theoretical pixel count for its corresponding ID. The number of residual films was determined by counting these IDs, and the total pixel count of all residual films in the video was calculated by summing the theoretical maximum pixel counts of all IDs. This total pixel count was then converted to the actual area using a pixel-to-area calibration value, enabling the estimation of residual film quantity on the cotton field surface.

2.6. RSE-YOLO-Seg Model Accelerated Deployment

Although the optimized architecture of the RSE-YOLO-Seg model has enhanced its feature extraction, fusion capabilities, and detection accuracy, the increased structural complexity may reduce its inference speed when deployed on practical edge devices. To address this issue, this study employs TensorRT for model quantization and accelerated deployment [33]. First, the PyTorch-trained model was exported as a static ONNX model, converting the model (including its network architecture and weights) from the PyTorch format (.pt) to the ONNX 1.13.1 format (.onnx). The network architecture was then optimized using the onnxsim tool. Subsequently, the trtexec tool was used to build a TensorRT inference engine, converting the model into the TensorRT format (.engine) to enable high-performance inference. The trained YOLO model is saved in FP32 format by default. Although converting it to FP16 format via TensorRT may cause a slight degradation in model accuracy, it effectively reduces model size and memory consumption, which significantly boosting inference speed.

3. Experimental Results and Analysis

3.1. Comparison of the Effects of Adding Attention Mechanisms to the Backbone

The effective acquisition of residual film information constitutes a critical component in residual film detection, while attention mechanisms act as a powerful tool enabling models to capture long-term dependencies in information. Residual film remaining on the ground exhibits distinct characteristics, including substantial size variations and irregular strip-like patterns, which arise from stretching during its use or disposal. To address the challenge posed by large-scale variations in target dimensions, this study introduces the PKI Module, specifically designed to capture multi-scale texture features of residual film. Furthermore, a lightweight attention module, termed SegNext Attention, is proposed to enhance the model’s focus on strip-shaped targets by leveraging the inherent strip-like characteristics of residual film. To validate the effectiveness of the proposed approach for residual film recognition, comparative experiments were conducted. These experiments integrated common efficient attention mechanisms—including SE [34], CA [10], CBAM [35], ECA [36], MPCA, and AFGC—into the 11th layer of the backbone network. Notably, SE and CA have been previously employed by researchers in residual film detection tasks, serving as baseline attention modules for this study. The detailed experimental results are presented in Table 4. In terms of recall, the integration of the SE, CBAM, MPCA, and AFGC attention mechanisms was found to suppress the model’s recognition performance for residual films. In contrast, the attention modules that most effectively enhanced the comprehensive recognition of residual films were SegNext, ECA, and CA, with respective improvements of 1.1, 0.7, and 0.3 percentage points in overall performance metrics.

The reasons are analyzed as follows: these mechanisms are mismatched with the characteristics of the residual film task—SE ignores spatial information, while the standard square convolutional kernels of CBAM and similar mechanisms are inefficient in capturing the strip-like morphology of residual film, easily introducing background noise or missing key features. In contrast, the multi-scale strip-like convolutional kernels (1 × K, K × 1) of the SegNext Attention mechanism adopted in this study align well with the morphological characteristics of residual film, enabling more precise focusing on target regions. Its synergy with the PKI module enhances multi-scale receptive field capability. Although the introduction of the proposed module results in a slight increase in parameters (from 2.83 M to 2.92 M), it only increases CPU latency by 1.7 ms, while improving the recognition accuracy of residual film from 83% to 86%. This significantly enhances the model’s residual film identification capability and reduces false detections. Furthermore, the performance improvement achieved by the SegNext attention mechanism is statistically significant: a paired t-test conducted on the mean Average Precision (mAP) values of all tests set detection results yielded a p-value < 0.05 when comparing the baseline model with the SegNext-enhanced model. This provides strong evidence that the observed 3% performance gain did not occur by chance.

3.2. Effectiveness of Improved Receptive Field Modules in the Network

To validate the effectiveness of the improved receptive field module integrated into the model and confirm whether the model redirects its focus to the spatial features of the receptive field (thereby facilitating the capture and extraction of residual film features), this study designed three control groups for generating residual film feature extraction heatmaps. These groups are as follows: (1) the original model; (2) the model with only the attention mechanism incorporated; and (3) the model combining the attention mechanism with RFCAConv. The effectiveness of the incorporated attention mechanism and receptive field module was verified using these heatmaps, and the corresponding results are presented in Figure 12. Taking Figure 12a as an example: without the attention mechanism, the model fails to recognize the small residual film feature in the lower right corner and exhibits weak sensitivity to target features—evidenced by the highlighting of non-residual film areas in the heatmap. By contrast, after integrating the receptive field module, the heatmap clearly shows that the model extracts this small feature (redder tones indicate stronger recognition capability). Other images in Figure 12 further confirm that the module enhances the model’s sensitivity to residual film features, reduces background interference, and redirects focus to residual film. This indicates the receptive field module enables the model to utilize spatial information of the receptive field, effectively improving residual film recognition accuracy.

To investigate the impact of integrating RFCAConv (Receptive Field Convolutional Attention) into the model, we further evaluated the residual film detection performance by replacing standard convolutions with RFCAConv at different layers of the model. (results in Table 5). The most significant improvements were observed when RFCAConv was introduced at layers 5, 7, and 18. Analysis of the network structure revealed that these layers correspond to the backbone’s mid-level feature extraction stages, which are responsible for generating feature maps for the P4 layer (medium-scale detection head). Given that medium-sized residual film targets account for the highest proportion (≈60%) of the dataset and that these layers’ receptive fields are better suited to capturing contextual information of medium-scale targets, integrating RFCAConv at these layers more effectively enhances the model’s perception of medium-sized residual films. As shown in Table 5, compared to single-layer RFCAConv introduction, A global incorporation strategy increased FLOPs from 8.9 G to 9.4 G and parameters by ~0.6%. However, due to the model’s lightweight design (with a total of only 2.66 M parameters), it still achieves real-time detection (21 FPS) on the Jetson Orin Nano edge device without acceleration. A paired t-test comparing the mAP of the RFCAConv-enhanced model with that of the baseline model yielded a p-value < 0.05, confirming a statistically significant performance gain. Thus, the global introduction of RFCAConv achieves a better balance between accuracy and efficiency, significantly improving residual film detection performance with only minor increases in computational cost and parameters.

3.3. Comparison Test of Loss Functions

To address the challenges of small residual film detection and edge segmentation, this study proposes the use of NM-IoU, which combines two loss functions at a specific weight ratio: the NWD-IoU loss function (focused on small residual film targets) and the MPD-IoU loss function (which considers the completeness of residual film edge coverage). To determine the optimal fusion weight, the model’s performance under different weight ratios was analyzed, with the results shown in Table 6. As indicated in the table, increasing the iou_ratio tends to improve various prediction metrics, such as residual film recall (R(b)); however, setting iou_ratio to 1 exerts a slight inhibitory effect on overall performance. When iou_ratio is set to 0.5, the residual film precision (P(b), P(m)), recall (R(b), R(m)), bounding box mean average precision (mAP@50(B)), and mask mean average precision (mAP@50(M)) all achieve the highest values in the test results: precision at 86.2% and 86.3%, recall at 80.1% and 79.0%, and mAP@50 at 88.7% (bounding box) and 87.2% (mask), respectively. This weighting ratio likely reflects a balance in gradient contributions between NWD and MPD. Specifically, NWD-IoU mitigates the sensitivity of IoU to small targets via the Wasserstein distance, while MPD-IoU optimizes the regression accuracy of bounding boxes through the minimum point distance. A weight of 0.5 enables the model to emphasize these two attributes equally when handling small targets and complex occlusions, preventing either loss function from dominating the training process.

The base model adopts the CIoU bounding box loss function. To evaluate the actual performance of the proposed NM (NWD-MPD) loss function, this paper compares it with commonly used typical bounding box loss functions, including SIoU, EIoU, PIoU, Shape-IoU, MPD-IoU, and NWD-IoU. Among these, NWD-IoU is the loss function used in Reference [10] for small residual film detection, and MPD-IoU is the loss function proposed in Reference [37] for handling overlapping bounding boxes. As shown in Table 6, SIoU, PIoU, and Shape-IoU do not improve residual film detection performance, while the remaining four loss functions (EIoU, MPD-IoU, NWD-IoU, and the proposed NM loss function) all enhance the model’s ability to detect small targets. Notably, the NM loss function proposed in this paper achieves the highest precision after convergence (as shown in Figure 13). It considers both the detection of small residual film targets and the false negatives caused by detection box distortion due to residual film overlap or occlusion. Combined with the attention mechanism introduced in this paper, it significantly improves the model’s residual film detection capability.

3.4. Ablation Experiments

To verify whether the improvements in RSE-SN-YOLO can achieve the expected results, this paper conducted a series of validation experiments on several improved modules, including A-SegNext (Incorporating C3k2-PKI.), B-RFCAConv, C-Efficient, and D-MN (MPD-NWD-IoU). Each configuration was independently trained five times using distinct random seeds, with all other training settings held constant. The mAP50(M) served as the primary evaluation metric, and paired t-tests were performed to assess the statistical significance of performance differences between the baseline and enhanced models.

The specific results of the ablation experiment are presented in Table 7.

(1): Firstly, Module A is a multi-scale convolutional module integrated with an attention mechanism designed for residual films; when used alone, it significantly improved residual film detection performance. Specifically, bounding box and mask precision (P(b) and (P(m)) increased by 3.0 and 3.2 percentage points, respectively. Module B is a convolutional module incorporating receptive field attention; when introduced alone, it relatively evenly improved both the precision and recall of residual film detection, with the mean Average Precision (mAP) of bounding boxes and masks each increasing by 0.5 percentage points. Module C is an improved efficient detection head that reduces the model’s computational load while preserving its ability to extract deeper, richer image features. Module D is a loss function designed to address issues in residual film detection, such as missed detection of fragmented residual films, detection box distortion, and target miss-detection caused by overlapping residual films; when used alone, it improved mask precision (P(m)). The introduction of Modules A and B leads to a slight increase in model parameters and a corresponding extension of inference time. In contrast, Module C employs parallel-branch feature processing, which effectively reduces both the number of parameters and computational latency.
(2): Secondly, we analyze the effects of combining modules. The combined use of Modules A and B showed a significant enhancement compared to their individual use: relative to the original model, the mean precision of bounding boxes (mAP(B)) and masks (mAP(M)) improved by 1.9 and 1.6 percentage points, respectively, leveraging the advantages of both modules. The combination of Modules A and D outperformed the baseline but saw decreases in P(b), P(m), R(m), and mAP compared to using A or D alone, indicating mutual suppression between the two modules, it is hypothesized that this suppression effect stems from subtle misalignments in their optimization objectives. Module A enhances multi-scale feature representation, with a particular focus on amplifying responses of strip-shaped structures. In contrast, Module D focuses on optimizing geometric consistency (via MPD-IoU) and small-object similarity (via NWD-IoU). The feature priorities emphasized by the attention mechanism may not fully align with the geometric constraints enforced by the loss function, potentially leading to conflicting gradient directions during training and thus suboptimal convergence. Although combining modules B and D improved accuracy rates (P(b), P(m)) compared to their individual use, this came at the cost of lower R(b) and R(m). Neither module A nor B unleashed their potential when combined with D. This finding reveals the complexity of inter-module interactions and underscores the necessity of carefully co-designing attention mechanisms and loss functions in future research. Additionally, combining any two of the three modules increased both parameters and inference time.
(3): Therefore, we incorporated the lightweight, efficient detection head Module C into each combination. Adding C to Modules A and B reduced the parameter count by 10% but led to a decrease in average mask precision. In contrast, when Modules A + D and B + D were combined with C, the parameter count decreased while detection performance (P(b), R(b), P(m), R(m), and mAP) improved significantly, indicating that introducing Module C enhances the performance of Module D. Consequently, we introduced Module D into the A + B + C combination, resulting in a marked improvement in detection performance compared to previous configurations. This effectively mitigated performance degradation caused by reduced parameter counts. The combination of Modules A, B, C, and D complements each other, achieving overall performance enhancement. Furthermore, all the improved modules yielded a statistically significant gain in mAP (p < 0.05), which confirms the effectiveness and robustness of the proposed modifications in enhancing detection performance.

3.5. Comparative Experiments on Different Residual Film Detection Models

To further validate the effectiveness of the algorithm improvements, comparative experiments were conducted with two groups: real-time instance segmentation algorithms from the same series (i.e., YOLO (v5–v12) n-series and s-series one-stage networks) and existing residual film research methods (e.g., YOLO-SPD [10], YOLO-SDI [13], DCA-YOLO11 [38]). The comparison results are presented in Table 7. During training, s-series and other comparable network models require 3–4 times more parameters than the n-series, with weight files also 3–4 times larger. However, their detection performance is comparable to that of the n-series, while detection speed is relatively lower. Thus, from the perspective of lightweight real-time deployment, the n-series was selected. Among n-series models, YOLO11n-seg outperforms others in bounding box/mask precision, recall, and mean average precision, making it the base model for this study.

To address issues like false positives, false negatives, low accuracy, and low recall in practical applications, this study proposes the RSE-YOLO-seg model. It was compared with other instance segmentation algorithms in the same series (YOLOv5n-seg, YOLOv8n-seg, YOLOv10n-seg, YOLO11n-seg, YOLO12n-seg) and three existing residual film recognition models from the literature, with results shown in Figure 14. The vertical axis of the figure represents the model’s mean average precision, and the horizontal axis represents its floating-point operations (FLOPs). Generally, FLOPs are inversely proportional to detection speed; thus, a model with smaller FLOPs and higher mean average precision performs better, which corresponds to positions in the upper left corner of the scatter plot. The improved model in this paper exhibits significant advantages in this regard. As shown in Table 8, compared to the models, RSE-YOLO-seg achieves improvements in bounding box mean average precision of 5.1, 4.5, 4.2, 3.0, 7.3, 5.5, 4.2, and 2.0 percentage points, respectively, and in mask mean average precision of 5.1, 3.7, 3.1, 2.7, 7.3, 5.4, 3.5, and 2.9 percentage points. Additionally, compared to lightweight models in the same series, its parameter count is reduced by 4%, 18%, 6%, 6%, and 4%, respectively, with a weight file size of 5.4 MB. The model achieves a single-image detection speed of 46.3 ms on a CPU, outperforming other comparative models. Further deployment experiments will be conducted on edge devices such as the Jetson Orin Nano, using TensorRT to accelerate model inference.

3.6. Comparison of Residual Film Detection Visualization Results Across Different Models

To further validate the real-time residual film detection capability of the RSE-YOLO-seg model, it was compared with four other models—YOLOv5-seg, YOLOv8-seg, YOLOv10-seg, and YOLO11-seg—based on the results in Table 6. The comparison results are shown in Figure 15, where blue indicates predicted bounding boxes and green indicates manually annotated ground-truth bounding boxes. In Scenario (a), which contains background interference such as cotton, the YOLOv5-seg model missed one residual film, while YOLOv8-seg and YOLOv10-seg falsely detected cotton as residual film. Only YOLO11-seg and the proposed RSE-YOLO-seg model correctly detected the residual film, with the latter achieving a 9% higher mean average precision than YOLO11-seg. In Scenario (b), where long strips of cotton cause interference, all models from YOLOv5-seg to YOLO11-seg misclassified the cotton strips as residual film, whereas only the proposed model successfully distinguished them. In Scenarios (c) and (d)—featuring bare ground and multi-row scenes captured by drones—YOLOv5-seg to YOLO11-seg exhibited varying degrees of missed detections. Compared to the manually annotated ground truth, YOLOv5-seg, YOLOv8-seg, YOLOv10-seg, and YOLO11-seg missed 8, 7, 7, and 5 residual films, respectively, while the proposed model missed only 1.

These results clearly demonstrate the performance advantage of the proposed model in practical detection tasks. Through visualization of residual film detection results across different models, the RSE-YOLO-seg model exhibits superior detection performance in various complex scenarios, significantly reducing both false detections and missed detections of residual film.

3.7. Edge Device Deployment Experiment

To validate the RSE-YOLO-seg model’s practical deployment performance on edge devices, this study deployed it on the Jetson Nano B01 (4 GB) and Jetson Nano Orin (8 GB). The YOLOv11 and TensorRT environments were configured accordingly, and PyTorch-trained .pt files were converted to TensorRT. engine files for accelerated inference. The testing platform is illustrated in Figure 16. Table 9 presents inference speed results of RSE-YOLO-seg’s enhanced modules and YOLOv11-seg (different model series) on the test set, across stages before and after TensorRT conversion (quantization).

As shown in Table 9, among all improved modules, the Efficient detection head module is the primary contributor to accelerated detection speed, reducing inference time by 7.1 ms (on Jetson Nano B01) and 4.9 ms (on Jetson Orin Nano, 8 GB). This highlights its key role in enhancing model efficiency and reducing parameters. In contrast, the NM module exerts a relatively minor impact on inference time. For attention-based modules (C3K2-PKI, SegNext, RFCAConv), while they improve the model’s detection and segmentation accuracy, their additional computational overhead leads to increased inference time.

After conversion with TensorRT, the YOLO11 series of models exhibit a slight decrease in accuracy, but their deployment efficiency improved. Taking the inference time on the Jetson Nano Orin device as an example, under the same single-precision (FP32) conditions, the inference times of YOLO11n-seg, YOLO11s-seg, and RSE-YOLO-seg after conversion were reduced by 28.5%, 41.3%, and 30.3%, respectively. This indicates that TensorRT’s acceleration effect is more pronounced for large models, as small models have performance limitations that prevent significant improvements in inference speed.

After converting the pruned RSE-YOLO-seg model to FP16, the average detection accuracy of residual film bounding boxes and masks decreased by 0.2% and 0.3%, respectively, while inference time was reduced by 36%, This half-precision quantization approach, despite causing minor degradation in model precision, significantly improves inference speed—thus offering a viable trade-off for practical deployment. Compared to the FP16-converted YOLO11n-seg model, the average detection accuracy of residual film bounding boxes and masks improved by 3.0% and 2.6%, respectively, with nearly identical inference speeds. Compared to the FP16-converted YOLO11s-seg model, the average detection accuracy of residual film bounding boxes and masks improved by 2.9% and 2.2%, respectively, and inference time was reduced by 13%.

Therefore, this study ultimately adopted the RSE-YOLO-seg. engine-FP16 model for deployment. On edge devices (Jetson Nano B01, Jetson Nano Orin), it achieves 17 FPS and 38 FPS inference speed, respectively (derived from inference time), meeting real-time detection requirements.

3.8. Experimental Determination of Residual Film Theoretical Area Based on Tracking Network

Experiments were conducted between 8:00 and 10:00 a.m., as well as between 5:00 and 6:00 p.m. This timing was chosen to avoid strong illumination—which introduces shadow interference during image acquisition—and to ensure the absence of adverse weather conditions (e.g., strong winds, dust) that could compromise data quality. Five different sampling points were randomly selected, with each point employing the four-point sampling method. For each sample point, the average value of five adjacent sample points was calculated, as shown in Figure 17a. A total of 100 sample plots (1.5 m × 0.75 m) were selected and divided into 20 groups. A camera was used to capture videos of the residual film distribution in the sampling area at a horizontal angle, with a resolution of 3024 × 4032. After filming, residual film on the surface of the sampling area was collected (only exposed residual film was collected). The residual film from each sample plot was collected separately and packaged individually. After washing and drying, as shown in Figure 17a(i,ii), the actual residual film area in the sampling area was manually measured using the CI203 area meter and grid paper method (taking the average of the two measurements).

Additionally, two calculation methods were applied to residual film in the sample plots:

Method 1: Randomly capture video frames of residual film in the sample plots, use an algorithm for identification and segmentation, and estimate the residual film area by counting residual film pixels.

Method 2: Import sampling area videos into the proposed RSE-YOLO-seg+ DeepSORT model to identify, segment, and track each piece of residual film simultaneously. The residual film area is determined as its theoretical contour area when the pixel count reaches a maximum [39]. The total residual film area in the sample plot is obtained by summing the areas of all tracked residual film pieces (via their unique IDs in the video).

The experimental results are shown in Figure 17b. Due to the varying orientations of residual film, randomly selected frames may not capture residual film in a frontal view, causing traditional static image-based methods to underestimate the actual residual film area in sample plots. Compared to the actual residual film area, Method 1 (traditional static image prediction) had a mean error (ME) of 232.30 cm², and a root mean square error (RMSE) of 251.53 cm². In contrast, this study used the dynamic video input instance segmentation model RSE-YOLO-seg combined with DeepSORT to track residual film, taking the maximum pixel count as the frontal projection area of residual film. This significantly reduced prediction errors: the mean error (ME) between the predicted and actual residual film areas was 142.00 cm², and the RMSE was 130.25 cm². The coefficient of determination (R²) improved from 0.873 to 0.968, significantly enhancing the accuracy of residual film area calculations.

4. Discussion

To address the challenges of multi-scale residual film instance segmentation in cotton fields, including missed detections, false detections, and incomplete bounding box segmentation, this study proposes an RSE-YOLO-Seg model. By integrating RFCAConv, C3K2-PKI, SegNext, Efficient Head and NM-IoU, the model enhances the precision, recall, and mean average precision (mAP) of residual film detection.

Compared with existing methods, such as improved Faster R-CNN [9] and the combination of YOLOv7-SPD with DeepLabV3+ [10], the proposed method offers superior computational efficiency. These previous models are often computationally expensive and unable to achieve real-time performance, limiting their practical application in field settings. Although methods such as YOLO-SDI [13], FreqDyn-YOLO [40] and SegFormer-based approach have improved multi-scale film detection, these methods only quantify the number of films without supporting area estimation, or are limited to estimating area solely from static images. In contrast, the proposed RSE-YOLO-Seg model achieves a good balance between detection accuracy and computational efficiency, making it suitable for real-time deployment on resource-constrained embedded devices. Furthermore, this study integrates RSE-YOLO-Seg with DeepSORT to estimate residual film area, which effectively reduces errors caused by variations in the angle and orientation of residual films during static image detection.

However, this study has several limitations. The proposed method is primarily designed for cotton fields, and its adaptability to other crop environments remains limited. Extending its applicability to other crops would necessitate dataset expansion and model retraining. Furthermore, the model’s robustness under diverse field conditions (e.g., intense sunlight, dust, strong winds, rainfall) requires further validation. Additionally, the impact of power consumption and thermal performance on detection accuracy during long-term operation on various edge devices warrants further investigation.

Future work will focus on validating its performance under a wider range of environmental conditions to improve generalizability and robustness. Additionally, since this study only targets surface residual films, subsequent on-site sampling will quantify the proportion of residual films that are severely overlapping, occluded, buried, or missed during tracking. Based on these results, an area error correction coefficient will be incorporated to further improve the accuracy of area estimation.

5. Conclusions

To address issues in residual film detection—such as false positives, false negatives, and random errors in static image-based residual film estimation—caused by large scale variations, fragmentation, and diverse occlusion postures of residual films, we propose the RSE-YOLO-seg model: a high-precision, lightweight instance segmentation model for residual film detection. Combined with the DeepSORT tracking algorithm, this model tracks residual films and estimates their theoretical area. The main conclusions are as follows:

(1): Firstly, to tackle the multi-scale characteristics of residual films, we introduced the PKI Module (with a variable receptive field) into the C3K2 module of the backbone network to capture multi-scale texture features of residual films. Combined with SegNext_Attention, which extracts multi-scale features in parallel using convolutional kernels of different sizes, the model enhances focus on strip-shaped residual films of varying scales. Experimental results showed that the C3K2-CB module (designed specifically for residual films and integrated with SegNext attention) outperformed common efficient attention mechanisms such as SE, CA, CBAM, MLCA, MPCA, and AFGC. It improved residual film recognition accuracy from 83% to 86%, strengthening the model’s ability to identify residual films and reducing false detections.
(2): Secondly, to address missed detections of fragmented residual films, we replaced standard convolutions in the model with receptive field convolutions (RFCAConv) to emphasize spatial features of the receptive field. This approach differentially processes receptive fields of varying regions and sizes while effectively sharing parameters from large-scale convolutional kernels, thereby enhancing the model’s ability to capture and utilize image information. The effectiveness of receptive field convolutions was validated via receptive field heatmaps: after replacement, the model effectively focused on residual film features, even attending to small residual films that were previously overlooked. Experiments also determined the optimal replacement positions for receptive field convolutions. Additionally, we designed a lightweight, Efficient-Head and a new NM (NWD-MPD) loss function. The efficient detection head adopts a decoupled head structure with parallel branches for feature processing. Each branch stacks two efficient modules to enhance the model’s complex function representation capability, while efficient convolution reduces model parameters by 10% and improves inference speed. Combined with the NM loss function, the residual film mask recall and average segmentation mask accuracy improved by 1.3 and 1.5 percentage points, respectively, enhancing the model’s ability to identify fragmented residual films and accurately segment residual films in complex scenarios. Furthermore, all improved modules showed statistically significant mAP gains (p < 0.05), confirming the effectiveness and robustness of the proposed modifications in enhancing detection performance.
(3): Thirdly, we compared RSE-YOLO-seg with widely used detection algorithms, including real-time instance segmentation models (YOLOv5-seg, YOLOv8-seg, YOLOv10-seg, YOLOv11-seg, YOLOv12-seg) and three existing residual film detection models. Results showed that RSE-YOLO-seg outperformed these models in bounding box average precision by 5.1, 4.5, 4.2, 3.0, 7.3, 5.5, 4.2, and 2.0 percentage points, respectively, and in mask average precision by 5.1, 3.7, 3.1, 2.7, 7.3, 5.4, 3.5, and 2.9 percentage points, respectively. Additionally, its parameter count was 4–18% lower than that of lightweight models in the same series. Meanwhile, when deployed on edge devices (Jetson Nano B01, Jetson Orin Nano), the model achieves inference speeds of 17 FPS and 38 FPS, respectively, meeting real-time detection requirements.
(4): Finally, through field residual film detection experiments, we tested 20 groups of plots across different regions. The proposed RSE-YOLO-seg, combined with DeepSORT, identifies and segments residual films in videos, tracks individual residual films, and uses the residual film area (converted from the maximum pixel count) as its theoretical contour area. Compared to the traditional method of randomly capturing residual film images, the mean error between predicted and actual areas decreased from 232.30 cm² to 142.00 cm², and the RMSE decreased from 251.53 cm² to 130.25 cm². This effectively mitigates random errors in static images of residual films under different orientations, thereby improving the accuracy of residual film area estimation.

Author Contributions

Conceptualization, H.F., X.C. and Q.Z.; methodology, H.F. and Q.X.; software, Q.X. and Q.Z.; validation, H.F., X.W. and Q.Z.; formal analysis, Q.X. and L.Y.; investigation, Q.X. and Q.Z.; resources, H.F. and X.C.; data curation, H.F.; writing—original draft preparation, Q.X.; writing—review and editing, H.F. and Q.Z.; visualization, Q.Z.; supervision, Q.Z.; project administration, H.F.; funding acquisition, Q.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This study was supported by National Key R&D Program of China (No. 2022YFD2002403), the China Postdoctoral Science Foundation, grant number 2023M741433 and 2025M772490, Talent Development Fund of Shihezi University in 2025- “Group Team” Aid Xinjiang Team (No. CZ002562), the Priority Academic Program Development of Jiangsu Higher Education Institutions, grant number PAPD-2023-87.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors have submitted a conflict-of-interest statement, indicating that there are no conflicts of interest.

References

Shao, L.; Gong, J.; Fan, W.; Zhang, Z.; Zhang, M. Cost Comparison between Digital Management and Traditional Management of Cotton Fields—Evidence from Cotton Fields in Xinjiang, China. Agriculture 2022, 12, 1105. [Google Scholar] [CrossRef]
Zhang, X.; Shi, Y.; Yan, J.; Yang, S.; Hou, Z.; Li, H. Residual Film–Cotton Stubble–Nail Tooth Interaction Study Based on SPH-FEM Coupling in Residual Film Recycling. Agriculture 2025, 15, 1198. [Google Scholar] [CrossRef]
Lakhiar, I.A.; Yan, H.; Zhang, J.; Wang, G.; Deng, S.; Bao, R.; Zhang, C.; Syed, T.; Wang, B.; Zhou, R.; et al. Plastic Pollution in Agriculture as a Threat to Food Security, the Ecosystem, and the Environment: An Overview. Agronomy 2024, 14, 548. [Google Scholar] [CrossRef]
Hu, C.; Wang, X.; Chen, X.; Tang, X.; Zhao, Y.; Yan, C. Current situation and control strategies of residual film pollution in Xinjiang. Trans. Chin. Soc. Agric. Eng. 2019, 35, 223–234. [Google Scholar] [CrossRef]
Wang, G.; Sun, Q.; Wei, M.; Xie, M.; Shen, T.; Liu, D. Plastic Film Residue Reshaped Protist Communities and Induced Soil Nutrient Deficiency Under Field Conditions. Agronomy 2025, 15, 419. [Google Scholar] [CrossRef]
Zheng, W.; Wang, R.; Cao, Y.; Jin, N.; Feng, H.; He, J. Remote Sensing Recognition of Plastic-film-mulched Farmlands on Loess Plateau Based on Google Earth Engine. Trans. Chin. Soc. Agric. Mach. 2022, 53, 224–234. [Google Scholar] [CrossRef]
Wu, X.; Liang, C.; Zhang, D.; Yu, L.; Zhang, F. Identification Method of Plastic Film Residue Basedon UAV Remote Sensing Images. Trans. Chin. Soc. Agric. Mach. 2020, 51, 189–195. [Google Scholar] [CrossRef]
Zhai, Z.; Chen, X.; Qiu, F.; Meng, Q.; Wang, H.; Zhang, R. Detecting surface residual film coverage rate in pre-sowing cotton fields using pixel block and machine learning. Trans. Chin. Soc. Agric. Eng. 2022, 38, 140–147. [Google Scholar] [CrossRef]
Zhang, X.; Huang, S.; Jin, W.; Yan, J.; Shi, Z.; Zhou, X.; Zhang, C. Identification Method of Agricultural Film Residue Based on Improved Faster R-CNN. J. Hunan Univ. Nat. Sci. 2021, 48, 161–168. [Google Scholar] [CrossRef]
Huang, D.; Zhang, Y. Combining YOLOv7-SPD and DeeplabV3+ for Detection of Residual Film Remaining on Farmland. IEEE Access 2024, 12, 1051–1063. [Google Scholar] [CrossRef]
Niu, Y.; Li, Y.; Chen, Y.; Jiang, P. Image Segmentation Method of Residual Film on Cotton Field Surface based on Improved SegFormer Model. Jisuanji Yu Xiandaihua 2023, 7, 93–98. [Google Scholar] [CrossRef]
Ma, J.; Zhao, Y.; Fan, W.; Liu, J. An Improved YOLOv8 Model for Lotus Seedpod Instance Segmentation in the Lotus Pond Environment. Agronomy 2024, 14, 1325. [Google Scholar] [CrossRef]
Lin, Z.; Xie, L.; Bian, Y.; Jian, Z.; Zhou, L.; Shi, M. YOLO-SDI-based detection of residual film in agricultural fields. Comput. Eng. 2025, 56, 1–12. [Google Scholar] [CrossRef]
Qiu, Z.; Huang, X.; Deng, Z.; Xu, X.; Qiu, Z. PS-YOLO-seg: A Lightweight Instance Segmentation Method for Lithium Mineral Microscopic Images Based on Improved YOLOv12-seg. J. Imaging 2025, 11, 230. [Google Scholar] [CrossRef]
Ji, W.; Pan, Y.; Xu, B.; Wang, J. A Real-Time Apple Targets Detection Method for Picking Robot Based on ShufflenetV2-YOLOX. Agriculture 2022, 12, 856. [Google Scholar] [CrossRef]
Wu, W.; He, Z.; Li, J.; Chen, T.; Luo, Q.; Luo, Y.; Wu, W.; Zhang, Z. Instance Segmentation of Tea Garden Roads Based on an Improved YOLOv8n-seg Model. Agriculture 2024, 14, 1163. [Google Scholar] [CrossRef]
Shi, H.; Liu, C.; Wu, M.; Zhang, H.; Song, H.; Sun, H.; Li, Y.; Hu, J. Real-time detection of Chinese cabbage seedlings in the field based on YOLO11-CGB. Front. Plant Sci. 2025, 16, 1558378. [Google Scholar] [CrossRef]
Wu, Z.; Zhen, H.; Zhang, X.; Bai, X.; Li, X. SEMA-YOLO: Lightweight Small Object Detection in Remote Sensing Image via Shallow-Layer Enhancement and Multi-Scale Adaptation. Remote Sens. 2025, 17, 1917. [Google Scholar] [CrossRef]
Wei, H.; Zhao, L.; Li, R.; Zhang, M. RFAConv-CBM-ViT: Enhanced vision transformer for metal surface defect detection. J. Supercomput. 2025, 81, 155. [Google Scholar] [CrossRef]
Liang, M.; Zhang, Y.; Zhou, J.; Shi, F.; Wang, Z.; Lin, Y.; Zhang, L.; Liu, Y. Research on detection of wheat tillers in natural environment based on YOLOv8-MRF. Smart Agric. Technol. 2025, 10, 100720. [Google Scholar] [CrossRef]
Zhang, T.; Zhou, J.; Liu, W.; Yue, R.; Yao, M.; Shi, J.; Hu, J. Seedling-YOLO: High-Efficiency Target Detection Algorithm for Field Broccoli Seedling Transplanting Quality Based on YOLOv7-Tiny. Agronomy 2024, 14, 931. [Google Scholar] [CrossRef]
Zou, J.; Song, T.; Cao, S.; Zhou, B.; Jiang, Q. Dress Code Monitoring Method in Industrial Scene Based on Improved YOLOv8n and DeepSORT. Sensors 2024, 24, 6063. [Google Scholar] [CrossRef] [PubMed]
Qi, Z.; Wang, J. PMDNet: An Improved Object Detection Model for Wheat Field Weed. Agronomy 2025, 15, 55. [Google Scholar] [CrossRef]
Yang, Z.; Xu, K.; Zhao, L.; Hu, N.; Wu, J. PWDE-YOLOv8n: An Enhanced Approach for Surface Corrosion Detection in Aircraft Cabin Sections. IEEE Trans. Instrum. Meas. 2025, 74, 2504722. [Google Scholar] [CrossRef]
Song, J.; Ma, B.; Xu, Y.; Yu, G.; Xiong, Y. Organ segmentation and phenotypic information extraction of cotton point clouds based on the CotSegNet network and machine learning. Comput. Electron. Agric. 2025, 236, 110466. [Google Scholar] [CrossRef]
Wang, Z.; Qin, J.; Huang, C.; Zhang, Y. CGMISeg: Context-Guided Multi-Scale Interactive for Efficient Semantic Segmentation. Comput. Mater. Contin. 2025, 9, 5811–5829. [Google Scholar] [CrossRef]
Yi, X.; Chen, H.; Wu, P.; Wang, G.; Mo, L.; Wu, B.; Yi, Y.; Fu, X.; Qian, P. Light-FC-YOLO: A Lightweight Method for Flower Counting Based on Enhanced Feature Fusion with a New Efficient Detection Head. Agronomy 2024, 14, 1285. [Google Scholar] [CrossRef]
He, Y.; Wan, L. YOLOv7-PD: Incorporating DE-ELAN and NWD-CIoU for Advanced Pedestrian Detection Method. Inf. Technol. Control 2024, 53, 390–407. [Google Scholar] [CrossRef]
Xiong, C.; Zayed, T.; Jiang, X.; Alfalah, G.; Abelkader, E. A Novel Model for Instance Segmentation and Quantification of Bridge Surface Cracks-The YOLOv8-AFPN-MPD-IoU. Sensors 2024, 24, 4288. [Google Scholar] [CrossRef]
Liu, Y.; Han, X.; Zhang, H.; Liu, S.; Ma, W.; Yan, Y.; Sun, L.; Jing, L.; Wang, Y.; Wang, J. YOLOv8-MSP-PD: A Lightweight YOLOv8-Based Detection Method for Jinxiu Malus Fruit in Field Conditions. Agronomy 2025, 15, 1581. [Google Scholar] [CrossRef]
Chen, S.; Liu, J.; Xu, X.; Guo, J.; Hu, S.; Zhou, Z.; Lan, Y. Detection and tracking of agricultural spray droplets using GSConv-enhanced YOLOv5s and DeepSORT. Comput. Electron. Agric. 2025, 235, 110353. [Google Scholar] [CrossRef]
Zhou, L.; Yang, Z.; Fu, L.; Duan, J. Yield Estimation in Banana Orchards Based on DeepSORT and RGB-Depth Images. Agronomy 2025, 15, 1119. [Google Scholar] [CrossRef]
Zhang, X.; Li, B. Tennis ball detection based on YOLOv5 with tensorrt. Sci. Rep. 2025, 15, 21011. [Google Scholar] [CrossRef]
Liao, J.; He, X.; Liang, Y.; Wang, H.; Zeng, H.; Luo, X.; Li, X.; Zhang, L.; Xing, H.; Zang, Y. A Lightweight Cotton Verticillium Wilt Hazard Level Real-Time Assessment System Based on an Improved YOLOv10n Model. Agriculture 2024, 14, 1617. [Google Scholar] [CrossRef]
Zhou, X.; Chen, W.; Wei, X. Improved Field Obstacle Detection Algorithm Based on YOLOv8. Agriculture 2024, 14, 2263. [Google Scholar] [CrossRef]
Zhu, C.; Hao, S.; Liu, C.; Wang, Y.; Jia, X.; Xu, J.; Guo, S.; Huo, J.; Wang, W. An Efficient Computer Vision-Based Dual-Face Target Precision Variable Spraying Robotic System for Foliar Fertilisers. Agronomy 2024, 14, 2770. [Google Scholar] [CrossRef]
Duan, Y.; Han, W.; Guo, P.; Wei, X. YOLOv8-GDCI: Research on the Phytophthora Blight Detection Method of Different Parts of Chili Based on Improved YOLOv8 Model. Agronomy 2024, 14, 2734. [Google Scholar] [CrossRef]
Meng, Q.; Zhai, Z.; Zhang, L.; Lu, J.; Wang, H.; Zhang, R. Recognition Method of Cotton Field Surface Residual Film Based on Improved YOLO 11. Trans. Chin. Soc. Agric. Mach. 2025, 56, 17–25+48. [Google Scholar] [CrossRef]
Lou, L.; Lu, H.; Song, R. Segmentation of Plant Leaves and Features Extraction Based on Muti-view and Time-series Image. Trans. Chin. Soc. Agric. Mach. 2022, 53, 253–260. [Google Scholar] [CrossRef]
Zhang, M.; Zhang, J.; Peng, Y.; Wang, Y. FreqDyn-YOLO: A High-Performance Multi-Scale Feature Fusion Algorithm for Detecting Plastic Film Residues in Farmland. Sensors 2025, 25, 4888. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Data collection site and method. (a) Data collection location; (b) On-site data collection via drones.

Figure 2. Data Collection Scheme for the Residual Film Dataset. (a) Three Types of Residual Film: (i) Bare Exposed Surface; (ii) Suspended on Cotton Stalks; (iii) Distributed in Complex Inter-row Environments. (b) Residual Film in Different Environments: (i) Clean Soil; (ii) Soil with a High Content of Cotton Debris; (iii) Wet Soil; (iv) Dry Soil (c) Drone-Captured Images: (i) Clean Soil; (ii) Soil with a High Content of Cotton Debris.

Figure 3. Example of Labelme Polygon Annotation for Residual Film.

Figure 4. Examples of Image Augmentation. (I) Original Image; (II) Enhanced Lighting; (III) Reduced Lighting; (IV) Reversed Angle; (V) Noise Interference; (VI) Random Angle.

Figure 5. RSE-YOLO-Seg Residual Film Segmentation Model. Note: √ indicates the improvement points of this model.

Figure 6. Schematic Diagram of the CoordA Attention (a) and RFCAConv Modules (b).

Figure 7. Structure Diagram of the C3K2_PKI Module. (a) PKI Module; (b) C3K2_PKI Module.

Figure 8. Schematic Diagram of the SegNext Attention Structure.

Figure 9. Efficient Segment Head. (a) Original Segment; (b) Efficient Segment.

Figure 10. Schematic diagrams of NWD-IoU and MPD-IoU. (a) The small target loss function has high sensitivity; (b) MPD Loss Function Diagram.

Figure 11. Orthographic Projection Area of Residual Film Obtained Based on Tracking Network. (a) DeepSORT Flowchart; (b) Diagram of Theoretical Orthographic Projection Area of Residual Film Obtained via Video Tracking. Note: In the formula, S refers to the actual area corresponding to the image, while x and y denote the pixel scale of the image.

Figure 12. Heatmap of Receptive Field RFCAConv Replacement. Note: (a–c) are images of residual film samples from between cotton rows, on suspended cotton stalks, and in complex off-row environments, respectively. (d) is an image of multi-row residual film samples captured by a drone.

Figure 13. Curves of Average Detection Accuracy (mAP) for Different Loss Functions. (a) Curve Diagram of mAP50(B) Detection Accuracy for Different Loss Functions; (b) Curve Diagram of mAP50(M) Detection Accuracy for Different Loss Functions.

Figure 14. Scatter Plot of Prediction Box Mean Average Precision (mAP) and Floating-Point Operations (FLOPs) for Different Models.

Figure 15. Recognition Performance of Different Models. Agriculture 15 02025 i001

: True,

: Missed;

: Misidentified.

Figure 15. Recognition Performance of Different Models. Agriculture 15 02025 i001

: True,

: Missed;

: Misidentified.

Figure 16. Embedded device model test diagram. Note: 1. Test interface 2. Jetson Nano Orin device 3. Jetson Nano B01 device.

Figure 17. Field Sampling of Residual Film, Manual Measurement Tools for Residual Film Area, and Comparison of Residual Film Area Values Measured by Different Calculation Methods. (a) Schematic Diagram of Residual Film Sampling at Each Sample Point: (i) CI203 Area Meter for Measuring Residual Film Area; (ii) Measure residual film area using 5 mm × 5 mm graph paper; (b) Comparison of Residual Film Area Estimates from Static Images, Dynamic Videos, and Manual Measurements.

Table 1. Data Augmentation and Dataset Division for Cotton Field Residual Film.

Different Morphologies of Residual Film	Original Images	Enhanced Images	Enhanced Image
Different Morphologies of Residual Film	Original Images	Enhanced Images	Training Set	Validation Set	Test Set
Exposed on the surface	503	1509	1051	304	154
Suspended on Cotton Stalks	498	1494	1048	299	147
In complex inter-row areas	511	1533	1078	305	150
Multi-row mixed scenes captured by UAV	1023	1023	715	204	104
Total	2535	5559	3892	1112	555

Table 2. Parameter settings for the model training.

Hyperparameter	Value
Image size	640 × 640
Epoch	250
Batch size	32
Learning rate	0.01
Momentum	0.937
Weight decay	0.0005
Optimizer	SGD

Table 3. Jetson environment configuration.

Index	Parameters (Deploy Phase)
Operating system	Ubuntu 22.04 LTS
Accelerated environment	CUDA 12.6+ cuDNN 9.3.0
Library	Pytorch 1.12
SDK	JetPack 6.2
TensorRT version	TensorRT 10.3.0

Table 4. Comparison of Model Performance with Different Attention Mechanisms.

Model	P/%		R/%		mAP@50 (B)/%	mAP@50 (M)/%	Parameters	FLOPs/G	Latency (CPU)/ms
Model	P(b)	P(m)	R(b)	R(m)	mAP@50 (B)/%	mAP@50 (M)/%	Parameters	FLOPs/G	Latency (CPU)/ms
Yolo11n	83.0	82.9	77.2	76.3	85.7	84.5	2,834,763	10.2	51.0
-SE	82.3	81.9	76.9	75.6	85.4	83.8	2,875,723	10.3	55.4
-CA	82.6	83.2	77.4	76.4	86.0	84.5	2,841,443	10.2	53.7
-CBAM	83.7	83.4	76.7	75.6	86.0	84.4	2,933,421	10.3	56.2
-ECA	83.2	83.7	77.4	75.7	86.4	84.9	2,867,541	10.3	53.6
-MPCA	82.2	83.0	76.9	75.1	85.8	84.3	3,195,979	10.3	52.5
-AFGC	83.3	82.9	76.6	75.7	85.7	83.9	2,900,561	10.2	56.3
-SegNext	86.0	86.1	77.4	76.4	86.8	85.4	2,915,107	10.3	52.7

Table 5. Comparison of Effects of Adding RFCAConv Model at Different Layers.

Layers	P/%		R/%		mAP@50 (B)/%	mAP@50 (M)/%	Parameter	FLOPs/G
Layers	P(b)	P(m)	R(b)	R(m)	mAP@50 (B)/%	mAP@50 (M)/%	Parameter	FLOPs/G
0	85.3	85.0	77.9	77.2	87.4	86.2	2,601,434	8.9
1	83.7	84.2	78.6	75.8	86.6	84.6	2,603,059	9.0
3	82.8	82.9	77.9	76.7	86.0	84.5	2,609,059	9.0
5	85.2	86.2	79.1	77.5	87.7	86.3	2,617,059	8.9
7	86.0	86.1	78.6	77.7	87.8	86.3	2,617,059	8.9
18	85.4	85.9	79.1	77.2	87.6	85.8	2,667,578	9.3
21	86.3	85.9	76.9	75.9	86.8	85.1	2,617,059	8.9
5, 7	86.5	86.8	78.3	77.2	87.3	86.2	2,633,083	8.9
5, 18	85.7	85.7	77.8	75.8	86.9	84.8	2,625,083	8.9
5, 7, 18	86.8	87.0	78.1	77.2	87.7	86.5	2,641,107	9.0
Our	86.2	86.3	80.1	79.0	88.7	87.2	2,662,650	9.4

Table 6. Model Performance Under Different SN Fusion Weight Ratios and Comparison of Experimental Results Using Different Loss Functions.

IoU		P/%		R/%		mAP@50 (B)/%	mAP@50 (M)/%
IoU		P(b)	P(m)	R(b)	R(m)	mAP@50 (B)/%	mAP@50 (M)/%
Model Performance Under Different SN Fusion Weight Ratios	iou_ratio = 0.0	85.0	85.2	78.0	76.2	87.2	85.3
	iou_ratio = 0.2	85.7	85.5	78.4	76.9	88.1	86.0
	iou_ratio = 0.5	86.2	86.3	80.1	79.0	88.7	87.2
	iou_ratio = 0.8	86.1	85.8	78.7	77.5	88.3	86.2
	iou_ratio = 1.0	85.8	86.0	79.7	78.2	88.4	86.4
Comparison of Experimental Results Using Different Loss Functions	CIoU	84.7	84.9	78.8	75.9	87.3	85.7
	SIoU	84.4	85.7	79.3	76.6	87.4	85.5
	EIoU	86.3	85.5	77.8	76.6	87.8	85.8
	PIoU	84.4	85.4	78.2	76.0	86.7	85.0
	Shape	83.5	83.4	77.8	76.3	86.0	84.3
	MPD	85.0	85.2	78.0	76.2	87.2	85.3
	NWD	85.8	86.0	79.7	78.2	88.4	86.4
	NM	86.2	86.3	80.1	79.0	88.7	87.2

Table 7. Evaluation Metrics of DFR-SN-YOLO Ablation Experiments. Note: “×” means that this module is not used on the baseline network YOLO11n-Seg; “√” indicates the use of this module. Precision, recall, and mAP values are reported as mean ± standard deviation (n = 5). The p-values indicate whether the mAP50(M) improvements relative to the baseline are statistically significant.

Model	Seg Next	RFCAConv	Efficient	MN	P/%		R/%		mAP 50(B)/%	mAP 50(M)/%	Parameter/M	Latency (CPU)/ms	p-Value
Model	Seg Next	RFCAConv	Efficient	MN	P(b)	P(m)	R(b)	R(m)	mAP 50(B)/%	mAP 50(M)/%	Parameter/M	Latency (CPU)/ms	p-Value
Base	×	×	×	×	83.0 ± 0.3	82.9 ± 0.4	77.2 ± 0.2	76.3 ± 0.5	85.7 ± 0.4	84.5 ± 0.3	2.83	51.0	-
A	√	×	×	×	86.0 ± 0.6	86.1 ± 0.7	77.4 ± 0.4	76.4 ± 0.5	86.8 ± 0.3	85.4 ± 0.6	2.92	52.7	0.035
B	×	√	×	×	83.9 ± 0.2	84.2 ± 0.3	77.3 ± 0.5	76.0 ± 0.2	86.2 ± 0.6	85.0 ± 0.4	2.90	53.1	0.013
C	×	×	√	×	83.2 ± 0.5	83.5 ± 0.3	77.6 ± 0.4	76.5 ± 0.6	86.0 ± 0.2	85.2 ± 0.5	2.56	44.3	0.019
D	×	×	×	√	83.3 ± 0.6	83.5 ± 0.4	77.3 ± 0.3	76.2 ± 0.5	86.3 ± 0.4	85.1 ± 0.7	2.83	50.2	0.027
A + B	√	√	×	×	86.7 ± 0.1	86.5 ± 0.3	78.1 ± 0.4	76.8 ± 0.1	87.6 ± 0.3	86.1 ± 0.2	2.93	53.5	0.008
A + D	√	×	√	×	85.6 ± 0.2	85.8 ± 0.1	77.7 ± 0.4	76.2 ± 0.5	86.7 ± 0.3	85.2 ± 0.4	2.92	52.8	0.018
B + D	×	×	√	√	84.9 ± 0.6	84.9 ± 0.4	76.9 ± 0.3	75.8 ± 0.5	86.3 ± 0.4	84.7 ± 0.6	2.90	53.4	0.025
A + B + C	√	√	√	×	84.7 ± 0.3	84.9 ± 0.4	78.8 ± 0.2	75.9 ± 0.5	87.3 ± 0.3	85.7 ± 0.5	2.66	47.4	0.016
A + C + D	√	×	√	√	84.1 ± 0.5	85.4 ± 0.4	78.6 ± 0.4	76.7 ± 0.6	87.5 ± 0.4	85.9 ± 0.3	2.59	45.6	0.022
B + C + D	×	√	√	√	86.0 ± 0.4	86.3 ± 0.7	79.2 ± 0.5	77.9 ± 0.3	88.2 ± 0.6	86.8 ± 0.5	2.57	45.3	0.024
A + B + C + D	√	√	√	√	86.2 ± 0.4	86.3 ± 0.2	80.1 ± 0.1	79.0 ± 0.3	88.7 ± 0.3	87.2 ± 0.2	2.66	46.3	0.011

Table 8. Comparison of Detection Results Among Different Models.

Model	P/%		R/%		mAP@50(B)/%	mAP@50(M)/%	Parameter/M	FLOPs/G	Weight/MB	Latency (CPU)/ms
Model	P(b)	P(m)	R(b)	R(m)	mAP@50(B)/%	mAP@50(M)/%	Parameter/M	FLOPs/G	Weight/MB	Latency (CPU)/ms
YOLOv5n-seg	79.4	79.2	75.6	75.1	83.6	82.1	2.76	11.0	5.8	50.2
YOLOv8n-seg	80.1	82.6	77.1	75.6	84.2	83.5	3.26	12.0	6.8	53.9
YOLOv10n-seg	81.9	82.5	77.0	76.1	84.5	84.1	2.84	11.7	6.0	51.7
YOLO11n-seg	83.0	82.9	77.2	76.3	85.7	84.5	2.84	10.2	6.0	51.0
YOLO12n-seg	78.5	78.9	72.9	71.9	81.4	79.9	2.76	9.7	5.7	52.6
YOLOv5s-seg	81.1	82.4	77.6	75.0	84.6	82.8	9.77	37.8	18.9	73.5
YOLOv8s-seg	80.6	81.4	78.1	76.0	85.1	83.0	11.80	42.7	23.9	75.3
YOLOv10s-seg	82.5	82.5	78.1	76.7	85.5	83.6	10.06	41.2	20.5	75.1
YOLO11s-seg	81.3	83.0	78.8	76.8	85.7	84.9	10.07	35.3	20.5	74.8
YOLO12s-seg	78.5	78.7	75.8	74.4	83.2	81.4	9.73	33.3	20.0	74.4
YOLO-SPD [10]	79.2	78.3	76.4	75.5	83.2	81.8	33.52	96.6	74.8	74.5
YOLO-SDI [13]	85.1	84.7	77.2	77.2	84.5	83.7	6.52	20.0	19.0	76.2
DCA-YOLO [29]	81.9	80.1	80.9	78.5	86.7	84.3	2.20	8.5	4.9	47.3
Our	86.2	86.3	80.1	79.0	88.7	87.2	2.66	9.4	5.4	46.3

Table 9. TensorRT Conversion Test Results. Note: The “Latency” column reports the inference time values in the format: inference time (±change relative to the YOLO11n-Seg.pt baseline model), where “+” denotes an increase and “−” denotes a decrease in inference time.

Model	Type	mAP50 (B)/%	mAP50 (M)/%	Latency (Jetson B01) /ms	Latency (Jetson Orin) /ms
YOLO11n + C3k2-PKI	.pt-FP32	86.3	85.2	92.5 (+6.3)	44.6 (+5.3)
YOLO11n + SegNext	.pt-FP32	86.8	85.4	94.1 (+7.9)	45.7 (+6.4)
YOLO11n + RFCAConv	.pt-FP32	86.2	85.0	95.7 (+9.5)	47.5 (+8.2)
YOLO11n + Efficient	.pt-FP32	86.0	85.2	79.1 (−7.1)	34.4 (−4.9)
YOLO11n + NM	.pt-FP32	86.3	85.1	86.4 (+0.2)	40.9 (+1.6)
YOLO11n-seg	.pt-FP32	85.7	84.5	86.2	39.3
	.engine-FP32	85.6	84.3	63.5 (−22.7)	28.1 (−11.2)
	.engine-FP16	85.5	84.3	57.7 (−28.5)	25.4 (−13.9)
YOLO11s-seg	.pt-FP32	85.7	84.9	105.4 (+19.2)	57.1 (+17.8)
	.engine-FP32	85.7	84.9	81.2 (−5.0)	33.5 (−5.8)
	.engine-FP16	85.6	84.7	76.5 (−9.7)	30.3 (−9.0)
RSE-YOLO-seg	.pt-FP32	88.7	87.2	88.1 (+1.9)	41.2 (+1.9)
	.engine-FP32	88.5	87.1	65.6 (−20.6)	28.7 (−10.6)
	.engine-FP16	88.5	86.9	59.3 (−26.9)	26.3 (−13.0)

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Fang, H.; Xu, Q.; Chen, X.; Wang, X.; Yan, L.; Zhang, Q. An Instance Segmentation Method for Agricultural Plastic Residual Film on Cotton Fields Based on RSE-YOLO-Seg. Agriculture 2025, 15, 2025. https://doi.org/10.3390/agriculture15192025

AMA Style

Fang H, Xu Q, Chen X, Wang X, Yan L, Zhang Q. An Instance Segmentation Method for Agricultural Plastic Residual Film on Cotton Fields Based on RSE-YOLO-Seg. Agriculture. 2025; 15(19):2025. https://doi.org/10.3390/agriculture15192025

Chicago/Turabian Style

Fang, Huimin, Quanwang Xu, Xuegeng Chen, Xinzhong Wang, Limin Yan, and Qingyi Zhang. 2025. "An Instance Segmentation Method for Agricultural Plastic Residual Film on Cotton Fields Based on RSE-YOLO-Seg" Agriculture 15, no. 19: 2025. https://doi.org/10.3390/agriculture15192025

APA Style

Fang, H., Xu, Q., Chen, X., Wang, X., Yan, L., & Zhang, Q. (2025). An Instance Segmentation Method for Agricultural Plastic Residual Film on Cotton Fields Based on RSE-YOLO-Seg. Agriculture, 15(19), 2025. https://doi.org/10.3390/agriculture15192025

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Instance Segmentation Method for Agricultural Plastic Residual Film on Cotton Fields Based on RSE-YOLO-Seg

Abstract

1. Introduction

2. Materials and Methods

2.1. Establishment of Agricultural Plastic Residual Film Dataset

2.2. Test Environment Configuration and Network Parameter Settings

2.3. Model Evaluation Metrics

2.4. Residual Film Segmentation Method

2.4.1. RFCAConv

2.4.2. C3K2_PKI Module

2.4.3. Segnext Attention

2.4.4. Segment_Efficient Head and Nwd-Mpd Loss Function

2.5. Theoretical Area of Residual Film Obtained Based on the Tracking Network

2.6. RSE-YOLO-Seg Model Accelerated Deployment

3. Experimental Results and Analysis

3.1. Comparison of the Effects of Adding Attention Mechanisms to the Backbone

3.2. Effectiveness of Improved Receptive Field Modules in the Network

3.3. Comparison Test of Loss Functions

3.4. Ablation Experiments

3.5. Comparative Experiments on Different Residual Film Detection Models

3.6. Comparison of Residual Film Detection Visualization Results Across Different Models

3.7. Edge Device Deployment Experiment

3.8. Experimental Determination of Residual Film Theoretical Area Based on Tracking Network

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI