FLD-Net for Floating Litter Detection in UAV Remote Sensing

Wang, Xingyue; Zhou, Bin; Ye, Xia; Wang, Lidong; Wang, Zhen

doi:10.3390/rs18050736

Open AccessArticle

FLD-Net for Floating Litter Detection in UAV Remote Sensing

by

Xingyue Wang

¹

,

Bin Zhou

^1,2,*

,

Xia Ye

¹,

Lidong Wang

¹

and

Zhen Wang

¹

School of Engineering, Hangzhou Normal University, Hangzhou 311121, China

²

Institute of Remote Sensing and Earth Science, Hangzhou Normal University, Hangzhou 311121, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(5), 736; https://doi.org/10.3390/rs18050736

Submission received: 24 January 2026 / Revised: 24 February 2026 / Accepted: 27 February 2026 / Published: 28 February 2026

(This article belongs to the Special Issue Remote Sensing Intelligent Interpretation in the Era of Large Models and Intelligent Agents: New Challenges, Methods and Opportunities)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

Constructed a UAV-perspective dataset to bridge the data deficiency in surface floating object detection.
Proposed FLD-Net improves detection accuracy by integrating deformable features and anti-noise attention to tackle target deformation and interference.

What are the implication of the main findings?

The low latency and high energy efficiency on embedded platforms confirm the system’s viability for real-time UAV patrols.
Demonstrates robust noise suppression in dynamic waters, providing a reliable solution for automated environmental monitoring.

Abstract

Unmanned Aerial Vehicles provide a cost-effective solution for water environment monitoring, yet detecting floating litter remains challenging due to small target scales, complex geometries, and severe surface interferences. To bridge the data deficiency in this domain, this study introduces UAV-Flow, a multi-scenario benchmark dataset wherein small-scale targets constitute 78.9%. Building upon this foundation, we propose the Floating Litter Detection Network (FLD-Net), a lightweight, real-time detection framework tailored for edge deployment. Adopting a progressive optimization paradigm, FLD-Net integrates three cascaded enhancement modules to achieve holistic performance gains across feature extraction, cross-scale fusion, and noise suppression. Specifically, the Deformation Feature Extraction Module (DFEM) enhances backbone adaptability to small targets and non-rigid deformations; the Dynamic Cross-scale Fusion Network (DCFN) facilitates efficient cross-scale semantic fusion via content-aware upsampling and an asymmetric topology; and the Dual-domain Anti-noise Attention (DANA) mechanism achieves discriminative decoupling between target semantics and structural noise through spatial-channel interaction. Experimental results on UAV-Flow demonstrate that FLD-Net achieves an mAP₅₀ of 80.47%. Compared to the YOLOv11s baseline, it improves Recall and mAP₅₀ by 11.66% and 8.51%, respectively, with only 9.9 M parameters. Furthermore, deployment on the NVIDIA Jetson Xavier NX yields an inference latency of 14 ms and an energy efficiency of 4.80 FPS/W, confirming the system’s robustness and viability for automated pollution monitoring.

Keywords:

UAV remote sensing; small object detection; lightweight network; embedded deployment

1. Introduction

With rapid global urbanization and industrial expansion have exacerbated water pollution, transforming it into a critical global challenge. Consequently, the monitoring and governance of floating litter have garnered significant international attention [1]. Floating debris, primarily composed of vegetation, plastic fragments, and anthropogenic waste [2], tends to form high-density accumulations in river channels and coastal zones due to hydrodynamic forces. Non-degradable plastic waste, in particular, disrupts aquatic ecosystems, threatens fisheries and navigation safety, and endangers human health through bioaccumulation [3]. Estimates suggest that 23 million tons of plastic waste entered aquatic ecosystems in 2016 alone, with projections soaring to 53 million tons by 2030 without effective intervention [4]. Traditional management strategies, relying on manual patrols and vessel-based salvage, exhibit significant latency and are constrained by visual range, weather conditions, and labor costs, preventing high-frequency, all-weather coverage [5].

In this context, Unmanned Aerial Vehicle (UAV) low-altitude remote sensing has emerged as a transformative paradigm. Surpassing the limitations of satellite remote sensing, UAVs offer centimeter-level spatial resolution and deployment flexibility, enabling real-time data acquisition in complex river networks [6]. This technology has been successfully applied to wildlife tracking [7], oil spill monitoring [8], and vegetation classification [9,10]. Deploying deep learning-based object detection algorithms on UAV platforms for automated litter identification has become a focal point in environmental remote sensing [11,12,13].

However, algorithm performance relies heavily on domain-specific datasets. General-purpose datasets (e.g., COCO [14]) fail to capture the unique aerial perspectives and optical characteristics of water scenes. Recent benchmarks such as Seaclear [15] and the River Floating Debris Dataset [16] have begun to address this gap. Furthermore, large-scale datasets like FMPD [17] have enhanced sample diversity and included annotations for complex environmental factors, such as reflections. Parallel to data accumulation, standardizing monitoring protocols is crucial. Research has focused on optimizing flight parameters and imaging configurations [18,19] and balancing coverage with resolution under varying hydrological conditions [20] to ensure data consistency and scientific reproducibility.

Despite the maturity of object detection in terrestrial scenarios [21,22], domain adaptation for water surface floating litter presents unique difficulties. As illustrated in Figure 1, UAV-based detection faces a triad of challenges stemming from target characteristics, environmental interference, and edge deployment constraints [23].

(1) Small Scale and Non-Rigid Deformation: Due to the wide field of view of aerial imagery, floating litter often occupies minimal pixel areas (

A < 32^{2}

pixels), with scales fluctuating drastically with flight altitude. While feature pyramids and context enhancement strategies have been proposed to mitigate this [24,25], targets also undergo non-rigid deformation caused by water flow, partial submersion, or occlusion. During the downsampling process of Deep Convolutional Neural Networks, the fine-grained features of these irregular targets are prone to information loss, leading to significant missed detections.

(2) Complex Environmental Noise: Natural water surfaces are highly dynamic [26]. Interferences such as specular reflections, glints, and shadows of shoreline vegetation create high-frequency structural noise. These visual artifacts are easily confused with targets like white foam or plastic bags, hindering the extraction of discriminative features by traditional networks. Consequently, integrating attention mechanisms or frequency domain analysis to suppress background noise has become a mainstream approach for enhancing robustness [27].

(3) SWaP Constraints on Edge Deployment: UAV inspection tasks necessitate deployment on embedded platforms with strict Size, Weight, and Power (SWaP) constraints, requiring real-time inference (FPS

> 30

) [28]. Existing high-performance models typically incur high computational costs, making it difficult to balance accuracy and speed under low-power conditions, thereby restricting practical engineering implementation.

To address these lacunae, this paper proposes FLD-Net, a lightweight, real-time detection model designed for UAV visual perception. Furthermore, to alleviate the scarcity of high-quality annotated data, we construct UAV-Flow, the first multi-scenario dataset for floating litter from a UAV perspective. The primary contributions of this work are summarized as follows:

We construct UAV-Flow, a benchmark dataset that fills the gap for small and non-rigid targets in water environments. Covering diverse hydrological and lighting conditions, it provides critical data support for researching algorithm robustness.
We propose FLD-Net, a real-time detection framework robust against water surface interference. By integrating three novel mechanisms—DFEM, DCFN, and DANA—the model effectively overcomes the technical bottlenecks of deformation adaptation, feature loss, and background noise suppression.
We implement a high-performance edge deployment scheme. Validation on embedded platforms demonstrates the system’s real-time capability and energy efficiency, offering a viable solution for low-cost, automated intelligent water monitoring.

2. Related Work

The automated monitoring of floating litter represents an interdisciplinary synergy of computer vision, remote sensing technology, and environmental engineering. With the rapid advancement of deep learning, research in this domain has witnessed a paradigm shift from traditional image processing methods to intelligent recognition based on Convolutional Neural Networks (CNNs). This section systematically reviews existing literature from two dimensions, data benchmarks and detection algorithms, while analyzing the specific challenges encountered in UAV-based aquatic operations.

2.1. Surface Floating Debris Datasets

High-quality annotated data serves as the cornerstone for driving performance improvements in deep learning models. Although numerous marine and aquatic litter datasets have emerged in recent years, existing public repositories exhibit significant domain discrepancies and scenario limitations when supporting fine-grained detection tasks from a low-altitude UAV perspective. Aquatic litter can be categorized into three distinct types: beach litter, surface floating litter, and benthic debris [29]. This study specifically targets waste that floats or suspends on open water surfaces (including oceans, lakes, and rivers) driven by currents or wind—typically low-density, high-buoyancy items such as plastic bags, bottles, foam, and branches.

The most representative dataset, SeaClear [15], contains 8610 images of shallow water environments and serves as a critical benchmark. However, it primarily focuses on underwater scenarios where image characteristics are dominated by light attenuation and scattering, with background interference stemming largely from benthic organisms or sediment. In contrast, UAV orthophotography is dominated by surface specular reflections, where background noise consists of dynamic wave glint, spray, and shoreline vegetation reflections. Another category, such as the Floater dataset [30], targets inland floating debris but is derived mainly from shore-based surveillance cameras. The horizontal or oblique angles of shore-based monitoring result in severe inter-object occlusion and perspective distortion, failing to simulate the spatial geometric features and scale distributions inherent to the UAV orthographic view.

While some studies have attempted to construct UAV-perspective datasets, they often suffer from limited sample sizes and scenario homogeneity. The River Floating Debris dataset [16] comprises only 840 samples from a single river scene; its limited magnitude and lack of annotation for varied flight altitudes fail to capture scale variations essential for operational UAV missions. Similarly, the HAIDA Trash Dataset [31] focuses on coastal and marine floating litter, covering common items like plastic bottles and fishing nets, yet it possesses a limited scale that restricts the generalization of deep CNNs under fluctuating illumination and cluttered backgrounds. The recently proposed FMPD dataset [17] expands both data volume and scenario diversity but is largely collected under ideal conditions with favorable lighting and calm water surfaces, exhibiting a benign water bias. Real-world aquatic environments are replete with uncontrollable interference factors; consequently, models trained on such biased data may perform well on test sets but suffer a drastic decline in robustness when deployed in complex lighting and adverse hydrological conditions.

Therefore, constructing a benchmark dataset that encompasses multiple scenarios, multi-scale targets, and rich environmental noise is a prerequisite for addressing the robustness deficiencies of current algorithms. This necessity underpins the motivation for developing the UAV-Flow dataset in this study.

2.2. UAV Object Detection

UAV remote sensing imagery presents severe challenges to traditional object detection algorithms due to its high spatial resolution, wide field of view, and complex backgrounds. Unlike natural scene images, UAV aerial photography is characterized by drastic target scale variations and intricate background textures [21]. Addressing these characteristics, existing research primarily focuses on three core dimensions: small target feature enhancement, complex background suppression, and lightweight edge deployment.

Regarding small target detection, UAV perspectives result in ground targets occupying minimal pixel areas (

A < 32^{2}

pixels). Although the YOLO series [32,33,34] and its variants [35,36] optimize multi-scale feature fusion via Path Aggregation Networks (PANet), continuous downsampling operations in deep CNNs inevitably lead to feature information loss. To mitigate this, LAR-YOLOv8 [37] and MSD-YOLOn [38] attempt to improve small target recall by expanding receptive fields or reinforcing shallow feature reuse. Their primary strategies include introducing attention mechanisms to enhance weak feature capture and employing multi-scale fusion architectures to integrate cross-level information, thereby preventing small targets from being submerged in deep features. Furthermore, RFLA [39] and NWD [40] introduce label assignment strategies based on Gaussian distribution modeling and Wasserstein distance, respectively. While these approaches alleviate the sensitivity of positive/negative sample matching from a loss function perspective, the loss of spatial information during the feature extraction stage remains a physical bottleneck constraining detection accuracy.

In terms of noise resistance in complex backgrounds, UAV imagery often includes urban scenes, vegetation cover, or uneven illumination, leading to significant background and occlusion interference. Attention mechanisms are widely regarded as effective means to enhance feature discriminability. FFCA-YOLO [24] utilizes channel attention to amplify salient feature responses while suppressing invalid background channels. However, traditional global attention often relies on statistical information, making it difficult to suppress high-frequency background noise while preserving the structural details of small targets. In comparison, RTD-Net [41] and RT-DETR [42] incorporate Transformer architectures, utilizing self-attention mechanisms to model long-range dependencies and effectively distinguish targets from complex backgrounds. However, their quadratic complexity causes high latency, restricting their use in real-time tasks. Furthermore, recent advances emphasize the necessity of modeling global contexts in complex remote sensing scenes; for instance, Sun et al. [43,44] leveraged consistency reasoning and dual-stream relationship learning to concurrently capture spatial dependencies and mitigate detection ambiguities.

Regarding real-time edge deployment, with the proliferation of edge computing in UAV payloads, balancing inference latency and detection accuracy on resource-constrained devices has become a critical hurdle. Model lightweighting techniques such as pruning, quantization, and knowledge distillation have been extensively explored. For instance, Drone-YOLO [36] maintains high detection accuracy under real-time conditions through a lightweight “sandwich” fusion mechanism, while LUD-YOLO [45] significantly reduces parameters via channel pruning. Nevertheless, a trade-off between accuracy and efficiency persists in practical applications: excessive parameter compression often sacrifices the feature extraction capability for weak and small targets, increasing the miss rate; conversely, high-precision models typically incur massive Floating Point Operations (FLOPs), making it difficult to meet real-time video stream processing requirements (FPS

> 30

) within limited power budgets.

In summary, while existing methods have progressed in their respective dimensions, constructing a lightweight model that simultaneously possesses small target perception and noise resistance remains a key scientific problem to be solved in UAV remote sensing. Moreover, the proposed FLD-Net must specifically account for the additional challenges posed by dynamic water surface backgrounds.

3. UAV-Flow Dataset

As elucidated in Section 2.1, existing benchmarks for floating litter detection universally suffer from domain discrepancies, insufficient sample magnitudes, and scenario homogeneity relative to UAV remote sensing imagery. To bridge this data hiatus and provide robust training support for detection models, this study constructs UAV-Flow, the first multi-scenario benchmark dataset tailored for the UAV perspective.

3.1. Dataset Construction

To guarantee data diversity and fidelity, the dataset construction workflow, as delineated in Figure 2, comprises four distinct phases: data acquisition, screening, annotation, and partitioning.

Data acquisition was conducted using the DJI M350 RTK platform. This platform was selected for its centimeter-level positioning accuracy and high stability against wind, ensuring imaging quality under complex meteorological conditions. To build a robust sample library, the acquisition strategy emphasized spatiotemporal heterogeneity. Three typical aquatic environments were selected: urban river channels, natural lakes, and nearshore mudflats. These cover a spectrum of hydrological conditions, from static to dynamic flow and from clear to turbid water. A multi-variable flight strategy was employed, encompassing varying altitudes (15–100 m), multiple viewing angles, and diverse time slots. This approach effectively captured the visual feature variations of floating debris under different illumination intensities, shadow occlusions, and water surface reflection conditions. Over 5000 raw images were acquired. Following rigorous manual screening, images exhibiting motion blur, severe overexposure, or background-only scenes were discarded, resulting in a core dataset of 4593 high-resolution valid images.

Addressing the challenges of sparse distribution and high annotation costs for aquatic debris, we proposed a semi-supervised annotation strategy to balance efficiency and precision. First, approximately 20% of typical samples were manually annotated via the Makesense.ai [46] platform to establish a high-quality seed dataset. Subsequently, an initial detection model was trained on this seed set to generate pseudo-labels for the remaining 80% of the data using the X-AnyLabeling tool [47]. Human experts were then introduced to verify and fine-tune the automatically generated labels, focusing on correcting missed minute targets and refining boundary regression deviations. Finally, all annotations were standardized to the YOLO format and randomly partitioned into training, validation, and testing sets in a 7:2:1 ratio to ensure unbiased evaluation.

3.2. Statistical Analysis and Characteristics

The dataset contains 20,618 annotated instances covering five common categories of anthropogenic waste in natural waters: plastic bottles, plastic, express cartons, foam, and pieces of paper. The class distribution is relatively balanced. Specifically, plastic bottles—the most representative surface pollutant—account for approximately 30% of instances, effectively mitigating model bias caused by long-tail distributions.

To quantify the scale characteristics of floating litter in UAV imagery, this study strictly adheres to the definition standards of the MS COCO benchmark [14]. Targets are classified into three scale levels based on the pixel area A of the bounding box: small objects (

A < 32^{2}

pixels), medium objects (

32^{2} \leq A \leq 96^{2}

pixels), and large objects (

A > 96^{2}

pixels). Statistical analysis of UAV-Flow based on these criteria (Figure 3) reveals significant small-scale characteristics. Small objects constitute 78.9% of all annotated instances. Notably, the proportion of small objects for plastic bottles, waste paper, and courier cartons exceeds 80%. This scale distribution authentically replicates the visual challenges encountered during high-altitude UAV patrols, where fine-grained feature information is prone to loss during downsampling.

To validate the cross-domain generalization capability of the model, UAV-Flow maintains high diversity in scenario composition, comprising urban rivers (46%), natural lakes (37%), and nearshore mudflats (17%). This multi-environment coverage ensures the inclusion of diverse interference modes, ranging from static zones with strong specular reflections to dynamic zones with complex background clutter. As presented in Table 1, compared to the single-perspective limitations of Floater and FMPD, or the small sample sizes of River Floating Debris and HAIDA, UAV-Flow achieves an order-of-magnitude improvement in image quantity, small object proportion, and environmental heterogeneity. In summary, UAV-Flow provides a moderately scaled, high-quality, and challenging multi-scenario benchmark for surface floating litter detection, offering solid data support for subsequent algorithmic research and model generalization evaluation.

4. Methodology

To address the challenges of high small-target density, non-rigid geometric deformations, and water surface background noise interference inherent in UAV-based floating litter detection, this article proposes FLD-Net, a lightweight real-time detection framework optimized for edge deployment. To satisfy the stringent requirements for low power consumption and real-time inference on UAV platforms, we select YOLOv11 as the baseline model due to its superior balance between speed and accuracy.

As illustrated in Figure 4, we introduce three mechanism-enhancement modules into the backbone, neck, and pre-detection stages of the YOLOv11 architecture: the Deformable Feature Extraction Module (DFEM), the Dynamic Cross-Scale Fusion Network (DCFN), and the Dual-Domain Anti-Noise Attention (DANA). These innovative modules form a continuous enhancement chain from low-level feature acquisition to high-level semantic decision-making, achieving the preservation of deformed small-target information, efficient cross-scale feature fusion, and background noise suppression.

4.1. Deformable Feature Extraction Module

In UAV remote sensing imagery, floating litter often exhibits highly irregular, non-rigid geometric deformations due to hydrodynamic impact and partial submersion. Standard CNNs rely on fixed geometric grids for feature sampling. This rigid sampling mechanism, when processing floating objects with stochastic morphological variations, is prone to sampling points falling on background regions rather than the target body, resulting in spatial mismatch during feature extraction and loss of semantic information [48].

To surmount the physical limitations of standard convolution in geometric modeling, Dai et al. [49] first proposed Deformable Convolutional Networks (DCNs), which introduce a set of learnable 2D offsets to the standard sampling grid, endowing the convolution kernel with the ability to adaptively adjust the receptive field shape. Subsequently, Zhu et al. [50] introduced a modulation mechanism in DCNv2, enhancing the model’s ability to suppress irrelevant backgrounds by adding weight masks. Recent studies have widely demonstrated the significant advantages of DCN in visual tasks involving complex geometric transformations. Xiao et al. [51] utilized multi-scale deformable convolution alignment to effectively resolve spatial misalignment caused by dynamic motion in satellite video, confirming the robustness of DCN in feature alignment for dynamically drifting targets. Similarly, Liu et al. [52] further showed that DCN can adaptively fit the geometric boundaries of complex topological structures such as slender and curved shapes, significantly breaking through the physical limitations of traditional convolution in modeling irregular targets.

Inspired by these studies and given the non-rigid characteristics of floating debris, this research posits that introducing deformable convolution is critical to resolving feature spatial mismatch and semantic loss. To this end, we designed the DFEM, intended to replace traditional convolution units in the deep layers of the backbone network, enabling the network to actively deform to fit the target’s true geometric contour. DFEM adopts a hierarchical nested architecture as shown in Figure 5. This module follows the topological design principles of Cross Stage Partial (CSP), balancing model expressiveness and inference cost through gradient flow splitting and feature recombination.

As shown in Figure 5a, the input feature flow of DFEM is first divided into a backbone branch and a residual branch via a Split operation. The backbone branch enters the core feature extraction component, designed as a C3_DCN unit depending on network depth. C3_DCN retains the efficiency of the YOLO series’ C3 module but introduces a deformation mechanism within its stacked bottleneck layers. The residual branch connects directly to the end via a

1 \times 1

convolution. The two branches are finally merged via a Concat operation and fused by a terminal convolution layer. This design not only ensures lossless gradient flow during backpropagation but also effectively reduces video memory consumption after introducing complex operators.

The core innovation of DFEM lies in the micro-level Bottleneck_DCN unit, as shown in Figure 5b. This unit reconstructs the standard residual block by replacing the standard convolution with deformable convolution, specifically assuming the input feature map is X and the sampling grid of standard convolution is R. In Bottleneck_DCN, for each position

p_{0}

on the output feature map Y, the sampling process is reconstructed as:

Y (p_{0}) = \sum_{p_{n} \in R} w (p_{n}) \cdot X (p_{0} + p_{n} + Δ p_{n}) \cdot Δ m_{n}

(1)

where

p_{n}

denotes the fixed offset within the regular sampling grid R, and

w (p_{n})

represents the kernel weight corresponding to the sampling position. The equation introduces two key variables, the learnable offset

Δ p_{n}

and the modulation scalar

Δ m_{n}

, which are dynamically predicted and generated by the network based on the input feature X. Breaking the constraints of a regular grid, for each sampling point, the network predicts a 2D offset vector, shifting the actual sampling position to

p = p_{0} + p_{n} + Δ p_{n}

. This allows the convolution kernel to deform, enabling its receptive field to adaptively stretch, rotate, or bend to fit the actual geometric contour of the floating litter.

Δ m_{n}

is a scalar activated by Sigmoid, serving as the weight for each sampling point. If a sampling point p falls into an irrelevant background region even after offsetting, the network can suppress its contribution by setting

Δ m_{n} \to 0

. Since the new sampling position p has non-integer coordinates, the feature value

X (p)

at this sub-pixel location is computed via bilinear interpolation, ensuring gradient differentiability during backpropagation.

Through this design, Bottleneck_DCN can dynamically adjust the geometric shape of the receptive field according to the actual morphology of the floating litter, breaking the physical constraints of the rigid grid in standard convolution. Ultimately, via CSP structural feature aggregation, DFEM achieves high-fidelity feature extraction for non-rigid targets in deep networks, significantly ameliorating the issue of missed detections caused by geometric mismatch.

4.2. Dynamic Cross-Scale Fusion Network

In UAV aquatic remote sensing monitoring, floating litter exhibits cross-scale distribution characteristics due to variations in flight altitude, with a predominance of small targets. In the UAV-Flow dataset, small objects account for as high as 78.9%. Feature Pyramid Networks (FPNs) are the standard paradigm for solving such multi-scale problems [53]. However, applying generic FPN architectures directly to water surface environments faces two core bottlenecks. First, while shallow feature maps retain high-resolution spatial details, they also contain substantial high-frequency background noise composed of water ripples and glints; excessive fusion of shallow features often degrades the Signal-to-Noise Ratio (SNR) of the feature maps. Second, traditional bilinear interpolation is content-agnostic, calculating sampling points based solely on spatial distance [54]. When processing floating objects with weak textures or blurred edges, this interpolation strategy is prone to sub-pixel level feature aliasing, leading to the loss of edge information for minute targets during fusion. Consequently, this paper proposes DCFN, which reconstructs the cross-scale feature fusion path through the synergistic design of asymmetric topology pruning and content-aware upsampling.

To achieve efficient feature aggregation under limited edge computing budgets, DCFN utilizes the P3, P4, and P5 feature layers of the backbone network as inputs to construct a lightweight asymmetric multi-path fusion structure, as shown in Figure 6. Drawing on the efficient design concepts of GFPN [55] and DAMO-YOLO [56], this structure strategically prunes redundant shallow fusion paths via an asymmetric topology and strengthens the top-down flow of deep information. By utilizing DySample for the top-down fusion path, semantic information is ensured not to decay during transmission. The exclusion of extra shallow layer (P2) fusion is justified because, in aquatic aerial imagery, while shallow features retain high-resolution spatial information, they are also replete with high-frequency background noise. Excessive shallow injection would introduce significant non-target interference into deep semantics, causing minute targets to be submerged. This design not only adheres to SNR optimization principles in signal processing but also effectively reduces computational load (FLOPs), as complex operations on high-resolution shallow feature maps consume substantial computing power. Thus, DCFN theoretically achieves a dual optimization of accuracy and efficiency, making it particularly suitable for the resource constraints of edge devices.

To address the semantic ambiguity of small targets in cross-scale fusion, DCFN introduces the DySample operator to replace standard interpolation. DySample is a lightweight dynamic upsampler based on a point-sampling perspective. According to Wang et al. [57] and Liu et al. [54], standard interpolation kernels possess spatial invariance, using the same weight kernel for the entire image. However, for small targets with minimal pixel occupancy, local texture changes are drastic and sparse. Research [54] indicates that adopting a content-aware reorganization strategy—dynamically generating upsampling kernels based on the semantic content of feature maps—is key to recovering minute target details and reducing feature aliasing. Compared to kernel prediction-based methods [57], DySample is theoretically more aligned with the nature of geometric transformation and possesses lower computational complexity.

The core idea of DySample is that upsampling should not be fixed network interpolation but a resampling process dynamically generated based on feature content. Let the input feature map be X and the target upsampling factor be s, yielding output feature map Y. As shown on the right side of Figure 6, DySample is divided into two stages: sampling set generation and feature resampling. The network first uses a lightweight generator to predict a dense offset map based on the local semantic content of input feature X. The final sampling grid is defined as the superposition of the original grid G and the offset O:

S = G + O

(2)

where G represents the projection coordinates of the target feature map pixels on the original low-resolution map, and O is the adaptive offset learned by the network. Each pixel value on the output feature map Y is obtained by sampling at position S in the input map X:

Y = Sample (X, S)

(3)

Similarly, the sampling operation here is implemented via bilinear interpolation to ensure differentiability. By learning offsets, DySample attempts to recover lost spatial information, pulling sampling points back to the semantic center or edges of the target. For floating debris, this implies that even if the target is very blurry in high-level features, DySample can focus the upsampling on potential target areas by referencing surrounding contextual information, thereby reconstructing sharp feature boundaries. This mechanism theoretically guarantees semantic alignment during feature fusion, enabling deep semantics to be precisely injected into the texture of small targets in shallow layers.

4.3. Dual-Domain Anti-Noise Attention

Although DFEM and DCFN effectively improve the geometric representation and cross-scale semantic alignment of non-rigid targets, significant high-frequency structural noise specific to aquatic environments remains in the deep feature space before entering the detection head. For instance, specular reflections under strong sunlight and bright edges of dynamic ripples often exhibit feature responses highly similar to minute plastic debris. Traditional global attention mechanisms (e.g., SE [58], CBAM [59]) typically recalibrate channel or spatial weights based on statistical information like global average pooling. However, in aquatic scenarios, such blind enhancement strategies face severe inductive bias risks; they tend to enhance all high-response regions, potentially erroneously amplifying bright reflection noise and leading to a surge in false positives.

To this end, we propose the DANA module, as shown in Figure 7. DANA is not merely feature enhancement but a discriminative decouplingmechanism. By establishing an interaction constraint mechanism between spatial and channel domains, it utilizes the spatial structural priors of the target to guide the screening of channel semantics, thereby precisely stripping the semantic signal of real floating debris from background clutter.

To precisely localize target boundaries while suppressing water ripple noise, DANA designs a Spatial Orthogonal Context Branch. Unlike standard spatial attention which incurs high computational burdens from large-kernel convolutions, this branch adopts the philosophy of Coordinate Attention (CA) [60], performing 1D feature aggregation along horizontal (X) and vertical (Y) directions, respectively. For input feature F, two direction-aware feature vectors are generated:

z^{h} = {AvgPool}^{h} (F), z^{w} = {AvgPool}^{w} (F)

(4)

where superscripts h and w represent the vertical and horizontal spatial dimensions. Accordingly,

z^{h}

and

z^{w}

are formulated to capture long-range dependencies of the feature map F along the vertical and horizontal directions. This operation effectively compresses the 2D spatial structure into two 1D feature descriptors. Subsequently, these two are concatenated, transformed via convolution, and passed through nonlinear activation to generate the intermediate feature f:

f = δ (Conv ([z^{h}, z^{w}]))

(5)

Then, f is split and transformed into two spatial attention maps

A^{h}

and

A^{w}

. The final spatial attention map

M_{s}

is the broadcast product of the two:

M_{s} = A^{h} \otimes A^{w}

(6)

where ⊗ denotes element-wise broadcast multiplication. This interactive design enables spatial attention to simultaneously consider contextual information in both horizontal and vertical directions, thereby more precisely locating the non-rigid boundaries of floating litter.

To avoid the erroneous enhancement of background noise by traditional SE modules, DANA introduces the Spatial-Channel Interaction Strategy. Modulating channel descriptors using intermediate features generated by the spatial branch is the key step for noise resistance. Traditional channel descriptors are computed as

s = F_{g p} (U)

. In contrast, DANA computes a spatially weighted channel descriptor

s^{'} = f \cdot s

. This ensures that if a channel responds spatially only in noise regions, its corresponding spatial weight will be suppressed, thereby lowering the overall weight of that channel via the interaction mechanism. After Sigmoid activation, the final channel attention vector

M_{c}

is generated. The Sigmoid function maps weights to the interval

(0, 1)

rather than binarized

{0, 1}

. This means that even faint target signals are merely re-weighted rather than hard-truncated. This soft attention mechanism ensures that gradients can still flow to minute targets during backpropagation, allowing the network to progressively correct its focus on small targets during training. The final feature reconstruction Y is the result of dual weighting:

Y = F \otimes M_{s} \otimes M_{c}

(7)

Through this spatial-channel interaction mechanism, DANA achieves discriminative decoupling of target semantics and noise structures. Spatial attention

M_{s}

is responsible for suppressing the spatial locations of noise, while channel attention

M_{c}

suppresses the semantic channels where noise resides. A high response is obtained only through the dual validation of

M_{s}

and

M_{c}

, effectively avoiding the over-suppression problem caused by judgment errors in a single dimension. After DANA processing, water surface reflections and ripple interference in the feature maps are significantly suppressed, while the feature responses of minute floating debris are focused and enhanced. This provides a high-SNR decision basis for the subsequent detection head, significantly reducing the false detection rate under complex lighting conditions.

5. Experiments and Discussion

5.1. Experimental Setup

All model training and validation processes in this study were conducted on a single computational node equipped with an NVIDIA GeForce RTX 4060 Ti GPU (16 GB VRAM). The software environment was established on the PyTorch 2.0.0 deep learning framework, accelerated by CUDA 11.8, using Python 3.10.18. Input images were uniformly resized to a resolution of

640 \times 640

pixels. A Stochastic Gradient Descent (SGD) optimizer was employed with an initial learning rate of

1 \times 10^{- 2}

, a momentum factor of 0.937, and a weight decay coefficient of

5 \times 10^{- 4}

. The batch size was set to 16, and the training process spanned 200 epochs.

To comprehensively evaluate model performance, Precision (P), Recall (R), and Mean Average Precision (mAP) were adopted as core metrics, defined mathematically as follows:

P = \frac{T P}{T P + F P}, R = \frac{T P}{T P + F N}

(8)

A P = \int_{0}^{1} P (R) d R, m A P = \frac{1}{N} \sum_{i = 1}^{N} A P_{i}

(9)

where

T P

,

F P

, and

F N

denote the number of True Positives, False Positives, and False Negatives, respectively.

A P

represents the average precision for a specific class, and mAP₅₀ denotes the mean average precision at an Intersection over Union (IoU) threshold of 0.5, reflecting the model’s detection capability. Additionally, Parameters (Params), Floating Point Operations (FLOPs), and Frames Per Second (FPS) were introduced to assess the degree of lightweight design and real-time inference capability.

5.2. Ablation Studies

To validate the scientific rigor of the FLD-Net architecture and the efficacy of its core modules, a two-stage ablation analysis was conducted on the UAV-Flow dataset. Prior to finalizing the architecture, granular comparative experiments were performed regarding the insertion position of DFEM and the topology/operator selection for DCFN.

Although DCN adapts to geometric deformations, it incurs computational overhead. To investigate its efficacy at different feature levels, DFEM was deployed at low levels, high levels, and all levels of the backbone. As illustrated in Table 2, applying the DFEM specifically to the deeper P4 and P5 layers yields superior performance compared to its application across all layers, achieving an mAP₅₀ of 73.51%. This phenomenon can be attributed to the hierarchical nature of feature representation. While DFEM excels at modeling non-rigid geometric transformations of floating litter, applying it to shallow layers (P2, P3) may inadvertently amplify low-level background noise, such as surface ripples and glint, leading to unstable gradient flow. In contrast, the high-level semantic features in P4 and P5 provide a more robust global context, allowing DFEM to focus on the inherent structural deformations of targets while suppressing environmental interference. This selective integration ensures an optimal balance between feature adaptability and model parameters.

To verify the necessity of the DCFN design, we conducted a stepwise optimization starting from the standard PANet baseline. As presented in Table 3, directly introducing the GFPN structure increased the mAP₅₀ to 73.58%; however, it resulted in a parameter surge to 13.72 M, violating the lightweight principle for UAV applications. By adopting an asymmetric pruning strategy (GFPN-Scale), the parameter count was significantly reduced to 9.66 M with only a marginal regression in accuracy. The critical breakthrough occurred with the integration of the DySample operator, which elevated the mAP₅₀ to 77.69% with a negligible increase in parameters. This strongly demonstrates that the performance leap of DCFN stems not from parameter stacking, but from DySample’s pixel-level lossless reconstruction of minute target features, which effectively compensates for the information loss caused by topological pruning. Furthermore, by evaluating a DCFN variant that retains the P2 layer, we explicitly validated the necessity of its removal. Discarding the P2 layer not only reduces computational overhead but also effectively mitigates the interference introduced by shallow-level background noise.

Upon establishing the optimal internal structure, we utilized YOLOv11s as the baseline to evaluate the proposed modules on detection performance (Table 4). The baseline exhibits notable limitations in complex aquatic scenarios, where low Recall stems from severe missed detections (False Negatives, FNs) caused by minute target feature loss. Incorporating DFEM bolsters the network’s capacity to model geometric deformations, consequently elevating detection performance. DCFN effectively mitigates these FNs, increasing Recall to 75.04% and highlighting the critical role of dynamic cross-scale fusion in small object detection. Conversely, complex aquatic backgrounds induce a substantial number of false alarms (False Positives, FPs). The application of DANA effectively filters surface reflections and ripple noise via spatial-channel dual-domain filtering, achieving an mAP₅₀ of 78.95%.

The experiments demonstrate significant complementarity among the modules. DFEM ensures precise geometric alignment of deep features within the backbone, while DCFN and DANA specifically address small target omissions in the neck and complex background noise prior to the detection head, respectively. Crucially, their concurrent utilization exhibits a pronounced synergistic effect, yielding improvements that far exceed their individual contributions. Compared to the baseline, the final FLD-Net achieves an 11.66% increase in Recall and an 8.51% improvement in mAP₅₀. These results definitively validate that FLD-Net systematically resolves the core challenges of extreme scale variations and intense background interference in complex aquatic remote sensing.

5.3. Comparative Experiments

To verify the comprehensive performance of FLD-Net, we benchmarked it against mainstream general object detectors, incorporating ultra-lightweight variants (YOLOv5n [61], YOLOv8n [33]) for edge-deployment evaluation, alongside YOLO series [33,34,61] and RT-DETR [42]. We also compared UAV-optimized models (TPH-YOLOv5 [35], Drone-YOLO [36], UAV-DETR [62]) and the noise-resistant FFCA-YOLO [24]. To quantify the perceptual capability for small objects, the

{AP}_{S}

metric was introduced. All models were evaluated under identical configurations. As shown in Table 5, FLD-Net achieved the optimal overall performance with an mAP₅₀ of 80.47%, a Recall of 77.84%, and an

{AP}_{S}

of 35.6%. While ultra-lightweight models and the baseline YOLOv11s offer extremely high inference speeds, they are fundamentally limited by their feature extraction capabilities for small objects, struggling to surpass an

{AP}_{S}

of 30%. In contrast, FLD-Net improved the baseline’s mAP₅₀ by 8.51% and

{AP}_{S}

by a substantial 6.7%, maintaining an exceptional efficiency ratio with only a 0.5 M increase in parameters. Although heavy architectures like RT-DETR and UAV-DETR achieved competitive precision (

{AP}_{S}

of 35.0% and 34.9%, respectively) via global attention, their computational costs are prohibitive for real-time airborne applications. FLD-Net not only surpassed UAV-DETR in small object accuracy but did so with only 1/4 of the parameters and less than 1/6 of the computational load, achieving a real-time speed of 156.72 FPS. Furthermore, FLD-Net significantly outperformed UAV-specific models like Drone-YOLO, attributed to its explicit mechanisms for suppressing strong structural noise, which generic UAV models lack.

To evaluate generalization and explicitly assess the model’s cross-domain robustness for small objects, validation was conducted on two external datasets: HAIDA Trash Dataset (complex marine scenes) and FMPD Dataset (dynamic river scenes). As presented in Table 6, FLD-Net achieved the highest overall accuracy on HAIDA Trash with an mAP₅₀ of 73.1%, significantly outperforming all comparison models. Furthermore, it demonstrated a clear advantage in small object perception, securing the highest

{AP}_{S}

of 17.8%. The model’s superiority is even more pronounced on the FMPD dataset, which is characterized by macroscopic debris and strong ripple interference. Notably, FLD-Net achieved an exceptional

{AP}_{S}

of 41.8%, overwhelmingly dwarfing the YOLOv11s baseline (12.7%) and RT-DETR (8.2%). Specifically, contrasting the drastic recall and small-target precision decay of RT-DETR in this dynamic environment, FLD-Net maintained robust adaptability. This substantial leap in

{AP}_{S}

is fully credited to the DANA module’s effective suppression of ripple and glare noise, preventing the weak features of minute floating debris from being submerged in highly dynamic aquatic backgrounds.

5.4. Visualization Analysis

To evaluate perceptual robustness in dynamic aquatic environments, Figure 8 qualitatively compares FLD-Net with mainstream detectors across five challenging scenarios. In the visualizations, red and yellow bounding boxes indicate FNs and FPs, respectively.

Baseline models exhibit significant perceptual deficiencies under complex interference. Specifically, under dynamic water ripples (a), low illumination (b), and turbid water (d), the YOLO series and RT-DETR suffer from severe missed detections of small targets due to inadequate weak feature extraction and high-frequency noise interference. In shoreline shadow occlusion (c), low background contrast causes target features to be submerged, leading to further missed detections. Critically, under strong glare reflections (e), baselines fail to distinguish light spots from debris, generating numerous false positives. In contrast, FLD-Net demonstrates superior environmental adaptability, significantly outperforming mainstream models by effectively suppressing false positives and enhancing recall rates.

To further validate the generalization capability of FLD-Net, Figure 9 illustrates the visualization results of FLD-Net versus mainstream detectors on the FMPD dataset, which is characterized by water flow fluctuations and high-density debris accumulation. The results indicate that the YOLO series models exhibit severe missed detection phenomena when confronting densely accumulated and minute floating fragments. This is primarily attributed to the loss of fine-grained features during the downsampling process, making it difficult to decouple minute targets from complex backgrounds. Conversely, although RT-DETR utilizes global attention to capture long-range dependencies, it demonstrates excessive sensitivity when processing dynamic ripple textures, resulting in substantial FPs by misidentifying water surface glints as floating litter. In comparison, FLD-Net exhibits exceptional adaptability on the FMPD dataset, effectively handling domain discrepancies across varying hydrological environments.

5.5. Real-Time Edge Deployment and Energy Efficiency

To validate the feasibility for practical engineering and explicitly address the computational overhead introduced by advanced operators (e.g., DFEM and DySample), end-to-end deployment testing was conducted on an NVIDIA Jetson Xavier NX (16 GB) embedded platform. The environment utilized the TensorRT 8.5 inference engine with FP16 half-precision quantization.

As Table 5 shows, although FLD-Net incurs a 3.8 G FLOPs increase over the baseline, theoretical FLOPs do not always scale linearly with actual edge latency. As detailed in Table 7, under TensorRT optimization, FLD-Net achieved an inference speed of 69.6 FPS, corresponding to a single-frame latency of approximately 14.4 ms. Despite incorporating three robust enhancement modules, the actual hardware latency increased by merely 2.4 ms compared to the optimized YOLOv11s, while securing an 8.51% improvement in mAP₅₀. Furthermore, the full-load power consumption remained highly stable at 14.5 W, achieving an excellent system energy efficiency of 4.80 FPS/W. These results strongly substantiate our lightweight design premise: FLD-Net successfully balances complex deformation adaptation and noise suppression without imposing a significant computational or power burden on typical UAV flight endurance.

As illustrated in Figure 10, combined with field tests on the DJI M350 RTK platform, the end-to-end system response time—including camera acquisition, model inference, and 4G transmission—was controlled within 100 ms. This metric fully satisfies the operational requirements for closed-loop control and real-time ground station decision-making, confirming the practical value of FLD-Net as a high-efficiency edge detector.

6. Conclusions

Addressing the challenges of intelligent floating litter detection in complex aquatic environments, this article presents an end-to-end framework encompassing both a novel dataset and a specialized lightweight detector.

At the data level, we constructed UAV-Flow, a multi-scenario benchmark dataset tailored for UAV-based vision. Leveraging a semi-supervised annotation paradigm, this dataset authentically replicates visual challenges such as illumination variations, complex background clutter, and extreme scale fluctuations, providing robust data support for minute and non-rigid target detection. At the algorithmic level, we proposed FLD-Net, a real-time detection architecture. Specifically, the DFEM mitigates geometric misalignment in standard convolutions when modeling non-rigid targets. The DCFN utilizes a dynamic sampling mechanism to resolve feature aliasing and semantic ambiguity during cross-scale fusion. Furthermore, the DANA module precisely suppresses background structural noise, such as surface glint and ripples. Quantitative results demonstrate that FLD-Net achieves an optimal trade-off between accuracy and efficiency, with only 9.9 M parameters, confirming its viability for edge deployment on resource-constrained UAV payloads.

Despite these promising results, this study has certain limitations. First, regarding data distribution and generalization, UAV-Flow primarily focuses on surface-level debris; the model’s feature representation capacity for semi-submerged or heavily occluded objects remains limited due to light attenuation. Additionally, the data acquisition for UAV-Flow was constrained to a limited geographic range, meaning the model’s cross-regional generalization in globally diverse aquatic environments requires further validation. Second, regarding extreme edge deployment, while real-time inference was achieved on the Jetson Xavier NX, deploying the model on ultra-low-power microcontrollers (MCUs) may require further model compression techniques, such as INT8 quantization or structural reparameterization.

Future work will address these limitations by exploring temporal coherence and multi-modal sensing to penetrate surface reflection interference, and by investigating the spatiotemporal dynamics of aquatic pollution to better support ecological monitoring.

Author Contributions

Conceptualization, B.Z.; methodology, B.Z. and X.W.; software, X.W.; validation, X.W. and Z.W.; formal analysis, B.Z.; investigation, X.Y.; resources, L.W. and B.Z.; data curation, X.Y.; writing—original draft preparation, X.W.; writing—review and editing, B.Z., L.W. and X.Y.; visualization, X.W.; supervision, B.Z.; project administration, B.Z.; funding acquisition, B.Z. and L.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Joint Fund of Zhejiang Provincial Natural Science Foundation of China under Grant No. LGEY26E090014.

Data Availability Statement

The source code and portions of the UAV-Flow dataset have been open-sourced on GitHub (link: https://github.com/starandmoonw/FLD-Net (accessed on accessed on 26 February 2026).).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Bhatia, S.K.; Mehariya, S.; Bhatia, R.K.; Kumar, M.; Pugazhendhi, A.; Awasthi, M.K.; Atabani, A.; Kumar, G.; Kim, W.; Seo, S.O.; et al. Wastewater based microalgal biorefinery for bioenergy production: Progress and challenges. Sci. Total Environ. 2021, 751, 141599. [Google Scholar] [CrossRef]
Van Sebille, E.; Wilcox, C.; Lebreton, L.; Maximenko, N.; Hardesty, D.P.; Van Franeker, J.A.; Eriksen, M.; Siegel, D.; Galgani, F.; Law, K.L. A global inventory of small floating plastic debris. Environ. Res. Lett. 2015, 10, 124006. [Google Scholar] [CrossRef]
Hong, S.; Lee, J.; Lim, S. Navigational threats by derelict fishing gear to navy ships in the Korean seas. Mar. Pollut. Bull. 2017, 119, 100–105. [Google Scholar] [CrossRef] [PubMed]
Borrelle, S.B.; Ringma, J.; Law, K.L.; Monnahan, C.C.; Lebreton, L.; McGivern, A.; Murphy, E.; Jambeck, J.; Leonard, G.H.; Hilleary, M.A.; et al. Predicted growth in plastic waste exceeds efforts to mitigate plastic pollution. Science 2020, 369, 1515–1518. [Google Scholar] [CrossRef] [PubMed]
Di, J.; Xi, K.; Yang, Y. An enhanced YOLOv8 model for accurate detection of solid floating waste. Sci. Rep. 2025, 15, 25015. [Google Scholar] [CrossRef]
Beatty, D.S.; Aoki, L.R.; Graham, O.J.; Yang, B. The future is big—and small: Remote sensing enables cross-scale comparisons of microbiome dynamics and ecological consequences. Msystems 2021, 6, e01106-21. [Google Scholar] [CrossRef]
Lee, H.; Byeon, S.; Kim, J.H.; Shin, J.K.; Park, Y. Construction of a Real-Time Detection for Floating Plastics in a Stream Using Video Cameras and Deep Learning. Sensors 2025, 25, 2225. [Google Scholar] [CrossRef]
He, L.; Zhou, Y.; Yang, H.; Su, L.; Ma, J. A deep learning-based method for marine oil spill detection and its application in UAV imagery. Mar. Pollut. Bull. 2026, 222, 118889. [Google Scholar] [CrossRef]
Chang, B.; Li, F.; Hu, Y.; Yin, H.; Feng, Z.; Zhao, L. Application of UAV remote sensing for vegetation identification: A review and meta-analysis. Front. Plant Sci. 2025, 16, 1452053. [Google Scholar] [CrossRef]
Cao, Y.; Lei, R. Progress and prospect of forest fire monitoring based on the multi-source remote sensing data. Natl. Remote Sens. Bull. 2024, 28, 1854–1869. [Google Scholar] [CrossRef]
Wang, J.; Zhao, H. Improved yolov8 algorithm for water surface object detection. Sensors 2024, 24, 5059. [Google Scholar] [CrossRef] [PubMed]
Yang, Z.; Yu, X.; Dedman, S.; Rosso, M.; Zhu, J.; Yang, J.; Xia, Y.; Tian, Y.; Zhang, G.; Wang, J. UAV remote sensing applications in marine monitoring: Knowledge visualization and review. Sci. Total Environ. 2022, 838, 155939. [Google Scholar] [CrossRef] [PubMed]
Song, F.; Zhang, W.; Yuan, T.; Ji, Z.; Cao, Z.; Xu, B.; Lu, L.; Zou, S. UAV quantitative remote sensing of riparian zone vegetation for river and lake health assessment: A review. Remote Sens. 2024, 16, 3560. [Google Scholar] [CrossRef]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision; Springer: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Đuraš, A.; Ilioudi, A.; Wolf, B.; Palunko, I.; De Schutter, B. Seaclear Marine Debris Detection & Segmentation Dataset; Data Set; 4TU.ResearchData: Delft, The Netherlands, 2024. [Google Scholar] [CrossRef]
Yang, Z. River Floating Debris Dataset for UAV-Based Detection. 2024. [Data Set]. Available online: https://figshare.com/articles/dataset/__zip/24123666 (accessed on 23 September 2025).
Pinson, S.; Vollering, A. Floating Macroplastic Debris in Rivers Labelled Dataset; Data Set; 4TU.ResearchData: Delft, The Netherlands, 2025. [Google Scholar] [CrossRef]
Gonçalves, G.; Andriolo, U.; Gonçalves, L.M.; Sobral, P.; Bessa, F. Beach litter survey by drones: Mini-review and discussion of a potential standardization. Environ. Pollut. 2022, 315, 120370. [Google Scholar] [CrossRef]
Andriolo, U.; Topouzelis, K.; van Emmerik, T.H.M.; Papakonstantinou, A.; Monteiro, J.G.; Isobe, A.; Hidaka, M.; Kako, S.; Kataoka, T.; Gonçalves, G. Drones for litter monitoring on coasts and rivers: Suitable flight altitude and image resolution. Mar. Pollut. Bull. 2023, 195, 115521. [Google Scholar] [CrossRef]
Gallitelli, L.; Girard, P.; Andriolo, U.; Liro, M.; Suaria, G.; Martin, C.; Lusher, A.; Hancke, K.; Blettler, M.; Garcia-Garin, O.; et al. Monitoring macroplastics in aquatic and terrestrial ecosystems: Expert survey reveals visual and drone-based census as most effective techniques. Sci. Total Environ. 2024, 955, 176528. [Google Scholar] [CrossRef]
Wu, X.; Li, W.; Hong, D.; Tao, R.; Du, Q. Deep learning for unmanned aerial vehicle-based object detection and tracking: A survey. IEEE Geosci. Remote Sens. Mag. 2021, 10, 91–124. [Google Scholar] [CrossRef]
Tang, G.; Ni, J.; Zhao, Y.; Gu, Y.; Cao, W. A survey of object detection for UAVs based on deep learning. Remote Sens. 2023, 16, 149. [Google Scholar] [CrossRef]
Jiang, Z.; Wu, B.; Ma, L.; Zhang, H.; Lian, J. APM-YOLOv7 for small-target water-floating garbage detection based on multi-scale feature adaptive weighted fusion. Sensors 2023, 24, 50. [Google Scholar] [CrossRef]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Hui, Y.; Wang, J.; Li, B. STF-YOLO: A small target detection algorithm for UAV remote sensing images based on improved SwinTransformer and class weighted classification decoupling head. Measurement 2024, 224, 113936. [Google Scholar] [CrossRef]
Yu, J.; Zheng, H.; Xie, L.; Zhang, L.; Yu, M.; Han, J. Enhanced YOLOv7 integrated with small target enhancement for rapid detection of objects on water surfaces. Front. Neurorobotics 2023, 17, 1315251. [Google Scholar] [CrossRef] [PubMed]
Gui, S.; Song, S.; Qin, R.; Tang, Y. Remote sensing object detection in the deep learning era—A review. Remote Sens. 2024, 16, 327. [Google Scholar] [CrossRef]
Hu, M.; Zhang, Y.; Jiao, T.; Xue, H.; Wu, X.; Luo, J.; Han, S.; Lv, H. An Enhanced Feature-Fusion Network for Small-Scale Pedestrian Detection on Edge Devices. Sensors 2024, 24, 7308. [Google Scholar] [CrossRef]
Chen, H.; Wang, S.; Guo, H.; Lin, H.; Zhang, Y.; Long, Z.; Huang, H. Study of marine debris around a tourist city in East China: Implication for waste management. Sci. Total Environ. 2019, 676, 278–289. [Google Scholar] [CrossRef]
Qiao, G.; Yang, M.; Wang, H. An annotated Dataset and Benchmark for Detecting Floating Debris in Inland Waters. Sci. Data 2025, 12, 385. [Google Scholar] [CrossRef]
Liao, Y.H.; Juang, J.G. Application of Unmanned Aerial Vehicles for Marine Trash Detection and Real-Time Monitoring System. Master’s Thesis, Department of Communications, Navigation and Control Engineering, National Taiwan Ocean University (NTOU), Keelung City, Taiwan, 2021. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar]
Jocher, G.; Qiu, J.; Chaurasia, A. Ultralytics YOLO, Version 8.0.0. GitHub. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 26 February 2026).
Jocher, G.; Qiu, J. Ultralytics YOLO11. GitHub. 2024. Available online: https://github.com/ultralytics/ultralytics (accessed on 26 February 2026).
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
Zhang, Z. Drone-YOLO: An efficient neural network method for target detection in drone images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Yi, H.; Liu, B.; Zhao, B.; Liu, E. Small object detection algorithm based on improved YOLOv8 for remote sensing. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 17, 1734–1747. [Google Scholar] [CrossRef]
Zhang, X.; Du, B.; Jia, Y.; Luo, K.; Jiang, L. MSD-YOLO11n: An improved small target detection model for high precision UAV aerial imagery. J. King Saud Univ. Comput. Inf. Sci. 2025, 37, 100. [Google Scholar] [CrossRef]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. RFLA: Gaussian receptive field based label assignment for tiny object detection. In Proceedings of the European conference on Computer Vision; Springer: Cham, Switzerland, 2022; pp. 526–543. [Google Scholar]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [Google Scholar] [CrossRef]
Ye, T.; Qin, W.; Zhao, Z.; Gao, X.; Deng, X.; Ouyang, Y. Real-time object detection network in UAV-vision based on CNN and transformer. IEEE Trans. Instrum. Meas. 2023, 72, 2505713. [Google Scholar] [CrossRef]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. Detrs beat yolos on real-time object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Sun, P.; Zheng, Y.; Xu, W.; Li, J.; Yang, J. Completing Missing Entities: Exploring Consistency Reasoning for Remote Sensing Object Detection. IEEE Trans. Image Process. 2026, 35, 569–584. [Google Scholar] [CrossRef]
Sun, P.; Zheng, Y.; Xu, W.; Yang, J. Dual-stream global relationship learning for oriented object detection in remote sensing images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2025, 18, 13652–13665. [Google Scholar] [CrossRef]
Fan, Q.; Li, Y.; Deveci, M.; Zhong, K.; Kadry, S. LUD-YOLO: A novel lightweight object detection network for unmanned aerial vehicle. Inf. Sci. 2025, 686, 121366. [Google Scholar] [CrossRef]
Skalski, P. Make Sense. GitHub. 2019. Available online: https://github.com/SkalskiP/make-sense/ (accessed on 26 February 2026).
Wang, W. X-AnyLabeling. GitHub. 2024. Available online: https://github.com/CVHub520/X-AnyLabeling (accessed on 26 February 2026).
Zhuang, Y.; Liu, J.; Zhao, H.; Ma, L.; Fang, Z.; Li, L.; Wu, C.; Cui, W.; Liu, Z. A deep learning framework based on structured space model for detecting small objects in complex underwater environments. Commun. Eng. 2025, 4, 24. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable convnets v2: More deformable, better results. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9308–9316. [Google Scholar]
Xiao, Y.; Su, X.; Yuan, Q.; Liu, D.; Shen, H.; Zhang, L. Satellite video super-resolution via multiscale deformable convolution alignment and temporal grouping projection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5610819. [Google Scholar] [CrossRef]
Liu, H.; Zhou, X.; Wang, C.; Chen, S.; Kong, H. Fourier-deformable convolution network for road segmentation from remote sensing images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 4415117. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to upsample by learning to sample. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 6027–6037. [Google Scholar]
Jiang, Y.; Tan, Z.; Wang, J.; Sun, X.; Lin, M.; Li, H. GiraffeDet: A heavy-neck paradigm for object detection. arXiv 2022, arXiv:2202.04256. [Google Scholar]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. Damo-yolo: A report on real-time object detection design. arXiv 2022, arXiv:2211.15444. [Google Scholar]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. Carafe: Content-aware reassembly of features. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3007–3016. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Jocher, G. Ultralytics YOLOv5. GitHub. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 26 February 2026).
Zhang, H.; Liu, K.; Gan, Z.; Zhu, G.N. UAV-DETR: Efficient end-to-end object detection for unmanned aerial vehicle imagery. arXiv 2025, arXiv:2501.01855. [Google Scholar]

Figure 1. Three major challenges in Drone-Based Detection of Surface Floating Debris.

Figure 2. Schematic diagram of the dataset construction workflow.

Figure 3. Analysis of dataset scale information. (a) illustrates the dominance of small objects across all categories. (b) quantifies the overall scale distribution.

Figure 4. Overall network architecture of FLD-Net.

Figure 5. Detailed structure of the Deformable Feature Extraction Module.

Figure 6. Architecture of the Dynamic Cross-Scale Fusion Network and Micro-Structure of DySample Mechanism.

Figure 7. Schematic of the Dual-Domain Anti-Noise Attention module.

Figure 8. Qualitative comparison of FLD-Net with mainstream detection models in five typical challenging scenarios.

Figure 9. Visual comparison of FLD-Net with mainstream detection models on the FMPD Dataset.

Figure 10. Architecture of the real-time floating litter detection system based on UAV edge computing.

Table 1. Comparison of UAV-Flow with Existing Mainstream Surface Floating Litter Datasets.

Dataset	Platform	Environment	Scale	Small Obj.	Main Limitation
Floater	Shore Cams	Multiple Rivers	3000	16.3%	Static View
FMPD	River Monitor	Multiple Rivers	2229	36.5%	Static View
River Debris	DJI Mini 2	Single River	840	-	Small Scale
HAIDA	UAV	Coastal/Port	324	29.6%	Small Scale
UAV-Flow(ours)	DJI M350	River/Lake/Flat	4593	78.9%	Limited Geo. Range

Table 2. Ablation Study on DFEM Hierarchical Settings.

Setting	P (%)	R (%)	mAP₅₀ (%)	Params (M)
Baseline	77.85	66.18	71.96	9.40
P2, P3	78.14	68.55	72.78	9.42 (+0.02)
P4, P5 (Ours)	78.88	67.99	73.51	9.50 (+0.10)
All	77.51	68.53	73.32	9.52 (+0.12)

Note: The bold indicates the optimal result for that column. The same applies to the following table.

Table 3. Ablation Study on DCFN Internal Design.

Model Variants	P2 Layer	P (%)	R (%)	mAP₅₀ (%)	Params (M)
Baseline (PANet)	–	77.85	66.18	71.96	9.40
GFPN	✓	77.78	70.53	73.58	13.72
GFPN-Scale	–	79.89	68.42	73.39	9.66
+DySample (DFCN)	–	79.32	75.04	77.69	9.67
DCFN (w/ P2)	✓	78.23	72.35	74.98	13.74

Table 4. Ablation Experiments of Key Modules on UAV-Flow Dataset.

DFEM	DCFN	DANA	P (%)	R (%)	mAP₅₀ (%)	Params (M)
			77.85	66.18	71.96	9.4
✓			79.31	67.76	73.44	9.5
	✓		79.32	75.04	77.69	9.67
		✓	79.41	75.88	78.95	9.47
✓	✓		78.48	75.53	78.49	9.78
	✓	✓	80.66	75.00	78.71	9.75
✓		✓	79.71	76.29	79.19	9.56
✓	✓	✓	80.67	77.84	80.47	9.9

Table 5. Comparative Experiments on the UAV-Flow Dataset.

Method	P (%)	R (%)	mAP₅₀ (%)	AP_S (%)	Params (M)	FLOPs (G)	FPS
YOLOv5n [61]	74.19	65.64	69.04	25.4	2.2	5.8	194.32
YOLOv8n [33]	78.00	66.11	71.72	27.8	2.68	6.8	347.22
YOLOv5s [61]	78.33	65.94	71.35	29.5	7.8	18.7	323.49
YOLOv8s [33]	78.09	67.40	72.13	29.9	9.8	23.4	309.36
YOLOv11s [34]	77.85	66.18	71.96	28.9	9.4	21.3	289.91
TPH-YOLOv5 [35]	78.57	69.64	75.73	31.0	45.36	245.1	8.22
Drone-YOLO [36]	78.07	69.11	74.10	29.5	34.65	7.9	12.4
FFCA-YOLO [24]	80.39	73.39	75.62	30.7	7.1	51.3	53.37
RT-DETR [42]	82.14	73.36	77.29	35.0	41.9	125.6	80.07
UAV-DETR [62]	82.72	74.91	79.75	34.9	44.6	161.4	35.09
FLD-Net (Ours)	80.67	77.84	80.47	35.6	9.9	25.1	156.72

Table 6. Cross-Domain Generalization on HAIDA Trash Dataset and FMPD Dataset.

Method	HAIDA Trash Dataset				FMPD Dataset
Method	P (%)	R (%)	mAP₅₀ (%)	AP_S (%)	P (%)	R (%)	mAP₅₀ (%)	AP_S (%)
YOLOv8s [33]	69.2	67.3	71.2	11.7	37.30	43.23	33.09	10.4
YOLOv11s [34]	55.5	63.5	64.9	10.8	38.17	49.30	37.00	12.7
RT-DETR [42]	62.9	67.3	67.0	15.4	40.35	25.23	26.77	8.2
FLD-Net (Ours)	74.0	66.4	73.1	17.8	41.31	49.5	39.13	41.8

Table 7. Performance Comparison on NVIDIA Jetson Xavier NX. TRT denotes TensorRT optimization, and Eff. stands for energy efficiency.

Model	TRT	mAP₅₀ (%)	Latency (ms)	FPS	Power (W)	Eff. (FPS/W)
YOLOv11s		71.96	18.5	54.0	14.2	3.80
YOLOv11s	✓	71.57	12.0	83.2	14.2	5.86
FLD-Net		80.47	22.4	44.6	14.5	3.07
FLD-Net	✓	80.29	14.4	69.6	14.5	4.80

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, X.; Zhou, B.; Ye, X.; Wang, L.; Wang, Z. FLD-Net for Floating Litter Detection in UAV Remote Sensing. Remote Sens. 2026, 18, 736. https://doi.org/10.3390/rs18050736

AMA Style

Wang X, Zhou B, Ye X, Wang L, Wang Z. FLD-Net for Floating Litter Detection in UAV Remote Sensing. Remote Sensing. 2026; 18(5):736. https://doi.org/10.3390/rs18050736

Chicago/Turabian Style

Wang, Xingyue, Bin Zhou, Xia Ye, Lidong Wang, and Zhen Wang. 2026. "FLD-Net for Floating Litter Detection in UAV Remote Sensing" Remote Sensing 18, no. 5: 736. https://doi.org/10.3390/rs18050736

APA Style

Wang, X., Zhou, B., Ye, X., Wang, L., & Wang, Z. (2026). FLD-Net for Floating Litter Detection in UAV Remote Sensing. Remote Sensing, 18(5), 736. https://doi.org/10.3390/rs18050736

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

FLD-Net for Floating Litter Detection in UAV Remote Sensing

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Surface Floating Debris Datasets

2.2. UAV Object Detection

3. UAV-Flow Dataset

3.1. Dataset Construction

3.2. Statistical Analysis and Characteristics

4. Methodology

4.1. Deformable Feature Extraction Module

4.2. Dynamic Cross-Scale Fusion Network

4.3. Dual-Domain Anti-Noise Attention

5. Experiments and Discussion

5.1. Experimental Setup

5.2. Ablation Studies

5.3. Comparative Experiments

5.4. Visualization Analysis

5.5. Real-Time Edge Deployment and Energy Efficiency

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI