A Lightweight Multi-Level Feature Fusion Detector for UAV-Based Tiny Personnel Detection in Hilly Road Safety Monitoring and Rescue

Cao, Yanghao; Sheng, Chaojun; Tian, Haishan; Xiao, Qi; Zhou, Ziyi; Li, Jiayuan; Tang, Weiwei; Wang, Kui

doi:10.3390/rs18111725

Open AccessArticle

A Lightweight Multi-Level Feature Fusion Detector for UAV-Based Tiny Personnel Detection in Hilly Road Safety Monitoring and Rescue

by

Yanghao Cao

^1,2,

Chaojun Sheng

^1,2,

Haishan Tian

^1,2,*,

Qi Xiao

^1,2,

Ziyi Zhou

^1,2,

Jiayuan Li

^1,2,

Weiwei Tang

^1,2

and

Kui Wang

^3,4

¹

School of Physics and Electronics, Hunan Normal University, Changsha 410081, China

²

Key Laboratory of Physics and Devices in Post-Moore Era, College of Hunan Province, Changsha 410081, China

³

National Engineering and Technology Research Center for Emergency and Disaster Relief Equipment, PLA Joint Logistics Support Force University of Engineering, Chongqing 401331, China

⁴

Chongqing Gangli Environmental Protection Co., Ltd., Chongqing 400074, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2026, 18(11), 1725; https://doi.org/10.3390/rs18111725

Submission received: 10 April 2026 / Revised: 25 May 2026 / Accepted: 25 May 2026 / Published: 27 May 2026

(This article belongs to the Special Issue Small Target Detection, Recognition, and Tracking in Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

In this study, a lightweight small-target detector, FCML-YOLO, was developed, tailored for UAV-based applications. The proposed detector attains a 95.2% mAP50 on the self-constructed OPVM-VIRD dataset while using substantially fewer parameters.
The method integrates a frequency-domain feature enhancement (FDFE) module, multi-scale feature reconstruction (MSFR) module, and lightweight multi-level feature fusion (LMFF) architecture. Our method improves small-target detection under complex backgrounds and demonstrates higher inference speed.

What are the implications of the main findings?

This work provides a feasible technical approach for small-target detection in complex mountainous road environments, contributing to improved safety monitoring and emergency rescue capabilities.
This framework enables real-time object detection on resource-limited UAV platforms, offering a feasible approach for practical deployment.

Abstract

Target detection systems based on UAV platforms have the advantages of speed, flexibility, and agility in safety monitoring and rescue operations on hilly roads. However, due to the high altitude of aerial imaging, terrain occlusions, and interference from complex backgrounds, trapped individuals often appear as visually negligible small targets, leading to high miss rates and delayed responses in traditional detection methods. To address this urgent need, this paper proposes a lightweight small-target detector called FCML-YOLO. First, a frequency-domain feature enhancement (FDFE) module is designed, which extracts frequency-domain features using the discrete cosine transform and enhances global context perception through adaptive global pooling and multi-branch fully connected layers. Then, a content-aware reassembly of features (CARAFE) module is incorporated to preserve fine-grained image details during upsampling. Additionally, a multi-scale feature reconstruction (MSFR) module is developed, which integrates features from multiple scales and reduces redundant information using an adaptive weighting mechanism. Building on this, we construct a lightweight multi-level feature fusion (LMFF) network by removing redundant structures and fully exploiting deep and shallow features. The experimental results on multiple datasets demonstrate that, compared to YOLO11s, FCML-YOLO achieves 4.4% improvement in mAP50 on the self-built OPVM-VIRD dataset. Additionally, the model demonstrates a significant advantage over mainstream detection models on public datasets such as VisDrone, USOD, DOTA, and TinyPerson. Furthermore, experiments are extended to the search-and-rescue-oriented SARD dataset to verify the applicability of FCML-YOLO in UAV-based rescue scenarios. The model is deployed on a self-developed UAV-mounted detection pod system, with the number of parameters reduced by 62.6% compared to the baseline model, achieving real-time performance at 64 frames per second (FPS).

Keywords:

remote sensing; YOLO11; small object detection; feature fusion; safety monitoring and rescue

1. Introduction

Hilly roads, especially in popular tourist areas, are high-risk zones for incidents such as missing persons and falls [1]. Typical hilly mountainous scenic areas (e.g., Yuelu Mountain in Changsha) feature undulating terrain, dense vegetation, and winding roads [2]. Coupled with variable weather and lighting conditions, the efficiency of safety monitoring and ground search-and-rescue operations is significantly limited in such areas [3]. Drones, with are maneuverable and flexible, can bypass obstacles and reach locations that are difficult for ground rescue teams to access, making them a crucial supplementary tool for rescue efforts [4]. However, in aerial images, the targets of individuals are small in scale and can easily be confused with rocks or tree shadows, making it challenging for existing object detection technologies to achieve stable recognition in complex backgrounds [5,6]. This severely restricts the effectiveness of monitoring and rescue responses.

Conventional object detection paradigms typically depend on handcrafted features and multi-step processing pipelines, characterized by intricacy and poor robustness, thus yielding subpar detection performance for small targets in remote sensing images. By contrast, deep learning-driven object detection methodologies autonomously learn hierarchical features across successive convolutional layers spanning low-level textures to high-level semantics, markedly boosting both detection precision and generalization, particularly for diminutive target identification in remote sensing scenarios.

Within the object detection domain, the YOLO algorithmic lineage has struck an optimal equilibrium between inference speed and detection fidelity through its end-to-end pipeline and single-pass inference mechanism. Across iterative iterations, YOLO variants have exhibited continuous advancements in detection efficacy [7,8,9,10,11,12,13,14,15]. Specifically, YOLO11, bolstered by an upgraded feature extraction architecture and lightweight design, showcases pronounced performance superiorities [16,17,18].

Notwithstanding the progresses attained by YOLO-centric detectors, detecting small objects in UAV-captured images still involves three major challenges:

First, most existing object detection approaches for UAV imagery are designed for urban or open-area scenarios, lacking targeted modeling to address challenges unique to hilly roads, such as viewpoint distortion caused by slope, dynamic vegetation occlusion, and the similarity in height between targets and backgrounds.

Second, inadequate multi-scale feature fusion hinders accurate localization and recognition of small objects. This shortcoming hampers the effective propagation of location cues from shallow features, precipitating a gradual degradation of deep semantic information during backpropagation.

Third, general lightweight models have limited feature extraction capabilities, leading to insufficient recall in practical search-and-rescue applications. Meanwhile, models with higher accuracy typically introduce substantial computational overhead, which hinders their deployment on airborne edge platforms, leading to a trade-off between accuracy and efficiency.

To tackle the above issues, this paper presents a lightweight small-target detector FCML-YOLO based on YOLO11s. It significantly improves remote sensing image target detection accuracy while effectively reducing parameters. The primary contributions of this work are outlined below:

(1): A frequency-domain feature enhancement (FDFE) module was designed. It first integrates spatial context information through adaptive global pooling, and then retains effective low-frequency information using discrete cosine transform (DCT). Subsequently, it generates attention weights by combining channel recalibration in a multi-branch fully connected subnetwork. By collaboratively utilizing frequency-domain, spatial, and channel information, it effectively separates similar targets in complex backgrounds.
(2): The content-aware reassembly of features (CARAFE) module is introduced. This module dynamically generates position-adaptive convolutional kernels, enhancing the spatial and semantic expression capabilities of features. This not only better preserves the details of small targets but also provides strong contextual modeling capabilities in occluded regions.
(3): A multi-scale feature reconstruction (MSFR) module was designed. By fusing feature maps of large, medium, and small resolutions and applying an adaptive attention mechanism to weight the key channel features, this module retains shallow-level details while reinforcing deep-level semantics. This allows the model to effectively detect objects across different scales.
(4): A lightweight multi-level feature fusion (LMFF) network was constructed. The original detection layer for large-scale objects is discarded, while a dedicated branch is added to enhance small-object detection. Moreover, the MSFR module is employed to adaptively weight and fuse features across different scales, which greatly enhances its ability to perceive small objects while keeping it lightweight.
(5): The FCML-YOLO model was deployed on a self-developed UAV detection pod system equipped with the NVIDIA Jetson AGX Orin and tested in the Yuelu Mountain Scenic Area. The model achieves real-time performance of 64 FPS, surpassing the baseline model YOLO11’s 57 FPS. This suggests it is suitable for real-world deployment in hill road safety monitoring and emergency response scenarios.

2. Related Work

2.1. Small-Object Detection

Detecting small objects in remote sensing imagery is a persistent challenge attributable to the targets’ minuscule size, sparse textural features, and high vulnerability to complex background interference. To address these hurdles, the research community has investigated diverse optimization approaches spanning multiple dimensions. To boost discriminative feature extraction, He et al. [19] devised SOD-YOLO, which integrates a C2f attention module grounded in Cross-Domain Fusion Attention (CDFA) into the YOLOv8 backbone. Meanwhile, Wang et al. [20] put forward the YOLO-ERF detector. They leveraged an optimized path aggregation network to augment detection precision. Shen et al. [21] introduced the DS-YOLOv8, which integrates two novel modules: DCN_C2f and SC_SA. This design mitigates the drawbacks of fixed convolutional kernels in YOLOv8. Zhou et al. [22] engineered DDSC-YOLO, a YOLOv8n-derived framework equipped with the DCNv3LKA attention mechanism for adaptive handling of multi-scale objects. Moreover, their devised SDI-FPN architecture fuses semantic and fine-grained features, coupled with specialized detection scales, to mitigate missing and false alarms. Jobaer et al. [23] devised the dynamic YOLO-SOD model, showcasing robust performance across diverse datasets. Chenxi et al. [24] presented SFFEF-YOLO, which incorporates a Fine-Grained Information Extraction Module to boost the model’s capacity to capture subtle details. Mahendran N [25] proposed SENetV2, which integrates squeeze-and-excitation modules with dense connections, significantly improving channel-wise feature representation while strengthening global information modeling. Finally, Giri et al. [26] introduced EFR-ACENet to dynamically tune the perceptual region according to task-specific detection needs.

2.2. Lightweight Object Detection

Conventional convolutional neural networks (CNNs) are hindered by massive parameter volumes and high computational overhead and thus often fail to meet latency requirements. Thus, developing lightweight detectors that balance efficiency and accuracy has become imperative. Wang et al. [27] devised the LSOD-YOLO model, creating an architecture that accelerates inference while boosting detection precision, achieving dual improvements. Luo et al. [28] enhanced the YOLOv5 framework with lightweight modifications: embedding the Ghost module into the C3 block and supplementing it with a Coordinate Attention (CA) mechanism. Fang et al. [29] proposed CS-YOLOv8, replacing the YOLOv8 backbone with FasterNet to reduce redundant computations and memory access overheads. Moreover, they designed a Channel Shuffle–Partial Convolution (CS-PConv) module, facilitating cross-channel information flow between convolutional and non-convolutional branches. Hua et al. [30] introduced the Cross-Stage Partial Deformable Network (CSPDNet), which employs a deformable separable convolution structure to achieve feature channel decoupling and adaptive sampling. This design significantly curbs redundant computations in UAV-based detection scenarios. Li et al. [31] developed a lightweight 3D detector for point cloud data, featuring a lightweight 3D sparse convolution module (LW-Sconv) based on decomposed and grouped convolutions. The model also incorporates a knowledge distillation loss to further reduce network complexity. Song et al. [32] presented YOLO-ELWNet, an efficient lightweight network that leverages the SIoU loss in its detection head. This design speeds up convergence and outperforms mainstream lightweight detectors in both detection accuracy and computational cost. Zhang et al. [33] refined YOLOv8’s neck module with a small-object-tailored detection head leveraging high-resolution feature maps, markedly boosting small-scale target detection capability.

2.3. Frequency-Domain Feature Processing for Object Detection

In recent years, extensive studies have been conducted on frequency-domain techniques in image processing, revealing that frequency-domain features are crucial for enhancing both object detection performance and model robustness. Zheng et al. [34] proposed an innovative framework leveraging the fast discrete curvelet transform (FDCT) to efficiently capture directional and scale-related information, thereby significantly enhancing detection accuracy. Duan et al. [35] introduced an algorithm that integrates both frequency- and spatial-domain features. By applying block-wise discrete cosine transform (DCT) in combination with attention mechanisms, their method suppresses background clutter and amplifies target signals. The refined features are then extracted and fused using a lightweight neural network. Fang et al. [36] presented a frequency-domain edge-aware camouflage object detection method, improving the identification of camouflaged objects in complex backgrounds. Liu Y. [37] designed an Anti-Aliasing Module (AAM) that utilizes wavelet pooling to perform frequency decomposition, effectively mitigating aliasing effects caused by downsampling in CNNs and preserving critical high-frequency details of small targets in remote sensing imagery. Li Z. and Fan H. [38] devised the C3k2-WT module to extract detail features of both high and low frequencies via wavelet transform, allowing the model to capture meaningful representations across various frequency bands and visual fields. Qin et al. [39] proposed FcaNet, based on frequency-domain analysis. This work extends the channel compression operation to frequency-domain representations. Compared with traditional channel attention methods, FcaNet achieves a higher competitive performance in object detection tasks. Li et al. [40] proposed FD2-Net, which decomposes features into distinct frequency components to tackle modal discrepancies between infrared and visible images, effectively boosting detection performance in complex scenarios.

To conclude this section, although deep learning-based approaches have achieved notable improvements in object detection for UAV imagery, hurdles, including substantial scale variations in targets, severe occlusive conditions, and intricate background contexts, continue to constrain detection accuracy. Despite extensive explorations into lightweight detectors yielding encouraging outcomes, achieving both high performance while remaining within strict computational constraints remains a key challenge. Additionally, despite frequency-domain techniques demonstrating efficacy in enhancing model robustness, devising effective strategies to integrate spatial-domain features to further elevate detection performance remains a critical avenue for future investigations.

3. Method

This paper presents FCML-YOLO, a compact small-object detector derived from YOLO11s. It is specifically designed to balance both detection accuracy and model efficiency. Firstly, the FDFE module is introduced, which combines spatial context, frequency-domain feature, and channel context information to retain key feature information of the object. Secondly, the CARAFE module is employed to substitute the conventional nearest-neighbor interpolation method, which enhances feature map quality and helps minimize information loss in the upsampling phase. Furthermore, the MSFR module is designed to comprehensively integrate fine-grained details from lower-level feature maps and semantic-rich information from higher-level feature maps. Finally, the LMFF network is constructed, with the MSFR module and cross-layer connections enhancing the integration of deep and shallow feature information, while redundant network layers are removed to reduce model complexity and improve sensitivity to small objects. The structure of FCML-YOLO is shown in Figure 1.

3.1. Frequency-Domain Feature Enhancement (FDFE) Module

Object detection in UAV-based remote sensing images encounters numerous challenges, including complicated background conditions and interference due to high similarity between targets. To tackle these problems, this study proposes the FDFE module. In FDFE, the input feature map is initially fused with global spatial information via adaptive global pooling and channel recalibration to enhance perception of spatial structures. Subsequently, a two-dimensional discrete cosine transform (2D DCT) is applied for feature compression, preserving several low-frequency-prioritized effective DCT frequency components to extract richer frequency-domain features and enhance the model’s responsiveness to subtle inter-class differences.

The adoption of 2D DCT in the FDFE module is motivated by its compact and efficient frequency-domain representation. This branch is not intended to reconstruct the full frequency spectrum, but to extract discriminative frequency responses for channel-wise feature enhancement. Compared with the Fourier transform, DCT directly produces real-valued frequency coefficients without requiring additional handling of complex-valued magnitude and phase information, which simplifies its integration into the proposed feature enhancement module. Compared with wavelet-based transforms, DCT avoids explicit multi-scale sub-band decomposition and additional cross-scale fusion strategies, thereby reducing structural complexity and better aligning with the lightweight design objective of the detector. Therefore, 2D DCT is adopted in FDFE to strengthen frequency-domain feature discrimination while maintaining computational compactness.

Then, the compressed frequency-domain features are input into a multi-branch dense subnetwork, where channel transformations and activation operations generate information weights corresponding to the feature maps, further strengthening the network’s awareness of global information. By integrating spatial context, frequency-domain feature, and channel context information, this module effectively suppresses complex backgrounds and enhances the discrimination of similar features. The structure of FDFE is shown in Figure 2.

The detailed algorithmic procedure for the FDFE module is outlined below:

Initially, the input feature map

X \in R^{C \times H \times W}

is processed through an adaptive pooling mechanism, resulting in the feature map

X^{'} \in R^{C}

, which helps extract spatial context information.

X^{'} = σ (F_{1} (X)) \otimes X

(1)

where

σ

is the Softmax function, and F is the convolutional layer with kernel

1 \times 1

.

Subsequently,

X^{'}

undergoes two layers of bottleneck transformations and is added to the original feature map

X

in a residual manner, resulting in the feature map

X^{″} \in R^{C \times H \times W}

, which integrates spatial context information.

X^{″} = X + F_{3} (G E L U (F_{2} (X^{'})))

(2)

where

F

is the convolutional layer with kernel

1 \times 1

.

After spatial feature enhancement, feature map

X^{″}

is split along the channel dimension into n independent channel variables

X^{i} \in R^{C^{'} \times H \times W}

, where

i \in \{0,1, \dots, n - 1\}

.

X^{″} \overset{S p l i t}{\to} \{X^{0}, X^{1}, \dots, X^{n - 1}\}

(3)

The two-dimensional DCT frequency space is modeled using a

7 \times 7

reference frequency grid. To obtain compact yet discriminative frequency-domain descriptors, 16 effective DCT frequency components are selected from this grid according to a fixed low-frequency-prioritized selection criterion. Let

(r_{i}, s_{i})

denote the i-th selected reference frequency index in the

7 \times 7

grid, where

i \in \{0, 1, \dots, 15\}

. For a DCT computation size of

H_{d} \times W_{d}

, the reference frequency index

(r_{i}, s_{i})

is mapped to the actual DCT frequency index

(u_{i}, v_{i})

as follows:

u_{i} = r_{i} ⌊\frac{H_{d}}{7}⌋, v_{i} = s_{i} ⌊\frac{W_{d}}{7}⌋

(4)

where

H_{d}

and

W_{d}

denote the height and width of the feature map used for DCT computation, respectively, and

⌊\cdot⌋

denotes the floor operation.

Then, the n divided channel variables are subjected to 2D DCT compression.

\begin{array}{l} {F r e q}^{i} & = {2 D D C T}^{u_{i}, v_{i}} (X^{i}) \\ = \sum_{h = 0}^{H - 1} \sum_{w = 0}^{W - 1} X_{∶, h, w}^{i} B_{h, w}^{u_{i}, v_{i}} \\ s . t . i \in \{0,1, \dots, n - 1\} \end{array}

(5)

where

(u_{i}, v_{i})

is the two-dimensional index corresponding to the frequency components of

X^{i}

, each frequency component

{F r e q}^{i}

is the frequency term obtained via DCT compression, and H and W denote the height and width of the feature map, respectively. The basis function of the DCT process can be expressed as

B_{h, w}^{i, j} = \cos (\frac{π h}{H} (i + \frac{1}{2})) \cos (\frac{π w}{W} (j + \frac{1}{2}))

(6)

Next, the compression results of all frequencies are concatenated in a level-wise manner to form the complete frequency feature vector

F r e q \in R^{c}

.

F r e q = c a t ({F r e q}^{0}, {F r e q}^{1}, {F r e q}^{2}, {\dots, F r e q}^{n - 1})

(7)

Subsequently, the frequency feature vector is processed through multiple branched dense layers, with GELU activation applied after every branch to generate diverse feature representations.

f_{i} = G E L U ({F C}_{1} (F r e q))

(8)

After obtaining the outputs of multiple branches, they are fused into a unified feature vector

f_{s}

via concatenation along the channel dimension, which enriches and diversifies the resulting feature representation:

f_{s} = C o n c a t (f_{0}, f_{1}, \dots, f_{n - 1})

(9)

where n is set to 4, denoting the four sub-channel groups and corresponding parallel fully connected branches.

To generate the final attention weights, the preliminary attention weight vector is computed using a Sigmoid function, and exponential operations are applied to further enhance the response of salient features, resulting in the final attention weight vector

F_{a t t}

.

F_{a t t} = e x p (δ ({F C}_{2} (f_{s})))

(10)

where

δ

refers to Sigmoid non-linear activation, and FC stands for the weight matrix of a dense layer.

Finally, the input feature map

X

undergoes element-wise multiplication with the attention tensor

F_{a t t}

to obtain the output feature map

Y

.

The feature map processed via frequency-domain feature enhancement contains subtle frequency features of small objects as well as global channel and spatial context features, which helps the model effectively distinguish similar targets in complex scenes, improving detection accuracy and robustness under challenging conditions.

3.2. Content-Aware Reassembly of Features (CARAFE) Module

Within the YOLO11 framework, the initial upsampling strategy relies on nearest-neighbor interpolation. However, this strategy fails to fully utilize the semantic cues embedded in the feature maps during the upsampling stage, leading to a reduction in critical details and causing edges to appear blurred when the feature maps are scaled up. In remote sensing images, this limitation is especially evident, where the detection of small objects and occlusion frequently cause confusion or lead to missing target cues. In order to address these limitations, this study substitutes the YOLO11’s nearest-neighbor upsampling process with the CARAFE module, which boosts the detection accuracy and strengthens its robustness to occluded targets without increasing computational cost. The CARAFE module generates adaptive convolution kernels dynamically using the contextual data of the input features [41], which allows the upsampling process to integrate rich semantic representations and spatial awareness. Consequently, the receptive field is effectively enlarged, and more critical details are preserved while edge blurring is avoided during feature map enlargement. The structure of the CARAFE module is illustrated in Figure 3.

The CARAFE module is composed of two primary submodules: kernel prediction and content-aware reassembly. Initially, the kernel prediction submodule compresses the input feature map

X \in R^{C \times H \times W}

through

1 \times 1

convolution, reducing its channel count to

C_{m}

and thereby effectively lowering the computational burden of subsequent operations.

Assume the upsampling factor is

σ

, let

N (X_{l}, k)

denote the

k \times k

subregion of

X

centered at location

l

. Then, a content encoder with kernel size

k_{e n c o d e r}

predicts an initial reconstructed kernel of size

σ^{2} \times k_{u p}^{2}

.

Subsequently, the kernel is spatially expanded to produce a reconstructed kernel of size

σ H \times σ W \times k_{u p}^{2}

. Finally, a softmax function is applied to normalize it, yielding the final reassembly kernel

W_{l'}

.

W_{l'} = ψ (N (X_{l}, k_{e n c o d e r}), X)

(11)

where

ψ

represents the kernel prediction submodule. This method allows the network to dynamically determine the most pertinent features from each region of the input, thereby effectively concentrating computational resources on the critical areas.

The content-aware reassembly submodule utilizes the reassembly kernel generated by the kernel prediction submodule to reconstruct the feature map. From the input feature map, a

k_{u p} \times k_{u p}

region centered at position

l

is extracted in this submodule and then undergoes a dot product operation with the predicted reassembly kernel to achieve feature reorganization. The ultimate output is an optimized feature map

X

, which has dimensions of

σ H \times σ W \times C

.

X^{'} = ϕ (N (X_{l}, k_{u p}), W_{l'})

(12)

where

ϕ

represents the content-aware reassembly submodule. Notably, all channels at a given spatial location share the same upsampling kernel, thereby reducing redundant computations and enhancing efficiency.

CARAFE significantly enhances the quality of upsampling, effectively preserving the rich detail of small objects. Through its content-aware reassembly mechanism, CARAFE can better focus on contextual information in occluded regions, restoring the key features of occluded targets while sustaining reliable recognition performance under occlusion. In this study, we set

C_{m}

,

k_{e n c o d e r}

,

k_{u p}

, and

σ

to 64, 3, 5, and 2, respectively.

3.3. Multi-Scale Feature Reconstruction (MSFR) Module

The feature fusion strategy employed by YOLO11 only upsamples the smaller-scale feature maps and adds them to the preceding layer’s features, failing to fully exploit the rich detail information contained in shallow high-resolution features. Moreover, features at different scales may introduce redundant information during the fusion process. To address these issues, this paper proposes a multi-scale feature reconstruction (MSFR) module, the structure of which is shown in Figure 4. This module combines three feature maps originating from varied scales and, through cross-scale information fusion, delivers more critical features to the subsequent network, thereby improving detection accuracy. Simultaneously, an SE module is used to adaptively weight the fused channel features, directing the model to prioritize channels with critical object information and suppressing the influence of redundant information.

The MSFR module selects three feature maps of varying scales as its input. For large-scale feature maps, traditional methods typically use strided convolutions for downsampling. However, strided convolutions can cause the loss of important feature details in small objects, which in turn raises the detection difficulty. To address this, we introduce SPD-Conv to replace traditional downsampling methods, effectively preserving fine-grained information, and offering better adaptability for small objects. For small-scale feature maps, the CARAFE module improves the representation capability of low-resolution image features. CARAFE, based on a content-aware reassembly mechanism, can meticulously reconstruct the upsampled feature map, effectively mitigating the limitations of traditional upsampling methods in semantic information representation. Subsequently, the three scale-consistent feature maps are merged in the channel direction, enabling the integration of fine edge information from high-resolution feature maps with the semantic-rich information from low-resolution feature maps. Finally, the SE module generates adaptive feature weights and applies them to the fused features to enhance the expressive power of critical channels. This process further strengthens the feature representation, which leads to improved overall detection accuracy. The core processing flow of the MSFR module is illustrated as follows.

Initially, the SPD-Conv module performs downsampling on the large-scale feature map [42], yielding the processed feature map. The structure of SPD-Conv is shown in Figure 5.

Specifically, the SPD-Conv module comprises three operations:

(1): Partition the input feature map into multiple sub-feature maps.

$f_{0,0} = X_{l} [0 ∶ H ∶ 2, 0 ∶ W ∶ 2] f_{1,0} = X_{l} [1 ∶ H ∶ 2, 0 ∶ W ∶ 2] f_{0,1} = X_{l} [0 ∶ H ∶ 2, 1 ∶ W ∶ 2] f_{1,1} = X_{l} [1 ∶ H ∶ 2, 1 ∶ W ∶ 2]$

(13)

where H and W denote the height and width of the feature map, respectively.
(2): Aggregate the aforementioned sub-feature maps along the channel dimension.

$f_{l} = C o n c a t (f_{0,0}, f_{1,0}, f_{0,1}, f_{1,1})$

(14)
(3): Apply a non-strided convolution to the aggregated feature map to compress the channel dimension.

${X_{l}}^{'} = C o n v (f_{l})$

(15)

Through the above three operations, SPD-Conv achieves downsampling while preserving the fine-grained information of the input features and effectively improving computational efficiency.

Subsequently, the small-scale feature map is upsampled via CARAFE, achieving more comprehensive feature reassembly.

W_{l'} = ψ (N (X_{s, l}, k_{e n c o d e r}), X_{s})

(16)

{X_{s}}^{'} = ϕ (N (X_{s, l}, k_{u p}), W_{l'})

(17)

where

N (X_{s, l}, k_{e n c o d e r})

denotes the

k \times k

subregion of

X_{s}

centered at location

l

;

W_{l'}

, the reassembly kernel;

ψ

, the kernel prediction module; and

ϕ

, the content-aware reassembly module.

Next, the processed large-scale feature map

X_{l}

, medium-scale feature map

X_{m}

, and upsampled small-scale feature map

{X_{s}}^{'}

are concatenated along the channel dimension to form the fused feature

F_{1}

. The corresponding computation is as follows:

F_{1} = C o n c a t ({X_{l}}^{'}, X_{m}, {X_{s}}^{'})

(18)

Then, feature map

F_{1}

is compressed along the channel dimension via

1 \times 1

convolution.

F_{2} = C o n v (F_{1})

(19)

Finally, the processed feature map is passed into the SE module [43] to acquire the reweighted output feature map

F_{f}

.

f_{a t t} = s i g m o d (F C (M a x P o o l (F_{2})))

(20)

F_{f} = F_{2} \otimes f_{a t t}

(21)

Shallow features contain abundant fine-grained information, which facilitates small object detection, whereas deep features carry richer high-level semantics, benefiting large-object recognition. We propose the MSFR module to associate shallow and deep features, effectively mitigating the loss of information and improving the model’s detection accuracy and adaptability across objects of varying scales.

3.4. Lightweight Multi-Level Feature Fusion (LMFF) Network

In remote sensing image object detection, existing methods typically deepen the convolutional backbone network to enhance semantic representation capability. While this can improve recognition performance for large-scale objects, it inevitably degrades the capacity to capture detailed features of small-scale objects. In response to this challenge, this study proposes a lightweight multi-level feature fusion (LMFF) network based on the YOLO11 architecture, aiming to efficiently integrate shallow detail features with deep semantic features. Specifically, the LMFF network employs the MSFR module to reconstruct the fused features and incorporates a dynamic feature weighting strategy to allocate optimized weights across various channels, thereby enhancing critical information and suppressing redundant information. To further balance detection accuracy and model complexity, LMFF performs a coordinated structural redesign of the backbone, neck, and detection head by enhancing the utilization of high-resolution features and reducing redundant high-level feature transformations. The overall structure is shown in Figure 6.

As shown in Figure 7, in the original YOLO11 backbone network, the input image undergoes five successive downsampling operations, producing feature layers C1–C5 at different scales. Although high-level features contain rich semantic information, their responsiveness to small objects is insufficient, particularly in edge and detail extraction. To address this limitation, LMFF first removes the C5 feature layer from the original YOLO11 backbone and retains only a four-stage downsampling structure. The corresponding M5 neck layer and P5 detection branch are also discarded, and a high-resolution P2 branch is introduced for small-object detection, forming a P2–P4 detection structure that is more suitable for small targets. The newly introduced P2 branch has a resolution of 160 × 160, enabling the detector to exploit earlier high-resolution shallow features before small-object details are further weakened by deeper downsampling, thereby improving the localization and recognition capability for small objects.

For convolutional layers, the number of parameters is mainly determined by the kernel size and the number of input and output channels. The removed deep path associated with C5, M5, and P5 usually involves convolutional layers with more input and output channels, so the parameters of these layers increase significantly with the channel numbers. In contrast, the newly introduced P2 branch is built upon shallow features with fewer channels and therefore introduces only a limited parameter increase. As a result, the parameter reduction achieved by removing the deep path with more channels exceeds the parameter increase introduced by the P2 branch, ultimately reducing the overall parameter scale of the model.

In the network’s neck, the MSFR module fuses feature maps at the C2, C3, and C4 scales and employs an adaptive weighting mechanism to dynamically adjust each scale’s importance, producing outputs that integrate deep semantic information with shallow spatial details and capture more subtle visual cues. LMFF further introduces cross-layer connections to directly associate shallow and deep features, thereby preserving essential semantic information while reducing detail loss. MSFR and the cross-layer connections enable multi-level feature interaction with limited parameter overhead. Specifically, MSFR focuses on feature reconstruction and channel reweighting among C2–C4, while the cross-layer connections promote the direct fusion of shallow details and deep semantics, thereby improving feature representation while preserving the lightweight nature of the network.

4. Experiment

4.1. Datasets

The experiments were conducted on multiple datasets with complementary scene characteristics. OPVM-VIRD and VisDrone2019 contain UAV imagery captured under diverse aerial imaging conditions, including scale variation, occlusion, illumination changes, and complex backgrounds. USOD, DOTA, and TinyPerson further cover remote-sensing small-object detection, multi-scale aerial object detection, and tiny-person detection, respectively. In addition, SARD provides UAV-based search-and-rescue scenes with small human targets in hilly roads, woods, tall grass, shaded regions, and injured-like postures. These datasets allow FCML-YOLO to be evaluated across general UAV detection, remote-sensing small-object detection, and rescue-oriented person detection scenarios.

4.1.1. Self-Built Dataset

OPVM-VIRD: OPVM-VIRD [44] is a UAV-acquired dataset collected from multiple practical monitoring scenarios, including urban areas, industrial parks, and hilly road environments in the Yuelu Mountain Scenic Area. The imagery was captured with an onboard sensing/detection payload across a broad range of conditions, covering day–night cycles, varying weather (e.g., sunny and rainy), and different UAV viewing altitudes, which introduces substantial appearance diversity for aerial vision research. In the hilly road scenes, pedestrians and vehicles are often observed along winding roads and under dense roadside vegetation, tree-shadow interference, partial occlusion, and altitude-induced scale variation. These scene characteristics are consistent with the visual challenges encountered in UAV-based road safety monitoring in mountainous environments. We annotate two frequently encountered target types, namely pedestrians and vehicles, to support practical UAV monitoring and inspection applications. Data collection follows a multi-layer altitude plan to account for scale changes in high-altitude operations, with flight heights spanning 30–200 m. Overall, OPVM-VIRD provides 20,025 labeled instances, including 8584 pedestrians and 11,441 vehicles.

4.1.2. Public Datasets

VisDrone2019: The VisDrone2019 dataset [45] is a commonly utilized benchmark for tracking in aerial scenarios and object detection. It comprises over 10,000 static images captured across a range of complex real-world scenarios, spanning 14 cities in China and encompassing diverse environments. Images were gathered under diverse weather and lighting, offering rich contextual variety. The dataset comprises 10 annotated categories. Training, validation, and test sets contain 6471, 548, and 1610 images, respectively, with average resolution ~2000 × 1500 pixels.

USOD: An open source remote sensing benchmark tailored for small-object detection, the USOD dataset [46] comprises 3000 aerial images with a 0.4 m spatial resolution, featuring 43,378 annotated vehicle instances. It includes 2100 training images and 900 validation images. Notably, 96.3% of annotated objects are smaller than 16 × 16 pixels, and 99.9% are under 32 × 32 pixels. The dataset also encompasses challenging environmental conditions such as low-light scenarios and shadow occlusion.

DOTA: The DOTA dataset [47] is a large-scale benchmark containing multi-scale and multi-oriented objects embedded in complex backgrounds. This work utilizes DOTAv1.5, one of its released iterations, which consists of 2806 aerial images annotated with ~400,000 object instances, covering scenarios like urban areas, rural regions, ports, and airports. The dataset includes 16 object categories, including small vehicles, baseball diamonds, planes, and ships.

TinyPerson: Purpose-built for detecting extremely small objects, particularly long-distance pedestrians, the TinyPerson dataset [48] focuses on high-resolution imagery with sparse small-target distributions. It includes 1610 images. Approximately 76,000 pedestrian instances are annotated. Over 90% of targets fall into the small-object range, making it highly representative of the challenges in detecting tiny human instances in surveillance and aerial imagery.

4.1.3. Search-And-Rescue UAV Dataset

SARD: SARD [49] is a public UAV-based search-and-rescue dataset designed for person detection in rescue scenarios. The dataset was collected during daylight using a DJI Phantom 4A UAV, with videos recorded at a resolution of 1920 × 1080 pixels. The camera angles varied from 45° to 90°, introducing noticeable viewpoint changes and apparent scale variations for human targets. The dataset contains 1981 image frames with person instances, covering complex backgrounds such as roads in hilly and non-urban natural environments, woods, tall grass, and shaded areas. SARD includes human targets with diverse postures, such as standing, sitting, lying, walking, and running. In addition, participants simulated exhausted or injured-like postures to better reflect human appearance variations in search-and-rescue scenarios. These characteristics make SARD a suitable benchmark for evaluating small-scale person detection in UAV-based emergency search-and-rescue scenarios.

4.2. Experimental Environment and Parameter Setting

Experiments were performed on a computing platform running Ubuntu 20.04, leveraging a software stack consisting of Python 3.10, PyTorch 2.0.0, and CUDA 11.8. The hardware configuration comprises an Intel(R) Xeon(R) Platinum 8481C CPU and an NVIDIA GeForce RTX 4090D GPU equipped with 24 GB of VRAM. Hyperparameter settings employed in the experiments are summarized in Table 1.

This study focused on practical engineering applications and utilized a self-developed UAV-based target detection pod system integrated with the Z-3N UAV platform for target detection and recognition validation. The system is also compatible with other UAV platforms such as the DJI M350 and M400, as shown in Figure 8. The system consists of a visible light imaging sensor, gimbal, and edge AI inference module. The target detection pod system was independently developed and encapsulated by the project team to ensure structural stability and reliable operation in complex field environments. The visible light camera chosen was the PDL-1K model from Puzhou Company, providing high-resolution image input to meet the demands of precise target recognition. The edge AI inference module is based on the NVIDIA Jetson AGX Orin, enabling onboard real-time inference and edge computing, and the proposed FCML-YOLO algorithm was deployed on it for field testing. The pod was mounted on the bottom of the UAV using a dedicated mounting bracket to achieve better field-of-view coverage and improve imaging and detection stability under flight vibrations. The AI inference hardware used was the NVIDIA Jetson AGX Orin, offering 275 TOPS of AI computing power. Its computing resources include a 2048-core NVIDIA Ampere architecture GPU and 64 GB of onboard memory, providing sufficient computational power and storage support for onboard real-time target detection.

4.3. Evaluation Metrics

The evaluation metrics include: precision (P), recall (R), mAP50, mAP50:95, model parameter count and the throughput in frames per second (FPS). mAP50 refers to the Mean Average Precision (mAP) computed at an IoU threshold of 0.5, used to evaluate performance under moderately relaxed matching constraints. mAP50:95 aggregates mAP values across IoU thresholds from 0.5 to 0.95 (stepping at 0.05), enabling a stricter, multi-threshold assessment that captures detection performance across varying difficulty levels. In addition, following the COCO evaluation protocol, AP_S, AP_M, and AP_L are reported to evaluate detection performance for small, medium, and large objects, respectively. These scale-specific AP values are computed over IoU thresholds from 0.50 to 0.95. The corresponding formulas are listed below:

P r e c i s i o n = \frac{T P}{T P + F P}

(22)

R e c a l l = \frac{T P}{T P + F N}

(23)

A P = \int_{0}^{1} P (R) d R

(24)

m A P = \frac{1}{n} \sum_{i = 1}^{n} A P_{i}

(25)

where TP represents the count of correctly identified target instances; FP, that of incorrectly predicted targets; FN, that of actual positive instances overlooked by the model; and n, the number of categories.

4.4. Experimental Analysis of Modules

To systematically assess the individual contributions of the proposed LMFF network and the MSFR, FDFE, CARAFE modules to overall model performance, we performed ablation studies on the VisDrone dataset using YOLO11s as the baseline. For evaluation objectivity and fairness, all experimental models were trained from scratch, without using pre-trained weights.

4.4.1. Effectiveness of the LMFF Network and MSFR Module

To validate the performance of the LMFF network and MSFR module, a set of comparative experiments were performed on the YOLO11s architecture. More specifically, we compared the original YOLO11s model, the YOLO11s-P2 model incorporating a P2 detection head, the LMFF network without the MSFR module, and the complete LMFF network.

As shown in Table 2, compared with YOLO11s, YOLO11s-P2 improves precision, recall, mAP50, and mAP50:95 by 2.2%, 2.2%, 2.7%, and 2.1%, respectively. Notably, AP_S increases from 10.9% to 13.6%, indicating that the added P2 detection branch effectively enhances small-object representation by exploiting high-resolution shallow features. Compared with YOLO11s-P2, LMFF without MSFR maintains comparable overall accuracy while reducing the parameter count from 9.56 M to 2.92 M, mainly because the high-channel deep path associated with C5, M5, and P5 is removed. Meanwhile, AP_S further increases to 13.9%, and AP_L remains higher than that of YOLO11s, suggesting that the lightweight redesign does not cause severe degradation for larger objects. After introducing MSFR, the complete LMFF network achieves the best mAP50, mAP50:95, AP_S, and AP_M, with AP_S and AP_M reaching 14.8% and 34.0%, respectively. Although AP_L slightly fluctuates, it remains comparable to the baseline, with only a 0.1% difference. These results verify that LMFF improves small- and medium-object detection while maintaining a favorable balance between large-object detection and model complexity.

To further analyze the internal structure of LMFF, we conducted additional ablation experiments on different MSFR input configurations, as shown in Table 3. LMFF-M2-S and LMFF-M2-D denote two-input MSFR variants using C2 + C3 and C3 + C4, respectively, while LMFF-M3 denotes the complete three-input MSFR configuration using C2 + C3 + C4. As shown in Table 3, compared with LMFF without MSFR, LMFF-M3 improves mAP50 from 42.7% to 43.8% and mAP50:95 from 25.8% to 26.6%, indicating that MSFR contributes to more effective multi-level feature reconstruction. Among the two-input variants, LMFF-M2-S using C2 + C3 achieves higher precision, while LMFF-M2-D using C3 + C4 obtains better recall, mAP50, and mAP50:95, suggesting that shallow-to-middle and middle-to-deep feature combinations emphasize different aspects of detection performance. The complete LMFF-M3 configuration achieves the best overall results, demonstrating that jointly integrating C2, C3, and C4 provides a better balance between shallow spatial details and deeper semantic information. Therefore, the three-input MSFR configuration is adopted as the final design of LMFF.

4.4.2. Effectiveness of the FDFE Module

To thoroughly assess the influence of the FDFE module on small-object detection in complex background scenarios, we performed comparative experiments using the YOLO11s model incorporated with the LMFF network, introducing four prevalent attention mechanisms: the Convolutional Block Attention Module (CBAM) [50], Efficient Multi-Scale Attention Module (EMA) [51], SENetV2 [25], and FcaNet [39].

As shown in Table 4, the model with the FDFE module achieved superior performance, with recall, mAP50, and mAP50:95 reaching 42.9%, 44.4%, and 26.9% in turn, outperforming all other attention mechanisms. While the model incorporating EMA also exhibited performance gains, its detection accuracy remained lower than that of the FDFE module. Although FcaNet achieved a slightly higher precision than FDFE by 0.4%, FDFE surpassed it in recall, mAP50, and mAP50:95 by 1.7%, 0.7%, and 0.4% in turn. Consequently, the proposed FDFE module exhibits the optimal balance of performance across the evaluated modules, bolstering the model’s capacity to discriminate among visually analogous targets and thereby elevating both detection precision and robustness for small objects within intricate scenarios.

Figure 9 presents a visual comparison of attention modules based on the Grad-CAM algorithm [52], using heatmaps to reveal the target attention characteristics across diverse scenes. The representative complex scene in Figure 9a and simple scene in Figure 9b were specifically selected for comparative analysis. The results show significant differences in target localization capabilities across the attention modules, with EMA and FDFE exhibiting notably superior focus on target regions under complex environmental conditions. Compared with CBAM, FcaNet, and SENetV2, they demonstrate more accurate regional focus. In simpler scenes, EMA and FDFE also show a clear advantage in capturing human-related features. Visualization results reveal that the devised FDFE module markedly enhances the robustness of target recognition in complex real-world scenarios.

4.4.3. Effectiveness of the CARAFE Module

Comparative evaluations were performed using an enhanced YOLO11s model integrating both the FDFE module and LMFF network to validate the efficacy of the CARAFE module. Nearest-neighbor interpolation is the default upsampling strategy in YOLO11s + LMFF + FDFE, and the other methods were evaluated by replacing this default operation under the same framework. As illustrated in Table 5, bilinear interpolation introduced no additional parameters but slightly reduced mAP50 from 44.4% to 44.1%, suggesting that simple interpolation may weaken fine-grained small-object responses. Transposed convolution improved mAP50 to 44.7% and mAP50:95 to 27.2%, but increased the parameter count to 3.72 M. The DySample module [53] and Dynamic Lightweight Upsampling (DLU) method [54] served as additional baselines, but they reduced mAP50 by 1.0% and 0.7%, respectively, compared with the nearest-neighbor baseline. Conversely, CARAFE achieved the highest mAP50:95 of 27.3% and the same mAP50 of 44.7% as transposed convolution, while requiring fewer parameters. Compared with the default nearest-neighbor baseline, CARAFE improved mAP50 and mAP50:95 by 0.3% and 0.4%, respectively, with only a limited parameter increase from 3.38 M to 3.52 M. Overall, CARAFE more effectively captures contextual information surrounding occluded regions and restores feature representations of occluded targets.

4.5. Ablation Study

To systemically assess how the proposed improvements contribute to detection performance, we carried out a sequential ablation analysis on the VisDrone dataset, taking YOLO11s as the baseline. By selectively adding, removing, or adjusting core modules, we comprehensively evaluated the individual and combined impacts of each component in FCML-YOLO, aiming to reveal their actual effects on detection performance. Table 6 presents the findings of these ablation experiments.

The ablation results indicate that integrating the LMFF module brought about increases of 4.3% and 2.9% in mAP50 and mAP50:95 in turn, while reducing the parameter count from 9.41 M to 3.34 M. This shows that the LMFF module significantly lessens the computational burden without the large-object detection layer and effectively merges shallow and deep multi-scale features. When the FDFE module was further incorporated, mAP50 went up to 44.4% (a 0.6% increase) and mAP50:95 rose to 26.9% (a 0.3% increase), with hardly any growth in the number of parameters. This suggests that the FDFE module can successfully separate the features of similar targets in the frequency domain, thus improving the detection accuracy for similar targets while keeping the model lightweight. When the LMFF, FDFE, and CARAFE modules were all introduced, mAP50 increased to 44.7% and mAP50:95 climbed to 27.3%. This proves that the CARAFE module, by expanding the receptive field, enables the model to more precisely capture the fine-grained characteristics of small objects. In summary, compared with YOLO11s, the proposed FCML-YOLO model secures a 5.2% mAP50 gain and a 3.6% mAP50:95 rise, while slashing the parameter count by 62.6%. It demonstrates excellent performance in both lightweight design and the ability to detect small objects.

4.6. Comparison of Detection Performance of Different Models

For qualitative visualization in this section, YOLO11s is adopted as the representative baseline because FCML-YOLO is developed based on the YOLO11s architecture. These visual comparisons are intended to illustrate typical detection differences between YOLO11s and FCML-YOLO under challenging aerial imaging conditions. In the qualitative results presented in this section, red boxes indicate regions where YOLO11s produces missed detections, and the same regions are marked in the corresponding image columns to facilitate direct comparison.

4.6.1. Experiments on VisDrone

To further showcase the competitive performance of the proposed FCML-YOLO model, we conducted a comprehensive comparison with multiple cutting-edge object detection algorithms using the VisDrone dataset, such as YOLOv5s, YOLOv8s, YOLOv10s, YOLO11s, YOLOv12s, FFCA-YOLO, DDSC-YOLO, CS-YOLOv8, and Drone-YOLO. The results of DDSC-YOLO, CS-YOLOv8, and Drone-YOLO were directly cited from their original publications because their official implementations are not publicly available.

As shown in Table 7 and Figure 10, the experimental results indicate that the proposed detection model achieves notable advantages in overall performance. Specifically, while YOLO11s shows slightly lower precision than YOLOv5s, YOLOv8s, and YOLOv10s, it outperforms all baseline YOLO series in recall, mAP50, and mAP50:95, demonstrating a clear advantage in detection accuracy. The results further show that FCML-YOLO not only significantly improves all evaluation metrics over the YOLO baseline models but also outperforms advanced models such as DDSC-YOLO, CS-YOLOv8, and Drone-YOLO with respect to precision, recall, mAP50, and mAP50:95. While ensuring high-accuracy detection, FCML-YOLO maintains a lightweight design with only 3.52 M parameters, representing reductions of approximately 62.6% and 67.7% compared to YOLO11s and Drone-YOLO, respectively, and demonstrating strong potential for real-time applications.

Table 8 compares the detection performance between FCML-YOLO and multiple mainstream object detection algorithms across diverse categories in the VisDrone dataset. The results demonstrate that our model yields comprehensive improvements over all 10 target classes. Compared to popular models such as YOLOv8s and YOLOv10s, FCML-YOLO exhibits significant performance gains in small-object detection tasks, particularly for classes like pedestrian, person, bicycle, and motorcycle. For instance, compared with YOLO11s, the mAP50 for people detection increased by 10.3%, and for motorcycle detection by 9.2%, demonstrating the elevated precision of the modified model when detecting densely clustered and small objects. For visually similar categories, such as truck and bus, FCML-YOLO improves mAP50 by 4.3% and 3.7%, respectively, over the baseline, further validating its superiority in distinguishing similar-looking targets.

Figure 11 illustrates qualitative comparisons between FCML-YOLO and the baseline YOLO11 model across diverse real-world detection contexts. In occlusion scenarios (see Figure 11a), the lightweight multi-level feature fusion module equips the model with stronger associative reasoning capabilities, allowing it to accurately detect vehicle targets even when partially obscured by branches, with noticeably improved bounding box completeness. In small-scale object scenes (Figure 11b), FCML-YOLO demonstrates heightened sensitivity through shallow feature enhancement mechanisms, successfully detecting small pedestrian targets at the far end of a sports field that the baseline model failed to capture. Under low-light conditions (Figure 11c), the improved model leverages cross-level feature interactions to mitigate edge blurring caused by uneven illumination, enabling the detection of previously missed motorcycles in dark areas and producing more precise contour fits for vehicles. In crowded scenes (Figure 11d), FCML-YOLO significantly lowers false negatives and false positives for distant pedestrians relative to the original YOLO11, achieving more accurate differentiation of individuals in high-density areas and generating bounding boxes that better conform to the actual object distribution. These visualization results clearly demonstrate the superior performance of FCML-YOLO across various representative remote sensing scenarios, further highlighting its robustness in complex aerial environments.

4.6.2. Comparison of Performance Across Different Datasets

The generalizability of a model is an important indicator for evaluating its robustness and adaptability across heterogeneous scenarios. In this study, we systematically evaluate the proposed FCML-YOLO model on four representative datasets, namely USOD, DOTA, TinyPerson, and SARD. These datasets cover remote-sensing vehicle detection, multi-scale aerial object detection, tiny pedestrian detection, and UAV-based search-and-rescue person detection, respectively. As reported in Table 9, FCML-YOLO consistently outperforms representative baseline detectors, including YOLOv5s, YOLOv8s, YOLOv10s, YOLO11s, and YOLOv12s.

On the USOD dataset, FCML-YOLO achieves a precision of 92.1% and an mAP50 value of 91.8%, surpassing all other models. Specifically, it improves precision by 1.1% over YOLO11s and mAP50 by 1.5% over YOLOv8s, which are the strongest corresponding baselines for these two metrics. On the DOTA dataset, which is characterized by dense target distributions and significant scale variations, FCML-YOLO improves mAP50 by 4.0%, 3.4%, 6.2%, 4.2%, and 4.9% over YOLOv5s, YOLOv8s, YOLOv10s, YOLO11s, and YOLOv12s, respectively. These results further validate the model’s strong feature representation capability across targets of varying scales. On the TinyPerson dataset, which primarily includes tiny pedestrian targets, FCML-YOLO achieves mAP50 improvements of 6.6%, 6.3%, 6.6%, 6.1%, and 7.1% over YOLOv5s, YOLOv8s, YOLOv10s, YOLO11s, and YOLOv12s, respectively, underscoring the model’s superior efficacy in tiny-object detection tasks.

Furthermore, on the SARD dataset, FCML-YOLO achieves the best performance across all metrics, with 96.6% precision, 92.2% recall, 96.5% mAP50, and 60.7% mAP50:95. Compared with the strongest baseline for each metric, FCML-YOLO improves precision, recall, mAP50, and mAP50:95 by 2.1%, 3.5%, 2.9%, and 3.7%, respectively. In particular, the notable recall improvement indicates that FCML-YOLO can more effectively detect human targets in UAV-based search-and-rescue scenarios, where reducing missed detections is critical. Overall, the results across the four datasets demonstrate that FCML-YOLO maintains strong generalization capability in both general aerial small-object detection and UAV-based emergency search-and-rescue person detection.

Across the evaluated datasets, YOLOv12s does not show the expected advantage over YOLO11s in UAV-based small-object detection. This phenomenon should not be interpreted as a general weakness of YOLOv12, but rather as a task-dependent architecture-suitability issue. Although YOLOv12 adopts an attention-centric design to enhance contextual representation, UAV-based tiny-object detection is dominated by low-resolution targets, weak texture cues, complex backgrounds, occlusion, and large scale variations. In such scenarios, detection accuracy depends more on high-resolution local details and fine-grained spatial cues, while stronger contextual modeling may not fully compensate for insufficient local target evidence. In addition, all YOLO-series baselines are trained under the same protocol without model-specific hyperparameter optimization. These results suggest that task-specific architectural adaptation, particularly high-resolution detail preservation and multi-level feature interaction, is more critical than model recency alone for UAV-based tiny-object detection.

Figure 12 presents qualitative comparisons between YOLO11s and FCML-YOLO across representative aerial detection scenarios, including occluded targets, wide-area surveillance scenes, multi-scale object distributions, tiny pedestrian targets, and UAV-based search-and-rescue cases. On the USOD dataset, which contains low-resolution grayscale images, YOLO11s exhibits evident missed detections when targets are small, moving, or partially occluded. In contrast, FCML-YOLO localizes these targets more accurately, demonstrating stronger robustness under low-resolution and occlusion conditions. On the DOTA dataset, where targets are densely distributed and exhibit significant scale variations, YOLO11s still suffers from noticeable missed detections. Benefiting from the enhanced multi-scale feature fusion strategy, FCML-YOLO improves detection completeness and localization quality in such large-area aerial scenes. On the TinyPerson dataset, FCML-YOLO also provides more reliable detection results for extremely small pedestrian targets, further confirming its effectiveness in tiny-object detection.

For the SARD visualization results, the red-highlighted regions show typical missed detections of YOLO11s. The simulated injured person in the high-grass area is small and highly blended with surrounding vegetation textures, making it difficult for YOLO11s to recognize. In contrast, FCML-YOLO accurately detects this target and also recovers small human targets missed by YOLO11s near shaded and forest-edge regions. These results indicate that FCML-YOLO can effectively reduce missed detections in UAV-based search-and-rescue scenarios and maintain stronger robustness under challenging conditions such as high-grass occlusion, shadow interference, and injured-like postures. Overall, the qualitative results across multiple datasets demonstrate that FCML-YOLO achieves more complete and stable detection performance under diverse aerial imaging conditions, especially for UAV-based emergency search-and-rescue person detection, where reducing missed detections of small human targets is critical.

4.6.3. Performance Comparison of the Self-Built OPVM-VIRD Dataset on Edge Devices

The OPVM-VIRD dataset was collected using our proprietary UAV platform, which captures diverse visual data from complex scenes, including hilly and mountainous areas. We tested multiple key performance metrics on the NVIDIA Jetson AGX Orin platform using this dataset. To validate the advantages of FCML-YOLO, we conducted comparative experiments between FCML-YOLO and several models from the YOLO series, including YOLOv5s, YOLOv8s, YOLOv10s, YOLO11s, and YOLOv12s. The experimental results, shown in Table 10, indicate that FCML-YOLO outperforms all other models across all evaluation metrics. Specifically, FCML-YOLO achieves the highest precision and recall, reaching 92.7% and 92.8%, respectively, which indicates its ability to reduce both false detections and missed detections in practical UAV-based monitoring scenarios. In terms of mAP50, FCML-YOLO achieved a score of 95.2%, surpassing YOLO11s by 4.4%, and outperformed other YOLO versions by 4.1% to 5.9%. In terms of mAP50:95, FCML-YOLO reached 69.6%, outperforming YOLO11s by 5.4% and thus demonstrating its more stable localization performance under stricter IoU thresholds. These quantitative results demonstrate the effectiveness of FCML-YOLO on the self-built OPVM-VIRD dataset, especially for UAV-based hilly road monitoring scenes with small targets, vegetation occlusion, and complex background interference.

As shown in Figure 13, we evaluated the performance of YOLO11s and the proposed FCML-YOLO under both low and high altitude perspectives. The red boxes in Figure 13 indicate typical missed detection regions of YOLO11s. In the low-altitude perspective, pedestrians are partially occluded by roadside vegetation and tree shadows. YOLO11s fails to completely detect the occluded pedestrian in the highlighted region, whereas FCML-YOLO successfully detects the missed target, showing better robustness to local occlusion and background interference. In the high-altitude perspective, pedestrians become smaller and are distributed along winding hilly roads, making them difficult to distinguish from tree shadows, road textures, and dense vegetation. YOLO11s produces more missed detections in the highlighted region, while FCML-YOLO detects more small pedestrian instances and provides more complete detection results. The experimental results indicate that FCML-YOLO maintains stronger detection robustness on hilly mountain roads with challenging terrain, dense vegetation, and winding roads, which supports its applicability to UAV-based road safety monitoring and rescue-oriented scenarios.

4.7. Deployment of UAV-Based Target Detection System on a UAV Platform

The core application of the FCML-YOLO model focuses on road safety monitoring and emergency search-and-rescue tasks in hilly and mountainous areas using UAV platforms. To validate the model’s real-time detection performance in complex real-world scenarios, this study conducted field validation experiments in the representative Yuelu Mountain scenic area. Based on the UAV platform, we evaluated the edge inference performance of FCML-YOLO and YOLO11s on the UAV-based target detection system, with inference precision set to FP32. Table 11 summarizes the comparison results with the baseline model YOLO11s. The inference speed of FCML-YOLO increased from 57 FPS to 64 FPS, while the parameter count was significantly reduced from 9.41 M to 3.52 M, indicating that the model not only maintains real-time detection capability but also achieves a more lightweight structural design, making it more suitable for deployment in environments with limited onboard computing power and storage resources. Benefiting from the efficiency of FCML-YOLO in target detection, the model is capable of quickly and accurately monitoring target areas in emergency scenarios such as public safety incidents or natural disasters, effectively improving emergency response times and personnel search-and-rescue efficiency.

5. Discussion

The experimental results indicate that enhancing high-resolution feature utilization is critical for UAV-based tiny-object detection. By strengthening the interaction between shallow spatial details and deep semantic information, FCML-YOLO improves the detection of small and partially occluded targets under complex aerial backgrounds. The introduced frequency-domain enhancement strategy further improves target discrimination in scenes with vegetation interference, shadows, and background clutter, while the CARAFE-based upsampling mechanism helps preserve fine-grained feature details during feature reconstruction.

In addition, the proposed framework achieves a favorable balance between lightweight design and detection performance. By simplifying redundant deep detection structures and optimizing multi-level feature fusion, FCML-YOLO significantly reduces model complexity while maintaining strong detection accuracy. The deployment experiments on the Jetson AGX Orin platform further demonstrate its applicability for real-time UAV onboard inference in safety monitoring and emergency rescue scenarios.

Nevertheless, performance degradation may still occur under extreme occlusion, severe illumination degradation, and ultra-long-distance aerial imaging conditions. Future work will focus on adaptive frequency-domain learning, multimodal fusion, and temporal feature modeling to further improve robustness in complex real-world rescue environments.

6. Conclusions

This study proposed FCML-YOLO, a lightweight small-object detection framework for UAV-based safety monitoring and rescue applications. By integrating the FDFE module, CARAFE operator, MSFR module, and LMFF network, the proposed method improves multi-scale feature representation and enhances the detection capability for tiny targets in complex aerial scenes. Experimental results on multiple public datasets and the self-built OPVM-VIRD dataset demonstrate that FCML-YOLO achieves superior detection accuracy and robustness while significantly reducing model parameters. In addition, deployment experiments on the Jetson AGX Orin platform verify its effectiveness for real-time UAV onboard deployment.

Author Contributions

Conceptualization, methodology, software, and validation, Y.C.; investigation, C.S.; writing—original draft preparation, supervision, and funding acquisition, H.T.; writing—review and editing, Q.X. and Z.Z.; formal analysis, J.L.; project administration, W.T.; funding acquisition, K.W. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Chongqing Municipal Natural Science Foundation under Grant CSTB2022NSCQ-MSX0192.

Data Availability Statement

All the data are contained in the article.

Conflicts of Interest

Author Kui Wang was employed by the company Chongqing Gangli Environmental Protection Co., Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

References

Fraternali, P.; Morandini, L.; Motta, R. Enhancing Search and Rescue Missions with UAV Thermal Video Tracking. Remote Sens. 2025, 17, 3032. [Google Scholar] [CrossRef]
Quero, C.O.; Martinez-Carranza, J. Unmanned aerial systems in search and rescue: A global perspective on current challenges and future applications. Int. J. Disaster Risk Reduct. 2025, 118, 105199. [Google Scholar] [CrossRef]
Simões, D.P.; Henrique, S.; Marsico, S.; Rodrigo, J.; Barbosa, L.A. Human detection in UAV imagery using deep learning: A review. Neural Comput. Appl. 2025, 37, 18109–18150. [Google Scholar] [CrossRef]
Song, Z.; Yan, Y.; Cao, Y.; Jin, S.; Qi, F.; Li, Z.; Lei, T.; Chen, L.; Jing, Y.; Xia, J.; et al. An infrared dataset for partially occluded person detection in complex environment for search and rescue. Sci. Data 2025, 12, 300. [Google Scholar] [CrossRef]
Jiang, N.; Chen, J.; Zhu, J.; Liang, B.; Yang, D.; Huang, X.; Xing, M. High Frame Rate Along-Track Swarm SAR Subaperture Collaboration Imaging for Moving Target. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5211721. [Google Scholar] [CrossRef]
Jiang, N.; Feng, D.; Wang, J.; Zhu, J.; Huang, X. Along-Track Swarm SAR: Echo Modeling and Sub-Aperture Collaboration Imaging Based on Sparse Constraints. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2023, 16, 5602–5617. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; IEEE: New York, NY, USA, 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; IEEE: New York, NY, USA, 2017. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Mokayed, H.; Nayebiastaneh, A.; De, K.; Sozos, S.; Hagner, O.; Backe, B. Nordic Vehicle Dataset (NVD): Performance of Vehicle Detectors Using Newly Captured NVD from UAV in Different Snowy Weather Conditions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023; pp. 5314–5322. [Google Scholar]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; IEEE: New York, NY, USA, 2023. [Google Scholar]
Wang, C.-Y.; Yeh, I.-H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision–ECCV 2024; Springer: Cham, Switzerland, 2025; Available online: https://www.researchgate.net/publication/385275121_YOLOv9_Learning_What_You_Want_to_Learn_Using_Programmable_Gradient_Information (accessed on 15 March 2024).
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-To-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Zhou, S.; Yang, L.; Liu, H.; Zhou, C.; Liu, J.; Zhao, S.; Wang, K. A Lightweight Drone Detection Method Integrated into a Linear Attention Mechanism Based on Improved YOLOv11. Remote Sens. 2025, 17, 705. [Google Scholar] [CrossRef]
Zhong, H.; Zhang, Y.; Shi, Z.; Zhang, Y.; Zhao, L. PS-YOLO: A Lighter and Faster Network for UAV Object Detection. Remote Sens. 2025, 17, 1641. [Google Scholar] [CrossRef]
He, Z.; Cao, L. SOD-YOLO: Small Object Detection Network for UAV Aerial Images. IEEJ Trans. Electr. Electron. Eng. 2024, 20, 431–439. [Google Scholar] [CrossRef]
Wang, X.; He, N.; Hong, C.; Sun, F.; Han, W.; Wang, Q. YOLO-ERF: Lightweight Object Detector for UAV Aerial Images. Multimed. Syst. 2023, 29, 3329–3339. [Google Scholar] [CrossRef]
Shen, L.; Lang, B.; Song, Z. DS-YOLOv8-Based Object Detection Method for Remote Sensing Images. IEEE Access 2023, 11, 125122–125137. [Google Scholar] [CrossRef]
Zhou, S.; Zhou, H. Detection Based on Semantics and a Detail Infusion Feature Pyramid Network and a Coordinate Adaptive Spatial Feature Fusion Mechanism Remote Sensing Small Object Detector. Remote Sens. 2024, 16, 2416. [Google Scholar] [CrossRef]
Jobaer, S.; Tang, X.; Zhang, Y. A Deep Neural Network for Small Object Detection in Complex Environments with Unmanned Aerial Vehicle Imagery. Eng. Appl. Artif. Intell. 2025, 148, 110466. [Google Scholar] [CrossRef]
Bai, C.; Zhang, K.; Jin, H.; Qian, P.; Zhai, R.; Lu, K. SFFEF-YOLO: Small Object Detection Network Based on Fine-Grained Feature Extraction and Fusion for Unmanned Aerial Images. Image Vis. Comput. 2025, 156, 105469. [Google Scholar] [CrossRef]
Narayanan, M. SENetV2: Aggregated Dense Layer for Channelwise and Global Representations. arXiv 2023, arXiv:2311.10807. [Google Scholar] [CrossRef]
Iqra; Giri, K.J. SO-YOLOv8: A Novel Deep Learning-Based Approach for Small Object Detection with YOLO beyond COCO. Expert Syst. Appl. 2025, 280, 127447. [Google Scholar] [CrossRef]
Wang, H.; Liu, J.; Zhao, J.; Zhang, J.; Zhao, D. Precision and Speed: LSOD-YOLO for Lightweight Small Object Detection. Expert Syst. Appl. 2025, 269, 126440. [Google Scholar] [CrossRef]
Luo, H.; Wei, J.; Wang, Y.; Chen, J.; Li, W. An Improved Lightweight Object Detection Algorithm for YOLOv5. PeerJ Comput. Sci. 2024, 10, e1830. [Google Scholar] [CrossRef] [PubMed]
Fang, X.; Song, X.; Li, Y. CS-YOLOv8: Lightweight Mango Detection Algorithm Based on YOLOv8. Int. J. Appl. Math. Control Eng. 2024, 7, 137–143. [Google Scholar]
Hua, W.; Chen, Q.; Chen, W. A New Lightweight Network for Efficient UAV Object Detection. Sci. Rep. 2024, 14, 13288. [Google Scholar] [CrossRef]
Li, L.; Li, B.; Zhou, H. Lightweight Multi-Scale Network for Small Object Detection. PeerJ Comput. Sci. 2022, 8, e1145. [Google Scholar] [CrossRef]
Song, B.; Chen, J.; Liu, W.; Fang, J.; Xue, Y.; Liu, X. YOLO-ELWNet: A Lightweight Object Detection Network. Neurocomputing 2025, 636, 129904. [Google Scholar] [CrossRef]
Zhang, Z. Drone-YOLO: An Efficient Neural Network Method for Target Detection in Drone Images. Drones 2023, 7, 526. [Google Scholar] [CrossRef]
Zheng, S.; Wu, Z.; Xu, Y.; Wei, Z.; Plaza, A. Learning Orientation Information from Frequency-Domain for Oriented Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 5628512. [Google Scholar] [CrossRef]
Duan, C.; Hu, B.; Liu, W.; Ma, T.; Ma, Q.; Wang, H. Infrared Small Target Detection Method Based on Frequency Domain Clutter Suppression and Spatial Feature Extraction. IEEE Access 2023, 11, 85549–85560. [Google Scholar] [CrossRef]
Fang, X.; Chen, J.; Wang, Y.; Jiang, M.; Ma, J.; Wang, X. EPFDNet: Camouflage Object Detection with Edge Perception in Frequency Domain. Image Vis. Comput. 2024, 154, 105358. [Google Scholar] [CrossRef]
Liu, Y.; Ye, Q.; Sun, L.; Wu, Z. SOD-YOLOv8n: Small Object Detection in Remote Sensing Images Based on YOLOv8n. IEEE Geosci. Remote Sens. Lett. 2025, 22, 6008405. [Google Scholar] [CrossRef]
Li, Z.; Fan, H. MFA-YOLO: Multi-Scale Fusion and Attention-Based Object Detection for Autonomous Driving in Extreme Weather. Electronics 2025, 14, 1014. [Google Scholar] [CrossRef]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. FcaNet: Frequency Channel Attention Networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; IEEE: New York, NY, USA, 2021. [Google Scholar]
Li, K.; Wang, D.; Hu, Z.; Li, S.; Ni, W.; Zhao, L.; Wang, Q. FD2-Net: Frequency-Driven Feature Decomposition Network for Infrared-Visible Object Detection. Proc. AAAI Conf. Artif. Intell. 2025, 39, 4797–4805. [Google Scholar] [CrossRef]
Wang, J.; Chen, K.; Xu, R.; Liu, Z.; Loy, C.C.; Lin, D. CARAFE: Content-Aware ReAssembly of Features. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul, Republic of Korea, 27 October–2 November 2019; IEEE: New York, NY, USA, 2019; Available online: https://ieeexplore.ieee.org/document/9010830/ (accessed on 27 February 2023).
Yang, Z.; Wu, Q.; Zhang, F.; Zhang, X.; Chen, X.; Gao, Y. A New Semantic Segmentation Method for Remote Sensing Images Integrating Coordinate Attention and SPD-Conv. Symmetry 2023, 15, 1037. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; IEEE: New York, NY, USA, 2018; Available online: https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper.html (accessed on 19 February 2024).
Li, K.; Zhong, Z.; Luo, Z.; Tian, H.; Wang, K.; Jiang, H.; Xiang, D.; Tang, W. FCAT: Frequency-Domain Cross-Attention for All-Weather Multispectral Object Detection in Low-Altitude UAV Security Inspection of Urban and Industrial Areas. Remote Sens. 2026, 18, 826. [Google Scholar] [CrossRef]
Du, D.; Zhang, Y.; Wang, Z.-X.; Wang, Z.; Song, Z.; Liu, Z.; Bo, L.; Shi, H.; Zhu, R.; Kumar, A.; et al. VisDrone-DET2019: The Vision Meets Drone Object Detection in Image Challenge Results. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; IEEE: New York, NY, USA, 2019. [Google Scholar]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Xia, G.-S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A Large-Scale Dataset for Object Detection in Aerial Images. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; IEEE: New York, NY, USA, 2018; pp. 3974–3983. [Google Scholar]
Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale Match for Tiny Person Detection. In Proceedings of the IEEE Workshop/Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; IEEE: New York, NY, USA, 2020; pp. 1257–1265. [Google Scholar]
Sambolek, S.; Ivašić-Kos, M. Automatic Person Detection in Search and Rescue Operations Using Deep CNN Detectors. IEEE Access 2021, 9, 37905–37922. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. CBAM: Convolutional Block Attention Module. In Proceedings of the European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2018; Available online: https://openaccess.thecvf.com/content_ECCV_2018/html/Sanghyun_Woo_Convolutional_Block_Attention_ECCV_2018_paper.html (accessed on 3 November 2023).
Ouyang, D.; He, S.; Zhang, G.; Luo, M.; Guo, H.; Zhan, J.; Huang, Z. Efficient Multi-Scale Attention Module with Cross-Spatial Learning. arXiv 2023, arXiv:2305.13563. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; IEEE: New York, NY, USA, 2017; pp. 618–626. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. arXiv 2023, arXiv:2308.15085. [Google Scholar] [CrossRef]
Fu, R.; Hu, Q.; Dong, X.; Gao, Y.; Li, B.; Zhong, P. Lighten CARAFE: Dynamic Lightweight Upsampling with Guided Reassemble Kernels. arXiv 2024, arXiv:2410.22139. [Google Scholar] [CrossRef]

Figure 1. FCML-YOLO network structure.

Figure 2. FDFE structure diagram.

Figure 3. CARAFE structure diagram.

Figure 4. MSFR structure diagram.

Figure 5. SPD-Conv structure diagram.

Figure 6. LMFF structure diagram.

Figure 7. YOLO11 structure diagram.

Figure 8. UAV-based target detection pod system on UAV platform.

Figure 9. Heatmap visualization analysis with different attention modules.

Figure 10. Performance comparison curves for the FCML-YOLO and YOLO series.

Figure 11. Comparison of detection results on aerial images under different scenarios.

Figure 12. Some detection results of FCML-YOLO on various datasets.

Figure 13. Comparison of detection results from aerial views of different altitudes.

Table 1. Experimental parameter settings.

Training Parameters	Details
epochs	200
image size (pixels)	640 × 640
batch size	8
workers	8
initial learning rate	0.001
weight decay	0.0005
optimization algorithm and parameters	Adam optimizer (β₁ = 0.937, β₂ = 0.999)

Table 2. Validation of the effectiveness of the LMFF network.

Model	P (%)	R (%)	mAP50 (%)	mAP50:95 (%)	AP_S (%)	AP_M (%)	AP_L (%)	Parameter (M)
YOLO11s	49.1	39.0	39.5	23.7	10.9	31.4	41.9	9.41
YOLO11s-P2	51.3	41.2	42.2	25.8	13.6	32.8	44.1	9.56
YOLO11s + LMFF (without MSFR)	51.8	41.1	42.7	25.8	13.9	32.7	42.4	2.92
YOLO11s + LMFF	53.4	42.3	43.8	26.6	14.8	34.0	41.8	3.34

Table 3. Ablation study on MSFR configurations in LMFF.

Model	MSFR Configuration	P (%)	R (%)	mAP50 (%)	mAP50:95 (%)	Parameter (M)
LMFF without MSFR	-	51.8	41.1	42.7	25.8	2.92
LMFF-M2-S	C2 + C3	53.4	41.2	42.9	26.1	3.24
LMFF-M2-D	C3 + C4	53.0	42.0	43.4	26.5	3.11
LMFF-M3	C2 + C3 + C4	53.4	42.3	43.8	26.6	3.34

Table 4. Experimental comparison of different attention mechanisms.

Model	P (%)	R (%)	mAP50 (%)	mAP50:95 (%)	Parameter (M)
YOLO11s + LMFF	53.4	42.3	43.8	26.6	3.34
+CBAM	53.1	41.3	43.4	26.4	3.34
+EMA	54.2	41.8	43.9	26.6	3.34
+SENetV2	53.4	42	43.5	26.4	3.35
+FcaNet	54.5	41.2	43.7	26.5	3.34
+FDFE	54.1	42.9	44.4	26.9	3.38

Table 5. Experimental results of different upsamplers.

Upsampling Method	P (%)	R (%)	mAP50 (%)	mAP50:95 (%)	Parameter (M)
Nearest-neighbor	54.1	42.9	44.4	26.9	3.38
Bilinear interpolation	54.4	42.0	44.1	26.9	3.38
Transposed convolution	53.9	43.1	44.7	27.2	3.72
DySample	53.7	41.5	43.4	26.5	3.40
DLU	52.6	41.5	43.7	26.5	3.45
CARAFE	54.5	42.9	44.7	27.3	3.52

Table 6. Ablation experiment results of different improved modules on VisDrone.

LMFF	FDFE	CARAFE	P (%)	R (%)	mAP50 (%)	mAP50:95 (%)	Parameter (M)
-	-	-	49.1	39	39.5	23.7	9.41
√			53.4	42.3	43.8	26.6	3.34
√	√		54.1	42.9	44.4	26.9	3.38
√		√	54.5	42.2	44.4	27.0	3.48
√	√	√	54.5	42.9	44.7	27.3	3.52

Table 7. Performance comparison of different algorithms on VisDrone.

Model	P (%)	R (%)	mAP50 (%)	mAP50:95 (%)	Parameter (M)
YOLOv5s	50.0	37.8	39.2	23.2	9.11
YOLOv8s	50.0	37.7	39.2	23.3	11.12
YOLOv10s	50.2	37.4	38.9	23.4	8.04
YOLO11s	49.1	39	39.5	23.7	9.41
YOLOv12s	48.4	36.6	37.3	22.3	9.23
FFCA-YOLO [46]	49.1	40.3	40.1	22.7	7.12
DDSC-YOLO [22]	52.9	40.3	42.2	25.5	-
CS-YOLOv8 [29]	53.4	40.7	42.6	25.7	-
Drone-YOLO [33]	-	-	44.3	27.0	10.9
FCML-YOLO	54.5	42.9	44.7	27.3	3.52

“-“ indicates that the corresponding metrics were not reported in the original papers.

Table 8. Comparison of mAP50 across each category.

Model	Pedestrian	People	Bicycle	Car	Van	Truck	Tricycle	Awning- Tricycle	Bus	Motor
YOLOv5s	41.3	32.2	12.4	79.1	44.3	36.8	29.1	16.3	56.8	43.2
YOLOv8s	42.2	33.1	12.7	79.4	44.3	36.7	26.8	14.8	57.2	44.4
YOLOv10s	42.0	32.7	12.6	79.0	44.4	36.6	27.0	15.1	56.4	43.5
YOLO11s	42.2	32	12.4	79.8	45.6	37.3	28.4	15.2	58.9	43.2
YOLOv12s	39.5	31.8	10.4	78.0	43.9	32.9	24.4	15.2	55.5	41.0
FCML-YOLO	50.1	42.3	16.9	83.8	47.9	41.6	32.8	16.5	62.6	52.4

Table 9. Comparative results on the USOD, DOTA, TinyPerson, and SARD datasets.

Dataset	Model	P (%)	R (%)	mAP50 (%)	mAP50:95 (%)
USOD	YOLOv5s	89.7	83.9	89.7	34.5
	YOLOv8s	89.9	85.3	90.3	36.1
	YOLOv10s	90.3	82.5	89.8	36.2
	YOLO11s	91.0	83.9	90.1	36.2
	YOLOv12s	90.2	84.3	89.8	35.6
	FCML-YOLO	92.1	85.8	91.8	36.9
DOTA	YOLOv5s	63.2	36.1	39.3	24.3
	YOLOv8s	64.6	36.4	39.9	24.8
	YOLOv10s	59.8	34.1	37.1	22.9
	YOLO11s	70.4	35.4	39.1	24.2
	YOLOv12s	69.6	34.8	38.4	23.6
	FCML-YOLO	72.2	39.2	43.3	27.0
TinyPerson	YOLOv5s	42.3	23.6	22.6	9.2
	YOLOv8s	42.6	23.9	22.9	9.3
	YOLOv10s	41.4	23.6	22.6	9.1
	YOLO11s	41.9	24.1	23.1	9.6
	YOLOv12s	40.6	23.6	22.1	9.0
	FCML-YOLO	45.1	29.9	29.2	12.2
SARD	YOLOv5s	93.2	87.5	93.4	57.0
	YOLOv8s	92.3	88.7	93.1	56.4
	YOLOv10s	92.7	84.9	92.3	56.0
	YOLO11s	94.5	87.9	93.6	56.7
	YOLOv12s	91.2	86.5	92.0	54.3
	FCML-YOLO	96.6	92.2	96.5	60.7

Table 10. Comparative results on the OPVM-VIRD dataset.

Dataset	Model	P (%)	R (%)	mAP50 (%)	mAP50:95 (%)
OPVM-VIRD	YOLOv5s	90	86	91.1	63.5
	YOLOv8s	88.2	86.9	90.3	63.3
	YOLOv10s	89.7	85.2	90.9	63.7
	YOLO11s	88.6	88.2	90.8	64.2
	YOLOv12s	88.6	84.9	89.3	62.3
	FCML-YOLO	92.7	92.8	95.2	69.6

Table 11. Performance on UAV-based target detection system.

Model	Precision	FPS	Parameter (M)
YOLO11s	FP32	57	9.41
FCML-YOLO	FP32	64	3.52

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Cao, Y.; Sheng, C.; Tian, H.; Xiao, Q.; Zhou, Z.; Li, J.; Tang, W.; Wang, K. A Lightweight Multi-Level Feature Fusion Detector for UAV-Based Tiny Personnel Detection in Hilly Road Safety Monitoring and Rescue. Remote Sens. 2026, 18, 1725. https://doi.org/10.3390/rs18111725

AMA Style

Cao Y, Sheng C, Tian H, Xiao Q, Zhou Z, Li J, Tang W, Wang K. A Lightweight Multi-Level Feature Fusion Detector for UAV-Based Tiny Personnel Detection in Hilly Road Safety Monitoring and Rescue. Remote Sensing. 2026; 18(11):1725. https://doi.org/10.3390/rs18111725

Chicago/Turabian Style

Cao, Yanghao, Chaojun Sheng, Haishan Tian, Qi Xiao, Ziyi Zhou, Jiayuan Li, Weiwei Tang, and Kui Wang. 2026. "A Lightweight Multi-Level Feature Fusion Detector for UAV-Based Tiny Personnel Detection in Hilly Road Safety Monitoring and Rescue" Remote Sensing 18, no. 11: 1725. https://doi.org/10.3390/rs18111725

APA Style

Cao, Y., Sheng, C., Tian, H., Xiao, Q., Zhou, Z., Li, J., Tang, W., & Wang, K. (2026). A Lightweight Multi-Level Feature Fusion Detector for UAV-Based Tiny Personnel Detection in Hilly Road Safety Monitoring and Rescue. Remote Sensing, 18(11), 1725. https://doi.org/10.3390/rs18111725

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Lightweight Multi-Level Feature Fusion Detector for UAV-Based Tiny Personnel Detection in Hilly Road Safety Monitoring and Rescue

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Small-Object Detection

2.2. Lightweight Object Detection

2.3. Frequency-Domain Feature Processing for Object Detection

3. Method

3.1. Frequency-Domain Feature Enhancement (FDFE) Module

3.2. Content-Aware Reassembly of Features (CARAFE) Module

3.3. Multi-Scale Feature Reconstruction (MSFR) Module

3.4. Lightweight Multi-Level Feature Fusion (LMFF) Network

4. Experiment

4.1. Datasets

4.1.1. Self-Built Dataset

4.1.2. Public Datasets

4.1.3. Search-And-Rescue UAV Dataset

4.2. Experimental Environment and Parameter Setting

4.3. Evaluation Metrics

4.4. Experimental Analysis of Modules

4.4.1. Effectiveness of the LMFF Network and MSFR Module

4.4.2. Effectiveness of the FDFE Module

4.4.3. Effectiveness of the CARAFE Module

4.5. Ablation Study

4.6. Comparison of Detection Performance of Different Models

4.6.1. Experiments on VisDrone

4.6.2. Comparison of Performance Across Different Datasets

4.6.3. Performance Comparison of the Self-Built OPVM-VIRD Dataset on Edge Devices

4.7. Deployment of UAV-Based Target Detection System on a UAV Platform

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI