MFEF-YOLO: A Multi-Scale Feature Extraction and Fusion Network for Small Object Detection in Aerial Imagery over Open Water

Liu, Qi; Yu, Haiyang; Zhang, Ping; Geng, Tingting; Yuan, Xinru; Ji, Bingqian; Zhu, Shengmin; Ma, Ruopu

doi:10.3390/rs17243996

Open AccessArticle

MFEF-YOLO: A Multi-Scale Feature Extraction and Fusion Network for Small Object Detection in Aerial Imagery over Open Water

by

Qi Liu

,

Haiyang Yu

^*,

Ping Zhang

,

Tingting Geng

,

Xinru Yuan

,

Bingqian Ji

,

Shengmin Zhu

and

Ruopu Ma

School of Surveying and Land Information Engineering, Henan Polytechnic University, Jiaozuo 454003, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(24), 3996; https://doi.org/10.3390/rs17243996

Submission received: 31 October 2025 / Revised: 1 December 2025 / Accepted: 7 December 2025 / Published: 11 December 2025

(This article belongs to the Special Issue Deep Learning-Based Small-Target Detection in Remote Sensing)

Download

Browse Figures

Versions Notes

Highlights

What are the main findings?

The proposed MFEF-YOLO delivers superior performance for maritime object detection, demonstrating marked improvements of 0.11 and 0.03 in mAP₅₀ for the SeaDronesSee and TPDNV datasets, respectively, alongside an 11.54% reduction in parameters.
The novel DBSPPF and IMFFNet modules demonstrate superior capability in enhancing multi-scale feature extraction and fusion, significantly boosting detection accuracy for small and densely distributed objects in complex open-water environments.

What is the implication of the main finding?

This work provides a practical solution for accurate real-time object detection using resource-constrained UAV platforms, enhancing their capability in maritime surveying, monitoring, and search-and-rescue operations.
The newly constructed TPDNV benchmark dataset, comprising over 120,000 annotated instances of dense small targets, establishes a challenging and valuable resource for advancing the field of small-object detection in remote sensing applications.

Abstract

Current object detection using UAV platforms in open water faces challenges such as low detection accuracy, limited storage, and constrained computational capabilities. To address these issues, we propose MFEF-YOLO, a small object detection network based on multi-scale feature extraction and fusion. First, we introduce a Dual-Branch Spatial Pyramid Pooling Fast (DBSPPF) module in the backbone network to replace the original SPPF module, while integrating ODConv and C3k2 modules to collectively enhance feature extraction capabilities. Second, we improve small object detection by adding a P2 detection head and reduce model parameters by removing the P5 detection head. Finally, we design an Island-based Multi-scale Feature Fusion Network (IMFFNet) and employ a Coordinate-guided Multi-scale Feature Fusion Module (CMFFM) to strengthen contextual information and boost detection accuracy. We validate the effectiveness of MFEF-YOLO using the public dataset SeaDronesSee and our custom dataset TPDNV. Experimental results show that compared to the baseline model, mAP₅₀ improves by 0.11 and 0.03 using the two datasets, respectively, while model parameters are reduced by 11.54%. Furthermore, DBSPPF and IMFFNet demonstrate superior performance in comparative studies with other methods, confirming their effectiveness. These improvements and outstanding performance make MFEF-YOLO particularly suitable for UAV-based object detection in open waters.

Keywords:

feature extraction; feature fusion; spatial pyramid pooling fast; small object detection; YOLO11

1. Introduction

Maritime transportation plays a pivotal role in global trade, accounting for over 80% of international trade volume annually [1]. However, vessels frequently encounter extreme weather conditions such as storms, leading to maritime accidents. This underscores the critical need for rapid target search and rescue operations, thereby driving the development of object detection technologies in open waters.

The core challenge of open-water object detection lies in rapid target identification. Traditional search methods primarily rely on human observers aboard helicopters or surface vessels, which suffer from low efficiency, limited coverage, and difficulties in achieving accurate and fast target localization [2]. In contrast, unmanned aerial vehicles (UAVs) offer distinct advantages including compact size, high-speed mobility, operational flexibility, low cost, and wide field-of-view. By integrating deep learning-based object detection algorithms, UAVs can autonomously process captured imagery to identify targets, significantly improving search and rescue efficiency.

In UAV-based object detection for open waters, targets are typically small and may appear blurred or occluded due to long distances and water interference, significantly increasing detection difficulty. This necessitates specialized object detection algorithms. To address these challenges, researchers have focused on small object detection and explored algorithmic improvements from various perspectives. Wu et al. [3] enhanced feature extraction capability for small, blurred objects by integrating dilated convolutional branches with varying dilation rates into the C3 modules and incorporating lightweight convolutional layers within YOLOv5s’ backbone. Tang et al. [4] proposed a Spatial Context Pyramid with Multi-Scale Attention (SCPMA) module that captures spatial and channel-dependent features of small objects at multiple scales, improving perception of spatial context features and utilization of multi-scale information. Wang et al. [5] improved model performance while maintaining lightweight by adding a tiny-object prediction head to YOLOv8 for detecting small targets missed by other heads, while removing medium and large object prediction heads that had minimal impact on accuracy. Li et al. [6] introduced Transformer [7] into YOLOv5’s backbone to enhance feature extraction for detecting small or occluded objects in complex maritime scenarios, though this significantly increased parameters and computational costs compared to convolutional modules. Beyond these specific improvements for small object detection, numerous researchers [8,9] have employed attention mechanisms to enhance network focus on small targets, thereby improving detection performance.

Inspired by these studies, we propose MFEF-YOLO, an enhanced YOLOv11-based model for improving small object detection performance in open waters. In the backbone network, we design a Dual-Branch Spatial Pyramid Pooling Fast (DBSPPF) module to enhance multi-scale feature utilization and extraction capabilities. Additionally, we integrate the C3k2 module with ODConv to create a novel C3k2_ODConv module, thereby improving the backbone network’s generalization capacity and feature extraction ability. For the head design, we incorporate a high-resolution tiny object detection head while eliminating the low-resolution large object detection head, which simultaneously enhances small object detection accuracy and reduces model parameters. Finally, we propose an Island-based Multi-scale Feature Fusion Network (IMFFNet) that employs our newly designed Coordinate-guided Multi-scale Feature Fusion Module (CMFFM). This architecture effectively fuses three feature maps of different scales and feeds them into deeper network layers, significantly minimizing information loss. By fully leveraging the intrinsic spatial and semantic information across different-scale feature maps, our approach enhances the network’s capacity to capture and utilize multi-scale objects, ultimately boosting overall detection performance.

In summary, the main contributions of this paper are as follows.

(1): In the backbone network, we propose a plug-and-play DBSPPF module. This module employs a dual-branch architecture to enhance multi-scale feature utilization and feature extraction capability, enabling the model to detect objects of varying sizes with higher accuracy. Furthermore, we incorporate depthwise separable convolution into this module to significantly improve detection speed. Additionally, we integrate ODConv into the C3k2 module, constructing an enhanced C3k2_ODConv module that not only strengthens feature extraction but also reduces computational overhead.
(2): In the head network, we introduce a 160 × 160 high-resolution detection head specifically for tiny objects while eliminating the 20 × 20 low-resolution head for large objects. This architectural modification significantly enhances small object detection capability in aerial imagery while reducing model parameters by 25%.
(3): For the neck network, we propose IMFFNet, an innovative feature fusion network that incorporates three core modules: the C3k2_Faster module enhances network learning capacity and computational efficiency while reducing computational redundancy and memory access; the DySample module preserves richer informative features from high-resolution feature maps; and the CMFFM effectively integrates multi-scale features to improve complex scene understanding and multi-scale object detection performance. This architecture significantly boosts detection performance and adaptability to targets of various sizes.
(4): We construct TPDNV, a novel small-object benchmark dataset comprising 120,822 annotated instances, where the majority are small and densely distributed targets. This dataset establishes a challenging benchmark for evaluating small-object detection models in open waters, with particular emphasis on high-density small target recognition scenarios.

2. Related Work

2.1. Object Detection

Object detection aims to identify all objects of interest in an image and determine their categories and locations. In recent years, the advancement of deep learning technologies has enabled object detection to adaptively extract image features and locate objects through an end-to-end learning framework. Currently, deep learning-based object detection algorithms can be broadly categorized into two types: two-stage and single-stage detection algorithms.

Two-stage object detection algorithms operate through a sequential process: first, the algorithm generates a set of candidate regions likely containing objects, then classifies these regions and refines their positions via bounding box regression. Representative algorithms include R-CNN [10], Fast R-CNN [11], Faster R-CNN [12], Mask R-CNN [13], and Cascade R-CNN [14]. Among them, Faster R-CNN is the most widely adopted—it first employs a Region Proposal Network (RPN) to generate candidate regions, then maps them to fixed-size feature maps via ROI Pooling for final classification and regression. Compared to single-stage detectors, two-stage methods achieve higher accuracy at the cost of slower inference speed, making them less suitable for real-time applications. In contrast, single-stage detectors are faster with relatively minor accuracy trade-offs [15], thus being more favorable for real-time detection.

Single-stage object detection algorithms directly predict target categories and bounding boxes from input images in a single forward pass, significantly optimizing computational efficiency and reducing processing time. Among these, the YOLO (You Only Look Once) series stands out as a representative approach, achieving an excellent balance between inference speed and detection accuracy while demonstrating superior performance in small object detection. The initial YOLOv1 [16] prioritized computational speed and global contextual understanding but sacrificed some detection accuracy. In 2016, YOLOv2 [17] substantially improved accuracy while maintaining fast inference. YOLOv3 [18] introduced Darknet-53 as its backbone, adopted a feature pyramid network structure for multi-scale detection, and replaced softmax with logistic regression, enhancing accuracy without compromising real-time performance. YOLOv4 [19] further advanced the framework with its Bag of Freebies and Bag of Specials optimization strategies, coupled with the more efficient CSPDarknet53 backbone, boosting both speed and precision. YOLOv5 introduced adaptive anchor box learning for improved detection efficiency. YOLOv6 [20] incorporated EfficientRep for a streamlined architecture. YOLOv8 replaced YOLOv5’s C3 blocks with gradient-rich C2f modules. YOLOv9 [21] introduced Programmable Gradient Information (PGI) and Generalized Efficient Layer Aggregation Network (GELAN). Finally, YOLO11 (a derivative of YOLOv8) replaced C2f with C3k2, integrated the novel C2PSA layer into the backbone, and incorporated depthwise separable convolutions into the classification head.

Single-stage detectors are frequently employed in small object detection tasks for Unmanned Aerial Vehicles (UAVs) due to their high inference efficiency. However, they typically fall short of the detection accuracy achieved by two-stage detectors. Furthermore, the inherently small size and limited number of pixels of small objects result in weak feature representation, which further exacerbates the difficulty of detection. Current research trends in small object detection are primarily focused on addressing these core challenges by enhancing feature extraction capabilities, optimizing feature fusion, incorporating attention mechanisms, and exploring advanced learning strategies [22]. In the realm of feature fusion, Jiang et al. [23] proposed a bidirectional dense feature pyramid network (BDFPN), which facilitates efficient multi-scale information fusion by expanding the feature pyramid architecture and incorporating skip connections. Conversely, Ni et al. [24] designed a scale compensation feature pyramid network (SCFPN) that integrates spatial details from shallow layers with semantic features from deep layers, thereby strengthening the network’s feature representation capability. Despite these advancements, a significant limitation remains in the comprehensive and efficient utilization of shallow, multi-scale features. To address this challenge, this paper aims to introduce a novel feature fusion module designed to fully exploit and integrate multi-scale features from the shallow layers. By incorporating an attention mechanism, the proposed module adaptively selects the most effective feature sources and identifies the most critical spatial locations at each scale, thereby achieving more efficient feature utilization compared to traditional concatenation methods.

2.2. Detection of Small Targets in Open Water

In recent years, deep learning-based methods for small object detection in open water scenarios have achieved remarkable progress, with representative models such as [2,25,26] being proposed. However, existing approaches still exhibit certain limitations. Most studies focus on improving detection accuracy, often at the considerable expense of inference speed (frames per second, FPS) and model size. For instance, Xu et al. [27] proposed the YoloOW model, which adapts to multi-scale spatial features to significantly enhance detection performance in unmanned aerial vehicle (UAV) imagery for open-water search and rescue missions. Their method outperformed traditional two-stage detectors and ranked near the top of the SeaDronesSee leaderboard. Meanwhile, Ma et al. [28] employed cross-scale feature fusion and enhanced feature representation for small objects to effectively address the challenges posed by the diversity of UAV images in open waters, achieving notable improvements in detection accuracy.

Concurrently, some research efforts have been directed toward model lightweighting and real-time optimization, though often at the cost of reduced detection accuracy. For example, Tang et al. [29] introduced LiteFlex-YOLO, which incorporates lightweight modules to reduce model complexity. Zhao et al. [30] proposed a modular plug-and-play optimization strategy based on the YOLOv8 framework, substantially enhancing the real-time performance of object detection in maritime rescue operations using UAVs. Their approach achieved a detection speed more than 20 times faster than the two-stage detector provided in the official SeaDronesSee implementation.

Although the aforementioned studies have made significant improvements in specific aspects, they generally struggle to achieve a satisfactory balance among detection accuracy, model lightweightness, and real-time performance. This limitation is particularly unfavorable for UAV platforms, which typically have limited storage resources and require both high responsiveness and high precision. To address this issue, this study aims to develop an object detection model that simultaneously maintains high accuracy, lightweight design, and real-time capability. By introducing lightweight modules, pruning redundant structures to reduce model size, fully leveraging multi-scale features to enhance detection capability, and optimizing computational workflows to minimize redundant computations and memory access costs, our approach seeks to ensure robust detection performance while meeting practical deployment requirements.

2.3. Applications of SPPF Module in Object Detection

The SPPF (Spatial Pyramid Pooling Fast) module in YOLO11 is inherited from YOLOv8. The processing pipeline begins with a

1 \times 1

convolutional layer that reduces the channel dimension C to half of the input size

C_{1 / 2}

. The feature map then sequentially passes through three max-pooling layers. Subsequently, the outputs from these pooling layers are concatenated channel-wise with the original pre-pooling feature map, expanding the channel dimension to

C_{1 / 2} \times 4

. Finally, a

1 \times 1

convolution adjusts the channel dimension to the desired output size. Positioned between the backbone and neck networks, the SPPF module effectively integrates multi-scale feature representations, significantly enhancing the model’s detection performance for both small and large objects while improving robustness to targets of varying sizes.

With the advancement of deep learning, researchers have begun to explore improvements to the SPPF module, yielding various innovative variants. Xu et al. [31] proposed the GSPPF module, which employs progressively increasing convolutional kernel sizes to capture broader contextual information across different scales. This module incorporates a gating mechanism to selectively integrate contextual information, thereby enhancing scale awareness and context sensitivity. Fan et al. [32] introduced the Efficient-SPPF module by integrating dilated convolutions into the SPPF structure, effectively expanding the network’s receptive field. Zhao et al. [33] enhanced the SPPF module by incorporating the LSKA attention mechanism [34], which reduces interference from complex backgrounds by capturing long-range dependencies and adaptability. Gu et al. [35] further improved the module by adding a global average pooling branch to strengthen spatial information representation, while equipping each branch with ELA modules [36] to diversify feature representations and boost object recognition capability.

2.4. Applications of Feature Fusion in Object Detection

In object detection, images processed by deep convolutional neural networks generate feature maps at different levels. Shallow-level feature maps retain higher spatial resolution, preserving richer localization and detailed information which benefits object positioning. However, these early-stage features contain weaker semantic representations and more noise due to limited convolutional processing. Conversely, deep-level feature maps exhibit stronger semantic characteristics that facilitate object classification, but suffer from reduced spatial resolution and poorer sensitivity to fine-grained details.

To leverage the advantages of feature maps at different levels, researchers have incorporated feature fusion methods into object detection. Lin et al. [37] proposed the Feature Pyramid Network (FPN), which employs a bottom-up and top-down architecture to enhance features by fusing multi-scale representations. Building upon FPN, Liu et al. [38] introduced the Path Aggregation Network (PANet), which augments higher-level feature layers with precise localization information from lower-level features through an additional bottom-up path. Lian et al. [39] presented a nonlinear fusion module called AFFB that better captures contextual information from various network layers by merging semantically and scale-inconsistent features. Sun et al. [40] designed a rich-scale feature interactive fusion module, which utilizes dilated convolutions and additive operations to refine features and generate multi-receptive-field representations. These features are progressively aggregated through interactive operations, mitigating resolution discrepancies across scales while efficiently consolidating contextual information.

3. Proposed Method

3.1. Overview of YOLO11

Ultralytics YOLO11 represents a cutting-edge SOTA model that builds upon the success of previous YOLO series while introducing novel improvements to further enhance performance and flexibility. Firstly, YOLO11 replaces the original C2f module with an innovative C3k2 module. The C3k2 module dynamically switches between C3k and Bottleneck architectures through its built-in c3k parameter, achieving both faster data processing and more comprehensive feature extraction compared to the C2f module. Secondly, a new C2PSA module is incorporated at the terminus of the backbone network. This module is an extension of the C2f module and incorporates the PSABlock module to enhance feature extraction and attention mechanisms. Finally, YOLO11 introduces depthwise separable convolution into the classification branch of its detection head. The overall architecture of YOLO11 is illustrated in Figure 1.

YOLO11 offers five model variants (n, s, m, l, x) of varying sizes to accommodate diverse application scenarios. Considering the constrained storage and computational resources available on UAVs for object detection tasks over open water, we selected YOLO11n as the baseline model. Despite being the smallest in the YOLO11 series, it delivers remarkably competitive accuracy.

3.2. MFEF-YOLO

In aerial imagery over open water, objects for detection are typically small and often occluded by water bodies. Additionally, these images are frequently affected by sunlight reflections on water surfaces. These factors pose significant challenges to object detection models. To address these issues, we propose MFEF-YOLO, an improved small object detection model based on YOLO11, whose architecture is illustrated in Figure 2. The model consists of three components: backbone, neck, and head. In the backbone, we replace the SPPF module with our proposed DBSPPF module and substitute C3k2 modules with C3k2_ODConv modules to enhance feature extraction capability. For the head, we introduce a high-resolution tiny object detection head while removing one low-resolution large object detection head, thereby improving small object detection and reducing model parameters. In the neck, we replace the last three C3k2 modules with C3k2_Faster modules and upgrade the Upsample module to DySample. Furthermore, we integrate our proposed CMFFM into the neck network to effectively fuse multi-scale feature maps. These neck modifications collectively form our novel IMFFNet.

3.3. Dual-Branch Spatial Pyramid Pooling Fast

The SPPF module expands the receptive field and extracts multi-scale features through successive max-pooling operations. However, this approach may lead to the loss of critical information for small objects, thereby compromising detection performance. Additionally, the simple channel-wise concatenation of pooled features often results in insufficient information fusion. To address these limitations, we propose the DBSPPF module, as illustrated in Figure 3. This module adopts a dual-branch architecture, augmenting the SPPF structure with an additional multi-scale feature interaction and fusion branch. This design enables more effective utilization of multi-scale features while minimizing information loss. By interactively fusing multi-scale features extracted via max-pooling, DBSPPF enhances feature representation, allowing the model to better detect objects of varying sizes. Furthermore, we introduce multiple skip connections in the DBSPPF, effectively preserving feature flow. This allows the network to capture useful information at shallower levels while reducing information loss in deeper layers, thereby enhancing the recognition capability for small targets. Finally, the incorporation of depthwise separable convolutions strikes an optimal balance, reducing computational cost while maintaining effective feature extraction, which in turn increases the overall inference speed.

Next, we detail the complete operational pipeline of DBSPPF. The input feature map

X

first undergoes a

1 \times 1

convolution to halve its channel dimension, producing two branched feature maps

T

(top) and

B

(bottom). In the bottom branch,

B

is processed through three consecutive max-pooling operations, generating feature maps

B_{1}

,

B_{2}

, and

B_{3}

sequentially. These pooled features (

B_{1}

,

B_{2}

,

B_{3}

) are then concatenated with the original

B

along the channel dimension, followed by a

1 \times 1

convolution to reconcile channel dimensions, yielding the final bottom-branch output

B_{4}

. This process can be formally expressed as:

B = C B S_{1 \times 1}^{C_{1} \to C_{1} / 2} (X)

(1)

B_{1} = M a x (B)

(2)

B_{2} = M a x (B_{1})

(3)

B_{3} = M a x (B_{2})

(4)

B_{4} = C B S_{1 \times 1}^{2 C_{1} \to C_{2}} [C a t (B_{1}, B_{2}, B_{3}, B)]

(5)

where

C B S_{1 \times 1}^{m \to n}

denotes a sequential operation comprising: (1) a

1 \times 1

convolution adjusting channel dimensions from

m

to

n

, (2) batch normalization, and (3) a SiLU activation.

M a x

represents max-pooling, while

C a t

indicates channel-wise concatenation.

In the top branch, feature map

T

first undergoes element-wise addition with

B_{1}

to produce

T_{1}

. Subsequently,

T_{1}

is combined with

B_{2}

and the depthwise separable convolution-processed

T_{1}

through another element-wise addition operation, yielding

T_{2}

. This computational pattern is repeated to obtain

T_{3}

.

T_{3}

is then processed through depthwise separable convolution and multiplied with the original

T

via element-wise multiplication. The resulting tensor passes through a

1 \times 1

convolution for channel adjustment, producing

T_{4}

. Finally, the outputs from both branches (

T_{4}

and

B_{4}

) are concatenated channel-wise and processed through a final

1 \times 1

convolution to reconcile channel dimensions, generating the output feature map

X^{'}

. The complete DBSPPF computational procedure can be formally expressed as:

T = C B S_{1 \times 1}^{C_{1} \to C_{1} / 2} (X)

(6)

T_{1} = A d d (T, B_{1})

(7)

T_{2} = A d d [T_{1}, B_{2}, D S C_{3 \times 3} (T_{1})]

(8)

T_{3} = A d d [T_{2}, B_{3}, D S C_{3 \times 3} (T_{2})]

(9)

T_{4} = C B S_{1 \times 1}^{C_{1} / 2 \to C_{2}} [M u l (D S C_{3 \times 3} (T_{3}), T)]

(10)

X^{'} = C B S_{1 \times 1}^{2 C_{2} \to C_{2}} [C a t (T_{4}, B_{4})]

(11)

where

D S C_{3 \times 3}

denotes a

3 \times 3

depthwise separable convolution operation,

A d d

represents element-wise addition, and

M u l

indicates element-wise multiplication.

3.4. C3k2_ODConv

To further enhance the model’s feature extraction capability, we integrate the C3k2 module with ODConv [41] to form a new C3k2_ODConv module. The introduction of ODConv enables C3k2 to dynamically adapt based on target characteristics. ODConv employs a multi-dimensional attention mechanism that computes attention along four dimensions of the convolutional kernel space, allowing fine-grained dynamic adjustments in spatial size, kernel count, input channels, and output channels. This enhances adaptability to input feature variations and improves feature extraction performance. The structure of ODConv is illustrated in Figure 4.

The ODConv module first compresses the input

X

through a Global Average Pooling (GAP) layer, then maps the feature vector to a low-dimensional space via a Fully Connected (FC) layer, followed by a ReLU activation function to enhance nonlinear representation. Finally, it constructs four parallel head branches (each consisting of an FC layer followed by either a Sigmoid or Softmax function) to compute the final output

Y

. The computational formulation of ODConv is as follows:

Y = (α_{w 1} ⊙ α_{f 1} ⊙ α_{c 1} ⊙ α_{s 1} ⊙ W_{1} + \dots + α_{w n} ⊙ α_{f n} ⊙ α_{c n} ⊙ α_{s n} ⊙ W_{n}) * X

(12)

where

α_{w i}

denotes the attention scalar for convolutional kernel

W_{i}

, while

α_{f i}

,

α_{c i}

and

α_{s i}

represent three newly introduced attention scalars corresponding to the output channel dimension, input channel dimension, and spatial dimension, respectively. The symbol

⊙

indicates the multiplication operation.

The C3k2_ODConv module incorporates ODConv into the C3k2 architecture to achieve more accurate feature extraction, thereby enhancing the network’s learning capacity and representational power. Utilizing a multi-dimensional attention mechanism, ODConv effectively reduces computational costs while maintaining robust feature extraction performance. By integrating ODConv within the C3k2 module, the model achieves both improved computational efficiency and enhanced feature extraction capability, with its detailed structure illustrated in Figure 5.

3.5. P2 Detection Head

To address the scale variation of objects in real-world scenarios, YOLO11 incorporates three detection heads of different sizes to handle objects at varying scales. These heads process feature maps at

1 / 8

,

1 / 16

, and

1 / 32

of the original image resolution, dedicated to detecting small, medium, and large objects, respectively. As illustrated in Figure 6, the distribution of object width and height ratios relative to images in the SeaDronesSee dataset reveals that the majority of objects occupy a very small portion of the image. This highlights the significant prevalence of small (including tiny) objects in the dataset.

In YOLO11, the detection head for small objects receives an input feature map that is downsampled by a factor of 8 compared to the original image. This aggressive downsampling causes small objects in the input to shrink proportionally, while successive downsampling operations throughout the network progressively erode spatial information, degrading small object representation and detection performance. To preserve finer details, we introduce a higher-resolution P2 detection head specifically designed to capture small objects missed by other heads. As illustrated in Figure 2, some inputs to the P2 detection head come from lower-level feature maps, which have a resolution twice that of the input feature maps for the small object detection head. Additionally, this path involves fewer convolutional and pooling layers, which reduces spatial information loss and preserves more details of small objects, ultimately aiding the model in accurately locating them.

While the addition of the P2 detection head improves model performance, the increased number of detection heads inevitably introduces additional parameters and computational overhead. Our subsequent ablation studies revealed that the large-object detection head actually constrained model performance, and its removal led to improved detection accuracy. Consequently, we retain only the small and medium-object detection heads from the original YOLO11 architecture, along with the newly added P2 head. This architectural modification reduces the model’s parameter count by approximately 28%. Ultimately, by incorporating the high-resolution P2 head while eliminating the low-resolution large-object detection head, we achieve dual benefits: enhanced UAV image recognition capability and a more lightweight model architecture.

3.6. Island-Based Multi-Scale Feature Fusion Network

FPN (Feature Pyramid Network) is a deep learning architecture widely used in object detection and segmentation. By constructing a feature pyramid, it effectively leverages multi-scale features to enhance the model’s capability in detecting objects at various scales. FPN merges information across different scales through a top-down and bottom-up structure (highlighted in the blue box in Figure 7a). Acknowledging the limitations of unidirectional information flow, PANet introduces an additional top-down pathway (shown in the red box in Figure 7a) to enhance the network’s feature aggregation capabilities. However, PANet fails to fully leverage the information between deep and shallow layers during feature fusion, primarily focusing on local feature integration while neglecting global contextual information. This limitation restricts the model’s understanding of complex scenes. To address this issue, this paper presents an island-based multi-scale feature fusion network, as illustrated in Figure 7b.

IMFFNet is an improvement based on the PANet architecture that incorporates an island-like structure to facilitate direct interaction between shallow detail information and deep semantic information, effectively reducing attenuation during the transmission process. Additionally, the island structure effectively integrates multi-scale features from the shallow layers, capturing richer spatial and texture details compared to single feature transmission. Furthermore, by performing multiple fusions of features at different scales, IMFFNet can acquire abundant global contextual information, enhancing the model’s ability to understand complex scenes and adapt to targets of various sizes, thereby improving detection accuracy.

Next, we will elaborate on the working principles of IMFFNet. Initially, the input image is processed through a backbone network to extract multi-scale feature representations denoted as

C_{i}

(

i \in \{1, 2, 3, 4, 5\}

), where each

C_{i}

corresponds to feature maps at

1 / 2^{i}

scale of the original input. These multi-resolution feature maps

C_{i}

are then fed into the neck network for feature fusion with bottom-up information flow. Notably,

\{C_{1}, C_{2}, C_{3}\}

are first processed by the Coordinate-guided Multi-scale Feature Fusion Module (CMFFM), also referred to as the island structure, to achieve cross-scale feature integration. The fused results are subsequently combined with the information flow. Through these fusion operations, we obtain a set of enhanced feature maps

N_{i}

(

i \in \{2, 3, 4, 5\}

). These enhanced feature maps are further refined through top-down information flow fusion, producing a new set of optimized feature maps

P_{i}

(

i \in \{2, 3, 4\}

). The detailed computational procedure of the island structure is as follows:

M_{2} = CMFFM (C_{1}, C_{2}, C_{3})

(13)

N_{2} = E (M_{2}, N_{3})

(14)

M_{3} = CMFFM (P_{2}, N_{3}, N_{4})

(15)

P_{3} = E (M_{3}, P_{2})

(16)

M_{4} = CMFFM (N_{3}, N_{4}, N_{5})

(17)

P_{4} = E (M_{4}, P_{3})

(18)

where

M_{i}

represents the fused feature maps, while

E

denotes the feature extraction operation.

The proposed IMFFNet architecture comprises three core components: C3k2_Faster, DySample, and CMFFM. We now present a detailed description of each module.

(1): C3k2_Faster: To enhance detection speed and reduce model complexity, we incorporate the FasterNet Block into the C3k2 module. The FasterNet Block originates from FasterNet [42], where researchers introduced partial convolution (PConv) that selectively performs standard convolution on specific input channels while preserving others to exploit feature map redundancy. Notably, PConv achieves efficient feature extraction by minimizing redundant computations and memory access. The FasterNet Block architecture consists of one PConv followed by two 1 × 1 convolutions, characterized by its simple structure and reduced parameters. While YOLO11’s C3k2 module employs Bottleneck blocks for performance enhancement, this approach significantly increases parameters and computational overhead. We therefore propose replacing the Bottleneck blocks in C3k2 with FasterNet Blocks, creating the novel C3k2_Faster module (architecture shown in Figure 8). Compared to the original C3k2, our module enhances learning capacity while reducing computational redundancy and memory access, achieving superior computational efficiency. Subsequent ablation studies demonstrate that C3k2_Faster effectively reduces computational load while improving both detection speed and accuracy.
(2): DySample: In the YOLO11 architecture, the upsampling operation employs the nearest neighbor interpolation method. This approach amplifies the image by replicating the nearest pixel values, without considering the relationships between image content or features. As a result, it often leads to discontinuous pixel transitions, manifesting as jagged edges or blurred effects. Moreover, in drone-based aerial object detection scenarios, the targets in images are typically small, meaning each pixel’s information can significantly impact detection accuracy. To address this, we propose replacing the Upsample module in the neck network with DySample [43], a dynamic upsampling method based on point sampling, to enhance the upsampled features. The DySample module performs content-aware point sampling for upsampling, dynamically adjusting sampling locations according to the input features’ content. This enables DySample to better adapt to variations in image details and structures, thereby improving information retention in high-resolution feature maps and boosting object detection performance. Figure 9 illustrates the structure of the DySample module. The specific workflow is as follows: Given an input feature map $X$ of size $H \times W \times C$ and an upsampling scale factor s, a sampling point generator creates a sampling set of size $s H \times s W \times 2$ . The grid_sample function then resamples X based on the sampling set, producing an upsampled feature map $X^{'}$ of size $s H \times s W \times C$ , thereby completing dynamic upsampling.
(3): Coordinate-guided Multi-scale Feature Fusion Module: In IMFFNet, to facilitate information exchange between spatial features of different scales and hierarchical semantics, we propose a Coordinate-guided Multi-scale Feature Fusion Module (CMFFM), as illustrated in Figure 10. The CMFFM consists of two components: feature alignment and feature fusion. For feature alignment, we align the dimensions of three input feature maps with varying scales. Considering that larger feature maps provide more low-level spatial details while smaller ones contain richer high-level semantic information, we choose to align them to the intermediate-scale feature map to balance both aspects. Specifically, we first downsample the large-scale feature map $M_{l}$ by a factor of 2 using a combination of average pooling and max pooling, yielding $M_{l m}$ . Then, we upsample the small-scale feature map $M_{s}$ by a factor of 2 via nearest neighbor interpolation, producing $M_{s m}$ . This ensures that both $M_{l}$ and $M_{s}$ are spatially aligned with the medium-scale feature map $M_{m}$ . Subsequently, $M_{l m}$ , $M_{m}$ , and $M_{s m}$ are concatenated along the channel dimension, followed by a $1 \times 1$ convolution to adjust the channel dimension to match $M_{m}$ , thereby reducing computational complexity for subsequent operations. The mathematical formulation of the feature alignment process is as follows:

M_{l m} = A d d [A v g (M_{l}), M a x (M_{l})]

(19)

M_{s m} = N e (M_{s})

(20)

M_{r} = C B S_{1 \times 1}^{C_{s + m + l} \to C_{m}} [C a t (M_{l m}, M_{m}, M_{s m})]

(21)

where

M_{r}

denotes the aligned feature map. The notation

C B S_{1 \times 1}^{m \to n}

represents a sequence of operations: first a

1 \times 1

convolution that adjusts the channel dimension from

m

to

n

, followed by batch normalization and a SiLU activation.

C a t

indicates channel-wise concatenation,

A d d

represents element-wise addition,

A v g

stands for average pooling,

M a x

denotes max pooling, and

N e

refers to nearest neighbor interpolation.

In the feature fusion stage, we first employ

3 \times 3

depthwise separable convolution followed by pointwise convolution to perform deep feature extraction on the input feature map

M_{r}

, enhancing feature correlation. Simultaneously,

M_{r}

is fed into a coordinate attention mechanism, which integrates both channel and spatial information to enrich contextual representation and improve feature discriminability. The extracted features are then channel-wise concatenated with

M_{r}

to preserve original feature information, thereby strengthening fusion effectiveness and boosting model expressiveness and flexibility. Finally, a

1 \times 1

convolution is applied to restore the channel dimension.

{M^{'}}_{r} = {C B S}_{1 \times 1}^{C_{m} / 2 \to C_{m}} [{D S C}_{3 \times 3}^{C_{m} \to C_{m} / 2} (M_{r})]

(22)

{M^{″}}_{r} = C A (M_{r})

(23)

{C M F F M}_{O} = C B S_{1 \times 1}^{3 C_{m} \to C_{m}} [C a t (M_{r}, {M^{'}}_{r}, {M^{″}}_{r})]

(24)

where

D S C_{3 \times 3}^{m \to n}

denotes a

3 \times 3

depthwise separable convolution that transforms the channel dimension from

m

to

n

.

C A

represents the coordinate attention mechanism.

M_{r}^{'}

indicates the output after depthwise separable convolution and pointwise convolution, while

M_{r}^{″}

denotes the output processed by the coordinate attention mechanism.

C M F F M_{O}

refers to the final output of the Coordinate-guided Multi-scale Feature Fusion Module.

Figure 10. Architecture diagram of CMFFM.

4. Experimental Results

4.1. Experimental Environment and Parameter Settings

In the experimental setup, the model was trained on hardware consisting of an NVIDIA GeForce RTX 4070 GPU (NVIDIA Corporation, Santa Clara, CA, USA) and an Intel(R) Core(TM) i5-13490F CPU (Intel Corporation, Santa Clara, CA, USA). The software environment included Windows 10 OS, Python 3.8, PyTorch 2.4.1, and CUDA 12.4, with specific versions detailed in Table 1. The training hyperparameters were configured as follows: SGD optimizer, 200 epochs, batch size of 8, and input image resolution of

640 \times 640

pixels (complete settings shown in Table 2). All other parameters remained at their default values unless otherwise specified.

4.2. Dataset Description

(1): SeaDronesSee: SeaDronesSee [44] is a large-scale benchmark dataset for maritime search-and-rescue object detection, designed to advance UAV-based search-and-rescue system development in marine environments. The dataset comprises 14,227 images captured by multiple cameras mounted on various UAVs, featuring varying altitudes, viewing angles, and time conditions. It contains critical maritime objects including vessels, swimmers, and rescue equipment, with resolutions up to 5456 × 3632 pixels. The annotations are categorized into six classes: ignore, swimmer, boat, jetski, life_saving_appliances, and buoy. The dataset is partitioned into training (8930 images), validation (1547 images), and test sets (3750 images).
(2): TPDNV: To validate MFEF-YOLO’s detection capability for small and tiny objects, we constructed a specialized benchmark dataset named TPDNV. This dataset is curated from three established datasets: TinyPerson [45], DIOR [46], and NWPU VHR-10 [47,48,49]. We specifically selected sea_person and earth_person categories from TinyPerson, along with ship category from both DIOR and NWPU VHR-10, forming three unified classes in TPDNV: sea_person, earth_person, and ship. Through rigorous filtering, we excluded all larger targets from the original selections. The refined dataset was further augmented via downscaling, rotation, flipping, and cropping operations, resulting in 2368 images containing 120,822 annotated instances. The final dataset is partitioned into training (1664 images), validation (232 images), and test sets (472 images).

4.3. Evaluation Metrics

For comprehensive performance evaluation, we adopt Precision, Recall, mAP₅₀, mAP_50:95, and mAP_S as evaluation metrics, with particular emphasis on mAP₅₀, mAP_50:95, and mAP_S as the primary indicators. Model efficiency is quantified using three key indicators: FPS (Frames Per Second), Parameters, and FLOPs (Floating Point Operations). FPS indicates the number of frames processed per second. A higher FPS corresponds to faster detection speed of the model. Parameters denote the number of trainable parameters, which is proportional to memory usage—fewer parameters reduce memory overhead. FLOPs measure computational complexity; higher FLOPs imply greater model complexity and increased demand for computational resources.

4.4. Ablation Experiments

In this section, drawing on the SeaDronesSee dataset, we conduct ablation studies on MFEF-YOLO using YOLO11n as the baseline model to evaluate the impact of each component. The evaluation metrics include Precision (P), Recall (R), mAP₅₀, mAP_50:95, mAP_S, Parameters (Params), FLOPs, and FPS. Table 3 presents the ablation results for four key improvements: (A) DBSPPF, (B) C3k2_ODConv, (C) P2 detection head, and (D) IMFFNet.

(1): DBSPPF: As indicated for module A in Table 3, replacing SPPF with DBSPPF in YOLO11n improves R, mAP₅₀, mAP_50:95, and mAP_S despite a slight decrease in P. Notably, R and mAP₅₀ increase by 0.023 and 0.014, respectively. The mAP_S improvement indicates that DBSPPF’s dual-branch structure enhances small object detection. Additionally, DBSPPF improves FPS, demonstrating its efficiency in accelerating inference. Figure 11 visualizes feature extraction before and after DBSPPF integration, where brighter red regions indicate higher model attention. The results show more concentrated and distinct hotspots around target areas with DBSPPF, confirming its effectiveness in improving localization accuracy and focus.
(2): C3k2_ODConv: As detailed in Table 3 (A + B approach), replacing C3k2 with C3k2_ODConv in the backbone of YOLO11n improves P, mAP₅₀, and mAP_50:95 by 0.015, 0.01, and 0.001, respectively. The incorporation of ODConv not only enhances the feature extraction capability of C3k2 but also reduces the computational cost of the model, decreasing from 6.5 G to 6.1 G.
(3): P2 Detection Head: In this section, we first compare and analyze different detection head combinations before conducting comparisons with other modules. The neck of YOLO11 contains three detection layers corresponding to P3, P4, and P5 heads. We evaluated three distinct head configurations, with the results presented in Table 4. Comparing the P2, P3, P4, P5 set with the P3, P4, P5 set in the table, the incorporation of the P2 detection head significantly enhances overall model accuracy, improving R by 0.054, mAP₅₀ by 0.044, mAP_50:95 by 0.004, and mAP_S by 0.084. This improvement stems from the resolution characteristics: the feature map at the P3 head has dimensions of $80 \times 80$ pixels ( $8 \times$ downsampled relative to the input image), where small targets (objects smaller than $32 \times 32$ pixels in the COCO dataset) would occupy less than $4 \times 4$ pixels, leading to feature loss during detection. In contrast, the P2 head’s $160 \times 160$ feature map provides higher spatial resolution, preserving richer positional and detailed information that facilitates small object localization and boosts detection accuracy.

While the addition of the P2 detection head improves detection accuracy, it also increases model parameters and computational costs. To address this, we performed the experiment using the P2, P3, and P4 configurations from Table 4. A comparison between the P2–P4 and P2–P5 configurations demonstrates that removing the P5 detection head significantly reduces model parameters while improving P, R, mAP₅₀, and mAP_50:95. This occurs because the P5 head primarily detects large objects—its feature maps undergo excessive downsampling, causing severe loss of detail information for medium and small targets. This compromises the model’s learning performance on such objects. By removing the P5 head, the remaining heads can focus more effectively on medium and small targets, thereby enhancing detection accuracy. However, since YOLO11n is an extremely lightweight model with limited parameters and computational capacity, relying solely on the P2 head makes it difficult to accurately detect all small objects. In fact, the deep features and contextual information provided by the P5 head also contribute to small object detection, which explains why removing the P5 detection head leads to a drop in the mAP_S metric.

Among the three configurations evaluated in Table 4, the third combination (P2 + P3 + P4 heads) demonstrates superior performance. Consequently, we adopt this optimal architecture for subsequent experiments.

As detailed in Table 3 (A + B + C approach), the proposed modification of adding a P2 detection head while removing the P5 head leads to consistent improvements across all evaluation metrics: P (+0.002), R (+0.062), mAP₅₀ (+0.05), mAP_50:95 (+0.01), and mAP_S (+0.071). Meanwhile, the model parameters are reduced from 2.8 M to 2.1 M, achieving a 25% reduction (0.7 M). This architectural optimization not only enhances detection performance, particularly for small objects, but also achieves an optimal trade-off between accuracy and model compactness.

(4): IMFFNet: In this section, we first conduct ablation studies on the components within the Island-based Multi-scale Feature Fusion Network (IMFFNet), followed by comparative analysis with other modules. The IMFFNet architecture primarily consists of three key components: C3k2_Faster, DySample, and CMFFM. Building upon the experimental configuration A + B + C from Table 3, we systematically evaluate these modules through ablation experiments, with results presented in Table 5. The symbol ‘✓’ beneath each module name in the table indicates its activation status in the corresponding experiment.

In the experiment shown in Row 3 of Table 5, we integrated FasterNet Block with C3k2 to form C3k2_Faster as a replacement for the original C3k2 module. This modification achieved significant improvements across all metrics without increasing the parameter count. Specifically, we observed increases of 0.017 in P, 0.007 in R, 0.011 in mAP₅₀, 0.012 in mAP_50:95, and 0.014 in mAP_S. Additionally, the computational cost was reduced by 0.2 GFLOPs while achieving a 3.13 FPS improvement. These results demonstrate that FasterNet Block effectively enhances C3k2’s feature extraction capability (particularly for small objects), reduces computational complexity, and improves detection speed.

In the experiment presented in Row 4 of Table 5, we replaced the Upsample module with DySample. Although this led to slight decreases in P and mAP_S, DySample significantly improved R and mAP₅₀ by 0.026 and 0.017, respectively, without increasing either the parameter count or computational cost. These results demonstrate that DySample can effectively enhance detection accuracy while maintaining model efficiency.

As shown in Row 5 of Table 5, the introduction of CMFFM significantly enhances the model’s detection capability. While R experiences a slight decrease, P, mAP₅₀, mAP_50:95, and mAP_S improve by 0.026, 0.008, 0.013, and 0.031, respectively. The notable improvement in mAP_S particularly demonstrates CMFFM’s effectiveness in small object detection. This enhancement stems from two key mechanisms: (1) multi-level feature fusion that mitigates feature discrepancy, and (2) coordinate-guided mechanism that highlights small object features. Visualization in Figure 12 reveals that CMFFM enables the model to focus more precisely on target regions while suppressing background interference. Although CMFFM introduces additional parameters and computational cost, leading to a slight reduction in inference speed, the significant accuracy gains justify this trade-off. Moreover, the maintained high FPS of 256.73 still adequately meets the real-time requirements for UAV applications.

As detailed in Table 3 (A + B + C + D approach), the integration of IMFFNet leads to comprehensive improvements across all accuracy metrics: P increases by 0.005, R by 0.031, mAP₅₀ by 0.036, mAP_50:95 by 0.025, and mAP_S by 0.030. These consistent gains demonstrate IMFFNet’s effectiveness in multi-scale feature fusion, particularly for enhancing small object detection capabilities. Furthermore, the heatmap visualization in Figure 13 clearly shows that IMFFNet strengthens the model’s focus on target regions (highlighted by red boxes), indicating more precise feature localization.

4.5. Comparative Experiments

To further validate the small object detection performance of MFEF-YOLO, we conducted experiments using the SeaDronesSee and TPDNV datasets (the latter containing more densely packed objects), both of which include a significant number of small targets. The results were compared with various YOLO models and classical object detection algorithms.

(1): SeaDronesSee: As demonstrated in Table 6, MFEF-YOLO achieves state-of-the-art detection accuracy when compared with 11 competing models. With comparable parameter counts, MFEF-YOLO exhibits superior performance in mAP_50:95, surpassing YOLOv8n, YOLOv10n, YOLO11n, YOLO-Remote, and Improved-yolov8 by 0.043, 0.054, 0.042, 0.054, and 0.016, respectively. Remarkably, MFEF-YOLO outperforms YOLO11n with fewer parameters, demonstrating improvements of 0.11 in mAP₅₀ and 0.042 in mAP_50:95. Additionally, MFEF-YOLO maintains higher accuracy than YOLO11s while reducing both parameter size (by 4×) and computational cost (by 2×). The comprehensive results presented in Table 6 and Figure 14 reveal that MFEF-YOLO achieves an optimal balance between detection accuracy and inference efficiency. While maintaining high precision, its inference time remains competitive with YOLOv10n and YOLO11s, with only marginal increases of 1.1 ms and 1.3 ms compared to YOLOv8n and YOLO11n, respectively. These results collectively demonstrate that MFEF-YOLO establishes new benchmarks for small object detection in drone imagery applications.
(2): TPDNV: Table 7 compares the detection performance of 12 different models, where MFEF-YOLO outperforms all others, achieving mAP₅₀ and mAP_50:95 scores of 0.474 and 0.228, respectively. Crucially, under identical training hyperparameters, MFEF-YOLO not only reduces the number of parameters but also improves detection accuracy compared to the baseline model. Moreover, while YOLOX, Subtle-YOLOv8, and MFEF-YOLO exhibit comparable mAP₅₀ performance, MFEF-YOLO maintains a clear advantage in both parameter efficiency and computational cost—an essential feature for UAV applications with constrained storage and computing resources. As demonstrated in Figure 15, MFEF-YOLO achieves superior detection accuracy while keeping inference time highly competitive, differing by only 6.1 ms and 5.9 ms compared to YOLOv8n and YOLO11n, respectively.

In conclusion, the proposed MFEF-YOLO demonstrates superior performance over several classical and state-of-the-art object detection methods, particularly in detecting small and densely clustered targets. Furthermore, while maintaining high detection accuracy and real-time processing capability, MFEF-YOLO achieves remarkable efficiency with low parameter count and computational requirements, thereby providing an effective solution for UAV-based target search operations in open water scenarios.

4.6. Experimental Visualization and Analytical Evaluation

To provide an intuitive demonstration of MFEF-YOLO’s capabilities, we conducted comprehensive visualization experiments comparing its performance against six state-of-the-art models on both SeaDronesSee and TPDNV datasets.

Figure 16 presents a comprehensive comparison of detection results from seven state-of-the-art models using the SeaDronesSee dataset. In Figure 16a, MFEF-YOLO demonstrates superior performance by accurately detecting all six swimmers and one buoy, while competing models exhibit either missed detections or false positives. Notably, all models except MFEF-YOLO misclassified the vertically suspended swimmer (marked by the red bounding box in the lower-left region) as life-saving equipment—a challenging case that highlights our model’s exceptional small-object recognition capability. Figure 16b shows that both MFEF-YOLO and Improved-yolov8 achieve optimal detection accuracy, successfully identifying all targets (4 boats, 1 buoy, and 8 swimmers). For the particularly challenging case of two overlapping swimmers, only MFEF-YOLO, YOLOv8n and Improved-yolov8 maintain full detection capability, while other models fail to recognize the occluded instance. These experimental results conclusively demonstrate MFEF-YOLO’s outstanding performance in small object detection and its robustness in handling complex scenarios with object occlusions, thereby validating the effectiveness of our proposed architectural improvements.

Figure 17 and Figure 18 present a comprehensive comparison of seven state-of-the-art models using the TPDNV dataset. In Figure 17, YOLOv8n, YOLOv10n, YOLO11n, and YOLO-Remote exhibit significant missed detections, while Subtle-YOLOv8 and Improved-yolov8 show sporadic false positives. Remarkably, only MFEF-YOLO achieves perfect detection of all densely clustered small objects, demonstrating its exceptional localization accuracy and contextual comprehension capabilities. Figure 18 reveals that YOLO11n fails to detect the ship in the lower-left quadrant, while all other models (except MFEF-YOLO) show sporadic misses on the upper-left target. In contrast, MFEF-YOLO achieves comprehensive detection of all targets, further validating its state-of-the-art performance in dense small object detection scenarios.

In summary, MFEF-YOLO demonstrates state-of-the-art performance on both SeaDronesSee and TPDNV datasets, exhibiting exceptional capabilities in: (1) precise object localization, (2) robust contextual understanding, and (3) superior detection of small and multiple objects.

4.7. Comparative Analysis of Different SPPF Module and Neck Module Enhancement Approaches

(1): Comparative Analysis of Enhanced SPPF Module Variants: To demonstrate the superiority of the proposed DBSPPF module, we conducted a comprehensive comparative study using YOLO11n as the baseline. In this experiment, we replaced the original SPPF module with five alternatives: DBSPPF, FocalModulation [61], SPPF-LSKA [33], Efficient-SPPF [32], and RFB [62], while keeping all other network components unchanged. As shown in Table 8, SPPF-LSKA achieved the highest P of 0.860, while Efficient-SPPF obtained the best R of 0.629. Notably, DBSPPF outperformed all counterparts in both mAP₅₀ (0.675) and mAP_50:95 (0.399) metrics. For small object detection (mAP_S), RFB showed superior performance (0.235). Comprehensive analysis reveals that under comparable model parameters and computational costs, DBSPPF demonstrates the best overall performance, particularly excelling in the critical mAP₅₀ and mAP_50:95 metrics, demonstrating its dual advantages in both accuracy and efficiency.
(2): Comparative Analysis of Enhanced Neck Network Architectures: To validate the effectiveness of our proposed IMFFNet, we conducted comprehensive comparative experiments with state-of-the-art neck network architectures. All experiments were performed via the YOLO11n framework using identical configurations (P2, P3, P4 detection heads) for fair comparison. As shown in Table 9, IMFFNet achieves superior performance across all key metrics: (1) leading in mAP₅₀, mAP_50:95, and mAP_S; (2) ranking second in both P and R; while (3) maintaining relatively lower computational complexity and parameter count. The comprehensive evaluation demonstrates that IMFFNet strikes an optimal balance between detection accuracy and model efficiency, establishing its superior performance for UAV-based object detection tasks.

5. Conclusions

To address the challenge of small object detection for drones in open water with complex backgrounds, we propose MFEF-YOLO, a lightweight yet efficient model based on YOLO11. First, we introduce DBSPPF, a plug-and-play module that significantly improves both accuracy and detection speed through a multi-scale feature interaction and fusion mechanism. Next, we integrate ODConv with the C3k2 module in the backbone network, achieving more precise feature extraction while reducing computational costs. The network architecture is further optimized by adding a P2 detection head and removing the P5 head, which enhances small object detection capability while decreasing model parameters. The core innovation lies in our novel IMFFNet architecture, comprising three key components: C3k2_Faster, DySample, and CMFFM. The pivotal CMFFM enables direct interaction between shallow-level details and deep semantic features, minimizing information loss during propagation. Moreover, CMFFM’s multi-scale feature fusion mechanism strengthens the model’s adaptability to complex scenes and various object sizes. Extensive experiments using the SeaDronesSee maritime search-and-rescue dataset demonstrate significant improvements over the baseline, with gains of 0.11 (mAP₅₀), 0.042 (mAP_50:95), and 0.091 (mAP_S). On our newly constructed dense small object dataset TPDNV, MFEF-YOLO achieves superior performance with 0.474 (mAP₅₀) and 0.228 (mAP_50:95), outperforming existing comparative models. Comprehensive comparisons of DBSPPF and IMFFNet with other improvement methods consistently show our approach’s superior performance.

Although MFEF-YOLO has demonstrated promising performance, its practical application faces challenges that extend far beyond dataset limitations. To enhance the model’s adaptability, our future research will focus on collecting more image data captured under extreme conditions and low-light environments. Additionally, we plan to implement model compression techniques such as pruning and knowledge distillation to significantly reduce both parameter size and computational demands, thereby alleviating storage and performance constraints using UAV platforms.

Author Contributions

Conceptualization, Q.L. and H.Y.; methodology, Q.L.; software, Q.L.; validation, Q.L., T.G. and X.Y.; formal analysis, B.J. and P.Z.; investigation, Q.L.; resources, H.Y.; data curation, Q.L.; writing—original draft preparation, Q.L.; writing—review and editing, Q.L. and H.Y.; visualization, Q.L. and S.Z.; supervision, R.M. and P.Z.; project administration, P.Z.; funding acquisition, H.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Ministry of Natural Resources–Henan Province Cooperative Science and Technology Project (Grant Number Yu-Bu-Sheng He Zuo 2024ZRBSHZ027) and the School of Surveying and Land Information Engineering, Henan Polytechnic University (Grant Number GCCYJ202419).

Data Availability Statement

The SeaDronesSee dataset in this study is openly and freely available at https://www.kaggle.com/datasets/ubiratanfilho/sds-dataset (accessed on 10 January 2025). The TinyPerson dataset in this study is openly and freely available at https://github.com/sxy1122/TinyBenchmark (accessed on 10 January 2025). The DIOR dataset in this study is openly and freely available at https://aistudio.baidu.com/datasetdetail/15179 (accessed on 10 January 2025). The NWPU VHR-10 dataset in this study is openly and freely available at https://github.com/Gaoshuaikun/NWPU-VHR-10 (accessed on 10 January 2025). The code implementing the proposed method is publicly available at https://github.com/aa12ad/MFEF-YOLO (accessed on 1 December 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Yang, T.; Jiang, Z.; Sun, R.; Cheng, N.; Feng, H. Maritime Search and Rescue Based on Group Mobile Computing for Unmanned Aerial Vehicles and Unmanned Surface Vehicles. IEEE Trans. Ind. Inform. 2020, 16, 7700–7708. [Google Scholar] [CrossRef]
Sun, C.; Zhang, Y.; Ma, S. DFLM-YOLO: A Lightweight YOLO Model with Multiscale Feature Fusion Capabilities for Open Water Aerial Imagery. Drones 2024, 8, 400. [Google Scholar] [CrossRef]
Wu, M.; Yun, L.; Wang, Y.; Chen, Z.; Cheng, F. Detection algorithm for dense small objects in high altitude image. Digit. Signal Process. 2024, 146, 104390. [Google Scholar] [CrossRef]
Tang, X.; Ruan, C.; Li, X.; Li, B.; Fu, C. MSC-YOLO: Improved YOLOv7 Based on Multi-Scale Spatial Context for Small Object Detection in UAV-View. Comput. Mater. Contin. 2024, 79, 983–1003. [Google Scholar] [CrossRef]
Wang, J.; Li, X.; Chen, J.; Zhou, L.; Guo, L.; He, Z.; Zhou, H.; Zhang, Z. DPH-YOLOv8: Improved YOLOv8 Based on Double Prediction Heads for the UAV Image Object Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5647715. [Google Scholar] [CrossRef]
Li, Y.; Yuan, H.; Wang, Y.; Xiao, C. GGT-YOLO: A Novel Object Detection Algorithm for Drone-Based Maritime Cruising. Drones 2022, 6, 335. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017), Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; NIPS Foundation: La Jolla, CA, USA, 2017; pp. 6000–6010. [Google Scholar]
Zeng, Y.; Guo, D.; He, W.; Zhang, T.; Liu, Z. ARF-YOLOv8: A novel real-time object detection model for UAV-captured images detection. J. Real-Time Image Process. 2024, 21, 107. [Google Scholar] [CrossRef]
Liu, Y.; Yang, D.; Song, T.; Ye, Y.; Zhang, X. YOLO-SSP: An object detection model based on pyramid spatial attention and improved downsampling strategy for remote sensing images. Vis. Comput. 2025, 41, 1467–1484. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask R-CNN. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Zhang, Y.; Ye, M.; Zhu, G.; Liu, Y.; Guo, P.; Yan, J. FFCA-YOLO for Small Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5611215. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.-Y.; Yeh, I.H.; Mark Liao, H.-Y. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. In Computer Vision—ECCV 2024, Proceedings of the 8th European Conference, Milan, Italy, 29 September–4 October 2024; Springer Nature: Cham, Switzerland, 2025; pp. 1–21. [Google Scholar]
Nikouei, M.; Baroutian, B.; Nabavi, S.; Taraghi, F.; Aghaei, A.; Sajedi, A.; Moghaddam, M.E. Small object detection: A comprehensive survey on challenges, techniques and real-world applications. Intell. Syst. Appl. 2025, 27, 200561. [Google Scholar] [CrossRef]
Jiang, L.; Yuan, B.; Du, J.; Chen, B.; Xie, H.; Tian, J.; Yuan, Z. MFFSODNet: Multiscale Feature Fusion Small Object Detection Network for UAV Aerial Images. IEEE Trans. Instrum. Meas. 2024, 73, 5015214. [Google Scholar] [CrossRef]
Ni, J.; Zhu, S.; Tang, G.; Ke, C.; Wang, T. A Small-Object Detection Model Based on Improved YOLOv8s for UAV Image Scenarios. Remote Sens. 2024, 16, 2465. [Google Scholar] [CrossRef]
Tang, C.; Wang, H.; Liu, J. SPMF: A saliency-based pseudo-multimodality fusion model for data-scarce maritime targets detection. Ocean Eng. 2026, 343, 123023. [Google Scholar] [CrossRef]
Jin, Z.; He, T.; Qiao, L.; Duan, J.; Shi, X.; Yan, B.; Guo, C. MES-YOLO: An efficient lightweight maritime search and rescue object detection algorithm with improved feature fusion pyramid network. J. Vis. Commun. Image Represent. 2025, 109, 104453. [Google Scholar] [CrossRef]
Xu, J.; Fan, X.; Jian, H.; Xu, C.; Bei, W.; Ge, Q.; Zhao, T. YoloOW: A Spatial Scale Adaptive Real-Time Object Detection Neural Network for Open Water Search and Rescue From UAV Aerial Imagery. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5623115. [Google Scholar] [CrossRef]
Ma, S.; Zhang, Y.; Peng, L.; Sun, C.; Ding, B.; Zhu, Y. OWRT-DETR: A Novel Real-Time Transformer Network for Small-Object Detection in Open-Water Search and Rescue From UAV Aerial Imagery. IEEE Trans. Geosci. Remote Sens. 2025, 63, 4205313. [Google Scholar] [CrossRef]
Tang, P.; Zhang, Y. LiteFlex-YOLO:A lightweight small target detection network for maritime unmanned aerial vehicles. Pervasive Mob. Comput. 2025, 111, 102064. [Google Scholar] [CrossRef]
Zhao, B.; Zhou, Y.; Song, R.; Yu, L.; Zhang, X.; Liu, J. Modular YOLOv8 optimization for real-time UAV maritime rescue object detection. Sci. Rep. 2024, 14, 24492. [Google Scholar] [CrossRef] [PubMed]
Xu, S.; Song, L.; Yin, J.; Chen, Q.; Zhan, T.; Huang, W. MFFCI–YOLOv8: A Lightweight Remote Sensing Object Detection Network Based on Multiscale Features Fusion and Context Information. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2024, 17, 19743–19755. [Google Scholar] [CrossRef]
Fan, K.; Li, Q.; Li, Q.; Zhong, G.; Chu, Y.; Le, Z.; Xu, Y.; Li, J. YOLO-Remote: An Object Detection Algorithm for Remote Sensing Targets. IEEE Access 2024, 12, 155654–155665. [Google Scholar] [CrossRef]
Zhao, L.; Liang, G.; Hu, Y.; Xi, Y.; Ning, F.; He, Z. YOLO-RLDW: An Algorithm for Object Detection in Aerial Images Under Complex Backgrounds. IEEE Access 2024, 12, 128677–128693. [Google Scholar] [CrossRef]
Lau, K.W.; Po, L.-M.; Rehman, Y.A.U. Large Separable Kernel Attention: Rethinking the Large Kernel Attention design in CNN. Expert Syst. Appl. 2024, 236, 121352. [Google Scholar] [CrossRef]
Gu, C.; Miao, X.; Zuo, C. TFDNet: A triple focus diffusion network for object detection in urban congestion with accurate multi-scale feature fusion and real-time capability. J. King Saud Univ.—Comput. Inf. Sci. 2024, 36, 102223. [Google Scholar] [CrossRef]
Xu, W.; Wan, Y. ELA: Efficient Local Attention for Deep Convolutional Neural Networks. J. Real-Time Image Process. 2024, 22, 140. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Lian, J.; Yin, Y.; Li, L.; Wang, Z.; Zhou, Y. Small Object Detection in Traffic Scenes Based on Attention Feature Fusion. Sensors 2021, 21, 3031. [Google Scholar] [CrossRef]
Sun, F.; Cui, J.; Yuan, X.; Zhao, C. Rich-scale feature fusion network for salient object detection. IET Image Process. 2023, 17, 794–806. [Google Scholar] [CrossRef]
Li, C.; Zhou, A.; Yao, A.J.A. Omni-Dimensional Dynamic Convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar] [CrossRef]
Chen, J.; Kao, S.h.; He, H.; Zhuo, W.; Wen, S.; Lee, C.H.; Chan, S.H.G. Run, Don’t Walk: Chasing Higher FLOPS for Faster Neural Networks. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 12021–12031. [Google Scholar]
Liu, W.; Lu, H.; Fu, H.; Cao, Z. Learning to Upsample by Learning to Sample. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; pp. 6004–6014. [Google Scholar]
Varga, L.A.; Kiefer, B.; Messmer, M.; Zell, A. SeaDronesSee: A Maritime Benchmark for Detecting Humans in Open Water. In Proceedings of the 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), Waikoloa, HI, USA, 3–8 January 2022; pp. 3686–3696. [Google Scholar]
Yu, X.; Gong, Y.; Jiang, N.; Ye, Q.; Han, Z. Scale Match for Tiny Person Detection. In Proceedings of the 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass, CO, USA, 1–5 March 2020; pp. 1246–1254. [Google Scholar]
Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object detection in optical remote sensing images: A survey and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-class geospatial object detection and geographic image classification based on collection of part detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
Cheng, G.; Han, J. A survey on object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef]
Cheng, G.; Zhou, P.; Han, J. Learning Rotation-Invariant Convolutional Neural Networks for Object Detection in VHR Optical Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2016, 54, 7405–7415. [Google Scholar] [CrossRef]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra R-CNN: Towards Balanced Learning for Object Detection. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 821–830. [Google Scholar]
Cao, Y.; Xu, J.; Lin, S.; Wei, F.; Hu, H. GCNet: Non-Local Networks Meet Squeeze-Excitation Networks and Beyond. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), Seoul, Republic of Korea, 27–28 October 2019; pp. 1971–1980. [Google Scholar]
Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep High-Resolution Representation Learning for Human Pose Estimation. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 5686–5696. [Google Scholar]
Gao, S.-H.; Cheng, M.-M.; Zhao, K.; Zhang, X.-Y.; Yang, M.-H.; Torr, P. Res2Net: A New Multi-Scale Backbone Architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G.J.A. YOLOv10: Real-Time End-to-End Object Detection. In Advances in Neural Information Processing Systems 37 (NeurIPS 2024), Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), Vancouver, BC, Canada, 10–15 December 2024; NIPS Foundation: La Jolla, CA, USA, 2024. [Google Scholar]
Ning, T.; Wu, W.; Zhang, J. Small object detection based on YOLOv8 in UAV perspective. Pattern Anal. Appl. 2024, 27, 103. [Google Scholar] [CrossRef]
Li, B.; Liu, Y.; Wang, X. Gradient Harmonized Single-stage Detector. Proc. AAAI Conf. Artif. Intell. 2018, 33, 8577–8584. [Google Scholar] [CrossRef]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. TOOD: Task-aligned One-stage Object Detection. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 10–17 October 2021; pp. 3490–3499. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sünderhauf, N. VarifocalNet: An IoU-aware Dense Object Detector. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 8510–8519. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J.J.A. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Zhao, S.; Chen, J.; Ma, L. Subtle-YOLOv8: A detection algorithm for tiny and complex targets in UAV aerial imagery. Signal Image Video Process. 2024, 18, 8949–8964. [Google Scholar] [CrossRef]
Lu, P.; Jia, Y.S.; Zeng, W.X.; Wei, P. CDF-YOLOv8: City Recognition System Based on Improved YOLOv8. IEEE Access 2024, 12, 143745–143753. [Google Scholar] [CrossRef]
Jia, W.; Li, C. SLR-YOLO: An improved YOLOv8 network for real-time sign language recognition. J. Intell. Fuzzy Syst. 2024, 46, 1663–1680. [Google Scholar] [CrossRef]
Chen, Z.; He, Z.; Lu, Z.-M. DEA-Net: Single Image Dehazing Based on Detail-Enhanced Convolution and Content-Guided Attention. IEEE Trans. Image Process. 2023, 33, 1002–1015. [Google Scholar] [CrossRef] [PubMed]
Li, K.; Geng, Q.; Wan, M.; Cao, X.; Zhou, Z. Context and Spatial Feature Calibration for Real-Time Semantic Segmentation. IEEE Trans. Image Process. 2023, 32, 5465–5477. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. J. Real-Time Image Process. 2024, 21, 62. [Google Scholar] [CrossRef]
Zhang, J.; Huang, W.; Zhuang, J.; Zhang, R.; Du, X. Detection Technique Tailored for Small Targets on Water Surfaces in Unmanned Vessel Scenarios. J. Mar. Sci. Eng. 2024, 12, 379. [Google Scholar] [CrossRef]

Figure 1. Architectural diagram of YOLO11.

Figure 2. Architectural diagram of MFEF-YOLO.

Figure 3. Architectural diagram of DBSPPF.

Figure 4. Architecture diagram of ODConv.

Figure 5. Architecture diagram of C3k2_ODConv.

Figure 6. Normalized distribution of object width and height ratios relative to image dimensions in SeaDronesSee.

Figure 7. Network architectures: (a) PANet and (b) the proposed IMFFNet.

Figure 8. Architecture diagram of C3k2_Faster.

Figure 9. Architecture diagram of DySample.

Figure 11. Feature extraction comparison before and after DBSPPF integration. (a) Scene a: original image and corresponding heatmaps. (b) Scene b: original image and corresponding heatmaps.

Figure 12. Feature extraction comparison before and after CMFFM integration. (a) Scene a: original image and corresponding heatmaps. (b) Scene b: original image and corresponding heatmaps.

Figure 13. Feature extraction comparison before and after IMFFNet integration. (a) Scene a: original image and corresponding heatmaps. (b) Scene b: original image and corresponding heatmaps.

Figure 14. Scatter plot of MFEF-YOLO (red star) with other models using the SeaDronesSee dataset.

Figure 15. Scatter plot of MFEF-YOLO (red star) with other models using the TPDNV dataset.

Figure 16. Detection results visualization on SeaDronesSee dataset: blue bounding boxes indicate ‘swimmer’, navy blue represents ‘life_saving_appliances’, pink denotes ‘buoy’, and white marks ‘boat’. Missed detections are annotated with yellow dashed circles, while false positives are highlighted with pink dashed circles. (a) Open water scenario under low-illumination conditions; (b) Open water scenario under sufficient-illumination conditions.

Figure 17. Visualization of detection results for dense small-object scenarios in the TPDNV dataset: white bounding boxes identify ‘ship’ targets, red boxes indicate missed detections, and orange boxes mark false positives.

Figure 18. Visualization of detection results for dense and blurred small-object scenarios in the TPDNV dataset: white bounding boxes identify ‘ship’ targets and red boxes indicate missed detections.

Table 1. Software and hardware versions used in the experiment.

Experimental Device	Version
Operating System	Windows 10
GPU	NVIDIA GeForce RTX 4070
CPU	Intel(R) Core(TM) i5-13490F
Python	3.8
PyTorch	2.4.1
CUDA	12.4

Table 2. Hyperparameter settings.

Parameter	Setup
Epochs	200
Patience	0
Batch size	8
Image size	640 × 640
Workers	8
Optimizer	SGD
Amp	False
Initial learning rate	0.01
Final learning rate	0.0001
Momentum	0.937

Table 3. Ablation study of MFEF-YOLO using the SeaDronesSee dataset.

Algorithms	P	R	mAP₅₀	mAP_50:95	mAP_S	Params (M)	FLOPs (G)	FPS
YOLO11n	0.842	0.605	0.661	0.393	0.226	2.6	6.3	364.28
A	0.814	0.628	0.675	0.399	0.229	2.8	6.5	367.56
A + B	0.829	0.625	0.685	0.400	0.216	2.8	6.1	336.43
A + B + C	0.831	0.687	0.735	0.410	0.287	2.1	9.4	301.62
A + B + C + D (Our)	0.836	0.718	0.771	0.435	0.317	2.3	11.7	256.73

Table 4. Experimental results of different detection head combinations.

Detection Head	P	R	mAP₅₀	mAP_50:95	mAP_S	Params (M)	FLOPs (G)	FPS
P3, P4, P5	0.829	0.625	0.685	0.400	0.216	2.8	6.1	336.43
P2, P3, P4, P5	0.808	0.679	0.729	0.404	0.300	2.9	9.9	295.02
P2, P3, P4	0.831	0.687	0.735	0.410	0.287	2.1	9.4	301.62

Table 5. Ablation experiments of C3k2_Faster, DySample and CMFFM in IMFFNet.

C3k2_Faster	DySample	CMFFM	P	R	mAP₅₀	mAP_50:95	mAP_S	Params (M)	FLOPs (G)	FPS
			0.831	0.687	0.735	0.410	0.287	2.1	9.4	301.62
✓			0.848	0.694	0.746	0.422	0.301	2.1	9.2	304.75
✓	✓		0.810	0.720	0.763	0.422	0.286	2.1	9.2	294.59
✓	✓	✓	0.836	0.718	0.771	0.435	0.317	2.3	11.7	256.73

Table 6. Comparative experiments of MEFE-YOLO on the SeaDronesSee test set.

Methods	mAP₅₀	mAP_50:95	Params (M)	FLOP_S (G)	Inference Time (ms/img)	FPS
Libra R-CNN [50]	0.572	0.373	41.6	217.0	40.7	23.8
Faster R-CNN [12]	0.567	0.361	41.4	207.0	38.5	25.1
GCNet [51]	0.595	0.383	46.5	260.0	128.2	15.5
HRNet [52]	0.571	0.369	46.9	286.0	55.2	17.8
Res2Net [53]	0.587	0.370	61.0	294.0	54.1	18.2
YOLOv8n	0.663	0.392	3.0	8.1	1.9	333.3
YOLOv10n [54]	0.643	0.381	2.7	8.2	2.3	384.6
YOLO11n	0.661	0.393	2.6	6.3	1.7	364.3
YOLO11s	0.699	0.429	9.4	21.3	2.3	331.6
YOLO-Remote [32]	0.642	0.381	3.1	8.3	1.9	344.8
Improved-yolov8 [55]	0.740	0.419	1.5	16.7	4.1	196.1
MFEF-YOLO	0.771	0.435	2.3	11.7	3.0	256.7

Table 7. Comparative experiments of MEFE-YOLO using the TPDNV test set.

Methods	mAP₅₀	mAP_50:95	Params (M)	FLOP_S (G)	Inference Time (ms/img)	FPS
Faster R-CNN	0.283	0.137	41.4	164.0	33.4	29.1
GHM [56]	0.261	0.118	55.4	219.0	38.2	25.6
TOOD [57]	0.374	0.180	32.0	154.0	39.7	24.9
VarifocalNet [58]	0.357	0.168	32.7	147.0	36.8	26.6
YOLOX [59]	0.473	0.172	8.9	13.3	8.2	104.3
YOLOv8n	0.448	0.223	3.0	8.1	5.7	98.0
YOLOv10n	0.428	0.216	2.7	8.2	7.4	114.9
YOLO11n	0.444	0.225	2.6	6.3	5.9	114.4
YOLO-Remote	0.435	0.218	3.1	8.3	8.0	106.4
Improved-yolov8	0.466	0.217	1.5	16.7	9.3	64.5
Subtle-YOLOv8 [60]	0.472	0.216	11.8	72.8	15.5	55.2
MFEF-YOLO	0.474	0.228	2.3	11.7	11.8	72.7

Table 8. Comparative analysis of SPPF enhancement methods on SeaDronesSee test set.

Method	P	R	mAP₅₀	mAP_50:95	mAP_S	Params (M)	FLOPs (G)
YOLO11n + DBSPPF	0.814	0.628	0.675	0.399	0.229	2.8	6.5
YOLO11n + FocalModulation	0.787	0.625	0.658	0.395	0.216	2.7	6.4
YOLO11n + SPPF-LSKA	0.860	0.614	0.661	0.398	0.228	2.9	6.5
YOLO11n + Efficient-SPPF	0.776	0.629	0.653	0.391	0.222	2.6	6.4
YOLO11n + RFB	0.791	0.615	0.659	0.394	0.235	2.7	6.5

Table 9. Comparative performance evaluation of neck network architectures using the SeaDronesSee test set.

Method	P	R	mAP₅₀	mAP_50:95	mAP_S	Params (M)	FLOPs (G)
YOLO11n + IMFFNet	0.837	0.707	0.761	0.432	0.321	2.1	11.9
YOLO11n + CGAFusion [63]	0.835	0.698	0.745	0.423	0.301	2.2	14.3
YOLO11n + CSFCN [64]	0.824	0.712	0.749	0.427	0.309	2.0	10.8
YOLO11n + slimneck [65]	0.814	0.692	0.723	0.404	0.291	2.0	9.2
YOLO11n + [66]	0.861	0.702	0.742	0.424	0.319	4.4	29.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, Q.; Yu, H.; Zhang, P.; Geng, T.; Yuan, X.; Ji, B.; Zhu, S.; Ma, R. MFEF-YOLO: A Multi-Scale Feature Extraction and Fusion Network for Small Object Detection in Aerial Imagery over Open Water. Remote Sens. 2025, 17, 3996. https://doi.org/10.3390/rs17243996

AMA Style

Liu Q, Yu H, Zhang P, Geng T, Yuan X, Ji B, Zhu S, Ma R. MFEF-YOLO: A Multi-Scale Feature Extraction and Fusion Network for Small Object Detection in Aerial Imagery over Open Water. Remote Sensing. 2025; 17(24):3996. https://doi.org/10.3390/rs17243996

Chicago/Turabian Style

Liu, Qi, Haiyang Yu, Ping Zhang, Tingting Geng, Xinru Yuan, Bingqian Ji, Shengmin Zhu, and Ruopu Ma. 2025. "MFEF-YOLO: A Multi-Scale Feature Extraction and Fusion Network for Small Object Detection in Aerial Imagery over Open Water" Remote Sensing 17, no. 24: 3996. https://doi.org/10.3390/rs17243996

APA Style

Liu, Q., Yu, H., Zhang, P., Geng, T., Yuan, X., Ji, B., Zhu, S., & Ma, R. (2025). MFEF-YOLO: A Multi-Scale Feature Extraction and Fusion Network for Small Object Detection in Aerial Imagery over Open Water. Remote Sensing, 17(24), 3996. https://doi.org/10.3390/rs17243996

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MFEF-YOLO: A Multi-Scale Feature Extraction and Fusion Network for Small Object Detection in Aerial Imagery over Open Water

Highlights

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Detection of Small Targets in Open Water

2.3. Applications of SPPF Module in Object Detection

2.4. Applications of Feature Fusion in Object Detection

3. Proposed Method

3.1. Overview of YOLO11

3.2. MFEF-YOLO

3.3. Dual-Branch Spatial Pyramid Pooling Fast

3.4. C3k2_ODConv

3.5. P2 Detection Head

3.6. Island-Based Multi-Scale Feature Fusion Network

4. Experimental Results

4.1. Experimental Environment and Parameter Settings

4.2. Dataset Description

4.3. Evaluation Metrics

4.4. Ablation Experiments

4.5. Comparative Experiments

4.6. Experimental Visualization and Analytical Evaluation

4.7. Comparative Analysis of Different SPPF Module and Neck Module Enhancement Approaches

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI