Robust Small-Object Detection in Aerial Surveillance via Integrated Multi-Scale Probabilistic Framework

Li, Youyou; Fang, Yuxiang; Zhou, Shixiong; Zhang, Yicheng; Ribeiro, Nuno Antunes

doi:10.3390/math13142303

Open AccessArticle

Robust Small-Object Detection in Aerial Surveillance via Integrated Multi-Scale Probabilistic Framework

by

Youyou Li

¹

,

Yuxiang Fang

¹,

Shixiong Zhou

¹

,

Yicheng Zhang

^2,*

and

Nuno Antunes Ribeiro

^3,*

¹

School of Air Traffic Management, Civil Aviation Flight University of China, Deyang 618307, China

²

Institute for Infocomm Research at the Agency for Science, Technology and Research, Singapore 138632, Singapore

³

Aviation Studies Institude, Singapore University of Technology and Design, Singapore 487372, Singapore

^*

Authors to whom correspondence should be addressed.

Mathematics 2025, 13(14), 2303; https://doi.org/10.3390/math13142303

Submission received: 11 June 2025 / Revised: 14 July 2025 / Accepted: 16 July 2025 / Published: 18 July 2025

(This article belongs to the Special Issue Artificial Intelligence and Optimization in Aircraft Design and Unmanned Aerial Vehicles, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurate and efficient object detection is essential for aerial airport surveillance, playing a critical role in aviation safety and the advancement of autonomous operations. Although recent deep learning approaches have achieved notable progress, significant challenges persist, including severe object occlusion, extreme scale variation, dense panoramic clutter, and the detection of very small targets. In this study, we introduce a novel and unified detection framework designed to address these issues comprehensively. Our method integrates a Normalized Gaussian Wasserstein Distance loss for precise probabilistic bounding box regression, Dilation-wise Residual modules for improved multi-scale feature extraction, a Hierarchical Screening Feature Pyramid Network for effective hierarchical feature fusion, and DualConv modules for lightweight yet robust feature representation. Extensive experiments conducted on two public airport surveillance datasets, ASS1 and ASS2, demonstrate that our approach yields substantial improvements in detection accuracy. Specifically, the proposed method achieves an improvement of up to 14.6 percentage points in mean Average Precision (mAP@0.5) compared to state-of-the-art YOLO variants, with particularly notable gains in challenging small-object categories such as personnel detection. These results highlight the effectiveness and practical value of the proposed framework in advancing aviation safety and operational autonomy in airport environments.

Keywords:

aerial surveillance; small-object detection; multi-scale features; bounding box regression

MSC:

90-10

Graphical Abstract

1. Introduction

Accurate and efficient object detection for aerial airport surveillance is a critical enabling technology for aviation safety and autonomous airport operations [1,2]. Modern airports increasingly rely on optical sensors and real-time computer vision systems to detect foreign object debris, monitor runway and apron activity, and prevent safety-critical incidents [3,4]. With rapid advancements in unmanned aerial systems and drone-based monitoring, airport surveillance has expanded to include autonomous aerial perspectives, substantially enhancing the ability to detect, classify, and track various safety-critical targets across expansive airport areas [5].

However, the aerial surveillance setting introduces unique detection challenges distinct from conventional ground-based detection scenarios [6,7]. These include severe object occlusions, extreme scale diversity due to varying altitudes and viewing angles, dense panoramic clutter typical of busy airports, and an abundance of extremely small targets, such as ground vehicles, personnel, and minute FOD on runways and aprons [8]. Addressing these complexities requires advanced detection architectures capable of robustly handling fine-grained details and extensive scale variance within cluttered aerial scenes [9,10].

Recent research demonstrates substantial progress toward overcoming these technical barriers, predominantly through deep learning-based detectors, notably variants of the YOLO (You Only Look Once) family [11,12,13]. Numerous studies have introduced and benchmarked architectural innovations specifically tailored to aerial airport scenarios, particularly focusing on specialized feature extraction strategies, advanced multi-scale feature fusion methods, tailored loss functions, and computational efficiency optimizations [6,14]. Nevertheless, existing approaches typically address only partial aspects of this multifaceted detection challenge [15]. For example, feature extraction pipelines optimized for small-object detection frequently employ attention mechanisms or transformer-based backbones but still encounter significant difficulties with panoramic clutter and severe occlusions [16,17]. Methods emphasizing multi-scale feature fusion, such as BiFPN or weighted attention modules, have effectively improved handling of extreme scale diversity; however, they often incur high computational costs or demonstrate limited robustness within densely cluttered scenes [18,19]. Furthermore, advanced bounding box regression losses—including SIoU, NCIoU, and Wasserstein-based methods—have improved localization precision [20]. However, many lack adequate uncertainty modeling [21]. To this end, the Normalized Wasserstein Distance (NWD) loss has emerged as a promising alternative, offering a closed-form, scale-invariant formulation particularly beneficial for accurately localizing small objects under uncertainty [22,23].

To comprehensively address these limitations, we propose a novel integrated detection framework composed of four core components: the Normalized Wasserstein Distance (NWD) loss [23] for rigorous probabilistic bounding box regression under uncertainty; Dilation-wise Residual (DWR) modules [24] to enhance multi-scale feature extraction robustness; a Hierarchical Screening Feature Pyramid Network (HS-FPN) for hierarchical feature screening and effective fusion; and DualConv modules for lightweight yet robust feature representation. Specifically, the NWD loss provides a scale-invariant measure of localization error, making it particularly well-suited for small-object detection. It addresses critical limitations of conventional regression losses by offering improved robustness to variations in object size [23]. The DWR modules effectively tackle occlusion and extreme scale variability through expanded receptive fields achieved by multiple dilation rates [24]. HS-FPN significantly refines multi-scale fusion capabilities by integrating semantic and spatial information more efficiently, overcoming scalability and accuracy limitations in dense aerial environments [25]. DualConv modules, meanwhile, optimize computational efficiency without compromising representational capability [26].

Extensive evaluations conducted on publicly available datasets (ASS1 and ASS2) [27] demonstrate our framework’s superior effectiveness. Rigorous benchmarking against state-of-the-art YOLO variants including YOLOv6n [28], YOLOv8n [29], YOLOv9s [14], YOLOv10n [30], YOLOv11n [31], and YOLOv12n [17] indicates that our proposed method achieves substantial improvements in mean Average Precision (mAP@0.5), surpassing baseline models by approximately nine percentage points on the ASS1 dataset. Moreover, it significantly enhances small-object detection accuracy, achieving AP improvements exceeding 25 percentage points for challenging classes such as personnel. These results underscore our framework’s robust ability to tackle comprehensive challenges inherent to aerial airport surveillance, substantially advancing aviation safety and operational autonomy.

The three main contributions of this study, with a particular emphasis on small-object detection, are summarized as follows:

This study systematically integrates a scale-invariant bounding box regression loss—Normalized Wasserstein Distance (NWD)—alongside advanced multi-scale feature extraction modules (DWR), hierarchical fusion structures (HS-FPN), and lightweight representation blocks (DualConv) to comprehensively address the challenges of detecting extremely small objects in aerial airport surveillance scenarios.
A thorough empirical evaluation is conducted using specialized aerial airport surveillance datasets (ASS1 and ASS2). This rigorous benchmarking clearly demonstrates substantial performance gains in mean Average Precision (mAP@0.5), particularly highlighting notable improvements in detecting small objects (e.g., persons, trucks).
Through strategic integration and extensive testing, this work explicitly demonstrates practical applicability and significant advancement in small-object detection accuracy (e.g., an improvement exceeding 25 percentage points in particularly challenging small-object categories), thus significantly contributing to aviation safety and the autonomous operation of airports.

2. Materials

The ASS dataset is a publicly available dataset specifically developed to support research and evaluation of object detection methods for airport surface surveillance tasks [27]. It includes two distinct subsets, ASS1 and ASS2. The ASS1 dataset contains 2000 surveillance images with annotations for a total of 8371 objects, categorized into airplanes (3466 instances), people (2994 instances), and trucks (1911 instances). These images represent realistic airport scenarios captured under varied environmental conditions such as day, night, and adverse weather, ensuring a diverse and challenging detection task.

The ASS2 dataset, on the other hand, comprises 100 panoramic surveillance images annotated with 8414 objects, primarily featuring airplanes (2341 instances) and trucks (6073 instances). Notably, the panoramic nature of ASS2 introduces significant detection complexity due to the high proportion (approximately 99.8%) of small-sized objects, capturing distant and challenging viewpoints.

Both datasets have been meticulously labeled with bounding boxes and categories, facilitating precise benchmarking of object detection methods. Given their public availability, these datasets provide a standardized and robust platform for researchers to compare and improve algorithms targeting complex, real-world surveillance scenarios in airports. Parts of the ASS1 and ASS2 datasets are illustrated in Figure 1.

3. Methods

This study proposes a novel architecture for small-object detection, as illustrated in Figure 2. The backbone network extracts multi-scale features from the input image, which are then processed through the Feature Pyramid Network (FPN) to generate feature maps at different scales. A key innovation in our architecture is the introduction of the SmallObj layer in the detection head, specifically designed to enhance the detection of tiny objects. This additional layer consists of a series of convolutional operations that increase the receptive field while preserving fine-grained spatial information critical for tiny object detection. The SmallObj layer processes features with specialized attention mechanisms and adaptive feature fusion, allowing the network to better capture the subtle characteristics of small objects that might be lost in conventional detection architectures. The detection head, augmented with this SmallObj layer, then predicts bounding boxes and class probabilities, with the predictions being optimized using our proposed Normalized Gaussian Wasserstein Distance Loss.

3.1. Normalized Gaussian Wasserstein Distance Loss

To address the issue where the original IoU loss function fails to provide optimization gradients when the predicted box and the ground truth box do not overlap or are completely contained, this paper introduces a loss function based on the Wasserstein distance. By modeling the bounding box as a Gaussian distribution and considering the weight distribution of pixels within the bounding box, it more accurately reflects the distribution characteristics of tiny objects, enabling the model to better measure the similarity between the predicted box and the ground truth box. The main calculation process is as follows:

Let the predicted box

P = (c_{x}^{p}, c_{y}^{p}, w_{p}, h_{p})

and the ground truth box

G = (c_{x}^{g}, c_{y}^{g}, w_{g}, h_{g})

. These are modeled as two-dimensional Gaussian distributions:

\begin{matrix} N_{p} & = N ((\begin{matrix} c_{x}^{p} \\ c_{y}^{p} \end{matrix}), (\begin{matrix} \frac{w_{p}^{2}}{4} & 0 \\ 0 & \frac{h_{p}^{2}}{4} \end{matrix})), \end{matrix}

(1)

\begin{matrix} N_{g} & = N ((\begin{matrix} c_{x}^{g} \\ c_{y}^{g} \end{matrix}), (\begin{matrix} \frac{w_{g}^{2}}{4} & 0 \\ 0 & \frac{h_{g}^{2}}{4} \end{matrix})) . \end{matrix}

(2)

The second-order Wasserstein distance is originally defined as follows:

W_{2}^{2} (N_{p}, N_{g}) = | | μ_{p} - μ_{g} {| |}_{2}^{2} + tr (Σ_{p} + Σ_{g} - 2 {(Σ_{g}^{1 / 2} Σ_{p} Σ_{g}^{1 / 2})}^{1 / 2}) .

(3)

The mean term is directly the squared Euclidean distance:

| | μ_{p} - μ_{g} {| |}_{2}^{2} = {(c_{x}^{p} - c_{x}^{g})}^{2} + {(c_{y}^{p} - c_{y}^{g})}^{2} .

(4)

Since the covariance matrices are diagonal, their square roots can be directly calculated:

\begin{matrix} Σ_{p}^{1 / 2} & = (\begin{matrix} \frac{w_{p}}{2} & 0 \\ 0 & \frac{h_{p}}{2} \end{matrix}), \end{matrix}

(5)

\begin{matrix} Σ_{g}^{1 / 2} & = (\begin{matrix} \frac{w_{g}}{2} & 0 \\ 0 & \frac{h_{g}}{2} \end{matrix}) . \end{matrix}

(6)

The matrix product

Σ_{g}^{1 / 2} Σ_{p} Σ_{g}^{1 / 2}

is as follows:

Σ_{g}^{1 / 2} Σ_{p} Σ_{g}^{1 / 2} = (\begin{matrix} \frac{w_{g}}{2} & 0 \\ 0 & \frac{h_{g}}{2} \end{matrix}) (\begin{matrix} \frac{w_{p}^{2}}{4} & 0 \\ 0 & \frac{h_{p}^{2}}{4} \end{matrix}) (\begin{matrix} \frac{w_{g}}{2} & 0 \\ 0 & \frac{h_{g}}{2} \end{matrix}) = (\begin{matrix} \frac{w_{g}^{2} w_{p}^{2}}{16} & 0 \\ 0 & \frac{h_{g}^{2} h_{p}^{2}}{16} \end{matrix}) .

(7)

Its square root is

{(Σ_{g}^{1 / 2} Σ_{p} Σ_{g}^{1 / 2})}^{1 / 2} = (\begin{matrix} \frac{w_{g} w_{p}}{4} & 0 \\ 0 & \frac{h_{g} h_{p}}{4} \end{matrix}) .

(8)

The trace term simplifies to

\begin{matrix} tr (Σ_{p} + Σ_{g} - 2 {(Σ_{g}^{1 / 2} Σ_{p} Σ_{g}^{1 / 2})}^{1 / 2}) \end{matrix}

\begin{matrix} = tr (\begin{matrix} \frac{w_{p}^{2}}{4} & 0 \\ 0 & \frac{h_{p}^{2}}{4} \end{matrix}) + tr (\begin{matrix} \frac{w_{g}^{2}}{4} & 0 \\ 0 & \frac{h_{g}^{2}}{4} \end{matrix}) - 2 tr (\begin{matrix} \frac{w_{g} w_{p}}{4} & 0 \\ 0 & \frac{h_{g} h_{p}}{4} \end{matrix}) \end{matrix}

(9)

\begin{matrix} = \frac{w_{p}^{2}}{4} + \frac{h_{p}^{2}}{4} + \frac{w_{g}^{2}}{4} + \frac{h_{g}^{2}}{4} - \frac{w_{g} w_{p}}{2} - \frac{h_{g} h_{p}}{2} . \end{matrix}

(10)

Combining the mean term and the trace term, we have

\begin{matrix} W_{2}^{2} (N_{p}, N_{g}) & = {(c_{x}^{p} - c_{x}^{g})}^{2} + {(c_{y}^{p} - c_{y}^{g})}^{2} \\ + \frac{w_{p}^{2} + w_{g}^{2}}{4} + \frac{h_{p}^{2} + h_{g}^{2}}{4} - \frac{w_{g} w_{p} + h_{g} h_{p}}{2} . \end{matrix}

(11)

NWD is obtained through exponential normalization:

\begin{matrix} NWD & = exp (- \frac{W_{2}^{2} (N_{p}, N_{g})}{C}), \\ = exp (- \frac{{(c_{x}^{p} - c_{x}^{g})}^{2}}{C} - \frac{{(c_{y}^{p} - c_{y}^{g})}^{2}}{C} \end{matrix}

(12)

\begin{matrix} - \frac{w_{p}^{2} + w_{g}^{2}}{4 C} - \frac{h_{p}^{2} + h_{g}^{2}}{4 C} + \frac{w_{g} w_{p} + h_{g} h_{p}}{2 C}) . \end{matrix}

(13)

where C is a dataset-dependent normalization constant.

3.2. Dilation-Wise Residual

The calculation process of the Dilation-wise Residual (DWR) module can be described as follows:

In the region residualization stage, the input feature map

F_{i n}

is first processed through a standard 3 × 3 convolution layer combined with batch normalization (BN) and a ReLU activation function, generating a series of concise feature maps

F_{r r}

with different regional sizes. This process can be expressed as

F_{r r} = ReLU (BN ({Conv}_{3 \times 3} (F_{i n})))

(14)

Here,

{Conv}_{3 \times 3}

denotes a 3 × 3 convolution operation, BN represents batch normalization, and ReLU is the activation function.

Next, in the semantic residualization stage, the feature map

F_{r r}

is divided into several groups. Each group is processed by a depth-wise dilated convolution with different dilation rates. Each channel feature uses only one desired receptive field size to avoid redundant receptive fields. The calculation formula is

F_{s r}^{i} = {DConv}_{d_{i}} (F_{r r}^{i})

(15)

where

F_{r r}^{i}

is the i-th group of regional feature maps,

{DConv}_{d_{i}}

represents a depth-wise dilated convolution with dilation rate

d_{i}

, and

F_{s r}^{i}

is the i-th group of feature maps obtained after semantic residualization.

Finally, the multiple feature maps obtained from semantic residualization are fused. This involves concatenating all feature maps, performing batch normalization on the concatenated feature maps, merging the feature maps using a point-wise convolution, and adding the merged feature maps to the input feature maps through a residual connection to generate the final output feature map

F_{o u t}

. This process can be expressed as

F_{o u t} = F_{i n} + Conv 1 \times 1 (BN (Concat (F {s r}^{1}, F_{s r}^{2}, \dots, F_{s r}^{n})))

(16)

where Concat denotes feature map concatenation, and

{Conv}_{1 \times 1}

represents a point-wise convolution operation.

3.3. High-Level Screening-Feature Fusion Pyramid Networks

As is shown in Figure 3, the calculation process of the High-level Screening-feature Fusion Pyramid Networks (HS-FPNs) is described as follows. The HS-FPN module is designed to address the multi-scale challenge in leukocyte datasets by effectively fusing high-level semantic information with low-level features.

This module is composed of two primary components: a feature selection unit and a feature fusion unit. Within the feature selection unit, the Channel Attention (CA) mechanism operates on the input feature map

f_{in} \in R^{C \times H \times W}

, where C, H, and W denote the number of channels, height, and width, respectively. The feature map is processed through both global average pooling and global max pooling operations, and the resulting descriptors are subsequently aggregated. A Sigmoid activation function is then applied to generate channel-wise attention weights, producing an output

f_{CA} \in R^{C \times 1 \times 1}

. This process can be formulated as

f_{C A} = σ (BN ({Conv}_{1 \times 1} (GAP (f_{i n}) + GMP (f_{i n}))))

(17)

where

σ

is the Sigmoid activation function, BN represents batch normalization, and

{Conv}_{1 \times 1}

is a 1 × 1 convolution operation. The Dimensional Matching (DM) module then applies a 1 × 1 convolution to reduce the number of channels for each scale feature map to 256, which can be represented as

f_{D M} = {Conv}_{1 \times 1} (f_{i n})

(18)

In the feature fusion module, the Selective Feature Fusion (SFF) mechanism is utilized to integrate high-level and low-level feature representations. Given a high-level input feature

f_{h i g h} \in R^{C \times H \times W}

and a low-level input feature

f_{l o w} \in R^{C \times H_{1} \times W_{1}}

, the high-level feature is first upsampled using a transposed convolution (T-Conv) with a kernel size of

3 \times 3

and a stride of 2, yielding an intermediate feature

f_{h i g h}^{'} \in R^{C \times 2 H \times 2 W}

. Bilinear interpolation is subsequently applied to adjust the spatial resolution of

f_{h i g h}^{'}

to match that of the low-level feature, resulting in

f_{a t t} \in R^{C \times H_{1} \times W_{1}}

. The CA module is then employed to transform the high-level features into channel-wise attention weights, which are used to filter the low-level features. Finally, the refined low-level features are fused with the adjusted high-level features to enhance the overall feature representation, producing the output

f_{o u t} \in R^{C \times H_{1} \times W_{1}}

. This process is defined by the following equations:

f_{h i g h}^{'} = T - Conv (f_{h i g h})

(19)

f_{a t t} = BilinearInterpolation (f_{h i g h}^{'})

(20)

f_{f i l t e r e d} = f_{a t t} \cdot f_{l o w}

(21)

f_{o u t} = f_{h i g h} + f_{f i l t e r e d}

(22)

where T-Conv denotes transposed convolution, BilinearInterpolation represents bilinear interpolation, and · indicates element-wise multiplication. Figure 4 illustrates the overall structure of DWR.

3.4. DualConv

As illustrated in Figure 5, the computation process of the proposed DualConv module is described as follows. DualConv integrates the advantages of both group convolution and heterogeneous convolution. It simultaneously applies convolutional kernels of sizes

3 \times 3

and

1 \times 1

to the same set of input feature map channels, enabling enhanced feature representation with improved computational efficiency.

Within the DualConv framework, the number of convolutional filter groups G serves as a control parameter that determines the proportion of

K \times K

kernels utilized in the overall convolutional filter composition. Specifically, for a given value of G, a fraction of

1 / G

of the channels is processed using a combination of

K \times K

and

1 \times 1

convolutional kernels, while the remaining

(1 - 1 / G)

proportion of channels is processed exclusively using

1 \times 1

kernels.

Let the output feature map dimensions be

D_{o} \times D_{o} \times N

, where

D_{o}

denotes the spatial resolution (height and width) and N is the number of output channels. In a standard convolution operation, as illustrated in Figure 1a, the input feature map is convolved with N filters of size

K \times K \times M

, where M represents the number of input channels. Accordingly, the total number of floating point operations (FLOPs) required for the standard convolutional layer, denoted as

F L_{S C}

, is given by

F L_{S C} = 2 \times K^{2} \times M \times N \times D_{o}^{2}

(23)

In DualConv, the number of FLOPs for the combined convolutional kernels is

F L_{c o m} = 2 \times (K^{2} + 1) \times \frac{M}{G} \times \frac{N}{G} \times D_{o}^{2}

(24)

The number of FLOPs for the remaining

1 \times 1

pointwise convolutional kernels is

F L_{1 \times 1} = 2 \times 1^{2} \times (M - \frac{M}{G}) \times \frac{N}{G} \times D_{o}^{2}

(25)

The total number of FLOPs is

F L_{D u a l C o n v} = G \times F L_{c o m} + G \times F L_{1 \times 1}

(26)

Comparing the computational cost (FLOPs) of dual convolutional layer with that of standard convolutional layer, the computational reduction ratio

R_{D C / S C}

is

R_{D C / S C} = \frac{F L_{S C} - F L_{D u a l C o n v}}{F L_{S C}}

(27)

4. Experiments and Analysis

To comprehensively evaluate the effectiveness of the proposed method, a series of experiments were conducted across multiple key performance dimensions. Specifically, we present evaluations based on mAP@0.5 and mAP@0.5:0.95 to assess overall detection accuracy, accompanied by precision–recall (PR) curves to analyze the reliability of predictions across varying confidence thresholds. Performance assessments were carried out on the ASS1 and ASS2 datasets to validate the robustness of the method in domain-specific aerial surveillance scenarios.

In addition to these core evaluations, two dedicated ablation studies were performed. The first, detailed in the subsection “Evaluating NWD-based Loss against Traditional IoU Loss,” investigates the contribution of the Normalized Wasserstein Distance loss relative to standard IoU-based loss functions. The second, presented in “Empirical Analysis of C-Value Selection in NWD Loss,” explores the sensitivity of model performance to the normalization constant C within the NWD formulation. Furthermore, we report comparisons of inference speed, computational complexity (FLOPs), and parameter count to assess efficiency. All experiments include rigorous comparisons against state-of-the-art (SOTA) methods, including YOLOv6n, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and YOLOv12n.

4.1. Experimental Setup

All experiments were conducted on a single NVIDIA RTX 4090 GPU. The entire implementation was developed using the Python (version 3.11.8) programming language, and model training and evaluation were performed using the Ultralytics YOLO library, which is built upon the PyTorch (version 2.2.2) deep learning framework. The baseline model adopted for this study is YOLOv11n. No deployment-oriented optimization techniques were applied. All inference metrics were obtained using standard PyTorch validation procedures to ensure consistent and fair comparisons across models. Inference times were measured under a batch size of 1 to provide a reliable estimate of per-image latency.

For the ASS1 dataset, all YOLO-based models were trained from scratch with randomly initialized weights, without relying on pre-trained parameters from external datasets. Each model was trained for 300 epochs using the Adam optimizer, with an initial learning rate of 0.01, a weight decay of 0.0005, a batch size of 16, and an input image resolution of 640 × 640 pixels. No early stopping criterion was employed to allow complete observation of the training dynamics. The dataset was split with a 7:3 ratio, where 70% of the samples were used for training and the remaining 30% for validation.

Model selection was based on validation performance: the checkpoint achieving the highest mean Average Precision across IoU thresholds from 0.50 to 0.95 (mAP@[0.50:0.95]) was saved and subsequently used for final evaluation and analysis. For the smaller ASS2 dataset, which consists of only 100 images, a transfer learning strategy was adopted to improve generalization and mitigate overfitting. In this case, model weights were initialized using the best-performing checkpoints previously trained on ASS1.

4.2. Evaluating NWD-Based Loss Against Traditional IoU Loss

Figure 6 illustrates a comparative analysis of the original IoU loss and the Normalized Wasserstein Distance (NWD) loss across six YOLO variants (YOLOv6n to YOLOv12n) on the ASS1 dataset. Across all model variants, the use of NWD loss consistently outperforms the original IoU loss in terms of mAP@0.5, demonstrating its generalizability and effectiveness for small-object detection in aerial surveillance contexts. The most notable improvement is observed in the YOLOv8n model, where NWD achieves 0.768 compared to 0.757 with the original loss. Even in more compact variants such as YOLOv6n, NWD yields noticeable improvements (from 0.724 to 0.731), validating the robustness of NWD across architectures with varying capacities.

These improvements can be attributed to the scale-invariant nature of the NWD loss, which provides more reliable regression signals for small or distant objects—an essential aspect of aerial surveillance tasks. While the performance gap narrows in higher-capacity models such as YOLOv10n through YOLOv12n, the NWD-enhanced variants still maintain a consistent edge over their IoU-based counterparts. This suggests that even as model architectures mature, the choice of a more appropriate localization loss function remains critical for maximizing detection accuracy. The consistent advantage across models confirms the utility of NWD as a drop-in replacement for traditional IoU losses, particularly in challenging detection scenarios where object sizes vary significantly.

4.3. Empirical Analysis of C-Value Selection in NWD Loss

To determine the optimal normalization constant C for the NWD loss, a comprehensive ablation study was conducted on the ASS1 dataset, evaluating C values in the range of 0 to 20 with a step size of 0.1. Detection performance was assessed for each object category—airplane, person, and truck—as well as in terms of overall accuracy measured by

mAP @ 0.5

. Representative results are shown in Figure 7.

As illustrated in the figure, the airplane and truck categories achieve their highest precision when

C = 14

, reaching 0.994 and 0.913, respectively. The person category attains its peak precision of 0.363 at

C = 16

. The overall detection performance, indicated by

mAP @ 0.5

, reaches its maximum value of 0.752 at

C = 14

, reflecting a favorable trade-off across all categories. While increasing C initially leads to improved performance, further increments beyond

C = 16

yield diminishing gains or slight declines, particularly in the person and truck categories. These trends suggest that excessively large values of C may reduce the effectiveness of the loss function in capturing fine-grained localization differences.

Based on this analysis, a value of

C = 14

was selected for all subsequent experiments, as it offers the most balanced and consistent improvements across object categories. This choice ensures that the NWD loss remains both robust and scale-aware for aerial small-object detection tasks.

4.4. Ablation Study on ASS1 Dataset

Table 1 reports the results of an ablation study conducted on the ASS1 dataset to systematically evaluate the contribution of each component within the proposed detection framework. The experiments are designed to progressively integrate key modules—Smallob preprocessing, DualConv, Dilation-wise Residual (DWR) modules, and the Hierarchical Screening Feature Pyramid Network (HS-FPN)—starting from a baseline configuration that incorporates only the Normalized Wasserstein Distance (NWD) loss. Each configuration is assessed in terms of detection accuracy (measured by mAP0.5 and per-class AP), computational efficiency (parameters and FLOPs), and inference performance (latency and FPS).

The baseline model, which employs only the NWD loss, yields an mAP0.5 of 75.4% and achieves 92.6 FPS with minimal computational cost (2.58M parameters and 6.3 GFLOPs). However, its performance on small-object categories is suboptimal, with AP scores of 34.8% for person and 92.1% for truck, indicating that the geometric advantages offered by NWD are insufficient on their own to capture fine-grained semantic cues in complex aerial imagery.

The inclusion of the Smallob preprocessing strategy results in a substantial performance improvement, raising the mAP0.5 to 88.4% and nearly doubling the person AP to 69.7%. This demonstrates the importance of small-object-oriented data enhancement for alleviating scale imbalance and improving the network’s sensitivity to underrepresented instances. Despite a moderate increase in parameters and latency, this configuration establishes a strong performance baseline for further architectural augmentation.

Subsequent integration of the DualConv modules slightly reduces the parameter count (from 2.89 M to 2.58 M) while maintaining high performance (mAP0.5 = 86.8%), reflecting the effectiveness of lightweight convolutional structures in preserving discriminative feature quality. The addition of DWR modules further enhances robustness to occlusion and scale variation, as evidenced by consistent gains in per-class AP and an mAP0.5 of 87.8%. Notably, this improvement is achieved with minimal trade-offs in speed or computational complexity.

The full configuration—comprising NWD loss, Smallob, DualConv, DWR, and HS-FPN—achieves the highest overall performance, with an mAP0.5 of 89.3% and strong precision across all object categories (e.g., 99.3% for airplane, 72.2% for person, and 96.3% for truck). This setup also attains the best runtime efficiency (94.3 FPS) and the lowest parameter count (2.10 M), demonstrating that the proposed framework effectively balances detection accuracy with computational cost. These findings validate the architectural choices and highlight the framework’s suitability for real-time small-object detection in aerial surveillance applications.

4.5. Comparative Analysis of mAP (0.5 and 0.5–0.95)

The mAP@0.5–0.95 evaluation, as illustrated in Figure 8, provides a comprehensive assessment of detection accuracy across multiple IoU thresholds ranging from 0.5 to 0.95. This metric is particularly stringent as it evaluates the model’s performance at increasingly strict localization requirements, offering a more holistic view of detection quality than single-threshold metrics. Our proposed method demonstrates superior performance compared to all baseline YOLO variants, achieving the highest mAP@50–95 score throughout the evaluation period. The consistent performance advantage is evident across the entire training progression, with our method maintaining a stable lead over YOLOv6n, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and YOLOv12n. This superior performance can be attributed to the synergistic effects of our integrated components: the Normalized Gaussian Wasserstein Distance loss function provides more accurate bounding box regression, the small-object detection enhancement improves localization precision for challenging targets, and the High-level Screening-feature Fusion Pyramid Networks enable better feature representation across multiple scales.

The performance gap between our proposed method and the baseline models becomes more pronounced as the IoU threshold increases, highlighting the superior localization accuracy of our approach. While traditional YOLO variants show relatively modest improvements or even performance plateaus during training, our method exhibits consistent upward trends with better convergence characteristics. The YOLOv12n model, representing the most recent advancement in the YOLO series, achieves competitive performance but still falls short of our proposed method’s accuracy. This performance differential is particularly significant considering that our method maintains computational efficiency while delivering enhanced accuracy. The robust performance across varying IoU thresholds demonstrates that our proposed modifications not only improve overall detection accuracy but also enhance the precision of bounding box localization, which is crucial for applications requiring high spatial accuracy such as autonomous systems and precision surveillance tasks.

The mAP@0.5 evaluation, as shown in Figure 9, demonstrates the detection performance at a single IoU threshold of 0.5, which represents a more lenient evaluation criterion compared to the multi-threshold mAP@0.5–0.95 metric. At this threshold, our proposed method exhibits exceptional performance superiority, achieving consistently higher mAP@0.5 scores throughout the training process compared to all baseline YOLO variants. The performance curves reveal that our method not only achieves higher peak accuracy but also demonstrates superior convergence stability and faster learning dynamics. The gap between our proposed method and the competing approaches is particularly pronounced, with our method maintaining a substantial lead over YOLOv6n, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and YOLOv12n throughout the evaluation period. This consistent performance advantage at the 0.5 IoU threshold indicates that our method excels at fundamental object detection tasks, successfully identifying and roughly localizing objects with high confidence.

The comparative analysis reveals interesting performance characteristics among the baseline methods, with YOLOv12n showing the most competitive performance among the YOLO variants, followed by YOLOv11n and YOLOv10n. However, even the best-performing baseline method falls considerably short of our proposed approach’s accuracy. The smooth convergence curves of our method, contrasted with the more volatile training dynamics observed in some baseline models, suggest that our integrated components contribute to more stable and reliable training processes. The substantial performance improvement at the mAP@0.5 level, combined with the previously discussed mAP@0.5–0.95 results, demonstrates that our method provides comprehensive detection improvements across both lenient and strict evaluation criteria. This dual-threshold excellence indicates that our proposed modifications enhance not only basic object detection capabilities but also precise localization accuracy, making the method suitable for a wide range of applications with varying precision requirements.

4.6. Precision–Recall Curve Comparison

Figure 10 presents precision–recall (PR) curves comparing the proposed model (denoted as Ours) against multiple baseline models, namely, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and YOLOv12n, represented respectively in panels (a) through (e). Panel (f) corresponds to the proposed method.

YOLOv8n (Figure 10a) exhibits an mAP@0.5 of 75.7%, with notably high precision for the Airplane (99.5%) and Truck (91.9%) categories, while performing poorly on Person (35.7%). YOLOv9s (Figure 10b) demonstrates improved detection performance, particularly for Person (44.6%), resulting in a higher mAP (79.7%).

Conversely, YOLOv10n (Figure 10c) shows limited improvement in Person precision (33.8%), reflecting a marginal decline compared to YOLOv9s, thus leading to an overall mAP of 74.7%. Similarly, YOLOv11n and YOLOv12n (Figure 10d and Figure 10e, respectively) display comparable results, with minimal variations in precision and recall across the evaluated classes, maintaining an mAP around 74.3%.

The proposed model (Ours, Figure 10f) significantly surpasses all baseline models, achieving an mAP@0.5 of 88.7%. Notably, the Person category precision markedly increases to 70.3%, highlighting the substantial effectiveness of the integrated architecture in detecting small and challenging objects. Additionally, the curves for Ours exhibit higher precision at greater recall levels, indicating improved detection robustness and reliability across all categories.

4.7. Comparative Results on ASS1 Dataset

Table 2 provides a comparative analysis of various YOLO-based models and the proposed method in terms of inference time, parameter count, computational complexity (FLOPs), frames per second (FPS), and detection accuracy (mAP0.5) on the ASS1 dataset. The results offer a comprehensive view of the trade-offs between efficiency and accuracy across different model architectures, with a particular focus on performance in detecting small objects such as personnel.

Among the baseline models, YOLOv8n emerges as a strong performer in terms of inference efficiency, achieving the highest FPS (147.1) and a relatively low parameter count (3.00 M), while still maintaining moderate accuracy (mAP0.5 = 75.7). YOLOv6n, while faster than most models (138.9 FPS), suffers from limited person detection performance (27.3 AP), highlighting its limitations in small-object sensitivity. YOLOv9s, on the other hand, achieves the highest mAP0.5 among the baselines (79.7), but this comes at the cost of significantly increased inference latency (18.1 ms) and FLOPs (26.7 G), making it less suitable for real-time applications.

The proposed method achieves an mAP0.5 of 89.3, outperforming all baseline models by a substantial margin—most notably achieving improvements of 9.6 percentage points over YOLOv9s and nearly 17 percentage points over YOLOv6n. This gain is especially evident in the Person category, where the proposed method achieves an AP of 72.2, more than double that of YOLOv6n (27.3) and YOLOv8n (35.7). This result underscores the model’s ability to effectively detect small, low-visibility targets in complex aerial scenes, a capability not observed in prior architectures.

Despite this notable increase in accuracy, the proposed method maintains competitive efficiency. With an inference time of 10.6 ms and a parameter count of only 2.10 M, it is significantly lighter than YOLOv9s (7.16 M) and exhibits only a moderate increase in FLOPs (16.4 G) compared to other models. Furthermore, it operates at 94.3 FPS, making it suitable for real-time or near-real-time deployment in aerial surveillance systems. These findings suggest that the proposed method offers a favorable balance between detection accuracy and computational efficiency.

Figure 11 provides a qualitative comparison of detection results on sample images from the ASS1 dataset. Panel (a) presents the original images, while panels (b) to (g) show detection results from YOLOv6n, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and YOLOv12n, respectively. Panel (h) demonstrates the proposed method.

The baseline YOLO variants (panels (b)–(g)) exhibit varied performance, frequently struggling with accurately detecting small-sized person objects, often leading to missed detections or detections with low confidence scores. Conversely, the proposed method (panel (h)) significantly enhances detection performance, successfully identifying person objects with higher confidence scores and more precise bounding boxes.

Additionally, the proposed method consistently improves detection robustness across multiple object categories, notably airplane and truck. These qualitative results underscore the efficacy of the proposed architecture, particularly highlighting its strength in addressing the limitations seen in previous YOLO variants in challenging detection scenarios.

4.8. Comparative Results on ASS2 Dataset

Table 3 presents a comparative evaluation of several YOLO-based models and the proposed framework on the ASS2 dataset, focusing on detection accuracy (measured by mAP0.5 and per-class AP), computational efficiency (inference time, parameter count, and FLOPs), and real-time capability (frames per second). The ASS2 dataset, characterized by its limited size and challenging small-object categories such as aircraft and trucks, provides a rigorous benchmark for assessing the generalization capability of lightweight detectors in constrained surveillance scenarios.

Among the baseline models, YOLOv9s achieves the highest mAP0.5 (50.2), reflecting its ability to handle complex detection tasks through a deeper architecture and larger capacity (7.16 M parameters and 26.7 GFLOPs). However, this performance comes at a significant cost in inference time (43.2 ms) and speed (23.1 FPS), which limits its suitability for real-time applications. In contrast, YOLOv6n and YOLOv8n are among the fastest models, with inference times of 29.0 ms and 27.5 ms, and FPS values of 34.5 and 36.4, respectively. Nevertheless, their mAP scores remain low (41.8 and 45.5), especially for the Truck category (19.9 and 24.7 AP), indicating suboptimal performance on small or low-visibility targets.

The proposed method achieves a significantly higher mAP0.5 of 62.7, outperforming the strongest baseline (YOLOv9s) by 12.5 percentage points. This improvement is particularly prominent in the Truck category, where the proposed model achieves an AP of 42.6, compared to 29.5 by YOLOv9s and less than 26 for all other baselines. The Airplane class also benefits from a notable gain, reaching an AP of 82.7, the highest across all models. These results highlight the model’s superior capability to detect small and domain-specific objects, even under data-scarce conditions, demonstrating the effectiveness of the proposed architectural components and training strategy—particularly the use of transfer learning from ASS1.

Despite its improved accuracy, the proposed model maintains competitive runtime efficiency. It operates at 27.2 FPS with an inference time of 36.8 ms—only slightly slower than YOLOv11n and YOLOv12n—while using fewer parameters (2.10 M) than any baseline model. Although the model exhibits a higher FLOP count (16.4 G), the increase is justified by the substantial performance gain, especially in challenging categories. This suggests that the additional computational cost is efficiently translated into meaningful performance improvements, supporting the design objective of balancing accuracy and inference efficiency.

Figure 12 presents qualitative detection results on the ASS2 dataset, comparing our proposed method against six state-of-the-art YOLO variants across airplane and truck categories. The visual comparison clearly demonstrates the superior detection capabilities of our approach, particularly in challenging scenarios involving varying object scales, complex backgrounds, and partial occlusions. The baseline models (b)–(g) exhibit notable limitations, such as missed detections and imprecise bounding box localizations, especially in cases where objects appear at oblique angles or in cluttered environments.

Our method (h) demonstrates notable improvements across several key aspects of detection performance. First, it achieves more accurate object localization, as reflected by the tighter and more precisely aligned bounding boxes around both airplanes and trucks. This enhanced localization precision can be attributed to the integration of the HS-FPN and the NWD loss, which together enhance multi-scale feature fusion and provide a scale-invariant localization objective. Second, the proposed framework exhibits superior robustness in detecting objects under challenging conditions, including partial occlusions caused by buildings or vegetation, as well as variations in object orientation and lighting. These results underscore the effectiveness of the architectural design in preserving spatial consistency and semantic discriminability under complex aerial surveillance scenarios.

The qualitative results are consistent with the quantitative performance reported in Table 3, where the proposed method demonstrates substantial improvements in detection accuracy across both object categories. The visual examples further emphasize the model’s ability to maintain high detection reliability while effectively suppressing false positives and minimizing missed detections—an essential requirement for real-world aerial surveillance and monitoring applications. These findings support the effectiveness of the proposed architectural components in addressing the inherent challenges of aerial object detection, including scale variability, complex and cluttered backgrounds, and diverse object orientations.

4.9. Analysis on Small-Object Detection Performance

In this section, we conduct a comprehensive analysis of small-object detection performance, specifically focusing on person detection, to evaluate the effectiveness of our proposed method. Given that person detection presents unique challenges due to the relatively small size of human figures in aerial imagery, this analysis is particularly relevant for assessing our method’s capabilities in handling small objects.

Figure 13 presents a comparative analysis of person precision and the number of parameters (in millions) for various object detection models, including baselines, ablation variants, and the proposed method.

The baseline models, namely, YOLOv6n, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and YOLOv12n, demonstrate person precision values ranging from 0.273 to 0.446. Notably, YOLOv9s achieves the highest precision among the baselines (0.446) but comes at the cost of a significantly larger model size (7.16 M parameters). Conversely, other baseline variants, such as YOLOv11n and YOLOv12n, offer more compact architectures (around 2.5 M parameters) but yield lower precision (0.332 and 0.331, respectively).

Ablation models, which incrementally incorporate modules such as NWD, Smallobj, DualConv, and DWR, illustrate substantial improvements in person precision. For example, the addition of the NWD module to YOLOv11n increases precision to 0.350, while further enhancements with Smallobj and DualConv modules result in dramatic precision gains, reaching up to 0.684. Importantly, these improvements are achieved without significant increases in model size, which remains below 3 M parameters.

The proposed method achieves the highest Person precision (0.703), outperforming all baseline and ablation variants. Remarkably, this performance is obtained with only 2.10 M parameters, highlighting the efficiency and effectiveness of the proposed approach. This result indicates a favorable trade-off between accuracy and model complexity, underscoring the practical value of the proposed method for resource-constrained scenarios.

Figure 14 visually compares the detection performance of multiple models across representative airport surveillance scenes. Subfigure (a) shows the original input images, while (b) to (g) depict the outputs of baseline YOLOv6n to YOLOv12n models, and (h) presents the results of the proposed method.

The baseline models, shown in (b) to (g), often miss small or occluded objects, especially persons and small vehicles in cluttered backgrounds. Detection confidence scores for persons are typically low, and false negatives are prevalent, particularly in challenging scenarios with complex visual conditions or significant occlusions. Moreover, the baseline models sometimes generate redundant bounding boxes or incorrect category predictions.

In contrast, subfigure (h) demonstrates the effectiveness of the proposed method. It achieves more accurate localization and identification of persons, even for small and partially occluded instances. The proposed model exhibits higher detection confidence and effectively reduces false negatives and redundant detections. These improvements are evident in crowded or visually challenging scenes, where the proposed method identifies more targets than the baselines.

Overall, the visual results indicate that the proposed approach significantly enhances detection accuracy and robustness, particularly for small and difficult objects, thereby providing a more reliable solution for airport scene understanding and surveillance.

5. Discussion

This study advances aerial surveillance by introducing principled innovations in bounding box regression and feature representation, specifically tailored to the challenges of small-object detection. The integration of the NWD loss addresses fundamental limitations of traditional IoU-based losses by offering a scale-invariant metric for localization, thereby enhancing the regression quality for small objects that are often overlooked or under-penalized in standard formulations.

However, the detection of small, oriented objects in remote sensing imagery requires not only precise localization but also enhanced semantic separability. In this context, future research could explore the incorporation of class-specific semantic modeling strategies, such as the differentiation-based embedding approach proposed in SemDiff [32], to further mitigate intra-class confusion and improve foreground activation. Additional directions include the adaptive tuning of NWD normalization constants based on object-specific characteristics, the integration of temporal consistency from sequential data to enhance detection stability, and the application of domain adaptation techniques to ensure robust generalization across diverse aerial surveillance scenarios.

6. Conclusions

This study proposed a unified and computationally efficient multi-scale detection framework specifically designed to address the challenges of small-object detection in aerial airport surveillance. The architecture integrates several key components: the Normalized Wasserstein Distance (NWD) loss for scale-invariant and robust bounding box regression, Dilation-wise Residual (DWR) modules to enhance multi-scale contextual extraction, a Hierarchical Screening Feature Pyramid Network (HS-FPN) for effective feature fusion across scales, and lightweight DualConv modules to preserve representational power while minimizing computational overhead. Together, these innovations effectively mitigate core challenges such as extreme scale variance, dense clutter, and low semantic resolution associated with small aerial targets.

Comprehensive evaluations on the ASS1 and ASS2 datasets demonstrate the efficacy of the proposed method. On ASS1, the framework achieves an mAP@0.5 of 89.3%, outperforming all compared YOLO variants by margins of up to 14.6 percentage points. Notably, it attains a Person AP of 72.2%, significantly higher than the 44.6% achieved by the strongest baseline (YOLOv9s), indicating improved sensitivity to small, occluded targets. On ASS2—a more limited and challenging dataset—the proposed method achieves an mAP@0.5 of 62.7%, again outperforming the strongest baseline by 12.5 points. Additionally, substantial gains are observed in category-specific performance, particularly for the Truck class, where the proposed approach improves AP by 13.1 points over YOLOv9s. These results underscore the model’s strong generalization capacity under data scarcity and its superior discriminative ability across object scales.

In summary, the proposed framework offers a balanced solution to the twin demands of high detection accuracy and real-time inference in aerial surveillance. By combining principled loss design with efficient architectural modules, the model achieves substantial improvements in small-object detection performance without sacrificing computational efficiency. The consistent gains observed across multiple datasets and object categories validate the framework’s applicability to real-world airport monitoring scenarios, where both precision and speed are essential for ensuring operational safety and autonomy.

Author Contributions

Conceptualization, Y.L.; data curation, Y.F. and S.Z.; formal analysis, Y.L., Y.F., Y.Z. and N.A.R.; investigation, Y.F.; methodology, Y.L. and Y.F.; project administration, Y.Z. and N.A.R.; resources, S.Z.; software, Y.L.; supervision, Y.Z. and N.A.R.; validation, Y.L. and Y.F.; writing—original draft, S.Z. and Y.Z.; writing—review and editing, S.Z., Y.Z. and N.A.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research was partially supported by the National Key R&D Program (Grant No. 2024YFC3014403), the Sichuan Science and Technology Program (Grant No. 2023NSFSC0753), and the Fundamental Research Funds for the Central Universities (Grant No. PHD2023-023).

Data Availability Statement

The source code presented in this study is available on GitHub at https://github.com/a211400/Small-Object-Detection-in-ASS-Dataset (accessed on 29 June 2025).

Conflicts of Interest

The authors declare no conflict of interest.

References

Phat, T.V.; Alam, S.; Lilith, N.; Tran, P.N.; Binh, N.T. Deep4Air: A Novel Deep Learning Framework for Airport Airside Surveillance. In Proceedings of the 2021 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), Shenzhen, China, 5–9 July 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1–6. [Google Scholar] [CrossRef]
Bloisi, D.; Iocchi, L.; Nardi, D.; Fiorini, M.; Graziano, G. Ground traffic surveillance system for air traffic control. In Proceedings of the 2012 12th International Conference on ITS Telecommunications, Taipei, Taiwan, 5–8 November 2012; IEEE: Piscataway, NJ, USA, 2012; pp. 135–139. [Google Scholar]
Nugraha, E.S.; Apriono, C.; Zulkifli, F.Y. A systematic review of radar technologies for surveillance of foreign object debris detection on airport runway. Bull. Electr. Eng. Inform. 2024, 13, 4102–4114. [Google Scholar] [CrossRef]
Ashmi, G.; Priyadharsini, R. The Modern Approaches for Identifying Foreign Object Debris (FOD) in Aviation. In Proceedings of the 2024 International Conference on Integrated Circuits and Communication Systems (ICICACS), Raichur, India, 23–24 February 2024; IEEE: Piscataway, NJ, USA, 2024; pp. 1–5. [Google Scholar]
Munyer, T.; Brinkman, D.; Huang, C.; Zhong, X. Integrative use of computer vision and unmanned aircraft technologies in public inspection: Foreign object debris image collection. In Proceedings of the 22nd Annual International Conference on Digital Government Research, Omaha, NE, USA, 9–11 June 2021; pp. 437–443. [Google Scholar]
Hua, W.; Chen, Q. A survey of small object detection based on deep learning in aerial images. Artif. Intell. Rev. 2025, 58, 1–67. [Google Scholar] [CrossRef]
Noroozi, M.; Shah, A. Towards optimal foreign object debris detection in an airport environment. Expert Syst. Appl. 2023, 213, 118829. [Google Scholar] [CrossRef]
Nikouei, M.; Baroutian, B.; Nabavi, S.; Taraghi, F.; Aghaei, A.; Sajedi, A.; Moghaddam, M.E. Small Object Detection: A Comprehensive Survey on Challenges, Techniques and Real-World Applications. arXiv 2025, arXiv:2503.20516. [Google Scholar] [CrossRef]
Liu, C.; Gao, G.; Huang, Z.; Hu, Z.; Liu, Q.; Wang, Y. YOLC: You Only Look Clusters for Tiny Object Detection in Aerial Images. arXiv 2024, arXiv:2404.06180. [Google Scholar] [CrossRef]
Feng, Q.; Xu, X.; Wang, Z. Deep learning-based small object detection: A survey. Math. Biosci. Eng. 2023, 20, 6551–6590. [Google Scholar] [CrossRef]
Zheng, Y.; Jing, Y.; Zhao, J.; Cui, G. LAM-YOLO: Drones-based Small Object Detection on Lighting-Occlusion Attention Mechanism YOLO. arXiv 2024, arXiv:2411.00485. [Google Scholar] [CrossRef]
Li, Y.; Li, Q.; Pan, J.; Zhou, Y.; Zhu, H.; Wei, H.; Liu, C. Sod-yolo: Small-object-detection algorithm based on improved yolov8 for uav images. Remote Sens. 2024, 16, 3057. [Google Scholar] [CrossRef]
Fan, Q.; Li, Y.; Deveci, M.; Zhong, K.; Kadry, S. LUD-YOLO: A novel lightweight object detection network for unmanned aerial vehicle. Inf. Sci. 2025, 686, 121366. [Google Scholar] [CrossRef]
Wang, C.Y.; Yeh, I.H.; Liao, H.Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Doloriel, C.T.C.; Cajote, R.D. Improving the Detection of Small Oriented Objects in Aerial Images. In Proceedings of the 2023 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WACVW), Waikoloa, HI, USA, 3–7 January 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 176–185. [Google Scholar] [CrossRef]
Li, H.; Qu, H. DASSF: Dynamic-Attention Scale-Sequence Fusion for Aerial Object Detection. arXiv 2024, arXiv:2406.12285. [Google Scholar] [CrossRef]
Tian, Y.; Ye, Q.; Doermann, D. YOLOv12: Attention-Centric Real-Time Object Detectors. arXiv 2025, arXiv:2502.12524. [Google Scholar] [CrossRef]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. arXiv 2019, arXiv:1911.09070. [Google Scholar] [CrossRef]
Cao, J.; Chen, Q.; Guo, J.; Shi, R. Attention-guided Context Feature Pyramid Network for Object Detection. arXiv 2020, arXiv:2005.11475. [Google Scholar] [CrossRef]
Wang, J.; Xu, C.; Yang, W.; Yu, L. A normalized Gaussian Wasserstein distance for tiny object detection. arXiv 2021, arXiv:2110.13389. [Google Scholar] [CrossRef]
Tang, Y.; Su, A.; Li, Z.; Wang, Z. End-to-end one-stream object tracking based on uncertainty regression. Neurocomputing 2025, 648, 130599. [Google Scholar] [CrossRef]
Yang, X.; Yan, J.; Ming, Q.; Wang, W.; Zhang, X.; Tian, Q. Rethinking Rotated Object Detection with Gaussian Wasserstein Distance Loss. arXiv 2022, arXiv:2101.11952. [Google Scholar] [CrossRef]
Xu, C.; Wang, J.; Yang, W.; Yu, H.; Yu, L.; Xia, G.S. Detecting tiny objects in aerial images: A normalized Wasserstein distance and a new benchmark. ISPRS J. Photogramm. Remote Sens. 2022, 190, 79–93. [Google Scholar] [CrossRef]
Wei, H.; Liu, X.; Xu, S.; Dai, Z.; Dai, Y.; Xu, X. DWRSeg: Rethinking efficient acquisition of multi-scale contextual information for real-time semantic segmentation. arXiv 2022, arXiv:2212.01173. [Google Scholar] [CrossRef]
Shi, Z.; Hu, J.; Ren, J.; Ye, H.; Yuan, X.; Ouyang, Y.; He, J.; Ji, B.; Guo, J. HS-FPN: High frequency and spatial perception FPN for tiny object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 6896–6904. [Google Scholar]
Zhong, J.; Chen, J.; Mian, A. DualConv: Dual convolutional kernels for lightweight deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 2022, 34, 9528–9535. [Google Scholar] [CrossRef]
Zhou, W.; Cai, C.; Zheng, L.; Li, C.; Zeng, D. ASSD-YOLO: A small object detection method based on improved YOLOv7 for airport surface surveillance. Multimed. Tools Appl. 2024, 83, 55527–55548. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Reis, D.; Kupec, J.; Hong, J.; Daoudi, A. Real-Time Flying Object Detection with YOLOv8. arXiv 2024, arXiv:2305.09972. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Khanam, R.; Hussain, M. YOLOv11: An Overview of the Key Architectural Enhancements. arXiv 2024, arXiv:2410.17725. [Google Scholar] [CrossRef]
Yuan, X.; Cheng, G.; Yao, R.; Han, J. Semantic differentiation aids oriented small object detection. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 5966–5979. [Google Scholar] [CrossRef]

Figure 1. Illustration of scenes from the ASS1 (a) and ASS2 (b) datasets, showcasing varied perspectives and object scales typical in airport surveillance scenarios.

Figure 2. Illustration of the proposed architecture.

Figure 3. Illustration of HS-FPN module.

Figure 4. Illustration of Dilation-wise Residual module.

Figure 5. Illustration of the DualConv module.

Figure 6. Comparison of original IoU loss and the NWD loss across YOLOv6n–YOLOv12n on the ASS1 dataset.

Figure 7. Effect of the normalization constant C in the NWD loss on detection precision across categories (airplane, person, truck) and overall mAP@0.5 on the ASS1 dataset.

Figure 8. Comparison of mAP@0.5–0.95 between the proposed method and YOLOv6n, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and YOLOv12n.

Figure 9. Comparison of mAP@0.5 between the proposed method and YOLOv6n, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and YOLOv12n.

Figure 10. Precision–recall curve comparison between the proposed method and baseline models. The panels display the individual PR curves for: (a) YOLOv8n; (b) YOLOv9s; (c) YOLOv10n; (d) YOLOv11n; (e) YOLOv12n; and (f) the proposed method (Ours).

Figure 11. Qualitative comparison on ASS1 dataset for Airplane, Person, and Truck categories, where the red box indicates the selected region for a detailed, magnified comparison: (a) original image; (b–g) YOLOv6n, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and YOLOv12n; (h) proposed method.

Figure 12. Qualitative comparison on ASS2 dataset for Airplane and Truck categories, where the red box indicates the selected region for a detailed, magnified comparison: (a) original image; (b–g) YOLOv6n, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and YOLOv12n; (h) proposed method.

Figure 13. Performance comparison of person detection precision and model parameter count among YOLO baseline models, ablation variants, and our proposed method.

Figure 14. Visualization of object detection results in an airport scenario, where the red box highlights a challenging region selected for magnified analysis: (a) original image; (b–g) YOLOv6n, YOLOv8n, YOLOv9s, YOLOv10n, YOLOv11n, and YOLOv12n; (h) the proposed method.

Table 1. Ablation study on ASS1 dataset.

NWD	Smallob	DualConv	DWR	HSFPN	Inference (ms)	Params (M)	FLOPs (G)	FPS	Airplane	Person	Truck	mAP@0.5
✓	-	-	-	-	10.8	2.58	6.3	92.6	99.4	34.8	92.1	75.4
✓	✓	-	-	-	11.3	2.89	12.3	88.5	99.4	69.7	96.0	88.4
✓	✓	✓	-	-	11.6	2.58	11.4	86.2	99.4	65.3	95.7	86.8
✓	✓	✓	✓	-	10.9	2.57	10.5	91.7	99.4	68.0	96.1	87.8
✓	✓	✓	✓	✓	10.6	2.10	16.4	94.3	99.3	72.2	96.3	89.3

Table 2. Comparison of inference time, Params, FLOPs, and mAP@0.5 on ASS1 dataset.

Methods	Inference (ms)	Params (M)	FLOPs (G)	FPS	Airplane	Person	Truck	mAP@0.5
YOLOv6n [28]	7.2	4.23	11.8	138.9	99.4	27.3	90.6	72.4
YOLOv8n [29]	6.8	3.00	8.1	147.1	99.5	35.7	91.9	75.7
YOLOv9s [14]	18.1	7.16	26.7	55.2	99.4	44.6	94.9	79.7
YOLOv10n [30]	9.6	2.26	6.5	104.2	99.2	33.8	91.1	74.7
YOLOv11n [31]	9.4	2.58	6.3	106.4	99.4	33.2	90.3	74.3
YOLOv12n [17]	15.0	2.55	6.3	66.7	99.4	33.1	90.1	74.2
Ours	10.6	2.10	16.4	94.3	99.3	72.2	96.3	89.3

Table 3. Comparison of inference time, Params, FLOPs, and mAP@0.5 on ASS2 dataset.

Methods	Inference (ms)	Params (M)	FLOPs (G)	FPS	Airplane	Truck	mAP@0.5
YOLOv6n [28]	29.0	4.23	11.8	34.5	63.6	19.9	41.8
YOLOv8n [29]	27.5	3.00	8.1	36.4	66.4	24.7	45.5
YOLOv9s [14]	43.2	7.16	26.7	23.1	70.9	29.5	50.2
YOLOv10n [30]	34.0	2.26	6.5	29.4	60.6	24.2	42.4
YOLOv11n [31]	35.4	2.58	6.3	28.2	67.1	25.8	46.5
YOLOv12n [17]	39.0	2.55	6.3	25.6	67.3	25.4	46.4
Ours	36.8	2.10	16.4	27.2	82.7	42.6	62.7

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Fang, Y.; Zhou, S.; Zhang, Y.; Ribeiro, N.A. Robust Small-Object Detection in Aerial Surveillance via Integrated Multi-Scale Probabilistic Framework. Mathematics 2025, 13, 2303. https://doi.org/10.3390/math13142303

AMA Style

Li Y, Fang Y, Zhou S, Zhang Y, Ribeiro NA. Robust Small-Object Detection in Aerial Surveillance via Integrated Multi-Scale Probabilistic Framework. Mathematics. 2025; 13(14):2303. https://doi.org/10.3390/math13142303

Chicago/Turabian Style

Li, Youyou, Yuxiang Fang, Shixiong Zhou, Yicheng Zhang, and Nuno Antunes Ribeiro. 2025. "Robust Small-Object Detection in Aerial Surveillance via Integrated Multi-Scale Probabilistic Framework" Mathematics 13, no. 14: 2303. https://doi.org/10.3390/math13142303

APA Style

Li, Y., Fang, Y., Zhou, S., Zhang, Y., & Ribeiro, N. A. (2025). Robust Small-Object Detection in Aerial Surveillance via Integrated Multi-Scale Probabilistic Framework. Mathematics, 13(14), 2303. https://doi.org/10.3390/math13142303

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Robust Small-Object Detection in Aerial Surveillance via Integrated Multi-Scale Probabilistic Framework

Abstract

1. Introduction

2. Materials

3. Methods

3.1. Normalized Gaussian Wasserstein Distance Loss

3.2. Dilation-Wise Residual

3.3. High-Level Screening-Feature Fusion Pyramid Networks

3.4. DualConv

4. Experiments and Analysis

4.1. Experimental Setup

4.2. Evaluating NWD-Based Loss Against Traditional IoU Loss

4.3. Empirical Analysis of C-Value Selection in NWD Loss

4.4. Ablation Study on ASS1 Dataset

4.5. Comparative Analysis of mAP (0.5 and 0.5–0.95)

4.6. Precision–Recall Curve Comparison

4.7. Comparative Results on ASS1 Dataset

4.8. Comparative Results on ASS2 Dataset

4.9. Analysis on Small-Object Detection Performance

5. Discussion

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI