RMVAD-YOLO: A Robust Multi-View Aircraft Detection Model for Imbalanced and Similar Classes

Li, Keda; Zheng, Xiangyue; Bi, Jingxin; Zhang, Gang; Cui, Yi; Lei, Tao

doi:10.3390/rs17061001

Open AccessArticle

RMVAD-YOLO: A Robust Multi-View Aircraft Detection Model for Imbalanced and Similar Classes

by

Keda Li

^1,2,3

,

Xiangyue Zheng

^1,2,3

,

Jingxin Bi

^1,2,3,

Gang Zhang

^1,2,3

,

Yi Cui

^1,2,3 and

Tao Lei

^1,2,3,*

¹

National Laboratory on Adaptive Optics, Chengdu 610209, China

²

University of Chinese Academy of Sciences, Beijing 101408, China

³

Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu 610209, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2025, 17(6), 1001; https://doi.org/10.3390/rs17061001

Submission received: 4 February 2025 / Revised: 8 March 2025 / Accepted: 11 March 2025 / Published: 12 March 2025

(This article belongs to the Section Remote Sensing Image Processing)

Download

Browse Figures

Versions Notes

Abstract

Aircraft detection technology plays a vital role in civilian applications, with significant attention being devoted to research on related algorithms in recent years. However, most existing research predominantly focuses on aircraft detection from a single top–down viewpoint, which constrains the applicability of detection technology across diverse scenarios. To overcome this limitation, we propose RMVAD-YOLO, a multi-view aircraft detection model built upon YOLOv8. First, we propose a novel Robust Multi-Link Scale Interactive Feature Pyramid Network (RMSFPN), which robustly extracts features of the same aircraft category from multiple views while enhancing feature differentiation between different aircraft categories. Second, we propose the Shared Convolutional Dynamic Alignment Detection Head (SCDADH), which enhances task interaction and collaboration by sharing convolutions between the classification and localization branches while simultaneously reducing the number of parameters, enhancing the model’s ability to deal with multi-scale targets. Additionally, to further leverage background information and enhance the model’s adaptability to multi-scale target variations, we incorporate the LSK Module into the backbone network. Finally, we propose the WFMIoUv3 loss function, which strengthens the model’s focus on challenging samples and improves detection robustness. Experimental results on the newly released Multi-Perspective Aircraft Dataset (MAD) demonstrate that RMVAD-YOLO achieves an accuracy of 90.1%, a recall of 76%, 84.8% mAP@0.5, and 70.5% mAP@0.5:0.95, while reducing parameters and delivering an overall improvement in detection performance compared to the baseline YOLOv8n. RMVAD-YOLO also performed well on the VisDrone 2019 dataset, further demonstrating its reliable generalization capabilities.

Keywords:

multi-view aircraft detection; aircraft classification; feature fusion; small target

1. Introduction

As a critical mode of transportation, aircraft play a pivotal role in civilian sectors, and associated detection technologies have become a major focus of research [1,2]. These technologies are essential for air traffic control, aviation monitoring [3], and airport planning [4]. Moreover, the accurate identification of aircraft types is crucial for enhancing safety protocols, optimizing flight schedules, and streamlining airport operations [5].

Although considerable research has been conducted on aircraft detection, this research primarily focuses on top–down detection methods [6,7], which limits the scope and applicability of the available technology. In critical tasks such as air traffic control and search and rescue, the viewing angles of aircraft vary significantly, including downward, eye-level, upward, and oblique perspectives, as illustrated in Figure 1. Variations in viewing angles not only affect detection performance, but also introduce ambiguities in distinguishing similar aircraft types, further complicating recognition tasks in real-world applications. Detectors trained on a single top–down dataset often lack robustness in complex application scenarios, making it difficult to meet real-world demands.

Multi-view aircraft detection currently faces numerous challenges. Under multi-view conditions, aircraft exhibit a greater diversity of pose variations, causing the same aircraft type to present vastly different features across different viewpoints. Moreover, these varying features can easily overlap with those of other aircraft categories, making it more complex to effectively aggregate features for accurate aircraft category detection. At the same time, differences in flight altitude and viewing angles result in more pronounced variations in target scale. Additionally, imbalanced class distributions often lead to poor detection performance for underrepresented aircraft categories, which is unacceptable in safety monitoring and military applications. Therefore, this study aims to address three key challenges: (1) how to robustly extract features of the same aircraft category across multiple viewpoints while enhancing feature distinction between different aircraft categories; (2) how to handle scale variations caused by changes in flight altitude and viewing angle; (3) how to mitigate the impact of class imbalance, which often leads to inferior detection performance for minority aircraft categories.

With the rapid advancement of deep learning, remarkable advancements have been achieved in object detection algorithms, leading to the emergence of numerous state-of-the-art models. Among these, YOLO has attracted considerable attention from researchers owing to its remarkable balance between detection speed and accuracy.

This paper proposes a novel multi-view aircraft detection model, RMVAD-YOLO, which leverages the strengths of YOLOv8 and incorporates targeted improvements to better tackle the challenges of multi-view aircraft detection and classification.

This paper offers the following principal contributions:

(1) We propose a Robust Multi-Link Scale Interactive Feature Pyramid Network (RMSFPN) that employs a flexible skip connection, seamlessly integrates output information from both the neck and backbone, and incorporates the Neck Heterogeneous Kernel Selection Mechanism (NHKSM) and channel rearrangement mechanism. This enables the model to robustly extract multi-view aircraft features, effectively capturing fine details across different perspectives. By seamlessly integrating these features, the model enhances its ability to handle the complexities of aircraft detection, particularly in distinguishing between similar aircraft categories.

(2) We propose the Shared Convolutional Dynamic Alignment Detection Head (SCDADH). The SCDADH facilitates task interaction by sharing the classification and localization branches of each detection head, thereby minimizing the number of parameters while making the classification and localization processes independent via a task decomposition module, thereby enhancing the model’s ability to deal with multi-scale targets.

(3) We integrate an LSK Module [8] into the backbone network. This dynamically adjusts the receptive field size, thereby effectively capturing the contextual information essential for accurate object detection in multi-view aircraft images. This enhances the model’s ability to adapt to complex backgrounds, extract multi-scale target features, and strengthen its focus on fine-grained details.

(4) We combine the advantages of the Wise-IoUv3 [9], Focaler-IoU [10], and MPDIoU [11] loss functions to propose WFMIoUv3, which replaces the traditional CIoU and directs more focus towards challenging samples, effectively addressing the issue of class imbalance.

The remainder of this paper is organised as follows: Section 2 presents a historical overview of object detection and aircraft detection techniques. Section 3 introduces an enhanced model for detecting multi-view aircraft and details the proposed model architecture and principles. Section 4 presents the dataset and experimental setup. Section 5 presents the experimental results and discussion. Section 6 describes how RMVAD-YOLO addresses the challenges encountered in multi-view aircraft detection tasks. Section 7 highlights the primary conclusions of this paper and suggests directions for future research.

2. Related Work

2.1. Object Detection

The objective of object detection in computer vision is to detect and locate target objects within images and videos. Traditional object detection algorithms, such as Support Vector Machines (SVMs) [12], decision trees [13], and AdaBoost [14], primarily rely on manually extracted features or machine learning techniques to identify and localize objects. In 2012, the emergence of AlexNet [15] marked the official transition of the object detection field into the era of deep learning. Deep learning-based object detection algorithms can be classified into two categories: two-stage and one-stage methods. Two-stage algorithms first generate candidate regions, then perform classification and bounding box regression to identify the object category and refine its location. Relevant networks include R-CNN [16], SPP-Net [17], Fast R-CNN [18], Faster R-CNN [19], and others. These algorithms have achieved remarkable accuracy but at the cost of speed. In contrast, single-stage object detection algorithms, such as SSD [20], RetinaNet [21], the YOLO series, and RT-DETR [22], adopt an end-to-end dense prediction method, eliminating the need for pre-selecting candidate regions and directly using deep convolutional networks to predict the location and category of objects within the original image, thereby greatly improving detection efficiency.

In 2015, Joseph Redmon et al. introduced the YOLO algorithm [23], which overcame the limitations of two-stage detection algorithms and significantly improved detection speed. Joseph Redmon et al. subsequently introduced YOLOv2 [24] and YOLOv3 [25], which incorporated feature pyramid networks (FPNs) [26] for multi-scale detection, balancing accuracy and practicality. In YOLOv4 [27], CSPNet served as the backbone network to improve small object detection. YOLOv5, compared to YOLOv4, offers better usability and faster convergence. YOLOX [28] and YOLOv6 [29] introduced anchor-free, end-to-end object detection frameworks. YOLOv7 [30] restructured the detection head and demonstrated strong performance across multiple benchmark datasets. YOLOv8 modifies the number of channels based on the scale of the model and replaced the prediction head with the decoupled head structure, adopting an anchor-free detection framework. Subsequently, YOLOv9 [31], YOLOv10 [32], and YOLOv11 were also proposed in succession. In contrast, with its flexible network architecture, YOLOv8 achieves superior detection performance, and has since become the most widely adopted model in the YOLO series.

2.2. Aircraft Detection

Early aircraft detection primarily relied on manually designed template matching techniques. Xu et al. [33] employed edge-preserving filtering (EPF) to efficiently perform the aircraft detection task using deformable templates. Lin et al. [34] proposed using the rotationally invariant radial gradient angle (RGA) feature for template matching. Yu et al. [35] incorporated scale priors into the Hough forest algorithm to achieve scale invariance. However, traditional methods heavily rely on manually designed features, performing poorly when confronted with large datasets and complex backgrounds. Given the rapid advancements in deep learning, researchers have placed increasing emphasis on deep learning-driven aircraft detection methods. Ding et al. [36] employed a multi-scale method to enhance VGG16-Net, improving the efficiency of object detection in satellite optical remote sensing images. Yang et al. [37] enhanced ResNet [38] and applied Super Vector Coding (SV) to HOG from regions of interest, improving detection speed. In the YOLO series, Wu et al. [39] improved YOLOv5 by using EloU loss for target localization, fully utilizing low-semantic feature maps to enable the network to effectively focus on small aircraft targets and enhance robustness. In aircraft classification, various methods have been proposed to address the challenges of fine-grained recognition and feature discrimination. Wu et al. [40] introduced FGA-YOLO, an efficient model that integrates a triple-feature fusion module and a global multi-scale module to enhance feature extraction, significantly improving classification accuracy. Yu et al. [41] tackled the issue of misclassification among visually similar sub-categories by proposing an improved cross-entropy loss function and attention-based modules, effectively refining spatial dependencies and reducing classification errors. Meanwhile, Wan et al. [42] developed a learnable Gabor filter-based texture feature extractor and a contrastive learning module, enhancing feature discrimination and improving recognition performance on fine-grained aircraft datasets. Yang et al. [43] introduced a multi-view aircraft detection dataset and employed a multilayer perceptron in combination with dynamic receptive field technology to extract aircraft features, thereby enhancing the model’s capability to detect similar aircraft types. Subsequent refinements to the focal loss function mitigated the issue of imbalanced aircraft samples. These enhancements partially improved the model’s detection performance for multi-view aircraft; however, certain limitations persist.

Currently, aircraft detection is primarily performed from a single viewpoint. However, in multi-view scenarios, the increased variety of angles introduces significant challenges for accurate aircraft classification detection. Aircraft from different classes may appear visually similar from certain viewpoints, complicating accurate detection, as the model may struggle to distinguish between them. In contrast, aircraft from the same class can exhibit highly varying visual characteristics from different perspectives. These changes in appearance can prevent the model from recognizing them as belonging to the same class, thereby degrading the model’s ability to detect aircraft across different viewpoints. The crux of the problem lies in how to effectively capture and align these diverse features from multiple angles, ensuring that the model can both distinguish between different classes and maintain consistency within the same class across varying viewpoints.

3. Method

3.1. Introduction to the RMVAD-YOLO Model

RMVAD-YOLO is an optimized variant of YOLOv8n, and its architecture is illustrated in Figure 2. The backbone processes the original aircraft image into a standardized 640 × 640 × 3 format. To enhance the model’s adaptability to multi-scale variations, the LSK Module is integrated, leveraging background information to improve feature extraction across different scales. The extracted features are then fed into RMSFPN, which robustly captures multi-view aircraft features and enhances the accuracy of aircraft category detection. Additionally, the incorporation of SCDADH further strengthens the model’s ability to handle scale variations by improving feature interaction between different levels. Finally, the model employs WFMIoUv3 as the loss function to mitigate the impact of class imbalance, ensuring more stable detection performance across all aircraft categories.

3.2. Robust Multi-Link Scale Interactive Feature Pyramid Network

To enable a more accurate detection of aircraft from different perspectives, it is essential to robustly extract richer features. YOLOv8 employs PANet [44] for feature fusion. This fusion strategy has notable limitations: in the bottom–up link, it fails to fully leverage the detailed features from the shallower layers of the backbone network, resulting in poor performance in detecting small-target aircraft, while in the top–down link, the downsampled features are fused only with feature maps of the same size, leading to a significant loss of semantic information, which causes the model to mis-detect aircraft from similar categories. To address these limitations, we propose the RMSFPN, as shown in Figure 3. Compared to PANet, RMSFPN incorporates three key optimizations.

3.2.1. Improvements in the Structure

To enhance the model’s detection performance for small-target aircraft, the bottom–up link further downsamples the spatial features from the shallow backbone network while increasing the number of channels, thereby retaining more detailed information. To improve the model’s detection accuracy for similar aircraft categories, the top–down link further up-samples the deeply fused semantic information and integrates it with the shallow detailed features, thus enabling further interaction between the deep semantic and shallow detailed information.

3.2.2. Multi-Scale Deep Convolutional Aggregation Module

The size, shape, and surface characteristics of an aircraft vary significantly across multiple viewpoints, and conventional C2f modules are unable to robustly capture these variations in multi-view aircraft inspection scenarios. To overcome this limitation, we propose the Multi-scale Deep Convolutional Aggregation Module (MDCAM), which retains the overall CSP structure, as shown in Figure 4.

The Multi-scale Deep Convolutional Aggregation Block (MDCAB) is the key to this, and the structure is shown in Figure 5.

The MDCAB expands the input feature maps through a point convolutional layer with an expansion factor of 2, thereby providing a richer representation for subsequent feature extraction. To robustly extract the airplane features across multiple viewpoints, the expanded feature maps are passed through deep convolutional layers with varying kernel sizes for multi-scale feature extraction. This multi-scale feature is crucial for capturing the airplane’s performance across varying angles and helps the model retain detailed information across multiple viewpoints. Subsequently, the outputs from the three deep convolutional layers are fused and summed, with a channel rearrangement mechanism facilitating the exchange of information between channels, thus enhancing the model’s adaptability to multi-view aircraft images.

3.2.3. Neck Heterogeneous Kernel Selection Mechanism

The selection of convolution kernel size in the MDCAB is crucial for detection performance. Building upon the Heterogeneous Kernel Selection Protocol proposed by Chen et al. [45], we developed the Neck Heterogeneous Kernel Selection Mechanism, which applies convolution kernels of varying sizes to different levels of feature maps. Deep-level large feature maps use larger convolution kernels to extract coarse-grained features, while shallow-level small feature maps use smaller convolution kernels to extract fine-grained features, thus achieving a favourable trade-off between speed and accuracy.

Specifically, we introduce convolutional kernels of sizes 1, 3, 5, 7, and 9, and assign three convolutional kernels of different sizes to each MDCAM at various positions within the RMSFPN, labelled as a, b, and c in Figure 5. The MDCAMs are numbered based on the sequence in which the feature maps pass through them, and the kernel sizes for each MDCAM are summarized in Table 1.

3.3. Shared Convolutional Dynamic Alignment Detection Head

The scale variation in aircraft across multiple viewpoints is often significant; the decoupled head of YOLOv8 employs independent classification and localization branches, resulting in a lack of interaction between the two tasks and an inability to effectively leverage relevant information across different layers, hindering the model’s optimal performance in handling multi-scale targets. Furthermore, the excessive number of convolutions results in a large number of model parameters, making deployment challenging. To address this, we propose the SCDADH, the structure of which is depicted in Figure 6.

The shared convolutional layer consists of two convolutions, with parameters shared across all classification and localization branches of the detection heads. The joint feature

F_{s h a r e d}

is generated by concatenating the outputs of the two shared convolutions following the input feature map

X

, which passes through each of them in a sequential manner along the channel dimension. This approach preserves crucial morphological information about the aircraft from multiple scales while lowering the number of parameters. The mathematical formulations of the convolution process and the joint feature

F_{s h a r e d}

are as follows:

C G S (X) = S I L U (G N (C o n v 2 d (X)))

(1)

F_{s h a r e d} = C o n c a t ({C G S}_{1} (X), {C G S}_{2} ({C G S}_{1} (X)))

(2)

To mitigate excessive feature interference between tasks, we designed a task decomposition module. The joint feature

F_{s h a r e d}

is first computed via adaptive average pooling to generate a fixed-size global feature vector

{a v g_F}_{s h a r e d}

, which encapsulates the average information of the input feature map and effectively captures key patterns and contextual information across the entire image. Subsequently,

{a v g_F}_{s h a r e d}

is passed to the task decomposition module for classification and localization, where it generates the attention weight W, calculated using the following formula, where

σ

denotes the Sigmoid activation function and

δ

represents the ReLU function:

W = σ (C o n v 2 d (δ (C o n v 2 d ({a v g_F}_{s h a r e d}))))

(3)

To boost the expressiveness of the feature map, the joint feature

F_{s h a r e d}

and the attention weight

W

undergo batch matrix multiplication, which is followed by normalization and activation. The final classification feature

F_{c l s}

and positioning feature

F_{r e g}

, after task decomposition, are obtained as shown in Equation (4), where

b m m

denotes the batch matrix multiplication function.

F_{f i n a l} = S I L U (G N (b m m (F_{s h a r e d}, W)))

(4)

In the classification task, the feature

F_{s h a r e d}

undergoes a convolution operation to reduce its dimensionality and enhance the feature representation capability, followed by the Sigmoid activation function. Then,

F_{s h a r e d}

and

F_{f i n a l}

undergo a batch matrix multiplication function to obtain the probability distribution

{c l s}_{p r o b}

for each class.

C l s_{p r o b} = b m m (F_{f i n a l}, σ (C o n v 2 d (δ (C o n v 2 d (F_{s h a r e d})))))

(5)

In the localization task, the feature

F_{s h a r e d}

is passed through convolutional layers to generate offsets and masks for regression alignment. Then, both

F_{s h a r e d}

and

F_{f i n a l}

are fed into the dynamic adjustment mechanism of Deformable Convolution DCNv2 [46], ensuring more precise localization of target objects in multi-view scenarios.

This design enables the SCDADH to overcome the limitations of insufficient feature sharing between classification and localization tasks, as well as the lack of multi-scale processing capabilities in traditional detection heads. This improves the model’s sensitivity to variations in aircraft shape and position.

3.4. LSK Module

To fully exploit background details and further improve the network’s adaptability to multi-scale variations in the target, we integrate the LSKNet-based feature extraction module into the backbone network. LSKNet adapts the receptive field size dynamically based on the requirements. The central component, the LSK Module, is shown in Figure 7. X and Y indicate input and output, respectively, F indicates the convolution operation, and S refers to the attention feature map.

The LSK Module first decomposes a large kernel into 2–3 smaller depth-wise kernels. The input feature is X. The feature

U_{i + 1}

of each layer is derived from the previous layer’s feature

U_{i}

through the depth-wise convolution

F_{i}^{d w}

, followed by processing with a 1 × 1 convolution

F_{i}^{1 \times 1}

to obtain

{\tilde{U}}_{i}

:

U_{0} = X, U_{i + 1} = F_{i}^{d w} (U_{i})

(6)

{\tilde{U}}_{i} = F_{i}^{1 \times 1} (U_{i}), f o r i i n [1, N]

(7)

The processed feature

{\tilde{U}}_{i}

is concatenated to obtain a large feature map

\tilde{U}

, and then average pooling and max pooling operations are performed separately to extract the effective spatial features

{S A}_{a v g}

and

{S A}_{m a x}

. A convolution layer

F^{2 \to N}

is used to convert the pooled features from the two channels into N spatial attention feature maps

\hat{S A}

, as follows:

\tilde{U} = [{\tilde{U}}_{1}; \dots; {\tilde{U}}_{i}]

(8)

{S A}_{a v g} = P_{a v g} (\tilde{U}), {S A}_{m a x} = P_{m a x} (\tilde{U})

(9)

\hat{S A} = F^{2 \to N} ([{S A}_{a v g}; {S A}_{m a x}])

(10)

For each spatial attention map

{\hat{S A}}_{i}

, a Sigmoid activation function is applied to obtain independent spatial masks

{\tilde{S A}}_{i}

, which are multiplied with the corresponding feature maps

{\tilde{U}}_{i}

and then fused through the convolutional layer F to obtain the attention feature S:

{\tilde{S A}}_{i} = σ ({\hat{S A}}_{i})

(11)

S = F (\sum_{1}^{N} ({\tilde{S A}}_{i} \cdot {\tilde{U}}_{i}))

(12)

The final output Y is derived by combining the input feature X and the attention feature S through a weighted fusion process:

Y = X \cdot S

(13)

The LSK Module utilizes a larger receptive field for small objects to capture more context-sensitive information, while maintaining a moderate receptive field for large objects to mitigate the risk of feature oversmoothing. This design further improves the model’s adaptability to multi-scale variations in the target.

3.5. Loss Function Optimisation

The traditional YOLOv8 employs CIoU as the bounding box loss function, which does not account for the influence of sample distribution (difficult vs. easy) in bounding box regression. This leads to missed detections and false positives, particularly when dealing with a limited number of aircraft types. To address this issue, we combine the concepts of the Wise-IoUv3, Focaler-IoU, and MPDIoU loss functions to propose WFMIoUv3 as a replacement for traditional CioU.

Focaler-IoU reconstructs the IoU loss using a linear mapping approach, enabling the model to effectively focus on difficult and easy samples, thus addressing the issue of sample imbalance in aircraft categories. Specifically, as illustrated in Equation (14), where IoU represents the original IoU value, adjusting parameters d and u enables

{I o U}^{f o c a l e r}

to focus on different samples.

{I o U}^{f o c a l e r} = \{\begin{matrix} 0, & I o U < d \\ \frac{(I o U - d)}{u - d}, & d ≪ I o U ≪ u \\ 1, & I o U > u \end{matrix}

(14)

MPDIoU handles situations where the ratio of the sides of the predicted box is equal to that of the ground truth, but where specific sizes differ—an issue not considered by traditional IoU. Equation (15) illustrates this, with w and h standing for the input image’s dimensions, and

d_{1}

and

d_{2}

for the Euclidean distances between the upper-left and lower-right corners of the ground truth and predicted frames.

M P D I o U = I o U - \frac{d_{1}^{2}}{w^{2} + h^{2}} - \frac{d_{2}^{2}}{w^{2} + h^{2}}

(15)

The diverse variations in pose and size in multi-view aircraft datasets often result in a large number of low-quality samples. To prevent the model from overfitting to these samples and to further enhance its generalization capability, this study continues to employ the Wise-IoUv3 loss function to achieve loss balancing. Wise-IoUv3 addresses the issue of harmful gradients from low-quality samples by incorporating a dynamic non-monotonic focusing mechanism along with gradient scaling. The calculation process is provided in Equation (15), where δ and α represent a set of hyperparameters.

R_{W I o U}

significantly amplifies the

L_{I o U}

of the common anchor frame, directing the model’s focus toward common samples. Samples with a smaller β are assigned reduced gradient gains. This further prevents the generation of large harmful gradients from low-quality samples.

L_{W I o U v 3} = r R_{W I o U} L_{I o U}, r = \frac{β}{δ α^{β - δ}}

(16)

The targeted optimization and combination of the three loss functions not only enhance the model’s capability to focus on challenging samples, but also utilize the dynamic non-monotonic focusing mechanism to improve its generalization capability, thereby improving its suitability for multi-view aircraft detection tasks.

4. Datasets and Experimental Environment

4.1. Dataset

To assess the effectiveness of the model improvements, we performed a series of experiments using the newly released Multi-Perspective Aircraft Dataset (MAD), developed by the State Key Laboratory of Mechanical Transmission at Chongqing University and released in late August 2024. Comprising 13,205 high-quality images, the dataset captures aircraft from various perspectives, under changing weather conditions, and in diverse scenes and poses. It includes ten aircraft types, with the number of aircraft per image ranging from 1 to 38. This variability enhances its diversity, thereby supporting the training of more robust detection models.

Figure 8 provides an overview of the MAD dataset. Figure 8a depicts the instances of distribution across the ten aircraft categories, where the horizontal axis denotes the aircraft categories and the vertical axis represents the number of instances per category. A pronounced class imbalance is observed: the A1 category comprises 4698 instances (40% of the total dataset), whereas the least represented category, A4, contains only 184 instances. This imbalance may cause the model to underemphasize rare aircraft categories, thereby impairing detection performance. Figure 8b presents the distribution of aircraft target sizes relative to image dimensions, where the horizontal and vertical axes indicate the width and height ratios of the targets to the image, respectively. Most aircraft targets cluster in the lower-left and upper-right regions, suggesting that they are either very small or occupy a significant portion of the image. Conversely, the middle region is relatively sparse, indicating a lower occurrence of moderately sized targets. Such substantial scale variation presents a major challenge for robust multi-view aircraft detection.

Furthermore, the dataset is officially divided into training, validation, and test sets following a 6:2:2 ratio; therefore, we did not re-partition the data.

4.2. Experimental Evaluation Indicators

For a precise evaluation of the model’s efficacy in the multi-view aircraft detection task, we utilize precision (P), recall (R), and mean average precision (mAP) as the key indicators of detection accuracy. Moreover, we consider parameter count (M) along with floating-point operations (GFLOPS) to measure the complexity of the model and operational efficiency.

4.3. Experimental Configuration

The experimental setup was based on an Ubuntu 18.04 environment with CUDA 11.8 and PyTorch 2.5.0. The hardware configuration included an Intel Xeon Gold 5220R processor and an A100 GPU. Further details on training parameters are provided in Table 2.

5. Experiment

We conducted comparison experiments for each module based on YOLOv8n, along with ablation experiments, comparative experiments, visualization experiments, and generalization experiments for the overall model.

5.1. Module Comparison Experiment

5.1.1. Comparison of Feature Pyramid Networks

To assess the effectiveness of the RMSFPN in multi-view aircraft detection, we conducted a comparison with eight widely used feature pyramid networks, with the results presented in Table 3.

As shown in the table above, the RMSFPN demonstrates strong performance in the multi-view aircraft detection task, achieving first place for both mAP@0.5 (83.5%) and mAP@0.5:0.95 (70.6%). Its precision and recall also rank among the best, highlighting its exceptional detection capability. Compared to the baseline model, RMSFPN improves recall, mAP@0.5, and mAP@0.5:0.95 by 1.8%, 1.0%, and 1.3%, respectively, while reducing the model parameter count by 0.1M. The increase in computational cost is minimal, rising by only 0.3, indicating that RMSFPN successfully reduces model complexity while maintaining high accuracy.

Compared to other FPN methods, RMSFPN also demonstrates exceptional performance. AFPN, HSFPN, BIFPN, and Gold-YOLO all exhibit varying degrees of performance degradation in the multi-view aircraft detection scenario, with a decrease in mAP@0.5 ranging from 2.8% to 0.1%, and a decrease in mAP@0.5:0.95 ranging from 3.1% to 0.1%. While Gold-YOLO shows the smallest decrease, it sacrifices a significant number of parameters and computational resources, making it unsuitable for real-time detection tasks. HSFPN, the most lightweight structure, experiences a notable drop in detection accuracy. Additionally, GFPN, Slim-neck, and MAFPN either maintain or slightly improve detection performance. Among these, GFPN maintains a relatively high mAP@0.5, but fails to outperform the baseline model in terms of mAP@0.5:0.95, parameter count, and GFLOPS. Slim-neck improves mAP@0.5 by 0.3% through its lightweight design, but the improvement is modest. MAFPN, in comparison to Slim-neck, further enhances detection performance, although at the cost of increased parameters and computational resources. In contrast, the RMSFPN achieves superior detection performance with lower parameter and computational resource consumption, making it an ideal choice for efficient and accurate multi-view aircraft detection tasks.

5.1.2. Comparison of Detection Heads

We compared SCDADH with both the original detection head and DyHead [54] to demonstrate its effectiveness. To ensure the rigor of the experiment, the baseline model for comparison in this section is derived from the previous iteration, with the current baseline model being YOLOv8n+RMSFPN. Subsequent improvements follow the same methodology. The final experimental results are presented in Table 4.

DyHead achieved the best detection accuracy (mAP@0.5 and mAP@0.5:0.95), reaching 84.1% and 71.9%, respectively. However, its parameter count increased by 43%, and computational cost grew by 66.7%, which limits its applicability in real-time scenarios. In contrast, the baseline detector strikes a balance between detection accuracy and computational overhead, but it underperforms in several key metrics.

In contrast, the SCDADH demonstrates a particularly outstanding performance in recall rate (R), improving by 2.4% compared to the baseline model, reaching 77.7%. In multi-view aircraft detection scenarios, a higher recall rate is crucial for reducing the risk of missed detections. Furthermore, the SCDADH reduces the number of parameters while maintaining a high recall rate and exhibits moderate computational overhead, making it suitable for resource-constrained applications.

5.1.3. The Optimal Way to Incorporate the LSK Module

To explore the optimal integration of the LSK Module into the backbone, we conducted three sets of experiments: Scheme A—adding the LSK Module after each C2f module; Scheme B—placing it after the SPPF layer; Scheme C—positioning it before the SPPF layer. Table 5 provides a summary of the experimental findings.

In Schemes A and B, although precision improved, other metrics showed significant decline. In contrast, incorporating the LSK Module before the SPPF layer proved to be the optimal solution. The network effectively adjusts the contextual information of the feature maps prior to entering the SPPF layer, thereby resulting in more accurate feature representation. This solution ranked first in precision, recall, mAP@0.5, and mAP@0.5:0.95, showing improvements of 2.2%, 0.1%, 0.8%, and 0.2%, respectively, over the baseline model.

5.1.4. Comparison of Loss Functions

To validate the outstanding performance of WFMIoUv3 in addressing class imbalance issues, three progressive experiments were designed, and the results are presented in Figure 9. It was observed that in multi-view aircraft detection scenarios, due to the large presence of low-quality samples, MPDIoU exhibited a temporary drop in detection performance compared to the original CIoU. Subsequently, the introduction of Focaler-IoU effectively balanced the model’s focus between normal and challenging samples. Finally, by incorporating the concept of Wise-IoUv3, the negative impact of low-quality samples on gradients was minimized as much as possible. This ingenious combination fully leveraged the strengths of each loss function, resulting in a maximum improvement of 0.6% in mAP@0.5 over the original CIoU, thus demonstrating robust performance.

To thoroughly assess the impact of the WFMIoUv3 loss function on mitigating class imbalance, we compare the detection accuracy across different aircraft categories with and without its application. The results are summarized in Table 6.

Categories A4 (184 instances), A5 (567 instances), A6 (313 instances), and A10 (286 instances), with fewer instances, show notable improvements in mAP@0.5 after incorporating WFMIoUv3. Specifically, Category A4 showed a 3.8% increase in detection performance, from 75.2% to 79.0%. Similarly, Category A5 demonstrated a 1.9% improvement, from 64.5% to 66.4%, while Category A6’s performance increased by 0.5%, from 69.8% to 70.3%. Even Category A10, with only 286 instances, demonstrated a slight yet meaningful improvement of 0.1%, increasing from 96.1% to 96.2%. These improvements across categories with fewer instances unequivocally demonstrate that WFMIoUv3 effectively addresses the class imbalance problem, enabling the model to better learn and detect features from rare categories. Moreover, for categories with a larger number of instances, such as A1 (4698 instances), A2 (1012 instances), and A7 (976 instances), the mAP@0.5 performance remains relatively stable, suggesting that the WFMIoUv3 loss function has little to no adverse effect on the performance of well-represented categories.

5.2. Ablation Experiment

To better highlight the contribution of each improvement, we conducted eight sets of ablation experiments based on YOLOv8n, progressively integrating the proposed modifications. The significance of each module is outlined below: (1) A: substituting PANet with the RMSFPN; (2) B: substituting the detection head with the SCDADH; (3) C: incorporating the LSK Module prior to the SPPF layer; (4) D: substituting traditional CIoU with WFMIoUv3.

As shown in the experimental results in Table 7, it is evident that each of the improvement modules (A, B, C, D) achieved positive effects when used individually, which validates the reasonableness and effectiveness of our improvements. Among them, the LSK Module added in improvement C demonstrated strong feature extraction capabilities for multi-scale targets in complex backgrounds and multi-view scenarios. In comparison with the baseline model, recall, mAP@0.5, and mAP@0.5:0.95 improved by 4.2%, 1.2%, and 1.4%, respectively, although precision decreased by 4.4%. However, after combining the LSK Module with RMSFPN and SCDADH, not only did precision rebound to 86.9%, but recall, mAP@0.5, and mAP@0.5:0.95 also showed significant improvements, with a further reduction in parameters. This demonstrates the powerful feature fusion ability and lightweight nature of RMSFPN and SCDADH in multi-view aircraft detection tasks. Finally, by using the proposed WFMIoUv3 loss function to alleviate the sample imbalance issue, we obtained RMVAD-YOLO. Compared to the baseline YOLOv8n, RMVAD-YOLO achieved an overall improvement in detection performance and lightweight design, with a 10% reduction in model parameters, while precision, recall, mAP@0.5, and mAP@0.5:0.95 improved by 2.6%, 2.5%, 2.3%, and 1.2%, respectively.

5.3. Comparison Experiment with Other Models

Among the various models in the YOLO series, YOLOv8n was selected as the baseline due to its strong performance in multi-view aircraft detection tasks, as shown in Table 8. Specifically, YOLOv8n achieves the highest mAP@0.5 of 82.5%, providing a solid benchmark for performance evaluation. Although YOLOv10n and YOLOv11n are more lightweight models, their detection performance falls short of YOLOv8n’s. Meanwhile, YOLOv9n, while having a similar detection performance to YOLOv8n, incurs a significantly higher computational cost, with GFLOPS reaching 10.7, compared to YOLOv8n’s 8.1. Therefore, YOLOv8n represents an optimal balance between detection accuracy and computational efficiency, making it an ideal baseline for comparing RMVAD-YOLO.

Having established YOLOv8 as the baseline model, we subsequently compare RMVAD-YOLO with other models from the YOLO series, as well as the improved YOLOv8n model proposed by Yang et al. [43], using the MAD dataset. In comparison with YOLOv6n and the improved YOLOv8n model by Yang et al. [43], RMVAD-YOLO achieves superior performance in terms of precision, recall, and mAP@0.5. Compared to YOLOv5s, RMVAD-YOLO achieves 7.5% higher precision and 3.6% higher mAP@0.5, while requiring significantly fewer parameters and lower computational complexity, leading to a more efficient and effective model. Compared to YOLOv9t, although YOLOv9t has a smaller parameter size and slightly higher recall, RMVAD-YOLO achieves a 20% reduction in computational complexity while attaining 9.6% higher precision and 2.5% higher mAP@0.5. Compared to YOLOv10 and YOLOv11n, although they are more lightweight, RMVAD-YOLO achieves significantly better detection accuracy, highlighting its overall superiority in detection accuracy.

Overall, RMVAD-YOLO strikes an optimal balance between detection accuracy and computational efficiency, making it well suited for multi-view aircraft detection tasks. Its superior precision and mAP@0.5, combined with reduced computational complexity, underscore its capability in addressing the complexities of multi-view object detection.

In addition, we also compared RMVAD-YOLO with the RT-DETR series models, as shown in Table 9. Although RMVAD-YOLO is slightly lower than the RT-DETR models in terms of precision and recall, it ranks first in the critical mAP@0.5 metric, demonstrating superior detection capability. With regard to parameter count and computational complexity, RMVAD-YOLO demonstrates notable advantages. Compared to the lightest model in the RT-DETR series, RT-DETR-l, RMVAD-YOLO’s parameter count and computational complexity are only 8.5% of those of RT-DETR-l, making RMVAD-YOLO a more practical and deployable solution for resource-constrained applications.

In summary, RMVAD-YOLO achieves the best trade-off between accuracy and speed, demonstrating superior performance compared to both the YOLO series and RT-DETR series models.

5.4. Visualization Experiment

To more intuitively demonstrate the detection performance of RMVAD-YOLO in multi-view aircraft detection scenarios, we selected two images from aircraft category A1 from different viewpoints in the MAD dataset and conducted comparison experiments with YOLOv8n. The results, shown in Figure 10 and Figure 11, indicate that RMVAD-YOLO demonstrates superior detection performance from both planar and elevation viewpoints.

Figure 10 presents the comparison results of the model in an occlusion scenario. In this image, the A2 aircraft is partially obscured by two A1 aircraft, with only a portion of the fuselage visible and key wing features hidden, which presents a significant challenge for the algorithm in terms of handling occlusion. As shown in the figure, YOLOv8n fails to detect the occluded aircraft, while RMVAD-YOLO not only detects the occluded plane, but also correctly classifies it as A2. This highlights RMVAD-YOLO’s superior ability to handle occlusion scenarios.

Figure 11 presents the comparison results in an adverse environment. From an upward view, an A2 aircraft is flying through dense smoke, creating a challenging detection environment and increasing the likelihood of misdetection. As shown, YOLOv8n incorrectly identifies the A2 aircraft as A1, resulting in highly unstable detection. In contrast, RMVAD-YOLO accurately detects the A2 aircraft with a high confidence score of 0.87, demonstrating its robustness in handling difficult environments.

Using heatmap in Figure 12, we further demonstrate the superior performance of RMVAD-YOLO in multi-view aircraft detection. Unlike YOLOv8n, which predominantly focuses on aircraft features, RMVAD-YOLO effectively incorporates complex background information, which is crucial for accurate detection. In challenging environments or under upward views, aircraft features may become less distinct, leading to a significant drop in detection accuracy for traditional models like YOLOv8n. In contrast, RMVAD-YOLO enhances its ability to discriminate targets by integrating both aircraft and background features, allowing it to maintain high detection accuracy in these demanding scenarios.

Furthermore, we evaluated RMVAD-YOLO’s classification performance using the confusion matrix. The model’s true and predicted categories are shown in the confusion matrix’s rows and columns, respectively. The diagonal entries indicate the proportion of correct predictions, and the higher the proportion, the darker the colour. Figure 13 shows that the overall diagonal colour of RMVAD-YOLO is darker than that of YOLOv8n, indicating superior classification performance and an improved ability to reduce false positives.

We further assess the ability of RMVAD-YOLO to address class imbalance by using P-R curves, as shown in Figure 14. The results indicate that, compared to YOLOv8n, RMVAD-YOLO shows a marginal decrease in mAP@0.5 for A6 and A10 aircraft, a change that is nearly negligible. Meanwhile, RMVAD-YOLO shows improvements in mAP@0.5 across the other eight aircraft categories. Notably, in the A4 and A5 categories, which have the fewest samples and are the most challenging to detect, YOLOv8n achieves only 69.7% and 60.3% mAP@0.5, respectively, with suboptimal detection performance. In contrast, RMVAD-YOLO improves these metrics to 79.0% and 66.4%, yielding improvements of 9.3% and 6.1%, effectively mitigating the class imbalance issue and further validating the superiority of RMVAD-YOLO.

5.5. Generalization Experiment

To explore the generalization capability of RMVAD-YOLO in multi-view scenarios, we further selected the VisDrone2019 dataset [55] from the remote sensing domain for the generalization experiments. VisDrone2019, captured by drones across various complex scenes and from different angles, aligns well with the design objectives of RMVAD-YOLO, and is therefore suitable for evaluating its generalization performance. VisDrone2019 includes 6471 training, 548 validation, and 1610 test images, spanning ten categories. It is important to note that the experimental setup for this generalization experiment is consistent with previous ones. Comparative experiments were conducted on YOLOv8n, YOLOv8s, YOLOv8m, and YOLOv8l, and the results are presented in Table 10.

It can be observed that regardless of the model scale, RMVAD-YOLO outperforms all other models in every detection metric. Notably, the VisDrone2019 dataset contains a substantial proportion of small targets, and many researchers enhance detection accuracy by adding a dedicated small-target detection head. However, this approach results in a significant increase in both parameter count and computational complexity. In contrast, RMVAD-YOLO achieves a comprehensive performance improvement without the need for a specialized detection head for small targets. This not only highlights the efficiency of RMVAD-YOLO, but also suggests that it still holds potential for further optimization on the VisDrone2019 dataset.

To further illustrate this, we compare our model with the study by Luo et al. [56], which used the same experimental settings, with both models being improvements based on YOLOv8n. The results presented in Table 11 indicate that Luo et al.’s ESOD-YOLO achieves only a slight improvement of 0.5% in mAP@0.5 and 0.4% in mAP@0.5:0.95 compared to RMVAD-YOLO. However, due to the addition of a small-target detection head, the parameter count and GFLOPS of ESOD-YOLO increase by 64.3% and 60.7%, respectively, compared to RMVAD-YOLO, which complicates its deployment on resource-constrained edge devices. In contrast, RMVAD-YOLO demonstrates greater potential for further optimization and exhibits strong generalization ability.

6. Discussion

6.1. Impact of Viewpoint Diversity on Accurate Aircraft Category Detection

Traditional aircraft detection has traditionally been conducted from a single viewpoint, where the model learns relatively stable feature representations. However, in multi-view detection, the increased diversity of viewing angles presents significant challenges in accurately distinguishing aircraft categories. The key difficulties arise from two primary factors:

(1) Inter-class similarity at certain viewpoints: In multi-view detection, an aircraft’s appearance can change drastically depending on the viewpoint. However, from certain angles, aircraft from different categories can appear almost identical. For instance, two structurally distinct aircraft may look highly similar when viewed from certain oblique angles, where their fuselage shapes and wing configurations align in a way that obscures their distinguishing characteristics.

(2) Intra-class variation across multiple perspectives: Conversely, variations in viewing angles can cause the same aircraft type to exhibit drastically different visual features. For example, an aircraft viewed from the front appears vastly different from the same aircraft viewed from the side or top. This variability undermines feature consistency, making it harder for the model to associate different views of the same aircraft as a single target, thereby reducing detection robustness across multiple perspectives.

RMVAD-YOLO employs RMSFPN to robustly extract aircraft features across multiple viewpoints, thereby facilitating the integration of features from diverse perspectives. Additionally, SCDADH enhances the synergy between classification and localization tasks, ensuring more consistent feature representations across varying viewpoints. As a result, the model reinforces intra-class feature consistency across varying viewpoints, thereby improving overall detection accuracy.

6.2. Confidence Score Analysis in Occlusion Scenarios

Notably, RMVAD-YOLO may produce low confidence scores in certain cases. For instance, in Figure 10, the confidence score for the A2 aircraft is only 0.38. This phenomenon primarily results from the loss of discriminative features due to severe occlusion: when key components (e.g., wings, tail) are obscured, the model must rely on limited local features (e.g., fuselage segments, texture patterns) for prediction. While the current confidence threshold (0.25) effectively filters out false positives in our experimental setting, relying solely on confidence thresholds to determine detection targets may introduce risks.

For future research, we plan to enhance RMVAD-YOLO’s ability to handle occlusions by integrating advanced attention mechanisms, such as Spatial Channel Attention or Transformer-based modules, to help the model focus on discriminative regions even when key features are obscured. Additionally, we will incorporate occlusion-aware data augmentation techniques, including CutOut, MixUp, and synthetic occlusion masks, to improve the model’s robustness in detecting partially visible aircraft. These enhancements are expected to generate more stable confidence scores and further improve RMVAD-YOLO’s reliability in complex multi-object detection scenarios.

7. Conclusions

We propose a novel model, RMVAD-YOLO, for multi-view aircraft detection. RMSFPN robustly extracts and integrates aircraft features across multiple viewpoints, improving its accuracy in aircraft category detection. Both SCDADH and the LSK Module enhance adaptability to multi-scale target variations, ensuring more effective feature representation for diverse target sizes. Additionally, the WFMIoUv3 loss function alleviates class imbalance, boosting detection performance for underrepresented categories. Significant performance gains on the MAD and VisDrone2019 datasets validate RMVAD-YOLO’s effectiveness in real-world applications, offering a reliable solution for multi-view aircraft detection.

Although our model effectively handles class imbalance, the detection accuracy of aircraft types with limited sample sizes needs improvement due to a severe lack of training data. To address this, future work could explore advanced data augmentation techniques, such as Generative Adversarial Networks (GANs), to generate synthetic samples and enhance model robustness. Additionally, investigating strategies like self-supervised learning and few-shot learning could further improve performance on rare classes and strengthen generalization capabilities.

Author Contributions

Conceptualization, K.L. and T.L.; methodology, K.L. and J.B.; validation, K.L., J.B. and T.L.; investigation, K.L. and X.Z.; writing—original draft preparation, K.L.; writing—review and editing, K.L., G.Z., Y.C. and T.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The MAD is openly available at https://github.com/YangBo0411/aircraft-detection, accessed on 13 November 2024.

Acknowledgments

We sincerely thank the editor and anonymous reviewers for their valuable feedback and constructive suggestions, which have significantly enhanced the quality of this manuscript.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Sandamali, G.G.N.; Su, R.; Sudheera, K.L.K.; Zhang, Y.; Zhang, Y. Two-Stage Scalable Air Traffic Flow Management Model Under Uncertainty. IEEE Trans. Intell. Transp. Syst. 2021, 22, 7328–7340. [Google Scholar] [CrossRef]
Shao, L.; He, J.; Lu, X.; Hei, B.; Qu, J.; Liu, W. Aircraft Skin Damage Detection and Assessment from UAV Images Using GLCM and Cloud Model. IEEE Trans. Intell. Transp. Syst. 2024, 25, 3191–3200. [Google Scholar] [CrossRef]
Li, B.; Hu, J.; Fang, L.; Kang, S.; Li, X. A new aircraft classification algorithm based on sum pooling feature with remote sensing image. In Proceedings of the MIPPR 2019: Pattern Recognition and Computer Vision, Wuhan, China, 2–3 November 2019; pp. 361–369. [Google Scholar]
Zhang, F.; Du, B.; Zhang, L.; Xu, M. Weakly Supervised Learning Based on Coupled Convolutional Neural Networks for Aircraft Detection. IEEE Trans. Geosci. Remote Sens. 2016, 54, 5553–5563. [Google Scholar] [CrossRef]
Shi, T.; Gong, J.; Jiang, S.; Zhi, X.; Bao, G.; Sun, Y.; Zhang, W. Complex Optical Remote-Sensing Aircraft Detection Dataset and Benchmark. IEEE Trans. Geosci. Remote Sens. 2023, 61, 5612309. [Google Scholar] [CrossRef]
Qian, Y.; Pu, X.; Jia, H.; Wang, H.; Xu, F. ARNet: Prior Knowledge Reasoning Network for Aircraft Detection in Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5205214. [Google Scholar] [CrossRef]
Xu, X.; Chen, Z.; Zhang, X.; Wang, G. Context-Aware Content Interaction: Grasp Subtle Clues for Fine-Grained Aircraft Detection. IEEE Trans. Geosci. Remote Sens. 2024, 62, 5641319. [Google Scholar] [CrossRef]
Li, Y.; Hou, Q.; Zheng, Z.; Cheng, M.M.; Yang, J.; Li, X. Large Selective Kernel Network for Remote Sensing Object Detection. In Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision (ICCV), Paris, France, 1–6 October 2023; 2023; pp. 16748–16759. [Google Scholar]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding Box Regression Loss with Dynamic Focusing Mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar] [CrossRef]
Zhang, H.; Zhang, S. Focaler-IoU: More Focused Intersection over Union Loss. arXiv 2024, arXiv:2401.10525. [Google Scholar] [CrossRef]
Ma, S.; Xu, Y. MPDIoU: A Loss for Efficient and Accurate Bounding Box Regression. arXiv 2023, arXiv:2307.07662. [Google Scholar] [CrossRef]
Cortes, C.; Vapnik, V. Support-vector networks. Mach. Learn. 1995, 20, 273–297. [Google Scholar] [CrossRef]
Bombara, G.; Vasile, C.-I.; Penedo, F.; Yasuoka, H.; Belta, C. A Decision Tree Approach to Data Classification using Signal Temporal Logic. In Proceedings of the 19th International Conference on Hybrid Systems: Computation and Control, Vienna, Austria, 12–14 April 2016; pp. 1–10. [Google Scholar]
Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 2017, 60, 84–90. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2999–3007. [Google Scholar]
Zhao, Y.; Lv, W.; Xu, S.; Wei, J.; Wang, G.; Dang, Q.; Liu, Y.; Chen, J. DETRs Beat YOLOs on Real-time Object Detection. In Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 16965–16974. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. arXiv 2015, arXiv:1506.02640. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar]
Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 936–944. [Google Scholar]
Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. YOLOX: Exceeding YOLO Series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. YOLOv6: A Single-Stage Object Detection Framework for Industrial Applications. arXiv 2022, arXiv:2209.02976. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Wang, C.-Y.; Yeh, I.H.; Liao, H.-Y.M. YOLOv9: Learning What You Want to Learn Using Programmable Gradient Information. arXiv 2024, arXiv:2402.13616. [Google Scholar] [CrossRef]
Wang, A.; Chen, H.; Liu, L.; Chen, K.; Lin, Z.; Han, J.; Ding, G. YOLOv10: Real-Time End-to-End Object Detection. arXiv 2024, arXiv:2405.14458. [Google Scholar] [CrossRef]
Xu, C.; Duan, H. Artificial bee colony (ABC) optimized edge potential function (EPF) approach to target recognition for low-altitude aircraft. Pattern Recognit. Lett. 2010, 31, 1759–1772. [Google Scholar] [CrossRef]
Lin, Y.; He, H.; Yin, Z.; Chen, F. Rotation-Invariant Object Detection in Remote Sensing Images Based on Radial-Gradient Angle. IEEE Geosci. Remote Sens. Lett. 2015, 12, 746–750. [Google Scholar] [CrossRef]
Yu, Y.; Guan, H.; Zai, D.; Ji, Z. Rotation-and-scale-invariant airplane detection in high-resolution satellite images based on deep-Hough-forests. ISPRS J. Photogramm. Remote Sens. 2016, 112, 50–64. [Google Scholar] [CrossRef]
Ding, P.; Zhang, Y.; Deng, W.-J.; Jia, P.; Kuijper, A. A light and faster regional convolutional neural network for object detection in optical remote sensing images. ISPRS J. Photogramm. Remote Sens. 2018, 141, 208–218. [Google Scholar] [CrossRef]
Yang, J.; Zhu, Y.; Jiang, B.; Gao, L.; Xiao, L.; Zheng, Z. Aircraft detection in remote sensing images based on a deep residual network and Super-Vector coding. Remote Sens. Lett. 2018, 9, 228–236. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wu, J.; Zhao, F.; Jin, Z. LEN-YOLO: A Lightweight Remote Sensing Small Aircraft ObjectDetection Model for Satellite On-Orbit Detection. J. Real-Time Image Proc. 2025, 22, 25. [Google Scholar] [CrossRef]
Wu, J.; Zhao, F.; Yao, G.; Jin, Z. FGA-YOLO: A one-stage and high-precision detector designed for fine-grained aircraft recognition. Neurocomputing 2025, 618, 129067. [Google Scholar] [CrossRef]
Yu, D.; Fang, Z.; Jiang, Y. Alleviating category confusion in fine-grained visual classification. Vis. Comput. 2025. early access. [Google Scholar] [CrossRef]
Wan, H.; Nurmamat, P.; Chen, J.; Cao, Y.; Wang, S.; Zhang, Y.; Huang, Z. Fine-Grained Aircraft Recognition Based on Dynamic Feature Synthesis and Contrastive Learning. Remote Sens. 2025, 17, 768. [Google Scholar] [CrossRef]
Yang, B.; Tian, D.; Zhao, S.; Wang, W.; Luo, J.; Pu, H.; Zhou, M.; Pi, Y. Robust Aircraft Detection in Imbalanced and Similar Classes with a Multi-Perspectives Aircraft Dataset. IEEE Trans. Intell. Transp. Syst. 2024, 25, 21442–21454. [Google Scholar] [CrossRef]
Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
Chen, Y.; Yuan, X.; Wu, R.; Wang, J.; Hou, Q.; Cheng, M.-M. YOLO-MS: Rethinking Multi-Scale Representation Learning for Real-time Object Detection. arXiv 2023, arXiv:2308.05480. [Google Scholar] [CrossRef]
Zhu, X.; Hu, H.; Lin, S.; Dai, J. Deformable ConvNets v2: More Deformable, Better Results. arXiv 2018, arXiv:1811.11168. [Google Scholar] [CrossRef]
Yang, G.; Lei, J.; Zhu, Z.; Cheng, S.; Feng, Z.; Liang, R. AFPN: Asymptotic Feature Pyramid Network for Object Detection. arXiv 2023, arXiv:2306.15988. [Google Scholar] [CrossRef]
Chen, Y.; Zhang, C.; Chen, B.; Huang, Y.; Sun, Y.; Wang, C.; Fu, X.; Dai, Y.; Qin, F.; Peng, Y.; et al. Accurate leukocyte detection based on deformable-DETR and multi-level feature fusion for aiding diagnosis of blood diseases. Comput. Biol. Med. 2024, 170, 107917. [Google Scholar] [CrossRef] [PubMed]
Tan, M.; Pang, R.; Le, Q.V. EfficientDet: Scalable and Efficient Object Detection. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 10778–10787. [Google Scholar]
Wang, C.; He, W.; Nie, Y.; Guo, J.; Liu, C.; Han, K.; Wang, Y. Gold-YOLO: Efficient Object Detector via Gather-and-Distribute Mechanism. arXiv 2023, arXiv:2309.11331. [Google Scholar] [CrossRef]
Xu, X.; Jiang, Y.; Chen, W.; Huang, Y.; Zhang, Y.; Sun, X. DAMO-YOLO: A Report on Real-Time Object Detection Design. arXiv 2022, arXiv:2211.15444. [Google Scholar] [CrossRef]
Li, H.; Li, J.; Wei, H.; Liu, Z.; Zhan, Z.; Ren, Q. Slim-neck by GSConv: A lightweight-design for real-time detector architectures. arXiv 2022, arXiv:2206.02424. [Google Scholar] [CrossRef]
Yang, Z.; Guan, Q.; Zhao, K.; Yang, J.; Xu, X.; Long, H.; Tang, Y. Multi-Branch Auxiliary Fusion YOLO with Re-parameterization Heterogeneous Convolutional for accurate object detection. arXiv 2024, arXiv:2407.04381. [Google Scholar] [CrossRef]
Dai, X.; Chen, Y.; Xiao, B.; Chen, D.; Liu, M.; Yuan, L.; Zhang, L. Dynamic Head: Unifying Object Detection Heads with Attentions. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 7369–7378. [Google Scholar]
Zhu, P.; Wen, L.; Du, D.; Bian, X.; Fan, H.; Hu, Q.; Ling, H. Detection and Tracking Meet Drones Challenge. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 44, 7380–7399. [Google Scholar] [CrossRef]
Luo, J.; Liu, Z.; Wang, Y.; Tang, A.; Zuo, H.; Han, P. Efficient Small Object Detection You Only Look Once: A Small Object Detection Algorithm for Aerial Images. Sensors 2024, 24, 7067. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Aircraft diagrams of complex mission scenarios. (a) Downward view; (b) eye-level view; (c) upward view; (d) oblique view.

Figure 2. The network architecture of RMVAD-YOLO.

Figure 3. The structure of the Robust Multi-Link Scale Interactive Feature Pyramid Network.

Figure 4. The structure of the Multi-scale Deep Convolutional Aggregation Module.

Figure 5. The structure of the Multi-scale Deep Convolutional Aggregation Block.

Figure 6. The structure of the Shared Convolutional Dynamic Alignment Detection Head.

Figure 7. The structure of the LSK Module.

Figure 8. Information about the MAD. (a) Number of instances per category in the MAD dataset; (b) scale variation in aircraft targets in the MAD dataset.

Figure 9. Loss function comparison plot.

Figure 10. Detection comparison in an occlusion scene: (a) original image, (b) YOLOv8n output, and (c) RMVAD-YOLO output.

Figure 11. Detection comparison in an adverse environment: (a) original image, (b) YOLOv8n output, and (c) RMVAD-YOLO output.

Figure 12. Comparison of heat maps: (a) original image, (b) YOLOv8n heat map, and (c) RMVAD-YOLO heat map.

Figure 13. Comparison of normalized confusion matrix: (a) YOLOv8n and (b) RMVAD-YOLO.

Figure 14. Comparison of P-R curves: (a) YOLOv8n and (b) RMVAD-YOLO.

Table 1. Neck Heterogeneous Kernel Selection Mechanism.

MDCAM Number	Convolution Kernel Size
1	5, 7, 9
2	3, 5, 7
3	1, 3, 5
4	1, 3, 5
5	3, 5, 7
6	5, 7, 9

Table 2. Training parameters.

Parameters	Value	Parameters	Value
Epochs	200	Optimizer	SGD
Batch size	16	Learning rate	0.01
Image size	640×640	Warmup epochs	3

Table 3. Comparison of different feature pyramid networks.

FPN	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	Params/M	GFLOPS
PANet_base	87.5	73.5	82.5	69.3	3.01	8.1
AFPN [47]	84.6	72	79.7	67.3	2.60	8.4
HSFPN [48]	83.2	73.3	80.6	66.2	1.94	6.9
BIFPN [49]	83	75.8	82.1	68.2	1.99	7.1
Gold-YOLO [50]	82.9	75.9	82.4	69.2	5.98	10.3
GFPN [51]	83.2	75.3	82.5	69	3.26	8.3
Slim-neck [52]	83.2	76.8	82.8	70.5	2.80	7.3
MAFPN [53]	84.2	76	83.3	70	2.99	8.7
RMSFPN (ours)	85.7 (↓1.8)	75.3 (↑1.8)	83.5 (↑1.0)	70.6 (↑1.3)	2.91 (↓0.1)	8.4 (↑0.3)

Note: Bolded values indicate the best performance.

Table 4. Comparison of different detection heads.

Detection Head	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	Params/M	GFLOPS
Base	85.7	75.3	83.5	70.6	2.91	8.4
DyHead	85.2	77.2	84.1	71.9	4.17	14
SCDADH (ours)	84.7 (↓1.0)	77.7 (↑2.4)	83.4 (↓0.1)	71.1 (↑0.5)	2.60 (↓0.31)	8.8 (↑0.4)

Note: Bolded values indicate the best performance.

Table 5. Comparison of results for different LSK Module incorporation methods.

Method	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	Params/M	GFLOPS
Base	84.7	77.7	83.4	71.1	2.60	8.8
A	85.1	75.1	81.2	66.4	2.77	9.4
B	85.4	76.4	83.2	70.7	2.72	8.9
C (ours)	86.9 (↑2.2)	77.8 (↑0.1)	84.2 (↑0.8)	71.3 (↑0.2)	2.72 (↑0.12)	8.9 (↑0.1)

Note: Bolded values indicate the best performance.

Table 6. Per-class mAP@0.5 comparison before and after applying WFMIoUv3 loss.

Loss Function	A1	A2	A3	A4	A5	A6	A7	A8	A9	A10
CIoU	90.1	79.6	74.4	75.2	64.5	69.8	94.9	98.8	98.4	96.1
WFMIoUv3	89.8	78.1	76.2	79.0	66.4	70.3	95.4	98.8	97.5	96.2

Table 7. Ablation experiment.

Group	Base	A	B	C	D	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	Params/M	GFLOPS
1	√					87.5	73.5	82.5	69.3	3.01	8.1
2	√	√				85.7	75.3	83.5	70.6	2.91	8.4
3	√		√			83.6	76.2	82.9	70	2.24	8.6
4	√			√		83.1	77.7	83.7	70.7	3.12	8.2
5	√				√	85.5	75.9	83.3	68.8	3.01	8.1
6	√	√	√			84.7	77.7	83.4	71.1	2.60	8.8
7	√	√	√	√		86.9	77.8	84.2	71.3	2.72	8.9
8 (ours)	√	√	√	√	√	90.1 (↑2.6)	76 (↑2.5)	84.8 (↑2.3)	70.5 (↑1.2)	2.72 (↓0.29)	8.9 (↑0.8)

Table 8. Comparison experiment with YOLO series models.

Model	P/%	R/%	mAP@0.5/%	Params/M	GFLOPS
YOLOv5s	82.6	76.3	81.2	7.04	15.8
YOLOv6n	75.6	71	80.1	4.63	11.34
YOLOv8n	87.5	73.5	82.5	3.01	8.1
YOLOv9t	80.5	76.2	82.3	2.62	10.7
YOLOv10n	84.2	73.3	80.5	2.70	8.2
YOLOv11n	83	74.7	81.9	2.58	6.3
Yang	88.5	74.1	83.4	-	-
RMVAD-YOLO (ours)	90.1	76	84.8	2.72	8.9

Note: Bolded values indicate the best performance.

Table 9. Comparison experiment with RT-DETR series models.

Model	P/%	R/%	mAP@0.5/%	Params/M	GFLOPS
RT-DETR-x	90.8	79.8	82.6	64.59	222.5
RT-DETR-l	90.3	79.4	82.2	32.00	103.5
RT-DETR-Resnet50	90.8	82.3	84.3	41.96	125.7
RMVAD-YOLO (ours)	90.1	76	84.8	2.72	8.9

Note: Bolded values indicate the best performance.

Table 10. Comparison of RMVAD-YOLO and YOLOv8 on VisDrone2019.

Model	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	Params/M
YOLOv8n	37.9	28.8	26.3	14.5	3.01
RMVAD-YOLO-n	40.8	31	28.8	16.2	2.72
YOLOv8s	45	33.4	31.7	17.8	11.13
RMVAD-YOLO-s	46.3	35.6	33.8	19.3	10.61
YOLOv8m	46.2	36.1	34.1	19.7	25.85
RMVAD-YOLO-m	48.5	37.6	35.8	20.6	24.28
YOLOv8l	48.1	37.2	35.4	20.6	43.61
RMVAD-YOLO-l	50	39	37.5	21.8	42.04

Note: Bolded values indicate the best performance.

Table 11. Comparison with Luo’s experiment.

Model	P/%	R/%	mAP@0.5/%	mAP@0.5:0.95/%	Params/M	GFLOPS
ESOD-YOLO	-	-	29.3	16.6	4.47	14.3
RMVAD-YOLO	40.8	31	28.8	16.2	2.72	8.9

Note: Bolded values indicate the best performance.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, K.; Zheng, X.; Bi, J.; Zhang, G.; Cui, Y.; Lei, T. RMVAD-YOLO: A Robust Multi-View Aircraft Detection Model for Imbalanced and Similar Classes. Remote Sens. 2025, 17, 1001. https://doi.org/10.3390/rs17061001

AMA Style

Li K, Zheng X, Bi J, Zhang G, Cui Y, Lei T. RMVAD-YOLO: A Robust Multi-View Aircraft Detection Model for Imbalanced and Similar Classes. Remote Sensing. 2025; 17(6):1001. https://doi.org/10.3390/rs17061001

Chicago/Turabian Style

Li, Keda, Xiangyue Zheng, Jingxin Bi, Gang Zhang, Yi Cui, and Tao Lei. 2025. "RMVAD-YOLO: A Robust Multi-View Aircraft Detection Model for Imbalanced and Similar Classes" Remote Sensing 17, no. 6: 1001. https://doi.org/10.3390/rs17061001

APA Style

Li, K., Zheng, X., Bi, J., Zhang, G., Cui, Y., & Lei, T. (2025). RMVAD-YOLO: A Robust Multi-View Aircraft Detection Model for Imbalanced and Similar Classes. Remote Sensing, 17(6), 1001. https://doi.org/10.3390/rs17061001

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RMVAD-YOLO: A Robust Multi-View Aircraft Detection Model for Imbalanced and Similar Classes

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.2. Aircraft Detection

3. Method

3.1. Introduction to the RMVAD-YOLO Model

3.2. Robust Multi-Link Scale Interactive Feature Pyramid Network

3.2.1. Improvements in the Structure

3.2.2. Multi-Scale Deep Convolutional Aggregation Module

3.2.3. Neck Heterogeneous Kernel Selection Mechanism

3.3. Shared Convolutional Dynamic Alignment Detection Head

3.4. LSK Module

3.5. Loss Function Optimisation

4. Datasets and Experimental Environment

4.1. Dataset

4.2. Experimental Evaluation Indicators

4.3. Experimental Configuration

5. Experiment

5.1. Module Comparison Experiment

5.1.1. Comparison of Feature Pyramid Networks

5.1.2. Comparison of Detection Heads

5.1.3. The Optimal Way to Incorporate the LSK Module

5.1.4. Comparison of Loss Functions

5.2. Ablation Experiment

5.3. Comparison Experiment with Other Models

5.4. Visualization Experiment

5.5. Generalization Experiment

6. Discussion

6.1. Impact of Viewpoint Diversity on Accurate Aircraft Category Detection

6.2. Confidence Score Analysis in Occlusion Scenarios

7. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI