RN-YOLO: A Small Target Detection Model for Aerial Remote-Sensing Images

Wang, Ke; Zhou, Hao; Wu, Hao; Yuan, Guowu

doi:10.3390/electronics13122383

Open AccessArticle

RN-YOLO: A Small Target Detection Model for Aerial Remote-Sensing Images

School of Information Science and Engineering, Yunnan University, Kunming 650504, China

^*

Author to whom correspondence should be addressed.

Electronics 2024, 13(12), 2383; https://doi.org/10.3390/electronics13122383

Submission received: 15 May 2024 / Revised: 7 June 2024 / Accepted: 13 June 2024 / Published: 18 June 2024

(This article belongs to the Special Issue Applications of Computer Vision, 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

Accurately detecting targets in remote-sensing images is crucial for the military, urban planning, and resource exploration. There are some challenges in extracting detailed features from remote-sensing images, such as complex backgrounds, large-scale variations, and numerous small targets. This paper proposes a remote-sensing target detection model called RN-YOLO (YOLO with RepGhost and NAM), which integrates RepGhost and a normalization-based attention module (NAM) based on YOLOv8. Firstly, NAM is added to the feature extraction network to enhance the capture capabilities for small targets by recalibrating receptive fields and strengthening information flow. Secondly, an efficient RepGhost_C2f structure is employed in the feature fusion network to replace the C2f module, effectively reducing the parameters. Lastly, the WIoU (Wise Intersection over Union) loss function is adopted to mitigate issues such as significant variations in target sizes and difficulty locating small targets, effectively improving the localization accuracy of small targets. The experimental results demonstrate that compared to the YOLOv8s model, the RN-YOLO model reduces the parameter count by 13.9%. Moreover, on the DOTAv1.5, TGRS-HRRSD, and RSOD datasets, the detection accuracy (mAP@.5:.95) of the RN-YOLO model improves by 3.6%, 1.2%, and 2%, respectively, compared to the YOLOv8s model, showcasing its outstanding performance and enhanced capability in detecting small targets.

Keywords:

target detection; remote sensing; YOLOv8; attention mechanism; lightweight convolution

1. Introduction

Target detection in aerial remote-sensing images aims to identify and determine the position and type of specific objects contained in the remote-sensing images. With the rapid development of drone and satellite technologies, it has been widely applied in both military and civilian sectors, playing a crucial role in various aspects, such as environmental monitoring [1], urban planning [2], agricultural management [3], and disaster response [4]. As depicted in Figure 1, aerial remote-sensing images have many features, such as overhead imaging, significant changes in object size, and many small targets. These features pose a major challenge to aerial remote-sensing image detection.

In target detection for remote-sensing images, traditional approaches often rely on specific feature selection [5] and hand-designed algorithms [6]. While these methods have yielded some success in specific scenarios, they need help to adapt to the complexities and variations inherent in remote sensing. Deep learning has provided a novel solution to these challenges in recent years. The target detection model is categorized into two main types: two-stage and one-stage. Two-stage detection involves generating candidate regions within the image, which are fed into a convolutional neural network (CNN) for target identification and localization through classification and regression. Representative algorithms in this category include R-CNN [7], Fast R-CNN [8], and Faster R-CNN [9]. On the other hand, single-stage detection directly computes the coordinates and category probabilities of target objects from the image features extracted using CNN, such as the algorithms in SSD [10,11] and the YOLO series [12,13,14,15]. Notably, while two-stage detection algorithms necessitate sequential steps for region proposal and subsequent classification and regression, they tend to be slower than their single-stage counterparts. Moreover, with the introduction of the YOLO family of algorithms, single-stage detection methods have significantly enhanced detection accuracy while maintaining high-speed performance.

Many researchers have made notable progress in refining YOLO models specifically for detecting targets in remote-sensing images in recent years. Zhu et al. [16] proposed TPH-YOLOv5, which integrates the transformer module to enhance the feature extraction capability of the model but increases the number of parameters and the computational complexity. Yang et al. [17] introduced RS-YOLOX, leveraging the ECA attention mechanism and ASFF feature extraction algorithm to boost small targets’ detection accuracy effectively. Yet, there remains room for improvement in terms of lightweighting. Yu et al. [18] leveraged the centralized feature pyramid (CFP) and the hybrid attention ACmix but encountered suboptimal enhancement accuracy and instances of missed detection. Liu et al. [19] combined hybrid extended convolution with a self-designed residual network, bolstering feature extraction for targets of varying sizes. However, their model’s scope could have been broadened to localizing and classifying aircraft on the DOTA dataset, constraining its generalization capability. Wang et al. [20] introduced a feature processing module in YOLOv8 to integrate the superficial features and deep features.. Still, the target sizes in the experimental dataset are more similar, and there is a certain degree of misdetection for targets with a large span of size and missed detection.

From the collective efforts of the researchers above, it becomes evident that despite the widespread adoption of YOLO models in remote-sensing target detection, significant opportunities for enhancement persist. Thus, this paper introduces the RN-YOLO (RepGhost NAM YOLO) model, building upon the foundation of improved YOLOv8, and achieves commendable detection results across the DOTAv1.5, TGRS-HRRSD, and RSOD datasets. The main contributions of this study include the following:

(1): Tackling the difficulty of detecting small targets and extracting detailed features within limitations is achieved by integrating NAM [21] between the feature extraction and fusion networks. NAM optimally preserves small target features through its lightweight design and enhances detection accuracy by adjusting the weight contribution factor.
(2): The RepGhost module [22] is introduced within the feature fusion network, creating RepGhost_C2f. This innovation effectively tackles the issue of inadequate detection capacity for targets spanning a wide range of sizes while significantly reducing model parameters.
(3): The WIoU loss function [23] replaces the CIoU in the original model, enhancing detectors’ overall performance by assigning different weights to the targets with various sizes and alleviating the challenge of localizing small targets.

2. Related Works

2.1. Machine Learning in Remote-Sensing Images

With the continuous evolution of machine-learning technology, feature-based classifiers have emerged as promising tools in remote-sensing target detection. Compared to traditional approaches, these methods offer enhanced capability to handle intricate data relationships, thereby improving detection robustness. Through meticulously crafted features, these classifiers proficiently identify, classify, and localize targets. Support Vector Machine (SVM) [24] stands out as a formidable classifier, partitioning data into distinct classes by identifying optimal hyperplanes. Meanwhile, Random Forest (RF) [25] employs an ensemble learning approach, constructing multiple decision trees and aggregating their outputs through voting, ensuring high accuracy and resilience. On the other hand, AdaBoost [26] iteratively enhances the classification performance by training a sequence of weak classifiers and combining their outputs with appropriate weights.

These machine learning-based methodologies provide a robust framework for remote-sensing target detection, facilitating accurate identification and localization of targets within vast remote-sensing datasets. However, one limitation persists: the reliance on manual feature engineering, which is intricate and laborious. This constraint hampers scalability for target detection in extensive remote-sensing datasets and impedes real-time applicability in remote-sensing image target detection.

2.2. Reinforcement Learning in Remote-Sensing Images

Reinforcement learning-based models typically involve the design of an intelligent agent capable of interacting with its environment and learning optimal action strategies based on feedback. In the context of target detection in remote-sensing images, the environment typically comprises the dataset of remote-sensing images containing the targets to be identified. Following each action within this environment, the intelligent agent receives a portion of the image and a corresponding reward. Subsequently, the model selects its following action based on its learned strategy, such as zooming in on a specific image region or shifting focus to another part of the image to pinpoint a particular detection target.

DQN [27] optimizes decision-making in target detection tasks by learning a value function that guides action selection. However, DQN utilized in target detection often demands many samples and much training time, consuming significant computational resources. Actor–Critic [28] integrates a policy gradient and a value function, allowing for updates to both during the learning process, thereby enhancing stability. Nevertheless, Actor–Critic methods typically necessitate well-designed reward functions and state representations, which is usually tricky in remote-sensing image target detection.

2.3. YOLOv8 Model

Thus far, deep learning-based remote-sensing image target detection remains widely utilized, with YOLO emerging as a frontrunner due to its faster speed and higher accuracy. YOLOv8 has swiftly become the prevailing model in target detection, leveraging its robust generalization capabilities and enhanced accuracy. The network architecture of YOLOv8 [29] comprises two key components: the backbone and the head part, with the latter encompassing feature fusion and target detection. Breaking away from previous iterations, YOLOv8 opts for the C2f module over the C3 module, enhancing the flow of gradient information while still upholding its lightweight design. Additionally, YOLOv8 shifts from an anchor-based approach to an anchor-free mode, eliminating the necessity for predefined bounding boxes and providing a more adaptable solution space. Finally, YOLOv8 employs distribution focal loss, facilitating a quicker focus near the target and expediting convergence.

3. Methods

3.1. Overall Architecture

YOLOv8 has demonstrated significant success in various domains, owing to its robust generalization and heightened accuracy. In addressing challenges such as wide-ranging object sizes, numerous small targets, and limited network feature extraction in remote-sensing images, this paper endeavors to enhance YOLOv8 for remote-sensing image detection. Initially, the NAM module is introduced after the C2f module within the backbone to bolster the learning of crucial image features. This paper adopts three NAM modules, aiming to subject the outputs of varying size dimensions in the backbone to comprehensive feature learning. Furthermore, after conducting comparative experiments on both the backbone and the head, this paper opts to substitute the C2f module connected to the detection head in the head segment with RepGhost_C2f. This modification amplifies feature fusion, enhances model detection accuracy, and streamlines the network model’s parameter count. Lastly, the CIoU loss function is supplanted with WIoU in the detection head to further refine small targets’ localization. The enhanced model structure, depicted in Figure 2, incorporates improvements across three key areas: the backbone, head, and detection head, with the subsequent Section 3.2, Section 3.3 and Section 3.4 elaborating on the respective ideas and methodologies for enhancement.

3.2. Normalization-Based Attention Module (NAM)

The attention mechanism directs the network towards vital information within the image, identifying critical features while filtering out non-essential ones. Therefore, incorporating such a mechanism into target detection can effectively retain the detailed features, bolster the model’s memory, and ultimately elevate the detection accuracy. This paper uses the normalization-based attention module (NAM), which improves the convolutional block attention module (CBAM).

CBAM represents a noteworthy attention mechanism model with versatile applications. It generates two sets of weights for each channel by employing global average pooling and global maximum pooling. These weights then undergo a series of transformations, including feedforward neural network processing with shared parameters, element-wise summation, and softmax activation, yielding final channel-specific weights.

In contrast, NAM alters this intermediate process, as shown in Figure 3. Initially, the individual channels undergo batch-wise normalization, and a scaling factor

λ_{i} (i \leq C)

in normalization is introduced. Since this scaling factor is learnable, the standard deviation

W_{i}

of the scaling factor is used to represent the importance of each channel, as shown in Equation (1), obviating the cumbersome parameters associated with fully connected and convolutional layers.

W_{i} = \frac{λ_{i}}{\sum_{j = 1}^{C} λ_{j}}

(1)

Subsequently, the channel weights are multiplied with each pixel point and activated by the softmax function, akin to CBAM. Finally, the probabilities at the positions of each pixel point are used as weights and multiplied with each pixel point to enable filtering of the features represented by each channel. The NAM module ensures feature integrity by reducing rather than deleting, and it also highlights specific features to capture details about small targets. The subsequent experimental results can verify that NAM can enhance the extraction of crucial information and lay a solid foundation for subsequent feature fusion.

3.3. RepGhost_C2f

In addition to enhancing the backbone’s feature extraction using NAM, this paper further optimizes the C2f module in the head and proposes RepGhost_C2f. The RepGhost module can achieve a simultaneous reduction in hardware computation and the model’s parameters, as demonstrated by experiments showcasing its robustness.

Given that numerous similar feature maps are generated during convolution operations, GhostNet addresses this redundancy by employing inexpensive linear operations to derive additional feature maps from a subset of the original ones, as depicted in Figure 4a. Building upon GhostNet, RepGhost further refines this process by substituting concat operations with addition operations, as illustrated in Figure 4b. RepGhost enhances the feature fusion efficiency and reduces the computational overhead at the hardware level. Moreover, RepGhost enhances the model parameter efficiency by repositioning the ReLU activation function after the addition operation, ensuring that only genuinely influential features undergo nonlinear transformation. Additionally, by halving the channels in the middle of RepGhost compared to GhostNet, the model significantly reduces parameters when traversing through the SE attention mechanism or the depth-divisible convolution module, thereby reducing the parameter and computational load.

C2f stands out as the pivotal module within YOLOv8, offering lightweight characteristics and the capability to amalgamate intricate abstract information from the deep network with detailed information from the shallow network, thereby enhancing target detection accuracy. This paper replaces the core bottleneck module in C2f with RepGhost to construct the new RepGhost_C2f module, as depicted in Figure 5. This modification amplifies the network’s feature extraction and fusion capabilities and further reduces the network’s parameter and computational load. Thus, it accomplishes a dual enhancement in both performance and efficiency.

3.4. WioU (Wise Intersection over Union)

Choosing an appropriate loss function is crucial for accurately localizing the target object and giving its coordinate position. IoU-based functions for calculating the bounding box loss constantly develop and appear as SIoU, EIoU, and GIoU. In YOLOv8, the CIoU (Complete Intersection over Union) loss function is employed to quantify the disparity between the predicted and ground truth boxes, as depicted in Equations (2)–(4).

L_{C I o U} = 1 - I o U + \frac{ρ (b, b^{g t})}{H^{2} + W^{2}} + α ν

(2)

ν = \frac{4}{π} {(a r c t a n \frac{w^{g t}}{h^{g t}} - a r c t a n \frac{w}{h})}^{2}

(3)

α = \frac{ν}{(1 - I o U) + ν}

(4)

where IoU represents the intersection over union ratio between the ground truth bounding box and the predicted bounding box.

b

and

b^{g t}

signify the center coordinates of the two bounding boxes, and

ρ (b, b^{g t})

denotes the Euclidean distance between the center coordinates.

α ν

indicates the aspect ratio similarity of the real frame and the bounding box. In the last two supplementary formulas,

w^{g t}

and

h^{g t}

state the side length of the target bounding box,

w

and

h

state the side length of the predicted bounding box, and H and W mean the length of the outer sides of the two bounding boxes together.

However, the CIoU loss function solely considers the width and height factors and lacks discrimination between large and small targets. The size of targets in remote-sensing images varies greatly. Consequently, it employs the same computational approach for any bounding box, which may not adequately distinguish between different target sizes. Therefore, we introduce the WIoU loss function to sufficiently reduce the loss of easily fitted large targets while prioritizing the loss associated with small targets.

Wise-IoU (WIoU) employs a combinatorial approach to calculate the loss function, illustrated in Figure 6. First, as shown in Equation (5), WIoU adopts a multiplicative approach to consider the width and height, effectively reducing the loss associated with large targets in high quality.

L_{W I o U v 1} = e x p (\frac{{(x_{a} - y_{a})}^{2} + {(x_{b} - y_{b})}^{2}}{H^{2} + W^{2}}) (1 - I o U)

(5)

where

(x_{a}, x_{b})

and

(y_{a}, y_{b})

means the center coordinate of the predicted bounding box and the ground truth bounding box, and H and W mean the length of the outer sides of the two bounding boxes together. IoU represents the intersection over union ratio between the ground truth bounding box and the predicted bounding box.

Referring to focal loss, a monotone focusing mechanism

L_{W I o U v 2}

for cross-entropy is designed, which effectively reduces the contribution of simple examples to the loss value. This enables the model to focus on difficult examples and improve the classification performance.

L_{W I o U v 2} = (\frac{L_{I o U}^{*}}{\bar{L_{I o U}}})^{γ} L_{W I o U v 1}

(6)

where

L_{I o U}^{*}

is a dynamic gradient gain,

\bar{L_{I o U}}

is a momentum average value that slows convergence in the latter stages of training by normalizing the loss function, and

γ

is the dynamic update normalization factor.

Finally, to counterbalance the detrimental gradients produced by low-quality targets and prioritize the detection of abundant small targets, WIoU uses

β

to construct non-monotonic focusing coefficients, as depicted in Equations (7) and (8), where

α

and

δ

are hyper-parameters. This ensures that high-quality large and low-quality small targets receive a low loss value.

L_{W I o U v 3} = \frac{β}{δ α^{β - δ}} L_{W I o U v 1}

(7)

where

β = \frac{L_{I o U}^{*}}{\bar{L_{I o U}}}

(8)

4. Experiments

4.1. Experimental Datasets and Their Preprocessing

The RN-YOLO model is tested and evaluated using three aerial remote-sensing image datasets: DOTAv1.5 [30], TGRS-HRRSD [31], and RSOD [32]. Primarily, DOTAv1.5 serves as the main dataset for this experiment, facilitating exhaustive comparisons and ablation experiments. To ensure the model’s generalization, experimental validations and comparisons are conducted on the TGRS and RSOD datasets respectively.

DOTAv1.5, provided by the China Resources Satellite Data and Application Center (CRSDAC), encompasses 2806 super-large images spanning 16 categories and containing 403,318 instances. Due to its extensive coverage, this dataset is widely utilized across various research domains. However, given the impractical size of the images—up to 20,000 × 20,000 pixels—they are unsuitable for direct model training. Hence, before experimentation, we resized the images to 640 × 640. Subsequently, we used a variety of techniques and pairwise combinations to enhance the dataset, including horizontal and vertical flipping images, enhancing the brightness, adding Gaussian noise, and injecting perceptual noise, as illustrated in Figure 7. The final dataset comprises 25,114 images, divided into training, validation, and testing sets in a 6:2:2 ratio.

Similarly, TGRS-HRRSD offers a sizable, high-resolution dataset. For our experiments, we focused solely on comparison and validation tests, selecting images with dimensions under 1000 × 1000, totaling 5406 images with 400 images per category. We partitioned this dataset into training, validation, and testing sets following the same 6:2:2 ratio.

RSOD consists of 976 images spread across four categories, primarily sized at 1000 × 1000. We applied identical image augmentation techniques to ensure dataset balance, resulting in 1916 augmented images, with approximately 400 images per category.

Additionally, to visualize the distribution of object sizes in the image, we count the percentage of each abscissa and divide the horizontal coordinate every 0.5%, as shown in Figure 8. The figures clearly demonstrate that the majority of objects occupy less than 3% of the total area, with those in DOTAv1.5 typically accounting for less than 1%, which will validate the performance of our model in small object detection.

4.2. Experimental Environment and Training Setting

This study’s experimental hardware is Intel Xeon Platinum 8352V, with 120G memory and NVIDIA GeForce RTX4090 GPU. The experimental software environment is Python 3.8, pythoch2.0.0, cuda11.8, and ubuntu22.04.1. In addition, the version of the ultralytics package for YOLOv8 relied on was 8.0.212. In the experiment, we set the warm epochs to 3 and the learning rate in this section to 0.1, and we put the learning rate to 0.01 in the initial section and gradually changed it to 0.0001. The batch size is 16 to adapt the model and avoid memory explosion. Furthermore, we adopt a cross-entropy loss function for classification and distribution focal loss for bounding box regression. We also modified the CIoU to WIoU, proving that it leads to a better performance.

4.3. Experimental Comparison and Analysis

In remote-sensing target detection, achieving a balance between model precision and speed is often paramount. To comprehensively evaluate model performance, this paper assesses accuracy using metrics such as P (Precision), R (Recall), mAP@50, and mAP@.5:.95. Additionally, the number of parameters (Parameters) in the model is considered to evaluate the operational efficiency. This holistic approach provides a thorough and detailed assessment of both precision and efficiency, offering valuable insights into the model’s applicability. By considering these metrics collectively, a more accurate evaluation of the model’s performance is attained, thus facilitating informed decisions for further enhancement and optimization. Definitions of these evaluation metrics can be found in the literature [17].

4.3.1. Comparative Experiments for Attention Module

To assess NAM’s efficacy, we conducted experiments incorporating various other commonly employed attention modules for comparison. For instance, CBAM [33] uses a sequential application of channel attention followed by spatial attention, exemplifying a typical hybrid attention mechanism. Context aggregation [34] fuses pixel relationships at multiple scales and in different spaces, thus weighting them separately. Shuffle attention [35] spatially divides the data and assigns attention weights separately for channel-wise information exchange.

To ensure comparability across experimental results, all attention mechanisms in this paper are positioned after C2f modules, which connect the backbone and head. Following training and testing on DOTAv1.5, the results are summarized in Table 1. Based on mAP@.5:.95, it is evident that the model augmented with the NAM module exhibits a superior detection performance, boasting a 1.3% enhancement over the YOLOv8 benchmark model. Notably, the NAM module utilizes the fewest parameters, with virtually no additional parameters added. This underscores the efficacy of the NAM module in achieving significant improvement while maintaining parameter stability.

Furthermore, comparison experiments were carried out on the RSOD and TGRS-HRRSD datasets, as outlined in Table 2. The findings demonstrate that the model integrating the NAM consistently surpasses YOLOv8 across all three datasets, yielding enhancements of 1.2%, 0.8%, and 1.3% on RSOD, TGRS-HRRSD, and DOTAv1.5, respectively. This underscores the substantial influence of the NAM module in improving the detection accuracy across diverse datasets.

4.3.2. Comparative Experiments for Lightweight Convolution

YOLOv8 utilizes C2f modules in both the backbone and the head section. The improved C2f module in this paper can enhance model performance while maintaining generalization. This paper conducted comparative experiments on the RSOD and DOTAv1.5 datasets to ascertain the impact of integrating improved C2f modules. This paper replaced all C2f modules in the backbone and the head section to ensure the consistency of the experiments. Furthermore, two lightweight modules, Scconv [36] and RepGhost, are employed for comparison purposes, and the results are presented in Table 3. It is observed that although both lightweight modules can decrease the number of parameters, the lightweight module is more effective in the head part. At the same time, its application in the backbone may result in a decreased detection performance. Therefore, we decided to improve and replace the C2f modules connected to the detector head to achieve an optimal detection performance. The comparison with YOLOv8 reveals that the lightweight module reduces the number of model parameters, and the RepGhost module demonstrates a superior performance on RSOD and DOTAv1.5, improving by 1.0% and 2.1%, respectively. Consequently, the RepGhost module was selected in this paper to enhance C2f, enabling simultaneous enhancement of the model parameters and performance.

Table 4 shows the experimental results conducted on TGRS-HRRSD. Table 4 indicates a significant reduction in the model’s parameter and a considerable improvement in performance following the integration of RepGhost. This underscores the strong generalization capability of RepGhost_C2f, prompting us to replace the C2f modules connecting the head with the detection head with RepGhost_C2f modules.

4.3.3. Comparative Experiments for Loss Function

YOLOv8 uses the CIoU loss function but only considers the width and height without distinguishing between high and normal-quality targets. To address this limitation, WIoU and GIoU were employed as replacements for the CIoU loss function in this study, and the resulting experimental outcomes are summarized in Table 5. The tables show that the WIoU achieves improvements of 0.9%, 0.6%, and 1.8% over the original YOLOv8 model across the three datasets, respectively, with WIoU exhibiting further enhancement over GIoU.

4.3.4. Ablation Study

We systematically integrated the enhancements above into the original YOLOv8 model and evaluated its performance on the cropped and augmented DOTAv1.5 dataset. The results are presented in Table 6. The accuracy mAP@.5:.95 of RN-YOLO is improved by 3.6% after modifying YOLOv8. Moreover, the parameter is reduced by 13.9%.

4.3.5. Comparative Experiments with Other Models

To thoroughly assess our proposed model’s target detection accuracy and generalization capability, we conducted a comprehensive comparative analysis across three datasets, employing state-of-the-art target detection algorithms for remote sensing as benchmarks. The summarized results are presented in Table 7. It is clear from the table that our proposed model consistently surpasses existing algorithms across all three aerial image datasets, demonstrating a superior detection accuracy. Additionally, our model boasts a significantly lower parameter than YOLOv8.

Finally, we comprehensively compared the YOLO series models and RN-YOLO on the TGRS-HRRSD dataset. Table 8 presents the prediction accuracies for each category. Our model demonstrates an absolute accuracy advantage in the majority of categories.

4.4. Visualization Experiments

To showcase the improved detection performance of the RN-YOLO model, we conducted visualization experiments on representative scenes from the three datasets, as illustrated in Figure 9. The comparison images on the top row show the detection results obtained using YOLOv8, while the bottom row demonstrates the enhanced detection performance achieved by RN-YOLO. The comparison indicates that YOLOv8 still struggles with significant leakage detection issues when detecting small targets. Conversely, RN-YOLO exhibits a remarkable capability in detecting small targets, even amidst large image size spans and many small targets. Moreover, the comparison images selected encompass various datasets: the first from the DOTA dataset, the second from the RSOD dataset, and the third from the TGRS-HRRSD dataset. This highlights the effectiveness of RN-YOLO in detecting small targets and its robust generalization across diverse datasets.

5. Discussion

In remote-sensing image detection, there are problems with large target size spans and many small targets. The existing models cannot extract detailed features, leading to missing and false detection. This paper improves the feature extraction network, feature fusion network, and location loss function of YOLOv8 and proposes RN-YOLO. RN-YOLO improves the target detection accuracy while reducing the parameters. Firstly, we integrate NAM into the feature extraction network to filter features. NAM can prioritize key features and suppress insignificant features. Secondly, we introduce RepGhost_C2f in the feature fusion network. RepGhost_C2f can boost the object detection accuracy while substantially reducing the parameters. Lastly, we refine the localization loss function WIoU to mitigate difficulties in localizing small targets and enhance the object detection accuracy. The experimental results demonstrate that our model effectively enhances mAP@.5:.95 by 3.6%, 1.2%, and 2% on the DOTAv1.5, TGRS, and RSOD datasets, respectively, compared to YOLOv8, while reducing the parameters by 13.9%, showcasing strong generalization. This study underscores the effectiveness of lightweight convolution, attention mechanisms, and appropriate loss functions in improving target detection algorithms, offering novel methods and insights in remote-sensing target detection.

Future research will focus on optimizing the network structure and extensively exploring attention mechanisms to enhance the network detection accuracy further while maintaining a lower number of parameters. Additionally, we aim to refine the loss function to ensure a robust performance when encountering tilted bounding boxes. Lastly, leveraging the reasoning capabilities of multimodal macro models on detected images provides a pathway for reflecting on and enhancing detection results.

Author Contributions

Conceptualization, K.W. and G.Y.; methodology, K.W. and G.Y.; software, K.W.; validation, K.W.; data curation, H.W.; writing—original draft preparation, K.W.; writing—review and editing, H.Z. and G.Y.; supervision, G.Y.; funding acquisition, H.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was financially supported by the Key R&D Projects of Yunnan Province (Grant No. 202202AD080004) and the Natural Science Foundation of China (Grant No. 62061049, 12263008).

Data Availability Statement

All the original datasets mentioned in this paper are accessible. DOTAv1.5 can be obtained from https://captain-whu.github.io/DOAI2019/dataset.html (accessed on 20 April 2023), TGRS-HRRSD can be acquired from https://github.com/CrazyStoneonRoad/TGRS-HRRSD-Dataset (accessed on 20 April 2023), and RSOD can be downloaded from https://github.com/RSIA-LIESMARS-WHU/RSOD-Dataset- (accessed on 20 April 2023).

Conflicts of Interest

The authors declare there is no conflict of interest.

References

Melesse, M.; Weng, Q.; Thenkabail, P.S.; Senay, G.B. Remote sensing sensors and applications in environmental resources mapping and modelling. Sensors 2007, 7, 3209–3241. [Google Scholar] [CrossRef] [PubMed]
Gakhar, S.; Tiwari, K.C. Spectral–spatial urban target detection for hyperspectral remote sensing data using artificial neural network. Egypt. J. Remote Sens. Space Sci. 2021, 24, 173–180. [Google Scholar] [CrossRef]
Yang, C. Remote sensing and precision agriculture technologies for crop disease detection and management with a practical application example. Engineering 2020, 6, 528–532. [Google Scholar] [CrossRef]
Koshimura, S.; Moya, L.; Mas, E.; Bai, Y. Tsunami damage detection with remote sensing: A review. Geosciences 2020, 10, 177. [Google Scholar] [CrossRef]
Bi, Y.; Bai, X.; Jin, T.; Guo, S. Multiple feature analysis for infrared small target detection. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1333–1337. [Google Scholar] [CrossRef]
Zhou, P.; Cheng, G.; Liu, Z.; Bu, S.; Hu, X. Weakly supervised target detection in remote sensing images based on transferred deep features and negative bootstrapping. Multidimens. Syst. Signal Process. 2016, 27, 925–944. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar] [CrossRef]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; Volume 28. [Google Scholar] [CrossRef]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the 14th European Conference of Computer Vision (ECCV 2016), Amsterdam, The Netherlands, 11–14 October 2016; Volume 14, pp. 21–37, Part I. [Google Scholar] [CrossRef]
Li, Z.; Yang, L.; Zhou, F. FSSD: Feature fusion single shot multibox detector. arXiv 2017, arXiv:1712.00960. [Google Scholar] [CrossRef]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar] [CrossRef]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar] [CrossRef]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 based on transformer prediction head for object detection on drone-captured scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar] [CrossRef]
Yang, L.; Yuan, G.; Zhou, H.; Liu, H.; Chen, J.; Wu, H. RS-Yolox: A high-precision detector for object detection in satellite remote sensing images. Appl. Sci. 2022, 12, 8707. [Google Scholar] [CrossRef]
Yu, J.; Liu, S.; Xu, T. Research on YOLOv7 remote sensing small target detection algorithm incorporating attention mechanism. J. Comput. Eng. Appl. 2023, 59, 167. (In Chinese) [Google Scholar] [CrossRef]
Liu, Z.; Gao, Y.; Du, Q.; Chen, M.; Lv, W. YOLO-extract: Improved YOLOv5 for aircraft object detection in remote sensing images. IEEE Access 2023, 11, 1742–1751. [Google Scholar] [CrossRef]
Wang, G.; Chen, Y.; An, P.; Hong, H.; Hu, J.; Huang, T. UAV-YOLOv8: A small-object-detection model based on improved YOLOv8 for UAV aerial pho-tography scenarios. Sensors 2023, 23, 7190. [Google Scholar] [CrossRef]
Liu, Y.; Shao, Z.; Hoffmann, Y.N. NAM: Normalization-based attention module. arXiv 2021, arXiv:2111.12419. [Google Scholar] [CrossRef]
Chen, C.; Guo, Z.; Zeng, H.; Xiong, P.; Dong, J. Repghost: A hardware-efficient ghost module via re-parameterization. arXiv 2022, arXiv:2211.06088. [Google Scholar] [CrossRef]
Tong, Z.; Chen, Y.; Xu, Z.; Yu, R. Wise-IoU: Bounding box regression loss with dynamic focusing mechanism. arXiv 2023, arXiv:2301.10051. [Google Scholar] [CrossRef]
Vedaldi, A.; Gulshan, V.; Varma, M.; Zisserman, A. Multiple kernels for object detection. In Proceedings of the 2009 IEEE 12th International Conference on Computer Vision, Kyoto, Japan, 27 September–4 October 2009; pp. 606–613. [Google Scholar] [CrossRef]
Ranzato, M.A.; Boureau, Y.L.; Cun, Y. Sparse feature learning for deep belief networks. In Proceedings of the Advances in Neural Information Processing Systems, Vancouver, BC, Canada, 3–6 December 2007; Volume 20. [Google Scholar] [CrossRef]
Viola, P.; Jones, M. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Kauai, HI, USA, 8–14 December 2001; Volume 1, p. I. [Google Scholar] [CrossRef]
Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A.A.; Veness, J.; Bellemare, M.G.; Graves, A.; Riedmiller, M.; Fidjeland, A.K.; Ostrovski, G.; et al. Human-level control through deep reinforcement learning. Nature 2015, 518, 529–533. [Google Scholar] [CrossRef] [PubMed]
Lillicrap, T.P.; Hunt, J.J.; Pritzel, A.; Heess, N.; Erez, T.; Tassa, Y.; Silver, D.; Wierstra, D. Continuous control with deep reinforcement learning. arXiv 2015, arXiv:1509.02971. [Google Scholar] [CrossRef]
Terven, J.; Cordova-Esparza, D.-M.; Romero-González, J.-A. A comprehensive review of YOLO: From YOLOv1 to YOLOv8 and beyond. Mach. Learn. Knowl. Extr. 2023, 5, 1680–1716. [Google Scholar] [CrossRef]
Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. DOTA: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar] [CrossRef]
Zhang, Y.; Yuan, Y.; Feng, Y.; Lu, X. Hierarchical and robust convolutional neural network for very high-resolution remote sensing object detection. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5535–5548. [Google Scholar] [CrossRef]
Long, Y.; Gong, Y.; Xiao, Z.; Liu, Q. Accurate object localization in remote sensing images based on convolutional neural networks. IEEE Trans. Geosci. Remote Sens. 2017, 55, 2486–2498. [Google Scholar] [CrossRef]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar] [CrossRef]
Gao, P.; Lu, J.; Li, H.; Mottaghi, R.; Kembhavi, A. Container: Context aggregation network. arXiv 2021, arXiv:2106.01401. [Google Scholar] [CrossRef]
Zhang, Q.L.; Yang, Y.B. Sa-net: Shuffle attention for deep convolutional neural networks. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2021), Toronto, ON, Canada, 6–11 June 2021; pp. 235–2239. [Google Scholar] [CrossRef]
Li, J.; Wen, Y.; He, L. Scconv: Spatial and channel reconstruction convolution for feature redundancy. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6153–6162. [Google Scholar] [CrossRef]

Figure 1. Remote-sensing image, red frames represent the identification box. (a) Large difference in target size; (b) Numerous small targets.

Figure 2. RN-YOLO model structure for remote-sensing image detection, red dotted boxes represent additional or modified parts of the model and the final arrow points to the eventual output of the model.

Figure 3. Normalization-based attention module structure diagram.

Figure 4. GhostNet and RepGhost structure. (a) GhostNet structure; (b) RepGhost structure.

Figure 5. RepGhost_C2f structure.

Figure 6. The important symbols of WIoU and CIoU representation diagram.

Figure 7. Sample image dataset enhancement. (a) Original image; (b) Image with horizontal flip and pretzel noise; (c) Image with vertical flip and brightness boost.

Figure 8. The percentage of each object in three datasets bar chart. (a) Bar chart with the percentage of each object in RSOD; (b) Bar chart with the percentage of each object in TGRS-HRRSD; (c) Bar chart with the percentage of each object in DOTAv1_5.

Figure 9. Detection comparison for YOLOv8 and RN-YOLO. (a) Detection results using YOLOv8; (b) Detection results using RN-YOLO.

Table 1. Comparison of the effectiveness of different attention mechanisms.

Methods	P (%)	R (%)	Paras (M)	mAP@50 (%)	mAP @.5:.95 (%)
Yolov8	85.0	81.6	11.13	86.2	60.2
+NAM	86.3	82.3	11.13	86.7	61.5
+CBAM	85.2	82.3	11.48	86.7	60.8
+ContextAggregation	86.9	82.0	11.82	86.9	61.3
+ShuffleAttention	87.2	82.6	11.13	87.2	61.2

Table 2. Performance of NAM attention mechanisms on three aerial remote-sensing datasets.

Methods	RSOD			TGRS-HRRSD			DOTAv1_5
Methods	P (%)	R (%)	mAP@.5:.95 (%)	P (%)	R (%)	mAP@.5:.95 (%)	P (%)	R (%)	mAP@.5:.95 (%)
Yolov8	95.1	94.5	78.5	91.2	87.1	67.9	82.0	81.6	60.2
+NAM	95.4	95.7	79.7	91.6	87.2	68.7	86.3	82.3	61.5

Table 3. Comparative experiments for lightweight modules used in different parts.

Position	Methods	RSOD			DOTAv1.5
Position	Methods	P (M)	mAP@50 (%)	mAP@.5:.95 (%)	Paras (M)	mAP@50 (%)	mAP@.5:.95 (%)
None	YOLOv8	11.13	99.1	78.5	11.13	86.2	60.2
backbone	+Scconv	10.36	97.3	77.7	10.36	84.9	58.6
backbone	+RepGhost	9.59	97.5	78.2	9.59	86.6	60.4
head	+Scconv	10.51	98.0	79.4	10.51	86.7	60.9
head	+RepGhost	9.50	97.5	79.5	9.76	87.0	62.3

Table 4. Performance comparison for RepGhost_C2f on different datasets.

Methods	RSOD			TGRS-HRRSD			DOTAv1_5
Methods	P (%)	R (%)	mAP@.5:.95 (%)	P (%)	R (%)	mAP@.5:.95 (%)	P (%)	R (%)	mAP@.5:.95 (%)
YOLOv8	95.1	94.5	78.5	91.2	87.1	67.9	85.0	81.6	60.2
+RepGhost	95.8	95.9	79.5	91.1	88.0	68.4	88.1	82.4	62.3

Table 5. Performance comparison for three loss functions on different datasets.

Methods	RSOD		TGRS-HRRSD		DOTAv1_5
Methods	mAP@50 (%)	mAP@.5:.95 (%)	mAP@50 (%)	mAP@.5:.95 (%)	mAP@50 (%)	mAP@.5:.95 (%)
YOLOv8	99.1	78.5	91.6	67.9	86.2	60.2
+WIoUv3	98.4	79.4	91.9	68.5	87.6	62.0
+GIoU	98.8	78.5	92.0	68.3	86.4	60.5

Table 6. Ablation experiments.

Model	RepGhost	WIoUv3	NAM	P (%)	R (%)	Paras (M)	mAP@50 (%)	mAP@.5:.95 (%)
Yolov8				85.0	81.6	11.13	86.2	60.2
Yolov8	√			88.1	82.4	9.76	87.0	62.3 (+2.1)
Yolov8	√	√		87.3	83.8	9.77	87.8	62.8 (+0.5)
Yolov8	√	√	√	87.9	84.9	9.77	87.8	63.8 (+1.0)

Table 7. Comparative experimental results with other models.

Dataset	Evaluation Metrics	Improved Faster R-CNN	YOLO X	YOLO v5	YOLO v7	YOLO v8	RN-YOLO (Ours)
DOTA-v1.5	mAP@50 (%)	72.5	84.2	87.3	85.7	86.2	87.8
	mAP@.5:.95 (%)	50.3	59.9	58.9	58.5	60.2	63.8
	Param (M)	60.40	8.94	7.05	36.56	11.13	9.77
TGRS-HRRSD	mAP@50 (%)	74.3	77.4	91.7	92.0	91.6	92.5
	mAP@.5:.95 (%)	53.2	58.9	66.3	67.5	67.9	69.1
	Param (M)	60.42	8.94	7.04	36.54	11.13	9.77
RSOD	mAP@50 (%)	80.8	92.1	97.9	98.5	99.1	98.0
	mAP@.5:.95 (%)	62.1	70.9	73.5	76.6	78.5	80.5
	Param (M)	60.42	8.94	7.02	36.4	11.13	9.77

Table 8. Accuracy comparison for each categories in the TGRS-HRRSD dataset.

Categories	YOLOv5 (%)	YOLOv7 (%)	YOLOv8 (%)	RN-YOLO (Ours) (%)
ship	66.7	72.4	65.2	61.2
bridge	35.3	40.6	43.5	50.1
ground_track	74.5	79.9	77.5	81.1
storage_tank	84.2	84.1	85.4	88.3
basketball_court	67.9	71.9	69.0	63.9
tennis_court	87.5	87.7	88.2	89.2
airplane	86.2	87.6	85.5	83.5
baseball_diamond	63.4	63.1	65.8	70.3
harbor	73.9	71.3	80.6	82.8
vehicle	74.6	71.9	73.7	72.9
crossroad	53.5	52.7	54.9	58.1
T_junction	45.1	44.3	43.3	42.6
parking_lot	48.1	49.4	49.6	55.2

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, K.; Zhou, H.; Wu, H.; Yuan, G. RN-YOLO: A Small Target Detection Model for Aerial Remote-Sensing Images. Electronics 2024, 13, 2383. https://doi.org/10.3390/electronics13122383

AMA Style

Wang K, Zhou H, Wu H, Yuan G. RN-YOLO: A Small Target Detection Model for Aerial Remote-Sensing Images. Electronics. 2024; 13(12):2383. https://doi.org/10.3390/electronics13122383

Chicago/Turabian Style

Wang, Ke, Hao Zhou, Hao Wu, and Guowu Yuan. 2024. "RN-YOLO: A Small Target Detection Model for Aerial Remote-Sensing Images" Electronics 13, no. 12: 2383. https://doi.org/10.3390/electronics13122383

APA Style

Wang, K., Zhou, H., Wu, H., & Yuan, G. (2024). RN-YOLO: A Small Target Detection Model for Aerial Remote-Sensing Images. Electronics, 13(12), 2383. https://doi.org/10.3390/electronics13122383

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

RN-YOLO: A Small Target Detection Model for Aerial Remote-Sensing Images

Abstract

1. Introduction

2. Related Works

2.1. Machine Learning in Remote-Sensing Images

2.2. Reinforcement Learning in Remote-Sensing Images

2.3. YOLOv8 Model

3. Methods

3.1. Overall Architecture

3.2. Normalization-Based Attention Module (NAM)

3.3. RepGhost_C2f

3.4. WioU (Wise Intersection over Union)

4. Experiments

4.1. Experimental Datasets and Their Preprocessing

4.2. Experimental Environment and Training Setting

4.3. Experimental Comparison and Analysis

4.3.1. Comparative Experiments for Attention Module

4.3.2. Comparative Experiments for Lightweight Convolution

4.3.3. Comparative Experiments for Loss Function

4.3.4. Ablation Study

4.3.5. Comparative Experiments with Other Models

4.4. Visualization Experiments

5. Discussion

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI