Next Article in Journal
Rail Surface Defect Detection Based on Image Enhancement and Improved YOLOX
Previous Article in Journal
A Spectral Enhancement Method Based on Remote-Sensing Images for High-Speed Railways
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

CNTR-YOLO: Improved YOLOv5 Based on ConvNext and Transformer for Aircraft Detection in Remote Sensing Images

School of Physics and Electronics, Central South University, Lushan South Road, Changsha 410083, China
*
Author to whom correspondence should be addressed.
Electronics 2023, 12(12), 2671; https://doi.org/10.3390/electronics12122671
Submission received: 15 May 2023 / Revised: 12 June 2023 / Accepted: 12 June 2023 / Published: 14 June 2023
(This article belongs to the Topic Computational Intelligence in Remote Sensing)

Abstract

:
Aircraft detection in remote sensing images is an important branch of target detection due to the military value of aircraft. However, the diverse categories of aircraft and the intricate background of remote sensing images often lead to insufficient detection accuracy. Here, we present the CNTR-YOLO algorithm based on YOLOv5 as a solution to this issue. The CNTR-YOLO algorithm improves detection accuracy through three primary strategies. (1) We deploy DenseNet in the backbone to address the vanishing gradient problem during training and enhance the extraction of fundamental information. (2) The CBAM attention mechanism is integrated into the neck to minimize background noise interference. (3) The C3CNTR module is designed based on ConvNext and Transformer to clarify the target’s position in the feature map from both local and global perspectives. This module is applied before the prediction head to optimize the accuracy of prediction results. Our proposed algorithm is validated on the MAR20 and DOTA datasets. The results on the MAR20 dataset show that the mean average precision (mAP) of CNTR-YOLO reached 70.1%, which is a 3.3% improvement compared with YOLOv5l. On the DOTA dataset, the results indicate that the mAP of CNTR-YOLO reached 63.7%, which is 2.5% higher than YOLOv5l.

1. Introduction

With the help of advanced satellite remote sensing technology, many high-resolution remote sensing images have been produced, which often contain a wealth of information. These images also provide rich materials for the research of target detection, so the detection methods of remote sensing targets have become a hot topic for scholars [1,2]. Among all types of targets, aircraft have high mobility and are of great value in various fields, especially in the military. Therefore, studying the detection methods of aircraft targets in remote sensing images is significant. However, it is still a challenging task because of the top–down view of remote sensing images, which can only acquire the upper surface features of objects, and due to many aircraft types being highly similar to each other, as well as satellite photography being susceptible to external factors such as weather, light, shadows and so on [3,4].
In recent years, deep learning algorithms have become the prevailing method for target detection due to advances in computer techniques. Target detection using deep learning algorithms can be categorized into two types: single-stage target detection algorithm, and two-stage target detection algorithm. The single-stage algorithm treats target detection as a combination of regression and classification tasks, while the two-stage algorithm first generates a collection of candidate regions and then identifies and classifies the target object based on these regions [5]. Two-stage algorithms, including R-CNN [6], Fast R-CNN [7], Faster R-CNN [8], and Cascade R-CNN [9], tend to have higher accuracy but suffer from high computational requirements due to the large number of candidate frames, leading to lengthy training periods and slow detection speeds. In contrast, the detection accuracy of single-stage algorithms is typically lower than the two-stage algorithms; however, the detection speed is substantially faster. Notable examples of single-stage algorithms are YOLO [10], SSD [11], Retinanet [12], and FCOS [13].
Numerous studies have explored the application of deep learning algorithms to detect aircraft targets in remote sensing images. For instance, Liu proposed a two-stage algorithm that utilizes the Harris operator to detect corners, clusters them using mean drift clustering to generate small yet precise candidate regions, and subsequently identifies the aircraft’s region by leveraging a CNN model, resulting in enhanced detection accuracy [14]. In the DPANet, Shi introduced a deconvolution module to extract external structural features of the aircraft, which was followed by a position attention mechanism to extract internal structural features, which reduced the false detection rate and improved detection precision [15]. Wu optimized Mask R-CNN by combining self-calibrated convolution with ResNet in the backbone, thus making the features more discriminative and resulting in improved network accuracy [16]. For his part, Ji expanded on Fast R-CNN by incorporating a multi-angle change module that extracts target features from multiple viewpoints, thereby reducing the false detection rate. Furthermore, he employed a box detection post-processing method with a majority voting strategy to further minimize the likelihood of misjudgment [17]. Although these algorithms are two-stage and possess unique accuracy advantages, they are still more complex relative to one-stage algorithms. Therefore, many researchers continue to focus on one-stage algorithms, particularly based on the YOLO series. For example, Cao improved the YOLOv3 model by adding a detection scale with a smaller perceptual field and using L2 regularization to combat overfitting [18]. Zhou devised the Deeper and Wider Module (DAWM), which drew inspiration from the Inception–ResNet model. Incorporating the DAWM architecture into YOLOv3 effectively mitigated the impact of background noise and further advanced network performance [19]. Luo added center and scale calibration at the beginning and end of the batch normalization layer in YOLOv5 to address the problem that the batch normalization layer ignores the representation differences between instances, enabling features to be corrected, which has improved the performance of the overall network [20]. Liu proposed the YOLO-extract algorithm, which removed feature layers and prediction heads in YOLOv5 with suboptimal feature extraction ability and replaced them with a new feature extractor possessing stronger feature extraction capabilities. This modification resulted in improved accuracy and reduced computational costs [21]. Notwithstanding the above advances in aircraft target detection algorithms, some algorithms fail to fully utilize global and local information of remote sensing images, resulting in aircraft target misdetection. To address this shortcoming, we require a novel aircraft target detection algorithm for remote sensing images that leverages global and local information more efficiently.
In this paper, we present CNTR-YOLO, which is an improved version of YOLOv5. We have made several modifications to enhance network performance. Firstly, we introduced the Dense module based on DenseNet to reinforce the feature extraction capability of the backbone. By reusing features, this module mitigates the loss of valid information. Secondly, we added the CBAM attention module to the neck to produce attention maps iteratively across both channel and spatial dimensions. This module assists in identifying areas with aircraft targets in images while reducing the impact of background noise interference. Lastly, in order to make full use of global and local information in remote sensing images, we established the C3CNTR module by combining the Transformer Block and ConvNext Block. This novel design is placed before the detection head of YOLOv5 and leverages the Transformer Block for processing global information and the ConvNext Block for processing local information.
Our contributions can be summarized as follows:
1. We propose a single-stage object detection algorithm to improve the accuracy of aircraft detection in remote sensing images.
2. For the first time, we design a structure that combines a convolutional network and Transformer in YOLOv5 to assist the prediction head, maximizing the utilization of local and global feature information.
3. We validate some effective measures to improve the performance in YOLOv5, such as using DenseNet to improve feature extraction and the CBAM attention mechanism to reduce interference from background information.

2. Related Work

In this section, we provide an overview of the key components of our proposed algorithm. Specifically, we discuss YOLOv5, Transformer, and ConvNext.

2.1. YOLOv5

YOLOv5 was released in 2020 by Ultralytics LLC and was built upon the foundation of YOLOv3 [22]. YOLOv5 rectified the earlier issue of faster detection speed at the expense of accuracy. It also improved real-time performance and simplified the network structure. Comprised of a backbone, neck, and head, YOLOv5 features five models, ranging from YOLOv5n to YOLOv5x based on the network depth. Despite YOLOv5x exhibiting marginally superior detection accuracy compared to YOLOv5l, the latter delivers faster speeds and requires fewer hardware resources. Therefore, we conduct research based on YOLOv5l.
Figure 1 illustrates the architecture of YOLOv5. The feature extraction network of YOLOv5 is composed of a CSPDarkNet53 network [23] and an SPPF layer. The neck utilizes a PANet [24] structure, and the head is a YOLO detection head that comprises a convolution layer and a prediction component. In YOLOv5, the C3 module is one of the most frequently applied modules. The structure of the C3 module, as shown in Figure 2, consists of three convolutional modules and a Bottleneck. The Bottleneck is a residual block that possesses faster computation speeds than the residual block of ResNet [25]. Furthermore, it enables a deeper network architecture while reducing computational parameters.
While YOLOv5 has demonstrated excellent performance across various vision tasks, its direct application to aircraft target detection in remote sensing images falls short of satisfactory outcomes. Thus, this paper introduces several improvements to enhance its performance in this domain.

2.2. Transformer

In recent years, Transformer [26] has achieved significant success in the field of natural language processing (NLP). As the size of the convolutional kernel constrains its ability to acquire local representations, researchers have looked to extend Transformer’s functionality to computer vision. To this end, Dosovitskiy et al. proposed the Vision Transformer (ViT) methodology [27], which leverages Multiple Self-Attention (MSA) to capture long-range feature dependencies within internal information.
The details of the ViT methodology can be succinctly summarized as follows. Firstly, a two-dimensional image is converted into several one-dimensional sequences. Location encoding is then incorporated to provide information on the image’s spatial position. Subsequently, the sequences, with learnable location encoding, are passed through the Transformer encoder, which calculates global attention and extracts features via the multi-headed attention module. Lastly, the MLP layer yields the prediction categories.
Several researchers have already integrated Transformer with YOLOv5. For example, in the detection of targets during UAV shooting scenes, Zhu replaced the Bottleneck in the C3 structure of the original YOLOv5 with the Transformer Block to create the C3TR module [28]. Figure 3 displays the structure of the C3TR module. Transformer’s unique properties enable the C3TR module to capture global information and abundant contextual information from features.
Target detection in remote sensing images presents unique challenges compared to UAV shooting scenes, including larger shooting distances, smaller objects, and a single angle of aircraft targets (which are mostly vertical). Given these difficulties, it is crucial to explore alternative approaches to integrate Transformer and address these complexities.

2.3. ConvNext

In the realm of computer vision, ViT has swiftly replaced convolutional networks as the state-of-the-art approach for image classification models. On the other hand, FAIR’s ConvNext [29], which relies entirely on standard convolutional networks, offers comparable accuracy and generalizability to Transformer.
ConvNext does not introduce significant innovations to the overall network architecture or construction ideas. Instead, it makes some modifications to the existing ResNet network by incorporating some advanced concepts of Transformer. These changes aim to combine the advantages of both convolutional neural networks (CNNs) and Transformer networks, which ultimately leads to improved CNN performance.
In contrast to Transformer, ConvNext, built using convolutional networks, exhibits a greater capacity to capture local information. This ability plays a pivotal role in detecting high-resolution remote sensing images. The present study proposes a novel joint design that integrates the strengths of both Transformers and ConvNext to improve detection performance.

3. Theoretical Model

To address the challenges associated with detecting aircraft targets in remote sensing images, we developed CNTR-YOLO based on YOLOv5. In this section, we first present the architecture of CNTR-YOLO. Subsequently, we elaborate on the critical components of CNTR-YOLO, including the C3CNTR module, Dense module, and CBAM attention module.

3.1. Overview of CNTR-YOLO

The architecture of the proposed CNTR-YOLO module is shown in Figure 4. Compared with YOLOv5, CNTR-YOLO has a total of seven differences. First, we replaced a C3 module with a Dense module at the end of the Backbone, then inserted a CBAM attention module after each of the first three C3 modules in the neck, and finally, the C3CNTR module is inserted before the detection head.

3.2. C3-ConvNext-Transformer (C3CNTR) Module

To enhance YOLOv5’s understanding of global and local information, we have drawn inspiration from the success of incorporating Transformer in YOLOv5 and designing the C3TR module in reference [28]. In light of this experience, we introduce ConvNext and Transformer to develop the C3CNTR module. ConvNext enhances the utilization of local information, while Transformer improves the utilization of global information. Figure 5 illustrates the structure of the C3CNTR module.
Figure 4. The architecture of CNTR-YOLO.
Figure 4. The architecture of CNTR-YOLO.
Electronics 12 02671 g004

3.2.1. Transformer Block

In the Transformer Block shown in Figure 6, we employ a classic Transformer Encoder architecture. In contrast to the standard convolutional network, this architecture utilizes certain special operations that will be elaborated on shortly.
1.
Flatten
A Flatten operation is located at the outset of the Transformer Encoder and serves to convert two-dimensional feature maps into one-dimensional sequences of feature maps. Given an input feature map X R H × W × C , it becomes X R N × C after Flatten, where N = H × W .
2.
Multi-head attention
Multi-head attention is a global operation that allows the Transformer Encoder to discover correlation information on a feature’s entire range. The feature map undergoes conversion into Q , K , V R N × C with different linear mappings following Flatten and LayerNorm to serve as input for multi-head attention. Comprising several single-head attentions, multi-head attention executes one operation on Q , K , V with each single-head attention. The output expression of the i-th single-head attention is as follows:
O u t p u t i = S i V i
S i = s o f t m a x ( Q i K i T )
where Q i , K i , V i is the multiplication of Q , K , V and the i-th single-head attention’s weight matrix, while S i R N × N represents the attention matrix, revealing the correlation between each element of the feature map and other elements. O u t p u t i refers to the feature that consolidates global information. After each single-head attention completes its operation, the resulting outputs are unified via the concatenation layer. The ultimate output expression is shown as follows:
O u t p u t a l l = C o n c a t ( O u t p u t 1 , , O u t p u t h )
where h is the number of multi-head attention heads.
3.
FFN
The output of multi-head attention advances to FFN once it undergoes LayerNorm. FFN refers to a Feed-Forward Network that essentially comprises two fully connected layers; one of which has Relu activation, while there is a Dropout between the two layers. The expression for FFN processing is shown below:
F F N ( x ) = m a x ( 0 , x W 1 + b 1 ) W 2 + b 2
where x is the sequence of input feature maps, W 1 and b 1 are the weights and offsets of the first fully connected layer, and W 2 and b 2 are the weights and offsets of the second fully connected layer.

3.2.2. ConvNext Block

The ConvNext Block’s structure is shown in Figure 7, which adopts the standard ConvNext network structure.
While ConvNext is essentially a convolutional network, its design delineates some similarities to Transformer, which are elaborated upon below.
  • DW Conv
A group convolution employs multiple groups of convolutional filters for convolution. On the other hand, DWConv (depthwise convolution) refers to a special group convolution in which the number of groups equals the number of channels. Similar to multi-head attention in Transformer, depthwise convolution plays a pivotal role in ConvNext’s architecture. Depthwise convolution, akin to the weighted sum operation in multi-head attention, performs operations on a channel-by-channel basis, amalgamating information only in the spatial dimension. The combination of depthwise convolution and 1 × 1 convolution allows for a separation of the spatial and channel dimensions of the feature maps. Each operation, by mixing information either across the spatial dimension or channel dimension, is performed independently, which is analogous to Transformers. Comprised of only pure convolutional networks, ConvNext’s global perceptual field differs from that of Transformers. To compensate for this limitation, ConvNext uses 7 × 7 convolution kernels in depthwise convolution.
2.
Inverted Bottleneck
The ConvNext Block culminates in an inverted bottleneck, which is a design element also found in Transformer. In Transformer Encoder, a crucial design specification entails incorporating an inverted bottleneck at the end, amplifying the hidden dimensions of the two fully connected layers in FFN to four times the input dimensions. Following the advent of Transformer, various cutting-edge convolutional networks adopted the inverted bottleneck design, such as MobileNetV2 [30]. Similar in approach to Transformer, ConvNext creates the Inverted Bottleneck at the end via two 1 × 1 convolutions. The role of 1 × 1 convolution is commensurate to that of a fully connected layer. The first 1 × 1 convolution expands the input channel four times, while the latter restores the number of input channels. The authors of ConvNext have also validated that this design enhances network performance across multiple tasks, encompassing classification and detection.

3.3. Dense Module

Toward the end of the feature extraction network, we exchanged a C3 module for a Dense module, aiming to heighten the network’s efficiency in utilizing feature information. The Dense module follows the structure of C3, as depicted in Figure 8, and it contains the architecture of DenseNet [31], which is delineated in Figure 9.
DenseNet melds concepts from ResNet and Inception networks [32], possessing four fundamental benefits, comprising: retaining low-latitude features; enhancing feature reuse; mitigating the gradient disappearance problem; and considerably diminishing the number of parameters. Its architecture principally incorporates numerous DenseNet Blocks and Transition Blocks, and we select two and one, respectively, for each. A DenseNet Block with N layers of convolution possesses N ( N + 1 ) / 2 connections, with each layer’s input deriving from all previous layers’ output, which is a stark contrast to the N connections in a traditional convolutional neural network with N layers. This unique connection methodology in a DenseNet Block optimizes a better utilization of features and obviates the need for learning a considerable mass of irrelevant feature information, thereby preventing gradient explosion and diminishing the likelihood of overfitting. Elevated feature extraction in the network is achieved while reducing computation and the number of parameters. Assuming N convolution layers exist in a Dense Block, the expression for the n-th layer of output is as follows:
x n = f n ( [ x 1 , x 2 , , x n 1 ] )
where f n represents the nonlinear operation at the n-th layer, and [ x 1 , x 2 , , x n 1 ] represents the operation of concatenating all the outputs before the n-th layer. Concatenation is distinguishable from residual connection, the latter which simply adds the values of two features together. Whereas concatenation, by comparison, increases the number of channels to enable preservation of the previous feature information in its entirety. To ensure consistency in the number of channels of input features across each DenseNet Block, a Transition Block is implemented to restore the number of channels in the output feature from the previous DenseNet Block.

3.4. CBAM

The Convolutional Block Attention Module (CBAM) [33] comprises two sub-modules: the Channel Attention Module (CAM) and the Spatial Attention Module (SAM). Through its attention mechanism, CBAM simultaneously regulates the channel and space features, thus enabling the network to capture a comprehensive range of information contained in the feature map. Illustratively, Figure 10 below depicts the diagram of CBAM.
The input feature map will first pass through the CAM. At the beginning of CAM is a global max pooling layer and a global average pooling layer. These two pooling layers will pool the feature maps based on height and width to obtain two 1 × 1 × C feature maps (C is the number of channels), and then, the obtained feature maps will be fed into a two-layer MLP network, which is shared by the two input features. The MLP-processed feature maps are summed element-wise, and finally, the sigmoid activation function is used to generate the channel attention feature. The channel attention feature will be multiplied element-wise with the input feature map to obtain the input feature map of SAM.
In SAM, first, the input feature map from CAM will undergo a channel-based global maximum pooling and global average pooling to obtain two H × W × 1 feature maps; H and W are the height and width of the feature maps, respectively. Then, the two feature maps are concatenated in the channel dimension, and the number of channels of the feature map is doubled. Next, the number of channels of the feature map is reduced by a convolutional layer followed by a sigmoid activation function, which generates a spatial attention feature. Finally, the spatial attention feature is multiplied based on element-wise with the input features of SAM to obtain the final features generated by CBAM.

4. Experiments

In this section, we first introduce the dataset used in the experiments, namely the MAR20 dataset. Subsequently, we also explain the evaluation metrics and implementation details of the experiments. The experiments can be broadly summarized as the comparison of CNTR-YOLO with other algorithms alongside the ablation study.

4.1. Dataset

The MAR20 dataset [34], presently the largest dataset for remote sensing military aircraft target recognition, is utilized in this paper to validate the proposed algorithm’s performance. The dataset contains 3842 images and 22,341 instances of mostly 800 × 800 pixels gathered from 60 military airports across the United States, Russia, and other countries using Google Earth. The MAR20 dataset specifically includes 20 aircraft models, with six of them being the Russian SU-35 fighter, TU-160 bomber, TU-22 bomber, TU-95 bomber, SU-34 fighter-bomber, and SU-24 fighter-bomber. The remaining 14 aircraft models include the U.S. C-130 transport, C-17 transport, C-5 transport, F16 fighter, E-3 AWACS, B-52 bomber, P-3C ASW, B-1B bomber, E-8 joint battlefield surveillance aircraft, F-15 fighter, KC-135 air refueling aircraft, F-22 fighter, F/A-18 combat attack aircraft, and KC-10 air refueling aircraft. These aircraft model types are denoted with abbreviations A1 to A20. The dataset is split into a training set of 1331 images and 7870 instances and a testing set of 2511 images and 14471 instances, as shown in this paper’s experimentation.
The DOTA dataset [35] is a large remote sensing image dataset consisting of 2806 high-resolution images obtained from Google Earth and multiple satellite sensors with image sizes ranging from 800 × 800 pixels to 4000 × 4000 pixels. In comparison to the MAR20 dataset, DOTA includes a more comprehensive range of object categories, including Plane, Baseball diamond, Bridge, Ground field track, Small vehicle, Large vehicle, Ship, Tennis court, Basketball court, Storage tank, Soccer ball field, Roundabout, Harbor, Swimming pool, and Helicopter. Due to the large size of the DOTA dataset images, they cannot be directly used for training neural networks. Therefore, we divided the images into sub-images of size 608 × 608 pixels at intervals of 100 pixels. The sub-images were randomly extracted in an 8:1:1 ratio to create the training set, validation set, and testing set.

4.2. Evaluation Metrics

We adopt commonly used evaluation metrics, namely P (precision), R (recall), mAP (mean average precision), and mAP 0.5 (mean average precision at IOU = 0.5) in the experiments. Specifically, the expressions for P and R are defined as follows:
P = T P T P + F P
R = T P T P + F N
In this regard, T P represents the number of positive samples that were correctly identified, F P represents the number of negative samples that were identified as positive samples, and F N represents the number of positive samples that were identified as negative samples. Based on P and R, we can compute AP (average precision), mAP, and mAP 0.5 as follows:
AP = 0 1 P ( R ) d R
mAP = 1 N i = 1 N AP i , IOU = 0.5 : 0.05 : 0.95
mAP 0.5 = 1 N i = 1 N AP i , IOU = 0.5
where N represents the number of classifications of the targets.

4.3. Implementation Details

The implementation of CNTR-YOLO utilizes PyTorch (version v1.8.0) as the underlying framework, and the operating system used is Ubuntu 20.4. An NVIDIA RTX3060 GPU with 12 GB memory served as the platform for training and testing. During training, an SGD optimizer was used with the momentum and weight decay set to 0.937 and 0.01, respectively. A warmup strategy was employed to enhance the training process’s stability. The learning rate gradually decreased at a rate of 0.01 for the first three epochs and continued training with 0.001. Moreover, the images were resized to 640 × 640 pixels, and considering the hardware limitations, the batch size was set to 2.
The other models, including Faster R-CNN, YOLOv4, YOLOv5m, YOLOv5l, and YOLOv5x, were tested and trained under the same settings as CNTR-YOLO, with images also resized to 640 × 640 pixels during training. Notably, we adopted the default settings of each model’s referenced research articles concerning other parameters.

4.4. Experimental Results

In line with the implementation settings in Section 4.3, we evaluate CNTR-YOLO on P, R, mAP, mAP 0.5 , and Latency. To show the advantages of the proposed algorithm, we compare it with Faster R-CNN, YOLOv4, YOLOv5m, YOLOv5l, and YOLOv5x. We first present experimental results on the MAR20 dataset, and then, to demonstrate the robustness of the proposed algorithm, we also show experimental results on the DOTA dataset.

4.4.1. Experimental Results on the MAR20 Dataset

The overall comparison results are shown in Table 1. The comparison results of different categories are shown in Table 2.
Table 1 presents the comparative results of six target detection algorithms using different metrics. CNTR-YOLO outperforms the others in terms of P, R, mAP 0.5 , and mAP. Specifically, CNTR-YOLO attains mAP 0.5 and mAP scores of 91.1% and 70.1%, respectively, which are 1.4% and 2.1% higher than YOLOv5x, and 2.6% and 3.3% higher than YOLOv5l. In addition, CNTR-YOLO’s mAP is 13.0% and 5.8% higher when compared against other non-YOLOv5 series algorithms, Faster R-CNN and YOLOv4, respectively. Notably, the proposed algorithm distinguishes different types of aircraft features with remarkable accuracy, achieving a recall rate of 87.5%, which is 4.1% and 1.6% higher than YOLOv5l and YOLOv5x, respectively. This ability significantly reduces recognition errors compared to other algorithms. Despite a 14.2 ms higher inference time than YOLOv5l, CNTR-YOLO’s improved detection performance still ensures that it is 4.0 ms faster than YOLOv5x.
Table 2 illustrates the mean average precision of the six methods across the twenty classifications in the MAR20 dataset. Overall, CNTR-YOLO outperforms the other five algorithms in most categories, with only three categories being inferior to YOLOv5l or YOLOv5, but the gaps are all within 2%. Notably, in category A14, where each method had the lowest mAP, CNTR-YOLO surpasses YOLOv5l and YOLOv5x by 9.7% and 5%, respectively. Additionally, CNTR-YOLO exhibits a significant performance advantage of 11.2% and 5.3%, respectively, over YOLOv5l and YOLOv5x in category A16. This gap is the largest among all categories. A comparison of the detection results between CNTR-YOLO and YOLOv5l on the same image is illustrated in Figure 11. CNTR-YOLO correctly identifies all instances, whereas YOLOv5l misidentifies an aircraft of A16 in the bottom right corner as belonging to A18. These two categories are visually similar from the perspective of remote sensing satellites (vertical direction), but CNTR-YOLO with its stronger detail discrimination ability can identify them correctly.

4.4.2. Experimental Results on the DOTA Dataset

Similarly, we show the general comparison results on the DOTA dataset in Table 3 and then show the comparison results on the specific categories in Table 4.
From Table 3, it can be observed that on the DOTA dataset, the proposed algorithm yields superior mAP 0.5 and mAP compared to other algorithms. This indicates the robustness of the proposed algorithm. When compared to YOLOv5l, CNTR-YOLO achieves 2.8% and 2.5% higher mAP 0.5 and mAP, respectively. When compared to YOLOv5x, CNTR-YOLO achieves 1.6% and 1.7% higher mAP 0.5 and mAP, respectively. In comparison to other algorithms, CNTR-YOLO outperforms them to a greater extent. The inference time results are very similar to those shown in Table 1, which is reasonable.
From Table 4, we can find that CNTR-YOLO outperforms other algorithms in most categories, which indicates that the proposed algorithm has a certain universality in the target detection of remote sensing images. Specifically, in the category of Plane, which represents the Aircraft considered in this paper, CNTR-YOLO yields mAP values of 83.2% that are 2.7% and 3.5% higher than those of YOLOv5x and YOLOv5l, respectively. This indicates that the proposed algorithm has superior performance in Aircraft detection compared to other algorithms on the DOTA dataset.

4.5. Ablation Study

The improvements of CNTR-YOLO include the substitution of a C3 with a Dense module, the application of the CBAM attention module, and the introduction of the C3CNTR module. These measures provide different levels of enhancements to YOLOv5l, which we will evaluate in this section. Although adding a small-scale detection head is common in YOLO-related object detection studies, such as TPH-YOLO, the approach is not utilized in this paper. The reason for this omission will be explained below. Furthermore, since C3CNTR is an improvement of C3TR, we will also inspect the enhancement effect of C3TR on the network (at the same position where C3CNTR is implemented). This assessment is essential to differentiate the performance variations between the two. The experimental results are displayed in Table 5, where the “tiny head” represents the small target detection head. It should be noted that to save table space, the suffixes “module” or “attention module” in the nouns of the tables are omitted in this paper.
Table 5 indicates that adding a small-scale target detection head reduces all metrics. This outcome is due to the majority of instances in the MAR20 dataset not being smaller than 32 × 32 pixels. Consequently, this approach is not employed in this paper. After incorporating the Dense module, all the performance metrics improved noticeably, and the mAP rose by 1.2% compared to YOLOv5l. Following the integration of the CBAM attention module, there were slight enhancements in all measures, resulting in a 0.2% increase in the mAP. In addition to these enhancements, the introduction of C3TR and C3CNTR produced different outcomes. While C3TR produced an increase of 1.1% in the mAP, C3CNTR resulted in a 1.9% increase, indicating that C3CNTR outperforms C3TR. Finally, after implementing all the improvements, CNTR-YOLO experiences a 3.3% enhancement in the mAP compared to YOLOv5l.
Regarding the use of attention mechanisms, several alternatives to CBAM were investigated, including Coordinate Attention (CA), Squeeze-and-Excitation Attention (SE), Normalization-based Attention (NAM), and Efficient Channel Attention (ECA); however, none of them achieved the anticipated outcome. After conducting experiments, we present the comparison results of the CBAM attention module and the aforementioned four alternatives on the MAR20 dataset in Table 6. It is worth noting that the experiments were based on the YOLOv5l+Dense module and represented by YOLOv5l*.
Table 6 reveals that CA and SE did not enhance the network’s performance; instead, they caused a decline of 0.3% and 0.2% on mAP, respectively. NAM maintained the same level of performance, while ECA and CBAM elevated mAP by 0.1% and 0.2%, respectively.

5. Conclusions

In this paper, we propose the CNTR-YOLO algorithm for detecting aircraft targets in remote sensing images by improving the existing YOLOv5 algorithm. Our work includes the first attempt to combine a convolutional network and Transformer to design a new module in YOLOv5 as well as validates some improved measures to help YOLOv5 achieve better performance in aircraft detection. Specifically, our proposed C3CNTR module absorbs the local observation capability of ConvNext and the global analysis capability of Transformer, making a greater contribution to improving detection accuracy compared to the C3TR module that uses only Transformer. Next, during the feature extraction stage, the Dense module significantly improves the network’s exploitation of features by utilizing multiple connections between convolutional layers, also avoiding the problem of gradient vanishing. Finally, we integrate the CBAM attention module to reduce interference from background information on the network, allowing the network to focus more on valuable areas and further improve the detection accuracy of the network. The mAP of the proposed CNTR-YOLO is 3.3% higher than YOLOv5l on the MAR20 dataset and exceeds other comparative methods, such as Faster R-CNN and YOLOv4. The results on the DOTA dataset show that the mAP of CNTR-YOLO reached 63.7%, also surpassing other compared methods. Particularly, for the specific category of Plane (which refers to aircraft in this paper), CNTR-YOLO achieved an mAP of 83.2%, which is 3.5% higher than YOLOv5l. This also reflects that our proposed algorithm has a certain robustness.

Author Contributions

Writing—Original draft preparation and software, F.Z. and Q.X.; Conceptualization and methodology, F.Z. and X.L.; Writing—Review and editing, F.Z. and H.D.; Resources, H.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Acknowledgments

We are grateful to the High-Performance Computing Center of Central South University for the assistance with the computations.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Blaschke, T. Object based image analysis for remote sensing. ISPRS J. Photogramm. Remote Sens. 2010, 65, 2–16. [Google Scholar] [CrossRef] [Green Version]
  2. Cheng, G.; Han, J. A Survey on Object Detection in Optical Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2016, 117, 11–28. [Google Scholar] [CrossRef] [Green Version]
  3. Zhang, L.B.; Zhang, Y.Y. Airport Detection and Aircraft Recognition Based on Two-Layer Saliency Model in High Spatial Resolution Remote-Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2017, 10, 1511–1524. [Google Scholar] [CrossRef]
  4. Zuo, J.; Xu, G.; Fu, K.; Sun, X.; Sun, H. Aircraft Type Recognition Based on Segmentation with Deep Convolutional Neural Networks. IEEE Geosci. Remote Sens. Lett. 2018, 15, 282–286. [Google Scholar] [CrossRef]
  5. Zhao, Z.; Zheng, P.; Xu, S.; Wu, X. Object Detection With Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Girshick, R.B.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
  7. Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Berlin, Germany, 11–14 March 2015; pp. 1440–1448. [Google Scholar]
  8. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  9. Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
  10. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
  11. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Pt. I, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  12. Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal Loss for Dense Object Detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
  13. Tian, Z.; Shen, C.; Chen, H.; He, T. Fcos: Fully convolutional one-stage object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9627–9636. [Google Scholar]
  14. Liu, Q.; Xiang, X.; Wang, Y.; Luo, Z.; Fang, F. Aircraft detection in remote sensing image based on corner clustering and deep learning. Eng. Appl. Artif. Intell. 2019, 87, 103333. [Google Scholar] [CrossRef]
  15. Shi, L.; Tang, Z.; Wang, T.; Xu, X.; Liu, J.; Zhang, J. Aircraft detection in remote sensing images based on deconvolution and position attention. Int. J. Remote Sens. 2021, 42, 4241–4260. [Google Scholar] [CrossRef]
  16. Wu, Q.; Feng, D.; Cao, C.; Zeng, X.; Feng, Z.; Wu, J.; Huang, Z. Improved Mask R-CNN for Aircraft Detection in Remote Sensing Images. Sensors 2021, 21, 2618. [Google Scholar] [CrossRef] [PubMed]
  17. Ji, F.; Ming, D.; Zeng, B.; Yu, J.; Qing, Y.; Du, T.; Zhang, X. Aircraft detection in high spatial resolution remote sensing images combining multi-angle features driven and majority voting CNN. Remote Sens. 2021, 13, 2207. [Google Scholar] [CrossRef]
  18. Cao, C.; Wu, J.; Zeng, X.; Feng, Z.; Wang, T.; Yan, X.; Wu, Z.; Wu, Q.; Huang, Z. Research on Airplane and Ship Detection of Aerial Remote Sensing Images Based on Convolutional Neural Network. Sensors 2020, 20, 4696. [Google Scholar] [CrossRef] [PubMed]
  19. Zhou, L.; Yan, H.; Shan, Y.; Zheng, C.; Liu, Y.; Zuo, X.; Qiao, B. Aircraft detection for remote sensing images based on deep convolutional neural networks. J. Electr. Comput. Eng. 2021, 2021, 1–16. [Google Scholar] [CrossRef]
  20. Luo, S.; Yu, J.; Xi, Y.; Liao, X. Aircraft target detection in remote sensing images based on improved YOLOv5. IEEE Access 2022, 10, 5184–5192. [Google Scholar] [CrossRef]
  21. Liu, Z.; Gao, Y.; Du, Q.; Chen, M.; Lv, W. YOLO-Extract: Improved YOLOv5 for Aircraft Object Detection in Remote Sensing Images. IEEE Access 2023, 11, 1742–1751. [Google Scholar] [CrossRef]
  22. Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv 2018, arXiv:1804.02767v1. [Google Scholar]
  23. Bochkovskiy, A.; Wang, C.-Y.; Liao, H.-Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934v1. [Google Scholar]
  24. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path aggregation network for instance segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8759–8768. [Google Scholar]
  25. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
  26. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst 2017, 30, 1–11. [Google Scholar]
  27. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  28. Zhu, X.; Lyu, S.; Wang, X.; Zhao, Q. TPH-YOLOv5: Improved YOLOv5 Based on Transformer Prediction Head for Object Detection on Drone-captured Scenarios. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 2778–2788. [Google Scholar]
  29. Liu, Z.; Mao, H.; Wu, C.Y.; Feichtenhofer, C.; Darrell, T.; Xie, S. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 11976–11986. [Google Scholar]
  30. Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
  31. Huang, G.; Liu, Z.; Van Der Maaten, L.; Weinberger, K.Q. Densely connected convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4700–4708. [Google Scholar]
  32. Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 2818–2826. [Google Scholar]
  33. Woo, S.; Park, J.; Lee, J.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
  34. Yu, W.Q.; Cheng, G.; Wang, M.J.; Yao, Y.Q.; Xie, X.X.; Yao, X.W.; Han, J.W. MAR20: A Benchmark for Military Aircraft Recognition in Remote Sensing Images. Natl. Remote Sens. Bull. 2022, 1–11. [Google Scholar] [CrossRef]
  35. Xia, G.S.; Bai, X.; Ding, J.; Zhu, Z.; Belongie, S.; Luo, J.; Datcu, M.; Pelillo, M.; Zhang, L. Dota: A large-scale dataset for object detection in aerial images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3974–3983. [Google Scholar]
Figure 1. The architecture of YOLOv5.
Figure 1. The architecture of YOLOv5.
Electronics 12 02671 g001
Figure 2. The structure of C3 module.
Figure 2. The structure of C3 module.
Electronics 12 02671 g002
Figure 3. The structure of the C3TR module; Tr Block stands for Transformer Block.
Figure 3. The structure of the C3TR module; Tr Block stands for Transformer Block.
Electronics 12 02671 g003
Figure 5. The structure of C3CNTR; CN Block stands for ConvNext Block.
Figure 5. The structure of C3CNTR; CN Block stands for ConvNext Block.
Electronics 12 02671 g005
Figure 6. The structure of Transformer Block.
Figure 6. The structure of Transformer Block.
Electronics 12 02671 g006
Figure 7. The structure of ConvNext Block.
Figure 7. The structure of ConvNext Block.
Electronics 12 02671 g007
Figure 8. The structure of the Dense module.
Figure 8. The structure of the Dense module.
Electronics 12 02671 g008
Figure 9. The structure of DenseNet.
Figure 9. The structure of DenseNet.
Electronics 12 02671 g009
Figure 10. The structure of CBAM attention module.
Figure 10. The structure of CBAM attention module.
Electronics 12 02671 g010
Figure 11. The detection results on one image of the test set of the MAR20 dataset: (a) detection result of YOLOv5 (b) detection result of CNTR-YOLO.
Figure 11. The detection results on one image of the test set of the MAR20 dataset: (a) detection result of YOLOv5 (b) detection result of CNTR-YOLO.
Electronics 12 02671 g011
Table 1. Comparison results of CNTR-YOLO and other algorithms on the MAR20 dataset.
Table 1. Comparison results of CNTR-YOLO and other algorithms on the MAR20 dataset.
MethodP (%)R (%)mAP 0.5 (%)mAP (%)Latency (ms)
Faster R-CNN77.373.682.757.183.6
YOLOv483.379.586.664.312.8
YOLOv5m85.780.387.665.711.0
YOLOv5l85.283.488.566.819.3
YOLOv5x86.685.989.768.037.5
CNTR-YOLO88.987.591.170.133.5
Table 2. Comparison results of CNTR-YOLO and other algorithms on various categories of MAR20 dataset (mAP).
Table 2. Comparison results of CNTR-YOLO and other algorithms on various categories of MAR20 dataset (mAP).
ClassFaster R-CNNYOLOv4YOLOv5mYOLOv5lYOLOv5xCNTR-YOLO
A163.767.570.973.172.874.0
A267.575.375.777.977.680.9
A370.976.077.378.678.281.9
A466.871.973.476.375.275.3
A565.568.772.072.173.173.8
A655.868.570.871.669.474.7
A745.655.058.761.362.868.8
A859.767.067.870.570.472.2
A962.868.369.469.670.571.8
A1050.159.359.262.564.666.5
A1157.466.166.168.471.072.0
A1240.143.245.248.048.746.7
A1350.557.659.159.261.061.2
A1426.632.031.633.037.442.7
A1551.861.466.565.065.368.2
A1658.964.167.263.369.274.5
A1769.567.468.170.172.571.1
A1865.971.571.473.072.774.0
A1963.472.971.275.275.077.0
A2063.571.672.573.173.774.8
Table 3. Comparison results of CNTR-YOLO and other algorithms on the DOTA dataset.
Table 3. Comparison results of CNTR-YOLO and other algorithms on the DOTA dataset.
MethodP (%)R (%)mAP 0.5 (%)mAP (%)Latency (ms)
Faster R-CNN75.971.875.853.784.5
YOLOv480.275.780.359.313.3
YOLOv5m82.978.881.860.611.6
YOLOv5l82.880.382.461.220.0
YOLOv5x83.382.183.662.038.2
CNTR-YOLO85.184.385.263.734.1
Table 4. Comparison results of CNTR-YOLO and other algorithms on various categories of DOTA dataset (mAP).
Table 4. Comparison results of CNTR-YOLO and other algorithms on various categories of DOTA dataset (mAP).
ClassFaster R-CNNYOLOv4YOLOv5mYOLOv5lYOLOv5xCNTR-YOLO
Plane69.175.678.679.780.583.2
Basketball diamond58.264.064.665.267.672.8
Bridge33.637.338.839.139.540.1
Ground track field55.563.764.765.067.166.9
Small vehicle49.454.053.855.656.159.3
Large vehicle66.972.873.874.976.876.5
Ship63.770.169.872.573.074.1
Tennis court84.989.291.290.692.091.9
Basketball court70.575.176.577.078.479.1
Storage tank57.664.466.967.768.170.4
Soccer ball field24.328.228.528.928.330.1
Roundabout46.252.955.855.356.157.5
Harbor60.765.966.567.567.468.7
Swimming pool48.955.356.657.157.358.9
Helicopter16.320.722.322.122.625.5
Table 5. Results achieved by YOLOv5 combining different modules on the MAR20 dataset.
Table 5. Results achieved by YOLOv5 combining different modules on the MAR20 dataset.
MethodP (%)R (%)mAP 0.5 (%)mAP (%)
YOLOv5l85.283.488.566.8
+tiny head84.582.988.166.6
+Dense87.084.889.868.0
+Dense+CBAM87.585.389.968.2
+Dense+CBAM+C3TR88.186.990.669.3
+Dense+CBAM+C3CNTR88.987.591.170.1
Table 6. Results achieved by YOLOv5 combining different modules on the MAR20 dataset; tiny head means small target detection head and Dense means Dense module.
Table 6. Results achieved by YOLOv5 combining different modules on the MAR20 dataset; tiny head means small target detection head and Dense means Dense module.
MethodP (%)R (%)mAP 0.5 (%)mAP (%)
YOLOv5l*87.084.889.868.0
+CA86.384.289.367.7
+SE86.584.489.567.8
+NAM87.084.689.768.0
+ECA87.184.989.868.1
+CBAM87.585.389.968.2
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zhou, F.; Deng, H.; Xu, Q.; Lan, X. CNTR-YOLO: Improved YOLOv5 Based on ConvNext and Transformer for Aircraft Detection in Remote Sensing Images. Electronics 2023, 12, 2671. https://doi.org/10.3390/electronics12122671

AMA Style

Zhou F, Deng H, Xu Q, Lan X. CNTR-YOLO: Improved YOLOv5 Based on ConvNext and Transformer for Aircraft Detection in Remote Sensing Images. Electronics. 2023; 12(12):2671. https://doi.org/10.3390/electronics12122671

Chicago/Turabian Style

Zhou, Fengyun, Honggui Deng, Qiguo Xu, and Xin Lan. 2023. "CNTR-YOLO: Improved YOLOv5 Based on ConvNext and Transformer for Aircraft Detection in Remote Sensing Images" Electronics 12, no. 12: 2671. https://doi.org/10.3390/electronics12122671

APA Style

Zhou, F., Deng, H., Xu, Q., & Lan, X. (2023). CNTR-YOLO: Improved YOLOv5 Based on ConvNext and Transformer for Aircraft Detection in Remote Sensing Images. Electronics, 12(12), 2671. https://doi.org/10.3390/electronics12122671

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop