You are currently viewing a new version of our website. To view the old version click .
Remote Sensing
  • Article
  • Open Access

8 July 2023

Few-Shot Object Detection in Remote Sensing Imagery via Fuse Context Dependencies and Global Features

,
,
,
,
and
State Key Laboratory of Information Engineering in Surveying Mapping and Remote Sensing, Wuhan University, Wuhan 430079, China
*
Author to whom correspondence should be addressed.
This article belongs to the Section Remote Sensing Image Processing

Abstract

The rapid development of Earth observation technology has promoted the continuous accumulation of images in the field of remote sensing. However, a large number of remote sensing images still lack manual annotations of objects, which makes the strongly supervised deep learning object detection method not widely used, as it lacks generalization ability for unseen object categories. Considering the above problems, this study proposes a few-shot remote sensing image object detection method that integrates context dependencies and global features. The method can be used to fine-tune the model with a small number of sample annotations based on the model trained in the base class, as a way to enhance the detection capability of new object classes. The method proposed in this study consists of three main modules, namely, the meta-feature extractor (ME), reweighting module (RM), and feature fusion module (FFM). These three modules are respectively used to enhance the context dependencies of the query set features, improve the global features of the support set that contains annotations, and finally fuse the query set features and support set features. The baseline of the meta-feature extractor of the entire framework is based on the optimized YOLOv5 framework. The reweighting module of the support set feature extraction is based on a simple convolutional neural network (CNN) framework, and the foreground feature enhancement of the support sets was made in the preprocessing stage. This study achieved beneficial results in the two benchmark datasets NWPU VHR-10 and DIOR. Compared with the comparison methods, the proposed method achieved the best performance in the object detection of the base class and the novel class.

1. Introduction

Object detection can help further decision-making analysis by obtaining the object location and category information of the region of interest (ROI) in an image [1,2,3,4]. Object detection is one of the important tasks in the understanding of satellite images, and it plays an indispensable role in military decision making, urban management, environmental monitoring, disaster detection, public safety, etc. [5,6,7,8,9]. Moreover, the result obtained by object detection techniques can also be used to track moving objects, which is of great significance for analyzing the state and movement pattern of the objects [10,11,12]. Currently, studies are relatively mature in strong supervised object detection technology using remote sensing images and have achieved a large number of excellent research results. However, for some specific research fields, it is extremely difficult to obtain a large amount of data to build a sample library to acquire excellent results, such as disaster data and military damage data [13,14,15,16]. Therefore, how to use limited remote sensing images to study the object detection problem under a small number of manually annotated samples is a topic of great research significance.
Object detection technology in remote sensing images has been relatively mature after decades of development, from the early methods based on geometric principles, methods based on low-level image feature extraction (color, texture, etc.), and traditional machine learning methods to current deep learning methods [17,18,19]. These gradually got rid of the complicated steps of manually designing features and developed toward a higher level of less artificial assistance and intelligent direction. The current mainstream deep learning methods include (1) two-stage detection models, such as R-CNN [20], Fast R-CNN [21], and faster R-CNN [22], which usually have high detection accuracy but slow speed, and (2) single-stage detection models, such as YOLO series [23] and SSD [24], which are usually faster but the detection accuracy is relatively low.
Additionally, some graph-based methods [25,26] are used to enhance the context information of remote sensing images, thereby enhancing the performance of object detection. Examples include the research on building an embedded graph attention network using latent spaces and semantic relationships among objects [27], the research on constructing semantic graphs by combining semantic graphs with word embeddings of object category labels on graph nodes [28], and the research on constructing semantic graphs by combined with the graph aggregation network to update the weight of the object node to achieve enhanced attention to the rotation angle and scale of the object [29].
These relatively mature models are deployed in application projects to solve some practical problems and improve production efficiency. However, remote sensing images are different from general natural images in that they have more complex spectral features, scene features, multiscale features, etc. Therefore, the application of these strongly supervised network models in object detection tasks in remote sensing images still faces many challenges [30,31]. The major challenges currently facing remote sensing object detection tasks are how to detect when there are very few samples, and how to use the trained model to adapt to infer new categories of objects.
To solve these practical problems, a few-shot remote sensing image object detection method was proposed. Few-shot object detection aims to expand the task of new category recognition and detection by training known category objects. It learns from the perspective of the model and learns the characteristics of the annotated base class. Its essence is to realize the transferability of knowledge through fine-tuning, and finally realize the detection of new categories. At present, few-shot object detection methods mainly include methods based on metric learning, methods based on meta-learning, methods based on data enhancement, and methods based on multimodal data [32,33,34,35]. However, the methods based on data enhancement and multimodal data are directly driven by data, but the generalization performance of these models is weak; thus, it is difficult to detect objects in other scenarios. The methods based on metric learning and meta-learning start from the perspective of model learning, learn and absorb the characteristics of data to form a model, and then identify the objects. This kind of method based on the model often faces problems such as incomplete features, as well as insufficient feature fusion ability and model learning ability, making the detection and recognition of the base class and new class inaccurate.
Addressing the above problems, this study proposes a few-shot remote sensing image object detection approach that integrates the global features of the support set (S) and the enhanced context dependency of the query set (Q). This method includes two parts: the base class training stage and the new class fine-tuning training stage. Moreover, our proposed network contains two important feature extraction modules and one feature fusion module. They are the meta-feature extractor (ME) for query set feature extraction, the reweighting module (RM) for support set feature extraction, and the feature fusion module (FFM) for fusing the two features. Among them, ME uses the optimized YOLOv5 [36,37] framework as the backbone, and focuses on the contextual dependencies of the query set objects through graph convolution. The RM is applied to obtain the global features of the support set, and then use the FFM to fuse it and the query set enhancement features obtained from ME; the output results are sent to the prediction layer of the model for prediction.
In summary, the main contributions of this study are as follows:
  • This study innovatively proposes a few-shot remote sensing image object detection method based on meta-learning, which integrates the global features of the S and the enhanced contextual dependencies of the Q, improving the final feature representation ability and object detection performance.
  • In the ME structure, a feature representation structure that takes into account the contextual long-distance dependencies of the Q was constructed, focusing on the regional similarity of the query features, and optimizing the encoding performance of the query features.
  • In the RM, the global feature pyramid extraction (GFPE) module is constructed to enhance the global feature representation of the S. Simultaneously, a new fusion module of query meta-features and support features is designed, which enhances the salient representation ability after feature fusion.

3. Methods

3.1. Overall Framework

Inspired by [70], a similar structure of the few-shot remote sensing image object detection framework based on meta-learning is proposed in this study. It proposes a few-shot remote sensing image object detection method that fuses the global features of S and the enhanced context dependencies of Q. The whole framework includes three important modules (ME, RM, and FFM) and the final bounding box prediction layer, as shown in Figure 1. The proposed framework consists of two stages: base training and fine-tuning training. The base training is used to train the dataset of visible categories and learn meta-knowledge from visible categories; the fine-tuning training stage transfers the meta-knowledge learned in the base training stage to train unseen category datasets.
Figure 1. The framework of the proposed network. This network contains three important modules: meta-feature extractor, reweighting module, and feature fusion module. The meta-feature extractor uses optimized YOLOv5 as the backbone to extract the meta-features of the query set. The reweighting module uses a global feature pyramid extractor to extract multiscale global features of the support set, and the fusion features through the feature fusion module operation are used as the input (F0, F1, and F2) of the prediction layer, which is eventually used to detect the object.
The input of the proposed framework includes Q and S, where the Q can be regarded as test samples for evaluating the performance of the task, which does not have annotations; the S is used as training samples, including the images and annotations for individual instances. The few-shot object detection is also called the N-ways K-shot detection task, where N and K represent N categories in the support set S, and each category has K instance annotations. In addition, the Q generally contains Nq images, and these images come from the same set of the same category C in the S. In this study, we use the annotations in the support set S as an auxiliary for frequency-sensitive foreground enhancement (FE) processing to incorporate frequency knowledge into the feature learning of the support set S.
Specifically, as shown in Figure 2, we first use the annotation information of the support set S to generate the mask (Figure 2b), and then use the mask to intercept the annotation part of the supporting image (Figure 2a), forming a mask with only the foreground object category; secondly, this study constructs a frequency extractor as shown in Figure 2 to reduce the noise of the input image. In this study, the frequency extractor uses discrete Fourier transform [78] to realize the frequency representation of the two-dimensional (2D) image data f(w, h) of size W × H, and the formula is shown below.
F k , l = w = 0 W 1   h = 0 H 1   f w , h e 2 π i k w W e 2 π i l h H ,
where k ∈ [0, W − 1] and l ∈ [0, H − 1] represent the coordinates of the two-dimensional image, respectively. Meanwhile, i in the 2D Fourier transform is the imaginary unit.
Figure 2. Schematic diagram of foreground enhancement (FE) processing. (a) represents the original image from the support set S, (b) is the generated mask, (c) is the image obtained by frequency extractor, and (d) is the final output result obtained by fuse the (b,c). Where, the mask of the support set S was used to crop the foreground (i.e., the ground track field), and the frequency extractor was applied to perform frequency transformation to remove some noise, enhancing the object feature expression.
To better represent the frequencies of different bands, the representation of Cartesian coordinates is first converted to polar coordinates (r and θ), as shown in Equation (2), where AI(r) represents the average intensity of the signal of the two-dimensional image in the radial distance. Meanwhile, to enhance the difference between the high-frequency bands of the image, we take the average intensity AI(r) of the signal as input, and this vector is input into a high-pass filter (Fhp), finally obtaining the one-dimensional spectral vector VI of input image I. The formula is shown in Equation (3).
r , θ = F k , l : r = k 2 + l 2 , θ = arctan l k , A I r = 1 2 π 0 2 π   F r , θ d θ .
v I = F h p A I r , F h p x = x ,         r > r τ 0 ,         o t h e r w i s e ,
where  r τ  is the threshold radius of high-pass filtering. VI’s dimension representation is  R C l a s s × 1 , which is reshaped as  R C l a s s × 1 × 1 × 1 , followed by matrix multiplication with the original support set feature Si  R C l a s s × 3 × W × H . Its output feature map is shown in Figure 2c. Here, Class represents the number of visible categories contained in the support set S. VI can easily distinguish whether there is noise on the image in the high-frequency band; thus, it can be used for noise reduction. Lastly, the object-only image obtained from the first step in Figure 2 is merged with the image shown in Figure 2c to obtain the final output image in Figure 2d. Figure 2d reduces the noise of the background except for the object part, which is beneficial to the enhancement of object features. Figure 2d is the final input of the RM.

3.2. Meta-Feature Extractor

ME is used for the multiscale representation of the features of Q. Usually, the same object in a remote sensing image has feature representations of different scales. Therefore, adding a multiscale feature representation structure is beneficial to the detection of objects of different sizes. The ME module designed in this study is optimized with YOLOv5 as the backbone. The overall structure of the module is shown in Figure 3, including Backbone, PANet, and Output. Among them, the original backbone, CSPDarkNet, of the YOLOv5 model was replaced with a more lightweight G-GhostNet [79] module. This module can minimize network parameters and improve detection accuracy. Simultaneously, G-GhostNet can achieve a balance between accuracy and GPU delay, by deleting some activations layers based on C-GhostNet [80]. In addition, to obtain the context dependence of each object scale feature, inspired by [81], a graph convolutional unit (GCU) is introduced. The design of this unit is to enhance the close connection between similar pixels of each scale feature in the Output layer and deepen the difference between dissimilar pixels.
Figure 3. Meta-feature extractor module. The overall structure is an optimization of YOLOv5. G-GhostNet is used as the backbone structure, and PANet is used as the neck layer, with the multiscale GCUs structure as the output layer.
In this study, the specific settings of G-GhostNet are as shown in Table 1. The G-Ghost Bottleneck structure in the Backbone layer in Figure 3 corresponds to the specific operations shown in Stage L1, Stage L2, and Stage L3 in Table 1, respectively. The output feature Yn is the fusion of each stage with the complex feature  Y n c R ( 1 μ ) c × W × H  obtained by the Block operation and the cheap feature  Y n g R μ c × W × H  obtained by the Cheap operation. The value range of μ is 0 ≤ h ≤ 1; here, the value of μ is taken as 0.5. The specific expression is shown in Equation (4), where  Y n c  is generally the feature obtained after the residual operation, and  Y n g  is generally the feature obtained after 1 × 1 or 3 × 3 convolution.
Y n = Y n c , Y n g .
Table 1. The overall framework of G-GhostNet set in this study. Block represents a residual bottleneck, the Cheap operation is generally a 1 × 1 or 3 × 3 convolution to obtain cheap features, and Output represents the size of the output feature map and the number of channels at this stage.
In addition, in Figure 3, the multiscale features obtained by the 1 × 1 convolution of the output layer lack contextual long-range dependence. Therefore, the GCU is established by using the graph structure to strengthen the dependency between adjacent pixels. The structure of GCU is shown in Figure 4, and the specific calculation method is shown in Algorithm 1.
Figure 4. Schematic diagram of graph convolutional unit.
First, the feature X of the output layer after 1 × 1 convolution is reprojected spatially, and the K-means [82] algorithm is used to initialize the W  R c 1 × V  and Variance R c 1 × V . Here, C1 represents the channel size of the input feature X, and V represents the number of regions where the feature map X is divided, which is also the number of nodes in the graph structure. In Step 1 of Algorithm 1, the calculation formulas of any element of the probability matrix Q  R H × W × c 1  and the coding matrix Z  R c 1 × V  are shown in Equations (5) and (6). In order to achieve multiscale context long-range dependence, this study sets the value of V to 4 and 8, respectively. Therefore, the output features of GCUi (i = 1, 2, 3) in Figure 3 are the feature fusion of the three features, which includes its own feature X, the feature  X ~ V4 when V = 4, and the feature  X ~ V8 when V = 8.
q i j k = exp x i j w k / σ k 2 2 / 2 k   exp x i j w k / σ k 2 2 / 2 ,
z k = z k z k 2 , z k = 1 i j   q i j k i j   q i j k x i j w k σ k ,
where  q i j k  is the probability of each pixel belonging to the region (node) k, k ∈ (1,V),  x i j  represents the feature of the pixel in the i-th row and j-th column of the two-dimensional feature map, and wk represents the feature vector of the k-th node. In addition,  σ k  is the variance of all dimensions of node k, and it is normalized to (0,1) by the Sigmoid function.
Algorithm 1 Graph convolutional unit (GCU) algorithm
1: Input: A feature map X
2: Output: The enhanced context-dependent features representation  X ~
3: while in training(test) stage: do
4: // Step 1: The feature map X is projected into the graph structure to obtain the probability matrix Q, encoding feature Z and adjacency matrix A
5:   Init W, Variance ← KMeans(n_clusters = V) (X)
6:   probability matrix Qf (xij ω k )
encoding feature Zf ( q i j k Q ,   ω k W , σ k V a r i a n c e )
adjacency matrix A ←  Z T Z
7: // Step 2: The graph convolution operation,  W g R c 1 × c 2  is a random weight matrix
8:  Z ~  ← f (A Z T W g ),  W g R c 1 × c 2
9: // Step 3: Reverse reprojection
10:           X ~ = Q Z ~ T
11: end while

3.3. Reweighting Module

The RM is used to extract meta-feature knowledge from the S, as well as to guide the positioning and object recognition of each image in the Q. The input of the RM is the changed form of VI mentioned in Equation (3) (Section 3.1). The specific structure of the RM is shown in Figure 5. Figure 5a shows the overall structure of the RM. After multiple convolution and pooling operations, it is input into the GFPE shown in Figure 1, inspired by [83], and finally three multiscale feature weights (R0 R W / 16 × H / 16 × 512 , R1 R W / 32 × H / 32 × 1024 , and R2 R W / 8 × H / 8 × 256 ) of S are obtained. Figure 5b shows the global feature extraction (GFE) block in GFPE. The module obtains the attention weight of each feature map (such as F R W 1 × H 1 × C 1  in Figure 5b), and then passes weighted averaging combines all features to generate global contextual features. Its calculation formula is shown below.
G F c t x = i = 1 N p e W g F i / m = 1 N p e W g F m F i ,
where F R W 1 × H 1 × C 1  =  { F i } i = 1 N = 3  is the input feature map of the GFE block, Fi is any feature in F, Np = W1 × H1 is the feature dimension in the feature map, and Z is the output feature map of the GFE block. For global feature extraction, Wg (1 × 1 convolution kernel) is first used to compress the feature map channel to 1, and then the Softmax function is used to activate WgFi. A 1 × 1 × H1W1 convolution is used to reshape, and then the attention weights are obtained. Finally, the weight  e W g F i / m = 1 N p e W g F m  is applied to aggregate any feature Fi to obtain the global context feature  G F c t x .
Figure 5. Reweighting module: (a) the entire module structure of the reweighting module; (b) the GE block in the global feature pyramid extractor (shown by GFPE in Figure 1). In addition, the feature map output of each layer is shown in the feature dimension, e.g., W1 × H1 × C1 denotes a feature map with width W1, height H1, and channel number C1; ⊗ denotes matrix multiplication.
To establish the interdependence between channels, the global context features  G F c t x  are weighted and reassigned. The redistribution calculation formula is as follows:
Z = W t 2 ReLU L N W t 1 G F c t x ,
where  W t 1  (1 × 1 convolution kernel) is used for channel dimensionality reduction of the feature map, followed by LayerNorm for feature normalization and ReLU for nonlinear activation; finally,  W t 2  (1 × 1 convolution kernel) is used to increase the dimension of the feature map channel, restoring the size of the feature channel C1, and the feature map Z is output.

3.4. Feature Fusion Module

To transfer the meta-knowledge of the RM to the ME, inspired by the omni-dimensional (OD) convolution [84], we propose a dynamic FFM, as shown in Figure 6, which implements the fusion of the output feature maps of the two-part structure. FFM introduces a multidimensional attention mechanism through a parallel strategy to learn more flexible attention to the four dimensions of the convolution kernel space. FFM can be expressed by Equations (9) and (10).
y = ( α w 1 α f 1 α c 1 α s 1 W 1 + + α w n α f n α c n α s n W n ) × F r ,
F f = F . c o n v 2 d F m , y ,
where Fr represents the feature map output (R0, R1, or R2) by the RM, αwi represents the attention scalar of the convolution kernel Wi, and αsi  R k × k   , αci  R c i n   , and αfi  R c o u t  represent the three newly introduced attentions, which work along the spatial dimension, input channel dimension, and output channel dimension, respectively. These four attentions are calculated using the multi-head attention module πi(x) [85,86]. By gradually multiplying the convolution Wi along the position, channel, filter, and kernel dimensions with different attention, the convolution operation can strengthen the difference of the input feature map in each dimension, as well as provide better performance to capture rich meta-knowledge contextual information. F.conv2d represents a two-dimensional convolution operation with weights, and Fm represents the feature map generated by the ME.
Figure 6. Schematic diagram of feature fusion module (FFM).

3.5. Loss Function

In order to obtain the excellent performance of few-shot remote sensing image object detection, a good loss function design is essential. In this study, we construct the loss function between the predicted bounding box and the ground-truth bounding box from the localization and classification parts of the object detection, so that the training phase gradually reaches convergence. First, for the localization loss, we adopt the mean square error loss to penalize the deviation between the predicted localization and the true localization. Its calculation formula is shown below.
L l o c = 1 N p o s p o s   c   c o o r d t c c o o r d p c 2 .
Similar to [70], pos means that this study only considers the loss of positive anchor boxes, while ignoring the loss of negative anchor boxes. In practical situations, we choose the localization threshold empirically; for example, if the IOU between the predicted anchor box and the ground-truth anchor box is greater than 0.7, it is considered a positive anchor box; if the IOU between the predicted anchor box and the ground-truth anchor box is less than 0.3, it is considered a negative anchor box. In addition, c represents a coordinate selector to select from four coordinate representations {x, y, w, h} of a specific bounding box, respectively.
Since the current IOU loss does not take into account the mismatched direction between the ground-truth anchor box and the predicted anchor box, which leads to slow convergence and low efficiency, it is proposed to use the SIOU loss function [87] as the constraint between the predicted anchor box and the ground-truth anchor box.
L b o x = 1 I o U + Δ + Ω 2 ,
where  Δ  represents distance cost,  Ω  represents shape cost, and  Δ  is a cost function redefined based on angle cost.
In addition, the confidence loss  L obj    needs to be paid attention to in object detection, and confidence loss focuses on the possibility of an object in a certain area. In this study, the confidence loss is a binary cross-entropy loss, which comprehensively considers objectness loss ( L o ) and non-objectness loss ( L noobj   ). Its calculation formula is shown below.
L obj   = w obj   L o + w noobj   L noobj = w obj   1 N pos   pos     log P o + w noobj   1 N neg   neg     log 1 P o ,
where  P o  indicates represents the possibility of containing the object. To balance  L o  and  L noobj   w obj    and  w noobj    represent the weight, considering that, in actual situations, there are generally more negative boxes than positive boxes.
The cross-entropy loss function is used as the object classification loss, and the calculation method is shown in Equations (14) and (15).
R c = e r p c c = 1 N   e r p c ,
L c l s = 1 N pos   pos     c   y c log R c ,
where  y c  represents the true category annotation of category c, and  R c  represents the classification probability belonging to category c R c  is normalized by the softmax function.  r p c  (c = 1, 2, 3,…, N) represents one of the N predicted anchor boxes.
Ultimately, the loss function constraints in the entire few-shot object detection are shown in Equation (16).
L = L loc   + L obj   + L cls .

4. Experiment and Results

4.1. Datasets and Evaluation Metrics

To verify the effectiveness of the method proposed in this study, this study conducts experimental verification using the two datasets of NWPU VHR-10 [88] and DIOR [2]. The specific dataset introduction is shown in Table 2.
Table 2. Details of the experimental dataset.
NWPU VHR-10: This dataset contains a total of 800 remote sensing images collected from the Google Earth and ISPRS Vaihingen datasets. The NWPU VHR-10 dataset is divided into two categories: one where each image has at least one manually labeled object instance, with a total of 650 images; the other where each image does not contain object instances, with a total of 150 images. The dataset contains 10 types of objects. According to the classification standards of most studies, the base classes include the ship, the storage tank, the basketball court, the ground track field, the harbor, the bridge, and the vehicle. The novel classes include the airplane, the baseball diamond, and the tennis court. Furthermore, in this study, few-shot experiments were used to evaluate the performance of the proposed method in 3/5/10-shot cases, respectively.
DIOR: This dataset contains a total of 23,463 images and is a large-scale benchmark dataset. The source of the dataset is mainly Google Earth, and a total of 20 categories are marked, including 192,472 instances. The size of each image is 800 × 800 pixels, and the image resolution range is 0.5–30 m. According to the general classification standard, the base classes include airport, basketball court, bridge, chimney, dam, expressway service area, expressway toll station, golf course, ground track field, harbor, overpass, ship, stadium, storage tank, and vehicle. There are five types of novel classes, including airplanes, baseball fields, tennis courts, train stations, and windmills. Furthermore, in this study, few-shot experiments were used to evaluate the performance of the proposed method in 3/5/10-shot cases, respectively.
To verify the performance of the proposed method and the advanced comparison methods, we use mean average precision (mAP) as the evaluation index, and its calculation process is shown in Equations (17)–(20).
  P   = T P T P + F P ,
  R   = T P T P + F N ,
A P c = 0 1   P c R d R ,
  m A P k = 1 k c = 1 k   A P c ,
where TP denotes the number of true positives, FP represents the number of false positives, FN is the number of false negatives, and FP denotes the number of false positives. The AP calculation formula of a certain category c is shown in Equation (19). mAPk is the average accuracy of the k-shot model in the case of k-shot.

4.2. Experiment Settings

This study uses an RTX 3090 graphics processing unit (24 G) for the training and prediction of the proposed network model, as well as the training and prediction process of the comparative experiment. In the experiment, due to the nonuniform size of the image, we resized the image; the image was reshaped to 512 × 512 pixels and sent to the network for experimentation. Moreover, we set up a mini-batch with a size of eight. The initial learning rate of network training was set to 1 × 10−4, and the weight decay was 5 × 10−5. For the base training of the NWPU VHR 10 dataset, the proposed model was trained for 12k iterations. In addition, for 3-shot, 5-shot, and 10-shot experiments, the models experienced 2k, 2.5k, and 4k iterations respectively. At the same time, for the base training of the DIOR dataset, the proposed model was trained for 8k iterations. In addition, for the 3-shot, 5-shot, and 10-shot experiments, the model experienced 2k, 3k, and 4k iterations, respectively.

4.3. Results on the NWPU VHR 10 Dataset

Table 3 presents the quantitative evaluation results of the experiments using the NWPU VHR 10 dataset in different state-of-the-art FSOD networks, including YOLOv5 [89], faster R-CNN (ResNet101) [22], FSRW [90], FSODM [70], TFA [91], PAMS-Det [92], G-FSDet [93], CIR-FSD [94], SAGS-TFS [67], TINet [69], and the proposed network.
Table 3. Comparisons among different FSOD networks on the NWPU VHR 10 dataset.
As shown in Table 3, the model proposed in this study achieved excellent mAP scores in the prediction results of both the base class and the novel class. In the base class detection results, the mAP score reached 92%, indicating that the network proposed in this study can achieve satisfactory results under a large number of samples. For the novel class detection, we performed experiments under 3-shot, 5-shot, and 10-shot settings, and the mAP reached 0.56, 0.69, and 0.72, respectively. Compared with several state-of-the-art few-shot object detectors, our model also had advantages. For example, compared with SAGS-TFS, our proposed model achieved a 5% improvement in mAP value in the 3-shot experimental setting and a 3% improvement in the 5-shot setting. As the number of object instance annotations increased, the mAP of the model also tended to be stable. Under the 10-shot setting, our model and the SAGS-TFS model achieved the same results, both 0.72. Meanwhile, compared with TINet, our model achieved the same performance in the 3-shot and 10-shot settings, and had a slight advantage in the 5-shot setting. However, for the prediction results of fully supervised YOLOv5 and faster R-CNN (ResNet101), these models achieved poor results when the training samples were gradually reduced. Simultaneously, when comparing the prediction results of the same type of meta-learning-based models such as FSRW and FSODM, our proposed method also achieved outstanding results. Especially in the experimental results under the few-shot setting, according to the numerical comparison of mAP, the model proposed in this study was 19%, 16%, and 7% better than FSODM. TFA, PAMS-Det, G-FSDet, and CIR-FSD are all two-stage fine-tuning methods. This type of model can generally achieve better results, especially if the prediction results in the base class are better, while the prediction results in the novel class are slightly lower. This is because of the introduction of novel class samples, whereby the model is still not perfect in adjusting the balance of foreground and foreground proportions.
Figure 7 shows the qualitative visualization results of the method proposed in this study for the 10-shot setting of the base class (results shown in light-blue background) and novel classes (results shown in light-yellow background) in the NWPU VHR 10 dataset. Combined with the quantitative analysis of Table 3, our model had excellent performance in the detection results of the base class and achieved a score of mAP above 0.90. Therefore, Figure 7 shows a good visualization effect of the base class. Especially for the ground track field, full-coverage detection was realized, and the mAP reached 100%, which shows that our model had the best detection effect for large-scale objects. Furthermore, for the detection of small and densely distributed objects such as ships and vehicles, the model proposed in this study showed a good detection performance. At the same time, the detection performance of novel classes also achieved the best visualization when compared with some state-of-the-art models. For example, it can be seen from Table 3 that FSODM had a large number of missed objects in novel class detection, while two-stage fine-tuning networks such as CIR-FSD had a higher probability of missed detection in tennis courts. The meta-learning-based model we proposed considers the enhancement of context information and global information, which can strengthen the distinction between the foreground and background of the tennis court, thus improving the accuracy of tennis court detection.
Figure 7. The visualization detection results of the NWPU VHR 10 dataset under the base class setting and the 20-shot setting. Results with light-blue background include the visual displays of the base class, and results with light-yellow background include the visual displays of the novel class.

4.4. Results on the DIOR Dataset

Table 4 shows the quantitative evaluation results of DIOR dataset in different state-of-the-art FSOD networks. Table 4 and Figure 8 jointly give some of the most advanced few-shot object detection models, and the quantitative results and qualitative visualization experimental results of the network proposed in this study in the DIOR dataset are shown.
Table 4. Comparisons among different FSOD networks on the DIOR dataset.
Figure 8. The visualization detection results of the DIOR dataset under the base class setting and the 20-shot setting. Results with light-blue background include the visual displays of the base class, and results with light-yellow background include the visual displays of the novel class.
In Table 4 and Figure 8, compared with the current state-of-the-art models, our proposed model achieved the highest mAP value and the most accurate visual detection results in the base class and the novel class; the highest base class detection mAP value reached 0.74, reaching the same level as the CIR-FSD model in base class detection. In the novel class, we performed experiments under 5-shot, 10-shot, and 20-shot settings, and the mAP values reached 0.36, 0.41, and 0.47 respectively. It can be found that, as the number of annotations increased, the effect improved, which is also in line with the principle of supervised learning. In the similar few-shot learning comparison, we found that the method proposed in this study improved the mAP value by 24% in the base class compared with the results of the FSRW network. Corresponding to the settings under 5-shot, 10-shot, and 20-shot settings, the detection results of the novel class were improved by 14%, 13%, and 13%, respectively, in terms of mAP values. In addition, for the G-FSDet network, since the prediction category of the network in the few-shot experiment differed from the experimental settings in this study and other network models, we took the maximum mAP of the network in different prediction categories as the comparison accuracy. In general, the detection accuracies of G-FSDet in the base class and novel class were better, but there were still objects with missed detection under very few samples.
Figure 8 shows the qualitative visualization results of the proposed model in the base class (results shown in light-blue background) and for the 20-shot setting (results shown in light-yellow background), respectively. According to the visualization results, our model paid more attention to reducing the missed detection rate of each dense object in the few-shot detection of DIOR dataset and had better detection performance than models such as FSODM, TFA, PAMS-Det, G-FSDet, CIR-FSD, SAGS-TFS and TINet. For example, in the dense object detection of airplanes, baseball fields, and tennis courts (Figure 8), there were only a small number of missed objects in the entire prediction set.

5. Discussion

5.1. Ablation Experiment

In order to further discuss the specific role and rationality of the proposed network in few-shot remote sensing image object detection, we discuss and analyze the effectiveness of important modules in the proposed model. Our ablation experiments are based on the NWPU VHR-10 dataset. Our baseline setting is a combined framework of ME based on the YOLOv5 framework and RM based on simple CNN network feature extraction. Then, the FE, GFPE, GCU, and FFM modules were added in turn for ablation experiments. The specific quantitative evaluation analysis is shown in Table 5.
Table 5. Ablation experiment on the NWPU VHR-10 Dataset.
Effectiveness of FE: The proposed FE module aims to enhance the ability to express foreground features in visible category images. After the FE module is operated, its output is used as the input of the RM to guide the object feature enhancement representation. As shown in Figure 9a, we visualized the feature maps of the middle layer before and after adding the FE module under the 10-shot setting (all feature maps in Figure 9 are based on the 10-shot setting). It can be observed from Figure 9a, when the FE module was not added, the shallow shape features of the foreground object and the background were more obvious, but the foreground and background did not have differential saliency expressions. When the FE module was added, the feature expression of the background was weakened, while the feature expression of the foreground was enhanced. Table 5 shows that the mAP values after adding the FE module were increased by 4%, 3%, and 2% respectively compared with the mAP values before adding the 3-shot, 5-shot, and 10-shot settings, which also directly shows that the introduced FE module is effective for few-shot object detection.
Figure 9. Effectiveness analysis of FE/GFPE/GCU/FFM. (ad) represent the middle layer heat maps before and after adding different modules. The red rectangles indicate the salience representation of adding different modules.
Effectiveness of GFPE: The role of GFPE is to extract the multiscale features of the S, and its output feature map is used as the weight feature of the FFM module to fuse with the output feature of the ME. As shown in Figure 9b, after adding the GFPE module, the deep semantic information of the object features was more prominently expressed. As shown in Figure 9b, it can be seen from the details displayed in the red rectangle that the feature map without the GFPE module reflected the phenomenon of multi-object adhesion; for example, the boundaries between the harbors in Figure 9b were not completely distinguished. On the contrary, after adding the GFPE module, the harbor feature was more specific, and the difference between the harbor and the background was also more prominent. At the same time, it can also be seen from Table 5 that, after adding the GFPE module, the mAP under each experimental setting increased by 12%, 6%, and 4%, which reflects the importance of the GFPE module.
Effectiveness of GCU: As an important structure in the ME, the GCU module is used to guide the enhancement of the context-dependent information of the meta-features of the Q. By setting the multiple graph convolution nodes such as V = 4 and V = 8, and fusing the original features, the meta-features fusion of multiscale context information of the Q is realized. It can be observed from Figure 9c that the meta-features guided by the GCU module were more focused on the expression of object semantic information. The first line shows the expression of ship features before and after the introduction of the GCU module. It can be found that, after the introduction of GCU, the semantic information of the ship was more obvious, as shown in the details of the red rectangle. In addition, regarding the display of the second row of airplane features, before the introduction of the GCU module, the features of the airplane and the background were difficult to distinguish. As shown in the details of the red rectangle, after the GCU module was introduced, the semantic information of the aircraft was more obvious, and objects could be well distinguished from other objects and the background. At the same time, combined with the specific quantitative indicators in Table 5, the accuracy was 12%, 6%, and 4% higher than the 3-shot, 5-shot, and 10-shot settings without adding the GCU module. This also fully demonstrates the role played by the GCU module, which is consistent with the information reflected in the heat map after adding the GCU module.
Effectiveness of FFM: The function of FFM is to aggregate the multiscale features generated by ME and the multiscale weight features generated by RM to generate fusion features for final object detection. Through the middle layer representation of the Q features in Figure 9d, it can be found that the most important difference before and after adding the FFM module was that FFM could focus more on characterizing the object features, strengthening the difference with the background information, and enhancing the feature representation of the object. As shown in the red rectangle details in Figure 9d, before adding the FFM module, the object feature representation was weaker. Combined with the quantitative representation results of Table 5, FFM played a role in enhancing the characteristics of the Q. With small sample annotations, the accuracy increased by 4%, 5%, and 3% under 3-shot, 5-shot, and 10-shot settings, respectively.

5.2. Comprehensive Analysis

The detection ability of the base class and novel class in special cases: In order to show that the network proposed in this study has the capability of comprehensive detection of the base class and novel class, we selected some samples with multiple categories in the NWPU VHR-10 dataset for display. Figure 10 shows the combinations of ground track field + basketball court + baseball diamond, bridge + baseball diamond, basketball court + tennis court, and harbor + tennis court, which are all base classes + novel classes. The realization of these combinations is extremely challenging; the first three images show successful cases, while the fourth is a typical failure case. The failure of the fourth image was mainly due to the missed detection and false detection of the harbor class. The main reasons for imperfect detection of the fourth image were as follows: (1) most of the training sets of the harbor class were full of ships, while the test set had harbors that were not full of ships or had no ships; (2) the scale gap between harbor classes was large, resulting in a particularly large-scale harbor being detected as multiple harbors.
Figure 10. Comprehensive visual display of base + novel classes object detection results. The ground track field + basketball court + baseball diamond, bridge + baseball diamond, basketball court + tennis court, and harbor + tennis court combinations are shown respectively.
In addition, there were some cases of detection failure in the NWPU VHR-10 dataset, as shown in Figure 11: (1) failures caused by inter-class similarity, e.g., the nonobvious characteristic differences between a few vehicles and airplanes, and between airplanes and ships; (2) failures caused by objects being too densely distributed, e.g., an increase in the probability of repeated detection or missed detection due to the close-to-close distribution of various courts; (3) a small number of missed detections caused by noise, e.g., vehicle missed detection caused by tree occlusion or shadow. Overall, our proposed method has certain scene adaptability and generalization.
Figure 11. Failed detection cases are marked with red ovals. The cases correspond to failed detection of interclass similarity, false detection due to dense distribution, and missed detection caused by tree occlusion, respectively.
Model complexity and inference time: The complexity and inference time of the model are important factors for evaluating performance. Therefore, this subsection makes a statistical comparison of the complexity and inference time of the model proposed in this study and other representative state-of-the-art few-shot object detection models, as shown in Table 6. Table 6 shows that the complexity of the FSODM network was the highest, and the number of parameters reached 81.25 M, followed by the model proposed in this study, which reached 65.01 M. Simultaneously, compared with these superior models, we calculated the computational complexity of the model in the test phase with a batch size of 4 under the same standard as other papers, and the computational complexity of our model was at a medium level, reaching 197.89 GFLOPs.
Table 6. Model complexity comparison and inference time between our network and other FSOD approaches.
Although the model parameters proposed in this study were relatively high, the inference time per image was the lowest at 0.11 s. This shows that the proposed network could achieve optimal performance benefits at the cost of increasing a small number of parameters. In addition, TFA, G-FSDet, CIR-FSD, SAGS-TFS are two-stage fine-tuning approaches; Although the other models had slightly fewer parameters compared to our proposed model in this study, the single image inference time was higher for them. This can be attributed to the fact that our proposed model has less architectural complexity during the inference stage.In particular, G-FSDet model had a model complexity of 60.19 M after the reparameterization technique, and a complexity of 74.08 M before the reparameterization technique. In summary, the proposed model achieved a superior few-shot object detection performance.

6. Conclusions

This study proposed a few-shot remote sensing image object detection method that fuses the context dependencies of the query set and the global features of the support set. It mainly includes three important structures: ME, RM, and FFM. Firstly, the RM passes the multiscale support set global features as meta-knowledge to the query set; secondly, the optimized YOLOv5 model is used as the baseline of the ME, combined with the multiscale GCUs module to enhance context dependence relationship, and the FFM structure is combined with the meta-knowledge of the RM to make the query set features more prominently expressed and achieve the optimal detection performance. Experiments on NWPU VHR-10 and DIOR datasets showed that the proposed model outperformed some current advanced FSODM detection models and had good detection generalization performance for different remote sensing images.
However, combined with the failure cases in the comparative experiments and comprehensive experiments in Section 5, the proposed model still has room for improvement. For example, we will pay more attention to the detection of objects with large-scale differences and focus on the detection of small objects with stronger density. These problems will be important topics for our future work.

Author Contributions

Conceptualization, H.S. and B.W.; methodology, B.W. and G.M.; writing—original draft preparation, B.W., Y.Z. (Yuan Zhou), and H.Z.; writing—review and editing, B.W., Y.Z. (Yongxian Zhang), and G.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Guangxi Science and Technology Major Project, grant number AA22068072.

Data Availability Statement

The experiments were evaluated on publicly open data sets. The access manner of the datasets can be determined from the corresponding published papers.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Tiwari, A.K.; Mishra, N.; Sharma, S. Analysis and Survey on Object Detection and Identification Techniques of Satellite Images. In Proceedings of the India International Science Festival, Delhi, India, 4–8 December 2015. [Google Scholar]
  2. Li, K.; Wan, G.; Cheng, G.; Meng, L.; Han, J. Object Detection in Optical Remote Sensing Images: A Survey and a New Benchmark. ISPRS J. Photogramm. Remote Sens. 2020, 159, 296–307. [Google Scholar] [CrossRef]
  3. Wu, X.; Sahoo, D.; Hoi, S.C. Recent Advances in Deep Learning for Object Detection. Neurocomputing 2020, 396, 39–64. [Google Scholar] [CrossRef]
  4. Bhil, K.; Shindihatti, R.; Mirza, S.; Latkar, S.; Ingle, Y.S.; Shaikh, N.F.; Prabu, I.; Pardeshi, S.N. Recent Progress in Object Detection in Satellite Imagery: A Review. In Sustainable Advanced Computing: Select Proceedings of ICSAC 2021; Springer: Singapore, 2022; pp. 209–218. [Google Scholar]
  5. Pi, Y.; Nath, N.D.; Behzadan, A.H. Detection and Semantic Segmentation of Disaster Damage in UAV Footage. J. Comput. Civ. Eng. 2021, 35, 04020063. [Google Scholar] [CrossRef]
  6. Ciaramella, A.; Perrotta, F.; Pappone, G.; Aucelli, P.; Peluso, F.; Mattei, G. Environment Object Detection for Marine ARGO Drone by Deep Learning. In Proceedings of the Pattern Recognition, ICPR International Workshops and Challenges, Virtual Event, 10–15 January 2021; Springer: Berlin/Heidelberg, Germany, 2021; pp. 121–129. [Google Scholar]
  7. Liu, H.; Yu, Y.; Liu, S.; Wang, W. A Military Object Detection Model of UAV Reconnaissance Image and Feature Visualization. Appl. Sci. 2022, 12, 12236. [Google Scholar] [CrossRef]
  8. Haris, M.; Hou, J.; Wang, X. Lane Lines Detection under Complex Environment by Fusion of Detection and Prediction Models. Transp. Res. Rec. 2022, 2676, 342–359. [Google Scholar] [CrossRef]
  9. Thayalan, S.; Muthukumarasamy, S. Multifocus Object Detector for Vehicle Tracking in Smart Cities Using Spatiotemporal Attention Map. J. Appl. Remote Sens. 2023, 17, 016504. [Google Scholar] [CrossRef]
  10. Zhang, Z.; Wang, C.; Song, J.; Xu, Y. Object Tracking Based on Satellite Videos: A Literature Review. Remote Sens. 2022, 14, 3674. [Google Scholar] [CrossRef]
  11. Liu, Y.; Liao, Y.; Lin, C.; Jia, Y.; Li, Z.; Yang, X. Object Tracking in Satellite Videos Based on Correlation Filter with Multi-Feature Fusion and Motion Trajectory Compensation. Remote Sens. 2022, 14, 777. [Google Scholar] [CrossRef]
  12. He, Q.; Sun, X.; Yan, Z.; Li, B.; Fu, K. Multi-Object Tracking in Satellite Videos with Graph-Based Multitask Modeling. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–13. [Google Scholar] [CrossRef]
  13. Deng, Z.; Sun, H.; Zhou, S.; Zhao, J.; Lei, L.; Zou, H. Multi-Scale Object Detection in Remote Sensing Imagery with Convolutional Neural Networks. ISPRS J. Photogramm. Remote Sens. 2018, 145, 3–22. [Google Scholar] [CrossRef]
  14. Fu, K.; Chang, Z.; Zhang, Y.; Xu, G.; Zhang, K.; Sun, X. Rotation-Aware and Multi-Scale Convolutional Neural Network for Object Detection in Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2020, 161, 294–308. [Google Scholar] [CrossRef]
  15. Huang, Z.; Li, W.; Xia, X.-G.; Wu, X.; Cai, Z.; Tao, R. A Novel Nonlocal-Aware Pyramid and Multiscale Multitask Refinement Detector for Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–20. [Google Scholar] [CrossRef]
  16. Shivappriya, S.N.; Priyadarsini, M.J.P.; Stateczny, A.; Puttamadappa, C.; Parameshachari, B.D. Cascade Object Detection and Remote Sensing Object Detection Method Based on Trainable Activation Function. Remote Sens. 2021, 13, 200. [Google Scholar] [CrossRef]
  17. Inglada, J. Automatic Recognition of Man-Made Objects in High Resolution Optical Remote Sensing Images by SVM Classification of Geometric Image Features. ISPRS J. Photogramm. Remote Sens. 2007, 62, 236–248. [Google Scholar] [CrossRef]
  18. Lei, Z.; Fang, T.; Huo, H.; Li, D. Rotation-Invariant Object Detection of Remotely Sensed Images Based on Texton Forest and Hough Voting. IEEE Trans. Geosci. Remote Sens. 2011, 50, 1206–1217. [Google Scholar] [CrossRef]
  19. Zhang, J.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Du, Q. SuperYOLO: Super Resolution Assisted Object Detection in Multimodal Remote Sensing Imagery. IEEE Trans. Geosci. Remote Sens. 2023, 61, 1–15. [Google Scholar] [CrossRef]
  20. Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 24–27 June 2014; pp. 580–587. [Google Scholar]
  21. Girshick, R. Fast R-Cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
  22. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-Cnn: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems 28 (NIPS 2015); Curran Associates, Inc.: Dutchess County, NY, USA, 2015. [Google Scholar]
  23. Jiang, P.; Ergu, D.; Liu, F.; Cai, Y.; Ma, B. A Review of Yolo Algorithm Developments. Procedia Comput. Sci. 2022, 199, 1066–1073. [Google Scholar] [CrossRef]
  24. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single Shot Multibox Detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
  25. Tian, S.; Kang, L.; Xing, X.; Li, Z.; Zhao, L.; Fan, C.; Zhang, Y. Siamese Graph Embedding Network for Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2020, 18, 602–606. [Google Scholar] [CrossRef]
  26. Li, Z.; Liu, Y.; Liu, J.; Yuan, Y.; Raza, A.; Huo, H.; Fang, T. Object Relationship Graph Reasoning for Object Detection of Remote Sensing Images. In Proceedings of the 2021 6th International Conference on Image, Vision and Computing (ICIVC), Qingdao, China, 23–25 July 2021; pp. 43–48. [Google Scholar]
  27. Tian, S.; Kang, L.; Xing, X.; Tian, J.; Fan, C.; Zhang, Y. A Relation-Augmented Embedded Graph Attention Network for Remote Sensing Object Detection. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
  28. Tian, S.; Cao, L.; Kang, L.; Xing, X.; Tian, J.; Du, K.; Sun, K.; Fan, C.; Fu, Y.; Zhang, Y. A Novel Hybrid Attention-Driven Multistream Hierarchical Graph Embedding Network for Remote Sensing Object Detection. Remote Sens. 2022, 14, 4951. [Google Scholar] [CrossRef]
  29. Zhu, Z.; Sun, X.; Diao, W.; Chen, K.; Xu, G.; Fu, K. Invariant Structure Representation for Remote Sensing Object Detection Based on Graph Modeling. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  30. Guo, W.; Yang, W.; Zhang, H.; Hua, G. Geospatial Object Detection in High Resolution Satellite Images Based on Multi-Scale Convolutional Neural Network. Remote Sens. 2018, 10, 131. [Google Scholar] [CrossRef]
  31. Goyal, V.; Singh, R.; Dhawley, M.; Kumar, A.; Sharma, S. Aerial Object Detection Using Deep Learning: A Review. In Computational Intelligence: Select Proceedings of InCITe 2022; Springer Nature: Singapore, 2023; Volume 968, pp. 81–92. [Google Scholar]
  32. Karlinsky, L.; Shtok, J.; Harary, S.; Schwartz, E.; Aides, A.; Feris, R.; Giryes, R.; Bronstein, A.M. Repmet: Representative-Based Metric Learning for Classification and Few-Shot Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June2019; pp. 5197–5206. [Google Scholar]
  33. Li, X.; Sun, Z.; Xue, J.-H.; Ma, Z. A Concise Review of Recent Few-Shot Meta-Learning Methods. Neurocomputing 2021, 456, 463–468. [Google Scholar] [CrossRef]
  34. Wu, A.; Zhao, S.; Deng, C.; Liu, W. Generalized and Discriminative Few-Shot Object Detection via SVD-Dictionary Enhancement. In Advances in Neural Information Processing Systems 34 (NeurIPS 2021); Curran Associates, Inc.: Dutchess County, NY, USA, 2021; pp. 6353–6364. [Google Scholar]
  35. Han, G.; Ma, J.; Huang, S.; Chen, L.; Chellappa, R.; Chang, S.-F. Multimodal Few-Shot Object Detection with Meta-Learning Based Cross-Modal Prompting. arXiv 2022, arXiv:2204.07841. [Google Scholar]
  36. Hendrawan, A.; Gernowo, R.; Nurhayati, O.D.; Warsito, B.; Wibowo, A. Improvement Object Detection Algorithm Based on YoloV5 with BottleneckCSP. In Proceedings of the 2022 IEEE International Conference on Communication, Networks and Satellite (COMNETSAT), Solo, Indonesia, 3–5 November 2022; pp. 79–83. [Google Scholar]
  37. Liu, Z.; Gao, Y.; Wang, L.; Du, Q. Aircraft Target Detection in Satellite Remote Sensing Images Based on Improved YOLOv. In Proceedings of the 2022 International Conference on Cyber-Physical Social Intelligence (ICCSI), Nanjing, China, 18–21 November 2022; pp. 63–68. [Google Scholar]
  38. Yao, X.; Shen, H.; Feng, X.; Cheng, G.; Han, J. R2IPoints: Pursuing Rotation-Insensitive Point Representation for Aerial Object Detection. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
  39. Iftikhar, S.; Zhang, Z.; Asim, M.; Muthanna, A.; Koucheryavy, A.; Abd El-Latif, A.A. Deep Learning-Based Pedestrian Detection in Autonomous Vehicles: Substantial Issues and Challenges. Electronics 2022, 11, 3551. [Google Scholar] [CrossRef]
  40. Balasubramaniam, A.; Pasricha, S. Object Detection in Autonomous Vehicles: Status and Open Challenges. arXiv 2022, arXiv:2201.07706. [Google Scholar]
  41. Purkait, P.; Zhao, C.; Zach, C. SPP-Net: Deep Absolute Pose Regression with Synthetic Views. arXiv 2017, arXiv:1712.03452. [Google Scholar]
  42. Sun, X.; Wu, P.; Hoi, S.C. Face Detection Using Deep Learning: An Improved Faster RCNN Approach. Neurocomputing 2018, 299, 42–50. [Google Scholar] [CrossRef]
  43. Nguyen, H. Improving Faster R-CNN Framework for Fast Vehicle Detection. Math. Probl. Eng. 2019, 2019, 3808064. [Google Scholar] [CrossRef]
  44. Wu, S.; Yang, J.; Wang, X.; Li, X. Iou-Balanced Loss Functions for Single-Stage Object Detection. Pattern Recognit. Lett. 2022, 156, 96–103. [Google Scholar] [CrossRef]
  45. Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A Survey of Modern Deep Learning Based Object Detection Models. Digit. Signal Process. 2022, 126, 103514. [Google Scholar] [CrossRef]
  46. Qin, L.; Shi, Y.; He, Y.; Zhang, J.; Zhang, X.; Li, Y.; Deng, T.; Yan, H. ID-YOLO: Real-Time Salient Object Detection Based on the Driver’s Fixation Region. IEEE Trans. Intell. Transp. Syst. 2022, 23, 15898–15908. [Google Scholar] [CrossRef]
  47. Luo, H.-W.; Zhang, C.-S.; Pan, F.-C.; Ju, X.-M. Contextual-YOLOV3: Implement Better Small Object Detection Based Deep Learning. In Proceedings of the 2019 International Conference on Machine Learning, Big Data and Business Intelligence (MLBDBI), Taiyuan, China, 8–10 November 2019; pp. 134–141. [Google Scholar]
  48. Wang, G.; Ding, H.; Li, B.; Nie, R.; Zhao, Y. Trident-YOLO: Improving the Precision and Speed of Mobile Device Object Detection. IET Image Process. 2022, 16, 145–157. [Google Scholar] [CrossRef]
  49. Li, Y.; Li, Z.; Ye, F.; Li, Y. A Dual-Path Multihead Feature Enhancement Detector for Oriented Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  50. Yin, Q.; Hu, Q.; Liu, H.; Zhang, F.; Wang, Y.; Lin, Z.; An, W.; Guo, Y. Detecting and Tracking Small and Dense Moving Objects in Satellite Videos: A Benchmark. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–18. [Google Scholar] [CrossRef]
  51. Wang, J.; Ding, J.; Guo, H.; Cheng, W.; Pan, T.; Yang, W. Mask OBB: A Semantic Attention-Based Mask Oriented Bounding Box Representation for Multi-Category Object Detection in Aerial Images. Remote Sens. 2019, 11, 2930. [Google Scholar] [CrossRef]
  52. Dong, X.; Qin, Y.; Fu, R.; Gao, Y.; Liu, S.; Ye, Y. Remote Sensing Object Detection Based on Gated Context-Aware Module. IEEE Geosci. Remote Sens. Lett. 2022, 19, 1–5. [Google Scholar] [CrossRef]
  53. Zhang, S.; He, G.; Chen, H.-B.; Jing, N.; Wang, Q. Scale Adaptive Proposal Network for Object Detection in Remote Sensing Images. IEEE Geosci. Remote Sens. Lett. 2019, 16, 864–868. [Google Scholar] [CrossRef]
  54. Ming, Q.; Miao, L.; Zhou, Z.; Dong, Y. CFC-Net: A Critical Feature Capturing Network for Arbitrary-Oriented Object Detection in Remote-Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
  55. Li, K.; Cheng, G.; Bu, S.; You, X. Rotation-Insensitive and Context-Augmented Object Detection in Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2017, 56, 2337–2348. [Google Scholar] [CrossRef]
  56. Wang, H.; Liao, Y.; Li, Y.; Fang, Y.; Ni, S.; Luo, Y.; Jiang, B. BDR-Net: Bhattacharyya Distance-Based Distribution Metric Modeling for Rotating Object Detection in Remote Sensing. IEEE Trans. Instrum. Meas. 2022, 72, 1–12. [Google Scholar] [CrossRef]
  57. Sun, X.; Wang, P.; Wang, C.; Liu, Y.; Fu, K. PBNet: Part-Based Convolutional Neural Network for Complex Composite Object Detection in Remote Sensing Imagery. ISPRS J. Photogramm. Remote Sens. 2021, 173, 50–65. [Google Scholar] [CrossRef]
  58. Li, J.; Tian, J.; Gao, P.; Li, L. Ship Detection and Fine-Grained Recognition in Large-Format Remote Sensing Images Based on Convolutional Neural Network. In Proceedings of the IGARSS 2020-2020 IEEE International Geoscience and Remote Sensing Symposium, Waikoloa, HI, USA, 26 September–2 October 2020; pp. 2859–2862. [Google Scholar]
  59. Chen, S.; Wang, H.; Mukherjee, M.; Xu, X. Collaborative Learning-Based Network for Weakly Supervised Remote Sensing Object Detection. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022. [Google Scholar] [CrossRef]
  60. Liu, Q.S.Y.; Chua, T.S.; Schiele, B. Meta-Transfer Learning for Few-Shot Learning. In Proceedings of the 2018 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
  61. Ding, Y.; Tian, X.; Yin, L.; Chen, X.; Liu, S.; Yang, B.; Zheng, W. Multi-Scale Relation Network for Few-Shot Learning Based on Meta-Learning. In Proceedings of the Computer Vision Systems: 12th International Conference, ICVS 2019, Thessaloniki, Greece, 23–25 September 2019; Springer: Berlin/Heidelberg, Germany, 2019; pp. 343–352. [Google Scholar]
  62. Yu, Z.; Chen, L.; Cheng, Z.; Luo, J. Transmatch: A Transfer-Learning Scheme for Semi-Supervised Few-Shot Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 12856–12864. [Google Scholar]
  63. Wang, Y.-X.; Ramanan, D.; Hebert, M. Meta-Learning to Detect Rare Objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 Octorber–2 November 2019; pp. 9925–9934. [Google Scholar]
  64. Xiao, Z.; Zhong, P.; Quan, Y.; Yin, X.; Xue, W. Few-Shot Object Detection with Feature Attention Highlight Module in Remote Sensing Images. In Proceedings of the 2020 International Conference on Image, Video Processing and Artificial Intelligence, Shanghai, China, 21–23 August 2020; Volume 11584, pp. 217–223. [Google Scholar]
  65. Zhang, Z.; Hao, J.; Pan, C.; Ji, G. Oriented Feature Augmentation for Few-Shot Object Detection in Remote Sensing Images. In Proceedings of the 2021 IEEE International Conference on Computer Science, Electronic Information Engineering and Intelligent Control Technology (CEI), Fuzhou, China, 24–26 September 2021; pp. 359–366. [Google Scholar]
  66. Wang, L.; Zhang, S.; Han, Z.; Feng, Y.; Wei, J.; Mei, S. Diversity Measurement-Based Meta-Learning for Few-Shot Object Detection of Remote Sensing Images. In Proceedings of the IGARSS 2022—2022 IEEE International Geoscience and Remote Sensing Symposium, Kuala Lumpur, Malaysia, 17–22 July 2022; pp. 3087–3090. [Google Scholar]
  67. Zhang, Y.; Zhang, B.; Wang, B. Few-Shot Object Detection With Self-Adaptive Global Similarity and Two-Way Foreground Stimulator in Remote Sensing Images. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 7263–7276. [Google Scholar] [CrossRef]
  68. Zhu, D.; Guo, H.; Li, T.; Meng, Z. Fine-Tuning Faster-RCNN Tailored to Feature Reweighting for Few-Shot Object Detection. In Proceedings of the 5th International Conference on Control and Computer Vision, Xiamen, China, 19–21 August 2022; pp. 48–51. [Google Scholar]
  69. Liu, N.; Xu, X.; Celik, T.; Gan, Z.; Li, H.-C. Transformation-Invariant Network for Few-Shot Object Detection in Remote Sensing Images. arXiv 2023, arXiv:2303.06817. [Google Scholar]
  70. Li, X.; Deng, J.; Fang, Y. Few-Shot Object Detection on Remote Sensing Images. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–14. [Google Scholar] [CrossRef]
  71. Zhou, Z.; Chen, J.; Huang, Z.; Wan, H.; Chang, P.; Li, Z.; Yao, B.; Wu, B.; Sun, L.; Xing, M. FSODS: A Lightweight Metalearning Method for Few-Shot Object Detection on SAR Images. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–17. [Google Scholar] [CrossRef]
  72. Zhang, H.; Zhang, X.; Meng, G.; Guo, C.; Jiang, Z. Few-Shot Multi-Class Ship Detection in Remote Sensing Images Using Attention Feature Map and Multi-Relation Detector. Remote Sens. 2022, 14, 2790. [Google Scholar] [CrossRef]
  73. Liu, S.; Ma, A.; Pan, S.; Zhong, Y. An Effective Task Sampling Strategy Based on Category Generation for Fine-Grained Few-Shot Object Recognition. Remote Sens. 2023, 15, 1552. [Google Scholar] [CrossRef]
  74. Hou, K.; Wang, H.; Li, J. Few-Shot Object Detection Model Based on Transfer Learning and Convolutional Neural Network. Preprint 2022. [Google Scholar] [CrossRef]
  75. Zhou, Z.; Li, S.; Guo, W.; Gu, Y. Few-Shot Aircraft Detection in Satellite Videos Based on Feature Scale Selection Pyramid and Proposal Contrastive Learning. Remote Sens. 2022, 14, 4581. [Google Scholar] [CrossRef]
  76. Zhao, Z.; Liu, Q.; Wang, Y. Exploring Effective Knowledge Transfer for Few-Shot Object Detection. In Proceedings of the 30th ACM International Conference on Multimedia, Lisboa, Portugal, 10–14 October 2022; pp. 6831–6839. [Google Scholar]
  77. Yang, Z.; Zhang, C.; Li, R.; Xu, Y.; Lin, G. Efficient Few-Shot Object Detection via Knowledge Inheritance. IEEE Trans. Image Process. 2022, 32, 321–334. [Google Scholar] [CrossRef]
  78. Kim, N.; Jang, D.; Lee, S.; Kim, B.; Kim, D.-S. Unsupervised Image Denoising with Frequency Domain Knowledge. arXiv 2021, arXiv:2111.14362, preprint. [Google Scholar]
  79. Han, K.; Wang, Y.; Xu, C.; Guo, J.; Xu, C.; Wu, E.; Tian, Q. GhostNets on Heterogeneous Devices via Cheap Operations. Int. J. Comput. Vis. 2022, 130, 1050–1069. [Google Scholar] [CrossRef]
  80. Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More Features from Cheap Operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Virtual, 14–19 June 2020; pp. 1580–1589. [Google Scholar]
  81. Li, Y.; Gupta, A. Beyond Grids: Learning Graph Representations for Visual Recognition. In Advances in Neural Information Processing Systems 31 (NeurIPS 2018); Curran Associates, Inc.: Dutchess County, NY, USA, 2018; pp. 9225–9235. [Google Scholar]
  82. Ahmed, M.; Seraj, R.; Islam, S.M.S. The K-Means Algorithm: A Comprehensive Survey and Performance Evaluation. Electronics 2020, 9, 1295. [Google Scholar] [CrossRef]
  83. Xiao, B.; Xu, B.; Bi, X.; Li, W. Global-Feature Encoding U-Net (GEU-Net) for Multi-Focus Image Fusion. IEEE Trans. Image Process. 2020, 30, 163–175. [Google Scholar] [CrossRef]
  84. Li, C.; Zhou, A.; Yao, A. Omni-Dimensional Dynamic Convolution. arXiv 2022, arXiv:2209.07947. [Google Scholar]
  85. Chen, H.; Jiang, D.; Sahli, H. Transformer Encoder with Multi-Modal Multi-Head Attention for Continuous Affect Recognition. IEEE Trans. Multimed. 2020, 23, 4171–4183. [Google Scholar] [CrossRef]
  86. Zhu, H.; Lee, K.A.; Li, H. Serialized Multi-Layer Multi-Head Attention for Neural Speaker Embedding. arXiv 2021, arXiv:2107.06493. [Google Scholar]
  87. Gevorgyan, Z. SIoU Loss: More Powerful Learning for Bounding Box Regression. arXiv 2022, arXiv:2205.12740. [Google Scholar]
  88. Cheng, G.; Han, J.; Zhou, P.; Guo, L. Multi-Class Geospatial Object Detection and Geographic Image Classification Based on Collection of Part Detectors. ISPRS J. Photogramm. Remote Sens. 2014, 98, 119–132. [Google Scholar] [CrossRef]
  89. Jocher, G.; Stoken, A.; Borovec, J.; Changyu, L.; Hogan, A.; Diaconu, L.; Ingham, F.; Poznanski, J.; Fang, J.; Yu, L.U. Yolov5: V3.0. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 1 June 2023).
  90. Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-Shot Object Detection via Feature Reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8420–8429. [Google Scholar]
  91. Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly Simple Few-Shot Object Detection. arXiv 2020, arXiv:2003.06957. [Google Scholar]
  92. Zhao, Z.; Tang, P.; Zhao, L.; Zhang, Z. Few-Shot Object Detection of Remote Sensing Images via Two-Stage Fine-Tuning. IEEE Geosci. Remote Sens. Lett. 2021, 19, 1–5. [Google Scholar] [CrossRef]
  93. Zhang, T.; Zhang, X.; Zhu, P.; Jia, X.; Tang, X.; Jiao, L. Generalized Few-Shot Object Detection in Remote Sensing Images. ISPRS J. Photogramm. Remote Sens. 2023, 195, 353–364. [Google Scholar] [CrossRef]
  94. Wang, Y.; Xu, C.; Liu, C.; Li, Z. Context Information Refinement for Few-Shot Object Detection in Remote Sensing Images. Remote Sens. 2022, 14, 3255. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.