CRTED: Few-Shot Object Detection via Correlation-RPN and Transformer Encoder–Decoder

Chen, Jinlong; Xu, Kejian; Ning, Yi; Jiang, Lianyuan; Xu, Zhi

doi:10.3390/electronics13101856

Open AccessArticle

CRTED: Few-Shot Object Detection via Correlation-RPN and Transformer Encoder–Decoder

by

Jinlong Chen

¹,

Kejian Xu

^1,*,

Yi Ning

²,

Lianyuan Jiang

¹ and

Zhi Xu

^1,*

¹

School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541000, China

²

School of Continuing Education, Guilin University of Electronic Technology, Guilin 541000, China

^*

Authors to whom correspondence should be addressed.

Electronics 2024, 13(10), 1856; https://doi.org/10.3390/electronics13101856

Submission received: 22 March 2024 / Revised: 7 May 2024 / Accepted: 8 May 2024 / Published: 10 May 2024

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Few-shot object detection (FSOD) aims to address the challenge of requiring a substantial number of annotations for training in conventional object detection, which is very labor-intensive. However, the existing few-shot methods achieve high precision with the sacrifice of time for exhaustive fine-tuning or have poor performance in novel-class adaptation. We presume the major reason is that the valuable correlation feature among different categories is insufficiently exploited, hindering the generalization of knowledge from base to novel categories for object detection. In this paper, we propose few-shot object detection via Correlation-RPN and transformer encoder–decoder (CRTED), a novel training network to learn object-relevant features of inter-class correlation and intra-class compactness while suppressing object-agnostic features in the background with limited annotated samples. And we also introduce a four-way tuple-contrast training strategy to positively activate the training progress of our object detector. Experiments over two few-shot benchmarks (Pascal VOC, MS-COCO) demonstrate that our proposed CRTED without further fine-tuning can achieve comparable performance with current state-of-the-art fine-tuned works. The codes and pre-trained models will be released.

Keywords:

few-shot object detection; region proposal network; transformer encoder–decoder; training strategies

1. Introduction

In recent years, object detection [1,2,3,4,5,6] has seen remarkable advancements through the application of deep neural models and large-scale training. Nevertheless, conventional object detection techniques typically depend extensively on vast amounts of quantity and quality annotated data and necessitate extended training duration, which has sparked the recent pursuit of few-shot object detection. The challenge in few-shot learning lies in the significant diversity of real-world objects, and despite noteworthy advancements, existing methods [7,8,9,10] primarily focus on image classification, seldom delving into the complexities of few-shot object detection. This might be due to the non-trivial nature of transferring knowledge from few-shot classification to few-shot object detection.

The core difficulty in object detection with limited examples lies in pinpointing unseen objects against a cluttered background. This is essentially a general problem of locating objects from a few annotated examples in new categories. Potential bounding boxes often overlook unseen objects or generate numerous false detections in the background. We argue that this is due to the sub-optimal scoring of promising bounding boxes by a region proposal network (RPN), making it challenging to detect novel objects. This distinction underscores the inherent difference between few-shot classification and object detection. Additionally, recent efforts in few-shot object detection [11,12,13] necessitate fine-tuning, preventing their direct application to novel categories.

In this paper, we attempt to address the problem mentioned above in few-shot object detection. First, we propose a novel network structure named Correlation-RPN based on a general RPN to activate the model to pay more attention to object-relevant regions and help learn the matching correlation between a query and support image feature, for generalizing the knowledge learned from base classes to novel classes. Secondly, we integrally migrate the transformer encoder–decoder into our framework. With the new feature coding mechanism, we utilize the decoder to obtain the correlational metric of feature representation after feature extraction in the backbone of our network. Thus, we introduce a four-way tuple-contrast training strategy to positively activate the training progress of our object detector.

The main contributions of this work include the following:

●: We propose a novel correlation-aware region proposal network structure called Correlation-RPN and migrate it to object detectors, improving detectors’ capacity of object localization and generalization;
●: We redesign a new feature coding mechanism and integrally migrate the encoder–decoder of the transformer into our model to effectively learn support–query feature similarity representation;
●: With our presented four-way tuple-contrast training strategy, CRTED without further fine-tuning can achieve comparable performance with most of the representative methods for few-shot object detection.

2. Related Work

2.1. General Object Detection

Object detection remains a key topic in computer vision, particularly with the rise of deep learning. CNN-based methods, always pre-trained on vast datasets, have gained popularity. These methods split into two categories: proposal-based and proposal-free detectors. The RCNN series [14,15,16] falls into the former, relying on pre-trained CNNs to classify region proposals from a selective search. SPP-Net [17] and Fast-RCNN [15] evolved from RCNN, extracting regional features via an RoI pooling layer from convolutional maps. Faster-RCNN [16] introduced a region proposal network (RPN) to enhance proposal quality. In contrast, YOLO [3,18,19,20] pioneered the proposal-free approach, using a single CNN for classification and bounding box prediction. Later works refined YOLO with default anchors for shape adjustment or multi-scale training. Proposal-free methods are simpler and faster but still rely heavily on annotated samples, limiting their performance in few-shot scenarios.

2.2. Few-Shot Object Detection

The challenging few-shot object detection (FSOD) problem aims to detect objects or novel classes at the instance level with limited annotations. Prior works on few-shot object detection can be mainly categorized into three paradigms: meta-learning, transfer learning and metric learning approaches. Meta-learning methods aim at devising a periodic and stage-wise meta-training paradigm to train a class-agnostic meta-model to help knowledge transfer from base classes to novel classes with few annotated labels, known as “Learning to learn”. Meta-RPN [21] and Meta faster-cnn [13] are proposed to generate class-relevant proposals while improving the instance alignment. Transfer learning-based methods, also known as fine-tuning-based methods, emphasize performing fine-tuning for few-shot object detection, which trained base classes and novel classes together, and fine-tuning the whole model, where the model is only trained on the base classes and then fine-tuned on a balanced set including both base classes and novel classes. MPSR [22] adopts a manually defined positive sample refinement branch to mitigate the object scale scarcity issue for few-shot object detection. Specially, TFA [23] pre-trains a base detector from abundant samples on a base set and fine-tunes it for novel classes. Metric learning approaches focus on learning good embedding spaces or appropriate metrics that facilitate downstream tasks, including cosine similarity [24], Euclidean distance to the class center, graph distances and so on. RepMet [25] achieves current SOTA results for few-shot object detection by simultaneously learning the parameters of the backbone network, the embedding space and the multimodal distribution of each training category within it in an end-to-end training manner. In [21], the authors exploit the similarity metric between the support set and query set in a few-shot setup to detect novel objects and suppress false detection in the background.

2.3. Transformer Encoder–Decoder

The transformer architecture, initially designed for machine translation, has been widely applied to various computer vision tasks. One notable example is DETR, a representative class of object detectors that leverage the strengths of transformers. These models employ a transformer encoder–decoder structure to understand and learn the relationships between the global image context and objects, using CNN features as input and producing final predictions. As a variant of DETR, ViDT [26] introduced a pre-trained transformer to replace the CNN backbone while maintaining a randomly initialized transformer neck. More recently, ViTDet [27] and MIMDet [28] have attempted to capitalize on the powerful network architecture pre-trained by MAE for object detection tasks. However, ViTDet only utilizes the pre-trained MAE encoder and discards the pre-trained decoder. In contrast, MIMDet retains the entire encoder–decoder for feature extraction, focusing on leveraging the reconstruction capabilities of the MAE decoder to mask input image patches, reducing additional inference costs. Unlike these approaches, imTED [29] employs a fully pre-trained transformer encoder–decoder that not only extracts features but also performs representation transformation, which offers a comprehensive utilization of the transformer’s capabilities, enhancing the performance of object detection tasks.

3. Approach

In this section, we will walk through the whole architecture designs in our proposed CRTED step by step. The structure of CRTED is exhibited in Figure 1. First, before introducing CRTED, we consider the few-shot object detection task it aims to achieve. We start with the preliminaries on the few-shot object detection setup that motivate our method. Then, we present our network architecture in detail for few-shot object detection. Finally, we describe the learning procedure of CRTED.

3.1. Preliminaries

Problem Definition In this work, we focus on processing the task of few-shot object detection. Given two sets of classes, we define a base set C_base and a novel set C_novel, where C_base ∩ C_novel = ∅. We define two datasets, a base dataset D_base with sufficient annotated objects of C_base and a novel dataset D_novel with few annotated objects of C_novel. A few-shot object detector aims at classifying and localizing objects of C_base ∪ C_novel by learning from D_base ∪ D_novel. In a task of N_n-way K-shot object detection with N_n = |C_novel|, there are exactly K annotated instances for each novel class in D_novel. The goal of this work is to train a model that can detect novel classes in C_novel by only providing K-shot labeled samples for C_novel and abundant images from C_base. Basically, images from C_base are split into a support image set S_b containing support images $s_{c}$ with a close-up of the target object and a query image set Q_b containing query images $q_{c}$ which potentially contains objects belonging to the support class. Given all support images S_b, our model learns to detect objects in Q_b. For convenience, we denote C_base, C_novel, D_base and D_novel as C_b, C_n, D_b and D_n in the following sections.
Rethink Region-Based Object Detectors The majority of current few-shot object detection methods rely heavily on the Faster R-CNN framework [16], which leverages a region proposal network to generate potentially relevant bounding boxes to facilitate subsequent detection tasks. The RPN plays a pivotal role, as it must not only distinguish between objects and the background but also filter out negative objects belonging to non-matching support categories. However, under the few-shot detection setting, where support image information is extremely limited, the RPN often struggles. It tends to indiscriminately focus on every potential object with a high objectness score, regardless of whether they belong to the support category or not. This behavior can hamper the generalization of knowledge from base classes to novel classes, and it also places a significant burden on the subsequent classification task of the detector, as it has to deal with a large number of irrelevant objects. Previous studies [1,3,4,16,21,30] have attempted to address this challenge by generating more accurate region proposals. Nevertheless, the issue persists, stemming from the inherent limitations of region-based object detection frameworks within the few-shot learning context. To truly address this challenge, it is essential to develop novel strategies that can effectively leverage the limited support image information, enhancing the discriminative capabilities of the RPN and ensuring that it focuses only on relevant objects, thus improving the overall performance of few-shot object detection systems.
Rethink Transformer-Based Detection Frameworks The transformer [31] emerged as a revolutionary self-attention-based building block specifically tailored for machine translation tasks. This architecture revolutionizes the way sequences are processed, updating each element by scanning through the entire sequence and subsequently aggregating information from it. Seeking to harness the immense potential of the transformer, DETRs [1] introduced an innovative approach by integrating a transformer encoder–decoder architecture into an object detector. This integration enabled the system to tackle the intricate challenge of attending to multiple support classes within a single forward pass. Nevertheless, a notable issue persists: the vision transformers employed in DETRs were randomly initialized, limiting their capabilities to solely processing feature representations extracted by the backbone network. This constraint underscores the need for further advancements to fully unlock the potential of vision transformers in object detection tasks.

3.2. Architecture

Correlation-Aware Region Proposal Network In generic object detection, an RPN is useful to provide region proposals and generate object-relevant anchors while suffering from performance drop when used in few-shot object detection, due to the low-quality region proposals for novel classes and the fatigue to capture inter-class correlation among different classes. Taking inspiration from the success of the RPN-based FSOD framework [21], we propose a novel network structure based on a general RPN which learns the matching correlation between the support set S_b and queries Q_b. Figure 2 shows the overall architecture of our proposed Correlation-RPN. Correlation-RPN can make use of the support information to be sensitively aware of the similarities and differences between S_b and Q_b, and is also able to provide high-quality region proposals with objects of target or novel classes, while relatively depressing proposals in the background or in other categories.

Figure 2. The overall view of the Correlation-RPN design. S_b: support set; Q_b: queries; X: support feature map; Y: query feature map. F_s and F_q denote the support–query feature extractor. F denotes the similarity of the support–query feature map, which includes a 1 × 1 × C and 3 × 3 × C vector. Then, the correlation map computed is to be fed into the RPN for generating proposals.

Specifically, we compute the correlational metric between the feature map of S_b and Q_b in a depth-wise manner. The similarity map is then utilized to build the region proposal generation. In particular, we denote the support features of S_b as

X \in R^{H \times W \times C}

and the feature map of the query

q_{c}

as

Y \in R^{H \times W \times C}

; the similarity is defined as follows:

F_{h, w, c} = \sum_{i, j} {α X}_{i, j, c} \cdot {β Y}_{h + i - 1, w + j - 1, c}, i, j \in \{1, \dots, K\}

(1)

where F is the resultant correlation feature map, and α, β are control coefficients to prevent overly favoring features of either side. Here, X is used as the kernel to slide on the query feature map [32,33] in a depth-wise cross-correlation way [34]. Our work adopts the top architecture of the attention RPN [21]. We empirically find that a kernel size of K = 1 performs well in our case, since we argue that a global feature can provide a great object prior for objectness classification, consistent with [16]. In our case, the kernel is calculated by averaging on the support feature map X. The correlation map is processed simultaneously by a 1 × 1 convolution and a 3 × 3 convolution followed by the objectiveness branch and regression branch. Correlation-RPN is trained jointly with the network and elaborated as in Section 4.3.

Feature Metric Matching Based on the idea of [1,29], we integrally migrate the transformer encoder–decoder as the pillar of the correlational metric aggregation module into our object detector. The feature metric matching is accomplished in the transformer encoder by the multi-head attention mechanism. Specifically, given the support features of S_b, denoted as $F_{s} \in R^{H \times d}$ , and the query image $q_{c}$ , of which the feature map is denoted as $F_{q} \in R^{H \times W \times d}$ , the matching coefficients M can be obtained by the following:

$\begin{matrix} M (H W, d, c) = M a t c h (F_{s}, F_{q}) \\ = S o f t m a x (\frac{(F_{s} (d, c) \cdot S) {(F_{q} (H W, d) \cdot S)}^{T}}{\sqrt{d}}), c \in \{1, \dots, C\} \end{matrix}$

(2)

where HW is the feature spatial size, d is the feature dimensionality, C is the number of support categories and S is a cosine similarity shared by $F_{s}$ and $F_{q}$ , to ensure they are embedded into the same linear feature projection space. The cosine similarity as the correlational metric of each pair of the feature representation of $F_{s}$ and $F_{q}$ is calculated via the following:

$c o s (f_{s}, f_{q}) = \frac{1}{C} \sum_{i}^{C} (\frac{f_{s} \cdot f_{q}}{| f_{s} | \cdot | f_{q} |}), i \in \{1, \dots, C\}$

(3)

where $f_{s}$ and $f_{q}$ denote the single feature representation of $F_{s}$ and $F_{q}$ . Finally, for each $q_{c}$ that may contain multiple complex instances, we ensure a fact that the choice of the m potential support cases for each of these instances is the same. Therefore, the average correlation score of the m same potential support objects can be considered as the similarity or shared feature representation between $q_{c}$ and the m potential S_b, with which we prefer the $s_{c}$ containing the most similar support instance as the powerful support patch of $s_{c}$ in training. And this process has experimentally demonstrated to be helpful for our training. The effectiveness of support–query feature similarity metric mining, i.e., distinguishing support objects similar to the query, is discussed in Section 4.3.
Encoding Matching In order to achieve class-agnostic object prediction, we propose the utilization of a carefully crafted set of predefined task encodings, which serve as a bridge between the given support classes and the abstract task encodings space. By mapping the support classes to these encodings, we ensure that the final object predictions are constrained within the task encodings space rather than being limited to predicting specific classes on the surface level. Drawing inspiration from the positional encodings employed in the transformer architecture, we implement task encodings $T \in R^{H \times d}$ utilizing sinusoidal functions. This allows us to capture both local and global patterns within the task encodings space, enhancing the representational power of our approach.

Furthermore, encoding matching and feature metric matching share the same matching coefficients. This ensures consistency across different matching processes and simplifies the overall pipeline. The matched encodings Q_E are simply obtained through a straightforward process, further streamlining the prediction framework:

Q_{E} = M ⨂ T,

(4)

where

⨂

denotes sinusoidal functional multiplication. In essence, our approach offers a more flexible and generalizable framework for object prediction, enabling us to transcend the limitations of traditional class-specific prediction methods and move towards a more abstract and powerful representation of objects.

Modeling Background for Object Prediction Generally, under a few-shot object detection setup, the background does not belong to any target classes and usually takes up a lot of space in a support or query image. Images where objects only account for a small proportion and most of the area is complex background, which we also called hard samples, are as shown in Figure 3. Taking the consideration of this reason, we propose a learnable prototype BG-P and a corresponding task encoding BG-E (fixed to zeros), for explicitly modeling the background class. This can significantly eliminate the matching ambiguity when a query is very hard to match to any of the given support classes. And we additionally introduce background suppression (BS) regularization as an auxiliary branch to help in addressing this problem, which will be described in detail in the next section. The final output of the feature metric matching module can be obtained via the following equation:

$Q_{F} = τ (M \cdot σ (F_{s}), {B S (F}_{q})),$

(5)

where $τ (\cdot)$ denotes the Hadamard product, $B S (\cdot)$ denotes the background suppression operation and $σ (\cdot)$ denotes the sigmoid function. By applying the matching coefficients M, we filter out features not matched to S_b, producing a feature map Q_F, denoted as Equation (5), that inhibits the negative impact of hard samples and highlights class-related objects from query set Q_b for each individual support class.

3.3. Training Procedure

Two-Stage Training strategy Our training procedure consists of two stages: the base-class training stage on samples from C_b (C_train = C_b), followed by K-shot few-shot fine-tuning stage on a balanced set of samples from both C_b and C_n (C_train = C_b ∪ C_n). More precisely, in the second stage, where only K labeled samples are individually available for each class in C_n, K samples are randomly selected for each class in C_b to balance the training iterations between C_b and C_n.

Generally, a naive training strategy is matching objects of the same class by constructing a training pair

\tilde{p_{c}} (q_{c}, s_{c})

where

q_{c}

and

s_{c}

are both in the same c-th class. However, a powerful model should not only be able to perform query–support feature similarity mining but also allow for capturing the inter-class correlation among different categories. For this reason, according to the different matching results in Figure 4, we present a novel four-way tuple-contrast training strategy to match the same category while distinguishing different categories. We randomly choose a query image

q_{c}

, a support image

s_{c}

and a hard sample

s_{h}

containing the same c-th category object and one other support image

s_{n}

containing a different n-th category object, to construct the training pair

\tilde{p_{t}} (q_{c}, s_{c}, s_{h}, s_{n})

, where c ≠ n. In the training pair

\tilde{p_{t}} (q_{c}, s_{c}, s_{h}, s_{n})

, only the objects of the c-th category in

q_{c}

are needed and annotated as foreground, while all other objects are neglected and treated as background.

During training, our model learns to match every proposal generated by Correlation-RPN in

q_{c}

with the object of

s_{c}

. Thus, the model needs to not only match the same category objects from

\tilde{p_{c}} (q_{c}, s_{c})

and

\tilde{p_{h}} (s_{c}, s_{h})

but also distinguish objects in different categories from

\tilde{p_{n}} (q_{c}, s_{n})

. Nevertheless, there are a massive amount of background proposals, especially with

s_{h}

, which usually dominate the training. Taking the consideration of this reason, we adjust these training pairs

\tilde{p}

to balance the ratio of proposals between queries and supports. The ratio of

\tilde{p}

is kept as 2:1:1 for

\tilde{p_{c}} (q_{c}, s_{c})

,

\tilde{p_{h}} (s_{c}, s_{h})

and

\tilde{p_{n}} (q_{c}, s_{n})

. According to their matching scores, we pick all N

\tilde{p_{c}} (q_{c}, s_{c})

and select the top 2N

\tilde{p_{h}} (s_{c}, s_{h})

and top N

\tilde{p_{n}} (q_{c}, s_{n})

, respectively, and calculate the matching loss on the selected training pairs.

Detection Loss Function During the training process, we use the multi-task loss on each sampled proposal. It is worth mentioning that we choose the different loss function while optimizing the network in two stages. In the first stage, for each bounding box $B$ , we predict a 2D classification vector $p \in [0, 1]$ to represent the probability for the target object and background, respectively. Inspired by [35,36,37], concretely, for a mini-batch of N RoI box features ${\{r_{x, y}^{i}, u_{x, y}^{i}, t_{x, y}^{i}\}}_{i}^{N}$ , our first stage loss $L_{f i r s t}$ is defined as follows with considerations tailored for detection:

$L_{f i r s t} = \frac{1}{N} \sum_{i}^{N} f (u_{x, y}^{i}, p_{i}) \cdot L_{r_{x, y}^{i}}$

(6)

$L_{r_{x, y}^{i}} = \frac{1}{N_{t_{i}} - 1} \sum_{j = 1, j \neq i}^{N} \vec{S} \{t_{i} = t_{j}\} \cdot \log \frac{e x p (\tilde{r_{i}} \cdot \tilde{r_{j}} / ϑ)}{\sum_{k = 1}^{N} \vec{S} \{k \neq i\} \cdot e x p (\tilde{r_{i}} \cdot \tilde{r_{k}} / ϑ)},$

(7)

where x, y denotes locations, and $ϑ$ is the hyper-parameter temperature as in [36]. $r_{x, y}^{i}$ refers to the encoded RoI feature of the detector head for the i-th region proposal generated by Correlation-RPN, $u_{x, y}^{i}$ denotes the IOU score of $r_{x, y}^{i}$ with a matched ground truth bounding box $B^{*}$ and $t_{x, y}^{i}$ denotes the truth annotation.

The second stage output also includes the vector

p \in [0, 1]

for distinguishing between background and target object classes. Different from the first stage, following the parameterization in [14], we present a new regression vector

t = (t_{x}, t_{y}, t_{w}, t_{h})

, to specify a scale-invariant translation and height/width shift of log-space relative to a region proposal. In the second stage, we adopt binary cross entropy (BCE) loss for classification and smooth L1 loss for regression. In combination,

L_{s e c o n d} = \frac{λ_{c l s}}{N} \sum_{i}^{N} B C E (c_{i}^{*}, p_{i}) + \frac{1}{N} \sum_{i}^{N} \vec{S} \{c_{i}^{*} = = 1\} \cdot L_{s m o o t h} (t_{i} - u_{x, y}^{i}),

(8)

where

c^{*}

refers to the class label for the target object, and

λ_{c l s}

denotes a balancing factor, which we empirically set to 2.

Our total loss function is the combination of the first (Equations (6) and (7)) and second stage loss (Equation (8)):

L_{t o t a l} = L_{f i r s t} + ρ \cdot L_{s e c o n d},

(9)

where

ρ

= 3 is a balancing factor for second stage loss.

Background Suppression (BS) Regularization In our proposed structure of CRTED, feature metric matching is developed with the encoder architecture design in the transformer by the multi-head attention mechanism. It is sure that this design can moderate the training stress for objects with various sizes, but it may still disturb the detector performance for localization in the scenario of hard samples, especially when in the few-shot condition. For this reason, we propose novel background suppression (BS) regularization, by utilizing object knowledge in the domain of ground truth bounding boxes for each training pair $\tilde{p_{c}} (q_{c}, s_{c})$ . Specifically, for $q_{c}$ in $\tilde{p_{c}} (q_{c}, s_{c})$ , we first obtain the middle-level $F_{q}$ of the target domain generated from Correlation-RPN. Then, we adopt a masking method that enables the ground truth labels of target objects in the image $s_{c}$ to be mapped to the convolutional cube. Consequently, we can identify the feature regions corresponding to the background, namely $R_{B S}$ . To minimize the adverse effects of background disturbances, we choose L2 regularization to penalize the activation of $R_{B S}$ :

$L_{B S} = B S (F_{q}) = {| | R_{B S} | |}_{2}$

(10)

With this

L_{B S}

, CRTED can depress regions of indifference while paying more attention to where we are interested, which is especially important for training in few-shot learning. More details and visualization results of the experiment are shown in Section 4.3.

Proposal Consistency Control One of the differences between image classification and object detection is that the former extracts semantics from the entire image, while the classification signals for the latter come from region proposals. We adopt a settled IoU threshold $T_{i o u}$ to assure the consistency of proposals, with the consideration that low-IoU proposals may result in excessive deviation to the center of regressed objects and therefore might include irrelevant semantic information. In the following formula, $f (\cdot)$ is responsible for controlling the consistency of proposals, defined with the proposal consistency threshold φ:

$T_{i o u} = f (u_{x, y}^{i}) = \frac{1}{N} \sum_{i}^{N} \vec{S} \{u_{i} \geq φ\} \cdot r (u_{x, y}^{i}),$

(11)

where $r (\cdot)$ can be re-weighted for object proposals with different levels of IoU scores. We experimentally find that φ = 0.75 is a good cutoff point on which the detector head can be trained according to most centered object recommendations.

4. Experiments

We mainly perform extensive experiments in both few-shot object detection benchmark datasets PASCAL VOC and MS COCO to assess the effectiveness of our proposed CRTED.

4.1. Few-Shot Object Detection Benchmarks

Pascal VOC In [38], the study consists of images with object annotations of 20 categories where the categories split for C_b and C_n are 15 and 5 separately. We use training set D_b ∪ D_n from Pascal VOC 07+12 trainval sets for training, where D_n is randomly sampled from previously unseen novel classes with K-shot in {1, 2, 3, 5, 10}. Following existing studies [12,39,40], we consider the same three random partitions of base/novel classes and samplings introduced. Each split is referred to as follows: Novel Split set 1: {“bottle”, “aeroplane”, “sofa”, “cow”, “horse”/others}; Novel Split set 2: {“bus”, “horse”, “motorbike”, “cow”, “sofa”/others} and Novel Split set 3: {“boat”, “cat”, “aeroplane”, “sheep”, “sofa”/others}. For fair comparison, in each partition, we use the same sampled novel instances and report AP50 for the detection precision for C_b (bAP50) and C_n (nAP50) on the Pascal VOC 07 test set. The results are averaged over 10 randomly sampled support datasets.
MS COCO In [41], a large-scale and more challenging object detection dataset is presented, which consists of 80 categories where |C_b| = 60, |C_n| = 20 and C_n are common to Pascal VOC. The training set D_b ∪ D_n is from the MS COCO 2017 training set, and we perform evaluations on five K images from the COCO 2017 val dataset, for which the number of shots is set to 1, 2, 3, 5, 10 and 30. The COCO-style detection precision of C_b ∪ C_n (AP), C_b (bAP) and C_n (nAP) is reported. The results are averaged over five randomly sampled support datasets.

4.2. Implementation Details

We follow the training pipeline of [21], and the basic deep architecture of our CRTED is trained end-to-end on four V100 GPUs parallel with optimizer Adam and a batch size of eight for each

\tilde{p_{t}} (q_{c}, s_{c}, s_{h}, s_{n})

. During training, we find that more training iterations may result in the model over-fitting to the training set D_b ∪ D_n and damage performance. The learning rate is thus experimentally set to 0.01 during the first training stage and gradually decayed to 0.0002 for 500 later iterations, which can lead to a better convergence point. We first perform the evaluation on the MS COCO 2017 training dataset and ensure only simple class objects appear in images for each

\tilde{p_{t}} (q_{c}, s_{c}, s_{h}, s_{n})

. During few-shot training, for one

q_{c}

belonging to C_b ∪ C_n in

\tilde{p_{t}} (q_{c}, s_{c}, s_{h}, s_{n})

, we provide 40 support images, containing 30 belonging to C_b and 10 belonging to C_n, termed as four-way 10-shot contrastive training.

4.3. Ablation Studies

Evaluation of Correlation-RPN We conduct sufficient experiments to assess our Correlation-RPN on different training strategies. To evaluate the proposal quality, we first compare the precision and recall on the top 50 proposals of the regular RPN, attention RPN and our proposed Correlation-RPN at a 0.5 IoU threshold. In addition, we also add the average best overlap ratio (ABO) across the ground truth bounding box $B^{*}$ as one of our evaluation metrics. As demonstrated in Table 1, our model with Correlation-RPN reveals better performance than the other two counterparts under the same training pairs and K-shot, producing performance improvement in all the evaluations, which indicates that our proposed RPN architecture can generate more object-relevant proposals to benefit the total detection prediction.

Table 1. Ablation studies on proposed Correlation-RPN and other counterparts.

Method			Precision	Recall	AP	ABO
Regular RPN	Attention RPN	Correlation-RPN	Precision	Recall	AP	ABO
√			0.7923	0.8804	54.5	0.7127
	√		0.8345	0.9130	56.9	0.7282
		√	0.8509	0.9214	57.1	0.7335

Specially, the visualization comparison of the attention from superficial layers, between our Correlation-RPN and other two counterparts, is also clearly offered in Figure 5. The results confirm that our Correlation-RPN has a better ability to pay attention to the target domain and provide more high-quality proposals. Especially when dealing with challenging samples, particularly objects that are significantly occluded, our method demonstrates an impressive ability to achieve a high level of confidence in accurate recognition, which ensures reliable and robust performance even in situations where traditional methods might struggle.

Analysis of Matching Procedure for CRTED Figure 6 clearly shows the feature visualization of different object classes learned with and without the matching procedure under the same constraint. As demonstrated, with the matching procedure introduced to learn inter-class correlation and capture intra-class compactness, different classes are better separated from each other, which helps to reduce model misclassification and boost generalization ability among similar categories. Specially, Table 2 clearly verifies the effectiveness of our proposed matching procedure. When we adopt our method into the model, regardless of the number of support classes, CRTED can still boost the detection performance of novel classes under the one-shot setup, which indicates the capacity of our designed matching procedure in support–query feature similarity metric mining and the strong power of the transformer encoder–decoder. It is worth nothing that when there are multiple support classes, MP is able to exploit the inter-class similarity of the support–query image to obtain improvement in few-shot detection performance, especially for about 2.8% mAP under the five-shot setting.

Figure 6. The t-SNE visualization of object classification in the feature space with and without our proposed matching procedure MP on the novel split set 2 of Pascal VOC.

Figure 6. The t-SNE visualization of object classification in the feature space with and without our proposed matching procedure MP on the novel split set 2 of Pascal VOC.

Table 2. Ablation studies to validate the effectiveness of our presented MP. C means the number of support classes.

Table 2. Ablation studies to validate the effectiveness of our presented MP. C means the number of support classes.

Method MP C Novel mAP (IoU = 0.5)
1 2 3 5 10
CRTED 1 28.2 43.3 51.6 54.0 60.3
√ 1 31.3 45.2 53.1 56.8 63.0
5 33.7 46.5 52.4 57.1 61.8
√ 5 37.3 51.1 54.5 58.2 63.3

Impact of Background Suppression (BS) Regularization For the purpose of enhancing few-shot detection, we mainly evaluate whether our proposed BS regularization method can boost transfer learning for CRTED. As shown in Figure 7, our background suppression (BS) regularization can effectively help CRTED to reduce the background disturbances. And it is clearly shown in Table 3 that our proposed regularization method can significantly improve the performance, when the support images

s_{c}

in the training set

\tilde{p_{t}} (q_{c}, s_{c}, s_{h}, s_{n})

are scarce in the few-shot domain, especially in one-shot. Additionally, it is worth mentioning that we try to pick objects from as many categories as possible to verify the effect, which shows that BS is generally robust for different categories.

Figure 7. Feature heatmap of background suppression (BS) regularization. Samples of different categories are selected from COCO val set. BS can effectively weaken negative influence from background disturbances and then activate CRTED to pay attention to object-relevant regions.

Table 3. The regularized transfer learning for CRTED of AP50 on the Pascal VOC dataset on C_b ∪ C_n. BL: baseline. BS: background suppression (BS) regularization. The mAP results show that our proposed BS regularization method can significantly boost the baseline CRTED, when in few-shot object detection.

Shots for Split 1	1	2	3	5	10
CRTED _BL	68.7	69.4	70.8	73.6	75.5
CRTED _BL+BS	69.8	70.2	72.0	75.4	76.5
Shots for Split 2	1	2	3	5	10
CRTED _BL	65.5	66.8	69.9	71.3	73.3
CRTED _BL+BS	67.7	68.0	71.3	71.8	73.7
Shots for Split 3	1	2	3	5	10
CRTED _BL	67.7	68.7	71.4	72.7	74.8
CRTED _BL+BS	68.6	70.4	72.7	73.9	75.1

Ablation of Training CRTED Refer to Table 4. We train our network with different training strategies and obtain 1.3% AP₅₀ improvement in the $\tilde{p_{t}}$ 10-shot training strategy, compared with the $\tilde{p_{c}}$ 10-shot training strategy. It is straightforward that the model performs better at the training strategy with $\tilde{p_{t}}$ than with $\tilde{p_{c}}$ , which shows the importance of training pairs $\tilde{p_{n}}$ and $\tilde{p_{h}}$ , included in $\tilde{p_{t}}$ . With adding $\tilde{p_{n}} (q_{c}, s_{n})$ , even the object in $q_{c}$ belonging to an unseen class, the model can learn inter-class correlation and intra-class compactness from novel classes to enhance generalization ability. And the model can improve robustness with $\tilde{p_{h}} (s_{c}, s_{h})$ , especially $q_{c}$ which is a hard sample. It is clear that with larger K-shot training, we achieve better performance, which also indicates a certain number of support images is beneficial to few-shot learning. We think that controlling the number of $s_{h}$ and $s_{n}$ to 1 can be sufficient in training the model for distinguishing different classes. And from Figure 8, our proposed training strategy can positively activate the training progress of our detector. Our full model thus adopts the $\tilde{p_{t}}$ 10-shot training strategy.

4.4. Comparison with State-of-the-Art Object Detectors

Comparisons between our approaches and state-of-the-art few-shot object detectors on Pascal VOC and COCO are shown in Table 5 and Table 6. Following the default novel split setting of data with different K-shots in previous researches, our proposed CRTED without further fine-tuning can show comparable performance with fine-tuned methods or even new SOTA results for few-shot object detection. Our CRTED outperforms A-RPN without fine-tuning by 2.6% on AP metrics, which demonstrates the strong generalization ability of our detector, especially in the few-shot scenario. Specially, in terms of AP over different classes, our CRTED obtains comparable performance for several cases on Pascal VOC and performs better than most of the representative fine-tuned models with one-shot on the COCO val dataset, boosting from about ~0.3% to ~3.8% in improvement.

5. Conclusions

This paper presents a novel few-shot object detection model, CRTED, which introduces a newly proposed object-relevant region-based module Correlation-RPN and the powerful structure of the transformer encoder–decoder. Our model without fine-tuning has been trained and validated on Pascal VOC and COCO datasets, and extensive qualitative experimental results have been given. Specifically, with the proposed matching procedure, BS regularization and the novel four-way tuple-contrast training strategy, CRTED can perform comparably or even better than those detectors with exhaustive fine-tuning in the same evaluation. We hope that this work can provide good inspiration for further works in few-shot object detection.

Author Contributions

Conceptualization, K.X.; methodology, K.X.; validation, K.X.; formal analysis, K.X., L.J., Z.X., Y.N. and J.C.; investigation, K.X.; resources, K.X.; data curation, K.X.; writing—original draft preparation, K.X., L.J. and Z.X.; writing—review and editing, K.X., L.J., Y.N. and Z.X.; visualization, K.X.; supervision, Z.X., Y.N. and J.C.; project administration, J.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Guangxi Science and Technology Development Project (AB23026135; AB21220011), the Guilin Science and Technology Plan Project (20210220).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-end object detection with transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer International Publishing: Cham, Switzerland, 2020; pp. 213–229. [Google Scholar]
Han, G.; Zhang, X.; Li, C. Semi-supervised dff: Decoupling detection and feature flow for video object detectors. In Proceedings of the 26th ACM International Conference on Multimedia, Seoul, Republic of Korea, 22–26 October 2018; pp. 1811–1819. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Sun, P.; Zhang, R.; Jiang, Y.; Kong, T.; Xu, C.; Zhan, W.; Tomizuka, M.; Li, L.; Yuan, Z.; Wang, C.; et al. Sparse r-cnn: End-to-end object detection with learnable proposals. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 14454–14463. [Google Scholar]
Zhang, H.; Li, F.; Liu, S.; Zhang, L.; Su, H.; Zhu, J.; Ni, L.M.; Shum, H.Y. Dino: Detr with improved denoising anchor boxes for end-to-end object detection. arXiv 2022, arXiv:2203.03605. [Google Scholar]
Yang, X.; Chai, L.; Bist, R.B.; Subedi, S.; Wu, Z. A deep learning model for detecting cage-free hens on the litter floor. Animals 2022, 12, 1983. [Google Scholar] [CrossRef] [PubMed]
Ravi, S.; Larochelle, H. Optimization as a model for few-shot learning. In Proceedings of the International Conference on Learning Representations, San Juan, Puerto Rico, 2–4 May 2016. [Google Scholar]
Finn, C.; Abbeel, P.; Levine, S. Model-agnostic meta-learning for fast adaptation of deep networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 1126–1135. [Google Scholar]
Cai, Q.; Pan, Y.; Yao, T.; Yan, C.; Mei, T. Memory matching networks for one-shot image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4080–4088. [Google Scholar]
Gidaris, S.; Komodakis, N. Dynamic few-shot visual learning without forgetting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 4367–4375. [Google Scholar]
Chen, H.; Wang, Y.; Wang, G.; Qiao, Y. Lstd: A low-shot transfer detector for object detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New Orleans, LA, USA, 2–7 February 2018; Volume 32. [Google Scholar]
Kang, B.; Liu, Z.; Wang, X.; Yu, F.; Feng, J.; Darrell, T. Few-shot object detection via feature reweighting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8420–8429. [Google Scholar]
Yan, X.; Chen, Z.; Xu, A.; Wang, X.; Liang, X.; Lin, L. Meta r-cnn: Towards general solver for instance-level low-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9577–9586. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Mandi, India, 16–19 December 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1–9. [Google Scholar] [CrossRef] [PubMed]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef] [PubMed]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Jocher, G. yolov5. 2021. Available online: https://github.com/ultralytics/yolov5 (accessed on 18 May 2020).
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Fan, Q.; Zhuo, W.; Tang, C.K.; Tai, Y.W. Few-shot object detection with attention-RPN and multi-relation detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 4013–4022. [Google Scholar]
Wu, J.; Liu, S.; Huang, D.; Wang, Y. Multi-scale positive sample refinement for few-shot object detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XVI 16. Springer International Publishing: Cham, Switzerland, 2020; pp. 456–472. [Google Scholar]
Wang, X.; Huang, T.E.; Darrell, T.; Gonzalez, J.E.; Yu, F. Frustratingly simple few-shot object detection. arXiv 2020, arXiv:2003.06957. [Google Scholar]
Wojke, N.; Bewley, A. Deep cosine metric learning for person re-identification. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), Lake Tahoe, NV, USA, 12–15 March 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 748–756. [Google Scholar]
Karlinsky, L.; Shtok, J.; Harary, S.; Schwartz, E.; Aides, A.; Feris, R.; Giryes, R.; Bronstein, A.M. Repmet: Representative-based metric learning for classification and few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 5197–5206. [Google Scholar]
Song, H.; Sun, D.; Chun, S.; Jampani, V.; Han, D.; Heo, B.; Kim, W.; Yang, M.H. Vidt: An efficient and effective fully transformer-based object detector. arXiv 2021, arXiv:2110.03921. [Google Scholar]
Li, Y.; Mao, H.; Girshick, R.; He, K. Exploring plain vision transformer backbones for object detection. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 280–296. [Google Scholar]
Fang, Y.; Yang, S.; Wang, S.; Ge, Y.; Shan, Y.; Wang, X. Unleashing vanilla vision transformer with masked image modeling for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6244–6253. [Google Scholar]
Liu, F.; Zhang, X.; Peng, Z.; Guo, Z.; Wan, F.; Ji, X.; Ye, Q. Integrally migrating pre-trained transformer encoder-decoders for visual object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 6825–6834. [Google Scholar]
Zhu, X.; Su, W.; Lu, L.; Li, B.; Wang, X.; Dai, J. Deformable detr: Deformable transformers for end-to-end object detection. arXiv 2020, arXiv:2010.04159. [Google Scholar]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 1–15. [Google Scholar]
Bertinetto, L.; Valmadre, J.; Henriques, J.F.; Vedaldi, A.; Torr, P.H. Fully-convolutional siamese networks for object tracking. In Proceedings of the Computer Vision–ECCV 2016 Workshops, Amsterdam, The Netherlands, 8–10 and 15–16 October 2016; Proceedings, Part II 14. Springer International Publishing: Cham, Switzerland, 2016; pp. 850–865. [Google Scholar]
Lu, E.; Xie, W.; Zisserman, A. Class-agnostic counting. In Proceedings of the Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018; Revised Selected Papers, Part III 14. Springer International Publishing: Cham, Switzerland, 2019; pp. 669–684. [Google Scholar]
Li, B.; Wu, W.; Wang, Q.; Zhang, F.; Xing, J.; Yan, J. Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–20 June 2019; pp. 4282–4291. [Google Scholar]
Sun, Y.; Chen, Y.; Wang, X.; Tang, X. Deep learning face representation by joint identification-verification. Adv. Neural Inf. Process. Syst. 2014, 27, 1–9. [Google Scholar]
Khosla, P.; Teterwak, P.; Wang, C.; Sarna, A.; Tian, Y.; Isola, P.; Maschinot, A.; Liu, C.; Krishnan, D. Supervised contrastive learning. Adv. Neural Inf. Process. Syst. 2020, 33, 18661–18673. [Google Scholar]
Oord, A.; Li, Y.; Vinyals, O. Representation learning with contrastive predictive coding. arXiv 2018, arXiv:1807.03748. [Google Scholar]
Everingham, M.; Van Gool, L.; Williams, C.K.I.; Winn, J.; Zisserman, A. The pascal visual object classes (voc) challenge. Int. J. Comput. Vis. 2010, 88, 303–338. [Google Scholar] [CrossRef]
Xiao, Y.; Lepetit, V.; Marlet, R. Few-shot object detection and viewpoint estimation for objects in the wild. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3090–3106. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.X.; Ramanan, D.; Hebert, M. Meta-learning to detect rare objects. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9925–9934. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part V 13. Springer International Publishing: Cham, Switzerland, 2014; pp. 740–755. [Google Scholar]
Li, B.; Wang, C.; Reddy, P.; Kim, S.; Scherer, S. Airdet: Few-shot detection without fine-tuning for autonomous exploration. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer Nature: Cham, Switzerland, 2022; pp. 427–444. [Google Scholar]
Ma, J.; Niu, Y.; Xu, J.; Huang, S.; Han, G.; Chang, S.F. Digeo: Discriminative geometry-aware learning for generalized few-shot object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 3208–3218. [Google Scholar]
Zhang, W.; Wang, Y.X. Hallucination Improves Few-Shot Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 13008–13017. [Google Scholar]
Sun, B.; Li, B.; Cai, S.; Yuan, Y.; Zhang, C. Fsce: Few-shot object detection via contrastive proposal encoding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 7352–7362. [Google Scholar]
Cao, Y.; Wang, J.; Jin, Y.; Wu, T.; Chen, K.; Liu, Z.; Lin, D. Few-shot object detection via association and discrimination. Adv. Neural Inf. Process. Syst. 2021, 34, 16570–16581. [Google Scholar]
Han, J.; Ren, Y.; Ding, J.; Yan, K.; Xia, G.S. Few-shot object detection via variational feature aggregation. In Proceedings of the AAAI Conference on Artificial Intelligence, Washington, DC, USA, 7–14 February 2023; Volume 37, pp. 755–763. [Google Scholar]

Figure 1. The whole framework of our proposed CRTED. The query–support feature from the weight-shared feature extractor will be conducted with BS regularization and encoding and then fed into Correlation-RPN for further extraction. Finally, the few-shot detection result is obtained by mapping to the query image with the decoder.

Figure 3. These are some hard samples selected from the Pascal VOC test set and the 1-shot detection results of our method.

Figure 4. The 4-way tuple-contrast training triplet and different training pairs.

s_{c}

: positive support image;

s_{h}

: hard sample;

s_{n}

: support image of novel class.

s_{c}

and

s_{n}

both have the same class with the ground truth in

q_{c}

. The training pair

\tilde{p_{t}} (q_{c}, s_{c}, s_{h}, s_{n})

consists of the query–support training pair

\tilde{p_{c}} (q_{c}, s_{c})

, the support–hard training pair

\tilde{p_{h}} (s_{c}, s_{h})

and the query–novel training pair

\tilde{p_{n}} (q_{c}, s_{n})

.

Figure 4. The 4-way tuple-contrast training triplet and different training pairs.

s_{c}

: positive support image;

s_{h}

: hard sample;

s_{n}

: support image of novel class.

s_{c}

and

s_{n}

both have the same class with the ground truth in

q_{c}

. The training pair

\tilde{p_{t}} (q_{c}, s_{c}, s_{h}, s_{n})

consists of the query–support training pair

\tilde{p_{c}} (q_{c}, s_{c})

, the support–hard training pair

\tilde{p_{h}} (s_{c}, s_{h})

and the query–novel training pair

\tilde{p_{n}} (q_{c}, s_{n})

.

Figure 5. Visualization comparison between models with Correlation-RPN and other RPN architecture on one-shot object detection on Pascal VOC class novel split 1. Correlation-RPN can focus on most representative region, resulting in more precise proposals and regression box.

Figure 8. Comparison of performance gains (left) and object classification loss (right).

Table 4. Experimental results for different training strategies.

Training Strategy	AP	AP₅₀	AP₇₅
$\tilde{p_{c}}$ 1-shot	48.5	55.9	41.1
$\tilde{p_{t}}$ 1-shot	48.7	56.2	41.2
$\tilde{p_{c}}$ 5-shot	57.7	68.1	47.3
$\tilde{p_{t}}$ 5-shot	58.1	68.4	47.7
$\tilde{p_{c}}$ 10-shot	58.8	69.2	48.4
$\tilde{p_{t}}$ 10-shot	59.7	70.5	48.8

Table 5. A performance comparison of nAP₅₀ on Pascal VOC. Red and green fonts denote the best and second-best performance, respectively. With fine-tuning, CRTED achieves a new SOTA performance on the 1-shot, 2-shot and 3-shot setting and obtains comparable performance on 5-shot and 10-shot, demonstrating its strong generalization capability.

Method	Fine-Tune	nAP₅₀ (Avg. on Splits for Each Shot)
Method	Fine-Tune	1	2	3	5	10
FSRW [12]	√	16.6	17.5	25.0	34.9	42.6
Meta R-CNN [13]	√	11.2	15.3	20.5	29.8	37.0
TFA_fc [23]	√	27.6	30.6	39.8	46.6	48.7
TFA_cos [23]	√	31.4	32.6	40.5	46.8	48.3
FSDetView [39]	√	26.9	20.4	29.9	31.6	37.7
A-RPN [21]	×	18.1	22.6	24.0	25.0	-
AirDet [42]	×	21.3	26.8	28.6	29.8	-
DiGeo [43]	√	31.6	36.1	45.8	51.2	55.1
CRTED (Ours)	×	21.6	27.0	29.2	30.2	35.4
CRTED (Ours)	√	32.2	36.6	46.4	51.4	55.5

Table 6. Performance comparison of AP with k-shot on COCO validation dataset. Red and green fonts denote best and second-best performance, respectively. CRTED achieves comparable performance on baseline without fine-tuning and outperforms most of representative methods with fine-tuning, which indicates its strong power, especially in few-shot settings.

Method	Venue	Fine-Tune	Shots
Method	Venue	Fine-Tune	1	2	3	5	10	30
FSRW [12]	ICCV 2019	√	-	-	-	-	5.6	9.2
Meta R-CNN [13]	ICCV 2019	√	-	-	-	-	8.7	12.4
TFA_fc [23]	ICML 2020	√	2.8	4.1	6.3	7.9	9.1	-
TFA_cos [23]	ICML 2020	√	3.1	4.2	6.1	7.6	9.1	12.1
FSDetView [39]	ECCV 2020	√	2.2	3.4	5.2	8.2	12.5	-
MPSR [22]	ECCV 2020	√	3.3	5.4	5.7	7.2	9.8	-
A-RPN [21]	CVPR 2020	×	4.3	4.7	5.3	6.1	7.4	-
W. Zhang et al. [44]	CVPR 2021	√	4.4	5.6	7.2	-	-	-
FSCE [45]	CVPR 2021	√	-	-	-	-	11.1	15.3
FADI [46]	NIPS 2021	√	5.7	7.0	8.6	10.1	12.2	16.1
AirDet [42]	ECCV 2022	×	6.0	6.6	7.0	7.8	8.7	12.1
VFA [47]	AAAI 2023	√	-	-	-	-	16.2	18.9
CRTED	Ours	×	5.9	6.6	7.2	8.0	8.9	12.5
CRTED	Ours	√	6.3	7.3	8.7	9.6	12.8	19.3

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Chen, J.; Xu, K.; Ning, Y.; Jiang, L.; Xu, Z. CRTED: Few-Shot Object Detection via Correlation-RPN and Transformer Encoder–Decoder. Electronics 2024, 13, 1856. https://doi.org/10.3390/electronics13101856

AMA Style

Chen J, Xu K, Ning Y, Jiang L, Xu Z. CRTED: Few-Shot Object Detection via Correlation-RPN and Transformer Encoder–Decoder. Electronics. 2024; 13(10):1856. https://doi.org/10.3390/electronics13101856

Chicago/Turabian Style

Chen, Jinlong, Kejian Xu, Yi Ning, Lianyuan Jiang, and Zhi Xu. 2024. "CRTED: Few-Shot Object Detection via Correlation-RPN and Transformer Encoder–Decoder" Electronics 13, no. 10: 1856. https://doi.org/10.3390/electronics13101856

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

CRTED: Few-Shot Object Detection via Correlation-RPN and Transformer Encoder–Decoder

Abstract

1. Introduction

2. Related Work

2.1. General Object Detection

2.2. Few-Shot Object Detection

2.3. Transformer Encoder–Decoder

3. Approach

3.1. Preliminaries

3.2. Architecture

3.3. Training Procedure

4. Experiments

4.1. Few-Shot Object Detection Benchmarks

4.2. Implementation Details

4.3. Ablation Studies

4.4. Comparison with State-of-the-Art Object Detectors

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Method	MP	C	Novel mAP (IoU = 0.5)
Method	MP	C	1	2	3	5	10
CRTED		1	28.2	43.3	51.6	54.0	60.3
	√	1	31.3	45.2	53.1	56.8	63.0
		5	33.7	46.5	52.4	57.1	61.8
	√	5	37.3	51.1	54.5	58.2	63.3