MSFFAL: Few-Shot Object Detection via Multi-Scale Feature Fusion and Attentive Learning

Few-shot object detection (FSOD) is proposed to solve the application problem of traditional detectors in scenarios lacking training samples. The meta-learning methods have attracted the researchers’ attention for their excellent generalization performance. They usually select the same class of support features according to the query labels to weight the query features. However, the model cannot possess the ability of active identification only by using the same category support features, and feature selection causes difficulties in the testing process without labels. The single-scale feature of the model also leads to poor performance in small object detection. In addition, the hard samples in the support branch impact the backbone’s representation of the support features, thus impacting the feature weighting process. To overcome these problems, we propose a multi-scale feature fusion and attentive learning (MSFFAL) framework for few-shot object detection. We first design the backbone with multi-scale feature fusion and channel attention mechanism to improve the model’s detection accuracy on small objects and the representation of hard support samples. Based on this, we propose an attention loss to replace the feature weighting module. The loss allows the model to consistently represent the objects of the same category in the two branches and realizes the active recognition of the model. The model no longer depends on query labels to select features when testing, optimizing the model testing process. The experiments show that MSFFAL outperforms the state-of-the-art (SOTA) by 0.7–7.8% on the Pascal VOC and exhibits 1.61 times the result of the baseline model in MS COCO’s small objects detection.


Introduction
Thanks to the development of large-scale computing devices, deep learning has made rapid progress. As a research branch of deep learning, object detection is widely used in production and life due to its excellent stability, high accuracy, and detection speed. It can realize the localization and classification of the objects and mark them in images in the form of text and bounding boxes. However, object detection algorithms based on deep learning usually need to learn the representation of object features from large-scale labeled data before they can classify and locate objects, which consumes many human and material resources [1][2][3]. Additionally, it is challenging to obtain a large amount of data that can be used for training in some application scenarios, such as rare species detection, industrial defect detection, and so on. Inspired by the cognitive characteristics that humans can recognize a new thing through only a few samples, the researchers believe that the neural network imitates human neurons' reasoning process, so it should also have similar learning capabilities [4]. Therefore, FSOD comes into being, which is dedicated to using Figure 1. Some support images used for training. The support objects only account for a small proportion in the figure and most of the area is the background, which we call hard samples. It is difficult for the model to extract support features that can represent the desired object category.
To overcome the above deficiencies, this paper proposes the MSFFAL based on metalearning. First, we adopt a multi-scale feature fusion strategy and design the backbone as ResNet + feature pyramid networks (FPNs) [21] to improve the model's recognition effect on small objects. Then, we optimize the model's representation of hard support samples by introducing the channel attention structure SENet [23] in the support branch to weight the features of foreground objects. Finally, we design an attention loss to let query features perform attention calculations with all support features. The computed attention scores constrain the model's representation of the query features. Through attention loss, the model learns to actively focus on objects of the same category in the two branches and no longer depends on query labels. The experiments on the benchmark datasets Pascal VOC [24,25] and MS COCO [22] prove the effectiveness of our method.
To summarize, the main contributions of this paper are as follows: (1) We propose an MSFFAL framework for few-shot object detection. The backbone of our model mainly contains the multi-scale feature fusion and channel attention mechanisms. The former is introduced to improve the model's detection accuracy on the small objects. The latter is adopted to strengthen the model's representation of hard samples in the support branch and enhance the model attention to foreground object features. (2) We design an attention loss to enhance the active recognition ability of the model, realize the consistent representation of objects belonging to the same category in the two branches, and improve the model's generalization ability in novel classes. Based on this, the model no longer relies on the feature selection and avoids model testing difficulty. (3) We conduct extensive experiments on the benchmark datasets Pascal VOC and MS COCO to verify the effectiveness of our method. The experimental results show that our model is 0.7-7.8% ahead of the SOTAs on the Pascal VOC. We also achieve a substantial lead over the baseline model in MS COCO's small object detection.
This paper includes five sections: The first is the introduction, which introduces the relevant research background of FSOD and the motivation for our research. The Section 2 presents the related work of FSOD and describes the problems and optimization possibilities of the previous methods. The Section 3 introduces our algorithm in detail. The Section 4 first introduces the dataset selected in this paper and the relevant experimental settings and then shows sufficient experimental results to prove the reliability of our work. The Section 5 summarizes the whole work and concludes. To summarize, the main contributions of this paper are as follows: (1) We propose an MSFFAL framework for few-shot object detection. The backbone of our model mainly contains the multi-scale feature fusion and channel attention mechanisms. The former is introduced to improve the model's detection accuracy on the small objects. The latter is adopted to strengthen the model's representation of hard samples in the support branch and enhance the model attention to foreground object features. (2) We design an attention loss to enhance the active recognition ability of the model, realize the consistent representation of objects belonging to the same category in the two branches, and improve the model's generalization ability in novel classes. Based on this, the model no longer relies on the feature selection and avoids model testing difficulty. (3) We conduct extensive experiments on the benchmark datasets Pascal VOC and MS COCO to verify the effectiveness of our method. The experimental results show that our model is 0.7-7.8% ahead of the SOTAs on the Pascal VOC. We also achieve a substantial lead over the baseline model in MS COCO's small object detection.
This paper includes five sections: The first is the introduction, which introduces the relevant research background of FSOD and the motivation for our research. The Section 2 presents the related work of FSOD and describes the problems and optimization possibilities of the previous methods. The Section 3 introduces our algorithm in detail. The Section 4 first introduces the dataset selected in this paper and the relevant experimental settings and then shows sufficient experimental results to prove the reliability of our work. The Section 5 summarizes the whole work and concludes.

Object Detection
The object detection algorithm can realize the detection of the targets in the image or video. If there is a target to be detected, it will return the category and bounding box information and mark it in the image. Conventional deep learning-based object detectors can be classified into one-and two-stage detectors. The one-stage detectors directly regress the object bounding boxes and categories through the fully connected layer or the convolutional layer in the deep features, such as the YOLO [26][27][28] series detectors and the SSD [20] detector. These are characterized by a high detection speed but are prone to misjudgment of background information. The two-stage detectors generate the object candidate regions and perform position repair and classification on the candidate regions, such as the faster R-CNN [29][30][31][32] series detectors. Compared with the one-stage detectors, the detection speed of the two-stage detectors is slower, but the detection accuracy is higher Sensors 2023, 23, 3609 4 of 18 than that of the single-stage detectors. Since the region proposal network (RPN) module in faster R-CNN only distinguishes foreground and background information, it has better class independence. This gives faster R-CNN a more significant advantage in generalization to novel classes. Therefore, most current FSOD models take faster R-CNN as their base detector. The method in this paper is also evolved based on this model.

Few-Shot Learning
To allow the deep model to generalize in the target domain with only a few samples, the researchers propose a new machine learning method, namely few-shot learning (FSL) [33][34][35][36][37] for this problem. However, insufficient samples will bring difficulties in model training, resulting in overfitting. Therefore, learning a kind of transferable abstract knowledge so that the deep model can be applied to the target scene with only a little or no training data has become a critical research problem in this field. Early FSL methods mainly focus on classification tasks. At the earliest, Li et al. [4] proposed a method based on the Bayesian framework. They believed computers should learn to use prior knowledge, just like humans can recognize new things from a few examples. Later, Vinyals et al. [38] proposed the matching network to encode images as deeply embedded features and perform weighted nearest neighbor matching to classify query images. Snell et al. [39] proposed a prototype network based on the previous methods, converted the embedded features into feature vectors, and classified samples by measuring the distance between the feature vectors. Recently, Xie et al. [40] found that the few-shot classification accuracy can be improved using Brownian distance instead of the previous Euclidean distance or cosine similarity. The above methods allow the model to no longer focus on the specific category of the object but to learn how to distinguish which objects are in the same category. Therefore, the model also has good generalization performance when facing unseen samples. However, compared with the few-shot classification tasks, FSOD needs to consider both the classification and localization of the objects. Thus, it is more challenging to implement and needs to be the focus of further work.

Few-Shot Object Detection and Meta-Learning Paradigm
FSOD aims to use only a few labeled images to train the model to realize the localization and classification of the objects. Among various FSOD methods, researchers are widely concerned with meta-learning-based models because of their abstract learning ability to better generalize in novel classes. Kang et al. [11] developed a dual-branch detection model based on YOLO. They proposed a reweighting module to realize the weighting of support features to query features, amplifying common object features and enhance the model's attention to objects belonging to the same category in the two branches. Like the former, Xiao and Yan et al. [14,16] built few-shot detectors based on faster R-CNN, which raises the detection results to a higher level. Fan et al. [12] proposed the attention RPN based on the faster R-CNN. They took support features to perform feature enhancement on query features before feeding them into RPN. This improves the proposal effect of RPN for unseen novel class objects. Zhang et al. [13] generated a feature convolutional kernel for the support branch features and then performed convolution operations on the query features to enhance the object features belonging to the same category. All the above methods focus on enhancing the support features to the query features, ignoring the inherent defects of the model itself. Firstly, the hard samples in the support branch lead to the model's imprecise representation of support features, affecting the effect of weighting query features. Secondly, the single-level feature maps lead to the model's poor performance in small object detection. Finally, weighting query features only through the support features of the same category cannot endow the model with the ability to actively identify objects of the same category. To this end, we propose the MSFFAL to overcome the previous shortcomings from the above three perspectives and verify the method's effectiveness through sufficient experiments.

Method
Our method has been further innovated and optimized based on the meta R-CNN [14]. We first improved the model's recognition performance for small objects by introducing a multi-scale mechanism in the feature extraction backbone. Then, we add a channel attention mechanism based on FPN to optimize the model's representation of hard samples in the support branch and improve detection precision. Finally, we designed an attention loss to let the model learn consistent representations of objects in the two branches of the same category. The model learns to actively identify objects from support samples, leading to an overall improvement in detection performance. In this section, we first make a method definition for FSOD. Then, we introduce the overall architecture of MSFFAL and describe the modules and structures in detail.

Problem Definition
We follow the dataset setup, training strategy, and evaluation methods in [11,14]. We divide the dataset into C b and C n , where C b is the base class data with thousands of annotations per class and C n is the novel class data with only one to dozens of annotations per class. The base class and the novel class data do not contain the same object categories, that is, C b ∩ C n = Φ. We first train the model on base classes C b and then fine-tune it on the balance set of C b and C n with only K annotations per class. K is set to different values according to the evaluation indicators of different datasets. For a given N-way K-shot learning task, in each iteration, the model samples a query image and NK support images with N categories and K objects in each category from the prepared dataset as input. Then, the model outputs the detection results of the objects in the query image. Finally, we evaluate the model's performance by the mAP on the novel classes in the test set.

Model Architecture
We choose the meta R-CNN, whose backbone is faster R-CNN, as our baseline. The model architecture is shown in Figure 2, which is a Siamese network structure. The upper side of the network is the query branch, which inputs the query image to be detected, while the lower side is the support branch, which inputs the support image-mask pairs for auxiliary detection. We remove the meta learner module in meta R-CNN and realize the information interaction in the two branches through our attention loss. Compared with the baseline, we optimize the backbone of the query and support branches into ResNet + FPN and ResNet + FPN + SENet structures, respectively. The two backbones share weight parameters during the training stage. The query features are passed through RPN and ROIAlign to obtain positive and negative proposal feature vectors. The support features are directly average pooled to obtain support feature vectors representing each support object category. Then, they are used to construct L metacls to classify support objects and to make attention loss with query positive proposal vectors. The model is trained with three losses, namely: where L det is the detection loss of faster R-CNN, L metacls is the meta-classification loss in the support branch, L atten is our attention loss, and λ is the weight parameter of the loss.

FPN and SENet
To improve the detection precision of the FSOD model for small objects and the representation effect for hard support samples, we design the feature extraction backbone as an FPN+SENet structure.
As shown in Figure 3, FPN mainly includes a bottom-up line (blue box), a top-down line (green box), and lateral connections (1 × 1 conv 256). Bottom-up is the forward process of the ResNet network. Each layer down-samples the feature maps' length and width and increases the number of channels. Suppose that the input image size is 224 × 224 × 3, Layer0-Layer3 output feature maps sizes of 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024, and 7 × 7 × 2048, respectively. Top-down is the process of up-sampling the width and height of the feature maps by two times. FPN combines high-level and low-level features through lateral connections to obtain M2-M5 features with 256 channels. Finally, the 3 × 3 convolution kernel is used to convolve the fusion features to eliminate the aliasing effect of up-sampling and get P2-P5 features. P6 is the feature map obtained by P5 after maxpooling with stride = 2. Each level features output by FPN are fed to the RPN module for region proposal. Among them, the low-level features will contribute more proposals for small objects to improve the detection effect of small objects. and to make attention loss with query positive proposal vectors. The model is trained with three losses, namely: where L det is the detection loss of faster R-CNN, L metacls is the meta-classification loss in the support branch, L atten is our attention loss, and λ is the weight parameter of the loss. The input of the model is a task sample consisting of a query image and a set of support images. The model learns from the input task to discover the objects belonging to the same category in the two branches so as to achieve generalization in the novel classes.

FPN and SENet
To improve the detection precision of the FSOD model for small objects and the representation effect for hard support samples, we design the feature extraction backbone as an FPN+SENet structure.
As shown in Figure 3, FPN mainly includes a bottom-up line (blue box), a top-down line (green box), and lateral connections (1 × 1 conv 256). Bottom-up is the forward process of the ResNet network. Each layer down-samples the feature maps' length and width and increases the number of channels. Suppose that the input image size is 224 × 224 × 3, Layer0-Layer3 output feature maps sizes of 56 × 56 × 256, 28 × 28 × 512, 14 × 14 × 1024, and 7 × 7 × 2048, respectively. Top-down is the process of up-sampling the width and height of the feature maps by two times. FPN combines high-level and low-level features through lateral connections to obtain M2-M5 features with 256 channels. Finally, the 3 × 3 convolution kernel is used to convolve the fusion features to eliminate the aliasing effect of upsampling and get P2-P5 features. P6 is the feature map obtained by P5 after max-pooling with stride = 2. Each level features output by FPN are fed to the RPN module for region proposal. Among them, the low-level features will contribute more proposals for small objects to improve the detection effect of small objects.
We add the SENet structure based on FPN. The model achieves a channel-level selfattention enhancement through this structure. During the training process, the model continuously learns to improve the representation of hard support samples. The design of SENet is shown in Figure 4. This module adds a skip connection to the output feature layer of the ResNet forward network. In the connection, the feature maps are first average pooled. Then, the channel attention scores are obtained through the channel attention module. Finally, the original feature maps are weighted at the channel level through the scores. The internal structure of the channel attention module is shown on the right side of Figure 4. The input feature vector V in ∈ R channel×1 is first dimensionally decreased through the first fully connected (FC) layer with a reduction rate of 4 to obtain The input of the model is a task sample consisting of a query image and a set of support images. The model learns from the input task to discover the objects belonging to the same category in the two branches so as to achieve generalization in the novel classes.
Finally, V in ‴ is followed by the second activation function sigmoid to obtain the weight score vector V out ∈ R channel×1 .The whole process can be summarized as: Two different activation functions are used to increase the network's nonlinearity and enrich the network's expressive ability. SENet allows the support branch to output high-quality support feature vectors for meta-classification and the construction of attention loss, improving the model detection performance.  We add the SENet structure based on FPN. The model achieves a channel-level selfattention enhancement through this structure. During the training process, the model continuously learns to improve the representation of hard support samples. The design of SENet is shown in Figure 4. This module adds a skip connection to the output feature layer of the ResNet forward network. In the connection, the feature maps are first average pooled. Then, the channel attention scores are obtained through the channel attention module. Finally, the original feature maps are weighted at the channel level through the scores. The internal structure of the channel attention module is shown on the right side of Figure 4. The input feature vector V in ∈ R channel×1 is first dimensionally decreased through the first fully connected (FC) layer with a reduction rate of 4 to obtain V in ∈ R channel/4×1 . Then, V in is followed by the first activation function Tanh to obtain V in ∈ R channel/4×1 . Then, the dimension of V in is increased through the second FC layer to obtain V in ∈ R channel×1 . Finally, V in is followed by the second activation function sigmoid to obtain the weight score vector V out ∈ R channel×1 .The whole process can be summarized as: Two different activation functions are used to increase the network's nonlinearity and enrich the network's expressive ability. SENet allows the support branch to output highquality support feature vectors for meta-classification and the construction of attention loss, improving the model detection performance.
high-quality support feature vectors for meta-classification and the construction of attention loss, improving the model detection performance.

Attention Loss
Meta learner is the core module in meta R-CNN, which uses the same category support features to weight the query features. This weighting method causes the model to lack the ability to actively identify objects of the same category, and the dependence on the query labels to select the support features makes model testing difficult. To remedy

Attention Loss
Meta learner is the core module in meta R-CNN, which uses the same category support features to weight the query features. This weighting method causes the model to lack the ability to actively identify objects of the same category, and the dependence on the query labels to select the support features makes model testing difficult. To remedy these, we design an attention loss to replace the meta learner module in the baseline model meta R-CNN.
The essence of the attention loss lies in utilizing the support features to establish a mapping between query-positive proposal features and their corresponding categories. Through training, a strong response is generated between objects of the same category in two branches. The model learns to recognize objects of the same category in two branches while also discriminating objects of different categories. As shown in Figure 5, we extract all query-positive proposal feature vectors V sheep , V car ∈ R 256×1 according to the intersection over union (IOU) between the predicted bounding boxes generated by the RPN and the ground truth. We then perform a matrix multiplication operation between all positive proposal feature vectors and the transpose of the support feature vectors V T support ∈ R 1×256 and put them through softmax to obtain attention vectors V atten ∈ R N , where N denotes the number of input support images in each iteration. Each element in V atten represents a support category. Suppose the category of the positive proposal is consistent with that of the support vector. In that case, we expect the value of the element position corresponding to this category to be close to 1; otherwise, it is close to 0. To achieve the goal above, we concatenate all the attention vectors V atten together to obtain the score matrix M score ∈ R Np×N , where Np is the numbers of positive proposals, and Labels corresponds to each positive proposal to constrain the trend of M score , that is, the proposed attention loss: where M labels ∈ R Np×N represents the concatenation of the Labels.
Through the attention loss, on the one hand, the model can learn a consistent representation of objects belonging to the same category in the two branches during the training process. On the other hand, the model learns an abstract and easily transferable metaknowledge in this way. Thus, it can also show an excellent generalization performance when facing unseen novel class objects.
proposal feature vectors and the transpose of the support feature vectors V support T ∈ R 1×256 and put them through softmax to obtain attention vectors V atten ∈ R N , where N denotes the number of input support images in each iteration. Each element in V atten represents a support category. Suppose the category of the positive proposal is consistent with that of the support vector. In that case, we expect the value of the element position corresponding to this category to be close to 1; otherwise, it is close to 0. To achieve the goal above, we concatenate all the attention vectors V atten together to obtain the score matrix M score ∈ R Np×N , where Np is the numbers of positive proposals, and Labels corresponds to each positive proposal to constrain the trend of M score , that is, the proposed attention loss: where M labels ∈ R Np×N represents the concatenation of the Labels.
Through the attention loss, on the one hand, the model can learn a consistent representation of objects belonging to the same category in the two branches during the training process. On the other hand, the model learns an abstract and easily transferable metaknowledge in this way. Thus, it can also show an excellent generalization performance when facing unseen novel class objects.

Datasets and Preparation
We validate our method on two benchmark object detection datasets, Pascal VOC and MS COCO. The few-shot object detection datasets are constructed by splitting the above two datasets. Pascal VOC: The Pascal VOC dataset contains 20 object categories in total. The dataset is divided into a base set and a novel set by three splits. The base set of each split includes 15 categories, and the novel set contains 5 categories. The novel sets of each split are: novel set 1: {"bird", "bus", "cow", "motorbike", "sofa"}; novel set 2: {"aircraft", "bottle", "cow",

. Datasets and Preparation
We validate our method on two benchmark object detection datasets, Pascal VOC and MS COCO. The few-shot object detection datasets are constructed by splitting the above two datasets. Pascal VOC: The Pascal VOC dataset contains 20 object categories in total. The dataset is divided into a base set and a novel set by three splits. The base set of each split includes 15 categories, and the novel set contains 5 categories. The novel sets of each split are: novel set 1: {"bird", "bus", "cow", "motorbike", "sofa"}; novel set 2: {"aircraft", "bottle", "cow", "horse", "sofa"}; novel set 3: {"boat", "cat", "motorbike", "sheep", "sofa"}. Novel set 2 and 3 are more challenging to train than novel set 1. We call them hard samples. The model is trained under the condition that only 1, 2, 3, 5, and 10 novel class samples are provided. The performance of the 1-, 2-, 3-, 5-, and 10-shot fine-tuning models is evaluated by the mAP in the test set. MS COCO: There are 80 object categories in the MS COCO dataset split into a base set containing 60 categories and a novel set having 20 categories. The model is trained on the MS COCO dataset under the condition that only 10 and 30 novel class samples are provided. The performance of the 10-and 30-shot fine-tuning models is evaluated by the mAP in the test set.

Implementation Details
Firstly, we pre-train our feature extraction module on the large-scale dataset Ima-geNet [41] and then train and fine-tune our model end-to-end on the Pascal VOC and MS COCO. We use two pieces of NVIDIA RTX 3090 24 g for model training. We choose stochastic gradient descent (SGD) as the training optimizer with momentum and decay set to 0.9 and 0.0001, respectively. On the Pascal VOC dataset, we performed 18,000 iterations with a learning rate of 0.001 in the first 16,000 iterations and 0.0001 in the last 2000 iterations during the base classes training phase. In the novel classes fine-tuning stage, 300, 600, 900, 1200, and 1500 iterations with a learning rate of 0.001 were performed for the 1-, 2-, 3-, 5-, and 10-shot settings, respectively. On the MS COCO dataset, we performed 120,000 iterations with an initial learning rate of 0.005 in the first 110,000 iterations and 0.0005 in the last 10,000 iterations during the base classes training phase. In the novel classes fine-tuning stage, 5000 and 8000 iterations with a learning rate of 0.001 were performed for the 10-and 30-shot settings, respectively.

Comparison with the State-of-the-Arts
In this section, we compare our model with the popular approaches in recent years on the Pascal VOC and MS COCO datasets. On the Pascal VOC dataset, we only compare the mAP obtained by the models in the novel classes. The evaluation metrics on the MS COCO dataset are more abundant, mainly comparing the model's AP 50-95 , AP 50 , AP S , AP M , and AP L .

Performance in the Three Novel Sets of the Pascal VOC
The detection results of our model in the three novel sets of the Pascal VOC are shown in Tables 1-3. In the tables, "shot" refers to the number of annotations provided during model training; "1-, 2-, 3-, 5-, and 10-shot" refer to the mAP of the model's performance on novel class object detection in the test set, trained with only 1, 2, 3, 5, and 10 annotations provided per class, respectively. "Mean" denotes the average mAP across the five aforementioned scenarios. We compare and analyze the results with the SOTA methods in recent years, including metric learning models, data augmentation models, and meta-learning models. As shown in Table 1, our model leads the SOTA by 2.5%, 8.2%, 7.8%, and 2.6% in 1-, 2-, 3-, and 5-shot fine-tuning in novel set 1, respectively. As illustrated in Table 2, we outperform all metrics by 1.4%, 2.1%, 4.3%, 6.3%, and 0.7% in novel set 2, respectively. In Table 3, we can find that our model outperforms 5-shot fine-tuning by 1.3% in novel set 3. We also lead the SOTA by 5.8% and 3.9% in the average of all fine-tuning results in novel sets 1 and 2, respectively. The results in Pascal VOC prove the effectiveness of our method. Our model achieves the best results among the 12 detection metrics and ranks second by a slight margin in the rest of the metrics. Even in the hard sample novel set 2, the model still performs well and obtains a comprehensive lead. Although only one result of the model is leading in Novel set 3, other metrics are close behind, which is still a vast improvement compared to the baseline model. It proves that the channel attention mechanism improves the model's representation of hard support samples and its detection effect. The addition of the attention loss enables the model to consistently represent the same category of objects in the two branches, enhancing the model's generalization ability in the novel classes.

Detailed Performance on the Novel Set 1
This section compares the model's average precision (AP) in the 3-and 10-shot finetuning of each novel class in novel set 1 and the mean average precision (mAP) in the novel and the base set. Here, "shot" refers to the number of annotations provided during model training.
It can be observed from Table 4 that our model achieves the maximum detection precision in most of the novel class objects, in which the AP of the "bus" and "cow" even surpasses the detection of the base class objects. However, our model lags behind previous methods in the mAP of the base classes. For this reason, we analyze that more information in the base class is retained because models such as TFA [8] and MPSR [43] inherit the ideas of transfer learning. Thus, they behave very differently in base and novel classes. While attention loss focuses on making the model actively discover the same class of objects in the two branches, its performance on the novel and base classes is relatively balanced, and there will be no massive gap when the detection category changes. In addition, since the base class data can be obtained from open source datasets, FSOD should pay more attention to the model's performance in the novel class data.

Performance on the MS COCO Dataset
This section displays the detection results of our model on the MS COCO dataset. In Table 5, AP 50-95 represents the average mAP of the model, with an IOU ranging from 0.5 to 0.95. AP 50 represents the mAP of the model when IOU = 0.5. AP S , AP M , and AP L are the mAP of the model for small (area < 32 × 32), middle (32 × 32 < area < 96 × 96), and large (area > 96 × 96) object detection, respectively. Here, "shot" refers to the number of annotations provided during model training. Table 5 shows that, in 10-shot fine-tuning, MSFFAL outperforms previous works by 1.1% in AP S . In 30-shot fine-tuning, it leads by 1.4% and 1.3% in AP 50 and AP S , respectively. Additionally, the results in the remaining indicators are close to the previous model. In particular, we can find that our model outperforms the baseline model Meta R-CNN in all metrics with a large margin, especially for small objects. This performance shows that the backbone combines the FPN structure to realize multi-scale feature fusion, providing more levels of features for RPN. Among them, the low-level features offer the model a higher number of small object proposals, thereby effectively improving its detection precision on small objects. In addition, SENet's optimization of support features and the active identification ability of the model endowed by the attention loss promote the improvement of detection results. However, according to the overall detection results, it can be seen that the performance of MSFFAL on the MS COCO is not as good as that on the Pascal VOC and falls behind previous models in many detection results. Concerning this phenomenon, we consider that the first reason is that the MS COCO dataset is more complex than the Pascal VOC, containing a large number of hard samples and small object samples. Moreover, since support features and query features construct the attention loss, the representation effect of the support features directly affects the final result of the model. Although the corresponding strategy has been adjusted to cover this problem, the model still cannot accurately represent such samples, revealing shortcomings when dealing with complex datasets. This will also be a part of our future work.

Convergence Comparison of Attention Loss on Different Datasets
This section compares the convergence of attention loss during model training on the Pascal VOC and MS COCO datasets. From Figure 6, it can be observed that the attention loss slowly converges in both datasets as the model iterates, demonstrating the effectiveness of the loss. Additionally, we find that compared to the convergence trend on the MS COCO dataset, attention loss converges more quickly and to a greater extent on the Pascal VOC dataset, indicating better convergence performance. This reflects that attention loss is not adept at handling complex datasets such as MS COCO, which has more object categories and contains many small objects. Although the multi-scale feature fusion strategy and attention mechanism somewhat alleviates the impact of this issue, MSFFAL still faces difficulties in representing these objects, which in turn affects the performance of the subsequent attention loss. However, from the overall detection performance of the model, MSFFAL achieves performance improvement compared to the baseline model and surpasses previous algorithms in small object detection.

Ablation Study
In this part, we conducted detailed ablation experiments to verify and analyze the impact of each module on the detection results. We conducted experimental validation on both the Pascal VOC and MS COCO datasets.

The Performance of the Proposed Modules
For a fairer comparison, we performed ablation experiments for the modules on the Pascal VOC hard sample novel set 2 (as shown in Table 6). In the table, FPN represents the multi-scale feature fusion module, and SENet represents the channel attention module. In this part, we compare it with the baseline model meta R-CNN at the same time. It can be observed from the second and third rows of Table 6 that the model's precision increased by 1.8-24.8% with the addition of attention loss. The fourth row in the table shows that, when we remove the Meta Learner module from the baseline, the model's precision improves by 0.3-2.8%. Such results reflect that our attention loss plays an essential role in enhancing the model's active identification ability and generalization effect in the novel classes and dramatically improves the detection precision of the baseline model. For the performance of the model accuracy improvement after removing the meta learner, we analyze that, because the original weighting mechanism selects support features of the same category to weight all positive and negative proposal vectors. This may lead to the enhancement of some negative proposal features, thus impacting the overall recognition performance of the model. In addition, the meta learner relies on the query labels for feature selection which affect the model testing process. For this reason, we directly use attention loss to replace the meta learner. Subsequently, the model's precision improved with the addition of FPN and SENet. In particular, SENet was essential in enhancing the model training for hard sample tasks. The ablation results in Table 6 prove that the addition of FPN allows RPN to generate more relevant proposals and enhance detection precision. SENet can effectively improve the model's representation effect on hard support samples. Attention loss enables the model to have an autonomous learning ability, effectively realizes the mining of the same class of objects in the two branches, and improves the detection effect of the few-shot model.

Ablation Study
In this part, we conducted detailed ablation experiments to verify and analyze the impact of each module on the detection results. We conducted experimental validation on both the Pascal VOC and MS COCO datasets.

The Performance of the Proposed Modules
For a fairer comparison, we performed ablation experiments for the modules on the Pascal VOC hard sample novel set 2 (as shown in Table 6). In the table, FPN represents the multi-scale feature fusion module, and SENet represents the channel attention module. In this part, we compare it with the baseline model meta R-CNN at the same time. It can be observed from the second and third rows of Table 6 that the model's precision increased by 1.8-24.8% with the addition of attention loss. The fourth row in the table shows that, when we remove the Meta Learner module from the baseline, the model's precision improves by 0.3-2.8%. Such results reflect that our attention loss plays an essential role in enhancing the model's active identification ability and generalization effect in the novel classes and dramatically improves the detection precision of the baseline model. For the performance of the model accuracy improvement after removing the meta learner, we analyze that, because the original weighting mechanism selects support features of the same category to weight all positive and negative proposal vectors. This may lead to the enhancement of some negative proposal features, thus impacting the overall recognition performance of the model. In addition, the meta learner relies on the query labels for feature selection which affect the model testing process. For this reason, we directly use attention loss to replace the meta learner. Subsequently, the model's precision improved   Table 7 compares the contributions of different modules in the detection results on the MS COCO dataset. This section mainly shows the APs and AP 50 achieved by the model in 10-and 30-shot fine-tuning. It can be seen from Table 7 that although the model faces a more complex dataset, the addition of attention loss still effectively improves the detection accuracy, leading to the AP 50 of the baseline by 7% and 7.2% in 10-and 30-shot fine-tuning, respectively. In addition, the subsequent combination of SENet and FPN has further improved the detection precision. However, compared to SENet, FPN contributes more to small object detection. Regardless of how the module is combined, adding FPN can promote the detection of small objects by 0.3-0.5%. Finally, compared with the baseline model, MSFFAL achieves a substantial lead. In the Section 3 of this paper, we introduce the insertion method of SENet. Since channel-wise attention enhancement is more suitable for deep features with higher semantic levels, there is no way to determine the specific joining position. Thus, in this part, we conduct ablation experiments on the influence of the insertion position of the SENet on the detection results of the Pascal VOC dataset. The experimental results show the average precision of the model for all fine-tuning in the three novel sets (as shown in Table 8). In the table, "Layer0-3" represent the four convolutional layers in ResNet, respectively. It can be seen from Table 8 that, when the channel attention mechanism is added after Layer1, Layer2, and Layer3 of ResNet, the results of the model reach the highest. This also verifies our conjecture that the module does not work with underlying features. In addition, all the experimental results in this paper are based on this setting.

Comparision with the Baseline in Meta Accuracy
Meta accuracy is the classification accuracy of the support branch features. It can reflect the classification effect of the model on the novel classes in the fine-tuning stage.
As shown in Figure 7, we compare the meta accuracy achieved by our MSFFAL and baseline model Meta R-CNN on the 1-, 3-, and 10-shot fine-tuning on the novel set 1 and 10-and 30-shot fine-tuning on the MS COCO under the same number of iterations. We can observe from the figure that the meta accuracy achieved by MSFFAL exceeds the baseline by 9.0%, 11.0%, 24.1%, 9.9%, and 14.7%, respectively. This proves that MSFFAL can dramatically improve the classification effect of the novel classes and enhance the model's representation of unseen objects to a large extent. Such results indicate that attention loss has a certain potential in the task of few-shot classification. This will also become our exploration and research direction in the next stage, and performance comparisons will be made with mainstream algorithms such as Brownian distance [40].
flect the classification effect of the model on the novel classes in the fine-tuning stage.
As shown in Figure 7, we compare the meta accuracy achieved by our MSFFAL and baseline model Meta R-CNN on the 1-, 3-, and 10-shot fine-tuning on the novel set 1 and 10-and 30-shot fine-tuning on the MS COCO under the same number of iterations. We can observe from the figure that the meta accuracy achieved by MSFFAL exceeds the baseline by 9.0%, 11.0%, 24.1%, 9.9%, and 14.7%, respectively. This proves that MSFFAL can dramatically improve the classification effect of the novel classes and enhance the model's representation of unseen objects to a large extent. Such results indicate that attention loss has a certain potential in the task of few-shot classification. This will also become our exploration and research direction in the next stage, and performance comparisons will be made with mainstream algorithms such as Brownian distance [40].

Time Complexity Analysis of MSFFAL
Time complexity is an important performance evaluation metric for most deep learning methods and has a significant reference value in deploying and applying models. The time complexity of a model is mainly reflected in the number of floating-point operations. Therefore, we conducted a complexity analysis of MSFFAL, as shown in Table 9. The second column of the table lists the modules included in MSFFAL, which are ResNet50, RPN, Predict head, FPN, and SENet. The third column shows each module's corresponding floating-point operation counts, measured in giga floating-point operations (GFLOPs). It can be found that the first three modules are the main components of the base detector faster R-CNN and occupy the majority of the model's computational complexity. The last two modules are introduced in Section 3. The time complexity of attention loss depends on the number of positive proposals, the value is unstable and always small, so it can be

Time Complexity Analysis of MSFFAL
Time complexity is an important performance evaluation metric for most deep learning methods and has a significant reference value in deploying and applying models. The time complexity of a model is mainly reflected in the number of floating-point operations. Therefore, we conducted a complexity analysis of MSFFAL, as shown in Table 9. The second column of the table lists the modules included in MSFFAL, which are ResNet50, RPN, Predict head, FPN, and SENet. The third column shows each module's corresponding floating-point operation counts, measured in giga floating-point operations (GFLOPs). It can be found that the first three modules are the main components of the base detector faster R-CNN and occupy the majority of the model's computational complexity. The last two modules are introduced in Section 3. The time complexity of attention loss depends on the number of positive proposals, the value is unstable and always small, so it can be ignored. The result shows that introducing FPN increases the model's time complexity by 2.81 GFLOPs. However, FPN helps the model enhance the detection ability of small objects. Moreover, SENet brings 0.003 GFLOPs to the model but effectively improves the model's representation and detection of difficult samples. In summary, we can find that improving a model's performance often comes at the expense of computational complexity. Therefore, reducing this sacrifice is a direction for future exploration.

Conclusions
We propose the MSFFAL based on meta-learning for few-shot object detection. In addition to optimizing the feature extraction backbone with the FPN structure and SENet, we also designed an attention loss. FPN effectively improves the detection effect of the model on small objects through multi-scale feature fusion. SENet relies on the channel attention mechanism to enhance the representation effect of the support branch on hard samples. Attention loss replaces the weighting module in the baseline model, introduces a mutual constraint between query and support features, and achieves a consistent representation of the objects belonging to the same class. Through this loss, the model learns to actively discover the objects of the same class during the training process and no longer relies on query labels for feature selection. We validate the effectiveness of our method on the benchmark datasets Pascal VOC and MS COCO. In addition to the successful results mentioned above, we identify some limitations in our research. For example, the model's overall performance in processing complex data still has room for improvement. This is because complex datasets affect the model's effective representation of features, which in turn affects the effectiveness of the subsequent attention loss. Moreover, there is a general lack of time complexity analysis for FSOD models. We will conduct an in-depth analysis on the complexity issue and research how to reduce the model's time complexity while maintaining its detection performance. In the future, we will optimize our method to solve the above problems. We will also continue to explore challenging datasets for method validation and performance comparison with representative algorithms.