Few-Shot Object Detection Method Based on Knowledge Reasoning

: Human beings have the ability to quickly recognize novel concepts with the help of scene semantics. This kind of ability is meaningful and full of challenge for the ﬁeld of machine learning. At present, object recognition methods based on deep learning have achieved excellent results with the use of large-scale labeled data. However, the data scarcity of novel objects signiﬁcantly affects the performance of these recognition methods. In this work, we investigated utilizing knowledge reasoning with visual information in the training of a novel object detector. We trained a detector to project the image representations of objects into an embedding space. Knowledge subgraphs were extracted to describe the semantic relation of the speciﬁed visual scenes. The spatial relationship, function relationship, and the attribute description were deﬁned to realize the reasoning of novel classes. The designed few-shot detector, named KR-FSD, is robust and stable to the variation of shots of novel objects, and it also has advantages when detecting objects in a complex environment due to the ﬂexible extensibility of KGs. Experiments on VOC and COCO datasets showed that the performance of the detector was increased signiﬁcantly when the novel class was strongly associated with some of the base classes, due to the better knowledge propagation between the novel class and the related groups of classes.


Introduction
The application of artificial intelligence technology in specific industrial contexts has become more and more common [1]. Object detection is the basis of many computer vision tasks, such as instance segmentation, image captioning, object tracking and so on [2]. Object detection aims to find all interested objects in the image by determining their categories and locating their positions. Due to the different appearance and posture of various objects, coupled with the interference of imaging light, occlusion and other factors, object detection has always been a challenging problem. Driven by big data, the deep learning model can be effectively trained with the help of abundant annotation data. In recent years, the performance of object detection algorithm based on deep learning methods has improved consistently. However, object detection methods based on deep learning show obvious shortcomings in open and complex scenes, partly because of a lack of labeled data. Insufficiently labeled data will lead to overfitting of the trained model. Although simple data enhancement and regularization techniques can alleviate this problem, it has not been completely solved.
Since the data in the real world has the characteristics of long tail distribution, object detection in the open world is an urgent and difficult problem. The performance is often degraded by the scarcity of new data. Human beings can make good use of experience knowledge to learn to solve new problems with the help of only a few examples. Few-shot learning aims to learn like human beings by making use of prior knowledge and only a small number of samples of new problems. In recent years, many researches [3][4][5][6][7][8][9] have The concept of few-shot learning first emerged from computer vision field [11]. It has attracted extensive attention in recent years. There are many algorithm models with excellent performance in image recognition tasks [12,13], such as the famous prototypical network [14] and the matching network [15]. The method based on meta learning not only trains the model on the target task, but also learns meta knowledge from many different tasks. Meta knowledge is used to adjust the model so that the model can converge quickly when facing a new task.
In the task of few-shot image recognition, Gregory et al. [16] designed a twin network, with identical structure and shared weights, to extract features from two images respectively, and calculated the similarity of the two images. The relationship network proposed by Flood et al. was transformed from a predefined fixed similarity measurement function to a learnable nonlinear similarity measurement function trained by neural network [17].

Knowledge Graphs
In daily life, if we know some static attributes of new things in advance including color, texture, shape, etc., as well as relationship attributes, such as the relationship with some easily recognizable objects of base classes, it will become easier to learn said new things. Therefore, when visual information is difficult to obtain, this explicit relational reasoning is more important. This relationship can be constructed through knowledge graphs (KGs). The definition of a knowledge graph is usually based on heuristic methods in common sense knowledge rule database [18,19]. For multi-label recognition, [20] provided an object co-occurrence-based knowledge graph. Ref. [21] provided a reasoning method over knowledge graphs and showed that reasoning over knowledge graphs can obtain conclusions from existing data. An increasing number of KGs have been constructed and published recently, by both academia and industry, such as Google Knowledge Graph, Microsoft Satori, and Facebook Entity Graph [22][23][24].

Object Detection
The challenges of object detection include but are not limited to the following aspects: different viewpoints, illumination and intra changes of class, scale changes, object rotation, dense and occluded object detection, small objects, and accurate object positioning etc. [2]. For the detection of scarce objects or the object under given conditions, due to Electronics 2022, 11, 1327 3 of 13 the lack of labeled data, the conventional detection model usually struggles to achieve the ideal accuracy.
Some works [25][26][27][28] have focused on the problem detecting objects in limited data scenarios. LSTD [3] proposed a method of promoting the transfer of knowledge from the source domain to the target domain. RepMet adopted distance measurement learning classifiers in ROI classification header [4]. MSPLD proposed to iterate between model training and high confidence sample selection [5]. Meta R-CNN and FSRW proposed using the attention vector of each class to readjust the feature mapping of the corresponding class [6,7]. MetaDet used meta level knowledge about model parameter generation to deal with category specific components of new classes [8]. In FSOD, the similarity between a small number of support sets and query sets was explored to detect new objects [9].
The performance of a few-shot detector is greatly affected by the scarcity of novel objects. However, the semantic relationship between the novel objects and the base objects is constant [10]. This kind of semantic relationship can be easily extracted from a knowledge graph of the real world. Therefore, we proposed a few-shot object detection method based on knowledge reasoning, shown in Figure 1, to detect and infer novel objects when some basic properties of the novel objects and the relationships with base objects are provided in advance.
source domain to the target domain. RepMet adopted distance measurement learning classifiers in ROI classification header [4]. MSPLD proposed to iterate between model training and high confidence sample selection [5]. Meta R-CNN and FSRW proposed using the attention vector of each class to readjust the feature mapping of the corresponding class [6,7]. MetaDet used meta level knowledge about model parameter generation to deal with category specific components of new classes [8]. In FSOD, the similarity between a small number of support sets and query sets was explored to detect new objects [9].
The performance of a few-shot detector is greatly affected by the scarcity of novel objects. However, the semantic relationship between the novel objects and the base objects is constant [10]. This kind of semantic relationship can be easily extracted from a knowledge graph of the real world. Therefore, we proposed a few-shot object detection method based on knowledge reasoning, shown in Figure 1, to detect and infer novel objects when some basic properties of the novel objects and the relationships with base objects are provided in advance.
We summarized the contributions as follows: (1) A few-shot object detection method based on knowledge reasoning was proposed. It applied knowledge graphs together with the visual information to the novel object detection. (2) We designed a general expression pattern of knowledge graphs, which can be flexibly applied to express the relationship between visible objects, and has good scalability. (3) By using GNN, a novel object can be recognized by the method of knowledge reasoning. The proposed methodology achieves state-of-the-art performance on object detection.  We summarized the contributions as follows: (1) A few-shot object detection method based on knowledge reasoning was proposed. It applied knowledge graphs together with the visual information to the novel object detection. (2) We designed a general expression pattern of knowledge graphs, which can be flexibly applied to express the relationship between visible objects, and has good scalability. (3) By using GNN, a novel object can be recognized by the method of knowledge reasoning. The proposed methodology achieves state-of-the-art performance on object detection.

Few-Shot Object Detection
The set of the known object classes is denoted as U, where U = {C 1 , C 2 , · · · , C N } and N is the number of the recognized object classes in the image. We assume that there exist unknown object classes in the image, and the set of unknown classes is denoted as V, where V = C ? 1 , C ? 2 , · · · , C ? j , · · · . It is assumed that there are K 1 object instances with their class labels and locations and K 2 unknown object instances with their locations in the input image Im. The i-th object instance is denoted as O i , where O i = [l i , x i , y i , w i , h i ], l i ∈ U and x i , y i , w i , h i denote the bounding box center coordinates, width, and height, respectively.
The j-th unknown object instance is denoted as O ? j , where O ? j = l ? j , x j , y j , w j , h j , l ? j ∈ V and x j , y j , w j , h j denote the bounding box center coordinates, width, and height respectively.
In a dataset of novel classes, the number of objects for each class is k for k-shot detection task. The few-shot detection model is constructed on the base of a two-stage detection framework. At the second stage, the labels of some uncertain object instances can be inferred by KGs.

Few-Shot Detector
There are two training phases for a typical few-shot detector, the base training phase on a base dataset and the fine-tuning phase on the union of a base dataset and a novel dataset. Differing from these methods, we designed a CNN model to detect the known object instances with their class labels and locations, and to further infer the class labels of novel objects though knowledge reasoning. Compared with one-stage detectors, two-stage detectors have better open set performance. The framework of our detector is shown in Figure 2, where Faster R-CNN [29] was chosen as the baseline.

Few-Shot Object Detection
The set of the known object classes is denoted as U , where N is the number of the recognized object classes in the image. We assume that there exist unknown object classes in the image, and the set of unknown classes is denoted as V , where It is assumed that there are 1 K object instances with their class labels and locations and 2 K unknown object instances with their locations in the input image Im . The - x y w h denote the bounding box center coordinates, width, and height, respectively. The -th j unknown object instance is denoted as x y w h denote the bounding box center coordinates, width, and height respectively.
In a dataset of novel classes, the number of objects for each class is k for k-shot detection task. The few-shot detection model is constructed on the base of a two-stage detection framework. At the second stage, the labels of some uncertain object instances can be inferred by KGs.

Few-Shot Detector
There are two training phases for a typical few-shot detector, the base training phase on a base dataset and the fine-tuning phase on the union of a base dataset and a novel dataset. Differing from these methods, we designed a CNN model to detect the known object instances with their class labels and locations, and to further infer the class labels of novel objects though knowledge reasoning. Compared with one-stage detectors, twostage detectors have better open set performance. The framework of our detector is shown in Figure 2, where Faster R-CNN [29] was chosen as the baseline. The role of Region Proposal Network (RPN) is to search for numerous candidate anchors (a set of candidate bounding boxes on the image), and then to determine whether the corresponding area of an anchor has an object prospect or is a background without object. Feature maps are generated after convolution layers. Each point of the feature maps corresponds to multiple anchors. There are k anchors in total, and each anchor needs to be distinguished between foreground and background. The foreground anchors are obtained by softmax classification; that is, the candidate region box is preliminarily extracted. Each anchor has four position offsets corresponding to [x, y, W, H], which are The role of Region Proposal Network (RPN) is to search for numerous candidate anchors (a set of candidate bounding boxes on the image), and then to determine whether the corresponding area of an anchor has an object prospect or is a background without object. Feature maps are generated after convolution layers. Each point of the feature maps corresponds to multiple anchors. There are k anchors in total, and each anchor needs to be distinguished between foreground and background. The foreground anchors are obtained by softmax classification; that is, the candidate region box is preliminarily extracted. Each anchor has four position offsets corresponding to [x, y, W, H], which are corrected by using the bounding box regression. In fact, RPN has preliminarily realized the object detection and positioning.

Knowledge Graph
A knowledge graph aims to describe various entities and their relationships in the real world. Entities refer to things that are distinguishable and independent. Semantic Knowledge graph is defined as G K .
where E is the entity set and N is the number of entities in knowledge graph. R is the relationship set and M is the number of relationships in knowledge graph. AT is the attribute set of entities. The attribute set of the i−th entity is defined as the following: where nu is the number of attributes of the i-th entity. Two basic forms of R in knowledge graph are expressed with Formulas (3) and (4).
There is r u (e i , e j ) when e i and e j satisfy relation r u , and there is at k v (e k ) when the entity e k has assigned the value of attribute at v .
In the knowledge map, relation is a function, which maps K graph nodes (entities, semantic classes, attribute values) to Boolean values.

Scene Graph
In the object detection task, the scene graph describing semantics is defined as following: where O is the set of object instances, C denotes the set of object classes and Edge is a set of edges.
According to the definition of a knowledge graph, the attribute set of object instances is a subset of AT, and the relationship set of object instances is a subset of R. We describe an object instance with a triplet, shown as (6).

Object Detection
Given the scene graph G S of the image Im, object instances to be detected are represented by a set of candidate bounding boxes denoted as B. The map is denoted as γ : O → B . The initial knowledge graph is built and denoted with an adjacent matrix M ob and a feature matrix M AT , where M ob ⊆ {0, 1} n×n , M AT = {at 1 , · · · , at n } T ⊆ R n×h .
A function f (·) is defined upon graph neural network (GNN) to learn graph and it can be implemented by a method in [30]. For the l-th layer of GNN, the weight matrix is defined as W(l). M AT is used as the initial features of modes. The message passing function is defined as P(·) with the following structure: where L(l + 1) are the nodes embedding after l layers of GNN, and L(l 1 ) = M AT .

Reasoning Based on Knowledge Graph
A knowledge subgraph is a data structure that describes the semantics of a specific scene. It encodes object instances, attributes of objects, and relationships between objects. The simplest way to extract the knowledge subgraph of a scene from the large knowledge graph describing the objective world is to retrieve the subgraph in the KG according to the recognized object instances and their attributes. The remaining available visual information is used to supplement the knowledge subgraph to enhance the semantic expression of the scene. The knowledge subgraph of Figure 1 is shown in Figure 3.

Reasoning Based on Knowledge Graph
A knowledge subgraph is a data structure that describes the semantics of a scene. It encodes object instances, attributes of objects, and relationships between The simplest way to extract the knowledge subgraph of a scene from the large kno graph describing the objective world is to retrieve the subgraph in the KG acco the recognized object instances and their attributes. The remaining available visu mation is used to supplement the knowledge subgraph to enhance the semantic sion of the scene. The knowledge subgraph of Figure 1 is shown in Figure 3. Two types of relationships are defined in knowledge subgraph. 1 R is the s sitional relationships and 2 R is the set of functional relationships, where 1 R  . Figure 3 shows the knowledge subgraph of Figure 1. In this exam object instances and their attributions are shown in Table 1. The relationships b instances are shown in Table 2.
Color-?, Shape-?, Size-?  Two types of relationships are defined in knowledge subgraph. R 1 is the set of positional relationships and R 2 is the set of functional relationships, where R 1 ∪ R 2 = R, R 1 ∩ R 2 = φ. Figure 3 shows the knowledge subgraph of Figure 1. In this example, the object instances and their attributions are shown in Table 1. The relationships between instances are shown in Table 2. Table 1. The object instances and their attributes in Figure 1.

Relationships Map
r 1 -"on" r 1 (people1, places), r 1 (people2, places) r 2 -"over" r 2 (object1, places) r 3 -"hold" r 3 (people2, object2) r 4 -"wear" r 4 (people1, shoes) In Figure 3, object instances (people 1, people 2, places, shoes) are easily detected, and most of the attributes can be recognized (Style of places, Style of shoes, Gender of people 1, the Color, Shape and Size of object 2). Although two object instances (object 1, object 2) cannot be recognized and some attributes (Gender of people2, the Color, Shape and Size of object 2) are unknown, we can clearly see that "people 2 hold object 2", "object 1 is over M ob and M AT are defined as Formula (11) and Formula (12) separately.

Space Projection
We projected the visual feature into the constructed semantic space to recognize the objects based on both visual information and the semantic relation. In the second-stage of the two-stage object detector, the extracted feature vector region proposals are forwarded to a classification subnet and a regression subnet. In the classification subnet, the feature vector is transformed into a vector denoted as v with d dimensions and forwarded through fully connected layers. Then, v is multiplied by a learnable weight matrix W ∈ R n×d to produce a probability distribution, which is shown in (12).
where n is the number of classes, b is a learnable bias vector and b ∈ R n . Cross-entropy loss is used during training.
To reduce the domain gap, semantic embeddings are needed. Learning from the transformer, we implemented a dynamic graph with self-attention architecture [31]. For a new class, it is only necessary to simply insert corresponding embeddings of new classes and fine-tune the detector, because the graph is variable and is constructed according to the word embeddings.

Datasets
To evaluate our method, we performed experiments on VOC [32] and COCO dataset [33], which are widely used for pretraining classification models. Before training the few-shot detector, we removed the new classes from the training dataset to initialize the backbone network, and to guarantee that the pretrained model has not seen these novel classes. Corresponding to the novel classes in VOC, the WordNet IDs to be removed are shown in Table 3.
In the COCO dataset, since the classes have the character of long-tail distribution, we selected the data-scarce classes on the distribution tail as the novel classes and used Google Knowledge Graph, Microsoft Satori, and Facebook Entity Graph as the base to extract the knowledge subgraph for a specific scene.

Implementation Details
We trained the KR-FSD on the base of Faster R-CNN with Stochastic Gradient Descent (SGD), and set the batch size to 16. In base training phase, the learning rate was set to 0.02, the momentum was set to 0.9, and the weight decay was set to 0.0001. The learning rate was set to 0.001 in the fine-tuning phase. We sampled the input image randomly from the base set and the novel set with a 50% probability, and then randomly selected an image from the chosen set.

Results on VOC and COCO Datasets
In Table 4, we show the performance (AP 50 ) of the novel classes on the VOC dataset. We used the same data splits and a fixed list of novel samples provided by [6]. In the VOC dataset, 5 classes were selected as novel classes from 20 object classes. The remaining 15 classes were base classes. Each novel class had only a few annotated object instances, such as 1 annotated object instance, 5 annotated object instances, and 10 annotated object instances. Compared with the state-of-the-art methods (FSRW [6] and Meta R-CNN [7]), our approach can achieve superior performance.  Table 5 shows the averaged APs of our method on the COCO dataset. In the COCO dataset, the minival set was used for testing and the rests were used for training. Twenty classes were selected as novel classes from 80 classes. The remaining 60 classes were the base. The novel classes overlapped with the classes in VOC. Each novel class had only a few annotated object instances, such as 10 annotated object instances, 20 annotated object instances, and 30 annotated object instances.  Table 6 shows the ablative performance, where mAP = 50. KR is knowledge reasoning component.

Experiments on Relation Reasoning
In order to verify the effectiveness of KG-based reasoning in detection, we selected three groups of data for effectiveness experiments. In the data group, the performance of each class was counted respectively by selecting one class as the novel class and the remaining classes as the base classes.
In the first experiment, we primarily counted the performance of sofa, TV, cat and chair. The object instances (such as TV, chair, and cat) in many scenes are strongly associated with the sofa. We frequently see "TV and sofa together", and "cat sitting on chair" or "cat sitting on sofa". In KGs, the distances between these entities (sofa and TV, cat and sofa, cat and chair) are shorter. Because these classes are closely related, the performance of all classes will be significantly improved when knowledge reasoning is integrated into visual features. The performance is shown in Figure 4.  In other two experiments, sofa almost has no association with other classes (mbike, bus, car, bird, cow, horse). Therefore, the distances between sofa and other entities (mbike, bus, car, bird, cow, horse) are longer. However, we often see bus, car and mbike at the same time, and sometimes see cow and horse together. Figures 5 and 6 show that almost all the performances of correlated classes are increased slightly, due to the better knowledge propagation between the two groups of classes.     In other two experiments, sofa almost has no association with other classes (mbike, bus, car, bird, cow, horse). Therefore, the distances between sofa and other entities (mbike, bus, car, bird, cow, horse) are longer. However, we often see bus, car and mbike at the same time, and sometimes see cow and horse together. Figures 5 and 6 show that almost all the performances of correlated classes are increased slightly, due to the better knowledge propagation between the two groups of classes.  In other two experiments, sofa almost has no association with other classes (mbike, bus, car, bird, cow, horse). Therefore, the distances between sofa and other entities (mbike, bus, car, bird, cow, horse) are longer. However, we often see bus, car and mbike at the same time, and sometimes see cow and horse together. Figures 5 and 6 show that almost all the performances of correlated classes are increased slightly, due to the better knowledge propagation between the two groups of classes.      In other two experiments, sofa almost has no association with other classes (mbike, bus, car, bird, cow, horse). Therefore, the distances between sofa and other entities (mbike, bus, car, bird, cow, horse) are longer. However, we often see bus, car and mbike at the same time, and sometimes see cow and horse together. Figures 5 and 6 show that almost all the performances of correlated classes are increased slightly, due to the better knowledge propagation between the two groups of classes.    Bird horse cow sofa Figure 6. The third experiment on relation reasoning. Sofa almost has no association with other classes (bird, horse, cow). The performances of correlated classes (horse, cow) are increased slightly.

Experimental Analysis
This work combines visual information and knowledge reasoning method in order to recognize novel classes. Experiments on VOC datasets showed that the performance was increased slightly at lower shot levels, such as 1-shot, and the performance was competitive compared with previous state-of-the-art methods at 5-shot and 10-shot, which is shown in Table 4. The average APs of the novel classes on the COCO dataset were increased slightly at 10-shot, 20-shot and 30-shot, which are shown in Table 5. Knowledge reasoning is proved to be meaningful for recognition tasks, as shown in Figures 4-6. When the novel classes were strongly associated with the base classes, the performance was noticeably increased, because there was better knowledge propagation between the novel classes and the related groups of classes.

Conclusions
In this paper, we proposed a few-shot object detection method based on knowledge reasoning. Since the semantic relation between the base classes and the novel classes in some scenes can be inferred by KGs, it is helpful to learn the novel concepts by applying knowledge reasoning with the available visual information. We built a few-shot detection model on the base of Faster R-CNN, and applied reasoning to some uncertain object instances at the second stage. To demonstrate the performance, we carried out experiments on VOC datasets and COCO datasets. Compared with state-of-the-art methods, our approach achieved better results at several few-shot detection settings. In future work, we will further carry on research on few-shot recognition and detection driven by knowledge and data.
Author Contributions: J.W. and D.C. designed the study; JW analyzed and interpreted the data; J.W. conducted the experiments, J.W. and D.C. provided the technical and material support. All authors contributed to the writing of the manuscript and final approval. All authors have read and agreed to the published version of the manuscript.
Funding: This research received no external funding.

Data Availability Statement:
The data is available from authors upon the reasonable request to the corresponding authors.

Acknowledgments:
The authors thank the anonymous reviewers for their valuable comments and suggestions.

Conflicts of Interest:
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influenced the work reported in this paper.

Abbreviations
The following abbreviations are used in this manuscript: