Object Detection Based on Multiple Information Fusion Net

: Object detection has been playing a signiﬁcant role in computer vision for a long time, but it is still full of challenges. In this paper, we propose a novel object detection framework based on relationship among di ﬀ erent objects and the scene-level information of the whole image to cope with the problem that some strongly correlated objects are di ﬃ cult to be recognized. Our motivation is to enrich the semantics of object detection feature by a scene-level information branch and a relationship branch. There are three important changes of our framework over traditional detection methods: representation of relationship, scene-level information as the prior knowledge and the fusion of the above two information. Extensive experiments are carried out on PASCAL VOC and MS COCO databases. The experimental results show that the detection performance can be improved by introducing relationship and scene-level information, and our proposed model achieve better performance than several classical and state-of-the-art methods


Introduction
Object detection is a hot topic in the field of computer vision and machine learning due to their widely applications in autonomous driving, robots, video surveillance, pedestrian detection, and so on. The classical object detection techniques are mainly based on the use of manual features, which can be divided into three steps: (1) target area selection; (2) feature extraction; (3) classification. In the first step, sliding-window strategy [1] which utilizes the sliding-windows with different dimensions and length-width ratios is widely adopted to search for candidate regions exhaustively. In the second step, the candidate regions obtained in the first step are analyzed. Several techniques can be used in this step for feature extraction, such as scale-invariant feature transform (SIFT) [2], histogram of oriented gradients (HOG) [3] and speeded-up robust features (SURF) [4]. In the third step, the candidate regions are classified according to the features extracted in the previous step by using classifiers such as support vector machine (SVM) [5] and AdaBoost [6]. Although the classical methods have been adopted in some object detection problems, there are still some limitations that hinder their breakthrough in speed and accuracy. Firstly, since the sliding-window strategy will capture many candidate regions in the original image, and the feature of regions needs to be extracted one by one, the classical object detection approaches are time-consuming. Secondly, the classical object detection methods may lack robustness because artificially designed features are sensitive to the variance in morphology, illumination and occlusion of object. object itself for detection. Recently, some researchers have realized the importance of relation, and proposed some methods [29][30][31] to achieve better detection results by exploring the relationships between objects. In ION [32], Bell et al. proposed a spatial recurrent neural networks (RNNs) for exploring contextual information across the entire image. Xu et al. put forward a scene graph generation approach by iterative message passing [33]. The network regards a single object as a point in topology, and the relationships of objects are considered as edges connecting points. Through passing information between the edges and points, it is proved that the relationship between objects has a positive impact on detection. Georgia et al. proposed a human-centric based model called InteractNet [34], in which human is regarded as the main clue to establish a relationship with other surrounding objects. The InteractNet indicates that a person's external behavior can provide powerful information to locate the objects they are interacting with. Liu et al. proposed a structure inference net (SIN) [35] which explores the structure relationship between objects for detection. However, SIN only takes the spatial coordinates of object proposals into account, while the appearance feature of object proposals is neglect. Han et al. presented a relation network [36], which considers both the appearance and geometry feature of object proposals for relation construction. Nevertheless, the scene-level feature which could provide a lot of context information for object detection [37] is ignored in relation Nntwork.
This paper proposes a novel object detection algorithm based on multiple information fusion net (MIFNet). Compared with the existing techniques, our algorithm not only adaptively establishes relationships between objects through attention mechanism [38], but also introduces scene-level information to make the proposed approach richer in semantics. In MIFNet, the relationships between an object and all other objects are got by relation channel modules. Besides, by introducing the scene-level context [21,39,40], the proposed network can enrich the object feature with scene information. The experimental results on PASCAL VOC [41] and MS COCO [42] databases demonstrate the effectiveness of the proposed algorithm.
The paper is structured as follows. The related work is introduced in Section 2. The proposed MIFNet is described in Section 3. The experimental results are given in Section 4. The conclusion is provided in Section 5.

Related Work
Context information: In real life, it is unlikely that an object can exist alone. Visual objects occur in particular environments and usually coexist with other related objects [43]. When the object's appearance feature is insufficient because of small object size, object occlusion, or poor image quality, a proper modeling of context will facilitate object detection and recognition task. Context information has been applied in many methods to enhance the performance of object detection [44][45][46][47][48][49], which can be roughly divided into two categories [49,50]: global information [32,51] (refers to the image level or scene level information), local information [35,36] (considers the object relationship or the interaction between the object and its surrounding area). It is proved that both the global and local context information have a positive impact on the object detection. Our proposed MIFNet has the capability of utilizing both global context (scene-level information) and local information (object relationship) to make the object's appearance feature richer.
Attention mechanism: The attention mechanism in deep learning is inspired by the mode of human attention thinking and has been widely used in natural language processing [52]. In attention module, an individual element can be influenced by aggregating information from other elements and the dependency between elements is modeled without excessive assumptions on their locations and feature distributions. The aggregation weights can be learned automatically, which is driven by the task goal. Recently, attention mechanism has been successfully applied in vision problems [37,53].
Attention mechanism can be represented as follows:

The Proposed Method
The framework of proposed multiple information fusion net (MIFNet) is shown in Figure 1. In our MIFNet, the feature map of an input image is first obtained through a feedforward convolutional network (VGG or Resnet). In the next stage, the network feature map is divided into two parts. One is as a part of the input of the first branch and the other is utilized to get the region proposals through RPN and then served as the input of the second branch. In the first branch network (I), a series of operations is performed on the feature map of the entire image to get the scene-level information as the input of scene GRU (Gated Recurrent Unit in III up). In the second branch network (II), the attention mechanism is utilized to establish object relationships adaptively. For the purpose of classifying and regressing regions of interest (RoIs), the second branch network not only utilizes the appearance feature extracted by convolutional layers and the coordinate information of the object, but also the information of all surrounding objects as the input of relation GRU (in III below). In the message passing module (III), scene GRUs and relation GRUs communicate information to each other in order to keep up with new information. In the last stage, we concatenate the information obtained by these two GRUs to refine the position of the corresponding RoI and predict the category of objects.

Scene-Level Information Processing Module
Contextual information is important for accurate object recognition. To extract the scene-level information, the image feature is firstly obtained by convolutional network (VGG or Resnet) as the input of the first branch. Secondly, the image feature obtained by RoI-pooling layer and the feature obtained by RPN (without scene information) are concatenated as the input of a convolutional layer. By concatenation, the information of potential object is richer. Besides, the weight of potential object can also be increased by training. In the end, the output fs of the first network branch, which called scene feature, is input to the scene GRU to choose information and update object feature.

Relationship Module
In most previous object detection methods based on the convolutional neural network [7,16], each object is identified independently, and the relationship of objects is neglected. To overcome this

Scene-Level Information Processing Module
Contextual information is important for accurate object recognition. To extract the scene-level information, the image feature is firstly obtained by convolutional network (VGG or Resnet) as the input of the first branch. Secondly, the image feature obtained by RoI-pooling layer and the feature obtained by RPN (without scene information) are concatenated as the input of a convolutional layer. By concatenation, the information of potential object is richer. Besides, the weight of potential object Appl. Sci. 2020, 10, 418 5 of 20 can also be increased by training. In the end, the output f s of the first network branch, which called scene feature, is input to the scene GRU to choose information and update object feature.

Relationship Module
In most previous object detection methods based on the convolutional neural network [7,16], each object is identified independently, and the relationship of objects is neglected. To overcome this limitation, the proposed approach models the relationship of objects by groups. That is, the feature vector of an object is obtained by fusing the features of itself and other objects to enrich the information, as shown in Figure 2.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 20 limitation, the proposed approach models the relationship of objects by groups. That is, the feature vector of an object is obtained by fusing the features of itself and other objects to enrich the information, as shown in Figure 2. In Figure 2, given an input set of N objects ( , ) , in which is the original appearance feature of the nth object extracted by convolutional neural network, denotes the location feature of the nth object composed by the 4-dimensional feature of object bounding box. The bounding box feature comprises width (w), height (h) and the center coordinates (x, y) of the box in our study. The relationship channel is a module that handles relationships among different objects, Nr is the number of channels (Nr = 64). By object relation module, the ( ), ( ) … ( ) which fuse the location information of all surrounding objects can be gained. For the purpose of obtaining the output ′ that is finally sent into the relation GRU, we concatenate the vectors on all channels ( ), ( ) … ( ). Because the processing mechanisms of relation channel modules are the same, we take one relation channel module as an example to explain how relation channel works. Figure 3 shows the process of one relation channel module. Firstly, the dot product operation is applied to obtain the appearance weight between the mth and nth objects, as shown in Equation (2).
where , are matrices which map the original appearance and into subspaces, • denotes the operation of dot product to obtain the degree of matching between and . , in which f n t is the original appearance feature of the nth object extracted by convolutional neural network, f n c denotes the location feature of the nth object composed by the 4-dimensional feature of object bounding box. The bounding box feature comprises width (w), height (h) and the center coordinates (x, y) of the box in our study. The relationship channel is a module that handles relationships among different objects, N r is the number of channels (N r = 64). By object relation module, the f 1 c (n), f 2 c (n) . . . f N r c (n) which fuse the location information of all surrounding objects can be gained. For the purpose of obtaining the output f n t that is finally sent into the relation GRU, we concatenate the vectors on all channels f 1 c (n), f 2 c (n) . . . f N r c (n). Because the processing mechanisms of relation channel modules are the same, we take one relation channel module as an example to explain how relation channel works. Figure 3 shows the process of one relation channel module. Firstly, the dot product operation is applied to obtain the appearance weight w mn t between the mth and nth objects, as shown in Equation (2).
where W K , W Q are matrices which map the original appearance f m t and f n t into subspaces, · denotes the operation of dot product to obtain the degree of matching between W K f m t and W Q f n t .
6 of 20 Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 20 Secondly, the location weight is calculated by Equation where and are geometry features that contain six relative position information (log( ), in which wn, hn, xn and yn are the width, height and center coordinates of the nth object, is a function based on sine and cosine to embed the geometry features into high-dimensional space [38]. Then we use to convert the embedded vector to scalar weight. The Relu activation function is utilized to ensure that only objects with certain geometric relationship can participate in this relationship calculation.
Next, the relationship weight is obtained by Equation (4).
where the relationship weight represents the impact of the mth object to the nth object, softmax is employed for normalization. Finally, Equation (5) can be utilized to get a feature ( ) that has the influence of surrounding objects on it.
where is used to transform the original appearance feature linearly. Equation (5) is the process of integrating the information of the object and other objects into the original appearance feature. The output ( ) is the weighted sum of the initial appearance features from other objects, which contains both its original appearance feature and the feature of all objects around it. Secondly, the location weight w mn g is calculated by Equation (3).
where f m c and f n c are geometry features that contain six relative position information (log |x m −x n | w n , log |ym−yn| h n , log w m w n , log h m h n , x m −x n w n , y m −y n h n ), in which w n , h n , x n and y n are the width, height and center coordinates of the nth object, ε g is a function based on sine and cosine to embed the geometry features into high-dimensional space [38]. Then we use W g to convert the embedded vector to scalar weight. The Relu activation function is utilized to ensure that only objects with certain geometric relationship can participate in this relationship calculation.
Next, the relationship weight is obtained by Equation (4).
where the relationship weight w mn represents the impact of the mth object to the nth object, softmax is employed for normalization. Finally, Equation (5) can be utilized to get a feature f N r c (n) that has the influence of surrounding objects on it.
where W v is used to transform the original appearance feature f n t linearly. Equation (5) is the process of integrating the information of the object and other objects into the original appearance feature. The output f N r c (n) is the weighted sum of the initial appearance features from other objects, which contains both its original appearance feature and the feature of all objects around it.
In the end, by the relation channel module, the feature f n t which merges features of multiple channels can be gained by Equation (6).
where the fusion feature f n t includes the extracted original appearance feature f n t (the initial appearance feature after convolutional layers) and the relationship feature ( f 1 c (n), . . . , f N r c (n)) (fusing the location information of all surrounding objects under a particular channel). In the relation channel, the feature of other objects can be mixed together to identify the relationship between the current object and other objects, and finally merged with the original appearance feature through the fully connected network.
The final output f n t is the input of the Relation GRU.

Message Passing Module
As we have discussed previously, context information is important for accurate object detection. For example, in Figure 4a, if road is considered as the global or scene-level information, the objects in this image are hardly to be detected as ships and planes since it is generally impossible for them to appear in the road scene. Similarly, in Figure 4b, when a dinner table appears, the probability of detecting chairs increases, because the dinner tables and chairs always appear in pairs. Thus, the Gated Recurrent Unit (GRU) [54] is utilized in this study. Similar to the long short-term memory (LSTM) model [55], GRU unit also has the function of adjusting the information flow in the unit, but it is lightweight and effective [35]. In the message passing module, information is continuously passed between the scene GRU and the relation GRU so that the useful information can be preserved.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 20 In the end, by the relation channel module, the feature ′ which merges features of multiple channels can be gained by Equation (6).
where the fusion feature ′ includes the extracted original appearance feature (the initial appearance feature after convolutional layers) and the relationship feature ( ( ), … , ( )) (fusing the location information of all surrounding objects under a particular channel). In the relation channel, the feature of other objects can be mixed together to identify the relationship between the current object and other objects, and finally merged with the original appearance feature through the fully connected network. The final output ′ is the input of the Relation GRU.

Message Passing Module
As we have discussed previously, context information is important for accurate object detection. For example, in Figure 4a, if road is considered as the global or scene-level information, the objects in this image are hardly to be detected as ships and planes since it is generally impossible for them to appear in the road scene. Similarly, in Figure 4b, when a dinner table appears, the probability of detecting chairs increases, because the dinner tables and chairs always appear in pairs. Thus, the Gated Recurrent Unit (GRU) [54] is utilized in this study. Similar to the long short-term memory (LSTM) model [55], GRU unit also has the function of adjusting the information flow in the unit, but it is lightweight and effective [35]. In the message passing module, information is continuously passed between the scene GRU and the relation GRU so that the useful information can be preserved.  GRU only has two gates. It combines the input gate and the forget gate in the LSTM into one, and the combined gate is called update gate, which determines how much information from the previous time and the current time is to be passed on. The other gate in GRU is reset gate, which controls how much past information is forgotten.
In our work, two parallel GRUs are applied to pass information to each other, one is the scene GRU and the other is the relation GRU. The scene GRU receives the whole image information fs as the input. The input of relation GRU is the integrated object information ′, which includes the object's own information and the influence of surrounding objects on it. We represent the initial state ℎ of the network with the original appearance feature (without any scene information or relation information). Here, since the processing mechanisms of scene GRU and relation GRU are identical, we take the relation GRU as an example to show how GRU works. Firstly, the reset gate of the tth moment is calculated as follows: GRU only has two gates. It combines the input gate and the forget gate in the LSTM into one, and the combined gate is called update gate, which determines how much information from the previous time and the current time is to be passed on. The other gate in GRU is reset gate, which controls how much past information is forgotten.
In our work, two parallel GRUs are applied to pass information to each other, one is the scene GRU and the other is the relation GRU. The scene GRU receives the whole image information f s as the input. The input of relation GRU is the integrated object information f n t , which includes the object's own information and the influence of surrounding objects on it. We represent the initial state h i of the network with the original appearance feature (without any scene information or relation information). Here, since the processing mechanisms of scene GRU and relation GRU are identical, we take the relation GRU as an example to show how GRU works. Firstly, the reset gate of the tth moment r t is calculated as follows: where σ is the logistic sigmoid function, [,] denotes the concatenation of vectors, and W r is a weight matrix learned through the convolutional neural network. The output of reset gate r t determines whether the previous state is forgotten. When r t is close to zero, the status information h i of the previous moment will be forgotten, and the hidden status is reset to the current input. Similarly, the update gate of the tth moment z t is computed by where z t is used to determine how much past information can continue to be passed on, W z is a weight matrix. If the value of the update gate is larger, the state information introduced at the previous moment is more, and vice versa. In GRU, the new hidden state h t can be obtained through Equation (9): where the new hidden state h t is determined by the value of the reset gate, W is a weight matrix, * denotes the element-wise multiplication. The actual output h i+1 is then computed by where some of the previous state h i will be passed, and the new hidden state h t will be selectively updated. Through GRU, the scene module and the relation module can pass information to each other and constantly update new information. In this way, useful information will continue to be passed, and useless information will be ignored. Finally, richer information can be obtained through Equation (11).
where h s i+1 represents the information obtained by the scene GRU, and h r i+1 denotes the information obtained by the relation GRU. The integrated information h i will be sent into the next GRUs as the new initial state.

Experimental Settings
Databases and evaluation metrics: Our model was evaluated on two databases: PASCAL VOC [41] and MS COCO [42]. PASCAL VOC is a widely used image database for object detection and classification. In our work, VOC 2007 and VOC 2012 subsets were utilized. VOC 2007 data set contained 9963 annotated images and 24,640 annotated objects, which were composed of three parts: train, validation (val), and test sets. VOC 2012 is an updated version of VOC 2007 data set, which included 11,530 pictures with 20 categories including people, animals (such as cats, dogs and birds), vehicles (such as cars, ships and planes), and furniture (such as chairs, tables, and sofas). Some examples of PASCAL VOC database can be seen in Figure 5. MS COCO database is built by Microsoft, which contained 328,000 images with 2000 object labels. Compared with PASCAL VOC, MS COCO includes natural images and common object images in daily life with more complex background, larger number of targets and smaller size. Thus, the object detection task on MS COCO database was more difficult and challenging. The sample images in MS COCO are shown in Figure 6. In this study, we followed Ref. [35] to adopt average precision (AP) and mean of average precision (mAP) as our evaluation metrics to compare the performances of different approaches. AP, which derived from precision and recall, is one of the popular metrics to measure the accuracy of object detectors. It computes the average precision value for recall value over 0 to 1. The precision and the recall rates are defined as follows: where Implementation details: In our experiments, the proposed model was implemented based on Faster R-CNN [7], an open-source framework for object detection built on Tensorflow [56] platform. The VGG-16 [57] and Resnet-101 [19] pre-trained on ImageNet [58] were adopted as backbone networks in our model to extract image feature. When adding the newly fully connected and convolutional layers, they were randomly initialized with a zero-mean Gaussian distribution with standard deviations of 0.01 and 0.001. The message passing module contained two parallel GRU units with shared parameters. All the parameters of the GRU units were initialized based on SIN [35]. Nonmaximum suppression (NMS) with intersection over union (IOU) were used for duplicate removal in all experiments.
Stochastic gradient descent (SGD) was applied to fine tune our network. Each SGD mini batch was composed of 256 randomly sampled object proposals from two randomly chosen images. In each mini-batch, 25% of the RoIs were selected as foreground from object proposals, which had IOU overlapped with a ground-truth bounding box of at least 0.5. We sampled the remaining RoIs from object proposals which had a maximum IOU with ground truth in the interval [0:1; 0:5]. We trained our model on a single NVIDIA GeForce GTX TITAN X GPU with 12 GB memory. The experimental parameters, training and test time of our MIFNet can be seen in Table 1. Implementation details: In our experiments, the proposed model was implemented based on Faster R-CNN [7], an open-source framework for object detection built on Tensorflow [56] platform. The VGG-16 [57] and Resnet-101 [19] pre-trained on ImageNet [58] were adopted as backbone networks in our model to extract image feature. When adding the newly fully connected and convolutional layers, they were randomly initialized with a zero-mean Gaussian distribution with standard deviations of 0.01 and 0.001. The message passing module contained two parallel GRU units with shared parameters. All the parameters of the GRU units were initialized based on SIN [35]. Nonmaximum suppression (NMS) with intersection over union (IOU) were used for duplicate removal in all experiments.
Stochastic gradient descent (SGD) was applied to fine tune our network. Each SGD mini batch was composed of 256 randomly sampled object proposals from two randomly chosen images. In each mini-batch, 25% of the RoIs were selected as foreground from object proposals, which had IOU overlapped with a ground-truth bounding box of at least 0.5. We sampled the remaining RoIs from object proposals which had a maximum IOU with ground truth in the interval [0:1; 0:5]. We trained our model on a single NVIDIA GeForce GTX TITAN X GPU with 12 GB memory. The experimental parameters, training and test time of our MIFNet can be seen in Table 1. Implementation details: In our experiments, the proposed model was implemented based on Faster R-CNN [7], an open-source framework for object detection built on Tensorflow [56] platform. The VGG-16 [57] and Resnet-101 [19] pre-trained on ImageNet [58] were adopted as backbone networks in our model to extract image feature. When adding the newly fully connected and convolutional layers, they were randomly initialized with a zero-mean Gaussian distribution with standard deviations of 0.01 and 0.001. The message passing module contained two parallel GRU units with shared parameters. All the parameters of the GRU units were initialized based on SIN [35]. Non-maximum suppression (NMS) with intersection over union (IOU) were used for duplicate removal in all experiments.
Stochastic gradient descent (SGD) was applied to fine tune our network. Each SGD mini batch was composed of 256 randomly sampled object proposals from two randomly chosen images. In each mini-batch, 25% of the RoIs were selected as foreground from object proposals, which had IOU overlapped with a ground-truth bounding box of at least 0.5. We sampled the remaining RoIs from object proposals which had a maximum IOU with ground truth in the interval [0:1; 0:5]. We trained our model on a single NVIDIA GeForce GTX TITAN X GPU with 12 GB memory. The experimental parameters, training and test time of our MIFNet can be seen in Table 1.

Performance Comparisons
PASCAL VOC Database: The performance of our MIFNet was compared with some classical and state-of-the-art object detection methods, including Fast R-CNN [16], Faster R-CNN [7], SIN [35], ION [32], CPF [40], and so on. The experimental results on data set VOC 2007 test and VOC 2012 test are shown in Tables 2 and 3 respectively. All the experimental results of comparison approaches are quoted from their corresponding literatures. From these tables, the following points can be observed. Firstly, Fast R-CNN and Faster R-CNN are classical two-stage approaches, while SSD is a classical one-stage approach. Since the relationship between objects and context information were neglected in them, their performances were inferior to other approaches. Secondly, by utilizing the spatial recurrent neural networks and semantic segmentation, ION and CPF took global contextual information into account. Therefore, they outperformed the classical approaches such as Fast R-CNN, Faster R-CNN, and SSD. Thirdly, SIN considered both the scene context information and object relationships. However, the relationship in SIN was established only by geometric structure of the objects, which neglected the objects' appearance information. As a result, its performance was still worse than the proposed MIFNet. At last, the proposed approach leveraged the attention mechanism to adaptively establish the relationship between objects, which considered both geometric and appearance information. Besides, the scene-level information was also introduced into the model. Thus, our MIFNet greatly improved the detection accuracy of some small and highly correlated objects (such as chair, boat, plant, tv) and achieved the best performance on PASCAL VOC database.  MS COCO database: In order to further verify the effectiveness of our proposed method, MS COCO database was utilized. The object detection results of different methods on this database are tabulated in Table 4. In this table, AP was averaged precision across all object categories and multiple intersection over union (IOU) values from 0.5 to 0.95, AP 50 denotes the mAP at IOU = 0.50, AP 70 denotes the mAP at IOU = 0.70, average recall (AR) represents the average recall rate averaged over all categories and IOU thresholds. AR 1 , AR 10 and AR 100 denote the maximum recall rate of the fixed number (1, 10, 100) of objects detected in each image, AR S , AR M and AR L represent the recall rate of small (area smaller than 32 2 ), medium (area between 32 2 and 96 2 ) and large (area bigger than 96 2 ) objects respectively. From Table 4, we can get the following observations. Firstly, since ION takes global context information into consideration, its performance is better than the classical approaches such as Fast R-CNN, Faster R-CNN and YOLOv2. Secondly, SIN outperforms ION, which indicates the object relationship is important for object detection. Finally, the proposed MIFNet performs best on MS COCO database because it establishes the relationship between objects by both geometric and appearance information adaptively and takes the scene-level information into account. In summary, these observations are consistent with the experimental results of PASCAL VOC database. Effectiveness of scene-level information: For the purpose of verifying the effectiveness of each part in our proposed method, some ablation experiments are carried out. Here, we employ the VGG-16 as the backbone of our MIFNet. In the first experiment, only the scene-level information is considered to update the object feature. As shown in Tables 5 and 6, applying scene-level information achieves a better mAP of 75.8% compared with the baseline (Faster R-CNN without scene-level information and object relationships) on PASCAL VOC 2007. On the bigger database MS COCO, a better mAP of 23.5% can be obtained. We find that introducing the scene-level information can improve the detection performances in certain categories, including bike, bottle, chair, plant, TV, and so on. Especially, the average accuracy of plant has increased by more than 10%. These results are not surprising since these categories are usually highly relevant to the context of scene. From Figure 7, we can clearly see that the detection result in (b) with scene-level information is more accurate than the detection result in (a) without scene-information. This may because the probability of plant appearing increases with the introducing of balcony information. Table 4. Detection results on MS COCO 2014 minival. Train set: trainval35k: MS COCO train + 35k val. "V" and "R" denote the model uses VGG-16 and Resnet-101 as backbone networks, respectively. The bold characters represent the best result for each column.

Method
Net AP AP 50  Effectiveness of scene-level information: For the purpose of verifying the effectiveness of each part in our proposed method, some ablation experiments are carried out. Here, we employ the VGG-16 as the backbone of our MIFNet. In the first experiment, only the scene-level information is considered to update the object feature. As shown in Tables 5 and 6, applying scene-level information achieves a better mAP of 75.8% compared with the baseline (Faster R-CNN without scene-level information and object relationships) on PASCAL VOC 2007. On the bigger database MS COCO, a better mAP of 23.5% can be obtained. We find that introducing the scene-level information can improve the detection performances in certain categories, including bike, bottle, chair, plant, TV, and so on. Especially, the average accuracy of plant has increased by more than 10%. These results are not surprising since these categories are usually highly relevant to the context of scene. From Figure 7, we can clearly see that the detection result in (b) with scene-level information is more accurate than the detection result in (a) without scene-information. This may because the probability of plant appearing increases with the introducing of balcony information.    Effectiveness of Relation and Relation Settings: In the second ablation experiment, the validity of relationship information is evaluated. Here, the scene-level information is ignored in our model and we only use a set of Relation GRUs for object detection. Experiments are performed on the PASCAL VOC and MS COCO databases, respectively. From the experimental results shown in Tables 7 and 8, we can see that the performance of our model with only relationship information is still superior to the baseline (Faster R-CNN), especially for highly correlated objects. Taking the detection results in Figure 8 as an example. It is clear that due to the introduction of relation information, the tables and chairs which are strongly correlated with each other can be more accurately detected. This indicates the object relationship is very important for detection.  Effectiveness of Relation and Relation Settings: In the second ablation experiment, the validity of relationship information is evaluated. Here, the scene-level information is ignored in our model and we only use a set of Relation GRUs for object detection. Experiments are performed on the PASCAL VOC and MS COCO databases, respectively. From the experimental results shown in Tables 7 and 8, we can see that the performance of our model with only relationship information is still superior to the baseline (Faster R-CNN), especially for highly correlated objects. Taking the detection results in Figure 8 as an example. It is clear that due to the introduction of relation information, the tables and chairs which are strongly correlated with each other can be more accurately detected. This indicates the object relationship is very important for detection.   76.4 77.9 80.0 75.1 67.2 62.5 86.0 86.4 88.6 61.0 84.7 72.9 86.8   Through the above experiments, we can clearly know that both the scene-level information and object relationship are beneficial for detection, and they are indispensable. Figure 9 shows the detection results of some algorithms and our model. It can be seen that our model performs better when detecting small and highly correlated objects (such as driver, table, chair) due to the message passing between scene-level information and object relationship. At the same time, for some objects with strong correlation with the scene, the detection result is also well (such as the boats in the sea scene and aeroplanes in the sky scene).
Through the above experiments, we can clearly know that both the scene-level information and object relationship are beneficial for detection, and they are indispensable. Figure 9 shows the detection results of some algorithms and our model. It can be seen that our model performs better when detecting small and highly correlated objects (such as driver, table, chair) due to the message passing between scene-level information and object relationship. At the same time, for some objects with strong correlation with the scene, the detection result is also well (such as the boats in the sea scene and aeroplanes in the sky scene). In order to test whether the number of relation modules will influence the detection result of our MIFNet, we also conduct an experiment to compare the performances of the proposed model with different numbers of relation modules. As shown in Table 9, we find that with the increase of the number of relation modules, the detection accuracy of our model will gradually decrease. This may because too many relation modules will make the network over-associate two objects. For example, once an object appears near the table, it will be detected as a chair regardless its feature. Therefore, we choose one module in the experiment.  Table 10, it can be found that when the number of stacked GRU units increases from 1 to 2, the mAP decreases. In addition, when the number of stacked GRU units increases from 2 to 3, no significant performance change can be observed. This indicates that one stacked GRU is enough for our proposed MIFNet. In order to test whether the number of relation modules will influence the detection result of our MIFNet, we also conduct an experiment to compare the performances of the proposed model with different numbers of relation modules. As shown in Table 9, we find that with the increase of the number of relation modules, the detection accuracy of our model will gradually decrease. This may because too many relation modules will make the network over-associate two objects. For example, once an object appears near the table, it will be detected as a chair regardless its feature. Therefore, we choose one module in the experiment. Effectiveness of GRU Settings and the inputs of GRU: In our network, multiple parallel GRU units are used to fuse the information of scene-level context and object relationship. In order to study the effectiveness of different GRU settings, several experiments are conducted. Firstly, we build the message passing module with different numbers (1 to 3) of GRU units and test their performances. From the experimental results in Table 10, it can be found that when the number of stacked GRU units increases from 1 to 2, the mAP decreases. In addition, when the number of stacked GRU units increases from 2 to 3, no significant performance change can be observed. This indicates that one stacked GRU is enough for our proposed MIFNet. Then, for the purpose of verifying the effectiveness of the message passing module in our MIFNet, we compare the experimental performances of two methods. The first method uses the scene-level information and the object relationship information as inputs to the different GRUs, which is the strategy employed in our approach. The second method is to concatenate the scene-level information and the object relationship information as one vector and then input this vector to only one GRU. From the experimental results in Table 11, the detection performances of two different methods are 77.6% and 76.2%, respectively. It is clear that the first method obtains better detection results since different information can be effectively transmit to each other through the two groups of GRUs in it. Nevertheless, the second method which directly concatenates different information cannot accomplish information transmission.

Conclusions
This paper proposed a network that fuses both the scene-level and relationship information for object detection in images. Compared with other methods, the most important advantage of our approach is that we leverage the attention mechanism to model the relationship between objects adaptively. Besides, the relationship weights are obtained using not only the geometric structure, but also the appearance feature of the objects. At last, the scene-level information is also considered in our model. Two widely used databases are employed in our experiment. From the experimental results, we can see that through fusing the scene-level and object relationship information, our proposed MIFNet outperforms some classical and state-of-the-art approaches. Furthermore, some ablation experiments are also carried out to test the effectiveness of our MIFNet.