Deep Spatial-Temporal Joint Feature Representation for Video Object Detection

With the development of deep neural networks, many object detection frameworks have shown great success in the fields of smart surveillance, self-driving cars, and facial recognition. However, the data sources are usually videos, and the object detection frameworks are mostly established on still images and only use the spatial information, which means that the feature consistency cannot be ensured because the training procedure loses temporal information. To address these problems, we propose a single, fully-convolutional neural network-based object detection framework that involves temporal information by using Siamese networks. In the training procedure, first, the prediction network combines the multiscale feature map to handle objects of various sizes. Second, we introduce a correlation loss by using the Siamese network, which provides neighboring frame features. This correlation loss represents object co-occurrences across time to aid the consistent feature generation. Since the correlation loss should use the information of the track ID and detection label, our video object detection network has been evaluated on the large-scale ImageNet VID dataset where it achieves a 69.5% mean average precision (mAP).


Introduction
Object detection in images has received much attention in recent years with tremendous progress mostly due to the emergence of deep neural networks, especially deep convolutional neural networks [1][2][3][4], and their region-based descendants [5][6][7][8][9][10][11]. These methods achieve excellent results on still image datasets, such as Pattern Analysis, Statistical Modelling and Computational Learning Visual Object Classes (PASCAL VOC) and Microsoft Common Object in Context (COCO).
With this success, computer vision tasks have been extended from the still image domain to the video domain because, in reality, the data sources of practical applications, such as smart surveillance, self-driving, and face recognition, are mostly videos. Thus, the additional challenges are [12]: (1) motion blur: due to rapid camera or object movement; (2) quality: due to the quality of internet video clips being lower than that of still images, even if the resolutions are the same; (3) partial occlusion: due to the position change in the camera or object; and (4) pose: due to unconventional object-to-camera poses that are frequently shown in video clips. To overcome this gap, most video object detection methods [13][14][15][16] use exhaustive post-processing in addition to still image detectors. For example, T-CNN [13] uses the two-stage Faster RCNN [8] detection framework for individual video frames. Then, context suppression and tracking are applied for the detection results. Since they do not actually involve temporal information, those video object detection methods do not have favourable results on video sources. Figure 1 illustrates the entire pipeline of the proposed method. The pipeline is introduced in two parts: the training stream and the testing stream. First, for the training procedure to have a better training starting point, the backbone of the proposed fully-convolutional neural networks is the VGG-16, which is pre-trained by using the ImageNet CLS-LOC dataset. Second, by feeding forward a set of fully-convolutional layers cascaded after the backbone network, the multiscale feature representation is generated to handle multiscale objects. Third, a set of different anchor shapes are generated on the multiscale feature maps to adapt objects with different scales' aspect ratios. Fourth, considering the predictions for each anchor and the intersection of union (IOU) of the anchor and ground truth, the detection loss is formed. Finally, the above procedures are replicated twice to establish a Siamese network. By measuring the similarity with the neighboring frame feature, the correlation loss, which consists of the center-value loss and the anchor coordinate loss, is computed. For the testing procedure, after sharing the backbone multiscale feature representation, anchor generation, and prediction, the testing results are computed after the Soft-NMS optimization.

Network Architecture
The following section details the proposed network, which contains the backbone network, multiscale feature representation, anchor generation, anchor prediction, and training sample selection. The network architecture explains the data flow in the feed-forward procedure.

Backbone Network
In this paper, the backbone network is the VGG-16 [3], which is pre-trained on the ImageNet CLS-LOC dataset [23]. As shown in Figure 2, the original VGG-16 is a deep CNN that includes 13 convolutional layers and three fully-connected layers. The convolutional layers generate deep features, which are then fed into the fully-connected layers. Similar to SSD, we remove the fully-connected layers and only use the convolutional layers to generate the feature maps.
The followings detail the entire backbone network: (1) Input: images with RGB channels.

Multiscale Feature Representation
After the backbone network, the multiscale feature representation network is cascaded, which is generated by feed-forward convolutional networks [4,30]. According to SSD [10], conv4_3 has a different feature scale compared to the other layers in the backbone network. Therefore, the multiscale feature representation is started from conv4_3 after applying a L2 normalization to scale the feature norm at each location in the feature map to 20 and to learn the scale during back propagation.

Multiscale Feature Representation
After the backbone network, the multiscale feature representation network is cascaded, which is generated by feed-forward convolutional networks [4,30]. According to SSD [10], conv4_3 has a different feature scale compared to the other layers in the backbone network. Therefore, the multiscale feature representation is started from conv4_3 after applying a L2 normalization to scale the feature norm at each location in the feature map to 20 and to learn the scale during back propagation.
Considering that the size of the input image is initialized to 300 × 300, conv4_3 has a 38 × 38 feature map size, and conv4_3 is the first feature scale map. The multiscale feature representation is generated after conv4_3 by applying feed-forward convolution.
Traditionally, the low-resolution feature map can be generated from the high-resolution feature map by a certain convolutional layer, but the computational costs are high. To reduce the computational costs, each scale block contains 2 convolutional layers. The first convolutional kernel size is 1 × 1 × M × N. Here, M is the upper layer channel number, and N is the current channel number. By using a 1 × 1 × M × N kernel, the middle feature map channel is decreased from M to N. The second convolution kernel size is 3 × 3 × N × K, where K is the current scale-feature channel number. Table 1 shows the details of the multiscale feature representation. There are six scale feature maps with feature map sizes of 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3, and 1 × 1, where 38 × 38 is the conv4_3 in VGG16. In the options list, s is the convolutional kernel stride, p is the padding size, and dilation is the dilated convolution. In the dilation, the stride is 1, the padding size is 6, and the convolutional kernel size is 6 with 3 × 3 parameters.

Anchor Generation
The anchor in the proposed network plays two roles. The first role is in the data selection procedure. The IOU of the ground truth and an anchor decides whether this anchor is a positive sample [31]. Traditionally, if the IOU is more than 50%, the anchor is a positive sample. The second role is in the training and testing procedure. In the training procedure, the classification and location loss are computed by the anchors. In the testing procedure, the bounding box results are computed by the anchor location layers and, after the non-maximum suppression of the bounding box results, the detection results are computed. Figure 3 shows the flow of the anchor generation.
Traditional object detection methods suggest processing an image at different scales and then combines the results. Since we have the multiscale feature representation, we can utilize these feature maps with different scales in a single network to have a similar effect with the traditional image processing method.
Similar to the Faster RCNN and the SSD, the anchors are generated on the feature maps by dense sampling. First, to handle different sizes of objects, different scales of objects are detected on different feature maps. Therefore, in different feature maps, we allocate different scales. In the feature map with low resolution, the anchor scale is large, and in the feature map with high resolution, the anchor scale is small. Second, for a certain scale, the anchor should have different aspect ratios. The design of these aspect ratios are decided by the object's aspect ratios. Suppose that we have m feature maps and the detection is applied to these m feature maps, the anchor scale of the feature map is computed as follows: In the formulation, s min is the minimum scale of the objects to be detected, and s max is the maximum scale of the objects to be detected. n represents the nth square feature map. s n is the scale of the certain feature map. According to the scales, the anchor area is obtained as anchor area = s 2 n . For the aspects of the anchors, different aspect ratios are set according to the object's aspect ratios. Traditionally, the aspect ratios contain a r = [1, 2, 3, 1 2 , 1 3 ]. Furthermore, for the scale consistency, a scale of s n = √ s n s n+1 and the aspect ratio of 1 are also considered. The width and height are computed as follows: Finally, there are six anchors for each pixel in the feature maps. In the proposed model, we allocate the different scales from 0.1 to 0.95, in which 0.95 indicates that the object occupies the entire image and 0.1 indicate the down-sampling rate on Conv4_3. Moreover, we set the starting scale of Conv6_2 is 0.2. The aspect ratios contain a r = [1, 2, 3, 1 2 , 1 3 ] and aspect ratio of 1 for scale of √ s n s n+1 . Table 2 shows the anchor details with the height, weight and number of each feature map in Section 3.1.2. In the table, for clarity, the channel number of the feature map is hidden. To include more objects with different aspect ratios and scales, there are six anchor shapes for each pixel on the feature map.
Traditional object detection methods suggest processing an image at different scales and then combines the results. Since we have the multiscale feature representation, we can utilize these feature maps with different scales in a single network to have a similar effect with the traditional image processing method.
Similar to the Faster RCNN and the SSD, the anchors are generated on the feature maps by dense sampling. First, to handle different sizes of objects, different scales of objects are detected on different feature maps. Therefore, in different feature maps, we allocate different scales. In the feature map with low resolution, the anchor scale is large, and in the feature map with high resolution, the anchor scale is small. Second, for a certain scale, the anchor should have different aspect ratios. The design of these aspect ratios are decided by the object's aspect ratios. Suppose that we have m feature maps and the detection is applied to these m feature maps, the anchor scale of the feature map is computed as follows: In the formulation, is the minimum scale of the objects to be detected, and is the maximum scale of the objects to be detected. n represents the nth square feature map.
is the scale of the certain feature map. According to the scales, the anchor area is obtained as ℎ = . For the aspects of the anchors, different aspect ratios are set according to the object's aspect ratios. Traditionally, the aspect ratios contain = and aspect ratio of 1 for scale of . Table 2 shows the anchor details with the height, weight and number of each feature map in Section 3.1.2. In the table, for clarity, the channel number of the feature map is hidden. To include more objects with different aspect ratios and scales, there are six anchor shapes for each pixel on the feature map.

Anchor Prediction
After associating a set of anchors with each feature map, the anchor prediction is the next key procedure during the algorithm. At each anchor, we predict the offsets relative to the anchor shapes and the per-class scores that indicate the presence of a class instance in this anchor. Suppose that the class number that we want to predict is n. For each anchor, the output size is (n + 1) + 4. (n + 1) is the class number and background, and 4 is the bounding box offset for this anchor. Figure 4 shows the prediction procedure of an anchor.
Referring to the multiscale feature representation in Section 3.1.2, Table 3 shows the detailed prediction kernels. The confidence output of each anchor is (n + 1) × 6 dimensions with (n + 1) classes and six shapes. The bounding box output of each anchor is (n + 1) × 4 dimensions with (n + 1) classes and (bx, by, bw, bh).

Anchor Prediction
After associating a set of anchors with each feature map, the anchor prediction is the next key procedure during the algorithm. At each anchor, we predict the offsets relative to the anchor shapes and the per-class scores that indicate the presence of a class instance in this anchor. Suppose that the class number that we want to predict is n. For each anchor, the output size is (n + 1) + 4. (n + 1) is the class number and background, and 4 is the bounding box offset for this anchor. Figure 4 shows the prediction procedure of an anchor.
Referring to the multiscale feature representation in Section 3.1.2, Table 3 shows the detailed prediction kernels. The confidence output of each anchor is ( 1) × 6 dimensions with (  1) classes and six shapes. The bounding box output of each anchor is (  1)    There are two kernels of 3 × 3 for the confidence prediction and the bounding box prediction for this anchor. The red kernel is the confidence kernel and the blue kernel is the bounding box prediction kernel. Table 3. Details of the prediction kernel.

Training Sample Selection
In the training procedure, the training samples are generated from the anchors. A matching strategy between the anchors and the ground truths is applied. The strategy begins by matching each anchor to the ground truth. Different from Multibox, the positive samples are the anchors with a Jaccard overlap higher than a threshold. The others are the negative samples. This strategy simplifies the training sample selection problems and provides the network with more positive samples.
Moreover, to improve performance, the network also incorporates hard example mining and data augmentation.

Hard Example Mining
Since the negative samples are selected from the background, the number of negative samples is much larger than the number of positive samples. This difference causes a significant imbalance between the positive and negative training samples. To overcome this difficulty, following OHEM [32], we sort the samples by their prediction scores. By selecting the top scores, we restrict the ratio between the negative and positive training samples to 3:1. In this way, we can avoid training mainly on the negative samples.

Data Augmentation
To improve the generalization of the network, we also apply data augmentation. Similar to SSD, each training image is augmented by one of the following options: • Use the entire original input image; • Sample a patch so that the minimum Jaccard overlap with the objects is 0.1, 0.3, 0.5, 0.7, or 0.9; • Randomly sample a patch.
After the aforementioned sampling step, each augmented image is resized to a fixed size and is horizontally flipped with the probability of 0.5.

Loss Function
In the proposed network, the loss function mainly contains two parts: the detection loss and the correlation loss. The detection loss relates to the object classification and the object bounding box regression. The correlation loss is the neighboring object feature correlation, which could involve the time-series information of the network.

Detection Loss
Detection loss relates to an object in a single frame, and it involves the frame information concerning the object. Similar to the Faster RCNN [6], detection loss consists of a combined classification loss L cls and a bounding box regression loss L bbox . The overall detection loss is the sum of L cls and L bbox . Suppose that f is a feature in a certain anchor with a Jaccard overlap that is higher than 0.5. Then: where n is the number of anchors whose Jaccard overlap is higher than 0.5; f is the feature of the certain anchor; c is the classification score of the anchor; and l and g are the bounding box offsets and ground truth, respectively. Similar to SSD, L cls is formed by multi class softmax: where: In the formulation, c label i is the anchor classification prediction score of the certain label in the ith anchor box. For example, c L bbox is based on the Huber loss [33] between the ground truth and the bounding box. The Huber loss is less sensitive to outliers in the data than the squared error loss. L bbox is formulated as follows: in which: Similar to R-CNN, α is a hyper-parameter and set to 1.
Suppose that the bounding box has four parameters, which are (bx, by, bw, bh). (bx, by) is the center of the bounding box; and (bw, bh) is the width and height, respectively. Pre and g refer to the prediction box and the ground truth box, respectively. i and j are as the same as L cls .
Moreover, when formulating L bbox , we adopt the parameterizations of the anchor bounding box(a) and the ground truth box(g) following R-CNN [8]: This parameterization enhances the effect of the center (bx, by) and weakens the effect of the width (bw) and height (bh).

Correlation Loss
Detection loss is mainly concerned with intra-frame information. However, in video sources, detection loss does not consider inter-frame information. Correlation loss is a strategy that involves the inter-frame information by measuring the feature difference between frames.
As is known, the traditional detection frameworks based on deep neural networks are discriminative algorithms. The main concern is how to plot a line or surface in the feature space. If the feature is on the left side of the line, the feature is generated by the positive sample. According to the well-constructed feature space of a convolutional neural network, the linear classifier (Softmax) could easily differentiate the negative features from the positive features. Due to the discriminative algorithm characteristics, the feature in the feature space is scattered. In video object detection, the object is continuously moving. Therefore, when the training dataset is completed, the intra-frame-based detection framework may have a favorable result.
The proposed correlation loss is inspired by the tracking task. In the tracking task [16,[34][35][36][37], the feature consistency is the key point to judge the tracking result, especially in the correlation-based tracking algorithm such as the correlation filter [34]. In the deep feature [12] and flow-guided feature [38], according to the object's movement, the feature is copied or aggregated from the key frames of other frames. In this way, the feature stability has been guaranteed. Different from these approaches, we formulate a correlation loss to supervise the feature consistency.
As shown in Figure 1, the correlation loss is computed by the Siamese network, which is a two-path network that uses replication. since we construct the multiscale feature map, the correlation loss is generated from it. The correlation loss (L corr ) is the combination of the center value loss (L center_value ) and the anchor coordinate loss (L coordinate ): where n is the number of ground truths; f t and f t+1 are the anchor center features of the neighboring frame t and t + 1, respectively; and a t+1 is the positive anchor box in frame t + 1. Each positive anchor box has 4 parameters (bx, by, bw, bh). map t+1 is the entire scale feature map in frame t + 1. Moreover, the positive anchor is different from the positive sample selection. In the correlation loss procedure, the anchor selection is the maximum Jaccard overlap between the ground truth box and the anchors, which means that the number of positive anchors is the same as the ground truth. First, the correlation loss is a value that compares the correlation distance between the same labelled object features in the neighboring frame.
The following is the formulation of (L center_value ): j , i f (label_t = label_t + 1) and (track_t = track_t + 1) Here, ⊗ is the correlation operation. The correlation operation measures the consistency of the feature. label_t and track_t are the ground truth label and the track ID, respectively, that relate to this anchor. By searching all the positive anchors in the neighboring frame, we can obtain the one-to-one corresponding positive anchor with its label and track ID in frame t and frame t + 1. f t,label_t,track_t is the center value of the positive anchor.
Second, the anchor coordinate loss measures the tracking result loss on the neighboring frame feature. Inspired by the correlation filter, we take the positive anchor in frame t as the first-frame tracking ground truth. We obtain the response map in frame t + 1. Then, the coordinate of the highest response is the tracking result in frame t + 1. Figure 5 shows the L coordinate computation flow.  The T (Track) formulation obtains the center (bx, by) of the max response location. In , Dis computes the center Euclidean distance of the tracking result and the positive anchor in frame t + 1. Moreover, in case the coordinate loss is high, we normalize it by the size of the feature map. By applying the correlation loss, the object feature in the neighboring frame can be more consistent.  Then, the anchor coordinate loss is formed as: L coordinate ( f t , map t+1 , a t+1 ) = n ∑ i∈positive Dis(T( f t,label_t,track_t i , map t+1 ), a t+1,label_t+1,track_t+1 )/map_size, i f (label_t = label_t + 1) and (track_t = track_t + 1) (11) where: The T (Track) formulation obtains the center (bx, by) of the max response location. In L coordinate , Dis computes the center Euclidean distance of the tracking result and the positive anchor in frame t + 1. Moreover, in case the coordinate loss is high, we normalize it by the size of the feature map. By applying the correlation loss, the object feature in the neighboring frame can be more consistent.

Experiment and Results
We report the results of the ImageNet VID validation dataset. The training set contains the ImageNet VID training dataset. The performance of our method is compared to the R-CNN [8], Fast R-CNN [5], original SSD [10], T-CNN [15], and TPN + LSTM [39], and the winner of the competition of ImageNet VID 2015 [13]. The detailed evaluation metrics are described in Section 4.1. All the methods in the experiments were programmed based on the Pytorch deep learning framework. The computational resources include a TITAN X GPU, 128 GB of memory, and an Intel Xeon E5-2670 CPU (2.30 GHz). The operating system used is Ubuntu 14.04. Moreover, in order to show the effectiveness of the proposed method, we also evaluate our model on the YouTube Object (YTO) dataset and the Unsupervised [40], YOLO [22], Context [41], a_LSTM [42], T-CNN [15], Base [10] models are selected for comparison.

ImageNet Dataset
We evaluate our method by using the 2015 ImageNet object detection from a video (VID) [23] dataset that contains 30 classes in 3862 training and 555 validation videos. The 30 object categories in ImageNet VID are a subset of the 200 categories in the ImageNet DET dataset. The objects have ground truth annotations for their bounding box and a tracking ID in a video. Since the ground truth for the test set is not publicly available, we measure the performance as the mean average precision (mAP) over 30 classes on the validation set by following the protocols in [12,13,15,38,39], which is standard practice. Figure 6 shows the number of ground truths in each class, in which we can see that the training ground truth in each class is unbalanced. Then, we subsample the VID training set by using only 30 frames from each video. ground truth annotations for their bounding box and a tracking ID in a video. Since the ground truth for the test set is not publicly available, we measure the performance as the mean average precision (mAP) over 30 classes on the validation set by following the protocols in [12,13,15,38,39], which is standard practice. Figure 6 shows the number of ground truths in each class, in which we can see that the training ground truth in each class is unbalanced. Then, we subsample the VID training set by using only 30 frames from each video.  Figure 7 shows the object area statistical information in the dataset, in which we can see more than half of the objects have areas lower than 0.33.
The positive samples are anchors with Jaccard overlap of ground truths of more than 0.5. If the Jaccard overlaps between all of the anchors and ground truths are lower than 0.5, there would be no  Figure 7 shows the object area statistical information in the dataset, in which we can see more than half of the objects have areas lower than 0.33.
The positive samples are anchors with Jaccard overlap of ground truths of more than 0.5. If the Jaccard overlaps between all of the anchors and ground truths are lower than 0.5, there would be no positive samples. Figure 8 shows the statistical information of small objects. In this dataset, if the scale of the smallest anchor is 0.1, the framework will miss about 10% of the ground truths.  Figure 7 shows the object area statistical information in the dataset, in which we can see more than half of the objects have areas lower than 0.33.
The positive samples are anchors with Jaccard overlap of ground truths of more than 0.5. If the Jaccard overlaps between all of the anchors and ground truths are lower than 0.5, there would be no positive samples. Figure 8 shows the statistical information of small objects. In this dataset, if the scale of the smallest anchor is 0.1, the framework will miss about 10% of the ground truths.
There are six anchor shapes in the multiscale feature map, which contains [1, 2, 3, , ] with an aspect ratio of 1, for a scale of . The backbone is the VGG 16 net with the fully-connected layers removed. Following SSD300, we change pool5 from 2 × 2 − s2 to 3 × 3 − s1 and use an ̀ algorithm to fill the "holes". The baseline uses conv4_3, conv6_2, conv7_2, conv8_2, conv9_2, and conv10_2 to predict both the locations and confidences. The baseline sets the anchor with a scale of 0.1 on conv4_3, and the other layers are initialized by the "Xavier" method [43]. For all prediction layers, we included six anchor shapes as described in Section 3.1.3. The baseline uses the 10 learning rate for 200,000 iterations. The training was continued for 200,000 iterations with 10 and 10 . Moreover, the optimization trick has a momentum of 0.9 and a weight decay of 0.0005.
Proposed network: Our network is established on SSD300, whose input image size is also resized to 300 × 300. The training differences between the proposed network and the baseline are as follows:

Model Training
Baseline-single shot multibox: The baseline is SSD300, which means the input is resized to 300 × 300. The multiscale feature map sizes are 38 × 38, 19 × 19, 10 × 10, 5 × 5, 3 × 3, and 1 × 1. There are six anchor shapes in the multiscale feature map, which contains [1, 2, 3, 1 2 , 1 3 ] with an aspect ratio of 1, for a scale of √ s n s n+1 . The backbone is the VGG 16 net with the fully-connected layers removed. Following SSD300, we change pool5 from 2 × 2 − s2 to 3 × 3 − s1 and use an a trous algorithm to fill the "holes". The baseline uses conv4_3, conv6_2, conv7_2, conv8_2, conv9_2, and conv10_2 to predict both the locations and confidences. The baseline sets the anchor with a scale of 0.1 on conv4_3, and the other layers are initialized by the "Xavier" method [43]. For all prediction layers, we included six anchor shapes as described in Section 3.1.3. The baseline uses the 10 −3 learning rate for 200,000 iterations. The training was continued for 200,000 iterations with 10 −4 and 10 −5 . Moreover, the optimization trick has a momentum of 0.9 and a weight decay of 0.0005.
Proposed network: Our network is established on SSD300, whose input image size is also resized to 300 × 300. The training differences between the proposed network and the baseline are as follows:

Input image
The input image is a pair of neighboring frames (frame t and frame t + 1). The neighboring frames must have at least one object pair with the same label and track ID. If one of the neighboring frames is empty, this frame pair is removed.

Correlation loss computation
Correlation loss (L corr ) has two subsets: (L center_value ) and (L coordinate ). (L center_value ) is computed by the center values of the positive anchors. In L coordinate , when computing the response map, the kernel size is 3 × 3, and its center is the positive anchor center. Furthermore, frame t + 1 is padded by one pixel.

Learning rate
Since the loss in the proposed network is more than the baseline, we first warm up the network with a learning rate of 10 −4 for 10,000 iterations. Then, the proposed network uses the 10 −3 learning rate for 200,000 iterations. The training was continued for 200,000 iterations with 10 −4 and 10 −5 .

Training time
The training time of the baseline is almost three days. Due to the correlation loss computation and the detection loss that contains two images, the training time of our method is longer than the baseline. Following the above training process, the entire training time is approximately seven days.
In addition to the above procedure, the proposed network training is mostly the same as the baseline. The proposed network also uses six layers to predict, and each layer's anchor shape is six. The optimization trick that we use is a momentum of 0.9 and a 0.0005 weight decay.

Testing and Results
We test the baseline and the proposed network on the validation dataset with subsampling. Moreover, we also adopt Soft-NMS [19] to accurately fix the candidate bounding boxes. Moreover, the Seq-NMS [14], which is a post-processing method, is applied after the detection process. Table 4 shows the mAP results on the validation dataset. From the table, the baseline SSD300 achieves a mAP of 67.9%. Our proposed method achieves 69.5%, which is a 1.6% improvement compared to the baseline.
From Table 4, we compare our proposed network with R-CNN [8], Fast R-CNN [5], T-CNN [15], TPN + LSTM [39], the baseline and the winner in the competition of ImageNet VID 2015 [13] (multi-model). The R-CNN and Fast R-CNN are baselines designed for still images. The last list is the winner in the competition of ImageNet VID 2015 and the result is the fusion of DeepID net, CRAFT and post-processing procedures. For the single model, the proposed network achieves the best score, which is a 1.1% improvement over TPN + LSTM.
Moreover, because our proposed network is based on SSD300, and because the testing process is similar, the test time per image is similar. Based on our equipment (Titan X), the testing time per image is approximately 32 fps because there are more anchors than the original SSD.
During the network feed-forward, Figure 9 shows some middle features, containing conv4_3, conv6_2, conv7_2, conv8_2, conv9_2, and conv10_2. During the network feed-forward, Figure 9 shows some middle features, containing conv4_3, conv6_2, conv7_2, conv8_2, conv9_2, and conv10_2. To measure the effect on features, whether involving the correlation loss or not, we extract features of the anchor with the maximum Jaccard overlap of ground truth. Then, computing the similarity of the neighbouring frame, the similarity metric is the Euclidean distance. For each class, we choose 10 pairs of neighbouring frames and the similarity metric is the average of those similarities. In order to avoid the effect of the channel size, the similarity index is normalized by ℎ : Here, _ and are the features of the certain anchor on the neighbouring frame. The channel is the number of the feature map. Figure 10 shows the feature's similarity of the proposed network and baseline. We can see our To measure the effect on features, whether involving the correlation loss or not, we extract features of the anchor with the maximum Jaccard overlap of ground truth. Then, computing the similarity of the neighbouring frame, the similarity metric is the Euclidean distance. For each class, we choose 10 pairs of neighbouring frames and the similarity metric is the average of those similarities. In order to avoid the effect of the channel size, the similarity index is normalized by channels: Here, f a f rame_t and f a f rame t +1 are the features of the certain anchor on the neighbouring frame. The channel is the number of the feature map. Figure 10 shows the feature's similarity of the proposed network and baseline. We can see our proposed method maintains a better feature similarity than the baseline. To measure the effect on features, whether involving the correlation loss or not, we extract features of the anchor with the maximum Jaccard overlap of ground truth. Then, computing the similarity of the neighbouring frame, the similarity metric is the Euclidean distance. For each class, we choose 10 pairs of neighbouring frames and the similarity metric is the average of those similarities. In order to avoid the effect of the channel size, the similarity index is normalized by ℎ : Here, _ and are the features of the certain anchor on the neighbouring frame. The channel is the number of the feature map. Figure 10 shows the feature's similarity of the proposed network and baseline. We can see our proposed method maintains a better feature similarity than the baseline.  Figure 11 shows the detection results of the validation dataset.

Model Analysis
Similarity Figure 10. The feature similarity of the proposed method. The horizontal axis is the number of the multiscale feature map and the vertical axis is the similarity index. Figure 11 shows the detection results of the validation dataset.

Model Analysis
The number of anchor shapes and multi-scale feature maps are the key hyper-parameters in the proposed framework. To evaluate the effects of different number of anchor shapes and feature maps, additional comparison experiments are conducted on the ImageNet VID dataset.

Number of Anchor Shapes
In the proposed framework, the number of anchor shapes for each pixel in the multi-scale feature map is one of the key hyper-parameter and the number of anchor shapes could affect the detection performance and speed. To analyze how the number of anchors affect the detection performance and speed, we conduct a comparison experiment between different numbers of anchors. Table 5 shows the details about the effect on detection performance and speed. The settings of this experiment are as following: 1.
Anchor-6: For each pixel in multi-scale feature map, there are six anchor shapes in the multiscale feature map, which contains [1, 2, 3, 1 2 , 1 3 ] and an aspect ratio of 1 for a scale of √ s n s n+1 . In our framework, the total number of anchors is 11,640.

2.
Anchor-4: For each pixel in multi-scale feature map, there are four anchor shapes in the multiscale feature map, which contains [1, 2, 1 2 ] and an aspect ratio of 1 for a scale of √ s n s n+1 . In our framework, the total number of anchors is 7760.

3.
Anchor-4 and 6: In this setting, we follow the original SSD setting. In Conv_4_3, Conv9_2, and Conv10_2, there are four anchor shapes in the multiscale feature map, which contains [1, 2, 1 2 ] and an aspect ratio of 1 for a scale of √ s n s n+1 . Then, in Conv_6_2, Conv7_2, and Conv8_2, there are six anchor shapes in the multiscale feature map, which contains [1, 2, 3, 1 2 , 1 3 ] and an aspect ratio of 1 for a scale of √ s n s n+1 . In our framework, the total number of anchors is 8732.   From the table, we can see that more anchors refer to better mean average precision. Additionally, more anchors need more computational resources and the detection speed is lower. If we remove the aspect ratio of 3 and 1 3 , the performance drops 1.6% and the detection speed has an 18 fps bonus.

Number of Multi-Scale Feature Maps
In our framework, the number of multi-scale feature maps is also a key hyper-parameter. To investigate the effect of the multi-scale feature representation, we do another comparison experiment between different numbers of multi-scale feature maps. Table 6 shows the details about the effect on detection performance and speed. Moreover, the anchor shapes in each pixel is six, which consists of [1, 2, 3, 1 2 , 1 3 ] and an aspect ratio of 1 for a scale of √ s n s n+1 . The followings are the experimental settings: From the table, we can see that more feature maps can significantly enhance the mean average precision. Since the anchors are mostly generated on feature maps with lower resolution, the speed between the settings does not make a great difference. Compared with Feature-3, Feature-6 has a 2.7% bonus on mean average precision.

Evaluation of the YouTube Object (YTO) Dataset
In order to show the effectiveness of the proposed network, we evaluate our model on a video object detection task with the YTO dataset [44].

YouTube Object Dataset
The YTO dataset contains 10 object classes, which include an airplane, bird, boat, car, cat, cow, dog, horse, motor-bike, and train. Moreover, these 10 object classes are a subset of the ImageNet VID dataset and these objects are also moving objects. Different from the VID dataset which contains full annotations on all video frames, the YTO training dataset is weakly annotated, i.e., each video is only ensured to contain one object of corresponding class, and only a few frames, whereas the objects in the YTO test dataset are all annotated. In total, the YTO dataset contains 155 videos. However, it only contains 6087 annotated frames; among them 4306 are for training and 1781 are for testing. The weak annotation makes it infeasible to train the proposed network on the YTO dataset.
Following [13,15], since the YTO classes are a subset of the VID dataset classes, we only use this dataset for evaluation and can directly apply the trained models on the YTO dataset for evaluation. The evaluation metric is the same as the ImageNet VID dataset.

Evaluation Results
We evaluate the model trained on ImageNet to the YTO test dataset and several state-of-the-art methods are selected for comparison. Table 7 shows the detailed AP lists computed by Unsupervised [40], YOLO [22], Context [41], a_LSTM [42], T-CNN [15], Base [10] and our own YTO test dataset. From the table, we can see that our proposed framework outperforms by a large margin. Compared with the baseline, our proposed method has around a 4.4% improvement and, compared with T-CNN, our proposed method has a 4.0% bonus.

Conclusions
In this paper, we propose a fast and accurate object detection network for video sources. The proposed network is a single-shot object detection network.
Unlike the traditional single-frame-based object detection network, our proposed network involves frame-to-frame information by using the object pair relations among neighboring frames. Following the Siamese network, we formulate a correlation loss to restrain the deep features. In this way, we incorporate correlation into the discriminative algorithm. The backbone network is the VGG16 model that reduced the fully-connected layers. Based on the backbone network, we establish a multiscale feature representation to predict detections on multiple layers. Different anchor scales are applied to different feature maps for different object scales. Hard example mining and data augmentation are also used to balance the training samples and to test the generalizations. Our proposed model has been tested on a large object detection dataset, namely, the ImageNet VID dataset. This dataset is the largest video object detection dataset. Compared to the baseline, our proposed network has a 1.6% bonus, and the test time does not increase. In order to show the effectiveness of the proposed framework, we evaluate the model on YTO dataset and the proposed framework has a 4.4% bonus compared to the baseline.
Although our proposed network has a better performance than the baseline, it still has some limitations. The first limitation is the use of time-series information. The correlation loss is rigid to the feature map. The second limitation is that we do not fully use the tracking information. The tracking process is only on the feature map and does not output the exact tracking result. If the network could output the exact tracking result, the testing time could be faster.