A Review of Video Object Detection: Datasets, Metrics and Methods

: Although there are well established object detection methods based on static images, their application to video data on a frame by frame basis faces two shortcomings: (i) lack of computational e ﬃ ciency due to redundancy across image frames or by not using a temporal and spatial correlation of features across image frames, and (ii) lack of robustness to real-world conditions such as motion blur and occlusion. Since the introduction of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2015, a growing number of methods have appeared in the literature on video object detection, many of which have utilized deep learning models. The aim of this paper is to provide a review of these papers on video object detection. An overview of the existing datasets for video object detection together with commonly used evaluation metrics is ﬁrst presented. Video object detection methods are then categorized and a description of each of them is stated. Two comparison tables are provided to see their di ﬀ erences in terms of both accuracy and computational e ﬃ ciency. Finally, some future trends in video object detection to address the challenges involved are noted.


Introduction
Video object detection involves detecting objects using video data as compared to conventional object detection using static images. Two applications that have played a major role in the growth of video object detection are autonomous driving [1,2] and video surveillance [3,4]. In 2015, video object detection became a new task of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC2015) [5]. With the help of ILSVRC2015, studies in video object detection have further increased.
However, using object detection on each image frame does not take into consideration the following attributes in video data: (1) Since there exist both spatial and temporal correlations between image frames, there are feature extraction redundancies between adjacent frames. Detecting features in each frame leads to computational inefficiency. (2) In a long video stream, some frames may have poor quality due to motion blur, video defocus, occlusion, and pose changes [22]. Detecting objects from poor quality frames leads to low accuracies. Video object detection approaches attempt to address the above challenges. Some approaches make use of the spatial-temporal information to improve accuracy, such as fusing features on different levels, e.g., [22][23][24][25]. Some other approaches focus on reducing information redundancy and improving detection efficiency, e.g., [26][27][28].
The great value of video object detection approaches is further presented in some specific applications. For example, hand segmentation [62,63] is well realized with the help of the optical flow to enhance the feature maps as per the video object detection method [28]. Human pose estimation in videos [64] is another successful application, which draws lessons from [22,28] to solve the motion blur, occlusion and other specific challenges occurring in videos. Furthermore, instance-level human parsing [65] starts from the similar approaches. Mutual assistance of tracking and detection [26] is well employed in multiple people tracking [66].

Datasets
Many datasets have been provided for specific applications [91][92][93]. For video object detection, the most commonly used dataset is the ImageNet VID dataset [5], which is a prevalent benchmark for video object detection. The dataset is split into a training set and a validation set, containing 3862 video snippets and 555 video snippets, respectively. The video streams are annotated on each frame at the frame rate of 25 or 30 fps. In addition, this dataset contains 30 object categories, which are a subset of the categories in the ImageNet DET dataset [93].
In the ImageNet VID dataset, the number of objects in each frame is small compared with the datasets used for static image object detection such as COCO [92]. Though the ImageNet VID dataset is widely used, it has limitations in fully reflecting the effect of various video object detection methods. In [94], a large-scale dataset named YouTube-BoundingBoxes (YT-BB) was provided, which is human-annotated at one frame per s on video snippets from YouTube with high accuracy classification labels and tight bounding boxes. YT-BB contains approximately 380,000 video segments with 5.6 million bounding boxes of 23 object categories, which is a subset of the COCO label set. However, the dataset contains only 23 object categories and the image quality is relatively low due to its collection by hand-held mobile phones.
In 2018, a dataset named EPIC KITCHENS was provided in [95], which consists of 32 different kitchens in 4 cities with 11,500,000 frames containing 454,158 bounding boxes spanning 290 classes. However, its kitchen scenario poses limitations on performing generic video object detection. Moreover, there exist the following other datasets that reflect specific applications: the DAVIS dataset [96] for object segmentation, CDnet2014 [97] for moving object detection, VOT [98] and MOT [99] for object tracking, Sports-1M data set [100] with segment-level annotations, HMDB-51 data set [101] with segment-level annotations for various human action categories, TRECVID [102] for video retrieval and indexing, the Caltech Pedestrian Detection data set [103] for pedestrian detection, and the PASCAL VOC dataset [104,105] for object detection. In addition, some works based on semi-supervised or unsupervised methods have been considered in [106][107][108][109].
For video object detection with classification labels and tight bounding boxes annotation, currently there exists no public domain dataset offering dense annotations for various complex scenes. To enable the advancement of video object detection, more effort is thus needed to establish comprehensive datasets.

Evaluation Metrics
The metric mean Average Precision (mAP) is extensively used in conventional object detection, which provides a performance evaluation in terms of regression and classification accuracies [9][10][11][12][13][14][15]17]. The evaluation metric mAP represents the mean Average Precision. The definition is the mean of the Average Precision of each category. As per the PASCAL Visual Object Classes Challenge 2012 (VOC2012) Development Kit, it is computed as follows: (1) The Precision/Recall curve is obtained first. For the Recall (r), the Precision is set to the maximum Precision achieved for any Recall r ≥ r. (2) The area under the Precision/Recall curve is considered to be the Average Precision (AP). The mean of AP in each category is mAP.
Prior to 2010, AP used to be computed by sampling the curve at a set of uniformly spaced Recall (0, 0.1, 0.2, . . . , 1) and then computing the average of the corresponding Precision value. More specifically, Recall = TP TP+FN and Precison = TP TP+FP , where the definitions of TP, FP and FN appear in Table 1. When the IoU is larger than a set threshold, the prediction is true. That is, where B gt and B p indicate the ground truth and prediction box, respectively. More details are stated in the following example.
The detection results of one category are presented in Table 2, and the number of the objects is 3, which means that TP + FN = 3. Confidence represents the confidence level of the prediction boxes. The definition of confidence score is stated in Equation (1) with P r denoting precision and IoU truth pred the confidence level of the box surrounding the entire object, confidence score = P r (object) × IoU truth pred , P r (object) ∈ [0, 1].
(1) The detection results are ranked according to the confidence score, which are shown in Table 3. The Precision and Recall are computed by the Equation noted above. The Precision/Recall curve of this category is shown in Figure 1. As a result, 33% and mAP is the mean of the AP in each category.   For video object detection, mAP is also directly used as an evaluation metric in [22,25,28,72,74]. Based on the object speed, it is labeled as mAP (slow), mAP (medium) and mAP (fast) [22]. This is done using the average score of IoU (Intersection over Union) of a current frame and 10 frames ahead and past as follows: slow (score > 0.9), medium (score ∈ [0.7, 0.9]) and fast (score < 0.7).
In [110], it was pointed out that performance cannot be sufficiently evaluated using only Average Precision (AP), since the temporal nature of video snippets do not get captured by it. In the same paper, a new metric named Average Delay (AD) was introduced based on the number of frames taken to detect an object starting from the frame it first appears in. A subset of the ImageNet VID dataset, named ImageNet VIDT, was considered to verify the effectiveness of AD. It has been reported that most methods with higher ADs still had good APs or good average detection accuracies. However, higher ADs also mean that the detection delay is large. In other words, the number of frames from the frame that the object first appears in is large. If only using AP as the metric to evaluate the performance of different methods, it becomes challenging to reflect the AD (the number of frames taken to detect an object starting from the frame it first appears). As a result, AP is not sufficient to reflect the temporal characteristics of video object detectors and the metric AD provides a complementary performance indicator.

Video Object Detection Methods
For video object detection, in order to make full use of the video characteristics, different methods are considered to capture the temporal-spatial relationship. Some papers have considered For video object detection, mAP is also directly used as an evaluation metric in [22,25,28,72,74]. Based on the object speed, it is labeled as mAP (slow), mAP (medium) and mAP (fast) [22]. This is done using the average score of IoU (Intersection over Union) of a current frame and 10 frames ahead and past as follows: slow (score > 0.9), medium (score ∈ [0.7, 0.9]) and fast (score < 0.7).
In [110], it was pointed out that performance cannot be sufficiently evaluated using only Average Precision (AP), since the temporal nature of video snippets do not get captured by it. In the same paper, a new metric named Average Delay (AD) was introduced based on the number of frames taken to detect an object starting from the frame it first appears in. A subset of the ImageNet VID dataset, named ImageNet VIDT, was considered to verify the effectiveness of AD. It has been reported that most methods with higher ADs still had good APs or good average detection accuracies. However, higher ADs also mean that the detection delay is large. In other words, the number of frames from the frame that the object first appears in is large. If only using AP as the metric to evaluate the performance of different methods, it becomes challenging to reflect the AD (the number of frames taken to detect an object starting from the frame it first appears). As a result, AP is not sufficient to reflect the temporal characteristics of video object detectors and the metric AD provides a complementary performance indicator.

Flow-Based
Flow-based methods use optical flow in two ways. In order to save computation, in the first way as discussed in [28] (DFF (DFF is the acronym standing for Deep Feature Flow for Video Recognition in reference [28]. Similarly, other acronyms appear in the references indicated)), optical flow is used to propagate features from key frames to non-key frames. In the second way, as discussed in [22] (FGFA), optical flow is used to make use of the temporal-spatial information between adjacent frames to enhance the features of each frame. In the second way, higher detection accuracies but lower speeds are reported. As a result, attempts were made to combine both of these ways in [68] (Impression Network), [69] (THP) and [27] (THPM). To obtain the difference between adjacent frames and utilize the temporal-spatial information at the pixel level, an optical flow algorithm was proposed in [29]. In [111], the optical flow estimation was achieved by using the deep learning model of FlowNet.
For video object detection, it is challenging to apply the state-of-the-art object detection approaches for still images directly to each image frame in video data for the reasons stated earlier. Therefore, based on FlowNet, the DFF method was proposed in [28] to address these shortcomings: (i) computation time of feature map extraction for each frame in video, (ii) similarity of features obtained on two adjacent frames, (iii) propagation of feature maps from one frame to another. In [28], a convolutional neural sub-network, ResNet-101, was employed to extract the feature map on sparse key frames. Features on non-key frames were obtained by warping the feature map on key frames

Flow-Based
Flow-based methods use optical flow in two ways. In order to save computation, in the first way as discussed in [28] (DFF (DFF is the acronym standing for Deep Feature Flow for Video Recognition in reference [28]. Similarly, other acronyms appear in the references indicated)), optical flow is used to propagate features from key frames to non-key frames. In the second way, as discussed in [22] (FGFA), optical flow is used to make use of the temporal-spatial information between adjacent frames to enhance the features of each frame. In the second way, higher detection accuracies but lower speeds are reported. As a result, attempts were made to combine both of these ways in [68] (Impression Network), [69] (THP) and [27] (THPM). To obtain the difference between adjacent frames and utilize the temporal-spatial information at the pixel level, an optical flow algorithm was proposed in [29]. In [111], the optical flow estimation was achieved by using the deep learning model of FlowNet.
For video object detection, it is challenging to apply the state-of-the-art object detection approaches for still images directly to each image frame in video data for the reasons stated earlier. Therefore, based on FlowNet, the DFF method was proposed in [28] to address these shortcomings: (i) computation time of feature map extraction for each frame in video, (ii) similarity of features obtained on two adjacent frames, (iii) propagation of feature maps from one frame to another. In [28], a convolutional neural sub-network, ResNet-101, was employed to extract the feature map on sparse key frames. Features on non-key frames were obtained by warping the feature map on key frames with the flow field generated by FlowNet [111] instead of getting extracted by ResNet-101. The framework is shown in Figure 3. This method accelerates the object detection on non-key frames. On the ImageNet VID dataset [5], DFF achieved an accuracy of 73.1% mAP with 20 fps, while the baseline accuracy on a single frame was 73.9% with 4 fps. This method significantly advanced the practical aspect of video object detection.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 6 of 24 framework is shown in Figure 3. This method accelerates the object detection on non-key frames. On the ImageNet VID dataset [5], DFF achieved an accuracy of 73.1% mAP with 20 fps, while the baseline accuracy on a single frame was 73.9% with 4 fps. This method significantly advanced the practical aspect of video object detection. In [22], a flow-guided feature aggregation (FGFA) method was proposed to improve the detection accuracy due to motion blur, rare poses, video defocus, etc. Feature maps were extracted on each frame in video using ResNet-101 [112]. In order to enhance the feature maps of a current frame, the feature maps of its nearby frames were warped to the current frame according to the motion information obtained by the optical flow network. The warped feature maps and extracted feature maps on the current frame were then inputted into a small sub-network to obtain a new embedding feature, which was used for a similarity measure based on the cosine similarity metric [113] to compute the weights. Next, the features were aggregated according to the weights. Finally, the aggregated feature maps were inputted into a shallow detection-specific sub-network to obtain the final detection outcome on the current frame. The framework of FGFA is shown in Figure 4. Based on the ImageNet VID dataset, FGFA achieved an accuracy of 76.3% mAP with 1.36 fps, which was higher than DFF.
Although the feature fusion method of FGFA improved the detection accuracy, it considerably increased the computation time. On the other hand, feature propagation methods showed improved computational efficiency but at the expense of reduced detection accuracy. In 2017, a so-called Impression Network [68] was developed to improve the performance in terms of both accuracy and computational speed simultaneously. Inspired by the idea that humans do not forget the previous frames when a new frame is observed, sparse key-frame features were aggregated with other key frames to improve the detection accuracy. Feature maps of non-key frames were also obtained by a feature propagation method similar to that in [28] with the assistant of a flow field. As a result, feature propagation to obtain the features of the non-key frames improved the inference computation speed. The feature aggregation method on the key frames used a small fully convolutional network to obtain the weight maps on each localization, which was different from the method in [22]. The Impression Network achieved 75.5% mAP accuracy at 20 fps on the ImageNet VID dataset.
Besides Impression Network, in [69], another combination method (THP) was introduced. Noting that all of the above methods utilized fixed interval key frames, this method introduced a temporally adaptive key frame scheduling to further improve the trade-off between speed and accuracy. Fixed interval key frames pose a difficulty to control the quality of key frames. With temporally adaptive key frame scheduling, the fixed interval key frames were adjusted in a dynamic manner according to the proportion of points with poor optical flow quality. If it was greater than a prescribed threshold T, it would indicate that a current frame had changed too much compared with the previous key frame. The current frame was then chosen as the new key frame and the feature maps were obtained from it. In [22], a flow-guided feature aggregation (FGFA) method was proposed to improve the detection accuracy due to motion blur, rare poses, video defocus, etc. Feature maps were extracted on each frame in video using ResNet-101 [112]. In order to enhance the feature maps of a current frame, the feature maps of its nearby frames were warped to the current frame according to the motion information obtained by the optical flow network. The warped feature maps and extracted feature maps on the current frame were then inputted into a small sub-network to obtain a new embedding feature, which was used for a similarity measure based on the cosine similarity metric [113] to compute the weights. Next, the features were aggregated according to the weights. Finally, the aggregated feature maps were inputted into a shallow detection-specific sub-network to obtain the final detection outcome on the current frame. The framework of FGFA is shown in Figure 4. Based on the ImageNet VID dataset, FGFA achieved an accuracy of 76.3% mAP with 1.36 fps, which was higher than DFF.
Although the feature fusion method of FGFA improved the detection accuracy, it considerably increased the computation time. On the other hand, feature propagation methods showed improved computational efficiency but at the expense of reduced detection accuracy. In 2017, a so-called Impression Network [68] was developed to improve the performance in terms of both accuracy and computational speed simultaneously. Inspired by the idea that humans do not forget the previous frames when a new frame is observed, sparse key-frame features were aggregated with other key frames to improve the detection accuracy. Feature maps of non-key frames were also obtained by a feature propagation method similar to that in [28] with the assistant of a flow field. As a result, feature propagation to obtain the features of the non-key frames improved the inference computation speed. The feature aggregation method on the key frames used a small fully convolutional network to obtain the weight maps on each localization, which was different from the method in [22]. The Impression Network achieved 75.5% mAP accuracy at 20 fps on the ImageNet VID dataset. Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 24 According to the results reported in [69], the mAP accuracy was 78.6% with a runtime of 13.0 and 8.6 fps on the GPUs NVIDIA Titan X and K40 (NVIDIA, California, USA) respectively. With a different T, the mAP slightly decreased to 77.8% at faster speeds (22.9 and 15.2 fps on Titan X and K40, respectively). Compared with the winning entry [114] of the ImageNet VID challenge 2017, which was based on feature propagation [28] and aggregation [22], an mAP of 76.8% at 15.4 fps was achieved on Titan X, and a better performance in terms of both the detection accuracy and speed was obtained in [69].
Similarly, THPM [27] provided a light weight network architecture for video object detection. A light image object detector is utilized on key frames. The state-of-the-art lightweight Mobilenet [115] is utilized as the backbone network. Feature maps from key frames are propagated to non-key frame for detection by a light flow network. A flow-guided gated recurrent unit (GRU) module is provided to aggregate features effectively between key frames. On the ImageNet VID dataset, THPM achieves 60.2% mAP at speed of 25.6 fps on mobiles (e.g., HuaWei Mate 8 produced by HUAWEI TECHNOLOGIES CO., LTD, China).

LSTM-Based
In order to make full use of the temporal-spatial information, convolutional long short term memory (LSTM [116]) was employed to process sequential data in [117] and select important information for a long duration. The methods reported in [70] and [71] are offline LSTM-based solutions, which utilize all the frames in the video. While the method in [72] is an online solution, it only uses the current and previous frames.
In [71], a light model was proposed, which was designed to work on mobile phones and embedded devices. This method integrated SSD [9] (an efficient object detector network) with the convolutional LSTM by applying an image-based object detector to video object detection via a convolutional LSTM. The convolutional LSTM was a modified version of the traditional LSTM encoding the temporal and spatial information.
Considering a video snippet as video frames V = {I0, I1, I2, … It}, the model is viewed as a function ( , ) = ( , ), where denotes the detection outcome of the video object detector and represents a vector of feature maps up to the video frame t. Each feature map of is the state input to the LSTM and is the state output. The state unit of LSTM contains the temporal information. LSTM can combine the state unit with input features, adaptively adding the temporal information to the input features, and updating the state unit at the same time. In [71], it was stated that such a convolutional LSTM layers could be added to any layer of the original object detector to refine the input features of the next layer. An LSTM layer could be placed immediately after any Besides Impression Network, in [69], another combination method (THP) was introduced. Noting that all of the above methods utilized fixed interval key frames, this method introduced a temporally adaptive key frame scheduling to further improve the trade-off between speed and accuracy. Fixed interval key frames pose a difficulty to control the quality of key frames. With temporally adaptive key frame scheduling, the fixed interval key frames were adjusted in a dynamic manner according to the proportion of points with poor optical flow quality. If it was greater than a prescribed threshold T, it would indicate that a current frame had changed too much compared with the previous key frame. The current frame was then chosen as the new key frame and the feature maps were obtained from it.
According to the results reported in [69], the mAP accuracy was 78.6% with a runtime of 13.0 and 8.6 fps on the GPUs NVIDIA Titan X and K40 (NVIDIA, California, USA) respectively. With a different T, the mAP slightly decreased to 77.8% at faster speeds (22.9 and 15.2 fps on Titan X and K40, respectively). Compared with the winning entry [114] of the ImageNet VID challenge 2017, which was based on feature propagation [28] and aggregation [22], an mAP of 76.8% at 15.4 fps was achieved on Titan X, and a better performance in terms of both the detection accuracy and speed was obtained in [69].
Similarly, THPM [27] provided a light weight network architecture for video object detection. A light image object detector is utilized on key frames. The state-of-the-art lightweight Mobilenet [115] is utilized as the backbone network. Feature maps from key frames are propagated to non-key frame for detection by a light flow network. A flow-guided gated recurrent unit (GRU) module is provided to aggregate features effectively between key frames. On the ImageNet VID dataset, THPM achieves 60.2% mAP at speed of 25.6 fps on mobiles (e.g., HuaWei Mate 8 produced by HUAWEI TECHNOLOGIES CO., LTD, China).

LSTM-Based
In order to make full use of the temporal-spatial information, convolutional long short term memory (LSTM [116]) was employed to process sequential data in [117] and select important information for a long duration. The methods reported in [70] and [71] are offline LSTM-based solutions, which utilize all the frames in the video. While the method in [72] is an online solution, it only uses the current and previous frames.
In [71], a light model was proposed, which was designed to work on mobile phones and embedded devices. This method integrated SSD [9] (an efficient object detector network) with the convolutional LSTM by applying an image-based object detector to video object detection via a convolutional LSTM.
The convolutional LSTM was a modified version of the traditional LSTM encoding the temporal and spatial information.
Considering a video snippet as video frames V = {I 0 , I 1 , I 2 , . . . I t }, the model is viewed as a function F(I t , S t−1 ) = (D t , S t ), where D t denotes the detection outcome of the video object detector and S t represents a vector of feature maps up to the video frame t. Each feature map of S t−1 is the state input to the LSTM and S t is the state output. The state unit S t of LSTM contains the temporal information. LSTM can combine the state unit with input features, adaptively adding the temporal information to the input features, and updating the state unit at the same time. In [71], it was stated that such a convolutional LSTM layers could be added to any layer of the original object detector to refine the input features of the next layer. An LSTM layer could be placed immediately after any feature map. Placing the LSTM earlier would lead to larger input volumes and much higher computational costs. In [71], the convolutional LSTM was placed only after the Conv13 layer, which was proved to be most effective through experimental analysis. This method was evaluated on the ImageNet VID 2015 dataset [5] and achieved a good performance in terms of the model size and computational efficiency (15 fps on a mobile CPU), with an accuracy comparable to those more computationally demanding single frame models.
In 2019, the method in [71] was improved in [70] in terms of inference speed. Specifically, as shown in Figure 5, due to the high temporal redundancy in the video, the model proposed in [70] contained two feature extractors: a small feature extractor and a large feature extractor. The large feature extractor with low speed was responsible for extracting the features with high accuracy, while the small feature extractor with a fast speed was responsible for extracting the features with poor accuracy. The two feature extractors were used alternately. The feature maps were aggregated using a memory mechanism with the modified convolutional LSTM layer. Then, a SSD-style [9] detector was applied to the refined features to obtain the final regression and classification outcome.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 8 of 24 feature map. Placing the LSTM earlier would lead to larger input volumes and much higher computational costs. In [71], the convolutional LSTM was placed only after the Conv13 layer, which was proved to be most effective through experimental analysis. This method was evaluated on the ImageNet VID 2015 dataset [5] and achieved a good performance in terms of the model size and computational efficiency (15 fps on a mobile CPU), with an accuracy comparable to those more computationally demanding single frame models.
In 2019, the method in [71] was improved in [70] in terms of inference speed. Specifically, as shown in Figure 5, due to the high temporal redundancy in the video, the model proposed in [70] contained two feature extractors: a small feature extractor and a large feature extractor. The large feature extractor with low speed was responsible for extracting the features with high accuracy, while the small feature extractor with a fast speed was responsible for extracting the features with poor accuracy. The two feature extractors were used alternately. The feature maps were aggregated using a memory mechanism with the modified convolutional LSTM layer. Then, a SSD-style [9] detector was applied to the refined features to obtain the final regression and classification outcome. For the methods mentioned above, image object detectors together with a temporal context information enhancement were employed to detect objects in video. However, for online video object detection, succeeding frames cannot be utilized. In other words, non-causal video object detectors are not feasible for online applications. Noting that most video object detectors are non-causal, a causal recurrent method was proposed in [72] for online detection without using succeeding frames. In this case, the challenges in terms of occlusion and motion blur remain, which requires the use of temporal information. For online video object detection, only the current frame and the previous frame are used. Based on the optical flow method [111], the short-term temporal information was utilized by warping the feature maps from the previous frame. However, sometimes image distortion or occlusion would last for several video frames. By using only the short-term temporal information, it was difficult to deal with these situations. The long-term temporal context information was also For the methods mentioned above, image object detectors together with a temporal context information enhancement were employed to detect objects in video. However, for online video object detection, succeeding frames cannot be utilized. In other words, non-causal video object detectors are not feasible for online applications. Noting that most video object detectors are non-causal, a causal recurrent method was proposed in [72] for online detection without using succeeding frames. In this case, the challenges in terms of occlusion and motion blur remain, which requires the use of temporal information. For online video object detection, only the current frame and the previous frame are used. Based on the optical flow method [111], the short-term temporal information was utilized by warping the feature maps from the previous frame. However, sometimes image distortion or occlusion would last for several video frames. By using only the short-term temporal information, it was difficult to deal with these situations. The long-term temporal context information was also exploited via the convolutional LSTM, in which the feature maps of the distant preceding frame obtained from the memory function was propagated to acquire more information. The important sub-network (temporal Conv LSTM) is shown in Figure 6. Given the feature map at the time step t, the state and output from the time step t−1, the output H t and the updated state S t at the current time step t are computed as Equation (2). The long-term temporal information is stored, propagated and employed. Then, the feature maps extracted on the current frame as well as the warped feature maps and the output of the LSTM were concatenated to obtain the aggregated feature maps. Finally, the aggregated feature maps were inputted into a detection sub-network to obtain the detection outcome on the current frame. By utilizing both the short-and long-term information, this method achieved an accuracy of 75.5% mAP at a high speed on the ImageNet VID dataset, indicating a competitive performance for online detection.
where the FG t , I t , O t and C t denote the output of Forget Gate, Input Gate, Output Gate and the information branch at the time step t, respectively. Their weights are represented by W FG , W I , W O and W C . σ(), ×, +, * represent the activation function, element-wise multiplication, element-wise addition and 3*3 convolutions operations, respectively.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 9 of 24 the aggregated feature maps were inputted into a detection sub-network to obtain the detection outcome on the current frame. By utilizing both the short-and long-term information, this method achieved an accuracy of 75.5% mAP at a high speed on the ImageNet VID dataset, indicating a competitive performance for online detection.
where the , , and denote the output of Forget Gate, Input Gate, Output Gate and the information branch at the time step t, respectively. Their weights are represented by , , and . (), ×, +, * represent the activation function, element-wise multiplication, element-wise addition and 3*3 convolutions operations, respectively.

Attention-Related
For video object detection, it is known that exploiting the temporal context relationship is quite important. This relationship needs to be established based on a long-duration video, which requires a large amount of memory and computational resources. In order to decrease the computational resources, an attention mechanism was introduced for feature map alignment. This mechanism was first proposed for machine translation in [118,119] and was then applied to video object detection in [25,[74][75][76][77].
Some methods only take the global or local temporal information into consideration. Specifically, the method RDN in [75] only makes use of the local temporal information. The methods SELSA in [77], and OGEMN in [74] only utilize the global temporal information. While the other methods of PSLA in [76], and MEGA in [25] use both the global and local temporal information.

Attention-Related
For video object detection, it is known that exploiting the temporal context relationship is quite important. This relationship needs to be established based on a long-duration video, which requires a large amount of memory and computational resources. In order to decrease the computational resources, an attention mechanism was introduced for feature map alignment. This mechanism was first proposed for machine translation in [118,119] and was then applied to video object detection in [25,[74][75][76][77]. Some methods only take the global or local temporal information into consideration. Specifically, the method RDN in [75] only makes use of the local temporal information. The methods SELSA in [77], and OGEMN in [74] only utilize the global temporal information. While the other methods of PSLA in [76], and MEGA in [25] use both the global and local temporal information.
Relation Distillation Networks (RDN) presented in [75] propagate and aggregate the feature maps based on object relationships in video. In RDN, ResNet-101 [112] and ResNeXt-101-64 × 4d [120] are utilized as the backbone to extract feature maps and object proposals are generated with the help of a Region Proposal Network (RPN) [15]. The feature maps of each proposal on the reference frame are augmented on the basis of supportive proposals. A prominent innovation in this work is to distill the relation with multi-stage reasoning consisting of a basic and an advanced stage. In the basic stage, the supportive proposals consisting of Top K proposals of a current frame and its adjacent frames are used to measure the relation feature of each reference proposal obtained on the current frame to generate refined reference proposals. In the advanced stage, supportive proposals with high objective scores are selected to generate advanced supportive proposals. Features of selected supportive proposals are aggregated with the relation against all supportive proposals. Then, such aggregated features are employed to strengthen the reference proposals obtained from the basic stage. Finally, the aggregated features of reference proposals obtained from the advanced stage are used to generate the final classification and bounding box regression. In addition, the detection box linking is used in a post-processing stage to refine the detection outcome. Evaluated on the ImageNet VID dataset, RDN achieved a detection accuracy of 81.8% and 83.2% mAP, respectively, with ResNet-101 and ResNeXt-101 for feature extraction. With linking and rescoring operations, it achieved an accuracy of 83.8% and 84.7% mAP, respectively.
A module (SELSA) was introduced in [77] to exploit the relationship between the proposals in the entire sequence level, and then related feature maps were fused for classification and regression. More specifically, the features of the proposals were extracted on different frames and then a clustering module and a transformation module were applied. The similarities of the proposals were computed across frames and the features were aggregated according to the similarities. Consequently, more robust features were generated for the final detection.
In [74], OGEMN was presented and used object-guided external memory to store the pixel and instance level features for further global aggregation. In order to improve the storage-efficiency aspect, only the features within the bounding boxes were stored for further feature aggregation.
In [25], MEGA was introduced, utilizing the global and local information inspired by how humans go about object detection in video using both global semantic information and local localization information. For situations when it was difficult to determine what the object was in the current frame, the global information was utilized to recognize a fuzzy object according to a clear object with a high similarity in another frame. When it was difficult to find out where the object was in a frame, the local localization information was used by taking the difference between adjacent frames if it was moving. More specifically, RPN was used to generate candidate proposals from those local frames (adjacent frames of current frames) and global frames. Then, a relation module was set up to aggregate the features of candidate proposals on global frames into that of local frames. This was named the global aggregation stage. With this method, the global information was integrated into the local frames. Then, features of the current frame were further augmented by the relation modules in the local aggregation stage. In order to expand the aggregation scale, an efficient module (Long Range Memory (LRM)) was designed where all the features computed in the middle were saved and utilized in a following detection. Evaluated on the ImageNet VID dataset, MEGA with ResNet-101 as backbone achieved an accuracy of 82.9% mAP. Compared with the competitor RDN, MEGA produced 1.1% improvement. Replacing ResNet-101 with ResNeXt-101 or with a stronger backbone to extract features, MEGA obtained an accuracy of 84.1% mAP. With the help of post-processing, it achieved 1.6% and 1.3% improvement with ResNet-101 and ResNeXt-101, respectively.
The method Progressive Sparse Local Attention (PSLA) was proposed in [76] to make use of the long term temporal information for enhancement on each feature cell in an attention manner. PSLA establishes correspondence by propagating features in a local region with a gradually sparser stride according to the spatial information across frames. Recursive Feature Updating (RFU) and Dense Feature Transforming (DenseFT) were also proposed based on PSLA to model the temporal relationship and enhance the features in a framework shown in Figure 7. More specifically, features were propagated in an attention manner. First, the correspondence between each feature cell in an embedding feature map of a current frame and its surrounding cells was established with a progressive sparser stride from the center to the outside of another embedding feature map of a support frame. Second, correspondence weights were used to compute the aligned feature maps. The feature maps were aggregated with the aligned features. In addition, similar to other video object detectors, the features of key frames were propagated to non-key frames. A lightweight network was then applied to extract low-level features on non-key frames and fuse them with the features propagated from key frames (DenseFT). Feature propagation was also employed between key frames, and key frame features were updated recursively by an update network (RFU). Hence, features were enriched by the temporal information with DenseFT and RFU, which were further used for detection. Based on the experimentations done in [76], an accuracy of 81.4% mAP was achieved on the ImageNet VID dataset.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 11 of 24 features were enriched by the temporal information with DenseFT and RFU, which were further used for detection. Based on the experimentations done in [76], an accuracy of 81.4% mAP was achieved on the ImageNet VID dataset.

Tracking-Based
Inspired by the fact that tracking is an efficient way to utilize the temporal information, several methods [78,79,81] have been developed to detect objects on fixed interval frames and track them in frames in between. The improved methods in [26] and [80] detect interval frames adaptively and track the other frames.
A framework named CDT was presented in [79], combining detection and tracking for video object detection. This framework consisted of an object detector, a forward tracker and a backward tracker. Initially, objects were detected by the image object detector. Then, each detected object was tracked by the forward tracker, and undetected objects were stored by the backward tracker. In the entire process, the object detector and the tracker cooperated with each other to deal with the appearance and disappearance of objects.
Another framework named CaTDet with a high computational efficiency was presented in [78]. This framework is shown in Figure 8, which includes a tracker and a detector. CaTDet uses a tracker to predict the position of objects with a high confidence in a next frame. The processing steps of CaTDet are: (i) Every frame is inputted to a proposal network to output potential proposals in the frame. (ii) Object position in a next frame is predicted with a high confidence using the tracker. (iii) In order to obtain the calibrated object information, the outputs of the tracker and the proposal network are combined and inputted to a refinement network.

Tracking-Based
Inspired by the fact that tracking is an efficient way to utilize the temporal information, several methods [78,79,81] have been developed to detect objects on fixed interval frames and track them in frames in between. The improved methods in [26] and [80] detect interval frames adaptively and track the other frames.
A framework named CDT was presented in [79], combining detection and tracking for video object detection. This framework consisted of an object detector, a forward tracker and a backward tracker. Initially, objects were detected by the image object detector. Then, each detected object was tracked by the forward tracker, and undetected objects were stored by the backward tracker. In the entire process, the object detector and the tracker cooperated with each other to deal with the appearance and disappearance of objects.
Another framework named CaTDet with a high computational efficiency was presented in [78]. This framework is shown in Figure 8, which includes a tracker and a detector. CaTDet uses a tracker to predict the position of objects with a high confidence in a next frame. The processing steps of CaTDet are: (i) Every frame is inputted to a proposal network to output potential proposals in the frame. (ii) Object position in a next frame is predicted with a high confidence using the tracker. (iii) In order to obtain the calibrated object information, the outputs of the tracker and the proposal network are combined and inputted to a refinement network.
tracker. Initially, objects were detected by the image object detector. Then, each detected object was tracked by the forward tracker, and undetected objects were stored by the backward tracker. In the entire process, the object detector and the tracker cooperated with each other to deal with the appearance and disappearance of objects.
Another framework named CaTDet with a high computational efficiency was presented in [78]. This framework is shown in Figure 8, which includes a tracker and a detector. CaTDet uses a tracker to predict the position of objects with a high confidence in a next frame. The processing steps of CaTDet are: (i) Every frame is inputted to a proposal network to output potential proposals in the frame. (ii) Object position in a next frame is predicted with a high confidence using the tracker. (iii) In order to obtain the calibrated object information, the outputs of the tracker and the proposal network are combined and inputted to a refinement network.  More specifically, based on the observation that objects detected in one video frame would most likely appear in a next frame, a tracker was used to predict the positions on the next frame with the historical information. In case new objects appeared in a current frame, a computationally efficient proposal network similar to RPN was utilized to detect proposals. In addition, to address situations such as motion blur and occlusion, the temporal information was used by a tracker to predict future positions. The results obtained by combining the tracker and the proposal network was then refined by a refinement network. Only the regions of interest were refined by the refinement network to save computation time while maintaining accuracy.
Similar to CDT and CaTDet, recent approaches for the detection and tracking of objects in video involve rather complex multistage components. In [81], a framework using a ConvNet architecture was deployed in a simple but effective way by performing tracking and detection simultaneously. More specifically, first R-FCN [19] was employed to extract the feature maps shared between detection and tracking. Then, proposals in each frame were obtained by using RPN based on anchors [15]. RoI pooling [15] was utilized for the final detection. In particular, a regressor was introduced to extend the architecture. Position-sensitive regression maps from both frames were used together with correlation maps as the input to an RoI tracking module, in which the box relationship between the two frames was outputted. For video object detection, the framework in [81] was evaluated on the ImageNet VID dataset achieving an accuracy of 82.0% mAP.
Similarly, inspired by the observation that object tracking is more efficient than object detection, a framework (D or T) was covered in [80], see Figure 9, which includes a scheduler network to determine the operation (detecting or tracking) on a certain frame. Compared with the baseline frame skipping (detecting on fixed interval frames and tracking on intermediate frames), the scheduler network with light weights and a simple structure was found to be more effective on the ImageNet VID dataset. Moreover, the adaptive mechanism in [26] (TRACKING ASSISTED) was used to select key frames. Detection on key frames involved the utilization of an accurate detection network and detection on non-key frames was assisted by the tracking module.
determine the operation (detecting or tracking) on a certain frame. Compared with the baseline frame skipping (detecting on fixed interval frames and tracking on intermediate frames), the scheduler network with light weights and a simple structure was found to be more effective on the ImageNet VID dataset. Moreover, the adaptive mechanism in [26] (TRACKING ASSISTED) was used to select key frames. Detection on key frames involved the utilization of an accurate detection network and detection on non-key frames was assisted by the tracking module.

Other Methods
Apart from the frameworks described above, some methods are presented that are based on a combination of multiple methods described above [24,121]. The method in [24] is based on the optical flow and tracking methods. The methods in [121] (Attentional LSTM) and [122] (TSSD) are based on the attention and LSTM methods.
In addition, these other methods appear in the literature [36,[83][84][85][86][87][88][89][90]. The methods in [83] and [87] discuss ways to align and enhance feature maps. While the method in [90] studied the effect of the input image size by selecting a size to achieve a better speed-accuracy trade-off. The method in [83] named STSN (spatiotemporal sampling networks) is shown in Figure 10. This method aligns feature maps between adjacent frames. Similar to the FGFA method in [22], it relies on the idea that detection on a single frame would have difficulties dealing with noise sources such as motion blur and video defocus. Multiple frames are thus utilized for feature enhancement to achieve better performance. Unlike FGFA, which uses the optical flow method to align feature maps, deformable convolution is

Other Methods
Apart from the frameworks described above, some methods are presented that are based on a combination of multiple methods described above [24,121]. The method in [24] is based on the optical flow and tracking methods. The methods in [121] (Attentional LSTM) and [122] (TSSD) are based on the attention and LSTM methods.
In addition, these other methods appear in the literature [36,[83][84][85][86][87][88][89][90]. The methods in [83] and [87] discuss ways to align and enhance feature maps. While the method in [90] studied the effect of the input image size by selecting a size to achieve a better speed-accuracy trade-off. The method in [83] named STSN (spatiotemporal sampling networks) is shown in Figure 10. This method aligns feature maps between adjacent frames. Similar to the FGFA method in [22], it relies on the idea that detection on a single frame would have difficulties dealing with noise sources such as motion blur and video defocus. Multiple frames are thus utilized for feature enhancement to achieve better performance. Unlike FGFA, which uses the optical flow method to align feature maps, deformable convolution is employed for feature alignment in [83]. First, a sharing feature extraction network is applied to extract feature maps on a current frame and adjacent frames. Then, the two feature maps are concatenated per channel and a deformable convolution is performed. The result of the deformable convolution is used as the offset for the second deformable convolution operation to align the feature maps. Furthermore, augmented feature maps are obtained by aggregating the features in the same way as FGFA. Compared with FGFA, STSN uses deformable convolution to align the features of two adjacent frames implicitly. Although it is not as intuitive as the optical flow method, it is also found to be effective. According to the experimental results reported, STSN still achieved a higher mAP than FGFA (78.9% vs. 78.8%) without relying on the optical flow information. In addition, without the assistant of the temporal post-processing, STSN obtained a better performance than the D&T baseline [81], 78.9% vs. 75.8%.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 13 of 24 employed for feature alignment in [83]. First, a sharing feature extraction network is applied to extract feature maps on a current frame and adjacent frames. Then, the two feature maps are concatenated per channel and a deformable convolution is performed. The result of the deformable convolution is used as the offset for the second deformable convolution operation to align the feature maps. Furthermore, augmented feature maps are obtained by aggregating the features in the same way as FGFA. Compared with FGFA, STSN uses deformable convolution to align the features of two adjacent frames implicitly. Although it is not as intuitive as the optical flow method, it is also found to be effective. According to the experimental results reported, STSN still achieved a higher mAP than FGFA (78.9% vs. 78.8%) without relying on the optical flow information. In addition, without the assistant of the temporal post-processing, STSN obtained a better performance than the D&T baseline [81], 78.9% vs. 75.8%. Different from [83], by using the deformable convolution to propagate the temporal information, the Spatial-Temporal Memory Network (STMN) was considered in [87], which involved an RNN architecture with a Spatial-Temporal Memory module (STMM) to incorporate the long-term temporal information. The Spatial-Temporal Memory Network (STMN) operates in an end-to-end manner to model the long-term information and align the motion dynamics for video object detection. STMM is the core module in STMN, a convolutional recurrent computation unit which fully utilizes the pretrained weights learned from static image datasets such as ImageNet [93]. This design is essential to address the practical difficulties of learning from video datasets, which largely lack the diversity of objects within the same category. STMM receives the feature maps of a current frame at time step t and the spatial-temporal memory → with the information of all the previous frames. Different from [83], by using the deformable convolution to propagate the temporal information, the Spatial-Temporal Memory Network (STMN) was considered in [87], which involved an RNN architecture with a Spatial-Temporal Memory module (STMM) to incorporate the long-term temporal information. The Spatial-Temporal Memory Network (STMN) operates in an end-to-end manner to model the long-term information and align the motion dynamics for video object detection. STMM is the core module in STMN, a convolutional recurrent computation unit which fully utilizes the pretrained weights learned from static image datasets such as ImageNet [93]. This design is essential to address the practical difficulties of learning from video datasets, which largely lack the diversity of objects within the same category. STMM receives the feature maps of a current frame at time step t and the spatial-temporal memory M → t−1 with the information of all the previous frames. Then, the spatial-temporal memory M → t of the current time step is updated. In order to capture the information of both later frames and previous frames at the same time, two STMMs are used for bidirectional feature aggregation to produce the memory M, which is employed for both classification and bounding box regression. Therefore, the feature maps are propagated and aggregated by combining the information across multiple video frames. Evaluated on the ImageNet VID dataset, STMN has achieved the current start-of-the-art accuracy.
All the algorithms described above start from how to propagate and aggregate feature maps. In [90], video object detection was examined from another point of view. Similar to [123], the effect of input image size on the performance of video object detection was studied in [90]. Furthermore, it was found that down-sampling images can obtain better accuracy sometimes. From this point of view, a framework named AdaScale was proposed to adaptively select the input image size. AdaScale predicts the best scale or size of a next frame according to the information of a current frame. One of the reasons for the improvement is that the number of false positives is reduced. The other reason for this is that the number of true positives is increased by resizing the too-large objects to a suitable size for the detector.
In [90], the optimal scale (pixels of the shortest side) of a given image is defined with a predefined finite set of scales S (S = {600, 480, 360, 240} in [90]). Furthermore, a loss function consisting of the classification and regression loss is employed as the evaluation metric to compare the results across different scales. The regression loss for the background is expected to be zero. Hence, if the loss function is utilized directly to evaluate the results across different scales, the image scale which contains fewer foreground bounding boxes is supported. In order to deal with this issue, a new metric (the loss function, which focuses on the same number of foreground bounding boxes chosen on different scales) is employed to compare across different scales. More specifically, the number of bounding boxes involved to compute the loss is determined by the minimum number (m) on all the scales. For each scale, the loss of the predicted foreground bounding boxes on the image is sorted in ascending order and the first m bounding boxes are chosen. The scale m with the minimum loss is defined as the best scale. Inspired by R-FCN [19] working on deep features for bounding boxes regression, the channels of the deep features are expected to contain the size information. Therefore, a scale regressor using deep features is built to predict the optimal scale. Evaluated on the ImageNet VID and mini YouTube-BB datasets, Adascale achieved 1.3% and 2.7% mAP improvements with 1.6 and 1.8 times speedup compared with single-scale training and testing, respectively. Furthermore, combined with DFF [28], the speed was increased by 25% while maintaining mAP on the ImageNet VID dataset.

Comparison of Video Object Detection Methods
The great majority of video object detection approaches use the ImageNet VID dataset [5] for performance evaluation. In this section, the timeline of video object detection methods in recent years is shown in Figure 11 together with a group listing of the methods in Figure 12. Then, a comparison is provided between the methods covered in the previous section. The comparison is presented in Tables 4 and 5, which correspond to with and without post-processing, respectively. The methods in Figure 11 belong to different groups but the same time, whereas the methods in Figure 12 belong to different times but the same groups. As can be seen from Figures 11 and 12, the methods based on optical flow were proposed earlier. During the same period, video object detection methods were assisted by tracking due to the effectiveness of tracking in utilizing the temporal-spatial information. The optical flow-based methods needed a large number of parameters and they were only suitable for small motions. In recent years, the methods based on attention have achieved much success, such as MEGA [25]. Using LSTM for feature propagation and aggregation is becoming a hot research topic and many new methods are being proposed, such as STSN [83] using deformable convolution to align the feature maps. The latest research is mostly based on attention, LSTM or a combination of methods such as Flow&LSTM [72]. Tables 4 and 5, which correspond to with and without post-processing, respectively. The methods in Figure 11 belong to different groups but the same time, whereas the methods in Figure 12 belong to different times but the same groups. As can be seen from Figures 11 and 12, the methods based on optical flow were proposed earlier. During the same period, video object detection methods were assisted by tracking due to the effectiveness of tracking in utilizing the temporal-spatial information. The optical flow-based methods needed a large number of parameters and they were only suitable for small motions. In recent years, the methods based on attention have achieved much success, such as MEGA [25]. Using LSTM for feature propagation and aggregation is becoming a hot research topic and many new methods are being proposed, such as STSN [83] using deformable convolution to align the feature maps. The latest research is mostly based on attention, LSTM or a combination of methods such as Flow&LSTM [72].    Figure 12. Video object detection methods sorted in different groups. Table 4 provides the outcomes without post processing. In this table, the methods are divided into different groups according to the way temporal and spatial information are utilized. Flow-guided groups propagate and align the feature maps according to the flow field obtained by optical flow. Both accuracy and speed of various frameworks are reported in this table. For example, DFF provides high computational efficiency and achieves a runtime of 20.25 fps using a Titan K40 GPU. FGFA achieves a high accuracy producing 76.3% mAP with 1.36 fps. Obviously, DFF is faster than FGFA. Flow-guided methods are intuitive and well understood to propagate features. Optical flow is deemed suitable for small movement estimation. In addition, since optical flow reflects pixel level displacement, it has difficulties when it is applied to high-level feature maps. One pixel movement on feature maps may correspond to 10 to 20 pixels movement. Inspired by the LSTM-based solutions in natural language processing, LSTM methods are used to incorporate the sequence information. In the LSTM group, Flow&LSTM [72] achieved the highest accuracy of 75.5%. Looking Fast and Slow [70] generated high speed but with low accuracy. LSTM captures the long-term information with a simple implementation. Since the sigmoid activation of the input and forget gates are rarely completely saturated, a slow state decay and thus loss of long-term dependence is resulted. In other words, it is difficult to retain the complete previous state in the update.
Attention-based methods also show the ability to perform video object detection effectively. In the attention-related group, MEGA [25] with ResNeXt-101 as backbone achieved the highest accuracy of 84.1% mAP. As described, it achieved a very high accuracy with a relatively fast speed. Attention-based methods aggregate the features within proposals that are generated. This decreases the computation time. Because of only using the features within the proposals, the performance relies on the effect of RPN to a certain extent. Here, it is rather difficult to utilize more comprehensive information.
In the tracking-based group, the methods are assisted by tracking. D&T loss [81] achieved 75.8% mAP. Tracking is an efficient method to employ the temporal information with a detector assisted by a tracker. However, it cannot solve the problems created by motion blur and video defocus directly. As the detection performance relies on the tracking performance, the detector part suffers from tracking errors. There are also other standalone methods including TCNN [24], STSN [83] and STMN [87].
In order to further improve the performance in terms of detection accuracy, post-processing can be added to the above methods. The results with post-processing are shown in Table 5. One can easily see that with post-processing, the accuracy is noticeably improved. For example, the accuracy of MEGA is improved from 84.1% to 85.4% mAP.

Future Trends
Challenges still remain for further improving the accuracy and speed of the video object detection methods. This section presents the major challenges and possible future trends as related to video object detection.
At present, there is a lack of a comprehensive benchmark dataset containing the labels of each frame. The most widely used dataset, that is ImageNet VID, does not include complex real-world conditions as compared to the static image dataset COCO. The number of objects in each frame in the ImageNet VID dataset is limited, which is not the case under real-world conditions. In addition, in many real-world applications, videos include a large field of view and in some cases high resolution images. Lack of a well-annotated dataset representing actual or real-world conditions remains a challenge for the purpose of advancing video object detection. Hence, the establishment of a comprehensive benchmark dataset is considered a future trend of importance.
Up to now, the most widely used evaluation metric in video object detection is mAP, which is derived from static image object detection. This metric does not fully reflect the temporal characteristics in video object detection. Although Average Delay (AD) is proposed to reflect the temporal characteristics, it is still not a fully developed metric. For example, the stability of detection in video is not reflected by it. Therefore, novel evaluation metrics to reflect detection stability which are more suitable for video object detection are considered to be another future trend of importance.
Most of the methods covered in this review paper only utilize the local temporal information or global information separately. There are only a few methods, such as MEGA, which have used the local and global temporal information at the same time and achieved a benchmark mAP of 85.4%. As demonstrated by MEGA, it is worth developing future frameworks which utilize both the local and global temporal information. Furthermore, for most of the existing video object detection algorithms, the number of frames used is too small to fully utilize the video information. Hence, as yet another future trend, it is of importance to develop methods that utilize the long-term video information.
As can be observed from Tables 4 and 5, the attention-based frameworks achieved a relatively high accuracy. However, such methods pose difficulties for real-time applications demanding very powerful GPUs. Although the Looking Fast and Slow method [70] achieved 72.3 fps on Pixel 3 phones, the accuracy is only 59.3%, which poses challenges for actual deployment. Indeed, the trade-off between accuracy and speed needs to be further investigated. Real-time performance is important for practical applications such as autonomous driving and video surveillance. It is significant to pay more attention to the methods to make a light model, while ensuring that the accuracy will not drop too much. Some light network structure design methods like Depthwise Separable Convolution [115] and channel shuffle [124] used in the classification application can be used for reference in video object detection. In addition, model compression methods like [125] can be considered as well.

Conclusions
In recent years, after the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) announced the video object detection task in 2015, many deep learning-based video object detection solutions have been developed. This paper has provided a review of the video object detection methods that have been developed so far. This review has covered the available datasets, evaluation metrics and an overview of different categories of deep learning-based methods for video object detection. A categorization of the video object detection methods has been made according to the way temporal and spatial information are used. These categories include flow-based, LSTM-based, attention-based and tracking-based methods, as well as others. The performance of various detectors with or without post-processing is summarized in Tables 4 and 5 in terms of both detection accuracy and computation speed. Several trends of importance in video object detection have also been stated for possible future works.