Next Article in Journal
Design and Implementation of a New System for Large Bridge Monitoring—GeoSHM
Next Article in Special Issue
A New 3D Object Pose Detection Method Using LIDAR Shape Set
Previous Article in Journal
Non-Enzymatic Electrochemical Sensing of Malathion Pesticide in Tomato and Apple Samples Based on Gold Nanoparticles-Chitosan-Ionic Liquid Hybrid Nanocomposite
Previous Article in Special Issue
New Keypoint Matching Method Using Local Convolutional Features for Power Transmission Line Icing Monitoring
Article Menu

Export Article

Open AccessArticle
Sensors 2018, 18(3), 774; doi:10.3390/s18030774

Deep Spatial-Temporal Joint Feature Representation for Video Object Detection

1,2,* , 1,2
School of Information and Electronics, Beijing Institute of Technology, Beijing 100081, China
Beijing Key Laboratory of Embedded Real-time Information Processing Technology, Beijing Institute of Technology, Beijing 100081, China
Author to whom correspondence should be addressed.
Received: 2 February 2018 / Revised: 26 February 2018 / Accepted: 27 February 2018 / Published: 4 March 2018
(This article belongs to the Special Issue Sensors Signal Processing and Visual Computing)
View Full-Text   |   Download PDF [12178 KB, uploaded 4 March 2018]   |  


With the development of deep neural networks, many object detection frameworks have shown great success in the fields of smart surveillance, self-driving cars, and facial recognition. However, the data sources are usually videos, and the object detection frameworks are mostly established on still images and only use the spatial information, which means that the feature consistency cannot be ensured because the training procedure loses temporal information. To address these problems, we propose a single, fully-convolutional neural network-based object detection framework that involves temporal information by using Siamese networks. In the training procedure, first, the prediction network combines the multiscale feature map to handle objects of various sizes. Second, we introduce a correlation loss by using the Siamese network, which provides neighboring frame features. This correlation loss represents object co-occurrences across time to aid the consistent feature generation. Since the correlation loss should use the information of the track ID and detection label, our video object detection network has been evaluated on the large-scale ImageNet VID dataset where it achieves a 69.5% mean average precision (mAP). View Full-Text
Keywords: deep neural network; video object detection; temporal information; Siamese network; multiscale feature representation deep neural network; video object detection; temporal information; Siamese network; multiscale feature representation

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. (CC BY 4.0).

Scifeed alert for new publications

Never miss any articles matching your research from any publisher
  • Get alerts for new papers matching your research
  • Find out the new papers from selected authors
  • Updated daily for 49'000+ journals and 6000+ publishers
  • Define your Scifeed now

SciFeed Share & Cite This Article

MDPI and ACS Style

Zhao, B.; Zhao, B.; Tang, L.; Han, Y.; Wang, W. Deep Spatial-Temporal Joint Feature Representation for Video Object Detection. Sensors 2018, 18, 774.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Related Articles

Article Metrics

Article Access Statistics



[Return to top]
Sensors EISSN 1424-8220 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top