# Vehicle Speed Estimation Based on 3D ConvNets and Non-Local Blocks

^{*}

## Abstract

**:**

## 1. Introduction

**Camera Calibration**[1,5,6]. In short, this method obtains the algorithm-generated scale and calculates the speed based on the vehicle trajectories acquired by a two-stage process including detection and tracking. The camera calibration process require the input of the intrinsic and extrinsic parameters. Due to different camera models, shooting angles and installation positions, camera calibration is required for each camera. Some camera calibration-based methods need the multiple manual measurements on the road [5,7,8]. Some algorithms have limitations in camera placement [1,5,9,10]. Besides, the accuracy of the speed estimate heavily relies on the accuracy of algorithms for detection and tracking. A variety of suboptimal environments (illumination variation, motion blur, background clutter and overlapping, etc.) could result in problems in vehicle detection and tracking, including detection missing, false detection, tracking loss, etc. [11]. When the vehicle detection or tracking is subpar, the error of speed estimation will be significant. More detailed analyses are presented in Section 2.1.

- We propose 3D ConvNets to estimate average vehicle speed based on video footage. In contrast to camera calibration-based methods, our method is an end-to-end method, independent of external factors. In particular, our method does not entail detecting or tracking vehicles.
- We propose to include non-local blocks so as to capture spatial-temporal information with a ‘global’ perspective.
- We propose to add optical flow from video as an input to the model. Optical flow contains useful information on the vehicle motion, which can be used to improve the model accuracy.
- We propose to employ the multi-scale convolution to extract the information of vehicles with different scales contours. This design resolves the problem of vehicle size varying due to their different distances from the camera.

## 2. Related Work

#### 2.1. Methods Based on Camera Calibration

**Camera Calibration**: A projection matrix is obtained $P=K\left(\right)open="["\; close="]">R|T$, where K represents intrinsic camera parameters, while R and T are extrinsic parameters, representing camera rotation and camera translation, respectively. The internal camera parameters related to the characteristics of the camera itself, such as the focal length, pixel of the camera. The external calibration parameters are the position and orientation of the camera relative to some fiducial coordinate system. The orientation of the camera often changes slightly due to weather and other reasons, so the parameters need to be reset frequently.**Detection-Tracking**: Detect the contours of the vehicle in the image; then track the vehicle by the detected contours. This step is to plot the trajectory of the vehicle on the road.**Vehicle Speed Calculation**: Calculate the traveling distance of the vehicle based on the information obtained at stage one and two. Trivially, measure the elapsed time during the traveling period. With the distance and time duration, calculate the vehicle speed by the basic formula: $v=s/t$.

#### 2.2. Video Action Recognition

#### 2.3. Non-Local Algorithm

## 3. Model

- Recognizing the asymmetry of the spatial and temporal information, we introduce (2+1)D convolution to more effectively extract the spatiotemporal features;
- We embed non-local blocks to the network to take into consideration ‘global’ information;
- We employ multi-scale convolution to better capture the information of vehicles’ varying sizes due to the varying distances between the vehicles and the camera.

**Mean Square Error**as the loss function:

#### 3.1. Inflated (2+1)D Convolution

#### 3.2. Non-Local Blocks

#### 3.3. Multi-Scale Convolution

## 4. Experiments and Analysis

#### 4.1. BrnoComSpeed Dataset

**Vehicle Speed Dataset**: We split each of the 18 original videos from the BrnoCompSpeed Dataset into multiple short videos t seconds long each. Each t-second video clip is uesd as a sample. We then calculate the average vehicle speed in each of the t-second videos.Subsequently, we label each short video with the calculated average vehicle speed. The consideration behind splitting original videos is the tradeoff of time information length. It is understood that 3D ConvNets’ receptive field in time domain is limited, which means 3D ConvNets cannot process time information of a long length. On the other hand, too short videos are not sufficiently informative. On balance, we set $t=10$ in building the VehSpeedDataset10. Table 2 is on dataset partition over training and test sets. We distribute 80% of the total 5332 short videos to training set and the remaining 20% to test set.

#### 4.2. Implementation Details

**Data Augmentation**: It is acknowledged that data augmentation is of great importance in deep neural networks. Thus, we apply data augmentation in our model. In the training phase, we first resize a video image to $256\times 256$ and then randomly crop a $224\times 224$ patch from the $256\times 256$ image. We randomly choose the starting frame among video images with the consideration that the starting frame is sufficiently early so that a desired number of frames is guaranteed. The time intervals between frames, chosen as model inputs, are fixed and uniform. We also apply random left-right flipping for each video in the training phase. In the test phase, we resize an image to $256\times 256$ and then take the $224\times 224$ center patch.

#### 4.3. 3D ConvNets and Non-Local Neural Networks

#### 4.3.1. Different 3D ConvNets

#### 4.3.2. Adding Optical Flow Information

#### 4.3.3. Non-Local Blocks and Multi-Scale

#### 4.3.4. Comparison of Different Methods

## 5. Conclusions and Future Work

## Supplementary Materials

## Author Contributions

## Funding

## Conflicts of Interest

## Abbreviations

3D ConvNets | 3-dimensional convolutional networks |

MAE | Mean absolute error |

MSE | Mean square error |

RoI | Region of Interest |

## References

- Lan, J.; Li, J.; Hu, G.; Ran, B.; Wang, L. Vehicle speed measurement based on gray constraint optical flow algorithm. Optik-Int. J. Light Electron Opt.
**2014**, 125, 289–295. [Google Scholar] [CrossRef] - Mathew, T. Intrusive and Non-Intrusive Technologies; Tech. Rep; Indian Institute of Technology Bombay: Mumbai, India, 2014. [Google Scholar]
- Luvizon, D.C.; Nassu, B.T.; Minetto, R. A video-based system for vehicle speed measurement in urban roadways. IEEE Trans. Intell. Transp. Syst.
**2017**, 18, 1393–1404. [Google Scholar] [CrossRef] - Huang, T. Traffic speed estimation from surveillance video data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Salt Lake City, UT, USA, 18–22 June 2018; pp. 161–165. [Google Scholar]
- Nurhadiyatna, A.; Hardjono, B.; Wibisono, A.; Sina, I.; Jatmiko, W.; Ma’sum, M.A.; Mursanto, P. Improved vehicle speed estimation using gaussian mixture model and hole filling algorithm. In Proceedings of the 2013 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Bali, Indonesia, 28–29 September 2013; pp. 451–456. [Google Scholar]
- Wang, H.; Liu, L.; Dong, S.; Qian, Z.; Wei, H. A novel work zone short-term vehicle-type specific traffic speed prediction model through the hybrid EMD–ARIMA framework. Transportmet. B Transp. Dyn.
**2016**, 4, 159–186. [Google Scholar] [CrossRef] - Maduro, C.; Batista, K.; Peixoto, P.; Batista, J. Estimation of vehicle velocity and traffic intensity using rectified images. In Proceedings of the 2008 15th IEEE International Conference on Image Processing, San Diego, CA, USA, 12—15 October 2008; pp. 777–780. [Google Scholar]
- Sina, I.; Wibisono, A.; Nurhadiyatna, A.; Hardjono, B.; Jatmiko, W.; Mursanto, P. Vehicle counting and speed measurement using headlight detection. In Proceedings of the 2013 International Conference on Advanced Computer Science and Information Systems (ICACSIS), Bali, Indonesia, 28–29 September 2013; pp. 149–154. [Google Scholar]
- Dailey, D.J.; Cathey, F.W.; Pumrin, S. An algorithm to estimate mean traffic speed using uncalibrated cameras. IEEE Trans. Intell. Transp. Syst.
**2000**, 1, 98–107. [Google Scholar] [CrossRef] [Green Version] - Grammatikopoulos, L.; Karras, G. Automatic estimation of vehicle speed from uncalibrated video sequences. In Proceedings of the FIG-ISPRS-ICA International Symposium on Modern Technologies, Education & Professional Practice in Geodesy & Related Fields, Sofia, Bulgaria, 9–10 November 2006. [Google Scholar]
- Nam, H.; Baek, M.; Han, B. Modeling and propagating cnns in a tree structure for visual tracking. arXiv
**2016**, arXiv:1608.07242. [Google Scholar] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 770–778. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.B.; He, K.; Hariharan, B.; Belongie, S.J. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; Volume 1, p. 4. [Google Scholar]
- Carreira, J.; Zisserman, A. Quo vadis, action recognition? A new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4724–4733. [Google Scholar]
- He, X.C.; Yung, N.H. A novel algorithm for estimating vehicle speed from two consecutive images. In Proceedings of the 2007 IEEE Workshop on Applications of Computer Vision (WACV ’07), Austin, TX, USA, 21–22 February 2007; p. 12. [Google Scholar]
- You, X.; Zheng, Y. An accurate and practical calibration method for roadside camera using two vanishing points. Neurocomputing
**2016**, 204, 222–230. [Google Scholar] [CrossRef] - He, X.; Yung, N.H.C. New method for overcoming ill-conditioning in vanishing-point-based camera calibration. Opt. Eng.
**2007**, 46, 037202. [Google Scholar] [CrossRef] - Kumar, A.; Khorramshahi, P.; Lin, W.A.; Dhar, P.; Chen, J.C.; Chellappa, R. A semi-automatic 2D solution for vehicle speed estimation from monocular videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2018, Salt Lake City, UT, USA, 18–22 June 2018. [Google Scholar]
- Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; pp. 3645–3649. [Google Scholar]
- Dubská, M.; Herout, A.; Sochor, J. Automatic Camera Calibration for Traffic Understanding. BMVC
**2014**, 4, 8. [Google Scholar] - Sochor, J.; Juránek, R.; Herout, A. Traffic surveillance camera calibration by 3d model bounding box alignment for accurate vehicle speed measurement. Comput. Vis. Image Underst.
**2017**, 161, 87–98. [Google Scholar] [CrossRef] - Filipiak, P.; Golenko, B.; Dolega, C. NSGA-II Based Auto-Calibration of Automatic Number Plate Recognition Camera for Vehicle Speed Measurement. In Proceedings of the European Conference on the Applications of Evolutionary Computation, Porto, Portugal, 30 March–1 April 2016; pp. 803–818. [Google Scholar]
- Sochor, J.; Juránek, R.; Špaňhel, J.; Maršík, L.; Širokỳ, A.; Herout, A.; Zemčík, P. Comprehensive Data Set for Automatic Single Camera Visual Speed Measurement. IEEE Trans. Intell. Transp. Syst.
**2018**, 20, 1633–1643. [Google Scholar] [CrossRef] - Ji, S.; Xu, W.; Yang, M.; Yu, K. 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell.
**2013**, 35, 221–231. [Google Scholar] [CrossRef] [PubMed] - Tran, D.; Bourdev, L.; Fergus, R.; Torresani, L.; Paluri, M. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 11–18 December 2015; pp. 4489–4497. [Google Scholar]
- Feichtenhofer, C.; Pinz, A.; Wildes, R. Spatiotemporal residual networks for video action recognition. In Proceedings of the Advances in Neural Information Processing Systems, Barcelona, Spain, 5–10 December 2016; pp. 3468–3476. [Google Scholar]
- Simonyan, K.; Zisserman, A. Two-stream convolutional networks for action recognition in videos. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; pp. 568–576. [Google Scholar]
- Tran, D.; Ray, J.; Shou, Z.; Chang, S.F.; Paluri, M. Convnet architecture search for spatiotemporal feature learning. arXiv
**2017**, arXiv:1708.05038. [Google Scholar] - Qiu, Z.; Yao, T.; Mei, T. Learning spatio-temporal representation with pseudo-3d residual networks. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 5534–5542. [Google Scholar]
- Tran, D.; Wang, H.; Torresani, L.; Ray, J.; LeCun, Y.; Paluri, M. A Closer Look at Spatiotemporal Convolutions for Action Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2018, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6450–6459. [Google Scholar]
- Buades, A.; Coll, B.; Morel, J.M. A non-local algorithm for image denoising. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005; Volume 2, pp. 60–65. [Google Scholar]
- Dabov, K.; Foi, A.; Katkovnik, V.; Egiazarian, K. Image denoising by sparse 3-D transform-domain collaborative filtering. IEEE Trans. Image Proc.
**2007**, 16, 2080–2095. [Google Scholar] [CrossRef] - Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local neural networks. arXiv
**2017**, arXiv:1711.07971. [Google Scholar] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Farnebäck, G. Two-frame motion estimation based on polynomial expansion. In Proceedings of the Scandinavian Conference on Image Analysis, Halmstad, Sweden, 29 June–2 July 2003; pp. 363–370. [Google Scholar]
- Loshchilov, I.; Hutter, F. Fixing weight decay regularization in adam. arXiv
**2017**, arXiv:1711.05101. [Google Scholar] - Burton, A.; Radford, J. Thinking in Perspective: Critical Essays in the Study of Thought Processes; Routledge: Abingdon, UK, 1978; Volume 646. [Google Scholar]
- Warren, D.H.; Strelow, E.R. Electronic Spatial Sensing for the Blind: Contributions from Perception, Rehabilitation, and Computer Vision; Springer Science & Business Media: Berlin/Heidelberg, Germany, 2013; Volume 99. [Google Scholar]
- Rodger, J.A. Toward reducing failure risk in an integrated vehicle health maintenance system: A fuzzy multi-sensor data fusion Kalman filter approach for IVHMS. Exp. Syst. Appl.
**2012**, 39, 9821–9836. [Google Scholar] [CrossRef]

Name | 2D ResNet50 | R3D50 | R(2+1)D50 |
---|---|---|---|

Conv1 | $7\times 7,64$ | $7\times 7\times 7,64$ | $5\times 7\times 7,64$ |

Pooling | $3\times 3$ | $3\times 3\times 3$ | $1\times 3\times 3$ |

Conv2_x | $\left(\right)open="["\; close="]">\begin{array}{c}1\times 1,64\\ 3\times 3,64\\ 1\times 1,256\end{array}$ | $\left(\right)open="["\; close="]">\begin{array}{c}1\times 1\times 1,64\\ 3\times 3\times 3,64\\ 1\times 1\times 1,256\end{array}$ | $\left(\right)open="["\; close="]">\begin{array}{c}3\times 1\times 1,64\\ 1\times 3\times 3,64\\ 1\times 1\times 1,256\end{array}$ |

Conv3_x | $\left(\right)open="["\; close="]">\begin{array}{c}1\times 1,128\\ 3\times 3,128\\ 1\times 1,512\end{array}$ | $\left(\right)open="["\; close="]">\begin{array}{c}1\times 1\times 1,128\\ 3\times 3\times 3,128\\ 1\times 1\times 1,512\end{array}$ | $\left(\right)open="\{"\; close="\}">\left(\right)open="["\; close="]">\begin{array}{c}3\times 1\times 1,128\\ 1\times 3\times 3,128\\ 1\times 1\times 1,512\end{array}\left(\right)open="["\; close="]">\begin{array}{c}1\times 1\times 1,128\\ 1\times 3\times 3,128\\ 1\times 1\times 1,512\end{array}$ |

Conv4_x | $\left(\right)open="["\; close="]">\begin{array}{c}1\times 1,256\\ 3\times 3,256\\ 1\times 1,1024\end{array}$ | $\left(\right)open="["\; close="]">\begin{array}{c}1\times 1\times 1,256\\ 3\times 3\times 3,256\\ 1\times 1\times 1,1024\end{array}$ | $\left(\right)open="\{"\; close="\}">\left(\right)open="["\; close="]">\begin{array}{c}3\times 1\times 1,256\\ 1\times 3\times 3,256\\ 1\times 1\times 1,1024\end{array}\left(\right)open="["\; close="]">\begin{array}{c}1\times 1\times 1,256\\ 1\times 3\times 3,256\\ 1\times 1\times 1,1024\end{array}$ |

Conv5_x | $\left(\right)open="["\; close="]">\begin{array}{c}1\times 1,512\\ 3\times 3,512\\ 1\times 1,2048\end{array}$ | $\left(\right)open="["\; close="]">\begin{array}{c}1\times 1\times 1,512\\ 3\times 3\times 3,512\\ 1\times 1\times 1,2048\end{array}$ | $\left(\right)open="["\; close="]">\begin{array}{c}1\times 1\times 1,512\\ 1\times 3\times 3,512\\ 1\times 1\times 1,2048\end{array}\left(\right)open="["\; close="]">\begin{array}{c}1\times 1\times 1,512\\ 1\times 3\times 3,512\\ 1\times 1\times 1,2048\end{array}$ |

Avgpooling | $7\times 7$ | $1\times 7\times 7$ | $1\times 7\times 7$ |

Dataset | Training Set | Test Set | Amount |
---|---|---|---|

VehSpeedDataset10 | 4266 | 1066 | 5332 |

3DConvNets | MAE | MSE |
---|---|---|

IR3D18 | 5.04 | 44.35 |

IR3D34 | 4.97 | 42.21 |

IR3D50 | 4.83 | 41.14 |

I3D(inception-v1) | 4.58 | 38.00 |

IR(2+1)D50 | 3.76 | 25.80 |

Input | $56\times 56$ | $112\times 112$ | $224\times 224$ | |||
---|---|---|---|---|---|---|

MAE | MSE | MAE | MSE | MAE | MSE | |

RGB | 3.99 | 28.91 | 3.80 | 25.87 | 3.76 | 25.80 |

Optical flow | 3.60 | 23.39 | 3.16 | 18.15 | 3.02 | 17.35 |

cat | 3.50 | 21.60 | 3.08 | 17.92 | 3.01 | 17.24 |

Configurations | Conv_Add | Non-Local | Multi-Scale Conv |
---|---|---|---|

IR(2+1)D50 | × | × | × |

IR(2+1)D50-c | √ | × | × |

IR(2+1)D50-cNL | √ | √ | × |

IR(2+1)D50-cNLms | √ | √ | √ |

Input | IR(2+1)D50 | IR(2+1)D50-c | IR(2+1)D50-cNL | IR(2+1)D50-cNLms | ||||
---|---|---|---|---|---|---|---|---|

MAE | MSE | MAE | MSE | MAE | MSE | MAE | MSE | |

Optical flow | 3.02 | 17.35 | 2.97 | 16.61 | 2.85 | 15.55 | 2.81 | 14.81 |

cat | 3.01 | 17. 24 | 2.95 | 16.30 | 2.85 | 15.30 | 2.73 | 14.62 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dong, H.; Wen, M.; Yang, Z.
Vehicle Speed Estimation Based on 3D ConvNets and Non-Local Blocks. *Future Internet* **2019**, *11*, 123.
https://doi.org/10.3390/fi11060123

**AMA Style**

Dong H, Wen M, Yang Z.
Vehicle Speed Estimation Based on 3D ConvNets and Non-Local Blocks. *Future Internet*. 2019; 11(6):123.
https://doi.org/10.3390/fi11060123

**Chicago/Turabian Style**

Dong, Huanan, Ming Wen, and Zhouwang Yang.
2019. "Vehicle Speed Estimation Based on 3D ConvNets and Non-Local Blocks" *Future Internet* 11, no. 6: 123.
https://doi.org/10.3390/fi11060123