Lane and Road Marker Semantic Video Segmentation Using Mask Cropping and Optical Flow Estimation
Abstract
:1. Introduction
- Providing two forms of network prior information. One is image preprocessing using dilation and erosion. Before the image is input into the network, we use the pre-frame segmentation as a mask to crop the image, and the cropped image is used as one of the inputs of the network. The purpose is to identify the location of the target in the picture with high probability. In the other one, the optical flow between adjacent frames is calculated. Considering the validity of adjacent information, the segmentation result of the previous frame warps to the current frame position as another input of the network.
- Designing an end-to-end trainable multi-input single-output network that uses multiple prior information to jointly process the segmentation of lanes and road markers. Additionally, no post-processing is introduced to avoid the extra computation cost caused by post-processing, and real-time segmentation without frame delay can be achieved.
- Evaluating our network in detail in benchmarks on the Apolloscape dataset and the CUlane dataset, and the experimental results show that our algorithm can output smoother results in video sequences, and has better robustness than other algorithms, as well as better real-time performance. In particular, the lane and road marker targets that are difficult to observe in the images can also be segmented by our algorithm.
2. Related Work
3. Model
3.1. Architecture Overview
3.2. Optical Flow Estimation
3.3. Mask and Crop
3.4. The Main Network
3.4.1. Extract the Characteristics of Different Input Sources
3.4.2. Fusion Features and Output Preliminary Predictions
3.5. Warp the Segmentation Result of the Previous Frame
3.6. Optimized Network
3.7. Loss Function
4. Experiment and Analysis
4.1. Dataset and Evaluation Metric
4.2. Training
4.3. Ablation Experiments
4.4. Evaluation
4.4.1. Apolloscape Lanemark Segmentation Dataset
4.4.2. CULane Dataset
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Andrade, D.C.; Bueno, F.; Franco, F.R.; Silva, R.A.; Neme, J.H.Z.; Margraf, E.; Omoto, W.T.; Farinelli, F.A.; Tusset, A.M.; Okida, S.; et al. A novel strategy for road lane detection and tracking based on a vehicle’s forward monocular camera. IEEE Trans. Intell. Transp. Syst. 2018, 20, 1497–1507. [Google Scholar] [CrossRef]
- Wu, C.B.; Wang, L.H.; Wang, K.C. Ultra-low complexity block-based lane detection and departure warning system. IEEE Trans. Circ. Syst. Video Technol. 2018, 29, 582–593. [Google Scholar] [CrossRef]
- Lee, C.; Moon, J.H. Robust lane detection and tracking for real-time applications. IEEE Trans. Intell. Transp. Syst. 2018, 19, 4043–4048. [Google Scholar] [CrossRef]
- Gu, S.; Lu, T.; Zhang, Y.; Alvarez, J.M.; Yang, J.; Kong, H. 3-d lidar+ monocular camera: An inverse-depth-induced fusion framework for urban road detection. IEEE Trans. Intell. Veh. 2018, 3, 351–360. [Google Scholar] [CrossRef]
- Yuan, C.; Chen, H.; Liu, J.; Zhu, D.; Xu, Y. Robust lane detection for complicated road environment based on normal map. IEEE Access 2018, 6, 49679–49689. [Google Scholar] [CrossRef]
- Li, J.; Mei, X.; Prokhorov, D.; Tao, D. Deep neural network for structural prediction and lane detection in traffic scene. IEEE Trans. Neural Netw. Learn. Syst. 2016, 28, 690–703. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Oliveira, G.L.; Bollen, C.; Burgard, W.; Brox, T. Efficient and robust deep networks for semantic segmentation. Int. J. Robot. Res. 2018, 37, 472–491. [Google Scholar] [CrossRef]
- Teng, Z.; Kim, J.H.; Kang, D.J. Real-time lane detection by using multiple cues. In Proceedings of the ICCAS 2010, Gyeonggi-do, Korea, 27–30 October 2010; pp. 2334–2337. [Google Scholar]
- Sotelo, M.A.; Rodriguez, F.J.; Magdalena, L.; Bergasa, L.M.; Boquete, L. A color vision-based lane tracking system for autonomous driving on unmarked roads. Auton. Robot. 2004, 16, 95–116. [Google Scholar] [CrossRef] [Green Version]
- Kaur, G.; Kumar, D. Lane detection techniques: A review. Int. J. Comput. Appl. 2015, 112, 4–8. [Google Scholar]
- Huval, B.; Wang, T.; Tandon, S.; Kiske, J.; Song, W.; Pazhayampallil, J.; Andriluka, M.; Rajpurkar, P.; Migimatsu, T.; Cheng-Yue, R.; et al. An empirical evaluation of deep learning on highway driving. arXiv 2015, arXiv:1504.01716. [Google Scholar]
- Kim, J.; Lee, M. Robust lane detection based on convolutional neural network and random sample consensus. In Proceedings of the International Conference on Neural Information Processing, Montreal, QC, Canada, 8–13 December 2014; pp. 454–461. [Google Scholar]
- Zou, Q.; Jiang, H.; Dai, Q.; Yue, Y.; Chen, L.; Wang, Q. Robust lane detection from continuous driving scenes using deep neural networks. IEEE Trans. Veh. Technol. 2019, 69, 41–54. [Google Scholar] [CrossRef] [Green Version]
- Neven, D.; De Brabandere, B.; Georgoulis, S.; Proesmans, M.; Van Gool, L. Towards end-to-end lane detection: An instance segmentation approach. In Proceedings of the 2018 IEEE Intelligent Vehicles Symposium (IV), Changshu, China, 26–30 June 2018; pp. 286–291. [Google Scholar]
- Qin, Z.; Wang, H.; Li, X. Ultra fast structure-aware deep lane detection. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Part XXIV 16. pp. 276–291. [Google Scholar]
- Pan, X.; Shi, J.; Luo, P.; Wang, X.; Tang, X. Spatial As Deep: Spatial CNN for Traffic Scene Understanding. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), New Orleans, LA, USA, 2–7 February 2018. [Google Scholar]
- Zhang, J.; Xu, Y.; Ni, B.; Duan, Z. Geometric constrained joint lane segmentation and lane boundary detection. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 486–502. [Google Scholar]
- Lee, S.; Kim, J.; Shin Yoon, J.; Shin, S.; Bailo, O.; Kim, N.; Lee, T.H.; Seok Hong, H.; Han, S.H.; So Kweon, I. Vpgnet: Vanishing point guided network for lane and road marking detection and recognition. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 1947–1955. [Google Scholar]
- Liao, Y.; Liu, Q. Multi-Level and Multi-Scale Feature Aggregation Network for Semantic Segmentation in Vehicle-Mounted Scenes. Sensors 2021, 21, 3270. [Google Scholar] [CrossRef] [PubMed]
- Sediqi, K.M.; Lee, H.J. A Novel Upsampling and Context Convolution for Image Semantic Segmentation. Sensors 2021, 21, 2170. [Google Scholar] [CrossRef] [PubMed]
- Fayyaz, M.; Saffar, M.H.; Sabokrou, M.; Fathy, M.; Klette, R.; Huang, F. STFCN: Spatio-temporal FCN for semantic video segmentation. arXiv 2016, arXiv:1608.05971. [Google Scholar]
- Nilsson, D.; Sminchisescu, C. Semantic video segmentation by gated recurrent flow propagation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 6819–6828. [Google Scholar]
- Jin, X.; Li, X.; Xiao, H.; Shen, X.; Lin, Z.; Yang, J.; Chen, Y.; Dong, J.; Liu, L.; Jie, Z.; et al. Video scene parsing with predictive feature learning. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5580–5588. [Google Scholar]
- Gadde, R.; Jampani, V.; Gehler, P.V. Semantic video cnns through representation warping. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4453–4462. [Google Scholar]
- Li, Y.; Shi, J.; Lin, D. Low-latency video semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 5997–6005. [Google Scholar]
- Liu, Y.; Shen, C.; Yu, C.; Wang, J. Efficient semantic video segmentation with per-frame inference. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; pp. 352–368. [Google Scholar]
- Lu, S.; Luo, Z.; Gao, F.; Liu, M.; Chang, K.; Piao, C. A Fast and Robust Lane Detection Method Based on Semantic Segmentation and Optical Flow Estimation. Sensors 2021, 21, 400. [Google Scholar] [CrossRef] [PubMed]
- Yang, K.; Zhang, J.; Reiß, S.; Hu, X.; Stiefelhagen, R. Capturing Omni-Range Context for Omnidirectional Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–24 June 2021; pp. 1376–1386. [Google Scholar]
- Zheng, S.; Lu, J.; Zhao, H.; Zhu, X.; Luo, Z.; Wang, Y.; Fu, Y.; Feng, J.; Xiang, T.; Torr, P.H.; et al. Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 21–24 June 2021; pp. 6881–6890. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An Image is Worth 16 × 16 Words: Transformers for Image Recognition at Scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
- Liu, L.; Chen, X.; Zhu, S.; Tan, P. CondLaneNet: A Top-to-down Lane Detection Framework Based on Conditional Convolution. arXiv 2021, arXiv:2105.05003. [Google Scholar]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
- Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
- Sun, D.; Yang, X.; Liu, M.Y.; Kautz, J. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 8934–8943. [Google Scholar]
- Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
- Huang, X.; Wang, P.; Cheng, X.; Zhou, D.; Geng, Q.; Yang, R. The apolloscape open dataset for autonomous driving and its application. IEEE Trans. Pattern Anal. Machine Intell. 2019, 42, 2702–2719. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Kingma, D.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Zhao, H.; Shi, J.; Qi, X.; Wang, X.; Jia, J. Pyramid scene parsing network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2881–2890. [Google Scholar]
- Chen, L.C.; Zhu, Y.; Papandreou, G.; Schroff, F.; Adam, H. Encoder-decoder with atrous separable convolution for semantic image segmentation. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 801–818. [Google Scholar]
- Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
- Sun, K.; Xiao, B.; Liu, D.; Wang, J. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 5693–5703. [Google Scholar]
- Hou, Y.; Ma, Z.; Liu, C.; Loy, C.C. Learning lightweight lane detection cnns by self attention distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Korea, 27–28 October 2019; pp. 1013–1021. [Google Scholar]
- Tabelini, L.; Berriel, R.; Paixao, T.M.; Badue, C.; De Souza, A.F.; Oliveira-Santos, T. Keep your eyes on the lane: Real-time attention-guided lane detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 294–302. [Google Scholar]
Training Strategy | mIoU cls | mIoU cat | TC |
---|---|---|---|
Direct joint training | 0.708 | 0.891 | 0.719 |
Two-stage training | 0.776 | 0.953 | 0.771 |
Clipping-Net Branch | mIoU cls | mIoU cat | TC | |
---|---|---|---|---|
0.1 | 1 | 0.693 | 0.877 | 0.632 |
0.1 | 8 | 0.699 | 0.879 | 0.678 |
0.5 | 8 | 0.713 | 0.892 | 0.704 |
0.5 | 25 | 0.769 | 0.946 | 0.763 |
1.2 | 25 | 0.776 | 0.953 | 0.771 |
1.2 | 50 | 0.773 | 0.951 | 0.770 |
2.3 | 50 | 0.768 | 0.947 | 0.770 |
2.8 | 50 | 0.767 | 0.945 | 0.769 |
2.8 | 80 | 0.754 | 0.928 | 0.761 |
Full image (without cropping) | 0.742 | 0.910 | 0.728 |
Category | Class | No Clipping-Net Branch | Clipping-Net Branch | |
---|---|---|---|---|
and | and | |||
Dividing | White solid | 0.995 | 0.997 | 0.987 |
Yellow solid | 0.871 | 0.903 | 0.873 | |
Yellow double solid | 0.903 | 0.909 | 0.895 | |
White solid and broken | 0.863 | 0.894 | 0.887 | |
Guiding | White broken | 0.705 | 0.715 | 0.706 |
Yellow broken | 0.847 | 0.878 | 0.851 | |
Stopping | White solid | 0.789 | 0.792 | 0.787 |
Parking | White solid | 0.667 | 0.679 | 0.662 |
Zebra | Crosswalk | 0.858 | 0.921 | 0.854 |
Rotation arrow | White thru | 0.653 | 0.712 | 0.681 |
White thru and left | 0.668 | 0.698 | 0.663 | |
White thru and right | 0.732 | 0.768 | 0.758 | |
White left | 0.730 | 0.744 | 0.741 | |
White right | 0.619 | 0.626 | 0.620 | |
White left and right | 0.566 | 0.587 | 0.541 | |
Reduction | Speed bump | 0.571 | 0.617 | 0.579 |
Attention | Zebra attention | 0.743 | 0.759 | 0.733 |
No park | No parking | 0.759 | 0.761 | 0.752 |
Average | 0.752 | 0.776 | 0.754 |
Method | mIoU cls | mIoU cat | TC | fps | |
---|---|---|---|---|---|
Single frame | VPGNet [18] | 0.720 | 0.905 | 0.653 | 42 |
DeepLabv3+ [39] | 0.761 | 0.942 | 0.706 | 21 | |
PSPNet [38] | 0.748 | 0.921 | 0.683 | 30 | |
Dilation-CNN [40] | 0.673 | 0.859 | 0.611 | 59 | |
HRNet [41] | 0.735 | 0.916 | 0.664 | 87 | |
Multiframe | PSPNet + GRFP [22] | 0.755 | 0.935 | 0.761 | 26 |
PSPNet + NetWarp [24] | 0.754 | 0.931 | 0.767 | 25 | |
PSPNet + Eff. [26] | 0.762 | 0.938 | 0.793 | 30 | |
Dilation + GRFP [22] | 0.687 | 0.864 | 0.679 | 54 | |
Dilation + NetWarp [24] | 0.688 | 0.863 | 0.692 | 51 | |
HRNet + Eff. [26] | 0.744 | 0.923 | 0.785 | 87 | |
Ours | 0.776 | 0.953 | 0.771 | 74 |
Category | SCNN [16] | ENet-SAD [42] | LaneATT [43] | CondLaneNet [31] | Ours | ||||
---|---|---|---|---|---|---|---|---|---|
Small | Medium | Large | Small | Medium | Large | ||||
Normal | 90.60 | 90.10 | 91.17 | 92.14 | 91.74 | 92.87 | 93.38 | 93.47 | 92.69 |
Crowded | 69.70 | 68.80 | 72.71 | 75.03 | 76.16 | 75.79 | 77.14 | 77.44 | 76.37 |
Dazzle light | 58.50 | 60.20 | 65.82 | 66.47 | 69.74 | 70.72 | 71.17 | 70.93 | 71.34 |
Shadow | 66.90 | 65.90 | 68.03 | 78.15 | 76.31 | 80.01 | 79.93 | 80.91 | 80.52 |
No line | 43.40 | 41.60 | 49.13 | 49.39 | 50.46 | 52.39 | 51.85 | 54.13 | 51.71 |
Arrow | 84.10 | 84.00 | 87.82 | 88.38 | 86.29 | 89.37 | 89.89 | 90.16 | 90.19 |
Curve | 64.40 | 65.70 | 63.75 | 67.72 | 64.05 | 72.40 | 73.88 | 75.21 | 71.45 |
Night | 66.10 | 66.00 | 68.58 | 70.72 | 70.81 | 73.23 | 73.92 | 74.80 | 74.93 |
Crossroad | 1990 | 1998 | 1020 | 1330 | 1264 | 1364 | 1387 | 1201 | 1376 |
Total | 71.60 | 70.80 | 75.13 | 76.68 | 77.02 | 78.14 | 78.74 | 79.48 | 78.32 |
TC | 0.634 | 0.667 | 0.691 | 0.705 | 0.708 | 0.732 | 0.737 | 0.749 | 0.783 |
FPS | 17.2 | 82 | 277 | 197 | 30 | 248 | 171 | 65 | 89 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Xing, G.; Zhu, Z. Lane and Road Marker Semantic Video Segmentation Using Mask Cropping and Optical Flow Estimation. Sensors 2021, 21, 7156. https://doi.org/10.3390/s21217156
Xing G, Zhu Z. Lane and Road Marker Semantic Video Segmentation Using Mask Cropping and Optical Flow Estimation. Sensors. 2021; 21(21):7156. https://doi.org/10.3390/s21217156
Chicago/Turabian StyleXing, Guansheng, and Ziming Zhu. 2021. "Lane and Road Marker Semantic Video Segmentation Using Mask Cropping and Optical Flow Estimation" Sensors 21, no. 21: 7156. https://doi.org/10.3390/s21217156
APA StyleXing, G., & Zhu, Z. (2021). Lane and Road Marker Semantic Video Segmentation Using Mask Cropping and Optical Flow Estimation. Sensors, 21(21), 7156. https://doi.org/10.3390/s21217156