A Survey on Deep Learning Based Approaches for Scene Understanding in Autonomous Driving

: As a prerequisite for autonomous driving, scene understanding has attracted extensive research. With the rise of the convolutional neural network (CNN)-based deep learning technique, research on scene understanding has achieved significant progress. This paper aims to provide a comprehensive survey of deep learning-based approaches for scene understanding in autonomous driving. We categorize these works into four work streams, including object detection, full scene semantic segmentation, instance segmentation, and lane line segmentation. We discuss and analyze these works according to their characteristics, advantages and disadvantages, and basic frameworks. We also summarize the benchmark datasets and evaluation criteria used in the research community and make a performance comparison of some of the latest works. Lastly, we summarize the review work and provide a discussion on the future challenges of the research domain.


Introduction
Scene understanding, positioning and navigation, path planning, and control execution are the four basic modules in an autonomous driving system, among which scene understanding is the core module. As described in [1], the complex task of outdoor scene understanding involves several subtasks such as object detection, scene categorization, depth estimation, tracking, event categorization and behavior analysis. Sense understanding acts as human eyes do, and it is a prerequisite for autonomous driving.
In the past ten years, machine learning using convolutional neural networks (CNNs) has moved from shallow machine learning to deep machine learning with the advancement of neural network theory and the improvement of hardware computing capabilities. The shallow machine learning model does not use distributed representation and requires artificially extracted features. The quality of the artificially extracted features largely determines the quality of the entire system. Deep learning is a kind of representation learning that can learn higher-level abstract representations of data and automatically extract features from the data. The hidden layers in deep learning are equivalent to a linear combination of input features, and the weights between the hidden layers and the input layer are equivalent to the weights of the input features in the linear combination. The learning capability of deep learning increases exponentially with the depth of the model. CNN is the most common deep learning methodology applied to autonomous driving. A CNN is parametrized by its weights vector θ = [W, b], where W is the set of weights governing the interneural connections and b is the set of neuron bias values. The set of weights W can be learned from the data during the training process. The convolutional layers within a CNN exploit the local spatial correlations of image pixels to capture image features. The last layer of a CNN is usually a fully connected layer, which acts as an object discriminator for a high-level abstract representation of objects. CNN-based deep learning can learn higher-level abstract representations of data and automatically extract features. Due to these advantages, CNN-based deep learning approaches have been widely used in scene understanding of autonomous driving, and it has achieved great success. This paper presents a survey on research using deep learning-based approaches for scene understanding in autonomous driving, especially focusing on two main tasks: object detection and scene segmentation. Similar surveys have been found in [2,3], with [2] specifically providing an overview on deep learning-based applications covering almost all aspects of autonomous driving, including perception and localization, high-level path planning, behavior arbitration, and motion controllers. Our paper only focuses on one specific area: scene understanding. Comparatively, our paper provides much more detail of the algorithms for various scene understanding tasks. The deep-learning based approaches for scene text detection and recognition for the purpose of scene understanding were reviewed in [3]. Although it had the same goal as our paper, the algorithms were significantly different. To the best of our knowledge, our paper is the first comprehensive summary specifically focusing on vision-based deep-learning algorithms for scene understanding in automotive driving.
As illustrated in Figure 1, we categorized the research on scene understanding in autonomous driving into two streams: object detection and scene segmentation. Object detection is identifying and locating various obstacles in a road traffic scene in the form of bounding boxes, such as pedestrians, vehicles, and cyclists. Scene segmentation is assigning a semantic category label to each pixel in a scene image, and it can be regarded as a refinement of object detection. Scene segmentation can be further divided into three substreams: full-scene semantic segmentation, object instance segmentation, and lane line segmentation. A traffic scene normally contains object categories like obstacles, free space (roads), lane lines, and so on. Full-scene semantic segmentation is performing pixel-level semantic segmentation on these categories in a full image. Instance segmentation is designed to identify individual instances within a category area, and it can be regarded as a more elaborate semantic segmentation. As Neven et al. [4] mentioned, obstacles and free space are categories with relatively concentrated pixels, while lane lines are a pixel-continuous and non-dense category. It is difficult to segment lane lines and other objects at the same time in an image. Therefore, this paper surveys lane line segmentation as a special substream. As for the relations between the tasks, instance segmentation can not only provide pixel-level recognition but also distinguish individuals, which is more meaningful for autonomous driving but also a harder task. As a specific task, lane line recognition is indispensable for lane departure warnings and lane keeping applications. Figure 1 also illustrates the overall framework of this paper. The research on scene understanding is organized as four work streams, including object detection, full-scene semantic segmentation, instance segmentation, and lane line segmentation.
In terms of approaches, object detection is divided into two categories of methods: the two-stage method and the one-stage method. Full-scene semantic segmentation is divided into two categories of methods: encoder-decoder and modified convolution. As a special task in full-scene semantic segmentation, road segmentation is reviewed as a specific category. The approaches for instance segmentation are divided into region proposal and masking. Lane line segmentation is divided into the two-step method and the end-toend method.
The remainder of the paper is structured as follows. In Section 2, the classification models (i.e., the basic deep CNN models) developed in the early years are reviewed, and the characteristics of the related algorithms are summarized. In Section 3, we discuss and analyze the research work of four work streams-object detection, full-scene semantic segmentation, instance segmentation, and lane line segmentation-according to their characteristics, advantages and disadvantages, and basic frameworks. Section 4 gives a performance comparison of some of the latest algorithms. Section 5 introduces the benchmark datasets and evaluation criteria accepted in the research society. Section 6 concludes with remarks and provides a discussion of the future challenges of the research domain.

Basic CNN Models
Deep learning-based object detection and scene segmentation actually originated from object classification; that is, classification models developed in the early years formed the basic models of detection and segmentation. Therefore, we give a brief overview of the object classification models in this section. The basic structure of deep CNNs can be traced back to LetNet, proposed by Lecun et al. [5] in 1990. It is composed of a convolutional layer, a pooling layer, a fully connected layer, and an activation function. In 2012, Krizhevsky et al. [6] extended LetNet to a deeper network-called AlexNet-capable of learning more complex features with the use of the ILSVRC database [7]. This work significantly improved the accuracy of image classification and initiated a continuous boom of deep learning research. Subsequently, VGGNet [8], GoogleNet [9], and ResNet [10] came along successively. Many lightweight deep neural networks came out one after another with continuous improvement of the network structure, such as ResNeXt [11], Shuf-fleNet [12], and so on. These excellent deep CNN models promoted the continuous breakthrough of computer vision tasks, such as object detection, semantic segmentation, and instance segmentation. Table 1 reviews the evolution of these CNNs in terms of the year, background, algorithm characteristics, and contributions.
These models are all suitable to be used as base models in autonomous driving to extract features for detection and classification purposes. In practice, the model should be selected via experiments according to applications. Generally, lightweight models are more suitable for tasks with larger datasets and can greatly reduce training time. The championship of location project in ILSVRC 2014 Repeatedly superimposing the convolutional layer and the pooling layer The relationship between the depth of the CNN and the performance of the model is studied The championship of classification project in ILSVRC 2014 Inception V1 module Efficient use of 1 × 1, 3 × 3, and 5 × 5 convolution. The efficiency reached the human level and subsequently developed into V2 [13], V3 [14], V4 [15], and Xception [16].
ResNet [10] 2015 The championship in ILSVRC 2015 Residual Uint Learning the difference between the input and output. Subsequently, ResNeXt [11] was proposed by combining it with Inception. DenseNet [17] 2016 Proposed by Gao et al.

DenseBlock
Realization of reuse between features NASNet [18] 2017 Proposed by Google ResNet + Inception Combination of previous network Structures ShuffleNet [12] 2017 Proposed by Zhang et al.

Channel shuffle Improving network information blocking
SeNet [19] 2017 The championship in ILSVRC 2017 Squeeze and excitation module The relationship between the feature channels is studied

Scene Understanding in Autonomous Driving
In this section, we review deep learning-based approaches for scene understanding in terms of four work streams: object detection, full-scene semantic segmentation, instance segmentation, and line lane segmentation.

Object Detection
The approaches for object detection are divided into the two-stage method and the one-stage method. Table 2 gives a summary of the representative work in terms of their characteristics, advantages and disadvantages, and basic frameworks.  [20] SPPNet [21] Fast R-CNN [22] Faster R-CNN [23] Cascade R-CNN [24] FPN [25] (1) Propose a large number of regions by selective searching; (2) Detect objects in the region proposals;  [20], which combines region proposal extraction and a CNN. It is the first algorithm that successfully applied deep learning to object detection, in which feature maps are extracted from a CNN rather than from the blockwise orientation histograms. The process of detection is shown in Figure 2. The method first generates a large number of regions using selective searching [35] and then extracts features on the region proposals using CNNs. However, the R-CNN has the following problems: (1) the extracted region proposals must be cropped or warped to a fixed size, resulting in missing information or a distorted image; (2) it is time-consuming, since it processed the region proposals separately; and (3) it is not an end-to-end network model. He et al. proposed Spatial Pyramid Pooling Network (SPPNet) [21] in 2014 to address these issues. It still adopted region proposal, but proposed a pyramid pooling module to improve the efficiency of feature extraction of the R-CNN. It avoided the repeated calculation of region proposals and effectively solved the problems caused by cropping and warping. Girshic et al. [22] proposed Fast R-CNN to combine the advantages of both an R-CNN and SPPNet in 2015. It extracted features on the entire image and replaced the pyramid pooling module in SPPNet with the Region of Interest (ROI) pooling layers so that a fixed dimensional feature map was extracted for each region proposal. Furthermore, to address the problem of step-by-step training, it considered object detection as a border regression issue and proposed a multitask loss function for training. Since feature extraction was conducted on the entire image rather than processing the regions separately, it efficiently improved the computing efficiency. In addition, Fast R-CNN realized an end-to-end training process. Modifications within Fast R-CNN [22] indicate that region proposal is a bottleneck of computation. After that, Ren et al. [23] further improved Fast R-CNN to a newer version (i.e., Faster R-CNN) that replaced selective searching with a region proposal network (RPN) to obtain high-quality region proposals. Faster R-CNN is a fully convolutional network that simultaneously predicts object bounds and class scores at each position and can be trained end-to-end.
In this subsection, we introduced the development process of the two-stage method, starting from the R-CNN. It used selective searching to create a region proposal ingeniously. Although this affects computation efficiency, it provides a good idea for the object detection task. SPPNet, Fast R-CNN, and Faster R-CNN followed the idea. By designing the pyramid pooling module, ROI pooling, RPN, and other modules, they simplified region proposal generation and continuously improved the model training effectiveness. In summary, the two-stage method has good performance for object detection accuracy, but the detection speed is not high due to the complex structure. It is suitable for scenes with small objects.

One-Stage Method
In 2015, a one-step detection method called You Only Look Once-V1 (YOLO-V1) [24] was proposed by Redmon et al. It solved the detection speed issue, which has been troublesome in the two-stage detection methods [20,21,23]. The authors framed object detection as a regression problem to spatially separate the bounding boxes and associated class probabilities. A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation. Since the whole detection pipeline is a one-stage network, it can be optimized end-to-end directly for detection performance. The whole picture is predicted during both training and prediction, which effectively makes good use of the context information and therefore reduces the false detection rate.
In 2015, Liu et al. proposed Single Shot MultiBox Detector (SSD) [27] by combining the advantages of Faster R-CNN [23] and YOLO-V1 [26]. The network combines predictions from multiple feature maps with different resolutions to handle objects of various sizes. SSD eliminates proposal generations and subsequent pixels or feature resampling stages and encapsulates all computations in a single network. This makes SSD easy to train and straightforward to integrate into systems that require a detection component.
In 2017, Lin et al. [28] analyzed the problems between the accuracy and speed of the two-stage and the one-stage methods. They discovered that the extreme foregroundbackground class imbalance encountered during the training of dense detectors was the central cause. They proposed focal loss to address this class imbalance by reshaping the standard cross-entropy loss so that it downweighted the loss assigned to well-classified examples.
In 2017, Redmon et al. [29] improved YOLO-V1 to YOLO-V2 and YOLO9000, which can detect over 9000 categories. YOLO-V2 proposed a new basic network-Darknet-19-which greatly improved the detection performance, especially on small objects. The following year, YOLO-V3 [30], based on V1 [26] and V2 [29], was proposed. In terms of the network structure, it used ResNet [10] for reference and proposed a more complex backbone network (Darknet-53) to extract features. At the same time, Feature Pyramid Networks (FPN) [25] was added to improve the detection of multiscale objects. Compared with other models, it has obvious advantages in detection speed and further strengthens the detection ability for small objects.
In 2019, Tian et al. [31] proposed a fully convolutional one-stage object detector (FCOS) to solve object detection in a per pixel prediction fashion. FCOS is anchor box free, as well as proposal free. By eliminating the predefined set of anchor boxes, the FCOS completely avoids the complicated computations related to anchor boxes, such as calculating overlapping during training. More importantly, they also avoid all hyperparameters related to anchor boxes, which are often very sensitive to the detection performance. Chen et al. [33] used a RoIConv operator for alignment of the features and designed a fully convolutional architecture (AlignDet) for combining the flexibility of learned anchors and the preciseness of aligned features. Duan et al. [34] (CenterNet) modeled an object as a single point and used key point estimation to find the center point and then regressed to obtain object parameters, including size, location, and orientation. In 2020, GC-YOLOv3 [32] made YOLO-V3 more accurate with a global context block. They fused a feature extraction network with a feature pyramid network to improve detection accuracy.
Different from the two-stage method, this method does not have a separate stage for proposal generation. It typically considers all pixels as potential objects and tries to classify each region of interest as either the background or an object. In this subsection, we introduced the development process of the one-stage method starting from YOLO-V1. After that, YOLO V2-V4 followed. Later, a series of anchor-free object detectors (e.g., FCOS, AlignDet, and CenterNet) were developed, where the goal was to predict the key points of the bounding box instead of trying to fit an object to an anchor. In summary, the one-stage method reduced the difficulty of training and deployment. Compared with the two-stage method, the one-stage method is faster but with slightly poorer detection performance. They are more suitable for scenes with sparse populations, such as suburban villages.

Full-Scene Semantic Segmentation
Full-scene semantic segmentation is segmenting object categories at the pixel level in a full image. The approaches for full-scene segmentation are divided into two categories: encoder-decoder structure models and modified convolution structure models. As a specific and important task, road segmentation has been widely studied. Thus, we review road segmentation as the third subsection. Table 3 gives a summary of the representative works in terms of their characteristics, core technology and functions, basic frameworks, and road segmentation. Different from object detection and classification, image semantic segmentation classification operates at the pixel level and thus is more difficult. The traditional semantic segmentation methods [55][56][57][58] rely on hand-crafted features that are usually tailored for a specific task. These methods do not offer ideal performance in terms of speed and accuracy. A breakthrough occurred in 2014, when Long et al. [36] proposed the fully convolutional network (FCN) and realized end-to-end pixel-level semantic segmentation. Its key insight is to build fully convolutional layers to automatically extract features for segmentation purposes. There are two main modifications: (1) The last fully connected layer of the CNN [8][9][10] is replaced by a convolution layer that outputs a size-reduced heatMap (segmented map). The structure of the network is actually an encoding process. (2) The segmented map is restored to the original size using bilinear interpolation. However, image restoring by this method lacks sensitivity to image details and will lead to rough and blurry segmentation. Moreover, the segmentation is based on the local area information without consideration of global information.
Badrnarayanan et al. [37] made an improvement on the FCN and proposed SegNet, which first adopted the encoder-decoder structure. SegNet consists of an encoder network and a corresponding decoder network, followed by a pixel-wise classification layer. The novelty of SegNet lies in the manner in which the decoder upsamples its lower resolution input feature maps. Specifically, the decoder uses pooling indices computed in the max pooling step of the corresponding encoder to perform non-linear upsampling. An illustration of the SegNet architecture is shown in Figure 3. This eliminates the need for learning to upsample. It does not involve deconvolution and greatly speeds up the training time. Many scene segmentation models adapt the encoder-decoder network structure, such as U-Net [38], ENet [39], RefineNet [59], and so on. Semantic Segmentation Network (SegNet) ascertained the encoder-decoder structure and achieved significant progress in semantic segmentation. However, SegNet has a complex architecture and a large number of parameters, which makes the network run slower and not in real time. Paszke et al. [9] proposed ENet to improve computation efficiency in 2016. It learned from [14] to optimize the architecture through modular network design. The model adapted early downsampling, a large encoder, a small decoder, and nonlinear operation. Experiments showed that the model ran much faster than SegNet.
An FCN classifies objects at the pixel level, but it does not take context information into consideration. Thus, similarity between pixels may cause recognition confusion. In 2016, Zhao et al. [40] proposed the pyramid scene parsing network (PSPNet) to exploit the capability of global context information by different-region-based context aggregation. To reduce context information loss between different subregions, they used a hierarchical global prior, containing information with different scales. They called it the pyramid pooling module for global scene prior construction and put it upon the final layer feature map of the network. Recently, a novel pyramid self-attention module to overcome dilution problems in high-level semantic information was proposed in [60]. At the same time, a channel-wise attention module was also employed to reduce the redundant features of the FPN [25].
In this subsection, we introduced the development process of encoder-decoder models, starting from the FCN. SegNet, PSPNet, ENet, and U-Net followed the idea. By combining the PSP module, FPN, and attention module, the segmentation accuracy was continuously improved. Although these methods can achieve end-to-end pixel-level output, they are relatively slow and not ideal for segmenting small objects. They are more suitable for scenes with sparse populations such as suburban villages because they are based on local area information.

Modified Convolution Structure Models
In the basic CNN structure, the convolution layers are used to extract image features, and the pooling layers are used to gather image background information. However, the pooling layers cause problems, such as reducing the image resolution and losing local information. This leaves an open question of whether severe intermediate downsampling is truly necessary. Therefore, much research has been conducted to solve the above issues by modifying the convolution structure. A convolutional network module is needed that aggregates multiscale context information without losing resolution or analyzing resized images.
In 2014, Chen et al. [48] analyzed two problems of semantic segmentation models: (1) reducing the image size through pooled downsampling, resulting in information loss, and (2) the spatial invariance generated by the CNN. Then, they developed the DeepLab model. They skipped subsampling after the last two max pooling layers in the FCN and modified the convolutional filters by introducing zeros to increase their length. As shown in Figure 4, the modified convolution with zeros can increase the receptive field without changing the number of convolution kernels while the computation remains the same. The DeepLab model overcomes the poor localization properties of deep networks by using dilated convolution with a fully connected conditional random field (CRF) [61].  [48]).
In 2017, Yu et al. proposed the Dilated Residual Network (DRN) [53] by combining dilated convolution with ResNet [10] and studying the gridding artifacts introduced by dilation. The gridding problem was solved by removing the maximum pooling layer, adding a network layer, and removing residual connections. Tests on the Cityscapes dataset [62] achieved good performance. In 2017, Chen et al. proposed DeepLab-V2 [50], based on DeepLab-V1 [48]. The most significant improvement was the combination of the dilated convolution structure and the pyramid network structure (21,40). They proposed aerospatial pyramid pooling (ASPP) to segment objects with multiple scales. Atrous convolutions can also name dilated convolutions. To further handle the problem of segmenting objects at multiple scales, Chen et al. [51] proposed DeepLab-V3, which employed atrous convolution in cascade or in parallel to capture the multiscale context by adopting multiple atrous rates. The following year, Chen et al. pointed out that ASPP [50] could extract more dense features, but the existence of atrous convolution would cause the boundary information of the segmentation object to be seriously lost. It is known that the decoder structure can gradually recover spatial information to capture clear object boundaries. Accordingly, they proposed a model of the encoder-decoder structure with ASPP known as DeepLab-V3+ [41]. The network is composed of several representative structures, which makes it a leader in the field of semantic segmentation. The framework structure of DeepLab-V3+ is shown in Figure 5. Its processing visual results on Cityscapes [62] are shown in Figure 6. In this section, we introduced region proposal-based methods and masking-based methods for instance segmentation. In general, the region proposal-based methods produce better accuracies than the masking-based methods. The challenges of instance segmentation still remain for small objects, as well as for efficient end-to-end models and their training schemes.

Road Segmentation
As an essential category in traffic scenes, the road is the area where a car can travel. Road recognition is of great significance for autonomous driving. Many researchers particularly focus on road segmentation. Up-conv-Poly [45] proposes an up-convolutional network for road segmentation by combining an FCN with U-Net [38]. The significant improvement was that it increased the width of the up-convolutional side of the network to improve the model accuracy. LidCamNet [46] fused lidar point clouds and camera images for road detection. The sparse point clouds are first projected onto the image plane and then upsampled to obtain a set of dense 2D images, encoding spatial information. Then, the FCN is trained to carry out road detection by using the fusing data. They designed three fusion strategies: early, late, and cross fusion. DEEP-DIG [47] used ResNet [10] with a fully convolutional architecture and multiple upscaling steps for image interpolation. On the basis of the FCN, it uses geometric transformations, such as affine and perspective transformation, image clipping, deformation, noise, and pixel changes, to obtain better road segmentation results. In addition, it attained encouraging improvement by performing data augmentation and conducting a number of training variants. Figure 7 shows the road segmentation results of Up-Conv-Poly, LidCamNet, and DEEP-DIG on the Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) dataset [63]. .

Instance Segmentation
Full-scene semantic segmentation only segments the categories within an image without consideration of object instances. Instance segmentation is designed to further segment individual object instances within a category area. From this point of view, instance segmentation is somewhat similar to object detection. Obviously, instance segmentation is useful for determining the motion state of individual obstacles, mainly referring to pedestrians and cars. The methods for instance segmentation can be divided into two classes: region proposal-based methods and masking-based methods. Table 4 gives a summary of the typical work in terms of their characteristics, advantages and disadvantages, and basic frameworks.  [69] PANet [70] Common detection methods such as R-CNN [20], SSD [27], R-FCN [24], FPN [25], and so on Pixel classification in identified regions Advantages: (1) High positioning accuracy; (2) Simultaneous detection and segmentation. Disadvantages: (1) Lack of consideration of global scene information; (2) Poor segmentation of occlusion and small objects.
(2) A potential mask is generated for each small block, and the final segmentation instance is optimized from multiple potential masks.

Advantages:
The refining module optimizes the rough segmentation masks and can process the hidden information in various sizes and background pictures.
Disadvantages: Low positioning accuracy

Region Proposal-Based Method
In 2014, Hariharan et al. [64] first proposed a network that was capable of detecting object instances and marking them at the pixel level. They called it simultaneous detection and segmentation (SDS). Unlike classical bounding box detection, SDS requires pixel-level segmentation for individual instances. The technical process is shown in Figure 8. The following year, they defined the hypercolumn [65] at a pixel as the vector of activations of all CNN units above that pixel. The main idea is to use the columns as pixel descriptors, which combine the low-level features with the high-level features to improve the optimization of details. SDS [64] and hypercolumns [65] are too time-consuming to select a large number of region proposals. Moreover, they do not make the best use of the learned deep features and large-scale training data. In 2015, Dai et al. presented multitask network cascades (MNC) [66] for instance-aware semantic segmentation. Their model consisted of three networks, respectively differentiating instances, estimating masks, and categorizing objects. These networks form a cascaded structure and are designed to share their convolutional features. An Instance-Sensitive Fully Convolutional Network (ISFCN) [67] further enhanced the method and learned from the position-sensitive score map in the R-FCN [24] to improve the local pixel segmentation. Such a cascading framework is also used in the Fully Convolutional Instance-Aware Semantic Segmentation (FCIS) [68].
In 2017, He et al. [69] extended Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition. The region of interest is generated for the first time, and then these ROIs are classified and subdivided in the second stage. An illustration of the Mask R-CNN architecture is shown in Figure 9.
In 2018, Liu et al. [70] proposed the path aggregation network (PANet), aimed at boosting information flow in a proposal-based instance segmentation framework. Specifically, they enhanced the entire feature hierarchy with accurate localization signals in the lower layers by bottom-up path augmentation, which shortened the information path between the lower layers and the topmost feature. As shown in Figure 10, three main improvements were made: (1) improving the FPN by using bottom-up path augmentation; (2) improving the pooling strategy by using adaptive feature pooling; and (3) improving the mask branch by using fully connected fusion. The visualization results of the Path Aggregation Network (PAN) can be shown in Figure 11. (1) feature extraction by using ResNet [10] and an FPN [25]; (2) ROI generation by a region proposal network (RPN) on the last layer of the convolution feature map [23]; (3) an RoI Align layer making each suggestion window generate a fixed-size feature map; and (4) the generation three output vectors, the first being the softmax classification, the second being the coordinate regression of each class, and the third being the ROI segmentation mask.

Masking-Based Method
A masking-based method is used to control the region or process of image processing by occluding the selected images to be processed (entirely or partially). A specific image or object for coverage is called a mask. The most important difference between this method and the region proposal-based method is that it does not need to detect the object with the bounding box, since the rectangular frame produced by SDS [64] is very timeconsuming. Therefore, Dai et al. proposed to exploit the shape information via convolutional feature masking (CFM) [71]. The proposal segments (e.g., superpixels) are treated as masks on the convolutional feature maps. The CNN features of the segments are directly masked out from these maps via SPPNet [21]. The technical route of CFM and SDS comparison is shown in Figure 12. In 2015, Facebook Artificial Intelligence (AI) research put forward an instance segmentation method (DeepMask) [72] on the basis of the image masking method, which can divide the image into blocks, determine whether a block contains an object, and accordingly segment the object masks. In the same year, SharpMask [73] was proposed, based on DeepMask, to refine the mask edges further. In 2016, MultipathNet [74] exploited Fast R-CNN to accurately locate objects and combined with the characteristics of DeepMask and SharpMask to further improve the masking accuracy. The relationship between the three can be summarized as (1) DeepMask generates the initial object masks; (2) Sharp-Mask optimizes these masks; and (3) MutiPathNet identifies the objects framed by each mask. In 2019, Lee et al. proposed a simple yet efficient form of anchor-free instance segmentation, called CenterMask, that added a novel spatial attention-guided mask (SAG-Mask) branch to an anchor-free one-stage object detector (FCOS) [31] in the same vein as Mask R-CNN [69].
In this section, we introduced region proposal-based methods and masking-based methods for instance segmentation. In general, the region proposal-based methods pro-duced better accuracies than the masking-based methods. The challenges of instance segmentation still remain for small objects, as well as for efficient end-to-end models and their training schemes.

Lane Line Segmentation
In traffic scenes, lane lines cover continuous narrow-but-long distances and are pixelsparse compared with other object categories. It is difficult to segment lane lines and other objects in an image at the same time. Most research treats lane line segmentation as a separate segmentation task. In the early years, there existed many traditional approaches to detect the lane lines using color features [78], edges [79], and other cues, combined with a Huff transform [80] or Kalman filtering [81]. In the latest two years, deep learning-based models have been developed for lane line segmentation. We divide these models into two categories: the two-step method and the end-to-end method.
The two-step method usually follows two steps: (1) the lane masks are generated by using deep learning-based semantic segmentation [39,41,82,83]; and (2) the generated lane line masks are fitted by using parametric fitting. For example, H-Net [4] is used to learn the projection matrix, and then the least squares method or the third-order spline curve is used to estimate the lane lines. The end-to-end method considers line fitting as a regression issue and uses a softmax layer at the end of the model to directly return the lane line parameters. Table 5 summarizes the typical work in terms of the dataset, characteristics, core technology and functions. VPGNet Dataset [84] Road marking detection is guided by a vanishing point under adverse weather conditions.

Inducing grid-level annotation
Vanishing point prediction task SCNN [85] CULane [85] (1) It is suitable for long continuous shape structures or large objects. The two-step method accounts for a majority of the deep learning-based lane detection algorithms. They all follow a similar process. The differences between them are given in Table 5. Here, we only describe two representative works: [4] and [11]. The most representative work of the two-step method was LaneNet+H-Net [4], proposed by Neven et al. in 2018. The technical pipeline is shown in Figure 13. They pointed out that the traditional lane detection methods rely on highly specialized and hand-crafted features and are therefore computationally expensive. They pioneered treating the lane detection problem as an instance segmentation problem. They modified E-Net [39] to segment the lane lines and employed the Discriminative Loss Function (DLF) [87] to aggregate the lane line pixels. The segmentation maps of the lane line instances are parametrically output. They also proposed H-Net to learn perspective projection transformation.
Recently, a lane marking semantic segmentation method based on LiDAR and camera fusion was proposed by Yin et al. [93], which was called FusionLane. In order to precisely locate lane lines, semantic segmentation is conducted on the birds-eye view map, converted from LiDAR point clouds. FusionLane uses Deeplab-V3+ [41] to segment the image captured by the camera, and the segmentation result is merged with the point clouds as the input of the network. In addition, they used a recurrent CNN, the long shortterm memory (LSTM) network to achieve temporal variation.

End-to-End Method
The problem of the two-step method is that the parameters of the network are not optimized for the task of interest (estimating the lane curvature parameters), but for a proxy task (segmenting the lane markings), resulting in suboptimal performance. The method of end-to-end lane segmentation based on deep learning has fewer corresponding research results. For example, Van et al. [94] proposed a method to train a lane detector in an end-to-end manner, directly regressing the lane parameters. The architecture consisted of two components: a deep network that predicted a segmentation-like weight map for each lane line and a differentiable least squares fitting module that returned the parameters of the best fitting curve in the weighted least squares sense for each map. It realized the backpropagation of the least squares fitting process and directly returned the lane line fitting parameters from end to end. In addition, Qin et al. [96] proposed a form of fast, structure-aware deep lane detection. As shown in Figure 14, they treated the process of lane detection as a row-based selection problem using global features. This clever thinking could significantly impact the computational efficiency. In this section, we introduced the two-step method and the end-to-end method for lane line recognition. Generally, the two-step method leverages the semantic segmentation models for initial detection and then uses a fitting method to obtain complete lane lines. This type of method is relatively slow, but can detect curved lines. The end-to-end method is relatively fast, but it does not work well for the curved lines.

Datasets and Evaluation Criteria
In this section, we summarize the authoritative image datasets used in the autonomous driving research community. We also introduce the primary evaluation criteria for object detection, semantic segmentation, and lane detection. Table 6 makes a comparison of these datasets in terms of image size, scene weather, and annotation. The annotation gives information about 3D and 2D characteristics, whether the video is supported, and lane line annotation. The datasets are as follows:

Datesets
1. CamVid [97], or the Cambridge Driving Label Video Database, is the first video collection with semantic labels. There are 32 semantic categories with a total of 710 images. Most of the videos were taken with a fixed-position camera, which partly solved the need for experimental data. However, compared with the datasets released in recent years, there are gaps in the number of labels and the completeness of the labeling; 2. The Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI) [63] dataset was co-founded by the Karlsruhe Institute of Technology in Germany and the Toyota American Institute of Technology. It is one of the most widely used datasets in the field of autonomous driving. It covers object detection, semantic segmentation, and object tracking, among other things. It consists of 389 pairs of stereo images and optical flow maps, a 39.2 km visual ranging sequence, 400 pixel-level segmented maps, and 15,000 traffic scene pictures labeled with bounding boxes; 3. The Cityscapes dataset [62] is comprised of a large, diverse set of stereo video sequences recorded in streets from 50 different cities. It defines 30 visual classes for annotation, which are grouped into eight categories. Of these images, 5000 have highquality, pixel-level annotations, and 20,000 additional images have coarse annotations. At present, more researchers will use Cityscapes to evaluate algorithm performance in the field of automatic driving; 4. The Mapillary Vistas dataset [98], or Mapillary, is a large-scale, street-level image dataset released in 2017. It has a total of 25,000 high-resolution color images divided into 66 categories, of which 37 categories are specific instance-attached labels. Label annotations for objects can be densely and finely drawn using polygons. It also contains images from all over the world captured under various conditions, including images of different weather, seasons, and times; 5. BDD100K [99] is a large-scale, self-driving dataset with the most diverse content, released by UC Berkeley in 2019. The dataset includes a total of 100,000 videos in complex scenes, such as different weather and times, each about 40 seconds in length. It is divided into 10 categories with about 1.84 million calibration frames. There are a total of 100,000 pictures of high-definition and blurred real driving scenes with different weather, scenes, and times, including 70,000 training sets, 20,000 test sets, and 10,000 validation sets; 6

Evaluation Criteria for Object Detection and Semantic Segmentation
Standard evaluation criteria need to be used to measure the performance of the algorithm on the dataset. Currently, three main aspects are evaluated: runtime, memory consumption, and accuracy. Firstly, the running time of the algorithm is a key indicator that determines whether the algorithm has real-time performance, which mainly depends on the rationality of the algorithm structure and the computing capacity of the running hardware. Secondly, the memory consumption is also a reference value under the same running time and hardware. At present, the evaluation criteria of algorithm performance mainly includes the average recall (AR) [102], average precision (AP) [102], mean average precision(mAP) [102], pixel accuracy (PA) [36], mean accuracy (MA) [36], mean intersection over union (mIoU) [36], and frequency-weighted intersection over union (FWIoU) [36]. Among them, the most commonly used evaluation criteria are the PA, mPA, MA, and mIoU. The specific definitions and calculation formulas are as shown in Equations (1)- (4).
The PA is expressed as the ratio of the pixels marked correctly to the total pixels, and the calculation formula can be written as The MA represents the average value of the pixel accuracy of all target categories, and the calculation formula can be written as The mIoU represents the average of the degree of coincidence between the predicted area and the actual area,and the calculation formula can be written as where k is the number of pixel categories, i T is the total number of pixels of the i-th class, ii P is the total number of pixels with actual type i and prediction type i , and ji P is the total number of pixels with actual type i and prediction type j .
Recently, the three latest evaluation indicators have been proposed by panoptic segmentation [103]: recognition quality (RQ), segmentation quality (SQ), and panoptic segmentation (PQ). RQ represents the accuracy of object recognition for each instance in panoptic segmentation. SQ is simply the average IoU of the matched segments. PQ can be seen as the multiplication of a segmentation quality (SQ) term and a recognition quality (RQ) term: where IoU(p,q) is the category intersection of the true positives (TP), false positives (FP), and false negatives (FN). The pixel accuracy (PA) is an indispensable evaluation criterion for semantic segmentation which can intuitively judge the number of truly predicted pixels. The mIoU is the most common criterion for segmentation and detection, which efficiently judges the truly predicted area. Comparatively, PA presents a finer evaluation.

Evaluation Criteria for Lane Detection
In the Tusimple lane dataset, the accuracy (Acc), false positive rate (FP), and false negative rate (FN) as the evaluation criteria. Besides that, the mean intersection over union (mIoU) is also used for evaluation:  Figure 15 shows an example using the mIoU as the evaluation measure.
Ground truth Predict result mIoU comparison Figure 15. Evaluation based on the mean intersection over union (mIoU). In the third column, the blue lines are the ground truth and other colors are the predicted results of each line.

Performance Comparison
It is difficult to make a comparison of object detection algorithms and lane line segmentation algorithms due to the lack of uniform datasets. This section will only conduct a performance comparison on some of the latest semantic segmentation work, including full-scene semantic segmentation and the instance segmentation. Table 7 compares representative full-scene semantic segmentation algorithms in terms of the core technology, dataset, mIoU, inference time, and frames per second (fps).

Comparison of the Full-Scene Segmentation Algorithms
It should be noted that the algorithms in Table 6 were all tested on Cityscapes [62]. All results are from the original papers. It can be seen that Deeplab-V3+ gave the highest mIoU at 82.1%. Due to the integration of Xception, ASPP, dilation convolution, and the encoder and decoder structure, Deeplab-V3+ can aggregate the context information of pixels, perform better inference, and also have better refinement processing for edge pixels. Therefore, DeepLabV3+ reached the best level. Algorithms such as HDC, DANet, FastFCN, and Deeplab-V3 have achieved more than 80% of the mIoU. These algorithms are better than others in terms of processing global image information and the segmentation effect on multiscale objects. PSPNet, Deeplab-V1, DNR, and Deeplab-V2 are relatively fast, with a running speed of less than 1 fps. In general, the algorithms that achieve better results in both speed and accuracy are the algorithms of the PSPNet, FastFCN, and DeepLab series. Table 8 compares representative instance segmentation algorithms in terms of the year, core technology, datasets, evaluation criteria, and accuracy. All results are from the original papers. The evaluation criteria used in different papers are not uniform, and thus it is difficult to compare them with a unified measure. However, in terms of the actual effect, PANet and CenterMask are superior to their similar methods. Among them, AF pooling in PAN helps to aggregate context information and can improve the segmentation efficiency in cases of object occlusion. SAG-Mask in CenterMask can increase the network's attention to important features and therefore improve segmentation accuracy.

Conclusion Remarks
This paper gives a comprehensive survey of the deep learning-based approaches for scene understanding in autonomous driving. The paper focused on two tasks of scene understanding: object detection and image segmentation. We first briefed the object classification models that formed the basic models of detection and segmentation. Then, we sorted the object detection work into two categories-the two-stage method and the onestage method-and accordingly reviewed the representative work. According to the particularity of the traffic scene, the image segmentation problem in autonomous driving was deconstructed into full-scene semantic segmentation, instance segmentation, and lane line segmentation. We summarized and compared the up-to-date representative methods used in the three segmentation tasks from four aspects: typical work, characteristics, advantages and disadvantages, and basic frameworks. We also summarized the benchmark datasets and evaluation criteria used in the research community and made a performance comparison on some of the latest works.
Although the research community has made significant progress, there is still a long way to go before a vehicle can recognize the environment like a human. Based on the review above, we believe the following aspects are the challenges in this field.

3D Segmentation
The majority of the existing image segmentation work focuses on the two-dimensional segmentation of objects. It would be ideal if we could segment objects in three dimensions. Point clouds generated from LiDAR have obvious advantages for 3D segmentation, compared with image data generated from cameras. Some work has been initiated for this purpose by using LiDAR point clouds or by fusing LiDAR with cameras. However, since point clouds are non-structured data with an irregular format, the challenge goes to how we can model the data by using CNN technology. The main work streams are (1) converting 3D point clouds into 2D images by using view transformation or coordinate transformation, such as in [106,107], (2) voxelizing irregular 3D point clouds into regular 3D tensions and using 3D-CNN [108] to process it, such as in [109][110][111], and (3) direct modeling of the point clouds by using PointNet [112,113] or graph CNNs [114,115].

Panoptic Segmentation
The majority of the existing vision-based works are designed for the detection or segmentation of particular objects such as roads, pedestrians, vehicles, and so on. It would be ideal if we could segment the scene with complete details, including various object categories such as roads, obstacles, the sky, plants, buildings, and traffic signs. That is called panoramic segmentation, as proposed by Kirillov et al. [103]. In this work, they developed a CNN-based approach to simultaneously segment free space such as the sky, roads, and grass, and obstacles such as pedestrians and cars within a scene. Some algorithms such as DeeperLab [116], JSIS-Net [117], and Panoptic FPN [118] have also emerged for a similar purpose. These algorithms usually use semantic segmentation for free space and instance segmentation for obstacles, and then fuse them together.

Multitasking Joint Model
The majority of the existing deep learning-based algorithms are designed for a single task. Multiple tasks such as detection and segmentation for various objects request multiple CNN models. This is not applicable for an in-vehicle system that is cost and space sensitive. Thus, whether we can use a single CNN model to fulfill multiple tasks becomes a challenging issue. Some work has been conducted in this area. For example, MultiNet [119] is a multitask model with the encoder-decoder structure, which is capable of detecting obstacles and segmenting roads at the same time. Detection, segmentation, and depth estimation are combined in a single model to identify objects and estimate the depth of the objects in [120].

Tracking and Behavior Analysis
This paper largely focused on two tasks-object detection and scene segmentationthat are a kind of static understanding of the scene. Actually, tasks like object tracking, behavior analysis, and anomaly detection are also important and even more challenging, since they involve the continuous monitoring and dynamic analysis of single or multiple targets. A recurrent NN with long short-term memory modules for long-term intent prediction of pedestrians was employed in [121]. A method of behavior estimation based on contextual traffic information to recognize and predict lane change intention was proposed in [122]. A visual analytical framework that exploits large amounts of multidimensional road traffic data for anomaly detection was presented in [123].