Real-Time Semantic Image Segmentation with Deep Learning for Autonomous Driving: A Survey

: Semantic image segmentation for autonomous driving is a challenging task due to its requirement for both effectiveness and efﬁciency. Recent developments in deep learning have demonstrated important performance boosting in terms of accuracy. In this paper, we present a comprehensive overview of the state-of-the-art semantic image segmentation methods using deep-learning techniques aiming to operate in real time so that can efﬁciently support an autonomous driving scenario. To this end, the presented overview puts a particular emphasis on the presentation of all those approaches which permit inference time reduction, while an analysis of the existing methods is addressed by taking into account their end-to-end functionality, as well as a comparative study that relies upon a consistent evaluation framework. Finally, a fruitful discussion is presented that provides key insights for the current trend and future research directions in real-time semantic image segmentation with deep learning for autonomous driving.


Introduction
Semantic segmentation is the task of assigning each pixel of an image to a corresponding class label from a predefined set of categories [1]. Although it can be considered to be a pixel-level classification problem in the pixels of an image, it is a much more complex procedure, compared to the standard classification which targets predicting the label of the entire image.
The enormous success of deep learning has made a huge impact in semantic segmentation methods, improving their performance in terms of accuracy. This promising progress has attracted the interest of many technological and research fields that require high-end computer vision capacities. Such an application is autonomous driving, in which self-driving vehicles must understand their surrounding environment, i.e., other cars, pedestrians, road lanes, traffic signs or traffic lights. Semantic segmentation based on deep learning is a key choice for accomplishing this goal, due to the phenomenal accuracy of deep neural networks in detection and multi-class recognition tasks.
Nevertheless, in applications, such as autonomous driving, that require low-latency operations, the computational cost of these methods is still quite limiting. Autonomous driving belongs to these applications because of the crucial need to take decisions in precise intervals. It is, therefore, necessary to improve the design of segmentation models towards achieving efficient architectures that will be able to perform in real time with the appropriate precision. To this end, in this paper, we review the best semantic segmentation architectures in terms of speed and accuracy. For the best and most complete evaluation of the examined architectures, all the models are compared based on their performance in a consistent evaluation framework.
The contribution of this paper is three-folded: (1) it presents in a consistent way an exhaustive list of the most efficient methods for designing real-time semantic segmentation models aiming at achieving both high accuracy and low latency. (2) it provides a comparative study of the state-of-the-art real-time semantic segmentation models based on their accuracy and inference speed. (3) it presents a fruitful discussion on current issues and improvements that results in key insights in the field of real-time semantic image segmentation.
This paper is organized as follows: in Section 1, an introduction for our survey of realtime semantic image segmentation with deep learning for autonomous driving is presented, while in Section 2, the basic approaches that reduce the inference time of real-time models are presented. In Section 3, we present an exhaustive list of the state-of-the-art models. In Section 4, the most popular datasets and metrics used in evaluation are summarized. Section 5 presents a discussion including our findings and key insights. In Section 6, promising future research issues are presented aiming to improve the performance of real-time semantic segmentation, and conclusions are drawn in Section 7.

Approaches for Inference Time Reduction
Although semantic segmentation models based on deep learning have achieved great accuracy in recent years, the need for efficiency that requires less inference time is still vital, especially for applications such as autonomous driving. In the following, we present all existing approaches that can be used in the deep neural network architecture design aiming to achieve a reduced response time in semantic image segmentation models.

Convolution Factorization-Depthwise Separable Convolutions
It is known that for most deep-learning models the convolutional layers are vital structural components. Therefore, transforming the convolutions that are performed in the layers of the network into more computationally efficient actions is an excellent way to improve the model's performance in terms of speed. A popular design choice for improving convolutions is the use of depthwise separable convolutions, which is a type of factorized/decomposed convolutions [2]. Standard convolution performs channel-wise and spatial-wise computation in a single step. On the contrary, depthwise separable convolution breaks the computation process into two steps. In the first step, a single convolutional filter per each input channel is applied (depthwise convolution), while in the second step, a linear combination of the output of the depthwise convolution is considered by means of a pointwise convolution [3]. These two different procedures are shown in Figure 1. It is important to compare the computational burden of these two tasks. More specifically, for N filters of size D × D, the ratio of computational complexity between depthwise separable convolutions and standard convolutions equals to Ratio = 1/N + 1/D 2 . For example, given 100 filters of size 256 × 256, the ratio of complexity is 0.01, which means that a series of depthwise separable convolution layers execute 100 times less multiplications than a corresponding block of standard convolutional layers.
This approach was first used in Xception [4], where the designers replaced Inception module [5] with depthwise separable convolutions. Likewise, MobileNets [6] comprise depthwise separable convolutions to improve efficiency. Overall, the efficiency by design offered by this approach is a strong argument for being used in any new efficiency-oriented network implementations.

Channel Shuffling
Another way to reduce computational cost significantly while preserving accuracy is channel shuffling. This approach was introduced in ShuffleNet [7]. Although in standard group convolution every input channel is associated with only a single output channel, in the case of channel shuffling, a group convolution acquires data from different input groups. In particular, every input channel will correlate with every output channel, as shown in Figure 2. Specifically, for the feature map generated from the previous group layer, we can first divide the channels in each group into several subgroups and then feed each group in the next layer with different subgroups.
This can be efficiently implemented by a channel shuffle operation that is described as follows: suppose a convolutional layer with g groups whose output has g × n channels; we first reshape the output channel dimension into (g, n), transposing and then flattening it back as the input of next layer. Channel shuffle is also differentiable, which means it can be embedded into network structures for end-to-end training. Overall, this approach comprises appealing characteristics, but one should be careful with the selection of the parameter g which defines the number of groups. For optimal results, experimentation is required to achieve the best g value.

Early Downsampling
In ENet [8] model architecture, it is stated that since processing large input frames is very expensive, a good way to solve this is to downsample these frames in the early stages of the network, resulting in the use of only a small set of feature maps.
The primary role of the initial network layers should be feature extraction and the preprocessing of the input data for the following parts of the architecture, rather than contributing to the classification stage [8].
This approach has been employed by the ENet model architecture so that spatial information loss due to downsampling is prevented. ENet relied on the SegNet [9] approach of saving indices of elements chosen in max-pooling layers and using them to produce sparse upsampled maps in the decoder, as shown in Figure 3. This approach allows reducing memory requirements while recovering spatial information. Overall, since the proposed approach relies upon spatial downsampling, it is not recommended for applications where the initial image contains fine image details that have the potential to disappear after the corresponding max-pooling operation.

The Use of Small Size Decoders
The encoder-decoder network is one of the most standard architectures of semantic segmentation. As a result, it is crucial to optimize the performance of this architecture as it influences plethora of models. In [8], it is suggested that the architecture of an encoder-decoder model can be simplified by reducing the decoder's size, in order to save computational cost. In particular, they introduced an architecture in which the encoder is larger than the decoder [8], following a different approach from the symmetric encoder-decoder architecture [9]. This novelty is based on the idea that encoder should process input data with smaller resolution of the input image. On the contrary, the one and only decoder's role is to upsample the output of the encoder by perfecting its details. Thus, reducing the decoder's size results in computational cost savings. Overall, this approach is appealing since in most of the times the reduction in the decoder's size does not affect the effectiveness.

Efficient Reduction of the Feature Maps' Grid Size
In CNNs, the reduction of the feature maps' grid size is achieved by the application of pooling operations. A problem that occurs is that the pooling operations can lead to representational bottlenecks, which can be avoided by expanding the activation dimension of the network filters. However, this process increases the computational cost. To remedy this, Szegedy et al. [10] suggested a pooling operation with a convolution of stride 2 performed in parallel, followed by the concatenation of the resulting filter banks. This technique of reducing the grid size of the feature maps (shown in Figure 4) is proven in the works of [8,10] that can achieve a significant improvement in inference time. Overall, several approaches exist that result in reducing the features map size which has been proven to not only achieve efficiency but to also achieve state-of-the-art effectiveness [11].

Increasing Network Depth While Decreasing Kernel Size
Research in [12] suggested that the use of very small (3 × 3) convolutional filters shows improvement on the standard configurations of CNNs. Smaller convolutional filters can permit an increase in the depth of the network by adding more convolutional layers and at the same time reducing the number of parameters of the network. This technique reduces computational cost, and at the same time, increases the accuracy of the network [13].

Two-Branch Networks
The trade-off between accuracy and inference time, has been addressed using twobranch networks: the one branch captures spatial details and generate high-resolution feature representation, while the other branch obtains high-level semantic context. Twobranch networks manage to achieve a beneficial balance between speed and accuracy because one of the two pathways is used to be a lightweight encoder of sufficient depth and the other pathway is a shallow, still wide branch consisting of a few convolutions [14][15][16]. At the same time, unlike the encoder-decoder architecture, two-branch networks preserve partial information that is lost after downsampling operations [16]. A standard two-branch network of [15] is shown in Figure 5. Overall, the proposed approach relies upon a lightweight architecture which achieves efficiency that is coupled with the possibility of both low-and high-level features representation learning that results in improved effectiveness.

Block-Based Processing with Convolutional Neural Networks
A novel and innovative method to speed up inference time is block-based processing, where as in [17] an image is split into blocks and adjusts the resolution of each block by downsampling the less important. This reduction of the processing resolution results in the reduction of the computational burden and the memory consumption.

Pruning
Pruning is a method suitable for producing models that perform more accurately, while being faster with less memory cost. The visual information is highly spatially redundant, and thus can be compressed into a more efficient representation. Pruning is distinguished in two categories: weight pruning and filter (channel) pruning. Both weight pruning and filter pruning are shown in Figure 6.
In weight pruning, individual parameters (connections), and hence the weights, are being removed, generating a sparse model that preserves the high-dimensional features of the original network. The work of [18] suggested weight pruning with a three-step method in which first the network is being trained to learn important connections. In the second step, unessential connections are being pruned. Lastly, the network is trained again to fine-tune the weights of the remaining connections. This way, network pruning can reduce the number of connections by 9x to 13x without reducing its effectiveness [18,19].
Filter (or channel) pruning is also a very successful type of method which leads to successful results. By removing filters that have negligible effect on the output accuracy alongside with their corresponding feature maps, the computation costs are significantly reduced [20]. He et al. [21] proposed an efficient method based on channel pruning which improves inference time, while preserving the accuracy of the network. Overall, this approach puts emphasis on the efficiency with the cost of losing information used in the network at the spatial and functional level resulting in reduced effectiveness.

Quantization
An additional method to ameliorate the efficiency of semantic segmentation is to use quantization to reduce the number of bits needed for the weight representation of the network. A standard way to represent each weight is to employ 32 bits. However, 32-bit operations are slow and have large memory requirements [22]. Han et al. [19] proposed a quantization approach that reduces the number of bits that represent each connection from 32 to 5 and at the same time restrict the number of effective weights by sharing the same weights between multiple connections, and then fine-tune them.

State-of-the-Art Deep-Learning Models
In this Section, we present the current state-of-the-art models of real-time semantic segmentation, based on their performance in terms of accuracy and inference time for "Cityscapes" and "CamVid" datasets. In our presentation, we follow a grouping of the models based on the approaches that have been used to achieve efficiency.
In the case of using a two-branch network, several models appear in the state-ofthe-art. SS [23] belongs to the multi-branch networks as it uses two branches, one for receiving the input image and the other for receiving a half-resolution version of that image. Moreover, SS introduces spatial sparsity in this type of architecture, managing to reduce the computational cost by a factor of 25, alongside with using in-column and cross-column connections and removing residual units. Its architecture is shown in Figure 7. Another model that constitutes of two branches is ContextNet [24]. Taking a deeper look at Figure 8, the first branch achieves cost efficient and accurate segmentation at low resolution, while the second combines a sub-network at high resolution to provide detailed segmentation results. ContextNet also uses depthwise separable convolutions for speeding up inference time and bottleneck residual blocks. Very important factor for its performance is network compression and the pyramid representation to segment in real time with low memory cost. Apart from these approaches, ContextNet uses also pruning to decrease the parameters of the network. Fast-SCNN [25] proposes a "learning to downsample" module for computing lowlevel features for multiple resolution branches in parallel, as shown in Figure 9. Fast-SCNN also features a global feature extractor for capturing the global context for semantic segmentation. Finally, Fast-SCNN uses depthwise separable convolutions and residual bottleneck blocks [26] to increase speed and reduce the number of parameters and the computational cost. BiSeNet (Bilateral Segmentation Network) [14] introduces a feature fusion module used for the efficient combination of the features and an attention refinement module to filter the features of each stage. In that fashion, it improves precision while upsampling operations are being avoided and thus the computational cost is kept low. BiSeNet V2 [15] is the evolution of BiSeNet, which presents good trade-off between speed and accuracy. It features a Guided Aggregation Layer to fuse the features extracted from the Detail and the Semantic branch, and proposes booster training strategy. Creators use fast-downsampling in the Semantic Branch to advance the level of the feature representation and increase the receptive field rapidly. The structure of both networks are presented in Figures   FasterSeg [27] employs neural architecture search [28,29] and proposes a decoupled and fine-grained latency regularization to balance the trade-off between high inference speed and low accuracy. The scope of neural architecture search is the design of deeplearning architectures in an automatic fashion, thus reducing the involvement of the human factor [30]. Another contribution is the use of knowledge distillation [31] and specifically, the introduction of a co-searching teacher-student framework that improves accuracy. Knowledge distillation aims to transfer the features learned from a large and complex network (teacher network) to a smaller and lighter network (student network). In the proposed co-searching teacher-student framework the teacher and student networks share the same weights, working as a supernet, thus not creating extra burden in terms of memory usage and size. The general structure is shown in Figure 12. ESNet [32] follows a symmetric encoder-decoder architecture, as shown in Figure 13. It involves a parallel factorized convolution unit module with multiple branch parallel convolutions, multi-branch dilated convolution and pointwise convolutions. The symmetry of ESNet's architecture reduces network's complexity and as a result leads to a reduction of the inference time. The parallel factorized convolution unit module with multiple branch parallel convolutions manage to learn imparallel feature representations in a powerful manner, without increasing the computational complexity. ShelfNet18 [33] structure is presented in Figure 14. More specifically, it is composed of multiple encoder-decoder branches, uses shared-weights and a residual block. To decrease inference time, ShelfNet18 proposes channel reduction to efficiently reduce to computation burden. The use of different encoder-decoder branches ameliorates the computational process, increasing the segmentation accuracy. Weights are being shared between convolutional layers of the same residual block, in order to decrease the number of network parameters without decreasing accuracy. ICNet [34] proposes a framework for saving operations in multiple resolutions and features a cascade feature fusion unit. Its architecture is shown in Figure 15. It processes, simultaneously, semantic information from low resolution branches, along with details from high-resolution images in an efficiently manner. In the case of using factorized convolutions to reduce latency the efficiency is achieved by modifying the way convolutions work in the network architecture to boost speed [2,35]. In the literature, several models employ such an approach. First, ESPNet [36] introduces a convolutional module, called efficient spatial pyramid, which functions efficiently in terms of computational cost, memory usage and power. ESPNet depends on a principle of convolution factorization. More analytically, convolutions are resolved into two steps: (1) pointwise convolutions and (2) spatial pyramid of dilated convolutions. ESPNet's network structure, shown in Figure 16, is fast, small, capable of handling low power, and low latency while preserving semantic segmentation accuracy. ESPNetv2 [37] based on ESPNet, outperforms the latter by 4-5% and has 2-4× fewer FLOPs on the PASCAL VOC and the "Cityscapes" dataset. ESPNetv2 modifies efficient spatial pyramid module and introduces an Extremely Efficient Spatial Pyramid unit by replacing pointwise convolutions with group pointwise convolutions, computationally expensive 33 dilated convolutions with depthwise dilated separable convolution. Lastly, it fuses the feature maps employing a computationally efficient hierarchical feature fusion method. According to [37], pruning and quantization are complementary methods for ESPNetv2. A structural unit of its architecture is presented in Figure 17. ESSGG [38] improves runtime of ERFNet by over a factor of 5X, by replacing its modules with more efficient ones, such as the aforementioned depthwise separable convolutions, grouped convolutions and channel shuffling. In this manner the inference time is reduced efficiently. Another contribution of this work is the training method, called gradual grouping where dense convolutions are transformed to grouped convolutions optimizing the function of gradient descent. Furthermore, it is important to refer that ESSGG uses a small decoder's size. Finally, pruning and quantization are proposed as secondary options used in the later stages of the network to improve efficiency. The detailed network's architecture is shown in Table 1. DABNet [39] introduces a Depthwise Asymmetric Bottleneck module, which extracts combined local and contextual information and reduces the number of parameters of the network. This specific module uses depthwise asymmetric convolution and dilated convolution. DABNet is build on the idea of creating a sufficient receptive field which aids to the dense use of the contextual information. DABNet's design, shown in Figure 18, achieves an efficient and accurate architecture with reduced number of parameters compared to other real-time semantic segmentation models. It is of high importance to note that DABNet uses convolution factorization by choosing depthwise separable convolutions to speed up and inference time. DABNet presents 70.1% mIoU on "Cityscapes" test set, involving solely 0.76 million parameters, and can run at 104 fps on 512 × 1024 high-resolution images. DFANet [40] focuses on deep feature aggregation using several interconnected encoding paths to add high-level context into the encoded features. Its structure is shown in Figure 19. ShuffleSeg [41] achieves satisfactory levels of accuracy by employing higher resolution feature maps. A deeper analysis of Figure 20, puts emphasis that in its encoder features grouped convolutions and channel shuffling which ameliorate performance. Shuffleseg was one of the first works that involved these two approaches for decreasing inference time. HarDNet [42] introduces dynamic random-access memory traffic for feature map access and proposes a new metric, Convolutional Input/Output. It also features a Harmonic Dense Block and depthwise separable convolution. HarDNet also states that inference latency is highly correlated with the DRAM traffic. HarDNet achieves a high accuracyover-Convolutional Input/Output and a great computational efficiency by improving the computational density (MACs (number of multiply-accumulate operations or floatingpoint operations) over Convolutional Input/Output). HarDNet's architecture is shown in Figure 21. ERFNet [43] is based on the encoder-decoder architecture. Taking a deeper look at Figure 22, it constitutes of a layer that features residuals connections and factorized convolutions aiming to sustain efficiency while being accurate enough. To speed up the processing time, designers chose a small decoder size and deconvolutions to simplify memory and computational costs. In the case of channel shuffling, a considerable number of state-of-the-art models employ this approach to increase efficiency. LEDNet [44] is a novel lightweight network (shown in Figure 23) that focuses on reducing the amount of network parameters. It follows an asymmetric encoder-decoder architecture and uses channel shuffling for boosting inference speed. ESSG and ShuffleSeg also use channel shuffling for improving efficiency. Furthermore, LEDNet's decoder involves an attention pyramid network to enlarge the receptive fields, while alleviating the network from extra computational complexity. Moreover, the asymmetric encoder-decoder architecture indicates the efficiency-oriented approach of the small decoder's size to improve performance in terms of speed. In the case of early downsampling, several semantic segmentation models use this approach to boost efficiency. EDANet [45] follows an asymmetric convolution structure. Asymmetric convolution decays a standard 2D convolution into two 1D convolutions. In this fashion, the parameters are reduced without sacrificing accuracy. EDANet uses early downsampling and dense connectivity to improve efficiency, while keeping low computational cost. It is important to note that BiSeNet V2 is also included in the networks which use early downsampling. Finally, EDANet does not use a classic decoder module to upsample the feature maps to reduce computational costs. On the contrary, bilinear interpolation is used to upsample feature maps by a factor of 8 to the size of input images, as shown in Figure 24 between the block of the Precision Layer and the Output block. This method is based on the efficiency-oriented approach mentioned in Section 2. Although this approach reduces accuracy, the trade-off between accuracy and inference speed is still satisfactory. ENet [8] is an optimized deep neural network designed for fast inference and high accuracy. It follows a compact encoder-decoder architecture; however, as shown in Table 2, ENet uses a small size decoder for reducing computation cost and increasing inference speed. Furthermore, ENet introduces early downsampling to achieve low-latency operation. More specifically, early downsampling is the process of downsampling the input image at the first layers of the network. The reason behind this technique is that a downsampled version of the input image can be much more effective without losing vital information and thus without sacrificing accuracy. Last but not least, [8] was one of the first semantic image segmentation models that aimed at real-time performance and a milestone for the following research attempts. The presented state-of-the-art models are listed in Table 3, along with their backbone and the efficiency-oriented approaches they share in common. The role of Table 3 is to summarize the ameliorative features that are mostly used in real-time semantic image segmentation, with the intention of clearing the way of designing efficient real-time semantic image segmentation models. Furthermore, Table 4 presents a collection of links to the implementation code of the state-of-the-art models.

Datasets
In this Section, we present the most popular datasets used in the field of semantic segmentation aiming for autonomous driving. The choice of a suitable dataset is of great importance for the training and the evaluation of the created models. The challenging task of dataset selection is one of the first major steps in research, especially for a difficult and demanding scientific field, such as autonomous driving, in which the vehicle exposure environment can be complex and varied. Each of the following datasets have been used for training and evaluation of real-time semantic segmentation models. Example images from the following datasets are shown in Figure 25.

Cityscapes
Cityscapes [55] is one of the most popular datasets in the field of semantic segmentation and autonomous driving. At first, it was recorded as a video, thus the images are especially selected frames captured from 50 different cities The selection was based upon the need for a great number of objects, variety of scenes and variety of backgrounds. Totally, 30 individual classes are provided grouped into 8 categories. The Cityscapes dataset contains around 5000 images of fine annotation and 20,000 images of coarse annotation. The contained urban street scenes were captured over several months of spring, summer and fall during daytime with good weather conditions.

CamVid
CamVid [56] is an image dataset containing road scenes. At first, it was recorded as a video of five sequences. The resolution of the images that CamVid is consisted of is 960 × 720. The dataset provides in total 32 classes. Some of the most important ones for road-scene understanding are: car, pedestrian, motorcycle, traffic light, traffic cone, lane markings, sign, road, truck/bus and child.

MS COCO-Common Objects in Context
COCO [57] is an extensive dataset suitable for tasks such as object detection and semantic image segmentation. It contains 328,000 images. From the total amount of images, more than 82,873 are specified for training, around 41,000 for validation and over 80,000 for testing.

KITTI
KITTI [58] is a hallmark dataset in the field of autonomous driving. It consists of a large-scale amount of traffic scenes. The data were collected with a diverse set of different sensors such as RGB and grayscale cameras and a 3D laser scanner..

KITTI-360
KITTI-360 [59] is a wide-reaching dataset consisting of well-crafted annotations and great scene information. Data were captured from various suburbs of Karlsruhe, Germany. In total, it contains more than 320,000 images and 100,000 laser scans in a driving distance of 73.7 km. Designers annotated both static and dynamic 3D scene elements with rough bounding primitives. The definition of the labels is consistent with the Cityscapes dataset. Finally, it employs 19 classes for evaluation.

SYNTHIA
SYNTHIA dataset [60] contains 9400 road scenes captured from a simulation of a city environment. It employs 13 classes. The resolution of the images is 1280 × 960.

Mapillary Vistas
Mapillary Vistas Dataset [61] is an exhaustive dataset of road scenes with humancrafted pixel-wise annotations. It is designed for road-scene understanding from images captured globally. It features 25,000 high-resolution images, 124 semantic object categories, 100 instance-specifically annotated categories. It covers scenes from 6 continents, and it provides a diversity of weather, season, time of day, camera, and viewpoint.

ApolloScape
ApolloScape [62] is a large dataset consisting of over 140,000 video frames (73 street scene videos) from various locations in China under varying weather conditions.

RaidaR
RaidaR [63] is a large-scale dataset of rainy road scenes, specifically designed for autonomous driving. RaidaR comprises 58,542 images of rainy weather. From this amount, a subset of 5000 is annotated with semantic segmentation. Moreover, 4085 sunny images were also annotated with semantic segmentations. Thus, RaidaR is one of the extensive datasets. Finally, it is one of the most promising ones due to the challenges it presents because of the rainy weather conditions.

Metrics
In this Section, we summarize some of the most popular metrics used for evaluating the performance of semantic segmentation models. The evaluation of these models and especially, those who are designed for real-time semantic segmentation depends on two key factors: effectiveness and efficiency.

Metrics Related to Effectiveness
For k + 1 classes (+1 class corresponds to background) and p ij the number of pixels of class i predicted/presumed as belonging to class j we define the following metrics [64]: • Pixel Accuracy: is defined as the ratio of correctly classified pixels divided by their total number.
• Mean Pixel Accuracy: is an extension of Pixel Accuracy, which calculates the ratio of correct pixels in a per-class basis and then averaged over the total number of classes.
• Intersection over Union (IoU): is a very popular metric used in the field of semantic image segmentation. IoU is defined the intersection of the predicted segmentation map and the ground truth, divided by the area of union between the predicted segmentation map and the ground truth.
• mean Intersection over Union (mIoU): is the most widely used metric for semantic segmentation. It is defined as the average IoU over all classes.

Metrics Related to Efficiency
To evaluate the efficiency of a semantic image segmentation model is vital to use metrics that define the processing time of the models and their computational and memory burden.

•
Frames per second: A standard metric for evaluating the time needed for a deep-learning model to process a series of image frames of a video is "Frames per second". Especially in real-time semantic segmentation applications, such as autonomous driving, it is crucial to know the exact number of frames a model can process below the time of a second. It is a very popular metric, and it can be really helpful for comparing different segmentation methods and architectures. • Inference time: is another standard metric for evaluating the speed of semantic segmentation. It is the inverse of FPS (Frame Rate), and it measures the execution time for a frame. • Memory usage: it is also a significant parameter to be taken into consideration when comparing deep-learning models in terms of speed and efficiency. Memory usage can be measured in different ways. Some researchers use the number of parameters of the network. Another way is to define the memory size to represent the network and lastly, a metric used frequently is to measure the number of floating-point operations (FLOPs) required for the execution.

Discussion
In this Section, we present our findings and key insights after considering the comprehensive analysis of the state-of-the-art networks presented in Section 3. It constitutes a fruitful discussion that comprises the emergence of a common operational pipeline and a comparative performance analysis enriched by a discussion of the dependencies on the used hardware along with the limitations and influence addressed by the current benchmarking datasets.

Common Operational Pipeline
The description of the state-of-the-art models for real-time semantic segmentation, which has been presented in Section 3, has shown sound evidence that most models share a common operational pipeline.
In particular, the first type of semantic segmentation networks which achieved realtime performance is based upon the encoder-decoder architecture. Representative example is ENet [8]. The encoder module uses convolutional and pooling layers to perform feature extraction. On the other hand, the decoder module recovers the spatial details from the sub-resolution features, while predicts the object labels (i.e., the semantic segmentation) [25]. A standard choice for the encoder module is a lightweight CNN backbone, such as GoogLeNet [5] or a revised version of it, namely Inception-v3 [10]. The design of the decoder module usually consists of upsampling layers based on bipolar interpolations or transposed convolutions.
In the effort to design efficient and at the same time accurate models, two-branch and multi-branch networks have been proposed. Instead of a single branch encoder, a twobranch network uses a deep branch to encode high-level semantic context information and a shallow branch to encode substantial/rich spatial details of higher resolution. Under the same concept, multi-branch architectures integrate branches handling different resolutions of the input image (high, low and medium). However, the features extracted from the different branches must be merged to proceed to the segmentation map. To this end, two-branch networks introduce a fusion module to combine the output of the encoding branches. The fusion module can be a Feature Fusion module in which the output features are joined by concatenation or addition, an Aggregation Layer (BiSeNet V2), a Bilateral Fusion module (DDRNet), or a Cascade Feature Fusion Unit (ICNet).
In a nutshell, there are three types of modules used in the operational pipeline of real-time semantic segmentation: an encoding, a decoding and a fusion module. There are several configurations where the fusion module is part of the decoder. Figure 26 shows the three modules and the corresponding standard yet dominant, design choices.

Comparative Performance Analysis
At Table 5, a detailed listing of the performance of state-of-the-art-models in terms of effectiveness (expressed by the mIoU metric) and efficiency (expressed by the FPS metric) for both "Cityscapes" and "CamVid" datasets. Concerning the results of the "Cityscapes" dataset, STDC1-50 presents the best result in terms of inference speed with 250.4 FPS and 71.9% mIoU in terms of accuracy. Moreover, FasterSeg achieves 163.9 FPS in terms of inference speed and 71.5% mIoU in terms of accuracy. Finally, BiSeNet V2 and Fast-SCNN achieve 156 FPS and 123.5 FPS with 72.6% and 68% mIoU, respectively. The results of these models are exceptional and really promising about the future of real-time semantic segmentation. As far as the results on the "CamVid" dataset are concerned, FasterSeg presents the outstanding results of 398.1 FPS in terms of inference speed and 71.1% mIoU in terms of accuracy. Moreover, DDRNet-23-slim achieve 230 FPS and 78% mIoU, while STDC1-Seg achieved 197.6 FPS and 73% mIoU.
After the aforementioned objective evaluation, we examined the presented models by taking into account the trade-off between speed and accuracy which can be shown in Figure 27a,b. In particular, in Figure 27a, where experimental results for the "Cityscapes" dataset are considered, we can observe that STDC1-50 [11] presents a satisfactory trade-off between mIoU and FPS, compared to the other models. Although other models such as U-HardNet70, BiSeNet v2 Large, SwiftNetRN-18 and ShelfNet18 achieve better performance in terms of accuracy, they lag behind in terms of frame rate. Moreover, in Figure 27b, where experimental results for the "CamVid" dataset are considered, it is shown that FasterSeg [27] can be considered to be an appealing choice for real-time semantic image segmentation, by taking into account the trade-off between accuracy and efficiency compared to all other models.

Dataset-Oriented Performance
Another issue concerning the finding of the best suited model for a problem at hand is the dataset selection for training. This is a critical issue in the field of deep learning in particular, as the data used to train a model is one of the most influential factors for the performance of the neural network. Therefore, in autonomous driving where the models work in real time, they should be trained in an earlier stage with the most suitable data. Moreover, the comparison of the semantic segmentation models should be made for the same dataset. Thus, research papers use the same dataset to compare different models. Nonetheless, in the area of real-time semantic segmentation, most of the models are being compared in the "Cityscapes" and "CamVid" datasets. Although this fact simplifies the searching for the fastest, yet most accurate, segmentation model it may not provide the most objective decision for the choice of the most suitable method. In particular, in the area of autonomous driving where every mistake may cause irreversible consequences, there is a practical necessity for training and testing the deep-learning models used in a wide variety of situations of the external environment (day, light, traffic, etc.). For example, "Cityscapes" and "CamVid" which are one of the most popular and most used datasets for benchmark tests, lack some vital features, regarding their diversity. In detail, they contain images with good/medium weather conditions on daytime. However, an efficient and safe autonomous driving system must have the ability to function under adverse weather conditions, such as snowfall and of course at night-time, especially in the case of emergency. To this end, one could adopt transfer learning techniques as will be discussed in Section 6 to remedy this limitation raised by currently available datasets. For a concise comparative study, it is crucial that the experiments should be performed under the same conditions. In particular, the inference time is measured for a particular hardware configuration. As a result, the comparisons between the different architectures should be taken under deep examination, as the specific GPU model, the ram and other parameters play an important role in the efficiency which the models present. For this reason, the vast majority of the examined research papers provide information about the hardware specifications and the experiment conditions under which their proposed models are being evaluated. To this end, at Table 5 the performance of each model is coupled with  the GPU used during the experimental work. Beyond modern GPUs, fast inference could be achieved with other powerful computer devices. In fact, on the subject of the hardware choice [65] provide a thorough and concise list of the commercially available edge devices. Edge TPU of Google, Neural Compute 2 2 of Intel, Jetson Nano, TX1, AGX Xavier of NVidia, AI Edge of Xilinx and Atlas 200 DK of Huawei are some of the modern commercially available edge computing devices for mobile and aerial robots deep-learning inference.

Future Research Trends
In this Section, we present some promising techniques towards improving the performance of semantic image segmentation that can be adopted in future research efforts.

•
Transfer learning: Transfer learning transfers the knowledge (i.e., weights) from the source domain to the target domain, leading to a great positive effect on many domains that are difficult to improve because of insufficient training data [66]. By the same token, transfer learning can be useful in real-time semantic segmentation by reducing the amount of the needed training data, therefore the time required. Moreover, as [47] proposes, transfer learning offers a greater regularization to the parameters of a pre-trained model. In [67,68], the use of transfer learning improved the semantic segmentation performance in terms of accuracy.
• Domain adaptation: Domain adaptation is a subset of transfer learning. Domain adaptation's goal is to ameliorate the model's effectiveness on a target domain using the knowledge learned in a different, yet coherent source domain [69]. In [70] the use of domain adaptation, achieved a satisfactory increase in mIoU on unseen data, without the adding extra computational burden, which is one of the great goals of realtime semantic segmentation. Thus, domain adaptation might be a valuable solution for the future of autonomous driving, by giving accurate results on unseen domains while functioning in low latency. • Self-supervised learning: Human-crafting large-scale data has a high cost, is timeconsuming and sometimes is an almost impracticable process. Especially in the field of autonomous driving, where millions of data are required due to the complexity of the street scenes, many hurdles arise in the annotation of the data. Self-supervised learning is a subcategory of unsupervised learning introduced to learn representations from extensive datasets without providing manually labeled data. Thus, any human actions (and involvements) are avoided, reducing the operational costs [71]. • Weakly supervised learning: Weakly supervised learning is related to learning methods which are characterized by coarse-grained labels or inaccurate labels. As reported in [71], the cost of obtaining weak supervision labels is generally much cheaper than fine-grained labels for supervised methods. In [72], a superior performance compared to other methods has been achieved in terms of accuracy, for a benchmark that uses the "Cityscapes" and "CamVid" datasets. Additionally, ref. [73] with the use of classifier heatmaps and a two-stream network shows greater performance in comparison to the other state-of-the-art models that use additional supervision. • Transformers: ref. [74] allows the modeling of a global context already at the first layer and throughout the network, contrary to the ordinary convolutional-based methods. Segmenter approach reaches a mean IoU of 50.77% on ADE20K [75], surpassing all previous state-of-the-art convolutional approaches by a gap of 4.6%. Thus, transformers appear to be promising methods for the future of semantic segmentation.

Conclusions
In this paper, we present an overview of the best methods and architectures for realtime semantic segmentation. The collection of the possible methods used for real-time segmentation aims at helping researchers find the most suitable techniques for boosting speed of deep-learning models while preserving their accuracy. Furthermore, tables listing the most accurate and efficient state-of-the-art real-time models are provided. The selection of the chosen models is based on experiments made using "Cityscapes" and "CamVid" datasets, depending on their mIoU, FPS and inference time they achieved. For studying purposes, a list of extensively used datasets and metrics of real-time semantic segmentation was described. Finally, this survey discusses current issues, thus showing areas of improvement in real-time semantic segmentation regarding the needs of autonomous driving and other high-end technological fields.