Failure Detection for Semantic Segmentation on Road Scenes Using Deep Learning

: Detecting failure cases is an essential element for ensuring the safety self-driving system. Any fault in the system directly leads to an accident. In this paper, we analyze the failure of semantic segmentation, which is crucial for autonomous driving system, and detect the failure cases of the predicted segmentation map by predicting mean intersection of union (mIoU). Furthermore, we design a deep neural network for predicting mIoU of segmentation map without the ground truth and introduce a new loss function for training imbalance data. The proposed method not only predicts the mIoU, but also detects failure cases using the predicted mIoU value. The experimental results on Cityscapes data show our network gives prediction accuracy of 93.21% and failure detection accuracy of 84.8%. It also performs well on a challenging dataset generated from the vertical vehicle camera of the Hyundai Motor Group with 90.51% mIoU prediction accuracy and 83.33% failure detection accuracy.

Based on vision methods, advanced driver assistance systems make autonomous driving possible. The National Highway Traffic Safety Administration [24] categorizes five levels of developmental stages of autonomous driving technology: • Level 0: The driver performs all operations. • Level 1: Some functions are autonomous, but the driver's initiative is required. • Level 2: Many of the essential functions are autonomous, but driving still requires attention. • Level 3: This is an autonomous driving stage, but, when a signal is given in an unexpected situation, the driver must intervene. • Level 4: This is an autonomous driving stage that does not require a driver to board.
These levels can be divided into two main stages: Levels 0-2 represent the autonomous driving assistance stage and Levels 3-4 represent the fully autonomous driving stage. To achieve a Level 3 autonomous driving system, notifying the driver when an unexpected situation occurs is necessary, which needs failure detection system. In this paper, we focus on detecting failure case on semantic segmentation which is the main vision recognition system.

Safety Problem of Real-World Application
When applying semantic segmentation networks to real-world applications for safety issues, it is a problem that there are no clear criteria to judge failure cases. In other words, allowing the system to detect failures is important for self-driving [25][26][27] in autonomous driving systems. For example, predicted semantic segmentation map from neural network, Figure 1 (right), displays the misrecognition of the sidewalk as a lane. If such misconception occurs in the network of actual autonomous driving vehicles, it can function as a fatal flaw which leads to serious accidents (e.g., fatality or car accident). Therefore, the function of notifying them and handing over authority to drivers is essential when the driving system has delivered the wrong results to drivers. To prevent such accidents, we propose a neural network method by predicting the mean intersection of union (mIoU), which is a widely used evaluation metric in the semantic segmentation task, indicating how accurately each pixel of the image is classified.

Background Theory
Before describing our proposed failure detection and mIoU prediction framework, this section briefly reviews the fundamental theories related to the proposed method. We first briefly examine the theory of deep learning and then explain the neural network architecture that exhibits good image classification performance. Then, we review the semantic segmentation network and failure detection network used in this paper.

Deep Neural Network
Deep neural network (DNN) is an artificial neural network consisting of several hidden layers between the input and output layers which models complex nonlinear relationships. Additional layers help convergence of the features by gradually assembling lower layers.
Previous DNNs [28] have usually been designed as front-feed neural networks, but recent studies have successfully applied deep learning structures for various applications with standard error backpropagation algorithms [29,30]. Moreover, weights can be updated using the stochastic gradient descent via the equation below: where η indicates learning rate and C denotes the cost function. The selection of the cost function depends on the learning objectives and data.

Convolutional Neural Network
While conventional machine learning methods extract hand-crated feature, CNN needs minimal preprocessing. CNN consists of one or several convolutional layers with additional weights and pooling layers. These structures allow the CNN to make full use of the input data from a two-dimensional structure and train by using standard backpropagation. CNN is used as a general structure in various imaging and signal processing techniques and multiple benchmark results for standard image data. This section details the structure and role of CNN.
Convolutional layers are the vanilla blocks of the CNN. As shown in Figure 2, the network is divided into feature extracting part and classifying part. Feature extraction area comprises several convolutional and pooling layers. The convolutional layer is an essential element that applies a filter to the input data and reflects the activation function with an optional pooling layer. After the convolutional layers extract features, the fully connected layer (FCL) classifies the image using the extracted features. A flattening layer is placed between the part that extracts the image features and the one that classifies the image.

Research Objective and Contribution
This paper aims to predict mIoU of an image using semantic segmentation maps and design a network to determine whether it belongs to a failure case. In this study, we propose a two-stage network algorithm which evaluates the score to detect the failure case for the image input. First, the encoder network Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation (ESPNet) [31] extracts features for the image and semantic segmentation map. Second, failure cases of the segmentation network are detected by FCL for mIoU prediction of images through the proposed methods. As a result, the proposed method can be used as a basis for a fully autonomous driving system to allow self-diagnosis. Our main contributions can be summarized as follows:

•
We secure safety by proposing a failure detection network for image segmentation. • Our model simultaneously performs mIoU prediction and failure detection for a single image. • We propose a modified loss function to solve the data imbalance problem of the generated ground-truth (GT) mIoU. • The proposed model exhibits good performance not only on the Cityscapes dataset but also on the Hyundai Motor Group (HMG) dataset.
The remainder of the paper is organized as follows. Section 2 describes the complete process of our mIoUNet method and analyzes the structure of the model. Section 3 presents the experimental results using various network structures and the various input channels.
The challenge dataset verifies the robustness of the network using the surround-view monitoring (SVM) camera road image provided by HMG. Finally, the paper ends with the conclusion in Section 4.

Proposed Failure Detection Network
In this section, we propose a network called mIoUNet that predicts the mIoU for images and detect failure. The learning pipeline of the proposed algorithm is as follows. First, we fully explain the how the training data are generated. Secondly, we show the optimized network structure that suits our purpose. Then, the details of CNN and FCL are introduced including the activation function and loss function. Finally, modified the loss function is introduced for failure detection task.

Data Generation
We define the terms used in this paper as follows. The GT segmentation map is ground truth of a semantic segmentation example. The GT mIoU is calculated by comparing the GT segmentation map with predicted segmentation map from the segmentation network. An example can be seen in Figure 3. The right side of the figure shows a segmentation map image which was generated using the ESPNet in Figure 3

Selection of the Convolutional Neural Network Structure
Our mIoUNet structure consists of a front part composed of the CNN, which performs feature extraction, and a back part composed of FCL for failure detection. In this paper, we propose an end-to-end network with sigmoid function at the end. Our main segmentation network is ESPNet, which is a segmentation network of a reduction-split-transform-merge structure using convolutional factorization. Similarly, mIoUNet uses the encoder part of ESPNet to extract similar feature from same structure, as illustrated in Figure 4.

Selection of the Fully Connected Layer Structure
In image classification, the structure of FCL is selected by referring to the experimental results of the performance by changing number of layers and nodes [32]. To select the best structure of mIoUNet, experiments were conducted with various layers and nodes with the highest accuracy. Table 1 presents 10 different cases [32] of the FCL structures and their results. CNN-2 structure is used, which is most similar to ESPNet-C. More details are provided in Section 3.3.

Selection of the Activation Function
The Sigmoid function is used at the end of the FCL. As illustrated in Figure 5, the function has outputs value between 0 and 1; therefore, it is suitable for predicting mIoU values.

Selection of the Loss Function
The difference between the predicted mIoU and GT mIoU values obtained through the model is defined as the loss, and the network tries to learn to by minimizing the loss. Mean squared error (MSE) is selected to calculate the loss value between scalar values.
where y i denotes the predicted mIoU value andŷ i indicates the GT mIoU value.

Modified Loss Function
In the initial version, mIoUNet was trained with MSE as a loss function, but prediction for the testing set converged near 0.5. In other words, mIoUNet did not detect out of range mIoU values. As a result of analyzing the data to determine the cause, imbalance distribution of GT mIoU was composed, as depicted in Table 2.
As shown in Table 2, the GT mIoU value of the test set distributed between 0.4 and 0.6 was 394, 78.8% of the total data. Such cross imbalance on data made the model overfit the prediction for mIoU values between 0.4 and 0.6. The MSE learning problem is that the data are concentrated around 0.5; the tendency is to follow the average simply to minimize the MSE loss value. Therefore, the loss value was modified as follows for learning.  The new modified loss corrects the overfitting problem by minimizing the effect on outliers for robustness in normal networks. When the difference between the actual and scalar value is less than 0.1, the transformed loss value is squared by the initial loss value, and, when it is greater than 0.1, the original loss value is square-rooted. As shown in Figure 6, in the case of a loss calculated less than the value of 0.1 as the error margin, a smaller loss value is used. In the opposite case, a larger loss value is used to allow the optimizer to backward error: (4) Figure 6. Modified loss function.

Final Network Structure
Our proposed model structure is shown in Figures 7 and 8. Figure 7 depicts the training, validation, and test process of the pipelines. Figure 8 presents the overall architecture of proposed model. Network details such as layer structure are described in Table 3.
In the case of semantic segmentation, the intersection of union (IoU) value becomes the denominator of the number of pixels corresponding to any class and the number of pixels accurately predicted for that class. The average value is called mIoU. In this paper, we compute the mIoU between the GT segmentation map and the segmentation map obtained via ESPNet to obtain the GT mIoU value.

Experimental Results
We conducted extensive experiments to demonstrate the performance of our proposed network. For performance evaluation, the prediction accuracy of the mIoU value of the image in the testing set was calculated, and the accuracy of the failure detection was calculated through the predicted mIoU value. The mean absolute error (MAE) was used as the evaluation metric to calculate the error because, when calculating positive and negative numbers, the meaning may be canceled out.
The MAE is calculated as follows: where y i denotes the predicted mIoU value andŷ i indicates the GT mIoU value. The mIoU prediction accuracy is calculated as follows: Failure detection accuracy is calculated as follows. The precision, recall, and F1score values are considered simultaneously to determine whether the detection result is reasonable. The method of calculating the performance indicators can be easily understood in Figure 9. The results of the experiment are classified into each situation according to the following criteria: True positive (TP) and true negative (TN) are the cases when the performance of ESPNet is well predicted by mIoUNet. In more detail, TP is defined when the mIoU prediction value and the GT mIoU value are greater than a threshold value of 0.5 where TN is smaller than 0.5. In both cases, the mIoUNet successfully detects not only the failure case but also success case for ESPNet. On the other hand, false positive (FP) and false negative (FN) are when mIoUNet fails to predict the failure or success case of ESPNet. FP is defined as the mIoU prediction value is larger than 0.5, but the GT mIoU prediction value is less than 0.5. On the contrary, FN is defined as the mIoU prediction value is smaller than 0.5, but the GT mIoU prediction value is greater than 0.5. In both cases, we define that the mIoUNet fails to detect properly the failure cases and success cases.
The evaluation metrics for failure case detection are calculated as follows:

Experimental Setup
In this subsection, we analyze our experimental results to evaluate the performance of the proposed algorithm on Cityscapes and Hyundai Motor Group (HMG) dataset. After each convolutional layer in the CNN, a batch normalization [33] and a rectified linear unit [5] layer are applied. Adam optimizer method [34] is used for training with batch size of 12 for 30 epochs. The size of the network input was set to R 1024×512×3 . Initial learning rate starts from 0.0001 and halving every 10 epochs. Dropout [35] is applied at a ratio of 0.5. All experiments were conducted on an NVIDIA Tesla V 100 implemented with PyTorch library.

Datasets
Two datasets were used in the experiment. The first is the Cityscapes dataset, a widely used segmentation dataset, which includes 2975 images for the training/validation set and the testing set contains 500 images. The second, vertical image, dataset for the road area was obtained by attaching an SVM camera to an actual vehicle from HMG. It consists of 863 images in the training/validation set and 216 in the testing set. To check the network robustness, we applied our method to rain/haze images and real-world road images, which are treated as different domains.

Quantitative Results
Various FCL structures were applied to predict mIoU values and used in our task. According to the various FCL structures, the evaluation indicators are as follows: selecting the most efficient model, mIoU prediction accuracy, number of parameters, and run time.
Similar to the image classification in [32], as the number of nodes in the FCL increases, the mIoU prediction accuracy becomes higher. However, real-time detection must be guaranteed. Therefore, 256-20-1, a structure with high accuracy that does not require a run time of 1 s, was selected. The evaluation indices obtained using the selected network structure are shown in Table 4. The mIoU prediction accuracy was calculated using (5) and (6). Experiments were conducted in two directions using the selected network structure: • Additional information: Additional information is learned by increasing the number of channels to improve prediction performance. We experimented with the cases of RGB, RGB hue, and RGB segmentation (RGBSeg) to observe the difference when learning additional information. • Real-time application: The image size was reduced and trained to secure real-time performance.
In adding one more channel with additional information, the accuracy improved when adding a hue channel corresponding to the color and adding a segmentation map. The increase in run time due to the increased network parameters as the channel was added was insignificant. We decided to use an input image with four RGB channels and a segmentation map for the best results (Table 5). After fixing the input image with RGBSeg with four channels, the input size was adjusted to reduce the network parameters. The experiments were conducted by reducing the existing 1024 × 512 input image to two times and four times. As the computation amount was reduced, the run time was similarly reduced with a real-time application, but the accuracy also decreased ( Table 6). As expected, when the image size was the largest, the accuracy was highest, and, when the image size was halved, the accuracy decreased by about 1.2%. The detection results of the limit situations according to each input size of the image are listed in Table 7. The accuracy of failure detection significantly decreased as the input size decreases. However, as presented in Table 7, the precision, recall, and F1-score values were calculated to determine why the mIoU prediction accuracy was higher than 512 × 256. As a result of checking the value, the image size was reduced so much that the network could learn much less, and training was performed to reduce the loss function. Therefore, all testing set images were recognized as failures due to overfitting on most GT mIOU values. In addition, when comparing the results for images of size 1024 × 512 and 512 × 256, the accuracy difference is almost 15%. Although the run time differed by about 2.7 times, the detection accuracy was significantly lower; therefore, it was not efficient to reduce the input size of the image to provide real-time characteristics.
The accuracy of MSE loss function and the accuracy of modified loss function differ as follows (Table 8). As a result of comparison, the loss function that fits well with the characteristics of unbalanced data was used to confirm 4.6% accuracy improvement for failure detection and 2.3% for mIoU prediction. The mIoU prediction accuracy is obtained using (5) and (6). Failure detection accuracy was calculated using (7).

Qualitative Results
This section shows examples of mispredicted Cityscapes dataset images among the results classified by the mIoUNet. The qualitative test results show the input image, GT segmentation map, and segmentation map for the testing set at the same time, as well as the GT mIoU, mIoU prediction, mIoU error value and failure detection result of the image.
False-Negative Image Figure 10 presents examples of results classified as false-negative images. Figure 10 (left) displays input images, and Figure 10 (middle) displays segmentation maps. Out of the 500 images, 27 images were detected. The false negative refers to when the GT mIoU is greater than 0.5 and the mIoU prediction is less than 0.5. The MAE values for mIoU of these images are as follows. If the network is used for the actual autonomous driving function, convenience can be increased by allowing the MAE value that defines the limit situation to be set according to users' convenience. Table 9 demonstrates that false-negative images have a small error. In particular, the average error was 0.075, which is smaller than the error value of 0.1, which is defined as the failure cases. False-Positive Image Figure 11 is an example of a result identified as a false-positive image. The first column of Figure 11 depicts the input images, and the second column of Figure 11 depicts the segmentation maps. Total 42 from 500 samples. The false-positive case is when the GT mIoU is less than 0.5, and the mIoU prediction is greater than 0.5. The MAE values for the mIoU of these images are as follows.  Table 10 demonstrates that false-positive images have larger error values than falsenegative images. The average error of the images was calculated as 0.045, and the max error was 0.5. To determine the cause, the images that make the average error higher are classified separately.  Figure 12 presents six images with large error values. The first column of Figure 12 displays the input images and the second column of Figure 12 displays the segmentation maps. First, some cases had many pixels in the image that were not considered a class in the segmentation map. In the semantic segmentation process on the right side of each picture, the area shown as black is the area deduced as pixels with no class. This dramatically decreases the GT mIoU value when calculating the mIoU value. This phenomenon is solved by randomly assigning classes by handling exceptions during the semantic segmentation inference or by inference using a better semantic segmentation network. Second, there are few pixels in the general roadway area in the image. For example, one can see narrow roads, roads with severe curvature and alleyways with many other buildings and structures. This problem can be solved by assigning more weight to pixels corresponding to classes, such as roads, people, and obstacles, which are essential in safe driving when the network is learning.

Experimental Results on the Rain/Haze Dataset
Experiments were conducted with images from the rain/haze composite version of the Cityscapes dataset ( Figure 13). This can be viewed in the same domain but in a different environment. The quantitative test result is the average of all the result values for the testing set. The qualitative test results show the input image, GT segmentation map, and segmentation map for the testing set at the same time, as well as the GT mIoU, mIoU prediction, mIoU error value, and failure detection result of the image.

Quantitative Results
The official RainCityscapes dataset [36] is composed of 66 images with different rain and haze versions. The experiments confirm that no significant difference exists based on RainCityscapes. We randomly used images corresponding to the various hyperparameters of the dataset in the experiments. mIoUNet detected 14 of 66 images as limit situations. Although a slight performance decline occurred due to the effects of rain and haze, the model is generally robust to other environments. mIoU prediction accuracy and the Failure detection accuracy for rain/haze dataset are shown in Table 11.

Qualitative Results
Examples of images detected as failure cases are presented in Figure 13. These images are also detected as failure cases when there are few road areas or many obstacles and buildings.

Experimental Results on the DeepLabV3+ Model
We experimented with DeepLabV3+ [20] to ensure that the proposed method is also applicable to other semantic segmentation models. We conducted an experiment to ensure that performance is maintained even when using GT mIoU values generated by other segmentation models after fixing the structure of the network, which is the second step of our network structure. Table 12 represents the distribution of GT mIoU values generated using DeepLabV3+ models.
The results of the learning using MSE and the modified loss function are shown in Table 13. As the shown in the table, the proposed loss function shows improved performance, but not as much as using ESPNet in Table 8. This is likely because the distribution of GT mIoU using DeepLabV3+ models has lower variance than that of ESPNet. However, a slight performance improvement was shown in the accuracy of failure detection and mIoU predication. We confirm that the proposed loss function not only results in significant performance improvements in unbalanced data, but also performs well in data from other distributions.

Experimental Results on the Challenging Dataset
We experimented with HMG dataset, which has a different viewpoint from Cityscapes. This dataset used a SVM camera road image dataset. It consists of 1079 road images acquired by an SVM camera attached to a vehicle, provided by Hyundai Motor Group. The experimental environment is the same as previously described. However, the last of the FCL is set to 12 because the number of segmentation classes is 12. Training set and validation set comprised 80% (863 images), and the remaining 20% (216) of the images were used as the testing set. Considering that the total number of images is 1079, which is less than that of the Cityscapes dataset, the learning rate was reduced by half every 50 epochs with 150 total epochs of training. The GT mIoU value distribution of HMG dataset was composed as depicted in Table 14. In the case of HMG dataset, we experimented with the use of loss function as MSE and the application of modified loss function (Table 15). The table shows that using the modified loss function gives better performance on both Cityscapes and HMG dataset. The results of different experiments on the number of input channels are shown in Table 16. Due to the dataset characteristics, it is necessary to change mIoU value to set it as the failure case. In addition, 0.6 was set as the threshold value of the mIoU corresponding to the failure case, where all performance indicators are generally good as following Table 17.

False-Negative Image in the HMG Dataset
The first column of Figure 14 depicts the input images, and the second column of Figure 14 depicts the segmentation maps. In the case of false-negative images (Figure 14), many images were taken during the day, and the degree of light dispersion was large within one image. For example, images with shadows and images with lights shining on a tunnel were classified.

False-Positive Image in the HMG Dataset
The first column of Figure 15 presents the input images, and the second column of Figure 15 presents the segmentation maps. In the case of false-positive images, many images were taken at night, and the images with high light dispersion were detected. The difference from the false-negative image is that many images were detected when the edge information, such as road markers or lanes, was not adequately visible or was in the road area. To summarize the results, the proposed model has a problem in that the mIoU cannot be accurately predicted when the light change is considerable in the vertical image of the road. However, the qualitative evaluation results reveal that using additional information, such as edges, can increase prediction accuracy.

Conclusions
In the image recognition system of an autonomous vehicle, for safety, it is crucial for the system to judge the failure cases autonomously, which is a standard for Level 3 autonomous driving. This paper proposes failure detection network for road images segmentation using mIoU. Our mIoUNet uses CNN and FCL, which are commonly used in the existing classification network structure.
The results on the Cityscapes dataset reveal 93.21% mIoU prediction accuracy and 84.8% failure detection accuracy. As a challenging task, HMG's SVM camera acquisition dataset, which is taken from different viewpoints, demonstrated 90.51% mIoU prediction accuracy and 83.33% failure detection accuracy.
As a result of experimenting with many different FCL structure versions, an efficient 256 × 12 × 1 FCL structure with high accuracy and fast inference speed is implemented and assessed with a model trained using a modified loss function. Performance of the network improved due to an increase in the number of input channels. This phenomenon means additional information is provided and detecting failure situations can be improved.
Finally, as a result of analyzing pictures with terrible error values, we observed that the proposed model successfully detects failure cases. We note the possibility that the performance of the proposed network can be improved according to the performance of semantic segmentation that creates the GT mIoU. This result suggests that a more accurate and robust semantic segmentation model results in better performance of the proposed model.
Although it is meant to detect failure cases in road images in autonomous driving, the proposed method only evaluated the reliability of single images. We aim to study a failure detection network that can predict the reliability of each pixel in an image as future work. Lightening the network structure to ensure real-time performance can also be considered as a future work.