In this TLFC-RD method, the two-level fusion of classifiers is used to predict the different classes of the road for CAVs. Here, the road is classified into two different classes: drivable area (i.e., road) and non-drivable area (i.e., background). These classifications are performed through the video frames obtained from the connected autonomous vehicles. There are four different classifiers used in this TLFC-RD method: LeNet-5, LSTM, ResNet and SVM. The classification of the road is improved by using the deep learning classifiers along with superpixel generation, appropriate features and multi-classifier feature fusion. 
Figure 1 illustrates the block diagram for the TLFC-RD method.
  4.2. Preprocessing Using Superpixel Generation
In the TLFC-RD method, the superpixel method groups the identical pixels of the neighborhood into superpixel blocks. The similarity among pixels is calculated to group the identical pixels. This preprocessing method is used to minimize the redundant image information as well as preserve the boundary data of the images. Moreover, this superpixel generation reduces the difficulty in the subsequent road detection process. Here, simple linear iterative clustering is considered for the location and color of pixels in the neighborhood. The feature vector of the color image is transformed into a five-dimensional feature vector, Labxy, that contains a Lab color and two-dimensional planar space. Generally, the Lab is the color model in the CIELAB color space. The brightness is represented as , two colors are represented as  and the spatial location in the flat image is denoted as . Here, the calculation of the similarity is accomplished by using the distance among the pixels. Therefore, the similarity among the pixels is computed, labels are assigned to the identical pixels and the overall algorithm is iterated until convergence.
Equations (1) and (2) shows the color difference and spatial distance among the pixels.
        
        where 
 and 
 represent the color difference and spatial distance among the pixels, respectively. The distance 
 is calculated by using Equation (3).
        
        where the step size among the pixels is represented as 
 and the compactness parameter is denoted as 
. This compactness parameter is utilized to define the relative proportion of various color and location distances. Each pixel in the image is allocated with a certain amount of superpixels. A processing alteration from pixels to superpixels using pre-processing is used to minimize the complexity during the classification. After pre-processing, the image is processed under feature extraction to extract the optimal features. The example for the pre-processed image is shown in 
Figure 3.
  4.3. Feature Extraction from the Pre-Processed Image
The classification between the drivable and non-drivable areas is obtained by extracting appropriate features from the pre-processed image . In this TLFC-RD method, seven appropriate feature extraction methods are used to extract the features. The utilized feature extraction methods are the spatial values of pixels, the RGB value of pixels, entropy, HSV color space, texton features, local distance distribution and LBP. Here, the features of the spatial value, RGB value and HSV color space are used in the TLFC to categorize the background from the shadows over the road. Specifically, the spatial value of the pixels is considered to identify the road from the background mainly for the sunlight region images. Next, the HSV value accomplishes an important role when the TLFC-RD is used for dark regions. The texton feature is used to define the local structure of the images and provides the features of intersections, corners and line terminators. However, the uncontrolled lighting conditions outdoors and unpredictable weather have a great impact on region detection. The local texture of images extracted using the LBP is used to overcome the aforementioned limitation. The feature extraction process is as follows:
- (a)
- Spatial and RGB value of pixels 
Initially, the spatial value of the pixels—i.e.,  and  coordinates of pixels—are taken from the input image. Subsequently, the  values of the pixels are taken as feature sets from the image.
- (b)
- Entropy 
The information source that exists in the image is defined by using information or Shannon entropy. This entropy value also defines the global features of the source in an average manner. Similarly, the image entropy is calculated using the histogram of the image, because it effectively provides the complex degrees of grey value distribution. Equation (4) shows the entropy value of the input image 
 by the size of 
.
        
        where the probability density function and number of pixels for the gray level 
 are 
 and 
 respectively.
- (c)
- HSV color space 
The main advantage of using HSV in road detection is that it is identical to the human conceptual understanding of colors. Moreover, it can divide achromatic and chromatic components. Here, the color is differentiated by using the hue 
, the percentage of the white light included in the pure color is denoted as the saturation 
 and the perceived light intensity is denoted as 
. Equations (5)–(7) express 
 and 
, which are the components of the HSV color space.
        
- (d)
- Texton features 
In general, the texton features are the output acquired from the filter bank. Here, the filter bank has Gaussians in different scales such as  and . After converting the input image  into the Lab color space, the Gaussian filters are applied in the  and  channels. Subsequently, the 18-dimension vector of the texton feature  is acquired from each pixel of the image.
- (e)
- Local distance distribution 
The neighborhood space 
 of a pixel is divided into the grid of 
, which is a three-dimensional vector. Next, the distribution histogram 
 for a point 
 at 
 is expressed according to Equation (8):
        where 
 and 
 is a third-dimensional vector. Hence, the resultant feature of the 
 vector is 
.
- (f)
- Local binary pattern 
In LBP, a neighborhood around each pixel is considered to generate a binary number for each pixel. A value of 1 is allocated for the pixels whose neighboring pixel’s intensity value is greater than or equal to 1 according to the center pixel. Otherwise, a value of zero is fixed for the respective pixels. Furthermore, the label values are rotationally placed together, and an eight-bit number is generated by using Equations (9) and (10).
        
        where the neighbor radius and number of adjacent pixels for the center pixel are represented as 
 and 
 respectively; the brightness intensities of the center and neighborhood pixel are denoted as 
 and 
 respectively.
  4.4. Assumptions
Consider that 
 vehicles randomly exist in the road, and these vehicles are connected to share the data about the drivable and non-drivable area information. Thus, 
 feature vectors are generated during road detection. For example, the feature vector of a single vehicle extracted during detection is shown in Equation (11).
        
Each vehicle has its own preferred set of feature extraction methods. So, the remaining vehicles randomly select their feature vectors as ,  and . The extracted feature vectors are processed under the two-level fusion of the classifiers to classify the precise class of the input image.
  4.5. Two-Level Fusion of Classifiers for Road Detection
In this proposed method, the fusion of the classifiers is proposed to obtain the precise identification between the drivable area and the non-drivable area. There are four different classifiers used in this two-level fusion: LeNet-5, LSTM, ResNet and SVM. The input image frame acquired from the dataset is considered as a vehicle, and then a cross fold is applied to present this input frame to any of the chosen classifiers. Hence, there is no possibility that the particular input frame could be processed by only one classifier. Subsequently, each vehicle in the CAVs uses its preferred artificial intelligence to extract the feature maps from the feature vector . Specifically, the vehicle in the CAV randomly selects any of the deep learning classifiers of LeNet-5, LSTM and ResNet. Next, the extracted feature maps are again given as an input to the SVM for the precise detection of the road. Therefore, the combination of the cross fold and TLFC is used to optimize the performances of road detection. The process of the two-level fusion of the classifier is explained in the following section.
  4.5.1. LeNet-5
Generally, LeNet-5 is a gradient-based learning CNN that is applied to extract feature maps from a feature vector . Except for the input and output layer, the LeNet-5 has six layers: three convolutional layers, two polling layers and one fully connected layer. The training of the parameters is reduced by minimizing the number of neurons in the fully connected layer. The process of LeNet-5 is as follows:
- 1.
- The convolutional layer is generally used to accomplish the feature extraction. In this layer, the input matrix is convolved with the convolution kernel. Let the feature vector be , where  represents the number of input images and  represents the amount of data in the respective . Moreover, the convolutional kernel is represented as , where the convolutional kernel’s size is denoted as . Equation (12) expresses the convolutional layer output , 
          where the offset term considered in each convolution is 
 and the activation function is denoted as 
.
- 2.
- There are five different activation functions that are widely used: Gaussian, Rectified Linear Unit (ReLU), Softplus, Tanh and Sigmoid. In these activation functions, the ReLU does not have a gradient saturation issue as it is faster than the saturating nonlinear functions. Therefore, the ReLU is considered in the CNN. 
- 3.
- Next, the pooling layer is used to accomplish the feature selection to minimize the dimensions of the data, whereas the main features of the data are preserved at the same time. In the local accepted domain, the mean, random and larger values are extracted using the mean, random and maximum pooling in the pooling layer. The output of the pooling layer 
          is expressed in Equation (13): 
          where 
 represents the layer number and 
 represents the former layer’s result.
- 4.
- In general, the fully connected layer is considered as the final layer of the CNN. The ReLU function is used in each neuron that is linked with the previous layer’s neuron. The local data are integrated into this layer and can be used to differentiate the classes. The output of the fully connected layer is expressed in Equation (14): 
- 5.
- Further, multiple classifications are accomplished by using the output layer or softmax layer. Here, the softmax layer maps the output of the previous layer to . Each result is related to the classification probability and its sum is 1. Next, the output is chosen based on the classification of higher probability values. Equation (15) provides the output of the output layer . 
  4.5.2. LSTM
LSTM [
25] includes two activation functions and three gates, which are used to extract the feature maps from the feature vector 
. The gates included in the LSTM are forgetting gates, input gates and output gates. Next, the long-term memory is included to create the black box of input and output. This leads to improve the training process of LSTM, therefore it helps to utilize the full historical sequence information. The result of the input gate 
, cell output 
 and output gate 
 for LSTM are expressed in Equations (16)–(18):
          where the sigmoid activation function is represented as 
, the weight matrix for the input and output gate is 
 and 
, the offset vectors/bias for the input and output gate are 
 and 
, memory is denoted as 
, and element-wise multiplication is denoted using 
. The output from the cell is used as a feature map for road detection in CAVs.
  4.5.3. ResNet
ResNet [
26], used in the TLFC, utilizes the residual block to solve the issues of gradient disappearance and degradation that exist in the convolutional neural network (CNN). The residual block used in the ResNet does not depend on the depth of the network, which also improves network performances. The integration of the input and output of the residual block is used to design the residual block in the ResNet.
The first layer’s activation is 
, where 
 is the residual, which is obtained after processing the linear transformation. In the second layer of ResNet, the 
 is added to the residual value. The concentration of parameterized layers over the residual learning is obtained by using a direct connection channel between the input and output. The feature maps from the nonlinear function of ResNet are represented in Equation (19):
          where the residual block weight is represented as 
 and the nonlinear function is denoted as 
.
  4.5.4. Decision Fusion Using Multiple Classifiers
The proposed TLFC-RD method uses the two-level fusion of diverse classifiers, which is performed based on a different set of feature vectors. At first, multiple classifier training is used to create the feature sets. Here, the feature maps are generated from the feature vector obtained for each vehicle in the CAVs. In the first level, feature maps are extracted from the fully connected layers of LeNet-5, the cell gate of LSTM and a residual output from ResNet. The extracted feature maps are fused together and given as an input to the next-level classifier.
A multi-classifier feature fusion model [
27] is used in this TLFC-RD, where the feature maps from the Lenet-5, LSTM and ResNet are used as inputs. This helps to improve the classification accuracy during the road detection. The reason for using multi-classifier feature fusion is that the pooling layer used in the deep learning classifier eliminates some information while performing the classification. However, the multiple features obtained from the Lenet-5, LSTM and ResNet have the supervision information about road scenes, which is utilized to differentiate the classes as drivable and non-drivable data. As the road scenes obtained from the vehicles are captured from different perspectives and angles, the proposed multi-classifier feature fusion model is used to extract appropriate feature maps from the road scenes. Therefore, the feature maps from Lenet-5, LSTM and ResNet are concatenated to fuse the features as shown in Equation (20), and the fused information is given as an input to the second-level classifier (i.e., SVM).
          
          where 
 represents the fused feature map vector.
  4.5.5. SVM
In general, SVM [
28] depends on the statistical learning theory and is used to perform road detection by using the 
. The optimization issue of SVM is converted into a convex problem by using the radial basis kernel function. This is used to avoid the local minimum and obtains the classification of global optimization during the road detection of CAVs. Equation (21) shows the kernel function of the SVM:
          where 
 is the radius, and 
 is additionally used in the next-level classification to detect the road. Here, the SVM is also used as a second-level classifier to predict a road by using the feature maps from the first-level classifiers. 
Figure 4 shows the architecture of the two-level fusion of the classifiers
Hence, the TLFC allows the better detection of roads with the KITTI and CamVid datasets. The performance of the TLFC-RD is mainly improved by the following four strategies: (i) the cross fold process at input and pre-processing using superpixel generation, which is used to minimize the complexity during the classification; (ii) an optimal feature extraction from the images is used to provide precise classification; (iii) multi-classifier feature fusion; and (iv) the TLFC-based classification of images provides better classification based on the generated feature maps.