Semi-Supervised Drivable Road Segmentation with Expanded Feature Cross-Consistency

: Drivable road segmentation aims to sense the surrounding environment to keep vehicles within safe road boundaries, which is fundamental in Advance Driver-Assistance Systems (ADASs). Existing deep learning-based supervised methods are able to achieve good performance in this ﬁeld with large amounts of human-labeled training data. However, the process of collecting sufﬁcient ﬁne human-labeled data is extremely time-consuming and expensive. To ﬁll this gap, in this paper, we innovatively propose a general yet effective semi-supervised method for drivable road segmentation with lower labeled-data dependency, high accuracy, and high real-time performance. Speciﬁcally, a main encoder and a main decoder are trained in the supervised mode with labeled data generating pseudo labels for the unsupervised training. Then, we innovatively set up both auxiliary encoders and auxiliary decoders in our model that yield feature representations and predictions based on the unlabeled data subjected to different elaborated perturbations. Both auxiliary encoders and decoders can leverage information in unlabeled data by enforcing consistency between predictions of the main modules and those perturbed versions from auxiliary modules. Experimental results on two public datasets (Cityspace and CamVid) verify that our proposed algorithm can almost reach the same performance with high FPS as a fully supervised method with 100% labeled data with only utilizing 40% labeled data in the ﬁeld of drivable road segmentation. In addition, our semi-supervised algorithm has a good potential to be generalized to all models with an encoder–decoder structure.


Introduction
Autonomous driving is a future-oriented technology with broad markets and development prospects.With the development of artificial intelligence and automation, in the automotive field, active safety systems in vehicles make transportation more efficient and safe [1].Nowadays, more and more vehicles are equipped with Advance Driver-Assistance Systems (ADASs).And among the subtasks of ADASs, drivable area segmentation is one of the important issues in all situations.The goal of drivable road segmentation is to sense the surrounding environment, keeping vehicles within safe road boundaries and preventing potential accidents, such as collisions with pedestrians or other vehicles in the driver's blind spot [2].Therefore, it is fundamental to perceive complex scenarios and discern the drivable areas while driving.
In the past, most studies focused on the inaccurate road segmentation caused by the obstruction of other pedestrians or vehicles, or camera imaging distortion due to underexposure or halo effects.In the beginning, traditional image processing methods including edgebased, texture-classification-based, illuminant-invariance-based, and geometric-vanishpoint-based methods were proposed [3].In the subsequent years, machine learning classifiers were applied to the field of drivable area segmentation, such as the SVM method [4].However, both image processing and machine learning methods are based on experiential hand-crafted feature extractions, which lead to vulnerable robustness in complex scenes.With the development of Convolutional Neural Network (CNN)-based models becoming mainstream in segmentation, there is no need for hand-crafted feature extractions and more advanced solutions have been designed for drivable area segmentation.In order to make the model performance as accurate as possible, it requires a massive collection of fine annotated data for training, which is time-consuming and expensive.Methods such as deep transfer learning and self-supervision can effectively reduce the dependence on labeled data [5][6][7], but they can not fully balance real-time results and accuracy in drivable area segmentation tasks.However, feature-perturbation-based semi-supervised methods have been proven to be effective in segmentation tasks [8].Therefore, we propose a semi-supervised method to leverage unlabeled data for drivable area segmentation by expanding the scope of features and enforcing the consistency between the perturbed expanding features and pseudo labels, which better overcomes the insufficiency of labeled data.
The model we designed consists of a main-encoder, a main-decoder, auxiliary encoders (aux-encoders), and auxiliary decoders (aux-decoders).In the following, the main-encoder and main-decoder together will be referred to as the main modules.Aux-encoders and aux-decoders together will be referred to as the auxiliary modules.Different kinds of perturbation are combined with the auxiliary modules and used in semi-supervised training.
As for training, fully-supervised training is performed first, where the main modules are trained on labeled data.Semi-supervised training follows closely, as shown in Figure 1.Predictions on original unlabeled data are first made by the main modules, generating pseudo labels corresponding to the predictions made by auxiliary modules on perturbed data.Then, unsupervised loss is designed to ensure the expanded feature cross-consistency between the perturbed and pseudo labels so that the model can leverage information in unlabeled data and improve the accuracy of the main modules.We conduct experiments on the CamVid and Cityscapes datasets.Our semi-supervised methods, by only utilizing 40% labeled data, almost reach the same Intersection-over-Union (IoU) values as fully supervised methods achieve with 100% labeled data.Compared with other semi-supervised methods, ours significantly outperforms other methods on IoU values and FPS values.The major contributions of our work can be summarized as follows:

•
To the best of our knowledge, our work is the first to introduce semi-supervised deep learning methods for drivable area segmentation.We propose a semi-supervised drivable area segmentation method based on expanded feature cross-consistency.The method is able to effectively utilize the information hidden in unlabeled data, which achieves a performance close to that of a fully supervised model with all labeled data by using only parts of labeled data.

•
We innovatively design a series of encoder-level feature perturbations and verify their effectiveness in our semi-supervised methods through a series of ablation study experiments.

•
We conducted a wide range of experiments on changing proto-segmentation models and comparing our semi-supervised method with others on road segmentation.Results show that our method has good generalizability and robustness in the field of drivable road segmentation.
The rest of this paper is organized as follows: Previous work related to drivable road area segmentation and semi-supervised methods is reviewed in Section 2. Our proposed semi-supervised segmentation methods are elaborated on in Section 3. The experimental settings are provided in Section 4. The experimental results including those of ablation experiments and comparisons with other semi-supervised methods are specified in Section 5.The discussion on the generalizability and extensibility of our semi-supervised method is illustrated in Section 6.We conclude this work and delineate the potential future directions for improvement in Section 7.

Related Work
In this section, works related to traditional and deep learning-based methods for drivable road area segmentation will be presented, as well as advanced semi-supervised methods that are able to compensate for the insufficient amount of training data in practical scenarios.

Traditional Drivable Road Area Segmentation
Drivable road areas usually differ from surrounding pixels and have unique local visual features.Therefore, based on image features, traditional image processing drivable road area segmentation methods can be divided into edge-based, texture-classificationbased, illuminant-invariance-based, and geometric-vanish-point-based methods [3].First, edges are a commonly used visual feature for drivable road area segmentation.For example, He et al. [9] proposed a color-feature-and edge-image-based algorithm, by obtaining road boundaries and delimiting the area complying with Gaussian distribution, improving accuracy and reducing the computational complexity.Second, texture-classification-based methods are employed for drivable road area segmentation.Graovac et al. [10] innovatively divided one road picture into distinguishable regions and subsequently calculated their texture differences based on statistical numerical features.Third, illuminant-invariance-based methods have also been designed for drivable road area segmentation.Alvarez et al. [11] proposed a novel method based on shadow-invariant features, which took full advantage of RGB-distributed information and camera direction information for road segmentation, achieving more robust and efficient results.Furthermore, geometric-vanish-point-based methods are also applied in this field.In [12], texture directions were extracted using confidence-weighted Gabor filters and clustered for estimating the vanish point, and then road boundaries were obtained through calculation.
Moreover, machine learning methods based on hand-crafted visual features have also been leveraged for drivable road area segmentation.These methods usually consist of three steps: feature extraction, image classification, and post-processing.For example, Zhou et al. [13] extracted both color features and texture features, and then an SVM classifier was employed for classification.In [4], structured SVMs were utilized for learning geometric features based on edges, color, and homography.In [14], Foedisch et al. used a simple neural network to achieve real-time road area segmentation.Some researchers use a composite of multiple machine learning models [15,16].Both of them achieved promising segmentation results.
However, though both the aforementioned traditional-image-processing-based methods and machine learning methods may work in some simple scenarios, they are vulnerable to various environmental factors, such as lightning and blocking.They tend to fail in knotty but more common real-life scenarios where road shadows, vehicle obscuration, and picture defects exist.

Fully Supervised Drivable Road Area Segmentation
With the development of deep learning, Convolutional Neural Networks (CNNs) have been feasible solutions for semantic segmentation [17][18][19][20][21].Some of them have been adapted and applied for drivable road area segmentation.For example, Holder et al. used a deep CNN to segment drivable road areas and the experimental results showed that it outperformed those conventional SVM-classifier-based techniques [22].Also, a fully convolutional residual network was further implemented for drivable road area segmentation, illustrating that deeper networks can achieve better results.Subsequent researchers made structural improvements based on prototype models.For example, an up-convolutional network was proposed in [23], all-layer and stage-layer modules were designed in [24], a siamesed fully convolutional network (s-FCN-loc) was proposed in [25], and a reverse attention network was designed in [26].Furthermore, instead of improving the structure of a single model, in [27], a CNN was combined with Long Short-Term Memory (LSTM) for better drivable road area segmentation performance.In addition, for some special tasks such as segmentation with fisheye lens, Ref. [28] used ResNet101 v2 as a feature extraction module to achieve accurate segmentation results for road surfaces.To improve the realtime performance of CNNs, Ref. [29] used an uncertainty-aware symmetric network based on asymmetric dilated convolution and validated it on embedded devices.Yolo-based models [30] also performed well in this field.YoloP [31] and YoloPv2 [32] are capable of achieving segmentation accurately and efficiently, and are able to complete the perception of lanes and traffic objects at the same time.Other methods that allow for the segmentation of the drivable area in multiple-task scenarios include DLT-Net [33], HybridNets [34], and GBIP-Net [35].ULODNet achieved the segmentation of drivable areas by detecting lanes and obstacles on roads [36].
These studies have demonstrated that CNN-based models can achieve remarkable accuracy in drivable road area segmentation.However, a prerequisite for fully supervised algorithms to achieve good results is that the amount of data needs to be large enough and of a high-enough quality.By presenting the related works in Table 1, it is seen that even if a large number of images can be obtained relatively easily, fine labels must be a timeconsuming and expensive task, especially in drivable road area segmentation.Therefore, to overcome this limitation, we propose a semi-supervised drivable road area method, which can achieve satisfactory performance with few annotations.Image processing a color-feature-and edge-image-based method.Graovac et al. [10] Image processing a texture-classification-based method.Alvarez et al. [11] Image processing an illuminant-invariance-based method.

Alvarez et al. [12]
Image processing a Gabor-filter-and clustering-based vanish point method Zhou et al. [13] Machine learning a color-feature-and texture-feature-based SVM method.

Yao et al. [4]
Machine learning a geometric-feature (including edges, color and homography)-based structured SVM method.Foedisch et al. [14] Machine learning a color-features-based neural network.Crisman et al. [15] Machine learning a edge-based modified clustering method.Yun et al. [16] Machine learning a boosting-, SVM-, and random forest-classifier-based complex method Holder et al. [22] Deep learning a deep CNN-based model Oliveira et al. [23] Deep learning an up-convolutional network-based model.Reis et al. [24] Deep learning an all-layer-and stage-layer-module-based model.Wang et al. [25] Deep learning a siamese fully convolutional network-based model.Sun et al. [26] Deep learning an improved SegNet with reverse attention-based model.Lyu et al. [27] Deep learning a CNN-combined with LSTM-based model.Scheck et al. [28] Deep learning a ResNet101 v2-based model for fisheye lens.Gong et al. [29] Deep learning an asymmetric dilated CNN-based model.

Shao et al. [35]
Deep learning: multi-tasking learning GBIP-Net: a method focused on interest points whose model is based on SAMT framework.Zhang et al. [36] Deep learning: multi-tasking learning ULODNet: a ResNet-or DarkNet-backbone-based network.

Semi-Supervised Semantic Segmentation
As deep learning becomes mainstream, methods that can balance low data annotation and higher accuracy are deserving of our attention, such as deep transfer learning [5], domain adaptation [37], self-supervised learning [6,7], and semi-supervised learning [8].However, the most suitable scenarios for deep transfer learning tend to be large models with fine-tuning, which may conflict with arithmetic-poor on-board edge computing devices in drivable area segmentation tasks.Although self-supervised learning guarantees lowinference computation consumption and does not even require labeled data, it means that larger data volumes and training are required to ensure model accuracy.In the field of autonomous driving, safety, in this case accuracy, is prioritized.Thus semi-supervised methods are considered to be balanced.
Further, semi-supervised methods in semantic segmentation models can be divided into five categories: adversarial methods, consistency regularization, pseudo-labeling, contrastive learning, and hybrid [8].Among them, the idea behind consistency regularization methods is that the same input model should be given the same output.Based on this, CCT [38] and CPS [39] perform perturbations on intermediate feature maps and model weight parameters respectively, expecting the outputs of the models to be consistent with no perturbation.CutMix [40] ClassMix [41], and VAT [42] perturb the input data, and have been incorporated into data augmentation in a wide range of fields and achieved favorable results.Inspired by CCT, CPS, and input perturbation methods, we propose a semi-supervised drivable road segmentation method with expanded feature cross-consistency, which combines input perturbations and feature perturbations.

Methodology
We consider that, for similar inputs, the model should achieve the same output, which is the theoretical basis for extracting hidden information from unlabeled data in our methods.Different types of perturbations are approaches to artificially creating similar inputs.Loss functions are used to restrain the consistency of the output.
We innovatively set up a set of auxiliary encoders and a set of auxiliary decoders, and by cross-connecting the main encoders and decoders through them, we ensure that all the main modules are involved in the gradient update during unsupervised training.If only one set of auxiliary modules is employed (for example, using only auxiliary decoders), then only the main module to which it is cross-linked (the main encoder, in this case) achieves gradient updating, while the main module (the main decoder, in this case) corresponding to this auxiliary module does not leverage the information in the unlabeled data.Perturbations are artificially designed for generating similar inputs and they are introduced into different nodes of the model: the encoder and the decoder.Such dual auxiliary module structure allows our semi-supervised methods to be applied to nearly all models based on an encoderdecoder structure.Furthermore, the fact that only the main modules are involved during inference makes our method less computationally dependent than others, making it suitable for drivable area segmentation.
More details on the overall algorithm, generation of perturbations, model structure, and loss will be elaborated on in this section.

Overall Algorithm
Figure 2 shows the panorama of our proposed algorithm, which consists of the mainencoder, the main-decoder, auxiliary encoders (aux-encoders), and auxiliary decoders (auxdecoders).In the training stage, fully supervised training is performed first.Limited labeled data x l with corresponding annotations y i are fed into the main-encoder and main-decoder to learn how to predict semantic segmentation results in a traditional supervised manner.Then, the remaining unlabeled data x ul are used to train the main-encoder, maindecoder, aux-encoders, and aux-decoders by enforcing consistency between pseudo labels ŷul and the predictions of auxiliary modules, which include ŷul e and ŷul d .Each auxiliary encoder takes as input a perturbed version of the input data and each auxiliary decoder takes as input a perturbed version of the encoder's output.This way, the representation learning of the main-encoder and main-decoder is further enhanced using the unlabeled data, and, subsequently, that of the segmentation network.In the inference stage, only the main-encoder and main-decoder are used to predict segmentation results, which means the model is not bloated during the inference stage.

Perturbations
As mentioned in Section 3.1, in the training stage, perturbations are used both on the input unlabeled data and the encoder's output.In our implementation, nine types of perturbations are used: VAT Perturbations: They are used to push data distribution to be isotropically smooth around each data point based on virtual adversarial training, the process of which can be regarded as a kind of noise n adv with the greatest impact against the gradient.We apply them in both aux-encoders and aux-decoders, formulated as t = t + n adv .t represents the input tensor, t represents the perturbed tensor, and n adv represents the VAT perturbations.
Dropout Perturbations: They randomly choose some positions with probability p and zero the elements in them.
Feature Noise Perturbations: They first generate a noise tensor N ∼ U (−0.3, 0.3) and add on the input tensor, formulated as t = (t N) + t, where represents elementwise multiplication.
Feature Drop Perturbations: They first generate a threshold γ ∼ U (0.7, 0.9) and create a mask M = { t < γ} 1 , where t is the batch-level maximum of the mean value of each dimension.Finally, we obtains perturbed tensors by performing element-wise multiplication, formulated as t = M t.
Cutout Perturbations: They are used to reduce the feature dependency on certain continuous elements of the input tensor, by randomly setting values of a cropped area as zero based on the predictions of the main modules ŷul .
Masking Perturbations: Masking perturbations contain two null-one mask matrices, a none-road mask M nr to confine background relationships and a road mask M r to limit road area [43], where M nr = 1 − M r .Each of the masks performs preliminary screening by utilizing the predictions of the main-modules ŷul .
Salt Noise Perturbations: They aim at simulating the effect of black and white noise in low-res pictures.Some positions are randomly set to the maximum value of t, and some are set to the minimum with random sampling rate 0.3, which is achieved through element-wise multiplication of mask M s with t.This process is formulated as t = M s t.
Color Jittering Perturbations: They consist of three types of transformation: brightness B, contrast C, and saturation S. In our implementation, they are used sequentially to perturb the input tensor, formulated as t = f B,M,C (t).
Lighting Perturbations: After transforming images to tensors, the eigenvalues and eigenvectors of all channels are concatenated together.Then, a matrix L of the same size as t is generated based on them.Finally, we obtain perturbed tensors by adding two matrices in an element-wise manner: t = L ⊕ t.

Main Encoder
The main-encoder is based on ResNet-50 [44], with dilated convolutions, followed by one PSP module [45] for additional enhancements in extracting features.It is a widely used general backbone that has been proven to have good performance in different segmentation tasks.The input of the main-encoder is unperturbed images and it outputs feature maps as the input for the main-decoder.The feature maps concatenate both high-dimensional and low-dimensional features, extracted by residual layers composed of a series of bottleneck blocks, shown in Figure 3. • Main Decoder: Feature maps generated from the main-encoder are fed into the maindecoder to predict the semantic segmentation results of the drivable road area.To maximize the robustness of decoding both original and perturbed features from different encoders, after one Conv2d layer, we only employ the simple 1 × 1 2d-convolution and three pixel shuffle modules as the main-decoder, where a pixel shuffle module consists of three layers: Conv2d, ReLU, and PixelShuffle, shown in Figure 4.

Loss Functions
The loss function consists of two parts: a supervised part and an unsupervised part, which are computed using cross-entropy and MSE, respectively.The specific calculation of each part is as follows.

Supervised Loss
The input data are in the form of ((x l , y i ),(x ul )), where x l is labeled data, y is the corresponding label of x l , and x ul is indicated as unlabeled data.
Supervised loss is the first stage calculated.It is consistent with normal fully supervised training, where only main modules, namely the main-encoder En m and main-decoder De m , participate in the prediction ŷl , shown as (1).
This section may be divided by subheadings.It should provide a concise and precise description of the experimental results, their interpretation, as well as the experimental conclusions that can be drawn.
Cross-entropy (CE) loss is used for this part, which measures the similarity between ŷl and labels y, shown as (2).

Unsupervised Loss
The design idea of unsupervised loss is to enable the model to utilize the information of unlabeled data during inference.For the above purposes, unsupervised loss L unsup is designed to consist of two items: loss of aux-encoders L e unsup and loss of aux-decoders L d unsup .
We first need to generate pseudo labels ŷul corresponding to the input x ul , shown as (4).ŷul = De m (En m (x ul )) On completion of that, a copy of x ul is sent to each aux-encoder En i aux , where one certain perturbation is applied on it, and then transferred to the main-decoder De m to make predictions ŷul e,i .This progress can be formulated as below: As for the unsupervised loss of aux-decoders ŷul d,i , it is processed through the mainencoder En m and each aux-decoder De i aux , which is mathematically expressed as follows: Following obtaining the predictions ŷul d,i for aux-decoders, MSE loss is calculated as follows: The loss L e unsup is back-propagated through aux-encoders and the main-decoder, and L d unsup is back-propagated through aux-decoders and the main-encoder.Thus, both the main-encoder and main-decoder are able to exploit information of unlabeled data during inference.

Total Loss
At the beginning of training, the model, in a fully supervised way, has learned only a very small amount of information from the labeled data, on the basis of which noisy pseudo labels are generated for unsupervised training.Therefore, the weights of the unsupervised loss computed at the initial training are set small and increase with continued training.
The unsupervised weighting parameter ω u is used to implement the idea above, and increases from 0 to 1 as training progresses.We denote batch id as i, the proportion of labeled data as p, and the total number of images in the training set as D.Then, ω u can be expressed as follows: Total loss L is formulated as follows:

Experiment Settings
The experiment settings contain the detailed configuration of implementation, the settings of the datasets, and the calculations of performance metrics.

Implementation Details
Our code compilation environment for running all experiments is based on the Pytorch 1.13.0 version of Python 3.8 on one Nvidia RTX3090.All training is performed for 150 epochs and the optimizer is SGD.The supervised and unsupervised learning rates are set to 0.01 and 0.001, respectively.In the parameter settings of SGD, weight decay is set to 0.0001, momentum is set to 0.9, and other parameters are kept in their default settings.We use Poly mode in the lr-scheduler with 1.2 of the parameter pow and other parameters are set to default.

Datasets
Cityscapes: Cityscapes is a large-scale dataset containing multiple cities that supports different vision tasks such as semantic segmentation and instance segmentation.We only use the semantic segmentation dataset part of Cityscapes, which contains a total of 2985 image materials from 18 cities for the training set and 500 images from 3 cities for the validation set.Every image in Cityscapes has a native size of 2048 × 1024 pixels, which is cropped to 513 × 513 pixels in experiments.There are 34 classes in the original semantic segmentation dataset, which are redundant in road segmentation.Therefore, only the road class is retained, and the remaining classes are merged into one non-road class.
CamVid: CamVid, short for The Cambridge-driving Labeled Video Database, is a lightweight semantic segmentation dataset.Compared with the Cityscapes dataset, the images in CamVid have more complex roads, more vehicles, and smaller dimensions, making prediction relatively more difficult.The CamVid dataset contains 701 images, of which 367 images are in the training set, 101 images belong to the validation set, and 223 images are used in the test set.Each image in CamVid is 480 × 360 pixels in size, and it is cropped to 360 × 360 pixels as the input.The CamVid dataset provides 32 groundtruth semantic labels, which are merged into 11 broad categories when used in a semantic segmentation task.In our experiment, only the road class is preserved and the remaining class are integrated to produce road and non-road class labels applicable to drivable area segmentation.
Whether the Cityscapes dataset or the CamVid dataset is used, their data are all labeled data.Thus, in unsupervised training, a certain percentage of images are randomly selected from the original training set, considering them as unlabeled data by ignoring their corresponding labels, and the rest of the data are kept intact for supervised training.Figure 5 illustrates the above operations.

Performance Metrics
Given an image for drivable road segmentation, the output of the model will be divided into two classes: "Road" and "Others".We use five performance metrics to measure the experimental results, which are accuracy, recall, precision, F1-Score, and Intersection over Union (IoU), in all experiments.

Accuracy
Accuracy is used to measure the proportion of pixels with correct predictions in all pixels, which is calculated with (11).

.2. Recall
The value of recall is the proportion of road class pixels with correct predictions in all road class pixels, which can be expressed as Equation (12).

.3. Precision
Precision is the proportion of road class pixels with correct predictions in the pixels that are predicted as road pixels, which is calculated with (13).

.4. F1-Score
The F1-Score is used to measure the average performance of precision and recall, which is calculated with (14).
The IoU is used to measure the correlation between road class pixels and the pixels that belong to or are predicted to belong to the road class, which is calculated with (15).
4.3.6.FPS FPS is the ratio of the number of images to the total inference time, which measures the real-time performance of the model.The inference time for each batch of test image is denoted as t i , and the number of the test dataloader is denoted as N.Then, the FPS calculation formula is (16).

Ablation Studies
The purpose of this experiment is to study the performance of different modules in the models and prove the improvement in our methods.We conducted the following experiments on arrangements with auxiliary modules:
In consideration of the rigor of the experiment, we ensured that each semi-supervised model had the same number of auxiliary modules.To be specific, the model with only aux-encoders adds two aux-encoders of each type, comprising twelve in total.And two aux-encoders of each type were appended to the model with only aux-decoders, keeping the totals consistent with the former.As for the model using both aux-encoders and auxdecoders, six aux-encoders and six aux-decoders of each perturbation were employed.In the process of training, 40% images were selected as labeled data, and the results are shown in Table 2. Table 2 presents that all semi-supervised methods outperformed the results of fully supervised training.As for semi-supervised methods, the performances of the model with only aux-encoders and the one with only aux-decoders (CCT-structure method) are similar, but the model using both aux-encoders and aux-decoders has significant performance improvement in terms of IoU on two datasets, reaching 0.871 and 0.865, respectively.This illustrates that the synthetic multiple auxiliary modules can make more accurate predictions in different datasets.

Perturbations
The purpose of this experiment is to assess the effectiveness of each kind of perturbation used in aux-encoders.The structure of the aux-decoders remained unchanged during the experiments.In each experiment, one set of aux-encoders consists of six aux-encoders with the same perturbation.And each kind of perturbation was tested in turn on two datasets, both with 40% labeled data.Moreover, we additionally conducted experiments on the model that comprises all kinds of perturbations as comparisons, which is denoted as "All" in Tables 3 and 4. The results are shown below.It can be seen from the tables above that although the effect of each kind of perturbation may fluctuate on different datasets, they all make improvements compared to the models with only aux-decoders.Methods "All" outperform the rest of them on both datasets, proving that all perturbations are effective and that using multiple perturbations at the same time can improve performance.

Proportions of Labeled Data
This part of the experiment is designed to analyze the improvements caused by increasing the proportion of labeled data and find a ratio that balances data volume and accuracy.Specifically, the percentage of labeled data is gradually expanded, beginning with 5%, and increasing to 10%, 20%, and finally 40%, which guarantees an inclusion logic in spite of expanding proportions of labeled data, and minimizes the impact of data diversity caused by appending new labeled data.Moreover, in order to investigate convergence during training on two datasets, the loss curves with 40% labeled data are recorded in Figure 6. Figure 7 displays the line charts of the relationship between proportions of labeled data and IoU on two datasets and Figure 8 shows the corresponding visualization of the predictions by semi-supervised models on a certain input.It can be seen from Figure 7 that there is a significant gap between the IoU of fully supervised and semi-supervised methods when the proportions of labeled data are lower than 20%.With over 40% labeled data, no significant reduction in IoU in semi-supervised training is found compared with the baseline.For example, on the CamVid dataset, the IoU difference between 40% labeled semi-supervised models and fully supervised ones is only 0.022.In Figure 6, it can be observed that on both datasets the models have approximate convergence after the 110th epoch.Loss has more fluctuations on CamVid than on Cityscapes because that the latter has a larger numbers and sizes of images than the former.
And in Figure 8, it is apparent that using 40% labeled data can obtain the same performance as the fully-supervised method on two datasets, and both of them are quite close to the ground truth, especially in the yellow dotted boxes.
In general, therefore, our models with only 40% labeled data can be used as an alternative to fully supervised models when the amount of labeled data is not sufficient.

Comparison for Other Semi-Supervised Methods
We compared our semi-supervised methods with others in the field of drivable road area segmentation.The experiments were conducted with the same percentage of labeled data (40%) on the Cityscapes and CamVid datasets, which are demonstrated in Table 5 and Table 6, respectively.The results of two the datasets are visualized in Figures 9 and 10.As can be seen from the two tables, our methods outperform the others.On the CamVid dataset, our method achieves the best results.Moreover, on the Cityscapes dataset, the IoU values of our method are significantly better than the second-best method.This is because our methods are based on feature consistency, and we expand the range of feature consistency to cover both the input level and the feature level, which are larger than the feature consistency used by CCT.Therefore, all modules in the inference backbone network of our methods are able to utilize the information of unlabeled data, achieving better performance.
It is also notable that our methods have the highest FPS, which means that our methods are more suitable in real-world autonomous driving scenarios to meet real-time requirements.Therefore, it is concluded that our proposed semi-supervised methods have the best performance in both accuracy and real-time metrics in the field of drivable area segmentation.

Discussion
To verify the generalizability and extensibility of our semi-supervised method, we added auxiliary modules in the same way that our semi-supervised methods do on different basic segmentation models and ensured that the perturbed predictions is consistent with the originals.The classical semantic segmentation models that we selected included UNet [46], ENet [47], ERFNet [48], and DeepLabV3+ [49].And the experiments were conducted on the CamVid dataset.During semi-supervised training, the proportion of labeled data was set to 40%, indicated by "semi" in Table 7. Fully supervised experiments were performed on both 40% and 100% of total data for comparison, which are represented by "full" below the different percentages of data in the header of Table 7.The results are shown in Table 7 and Figure 11.As we can see from Table 7, when only 40% of the labeled data are used for semisupervised and fully supervised training, the results of the semi-supervised methods are all significantly improved over the fully supervised one for all basic models.Compared with the results of supervised training using all the data, it can be seen that there is still a gap between the semi-supervised methods and the fully supervised methods, but the gap is marginal, especially on the DeepLabV3+ model.This indicates the generalizability and potential of our methods.With more segmentation models being proposed, employing more advanced networks in our semi-supervised methods could achieve better performance, which is one of the focuses of future work.

Conclusions and Future Work
In summary, we proposed novel semi-supervised methods for drivable road segmentation.Our method reaches a good performance by enforcing cross-consistency between the perturbed expanding features and pseudo labels so that they can leverage the information of unlabeled data.Our methods can almost reach the same accuracy and IoU values by only using 40% labeled data as fully supervised methods do with 100% labeled data.Furthermore, the experimental results demonstrate that compared to other semi-supervised methods, ours has better accuracy and real-time performance in the field of drivable area segmentation.Moreover, our methods remain effective when employing other networks, which illustrates the generalizability of our method.
In the future, we will improve our methods by investigating new encoder-decoderstructured backbones that could reach the same IoU with fewer labeled data.If possible, we will deploy the model to an edge computing device such as Nvdia TX2 and perform experiments on real scenario and noisy data.In addition, experimentation regarding where perturbations are placed will be an appealing idea in models with a non-encoder-decoder structure, which can broaden the scope of applications of our semi-supervised method.Moreover, the main module should be as brief as possible to ensure real-time prediction.Finally, designing more efficient perturbations is also one of the focuses of future work.

Figure 1 .
Figure 1.The overall flow of semi-supervised training: Main modules first generate pseudo labels on original unlabeled data, which is shown in black arrows.Then, auxiliary modules make predictions on perturbed data, and unsupervised loss is designed to measure the discrepancy between pseudo labels and perturbed prediction, which are shown in light blue arrows (predictions perturbed by auxiliary encoders) and orange arrows (predictions perturbed by auxiliary decoders).

Figure 2 .
Figure 2. The overall algorithm can be divided into four stages.Stage 1: supervised training is illustrated as black lines.Stage 2: the predictions on unlabeled data ŷul are used to ensure consistency between the perturbed prediction yielded from stages 3 and 4 by MSE loss, shown as gray lines.Stage 3: unlabeled data are reconstructed by aux-encoders generating perturbed tensors, and then those tensor are transferred to main decoder for prediction, shown as light blue lines.Stage 4: main encoder transforms the unlabeled data into feature maps and then distributes them to aux-decoders with different perturbations to make the prediction.

Figure 3 .
Figure 3.The detailed structures of the main-encoder.

Figure 4 .
Figure 4.The detailed structures of the main decoder.• Auxiliary Encoders: Auxiliary encoders consist of several aux-encoders with different perturbations, including VAT, dropout, feature noise, salt noise, color jittering, and lighting perturbations.And there is more than one aux-encoder for each kind of perturbation.It is denoted as En aux = {En 1 aux , . . ., En i aux , . . ., En K aux }, where K is the total number.• Auxiliary Decoders: Auxiliary Decoders De aux are composed of aux-decoders with perturbations including VAT, dropout, feature noise, feature drop, cutout, and masking, which are not exactly consistent with those used in the aux-encoder because of differences in properties during training in different modules.In the same way as aux-encoders, it can be formulated as follows: De aux = {De 1 aux , . . ., De i aux , . . ., De K aux }.

Figure 5 .
Figure 5. Randomly selecting some new labeled data and adding them into preceding selected data.

Figure 6 .
Figure 6.The loss curves obtained on Cityscapes and CamVid datasets show that the models have approximated convergence after 110th epoch.After setting the proportion of labeled data in the semi-supervised datasets, the following two experiments were conducted: fully supervised training on the labeled subset of the semi-supervised datasets and semi-supervised training on the whole of the semisupervised datasets.All models use the same number of aux-encoders and aux-decoders, and the baseline is the result of fully supervised training on the original datasets with 100% of the data labeled.Figure7displays the line charts of the relationship between proportions of labeled data and IoU on two datasets and Figure8shows the corresponding visualization of the predictions by semi-supervised models on a certain input.It can be seen from Figure7that there is a significant gap between the IoU of fully supervised and semi-supervised methods when the proportions of labeled data are lower than 20%.With over 40% labeled data, no significant reduction in IoU in semi-supervised training is found compared with the baseline.For example, on the CamVid dataset, the IoU difference between 40% labeled semi-supervised models and fully supervised ones is only 0.022.

Figure 7 .
Figure 7.With the increment in the labeled data, the IoU values become closer to the baseline.With over 40% labeled data, no significant reduction in IoU in semi-supervised training is found compared with baseline.

Figure 8 .
Figure 8. Prediction of semi-supervised methods on two datasets with different labeled-data percentages.

Figure 9 .
Figure 9. Examples of making predictions on certain images from Cityscapes using our method and others (CycleGAN, AdvNet, and CCT).

Figure 10 .
Figure 10.Examples of making predictions on certain images from CamVid using our method and others (CycleGAN, AdvNet, and CCT).

Table 1 .
Literature review: related works on road area segmentation.

Table 2 .
Results on models with different auxiliary modules with 40% labeled data.

Table 3 .
Results on different perturbations of aux-encoders in Camvid dataset with 40% labeled data.

Table 4 .
Results on different perturbations of aux-encoders in Cityscapes dataset with 40% labeled data.

Table 5 .
Performance of different semi-supervised methods on Cityscapes.

Table 6 .
Performance of different semi-supervised methods on CamVid.

Table 7 .
Performance of different base segmentation models on CamVid.