Lidar–Camera Semi-Supervised Learning for Semantic Segmentation

In this work, we investigated two issues: (1) How the fusion of lidar and camera data can improve semantic segmentation performance compared with the individual sensor modalities in a supervised learning context; and (2) How fusion can also be leveraged for semi-supervised learning in order to further improve performance and to adapt to new domains without requiring any additional labelled data. A comparative study was carried out by providing an experimental evaluation on networks trained in different setups using various scenarios from sunny days to rainy night scenes. The networks were tested for challenging, and less common, scenarios where cameras or lidars individually would not provide a reliable prediction. Our results suggest that semi-supervised learning and fusion techniques increase the overall performance of the network in challenging scenarios using less data annotations.


Introduction
Nowadays, data are considered a valuable asset generating massive investments. However, how much data should autonomous vehicles collect to generate a reasonable driving model? Currently, Waymo has a fleet composed of around 55 vehicles driving over 1 million kilometres per year, roughly corresponding to 30,000 h. This number roughly corresponds to the driving hours of one taxi driver in their entire work life. The collected data cover most of the common scenarios, different illumination conditions, and weather, but still not enough to allow completely safe driving [1]. An autonomous road vehicle is expected to encounter a large variety of environmental conditions which might be difficult to take into account fully during the development of its perception modules. Furthermore, the occurrence of specific situations may be rare, and for this reason, hard to grab in a dataset. Increasing the size of the dataset, increases, in turn, the probability of encountering rare events; however, it does not guarantee to assign their proper relevance. Single events may be considered as outliers, and, for this reason, the overall network may perform poorly in such situations.
In this paper, we differentiated the datasets for testing and training our networks by illumination condition (day and night), and by weather conditions (sunny and rain), demonstrating that data fusion techniques and semi-supervised learning may help in segmenting objects in such conditions, showing how the availability of big quantities of data, including non-annotated scenes, can improve the performance of AI-based algorithms. Specifically, it is important to point out that these classes may be strongly unbalanced as sunny days may be better represented in the dataset than rainy nights. For instance, the task of segmenting vehicles or people presents intrinsic difficulties, because all classes of objects during the day, are very well represented in the dataset, whereas some classes, such as people in rainy nights, are more rare (for obvious reasons), but remain important to detect with high confidence.
Our paper thus contributes to the body of knowledge in this field, investigating how sensor fusion and semi-supervised learning can be used to increase the overall performance of the network with a particular emphasis on uncommon events and challenging scenarios. The objective was to carry out a fair evaluation, also using cross validation, of the techniques and how they can be used to improve machine learning in autonomous driving. The study investigated and integrated two lines of research: the use of an individual sensor setup versus sensor fusion, and how to use semi-supervised learning to improve overall network performance using unlabelled data coming from one of the sensors-in this case, the RGB camera or the lidar. It is not in the scope of the present paper to beat the current benchmarking in object segmentation, but to show readers how fusion and semisupervised learning can be used to improve performance in AI algorithms. However, the results suggest significant segmentation capability improvement in night and in rainy conditions, ranging from 10 to 30 percentage points.
To achieve this goal, we trained different models in a supervised fashion, with data fusion and semi-supervised learning. The supervised learning technique was used to train a baseline model and an upperbound model was used for comparison. The expected result is that the upperbound model would be the best performing one, benefiting from the full knowledge of the full dataset with data annotation. However, acquiring real-world scenarios with full data annotation is not always achievable, especially in tasks such as autonomous driving. Thus, this paper shows that semi-supervised learning and co-training achieve comparable performance (about 2-3 percent points difference) using less data annotations. An additional point of discussion is the cross-validation on different data splits. In this study, we trained 10 different models for each train-validation-test to show the variance in the test results.
This paper is organized as follows: Section 2 introduces the reader to the topic, offering a review of the state of the art, including recent studies about sensor fusion and semi-supervised learning. The materials and datasets used for this study are thoroughly explained in Section 3, including the Waymo dataset used for this research. Section 4 addresses our method for building the neural networks, training, validation and testing modalities. Finally, our results are reported in Section 5 including a comparison with our previous method, and a discussion of our main findings.

Related Work
Most of the state-of-the art methods for autonomous driving involve data-driven techniques at various levels, among which deep neural networks are shown to be promising in solving scene interpretation problems. Working on improving scene interpretation, this paper is focused on the intersection of two problems: sensor fusion and semi-supervised learning. Both topics have been extensively explored in the literature.
Semi-supervised learning is a widely explored idea for exploiting the availability of big unlabelled datasets to train various types of neural networks. In a recent review [2], Van Engelen et al. explored the topic from a broad non-task-specific perspective. Despite the fact that the idea of semi-supervised learning is applicable in different ways, and for several sources of information, images classification and semantic segmentation are the most historically used. Ouali et al. [3] used labelled data to train the main encoderdecoder-based network for semantic segmentation using the PASCAL VOC dataset [4]. The unlabelled data were used in a second stage to train the same network with the addition of auxiliary decoders, and perform a consistency check between the main decoder and the auxiliary decoders. The potential of semi-supervised learning was also used in [5] to build a network for semantic segmentation where strong pixel-level annotation is only available for part of the dataset, and weak annotation (image-level) is available for the remaining part of the dataset. The semantic segmentation generated on the weakly annotated images was used to train the overall network. Also in this case, the examples were taken from the PASCAL VOC dataset. The main driver for semi-supervised learning was to reduce the cost of labelling [6], which is time-consuming and intensive work.
Focusing on the task of autonomous driving, [7] offered a review of methods and datasets, indicating the increment in labelling efficiency, transfer learning, semi-supervised learning, etc., as open questions for research to leverage lifelong learning by updating networks with continual data collection instead of re-training from scratch. One example of application is provided in [8], where a semi-supervised learning method that uses labelled and unlabelled camera images to improve traffic sign recognition is proposed. The semisupervised learning is also used in [9], where Zhu et al. define a teacher model which is trained in a supervised manner using labelled camera images. Then, the teacher was used to generate labels on an unlabelled dataset which was used to train a student model. The authors show that the student model outperformed the teacher model using the data from the Cityscapes [10], CamVid [11] and KITTI [12] datasets.
As autonomous vehicles nowadays are integrated with different sensors, 3D lidar data are also used for semantic segmentation. A review work explores the available datasets and emphasises the importance of the availability of big quantities of labelled data coming from 3D lidar that are expensive to label manually, though strongly needed for autonomous driving [13]. A method to achieve the task of semi-supervised learning using 3D lidar data is described in [14], in which a set of manually labelled data and pairwise constraints are used to achieve an improvement in performance.
In addition to many techniques for extracting relevant information from camera or lidar data individually, data fusion is a growing trend to integrate the information coming from both sensors to improve each other in segmentation performance. In [15], the authors offer a review of different methods for sensor fusion perception in autonomous driving using deep learning techniques, focusing on fusion as a means for solving visual odometry, segmentation, detection and mapping issues, pointing out in their conclusion of how adverse weather can affect overall performance. A focused review on sensors' performance under adverse weather conditions can be found in [16], in which the authors better describe the individual strengths and limitations of sensors in the automotive field, providing a comprehensive list of data-driven methods and an open dataset. However, the literature is rich in approaches to sensor fusion that use classical stochastic inference instead of neural networks. For instance, in [17], the author generalizes the approach in [18] with the objective to obtain quality-fused values from multiple sources of probabilistic distributions in which quality is related to the lack of uncertainty in the fused value and the use of credible sources. On a different research line, the authors in [19] addressed the problem of sensor fusion and data analysis integration with emerging technologies and described several classic methods for sensor fusion, such as Kalman filtering and Bayesian inference. The strengths of these methods reside in their simplicity and the high level of control they offer over the design process, with the drawback of low flexibility and adaptability. On the contrary, convolutional neural networks have demonstrated high flexibility and adaptability to input variations, with the drawback of losing control over the design process-as CNNs are, essentially, black boxes.
Among many dedicated techniques of lidar camera fusion that can be found in the literature, a relevant example is described in [20], where Li et al. defined the so-called "BiFNet" as a bidirectional network for road segmentation that uses camera image and lidar eye-bird view. In [21], a lidar-camera cross fusion technique was presented, showing an increment in performance using the fusion technique over an individual sensor on the KITTI dataset, and later extended using the co-training method that included labelled and unlabelled examples [22].

Materials
In this work, we used the Waymo open dataset [23]. This section first presents a general overview of the dataset. Afterwards, we describe the procedure that was used for converting unordered point clouds into images, and for generating semantic masks from 3D bounding boxes.

Waymo Dataset
The Waymo open dataset includes 1110 driving sequences recorded with multiple cameras and lidars across a large variety of locations, road types and weather and lighting conditions. Each driving sequence consists of a 20-s-long recording sampled at 10 Hz. Both 2D and 3D bounding boxes were manually generated for all frames and considering the following four categories of objects: vehicles; pedestrians; cyclists; and traffic signs. Additionally, the driving sequences were partitioned into four broad subsets, namely day-fair; night-fair; day-rain; night-rain (see Table 1 for further details). The labels day and night indicate whether a sequence was collected during the day under good lighting conditions, or late in the day or at night under poor external illumination. The labels fair and rain instead refer to the weather conditions, with fair denoting good weather, and rain denoting active raining or wet environment following recent precipitation.

Point Cloud Projection
The literature is rich in approaches to process point clouds with deep neural networks, see for example [24,25]. In this work, the lidar point cloud is simply projected into the camera plane in order to generate a three-channel tensor with the same width and height of the RGB image, and such that each channel encodes one of the 3D spatial coordinates [21]. By doing so, it is straightforward to establish a one-to-one correspondence between the colour information, contained in the RGB image, and the spatial information, contained in the point cloud. A point cloud acquired with a Velodyne HDL-64E consists of approximately 100,000 points where each point p is specified by its spatial coordinates in the lidar coordinate system, that is p = [x, y, z, 1] T . Given the lidar-camera transformation matrix T, the rectification matrix R, and the camera projection matrix P, it is possible to calculate the column position u, the row position v, and the scaling factor α, where the projection of p intersects the camera plane, by solving the following expression α [u, v, 1] T = P R T p.This procedure is applied to every point in the point cloud, while discarding points such that α < 0 or when [u, v] falls outside the image. By using the above procedure, three images denoted as X, Y and Z are generated where each pixel contains the x, y, and z coordinates of the 3D point that was projected into it.

Sparse Semantic Masks from 3D Bounding Boxes
As mentioned in Section 3.1, the annotations provided for the Waymo dataset are 2D and 3D bounding boxes. Here, however, we are interested in carrying out semantic segmentation; for this reason, the 3D bounding boxes are converted into semantic masks. This can be easily achieved by using a procedure analogous to the point cloud projection mentioned in Section 3.2. More specifically, given a 3D bounding box, we collect all the lidar points that fall within it, and then project them into the image plane. Each projected point is drawn in the semantic mask as a disk with the same class as the bounding box. This procedure is repeated for all the 3D bounding boxes found in a given frame. In this work, only the vehicle class is considered which is most represented and evenly distributed across the coarse categories described in Section 3.1. Some examples of semantic masks obtained using this method are shown in Figure 1. A limitation of this procedure is that only regions of an image where there are lidar detections can be assigned to a valid class. All remaining pixels are assigned to a do-not-care class that is ignored during training. As illustrated in the bottom three rows of Figure 1, poor illumination and rainy weather might affect the quality of the sensor data which could have detrimental effects for downstream applications. The left column shows four RGB images captured under various lighting and weather conditions, whereas the right column contains the corresponding semantic masks obtained using the procedure described in Section 3.3. Red pixels denote the vehicle class, while grey pixels denote the background (i.e., negative class). All other pixels in the semantic mask are ignored during training.

Method
This section describes the methodology proposed to provide a quantitative evaluation and a comparison of semi-supervised learning techniques against a baseline network and sensor fusion methodologies, including the models' design, data splits used for the cross validation and the training procedures used in this paper.

Model
The base network architecture used in this work is the well-known FCN-ResNet50 [26]. This CNN contains five stages, denoted as S1-S5, where each stage consists of several layers (e.g., convolutional, batch normalization, max-pooling, etc.). The proposed model contains three subnetworks, namely RGB, Lidar and Fusion, and it is shown in Figure 2. The RGB and Lidar subnetworks have the same structure as the base FCN-ResNet50 and as described by their names, receive as input camera images and lidar images, respectively. The fusion subnetwork instead processes the concatenated features of the single modality branches after stage 4. This can be described as a late fusion strategy [21]. As illustrated in Figure 2, the fusion subnetwork shares some of its stages (S1-S4) with the single modality subnetworks. All networks' weights were initialized with an FCN-ResNet50 pretrained on a subset of COCO train2017.  Figure 2. Illustration of the model used in this work. The model can be decomposed into three subnetworks: RGB; lidar; and fusion. The fusion subnetwork receives as input the concatenated features generated by stages S1-S4 of the single modality subnetworks. This corresponds to a late fusion strategy. During a supervised learning step, the output of each subnetwork is used to compute a loss term with respect to a manually generated ground truth. During a semi-supervised step, the fusion subnetwork acts as teacher for the single modality ones. That is, the fusion subnetwork's output is used as ground truth for computing the losses of the RGB and lidar subnetworks. For co-training, only the single modality subnetworks are considered. In this case, the output relative to one sensor is used as ground truth for the other.

Dataset Splits
As mentioned in Section 3.1, in this work, we used the Waymo dataset which consists of 1110 driving sequences collected under various weather and lighting conditions. Each sequence originally contains approximately 200 frames (i.e., 20 s sampled at 10 Hz) of which we kept every 10th frame. Out of the full dataset, for the experiments described in the following sections, we generated N = 10 random dataset splits S The randomization was carried out by considering driving sequences instead of individual frames in order to avoid any overlap between the subsets in any given S i . Subset T i denotes a training set, which is a collection of annotated examples used for computing the loss function and updating the model parameters. Subset V i denotes a validation set which is used to carry out early stopping and hyper-parameter tuning. Subset U i denotes a collection of examples for semi-supervised learning. That is, examples that are assumed to be unlabelled and for which the models should generate proxy labels. Lastly, each K i denotes a test set used for evaluating the generalization performance of the model for unseen data. With the exception of the training sets T , which only contain examples belonging to the coarse category day-fair, all other sets include examples belonging to all coarse categories. By only considering training data belonging to one category, it is possible to investigate the effectiveness of the proposed approach for carrying out domain adaptation. Table 2 provides more details regarding the dataset splits S.

Training
The training procedure was designed to iterate two phases for any given dataset split S i . Phase 1: The model is trained in a purely supervised fashion with training set T i and validation set V i . The intersection over union (IoU) is computed on the validation set at the end of each epoch for all the subnetworks. If any of the subnetworks' IoU has improved, a copy of the complete model is stored. The IoU index is commonly used for segmentation tasks in deep learning [27], and it is calculated as follows: where A T is the ground truth semantic area of a specific object, and A P is the area of the object predicted by the network. This index is maximized when the union is equal to the intersection. Phase 2: The model containing the best performing fusion subnetwork, denoted as ψ i , is used as starting point for training four different models:
A model trained using a semi-supervised approach called co-training [22]; 3.
A model trained using a newly introduced sensor fusion-based semi-supervised approach; 4.
A supervised-learning upperbound model.
For both phases, we used the cross-entropy loss function and Adam optimization [28]. The batch size was set to 32. The learning rate was decayed using the poly-learning policy [29] implemented as where j denotes the current epoch number, η 0 is the starting learning rate, N represents the total number of the training epoch and α = 0.9. In the first phase, we set η 0 = 0.0003 and N = 50, whereas in the second phase, we set η 0 = 0.0001 and N = 100. The values indicated above were empirically determined after a hyper-parameter search and fixed for all the trainings, though the perfect optimization of these parameters is beyond the scope of the paper. The images were first downsampled from 1920 × 1280 to 480 × 320 pixels. Afterwards, the top half region, usually containing buildings, vegetation and the sky was discarded. The input images thus had a size of 240 × 320 pixels. Data augmentation consisted of three operations, specifically random square cropping, random rotations in the range [−20 • , 20 • ] about the centre of the images, and random colour jittering (brightness, contrast, saturation and hue) applied to the RGB images.

Supervised Learning Baseline
Given a model ψ, the three subnetworks are denoted as ψ β ≡ ψ β (θ β ), where β ∈ {rgb, lid, fus} and θ β represent the subnetwork's weights. For ease of notation, in the following discussion, we omit the weight vector from the equations. As mentioned in Section 4.1, some of weights are shared between the subnetworks, therefore θ lid,rgb ∩ θ fus = ∅. An input example has two views, lidar and camera, and is denoted as x = {x rgb , x lid }. The corresponding ground truth semantic mask is denoted as y. The cross entropy between a subnetwork's prediction and the ground truth is represented as H(ψ β (x), y). In the previous expression, we assumed that each subnetwork extracts the appropriate input view, that is, ψ β (x) ≡ ψ β (x β ). The total loss function used for supervised learning is then given by the following expression: As indicated in Phase 1 of Section 4.3, the baseline model is trained in a supervised fashion using the loss function indicated in Equation (3). The backbone of our networks is always a pretrained FCN-ResNet50.

Co-Training
Co-training is a semi-supervised learning algorithm that can be applied to problems where the instance space can be partitioned into two independent views. The instance space is an abstraction of the input space associated with a classification problem, whereas the views contain the actual data that will be consumed by the classifiers. The predictions obtained in one view can be employed as labels in the other view with the final goal of leverage for boosting performance. This approach has two strengths: (1) reduced effort in manual annotation; and (2) increased knowledge learning from the independent views. In this work, following a previous line of research described in [22], the instance space consists of urban driving scenes, and the views are provided in the form of RGB images and lidar point clouds. The predictions of a lidar-based semantic segmentation network, which is generally less affected by environment illumination (day/night light), could be exploited by a camera-based network in order to learn more discriminative features in challenging conditions such as rainy days or night light.
Let us consider a teacher network ψ t and a student network ψ s , parametrized by weights θ t and θ t , respectively, and their relative views (x t , x s , y) ∈ T i that correspond to the teacher view x t , the student view x s , the ground truth labels y and the training set for any given data split T i . The fundamental building block for the loss function is the wellknown cross-entropy loss, denoted as H, between the student prediction and the ground truth label. The student loss is calculated using the supervised loss in Equation (3) for each data view (RGB, lidar). For an unlabelled example in U i , the student's prediction and the teacher's prediction represent probability distributions over possible classes, though the teacher's is considered as the ground truth to train the student model. By considering that the Kullback-Leibler divergence [30], denoted as D KL is a measure of difference between probability distributions, the co-training loss is implemented as follows:

Fusion Proxy Labels
Typically, in co-training, the proxy labels are obtained using the single modality subnetworks. However, it has been shown that a model generated using data fusion typically achieves higher performance [21]. For this reason, it is expected that the proxy labels obtained by the fusion subnetwork will be more reliable. The agreement between fusion network and lidar/camera subnetworks is performed for each subnetwork using the D KL as follows:

Results and Discussion
The model detailed in Section 4.1 was trained on the 10 random dataset splits S i , described in Section 4.2, and extracted from the Waymo dataset (see Section 3.1 for more details). The full training procedure is outlined in Section 4.
Furthermore, training the model more times provides a reliable result in terms of repeatability, as individual training bouts may perform differently according to different conditions in data-loading randomization, dropout layers and weights initialization. The results on the test sets are summarized in Table 3

Supervised Baseline
The supervised baseline models were trained as described in Section 4.4 using only sequences collected in day-time and fair weather conditions. As can be seen in Table 3, the fusion subnetwork performs significantly better than the single modality subnetworks in both the day-fair and day-rain categories. However, the lidar subnetwork performs best in the night-time categories where the performance of the camera subnetwork drops significantly. The fact that the lidar subnetwork performs better than the camera one at night-time is not surprising considering that the lidar is an active sensor. The fusion subnetwork also has access to the lidar input, so its poor performance in night-time sequences is an indication that, during training, it has learned to rely too strongly on the camera-based features. Another interesting result is that the lidar subnetwork performance is negatively affected by rainy weather. As shown in two examples in Figure 1, the water covering the surrounding surfaces degrades the density, and possibly the quality, of the point clouds captured with the lidar.

Co-Training
Co-training shows slight improvements for day-fair weather data, which is expected as full information content already provides reliable results for the supervised baseline. The improvement using the co-training approach with respect to the baseline is higher in case of night-fair data than day-light data. Here, the performance is over 80% accuracy in almost all generated models, showing an improvement in the camera-based network case of about 15 percent points. The only case in which the performance is lower, but still a big improvement with respect to camera data baseline, is the night-rain scenario, in which the performance against the baseline increases in all cases ranging between 5 percent points, in the lidar case, to almost 30 points, in the camera case.

Fusion-Based Semi-Supervised Learning
Result from Table 3 for semi-supervised fusion and co-training are fully comparable, with only little difference in all scenarios. In most of the cases, fusion seems to outperform the co-training on average, with some cases in which the opposite happens. The variance, however, shows that co-training seems to be more stable, showing lower variance in many cases. In all cases, there is an improvement that is more significant in the case of night and rainy weather and even comparable to the upperbound.
An important observation is that the upperbound model clearly shows the best performance in all cases, which is not a surprise. It is well known that the availability of large quantities of well-annotated data is a fundamental issue, and the best performance is achieved with increased information availability. However, real-world cases show that this is not always possible, achievable or cost-effective. This result shows that co-training and semi-supervised learning can help to fill this gap, and in some cases even over-performing on the upperbound model, for example, in the case of night-fair lidar data. It is reasonable to expected that semi-supervised learning and co-training would surpass the upperbound model if more data were available.

Cross-Validation
The results in Table 3 also show the variance across several models trained under the same conditions but on different data splits (as described in Section 4.1). Using crossvalidation, these results confirm that fusion and co-training have similar performance improving the baseline performance, reaching the upperbound-level performance. Figure 3b shows the performance for each subnetwork (RGB, lidar and fusion) of each data split, the average accuracy and the variance. From Figure 3a, one can observe that in the day-fair case, the camera performances are comparable in all cases (baseline, co-training, fusion and upperboud), while fusion and co-training clearly help in rain and night for all data splits. The same behaviour can be seen in Figure 3b for the lidar sub-network, though showing (as expected) better performance during night-time. Finally, Figure 3c shows the performance distribution over all the data splits for the fusion sub-network.

Discussion
The purpose of the experiments proposed in Section 5 is three-fold: (1) to evaluate the potential benefits of using a multi-sensor system over single modality approaches; (2) to investigate the model generalization capabilities to more challenging domains; and (3) to evaluate whether co-training and our newly introduced fusion-based strategy could be useful for domain adaptation.
According to the results shown in Section 5, several potential benefits support the use of semi-supervised learning, sensor fusion and co-training strategies. First, a multi-sensor system is shown to provide high reliability and redundancy, for all the cases reported in Table 3 in which sensor-fusion outperforms the single-sensor approach (RGB camera or lidar). Second, Table 3 also shows that co-training and semi-supervised learning help the model to generalize better to more challenging domains. The best improvement in the semi-supervised learning techniques is shown in the night-rain case with an over 10 percent points IoU performance increase, starting from 70% IoU accuracy of the fusion baseline, and reaching 81% in the fusion-semi case.
The use of non-annotated data has clearly been shown to improve the overall performance, resulting in big savings in practical applications where data annotation is a heavy and complex task. However, data annotation cannot be neglected, the upperbound model is shown to be the best performing one, and hence full data annotation is still the best way to achieve high reliability and stability in neural networks.

Conclusions and Future Work
In conclusion, this paper offered a comparative study that analysed two semi-supervised methods of sensor fusion techniques for lidar-camera data in deep learning, showing a comparison among different networks' performance, the baseline model and a supervised upperbound model. Our results confirm the overall trend that the semi-supervised method could boost performance, taking advantage of the availability of a big un-annotated dataset. The paper shows that the upperbound model performance level can be reached using other methods such as semi-supervised learning and co-training, resulting in a cost-effective method that uses less data annotation. This result is supported by a cross-validation using 10 different data splits. Furthermore, the statistical analysis on single model training shows how benchmarking in autonomous driving could be affected by randomization in individual training.
Future work could extend the analysis over different conditions including additional subcategories in which deep learning in autonomous driving performance still suffers from the availability of data such as adverse weather-related data.

Conflicts of Interest:
The authors declare no conflict of interest.