Off-Road Environment Semantic Segmentation for Autonomous Vehicles Based on Multi-Scale Feature Fusion

Zhou, Xiaojing; Feng, Yunjia; Li, Xu; Zhu, Zijian; Hu, Yanzhong

doi:10.3390/wevj14100291

Open AccessArticle

Off-Road Environment Semantic Segmentation for Autonomous Vehicles Based on Multi-Scale Feature Fusion

by

Xiaojing Zhou

^1,2,†,

Yunjia Feng

^1,*,†,

Xu Li

¹,

Zijian Zhu

¹ and

Yanzhong Hu

¹

School of Instrument Science and Engineering, Southeast University, Nanjing 210096, China

²

Intelligence & Collaboration Laboratory, Beijing 100070, China

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

World Electr. Veh. J. 2023, 14(10), 291; https://doi.org/10.3390/wevj14100291

Submission received: 15 September 2023 / Revised: 6 October 2023 / Accepted: 9 October 2023 / Published: 13 October 2023

(This article belongs to the Special Issue Vehicle-Road Collaboration and Connected Automated Driving)

Download

Browse Figures

Versions Notes

Abstract

:

For autonomous vehicles driving in off-road environments, it is crucial to have a sensitive environmental perception ability. However, semantic segmentation in complex scenes remains a challenging task. Most current methods for off-road environments often have the problems of single scene and low accuracy. Therefore, this paper proposes a semantic segmentation network based on LiDAR called Multi-scale Augmentation Point-Cylinder Network (MAPC-Net). The network uses a multi-layer receptive field fusion module to extract features from objects of different scales in off-road environments. Gated feature fusion is used to fuse PointTensor and Cylinder for encoding and decoding. In addition, we use CARLA to build off-road environments for obtaining datasets, and employ linear interpolation to enhance the training data to solve the problem of sample imbalance. Finally, we design experiments to verify the excellent semantic segmentation ability of MAPC-Net in an off-road environment. We also demonstrate the effectiveness of the multi-layer receptive field fusion module and data augmentation.

Keywords:

point cloud; semantic segmentation; multi-scale feature fusion; off-road environment

1. Introduction

Autonomous vehicles are an innovative mode of transportation based on advanced sensors, computer vision, and artificial intelligence technologies. They do not require human intervention, thus performing driving tasks in various environments more safely and efficiently. For autonomous vehicles, in addition to performing tasks on regular, structured road scenes, sometimes they also need to work in off-road scenes, such as battlefields, post-disaster scenes, etc. In this case, the road is usually unstructured, which is described as an off-road environment. As we can see, it is especially crucial to establish a perception algorithm to deal with autonomous vehicles driving in off-road environments, which will help the unmanned platform realize full-scene perception.

Environmental perception is an important part of the unmanned platform. Cameras and LiDAR (light detection and ranging) are the two main sensors used to obtain environmental information. Camera-based semantic segmentation algorithms mostly rely on the texture or color features of roads, such as boundaries [1], lane lines [2], or vanishing points [3]. Some depth camera-based methods [2] use depth information as an auxiliary to perform semantic segmentation in off-road environments. Although good segmentation results have been achieved, they are not robust enough due to the influence of illumination changes. Essentially, camera-based environmental perceptions are realized based on color and texture, which are greatly affected by light, thus causing them to fail at night. In off-road environments, it is necessary to work at night, so the camera sensor is not suitable to be the main sensor. As an active detection sensor, LiDAR is also widely used in environmental perception algorithms. Compared to other onboard sensors, LiDAR can provide richer environmental information [4]. Yu et al. [5] used LiDAR to perceive street light poles, and Liu et al. [6] used LiDAR to greatly improve the detection distance of vehicles, pedestrians, and cyclists. Unlike a passive receiving device such as a camera, LiDAR is a suitable sensor to be used on rainy or foggy days and where the light intensity changes drastically. The advantage is that its sensitivity to the environment is extremely low. Therefore, LiDAR has been widely used in the field of environmental perception on the vehicle side. Additionally, its high robustness to environmental perception makes it very suitable for off-road environments.

For point cloud semantic segmentation tasks and their application in off-road environments, researchers have carried out a lot of related research (in Section 2: Related Work). However, due to the lack of datasets and the high cost of labeling, the problem of low semantic segmentation accuracy and few classes in off-road environments remains to be solved. Therefore, in this paper, focusing on off-road environments, we designed the MAPC-NET. Additionally, we apply the CARLA [7] simulator for scene construction and data simulation to achieve efficient and high-precision semantic segmentation results. We summarize our contributions as:

A gated fusion module is used to fuse PointTensor and Cylinder to achieve an efficient end-to-end semantic segmentation network.
A multi-layer receptive field module is designed to effectively realize the feature extraction of objects of different scales, especially at small scales.
The CARLA simulator is used to build an off-road environment dataset. In addition, linear interpolation is used to enhance the off-road dataset to improve the training effect.

2. Related Work

2.1. Structured Road Scene 3D Semantic Segmentation

Semantic segmentation of point cloud refers to classifying each point of the input according to the corresponding class so that different types of objects can be distinguished. For the semantic segmentation of 3D point clouds in structured outdoor scenes, the input point cloud can be encoded in three ways: voxel-based, point-based, and projection-based.

In the projection-based algorithm, the 3D point cloud is projected into the 2D space, and then the semantic segmentation network is used to calculate the “pseudo image” in the 2D space. After that, the segmentation result is back-projected to the coordinate space of the 3D point cloud by interpolation to realize the semantic segmentation of the original point cloud. Among them, SqueezeSeg [8], SqueezeSegv2 [9], SqueezeSegv3 [10], Salsanext [11], etc., use spherical projection, while PolarNet [12] and VD3D-FCN [13] algorithms use bird’s-eye view projection for feature extraction.

Voxel-based semantic segmentation algorithms re-encode the 3D space with voxels. For example, VoxelNet [14] is a typical algorithm that uses voxels to achieve semantic segmentation of 3D point clouds. It divides the 3D point cloud spaced at equal intervals, which are called voxels. Each voxel is converted into a unified feature representation vector by a VFE (voxel feature encoding) layer, and feature extraction is performed based on this.

Semantic segmentation based on point cloud sequence is a method used to directly extract the feature on the original unordered point cloud sequence. Additionally, a multi-layer perceptron is used to directly perform semantic encoding and spatial position calculation on the point cloud itself. For example, PointNet [15], PointNet++ [16], RandLA-Net [17], and KPConv [18] are point-based algorithms.

These three feature extraction methods have their own advantages and disadvantages. The method based on projection is usually faster than the method that extracts features directly in three-dimensional space. However, the precision loss caused by the forward and reverse projections cannot be ignored. Voxel-based methods are also widely used. After voxel coding, whether deep learning or traditional clustering algorithms are used, target recognition or segmentation tasks can be effectively performed. However, the 3D convolution algorithm is less efficient as the data size increases. Algorithms based on point sequences have high computational efficiency, but poor locality and easy loss of features, thus making it difficult for some small objects to be segmented from large objects. Until the PVCNN [19] algorithm was proposed, the fusion of voxel and point cloud sequences for feature extraction greatly improved the accuracy and efficiency. After that, RPVNet [20] fused three feature extraction methods and obtained excellent semantic segmentation results.

2.2. Off-Road Point Cloud Semantic Segmentation

The above methods for structured road provide many valuable references for off-road segmentation. The biggest difference between the off-road environments and the structured road scene is that its drivable area has no lane lines, no obvious road boundaries, and even no regular shape. Such off-road is very different from structured road, so it is difficult to directly apply the 3D point cloud semantic segmentation algorithm based on structured road scenes. At present, the semantic segmentation algorithms of off-road scenes are mainly divided into three categories: feature engineering based on point clouds, weakly supervised learning, and transfer learning

The feature engineering algorithm based on point clouds performs road segmentation by extracting the geometric features of roads in off-road scenes. On one hand, Liu et al. [21] focus on identifying negative obstacles on the road. First, three LiDARs are installed directly above and on both sides of the vehicle. Then a mathematical model of the LiDAR scan line is established, and an adaptive filtering algorithm is proposed to identify negative obstacles based on this model. Finally, the operation results of the three LiDARs are fused to detect the drivable area and negative obstacles of the off-road. On the other hand, Liang et al. [22] project the LiDAR point cloud into a two-dimensional image plane and generate a histogram from it. Water, positive obstacles, and drivable areas in off-road scenes are detected from the histogram. Finally, the result is back-projected into the LiDAR coordinate system. Although the feature engineering method has achieved good results in specific off-road scenes, it has significant constraints, merely possessing the capacity to classify a few specific elements in the scene, and failing to adapt to various off-road scenes.

Gao et al. [23] projected the original point cloud onto the image plane through a bird’s-eye view, and then used the GPS information of the moving vehicle to obtain the driving trajectory. On the projected image, the region growth algorithm is performed on the driving trajectory to automatically generate the label of the drivable region, and combined with a small amount of manually labeled data as the training dataset, a good segmentation result is finally achieved. Meanwhile, the workload of manual annotation is greatly reduced. Holder et al. [24] use an existing CNN framework to pre-train on a dataset of urban structured road scenes. They then use a small dataset of off-road scenes to re-determine the segmentation classes for transfer learning. While achieving good results, it can effectively reduce the labeling of off-road scene LiDAR point cloud data.

To sum up, the main problem in designing semantic segmentation algorithms for off-road scenes is the lack of datasets. Existing algorithms mainly use geometric features or combine specific algorithms with a small amount of data to perform semantic segmentation. However, lower accuracy is still a big problem. Therefore, on one hand, the research should focus on how to obtain a large amount of high-quality data. Relying on computer simulation technology, typical off-road scenes can be built to obtain a large number of accurately labeled datasets. On the other hand, more targeted algorithms should be designed according to the characteristics of off-road scenes. The above two aspects have important engineering value and academic significance for improving the semantic segmentation accuracy of off-road scenes.

3. Methods

3.1. Network Overview

The overall structure of our designed multi-scale augmentation Point-Cylinder network (MAPC-Net) is shown in Figure 1. In outdoor scenes, LiDAR point cloud distribution is as follows: the closer the scene the more point clouds there are, the further the scene the fewer there are. However, the traditional voxel-based algorithm divides the entire space evenly, which leads to a large proportion of empty voxels, making the calculation based on voxels not stable and real-time. Inspired by the Cylinder algorithm [25], the MAPC-Net proposed in this paper uses the cylindrical coordinate system to replace the traditional 3D coordinate system during the encoding stage, that is, it uses Cylinder to replace the traditional voxel, and fuses it with the point cloud sequence for feature extraction. As shown in Figure 1, the Point-Cylinder structure designed in this paper takes into account the low computational complexity of point cloud sequences and the locality of the Cylinder, so that the network can obtain better semantic segmentation results in real-time.

The network adopts an encoder-decoder structure. The encoding stage is used for feature extraction, and the decoding stage is used to restore the original size and output semantic segmentation results. Besides, skip connections are added to the network so that the recovered feature maps can have more low-scale features to obtain better semantic segmentation results. We use an asymmetric encoder-decoder structure in the network, where the PointTensor and Cylinder voxels are fused once in the encoding stage and twice in the decoding stage. The PointTensor needs more information from the Cylinder voxels in the decoding stage to ensure that the recovered feature map has more details to make the final output semantic segmentation result more accurate.

In the cylindrical voxel part, 3D sparse convolution and two-layer residual modules are used as basic units in the encoder part, which can effectively extract features; in the decoder part, 3D deconvolution and two-layer residual modules are used as basic units. At the same time, cross-layer connections are added between the four layers of down-sampling and the four layers of up-sampling, and the high-level semantic features and fine-grained, low-level features are fused to enhance the up-sampling results. The point cloud sequence first passes through the MRFFM (multi-layer receptive field fusion module) to extract and fuse the feature information of large, medium, and small scales off-road. Then, PointTensor with shape (n, 32) is obtained. After that, the multi-layer perceptron is directly used for processing. Although the method is simple, it can still extract fine-grained features in units of points, which is a good complement to the coarse-grained domain features extracted by cylindrical voxels.

After feature extraction from PointTensor and Cylinder voxels, voxel-based features must be converted to point-based features for fusion. Generally, the most direct implementation is to assign Cylinder features to all points of the grid. However, this will result in the same point features within the same voxel, so we use trilinear interpolation to convert the Cylinder features to the point cloud feature format ensuring the features of each point are different. Finally, the features extracted from the PointTensor and the point-based features after trilinear interpolation are fused to obtain the final output.

For two feature tensors, there are two methods for feature fusion: add and concatenate. The essence of feature fusion is to aggregate useful information together under the interference of useless information, and the above two feature fusion methods will be affected by non-informative features due to heterogeneous information sources. Benefiting from related research on gating mechanisms [26], feature aggregation can be adaptively performed by measuring the importance of each feature, and the gating aggregation can be defined as:

\begin{matrix} \tilde{X} = \sum_{i = 0}^{L} s p l i t {[s o f t m a x (\sum_{i = 0}^{L} G_{i})]}_{i} \cdot X_{i} \end{matrix}

(1)

where the gating vector

G_{i}

represents the gating vector of the i-th feature, and

G_{i} \in {[0, 1]}^{N \times L}

. N denotes the number of point. Each gate vector has L channels for each feature representation, and the feature weights on each channel are summed and superimposed by voting and converted to probability weights by Softmax. Finally, the results of the corresponding channels are separated to weigh the input features. The gating vector is calculated as:

\begin{matrix} G_{i} = s i g m o i d (w_{i} \cdot X_{i}) \end{matrix}

(2)

where

w_{i}

represents the weight of the convolutional layer. The specific calculation method is shown in Figure 2.

Using gated fusion, we realize the fusion of the PointTensor and the point-based features after trilinear interpolation, which is also the last step of the Point-Cylinder structure. At the end of the entire semantic segmentation network, the fused feature vector passes through the fully connected layer and finally outputs the semantic segmentation result.

3.2. Multi-Layer Receptive Field Fusion Module

In the off-road environments, the segmentation of the drivable area is significant. Also, the identification of obstacles is a major task, such as stones, plants, etc. Therefore, we designed a multi-layer receptive field fusion module (MRFFM) to fuse and stack the feature vectors under different receptive fields, so that the network can more effectively perceive different scales of objects in off-road environments. The structure of the MRFFM module is shown in Figure 3.

In the MRFFM module, the original PointTensor is first subjected to three sets of convolutions of different kernel sizes to obtain a feature tensor, which represents the feature extraction results with large, medium, and small receptive fields. We use this to denote the fusion result of small-scale features and medium-scale features, which can be expressed as:

\begin{matrix} z = P (x_{1} \oplus x_{2}) \otimes x_{1} + (1 - P (x_{1} \oplus x_{2})) \otimes x_{2} \end{matrix}

(3)

Similarly, the final fusion output of the entire module can be expressed as:

\begin{matrix} O u t p u t = P (x_{3} \oplus z) \otimes z + (1 - P (x_{3} \oplus z)) \otimes x_{3} \end{matrix}

(4)

In the above formula,

P (\cdot)

represents the output of the PointTensor attention module (PAM);

\oplus

represents the feature fusion, where element-wise summation is used; and

\otimes

represents element-wise multiplication. The detailed flow of the PAM is shown in Figure 4. Since the end of the module is the output of the Sigmoid function, the result is between 0–1, so the dotted line in Figure 3 that represents

1 - P (x_{1} \oplus x_{2})

and the solid line that represents

P (x_{1} \oplus x_{2})

are in the range of 0–1. Therefore, through MRFFM, we can achieve weighted superposition of feature tensor with different scales.

Inspired by attentional feature fusion (AFF) [27], we extend the multi-scale channel attention module to point cloud sequences and design the PointTensor attention module (PAM), which is shown in Figure 3. The input of this module is PointTensor. The global channel context, represented as G(X), can be obtained through global average pooling, and the other branch is the local channel context, which can be represented as L(X). The Conv module in Figure 4 adopts the convolution proposed by Minkowski [28], which can quickly achieve local channel context aggregation of PointTensor. The final output of the module is calculated by the Sigmoid function and can be expressed as:

\begin{matrix} O u t p u t = S i g m o i d (G (X) \oplus L (X)) \end{matrix}

(5)

where

\oplus

means broadcasting addition. As a result, a 0–1 output can be obtained through the PAM module, and the feature tensor with different scales can be weighted and superimposed.

3.3. Data Augmentation

In outdoor scenes, the classes of LiDAR points are often extremely unbalanced. In Semantic-KITTI [29], the number of point clouds corresponding to road classes reaches 10⁹, while that of smaller objects such as pedestrians and bicycles is only 10⁵. The difference between them is tens of thousands of times. In addition, more than 80% of the point clouds in the RELLIS-3D [30] are grass, trees, and bushes. The class with the largest share is 1000 times more than the class with the smallest share. This situation still exists in the off-road scene dataset constructed by CARLA in this paper. In general, if some less frequent objects with a small number of point clouds are allowed to appear multiple times in the scene, then this will make the network’s prediction of these classes more accurate. Therefore, we use linear interpolation to perform super-resolution operations on the classes with a small proportion in the dataset and extract these classes into a sample library. During training, objects are randomly sampled from this sample library, scaled and rotated randomly, and then placed in a sequence of LiDAR point clouds. At the same time, to ensure that these objects are consistent with reality, we randomly place these objects on the ground. This constitutes the training set for the current LiDAR point cloud scene.

3.4. Loss Function and Optimizer

We combine the cross-entropy loss function [31] and Lovász-Softmax [32] as the loss function and use stochastic gradient descent (SGD) as the optimizer of the network. The semantic segmentation problem is actually classifying each point in the point cloud, and the cross-entropy loss function can be applied to the multi-classification problem. Lovász loss is an effective additional loss term that can be used for different machine-learning tasks such as object detection and semantic segmentation. Therefore, we combine the Lovász loss with the cross-entropy loss function to achieve better model training results. The final loss function can be expressed by Equation (6):

L_{t o t a l} = - \sum_{i} 1 / \sqrt{v_{i}} P (y_{i}) \log P ({\hat{y}}_{i}) + 1 / |C| \sum_{c \in C} J (e (c))

(6)

where

v_{i}

represents the frequency of each class,

P ({\hat{y}}_{i})

and

P (y_{i})

represent the corresponding predicted and true value probabilities, and

J

represents the extension function of the Lovász function for the semantic segmentation metric IOU (intersection over union).

|C|

represents the total number of classes, and

e (c)

represents the error vector of class c, which can be expressed as:

\begin{matrix} e_{i} (c) = \{\begin{matrix} 1 - x_{i} (c) i f c = y_{i} (c) \\ x_{i} (c) o t h e r w i s e \end{matrix} \end{matrix}

(7)

where

x_{i} (c) \in [0, 1]

denotes the probability that point

i

is the prediction of class

c

, and

y_{i} (c) \in {- 1, 1}

denotes the corresponding ground-truth label.

For the optimizer, the SGD is chosen in this paper, the initial learning rate is set to 0.24, and the decay is 10⁴ per round of training. To speed up the learning process, momentum is introduced and set to 0.9.

4. Experiment

4.1. Data Preparation

The initial dataset in our work adopts the RELLIS-3D [30], including 64-beam LiDAR point cloud data with semantic segmentation labels in five scenes. The label data contains 20 classes including grass, mud, trees, people, etc. From this, we obtained a total of 13,556 frames of off-road LiDAR semantic segmentation data for off-road scenes.

However, as shown in Figure 5a, the plant height in the RELLIS-3D scene is low and the undulating terrain is missing. So, we use the CARLA simulator to build and collect data on relevant off-road scenes to make the training cover more off-road scenes. The built scene is shown in Figure 5b. When collecting, we use a single LiDAR semantic segmentation sensor. The segmentation classes are grass, mud, vegetation, people, water, and rocks. The driving mode is manual, the maximum speed is set to 5 m/s, the LiDAR is set to 64 beams, and the number of points collected per second is 200,000. Using the above parameters to run the program in the scene, the corresponding dataset of off-road semantic segmentation can be collected, which contains a total of 8000 frames after screening.

To ensure the consistency of the simulated dataset and the real scene dataset, we map the labels of RELLIS-3D to six classes: grass, mud, vegetation, people, water, and rocks. Specifically, the ‘grass’ in RELLIS-3D is mapped as ‘grass’; ‘tree’ and ‘bush’ are mapped as ‘vegetation’; ‘mud’ is mapped as ‘mud’; ‘rubble’ and ‘barrier’ are mapped as rocks; ‘puddle’ and ‘water’ are mapped as water; ‘person’ is mapped as ‘people’; and other objects are mapped as ‘unlabeled’. From this, we can get the real scene dataset.

Finally, the simulated dataset and real scene dataset are augmented with the super-resolution-based data augmentation algorithm described in Section 3.3. For the simulated dataset, 80% of the data is selected as the training set, and 20% of the data is used as the test set. As a result, 6400 frames of simulated data are obtained as training set A, and 1600 frames of simulated data are used as test set A. For the real RELLIS-3D dataset, we select the first three scenes as the training set and the last scene as the test set, thus 11,497 frames of data are used as training set B, and 2059 frames of data are used as test set B. During training, we combine training set A and training set B. When testing, we test on test set A and test set B, respectively.

4.2. Evaluation Metrics and Experimental Setup

To quantitatively analyze the off-road semantic segmentation model constructed in this paper, we use the IoU (intersection-over-union), recall rate, and accuracy rate to evaluate the inference accuracy of the network model. In addition, we evaluate the efficiency of the network in terms of average inference time:

\begin{matrix} IoU = \frac{T_{p}}{T_{p} + F_{p} + F_{N}} \end{matrix}

(8)

\begin{matrix} recall = \frac{T_{p}}{T_{p} + F_{N}} \end{matrix}

(9)

\begin{matrix} precision = \frac{T_{p}}{T_{p} + F_{p}} \end{matrix}

(10)

Among them, for a certain class of C,

T_{p}

is the number of points that are correctly detected as C;

F_{N}

is the number of points that are not detected as C; and

F_{p}

is the number of points that are incorrectly detected as C. Therefore, IoU can be used to evaluate the difference between the segmentation result and the true label, the recall can be used to analyze the missed detection rate, and the accuracy can be used to analyze the false detection rate.

We use a computer with the Ubuntu18.04 operating system; its CPU is Intel@Xeon(R) E5-2678, the memory is 128 G, and it is equipped with four NVIDIA 3090 for neural network calculation acceleration. In our experiments, a single graphics card is used for training and inference.

4.3. Experimental Results

4.3.1. Network Comparison Experiment

In this paper, MAPC-Net is trained on two different training sets, A and B, and tested on test set A and test set B, respectively. During the experiment, the batch size is set to eight, the initial learning rate is set to 0.24, and the training decay is 10⁴ per round. To speed up the learning process, momentum was introduced and set to 0.9, and stochastic gradient descent was used for a total of thirty rounds of training. At the same time, we select the SalsaNext [11] and PVCNN [19] to compare with the MAPC-Net proposed in this paper in test set A and test set B respectively. Among them, the SalsaNext algorithm is a typical point cloud semantic segmentation algorithm based on projection. PVCNN has designed a feature extraction method that combines traditional voxels and point clouds and directly performs feature extraction in three-dimensional space through the encoding and decoding structure. Its semantic segmentation comparison is shown in Figure 6.

The results in Table 1 show that the MAPC-Net has the highest mIoU in both Test A and Test B. Compared with the projection-based SalsaNext, it has obvious advantages. The IoU of all classes is improved. The mIoU in Test A has an improvement of 12% and an 11% improvement in Test B. This is because the feature extraction directly in the three-dimensional space reduces the accuracy loss caused by the forward and reverse projections, making the semantic segmentation results more accurate. Compared with the traditional voxel and point cloud fusion algorithm, MAPC-Net also has certain advantages. The IoU of most classes exceeds the point-voxel-based PVCNN model. The mIoU in Test A is improved by 5% and 4% in Test B. This is because in outdoor scenes, the use of cylindrical voxels will make the distribution of point clouds more reasonable than cubic voxels, making the network more robust, and can have better semantic segmentation results in different off-road scenes.

MAPC-Net has a longer inference time than the other two models. However, since the operating frequency of the LiDAR is 5–20 HZ, and the main task during the process is to identify blocks and other obstacles, it will not travel at a fast speed. The inference time of MAPC-Net is less than 200 ms, which can be considered as real-time and can be well applied to off-road environments.

4.3.2. Data Augmentation Experiment Results

After data augmentation, we calculated the proportion of each type of point cloud, presented in Table 2.

It can be seen from Table 2 that after data augmentation, the proportion of classes with fewer points is significantly improved, and it also provides a basis for training an efficient semantic segmentation network. To verify the effectiveness of the data augmentation module, we use the same parameters and the same network to train on the datasets before and after augmentation. IoU and mIoU are calculated, which are shown in Table 2.

It can be seen from Table 3 that when linear interpolation is used for data augmentation, the network is more effective for semantic segmentation of classes with fewer point clouds. In Test A, the network trained on the enhanced training set improved by 5% on the IoU of “mud”, and improved by 27% on the IoU of “people”. In Test B, The IoU increased by 51% on “mud “, and 36% on “people”. Although the data augmentation leads to the decline of the IoU on some classes with a large number of point clouds, the decline is relatively low, with a maximum of no more than 4%, which does not affect the overall semantic segmentation accuracy, and the mIoU is still significantly improved. The experiments fully demonstrate that the data augmentation module can effectively improve the semantic segmentation accuracy of classes with a small number of point clouds.

4.3.3. Multilayer Receptive Field Module Experiment Result

This paper intends to further verify the effectiveness of the MRFFM we designed, that is, whether the module can effectively segment objects of different scales in the scene. In this section, the inference results of MAPC-Net-1, whose MRFFM is removed, and MAPC-Net are compared. The quantitative and the qualitative results are shown in Table 4 and Figure 7.

It can be seen from Table 4 that MRFFM significantly improves the mIoU in both Test A and B. In Test A, the network trained with MRFFM improved by 2% on the IoU of “grass”, 4% on the IoU of “rocks”, and 4% on the IoU of “people”. In Test B, the network trained with MRFFM improved by 3% on the IoU of “grass”, 9% on the IoU of “rocks”, and 6% on the IoU of “people”. The results show the effectiveness of MRFFM in the segmentation of objects of different scales.

The results in Figure 7 demonstrate the effectiveness of the multi-layer receptive field fusion module. Due to the three convolution kernels of different sizes in this module, information of different scales can be extracted more effectively. Furthermore, objects of different scales can be segmented more effectively. Especially for small and medium objects, the addition of MRFFM significantly increases the segmentation results of these classes, which greatly improves the performance of off-road environments.

To sum up, the performance of MAPC-Net, SalsaNext, and PVCNN are compared and tested in this section. As the experimental results show, relying on the method of directly extracting features in three-dimensional space and the design of the Cylinder, MAPC-Net performs better in off-road scenes compared with the projection-based method and the point-voxel-based method. Additionally, the experiments illustrate the positive effects of the data augmentation module and the MRFFM on MAPC-Net. Among them, the MRFFM can help the network capture the features of objects of different scales, to segment them more accurately. Additionally, the data augmentation module effectively improves the segmentation accuracy of objects with a small number of point clouds. Therefore, according to the results, MAPC-Net can effectively perform semantic segmentation tasks in off-road scenes.

5. Conclusions

This paper proposes a robust off-road environment semantic segmentation algorithm for autonomous vehicles called MAPC-Net. In our work, the multi-layer receptive field module is used to extract information of objects of different scales to segment them more accurately. In addition, the dataset is enhanced with linear interpolation, which improves the segmentation results of objects with fewer point clouds. During the experiment, we compared the segmentation results of MAPC-Net with those of the projection-based algorithm SalsaNext and the point-voxel-based algorithm PVCNN. The results show that the MAPC-Net proposed in this paper has better segmentation results in the off-road environment. We also designed experiments to verify the positive effects of the MRFFM and the data augmentation module. In addition, since MAPC-Net is used in off-road environments, it can guarantee real-time semantic segmentation when the moving speed is slow.

However, the insufficient number of constructed datasets in this paper limits the algorithm application field. Nevertheless, the basic elements in different off-road scenes are similar, thus the six classes designed in this paper still have a certain expansibility.

In the future, we will continue our work to expand and construct more off-road scenes, making the algorithm realize real-time semantic segmentation in more scenes.

Author Contributions

Conceptualization, X.Z. and Y.F.; methodology, X.Z., Y.F. and Z.Z.; software, Y.F.; validation, X.Z., Y.F. and Y.H.; formal analysis, X.Z.; investigation, Y.F. and Z.Z.; resources, Y.H. and Z.Z.; data curation, X.Z.; writing—original draft preparation, X.Z.; writing—review and editing, X.Z. and X.L.; visualization, Y.F.; supervision, X.L.; project administration, X.Z. and Y.F.; funding acquisition, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by Collective Intelligence & Collaboration Laboratory (QXZ23012201).

Data Availability Statement

All data generated or analyzed during this study are included in this published article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yuan, Y.; Jiang, Z.; Wang, Q. Video-based Road detection via online structural learning. Neurocomputing 2015, 168, 336–347. [Google Scholar] [CrossRef]
Broggi, A.; Cardarelli, E.; Cattani, S.; Sabbatelli, M. Terrain mapping for off-road Autonomous Ground Vehicles using rational B-Spline surfaces and stereo vision. In Proceedings of the 2013 IEEE Intelligent Vehicles Symposium (IV), Gold Coast, Australia, 23–26 June 2013; IEEE: Piscataway, NJ, USA, 2013; pp. 648–653. [Google Scholar]
Wang, P.-S.; Liu, Y.; Sun, C.; Tong, X. Adaptive O-CNN: A Patch-based Deep Representation of 3D Shapes. ACM Trans. Graph. 2018, 37, 1–11. [Google Scholar] [CrossRef]
Xia, X.; Bhatt, N.P.; Khajepour, A.; Hashemi, E. Integrated Inertial-LiDAR-Based Map Matching Localization for Varying Environments. IEEE Trans. Intell. Veh. 2023, 1–12. [Google Scholar] [CrossRef]
Yu, Y.; Li, J.; Guan, H.; Wang, C.; Yu, J. Semiautomated extraction of street light poles from mobile LiDAR point-clouds. IEEE Trans. Geosci. Remote Sens. 2014, 53, 1374–1386. [Google Scholar] [CrossRef]
Liu, H.; Lin, C.; Gong, B.; Wu, D. Extending the Detection Range for Low-Channel Roadside LiDAR by Static Background Construction. IEEE Trans. Geosci. Remote Sens. 2022, 60, 1–12. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Ros, G.; Codevilla, F.; Lopez, A.; Koltun, V. CARLA: An open urban driving simulator. In Proceedings of the Conference on Robot Learning, PMLR, Mountain View, CA, USA, 13–15 November 2017; pp. 1–16. [Google Scholar]
Wu, B.; Wan, A.; Yue, X.; Keutzer, K. Squeezeseg: Convolutional neural nets with recurrent CRF for real-time road-object segmentation from 3d lidar point cloud. In Proceedings of the ICRA, IEEE International Conference on Robotics and Automation, Brisbane, Australia, 21–25 May 2018; pp. 1887–1893. [Google Scholar]
Wu, B.; Zhou, X.; Zhao, S.; Yue, X.; Keutzer, K. Squeezesegv2: Improved model structure and unsupervised domain adaptation for road-object segmentation from a lidar point cloud. In Proceedings of the IEEE International Conference on Robotics and Automation, Montreal, QC, Canada, 20–24 May 2019; pp. 4376–4382. [Google Scholar]
Xu, C.; Wu, B.; Wang, Z.; Zhan, W.; Vajda, P.; Keutzer, K.; Tomizuka, M. Squeezesegv3: Spatially-adaptive convolution for efficient point- cloud segmentation. arXiv 2020, arXiv:2004.01803. [Google Scholar]
Cortinhal, T.; Tzelepis, G.; Erdal Aksoy, E. SalsaNext: Fast, uncertainty-aware semantic segmentation of LiDAR point clouds. In Proceedings of the International Symposium on Visual Computing, San Diego, CA, USA, 5–7 October 2020; Springer: Cham, Switzerland, 2020; pp. 207–222. [Google Scholar]
Zhang, Y.; Zhou, Z.; David, P.; Yue, X.; Xi, Z.; Gong, B.; Foroosh, H. Polarnet: An improved grid representation for online lidar point clouds se-mantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 16–18 June 2020; pp. 9601–9610. [Google Scholar]
Li, B.; Zhang, T.; Xia, T. Vehicle detection from 3d lidar using fully convolutional network. arXiv 2016, arXiv:1608.07916. [Google Scholar]
Zhou, Y.; Tuzel, O. Voxelnet: End-to-end learning for point cloud based 3d object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4490–4499. [Google Scholar]
Qi, C.R.; Su, H.; Mo, K.; Guibas, L.J. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 652–660. [Google Scholar]
Qi, C.R.; Yi, L.; Su, H.; Guibas, L.J. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. arXiv 2017, arXiv:1706.02413. [Google Scholar]
Hu, Q.; Yang, B.; Xie, L.; Rosa, S.; Guo, Y.; Wang, Z.; Trigoni, N.; Markham, A. Randla-net: Efficient semantic segmentation of large-scale point clouds. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11108–11117. [Google Scholar]
Thomas, H.; Qi, C.R.; Deschaud, J.E.; Marcotegui, B.; Goulette, F.; Guibas, L.J. Kpconv: Flexible and deformable convolution for point clouds. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6411–6420. [Google Scholar]
Liu, Z.; Tang, H.; Lin, Y.; Han, S. Point-voxel cnn for efficient 3d deep learning. arXiv 2019, arXiv:1907.03739. [Google Scholar]
Xu, J.; Zhang, R.; Dou, J.; Zhu, Y.; Sun, J.; Pu, S. Rpvnet: A deep and efficient range-point-voxel fusion network for lidar point cloud segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 16024–16033. [Google Scholar]
Liu, T.; Liu, D.; Yang, Y.; Chen, Z. Lidar-based traversable region detection in off-road environment. In Proceedings of the 2019 Chinese Control Conference (CCC), Guangzhou, China, 27–30 July 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 4548–4553. [Google Scholar]
Chen, L.; Yang, J.; Kong, H. Lidar-histogram for fast road and obstacle detection. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 1343–1348. [Google Scholar]
Gao, B.; Xu, A.; Pan, Y.; Zhao, X.; Yao, W.; Zhao, H. Off-road drivable area extraction using 3D LiDAR data. In Proceedings of the 2019 IEEE Intelligent Vehicles Symposium (IV), Paris, France, 9–12 June 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 1505–1511. [Google Scholar]
Holder, C.J.; Breckon, T.P.; Wei, X. From on-road to off: Transfer learning within a deep convolutional neural network for segmentation and classification of off-road scenes. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016; Springer: Cham, Switzerland, 2016; pp. 149–162. [Google Scholar]
Zhu, X.; Zhou, H.; Wang, T.; Hong, F.; Ma, Y.; Li, W.; Li, H.; Lin, D. Cylindrical and asymmetrical 3d convolution networks for lidar segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 9939–9948. [Google Scholar]
Takikawa, T.; Acuna, D.; Jampani, V.; Fidler, S. Gated-scnn: Gated shape cnns for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 5229–5238. [Google Scholar]
Dai, Y.; Gieseke, F.; Oehmcke, S.; Wu, Y.; Barnard, K. Attentional feature fusion. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Virtual, 5–9 January 2021; pp. 3560–3569. [Google Scholar]
Choy, C.; Gwak, J.Y.; Savarese, S. 4d spatio-temporal convnets: Minkowski convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3075–3084. [Google Scholar]
Behley, J.; Garbade, M.; Milioto, A.; Quenzel, J.; Behnke, S.; Stachniss, C.; Gall, J. Semantickitti: A dataset for semantic scene understanding of lidar sequences. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 9297–9307. [Google Scholar]
Jiang, P.; Osteen, P.; Wigness, M.; Saripalli, S. Rellis-3d dataset: Data, benchmarks and analysis. In Proceedings of the 2021 IEEE International Conference on Robotics and Automation (ICRA), Xi’an, China, 30 May–5 June 2021; IEEE: Piscataway, NJ, USA, 2021; pp. 1110–1116. [Google Scholar]
Zhang, Z.; Sabuncu, M. Generalized cross entropy loss for training deep neural networks with noisy labels. arXiv 2018, arXiv:1805.07836. [Google Scholar]
Berman, M.; Triki, A.R.; Blaschko, M.B. The lovász-softmax loss: A tractable surrogate for the optimization of the intersection-over-union measure in neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4413–4421. [Google Scholar]

Figure 1. The overall structure of MAPC-Net. The original point clouds are first sent in the feature extraction of the PointTensor and Cylinder branches and then sent in the gated fusion module to obtain the semantic segmentation output.

Figure 2. The gated feature fusion module adaptively fuses the features of PointTensor and Cylinder.

Figure 3. The multi-layer receptive field fusion module can extract features of objects of different scales, so that various obstacles in off-road scenes can be effectively identified.

Figure 4. PointTensor attention module, which helps the multi-layer receptive field module achieve better feature extraction of PointTensor.

Figure 5. Simulated and real scenarios, where (a) is a real scenario from [30] and (b) is a simulated scenario built using the CARLA simulator.

Figure 6. Semantic segmentation results of different networks in test set A and test set B. Whether in test set A or test set B, the segmentation results of MAPC-Net are closer to the ground-truth labels. Among them, magenta represents roads.

Figure 7. The results in the figure demonstrate the effectiveness of the multi-layer receptive field fusion module. The MAPC-Net with multi-layer receptive field module can more effectively identify obstacles of different scales. Among them, red represents people in small objects, vehicles in medium objects, and roads in large objects.

Table 1. Semantic segmentation results of different networks in Test A and Test B.

	Network	IoU (%)						mIoU (%)	Time (ms)	Params (M)
	Network	Grass	Mud	Vegetation	Rock	Water	People	mIoU (%)	Time (ms)	Params (M)
A	SalsaNext	99.82	74.85	20.06	39.71	51.71	54.26	56.74	38	6.7
	PVCNN	99.56	82.35	24.19	41.93	56.54	57.04	60.35	151	2.5
	Our work	99.98	83.70	23.57	50.76	68.11	56.30	63.74	165	58.6
B	SalsaNext	64.74	9.58	79.04	75.89	23.20	83.17	55.94	42	6.7
	PVCNN	63.58	13.27	86.79	82.93	27.65	84.63	59.81	152	2.5
	Our work	65.18	16.09	85.40	84.58	31.09	88.51	61.81	159	58.6

Table 2. Changes of point clouds of each class after data augmentation.

Classes	Before	After	Change Rate
grass	32.609%	30.874%	−5.321%
vegetation	27.945%	26.458%	−5.321%
rocks	22.204%	21.023%	−5.319%
water	11.452%	10.843%	−5.318%
mud	5.313%	9.922%	86.749%
person	0.211%	0.628%	197.630%

Table 3. The IoU changes with the data augmentation module.

		IoU (%)						mIoU (%)
		Grass	Mud	Vegetation	Rocks	Water	People	mIoU (%)
A	before	99.98	76.56	22.34	51.89	64.78	44.25	59.97
A	after	99.98	83.70	23.57	50.76	68.11	56.30	63.74
B	before	65.40	10.67	82.07	81.99	30.61	64.94	55.95
B	after	65.18	16.09	85.40	84.58	31.09	88.51	61.81

Table 4. The IoU changes with the multilayer receptive field module.

		IoU (%)						mIoU (%)
		Grass	Mud	Vegetation	Rocks	Water	People	mIoU (%)
A	MAPC-Net-1	97.86	84.24	24.37	45.63	65.82	50.74	61.44
A	MAPC-Net	99.98	83.70	23.57	50.76	68.11	56.30	63.74
B	MAPC-Net-1	62.37	17.28	81.07	75.52	32.24	72.65	56.86
B	MAPC-Net	65.18	16.09	85.40	84.58	31.09	88.51	61.81

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhou, X.; Feng, Y.; Li, X.; Zhu, Z.; Hu, Y. Off-Road Environment Semantic Segmentation for Autonomous Vehicles Based on Multi-Scale Feature Fusion. World Electr. Veh. J. 2023, 14, 291. https://doi.org/10.3390/wevj14100291

AMA Style

Zhou X, Feng Y, Li X, Zhu Z, Hu Y. Off-Road Environment Semantic Segmentation for Autonomous Vehicles Based on Multi-Scale Feature Fusion. World Electric Vehicle Journal. 2023; 14(10):291. https://doi.org/10.3390/wevj14100291

Chicago/Turabian Style

Zhou, Xiaojing, Yunjia Feng, Xu Li, Zijian Zhu, and Yanzhong Hu. 2023. "Off-Road Environment Semantic Segmentation for Autonomous Vehicles Based on Multi-Scale Feature Fusion" World Electric Vehicle Journal 14, no. 10: 291. https://doi.org/10.3390/wevj14100291

APA Style

Zhou, X., Feng, Y., Li, X., Zhu, Z., & Hu, Y. (2023). Off-Road Environment Semantic Segmentation for Autonomous Vehicles Based on Multi-Scale Feature Fusion. World Electric Vehicle Journal, 14(10), 291. https://doi.org/10.3390/wevj14100291

Article Menu

Off-Road Environment Semantic Segmentation for Autonomous Vehicles Based on Multi-Scale Feature Fusion

Abstract

1. Introduction

2. Related Work

2.1. Structured Road Scene 3D Semantic Segmentation

2.2. Off-Road Point Cloud Semantic Segmentation

3. Methods

3.1. Network Overview

3.2. Multi-Layer Receptive Field Fusion Module

3.3. Data Augmentation

3.4. Loss Function and Optimizer

4. Experiment

4.1. Data Preparation

4.2. Evaluation Metrics and Experimental Setup

4.3. Experimental Results

4.3.1. Network Comparison Experiment

4.3.2. Data Augmentation Experiment Results

4.3.3. Multilayer Receptive Field Module Experiment Result

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI