Asymmetric Encoder-Decoder Structured FCN Based LiDAR to Color Image Generation

In this paper, we propose a method of generating a color image from light detection and ranging (LiDAR) 3D reflection intensity. The proposed method is composed of two steps: projection of LiDAR 3D reflection intensity into 2D intensity, and color image generation from the projected intensity by using a fully convolutional network (FCN). The color image should be generated from a very sparse projected intensity image. For this reason, the FCN is designed to have an asymmetric network structure, i.e., the layer depth of the decoder in the FCN is deeper than that of the encoder. The well-known KITTI dataset for various scenarios is used for the proposed FCN training and performance evaluation. Performance of the asymmetric network structures are empirically analyzed for various depth combinations for the encoder and decoder. Through simulations, it is shown that the proposed method generates fairly good visual quality of images while maintaining almost the same color as the ground truth image. Moreover, the proposed FCN has much higher performance than conventional interpolation methods and generative adversarial network based Pix2Pix. One interesting result is that the proposed FCN produces shadow-free and daylight color images. This result is caused by the fact that the LiDAR sensor data is produced by the light reflection and is, therefore, not affected by sunlight and shadow.


Introduction
Light detection and ranging (LiDAR) sensors are widely used for measuring the distances to objects and their reflection information. The distance is calculated by using the round-trip time (RTT) of the light pulses emitted from the sensors. The sensors also provide the reflection intensity, which depends on the materials of the objects. The data obtained from the sensors are the locations of objects in 3D space and their reflection intensities, so the data are often called LiDAR 3D point-cloud data.
Because the LiDAR 3D point-cloud data, namely the range (or distance) and reflection, is independent of sunlight and shadows, the same data can be obtained whether it is day or night [1][2][3][4][5][6][7]. This environmental consistency of LiDAR data has a great advantage over conventional camera images for autonomous vehicle application because the quality of camera images is highly dependent on illumination [8].
One serious problem is that the LiDAR data is too sparse directly to use for the color image generation. To compensate for the sparsity of the LiDAR data, various attempts have been made in the literature. By incorporating the information from camera images, the accuracy of the distance map

Conventional Interpolation Methods
Ashraf et al. [8] proposed an adaptive interpolation method for LiDAR range data. A 2D interpolated reflection-intensity image is obtained from the LiDAR 3D reflection intensity using various interpolation methods, such as natural neighbor, nearest neighbor, bilinear, bi-cubic, inverse distance weighted, and kriging, and is compared with the corresponding camera-captured gray image to select the best interpolation method. The best method is applied for the range data interpolation. They empirically showed that the inverse distance-weighted interpolation method has the best performance. Chen et al. [12] applied the 2D interpolated reflection intensity image with a camera-based RGB image for lane detection. Asvadi et al. [13] also applied the interpolated reflection-intensity image with a color image for vehicle detection and compared the natural neighbor, nearest neighbor, and bilinear interpolation methods. The nearest neighbor interpolation method has the best performance for vehicle detection.
Even though various existing interpolation methods have been applied to reconstruct the reflection-intensity image, interpolated images still have very poor quality and are affected by severe noise. In addition, color image generation from the reflection intensity has not been reported in the literature.

Color Image-Generation Methods
Since the first fully automatic image colorization method was implemented by using a simple neural network [18], various studies on color image generation using end-to-end deep-learning networks have been reported [19][20][21][22][23][24][25][26][27]. The deep-learning-based color image generation methods commonly use an encoder-decoder-structured FCN [30,31]. The encoder network consists of a combination of a convolution layer, batch-normalization layer [32], dropout [33], activation function, and sub-sampling. The decoder network consists of a combination of up-sampling, convolution layer, batch-normalization layer, dropout, and activation function. The color image-generation network has been applied to a variety of applications, such as converting gray images to color images [19][20][21][22][23][24], sketch images to color images [23,25], and infrared images to color images [26,27]. Figure 1 shows the typical network architecture for gray-to-color image generation [19,20]. The 2-channel chrominance components are predicted from the input single channel luminance component by using the encoder-decoder-structured FCN. Finally, a color image is generated by combining the predicted chrominance components with the input luminance [19][20][21]. Most encoder-decoder-structured FCN have the same number of encoder layers and decoder layers, i.e., symmetric structure.

Chrominance Components
Luminance Component (Input Image) Color Image (Output Image) Encoder-Decoder Structured FCN

Proposed Color Image-Generation System
The proposed color image-generation system from LiDAR reflection intensity is composed of two steps: 3D-to-2D projection and color image generation, as shown in Figure 2. The 3D-to-2D projection reconstructs a 2D projected reflection image by projecting the LiDAR 3D reflection intensity onto the target image plane desired. The projected reflection image is very sparse, as shown in lower left picture of Figure 2, because of the different resolution and field of view (FOV) between LiDAR and the target image. The target image is assumed to have the same image plane that is captured by a camera installed on the vehicle. The goal of our work is to generate a color image that is as similar as possible to the image captured by the camera from the LiDAR 3D reflection intensity. In the image-generation step, RGB components are generated from the 2D reflection image using the encoder-decoder-structured FCN model. Unlike the conventional FCN application, the color image should be generated from the sparse 2D reflection image. This is why the proposed FCN is designed to have an asymmetric network structure, i.e., the layer depth of the decoder in the FCN is deeper than that of the encoder.  In the following subsection, we describe in detail the 3D-to-2D projection and color imagegeneration network. In addition, the training and inference processes are also described.

3D-to-2D Projection
The 2D reflection intensity image is reconstructed from the LiDAR 3D point cloud by using a LiDAR-to-camera projection matrix, which maps the reflection intensity value of LiDAR data onto the corresponding image coordinates [8,13,34].

Proposed Color Image-Generation Network
The proposed color image-generation network model is designed with an asymmetric encoder-decoder-structured FCN model as shown in Figure 3. This model generates a color image with 3 channels (size: 592 × 112 × 3) from the sparse 2D reflection image with 1 channel (592 × 112 × 1). The encoder network consists mainly of several convolution blocks and sub-sampling steps. Each convolution block is composed of a convolution layer, batch-normalization layer, and activation function, in consecutive order. The decoder also consists of several up-sampling steps and convolution blocks. As various numbers of convolution blocks can be applied before each sub-sampling or up-sampling, we introduce a new terminology called the convolution group, which consists of several convolution blocks. In this work, considering the size of the input and output images, the encoder network is designed with six convolution groups and four sub-sampling steps. The decoder is designed with four up-sampling steps and six convolution groups. In Figure 3, N e i and N d i represent the number of convolution blocks in the convolution group of the encoder and decoder, respectively. As results, the number of convolution blocks is ∑ 6 i=1 N e i and ∑ 6 i=1 N d i in the encoder and decoder networks, respectively. For sub-sampling, max-pooling with a factor of 2 is applied. For up-sampling, un-pooling [35] with a factor of 2 is applied. In each block, K convolution layers are applied, which is denoted as the convolution-K block. According to the research results that indicate that dropout is not needed when using batch normalization [32], dropout is not applied in the proposed network.  Whereas the convolution-8 block is applied in the first convolution group of the encoder, the convolution-3 block is applied in the last convolution group of the decoder because the output is a 3-channel color image. In all the convolution layers, stride 1 and the same zero padding are applied. Except for the convolution-3 blocks of the last convolution group in the decoder, a rectified linear unit (ReLU) is used for the activation function in all convolution blocks. In the last convolution-3 blocks, hyperbolic tangent (tanh) is used for the activation function and batch normalization is not applied. The ReLU and tanh functions are as follows:

Decoder Network Encoder Network
The role of the encoder network is to extract features from the sparse 2D reflection-intensity image. The role of the decoder network is to map the low-resolution feature maps to an RGB color image with full output resolution. Because the projected 2D reflection and generated color images have different amounts of information, it is necessary to apply different numbers of convolution blocks in the encoder and decoder. The proposed network can be regarded as a symmetrically structured FCN when it has the same the number of blocks between the ith convolution groups in the encoder and decoder (N e i = N d i ) as shown in Figure 4a. An asymmetrically structured FCN can be realized if the ith convolution groups in the encoder and decoder have different numbers of block (N e i = N d i ). Figure 4b shows two cases of asymmetrically structured networks: the first is the case of a decoder that has greater depth than the encoder (  The total number of layers in the proposed network is obtained by summing the number of convolutions and batch-normalization layers in both the encoder and decoder. As each convolution block consists of one convolution and one batch-normalization layer, the number of layers at encoder L e and decoder L d are calculated as follows:

Decoder Network Encoder Network
Note that batch normalization is not performed in the last convolution block of the decoder. Therefore, the number of total layers L t is calculated as follows: Assume that all convolution filters used in the proposed network have the same size (F × F). As all convolution blocks in the ith convolution group have the same number of convolution filters, each block belonging to the ith group has K e i (or K d i ) convolution filters in the ith group of the encoder (or decoder). The total number of parameters is obtained by summing the number of weights and biases of the convolution layers and the number of parameters of the batch-normalization layers. The number of parameters at encoder M e is calculated as follows: At the decoder, the number of parameters M d is given by: In Equations (6) and (7), the first and second terms on the right side indicate the number of weights and biases in the convolution layers and the third term is the number of parameters in the batch-normalization layers. Therefore, the total number of parameters in the proposed network M t is given by: As mentioned before, the number of convolution filters in the first group of the encoder and the last group of the decoder are fixed to eight and three, i.e., K e 1 = 8 and K d 1 = 3. For other groups, we design that the number of convolution filters is increased by a power of two (K e i = K d i = 2 i+2 , for 2 ≤ i ≤ 6). All convolution filters are designed to be of (3 × 3) size (F = 3). Figure 5 shows the training and inference processes of the proposed network. The training process is indicated by blue dashed arrows and the inference process by red solid arrows.  In the training process, the 2D projected reflection-intensity image and corresponding RGB color image are used as the dataset. The projected reflection image is used as input for the proposed model and the corresponding color image is used as the target image that is the ground truth (GT). Because the tanh function is used as the activation function of the last convolution group, the dynamic range of the output image to be generated is [−1, 1]. Thus, the GT color image is converted to the same dynamic range, where each color component is mapped to the dynamic range independently. The loss function is mean-squared error (MSE) between target T and generation G images, as follows:

Training and Inference Processes
where, m and n are the width and height of the image and c indicates the number of channels.
In the inference process, RGB components are generated through the proposed color image-generation network with training parameters. As the RGB components have a dynamic range of [−1, 1], the final generated RGB color image is obtained by conversion to the range of [0, 255].

Experimental Results
In this section, we describe the configuration of the dataset for the simulation, the hyperparameters for learning, and the quality measure metrics. We evaluate the performance of the proposed color image generation with varying depths of the encoder and decoder networks. The proposed method is also compared with conventional interpolation methods and GAN based Pix2Pix [23].

Evaluation Dataset
For simulations, our dataset is reconstituted from the raw KITTI dataset [34], which is recorded under various driving environments such as road, city, and residential areas during the day. For time consistency, we combine a pair of LiDAR 3D reflection intensity and stereo color image that are captured at the same time. A Velodyne HDL-64E rotating 3D laser scanner is used [4]. The 2D projected reflection image is obtained by projecting the LiDAR 3D reflection intensity onto the target image plane as mentioned in Section 3.1. The corresponding color image is obtained by randomly selecting one of the left and right images. Paired images with the 2D projected reflection and color image are used as the evaluation dataset for the color image-generation networks. When constructing the evaluation datasets, we manually exclude image pairs recorded under heavy shadows in order to generate shadow-free color images. Both the 2D projected reflection and color images are cropped to size of 1184 × 224 so that they have the same valid area. These are then resized into 592 × 112 by sub-sampling for simplicity of simulation. The evaluation dataset consists of 4308 image pairs, which are 2872 pairs for training, 718 for validation, and 718 for testing. The three sets have similar distributions of scene categories (city, residential, and road), color camera positions (left and right), and temporal index. Detailed descriptions of the evaluation dataset are summarized in Table 1. The number of image pairs for training, validation, and testing are listed and the numbers of left and right images are indicated separately.  Figure 6 shows the histogram of the number of valid LiDAR points in the 2D projected reflection image. The maximum and minimum numbers of valid points are 4916 (7.41%) and 1832 (2.76%). In average, the number of points is 3502, which means the sparseness ratio is 5.28%. The reflection images are very sparse and even irregular compared to the target color image.

Hyper-Parameters for Training
Because the amount of information on the 2D projected reflection image is very small, the number of blocks in the first convolution group of the proposed image-generation network is fixed to one (N e 1 = 1). Accordingly, the final group has also one convolution block (N d 1 = 1). The last encoder group and the first decoder group are also set to have one convolution block (N e 6 = N d 6 = 1). The remaining convolution groups are designed with different numbers of convolution blocks. For simplicity of the performance evaluation, however, we assume that these groups have the same number of blocks in the encoder and decoder, respectively, that is N e i N e and N d i N d for i = 2, 3, 4, 5. The performance of the proposed network are evaluated with different values of N e and N d . Hereinafter, we use the notation M-N e -N d , which denotes the proposed network model with block numbers N e and N d .

Measurement Metrics
To evaluate the performance of the generated image, PSNR [28] and SSIM [28,29] are used. The PSNR between the generated color and GT images is calculated based on the RGB components as follows: where MSE is given in Equation (9). SSIM is calculated using only the gray-scale Y component [37], as follows: where µ G and µ T represent the average of the generated and target images, σ 2 G and σ 2 T are variances, and σ GT is covariance. The positive constants C 1 (= 0.0001) and C 2 (= 0.0009) are used to avoid a null denominator.

Experiments with Symmetrically Structured Network
In this subsection, we analyze the performance of symmetrically structured networks, which consist of the same number of convolution blocks in both the encoder and decoder (N e = N d ). When varying the number of blocks from one to ten, the numbers of layers and of parameters are calculated as listed in Table 2. Note that M-N e -N d denotes a network with N e and N d blocks at each convolution group (i = 2, 3, 4, 5). The encoder has one more layer than the decoder because the sixth group of the decoder does not have a batch-normalization layer. The decoder has more parameters than the encoder because of the increasing filter dimension. Table 3 shows the performance results according to the total number of layers in the symmetric model. For the case of the symmetric structure, the proposed network with 103 layers, i.e., M-6-6 (N e = N d = 6), produces the best performance on average in terms of both PSNR and SSIM. If the network has more than 103 layers, the vanishing-gradient-problem [38] occurs, resulting in poor performance.

Experiments with Asymmetrically Structured Network
In this subsection, we analyze asymmetrically structured networks, which consist of different numbers of convolution blocks in the encoder and decoder. First, the numbers of layers and parameters are examined and listed in Table 4. This shows that the number of parameters as well as the number of layers are not changed when the sum of N e and N d is constant. For example, M-3-9, M-6-6, and M-9-3 have the same number of total layers (103) and the number of total parameters (3,350,243). Figure 7 shows the PSNR performance results with respect to N e for the case of using four different total numbers of layers. For example, the red color curve with cross shows the average PSNR over different N e when N e + N d = 12, i.e., when using a total of 103 layers. M-3-9 (N e = 3) achieves the best performance. In other cases, N e = 3 also achieves the best performance. In general, it is not necessary for the coverage of convolution filter to be much larger than the input image. For the encoder network, the filter coverage is determined by the number of sub-sampling R sp and the number of convolution blocks. Considering the encoder, except for the first and the last convolution groups, the coverage of convolution filters can be derived with respect to the input image size when using (3 × 3) filters. It is reasonable that the coverage is not greater than the size of the input image, as follows: where m and n are the width and height of the image. In this equation, the left term is the coverage of the convolution filters.  For our case, the image size is (592 × 112) and the coverage of filters is 112 when N e = 3. This can be why the performance decreases when N e is greater than 3. From Figure 7, it is observed that the performance can increase when the layer depth of the decoder is deeper than that of the encoder. For example, the red color curve with cross has better performance than the blue color curve with triangle at the same N e . Typically, M-3-9 has higher average PSNR than M-3-5. It is necessary to analyze in detail the performance according to the depth of the decoder. Table 5 shows the performance in terms of PSNR and SSIM according to the depth of the decoder, where the depth of the encoder is fixed to 28 layers (N e = 3). The performance monotonically increases until N d reaches nine. It decreases when N d is greater than nine. In particular, it drops abruptly from N d = 12. M-3-9 achieves the best performance. As in the case of the symmetrically structured network, the asymmetric network also has the vanishing-gradient-problem for more than 103 layers. Even though M-6-6 and M-3-9 have the same numbers of layers and parameters (103 layers and 3,350,243 parameters), M-3-9 has better performance than M-6-6. This is a typical example of the fact that asymmetric networks have better performance than symmetric networks. Therefore, it is recommended that the proposed FCN be designed to have a deeper decoder than encoder for sparse input data such as LiDAR 2D reflection intensity.

Comparison to State-of-the-Art
The two conventional interpolation methods, such as inverse distance weighted (IDW) and nearest neighbor (NN), and GAN based Pix2Pix [23] are tested for the performance evaluation of the proposed method. Because the interpolation results are only gray-scale images, PSNR is calculated using only the gray-scale Y component of the GT image. In case of the GAN based Pix2Pix, generator network depth is changed to adapt the spatial resolutions of input and output images used in the simulation. As shown in Table 6, the proposed FCN with M-3-9 has, on average, '8.23 dB and 3.41 dB' higher PSNR and '0.3 and 0.11' greater SSIM than IDW interpolation and Pix2Pix, respectively.  Figure 8 compares the subjective visual qualities for various methods. In case of Figure 8a, the two interpolation methods cannot generate the object colors in nature. Also, the bus located at the top-middle portion of image is hard to be recognizable. In case of Pix2Pix, objects are generated with their colors but are heavily blurred. One interesting phenomenon in Pix2Pix is that the bus is generated at the different location and that pedestrian is disappeared. On the contrary, the proposed method generates objects at the correct location and with similar colors, compared to others. All the methods produce poor visual qualities with low PSNR and SSIM in Figure 8b. The whole image generated by Pix2Pix is extremely blurred and it is completely unrecognizable. The curbstone is highly blurred and indistinguishable in the proposed method, but other objects are relatively well-generated. Figure 8c shows that IDW generates better objects such as car, bicycle, and person than NN. And the similar trends are observed in others.

2D Projected Reflection Intensity Images (Input Images)
Ground Truth Images (Target Images)   Figure 9 shows additional example images generated from LiDAR data that were captured under heavily-shadowed environments. Notice that these heavily-shadowed images are not used in training and validation. Heavy shadows can be seen in the GT images but is almost removed in the generated color images. This means that the proposed FCN-based color image-generation network can generate shadow-free color images from LiDAR sensor data.

Ground Truth Images
Proposed Method Generated Images 2D Projected Reflection Intensity Images Figure 9.
Additional inference examples of GT images with heavy shadows.

Conclusions
In this paper, we propose a color image-generation method from LiDAR 3D reflection intensity. The proposed method consists of 3D-to-2D projection and an image-generation network. For the image-generation network, an asymmetrically structured FCN is designed considering the sparseness of the projected reflection image.
Through simulations, it is shown that the proposed method generates fairly good visual quality of images while maintaining almost the same color as the GT image. In particular, the asymmetrically structured FCN with a deeper decoder than encoder generates a higher-quality color image. Until the total number of layers reaches a certain number, the quality of the generated image monotonically increases. We also prove that the proposed method produces improvements of '8.23 dB and 3.41 dB' in PSNR and '0.3 and 0.11' in SSIM over the conventional interpolation methods and Pix2Pix, respectively. In addition, the proposed FCN-based color image-generation network can generate shadow-free color images from LiDAR sensor data. We expect that the proposed method can generate daytime color images at night because the same LiDAR data can be obtained whether it is day or night. This means that the proposed method could be very useful for developing various nighttime driving assistance systems. These results can help developers design FCN-based image-generation networks from very limited information and/or different kinds of sensing data.

Conflicts of Interest:
The authors declare no conflict of interest.