Color Image Generation from LiDAR Reflection Data by Using Selected Connection UNET

In this paper, a modified encoder-decoder structured fully convolutional network (ED-FCN) is proposed to generate the camera-like color image from the light detection and ranging (LiDAR) reflection image. Previously, we showed the possibility to generate a color image from a heterogeneous source using the asymmetric ED-FCN. In addition, modified ED-FCNs, i.e., UNET and selected connection UNET (SC-UNET), have been successfully applied to the biomedical image segmentation and concealed-object detection for military purposes, respectively. In this paper, we apply the SC-UNET to generate a color image from a heterogeneous image. Various connections between encoder and decoder are analyzed. The LiDAR reflection image has only 5.28% valid values, i.e., its data are extremely sparse. The severe sparseness of the reflection image limits the generation performance when the UNET is applied directly to this heterogeneous image generation. In this paper, we present a methodology of network connection in SC-UNET that considers the sparseness of each level in the encoder network and the similarity between the same levels of encoder and decoder networks. The simulation results show that the proposed SC-UNET with the connection between encoder and decoder at two lowest levels yields improvements of 3.87 dB and 0.17 in peak signal-to-noise ratio and structural similarity, respectively, over the conventional asymmetric ED-FCN. The methodology presented in this paper would be a powerful tool for generating data from heterogeneous sources.

There have been recent studies on generating camera-like images from LiDAR data [10,12]. The LiDAR to color image generation is useful in various applications such as vehicle's night vision system, night surveillance sensor, and military night vision device, etc. An encoder-decoder structured fully convolutional network [15] (ED-FCN) is used for image generation from the heterogeneous data in [10,12], as shown in Figure 1a. One interesting result discussed in [10,12] is that the shadow-free images are generated since the LiDAR reflection data are produced irrespective to the illumination change. This would be very useful property for visual assistance in night driving. The monochrome images can be generated from the LiDAR reflection data by using the ED-FCN [10]. An asymmetric In this paper, we propose to use the SC-UNET structures for the camera-like color image generation from LiDAR reflection data. It should be noted that the input refection data are extremely sparse while the output image is dense. This difference in the sparseness yields that feature maps in the encoder and decoder have different characteristics in terms of sparseness and similarity. The differences in feature map characteristics are also varied with respect to the levels due to the network structure. In this paper, the sparseness of feature maps is analyzed based on receptive fields for each level in the ED-FCN network. In addition, the similarities between feature maps of the encoder and decoder are empirically analyzed by using the dataset recorded under various driving environments. Based on these analyses, we propose a methodology in selecting connections in the SC-UNET-based image-generation network. The connections between feature maps of the encoder and decoder parts in the proposed network are determined by considering the sparseness of each level in the encoder network and the similarity between the same levels of encoder and decoder parts.
The rest of this paper is organized as follows. In Section 2, we propose a network structure to generate a camera-like 2D color image from the 3D LiDAR data. The training and inference processes are also described. In Section 3, the performance of the proposed network is compared with the conventional ED-FCN and UNET networks. Section 4 draws the conclusions.

Proposed Method
In this section, we propose an image-generation network that generates a color image from the heterogeneous LiDAR reflection intensity. First, ED-FCN-based image-generation system [10,12] is analyzed with respect to sparseness and similarity. Then, the conventional SC-UNET architectures used for terahertz image segmentation [23] is re-purposed and adapted to heterogenous image generation based on the analyses. Figure 2 shows the ED-FCN-based image-generation system proposed in our previous works [10,12] and its feature maps at each level. In the pre-processing stage, 3D LiDAR point clouds are converted into a 2D LiDAR reflection-intensity image using a 3D-to-2D projection matrix. The reflection image has the same spatial resolution as the RGB color image to be generated. The color image is finally generated from the reflection image using the ED-FCN that consists of five levels with two convolution blocks. At each level of both the encoder and decoder blocks, C L (= 2 (4−L) N) feature maps, denoted as F e L and F d L , are obtained, where L and N indicate level number and filter number of the convolutional block at level 4, respectively. The dimension of the feature maps and the kernel size of the convolution filter are W L × H L × C L and 3 × 3 ×C L , respectively. Two feature maps from encoder and decoder parts are visualized with representative feature maps, R e L and R d L , respectively, in which each pixel is represented by the maximum value of the feature maps as follows:

Sparseness and Similarity of ED-FCN
The input reflection image is extremely sparse, i.e., the sparseness is 94.72%. This means that only 5.28% of the pixels in the reflection image have non-zero valid values and are irregularly distributed. In the encoder, the sparseness of the feature map is decreased as the level approaches the transition between the encoder and decoder parts, i.e., level 0. The feature map at the transition is completely dense (sparseness 0%). This is caused by enlarging the receptive field through a series of convolution and pooling processes. On the other hand, all the feature maps of the decoder part are dense. Detailed analysis of the relationship between the receptive field and sparseness is presented in Appendix A. If the UNET structure is directly applied to image-generation network, the sparse feature map in the encoder is combined with the dense one in the decoder at a higher level. For example, given that the encoder feature map has n% non-zero values, 100−n 200 % of a concatenated feature map is invalid and has an undesirable effect on generating the next feature map in the decoder. If the influence of the activation function is neglected, the percentage of non-zero values (n%) at each encoder level can be estimated by calculating the size of the receptive field. Accordingly, it is reasonable to apply the SC-UNET architecture, which concatenates feature maps at the levels at which the sparseness is lower than a certain value.
As shown in Figure 2, the input reflection intensity and output color images have completely different visual characteristics. The encoder and decoder feature maps at higher levels have characteristics similar to those of the reflection intensity and camera image, respectively. On the contrary, the feature maps of the encoder and the decoder have more common characteristics for a lower level. To verify the properties, the similarity S L [24,25] between representative feature maps at the level L is measured as follows: where <, > and · 2 denote inner product and L2 norm, respectively.

Figure 2.
Encoder-decoder structured fully convolutional network (ED-FCN)-based color image-generation network from light detection and ranging (LiDAR) reflection data; the network has five levels including transition level (level 0); for each level, the similarities between representative feature maps of encoder and decoder parts are provided; the kernel size of the convolution filter is 3 × 3 × C L , where C L = 2 (4−L) × N represents the number of channels at the level L. Figure 2, the similarity increases as the spatial resolution of the feature map decreases. For example, the similarity between input reflection and output color images is very low, i.e., 0.192. However, the similarity at level 1 is quite high, i.e., 0.821. Clearly, it is reasonable to concatenate feature maps with high similarity.

As shown in
From the above analysis, the sparseness of the encoder feature map and the similarity between the encoder and decoder feature maps should be considered when designing the concatenation structure in an image-generation network.

Proposed Network Architectures
In this section, we present the five types of network architecture for color image generation, as shown in Figure 3. ED-FCN represents the conventional architecture without any connection. UNET is also a conventional architecture that has feature map connections between the encoder and decoder parts at every level. The proposed architectures are the image-generation networks based on SC-UNET structures and are denoted as SC-UNET w/Lv(a,b,c), which indicates the SC-UNET architecture with the connection between encoder and decoder at levels a, b and c. Note that UNET is SC-UNET w/Lv (1,2,3,4). All architectures consist of fully convolutional networks and have the following common structure. A single-channel sparse 2D reflection-intensity image (592 × 112 × 1) is obtained from the 3D LiDAR point and is used as input data to the image-generation network. The output of the generation network is a three-channel color image (592 × 112 × 3). The encoder and decoder parts of the network are constructed with five levels considering the size of the input image. Each level consists of two convolution blocks and one sampling layer. Each convolution block is composed of a convolution layer, exponential linear unit (ELU) activation function [26], and batch-normalization layer [27], in consecutive order. Each convolution layer consists of 2 (4−L) N filters of size 3 × 3, as shown Figure 3. In each convolution layer, stride 1 and zero-padding are applied. In the encoder, max pooling with factor 2 is applied for downsampling. In the decoder part, deconvolution [28] with stride 2 and zero-padding is applied for upsampling. As the level number of the encoder is decreased by one, the number of feature map channels is doubled. When the level number of the decoder is increased by one, the number of feature map channels is halved. At the end of the decoder part, the N-channel feature map is transformed into three color channels (R, G, B) by applying three 1 × 1 ×N-sized convolution layers with sigmoid activation, s(x) = 1/(1 + e −x ). Notably, batch normalization is not applied for the 1 × 1 ×N convolution layers.
Conventional UNET and three proposed architectures have connections between the encoder and the decoder parts, unlike the ED-FCN. The encoder feature map at a certain level are connected to the decoder feature map at the same level in the form of concatenation. The concatenated feature map is fed to the convolution block of the decoder part.
The results of the analysis in Section 2.1 and the number of weights in the encoder feature map at each level are summarized in Table 1. The following observations are derived: Observation 1. At a low level, the small amount of valid information can be transferred to the decoder side via concatenation. For example, the amount of feature map data to be transferred is very limited if only level 1 is concatenated. Observation 2. At a high level, the encoder feature map has high sparseness. For example, the structure having a single connection at level 4 is expected to have limited performance due to the small number of valid pixels.
Observation 3. At a low level, the similarity between feature maps of the encoder and decoder parts increases. For example, the structure with a single connection at level 4 is expected to have limited performance due to the very different characteristics between the encoder and decoder feature maps. In summary, it is necessary to concatenate multiple levels in the sense of the amount of transferred information and it is desirable to concatenate feature maps at the low levels. Accordingly, we propose architectures, called SC-UNETs with w/Lv(1), w/Lv(1,2) and w/Lv(1,2,3).

Training and Inference Processes
In the training process, the 2D LiDAR reflection intensity images and the corresponding RGB color images are used as input data and target data of image generation network, respectively. Because the sigmoid function [29] is used as the activation function of the last convolution layer, the dynamic range of generated output data is (0, 1). Thus, the target color images are converted to the same dynamic range for the training. Like as in [10,12], mean square error (MSE) is used as a loss function.
For hyper-parameters of training, the proposed network architectures are trained until a maximum of 2000 epochs. The adaptive moment estimation solver [30], with batch size 4, learning rate l r = 5 × 10 −4 , and momentum parameters β 1 = 0.9, β 2 = 0.999 and = 10 −8 is applied. The early stopping technique with patience parameter of 25 is applied for validation loss [31].
In the inference process, three-channel images with the dynamic range (0, 1) are generated through the proposed color image-generation network. Finally, RGB color images are obtained by converting each channel to the dynamic range of (0, 255).

Simulation Environment and Results
This section describes the simulation environments and evaluation metrics. The performance of the proposed architectures is evaluated and compared with the conventional architectures.

Simulation Environment
The evaluation dataset was reconstituted from the raw KITTI dataset [32], as in [10,12]. The dataset consisted of pairs of projected LiDAR reflection images and color images that were recorded simultaneously. Pairs recorded under heavy shadows were not included to enable shadow-free color image generation. For more details on the dataset, refer to [10,12]. The evaluation dataset consisted of a total of 4300 pairs. The pairs were randomly selected and divided into five folds for k-fold cross validation (k = 5) [33,34]. Both LiDAR reflection and color images had the same resolution of 592 × 112 (66,304 pixels). The reflection image had an average of 3502 valid values.
The peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) were used to evaluate the image quality between the generated and target color images [35]. PSNRs were separately calculated for each R, G, and B channel and the average PSNR was used for evaluation. In contrast, only the gray-scale image was used for the measurement of SSIM.
The hardware used in the simulation was a workstation with Intel Core i7-6850 CPU 3.60GHz and Nvidia Titan X Pascal GPU. The software environments were Ubuntu 16.04, Python 3.5.6, Tensorflow 1.13.1 [36], and Keras 2.3.1 [37].

Performance of the Proposed SC-UNET-Based Architectures
The validity of the selected connections in the UNET structure was investigated for camera-like RGB color image generation from the sparse 2D LiDAR reflection image. The three proposed architectures, such as SC-UNET w/Lv(1,2,3), w/Lv(1,2) and w/Lv(1), as shown in Figure 3c-e, were evaluated in this simulation. Two conventional networks, ED-FCN and UNET, were used for the performance comparison. To determine the performance variations with respect to the number of filters N in convolution layer, we conducted experiments for N = 16, 32, 48, and 64. As previously mentioned in Section 3.1, five-fold cross validation was applied in all the experiments. The PSNR, SSIM, and their corresponding standard deviations are summarized in Table 2. For the evaluation of computational complexity, the number of weights in the network and the processing time measured in millisecond per frame were analyzed. To analyze the effect of single layer connection in the proposed architecture, the simulation results for connections of SC-UNET w/Lv(1), w/Lv(2), w/Lv(3), and w/Lv(4) were also summarized. For comparison with our previous work [12], all methods were also tested on the same dataset used in [12] and the performance of asymmetric ED-FCN [12] is listed in Table 2.
As N increased, the PSNR and SSIM of all the architectures improved. Notably, the numbers of weights increased with respect to N; in other words, the computational complexity and memory requirements increased. Therefore, it was necessary to select an appropriate value of N according to the applications and available resources.
UNET provided better image quality performance than ED-FCN. This demonstrated that the connection between the encoder and decoder was useful, even in heterogeneous image generation. In cases of single layer connection of SC-UNET, SC-UNET w/Lv(1) showed the best performance. SC-UNET w/Lv(1) and SC-UNET w/Lv(2) outperformed UNET. This meant that connection at higher level was not appropriate. SC-UNET w/Lv(1,2,3) showed better performance than UNET. On the contrary, the proposed architectures with connections at higher level, i.e., SC-UNET w/Lv(3), w/Lv(4), and w/Lv (3,4), yielded better image quality than ED-FCN, but worse quality than UNET. SC-UNET w/Lv(1,2) outperforms all the architectures, including SC-UNET w/Lv(1,2,3). SC-UNET w/Lv (1,2) with N = 48 and 64 had better image quality performance than asymmetric ED-FCN. In particular, SC-UNET w/Lv (1,2) with N = 64 produced improvements of '3.87 dB in PSNR and 0.17 in SSIM' over the asymmetric ED-FCN, respectively. These results confirmed the validity of the observations presented in Section 2.2.
As shown in Table 1, the feature map at level 1 was fully dense and the similarity between encoder and decoder feature maps was 0.821. Similarly, the sparseness and similarity at level 2 were '0.92% and 0.573', respectively. Therefore, encoder feature maps at levels 1 and 2 could provide useful information for the image generation at the decoder part. In contrast, the sparseness and similarity at levels 3 and 4 were '8.72% and 0.567', and '42.63% and 0.355', respectively. Considering both sparseness and similarity, the encoder feature maps at levels 3 and 4 had less relevance to the decoder feature maps. This implies that the connections at levels 3 and 4 could produce undesirable influence on the image-generation performance. This explains why SC-UNET w/Lv(1,2) yielded the best performance and SC-UNET w/Lv (3,4) yielded the worst performance among other networks with connections, including UNET. These simulation results provide the insight that the connections in SC-UNET should be selected by considering the sparseness of each level in the encoder network and the similarity between the same levels of the encoder and decoder networks. In Figure 4a, two networks without connection between encoder and decoder feature maps, such as the ED-FCN and asymmetric ED-FCN, generate very blurry objects such as white vehicle and white road-pole with red stripes. Contrarily, the UNET and SC-UNET w/Lv(1,2) produce those objects in detail. The UNET distorts short-distance black vehicle on the left side, but the proposed method faithfully generates it. Figure 4b shows that the ED-FCN does not generate tire-wheel and small wall on the right side and the proposed SC-UNET w/Lv(1,2) generates them more clearly than all others. In Figure 4c, all networks generate images with high visual quality. Similar trends mentioned above are observed. In summary, the ED-FCN and asymmetric ED-FCN generate blurry images. The proposed method faithfully generates images while the UNET produces occasionally serious distortion.

Conclusions
In this paper, we propose a SC-UNET architecture that effectively generates a camera-like RGB color image from a heterogenous sparse LiDAR reflection-intensity image. The sparseness of the encoder feature map and the similarity between the encoder and decoder feature maps are analyzed at each level of the conventional ED-FCN. At high levels, the sparseness increases and the similarity decreases. It is not reasonable to concatenate feature maps at high levels when designing a SC-UNET architecture for image generation. SC-UNET architectures with concatenation at low levels are proposed. Through simulations, we show that the proposed SC-UNET w/Lv(1,2), i.e., SC-UNET with concatenations at levels 1 and 2, outperforms the other architectures including asymmetric ED-FCN, in terms of both the objective and subjective qualities of the generated image. In particular, SC-UNET w/Lv (1,2) with N = 64 produces improvements of '3.87 dB in PSNR and 0.17 in SSIM' over the asymmetric ED-FCN, respectively.
It is very important to consider the sparseness and similarity in determining the levels to be concatenated between feature maps of the encoder and decoder. The methodology is very useful in various applications where the input and output have different sparseness and heterogeneous characteristics.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A. Analysis of Sparseness Using Receptive Field
In this appendix, we provide the analysis of the relationship between sparseness and the receptive field in the ED-FCN network, as shown in Figure 2.
In general, neglecting the effect of bias in the convolution layer and the effect of the activation layer, the size of the effective receptive field is [2 P (2B + 1)] × [2 P (2B + 1)], where B and P represent the cumulative number of convolution blocks with a (3×3) filter and the cumulative number of pooling operations, respectively, as shown in Table A1 [38,39].
In the case of Figure 2, the size of the receptive field, RF L at encoder level L is as follows: RF L = [2 (4−L) (2(10 − 2L) + 1)] × [2 (4−L) (2(10 − 2L) + 1)]. (A1 )  Table A1. Numbers of pooling and convolution operations; P L and B L denote numbers of pooling and convolution operations at level L, respectively; P and B denote cumulative numbers of pooling and convolution operations, respectively. 4  0  2  0  2  3  1  2  1  4  2  1  2  2  6  1  1  2  3  8  0  1  2  4 10 For example, the size of receptive field is (52 × 52) at encoder level 2. It means that if there is at least one non-zero value within a 52 × 52 square kernel centered at a certain pixel in the reflection-intensity image, the corresponding pixel in the feature map has a non-zero value by the series of convolution and pooling operations. This pixel will be called a "valid pixel" in this Appendix. For evaluation, 4300 projected reflection-intensity images are used. The percentage of valid pixels for all the pixels in evaluation images are calculated with respect to the receptive field, as shown in Figure A1. For all the pixels in the feature map to be valid, the size of the receptive field should be larger than 101 × 101. The receptive field size and sparseness at each level, according to these results, are summarized in Table A2. In the case of encoder level 2, the feature map has 99.08% valid pixels; in other words, the sparseness is 0.92%. Notably, the sparseness for the receptive field with a size of 52 × 52 is the average sparseness for 51 × 51 and 53 × 53, as size of the receptive field should be odd owing to the characteristics of the convolution operation.