Birds Eye View Look-Up Table Estimation with Semantic Segmentation

: In this work, a study was carried out to estimate a look-up table (LUT) that converts a camera image plane to a birds eye view (BEV) plane using a single camera. The traditional camera pose estimation ﬁelds require high costs in researching and manufacturing autonomous vehicles for the future and may require pre-conﬁgured infra. This paper proposes an autonomous vehicle driving camera calibration system that is low cost and utilizes low infra. A network that outputs an image in the form of an LUT that converts the image into a BEV by estimating the camera pose under urban road driving conditions using a single camera was studied. We propose a network that predicts human-like poses from a single image. We collected synthetic data using a simulator, made BEV and LUT as ground truth, and utilized the proposed network and ground truth to train pose estimation function. In the progress, it predicts the pose by deciphering the semantic segmentation feature and increases its performance by attaching a layer that handles the overall direction of the network. The network outputs camera angle (roll/pitch/yaw) on the 3D coordinate system so that the user can monitor learning. Since the network’s output is a LUT, there is no need for additional calculation, and real-time performance is improved.


Introduction
Autonomous driving is currently receiving much attention, with many research institutes and companies conducting related research. The autonomous vehicle field can be divided into three major categories: recognition, judgment, and control. In particular, the cognitive area is rapidly developing, along with the development of machine learning. In addition, estimating 3D data based on 2D cognitive data using a camera is an essential element for autonomous driving. Estimating the pose of a camera attached to an autonomous vehicle is a method for outputting 3D data. In this paper, we propose a network that predicts an LUT [1] that transforms a camera image plane to a BEV [2] plane, and we aim to estimate the pose of the camera through this.
A camera is a sensor based on 2D data projected on a lens, so it is almost impossible to estimate perfect 3D data. Using a single camera to obtain the distance to an object-that is, the depth-is possible only with non-occlusion data, and the texture of the object with the Z-value cannot be accurately obtained. Therefore, we considered that a relatively strong feature point that can estimate a pose in the image plane is the free space and set, and we studied pose estimation based on the feature point in the area specified as free space as an initial goal. A semantic segmentation network [3][4][5][6] was selected as a backbone network to construct a network that outputs the LUT by utilizing the feature points of the free space domain.
A BEV was used to show depth using camera data intuitively. This BEV is similar to the around view monitoring (AVM) [7] used for parking assistance [8]. However, since an object in a position higher than the ground is projected onto the floor and expressed, any obstacle on the road surface is expressed as drooping compared to the original shape. Moreover, since camera-based depth lacks a reference distance, the relative distance is valid, but it cannot calculate the absolute distance. In general, the ToF [9] sensor, the size of the surrounding artificial landmark, and motion data of a driving vehicle are used to estimate actual distance. However, the purpose of this paper was to obtain the camera pose of a driving vehicle by using only a camera, which is a low-cost sensor, so it does not matter that the only output is the relative distance when using a BEV.
Traditionally, to measure the pose of a camera, sensor calibration with a ToF sensor is performed, or the camera calibration process is performed in a tolerance calibration room [10], where a marker with a specified actual distance is located. However, the development of autonomous driving is progressing with the target of mass production, and it is necessary to set the cost of the sensor not too high so as to mass produce it. Therefore, we propose a system based on a single camera requiring other sensors and infra.
The rest of this paper is organized as follows: Section 2 introduces related works; Section 3 discusses the DB used in the experiment; Section 4 describes the structure and details of the entire network utilized in machine learning; Section 5 presents experiments with different methods and their results; finally, Section 6 concludes the thesis and presents future work.

Related Work
Le et al. [11] claimed that dynamic object detection and pose estimation are tightly coupled tasks. When a network is constructed and trained to perform dynamic object detection and pose estimation, the results of dynamic object detection and pose estimation work complementary. By applying this point, we used semantic segmentation that expresses the contour of an object rather than detection based on a bounding box. The LUT was predicted using the segmentation result as a feature point, and we constructed the network to estimate the roll/pitch/yaw of the camera image and to induce an interaction between the segmentation and pose estimation.
Jaderberg et al. [12] proposed STN, which continues the process of warping the original image and proposes a layer that can obtain an image that includes good features in contrast to the original image. In this process, a fully connected layer was placed inside the STN to consider the interrelationship of all features. We deduced the entire network to be suitable for pose estimation, to create a layer that outputs pose by itself, and to afford this layer the ability to convert encoded features into poses by using the fully connected layer.
Ronneberger et al. [13] proposed a general encoder-decoder and described how to efficiently perform up-sampling after down-sampling. Many studies have utilized similar methods. Our paper also deals with the form of an image to image (original image to the LUT) as a result, and since encoded data were used in processing it, the overall configuration was composed of an encoder-decoder.

Synthetic Database
We needed to collect image semantic segmentation and camera pose ground truth to implement the proposed network. Representatively, MS-COCO [14] and KITTI [15] have such data, but there is a disadvantage that the variation in camera poses is not significant. Each of the camera rolls, pitches, and yaws range from 0 to 360 degrees, but the variety of the open dataset has a disadvantage in that it does not reach that level.
Our solution was to use a simulator that simulates real environments and places. We acquired data from various camera poses by using the MORAI Sim Standard [16] (Figure 1) as a simulator. When a camera is attached to an actual vehicle, experimental data of various angles cannot be acquired due to in-vehicle structures such as windshields. As learning proceeds, the results may not be generalized and may be overfitted. Since this paper proposes a network that predicts the LUT for producing a BEV using pose data, a simulator was used to derive data of the various poses that were constructed and utilized.

Data Collection
By utilizing the simulator's characteristics to collect various ground truths, data related to the camera and camera pose were acquired. The process was configured so that a separate handcraft labeling operation was not required.
An RGB camera image for the input of the whole system, the segmentation ground truth image used for the backbone semantic segmentation, and the pose (x, y, z, roll, pitch, and yaw) expressing the camera attachment position were acquired.
Among the pose data, the roll, pitch, and yaw, which express the angles of each 3D axis, ranged from 0 to 360 degrees. Due to the vehicle's windshield, it is difficult to express the rotation of the camera in real vehicle, so only a tiny change in pose compared to the entire range can be expressed. In this paper, various camera poses were constructed using the simulator, because overfitting occurs when learning with data with a small number of configurable data collection groups compared to the actual range.

LUT Generation
Due to the nature of the simulator that provides a camera image without lens system distortion, the distortion removal procedure is omitted, an ideal camera matrix is created, and the rotation and translation matrices are generated using the extracted camera pose roll/pitch/yaw and translation information. c x /c y means the principal length of the camera and f x / f y means the focal length, and based on these contents, the camera matrix K is estimated, which is a conversion matrix between the camera's original plane and the normalized plane.
In the homography production stage, through image calibration, the existing methods combine the 3 × 3 rotation matrix (R) and the 3 × 1 translation matrix (T) to create and utilize a 3 × 4 matrix RT, but if this method is used, the axis before rotation is applied. Due to the gimbal lock phenomenon that occurs when huge angles are rotated on an axis, the conversion may not be performed properly. Therefore, we multiplied 3 × 4 R|T null first and multiplied 3 × 4 R i |T later to solve the problem. The problem was solved by considering the rotation of the coordinate axis first and then applying a translation matrix based on a new three-dimensional orthogonal axis (3 × 3 unit matrix) rather than the rotated axis (maybe with gimbal lock). The details are as follows.
R|T null =   r 11 r 12 r 13 0 r 21 r 22 r 23 0 r 31 r 32 r 33 0 Since the RT obtained in this way is a 3 × 4 matrix, which is a transformation between a 3D homogeneous coordinate system and a 2D homogeneous coordinate system, the z-axis data are removed, based on the camera coordinates, to be used in the BEV transformation (which is a 2D-to-2D transformation, z = 0). If column 3 is deleted in RT, an effect in which the z-axis data becomes 0 is derived, and a 3 × 3 matrix is obtained.
The homography H between the original image and the BEV is the product of RT 3rdColRemoved (which is the rotation matrix between the image and the BEV), the camera matrix K, and the matrix that makes the scale and makes left-top of the BEV to (0, 0).
Both the x and y coordinates of the original image corresponding to the BEV points can be obtained using homography H. The x and y coordinate values are divided by the width and height of the original image, multiplied by 65,525, converted into 16-bit data, and then stored as images named LUT_X and LUT_Y, respectively.

Proposed Deep Learning Network
The network ( Figure 2) consists of an encoder, a decoder, an LUT generator, and a pose regressor. As loss functions, segmentation loss in the encoder, LUT loss in the LUT generator, and pose loss in the pose regressor were used.
The input of the whole network is a three-channel RGB image, and the output is the predicted pose and LUT of the BEV. The predicted pose is output through the encoder and pose regressor, and the LUT is output through the encoder, decoder, and LUT generator.
The pose regressor is attached in the form of an add-on, and by connecting it, the overall direction of the network is given and the performance is improved. The predicted pose helps the user to intuitively monitor the learning progress during the learning process.
The image output in the LUT reduces the post-processing time because it enables the BEV to be produced immediately after performing only the memory copy process and simple interpolation operation without additional operation.
In Tables 1-5, C denotes the customizable channel, and H and W denote the height and width of the input image (or input feature map), respectively.

Encoder
To create an LUT that converts to a BEV using a single camera, we used semantic segmentation-based features [5] for the semantic segmentation backbone network. After fitting the original image and the segmentation result to the same scale, down-sampling is performed through the compress block. The output from each compress block comprises encoded data for multiple scales, and through this, a network robust to multiple scales is constructed.
The semantic segmentation layer aims to find precise boundaries for objects in the image. In addition, since the proposed network aims to estimate the pose using image features, we inserted a semantic segmentation layer into the encoder (Figure 3, Table 1) to utilize the information on the precise boundary as a feature. The compress block (Figure 4a, Table 2a,b) plays a role in integrating features using the custom residual block (Figure 4b, Table 2c,d), referring to He et al. [17], and the structure that fuses the original image and segmentation features.

Decoder
Due to the characteristics of a camera that projects light in a specific space, far-distance data are insufficient compared to near-distance data, which cause aliasing. We constructed a parallel path to efficiently perform anti-aliasing by utilizing the features delivered from the encoder.
A structure for restoring and up-scaling multi-scale encoded data was constructed by composing a parallel path through the transposed convolution [8] and pixel shuffle [18] to solve the aliasing phenomenon that may occur in the up-scaling process.
The decoder ( Figure 5, Table 3) does not output a separate loss, but delivers data to the LUT generator.

LUT Generator
Through converging and compressing the results obtained through the decoder, three LUT channels were finally generated. The first/second channels represent x/y coordinates of the original image, respectively, and the third channel represents the boundary for the camera's field of view (FoV) area during the LUT conversion process (the boundary value is the max value, while the rest is the min value). Figure 6 and Table 4 show the structure of the LUT generator.

Pose Regressor
In essence, the LUT is an interpretation of the geometrical information of the image (the geometrical information described in this paper is the camera pose, that is, the roll/pitch/yaw), so we must consider the camera pose regression from the network design stage. To effectively add pose regression information to the network, we aimed to make the network recognize the task of estimating the pose by attaching a pose regressor (Figure 7, Table 5) to the network. To estimate the pose, we compressed the encoder data and utilized the fully connected layer [19] to understand the correspondence of all the information between the encoder features [12].
In the angle expression method using the degree or radian, the non-continuous parts such as 0 degrees/360 degrees and -π/π may hinder the learning performance, so cos is used and then translated, while the region of the value is scaled to change from -1~1 to 0~1, and the activation function of the final layer is used as a sigmoid to efficiently infer the value of 0~1.

Loss Cost
Three feature points calculate loss cost. The seg loss is at the end of the segmentation backbone of the encoder, the LUT loss is located at in LUT generator output, and the pose loss is obtained from the pose regressor output.
The contour of the semantic segmentation feature must be precise to help improve the entire network's performance, so we output the segmentation loss using the pixel-wise cross-entropy [20] of the segmentation.
The LUT loss is calculated through pixel-wise mean squared error (MSE), and the weights of the first/second and third channels, which have different basic properties, are experimentally learned differently.
For the result of the pose regressor, the pose loss is obtained by using the MSE. Due to the relatively small number of elements, a smaller value is output compared to the other losses.
Since the domain covered by each loss and the convergence speed in the learning process are different, the weight multiplied by each loss cost in calculating the total loss was experimentally obtained and is as follows.

Quantitative Evaluation
In this paper, the seg loss is a learning measure for segmentation features inside the network, and the LUT and pose losses are quantitative indicators for the LUT/3D spatial angle, respectively (Table 6). We gradually changed the layers to infer the change in performance from the structural evolution of our network (Table 7). The resolution representation of the encoder, decoder, and LUT generator was tested with 1 (single scale)/4 (multi-scale), and the parallel path of the decoder was tested with two pixel shuffles or pixel shuffle and convolution transposed. Finally, we tried to improve the performance through the combination with a pose regressor.
All tests were inferenced in the NVIDIA GTX 1080 environment, and three losses were evaluated.
If the parallel path composed of two pixel shuffles is changed to a layer consisting of pixel shuffle and convolution transposed, the speed is 1.11 times slower, but the segmentation loss is reduced 0.89 times and the LUT loss is significantly reduced as well (0.06 times).
If the resolution representation considered by the encoder, decoder, and LUT generator is changed from single scale to multi-scale, the speed is 3.7 times slower, the seg-mentation loss is 0.38 times less, and the LUT loss is 0.44 times less, so we can see that the loss decreases. When changing to multi-scale, we can see that the segmentation loss is significantly reduced.
When learning that by adding a pose regressor to the end of the encoder, the speed is only reduced 1.09 times, but the segmentation loss is reduced 0.64 times and the LUT loss by 0.25 times, we can see that the loss cost is significantly reduced, while the processing speed is slightly increased. Through this, we can understand that it is meaningful to connect the pose regressor to the end of the encoder to obtain the direction of the entire network.

Qualitative Evaluation
Since the coordinates of the original image corresponding to each coordinate of the BEV can be estimated using the LUT data obtained from the end of the network, the BEV was generated through this value. It was tested using two map data of Chungbuk National University (CBNU) and KATRI K-city (Tables 8 and 9). As we progressed from v1 to v4, the aliasing decreased. Particularly, if we compare v3 (without a pose regressor) and v4 (with a pose regressor), we can see that the concept of the overall pose is added, and the distant region is converted relatively well.

Conclusions
In this work, we studied BEV conversion based on a single camera image. We used segmentation backbone-based features during the study, and the performance difference before and after attachment was analyzed by adding on a pose regressor. Since it is challenging to collect various camera poses using an actual camera, we tested the network through a simulator.
We plan to conduct research using actual camera data (or actual + synthetic data) in this network in the future and try to reduce aliasing by improving the network. In addition, to supplement the characteristics of a single camera, which makes it difficult to estimate scale, it is intended to produce a real distance-based BEV with an explicit unit rather than a relatively real distance through combination with a ToF sensor or other odometry methods, as with adjacent frames of a single camera.