Efficient Stereo Depth Estimation for Pseudo-LiDAR: A Self-Supervised Approach Based on Multi-Input ResNet Encoder

Perception and localization are essential for autonomous delivery vehicles, mostly estimated from 3D LiDAR sensors due to their precise distance measurement capability. This paper presents a strategy to obtain a real-time pseudo point cloud from image sensors (cameras) instead of laser-based sensors (LiDARs). Previous studies (such as PSMNet-based point cloud generation) built the algorithm based on accuracy but failed to operate in real time as LiDAR. We propose an approach to use different depth estimators to obtain pseudo point clouds similar to LiDAR to achieve better performance. Moreover, the depth estimator has used stereo imagery data to achieve more accurate depth estimation as well as point cloud results. Our approach to generating depth maps outperforms other existing approaches on KITTI depth prediction while yielding point clouds significantly faster than other approaches as well. Additionally, the proposed approach is evaluated on the KITTI stereo benchmark, where it shows effectiveness in runtime.


Introduction
Understanding the three-dimensional structure of the environment is possible for humans due to biological vision. Depth perception using computer vision technology is still one of the unsolved problems and most challenging issues in this research field. More significantly, proper depth perception is required for an autonomous system such as an autonomous delivery vehicle. It is possible to obtain such perception from the LiDAR point cloud; however, LiDAR is a very costly technology. It will drastically increase the production cost of a delivery robot system [1]. Without a doubt, a depth-predicting system is required to find an obstacle location and avoid a collision. Many researchers have already discussed the idea of alternative LiDAR solutions due to cost and over-dependency leading to safety risks. For example, the PSMNet model defined in the pseudo-LiDAR paper [2] is an image-based approach. The model architecture is too heavy, requiring more time to produce depth estimation. Therefore, the corresponding point cloud generation will be slower (average 1-6 Hz depending on the resolution) than LiDAR hardware (10 Hz). Our approach uses self-supervised stereo Monodepth2 [3] as the starting point and improved it to perform network training with stereo pairs in KITTI benchmark [4] datasets. Then, we used the generated disparity information to create the point cloud (shown in Figure 1). The main contributions of this paper are: • Adopting a U-Net-based [5] encoder-decoder architecture as a depth network instead of the heavy PSMNet model to increase real-time performance • Modifying the encoder network for the training step. The final result outperforms all the modes used by Monodepth2 [3] in terms of depth prediction.
It might be challenging to achieve a good balance between precision and latency. In order to solve this significant issue, this work suggests an optimal approach using selfsupervised learning. In-depth experiments are also carried out by the authors to verify the

Related Works
Image-based depth estimation to perform perception or localization tasks can be achieved using monocular vision [6] or stereo vision [2]. An algorithm such as DORN [7] achieves lower depth estimation errors than other previous works on monocular depth estimation [8][9][10]. On the other hand, stereo-based depth prediction systems [2] show more precision in estimating disparity. However, a promising solution requires a realtime operating speed with more efficiency. The BTS architecture [11] for depth estimation has a guided local planner layer in the decoding network. The method outperformed some of the evaluation metrics. However, the work does not provide the computational processing time against generating point cloud formation. Since the base networks have 49.5 million or more parameters, the network (ResNet50 or others) is very computationally expensive. In contrast to our proposed encoder module, the trainable parameters are around five times higher.
Recent studies are leveraging deep neural networks to learn model priors using pictorial depth, such as texture density or object perspective, directly from the training data [12]. Several technical breakthroughs in the past few years have made it possible to improve depth estimation based on the ground truth depth dataset. If ground truth depth is not available, a possible alternative is to train models using image reconstruction. Here, the model is fed either monocular temporal frames or stereo pairs of images as input. The model is trained by reducing error in image reconstruction by imitating the depth and projecting it into a nearby view. Stereo pair is one form of self-supervision. Since stereo pair of data is available and easy to obtain, a deep network can be trained to perform depth estimation using synchronized stereo pairs during training. For the problem of novel view synthesis, the authors proposed a model with discretized depth [13] and a model predicting continuous disparity [6]. Several advancements have also occurred in stereo-based approaches, including generative adversarial networks [14] and supervised datasets [15]. Moreover, there are approaches to predicting depth with minimized photometric reprojection error with use of relative pose from a source image with respect to a target image [3]. In their stereo approach, the authors used stereo pairs to calculate losses; however, the neural architecture does not obtain features of other image pairs.
A stereo group-wise correlation method [16] computes the cost volume and divides the left and right features into groups along the channel dimension. Each group's correlation maps are calculated to obtain several matching cost suggestions packaged into a cost volume. X. Guo et al. proposed an improved 3D stacked hourglass network to reduce the computation cost [16]. The RAFT-Stereo [17] architecture employs multi-level convolutional GRUs for accurate real-time inference and provides cutting-edge cross-dataset generalization results. CasStereoNet [18] presented a cost volume based on a feature pyramid encoding geometry and context at smaller scales to improve stereo matching, ranking first in the DTU benchmark [19] at the time of publication. A network based on Cascade and fused cost volume [20] is used to increase resilience of a stereo matching network by decreasing domain disparities and balancing the disparity distribution across datasets. StereoDRNet's depth architecture [21] predicts view-constant disparity and occlusion maps, which aids the fusion system in producing geometrically consistent reconstructions. EPE (0.98) and FPS (4.3) outperform PSMNet [2]. LEAStereo [22], a deep stereo matching architecture, establishes an outperforming result on the KITTI [4,23,24] test dataset with fewer parameters and significantly shorter inference time. ACVNet [25], a stereo matching network, presents outperforming results in both quantitative and qualitative aspects. However, the runtime for these algorithms is very high.
We demonstrate that the existing depth estimation model can be adapted to generate higher-quality results by combining the stereo pair in input layers rather than using the pair to calculate relative pose loss only. Moreover, we used the modified model to generate point clouds in real time.

Method
This section introduces the architecture of our modified deep network and then presents the strategy for splitting, point cloud generation, post-processing steps, and evaluation techniques used for comparison. The proposed pipeline is shown in Figure 2, and the modules are discussed in detail in this section.

Stereo Training Using Depth Network
The proposed architecture is encoder-decoder-based classic U-Net (shown in Figure 3). The encoder is a pre-trained ResNet model [26], and the decoder converts the sigmoid output to a depth map. The primary network for training, U-Net architecture, merges various scale features with varying receptive field sizes and concatenates the feature maps after upsampling them by pooling them into distinct sizes. The ResNet encoder module usually accepts single RGB images as input. In the proposed method, the input is designed to take the image pair as input and provide estimation based on it. Therefore, the modified network works both for training and inference. The ResNet encoder is modified to accept a pair of stereo frames, or six channels, as input for the posture model. As a result, instead of the ResNet default of (3,192,640), the ResNet encoder uses convolutional weights in the first layer of shape (6,192,640). The depth decoder is a fully convolutional network that takes advantage of feature maps of different scales and concatenates them after upsampling. There is sigmoid activation at the last layer that outputs a normalized disparity map between 0 and 1. Table 1 shows the output total number of trainable parameters for encoder are 11,185,920 for (192,640) size of image input, whereas, for singleimage-layer-based encoder, it would be 11,176,512. The ResNet encoder has 20 Conv2d layers, 20 BatchNom2D layers, 17 ReLU, 1 MaxPool2D layer, and eight basic blocks in total. The decoder layer has the same block, kernel size, and strides.  In monocular mode, Monodepth2 authors used temporal frame in Posenet [3] instead of stereo pair to calculate the extrinsic parameter of the camera and the pose of the image frame. Our approach will not rely on temporal frames for self-supervised prediction. The reprojection loss is calculated using SSIM [27] between prediction and target in stereo mode in stereo training. Metric reprojection error L p is calculated from relative pose T s→t of source view denoted as I s with respect to its target image I t . In our training, the other stereo pair will provide relative position T s→t of source image I s . This rotation and translation information will be used to calculate mapping from the source frame to the target frame. Simultaneously, the ResNet architecture is fed with both image pairs (shown in Figure 3). The other can be considered the stereo pair of source images by considering one as the primary input. The target image is reprojected from the predicted depth and transformation matrix from the stereo pair using the intrinsic matrix. Then, the method used bilinear sampling to sample the source image from the target image. This loss aims to minimize the difference between the target picture and the reconstructed target image, in which depth is the most crucial factor. Instead of averaging the photometric error across all source frames, the method utilized the minimum at each pixel. The equation of photometric loss L p can be represented [3] as in the following Equation (1) Here, RE is the metric reconstruction error. I t →t is obtained [28] from the projected depth D, intrinsic parameter K, and relative pose, as in the following Equation (2). is the bilinear sampling operator and prj() denotes 2D-cordinate of projected image.
On the other hand, edge-aware smoothness loss L s is also calculated between the target frame and mean-normalized inversed depth value. It boosts the model to recognize sharp edges and eliminate noises. The reprojection loss requires to have correct output image and target frame. Therefore, the method is designed to choose the proper target frame from the image pairs. The following Equation (3) is the final training loss function, which is the function used in [3] L = µL p + λL s where µ is the mask pixel value, which is µ {0, 1}, obtained from the auto-masking method [3], and λ is the smoothness term, which is 0.001. Learning rate 10 −4 , batch size 12, epochs size 20 is used while training model size of both 640 × 192 and 1024 × 320. The edge-aware smoothness [3] can be described as following Equation (4) where d * t = d t /d t is the mean-normalized inverse depth [29] to discourage the estimated depth's shrinking.

Dataset Splitting
We use the data split of Zhou et al. [28], which has 39,810 datasets for training and 4424 for validation. The intrinsic parameter provided by KITTI [4], which includes focal length and image center, is normalized with respect to image resolution. A horizontal translation of fixed size is applied to the horizontal transformation between stereo frames. The neural network is fed the image from the split file along with the corresponding pair. However, the rest of the calculation is based on the first taken from the split dataset, not the pair image. In stereo supervision, median scaling is not needed as the camera baseline can be used as a reference for scale.

Point Cloud Back-Projection
The depth (z) can be obtained from a stereo disparity estimation system that requires pair of right-left images with a horizontal baseline b. The depth estimation system will consider the left image as a reference and save the disparity map d with respect to the right image for each pixel (x, y). Considering the focal length of the camera, f, the following Equation (5) of depth transformation can be obtained, Point clouds have their own 3D coordinate with respect to a reference viewpoint and direction. Such 3D coordinate can be obtained by back-projecting all the depth pixels to a 3dimensional coordinate system that will contain the point coordinates as [(X n , Y n , Z n )] N n=0 ; N is the number of total points generated from the depth pixel. The back-projection was performed on the KITTI dataset images using their project matrices. The 3D location of each point can be obtained using the following Equations (6)-(8) with respect to the left camera frame reference, which can be calculated from the calibration matrices.
where f is the focal length of the camera and c x , c y is the center pixel location of the image. Similar steps of back-projection are used to generate a pseudo-LiDAR point cloud [30].

Post-Processing Step
The method can adopt a post-processing step while training to achieve a significant accurate result in the evolution benchmark step. This adaption does not have any significance on the actual method. It is presented to compare with similar benchmarks that adopted these post-processing steps. Due to augmentation in the post-processing steps, the model tends to improve the estimation result. In order to obtain the model with the post-processing step, the stereo network is trained with the images two times, flipped and un-flipped. A threshold parameter randomizes this flip feature during training. Therefore, the model can be prepared both with and without post-processing steps. The flip feature occurs both in the image and its intrinsic parameters, including the baseline of the pairs. An unsupervised depth estimator introduces this type of two-forward pass-through network technique to improve the result [6].

Evaluation Metric
The evaluation benchmark primarily illustrated the errors between ground truth and prediction. The presented errors are mean absolute error (Abs Rel), squared error (Sq Rel), linear root mean squared error (RMSE), and logarithmic root mean squared error (RMSE log), respectively. These values indicate the lower, better result. On the other hand, δ < x denotes the ratio prediction and ground truth between x and 1/x. The results that are closer to 1 are better results. Instead of LiDAR reproject, ground truth depth from the KITTI depth prediction dataset [31] is used to evaluate the prediction method. During evaluation of our method, we used the same ground truth mentioned by Monodepth2 [3] while using stereo images as input in the encoder's input layer. Moreover, KITTI stereo 2015 benchmark is also used for comparison. Figure 4 presents the qualitative results on a specific KITTI scene. The first-, second-, and seventh-row results show that our method adequately recognizes the pole. Moreover, other results show that size of pedestrians (for example, the result in row 9), shape of objects (for example, the results in rows 6 and 8), and buildings (for example, the results in rows 5 and 7) are more aligned with the original image. From this visual result, it is evident that our depth estimator can predict some of the features, such as poles or street signs, moving objects, and objects at far distances. Comparison is performed with other Monodepth2 modes: monocular only (M), stereo only (S), and both (MS), along with other self-supervised models presented in the paper [3]. Table 2 shows that our method (highlighted with bold font) outperforms all the variants, including self-supervised methods, except DORN [32]. Here, D refers to depth supervision, D* refers to auxiliary depth supervision, M refers to self-supervised mono supervision, S refers to self-supervised stereo supervision, and PP refers to post-processing. The other data were collected for comparison from [3]. The result achieving higher accuracy is due to introduction of stereo pairs in the input layer of ResNet architecture. Table 3 shows a comparison with the KITTI stereo benchmark since stereo pairs are introduced, which presents a satisfactory runtime for the proposed model. For stereo comparison, the disparity generated from the model is a scale with a scale faction and image width since the model was normalized to the image width. We used a common system to compare average processing speed (11th Gen Intel i7-11800H, 2.  Table 4 with ** show increase in FPS than other image resolutions. If the model resolution and image solution size are similar, the process does not use functional interpolation to resize the depth metrics to the image resolution. Therefore, it requires less time to predict the final output. Other resolutions present low FPS due to computationally expensive rescaling of the depth map. The processing time for PSMNet requires much longer than U-Net-based architecture. The overall steps to produce the point cloud are presented in Algorithm 1. Initialize the encoder and decoder model 2:

Experiment and Results
Initialize the proper model and input size 3: Initialize the calibration parameter, such as intrinsic and projection matrix.

4:
while image frames are available, do 5: Read image pairs 6: Convert to torch tensor 7: Concatenate the image pairs 8: Extract the features using the encoder network 9: Depth output using decoder network 10: Functional interpolation of result if the size is different 11: Squeeze the output to the array 12: Project the disparity to points 13: Convert to point field for visualization 14: end   The main module responsible for FPS is depth prediction network. We presented the result with models 640 × 192 and 1024 × 320 in Table 4. The result includes processing of point cloud projection, whereas Table 3 provides only the runtime of depth estimation. Point cloud back-projection does not require extra time when the model architecture's input size and the input image size are the same. Since it does not require using functional interpolation, the runtime for the whole algorithm is low and almost the same when the same model input and image input are used. Therefore, the computational process varies when the model input size and image input are different. Using stereo model 1024 × 320, we can obtain higher accuracy in real time. Figure 5 shows the point cloud visualization result on KITTI scenes. We used ROS converted bag of KITTI dataset and RViz to visualize the point cloud data. The point cloud visualization shows the perception of pedestrians, bicycles, cars, and traffic signs in 3D space.

Conclusions
In this paper, we have presented a strategy aimed at reducing the gap between point clouds from real LiDAR devices and image-based point clouds. We presented performance results for operating the model with U-Net architecture. Both versions of our resolution (average 57.2 and 31.9 FPS, respectively) indicate real-time operating performance. Moreover, we improved the network input layer by introducing stereo pairs to the input layer. Improvement in the stereo-based network is due to stereo information that helps the network to conceive more perceptions regarding moving and standstill objects. The final result shows more accurate results on the 1024 × 320 model and post-processing-based training on 1024 × 320. The improvement we achieved from the modified network is comparatively greater than its previous versions. Initially, we took a different approach to introduce more pixel features to the model. We tried to concatenate the temporal frames in the input layers; however, the result was poor. Later, we adopted the approach of stereo pairs since the model has no experience with stereo pairs. Since the work used stereo pairs, it is not technically suited for mono depth estimation. Due to use of stereo pairs, the computing process also increased. On the other hand, the LiDAR device is the most expensive commercial component for delivery vehicles. The image-based approach is the way to close this gap in cost. The proposed pipeline is entirely dependent on depth estimation's accuracy, and poor depth estimation will result in erroneous point cloud data. Our approach aims to obtain an increase in FPS to generate fast pseudo-LiDAR. The model is trained with the KITTI raw dataset, which consists of 39,810 unrectified datasets for training and 4424 for validation. Therefore, it is not equally possible to compare the result with the KITTI stereo benchmark. However, the stereo benchmark is used for comparison with stereo matching algorithms, such as RAFT-3D [46], PSMNet [2], LEAStereo [22], CFNet [20], or ACVNet [25]. The result outperforms in terms of runtime. The outcome of this work does not raise any privacy issues or fairness issues. We do not believe our work or its possible applications pose any major concerns of security threats or human rights violations. In future work, we also aim to perform 3D object detection and SLAM algorithm over the point cloud achieved from depth prediction.