1. Introduction
The safety of autonomous driving is contingent upon its perception ability to comprehensively understand its surroundings. It is essential to accomplish this that the autonomous driving systems must ascertain the relative distance between the vehicle and surrounding objects. This demand for high-quality omnidirectional ranging has prompted LiDARs to be widely used in autonomous driving. However, its high cost hinders the popularization of advanced autonomous driving technology and its sparse 3D scanning results can only provide limited scene-depth understanding. With the rapid development of computer vision, the depth estimation method for panoramic images in driving scenarios is expected to overcome the limitations of sparse LiDAR scanning results in scene understanding, leading to the realization of an affordable, high-density surround-type depth perception system.
Capturing scene depth from RGB images is a popular research topic of computer vision. Classical depth estimation methods obtain the scene depth based on the potential physical properties of images. The Multiple-View Stereo (MVS) algorithm uses the triangulation method to match two images and converts the disparity between them into depth [
1]. The Structure from Motion (SFM) algorithm mainly utilizes the matrix information to convert the motion parameters and estimates the depth of the feature points [
2]. Moreover, monocular depth estimation methods generally recovered depth from defocus (DFD) in RGB images based on optical principles in earlier studies [
3]. The ground-breaking advances in deep learning in computer-vision tasks have significantly enhanced the performance of depth estimation [
4,
5]. The pre-trained neural network has been used for end-to-end dense depth estimation from a single image, and the innovations in network structures, loss functions and training strategies of deep learning have outstandingly improved the estimation accuracy.
The large-scale application of the Around View Monitor (AVM) in vehicles has further enhanced the application potential of depth estimation in autonomous driving. The AVM relies on a set of surrounding cameras to capture background images in different directions and constructs a panoramic image based on the current vehicle position through an image synthesis algorithm. Depth estimation on the panoramic images can break through the limited view of traditional monocular cameras and estimates the distance from the center camera to the surrounding environment with a 360° fusion perspective. Panorama depth estimation thus is critical to comprehend the scene information of the real 3D world. It has a strong potential to be utilized in panorama parking systems, panorama reversing systems, traffic jam assistance systems, and 3D scene reconstruction.
The panoramic image offers a comprehensive view of the scene based on their multi-view and wide-format characteristics. However, these characteristics also lead to distortion problems when such images are represented in the equirectangular projection (ERP). During the camera scanning and imaging process, the distortion of panoramic images increases from the center to the sides along the latitude direction, and the object distance that increases along with the scanning angle leads the scale of the image to shrink gradually from the center to both sides. These phenomenon result in the deformation of objects within the image, damaging the semantic information of the image seriously, which lead to biased edge depth estimation and catastrophic estimation errors.
Imperfect observation, which is difficult to comprehensively obtain complex environment information with the limitation on algorithm and equipment, also become a major challenge, hindering the application of panorama depth estimation in autonomous driving. To support the progression of the panorama depth solutions, several datasets have been produced. Current outdoor panoramic image datasets lack comprehensive coverage of road traffic scenes, making it challenging to fully address various driving conditions. As a result, advanced panorama depth estimation solutions are often developed and validated using indoor scene datasets, such as Matterport3D [
6], Stanford2D3D [
7], PonoSUNCG [
8], and 360D [
9].
The current methods of panoramic image acquisition face limitations in providing consistent and accurate depth information, especially in complex driving environments. These issues include inconsistent scene-depth understanding due to sparse or incomplete data. Additionally, existing panoramic depth estimation approaches struggle to effectively capture detailed depth information, limiting their reliability in real-world driving scenarios. To address these challenges, we propose an end-to-end panoramic depth estimation framework, which improves depth estimation performance through innovations in data augmentation and neural network structure.
The proposed framework includes a Patch Filling module, which employs mean interpolation on 3D point clouds to generate dense depth maps, effectively compensating for missing data. Additionally, we introduce the ViT-Fuse model, which leverages the spatial encoding power of a Vision Transformer [
10] to optimize feature fusion and mitigate edge distortion in panoramic depth estimation. Together, these components enable more accurate and reliable depth perception, advancing the capability of autonomous vehicles to navigate complex environments with greater precision. The main contributions of this study are as follows:
In this study, we propose a novel framework for panoramic depth estimation that makes several key contributions to the field of autonomous driving. First, we introduce a Patch Filling module designed to address the sparsity inherent in 3D point cloud data by generating dense depth maps. These panoramic depth maps, integrated into the Ford Campus Vision and LiDAR Dataset [
11], enhance the dataset’s usability for future research and demonstrate improved depth estimation accuracy in autonomous driving scenarios through comparative experiments. Additionally, we develop a new panoramic projection fusion model, ViT-Fuse, which leverages the robust spatial encoding capabilities of a Vision Transformer to mitigate the distortion challenges associated with equirectangular projection (ERP) images. This model outperforms conventional methods, particularly in outdoor environments, providing smoother and more accurate edge details in the depth maps. Our experimental results further highlight the practical potential of this framework, achieving a maximum error reduction of 19.36% compared to baseline models and demonstrating superior accuracy, especially under the tightest threshold conditions. These findings establish ViT-Fuse as an excellent solution for enhancing panoramic depth estimation in complex driving environments.
The remainder of the paper is structured as follows: 
Section 2 reviews related works on the autonomous driving panorama depth perception schemes. In 
Section 3, the Patch Filling module and the processing method for autonomous driving datasets are presented. 
Section 4 detailed expounds the basic structure and inference principles of the ViT-Fuse model. In 
Section 5, the experimental settings and the analysis and comparison of the experimental data are shown. We discuss the experimental results in 
Section 6. Lastly, in 
Section 7, the conclusions are provided.
  3. Patch Filling
We utilized the Ford Campus Vision and LiDAR Dataset [
11] for our outdoor autonomous driving scene dataset. This dataset provides panoramic images with comprehensive scene information, created by stitching together images from a five-camera omnidirectional camera system, and raw 3D point cloud data are provided via laser scanner. However, the Ford Campus Vision and LiDAR Dataset does not directly offer panoramic depth maps. Therefore, in this section, we generate panoramic depth maps using 3D point cloud data and match them to their corresponding panoramas, with the aim of using these panoramic depth maps as the ground truth for subsequent analysis. Additionally, the generated panoramic depth maps are integrated into the dataset, enabling other researchers to utilize them directly. We project the raw 3D point cloud data onto a blank image that has the same size as the corresponding RGB image in the dataset. The points in the projection image are colored based on their distances. However, the point cloud data need to be denser or even be supplemented in certain areas. As shown in 
Figure 1, there are some blank areas in the corresponding 3D point cloud image of the scene information, such as vehicles in the framed area of the RGB image, which indicates that the rendered point cloud data are not complete. The insufficient availability of 3D point cloud data from critical surrounding objects can significantly impede the integrity of the ground truth for depth estimating. This presents an undeniable challenge to the subsequent development of the panorama depth estimation.
In this paper, we propose the Patch Filling method, which utilizes a fixed-size square slider to fill the points within the range of point cloud projection. The filling conditions and process are determined by the information of the original points already present in the slider. The blank areas in the projection correspond to the positions where the distance information from point cloud data is missing. We can identify and fill the missing information in the matrix corresponding to the blank pixels using the Patch Filling method, by converting the pixels of the ERP image into a two-dimensional matrix. The specific filling formula is as follows:
      where 
r is the number of rows or columns of the square slider matrix and 
Ci is the matrix element value corresponding to the 
ith pixel in the slider area. 
Cw is the matrix element value corresponding to the blank pixel in the slider area after being filled. 
Nw is the number of blank pixels in the slider area.
The slider size and the filling conditions were selected by grid search, the termination condition of the search are that the slider size must end at the given max value and the blank pixels need to be less than 70% of all pixels. It ensures the feasibility of the Patch Filling method and the quality of the generated depth map after filling. The result of filling the framed area is shown in 
Figure 1. After applying the Patch Filling method (Algorithm 1), the missing distance information from the point cloud data of the vehicle is supplemented, which is nearly consistent with the actual situation. The image obtained after accomplishing the slider filling demonstrates that the sparseness problem has been solved. However, the driving scenarios generally do not pay attention to the information near the poles that is far away from the vehicle in the panorama, considering the needs for research in autonomous driving. Therefore, we employ masking in the area outside the range of 3D point cloud projection and normalize the remaining areas. The ground-truth image, which is shown in 
Figure 1, represents the expected depth map corresponding to the panorama. Then, the panoramic image will combine with the ground-truth image within the panorama depth estimation network to carry out subsequent estimation tasks.
      
| Algorithm 1 The Patch Filling algorithm for the proposed method | 
| Input: sparse depth map Output: dense depth map
 1:    for r = 5, 10, 20 do
 2:       for m from 583 to 998 step r
 3:          for n from 0 to 2400 step r
 4:             B ← rgb[m:m + r, n:n + r]
 5:             C ← B.reshape(−1)
 6:             num ← 0
 7:             for i in C do
 8:                  if i = 255 then
 9:                   num ← num + 1
 10:                end if
 11:            end for
 12:            su ←
 13:            s ← 0
 14:            if num < 0.7r2 then
 15:               for i in C do
 16:                  if i = 255 then
 17:                      C[s] ← (su − 255num)/(r2 − num)
 18:                  end if
 19:                  s ← s + 1
 20:               end for
 21:            end if
 22:            D ← C.reshape(B.shape)
 23:            rgb[m:m + r, n:n + r] ← D
 24:         end for
 25:      end for
 26:   end for
 27:  return rgb
 | 
  4. Panorama Depth Estimation Network
The continuous development of deep learning has endowed panorama depth estimation with a major breakthrough in feature fusion and the self-learning of panoramic images. We proposed a panoramic feature fusion model, ViT-Fuse, improving the method of extracting equirectangular image features in UniFuse and integrating the ViT module in the encoding stage to realize image context learning for equirectangular panoramas, so that it can obtain more comprehensive equirectangular features, enhancing the fusion features and improving the accuracy of depth estimation. The subsequent subsections explain in detail the structure and inference principles of the ViT-Fuse model.
  4.1. Panorama Projection Fusion
The proposed ViT-Fuse model demonstrates a new panorama projection fusion scheme. The overall structure ViT-Fuse is shown in 
Figure 2; it uses the UniFuse network [
31] as the backbone. ViT-Fuse primarily comprises the pre-trained ResNet-18 [
39] and performs the fusion process from the CMP branch to the ERP branch only at the decoding stage. The ERP image is typically chosen as the representative format for panoramic views and depth maps due to the limited field of view and discontinuous scenes captured by CMP. Therefore, similar to UniFuse, ViT-Fuse reduces a decoder for predicting the cube depth maps, which may prevent the training from losing focus on the equirectangular depth. In the ViT-Fuse structure, two encoders are used to extract feature maps from the equirectangular image and the cubemap. Moreover, the ERP branch obtains new ViT features utilizing the additional ViT module, which can be fused with equirectangular features generated by the ResNet-18 network to supplement the feature information. The enhanced equirectangular features and cubemap features pass through the fusion module in the model to improve the result of the feature fusion and obtain more accurate depth estimation.
  4.2. Vision Transformer Module
Previous studies have focused on conducting experiments on indoor scene datasets, such as Stanford2D3D, Matterport3D, and 360D. For outdoor-scene-related tasks, such as autonomous driving, panoramic images encompassing a wider range of targets tend to exhibit greater comprehensiveness. However, problems may arise while attempting to capture critical objects displayed within smaller ranges in the image, which can potentially ignore the edge features of said objects and result in the challenging panorama depth estimation.
Research on ViT techniques have established the practicability and advantages of splitting an image into multiple patches of identical sizes and leveraging transformer encoders to process them. As each patch undergoes processing, positional information is embedded to enable its integration into the transformer encoder, which consists of multiheaded self-attention (MSA) and Multi-Layer Perceptron (MLP) blocks. ViT facilitates the learning of global image features, thereby aiding in capturing detailed panorama information.
In our framework, the ViT module works in conjunction with ResNet-18 to enhance depth estimation performance. ResNet-18 extracts initial feature maps from both ERP and CMP images, which are refined through its layers to capture local details. Simultaneously, ERP images are also fed into the ViT module, which splits the image into patches, flattens them, and encodes positional information for each patch. The resulting ViT-encoded features are further processed through convolutional layers to produce normalized equirectangular features. These features are crucial for maintaining accuracy in object edges and small-scale details that might otherwise be lost.
The integration of ViT with ResNet enhances the depth estimation model by combining ResNet’s local feature extraction capabilities with ViT’s strength in capturing global representations, leading to improved predictive performance. ResNet preserves essential local features, while ViT addresses the limited receptive field of convolutional layers by providing global context, enhancing the model’s robustness in diverse autonomous driving environments. This fusion helps the model mitigate distortions inherent in panoramic images and improves its ability to capture fine-grained details at object edges. Specifically, the self-attention mechanism of ViT further enables the model to learn global dependencies, overcoming CNNs’ limitations in handling long-range relationships. This leads to more precise depth predictions, especially in complex geometrical scenes.
In light of depth prediction tasks and model frameworks, alterations have been made to the original ViT, including the elimination of the classification head while forgoing the use of learnable embedding to the sequence of embedded patches. In an effort to facilitate the subsequent feature fusion, the ViT output is preprocessed and used as the initial ViT feature. The formulation is expressed as follows:
        where 
E represents the fully connected layer, 
Epos represents the position embeddings, 
xp is the sequence of flattened 2D patches, 
P2 is the resolution of each image patch, 
D represents the size of the constant latent vector used by transformer through all of its layers, 
LN means Layernorm, and 
T (
·) means preprocessing.
  4.3. Encoding Stage
The ViT-Fuse model retains ResNet-18, wherein the ERP image and its corresponding CMP image are fed to the equirectangular encoder and cube encoder located within ResNet-18, respectively, and the corresponding initial features are obtained first. Subsequently, the initial features undergo a process of max-pooling before being channeled through the first ResNet layer, leading to the acquisition of the corresponding equirectangular and cubemap features. The formulation is expressed as follows:
        where 
M (·) represents the max-pooling of the feature map, 
RL (·) represents the ResNet layers, and 
Fequi and 
Fcube represent the equirectangular features and cubemap features output through the ResNet layers, respectively. Simultaneously, the ERP images also pass through the previously mentioned ViT module. Analogous to the ResNet layers in the equirectangular encoder and cube encoder, the initial equirectangular features obtained by ViT pass through convolutional layers and directly yield the related features. These features first generate the normalized equirectangular features 
Fvit through the application of the sigmoid function. 
Fvit is multiplied with 
Fequi to obtain the equirectangular features 
Fenc with more comprehensive features after undergoing feature enhancement. Afterwards, 
Fenc collaborates with 
Fcube to complete the fusion of the two branch image features through the fusion module and ultimately obtain the fused features. Our fusion module still utilizes the CEE module in UniFuse as the fusion layer. The formulation for the above process is expressed as follows:
        where 
Conv2 denotes convolution with a kernel size of two, 
S (·) denotes the sigmoid function, and ⊗ denotes elementwise multiplication.
  4.4. The Fusion Module
The structure of the fusion module we used is shown in 
Figure 3. Prior to the fusion with 
Fenc, 
Fcube undergoes a conversion process via the C2E module, followed by reprojection of the resulting features onto an equirectangular grid. They are then input into the CEE module to merge the two features. The corresponding formulation is expressed as:
        where 
Ffuse denotes the fused features, 
C2
E (
·) represents the reprojection of the cubemap features in the equirectangular images, and 
CEE (
·) denotes the process contained in the CEE module in the structure. 
CEE is one of the modules in the fusion framework of ViT-Fuse designed by Jiang et al., which use cubemap to enhance the equirectangular features. A residual module is included in 
CEE to alleviate the hindrance caused by the discontinuity of CMP, which consists of a 1 × 1 conv and a 3 × 3 conv. In the fusion process, the enhanced equirectangular features and the 
C2
E features output by the 
C2
E module are fed to the 
CEE module first. Then, the two are concatenated, and the concatenated features are passed through the residual module, followed by two convolution modules. The 1 × 1 conv in the residual module can reduce the number of channels doubled due to concatenation, and the residual features are produced by passing through the 3 × 3 conv. The residual features are added to the 
C2
E features. After the residual modulation, the missing features on the edge of images due to the discontinuity of the CMP can be filled. After concatenation of the enhanced 
C2
E features and equirectangular features, a Squeeze-and-Excitation (
SE) block is added in CEE before the final 1 × 1 conv recalibrates channel-wise feature responses to achieve better feature fusion. The process in 
CEE can be expressed as follows:
        where Re
lu (
·) represents the Relu activation function, 
Concat represents the cross-channel concatenation, 
Bn represents BatchNorm, and ⊕ represents elementwise addition.
  4.5. Decoding Stage
During the decoding stage, each upsampling layer corresponds to a respective layer at the encoding stage. The fused features obtained by passing through the layers at the encoding stage can be provided to the corresponding upsampling layers. The output of the last layer is directly fed into the first upsampling layer, while the feature of each remaining layer is concatenated with the output of the previous upsampling layer before passing through its corresponding upsampling layer. ERP images are progressively enlarged layer by layer throughout the process until the output of the last upsampling layer is obtained. Finally, the depth map predicted by the entire model is generated upon processing this output. The formulation for the above process is expressed as follows:
        where 
Up×2 denotes upsampling by a factor of two, 
Elu (
·) denotes the Elu activation function, 
X denotes the ERP image output by each layer, 
Xdepth denotes the depth prediction map before processing, 
Mdepth denotes the maximum depth, and 
Pdepth denotes the depth prediction map of the final output of the model.
  4.6. Evaluation Metrics and Loss Function
The quantitative evaluation metrics we used to evaluate depth estimation include the Mean Absolute Error (MAE), Absolute Relative Error (AbsRel), Root Mean Squared Error (RMSE), and accuracy metric; the corresponding formulation is expressed as follows:
        where 
m represents the total number of pixels, 
yi represents the actual depth value corresponding to the 
ith pixel, 
 represents the estimated value corresponding to the 
ith pixel, and 
Thr is the threshold. The accuracy metric usually uses three different thresholds, 1.25, 1.25
2, and 1.25
3. Theoretically, the higher the accuracy is, the better the effect of prediction is. In addition, this study still uses BerHu loss as the regression function in the model, and its formulation is expressed as follows:
  5. Experiments and Results
We validated the effectiveness of the Patch Filling method in addressing the issue of sparse point clouds and improving depth estimation by comparing the ViT-Fuse model with and without Patch Filling. Additionally, we evaluated the proposed ViT-Fuse and compared it with UniFuse by conducting experiments on the Ford Campus Vision and LiDAR Dataset.
  5.1. Datasets
The Ford Campus Vision and LiDAR Dataset is a real-world driving scenario dataset collected by an autonomous ground vehicle testbed. The test vehicle is equipped with a Velodyne 3D LiDAR scanner, push-broom forward-looking Riegl LiDARs, a Point Gray Ladybug3 omnidirectional camera system, and so on. Unlike datasets such as Matterport3D, this dataset does not directly provide raw depth maps but provides files such as 3D point cloud data. Ford Campus Vision and LiDAR Dataset pay more attention to the partial panoramic area mainly based on the driving perspective, and the point cloud data is also mainly distributed in this area. The dataset contains 3817 panoramic RGB images, of which 3800 are used as the training set for the experiment. During data processing, the corresponding panoramic actual depth maps are converted from the original 3D point cloud data.
  5.2. Implementation Details
We implemented our experiments using PyTorch and trained on GEFORCE RTX 3090Ti GPU. We used Adam with default parameters as the optimizer with a constant learning rate of 1 × 10−4. During training, we set the input size to 512 × 1024 and the batch size to 1. At the encoding stage of the equirectangular images, the patch size in the ViT module we added was set to 8 × 8, the embedding dim was set to 1024, and the depth of the transformer encoder was 4.
  5.3. Effectiveness of Patch Filling
We compared the effectiveness of Patch Filling (PF) in improving depth estimation by evaluating the ViT-Fuse model with and without PF on the Ford Campus Vision and LiDAR Dataset. The PF method was introduced to fill the gaps in sparse point cloud data, aiming to generate a denser and more accurate depth map. Without PF, the model suffered from incomplete depth information, particularly in regions where the point cloud data were sparse or missing. By applying PF, we can enhance the completeness of the depth map, leading to improved accuracy, especially in challenging areas such as object edges and distant objects.
Figure 4 shows a visualization of the effectiveness of the PF method. The PF method effectively reduced the gaps caused by sparse 3D point clouds, generating denser panoramic depth maps that more accurately reflect real-world scenarios. Depth prediction results obtained using ViT-Fuse with depth maps supplemented by the PF method showed clear improvements. By utilizing these enhanced depth maps, the model benefitted from more reliable input, leading to more accurate predictions and a significant reduction in the adverse effects caused by sparse or incomplete observations. This demonstrated the effectiveness of the PF method in improving the overall depth estimation process for complex driving environments.
 Table 1 demonstrates the quantitative results of our experiments. The ViT-Fuse model with PF achieved a significant reduction in error metrics compared to the version without PF. Specifically, the Mean Absolute Error (MAE) was reduced by 14.49%, the Absolute Relative Error (Abs Rel) was reduced by 18.03%, and the Root Mean Squared Error (RMSE) was reduced by 11.93%. These improvements indicated that PF effectively mitigated the gaps caused by sparse point cloud data, resulting in more accurate depth predictions. Furthermore, the accuracy metrics, particularly δ < 1.25, also showed substantial improvements, with an increase of 9.73%. This improvement highlighted the model’s ability to better capture fine details in the depth map, leading to more reliable predictions in autonomous driving scenarios. The results demonstrated that the inclusion of PF not only enhanced the overall accuracy of the depth estimation but also improved the model’s robustness in complex driving environments.
 To further highlight the advantages of PF, we also conducted a comparative evaluation with Bilinear Interpolation (BI), a commonly used method for filling missing data in images [
40]. BI excels in interpolation strength and is computationally efficient, which makes it suitable for simpler tasks. However, our experiments on the Ford Campus Vision and LiDAR Dataset, a complex outdoor dataset, revealed certain limitations of BI. Although BI can effectively fill sparse points, the interpolated depth maps it generates are often not smooth enough and lack the natural continuity found in real-world depth data. This discrepancy is especially evident in complex scenes, where the interpolation across the same object results in inconsistent depth values between neighboring regions. As a result, BI’s interpolation may appear fragmented, introducing artifacts that reduce the overall prediction quality. In contrast, PF addresses these issues by maintaining a smoother and more continuous depth map, as shown in 
Figure 4. PF not only fills missing data points but also preserves consistency across the depth values of objects, leading to more realistic depth predictions. This makes PF more effective in autonomous driving environments, where consistent depth estimation is crucial for tasks like object detection and obstacle avoidance.
The quantitative results in 
Table 1 further validate the advantages of PF over BI. PF achieved a reduction of 6.83% in MAE, 8.22% in Abs Rel, and 4.72% in RMSE, demonstrating improved overall performance in depth estimation. Additionally, the accuracy metric δ < 1.25 showed a 3.49% increase, indicating better prediction reliability. These results confirmed that PF offered a more effective solution for addressing the sparsity of point cloud data compared to BI, enhancing the performance and robustness of the ViT-Fuse model in real-world autonomous driving scenarios.
  5.4. Comparative Results Between ViT-Fuse and Baseline Models
The quantitative comparison between our model and several baseline models, including FCRN [
19], BiFuse, and UniFuse, on the Ford Campus Vision and LiDAR Dataset is displayed in 
Table 2. As shown in 
Figure 5, the loss curves of ViT-Fuse and UniFuse exhibited little variance in the first 150 epochs, with UniFuse showing slightly better training efficacy. However, after 150 epochs, ViT-Fuse ultimately outperformed UniFuse across all metrics and maintained its lead. The inclusion of FCRN and BiFuse in the comparison provided a broader baseline, demonstrating that ViT-Fuse achieved superior performance not only against UniFuse but also against other standard depth estimation frameworks. As presented in 
Table 2, ViT-Fuse showed significant improvements across multiple metrics. In comparison to UniFuse, ViT-Fuse achieved reductions of 0.0042 and 0.0075 in MAE and RMSE, respectively. The most notable improvement was in the Abs Rel error, where ViT-Fuse outperformed UniFuse by 0.1300, equivalent to approximately 16.46%. On average, ViT-Fuse reduced the overall error metrics by approximately 9.15%, indicating superior predictive performance. Specifically, on the loosest accuracy metric of δ < 1.25
3, ViT-Fuse showed a modest improvement of 0.34%, while on the tighter metric of δ < 1.25
2, it achieved a 0.68% increase. The most significant enhancement was observed on the tightest accuracy metric of δ < 1.25, where ViT-Fuse outperformed UniFuse by 0.99%. The enhancement of the accuracy metrics further reinforced the notion that ViT-Fuse achieved greater precision.
In comparison to the additional baselines, FCRN and BiFuse, ViT-Fuse also demonstrated superior performance. The qualitative comparison between the FCRN, BiFuse, UniFuse, and ViT-Fuse models is illustrated in 
Figure 6. ViT-Fuse showed improved depth estimation accuracy and reduced errors, particularly in complex scene structures. The comparison revealed that both FCRN and BiFuse exhibited limitations in capturing detailed object boundaries, leading to less accurate depth maps. In contrast, the combination of the Vision Transformer module and Patch Filling method in ViT-Fuse enabled more precise depth predictions and enhanced the model’s robustness in real-world outdoor environments.
Moreover, since our model was primarily an improvement based on the UniFuse model, we conducted a more detailed comparative validation between the two models on different panoramic images within the dataset. The qualitative comparison between the UniFuse model and our proposed ViT-Fuse model on the Ford Campus Vision and LiDAR Dataset is exhibited in 
Figure 7. This comparison highlighted the advantages of ViT-Fuse in generating more detailed and accurate depth maps, particularly at the edges of target objects. In 
Figure 7b, the red-boxed area on the left showed two building groups resembling trees, while the right side contained a vehicle on the road. In the depth map generated by ViT-Fuse, the prediction of the building edges was more precise compared to UniFuse, as observed when comparing the two depth maps side by side. ViT-Fuse exceled in maintaining the integrity of edge information, minimizing the loss that was present in other models’ predictions. Similarly, in the right area, the contours of the vehicle generated by ViT-Fuse were clearer and more comprehensive than those produced by the other models. Both FCRN and BiFuse struggled to capture such fine details, resulting in depth maps that were less accurate along object edges. The ability of ViT-Fuse to accurately represent these contours further demonstrated the effectiveness of the model in mitigating depth prediction errors and optimizing edge performance. These qualitative results, alongside the quantitative improvements in 
Table 2, further validated the robustness and precision of the proposed ViT-Fuse model.
  6. Discussion
We designed a panorama depth estimation experiment and tested ViT-Fuse and baseline models on the driving scenario dataset. The ViT module is added to ViT-Fuse, which improves the shortcomings of the traditional convolution with a small receptive field and retains more image information. This is consistent with previous research showing that Vision Transformers can effectively capture global features in image tasks, outperforming CNNs in terms of receptive field size and feature retention [
10]. Moreover, the unique self-attention mechanism in the ViT module enhances the ability to aggregate global information, a feature previously explored by Shen et al. [
35] in their PanoFormer model to mitigate image distortion in panoramic tasks. Therefore, based on theoretical speculation, ViT-Fuse can optimize the feature fusion process of UniFuse and perform better in panorama depth estimation tasks. The experimental results demonstrate that ViT-Fuse outperforms baseline models, such as UniFuse, with each error metric reduced to varying degrees, aligning with trends observed in prior studies on panoramic image depth estimation [
22]. On average, ViT-Fuse reduces overall error metrics by more than 9%, demonstrating a superior performance in comparison to UniFuse, especially in terms of handling complex outdoor scenes and object edges. This further verifies the advantages of using transformers in depth prediction tasks. Although the improvement of ViT-Fuse on the loosest accuracy metric is relatively modest, the model shows remarkable enhancement on the tightest accuracy metric, achieving a 0.99% increase under the δ < 1.25 condition. Such improvements highlight the ability of the ViT-Fuse framework to capture fine details more accurately. Similar advancements in edge-detail handling have been reported in studies involving hybrid CNN–transformer architectures for vision tasks [
32].
In addition to the ViT module, the Patch Filling method addresses the sparsity issues in point cloud data by filling gaps with mean interpolation. This complements previous works on depth completion for sparse LiDAR data [
11]. The results indicate that the inclusion of Patch Filling not only enhances the model’s accuracy but also improves its robustness in dealing with missing or incomplete observations, which is critical in real-world autonomous driving scenarios.
Despite the promising results achieved by the proposed ViT-Fuse framework, there are several limitations that need to be addressed. First, the experiments were conducted solely on the Ford Campus Vision and LiDAR Dataset, which, while comprehensive, may not fully capture the variety of real-world driving scenarios. This limits the generalizability of our findings across different environmental conditions and datasets. Future works will focus on expanding the scope of validation by conducting experiments on other suitable and available panoramic autonomous driving datasets or on newly created datasets tailored to our research needs. This will provide a more robust understanding of the model’s performance across diverse environments.
Additionally, while the ViT-Fuse model improves feature fusion and depth estimation, the reliance on a single modality—camera-based vision—presents challenges in certain driving conditions, such as poor lighting or adverse weather. In future research, we aim to explore the integration of other sensing modalities, including radar and ultrasonic sensors, to enhance the robustness and accuracy of depth perception. These multi-modal approaches have the potential to improve depth estimation, particularly in situations where visual data alone may be insufficient. Further investigations will also prioritize optimizing the computational efficiency of the ViT module, ensuring the model can be deployed effectively in real-time autonomous driving systems.