U-ETMVSNet: Uncertainty-Epipolar Transformer Multi-View Stereo Network for Object Stereo Reconstruction

: The Multi-View Stereo model (MVS), which utilizes 2D images from multiple perspectives for 3D reconstruction, is a crucial technique in the field of 3D vision. To address the poor correlation between 2D features and 3D space in existing MVS models, as well as the high sampling rate required for static sampling, we proposeU-ETMVSNet in this paper. Initially, we employ an integrated epipolar transformer module (ET) to establish 3D spatial correlations along epipolar lines, thereby enhancing the reliability of aggregated cost volumes. Subsequently, we devise a sampling module based on probability volume uncertainty to dynamically adjust the depth sampling range for the next stage. Finally, we utilize a multi-stage joint learning method based on multi-depth value classification to evaluate and optimize the model. Experimental results demonstrate that on the DTU dataset, our method achieves a relative performance improvement of 27.01% and 11.27% in terms of completeness error and overall error, respectively, compared to CasMVSNet, even at lower depth sampling rates. Moreover, our method exhibits excellent performance with a score of 58.60 on the Tanks &Temples dataset, highlighting its robustness and generalization capability.


Introduction
Multi-View Stereo (MVS) is one of the fundamental tasks of 3D computer vision, which is centred on the use of camera parameters and viewpoint poses to compute the mapping relationship of each pixel in an image for dense 3D scene reconstruction.In recent years, this technology has found widespread applications in areas such as robot navigation, cultural heritage preservation through digitization, and autonomous driving.Traditional methods heavily rely on manually designed similarity metrics for reconstruction [1][2][3][4].While these approaches perform well in Lambertian surface scenarios, their effectiveness diminishes in challenging conditions characterized by complex lighting variations, lack of distinct textures, and non-Lambertian surfaces.Furthermore, these methods suffer from computational inefficiency, significantly increasing the time required for reconstructing large-scale scenes, thereby limiting practical applications.
Deep learning-based MVS methods, such as Yao et al. [5] employed 2D Convolutional Neural Networks (CNN) to extract image features.They utilized differentiable homography warping, 3D CNN regularization, and depth regression operations to achieve end-to-end depth map prediction.Finally, the reconstructed dense scene was obtained through depth map fusion.The introduction of CNN networks allows for better extraction of global features, with excellent performance even in scenarios with weak textures and reflective environments.Additionally, Gu et al. [6], in CasMVSNet, adopt a cascading approach to construct the cost volume, gradually refining the depth value sampling range from coarse to fine.This stepwise refinement at higher feature resolutions generates more detailed depth maps, ensuring overall efficiency in reconstruction and a rational allocation of computational resources.However, conventional multi-stage MVS frameworks often lack flexibility in depth sampling, relying mostly on static or pre-defined ranges for depth value sampling.In cases where there is a deviation in depth sampling for a certain pixel, the model cannot adaptively adjust the sampling range for the next stage, leading to erroneous depth inferences.
The core step of multi-view stereo vision is to construct a 3D cost volume, which can be summarized as computing the similarity between multi-view images.Existing methods mostly utilize variance [6,7] to build the 3D cost volume.For example, Yao et al. [5] havethe same weights for different perspectives in matching cost volume construction and use the mean square deviation method to aggregate feature volumes from different perspectives.However, this approach overlooks pixel visibility under different viewpoints, limiting its effectiveness in dense pixel-wise matching.To address this issue, Wei et al. [8] introduced context-aware convolution in the AA-RMVSNet's intra-view aggregation module to aggregate feature volumes from different viewpoints.Additionally, Yi et al. [9] proposed an adaptive view aggregation module, utilizing deformable convolution networks to achieve pixel-wise and voxel-wise view aggregation with minimal memory consumption.Luo et al. [10] employed a learning-based block matching aggregation module, transforming individual volume pixels into pixel blocks of a certain size and facilitating information exchange at different depths.However, directly applying regularization to the cost volume fails to facilitate communication with depth feature information from adjacent depths.With the continuous development of attention mechanisms, Yu et al. [11] incorporated attention mechanisms into the feature extraction stage of the MVS network, resulting in noticeable improvements in experimental results.Li et al. [12] transformed the depth estimation problem into a correspondence problem between sequences and optimized it through selfattention and cross-attention.Unfortunately, the above methods focus solely on a single dimension, addressing only 2D local similarity issues and obtaining pixel weights through complex networks.This introduces additional computational overhead, neglecting the correlation between 2D semantics and 3D space, ultimately compromising the assurance of 3D consistency in the depth direction.
To address the above issues, this paper proposes an uncertainty-epipolar Transformer multi-view stereo network (U-ETMVSNet) for object stereo reconstruction.First, this paper uses an improved cascaded U-Net network to enhance the extraction of 2D semantic features.And the cross-attention mechanism of the epipolar Transformer is used to construct the 3D association between different view feature volumes along the epipolar lines, enhancing the 3D consistency of the depth space, without introducing additional learning parameters to increase the amountof model calculations.The cross-scale cost volume information exchange module allows information contained in cost volumes at different stages to be progressively transmitted, strengthening the correlation between cost volumes and improving the quality of depth map estimation.Secondly, allynamic dynamic adjusting the depth sampling range based on the uncertainty of the probability cost volume is employed to effectively reduce the requirements on the number of depth samples and enhance the accuracy of depth sampling.Finally, a multi-stage joint learning approach is proposed, replacing the conventional depth regression problem at each stage with a multi-depth value classification problem.This joint learning strategy significantly enhances the precision of the reconstruction.The proposed method is experimentally validated on the DTU and Tanks&Temples datasets, and its performance is compared with current mainstream methods.The method in this paper achieves high reconstruction accuracy even at lower sampling rates, confirming the effectiveness of the proposed approach for dense scene reconstruction.
The rest of the paper is organised as follows: Section 2 provides an overview of relevant methods in the field.Section 3 provides a detailed overview of the proposed network and the entire process of object reconstruction.Section 4 presents the experimental setup and multiple experiments conducted to validate the reliability and generalization capabilities of the proposed method.Finally, in Section 5, we summarize the contribu-tions of the proposed network to multi-view reconstruction of objects offer prospects for future work.

Traditional MVS
Early MVS methods can be categorized based on their technical characteristics into point-based methods [13,14], voxel-based methods [15], depth map-based methods [16][17][18], and polygon mesh-based methods [19].Point-based approaches extend from initial matching points to surrounding pixels, iteratively refining feature points to achieve dense reconstruction.However, this method limits the capability of parallel data processing.In certain scenarios, such as those with uneven texture distribution, this approach heavily relies on accurate feature extraction, resulting in less-than-ideal outcomes.Voxel-based methods initially calculate the scene's bounding box and then identify voxels near irregular grids in 3D space.Vogiatzis et al. [20] proposed a method that partitions 3D space into "object" and "no-object" regions, enforcing photometric consistency between adjacent areas and expanding the "object" region.However, discrete spatial partitioning increases memory usage for improved accuracy, making this method suitable only for low-resolution small scenes.Depth map-based approaches decompose these steps into two parts, starting with multiple single-view depth estimations.This approach can be combined with the previous two methods, merging depth maps to obtain the final predicted point cloud.Compared to the methods mentioned earlier, this approach offers greater flexibility.Polygon mesh-based methods initialize the evolution of the scene surface and iteratively enhance multi-view photometric consistency while evolving the scene surface.These early MVS methods have their advantages, but they also face limitations such as parallel processing capability, robustness in specific scenarios, and applicability to different scene sizes.

Learning-Based MVS
In early research, achieving end-to-end 3D reconstruction models was addressed by Ji et al.'s SurfaceNet method [21], which cleverly encoded images and camera parameters into 3D voxels, yielding significant reconstruction results.Extending this idea, Huang et al. [22] proposed DeepMVS, employing plane-wise scanning sampling for each input image to construct the cost volume of the source images.To enhance the model's scalability and overcome limitations on the number of input images, a clever use of max-pooling was employed to gather and aggregate information from neighboring images, effectively addressing this challenge.Yao et al. [5] introduced an end-to-end multi-view reconstruction algorithm in MVSNet, combining plane sweep stereo, differentiable homographic warping, variance matching cost volume construction, and 3D regularization.This algorithm has become the standard procedure for MVS reconstruction.Building on this foundation, Yi et al. [9] proposed an adaptive view aggregation module, constructing the cost volume selectively by learning the contributions of different views.Ma et al. [23] introduced a coarse-to-fine MVS method based on a cascaded structure in EPP-MVSNet, allowing more accurate aggregation of high-resolution image features.On the other hand, addressing memory consumption concerns, Yang et al. [24] introduced a coarse-to-fine cost pyramid construction method, mitigating memory usage through distributed computing to enhance model efficiency.Yao et al. [25] proposed R-MVSNet, utilizing a GRU structure for cost volume regularization, effectively resolving excessive memory usage issues at the expense of increased training time.Chen et al.'s VA-Point-MVSNet [26] initially predicts a coarse depth map, followed by an iterative up-sampling and refinement process to generate depth maps with a narrower depth range.However, due to potential depth interval errors in the coarse estimation phase, this method performs suboptimally in high-resolution reconstruction.The coarse-to-fine strategy also struggles to capture crucial information for depth inference.
In multi-stage MVS frameworks, the initial stage typically employs a fixed depth sampling range to cover the entiredepth values of the input scene.Subsequent stages then modify the depth sampling range based on the predicted depth values from the previous stage.Gu's CasMVSNet [6] gradually reduces the depth range using a reduction factor, achieving high-quality depth map inference.Yu et al.'s Fast-MVSNet [27] uses a sparse cost volume to learn both sparse and high-resolution depth maps.It employs a Gaussian-Newton layer to iteratively optimize the sparse depth map and utilizes data-adaptive propagation and the Gaussian-Newton layer for high-resolution depth map optimization.Cheng et al.'s UCS-Net [28] uses the variance of depth space distribution to progressively narrow the depth scanning range, achieving a reasonable and fine-grained partition of depth space underlimited memory usage.Wang et al.'s PatchmatchNet [29] optimizes each stage's depth sampling using an adaptive propagation and evaluation scheme.It reduces the number of depth hypotheses and removes regularization structures to improve model efficiency, though the overall performance is not highly satisfactory.
With the continuous development of attention mechanisms, Yu et al. [11] applied attention mechanisms to the feature extraction stage of MVS networks to capture longterm dependencies in depth inference tasks, achieving promising experimental results.Li et al. [12] formalized the depth estimation problem as a sequence-to-sequence correspondence problem.They utilized positional encoding, self-attention, and cross-view attention mechanisms to capture the features of the cost volume, enabling dense stereo estimation.Ding et al.'s TransMVSNet model [30] and Zhu et al.'s MVSTR model [31] introduced a global contextual Transformer, expanding the network's receptive field and reinforcing the 3D consistency of dense features, achieving robust dense feature matching.Sun et al. [32] proposed a Transformer-based local feature matching method that used attention mechanisms to obtain feature descriptors of images for precise matching.They demonstrated the effectiveness of dense matching even in areas with weak textures.However, these methods tend to overly focus on 2D features, associating features of pixels within views through extensive computations, resulting in suboptimal overall model efficiency.

Method
In this section, we provide a detailed overview of the model proposed in this paper.The overall network architecture is depicted in Figure 1.The network processes the given image I I=0,...,N−1 ∈ R H×W×3 , utilizing an enhanced Cascaded U-Net to extract 2D features at various scales (Section 3.1).Subsequently, we employ a differentiable homography warping to construct the source view feature volume, initializing depth hypotheses through inverse depth sampling in the initial stage (Section 3.2).The epipolar Transformer (ET) is then utilized to aggregate feature volumes from different viewpoints, generating stage-wise matching cost volume.The cost volume information exchange module (CVIE) enhances the utilization of information across different scales (Section 3.3).In stage 1 of the model, we dynamically adjust the depth sampling range based on the uncertainty in the current probability cost volume distribution, aiming to enhance the accuracy of depth inference (Section 3.4).Finally, we introduce the multi-stage joint learning approach proposed in this paper (Section 3.5).

Cascaded U-Net Network
Traditional methods, such as Yao et al. [5], employ 2D convolutional networks for feature extraction, however, this approach can only perceiv image textures within a fixed field of view.In contrast, Chen et al. [33] utilize an improved U-Net network for feature extraction, achieving favorable results.In this section, an enhanced cascaded U-Net feature extraction module is designed.The first part of the structure is illustrated in Figure 2. The network selectively handles low-texture regions to preserve more intricate details.
The given reference image I i=0 and adjacent source images {I i } N−1 i=1 are fed into the network to construct image features at different scales.In this cascaded U-Net network, the front-end feature encoder utilizes successive convolution and pooling operations, increasing the channel dimensions while reducing the size to extract deep features from the images.However, as the network depth increases, more feature information tends to be lost.The back-end decoder functions inversely to the encoder, performing upsampling to not only restore the original size but also connecting with feature maps from earlier stages.This facilitates better reconstruction of target details.The key to this process lies in fusing high-level and low-level features to enrich the detailed information in the feature maps.Subsequently, the cascaded network repeats this process, and the second part of the cascaded structure appends convolution operations at the output ports, obtaining features , where k = 0, 1, 2 denotes the three different stages of the model, omitted for simplicity in the following discussion.This cascaded U-Net feature extraction module aids in preserving richer detailed features, providing more accurate information for depth estimation.

Cascaded U-Net Network
Traditional methods, such as Yao et al. [5], employ 2D convolutional networks for feature extraction, however, this approach can only perceiv image textures within a fixed field of view.In contrast, Chen et al. [33] utilize an improved U-Net network for feature extraction, achieving favorable results.In this section, an enhanced cascaded U-Net feature extraction module is designed.The first part of the structure is illustrated in Figure 2. The network selectively handles low-texture regions to preserve more intricate details.

Cascaded U-Net Network
Traditional methods, such as Yao et al. [5], employ 2D convolutional networks for feature extraction, however, this approach can only perceiv image textures within a fixed field of view.In contrast, Chen et al. [33] utilize an improved U-Net network for feature extraction, achieving favorable results.In this section, an enhanced cascaded U-Net feature extraction module is designed.The first part of the structure is illustrated in Figure 2. The network selectively handles low-texture regions to preserve more intricate details.

Homography Warping
In deep learning-based multi-view stereo (MVS) methods [27,29,30], the construction of the cost volume often involves the use of differentiable homographic warping, drawing inspiration from traditional plane sweep stereo.Homographic warping leverage camera parameters to establish mappings between each pixel on the source view and different depths under the reference view, within a depth range of [d min , d max ].The procedure entails warping source image features to the d j layer of the reference view's viewing frustum.This process is mathematically expressed as shown in Equation (1).
The pixel feature in the reference view is denoted as p r .We use {K i } N−1 i=0 to represent the camera intrinsic parameters and {[R 0,i |t 0,i ]} N−1 i=1 to represent the motion transformation parameters from the source views to the reference view.By embedding camera parameters into features and performing mapping transformations, we establish the mapping relationship between the pixel feature p s i ,j in the i-th source view corresponding to p r .The features p s i ,j are distributed along the epipolar line in the source view, and the depth features of layer d j can be represented as i=1 ϵR H×W×C×D are generated, where D is the total number of depth hypotheses.This process accomplishes the conversion of features from two-dimensional to three-dimensional, thereby restoring depth information.
Due to the absence of calibration in the input images, directly performing uniform sampling in depth space may result in spatial sampling points not being evenly distributed along the epipolar lines when projected onto the reference view.This is particularly evident in regions farther from the camera center, where the mapped features may be very close, leading to a loss of depth information, as illustrated in Figure 3.To address this issue, inspired by references [29,34], in the first stage of depth sampling, this paper employs the inverse depth sampling method to initialize depth.The specific operation involves uniformly sampling in inverse depth space, ensuring equidistant sampling in pixel space, as shown in Equation (2).
Employing this depth sampling method effectively avoids the loss of depth information, thereby significantly enhancing the reconstruction results.

Cost Volume Aggregation
The complete cost volume aggregation module consists of two components: the Epipolar Transformer aggregation module (ET) (Section 3.3.1)and the cross-scale cost volume information exchange module (CVIE) (Section 3.3.2).In this section, we will introduce

Cost Volume Aggregation
The complete cost volume aggregation module consists of two components: the Epipolar Transformer aggregation module (ET) (Section 3.3.1)and the cross-scale cost volume information exchange module (CVIE) (Section 3.3.2).In this section, we will introduce both components.

Cost Volume Aggregation
The complete cost volume aggregation module consists of two components: the Epipolar Transformer aggregation module (ET) (Section 3.3.1)and the cross-scale cost volume information exchange module (CVIE) (Section 3.3.2).In this section, we will introduce both components.

Epipolar Transformer
Cost volume construction is the process of aggregating feature volumes from different source views to obtain depth information for individual pixels in the reference view.As conventional variance-based aggregation methods often struggle to filter out noise effectively, this paper employs an epipolar Transformer for aggregating feature volumes from different views.Specifically, the Transformer's cross-attention mechanism is used to build a 3D correlation along the epipolar line direction between the reference feature  (Query) and source features { , } (Keys).And use the cross-dimensional attention to guide the aggregation of feature volumes from different views, ultimately achieving cross-dimensional cost volume aggregation.The detailed structure of the module is illustrated in Figure 4.  Common shallow 2D CNNs can only extract texture features within a fixed receptive field and struggle to capture finer details in regions with weak textures.Therefore, this paper employs the computationally intensive cascaded U-Net for query construction.Guided by Equation ( 1), the projection transformation of source view features restores the depth information of 2D query features.To ensure 3D consistency in depth space, we adopt a cross-attention mechanism along the epipolar line direction to establish 2D semantic and 3D spatial depth correlations.This involves the 3D correlation between the pixel feature of the reference view, p r (Query), and the source features mapped to the epipolar line, p s i ,j D−1 j=0 (Keys).The attention weights, w i , are computed to achieve this, as shown in Equation (3).
where t e represents the temperature parameter.The p si,j D−1 j=0 are stacked along the depth dimension to form v i ∈ R C×D .Previous studies [35,36] have indicated that utilizing group-wise correlations to group feature volumes can reduce the computational and storage requirements of the model during cost volume construction.Therefore, this paper employs group-wise correlations to partition the feature volumes into g groups along the feature dimension, where g = 0, . . ., G − 1.Based on the inner product calculation in Equation (4), the similarity s i ∈ R G×D is computed between the source view feature volumes and the reference view feature volume.The obtained s i serves as the values for the cross-attention mechanism.
In this context, the g-th group feature of v i is denoted as along the channel dimension.Finally, the values of the epipolar attention mechanism are guided and aggregated for stage n by the w i , resulting in the stage-wise aggregated cost volume C n agg .The specific operations are detailed in Formula (5).

Cross-Scale Cost Volume Information Exchange
Traditional multi-view stereo (MVS) algorithms often overlook the correlation between cost volumes at different scales, resulting in a lack of information transfer within each layer [5].To address this limitation, our study introduces a cross-scale cost volume information exchange module, outlined in Figure 5.To address this, our module employs a portion of the Cascade Iterative Depth Estimation and Refinement (CIDER) [34], applying a lightweight regularization to coarsely regularize the stage-wise cost volume.Subsequently, through a separation operation, this volume is integrated into the next layer.This process eliminates noise and facilitates the fusion of information from small-scale cost volumes into the subsequent layer's cost volume, thereby enhancing the quality of depth map estimation.It separates the initially regularized stage-wise cost volume, fusing it into the next layer.This process not only eliminates noise but also enables the integration of information from small-scale cost volumes into the next layer, enhancing the quality of depth map estimation.Taking the (n − 1)-th layer as an example, the generated cost C n−1 agg ∈ R B×C×D×H×W undergoes initial regularization to acquire sufficient contextual information, followed by an upsampling operation, resulting in C n−1 agg ∈ R B× C 2 ×D ′ ×2H×2W , where D, represents the upsampled depth samples.This size is consistent with the subsequently generated cost volume C n agg in the next stage.The fusion of these volumes yields the final cost volume C n ∈ R B×C×D ′ ×2H×2W for that stage.

Dynamic Depth Range Sampling
An appropriate depth sampling range is crucial for comprehensive coverage of real depth values, playing a vital role in generating high-quality depth maps.Conventional methods typically focus on the distribution of individual pixels in the probability volume, adjusting the depth sampling range for the next stage based on this information.Zhang et

Dynamic Depth Range Sampling
An appropriate depth sampling range is crucial for comprehensive coverage of real depth values, playing a vital role in generating high-quality depth maps.Conventional methods typically focus on the distribution of individual pixels in the probability volume, adjusting the depth sampling range for the next stage based on this information.Zhang et al. [37] introduced a novel approach leveraging the information entropy of a probability volume to fuse feature volumes from different perspectives.Motivated by this, we propose an uncertainty module to adapt the depth sampling range.This module takes the information entropy of the probability volume from Stage 1 as input to assess the reliability of depth inferences.A higher output from the Uncertainty Module indicates greater uncertainty in the current pixel's depth estimation.Consequently, in Stage 0, the depth sampling range is expanded correspondingly to comprehensively cover true depth values, as illustrated in Figure 1.The module comprises five convolutional layers and activation functions, producing output values between 0 and 1.Higher values signify increased uncertainty.The uncertainty interval D(x) for the pixel x in the next stage is calculated using Equation (6).
where λ is the hyperparameter defining the confidence interval, E est represents the entropy map of the probability volume, U(•) denotes the uncertainty module for the probability volume, and D est is the predicted depth value for the current pixel.

Cross-Entropy Based Learning Objective
Regularization operations yield a probability volume with dimensions H × W × D, storing the matching probabilities between pixels and different depth values.The paper departs from utilizing the Smooth L 1 loss to minimize the disparity between predicted and actual values.Instead, it addresses a multi-sampled depth value classification problem as an alternative to conventional depth estimation methods.In Stages 0 and 2, the crossentropy loss function is employed to quantify the difference between the true probability distribution P(x) and the predicted probability distribution P(x) for each pixel x.

Uncertainty-Based Learning Objectives
In Stage 1, the paper dynamically adjusts the depth sampling range from Stage 0 based on the uncertainty of pixel distribution in the probability volume.Additionally, a negative log-likelihood minimization constraint is incorporated into the loss function of Stage 0 to jointly learn depth value classification and its uncertainty U(•).The loss function for the second stage is outlined in Equation (8).

Joint Learning Objective
The constants λ 1 , λ 2 and λ 3 , all belonging to the interval (0, 1), represent the weights assigned to the learning objectives of the three stages.The overarching goal of multi-stage joint learning is to minimize the overall loss function, defined as follows:

Experiments
In this section, we evaluate our proposed model on the DTU [38] and Tanks&Temples [39] datasets.We begin by providing a comprehensive overview of the two experimental datasets and detailing the specifics of our experimental setup (Sections 4.1 and 4.2).Subsequently, we present and analyze the model's performance on the experimental datasets (Section 4.3).Additionally, we conduct ablation study on the DTU dataset (Section 4.4) to thoroughly validate the effectiveness of our proposed model.

•
DTU dataset [38]: This dataset leverages an adjustable industrial robot arm to capture 129 scenes in a laboratory setting.Each scene comprises object views from 64 or 49 different angles under seven distinct lighting conditions, with recorded intrinsic and extrinsic camera parameters.The dataset is partitioned into 79 training scenes, 18 validation scenes, and 22 test scenes.It is noteworthy that we adopt the same dataset partitioning method as CasMVSNet [6].

•
Tanks&Temples Dataset [39]: The dataset encompasses 14 indoor and outdoor scenes with varying resolutions.Due to the absence of intrinsic camera parameters in this dataset, we employ OpenMVG [40] (open multiple view geometry) to compute and generate sparse point clouds.Evaluation of the reconstructed point clouds is conducted using an F1 score that combines precision and recall for a comprehensive assessment.

Implementation Details
Following experimental conventions [30,41], this paper trains and evaluates the proposed model on the DTU dataset.To verify the model's generalization ability, the model trained on DTU is directly tested on the Tanks&Temples dataset without any modifications.The depth sampling numbers {D k } k=0,1,2 at different stages are set to 16, 8, and 4, with depth sampling range (d min and d max ) configured as 425 mm and 935 mm.The temperature parameter (t e ) in the polarcross attention mechanism is set to 2. We train this paper's model for 14 epoches.The Adam optimizer [42] with β 1 = 0.9 and β 2 = 0.999 is employed to optimize the model.The experiment is conducted on one NVIDIA RTX3090 GPU with a batch size of 2. The initial learning rate is set to 0.001 and is reduced by a factor of 2 after 8, 10, and 12 epoches.For DTU dataset training, input image resolution is 1600 × 1200 with N (number of input images) set to 5. On the Tanks&Temples dataset, N is set to 7, and the input image resolution is 1080 × 2048.

Benchmark Performance 4.3.1. Evaluation on DTU Dataset
In this section, we compare the performance of our model with traditional methods, learning-based methods, and the methods reported in the latest technical literature.
To better analyze the differences between various methods, our approach is compared with Gipuma [4], Effi-MVSnet [43], DA-PatchmatchNe [44], and CasMVSNet [6].Among these, Gipuma employs a disparitypropagation strategy from traditional 3D reconstruction methods, proposing a diffusion propagation strategy utilizing GPU's multicore architecture for multi-view 3D reconstruction.Effi-MVSnet utilizes GRU based on 2D convolution to generate cost volumes.DA-PatchmatchNe combines data augmentation with traditional multi-scale patchmatchalgorithm.Cas-MVSNet adopts a cascaded approach to construct cost volumes, gradually refining the depth sampling range from coarse to fine, ensuring overall efficiency of reconstruction and rational allocation of computational resources.
We opt for input image resolutions of 1600 × 1200 with the number of views set at N = 5.Employing the official evaluation metrics provided by the DTU dataset, we compute reconstruction accuracy (Acc.), completeness (Comp.), and their average, termed overall error (Overall), measuring the reconstruction errors between the generated point cloud and the ground truth.Smaller values of these three metrics indicate better reconstruction performance.

Evaluation on Tanks and Temples Dataset
To assess the generative capability of our approach across diverse scenarios, the model trained on DTU is tested directly on the Tanks& Temples dataset without any adjustments, and compared with traditional methods as well as learning-based methods.During testing, the number of input views is set at N = 7, and the input image size is 1080

Evaluation on Tanks and Temples Dataset
To assess the generative capability of our approach across diverse scenarios, the model trained on DTU is tested directly on the Tanks&Temples dataset without any adjustments, and compared with traditional methods as well as learning-based methods.During testing, the number of input views is set at N = 7, and the input image size is 1080 × 2048.Evaluation of the reconstructed point cloud is performed using F1 scores, where higher F-scores indicate superior performance.
Table 2 presents the performance comparison with different methods.Our approach maintains outstanding results, scoring an impressive average of 58.60 on the challenging intermediate dataset, even with a lower depth sampling rate.This places our method at the forefront, with a marginal 2.91-point gap from the third-ranked AA-RMVSNe [8].Notably, compared to mainstream methods such as CasMVSNet [6] and Fast-MVSNet [27], our approach showcases performance improvements of 3.86% and 23.65%, respectively.These results affirm the robust generalization capabilities of our model.To comprehensively evaluate the performance of the proposed model, this experiment presents insights into memory consumption and runtime, comparing them with methods such as AA-RMVSNet [8], CasMVSNet [6], and CIDER [34], as depicted in Figure 7.
Different methods exhibit varying memory usage on the DTU test set, as illustrated in Figure 7a.Through comparison, it is found that our method consumes only 4.53 GB of GPU memory, significantly less than other methods.Meanwhile, among all compared methods, our approach achieves an overall error reduction to 0.315, demonstrating excellent performance.Figure 7b shows the time required for depth map prediction across different methods on the "Tanks&Temples" dataset.Our method computes in just 1.53 s, significantly faster than CIDER.Additionally, our method achieves an F-score of 58.60, demonstrating excellent overall performance.
GPU memory, significantly less than other methods.Meanwhile, among all compared methods, our approach achieves an overall error reduction to 0.315, demonstrating excellent performance.Figure 7b shows the time required for depth map prediction across different methods on the "Tanks& Temples" dataset.Our method computes in just 1.53 seconds, significantly faster than CIDER.Additionally, our method achieves an F-score of 58.60, demonstrating excellent overall performance.

Ablation Study
In this section, a series of ablation studies are conducted on the DTU dataset to validate the effectiveness of each component.The official point cloud reconstruction metrics provided by the DTU dataset are employed as experimental benchmarks [38], with default input image specifications set at 864 × 1152.Control variable methodology is employed to isolate the impact of individual components on the overall network performance, ensuring that other components are evaluated under the same experimental conditions.

Number of Views
This experiment, conducted on the DTU dataset, aims to evaluate the impact of different input view counts (N = 3, 4, 5, 6) on reconstruction outcomes.As observed from the results in Table 3, an increase in the number of views allows for the extraction of more feature information, thereby enhancing reconstruction accuracy and completeness.However, indiscriminate addition of views is not a prudent choice, as it not only consumes computational resources but may also introduce unnecessary interference with the overall reconstruction quality.Determining the optimal view count requires a careful balance between performance improvement and efficiency maintenance.

Ablation Study
In this section, a series of ablation studies are conducted on the DTU dataset to validate the effectiveness of each component.The official point cloud reconstruction metrics provided by the DTU dataset are employed as experimental benchmarks [38], with default input image specifications set at 864 × 1152.Control variable methodology is employed to isolate the impact of individual components on the overall network performance, ensuring that other components are evaluated under the same experimental conditions.

Number of Views
This experiment, conducted on the DTU dataset, aims to evaluate the impact of different input view counts (N = 3, 4, 5, 6) on reconstruction outcomes.As observed from the results in Table 3, an increase in the number of views allows for the extraction of more feature information, thereby enhancing reconstruction accuracy and completeness.However, indiscriminate addition of views is not a prudent choice, as it not only consumes computational resources but may also introduce unnecessary interference with the overall reconstruction quality.Determining the optimal view count requires a careful balance between performance improvement and efficiency maintenance.Table 5 shows the comparative results of the ablation experiment for the dynamic sampling module on the DTU dataset.It can be observed that the dynamic depth range sampling module extends the depth sampling from 13.12 mm to 28.43 mm in stage 0. The coverage of real depth values is also improved, increasing from 0.8468% to 0.8934%.Additionally, even at lower sampling rates, the model's comprehensive reconstruction error decreases from 0.320 to 0.315.This indicates that measuring the uncertainty of sampling with the entropy of the cost volume, and subsequently adjusting the depth sampling range, allows for more accurate predictions along the object edges.This approach takes into consideration the correlation between contextual information, features of neighboring pixels, and the depth sampling range of the current pixel, resulting in enhanced precision.In this experiment, we compare the cost volume construction method proposed in this paper with two types of aggregation in learning-based multi-view stereo (MVS): 1. variance fusing [6,7,28], 2. CNN-based fusing [8,29].
Our method primarily establishes semantic correlations in 3D space through crossattention, enhancing the aggregation of image features from a greater number of input views during the cost volume construction.Additionally, the cross-scale cost volume communication module boosts information utilization, strengthening correlations among cost volumes at different scales.As demonstrated in Table 6, our approach achieves a relative improvement of 12.35%, 16.01%, and 16.19% in accuracy error, completeness error, and overall error, respectively, compared to CNN aggregation.The reconstruction performance of our method significantly surpasses the other two approaches.As depicted in Figure 8, compared to the original CasMVSNet [6] (the number of depth samples is 48, 32, 8), the introduced cost volume aggregation module in our approach helps mitigate the impact of errors, resulting in sharper and smoother edges in the depth map.In Figure 8e, it is evident that our model's predicted depth map is more complete, demonstrating superior performance in handling low-texture regions and physical edges.
In Table 7, we conduct ablation study on two components employed in the cost volume aggregation process.Both components show optimization effects on the Overall metric, as evident from the results.It'snoteworthy that the introduction of the CVIE module leads to higher memory usage, primarily due to the additional space required to store the cross-scale cost volume.This addition also increases computation time.However, given the overall performance improvement, the extra storage space is deemed worthwhile.
In Table 7, we conduct ablation study on two components employed in the cost volume aggregation process.Both components show optimization effects on the Overall metric, as evident from the results.It'snoteworthy that the introduction of the CVIE module leads to higher memory usage, primarily due to the additional space required to store the cross-scale cost volume.This addition also increases computation time.However, given the overall performance improvement, the extra storage space is deemed worthwhile.In this section, we experimentally compare our classification-based cross-entropy loss (CE loss) with the commonly used regression-based L1 loss [5,6] on the DTU dataset.The experimental results are presented in Table 8, where the depth error is calculated as the average absolute difference between the predicted depth and the ground truth.Lower  In this section, we experimentally compare our classification-based cross-entropy loss (CE loss) with the commonly used regression-based L 1 loss [5,6] on the DTU dataset.The experimental results are presented in Table 8, where the depth error is calculated as the average absolute difference between the predicted depth and the ground truth.Lower error values indicate better performance.It can be observed that replacing the depth regression approach with the multi-depth classification method reduces the depth error from 8.53 to 6.79, and consequently, the overall reconstruction error is further reduced.This validates the effectiveness of the proposed module.In summary, the excellent reconstruction performance of our method is primarily attributed to the appropriate number of input views, the cascaded U-Net module, the epipolar transformer module, the dynamic depth sampling module, and the multi-stage joint learning approach.

Conclusions
This paper proposes an uncertainty-epipolar Transformer multi-view stereo network (U-ETMVSNet) for object stereo reconstruction.Initially, an enhanced Cascaded U-Net is employed to bolster both feature extraction and query construction within the epipolar Transformer.The epipolar Transformer, along with the cross-scale information exchange module, enhances the correlation of cross-dimensional information during cost volume aggregation, ensuring 3D consistency in depth space.The dynamic adjustment of depth sampling range based on the uncertainty of the probability volume also enhance stability in reconstructing regions with weak texture, and the reconstruction performance remains excellent even at lower depth sampling rates.Finally, the multi-stage joint learning method based on multi-depth value classification solution also effectively improves the reconstruction accuracy.The proposed method in this paper exhibits excellent performance in terms of completeness, accuracy, and generalization ability on the DTU and Tanks&Temples datasets, comparable to existing mainstream CNN-based MVS networks.However, the algorithm retains common 3D CNN regularization modules, resulting in no significant advantage in terms of memory usage.Future work aims to explore the role of Transformers in dense feature matching to replace CNN regularization, enhancing the practicality of deploying the model on mobile devices.

18 Figure 1 .
Figure 1.The network architecture of our proposed model.Initially, the multi-scale 2D features are extracted using the cascaded U-Net module.Subsequently, various operations, including homography warping, cost volume aggregation, 3D regularization, and multi-depth classification, are employed to obtain depth estimations at different scales.In Stage 1, the uncertainty module dynamically adjusts the sampling range for the subsequent stage.

Figure 1 .
Figure 1.The network architecture of our proposed model.Initially, the multi-scale 2D features are extracted using the cascaded U-Net module.Subsequently, various operations, including homography warping, cost volume aggregation, 3D regularization, and multi-depth classification, are employed to obtain depth estimations at different scales.In Stage 1, the uncertainty module dynamically adjusts the sampling range for the subsequent stage.

18 Figure 1 .
Figure 1.The network architecture of our proposed model.Initially, the multi-scale 2D features are extracted using the cascaded U-Net module.Subsequently, various operations, including homography warping, cost volume aggregation, 3D regularization, and multi-depth classification, are employed to obtain depth estimations at different scales.In Stage 1, the uncertainty module dynamically adjusts the sampling range for the subsequent stage.

Figure 2 .
Figure 2. Enhanced cascade U-Net feature extraction module.The given reference image  and adjacent source images { } are fed into the network to construct image features at different scales.In this cascaded U-Net network, the front-end feature encoder utilizes successive convolution and pooling operations, increasing the channel dimensions while reducing the size to extract deep features from the

18 Figure 3 .
Figure 3. Schematic diagram of depth sampling.When sampling points are evenly spaced in depth space, they may not uniformly map to the epipolar lines corresponding to the source view.This leads to sampling errors and the loss of certain depth information.(In the figure, the depth space and epipolar lines are scaled to the same scale).

Figure 3 .
Figure 3. Schematic diagram of depth sampling.When sampling points are evenly spaced in depth space, they may not uniformly map to the epipolar lines corresponding to the source view.This leads to sampling errors and the loss of certain depth information.(In the figure, the depth space and epipolar lines are scaled to the same scale).

3. 3
.1.Epipolar Transformer Cost volume construction is the process of aggregating feature volumes from different source views to obtain depth information for individual pixels in the reference view.As conventional variance-based aggregation methods often struggle to filter out noise effectively, this paper employs an epipolar Transformer for aggregating feature volumes from different views.Specifically, the Transformer's cross-attention mechanism is used to build a 3D correlation along the epipolar line direction between the reference feature p r (Query) and source features p si,j D−1 j=0 (Keys).And use the cross-dimensional attention to guide the aggregation of feature volumes from different views, ultimately achieving crossdimensional cost volume aggregation.The detailed structure of the module is illustrated in Figure 4.

Figure 3 .
Figure 3. Schematic diagram of depth sampling.When sampling points are evenly spaced in depth space, they may not uniformly map to the epipolar lines corresponding to the source view.This leads to sampling errors and the loss of certain depth information.(In the figure, the depth space and epipolar lines are scaled to the same scale).
where ⟨•, •⟩ represents the inner product.s i ∈ R G×D are obtained by stacking s

18 Figure 5 .
Figure 5. Cross-scale cost volumes information exchange module structure diagram.

Figure 5 .
Figure 5. Cross-scale cost volumes information exchange module structure diagram.

Figure 7 .
Figure 7. Comparesthe GPU memory usage and runtime of different methods.(a) Contrasts in GPU memory usage.(b) Comparisons of runtime.

Figure 7 .
Figure 7. Comparesthe GPU memory usage and runtime of different methods.(a) Contrasts in GPU memory usage.(b) Comparisons of runtime.

Figure 8 .
Figure 8. Qualitative comparison of depth map predictions by different methods.(a) Real image.(b) Real depth map.(c) CasMVSNet predicted depth map.(d) CasMVSNet + ET.(e) Depth map predicted by our algorithm.

Figure 8 .
Figure 8. Qualitative comparison of depth map predictions by different methods.(a) Real image.(b) Real depth map.(c) CasMVSNet predicted depth map.(d) CasMVSNet + ET.(e) Depth map predicted by our algorithm.

Table 1 .
Experimental results of different methods on the DTU evaluation set (lower values are better).The best and second-best results are highlighted in bold and underlined, respectively.

Table 2 .
Quantitative results on the Tanks&Temples dataset-intermediate dataset (F-score, higher is better).F-score is the average across all scenes, and the best and second-best results are highlighted in bold and underlined, respectively.

Table 3 .
Effect of different view numbers on the experimental results.Net feature extraction module (CU-Net) excels in extracting more precise and comprehensive multi-scale 2D features.It not only emphasizes overall features but also focuses on effectively capturing local details, particularly in handling low-texture areas.This addition enhances sensitivity to local details, leading to superior feature information acquisition.As demonstrated in the experiments in Table4, this module significantly improves the model's performance.

Table 4 .
Effect of CU-Net module on experimental results.

Table 5 .
Quantitative comparison of ablation experiments on dynamic sampling module in DTU's test set (This experiment mainly analyzes stage 0).

Table 6 .
Quantitative results of different aggregation methods.

Table 7 .
DTU dataset ablation study results' comparison.Ablation study on the components used in the cost volume aggregation process on the DTU dataset.

Table 7 .
DTU dataset ablation study results' comparison.Ablation study on the components used in the cost volume aggregation process on the DTU dataset.

Table 8 .
Ablation study of L 1 loss and cross-entropy loss on the DTU dataset.