Multi-View Stereo Network Based on Attention Mechanism and Neural Volume Rendering

: Due to the presence of regions with weak textures or non-Lambertian surfaces, feature matching in learning-based Multi-View Stereo (MVS) algorithms often leads to incorrect matches, resulting in the construction of the ﬂawed cost volume and incomplete scene reconstruction. In response to this limitation, this paper introduces the MVS network based on attention mechanism and neural volume rendering. Firstly, we employ a multi-scale feature extraction module based on dilated convolution and attention mechanism. This module enables the network to accurately model inter-pixel dependencies, focusing on crucial information for robust feature matching. Secondly, to mitigate the impact of the ﬂawed cost volume, we establish a neural volume rendering network based on multi-view semantic features and neural encoding volume. By introducing the rendering reference view loss, we infer 3D geometric scenes, enabling the network to learn scene geometry information beyond the cost volume representation. Additionally, we apply the depth consistency loss to maintain geometric consistency across networks. The experimental results indicate that on the DTU dataset, compared to the CasMVSNet method, the completeness of reconstructions improved by 23.1%, and the Overall increased by 7.3%. On the intermediate subset of the Tanks and Temples dataset, the average F-score for reconstructions is 58.00, which outperforms other networks, demonstrating superior reconstruction performance and strong generalization capability.


Introduction
With the rapid development of computer vision technology, multi-view stereo (MVS) has become a highly prominent field of interest.Research in MVS aims to reconstruct threedimensional information of a scene from multiple perspective images with known camera parameters, playing a crucial role in various domains such as virtual reality, augmented reality, and visual effects in the film industry.
In existing MVS methods, traditional methods based on geometric context [1][2][3][4][5] have achieved good reconstruction results in texture-rich areas, especially in terms of accuracy.However, challenges persist in reconstructing the three-dimensional information of the scene from images in areas with low texture, image occlusions, variations in radiance, or non-Lambertian surfaces.To address this issue, some researchers [6] have employed deep learning techniques, utilizing Convolutional Neural Network (CNN) to extract image features.They perform robust feature matching within the field of view of the reference camera to construct a cost volume representing the geometric information of the scene.Subsequently, they employ a 3D U-Net network for regularization to regress depth maps.Finally, the scene's three-dimensional information is reconstructed through depth maps fusion.While this approach enhances the overall quality of reconstructing scenes, it encounters challenges in challenging areas with low texture or non-Lambertian surfaces, where Electronics 2023, 12, 4603 2 of 18 features at the same 3D position exhibit significant differences between different views.Incorrect feature matching results in the construction of the flawed cost volume by the network, leading to poor completeness in the final reconstruction.This is due to traditional CNN having fixed receptive field sizes, which limit feature extraction networks to capture only local features, hindering the perception of global contextual information.The lack of global contextual information often causes the network to exhibit local ambiguities in challenging regions, thus reducing matching robustness.Recent studies have employed self-attention mechanism [7,8] to capture crucial information for cost volume computation by considering context similarity and spatial proximity.This has improved matching robustness and enhanced the ability of the cost volume to represent scene geometry information.However, there remains significant potential for enhancing the reconstruction quality, especially in challenging areas.
Recently, the Neural Radiance Field (NeRF) [9] rendering technique has made significant advancements in the fields of computer vision and computer graphics.NeRF models view-dependent photometric effects using differentiable volume rendering, enabling it to reconstruct implicit 3D geometric scenes.Additionally, it learns volume density, which can be interpreted as depth, allowing it to explicitly represent the reconstructed geometric scene information through indirectly rendering depth.Subsequent works [10][11][12][13][14][15] have focused on accelerating its rendering speed and implicitly learning the 3D scene's geometry with a strong generalization capability by inputting a few views and combining them with the MVS network to synthesize higher-quality novel views or more accurate depth maps.However, these efforts have primarily advanced the development of the Neural Radiance Field while overlooking the quality of point cloud reconstruction by the MVS network.Therefore, our method leverages the precise neural volume rendering of the Neural Radiance Field to build 3D geometric information about the scene.This approach enables the rendering of depth, even in challenging areas with low texture or non-Lambertian surfaces, allowing the MVS network to learn rich scene geometry information beyond the cost volume that represents scene geometry.This overcomes issues arising from rough depth maps due to incorrect matching in the network, ultimately enhancing the quality of the reconstructed point cloud.
In conclusion, we propose an end-to-end MVS network based on attention mechanism and neural volume rendering.By combining dilated convolution and attention mechanism during feature extraction, we extract rich feature information.This allows the network to achieve reliable feature matching in challenging regions.Leveraging the capacity of neural volume rendering to resolve scene geometry information, our approach mitigates the impact of the flawed cost volume arising from incorrect feature matching.Our method exhibits high completeness in reconstructing point clouds on the competitive DTU dataset concerning indoor objects and demonstrates robust performance on the Tanks and Temples dataset, which pertains to outdoor scenes.It outperforms many learning-based MVS networks, thus advancing 3D reconstruction based on MVS networks in crucial domains such as virtual reality, augmented reality, autonomous driving, and other significant applications.
In summary, our primary contributions can be outlined as follows: • We introduce a multi-scale feature extraction module based on triple dilated convolution and attention mechanism.This module increases the receptive field without adding model parameters, capturing dependencies between features to acquire global context information and enhance the representation of features in challenging regions;

•
We establish a neural volume rendering network using multi-view semantic features and neural encoding volume.The network is iteratively optimized through the rendering reference view loss, enabling the precise decoding of the geometric appearance information represented by the radiance field.We introduce the depth consistency loss to maintain geometric consistency between the MVS network and the neural volume rendering network, mitigating the impact of the flawed cost volume; • Our approach demonstrates state-of-the-art results on the DTU dataset and the Tanks and Temples dataset.
The remaining structure of this paper is as follows.In Section 2, we present an overview of related work related to learning-based MVS networks and neural volume rendering.Subsequently, in Section 3, we delve into the various components of our proposed MVS network based on attention mechanism and neural volume rendering.Section 4 reports an extensive set of experimental results on the DTU dataset and the Tanks and Temples dataset, supplemented by ablation experiments to validate the effectiveness of the proposed modules.Finally, in Section 5, we offer the conclusion of the article.

Learning-Based Multi-View Stereo
In light of the flourishing progress in deep learning technologies, a multitude of researchers have harnessed CNN to tackle MVS tasks.As a representative work, MVSNet [6] has established a deep learning-based MVS pipeline.This pipeline generates a 3D cost volume by integrating features from various perspectives through differentiable homography transformations.Subsequently, 3D CNN are employed to refine the cost volume to perform depth regression.Nonetheless, MVSNet consumes a substantial amount of memory, prompting subsequent efforts to seek more lightweight approaches.The study [16] has employed the recurrent architecture, which adjusts two-dimensional feature maps along the depth direction sequentially using Gated Recurrent Units (GRUs).This approach avoids the memory consumption associated with adjusting the entire cost volume at once, enabling high-resolution reconstructions.Another approach [17] estimates and refines depth maps in a coarse-to-fine manner.Initially, it predicts low-resolution depth maps with a large depth interval.As the depth range decreases, the algorithm iteratively increases the depth map resolution.This algorithm effectively reduces memory consumption caused by excessively large cost volumes.However, due to the limitations of CNN in capturing feature information in challenging regions, such as areas with weak textures and non-Lambertian surfaces, subsequent efforts have introduced attention mechanism into the MVS network to enhance feature representations of images.Works like [7] have incorporated self-attention mechanism at the feature extraction stage, enabling the network to focus more on crucial information and capture interdependencies between pixels.Nevertheless, there remains significant room for improvement in point cloud reconstruction, particularly in challenging areas.Due to the inherent advantage of capturing global contextual information using self-attention mechanism in Transformer models [18], subsequent works [19][20][21] have introduced it into MVS, enabling a comprehensive understanding of the global context within the MVS model to extract rich information from the environment.However, this often leads to increased computational time and memory consumption, especially in the reconstruction of high-resolution and large-scale scenes, incurring substantial computational costs.

Neural Volume Rendering Based on Multi-View Stereo
The Neural Radiance Field (NeRF) [9] represents scenes as continuous implicit functions of position and direction for high-quality view synthesis, achieving photorealistic rendering results at a pixel level.Subsequent works [15,22] have extended NeRF using MVS to support various other neural rendering tasks.MVSNeRF [15] utilizes cost volume constructed by MVS for geometric-aware scene inference, combining it with neural volume rendering for radiance field reconstruction, enabling high-quality view synthesis even with a limited number of images.RC-MVSNet [22], on the other hand, leverages a strongly generalized cost volume derived from MVS, combining it with neural volume rendering to reconstruct implicit scenes.It introduces a neural volume rendering-based reference view synthesis loss to optimize implicit scene information, alleviating photometric blur issues on non-Lambertian reflecting surfaces encountered by unsupervised learning MVS network.Our method leverages a strongly generalized cost volume and incorporates crucial 2D feature information from multiple views for neural volume rendering.In an end-to-end learning manner, it precisely conducts geometric inference for scene perception, mitigating the impact of flawed cost volumes constructed due to incorrect matches in challenging regions by the MVS network.

Methods
In this section, we elucidate the overall architecture of the proposed method, as illustrated in Figure 1.This architecture primarily comprises the MVS network and the neural volume rendering network.Specifically, in the feature extraction stage, we introduce the attention-aware feature extraction module.This module combines dilated convolution with attention mechanism to extract more comprehensive feature information.The MVS network progressively constructs a probability volume in a coarse-to-fine manner to estimate the depth maps and confidence maps.Subsequently, we design a novel neural volume rendering network.

Methods
In this section, we elucidate the overall architecture of the proposed method, as illustrated in Figure 1.This architecture primarily comprises the MVS network and the neural volume rendering network.Specifically, in the feature extraction stage, we introduce the attention-aware feature extraction module.This module combines dilated convolutions with attention mechanisms to enhance multi-level feature-capturing capabilities, thereby extracting more comprehensive feature information.The MVS network progressively constructs a probability volume in a coarse-to-fine manner to estimate the depth maps and confidence maps.Subsequently, we design a novel neural volume rendering network.
The multi-layer perceptron (MLP) network uses multi-view 2D feature vectors along with the 3D neural encoding volume containing geometric-aware information as the mapping condition.Additionally, we adopt a uniform sampling strategy guided by depth maps and confidence maps to focus the scene sampling on the estimated depth surface region.Finally, we apply the rendering reference view loss RRV L to precisely resolve the geometric shape of the scene from the radiance field.We also introduce the depth consistency loss DC L to ensure geometric consistency between the MVS network and the neural volume rendering network.It is noteworthy that the proposed network architecture functions as a universal framework for training the MVS network, making it applicable to any learning-based MVS network.The two networks provide mutual supervision and are simultaneously optimized.The multi-layer perceptron (MLP) network uses multi-view 2D feature along with the 3D neural encoding volume containing geometric-aware information as the mapping condition.Additionally, we adopt a uniform sampling strategy guided by depth maps and confidence maps to focus the scene sampling on the estimated depth surface region.Finally, we apply the rendering reference view loss L RRV to precisely resolve the geometric shape of the scene from the radiance field.We also introduce the depth consistency loss L DC to ensure geometric consistency between the MVS network and the neural volume rendering network.It is noteworthy that the proposed network architecture functions as a universal framework for training the MVS network, making it applicable to any learning-based MVS network.The two networks provide mutual supervision and are simultaneously optimized.

Attention-Aware Feature Extraction Module
We propose the attention-aware feature extraction module.This module exhibits resemblances to a 2D U-Net, featuring elementary units that encompass both an encoder and a decoder, complete with skip connections.The encoder forms a network composed of dilated convolutional layers and an attention module, as depicted in Figure 2. In the encoder section, the features are initially subsampled using a convolutional layer with a stride of 2. Subsequently, dilated convolutional layers with 3 × 3 kernel are employed to expand the receptive field of the input features.To address potential information correlation issues associated with the use of dilated convolution, we adopt a strategy similar to that of [23], where feature maps are passed through a residual network structure with Sigmoid function after undergoing dilated convolutional layer with different dilation rate.To create the final feature map, the three fine-grained features are combined and run through a convolutional layer with an attention module.A convolutional layer and deconvolutional layer with 3 × 3 kernel make up the decoder.When provided with a reference image I 1 and source images {I i } N i=2 at a resolution of H × W captured from different viewpoints, the attention-aware feature extraction module outputs three different scales of features, denoted , where k represents the kth stage.and a decoder, complete with skip connections.The encoder forms a residual network composed of dilated convolutional layers and an attention module, as depicted in Figure 2. In the encoder section, the output features from each layer are initially subsampled using a 3 × 3 convolutional layer with a stride of 2. Subsequently, dilated convolutional layers with 3 × 3 kernels are employed to expand the receptive field of the input features, facilitating the exploration of deep-level fine-grained features.To address potential information correlation issues associated with the use of triple dilated convolution, we adopt a strategy similar to that of [23], where feature maps are passed through a residual network structure with sigmoid functions after undergoing dilated convolutional layers with different dilation rates.To create the final feature map, the three fine-grained features are combined and run through a convolutional layer with an attention module.A convolutional layer with a 3 × 3 kernel and a deconvolutional layer with a stride of 2 make up the decoder.When provided with a reference image 1 , , , where k represents the -th k stage.Figure 3 provides a visual representation of the attention module's architectural design.The features, which have undergone fusion through dilated convolutional layers, are input into two 3 × 3 convolutional layers.Each of these layers goes through Group Normalization (GN) and a ReLU activation function.Subsequently, we incorporate a Lay-erScale-based local attention layer [24].The operational details of this local attention layer are elucidated in Figure 4, illustrating the mapping of queries and a collection of key-value pairs to generate an output, with pixel outputs computed via Softmax operations.Figure 3 provides a visual representation of the attention module's architectural design.The features, which have undergone triple dilated convolution, are input into two convolutional layers with 3 × 3 kernel.Each of these layers goes through Group Normalization (GN) and a ReLU activation function.Subsequently, we incorporate a LayerScale-based local attention layer [24].The operational details of this local attention layer are elucidated in Figure 4, illustrating the mapping of queries and a collection of key-value pairs to generate an output, with pixel outputs computed via Softmax operation.
In this equation, q ij = W q x ij , k ab = W k x ab and v ab = W v x ab represent the queries, keys, and values, respectively, with the matrices W n (n = q, k, v) composed of learnable parameters.Here, R denotes a local region with a 3 × 3 input size.To address the issue of permutation equivariance resulting from the lack of encoded positional information, we introduce relative positional embeddings by incorporating learnable parameters into the keys, as described in [25].The relative distance vector r ab is partitioned along the dimensions, with half of the dimension of the output channel allocated for encoding the row direction and the remaining half for encoding the column direction.Furthermore, the features x att , produced by the attention layer, need to be multiplied by the learned diagonal matrix weights within the network.
where s 1 to s n are learnable weights.
sions, with half of the dimension of the output channel allocated for encoding the row direction and the remaining half for encoding the column direction.Furthermore, the features att x , produced by the attention layer, need to be multiplied by the learned diagonal matrix weights within the network.
where s 1 to n s are learnable weights.

Cost Volume Construction
Subsequently, we perform adaptive depth hypothesis sampling using J layers of depth hypothesis planes direction and the remaining half for encoding the column direction.Furthermore, the features att x , produced by the attention layer, need to be multiplied by the learned diagonal matrix weights within the network.
where s 1 to n s are learnable weights.

Cost Volume Construction
Subsequently, we perform adaptive depth hypothesis sampling using J layers of depth hypothesis planes

Cost Volume Construction
Subsequently, we perform adaptive depth hypothesis sampling using J layers of depth hypothesis planes D j J j=1 .Based on these assumptions, we construct feature volumes {V i } N i=1 , which are constructed by differential warping 2D source views features to the reference view.Under the depth plane hypothesis d, the warping between a pixel p in the reference view and its corresponding pixel p i in the i-th source view is defined as follows: where K i and K are the intrinsic matrix of the i-th source camera and the reference camera, respectively, R i and t i represent the rotation and translation between the two views.Subsequently, we consolidate multiple feature volumes {V i } N i=1 into a 3D cost volume V using the variance-based aggregation strategy.Then, the cost volume is then regularized into a depth probability volume using a 3D U-Net.We determine the probability P j (p) on a specified depth plane D j (p) for the pixel p in the reference view.Following this, we calculate the estimated depth value D(p) for the pixel p using the method outlined below:

Neural Volume Rendering Network
To further alleviate the issue of incorrect feature matching in MVS caused by significant differences in 3D location of features between adjacent views, we introduced a neural volume rendering network.This network is trained in a self-supervised manner to learn the scene geometry, providing the network with rich scene geometry information.This addition helps mitigate the impact of the flawed cost volume generated by incorrect matching issues in the MVS network.

Scene Representation Based on Multi-View Features and Neural Encoding Volume
Our network extracts potential 2D feature vectors from the encoded contextual information of the source views.These multi-view 2D features provide additional semantic information about the scene, addressing 3D geometric ambiguity and enhancing the network's ability to handle occlusion.Inspired by PixelNeRF [13], we project 3D points from arbitrary space into the input multi-view images.For N different views {I i } N i=1 , each having its corresponding extrinsic matrix T i = [R i | t i ] concerning the target reference image and intrinsic matrix K i .To acquire the color c and volume density σ of the 3D point, we begin by converting its 3D position x, and its reference view direction g, into the coordinate system of each view, leading to the 3D point x i = R i x + t i .Subsequently, we project this point onto the corresponding pixel and feature maps and employ bilinear interpolation to sample its pixel p i , and feature vector f i : where g i represents the projection of the 3D point x's reference view direction onto the respective observation directions of the multi-view images.
We utilize a weighted pooling operator, denoted as ψ, to aggregate the multi-view feature vectors, as illustrated in Figure 5. Initially, we combine the feature vector f i with the pixel information p i to create a two-dimensional feature vector.Subsequently, we compute the mean µ and variance ν of the two-dimensional feature vector to capture both local and global information.Then, the two-dimensional feature vector is concatenated with µ and ν fed into our specially designed lightweight MLP, extracting the multi-view perceptual features, denoted as f i , and the pooling weights, denoted as w i .Finally, by applying a Softmax operation to the weight vector, we perform a weighted pooling operation on the multi-view perceptual features, resulting in the final feature vector f img : Subsequently, following the same approach as in RC-MVSNet [22], we performed trilinear interpolation on the 3D neural encoding volume constructed using MVS, resulting in voxel-aligned three-dimensional feature voxel denoted as voxel f .We then passed the weighted pooled final feature vector img f and the three-dimensional feature voxel voxel f through an MLP network to obtain RGB color c and volume density  at 3D sampling points in the reference view direction.Subsequently, following the same approach as in RC-MVSNet [22], we performed trilinear interpolation on the 3D neural encoding volume constructed using MVS, resulting in voxel-aligned three-dimensional feature voxel denoted as f voxel .We then passed the weighted pooled final feature vector f img and the three-dimensional feature voxel f voxel through an MLP network to obtain RGB color c and volume density σ at 3D sampling points in the reference view direction.

Confidence and Depth-Guided Sampling for Volume Rendering
In the reference view I 1 , each pixel p corresponds to a ray defined in the world coordinate system.The 3D point associated with the pixel p along this ray originating at a distance e from the origin o can be represented as r p = o + eg.To render the color I 1 (p) at the pixel p, rays are uniformly sampled at M discrete sample distances e m within the original NeRF near and far planes [e n , e f ].The radiance field ϕ at the 3D point is then queried: Due to the uniform sampling probability within a sampling range in the original NeRF, the points may not be concentrated on the surface of the object, leading to a decrease in the quality of the rendered reference view.Therefore, for pixel p, we propose to sample candidate points under the guidance of the prior range defined by the depth estimation value D(p) and its confidence from the MVS network.
We define the standard deviation Ŝ as the degree of confidence for pixel p with depth estimation value D(p): The potential location of the object surface corresponding to each pixel should be confined within the interval defined by the depth estimation value D(p) and the standard deviation Ŝ(p), represented as Û(p): Confidence and depth range Û(p) contain valuable signals to guide sampling along rays, thus, for rendering the color of a 3D point x in the geometric scene, we replaced the coarse network used for hierarchical sampling in the original NeRF.We distribute half of the sampled points between the near plane e n and the far plane e f .The second half of the sampled points are extracted within the range of the confidence and depth prior Û(p).This ensures both the network's generalization capability and model convergence.Figure 6 presents a comparison between the two sampling methods.
Next, we render the predicted colors and volume density values {(c m , σ m )} M m=1 for each sampling point into the predicted reference pixel: where E m represents the cumulative transmittance along the ray e m , and δ m = e m+1 − e m is the distance between adjacent samples.Our objective is to precisely deduce the depth value corresponding to the reference view from the radiance field.Therefore, we achieve the depth value for pixel p by performing a density integral along the rays in the direction of the reference view.
The potential location of the object surface corresponding to each pixel should be confined within the interval defined by the depth estimation value

 
D p and the stand- Figure 6.Comparison of two sampling methods.In contrast to the uniform sampling employed in the original NeRF, the sampling method within confidence and depth prior range concentrates the samples more on the surface of the object.

Loss Function
Within the neural volume rendering network, following the methodology established in the original NeRF, we introduce the rendering reference view loss.This loss function utilizes mean squared error to quantify the disparity between the color of volume rendering along rays from the reference view and the color of the corresponding ground truth reference view.By optimizing the pixel values of the rendered reference view I 1 (p), we enhance the implicit geometric representation capability of the 3D scene.(15) To ensure geometric consistency between the two networks, we propose the depth consistency loss.This loss function employs L 1 loss to minimize the difference between the rendered depth and the estimated depth from the MVS network, while also minimizing the difference between the rendered depth and the ground truth depth.
Within the MVS network, we utilize the L 1 loss as the training loss, quantifying the divergence between the ground truth depth and the estimated depth.
In the end, the overall training loss function for the end-to-end network is given by the following:

Experiments
We comprehensively present the performance of our proposed method through a series of experiments.Additionally, we perform ablation experiments to validate the efficacy of our proposed attention-aware feature extraction module, loss functions, and the confidence and depth-guided sampling strategy.

Datasets
We conducted model training and evaluation using the DTU dataset [26] and the Tanks and Temples dataset [27].The DTU dataset comprises 124 scenes captured from 49 distinct viewpoints, covering a range of 7 diverse lighting conditions, and collected using a robotic arm in indoor environments.We assess the reconstructed point cloud using three measurement criteria: Accuracy, Completeness, and Overall.
Accuracy represents the average distance between the reconstructed point cloud and the ground truth point cloud, calculated by the Formula (20).Completeness indicates the number of surfaces from the ground truth point cloud that are captured in the reconstructed point cloud within the same world coordinates, computed using Formula (22).Overall is the average of Accuracy and Completeness, calculated as per Formula (23).
where RE denotes the reconstructed point cloud, GR represents the ground truth point cloud, and dis re→gr signifies the shortest distance from a point in the reconstructed point cloud to the ground truth point cloud.
Comp. = 100 |GR| ∑ gr∈GR dis gr→re (22) where dis gr→re represents the shortest distance from a point in the ground truth point cloud to the reconstructed point cloud.
The Tanks and Temples dataset, on the other hand, captures complicated real-world sceneries with 8 intermediate subsets and 6 advanced subsets.
We utilize the F-score as the evaluation metric for the Tanks and Temples dataset.The F-score takes into account the precision PR and recall RE of the reconstructed point cloud, with precision defined as in Equation ( 22) and recall as in Equation ( 23).The F-score is calculated according to the formula in Equation (24).

End-to-End Training Details
We fixed the number of input images at N = 4 and resized the original images to a resolution of 512 × 640 pixels during the training phase.We divided the MVS network into three stages, with each stage taking input images at 1/16, 1/4, and 1 of the original resolution, respectively.We assumed the same number of plane sweep depths and depth intervals as [17].Specifically, for the three stages, we assumed 48, 32, and 8 plane sweep depths and depth intervals of 4, 2, and 1, respectively.In the neural volume rendering network, we set the number of ray samples to 1024.We used the Adam optimizer with λ DC 1 = 0.8, λ DC 2 = 0.2, λ RRV = 1, λ DC = 0.01 and λ MVS = 1.The training process comprised 16 stages, commencing with an initial learning rate of 0.0001.This learning rate was halved at the 10th, 12th, and 14th epochs.Our method was trained with a batch size of 2 using 2 Nvidia GTX 3090ti GPUs.

Results on DTU Dataset
Our model was assessed with 5 neighboring views (N= 5) and input images at a resolution of 1152 × 864 pixels.We conducted a comparative analysis between our outcomes and those obtained from various traditional techniques as well as cutting-edge learning-based approaches.The quantitative evaluation results are presented in Table 1.Our method excelled in terms of completeness, exhibiting a significant 27% improvement compared to CVP-MVSNet [28].Moreover, our approach outperformed existing advanced methods in terms of overall reconstruction quality.In addition to quantitative analysis, Figure 7 provides visual qualitative results of the reconstructed point clouds.Our model generated more complete point clouds with finer texture details in challenging regions characterized by weak textures and lighting reflections compared to CasMVSNet [17] and UCS-Net [29].

Results on Tanks and Temples Dataset
We conducted assessments using input images at a resolution of 1920 × 1080 and a neighboring view count set to = N 5. Table 2 presents quantitative results for the inter- mediate subset.Our method demonstrates superior performance across most intermediate subsets, underscoring its effectiveness and generalization capability.Figure 8 offers illustrative qualitative visualizations of the 3D point clouds reconstructed, highlighting the robust reconstruction capabilities of our algorithm.Figure 9 showcases qualitative results for the "Train" and "Horse" scenes within the intermediate subset.Our method excels in producing more precise and comprehensive points, particularly in regions with low-texture attributes or non-Lambertian surfaces.In the more complex advanced subsets, as delineated in Table 3, our approach performs better than previous advanced learningbased approaches in the scene "Ballroom" and scene "Palace".

Results on Tanks and Temples Dataset
We conducted assessments using input images at a resolution of 1920 × 1080 and a neighboring view count set to N = 5.Table 2 presents quantitative results for the intermediate subset.Our method demonstrates superior performance across most intermediate subsets, underscoring its effectiveness and generalization capability.Figure 8 offers illustrative qualitative visualizations of the 3D point clouds reconstructed, highlighting the robust reconstruction capabilities of our algorithm.Figure 9 showcases qualitative results for the "Train" and "Horse" scenes within the intermediate subset.Our method excels in producing more precise and comprehensive points, particularly in regions with low-texture attributes or non-Lambertian surfaces.In the more complex advanced subsets, as delineated in Table 3, our approach performs better than previous advanced learning-based approaches in the scene "Ballroom" and scene "Palace".

Ablation Study
We conducted four comparative experiments on the DTU evaluation dataset.We investigated the impact of different loss functions and the attention-aware feature extraction module on the reconstruction results.Additionally, we examined the impact of varying dilation rates in the attention-aware feature extraction module on the reconstruction results.We also assessed the influence of confidence and depth-guided sampling strategy under different view counts on the reconstruction results.Finally, we evaluated the network's performance when varying the number of rays used for sampling.4, clearly illustrating that the attention-aware feature extraction module along with the two loss functions significantly enhances the integrity of the point cloud reconstruction.When these components are combined with the baseline CasMVSNet model, the improvement in point cloud reconstruction is most prominent in terms of integrity assessment, while maintaining a high overall evaluation level.Our proposed model, during the evaluation process on the test dataset, bypasses the neural volume rendering network.Instead, it utilizes the MVS network to estimate depth maps based on the learned feature weights.As a result, a minor increase in the number of parameters, inference time, and memory usage over the baseline model is introduced to enhance the completeness and overall quality of the reconstructed point clouds.We also visualize the influence of these components on the reconstruction results, as shown in Figure 10.By adding the neural volume rendering network and incorporating rendering reference view loss and depth consistency loss to the baseline CasMVSNet, the neural volume rendering network learns additional scene geometry information beyond the cost volume representing scene geometry.This leads to an enhancement in the completeness of the reconstructed point cloud.Additionally, the inclusion of the attention-aware feature extraction module extracts rich feature information to mitigate feature-matching errors, resulting in improved reconstruction results for point clouds in regions with weak texture and non-Lambertian surfaces.Table 5 presents the influence of different dilation rates in the attention-aware feature extraction module on the reconstruction results.When the dilation rates of the three dilated convolutions are set to 2, 3, and 4, the overall point cloud reconstruction quality is the best.However, as the dilation rates increase, the continuity of extracted feature information decreases, resulting in reduced information coherence, and consequently, the overall point cloud reconstruction quality by the network deteriorates.5 presents the influence of different dilation rates in the attention-aware feature extraction module on the reconstruction results.When the dilation rates of the three dilated convolutions are set to 2, 3, and 4, the overall quality of point cloud reconstruction is the best.However, as the dilation rate increase, the continuity of extracted feature information decreases, resulting in reduced information coherence, and consequently, the overall quality of point cloud reconstruction by the network deteriorates.6 demonstrates the impact of sampling within the confidence and depth prior range on the reconstruction results under varying numbers of views.The point cloud reconstruction achieves the best overall quality when the number of views is set to 4. Therefore, we adopted this view count for other ablation analyses.Furthermore, the confidence and depth-guided sampling strategy concentrates on collecting points near the object's surface.This allows the network to accurately construct the geometric shape of the neural radiance field, thereby mitigating the impact of the flawed cost volume on the network.Consequently, the point cloud reconstruction exhibits an overall improvement in performance.During volume rendering, we quantitatively assessed the impact of varying the number of sampled rays on point cloud reconstruction results.As shown in Table 7, we conducted experiments with four different sampling quantities.The point cloud reconstruction achieved the best accuracy and completeness evaluation results when the number of sampled rays reached 1024.

Conclusions
In this research, we introduce an attention-aware feature extraction network to capture inter-pixel dependencies and adequately extract semantic information from the original views.Furthermore, we establish a novel neural volume rendering network based on multiview semantic features and neural encoding volume, utilizing rendering reference view loss to reconstruct the 3D scene geometry.Additionally, we introduce depth consistency loss to maintain the consistency of scene geometry, alleviating the impact of incorrect matching in regions with weak texture or non-Lambertian surfaces.Extensive experimentation on both the DTU and Tanks and Temples datasets showcases the superior performance of our network compared to previous state-of-the-art approaches.Comprehensive ablation studies validate the effectiveness of the individual modules introduced.

Figure 1 .
Figure 1.Illustration of the proposed approach.Our network consists of the MVS network and the neural volume rendering network.

Figure 1 .
Figure 1.Illustration of the proposed approach.Our network consists of the MVS network and the neural volume rendering network.

Figure 2 .
Figure 2. The design of the feature extraction module we propose.

Figure 2 .
Figure 2. The design of the feature extraction module we propose.

Figure 3 .
Figure 3.The architecture of the attention module.This module is a residual structure composed of a mixture of convolutional layers and a local attention layer.

Figure 4 .
Figure 4.The architecture of the local attention layer.

1 .Figure 3 .
Figure 3.The architecture of the attention module.This module is a residual structure composed of a mixture of convolutional layers and a local attention layer.

Figure 3 .
Figure 3.The architecture of the attention module.This module is a residual structure composed of a mixture of convolutional layers and a local attention layer.

Figure 4 .
Figure 4.The architecture of the local attention layer.

1 .
Based on these assumptions, we construct feature vol- umes  N i i {V } 1 , which are constructed by differential warping 2D source views features to

Figure 4 .
Figure 4.The architecture of the local attention layer.

)Figure 5 .
Figure 5. Weighted pooling operation.Here, N represents the number of input views.The notation below the MLP denotes the dimensions of input and output variables in the linear layer, respectively.

Figure 5 .
Figure 5. Weighted pooling operation.Here, N represents the number of input views.The notation below the MLP denotes the dimensions of input and output variables in the linear layer, respectively.

Û
p contain valuable signals to guide sampling along rays, thus, for rendering the color of a 3D point x in the geometric scene, we replaced the coarse network used for hierarchical sampling in the original NeRF.We distribute half of the sampled points between the near plane n e and the far plane f e .The second half of the sampled points are extracted within the range of the confidence and depth prior   Û p .This ensures both the network's generalization capability and model convergence.

Figure 6
Figure 6 presents a comparison between the two sampling methods.

Figure 6 .
Figure 6.Comparison of two sampling methods.In contrast to the uniform sampling employed in the original NeRF, the sampling method within confidence and depth prior range concentrates the samples more on the surface of the object.Next, we render the predicted colors and volume density values     

Figure 7 .
Figure 7.For DTU dataset scans 13 and 77, we compare the reconstruction results with CasMVSNet, UCS-Net, and ground truth.

Figure 7 .
Figure 7. On the DTU dataset scan 13 and scan 77, we compare the reconstruction results with CasMVSNet, UCS-Net, and ground truth.

Figure 8 .
Figure 8. Visualization of 3D point clouds of (a) the scene "Family", (b) the scene "Lighthouse", (c) the scene "Horse", (d) the scene "Train", (e) the scene "Playground", (f) the scene "Temple", and (g) the scene "Museum" on the intermediate and advanced subsets of the Tanks and Temples dataset.

Figure 9 .
Figure 9.The precision results of the "Train" (τ = 5 mm) and the recall results of the "Horse" (τ = 3 mm) scenes reconstructed on the Tanks and Temples dataset are compared with the CasMVS-Net, R-MVSNet, and CVP-MVSNet.Here, τ represents the official distance threshold, and darker regions indicate higher errors relative to τ.

4. 4 . 1 .
Influence of Attention-Aware Feature Extraction Module and Different Loss Functions We have discussed the impact of the attention-aware feature extraction module and different loss functions on the final reconstruction of point clouds and the associated effects on model parameters, inference time, and memory usage during testing, building upon the baseline model CasMVSNet.The outcomes are displayed in Table

Figure 10 .
Figure 10.Qualitative results of scan 48 reconstruction results on the DTU dataset using the attention-aware feature extraction module and various loss functions.4.4.2.Impact of Different Dilation Rates in the Attention-Aware Feature Extraction Module

Figure 10 .
Figure 10.Qualitative results of scan 48 on the DTU dataset using the attention-aware feature extraction module and various loss functions.4.4.2.Impact of Different Dilation Rates in the Attention-Aware Feature Extraction Module Table5presents the influence of different dilation rates in the attention-aware feature extraction module on the reconstruction results.When the dilation rates of the three dilated convolutions are set to 2, 3, and 4, the overall quality of point cloud reconstruction is the best.However, as the dilation rate increase, the continuity of extracted feature information decreases, resulting in reduced information coherence, and consequently, the overall quality of point cloud reconstruction by the network deteriorates.

Table 1 .
Quantitative results for the DTU dataset are presented.(lower scores indicate better performance).These results are categorized into traditional methods and learning-based methods.The other research results referenced in this study, other than our own, are taken from previously released research.

Table 2 .
The quantitative results of the F-score in the intermediate subset of the Tanks and Temples dataset are presented below (higher scores indicate better performance).

Table 2 .
The quantitative results of the F-score in the intermediate subset of the Tanks and Temples dataset are presented below (higher scores indicate better performance).

Table 3 .
We present the quantitative results of the F-score within the advanced subset of the Tanks and Temples dataset, with higher scores indicative of superior performance.

Table 4 .
Ablation study of attention-aware feature extraction module and different loss functions.The baseline represents CasMVSNet.AM represents the attention-aware feature extraction module.RRV represents rendering reference view loss.DC represents depth consistency loss.

Table 5 .
Different dilation rates of the dilated convolutions in the attention-aware feature extraction module on the reconstruction results.

Table 6 .
Ablation study of confidence and depth-guided sampling strategy under different view counts.CDG represents the confidence and depth-guided sampling strategy.
4.4.4.Performance of Sampling with Varying Numbers of Rays

Table 7 .
The quantitative performance with different quantities of ray sampling.