ED 2 IF 2 -Net: Learning Disentangled Deformed Implicit Fields and Enhanced Displacement Fields from Single Images Using Pyramid Vision Transformer

: There has emerged substantial research in addressing single-view 3D reconstruction and the majority of the state-of-the-art implicit methods employ CNNs as the backbone network. On the other hand, transformers have shown remarkable performance in many vision tasks. However, it is still unknown whether transformers are suitable for single-view implicit 3D reconstruction. In this paper, we propose the ﬁrst end-to-end single-view 3D reconstruction network based on the Pyramid Vision Transformer (PVT), called ED 2 IF 2 -Net, which disentangles the reconstruction of an implicit ﬁeld into the reconstruction of topological structures and the recovery of surface details to achieve high-ﬁdelity shape reconstruction. ED 2 IF 2 -Net uses a Pyramid Vision Transformer encoder to extract multi-scale hierarchical local features and a global vector of the input single image, which are fed into three separate decoders. A coarse shape decoder reconstructs a coarse implicit ﬁeld based on the global vector, a deformation decoder iteratively reﬁnes the coarse implicit ﬁeld using the pixel-aligned local features to obtain a deformed implicit ﬁeld through multiple implicit ﬁeld deformation blocks (IFDBs), and a surface detail decoder predicts an enhanced displacement ﬁeld using the local features with hybrid attention modules (HAMs). The ﬁnal output is a fusion of the deformed implicit ﬁeld and the enhanced displacement ﬁeld, with four loss terms applied to reconstruct the coarse implicit ﬁeld, structure details through a novel deformation loss, overall shape after fusion, and surface details via a Laplacian loss. The quantitative results obtained from the ShapeNet dataset validate the exceptional performance of ED 2 IF 2 -Net. Notably, ED 2 IF 2 -Net-L stands out as the top-performing variant, exhibiting the highest mean IoU, CD, EMD, ECD-3D, and ECD-2D scores, reaching impressive values of 61.1, 7.26, 2.51, 6.08, and 1.84, respectively. The extensive experimental evaluations consistently demonstrate the state-of-the-art capabilities of ED 2 IF 2 -Net in terms of reconstructing topological structures and recovering surface details, all while maintaining competitive inference time.

The main ideas of the earlier data-driven implicit approaches [18][19][20][21][22] are to learn latent vectors of the input images and the neural networks are applied to fit the mapping relationship from the query points to a implicit scalar field.For example, DeepSDF [18] introduces latent codes that are able to represent similar objects and output signed distance functions (SDFs) approximating object shapes in combination with query point coordinates.IM-Net [19] encodes a single input image to extract a latent vector, which is then decoded together with the query point coordinates to generate an implicit scalar field value representing the spatial relationship between the point and the object shape.Occupancy Network [20] encodes different types of inputs into embeddings while converting query points into point features, and the decoder incorporates all the information and outputs a real number to indicate the occupancy probability of the query point.Littwin et al. [21] use the encoding vector of the input image as the weight matrix in an MLP for binary classification of query points, resulting in the generation of an implicit field.These methods can only reconstruct the coarse shape but fail to reproduce the details of the object.Rather than predicting a single global implicit field, PQ-NET [23] outputs a SDF for each intrinsic structure of the object and fuses these implicit fields to generate the final SDF, producing more promising reconstruction results.
Recently, several novel CNN-based models have been proposed [24][25][26][27][28][29].DISN [24] fuses global and local features of the input image with point features of query points to obtain the fused SDF.MDISN [25] deforms a randomly generated SDF based on local feature variations at the layer level to approximate the ground-truth SDF.Ladybird [26] considers pixel-aligned local features of query points and their symmetry points, combining them with global features to output the SDF.Ray-ONet [27] integrates global features, local features, and scaling parameters to estimate the occupancy probability of spatial query points along rays, reducing complexity and improving performance compared to Occupancy Network [20].Peng et al. [28] merge global and local features extracted by an encoder, incorporate query points via linear interpolation, and use subsequent networks to predict occupancy values.In contrast to earlier works [18][19][20][21][22][23], such methods [24][25][26][27][28] can better capture structure details and recover finer shapes owing to the integration of local features.However, details at the surface level such as depth, which are equally critical for visual perception, are still poorly reconstructed.D 2 IM-Net [29] focuses on recovering surface details, which often produces promising surface features, yet is unable to reconstruct the correct topological structures.Furthermore, previous CNN-based approaches [24][25][26][27][28][29] often encounter two inherent limitations associated with the convolutional layer.Firstly, convolutional kernels treat all pixels equally, resulting in inefficiency when processing images.This uniform treatment fails to capture the varying importance and dependencies of different pixels within an image.Secondly, due to the local nature of convolution, long-range pixel relationships are not effectively modeled.As a result, crucial contextual information may be overlooked, hindering the ability to fully understand and exploit the complex dependencies and interactions between pixels across the entire image.The PIFu series [30][31][32] incorporates pixel-aligned local features and depth information into their paradigm, with a primary focus on human reconstruction.
With the advent of the works like ViT [36] and DeiT [37], transformers have obtained considerable attention in computer vision recently.Transformer-based vision models have achieved state-of-the-art performance in several downstream tasks, such as DETR [38] for object detection, SwinIR [39] for image restoration, Segmenter [40] for semantic segmentation, and MViTv2 [41] for image classification.While transformers have demonstrated preliminary success in many tasks including explicit 3D reconstruction [1][2][3][4][5]42], whether they could be successfully employed to improve implicit 3D reconstruction is still unknown.
To address the limitations of existing implicit methods that struggle to simultaneously reconstruct the topological structure and surface details of objects, ED 2 IF 2 -Net is proposed in this paper.Our approach utilizes transformers, specifically Pyramid Vision Transformer (PVT), to enable end-to-end single-view implicit 3D reconstruction.By leveraging PVT, we aim to mitigate the negative impacts of underlying convolutional layers in CNN-based methods, allowing for comprehensive reconstruction of both topological structures and surface details from a single image.For an input image, local features and a global vector are extracted using a pre-trained Pyramid Vision Transformer encoder [43].Subsequently, a coarse shape decoder reconstructs a coarse implicit field based on the global vector.A deformation decoder, incorporating symmetry priors that provide extra knowledge about the object shape, predicts a deformed implicit field with finer-grained structure details using pixel-aligned local features and multiple implicit field deformation blocks (IFDBs).Finally, a surface detail decoder equipped with hybrid attention modules [44] (HAMs) constructs an enhanced displacement field, enabling the recovery of enhanced surface details from the local features.In order to facilitate the learning of the implicit field deformation function, IFDB offers a lightweight and effective approach.It refines the coarse implicit field by leveraging information from query points and pixel-aligned local features at neighboring scales.Through simple iterations, multiple IFDBs efficiently fit the deformation function, enabling the generation of the deformed implicit field that captures the finer topological structure of the object.In contrast to CBAM [45], HAM is a more novel and parameter-efficient module that significantly improves surface detail recovery performance.The output of the proposed ED 2 IF 2 -Net is a fusion of the deformed implicit field and the enhanced displacement field together.The main contributions of this paper include:

1.
A Pyramid-Vision-Transformer-based ED 2 IF 2 -Net is proposed for end-to-end singleview implicit 3D reconstruction, which disentangles implicit field reconstruction into accurate topological structures and enhanced surface details with competitive inference time.To our knowledge, it is the first method to utilize transformers for single-view implicit 3D reconstruction.Experimental results show superior performance in both overall reconstruction and detail recovery.

2.
The finer topological structural details of the object are achieved through iterative refinement of the coarse implicit field using multiple IFDBs.IFDB deforms the implicit field from coarse to fine based on query point and pixel-aligned local feature variations at continuous scales.ED 2 IF 2 -Net also enhances surface detail representation at spatial and channel levels.

3.
A novel loss function consisting of four terms is proposed, where coarse shape loss and overall shape loss allow the reconstruction of the coarse shape and the overall shape after fusion, and novel deformation loss and Laplacian loss enable ED 2 IF 2 -Net to reconstruct structure details and recover surface details, respectively.

Related Works
Since the proposed ED 2 IF 2 -Net is a single-view 3D reconstruction network based on the Pyramid Vision Transformer, in this section, we review some related works as follows.

Implicit Methods for Single-View 3D Reconstruction
Since the object shapes in this work are represented with implicit functions, this section focuses on reviewing the implicit methods for single-view 3D reconstruction.
There are mainly two popular forms of data in deep-learning-based single-view implicit 3D reconstructions: occupancy probability and SDF.Specifically, the implicit models learn the scalar value of each query point under the supervision of the ground-truth occupancy probability or SDF.Earlier implicit methods, such as Occupancy Network [20] and IM-Net [19], tend to adopt a straightforward idea.There, the latent vectors of the input image are firstly extracted via an image encoder, and are subsequently combined with the features or coordinates of the query points for the MLP inputs, and then the occupancy probability or SDF for each query point can be predicted.Recently, a few novel CNN-based implicit 3D reconstruction models [24][25][26][27][28][29] have been proposed that take into account local features, resulting in more promising reconstruction performance.
The most relevant works to ours are DISN [24], MDISN [25], and D 2 IM-Net [29].Specifically, both DISN and MDISN predict the camera parameters of the input image to extract local features corresponding to each query point.In DISN, query point features are concatenated with global and local features.Two concatenated features are decoded to obtain two predicted values, which are summed to derive the final SDF.MDISN deforms the randomly generated SDF for each query point from coarse resolutions to fine ones depending on the variation of local features.D 2 IM-Net predicts the camera pose and decomposes the reconstruction of the object's implicit field into two parts: the reconstruction of coarse shapes and the recovery of details.DISN and MDISN achieve better experimental results than earlier approaches, yielding shapes with more structure details.However, they still fail to recover the surface details of an object.While D 2 IM-Net is capable of recovering good surface details, it often results in poor topological structures.Although a variant of D 2 IM-Net, called D 2 IM-Net GL , uses both global and local features in the basic decoder, it still struggles to reconstruct a satisfactory shape and even produces blurry surface details.
Compared to DISN, MDISN and D 2 IM-Net, ED 2 IF 2 -Net is the first to employ a transformer to solve single-view implicit 3D reconstruction, which alleviates the negative effects brought by convolution in CNN-based models.This paper proposes a novel paradigm that disentangles the reconstruction of an object into reconstruction of more accurate topological structures and enhanced surface details.The finer-grained topological structure details and enhanced surface details are obtained through iterative refinement of the coarse implicit field using the multiple IFDBs, as well as enhancement of the surface detail features in both spatial and channel dimensions using HAMs, as shown in Figure 1.The core difference lies in the construction of individual specific loss terms for all learned fields, including the coarse implicit field, deformed implicit field, and enhanced displacement field.This disentanglement of the deformed implicit field, which contains most of the topological structures, from the enhanced displacement field allows for better learning, resulting in the recovery of enhanced surface details.Actually, a novel deformation loss for learning structure details from ground truth is introduced in our combined loss function, while the surface details can be learned from ground-truth normal maps by applying a Laplacian loss.Extensive qualitative and quantitative comparisons conducted in the experimental Section 4 unequivocally demonstrate the remarkable capabilities of our proposed ED 2 IF 2 -Net.In stark contrast to the limitations observed in DISN and MDISN, where surface detail recovery of objects is lacking, ED 2 IF 2 -Net successfully overcomes this challenge.The reconstructed results exhibit significantly improved surface detail fidelity, showcasing the effectiveness and superiority of our approach.Furthermore, our model solves the problem of D 2 IM-Net and its variant D 2 IM-Net GL , which reconstruct the wrong topological structures.ED 2 IF 2 -Net is capable of generating visually attractive and high-quality 3D shapes.The global vector is used in a coarse shape decoder to predict a coarse implicit field, which is then iteratively refined by a deformation decoder to obtain a deformed implicit field with finer structure details using multiple implicit field deformation blocks (IFDBs).A surface detail decoder with hybrid attention modules (HAMs) uses local features to recover an enhanced displacement field.The final output of ED 2 IF 2 -Net is a fusion of the deformed implicit field and the enhanced displacement field.Four combined loss terms are applied to reconstruct the coarse implicit field, structure details, overall shape, and surface details.

Laplacian Operators
Laplacian operators are frequently used to extract local variations in images and 3D shapes.Further, Laplacian pyramids have so far been used extensively in neural models for super-resolution image reconstruction [46,47] and generation [48] by extracting multi-scale structures from images.Li et al. [49] propose a Laplacian loss for image synthesis, which effectively preserves image details and eliminates artifacts.Recently, there have been some works on single-view 3D reconstruction using Laplacian operators.Wang et al. [17] apply Laplacian loss to meshes by minimizing the loss between Laplacian coordinates before and after surface mesh deformation.Liu et al. [50] smooth the surface via a Laplacian regularization, but it is prone to lose the surface details of the object.D 2 IM-Net [29] takes the disentangled detail information as a displacement field, which recovers the surface details well with Laplacian loss.In this work, we follow and improve D 2 IM-Net regarding Laplacian loss.The key difference is that, in our acquisition of the displacement field, the feature representation of surface details is enhanced using HAM [44], which effectively overcomes the lack of surface details in D 2 IM-Net.

Transformers in Computer Vision
Transformers [51] originate from natural language processing whose core component is multi-head self-attention.Recently, transformers have received much attention in computer vision and have made a profound impact.For a comprehensive review of transformers in vision, the readers are referred to [52].For applications in vision, transformers have achieved state-of-the-art performance in object detection [38], image classification [36], image restoration [39], and multi-view 3D reconstruction [2].In this work, a Pyramid Vision Transformer [43] is used to extract multi-scale hierarchical local features and a global vector.PVT inherits the advantages of CNNs and transformers in that it can extract multiscale hierarchical local features from images without inductive bias.Ablation studies also demonstrate that, when the Pyramid Vision Transformer is used as an encoder for ED 2 IF 2 -Net, fewer artifacts and better performance can be achieved compared to ResNet18 [53].

Overview
In this work, we aim at reconstructing high-fidelity 3D shapes with topological structures and surface details by means of a network that models the signed distance function (SDF) defined as g, given a single RGB image I ∈ R H×W×3 of the object and any spatial query point P ∈ R 3 .The network outputs the signed distance function values s = g(I, P), s ∈ R. The training data pair for ED 2 IF 2 -Net to learn the implicit function is made up of single-view images of the object, spatial query points, and their corresponding ground-truth SDF values, viz.(I, P, SDF(P)).
ED 2 IF 2 -Net disentangles the SDF of the shape T into the deformed implicit field with structure details and the enhanced displacement field that allows the surface details of the object to be reestablished.The pipeline of ED 2 IF 2 -Net is shown in Figure 1.ED 2 IF 2 -Net extracts image features with a Pyramid Vision Transformer encoder followed by three decoders reconstructing the coarse implicit field, the deformed implicit field, and the enhanced displacement field, respectively.Then the latter two scalar fields are fused to get the final SDF.Finally, the iso-surface with SDF = 0 can be extracted using Marching Cubes [54] for visualization.
The following sections describe in detail how ED 2 IF 2 -Net disentangles the implicit field, network architecture, and loss function.

Disentanglement Method
The variations of the detailed information around the surface of the object (i.e., surface details) affect the Laplacian of the SDF [55].Inspired by this, the surface details of an object can be detected through Laplacian operators and the remaining topological structures can be reconstructed according to an appropriate loss function, thus disentangling the implicit field reconstruction into topological structure reconstruction and surface detail recovery.As shown in Figure 2, the ground-truth SDF is disentangled into the deformed implicit field with structure details and the enhanced displacement field containing surface details of the object.The most similar work to ours is D 2 IM-Net [29], which only disentangles the ground-truth SDF into the sum of a coarse implicit field and a displacement field.Unlike D 2 IM-Net, our disentangled deformed implicit field is based on the coarse implicit field, where the deformed implicit field contains most of the topological structures.Given a query point P, our disentanglement solution can be denoted as: where SDF denotes the ground-truth SDF, g su , g st , and g co represent the enhanced displacement field with surface details, the deformed implicit field containing most of topological structures, and the coarse implicit field, respectively, and f defines the deformation function from g co to g st .Actually, we can suppose that the shape embedded in the deformed implicit field is smooth and the reconstructed shape from the deformed implicit field can only approximate the object surface.Therefore, the surface details can be further represented with the enhanced displacement field.As the enhanced displacement field is attached onto the smooth deformed implicit field near the iso-surface of the object, the Laplacian of the enhanced displacement field is approximately equal to the Laplacian of SDF: SDF(P) = g su (P). ( In order to accelerate the network training, Laplacian for only the sampling points, whose minimum distance to the object shape T is less than a predefined threshold α, will be taken into consideration.
Motivated by the works [29,31,56] related to inference on the visible and invisible surfaces of objects, the forward and backward displacement maps are introduced for the visible and occluded parts of the object, respectively.Our forward displacement map recovers the visible surface details of the object based on a Laplacian loss and the backward displacement map is used to fine-tune the deformed implicit field, further compensating for unreconstructed structure details and fixing incorrect topological structures.In short, we have where g suF and g suB represent the forward and backward displacement maps, respectively.p = π(P) is the projection of P on the single-view image.P V is the point set which consists of points close to the visible surface of the object.Indeed, 3D displacement fields are more direct and are also defined in 3D space.However, displacement maps are applied instead of 3D displacement fields because they enable alignment of the input image with the details, making it possible to calculate the Laplacian loss term.Additionally, it is more intuitive for us to observe the detailed information of the object in the displacement maps.

Network Architecture
The proposed ED 2 IF 2 -Net contains four main components: Pyramid Vision Transformer encoder, coarse shape decoder, deformation decoder, and surface detail decoder.
Based on the Pyramid Vision Transformer encoder, two variants of ED 2 IF 2 -Net are designed: ED 2 IF 2 -Net-T with lower computation complexity and ED 2 IF 2 -Net-L with higher computation complexity, and the latter achieves more pleasing reconstruction results.ED 2 IF 2 -Net-T and ED 2 IF 2 -Net-L differ only in encoders, and they share the same architecture for the other parts.

Pyramid Vision Transformer Encoder
In this work, a Pyramid Vision Transformer [43] is used as an encoder for image feature extraction, which consists of four stages.Each of these stages is composed of a patch embedding layer and transformer encoder layers extracting multi-scale local features.In the k th stage, the patch embedding layer partitions the input patches, assuming that the size of each patch after partition is B k .Then, these patches are flattened, followed by a linear projection to the corresponding dimension D k of the current stage.After that, embedded patches are reshaped to , where the width and the height are scaled by a factor of B k , and later fed into transformer encoder layers together with the position embeddings.In this way, the image features with different scales can be generated at different stages.
In addition, one of the core components of transformer encoder layers in the Pyramid Vision Transformer [43] is Spatial-Reduction Attention (SRA) that can extract highresolution features without too much computation complexity.The input of SRA is a query vector Q, a key vector K, and a value vector V.It differs from the standard MSA only in that an extra spatial reduction is performed on K and V before the standard multi-head self-attention.The spatial reduction can be described as: where X ∈ R (H k W k )×D k indicates the input to be reduced, R k is a hyperparameter that represents the reduction factor of the k th stage, LN denotes layer normalization, Reshape(X, R k ) means the operation of transforming the input X into a sequence S ∈ R , and In our implementation, two pre-trained models of the Pyramid Vision Transformer [43] are used, PVT-Tiny and PVT-Large, as encoders for ED 2 IF 2 -Net-T and ED 2 IF 2 -Net-L, respectively.The dimensions of local features for all stages of PVT-Tiny and PVT-Large are 64, 128, 320, and 512.The Pyramid Vision Transformer encoder finally outputs local features at four scales denoted as l i (i ∈ {0, 1, 2, 3}) and a global vector z of the input single-view image.

Coarse Shape Decoder
Inspired by IM-Net [19], implicit 3D reconstruction is by nature a classification problem and we use ReLU for nonlinear activation of the MLPs to fit the SDF of the object's coarse shape.Specifically, the global vector z and the query point P are concatenated together as the input, and the coarse implicit field g co will be output through the MLPs: However, the coarse implicit field g co is merely capable of approximating the coarse shape of the object, and unable to reconstruct the structure details and recover the surface details.Therefore, a deformation decoder and a surface detail decoder can be applied to learn details of structure and surface of the object, respectively.

Deformation Decoder
The deformation function f can be learnt via the deformation decoder, as illustrated in Figure 3.The deformation decoder firstly unifies the multi-scale local features through a bilinear interpolation and then retrieves the local features for the query point P at all scales in a pixel-aligned manner.Let p = π(P) be the projection of P on the image.Following the similar idea of Ladybird [26], two pixel-aligned local features h 1 and h 2 are provided for P. Specifically, h 1 and h 2 are extracted from the projection of P and its self-reflecting symmetric point P s on l i , respectively, which are then concatenated as the final pixel-aligned local feature l i (p) of P. Finally, the continuous pixel-aligned local feature pairs are used to refine the coarse implicit field g co through the core lightweight components of the deformation decoder, i.e., implicit field deformation block (IFDB).There exist three IFDBs in our implementation of the deformation decoder and the last one generates the deformed implicit field g st (see Algorithm 1): where s j (P) and c j (j ∈ {1, 2, 3}) stand for the intermediate implicit field at P and the state code generated by the j th IFDB, respectively, in particular, g st (P) = s 3 (P).

Algorithm 1 Deformation
Input: coarse implicit field g co , multi-scale local features l i (i ∈ {0, 1, 2, 3}), query point P and its projection p on the image Output: deformed implicit field g st 1: function DEFORMATION(g co , l i , P, P s ← Find_Symmetry_Point(P) // Find the symmetry point of the query point P l i ← Bilinear_Interpolation(l i , res) // Unification of multi-scale local features to res through bilinear interpolation 7: end for 11: s 0 (P) = g co (P) s j (P), c j ← IFDB(s j−1 (P), l 3−j (p), l 4−j (p), P, c j−1 ) 17: g st (P) ← s j (P) return g st (P) 22: end function Pixel-aligned local feature pairs not only enable the deformed implicit field to reconstruct the finer-grained topological structure details aligned with the image, but also guarantee that the surface details can be correctly recovered using the surface detail decoder.This is achieved by incorporating additional information about the query point and its symmetry point in the object shape into the features.

Implicit Field Deformation Block
It is known that the local features with larger scale tend to produce an overall shape, while the ones with smaller scale can keep fine-grained structure details.The IFDB takes advantage of this characteristic, which can be seen in Figure 4. To ensure a smooth implicit field deformation, IFDB deforms the input implicit field according to the variations of the pixel-aligned local features between adjacent scales.Moreover, a state code is used to record all the information of the current implicit field deformation, which will be updated at the end of each IFDB for the next IFDB.In contrast, the coordinates of the query points are input into the deformation module instead of inputting point features in MDISN [25], and our policy performs better than the latter.

Surface Detail Decoder
In order to recover the enhanced displacement field g su with surface details, a surface detail decoder is applied to recover the detailed displacement maps of an object.As demonstrated in Figure 5, a surface detail decoder takes as input all local features that are extracted by the hybrid attention module [44] (HAM) consisting of spatial and channel attention to enhance the feature representation of surface details.Then, a 1 × 1 convolution layer and a ReLU activation layer are employed to decrease the channels, followed by an upsampling and a 1 × 1 convolution layer.After that, the outputs of the convolution are element-wise accumulated into the features of the next scale.After repeating the above workflow three times, the features are upsampled twice to keep the consistent size with the input image, followed by a series of convolution and HAM layers.Finally, the enhanced forward and backward displacement maps are output through a 1 × 1 convolution layer.According to Equation ( 4), if P is near the visible surface, the deformed implicit field of P is added to the forward displacement map at p. Conversely, it is added to the backward one.
In our implementation, the gradient of the SDF on each query point is derived using a central difference approximation.In the case that the direction of the gradient is approaching the viewpoint orientation and the ground-truth SDF is less than a specific threshold, the point is considered as being close to the visible surface.Otherwise, the point is treated as being near the invisible surface.Moreover, a similar network to DISN [24] for estimating the camera parameters is also trained.It should be pointed out that the camera parameters and the gradients derived from the ground-truth SDF used in training are the ground truth, and the predicted values are used in testing.

Loss Function and Sampling Strategy
The total loss function of ED 2 IF 2 -Net consists of four components L = L Coa + L De f + L Ove + L Lap , where L Coa , L De f , L Ove , and L Lap represent the coarse shape loss, deformation loss, overall shape loss, and Laplacian loss, respectively.More specifically, L 2 -norm-based L Coa is used to minimize the distance between the coarse implicit field g co and the groundtruth SDF SDF, and L 1 -norm-based L Ove is employed to minimize the distance between the fused implicit field g and SDF, which can regularize the enhanced displacement field: The structure details are evaluated through a novel deformation loss L De f .Since the deformation decoder iteratively refines g co , the intermediate implicit field s generated by all IFDBs, and the deformed implicit field g st are all taken into consideration, and their L 1 -distances to SDF are accumulated through a weighted summation: where P i represents the i th query point, N denotes the number of query points, M indicates the number of intermediate implicit fields, the j th intermediate implicit field is defined as s j , ω j stands for the weight assigned to s j in the deformation loss, specifically, and ω 0 is the weight of the deformed implicit field g st .L Lap is the L 2 -distance between the Laplacian of the forward displacement map g suF and the Laplacian of SDF.As these two Laplacians are not in the same space, the forward displacement map g suF is a 2D image, whereas SDF is in 3D space.This problem is addressed using the Laplacian of SDF with respect to the projected points on the image.Suppose p i (u i , v i ) denotes the projection of P i on the image and p i represents the coordinates of P i in the camera coordinate system.Similar to D 2 IM-Net [29], Laplacian loss L Lap can be formulated as: The Laplacian of the forward displacement map g suF is: In case P i is close to the visible surface of an object, N(p i ) is the unit normal from the ground-truth normal map, equivalent to the gradient of the SDF with respect to p i as: Then, the Laplacian of SDF can be defined as: To enhance the fidelity of the reconstructed object and capture richer small-scale details, ED 2 IF 2 -Net employs a weighted sampling strategy similar to D 2 IM-Net [29].This strategy assumes dense sampling of the object, where the density of each sampling point is determined by the number of surrounding sampling points within a specified radius.Inside and outside the object, a clipping policy defines compact sample densities.These densities, along with the same samples, serve as sampling weights during the training of ED 2 IF 2 -Net.The effectiveness of the weighted sampling strategy in reconstructing small-scale details is demonstrated through ablation studies.

Experiment Results and Discussion
In Section ??, the utilized datasets and evaluation metrics are described, while in Section 4.2, the implementation details are outlined.Section 4.3 presents a qualitative and quantitative comparison of ED 2 IF 2 -Net with state-of-the-art methods for single-view implicit 3D reconstruction.Ablation studies are conducted in Section 4.4 to assess the impact of different factors, and the computational complexity of various methods is analyzed in Section 4.5.Examples showcasing the proposed applications are demonstrated in Section 4.6 and the influence of different camera sensors on ED 2 IF 2 -Net is discussed in Section 4.7.

Dataset and Metrics
ED 2 IF 2 -Net was trained and tested on a subset of ShapeNet [57], which comprises 13 classes and approximately 44,000 3D models.These models underwent pre-processing using the method proposed by DISN [24] to generate point coordinate-SDF pairs, as well as RGB images and normal maps from 36 random views at a resolution of 224 × 224.For the experiments, we adhered to the official training/validation/testing split.
For the overall quality of the reconstruction, intersection of union (IoU), Chamfer distance (CD) and earth mover distance [58] (EMD) are computed.Moreover, the edge Chamfer distance [59] of the reconstructed shape (ECD-3D) and the edge Chamfer distance in the image [29] (ECD-2D) are used to measure the recovered detail information.The specific definitions of all the above evaluation metrics are as follows: IoU is used to measure the similarity between the reconstructed object and the ground truth, defined as where PC P and PC Q denote two point clouds, and Γ denotes the operation that converts a point cloud into a voxel grid.CD is a commonly used metric for measuring the distance between two point clouds, denoted as PC P , PC Q , defined as EMD is a metric frequently used to measure the distance between two point clouds, denoted as PC P , PC Q , by considering the distribution problem.It can be defined as where φ : PC P → PC Q represents a bijection between the two point clouds.ECD-3D is a metric calculated as the Chamfer distance (CD) between the edge points on the ground-truth object and the reconstructed object.The "edgeness" property of each sampled point from a 3D object is defined as where Ω j represents the set of neighboring points of p j , and n j and n k denote the unit normal vectors at points p j and p k , respectively.
In our implementation, we consider a set of 10 neighbouring points (Ω) for each point and we evaluate the edge feature recovery using points with an "edgeness" property (ψ(p j )) value below 0.8.
ECD-2D represents the Chamfer distance (CD) between edge pixels on the rendered images.In our implementation, we utilize the Canny operator to extract the edges from the rendered normal map of the reconstructed object, which has a resolution of 224 × 224, in order to obtain the edge pixels.

Implementation Details
In ED 2 IF 2 -Net, an RGB image of size 224 × 224 is used as input, and the model outputs signed distance values of the query points.The iso-surface mesh is visualized using Marching Cubes with a resolution of 128 × 128 × 128.The network is implemented in Pytorch [60] and the training parameters are set as follows: batch_size of 16, Adam optimizer [61] with a learning rate of 5 × 10 −5 , β 1 = 0.9, β 2 = 0.999, and weight decay of 10 −5 .During training, 2048 query points are randomly selected based on the weighted sampling strategy for loss calculation and back propagation.The experiments were conducted using PyCharm Community Edition, and the training of ED 2 IF 2 -Net was performed on two Nvidia RTX 3090 graphics cards, taking 1 to 3 days, depending on the specific settings.The hyperparameters ω 0 , ω 1 , and ω 2 in the deformation loss term L De f of the network loss function were fixed at 0.5, 0.25, and 0.25, respectively.The value of ω 0 at 0.5 was chosen to emphasize the influence of the deformed implicit field g st on the final reconstruction results.The training process involved 500 epochs and the learning rate was adaptively optimized using the Adam optimizer as described above.Further details on the implementation can be found in Appendix A.

Comparison with SOTA Approaches
Our comparison concentrates on the implicit models that have achieved state-of-the-art results to date, mainly including IM-Net [19], MDISN [25], DISN [24], D 2 IM-Net [29], and its variant D 2 IM-Net GL .IM-Net is similar to the coarse shape decoder of ED 2 IF 2 -Net, and MDISN and D 2 IM-Net are the corresponding baselines for the deformation decoder and the surface detail decoder, respectively.Moreover, DISN is by far the most excellent single-view implicit 3D reconstruction method with respect to geometric details.For comparative fairness, the above networks were all trained and tested based on the same benchmarks.
Table 1 presents the quantitative comparison of all the aforementioned methods on ShapeNet.The results indicate that ED 2 IF 2 -Net-T and ED 2 IF 2 -Net-L outperform other methods in most object categories, demonstrating significantly higher mean values for each evaluation metric across 13 object categories compared to the other methods.Notably, ED 2 IF 2 -Net-L achieves state-of-the-art quantitative results on ShapeNet, with an IoU of 61.1, CD of 7.26, EMD of 2.51, ECD-3D of 6.08, and ECD-2D of 1.84.When compared to DISN, ED 2 IF 2 -Net-L exhibits a 7% increase in mean IoU, and a 34%, 12%, 12%, and 26% decrease in mean CD, EMD, ECD-3D, and ECD-2D, respectively.These quantitative results demonstrate that both ED 2 IF 2 -Net-T and ED 2 IF 2 -Net-L excel not only in overall shape (topological structure) but also in recovering edge details (surface details).It is important to note that ED 2 IF 2 -Net may not achieve the best performance in every category, which could be attributed to the the network being trained on all categories of ShapeNet.The network's sensitivity to the quantity and diversity of models within a single category may result in slightly inferior reconstruction results for categories with fewer models or predominantly similar models, such as phones.
The qualitative comparison of different methods is presented in Figure 6.From the figure, it is evident that IM-Net can only reconstruct the coarse shape of the object, resulting in a loss of significant topological structure details (such as holes of the sofa backrest and handles of the table drawers) as well as surface details (e.g., chair backrest).Compared to IM-Net, DISN performs better in rebuilding topological structures and surface details, although the results may contain geometric noise leading to blurry surfaces (e.g., sofa backrest surface).However, DISN struggles in recovering details at small scales (bottom and backrest of the chair).
While MDISN can reconstruct more detailed topological structures, it fails to recover surface details and even introduces shape distortions to the object (e.g., speaker and table).On the other hand, D 2 IM-Net shows promise in surface detail recovery but often produces incorrect topological structures for highly curved shapes (e.g., armrests and bottom of the chair) and introduces numerous artifacts (such as the table).In contrast, both ED 2 IF 2 -Net-T and ED 2 IF 2 -Net-L are able to reconstruct more visually appealing qualitative results.These methods enable the reconstruction of more complex topologies (e.g., holes in sofa backrests, handles of table drawers) and capture finer smallscale surface details (e.g., chair backrests).These findings align with the quantitative comparison in Table 1 and validate that ED 2 IF 2 -Net can effectively generate high-fidelity 3D shapes with accurate topological structure and surface details.
Table 1.Quantitative comparison of all methods for single-view 3D reconstructions on ShapeNet.Evaluation metrics include IoU (%, the larger the better), CD (×0.001, the smaller the better), EMD (×100, the smaller the better), ECD-3D (×0.01, the smaller the better), and ECD-2D (the smaller the better).CD and EMD are calculated on 2048 sample points.ECD-3D is computed on 20K points.ECD-2D is calculated on the normal maps with a resolution of 224 × 224.Top scores are highlighted in bold and underlined, while the italic one is the second.

Ablation Studies
To validate the effectiveness of the individual components of ED 2 IF 2 -Net and the loss functions, extensive qualitative and quantitative ablation studies were carried out.All the networks used in the ablation studies were trained and tested on the chair class of ShapeNet.To be specific, the following network options were designed: • Option 1: In this option, we keep the original encoder PVT in the network, plus the coarse shape decoder (CSD) and a random sampling strategy, and the loss function L Coa is applied.It can be seen from Figure 7 that the coarse shape decoder and the random sampling strategy can only reconstruct the coarse shape with few structure details and no surface details.It is consistent with the quantitative results in Table 2. • Option 2: On the basis of the first option, the network is trained with weighted sampling (WS).It can be found from Figure 7 that WS enables the network to reconstruct more details, especially at small scales.• Option 3: In this option, we still use PVT as the encoder.However, we try to directly initialize a random signed distance value for each query point and iteratively refine it in the deformation decoder (DD).Then, the network is trained only constrained by deformation loss L De f with WS.It can be observed from Figure 7 that the network without the coarse implicit field reconstructs awful surfaces and topologies.Moreover, quite a few surface artifacts emerge due to the absence of the coarse implicit field near the shape.• Option 4: With this option, CSD together with the DD serve as the decoders and only L Coa with WS is used for the loss estimation.It can be seen in Figure 7 that such a network creates fewer shape artifacts and distortions, but it still fails to reconstruct a full shape of the structure, which is attributed to the fact that the loss function takes no account of the intermediate implicit fields generated in the iterative deformation.• Option 5: Based on the previous options, L Coa and L De f are applied to train the CSD and the DD, respectively.WS is also used here.From Figure 7, it is illustrated that the network with this option is capable of reconstructing more accurate topological structures and producing a smoother shape.• Option 6: In this option, the surface detail decoder in ED 2 IF 2 -Net with WS cancels the prediction of the backward displacement map and the deformed implicit field is only fused with the forward displacement map.The surface detail decoder in this case is represented as SDD_S and the normal case is denoted as SDD_N.It can be noticed from Figure 7 that, without the backward displacement map, the surface details of the results may be incorrectly reconstructed and distortions may occur at the structural level, possibly owing to the lack of the backward displacement map, which prevents fine-tuning.
• Option 7: Only the L Lap of the standard ED 2 IF 2 -Net loss functions is removed and the rest remains unchanged.It can be noted from Figure 7 that, in this case, the surface details of the reconstruction cannot be clearly recovered and may produce distortions.• Option 8: The encoder in ED 2 IF 2 -Net-T is replaced with ResNet18, keeping the rest of the settings fixed.The reconstruction results are shown in Figure 7 and it can be noticed that there exist plenty of artifacts, which may be caused by ResNet18 being slightly inferior to PVT in terms of feature extraction, proving that PVT is optimal for ED 2 IF 2 -Net.• Option 9: When this option is selected, all HAMs in the SDD_N of the standard ED 2 IF 2 -Net are removed and the rest of the network settings remain fixed.As shown in Figure 7, the surface details of the reconstructed objects become unclear without the HAM, which is consistent with the quantitative results in Table 2, demonstrating that the variant leads to an increase in ECD-3D and ECD-2D.These results confirm the effectiveness of HAMs in enhancing surface details. • Option 10: We remove the DD from the standard ED 2 IF 2 -Net and exclude the L De f term from the loss function to create a variant pipeline similar to D 2 IM-Net.As shown in Figure 7, the shapes reconstructed by this variant are not comparable to the ones reconstructed by the standard ED 2 IF 2 -Net.It is worth noting that the quantitative comparisons in Tables 1 and 2 show that, although the variant (marked in orange) has slightly lower performance than the standard ED 2 IF 2 -Net to some extent, it still outperforms D 2 IM-Net, which confirms the superiority of our network pipeline.• Option 11: To further validate the effectiveness of the deformation decoder (DD) and deformation loss L De f in reconstructing finer topological structures, we add the DD to the network of option 10 while keeping the other settings unchanged.As shown in Figure 7, this variant generally reconstructs object shapes with more detailed topological structures compared to option 10.This further demonstrates the contribution of the DD in reconstructing finer topological structures of objects.However, it is worth noting that the variant still struggles to generate visually appealing object shapes compared to the standard ED The visualization and quantitative results of the ablation studies are presented in Figure 7 and Table 2, respectively.Overall, the weighted sampling strategy enables the network to reconstruct small-scale details effectively.Additionally, the deformation decoder, which refines the coarse implicit field, plays a crucial role in capturing the object's topology.The deformation decoder performs optimally when trained with the deformation loss term L De f , while the coarse shape decoder benefits from the L Coa .The deformed implicit field, derived from the deformation decoder, serves as a solid foundation for reconstructing the object's surface, which is further fused with the forward displacement map generated by the surface detail decoder trained by L Lap to recover the surface details of the object.Moreover, the backward displacement map from the surface detail decoder compensates for the deformed implicit field, ensuring the correct topology reconstruction.Furthermore, compared to using ResNet18 as an encoder, the standard ED 2 IF 2 -Net achieves higher-fidelity results.Importantly, the presence of the deformation decoder and the utilization of the deformation loss term L De f contribute to the reconstruction of ED 2 IF 2 -Net with finer topological structures.Furthermore, the PVT architecture, which generates multi-scale hierarchical local features, is more suitable as an image encoder for ED 2

Computational Complexity
In addition to the qualitative and quantitative experiments described above, we also provide the computational complexity of the various methods in Table 3, specifically in terms of training time and inference time.To ensure a fair comparison, all models were trained and tested using the same settings.
As shown in the table, ED

Surface Detail Transfer
Surface detail transfer is defined as the fusion of the disentangled enhanced displacement field of a source object with the deformed implicit field of another target object.In this application, the specified surface details can be transferred and Figure 9 shows examples of surface detail transfer between different objects.

Pasting a Logo
We propose that a logo can be pasted on the target object image and then the modified image is used to generate a model with the logo.Figure 10 shows examples of pasting a logo on a model.In the previous experiments and applications, the images utilized were acquired using a standard camera sensor model, which allowed for capturing images without significant distortion.However, in various industries such as drone aerial photography, security surveillance, and automotive, wide-angle and fisheye imaging sensors are extensively employed.These sensors typically have a field of view (FOV) greater than 100 degrees, which is considerably larger compared to standard camera sensors.Hence, in this section, we primarily focus on discussing the effects of images captured by wide-angle and fisheye imaging sensors on the performance of ED 2 IF 2 -Net.
There are existing works [62][63][64] that utilize images captured by wide-angle or fisheye sensors for 3D reconstruction and other related tasks.For instance, Ma et al. [62] proposed a specific model for fisheye sensors and introduced sparse and dense multi-view 3D reconstruction methods based on this model.Strecha et al. [63] performed 3D reconstruction using images captured by fisheye sensors and standard lens models, respectively, employing the Pix4Dmapper software.Kakani et al. [64] proposed a self-calibration method for wide-angle and fisheye cameras to correct the captured images, allowing for their utilization in 3D reconstruction and other tasks.
In general, wide-angle sensors can capture images with a larger field of view compared to standard lenses, but they often introduce perspective distortion.This distortion can alter the shape of objects in the image, making it challenging for ED 2 IF 2 -Net trained on images acquired from the standard lens model to reconstruct high-fidelity object shapes.On the other hand, fisheye camera sensors can capture images with an extremely wide field of view but introduce barrel distortion, which causes even more severe distortion of objects in the image.Consequently, ED 2 IF 2 -Net faces difficulties in reconstructing accurate object shapes from such distorted images.
To mitigate the effects of perspective distortion and barrel distortion caused by wideangle and fisheye camera sensors on ED 2 IF 2 -Net, pre-processing techniques such as camera calibration [64] and image correction [65] can be employed to reduce the degree of image distortion.Another approach is to consider training ED 2 IF 2 -Net on publicly available datasets of images captured by wide-angle sensor models and fisheye sensor models, enabling the network to learn about the different distortions using its powerful feature extraction and learning capabilities.

Limitations and Future Works
The proposed method has two main limitations.Firstly, although the surface detail decoder enhances surface information, some reconstructed object shapes, such as the speaker in Figure 6, lack prominent surface detail.This limitation may be attributed to the introduction of redundant features during the implicit field deformation procedure.To address this, future studies should explore adaptive neglect of unnecessary local features as an attractive direction for improvement.Secondly, while ED 2 IF 2 -Net outperforms similar methods in terms of inference speed and performance, it is not specifically designed for real-time 3D reconstruction.This may pose challenges for systems that require realtime reconstruction.To tackle this issue, we plan to leverage a sparse sphere rendering algorithm [33,66] to accelerate inference speed.Additionally, we aim to explore more advanced transformers, such as Swin Transformer V2 [67], to enhance the feature extraction capability of ED 2 IF 2 -Net.
In future work, we will optimize the proposed framework for embedded platforms, considering the following aspects: (1) reducing model parameters and computational complexity by minimizing the number of layers in the Pyramid Vision Transformer or reducing the number of channels in the deformation decoder's convolutional layer while maintaining performance; (2) improving the readout speed of implicit fields and displacement fields by utilizing more efficient data structures, such as hash tables, for data storage; (3) optimizing the training and prediction process through techniques such as distillation [68]; (4) deploying the framework on native embedded platforms to reduce communication and latency.

Conclusions
In this paper, we introduce ED 2 IF 2 -Net, the first single-view 3D reconstruction network based on the Pyramid Vision Transformer.Our network disentangles objects' implicit fields into deformed implicit fields and enhanced displacement fields.IFDBs refine the coarse implicit fields by analyzing pixel-aligned local features across scales, capturing finer topological structure details in the deformed implicit fields.Moreover, we enhance the displacement fields in both spatial and channel dimensions to preserve surface details.
By employing a novel deformation loss and Laplacian loss, ED 2 IF 2 -Net achieves highfidelity reconstruction, capturing both the structure and surface details of objects.On the ShapeNet dataset, ED 2 IF 2 -Net delivers superior performance, with ED 2 IF 2 -Net-L achieving the best mean IoU, CD, EMD, ECD-3D, and ECD-2D values of 61.1, 7.26, 2.51, 6.08, and 1.84, respectively.
Compared to other methods, ED 2 IF 2 -Net excels in reconstructing finer topological structures while preserving enhanced surface details.It overcomes the limitations of alternative approaches that may compromise surface details or yield incorrect topology, resulting in higher-quality reconstructions.
Our research represents a significant milestone in single-view implicit 3D reconstruction.We propose the first transformer-based single-view implicit 3D reconstruction network, opening up new possibilities for solving such tasks using transformers.ED 2 IF 2 -Net achieves state-of-the-art performance on the ShapeNet dataset while maintaining competitive inference time.The proposed IFDB and deformation loss can be readily applied to future works, enabling better reconstruction results in single-view implicit 3D reconstruction.The disentangled deformed implicit fields and enhanced displacement fields in our network benefit downstream applications, including surface detail transfer and pasting a logo.Furthermore, our framework can be optimized for embedded platforms, shedding new light on industrial applications such as VR/AR.Beyond real-time rendering challenges, the framework holds promise for industries such as robotics and autonomous driving.

Figure 1 .
Figure 1.The overall pipeline of the proposed ED 2 IF 2 -Net, where P is a 3D query point and π(•) represents the operation of projecting a 3D spatial query point to an image.PVT Enc means Pyramid Vision Transformer encoder.Coa Dec, Def Dec, and Sur Dec denote Coarse Shape Decoder, Deformation Decoder, and Surface Detail Decoder, respectively.ED 2 IF 2 -Net first extracts a global vector and the local features of the input image via a Pyramid Vision Transformer encoder.The global vector is used in a coarse shape decoder to predict a coarse implicit field, which is then iteratively refined by a deformation decoder to obtain a deformed implicit field with finer structure details using multiple implicit field deformation blocks (IFDBs).A surface detail decoder with hybrid attention modules (HAMs) uses local features to recover an enhanced displacement field.The final output of ED 2 IF 2 -Net is a fusion of the deformed implicit field and the enhanced displacement field.Four combined loss terms are applied to reconstruct the coarse implicit field, structure details, overall shape, and surface details.

Figure 2 .
Figure 2.An illustrative description of our disentanglement.ED 2 IF 2 -Net disentangles the groundtruth SDF of the chair into a deformed implicit field and an enhanced displacement field (visible surface), where the deformed implicit field is obtained by refining the coarse implicit field of the object.The red arrows in the deformed implicit field represent the deformation function f from the coarse implicit field (green part) to the deformed implicit field (containing most of the topological structures of the object).

3 :
p s ← Find_Symmetry_Point_Projection(P s ) // Find the projection of the symmetry point of the query point P on the image 4: for i = 0 → 3 do 5: res ← 224 6:

Figure 3 .
Figure 3. Architecture of the deformation decoder, where s j and c j represent the intermediate implicit field and the state code of the j th IFDB output, respectively.

Figure 4 .
Figure 4. Illustrations of the j th IFDB, where Concat means concatenation operation.

Figure 5 .
Figure 5. Architecture of the surface detail decoder.

Figure 6 .
Figure 6.Qualitative comparison of various methods for single-view 3D reconstruction on ShapeNet.

Figure 7 .
Figure 7. Visualization of the qualitative ablation studies of ED 2 IF 2 -Net-T.It is best viewed magnified on the screen.

IF 2 -
Net compared to other conventional transformers such as DeiT.Lastly, in the surface detail decoder, the HAM module proves to be more effective in improving the model's performance in recovering surface details and ensuring the reconstruction of a correct topological structure compared to CBAM.The proposed ED 2 IF 2 -Net is inherently superior to D 2 IM-Net.Specifically, ED 2 IF 2 -Net significantly improves the network's ability to extract features by using PVT instead of ResNet18.Moreover, ED 2 IF 2 -Net iteratively refines the coarse implicit field via the deformation decoder with L De f to reconstruct finer topological structure details of the object, and employs HAM to enhance surface details instead of predicting only the coarse implicit field and the ordinary displacement field as in D 2 IM-Net.Finally, when the deformation decoder with L De f in the standard ED 2 IF 2 -Net are abolished, the quantitative results achieved by the network still outperform D 2 IM-Net.

4. 6 .
Applications 4.6.1.Test on Online Product Images ED 2 IF 2 -Net, after being trained on the rendered RGB images, allows for further testing of online product images without ground-truth shapes.The qualitative reconstruction results of ED 2 IF 2 -Net for online product images are presented in Figure 8.This application demonstrates the generalization capability of ED 2 IF 2 -Net.

Figure 8 .
Figure 8. Examples of reconstruction from online images through ED 2 IF 2 -Net.

Figure 9 .
Figure 9. Two examples of surface detail transfer using ED 2 IF 2 -Net, where the backrest details of the source chair are transferred.

Figure 10 .
Figure 10.Examples about pasting a logo using ED 2 IF 2 -Net.Actually, D 2 IM-Net [29] provides similar applications.A quantitative comparison of the different applications of D 2 IM-Net and ED 2 IF 2 -Net is shown in Table 4.It should be noted that, as there are no ground-truth models for the generated objects, the corresponding ground-truth models are created for the shown generated objects by traditional manual modeling, and the mean values of various evaluation metrics obtained by both D 2 IM-Net and ED 2 IF 2 -Net are computed.It can be observed from Table 4 that the proposed ED 2 IF 2 -Net achieves more promising performance compared to D 2 IM-Net in downstream applications, where ED 2 IF 2 -Net-L reaches state-of-the-art performance in these applications, further illustrating the superiority of ED 2 IF 2 -Net over D 2 IM-Net.

Figure A2 .
Figure A2.Performance of the models with different learning rates is compared with batch_size set to 16 and other settings kept fixed.Evaluation metrics are IoU, CD, EMD, ECD-3D, and ECD-2D.

Bench Box Car Chair Display Lamp Speaker Rifle Sofa Table Phone Boat Mean
2 IF 2 -Net.This observation emphasizes the importance of the deformation loss L De f in the reconstruction process.• Option 12: To further demonstrate the superiority of the proposed method in feature extraction, we replace the PVT-Tiny and PVT-Large image encoders in the standard ED 2 IF 2 -Net-T and ED 2 IF 2 -Net-L with DeiT-Tiny and DeiT-Base [37], respectively, while keeping the other settings unchanged.The qualitative results are presented in Figure 7.It can be observed that when the image encoders of ED 2 IF 2 -Net-T and ED 2 IF 2 -Net-L are replaced by DeiT-Tiny and DeiT-Base, respectively, the network tends to reconstruct inferior results, which exhibit poor topological structure and surface details.This further confirms the effectiveness of ED 2 IF 2 -Net in feature extraction.• Option 13: To further validate the effectiveness of HAM in the surface detail decoder for enhancing surface detail representation, all HAMs in the surface detail decoder of the standard model are replaced with CBAMs [45], while keeping the other settings unchanged.The qualitative reconstruction results are depicted in Figure 7.In comparison to the standard ED 2 IF 2 -Net, the variant encounters challenges in capturing and recovering clear surface details of the object, resulting in the presence of artifacts around the shape.This option further demonstrates that HAM is more effective than CBAM in enhancing the capability of ED 2 IF 2 -Net to handle surface details. • Option 14: The standard ED 2 IF 2 -Net proposed in this paper, including all components and the loss function with WS.

Tiny ResNet18 DeiT-Tiny PVT-Large DeiT-Base CSD DD SDD_S SDD_N HAM CBAM WS L Coa L De f L Ove L Lap IoU↑ CD↓ EMD↓ ECD-3D↓ ECD-2D↓
2 IF 2 -Net-T achieves the fastest training speed, with a training time of 47 h.Similarly, ED 2 IF 2 -Net-L has a relatively shorter training time of 66 h compared to most other models.In terms of inference time, both ED 2 IF 2 -Net-T and ED 2 IF 2 -Net-L outperform other methods, with inference times of 97.64 ms and 144.09 ms, respectively.The above comparison of computational complexity highlights the advantages of the proposed ED 2 IF 2 -Net in terms of faster training speed and shorter inference time.

Table 3 .
Training time and inference time.All training times as well as inference times are obtained with the same settings, where the inference times are tested with a batch_size of 1.

Table 4 .
Quantitative results of D 2 IM-Net and ED 2 IF 2 -Net for various applications.Evaluation metrics also include IoU (%), CD (×0.001),EMD (×100), ECD-3D (×0.01), and ECD-2D.The best results for each application are highlighted in bold and underlined, while the italic one is the second.Discussion about the Effects of Camera Sensor Type on ED 2 IF 2 -Net

•
Dataset compatibility: The majority of images in existing publicly available 3D reconstruction datasets are based on a resolution of 224 × 224.Therefore, selecting RGB images with a resolution of 224 × 224 as input ensures better alignment with the dataset, leading to improved training efficacy of the network.• Resource constraints: Higher resolution images as input increase computational and memory requirements, resulting in longer training times and higher hardware demands.By opting for RGB images with a resolution of 224 × 224 as input, computational resource consumption is reduced while maintaining higher performance levels.• Information preservation: 3D reconstruction involves processing and analyzing input images to extract relevant features.By choosing RGB images with a resolution of 224 × 224 as input, more detailed information can be preserved, resulting in enhanced 3D reconstruction performance.