# Three-Dimensional Reconstruction from a Single RGB Image Using Deep Learning: A Review

^{1}

^{2}

^{3}

^{4}

^{*}

## Abstract

**:**

## 1. Introduction

- An overview of the neural networks proposed for monocular 3D object reconstruction in the last five years, including:
- -
- -
- -

- A summary of the major 3D datasets that are used by the discussed neural networks.
- A description of common metrics used to evaluate the 3D reconstruction algorithms.
- A comparison of the performance of these methods using different evaluation metrics.

## 2. Related Work

## 3. Representing Shape in 3D

#### 3.1. Depth Map

#### 3.2. Normal Map

#### 3.3. Point Cloud

#### 3.4. 3D Mesh

#### 3.5. Voxel

## 4. Datasets

#### 4.1. Real Textureless Surfaces Data

#### 4.2. Synthetic 3D Point Cloud Data

#### 4.3. ShapeNet

#### 4.4. R2N2 Dataset

## 5. Networks

#### 5.1. Bednarik et al.

#### 5.2. Patch-Net

#### 5.3. HDM-Net

**Figure 4.**Patch-Net uses Bednarik et al. [3]’s network with only depth and normal decoders. The input image is divided into overlapping patches, and predictions for each patch are obtained separately. Patch predictions are stitched to form the complete depth and normal maps.

#### 5.4. IsMo-GAN

#### 5.5. Pixel2Mesh

#### 5.6. Salvi et al.

#### 5.7. VANet

#### 5.8. 3D-VRVT

Literature | Architecture | Output | Method | Dataset Type | Runtime [s] |
---|---|---|---|---|---|

Bednarik et al. [3] (2018) | VAE with one encoder and three decoders | normal map, depth map, and 3D mesh | based on SegNet [23] with VGG-16 [24] backbone | real, deformable, textureless surfaces | 0.016 |

Patch-Net [4] (2019) | VAE with one encoder and two decoders | normal and depth maps | converts image to patches, gets 3D shape of patches using [3], and stitches them together | real, deformable, textureless surfaces | - |

Hybrid Deformation Model Network (HDM-Net) [5] (2018) | VAE with one encoder and one decoder | 3D point cloud | simple autoencoder with skip connections like ResNet [34], combines 3D regression loss with an isometry prior and a contour loss | synthetic, deformable, well-textured surfaces | 0.005 |

Isometry-Aware Monocular Generative Adversarial Network (IsMo-GAN) [6] (2019) | GAN with two sequential VAEs as a generator and a simple CNN as discriminator | 3D point cloud | integrates an OD-Net to segment foreground, and trains in an adversarial setting along with 3D loss and isometry prior from [5] | synthetic, deformable, well-textured surfaces and real, deformable, textureless surfaces | 0.004 |

Pixel2Mesh [7] (2018) | two-lane network with a feature extractor and a graph based mesh predictor (GCN) | 3D mesh | feature extractor based on VGG-16 [24], feeds cascaded features to the GCN that uses graph convolutions | synthetic, rigid, well-textured surfaces | 0.016 |

View Attention Guided Network (VANet) [9] (2021) | two-lane feature extractor for both single or multi-view reconstruction, followed by a mesh prediction network | 3D mesh | uses channel-wise attention and information from all available views to extract features, which are then sent to a Pixel2Mesh-based [7] mesh vertex predictor | synthetic, rigid, well-textured surfaces | - |

Salvi et al. [8] (2020) | VAE based on ONets [39] with self-attention [37] in encoder | parametric representation | ResNet-18 [34] encoder with self-attention modules, followed by a decoder and an occupancy function | synthetic, rigid, well-textured surfaces | - |

3D-VRVT [41] (2021) | encoder-decoder architecture | voxel grid | encoder based on Vision Transformers [40], followed by a decoder made up of 3D deconvolutions | both synthetic and real, rigid, well-textured surfaces | 0.009 |

## 6. Comparison

#### 6.1. Metrics

- Depth Error (${\mathcal{E}}_{D}$): The depth error metric is used to compute the accuracy of depth map predictions. Let ${\Theta}_{\mathbf{K}}$ and ${\Theta}_{\mathbf{K}}^{\prime}$ be the point clouds associated with the predicted and ground-truth depth maps respectively, with the camera matrix $\mathbf{K}$. To remove the inherent global scale ambiguity [42] in the prediction, ${\Theta}_{\mathbf{K}}$ is aligned to ground-truth depth map ${\mathbf{D}}^{\prime}$ to get an aligned point cloud ${\overline{\Theta}}_{\mathbf{K}}$ as$${\overline{\Theta}}_{\mathbf{K}}=\Omega ({\Theta}_{\mathbf{K}},{\mathbf{D}}^{\prime})$$$${\mathcal{E}}_{D}=\frac{1}{N}\sum _{n=1}^{N}\frac{{\sum}_{i}\parallel {\Theta}_{\mathbf{K}}^{\prime}-{\overline{\Theta}}_{\mathbf{K}}\parallel {\mathbf{B}}_{i}^{n}}{{\sum}_{i}{\mathbf{B}}_{i}^{n}}.$$Note that the foreground mask $\mathbf{B}$ in the equation ensures that the error is only calculated for foreground pixels. Smaller depth errors are preferred.
- Mean Angular Error (${\mathcal{E}}_{MAE}$): The mean angular error ${\mathcal{E}}_{MAE}$ metric is used to calculate the accuracy of normal maps, by computing the average difference between the predicted and ground-truth normal vectors. The angular errors for all samples are calculated using Equation (3), and then averaged for all samples. Smaller angular errors indicate better predictions.
- Volumetric IoU (${\mathcal{E}}_{IOU}$): The Intersection over Union (IoU) metric for meshes is calculated as the volume of the intersection of ground-truth and predicted meshes, divided by the volume of their union. Larger values are better.
- Chamfer Distance (${\mathcal{E}}_{CD}$): Chamfer distance is a measure of similarity between two point clouds. It takes the distance of each point into account by finding, for each point in a point cloud, the nearest point in the other cloud, and summing their squared distances.$${\mathcal{E}}_{CD}=\frac{1}{|\Theta |}\sum _{x\in \Theta}\underset{y\in {\Theta}^{\prime}}{min}{\parallel x-y\parallel}^{2}+\frac{1}{|{\Theta}^{\prime}|}\sum _{x\in {\Theta}^{\prime}}\underset{y\in \Theta}{min}{\parallel x-y\parallel}^{2}$$
- Chamfer-L1 (${\mathcal{E}}_{C{D}_{1}}$): The Chamfer distance (CD) has a high computational cost for meshes because of a large number of points, so an approximation called Chamfer-L1 is defined. It uses the L1-norm instead of the Euclidean distance [8]. Smaller values are preferred.
- Normal Consistency (${\mathcal{E}}_{NC}$): The normal consistency score is defined as the average absolute dot product of normals in one mesh and normals at the corresponding nearest neighbors in the other mesh. It is computed similarly to Chamfer-L1 but the L1-norm is replaced with the dot product of the normal vectors on one mesh with their projection on the other mesh [8]. Normal consistency shows how similar the shapes of two volumes are, and is useful in cases such as where two meshes might overlap significantly, giving a high IoU, but have a different surface shape. Higher normal consistency is preferred.
- Earth Mover’s Distance (${\mathcal{E}}_{EMD})$:The Earth Mover’s Distance computes the cost of transforming one one pile of dirt, or one probability distribution, into another. It was introduced in [44] as a metric for image retrieval. In case of 3D reconstruction, it computes the cost of transforming the set of predicted vertices into the ground-truth vertices. The lower the cost, the better the prediction.
- F-score (${\mathcal{E}}_{F}$):The F-score evaluates the distance between object surfaces [2,45]. It is defined as the harmonic mean between precision and recall. Precision measures reconstruction accuracy by counting the percentage of predicted points that lie within a certain distance from the ground truth. Recall measures completeness by counting the percentage of points on the ground truth that lie within a certain distance from the prediction. The distance threshold $\tau $ can be varied to control the strictness of the F-score. In the results reported in this paper, $\tau ={10}^{-4}$.

#### 6.2. Experiments

## 7. Discussion

## 8. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Abbreviations

2D | Two-Dimensional |

3D | Three-Dimensional |

CAD | Computer-Aided Design |

CNN | Convolutional Neural Network |

GAN | Generative Adversarial Network |

NLP | Natural Language Processing |

RGB | Red-Green-Blue |

RGB-D | Red-Green-Blue-Depth |

RNN | Recurrent Neural Network |

VAE | Variational Autoencoder |

VOC | Visual Object Classes |

## References

- Bautista, M.A.; Talbott, W.; Zhai, S.; Srivastava, N.; Susskind, J.M. On the generalization of learning-based 3d reconstruction. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Online, 5–9 January 2021; pp. 2180–2189. [Google Scholar]
- Tatarchenko, M.; Richter, S.R.; Ranftl, R.; Li, Z.; Koltun, V.; Brox, T. What do single-view 3d reconstruction networks learn? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 3405–3414. [Google Scholar]
- Bednarik, J.; Fua, P.; Salzmann, M. Learning to reconstruct texture-less deformable surfaces from a single view. In Proceedings of the 2018 International Conference on 3D Vision (3DV), Verona, Italy, 5–8 September 2018; pp. 606–615. [Google Scholar]
- Tsoli, A.; Argyros, A.A. Patch-Based Reconstruction of a Textureless Deformable 3D Surface from a Single RGB Image. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops, Seoul, Korea, 27 October–2 November 2019. [Google Scholar]
- Golyanik, V.; Shimada, S.; Varanasi, K.; Stricker, D. HDM-Net: Monocular Non-Rigid 3D Reconstruction with Learned Deformation Model. arXiv
**2018**, arXiv:1803.10193. [Google Scholar] - Shimada, S.; Golyanik, V.; Theobalt, C.; Stricker, D. IsMo-GAN: Adversarial Learning for Monocular Non-Rigid 3D Reconstruction. arXiv
**2019**, arXiv:1904.12144. [Google Scholar] - Wang, N.; Zhang, Y.; Li, Z.; Fu, Y.; Liu, W.; Jiang, Y.G. Pixel2mesh: Generating 3d mesh models from single rgb images. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 52–67. [Google Scholar]
- Salvi, A.; Gavenski, N.; Pooch, E.; Tasoniero, F.; Barros, R. Attention-based 3D Object Reconstruction from a Single Image. In Proceedings of the 2020 International Joint Conference on Neural Networks (IJCNN), Glasgow, UK, 19–24 July 2020; pp. 1–8. [Google Scholar]
- Yuan, Y.; Tang, J.; Zou, Z. Vanet: A View Attention Guided Network for 3d Reconstruction from Single and Multi-View Images. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar] [CrossRef]
- Chang, A.X.; Funkhouser, T.; Guibas, L.; Hanrahan, P.; Huang, Q.; Li, Z.; Savarese, S.; Savva, M.; Song, S.; Su, H.; et al. ShapeNet: An Information-Rich 3D Model Repository. arXiv
**2015**, arXiv:1512.03012. [Google Scholar] - Zollhöfer, M.; Garrido, J.T.P.; Pérez, D.B.T.B.P.; Stamminger, M.; Theobalt, M.N.C. State of the Art on Monocular 3D Face Reconstruction, Tracking, and Applications. Comput. Graph. Forum
**2018**, 37, 523–550. [Google Scholar] [CrossRef] - Yuniarti, A.; Suciati, N. A review of deep learning techniques for 3D reconstruction of 2D images. In Proceedings of the 2019 12th International Conference on Information & Communication Technology and System (ICTS), Surabaya, Indonesia, 18 July 2019; pp. 327–331. [Google Scholar]
- Han, X.F.; Laga, H.; Bennamoun, M. Image-based 3D object reconstruction: State-of-the-art and trends in the deep learning era. IEEE Trans. Pattern Anal. Mach. Intell.
**2019**, 43, 1578–1604. [Google Scholar] [CrossRef] [PubMed] - Laga, H. A survey on deep learning architectures for image-based depth reconstruction. arXiv
**2019**, arXiv:1906.06113. [Google Scholar] - Liu, C.; Kong, D.; Wang, S.; Wang, Z.; Li, J.; Yin, B. Deep3D reconstruction: Methods, data, and challenges. Front. Inf. Technol. Electron. Eng.
**2021**, 22, 652–672. [Google Scholar] [CrossRef] - Maxim, B.; Nedevschi, S. A survey on the current state of the art on deep learning 3D reconstruction. In Proceedings of the 2021 IEEE 17th International Conference on Intelligent Computer Communication and Processing (ICCP), Cluj-Napoca, Romania, 28–30 October 2021; pp. 283–290. [Google Scholar]
- Fu, K.; Peng, J.; He, Q.; Zhang, H. Single image 3D object reconstruction based on deep learning: A review. Multimed. Tools Appl.
**2021**, 80, 463–498. [Google Scholar] [CrossRef] - Choy, C.B.; Xu, D.; Gwak, J.; Chen, K.; Savarese, S. 3D-R2N2: A Unified Approach for Single and Multi-view 3D Object Reconstruction. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar]
- Blender Online Community. Blender—A 3D Modelling and Rendering Package; Blender Foundation, Stichting Blender Foundation: Amsterdam, The Netherlands, 2018. [Google Scholar]
- Miller, G.A. WordNet: A Lexical Database for English. Commun. ACM
**1995**, 38, 39–41. [Google Scholar] [CrossRef] - Griffiths, D.; Boehm, J. A review on deep learning techniques for 3D sensed data classification. Remote Sens.
**2019**, 11, 1499. [Google Scholar] [CrossRef] - ShapeNet Research Team. About ShapeNet. Available online: https://shapenet.org/about (accessed on 30 May 2022).
- Badrinarayanan, V.; Handa, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Robust Semantic Pixel-Wise Labelling. arXiv
**2015**, arXiv:1505.07293. [Google Scholar] - Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv
**2014**, arXiv:1409.1556. [Google Scholar] - Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
- Chollet, F. Keras, 2015. GitHub. Available online: https://github.com/fchollet/keras (accessed on 31 July 2022).
- Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M.; et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015. Available online: tensorflow.org (accessed on 31 July 2022).
- Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv
**2015**, arXiv:1505.04597. [Google Scholar] - Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L.; et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. In Advances in Neural Information Processing Systems 32; Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024–8035. [Google Scholar]
- Otsu, N. A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybern.
**1979**, 9, 62–66. [Google Scholar] [CrossRef] - Suzuki, S. Topological structural analysis of digitized binary images by border following. Comput. Vision Graph. Image Process.
**1985**, 30, 32–46. [Google Scholar] [CrossRef] - Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
- Bronstein, M.M.; Bruna, J.; LeCun, Y.; Szlam, A.; Vandergheynst, P. Geometric deep learning: Going beyond euclidean data. IEEE Signal Process. Mag.
**2017**, 34, 18–42. [Google Scholar] [CrossRef] - He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. arXiv
**2015**, arXiv:1512.03385. [Google Scholar] - Fan, H.; Su, H.; Guibas, L.J. A point set generation network for 3d object reconstruction from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 605–613. [Google Scholar]
- Oh Song, H.; Xiang, Y.; Jegelka, S.; Savarese, S. Deep Metric Learning via Lifted Structured Feature Embedding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 4004–4012. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30 (NIPS 2017); Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Zhang, H.; Goodfellow, I.; Metaxas, D.; Odena, A. Self-attention generative adversarial networks. In Proceedings of the International Conference on Machine Learning, Long Beach, CA, USA, 9–15 June 2019; pp. 7354–7363. [Google Scholar]
- Mescheder, L.; Oechsle, M.; Niemeyer, M.; Nowozin, S.; Geiger, A. Occupancy networks: Learning 3d reconstruction in function space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4460–4470. [Google Scholar]
- Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv
**2020**, arXiv:2010.11929. [Google Scholar] - Li, X.; Kuang, P. 3D-VRVT: 3D Voxel Reconstruction from A Single Image with Vision Transformer. In Proceedings of the 2021 International Conference on Culture-Oriented Science & Technology (ICCST), Beijing, China, 18–21 November 2021; pp. 343–348. [Google Scholar]
- Eigen, D.; Puhrsch, C.; Fergus, R. Depth map prediction from a single image using a multi-scale deep network. In Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
- Stegmann, M.B.; Gomez, D.D. A brief introduction to statistical shape analysis. In Informatics and Mathematical Modelling; Technical University of Denmark: Kongens Lyngby, Denmark, 2002; Volume 15. [Google Scholar]
- Rubner, Y.; Tomasi, C.; Guibas, L.J. The earth mover’s distance as a metric for image retrieval. Int. J. Comput. Vis.
**2000**, 40, 99–121. [Google Scholar] [CrossRef] - Knapitsch, A.; Park, J.; Zhou, Q.Y.; Koltun, V. Tanks and temples: Benchmarking large-scale scene reconstruction. ACM Trans. Graph. (ToG)
**2017**, 36, 1–13. [Google Scholar] [CrossRef]

**Figure 1.**Interest in using deep learning-based methods for 3D reconstruction is reflected in the number of publications on ScienceDirect matching the keywords “3d reconstruction” AND “deep learning”, which have been exponentially growing since 2015.

**Figure 2.**Examples of images in the datasets of Bednarik et al. [3] and Golyanik et al. [5], which are used to evaluate some of the networks in this paper. (

**a**) The textureless surfaces dataset [3] contains RGB images and corresponding normal and depth maps for 5 different real objects. (

**b**) The synthetic point cloud dataset of Golyanik et al. [5] has a deforming thin plate rendered with 4 different textures under 5 different illuminations.

**Figure 3.**The textureless surface reconstruction network [3] (

**left**) consists of an encoder $\Lambda $ that takes a masked image ${\mathbf{I}}_{m}^{n}$ as input and outputs a latent representation $\Lambda $. This is followed by three parallel decoders ${\Phi}_{N},{\Phi}_{D},$ and ${\Phi}_{C}$ that use $\Lambda $ for reconstructing the normal map, depth map, and a 3D mesh respectively. The indices of all maxpool operations in the encoder are saved when downsampling (

**right**). These indices are later used for non-linear upsampling in corresponding decoder layers.

**Figure 5.**Overview of the HDM-Net [5] architecture. It has an encoder that takes an RGB image of size $224\times 224\times 3$ and encodes it into a latent representation of size $28\times 28\times 128$. This is then used by the decoder to reconstruct a 3D point cloud of the surface with ${73}^{2}$ points.

**Figure 6.**Overview of IsMo-GAN [6]. The generator network accepts a masked RGB image, segmented by the object detection network (OD-Net), and returns a 3D point cloud. The output and ground-truth are fed to the discriminator which serves as a surface regularizer.

**Figure 7.**The Pixel2Mesh [7] network consists of two parallel networks that take an RGB image and a coarse ellipsoid 3D mesh, and learn to regress the 3D shape of the object in the image. The key contribution is the graph-based convolutions and unpooling operators in the bottom half of the network.

**Figure 9.**Overview of VANet [9], a unified approach for both single and multi-view reconstruction with a two-branch architecture.

**Figure 10.**3D-VRVT takes one image as input and uses a Vision Transformer encoder to extract a feature vector. This is then fed to a decoder that outputs the voxel representation of the object.

**Table 1.**Summary of the major 3D datasets in this paper. Most of the datasets contain textured surfaces and are generated from synthetic 3D models. Real datasets captured directly with 3D sensors are less common and smaller in size because of the difficulty associated with obtaining them.

Dataset | Type | Surface | Texture | Groundtruth | Size |
---|---|---|---|---|---|

Bednarik et al. [3] | Real | Deformable | No | Depth and Normal Maps | 26,445 |

Golyanik et al. [5] | Synthetic | Deformable | Yes | Point Clouds | 4648 |

ShapeNet [10] | Synthetic | Mixed | Yes | 3D Mesh | >300 M |

R2N2 [18] | Synthetic | Rigid | Yes | 3D Meshes and Voxelized Models | 50,000 |

**Table 2.**Summary of objects in the textureless surfaces dataset [3]. Sequences of data samples were captured using a Kinect device at 5 FPS with varying lighting conditions across sequences.

$\mathit{cloth}$ | $\mathit{tshirt}$ | $\mathit{sweater}$ | $\mathit{hoody}$ | $\mathit{paper}$ | |
---|---|---|---|---|---|

sequences | 18 | 12 | 4 | 1 | 3 |

samples | 15,799 | 6739 | 2203 | 517 | 1187 |

**Table 4.**The textureless surfaces dataset [3] is used to compare the performance of different depth and normal map reconstruction methods. $128\times 128$ size patches were used in the Patch-Net. Bold values show the best results for each metric.

Metric | ${\mathcal{E}}_{\mathit{D}}$ (mm) ↓ | ${\mathcal{E}}_{\mathit{MAE}}$ (degrees) ↓ | ||
---|---|---|---|---|

Method | Bednarik et al. [3] | Patch-Net [4] | Bednarik et al. [3] | Patch-Net [4] |

cloth-cloth | $17.53\pm 5.50$ | $\mathbf{12.80}\pm \mathbf{4.45}$ | $17.37\pm 12.51$ | $\mathbf{14.72}\pm \mathbf{3.39}$ |

tshirt-tshirt | $17.18\pm 18.58$ | $\mathbf{13.70}\pm \mathbf{3.83}$ | $\mathbf{18.07}\pm \mathbf{12.71}$ | $18.63\pm 4.43$ |

cloth-tshirt | $26.26\pm 7.72$ | $\mathbf{22.74}\pm \mathbf{7.20}$ | $25.74\pm 15.81$ | $\mathbf{24.29}\pm \mathbf{3.80}$ |

cloth-sweater | $38.93\pm 10.36$ | $\mathbf{30.10}\pm \mathbf{10.00}$ | $31.52\pm 19.07$ | $\mathbf{27.94}\pm \mathbf{4.79}$ |

cloth-hoody | $43.22\pm 24.81$ | $\mathbf{31.09}\pm \mathbf{8.73}$ | $32.54\pm 21.15$ | $\mathbf{29.73}\pm \mathbf{2.52}$ |

cloth-paper | $24.16\pm 7.15$ | $\mathbf{14.53}\pm \mathbf{4.48}$ | $35.53\pm 22.16$ | $\mathbf{24.52}\pm \mathbf{5.96}$ |

**Table 5.**Comparison of different point cloud reconstruction methods using the synthetic thin plate dataset [5] under one illumination. Bold values show the best results for each metric.

Metric | ${\mathcal{E}}_{3\mathit{D}}$ (mm) ↓ | ${\mathcal{E}}_{\mathit{\sigma}}$ (mm) ↓ | ||
---|---|---|---|---|

Method | HDM-Net [5] | IsMo-GAN [6] | HDM-Net [5] | IsMo-GAN [6] |

endoscopy | 48.50 | 33.60 | 13.50 | 14.80 |

graffiti | 49.90 | 33.30 | 22.00 | 20.80 |

clothes | 48.90 | 35.30 | 26.40 | 24.20 |

carpet | 144.20 | 110.50 | 26.90 | 26.80 |

mean | 72.88 | 53.18 | 22.20 | 21.65 |

**Table 6.**Comparison of mesh reconstruction from textureless data. As can be seen, IsMo-GAN outperforms [3] and HDM-Net on the real textureless cloth data by $26.5\%$ and $10.5\%$ respectively, and it outperforms HDM-Net on the synthetic textureless thin-plate data by $31.9\%$. Bold values show the best results.

Metric | ${\mathcal{E}}_{3\mathit{D}}$ (mm) ↓ | ||
---|---|---|---|

Method | Bednarik et al. [3] | HDM-Net [5] | IsMo-GAN [6] |

cloth [3] | 21.48 | 17.65 | 15.79 |

plate [5,6] | - | 99.40 | 67.70 |

Metric | ${\mathcal{E}}_{\mathit{F}}$ (%) ↑ | ${\mathcal{E}}_{\mathit{EMD}}$↓ | ${\mathcal{E}}_{\mathit{CD}}$↓ | |||
---|---|---|---|---|---|---|

Method | A | B | A | B | A | B |

plane | 71.12 | 77.01 | 0.579 | 0.486 | 0.477 | 0.304 |

bench | 57.57 | 67.69 | 0.965 | 0.770 | 0.624 | 0.362 |

cabinet | 60.39 | 63.30 | 2.563 | 1.575 | 0.381 | 0.327 |

car | 67.86 | 69.53 | 1.297 | 1.185 | 0.268 | 0.235 |

chair | 54.38 | 60.74 | 1.399 | 0.957 | 0.610 | 0.443 |

monitor | 51.39 | 60.35 | 1.536 | 1.269 | 0.755 | 0.459 |

lamp | 48.15 | 56.26 | 1.314 | 1.086 | 1.295 | 0.879 |

Method | A | B | A | B | A | B |

speaker | 48.84 | 53.49 | 2.951 | 2.283 | 0.739 | 0.562 |

firearm | 73.20 | 77.24 | 0.667 | 0.473 | 0.453 | 0.333 |

sofa | 51.90 | 56.83 | 1.642 | 1.376 | 0.490 | 0.400 |

table | 66.30 | 70.78 | 1.480 | 1.173 | 0.498 | 0.334 |

phone | 70.24 | 72.27 | 0.724 | 0.573 | 0.421 | 0.298 |

watercraft | 55.12 | 62.12 | 0.814 | 0.718 | 0.670 | 0.450 |

mean | 59.73 | 65.20 | 1.380 | 1.071 | 0.591 | 0.414 |

**Table 8.**Comparison of (

**A**) Pixel2Mesh [7], (

**C**) Salvi et al. [8] and 3D-VRVT [41] on the Choy et al. [18] subset of the ShapeNet dataset [10]. With self-attention after ResNet layers, Ref. [8] improves IoU by $25\%$, normal consistency by $8.9\%$, and reduces the average Chamfer-L1 distance approximately 11 times. Ref. [41] further improves the IoU by $9\%$ with a Vision Transformer architecture containing multi-headed self-attention. Bold values show the best results for each metric.

Metric | ${\mathcal{E}}_{\mathit{IoU}}$↑ | ${\mathcal{E}}_{{\mathit{CD}}_{1}}$↓ | ${\mathcal{E}}_{\mathit{NC}}$↑ | ||||
---|---|---|---|---|---|---|---|

Method | A | C | D | A | C | A | C |

plane | 0.420 | 0.645 | 0.608 | 0.187 | 0.011 | 0.759 | 0.868 |

bench | 0.323 | 0.493 | 0.563 | 0.201 | 0.016 | 0.732 | 0.813 |

cabinet | 0.664 | 0.737 | 0.794 | 0.196 | 0.016 | 0.834 | 0.876 |

car | 0.552 | 0.761 | 0.855 | 0.180 | 0.014 | 0.756 | 0.855 |

chair | 0.396 | 0.534 | 0.553 | 0.265 | 0.021 | 0.746 | 0.829 |

monitor | 0.490 | 0.520 | 0.555 | 0.239 | 0.026 | 0.830 | 0.863 |

lamp | 0.323 | 0.379 | 0.436 | 0.308 | 0.045 | 0.666 | 0.722 |

speaker | 0.599 | 0.660 | 0.725 | 0.285 | 0.028 | 0.782 | 0.839 |

firearm | 0.402 | 0.527 | 0.597 | 0.164 | 0.012 | 0.718 | 0.804 |

sofa | 0.613 | 0.689 | 0.716 | 0.212 | 0.019 | 0.820 | 0.866 |

table | 0.395 | 0.535 | 0.617 | 0.218 | 0.019 | 0.784 | 0.861 |

phone | 0.661 | 0.754 | 0.805 | 0.149 | 0.012 | 0.907 | 0.937 |

watercraft | 0.397 | 0.568 | 0.604 | 0.212 | 0.018 | 0.699 | 0.801 |

mean | 0.480 | 0.600 | 0.654 | 0.216 | 0.019 | 0.772 | 0.841 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Khan, M.S.U.; Pagani, A.; Liwicki, M.; Stricker, D.; Afzal, M.Z.
Three-Dimensional Reconstruction from a Single RGB Image Using Deep Learning: A Review. *J. Imaging* **2022**, *8*, 225.
https://doi.org/10.3390/jimaging8090225

**AMA Style**

Khan MSU, Pagani A, Liwicki M, Stricker D, Afzal MZ.
Three-Dimensional Reconstruction from a Single RGB Image Using Deep Learning: A Review. *Journal of Imaging*. 2022; 8(9):225.
https://doi.org/10.3390/jimaging8090225

**Chicago/Turabian Style**

Khan, Muhammad Saif Ullah, Alain Pagani, Marcus Liwicki, Didier Stricker, and Muhammad Zeshan Afzal.
2022. "Three-Dimensional Reconstruction from a Single RGB Image Using Deep Learning: A Review" *Journal of Imaging* 8, no. 9: 225.
https://doi.org/10.3390/jimaging8090225