# Three-Dimensional Reconstruction from Single Image Base on Combination of CNN and Multi-Spectral Photometric Stereo

## Abstract

## 1. Introduction

## 2. Related Work

#### 2.1. Photometric Stereo

#### 2.2. Machine Learning in Depth Estimation

## 3. Methods

#### 3.1. Multi-Spectral Photometric Stereo

_{j}is the j-th illumination direction vector, n (x, y) is the normal vector of a certain point of the target, E

_{j}(λ) is the illumination intensity, R(x, y, λ) is a parameter related with the albedo and chromaticity of a certain point of the target, and S

_{i}(λ) is the color response of the camera photosensitive element.

#### 3.2. Deep Convolutional Neural Network

#### 3.2.1. Architecture

#### 3.2.2. Training

#### 3.3. Combination of Deep Convolution Neural Network and Multi-Spectral PS

## 4. Experiments

#### 4.1. The Synthesis Dataset Rendered from ShapeNet

#### 4.2. Result of Our Network

#### 4.2.1. Experiment Results

#### 4.2.2. Quantitative Analysis

_{i}is the prediction depth of the i-th point using our network. The meaning of each parameter in the table is:

- Mean relative error (rel), which can be calculated according to Equation (7):$$\frac{1}{N}{\displaystyle \sum}_{i}\frac{\left|{d}_{i}-{d}_{i}^{*}\right|}{{d}_{i}^{*}}$$
- Root mean squared error (rms), which can be calculated according to Equation(8):$$\sqrt{\frac{1}{N}{\displaystyle \sum}_{i}{\left({d}_{i}-{d}_{i}^{*}\right)}^{2}}$$
- Accuracy with threshold t (δ), this is a statistical parameter that is used to count the percentage of pixels matching a certain condition in the image with respect to the total number of pixels in the image. According to the different values of t, the result is divided into three grades, that is, when t is 1.25, the result is δ1, when t is 1.252, the result is δ2, and when t is 1.253, the result is δ3. It can be calculated according to Equation (9):$$\mathsf{\delta}=\frac{1}{N}{\displaystyle \sum}_{i}{\eta}_{i}\phantom{\rule{0ex}{0ex}}{\eta}_{i}=\{\begin{array}{c}1ifTt\\ 0ifT\ge t\end{array}\phantom{\rule{0ex}{0ex}}\mathrm{T}=\mathrm{max}\left(\frac{{d}_{i}}{{d}_{i}^{*}},\frac{{d}_{i}^{*}}{{d}_{i}}\right),t\in \left[1.25,{1.25}^{2},{1.25}^{3}\right]$$

#### 4.3. Result of Combination of Deep Convolution Neural Network and Multi-Spectral PS

#### 4.3.1. Experiment Results

#### 4.3.2. Quantitative Analysis

## 5. Conclusions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## References

**Figure 2.**The architecture of our network. Conv is a convolution operation, and deconv is a deconvolution operation.

**Figure 3.**The images generated using the train model. The first three columns are RGB images and the last column is the depth image.

**Figure 4.**The results using the same train model. The top line shows the test images, the middle line shows the ground truth of each image, and the bottom line shows the depth our network predicts.

**Figure 5.**The results using a new train model. The top line shows the test images, the middle line shows the ground truth of each image, and the bottom line shows the depth our network predicts.

**Figure 6.**Results of our network with real world objects. The top line shows the input images, the bottom line shows the estimated depth result from our network respectively, which is produced by Matlab’s imagesc function.

**Figure 7.**The final results produced by Matlab’s imagesc function. (

**a**) The input images. (

**b**) The outputs of Kinect. (

**c**) The result of the depth estimation of traditional multi-spectral PS. (

**d**) The result of [10]. (

**e**) The result of the depth estimation of our method.

**Figure 8.**(

**a**) The input image. (

**b**) The approximate ground truth depth images after processing. (

**c**) The 3D representation of (

**b**). (

**d**) The 3D representation of the result of our deep convolutional neural network (DCNN) and multi-spectral photometric stereo (DCNN+MS-PS) method. (

**e**) The error map between (

**c**,

**d**).

**Table 1.**Details of our deep convolution neural network (DCNN). Conv is a convolution operation, and deconv is a deconvolution operation.

Name | Input | Weights | Output Layers | Remarks |
---|---|---|---|---|

conv_1 | image | (5,5,2,2) | 32 | padding=’VALID |

conv_2 | conv_1 | (5,5,1,1) | 32 | padding=’VALID |

conv_3 | conv_2 | (5,5,2,2) | 64 | padding=’VALID |

conv_4 | conv_3 | (5,5,1,1) | 64 | padding=’VALID |

conv_5 | conv_4 | (5,5,2,2) | 128 | padding=’VALID |

conv_6 | conv_5 | (5,5,1,1) | 128 | padding=’VALID |

conv_7 | conv_6 | (5,5,2,2) | 256 | padding=’VALID |

conv_8 | conv_7 | (5,5,1,1) | 256 | padding=’VALID |

conv_9 | conv_8 | (5,5,2,2) | 256 | padding=’VALID |

conv_10 | conv_9 | (5,5,1,1) | 256 | padding=’VALID |

reshape | reshape conv_10 to 1×N | |||

fc_1 | conv_10 | N×4096 | / | keep_prob=0.5 |

fc_2 | fc1 | 4096×4096 | / | keep_prob=0.5 |

fc_3 | fc2 | 4096×4096 | / | keep_prob=0.5 |

fc_4 | fc3 | 4096×N | / | keep_prob=0.5 |

reshape | reshape fc_4 to the shape of conv_10 | |||

deconv_1 | fc4+conv10 | (5,5,1,1) | 128 | padding=’VALID |

deconv_2 | deconv_1 | (5,5,2,2) | 64 | padding=’VALID |

deconv_3 | deconv_2 | (5,5,1,1) | 64 | padding=’VALID |

deconv_4 | deconv_3+conv_7 | (5,5,2,2) | 32 | padding=’VALID |

deconv_5 | deconv_4 | (5,5,1,1) | 32 | padding=’VALID |

deconv_6 | deconv_5+conv_5 | (5,5,2,2) | 16 | padding=’VALID |

deconv_7 | deconv_6 | (5,5,1,1) | 16 | padding=’VALID |

deconv_8 | deconv_7+conv_3 | (5,5,2,2) | 8 | padding=’VALID |

deconv_9 | deconv_8 | (5,5,1,1) | 8 | padding=’VALID |

deconv_10 | deconv_9 | (5,5,2,2) | 1 | padding=’VALID |

Image | rel ^{1} | rms ^{1} | δ_{1} ^{2} | δ_{2} ^{2} | δ_{3} ^{2} |
---|---|---|---|---|---|

Figure 4a | 0.5935 | 0.5083 | 0.0672 | 0.2120 | 0.4736 |

Figure 4b | 0.5738 | 0.5083 | 0.0310 | 0.2215 | 0.5116 |

Figure 4c | 0.5836 | 0.5070 | 0.0693 | 0.2339 | 0.4447 |

Figure 4d | 0.6497 | 0.6010 | 0.0410 | 0.1205 | 0.2748 |

Figure 4e | 0.8300 | 0.7292 | 0.0167 | 0.0591 | 0.1224 |

Figure 4f | 0.7302 | 0.6282 | 0.0133 | 0.0771 | 0.2094 |

Figure 5a | 0.4225 | 0.2899 | 0.3824 | 0.6693 | 0.7788 |

Figure 5b | 0.6329 | 0.5032 | 0.1700 | 0.3004 | 0.3848 |

Figure 5c | 0.6829 | 0.6607 | 0.0381 | 0.1571 | 0.2502 |

Figure 5d | 0.6358 | 0.4792 | 0.1272 | 0.2744 | 0.4166 |

Figure 5e | 0.4741 | 0.2979 | 0.3527 | 0.6156 | 0.7533 |

Figure 5f | 0.3473 | 0.3353 | 0.1949 | 0.6677 | 0.8589 |

^{1}lower is better,

^{2}higher is better.

**Table 3.**The quantitative analysis of the results of Figure 7. MS-PS is an acronym for multi-spectral photometric stereo, and the parameter ‘rel’ and ‘rms’ are defined in Equations (7) and (8).

Image | rel ^{1} | rms ^{1} | δ_{1} ^{2} | δ_{2} ^{2} | δ_{3} ^{2} | |
---|---|---|---|---|---|---|

aircraft | DCNN | 0.5716 | 0.2928 | 0.1262 | 0.2667 | 0.4165 |

MS-PS | 2.6899 | 0.6273 | 0.5221 | 0.5746 | 0.6315 | |

DCNN+MS-PS | 2.8001 | 0.4118 | 0.5832 | 0.7078 | 0.8007 | |

train | DCNN | 0.8165 | 0.5204 | 0.2237 | 0.3637 | 0.4502 |

MS-PS | 1.8237 | 0.7501 | 0.0607 | 0.0915 | 0.1266 | |

DCNN+MS-PS | 0.7368 | 0.5831 | 0.2300 | 0.4407 | 0.4979 | |

ship | DCNN | 2.1245 | 0.3684 | 0.2371 | 0.3756 | 0.4887 |

MS-PS | 1.1393 | 0.3393 | 0.1560 | 0.2689 | 0.3621 | |

DCNN+MS-PS | 1.2453 | 0.3089 | 0.2184 | 0.4048 | 0.5548 |

^{1}lower is better,

^{2}higher is better.

© 2018 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

