Accurate 3D Shape Reconstruction from Single Structured-Light Image via Fringe-to-Fringe Network

: Accurate three-dimensional (3D) shape reconstruction of objects from a single image is a challenging task, yet it is highly demanded by numerous applications. This paper presents a novel 3D shape reconstruction technique integrating a high-accuracy structured-light method with a deep neural network learning scheme. The proposed approach employs a convolutional neural network (CNN) to transform a color structured-light fringe image into multiple triple-frequency phase-shifted grayscale fringe images, from which the 3D shape can be accurately reconstructed. The robustness of the proposed technique is veriﬁed, and it can be a promising 3D imaging tool in future scientiﬁc and industrial applications


Introduction
Three-dimensional (3D) shape or depth perception has been a favored long-term research topic in recent decades, driven by numerous scientific and engineering applications in many fields, due to its capability of perceiving depth information that cannot be fulfilled by two-dimensional (2D) imaging.The 3D shape and depth perception techniques are typically stereo vision-and optics-based, and the most widely used scheme involves employing structured light to facilitate image analysis.Representative commercial products in this category include industrial 3D scanners that provide accurate 3D shape measurements with fringe-pattern illumination and consumer-grade 3D depth sensors that deliver real-time depth maps with speckle-pattern or similar illuminations [1][2][3][4][5][6].A fringe pattern-based technique generally requires capturing multiple fringe-shifted images in sequence to achieve high accuracy, but the measurement speed is consequently slow.By contrast, a speckle pattern-based technique usually uses two images simultaneously captured by two cameras; in this way, real-time measurement speed can be attained, but the corresponding accuracy is often relatively low.Therefore, a structured-light technique capable of providing not only high-accuracy but also fast-speed performance is always of great interest in the research and development of new 3D scanning systems.
Based on the successful applications of deep learning in the computer vision and optics fields, integrating the structured-light technique and the deep learning scheme for accurate 3D shape reconstruction should be achievable [25][26][27].As a matter of fact, a few strategies have been proposed to transform a captured structured-light image into its corresponding 3D shape using deep learning.For instance, an autoencoder-based network named UNet can serve as an end-to-end network to acquire the depth map from a single structured-light image [28][29][30][31].Works presented in [32][33][34][35][36] reveal that a phase map can be retrieved by one or multiple neural networks from structured-light images, and the phase map is then used to calculate the depth map.
In this paper, a novel technique of 3D shape reconstruction from a single structuredlight fringe image is proposed.It is based on using a high-accuracy 3D imaging technique known as fringe projection profilometry (FPP).The conventional FPP technique involves using a projector to project a set of phase-shifted fringe patterns with multiple frequencies on the target, where the surface height or depth information is naturally encoded into the images captured by a camera at a different view from the projector.Through retrieving phase distributions from the captured images, the height or depth map can be determined and the 3D shape can be reconstructed.Inspired by Zhang's work [37] of encoding three fringe patterns of different frequencies into a single composite RGB image and a few other researchers' work [38,39] of using a deep learning method to transform one or two fringe images to multiple phase-shifted fringe images, the proposed technique aims to employ a convolutional neural network (CNN) model to transform a single-shot red-green-blue (RGB) fringe-pattern image into a number of multi-frequency phase-shifted grayscale fringe-pattern images, which are then used to reconstruct the depth and 3D shape map using the conventional FPP algorithm.Due to the main purpose of transforming a single fringe pattern into multiple fringe patterns, the CNN model is called a fringe-to-fringe network.Compared with the conventional FPP technique, the essential difference is that the proposed approach requires only a single-shot color image instead of multiple (e.g., [12][13][14][15][16][17][18][19][20] images.Since the new technique requires only a single-shot image instead of multiple images, the measurement speed can be remarkably improved while maintaining equivalent accuracy.Such a highly demanded capability is technically impossible from the conventional perspective, but is now practicable with a CNN approach.It is noteworthy that, unlike Yu's work [38], which uses multiple networks and one or multiple fringe images, the proposed technique uses a single network and a single image for fringe-to-fringe transformations.In addition, a favorable UNet-like network other than a simple autoencoder-like network is employed to enhance the transformation performance.Furthermore, it can deal with background fringes.More importantly, by using the multiple phase-shifted fringes as an intermediary, the proposed approach achieves substantially higher accuracy than the direct transformation from fringe to depth using deep learning.Figure 1 demonstrates the fringe-to-fringe network architecture and the pipeline of the proposed approach.

Materials and Methods
The key step of the proposed technique is to employ a CNN model to transform a single-shot RGB fringe-pattern image into a number of phase-shifted grayscale fringepattern images, which are then used to reconstruct the depth and 3D shape map using the conventional FPP algorithm.
The required training and validation datasets, including input images and their corresponding fringe-pattern outputs, are generated by a conventional FPP system.The test dataset is generated in this way as well for assessment purposes.Details of the FPP technique can be found in Ref. [40].In this work, however, to cope with the essential phase unwrapping issue, a four-step triple-frequency phase-shifting (TFPS) scheme is adopted for high-accuracy phase determination.Figure 2 exhibits the schematic pipeline of the employed conventional FPP technique, which is elaborated as follows.The initial fringe patterns fed into the projector are evenly spaced vertical sinusoidal fringes, which are numerically generated with the following function [41]: where I (p) is the intensity of the initial pattern at the pixel coordinate (u, v); the subscript i indicates the ith frequency with i = {1, 2, 3} and n denotes the nth phase-shifted image with n = {1, 2, 3, 4}; 0 is a constant coefficient indicating the value of intensity modulation (e.g., 2 ); δ is the phase-shift amount with δ n = (n−1)π

2
; and φ is the fringe phase.For vertical fringes, φ is independent of the vertical coordinates and can be simply defined as , where f i is the number of fringes in the ith frequency pattern and W is the width of the generated image.
The fringes in the captured images are often distorted, following the 3D shapes of the targets.The fringe patterns can be described as: where A, B, and I are the background intensity, fringe amplitude, and pixel intensity of the captured fringe image at pixel coordinate (u, v), respectively.The phase distribution φ i (u, v) in the captured images is now a function of both u and v, and it can be determined by using a standard four-step phase-shifting scheme as: where superscript w signifies a wrapped phase because of the arc-tangent function.(u, v) will be omitted hereafter for simplicity.
If the three frequencies satisfy ( the unwrapped phase of the highest-frequency fringe patterns can be calculated with the following hierarchical equations: In the equations, x 0 denotes a singularity function, and it is 1 for x 0 and is 0 otherwise; INT indicates rounding to the nearest integer.The rationale of the algorithm is to generate an intermediate result of unwrapped phase φ 123 to start a hierarchical phase-unwrapping process.φ w 12 and φ w 23 are intermediate wrapped phases with ( f 2 − f 1 ) and ( f 3 − f 2 ) fringes in the pattern, respectively.φ 123 is both wrapped and unwrapped because there is only one fringe in the pattern since ( φ 23 is an intermediate unwrapped phase to bridge φ 123 and φ 3 because the ratio of their fringe numbers is large.The phase distribution of the highest-frequency fringe patterns, φ 3 , is adopted because it yields the highest accuracy for phase determination.In this work, the three frequencies are 61, 70, and 80.This combination gives a ratio of 1:10:80 for a balanced hierarchical calculation.
From the retrieved phase distribution, the depth map can be calculated [40] from: where z w is the physical height or depth at the point corresponding to the pixel (u, v) in the captured image, and it is also the z-coordinate of the point in the reference world coordinate system; φ is the unwrapped phase of the highest-frequency fringe pattern at the same pixel, which is determined from Equation (4); c 1 − c 19 and d 0 − d 19 are 39 constant coefficients that can be pre-determined by a calibration process [40,42].Following this, the other two coordinates x w and y w can be easily determined upon knowing the camera parameters [42].For this reason, the two terms, depth (or height) measurement and 3D shape reconstruction, can often be used interchangeably.
In the dataset generation, three original uniform fringe patterns with different frequencies (i.e., 61, 70, and 80) are respectively loaded into the RGB channels of the first projection image, and the following 12 projection images are the four-step TFPS grayscale images of the aforementioned three fringe patterns.For each sample, the system projects these 13 patterns onto the target and meanwhile captures 13 corresponding images.The first captured color image serves as the fringe pattern input, and the remaining 12 grayscale images aim to generate the corresponding ground-truth labels as previously described.Dozens of small plaster sculptures are randomly oriented and positioned many times in the field of view of the capturing system to serve as a large number of different samples.In total, 1500 data samples are acquired.To ensure reliable network convergence and avoid biased evaluation, the samples are split by a ratio of 80%-10%-10% as the training, validation, and test datasets.Notably, a number of objects are captured solely for the validation and test datasets to ensure that a target, regardless of rotation and position, will appear only in one of the three datasets.
Figure 3a shows a few selected input and output pairs contained in the datasets of the fringe-to-fringe network.Because a performance comparison of the proposed network with the existing fringe-to-depth and speckle-to-depth networks will be conducted later, a few examples of these two network datasets are presented in Figure 3b and Figure 3c, respectively.Furthermore, since the wrapped phase can be obtained from the numerator (N) and denominator (D) of the arctangent function shown in Equation ( 3), a network capable of determining Ns and Ds of the captured fringe images can be suitable for the 3D shape reconstruction [43].Such a fringe-to-ND network will also be included in the comparisons with the proposed fringe-to-fringe network.The examples of the input and output data of the fringe-to-ND network are shown in Figure 3d.
In the proposed approach, a UNet-based CNN network [11] is constructed for the fringe-to-fringe transformation.In particular, the CNN model is trained to transform each fringe pattern in the RGB channels of the input image into its corresponding four-step phase-shifted fringe images.The appearance of the target and background remain the same in the output images, and the only change is the shifting of the fringe patterns at an incremental step of π 2 .The network consists of two main paths: an encoder path and a decoder path.The encoder path contains spatial convolution (kernel size of 3 × 3) and max-pooling (pool size 2 × 2) layers to extract representative features from the singleshot input image, whereas the decoder path reverses the operations of the encoder path with transposed convolution (strides = 2) and spatial convolution layers to upsample the previous feature map to a higher-resolution receptive map.It is noted that all the spatial convolution and transposed convolution layers use the same padding type, in which the output feature maps have exactly the same spatial resolution as the input feature maps.In addition, symmetric concatenations from the encoder path to the decoder path are established to ensure rigorous transformation of features at different sub-scale resolutions.Finally, a 1×1 convolution layer is attached to the last layer to transform the vector feature maps into the desired fringe-pattern outputs.
The format of the training, validation, and test data, as well as the output data, is a four-dimensional (4D) tensor of size s × h × w × c, where the four variables are the number of data samples, the height and width of input and output images, and the channel depth, respectively.Specifically, h and w are 352 and 640, respectively; c is set to 3 for the RGB input image and 12 for the output of multiple phase-shifted grayscale fringe images.The adopted resolution of the images is restricted by the computation system.In the network model, a dropout function with a rate of 0.2 is added to the network to prevent the overfitting issue.In addition, a nonlinear activation function named the leaky rectified linear unit (LeakyReLU) [44] is applied after each spatial convolution layer in the network to handle the zero-gradient issue.The LeakyReLU function is expressed: where x and h are the input and output vectors, respectively; weight parameters W in matrix form and bias parameters b in vector form are optimized by the training process; and α is a negative slope coefficient, which is set to 0.2 in the proposed work.Table 1 details the architecture and the number of parameters of the proposed network for fringe-to-fringe transformation.

Results and Discussion
The hardware components used for the training process consist of an Intel Xeon Gold 6140 2.3GHz CPU, 128GB RAM, and two Nvidia Tesla V100 SXM2 32GB graphics cards.Furthermore, Nvidia CUDA Toolkit 11.0 and cuDNN v8.0.3 are adopted to enable the high-performance computing capability of the GPU.Tensorflow and Keras, two widely used open-source software libraries for deep learning, are chosen in this work for network construction and subsequent learning tasks.The datasets are captured by an RVBUST RVC-X mini 3D camera that allows the capturing of the required fringe-pattern images and the use of user-defined functions to determine 3D shapes.
For reliable and efficient performance, both the model parameters and the hyperparameters that control the learning process are optimized during training.The network is trained through 300 epochs with a mini-batch size of 2. Adam optimization [45] with a step decay schedule is implemented to gradually reduce the initial learning rate (0.0001) after the first 150 epochs.A data augmentation scheme (e.g., whitening) is implemented to overcome the modulation overfitting problem.A few Keras built-in functions such as ModelCheckpoint and LambdaCallback are applied to monitor the training performance and save the model parameters that yield the best results.
After the training is completed, the test datasets are fed into the trained CNN model to obtain the output fringe images.These images are then analyzed following Equations ( 3)-( 5) to reconstruct the depth map and subsequent 3D shapes.Table 2 summarizes a few quantitative statistical metrics from the test dataset as well as the validation dataset.The metrics include root-mean-square error (RMSE), mean error, median error, trimean error, mean of the best 25%, and mean of the worst 25%.Because using an endto-end neural network to directly transform a single-shot fringe or speckle image into its corresponding 3D shape or depth map has most recently gained a great deal of interest [46,47], a comparison with such a fringe-to-depth network and a speckle-to-depth network is conducted.Moreover, a fringe-to-ND network [32,34,43,48] is carried out for comparison as well since it is an approach falling between the fringe-to-depth and the proposed fringeto-fringe network.The comparative work uses the same system configuration as well as network architecture shown in Figure 1.For the fringe-to-depth network, the difference from the proposed fringe-to-fringe one is that the channel depth of the last convolution layer is changed from 12 to 1 to accommodate the depth map requirement (i.e., c = 1).For the speckle-to-depth network, the channel depth of the input is one instead of three since the speckle image is grayscale.For the fringe-to-ND network, the channel depth of the output is six because there are three numerator maps and three denominator maps.The results displayed in the table clearly show that the proposed technique works well and outperforms the fringe-to-depth network and the speckle-to-depth network methods.It also performs slightly better than the fringe-to-ND network scheme.

Method
Fringe-to-Fringe Fringe-to-Depth Speckle-to-Depth Fringe-to-ND Figure 4 displays the 3D results of the proposed approach with the fringe-to-fringe network and the comparative fringe-to-depth network.In the figure, the first to fifth rows are the plain images, the test inputs, followed by the 3D labels, the 3D shapes obtained from the fringe-to-fringe network, and the results from the fringe-to-depth network, respectively.For each of the two sample objects, a selected feature region is magnified for better comparison purposes.It is evident that the proposed approach with the fringe-to-fringe network is capable of acquiring 3D results with detailed textures close to the ground-truth labels, whereas the technique with the direct fringe-to-depth network reconstructs the 3D shape of the target with fewer details and larger errors.Figure 5 demonstrates a qualitative comparison assessment between the proposed fringe-to-fringe approach and the recently developed speckle-to-depth method.The top portion of the figure shows the fringe-pattern input and speckle-pattern input required by the fringe-to-fringe and speckle-to-depth methods, respectively.From the shape reconstruction results displayed in the bottom portion of the figure, it is evident that the speckle-to-depth scheme reconstructs fewer surface details than the proposed fringe-tofringe approach.

Single-shot fringe input
Single-shot speckle input Ground-truth 3D Fringe-to-fringe 3D Speckle-to-depth 3D To further illustrate the performance of the proposed approach, Figure 6 shows the results of another two representative test samples.The first column in the top portion of the figure displays the single-shot inputs, and the next two columns are the phase distributions retrieved from the predicted fringe patterns and their ground truth.The first two columns in the bottom portion of the figure are the phase differences between the predicted and groundtruth ones.The last two columns demonstrate the 3D shapes reconstructed by the proposed technique and the conventional structured-light FPP technique.The mean values of the phase differences obtained for the two test samples are −2.616× 10 −4 and 3.958 × 10 −4 , respectively.It can be seen again that the proposed technique can reconstruct high-quality 3D shapes comparable to the ones reconstructed by the state-of-the-art structured-light technique, except for some noise along the edges.
Figure 7 exhibits a comparative demonstration of the proposed fringe-to-fringe approach and the fringe-to-ND method.The first row of the figure includes the plain image, fringe-pattern input, and ground-truth unwrapped phase distributions.The second and third rows plot the unwrapped phase distributions obtained from the predicted outputs, phase errors, and 3D shapes reconstructed by the fringe-to-fringe and fringe-to-ND methods.It is observed that both techniques can acquire 3D shapes close to the 3D ground-truth label, with detailed surface information.Nevertheless, a closer inspection reveals that the fringe-to-fringe technique can provide slightly better results than the fringe-to-ND approach.The performance has been reflected in Table 2 previously presented.

Phase errors (in gray)
Proposed 3D Ground-truth 3D The last experiment is accomplished to demonstrate the practicality of the proposed technique for the 3D shape reconstruction of multiple separated objects.Conventionally, such a task involving geometric discontinuities cannot be fulfilled by an automated singleshot fringe-projection-based technique due to the ambiguity in determining discontinuous fringe orders.Nevertheless, the discontinuity of fringe orders can be identified by the proposed network since it produces multiple TFPS fringe patterns.This can be observed in Figure 8.

Single-shot input
Ground-truth 3D Fringe-to-fringe 3D Plain image

Conclusions
In summary, a novel, accurate single-shot 3D shape reconstruction technique integrating the structured-light technique and deep learning is presented.The proposed fringe-to-fringe network transforms a single RGB color fringe pattern into multiple phaseshifted grayscale fringe patterns demanded by the subsequent 3D shape reconstruction process.The characteristics of single-shot input can help to substantially reduce the capturing time; meanwhile, the accuracy can be maintained since the final 3D reconstruction is based on the conventional state-of-the-art high-accuracy algorithm.
The conventional structured-light-based FPP technique consists of a few temporal steps to decode the fringe patterns and reconstruct the 3D shapes.Any of the intermediate data, including phase-shifted patterns, wrapped phase distributions, fringe orders, unwrapped phase distributions, and height maps, may serve as the output in a deep learning network for the 3D shape reconstruction.In other words, the deep learning approach may be applied to any stage of the conventional FPP technique to facilitate the analysis and processing.Technically, a direct and straightforward 2D-to-3D conversion technique such as the fringe-to-depth method is desired.However, the complex relations between a camera-captured 2D image and its corresponding 3D shapes make such a direct transformation challenging when high-accuracy reconstruction is demanded.The experimental results presented in this paper show that the performance of the proposed approach is superior to that of other deep-learning-based methods.The proposed fringe-to-fringe technique uses the predicted multiple phase-shifted patterns as an intermediary to bridge a single 2D image and its corresponding 3D shape.It benefits from the fact that the relation of the single-shot input with the predicted fringe patterns is less complex than its relation with other intermediate results.
Because RGB color images are used in the proposed technique, it is generally not suitable for working with objects of dark colors, particularly vivid red, green, and blue colors.This is a limitation of the proposed technique.In the meantime, it should be pointed out that the projector and camera were initially color-calibrated by the manufacturers, and no special handling was applied to deal with the color cross-talk.The experimental results have shown that the inevitable minor color cross-talk can be handled well by the technique.
The real-time and high-accuracy 3D shape reconstruction capability of the proposed approach provides a great solution for future scientific research and industrial applications.

Figure 1 .
Figure 1.Pipeline of the proposed approach.

Figure 2 .
Figure 2. Pipeline of the FPP technique with triple-frequency phase-shifting scheme.

Figure 4 .
Figure 4. 3D shape reconstruction of the proposed technique and the comparative fringe-to-depth method.

Figure 5 .
Figure 5. 3D shape reconstruction of the proposed technique and the comparative speckle-to-depth method.

Figure 6 .Figure 7 .
Figure 6.Phase distributions, phase errors, and 3D shape reconstruction of two representative test samples.

Figure 8 .
Figure 8. 3D shape reconstruction of multiple separated objects with complex shapes.

Table 1 .
The proposed fringe-to-fringe architecture and layer parameters.