Monocular Real Time Full Resolution Depth Estimation Arrangement with a Tunable Lens

: This work introduces a real-time full-resolution depth estimation device, which allows integral displays to be fed with a real-time light-ﬁeld. The core principle of the technique is a high-speed focal stack acquisition method combined with an efﬁcient implementation of the depth estimation algorithm, allowing the generation of real time, high resolution depth maps. As the procedure does not depend on any custom hardware, if the requirements are met, the described method can turn any high speed camera into a 3D camera with true depth output. The concept was tested with an experimental setup consisting of an electronically variable focus lens, a high-speed camera, and a GPU for processing, plus a control board for lens and image sensor synchronization. The comparison with other state of the art algorithms shows our advantages in computational time and precision.


Introduction
Standard cameras can extract 2D light information from a scene, which includes intensity and wavelength. However, 3D details are lost during this process due to lack of information regarding the direction of each ray of light. The 3D data is essential to a better understanding of the scene as it could be used to find the placement of objects more accurately in the scene [1,2], or to construct 3D mesh [3][4][5][6], or for artistic purposes in disciplines such as cinematography or computer game development [7].
Several methods exist that obtain 3D volumes. They can be classified in two main groups: active or passive. While active methods require additional hardware that emits light information from the device to sense the 3D information of the scene [7][8][9][10][11], passive methods use only the received light information. Classical passive methods consist of utilizing stereo [12][13][14], structure from motion (SFM) [15][16][17], depth from focus (DFF) [18,19] or approaches with a monocular camera and single images using deep learning techniques [20,21]. Passive methods usually do not allow obtaining the 3D information in real time, as they require a high level of computational processing, and the single capture approach cannot obtain real distances.
The vision system presented in this paper extracts 3D information in real time using a sensor coupled with a variable focus lens, obtaining a full pipeline for passive real time 3D extraction. A comparison with other widespread methods yields that our algorithm is faster and more accurate.

Materials and Methods
Our proposal is a vision system setup to capture 3D images in real time. There are two main components to ensure the full pipeline, the setup (Hardware components) and the algorithm (Software component). Both are described in the next sections. Appl To achieve the hardware part of our vision system, several components were selected ( Figure 1): a variable focus lens, a high speed camera, a synchronization module and a processing system.
A liquid lens was chosen as a variable focus lens. Due to the capability of fast movement, the fastest in the market is capable of moving at least 156 fps with high precision and repeatability. The camera captures Full HD (1920 × 1080) with a high frame rate (greater than 156), and supports the lens mount (C-mount). As a synchronization module, we use a microprocessor to synchronize the lens and the camera with high accuracy, to this purpose, an FPGA was selected, since the FPGAs allow to generate multiple clocks without losing cycles. Finally, the processing system must support reading all the camera frames and needs a GPU that executes the parallel algorithm.
Our algorithm is capable of extracting the distances from the captured frames, estimating distances with high accuracy, and dealing with the low frequencies problem by using the input color information, a common issue in DFF algorithms [22].

Camera
The camera requirements need to be chosen according to the selected optics and to the desired resolution of the depth information. In the setup used for this paper, the variable focus lens is C-mount. The lens acquires six different focus planes by sweeping from the nearest focus to the farthest focus (infinity) without an overlapping depth of field. The camera must be C-mount compatible and have at least a frame rate of 150 frames per second (fps) to simulate a real time 25 fps camera. Using a global shutter is mandatory to avoid artifacts while capturing at high frame rate speed. Additionally, an Opto-isolated trigger input/output is needed to perform the correct synchronization with the lens and a high-speed interface for data transfer. To send this amount of data, at least a 3 Gb/s speed bus is needed using Bayer pattern following Equation (1): where AD is the amount of data in Gb/s, F the number of frames per second and W, H, C are width, height, and channels, respectively. Usable interfaces that can transfer such high amounts of data are, but not limited to, the Camera Link (CL), CoaXPress (CXP) and MIPI, depending on the version and lanes. For the setup used in the camera presented here, a frame grabber was used to read the output data. The camera and frame grabber used in this setup are the Flare2MP [23] and Matrox Radient eV-CL [24]. The capture is done at 156 fps with an exposure time of 6 milliseconds.

Variable Focus Lens
In order to capture multiple images of the same scene at different focus positions at high speed, a variable focus lens is needed. In our setup, an electronically focus controllable lens is used (Varioptic's C-C-39N0-250) [25].
This lens is controlled using a custom-made camera-lens synchronization module using an FPGA that accurately controls the lens focus distance with the image sensor trigger, capturing the images in the focal stack. The focus sweep curve follows a "sawtooth" graph when plotted as focus time vs. total time from near to far focus. Between the near and far focus, six images with short exposure time are acquired, consuming a total time of 38.4 ms to capture one focal stack. This provides an effective frame rate of 26 fps (156 total frames acquired by the image sensor). Figure 2 shows the sawtooth described. In our case it is necessary to send two commands to the liquid lens to achieve focus repeatability by image captured, the Figure 3 displays the voltage rms values used to modifiy the focal-length as well as the optical power.

Variable Focus Lens
In order to capture multiple images of the same scene at different focus positions high speed, a variable focus lens is needed. In our setup, an electronically focus controllab lens is used (Varioptic's C-C-39N0-250) [25].
This lens is controlled using a custom-made camera-lens synchronization modu using an FPGA that accurately controls the lens focus distance with the image sensor trigg capturing the images in the focal stack. The focus sweep curve follows a "sawtooth" grap when plotted as focus time vs total time from near to far focus. Between the near and f focus, six images with short exposure time are acquired, consuming a total time of 38.4 m to capture one focal stack. This provides an effective frame rate of 26 fps (156 total fram acquired by the image sensor). Figure 2 shows the sawtooth described. In our case it necessary to send two commands to the liquid lens to achieve focus repeatability by ima captured, the Figure 3 displays the voltage rms values used to modifiy the focal-length well as the optical power.   Liquid lenses have been successfully used in another fields, for example in the medical and surgical sectors by increasing the depth of field [26], or adding other kind of information as polarization to 3D scenes resulting in a 4 dimensions experimental space [27].

Synchronization Module
Camera, lens and system processors must be synchronized to obtain the focal stack in the desired moment without losing any cycle of CPU, however, common CPU's are not valid due to clock uncertainty. To achieve the desired synchronization requirements, any microprocessor able to generate two clocks and an I2C signal is valid. An Arty Z7 [28] was chosen, using VHDL language to generate the properly modules to control the lens via I2C and generate output clocks to the trigger and the lens, using as input a PC reset to start capturing.

Processor System
Once the focal stack is captured, it is sent to the processing system to estimate depth. This system can be any hardware capable of processing and reading the incoming data at the needed speed. In the setup, the depth estimation algorithm (explained in Section 2.3) runs on a GeForce RTX 1080 GPU, and the data is transferred to the GPU using the Matrox Radient eV-CL already described.

Depth Estimation Algorithm
Dense depth estimation algorithms must obtain the non textureless zones, and fill the textureless zones with the neighbour information, in this paper the result of obtaining the high frequency zones are defined as a sparse estimation of depth. The sparse estimation of depth is done by composing the focal stack using a defocus operator [29]. The amount of information estimated will heavily depend on the scene; if the image is lacking high frequencies, depth cannot be estimated. Once the sparse depth map has been computed, the unknown values must be filled. To this purpose, a sparse depth map and a composition of the focal stack is used as a starting point to fill the algorithm. Figure 4 below shows the steps in the algorithm pipeline. Let I(z) be the focal stack with shape W × H × N with W , H, N as height, width and the number of planes of the stack, respectively, the sparse depth map is defined as follows: where G z (I) is the 2D gradient magnitude function for each image of the stack, U i is an area upsample function and R 1 i is an area downsample function, where i is the factor to resize and n the maximum number of pyramids, for our results n value is equal to 3. The multiscale approach avoids noise artifacts due to the elimination of small noisy pixels, the maximum gradients that exceed a tolerance factor with respect to the other gradient values of the stack will be considered. As only gradients are used, the resulting sparse depth map will have information only along the edges. However, a complete depth map is needed. Figure 5 shows a real sparse depth map obtained with our camera. The next step prior to generating a depth map is to evaluate the movement intra-stack, the idea of capture in non-static scenes generates artifacts due to the displacement occurred between each stack plane, the procedure is to eliminate the artifacts from the sparse depthmap using a Z-entropy function as named in the Equation (3), the regularization procedure does not enhance the movement because of the technique.
where S is the stack, H the entropy and z is the index of each plane, to this procedure the input image is selected by bins, the idea is to have, at most, 2 bit to represent the different values in the same pixel along the z axis, avoiding big changes due to the movement. Another part of the algorithm implies inferring the unknown values in the sparse depth map. The final solution should satisfy the following constraints: • All the data points in the existing sparse depth map have to be present in the final solution. • There are no missing depth values.

•
The resulting depth map should be edge preserving and should be smooth within the map.
Following the previous statement, the solution will be the one that minimizes the error function: where D(x, y) is the resulting depth map, M(x, y) is the binary mask of Equation (5), this parameter ensures that the original values of the sparse depth-map are not modified and S(D, R) is the smooth term weighted with the parameter λ which ensures that the depth map is smooth at the same time that preserves the edges present in the reference R, the idea of S(D, R) is to fill the depth map values using the sparse depth-map, which is the first iteration but also using the color distance provided by the reference.
The reference R is a simple composition of the input stack presented in Equation (6) and the smoothing term defined by Equation (7).
whereŴ(R, x 1 , y 1 , x 2 , y 2 ) is the bistochastized version of a bilateral affinity function W.
The affinity of each pixel depends not only on the distance between each other, but also in the color distance in the YUV color space. W is defined as follows: where σ 2 xy , σ 2 l and σ 2 uv are the spatial, luma and uv variances, respectively. The problem of minimizing the error function (4) is intractable; however, it can be modified as in [30,31] to solve the problem in bilateral space [32].
With the problem modified, the selected parameters to execute the algorithm are σ xy = 16, σ l = 16 and σ uv = 16, obtaining a grid of 16 × 16 × 16 in the bilateral space, this reduces the number of neighbours in the search of candidates reducing the computational time, to remove the square artifacts produced by our approach, a final Gaussian blur is applied using a kernel size of ks = 16 × 1.2 and a σ = ks × 0.7 obtaining the result shown in Figure 6.

Results
This section presents the results of our arrangement. The working principle of our system is depth estimation from the focal stack acquired using a liquid lens.
The raw data obtained by our arrangement are presented in the Figure 7, by showing the images in the focal stack used to calculate the depth map shown in Figure 6. With the resulting depth map it is also possible to extract an All In Focus (AIF) image as shown in Figure 8, by using and interpolating the depth map indices with the input images. Fast capture acquisition allows the intra-stack movement to be decreased, avoiding high frequencies artifacts in the depth map. Some frames extracted from a captured video are shown in the Figure 9. As discussed in Section 2.3, it is necessary to compute the Z-entropy to remove the movement artifacts, the Figure 10 presents a stack with some movement in the hand and in the mouse and the Figure 11 shows the mask detected as movement to remove from the sparse depth map.    Figure 12 shows a 3D point cloud that was generated using the depth map (6) and the all in focus (8) by using the intrinsics parameters of the camera to assign the Z distances. Figure 12. Point cloud generated from a frame of video in Figure 9, the different views of the point cloud were generated with the MeshLab [33] interface.

Discussion
The experimental setup is not directly comparable with other methods of acquisition, in that way different algorithms are chosen to compare the depth maps results.
There are many different algorithms proposed in the state of the art to approximate the depth estimation from focal stack capture [19,22,[34][35][36][37]. The discussion is presented by choosing three, with the first chosen due to the similarity of the capture system [34], Hui et al. presents a camera with a liquid lens which obtains the different focus positions varying the voltage like the arrangement proposed in this document. The second algorithm is the most referenced [36] in the literature, and uses classic computer vision to solve the problem, that is also comparable with the algorithm presented in the Section 2.3. The last one was selected to compare a classical computer vision algorithm with neural networks and is the first article that uses neural networks to solve the problem [19]. The article presented by our team [37] was compared visually with the algorithm presented here in the same way than the others. Noise study was not applied here given that the ground-truth used for this study were contemplated in the training stage.
In the method presented in this paper, the focal stack acquisition and processing is done in real time, allowing for live video as presented in Section 2.3. In addition, our prototype works outdoors even with direct sunlight from behind the objects. Figure 13 shows the comparison between the method presented in this paper and the methods mentioned before [19,34,36], using real images. Only the computer vision algorithm could be compared as the focal stack was obtained with our camera and we did not have access to the raw data collected in the referenced works.
In the images shown in Figure 13, our algorithm appears to be more edge preserving and smooth with the environment of the scene. Hui et al.'s implementation appears to work well with big objects inside the scene, the algorithm is edge preserving but not very smooth between the different levels of distances. The DDFF algorithm has very low resolution output, and the neural network is trained with a mobile captured dataset under very different conditions than the proposed cameras. VDFF and our results are similar but execute slower and has less accuracy under noisy conditions. Relative multiscale deep depth from a focus approach [37] is compared with the same images as the others, as shown in the Figure 14. The neural network approach shows better results in some scenes. Under textureless zones, the zone without edges has good depth map values, but if we modify or use images of a camera with a different configuration, like the three top images, the classical algorithms adapt better than the neural network approach, and the same happens if we add a macro lens in the top of the arrangement and capture the focal stack by varying the focus positions, as we can see in the bottom images. (c) VDFF [36]. (d) Hui et al. [34]. (e) DDFF [19].
To evaluate the accuracy of our method a PSNR metric was chosen: GT) ), (9) where M is the maximum pixel value and MSE is the mean squared error between D computed depth map and GT ground truth. The metric result is expressed in dB. Ground truth depth maps were obtained from Middleburry dataset [39], using the original left image to generate synthetic defocused images following the procedure of J. Lee et al. [40]. The comparison shown in Figure 15 was made using different images of the cited dataset, by adding simulated camera noise to evaluate the robustness of the different algorithms. The PSNR average was computed with the following equation: PSNR(D(i),GT(i)) 10 N ) (10) where N is the number of the images in the dataset, five in our case. Figure 16 shows some of the result over one image with the different methods and the ground truth. Table 1 shows the average PSNR and the runtime comparison. These measurements use 10 planes due to the limitation of the DDFF algorithm. Comparison times of DDFF were extracted from their article, and the code is provided by the author. Hui et al was implemented in Python, using the paper explanation, and the times were extracted from their article, taking the time that they expose for one plane and multiplying by 10. VDFF times were obtained in the same way that our algorithm, and the code is provided by the authors. The hardware setup is an i7-9700 and a NVIDIA 1080 GTX. Table 1 demonstrates that our approach executes faster than the other methods, using a parallel setup, and the accuracy also improves, as evidenced by the enhancement in the shape of the images.

Conclusions
Digital information provided by common cameras produce a lack of information of the captured scene. To improve the human perception knowing the distances from the lens to the scene would be helpful. This paper proposes a passive method to extract the depth information in real time with a full pipeline.
The algorithm presented shows an improvement over the regularization term by filling textureless zones faster than the others, while the extraction of high frequencies could be comparable with the other algorithms in computational time. Under not well illuminated conditions, due to the low exposure time of the fast acquisition camera, the DFF algorithm's behaviour is not that which was desired. Liquid lenses have a very small diameter because of the difficulty to move large amounts of liquid electronically, which is a critical point of decision to obtain good results. Some of the algorithms compared provide a better PSNR than ours on synthetic and non-noisy scenes. Due to the selection of the defocus operator, the behavior is better on synthetic images with a known blur applied to the original image. Our selection of defocus operator was chosen to avoid noise artifacts and this causes the loss of information in pure synthetic images.
This work demonstrates that it is possible to combine a setup avoiding lasers and multiple cameras to convert a fast acquisition camera into a real 3D camera.