Iterative Reﬁnement of Uniformly Focused Image Set for Accurate Depth from Focus

: Estimating the 3D shape of a scene from differently focused set of images has been a practical approach for 3D reconstruction with color cameras. However, reconstructed depth with existing depth from focus (DFF) methods still suffer from poor quality with textureless and object boundary regions. In this paper, we propose an improved depth estimation based on depth from focus iteratively reﬁning 3D shape from uniformly focused image set (UFIS). We investigated the appearance changes in spatial and frequency domains in iterative manner. In order to achieve sub-frame accuracy in depth estimation, optimal location of focused frame in DFF is estimated by ﬁtting a polynomial curve on the dissimilarity measurements. In order to avoid wrong depth values on texture-less regions we propose to build a conﬁdence map and use it to identify erroneous depth estimations. We evaluated our method on public and our own datasets obtained from different types of devices, such as smartphones, medical, and normal color cameras. Quantitative and qualitative evaluations on various test image sets show promising performance of the proposed method in depth estimation.


Introduction
Estimating three dimensional shape of a scene from color image is a challenging task [1]. Without any prior knowledge on the scene, it is an ill-posed problem to recover three dimensional shape of objects using single color camera. Many researchers have proposed diverse approaches for 3D depth acquisition. For example, in Reference [2,3], the authors estimate depth information based on a coded aperture framework. However, it requires hardware modification which cannot be applied to off-the-shelf camera, such as smart phone camera. Depth from focus (DFF), or shape from focus (SFF), is a technique for depth estimation from a set of image frames having continuously changing focus amount that are taken at the same location and viewpoint. By identifying best focused frame of each pixel, depth of each pixel can be estimated. In DFF, three factors determine the quality of reconstructed depth: lens aperture, focal length, and focusing distance. While many computer vision techniques assume that given images are obtained with pin-hole camera, DFF assumes that a real aperture camera is used. Real aperture cameras have relatively short depth of field resulting in images focused only on a small distance range of a scene. For instance, Darrell and Wohn [4] use Laplacian of Gaussian pyramids in order to find best focused frame at each pixel.
There are lots of traditional Depth from focus methods [5][6][7][8][9]. Gaganov et al. [6] propose new SFF algorithm based on Markov Random Fields, and Mendapara et al. [7] use SUSAN operator. Mahmood et al. [8] use energy of high-frequency components in S transform and Mahmood et al. [9] propose to use 3D anisotropic nonlinear diffusion filtering for accurate SFF. Recently, depth from focus (DFF) method has been improved in various ways producing more accurate depth map and high quality all-focused image. Muhammad and Choi [10] use a neural network in order to extract 3D shape of objects based on DFF. In Reference [11], they use neural networks to learn the shape of focused image surface (FIS). In Reference [12,13], the authors use dynamic programming optimization technique to get depth (3D shape) of the scene from set of focused images. Their method works significantly faster than previous FIS algorithms. In Reference [14], the authors propose local search algorithm for SFF problem to reduce computational complexity. In Reference [15], the authors propose focus on measurement method based on an optical transfer function, which is implemented in Fourier domain for 3D shape recovery. Sun et al. [16] employ the entropy of high frequency bands as the amount of blur combining texture information for improved performance. Liu et al. [17] propose a semi global DFF approach that enforces the adaptive smooth constraint on reconstructed depth of the scene. Meoller et al. [1] propose variational approach using an efficient non-convex minimization scheme to produce depth map. Suwajanakorn et al. [18] formulate uncalibrated DFF problem and propose a new focal-stack aligning algorithm to estimate depth of given scene using hand held cameras.
In Reference [19], the authors propose a new focus measurement that is robust to noise with higher accuracy in focus measurement. They present Ring Difference filter (RDF) by inserting a gap and looking at the pixels that are located farther away from the point of interest (POI). They extended their work [20] by proposing RDF-based cost aggregation that utilizes both local and non-local characteristics by inheriting the structure of RDF. Recently, Zhiqiang et al. [21] proposed depth recovery framework including depth reconstruction and refinement process. They use non-local matting Laplacian prior and variance based confidence level computation. It is able to produce depth map robust to texture-less regions and give more clean edges. Hazirbas et al. [22] propose auto-encoder-style network to predict depth from focal stack. In order to train their network, they create 12-Scene Benchmark dataset. For the encoder part, they use VGG-16, a popular deep neural network for object recognition [23], without fully connected layers, and for decoder network they simply use flipped structure of the encoder network. It provides sharpness map (feature map) for each frame separately with single output depth map. However, such deep learning approach has limitations, such as fixed number of input frames (it depends on training samples).
However, traditional DFF methods suffers from unstable approximation results with texture-less regions and object boundary regions. Indeed, texture-less region shows limited clue for the estimation of focus amount from only its visual appearance and object boundary contains sudden changes in depth that making patch based focus estimation difficult. In order to resolve such limitations, we suggest maximizing depth inference from neighbor pixels and neighbor frames globally optimizing overall changes of focus amount along the spatial and frame domains. In this work, we propose a new DFF method using single color camera. We investigate the appearance changes in spatial and frequency domains over differently focused image frames. In spatial domain, visual appearance change of low-textured region along the frames is difficult to be observed. In frequency domain, however, low-textured region gives slightly better observation over frames. Using both spatial and frequency domain observations, we can get robust depth estimation at each pixel even in low-textured regions. In order to achieve sub-frame level accuracy in depth estimation at each pixel, optimal location of best focused frame is estimated by fitting a second order (quadratic) polynomial curve on the dissimilarity measurements. Based on the estimated initial depth, we build a uniformly focused image set by pixelwise adjustment of depth level. In other words, all-in-focus image and following uniformly out-focused images are created. We perform our depth estimation iterative manner on this uniformly focused image set that refine our depth estimation. During depth estimation process we create confidence map that helps to fill erroneous depth points caused by textureless uniform regions. After filling erroneous points, we use guided filter [24] using predicted depth to build its all-in-focus image with which we finally create cleaner depth image. Each estimated depth point is considered as 3D point, and surface normal of the point is computed from its 8 neighbors [25]. We interpolate them using Algebraic Point Set Surface(APSS) [26], and, finally, using marching cubes algorithm [27], we create polygonal surface representation.

Proposed Depth from Focus
The proposed method consists of three steps. In first step initial depth of each pixel is estimated. The second step is iterative refinement building uniformly focused image set. The last step is postprocessing that improves depth estimation in both textureless and object boundary regions.

Initial Depth Estimation
In order to find best-focused frame of each pixel, we calculate neighborhood patch dissimilarity. Small window size causes noisy depth due to the lack of enough observation for focus estimation. On the other hand, a bigger patch loses details in depth. In our work, we use adaptive patch size at each iterative estimation. In addition, we use a circular window to minimize the artifact of estimated depth.
A circular window creates less perceivable noise. Figure 2 shows the difference of 3D reconstruction result between square and circle window.

All in focus
Proposed depth map 3D Reconstruction Square Circle Figure 2. From left to right: all-in-focused image, depth map, and 3D reconstruction results using circle and square windows (highlighted part zoomed-in). Three-dimensional reconstruction result shows that circle window produces less artifacts.
Using patch dissimilarity to find the best-focused frame, we chose a reference frame arbitrarily among focal-stack. Then, we calculate the normalized average absolute difference of patches between the reference frame and all other frames at each pixel. The best-focused frame becomes a local maximum in the difference. The best-focused frame detection task for each pixel becomes a simple local maximum detection task. However, if the best-focused frame is identical or very close to the reference frame, the best-focused frame cannot be detected easily. To avoid such a singular case, we blur the reference frame before the difference computation. The blurring amount of the reference frame is selected not to have a similar blur amount in all other focused image frames ( Figure 3a). We use Gaussian blurring assuming that it simulates optical blur well in out-focus images. Normalized average absolute difference between the reference frame and kth frame is calculated on both spatial and frequency domain information of the given patch using Equation (1) for spatial domain and Equation (2) for frequency domain: where x and y are pixel indexes, w is window size, R is reference frame in spatial domain,R is reference frame (neglect edge effects using mask) in frequency domain, k is index of frames (1 k n, n is number of frames in focus-stack), F is kth frame in spatial domain,F is kth frame in frequency domain, and mask is binary circle mask (to use circle window). Then, we create distance vector for each index of frame with applying weight and normalization for Equations (1) and (2) using Equation (3): where the apostrophe mark means that variable is normalized between 0 and 1, and α is a weight with which we can assign a preference on either spatial domain clue or frequency domain clue ( Figure 4).  . Difference between reference frame and other frames on less textured part of the image (marked with rectangle). Each graph: x axes-frame index; y-axes difference between reference frame and others (normalized between 0 and 1). From left to right: one of the input image, frame difference on frequency (blue) and spatial (red) domain, third to fifth: summation results after applying weight to both domain.
Final step is finding the index of maximum distance (max(d k )) which indicates potential bestfocused frame. When we calculate the difference between reference-frame and other frames, the bestfocused frame may gives bigger distance than its neighbors ( Figure 5a). Note that, in DFF, the bestfocused frame's index indicates the depth of the corresponding point. In order to be more accurate by estimating sub-frame level best focused frame, we use parameterized quadratic curve fitting [40] around the maximum index ( Figure 1 (curve fitting) and Figure 3b). We use second order polynomial f (x) = a 0 + a 1 x + a 2 x 2 to fit on d k . Figure 3c shows our depth estimation with and without curve fitting. Reference frame is focused reference-frame [12] focused-frame [35] reference-frame [35] focused-frame [35] Explained with graphic Real example frame numbers frame numbers frame numbers ref-frame (a) Choosing reference-frame arbitrary (real and synthetic graphic). In real example, it showed distance between reference-frame and other frames. Red − point-blur amount of reference-frame (likely), green − mark-focused-frame

All-in-focus
Proposed PAD VAR F-RDF (b) Initial depth map of proposed and recent methods. Initial depth map shows that the proposed focus measure method is more robust than state-of-the-art methods, such as PAD [39], VAR [1], and F-RDF [20]

Iterative Depth Refinement
Initial depth is estimated by investigating the appearance changes in both spatial and frequency domains in the previous section. To obtain more accurate result, we build a uniformly focused image set (UFIS) by taking pixels from the focus stack using resulting depth values. In other words, an all-infocus image and following uniformly out-focused images were created. Then, using UFIS, we improve our depth estimation iterative manner. Let n be a number of frames of UFIS. n is an odd number, and the central frame c should be an all-in-focus frame generated by the initial depth result (if n = 7 then c = 4, if n = 5 then c = 3). We fill all other frames of UFIS using the following equation: where x and y are pixel indexes, T is UFIS (synthetic focal stack), i is an index of the reconstructed frame number of T, F is initial focus stack, and f is the corresponding frame number calculated by Equation (5) using initial depth estimation result: where i is an index of the reconstructed frame (same with Equation (4)), D x,y is estimated depth at [x, y] position (same position with Equation (4)), and c is the central frame index of T reconstructed frame. After UFIS is constructed, we repeat the initial depth estimation process with decreasing window size. Based on the result of Iterative Depth Refinement, we update the depth result. We continue this iteration process with a smaller window size until the changes in the depth value at each iteration falls below a stopping criteria. We update depth result at each iteration using Equation (6): where D p x,y is previous depth at (x,y) location, d x,y is depth result in the current iteration, and c is center frame index. Figure 6 shows the results of 4 iterations. The first image is one of input frames, second is a point cloud of the scene from initial depth estimation, and A,B,C,D images show the point cloud of each iteration: green color is first iteration with window size 7 × 7, orange color is second iteration with window size 5 × 5, blue color is third iteration with window size 3 × 3, and purple color is fourth iteration with window size 1 × 1. Image A shows the point cloud results of all iterations, and the green point cloud shows more flat results at the highlighted region (in Figure 6). Image B shows second, third, and fourth iterations' point cloud result, and the yellow point cloud has provided a more flat result than blue and purple (in Figure 6). Image C shows that the blue point cloud has more flat regions than purple (in Figure 6). It shows that each iteration point cloud becomes more accurate (in highlighted more curved case). UFIS process tries to reach an optimal result with corresponding initial (or previous) depth estimation. Figure 7 describes the creation of our UFIS.

Textureless Region and Post Processing
Estimating depth on the textureless region is one of the challenging problems in DFF. Most existing DFF methods fail in textureless regions because of the lack of enough information to observe. Figure 8 describes two different parts of the input scene: textured and textureless regions.  The image shows that textured surface has more information to observe, and the textureless surface has a lack of enough information to chose the best-focused frame. In Figure 9, the depth estimation result has a bunch of erroneous points on textureless regions. To fix these erroneous points, we have to use depth information from its neighbors.
In order to identify textureless region, we investigate all-in-focused image (generated using depth map) and find inaccurate (noisy) regions from depth map, as well. We create two binary masks: TRM mask. Textureless regions have lower variance in color. In order to create TRM mask, we scan a window over all-in-focused image calculating color variance within the window. If the variance is lower than a threshold, we mark the point as textureless. Threshold is the squared value of the window size (th = w 2 , w is window size). Threshold value is chosen empirically as the value that produces a better result than other variations when pixel values are in between 0 and 255. Figure 9D describes TRM mask (black-color: textureless region).
IDM mask. Inaccurate depth points which are from textureless part have higher variance while accurate depth points have lower variance. Based on this assumption we create IDM mask. Generating IDM mask process is similar to TRM. If the variance is higher than a threshold, we mark that point as an inaccurate depth point. Figure 9E describes IDM mask (black-color: inaccurate depth). IDM sometimes finds inaccurate regions even at the correct depth around edges of objects. We overlap both masks to figure out textureless regions. We indicate erroneous depth from textureless regions as (EDfT). Finally, hole-filling is performed with EDfT regions. We replace EDfT points by its neighbors of high-score-confidence. Depth confidence gets higher if the peak point in the sharpness estimation is more greater than its neighbors in the focus measure process. Based on this idea, we created a score-confidence map in the depth estimation process.
After fixing depth values of textureless region, we perform post-processing for getting cleaner depth around the object boundary using a filter called guided filter [24]. The guided filter investigates intensity changes in an all-in-focus image (Figure 10b). Finally, we scale estimated depth values by scaling factor s to convert frame index to physically meaningful depth value. Median and Gaussian filters are applied, in turn, to eliminate any remaining noisy value. After obtaining depth information of the scene, we can get point clouds of the scene, as shown in Figure 1

Experimental Setup
In experimental evaluation, we use 11 × 11 window size and make it circular mask with corresponding binary mask. In the iteration step, window size is gradually reduced until it reaches 1 × 1 size. We use al pha = 0.5 on Equation (3). When we choose peak-point to find focused frame, we use second order polynomial fitting. Reference frame number is round(n/3) (n is number of frames in stack), and then we apply Gaussian-blurring (size: 11 × 11).

Qualitative and Quantitative Evaluation
In order to evaluate our work, we use four different datasets: LFSD dataset [28], Lytro dataset [49], dataset from mobile [18], and our synthetic and medical (teeth) datasets. We compare our method with existing focus-measure focus-measure [29][30][31][32][33][34][35][36][37] and state-of-the-art DFF [1,[18][19][20]22,38,39] methods. Figure 11 shows results on the LFSD dataset [28]. The LFSD dataset [28] provides focal stacks captured by Lytro Illum camera with corresponding depth map. It has 5 to 12 focus image frames for each subject. The depth map has darker intensity for farther depth. The proposed method estimates depth of each sample reasonably well. Table 2 shows quantitative evaluation of our method and previous focus measure methods (to get depth result we used our post-processing without filling textureless region) and state-of-the-art DFF methods. In order to evaluate our focus measurement without post-processing, we collected synthesized focus image set (focus level changed by adjusting camera focus) from graphics models, including ground truth depth. Figure 12 shows sample depth estimation results compared with ground truth. Depth images are normalized with min/max value corresponding to ground-truth, and depth images are from initial depth result applying median filter (size: 3 × 3) without using post-processing. In this test, we vary patch size (Table 3). Each synthetic subject consists of 31 focus image frames with size 405 × 720. Accuracy in Table 3 is measured by average absolute difference of depth values. Each window size shows RMSE (Root Mean Squared Error) of the related sample, and the last row show average absolute difference of them. 11 × 11 window size shows best estimation accuracy, however it depends on the complexity of the shape of test objects. If objects have complicated surface shape, smaller window size will work better, but it may cause a noisy result (as shown 5 × 5 window size's result).  (10). The first row shows all in focus images, 2-9 rows are results from previous focus measuring and state-of-the-art DFF methods, 10-rows show reconstructed depth maps of our proposed method, and last row shows show given depth from Reference [28]. We constructed 3D mesh model of target object using our teeth dataset ( Figure 13). These results show that our proposed method prove the potential of single color camera-based 3D reconstruction. In our teeth dataset has 100 focus image frames for each subject taken by dental intraoral camera. Each estimated depth point is considered as a 3D point, and surface normal of the 3D point is computed from its 8 neighbors [25]. After that, we interpolate them using Algebraic Point Set Surfaces (APSS) [26] method. It creates a smooth surface using local moving least-squares (MLS) approximations of the data [50]. Finally, marching cubes algorithm [27] creates polygonal surface representation ( Figure 13). This evaluation result shows the potential of the proposed method in medical 3D imaging application with simple color camera.  We compare our model with Deep Depth from Focus (DDFF)Net [22]. Unfortunately, their model requires 10 images with fixed size. LFSD dataset [28] has several samples with 10 frames. So, we run our method with same inputs (10 frames). Figure 10a shows result comparison with DDFFNet [22]. DDFFNet provided blurry and poor depth result, which barely kept structure and depth of the scene on LFSD dataset [28]. Our proposed method provides better depth estimation, which is clean and accurate.

Comparison with the State-of-the-Art
In this section, we compare the proposed method with state-of-the-art DFF methods, such as VAR [1], DFM [18], RDF [19], Composite Focus Measure (CFM) [38], PAD [39], and F-RDF [20]. For experimental evaluation, we use the LFSD dataset [28], the Lytro dataset [49], and the dataset from mobile [18]. Figure 14 shows depth and point-cloud results of VAR [1], RDF [19], PAD [39], F-RDF [20] and proposed method on "Buddha" sample. The proposed method provide more clean and accurate depth, especially around object boundaries. The last row shows a zoom-in part of the highlighted region where the proposed method has more accurate depth than others.

Proposed
Ground  Figure 15 shows peak signal-to-noise ratio (PSNR) between predicted depth and ground truth. In order to compute PSNR, the block first calculates the mean-squared error using Equation (12). Then, the block computes the PSNR using Equation (13) In Equations (12) and (13), M and N are the number of rows and columns in the input images. R is the maximum fluctuation (the maximum possible pixel value of the image) in the input image data type. This figure shows that quality of the proposed method outperforms state-of-the-art methods.  Figure 15. Quantitative evaluation on 'Buddha' synthetic sample (with 25 frames) with state-ofthe-art methods. From left to right: Variational DFF (VAR) [1], Ring Difference filter (RDF) [19], Composite Focus Measure (CFM) [38], PAD method of multipliers (PAD) [39], Fast and noise robust RDF (F-RDF) [20], proposed, ground-truth, and in-focus image. The proposed method perform better than recent methods that visible in the images and reflected in the peak signal-to-noise ratio (PSNR) (in dB) shown below the each depth result. Note that depth image and PSNR value for CFM [38] was cropped from original paper. Figure 16 shows qualitative performance of the proposed method on LFSD dataset [28] over the recent methods, such as VAR [1], RDF [19], PAD [39], and F-RDF [20]. Each sample has 5 to 12 frames of size = 1080 × 1080. Our proposed method provided clean depth with detail and clean edges, especially on samples B (branch of the tree), D (wire on the window), F (boundary of the statue), and H (pens).  Figure 17 shows results of VAR [1], RDF [19], PAD [39], F-RDF [20], and the proposed method on Lytro dataset [49]. Our proposed method gives clean depth results, especially with sample E around the boundary of the lock.

In-Focus
We evaluated the proposed method on many challenging samples. In Figures 18a and 19, proposed method provide clean and detailed depth even on complex region of the scene. In Figure 19, the proposed method provided cleaner edges than other methods on all samples, and second sample has leaf that only our method could able to provide accurate depth. Figure 18a has the complex part of the scene that is highlighted; "Toy" sample has small details that the proposed method provided accurate depth with clean edges, and "Buddha" sample has a column that shows proposed method provided accurate depth, while other methods failed, and the proposed method has more a accurate result on the "Flower" sample with small details.
In the evaluation on mobile dataset [18] shown in Figure 20, we use three test-samples and compare our work with state-of-the-art methods, such as DFM [18] and CFM [38]. The first row shows "plants" sample with 30 frames, second: "fruits" sample with 23 frames, and third row shows "books" sample with 14 frames. Our method finds more accurate depth than prior works, especially around the edges on all samples. On the "books" sample, DFM [18] failed to provide clean depth on background because of textureless region (background has black color and no texture), and CFM [38] provided a noisy result, while the proposed method provided clean depth even at textureless (background) regions. These results prove that the proposed method can provide a clean and accurate edge, even on textureless regions.
Experimental evaluations show that our method gets improved depth estimation results over existing methods. PAD [39] method is less robust to textureless regions and suffers from texturecopy artifacts. DDFFNet [22] method fails to predict correct depth, providing depth is very blurry, and it has a limitation with image numbers in stack. VAR [1] can keep the overall structure of the scene but it cannot provide clean edges. RDF [19] and F-RDF [20] methods have limitations; their methods cannot provide good result if the image contains a region that is never in focus by a wide margin. The proposed method generates a depth using information extracted in both spatial and frequency domains. Fourier domain works more efficiently on low-textured, and spatial domain has better performance with textured areas. Figure 5b shows that our focus measure is more robust than state-of-the-art methods. PAD [39] and F-RDF [20] provided high noisy depth (initial depth), and VAR [1] provided a less noisy depth result, while proposed focus measure provide cleaner depth result (however, it needs post-processing).  (a) Qualitative evaluation on three samples (toy-6 frames, Buddha-25, flower-12). From top to down: In-focus image (with highlighted region) and cropped part of recent methods PAD, CFM, RDF-20, and last row shows proposed method's result). Our result shows more accurate depth with details even in challenging parts of the scene. Note that depth images for CFM [38] method were cropped from the original paper.

All-in-focus Initial
Depth Final Depth (b) From top to down: in-focus frame(calculated using initial depth result), initial depth, and final depth of the proposed method. In-Focus RDF CFM F-RDF Proposed VAR PAD Figure 19. Qualitative evaluation on LFSD dataset [28] with recent (state-of-the-art) methods. The proposed method can provide depth with detail information even at complex scene. Note that depth image for CFM [38] method was cropped from the original paper.
Proposed CFM DFM All-in-focus Figure 20. Qualitative evaluation (dataset from DFM [18]) with state-of-the-art methods. First column shows all-in-focus image, second: result from DFM [18], third: CFM [52], fourth: the proposed method. The proposed method provide detailed and clean edges, and third row has clean background while other methods failed on textureless region (background). Note that depth images for CFM [38] and DFM [18] methods were cropped from the original papers.

Conclusions
This paper proposes a new robust and accurate depth estimation method based on depth from focus (DFF) iteratively reconstructing a uniformly focused image set. We investigated the appearance changes in spatial and frequency domains in an iterative manner. We evaluated our method extensively on four public and our datasets, including a synthetic dataset. Three-dimensional modeling experiments were performed showing the potential of the proposed method in both consumer applications using a smartphone and medical applications in 3D reconstruction.