Multiple-Layer Visibility Propagation-Based Synthetic Aperture Imaging through Occlusion

Heavy occlusions in cluttered scenes impose significant challenges to many computer vision applications. Recent light field imaging systems provide new see-through capabilities through synthetic aperture imaging (SAI) to overcome the occlusion problem. Existing synthetic aperture imaging methods, however, emulate focusing at a specific depth layer, but are incapable of producing an all-in-focus see-through image. Alternative in-painting algorithms can generate visually-plausible results, but cannot guarantee the correctness of the results. In this paper, we present a novel depth-free all-in-focus SAI technique based on light field visibility analysis. Specifically, we partition the scene into multiple visibility layers to directly deal with layer-wise occlusion and apply an optimization framework to propagate the visibility information between multiple layers. On each layer, visibility and optimal focus depth estimation is formulated as a multiple-label energy minimization problem. The layer-wise energy integrates all of the visibility masks from its previous layers, multi-view intensity consistency and depth smoothness constraint together. We compare our method with state-of-the-art solutions, and extensive experimental results demonstrate the effectiveness and superiority of our approach.


Introduction
The capability of seeing through occlusions in heavily-cluttered scenes is beneficial to many practical computer vision application fields, ranging from hidden object imaging to detection, tracking and recognition in surveillance. Since traditional imaging methods use a simple camera to acquire the 2D projection of the 3D world from a single viewpoint, they are unable to directly resolve the occlusion problem.
A fundamental solution to the problem is to exploit new imaging procedures. For example, emerging computational photography techniques based on generalized optics provide plausible solutions to capture additional visual information. Many different camera arrays have been built over the past few years (as shown in Figure 1), and the camera array synthetic aperture imaging or SAI [1][2][3][4][5][6][7][8][9][10][11][12][13][14][15][16][17] provides a unique capability of seeing through occlusions. SAI warps and integrates the multiple view images to simulate a virtual camera with an ultra-large convex lens, and it can focus on different frontal-parallel [1] or oblique [2] planes with a narrow depth of field. As a result, objects laying on the virtual focus plane, even if being occluded in reality, would be clearly imaged.
In practice, however, getting a good imaging result is still challenging. For one thing, we really can't find the focus plane of an object for all of those cameras without sufficient visibility analysis. Actually, the indiscriminate averaging of the projection of the foreground occluder often significantly blurs the images of focused occluded objects. Additionally, objects laying on different focus planes in the scene always fail to be present in one single synthetic image.  Different from in-painting algorithms [18,19], which can generate visually-plausible results, but not guarantee the correctness of the result, our technique is based on the light field visibility analysis. For every 3D point, we trace all rays passing through it back to the camera array and then construct a visibility layer in which the 3D point is visible in all active cameras. To recover the all-in-focus image behind a specific depth layer, we partition the scene into multiple visibility layers to directly deal with layer-wise occlusion and apply an optimization framework to propagate the visibility information between multiple layers. On each layer, visibility and optimal focus depth estimation is formulated as a multiple label energy minimization problem. The energy integrates the visibility mask from previous layers, multi-view intensity consistency and the depth smoothness constraint. We compare our method with state-of-the-art solutions on publicly-available Stanford [20] and UCSD (University of California, San Diego) [21] light field datasets and three datasets captured by ourselves with multiple occluders. As illustrated in Figure 2, conventional synthetic aperture imaging algorithms only focus on a particular depth and result in poor quality (see Figure 2b), while our approach, which creatively segments the scene into multiple visibility layers and uses an optimization framework to propagate the visibility information between multiple layers, can produce the all-in-focus image even under occlusion (see Figure 2c). Extensive experimental results with qualitative and quantitative analysis further demonstrate the performance.

Related Work
Tremendous efforts have been made on developing light field imaging systems and post-processing algorithms. On the hardware front, light field camera arrays with different numbers of cameras, resolution and effective aperture size have been built (as shown in Figure 1), e.g., Stanford [3], CMU [4], UCSD [5], Alberta [6], Delaware [7], Northwestern Polytechnical University (NPU) [8,16], PiCam [15], Tokyo [17], etc., and the camera array synthetic aperture imaging technique has been proven to be a powerful way to see objects through occlusion. Similar camera array techniques have been adopted in producing movie special effects. For instance, in the 1999 movie The Matrix, a 1D camera array is used to create an impressive bullet dodging scene that freezes time, but changes viewpoint towards the character.
On the algorithm front, one of the most important techniques is synthetic aperture imaging (SAI). By integrating appropriate rays in the camera array, SAI can generate a view that would be captured by a virtual camera having a large aperture. In addition, through shearing or warping the camera array images before performing this integration, SAI can focus on different planes in the scene. For example, the Stanford LF (Light Field) camera array by Levoy et al. [3] consists of 128 Firewire cameras, and for the first time, aligns multiple cameras to a focus plane to approximate a camera with a very large aperture. The constructed synthetic aperture image has a shallow depth of field, so that objects off the focus plane disappear due to significant blur. This unique characteristic makes synthetic aperture imaging a powerful tool for occluded object imaging.
Taking advantage of the geometry constraints of the dense camera array, Vaish et al. [11] present a convenient plane + parallax method for synthetic aperture imaging. A downside of their work, however, is that all rays from the camera array are directly integrated without further analysis. Thus, the clarity and contrast of their imaging result would be reduced by rays from the foreground occluders.
Visibility analysis through occlusion is a difficult, but promising way to improve the occluded object imaging quality, and many algorithms have been developed in this way. Vaish et al. [12] study four cost functions, including color medians, entropy, focus and stereo, for reconstructing an occluded surface using synthetic apertures. Their method achieves encouraging results under slight occlusion; however, the cost functions may fail under severe occlusion. Joshi et al. [10] propose a natural video matting algorithm using a camera array. Their method uses high frequencies present in natural scenes to compute mattes by creating a synthetic aperture image that is focused on the foreground object. Their result is inspiring, and it has the potential to be used for visibility analysis. However, this algorithm may fail in the case of a textureless background and cannot deal with occluded object matting. Pei et al. [13] propose a background subtraction method for segmenting and removing foreground occluders before synthetic aperture imaging. Their result is encouraging in a simple static background; however, since this approach is built on background subtraction, it cannot handle static occluders. In addition, their performance is very sensitive to a cluttered background and may fail under crowded scenes.
The most relevant method to ours would be the work of Pei et al. [6], which solves the foreground segmentation problem through binary labeling via graph cuts. Instead of labeling the visibility and focusing depth, they label whether a point is in focus at a particular depth and aggregate these focus labels in a given depth range to get a visibility mask for occluded object imaging. Although the result is encouraging, this method can only deal with front occluder (whose depth range needs to be provided as a prior) labeling problem and may fail if the occluder has severe self-occlusion or there are multiple occluded objects due to a lack of visibility propagation. In addition, the result of the method in [6] can only focus on a particular depth of the scene instead of all-in-focus imaging, and the performance will be decreased in a textureless background.
A graph cut [22][23][24][25] is one of the most popular methods to solve the Markov random field (MRF) energy minimization problem due to its efficiency. In this paper, we formulate the entire scene synthetic aperture imaging as a multi-layer multi-labeling problem, which can be solved via graph cuts. Different from most works using graph cuts for 3D scene reconstruction, in our algorithm, we creatively apply the graph cuts in solving the visibility property and optimal focus depth labeling energy minimization problems. In order to solve the visibility property labeling energy minimization problem, we carefully design a key energy function with a different data term, which can measure the cost of assigning a visibility label to a pixel, and a different smoothness term, which can encourage neighboring pixels to belong to the same visibility label. Additionally, in order to solve the optimal focus depth labeling energy minimization problem, we carefully design another key energy function whose data term can measure the cost of assigning a depth layer to a pixel and whose smoothness term can courage neighboring pixels to share the same label.
The organization of this paper is as follows. Section 2 presents the visibility layer propagation-based imaging model. Section 3 details the visibility optimization algorithm. Section 4 details the visibility optimization algorithm. Section 5 describes the dataset, implementation details and the experimental results. We conclude the paper and point out the future work in Section 6.

Visibility Layer Propagation-Based Imaging Model
In this section, we will introduce our multiple-layer propagation-based synthetic aperture imaging method. The framework of our method is an iterated process (as shown in Figure 3). The scene is divided into different visibility layers. For each layer, we have two steps, visibility labeling and depth estimation, which will be discussed in Sections 3 and 4, respectively.  Instead of segmenting the observed scene into various depth layers, our approach segments the entire scene into multiple visibility layers, for example the Stanford Light Field dataset shown in Figure 3 is divided into two visibility layers. The visibility layer is defined on each layer as all of the rays that are not occluded in any cameras and computed by energy minimization. Points on each visibility layer do not necessarily need to correspond to the same object or surface. By modeling the scene as multiple visibility layers and propagating visibility information through layers, we can obtain the focusing depth and corresponded cameras for all of the objects in the scene, including the occlusion object and occluded objects. Therefore, each visibility layer consists of pixels that are visible in all active cameras. The word active refers to the fact that the pixel position of the camera has not been labeled as visible by the previous visibility layers, e.g., not labeled as visible by Layer 1 in Figure 3. Extraction of each visibility layer is based on the information of previous visibility layers. More precisely, according to the occlusion mask information of previous layers, we firstly obtain the current visibility layer, then estimate the depth map of this layer and, finally, update the occlusion mask. For better understanding of the proposed method, we provide an example workflow with the Stanford Light Field data in Figure 3.

Multiple View Images
There are mainly two reasons why we introduce the concept of the visibility layer. First, by taking advantage of the introduced visibility layer, the occlusion problem can be tackled more directly. The visibility information is propagated from layer to layer, and in each layer, the occlusion mask needs to be updated only once. Second, segmenting the scene into visibility layers instead of depth layers is more beneficial, as neighboring pixels in the same layer tend to belong to the same object, and the depth smoothness constraint can be enforced when estimating the depth map.
Let L denote the number of visibility layers in the scene. For each layer, we need to find a labeling function f : Ω → L, where Ω refers to the set of all unmasked pixels in all images and L = {0, d 1 , d 2 , . . . , d m } denotes the set of possible labels of these pixels. d i (i = 1, 2, . . . , m) > 0 represents the depth range of our scene. For a pixel x, if f (x) > 0, then x is fully visible in all active camera views. Otherwise, if f (x) = 0, then x is partially occluded.
Considering the labeling redundancy of camera array (the labels in different cameras are highly related), the problem can be further simplified. Instead of labeling all of the unmasked pixels of all of the cameras, we label all of the pixels of the reference camera equivalently (not only the unmasked pixels, as a masked pixel of the reference camera may still be fully visible in all of the other active cameras). This means that if there are N cameras in the camera array, we only label all of the pixels of the reference camera view instead of labeling all of the unmasked pixels of all cameras. Specifically, instead of finding the above labeling function, we seek a more succinct labeling, g : I ref → L, where I ref refers to the whole image area of the reference camera. In our implementation, the visibility and depth map is calculated first on the reference image, then the visibility and depth maps of all of the other cameras are derived based on the calibration information of the camera array (as shown in Figure 3, the visibility masked multiple view images).
Therefore, for each layer , the problem of estimating fully-visible pixels and corresponding depths can be formulated as the following energy minimization problem: where the data term E d is a data penalty function, which can measure the cost of assigning the label g to the pixel, and the smooth term E s is a regularizer that encourages neighboring pixels to share the same label, while the visibility information V k (k = 1, 2, . . . , − 1) from previous layers is used to encode and block the occluded rays. In our algorithm, the estimation of visibility information is coupled to the depth value, and it is difficult to minimize energy Function (1) directly, so we let the label g represent two kinds of labeling information, including visibility labeling information and depth labeling information. Additionally, if we want to get the visible map of each visibility layer, we can make g = V , so Equation (1) can be transformed into Equation (2) and can be solved in Section 3 through the carefully-designed data term and smoothness term. Additionally, if we want to get all depth maps of each visibility layer, we can make g = D , so Equation (1) can be transformed into Equation (9) and can be solved in Section 4 with the well-designed data term and smoothness term.
As the estimation of visibility information is coupled to depth value and can only be obtained by analyzing synthetic images of different depths of focus, it is difficult to minimize energy Function (1) directly. In this paper, we solve for g = {V , D } by the following two optimization modules: (1) optimize the visible map V in the reference camera (as shown in Figure 3, the fully-visible map in Layers 1 and 2); (2) calculate the depth map D of visible pixels (as shown in Figure 3, the focus depth map of the fully-visible region in Layers 1 and 2).
In the first module, in order to obtain the fully-visible map, we formulate this problem as a binary energy minimization problem, which can be optimized by graph cuts [22]. The energy function used in the first module is composed of an energy function with a unary data term, which represents the cost of assigning a visibility label to a pixel, and a pairwise smoothness term, which accounts for the smoothness prior of the visibility layer. More details about the first module will be discussed in the Section 3.
In the second optimization module, estimation of the optimal focus depth for pixels in each visible layer is formulated as a multiple-label energy minimization problem and is also solved via graph cuts [22]. The energy function is composed by a unary data term, which indicates the cost of assigning a depth label to a pixel, and a pairwise smoothness term, which accounts for the smoothness constraint of the depth map. More details about the second optimization module will be discussed in the Section 4.

Multiple-Layer Visibility Optimization
Since our method propagates the binary visible map between multiple layers, for a certain layer ∈ {1, 2, . . . , L}, occluders in front of this layer have been labeled and can be easily removed in the images of all cameras. To make the notation uncluttered, we do not write previous visibility layers V k (k = 1, 2, . . . , − 1) explicitly unless necessary. As a result, the visibility energy function can be written as follows: Data term: If a pixel is fully visible in the current layer, it should be in focus for some depth value, and at the same time, corresponding pixels that form the synthetic aperture image should be related by the same point of an object (except those occluded by previous layers). If a scene point is in focus, its corresponding pixel in the synthetic aperture image will have good clarity and contrast, which can be measured by state-of-the-art focusing metrics. In addition, the corresponding pixels that form the synthetic aperture image should have a similar intensity value, which can be measured by various intensity constance metrics. In this paper, focusing metrics and intensity constance metrics are all referred to as focusing metrics. We define the cost of labeling a pixel as fully visible based on its corresponding curve of focusing metrics in synthetic images of different depths of focus.
The ideal curve of a fully-visible pixel (Figure 4, Point A) should satisfy the following two constraints: (1) it is unimodal throughout the focus depth scope; and (2) the curve reaches a global minimal, if and only if all visible rays intersect at the same point on an object in the scene. In contrast, a partially-occluded pixel or a free point without focus should always have a large value through the entire focus depth scope ( Figure 4, Point C). That is because these points are only visible in some of the cameras, thus for unfocused depth and even for focused depth, the cost of those points is high. A textureless object pixel should have a small value in a small range of depths around the focusing depth ( Figure 4, Point B), while a textureless background pixel should have a small value over a broad focus range due to its similarity with the neighborhood pixels ( Figure 4, Point D). Besides, in Figure 4, Point D gives a sharp peak near the origin. That is caused by the position of focusing depth plane: when it is too close to the camera, the out-of-focus problem results in an unexpected value.
Reasonably, we cannot estimate the depth of the textureless background pixels. Thus, according to the width of the low value depth range, we remove the textureless background region before our binary visibility optimization. Based on the above analysis, we have compared different kinds of metrics to obtain the desired ideal curve. Part of the comparison result is shown in Figure 5. Figure 5a gives the input images of different cameras, while Figure 5b-d displays the synthetic aperture imaging result, variance image and maximal color difference (MCD) image in different depths. Comparing Figure 5c,d, we can see that the MCD image could describe the minimal color difference more accurately than the variance image. Thus, the MCD measurement is more suitable for visibility analysis. Figure 5e shows the corresponding curves of Points A-C marked in Figure 5a. The focus measures evaluated include DCTenergy ratio (DCTR) [26], diagonal Laplacian (LAPD) [27], steerable filters (SFIL) [28], variance and MCD. For the first three focus measures, we compute the focus metric using a 5 × 5 pixel block on one hundred sampled focus planes. All of the results are normalized and mapped to [0 1], where a low value represents good focus. The result indicates that for a point in a textured region without occlusion, all focus measures can successfully find the focus point (Point A in Figure 5e). However, when the textured point is occluded in some cameras (Point B in Figure 5e), the curves of DCTR, LAPD, SFIL and variance measures are multimodal with multiple local minima. In contrast, the MCD metric is more stable and more insensitive to occlusion. In the low texture region (Figure 5e, Point C), the first three measures contain many noises. In contrast, both the variance and MCD measure reach the global minimum around the ground truth. In addition, the MCD curve is sharper than the variance and closer to the ideal curve. Based on the analysis above, we select the MCD measure to define the data cost E d (V ) for each pixel x in the reference camera: where D = {d 1 , d 2 , . . . , d m } is the depth range of the scene, M CD d (x)(d ∈ D) is the MCD focus measure value of the pixel x in depth d: represents the value of pixel x on the warped image of camera i in depth d. B i (x) is a binary map of camera i to mask fully-visible pixels of previous layers. C max is the maximum gray intensity, and for a eight-bit camera, its value is 255. V i 0 is the visibility layer 0 of camera i and can be obtained easily from V 0 of the reference camera. If B i (x) = 0, x is occupied by previous layers, otherwise B i (x) = 1.
A good energy function should reach a good solution when the energy is low. In order to achieve this, we design the data term of the visibility optimization model as Equation (3), which is introduced to classify all of the pixels as visible or invisible. When min(MCD) is small, or the data term is small, the probability that the point is occluded is low; thus, the cost of assigning it as a visible point is low. In addition, according to the definition of MCD, even if one of the camera views is occluded, the min(MCD) appears to be a large value, and the cost of assigning this point as a visible point is high by Equation (3). Thus, for visibility labeling, it is straightforward to see that our data term should achieve its minimum when it is correctly assigned and achieve a large value for an occluded point, which is the perfect data term that we want.
Smoothness term: The smoothness term E s (V ) at layer is a prior regularizer that encourages overall labeling to be smooth. The prior is that two neighboring pixels have a higher probability to belong to the same object and should be both visible or occluded in the reference camera at the same time. Here, we adopt the standard four-connected neighborhood system and penalize the fact if labels of two neighboring pixels are different: where τ v and β(p, q) denote the maximum and weight of the smoothness term, respectively. h is a decreasing weighting function that takes into account the MCD measure similarity between neighboring pixels. The more similar the MCD measure is, the weight will be higher, and the smoothness constraint between pixels will be stronger. Because every two neighboring pixels have a higher probability of belonging to the same object and should be both visible or occluded in the reference camera at the same time, so the MCD measure should be similar between neighboring pixels, which means that the smoothness term will be big between neighboring pixels. In our algorithm, in order to represent the decreasing relationship between the MCD measure and the smoothness term, we simply use the inverse proportional function h(x) = k/x as h(.) in our experiments. It is worth noting that other functions that can represent the decreasing relationship should also work well. With the above data term and smoothness term, our energy function can be minimized via graph cuts [22].

Multiple-Layer Depth Optimization
As V is obtained by the visibility optimization in Section 3, we could estimate the optimal depth of these fully-visible pixels by multiple depth label optimization. Different from the visibility optimization, which applies the graph cuts to get all visibility layers, our multiple-layer depth map optimization model applies the graph cuts to get the optimal focus depth of the fully-visible pixels in each visibility layer. In this model, we also need to find a labeling function: D : Ω → D, where Ω = {x : |V (x) = 1} represents the set of all fully-visible pixels on the reference camera and D is the depth range of the scene. In order to get this labeling function, we use the energy minimization framework to do this work: Data term: The data term should reflect defocusing and reaches the global minimal at the optimal focusing depth. Just as we have analyzed in Section 3, the MCD metric outperforms other focus measures; thus, we adopt the MCD measure in the synthetic aperture images as the cost function: Smoothness term: The smoothness term E s (D ) of layer is a regularizer that encourages the overall depth labeling to be smooth. Similar to Section 3, we adopt the standard four-connected neighborhood system and penalize the fact that the depth labels of two neighboring pixels are different.
S p,q (D (p), D (q)) = min(τ v , β(p, q) · |D (p) − D (q)|) where τ v and β(p, q) are as defined in Section 3. Again, the problem of depth label optimization is solved via graph cuts [22]. Because we already get all synthetic images of different depths of focus, so after obtaining V and D , we can get all pixel's optimal imaging results and each layer's color information. Combing with the visible information in each visibility layer and removing the visibility layer that contains the occluders, we can get the all-in-focus imaging result just by simply combining the remaining layers.
The above process also can be represented by: where x is one pixel in the reference camera, I syn (x) is the synthetic value of pixel x in V , D (x) denotes the optimal depth of pixel x in V and I D (x) k (x) represents the value of pixel x on the warped image of camera k at depth D (x). N represents the number of camera views. Then, given a depth d f , we could generate the all-in-focus synthetic aperture image by combining I syn (x) with D (x) > d f .

Experimental Results
We have compared the performance of our method with the synthetic aperture imaging methods of Vaish et al. [11] and Pei et al. [6] on four datasets, including the CD case behind plants from Stanford, the crowd surveillance scene from UCSD and two datasets captured by ourselves. In addition, to illustrate that our method can be successfully applied when there are multiple visibility layers, we have captured another dataset where there are multiple occluders.
To avoid explicit imaging for all of the objects that are far away in the scene, we limit our search to a range of depths around the objects that are our concern. For the CD case behind plants from Stanford and our own dataset, the accuracy of each method is compared to the ground truth separately. More implementation details are given below.

Experiment 1: CD Case behind Plants
This dataset contains 105 views on a 21 × 5 grid (synthetic aperture size 60 cm by 10 cm), and the image resolution is 650 × 515. The scene contains some plants occluding two CD cases. Our goal is to estimate the depth for all of the objects in the scene and to image the scene behind the plants. Figure 6 shows the comparison results of Vaish et al. [11], Pei et al. [6] and our method. From Figure 6c-e, we can see that all three methods could see through the plants and get the imaging result of the occluded CD cases. Additionally, for the certain focus plane, which contains the CD cases, Pei et al. [6] is clearer than Vaish et al. [11]. However, the objects off this focus plane are still blurred, including the CD on the right, which is just near the focus plane. The all-in-focus image in Figure 6e shows much better clarity of our method and is closer to the ground truth in Figure 6a. Figure 6f gives the comparison of the imaging result for several local regions. It can be seen that our method can give the all-in-focus image for all three objects of different depths, while the methods of Vaish et al. [11] and Pei et al. [6] can only focus on a given depth plane.
Besides, we use the peak signal-to-noise ratio (PSNR) assessment to compare these methods quantitatively (see Table 1). The calculation of PSNR is given in Equations (14) and (15).
where w and h denote the image width and height, X is the image region, I(x) is the pixel intensity value at x in the ground truth image and I denotes the image to be assessed. I max = 255 is the maximal intensity value.  As we all know, for different methods, the more clarity of imaging that results, the bigger the PSNR value that will be achieved. In this experiment, the PSNR that our all-in-focus synthetic aperture image achieves is 31.1088, which is much higher than 20.8774 of Pei et al. [6] and 18.1225 of Vanish et al. [11]. Therefore, compared to other methods, the all-in-focus imaging result by our method is clearer and closer to the ground truth.

Experiment 2: Crowd Surveillance Scene
The "Crowd" dataset is captured by UCSD with eight synchronous views on an 8 × 1 grid. There are 276 frames, and the image resolution is 640 × 480. The scene contains five people moving in the scene, and they are frequently occluded by each other. Our goal is to see through the occluder in the front and to image for all others continuously. Figure 7 shows the comparison result of our method and Vaish et al. [11]. In Frame 210, the man in saffron cloth is occluded. In Vaish's result (Figure 7b), the occluded man is blurred by shadows from the occluder. In addition, people out of the focus plane are all blurred. In contrast, our approach could see through occlusion and achieve a clear all-in-focus image (Figure 7c). The details of the local region results are shown in Figure 7d. Our method also shows better performance in Frames 213 and 217 than Vaish's method.  The success of our work comes from the idea that for every synthetic aperture imaging result on each frame, the scene can be regarded as static, and there are no moving objects. This is quite reasonable, as no object would make obvious movements considering the high frequency of the camera. Figure 7 shows the result of several subsequent frames.
One limitation of our approach is that a scene point needs to be visible at least in two camera views; otherwise, a black hole will appear in the all-in-focus image (Figure 7c).

Experiment 3: Complex Outdoor Scene
To further test our method on severe occlusion cases, we have done another experiment with a complex outdoor scene. As shown in Figure 8, the street, trees and distant buildings are all occluded by nearby flowers. Our aim is to see behind the scene through the occlusion of the flowers in front. In this experiment, we adopt the ACTS system developed by Zhang et al. [29] for camera pose estimation. Comparison results of Vaish's method [11], Pei's method [6] and our method are shown in Figure 8b-d. As Vaish's method only focuses on a given depth plane and cannot eliminate front occluders completely, it cannot provide an all-in-focus image of the scene behind. Additionally, although Pei's method can remove some foreground occluders through foreground occluder segmentation and get a more clear result for the target, the targets' out-of-focus plane is still very blurred, for example the building shown in Figure 8e. In comparison, our method could provide a depth-free view point and all-in-focus image for any given depth range. For instance, Figure 8d shows the all-in-focus image of the scene behind the flowers. Please note that although the depth change and occlusion in this scene are extremely complex, our method accurately gives the desired all-in-focus result.

Experiment 4: Seeing through Multiple Occluded Objects
Our method can be applied to a scene where there are multiple occluded objects. Due to visibility propagation between different layers, we can remove multiple occluders, focus on the occluded object and obtain an all-in-focus image of the occluded scene. Figure 9 shows the result of our synthetic aperture imaging method when there are multiple occluders. Figure 9a is the input image of the reference camera; it can be seen that the red book is occluded by the playing card, which is further occluded by the yellow box in front. The standard synthetic aperture imaging result of the playing card and red book is shown in Figure 9b,e, respectively. It can be seen that due to severe occlusion, Vaish's method [11] can only get a blurred image of the occluded object. The state-of-the-art synthetic aperture imaging result of the playing card and red book is shown in Figure 9c,f, respectively. It can be seen that in the case of severe occlusion, Pei's method [6] can only get a blurred image of the occluded object due to the inaccuracy of the estimated foreground label. In comparison, our method can remove front occluders completely and provide an all-in-focus image of the scene behind the yellow box ( Figure 9d) and even the playing card (Figure 9g).

Conclusions and Future Work
In this paper, we have presented a novel synthetic aperture imaging approach for creating all-in-focus images through occlusion. Different from existing synthetic aperture imaging algorithms, we have segmented the scene into multiple visibility layers and apply an optimization framework to propagate the visibility information between multiple layers to produce an all-in-focus image, even under occlusion.
We believe this approach is useful in challenging applications, like surveillance of occluded people in crowded areas, where seeing the people's appearance may be of primary interest, or reconstructing hidden objects through severe occlusion, or even rendering a depth-free viewpoint image. In the future, we would like to design more robust cost functions for the focus depth estimation and to extend our work to unstructured light field imaging through occlusion with handheld mobile phones.