U2-LFOR: A Two-Stage U2 Network for Light-Field Occlusion Removal

Senussi, Mostafa Farouk; Abdalla, Mahmoud; Kasem, Mahmoud SalahEldin; Mahmoud, Mohamed; Kang, Hyun-Soo

doi:10.3390/math13172748

Open AccessArticle

U²-LFOR: A Two-Stage U² Network for Light-Field Occlusion Removal

by

Mostafa Farouk Senussi

^1,2

,

Mahmoud Abdalla

¹

,

Mahmoud SalahEldin Kasem

^1,3

,

Mohamed Mahmoud

^1,2

and

Hyun-Soo Kang

^1,*

¹

Department of Information and Communication Engineering, School of Electrical and Computer Engineering, Chungbuk National University, Cheongju-si 28644, Republic of Korea

²

Information Technology Department, Faculty of Computers and Information, Assiut University, Assiut 71526, Egypt

³

Multimedia Department, Faculty of Computers and Information, Assiut University, Assiut 71526, Egypt

^*

Author to whom correspondence should be addressed.

Mathematics 2025, 13(17), 2748; https://doi.org/10.3390/math13172748

Submission received: 1 August 2025 / Revised: 21 August 2025 / Accepted: 25 August 2025 / Published: 26 August 2025

(This article belongs to the Special Issue Emerging Deep Learning Models and Applications in Image Processing and Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

Light-field (LF) imaging transforms occlusion removal by using multiview data to reconstruct hidden regions, overcoming the limitations of single-view methods. However, this advanced capability often comes at the cost of increased computational complexity. To overcome this, we propose the U²-LFOR network, an end-to-end neural network designed to remove occlusions in LF images without compromising performance, addressing the inherent complexity of LF imaging while ensuring practical applicability. The architecture employs Residual Atrous Spatial Pyramid Pooling (ResASPP) at the feature extractor to expand the receptive field, capture localized multiscale features, and enable deep feature learning with efficient aggregation. A two-stage U²-Net structure enhances hierarchical feature learning while maintaining a compact design, ensuring accurate context recovery. A dedicated refinement module, using two cascaded residual blocks (ResBlock), restores fine details to the occluded regions. Experimental results demonstrate its competitive performance, achieving an average Peak Signal-to-Noise Ratio (PSNR) of 29.27 dB and Structural Similarity Index Measure (SSIM) of 0.875, which are two widely used metrics for evaluating reconstruction fidelity and perceptual quality, on both synthetic and real-world LF datasets, confirming its effectiveness in accurate occlusion removal.

Keywords:

light-field imaging; occlusion removal; U2-Net; Residual Atrous Spatial Pyramid Pooling (ResASPP); residual block (ResBlock)

MSC:

94A08

1. Introduction

Occlusion removal in LF imaging is a crucial task in computer vision, particularly for applications requiring precise scene reconstruction, such as object recognition, detection, and tracking [1,2,3,4,5,6,7]. Occlusions, which obscure vital scene details, hinder the performance of these applications, leading to misclassification, loss of precision, and defective tracking. This challenge intensifies in dynamic contexts where occlusions are transient and complex, further complicating restoration efforts [8,9]. Light-field imaging offers a transformative solution to these challenges by capturing not only spatial information but also the angular dimension of light [10,11]. This results in a more comprehensive representation of scenes compared to single-view methods. LF imaging leverages camera arrays to produce sub-aperture images, each representing the scene from a different viewpoint, thus enabling a variety of post-capture effects such as refocusing [12,13], depth estimation [14,15], and spatial super-resolution [16,17,18].

The multiview nature of LF imaging allows occluded regions to be reconstructed by synthesizing information from unobstructed views [19], providing a powerful tool for occlusion removal [20]. However, the complexity of LF data, particularly when removing large and intricate occlusions, poses significant challenges [21]. Traditional methods often fail to capture the full context and integrate features across scales, resulting in incomplete or artifact-laden reconstructions. Consequently, recent advancements have introduced innovative architectures that combine robust feature extraction, effective multiscale fusion, and fine-grained refinement techniques to overcome these limitations.

Early approaches, such as DeOccNet [22], adopted encoder–decoder architectures with ResASPP for multiscale feature extraction but faced difficulties with large occlusions and spatial dependencies. To address these gaps, Mask4D [23] employed 4D convolutions to preserve spatial–angular coherence, while Zhao et al. [24] utilized GANs for semantic inpainting, achieving realistic reconstructions even in challenging scenarios. Zhang et al. [25] proposed a filter to reconstruct occluded regions in sparse LFs but faced challenges with dense LFs due to high memory demands. ISTY [26] addressed this by modularizing de-occlusion into feature extraction, occlusion detection, and inpainting, enhancing performance across datasets. More recent methods have focused on hybrid architectures; for instance, Wang et al. [27] combined CNNs for local feature extraction and Swin Transformers for global context, enhancing occlusion removal in challenging scenarios, while SwinSccNet [28] integrates CNNs with Swin Transformers for efficient global and local feature modeling. Senussi et al. [29] enhanced multiscale feature extraction and fusion using CSPDarknet53 and the bidirectional feature pyramid network (BiFPN), allowing robust and consistent removal of occlusions in diverse LF datasets. These developments reflect the field’s shift toward efficient and scalable solutions, ensuring a balance between computational efficiency and performance without compromising quality.

In this paper, we introduce the U²-LFOR network, a compact neural network designed specifically for efficient occlusion removal in LF images. Our approach uses the ResASPP module as the feature extractor, applying atrous convolutions to expand the receptive field, which enables deep feature learning and multiscale information aggregation to capture both global context and fine details. The two-stage U²-Net structure enhances hierarchical feature learning, ensuring accurate occlusion reconstruction without increasing model complexity. Furthermore, the Cascaded Residual Refinement Module restores fine details to occluded regions, ensuring high-fidelity reconstructions that maintain both local and global structural consistency. Collectively, these components provide a mathematically grounded and algorithmically principled framework for LF occlusion removal, integrating multiscale feature modeling, hierarchical refinement, and residual estimation to achieve high-fidelity results.

As shown in Figure 1, the U²-LFOR structure distinguishes itself from existing U-shaped LF occlusion removal methods, such as DeOccNet and Mask4D, by incorporating intermediate feature fusion and a two-stage U² encoder–decoder structure, which ensures more accurate and detailed reconstruction of occluded regions, even in complex scenarios. U²-LFOR, with just 11.06 M parameters and an inference time of 7.86 ms, attains a favorable trade-off between computational cost and occlusion removal accuracy. It achieves an average PSNR of 29.27 dB and SSIM of 0.875 across synthetic and real-world LF datasets, offering a highly efficient solution for LF occlusion removal while addressing both computational and occlusion complexities in real-world applications.

Our contributions are threefold:

Context Feature Extraction and Fusion: ResASPP integrates residual connections with atrous spatial pyramid pooling to enhance the receptive field and capture multiscale contextual features. It effectively captures both local and global image structures, enabling precise and robust removal of complex occlusion patterns in LF images.
Compact U²-Net Architecture: We design a two-stage U²-Net architecture, achieving a balance between performance and computational efficiency, with only 11.06 M parameters and an inference time of 7.86 ms, making it suitable for resource-constrained environments.
Comprehensive Evaluation: Extensive experiments on both synthetic and real-world LF datasets demonstrate the superior performance of our method, achieving an average PSNR of 29.27 dB and SSIM of 0.875, showing its effectiveness across diverse occlusion scenarios.

The structure of this paper is as follows: Section 2 reviews existing methods, contextualizes our contributions, and establishes the foundation for our approach. Section 3 outlines the design and technical details of our architecture. Section 4 presents both quantitative and qualitative evaluations, complemented by comparative analyses. Section 5 examines the contributions of individual components to overall performance. Section 6 highlights current challenges and suggests strategies for future research; it also summarizes the key findings and their broader significance.

2. Related Work

This section reviews previous approaches to addressing occlusions, covering both traditional and deep learning methods for single-view image inpainting and LF occlusion removal. It provides context for our work by analyzing key advancements in these areas.

2.1. Single-View Image Inpainting

Single-image inpainting fills in occluded regions by generating realistic content based on surrounding context. In LF imaging, it ensures smooth restoration across both spatial and angular dimensions, enabling accurate scene reconstruction and consistency across multiple viewpoints. The following subsections explore single-image inpainting techniques in LF imaging, highlighting their potential to enhance complex scene reconstruction.

2.1.1. Conventional Methods

Bertalmio et al. [30] introduced anisotropic diffusion as a foundational approach to inpainting occluded regions, propagating pixel values along local gradients to preserve edges while addressing occlusions. However, its limitations in handling complex textures and highly detailed backgrounds restrict its effectiveness in challenging scenarios. Building on this, Ballester et al. [31] proposed a variational method that interpolates pixel intensities and gradient directions, solving coupled second-order partial differential equations (PDEs) to extend isophotes across missing regions. PatchMatch [32] introduced randomized patch matching for efficient nearest-neighbor propagation, while Wexler et al. [33] extended this concept to dynamic scenes, ensuring spatio-temporal consistency. For LF imaging, methods such as those in [34,35] focus on inpainting the central view and propagating reconstructed content across all views but fail to fully capture the 4D nature of occlusions, often leading to structural inconsistencies in restored LFs.

2.1.2. Deep Learning-Based Methods

A novel approach to inpainting was introduced with Partial Convolution (PConv) [36] to address irregularly masked images, which enables inpainting by encoding contextual features while reducing artifacts from invalid pixels, leading to improved restoration quality. Li et al. [37] extended that approach with recurrent feature reasoning (RFR), a two-stage model that uses a vector-quantized variational auto-encoder (VQ-VAE) [38] for recurrent inpainting and feature refinement, improving the filling of large missing regions. Xie et al. [39] improved inpainting accuracy by incorporating attention mechanisms into their learnable bidirectional attention map (LBAM) model, which targets masked areas more precisely with soft attention maps. Nazeri et al. [40] proposed EdgeConnect, a two-stage adversarial network that generates edge maps to improve texture restoration and inpainting. Song et al. [41] employed semantic segmentation in a dual-stage approach to guide the inpainting, address boundary blurriness, and improve texture fidelity. Similarly, Ren et al. [42] divided the reconstruction of the structure and the texture into two stages, effectively preserving fine details during the inpainting process. Semantic guidance was integrated with the inpainting in SGE-Net [43], leading to improved boundary clarity and enhanced texture realism. Optimizing content and texture constraints for large-region inpainting was explored by Yang et al. [44], while Zeng et al. [45] proposed a generative model enhanced with iterative feedback. Lastly, Yi et al. [46] proposed Contextual Residual Aggregation (CRA) for high-resolution inpainting, overcoming memory limitations while preserving fine details.

2.2. LF Occlusion Removal

Recent advancements in LF imaging exploit rich 4D data to effectively address occlusions, enhancing image quality and scene reconstruction. In the subsequent subsections, we review traditional and deep learning-based approaches that mitigate these challenges.

2.2.1. Conventional Methods

Early research in synthetic aperture focusing began with [47], who developed a method to enhance visibility through partial occluders by resampling LFs. Their approach aligned 4D LF images to refocus across planes, enhancing background clarity and foreground blur. Building on this, Ref. [48] applied synthetic aperture focusing to 3D reconstruction, comparing it with traditional stereo methods and proposing improved multiview techniques that handled occlusions more effectively using color and entropy metrics. To support LF capture, Ref. [49] proposed a calibration method using planar parallax to estimate camera positions and reproject images onto different planes, achieving better accuracy than conventional calibration techniques. Later, Ref. [50] introduced a pixel-labeling method to remove occlusions by identifying and masking affected pixels. Their subsequent work [51] employed image matting to produce all-in-focus images but was limited by depth-specific focus ranges. Addressing these limitations, Ref. [4] segmented scenes into layers, enabling focus at all depths. Further advancements included an iterative reconstruction technique by [52], which used clustering to distinguish occlusions from the background and refined the results through optimization. Despite progress, challenges with large occlusions and depth accuracy persist, requiring further advances in LF imaging.

2.2.2. Deep Learning-Based Methods

Building on the limitations of traditional methods, Wang et al. [22] introduced DeOccNet, a deep learning model that combines an encoder–decoder structure with ResASPP to expand the receptive field and improve occlusion understanding. A mask embedding approach generates occluded LF images, but it struggles with large occlusions and blurry reconstructions due to poor spatial dependency modeling in the SAI stacking. To address these shortcomings, Li et al. [23] introduced Mask4D, which uses 4D convolution to preserve spatial coherence and angular consistency, enhancing the removal of complex occlusions. Similarly, Zhao et al. [24] used GANs for occlusion removal, enabling semantic inpainting that integrates occlusions with backgrounds for realistic reconstruction. Zhang et al. [25] developed a dynamic microlens filter to enhance feature extraction from shifted lenslet images in sparse LFs, but its rigid background assumptions and high memory use limit its application to dense LFs. In response, Zhang et al. [53] proposed LFORNet, which integrates Foreground Occlusion Location (FOL), Background Content Recovery (BCR), and a refinement module to effectively handle occlusions of varying sizes and scenarios through multi-angle view stacks (MVAS) processing. Song et al. [54] introduced a dual-pathway fusion network that separates center-view synthesis and occlusion prediction, combining their outputs for better reconstruction accuracy. Hur et al. [26] developed the ISTY framework, which integrates modules for feature extraction, occlusion detection, and inpainting to address challenges in both sparse and dense datasets. However, its reliance on CNNs limits its ability to handle complex occlusions due to limited receptive fields. Wang et al. [27] introduced a hybrid CNN and Swin Transformer approach, using CNNs for local features and Swin Transformers for global patterns, improving performance in large occlusions. Building on this, Zhang et al. [28] introduced SwinSccNet, combining ScConv blocks for feature compression and the Swin-Unet framework to balance computational efficiency and de-occlusion performance. Senussi et al. [29] advanced the field with a model combining CSPDarknet53 for multiscale feature extraction, BiFPN for feature fusion, and a refinement module with half-instance initialization, enhancing occlusion removal and reconstruction quality across varied datasets.

3. Proposed Method

In this section, we present our novel U²-LFOR network for occlusion removal in LF images. Designed to address the computational challenges of LF imaging, our network effectively reconstructs occluded regions from multiview data while maintaining low computational complexity. To achieve this, our network identifies occluded regions and replaces them with reconstructed background information. As depicted in Figure 2, the U²-LFOR architecture consists of three key components: feature extraction, the two-stage U² Net, and the refinement module. The network processes 5 × 5 sub-aperture images (SAIs) as input. First, the feature extraction module captures context multiscale features from the SAIs using convolution and ResASPP blocks. These features provide the foundation for subsequent refinement. Next, the two-stage U² Net progressively integrates and refines features through its hierarchical structure, enabling effective reconstruction of occluded regions. Finally, the refinement module combines the refined features with ResBlocks and a convolution layer to produce the final occlusion-free image. The network is trained using center view (CV) SAIs as ground truth to ensure accurate supervision and reconstruction. The pseudo-code of U²-LFOR is presented in Algorithm 1.

Algorithm 1 Pseudo-Code of U²-LFOR for Occlusion Removal in LF Images

Input:: Densely sampled occluded LF image $L_{0} \in R^{U \times V \times H \times W \times C_{in}}$
Output:: Occlusion-free center-view LF image $I_{Out}$
1:: Feature Extraction:
2:: $F_{C} \leftarrow {Conv}_{1 \times 1} (L_{0})$ ▹ Merge angular info into channels
3:: $F_{R} \leftarrow F_{C} + {Conv}_{1 \times 1} (Concat (LReLU ({Conv}_{d} (F_{C})))), d \in {1, 2, 4, 8}$ ▹ ResASPP features
4:: Two-Stage U²-Net:
5:: $F_{E}^{(1)} \leftarrow {Encoder}^{(1)} (F_{R})$ ▹ Stage 1 encoder output
6:: $F_{E}^{(2)} \leftarrow {Encoder}^{(2)} (F_{E}^{(1)})$ ▹ Stage 2 encoder takes Stage 1 encoder output
7:: $F_{D}^{(2)} \leftarrow {Decoder}^{(2)} (F_{E}^{(2)})$ ▹ Stage 2 decoder output
8:: $F_{D}^{(1)} \leftarrow {Decoder}^{(1)} (F_{E}^{(1)} + F_{D}^{(2)})$ ▹ Stage 1 decoder = Stage 1 encoder + Stage 2 decoder
9:: $F_{U} \leftarrow F_{R} + F_{D}^{(1)}$ ▹ Skip connection between the feature extractor and U²-Net outputs
10:: Refinement Module:
11:: $F_{res 1} \leftarrow {ResBlock}_{1} (F_{U})$
12:: $F_{res 2} \leftarrow {ResBlock}_{2} (F_{res 1})$
13:: $I_{Out} \leftarrow {Conv}_{1 \times 1} (F_{res 2})$ ▹ Final occlusion-free output
14:: return $I_{Out}$

3.1. LF Feature Extractor

As illustrated in Figure 3, the feature extractor follows a carefully designed sequence of consecutive components: an initial convolutional layer and a ResASPP. This layered design incrementally transforms input features to extract local details while progressively capturing global contextual information. Initially, the input tensor

L_{0} \in R^{U \times V \times H \times W \times C_{in}}

(where U and V denote the angular dimensions, H and W correspond to the spatial height and width, and

C_{in}

represents the number of input channels) is first processed through a convolutional layer. In this specific work, the input tensor was structured as

L_{0} \in R^{5 \times 5 \times 256 \times 192 \times 3}

, incorporating the angular, spatial, and channel-wise components of the input LF.

This convolutional layer applies a kernel size of

1 \times 1

, a stride of 1, and a padding of 1. The

1 \times 1

kernel operates successively across the channel dimension of the input, merging the angular information across the U and V dimensions. By concatenating the angular information along the channel dimension, the network preserves both spatial and angular resolutions while enabling the efficient interaction of features across all channels. Formally, the convolution operation is expressed as:

F_{C} = {Conv}_{1 \times 1} (L_{0})

(1)

where

F_{C} \in R^{(3 \times U \times V) \times H \times W}

represents the output tensor setting the stage for deeper feature extraction. Subsequently,

F_{C}

is fed into the ResASPPB module depicted in Figure 3. This module employs four parallel atrous convolutional layers with dilation rates

d = {1, 2, 4, 8}

, each followed by a LeakyReLU activation with a leaky factor of 0.1, enabling the feature extractor to encode multiscale contextual information. These successive atrous convolutions expand the receptive field without increasing computational overhead. The outputs from each atrous convolution are concatenated, passed through a

1 \times 1

convolution for channel reduction, and fused with the input tensor through a residual connection.

The result,

F_{R}

, is computed as:

F_{R} = F_{C} + {Conv}_{1 \times 1} (Concat ({LReLU ({Conv}_{d} (F_{C}))}))

(2)

where

d \in {1, 2, 4, 8}

represents the dilation rates. The output tensor,

F_{R} \in R^{H \times W \times C_{o u t}}

, maintains essential multiscale and context-aware features, which are crucial for the subsequent restoration stages.

3.2. Two-Stage U² Net

The Two-Stage U²-Net architecture employs a hierarchical network structure that progressively refines features through two sequentially interconnected U²-Net modules, referred to as Stage 1 and Stage 2 in Figure 2. Each stage adopts a U-shaped encoder–decoder structure with skip connections, enabling the model to effectively capture and reconstruct spatial and semantic information across multiple resolutions. The key innovation lies in the multi-stage refinement process, where Stage 2 builds upon the outputs of Stage 1 to enhance semantic consistency and spatial resolution.

In Stage 1, the input is processed by an encoder consisting of three downsampling blocks. Each block performs a

3 \times 3

convolution with a stride of 2 and padding of 1, reducing the spatial resolution by a factor of 2 while progressively increasing the number of channels. This is followed by batch normalization (BN) and LeakyReLU activation with a leaky factor of 0.1. Specifically, the first downsampling block increases the input channels from 64 to 128, the second block increases them to 256, and the third block outputs 512 channels. The decoder mirrors the encoder structure with three upsampling blocks that restore the spatial resolution using

4 \times 4

transposed convolutions with a stride of 2 and padding of 1, followed by batch normalization (BN) and LeakyReLU activation with a leaky factor of 0.1. These upsampling blocks progressively decrease the number of channels: the first reduces the channels from 512 to 256, the second reduces them to 128, and the third restores them to 64. Element-wise addition is applied between the outputs of the encoder and decoder at corresponding levels through skip connections, which transfer spatial details and merge features to preserve semantic information across scales.

The output of Stage 1 undergoes further processing and is fed into Stage 2, which employs a similar encoder–decoder structure but with a reduced depth. The encoder in Stage 2 consists of two downsampling blocks: the first block transforms the input from 64 to 128 channels, and the second block increases the channels to 256. Both blocks use the same

3 \times 3

convolution configuration (stride of 2 and padding of 1), followed by batch normalization (BN) and LeakyReLU activation. The decoder in Stage 2 includes two upsampling blocks instead of three. The first block reduces the channels from 256 to 128, while the second restores them to 64. Both blocks utilize

4 \times 4

transposed convolutions with a stride of 2 and padding of 1. This reduction in depth enables Stage 2 to focus on extracting deeper, high-level feature representations while maintaining computational efficiency.

The architecture is mathematically defined as follows: For Stage 1, given the input

F_{R}

, the encoder and decoder produce:

F_{U}^{(1)} = {Decoder}^{(1)} ({Encoder}^{(1)} (F_{R}) + F_{U}^{(2)}) .

(3)

This intermediate output is further refined and passed to Stage 2, where:

F_{U}^{(2)} = {Decoder}^{(2)} ({Encoder}^{(2)} ({Encoder}^{(1)} (F_{R})) .

(4)

The output from the feature extractor,

F_{R}

, is then combined with the

U^{2}

output,

F_{U}^{(1)}

, using the skip connection, represented by the dashed red line in Figure 2. This skip connection ensures the retention of crucial low-level features, preserving fine details and significantly enhancing the overall output quality.

F_{U} = F_{R} + F_{U}^{(1)} .

(5)

These design choices address challenges such as the vanishing gradient problem and improve convergence during training. The inclusion of batch normalization and LeakyReLU in both the encoder and decoder blocks ensures training stability and enhances feature discriminability.

3.3. Refinement Module

The refinement module consists of two stacked ResBlocks, each containing a series of well-configured layers, followed by a

1 \times 1

convolutional layer for dimensionality reduction and final output generation. As shown in Figure 4, each ResBlock contains three convolutional layers, all with a kernel size of

3 \times 3

, a stride of 1, and padding of 1 to preserve spatial dimensions. These layers use standard convolution with groups set to 1 and are followed by LeakyReLU activation functions with a negative slope of 0.1, except for the final layer. The defining characteristic of the ResBlock is its skip connection, which directly adds the input to the block’s output, enabling the model to learn residual mappings rather than full transformations. This residual design accelerates convergence and enhances feature refinement. Mathematically, the output of the ResBlock can be expressed as:

F_{resB} = x + F (x),

(6)

where x is the input to the block and

F (x)

represents the output after the non-linear transformations.

Following the ResBlocks, a final

1 \times 1

convolutional layer reduces the channel dimension from 64 (the feature depth after the ResBlocks) to the required number of output channels, which is 3 for RGB images. By applying a kernel size of

1 \times 1

, a stride of 1, and no padding, this layer maintains the spatial resolution of the feature map while performing a channel-wise linear transformation to produce the final refined output.

The refinement pipeline can be summarized as:

I_{Out} = Conv ({ResBlock}_{2} ({ResBlock}_{1} (F_{U}))),

(7)

where

F_{U}

is the input to the refinement module,

{ResBlock}_{1}

and

{ResBlock}_{2}

are the two residual blocks in sequence, and Conv represents the final

1 \times 1

convolutional layer.

3.4. Loss Function

In our approach, we employ a composite loss function that integrates three losses to guide the reconstruction process: Mean Absolute Error (MAE) loss, Structural Similarity Index Measure (SSIM) loss, and Perceptual Loss (

L_{PER}

). The total loss

L

is expressed as a weighted sum of these individual losses:

L = k_{1} \cdot L_{MAE} + k_{2} \cdot L_{SSIM} + (1 - k_{1} - k_{2}) \cdot L_{PER},

(8)

where

k_{1} = 0.30

and

k_{2} = 0.35

are the empirically chosen weights for MAE and SSIM loss, respectively, ensuring a balanced contribution from each loss function.

The MAE loss computes the average absolute pixel-wise differences between the ground truth I and the reconstructed image

\hat{I}

:

L_{MAE} = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} |I_{i, j} - {\hat{I}}_{i, j}| .

(9)

where H and W denote the height and width of the image, respectively. This loss ensures that the pixel-level accuracy of the reconstruction is optimized.

The SSIM loss evaluates the perceptual quality of an image by comparing luminance, contrast, and structure between the ground truth I and the reconstructed image

\hat{I}

. It better aligns with human vision and preserves finer details compared to traditional pixel-based losses like MAE. The SSIM loss is defined as:

L_{SSIM} = 1 - SSIM (I, \hat{I}),

(10)

where SSIM is computed using:

l (I, \hat{I}) = \frac{2 μ_{I} μ_{\hat{I}} + C_{1}}{μ_{I}^{2} + μ_{\hat{I}}^{2} + C_{1}}, C_{1} = {(K_{1} L)}^{2},

(11)

c (I, \hat{I}) = \frac{2 σ_{I} σ_{\hat{I}} + C_{2}}{σ_{I}^{2} + σ_{\hat{I}}^{2} + C_{2}}, C_{2} = {(K_{2} L)}^{2},

(12)

s (I, \hat{I}) = \frac{σ_{I \hat{I}} + C_{3}}{σ_{I} σ_{\hat{I}} + C_{3}}, C_{3} = \frac{C_{2}}{2},

(13)

where

μ_{I}

and

μ_{\hat{I}}

are the mean values of I and

\hat{I}

,

σ_{I}

and

σ_{\hat{I}}

are the standard deviations of I and

\hat{I}

,

σ_{I \hat{I}}

is the covariance between I and

\hat{I}

, L is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and

K_{1} = 0.01, K_{2} = 0.03

are small constants.

Perceptual loss aligns the high-level features and style of the reconstruction with the original image. It is the sum of feature reconstruction loss

L_{FEAT}^{τ}

and style reconstruction loss

L_{STYLE}^{τ}

:

L_{PER} = L_{FEAT}^{τ} + L_{STYLE}^{τ} .

(14)

The feature reconstruction loss is defined as:

L_{FEAT}^{τ} = \frac{1}{C_{j} H_{j} W_{j}} \sum_{c = 1}^{C_{j}} \sum_{h = 1}^{H_{j}} \sum_{w = 1}^{W_{j}} {(τ_{j} (I) - τ_{j} (\hat{I}))}^{2},

(15)

where

C_{j}

,

H_{j}

, and

W_{j}

denote the number of channels, height, and width of the feature map at layer j, and

τ_{j} (I)

represents the activations of the ground truth at that layer.

The style reconstruction loss is computed as:

L_{STYLE}^{τ} = \frac{1}{C_{j} H_{j} W_{j}} \sum_{c = 1}^{C_{j}} {(G_{j}^{τ} (I) - G_{j}^{τ} (\hat{I}))}^{2},

(16)

G_{j}^{τ} (x) = \frac{1}{C_{j} H_{j} W_{j}} \sum_{h = 1}^{H_{j}} \sum_{w = 1}^{W_{j}} τ_{j} {(x)}^{T} τ_{j} (x) .

(17)

where

G_{j}^{τ} (x)

is the Gram matrix at layer j, capturing correlations between feature maps. This loss helps preserve the perceptual style and texture of the image when removing occlusions and complex textures during image reconstruction.

4. Experiments

4.1. Experimental Setup

In line with the approaches presented in [26,29], we trained and evaluated our network following the procedures and settings specified in these studies. Specifically, we followed their training procedures and evaluation strategies, ensuring consistency with their methodologies while making necessary adjustments for our experimental setup. The following subsections provide a detailed description of the steps involved in the training and testing process of our network. Additionally, Table 1 summarizes the datasets used in both the training and testing phases.

4.1.1. Training Dataset

We developed a robust training pipeline for our U²-LFOR network, which was trained on a carefully curated dataset that integrated both real-world and synthetically generated occlusion cases. Adopting the mask embedding strategy outlined in [22], which creates occluded LF images by embedding occlusion masks into occlusion-free LF images, we enabled the simulation of a wide range of occlusion patterns and complexities. To account for varying disparity conditions, one to three occlusion masks were randomly placed during the embedding process. To further enrich the dataset, we augmented the original 80 synthetic masks from [22] by adding 21 additional large and dense occlusion masks, derived from real-world scenes, which are particularly challenging to reconstruct. To ensure the availability of ground-truth occlusion-free images, only objects with negative disparity were included in the LF images. For this study, a total of 1418 LF images were selected from the DUTLF-V2 [55] dataset, which contains densely sampled LF images captured using the Lytro Illum camera [58]. Through deliberate augmentation and careful selection of training data, our model effectively learned and generalized occlusion removal methods, ensuring high adaptability to a wide range of complex real-world scenarios.

4.1.2. Testing Dataset

To assess the effectiveness of our network on sparse LF images, we used a combination of synthetic and real datasets. First, we tested our model on two synthetic sparse LF datasets, 4-Syn and 9-Syn, created by [22] and [27], respectively. These datasets contain sparse LF scenes designed to challenge the model’s ability to handle occlusions across varying disparity levels. Additionally, we included the Stanford CD scene [49], a real sparse LF dataset with ground truth, to provide a more accurate comparison of our network’s performance on real-world sparse LF occlusion data. For dense LF images, we selected 615 images from the DUTLF-V2 test dataset [55], along with 33 real occlusion images to test the model under more complex real-world conditions. To simulate realistic multi-disparity occlusion scenarios, we adopted a mask embedding technique that generated Single Occ and Double Occ cases with disparities ranging from one to four, enabling a thorough evaluation of the network’s performance across various occlusion types and disparity levels. In addition to synthetic datasets, we used publicly available real-world LF scenes for qualitative evaluation. The Stanford Lytro dataset [56] and the EPFL-10 dataset [57] are dense LF datasets that offer diverse occlusion patterns and disparity levels. By combining synthetic and real-world datasets, we ensured a comprehensive evaluation of our network across a wide range of occlusion types and LF data.

4.1.3. Training Details

The DUTLF-V2 [55] dataset provides LF images with an angular and spatial resolution of

9 \times 9 \times 600 \times 400

. For our experiments, we extracted the central

5 \times 5

views, reducing the spatial resolution to

300 \times 200

. During training, images were center-cropped and horizontally flipped to achieve a final resolution of

256 \times 192

. To simulate occlusions, a mask embedding technique was employed, where one to three RGB masks were randomly selected, combined, and shuffled within the images. The model was trained using the ADAM optimizer with parameters

(β_{1}, β_{2}) = (0.5, 0.9)

and a batch size of 18. Regularization parameters were set as

λ_{1} = 0.01

and

λ_{2} = 120

. The initial learning rate of 0.001 was halved every 150 epochs. Training was conducted for 500 epochs using the PyTorch Framework (version 2.1.1 + cu118), requiring approximately 18 h on a single Nvidia GeForce 3090 GPU.

4.2. Experimental Results

We evaluated our model by conducting experiments on de-occluded images, comparing its performance against leading LF occlusion removal methods, including DeOccNet [22], ISTY [26], Zhang et al. [25], and Senussi et al. [29]. To further investigate the contribution of angular information in LFs, we extended the comparison to include single-image inpainting methods, such as RFR [37] and LBAM [39]. For consistency and a fair comparison, we retrained the DeOccNet [22] and Senussi et al. [29] methods from scratch on our dataset, while ISTY [26] was tested using the original pre-trained weights provided by the authors. Since the implementations of Zhang et al. [25], RFR [37], and LBAM [39] are not publicly available, their results were sourced directly from ISTY [26]. All evaluations were performed using a unified training approach that incorporated mask embedding on the dense LF dataset.

4.2.1. Quantitative Results

The quantitative results, presented in Table 2, highlight the strong performance of the proposed method compared to existing approaches for both sparse and dense LF datasets, evaluated using PSNR and SSIM, two standard metrics in LF occlusion removal research.

For sparse LFs, the proposed method consistently delivered top results. Specifically, for 4-Syn, it achieved the second highest PSNR, slightly behind Senussi et al. [29]. However, the proposed method outperformed all other approaches in more complex scenarios, such as 9-Syn and CD, where it achieved the highest PSNR values. Furthermore, SSIM scores underscored its ability to reconstruct images with high structural and perceptual quality, even in complex and occluded regions, achieving the best results in all scenarios. In contrast, RFR [37] and LBAM [39] performed poorly on sparse LF datasets due to their reliance on single-image inpainting techniques, which fail to utilize LF-specific angular and background information effectively. DeOccNet [22] showed moderate results but lacked consistency across different scenarios, struggling to handle large occlusions effectively, which limited its overall performance. Zhang et al.’s method [25] showed improvements in some cases, but its assumptions about background visibility and reliance on shifted lenslet images limited its performance. Similarly, ISTY [26] faced challenges with sparsity and occlusions due to its use of local receptive fields in CNNs. Senussi et al.’s method [29], while achieving the best result for 4-Syn, fell short of the proposed method in most other sparse LF scenarios.

4.2.2. Qualitative Results

Figure 5 provides qualitative comparisons for sparse synthetic scenes and the real-world CD scene. In the sparse synthetic scenes (rows 1 to 4), the RFR [37] and LBAM [39] methods fail to reconstruct details in occlusion regions, resulting in blurred and incomplete outputs. DeOccNet [22] demonstrates slightly better performance but leaves eminent artifacts and struggles with complex textures, particularly in scenes 2 and 4. Zhang et al.’s method [25] produces overly smooth outputs, leading to a loss of important details, while ISTY [26] performs comparatively better but still retains residual artifacts in occluded areas. Senussi et al.’s method [29] delivers reasonable results but lacks precision in challenging scenes, such as scenes 1 and 2. In contrast, our method effectively removes occlusions and restores sharp details. For instance, in scene 2, our approach produces clear and accurate reconstructions, capturing intricate textures better than any other method. Similarly, in scene 4, our model restores structural details with great accuracy, outperforming all other methods. For the real-world CD scene (row 5), which is more complex due to real-world textures, most methods struggle.

Turning to the dense LF dataset, as shown in Figure 6, we compared the performance of different methods to handle single occ and double occ. In single occ (rows 1 and 3), the RFR [37] and LBAM [39] methods produce hazy outputs, while DeOccNe [22] struggles due to its simplistic stacking of SAIs, neglecting critical spatial relationships, which results in poor removal of large occlusions. Zhang et al.’s method [25] leads to a loss of fine detail, struggling to adapt to dynamic occlusion scenarios, which reduces sharpness and accuracy. ISTY [26] and Senussi et al.’s method [29] perform better, with ISTY providing the clearest results, as reflected in its higher SSIM values. Although our method achieves the highest PSNR in both single-occ and double-occ (rows 2 and 4) scenarios, it struggles with structural consistency, as indicated by its lower SSIM values compared to ISTY [26]. In double-occ scenarios, where overlapping occlusions present greater challenges, RFR [37], LBAM [39], and DeOccNet [22] fail to preserve structural details, while ISTY [26] better maintains overall structural integrity, visually outperforming our method. However, our method outperforms others in recovering fine details and reducing artifacts, producing clearer outputs in specific high-texture regions.

RFR [37], LBAM [39], and Zhang et al.’s method [25] produce blurry outputs, while ISTY [26] and Senussi et al.’s method [29] show improvements but lack full accuracy. Our method achieves the best results, removing occlusions and preserving the natural appearance of the scene.

4.2.3. Performance Evaluation on Real-World Scene Data

Figure 7 demonstrates the comparative performance of occlusion removal on real-world LF images. The first column displays the input occluded LF images, while the second, third, and fourth columns showcase the results of DeOccNet [22], Senussi et al. [29], and our proposed model, LiteU2Or, respectively. LiteU2Or achieves a clear advantage, particularly in scenarios with thin and repetitive occlusions. In the first row, our model effectively removes the bicycle spokes, restoring the background with high clarity and natural textures. By contrast, DeOccNet [22] produces blurred outputs with significant detail loss, while Senussi et al.’s method [29] leaves residual artifacts, reducing the overall reconstruction quality.

In the second and third rows, which feature complex, thin, and irregular occlusions such as the fence and tree branches, LiteU2Or outperforms DeOccNet and Senussi et al.’s method by maintaining better structural consistency and finer detail restoration of the background, resulting in better overall image quality. In the second row, LiteU2Or effectively removes the fence occlusion, preserving clarity and avoiding the blurriness and distortion seen in the other methods. Similarly, in the third row, it handles the tree branches well, minimizing distortion and preserving the scene’s structural integrity. In contrast, both DeOccNet [22] and Senussi et al.’s method [29] struggle with these complex occlusions, leaving artifacts and incomplete restorations.

4.2.4. Evaluation of Computational Efficiency

The results in Table 3, visually summarized in Figure 8, highlight the computational efficiency and optimized design of our proposed model compared to existing methods, demonstrating a balance between lightweight architecture, fast inference, and high-quality outputs. With a parameter count of 11.06 M, our model is significantly more compact than competing methods, such as LBAM [39] (69.3M), ISTY [26] (80.6 M), and Senussi et al. [29] (52.59 M). Even compared to DeOccNet [22] (39.0 M), which is specifically designed as a lightweight model, our approach is considerably smaller, making it one of the most compact architectures. Although Zhang et al.’s method [25] has the smallest parameter count at 2.7 sM, this comes at the cost of an impractically high inference time.

In terms of inference time, our model achieves the fastest performance at 7.86 ms, outperforming all other methods. The second-best method, DeOccNet [22], requires 10 ms, while LBAM [39] and ISTY [26] require 12 ms and 24 ms, respectively. Zhang et al.’s method [25], despite its small parameter count, suffers from an exceptionally high inference time of 3050 ms, rendering it unsuitable for real-time applications. Senussi et al.’s method [29], with an inference time of 138.8 ms, also falls far behind in computational speed.

5. Ablation Study

To gain deeper insights into the performance of our U²-LFOR architecture, we conducted an ablation study to assess the contributions of different network components. The results are summarized in Table 4, where we analyze the impact of removing key modules, including ResApp, U² Stage 1, U² Stage 2, and the refinement module. Each model variant omitted one component, and the network was retrained using the same training data for each configuration to isolate the effects of these components. Our baseline model, LitU²Or (ours), delivered the best results, outperforming the other configurations across all LF types, except for the 4-Syn Sparse case, as highlighted in red in the table.

In the 4-Syn Sparse scenario, LitU2Or (ours) achieved the second-best results, just behind the ‘w/o U² Stage 2’ configuration, with a PSNR of 27.33. For other configurations, such as 9-Syn Sparse, CD, and Dense (Single Occ, Double Occ), U²-LFOR (ours) outperformed other variants, achieving the best results across all metrics. When the ResApp module was omitted, the model’s performance dropped across all light-field types, as evidenced by the lower PSNR and SSIM scores. For example, in the 4-Syn Sparse scenario, the PSNR dropped from 27.22 to 27.01, and the SSIM dropped from 0.870 to 0.858, showing its crucial role in the model’s effectiveness. The removal of U² Stage 1 and Stage 2 resulted in notable performance degradation. For example, in the 9-Syn Sparse scenario, excluding U² Stage 1 reduced the PSNR from 28.22 (with all components) to 26.51 and the SSIM from 0.879 to 0.854. Similarly, while excluding U² Stage 2 yielded a PSNR increase to 27.33 in the 4-Syn scenario, it led to declines across all other metrics in all LF types. These results underscore the critical contributions of U² Stage 1 and Stage 2 to overall performance, particularly in enhancing feature aggregation and multiscale contextual understanding, enabling the model to effectively handle complex occlusions. Finally, the exclusion of the refinement module resulted in a significant reduction in performance, as evident in the Dense category, where the PSNR decreased from 32.83 (with all components) to 30.52, and the SSIM dropped from 0.872 to 0.826. This highlights the key role of the refinement module in enhancing the texture and detail in the final output, particularly in reconstructing occluded regions under complex occlusion scenarios. As illustrated in Figure 9, the visual results further corroborate the quantitative findings, showing clear improvements when all components are included in the architecture.

6. Conclusions and Future Work

This paper introduces U²-LFOR, a compact model for occlusion removal in LF images. By using ResASPP for multiscale feature extraction and expanding the receptive field, our model restores fine details in occluded regions while maintaining a balance between computational efficiency and reconstruction quality. The two-stage U²-Net structure refines hierarchical learning, while the Cascaded Residual Refinement Module enhances detail recovery. With just 11.06 M parameters and a 7.86 ms inference time, U²-LFOR achieved competitive performance, with a PSNR of 29.27 dB and SSIM of 0.875, demonstrating its scalability for real-time applications. Despite the strengths of our approach, many challenges remain that warrant further exploration. Our method currently faces limitations in removing more complex and larger occlusions, particularly in dynamic or cluttered scenes. Future work will explore advanced, adaptive feature fusion mechanisms, such as attention-based networks, to enhance the model’s ability to recover fine details in challenging scenarios. Additionally, as LF datasets become denser, we plan to explore memory-efficient structures to ensure scalability without sacrificing performance. Finally, we aim to investigate self-supervised or few-shot learning techniques to reduce reliance on extensive labeled datasets, thereby improving the model’s generalization to unseen datasets.

Author Contributions

Conceptualization, M.F.S. and M.S.K.; Methodology, M.F.S., M.S.K., and M.A.; Software, M.F.S., M.M., and M.A.; Validation, M.F.S. and H.-S.K.; Formal Analysis, M.F.S. and M.M.; Investigation, H.-S.K.; Resources, H.-S.K.; Data Curation, M.F.S.; Writing—Original Draft Preparation, M.F.S.; Writing—Review and Editing, M.F.S. and H.-S.K.; Visualization, H.-S.K.; Supervision, H.-S.K.; Project Administration, H.-S.K.; Funding Acquisition, H.-S.K. All authors have read and agreed to the published version of the manuscript.

Funding

This work was partially supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education under Grant 2020R1I1A3A04037680, and partly by the Innovative Human Resource Development for Local Intellectualization program through the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea Government [Ministry of Science and ICT (MSIT)] (IITP-2025-RS-2020-II201462, 50%).

Data Availability Statement

The datasets used in this paper are public datasets.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Samarakoon, T.; Abeywardena, K.; Edussooriya, C.U. Arbitrary Volumetric Refocusing of Dense and Sparse Light Fields. arXiv 2025, arXiv:2502.19238. [Google Scholar]
Jiang, Y.; Li, X.; Fu, K.; Zhao, Q. Transformer-based light field salient object detection and its application to autofocus. IEEE Trans. Image Process. 2024, 33, 6647–6659. [Google Scholar] [CrossRef]
Yang, T.; Zhang, Y.; Tong, X.; Zhang, X.; Yu, R. A new hybrid synthetic aperture imaging model for tracking and seeing people through occlusion. IEEE Trans. Circuits Syst. Video Technol. 2013, 23, 1461–1475. [Google Scholar] [CrossRef]
Yang, T.; Zhang, Y.; Yu, J.; Li, J.; Ma, W.; Tong, X.; Yu, R.; Ran, L. All-in-focus synthetic aperture imaging. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part VI 13. Springer: Cham, Switzerland, 2014; pp. 1–15. [Google Scholar]
Mahmoud, M.; Yagoub, B.; Senussi, M.F.; Abdalla, M.; Kasem, M.S.; Kang, H.S. Two-Stage Video Violence Detection Framework Using GMFlow and CBAM-Enhanced ResNet3D. Mathematics 2025, 13, 1226. [Google Scholar] [CrossRef]
Kasem, M.S.; Mahmoud, M.; Yagoub, B.; Senussi, M.F.; Abdalla, M.; Kang, H.S. HTTD: A Hierarchical Transformer for Accurate Table Detection in Document Images. Mathematics 2025, 13, 266. [Google Scholar] [CrossRef]
Abdalla, M.; Kasem, M.S.; Mahmoud, M.; Yagoub, B.; Senussi, M.F.; Abdallah, A.; Hun Kang, S.; Kang, H.S. ReceiptQA: A Question-Answering Dataset for Receipt Understanding. Mathematics 2025, 13, 1760. [Google Scholar] [CrossRef]
Chen, Y.; Xia, R.; Yang, K.; Zou, K. Dual degradation image inpainting method via adaptive feature fusion and U-net network. Appl. Soft Comput. 2025, 174, 113010. [Google Scholar] [CrossRef]
Mahmoud, M.; Kang, H.S. Ganmasker: A two-stage generative adversarial network for high-quality face mask removal. Sensors 2023, 23, 7094. [Google Scholar] [CrossRef]
Senussi, M.F.; Abdalla, M.; Kasem, M.S.; Mahmoud, M.; Yagoub, B.; Kang, H.S. A Comprehensive Review on Light Field Occlusion Removal: Trends, Challenges, and Future Directions. IEEE Access 2025, 13, 42472–42493. [Google Scholar] [CrossRef]
Senussi, M.F.; Abdalla, M.; SalahEldin, M.; Kasem, M.M.; Kang, H.S. Spectral Normalized U-Net for Light Field Occlusion Removal. Int. Conf. Future Inf. Commun. Eng. 2025, 16, 294–297. [Google Scholar]
Li, J.; Hong, J.; Zhang, Y.; Li, X.; Liu, Z.; Liu, Y.; Chu, D. Light-Ray-Based Light Field Cameras and Displays. In Cameras and Display Systems Towards Photorealistic 3D Holography; Springer: Cham, Switzerland, 2023; pp. 27–37. [Google Scholar]
He, R.; Hong, H.; Cheng, Z.; Liu, F. Neural Defocus Light Field Rendering. IEEE Trans. Pattern Anal. Mach. Intell. 2025, 47, 8268–8279. [Google Scholar] [CrossRef]
Lee, J.Y.; Hur, J.; Choi, J.; Park, R.H.; Kim, J. Multi-scale foreground-background separation for light field depth estimation with deep convolutional networks. Pattern Recognit. Lett. 2023, 171, 138–147. [Google Scholar] [CrossRef]
Yan, W.; Zhang, X.; Chen, H. Occlusion-aware unsupervised light field depth estimation based on multi-scale GANs. IEEE Trans. Circuits Syst. Video Technol. 2024, 34, 6318–6333. [Google Scholar] [CrossRef]
Lu, Y.; Wang, S.; Wang, Z.; Xia, P.; Zhou, T. Lfmamba: Light field image super-resolution with state space model. arXiv 2024, arXiv:2406.12463. [Google Scholar] [CrossRef]
Chao, W.; Zhao, J.; Duan, F.; Wang, G. Lfsrdiff: Light field image super-resolution via diffusion models. In Proceedings of the ICASSP 2025—2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hyderabad, India, 6–11 April 2025; IEEE: New York, NY, USA, 2025; pp. 1–5. [Google Scholar]
Gao, R.; Liu, Y.; Xiao, Z.; Xiong, Z. Diffusion-based light field synthesis. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 1–19. [Google Scholar]
Han, K. Light Field Reconstruction from Multi-View Images. Ph.D. Thesis, James Cook University, Douglas, Australia, 2022. [Google Scholar]
Chang, X. Method and Apparatus for Removing Occlusions from Light Field Images. 2021. Available online: https://patents.google.com/patent/US20210042898A1/en (accessed on 15 January 2025).
Liu, Z.S.; Li, D.H.; Deng, H. Integral Imaging-Based Light Field Display System with Optimum Voxel Space. IEEE Photonics J. 2024, 16, 5200207. [Google Scholar] [CrossRef]
Wang, Y.; Wu, T.; Yang, J.; Wang, L.; An, W.; Guo, Y. DeOccNet: Learning to see through foreground occlusions in light fields. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Snowmass Village, CO, USA, 1–5 March 2020; pp. 118–127. [Google Scholar]
Li, Y.; Yang, W.; Xu, Z.; Chen, Z.; Shi, Z.; Zhang, Y.; Huang, L. Mask4D: 4D convolution network for light field occlusion removal. In Proceedings of the ICASSP 2021—2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; IEEE: New York, NY, USA, 2021; pp. 2480–2484. [Google Scholar]
Pei, Z.; Jin, M.; Zhang, Y.; Ma, M.; Yang, Y.H. All-in-focus synthetic aperture imaging using generative adversarial network-based semantic inpainting. Pattern Recognit. 2021, 111, 107669. [Google Scholar] [CrossRef]
Zhang, S.; Shen, Z.; Lin, Y. Removing Foreground Occlusions in Light Field using Micro-lens Dynamic Filter. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI), Virtual, 19–27 August 2021; pp. 1302–1308. [Google Scholar]
Hur, J.; Lee, J.Y.; Choi, J.; Kim, J. I see-through you: A framework for removing foreground occlusion in both sparse and dense light field images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 2–7 January 2023; pp. 229–238. [Google Scholar]
Wang, X.; Liu, J.; Chen, S.; Wei, G. Effective light field de-occlusion network based on Swin transformer. IEEE Trans. Circuits Syst. Video Technol. 2022, 33, 2590–2599. [Google Scholar] [CrossRef]
Zhang, Q.; Fu, H.; Cao, J.; Wei, W.; Fan, B.; Meng, C.; Fang, Y.; Yan, T. SwinSccNet: Swin-Unet encoder–decoder structured-light field occlusion removal network. Opt. Eng. 2024, 63, 104102. [Google Scholar] [CrossRef]
Senussi, M.F.; Kang, H.S. Occlusion Removal in Light-Field Images Using CSPDarknet53 and Bidirectional Feature Pyramid Network: A Multi-Scale Fusion-Based Approach. Appl. Sci. 2024, 14, 9332. [Google Scholar] [CrossRef]
Bertalmio, M.; Sapiro, G.; Caselles, V.; Ballester, C. Image Inpainting. In Proceedings of the 27th Annual Conference on Computer Graphics and Interactive Techniques; ACM Press: New York, NY, USA, 2000. [Google Scholar]
Ballester, C.; Bertalmio, M.; Caselles, V.; Sapiro, G.; Verdera, J. Filling-in by joint interpolation of vector fields and gray levels. IEEE Trans. Image Process. 2001, 10, 1200–1211. [Google Scholar] [CrossRef]
Barnes, C.; Shechtman, E.; Finkelstein, A.; Goldman, D.B. PatchMatch: A randomized correspondence algorithm for structural image editing. ACM Trans. Graph. 2009, 28, 24. [Google Scholar] [CrossRef]
Wexler, Y.; Shechtman, E.; Irani, M. Space-time completion of video. IEEE Trans. Pattern Anal. Mach. Intell. 2007, 29, 463–476. [Google Scholar] [CrossRef]
Zhang, F.L.; Wang, J.; Shechtman, E.; Zhou, Z.Y.; Shi, J.X.; Hu, S.M. Plenopatch: Patch-based plenoptic image manipulation. IEEE Trans. Vis. Comput. Graph. 2016, 23, 1561–1573. [Google Scholar] [CrossRef]
Le Pendu, M.; Jiang, X.; Guillemot, C. Light field inpainting propagation via low rank matrix completion. IEEE Trans. Image Process. 2018, 27, 1981–1993. [Google Scholar] [CrossRef] [PubMed]
Liu, G.; Reda, F.A.; Shih, K.J.; Wang, T.C.; Tao, A.; Catanzaro, B. Image inpainting for irregular holes using partial convolutions. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 85–100. [Google Scholar]
Li, J.; Wang, N.; Zhang, L.; Du, B.; Tao, D. Recurrent feature reasoning for image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 7760–7768. [Google Scholar]
Zhu, M.; He, D.; Li, X.; Li, C.; Li, F.; Liu, X.; Ding, E.; Zhang, Z. Image inpainting by end-to-end cascaded refinement with mask awareness. IEEE Trans. Image Process. 2021, 30, 4855–4866. [Google Scholar] [CrossRef]
Xie, C.; Liu, S.; Li, C.; Cheng, M.M.; Zuo, W.; Liu, X.; Wen, S.; Ding, E. Image inpainting with learnable bidirectional attention maps. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 8858–8867. [Google Scholar]
Nazeri, K. EdgeConnect: Generative Image Inpainting with Adversarial Edge Learning. arXiv 2019, arXiv:1901.00212. [Google Scholar] [CrossRef]
Song, Y.; Yang, C.; Shen, Y.; Wang, P.; Huang, Q.; Kuo, C.C.J. Spg-net: Segmentation prediction and guidance network for image inpainting. arXiv 2018, arXiv:1805.03356. [Google Scholar] [CrossRef]
Ren, Y.; Yu, X.; Zhang, R.; Li, T.H.; Liu, S.; Li, G. Structureflow: Image inpainting via structure-aware appearance flow. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 181–190. [Google Scholar]
Liao, L.; Xiao, J.; Wang, Z.; Lin, C.W.; Satoh, S. Guidance and evaluation: Semantic-aware image inpainting for mixed scenes. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXVII 16. Springer: Cham, Switzerland, 2020; pp. 683–700. [Google Scholar]
Yang, C.; Lu, X.; Lin, Z.; Shechtman, E.; Wang, O.; Li, H. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 6721–6729. [Google Scholar]
Zeng, Y.; Lin, Z.; Yang, J.; Zhang, J.; Shechtman, E.; Lu, H. High-resolution image inpainting with iterative confidence feedback and guided upsampling. In Proceedings of the Computer Vision—ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XIX 16. Springer: Cham, Switzerland, 2020; pp. 1–17. [Google Scholar]
Yi, Z.; Tang, Q.; Azizi, S.; Jang, D.; Xu, Z. Contextual residual aggregation for ultra high-resolution image inpainting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 7508–7517. [Google Scholar]
Vaish, V.; Garg, G.; Talvala, E.; Antunez, E.; Wilburn, B.; Horowitz, M.; Levoy, M. Synthetic aperture focusing using a shear-warp factorization of the viewing transform. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05)-Workshops, San Diego, CA, USA, 21–23 September 2005; IEEE: New York, NY, USA, 2005; p. 129. [Google Scholar]
Vaish, V.; Levoy, M.; Szeliski, R.; Zitnick, C.L.; Kang, S.B. Reconstructing occluded surfaces using synthetic apertures: Stereo, focus and robust measures. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), New York, NY, USA, 17–22 June 2006; IEEE: New York, NY, USA, 2006; Volume 2, pp. 2331–2338. [Google Scholar]
Vaish, V.; Wilburn, B.; Joshi, N.; Levoy, M. Using plane+ parallax for calibrating dense camera arrays. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004, Washington, DC, USA, 27 June–2 July 2004; IEEE: New York, NY, USA, 2004; Volume 1, p. I. [Google Scholar]
Pei, Z.; Zhang, Y.; Chen, X.; Yang, Y.H. Synthetic aperture imaging using pixel labeling via energy minimization. Pattern Recognit. 2013, 46, 174–187. [Google Scholar] [CrossRef]
Pei, Z.; Chen, X.; Yang, Y.H. All-in-focus synthetic aperture imaging using image matting. IEEE Trans. Circuits Syst. Video Technol. 2016, 28, 288–301. [Google Scholar] [CrossRef]
Xiao, Z.; Si, L.; Zhou, G. Seeing beyond foreground occlusion: A joint framework for SAP-based scene depth and appearance reconstruction. IEEE J. Sel. Top. Signal Process. 2017, 11, 979–991. [Google Scholar] [CrossRef]
Zhang, S.; Chen, Y.; An, P.; Huang, X.; Yang, C. Light field occlusion removal network via foreground location and background recovery. Signal Process. Image Commun. 2022, 109, 116853. [Google Scholar] [CrossRef]
Song, C.; Li, W.; Pi, X.; Xiong, C.; Guo, X. A dual-pathways fusion network for seeing background objects in light field. In Proceedings of the International Conference on Image, Signal Processing, and Pattern Recognition (ISPP 2022), Guilin, China, 25–27 February 2022; SPIE: Bellingham, WA, USA, 2022; Volume 12247, pp. 339–348. [Google Scholar]
Piao, Y.; Rong, Z.; Xu, S.; Zhang, M.; Lu, H. DUT-LFSaliency: Versatile dataset and light field-to-RGB saliency detection. arXiv 2020, arXiv:2012.15124. [Google Scholar]
Raj, A.S.; Lowney, M.; Shah, R. Light-Field Database Creation and Depth Estimation; Stanford University: Palo Alto, CA, USA, 2016. [Google Scholar]
Rerabek, M.; Ebrahimi, T. New light field image dataset. In Proceedings of the 8th International Conference on Quality of Multimedia Experience (QoMEX), Lisbon, Portugal, 6–8 June 2016. [Google Scholar]
Bok, Y.; Jeon, H.G.; Kweon, I.S. Geometric calibration of micro-lens-based light field cameras using line features. IEEE Trans. Pattern Anal. Mach. Intell. 2016, 39, 287–300. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Differentiating U²-LFOR from existing U-shaped structures: DeOccNet (single-stage encoder–decoder), Mask4D (dual decoders for occlusion and mask processing), and our two-stage U² network with intermediate feature fusion for refined reconstruction.

Figure 2. Architecture of the proposed U²-LFOR for occlusion removal in LF images. It comprises three main components: (1) a feature extractor with convolution and ResASPP for receptive field expanding and context feature representation; (2) a two-stage U² Net for progressive refinement via hierarchical feature fusion in Stage 1 and Stage 2; and (3) a refinement module with ResBlocks and a convolution layer to produce the final occlusion-free output

I_{o u t}

. Key structural elements include downsampling/upsampling blocks, skip connections, and supervision for enhanced reconstruction.

Figure 2. Architecture of the proposed U²-LFOR for occlusion removal in LF images. It comprises three main components: (1) a feature extractor with convolution and ResASPP for receptive field expanding and context feature representation; (2) a two-stage U² Net for progressive refinement via hierarchical feature fusion in Stage 1 and Stage 2; and (3) a refinement module with ResBlocks and a convolution layer to produce the final occlusion-free output

I_{o u t}

. Key structural elements include downsampling/upsampling blocks, skip connections, and supervision for enhanced reconstruction.

Figure 3. Detailed architecture of the Residual Atrous Spatial Pyramid Pooling (ResASPP) module.

Figure 4. Structure of the ResBlock in the refinement module.

Figure 5. Qualitative results comparing our method with existing approaches (RFR [37], LBAM [39], DeOccNe [22], Zhang et al. [25], ISTY [26] and Senussi et al. [29]) on the sparse LF dataset, achieving sharper and more accurate occlusion removal. Restored areas are marked in red and yellow boxes.

Figure 6. Qualitative comparisons on the dense LF dataset, analyzing the key performance differences and highlighting the strengths of each method (RFR [37], LBAM [39], DeOccNe [22], Zhang et al. [25], ISTY [26], Senussi et al. [29] and Ours). Red and yellow boxes indicate restored regions.

Figure 7. Visual results of the performance evaluation in real-world scenes, with red boxes marking occlusion zones and enlarged fragments showing restoration quality in these regions, comparing DeOccNe [22], Senussi et al. [29], and our method.

Figure 8. Crafted for performance and cost, our method leads the way in LF occlusion removal, balancing high accuracy with reduced computational cost, compared with DeOccNe [22], Zhang et al. [25], ISTY [26], and Senussi et al. [29].

Figure 9. Comprehensive visual representation of the ablation study, illustrating the contribution of each model component to the overall performance.

Table 1. A summary of the light-field datasets used for training and testing our network, categorized by scene density, dataset type, and the number of scenes.

LF Density	Dataset	Category	# of Scenes
Sparse LF	DeOccNet Train [22]	Synthetic	60
	Stanford CD [49]	Real	30
	4-synLFs [22]	Synthetic	4
	9-synLFs [25]	Synthetic	9
Dense LF	DUTLF-V2 [55]	Real	4204
	Stanford Lytro [56]	Synthetic	71
	EPFL-10 [57]	Synthetic	10

Table 2. Quantitative comparison on the sparse and dense LF dataset using PSNR and SSIM. Red indicates the best result, while blue indicates the second-best result. ↑ means higher values are better.

LF Type	Sparse (Syn)		Sparse (Real)	Dense (Syn)
Name	4-Syn	9-Syn	CD	Single Occ	Double Occ
PSNR ↑
RFR [37]	19.89	20.69	21.13	26.28	23.25
LBAM [39]	21.11	23.04	21.56	27.92	24.83
DeOccNet [22]	23.74	23.70	22.70	28.67	25.85
Zhang et al. [25]	14.46	22.00	20.19	23.15	18.01
ISTY [26]	26.42	27.04	25.17	32.44	28.31
Senussi et al. [29]	27.32	27.48	25.68	30.70	29.34
U²-LFOR (Ours)	27.22	28.22	26.29	32.83	31.77
SSIM ↑
RFR [37]	0.668	0.672	0.646	0.867	0.801
LBAM [39]	0.677	0.725	0.803	0.899	0.827
DeOccNet [22]	0.701	0.715	0.741	0.914	0.847
Zhang et al. [25]	0.683	0.758	0.832	0.900	0.823
ISTY [26]	0.836	0.849	0.870	0.947	0.902
Senussi et al. [29]	0.862	0.853	0.886	0.838	0.850
U²-LFOR (Ours)	0.870	0.879	0.893	0.861	0.872

Table 3. Overview of model parameters and inference times for 256 × 192 LF images on the Nvidia RTX 3090 GPU. Best performance in red, second-best in blue, with ↓ indicating lower is better.

Model	# of Network Parameters ↓	Inference Time ↓
LBAM [39]	69.3 M	12 ms
DeOccNet [22]	39.0 M	10 ms
Zhang et al. [25]	2.7 M	3050 ms
ISTY [26]	80.6 M	24 ms
Senussi et al. [29]	52.59 M	138.8 ms
U²-LFOR (Ours)	11.06 M	7.86 ms

Table 4. Ablation study: impact of network components, with red highlighting the best result.

LF Type	Sparse (Syn)		Sparse (Real)	Dense (Syn)
Name	4-Syn	9-Syn	CD	Single Occ	Double Occ
PSNR
w/o ResAspp	27.01	27.35	25.19	32.35	31.28
w/o U² Stage 1	25.87	26.51	24.85	31.00	29.54
w/o U² Stage 2	27.33	28.02	25.74	31.19	29.77
w/o Refinement	27.19	27.78	25.92	30.52	28.88
U²-LFOR (Ours)	27.22	28.22	26.29	32.83	31.77
SSIM
w/o ResAspp	0.858	0.863	0.883	0.854	0.870
w/o U² Stage 1	0.847	0.854	0.872	0.833	0.836
w/o U² Stage 2	0.868	0.886	0.887	0.837	0.841
w/o Refinement	0.867	0.869	0.891	0.821	0.826
U²-LFOR (Ours)	0.870	0.879	0.893	0.861	0.872

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Senussi, M.F.; Abdalla, M.; Kasem, M.S.; Mahmoud, M.; Kang, H.-S. U²-LFOR: A Two-Stage U² Network for Light-Field Occlusion Removal. Mathematics 2025, 13, 2748. https://doi.org/10.3390/math13172748

AMA Style

Senussi MF, Abdalla M, Kasem MS, Mahmoud M, Kang H-S. U²-LFOR: A Two-Stage U² Network for Light-Field Occlusion Removal. Mathematics. 2025; 13(17):2748. https://doi.org/10.3390/math13172748

Chicago/Turabian Style

Senussi, Mostafa Farouk, Mahmoud Abdalla, Mahmoud SalahEldin Kasem, Mohamed Mahmoud, and Hyun-Soo Kang. 2025. "U²-LFOR: A Two-Stage U² Network for Light-Field Occlusion Removal" Mathematics 13, no. 17: 2748. https://doi.org/10.3390/math13172748

APA Style

Senussi, M. F., Abdalla, M., Kasem, M. S., Mahmoud, M., & Kang, H.-S. (2025). U²-LFOR: A Two-Stage U² Network for Light-Field Occlusion Removal. Mathematics, 13(17), 2748. https://doi.org/10.3390/math13172748

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

U²-LFOR: A Two-Stage U² Network for Light-Field Occlusion Removal

Abstract

1. Introduction

2. Related Work

2.1. Single-View Image Inpainting

2.1.1. Conventional Methods

2.1.2. Deep Learning-Based Methods

2.2. LF Occlusion Removal

2.2.1. Conventional Methods

2.2.2. Deep Learning-Based Methods

3. Proposed Method

3.1. LF Feature Extractor

3.2. Two-Stage U² Net

3.3. Refinement Module

3.4. Loss Function

4. Experiments

4.1. Experimental Setup

4.1.1. Training Dataset

4.1.2. Testing Dataset

4.1.3. Training Details

4.2. Experimental Results

4.2.1. Quantitative Results

4.2.2. Qualitative Results

4.2.3. Performance Evaluation on Real-World Scene Data

4.2.4. Evaluation of Computational Efficiency

5. Ablation Study

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

U2-LFOR: A Two-Stage U2 Network for Light-Field Occlusion Removal

Abstract

1. Introduction

2. Related Work

2.1. Single-View Image Inpainting

2.1.1. Conventional Methods

2.1.2. Deep Learning-Based Methods

2.2. LF Occlusion Removal

2.2.1. Conventional Methods

2.2.2. Deep Learning-Based Methods

3. Proposed Method

3.1. LF Feature Extractor

3.2. Two-Stage U2 Net

3.3. Refinement Module

3.4. Loss Function

4. Experiments

4.1. Experimental Setup

4.1.1. Training Dataset

4.1.2. Testing Dataset

4.1.3. Training Details

4.2. Experimental Results

4.2.1. Quantitative Results

4.2.2. Qualitative Results

4.2.3. Performance Evaluation on Real-World Scene Data

4.2.4. Evaluation of Computational Efficiency

5. Ablation Study

6. Conclusions and Future Work

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

U²-LFOR: A Two-Stage U² Network for Light-Field Occlusion Removal

3.2. Two-Stage U² Net