Prex-NetII: Attention-Based Back-Projection Network for Light Field Reconstruction

Kim, Dong-Myung; Suh, Jae-Won

doi:10.3390/electronics14204117

Open AccessArticle

Prex-NetII: Attention-Based Back-Projection Network for Light Field Reconstruction

by

Dong-Myung Kim

and

Jae-Won Suh

^*

Department of Electronics Engineering, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju 28644, Republic of Korea

^*

Author to whom correspondence should be addressed.

Electronics 2025, 14(20), 4117; https://doi.org/10.3390/electronics14204117

Submission received: 27 August 2025 / Revised: 14 October 2025 / Accepted: 20 October 2025 / Published: 21 October 2025

(This article belongs to the Special Issue Image Processing Based on Convolution Neural Network: 2nd Edition)

Download

Browse Figures

Review Reports Versions Notes

Abstract

We propose an attention-based back-projection network that enhances light field reconstruction quality by modeling inter-view dependencies. The network uses pixel shuffle to efficiently extract initial features. Spatial attention focuses on important regions while capturing inter-view dependencies. Skip connections in the refinement network improve stability and reconstruction performance. In addition, channel attention within the projection blocks enhances structural representation across views. The proposed method reconstructs high-quality light field images not only in general scenes but also in complex scenes containing occlusions and reflections. The experimental results show that the proposed method outperforms existing approaches.

Keywords:

light field reconstruction; angular super-resolution; convolutional neural network

1. Introduction

Light field imaging enables the reconstruction of 3D information by capturing both spatial and angular information [1,2]. Due to this property, light field imaging has been applied in various fields [3,4,5,6,7,8]. However, to extend its applications further, the angular resolution of light field imaging needs to be improved. One of the primary challenges in this task is processing a large number of views. This requirement substantially increases the number of network parameters. In addition, parallax between views complicates the reconstruction process. Moreover, factors such as light reflections and occlusions make the restoration process more difficult. Consequently, enhancing light field resolution remains a highly challenging problem.

Conventional view synthesis methods [9,10,11,12,13] generate novel views using classical approaches. Geometry-based techniques [9,11] use estimated depth or disparity maps to guide pixel rearrangement. They perform well in simple scenes but have difficulty handling occlusions, complex lighting, or strong reflections. To address these limitations, Zhou et al. [14] proposed learning an appearance mapping from input images via appearance flow, thus avoiding explicit depth estimation. While this method can yield high-quality reconstructions under favorable conditions, it is still challenged by occlusions and regions not visible in the input views. Moreover, its reliance on flow-based warping limits flexibility in handling novel pixels and complex scenes.

Recently, deep learning-based methods [15,16,17,18,19,20,21,22,23,24,25,26] have been introduced for light field reconstruction. These approaches are categorized into depth-based and non-depth-based methods. Depth-based methods generate novel views by predicting a depth map from the input images. However, inaccurate depth estimation causes problems that reduce the quality of the novel view. Non-depth-based methods can avoid these issues since they do not rely on depth maps. However, due to the inherent complexity of light field, the quality of the reconstructed images remains limited. To overcome this limitation, non-depth-based approaches focus on efficiently utilizing the intrinsic information across light field views without relying on explicit depth estimation.

In our previous work, we proposed Prex-Net [27], which progressively fused feature maps across views using a modified back-projection structure [28]. We extend our previous approach and propose Prex-NetII, an improved algorithm for high-quality light field reconstruction. Unlike our previous network [27], Prex-NetII employs pixel shuffle and spatial attention for initial feature extraction. The pixel shuffle approach efficiently aggregates multi-view information with fewer parameters than 3D convolution.

In addition, the spatial attention module improves the network’s capacity to capture spatial correlations across sub-aperture images. Furthermore, channel attention is incorporated into the up- and down-projection modules within the refinement network to better exploit inter-view dependencies and underlying 3D structural information. To further stabilize training and improve reconstruction performance, long skip connections are applied before and after the refinement network. The main contributions of this work are summarized as follows:

Efficiently extracts the initial feature map using pixel shuffle, reducing the number of parameters compared to our previous work.
Enhanced training stability achieved by adopting long skip connections around the refinement network.
Improved cross-view representation through attention mechanisms that better capture structural dependencies across views.

2. Related Works

A common strategy for synthesizing novel views is to predict a depth map and warp the input images. The accuracy of the estimated depth map directly determines the quality of the resulting views. Traditional depth-based light field reconstruction methods have explored various approaches. Wanner and Goldluecke [9] introduced a variational framework for disparity estimation and angular super-resolution, deriving disparity maps from local slope estimations to generate warp maps. However, this approach is limited to local regions and shows reduced performance in areas with complex structures or strong specular reflections. Mitra and Veeraraghavan [10] proposed a patch-based method using Gaussian mixture model to model disparity patterns, integrating patches to reconstruct high-resolution light field images. This method struggles when the patch sizes are smaller than the maximum disparity in the light field.

With the introduction of deep learning, CNN-based methods have been applied to light field reconstruction. Flynn et al. [15] demonstrated a method for synthesizing novel views from wide-baseline stereo images, providing guidelines for disparity-based CNN synthesis. However, generating views using only two input images limits the total number of novel views. Kalantari et al. [16] proposed a framework that estimates depth from densely sampled light field images, warps the inputs accordingly, and refines intermediate synthesis images through color estimation. Although effective, this method requires cropping boundary regions due to missing data in warped inputs and is computationally expensive when producing multiple views. Jin et al. (LFASR-geometry) [17] addressed speed limitations by predicting multiple depth maps simultaneously, but remaining inaccuracies in depth estimation still cause distortions. Jin et al. (LFASR-FS-GAF) [18] further enhanced intermediate synthesis by incorporating attention maps and plane sweep volumes (PSVs), improving synthesis quality, but limitations remain in improving the accuracy of depth maps. Chen et al. [19] proposed a hybrid method that combines depth-based and non-depth-based synthesis through region-wise disparity guidance. However, it still suffers from detail loss and artifacts due to disparity estimation errors.

Several methods reconstruct light fields without estimating depth explicitly, relying on structural cues or frequency-domain information instead. Shi et al. [12] restored the sparsity of light field images in the Fourier domain via nonlinear gradient descent, although their approach required specific sampling along image borders and diagonals. Vagharshakyan et al. [13] applied adaptive discrete shearlet transforms and EPI-based inpainting, but sequential synthesis along axes slowed processing and limited occlusion handling.

As in depth-based methods, CNN-based models have been proposed to reconstruct light fields without explicit depth estimation. Yoon et al. [20] proposed a network to synthesize an intermediate view from two adjacent views, but it had a limitation in the location of synthesized views. Gul and Gunturk [21] fed lenslet stacks into a network to achieve angular super-resolution more efficiently, yet simple architectures caused quality degradation. Wu et al. [22] reconstructed light fields via EPIs using a blur-restoration-deblur framework, but the reconstruction failed for scenes with disparities beyond a certain range. To address this, Wu et al. [23] proposed a shear-aware light field reconstruction network that integrates learnable shearing, downscaling, and prefiltering into the rendering process. It reduces aliasing in epipolar-plane images by processing and fusing multiple sheared inputs. Wang et al. (DistgASR) [24] combined MacPI structures and multiple filters to extract spatial and angular features simultaneously, achieving high quality with minor inference delays. Fang et al. (GLGNet) [25] proposed an EPI-based framework that incorporates a bilateral upsampling module to perform angular super-resolution at arbitrary interpolation rates. However, the method tends to lose fine details in scenes with complex textures. Salem et al. (LFR-DFE) [26] integrated dual feature extraction and macro-pixel upsampling, producing high-quality reconstructions under sparse inputs, but the approach struggled with complex occlusion boundaries.

In summary, inaccurate depth estimation causes distortions in the reconstructed views of depth-based methods. Although non-depth-based approaches can avoid this issue, they still suffer from image artifacts and loss of fine details in regions with complex occlusions or reflections. This is because existing models have difficulty capturing the spatial–angular relationships required for accurate reconstruction. To solve these problems, we propose an attention-based back-projection network that enhances feature interaction across views and improves reconstruction quality.

3. Proposed Network

The proposed network consists of an initial feature extraction network and a refinement network. The initial feature extraction network integrates spatial and angular information from multiple input views. It generates a fused feature representation of the corner sub-aperture images through pixel shuffle. This allows the use of 2D convolutions to capture spatio-angular dependencies with a low computational cost. A spatial attention module is then applied to focus on informative regions. The initial feature map is then fed into the refinement network for reconstruction. The refinement network adopts a back-projection structure to refine the extracted features. The refinement network consists of multiple up- and down-projection blocks that reduce reconstruction errors and recover fine spatial details while maintaining consistency across views.

As shown in Figure 1, the initial feature extraction network concatenates corner images

L F_{L T}

,

L F_{R T}

,

L F_{L B}

, and

L F_{R B} \in R^{H \times W \times 1}

of the 7 × 7 light field image to create

L F_{c o n c a t} \in R^{H \times W \times 4}

, where H and W are the height and width of light field images, respectively. Then,

L F_{c o n c a t}

is rearranged to

L F_{p s} \in R^{r H \times r W \times 4 / r^{2}}

by pixel shuffle, where r is the upscaling factor and is set to 2. Consequently, the input images of the four corners are rearranged into a single-channel feature map with double the spatial resolution. This allows the network to efficiently extract spatio-angular features from multiple images using only 2D convolution.

The pixel-shuffled image

L F_{p s}

passes through 2D convolution layers of 256, 128, 64, 49, and 49 with 3 × 3 kernels. The spatial attention module is inserted after the first convolution layer to improve spatial features. As a result, the initial feature map

F_{i n i t}

is obtained. The stride and padding of all convolution layers are set to 1, except for the last layer. To match the spatial resolution of the initial feature map

F_{i n i t}

and the input light field image, the stride of the last convolution layer is set to 2 and the padding is set to 1. Each convolution layer is followed by LeakyReLU with a negative slope of 0.01.

In the spatial attention module, the input feature map is processed through both average and max pooling to produce two spatial maps. These maps are concatenated and passed to a 2D convolution layer with a 7 × 7 kernel, stride 1, and padding 1. This layer generates the spatial attention map. The 7 × 7 kernel size is adopted to provide a wider receptive field for capturing long-range spatial correlations. The attention map is normalized using a sigmoid activation and multiplied with the original feature map to obtain spatially refined features. The structure of the spatial attention module is shown in Figure 2.

To capture the inherent correlations in the initial feature map, we introduce a refinement network based on a back-projection structure [28]. As shown in Figure 1, the up- and down-projection blocks of the refinement network are densely connected to each other. The first up-projection block takes the initial feature map as input. The subsequent up-projection blocks concatenate the outputs of the down-projection blocks from the previous stage and use them as input. Similarly, the first down-projection block takes the output of the first up-projection block as input, and the following down-projection blocks concatenate the outputs of the up-projection blocks from the previous stage and use them as input.

The input of the last convolution layer is the concatenated outputs of the down-projection blocks. The last convolution layer has a filter size of 3 × 3 and is not followed by an ReLU. Finally, the output feature map

F_{r e f i n e}

of the refinement network is added to the initial feature map

F_{i n i t}

, via a long skip connection, to generate the final reconstructed light field images. This long skip connection helps to improve the quality of the reconstructed light field images and enables stable training of the proposed network.

3.1. Up-Projection Bock

The up-projection block is shown in Figure 3. The input feature map

U_{i n}^{i}

of the up-projection block is the initial feature map or the concatenated outputs of the down-projection blocks. It can be expressed as

U_{i n}^{i} = \{\begin{matrix} F_{i n i t}, & i = 1 \\ c o n c a t e n a t e (D_{o u t}^{1}, \dots, D_{o u t}^{i - 1}), & i = 2, \dots, 12, \end{matrix}

(1)

where i represents the i-th block. As i increases, the number of channels of the input feature map

U_{i n}^{i}

increases by

49 \times i

. Therefore, a 1 × 1 convolution layer is employed to set the channel number of the input feature map

U_{i n}^{i}

to 49. The following 3 × 3 convolution layer expands the 49-channel feature map to a 196-channel feature map. The expanded 196-channel feature map is passed through a channel attention module to assign weights to each channel. This expanded feature map enhances feature representation and captures more complex spatial relationships. Subsequently, the feature map is rearranged into a 49-channel feature map

u_{1}^{i}

using pixel shuffle with an upscaling factor of 2. The pixel-shuffled 49-channel feature map

u_{1}^{i}

is processed by a 6 × 6 convolution layer with a stride of 2 and padding of 2. This operation produces a 49-channel feature map

u_{2}^{i}

with spatial resolution matching that of the input light field image. The feature map

u_{2}^{i}

becomes a residual feature map by subtracting

u_{0}^{i}

. This 49-channel residual feature map passes through a 3 × 3 convolution layer for channel expansion and pixel shuffle with an upscaling factor of 2. Each convolution layer is followed by LeakyReLU with a negative slope of 0.2. The resulting

u_{3}^{i}

is then added to

u_{1}^{i}

to produce

U_{o u t}^{i}

. The final output of the up-projection block can be expressed as

U_{o u t}^{i} = F (u_{0}^{i}) + F (ϕ (g (u_{1}^{i})) - u_{0}^{i}),

(2)

where

F (\cdot) = ϕ (P_{2} (A_{c} (q (\cdot))))

. Here,

ϕ (\cdot)

represents the LeakyReLU activation function,

P_{2} (\cdot)

denotes the pixel shuffle operation, and

A c (\cdot)

denotes the channel attention module, while

q (\cdot)

and

g (\cdot)

refer to

3 \times 3

and

6 \times 6

convolution layers, respectively.

The channel attention module is shown in Figure 4. First, a feature map with 196 channels is input and compressed into a channel-wise descriptor by applying average pooling. Although max pooling can also be used in this step, we used only average pooling to reduce the computational cost. The average-pooled feature map with a shape of (196 × 1 × 1) is passed through two fully connected layers to generate the attention map, where the reduction ratio is set to 14. Here, the feature dimension is reduced from 196 to 14 and then expanded back to 196. This enables the network to learn channel-wise weights that represent the importance of each channel. An ReLU activation is applied between the layers. The resulting attention map is normalized using a sigmoid activation function and then multiplied element-wise with the input feature map. This produces the final channel-attended feature representation.

3.2. Down-Projection Bock

The down-projection block is shown in Figure 5. The input feature map

D_{i n}^{i}

of the down-projection block is the concatenated outputs of the up-projection blocks. It can be expressed as

D_{i n}^{i} = c o n c a t e n a t e (U_{o u t}^{1}, \dots, U_{o u t}^{i}), i = 1, 2, \dots, 12 .

(3)

Similar to the up-projection block, each increment in i increases the channels of the input feature map for the down-projection block by

49 \times i

. Therefore, a 1 × 1 convolution layer is employed to set the channel number of the input feature map

D_{i n}^{i}

to 49. Unlike the up-projection block, the spatial resolution of the feature map

D_{i n}^{i}

is double that of the input light field image. Therefore, we resize it using a 6 × 6 convolution layer with a stride of 2 and padding of 2. The 49-channel feature map

d_{1}^{i}

is expanded to 196-channel feature map using a 3 × 3 convolution layer. The 196-channel feature map is then passed through the channel attention module and subsequently rearranged into a 49-channel feature map

d_{2}^{i}

by pixel shuffle with an upscaling factor of 2. The 49-channel feature map

d_{2}^{i}

becomes a residual feature map by subtracting the feature map

d_{0}^{i}

previously generated by the 1 × 1 convolution layer. This 49-channel residual feature map passes through a 6 × 6 convolution layer to match the spatial resolution of the input light field image. It is then added to

d_{1}^{i}

to produce the output of the down-projection block. It can be expressed as

D_{o u t}^{i} = d_{1}^{i} + ϕ (g (F (d_{1}^{i}) - d_{0}^{i})) .

(4)

4. Simulation Results

For training, we used the Stanford Lytro light field archive [29] and the Kalantari dataset [16]. These datasets have an angular resolution of 14 × 14 and a spatial resolution of 376 × 541. We used only the 7 × 7 light field images from the center of the 14 × 14 light field images. We converted RGB color images to YCbCr color images and conducted experiments using only luminance. For testing, we used real-world light field images named 30Scenes [16], Occlusions [29], and Reflective [29] datasets. The 30Scenes dataset consists of general images, the Occlusions dataset contains images with overlapping objects, and the Reflective dataset includes images with reflective areas. In particular, light field images with occlusions or diffuse reflections make it difficult to predict the pixels of the reconstructed light field images. We evaluated the performance of the proposed method using datasets with various characteristics.

We used randomly cropped patches with a spatial resolution of 96 × 96 for training, and the cropped patch was augmented by flipping and rotation. To avoid memory limitation issues, the batch size was set to 1. For optimization, we used Adam optimizer with

β_{1} = 0.9, β_{2} = 0.999

. The learning rate was initialized as

10^{- 4}

and decreased to

10^{- 6}

by

10^{- 1}

every 5000 epochs. The proposed network was trained using the Charbonnier loss [30], which minimizes the error between the original high-resolution (HR) light field image and the reconstructed light field image. This loss function is adopted for its robustness to outliers and its smooth approximation of the

L_{1}

norm. The loss is defined as

L = \sqrt{∥ L^{H R} - {\hat{L}}^{H R} ∥^{2} + ϵ^{2}},

(5)

where

L^{H R}

and

{\hat{L}}^{H R}

denote the ground-truth and predicted HR light field images, respectively. The value of

ϵ

is set to

1 \times 10^{- 3}

, which is commonly used in image reconstruction tasks.

To evaluate the performance of the proposed method, various metrics were compared with those of existing approaches. Table 1 presents the PSNR comparison results. Non-depth-based methods [24,26,27] achieve better performance than early depth-based approaches [16,18] as they reconstruct images directly without relying on depth estimation. On average, the PSNR of the proposed method was 2.33 dB, 1.44 dB, 0.61 dB, 0.38 dB, and 0.27 dB higher than that of Kalantari et al. [16], LFASR-FS-GAF [18], DistgASR [24], Prex-Net [27], and LFR-DFE [26], respectively. The proposed method consistently outperformed competing approaches across the entire dataset, regardless of its characteristics.

To analyze the influence of the number of projection blocks, we conducted experiments by varying the number of blocks in the refinement network. All models were trained under identical conditions, and their reconstruction performance was evaluated in terms of PSNR and SSIM. As shown in Table 2, the performance improved as the number of projection blocks increased. However, as the number of projection blocks increased, the performance gain gradually decreased. We determined the optimal number of up- and down-projection blocks to be 12 for each, considering the trade-off between performance and model parameters. In addition, the proposed model achieved better performance while reducing the number of parameters compared to our previous work [27].

We conducted a series of experiments to validate the effectiveness of the proposed method. This ablation study was conducted to evaluate the contribution of the applied attention blocks to performance improvement. We applied different attention modules to the initial feature extraction and projection blocks. As shown in Table 3, the results indicate that the best performance was achieved when spatial attention was applied at the initial feature extraction stage and channel attention was integrated within the projection block.

Spatial attention applied at an early stage helped the network to focus on relevant regions in the input image, improving spatial feature learning. Channel attention within the projection block refined the features by strengthening channel-wise correlations, which led to improved reconstruction performance.

Figure 6 presents a comparison between the proposed method and existing methods using both error maps and cropped image regions. The error maps show the pixel-wise differences between the reconstructed and ground-truth images. Smaller errors are shown in blue, while larger errors are shown in red. For clearer comparison, error values were clipped to the range of 0 to 0.1, with values above 0.1 truncated to 0.1. As shown in Figure 6, the proposed method more accurately reconstructs the ground-truth images than existing methods.

Methods that rely on depth maps reconstruct light field images by warping input views based on estimated depth. As shown in Figure 6, blur or artifacts can be seen because the predicted depth map is inaccurate. In contrast, methods that do not rely on depth maps avoid such issues. However, the figure shows that they may not fully exploit angular correlations in the light field data, which leads to reduced texture quality.

For example, in the first cropped region of the Flower2 scene, the proposed method recovers the background object between the petals more accurately than existing methods. In the second crop of the Occlusion25 scene, the proposed method recovers leaf textures more clearly than existing methods. This indicates that the proposed method provides consistently higher-quality reconstructions across these examples. In the first crop of the Reflective3 scene, the proposed method restores reflected light that closely matches the ground truth. Additionally, in the second crop of the same scene, existing methods tend to produce unwanted white lines in reflective regions due to interference from neighboring regions. The results show that the proposed method is more robust in handling reflections.

5. Conclusions

In this paper, we proposed Prex-NetII to increase the angular resolution of light fields using an attention-based back-projection network. Compared with our previous work, the proposed network included several improvements to enhance reconstruction performance. The proposed network adopted pixel shuffle for initial feature extraction, which efficiently extracted features while reducing the number of parameters. By employing attention mechanisms, the network effectively captured inter-view correlations and selectively enhanced important spatial features. In addition, long skip connections were applied around the refinement network to stabilize training and preserve structural details across views. The experimental results showed that our method achieved higher PSNR and SSIM compared with existing approaches, demonstrating the benefit of integrating projection-based refinement with attention-driven feature fusion. The proposed light field reconstruction method can be applied to various industrial domains, such as CCTV systems, visual recognition, and AR/VR applications. Future work will focus on reducing model complexity while maintaining reconstruction accuracy.

Author Contributions

Conceptualization, D.-M.K.; methodology, D.-M.K.; software, D.-M.K.; formal analysis, D.-M.K.; investigation, J.-W.S.; resources, J.-W.S.; data curation, D.-M.K.; writing—original draft preparation, D.-M.K.; writing—review and editing, D.-M.K. and J.-W.S.; validation, D.-M.K. and J.-W.S.; visualization, D.-M.K.; supervision, J.-W.S.; project administration, J.-W.S.; funding acquisition, J.-W.S. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Research Foundation of Korea grant funded by the Korea government (MSIT) (No. 2022R1A5A8026986).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The datasets used in this paper are public datasets. The code used for evaluation of the proposed method can be accessed at https://github.com/dmkim17/Prex-Net2 (accessed on 22 August 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Adelson, E.H.; Bergen, J.R. The plenoptic function and the elements of early vision. In Computer Models of Visual Processing; MIT Press: Cambridge, MA, USA, 1991; pp. 3–20. [Google Scholar]
Levoy, M.; Hanrahan, P. Light field rendering. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’96), New Orleans, LA, USA, 4–9 August 1996; pp. 31–42. [Google Scholar] [CrossRef]
Kim, C.; Zimmer, H.; Pritch, Y.; Sorkine-Hornung, A.; Gross, M. Scene reconstruction from high spatio-angular resolution light fields. ACM Trans. Graph. 2013, 32, 73. [Google Scholar] [CrossRef]
Yücer, K.; Sorkine-Hornung, A.; Wang, O.; Sorkine-Hornung, O. Efficient 3D object segmentation from densely sampled light fields with applications to 3D reconstruction. ACM Trans. Graph. 2016, 35, 22. [Google Scholar] [CrossRef]
Wang, Y.; Yang, J.; Guo, Y.; Xiao, C.; An, W. Selective light field refocusing for camera arrays using bokeh rendering and superresolution. IEEE Signal Process. Lett. 2019, 26, 204–208. [Google Scholar] [CrossRef]
Yan, T.; Zhang, F.; Mao, Y.; Yu, H.; Qian, X.; Lau, R.W.H. Depth estimation from a light field image pair with a generative model. IEEE Access 2019, 7, 12768–12778. [Google Scholar] [CrossRef]
Wang, Y.; Wu, T.; Yang, J.; Wang, L.; An, W.; Guo, Y. DeOccNet: Learning to see through foreground occlusions in light fields. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV), Snowmass Village, CO, USA, 1–5 March 2020; pp. 118–127. [Google Scholar] [CrossRef]
Wang, T.-C.; Chandraker, M.; Efros, A.A.; Ramamoorthi, R. SVBRDF-invariant shape and reflectance estimation from light-field cameras. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5451–5459. [Google Scholar] [CrossRef]
Wanner, S.; Goldluecke, B. Variational light field analysis for disparity estimation and super-resolution. IEEE Trans. Pattern Anal. Mach. Intell. 2013, 36, 606–619. [Google Scholar] [CrossRef] [PubMed]
Mitra, K.; Veeraraghavan, A. Light field denoising, light field superresolution and stereo camera based refocussing using a GMM light field patch prior. In Proceedings of the 2012 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, Providence, RI, USA, 16–21 June 2012; pp. 22–28. [Google Scholar] [CrossRef]
Seitz, S.M.; Dyer, C.R. View Morphing. In Proceedings of the 23rd Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH ’96), New Orleans, LA, USA, 4–9 August 1996; pp. 21–30. [Google Scholar] [CrossRef]
Shi, L.; Hassanieh, H.; Davis, A.; Katabi, D.; Durand, F. Light field reconstruction using sparsity in the continuous Fourier domain. ACM Trans. Graph. 2014, 34, 12. [Google Scholar] [CrossRef]
Vagharshakyan, S.; Bregovic, R.; Gotchev, A. Light field reconstruction using shearlet transform. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 40, 133–147. [Google Scholar] [CrossRef] [PubMed]
Zhou, T.; Tulsiani, S.; Sun, W.; Malik, J.; Efros, A.A. View synthesis by appearance flow. In Proceedings of the European Conference on Computer Vision (ECCV), Amsterdam, The Netherlands, 11–14 October 2016; pp. 286–301. [Google Scholar] [CrossRef]
Flynn, J.; Neulander, I.; Philbin, J.; Snavely, N. DeepStereo: Learning to predict new views from the world’s imagery. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 5515–5524. [Google Scholar] [CrossRef]
Kalantari, N.K.; Wang, T.-C.; Ramamoorthi, R. Learning-based view synthesis for light field cameras. ACM Trans. Graph. 2016, 35, 193. [Google Scholar] [CrossRef]
Jin, J.; Hou, J.; Yuan, H.; Kwong, S. Learning light field angular super-resolution via a geometry-aware network. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; pp. 11141–11148. [Google Scholar] [CrossRef]
Jin, J.; Hou, J.; Chen, J.; Zeng, H.; Kwong, S.; Yu, J. Deep coarse-to-fine dense light field reconstruction with flexible sampling and geometry-aware fusion. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 1819–1836. [Google Scholar] [CrossRef] [PubMed]
Chen, Y.; Zhang, X.; Li, J.; Wang, L.; Guo, Y. Enhanced light field reconstruction by combining disparity and texture information in PSVs via disparity-guided fusion. IEEE Trans. Comput. Imaging 2023, 9, 665–677. [Google Scholar] [CrossRef]
Yoon, Y.; Jeon, H.-G.; Yoo, D.; Park, J.; Kweon, I.S. Light-field image super-resolution using convolutional neural network. IEEE Signal Process. Lett. 2017, 24, 848–852. [Google Scholar] [CrossRef]
Gul, M.; Gunturk, B.K. Spatial and angular resolution enhancement of light fields using convolutional neural networks. IEEE Trans. Comput. Imaging 2018, 27, 2146–2159. [Google Scholar] [CrossRef] [PubMed]
Wu, G.; Wang, Y.; Wang, L.; Yu, J.; Guo, Y. Light field reconstruction using convolutional network on EPI and extended applications. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 41, 1681–1694. [Google Scholar] [CrossRef] [PubMed]
Wu, G.; Wang, Y.; Wang, L.; Yu, J.; Guo, Y. Revisiting light field rendering with deep anti-aliasing neural network. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 5430–5444. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Wang, L.; Wu, G.; Yang, J.; An, W.; Yu, J.; Guo, Y. Disentangling light fields for super-resolution and disparity estimation. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 425–443. [Google Scholar] [CrossRef] [PubMed]
Fang, L.; Wang, Q.; Ye, L. GLGNet: Light field angular superresolution with arbitrary interpolation rates. Vis. Intell. 2024, 2, 6. [Google Scholar] [CrossRef]
Salem, A.; Elkady, E.; Ibrahem, H.; Suh, J.-W.; Kang, H.-S. Light field reconstruction with dual features extraction and macro-pixel upsampling. IEEE Access 2024, 12, 121624–121634. [Google Scholar] [CrossRef]
Kim, D.-M.; Yoon, Y.-S.; Ban, Y.; Suh, J.-W. Prex-Net: Progressive exploration network using efficient channel fusion for light field reconstruction. Electronics 2023, 12, 4661. [Google Scholar] [CrossRef]
Haris, M.; Shakhnarovich, G.; Ukita, N. Deep back-projection networks for super-resolution. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 1664–1673. [Google Scholar] [CrossRef]
Raj, A.S.; Lowney, M.; Shah, R.; Wetzstein, G. Stanford Lytro Light Field Archive; Stanford Computational Imaging Lab: Stanford, CA, USA, 2016. [Google Scholar]
Charbonnier, P.; Blanc-Féraud, L.; Aubert, G.; Barlaud, M. Two deterministic half-quadratic regularization algorithms for computed imaging. In Proceedings of the 1st International Conference on Image Processing (ICIP), Austin, TX, USA, 13–16 November 1994; pp. 168–172. [Google Scholar] [CrossRef]

Figure 1. Overall architecture of the proposed Prex-Net. The network consists of two stages: (1) an initial feature extraction network that integrates spatial and angular information using pixel shuffle and spatial attention, and (2) a refinement network that refines features through multiple up- and down-projection blocks.

Figure 2. Structure of the spatial attention module. Average and max pooling are first applied to the input feature map to generate spatial features, which are then concatenated and passed through a 7 × 7 convolution, followed by a sigmoid activation to produce the spatial attention map.

Figure 3. Structure of the up-projection block. Stacked inputs are compressed, enhanced with channel attention, and spatially aggregated using a 6 × 6 convolution.

Figure 4. Structure of the channel attention module. Global average pooling and two fully connected layers generate attention weights, which scale channel features.

Figure 5. Structure of the down-projection block. The block compresses concatenated inputs, applies channel attention for feature enhancement, and employs convolution and pixel shuffle operations to restore residual information.

Figure 6. Visual comparison of central novel view result images.

Table 1. Quantitative comparison of PSNR and SSIM between our proposed method and existing methods for

2 \times 2 \to 7 \times 7

reconstruction.

Table 1. Quantitative comparison of PSNR and SSIM between our proposed method and existing methods for

2 \times 2 \to 7 \times 7

reconstruction.

Methods	30Scenes	Occlusions	Reflective	Average
Kalantari et al. [16]	41.42/0.984	37.46/0.974	38.07/0.953	38.98/0.970
LFASR-FS-GAF [18]	42.75/0.986	38.51/0.979	38.35/0.957	39.87/0.974
DistgASR [24]	43.61/0.995	39.44/0.991	39.05/0.977	40.70/0.988
Prex-Net [27]	43.49/0.987	40.00/0.983	39.30/0.961	40.93/0.977
LFR-DFE [26]	43.62/0.987	40.08/0.983	39.42/0.960	41.04/0.977
Prex-NetII	43.79/0.988	40.35/0.984	39.80/0.962	41.31/0.978

Table 2. Quantitative comparison of PSNR and SSIM with varying numbers of projection blocks.

Number of Blocks	30Scenes	Occlusions	Reflective	Average	Param. (M)
Prex-Net [27]	43.49/0.987	40.00/0.983	39.30/0.961	40.93/0.977	8.10
Blocks 3	43.20/0.986	39.55/0.981	39.23/0.958	40.66/0.975	2.11
Blocks 6	43.51/0.987	39.95/0.983	39.48/0.956	40.98/0.975	3.85
Blocks 9	43.66/0.987	40.06/0.983	39.75/0.962	41.16/0.977	5.63
Blocks 12	43.79/0.988	40.35/0.984	39.80/0.962	41.31/0.978	7.46

Table 3. Ablation results on spatial and channel attention modules.

Variants	Initial Feature Extraction	Projection Block	PSNR(dB)/SSIM
w/o	✗	✗	41.20/0.977
1	channel attention	spatial attention	40.86/0.975
2	channel attention	channel attention	41.24/0.978
3	spatial attention	spatial attention	40.85/0.975
4	spatial attention	channel attention	41.31/0.978

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, D.-M.; Suh, J.-W. Prex-NetII: Attention-Based Back-Projection Network for Light Field Reconstruction. Electronics 2025, 14, 4117. https://doi.org/10.3390/electronics14204117

AMA Style

Kim D-M, Suh J-W. Prex-NetII: Attention-Based Back-Projection Network for Light Field Reconstruction. Electronics. 2025; 14(20):4117. https://doi.org/10.3390/electronics14204117

Chicago/Turabian Style

Kim, Dong-Myung, and Jae-Won Suh. 2025. "Prex-NetII: Attention-Based Back-Projection Network for Light Field Reconstruction" Electronics 14, no. 20: 4117. https://doi.org/10.3390/electronics14204117

APA Style

Kim, D.-M., & Suh, J.-W. (2025). Prex-NetII: Attention-Based Back-Projection Network for Light Field Reconstruction. Electronics, 14(20), 4117. https://doi.org/10.3390/electronics14204117

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prex-NetII: Attention-Based Back-Projection Network for Light Field Reconstruction

Abstract

1. Introduction

2. Related Works

3. Proposed Network

3.1. Up-Projection Bock

3.2. Down-Projection Bock

4. Simulation Results

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI