RCA-LF: Dense Light Field Reconstruction Using Residual Channel Attention Networks

Dense multi-view image reconstruction has played an active role in research for a long time and interest has recently increased. Multi-view images can solve many problems and enhance the efficiency of many applications. This paper presents a more specific solution for reconstructing high-density light field (LF) images. We present this solution for images captured by Lytro Illum cameras to solve the implicit problem related to the discrepancy between angular and spatial resolution resulting from poor sensor resolution. We introduce the residual channel attention light field (RCA-LF) structure to solve different LF reconstruction tasks. In our approach, view images are grouped in one stack where epipolar information is available. We use 2D convolution layers to process and extract features from the stacked view images. Our method adopts the channel attention mechanism to learn the relation between different views and give higher weight to the most important features, restoring more texture details. Finally, experimental results indicate that the proposed model outperforms earlier state-of-the-art methods for visual and numerical evaluation.


Introduction
Light fields (LF) record 3D scenes into uniform and dense image samples. These images contain spatial and angular information about the 3D scenes. As a result, many applications have developed and benefited greatly from this huge amount of information, such as de-occlusion [1,2], depth-sensing [3][4][5], saliency detection [6], and salient object detection [7]. In addition, LF could be promising to ease other applications such as the fruit-picking robot, where a robot traverses a whole field and harvests on its own [8,9]. LF images are caught using portable cameras or camera arrays in most situations. In order to use array cameras, several cameras are required, which is an expensive and laborious process [10]. A practical solution for capturing LF images with portable cameras can be provided by inserting a microlens array in front of the image sensor [11,12]. Despite the advantages of this solution, it comes with a major drawback: poor sensor resolution. Therefore, obtaining LF images with high spatial and angular resolution is difficult.
Recently, several learning-based approaches that considerably enhance the performance of LF reconstruction have been presented. The LF reconstruction challenge reconstructs dense LF images from sparse input views. Previous approaches using the convolutional neural network (CNN) without depth estimation [13,14] can only handle LFs with a small baseline. They explore the connection between the angular and spatial domains but fail to use the epipolar information fully.
Some approaches [15,16] estimate depth maps and warp views to investigate relationships between views. However, the wrongness of the calculated depth map greatly affects how the LF reconstruction turns out. There is another approach to mitigate the effect of limited sensor resolution through LF super-resolution [17][18][19], but this is outside this research's interest.
This article presents a unique learning-based methodology for rapidly reconstructing a densely sampled LF from a very sparsely sampled LF. Computationally efficient convolutions realize our end-to-end CNN model to understand spatial-angular relationships deeply. We up-sample the sparsely input LF to the required angular size using the bicubic interpolation in the preprocessing stage. The RCA-LF is then deployed to leverage the inherent LF structure in the up-sampled LF images. Notably, our method does not need disparity warping or intensive computations. In addition, it reconstructs a whole LF in a single forward pass. Specifically, we introduce the residual channel attention light field (RCA-LF) structure to solve different LF reconstruction tasks. In our approach, view images are grouped in one stack where epipolar information is available. We use 2D convolution layers to process and extract features from the stacked view images. Our method adopts the channel attention mechanism to learn the relation between different views and give higher weights to the most important features, restoring more texture details.
We propose a new way to process the multi-channel input, which comes from 2D convolution instead of 3D convolution. Two-dimensional convolution takes a single slice as an input and fails to leverage context from adjacent slices. Conversely, 3D convolution overcomes this issue by leveraging the slice context with 3D convolutional kernels, resulting in enhanced performance. However, 3D convolutions have a limited range depending on the kernel size (3 × 3 × 3 kernels can leverage depth information using only three consecutive slices).
In our proposed method, the input has a size of (B, H, W, 49) for the 3 × 3 to 7 × 7 reconstruction task where the 49 represents the number of input channels. For 2D convolution, the number of filters equals (filter_height × filter_width × in_channels × out_channels). Consequently, every output channel is a function of all input channels at each convolution. Adopting this method can fix the limited range issue of the 3D convolution and provide better quality.
The number of input channels is 49 for the 3 × 3 to 7 × 7 reconstruction task. Still, we extract more features on the subsequent convolution layers, meaning more interactions can be identified between the extracted features of input images, restoring more information and details. Because some of the extracted features might contain useless or redundant information, the channel attention mechanism rescales (gives different weights) for these extracted features depending on the information content.
We can summarize the contributions of this article as follows: (1) We adopt a channel attention mechanism to reconstruct LF images. (2) Our method increases the interaction between different LF images by processing LF images as input-output channels of 2D convolutions. (3) We design the RCA-LF to increase the interactions between input-output channels (parallel processing) and decrease the number of blocks (serial processing); hence, it can reconstruct LF images accurately and fast.

LF Representation
A wealth of information about the surrounding 3D space is revealed by LF imaging, contrary to traditional imaging methods. The Plenoptic function was initially described using seven variables that determine the view from any possible angle, for all wavelengths of light and at any time, as P = P θ, ϕ, λ, t, V x , V y , V z [20]. It was then simplified to a 4D description with the intersections of light rays with two planes L = L(u, v, x, y), where (u, v) and (x, y) denote the points of intersection with the first and second planes, respectively, as shown in Figure 1 [21].

LF Reconstruction
Many LF reconstruction approaches have been presented. These approaches are classified into three types: traditional, deep learning depth-based, and deep learning nondepth-based approaches.

Traditional Approaches
Wanner and Goldluecke improved the spatial and angular resolutions using the Epipolar Plane Image (EPI) for depth map estimation [22]. However, this variational framework has flaws since the input views only assess the disparity. Another approach was proposed to utilize the Gaussian mixture model for LF denoising, super-resolution, and refocusing [23]. In this approach, the patch prior was designed using the disparity pattern. However, their approach is vulnerable to low-quality LF images. Pujades et al. [24] proposed a novel cost function optimized by a Bayesian formulation to estimate the depth and reconstruct novel views. Chaurasia et al. [25] proposed a novel image-based rendering using superpixels to preserve depth discontinuities. The warped views are blended using a camera and depth information. Zhang et al. [26] proposed an interactive system adopting patch-based methods for LF editing. This technique models the collected images as overlapping layers with varying depths and uses back-to-front layered synthesis. Vagharshakyan et al. [27] utilized the EPIs in the shearlet domain to reconstruct dense images using large baseline-rectified images. Their method provided good results for non-Lambertian scenes of semi-transparent objects.

Deep Learning Depth-Based Approaches
Kalantari et al. [15] suggested decomposing the reconstruction process into disparity and color estimates independently evaluated by the relevant CNN network. Due to their separate reconstruction, connections between novel LF images were overlooked. Another approach was proposed to speed up Kalantari's method using a predefined CNN [28]. In addition, they proposed the estimation of two disparity maps to provide more accurate results. Shi et al. [16] used two reconstruction modules: pixel reconstruction to handle the occlusions explicitly, and feature reconstruction for high frequencies. However, this method was limited by the need for depth maps. In contrast to the previous methods designed for images with a small baseline, Jin et al. [29] designed a model for images with a large baseline. A CNN was employed to estimate depth maps to wrap input views, and these views were then blended using a SAS CNN [30]. Because the quality of synthesized views is dependent on the accuracy of estimated depth maps, unwanted artifacts often emerge in synthesized views.

LF Reconstruction
Many LF reconstruction approaches have been presented. These approaches are classified into three types: traditional, deep learning depth-based, and deep learning non-depth-based approaches.

Traditional Approaches
Wanner and Goldluecke improved the spatial and angular resolutions using the Epipolar Plane Image (EPI) for depth map estimation [22]. However, this variational framework has flaws since the input views only assess the disparity. Another approach was proposed to utilize the Gaussian mixture model for LF denoising, super-resolution, and refocusing [23]. In this approach, the patch prior was designed using the disparity pattern. However, their approach is vulnerable to low-quality LF images. Pujades et al. [24] proposed a novel cost function optimized by a Bayesian formulation to estimate the depth and reconstruct novel views. Chaurasia et al. [25] proposed a novel image-based rendering using superpixels to preserve depth discontinuities. The warped views are blended using a camera and depth information. Zhang et al. [26] proposed an interactive system adopting patch-based methods for LF editing. This technique models the collected images as overlapping layers with varying depths and uses back-to-front layered synthesis. Vagharshakyan et al. [27] utilized the EPIs in the shearlet domain to reconstruct dense images using large baseline-rectified images. Their method provided good results for non-Lambertian scenes of semi-transparent objects.

Deep Learning Depth-Based Approaches
Kalantari et al. [15] suggested decomposing the reconstruction process into disparity and color estimates independently evaluated by the relevant CNN network. Due to their separate reconstruction, connections between novel LF images were overlooked. Another approach was proposed to speed up Kalantari's method using a predefined CNN [28]. In addition, they proposed the estimation of two disparity maps to provide more accurate results. Shi et al. [16] used two reconstruction modules: pixel reconstruction to handle the occlusions explicitly, and feature reconstruction for high frequencies. However, this method was limited by the need for depth maps. In contrast to the previous methods designed for images with a small baseline, Jin et al. [29] designed a model for images with a large baseline. A CNN was employed to estimate depth maps to wrap input views, and these views were then blended using a SAS CNN [30]. Because the quality of synthesized views is dependent on the accuracy of estimated depth maps, unwanted artifacts often emerge in synthesized views.

Deep Learning Non-Depth-Based Approaches
Most of these methods extract information from EPIs for the reconstruction process. Wu et al. [31] divided the process into low-frequency restoration after a blur operation and high-frequency restoration by inverting the blur operation. However, they did not use the epipolar information efficiently, as they only extracted the EPIs in one direction. Using a CNN, Wu et al. [32] applied a shearing operation to input EPIs to eliminate the effect of significant disparities. Then, they employed a CNN to learn a fusion score. In this method, the authors misused the angular information by using EPIs horizontally or vertically for the reconstruction. In addition, they reconstructed rows and then columns hierarchically, leading to reconstruction error accumulation. Meng et al. [33] proposed an HDDRNet for LF spatial and angular super-resolution employing a high dimensional CNN. Although they used the provided angular information efficiently, employing the 4D convolutions, this was at the expense of model complexity. Mildenhall et al. [34] proposed reconstructing multi-plane images from input views and then blending them to reconstruct novel views. Wang et al. [35] used EPI and EPI stacking to create a pseudo-4D CNN. They used EPI structure-preserving loss to increase reconstruction quality. They wasted angular data by only using horizontal or vertical EPI stacks. Hu et al. [14] proposed LF reconstruction with hierarchical feature fusion. SAS layers were employed to extract features from 4D LF images, while the U-Net structure was adopted to generate both semantic and local feature representation. They integrated these two structures and proposed a U-SAS module to enable the extraction of spatial features and the correlation of SAIs. In addition, they adopted an enlarged patch size when training for the integrated information of objects. Liu et al. [36] proposed to extract EPI information in a horizontal, vertical, and angular manner to reconstruct LF images. However, each branch was processed alone, which affected the final quality. Zhang et al. [37] reconstructed LF images employing 2D and 3D CNNs on horizontal and vertical EPIs. However, they neglected the angular LF information, slightly affecting the final reconstruction quality. Salem et al. [38] mapped the LF reconstruction problem from the 4D into the 2D domain by transforming the 4D LF into a 2D raw LF image to ease the reconstruction. They provided satisfactory reconstruction quality using a model inspired by the RCAN [39,40]. Still, they used a heavy model, which affected the reconstruction time.

Problem Formulation
We can consider the LF images as a 2D array of view images, as shown in Figure 2a. These images have (H, W) and (U, V) spatial and angular resolutions. Our goal is to reconstruct dense LF images from their sparse input counterparts. Assume L LR ∈ R H×W×u×v represents the sparse input views with angular resolution (u, v). Using the LR input, our RCA-LF network can reconstruct a dense output L HR ∈ R H×W×U×V with (U, V) angular resolution. Before applying the LR input images, we up-sample the sparse input EPIs to the required output size utilizing the Bicubic interpolation to generate L LR ∈ R H×W×U×V . The last step before applying the LR input to the network is to rearrange it from the 4D representation L LR ∈ R H×W×U×V into the 3D representation L LR ∈ R H×W×UV , as shown in Figure 2b. We reconstruct the 3D L LR by stacking the view images in row-major order as indicated by the blue line in Figure 2a.

Network Architecture
We designed our network similarly to the RCAN network [39]. In terms of functionality, our model can be divided into primary feature extraction, deep feature extraction, and final output restoration, as shown in Figure 3a. The primary feature extraction is implemented using two convolutional layers (Conv). Each Conv is followed by a long skip connection to bypass the low-frequency components to the output part, allowing the network to concentrate on high-frequency component extraction. The deep feature extraction is implemented using ten residual channel attention blocks (RCAB), as shown in Figure  3b. The final part is implemented by summing the primary extracted features with the deep extracted features to reconstruct the final output. This is unlike the RCAN method, in which the input is a single-channel input. Then, channel-wise features are extracted from the input to be processed through the network. The input in our method is a stack of × images (multi-channel input), where (U, V) is the angular resolution. Then, more channel-wise features are extracted with the extraction ratio e to be × × . The RCAB is the main component of our network, as the RCA-LF consists of ten RCABs. The RCAB is a residual block (RB) with an integrated channel attention mechanism (CA). The first part of the RCAB, RB, is built by cascading two Conv layers with an activation function (ReLU) with a skip connection.

Network Architecture
We designed our network similarly to the RCAN network [39]. In terms of functionality, our model can be divided into primary feature extraction, deep feature extraction, and final output restoration, as shown in Figure 3a. The primary feature extraction is implemented using two convolutional layers (Conv). Each Conv is followed by a long skip connection to bypass the low-frequency components to the output part, allowing the network to concentrate on high-frequency component extraction. The deep feature extraction is implemented using ten residual channel attention blocks (RCAB), as shown in Figure 3b. The final part is implemented by summing the primary extracted features with the deep extracted features to reconstruct the final output.

Network Architecture
We designed our network similarly to the RCAN network [39]. In terms of func ality, our model can be divided into primary feature extraction, deep feature extrac and final output restoration, as shown in Figure 3a. The primary feature extraction i plemented using two convolutional layers (Conv). Each Conv is followed by a long connection to bypass the low-frequency components to the output part, allowing the work to concentrate on high-frequency component extraction. The deep feature extra is implemented using ten residual channel attention blocks (RCAB), as shown in F 3b. The final part is implemented by summing the primary extracted features wit deep extracted features to reconstruct the final output. This is unlike the RCAN method, in which the input is a single-channel input. T channel-wise features are extracted from the input to be processed through the netw The input in our method is a stack of × images (multi-channel input), where ( is the angular resolution. Then, more channel-wise features are extracted with the ex tion ratio e to be × × . The RCAB is the main component of our network, as the R LF consists of ten RCABs. The RCAB is a residual block (RB) with an integrated cha attention mechanism (CA). The first part of the RCAB, RB, is built by cascading two C layers with an activation function (ReLU) with a skip connection.  This is unlike the RCAN method, in which the input is a single-channel input. Then, channel-wise features are extracted from the input to be processed through the network. The input in our method is a stack of U × V images (multi-channel input), where (U, V) is the angular resolution. Then, more channel-wise features are extracted with the extraction ratio e to be e × U × V. The RCAB is the main component of our network, as the RCA-LF consists of ten RCABs. The RCAB is a residual block (RB) with an integrated channel attention mechanism (CA). The first part of the RCAB, RB, is built by cascading two Conv layers with an activation function (ReLU) with a skip connection.
The CA is adopted to allow the network to treat the extracted channel-wise features unequally and concentrate on the crucial features. A global average pooling is used to shrink the intermediate C feature map of size H × W into 1 × 1 to obtain the initial channel-wise statistics to determine which channels are more important. These channel statistics may be considered a collection of local descriptors to express the full-view stack [41]. A Conv then down-samples these initial statistics with a reduction ratio of r. A Conv up-samples these statistics with the same reduction ratio after being activated by ReLU, as shown in Figure 3 in [39]. Finally, a gate mechanism is applied to learn the nonlinear interactions between channels and the non-exclusive mutual relationship. The gate mechanism is applied with a sigmoid function to obtain the final channel-wise statistics.

Implementation Details
The luminance component is only used to train the RCA-LF network, while the EPIs of the chrominance components are up-sampled with the Bicubic interpolation. We trained our network to map the LR input images to the HR LF output images by minimizing the L 1 loss and optimizing the Adam optimizer with its default parameters [42]. The L 1 loss is defined as follows when a training set has N combinations of input and counterpart ground-truth pictures: f () represents the function responsible for mapping the LR input into the HR output and is implemented by the RCA-LF network. All the Conv layers used were of size 3 × 3 with zero padding, except for the Conv layers used for the CA, which were of size 1 × 1. Both the extraction ratio e and the reduction ratio were set to 8. We trained the network with patches of size 32 × 32 and a batch size of 128. We started the training with an initial learning rate of 10 −4 and decreased it exponentially by 0.1 every 100 epochs while we trained the network for 150 epochs. We used 100 full LF images to train our network [15,43], using TensorFlow [44] on an NVIDIA GeForce RTX 3090 GPU. PSNR and SSIM were used as reconstruction quality assessment indicators.

Experiments and Discussion
We conducted comprehensive experiments to validate the effect of the proposed RAC_LF network. We compared the RCA_LF numerically and visually with state-of-theart methods using real-world LF images. We used 30 LFs from the 30 scenes dataset [15], 31 LFs from the refractive and reflective surfaces dataset [43], and 43 LFs from the occlusions dataset [43]. The average PSNR and SSIM [45] over the reconstructed LF luminance were used for the numerical comparison. We compared the RCA_LF over two interpolation tasks (2 × 2-8 × 8 and 3 × 3-7 × 7) and two extrapolation tasks (2 × 2-8 × 8 extrapolations 1 and 2), as shown in Figure 4. The CA is adopted to allow the network to treat the extracted channel-wise features unequally and concentrate on the crucial features. A global average pooling is used to shrink the intermediate C feature map of size × into 1 × 1 to obtain the initial channel-wise statistics to determine which channels are more important. These channel statistics may be considered a collection of local descriptors to express the full-view stack [41]. A Conv then down-samples these initial statistics with a reduction ratio of r. A Conv up-samples these statistics with the same reduction ratio after being activated by ReLU, as shown in Figure 3 in [39]. Finally, a gate mechanism is applied to learn the nonlinear interactions between channels and the non-exclusive mutual relationship. The gate mechanism is applied with a sigmoid function to obtain the final channel-wise statistics.

Implementation Details
The luminance component is only used to train the RCA-LF network, while the EPIs of the chrominance components are up-sampled with the Bicubic interpolation. We trained our network to map the LR input images to the HR LF output images by minimizing the L1 loss and optimizing the Adam optimizer with its default parameters [42]. The L1 loss is defined as follows when a training set has N combinations of input and counterpart ground-truth pictures: = ∑ ( ) () represents the function responsible for mapping the LR input into the HR output and is implemented by the RCA-LF network. All the Conv layers used were of size 3 × 3 with zero padding, except for the Conv layers used for the CA, which were of size 1 × 1.
Both the extraction ratio e and the reduction ratio were set to 8. We trained the network with patches of size 32 × 32 and a batch size of 128. We started the training with an initial learning rate of 10 and decreased it exponentially by 0.1 every 100 epochs while we trained the network for 150 epochs. We used 100 full LF images to train our network [15,43], using TensorFlow [44] on an NVIDIA GeForce RTX 3090 GPU. PSNR and SSIM were used as reconstruction quality assessment indicators.

Experiments and Discussion
We conducted comprehensive experiments to validate the effect of the proposed RAC_LF network. We compared the RCA_LF numerically and visually with state-of-theart methods using real-world LF images. We used 30 LFs from the 30 scenes dataset [15], 31 LFs from the refractive and reflective surfaces dataset [43], and 43 LFs from the occlusions dataset [43]. The average PSNR and SSIM [45] over the reconstructed LF luminance were used for the numerical comparison. We compared the RCA_LF over two interpolation tasks (2 × 2-8 × 8 and 3 × 3-7 × 7) and two extrapolation tasks (2 × 2-8 × 8 extrapolations 1 and 2), as shown in Figure 4.  Tables 1-4 present numerical data indicating the proposed approach's effectiveness. Numerical comparisons are provided regarding peak-signal-to-noise ratio (PSNR) and the structural similarity index (SSIM) [45]. Figures 5-8 show a visual contrast highlighting our model's ability to recreate high-quality images with sharper edges around object boundaries, even in obscured areas and against complex backgrounds. However, we attribute the significant improvement in our model results to: (1) 3D representation (LF view stack), allowing the network to model and understand relations between different LFs; (2) the channel attention mechanism, which played an important role by allowing the network to concentrate on the crucial features.

Different Reconstruction Tasks
Wu et al. [31] underutilized angular data, using EPIs in just one direction. Utilizing EPIs in horizontal and vertical dimensions subsequently yielded superior outcomes. Nonetheless, they hierarchically up-sampled LF, increasing error accumulation on the most recently reconstructed views. In addition, they proposed a second paradigm based on sheared EPIs [32]. In particular, low-angular-resolution EPIs were sheared before being upsampled to the necessary angular resolution. The up-sampled EPIs with various shearing methods were fused by learning fusion scores using a CNN. Liu et al. [36] used angular information more effectively than earlier techniques, yet this was insufficient since they only employed one EPI stack in each direction. Zhang et al. [37] used micro-lens pictures and view image stacks to investigate further LF data. Salem et al. [38] used the raw LF representation to ease the reconstruction process. In addition, they initialized the input image using the nearest view initialization method. However, this method had a limitation for some reconstruction tasks. Additionally, it affected the quality of the final image.  [15] and Shi et al. [16] generate new views by distorting the input views by their assessed disparity/depth. On the other hand, depth estimation and warping are challenging, particularly for LF pictures with a tiny depth difference, making it possible for images to be flawed and seem out of place. Due to Yeung et al.'s disregard for the connections between distinct views, their approach generates false shadows and ghosting artifacts at the borders of reconstructed views [46].
Reconstructing 8 × 8 out of 2 × 2 views is a challenging task due to the sparseness of the input views. Yeung et al. [46] observed that the reconstruction quality of the center views is much worse than that of the views located near the input views. Because the center view is the farthest distance from any input views, inferring the details with greater accuracy presents the biggest problem. Therefore, they proposed different combinations of interpolation and extrapolation to reconstruct LF images. As a result, the average distance from all the novel views is shorter than before, increasing the reconstruction quality of the center views. Most available algorithms are optimized for interpolation tasks and cannot predict extrapolated views. That is why ghosting and artifacts often appear around thin structures and occluded regions. Extrapolation is more challenging than interpolation because certain portions of the reconstructed views are not present in the input. In addition, it cannot keep the slopes of the lines in the reconstructed EPIs the same. It is challenging to devise a method for dealing with different relationships between input and output views. However, the task becomes more feasible and efficient with our proposed approach.    Table 5 presents the average run-time to reconstruct a full LF image for the first task: 7 × 7 out of 3 × 3 views. We tested our model on an NVIDIA Geforce RTX 3090. The proposed model can reconstruct LF images faster due to its highly parallel design.

Reconstruction Time
Wang et al. [47] consume a lot of time as they do not reconstruct the entire scene in one feedforward pass. Instead, they reconstruct rows and then columns hierarchically. Yeung et al. [46] and Liu et al. [36] used MATLAB to build their code, which contains many time-consuming reshaping operations. Compared to Salem et al. [38], they used 15 residual blocks (RBs) compared to the 10 RBs in our proposed work. In addition, they process LFs in the raw representation of size 7H × 7W compared to H × W in our implementation.

Ablation Study
We compared three different architectures to validate the effect of the channel attention (CA) mechanism on the reconstruction process. Numerical comparison is presented in Table 6, where the first row indicates the simplest case without applying the CA mechanism. The second row gives the results for the block that is the same as the one proposed in [39] with the CA integrated inside the RCAB. The final row gives the results for the proposed block with the CA separated from the RB, as shown in Figure 9.

Ablation Study
We compared three different architectures to validate the effect of the channel attention (CA) mechanism on the reconstruction process. Numerical comparison is presented in Table 6, where the first row indicates the simplest case without applying the CA mechanism. The second row gives the results for the block that is the same as the one proposed in [39] with the CA integrated inside the RCAB. The final row gives the results for the proposed block with the CA separated from the RB, as shown in Figure 9.

Future Work
In this paper, we present a method for reconstructing light field images. The proposed method is characterized by its applicability to all reconstruction tasks for LF images with a small baseline. Although this model is efficient, it fails to reconstruct LF images with a broad baseline. In addition, it sometimes fails to reconstruct parts of the scenes with complex backgrounds or contains severe reflections. Therefore, we are trying to develop a method capable of reconstructing complex scenes and scenes with a broad baseline.

Conclusions
This research proposes an effective learning-based paradigm for increasing the angular resolution of LF images. We up-sampled input EPIs to the required angular size, which allows our network to be used for any reconstruction task. In addition, this allowed the network to comprehend and accurately represent the connection since the input and output were of the same size. Finally, we adopted the channel attention mechanism to help the network to concentrate on the important features by assigning higher weights. The proposed RCA_LF network reconstructs LF images by mapping the up-sampled lowresolution images into high-resolution 3D LF volumes. The RCA_LF outperforms other state-of-the-art methods in reconstructing LF images with a small baseline.  Data Availability Statement: The datasets used in this paper are public datasets. We also provide the proposed method's training and evaluation codes at: https://github.com/ahmeddiefy/RCA-LF, which was created (accessed on 24 May 2022).

Conflicts of Interest:
The authors declare no conflict of interest.