Light Field Image Quality Enhancement by a Lightweight Deformable Deep Learning Framework for Intelligent Transportation Systems

: Light ﬁeld (LF) imaging has multi-view properties that help to create many applications that include auto-refocusing, depth estimation and 3D reconstruction of images, which are required particularly for intelligent transportation systems (ITSs). However, cameras can present a limited angular resolution, becoming a bottleneck in vision applications. Thus, there is a challenge to incorporate angular data due to disparities in the LF images. In recent years, different machine learning algorithms have been applied to both image processing and ITS research areas for different purposes. In this work, a Lightweight Deformable Deep Learning Framework is implemented, in which the problem of disparity into LF images is treated. To this end, an angular alignment module and a soft activation function into the Convolutional Neural Network (CNN) are implemented. For performance assessment, the proposed solution is compared with recent state-of-the-art methods using different LF datasets, each one with speciﬁc characteristics. Experimental results demonstrated that the proposed solution achieved a better performance than the other methods. The image quality results obtained outperform state-of-the-art LF image reconstruction methods. Furthermore, our model presents a lower computational complexity, decreasing the execution time.


Introduction
A light field describes the distribution of light rays in the space; thus, more information from our environment can be used to build an image. However, due to the high dimensionality of the data, to obtain a scene is a difficult task [1].
Currently, the Light Field (LF) imaging [1] area has been explored by many studies [2,3] in the field of Virtual Reality (VR), Augmented Reality (AR) and different industrial applications, such as the commercial plenoptic cameras. In addition, different image-based solutions are used in Intelligent Transportation Systems (ITSs) for several applications [4][5][6][7][8][9][10], which use different machine learning techniques. ITS solutions aim to improve safety, mobility and efficiency of transport services, and to accomplish these goals, visual information plays an important role in the development of these services. Nowadays, there are many proposals of deep learning models [11][12][13][14][15][16][17][18][19] that are applied to is performed for angular data incorporation. Feature extraction containing rich spatial data is performed to align with their original features, and a soft activation function is also used in the CNN model to decrease the computational complexity of the proposed deformable deep learning framework. Consequently, the proposed framework obtains an improvement on the final image quality.
The main contributions of this paper are listed below.
1. An improved framework, which considers the feature extraction and angular alignment using the deformable convolution network approach, ruling out the use of applying a loss function.

2.
To reduce the computational complexity for LF SR images, a novel activation function is utilized, which is performed in the proposed CNN model. Thus, a lightweight solution to process LF SR images is obtained. 3.
The performance assessment of the proposed model is tested using recent databases. Experimental results demonstrated that our proposal reached a high accuracy for image reconstruction, obtaining a better performance in image quality than other similar works.

4.
Our proposed framework improves the image content and its perceptual quality, which are obtained with a reduced computational processing and execution time that is relevant for different applications in the ITS research area [52][53][54].
Experimental results showed that our proposed CNN architecture obtain a low computational complexity, reducing, on average, 37% of the training time and, on average, 40% of the execution time. Moreover, image quality was also evaluated, and the results demonstrated a superior performance of the proposed model in terms of objective image quality metrics, such as Structural Similarity Index Measure (SSIM) and peak signal-to-noise ratio (PSNR), reaching score values of 0.99 and superior to 45, respectively.
The remainder of this paper is organized as follows. In Section 2, related works are presented. The methodology and the details of the proposed method is presented in Section 3. Experimental results are presented in Section 4. Finally, the conclusions are presented in Section 5.

Related Works
In this section, some works about LF image representation, as welll as frameworks based on Deep Learning algorithms, are treated.

Light Field Representation and Images
The most common solution for the representation of a 4D LF is the light rays parameterized by the coordinates of their intersections with two planes in arbitrary positions. Thus, the coordinate system is represented by (u, v) for the first plane, and (s, t) is the representation for the second one.
The plenoptic function that describes a LF is reduced from seven to only four dimensions, and it is represented by Equation (1).
A 4D LF can be visualized in two ways, through an integral LF structure, and 2D slices [55,56]. Thus, the 4D LF can be represented as being a 2D array of images. For LF rendering, the capture of insufficient samples can cause the ghosting effect in the views. However, it is impractical to acquire many samples of a LF [1]. The minimum number of samples needed for light field rendering is studied in [57,58], which concluded that the pixels must at least touch each other to render the views without producing the ghosting effect. Thus, a large number of samples are needed for producing a noise-free output, what is computationally expensive, even now.
Many methods and models have been developed for working with LF images. An approach was developed in [59], to estimate disparity from a LF images. Farrugia et al. [43] proposed a linear subspace projection approach for LF image SR. In [60], a LFBM5D for LF image denoising is proposed, extending the state-of-the-art Block-matching and 3D filtering (BM3D) image denoising filter to LFs. Another method was used to achieve LF image SR in [44], using a graph-based method via graph optimization. Although the LF images are well encoded in these cited studies, the spatial information is not fully exploited. Recently, deep learning methods [61] are achieving superior results when compared to traditional methods in spatial information exploitation. However, the computational models are much more complex and time-consuming for processing.
In our work, the feature extraction and angular alignment are performed to improve the image quality, reducing noise effects, and a soft activation function was used in the CNN model for decreasing computational expenses.

Frameworks Using Deep Learning Algorithms
Deep-learning methods [62][63][64] have been used for several applications [65,66], such as classification, detection, and recognition of images. For Single Image Super-Resolution (SISR), a framework is proposed in [67], which learns the mapping from LR to HR image using three layers, patch representation, non-linear mapping and reconstruction. Dong et al. [68] use the SRCNN structure [67] to achieve a speed up of more than 40 times with even superior restoration quality. A DRCN structure is proposed in [69], which improves the SR results without introducing new parameters. Lai et al. [70] propose a method that adopts a Laplacian pyramid to reconstruct residuals of high-resolution images. Hu et al. [71] propose a method to solve SISR of arbitrary scale factor with a single model. The cited studies work on obtaining a high-resolution image. However, they still have a large computational expense.
Currently, novel SISR methods are demonstrating superior performance to traditional methods in spatial information exploitation. The LFCNN approach is used in [72], improving both the efficiency of training and the quality of angular SR results by using weight sharing. In [73], the authors attempt to measure the degree of their LF coherence (LFC), obtaining consistent performance. Yuan et al. [47] use the LF-DCNN model for improving the LFCNN via a SISR network EDSR [74] and a specific EPI-enhancement network.
A bidirectional recurrent network LFNet is proposed in [49] by extending BRCN to LFs. Wang et al. [75] proposed another method, named LF-InterNet, for interacting spatial and angular information for LF image SR. LF-ATO [76] and LF-InterNet [75] has achieved a high reconstruction accuracy. Although the recent studies have improved the network performance, the problem of disparity problem has not been well explored in the literature. In the LFSSR [50] and LF-InterNet model [75], the LF features are organized, and the angular information is incorporated in the model. However, the disparity problem continues to occur in these studies. The LFNet [49] works with a video SR framework to address the problem of disparity in recurrent networks, but it considers only SAIs from the same row or column as its inputs.
The configuration in regular CNNs, which consider a fixed kernel, does not explore long-range information. For resolving this problem, a deformable convolution is proposed in [77] considering additional and learned offsets to make the convolution kernel distant from its neighborhood. However, the deformable convolutions have been applied to video SR [78,79] or more complex computational systems [77].

Methodology
In this section, the main steps followed in building the proposed framework are described. We introduce the framework topology, used datasets and evaluation of the proposed method through comparison to other methods. Figure 1 shows the topology of the Lightweight Deformable Deep Learning Framework, including the feature extraction, angular alignment (AA) using the deformable convolution approach, and the reconstruction step. The input LR data serve as input in the CNN model, which performs the feature extraction and, posteriorly, the AA using the deformable approach; then, reconstruction is performed. The deformable convolution network approach considers constrained pooling layer models to treat the information related to angular resolution in order to improve the image content and perceptual quality. By the end, the reconstructed data are generated, in which the LF data are represented as L(x, y, s, t).

Feature Extraction
The feature representation containing a rich spatial context information is useful to the subsequent alignments and reconstruction steps. Thus, in this work, spatial pyramid pooling is used for performing the feature extraction.
The inputs are processed with a 1 × 1 convolution, for generating initial features. The residual modules and blocks are used for performing deep feature extraction. Then, 3 × 3 convolutions are combined in the residual blocks. Later, features of these branches are added in 1 × 1 convolution.
The activation function used in this work is defined by Equation (2).
in which, α and β are a pair of trainable positive parameters. The activation function presents a non-monotonic region, and t < 0 has the property with zero mean. In the case of t > 0, it avoids and rectifies the output distribution.
In the experiments, other activation functions are used, such as Leaky ReLU for comparison with the SR function.

Angular Alignment
After the feature extraction, an angular alignment using a deformable convolution network approach is performed, in which a bidirectional alignment incorporates angular data. Side-view features are put to the center view, and then they are aligned with the centerview feature. In this work, a deformable convolution occurs for performing the feature collection and another for distribution. The first convolution considers the (k − 1)th sideview R k−1 i and offsets ∆P k i for generating the k-th feature R k i→c , as shown in Equation (3).
where H k dcn is the deformable convolution in the k th block, An offset generation branch is used in this work, learning the offset ∆P k i . The sideview feature R k−1 i is added to the center-view feature R c , going to a 1 × 1 convolution for performing a feature reduction. After, a residual module is applied to enlarge the receptive field, maintaining a dense sampling rate. Thus, the residual module improves the angular dependencies between the center and side views. By the end, another 1 × 1 convolution is used for generating an offset feature.
A 1 × 1 convolution is performed, adding the angular data in the aligned features.
where [· , ·] represents the concatenation, and H k 1×1 represents the 1 × 1 convolution. To super-resolve all LF images, the incorporated angular information need to be encoded into each side view. Consequently, we perform feature distribution to propagate the incorporated angular information to the side views. Since the disparities between the side-view features and center-view features are mutual, we do not perform additional offset learning. Instead, we use the opposite offset ∆P k i = −∆P k i to warp the fused center-view feature R k c to the i-th side view. That is, Posteriorly, the center-view feature R k c and side-view R k i , (i = 1, 2, · · · , A 2 − 1) are generated by the k-th.
In the proposed model, the alignment is performed among the center views and each side view. It is important to note that the number of alignments can influence the network model. Thus, the performance of the proposed model was analyzed according to the variations' number of alignments.

Reconstruction
For high reconstruction accuracy, spatial and angular data are used in the framework, and a reconstruction step was necessary to add the features for the LF image. Thus, multi-distillation blocks are used with a mechanism to extract and process hierarchical features with the aim to achieve a small number of parameters and, consequently, a low computational cost.
The outputs of the feature extraction and each alignment are processed by a 1 × 1 convolution. The coarsely fused feature goes to the stacked information blocks. In each information block, the input feature is processed by a 3 × 3 convolution and an activation function.
The narrow feature fed to the bottleneck of the information block and the wide feature goes to a 3 × 3 convolution. Posteriorly, features of different stages are processed by a 1 × 1 convolution, and the feature of the last information block is processed by a 3 × 3 convolution for reducing its depth from 128 to 32.
A 1 × 1 convolution is used for the reconstructed features, extending the depth to α 2 C. The α is an upsampling factor. A pixel shuffle is used for upscaling the reconstructed feature, with a resolution αH × αW. Thus, a 1 × 1 convolution is used to compress the number of feature channels.
Moreover, we have justified in additional experiments that the detail-restoration network can be certainly substituted by a deeper or more complex network structures, which will further improve the performance of LF reconstruction.

Model of the Network
In this work, input sparse views S 0 (x, y, s, t) with the resolution of (H, W, n, n) are used and one angular dimension t = t * , t * ∈ {1, 2, ..., n} extracts 3D volume, containing a resolution of (H, W, n) as shown in Equation (6).
Bl t * (x, y, s) are interspersed as Bl t * (x, y, s) ↑ to the resolution (H, W, N). Thus, the details of Bl t * (x, y, s) ↑ are restored as F r3d (·), forming the intermediate LF in Equation (7).
An angular domain conversion is performed to transform from t to dimension s. Using s = s * , s * ∈ {1, 2, ..., N} are extracted from Sl inter (x, y, s * , t) as is shown in Equation (8).
The resolution of (H, W, n) is interspersed to Bl s * (x, y, t) ↑ at same resolution in Bl t * (x, y, s) ↑. Thus, the detail-restoration network is used for recovering details of Bl s * (x, y, t) ↑, as F c3d (·). The output Sl out (x, y, s, t) as the resolution of (H, W, N, N) is shown in Equation (9).

Details of Implementation of the CNN Model
In this work, a model of an angular resolution of 5 × 5 was used. The learning rate of our model was set to 4 × 10 −4 , and then it was decreased by a factor of 0.5. This occurred for every 10 epochs. The training phase finished at 50 epochs.
The optimization of the training of the CNN model is performed by the mini-batch momentum Stochastic Gradient Descent (SGD) approach, and the filters of the CNN are initialized through a zero-mean Gaussian distribution.
Our model was implemented using the deep learning API written in Python called Keras, on a workstation with an Intel 3.6 GHz CPU and a TiTan X GPU.
Tests are performed with the SR function and, for comparison, we also used the well-known activation function, Leaky ReLU.

Datasets
Some public LF datasets, such as INRIA [80], HCInew [81], EPFL [82], and HCIold [83], were used in this work. They are presented in Table 1 with the main characteristics. These datasets were chosen because they are the most used in the related works [24,75]. Table 1 presents the number of scenes for training and for testing of each dataset used in this work. Each dataset presents a total number of scenes available, represented by the column-named Scenes. The LFs of the datasets have an angular resolution, AngRes, of 9 × 9. In the training stage, each SAI was cropped into HR patches with stride of 32. The bicubic downsampling approach was used for generating the LR patches containing a resolution of 64 × 64. It is important to note that a random horizontal and vertical flipping, 90-degree rotation was performed in this work, augmenting the training data by eight times.
According to the related works [48,49,75], the Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) were used as quantitative metrics for image quality assessment.
The PSNR is determined using the following relation: where d represents the bit depth of pixel, W represents the image width, H is the image height, and p[i, j], p [i, j] represent the ith-row jth-column pixel in the original and reconstructed image, respectively. The SSIM is computed by: SSI M(P) = 2 * µ 1 (P) * µ 2 (P)+C1 µ 1 (P) 2 +µ 2 (P) 2 +C1 * 2 * cov(P)+C2 s1(P) 2 +s2(P) 2 +C2 where µ 1 (P) and µ 2 (P) represent the mean value of seq1 and seq2 computed in a window located around P image; s1(P) and s2(P) represent the standard deviation of seq1 and seq2 computed over the same window; cov(P) is the covariance between seq1 and seq2 computed over the same window; C1 = (K1 * L) 2 and C2 = (K2 * L) 2 represent the regularization constants, in which K1, K2 are the regularization parameters, and they must be > 0; L is the dynamic range of the pixel values. For measuring the computational efficiency, our proposed method was compared to same methods used in the image quality assessment. For this performance comparison, the number of parameters, #Params, for measuring the model size, and the FLOPs for measuring the memory cost were captured.
Additionally, the training and execution time of our proposed method and other related proposals are measured. It is important to note that the efficiency of our proposed method will be measured for 4 × SR scale.

Experimental Results
In this section, the main results about the use of AA in the network model, and the proposed model performance compared to other state-of-the-art models are presented.

Angular Alignment in the Network Model
In this subsection, we investigate the tests for definition of the network model of our proposed solution, through the AA.
The relation between the number of alignments (#AA), average PSNR (Avg PSNR), and average SSIM (Avg SSIM) are studied; this is shown in Figure 2. Here, the average values of each dataset: INRIA [80], HCInew [81], EPFL [82], and HCIold [83] are shown. Relation between the number of alignments and average image quality scores using PSNR and SSIM, which were applied to the datasets INRIA [80], HCInew [81], EPFL [82], and HCIold [83]. It can be observed from Figure 2a,b that the number of AA converged to the value of five for both metrics PSNR and SSIM. It is important to note that the number of alignments represents the deformable convolutions in the feature distribution step, being an important role a scenario of LF image SR. The reconstruction accuracy is improved in the moment that the number of AA increases. However, the performance saturated in the #AA = 5.

Image Quality Assessment
Images generated by our proposed method and related methods are shown in Figure 3, which were generated using an image extracted from the INRIA [80] dataset. It is important to note that the perceptual quality of the images generated by our proposed model, considering 4 × SR scale is compared with the groundtruth images.
The image quality assessment is quantitatively evaluated using objective metrics. In this test, all the images available in each dataset were used. Table 2 presents the PSNR scores obtained by our proposed model and other methods used for performance comparison. As can be observed in Table 2, the proposed model with the SR activation function achieved the highest PSNR scores.
Similar results are obtained using SSIM, in which our proposed method achieved the best performance as can be observed in Table 3.

Computational Efficiency
The comparison of our method to other methods was performed in terms of the number of parameters, #Params, and FLOPs in GFLOPs unit, whose results are shown in Table 4. As can be seen, our method uses a small number of parameters and a medium number of FLOPs. In addition, the simulation time of our proposed model is compared to other methods, and the results show that the training takes to converge, using the activation function Leaky ReLU, around 7 hours and, using our method with the SR activation function, takes around 5 h. Thus, Table 5 shows a reduction, on average, of 37% of the training and, on average, of 40% of the execution time using the SR activation function when compared to the related works. The execution time is measured as an average for running in the datasets LF datasets, such as INRIA [80], HCInew [81], EPFL [82], and HCIold [83]. It is worth noting that all methods used for performance comparison were run on the same workstation with an Intel 3.6 GHz CPU and a TiTan X GPU.

Conclusions
In this work, we propose a new method for better visual quality, and to decrease the problem of disparity in LF images. The procedure of feature alignment incorporates angular data as well as improves the image quality. Moreover, the experimental results verified the benefits of the proposed framework for the problem of depth estimation into LF images. In order to obtain more reliable and representative results, all methods used for comparison purposes were evaluated in datasets with different characteristics. Experimental results showed that our proposed framework obtained the best performance in relation to other methods. This fact demonstrated the versatility and good response in different image conditions. For the training and execution time of our proposed model, we verified a reduction, on average, of 37% of the training and, on average, of 40% of the execution time using the SR activation function when compared to the related works. For the PSNR metric, we achieved values of 46.89 and, for the SSIM metric, we achieved values of 0.997 in the determined dataset used in this work. Such values were achieved by the learning capacity through the multi-scale feature representation and the activation function application.
The choice of the number of AA proved to have great importance in this work. Additionally, to increase the performance in relation to the training and execution time, the training function, SR, played a very important role in the network model.
In future work, we intend to explore the proposed model in the area of Biometrics LF Data, comparing the efficiency of our proposed model and recent state-of-the-art approaches. In addition, different activation functions will be tested to continue decreasing the time during the training and execution phases of the proposed framework.

Conflicts of Interest:
The authors declare that they have no conflict of interest.