Blind First-Order Perspective Distortion Correction Using Parallel Convolutional Neural Networks

In this work, we present a network architecture with parallel convolutional neural networks (CNN) for removing perspective distortion in images. While other works generate corrected images through the use of generative adversarial networks or encoder-decoder networks, we propose a method wherein three CNNs are trained in parallel, to predict a certain element pair in the 3×3 transformation matrix, M^. The corrected image is produced by transforming the distorted input image using M^−1. The networks are trained from our generated distorted image dataset using KITTI images. Experimental results show promise in this approach, as our method is capable of correcting perspective distortions on images and outperforms other state-of-the-art methods. Our method also recovers the intended scale and proportion of the image, which is not observed in other works.


Introduction
Perspective distortion occurs if the objects in an image significantly differ in terms of scale and position, from how the objects are perceived by an observer [1]. This can be classified as first-order distortions modeled by multiplying an undistorted image with a transformation matrix M of size 3 × 3. First-order distortions can also be caused by an incorrect acquisition environment, such as capturing from an incorrect angle or motions of objects or the photographer. Higher-order distortions are typically caused by capturing a scene with an inappropriate focal length. For example, a wide-angle lens provides a greater angle of view than a normal lens but leads to objects appearing stretched and asymmetrical while the telephoto lens makes objects appear closer to one another than what is perceived in the scene [2].
To some extent, perspective distortion is intentionally applied to images to create artistic effects such as emphasizing a certain object in the scene by making it appear larger than others, and other artistic manipulations and scene editing proposed in the literature [3,4]. Distorted images affect the visual perception of objects in the scene and thus, perspective distortion correction is required on some aspects of photography and computer vision applications.
One area where perspective distortion correction is also needed is in traffic surveillance systems where distorted images affect the performance of vehicle recognition, license plate recognition [5], and other tasks such as speed estimation and distance measurements. Scanned documents may appear warped or misaligned, which need to be corrected for document analysis [6].   [13]. One or more distortion types may be present in a distorted image.
We present the following contributions of this study:

•
Our network architecture corrects perspective distortion and produces visually better images than other state-of-the-art methods. In terms of pixel-wise reconstruction error, our method outperforms other works.

•
Our method, to the best of our knowledge, is the first attempt to estimate the transformation matrix for correcting an image rather than using a reconstruction-based approach. Our method is straightforward and the network design is simpler compared to other works that mainly rely on deep generative models such as GANs or encoder-decoder networks, which are notoriously difficult to train and prone to instability.
• Our method also recovers the original scale and proportion of the image. This is not observed in other works. Recovering the scale and proportion is beneficial for applications that perform distance measurements.

Model-Based Techniques
Some works have been proposed where images are corrected, assuming distortion parameters are provided or available [14,15]. However, there are cases wherein information about the camera lens or acquisition system is unavailable, which inspired some studies on auto-calibration methods where distortion parameters are estimated [16][17][18][19]. Fitzgibbon proposed a single-image automatic distortion correction using a division model to approximate the radial distortion curve [20]. A lightweight auto-rectification method was proposed by Chaudhury et al. [21] where perspective distortions are corrected by performing a RANSAC-based vanishing point detector that restores parallelism of lines in the image. Similarly, the framework proposed by Santana-Cedrés et al. [22] uses a voting scheme for identifying vanishing points and performs perspective correction by simulating camera motion. More recently, an automatic perspective distortion correction for wide-angle portrait images captured on mobile devices was proposed, where a novel face objective term was introduced to properly correct face distortions and background distortions separately [23]. Some works use multiple images with different orientations to properly estimate distortion parameters [24][25][26]. To some extent, methods that combine multiple images for enhancement require some perspective transformation technique [27][28][29][30]. The same technique is implemented for performing image stitching [31][32][33][34][35].

Methods Using Low-Level Features
Using low-level features, such as edges, lines and vanishing points are explored for perspective distortion correction [20][21][22][36][37][38][39]. Wang et al. [18] used an improved Hough Transform for distortion correction while Bukhari and Dailey [19] proposed a sampling method that robustly chooses the circular arcs and determines distortion parameters that are insensitive to outliers. Aside from using low-level features as parameters for distortion correction, assumptions are sometimes included in other studies. For example, images with man-made structures are assumed to appear straight [40]. Lee et al. [41] proposed a set of criteria based on such assumption for upright adjustment of photographs using an optimization-based calibration method. However, methods that rely on low-level features and assumptions do not work well with a variety of images and only work on specialized scenarios. Results from our experiments show that the proposed method of Chaudhury et al. [21] does not correctly rectify our distorted images.

Learning-Based Methods
Blind distortion correction is an ill-posed problem. Therefore, learning-based methods using only a single distorted image are being pursued [10][11][12][42][43][44][45][46]. Deep learning for correcting documents were proposed recently [12,[44][45][46] which implements convolutional neural networks, encoder-decoders, and U-net-based architectures [47]. Work on correcting portrait images used an encoder-decoder architecture [10]. The encoder-decoder architecture proposed by Li et al. [11] aims to correct real-world images by predicting the distortion flow and further refining the correction by iterative resampling, which is a predecessor of our work. Instead of using a multi-model network for predicting the distortion flow, we used multiple convolutional neural networks (CNN) that run in parallel to predict the transformation matrix. Our network is trained purely for correcting perspective distortions, unlike the work of Li et al. [11] that correct a wide range of distortion types, such as barrel and pincushion distortions. Furthermore, our results outperforms the method of Li et al. [11], which occasionally generates incorrect rectification of images even on the dataset they have used for training (Places-365 dataset [48]). To some extent, our network properly generalizes to this dataset despite being trained on KITTI [49] images.

Empirical Analysis on the Transformation Matrix
The motivation behind having networks train in parallel to predict a certain element in M is discussed here. An image may be distorted under perspective imaging. A transformation mapping M is given by [13]: where M is an m × n transformation matrix, where x is a vector with n entries. The goal of all the networks is to learn a transformation matrix , given an H × W distorted imageȊ. I is generated from H × W original image I by creating a random 3 × 3 transformation M, then applying the said transformation for each (x, y) pixel in I. Given M, (x,y) inȊ can be represented as: Since M is homogeneous, T( x) must be normalized to obtain the inhomogeneous equation [50]: Given a single-entry matrix M (m 3,3 = 1), and an input image I, we performed a frame-by-frame analysis on how m i,j ∈ M (1 ≤ i, j ≤ 3) transforms I. In other words, we wanted to visualize the effect of each element in M and how these elements contribute to the overall distortion applied to I. The frames for m i,j ∈ M are generated by repeatedly incrementing its element. For example, the frames for m 1,1 are generated by repeatedly adding ∆ to m 1,1 , where ∆ is chosen arbitrarily to produce observable frame animations. The origin point for all the frame animations generated is on the top left. Results are visualized in Figure 3.
Based on this experiment, we have identified the element pairs responsible for certain transformation behaviors (e.g., rotating or shearing an image) that a certain network can be trained to estimate. The elements are paired as follows: Since m 1,2 is multiplied by y and m 2,1 is multiplied by x in Equation (3), this creates a shearing effect alongx andy respectively. This is represented as an element pair, {m 1,2 , m 2,1 }. • Since no other term is multiplied with m 1,3 and m 2,3 in Equation (3), increasing these entries results in pixel-wise displacements along x and y respectively. These are not considered as input for the network as they are typically not observed in distorted images. Based on this experiment, m 1,3 , m 2,3 and m 3,3 can be excluded in training. Thus, Equation (2) can be simplified into the following: The element pairs are used for training the network, which also form the elements in M (seen in Equation (4)). Because M is invertible, we used M as ground-truth and M −1 for removing distortion from imageȊ.

Synthetic Distortion Dataset: dKITTI
Similar to our predecessor [11] where a synthetic distortion dataset is used, we used the KITTI dataset [49] for populating a set of distorted images and their corresponding M that serves as the ground-truth transformation matrix. A distorted image in the dataset has a randomly generated M with respect to Equation (4). These images and M pairings form the distortion dataset, dKITTI. Figure 4 illustrates how we generated dKITTI for training. For each KITTI image, we generated a random M for distorting the image and automated the region selection to produce the final distorted image. The range of transformation matrix values (Table 1) used for generating dKITTI images are uniformly sampled. The region selection is performed by fitting a maximum bounding box ( Figure 4) which is performed as follows: 1.
Declare a bounding box B with a size of (B w , B h ) in terms of width and height.
Iteratively decrease (B w , B h ) until the number of zero pixels, P, becomes 0. B becomes the selected cropped imageȊ.

3.
ResizeȊ by bilinear interpolation such that (Ȋ w ,   However, resizing the distorted image,Ȋ, implies that the 3D positioning of the image has changed and therefore, M should be updated. Figure 5 illustrates this observation. m 1,1 and m 2,2 deal with the width and height of the image (seen in Figure 3). These elements are updated as follows: To avoid producing synthetic distorted images that are too extreme or far-fetched from real-world perspective distortions, we further refined our dataset generation by checking if the edge distribution of the distorted and original images are about the same. More specifically, all distorted and original images go through an edge similarity check algorithm (using Sobel operator [51]), where the difference of the total number of edge pixels between the distorted and original images should be less than 25%. This ensures that the loss of overall content from the original image is minimized. Distorted images are regenerated if it does not satisfy this threshold. Figure A3 shows some image samples used for training as well as those that were discarded.

Proposed Network
Our proposed network consists of three sub-networks which are trained to produce a certain element pair inM, which forms the transformation matrix that caused the distortion in the input image. The corrected image is obtained by transforming the distorted input image usingM −1 . More specifically, all three sub-networks requireȊ, a cropped distorted image as input (Figure 4), where the goal is to produce { m 3,1 , m 3,2 }, { m 1,1 , m 2,2 } and { m 1,2 , m 2,1 } and minimize the difference to {m 3,1 , m 3,2 }, {m 1,1 , m 2,2 } and {m 1,2 , m 2,1 } during training. The basis of the element pairs for each network are discussed in Section 3. We refer to these networks as N({m 3,1 , m 3,2 }), N({m 1,1 , m 2,2 }) and N({m 1,2 , m 2,1 }) respectively. This makes training faster and yields better results than having only one network in producing M. We justify this claim in Section 6.1.

Parallel CNN Model
The architectural design of our network is shown in Figure 6. There are three instances of this that attempt to predict element pairs in M, where each network is trained in parallel. Similarly, the three networks are used in parallel for inference. The CNN accepts an input image of size 1442 × 575. The input undergoes the pre-trained DenseNet [52] layers, followed by 9 convolutional layers. Each layer uses max-pooling operations and ReLU activations. The last convolutional layer is connected to a fully connected layer which outputs { m i,j , m k,l } ∈ M, i, j, k, l = 1, ..., 3.

Training Details
Each network N({m 3,1 , m 3,2 }), N({m 1,1 , m 2,2 }) and N({m 1,2 , m 2,1 }) is trained to minimize the mean square error (MSE) function of its assigned element pair in M with respect to element pairs in ground-truth M. The total loss function L is of the form: where L 1 , L 2 , L 3 are defined as follows: where n is the number of observed input. The penalty terms α, β, γ, are added to corresponding element pairs based on observed sensitivity conducted from our experiment discussed in Section 3.
We trained the networks using an NVIDIA RTX 2080Ti GPU and the networks converge at around 20 epochs. We observed that during training, while some networks converge faster than the others, there were no overfitting incidents. Hence, we let all networks train until all networks have converged to an acceptable loss.

Evaluation
We evaluated our network architecture using the dKITTI dataset. The network is trained with 95,330 distorted images, while we performed an evaluation on the validation set containing 5018 images. The images are 1442 × 575 pixels in size.
We measured the following in terms of transformation matrix error: absolute relative and square relative error and root means squared error (RMSE). The same metrics are used for measuring the pixel-wise error, while structural image similarity (SSIM) [54] is used for checking image reconstruction quality. We also measured the failure rate which is the percentage of images in the validation set that are not properly corrected, such as in the case of homography estimation [55] where it fails to produce visually better images than the input. Performance results are shown in Table 2 and the best results are shown in Figure 7. Figure 8 shows the results of manually picked images that have observable distortion and only depict a small region from the original image. Additional image results are shown in Figures A1 and A2 in the Appendix A.
We compared our method with the following: dataset transformation matrix mean, which is used as a baseline, homography estimation method [55], the methods proposed by Li et al. [11], and Chaudhury et al. [21]. Homography estimation is computed by estimatingM −1 for a given distorted imageȊ (Equation (2)) such that the back-projection error to the corrected image I is minimized. Homography estimation, however, is not a blind distortion correction technique but this is included for comparison. We used ORB detector [56] for detecting feature points forȊ and I then used RANSAC [57] for minimizing the error. We set a threshold for considering matches only within a certain Euclidean distance, to minimize outliers. For the work of Li et al. [11], we used their pre-trained model, specifically their multi-model distortion network with resampling for generating the corrected image. For the work of Chaudhury et al. [21], we used their independent auto-rectifier algorithm with default parameters.  As shown in Table 2, our network architecture outperforms the other methods. To validate the robustness of our network, we input images with extreme distortions, by sampling images with minimum and maximum transformation matrix values in Table 1. Figure 9 show that our network corrects images with extreme distortions and performs better than other methods. Since the nature of homography estimation involves detecting feature points in the images, there are some occasions wherein there are very few feature points available (incorrect warping observed in Figure 9). Hence, the transformation matrix cannot be inferred properly on some images. In effect, Homography estimation cannot be performed on 13.90% of images in the validation set (specified in Table 2). Our method guarantees thatM can be inferred on all images in the validation set.
The distortion parameters produced by the methods of Li et al. [11] and Chaudhury et al. [21] have some limitations and can be further improved as follows: • Both methods do not consider the scaling of images as a possible factor in perspective distortion, unlike our method, as discussed in Section 3.

•
Images with low texture and those with shearing, as seen from examples in Figure 8 and 9 affect the correction. This is more observed in the method of Chaudhury et al. [21], which can only handle limited distortions on images. Our method is observed to be robust from these limitations. • Some images are misclassified as a different distortion type using the method of Li et al. [11]. For example in Figure 8, the third image of row A is misclassified as a barrel or pincushion distortion which resulted in a different corrected image. Our method covers more cases of perspective distortions. As seen in our results, our method consistently produces corrected images.

Experiment on Network Variants
We conducted an experiment to validate the effectiveness of parallel CNNs for perspective distortion correction. The following network variants are described in Table 3. Model A uses DensetNet as the pretrained layer proposed in Figure 6. Model B uses pre-trained ResNet-161 [58] layers instead of DenseNet layers. Model C does not use any pre-trained layer. Model D is similar to Model A except only one instance is trained. The fully connected layer outputs { m 1,1 , m 1,2 , m 2,1 , m 2,2 , m 3,1 , m 3,2 }. The results are summarized in Table 4.
Model C appears to have the lowest transformation matrix error among other variants but the lowest pixel-wise error and highest SSIM accuracy of the corrected images were produced by Model A. The results also show that predicting grouped element pairs and training three network instances in parallel are better than using a single network instance. Hence, Model A is the primary network architecture used for correcting distorted images.

Closeness of Estimations to Ground-Truth
We randomly selected 500 images each from the training and validation sets, then validated the predictedM and compared it against the ground-truth M. The norm ofM and M are plotted in Figure 10. Our network predicts shortly by a mean margin of 0.0239 in terms of norm value from the ground-truth. This difference is very small and visually negligible as observed from the image results. The scatter plot also shows that our prediction distribution is almost the same as the ground-truth distribution of the training and test sets. We validated if our network can correct images with different M 1,1 and M 2,2 values. As stated in Section 3, these elements deal with scaling of images and should be considered in modelling perspective distortion. We generated 276 distorted images from KITTI where only M 1,1 and M 2,2 are uniformly randomized and then used our proposed network for predictingM. Table 5 summarizes the transformation matrix error and pixel-wise error metrics. The best image results are shown in Figure 11. Based on the results, our network can recover the original scale of the image which cannot be performed by other methods.

Activation Visualization
We analyzed how our network behaves by visualizing the gradient-weighted activation maps of the convolutional layers, using the technique of Selvaraju et al. [59]. Figure 12 illustrates the feature maps. As observed in the visualizations, our network tends to extract edges, outlines, then certain regions of the images. The first layer gravitates towards the edges, lines, and contours. For each succeeding layer, the low-level features are being grouped where all edges, lines, and contours appear to be grouped on the 4th layer. Succeeding layers tend to activate on specific regions of the images where the last layer appears to focus on the overall orientation of the image.

Model Generalization
We experimented with our network on unseen data by using test images from Places-205 [48] dataset. A total of 612 images from Places-205 were randomly selected and distorted, where the majority of images have little to no presence of cars and roads. Thus, the images are entirely on a different domain from the KITTI dataset. Table 6 shows the accuracy metrics. Figure 13 illustrates the best results. Notice that our network can still recover the corrected image properly as compared to other methods. While the distortion parameters are similar but the scene context is different, our network can still infer the transformation matrix to correct the image. We speculate that our network is invariant to scene compositions because the activation maps (discussed in Section 6.3) focuses more on edges and lines in the image. Table 6. Accuracy metrics using Places205 dataset [48]. Best performance in bold. Our network was not trained using images from Places205, but still outperforms other methods.

Method
Transformation

Limitations
We observed that our network could not properly correct outdoor images with repeating textures as well as indoor scenes with texts or cluttered objects. These examples are shown in Figure 14. Since our network does not recognize specific objects and semantic information in particular, then the network cannot correct images with a dense amount of objects and repetitive textures such as rocks. The network was not trained with any indoor scenes and thus, produces incorrect distortion parameters. We believe that the straightforward solution to this is to retrain the network with more variety of images or perform domain adaptation.
We also attempted to investigate the limits of our trained network, using panoramic images from the Internet. For an image to be compatible with our network, we either resized the image to 1442 × 575, assuming the original aspect ratio is preserved, or cropped an area of similar size in the image, with the center as the origin. Results are shown in Figure 15. Panoramic images will most often involve a combination of different distortions, some are higher-order distortions, such as barrel or pincushion distortions. However, results visually show that our network has attempted to correct the images' orientation and reduced the stretching in some areas as compared to other methods.

Conclusions
We proposed a blind first-order perspective distortion correction method by using three convolutional neural networks in inferring the transformation matrix for correcting an image where these networks are trained and used in parallel. We discovered that elements in the transformation matrix can be grouped because they perform a specific transformation to the image such as scaling or skewing, which is the rationale behind our approach and design of the network. Our proposed method shows promising results, as shown by outperforming other state-of-the-art methods. Our network can generalize properly on a different domain as well as recover the intended scale and proportion of the image, which could be used for images that appear stretched, making objects in the image appear close to their original scales.
Our network cannot correct images with repeating textures as well as indoor scenes with texts or cluttered objects. We speculate that this could be solved by adding more training samples that cover such cases. We plan to explore how images with higher-order distortions can be corrected, without relying on generative or encoder-decoder architectures which to some extent, was already performed by Li et al. [11] for reconstructing the intermediate flow representation of the distorted image. It would be interesting to use the same strategy (Section 3) that we proposed.

Conflicts of Interest:
The authors declare no conflict of interest.

Appendix A
Additional results are shown in the next figures. The source code for this work is available at: https://github.com/NeilDG/NeuralNets-ImageCorrection. The pre-trained model can be requested by emailing the authors.