1. Introduction
Perspective distortion occurs if the objects in an image significantly differ in terms of scale and position, from how the objects are perceived by an observer [
1]. This can be classified as first-order distortions modeled by multiplying an undistorted image with a transformation matrix
M of size
. First-order distortions can also be caused by an incorrect acquisition environment, such as capturing from an incorrect angle or motions of objects or the photographer. Higher-order distortions are typically caused by capturing a scene with an inappropriate focal length. For example, a wide-angle lens provides a greater angle of view than a normal lens but leads to objects appearing stretched and asymmetrical while the telephoto lens makes objects appear closer to one another than what is perceived in the scene [
2].
To some extent, perspective distortion is intentionally applied to images to create artistic effects such as emphasizing a certain object in the scene by making it appear larger than others, and other artistic manipulations and scene editing proposed in the literature [
3,
4]. Distorted images affect the visual perception of objects in the scene and thus, perspective distortion correction is required on some aspects of photography and computer vision applications.
One area where perspective distortion correction is also needed is in traffic surveillance systems where distorted images affect the performance of vehicle recognition, license plate recognition [
5], and other tasks such as speed estimation and distance measurements. Scanned documents may appear warped or misaligned, which need to be corrected for document analysis [
6].
Image registration algorithms typically use transformation matrices that map an image to a different position or orientation in Euclidean space [
7]. In this study, we propose a framework for correcting first-order distortions using multiple convolutional neural networks trained in parallel, that compose the transformation matrix,
of size
, of a distorted image, where
M is the ground-truth that caused the distortion (
Figure 1). Distortion types that can be corrected by our proposed network are shown in
Figure 2. The key idea to our approach is that we train a certain network to produce a certain element pair in
, which contributes to a certain effect in the image, i.e., element pair inducing shear effect, or scale effect.
is then applied to the distorted image to produce the correct image. Since each network only produces a certain element pair in
, it provides a more straightforward approach by simply applying a transformation to correct the image, unlike generating corrected images using GAN or encoder-decoder architectures, which are more difficult to train and prone to instability such as mode collapse [
8,
9]. While our method requires multiple networks to correct an image, this approach results in a smaller computational footprint because each CNN has a fewer number of hidden layers compared to other architectures [
10,
11,
12] involving deep networks.
We present the following contributions of this study:
Our network architecture corrects perspective distortion and produces visually better images than other state-of-the-art methods. In terms of pixel-wise reconstruction error, our method outperforms other works.
Our method, to the best of our knowledge, is the first attempt to estimate the transformation matrix for correcting an image rather than using a reconstruction-based approach. Our method is straightforward and the network design is simpler compared to other works that mainly rely on deep generative models such as GANs or encoder-decoder networks, which are notoriously difficult to train and prone to instability.
Our method also recovers the original scale and proportion of the image. This is not observed in other works. Recovering the scale and proportion is beneficial for applications that perform distance measurements.
3. Empirical Analysis on the Transformation Matrix
The motivation behind having networks train in parallel to predict a certain element in
M is discussed here. An image may be distorted under perspective imaging. A transformation mapping
M is given by [
13]:
where
M is an
transformation matrix, where
is a vector with
n entries.
The goal of all the networks is to learn a transformation matrix, given an
distorted image
.
is generated from
original image
I by creating a random
transformation
M, then applying the said transformation for each
pixel in
I. Given
M,
in
can be represented as:
Since
M is homogeneous,
must be normalized to obtain the inhomogeneous equation [
50]:
Given a single-entry matrix
M (
), and an input image
I, we performed a frame-by-frame analysis on how
(
) transforms
I. In other words, we wanted to visualize the effect of each element in
M and how these elements contribute to the overall distortion applied to
I. The frames for
are generated by repeatedly incrementing its element. For example, the frames for
are generated by repeatedly adding
to
, where
is chosen arbitrarily to produce observable frame animations. The origin point for all the frame animations generated is on the top left. Results are visualized in
Figure 3.
Based on this experiment, we have identified the element pairs responsible for certain transformation behaviors (e.g., rotating or shearing an image) that a certain network can be trained to estimate. The elements are paired as follows:
Small changes in
result in a sideways rotation along the Y axis. Small changes in
result in a shearing operation, where the image’s bottom left and bottom right anchor points move sideways and upwards. Equation (
3) shows that increasing
and
causes the
and
to shrink. This is represented as an element pair,
.
Based in Equation (
3),
,
,
deal with the scale of the image. The matrix entries,
and
, deal with the width and height of the image respectively. Since
is part of the denominator, it changes both the width and height of the image. We do not need to use
as input when training our network because
and
can be inferred instead. This is represented as an element pair,
.
Since
is multiplied by
y and
is multiplied by
x in Equation (
3), this creates a shearing effect along
and
respectively. This is represented as an element pair,
.
Since no other term is multiplied with
and
in Equation (
3), increasing these entries results in pixel-wise displacements along x and y respectively. These are not considered as input for the network as they are typically not observed in distorted images.
Based on this experiment,
,
and
can be excluded in training. Thus, Equation (
2) can be simplified into the following:
The element pairs are used for training the network, which also form the elements in
M (seen in Equation (
4)). Because
M is invertible, we used
M as ground-truth and
for removing distortion from image
.
4. Synthetic Distortion Dataset: dKITTI
Similar to our predecessor [
11] where a synthetic distortion dataset is used, we used the KITTI dataset [
49] for populating a set of distorted images and their corresponding
M that serves as the ground-truth transformation matrix. A distorted image in the dataset has a randomly generated
M with respect to Equation (
4). These images and
M pairings form the distortion dataset, dKITTI.
Figure 4 illustrates how we generated dKITTI for training. For each KITTI image, we generated a random
M for distorting the image and automated the region selection to produce the final distorted image. The range of transformation matrix values (
Table 1) used for generating dKITTI images are uniformly sampled. The region selection is performed by fitting a maximum bounding box (
Figure 4) which is performed as follows:
Declare a bounding box B with a size of in terms of width and height. where W refers to the distorted image generated.
Iteratively decrease until the number of zero pixels, P, becomes 0. B becomes the selected cropped image .
Resize by bilinear interpolation such that where O is the original image.
However, resizing the distorted image,
, implies that the 3D positioning of the image has changed and therefore,
M should be updated.
Figure 5 illustrates this observation.
and
deal with the width and height of the image (seen in
Figure 3). These elements are updated as follows:
To avoid producing synthetic distorted images that are too extreme or far-fetched from real-world perspective distortions, we further refined our dataset generation by checking if the edge distribution of the distorted and original images are about the same. More specifically, all distorted and original images go through an edge similarity check algorithm (using Sobel operator [
51]), where the difference of the total number of edge pixels between the distorted and original images should be less than
. This ensures that the loss of overall content from the original image is minimized. Distorted images are regenerated if it does not satisfy this threshold.
Figure A3 shows some image samples used for training as well as those that were discarded.
5. Proposed Network
Our proposed network consists of three sub-networks which are trained to produce a certain element pair in
, which forms the transformation matrix that caused the distortion in the input image. The corrected image is obtained by transforming the distorted input image using
. More specifically, all three sub-networks require
, a cropped distorted image as input (
Figure 4), where the goal is to produce
,
and
and minimize the difference to
,
and
during training. The basis of the element pairs for each network are discussed in
Section 3. We refer to these networks as
,
and
respectively. This makes training faster and yields better results than having only one network in producing
. We justify this claim in
Section 6.1.
5.1. Parallel CNN Model
The architectural design of our network is shown in
Figure 6. There are three instances of this that attempt to predict element pairs in
, where each network is trained in parallel. Similarly, the three networks are used in parallel for inference. The CNN accepts an input image of size
. The input undergoes the pre-trained DenseNet [
52] layers, followed by 9 convolutional layers. Each layer uses max-pooling operations and ReLU activations. The last convolutional layer is connected to a fully connected layer which outputs
.
5.2. Training Details
Each network
,
and
is trained to minimize the mean square error (MSE) function of its assigned element pair in
with respect to element pairs in ground-truth
M. The total loss function
L is of the form:
where
are defined as follows:
where
n is the number of observed input. The penalty terms
, are added to corresponding element pairs based on observed sensitivity conducted from our experiment discussed in
Section 3. The following values were used for training:
. The penalty term,
, is very large because minuscule differences between
and
(≥ 1.0
difference) have a noticeable misalignment between ground-truth image
I and generated image
.
We implemented the network and performed experiments using PyTorch. The three parallel networks are optimized using ADAM [
53] with learning rates set to
and batch size of 8.
We trained the networks using an NVIDIA RTX 2080Ti GPU and the networks converge at around 20 epochs. We observed that during training, while some networks converge faster than the others, there were no overfitting incidents. Hence, we let all networks train until all networks have converged to an acceptable loss.
6. Evaluation
We evaluated our network architecture using the dKITTI dataset. The network is trained with 95,330 distorted images, while we performed an evaluation on the validation set containing 5018 images. The images are pixels in size.
We measured the following in terms of transformation matrix error: absolute relative and square relative error and root means squared error (RMSE). The same metrics are used for measuring the pixel-wise error, while structural image similarity (SSIM) [
54] is used for checking image reconstruction quality. We also measured the failure rate which is the percentage of images in the validation set that are not properly corrected, such as in the case of homography estimation [
55] where it fails to produce visually better images than the input. Performance results are shown in
Table 2 and the best results are shown in
Figure 7.
Figure 8 shows the results of manually picked images that have observable distortion and only depict a small region from the original image. Additional image results are shown in
Figure A1 and
Figure A2 in the
Appendix A.
We compared our method with the following: dataset transformation matrix mean, which is used as a baseline, homography estimation method [
55], the methods proposed by Li et al. [
11], and Chaudhury et al. [
21]. Homography estimation is computed by estimating
for a given distorted image
(Equation (
2)) such that the back-projection error to the corrected image
I is minimized. Homography estimation, however, is not a blind distortion correction technique but this is included for comparison. We used ORB detector [
56] for detecting feature points for
and
I then used RANSAC [
57] for minimizing the error. We set a threshold for considering matches only within a certain Euclidean distance, to minimize outliers. For the work of Li et al. [
11], we used their pre-trained model, specifically their multi-model distortion network with resampling for generating the corrected image. For the work of Chaudhury et al. [
21], we used their independent auto-rectifier algorithm with default parameters.
As shown in
Table 2, our network architecture outperforms the other methods. To validate the robustness of our network, we input images with extreme distortions, by sampling images with minimum and maximum transformation matrix values in
Table 1.
Figure 9 show that our network corrects images with extreme distortions and performs better than other methods.
Since the nature of homography estimation involves detecting feature points in the images, there are some occasions wherein there are very few feature points available (incorrect warping observed in
Figure 9). Hence, the transformation matrix cannot be inferred properly on some images. In effect, Homography estimation cannot be performed on 13.90% of images in the validation set (specified in
Table 2). Our method guarantees that
can be inferred on all images in the validation set.
The distortion parameters produced by the methods of Li et al. [
11] and Chaudhury et al. [
21] have some limitations and can be further improved as follows:
Both methods do not consider the scaling of images as a possible factor in perspective distortion, unlike our method, as discussed in
Section 3.
Images with low texture and those with shearing, as seen from examples in
Figure 8 and
Figure 9 affect the correction. This is more observed in the method of Chaudhury et al. [
21], which can only handle limited distortions on images. Our method is observed to be robust from these limitations.
Some images are misclassified as a different distortion type using the method of Li et al. [
11]. For example in
Figure 8, the third image of row A is misclassified as a barrel or pincushion distortion which resulted in a different corrected image. Our method covers more cases of perspective distortions. As seen in our results, our method consistently produces corrected images.
6.1. Experiment on Network Variants
We conducted an experiment to validate the effectiveness of parallel CNNs for perspective distortion correction. The following network variants are described in
Table 3. Model A uses DensetNet as the pretrained layer proposed in
Figure 6. Model B uses pre-trained ResNet-161 [
58] layers instead of DenseNet layers. Model C does not use any pre-trained layer. Model D is similar to Model A except only one instance is trained. The fully connected layer outputs
. The results are summarized in
Table 4.
Model C appears to have the lowest transformation matrix error among other variants but the lowest pixel-wise error and highest SSIM accuracy of the corrected images were produced by Model A. The results also show that predicting grouped element pairs and training three network instances in parallel are better than using a single network instance. Hence, Model A is the primary network architecture used for correcting distorted images.
6.2. Closeness of Estimations to Ground-Truth
We randomly selected 500 images each from the training and validation sets, then validated the predicted
and compared it against the ground-truth
M. The norm of
and
M are plotted in
Figure 10. Our network predicts shortly by a mean margin of 0.0239 in terms of norm value from the ground-truth. This difference is very small and visually negligible as observed from the image results. The scatter plot also shows that our prediction distribution is almost the same as the ground-truth distribution of the training and test sets.
We validated if our network can correct images with different
and
values. As stated in
Section 3, these elements deal with scaling of images and should be considered in modelling perspective distortion. We generated 276 distorted images from KITTI where only
and
are uniformly randomized and then used our proposed network for predicting
.
Table 5 summarizes the transformation matrix error and pixel-wise error metrics. The best image results are shown in
Figure 11. Based on the results, our network can recover the original scale of the image which cannot be performed by other methods.
6.3. Activation Visualization
We analyzed how our network behaves by visualizing the gradient-weighted activation maps of the convolutional layers, using the technique of Selvaraju et al. [
59].
Figure 12 illustrates the feature maps. As observed in the visualizations, our network tends to extract edges, outlines, then certain regions of the images. The first layer gravitates towards the edges, lines, and contours. For each succeeding layer, the low-level features are being grouped where all edges, lines, and contours appear to be grouped on the 4th layer. Succeeding layers tend to activate on specific regions of the images where the last layer appears to focus on the overall orientation of the image.
6.4. Model Generalization
We experimented with our network on unseen data by using test images from Places-205 [
48] dataset. A total of 612 images from Places-205 were randomly selected and distorted, where the majority of images have little to no presence of cars and roads. Thus, the images are entirely on a different domain from the KITTI dataset.
Table 6 shows the accuracy metrics.
Figure 13 illustrates the best results. Notice that our network can still recover the corrected image properly as compared to other methods. While the distortion parameters are similar but the scene context is different, our network can still infer the transformation matrix to correct the image. We speculate that our network is invariant to scene compositions because the activation maps (discussed in
Section 6.3) focuses more on edges and lines in the image.
6.5. Limitations
We observed that our network could not properly correct outdoor images with repeating textures as well as indoor scenes with texts or cluttered objects. These examples are shown in
Figure 14. Since our network does not recognize specific objects and semantic information in particular, then the network cannot correct images with a dense amount of objects and repetitive textures such as rocks. The network was not trained with any indoor scenes and thus, produces incorrect distortion parameters. We believe that the straightforward solution to this is to retrain the network with more variety of images or perform domain adaptation.
We also attempted to investigate the limits of our trained network, using panoramic images from the Internet. For an image to be compatible with our network, we either resized the image to
, assuming the original aspect ratio is preserved, or cropped an area of similar size in the image, with the center as the origin. Results are shown in
Figure 15. Panoramic images will most often involve a combination of different distortions, some are higher-order distortions, such as barrel or pincushion distortions. However, results visually show that our network has attempted to correct the images’ orientation and reduced the stretching in some areas as compared to other methods.
7. Conclusions
We proposed a blind first-order perspective distortion correction method by using three convolutional neural networks in inferring the transformation matrix for correcting an image where these networks are trained and used in parallel. We discovered that elements in the transformation matrix can be grouped because they perform a specific transformation to the image such as scaling or skewing, which is the rationale behind our approach and design of the network. Our proposed method shows promising results, as shown by outperforming other state-of-the-art methods. Our network can generalize properly on a different domain as well as recover the intended scale and proportion of the image, which could be used for images that appear stretched, making objects in the image appear close to their original scales.
Our network cannot correct images with repeating textures as well as indoor scenes with texts or cluttered objects. We speculate that this could be solved by adding more training samples that cover such cases. We plan to explore how images with higher-order distortions can be corrected, without relying on generative or encoder-decoder architectures which to some extent, was already performed by Li et al. [
11] for reconstructing the intermediate flow representation of the distorted image. It would be interesting to use the same strategy (
Section 3) that we proposed.