Unsupervised Multi-Scale-Stage Content-Aware Homography Estimation

: Homography estimation is a critical component in many computer-vision tasks. However, most deep homography methods focus on extracting local features and ignore global features or the corresponding relationship between features from two images or video frames. These methods are effective for alignment of image pairs with small displacement. In this paper, we propose an unsupervised Multi-Scale-Stage Content-Aware Homography Estimation Network (MS2CA-HENet). In the framework, we use multi-scale input images for different stages to cope with different scales of transformations. In each stage, we consider local and global features via our Self-Attention-augmented ConvNet (SAC). Furthermore, feature matching is explicitly enhanced using feature-matching modules. By shrinking the error residual of each stage, our network achieves coarse-to-ﬁne results. Experiments show that our MS2CA-HENet achieves better results than other methods.


Introduction
Image/video homography estimation is the process of finding corresponding relationships by estimating a projective transformation.It is a basic task in a variety of applications, including visual SLAM [1,2], image/video stitching [3,4] and augmented reality [5,6].Most of the traditional methods for homography estimation [7,8] employ matched features, such as SIFT [9], SURF [10] and ORB [11], to establish the corresponding relationship.These methods are highly dependent on the extracted features and can typically provide good results in scenes with rich features and a uniform distribution of features.In addition, these steps (feature detection, feature matching and homography estimation) in traditional methods are performed independently; the total performance of alignment can easily be limited by the influence of any one step.
Deep homography estimation methods have drawn more attention from researchers due to their excellent performance in feature representation.These methods usually are divided into two categories: supervised estimation methods [12][13][14][15][16] and unsupervised methods [17][18][19][20].These learning-based methods can often outperform traditional methods in some difficult scenarios, such as images/videos with few features or lacking texture.However, these methods focus on local features, ignoring long-range relationships and the corresponding relationship between features from two images or video frames.In addition, these approaches are effective for image pairs or video frames with small displacements.
Previous research [13,15] has shown that using a multi-stage process to progressively predict and refine homography can cope with large global displacement between two images/video frames.In this paper, we extend these methods to an unsupervised method and propose an unsupervised Multi-Scale-Stage Content-Aware Homography Estimation Network (MS2CA-HENet).In this framework, images with different resolutions are used as input at different stages, starting with low-resolution input images and gradually increasing the size of input images.Large-scale and global transformations are estimated on lowresolution input images; small-scale and local transformations are estimated on highresolution input images.The homography estimation network of each stage includes a feature-extraction module, feature-matching module and homography-estimation module.The feature-extraction module introduces a self-attention mechanism, which can cover a larger scope in the process of feature extraction to collect feature information, and considers local and global information for extracted features.The feature-matching module enhances the matching relationship between features.By shrinking the error residual of each stage, our network achieves coarse-to-fine results and promotes the performance of alignment.Compared with previous work, our contributions are listed as follows: (1) We design a novel unsupervised Multi-Scale-Stage Content-Aware Homography Estimation Network (MS2CA-HENet), which effectively addresses homography estimation for a pair of images with large displacement.(2) We propose a Self-Attention-augmented ConvNet (SAC) to capture local and global features.Moreover, a feature-matching module is introduced into the homographyestimation network to enhance the long-distance dependencies between two image feature maps.(3) We estimate the residual offsets of the displacement instead of the complete offsets, which estimates the homography from coarse-to-fine via minimizing the error residual at each stage.Experiments show that our method achieves superior performance compared to other methods.

Supervised Deep Homography Methods
DeTone et al. [12] made the first attempt to propose a deep homography estimation method, which used a deep convolution neural network to estimate homography.The authors of [13,14,16] utilize a hierarchical architecture that extract features from two image patches to perform homography estimation.Hierarchical approaches can gradually reduce estimation error from coarse-to-fine.Le et al. [15] extend this approach to estimate the motion mask in order to address potentially large dynamic motion.However, these methods are supervised approaches; they need a large number of ground truth annotations, which are costly to gather from real-world data.

Unsupervised Deep Homography Methods
Nguyen et al. [17] propose an unsupervised method via a Spatial Transformation Layer (STL) [21] to calculate pixel loss between two images/video frames.Their unsupervised method achieves comparable performance to the HomographyNet [12] method.Wang et al. [18] eliminate the need for ground-truth annotations and use invertibility constraints to improve previous unsupervised approaches.Ye et al. [22] use a homography flow rather than the typically used four-point parameterization to estimate homography.Koguciuk et al. [19] extend this approach by calculating the perceived loss [23], which considerably increases the robustness of the model to variations in light.Liu et al. [20] propose a content-aware homography estimation method that learns a mask to eliminate the outliers in a manner similar to the RANSAC [24] function.

Self-Attention
In computer vision, attention mechanisms [25,26] highlight key elements of an image or feature map while ignoring the rest.Attention is a crucial component of deep convolutional networks owing to its ability to concentrate on important regions within a given context.Self-attention is described as paying attention to a single context rather than to several contexts.The advantage of self-attention is the ability to interact remotely; it has produced cutting-edge models for a variety of tasks [27,28], e.g., image generation [29] and object detection [30].It has recently shown benefits in a variety of vision tasks to complement convolution models with self-attention.Wang et al. [31] demonstrate that self-attention is an instance of non-local [32,33] methods and that it can be used to improve video categorization and object recognition.Using a variation of non-local methods, Chen et al. [34] attain favorable outcomes in image classification and video action identification tasks.At the same time, Bello et al. [35] also see big improvements in object detection and image classification by adding global self-attention features to convolutional features.( , ( )

Overall Architecture
T I H S  ( , ( ) )

Stage-3
Stage-1 Stage-2 The overall network model consists of three stages.In the first stage, we input the smallest-resolution images I 1 R , I 1 T , and output the displacement D 1 of the four image corner points from I 1 R to I 1 T .Moreover, a Tensor Direct Linear Transform (Tensor DLT) [36] layer is applied to compute the differentiable mapping from the four-point parameterization D 1 to the homography matrix of 3 × 3 parameterization H 1 .In the second stage, the reference image I 2 R and the warped target image I 2 T are input to the module similar to the first stage.The warped target image I 2 T is obtained as follows: where W () warps the target image using the homography transformation in the Spatial Transformation Layer; S is a scaling matrix at the two scales of the warped target image I 1 T and the warped target image I 2 T .More specifically, the relationship of homography offsets between two adjacent scale images is calculated to scale the homography, and small-scale offsets are expanded by two times to make them equivalent to the changes on the large-scale images.
For the output of residual displacement ∆D 2 from the reference image I 2 R to the warped target image I 2 T in the second stage, the total displacement D 2 between the reference image I 2 R and the warped target image I 2 T is obtained by the displacement D 1 and the residual displacement ∆D 2 : Depending on the displacement D 2 from the reference image I 2 R to the warped target image I 2 T , the homography transformation H 2 can be obtained by the DLT method.The homography transformation calculated in the second stage is scaled and applied to the target image I 3 T .Similar to the second stage, the reference image I 3 R and the warped target image I 3 T are inputs to the third stage, and the output is the residual displacement ∆D 3 .The displacement D 3 from the reference image I 3 R to the warped target image I 3 T is obtained by the displacement D 2 and ∆D 3 : Based on the displacement D 3 , the homography transformation H 3 can be obtained by the DLT method.

Network Modules In Stage 1
In this section, we introduce the modules of our network in Stage 1 (see Figure 2) in detail.In the feature-extraction module F(.), the local and global features are extracted by combining the convolution and self-attention operation.Moreover, the feature-matching module M(.) is designed to enhance feature-matching explicitly from feature maps.At last, we estimate the homography matrix by the homography estimation module H(.).

Feature Extractor
In the feature-extraction module F(.), we combine the convolution and self-attention operation to obtain local and global features.The process is described as : We firstly employ ResNet34 [37] to extract the image features.Due to the limitation of the receptive field of the convolution kernel, the convolution processes data locally, which makes it computationally inefficient to predict long-range relationships in images.We embed a self-attention module into the feature extractor, enabling it to efficiently model long-distance interactions from the image features, as shown in Figure 3.To be specific, we use the output of Layer2 in ResNet34 and embed a self-attention module after each layer of Layer2.For a pair of input images I R and I T of size 1 × H × W, the size of the feature maps is C × H/8 × W/8.The specific network structure of the feature-extraction module F(.) is shown in Table 1.
Specifically, assume the features of an image extracted by ResNet34 are denoted as x i .The feature maps x i can be transformed into different feature spaces by different 1 × 1 convolutions: where W k , W q , W v are three different 1 × 1 convolutions.The spacial relationship is calculated by: , where where K 2D and Q 2D denote the flattened results of the tensors K and Q, respectively, in a sample.The size of K 2D (Q 2D ) is C × N, where N = H × W. β i,j indicates the correlation between the ith location and the jth position.The output is z = (z 1 , z 2 , . . . ,z i , . . . ,z N ) ∈ R C×N : where V 2D ∈ R C×N denotes the flattened matrix of the tensor V in a sample.In addition, the output result is multiplied by a learnable parameter and the feature maps are added.Consequently, the ultimate result is determined by: where γ is a learnable scalable factor with an initial value of 0. By a learnable γ, the module first depends on neighborhood cues, and then progressively can be trained to give non-local evidence more weight.

Feature-Matching Module
Feature matching is an important step in traditional homography estimation methods.By comparing the distances between the feature-point descriptors on each pair of images, the feature points with the minimum distance between them are selected as the matching points.In deep-learning methods, the convolution layer is inefficient at learning the matching relation between features, especially when the displacement between corresponding points is large-the position of the matched feature is much larger than the receptive field of the convolution kernel.In the proposed feature-matching module M(.), feature maps F R and F T are inputs, the output is a cost volume S 3D to store the correlation values between features of the reference image and the target image in spacial position.The process is presented as follows: Specifically, we first reshape the extracted feature maps F R and F T that output from the feature extractor into corresponding 2D matrices F R 2D and F T 2D , respectively.Then, the matching cost S 2D (i, j) between the ith feature vector in F R 2D and the jth feature vector in F T 2D is implemented as the correlation between the feature vectors: where S 2D ∈ R B×N×N .N denotes the size of spatial resolution, C represents the dimension of the feature vectors, T stands for the transpose operator, and ' ' stands for the dot product.Therefore, the full cost-volume calculation between two different feature maps F R and F T can be expressed as: where ⊗ means matrix multiplication.As a result, the total cost volume S 3D is converted by the 2D cost volume S 2D .The specific calculation process is shown in Figure 4. Compared with other parts of the model, the feature-matching module in our model does not have any trainable parameters.The cost volume may be conceptualized as a 3D form of a similarity matrix.It keeps track of how much it costs to match two sets of dense feature vectors.

Homography Estimator
In the homography estimation module H(.), we employ three successive convolutional layers and two fully connected layers to obtain the displacement D of the four image corners from reference images to target images.To prevent over-fitting, we use the dropout method [38] between the last convolutional layer and the first fully connected layer with a drop probability of 0.5.Our homography estimator function between the cost volume S 3D and the displacement D is described as follow: By applying the directly linear transform to D, we can obtain the homography matrix of 3 × 3 parameterization between a pair of images.

Loss Function
Utilizing the estimated H k in the kth stage, we warp the patch images I k R to W (I k R ) via the Spatial Transformation Layer [21] and compute the L1 loss between the warped target images and the reference images.The network as a whole is differentiable and may be trained through back propagation.
At each stage, we minimize the average L1 pixel-wise photometric loss during the training process.According to previous studies [17,39], an L1-type loss function is more suitable for image alignment problems and the network is more easily trained with an L1-type loss function, so we choose an L1-type loss function instead of an L2-type loss function.In addition, the images may contain some artifacts due to the injection of random illumination offset and distortion, and the L1-type loss function is more robust to outliers [40].The total loss function can be expressed as followed: where the balancing weights are set to α 1 = 0.5, α 2 = 0.3 and α 3 = 0.2.Our loss function consists of three parts, which represent the homography estimation network of three stages and set different weights.W () is an operation that performs the predicted homography of each stage on the input images using a Spatial Transformation Layer.In Stage 1 of the loss function, we warp patch images I 1 R to W (I 1 R , H 1 ) by the predicted homography transformation H 1 .The average L1 pixel-wise photometric loss function is used to minimize the difference in pixel values between the corresponding pixel points W (I 1 R , H 1 ) and I 1 T .In Stage 2, we minimize the difference between W (I 2 R , H 2 ) and I 2 T = W (I 2 T , (H 1 S) −1 ) instead of the difference between W (I 2 R , H 2 ) and the original input I 2 T .Since the warped I 2 T is closer to ground truth than I 2 T , the loss shrinks the error residual of each stage.The third stage is similar to the second stage.

Dataset and Evaluation Metric
We utilize the method given by Detong et al. [12] for generating datasets on the MS-COCO [41] dataset due to the lack of publicly available datasets for homography estimation.We select 82,783 images from MS-COCO train2014 for the training set and 5000 images from test2014 for the testing set.For each image, a patch with the size of 128 × 128 is arbitrarily cropped, and each corner point then obtains a random disturbance in the range of 45 pixels, which provides the ground truth four-point corner values to evaluate the proposed method.Then, the image is warped using the inverse of the homography matrix that is defined by the four correspondences.We crop out a second patch from the same position in the warped image.Considering the multi-scale input images of our network, we downsample the patch pairs of 128 × 128 to different resolution sizes of 64 × 64 and 32 × 32.We use the Mean Average Corner Error (MACE) [12] as a metric, which computes the L2 distance between the ground-truth corners and the predicted corners.A lower MACE means better performance.

Implementation Details
Our network is implemented in PyTorch.The network is trained using an Adam Optimizer with the stochastic gradient descent.The initial value of the learning rate is l r = 5.0 × 10 −5 .We train our homography network for 60 epochs.All of our training and testing procedures are carried out on a single NVIDIA Titan XP GPU.
Figure 5 shows the comparative results for the MACE on the MS-COCO dataset.Specifically, we obtain the following observations: the result of the traditional ORB+RANSAC method is higher than those learning-based homography estimation methods.The main reason is that deep learning methods can extract more-robust features than traditional methods.
Compared with those deep learning models [12][13][14][17][18][19][20] without the feature-matching module, both SRHEN [16] and our model have the feature-matching module, leading to better results.This demonstrates the necessity of the feature-matching module in deep homography estimation models.Compared with other method (i.e., SRHEN [16]) without the self-attention mechanism, our model adopts a Self-Attention-augmented ConvNet to extract local and global features and enhance the long-distance reliance of the features.Moreover, our model adopts a feature-matching module to strengthen the long-distance reliance of the different feature maps, which can better capture the spatial correspondence between the reference and target images.Our method reduces the MACE by 10.0% compared to SRHEN.As shown in Figure 5, the proposed MS2CA-HENet achieves the best performance.
The visual comparative results of different homography estimation methods are illustrated in Figure 6.As can be seen from the figure, compared with some related homography estimation methods [12,16,17,19], the proposed method obtains better alignment results, which is consistent with the MACE in Figure 5.
In the process of generating the synthetic images, we set different values of pointperturbation parameter ρ to control the displacement of the four corner points in the image patches.The positions of the four corner points are disturbed by taking random values in the range [−ρ, ρ].As the value of point-perturbation parameter ρ increases, so does the displacement of the corresponding corner points.The greater the degree of image distortion transformation, the lower the overlap rate between the input of image patches intercepted at the same position.The quantitative comparison results and the visual results are shown in Table 2 and Figure 7.As shown in this table, all methods perform well when the displacement is small.However, the performance of all methods degrades when the displacement increases.The approaches detailed in [12,17] take the convolution operator to obtain features, which can only capture short-range features due to the limit of the receptive field.Establishing correspondences between features only used by the convolution layer cannot bridge the gap between feature maps and homography.Hence, the values of MACE are higher than those of our proposed method.In contrast, our method still can keep relatively low values of MACE as the displacement increases.The visual results show the effectiveness of our method for a pair of images with large displacement.

Original Image
DeTone ( 2016 Self-SupervisedNet [17], SRHEN [16] and biHomE [19].The red boxes are the ground-truth boxes, and the yellow boxes are the prediction results.Since HierarchicalNet [13] takes a multi-stage network to estimate the homography, it is compared with the proposed method.As shown in Figure 8, the values of MACE gradually reduce as the number of stacked models increases, which shows a multi-stage network can gradually estimate and refine a homography.Because of the Self-Attentionaugmented ConvNet and the global-feature-matching module between two images/video frames, the result of our MS2CA-HENet is lower than that of HierarchicalNet in each stage.From this figure, it can also be observed that the value of MACE in our method is higher when the hierarchy size is 4. Due to the use of multi-scale input, the homography estimation network in the first stage deals with very small images and the training becomes unstable.Hence, we take three stages to train our network.

Ablation Study
Module Selection: We conduct an ablation study in Table 3 to show the effectiveness of the local-global feature-extraction module F(.) and feature-matching module M(.).In the first row of the table, we use ResNet34 instead of the local-global feature-extraction module and feature-matching module.From the first row, we can see the Mean Average Corner Error gradually decreases as the scale increases.However, the MACEs in the first row (only multi-scaled images) are higher than the results of other rows (our designed module F(.) and M(.)).Especially, it can be observed that the error rate without our F(.) and M(.) modules (the first row in the table) is higher than our method by 6.28, 3.72 and 2.87, respectively.This demonstrates the importance of using the local-global feature-extraction module F(.) and feature-matching module M(.) for homography estimation in our model.
Scale Selection: Our model adopts different scale images as the input of each stage.To verify this effectiveness, we compare the same-scale images as input with our multiscale images.The quantitative comparison and the visual results are shown in Table 4 and Figure 9, respectively.It can be observed that the MACEs of networks with samescale images is higher than that of our multi-scale network.It seems obvious that for homography estimation models with different input sizes, the models can capture the homography transformation of different input sizes by dividing the transformation space into different stages.High resolution images contain more details of the image; low resolution images focus on the overall information.The visual results (Figure 9) also show our multi-scale method obtains better results.

Conclusions
In this paper, we design a novel unsupervised Multi-Scale-Stage Content-Aware Homography Estimation Network (MS2CA-HENet), which effectively copes with homography estimation for a pair of images with large displacement.In each stage, we consider local and global features via our Self-Attention-augmented ConvNet (SAC) and strengthen feature correspondences explicitly by a feature-matching module.The output of the homography estimation network in each stage is the residual value of the displacement for a pair of images.By shrinking the error residual of each stage, our network achieves coarse-to-fine results and promotes alignment performance.Extensive experiments demonstrate our method achieves favorable performance compared with other methods.

Figure 1
Figure 1 illustrates our overall framework.Our network takes the pyramid pairs generated by one initial pair of images or video frames as input, and outputs the homography transformation between the initial pair of images.Pyramid images are built by the down-sampling of 2 k from original input images.The resolution of the three-layer pyramid images is 128 × 128, 64 × 64 and 32 × 32 successively.

Figure 1 .
Figure 1.The proposed MS2CA-HENet architecture.The whole network consists of three parts: Stage-1, Stage-2 and Stage-3, respectively, for homography estimation.W () is an operation that performs a homography transformation on the input images.

Figure 2 .
Figure 2. The proposed architecture in Stage 1.In the feature-extraction module F(.), the local and global features are extracted by combining the convolution and self-attention operation.Moreover, the feature-matching module M(.) is designed to enhance feature-matching explicitly from feature maps.At last, we estimate the homography matrix by the homography estimation module H(.).

Figure 3 .
Figure 3.The architecture of the self-attention module.

Figure 4 .
Figure 4.The feature-matching module computes a cost volume between two feature maps, where C, H and W, respectively, represent the number of channels and the height and width of the feature maps.

Figure 7 .
Figure 7. Visualization results for different displacements from 10 to 60.In each example, the first row represents the target image, and the second row represents the warped target image.The red boxes are the ground-truth boxes, and the yellow boxes are the prediction results.

Figure 9 .
Figure 9. Visualization results with differently scaled images in different stages: (a) represents the target images; (b-d) represent the warped target images by same-scale images with 128 × 128 in different stages; (e-g) represent the results of different scale image inputs with 32 × 32, 64 × 64 and 128 × 128.

Table 2 .
The MACEs of different displacements.

Table 3 .
Ablation study of the module selection.

Table 4 .
Ablation study of the image scale selection.