Regression-Based Camera Pose Estimation through Multi-Level Local Features and Global Features

Accurate and robust camera pose estimation is essential for high-level applications such as augmented reality and autonomous driving. Despite the development of global feature-based camera pose regression methods and local feature-based matching guided pose estimation methods, challenging conditions, such as illumination changes and viewpoint changes, as well as inaccurate keypoint localization, continue to affect the performance of camera pose estimation. In this paper, we propose a novel relative camera pose regression framework that uses global features with rotation consistency and local features with rotation invariance. First, we apply a multi-level deformable network to detect and describe local features, which can learn appearances and gradient information sensitive to rotation variants. Second, we process the detection and description processes using the results from pixel correspondences of the input image pairs. Finally, we propose a novel loss that combines relative regression loss and absolute regression loss, incorporating global features with geometric constraints to optimize the pose estimation model. Our extensive experiments report satisfactory accuracy on the 7Scenes dataset with an average mean translation error of 0.18 m and a rotation error of 7.44° using image pairs as input. Ablation studies were also conducted to verify the effectiveness of the proposed method in the tasks of pose estimation and image matching using the 7Scenes and HPatches datasets.


Background and Introduction
In recent years, the development of deep learning and computer vision technologies [1][2][3] has led to widespread research on camera pose estimation in both academia and industry [4][5][6]. Accurate and robust camera pose estimation is crucial for downstream tasks, such as object localization, size estimation, camera movement justification, activity recognition, and more, which can enable the development of smart living spaces. Examples of such applications include fire detection, locating ingredients for cooking robots, and planning routes to kitchens and offices. Estimating the camera's 6 degrees of freedom (6-DoF) pose from images captured by the camera can be achieved through end-to-end deep learning [7] or feature matching from structure-based approaches [8]. By integrating advanced deep learning technology with color and depth cameras as input sensors, multi-sensor systems can assist in intelligent living.
Current image-based camera pose estimation methods are greatly affected by challenging scenes, especially illumination changes, viewpoint changes, etc. These problems lead to inaccurate image-based pose estimation. End-to-end methods based on features and • We propose a novel end-to-end camera pose estimation framework that uses image pairs as input and leverages epipolar geometry to generate image pixel pairs for estimating the camera pose. The framework also includes the automatic fine-tuning of hyperparameters during the training process, resulting in improved accuracy and adaptability. • We adopt a multi-level deformable convolution approach that simultaneously detects and describes the network to extract local features. This addresses the issue of sensitivity to shape information (such as scale, orientation, etc.) and inaccurate keypoint positioning, leading to more robust and accurate camera pose estimation; • We propose a novel loss that integrates the detection and description loss based on local features with the relative pose loss function based on global features. This novel loss function enhances the accuracy of camera pose estimation by jointly optimizing local and global feature representations, leading to improved performance compared to existing methods; • The proposed method is evaluated on benchmark datasets, including HPatches and 7Scenes. The HPatches dataset provides diverse image patches for illumination, viewpoint, and scale evaluation, while the 7Scenes dataset offers realistic indoor sequences for accuracy and stability testing. The experimental results verify the effectiveness of the proposed method for image-matching tasks and camera pose estimation tasks, and demonstrate its superiority compared to state-of-the-art methods.

Organization
The remainder of this paper is organized as follows. Section 2 presents a review of the related works, including localization with sparse local feature matching and camera localization with global feature regression, which provides the context for the proposed approach; Section 3 provides an overview of the dataset preprocessing steps, such as updating the depth image using the position grid, associating pixels with the color image and depth image, and introducing epipolar geometry; Section 4 describes the proposed method of the multi-level deformable network and local feature extraction based on pixel matching for camera pose estimation; Section 5 presents the experiments and discussions on the settings, multi-step image pixel reprojection, image-matching experiment on the HPatches dataset, and pose estimation experiment on the 7Scenes dataset. Finally, Section 6 summarizes the findings and potential implications of the regression-based camera pose estimation approach using multi-level local and global features.

Localization with Sparse Local Feature Matching
According to the processing order of the descriptor and detector in the feature matching method, sparse local feature matching consists of the following branches: (1) detectthen-describe approaches that include keypoint detection stages with robust and efficient handcrafted detectors (e.g., SIFT (scale-invariant feature transform) [16], SUSAN (smallest univalue segment assimilating nucleus) [17]), or CNN-based invariants (Convolutional Neural Network) [18][19][20][21][22][23][24][25][26], followed by descriptor extraction on a sparse set of the detected keypoints with the help of image patch [27], Siamese CNN network [28], L2-distance [10], or second-order similarity regularization [29]. (2) The detect-and-describe approaches take an end-to-end approach to jointly learn keypoint locations and descriptors. LIFT (learned invariant feature transform) [9] uses a full-featured point-handling pipeline, including feature detecting, orientation estimating, and feature describing. LF-Net (local feature network) [30] proposes to confine a two-branch network into one branch for feature extraction in an end-to-end manner. SuperPoint [26] jointly learns keypoint detection and description, while R2D2 (repeatable and reliable detector and descriptor) [31] trains predictors of the local descriptor discriminator. ASLFeat [32] is based on D2-Net [33] and improves the perception ability of geometric invariance. DH3D [34] uses an embedding of detection and description modules in a Siamese network. (3) The describe-to-detect methods extract descriptors first and then detect keypoints. D2-Net [33] detects keypoints on a dense feature map for more stable detectors, while DELF [35] is proposed for training keypoints in a local maxima way. The above methods are computationally intensive due to the multi-stage processing periods, which rely heavily on parameter assumptions and prior knowledge. Our approach integrates the image-matching process, detection and description process, and global feature extraction process. The proposed framework can easily extract sparse local features in an end-to-end manner.

Camera Localization with Global Feature Regression
The regressed global features are used to compute the absolute camera pose through single monocular images or image sequences. PoseNet [11] initially regresses the 6-DoF pose through a single image. According to the loss function type, global feature-based regression methods include: (1) fixed Euclidean loss-based methods, which introduce the scaling factor for balancing the position item and orientation item [11], or add Bernoulli distributions to describe the uncertainty of localization [36]. Furthermore, LSTM [37,38] adds four LSTM units and SVS [39] adds a classification module to improve performance.
(2) Learnable pose loss-based methods learn the weight pose to make the results more stable [40]. Later, the adversarial network [41] and novel DNN [13] are added to share the same loss function. (3) Relative sequence loss-based methods learn the loss from a pair of images with a geometric constraint [12]. These methods combine the absolute pose loss and the relative pose loss from an image pair, and the two terms are added with  [12] by extracting features from image pairs. The aforementioned methods lack accuracy in the task of pose estimation with image pairs as input and multiple parameters to optimize. Our proposed method leverages local features from image correspondences and demonstrates robustness in changing environments.

Dataset Preprocessing and Epipolar Geometry
The proposed methodology aims to estimate the camera pose by utilizing the correspondences obtained from RGB and depth image pairs as input. Data processing strategies, such as random cropping and normalization operations, are consistently applied to the fixed-step input images. Epipolar geometry [43] is employed to calculate the pixel correspondences.

Update Depth Image Using Position Grid
The correspondences between color image pairs are determined based on the pixel positions and intensities of the corresponding depth image pairs. We designed a position grid to assist in the processing of corner pixel identification, depth information judgment, and interpolation.
Given the width and height of the first image in the depth image pairs, we create a corresponding position grid for further computation. Specifically, for a depth image size of h × w, we construct a vector of size (2, h × w) to represent the coordinates of the position grid. The vector contains two matrices of size (h, w) each, representing the horizontal and vertical coordinates respectively. The first matrix is formed by stacking column vectors of dimension (h, 1) with elements [0, h − 1] in the column space w times, while the second matrix is formed by stacking row vectors of dimension (1, w) with elements [0, w − 1] in the row space h times.
To eliminate the coordinate positions with unqualified pixels and perform further pixel matching, we process the corner and depth value of the first depth image in the image pair. Specifically, given the first depth image and its corresponding position grid, the two dimensions of the position grid are defined as the i and j index values, respectively. Firstly, we check whether the index values of the four corners of the depth image are within the range of the image's width and height, as shown in Figure 1. Next, we check whether the depth of the pixels represented by the index values is greater than 0 (i.e., not occluded) and less than 65,535 (the maximum value for depth information storage, corresponding to a distance of 65 me), and update the index value that conforms to the corner and depth information checks in the position grid.
Output: Depth map after corner and depth inspection, coordinates to eliminate unqualified information Define the two dimensions of the pixel location map as i and j index values, which respectively include h×w index values, and use the torch.floor() function to return so that each element is converted to the largest integer less than or equal to the element. Use the torch.ceil() function to return the largest integer greater than or equal to the input element; when i and j are integers, they are rounded up or down to the element itself; when i and j are not integers, i and j are respectively The combination of the up and down rounded values of is assigned to its corresponding upper left corner, upper right corner, lower left corner, and lower right corner to define the index value for judging the nature of the depth image.
First, judge whether the pixel information at the corner point is satisfied that the index value is within the range of the width and height of the image, as shown in Figure 4.3; then judge whether the pixel depth in the index value is greater than 0 (that is, not Is occluded) and less than 65535 (the maximum value of depth information storage, converted to 65 meters), the ID that meets the conditions for both corner point information and depth information is stored.     After obtaining the filtered depth image and its corresponding position grid, we use weight coefficients, which are determined by the upper and lower bounds of the i and j index values, to compute new depth information values by a weighted sum of the four nearest depth values; we use bilinear interpolation to update the pixel values of the depth image. In addition, the 2D coordinates and 1D index values of the filtered depth images are stored for further conversion.

Associate Pixels of the Color Image and Depth Image
To obtain pixel matches between color images, we apply epipolar geometry to the depth map, camera intrinsics, and camera extrinsic parameters of the 7Scenes dataset. Epipolar geometry calculates the relationships between the 3D points and points on the projected 2D images from cameras taken from different views.
As the camera intrinsic parameters of the 7Scenes dataset were not calibrated, we followed the official instructions and set the focal length to 585, the coordinate axis tilt parameter to 0, and the principal point coordinates to (320, 240). KinectFusion provides the camera's extrinsic parameters in the 7Scenes dataset.
The pinhole camera model projects objects from the world coordinate system to the 2D pixel plane through the camera plane. In this model, P w = [x w , y w , z w ] T , P c = [x c , y c , z c ] T , P xy = [x, y] T , and P uv = [u, v] T represent the same object in the world coordinate system, camera coordinate system, image coordinate system, and pixel coordinate system, respectively. The depth information is lost from the camera's coordinate system to the image coordinate system during the projection process. Through a rigid transformation, perspective transformation, and affine transformation, the coordinate transformation can be performed in different coordinate systems. The specific transformation method and equation are shown in Figures 2 and 3. corresponding to this pose.

4.2.3.2
The conversion relationship between world coordinate system, camera coordinate system, image coordinate system and pixel

coordinate system
The pinhole camera model can be used to model the process of projecting objects in the real-world coordinate system to the two-dimensional pixel plane through the camera ) 1 = 7 − 7 % * 81, 6 = 9 − 9 % * 86  Coordinate transformation between the world coordinate system, camera coordinate system, image coordinate system, and pixel coordinate system.  Figure 3. Coordinate conversion formula between different coordinate systems.
Since the same point in the real world has the same coordinates in the world coordinate system, it is possible to obtain the world coordinates from the pixel coordinates in the first image through coordinate transformation. Then, the pixel coordinates of the same point in the second image can be obtained from its world coordinates. This allows us to obtain pixel correspondences between the image pairs. Figure 4 depicts the process of obtaining pixel matching between image pairs with qualified depth information. Among them, since the pixel grid coordinate system and the pixel coordinate system are orthogonal, the mutual conversation needs to exchange the positions of the two coordinate axes. Pixels of the second depth image (whose depth value difference between the transformed pixels and original pixels is greater than 0.05) are considered occluded and are filtered out. After these procedures, the filtered pixel correspondences between the color image pairs can be obtained.

Model structure
In order to use the advantages of simultaneous detection and description methods, such as robustness to challenging scenes, efficient storage and matching, etc., while improving its insensitivity to shape information (such as scale, direction, etc.) and lack of key point positioning accuracy, it also uses For geometric constraints between image Camera internal parameters K1 Camera coordinate system(3, n) ) !" , + !" , , !" # Camera external parameters T1 World coordinate system(4, n) ) $" , + $" , , $" , 1 % reverse Pixel network(2, n)

Method
This section presents the framework of the proposed method, which is an end-to-end camera pose estimation network based on relative pixel correspondences and the multilevel deformable network. We also introduce our designed loss function, which includes the global features and local description-detection features loss. Figure 5 illustrates the architecture of the network, which combines the supervision of local and global features. The input to the network is an image pair that includes related depth images and pose ground truth. The multi-level deformable network based on L2-Net is used as the feature extractor, and different image resolutions are applied in multi-convolutional layers. The feature detection score map is obtained by sampling and weighting different feature maps. The extracted features are used to regress the absolute pose and relative pose through a fully connected layer. The whole process includes four stages: data preprocessing, image feature extraction, image feature fusion, and image pose regression. The algorithm is illustrated in Figure 6 with the training and testing periods.

Multi-Level Deformable Network
To enhance the modeling ability of convolutional neural networks with fixed geometric structures, a deformable convolution is introduced. It learns offset locations of spatial samples in target tasks [14,15] through back-propagation and training the network in an end-to-end manner. This allows for the estimation of pixel-level local feature transformations and global shape modeling using stacked convolutional networks. Training process: Unlike the experiment in the previous chapter, the backbone network trained in the experiment in this chapter is not pre-trained on the classification dataset, but adopts a retraining model from scratch. The input image is first uniformly scaled to 256 pixels on all short sides, and then randomly 256×256 is used for cropping. At the same time, each iteration selects a bunch of images with a frame index difference of 10 in the dataset, uses a stochastic gradient descent optimizer, and uses Adam [104] for fine-tuning.

Experimental protocol and parameter settings
The experiment has a total of 1000 iterations, in which the learning rate for the first 100 times is 1e10-5, and the learning rate is divided by 5 for every 100 cycles thereafter. In addition, the detection loss and the description loss, the absolute loss and the relative loss, and the local components The weighting factors between the loss and the global loss are both set to 1. The batch size is set to 4, and the maximum number of matching

Dataset image preprocessing
Different from the previous chapter using a single color image for camera pose regression, this chapter uses the depth image-assisted image to locate the input image in the network. This requires the same cropping strategy for the image pairs that need to be calculated and the color map and the depth map, and need to be normalized to reduce the difference of image pixels, in addition, through the epipolar geometry method to calculate the pixel coordinates in different coordinate systems to calculate the matching pair.

Deformable Convolutional Network
The deformable network has the ability to densely estimate local changes in the images and model the transformation of CNNs by learning the offsets added in the spatial sampling locations. The framework of the deformable network is shown in Figure 7, and it can be trained directly from scratch. To reduce the amount of calculation, the network uses the lightweight L2-Net [10] as the backbone network while changing the last 8 × 8 convolution layer into three 3 × 3 convolution layers. The network outputs a 128-dimensional feature map, which is 1/4 of the input resolution.
The goal of the deformable convolutional network (DCN) [14] is to improve the ability to model geometric changes by dynamically learning the changing receptive field. To achieve this, we use a regular grid R to sample the input feature map x in a dense and local manner [14]. The location enumeration p k represents a specific location on R. The output of a single location, p 0 , on the feature map y can be computed as follows: DCN enhances the regular convolution by additionally learning the sampling offset [14] {∆p k |k = 1, . . . , K}, where K = |R|; the Equation (1) can be rewritten as: Since the offset Δ°> is usually a fraction, the formula (4.4) can be realized by bilinear interpolation, and the feature quantity Δg > is limited to the range of (0, 1). In the training process, the initial values of Δ°> and Δg > are set to 0 and 0.5, respectively [93]. Obtaining features from low-resolution feature maps will limit the positioning accuracy of key points, so restoring spatial resolution is an effective method to improve positioning accuracy, for example, by learning other feature decoders (  ∆p k and ∆m k represent the learnable offset and module scale factor of the k − th position. The range of ∆m k is in [0, 1], and ∆p k has no constraints on the range. The bilinear interpolation could be applied to the computation of x(p 0 + p k + ∆p k ). In the training period, the initial values of ∆p k and ∆m k are given 0 and 0.5, respectively [15].

Multi-Level Feature Detection Network
Obtaining features from low-resolution feature maps may limit positioning accuracy. Restoring spatial resolution has proven effective in improving positioning accuracy, such as using other feature decoders (e.g., SuperPoint [26]) or employing dilation convolution (e.g., R2D2 [31]). However, these methods increase the number of learning hyperparameters and require significant GPU storage and computational resources. This method uses the multilevel detection method proposed by ASLFeat [32]. This method achieves the restoration of image spatial resolution in a simple and effective way by combining the multi-level feature detection using the inherent pyramid feature of the convolutional network.
Specifically, the method utilizes a feature hierarchical structure composed of several levels of {t (1) , t (2) , . . . , t (p) } where {1, 2, . . . , 2 (p−1) } is the step size, and the detection network is applied at each level to obtain a set of detection scores {q (1) , q (2) , . . . , q (p) }; each score map is up-sampled to have the same spatial resolution as the input image, and then combined using a weighting value:ŝ = ∑ p w p q (l) The advantages of multi-level detection are embodied in three aspects. First, it uses a multi-level detection method, which conforms to the classic space theory [44] because it has different sizes of receptive fields to locate key points; second, compared with U-Net [45], it recovers the spatial resolution without additional learning weights to achieve pixel-bypixel accuracy. Finally, it keeps the low-level features unchanged but integrates multi-level semantic detection [46] to help preserve low-level structures, such as corners or edges. The architecture of the entire network is shown in Table 1, where the initial resolution of the input image is 256 × 256. After performing feature extraction through the aforementioned multi-level deformable network, the subsequent multi-layer perceptron outputs the estimated posture location and rotation of the 3D feature through the fully connected layer. Since the network operates on input image pairs, a group of identical networks is copied to form a set of parallel networks that accept input image pairs. Finally, the network output contains a set of image pairs. The pose, feature map, and score feature map of the image are used in the subsequent split calculation process. Among them, the output of the last convolutional layer in the multi-level network is a feature map, and the weighted sum of the score map is transformed into a score feature map.
The specific calculation process is as follows: first, obtain the feature maps of the network conv1, conv3, and conv8 layers as input. Then, normalize the feature map by dividing each value by the largest value in the feature map. Next, fill the feature map with mirroring and perform two-dimensional average pooling with a step size of 1 and a pooling area size of 5 to obtain a feature map with the same size as the input. Subtract the normalized feature map and the pooled feature map from the average value of the pooled feature map to obtain the maximum scores on the channel and local levels, respectively. The maximum score multiplied by the maximum value is bilinearly interpolated to the original input image size to obtain the score feature map corresponding to the feature map; the weight coefficient is multiplied and the final score feature map is obtained.
The last three layers of L2-Net, conv6, conv7, and conv8, are replaced by DCN. To calculate multi-level features, conv1, conv3, and conv8 are selected. The weighted proportion in Equation (1) is w i = 1, 2, 3, and the expansion rate of searching for neighboring pixels is set to N(i, j) = 3, 2, 1, respectively. The basic network of this method uses a multi-level deformable network as the feature extraction network, which will be introduced separately below.

Local Feature Extraction Based on Pixel Matching
In contrast to the traditional "detect first and then describe" approach, which consists of two separate stages, D2-Net [33] proposes a method that computes dense features of an image by simultaneously obtaining detector and descriptor representations. On the other hand, ASLFeat [32] has improved the measurement method by calculating the loss of local detection and description features. On this basis, this section proposes the loss of fusing the local features and global features. During the global image training process, the loss of the position and direction in positioning is returned. By weighing and calculating the global loss and local loss to minimize it, the positioning performance can be improved, satisfying both local rotation invariance and global rotation consistency. This section will introduce the process and method of local feature extraction based on pixel matching.
The loss function module includes the global feature loss and the local feature loss. The global feature loss is the weighted sum of the absolute pose loss of the query image and the relative pose loss between image pairs. The local feature loss is the combination of the descriptor loss and the detector loss. The combination is obtained by maximizing and normalizing the product by matching the corresponding positive and negative sample triple loss and the product of the local maximum score obtained in the feature map and the channel maximum score.

Loss of Feature Descriptor
After the input training image I passes through the multi-level deformable convolutional network F, a three-dimensional tensor F = F(I), F ∈ R h×w×n can be obtained, where h × w is the feature map size and n is the number of channels. The most direct representation of the three-dimensional tensor F is to set the descriptor vector d as a dense set where d ij = F ij: , d ∈ R n . Here, i = 1, . . . , h and j = 1, . . . , w. Through the descriptor vector, it is easier to compare the difference between images and establish corresponding relationships using the Euclidean distance. These descriptors will be dynamically adjusted during the training phase. Even if the image contains strong appearance changes, the same set of points in the scene can produce similar descriptors. Before comparing the descriptors, it is necessary to apply L2 normalization to the descriptors: First, we introduce the calculation method of the ternary boundary ranking loss. Given a set of image pairs (I 1 , I 2 ) and its corresponding relationship c : A ←→ B, where A ∈ I 1 , B ∈ I 2 , this loss corresponds to the distance between the pixel descriptorsd  of the negative sample pixel in another image is n(c) = min(||d The negative sample points on the two images are defined as N 1 = argmin P∈I 1 ||d The calculation formula of the ternary boundary ranking loss is m(c) = max(0, M + p(c) 2 − n(c) 2 ). The calculation diagram describing the loss is shown in Figure 8

Loss of feature descriptor:
After the input training image ℐ passes through the multi-level deformable convolutional network ℱ , a 3-dimensional tensor ß = ℱ(ℐ), ß ∈ ℝ`× G×> can be obtained, where ℎ × q is the feature map Spatial resolution size, n is the number of channels. The most direct representation of the three-dimensional tensor F is to set the descriptor vector d as a dense set ® *B = ß *B: , ® ∈ ℝ > , where m = 1, … , ℎ and { = 1, … , q, through the descriptor vector, it is easier to compare the difference of the image, and use the Euclidean distance to establish the corresponding relationship. These descriptors will be dynamically adjusted during the training phase. Even if the image contains strong appearance changes, the same set of points in the scene can produce similar descriptors. Before comparing the descriptors, it is necessary to apply L2 normalization to the descriptors: ® v *B = ® *B /O® *B O % .

Feature Detection Sub-Loss
The three-dimensional tensor F can be represented by another set of two-dimensional responses D [47], D k = F ::k , D k ∈ R h×w , where k = 1, . . . , n, in this interpretation, the feature extraction function F can be regarded as n different feature detection functions D k , each of which generates a two-dimensional response graph D k . These detection response maps are similar to the Gaussian difference (DoG) response maps obtained in the scaleinvariant feature transformation (SIFT [9]), or the score maps obtained in the Harris corner detection algorithm [48].
Traditional feature detection methods (such as DoG) make the detection map sparse by suppressing the non-maximum value of the space part. Selecting the detected point (i, j) from multiple detection images D k (k = 1, . . . , n) requires meeting the following criteria: in D k , D k ij is the local maximum, and the value of k is such that D t ij is the maximum value of t. It can be intuitively understood that for each pixel (i, j), we first select the best detector D k in the different channels, and then verify whether the response graph D k of the detector is on (i, j). There is a local maximum. Because backpropagation is required during network training, a series of scores are used to represent the detection information of pixels. First, the local maximum score is defined as the keypoint peak detection: Among them, N(i, j) is a collection of nine pixels, including the pixel (i, j) and its surroundings. The channel selection is defined as the non-maximum suppression of each descriptor on the channel: In order to consider both the on-channel and local scores, all feature maps are multiplied and maximized to obtain a score map: The score is obtained by performing image-level normalization on the pixel point (i, j): The schematic of the detection loss is shown in Figure 9. To make the neural network more robust to scale changes and viewpoint changes, an image pyramid is used to send the input image to the neural network at three resolutions of 0.5, 1, and 2 times, respectively. For each resolution ρ, the feature map F ρ is calculated. Then, the feature map of the smallerresolution image is transferred to the feature of the larger-resolution image. The summation between feature maps of different resolutions needs to use bilinear interpolation to adjust the resolution of the feature maps to the same.
The score is obtained by performing image-level normalization on the pixel point (i, j): The schematic of the detection loss is shown in Figure 4.9. In order to make the neural network more robust to scale changes and viewpoint changes, an image pyramid is used to send the input image to the neural network with three resolutions of 0.5, 1 and 2 times, respectively, and calculate the value of each resolution °. The feature map ß j , and then the feature map of the smaller resolution image is transferred to the feature of the larger resolution image. The summation  (3, n) ) !" , + !" , , !" # Camera external parameters T1 World coordinate system(4, n) ) $" , + $" , , $" , 1 % ) !" = (1 " −1 (" ) * , !" 9 )"

Pixel c
Reverse sequence  In order to use a single neural network to train the detection and description process at the same time, it is necessary to use a loss function that optimizes the detection and description while targeting local features, so that the key points in the detection process are repeatable in viewpoint changes and illumination changes. During the description process, each descriptor is intentionally made different from each other to avoid mismatches. The ternary boundary ranking loss is used to optimize the descriptors while maintaining their distinctiveness. To increase the optimization of the repeatability of the detector, the loss of the detection item s ij is added to the ternary boundary ranking loss. The detection and description processes can be optimized at the same time, so the loss function of the local feature is:

Loss Function Based on the Image Sequence for Global Features
For global features, in addition to the loss function of learnable weights that can constrain geometric information, MapNet [12] proposes the use of time constraints on image pairs. This helps to force the network to learn global features that achieve overall positioning accuracy. The method in this section uses geometric constraints and time constraints as the loss functions of the global feature, expressed as: Among them, i and j represent the index values of a pair of image pairs, I ij = (p i − p j , q i − q j ) represents the relative pose between the images I j and I j , and α is the absolute pose loss obtained from a single image. The weighting factor between the relative pose loss obtained from the image pair loss(I i ) is used to describe the distance between the predicted value of the camera pose and the pose ground truth, which is defined as: loss(I i ) = ||p − p * || 1 e −β + β + ||q − q * || 1 e −y + y (11)

Datasets
The 7Scenes dataset [49] is released by Microsoft; it uses Kinect to collect indoor datasets with color maps, depth maps, and pose ground truth in 7 scenes. It is popular as a benchmark in indoor camera pose estimation experiments.
The HPatches dataset [50] includes 116 image sequences and the ground truth of homography matrices, which could be used to evaluate the extraction performance of local descriptors. The 57 sequences include illumination conversion and the 59 sequences include viewpoint/occlusion conversion.

Implementation Details
The experiment was implemented with PyTorch Training iterations of 1000. This value was determined based on empirical experimentation to achieve optimal convergence and performance; • Balancing factors between detection loss and description loss, absolute loss and relative loss, and local loss and global loss, all set to 1. These values were chosen to give equal importance to different components of the loss function, which could also achieve better performance according to the experiments; • Initial learning rate of 1 × 10 −5 for the first 100 iterations, divided by 5 for every 100 iterations thereafter. This learning rate scheduling was determined based on empirical experimentation to achieve optimal training progress and convergence.
We use a batch size of 4, with 128 matching correspondences. The balancing factors between the detection loss and description loss, the absolute loss and the relative loss, and the local loss and global loss are all set to 1. The training iterations are set to 1000, with an initial learning rate of 1 × 10 −5 for the first 100 iterations, and then divided by 5 for every 100 iterations. The backbone network is trained from scratch without pretraining on the classification dataset. The input images are uniformly scaled to 256 pixels on the short side and then randomly cropped to 256 × 256. For every iteration, a pair of images with a frame index difference of 10 is selected. The stochastic gradient descent optimizer is used with the Adam [53] solver for fine-tuning. During inference, the input images are scaled to 256 pixels on the short side and then center-cropped to 256 × 256.

Multi-Step Image Pixel Reprojection
Given the index gap of 10 and a total of 2 images, we can obtain a set of image pairs from the 7Scenes dataset, which includes depth images, color images, and camera intrinsic and extrinsic parameters. Through image processing and epipolar geometry, the pixel correspondences of the image pairs could be computed.
Taking the image pair with an index gap of 400 as an example, the process is illustrated in Figure 10. The initial number of pixels for each image in the pair is h × w = 320 × 640 = 307,200. Firstly, we filter out invalid pixels using the depth check of the first depth image resulting in 245,574 pixels. Secondly, we obtain the corresponding pixel position in the second depth image using epipolar geometry, resulting in 245,574 pixels. Then, we filter out pixels with invalid depth values and corner indices resulting in 148, 857 pixels. We further filter out pixels with a projected depth difference greater than 0.05 m compared to their own depth resulting in 126,411 pixel pairs. Finally, we randomly sample 512 pixel pairs. The matching correspondences are shown in Figure 11. • Initial learning rate of 1 × 10 −5 for the first 100 iterations, divided by 5 for every 100 iterations thereafter. This learning rate scheduling was determined based on empirical experimentation to achieve optimal training progress and convergence.
We use a batch size of 4, with 128 matching correspondences. The balancing factors between the detection loss and description loss, the absolute loss and the relative loss, and the local loss and global loss are all set to 1. The training iterations are set to 1000, with an initial learning rate of 1 × 10 −5 for the first 100 iterations, and then divided by 5 for every 100 iterations. The backbone network is trained from scratch without pretraining on the classification dataset. The input images are uniformly scaled to 256 pixels on the short side and then randomly cropped to 256 × 256. For every iteration, a pair of images with a frame index difference of 10 is selected. The stochastic gradient descent optimizer is used with the Adam [53] solver for fine-tuning. During inference, the input images are scaled to 256 pixels on the short side and then center-cropped to 256 × 256.

Multi-Step Image Pixel Reprojection
Given the index gap of 10 and a total of 2 images, we can obtain a set of image pairs from the 7Scenes dataset, which includes depth images, color images, and camera intrinsic and extrinsic parameters. Through image processing and epipolar geometry, the pixel correspondences of the image pairs could be computed.
Taking the image pair with an index gap of 400 as an example, the process is illustrated in Figure 10. The initial number of pixels for each image in the pair is h × w = 320 × 640 = 307, 200. Firstly, we filter out invalid pixels using the depth check of the first depth image resulting in 245, 574 pixels. Secondly, we obtain the corresponding pixel position in the second depth image using epipolar geometry, resulting in 245, 574 pixels. Then, we filter out pixels with invalid depth values and corner indices resulting in 148, 857 pixels. We further filter out pixels with a projected depth difference greater than 0.05 m compared to their own depth resulting in 126, 411 pixel pairs. Finally, we randomly sample 512 pixel pairs. The matching correspondences are shown in Figure 11.

Image-Matching Experiment on HPatches Dataset
We evaluate the performance of local descriptors on the HPatches dataset using the following metrics: (1) keypoint repeatability (%Rep.): the ratio of potential matches in the co-visible view; (2) descriptor matching score (%MS): the ratio of correct matches and the minimum number of keypoints in the co-visible view; (3) mean average accuracy

Image-Matching Experiment on HPatches Dataset
We evaluate the performance of local descriptors on the HPatches dataset using the following metrics: (1) keypoint repeatability (%Rep.): the ratio of potential matches in the co-visible view; (2) descriptor matching score (%MS): the ratio of correct matches and the minimum number of keypoints in the co-visible view; (3) mean average accuracy (%MMA): the ratio of correct matches to potential matches. A matching pair is defined as the nearest neighbors after searching, and the distance between the points is less than the error threshold. For the above indicators, Table 2 compares the average values of image pairs in the dataset with SuperPoint [26] and D2-Net [33]. SuperPoint [26] is a widely recognized and commonly used method for keypoint detection and description, known for its repeatability and accuracy in challenging scenarios. D2-Net [33] is another state-ofthe-art method that has demonstrated excellent performance in local feature extraction, matching, and camera pose estimation tasks. Figure 12 compares the multi-step matching keypoints results by SIFT features and the multi-step matching method in the chess scene of the 7Scenes dataset. This step represents the frame index gap. The number of keypoints from SIFT features decreases as the step increases, while the keypoints from the multi-step matching could provide constant matches within a given range, which improves the robustness of image matching and the reliability of gradient values. It is essential to obtain robust and accurate image-matching results efficiently in challenging environmental conditions. The most popular image-matching methods could be divided into sparse matching (including detection and description processes) and dense matching (including description processes). Table 3 summarizes the process, advantages, and disadvantages of various public matching methods. Detect-then-describe methods have low robustness due to the low-dimensional features of local detectors being sensitive to pixel intensities. The dense matching methods perform well in changing illumination areas; however, the matching memory and time consumption are high.    decreases as the difference between frames increases. Using the result of multi-step depth matching can keep the number of matching pairs constant within a given range. For example, the experimentally given matching pairs are randomly selected 128 pairs, which makes the matching result more robust and the gradient value has reference significance. In difficult scenarios, finding a matching pair that is robust, accurate, and efficient in storage and matching is the basis for solving complex computer vision problems. The most common matching is divided into sparse matching that includes detection and description phases and dense matching that only has description phase . After comparing the detection and description methods represented by SuperPoint, and the simultaneous detection and description methods represented by D2-Net, Table 4.3 summarizes the process and advantages and disadvantages of common matching methods.  High matching time consumption and memory.

Pose Estimation Experiment on 7Scenes Dataset
In order to verify the performance of our proposed network on the pose estimation task, we conducted experiments and compared the results with several competing methods that use multiple images or videos as input on the 7Scenes dataset. VidLoc [54], MapNet [12], and LSG [55] were selected for comparing the translation (in m) and rotation (in • ) errors. As shown in Table 4, our method achieves better performance with smaller pose errors compared to other related methods, which confirms the effectiveness of the proposed loss function and pixel constraints.
Furthermore, Table 5 presents a comparison of different methods that use multiple images or video as input in terms of robustness, type of graphics card, input image pixel values, processing time per image in milliseconds, and network model size. Our proposed method shows competitive performance in terms of time consumption, with smooth time and local features, and demonstrates robustness in motion blur (correspondences from image pairs could justify moving objects) and without drift (relative pose could query geometric constraints of image pairs and reduce drift). Without pre-training, the size of our network model is 60 Mb. Compared to VidLoc, the time consumption of our method for testing each image is significantly lower at 10.2 milliseconds.
To evaluate the effectiveness of each module used in the network, and to quickly conduct the experiment, we select the heads scene of the 7Scenes dataset, which has the smallest number of images, and the results from the heads scene could represent the performance of the 7Scenes dataset. The ablation study experiment was conducted for 100 iterations with a learning rate of 1 × 10 −6 . Since the local loss module could only be used with the output of the multi-level deformable network, we compared the pose estimation results with ResNet and multi-level deformable networks, as well as different weightings of the global loss and local loss modules. Table 6 shows that the multi-level deformable network and the combination of global loss and local loss could obtain smaller pose errors. Increasing the weight of the local loss could slightly improve the pose estimation results.

Conclusions
In this paper, we propose a regression-based camera pose estimation framework that consists of a multi-level deformable network for feature extraction and a loss function that fuses multi-view features with both local rotation invariance and global rotation consistency. To address challenges such as changing environments and motion blur in datasets, we design the feature extraction network and multi-level network to be robust and accurate. Our experiments on the 7Scenes and HPatches datasets show that our proposed network outperforms competing methods in accuracy and robustness. We demonstrate that correspondences produced by camera sensors, including RGB and depth cameras, can outperform local detection and description optimization integrated with global feature supervision, which leverages the rotation consistency of global features and the rotation invariance of local features. Moreover, the features captured within global and local supervision are also suitable for image matching. In future work, we will apply the learnable balancing factor to the loss to improve the model scalability and portability, and will try to add other common sensors, e.g., IMUs, to improve indoor localization performance and apply these methods to robot navigation and planning to enable smarter living. Data Availability Statement: Publicly available datasets were analysed in this study. These datasets can be found here: 7Scenes dataset: https://www.microsoft.com/en-us/research/project/rgb-d-dataset-7scenes/, (accessed on 1 January 2013); HPatches dataset: https://github.com/hpatches/hpatchesdataset, (accessed on 19 April 2017).

Conflicts of Interest:
The authors declare no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript: height of the image w width of the image i index value of the image height j index value of the image width P w the point coordinates (x w , y w , z w ) in the world coordinate system P c the point coordinates (x c , y c , z c ) in the camera coordinate system P xy the point coordinates (x, y) in the image coordinate system P uv the point coordinates (u, v) in the pixel coordinate system f the focal length in the pinhole model T camera external parameter matrix K camera internal parameter matrix x feature map R regular grid p k the enumeration of the location in R ∆p k the learnable offset ∆m k the module scale factor of the k − th position w i the weight factor in different convolutional layers N(i, j) the expansion rate of searching for neighboring pixels F the tensor obtained through the multi-level deformable convolutional network n number of channels d descriptor vector N negative sample points in the image I D two-dimensional response ρ resolution