Automatic Registration for Panoramic Images and Mobile LiDAR Data Based on Phase Hybrid Geometry Index Features

: The registration of panoramic images and mobile light detection and ranging (LiDAR) data is quite challenging because different imaging mechanisms and viewing angle differences generate signiﬁcant geometric and radiation distortions between the two multimodal data sources. To address this problem, we propose a registration method for panoramic images and mobile LiDAR data based on the hybrid geometric structure index feature of phase. We use the initial GPS/IMU to transform the mobile LiDAR data into an intensity map and align the two images to complete registration. Firstly, a novel feature descriptor called a hybrid geometric structure index of phase (HGIFP) is built to capture the structural information of the images. Then, a set of corresponding feature points is obtained from the two images using the constructed feature descriptor combined with a robust false-match elimination algorithm. The average pixel distance of the corresponding feature points is used as the error function. Finally, in order to complete the accurate registration of the mobile LiDAR data and panoramic images and improve computational efﬁciency, we propose the assumption of local motion invariance of 3D–2D corresponding feature points and minimize the error function through multiple reprojections to achieve the best registration parameters. The experimental results show that the method in this paper can complete the registration of panoramic images and the mobile LiDAR data under a rotation error within 12 ◦ and a translation error within 2 m. After registration, the average error of rotation is about 0.15 ◦ , and the average error of translation is about 1.27 cm. Moreover, it achieves a registration accuracy of less than 3 pixels in all cases, which outperforms the current ﬁve state-of-the-art methods, demonstrating its superior registration performance.


Introduction
In recent years, with the rise of vehicle navigation [1], urban planning [2,3], cultural relics protection [4], and assisted driving [5,6], the demand for 3D laser point clouds has dramatically increased. The accuracy of the 3D laser point cloud is also getting higher and higher. However, although the 3D laser point cloud can accurately record objects' positions and surface features, it lacks a description of object texture information. Additionally, images have good spatial resolution and can provide rich textures and semantics, but lack a description of the depth information of objects. The fusion processing of the 3D laser point cloud and an image can effectively make up for the lack of a single data source, and this combination is widely used in building extraction [7], 3D reconstructions of large-scale scenes [8,9], point cloud classification [10,11] and dynamic object detection [12][13][14]. The precise registration of the 3D laser point cloud and the image is the premise of fusion processing. Only accurate registration can enable all the respective advantages of the two sensors.
The registration between the 3D laser point cloud and the image is necessary to calculate the rigid transformation relationship between the two. Through translation and rotation, the projected image of the 3D laser point cloud can be aligned with the image captured by the camera. Laser scanners are used to acquire 3D point cloud data. Depending on the acquisition method, laser scanning can be divided into terrestrial laser scanning (TLS), mobile laser scanning (MLS) and airborne laser scanning (ALS) [15]. In general, ALS can be used for large-scale building surveys, roads and forests; TLS can be used for more detailed but slower urban surveys in outdoor and indoor environments; and MLS is less accurate than TLS, but has a higher efficiency due to sensors mounted on the same vehicle. A mobile mapping system (MMS), usually equipped with a mobile laser scanner, panoramic camera, inertial measurement unit (IMU) and global positioning system (GPS), can easily obtain the LiDAR point cloud, image and orientation [16]. Theoretically, the LiDAR point cloud and the image obtained by the MMS are precisely registered. However, because pre-calibrated laser scanners and panoramic cameras are likely to have unpredictable shifts in their long-term use, an IMU or a GPS may not be able to obtain accurate poses due to environmental problems. Therefore, even if the panoramic image and the mobile LiDAR data have been georeferenced before fusion [17], it is still difficult to accurately calculate the rigid transformation relationship between the mobile LiDAR data and the panoramic image to complete the accurate registration between the two.
Since LiDAR point clouds are discrete space points, and the image is a continuous two-dimensional optical image, the different data structures make the two kinds of data have significant differences [18]. At the same time, the density of the LiDAR point cloud and the spatial resolution of the image [19], as well as the differences in geometry and radiation between the two types of data, will also affect registration. These situations make the registration of LiDAR point clouds and images more complex and challenging. The registration process of the LiDAR point cloud and images generally includes 2D-2D registration methods [18][19][20][21][22][23], 3D-2D registration methods [16,17,[24][25][26][27] and 3D-3D registration methods [28][29][30][31]. Most of the existing registration methods extract the same features from the image and the LiDAR point cloud to complete registration, but these methods generally have characteristics that require specific scenes, tedious extraction steps and complicated calculations. The 3D-3D registration methods are not popular because they require 3D image reconstruction first. The 2D-2D registration methods may not match in the finished work because measurements of the similarity of two images are not accurate enough.
The method proposed in this paper belongs to the 2D-2D registration method group. To realize the refined registration of the panoramic image and the mobile LiDAR data, we propose a registration method for the panoramic image and the mobile LiDAR data based on the hybrid geometric structure index feature of phase. First, the mobile LiDAR data are transformed into an intensity map using the initial GPS/IMU data and panoramic image. Then, the feature descriptor of the developed HGIFP combined with a robust false matching point elimination algorithm is used to obtain a set of corresponding feature points and a 2D-3D correspondence. Finally, according to the local motion invariance assumption of the 3D-2D corresponding feature points proposed by us, the registration of the mobile LiDAR data and the panoramic image is completed after several iterations.
The main contributions of this paper are as follows:

1.
A registration method for panoramic images and mobile LiDAR data based on the hybrid geometric structure index feature of phase is proposed.

2.
A multi-scale and multi-directional feature descriptor called HGIFP is developed to capture the shape and structural features of the image and combined with a robust false matching point elimination algorithm to complete the corresponding feature points extraction work of the multimodal images.  3.
The assumption of local motion invariance of 3D-2D corresponding feature points is proposed to solve the registration process of the mobile LiDAR data and panoramic image. 4.
The method in this paper can correct the rotation error within 12 • and the translation error within 2 m, and the average error is within 3 pixels.

Related Works
The registration process of the LiDAR point cloud and images has been widely studied and can be divided into the airborne LiDAR point cloud and the mobile LiDAR point cloud according to the different acquisition methods. The registration methods of these two acquisition methods can be divided into the three methods mentioned above: 2D-2D, 3D-2D and 3D-3D. However, the registration method for the airborne Lidar point cloud and images is likely unsuitable for when the mobile LiDAR point cloud and panoramic images are collected simultaneously. Compared with the airborne Lidar point cloud, the mobile LiDAR point cloud has a smaller angle perturbation with a more severe impact [22], which makes registration more complex and challenging. Below we present a detailed review of the registration process for mobile LiDAR data and panoramic images.
3D-2D registration methods. 3D-2D registration methods aim to extract the corresponding feature primitives from panoramic images and the mobile LiDAR point cloud and convert them into linear relations to complete registration. Cui et al. [26] completed registration by extracting line features from the LiDAR point cloud and the panoramic image, respectively, and projecting the line features in the LiDAR point cloud to the panoramic image. This method requires specific features, such as curbs, light poles and urban buildings. Zhu et al. [25] proposed to use the skyline features of panoramic images and the LiDAR point cloud to complete registration. This registration method is more generic than the method of finding city line features. However, this method is not applicable when the skyline cannot be obtained due to dense building scenes. Peng et al. [17] further used point features to constrain based on the single-use of line features and proposed an "infinitesimal feature analysis method" to refine edge lines. Compared with 2D-2D registration methods, 3D-2D registration methods are more dependent on the specified features in the registration scene, which means that when a scene cannot provide the specified features required by the registration method, or the number of features is not enough and the distribution is not uniform enough, the accuracy of the registration method will be greatly reduced.
2D-2D registration methods. 2D-2D registration methods need to transform mobile LiDAR data into an image, so 2D-2D registration methods can also be regarded as the registration between images of different modalities. These methods can generally be divided into region-based [21,22,32,33] and feature-based [16] registration methods. The region-based registration methods measure the similarity of two images by constructing a similarity measure from the space or frequency to realize the registration of the mobile LiDAR data and panoramic image. Commonly used similarity measures generally include mutual information (MI), normalized mutual information (NMI) and the sum of squared differences (SSD). Wang et al. [22] used the method of maximizing mutual information to complete the registration of mobile LiDAR data and panoramic images. Zachary et al. [33] introduced gradient components and normalized mutual information to complete the registration process. Zhao et al. [32] improved the robustness of registration by setting weights for different data sources according to different scenarios when calculating with mutual information. However, region-based approaches often require a lot of computation and are also subject to image noise and nonlinear radiation changes, limiting their application in the registration field. Feature-based methods generally complete registration by searching for corresponding features between two images. For example, Li et al. [16] completed the registration process by finding the corresponding vehicle unit on the panoramic image and the point cloud projection image. Feature-based registration methods have made significant progress with the development of multimodal image registration algorithms. Zhu et al. [18] and Shi et al. [20] used excellent multimodal image registration algorithms on point cloud projection images, completing registration with aerial images. However,  due to the difference between the airborne LiDAR point cloud and the mobile LiDAR point  cloud, these methods still need to be further improved to better register the mobile LiDAR  point cloud and the panoramic image. 3D-3D registration methods. 3D-3D registration methods use structure from motion (SFM) and multi-view stereo (MVS) to generate point clouds from adjacent images, transforming the registration of LiDAR point clouds and images into the registration between the point clouds [16]. These registration methods require sufficient overlap between the two images taken before and after and require multiple overlapping images to obtain densely connected points to generate reliable point clouds [34]. 3D matching algorithms, such as normal distribution transform (NDT) [35], iterative closest point (ICP) [36] and variants [37,38], are commonly used for registration between point clouds. Zhao et al. [30] aligned a point cloud computed from a video onto the point cloud directly obtained from a 3D sensor. However, accurately performing the automatic initialization of the ICP algorithm is an important issue. Li et al. [39] first utilized the global navigation satellite system (GNSS) and IMU-assisted SFM to obtain an accurate image orientation and correct the error of low-precision POS through a two-step registration method. Moreover, they then transformed the original 2D-3D registration into 3D-3D registration, minimizing the differences between the depth map derived from SFM and the original laser scan to achieve accurate registration. In general, 3D-3D registration methods may generate errors when using images to create point clouds and point clouds to register each other. A sequence of images is also required to complete registration.

Methodology
This paper proposes an automatic registration method for panoramic images and mobile LiDAR using GPS/IMU data, which includes four steps: (1) Transform the mobile LiDAR data into an intensity map according to the GPS/IMU; (2) Use a novel structural feature descriptor called HGIFP and a robust mismatch elimination algorithm to obtain corresponding feature points; (3) Iteratively obtain the optimal solution using a particle swarm optimization (PSO) algorithm; (4) Combine the results obtained in step (3) into new POS data, and repeat steps (1)-(3) twice to complete the registration of the panoramic image and the mobile LiDAR data. Figure 1 shows a flowchart of the steps of the proposed method.

Multimodal Image Corresponding Feature Points Extraction
The extraction of the corresponding feature points is generally realized by relying on the local grayscale characteristics of the feature points. However, for multimodal image data, the grayscale characteristics will change non-linearly or non-functionally. The traditional method of extracting corresponding feature points based on image grayscale features is no longer applicable. In contrast, the geometrical features of the image are more resistant to nonlinear radiation differences. In this paper, a new method called HGIFP is proposed to extract corresponding feature points from multimodal images by using the geometrical features of images, which includes four steps: (1) Construct a weighted moment of local features taking anisotrophy into account and use a FAST detector to detect the interest points on the weighted moment map of local features taking anisotropy into account; (2) Use a new structural feature descriptor called HGIFP to obtain panoramic images; (3) For the problem of the viewing angle difference, use the local consensus of matching points to remove outliers to obtain the correct corresponding feature points, and robustly complete the extraction of the corresponding feature points.

Multimodal Image Corresponding Feature Points Extraction
The extraction of the corresponding feature points is generally realized by relying on the local grayscale characteristics of the feature points. However, for multimodal image data, the grayscale characteristics will change non-linearly or non-functionally. The traditional method of extracting corresponding feature points based on image grayscale features is no longer applicable. In contrast, the geometrical features of the image are more resistant to nonlinear radiation differences. In this paper, a new method called HGIFP is proposed to extract corresponding feature points from multimodal images by using the geometrical features of images, which includes four steps: (1) Construct a weighted moment of local features taking anisotrophy into account and use a FAST detector to detect the interest points on the weighted moment map of local features taking anisotropy into account; (2) Use a new structural feature descriptor called HGIFP to obtain panoramic images; (3) For the problem of the viewing angle difference, use the local consensus of matching points to remove outliers to obtain the correct corresponding feature points, and robustly complete the extraction of the corresponding feature points.

Point Feature Detection
There are apparent nonlinear radiation differences (NRD) between multimodal images, which increases the difficulty of feature point detection. As a nonlinear filter, anisotropic filtering can achieve the smooth operation of the image and preserve the edge and texture information of the image. Therefore, a local feature map with good edges and corner features can be obtained by subtracting the to-be-registered and the smoothed images. There are apparent nonlinear radiation differences (NRD) between multimodal images, which increases the difficulty of feature point detection. As a nonlinear filter, anisotropic filtering can achieve the smooth operation of the image and preserve the edge and texture information of the image. Therefore, a local feature map with good edges and corner features can be obtained by subtracting the to-be-registered and the smoothed images. Meanwhile, phase consistency (PC) has been proven to detect feature points [40], and the edges and corner features of the image can be accurately obtained by using the maximum and minimum moments [41]. Therefore, our operation steps are as follows: (1) Smooth the image using optimized anisotropic filtering [42] to complete the anisotropic image diffusion; (2) Subtract the smoothed image from the to-be-registered image to obtain a local feature map that takes anisotropy into account; (3) Construct a weighted moment Equation and generate an anisotropic weighted moment map to extract feature points. The anisotropic diffusion Equation is: where t is the metric value of the time scale; div is the divergence operator; 1/ 1 + |∇I| 2 /k 2 is the diffusion coefficient; k is the contrast factor; ∇ is the gradient operator; ∆ represents the Laplacian operator; I is the current two-dimensional image; L is the difference image; m means the dimension of the image; A l (L n ) represents the diffusion coefficient matrix of the image in each dimension; L n+1 is the result after diffusion; and τ represents the diffusion time step, τ = t n+1 − t n . When the diffusion calculation is calculated, t n = 1/2σ 2 represents the scale, and n is the number of layers in the scale space.
After completing the anisotropic diffusion of the image, we calculate a local feature map taking anisotropy into account. It is calculated as follows: where I F (x, y) is the local feature map taking anisotropy into account, I(x, y) is the original image, G σ s is Gaussian filter, and the result of the anisotropic filtering of the original image is I pm (x, y). We use phase congruency to detect the edge features and corner features. Unlike the grey-based edge feature extraction method, this method is not affected by the image's local light and dark changes. It can include information, such as corners, lines, textures, etc., especially when the edge contrasts of the image are relatively low ( Figure 2). The formula of phase congruency is as follows: where PC(x, y) is the magnitude of phase congruency; (x, y) is the pixel coordinate of a specific point on the image; A no (x, y) is the amplitude at (x, y) at wavelet scale n and orientation o; and ∆φ no (x, y) is a more sensitive phase deviation. W o (x, y) is the weight factor for a frequency spread, T is a noise threshold and ε is a small constant to avoid division by zero. denotes that the enclosed quantity is equal to itself when its value is positive and zero otherwise. According to the moment analysis algorithm, the axis corresponding to the minimu moment is called the principal axis, and the principal axis indicates the direction info mation of the feature; the axis corresponding to the magnitude of the maximum mome generally reflects the distinctiveness of the feature. We construct a weighted moment m taking local anisotropy features into account according to the maximum moment and t minimum moment, and the calculation results are as follows: where According to the moment analysis algorithm, the axis corresponding to the minimum moment is called the principal axis, and the principal axis indicates the direction information of the feature; the axis corresponding to the magnitude of the maximum moment generally reflects the distinctiveness of the feature. We construct a weighted moment map taking local anisotropy features into account according to the maximum moment and the minimum moment, and the calculation results are as follows: where a = ∑ θ (pc(θ)cos(θ)) 2 where the maximum moment map M max represents the edge map of the image; the minimum moment map M min represents the corner map; M pm−pc is a weighted moment diagram taking anisotropy local features into account; k is the weight coefficient, and its value is between −1 and 1; and a, b, c are three intermediate quantities. Moreover, we use a FAST detector to detect interest points.

Feature Description Based on Mixed Index Map and Orientation of Phase Congruency Index Map
After the feature points are detected from the image, feature descriptors are needed to describe the difference between the feature points. The RIFT [34] has proved that compared with the classical feature descriptor, the feature descriptor of the multimodal image using the geometric structure information of the PC map still cannot describe features well. Thus, the RIFT constructs feature descriptors based on a maximum index map (MIM) to solve this problem. However, this method cannot provide enough corresponding feature points when matching visible light and LiDAR intensity images. We use the visible light and LiDAR intensity images for testing, as shown in Figure 3a. The detection of feature points is performed via the same operation, so the number of the initial feature points of the two methods is the same, and the distribution is consistent. Figure 2b shows that the number of the detected feature points is abundant and evenly distributed. However, in the matching process, the RIFT cannot obtain enough corresponding feature points, the distribution of the matching pairs is uneven and the matching effect is poor, as shown in Figure 3e. The reasons may be as follows. For images with severe NRD, the MIM-based method cannot describe the detail of features well, because each pixel has only six description latitudes, as shown in Figure 3c. However, if the number of phase indices increases by increasing the convolution direction, it will increase the time complexity. Therefore, we propose a method to construct HGIFP feature descriptors based on the mixed maximum index map and orientation of the phase congruency index map (MIM-OPCIM), as shown in Figure 3d; the extraction effect is shown in Figure 3f.  Figure 3d; the extraction effect is shown in Figure 3f. MIM is constructed via the log-Gabor convolution sequence. The convolution sequence was obtained in the PC map calculation stage, so the computational complexity of MIM is very small and does not entail too many additional calculations. The specific construction method of MIM is shown in Figure 4. Given an image ( , ), and represent the even-symmetric and odd-symmetric log-Gabor wavelets at scale and orientation . The two wavelet functions are convolved with the image signal to obtain the response components  MIM is constructed via the log-Gabor convolution sequence. The convolution sequence was obtained in the PC map calculation stage, so the computational complexity of MIM is very small and does not entail too many additional calculations. The specific construction method of MIM is shown in [ ote Sens. 2022, 14, x FOR PEER REVIEW 9 of (a) (b) (c) The orientation of phase congruency represents the fastest-changing direction of image's geometric structure features, which is similar to the gradient direction and is ve important for feature description. We collect the horizontal and vertical components ( ) at all scales and all orientations to obtain an orientation of phase congruen map (OPCM). It is calculated as follows: Due to the different modes of images, the orientations of phase congruency of ea pixel between the two OPCMs are not closely related. It is impossible to use the OPCM establish a feature description directly. Based on this, we construct the orientation of phase congruency index map (OPCIM) for feature description. We reclassify each pixe each OPCM image according to the numerical value, and the value of the reclassificati is to obtain the OPCIM. It is calculated as follows: After obtaining the OPCIM, to better describe the image, we use MIM to constr the OPCIM and then propose the MIM-OPCIM, calculated as follows: We arrange A o (x, y) to obtain the convolution graph {A o (x, y)} o 1 . In the calculation, O is the number of orientations. For each pixel position of the image, we can obtain an O-dimensional ordered array, and the subscript corresponding to the maximum value on each O-dimensional ordered array is selected to obtain the maximum index map.
The orientation of phase congruency represents the fastest-changing direction of the image's geometric structure features, which is similar to the gradient direction and is very important for feature description. We collect the horizontal and vertical components of o no (θ) at all scales S and all orientations O to obtain an orientation of phase congruency map (OPCM). It is calculated as follows: Due to the different modes of images, the orientations of phase congruency of each pixel between the two OPCMs are not closely related. It is impossible to use the OPCM to establish a feature description directly. Based on this, we construct the orientation of the phase congruency index map (OPCIM) for feature description. We reclassify each pixel of each OPCM image according to the numerical value, and the value of the reclassification is N to obtain the OPCIM. It is calculated as follows: Remote Sens. 2022, 14, 4783 9 of 25 After obtaining the OPCIM, to better describe the image, we use MIM to constrain the OPCIM and then propose the MIM-OPCIM, calculated as follows: where a is the constraint coefficient matrix of the OPCIM, µ is the size of the termination value for reclassifying MIM from 1, and the size of N is the same as that in Equation (14). For each pixel, the method based on the MIM-OPCIM has a better description than that based on the RIFT because the description dimension increases from 6 to N. Therefore, the HGIFP feature descriptor constructed based on the MIM-OPCIM is more reliable than the feature descriptor constructed by the RIFT. After obtaining the MIM-OPCIM, we generate feature descriptors for each feature point in the same way as the RIFT to obtain HGIFP feature descriptors. We build a local image patch of size J × J pixels centered on each feature point and assign weights to each pixel using a Gaussian function with a standard deviation of size J/2. For the sub-grid in each local pixel block, the size is set to r × r, r > 6. This way includes as much neighboring information as possible. Figure 5 shows the construction process of the HGIFP feature descriptor.

Elimination of Mismatched Points Based on Local Preserving Matching
Obtaining a reliable correspondence from two feature point sets is an essential step in extracting corresponding feature points. For panoramic images and intensity maps, the difference in imaging viewpoints makes it easier to have wrong matches in the matching results. Filtering out the correct matching pairs in the initial matching set is particularly critical. To address this problem, we use locality preserving matching (LPM) [43] to filter from the collection of matching points to obtain the correct matching pairs.
The local preserving matching of feature points means that the motion displacements of the two corresponding feature points in the neighborhood should be consistent. So, we can eliminate the wrong matching pairs by calculating the difference between the motion displacement values of each feature point and its neighborhood and the topological relationship.
Assuming that the unknown interior-point set is , and the initial set is , then the final matching set * has the following relationship: * = ( ; , ) The loss function ( ; , ) is defined as follows: where , are the neighborhoods of the feature points , ; is a certain distance metric, such as Euclidean distance; and is the cardinality of the neighbourhood points.
Here we use a nearest neighbor search; 1/2 is the normalization of the loss function;

Elimination of Mismatched Points Based on Local Preserving Matching
Obtaining a reliable correspondence from two feature point sets is an essential step in extracting corresponding feature points. For panoramic images and intensity maps, the difference in imaging viewpoints makes it easier to have wrong matches in the matching results. Filtering out the correct matching pairs in the initial matching set is particularly critical. To address this problem, we use locality preserving matching (LPM) [43] to filter from the collection of matching points to obtain the correct matching pairs.
The local preserving matching of feature points means that the motion displacements of the two corresponding feature points in the neighborhood should be consistent. So, we can eliminate the wrong matching pairs by calculating the difference between the motion displacement values of each feature point and its neighborhood and the topological relationship.
Assuming that the unknown interior-point set is I, and the initial set is S, then the final matching set I * has the following relationship: The loss function C(I; S, λ) is defined as follows: where N xi , N yi are the neighborhoods of the feature points x i , y i ; d is a certain distance metric, such as Euclidean distance; and K is the cardinality of the neighbourhood points.
Here we use a nearest neighbor search; 1/2K is the normalization of the loss function; λ is the penalty term coefficient, λ > 0; N is the number of all matching points; and |I| is the number of correct matching point pairs. If we use the numerical difference in the local motion displacement of the matching points to measure whether the point is a wrong matching point, this method will fail when there are non-rigid changes in the image. Therefore, we binarize the distance.
For Equation (18), a 1 × N vector P can be used to indicate whether each pair of matching points is correct. P i = 1 means that the match correctness of the pair of matching points is an inlier and P i = 0 means an outlier.
So, we can rewrite Equation (18) by combining P with Equation (12).
In order to better use the local motion consistency of the feature points to screen the error points, we can use the topological relationship of the neighborhood to distinguish the match correctness of correspondence further (x i , y i ) in the neighborhood N xi , N yi , such as in Figure 6. v i represents the displacement vectors corresponding to (x i , y i ), and v j represents the displacement vectors corresponding to x j , y j , more specifically In order to better use the local motion consistency of the feature points to screen the error points, we can use the topological relationship of the neighborhood to distinguish the match correctness of correspondence further ( , ) in the neighborhood , , such as in Figure 6.
represents the displacement vectors corresponding to ( , ), and represents the displacement vectors corresponding to , , more specifically The difference between and is defined as follows: where , represents the topological similarity between two matching point pairs in the neighborhood , ∈ [−1, 1]. (⋅,⋅) represents the inner product calculation of vectors. The larger the value of , , the higher the similarity between the two, and binarization is performed according to the similarity threshold: We combine Equations (21) and (22). Equation (20) can be written as the following The difference between v i and v j is defined as follows: where s v i , v j represents the topological similarity between two matching point pairs in the neighborhood v i , v j ∈ [−1, 1]. (·, ·) represents the inner product calculation of vectors. The larger the value of s v i , v j , the higher the similarity between the two, and binarization is performed according to the similarity threshold: We combine Equations (21) and (22). Equation (20) can be written as the following minimization problem: where For the above Equation (23), the initially given matching set S is known, so the positions of all feature points are known, and hence all the cost values {c i } N 1 can be calculated in advance. That is to say, the only unknown quantity is P i . If we want to minimize the value of the loss function, we need P i (c i − λ) < 0. For the wrong matching pairs P i = 0, the calculation result of (c i − λ) is meaningless. Only when P i = 1, (c i − λ) < 0 can the loss function reduce the value. Thus, we can relate the minimization of the loss function for solving P to λ: Thus, we can see that λ is a threshold for judging whether a pair of matching points is the correct interior point through Equation (25). According to the experiment, we use the above Equations for two iterations to obtain I * from the candidate matching set S.

Panoramic Image and the Mobile LiDAR Data Registration Model Based on Local Motion Invariance in 3D-2D Corresponding Feature Points
After obtaining a reliable set of corresponding feature points in the panorama image and the LiDAR intensity image, we can use the average pixel distance between the sets of corresponding feature points as the error equation. Therefore, the registration problem of panoramic images and mobile LiDAR data can be equivalent to the optimization problem, or minimizing the value of the error equation. However, in the optimization process, if the LiDAR intensity image is regenerated whenever extracting corresponding feature points and calculating the cost function, this will significantly increase the computational complexity and time. It also cannot converge to the optimal camera matrix if we realize registration by minimizing the distance between the projection points and the corresponding pixel points, because the number of extracted corresponding feature points is sparse and unevenly distributed due to the significant difference in angles of view and displacement between the matched images.
Based on the above problems, we propose the following hypothesis: Under a given pose, if the mapping of a point in the point cloud is the corresponding feature point of a pixel on the panoramic image, the mapping of this point should also be the corresponding feature point of the original pixel or within the minimal neighborhood of the corresponding feature point when the pose is changed to other similar viewing angles. This 3D-2D correspondence invariance is called the local motion invariance in the corresponding feature. The specific operation steps are as follows: 1. According to the panoramic image imaging model, there are the following formulas: where (X w , Y w , Z w ) is a certain point in the point cloud; (u, v) is the imaging point of the point on the two-dimensional plane; K is the inner orientation elements of the panoramic camera; F is the transformation matrix of the camera coordinate to the spherical polar coordinate; and T is the outer orientation elements of the panoramic camera. From this, we construct the imaging point cloud set P intensity . Moreover, there is a one-to-one correspondence between P intensity and each pixel in the I intensity (x, y). 2. After using the initial pose to obtain the LiDAR intensity image I intensity (x, y), we extract the corresponding feature points from the panoramic image I panoramic (x, y) and I intensity (x, y) to obtain the corresponding feature points set S panoramic , S intensity . Then, according to the corresponding relationship established in step 1, we obtain the point cloud set P panoramic−intensity which corresponds to S intensity . Finally, we construct the error equation with P panoramic−intensity and S panoramic . We minimize the error equation to complete the registration of the panoramic image and the mobile LiDAR data. The specific Equation is as follows: where K is the inner orientation elements of the panoramic camera; F is the transformationmatrix of the camera coordinate to the spherical polar coordinate; T is the outer orientation elements of the panoramic camera; S i is the point in the set S panoramic , and P i is the point in the point cloud set P panoramic−intensity corresponding to S i . 3. After constructing the error equation, we use the PSO algorithm for iterative optimization. We set the rotation parameter to ±0.2 of the input parameter and the translation parameter to ±2 of the input parameter. The number of iterations is 200, and the number of particles is 60. 4. A step-by-step approximation method is adopted based on the proposed assumptions to obtain the correct motion pose. After completing step 3, we repeat the process of step 1-step 3 with the obtained camera motion pose as the initial pose, and set the rotation and translation parameter thresholds to 1/2 of the previous process. When the pixel average distance is less than 3 pixels, the iteration terminates. For specific experiments, please refer to Section 4.3.

Experiment and Analysis
This section focuses on our experimental verification of the performance of the proposed registration method on panoramic images and the mobile LiDAR data. First, we provide the detailed setup of the experiment, including a description of the experimental dataset and the experiment environment. Secondly, the performance of our proposed algorithm for extracting corresponding feature points based on HGIFP is analyzed. The evaluation is completed by comparing four state-of-the-art methods: SIFT, 3MRS, PSO-SIFT and RIFT. For a fair comparison, we set these four methods' local description block sizes to be the same. Considering the characteristics of panoramic images, we will evaluate the corresponding feature point extraction algorithm in multimodal images by visual inspection. Then in the discussion of Section 5, the final results of the registration of the mobile LiDAR data and panoramic images are analyzed. We also introduce region-based MI methods for comparison in the discussion, and finally illustrate the robustness and usability of the proposed method by overlaying panoramic images and mobile LiDAR data.

Setting of Experiments
The experiments were performed using a personal computer (PC) configured with an Inter(R) Core(TM) i7-9750H CPU 2.60 GHz and 16.0 GB RAM. The mobile LiDAR data and panoramic images were collected from Guangzhou Ecological Design Town. In order to facilitate processing, the collected panoramic images were downsampled with a sampling interval of 3 pixels and converted into single-channel grayscale images. The parameters for image matching were set as follows. The log-Gabor filter was set to four scales and six directions (0, π 6 , 2π 6 , 3π 6 , 4 6 , 5π 6 ). Three pairs of panoramic images and the corresponding range of the mobile LiDAR data were selected as experimental data. The details are listed in Table 1. These experimental data cover intersections, trees, basketball courts and residential buildings.  Figure 7a shows the first set of experimental data, including residential houses on both sides of the road; Figure 7b shows the second set of experimental data, open intersection data, including high-rise buildings and residential dwellings; and Figure 7c shows the third set of experimental data, including trees, basketball courts and residential houses.

Parameter Study
The proposed corresponding feature points extraction algorithm in the multimodal image contains four main parameters, namely N, µ, J and r. The parameter N is the size of the reclassification of the phase congruency orientation. Generally speaking, the larger the N is, the higher the discrimination of the information in each pixel in the image and the higher the computational complexity. The parameter µ is the reclassification size of the MIM. The larger µ is, the greater the weight of the MIM in the constructed similarity descriptor. The parameters J and r are the sizes of the local image patch and sub-image block used for feature description. If the sizes of the local image patch and sub-image block are too small, the information is insufficient, which does not adequately reflect the distinctiveness of the feature. Conversely, if the sizes of the local image patch and sub-image block are too large, the computational complexity will be higher, and the matching time will be longer. Therefore, the correct parameter design dramatically influences the extraction effect of the corresponding feature points. We use a pair of LiDAR intensity and visible light images for parameter research. This is because the panoramic image is stitched in six directions. In order to accurately assess the extraction effect of the corresponding feature points, we need to calculate the extrinsic camera parameters through the corresponding feature points and then use the calculated extrinsic camera parameters to regenerate the LiDAR intensity image to realize the fusion of the panoramic image and the LiDAR intensity image. In contrast, extracting and verifying corresponding feature points from intensity projection images of mobile LiDAR data and images captured by flat cameras is simpler and more reliable. We only need to use the corresponding feature points to calculate the transformation matrix between the images to complete the evaluation of the image registration effect. The LiDAR intensity image and visible light image of the experiment are shown in Figure 8.  Figure 7a shows the first set of experimental data, including residential houses on both sides of the road; Figure 7b shows the second set of experimental data, open intersection data, including high-rise buildings and residential dwellings; and Figure 7c shows the third set of experimental data, including trees, basketball courts and residential houses.

Parameter Study
The proposed corresponding feature points extraction algorithm in the multimodal image contains four main parameters, namely , , and . The parameter is the size of the reclassification of the phase congruency orientation. Generally speaking, the image and the LiDAR intensity image. In contrast, extracting and verifying correspondi feature points from intensity projection images of mobile LiDAR data and images ca tured by flat cameras is simpler and more reliable. We only need to use the correspondi feature points to calculate the transformation matrix between the images to complete evaluation of the image registration effect. The LiDAR intensity image and visible lig image of the experiment are shown in Figure 8.   In order to obtain suitable parameters, we design four independent experiments to learn the parameters N, µ, J and r. Each experiment has only one parameter as a variable, and the other parameters are designed as fixed values. The overall setup details of the experiments are in Table 2. We use the number of correct matches (NCM) and the success ratio (SR) as evaluation metrics for each parameter. The results of the experiments are shown in Tables 3-6. To ensure the fairness of the comparison experiment, the mismatch elimination algorithm used in the parameter study is consistent with that in the RIFT.    The following conclusions can be drawn from the experimental results: (1) The larger the value of N, the richer the information of the MIM-OPCIM, so that there will be fewer mismatches and more NCM will be obtained. However, when N is greater than a threshold, due to local geometric distortions, the number of mismatches and the number of NCM will also decrease. As can be seen from Table 3, when N reaches 13 and 15, the SR of the MIM-OPCIM reaches 100%. Considering the number and computational complexity of extracting corresponding feature points for the MIM-OPCIM, we set N = 15. (2) It can be seen from Table 4 that the constructed the MIM-OPCIM mismatch decreases with the increase in µ, but when µ ≥ 1.5, due to the influence of local geometric distortions, the mismatch begins to occur. At the same time, the number of NCM is also decreasing. Thus, we set µ = 1.3.
(3) SR is the highest for all assumed values of parameter J. However, considering that our main purpose is to obtain more corresponding feature points, therefore, according to Table 5, we set J = 108. (4) The change in the parameter r will impact the NCM because more information is contained in the sub-image block. To balance the time complexity and the number of corresponding feature points extracted, as shown in Table 6, we set r = 8. According to the experimental results and analysis, in the following experiments, these parameters are fixed as N = 15, µ = 1.3, J = 108 and r = 8.

Matching Performance Analysis and Comparison
In this section, an analysis is performed to verify the matching effect of the proposed method. We use SIFT, 3MRS, PSO-SIFT, and RIFT and our method to extract corresponding feature points on three test data. All test data are subjected to two rotation and translation changes on the correct camera external parameters. The first change randomly adds a rotation error of 12 • and a translation error of 2 m, and the second change randomly adds a rotation error of 6 • and a translation error of 1 m (the settings of the first set of parameters are based on the errors encountered in our work, and the settings of the second and third sets of parameters are to check the registration results of our method under different distortions). The solid yellow line indicates all the correctly matched corresponding feature points in the images, and the solid red line indicates that there are incorrectly matched corresponding feature points. Figures 9 and 10 show the extraction results of corresponding feature points obtained by using five methods on three test data. Figure 9 shows the extraction results of the corresponding feature points of the first test data. Figure 10 shows the number and success rate of the corresponding feature points extracted on three test data. Despite local geometric distortions, NRD, noise and large deformations between images, our proposed method can perform well on all three-test data. In particular, at 12 • rotations and with 2 m translation errors, the corresponding feature points obtained are 25, 28 and 12, as seen in Figure 10. On three test data, SIFT and PSO-SIFT all fail to match because both SIFT and PSO-SIFT are feature descriptions based on gradient histograms. Although PSO-SIFT redefines the gradient to meet the needs of multimodal image matching, this does not change the gradient's sensitivity to NRD. rate of the corresponding feature points extracted on three test data. Despite local geometric distortions, NRD, noise and large deformations between images, our proposed method can perform well on all three-test data. In particular, at 12° rotations and with 2 m translation errors, the corresponding feature points obtained are 25, 28 and 12, as seen in Figure  10. On three test data, SIFT and PSO-SIFT all fail to match because both SIFT and PSO-SIFT are feature descri In contrast, the matching effects of the RIFT and 3MRS is significantly better those of than the SIFT and PSO-SIFT, which use gradients for feature description. The RIFT can better complete the extraction of corresponding feature points in multimodal images when the difference in viewing angle between the two images is not very large. On three test data, the numbers of corresponding feature points obtained by the RIFT under the second set of parameters are 19, 16 and 10, as seen in Figure 10. When there is no viewing angle difference between the two images, 3MRS obtains the largest number of matching points on the three test data, but once there is a viewing angle difference between the images, 3MRS cannot guarantee the accuracy of the obtained corresponding feature points, ss shown in Figure 10. These results show that the RIFT has good robustness for matching visible light and intensity LiDAR images but cannot handle significant geometric distortions between images well. 3MRS uses coarse matching combined with template matching to increase the number of extracted corresponding feature points significantly. Still, it is more sensitive to geometric distortions between images and unsuitable for registration between the mobile LiDAR data and panoramic images. In contrast, the matching effects of the RIFT and 3MRS is significantly better those of than the SIFT and PSO-SIFT, which use gradients for feature description. The RIFT can better complete the extraction of corresponding feature points in multimodal images when the difference in viewing angle between the two images is not very large. On three test data, the numbers of corresponding feature points obtained by the RIFT under the second set of parameters are 19, 16 and 10, as seen in Figure 10. When there is no viewing angle difference between the two images, 3MRS obtains the largest number of matching

Pixel Error and Runtime Evaluation
We illustrate the superiority of the proposed method by evaluating the pixel root mean square error (RMSE) and running time (RT) of the results using different methods. The GPS/IMU data used in the experiment are shown in Table 7. At the same time, we add the region-based MI method to measure the proposed method's effect more comprehensively. We choose root mean square error (RMSE) and running time (RT) to evaluate the proposed method and the other five methods. RMSE represents the accuracy of the overall match and is calculated by manually selecting 20 checkpoints. The final effects of all methods are shown in Figure 11. Figure 12a depicts the histograms of RMSE for all methods, and Figure 12b shows RT histograms for all methods. The smaller the RMSE and RT, the better the matching performance. Table 8 reports the matching results of all methods in terms of RMSE and RT. On three test sets, the matching effect of our proposed method is the best, and the errors of the three results are 0.997 pixels, 1.606 pixels and 2.531 pixels. The errors of other methods are more than 50 pixels, and the registration requirements of the mobile LiDAR data and panoramic image cannot be completed. The reasons for the analysis are as follows. Using the feature point matching method to complete the mobile LiDAR data registration and the panoramic image, it is necessary to accurately find the corresponding relationship between the coordinate and pixel points. In the case of a rotation error of 12 • and a translation error of 2 m, except for when using the method in this paper, only a few wrong corresponding feature points are obtained. Therefore, when using the PSO algorithm for iterative convergence, all methods except the method in this paper cannot converge to the correct camera extrinsic parameters. In addition, the resolution of the experimental data in this paper is 2731 × 1366. The high resolution is also one of the reasons that the pixel offset difference of other methods is too large and cannot be registered. Our experiments have also been tested using the histogram of absolute phase consistency gradients (HAPCGs) [44]. The algorithm cannot complete the matching due to the image's high resolution, which leads to memory overflow during calculations. When the image is processed by multiple down-sampling, the HAPCG still cannot extract the correctly corresponding feature points between the images. To maintain the clarity of the matching image as much as possible and achieve accurate registration, we do not recommend using multiple down-sampling to complete the extraction of corresponding feature points in multimodal images. Due to the large geometric deformation of the LiDAR intensity image and the original image in different poses, the MI method cannot converge to the correct camera extrinsic parameters for the region-based MI method. This is the reason why the MI method has significant pixel errors. Table 7. Experimental data of external orientation parameters: the (Initial value) column represents the initial value of the element, and the (Correct value) column represents the correct value of the element. In running time (RT), the three times required by our method are 184.4135 s, 187.1741 s and 192.0239 s. Among the comparative methods, the amount of time required by the RIFT is the least, and the three times of the RIFT are 118.803164 s, 100.742531 s and 104.78784 s. Although the RIFT algorithm takes the shortest amount of time among the six methods, the registration of the mobile LiDAR data and the panoramic image cannot be completed by the RIFT. This is because the RIFT cannot obtain correct and evenly distributed corresponding feature points when the viewing angle difference is too large. As shown in Table 8, the matching errors of the RIFT on the three test sets are 93.813 pixels, 51.153 pixels and 111.961 pixels. At the same time, comparing three test sets, we found that the proposed algorithm has the best effect on the first test data and the worst effect on the third test data. By analyzing the three test sets, it can be found that the structure information of the LiDAR intensity image in the first datum is the most obvious, so more uniformly distributed corresponding feature points can be obtained. The local geometric distortions and NRD of the LiDAR intensity image in the third test datum are the most serious, which brings significant challenges to the matching algorithm, so the matching result of the third test datum is the worst.   Figure 12a depicts the histograms of RMSE for all methods, and Figure 12b shows RT histograms for all methods. The smaller the RMSE and RT, the better the matching performance. Table 8 reports the matching results of all methods in terms of RMSE and RT. On three test sets, the matching effect of our proposed method is the best, and the errors of the three results are 0.997 pixels, 1.606 pixels and 2.531 pixels. The errors of other meth-  In running time (RT), the three times required by our method are 184.4135 s, 187.1741 s and 192.0239 s. Among the comparative methods, the amount of time required by the RIFT is the least, and the three times of the RIFT are 118.803164 s, 100.742531 s and 104.78784 s. Although the RIFT algorithm takes the shortest amount of time among the six methods, the registration of the mobile LiDAR data and the panoramic image cannot be completed by the RIFT. This is because the RIFT cannot obtain correct and evenly distributed corresponding feature points when the viewing angle difference is too large. As shown in Table 8, the matching errors of the RIFT on the three test sets are 93.813 pixels, 51.153 pixels and 111.961 pixels. At the same time, comparing three test sets, we found that the proposed algorithm has the best effect on the first test data and the worst effect on the third test data. By analyzing the three test sets, it can be found that the structure information of the LiDAR intensity image in the first datum is the most obvious, so more uniformly distributed corresponding feature points can be obtained. The local geometric

Registration Parameters Error Evaluation
In order to better verify the matching effect of the proposed method, Figure 13 shows the results of the overall registration, and Table 9 shows the parameters before and after registration. Although the point cloud projection images of the three test sets have different degrees of local geometric distortions, NRD and noise effects, the three pairs of panoramic images and the mobile LiDAR data are all correctly aligned. According to Table 9, we know that each parameter's errors after the registration of the first test datum are −0.016 m, 0.010 m, 0.008 m, −0.0017 rad, 0.002 rad and 0.005 rad. Each parameter's errors after the second test datum registration are 0.000 m, 0.011 m, 0.013 m, −0.002 rad, −0.001 rad and 0.005 rad. After the registration of the third test datum, the errors of each parameter are 0.007 m, −0.023 m, −0.026 m, −0.002 rad, −0.0005 rad and 0.005 rad. The average translation errors of three test data are all within 3 cm, and the average rotation errors are all within 0.005 rad. We found that the first test datum has the most accurate result of the three test sets, with an average translation error of 1.14 cm and an average rotation error of 0.003 rad. The third test datum has the worst result, with an average translation error of 1.87 cm and an average rotation error of 0.0025 rad. Comparing the point cloud projection images of the three test sets, we found that the LiDAR intensity image of the first test datum has the best quality and complete structural information. The LiDAR intensity image of the third test datum has the worst quality, lacking the necessary structural information and filled with a lot of noise. Therefore, high-quality mobile LiDAR data are crucial for successfully registering the mobile LiDAR data and panoramic images. (c) Figure 13. Effect diagram before and after registration. (a) comparison of the first test data before and after registration; (b) comparison of the second test data before and after registration; (c) comparison of the third test data before and after registration. Table 9. Optimization of the exterior orientation parameters: the (Initial value) column represents the initial value of the element, the (Correct value) column represents the correct value of the element, and the (Final value) column represents the final optimized value of the element.

Test Elements Correct Value Initial Value Final Value
Test 1 Figure 13. Effect diagram before and after registration. (a) comparison of the first test data before and after registration; (b) comparison of the second test data before and after registration; (c) comparison of the third test data before and after registration. Table 9. Optimization of the exterior orientation parameters: the (Initial value) column represents the initial value of the element, the (Correct value) column represents the correct value of the element, and the (Final value) column represents the final optimized value of the element.

Test Elements
Correct Value Initial Value Final Value

Conclusions
The registration between panoramic images and mobile LiDAR data is very important for urban applications, such as vehicle navigation, assisted driving, 3D urban change detection, etc. It is difficult to align accurately because pre-calibrated laser scanners and panoramic cameras are likely to have unpredictable shifts in long-term use, and IMU or GPS may not be able to obtain accurate poses due to environmental problems. This paper proposes a registration method for panoramic images and mobile LiDAR data based on the phase's hybrid geometric structure index feature. First, we use the initial GPS/IMU data and panoramic image to obtain an intensity LiDAR image. Secondly, we obtain the set of corresponding feature points on the panoramic image and the intensity LiDAR image and receive the 2D-3D correspondence. Finally, after three iterative calculations, the registration of the mobile LiDAR data and the panoramic image is completed. The results show that the method can accurately register the panoramic image and the mobile LiDAR data, and the average registration error is less than three pixels.
Compared with other registration methods, the method in this paper does not need to rely on specific environmental scenarios and has strong applicability. By designing a corresponding feature points extraction algorithm based on the hybrid geometric structure index feature of phase, we can obtain the corresponding feature points with a uniform distribution and in an abundant quantity. These well-distributed and abundant corresponding feature points are the keys for us to complete the registration of mobile LiDAR data and panoramic images. At the same time, when analyzing the three test data in this paper, we found that the difference in point cloud quality affected our matching results. The better the image quality of point cloud imaging, the higher the registration accuracy. Low-quality point cloud imaging images will bring greater challenges to the registration algorithm.
In future work, we will consider further reducing the registration time. For sequential panoramic images, we can use the relative motion relationship between sequential panoramic images to reduce the number of registrations between mobile LiDAR data and panoramic images. However, a more robust and fast corresponding feature points extraction algorithm will help us reduce the registration complexity, which would enable the better and faster registration of panoramic images and mobile LiDAR data.