Geometric- and Optimization-Based Registration Methods for Long-Wave Infrared Hyperspectral Images

Registration of long-wave infrared (LWIR) hyperspectral images with their thermal and emissivity components has until now received comparatively less attention with respect to the visible near and short wave infrared hyperspectral images. In this paper, the registration of LWIR hyperspectral images is investigated to enhance applications of LWIR images such as change detection, temperature and emissivity separation, and target detection. The proposed approach first searches for the best features of hyperspectral image pixels for extraction and matching in the LWIR range and then performs a global registration over two-dimensional maps of three-dimensional hyperspectral cubes. The performances of temperature and emissivity features in the thermal domain along with the average energy and principal components of spectral radiance are investigated. The global registration performed over whole 2D maps is further improved by blockwise local refinements. Among the two proposed approaches, the geometric refinement seeks the best keypoint combination in the neighborhood of each block to estimate the transformation for that block. The alternative optimization-based refinement iteratively finds the best transformation by maximizing the similarity of the reference and transformed blocks. The possible blocking artifacts due to blockwise mapping are finally eliminated by pixelwise refinement. The experiments are evaluated with respect to the (i) utilized similarity metrics in the LWIR range between transformed and reference blocks, (ii) proposed geometric- and optimization-based methods, and (iii) image pairs captured on the same and different days. The better performance of the proposed approach compared to manual, GPU-IMU-based, and state-of-the-art image registration methods is verified.


Introduction
Image registration is a preprocessing stage in image analysis to geometrically align different sets of image data captured from different poses into the same coordinate system [1]. Such a process is used in various areas ranging from computer vision, medical imaging, and aerial and space research to military image analysis in order to increase the gathered information acquired from different cameras and sensors. In the case of hyperspectral images [2,3], the registration aims to align the three-dimensional (3D) hyperspectral data with two spatial dimensions and one spectral dimension, compared to the registration of two-dimensional (2D) images.
The previous studies on hyperspectral image registration [4][5][6][7][8][9][10][11][12][13][14][15][16][17] mainly achieve this goal in two different categories. The first category [4][5][6][7][8][9][10][11][12], namely feature-based methods, finds and matches the feature points on the given hyperspectral images and estimates the best geometrical transform between two images by using matched points. The main stream in this category applies a transformation to convert the 3D hyperspectral data cube into a 2D image, and then the registration is performed over the resulting 2D image correspondences. The second category of hyperspectral image registration methods [13][14][15][16][17] mainly originates local divergences into account. Second, they are very dependent on the generality of the training set as the main drawback of learning methods. Accordingly, their application to the 2D maps of LWIR hyperspectral images was unfortunately not very successful, as reported in this paper with the comparisons. Finally, the scarcity of the LWIR hyperspectral images and the challenges in the dynamic thermal modeling for synthetic LWIR image generation make the development of new learning-based solutions infeasible for the moment for such data.
Given the mentioned challenges, this article investigates the registration of aerial LWIR hyperspectral images taken at different times of the same scene at different positions. The proposed solutions are developed as an output of an experiment-driven research in a related NATO research panel with the provided LWIR hyperspectral data in order to reveal the registration performance of different LWIR spectral sensors. While the performed research first begins with the performances of the conventional registration pipeline for LWIR hyperspectral registration, it evolves to an ultimate method revealing the best features in the thermal range for registration and solving the problems regarding local distortions in the global registration. In this regard, the main contributions of the performed study can be summarized as follows: • Considering the temperature variations in the scene from one acquisition to the other, the most appropriate and robust components among temperature, emissivity, and radiance features of hyperspectral pixels are investigated to convert the 3D hyperspectral images to 2D images.

•
In order to solve the local misalignment problems in global and GPS-IMU-based registrations, two blockwise methods are developed. While the geometric-based solution searches for the best local homography from the keypoints in the neighborhood of each block, the alternative optimization-based solution takes the global transformation as the initial estimation for each block and iteratively improves the homography with respect to the similarity of the transformed and reference blocks.

•
In order to account for the variations in the radiance data due to the temperature changes in the scene, the performances of different metrics in multimodal image registration, namely the geometric mean square error between the keypoints and their correspondences (geometric MSE), structural similarity index (SSIM) [37], and mutual information (MI) [38], are examined.

•
The proposed method with local refinements is compared with manual and GPU-IMU-based registrations along with the state-of-the-art image registration methods by using the hyperspectral LWIR images captured on the same and different days.
The next section gives brief overviews of hyperspectral and image registration methods. Section 3 gives the utilized LWIR hyperspectral image data for the experiments. Section 4 then describes the proposed hyperspectral registration method with the details of the mentioned stages. Section 5 then presents the experimental results and comparisons. Then, the discussions are presented in Section 6, which is followed by the conclusions in Section 7.

Related Work
This section first provides a review of the hyperspectral image processing methods. Then, a brief summary of state-of-the-art image registration methods is given in connection with hyperspectral image registration. Table 1 reveals an overview of the hyperspectral registration methods with respect to the spectral regions of the utilized images, which mostly select VNIR, SWIR, and visible RGB range. Among a few studies in the LWIR range, the registration is studied for broadband images of thermal infrared sensors in [19] and [20] for surveillance applications. As another work, Koz et al. [21] present some initial methods and results for the registration Remote Sens. 2021, 13, 2465 4 of 30 of middle-wave infrared (MWIR) and LWIR hyperspectral images with some limited data, which also triggers the performed research in this paper in its early phase. The methods in the table are also classified as feature-based methods and optimizationbased methods. The feature-based methods mainly utilize a conversion from 3D hyperspectral cubes to 2D images and then use the resulting 2D images for further processing. Mukherjee et al. [4] perform such a transformation by using principal component analysis. The resulting principal components are ranked and selected with respect to smoothness criteria, and the selected components are summed up with a nonlinear function to obtain the 2D images. The scale-invariant feature transform (SIFT) features [39] are then utilized to find and match the correspondences on the resulting 2D images. The performance of this method is improved by Gonçalves et al. [5] by incorporating segmentation and outlier removal steps. Similar methods propose to regulate the parameter selection of the SIFT descriptors as in [6], to modify the orientation component of the SIFT [7,8], or to adopt a different descriptor such as speeded up robust features (SURF) [40] for better and more robust registration performances [9].

Hyperspectral Image Registration Methods
A different approach in feature-based methods [11,12] performs the registration by interpreting the 3D hyperspectral data cube as a vector image formed of vectors, instead of pixels. Rather than finding the feature points on 2D images, these methods utilize the changes in spectral dimensions to find feature points in three-dimensional hyperspectral cubes. To this end, Dorado-Munoz et al. [11,12] develop the modified version of the SIFT descriptor for 3D data. The modified method describes the scale-space of the difference of Gaussians over smoothed vector images rather than pixel images, finds the extreme points by defining the greater-than and smaller-than operations over vectors, and finally, performs the extrema localization by defining a Hessian matrix over vectors. A major challenge in this approach is, however, describing a proper ordering operation over vectors, whose components are in fact hyperspectral bands with different significance. The optimization-based methods as the second category in the table utilize similarity metrics borrowed from medical imaging [13][14][15][16][17], which are based on statistical measures. For instance, Kern et al. [13] extend the description of mutual information from image pairs to a set of image pairs and utilize the extended metric for hyperspectral image registration. In [15], cross cumulative residual entropy is utilized to register the RGB images with the VNIR hyperspectral images by converting the images into grayscale 2D images. More sophisticated optimization-based methods are also proposed by Zhou et al. [16,17] for the registration of hyperspectral and color images having very large scale differences. Compared to the feature-based methods, the main disadvantage of iterative optimization methods is their computational complexity. Based on the utilized search algorithm, sticking in the local minima and performance variations with respect to the selected initial points can be regarded as another drawback of this category.

State-of-the-Art Image Registration Methods
The image registration pipeline mainly consists of feature detection, feature description, feature matching, and geometric transformation estimation stages. Recent developments in deep learning have also affected this pipeline with a significant amount of performance gain by focusing on one or more stages of the registration pipeline. Table 2 gives a brief overview of the state-of-the-art image registration methods by mentioning the particular stages that each method handles in the chain of registration process along with the input and outputs of the algorithms. In accordance with Table 2, the first group [28][29][30] of these methods is able to find correspondences of image pair without explicitly detecting, describing, and matching the features. Deep feature matching (DFM) [28] benefits from a pretrained VGG-19 [41] extractor and, starting from terminal layers' features, hierarchically matches features until the first layers. NCNet [29] uses a neighbor consensus network to process the 4D correlation map. Patch2Pix [30] starts to match the features at the patch level and, using fine and mid-level regressors, finds the pixel level matches. The second group [31,32] in Table 2 consists of joint feature detectors and descriptors. SuperPoint [31] is an encoder-decoder network that outputs detected features and corresponding descriptors at the same time. D2-Net [32] also jointly extracts features by finding salient features in the trained network based on VGG-16 [41]. The third group focuses on producing good matches, either by rejecting outliers or learning to match well. LPM [33] removes mismatches benefitting from local neighborhood structures, while SuperGlue [34] learns to remake descriptors better for matching and learns to match features optimally. The last group of algorithms directly produces the homography between two images with regression. Deep image homography estimation [35] in this group inputs two grayscale images stacked and outputs the geometric relationship between them. Unsupervised deep homography [36] improves the former network to make it appropriate for unsupervised training.
With regard to the given overviews, the first observation is that there were no methods addressing the registration of LWIR hyperspectral images in the previous literature at a journal level. The second observation is that the local misalignment problems in global and GPS-IMU-based registrations are not studied in the previous literature of both image and hyperspectral image registration. The proposed algorithms in this paper will focus on these two aspects.  [29] 2020 + + + Image Pair Putative Matches Patch2Pix [30] 2020 + + + Image Pair Putative Matches SuperPoint [31] 2018 + + Image Features + Descriptors D2-Net [32] 2019 + + Image Features + Descriptors LPM [33] 2018 + Features + Descriptors Putative Matches SuperGlue [34] 2020 + Features + Descriptors Putative Matches Deep Homog. Est. [35] 2016 Image Pair Homography Unsuper. Deep Homog. [36] 2018 Image Pair Homography

Experimental Dataset
Three sets of LWIR hyperspectral images could be utilized for the experiments due to the scarcity of such publicly available hyperspectral LWIR data. Only the parts of the captured scenes without any sensitive information could be utilized for the experiments. Table 3 gives the details of the images with their abbreviations and Figure 1 illustrates a sample band from each image. The first and second sets are captured with SEBASS sensor [42] from a height of 500 m above ground level, and the third set is captured with TASI [43] from the heights of about 610 m (2000 ft) and 762 m (2500 ft). The spectral ranges of SEBASS and TASI sensors are 7.6-13.5 µm and 7.6-13.5 µm, involving 128 bands and 32 bands, respectively. All three sets consist of three images, two of which are captured on the same day at different times and the third of which is captured on a different day. The performances are reported for the image pairs on the same and different days.
The information for the images is provided in Table 3.

Proposed Registration Methods
The proposed registration method is illustrated in Figure 2. The main stages of the method can be described as follows:  3D-2D Conversion: The hyperspectral images with two spatial and one spectral dimension, namely HSI1 and HSI2, are transformed to 2D maps in this stage. Different conversions based on temperature and emissivity components in the LWIR range are proposed and compared.  Global Pose Estimation based on Keypoint Matching: The keypoints of the 2D images are extracted and matched. The homography (2D planar projective) transformation [44] between the 2D images is estimated by random sample consensus (RAN-SAC) [45].
The information for the images is provided in Table 3.

Proposed Registration Methods
The proposed registration method is illustrated in Figure 2. The main stages of the method can be described as follows: • 3D-2D Conversion: The hyperspectral images with two spatial and one spectral dimension, namely HSI1 and HSI2, are transformed to 2D maps in this stage. Different conversions based on temperature and emissivity components in the LWIR range are proposed and compared. • Global Pose Estimation based on Keypoint Matching: The keypoints of the 2D images are extracted and matched. The homography (2D planar projective) transformation [44] between the 2D images is estimated by random sample consensus (RANSAC) [45]. • Blockwise Refinement: The global planar projective transformation is refined over blocks to align the detail edges, objects, and contrast in local regions by using the proposed geometry-based or optimization-based methods. • Pixelwise Refinement: Blocking artifacts that can occur due to the individual mapping of neighbor blocks are further refined by investigating the best homography among different candidates for each pixel at this final stage. • 2D Mosaicing and Outputs: Given the 2D images, one image is geometrically transformed to the coordinate system of the other (reference) image by using the estimated transforms for each pixel. The resulting mosaic image and the similarity metrics, namely mutual information (MI) and structural similarity index (SSIM), between the transformed and reference images are returned as outputs for visual inspection and performance evaluation.
formed research are concentrated on the investigation of the temperature, emissivity, and radiance components in thermal LWIR range for 3D-2D conversion for the purpose of feature extraction and matching, development of the geometric-and optimization-based blockwise local refinements after global transformation, and the elimination of blocking artifacts using pixelwise refinements. The other parts of the proposed scheme such as feature matching and RANSAC follow the well-settled processing in 2D image registration.
The following sections describe these contributions in detail.

3D-2D Conversions
As the radiance signal in the thermal range is mainly affected by the temperature and emissivity of the materials, different approaches are proposed to convert the 3D hyperspectral images to 2D images. These images are then investigated with respect to their performances for keypoint extraction and matching. First, a method to estimate the brightness temperatures of hyperspectral pixels is developed [46] and a 2D image is formed by using this estimate for each pixel. Second, the average spectral energy and the principal components of the radiance signal for each hyperspectral pixel are used to generate 2D images. Finally, the average spectral energy and the principal components of the emissivity signal for each pixel are utilized as the other main components of the signal captured in the thermal range. The descriptions of the algorithms are as follows.

Brightness Temperature Estimation for Hyperspectral Pixels
Let us assume that the hyperspectral image with radiance information and of size is denoted as ( , , ), where and indicate the spatial horizontal and vertical coordinates and λ is the spectral wavelength. The brightness temperature of the hyperspectral pixels is optimally estimated by minimizing the mean square error (MSE) between the radiance of the hyperspectral pixels and the Planck curves generated for different temperatures. The proposed estimation is as follows: The main differences of the proposed scheme with respect to the conventional image registration pipeline are highlighted in bold in Figure 2. The contributions of the performed research are concentrated on the investigation of the temperature, emissivity, and radiance components in thermal LWIR range for 3D-2D conversion for the purpose of feature extraction and matching, development of the geometric-and optimization-based blockwise local refinements after global transformation, and the elimination of blocking artifacts using pixelwise refinements. The other parts of the proposed scheme such as feature matching and RANSAC follow the well-settled processing in 2D image registration. The following sections describe these contributions in detail.

3D-2D Conversions
As the radiance signal in the thermal range is mainly affected by the temperature and emissivity of the materials, different approaches are proposed to convert the 3D hyperspectral images to 2D images. These images are then investigated with respect to their performances for keypoint extraction and matching. First, a method to estimate the brightness temperatures of hyperspectral pixels is developed [46] and a 2D image is formed by using this estimate for each pixel. Second, the average spectral energy and the principal components of the radiance signal for each hyperspectral pixel are used to generate 2D images. Finally, the average spectral energy and the principal components of the emissivity signal for each pixel are utilized as the other main components of the signal captured in the thermal range. The descriptions of the algorithms are as follows.

Brightness Temperature Estimation for Hyperspectral Pixels
Let us assume that the hyperspectral image with radiance information and of size m x n x p is denoted as I(x, y, λ), where x and y indicate the spatial horizontal and vertical coordinates and λ is the spectral wavelength. The brightness temperature of the hyperspectral pixels is optimally estimated by minimizing the mean square error (MSE) between the radiance of the hyperspectral pixels and the Planck curves generated for different temperatures. The proposed estimation is as follows: 1.
The radiance of the hyperspectral pixel is assigned to a vector for each pixel of the hyperspectral image: v x,y (λ) = I(x, y, λ).
2 Planck curves [47] for different temperatures are generated for the spectral range of LWIR camera from a minimum temperature, T min , to a maximum temperature, T max , with a step size of temperature, ∆T: where in cgs units, h = 6.626068 × 10 −2 erg s (Planck's constant), k = 6.626068 × 10 −16 erg deg-1 (Boltzman's constant), c = 2.997925 × 10 10 cm/s (speed of light in vacuum), and T is the object temperature in kelvin. The mean square error (MSE) between the generated Planck curve for each temperature and the radiance of the hyperspectral pixel is computed. The temperature giving the minimum MSE is assigned as the brightness temperature of the hyperspectral pixel: 4 The algorithm takes the T min , T max , and ∆T values along with the hyperspectral image as inputs. The 2D image, T B (x, y), involving the estimated brightness temperature for each pixel, is returned as the output.

Average Spectral Energy and PCA Components of Radiance Spectra as 2D Maps
The average spectral energy of the radiance signal for each pixel of the hyperspectral image, E R (x, y), is defined as follows: In addition, principal component analysis (PCA) is applied to the LWIR images in accordance with the proposed registrations methods for VNIR and SWIR images [4,5]. The corresponding images for the first and second principal components are used as 2D maps.

Average Spectral Energy and PCA Components of Emissivity Signal as 2D Maps
The computation of the average spectral energy for the emissivity component of the hyperspectral image involves the following stages: 1.
The given hyperspectral image, I(x, y, λ), is first separated into temperature and emissivity components by using a temperature emissivity separation (TES) algorithm [48]: [T P (x, y), e(x, y, λ)] = TES (I(x, y, λ), where T p (x, y) is the estimate of the physical temperature for the hyperspectral pixels and e(x, y, λ) is the estimate of the three-dimensional emissivity signal with a size of m x n x p.

2.
The average emissivity energy for each pixel, E e (x, y), is computed as follows: The algorithm returns the 2D image, E e (x, y), as the output. In addition to the average spectral energy, the corresponding images for the first and second components of the emissivity signal, e(x, y, λ) after the PCA transform are also investigated for feature extraction.

Global Pose Estimation Based on Keypoint Matching
The registration of the hyperspectral cubes turns into the registration of 2D images at this stage, which follows the conventional pipeline for pose estimation between images. The resulting 2D maps first undergo median filtering due to the dominant sensor noise in thermal range, which usually emerges as spikes on the generated maps. As a second preprocessing step, the 2D image pairs are brought into the same range of [0,255] by using the maximum and minimum of the images.
The widely known scale-invariant feature transform (SIFT) [39] is adopted for keypoint extraction. However, the initial tests on global pose estimation have revealed that the same conclusions are also valid for the other keypoint extraction method such as SURF [40], Harris [49], DOM-SIFT [7], and GOM-SIFT [8]. The keypoints from two images are matched with respect to the mean square error (MSE) between their descriptors. These initial matches are inputted to the RANSAC algorithm to find the best planar projective transformation, namely the homography matrix of size 3 × 3.

Blockwise Refinement
The resulting global homographies and georeferencing (GPS-IMU)-based registrations [25] might indicate local misalignments due to the inaccuracies in the selection of keypoints and instabilities in the positions of the GPS and IMU devices during capturing. Therefore, global transformation should be further refined to align the local regions.

Geometric-Based Local Refinement
The proposed geometric-based method in this section performs the refinement over each image block by defining a homography transformation with respect to the nearest keypoints to that block. Let us assume that P(x , y ) and R(x, y) denote the utilized 2D maps for the two hyperspectral images, namely original and reference images, with the coordinate systems (x , y ) and (x, y). Let P(x, y) also denote the ultimate transformed image with the geometric-based local refinement and H g be the obtained global homography from the coordinate system from (x , y ) to (x, y) as the output of the global pose estimation. The proposed blockwise geometric refinement involves the following stages:

1.
First, divide the reference image, R(x, y) into the nonoverlapping blocks of size N B xN B . Denote the resulting blocks as r ij (x, y), where i and j refer to the horizontal and vertical indices for the block number as illustrated in Figure 3.

2.
For each block, r ij (x, y), find the spatial (Euclidian) distances of all keypoints to the center of that block. Find the closest N K number of matched keypoints to the center of that block. As an example, the selected closest keypoints are illustrated with solid circles in Figure 3.

3.
For each 4-point combination of the N K points, i. Derive a homography matrix by using their corresponding matches [44]. ii.
Form the transformed block, p ij (x, y), by finding the corresponding position, (x , y ), for each pixel position, (x, y), inside the block, with the given homography matrix and by performing a bilinear interpolation on P(x , y ). iii.
Compute the distance (with respect to a metric) between r ij (x, y) and p ij (x, y).

4.
Assign the homography matrix, which gives the minimum distance among all the homography matrices including also the inverse of the global homography, H g −1 , as the homography of the block r ij (x, y). The resulting homography for the block r ij (x, y) is denoted as H ij . 5.
The ultimate transformed image P(x, y) is formed by concatenating the transformed image blocks with the resulting homographies (H ij s).  The block size, , and the number of selected keypoints for each block, , are selected as the design parameters in the proposed algorithm. The performances of three metrics, namely SSIM, mutual information, and the average spatial (Euclidian) distance between the coordinates of the selected keypoints after the transformation and the coordinates of the corresponding keypoints on the reference image, are compared be- The block size, N B , and the number of selected keypoints for each block, N K , are selected as the design parameters in the proposed algorithm. The performances of three metrics, namely SSIM, mutual information, and the average spatial (Euclidian) distance between the coordinates of the selected N K keypoints after the transformation and the coordinates of the corresponding keypoints on the reference image, are compared between the transformed and reference blocks. Note that the negatives of the similarity metrics MI and SSIM are utilized as distance metrics during implementation to align with the distance definition. The overall similarity of the ultimate transformed image, P(x, y), and reference image, R(x, y), is evaluated in terms of SSIM.

Optimization-Based Local Refinement
The optimization-based local refinement also works over the image blocks. Given a reference block, the algorithm tries to find the best homography, which minimizes the distance metric, D, between the reference block and the corresponding transformed block, which is formed by mapping the pixels of the other image with the given homography matrix. The optimization problem can be mathematically described as follows: where H refers to the homography transformation and r ij (x, y) is the ith and jth block of the reference image R(x, y) as mentioned in the previous section. p ij (x, y) corresponds to the transformed block, which is formed by mapping the corresponding pixels of P(x , y ) to the coordinate system (x, y) with the given H. Therefore, the distance, D, is a function of H, which is a 3 × 3 matrix with 9 parameters. In the description of (7), it is not preferred to include the dependency of p ij (x, y) to H in order not to make the equation too cumbersome. The distance metric, D, is selected as the negative of the SSIM between the transformed and reference block as the optimization problems are conventionally expressed as a minimization problem rather than a maximization problem. The optimization algorithm takes the global homography H g as an initial guess for each block. The quasi-Newton method is adopted for the solution of the minimization problem in (7) as one of the standard methods preferred in optimization toolboxes. Given that the distance D in (7) is a function of H, the main stages of the proposed algorithm are as follows: 1.
First, divide the reference image, R(x, y) into the nonoverlapping blocks of size N B xN B . Denote the resulting blocks as r ij (x, y), where i and j refer to the horizontal and vertical indices for the block number, respectively.

2.
For each block, r ij (x, y), i. Assign H as H g at the initialization. ii.
Form the transformed block, p ij (x, y), by finding the corresponding position, (x , y ), for each pixel position, (x, y), inside the block, with the given homography matrix H and by performing a bilinear interpolation on P(x , y ). iii.
Compute the distance between r ij (x, y) and p ij (x,y), namely D(p ij (x, y), r ij (x, y). iv.
Compute the gradient ∇D and Hessian ∇ 2 D of the cost function D with respect to H. v.
Update the H with respect to the quasi-Newton algorithm: where k is the number of iteration. vi.
If the number of iteration is smaller than the maximum number of iteration and the change in the cost function, D(H k+1 ) − D(H k ), is greater than a threshold value, then update the homography H as H k+1 and go to step ii. Otherwise, finish the iteration.
vii. The ultimate homography for the block r ij (x, y) is denoted as H ij . The distance metric, D, is selected as the negative of the SSIM between the transformed and reference block as the optimization problems are conventionally expressed as a minimization problem rather than a maximization problem. The transformed image P(x, y) is formed by concatenating the transformed image blocks with the resulting homographies (H ij s).

Pixelwise Refinement
Although blockwise refinement improves the performance of the global mapping, it can also produce blocking artifacts in some regions, as illustrated in Figure 12a-c, which is similar to the blocking problem in motion-compensated video coding [50]. As the algorithm maps all the pixels inside the blocks with the same homography, the neighbor blocks having different but wrong transformations can produce blocking distortions. The refinement is therefore further improved at pixel level. The proposed pixelwise refinement updates the homography for each pixel by searching for the best homography from a candidate set of homographies. This set is formed of the homographies of the blocks giving the best-matched results over all the blocks in the image in terms of the resulting SSIM score between the transformed and reference blocks after the blockwise refinement.
Let us assume that the best M blockwise homographies with respect to the distance of the transformed and reference blocks are denoted as H 1 , H 2 , . . . , H M and P pw (x, y) is the resulting image after pixelwise refinement. The proposed pixelwise refinement is described as follows: 1.

2.
Compute the SSIM map between each of the transformed image and the reference image, R (x, y), resulting in the SSIM maps, S 1 (x, y), S 2 (x, y), . . . , S M (x, y). Note that S i (x, y) indicates the similarity of the pixels R (x, y) and L i (x, y) with respect to SSIM.

3.
Apply an averaging filter to the resulting SSIM maps in order also to include the effect of neighbor pixels in the similarity computation.
Assuming that S k (x, y) is the highest score, assign the pixel value of the transformed image, L k (x, y), as the value of the P pw (x, y). The algorithm ultimately computes the MI and SSIM between the P pw (x, y) and R(x, y) for the evaluation.

Outputs: 2D Mosaic and Objective Similarity Metrics
A 2D mosaic is formed by combining the transformed and reference images and visually inspecting whether the lines and texture exhibit a continuous behavior or not after the applied transformations. In order to perform an objective assessment, the mutual information and SSIM between the transformed and reference images are also computed for the comparison of proposed 2D conversion methods and algorithms.
Given that u and v correspond to the intensity values of I(x, y) and R(x, y) respectively, the mutual information (MI) [37] between the images is computed as follows: where P I (u, v) and P R (u, v) are the marginal distributions of the images, P(u, v) is the joint distribution of the images, and L is the maximum intensity of both images. Mutual information basically measures the dependence of two images by calculating the distance between the joint distribution of intensity values of two images P(u, v) and the product of the marginal distributions P I (u)P R (v). The structural similarity index measure (SSIM) [38] between the two images, of I(x, y) and R(x, y), is computed as follows: where µ I and µ R are averages, σ 2 I and σ 2 R variances, and σ IR is the covariance of the images I(x, y) and R(x, y). c 1 and c 2 are defined as c 1 = (k 1 L) 2 and c 2 = (k 2 L) 2 , where L stands for the dynamic range of the intensity values with k 1 = 0.01 and k 2 = 0.03. SSIM is basically an image quality assessment measure inspired by the human visual system's capability of extracting structural information. It is a weighted combination of three comparison measurements: luminance, contrast, and structure [38].
MI metric is maximized when the images are precisely aligned. It is minimized when their statistics are independent, resulting in zero information. Therefore, MI could take values greater than or equal to zero and less than or equal to the minimum of the entropy values of two images [37]. On the other hand, SSIM could take values between and equal to zero and one. Larger values of MI and SSIM between the transformed and reference images mean a better registration performance. Hence, the cost function for the minimization problem is chosen as negative of SSIM throughout the article as in the conventional description of optimization problems.

Results
The experiments were conducted in MATLAB 16 environment, using a 2.5 GHz computer with 16 GB RAM. The Vlfeat library [51] and hyperspectral library by Greg [52] were utilized for the baseline operations regarding feature matching and hyperspectral image processing. The quasi-Newton algorithm in MATLAB optimization toolbox was adopted for the optimization-based registration method by selecting the tolerance in the change of the cost function as 10 −4 and the maximum number of iterations as 500. The results are presented in five subsections investigating the following main aspects of the proposed registration method: • First, the performances of the proposed 3D-2D conversions are discussed and the best conversions for the global pose estimation are determined in Section 5.1. • This is followed by the description of controlled experiments conducted to select the design parameters for the proposed blockwise local refinement in Section 5.2.

•
The improvements of the pixelwise refinement in comparison with the blockwise refinement are then discussed in Section 5.3.

•
The next subsection, Section 5.4, compares the performances of the proposed geometricbased and optimization-based local refinements. • Finally, the improvements of the proposed registration method are revealed with respect to the conventional approaches, such as manual and direct georeferencingbased registration, and state-of-the-art image registration methods, including deep learning-based approaches, in Section 5.5. Figure 4 illustrates the 2D images obtained with the proposed 3D-2D conversions for the images LWIR1 a and LWIR1 b , which were captured on the same day. The 2D images were scaled to the range of 0-255 by using the maximum and minimum values. The resulting images for the brightness temperature (a), average energy of the radiance spectra (b), and first principal component of the radiance spectra (c) have similar contrast. On the other hand, the images for the average emissivity maps (d) possess relatively low contrast. Therefore, histogram equalization was performed on these images to enhance the contrast, as illustrated in Figure 4e. The 2D image for the first principal component of the emissivity spectra also has comparatively less contrast with respect to the radiance-based transformations and brightness temperature. This can be linked with the fact that the temperature and related 2D maps for radiance spectra correspond to the coarse part of the spectral changes, whereas the emissivity component can be interpreted as a detailed component from the point of signal decomposition theory. Therefore, the spatial contrast is comparatively lower for the emissivity-based 2D maps.

Selection of 3D to 2D Conversion Method for Global Pose Estimation
Remote Sens. 2021, 13, x FOR PEER REVIEW 15 of 30 they do not possess enough contrast for feature extraction and matching. However, the temperature maps can still be used to extract and match features even though the temperatures vary from time to time. For the sake of completeness, Figure 7 give also the mosaic images for the brightness temperature maps for the other four pairs, namely LWIR2a-LWIR2b, LWIR2a-LWIR2c, LWIR3a-LWIR3b, and LWIR3a-LWIR3c. In the rest of the experiments, the brightness temperature maps and SSIM were utilized for the global transformation and performance evaluation to further investigate the proposed local refinements.  Figure 5a-g illustrates the matched points for different 3D-2D conversions for the same-day pair LWIR1 a and LWIR1 b . The inliers after the RANSAC algorithms are illustrated. All the 2D maps have enough matches to estimate the geometric transformation between the pair LWIR1 a -LWIR1 b . Table 4 gives the ratios of the inliers after the matching to the total number of matched points for all the pairs including the same and different day capturing. While brightness temperature maps and average energy give high scores for the sameday pairs, namely LWIR1 a -LWIR1 b , LWIR2 a -LWIR2 b , and LWIR3 a -LWIR3 b , the principal components for the radiance components and the emissivity-based maps do not achieve a sufficient number of matching points for the pose estimation, particularly for the pair LWIR2 a -LWIR2 b . Another important observation regarding the matching performances is a severe decrease in the inlier ratios for different days for all the 2D maps. While the ratios for the brightness temperatures and average radiance energy fall significantly down to 10%, the matching results for the principal components of the radiance and emissivity-based 2D maps do not indicate stable performances for different pairs on different days.      Figure 6a-g gives the mosaic images obtained by overlapping the transformed image with the reference image. In particular, the continuity of the horizontal and vertical lines while passing from one image to the other indicates the success of the registration. The geometry of the transformed images is similar for all the proposed 2D conversion methods for the LWIR1 a -LWIR1 b pair. Figure 6h-k gives the mosaic images for the 2D maps, which have enough matches for registration, for the pair captured on different days, LWIR1 a -LWIR1 c , as indicated in Table 4. The mosaic images are coarsely aligned for different-day image pairs compared to the same-day results. Note that the mosaic images for the other 2D maps denoted with "X" in Table 4 Table 4. Table 5 gives the mutual information and SSIM between the transformed images and reference images, respectively. Both the metrics indicate similar behaviors for different pairs. While the brightness temperature and average radiance energy give the best results for the same pair, their performances significantly decrease for different-day image pairs. The results for the emissivity-based maps are not very stable both for the same-day and differentday captures. Contrary to an initial assumption that the emissivity maps can survive temperature changes and achieve a better matching, the experiments reveal that they do not possess enough contrast for feature extraction and matching. However, the temperature maps can still be used to extract and match features even though the temperatures vary from time to time. For the sake of completeness, Figure 7 give also the mosaic images for the brightness temperature maps for the other four pairs, namely LWIR2a-LWIR2b, LWIR2a-LWIR2c, LWIR3a-LWIR3b, and LWIR3a-LWIR3c. In the rest of the experiments, the brightness temperature maps and SSIM were utilized for the global transformation and performance evaluation to further investigate the proposed local refinements.

Selection of Design Parameters and Distance Metrics for Blockwise Local Refinement
The global transformation defined in the previous section coarsely matches the images without considering a fine registration in local regions. As an example, the given transformations in Figures 6 and 7 indicate that the registration is in fact mostly achieved throughout the vertical and horizontal roads on the given scenes, while the performance

Selection of Design Parameters and Distance Metrics for Blockwise Local Refinement
The global transformation defined in the previous section coarsely matches the images without considering a fine registration in local regions. As an example, the given transformations in Figures 6 and 7 indicate that the registration is in fact mostly achieved throughout the vertical and horizontal roads on the given scenes, while the performance for the other regions can still be improved. This section gives the results for the proposed blockwise local refinement for such a purpose along with the selection of the design parameters.
The design parameters in the proposed geometric-based blockwise local refinement in Section 4.3 are the number of neighbor points, N K , to determine the homography for a given block and the size of the blocks, N B . Figure 8a illustrates the change of the global SSIM between the reference image and the transformed image after the proposed blockwise refinement as a function of N K , when the block size is kept fixed as 27. It is quite expected that as the number of neighbor points increases, the number of combinations of four points to generate a better homography transformation for a given block increases as well. However, such a performance increase is limited with the execution time of the method, which is given in Figure 8b as a function of N K . In order to hold the duration at the level of a few minutes, an N K of 9 was selected in the experiments. The resulting SSIM scores in the given figure are all higher than the one obtained with the global homography, as the algorithm forces the selection of the better local SSIM for each candidate homography obtained with any four points. However, the selection of N K as 6 or 7 can give local black regions, particularly in smooth areas due to similarities of those regions in terms of SSIM.
ultimate case for the selection of a block is just the whole image, where the global homography is applied. Therefore, the curve is decreasing through the indicated line for global homography as the block size is increasing. In order to hold the execution time shown in Figure 9b to a few minutes, the block size of 27 was selected, which gives a global SSIM score of about 0.35. Note that as the block size increases, the number of blocks decreases in the whole image. Accordingly, it can be observed that the decay in the computational time in Figure 9b is inversely proportional to the area of the blocks, .
(a) (b)   Figure 9a gives the change of the global SSIM between the reference image and the transformed image after the blockwise refinement as a function of the block size N B for the fixed N K = 9. As the block size increases, the probability that there is a better homography than the global homography for the selected blocks decreases. Note that the ultimate case for the selection of a block is just the whole image, where the global homography is applied. Therefore, the curve is decreasing through the indicated line for global homography as the block size is increasing. In order to hold the execution time shown in Figure 9b to a few minutes, the block size of 27 was selected, which gives a global SSIM score of about 0.35. Note that as the block size increases, the number of blocks decreases in the whole image. Accordingly, it can be observed that the decay in the computational time in Figure 9b is inversely proportional to the area of the blocks, N B 2 .
four points to generate a better homography transformation for a given block increases as well. However, such a performance increase is limited with the execution time of the method, which is given in Figure 8b as a function of . In order to hold the duration at the level of a few minutes, an of 9 was selected in the experiments. The resulting SSIM scores in the given figure are all higher than the one obtained with the global homography, as the algorithm forces the selection of the better local SSIM for each candidate homography obtained with any four points. However, the selection of as 6 or 7 can give local black regions, particularly in smooth areas due to similarities of those regions in terms of SSIM. Figure 9a gives the change of the global SSIM between the reference image and the transformed image after the blockwise refinement as a function of the block size for the fixed = 9. As the block size increases, the probability that there is a better homography than the global homography for the selected blocks decreases. Note that the ultimate case for the selection of a block is just the whole image, where the global homography is applied. Therefore, the curve is decreasing through the indicated line for global homography as the block size is increasing. In order to hold the execution time shown in Figure 9b to a few minutes, the block size of 27 was selected, which gives a global SSIM score of about 0.35. Note that as the block size increases, the number of blocks decreases in the whole image. Accordingly, it can be observed that the decay in the computational time in Figure 9b is inversely proportional to the area of the blocks, .
(a) (b)  As the last experiment, the performances of different distance metrics for the selection of blockwise homography matrices in the proposed local refinement were compared. The metrics selected for measuring the distance between the reference block and the transformed block were SSIM, mutual information, and geometric mean square error (MSE) between the matched points as in the conventional RANSAC algorithm. Figure 10 gives the resulting global SSIM as a function of block size for the fixed N K = 9 for the mentioned distance metrics. The performance of the geometric MSE is the worst among the three methods, which indicates that the localization of the SIFT points is not very accurate for better registration of the thermal images. As a second conclusion, the performance of the mutual information indicates a concave characteristic, which decreases for the lower block sizes due to the lower accuracies in the estimation of the probability density functions (pdfs) with fewer pixels. The mutual information then shows a coherent behavior with the SSIM for the higher block sizes. In the rest of the experiments, SSIM was selected as the distance metric for the blockwise refinement, the number of neighbor points was fixed to 9, and the block size of 27 was selected.
very accurate for better registration of the thermal images. As a second conclusion, the performance of the mutual information indicates a concave characteristic, which decreases for the lower block sizes due to the lower accuracies in the estimation of the probability density functions (pdfs) with fewer pixels. The mutual information then shows a coherent behavior with the SSIM for the higher block sizes. In the rest of the experiments, SSIM was selected as the distance metric for the blockwise refinement, the number of neighbor points was fixed to 9, and the block size of 27 was selected. Figure 10. Global SSIM between transformed and reference images vs. block size for different metrics in blockwise local refinement, namely SSIM, mutual information, and geometric MSE. Figure 11 illustrates the visual results for the pair of LWIR1a-LWIR1b, which include the transformed images with the global transformation, blockwise refinement, and pixelwise refinement, and the corresponding SSIM maps between the reference image (LWIR1b) and the transformed images. Note that the brighter regions indicate higher SSIM scores in the given maps. As can be observed in Figure 11c,f, the global homography registers the parts throughout the road in the scene with high scores, but there are significant shifts in the regions shown with the red rectangles on the SSIM map in Figure  11f. The blockwise refinement on the other hand significantly improves the matching and SSIM scores on those regions as illustrated in Figure 11d and the corresponding SSIM map in Figure 11g.

Comparison of Blockwise and Pixelwise Local Refinements
However, a major drawback of the blockwise refinement is the blocking artifacts as the algorithm maps all the pixels inside the blocks with the same homography. Consequently, the neighbor blocks having different but wrong transformations can produce blocking distortions, which is similar to the blocking problem in motioncompensated video coding [50]. The red rectangles in Figure 11d illustrate some of those distortions, whose zoomed versions are shown in Figure 12a-c. The proposed pixelwise refinement given in Section 4.4 significantly compensates them, as shown in Figure 11e and Figure 12d-f. Accordingly, the global SSIM results given in Figure 13 as a function of  Figure 11 illustrates the visual results for the pair of LWIR1a-LWIR1b, which include the transformed images with the global transformation, blockwise refinement, and pixelwise refinement, and the corresponding SSIM maps between the reference image (LWIR1b) and the transformed images. Note that the brighter regions indicate higher SSIM scores in the given maps. As can be observed in Figure 11c,f, the global homography registers the parts throughout the road in the scene with high scores, but there are significant shifts in the regions shown with the red rectangles on the SSIM map in Figure 11f. The blockwise refinement on the other hand significantly improves the matching and SSIM scores on those regions as illustrated in Figure 11d and the corresponding SSIM map in Figure 11g. However, a major drawback of the blockwise refinement is the blocking artifacts as the algorithm maps all the pixels inside the blocks with the same homography. Consequently, the neighbor blocks having different but wrong transformations can produce blocking distortions, which is similar to the blocking problem in motion-compensated video coding [50]. The red rectangles in Figure 11d illustrate some of those distortions, whose zoomed versions are shown in Figure 12a-c. The proposed pixelwise refinement given in Section 4.4 significantly compensates them, as shown in Figures 11e and 12d-f. Accordingly, the global SSIM results given in Figure 13 as a function of block size indicate better performances for the pixelwise refinement compared to blockwise refinement. Figure 11. (a,b) Brightness temperature (BT) maps for LWIR1a and LWIR1b. (c-e) T ages with global homography, after blockwise refinement, and after pixelwise refi tively. (f-h) Corresponding SSIM maps between the transformed image and refer map of LWIR1b) for global homography, blockwise refinement, and pixelwise refin tively.

Comparison of Geometric-Based and Optimization-Based Local Refinements
An alternative for the geometric feature-based blockwise refinement is based blockwise refinement, as described in Section 4.3. The optimization al the global homography as an initial point and then iteratively updates this m block to maximize the SSIM between the reference block and transformed

Comparison of Geometric-Based and Optimization-Based Local Refinements
An alternative for the geometric feature-based blockwise refinement is optimizationbased blockwise refinement, as described in Section 4.3. The optimization algorithm takes the global homography as an initial point and then iteratively updates this matrix for each block to maximize the SSIM between the reference block and transformed block. Figure 14a illustrates the change of the local SSIM for a sample block with respect to the iteration numbers. The saturation is quite observable as the iteration continues.
An alternative for the geometric feature-based blockwise refinement is optimizationbased blockwise refinement, as described in Section 4.3. The optimization algorithm takes the global homography as an initial point and then iteratively updates this matrix for each block to maximize the SSIM between the reference block and transformed block. Figure  14a illustrates the change of the local SSIM for a sample block with respect to the iteration numbers. The saturation is quite observable as the iteration continues.  Figure 14b gives the resulting global SSIM with respect to the block size both for the optimization-based and geometric-based approaches for the pair of LWIR1a-LWIR1b. While the optimization-based approach is better than the result for the global homography, it gives comparable results with the geometric approach. However, the duration illustrated in Figure 14c, for instance for the block size of 27, is about 5 times the one for the geometric approach. The geometric approach is very appropriate with its tolerable execution time for practical usage. However, a possible limitation for the geometric-based method is its dependency on the number of extracted keypoints in the scene for the implementations in practice. Figure 14d also gives the global SSIM results with respect to the block size for the pair of LWIR1a-LWIR1c, which is a pair captured on different days. There are no significant improvements in the registration results for both approaches compared to the global homography. The ultimate results for the same-and different-day pairs are discussed in the next section.

Improvements of the Proposed Refinements with respect to the Baseline and Stateof-the-Art Methods
In the presentation of the proposed method in previous sections, the disadvantages of the mainstream global registration methods are first revealed. Afterward, we eliminate  Figure 14b gives the resulting global SSIM with respect to the block size both for the optimization-based and geometric-based approaches for the pair of LWIR1a-LWIR1b. While the optimization-based approach is better than the result for the global homography, it gives comparable results with the geometric approach. However, the duration illustrated in Figure 14c, for instance for the block size of 27, is about 5 times the one for the geometric approach. The geometric approach is very appropriate with its tolerable execution time for practical usage. However, a possible limitation for the geometric-based method is its dependency on the number of extracted keypoints in the scene for the implementations in practice. Figure 14d also gives the global SSIM results with respect to the block size for the pair of LWIR1a-LWIR1c, which is a pair captured on different days. There are no significant improvements in the registration results for both approaches compared to the global homography. The ultimate results for the same-and different-day pairs are discussed in the next section.

Improvements of the Proposed Refinements with Respect to the Baseline and State-of-the-Art Methods
In the presentation of the proposed method in previous sections, the disadvantages of the mainstream global registration methods are first revealed. Afterward, we eliminate the misalignments in global registration by means of the blockwise refinement, which is followed by the pixelwise refinement at the last stage. The ultimate proposed method is formed of all these stages, namely global registration, blockwise registration, and pixelwise registration, as indicated in Figure 2. Note that the geometric-based or optimization-based method could be selected for the blockwise refinement in the whole registration chain. The experimental results are provided for both alternatives.
Manual registration, which is performed in typical remote sensing software, and georeferencing-based registration, which uses geo files obtained with the GPS-IMU system, were selected as the baseline methods for the experiments. The manual registration is realized by manually selecting a number of corresponding points in two images, as shown in Figure 15. These points are then used to estimate the transformation between two images. In the case of direct georeferencing, a corresponding pixel in the first image for a given pixel in the second image is found by performing a bilinear interpolation over the nearest pixels, whose corresponding distances are computed by using the longitudes and latitudes in the geo files.
ote Sens. 2021, 13, x FOR PEER REVIEW 23 of 30 elwise registration, as indicated in Figure 2. Note that the geometric-based or optimization-based method could be selected for the blockwise refinement in the whole registration chain. The experimental results are provided for both alternatives. Manual registration, which is performed in typical remote sensing software, and georeferencing-based registration, which uses geo files obtained with the GPS-IMU system, were selected as the baseline methods for the experiments. The manual registration is realized by manually selecting a number of corresponding points in two images, as shown in Figure 15. These points are then used to estimate the transformation between two images. In the case of direct georeferencing, a corresponding pixel in the first image for a given pixel in the second image is found by performing a bilinear interpolation over the nearest pixels, whose corresponding distances are computed by using the longitudes and latitudes in the geo files. The SuperGlue [34], D2-Net [33], and locality preserving matching (LPM) [32] were selected as state-of-the-art methods due to their high performance and popularity in recent years. SuperGlue performs the feature matching by designing a neural network, which jointly finds correspondences and rejects outliers on local features. While the algorithm can also use regular keypoints extracted by classical extractors, the best-reported version, which uses the features extracted by SuperPoint algorithm [31], is preferred in the implementation. D2-Net is another popular deep learning-based network designed for joint detection and description of features. LPM, on the other hand, improves the classical registration pipeline by integrating a mismatch removal phase using putative correspondences.
A common characteristic of the given methods and the other state-of-the-art methods in Table 2 is that they mainly focus on the improvement of feature extraction and matching rather than the improvement of the alignment after the registration of the images. Therefore, the extracted features were further processed by RANSAC to estimate the ultimate homography between the image pairs in the case of SuperGlue and D2-Net. LPM was adapted to the given framework by first eliminating the outliers and then again using RANSAC to estimate the homography. The parameters such as the maximum number of keypoints, keypoint extraction threshold, and matching thresholds were selected as default values for the implementation of SuperGlue. The standard pretrained networks provided by the authors were employed for the implementation of both SuperGlue and D2-Net. As a critical parameter, the number of nearest neighbors during outlier rejection was also set to the default value for the implementation of LPM.
It should be noted that all the experiments were performed on the normalized 8-bit images obtained after 3D-2D conversions. The experiments have revealed that the application of learning-based SuperGlue and D2-Net algorithms, with the provided pretrained networks, to the converted 8-bit images in the LWIR range achieves registration as in the case of standard RGB images. One complementary option in the experiments could be to further train the pretrained networks with the converted 2D images. However, the scarcity of publicly available hyperspectral LWIR data was one challenge for such a direction. The SuperGlue [34], D2-Net [33], and locality preserving matching (LPM) [32] were selected as state-of-the-art methods due to their high performance and popularity in recent years. SuperGlue performs the feature matching by designing a neural network, which jointly finds correspondences and rejects outliers on local features. While the algorithm can also use regular keypoints extracted by classical extractors, the best-reported version, which uses the features extracted by SuperPoint algorithm [31], is preferred in the implementation. D2-Net is another popular deep learning-based network designed for joint detection and description of features. LPM, on the other hand, improves the classical registration pipeline by integrating a mismatch removal phase using putative correspondences.
A common characteristic of the given methods and the other state-of-the-art methods in Table 2 is that they mainly focus on the improvement of feature extraction and matching rather than the improvement of the alignment after the registration of the images. Therefore, the extracted features were further processed by RANSAC to estimate the ultimate homography between the image pairs in the case of SuperGlue and D2-Net. LPM was adapted to the given framework by first eliminating the outliers and then again using RANSAC to estimate the homography. The parameters such as the maximum number of keypoints, keypoint extraction threshold, and matching thresholds were selected as default values for the implementation of SuperGlue. The standard pretrained networks provided by the authors were employed for the implementation of both SuperGlue and D2-Net. As a critical parameter, the number of nearest neighbors during outlier rejection was also set to the default value for the implementation of LPM.
It should be noted that all the experiments were performed on the normalized 8-bit images obtained after 3D-2D conversions. The experiments have revealed that the application of learning-based SuperGlue and D2-Net algorithms, with the provided pretrained networks, to the converted 8-bit images in the LWIR range achieves registration as in the case of standard RGB images. One complementary option in the experiments could be to further train the pretrained networks with the converted 2D images. However, the scarcity of publicly available hyperspectral LWIR data was one challenge for such a direction. The complexity of the synthetic data generation for LWIR hyperspectral images was another challenge as it requires not only the modeling of the emissivity, temperature, and radi-ance relations, but also the dynamic temperature modeling for heat transmission between different objects and regions.
Given the selected baseline and state-of-the-art methods, our claims for the experimental validation of the proposed improvements in this section are threefold. First, the manual and georeferencing-based methods can produce local misalignments. Second, a direct adaptation of the mentioned state-of-the-art methods to image registration by using conventional RANSAC and global homography-based transformation can produce local misalignments. Third, the proposed algorithm with the local refinements eliminates the local misalignments, which are not explicitly handled in the baseline and state-of-theart methods. Table 6 gives the overall SSIM results for the proposed methods, baseline methods, and state-of-the-art methods for the image pairs captured on the same and different days. The proposed geometric-based local refinement is the best approach among all the methods. The performance of the state-of-the-art methods, namely SuperGlue, D2-Net, and LPM, are in comparable ranges with respect to the manual and global registration. As a specific case, LPM could not match the image pair 2a-2c due to the insufficient number of matches to estimate a regular homography after outlier rejection. As all these compared methods are mainly focused on feature extraction and matching rather than the registration performance between the reference and transformed images, similar results with global registration are quite expected in the experiments. Accordingly, the proposed geometric-and optimizationbased refinements after global transformation improve the registration performances in local regions and provide better scores. Table 6. SSIM results between transformed and reference images for the proposed global, geometric-based, and optimizationbased approaches with the baseline and state-of-the-art image registration methods. Note that there are no geo-coordinates provided for the images 3a, 3b, and 3c.  Table 6 also gives the results for the image pairs captured on different days. In fact, all the methods give similar SSIM results, however, with much lower values compared to the ones on the same day. The results indicate that while the proposed global homographybased registration can still be used as an alternative to the manual and georeferencing-based approaches for different-day pairs, the proposed local refinements with geometric and optimization methods are not very useful due to the dissimilarity of the utilized 2D maps in different days.

SSIM for Same-Day Pairs
The proposed geometric-and optimization-based refinements use the SSIM metric between the transformed and reference blocks to find the best blockwise homographies and then present the overall registration quality in terms of the SSIM metric between trans-formed and reference images, as shown in Table 6. In order to perform a cross-evaluation with a different metric, Table 7 also gives the results in terms of mutual information between the transformed and reference images. Although the gains in terms of MI are not as large as the case of SSIM, over which the optimization is performed, they are still at significant levels allowing similar conclusions regarding the performances of the proposed, baseline, and state-of-the-art methods. Table 7. Mutual information (MI) results between transformed and reference images for the proposed global, geometricbased, and optimization-based approaches with the baseline and state-of-the-art image registration methods. Note that there are no geo-coordinates provided for the images 3a, 3b, and 3c.

Discussion
Different 3D-2D conversion methods have been proposed for the registration of LWIR hyperspectral images by considering the radiance, temperature, and emissivity components in the thermal infrared spectrum. The experiments first revealed that the 2D maps of brightness temperature and average radiance energy of hyperspectral pixels are more convenient for extracting and matching points and then estimating the global transformation to align hyperspectral LWIR images. In contrary to the preliminary assumption that the emissivity-based 2D maps can be more invariant in response to temperature changes in the fields, the resulting 2D images obtained from emissivity components were found to have low spatial contrast and, therefore, to be unsuitable for properly extracting and matching the feature points. As another observation, the number of extracted and matched points, and consequently the registration performance, significantly decreased for the pairs of images captured on different days with respect to the pairs of images captured on the same day. The sensor technology for thermal LWIR hyperspectral images is still evolving and is not yet mature. The quality of the hyperspectral images is very dependent on the sensor heatings and other circumstances during the capturing as well as calibration and normalization processes performed at different times on such sensors.
After globally aligning the 2D maps of LWIR images, the proposed method is further improved with blockwise and pixelwise registrations. The experiments for blockwise refinement first indicated that the best similarity metric for the registration of the local blocks is SSIM compared to the MSE and mutual information. While the performance of the MSE was always the lowest for different block sizes, the performance of the mutual information also decreased for lower size blocks due to the lower number of pixels for the estimation of density functions. The pixelwise improvement as a follow-up stage significantly decreased the blocking artifacts in blockwise refinement by properly assigning the correct transformation for each pixel inside the blocks. As another comparison in the tests, the performance of the proposed geometric-based registration method was found to be better than or comparable to the proposed optimization-based method, while providing a more suitable execution time for practical usage. However, it should be noted that the performance improvement of the geometric-based refinement is directly related to the number of extracted keypoints for registration, as one of its limitations with respect to the optimization-based refinement.
Finally, the improvements in the proposed local refinements were presented in terms of SSIM with respect to the baseline and state-of-the-art methods. The cross-evaluation of the method with a different metric than the utilized metric for optimization also supported the conclusions regarding the improvements. In order to further discuss the contribution of the proposed method with respect to the previous literature, a detailed visual inspection was also performed on different patches of the test images. Table 8 illustrates these examples for different methods by overlapping the transformed patches with the reference patches. While the columns represent different patches, the rows indicate different methods. Gray regions in the overlapped patches indicate where the two images have similar values. Magenta and green colors correspond to the regions where the intensities are different. The regions in green or magenta, indicating the nonoverlapping parts for the patches, are quite visible for manual, global, and GPU-IMU-based registration. The same observation is also valid for the compared methods, SuperGlue [34], D2-Net [32], and LPM [33], which perform the registration with a global transformation. Accordingly, while SuperGlue and D2-Net give mismatches in the lower part of the test images, they are better in aligning the middle parts, as the global homography is estimated with respect to those regions. The reverse situation, where the lower parts are aligned regularly but the upper parts are mismatched, can also occur, as in the given example for LPM. Table 8. Examples for the overlapped reference and transformed patches for the proposed, baseline, and state-of-the-art methods. The first row indicates the original patches in the reference image.

Reference Patch
FOR PEER REVIEW 27 of 30              Geometric-Based Refinement

Optimization-Based Refinement
The experiments reveal that the performances of the state-of-the-art methods such as SuperGlue, D2-Net, and LPM are in similar ranges compared to the global registration as all these methods focus on feature extraction and matching. The local refinements after global transformation were one of the neglected aspects of these methods. The proposed methods on the other hand successfully achieve registration on local regions as indicated with overlapped images by performing automatic refinements on local regions after global transformation. Without loss of generality, the proposed refinements can also take the resulting global homographies from the other methods as input and improve the performances of those methods as well. However, the performances of those methods were in comparable ranges, as given in Tables 6 and 7. Therefore, the presentation of the improvements with respect to the performed global transformation was found to be sufficient in the experiments to draw main conclusions.

Conclusions
The experiments on the registration of LWIR hyperspectral images in the performed research first indicate that the temperature components are more convenient for feature extraction and matching in thermal range compared to the emissivity component. The Geometric-Based Refinement

Optimization-Based Refinement
The experiments reveal that the performances of the state-of-the-art methods such as SuperGlue, D2-Net, and LPM are in similar ranges compared to the global registration as all these methods focus on feature extraction and matching. The local refinements after global transformation were one of the neglected aspects of these methods. The proposed methods on the other hand successfully achieve registration on local regions as indicated with overlapped images by performing automatic refinements on local regions after global transformation. Without loss of generality, the proposed refinements can also take the resulting global homographies from the other methods as input and improve the performances of those methods as well. However, the performances of those methods were in comparable ranges, as given in Tables 6 and 7. Therefore, the presentation of the improvements with respect to the performed global transformation was found to be sufficient in the experiments to draw main conclusions.

Conclusions
The experiments on the registration of LWIR hyperspectral images in the performed research first indicate that the temperature components are more convenient for feature extraction and matching in thermal range compared to the emissivity component. The Geometric-Based Refinement

Optimization-Based Refinement
The experiments reveal that the performances of the state-of-the-art methods such as SuperGlue, D2-Net, and LPM are in similar ranges compared to the global registration as all these methods focus on feature extraction and matching. The local refinements after global transformation were one of the neglected aspects of these methods. The proposed methods on the other hand successfully achieve registration on local regions as indicated with overlapped images by performing automatic refinements on local regions after global transformation. Without loss of generality, the proposed refinements can also take the resulting global homographies from the other methods as input and improve the performances of those methods as well. However, the performances of those methods were in comparable ranges, as given in Tables 6 and 7. Therefore, the presentation of the improvements with respect to the performed global transformation was found to be sufficient in the experiments to draw main conclusions.

Conclusions
The experiments on the registration of LWIR hyperspectral images in the performed research first indicate that the temperature components are more convenient for feature extraction and matching in thermal range compared to the emissivity component. The Geometric-Based Refinement

Optimization-Based Refinement
The experiments reveal that the performances of the state-of-the-art methods such as SuperGlue, D2-Net, and LPM are in similar ranges compared to the global registration as all these methods focus on feature extraction and matching. The local refinements after global transformation were one of the neglected aspects of these methods. The proposed methods on the other hand successfully achieve registration on local regions as indicated with overlapped images by performing automatic refinements on local regions after global transformation. Without loss of generality, the proposed refinements can also take the resulting global homographies from the other methods as input and improve the performances of those methods as well. However, the performances of those methods were in comparable ranges, as given in Tables 6 and 7. Therefore, the presentation of the improvements with respect to the performed global transformation was found to be sufficient in the experiments to draw main conclusions.

Conclusions
The experiments on the registration of LWIR hyperspectral images in the performed research first indicate that the temperature components are more convenient for feature extraction and matching in thermal range compared to the emissivity component. The

R PEER REVIEW 28 of 30
Geometric-Based Refinement

Optimization-Based Refinement
The experiments reveal that the performances of the state-of-the-art methods such as SuperGlue, D2-Net, and LPM are in similar ranges compared to the global registration as all these methods focus on feature extraction and matching. The local refinements after global transformation were one of the neglected aspects of these methods. The proposed methods on the other hand successfully achieve registration on local regions as indicated with overlapped images by performing automatic refinements on local regions after global transformation. Without loss of generality, the proposed refinements can also take the resulting global homographies from the other methods as input and improve the performances of those methods as well. However, the performances of those methods were in comparable ranges, as given in Tables 6 and 7. Therefore, the presentation of the improvements with respect to the performed global transformation was found to be sufficient in the experiments to draw main conclusions.

Conclusions
The experiments on the registration of LWIR hyperspectral images in the performed research first indicate that the temperature components are more convenient for feature 1, 13, x FOR PEER REVIEW 28 of 30 Geometric-Based Refinement

Optimization-Based Refinement
The experiments reveal that the performances of the state-of-the-art methods such as SuperGlue, D2-Net, and LPM are in similar ranges compared to the global registration as all these methods focus on feature extraction and matching. The local refinements after global transformation were one of the neglected aspects of these methods. The proposed methods on the other hand successfully achieve registration on local regions as indicated with overlapped images by performing automatic refinements on local regions after global transformation. Without loss of generality, the proposed refinements can also take the resulting global homographies from the other methods as input and improve the performances of those methods as well. However, the performances of those methods were in comparable ranges, as given in Tables 6 and 7. Therefore, the presentation of the improvements with respect to the performed global transformation was found to be sufficient in the experiments to draw main conclusions.

Conclusions
The experiments on the registration of LWIR hyperspectral images in the performed research first indicate that the temperature components are more convenient for feature extraction and matching in thermal range compared to the emissivity component. The Geometric-Based Refinement

Optimization-Based Refinement
The experiments reveal that the performances of the state-of-the-art methods such as SuperGlue, D2-Net, and LPM are in similar ranges compared to the global registration as all these methods focus on feature extraction and matching. The local refinements after global transformation were one of the neglected aspects of these methods. The proposed methods on the other hand successfully achieve registration on local regions as indicated with overlapped images by performing automatic refinements on local regions after global transformation. Without loss of generality, the proposed refinements can also take the resulting global homographies from the other methods as input and improve the performances of those methods as well. However, the performances of those methods were in comparable ranges, as given in Tables 6 and 7. Therefore, the presentation of the improvements with respect to the performed global transformation was found to be sufficient in the experiments to draw main conclusions.

Conclusions
The experiments on the registration of LWIR hyperspectral images in the performed research first indicate that the temperature components are more convenient for feature extraction and matching in thermal range compared to the emissivity component. Geometric-Based Refinement

Optimization-Based Refinement
The experiments reveal that the performances of the state-of-the-art methods such as SuperGlue, D2-Net, and LPM are in similar ranges compared to the global registration as all these methods focus on feature extraction and matching. The local refinements after global transformation were one of the neglected aspects of these methods. The proposed methods on the other hand successfully achieve registration on local regions as indicated with overlapped images by performing automatic refinements on local regions after global transformation. Without loss of generality, the proposed refinements can also take the resulting global homographies from the other methods as input and improve the performances of those methods as well. However, the performances of those methods were in comparable ranges, as given in Tables 6 and 7. Therefore, the presentation of the improvements with respect to the performed global transformation was found to be sufficient in the experiments to draw main conclusions.

Conclusions
The experiments on the registration of LWIR hyperspectral images in the performed research first indicate that the temperature components are more convenient for feature extraction and matching in thermal range compared to the emissivity component. The application of the mainstream global registration methods to the resulting 2D maps, how- The experiments reveal that the performances of the state-of-the-art methods such as SuperGlue, D2-Net, and LPM are in similar ranges compared to the global registration as all these methods focus on feature extraction and matching. The local refinements after global transformation were one of the neglected aspects of these methods. The proposed methods on the other hand successfully achieve registration on local regions as indicated with overlapped images by performing automatic refinements on local regions after global transformation. Without loss of generality, the proposed refinements can also take the resulting global homographies from the other methods as input and improve the performances of those methods as well. However, the performances of those methods were in comparable ranges, as given in Tables 6 and 7. Therefore, the presentation of the improvements with respect to the performed global transformation was found to be sufficient in the experiments to draw main conclusions.

Conclusions
The experiments on the registration of LWIR hyperspectral images in the performed research first indicate that the temperature components are more convenient for feature extraction and matching in thermal range compared to the emissivity component. The application of the mainstream global registration methods to the resulting 2D maps, however, produces local misalignments on the registered images. The proposed blockwise and pixelwise refinements eliminate those misalignments and give better results than manual, global, and GPU-IMU-based registration. In addition, the improvements of the proposed method compared to the state-of-the-art methods such as SuperGlue, D2-Net, and LPM are also reported both visually and in terms of objective similarity metrics, namely SSIM and mutual information. In the view of the current trend of image analysis, the proposed framework for local refinements can be extended with deep learning-based methods. Future work will focus on the potential challenges for the extension of the proposed optimization framework to deep networks.
Author Contributions: A.K. carried out method development, implementation, analysis, interpretation, and manuscript writing. U.E. contributed to implementation of state-of-the-art methods, analysis, interpretation, and manuscript writing. All authors have read and agreed to the published version of the manuscript.
Funding: No funding to be reported.