IFSrNet: Multi-Scale IFS Feature-Guided Registration Network Using Multispectral Image-to-Image Translation

: Multispectral image registration is the process of aligning the spatial regions of two images with different distributions. One of the main challenges it faces is to resolve the severe inconsistencies between the reference and target images. This paper presents a novel multispectral image registration network, Multi-scale Intuitionistic Fuzzy Set Feature-guided Registration Network (IFSrNet), to address multispectral image registration. IFSrNet generates pseudo-infrared images from visible images using Cycle Generative Adversarial Network (CycleGAN), which is equipped with a multi-head attention module. An end-to-end registration network encodes the input multispectral images with intuitionistic fuzzification, which employs an improved feature descriptor—Intuitionistic Fuzzy Set–Scale-Invariant Feature Transform (IFS-SIFT)—to guide its operation. The results of the image registration will be presented in a direct output. For this task we have also designed specialised loss functions. The results of the experiment demonstrate that IFSrNet outperforms existing registration methods in the Visible–IR dataset. IFSrNet has the potential to be employed as a novel image-to-image translation paradigm.


Introduction
The information derived from infrared and visible imaging is complementary, as the former captures details on the intensity of temperature radiation emitted by the target, whereas the latter reflects information regarding the texture and contours of the target.Despite multispectral images being captured concurrently within the same environment, disparities persist among them due to the diverse intensities, gradients, and structures associated with different wavelengths of light.Multispectral image registration [1] aims to establish the mapping relationship between the reference image and the image to be registered, achieving geometric calibration through various methods.The primary process involves selecting the appropriate image registration technique, extracting feature points, performing feature matching, and evaluating the registration accuracy.In the field of medical image diagnosis [2], the registration captured at different times or in different modalities can assist physicians in formulating more accurate treatment plans.In the context of remote sensing [3], the registration of multispectral images can provide detailed information about the ground.Additionally, this technology can be employed in monitoring the distribution and dispersion of environmental pollutants [4] and in precision agricultural management [5].It is evident that the research into multispectral image registration has a wide range of possible applications.
The visible image is typically a composite image of the entire visible wavelength band (380-780 nm), whereas the infrared image involves separate imaging of a single band to capture the spectral information.Figure 1 depicts multispectral images captured at various wavelengths.Consequently, the radiometric difference between visible and infrared images is nonlinear.The proportion and repeatability of local feature points of the same objects occupied in images of different bands will decrease.This elevates the mismatching rate of local features, and compromises the quality and precision of image registration.The principal obstacle in multispectral image registration is the inconsistency in feature intensities between pairs of images, which precludes the direct registration of images from the same scene through the matching of feature descriptors.The method of leveraging Generative Adversarial Network (GAN)-based image-to-image translation [6] ingeniously circumvents the intricate spectral difference issue, simplifying the registration process.However, owing to the instability in data quality, it is challenging for a GAN to discern the interest part from the vast array of features solely through the fusion of a reference images.In essence, this precludes the model from generating images with impeccable detail.Notably, the majority of current mainstream feature fusion techniques [7] rely on linear addition; their fitting capabilities require further enhancement.As shown in Figure 2. CycleGAN [8] employs a spatial loss of cyclic consistency to facilitate the transformation of images from infrared to visible and vice versa.This bidirectional mapping of image-toimage transformations also ensures the more accurate and complete preservation of the structural information of the object.Nevertheless, this bidirectional mapping is constrained in its ability to accommodate complex scenarios.This is because, regardless of whether the discriminator generates a score value or a score map for the input image, it is merely a method of providing an approximate score for the entire image.The judging process is rendered too coarse due to the failure to consider local features.Furthermore, the generation of a score map for each pixel in the image would be excessively detailed, thereby rendering it challenging to maintain consistency in the judgement of both local and global features.It is not appropriate to employ either the entire image or individual pixels as a reference for details in the modal transformation, as visible images are rich in background information.Given that fuzzy theory offers a solution to uncertain data aggregation, this study endeavours to refine the feature additivity assumption and incorporate the concept of intuitionistic fuzzy sets [9].The IFS-SIFT feature maps contain richer details as can be observed in Figure 3.This paper proposes a multi-scale [10] IFS feature-guided multispectral image registration method using image-to-image translation, which is termed IFSrNet.This involves a nonlinear fusion between extracted features on different scales and reference to obtain relatively correct feature detail from the pseudo-images generated by the CycleGAN with multi-head attention module.This paper has constructed an end-to-end network [11], and the experiments demonstrate that the registration network exhibits comparable accuracy to other registration methods.The principal contributions of the proposed approach are asfollows: 1.
The paper uses pseudo-infrared (IR) images created from reference images to overcome the discrepancy between multispectral paired images.The training data and multi-head attention module facilitates the learning of the generative model, thereby enabling the utilisation of convolutional neural networks (CNNs) for image registration.

2.
This paper introduces the concept of intuitionistic fuzzy set features, which serve as an extension of gradient information.Furthermore, the two channels of the target and reference images are integrated by a multi-scale concatenation.

3.
A novel loss function is designed for the registration network to increase the weight of matching unambiguous feature information.This approach not only preserves the structural integrity of the generated images, but also mitigates the inherent limitation of generative models, which cannot be aligned pixel by pixel.

Related Work
In recent years, a number of methodologies have been developed with the aim of extracting consistent features from disparate image modalities for image registration.Nunes et al. [12] developed a multispectral feature descriptor (MFD) to extract invariant gradient information in both the spatial and frequency domains via LogGabor filters.Gao et al. [13] constructed a partial principal orientation map to obtain robust orientation information, and simultaneously employed gradient location and orientation histogram (GLOH) descriptors to achieve intensity invariance.Furthermore, a number of methods for multimodal image registration based on deep features have been proposed.Xu et al. [14] adopted a coarse-to-fine approach to registration and employed image fusion techniques to facilitate multimodal image registration.Wei et al. [15] proposed a gradient-guided multispectral image registration method utilising a convolutional neural network, known as the gradient-guided registration network for multispectral images; RegiNet is an endto-end network that takes the gradient maps of both the target image and the reference image as inputs, generating the registered image as its output.Zhang et al. [16] proposed the histogram of weighted phase direction (HOWP), which is employed to reduce the discrepancy between multimodal contrasts.
The optical, geometric, and spatial features expressed by infrared and visible images are significantly different.Establishing spatial relationships between two or more points using convolutional neural networks is a challenging task.Registration methods that rely on image-to-image translation [17] can successfully map visible images to infrared images.Some attempts based on image-to-image translation [18,19] employed image generation techniques to transform visible images into infrared images.This approach provided training data and circumvented the issue of cross-modal matching.However, these methods still require the traditional utilisation of feature point extraction operators, such as Scale-Invariant Feature Transformation (SIFT) [20], Speeded Up Robust Features (SURF) [21], and Partial Intensity Invariant Feature Descriptor (PIIFD) [22].Kumari et al. [23] achieved the registration of infrared and visible light images by utilising a generative adversarial network equipped with a spatial transformer module.The adversarial loss compelled the generator to produce a pseudo-infrared image, which was then compared to the original infrared image in a discriminator to assess the realism of the generated pseudo-infrared image.Mao et al. [24] leveraged the benefits of transfer learning to enhance feature matching.They used a parallel convolutional autoencoder to reconstruct images in both the visible and infrared branches before they are entered into the adversarial sub-network.Bingchao Yang et al. [25] chose to harness the modal shifting capabilities of GAN to generate pseudo-infrared images from infrared images.They combined the SURF algorithm with the PIIFD feature descriptor to extract and generate image feature points.

IFS Feature Image
Infrared images are abundant in structural information.In the training of generative adversarial networks, it is imperative to segregate and eliminate distributional disparities while giving sole attention to spatial structural differences.This is crucial for simplifying the subsequent registration task.Traditionally, GAN-based methods strive to eradicate distributional differences by transforming images from the source domain into the target domain.Nevertheless, even though translated and target images may appear isospectral, residual distribution disparities can still be substantial.This renders the mean absolute error (MAE) [26] or mean squared error (MSE) [27] unsuitable for optimising the registration network.
To tackle this issue, this paper devised an intuitionistic fuzzy feature image grounded in gradient information.This preliminary design is intended to address the issue of non-consistency in the output image of the GAN model.It amplifies the impact of the large gradient direction and reduces the noise in the small gradient direction.Intuitionistic fuzzy sets represent the most significant expansion and development of fuzzy set theory.Fuzzy sets can be used to describe the concept of 'both positive and negative'.Intuitionistic fuzzy sets propose non-membership and hesitancy based on membership, which can be used to describe the neutral state of 'neither one nor the other'.In the application of IFS, the determination of three fuzzy description measures is crucial.This is a hot and difficult research topic nowadays and one of the main contributions of this paper.The feature vector of the ith feature of the image R is denoted by SIFT [28] as , and the jth feature vector of image S is represented by IFS-SIFT as § j = µ j 1 , µ j 2 , µ j 3 .The eight feature vectors φ i k presented here have been derived from the design in SIFT.As there are eight neighbouring pixels to the target pixel point, the gradient between each neighbouring pixel and the target is defined as a feature vector in the direction of that gradient.Although multi-dimensional descriptors may be more robust in labelling features, some directions with small gradient differences are not worthy of consideration in this task.Furthermore, this approach places a significant computational burden on the network.For the kth moment φ k (1 k and µ j k , this article aims to find the direction with the highest magnitude in φ i k , designated as the direction of membership.To calculate the membership µ j 1 , add the φ i k+1 and φ i k−1 magnitudes of the two directions with the smallest adjacent angle.Then, determine the proportion of this sum to the total gradient magnitudes of all eight directions to obtain the membership of the main direction.The unaffiliated degree µ j the gradient magnitudes in the three opposite directions.The directions perpendicular to the main direction are the hesitant degree directions µ j 3 .Therefore, the membership § i µ j 1 , non-membership § j µ j 2 , and hesitancy § j µ j 3 of µ j k in § j are defined as follows.The idea of computation is inspired by [29].
The feature descriptor based on the IFS-SIFT reduces computational complexity compared to the SIFT feature descriptor while maintaining high accuracy and accounting for the correlation of neighbouring pixels.When defining the direction of key points in SIFT, the gradient direction and magnitude of all pixels within a circle centred on the feature point and radiuses by 1.5 times the scale of the Gaussian image in which the feature point is located are counted.The 8-direction histograms of the key points are obtained by Gaussian filtering.When constructing the key point descriptor, the region is divided into 4× 4 sub-blocks.The gradient magnitude of each direction is obtained by performing histogram statistics of 8 directions for each sub-block, resulting in a total of 128 dimensional descriptor vectors.The IFS-SIFT feature descriptor has the advantage of reflecting correlation between adjacent pixels due to its 48-dimensional vectors.In Figure 4, although gradient maps are adept at communicating information regarding the edges of an image, the presence of disparities in residual distributions can increase the model's sensitivity to local features within the image task.Given the high performance of infrared images in the transmission of structural information, structural features should be accorded more attention in subsequent registration tasks.Intuitionistic fuzzy feature maps, which place more emphasis on global structural information, exhibit significant potential.

Model Frame
The paper has built an end-to-end network that connects the features of an IR image with a pseudo-IR image to perform a registration transform on the IR image.Figure 5 presents the comprehensive architectural framework of the network.Pseudo-infrared images were produced from visible images by CycleGAN [30].Since the image-to-image translations involved in this paper are global cross-modal transformations, the sensitivity of the generative model to global information is crucial.The attention [31] mechanism is capable of mining remote dependencies in order to obtain global information through global interaction.This approach is more advantageous for high-level semantic feature extraction.Specifically, this paper adopts the multi-head attention [32] mechanism proposed in Transformer [33], whereby the three modular inputs are passed through a convolutional block to obtain Q, K, and V. Subsequently, the acquired Q, K, and V are employed to execute multi-head attention computation.This attention mechanism enables the generation of a more comprehensive feature representation by leveraging the multi-head attention.The generator in the modified CycleGAN framework used in this paper employs a standard encoder-decoder network as the generator, which is composed primarily of a deep feature extraction module and an image reconstruction module.This generator is capable of producing high-quality images, including images of optimal resolution with rich detail.The U-shaped discriminator network, as depicted in Figure 6, incorporates a multi-head attention module that performs true-false judgements for different patch sizes.This differs from the conventional approach of improving the global judgement of the generated image over the whole image.This is accomplished by fusing the encoder and decoder output feature maps of varying spatial dimensions and subsequently passing them through distinct convolutional layers, thereby generating a single-channel feature map comprising three distinct resolution patches.The design facilitates the enhancement of accuracy in classification.The regular updating of a generator can better simulate the distribution of real data, thereby generating images with superior quality.The input and output default image size is 512 × 512, but other sizes can also be supported.Rectified Linear Unit (ReLU) activation functions follow each convolutional layer except the last, which has an output.Skip connections are added to the network and combined by concatenation.A detailed enumeration of the specific parameters pertaining to the registration network is presented in Table 1.The IFSrNet input consists of two branches that extract the features of the target image and the intuitionistic fuzzy set feature map of the reference image.In Algorithm 1, the paper presents a pseudo-code flow for model training.The potential issues of overfitting and weak generalisation in complex registration models have been addressed by the effectiveness of Dropout in numerous deep learning vision tasks.The Dropout mechanism is employed to randomly disable some units in a network and generate multiple sub-networks.This approach is intended to alleviate the overfitting problem and enhance the generalisation performance of the network.In this network, a dropout layer is incorporated subsequent to the activation layer, and distinct dropout parameters are set to perform the registration training.The quality of the registered images output from the network is evaluated by Peak Signal to Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM) in order to identify the optimal Dropout parameters.The initial Dropout module employs a probability of 0.1, 0.2, and 0.3.

Design of Loss Function
This paper proposes a novel approach for loss function.During the training of the registration network, additional reference images are not used to calculate the loss of the model.The original multispectral and generated images are the main components that make up the loss.The similarity measure of the reference and the image to be registered is derived by calculating the inter-feature distance.The current research on distance is primarily concerned with the calculation of feature similarity like SSIM [34].Furthermore, Intuitionistic Fuzzy Sets possess the capacity to describe dissimilarity.This paper employs loss functions to constrain the membership and non-membership between images, in accordance with the aforementioned concept.In calculating the distance between features, both similarity and dissimilarity are taken into account.The specifics of the loss design originate from [9].Suppose the intuitionistic fuzzy set is A = {⟨x, θ A (x), ν A (x), ξ A (x)⟩ | x ∈ X}, the membership θ A (x), non-membership ν A (x), and hesitancy ξ A (x) together form an ordered interval pair (θ Intuitionistic fuzzy distance measures the distance between features, defined as follows: is the matching distance for R i and S j .
where N is the dimension, ω is the normalised weighting factor, and ∑ k 1 ω k = 1.Since the stability of images to be registered in the invariant moments is not uniform, it is possible to adjust the weighted coefficients for moments with large changes in amplitude.δ R i , S j is the mismatch distance between feature R i and S j that is defined with respect to hesitancy as follows.
The total loss function is defined as follows: L total = αL membership + βL non−membership + γL hesitancy (7)

Experiments 4.1. Experiment Settings
The TNO [35] multispectral image dataset and an expanded version based on sample modification were utilised in this study.In order to expand the size of the training dataset, a manual method of overlapping cropping and multi-angle rotation was employed in the enhancement process.The final enhanced dataset for the experiments comprises a total of nearly 90,000 infrared and visible image pairs, with a batch size of 16. Figure 7 illustrates some of the enhanced data samples.The image pairs are randomly divided into three groups: the training set, the validation set, and the test set.The relevant data pertaining to the size and parameter information for each of the datasets can be found in Table 2.The pseudo-IR images input to the registration network are generated by the generative model in the initial stage.The comparison in Figure 8 shows that the trained generative model used in this paper can output high-quality pseudo-IR images, which provides for the later image registration.For the image-to-image translation network section, the batch size is 2. The entire CycleGAN model is optimised using the Adam optimiser [36], with the learning rates for the discriminator and generator set to 0.0004 and 0.0001, respectively.The model is trained for 200 epochs, with the learning rate remaining fixed for the first 100 epochs.Subsequently, from the 150th epoch onward, the learning rate gradually decays to 0. Additionally, for the registration network, the experiment trained the model for 100 epochs with a mini-batch size of 16.The study was conducted on a Windows 10 operating system, employing a desktop computer equipped with 32 GB memory, a Core i7-10700K CPU, and an NVIDIA RTX 3090 GPU.The experimental framework chosen wasTensorflow 2.12.0,Cuda 12.1, Python 3.11, and Cudnn 11.2.
To demonstrate the performance of the method proposed in this paper, the experiment compares its registration accuracy and the magnitude of network parameters against several existing registration methods, including DASC [37], DHN [38], MHN [39], NTG [40], SMILE [41], and MURF [42].

Evaluation Metrics
Since subjective evaluation information is highly influenced by individuals, quantitative metrics to evaluate the results of image registration are more objective and uniform.The following metrics mainly serve to evaluate the results of image registration: MAE, PSNR [43], Normalised Mutual Information (NMI) [44], SSIM, and Learned Perceptual Image Patch Similarity (LPIPS).MAE represents the mean of the absolute errors between the predicted and observed values.PSNR is an image quality reference value that measures the discrepancy between the maximum signal and the background noise.The greater the PSNR value, the less image distortion will be present.In this paper, NMI is employed as a metric to assess the accuracy of an image in comparison with the ground truth.The value range of NMI is between 0 and 1, with higher values denoting greater accuracy in image registration.SSIM is a quantitative measure used to assess the structural similarity between two images.SSIM values are expressed as a ratio between 0 and 1, with higher values indicating greater structural similarity.LPIPS, also known as perceptual loss, is used to measure the difference between two images.The metric facilitates the learning of the inverse mapping from a generated image to the ground truth, thereby compelling the generator to learn the inverse mapping from a reconstructed real image to a pseudo-image and to prioritise the perceived similarity between them.A lower value of LPIPS indicates that the two images are more similar to each other, and vice versa.

IFS-SIFT Validation
To compare the impact prior and subsequent to the integration of the multi-scale IFS-SIFT feature, the experiment employed an identical generator and discriminator with the same loss function as the optimisation target.All parameters and environmental settings were assigned identically.The results of the image registration are presented in Figure 9, where the disparities in registration effects are more intuitively displayed through image fusion.The comparison shows that registration results guided by multi-scale IFS-SIFT features are closer to the ground truth.This is because multi-scale IFS-SIFT features enhance the neural network's feature representations and broaden its receptive field size to input.Finegrained features at lower scales capture local details, while high-scale features grasp global semantic information.Additionally, the multi-scale IFS-SIFT features share weights among convolutional layers and reduce feature dimensions, resulting in a significant reduction in network parameters.This implies that the registration results remain unaffected even with reduced memory usage, thereby lowering the hardware requirements for the model.

Visual Comparison
The registration results of visible and IR images using DASC, DHN, MHN, NTG, SMILE, MURF, and the network proposed in this paper are presented in Figure 10.DASC employs DSC+GAN, which is the first successful use of GAN for unsupervised clustering.However, it achieves low accuracy in multispectral image registration due to the presence of projection residuals in the generative model subspace.DHN has a fast registration speed, which is achieved by obtaining an intermediate variable homography through the neural network.This eliminates the need to separate feature point detection from transform estimation, as is carried out in traditional methods that use techniques such as ORB for corner detection and RANSAC [45] for matrix estimation.To preprocess the homography, denoising is added.However, this can result in the loss of feature information for noisy IR images, which can severely impact the registration accuracy.MHN is capable of processing large global motions and providing current single response matrix estimation results, making it more for image registration tasks that involve dynamic scenes, blurred scenes, or lack of texture.The NTG method for multispectral image registration is based on the principle that the gradient of difference image is sparsest when two images are perfectly registered.However, it may not be stable enough when dealing with dynamic and nonrigid objects.The SMILE method typically necessitates an initial registration estimate as a point of departure.The initial estimate is inaccurate, which leads to subsequent processing steps that deviate in the correct direction to the extent that the results of each experiment differ significantly.The MURF algorithm employs a coarse-to-fine registration strategy, whereby both global rigid and local non-rigid transformations are considered.However, the registration network may become less accurate if the input image contains serious noise, artefacts, or distortion.To display the registration results, they are crossed in a checkerboard diagram.As shown in the figure, the method proposed in this paper achieves superior or comparable registration accuracy to other methods.

Quantitative Assessments
Table 3 presents the mean performance value of seven different algorithms on the same test set for each metric, allowing for quantitative analysis of registration.The best result for each evaluation metric is highlighted in red, while the second best is highlighted in blue.The study tested thirty sets of multispectral images using five registration algorithms and four evaluation metrics.Figure 11a,b show that DASC has the highest MAE, indicating that it is not the best fit for the dataset.On the other hand, IFSrNet demonstrates superior MAE and SSIM values compared to the other algorithms, indicating its superiority in terms of registration accuracy on this dataset.Figure 11c shows that the proposed algorithm records slightly lower than the others at individual points.
Noise interference in the image pairs, particularly those with intricate modal features, during the encoding process may account for this difference.However, our proposed algorithm, as shown in Figure 11d, successfully preserves the intricate details of the image by using multi-scale IFS feature guidance, which enhances the robustness of image registration.It is worth noting that this approach achieves the highest NMI.Furthermore, IFSrNet also demonstrates superior performance in the comparison of inter-image perceptual similarity, as illustrated in Figure 11e, which is comparable to MURF.In contrast to traditional methods, LPIPS is more aligned with the human perceptual situation.The efficiency of the models is evaluated, with particular attention paid to the parameter size and run speed.The experiments ensure that all models are run in the same environment and with the same equipment.Table 4 shows that IFSrNet has an advantage in network parameters but an average performance in inference time.If the network is equipped with GPU acceleration, there is a potential for much higher computational efficiency.

Rotation Robust Experiment
To determine the effect of image training with different rotation angles on the registration network's performance, three sets of images were randomly selected from the dataset.Each set of images was rotated in 25°intervals from 0°to 75°, resulting in three rotated images per group.Nine new IFS feature matching maps were generated and used to evaluate the algorithm's robustness under rotated conditions.Simultaneously, the image to be registered was rotated in 1°steps until it was rotated to 75°.The similarity index was calculated as the cosine distance between the rotated image and the reference features.The experimental results are presented in Figure 12a, while Figure 12b shows the number of positive matches (NPM) of the four groups of rotated images.The result of matching on instances as illustrated in Figure 13.The accuracy and NPM are contingent upon the specific distortion angle of the floating image.Therefore, if there are variations in the angle of distortion, the resulting accuracy and NPM will also vary.The analysis indicates that the algorithm maintains good feature similarity for a range of angles due to the network learning the invariance of small rotations during training, which is inseparable from the rotational invariance of SIFT feature descriptors.However, for rotation angles exceeding 25°, the NPM exhibits a substantial decrease and the cosine distance between features increases significantly.

Ablation Study
To assess the impact of the loss function on the registration network's performance, this paper conducted an ablation study using the traditional similarity metrics SSIM, loss of single membership, and loss of IFS.As a comparison, SSIM loss [46] L spatial is defined as follows: where h represents the high resolution multispectral image obtained by the network, p represents the original panchromatic image, n represents the number of pixels in the image block, µ h represents the mean of h, µ p represents the mean of p, σ 2 h is the variance of h, σ 2 p is the variance of p, and σ hp is the covariance of h and p. c 1 = (k 1 L) 2 and c 2 = (k 2 L) 2 are two constants to avoid dividing by 0, L is the range of pixel values, and k 1 = 0.01 and k 2 = 0.03 are default values.Since the SSIM value range is from −1 to 1, the closer the SSIM value of the two images is to 1, the more similar the images are.And the smaller the value of the loss function in the network the better; the optimisation of the network is also in the direction of smaller loss values, so the texture loss of the image is calculated using 1 − SSIM(h, p).
Table 5 presents the performance of three loss functions: L spatial , L membership , and L Total .The results indicate that, although the loss of membership outperforms SSIM in terms of similarity, there is potential for improvement.When only L spatial is used, the registration network's NMI falls below 0.2.This highlights the difficulty in capturing the correlation between multispectral paired images when relying solely on the structural similarity loss.The optimal solution to this problem is to implement a complete IFS loss.The data presented in red font indicate that L Total performs the best on all evaluation measures.The transformation of multispectral image modalities is challenging in the absence of labelling.In practice, CycleGAN is a suitable approach for transformation tasks involving images with texture and colour variations.However, the estimated spatial error of GAN and CycleGAN cannot exclude the effect of residual distribution between the moving and target image.In contrast, CycleGAN embedded with a multi-head attention mechanism can circumvent the larger spatial error.In order to demonstrate the potential of the combination of CycleGAN and multi-head attention for the transformation of multispectral images, this paper explored the correlation between the model output and the Dice coefficient.The mean of the model output was computed based on the joint region of the moving and target masks.Subsequently, the mean absolute value was subtracted from 1 to obtain a normalised value indicative of the performance of the model generation.Similarly, the generative performance of both the GAN and the CycleGAN models were calculated.Figure 14 provides clear evidence of the positive correlation between each model and Dice, thereby confirming the potential of the combination of CycleGAN and multi-head attention to accurately transform the image modality.The GAN and CycleGAN have no significant positive correlation with Dice.
In order to ascertain the impact of the training set employed in the model on the registration network, three datasets were established for the purpose of separate experiments.As shown in Table 6, the mean number of matches for 286.8 is the highest of the three strategies.The single TNO native image set, the single manufactured images after enhancement, and the hybrid dataset are presented.The hybrid training set not only solves the limitation of the insufficient size of the native dataset, but also ensures the number of features that can be matched.In the field of image registration, research on multi-scale feature guidance has concentrated on gradient-based edge detection for features at each stage.In order to analyse the effectiveness of the multi-scale IFS feature map guidance, the experiment involved the alteration of the type of multi-scale features without modifying the skeleton of the registration network.The application of deep learning to the task of edge detection enables the DexNet and BDCN models to learn the singularity of edge information.Nevertheless, the failure to recognise weak edges may result in an uneven detection of features.The results of the ablation can be found in Table 7. Sobel edge mapping describes changes in gradient that retains more edge information than Canny and Laplace edge mapping.IFS feature descriptors benefit from advantages in dealing with biases caused by the generative model, and the efficiency gap with Sobel is closer.The consistency between the matches identified by the registration method and the set of ground truth matches is verified, and pairs of true positive (TP), false positive (FP), and false negative (FN) matches are defined to calculate the precision and recall scores.The sum of TP and FN denotes all correct matches from putative feature points.Consequently, the values of precision and recall can be calculated as follows: The larger precision and recall are, the more accurate the feature matching and the stronger the adaptability of the registration method to distinguish feature points, respectively.F1-score is the harmonic mean and summary statistic of precision and recall, which can be calculated as follows: The total summation of TP points from all image pairs in the construed datasets is calculated as the number of correct matches, which is denoted as Summation of TP [34].

Applications
The feasibility of applying IFSrNet to other multimodal image registration tasks has been tried in medical images [47].Chemical exchange saturation transfer (CEST) is a magnetic resonance imaging (MRI) technique for enhancing the contrast of images, which indirectly identifies the metabolites in tissue at millimolar concentrations through the water proton signal.Because the samples must have a sufficient saturation frequency, a large time span is usually required to acquire the spectrum.It is important to note that subject movement throughout the scan can lead to errors in CEST quantification.Even minor movements can have a significant impact on CEST analysis, resulting in the appearance of unusual peaks or dips in the spectrum and an uneven signal distribution on the image.To mitigate motion artefacts, image registration is a commonly employed method to ensure high-quality CEST-MRI images.Figure 15 shows the application of IFSrNet for registering T1 and Gd image pairs in CEST-MRI images of a rat brain.The lack of an objective gold standard for assessing the registration results of medical images means that the physician's judgement is crucial.After observation by senior experts, IFSrNet was able to accurately reconstruct the target images, which illustrates the potential of the network's application.All the CEST-MRI images in this experiment were obtained from Johns Hopkins University.

Discussion
The image-to-image translation registration strategy employs a novel approach which effectively circumvents the complex multimodal issue while simultaneously reducing the difficulty of registration.The approach proposed in this paper requires the preparation of multimodal data for the training of the generator.When encountering previously unknown data, the performance of the trained model will inevitably decline, although the present network has been designed to mitigate this effect.Furthermore, the method in this paper introduces a small amount of bias and variance in the process of remapping the data distribution between different modalities, which requires subsequent work to optimise these negative effects.In future work, the authors will endeavour to extend this study to other modal image registration tasks beyond multispectral images.Additionally, the lightweight improvement of the network is a promising avenue for further investigation.
The precise registration of images constitutes a fundamental precondition for multispectral image processing.The related extended works encompass the fusion of image information, the localisation of targets, and the detection of changes, along with the reconstruction of high-resolution images.The procedure of multispectral image fusion is expedited by image registration, which allows for the organic combination of the advantages or complementarities of the information comprised in each image dataset, thereby giving rise to the production of a more comprehensive and accurate image.Furthermore, the image registration is also more conducive to change detection, which enhances the accuracy of target positioning and discovers alterations in features or the ground surface through comparing the image disparities at different times or under different circumstances in remote sensing operations.Within the domain of high-resolution image reconstruction, the aim is to coalesce multiple low-resolution images into a sole high-resolution image by means of registration, thereby enhancing the pellucidity and detailed manifestation of the image.

Conclusions
In this paper, a Multi-scale IFS Feature-guided Registration Network Using Multispectral Image-to-image Translation is proposed.To tackle the challenge of matching infrared images with visible images during image registration, a pseudo-infrared image is generated from visible images using CycleGAN equipped with a multi-head attention module.The generative models can be interchanged between the two modalities due to the bidirectional feedback mechanism of CycleGAN, which allows for the transfer of information between the two domains.This approach reduces the difficulty in modal feature extraction by addressing the large modal differences between the two types of images.To avoid the problem that the generated images cannot achieve pixel-by-pixel correspondence with the ground truth, the feature vectors of the reference are first extracted by the improved robust IFS-SIFT feature descriptor in the case of scale transformation.Secondly, this paper establishes an end-to-end registration network model and designs a loss function that incorporates multi-scale feature guidance.The experiments demonstrate that modal transformed infrared spectral information is effective in extracting both structural and textural features from the image.IFS-SIFT demonstrates superior performance in extracting multi-scale feature vectors, and the established registration model is robust in estimating the transformation despite the presence of discrete points in the reference.The proposed model primarily focuses on addressing the challenges posed by significant scale differences between infrared and visible images, as well as the intricacies involved in image registration, in comparison to other algorithms.In the future, this work can be further extended to encompass tasks such as image fusion and image enhancement with GAN.

Figure 1 .
Figure 1.Multispectral images.The images are presented in the following order from left to right: (a) RGB, (b) panchromatic, (c) NIR, and (d) LWIR image.Different numbers of channels and wavelengths are the criteria for classifying these images.

Figure 2 .
Figure 2. GAN-based methods can only ensure that the distribution of domain A corresponds to the distribution of domain B. It is desirable that the two domains can feed into each other, ideally.

Figure 3 .
Figure 3. (a) Visible image as a reference.(b) Infrared images to be registered.(c) IFS-SIFT feature image.(d) Infrared image that has been registered.

Figure 4 .
Figure 4. Results of feature maps in the registration network.The first row of (a-d) are the reference, and gradient maps of 60, 20, and 5 epochs, respectively.The second row of (e-h) shows the opposite reference, and IFS maps for the same epoch.

Figure 5 .
Figure 5.The network architecture of IFSrNet.IFSrNet is an end-to-end network that utilises pseudo-IR images created by generative model as the reference, which are fed into the registration network along with the target image registration.

Figure 6 .
Figure 6.Architecture of the multi-headed discriminator of the proposed IFSrNet.

Algorithm 1 1 , µ j 2 , µ j 3 #
Registration network training procedure Initialise the input size i, the kernel size k, and the stride s, and the output size o.Initialise model parameters according to pre-training settings; Input: training multispectral image dataset D; Output: trained model N; for epochs do for I Vis ,I IR in D do # Forward propagation I Vis ' = DeCNN (I Vis ) , I IR ' = DeCNN (I IR ) # Calculated IFS-SIFT feature map § j = µ j Multi-scale feature concatenation O Reg ' = Concatenation (I Vis ' + I IR + § j ) # Computation of loss L total =(L membership , L non−membership , L hesitancy ) # Backward propagation and update parameters w = w− lr× dl/dw end for # Test model performance end for

Figure 7 .
Figure 7.A preliminary presentation of the manually produced rotated dataset.

Figure 8 .
Figure 8.A portion of the pseudo-infrared image generated by the trained CycleGAN.

Figure 9 .
Figure 9. Fusion of the registration result with the reference.(a) shows the infrared image to be registered, (b) is the unguided registration result, (c) is the registration result achieved by multi-scale IFS-SIFT feature guidance, and (d) represents the reference.

Figure 10 .
Figure 10.The figure shows from left to right (a) image to be registered; (b) reference; (c) DASC; (d) DHN; (e) MHN; (f) NTG; (g) SMILE; (h) MURF; (i) IFSrNet proposed in this paper.To concentrate on the region of interest, the lack of information on the edges of the image after registration is marked by the red boxes, and the detailed structural information is marked by the green and blue boxes.

Figure 11 .
Figure 11.Experimental results on real datasets.Shown in order is the performance in each evaluation metric, (a) MAE, (b) SSIM, (c) PSNR, (d) NMI, and (e) LPIPS.The horizontal axis of the graph represents the number of each sample, while the vertical axis represents the value obtained.The various algorithms have been identified with different colours for differentiation purposes.

Figure 12 .
Figure 12.Quantitative statistics of rotational invariance.(a) shows the average feature cosine distances for different rotation angles, and (b) demonstrates the number of correctly matched features for the four rotation angles in the three datasets.

Figure 13 .
Figure 13.Example of feature matching of two images to be processed, where the left is a reference and the right is a rotated image to be registered.The matching result is shown in (a) with 0°of rotation, (b) shows the result with 25°of rotation, and (c,d) show the results with 50°and 75°of rotation, respectively.

Figure 14 .
Figure 14.Correlation of estimated spatial errors with Dice in three generation strategies.The distribution of spatial error points in (a,b) does not exhibit a strong positive correlation with the Dice coefficient.In contrast, the generative model employed in this study, as illustrated in (c), exhibits a clear positive correlation.

Figure 15 .
Figure 15.CEST-MRI images for a rat stroke lesion in which (a) is T1 as the image registration to be registered, (b) is Gd as the reference, and (c) is the T1 image that has been registered.

Table 1 .
Summary of the registration network.

Table 2 .
A brief description of the datasets used in the study.

Table 3 .
This study tested the average MAE, SSIM, PSNR, NMI, and LPIPS values obtained from 30 sets of multispectral images.The best performance is marked in red font, and blue is the second best.

Table 4 .
Number of parameters and inference time for IFSrNet and other registration algorithms, where the parameters and times are in M and seconds, respectively.The red font parameter represents the optimal performance, while the blue font is just below it.

Table 5 .
An ablation study on the IFSrNet loss function.Values marked in red represent the best performance.

Table 6 .
In the initial training phase of the model, ablation experiments were conducted with different training sets, with the average number of matched pairs serving as the standard.The letters ✓ and × are used to indicate whether the dataset was or was not included in the analysis.

Table 7 .
Registration results of ablation studies using different feature map extraction methods.The red font parameter represents the optimal performance, while the blue font is just below it.