Automatic and Robust Infrared-Visible Image Sequence Registration via Spatio-Temporal Association

To solve the problems of the large differences in gray value and inaccurate positioning of feature information during infrared-visible image registration, we propose an automatic and robust algorithm for registering planar infrared-visible image sequences through spatio-temporal association. In particular, we first create motion vector distribution descriptors which represent the temporal motion information of foreground contours in adjacent frames to complete coarse registration without feature extraction. Then, for precise registration, we extracted FAST corners of the foreground, which are described by the spatial location distribution of contour points based on connected blob detection, and match these corners using bidirectional optimal maximum strategy. Finally, a reservoir updated by Better-In, Worse-Out (BIWO) strategy is established to save matched point pairs and obtain the optimal global transformation matrix. Extensive evaluations on the LITIV dataset well demonstrate the effectiveness of the proposed algorithm. Particularly, our algorithm achieves lower registration overlapping errors than the other two state-of-the-arts.


Introduction
Multi-sensor image fusion [1][2][3][4], which can enhance the ability of target description and scene understanding, is widely used in target tracking [5], face recognition [6], night vision observation [7] and many other fields. Image registration, as an important procedure for image fusion, greatly determines the accuracy of target alignment in the scene, thus affecting the quality of infrared-visible image fusion.
There exist two challenging problems in infrared-visible image registration. On the one hand, images from different sensors manifest different phenomena [8]. Visible images record the reflected light information of objects while infrared images record the thermal field distribution. Thus, the gray value of two types of images are quite different. On the other hand, directly extracting feature information across different types of images for registration [8][9][10] may lead to inaccurate positioning, which would reduce the accuracy of the final transformation matrix for alignment.
To overcome the challenges above, many effective algorithms [11][12][13][14][15] have been proposed in recent years. Most of them are committed to using motion information to register targets in the scene. St. Charles et al. [14] proposed a framework based on shape matching, but its performance depends (1) We propose a spatio-temporal associated registration algorithm for infrared-visible image sequences, which combines temporal motion information and intra-frame feature matching scheme, achieving low registration overlapping errors. (2) We create MVD descriptors of foreground contours for coarse registration without feature extraction. Thus, foreground targets can be roughly aligned to eliminate the impact of inaccurate positioning of feature points. (3) We propose a description of feature points based on the spatial location distribution of connected blob contours, and perform feature matching using bidirectional optimal maximum strategy. A robust reservoir updated by BIWO strategy is proposed to improve the accuracy of the final global transformation matrix.
The rest of the paper is structured as follows: Section 2 provides a review of related work. Section 3 introduces algorithm overview and details of each procedure. Section 4 presents qualitative and quantitative evaluations of the proposed algorithm as well as comparisons with other state-of-the-arts. Section 5 presents the conclusions.

Related Work
Methods for image registration can be categorized into two types: intensity-based methods and feature-based methods. Intensity-based registration methods use the gray values of all pixels to calculate the similarity between two images, and adopt search methods such as genetic algorithm [19], ant colony algorithm [20], Powell algorithm [21], etc. to get the optimal transformation matrix. Cross-correlation methods [22] use the regional cross-correlation between two images as the similarity measure function for registration. However, the computational complexity of such methods is quite high. Mutual information methods [21,23] utilize the statistical correlation between images for registration, which are commonly used in multimodal medical image registration. Phase correlation methods [24,25] are effective complements to image registration, which compute the parameters of scaling, rotation and translation by calculating the maximum cross power spectrum of two images from frequency domain. Intensity-based registration methods can hardly be applied in infrared-visible images due to the huge difference in gray value. Furthermore, textures in visible images are often missed in infrared images as they seldom influence heat emitted by targets.
Feature-based registration methods extract feature information such as points [9,10], edges [26], contours [8,14], etc. of the images, which are robust to illumination, temperature, occlusion and Sensors 2019, 19, 997 3 of 17 many other disruptive factors. It is unreliable to directly use edges or contours as they are not completely consistent across infrared and visible images. Points are the simplest but most efficient feature information and FAST corner [17] has developed rapidly in computer vision tasks for its fast computing speed. Similarly, when there exists large spatial variation between infrared and visible images, the positioning of feature points will have a certain deviation, which means that methods of directly using feature points for registration [9,10] are inapplicable to our task.
Registration of infrared-visible image sequence has attracted much attention in recent years. Image sequence can provide motion information between adjacent frames, which can be complementary to image registration. Methods based on global optical flow information [11] need to calculate the optical flow of all frames, making their use impossible for real-time registration. Methods based on target trajectories were proposed in [12,13]. The positioning of the centroids or apexes of the targets may not be accurate enough, and thus trajectories formed by these points may not fully reflect the motion state of the targets. Sun et al. [15] proposed to combine motion vectors and feature information. However, the calculation of motion vectors requires a consistent number of targets in the scene so that such method is not suitable for situations in which targets enter or leave the scene. In the proposed registration algorithm, the problem of inaccurate positioning of feature points is overcome by coarsely registering foreground contours using MVD descriptors. We create a strict matching strategy as well as a robust reservoir to match and save point pairs extracted from relocated foregrounds, and to calculate the optimal global transformation matrix. Main procedures will be detailed below.

Overview of the Proposed Algorithm
Moving targets in the scene can provide available motion information. We extract the foregrounds of the images and accomplish registration on them. Figure 1 shows an overall flow of the proposed registration algorithm for infrared-visible image sequences. It mainly consists of three procedures: foreground extraction, coarse registration based on temporal motion information, and precise registration based on intra-frame feature matching.
In coarse registration, we calculate and rectify the motion vector field of two adjacent frames. The motion vectors on the foreground contours can fully reflect the motion state of the target. Motion vectors of different parts and different targets are discriminative. Inspired by the HOG [16], we create MVD descriptors for foreground contours and use corresponding foreground contour points to calculate coarse parameters of the transformation matrix.
In precise registration, FAST corners [17] of the relocated foregrounds are extracted. These points are then described by the shape context descriptors [18] of the connected blob contours to which they belong and matched by bidirectional optimal maximum strategy. A reservoir based on BIWO update strategy is established to save matched point pairs and to calculate precise parameters of the transformation matrix.
It should be noted that infrared image is the floating image while visible image is the reference image in the proposed algorithm. The transformation model we use is similar transformation matrix including scaling, rotation and translations, which is: where (x IR , y IR ) and (x V IS , y V IS ) are the positions of pixels in infrared and visible images, respectively. σ is the scaling factor, θ is the rotation factor, and ∆y, ∆x are the vertical and horizontal translation factors. The proposed algorithm is focused on registering planar scenes, which means that various sensors capture images at long distance from targets.

Foreground Extraction
To fully make use of the temporal motion information, we perform foreground extraction by using the method in [27]. It uses color and texture information to define the pixels as "background words" in local dictionaries and introduce a feedback mechanism to continuously upgrade the model. In the subsequent coarse registration process, we use the motion vector field of the foreground. And in the precise registration process, we retain FAST corners of the relocated foreground.

Coarse Registration
When the spatial transformation of two images differs greatly, direct extraction of feature points may result in inaccurate positioning. Coarse registration on foreground contours can solve this problem, which we will explain in detail.

Image Preprocessing
We perform some preprocessing on the images to obtain a more accurate motion vector field:

•
The lighting condition may change when sensors capture images, which will greatly affect the accuracy of the motion vector field. For an image sequence with a resolution of M N × , the gray value of the pixels in the next frame 1 ( , ) t g i j + will be rectified to the previous frame ( , ) t g i j by: • Noise removal is necessary. We use a Gaussian filter (5 × 5 size, standard deviation of 3) to smooth each frame of the image sequence.

Motion Vector Field Calculation
To obtain motion vector field of the whole image, we calculate optical flow which represents the instantaneous motion velocity of each pixel. [28,29] are typical methods for optical flow calculation. But they are inapplicable when the speed of moving targets is too low or too fast, especially in our task where sensors capture images far from moving targets in the scene. We adopt an improved optical flow calculation method presented in [30]. It defines an improved variational equation and

Foreground Extraction
To fully make use of the temporal motion information, we perform foreground extraction by using the method in [27]. It uses color and texture information to define the pixels as "background words" in local dictionaries and introduce a feedback mechanism to continuously upgrade the model. In the subsequent coarse registration process, we use the motion vector field of the foreground. And in the precise registration process, we retain FAST corners of the relocated foreground.

Coarse Registration
When the spatial transformation of two images differs greatly, direct extraction of feature points may result in inaccurate positioning. Coarse registration on foreground contours can solve this problem, which we will explain in detail.

Image Preprocessing
We perform some preprocessing on the images to obtain a more accurate motion vector field:

•
The lighting condition may change when sensors capture images, which will greatly affect the accuracy of the motion vector field. For an image sequence with a resolution of M × N, the gray value of the pixels in the next frame g t+1 (i, j) will be rectified to the previous frame g t (i, j) by: • Noise removal is necessary. We use a Gaussian filter (5 × 5 size, standard deviation of 3) to smooth each frame of the image sequence.

Motion Vector Field Calculation
To obtain motion vector field of the whole image, we calculate optical flow which represents the instantaneous motion velocity of each pixel. [28,29] are typical methods for optical flow calculation. But they are inapplicable when the speed of moving targets is too low or too fast, especially in our task where sensors capture images far from moving targets in the scene. We adopt an improved optical flow calculation method presented in [30]. It defines an improved variational equation and introduces a smoothness constraint to minimize it. Figure 2 shows the motion vector diagrams obtained by adjacent frames in the image sequence. introduces a smoothness constraint to minimize it. Figure 2 shows the motion vector diagrams obtained by adjacent frames in the image sequence.

Motion Vector Field Filtering and Re-Projection
• Motion vector that belongs to the background is set to zero. Because only the motion vector of the foreground is sufficiently distinguishable for registering foreground contours.

•
Motion vector near the image boundaries tends to be inaccurate and is not conducive to the establishment of subsequent MVD descriptors. We remove the motion vector near the boundaries with a threshold of 20 pixels. • For a pixel with location ( , ) x y , gray value t g , and calculated motion vector ( , ) x y u u = u , the offset of the gray value relative to the pixel in the next frame 1 t g + can be obtained by reprojection (bilinear interpolation method): If the offset is greater than the threshold (5 pixels in our algorithm), motion vector of this pixel is considered to be wrong and is set to zero.

Creation of Motion Vector Distribution Descriptor and Contour Matching
For a pixel with ( , ) x y u u = u , we can get magnitude and orientation of motion vector by: where eps is the minimum floating point precision, ensuring the denominator is not zero.
Magnitude of motion vector has rotation and translation invariance. To obtain scaling invariance, for all motion vector magnitudes ( , ) Mag i j , we normalize and encode them by: Where max Mag is the maximum magnitude value.

Motion Vector Field Filtering and Re-Projection
• Motion vector that belongs to the background is set to zero. Because only the motion vector of the foreground is sufficiently distinguishable for registering foreground contours.

•
Motion vector near the image boundaries tends to be inaccurate and is not conducive to the establishment of subsequent MVD descriptors. We remove the motion vector near the boundaries with a threshold of 20 pixels. • For a pixel with location (x, y), gray value g t , and calculated motion vector u = (u x , u y ), the offset of the gray value relative to the pixel in the next frame g t+1 can be obtained by re-projection (bilinear interpolation method): If the offset is greater than the threshold (5 pixels in our algorithm), motion vector of this pixel is considered to be wrong and is set to zero.

Creation of Motion Vector Distribution Descriptor and Contour Matching
For a pixel with u = (u x , u y ), we can get magnitude and orientation of motion vector by: where eps is the minimum floating point precision, ensuring the denominator is not zero. Magnitude of motion vector has rotation and translation invariance. To obtain scaling invariance, for all motion vector magnitudes Mag(i, j), we normalize and encode them by: where Mag max is the maximum magnitude value. Orientation of motion vector has scaling and translation invariance, but no rotation invariance. For all motion vector orientations Ori(i, j), we first encode them by: where each code denotes an interval of 10 • and code 0 represents that magnitude value is zero.
To obtain rotation invariance, we then turn the main orientation, selected by the maximum in its statistical histogram, as code 1. The remaining orientations are recorded in a clockwise direction as code 2 to 36: where M code is the main orientation code before recording. Inspired by the HOG [16], taking a point on the foreground contour as the center, we first perform histogram statistics on non-zero magnitude and orientation of motion vector in a 9 × 9 pixels block. We then assign two-dimensional Gaussian distribution weight (9 × 9 size, standard deviation of 5) to generate it. We finally obtain encoded motion vector magnitude distribution vector H Mag and orientation distribution vector H Ori of the center point and both of them are 36-dimensional. Massive experiments demonstrate that under the planar condition, the speed of moving target in the scene is low, resulting in a smaller discrimination of motion vector magnitude distribution. When there are multiple targets moving in different directions in the scene, or when movements such as waving, turning, etc. occur, the distribution of motion vector orientation is more differentiated. Based on this, we add weight factor ω (ω = 0.2 in our algorithm) and combine these two histogram vectors to create motion vector distribution descriptors (72-dimensional) for foreground contour points (the total number is N): We randomly select some foreground contour points (in our algorithm, 10% of infrared contour point set each frame). For a selected infrared point, visible contour point with the smallest Euclidean distance between motion vector distribution descriptors is regarded as the corresponding point. The random sample consensus (RANSAC) algorithm [31] is adopted to calculate the optimal transformation matrix. Figure 3 shows the matching result of randomly selected foreground contour points. Orientation of motion vector has scaling and translation invariance, but no rotation invariance. For all motion vector orientations ( , ) Ori i j , we first encode them by: where each code denotes an interval of 10  and code 0 represents that magnitude value is zero.
To obtain rotation invariance, we then turn the main orientation, selected by the maximum in its statistical histogram, as code 1. The remaining orientations are recorded in a clockwise direction as code 2 to 36: is the main orientation code before recording. Inspired by the HOG [16], taking a point on the foreground contour as the center, we first perform histogram statistics on non-zero magnitude and orientation of motion vector in a 9 × 9 pixels block. We then assign two-dimensional Gaussian distribution weight (9 × 9 size, standard deviation of 5) to generate it. We finally obtain encoded motion vector magnitude distribution vector

Mag H and orientation distribution vector
Ori H of the center point and both of them are 36dimensional. Massive experiments demonstrate that under the planar condition, the speed of moving target in the scene is low, resulting in a smaller discrimination of motion vector magnitude distribution. When there are multiple targets moving in different directions in the scene, or when movements such as waving, turning, etc. occur, the distribution of motion vector orientation is more differentiated. Based on this, we add weight factor ω ( 0.2 ω = in our algorithm) and combine these two histogram vectors to create motion vector distribution descriptors (72-dimensional) for foreground contour points (the total number is N): We randomly select some foreground contour points (in our algorithm, 10% of infrared contour point set each frame). For a selected infrared point, visible contour point with the smallest Euclidean distance between motion vector distribution descriptors is regarded as the corresponding point. The random sample consensus (RANSAC) algorithm [31] is adopted to calculate the optimal transformation matrix. Figure 3 shows the matching result of randomly selected foreground contour points. There are some mismatches and inaccurate matched pairs, especially when targets in the scene are moving at a slower speed and the directions are almost consistent. RANSAC algorithm is able to calculate the optimal transformation matrix among all matched point pairs. Through coarse registration process, we can get coarse parameters of similarity transformation model: c There are some mismatches and inaccurate matched pairs, especially when targets in the scene are moving at a slower speed and the directions are almost consistent. RANSAC algorithm is able to calculate the optimal transformation matrix among all matched point pairs. Through coarse registration process, we can get coarse parameters of similarity transformation model: σ c , θ c , ∆y c and ∆x c .

Precise Registration
Coarse registration cannot obtain accurate registration parameters due to its randomness. Importantly, it can roughly align targets in the scene for more accurate positioning of feature points. In the following section, the proposed precise registration is introduced.

Relocation and Feature Point Extraction
We first relocate the floating image and foreground with coarse parameters (bilinear interpolation method). Figure 4 shows the results of relocated original image and foreground.

Precise Registration
Coarse registration cannot obtain accurate registration parameters due to its randomness. Importantly, it can roughly align targets in the scene for more accurate positioning of feature points. In the following section, the proposed precise registration is introduced.

Relocation and Feature Point Extraction
We first relocate the floating image and foreground with coarse parameters (bilinear interpolation method). Figure 4 shows the results of relocated original image and foreground. Since FAST corner detection algorithm [17] has low computational complexity and ensures accurate positioning, we choose to extract FAST corners of moving targets and match them. Similar to motion vector field filtering in coarse registration, corners that belong to the background and near the image boundaries are abandoned. Figure 5 shows the detected FAST corners.

Feature Points Description
In Reference [15], the number of targets in the scene needs to be consistent for homologous and heterogeneous feature point matching. Besides, in the process of normalized location description establishment, the centroid is obtained with all foregrounds. However, when moving targets just enter or leave the scene, the number of them tends to be inconsistent. As shown in Figure 6, there is one target entering or leaving the scene in the infrared while there are two in the visible. Reference [15] is unable to deal with these specific moments and the centroid calculation is inaccurate.
To register these special frames, we first use the two-pass algorithm to segment targets (foregrounds) in the scene, and merge the nearby small connected blobs caused by foreground extraction. In subsequent description, calculation of the centroid and establishment of the shape context descriptor are only performed in the connected foreground blob to which the feature point belongs. We then describe the feature point as follows: Since FAST corner detection algorithm [17] has low computational complexity and ensures accurate positioning, we choose to extract FAST corners of moving targets and match them. Similar to motion vector field filtering in coarse registration, corners that belong to the background and near the image boundaries are abandoned. Figure 5 shows the detected FAST corners.

Precise Registration
Coarse registration cannot obtain accurate registration parameters due to its randomness. Importantly, it can roughly align targets in the scene for more accurate positioning of feature points. In the following section, the proposed precise registration is introduced.

Relocation and Feature Point Extraction
We first relocate the floating image and foreground with coarse parameters (bilinear interpolation method). Figure 4 shows the results of relocated original image and foreground.  Since FAST corner detection algorithm [17] has low computational complexity and ensures accurate positioning, we choose to extract FAST corners of moving targets and match them. Similar to motion vector field filtering in coarse registration, corners that belong to the background and near the image boundaries are abandoned. Figure 5 shows the detected FAST corners.

Feature Points Description
In Reference [15], the number of targets in the scene needs to be consistent for homologous and heterogeneous feature point matching. Besides, in the process of normalized location description establishment, the centroid is obtained with all foregrounds. However, when moving targets just enter or leave the scene, the number of them tends to be inconsistent. As shown in Figure 6, there is one target entering or leaving the scene in the infrared while there are two in the visible. Reference [15] is unable to deal with these specific moments and the centroid calculation is inaccurate.
To register these special frames, we first use the two-pass algorithm to segment targets (foregrounds) in the scene, and merge the nearby small connected blobs caused by foreground extraction. In subsequent description, calculation of the centroid and establishment of the shape context descriptor are only performed in the connected foreground blob to which the feature point belongs. We then describe the feature point as follows:

Feature Points Description
In Reference [15], the number of targets in the scene needs to be consistent for homologous and heterogeneous feature point matching. Besides, in the process of normalized location description establishment, the centroid is obtained with all foregrounds. However, when moving targets just enter or leave the scene, the number of them tends to be inconsistent. As shown in Figure 6, there is one target entering or leaving the scene in the infrared while there are two in the visible. Reference [15] is unable to deal with these specific moments and the centroid calculation is inaccurate.
To register these special frames, we first use the two-pass algorithm to segment targets (foregrounds) in the scene, and merge the nearby small connected blobs caused by foreground extraction. In subsequent description, calculation of the centroid and establishment of the shape context descriptor are only performed in the connected foreground blob to which the feature point belongs. We then describe the feature point as follows: where [x c , y c ] is the position of the centroid.

•
The shape context descriptor [18] of the feature point. It reflects the spatial location distribution of neighbored points around the center. Contour points of the connected foreground blob to which the feature point belongs form the descriptor. In our algorithm, log-polar coordinate is used to divide the distance into 5 bins and the angle into 8 bins. The shape context descriptor (40-dimensional) of the feature point is established by: where sc(dis, ang) is the distribution statistical histogram of joint distance and angle. • Position of the feature point: [ , ] P x y = .
• Location of the feature point relative to the centroid of the connected foreground blob to which it belongs, calculated by: Where [ , ] c c x y is the position of the centroid.

•
The shape context descriptor [18] of the feature point. It reflects the spatial location distribution of neighbored points around the center. Contour points of the connected foreground blob to which the feature point belongs form the descriptor. In our algorithm, log-polar coordinate is used to divide the distance into 5 bins and the angle into 8 bins. The shape context descriptor (40-dimensional) of the feature point is established by: where ( , ) sc dis ang is the distribution statistical histogram of joint distance and angle.

Matching
Three similarity metrics are used for feature point matching: • Euclidean distance between positions of the two feature points: Figure 6. Targets enter (a,b) and leave (c,d) the scenes of infrared (a,c) and visible (b,d), respectively, at the same time.

Matching
Three similarity metrics are used for feature point matching: • Euclidean distance between positions of the two feature points: • Euclidean distance between locations of the two feature points relative to the centroids: • Chi-square test statistic between two shape context descriptors: Since coarse registration process has roughly aligned moving targets in the scene, we first treat a point pair as potentially matched if S P < S th1 and S L < S th2 ; otherwise we just ignore it and consider another point pair. Then we calculate chi-square test statistic between two shape descriptors. For an infrared feature point, visible feature point with the smallest chi-square test statistic is regarded as its matched point.
There may come situations in which more than one visible feature point is matched to the same infrared feature point or mismatches happens. To solve them, as introduced in Algorithm 1, we create bidirectional optimal maximum strategy to filter the point pairs. Figure 7 shows the matched point pairs by using the proposed matching strategy. For an infrared feature point, visible feature point with the smallest chi-square test statistic is regarded as its matched point. There may come situations in which more than one visible feature point is matched to the same infrared feature point or mismatches happens. To solve them, as introduced in Algorithm 1, we create bidirectional optimal maximum strategy to filter the point pairs. Figure 7 shows the matched point pairs by using the proposed matching strategy.

Reservoir Construction and Optimal Transformation Matrix Calculation
For image sequence registration, if we only use feature points of the current frame to compute the transformation matrix, parameters may not be obtained since there may not be enough feature points to be extracted, especially when moving targets enter or leave the scene. To solve this problem, we save the matched point pairs from different frames in a reservoir. Many approaches to constructing and updating reservoirs have been proposed, for example in reference [8], a 30 or 100 frame-wide reservoir with First-In, First-Out (FIFO) update strategy is created. The disadvantage of this reservoir is that there will not be enough matched point pairs if the targets disappear during the 30 or 100 frames. And FIFO strategy may replace the better-matched point pair with worse-matched or even wrong match. In reference [14], reservoir eliminates those pairs that are regarded as persistent outliers based on the RANSAC algorithm and a proposed voting scheme. This reservoir is robust, but with the cost of high computational complexity. In reference [15], a match whose HOE matching metric is greater than the median value is considered as outliers. When a new match is generated, one of the outliers is randomly selected and replaced. This reservoir is updated by the description of the feature points, which is not applicable to our algorithm.
Based on the feature point description method in our algorithm, as introduced in Algorithm 2, we create a new reservoir updated by BIWO strategy. Once the reservoir is full, a new match is allowed to enter only when the similarity metrics meet the admission criteria.

Reservoir Construction and Optimal Transformation Matrix Calculation
For image sequence registration, if we only use feature points of the current frame to compute the transformation matrix, parameters may not be obtained since there may not be enough feature points to be extracted, especially when moving targets enter or leave the scene. To solve this problem, we save the matched point pairs from different frames in a reservoir. Many approaches to constructing and updating reservoirs have been proposed, for example in reference [8], a 30 or 100 frame-wide reservoir with First-In, First-Out (FIFO) update strategy is created. The disadvantage of this reservoir is that there will not be enough matched point pairs if the targets disappear during the 30 or 100 frames. And FIFO strategy may replace the better-matched point pair with worse-matched or even wrong match. In reference [14], reservoir eliminates those pairs that are regarded as persistent outliers based on the RANSAC algorithm and a proposed voting scheme. This reservoir is robust, but with the cost of high computational complexity. In reference [15], a match whose HOE matching metric is greater than the median value is considered as outliers. When a new match is generated, one of the outliers is randomly selected and replaced. This reservoir is updated by the description of the feature points, which is not applicable to our algorithm.
Based on the feature point description method in our algorithm, as introduced in Algorithm 2, we create a new reservoir updated by BIWO strategy. Once the reservoir is full, a new match is allowed to enter only when the similarity metrics meet the admission criteria. The proposed reservoir updates only when the new point pair is better matched, which prevents worse-matched and wrong matches from entering it. Once the reservoir is filled with sufficient matched point pairs, the best precise parameters σ p , θ p , ∆y p and ∆x p can be obtained by using RANSAC algorithm and we can get the final optimal global parameters by inverting the Equation (1):

Experiments and Analysis
In this section, we test and analyze the performance of the proposed algorithm in the planar image sequences.

Dataset
For comparison with other algorithms related to our work, we choose LITIV dataset provided by reference [32] to test our algorithm. It contains nine image sequences and provides ground-truth matrices by manually selecting notable matched point pairs of moving targets in the scene. The dataset has a resolution of 240 × 320, a frame rate of 30 FPS, and lengths between 329 and 1238 frames.

Qualitative Results and Analysis
In the proposed algorithm, infrared image is the floating image and registered by the computed transformation matrix, while visible image is used as the reference image. To visualize the effectiveness of the proposed algorithm, in Figure 8, we show mosaic results incorporate transformed infrared images, original visible images and the ground-truth matrices of each image sequence.  Due to the existence of coarse registration process, when targets appear in the scene, the deviation of two images is not very large, especially in LITIV-4, LITIV-6 and LITIV-9, targets have already been roughly aligned. In LITIV-6, LITIV-7, LITIV-8, and LITIV-9, the final registration matrix does not completely coincide with the ground-truth matrix. This does not mean that the proposed algorithm is unable to achieve high registration accuracy. There are two reasons for this: (1) The ground-truth matrix is obtained by manually selecting notable matched point pairs of targets, and there may be errors. In the subsequent quantitative results, we can see that the registration evaluation metric can exceed the ground-truth matrix in some frames. (2) The ground-truth matrix is unique and suitable for registration of planar scenes, but LITIV dataset does not fully satisfy the planar condition, especially in LITIV4, LITIV8, and LITIV9, moving targets in the scene are multiple, and each target has its own depth of field. We cannot directly register all the targets in the scene with the same transformation matrix. We will provide further detailed explanations for this in the following section.

Quantitative Results and Analysis
To quantitatively evaluate the proposed algorithm, we select two state-of-the-art algorithms for registering planar image sequences [14,15] as comparisons. Charles et al. [14] creates a framework based on shape matching, and introduces a voting scheme to define whether a matched point pair in the reservoir is persistent outliers, and eliminating it to updates the reservoir. Sun et al. [15] creates a framework combining motion and feature information. It uses motion vectors to calculate the scaling and rotation factors in coarse registration and uses HOE descriptors to describe and match the feature points in precise registration. A reservoir updated based on HOE matching metric is proposed.
We adopt a registration evaluation metric, which is overlapping error of the foreground, defined as: Where VIS F and IR F are visible and infrared foregrounds respectively. ( , ) represents that infrared foreground is transformed by the transformation matrix H .
Overlapping error of the foreground can easily be influenced by the results of the foreground extraction. [14] proposed using binary polygons instead of foreground to calculate the overlapping error. Binary polygons are formed by connecting the matched points of notable parts of moving targets (heads, shoulders, palms, etc.). When the binary polygon in infrared is transformed, the Due to the existence of coarse registration process, when targets appear in the scene, the deviation of two images is not very large, especially in LITIV-4, LITIV-6 and LITIV-9, targets have already been roughly aligned. In LITIV-6, LITIV-7, LITIV-8, and LITIV-9, the final registration matrix does not completely coincide with the ground-truth matrix. This does not mean that the proposed algorithm is unable to achieve high registration accuracy. There are two reasons for this: (1) The ground-truth matrix is obtained by manually selecting notable matched point pairs of targets, and there may be errors. In the subsequent quantitative results, we can see that the registration evaluation metric can exceed the ground-truth matrix in some frames. (2) The ground-truth matrix is unique and suitable for registration of planar scenes, but LITIV dataset does not fully satisfy the planar condition, especially in LITIV4, LITIV8, and LITIV9, moving targets in the scene are multiple, and each target has its own depth of field. We cannot directly register all the targets in the scene with the same transformation matrix. We will provide further detailed explanations for this in the following section.

Quantitative Results and Analysis
To quantitatively evaluate the proposed algorithm, we select two state-of-the-art algorithms for registering planar image sequences [14,15] as comparisons. Charles et al. [14] creates a framework based on shape matching, and introduces a voting scheme to define whether a matched point pair in the reservoir is persistent outliers, and eliminating it to updates the reservoir. Sun et al. [15] creates a framework combining motion and feature information. It uses motion vectors to calculate the scaling and rotation factors in coarse registration and uses HOE descriptors to describe and match the feature points in precise registration. A reservoir updated based on HOE matching metric is proposed.
We adopt a registration evaluation metric, which is overlapping error of the foreground, defined as: where F V IS and F IR are visible and infrared foregrounds respectively. Γ(F IR , H) represents that infrared foreground is transformed by the transformation matrix H.
Overlapping error of the foreground can easily be influenced by the results of the foreground extraction [14] proposed using binary polygons instead of foreground to calculate the overlapping error. Binary polygons are formed by connecting the matched points of notable parts of moving targets (heads, shoulders, palms, etc.). When the binary polygon in infrared is transformed, the overlapping error of the two polygonal regions is calculated. Figure 9 shows the binary polygons in LITIV-1.
overlapping error of the two polygonal regions is calculated. Figure 9 shows the binary polygons in LITIV-1. To globally show the performance of the optimal transformation matrix calculated for each frame, we plot the overlapping error-time curves and compare them with the algorithms of [14,15]. Figure 10 shows the curves of nine image sequences in LITIV dataset. Figure 10. Overlapping error-time curves of nine image sequences in LITIV dataset. The red is our algorithm, the green is the algorithm of [14], and the blue is the algorithm of [15]. The dotted black represents the ground-truth matrix.
Except for LITIV-8, the overlapping error of our algorithm is lower than [14] in all image sequences, and the convergence speed is faster. [14] registers targets based on direct shape matching, which is susceptible to foreground detection results. When the moving targets in the scene enter and leave the field of view, occlusion or overlap of the targets may happen and result in inaccurate shapes of the foregrounds. Our algorithm contains coarse registration based on motion vector distribution. Even if the occlusion or target overlap occurs, identical distribution of the motion vector can be used To globally show the performance of the optimal transformation matrix calculated for each frame, we plot the overlapping error-time curves and compare them with the algorithms of [14,15]. Figure 10 shows the curves of nine image sequences in LITIV dataset. overlapping error of the two polygonal regions is calculated. Figure 9 shows the binary polygons in LITIV-1. To globally show the performance of the optimal transformation matrix calculated for each frame, we plot the overlapping error-time curves and compare them with the algorithms of [14,15]. Figure 10 shows the curves of nine image sequences in LITIV dataset. Figure 10. Overlapping error-time curves of nine image sequences in LITIV dataset. The red is our algorithm, the green is the algorithm of [14], and the blue is the algorithm of [15]. The dotted black represents the ground-truth matrix.
Except for LITIV-8, the overlapping error of our algorithm is lower than [14] in all image sequences, and the convergence speed is faster. [14] registers targets based on direct shape matching, which is susceptible to foreground detection results. When the moving targets in the scene enter and leave the field of view, occlusion or overlap of the targets may happen and result in inaccurate shapes of the foregrounds. Our algorithm contains coarse registration based on motion vector distribution. Even if the occlusion or target overlap occurs, identical distribution of the motion vector can be used Figure 10. Overlapping error-time curves of nine image sequences in LITIV dataset. The red is our algorithm, the green is the algorithm of [14], and the blue is the algorithm of [15]. The dotted black represents the ground-truth matrix.
Except for LITIV-8, the overlapping error of our algorithm is lower than [14] in all image sequences, and the convergence speed is faster. [14] registers targets based on direct shape matching, which is susceptible to foreground detection results. When the moving targets in the scene enter and leave the field of view, occlusion or overlap of the targets may happen and result in inaccurate shapes of the foregrounds. Our algorithm contains coarse registration based on motion vector distribution. Even if the occlusion or target overlap occurs, identical distribution of the motion vector can be used to align the targets as long as the changes are simultaneous. LITIV-8 is special, where the spatial difference between the floating image and the reference image is large. Translation of horizontal direction is more than 100 pixels given by ground-truth. [14] directly registers the moving targets without relocation or distance constraints on feature points, so it performs better in this sequence. Despite [14] being better for a specific problem, the overall performance (for most of the LITIV dataset sequences) of our proposed algorithm is superior to [14]. For further quantitative evaluation, we give the minimum overlapping errors of each image sequence in Table 1, which reflects the optimal registration performance. And we give the average overlapping errors of each image sequence in Table 2, which reflects the robustness of the algorithms. Combining the curves, the minimum and the average overlapping errors, our algorithm achieves the best registration results compared to [14,15] in LITIV-1, LITIV-3, LITIV-4, LITIV-5 and LITIV-7. It should be noted that in LITIV-4, our algorithm cannot achieve lower overlapping errors for a long time. This happens because the number of connected blobs extracted by the foreground detection method is inconsistent during these frames, where there is one blob in the infrared while there are two blobs in the visible. In the feature matching process, our algorithm relies on the connected blobs to which the feature points belong. However, when the number of connected blobs returns to the same, the overlapping error begins to decrease.
Except for LITIV-4 and LITIV-8, our algorithm achieves the lowest average overlapping error compared to [14,15]. This is because the coarse registration can roughly align the targets in the scene, ensuring that the spatial deviation is not particularly large. We can see that in LITIV-1, LITIV-3, LITIV-6 and LITIV-9, our algorithm has already reached a low error at initial frames.
In all but LITIV-2 and LITIV-3 sequences, our algorithm obtains lower errors than the ground-truth, which indicates that there are potential errors in manually selecting matched point pairs to calculate ground-truth matrices, and also explains that our final matrices do not completely coincide with the ground-truth in the mosaic results.
LITIV-5, LITIV-6 and LITIV-7 fully reflect the superiority of our algorithm. In these image sequences, we can see from the overlapping error-time curves that our algorithm keeps the lowest level for most of the time, because these sequences share the common characteristics: moving targets is far from the sensor in the scene, there is almost no overlap between moving targets and the moving directions are quite distinguishable, which are more conducive for coarse registration based on the distribution of motion vector. Figure 11 shows infrared motion vector diagram obtained from adjacent frames in LITIV-5, which has obvious discrimination. calculate ground-truth matrices, and also explains that our final matrices do not completely coincide with the ground-truth in the mosaic results. LITIV-5, LITIV-6 and LITIV-7 fully reflect the superiority of our algorithm. In these image sequences, we can see from the overlapping error-time curves that our algorithm keeps the lowest level for most of the time, because these sequences share the common characteristics: moving targets is far from the sensor in the scene, there is almost no overlap between moving targets and the moving directions are quite distinguishable, which are more conducive for coarse registration based on the distribution of motion vector. Figure 11 shows infrared motion vector diagram obtained from adjacent frames in LITIV-5, which has obvious discrimination. In LITIV-1, LITIV-5, LITIV-6, LITIV-7 and LITIV-9, our algorithm converges faster than [14,15] and can achieve lower overlapping error (0.25) in a shorter time.
As for computational complexity, experiments are conducted on an Intel(R) Core (TM) i5-6500 CPU, 3.20 GHz, 16 GB RAM, Win 7 × 64, Matlab R2016a platform. Table 3 shows the average computing time of single frame of each image sequence in LITIV dataset. In summary, the proposed registration algorithm is superior to the other two state-of-the-arts in registering infrared-visible image sequences. When moving targets in the scene are small and the motion states are distinguishable, the proposed algorithm can achieve higher registration accuracy and faster convergence speed than the other two state-of-the-art algorithms.

Conclusions
In this paper, we propose an automatic and robust infrared-visible image sequence registration algorithm through spatio-temporal association. In the coarse registration step, we use temporal motion information of the foreground, establish MVD descriptors for foreground contour points and roughly align the targets to eliminate inaccurate positioning of the feature points. In the precise registration step, we use the spatial location distribution of connected contour points to describe the detected feature points and match them under bidirectional optimal maximum strategy. BIWO strategy based on similarity metrics is created to update the reservoir and low registration overlapping errors can be obtained. Extensive evaluations well demonstrate the effectiveness of the proposed algorithm, which outperforms the other two state-of-the-arts in registering infrared-visible image sequences. In LITIV-1, LITIV-5, LITIV-6, LITIV-7 and LITIV-9, our algorithm converges faster than [14,15] and can achieve lower overlapping error (0.25) in a shorter time.
As for computational complexity, experiments are conducted on an Intel(R) Core (TM) i5-6500 CPU, 3.20 GHz, 16 GB RAM, Win 7 × 64, Matlab R2016a platform. Table 3 shows the average computing time of single frame of each image sequence in LITIV dataset. In summary, the proposed registration algorithm is superior to the other two state-of-the-arts in registering infrared-visible image sequences. When moving targets in the scene are small and the motion states are distinguishable, the proposed algorithm can achieve higher registration accuracy and faster convergence speed than the other two state-of-the-art algorithms.

Conclusions
In this paper, we propose an automatic and robust infrared-visible image sequence registration algorithm through spatio-temporal association. In the coarse registration step, we use temporal motion information of the foreground, establish MVD descriptors for foreground contour points and roughly align the targets to eliminate inaccurate positioning of the feature points. In the precise registration step, we use the spatial location distribution of connected contour points to describe the detected feature points and match them under bidirectional optimal maximum strategy. BIWO strategy based on similarity metrics is created to update the reservoir and low registration overlapping errors can be obtained. Extensive evaluations well demonstrate the effectiveness of the proposed algorithm, which outperforms the other two state-of-the-arts in registering infrared-visible image sequences.