2.3.1. Calculation of Geometric Shifts within the Matching Window
For calculating geometric shifts between the input images within a single matching window, both subset images are transformed into the frequency domain – in this study using FFTW (Fastest Fourier Transform in the West), the fastest freely available implementation of discrete Fourier transform (DFT) that is described in detail by Frigo and Johnson in 2005 [
31] and 1997 [
32]. The two resulting images in the frequency domain are phase-correlated to generate their cross-power spectrum, which is then transformed back into spatial domain using inverse FFTW [
9,
13,
22,
33,
34]. The normalized form of the cross-power spectrum in the spatial domain demonstrates a distinct sharp peak at the point of registration of the input images (
Figure 2), which can be used for quantification of image displacements [
7,
9,
16,
19,
24,
33]. Integer shifts can be derived from the distance between the position of the maximum peak and the centre position of the spectrum in the X and Y directions [
33].
The subset image of the target dataset, i.e., the image content within the matching window, is then temporarily moved according to the calculated integer shifts. This is realized by adjusting the corner coordinates of the matching window and recreating the subset image for the target dataset at the new position. To derive sub-pixel shifts, both subset images (reference and integer corrected target) are again transformed into the frequency domain, and a refreshed cross-power spectrum is created. Sub-pixel shifts can then be estimated based on formulas 1 and 2, where
represents the value of the refreshed cross power spectrum at the peak position and
and
represent the direct neighbours in X- and Y-direction. Corresponding mathematic derivations can be found in [
19].
The calculated sub-pixel shifts are added, and the total shifts in pixel units converted into map units, with respect to the ground sampling distance of the reference image. In the case of the local shift correction approach, a tie point grid is generated by combining total map shifts with the absolute map coordinate offsets, as provided by the geocoding information of the input images.
2.3.2. Validation of Calculated Spatial Shifts
The reliability of the X/Y displacements is connected to the pixel value similarity of the input images induced by, e.g., varying illumination or land cover changes or by different sensor noise levels. This is also supported by [
9,
22], and potential effects are quantified by implementing a total of five complementary validation techniques. Their individual performance depends on the image content of the input images and the pattern of misregistration. They build upon each other and aim to avoid different sources of erroneous shift detection but may also be deactivated on demand. However, optimal validation results have been observed by combining all the validation techniques presented below.
First, a validity check of the calculated integer shifts has been implemented (
Figure 3). It is assumed that correcting the integer shifts within the respective matching window by moving the target subset image must result in zero displacements. If a subsequent calculation still yields non-zero X/Y shifts, the found match is understood to be false-positive and rejected. In this case, we apply the newly calculated integer shifts and then repeat the shift calculation algorithm to assess whether remaining integer shifts are now zero. This continues until a maximum number of iterations (5, by default) is reached. If shifts remain, the matching fails for the respective geographic position. Otherwise, a valid match (in terms of integer displacements) is found, and the sub-pixel displacements are finally derived, as explained above.
After successful calculation of sub-pixel X/Y shifts, a threshold check (second validation method) is applied to eliminate unrealistically large displacements. This enables incorporation of the knowledge of the user by simply rejecting all tie points that exceed a certain vector length (by default 5 reference image pixels). However, the usage of this possibility has been left optional to maintain wide applicability of the proposed algorithm.
It is still possible that some registration points do not match correctly, e.g., due to repetitive patterns in one of the input images that introduce pseudo-matches. To address that, we combined another three validation techniques (3–5) to effectively filter outlier tie points in the case of local displacement correction. Two of those validation metrics are separately calculated for each tie point of the dense grid, directly after calculating sub-pixel shifts.
The first (validation method 3) analyses the three-dimensional shape of the cross-power spectrum and aims to quantify the sharpness of the peak (
Figure 2) as a measure for assessing the reliability of the respective tie point. It returns a reliability percentage
R that is calculated using the following formulas:
Because represents the value of the cross-power spectrum at a specific position and N is the number of involved pixels, the mean power within a 3 × 3 window centred at the position of the peak of the spectrum is related to the mean power of the remaining spectrum plus its three-fold standard deviation. This results in a reliability percentage for each tie point and enables the exclusion of those points with a value below a certain threshold from later calculation of geometric transformation parameters. Empirical tests (not shown) indicated that a threshold of 30% can be recommended, below which tie points are rejected.
The fourth validation method that is also performed for each point of the dense grid evaluates the similarity of the subset image content within the matching window before and after shift correction. It relies on the assumption that successful correction of an erroneous co-registration leads to increased similarity between the reference and target image. As a measure for image similarity, the Mean Structural Similarity Index (MSSIM) has been chosen, as it is reported to be highly sensitive to even marginal image displacements [
35,
36].
MSSIM returns a value between 0 and 1, whereas 0 represents no match and 1 represents a perfect match. To calculate MSSIM before and after shift correction, it is first applied to the input data of phase correlation, and then the subset target image is globally shifted, according to calculated sub-pixel shifts for the corresponding position; then, MSSIM is calculated again. All points of the dense tie point grid where MSSIM decreases are flagged and later excluded from estimating transformation parameters.
The fifth and final validation technique is not applied point-by-point, but it incorporates all calculated shift vectors at once. This is essential for avoiding artefacts in the co-registration result due to outliers within the dense grid point data. For that purpose, the widely used state-of-the art algorithm RANSAC [
37] has been chosen, coupled with the assumption that in case of remotely sensed satellite data, an erroneous co-registration can be roughly corrected using an affine transformation. In the AROSICS framework, RANSAC has been incorporated to automatically estimate the parameters of an assumed affine transformation between reference and target image and thus to identify outliers among the previously calculated shift vector grid. Parameter estimation is automatically performed by RANSAC in an iterative process, but it has some free input parameters to be set. The most important one (according to our experience) is the threshold used to separate inliers and outliers, which is also supported by [
38,
39,
40]. In this context, the optimal threshold value highly depends on the heterogeneity of the input sample and, hence, on the variance and reliability of the calculated displacement vectors. This variance is, in turn, connected to the input images itself, meaning that, for example, extensive cloud coverage may strongly influence the calculated tie points, as it often causes repetitive image patterns leading to a greater proportion of false-positives [
9]. In addition, RANSAC suffers from high outlier proportions within the input sample [
38]. To overcome that and avoid an erroneous estimation of affine parameters, multiple strategies have been implemented in this validation step.
First, the opportunity to consider bad-data masks for the input images helps to effectively exclude such critical image areas from the matching process (
Section 2.2). Thus, the inlier-outlier variance within RANSAC input data is reduced, and the probability of an outlier dominated input sample is significantly decreased. However, it must be noted that, on the one hand, cloud masks are usually generated automatically and possibly contain misclassifications. On the other hand, the provision of such masks has been intentionally kept optional, because they are not always available.
As a further way to increase the robustness of outlier determination, RANSAC has been implemented as the last level of validation. Consequently, the input sample of RANSAC is already pre-filtered, in the sense that tie points with low reliability value or decreasing MSSIM are no longer passed to RANSAC.
Finally, an iterative approach is used to automatically optimize the abovementioned RANSAC threshold separating inliers from outliers. RANSAC is run multiple times while observing the proportion of inliers and outliers in the result to identify a threshold that flags a realistic number of outliers within the input data. The first iteration starts with an initial guess of the threshold, which is then iteratively increased or decreased, depending on whether the flagged outlier proportion is too low or too high. Empirical sensitivity tests (not shown) indicated that a 10% proportion of outliers seemed to be suitable for most use cases. Thus, the algorithm terminates as soon as 10% ± 2% of the input tie points have been marked as outliers.