2.1.1. Dual-Channel Radiation Feature Alignment Network
We propose the EWAM model while retaining LoFTR’s coarse-to-fine matching backbone. EWAM introduces a dedicated feature extraction branch that separately handles visible and infrared data (overall pipeline shown in
Figure 1). The rationale is that using a shared feature extraction pipeline for both modalities fails to account for their inherent disparities. This prevents the network from capturing infrared-specific cues, such as unique radiation signatures and distinct grayscale statistics. This problem is exacerbated by significant texture mismatch [
32]. To mitigate cross-modal gap, we isolate parameters by embedding the Environment-Adaptive Radiation Feature Extractor and the Wavelet Transform High-Frequency Enhancement Module into the infrared pathway. Each residual block is tagged with its modality, allowing convolutions to learn modality-specific kernels, sharpen edges, and boost target saliency. Infrared images are processed by the Wavelet Transform High-Frequency Enhancement Module and, in parallel, fed into the Environment-Adaptive Radiation Feature Extractor for selective radiation encoding. The infrared branch thus yields an aggregated radiation representation. Finally, both modalities are fused within the pipeline for joint feature extraction, followed by coarse-to-fine matching.
As illustrated in
Figure 1, features are separately extracted from visible images
and infrared images
. The
is concurrently processed through the Environment-Adaptive Radiation Feature Extractor and the Wavelet Transform-Based High-Frequency Enhancement Module. This feature fusion process is applied exclusively to generate coarse-level feature maps. The coarse-level feature map
for the infrared image is derived from the fused representation produced by these modules, whereas the coarse-level feature map
for the visible image is extracted directly from
without this fusion process. Both feature maps have a spatial resolution of H/8 × W/8 of the original image.
and
denote the fine-level feature maps derived from
and
at a spatial resolution of, at H/2 × W/2 relative to the original image.
denotes the number of LoFTR local feature transformation modules.
denotes the locations of matches where the confidence matrix exceeds the given threshold;
denotes the expectation function of a probability distribution.
denotes the predicted matching entries of the fine-level matching prediction matrix.
2.1.2. Design of the Environment-Adaptive Radiation Feature Extractor
Due to the significant radiation inhomogeneity of terrestrial features across the infrared spectrum in different scenes [
33], the infrared radiation intensity in a typical farmland scene is primarily governed by solar heating. This results in a relatively uniform daytime temperature distribution, where global thermal statistics, such as the mean and variance, provide limited discriminative power [
34]. However, in environments such as the gobi, Desert, rocks and water bodies exhibit pronounced differences in radiation intensity even at the same temperature. This discrepancy undermines the stability of the correlation between infrared features and visible textures [
35]. To this end, we design an Environment-Adaptive Radiation Feature Extractor as shown in
Figure 2. The module incorporates a hierarchical radiation prior extractor together with a dynamic feature injection mechanism, enabling the extraction of global radiation statistics from raw infrared imagery. It encodes physical properties of the target such as mean radiation intensity and entropy of the distribution, thus compressing the radiation distribution into a compact physical prior vector. To improve computational efficiency, an adaptive dynamic control strategy is incorporated into the module: if the variance of the infrared image falls below a given threshold such as in a farmland scene, the radiation feature fusion process is skipped. The injection of radiation prior features is activated only in scenes with significant radiation heterogeneity, such as the gobi, desert, where the radiation prior acts as a self-supervised signal. This approach prevents the injection of invalid features by constraining the physical plausibility of infrared features, while simultaneously enhancing their physical interpretability.
When the variance of the infrared image is small, it indicates a uniform radiation field distribution, and differences between radiation features are minimal, making them difficult to distinguish. Therefore, the step of extracting radiation characteristics is skipped. High variance implies significant differences in radiation intensity across the image, corresponding to well-defined edges and structures. During feature extraction, the model uses 7 × 7 and 3 × 3 depthwise-separable convolutional kernels for multi-scale convolution, to capture edge, texture, and long-range radiation features. Adopting a larger receptive field facilitates the integration of neighborhood information, enabling robust estimation of overall radiation distribution in the initial processing stage. Then the two feature maps are combined by a 1 × 1 convolution to produce an output of 128 channels. After normalization and ReLU activation, a channel-wise and spatial attention mechanism refines radiation-related information while suppressing background noise. Nonlinear representation enhancement is performed using a residual bottleneck. Adaptive average pooling and covariance pooling are then applied to generate features, where the latter employs a covariance matrix to preserve the correlation structure of radiation distribution while enhancing the representation of radiation characteristics for objects such as the Gobi, deserts, and rock formations. Finally, the MLP bottleneck (including ReLU, Dropout, and LayerNorm) performs nonlinear transformation and regularization, outputting the global radiation feature vector T, which is fed into downstream stages to highlight structural dissimilarities. In other scenes, radiation characteristics are not introduced, thereby effectively mapping infrared physical properties to a common representation space aligned with visible light features. This mitigates modal differences and ultimately achieves compact physical perception feature encoding through multi-stage transformations.
Infrared Feature Formulation:
In Equation (1), denotes the local radiation saliency feature, denotes the spatially adaptive weight.
The local variance is estimated on the input infrared image within a sliding window:
In Equation (2),
denotes the mean within the window. K denotes the number of pixels within the window.
denotes the radiation intensity of pixels in the infrared image.
denotes the average value of the total thermal radiation across the entire image. The calculation formula is:
2.1.3. Design of the Wavelet Transform High-Frequency Enhancement Module
During acquisition, infrared images, often degraded by radiometric distortions and noise, typically exhibit blurry edges, loss of texture information, and reduced spatial resolution. Consequently, accurately determining homologous points in corresponding visible-light imagery after feature extraction is challenging. To address this issue, we designed the Wavelet Transform High-Frequency Enhancement Module, as shown in
Figure 3. This module combines the wavelet transform [
36] with a learnable filter.
This module employs the discrete wavelet transform (DWT) to decompose the image into one low-frequency and three high-frequency sub-bands. Candidate bases for this decomposition include the Haar, Daubechies (Db), Coiflet (Coif), and Symlet (Sym) wavelets, as well as the dual-tree complex wavelet transform (DT-CWT) [
37]. The selection of these bases is governed by properties such as support length, number of vanishing moments, smoothness, and frequency localization. Haar wavelets exhibit low computational complexity but poor smoothness, potentially reducing the accuracy of high-frequency edge extraction. Coiflet and higher-order Daubechies wavelets possess more vanishing moments, yielding smoother basis functions that enhance edge representation at the cost of increased computational complexity. The DT-CWT is approximately shift-invariant and directionally selective, yet it is computationally intensive.
The Daubechies-4 (Db4) wavelet is selected for its optimal trade-off among spatial localization, smoothness, and computational efficiency. The Db4 wavelet possesses a sufficient number of vanishing moments to capture fine-scale edge and texture information while maintaining low computational complexity, making it suitable for large-scale thermal datasets. This selection facilitates robust extraction of terrain structures such as ridgelines, riverbanks, cliffs, building facades, and corner lineaments in complex natural scenes. According to the principle of edge localization, the image signal is decomposed into high- and low-frequency components. Adaptive learning dynamically optimizes high-frequency enhancement in a computationally efficient manner. This approach overcomes the limitations of manually defined parameters, which often degrade efficiency and impair generalization.
The decomposition and reconstruction process is described as follows:
In Equation (4), LL denotes the low-frequency subband, which reflects the average radiative properties of the target region and captures slow spatial variations in radiation caused by surface materials, solar irradiation, or environmental conditions. For example, in farmland scenes, LL exhibits a relatively uniform radiation field, primarily due to extensive vegetation coverage, whereas in desert or rocky terrains, it highlights large-scale differences in radiation intensity among features such as sand dunes, rocks, and sparse vegetation. LH denotes the horizontal gradient (vertical edge). In terrain feature analysis, it is commonly used to identify boundaries with distinct horizontal extension characteristics, such as ridgelines and riverbanks [
38]. HL denotes the vertical gradient (horizontal edge). In practical scenes, this component highlights vertical features such as building facades, steep cliffs, and tree outlines [
39]. HH denotes the diagonal gradient (corners and diagonal edges). This information can reveal the arrangement or structure of landforms along diagonal directions, such as sloping terrain surfaces and extended slope vegetation [
40].
In Equation (5), denotes the parameterized high-frequency enhancement function.
The input high-frequency subband
is processed to extract local edge features via a 3 × 3 convolution kernel. The corresponding formula is:
In Equation (7), denotes the convolution sum and weights, with each kernel extracting edges in a specific direction. is the pixel value of the input subband at position , k denotes the output channel index of the convolution kernel, and is used to adaptively learn the local edge statistical characteristics in infrared images.
In Equation (8), A denotes the activated feature map, which preserves edges with positive gradients while filtering out noise with negative gradients, such as false edges caused by radiation diffusion. Principle: Radiation noise is more pronounced in regions with negative gradients. By applying zero-thresholding to suppress noise while preserving positive gradient structures such as ridgelines, low-gradient noise is filtered out.
Feature Fusion and Edge Reconstruction Formula:
In Equation (9), denotes the fusion weight that strengthens the diagonal edges and corner points, denotes the output bias, achieving adaptive sharpening by dynamically learning the contribution weights of different edges, Low-Frequency Noise Suppression: Performs noise suppression by learning the balance between edge sharpening and low-gradient noise filtering through gradient backpropagation.
Add an adaptive threshold formula to the low-frequency subband LL:
In Equation (10), denotes the noise variance (estimated via local variance estimation), suppressing low-frequency interference caused by texture diffusion, N is the number of pixels in the window.
In Equation (11),
denotes the learnable noise sensitivity coefficient (optimized via backpropagation); V is a local sliding window variance estimation. Purpose: Suppress radiation diffusion and filter out background noise in the low-frequency subband. Edge-strengthened backpropagation formula:
In Equations (12) and (13), denotes the learning rate, L denotes the loss function, which combines mean squared error, noise suppression loss, and diagonal edge enhancement loss, with different weighting ratios.
In Equation (14), denotes the mean squared error, denotes the noise suppression loss. The weighting coefficients are set as α = 1.0, β = 0.5, γ = 0.2. These values were determined empirically to balance reconstruction accuracy, noise suppression, and directional edge enhancement across different thermal scenes.
Purpose: To constrain overall quality and ensure structural consistency between the wavelet transformed image and the original image.
denotes the directional perception loss. To further strengthen the structural consistency of high-frequency reconstruction, a directional perceptual loss is defined on the diagonal high-frequency subband HH. This loss constrains the reconstructed diagonal structures to align with their original orientation-dependent responses. Specifically, the HH subband is filtered using 3 × 3 directional kernels to extract responses along two dominant diagonal directions, 45° and 135°. The two enforced directions correspond to the principal diagonal orientations represented in the HH subband of the single-tree DWT. Since the standard DWT combines both diagonal components within HH, the supervision at 45° and 135° provides an effective approximation for enhancing diagonal and corner structures without introducing additional computational cost. denoted as
formula:
In Equation (17), denotes the high-frequency subband in a specific direction. Purpose: targeted edge enhancement with directional adaptability. Advantages: Data-driven learning of infrared edge characteristics outperforms fixed operators.