Thermal Image Super-Resolution Based on Lightweight Dynamic Attention Network for Infrared Sensors

Infrared sensors capture infrared rays radiated by objects to form thermal images. They have a steady ability to penetrate smoke and fog, and are widely used in security monitoring, military, etc. However, civilian infrared detectors with lower resolution cannot compare with megapixel RGB camera sensors. In this paper, we propose a dynamic attention mechanism-based thermal image super-resolution network for infrared sensors. Specifically, the dynamic attention modules adaptively reweight the outputs of the attention and non-attention branches according to features at different depths of the network. The attention branch, which consists of channel- and pixel-wise attention blocks, is responsible for extracting the most informative features, while the non-attention branch is adopted as a supplement to extract the remaining ignored features. The dynamic weights block operates with 1D convolution instead of the full multi-layer perceptron on the global average pooled features, reducing parameters and enhancing information interaction between channels, and the same structure is adopted in the channel attention block. Qualitative and quantitative results on three testing datasets demonstrate that the proposed network can superior restore high-frequency details while improving the resolution of thermal images. And the lightweight structure of the proposed network with lower computing cost can be practically deployed on edge devices, effectively improving the imaging perception quality of infrared sensors.


Introduction
RGB sensors are universally used in smartphones, drones, laptops, and other devices due to their excellent imaging quality and speed.However, the image quality captured by RGB sensors degrades dramatically under harsh conditions [1].The wavelength of longwave infrared radiation ranges between 7 and 14 µm, while the wavelength of visible light in the electromagnetic spectrum lies between 390 and 780 nm .Therefore, infrared sensors have a robust ability to penetrate smoke, fog, and haze, and can replace RGB sensors in the aforementioned scenarios.Nonetheless, high-resolution (HR) infrared focal plane arrays are expensive, and civilian uncooled infrared detectors output low-resolution (LR) thermal images that cannot match the megapixel RGB sensors [2].The straightforward approach is to design complex hardware devices to enhance the resolution of thermal images, but the extended manufacturing time limits their practical application.In recent years, a method known as super-resolution (SR) has been extensively developed to enhance LR image resolution [3].
The purpose of image super-resolution (ISR) is to recover the corresponding HR images from the observed degraded LR counterparts.The current ISR methods can be classified into three categories based on technical approaches: interpolation based, reconstruction based, and learning based.The linear interpolation algorithms (e.g., nearest, bilinear, and bicubic) design interpolation weighting functions based on the assumption of the local smoothness of the image [4].They are simple and fast but are prone to aliasing artifacts in areas with rich high-frequency information.To overcome the shortcomings of linear interpolation, more works focus on nonlinear interpolation algorithms, i.e., adaptive image interpolation.For example, adaptive interpolation methods based on edge guidance [5,6] or perception [7] and non-local means [8] improve the perceptual quality of interpolated images, but high-frequency details are still insufficient under large scale factors.The ISR methods based on pixel domain reconstruction employ the introduced prior knowledge as a constraint to iteratively solve the objective function until it converges to the local optimal solution.Projection onto convex sets, iterative back projection and maximum a posteriori (MAP) are representative algorithms [4].The reconstruction-based ISR methods [4,9,10] first establish a degradation model from HR images to LR images, which is the inverse problem of ISR [11][12][13].Therefore, the degradation model can be reversely solved based on algorithms such as MAP estimation, and the HR image can be predicted.For instance, Greenbaum et al. [13] generated HR images based on super-resolved stacks of multiple shifted LR images to improve the field-of-view of dense samples captured by lensfree holographic microscopy.As a shallow learning algorithm based on learning, traditional sparse dictionary-based ISR methods suffer from slow speed and are limited by the size of the overcomplete dictionary, resulting in unsatisfying performance [14].However, with the development of multimedia technology, the above methods have gradually reached performance bottlenecks and cannot meet the needs of generating high-definition or even ultra-high-definition images.Recently, ISR models based on deep convolutional neural networks (CNNs) achieved impressive performance gains over traditional methods [1,3].CNN-based ISR methods map LR images to HR images in an end-to-end manner according to the datasets during training.Taking advantage of the powerful nonlinear fitting and automatic feature learning capabilities of CNNs, as well as the emergence of dedicated acceleration hardware such as the neural processing unit (NPU), the performance of the ISR network trained with massive training data is significantly better than the above traditional methods.SRCNN [15,16], as the first CNN-based ISR model, exhibits significantly superior performance than interpolation-and sparse coding-based methods.Relying on the powerful nonlinear fitting ability of CNN, CNN-based ISR models for RGB sensors have been continuously proposed [17].However, there are relatively few ISR methods specifically designed for infrared sensors.Therefore, there is an urgent need to develop a thermal image SR model based on CNN that can be applied in practice.
Inspired by the human visual system (HVS), progressively ISR networks employ attention mechanisms to improve performance [18].The HVS is not equally capable of processing all the information contained in the observed scene but concentrates limited computing resources on the most information-rich regions.For instance, when humans gaze at the sky, their attention is likely to be focused on objects like birds flying in the sky rather than the background of the sky itself.Similarly, it is the high-frequency details such as edges and textures rather than smooth areas that most affect the perceived quality of an image.The purpose of ISR is to maximize the recovery of high-frequency information while improving the resolution.Channel attention (CA) and spatial attention (SA) mechanisms are currently widely used attention methods and have been proven to improve many low-level computer vision tasks, including ISR [19][20][21].CA and SA perform recalibration operations in the channel and pixel spaces of feature maps, respectively.To make full use of channel and spatial information interaction, many works stack CA and SA blocks to form attention modules and reuse them [1,22,23].Previous studies [18,24] have shown that although LR images have insufficient resolution, they still contain a large amount of lowfrequency information and high-frequency details.ISR networks that without attention blocks process all channels and image regions equally cannot effectively recover highfrequency details.Therefore, the attention mechanism can enrich the edges and texture details of the final super-solved HR image, thereby improving the perceptual quality.
The few existing CNN-based thermal image SR models for infrared sensors mainly suffer from the following three shortcomings.(1) Inefficient sequential stacking of attention modules.As shown by Chen et al. [18], simply using the same attention module hierarchy to extract features is not always beneficial to the final RGB ISR performance.And we prove in Section 3 that thermal image SR for infrared sensors should embrace attention modules variously at different depths of the network.The results show that the early attention feature extraction modules enhance low-level information, the tail modules extract high-frequency details, and the middle attention modules enhance the above two features mixedly.Therefore, it is necessary to dynamically adjust the attention weights according to the characteristics of different stages of the network.(2) Networks are complex and unwieldy.With the advent of residual structures [25], it is possible to stably train very deep networks.Since then, more and more ISR networks have improved performance by continuously increasing network capacity (i.e., increasing the number of layers of the network and the width of each layer).For example, EDSR [26] has more than 40M parameters, and the huge computing power requirement makes it almost impossible to deploy in embedded devices, which limits its practical application.Although well-designed networks with more parameters can continuously improve performance, the resulting gains and training/inference costs have to be considered.In other words, there should be a trade-off between the performance and scope of parameters.(3) The designed attention module is not compact.Inspired by SENet [27], more researchers [28][29][30] have invested in designing complex CA structures to enhance channel dimension features extraction, or combining convoluted SA blocks to improve performance.Although these methods enrich the details of the super-solved images, the heavy computational burden leads to slow inference, limiting their practical applications.Our initial idea was to design an efficient attention module that could reduce the complexity of the network and achieve satisfactory performance, making it possible to deploy on edge devices.
To address the above issues, we propose a lightweight dynamic attention superresolution network (LDASRNet) for infrared sensors to super-resolve LR thermal images.The LDASRNet consists of a shallow feature extraction (SFE) module, a deep feature extraction (DFE) module and a feature reconstruction (FRec) module.The SFE module consists of only one convolutional layer with filter size 3 × 3, which is used to extract shallow features.The DFE module consists of sequentially stacked dynamic attention blocks (DABs).Different from previous works, our proposed DAB adaptively and dynamically assigns weights to attention and non-attention branches according to different deep features of the network.In particular, the efficient CA block efficiently aggregates channel dimension features.And the pixel attention block generates 3D instead of 2D attention maps compared with SA, and obtains more performance gains with less computational cost.The FRec module is used to reconstruct the final HR thermal image.
In this paper, our main contributions are as follows: The remainder of this paper is organized as follows.Section 2 describes the technical details of the existing attention mechanism, as well as CNN-based SR methods for RGB and thermal images.Section 3 shows the motivation and necessity of our proposed dynamic attention for thermal image SR.Section 4 details our proposed dynamic attention network structure.Section 5 qualitatively and quantitatively compares the proposed network with state-of-the-art lightweight ISR models.We conclude this paper in Section 6.

CNN-Based Image Super-Resolution
Traditional ISR methods based on the sparse coding framework represent HR image patches as sparse linear combinations of atoms in an over-complete dictionary [14].Therefore, the ISR performance is limited by the dictionary size, and the inference speed is slow, resulting in dissatisfactory ISR performance under large scale factors [1].
As CNNs have demonstrated impressive accuracy in the field of image recognition [31], CNN-based ISR methods have emerged.SRCNN [15], as a pioneering CNN-based ISR work, achieves significantly superior performance than traditional ISR methods.And FSRCNN [16] further boosts SRCNN to obtain more gains at a lower computational cost.However, SRCNN only has three convolutional layers, resulting in a relatively small model capacity, which may not be able to recover sufficient details when faced with large-scale upsampling scale factors.Following SRCNN, VDSR [32] explores multi-layer small-size convolution kernels to expand the receptive field, and the resulting 20-layer network greatly improves accuracy.As the residual connection in ResNet [25] makes it possible to stably train very deep networks, the residual structure is widely used in the ISR models.Unlike SRResNet [33], EDSR [26] consists of 32 residual blocks that remove batch normalization, reducing memory consumption and artifacts.However, the above model adopts the pre-upsampling method, that is, the LR input is first interpolated to the desired size, which increases the difficulty and computational burden of subsequent deep feature extraction.Therefore, more works adopt a post-upsampling structure, which upsample features to the expected resolution at the end of the model [3].RDN [34] employs residual blocks with dense connections to make full use of the features of previous layers, and the formed persistent memory mechanism effectively utilizes abundant features.Inspired by SENet [27], RCAN [24] models interactions between channels to recalibrate the channel features.As a scale-attention-based network, RCAN adopts a two-level residual CA structure to deepen the network depth while utilizing more low-frequency information.Similarly, HAN [35] with channel-spatial attention modules to explore the relationship between pixel domain and channel dimension, and uses layer attention to utilize the output features of all residual groups.The effectiveness of CA and SA to reweight features in channel and spatial dimensions respectively to enhance high-frequency information has been proven, and extensively adopted in low-level (e.g., ISR) and high-level computer vision tasks [35][36][37][38].
Motivated by ISR in the visible spectral domain, Choi et al. [39] proposed TEN, a network consisting of four convolutional layers for end-to-end mapping of LR thermal images.Limited by the expensive HR thermal detectors at that time, it was difficult to obtain a large number of paired LR-HR thermal images, TEN used RGB images as the training dataset.Similarly, Marivani et al. [40] proposed multimodal SR models DMSC and DMSC + , using HR RGB images as an auxiliary to super-resolve LR near-infrared thermal images.However, Rivadeneira et al. [41] demonstrated that a network trained with thermal images is superior for thermal image SR inference compared to RGB training datasets.In addition, the authors also constructed a dataset consisting of 101,640 × 512 resolution thermal images to activate the research in the field of thermal image enhancement.Aiming at the problem of severe thermal image noise caused by high clutter in the maritime environment, Bhattacharya et al. [42] proposed two CNN-based networks to perform denoising and SR tasks, respectively, to improve the perception of maritime thermal images.The idea of cascading two networks is also adopted in CDN_MRF [43].The first residual network of CDN_MRF is used to extract thermal image structure, and the second network is used for fine high-frequency details.As the champion of the Perception Beyond the Visible Spectrum (PBVS)-2020 Thermal Image SR Challenge, the progressive feature extraction module of the TherISuRNet [2] was adopted to generate HR thermal images.In order to reduce redundant features extraction in the deep networks, ChaSNet [44] with the channel separation method eliminates overload features in the trunk of the network.However, the limited receptive fields of the convolution kernels adopted in the above models limit the performance.Zhang et al. [1] proposed MPRANet, a thermal image network composed of residual blocks with parallel convolution kernels of different sizes to effectively extract local and global features.
Apart from discriminative models, some research efforts have focused on generative models, such as generative adversarial networks (GANs) [45] for thermal image SR.Liu et al. [46] integrated the gradient prior knowledge of natural scenes, and trained the GAN-based thermal image SR network with RGB images as style feature auxiliary information.Rivadeneira et al. [47] proposed CycleGAN, a network based on [48] and with an unsupervised training method.In general, GAN-based thermal image SR model training is unstable and prone to mode collapse, so most thermal image SR methods are still based on CNN.

Attention Mechanisms in Image Super-Resolution
ISR networks extensively employ CA and SA modules to enhance channel and spatial dimension feature maps, respectively.These two attention paradigms focus on rich patterns to reconstruct HR images.
Channel Attention.Channel attention is divided into scalar-based CA [27] and covariance-based CA [49].The schematic diagram of the two paradigms is shown in Figure 1.Scalar-based CA generates weights in the channel dimension for reweighting feature maps.This means that all pixels of each channel are rescaled by the same scalar, i.e., CA is the channel dimension isotropy operator.Whereas covariance-based CA performs inner product (i.e., self-attention) operation on the input features to generate a cross-covariance matrix to transfer information between channels.Spatial Attention.Spatial attention can be viewed as an anisotropic operator, i.e., each pixel of all channels is multiplied by various weights to highlight features.Spatial gatebased [29,30] and self-attention-based [50,51] SA are two representative paradigms.As shown in Figure 1, the spatial gate SA generates channel-independent weight masks, while the self-attention SA computes the cross-covariance in the pixel domain.The information interaction of the above two SAs is only in the spatial dimension, and there is no interaction between the channels.

Scalar-based Channel Attention
As shown in Figure 1, the CA operation is isotropic in the spatial dimension, that is, the computational complexity of CA is less than that of SA.Therefore, lightweight ISR networks tend to adopt CA rather than SA.However, some research works [1,21] have shown that the combined of CA and SA can help improve ISR performance.The aggregation capabilities of CA and SA in different feature dimensions complement each other.We propose to use pixel-wise attention, which is more effective than SA, and combined with CA, to improve performance while maintaining a compact network structure.

Motivation
According to ISR research works on RGB sensors [18], LR images are mixed with lowfrequency information and high-frequency details, such as edges and textures.A network without incorporating an attention mechanism handles all frequency bands equally at all layers, resulting in inefficient redundant computation.The attention mechanism can enhance high-frequency features and improve visual quality.However, there are few studies on thermal image SR task for infrared sensors to prove the above assumptions.
We construct a network composed of attention modules to explore the properties of the attention mechanism in thermal image SR task.The LR thermal image first passes through a convolutional layer to extract shallow features, then sequentially extracts deep features by 16 attention blocks, and finally outputs the HR thermal image through an upsampling module.The proposed attention module consists of sequential channel and spatial attention blocks such that each pixel in the feature map can be rescaled independently.We visualize the feature maps of the outputs of some attention modules as shown in Figure 2. As shown in Figure 2, the behaviors of attention modules at different depths are quite different, even diametrically opposite.The shallow attention modules extract low-frequency features, i.e., flat regions are enhanced, the tail attention modules enhance high-frequency information, and the middle modules mix the above two operations.Furthermore, to verify whether all attention modules contribute to the final performance gain, we replace the attention modules of some layers with residual blocks, and the results are shown in Table 1.The peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) are performance metrics.Table 1 shows that the attention module actually improves the ISR accuracy, and the location of the attention module is critical to the performance.The pure residual block structure achieved the same performance as the structure with attention modules in the first half, and the quantitative comparison results between the network with attention blocks in the second half and the fully attention network were consistent.According to the above experiments, we can conclude the following conclusions: (1) It is beneficial to adopt the attention mechanism for ISR task.(2) It is not always optimal to embrace the attention blocks equally.Therefore, we propose a dynamic attention network for infrared sensors to super-resolve LR thermal images.  10, 11, 12, 13, 14, 1516}  33.24 0.8099 {2, 4, 6, 8, 10, 12, 14, 16}  33.20 0.8089

Proposed Method
We describe the details of the proposed lightweight dynamic attention super-resolution network (LDASRNet) for infrared sensors in this section.The structure diagram of LDASR-Net is shown in Figure 3.

Shallow Feature Extraction Module
Given a LR thermal image I LR ∈ R C×H×W , where C is the number of channels, H and W are the height and width, respectively.The function of the SFE module can be expressed mathematically as: where x 0 is the output of the SFE module and f SFE (•) represents the SFE function, which consists of a single 3 × 3 convolution kernel.An alternative scheme is to utilize more convolutional layers to form the SFE module, but we found that this is inefficient and unnecessary.A single 3 × 3 convolutional layer can already extract low-frequency information well, and it balances ISR accuracy and computational burden.

Deep Feature Extraction Module
As shown in Figure 3, the outputs x 0 of the SFE module are then used to extract high-level patterns through a DFE module composed of K DABs.Each DAB consists of a dynamic weight block (DWB) as well as an attention branch and a non-attention branch.The structure of DAB is shown in Figure 4.

Dynamic Weights Block
As mentioned in Section 3, simply stacking attention blocks failed lead to optimal performance gains.Therefore, we propose a dynamic attention mechanism to maximize the enhancement of high-frequency information.The specific implementation of the dynamic attention mechanism is shown in Figure 4.
The input x i−1 ∈ R C i−1 ×H i−1 ×W i−1 passes through the global average pooling (GAP) layer of i-th DAB to obtain a feature vector z i−1 ∈ R C i−1 ×1×1 , where the k-th statistic of z i−1 is calculated according to: are the feature map of the k-th channel and its corresponding GAP output, respectively.There are many other sophisticated methods for aggregating global information; we use the simplest GAP to achieve this goal efficiently.Intuitively, the usual approach is to follow z i−1 with two fully connected layers, namely a channel reduction layer and a channel increase layer, to enhance the information interaction between channels [27].However, we use 1D convolution to simplify the above operation and achieve superior performance while reducing the number of parameters.We will demonstrate the necessity of choosing 1D convolution instead of two fully connected layers.
If we choose two fully connected layers, the output w can be expressed as: where w ∈ R C i−1 ×1×1 , σ is a sigmoid function, and the specific form of f {w 1 ,w 2 } is: where RELU is the rectified linear unit [52].As the weights of the channel reduction layer W 1 and channel increase W 2 have sizes C × C r and C r × C respectively, r is an adjustable attenuation parameter.The above operation reduces the parameter burden but destroys the direct one-to-one correspondence between channels and weights [53].One weight element of a fully connected layer utilizes all channel information; however, the operation shown in Equation ( 4) first maps full-size features to a low-dimensional space, and then maps back to a high-dimensional space.The direct relationship between channels and weights is broken.We show that 1D convolution elegantly preserves the explicit correspondence between channels and weights.
We enhance the cross-channel interaction using 1D convolution as shown in the following equation: where f C1D t (•) represents a 1D convolution with filter size t, and f FC (•) and σ softmax (•) are fully connected layer and softmax functions, respectively.Our goal is to generate two weights for the attention and non-attention branches respectively from the input feature maps.We use GAP based on the following two considerations: (1) As the depth of the network increases, GAP effectively increases the receptive field, thereby extracting global image information.(2) Compared with directly applying the fully connected layer, the GAP drastically reduces parameters, suppresses overfitting and can flexibly adapt to changes in the size of input feature maps.Therefore, the outputs of our proposed DWB block are: We use w attn and w n−attn to recalibrate the attention branch and no-attention branch, respectively.The above operation can be formulated as: where w attn i−1 and w n−ttn i−1 represent the weights of the attention branch and the non-attention branch, respectively.x attn i−1 is the output of the attention branch, and x n−attn i−1 is the output of the non-attention branch.f 1×1 (•) is a convolution layer with a convolution kernel size of 1 × 1.
In order to reduce the learning difficulty and compress the filter space, we let w attn + w n−attn = 1, and the softmax function obtains the normalized weights.
The size of the 1D convolution kernel t determines the interaction range between local channels.In order to avoid manually determining t, which consumes time and resources, we automatically determine t in the following way: where |z| odd represents the odd number closest to z, C i−1 denotes the number of channels, γ and b are two hyper-parameters, and we empirically set γ = 2 and b = 1, respectively.We use nonlinear mapping instead of linear mapping to extend the representation capacity, and the local interaction range is proportional to the channel dimension size.

Attention Branch
As shown in Figure 4, the attention branch consists of a channel attention block (CAB) and pixel attention block (PAB), and their structures are shown in Figure 5 and Figure 6b, respectively.
Similar to DWB, we use GAP to obtain channel-by-channel global information, and then perform cross-channel interaction with 1D convolution without channel dimensionality reduction.Our efficient CAB reduces model complexity while capturing local dependencies between channels.The characteristics of feature map of each channel vary greatly.As shown in Figure 6a, the conventional spatial attention block equally weights all pixels of each channel feature map, which cannot fully enhance the spatial information.Inspired by PAN [54], we propose pixel attention with residual connections, which weights the pixels of each channel independently as shown in Figure 6b.

Non-Attention Branch
As a complement, we introduce the non-attention branch to extract information ignored by the attention branch.We use a single 3 × 3 convolutional layer to form the non-attention branch.It is worth noting that non-attention branches with more complex structures can be adopted as alternatives, but 3 × 3 convolutional layers are suitable for our proposed lightweight structure.
To sum up, the output of the DFE module is: where

Feature Reconstruction Module
The FRec module is used to reconstruct the outputs of the DFE module into the final HR thermal image.There are few works that carefully design upsampling modules.We design two FRec modules for ×2/×3 and ×4 scale factors, respectively, as shown in Figure 7.We use nearest neighbor interpolation in the FRec module to upsample the feature maps to the desired size, and leverage PAB to enhance information representation.Since the ISR task with ×4 scale factor is more burdensome, we designed the FRec module as shown in Figure 7b for the ×4 scale factor.
Overall, the generated HR thermal image output I SR can be expressed as: where f FRec (•) and f up (•) represent the FRec module and bilinear interpolation, respectively.We interpolate the LR thermal image to the desired size, allowing the network to learn the residual information, thereby reducing the burden and stability of the network training.

Training and Testing Datasets
We use the dataset proposed by Rivadeneira et al. [47] as the training dataset.This dataset was serviced as the training and testing dataset for the PBVS [55] Thermal Image SR (TISR) challenge, which we simply abbreviate as the Challenge dataset.The Challenge dataset was created by capturing thermal images from three thermal cameras mounted on a panel.The panel was installed on the car and controlled by a developed multithreaded script to acquire images simultaneously.The specific specifications of the three thermal cameras and the composition of the Challenge dataset are shown in Table 2 and Table 3, respectively.Since the medium-resolution (MR) Axis and LR Domo thermal images of the Challenge dataset are not completely aligned, we adopt the Flir HR subdataset as the training dataset, and the corresponding LR counterparts are obtained by the bicubic interpolation method.As for the testing datasets, in addition to the Challenge testing dataset, in order to reflect the superior generalization of the proposed LDASRNet, we also handle Iray (http://iray.iraytek.com:7813/apply/E_Super_resolution.html/,accessed on 20 September 2023) and FLIR (https://www.flir.in/oem/adas/adas-dataset-form/,accessed on 20 September 2023) as two additional testing datasets.

Evaluation Metrics
PSNR and SSIM are adopted to quantitatively evaluate the performance of the proposed LDASRNet and compared models.All performance reports are evaluated on the Y channel of YCbCr color space.Following previous research [1], we crop s pixels around the generated I SR , where s = 2, 3, 4 is the corresponding scale factor.

Data Augmentation Method
Previous studies have shown that feature domain data augmentation (DA) methods harm the performance of ISR task, while pixel domain DA methods boost ISR accuracy.Inspired by CutBlur [56] and MPRANet [1], we adopt a mixture of DA (MoDA) strategy when training the proposed LDASRNet.Specifically, in addition to random horizontal/vertical flipping and rotation of the LR-HR pairs in each iteration during the training, one of the following pixel domain DA methods is also randomly selected to enhance the LR-HR image pairs: CutMixup [56], RGB permute [56], Blend [56], CutBlur [56], CutOut [57], CutMix [58] and Mixup [59].Quantitative performance comparison of the proposed LDASRNet with various DA methods is shown in Table 4. Table 4 demonstrates that the pixel domain DA methods can effectively improve ISR performance.For example, compared with the baseline model, the proposed LDASRNet trained using RGB permute or Mixup improved the PSNR metric by at least 0.09 dB, while using the CutMixup or CutBlur method improved it by 0.12 dB.Furthermore, we obtained the highest performance gains using the MoDA strategy.The results show that the pixel domain MoDA strategy in the ISR task can effectively boost accuracy.

Implementation Details
We train the proposed LDASRNet using the PyTorch [19] framework.We adopt the AdamW [60] optimizer instead of Adam [61] optimizer, and ablation experiments show that AdamW can slightly improve performance compared to Adam.There are 2000 epochs in total, and the initial learning rate 5 × 10 −4 decreases by half every 200 epochs.The cropped ground truth resolutions corresponding to ×2, ×3 and ×4 scale factors are 96 × 96, 128 × 128 and 192 × 192, respectively.
We propose a variant network of LDASRNet named LDASRNet-T.The only difference from LDASRNet is that the non-attention branch in LDASRNet-T consists of a convolution layer with a convolution kernel size 1 × 1.

Ablation Experiments
AdamW vs. Adam.We find that LDASRNet trained with the AdamW optimizer achieves higher scores on PSNR and SSIM metrics compared to Adam.The results are shown in Table 5.We speculate that this is because AdamW directly uses the weight decay term when updating the weights.Table 5 shows that LDASRNet trained with the AdamW optimizer on the Challenge, FLIR and Iray testing datasets achieved performance improvements of 0.01 dB, 0.04 dB and 0.08 dB, respectively, in the PSNR metric compared to Adam.Moreover, the MoDA training strategy achieves improved accuracy with AdamW or Adam optimizer, further confirming the rationality of MoDA for ISR task.
Validity of DAB structure.To verify the effectiveness of the proposed DAB structure, our ablation experiments compare the impact of a single path and two paths (i.e., attention branch and non-attention branch) on ISR performance.All experiments were performed on the Challenge testing dataset with the ×4 scale factor, and the results are shown in Table 6.Table 6 shows that, only equipped with non-attention branch or attention branch, the PSNR metric is 0.28 dB and 0.51 dB lower than LDASRNet, respectively.This justifies attention and non-attention branches to complement each other.Note that using only the nonattention branch results in a 0.23 dB improvement compared to using the attention branch, but the number of parameters of the former is 2.7 times that of the latter, which shows that our attention branch composed of CAB and PAB is an efficient lightweight structure.
In addition to the addition we used, the fusion methods of the two branches also include strategies such as concatenation [54] and adaptive weight [18].We compare the performance and parameter trade-offs of the adopted addition versus concatenation and adaptive weight.As shown in case 3 and case 5 in Table 6, concatenation has 25.6 K more parameters than the additive and adaptive weight fusion methods but has the worst performance.We adopt the simplest addition strategy to achieve the best accuracy while maintaining the smallest scope of parameters.
Model capacity.The capacity of the model, i.e., the width and depth of the network, is critical to ISR accuracy.The number of filters in each convolutional layer in the DAB of LDASRNet is 40.In order to ablate the impact of the model capacity on performance, we propose two variant networks, LDASRNet w/Fewer Channels and LDASRNet-T.There are 32 feature channels in the DAB of LDASRNet w/Fewer Channels.The number of channels of LDASRNet-T remains the same as LDASRNet, but the kernel size of the convolutional layer in the non-attention branch is 1 × 1. Case 6 and case 8 in Table 6 show that fewer channels or a small convolution kernel in the non-attention branch is not conducive to the final thermal image SR performance.Our LDASRNet achieved optimal performance while remaining lightweight.Configuration of two-path structure.We show the ablation experimental results in Table 7 to verify the impact of different configurations of the two-path structure.The two pairs of configurations, case 1 and case 5, and case 4 and case 7, show that non-attention branch equipped with CAB or PAB with dynamic attention block failed always bring positive gains.And the case 6 results indicate that dynamically assigned weights to CAB and PAB are 0.2 dB higher than the cascaded CAB and PAB structure, and the dynamic attention strategy can obtain clear performance improvements.Our non-attention branch and attention branch are supplemented with the dynamic weight structure to achieve superior accuracy.

Quantitative Experiments
The proposed LDASRNet is a lightweight ISR network, and we choose the following networks with parameters less than 1M for comparison: SRCNN [15], FSRCNN [16], SR-LUT [62], PAN [54], DRRN [63], A 2 F [64], AWSRN-S [65], IMDN [66], VDSR [32] and A 2 N [18].To verify that LDASRNet has comparable or even better performance than larger networks, we select models AWSRN [65], SRMDNF [67], CARN [68], ChaSNet [44], MPRANet [1] and MDSR [26] with parameters ranging from 1.4M to 6.5M for comparison.In addition, we select RCAN [24] and EDSR [26], two networks with parameters exceeding 10M (EDSR has more than 40M parameters) as reference.The qualitative measurement results of LDASRNet and the above models on the Challenge, FLIR and Iray datasets are shown in Table 8.We show the comparison results of the three metrics PSNR, SSIM and FLOPs.Specifically, Figure 8 shows on the thermal image img0005 from the Iray test dataset with the ×2 scale factor that our LDASRNet achieves a PSNR metric that is 0.11 dB higher than the second-best model AWSRN, i.e., 36.33 dB vs. 36.22dB.Similarly, Figure 9 exhibits that LDASRNet outperforms the second-best and third-best networks by 0.05 dB/0.0012 and 0.09 dB/0.0027 in terms of the PSNR/SSIM metrics, respectively, and still achieves the highest quantitative measurements.Figure 10 indicates that LDASRNet achieves the maximum performance gain on the ×4 scale factor of the Challenge test dataset for the img0037, with a PSNR of 35.22 dB and SSIM of 0.9195, outperforming A 2 N and AWSRN by 0.31 dB and 0.0042, respectively.

Compare with Lucy-Richardson-Rosen Algorithm
In addition, we further compare the proposed LDASRNet with the recent Lucy-Richardson-Rosen algorithm (LRRA) [69,70] that exhibits superior deblurring performance.Empirically, we set the maximum number of iterations of LRRA to 8, and we use a synthetic Gaussian-type point-spread function (PSF) with standard deviation 0.3 and filter of size 3 × 3. The quantitative test results of our LDASRNet and LRRA on the ×2, ×3 and ×4 scale factors of the three test datasets of Challenge, FLIR and Iray are shown in Table 10.The comparison results of the visual perception quality of the generated images are shown in Figure 11.Note that due to space limitations, our results only show img0011 from Iray, img0006 from FLIR and img0035 from Challenge.Table 10 shows that our LDASRNet achieves the highest metrics compared to LRRA on the three test datasets with all scale factors.Specifically, on Challenge with ×2, ×3 and ×4 scale factors, our LDASRNet is 7.07 dB, 6.4 dB and 6.23 dB higher than LRRA in the PSNR metric, respectively.Similarly, on the FLIR and Iray test datasets, LDASRNet is at least 3.84 dB/0.0782 and 0.98 dB/0.4313higher than LRRA in PSNR/SSIM metrics, respectively, i.e., 35.48 dB/0.8683 vs. 31.64dB/0.7901 and 27.35 dB/0.8873 vs. 26.37 dB/0.8563.The results show that the proposed LDASRNet has a higher signal-to-noise ratio and more complete structural recovery than the reconstruction results of LRRA.
Qualitatively, Figure 11 shows that the HR thermal images generated by our LDAS-RNet possess more high-frequency details than LRRA.For example, for the buildings in img0011 and img0006, the edges and textures in the results of our method are clearer, while the LRRA images are blurry.Similarly, in img0035 in Challenge, the fence generated by LRRA obviously lacks high-frequency details and has poor perceptual quality compared to our result.
Notably, LRRA demonstrated its excellent deblurring performance in a previous study [70].We believe that the input LR thermal image is almost free of blur, which is the main reason why LRRA is not as expected.Accordingly, comparison with deep learning-based networks and LRRA demonstrate the superior performance of the proposed LDASRNet for thermal image SR.The superior HR thermal image reconstruction accuracy and compact model size of the proposed LDASRNet show the potential for deployment in edge devices.

Conclusions
In this paper, we show that simply stacking attention modules at different depths of the deep network is suboptimal.Based on this observation, we propose a lightweight thermal image super-resolution network LDASRNet based on the dynamic attention mechanism for infrared sensors.The dynamic weight block in the proposed LDASRNet provides masks to the attention and non-attention branches according to the input features to enhance highfrequency detail extraction.Specifically, we use 1D convolution without dimensionality reduction to replace the fully connected layer to enrich the interactions between channels.The attention branch consisting of efficient channel attention and pixel attention blocks complements the non-attention branch to extract local and global features.Qualitative and quantitative experiments on three testing datasets containing various scenarios show that the proposed LDASRNet can recover high-frequency details accurately, and the lightweight structure has the potential to be deployed on edge devices.

Figure 1 .
Figure 1.Schematic diagram of channel attention and spatial attention in different paradigms.

Figure 2 .
Figure 2. Visualization of feature maps.In the input and output feature maps, white, red, and blue pixels indicate zero, positive, and negative values, respectively.The brighter the pixels in the attention maps, the larger the value of the attention coefficients.

Figure 3 . 1 )
Figure 3.The structure of the proposed lightweight dynamic attention super-resolution network (LDASRNet).The LDASRNet consists of the following three modules: (1) Shallow feature extraction (SFE) module.The input LR thermal image first passes through an SFE module consisting of a 3 × 3 convolution kernel to extract low-level features.Our experiments show that using a single 3 × 3 convolution is a acceptable balance between performance and parameters.(2) Deep feature extraction (DFE) module.The output of the SFE module is used to extract deeper features through the DFE module.As a key component of LDASRNet, the DFE module consists of K dynamic attention blocks for dynamically enhancing high-frequency feature extraction according to the input feature maps.(3) Feature reconstruction (FRec) module.The FRec module constructs the outputs of the DFE module into the final HR thermal image.Due to the lower LR spatial resolution of the ×4 scale factor compared to the ×2 and ×3 scale factors, reconstruction is more difficult.We designed two types of FRec modules specifically for the ×2/×3 and ×4 scale factors, respectively.

Figure 4 .
Figure 4.The overall structure of the dynamic attention block.AvgPool and FC represent global average pooling and full connection operations, respectively.

Figure 5 .
Figure 5.The overall structure of the channel attention block.GAP represents global average pooling, and k = 5 represents the 1D convolution kernel size.

Figure 6 .
Figure 6.The overall structures of the spatial and pixel attention blocks.(a) Spatial attention block.(b) The proposed residual pixel attention block.

aFigure 7 .
Figure 7.The overall structures of the feature reconstruction blocks.NN Inter is nearest neighbor interpolation operation.(a) Feature reconstruction block for ×2 and ×3 scale factors.(b) Feature reconstruction block for ×4 scale factor.

Figure 10 .
Figure 10.The qualitative results on the Challenge testing dataset with the ×4 scale factor.

Figure 11 .
Figure 11.Comparison of the visual quality of thermal images processed by the proposed LDASRNet and LRRA.

Table 1 .
Attention modules contribute to thermal image SR performance.Attention module index indicates that the i-th position is an attention module and the rest are residual blocks, and PSNR and SSIM are performance metrics.Red and blue text indicate best and second-best performance, respectively.

Table 2 .
Thermal camera specifications for creating the Challenge dataset.

Table 3 .
The structure and composition of the Challenge dataset.

Table 4 .
Quantitative performance comparison of the proposed LDASRNet using various data augmentation methods.We take LDASRNet trained using horizontal/vertical flipping and random rotation without the MoDA strategy as the baseline.Results are reported on the Challenge testing dataset with ×4 scale factor.Red and green text indicate the best metric and the gain relative to the baseline, respectively.

Table 5 .
Performance comparison between AdamW and Adam optimizers.All results are reported on the ×2 scale factor.Red text indicates the best metrics.

Table 6 .
Ablation studies: effects of different DAB structural configurations.LDASRNet w/Fewer Channels means that the number of channels of the feature map in DAB is reduced to 32.All experiments were performed on the Challenge testing dataset with the ×4 scale factor.Red text indicates the best metrics.

Table 7 .
Ablation studies: effects of various structural configurations of two paths.C-P AB means channel-and pixel-wise attention block.All experiments were performed on the Challenge testing dataset with ×4 scale factor.Red text indicates the best metric.

Table 8 .
Quantitative comparison results of PSNR/SSIM/FLOPs metrics.Red and blue texts indicate the best and second-best performance, respectively (except those with parameters greater than 10 M).† indicates that the model parameters are larger than our LDASRNet but the accuracy is worse.Figures 8-10, the proposed LDASRNet not only obtains the best PSNR and SSIM metrics, but also recovers the most complete details compared to the compared networks. from

Figure 8 .
The qualitative results on the Iray testing dataset with the ×2 scale factor.

27 dB / 0.7818 Figure 9.
The qualitative results on the FLIR testing dataset with the ×3 scale factor.

Table 10 .
Quantitative comparison results of the proposed LDASRNet and LRRA.PSNR (dB) and SSIM are the metrics.