A Feature-Driven Inception Dilated Network for Infrared Image Super-Resolution Reconstruction

Huang, Jiaxin; Wang, Huicong; Li, Yuhan; Liu, Shijian

doi:10.3390/rs16214033

Open AccessTechnical Note

A Feature-Driven Inception Dilated Network for Infrared Image Super-Resolution Reconstruction

¹

Key Laboratory of Infrared System Detection and Imaging Technology, Chinese Academy of Sciences, Shanghai 200083, China

²

Shanghai Institute of Technical Physics, Chinese Academy of Sciences, Shanghai 200083, China

³

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(21), 4033; https://doi.org/10.3390/rs16214033

Submission received: 5 August 2024 / Revised: 27 September 2024 / Accepted: 23 October 2024 / Published: 30 October 2024

(This article belongs to the Special Issue Advanced Artificial Intelligence and Deep Learning for Remote Sensing II)

Download

Browse Figures

Versions Notes

Abstract

Image super-resolution (SR) algorithms based on deep learning yield good visual performances on visible images. Due to the blurred edges and low contrast of infrared (IR) images, methods transferred directly from visible images to IR images have a poor performance and ignore the demands of downstream detection tasks. Therefore, an Inception Dilated Super-Resolution (IDSR) network with multiple branches is proposed. A dilated convolutional branch captures high-frequency information to reconstruct edge details, while a non-local operation branch captures long-range dependencies between any two positions to maintain the global structure. Furthermore, deformable convolution is utilized to fuse features extracted from different branches, enabling adaptation to targets of various shapes. To enhance the detection performance of low-resolution (LR) images, we crop the images into patches based on target labels before feeding them to the network. This allows the network to focus on learning the reconstruction of the target areas only, reducing the interference of background areas in the target areas’ reconstruction. Additionally, a feature-driven module is cascaded at the end of the IDSR network to guide the high-resolution (HR) image reconstruction with feature prior information from a detection backbone. This method has been tested on the FLIR Thermal Dataset and the M3FD Dataset and compared with five mainstream SR algorithms. The final results demonstrate that our method effectively maintains image texture details. More importantly, our method achieves 80.55% mAP, outperforming other methods on FLIR Dataset detection accuracy, and with 74.7% mAP outperforms other methods on M3FD Dataset detection accuracy.

Keywords:

super-resolution; object detection; infrared image; dilated convolution; feature driven

1. Introduction

IR imaging technology utilizes an IR sensor that converts detected infrared radiation into observable images [1]. It has characteristics that make it less affected by external light, capable of long detection distances, and strong at penetrating smoke and haze. However, IR imaging systems are often limited by electronic noise and degradation during transmission, resulting in low-resolution and blurred details. These different characteristics of IR images compared to visible images make it necessary to propose a SR algorithm appropriate for IR images.

On the other hand, most reconstruction algorithms aim to improve the perceptual quality of reconstructed images from various perspectives, including changing sampling methods, introducing residual modules, establishing skip connections, improving loss functions, adding attention mechanisms and applying the transformer model. For example, ESPCN [2] proposed an efficient sub-pixel convolution layer to upscale LR feature maps. EDSR [3] achieved improved SR reconstruction results by removing unnecessary modules from the conventional ResNet architecture. BSRN [4] introduced more effective attention modules to enhance the model’s ability. Liang et al. proposed SwinIR [5], an image SR algorithm based on the Swin Transformer [6], and used the shift window mechanism to model the long-range dependency. And, in [7], the authors proposed the Recursive Generalization Transformer for image SR.

However, these mainstream image SR reconstruction algorithms often overlook the impact of reconstructed SR images on downstream detection tasks, and the potential connection between reconstruction and detection tasks [8]. To address these challenges, joint training strategies for multi-task learning have been proposed in other fields. For instance, in the image fusion field, Tang et al. [9] concatenated the image fusion network with the semantic segmentation network to guide image fusion with semantic information. And, in image denoising, Liu et al. [10] used the joint loss of the semantic segmentation network and the denoising network to update the denoising network via back-propagation. The final denoising network had the ability to remove image noise while retaining the semantic features required for high-level vision tasks.

In this paper, we report on the Inception Dilated Super-Resolution (IDSR) network. By constructing multiple branches, the network integrates detailed texture features extracted from the convolution branch with long-range dependencies captured in the non-local [11] operation branch, complementing each other to extract enriched features. Meanwhile, to widen the network’s receptive field without changing the image resolution and losing image details, dilated convolutions are used in the convolution branch to adapt the low contrast characteristics of IR images. Secondly, to improve the accuracy of an object detector on SR images, the network no longer learns the entire image but selects the target area to train the SR method, reducing the reconstruction of flat and irrelevant regions of the image. On the other hand, the SR network and the detection model’s backbone are cascaded to ensure that the SR images have both good visual perceptual quality and retain the advanced semantic features required for downstream detection tasks.

2. Materials and Methods

In this section, the proposed IR image SR reconstruction network for object detection will be introduced in detail. It mainly relies on three parts: a data preprocessing method, a multi-branch IR image SR network and a feature-driven module. The overall structure of the network is shown in Figure 1.

Firstly, LR infrared images are preprocessed to obtain image patches. Secondly, the super-resolution (SR) patch, denoted as

I^{S R}

, generated by the SR network and the high-resolution (HR) patch, denoted as

I^{H R}

, are simultaneously sent to the feature-driven module to extract prior knowledge of detection features. Furthermore, the feature-driven loss is leveraged to guide the flow of high-level semantic information from object detection back to the SR network.

2.1. Data Preprocessing

In the general image SR algorithms, the typical operation involves randomly cropping the image into patches. However, in natural images, the target pixels constitute a relatively small proportion of the entire image, with most pixels being flat areas that users are generally not concerned about. Random cropping will cause the network to learn more about the SR reconstruction of these irrelevant areas, which does not align with the needs of downstream detection tasks. It also results in the complete target pixels being split, preventing the network from effectively learning the structural features of the target.

Therefore, in this section, we propose a data preprocessing operation tailored to the target area. Crop the corresponding target areas based on the labels provided in the dataset. Given the guiding significance of the environmental area around the target for reconstruction, for target areas with a small pixel proportion, we expand the target outward to cut out patches of the specified size. For target areas with a pixel ratio larger than the specified size, we employ data augmentation to enhance the generalization ability of the model [12]. The specific operation involves adjusting the target size to the specified size.

2.2. IDSR Network

To address the issues of unclear texture and low contrast in infrared images, we design a SR network called IDSR with multiple branches. The network structure is consistent with SwinIR [5]. As shown in Figure 2, it consists of three basic modules: a shallow feature extraction module, a deep feature extraction module and a reconstruction module.

2.2.1. Shallow Feature Extraction

The shallow feature extraction is composed of a 3 × 3 convolutional layer. Research has shown that using a convolutional layer in the early stage of the network can provide richer local feature information to the deep network [13] and increase the inductive bias of the network [14].

2.2.2. Deep Feature Extraction

The deep feature extraction is the crucial part of the entire network, which contains K Residual Inception Dilated Blocks (RIDBs) and a deformable convolutional layer. The deformable convolution introduces offsets in the receptive field, allowing the convolution region to be better matched with the target feature.

Residual Inception Dilated Block: As shown in Figure 2a, the Residual Inception Dilated Block (RIDB) is a residual structure that includes several Inception Dilated Layers (IDLs) and a deformable convolutional layer.

The detailed structure of the IDL is shown in Figure 2b. Like the iFormer [15], the IDL is equipped with a feed-forward network (FFN), and differently it incorporates the Inception Dilated Mixer (IDM); LayerNorm (LN) is applied before both the IDM and FFN. The whole process is formulated as follows:

\begin{matrix} X = I D M (L N (X)) + X \\ X = F F N (L N (X)) + X \end{matrix}

(1)

IDM is the key component of IDL. Its detailed architecture is depicted in Figure 3. IDM is a multi-branch feature extractor that combines CNN and non-local layer, effectively aggregating high-frequency detail information and long-range dependencies. Unlike the iFormer used in image classification field, in image SR, employing pooling operations to expand the receptive field would result in the loss of the accurate spatial positional relationship between target components. Therefore, dilated convolution is used instead of pooling operations and traditional convolution to widen the receptive field [16].

Technically, given an input feature map

F \in R^{H \times W \times C}

extracted by the front network, it is divided into

F_{h} \in R^{H \times W \times C_{h}}

and

F_{l} \in R^{H \times W \times C_{l}}

along the channel dimension. Then,

F_{h}

and

F_{l}

are utilized to extract the high-frequency information and low-frequency information, respectively.

High-frequency extractor: Ref. [17] mentioned that a wider receptive field is beneficial for alleviating the problem of low contrast in adjacent regions of infrared images, and the feature response can be carried out within a larger neighborhood range, helping to restore the true information of the image. Ref. [18] confirmed that CNN mines high-frequency components of images through local convolution within the receptive fields. Dilated convolution inherits this property. Meanwhile, to better adapt to the scale diversity of targets in natural images, the dilated convolution is stacked in a cascade mode. As shown in Figure 3, before concatenation, the outputs are hierarchically added to effectively remove unwanted gridding artifacts caused by dilated convolutions [19]. With different dilated rates, each branch has a different receptive field. The branch with a large reception field can extract more abstract feature information for large target areas, while the branch with a smaller reception field is better suited for small target areas.

Low-frequency extractor: Low-frequency information is the main component of the image, serving as a comprehensive measure of the intensity of the entire image. However, the convolution operation tends to focus more on processing the local neighborhood and cannot fully utilize the contextual information provided by distant pixels. Therefore, a non-local operation is adopted to capture long-range dependencies [11]. This operation directly computes interactions between any two pixels as the feature response, allowing feature extraction to extend beyond adjacent pixels and provide richer global semantic information for the deep layers. The definition is as follows:

y_{i} = \frac{1}{C (x)} \sum_{\forall j} f (x_{i}, x_{j}) g (x_{j})

(2)

Here, x represents the input feature map in this paper, and i and j are the spatial indexes of the feature map. Function f calculates a relationship between i and all j. Function g computes a representation of the feature map at position j. The weighted sum is standardized by using the response factor

C (x)

. In Equation (2), the response of position

x_{i}

is computed by all positions

x_{j}

. This is a global response value corresponding to the low frequency feature.

In Figure 4, we perform visualization on the feature frequency information extracted from different branches. The visualization results indicate that the dilated convolution branch extracts more high-frequency features, whereas the feature information captured by the non-local layer is mainly concentrated in the low-frequency information.

2.2.3. Reconstruction Module

The reconstruction module uses the sub-pixel convolution layer [2]. Compared to simple interpolation and deconvolution operations, sub-pixel upsampling can solve the problem of artificial traces to a certain extent.

2.3. Feature-Driven Module

Section 2.2 mainly focuses on building different branches to extract frequency information. However, it does not consider whether the reconstructed SR image is suitable for downstream object detection tasks. Therefore, we propose cascading a feature-driven module to the reconstruction network to reduce the feature gap between different tasks.

As shown in Figure 1, the backbone of the YOLOv7 model [20] is selected as the feature-driven module. We use the feature prior information extracted from the YOLOv7 backbone to guide the SR reconstruction. The original YOLOv7 selected feature maps from the 24th, 37th and 50th layers to feed into the Neck network. Considering, in image SR reconstruction, we usually send image patches into the network, selecting the 50th layer would result in the feature map being too small with less semantic information, at which point the feature map no longer has effective guiding significance. Furthermore, to avoid the excessive coupling between the reconstructed image features and the selected detection backbone network, the feature maps from the 11th, 24th and 37th layers of YOLOv7 backbone are finally selected for feature alignment, which can improve the adaptability of the reconstructed image in other detectors.

Technically, the pretrained YOLOv7 backbone is cascaded at the end of the SR network. The SR image patches and the HR image patches are simultaneously sent to the backbone, and then the similarity between the corresponding feature maps is calculated as the feature-driven loss. In this way, the SR image patches can share similar feature information with the HR image patches. Via loss back-propagation, the feature prior knowledge is leveraged to guide the training of the SR network. Finally, the module not only ensures the high quality of the generated SR images but also learns image features suitable for the object detection task.

2.4. Joint Loss Function

Our method aims to reinforce the semantic features in the SR images while retaining visual quality. To achieve this, the overall loss function adopts a joint loss function, which is defined as follows:

L_{o v e r a l l} = λ L_{r e c} + μ L_{f e a t}

(3)

Here,

L_{r e c}

represents the SR reconstruction loss, and

L_{f e a t}

represents the feature-driven loss. The parameters

λ

and

μ

are used to balance

L_{r e c}

and

L_{f e a t}

. The reconstruction loss is designed to maintain the overall structure and edge details of the SR images, ensuring their visual fidelity. The feature-driven loss guides the SR network to generate image features that can effectively facilitate downstream object detection tasks.

2.4.1. Super-Resolution Reconstruction Loss

The SR reconstruction loss measures the difference between the SR images and HR images at the pixel level. In this paper, we use the more stable Charbonnier loss function [21] to optimize our network, which is expressed as follows:

L_{r e c} = \frac{1}{N} \sum_{i = 1}^{N} \sqrt{{(I_{S R}^{(i)} - I_{H R}^{(i)})}^{2} + ϵ^{2}}

(4)

Here, N represents the number of training images and i is the index of images from the training set.

I_{S R}

denotes the generated SR image, while

I_{H R}

represents the ground truth HR image. The constant

ϵ

is usually set to

1 \times 10^{- 3}

. The Charbonnier loss function is a differentiable variant of

L_{1}

loss (i.e., mean absolute error) by introducing a constant. Compared with the

L_{2}

loss (i.e., mean square error), it is less sensitive to outliers and avoids excessive amplification of errors.

2.4.2. Feature-Driven Loss

In Section 2.3, it is mentioned that the feature-driven module requires obtaining the feature maps from the 11th, 24th and 37th layers of the YOLOv7 backbone. Here, we use the mean squared error (MSE) loss to calculate the similarity between the feature maps extracted by the backbone from the SR images and the HR images. The expression is as follows:

L_{f e a t} = \sum_{j = 11, 24, 37} \frac{1}{N} \sum_{i = 1}^{N} {∥ (F e a_{j} (I_{S R}^{(i)}) - F e a_{j} (I_{H R}^{(i)})) ∥}_{F}

(5)

Here,

F e a_{j} (I_{S R}^{(i)})

represents the feature map of the SR image extracted by the backbone at layer j, and

F e a_{j} (I_{H R}^{(i)})

is the corresponding feature map of the ground truth HR image extracted by the backbone at layer j. The Frobenius norm is denoted by F, i is the index of the images and j only takes the values 11, 24 and 37.

Only the feature maps from these three layers are utilized in the calculation.

3. Results

In this section, we first introduce the training datasets we use in this paper. Then, we provide the environment and parameter settings we use in the experiment. Finally, our method is compared with some popular SR reconstruction algorithms in terms of SR reconstruction metrics and detection metrics.

3.1. Dataset

We adopt the public FLIR dataset [22] and the M3FD dataset [23] for our experimental studies.

The FLIR dataset consists of a size 7089 training set, size 1773 validation set and size 1366 test set, where the labeled objects include persons and cars. The image size of the dataset is

640 \times 512

pixels. For the SR reconstruction task, we crop the FLIR dataset images into patches using the data preprocessing method mentioned in Section 2.1. We then resize the HR patches into LR patches using bicubic downsampling with a

4 \times

factor and apply Gaussian noise to blur the the LR patches. For fairness, all SR networks are trained from scratch, without pretraining, directly on the FLIR dataset. The object detection model is fine-tuned using the original FLIR datasets.

To further verify the effectiveness of our method, we selected the M3FD dataset for generalization test experiments. The M3FD dataset collects 4200 infrared images with person and car targets. We split the M3FD dataset with the ratio of 6:2:2 into the training set, validation set and test set. Unlike FLIR, for the super-resolution task, we directly use the super-resolution weight model trained on FLIR to test M3FD and obtain SR images. For detection, we use the fine-tuned YOLOv7 to perform detection on the M3FD test set.

3.2. Experimental Environment and Parameter Settings

All experiments are conducted using PyTorch 1.12.1. on an NVIDIA GeForce RTX 3060. The batch size is set to 8, the learning rate is initialized to

1 \times 10^{- 4}

and the total training iterations are set to 300 k. Adam with a momentum of 0.9 is selected as the optimizer.

Note that the experiment on the FLIR dataset of our method employs a two-stage training strategy:

Firstly, the IDSR network is trained on the FLIR dataset. In this case, $λ = 1$ , $μ = 0$ in Equation (3). The feature-driven module does not participate in the training of the reconstruction network.
Secondly, different weights are assigned to the loss function by adjusting the value of $λ$ and $μ$ . During this phase, the parameters of the feature-driven model are frozen, and only the joint loss is used to train the IDSR network via back-propagation.

The two-stage training strategy is summarized in Algorithm 1. Here, M = 300 k, p = 200 k and q = 100 k. The reason for adopting this training strategy is to allow the images to maintain basic textures without losing the applicability to the downstream object detection task.

Algorithm 1 Two-stage training strategy

Input:: Training set: $D = {(I_{n}^{L R}, I_{n}^{H R})}_{n = 1}^{N}$
Output:: Super Resolution Patch $I^{S R}$
1:: for m ≤ Max iterations M do
2:: for p iterations do
3:: Select b Low Resolution Patches $I_{1}^{L R}, I_{2}^{L R}, \dots, I_{b}^{L R}$ ;
4:: Select b High Resolution Patches $I_{1}^{H R}, I_{2}^{H R}, \dots, I_{b}^{H R}$ ;
5:: Calculate the Super-resolution reconstruction loss $L_{r e c}$ according to Equation (4);
6:: Update the parameters of the IDSR network;
7:: end for
8:: Generate Super Resolution Patches from Low Resolution Patches and High Resolution Patches in the training set;
9:: for q iterations do
10:: Select b Low Resolution Patches $I_{1}^{L R}, I_{2}^{L R}, \dots, I_{b}^{L R}$ ;
11:: Select b High Resolution Patches $I_{1}^{H R}, I_{2}^{H R}, \dots, I_{b}^{H R}$ ;
12:: Calculate the joint loss $L_{o v e r a l l}$ according to Equation (3);
13:: Update the parameter of IDSR network;
14:: end for
15:: end for

3.3. Experimental Analysis

The experiment on the FLIR dataset consists of four parts:

Compare the performance of SR images generated by IDSR with other existing SR reconstruction networks in terms of image super-resolution reconstruction and object detection tasks.
Explore the influence of different loss weight selections on the object detection task. Concurrently, the loss weight yielding the best performance on the detection task is chosen as the parameter for subsequent experiments.
Compare the performance of SR images generated by the feature-driven IDSR (our method) with other existing SR reconstruction networks in image super-resolution and object detection tasks.
Analyze the effect of different modules on the baseline (SwinIR), including the data preprocessing, the IDM module, the deformable convolution module (Deconv) and the feature-driven module.

To verify the generalization ability of the model trained on the FLIR dataset, the experiments on the M3FD dataset consist of two parts:

Use the SR model trained on the FLIR dataset to test the M3FD dataset and directly output SR images.
Compare the performance of SR images generated by the feature-driven IDSR (our method) with other existing SR reconstruction networks in image super-resolution and object detection tasks.

3.3.1. Comparison with State-of-the-Art SR Methods

To fairly compare the quality of SR images generated by different SR methods, all SR networks are trained for 200 k iterations on the FLIR dataset without pretraining. During this training, the feature-driven module was not involved. To verify the effectiveness of the IDSR we proposed, we first compare IDSR with the other five methods: bicubic, EDSR, BSRN, SwinIR and RGT.

For quantitative evaluation, we select four statistical evaluation metrics to quantify the quality of the SR images: peak signal-to-noise ratio (PSNR), multi-scale structural similarity index measure (MS-SSIM), information fidelity criterion (IFC) and spatial frequency (SF). PSNR is defined as the ratio of peak signal power to average noise power and calculates the reconstruction error between the SR images and the reference images by mean square error (MSE). MS-SSIM [24] computes the contrast comparison, the structure comparison and the luminance comparison between SR images and HR images at different resolutions. IFC [25] evaluates the amount of common information between the SR image and the HR image. SF reflects the rate of change in image grayscale. Generally, the clearer the image, the higher the spatial frequency. In particular, all evaluation metrics are calculated only on the target area.

The SR images are used not only for visual observations but also for the object detection task. Therefore, we perform object detection on the SR images and compute the mean average precision (mAP) of the different SR reconstruct methods.

The evaluation results on the above metrics are reported in Table 1.

One can observe that the IDSR network achieves both the best and the second-best results in the SR metrics. The highest IFC and SF scores indicate that the SR images generated by IDSR contain the richest high-frequency information, which is closely related to the dilated convolution branch in the network structure. The excellent performance of MS-SSIM demonstrates that the non-local branch in IDSR effectively extracts the global structure information of the images.

From the detection metric results, we can find that using the simple bicubic method results in lower detection accuracy than the original LR images, while the detection accuracies of other SR images all exceed the original LR images. This indicates that the information of SR images has a significant impact on the results of the detection task. A simple interpolation operation cannot reconstruct more useful semantic information and instead may damage the original structure of the LR image. Even without the addition of the feature-driven module, the IDSR network yields significantly better results for both car and person targets, with the average mAP increasing by 5.7 points on the detection results. Such results benefit from our data preprocessing operation and the dilated convolution branch designed specifically for the low-contrast characteristics of infrared images, which can effectively extract image feature information, improve the edge clarity of the target object, and enhance the image quality of the target area.

Additionally, we provide some visualized examples to show the SR reconstruction results and object detection results on the FLIR, as shown in Figure 5. From the results, we can observe that, in the FLIR-08989 scene, the distortion of the top of the reconstructed car in the lower left corner is alleviated. In the FLIR-08951 scene, the person’s head contour and the details of the rear edge of the car are clearer in the frame reconstructed by our method. For the target areas, namely persons and cars, IDSR generates results that are more visually attractive and more aligned with real-world scenarios.

3.3.2. Different Values of Loss Weights

As shown in Equation (3), by setting different values for

λ

and

μ

, we can control the contribution of the two loss functions in the back-propagation process. Therefore, we analyze the role of the feature-driven module by varying the values of

λ

and

μ

. Since the purpose of introducing the feature-driven module is not to achieve the highest PSNR but to demonstrate the detection performance, we use the mAP as the sole evaluation metric. The experimental results are presented in Figure 6. From the mAP values, we observe that, when

λ = 1.0

and

μ = 0.0

, the mAP value is the lowest. At this setting, the feature-driven module is not involved during the training phase. This phenomenon means that, after fine-tuning by the feature-driven module, the generated SR images contain the necessary semantic information for the detection task. Consequently, this approach can reduce the gap between the SR reconstruction task and the object detection task, thereby improving the accuracy of the downstream detection task.

3.3.3. After Fine-Tuning by the Feature-Driven Module, the Comparison with State-of-the-Art SR Methods

Our complete method, which is after cascading the feature-driven module, is trained using the two-stage strategy detailed in Section 3.2. The other SR methods compared are trained directly for 300 k iterations without incorporating the feature-driven module. Table 2 presents the experimental results for each method after training for 300 k iterations.

Compared to Section 3.3.1, after an additional 100k iterations, the detection metrics of the SR images generated by each SR reconstruction method have improved to some extent. One can notice that our method achieves the highest mAP across all categories and ranks first in terms of average mAP. For SR metrics, our method shows the largest increase in IFC and SF but experiences a slight decrease in the PSNR and MS-SSIM metrics. This indicates that there is a certain disparity in the features of the generated images between the low-level super-resolution task and the high-level detection task, as well as a weak correlation between PSNR and detection performance. The feature-driven module enables IDSR to adaptively reconstruct semantic information guided by feature-driven loss, allowing the SR images generated by our method to achieve the highest accuracy in the detection task.

We also select two other detectors, SSD and Cascade-RCNN, to evaluate object detection performance on the SR images. The evaluation results are presented in Table 3. From these results, we can find that our method achieves the highest mAP values for both person and car categories, indicating that the SR images generated by our method provide sufficient semantic information about targets to different detectors.

Moreover, we provide some visualized examples in Figure 7 and Figure 8 to illustrate the advantages of the feature-driven module in maintaining image quality and facilitating object detection. Our method can detect more cars and persons, irrespective of their distance from the camera or whether they are obstructed. It also demonstrates stronger adaptability. In the FLIR-09401 scene, the leftmost five persons are occluded, and the SR image reconstructed by our method successfully detects four persons, including those with an occluded area exceeding

50 %

. In contrast, other methods cannot reconstruct the sharpened edges of targets, making it difficult for the detector to determine whether it is the contour of a pedestrian. A similar phenomenon occurs in the FLIR-09572 scenario, where, for cars at a long imaging distance, the SR image reconstructed by our method contains not only the black window information but also the gray contour at the rear of the vehicle. This helps the detector to better distinguish the vehicle from the environment background and provides more semantic information for the detector.

3.3.4. Ablation Experiment

In order to verify the effectiveness of the different modules of our method, ablation experiments are conducted on the FLIR dataset. The experimental results are shown in Table 4. Firstly, SwinIR is used as the baseline and trained for 200 k iterations, after which the super-resolution metrics results and mAP results are obtained. Data preprocessing operations result in a slight improvement in detection metrics, indicating that the network has learned more effectively how to reconstruct target areas, which is beneficial for enhancing detection accuracy. When the Deconv module is used as the feature fusion module, it can be seen that the detection accuracy increases by 1.4% and the PSNR improves by 0.28 db, indicating the effectiveness of the Deconv module in object detection and super-resolution reconstruction performance. When the IDM module is added to extract image features, it can be observed that IFC and SF achieve the highest scores, demonstrating the effective impact of the IDM module on extracting high-frequency information. Secondly, we cascade the feature-driven module after both the SwinIR network and the proposed IDSR network, and continue training for an additional 100 k iterations. As can be seen from the results, whether cascading the feature-driven module on the SwinIR network or the IDSR network, the metrics related to high-frequency information, such as IFC and SF, as well as detection metrics, are all improved, with our method yielding the best results. This demonstrates that the feature-driven module, by introducing prior knowledge of detection features, enables the super-resolution images generated by the front-end super-resolution network to contain more high-frequency information and advanced semantic information. When combined with our designed IDSR super-resolution network, it can achieve optimal detection accuracy.

3.3.5. Generalization Experiment on M3FD Dataset

We further conducted generalization experiments on the M3FD dataset, with the results presented in Table 5. Similar to the results on the FLIR dataset, our results achieved the best performance across the IFC, SF and mAP metrics on the M3FD dataset. This demonstrates the generalization ability of our SR reconstruction model, which was trained on the FLIR dataset, when applied to the road scene dataset, with higher detection accuracy for cars and persons on the road.

4. Discussion

The reconstruction results of our method, presented in the previous section, have indicated that our IDSR network maintained reconstruction quality compared to several other SR reconstruction algorithms. The detection accuracy in SR images generated by the IDSR network outperformed standalone state-of-the-art methods such as SwinIR and RGT. When we cascaded the feature-driven module at the backend of the IDSR network, the results showed that this module increased the mAP value for person and car targets.

One limitation of our experiment was that we demonstrated the performance of infrared datasets that contain only two classes with person and car. We look forward to exploring the performances of our method on a broader range of objects from different scenes. Another limitation of our method was that we have used the LR-HR pairs to train our SR network, where the LR images were generated artificially from the HR images. To our knowledge, there is currently no suitable alignment dataset that contains both HR images and LR images for object detection. Therefore, for future work, we are looking forward to exploring works to create more realistic pseudo-LR images. We also plan to enable the trained SR network to better adapt to various image degradations.

5. Conclusions

In this paper, we first propose a multi-branch image SR reconstruction network called IDSR, specially designed to address the characteristics of low contrast and blurred edge details in infrared images. The multi-branch architecture is designed to enhance the SR network’s descriptive ability through the use of dilated convolutions and non-local layers. When combined with the deformable convolution, the feature information extracted by the multi-branch structure can be more effectively integrated. Furthermore, to bridge the gap between the SR reconstruction task and object detection task, we propose a new dataset preprocessing operation that involves cropping images into patches based on target labels. Additionally, we introduce a joint loss function to improve the facilitation of reconstructed results for the detection task by cascading a feature-driven module. The joint loss function enables semantic features to guide the learning of the reconstruction network, which benefits the SR images being more conducive to the downstream detection task. Experiments on the FLIR infrared dataset demonstrate that the SR images generated by our method exhibit significant advantages in terms of both image quality and detection accuracy. Meanwhile, experiments on the M3FD dataset have also demonstrated the generalization ability of our model.

Author Contributions

All of the authors contributed to this study. Conceptualization, J.H.; methodology, J.H.; software, J.H.; validation, J.H. and H.W.; data curation, J.H. and Y.L.; writing—original draft preparation, J.H.; writing—review and editing, J.H. and S.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The publicly available datasets FLIR was analyzed in this study and can be found here: https://www.flir.com/oem/adas/adas-dataset-form/, accessed on 1 May 2024. Moreover, a publicly available dataset, M3FD, was analyzed in this study and can be found here: https://github.com/JinyuanLiu-CV/TarDAL?tab=readme-ov-file, accessed on 1 May 2024.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Huang, Y.; Miyazaki, T.; Liu, X.F.; Omachi, S. Infrared Image Super-Resolution: Systematic Review, and Future Trends. arXiv 2022, arXiv:2212.12322. [Google Scholar] [CrossRef]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-Time Single Image and Video Super-Resolution Using an Efficient Sub-Pixel Convolutional Neural Network. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Lee, K.M. Enhanced Deep Residual Networks for Single Image Super-Resolution. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, USA, 21–26 July 2017; pp. 1132–1140. [Google Scholar] [CrossRef]
Li, Z.; Liu, Y.; Chen, X.; Cai, H.; Gu, J.; Qiao, Y.; Dong, C. Blueprint Separable Residual Network for Efficient Image Super-Resolution. In Proceedings of the 2022 IEEE/CVF COnference on Computer Vision and Pattern Recognition Workshops, CVPRW 2022, New Orleans, LA, USA, 19–20 June 2022; pp. 832–842. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Gool, L.V.; Timofte, R. SwinIR: Image Restoration Using Swin Transformer. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar] [CrossRef]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows. In Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), Montreal, QC, Canada, 11–17 October 2021; pp. 9992–10002. [Google Scholar] [CrossRef]
Chen, Z.; Zhang, Y.; Gu, J.; Kong, L.; Yang, X. Recursive Generalization Transformer for Image Super-Resolution. In Proceedings of the ICLR, Vienna, Austria, 7 May 2024. [Google Scholar]
Zamir, A.; Sax, A.; Shen, W.; Guibas, L.; Malik, J.; Savarese, S. Taskonomy: Disentangling Task Transfer Learning. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Salt Lake City, UT, USA, 18–23 June 2018; pp. 3712–3722. [Google Scholar] [CrossRef]
Tang, L.; Yuan, J.; Ma, J. Image fusion in the loop of high-level vision tasks: A semantic-aware real-time infrared and visible image fusion network. Inf. Fusion 2022, 82, 28–42. [Google Scholar] [CrossRef]
Liu, D.; Wen, B.; Liu, X.; Wang, Z.; Huang, T. When Image Denoising Meets High-Level Vision Tasks: A Deep Learning Ap proach. In Proceedings of the International Joint Conferences on Artificial Intelligence Organization, Stockholm, Sweden, 13–19 July 2018; pp. 842–848. [Google Scholar] [CrossRef]
Wang, X.; Girshick, R.; Gupta, A.; He, K. Non-local Neural Networks. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7794–7803. [Google Scholar] [CrossRef]
Shorten, C.; Khoshgoftaar, T.M. A survey on Image Data Augmentation for Deep Learning. J. Big Data 2019, 6, 60. [Google Scholar] [CrossRef]
Zhang, Q.; Xu, Y.; Zhang, J.; Tao, D. ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for Image Recognition and Beyond. Int. J. Comput. Vis. 2023, 131, 1141–1162. [Google Scholar] [CrossRef]
Xiao, T.; Singh, M.; Mintun, E.; Darrell, T.; Dollár, P.; Girshick, R.B. Early Convolutions Help Transformers See Better. arXiv 2021, arXiv:2106.14881. [Google Scholar]
Si, C.; Yu, W.; Zhou, P.; Zhou, Y.; Wang, X.; Yan, S. Inception transformer. In Proceedings of the 36th International Conference on Neural Information Processing Systems, NIPS’22, New Orleans, LA, USA, 28 November–9 December 2022. [Google Scholar]
Yu, F.; Koltun, V. Multi-Scale Context Aggregation by Dilated Convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Kwasniewska, A.; Ruminski, J.; Szankin, M.; Kaczmarek, M. Super-resolved thermal imagery for high-accuracy facial areas detection and analysis. Eng. Appl. Artif. Intell. 2020, 87, 103263. [Google Scholar] [CrossRef]
Wang, H.; Wu, X.; Huang, Z.; Xing, E.P. High-Frequency Component Helps Explain the Generalization of Convolutional Neural Networks. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 8681–8691. [Google Scholar] [CrossRef]
Mehta, S.; Rastegari, M.; Caspi, A.; Shapiro, L.; Hajishirzi, H. ESPNet: Efficient Spatial Pyramid of Dilated Convolutions for Semantic Segmentation. In Proceedings of the Computer Vision—ECCV 2018, Munich, Germany, 8–14 September 2018; Springer International Publishing: Berlin/Heidelberg, Germany, 2018; pp. 561–580. [Google Scholar]
Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable Bag-of-Freebies Sets New State-of-the-Art for Real-Time Object Detectors. In Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar] [CrossRef]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Fast and Accurate Image Super-Resolution with Deep Laplacian Pyramid Networks. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 41, 2599–2613. [Google Scholar] [CrossRef] [PubMed]
FLIR, T. Flir Thermal Dataset for Algorithm Training. DB/OL. 2018. Available online: https://www.flir.com/oem/adas/adas-dataset-form/ (accessed on 1 May 2024).
Liu, J.; Fan, X.; Huang, Z.; Wu, G.; Liu, R.; Zhong, W.; Luo, Z. Target-aware Dual Adversarial Learning and a Multi-scenario Multi-Modality Benchmark to Fuse Infrared and Visible for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 5802–5811. [Google Scholar]
Wang, Z.; Simoncelli, E.P.; Bovik, A.C. Multiscale structural similarity for image quality assessment. In Proceedings of the Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, Pacific Grove, CA, USA, 9–12 November 2003; Volume 2, pp. 1398–1402. [Google Scholar] [CrossRef]
Sheikh, H.R.; Bovik, A.C.; Veciana, G.d. An information fidelity criterion for image quality assessment using natural scene statistics. IEEE Trans. Image Process. 2005, 14, 2117–2128. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The overall structure of the proposed method, which mainly consists of three parts: a data preprocessing method to crop the images into patches, a SR reconstruction network to generate SR images and a feature-driven module to improve the detection accuracy.

Figure 2. The architecture of the proposed ISDR for image super-resolution.

Figure 3. The details of Inception Dilated Mixer (IDM).

Figure 4. Frequency magnitude from 8 output channels of high-frequency extractor and low-frequency extractor.

Figure 5. Super-resolution reconstruction results for LR images from the FLIR dataset (200 k iterations). Each two rows represent a scene, and from top to bottom are FLIR-08989 and FLIR-08951.

Figure 6. The analysis of loss weight

(λ, μ)

selection in our method.

Figure 6. The analysis of loss weight

(λ, μ)

selection in our method.

Figure 7. Super-resolution reconstruction results for LR images from the FLIR dataset by feature-driven IDSR (our method, 300 k iterations). Each two rows represent a scene, and from top to bottom are FLIR-08989 and FLIR-08951.

Figure 8. Object detection (YOLOv7) results for SR images from the FLIR dataset by feature-driven IDSR (our method, 300 k iterations). Each two rows represent a scene, and from top to bottom are FLIR-09401 and FLIR-09572.

Table 1. Super-resolution performance and detection performance (mAP) on the FLIR dataset (200 k iterations). RED indicates the best result and BLUE represents the second-best result.

Method	PSNR	MS-SSIM	IFC	SF	Person mAP	Car mAP	Average mAP
LR	—	—	—	—	70.30	77.10	73.70
Bicubic	25.94	0.68	1.17	8.26	70.30	76.10	73.20
EDSR	26.60	0.72	1.29	10.21	73.10	80.50	76.80
BSRN	26.53	0.71	1.27	10.02	72.70	80.50	76.60
SwinIR	26.65	0.72	1.31	10.44	73.70	80.90	77.30
RGT	26.68	0.72	1.31	10.29	73.80	81.50	77.65
IDSR	26.64	0.72	1.31	10.53	76.30	82.50	79.40

Table 2. After fine-tuning by the feature-driven module, the super-resolution performance and detection performance (mAP) on the FLIR dataset (300 k iterations). RED indicates the best result and BLUE represents the second-best result.

Method	PSNR	MS-SSIM	IFC	SF	Person mAP	Car mAP	Average mAP
Bicubic	25.94	0.68	1.17	8.26	70.30	76.10	73.20
EDSR	26.55	0.71	1.30	10.43	73.00	80.60	76.80
BSRN	26.45	0.71	1.27	10.28	73.00	80.40	76.70
SwinIR	26.69	0.72	1.32	10.43	73.90	81.50	77.70
RGT	26.71	0.72	1.32	10.17	74.40	81.40	77.90
Our method	26.50	0.71	1.36	11.17	77.30	83.80	80.55

Table 3. After fine-tuning by the feature-driven module, the detection performance (mAP) on the FLIR dataset (with SSD and Cascade-RCNN detectors) (300 k iterations). RED indicates the best result and BLUE represents the second-best result.

	SSD			Cascade-RCNN
Method	Person mAP	Car mAP	Average mAP	Person mAP	Car mAP	Average mAP
Bicubic	31.70	54.60	43.15	42.90	53.60	48.25
EDSR	42.60	64.20	53.40	54.70	62.50	58.60
BSRN	42.80	63.80	53.30	54.40	61.80	58.10
SwinIR	44.50	66.00	55.25	56.70	64.40	60.55
RGT	44.30	65.60	54.95	57.30	65.20	61.25
Our method	49.50	71.50	60.50	63.60	73.20	68.40

Table 4. The results of ablation experiments.

Experiment	PSNR	MS-SSIM	IFC	SF	Person mAP	Car mAP	Average mAP
SwinIR (200 k iters)	26.65	0.72	1.31	10.44	73.70	80.90	77.30
+DataPre (200 k iters)	26.67	0.72	1.30	10.41	74.90	82.30	78.60
+Deconv (200 k iters)	26.93	0.72	1.31	9.96	74.70	82.70	78.70
+IDM (200 k iters)	26.69	0.70	1.37	11.68	74.40	83.60	79.00
+DataPre + Deconv + IDM = IDSR (200 k iters)	26.64	0.72	1.31	10.53	76.30	82.50	79.40
SwinIR (300 k iters)	26.69	0.72	1.32	10.43	73.90	81.50	77.70
+Feature-driven (300 k iters)	26.62	0.70	1.34	11.00	75.20	84.40	79.80
+DataPre + Deconv + IDM + Feature-driven = Our method (300 k iters)	26.50	0.71	1.36	11.17	77.30	83.80	80.55

Table 5. Super-resolution performance and detection performance (mAP) on the M3FD dataset. RED indicates the best result and BLUE represents the second-best result.

Method	PSNR	MS-SSIM	IFC	SF	Person mAP	Car mAP	Average mAP
LR	—	—	—	—	68.00	74.60	71.30
Bicubic	35.98	0.93	1.79	6.39	68.10	74.70	71.40
EDSR	37.30	0.94	1.88	7.45	71.70	75.90	73.80
BSRN	36.79	0.94	1.88	7.40	70.90	76.20	73.55
SwinIR	37.38	0.94	1.92	7.50	71.70	76.20	73.95
RGT	37.43	0.94	1.91	7.27	71.80	76.70	74.25
Our method	37.11	0.93	1.95	7.81	72.50	76.90	74.70

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huang, J.; Wang, H.; Li, Y.; Liu, S. A Feature-Driven Inception Dilated Network for Infrared Image Super-Resolution Reconstruction. Remote Sens. 2024, 16, 4033. https://doi.org/10.3390/rs16214033

AMA Style

Huang J, Wang H, Li Y, Liu S. A Feature-Driven Inception Dilated Network for Infrared Image Super-Resolution Reconstruction. Remote Sensing. 2024; 16(21):4033. https://doi.org/10.3390/rs16214033

Chicago/Turabian Style

Huang, Jiaxin, Huicong Wang, Yuhan Li, and Shijian Liu. 2024. "A Feature-Driven Inception Dilated Network for Infrared Image Super-Resolution Reconstruction" Remote Sensing 16, no. 21: 4033. https://doi.org/10.3390/rs16214033

APA Style

Huang, J., Wang, H., Li, Y., & Liu, S. (2024). A Feature-Driven Inception Dilated Network for Infrared Image Super-Resolution Reconstruction. Remote Sensing, 16(21), 4033. https://doi.org/10.3390/rs16214033

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Feature-Driven Inception Dilated Network for Infrared Image Super-Resolution Reconstruction

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Preprocessing

2.2. IDSR Network

2.2.1. Shallow Feature Extraction

2.2.2. Deep Feature Extraction

2.2.3. Reconstruction Module

2.3. Feature-Driven Module

2.4. Joint Loss Function

2.4.1. Super-Resolution Reconstruction Loss

2.4.2. Feature-Driven Loss

3. Results

3.1. Dataset

3.2. Experimental Environment and Parameter Settings

3.3. Experimental Analysis

3.3.1. Comparison with State-of-the-Art SR Methods

3.3.2. Different Values of Loss Weights

3.3.3. After Fine-Tuning by the Feature-Driven Module, the Comparison with State-of-the-Art SR Methods

3.3.4. Ablation Experiment

3.3.5. Generalization Experiment on M3FD Dataset

4. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI