Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution

Li, Yinghua; Xie, Jingyi; Chi, Kaichen; Zhang, Ying; Dong, Yunyun

doi:10.3390/rs16224201

Open AccessArticle

Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution

by

Yinghua Li

¹

,

Jingyi Xie

¹

,

Kaichen Chi

²

,

Ying Zhang

¹

and

Yunyun Dong

^3,*

¹

Xi’an Key Laboratory of Image Processing Technology and Applications for Public Security, Xi’an University of Posts and Telecommunications, Xi’an 710121, China

²

School of Artificial Intelligence, OPtics and ElectroNics, Northwestern Polytechnical University, Xi’an 710129, China

³

Northwest Land and Resource Research Center, Shaanxi Normal University, Xi’an 710119, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2024, 16(22), 4201; https://doi.org/10.3390/rs16224201

Submission received: 5 September 2024 / Revised: 24 October 2024 / Accepted: 8 November 2024 / Published: 11 November 2024

Download

Browse Figures

Versions Notes

Abstract

In recent years, super-resolution technology has gained widespread attention in the field of remote sensing. Despite advancements, current methods often employ uniform reconstruction techniques across entire remote sensing images, neglecting the inherent variability in spatial frequency distributions, particularly the distinction between high-frequency texture regions and smoother areas, leading to computational inefficiency, which introduces redundant computations and fails to optimize the reconstruction process for regions of higher complexity. To address these issues, we propose the Perception-guided Classification Feature Intensification (PCFI) network. PCFI integrates two key components: a compressed sensing classifier that optimizes speed and performance, and a deep texture interaction fusion module that enhances content interaction and detail extraction. This network mitigates the tendency of Transformers to favor global information over local details, achieving improved image information integration through residual connections across windows. Furthermore, a classifier is employed to segment sub-image blocks prior to super-resolution, enabling efficient large-scale processing. The experimental results on the AID dataset indicate that PCFI achieves state-of-the-art performance, with a PSNR of 30.87 dB and an SSIM of 0.8131, while also delivering a 4.33% improvement in processing speed compared to the second-best method.

Keywords:

super-resolution; remote sensing images; compressed sensing

1. Introduction

Remote sensing images, acquired using remote sensing technology, contain a wealth of geographical information and depict the surface or atmosphere of the Earth. These images are extensively utilized in several industries, including catastrophe monitoring, agricultural and forestry surveys, resource exploration, and even military surveillance. However, the visual clarity of remote sensing images acquired from satellites is frequently constrained by the capabilities of the imaging devices, optical sensors, and long-distance transmission, as well as environmental interference, resulting in noise that diminishes the quality of the image [1]. This leads to a decrease in the amount of useful information, which impairs the ability to further analyze and process the images. Due to the expensive nature of hardware upgrades, research has turned towards software-based methods for processing remote sensing images at a more affordable price. As an extensively utilized method for enhancing image resolution, image super-resolution (SR) technology can greatly improve the quality of remote sensing images.

Image super-resolution technology is a crucial technique for enhancing image resolution and quality. The objective of super-resolution is to generate high-resolution (HR) images from low-resolution (LR) images while preserving essential image details [2,3,4]. Traditional super-resolution techniques, such as the iterative back-projection method developed by Irani et al. [5], have achieved significant success in restoring image information and laid the foundation for this field. However, these traditional techniques are limited when dealing with complex scenes or lower-quality images, resulting in suboptimal reconstruction quality characterized by artifacts such as jagged edges and blurriness, which negatively impact the overall reconstruction quality.

As an important part of deep learning, convolutional neural networks (CNNs) [6] have attracted a lot of interest from academics looking to improve super-resolution (SR) technology. Due to their superior feature learning and representation capabilities, CNNs have demonstrated excellent performance in super-resolution reconstruction [7,8]. The initial CNN-based SR reconstruction network, SRCNN [9], shows substantial improvements over traditional techniques. Lim et al. [10] make significant progress in this field by developing EDSR, which is built upon SRResNet and eliminates batch normalization in each module. However, because CNN models typically employ fixed-size convolutional kernels, they struggle to effectively capture long-range dependencies and allocate computational resources efficiently. This results in redundancy in smooth regions and insufficient attention to edges and texture regions, failing to adequately focus on high-frequency features. With the advent of the Transformer [11], this novel deep learning architecture has also been applied to SR tasks. The Transformer uses a self-attention mechanism as the core component of its encoder–decoder [12] structure, which helps the model efficiently capture comprehensive information from input sequences. This allows for more organized and effective feature extraction. Dosovitskiy et al. [13] propose the Vision Transformer (ViT) model as a solution for image classification. This model demonstrates superior capability in capturing holistic information from images, outperforming earlier CNN-based SR models.

However, even with the application of Transformers to super-resolution tasks, existing methods still exhibit significant shortcomings. Firstly, while the multi-scale self-attention mechanism employed by Transformers greatly enhances the capture of global information dependencies, it neglects local edge details. This oversight poses a challenge in effectively reconstructing precise features in the generated images, particularly when dealing with remote sensing images in complex terrains or dense urban environments, which may lead to inaccuracies in the restored information. Secondly, many models seek to improve performance by increasing depth; however, the complexity of self-attention grows quadratically with the size of the image blocks. Although this complexity allows for parallel computation, it also incurs higher computational overhead, thereby reducing the processing speed. This drawback becomes particularly pronounced in large-scale super-resolution tasks for satellite-acquired remote sensing images, further intensifying the demand for computational hardware. Consequently, balancing model depth with computational efficiency emerges as a key challenge in optimizing super-resolution models. To accelerate large-scale image SR tasks, Kong et al. [14] propose ClassSR, which first classifies [15] and then performs SR tasks. RL-Restore [16] and Path-Restore [17] methods decompose the image into sub-images and then use reinforcement learning to estimate and select suitable processing paths. However, these methods still suffer from limited receptive fields, resulting in inflexible partitioning, numerous partition errors, and overall lower model performance.

In order to tackle the previously mentioned concerns, we suggest implementing a novel and innovative technique for super-resolution known as Perception-guided Classification Feature Intensification Network (PCFI). PCFI consists of two key modules: Integrated compressive sensing (CS)-based perception classifier module, abbreviated as ICPC, and depth–texture interaction fusion module, abbreviated as DTIF. ICPC utilizes a more concise and focused approach to classifying features. This module employs the CS technique to extract data from the image. We denote the mapping domain, where the measurements derived from compressed sensing are utilized for analyzing image features, as the perceptual domain. The sampling rate is defined, and the resulting sensing matrix has a limited quantity of sampled data. Subsequently, this sensing matrix is fed into a pre-trained model designed to extract perceptual domain features for compilation. By learning the perceptual features of the image for classification, ICPC can better capture semantic information and visual features within the image.

Since perceptual domain features are less sensitive to noise and redundant information in the image [18], ICPC can reduce the influence of these factors on classification results, thereby improving the stability and robustness of classification. In addition, ICPC offers more precise previous information for subsequent image super-resolution operations, hence enhancing the quality of image reconstruction. By integrating perceptual domain feature classification, ICPC not only boosts the accuracy of image classification but also enhances the quality of reconstructing huge remote sensing images while speeding up SR activities. Consequently, it improves the overall effectiveness and efficacy of image processing. The other primary module, DTIF, is based on the Swin Transformer framework and employs window interactions to extract deep texture information, thereby accurately restoring the detailed features of complex textured regions.

To address the inclination of Transformer to prioritize overall information while neglecting specific details, we incorporate an N–Gram language model into the window sliding component. Window-based Self-Attention (WSA) enables pixels to interact with each other across windows, resulting in more precise capture of detailed information. Additionally, within DTIF, we incorporate the Cross-Window Importance Aggregation (CWIA) block that spans both W-Trans and SW-Trans, residually connecting with the output of the self-attention mechanism. This block implements a technique called two-dimensional channel-wise local significance pooling to improve the representation of edge blur features. The well-designed PCFI network addresses the limitations of previous methods caused by the self-attention mechanism and their poor performance on large images. Such a manner exhibits substantial benefits in handling extensive remote sensing images, consequently enhancing the overall quality and efficiency of image super-resolution operations.

In summary, this paper presents the following contributions:

Proposed Perception-guided Classification Feature Intensification Network. Unlike traditional super-resolution methods, PCFI significantly enhances processing speed and improves image reconstruction quality by simultaneously extracting detailed features for super-resolution tasks.
Constructed Integrated Compressive Sensing-based Perception Classifier Module. ICPC leverages perceptual domain features to classify image blocks, substantially improving classification accuracy and effectively accelerating large-scale image reconstruction tasks.
Designed Depth–Texture Interaction Fusion Module. DTIF integrates attention mechanisms and texture interactions, enhancing information exchange across windows and spatial dimensions, thereby strengthening the representation of local details in complex textures or edge areas. This approach achieves more precise restoration of degraded image details.

This paper contains a total of five chapters, and the four subsequent chapters successively present the related work of the article, the methodology, the experiments, and finally summarize the methodology of this paper.

2. Related Work

2.1. Super-Resolution in Natural Images

In recent years, advancements in super-resolution reconstruction technology have significantly propelled the development of low-resolution image processing. Most models have been improved and innovated based on natural image datasets such as ImageNet and DF2K. Concurrently, the construction of diverse datasets and enhancements in computer performance have led to the emergence of numerous models based on classic networks like CNNs and GANs. Dong et al. [9] introduce SRCNN, the inaugural CNN network model for single-image super-resolution reconstruction. This model is capable of immediately acquiring the end-to-end mapping from low-resolution to high-resolution images. This model features a simple network structure and high pixel accuracy, but its numerous parameters make training challenging. VDSR [19], a classic residual model proposed by Kim et al., uses fewer parameters and increases network depth to extract more feature maps. This approach demonstrates that deeper networks facilitate image reconstruction. The model, although easier to train and able to improve the reconstruction accuracy more than SRCNN, still suffers from high training difficulty and low speed. Lai et al. [20] propose LapSRN, which incorporates the traditional Laplacian Pyramid, representing a further enhancement over previous residual models. This model enlarges feature maps progressively through stepwise upsampling, allowing for residual prediction at each level, thus improving training speed. SRGAN [21], based on the GAN network, employs a deep residual network as the generative function and uses perceptual loss as the optimization target. By training both the generator and discriminator simultaneously, it produces more natural images. However, the high complexity of the network results in significant training difficulty and lower pixel accuracy of the reconstructed images. ESRGAN [22] builds on SRGAN by integrating multi-level residual networks and dense connections into dense residual blocks. This model is relatively more stable and produces reconstructed images with richer texture details.

2.2. Super-Resolution in Remote Sensing

As super-resolution technology in remote sensing imagery advances, researchers are increasingly concentrating on techniques to improve the resolution of these images. Remote sensing photographs exhibit notable dissimilarities from natural images due to the inherent interdependence of objects and surroundings in remote sensing images, necessitating the inclusion of environmental data. Liebel et al. [9] are the first to propose the Sentinel-2 remote sensing image dataset to train SRCNN for single-image super-resolution reconstruction in remote sensing. However, due to the multi-scale nature of remote sensing images, the reconstruction results were still suboptimal. To address multi-scale feature extraction, Lei et al. [23] introduce LGCNet based on CNN for remote sensing images. This network cascades the results from different layers, employing a “multi-branch” structure to learn multi-level representations of remote sensing images. Compared to SRCNN, the quality of the reconstructed images is better, but the large number of parameters results in slower processing speeds. Xu et al. [24] propose DMCN, a symmetric hourglass-shaped CNN architecture. Through the design of multiple skip connections, the network complexity is significantly reduced, and the processing time is shortened. However, the reconstruction results do not show a substantial improvement and still exhibited artifacts. Jiang et al. [25] introduce EEGAN, a method that employs an adversarial learning approach to enhance remote sensing images by reducing noise sensitivity and improving the restoration of high-frequency edge details. These methods provide various solutions that develop the resolution of remote sensing images by utilizing distinct network topologies and learning algorithms. Therefore, they contribute to the advancement of remote sensing image processing.

2.3. Transformer in Super-Resolution

Transformer-based super-resolution reconstruction is an innovative approach in recent years. The Transformer model relies entirely on the self-attention mechanism, which replaces the conventional RNN sequence processing with a parallelized self-attention mechanism to address the issue of long-range dependencies. Transformer-based super-resolution reconstruction methods are mainly classified into two main categories, one is the Transformer that combines the attention mechanism and CNN network, and the other is the Transformer that consists of the attention mechanism only. Liang et al. [26] propose SwinIR, which integrates Swin Transformer with SR tasks for the first time. This model divides the image into blocks and applies self-attention mechanisms at the block level to learn global dependencies between pixels. RCAN [27] updates the weights according to the importance of different channels, strengthens the useful channels while suppressing the useless ones, and improves the utilization of computational resources. CRAN [28], proposed by Zhang et al., which utilizes contextual reasoning based on the global context, introduces channel and spatial interaction to generate attention masks to enhance the reconstruction performance of the network. To improve the traditional disadvantage of Transformer regarding ignoring local detail localization due to its global focus, TraFuse [29] employs a sequential stacked coding structure of CNN and Transformer and uses simple progressive upsampling in the encoders of the branches to increase the reconstruction performance of the network.

2.4. N–Gram Language Model

N-gram is a statistical language model (LM) [30] based on the assumption that the occurrence of the n-th word is related to the previous

n - 1

words and unrelated to any other words. The probability of an entire sentence is the product of the probabilities of each word, which can be calculated through statistical analysis of the corpus. The first characteristic of the N-gram model is that the occurrence of a word depends on several preceding words. The second characteristic is that the more information we have, the more accurate the prediction becomes.

Assume a sentence S consists of a sequence of words

w_{1}, w_{2}, w_{3}, \dots, w_{n}

. The N-gram language model [31] can be expressed with the following formula (each word

w_{i}

depends on the influence of the words from the first word

w_{1}

to the preceding word

w_{i - 1}

):

\begin{matrix} P (S) & = P (w_{1}, w_{2}, w_{3}, \dots, w_{n}) \\ = P (w_{1}) P (w_{2} | w_{1}) P (w_{3} | w_{1}, w_{2}) \dots P (w_{n} | w_{1}, w_{2}, w_{3}, \dots, w_{n - 1}) \end{matrix}

(1)

Initially, N-gram language models were only used for text analysis tasks such as spell checking, grammar checking, and text generation. However, due to the similarity of the N-gram concept in images to that in text, in recent years, some researchers have begun to utilize N-gram language models to various computer vision tasks. Choi et al. [32] proposed the NGSwin network, which was the first to combine N-gram with super-resolution reconstruction tasks. They defined Uni–Gram as a local window within Swin Transformer, where pixels interact through self-attention, overcoming the limitations of the Swin Transformer that arise from its emphasis on global information.

3. Method

Our proposed approach primarily involves compressive sensing classification of input remote sensing images, followed by deep texture extraction on the classified images to enhance the representation capability of the super-resolved images. In Section 3.1, we present a comprehensive description of our proposed PCFI model structure. In Section 3.2, we provide a detailed explanation of ICPC, which involves classifying sub-image blocks according to predetermined ranges. Smooth blocks primarily contain background and contour information, while edge blocks are more complex, containing detailed information but may suffer from information blur and partial feature loss due to edge blurring. Therefore, in Section 3.3, we introduce the deep texture extraction module, which enhances the representation capability of blurred features caused by edge blur, thereby improving the ability of SR model to express details when dealing with complex remote sensing images.

3.1. Overall Structure

Figure 1 provides a visual representation of the overall network architecture of our proposed PCFI network. Comprising two main modules, namely the ICPC and the DTIF, our PCFI architecture is designed to process single HR images. Upon input, the HR image undergoes cropping to generate several equally sized sub-images. These sub-images are then individually fed into ICPC, where compressive sensing technique [33] is applied, followed by feature classification using compressive sensing: smooth blocks, margin blocks, and texture blocks. The output consists of three classes based on features, detailed in Section 3.2. Subsequently, the classified sub-images are directed into the DTIF for deep content extraction. This module, a hybrid of the U-Net [34] architecture and Swin Transformer architecture, adheres to the fundamental structure of U-Net. The specific composition of the DTIF will be elucidated in Section 3.3.

3.2. Integrated Compressive Sensing-Based Perception Classifier Module

Our integrated compressive sensing-based perception classifier (ICPC) module utilizes compressed sensing for feature classification, which effectively reduces computational complexity compared to the commonly employed deep learning methods, such as CNNs or self-attention mechanisms. Although CNNs demonstrate excellent performance in feature extraction, they often incur substantial computational and memory overhead. For instance, remote sensing images are typically large-scale; when processed using self-attention mechanisms, the model must account for dependencies between every pixel and region, significantly increasing the computational complexity. In contrast, CNNs rely on deep networks for layer-by-layer processing, requiring numerous convolutional kernels and network layers to capture features of varying complexities. This often leads to the processing of considerable redundant information to achieve comprehensive feature classification, which greatly diminishes processing speed and contradicts the ICPC module’s goal of accelerating super-resolution tasks, thereby increasing the computational burden.

Moreover, remote sensing images frequently contain a substantial amount of redundant information, with critical data often concentrated in a few specific regions or frequency bands. Many areas within remote sensing images, such as oceans and deserts, are usually uniform or exhibit extensive smooth textures. The pixel value variations in these regions are minimal and can be represented with fewer non-zero or significant values. By leveraging the spatial sparsity of such areas, the compressed sensing technique can save storage space and reduce processing complexity. When addressing complex textures, these regions typically exhibit sparse representations in certain transformation domains (e.g., wavelet transforms, discrete cosine transforms). CS effectively captures these sparse features, facilitating the identification of different texture categories in classification tasks.

In this section, we start by cropping the image into equally sized blocks and employ the CS theory [35] discussed in this section to map pixel values to perceptual values. Subsequently, the pixel values of each block are multiplied by the measurement matrix to obtain measurement values. These measurement values are then processed to derive features within the measurement value range, allowing for the classification of image blocks based on features within the perceptual domain [36]. This module categorizes image blocks into three classes: smooth, margin, and texture blocks. The specific structure is illustrated in Figure 2.

3.2.1. Definition of Correlation Between Signals in Two Domains

Since the mutual covariance measures the similarity between two signals, we utilize mutual covariance to assess the correlation of frequency domain signals. Specifically, the correlation between two frequency domain image signals is defined as

Γ_{F} = cov (α, θ)

(2)

where

Γ_{F}

denotes the frequency domain correlation of the signals, cov represents the mutual covariance computation symbol, and

α

and

θ

are the video signals in the DCT domain.

When the value of

Γ_{F}

is positive, it indicates that the two signals are positively correlated; conversely, a negative value indicates a negative correlation. The larger the absolute value of

Γ_{F}

, the higher the similarity between the two signals, meaning the correlation is stronger. Conversely, a smaller absolute value indicates lower similarity and weaker correlation. In image analysis, we typically start from the pixel domain. However, directly analyzing in the pixel domain can be challenging due to the large volume of data and the lack of intuitive insights, making it difficult to effectively extract useful information. Therefore, we project the image data into the frequency domain for analysis, where frequency domain signals can more clearly reflect the pixel features of the image. According to the CS theory [37] presented in this section, there exists a linear relationship between the perceptual domain and the frequency domain. This means that the frequency domain can be regarded as an effective representation of the pixel domain, while the perceptual domain can reflect the detail level and complexity of the pixel domain images. Thus, by analyzing the signals in the perceptual domain, we can achieve a certain degree of extraction and analysis of image pixel features.

3.2.2. Correlation of Linear Relationships Between Perceptual Domain and Frequency Domain

For each block

S_{z 1}

,

S_{z 2}

,

S_{z 3}

,

S_{z 4}

, corresponding to frequency domain signals

α_{1}

,

α_{2}

,

α_{3}

,

α_{4}

, the independent compressed sensing measurements [38] are conducted utilizing an identical sensing matrix. This process yields perception values

z_{1}

,

z_{2}

,

z_{3}

,

z_{4}

, which are directly concatenated as row vectors to construct the perceptual domain

z = {[z_{1}, z_{2}, z_{3}, z_{4}]}^{T}

. Given

z_{i} = {[z_{i 1}, z_{i 2}, \dots, z_{i n}]}^{T}

for

i = 1, 2, \dots, n

, the representation of the feature image in the perceptual domain is denoted by the following complex formula:

z = [\begin{matrix} z_{11} & z_{12} & \dots & z_{1 n} \\ z_{21} & z_{22} & \dots & z_{2 n} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ z_{41} & z_{42} & \dots & z_{4 n} \end{matrix}]

(3)

In this formulation, each

z_{i}

is represented as a column vector, where

z_{i j}

signifies the j-th perception value of the i-th block.

Covariance Matrix Vectorization. If

X_{1}, X_{2}, \dots, X_{n}

constitute a collection of random variables forming a random vector

X = {[X_{1}, X_{2}, \dots, X_{n}]}^{T}

, and each random variable has m samples, then there exists a sample matrix, as follows:

O = {[γ_{1}, γ_{2}, \dots, γ_{n}]}^{T} = [ζ_{1}, ζ_{2}, \dots, ζ_{m}]

(4)

where

γ_{i}

(for

i = 1, \dots, n

) corresponds to the first i vectors of the sample values of the first random variable, while

ζ_{j}

(for

j = 1, \dots, m

) represents each random vector K within the sample vector. Consequently, the expression for the mutual covariance between random variables

X_{i}

and

X_{j}

is given by

M_{i j} = E [X_{i} X_{j}] - E (X_{i}) E (X_{j})

(5)

Covariance estimates can be derived from sample values, as follows:

M_{i j} = \frac{1}{m} γ_{i}^{T} γ_{j} - \frac{1}{m^{2}} \sum_{a = 1}^{m} O_{i a} \sum_{b = 1}^{m} O_{j b}

(6)

Perceptual Domain Covariance Matrix. We conduct measurements on each image block using the same randomly generated measurement matrix. The process follows a precise measurement procedure:

z_{i} = σ x_{i} (i = 1, 2, \dots, N)

(7)

where

x

represents the original signal of size

n \times 1

and

z

denotes the corresponding signal in the perceptual domain of size

m \times 1

. This procedure operates under the assumption that the random measurement matrix

σ

adheres to a Gaussian distribution with a mean of zero and a variance of

1 / m

. Each signal in the perceptual domain

z

is associated with a signal in the frequency domain

α

, as represented by

x = D α, z = σ D α = Γ α

(8)

where

Γ = σ D

represents the projection matrix and

D

signifies the sparse basis Discrete Cosine Transform (DCT) matrix. The elements of the projection matrix

Γ

are derived as follows:

Γ_{i j} = \sum σ_{i k} D_{k j} = 1, 2, \dots, m; j = 1, 2, \dots, n)

(9)

The variance of the elements of the projection matrix

Γ

are determined as

\begin{matrix} D [Γ_{i j}] = D [\sum σ_{i} k D_{k j}] = (\sum D_{k j}^{T}) (\sum D_{21}) = 1 / m \end{matrix}

(10)

The frequency domain signals corresponding to N image blocks form the frequency domain sample matrix:

α_{N \times n} = {[α_{1}, α_{2}, \dots, α_{N}]}^{T}

(11)

Assuming each signal in the perceptual domain is a random variable, the sample matrix over the perceptual domain is obtained as follows:

z = [z_{1}, z_{2}, \dots, z_{N}] = [Z_{1}, Z_{2}, \dots, Z_{m}]

(12)

where

z_{i}

represents the sample vector of the i-th random variable, and

Z_{j}

(j = 1, \dots, m)

represents the sample vector of each random vector.

Let

Γ_{i}

(i = 1, 2, \dots, m)

denote the row vector of the projection matrix

Γ

. Utilizing the vector form of the covariance matrix, we have

M_{z} = \frac{1}{m} ([Z_{1}, Z_{2}, \dots, Z_{m}] {[Z_{1}, Z_{2}, \dots, Z_{m}]}^{T} - Z_{o} Z_{o}^{T})

(13)

M_{z} \approx \frac{1}{m} {(Γ α)}^{T} (Γ α)

(14)

Finally, the covariance matrix in the perceptual domain can be approximated as

M_{z} \approx \frac{1}{m} (α Γ^{T}) (Γ α^{T}) \approx \frac{1}{m} α α^{T}

(15)

Frequency Domain Covariance Matrix. The frequency domain sample matrix is represented as

σ_{N \times n} = {[σ_{1}, σ_{2}, \dots, σ_{N}]}^{T} = [ξ_{1}, ξ_{2}, \dots, ξ_{n}]

(16)

where

σ_{i}

(

i = 1, 2, \dots, N

) denotes the vector containing all sample values for the ith random variable, and

ξ_{j}

(

j = 1, 2, \dots, m

) corresponds to the sample vector of each random vector. Then, the i-th row of the covariance matrix and j-th column elements depict the mutual covariance between signals

σ_{i}

and

σ_{j}

:

M_{σ} (i, j) = E [σ_{i} σ_{j}] - E [σ_{i}] E [σ_{j}]

(17)

All Discrete Cosine Transform (DCT) coefficients within an image block adhere to a Gaussian distribution with a mean of zero; thus,

E [σ_{i}] = E [σ_{j}] = 0

(18)

M_{z} \approx \frac{n}{m} M_{σ}

(19)

The correlation between the signals in the perception and frequency domains is seen in the equation above. This model shows that the cross-covariance matrix between perception and frequency signals is essentially linear, corresponding to the correlation between the two domains. Clearly, in compression-based image processing, correlation evaluation of frequency domain signals may be directly performed using perceptual domain signals. Therefore, based on the content of this section, in compression-based image processing, the correlation analysis of frequency domain signals can be directly performed using signals from the perceptual domain.

3.2.3. Image Patch Classification Based on Perceptual Domain Features

The elements of the perceptual domain covariance matrix are correspondingly related to the variances of the respective frequency domain signals, emphasizing the close relationship between frequency domain signals and pixel domain signals. Specifically, smooth image patches correspond to frequency domain signals with lower sparsity, indicating a prevalence of uniform information across frequencies. Conversely, edge and texture patches are associated with frequency domain signals that exhibit higher sparsity, reflecting the presence of distinct high-frequency components that characterize sharp transitions and intricate textures in the image. This relationship underscores the importance of understanding the interplay between spatial characteristics and their spectral representations in image processing tasks. Consequently, the variance of the perceptual domain covariance matrix

var (M_{z})

reflects the characteristics of image patches and can be utilized for their classification. The criteria for classification are defined as follows:

if \{\begin{matrix} var (M_{z}) < T_{1}, & Smooth Block \\ T_{1} \leq var (M_{z}) \leq T_{2}, & Margin Block \\ var (M_{z}) > T_{2}, & Texture Block \end{matrix}

(20)

When judging the categories of image blocks, we combine multi-directional block classification as follows:

(1) Let

{\bar{z}}_{i} = Φ {x_{i} + λ F_{i + 1}^{h}}

(

i = 1, 2, \dots, N

), where

F_{i + 1}^{h}

refers to the feature vector of the block adjacent to the i-th block in the horizontal direction. Based on the covariance matrix of

{\bar{z}}_{i}

, determine its variance

v a r (M_{z})

to classify it into one of the image block types and assign a parameter

C L S_{h}

.

(2) Let

{\bar{z}}_{j} = Φ {x_{j} + λ F_{j + 1}^{v}}

(

j = 1, 2, \dots, N

), where

F_{j + 1}^{v}

refers to the feature vector of the block adjacent to the j-th block in the vertical direction. Based on the covariance matrix of

{\bar{z}}_{j}

, determine its variance

v a r (M_{z})

to classify it into one of the image block types and assign a parameter

C L S_{v}

.

(3) For the same image sub-block, consider the category parameters

C L S_{h}

and

C L S_{v}

from both horizontal and vertical directions. The priority order, from high to low, is edge block, texture block, and smooth block; choose the category with the highest priority as the final category of the image block.

The algorithm simultaneously takes into account the global statistical characteristics and local distribution properties of image segments. By employing the 2

σ

principle, it effectively computes the threshold. Smooth blocks primarily contain background and contour information, exhibiting relatively uniform features. In contrast, edge blocks are more complex, encompassing detailed information, such as boundaries corresponding to high-frequency components in the image. Consequently, edge blocks demonstrate higher sparsity in the frequency domain compared to smooth blocks.

3.3. Depth–Texture Interaction Fusion Module

The depth–texture interaction fusion module (DTIF) module employs a traditional U-Net encoder-decoder architecture, which effectively extracts and integrates contextual information while flexibly handling features of varying scales, demonstrating strong adaptability. Before inputting the image blocks into DTIF, they have already undergone feature classification via the ICPC module. This preprocessing allows us to dynamically adjust the number of DTIF modules based on the prior classification results. Specifically, for image blocks with different feature types, we can allocate varying numbers of DTIF modules. For example, simple and smooth regions may be assigned three DTIF modules; fast regions with pronounced edge features may utilize six modules; and complex texture regions may be assigned nine modules. This adjustment mechanism ensures that the model operates efficiently under different circumstances and optimizes feature extraction. In the DTIF, the primary task of the encoder is to extract detailed features of the image, while the decoder combines the encoder’s features with its own through skip connections, ensuring the retention of detail information. This design not only enhances the quality of reconstruction but also effectively reduces information loss, resulting in a clearer and more realistic final output image. By combining the advantages of both the encoder and decoder, the DTIF module is better equipped to adapt to various image features, thereby improving the performance of super-resolution reconstruction. The basic architecture of DTIF consists of a layered encoder, pooling layers, a compression layer, and a decoder, along with skip connection layers linking the encoder and decoder. This section will begin by presenting a comprehensive outline of the overall structure of DTIF. It will then proceed to offer in-depth explanations of the DTIF, CWIA, and MSFE.

Encoder. For a given LR image

X \in R^{H \times W \times 3}

, the extracted

Y_{i} \in R^{H \times W \times C}

is obtained through pixel shuffle, where H, W, and C represent the height, width, and channels, respectively. To enrich the transferred feature details and enhance the expression of content features, we design a three-layer encoder. Specifically, the extracted

Y_{i}

first undergoes two stages composed of two DTIT (DTI Transformer) blocks and soft pooling [39] with a kernel size of 2 × 2, followed by processing through a third DTIT block. The details of the DTIT will be discussed in Section 3.3.1.

Pooling Layer. In contrast to non-deep learning methods such as average pooling [40] and max pooling [41], we utilize soft pooling [39] in this process, as shown in Figure 3. The calculation formula for soft pooling is as follows:

S o f t P o o l i n g (γ) = \sum_{i = 1}^{N} ω_{i} γ_{i}

(21)

where x is the input data, N is the number of input data(

γ_{i}

), and

ω_{i}

is the weight corresponding to the input data

γ_{i}

, usually calculated using the softmax function:

ω_{i} = \frac{e^{γ_{i}}}{\sum_{j = 1}^{N} e^{γ_{j}}}

(22)

The objective of this pooling approach is to minimize the information loss that occurs during the pooling process, while yet preserving the functionality of the pooling layer. By retaining information content features effectively, soft pooling facilitates deeper extraction of blurred information, thus significantly enhancing the performance of traditional SR methods in handling images with varying levels of clarity.

Compression Layer. The design of a three-layer encoder and the choice of pooling layers enable the extraction of more image details, but they also convey a significant amount of image information. To address this, the introduction of the compression layer module (MSFE) effectively compresses input image information while extracting features. This means that, while compressing redundant information from simple image blocks, the compression layer also aids in capturing detailed features [42] from complex, blurred image blocks. The design of the compression layer not only reduces computational complexity and alleviates the computational burden on the subsequent decoder but also enhances the representation capability of similar features, effectively minimizing information loss during the feature transfer process to the decoder, thereby improving the quality of image reconstruction. The specific details of this module will be elaborated in Section 3.3.4.

Decoder. Similar to most U-Net network architectures, the decoder adopts the same modular structure as the encoder. Here, we also use a combination of DTIT and soft pooling block. Additionally, The final result of the first DTIT is designed to bypass the completely connected layer and instead enter the compression layer module. It is then combined with the output of the three-layer encoder for residual connection [43], before being inputted into the decoder. This skip-connection [44] design aids in optimizing the model by reducing computational complexity while leveraging locality and dependency to perform multiscale feature extraction on the input.

3.3.1. Depth–Texture Interaction Transformer

Due to the fixed-size window mechanism employed by the Swin Transformer for self-attention operations, features near the window boundaries are prone to blurriness or incoherence, particularly in areas with complex texture details, making it difficult to effectively recover degraded edge pixels. This local window division hinders seamless information connectivity between different windows, resulting in suboptimal performance in texture blocks within remote sensing images. Furthermore, although the Swin Transformer facilitates information interaction through its window mechanism, its alternating local operations still impose limitations on information propagation, particularly in terms of long-range dependencies across windows. While the Swin Transformer employs a multi-layer shifted window approach for inter-window information interaction, its ability to capture global information remains constrained. Unlike global self-attention mechanisms, the Swin Transformer cannot capture global information at each layer; instead, it gradually builds global perception through multi-layer accumulation. Consequently, when addressing tasks requiring traversal across multiple local regions, it may not sufficiently model long-range dependencies.

To effectively overcome these challenges, we introduce an N–Gram window interaction module prior to self-attention operations within the module. This N–Gram window interaction module establishes tight feature relationships between multiple adjacent windows, breaking the limitations of the traditional Swin Transformer’s local window operations. It enables features from different windows to interact during the computation process, facilitating more comprehensive information propagation and fusion. Leveraging the concept of N–Gram from natural language processing, the N–Gram window interaction module fosters tighter feature relationships among adjacent windows, achieving efficient inter-window information dissemination and integration. This enhancement not only boosts the model’s information interaction capabilities within local windows but also effectively alleviates issues of edge blurriness and disjointed local features by capturing contextual information across windows. In the depth–texture interaction fusion (DTIT) module, our designed Cross-Window Importance Aggregation (CWIA) block spans the combined areas of W-Trans and SW-Trans following the N–Gram interaction module, connecting with the outputs of the self-attention mechanism through residual connections. This module allows for simultaneous consideration of feature importance in both directions, employing different pooling strategies to extract finer texture information based on directional significance. Through this integrative feature extraction approach, the model can more accurately identify and restore critical structures and details within images, enhancing overall image quality and recognition accuracy. The DTIT module is adept at capturing complex textures that are challenging to process using traditional convolution or window mechanisms, thereby transcending the local limitations of the Swin Transformer and ultimately improving overall image quality.

3.3.2. N–Gram Window Interaction Block

In the Swin Transformer, the original sliding window self-attention (SA) and cross-window self-attention (WSA) are computed as follows:

SA (Q, K, V) = softmax (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

(23)

where Q, K, and V denote the matrices representing the query, key, and value, respectively. Additionally,

d_{k}

refers to the dimension of the key.

WSA cross-window self-attention calculation formula:

WSA (Q, K, V) = softmax (\frac{Q (K^{T} + A)}{\sqrt{d_{k}}}) V

(24)

where A is the window shift matrix, which is used to address the limited receptive field issue caused by window shifting.

To address the window shifting issue in Swin-V1, we adopt the concept proposed by Choi et al. [32], which utilizes a Uni–Gram non-overlapping local window based on the N–Gram language model, as shown in Figure 4. The accurate formula is as follows:

Uni - Gram (Q, K, V) = softmax (\frac{Q (K^{T} ⊙ M)}{\sqrt{d_{k}}}) V

(25)

where ⊙ denotes element-wise multiplication, and M is the Uni–Gram window mask matrix used to define the window range.

Each adjacent Uni–Gram window can be combined into a larger N–Gram window by concatenating the query, key, and value matrices of multiple Uni–Gram windows. In the N–Gram language model, consecutive forward, backward, or bidirectional words are considered as target words. Using N–Gram to define the window for WSA allows pixels within the window to influence each other through WSA, thereby enlarging the receptive field. This expansion enhances the capability of the model by increasing the accuracy of extracting details through a broader contextual understanding.

3.3.3. Cross-Window Importance Aggregation

Given that the Swin Transformer relies heavily on window alternation for information interaction, it tends to perform poorly when dealing with edge-blurred textures. This makes it challenging to precisely restore damaged pixels through content interaction. In order to resolve this problem, we have developed the CWIA block. This module extends across the window-blocked area in both W-Trans and SW-Trans and is linked to the result of self-attention by residual connections. The CWIA block utilizes two-dimensional Local Importance Pooling (LIP) [45] in both the horizontal as well as vertical orientations, using the following calculation:

Let X denotes the input feature map of size

H \times W \times C

, where H is the height, W is the width, and C is the number of channels. For each position

(i, j)

in the input feature map X, the LIP operation computes the importance score

I_{i j}

as the local sum of absolute differences (LSAD) [46] between the pixel values within a neighborhood window centered at

(i, j)

.

The importance score

I_{i j}

for position

(i, j)

is calculated as

I_{i j} = \sum_{p = - k}^{k} \sum_{q = - k}^{k} | X_{(i + p) (j + q)} - X_{i j} |

(26)

where k is the radius of the neighborhood window, and

X_{(i + p) (j + q)}

represents the pixel value at position

(i + p, j + q)

in the input feature map X.

The two outputs after pooling are multiplied and subjected to self-attention. The result is then added to the output after passing through W-Trans and SW-Trans directly via an adder. The inclusion of this module enhances the representational capacity of details, allowing for deep extraction of texture block details to improve the accuracy of pixel recovery.

3.3.4. Multi-Scale Feature Enhancer

The Multi-Scale Feature Enhancer (MSFE) block, serving as the compression layer of the network, consists of three steps: transpose interpolation, concatenation, and grouped convolution. First, the output

y_{s}

from the third layer of the encoder is fed into the MSFE block, where it undergoes upsampling through the transpose interpolation layer, resulting in the expanded feature map

y_{s}^{'}

with an increased resolution of

I^{'}

. Unlike traditional interpolation methods, transpose interpolation better integrates the feature representations within convolutional neural networks, demonstrating significant advantages in preserving spatial information and detail features. Additionally, transpose interpolation smooths the feature map during expansion, avoiding artifacts such as the checkerboard effect, while requiring no additional training weights. This process allows for simple parameter control over the output image resolution, effectively alleviating computational burdens and enhancing the overall quality and efficiency of the upsampling. Subsequently, the MSFE block concatenates the results across channels in the channel domain to amplify and reinforce identical features. The combined features are then processed through the grouped convolution layer. Grouped convolution operates by partitioning channels into groups, performing

3 \times 3

convolution operations within each group. This mechanism enables the network to learn different feature subspaces, resulting in richer feature representations upon final aggregation, particularly enhancing edge and complex texture block representations when processing intricate remote sensing images. Compared to standard convolution, grouped convolution significantly reduces the number of parameters and computational load, thereby further lowering the demands on computational resources. Through these steps, the MSFE block efficiently extracts multi-scale features while ensuring the capability to handle complex textures. The final output

Y_{S}

is then fed into the decoder to complete the subsequent reconstruction tasks.

4. Experiment

4.1. Experiment Setup

Implementation details. PCFI is constructed using the PyTorch 1.10.0 framework, specifically leveraging a single NVIDIA GTX 3090Ti GPU. The network training strategy utilizes the Adaptive Moment Estimation (Adam) [47] algorithm as the approach for optimizing parameters. The Adam hyperparameters are set at

β_{1} = 0.9

and

β_{2} = 0.999

. In addition, we implement cosine annealing as the approach for decaying the learning rate. We initialize the learning rate to

2 \times 10^{- 4}

and perform training for 250,000 iterations. During the training procedure, we utilize a stochastic selection technique to randomly choose 16 LR-HR image pairings from the training set. Subsequently, we perform random cropping to get LR image patches with dimensions of

64 \times 64

. In addition, to increase the diversity of the dataset, we enrich each input image by applying data augmentation techniques like flipping the images vertically and horizontally. The training loss function employs the L1 norm, minimizing the standard L1 loss between the predicted image

I_{S R}

and the high-resolution image

I_{H R}

:

L = ∥ I_{H R} - I_{S R} ∥_{1}

.

Evaluation metrics. In conducting quantitative comparisons, we adopted Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Learned Perceptual Image Patch Similarity (LPIPS), Feature Similarity Index (FSIM), and Spectral Angle Mapper (SAM) as evaluation metrics to assess our model’s performance in the SR task. PSNR, measured in decibels (dB), indicates that a higher value signifies greater similarity between the reconstructed image and the original image. The SSIM index evaluates the structural similarity of images, with higher values also indicating increased similarity between images. Increases in PSNR and SSIM values represent superior performance. The detailed calculation formulas are as follows:

PSNR = 10 {log}_{10} (\frac{R^{2}}{MSE})

(27)

MSE = \frac{1}{N} \sum_{i = 1}^{N} {(x_{i} - y_{i})}^{2}

(28)

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(29)

We employed LPIPS to better assess the quality of the reconstructed image from a perceptual perspective, where a lower score indicates better visual quality of the reconstructed image. The detailed calculation formula is as follows:

LPIPS (x, y) = \sum_{i} \frac{| | ϕ_{i} (x) - ϕ_{i} (y) {| |}^{2}}{N}

(30)

FSIM focuses on low-level features of images, such as the perceived quality of edges and textures, by evaluating key visual information through phase congruency and gradient magnitude. A higher score indicates better recovery of edge and texture details in the image. The calculation formula is as follows:

FSIM (x, y) = PC (x, y) \cdot GM (x, y)

(31)

SAM is used to evaluate the similarity of spectral information by comparing the spectral angles of each pixel in hyperspectral bands, effectively detecting whether spectral information is lost during reconstruction. A lower SAM score indicates a smaller difference between the reconstructed image and the original image in the spectral dimension, reflecting higher fidelity of spectral information. The detailed calculation formula is as follows:

SAM (x, y) = arccos (\frac{x \cdot y}{| | x | | \cdot | | y | |})

(32)

4.2. Datasets

To systematically evaluate the performance of our model, we selected multiple datasets. First, we utilized the AID dataset, a widely used large-scale benchmark dataset for remote sensing image scene classification, comprising 10,000 remote sensing images of fixed size 600 × 600 pixels [48]. The AID dataset is primarily obtained through aerial imaging techniques and contains 30 different scene categories, such as airports, bare land, and baseball fields (as shown in Figure 5). We used the AID dataset as both the training and testing set to assess the model’s performance on standard remote sensing images. To further explore the adaptability of the model under different imaging modalities, we conducted experiments on SAR images and hyperspectral images. For SAR images, we used the SEN1-2 dataset [49], which contains data from the Sentinel-1 and Sentinel-2 satellites. The SAR images provided by Sentinel-1 are single-channel 8-bit images of size 256 × 256 pixels, while the optical images from Sentinel-2 are three-channel 8-bit images of the same size. We selected the “Summer” dataset, which includes registered pairs of SAR and optical images, as the training and testing sets. For hyperspectral images, we employed the AVIRIS dataset [50] developed by NASA. This dataset is specifically designed for capturing hyperspectral images of the Earth’s surface, with each image typically being 256 × 256 pixels, where pixel values represent reflectance with a range of [0, 1], reflecting the surface’s radiation reflection characteristics. The AVIRIS dataset was also used as the training and testing set to validate the model’s performance in processing hyperspectral images. Finally, to comprehensively demonstrate the model’s performance, we conducted tests on natural images. We selected five commonly used standard benchmark natural image datasets—Set5 [51], Set14 [52], BSD100 [53], Urban100 [54], and Manga109 [55]—as the testing set to evaluate the model’s performance on natural images.

4.3. Performance on Task

Quantitative evaluations. To better evaluate the performance of the PCFI network, we compared it with current efficient super-resolution (SR) models, such as Bicubic, EDSR [10], NLSN [56], and HAT-L [57], across 30 different scene categories in the AID dataset. As shown in Table 1, PCFI significantly outperforms other models in terms of SSIM and PSNR scores, demonstrating exceptional performance. However, since SSIM and PSNR primarily assess image quality from a data perspective, they do not comprehensively reflect human visual perception. Therefore, we introduced LPIPS and FSIM as additional metrics for the quality assessment of standard remote sensing images, to better illustrate the model’s capabilities in visual effectiveness. As presented in Table 2, PCFI outperformed other advanced models, including Bicubic, EDSR [10], NLSN [56], HAT-L [57], TransENet [58], and HAN [59], in both LPIPS and FSIM scores. This indicates that PCFI excels in the recovery of edge and texture details, especially when dealing with edge regions and complex textures. Through the PCFI approach, we effectively enhanced the local modeling capabilities of the Transformer model, facilitating efficient interactions between windows and improving performance in super-resolution reconstruction tasks, particularly in restoring complex features such as object edges and textures in standard remote sensing images.

To assess the model’s complexity, we conducted a computation time comparison on 100 random images at a scaling factor of 4. As shown in Figure 6, the scatter plot compares the number of parameters and FLOPs of six models, including PCFI. The bar chart displays the PSNR and SSIM comparisons for each model at a 2× upscaling factor on the AID dataset. The experimental results indicate that our model achieves the highest PSNR and SSIM scores at a 3× upscaling factor while maintaining a balance between the number of parameters and FLOPs. As shown in Table 3, our model outperforms other models in terms of image reconstruction quality while also demonstrating excellent processing speed. Compared to other equally outstanding models, such as EDSR [10], NLSN [56], HAT-L [57] and TransENet [58], as well as GRL [60], our model exhibits faster processing speeds. Specifically, relative to the second-best performing model on the AID dataset, our computation time is reduced by approximately 4.33%. This is particularly important when processing large-scale datasets such as remote sensing images, as it significantly decreases computational overhead and enhances processing efficiency.

In addition, we compared PCFI with EDSR [10], RRDB [61], SNGAN [62], SRGAN [21], and HSENet [63] on the SAR images from the SEN1-2 dataset on ×2 and ×4 scale. As shown in Table 4, while PCFI still maintains an advantage in PSNR scores, it performed slightly worse in SSIM and FSIM. This discrepancy is mainly due to the impact of typical speckle noise (multiplicative noise) present in SAR images, which cannot be easily removed through conventional filtering methods, significantly reducing image contrast and detail representation, thus making it challenging for the model to extract effective features. Additionally, the electromagnetic reflection characteristics of SAR images pose difficulties for methods designed for optical images. Finally, we performed a quantitative comparison of PCFI with Bicubic, RDN [64], ESPCN [65], TransENet, and HSENet on the AVIRIS hyperspectral dataset, using PSNR, SSIM, and SAM as evaluation metrics. As shown in Table 5, although PCFI continues to lead in PSNR and SSIM scores at ×2 and ×4, its performance on SAM scores is relatively poor. Hyperspectral images contain a large number of spectral channels, far exceeding the three channels of standard RGB remote sensing images, which poses challenges for PCFI, a model adept at processing low-dimensional optical images, in capturing hyperspectral information. Additionally, the signal-to-noise ratios of spectral edge channels (such as infrared and ultraviolet bands) are comparatively low, further impacting the overall performance of the model and leading to less accurate recovery of spectral information. Nevertheless, PCFI performs well in PSNR and SSIM scores, indicating its advantages in handling spatial details and pixel value differences, excelling in both spatial dimensions and visual effects.

To further assess the performance of this model, we also compared PCFI with IMDN [66], EDSR [10], HNCT [67], LatticeNet [68], SwinIR [26], ESRT [69], HAT [57], SRFormer [70], and GRL [60]. As shown in Table 6, we quantitatively evaluated multiple SR models at scaling factors of 2×, 3×, and 4× across five benchmark datasets. Our network outperformed other models in both SSIM and PSNR scores, demonstrating exceptional performance. This indicates that our network not only excels in enhancing the resolution of remote sensing images but also shows significant advantages in improving the quality of natural images.

Visual comparison. As shown in Figure 7, we selected representative images from five different scenes in the AID dataset: airport, park, sparse residential area, and viaduct. To evaluate the effects of super-resolution reconstruction, we compared these images with the results from traditional methods such as Bicubic, EDSR [10], NLSN [56], and HAT-L [57], while also referencing high-resolution (HR) images. The comparison clearly indicates that images reconstructed using the PCFI network significantly outperform other models in detail preservation, particularly in complex edge and texture regions, which often exhibit blurring or artifacts. Furthermore, as illustrated in Figure 8, we selected two typical “Summer” images from the SAR image dataset SEN1-2 and performed super-resolution processing using EDSR [10], RRDB [61], SNGAN [62], SRGAN [21], HSENet [63], and our model, followed by a detailed comparison of the results. Additionally, we selected two images from the hyperspectral dataset AVIRIS and compared the outputs from Bicubic, RDN [64], ESPCN [65], TransENet [58], HSENet [63], and our model. These comparative analyses demonstrate that, despite the differing imaging modalities of remote sensing images, our model still excels in human visual perception, effectively preserving details and resulting in more realistic images. Finally, to validate the effectiveness of our model in processing natural images, we selected a representative image from each of the five benchmark datasets: Manga109 [55], BSD100 [53], Set5 [51], Set14 [52], and Urban100 [54]. We obtained outputs using Bicubic, SwinIR [26], HAT [57], and GRL [60], and we then compared these outputs with the images processed by PCFI. As shown in Figure 9, our PCFI outperforms other state-of-the-art methods in natural images effectively and significantly. This result verifies that the introduction of DTIF substantially enhances the effectiveness of the super-resolution model, greatly improving its reconstruction capabilities.

4.4. Ablation Studies

The PCFI model comprises two main modules, the Integrated Compressed Perception Classifier (ICPC) and the Deep Texture Interaction Module (DTIF), which includes two key blocks: CWIA and NWIB. In this section, we provide a detailed description of the ablation study conducted to validate the effectiveness and necessity of each module within the PCFI network. We selected the AID dataset as the experimental basis, as it contains a large number of high-resolution natural images that can adequately assess the performance of our model. We established a baseline model, denoted as

{Model}_{L}

, which excludes the ICPC, CWIA, and NWIB modules, while the complete model is referred to as

{Model}_{O}

. From

{Model}_{O}

, we sequentially removed the ICPC module, CWIA block, and NWIB block, resulting in the following additional models:

{Model}_{A}

,

{Model}_{B}

, and

{Model}_{C}

. We conducted experiments on these models to measure their output PSNR and SSIM after processing 100 random images, using the comparison of PSNR and SSIM scores to evaluate the quality of the reconstructed images. Furthermore, to demonstrate the impact of the ICPC module, we also present the processing times for each model.

Integrated Compressive Sensing-based Perception Classifier Module. The ICPC module performs compressed sensing sampling on sub-images cropped based on perceptual domains, followed by feature classification of these sub-images. In our ablation study, we first removed the ICPC module, resulting in a new model

{Model}_{A}

, to assess its impact on the performance of the PCFI network. As shown in Table 7, compared to the full model,

{Model}_{A}

exhibited a significant performance drop, with PSNR decreasing by 0.12 dB and SSIM decreasing by 0.0016. This result clearly indicates that the removal of the ICPC module adversely affects the quality of the reconstructed images and validates that processing sub-image blocks through feature classification before reconstruction can more accurately recover local detail textures, thereby improving the reconstruction accuracy. To further validate the acceleration effect of the ICPC module within PCFI, we also compared the computational speeds of various models investigated in this study. Notably, the runtime of

{Model}_{A}

, with the ICPC module removed, significantly increased. We conclude that the inclusion of the ICPC module improves the model’s efficiency by approximately 8.91%. This improvement is attributed to the ability of the model, after compressed sensing classification, to allocate varying numbers of DTIF modules to tasks involving edge, smooth, and texture feature types, as opposed to assigning the same number of modules to all three feature types. This clearly demonstrates that the ICPC module’s use of compressed sensing for feature extraction is highly beneficial for accelerating super-resolution tasks.

Depth–Texture Interaction Module. The N–Gram Window Interaction Block (NWIB) and the CWIA block are essential components of the DTIT, crucial for enhancing the accuracy of the reconstructed images. First, we formed a new model

{Model}_{B}

by removing the CWIA block from

{Model}_{O}

to validate the role of CWIA. As shown in Table 7, the PSNR and SSIM of

{Model}_{O}

improved by approximately 1.38% and 0.53% compared to

{Model}_{B}

, respectively, indicating better reconstruction quality. This improvement is attributed to the CWIA block’s capability to capture local features in both vertical and horizontal directions. This bidirectional extraction method enables the model to effectively identify complex textures and edges within the image, thereby enhancing the accuracy and quality of pixel reconstruction. Therefore, the inclusion of CWIA had a significant positive impact on model performance, further consolidating the advantages of DTIT in super-resolution tasks. Next, we formed a new model

{Model}_{C}

by removing the NWIB block from

{Model}_{O}

to assess the impact of NWIB on model performance. As shown in Table 7, compared to

{Model}_{O}

, the PSNR and SSIM of

{Model}_{C}

decreased by 0.21 dB and 0.0031 dB, respectively. This result clearly indicates that NWIB plays a significant role in enhancing the similarity between the reconstructed images and the original images. This improvement is primarily due to the introduction of the N–Gram window interaction before the self-attention mechanism. This interaction effectively captures correlations within local regions, allowing the model to better understand the relationships between adjacent elements when processing images. This precise capture of local features not only aids in the restoration of complex textures but also significantly improves detail retention during the reconstruction process, reducing the occurrence of blurriness and artifacts. This further highlights the indispensable role of the NWIB block within the model, providing stronger support for image reconstruction and ensuring that the final results are closer to the original images.

5. Conclusions

In this study, we propose a novel network for remote sensing image super-resolution, called Perception-guided Classification Feature Intensification (PCFI). This network consists of two main modules: integrated compressive sensing-based perception classifier module (ICPC) and depth–texture interaction fusion module (DTIF). The ICPC module employs a perception-guided feature classification strategy, using perception mechanisms to guide targeted feature classification of different regions in the input image, such as edges, textures, and smooth areas. This approach enhances the capture of local feature details, contributing to the acceleration of the super-resolution (SR) task. The DTIF module combines Transformer architectures with a window texture interaction mechanism to extract complex textures from two dimensions, enabling it to capture texture variations and edge information in the image with precision, particularly in high-frequency textures and fine edges, while avoiding information loss or blurriness. Comprehensive experimental results demonstrate that our PCFI model outperforms current state-of-the-art super-resolution reconstruction methods on standard remote sensing image datasets. It not only significantly improves image reconstruction quality, especially in edge regions and complex texture restoration, but also enhances the computational efficiency. This dual improvement in performance and efficiency allows the PCFI to exhibit stronger advantages when processing large-scale datasets such as remote sensing images.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; software, J.X.; validation, K.C.; formal analysis, Y.D.; writing—original draft, J.X.; writing—review and editing, Y.D.; visualization, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Nos. 62071380 and 62102314) and the Natural Science Foundation of Shaanxi Province (No. 2022JQ-668).

Data Availability Statement

The SEN1-2 dataset used in this study is accessible from https://mediatum.ub.tum.de/1436631 (accessed on 7 October 2024). The dataset consists of 282,384 pairs of corresponding synthetic aperture radar and optical image patches, acquired by the Sentinel-1 and Sentinel-2 remote sensing satellites, respectively. It is shared under the open access license CC-BY.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

Zhang, Q.; Yuan, Q.; Song, M.; Yu, H.; Zhang, L. Cooperated spectral low-rankness prior and deep spatial prior for HSI unsupervised denoising. IEEE Trans. Image Process. 2022, 31, 6356–6368. [Google Scholar] [CrossRef] [PubMed]
Xia, B.; Tian, Y.; Zhang, Y.; Hang, Y.; Yang, W.; Liao, Q. Meta-learning based degradation representation for blind super-resolution. IEEE Trans. Image Process. 2023, 32, 3383–3396. [Google Scholar] [CrossRef] [PubMed]
Cai, Q.; Qian, Y.; Li, J.; Lyu, J.; Yang, Y.H.; Wu, F.; Zhang, D. HIPA: Hierarchical patch transformer for single image super resolution. IEEE Trans. Image Process. 2023, 32, 3226–3237. [Google Scholar] [CrossRef] [PubMed]
Guo, J.; Wen, L.; Zhou, Y.; Song, B.; Chi, Y.; Yu, F.R. SPACE: Self-supervised Dual Preference Enhancing Network for Multimodal Recommendation. IEEE Trans. Multimedia 2024. [Google Scholar] [CrossRef]
Irani, M.; Peleg, S. Improving resolution by image registration. CVGIP Graph. Model. Image Process. 1991, 53, 231–239. [Google Scholar] [CrossRef]
Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Networks Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef]
Ran, R.; Deng, L.J.; Jiang, T.X.; Hu, J.F.; Chanussot, J.; Vivone, G. GuidedNet: A general CNN fusion framework via high-resolution guidance for hyperspectral image super-resolution. IEEE Trans. Cybern. 2023, 53, 4148–4161. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef]
Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Kong, X.; Zhao, H.; Qiao, Y.; Dong, C. Classsr: A general framework to accelerate super-resolution networks by data characteristic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12016–12025. [Google Scholar]
He, J.; Wang, Y.; Liu, H. Ship classification in medium-resolution SAR images via densely connected triplet CNNs integrating Fisher discrimination regularized metric learning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3022–3039. [Google Scholar] [CrossRef]
Yu, K.; Dong, C.; Lin, L.; Loy, C.C. Crafting a toolchain for image restoration by deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2443–2452. [Google Scholar]
Yu, K.; Wang, X.; Dong, C.; Tang, X.; Loy, C.C. Path-restore: Learning network path selection for image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7078–7092. [Google Scholar] [CrossRef] [PubMed]
Guo, J.; Li, Z.; Song, B.; Chi, Y. TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning. Remote Sensing 2024, 16, 1843. [Google Scholar] [CrossRef]
Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Lei, S.; Shi, Z.; Zou, Z. Super-resolution for remote sensing images via local–global combined network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1243–1247. [Google Scholar] [CrossRef]
Xu, W.; Guangluan, X.; Wang, Y.; Sun, X.; Lin, D.; Yirong, W. High quality remote sensing image super-resolution using deep memory connected network. In Proceedings of the IGARSS 2018–2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 8889–8892. [Google Scholar]
Jiang, K.; Wang, Z.; Yi, P.; Wang, G.; Lu, T.; Jiang, J. Edge-enhanced GAN for remote sensing image superresolution. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5799–5812. [Google Scholar] [CrossRef]
Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
Rahman, M.L.; Zhang, J.A.; Huang, X.; Guo, Y.J.; Heath, R.W. Framework for a perceptive mobile network using joint communication and radar sensing. IEEE Trans. Aerosp. Electron. Syst. 2019, 56, 1926–1941. [Google Scholar] [CrossRef]
Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part I 24. Springer: Berlin/Heidelberg, Germany, 2021; pp. 14–24. [Google Scholar]
Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
Lopez-Gazpio, I.; Maritxalar, M.; Lapata, M.; Agirre, E. Word n-gram attention models for sentence similarity and inference. Expert Syst. Appl. 2019, 132, 1–11. [Google Scholar] [CrossRef]
Choi, H.; Lee, J.; Yang, J. N-gram in swin transformers for efficient lightweight image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2071–2081. [Google Scholar]
Guo, J.; Sun, H.; Han, J.; Song, B.; Chi, Y.; Song, B. Multi-task Fine-grained Feature Mining for Multi-label Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 26, 8849–8859. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; proceedings, part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
Shi, W.; Jiang, F.; Liu, S.; Zhao, D. Image compressed sensing using convolutional neural network. IEEE Trans. Image Process. 2019, 29, 375–388. [Google Scholar] [CrossRef] [PubMed]
Orović, I.; Papić, V.; Ioana, C.; Li, X.; Stanković, S. Compressive sensing in signal processing: Algorithms and transform domain formulations. Math. Probl. Eng. 2016, 2016, 7616393. [Google Scholar] [CrossRef]
Guo, J.; Song, B.; Tian, F.; Liu, H.; Qin, H. Perception of image characteristics with compressive measurements. IEICE Trans. Inf. Syst. 2014, 97, 3234–3235. [Google Scholar] [CrossRef]
Ravelomanantsoa, A.; Rabah, H.; Rouane, A. Compressed sensing: A simple deterministic measurement matrix and a fast recovery algorithm. IEEE Trans. Instrum. Meas. 2015, 64, 3405–3413. [Google Scholar] [CrossRef]
Stergiou, A.; Poppe, R.; Kalliatakis, G. Refining activation downsampling with SoftPool. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10357–10366. [Google Scholar]
Hsiao, T.Y.; Chang, Y.C.; Chou, H.H.; Chiu, C.T. Filter-based deep-compression with global average pooling for convolutional networks. J. Syst. Archit. 2019, 95, 9–18. [Google Scholar] [CrossRef]
Murray, N.; Perronnin, F. Generalized max pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Ho Chi Minh City, Vietnam, 8–10 October 2014; pp. 2473–2480. [Google Scholar]
Kim, H.; Khan, M.U.K.; Kyung, C.M. Efficient neural network compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12569–12577. [Google Scholar]
Quan, T.M.; Hildebrand, D.G.C.; Jeong, W.K. Fusionnet: A deep fully residual convolutional neural network for image segmentation in connectomics. Front. Comput. Sci. 2021, 3, 613981. [Google Scholar] [CrossRef]
Ahn, N.; Kang, B.; Sohn, K.A. Efficient deep neural network for photo-realistic image super-resolution. Pattern Recognit. 2022, 127, 108649. [Google Scholar] [CrossRef]
Gao, Z.; Wang, L.; Wu, G. Lip: Local importance-based pooling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3355–3364. [Google Scholar]
Dawoud, N.N.; Samir, B.B.; Janier, J. Fast template matching method based optimized sum of absolute difference algorithm for face localization. Int. J. Comput. Appl. 2011, 18, 0975–8887. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
Schmitt, M.; Hughes, L.H.; Zhu, X.X. The SEN1-2 dataset for deep learning in SAR-optical data fusion. arXiv 2018, arXiv:1807.01569. [Google Scholar] [CrossRef]
Wang, W.; Zhao, B.; Feng, F.; Nan, J.; Li, C. Hierarchical sub-pixel anomaly detection framework for hyperspectral imagery. Sensors 2018, 18, 3662. [Google Scholar] [CrossRef] [PubMed]
Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the Proceedings of the 23rd British Machine Vision Conference (BMVC), Surrey, UK, 3–7 September 2012. [Google Scholar]
Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In Proceedings of the Curves and Surfaces: 7th International Conference, Avignon, France, 24–30 June 2010; Revised Selected Papers 7. Springer: Berlin/Heidelberg, Germany, 2012; pp. 711–730. [Google Scholar]
Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416–423. [Google Scholar]
Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
Mei, Y.; Fan, Y.; Zhou, Y. Image super-resolution with non-local sparse attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3517–3526. [Google Scholar]
Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
Lei, S.; Shi, Z.; Mo, W. Transformer-based multistage enhancement for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
Niu, B.; Wen, W.; Ren, W.; Zhang, X.; Yang, L.; Wang, S.; Zhang, K.; Cao, X.; Shen, H. Single image super-resolution via a holistic attention network. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 191–207. [Google Scholar]
Li, Y.; Fan, Y.; Xiang, X.; Demandolx, D.; Ranjan, R.; Timofte, R.; Van Gool, L. Efficient and explicit modelling of image hierarchies for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18278–18289. [Google Scholar]
Liu, T.J.; Chen, Y.Z. Satellite image super-resolution by 2d rrdb and edge-enhanced generative adversarial network. Appl. Sci. 2022, 12, 12311. [Google Scholar] [CrossRef]
Zheng, C.; Jiang, X.; Zhang, Y.; Liu, X.; Yuan, B.; Li, Z. Self-normalizing generative adversarial network for super-resolution reconstruction of SAR images. In Proceedings of the IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1911–1914. [Google Scholar]
Lei, S.; Shi, Z. Hybrid-scale self-similarity exploitation for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5401410. [Google Scholar] [CrossRef]
Chen, G.; Zhang, L.; Sun, M.; Gao, Y.; Michelini, P.N.; Wu, Y. Single-image hdr reconstruction with task-specific network based on channel adaptive RDN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 398–403. [Google Scholar]
Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2024–2032. [Google Scholar]
Fang, J.; Lin, H.; Chen, X.; Zeng, K. A hybrid network of cnn and transformer for lightweight image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 1103–1112. [Google Scholar]
Luo, X.; Xie, Y.; Zhang, Y.; Qu, Y.; Li, C.; Fu, Y. Latticenet: Towards lightweight image super-resolution with lattice block. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 272–289. [Google Scholar]
Lu, Z.; Li, J.; Liu, H.; Huang, C.; Zhang, L.; Zeng, T. Transformer for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 457–466. [Google Scholar]
Zhou, Y.; Li, Z.; Guo, C.L.; Bai, S.; Cheng, M.M.; Hou, Q. Srformer: Permuted self-attention for single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12780–12791. [Google Scholar]

Figure 1. Overall structure of Perception-guided Classification Feature Intensification Network and integrated compressive sensing-based perception classifier module.

Figure 2. An illustration of depth–texture interaction fusion module. The figure is divided into three sections: the top-left represents the DTIF module, the top-right section represents the DTIT block, and the bottom section represents the CWIA block.

Figure 3. The process of soft pooling (The red arrows are for the forward operation, and the output value of the SoftPool operation is generated by passing the standard sum of all

\tilde{γ}

in the kernel neighborhood N).

Figure 3. The process of soft pooling (The red arrows are for the forward operation, and the output value of the SoftPool operation is generated by passing the standard sum of all

\tilde{γ}

in the kernel neighborhood N).

Figure 4. The process of N–Gram window sliding (When sliding the window over single-character paddings, forward N–Gram features are obtained through the WSA operation).

Figure 5. Some typical samples of AID dataset from 30 different scene classifications.

Figure 6. The comparison of FLOPs and parameters, as well as PNSR/SSIM performance, with other methods on the AID dataset at a ×2 scale.

Figure 7. Visual comparison on AID datasets at a ×3 scale. The patches used for comparison are marked in red boxes.

Figure 8. Visual comparison on the SAR dataset and AVIRIS dataset at a ×2 scale. The patches used for comparison are marked in red boxes.

Figure 9. Visual Comparison of Images from Manga109, BSD100, Set5, Set14, and Urban100 Datasets at ×3 Scale The patches used for comparison are marked in red boxes.

Table 1. Comparison on AID datasets with Scale 3. We tested different models on the 30 different scenes in the AID dataset and marked the best-performing ones in red and the second-best ones in blue.

Classes	Bicucbic		EDSR [10]		NLSN [56]		HAT-L [57]		PCFI(Ours)
Classes	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
Airport	27.83	0.7554	29.93	0.8282	30.16	0.8322	30.15	0.8319	30.18	0.8325
Bare Land	35.60	0.8564	36.94	0.8837	37.00	0.8845	36.88	0.8841	37.12	0.8849
Baseball Field	31.00	0.8305	33.05	0.8765	33.24	0.8787	33.25	0.8789	33.28	0.8792
Beach	32.90	0.8446	34.18	0.8727	34.31	0.8749	34.24	0.8756	34.43	0.8758
Bridge	30.22	0.8283	32.93	0.8800	33.12	0.8818	33.04	0.8809	33.16	0.8823
Center	26.51	0.6944	28.77	0.7921	28.95	0.7971	28.92	0.7956	28.98	0.7974
Church	24.29	0.6333	26.30	0.7469	26.51	0.7528	26.56	0.7532	26.63	0.7536
Commercial	27.33	0.7174	29.01	0.7940	29.21	0.7996	29.21	0.8007	29.24	0.8011
Dense Residential	22.93	0.5671	24.38	0.6839	24.60	0.6936	24.67	0.6936	24.69	0.6939
Desert	39.26	0.9100	40.20	0.9268	40.27	0.9278	40.37	0.9278	40.42	0.9280
Farmland	33.10	0.8226	35.00	0.8683	35.10	0.8699	35.03	0.8691	35.15	0.8702
Forest	28.79	0.6605	29.85	0.7315	29.98	0.7369	30.10	0.7363	30.13	0.7371
Industrial	26.77	0.6952	28.88	0.7931	29.04	0.7982	29.04	0.7980	29.07	0.7986
Meadow	33.86	0.7483	34.63	0.7804	34.69	0.7821	34.70	0.7815	34.73	0.7825
Medium Residential	26.36	0.6335	28.34	0.7365	28.49	0.7418	28.46	0.7408	28.52	0.7422
Mountain	29.51	0.7349	30.63	0.7885	30.74	0.7916	30.78	0.7923	30.81	0.7925
Park	29.06	0.7530	30.54	0.8130	30.71	0.8177	30.71	0.8189	30.74	0.8191
Parking	24.24	0.7060	27.25	0.8317	27.57	0.8408	27.56	0.8405	27.59	0.8412
Playground	32.64	0.8450	35.37	0.8943	35.58	0.8967	35.49	0.8959	35.60	0.8969
Pond	30.70	0.8167	32.11	0.8542	32.22	0.8559	32.18	0.8555	32.25	0.8563
Port	26.67	0.7986	28.50	0.8596	28.71	0.8631	28.81	0.8638	28.84	0.8642
Railway Station	26.78	0.6793	28.72	0.7738	28.89	0.7783	28.88	0.7780	28.91	0.7786
Resort	26.79	0.7029	28.52	0.7799	28.68	0.7845	28.71	0.7849	28.73	0.7853
River	30.37	0.7402	31.55	0.7891	31.64	0.7914	31.63	0.7909	31.68	0.7917
School	27.41	0.7237	29.36	0.8044	29.55	0.8097	29.54	0.8104	29.59	0.8108
Sparse Residential	26.66	0.6006	27.71	0.6728	27.84	0.6767	27.88	0.6759	27.91	0.6769
Square	28.55	0.7391	30.84	0.8200	31.04	0.8244	31.00	0.8251	31.06	0.8257
Stadium	27.16	0.7547	29.63	0.8387	29.79	0.8422	29.77	0.8422	29.83	0.8424
Storage Tanks	25.65	0.6793	27.44	0.7664	27.61	0.7709	27.60	0.7698	27.65	0.7714
Viaduct	26.97	0.6755	28.99	0.7757	29.17	0.7813	29.11	0.7794	29.21	0.7818
Average	28.86	0.7382	30.65	0.8086	30.81	0.8126	30.81	0.8124	30.87	0.8131

Table 2. LPIPS and FSIM comparison of different networks on AID dataset. We mark the best-performing ones in red and the second-best ones in blue.

	Bicubic	EDSR [10]	NLSN [56]	HAT-L [57]	TransENet [58]	HAN [59]	PCFI
LPIPS	0.4802	0.3069	0.3041	0.3076	0.3133	0.3077	0.3038
FSIM	0.7572	0.8108	0.8232	0.8214	0.8279	0.8267	0.8285

Table 3. Running time comparison of different networks.

Networks	EDSR [10]	NLSN [56]	HAT-L [57]	TransENet [58]	GRL [60]	PCFI
Running Time (ms)	132.62	186.18	284.16	152.67	213.43	178.12

Table 4. PSNR, SSIM, and FSIM comparison of different networks on SEN1-2 dataset. We mark the best-performing ones in red and the second-best ones in blue.

Scale		EDSR [10]	RRDB [61]	SNGAN [62]	SRGAN [21]	HSENet [63]	PCFI
	PSNR	41.19	41.27	41.86	41.74	42.23	42.91
×2	SSIM	0.981	0.986	0.987	0.986	0.993	0.996
	FSIM	0.9989	0.9996	0.9994	0.9991	0.9989	0.9993
	PSNR	31.15	31.39	27.81	27.62	24.34	31.85
×4	SSIM	0.838	0.841	0.771	0.763	0.622	0.839
	FSIM	0.9848	0.9836	0.9653	0.9657	0.9718	0.9831

Table 5. PSNR, SSIM, and SAM comparison of different networks on the AVIRIS dataset. We mark the best-performing ones in red and the second-best ones in blue.

Model	×2			×4			×8
Model	PSNR	SSIM	SAM	PSNR	SSIM	SAM	PSNR	SSIM	SAM
Bicubic	43.72	0.9733	0.1376	39.19	0.9369	0.9683	36.19	0.9029	2.8346
RDN	45.69	0.9835	0.4558	40.61	0.9532	0.9436	37.27	0.9194	2.2317
ESPCN	41.93	0.9681	1.9268	38.41	0.9246	1.8725	33.73	0.8324	3.0341
TransENet	45.91	0.9843	0.6671	41.43	0.9621	1.0423	38.09	0.9281	2.1462
HSENet	45.84	0.9839	0.4603	40.75	0.9546	0.9873	37.15	0.9175	2.2937
PCFI(Ours)	45.97	0.9851	0.4037	41.65	0.9623	0.9752	38.12	0.9228	2.2731

Table 6. Quantitative comparison of super resolution. We train the result of PCFI and other different methods with 5 standard datasets under a scale factor of 2/3/4 and mark the best-performing ones in red and the second-best ones in blue.

Method	Scale	Set5 [51]		Set14 [52]		BSD100 [53]		Urban100 [54]		Manga109 [55]
Method	Scale	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM	PSNR	SSIM
IMDN [66]		38.00	0.9605	33.63	0.9177	32.19	0.8996	32.17	0.9283	38.88	0.9774
EDSR [10]		38.11	0.9602	33.92	0.9195	32.32	0.9013	32.93	0.9351	39.10	0.9773
HNCT [67]		38.08	0.9608	33.65	0.9182	32.22	0.9001	32.22	0.9294	38.87	0.9774
LatticeNet [68]		38.06	0.9607	33.70	0.9187	32.20	0.8999	32.25	0.9288	38.94	0.9774
SwinIR [26]	×2	38.14	0.9611	33.86	0.9206	32.31	0.9012	32.76	0.9340	39.12	0.9783
ESRT [69]		-	-	-	-	-	-	-	-	-	-
HAT [57]		38.63	0.9630	34.86	0.9274	32.62	0.9053	34.45	0.9466	40.26	0.9809
SRFormer [70]		38.51	0.9627	34.44	0.9253	32.57	0.9046	34.09	0.9449	40.07	0.9802
GRL [60]		38.67	0.9647	35.08	0.9303	32.68	0.9087	35.06	0.9505	40.67	0.9818
PCFI (Ours)		38.69	0.9653	35.14	0.9311	32.72	0.9092	35.12	0.9509	40.71	0.9824
IMDN [66]		34.36	0.9270	30.32	0.8417	29.09	0.8046	28.17	0.8519	33.61	0.9445
EDSR [10]		34.65	0.9280	30.52	0.8462	29.25	0.8093	28.80	0.8653	34.17	0.9476
HNCT [67]		34.47	0.9275	30.44	0.8439	29.15	0.8067	28.28	0.8557	33.81	0.9459
LatticeNet [68]		34.40	0.9272	30.32	0.8416	29.10	0.8049	28.19	0.8513	33.63	0.9442
SwinIR [26]	×3	34.62	0.9289	30.54	0.8463	29.20	0.8082	28.66	0.8624	34.78	0.9478
ESRT [69]		34.42	0.9268	30.43	0.8433	29.15	0.8063	28.46	0.8574	33.95	0.9455
HAT [57]		35.07	0.9329	31.08	0.8555	29.54	0.8167	30.23	0.8896	35.53	0.9552
SRFormer [70]		35.02	0.9323	30.94	0.8540	29.48	0.8156	30.04	0.8865	35.26	0.9543
GRL [60]		-	-	-	-	-	-	-	-	-	-
PCFI (Ours)		35.11	0.9336	31.13	0.8560	29.58	0.8169	30.29	0.8991	35.58	0.9557
IMDN [66]		32.21	0.8948	28.58	0.7811	27.56	0.7353	26.04	0.7838	30.45	0.9075
EDSR [10]		32.46	0.8968	28.80	0.7876	27.71	0.7420	26.64	0.8033	30.02	0.9148
HNCT [67]		32.31	0.8957	28.71	0.7834	27.63	0.7381	26.20	0.7896	30.70	0.9112
LatticeNet [68]		32.30	0.8943	28.61	0.7812	27.57	0.7355	26.14	0.7844	30.54	0.9075
SwinIR [26]	×4	32.92	0.9044	29.09	0.7950	27.92	0.7489	27.45	0.8254	32.03	0.9260
ESRT [69]		32.19	0.8947	28.69	0.7833	27.69	0.7379	26.39	0.7962	30.75	0.9100
HAT [57]		33.04	0.9056	29.23	0.7973	28.00	0.7517	27.97	0.8368	32.48	0.9292
SRFormer [70]		32.93	0.9041	29.08	0.7953	27.94	0.7502	27.68	0.8311	32.21	0.9271
GRL [60]		33.10	0.9094	29.37	0.8058	28.01	0.7611	28.53	0.8504	32.77	0.9325
PCFI (Ours)		33.19	0.9098	29.41	0.8069	28.06	0.7628	28.68	0.8508	32.83	0.9337

Table 7. Comparison of PSNR, SSIM scores, and running times of PCFI with removed modules on the AID dataset at a ×3 scale. We rename the PCFI after removing certain modules. (×: Indicates that the module is not used. ✓: Indicates that the module is used.)

Method	Running Time (ms)	ICPC	DTIF		AID
Method	Running Time (ms)	ICPC	CWIA	NWIB	PSNR	SSIM
Model_L	129.18	×	×	×	30.28	0.8078
Model_A	195.53	×	✓	✓	30.71	0.8119
Model_B	135.58	✓	×	✓	30.41	0.8092
Model_C	153.24	✓	✓	×	30.62	0.8104
Model_O	178.12	✓	✓	✓	30.83	0.8135

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, Y.; Xie, J.; Chi, K.; Zhang, Y.; Dong, Y. Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution. Remote Sens. 2024, 16, 4201. https://doi.org/10.3390/rs16224201

AMA Style

Li Y, Xie J, Chi K, Zhang Y, Dong Y. Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution. Remote Sensing. 2024; 16(22):4201. https://doi.org/10.3390/rs16224201

Chicago/Turabian Style

Li, Yinghua, Jingyi Xie, Kaichen Chi, Ying Zhang, and Yunyun Dong. 2024. "Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution" Remote Sensing 16, no. 22: 4201. https://doi.org/10.3390/rs16224201

APA Style

Li, Y., Xie, J., Chi, K., Zhang, Y., & Dong, Y. (2024). Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution. Remote Sensing, 16(22), 4201. https://doi.org/10.3390/rs16224201

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution

Abstract

1. Introduction

2. Related Work

2.1. Super-Resolution in Natural Images

2.2. Super-Resolution in Remote Sensing

2.3. Transformer in Super-Resolution

2.4. N–Gram Language Model

3. Method

3.1. Overall Structure

3.2. Integrated Compressive Sensing-Based Perception Classifier Module

3.2.1. Definition of Correlation Between Signals in Two Domains

3.2.2. Correlation of Linear Relationships Between Perceptual Domain and Frequency Domain

3.2.3. Image Patch Classification Based on Perceptual Domain Features

3.3. Depth–Texture Interaction Fusion Module

3.3.1. Depth–Texture Interaction Transformer

3.3.2. N–Gram Window Interaction Block

3.3.3. Cross-Window Importance Aggregation

3.3.4. Multi-Scale Feature Enhancer

4. Experiment

4.1. Experiment Setup

4.2. Datasets

4.3. Performance on Task

4.4. Ablation Studies

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI