Next Article in Journal
SSFAN: A Compact and Efficient Spectral-Spatial Feature Extraction and Attention-Based Neural Network for Hyperspectral Image Classification
Previous Article in Journal
The Seismic Surface Rupture Zone in the Western Segment of the Northern Margin Fault of the Hami Basin and Its Causal Interpretation, Eastern Tianshan
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution

1
Xi’an Key Laboratory of Image Processing Technology and Applications for Public Security, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
2
School of Artificial Intelligence, OPtics and ElectroNics, Northwestern Polytechnical University, Xi’an 710129, China
3
Northwest Land and Resource Research Center, Shaanxi Normal University, Xi’an 710119, China
*
Author to whom correspondence should be addressed.
Remote Sens. 2024, 16(22), 4201; https://doi.org/10.3390/rs16224201
Submission received: 5 September 2024 / Revised: 24 October 2024 / Accepted: 8 November 2024 / Published: 11 November 2024

Abstract

:
In recent years, super-resolution technology has gained widespread attention in the field of remote sensing. Despite advancements, current methods often employ uniform reconstruction techniques across entire remote sensing images, neglecting the inherent variability in spatial frequency distributions, particularly the distinction between high-frequency texture regions and smoother areas, leading to computational inefficiency, which introduces redundant computations and fails to optimize the reconstruction process for regions of higher complexity. To address these issues, we propose the Perception-guided Classification Feature Intensification (PCFI) network. PCFI integrates two key components: a compressed sensing classifier that optimizes speed and performance, and a deep texture interaction fusion module that enhances content interaction and detail extraction. This network mitigates the tendency of Transformers to favor global information over local details, achieving improved image information integration through residual connections across windows. Furthermore, a classifier is employed to segment sub-image blocks prior to super-resolution, enabling efficient large-scale processing. The experimental results on the AID dataset indicate that PCFI achieves state-of-the-art performance, with a PSNR of 30.87 dB and an SSIM of 0.8131, while also delivering a 4.33% improvement in processing speed compared to the second-best method.

1. Introduction

Remote sensing images, acquired using remote sensing technology, contain a wealth of geographical information and depict the surface or atmosphere of the Earth. These images are extensively utilized in several industries, including catastrophe monitoring, agricultural and forestry surveys, resource exploration, and even military surveillance. However, the visual clarity of remote sensing images acquired from satellites is frequently constrained by the capabilities of the imaging devices, optical sensors, and long-distance transmission, as well as environmental interference, resulting in noise that diminishes the quality of the image [1]. This leads to a decrease in the amount of useful information, which impairs the ability to further analyze and process the images. Due to the expensive nature of hardware upgrades, research has turned towards software-based methods for processing remote sensing images at a more affordable price. As an extensively utilized method for enhancing image resolution, image super-resolution (SR) technology can greatly improve the quality of remote sensing images.
Image super-resolution technology is a crucial technique for enhancing image resolution and quality. The objective of super-resolution is to generate high-resolution (HR) images from low-resolution (LR) images while preserving essential image details [2,3,4]. Traditional super-resolution techniques, such as the iterative back-projection method developed by Irani et al. [5], have achieved significant success in restoring image information and laid the foundation for this field. However, these traditional techniques are limited when dealing with complex scenes or lower-quality images, resulting in suboptimal reconstruction quality characterized by artifacts such as jagged edges and blurriness, which negatively impact the overall reconstruction quality.
As an important part of deep learning, convolutional neural networks (CNNs) [6] have attracted a lot of interest from academics looking to improve super-resolution (SR) technology. Due to their superior feature learning and representation capabilities, CNNs have demonstrated excellent performance in super-resolution reconstruction [7,8]. The initial CNN-based SR reconstruction network, SRCNN [9], shows substantial improvements over traditional techniques. Lim et al. [10] make significant progress in this field by developing EDSR, which is built upon SRResNet and eliminates batch normalization in each module. However, because CNN models typically employ fixed-size convolutional kernels, they struggle to effectively capture long-range dependencies and allocate computational resources efficiently. This results in redundancy in smooth regions and insufficient attention to edges and texture regions, failing to adequately focus on high-frequency features. With the advent of the Transformer [11], this novel deep learning architecture has also been applied to SR tasks. The Transformer uses a self-attention mechanism as the core component of its encoder–decoder [12] structure, which helps the model efficiently capture comprehensive information from input sequences. This allows for more organized and effective feature extraction. Dosovitskiy et al. [13] propose the Vision Transformer (ViT) model as a solution for image classification. This model demonstrates superior capability in capturing holistic information from images, outperforming earlier CNN-based SR models.
However, even with the application of Transformers to super-resolution tasks, existing methods still exhibit significant shortcomings. Firstly, while the multi-scale self-attention mechanism employed by Transformers greatly enhances the capture of global information dependencies, it neglects local edge details. This oversight poses a challenge in effectively reconstructing precise features in the generated images, particularly when dealing with remote sensing images in complex terrains or dense urban environments, which may lead to inaccuracies in the restored information. Secondly, many models seek to improve performance by increasing depth; however, the complexity of self-attention grows quadratically with the size of the image blocks. Although this complexity allows for parallel computation, it also incurs higher computational overhead, thereby reducing the processing speed. This drawback becomes particularly pronounced in large-scale super-resolution tasks for satellite-acquired remote sensing images, further intensifying the demand for computational hardware. Consequently, balancing model depth with computational efficiency emerges as a key challenge in optimizing super-resolution models. To accelerate large-scale image SR tasks, Kong et al. [14] propose ClassSR, which first classifies [15] and then performs SR tasks. RL-Restore [16] and Path-Restore [17] methods decompose the image into sub-images and then use reinforcement learning to estimate and select suitable processing paths. However, these methods still suffer from limited receptive fields, resulting in inflexible partitioning, numerous partition errors, and overall lower model performance.
In order to tackle the previously mentioned concerns, we suggest implementing a novel and innovative technique for super-resolution known as Perception-guided Classification Feature Intensification Network (PCFI). PCFI consists of two key modules: Integrated compressive sensing (CS)-based perception classifier module, abbreviated as ICPC, and depth–texture interaction fusion module, abbreviated as DTIF. ICPC utilizes a more concise and focused approach to classifying features. This module employs the CS technique to extract data from the image. We denote the mapping domain, where the measurements derived from compressed sensing are utilized for analyzing image features, as the perceptual domain. The sampling rate is defined, and the resulting sensing matrix has a limited quantity of sampled data. Subsequently, this sensing matrix is fed into a pre-trained model designed to extract perceptual domain features for compilation. By learning the perceptual features of the image for classification, ICPC can better capture semantic information and visual features within the image.
Since perceptual domain features are less sensitive to noise and redundant information in the image [18], ICPC can reduce the influence of these factors on classification results, thereby improving the stability and robustness of classification. In addition, ICPC offers more precise previous information for subsequent image super-resolution operations, hence enhancing the quality of image reconstruction. By integrating perceptual domain feature classification, ICPC not only boosts the accuracy of image classification but also enhances the quality of reconstructing huge remote sensing images while speeding up SR activities. Consequently, it improves the overall effectiveness and efficacy of image processing. The other primary module, DTIF, is based on the Swin Transformer framework and employs window interactions to extract deep texture information, thereby accurately restoring the detailed features of complex textured regions.
To address the inclination of Transformer to prioritize overall information while neglecting specific details, we incorporate an N–Gram language model into the window sliding component. Window-based Self-Attention (WSA) enables pixels to interact with each other across windows, resulting in more precise capture of detailed information. Additionally, within DTIF, we incorporate the Cross-Window Importance Aggregation (CWIA) block that spans both W-Trans and SW-Trans, residually connecting with the output of the self-attention mechanism. This block implements a technique called two-dimensional channel-wise local significance pooling to improve the representation of edge blur features. The well-designed PCFI network addresses the limitations of previous methods caused by the self-attention mechanism and their poor performance on large images. Such a manner exhibits substantial benefits in handling extensive remote sensing images, consequently enhancing the overall quality and efficiency of image super-resolution operations.
In summary, this paper presents the following contributions:
  • Proposed Perception-guided Classification Feature Intensification Network. Unlike traditional super-resolution methods, PCFI significantly enhances processing speed and improves image reconstruction quality by simultaneously extracting detailed features for super-resolution tasks.
  • Constructed Integrated Compressive Sensing-based Perception Classifier Module. ICPC leverages perceptual domain features to classify image blocks, substantially improving classification accuracy and effectively accelerating large-scale image reconstruction tasks.
  • Designed Depth–Texture Interaction Fusion Module. DTIF integrates attention mechanisms and texture interactions, enhancing information exchange across windows and spatial dimensions, thereby strengthening the representation of local details in complex textures or edge areas. This approach achieves more precise restoration of degraded image details.
This paper contains a total of five chapters, and the four subsequent chapters successively present the related work of the article, the methodology, the experiments, and finally summarize the methodology of this paper.

2. Related Work

2.1. Super-Resolution in Natural Images

In recent years, advancements in super-resolution reconstruction technology have significantly propelled the development of low-resolution image processing. Most models have been improved and innovated based on natural image datasets such as ImageNet and DF2K. Concurrently, the construction of diverse datasets and enhancements in computer performance have led to the emergence of numerous models based on classic networks like CNNs and GANs. Dong et al. [9] introduce SRCNN, the inaugural CNN network model for single-image super-resolution reconstruction. This model is capable of immediately acquiring the end-to-end mapping from low-resolution to high-resolution images. This model features a simple network structure and high pixel accuracy, but its numerous parameters make training challenging. VDSR [19], a classic residual model proposed by Kim et al., uses fewer parameters and increases network depth to extract more feature maps. This approach demonstrates that deeper networks facilitate image reconstruction. The model, although easier to train and able to improve the reconstruction accuracy more than SRCNN, still suffers from high training difficulty and low speed. Lai et al. [20] propose LapSRN, which incorporates the traditional Laplacian Pyramid, representing a further enhancement over previous residual models. This model enlarges feature maps progressively through stepwise upsampling, allowing for residual prediction at each level, thus improving training speed. SRGAN [21], based on the GAN network, employs a deep residual network as the generative function and uses perceptual loss as the optimization target. By training both the generator and discriminator simultaneously, it produces more natural images. However, the high complexity of the network results in significant training difficulty and lower pixel accuracy of the reconstructed images. ESRGAN [22] builds on SRGAN by integrating multi-level residual networks and dense connections into dense residual blocks. This model is relatively more stable and produces reconstructed images with richer texture details.

2.2. Super-Resolution in Remote Sensing

As super-resolution technology in remote sensing imagery advances, researchers are increasingly concentrating on techniques to improve the resolution of these images. Remote sensing photographs exhibit notable dissimilarities from natural images due to the inherent interdependence of objects and surroundings in remote sensing images, necessitating the inclusion of environmental data. Liebel et al. [9] are the first to propose the Sentinel-2 remote sensing image dataset to train SRCNN for single-image super-resolution reconstruction in remote sensing. However, due to the multi-scale nature of remote sensing images, the reconstruction results were still suboptimal. To address multi-scale feature extraction, Lei et al. [23] introduce LGCNet based on CNN for remote sensing images. This network cascades the results from different layers, employing a “multi-branch” structure to learn multi-level representations of remote sensing images. Compared to SRCNN, the quality of the reconstructed images is better, but the large number of parameters results in slower processing speeds. Xu et al. [24] propose DMCN, a symmetric hourglass-shaped CNN architecture. Through the design of multiple skip connections, the network complexity is significantly reduced, and the processing time is shortened. However, the reconstruction results do not show a substantial improvement and still exhibited artifacts. Jiang et al. [25] introduce EEGAN, a method that employs an adversarial learning approach to enhance remote sensing images by reducing noise sensitivity and improving the restoration of high-frequency edge details. These methods provide various solutions that develop the resolution of remote sensing images by utilizing distinct network topologies and learning algorithms. Therefore, they contribute to the advancement of remote sensing image processing.

2.3. Transformer in Super-Resolution

Transformer-based super-resolution reconstruction is an innovative approach in recent years. The Transformer model relies entirely on the self-attention mechanism, which replaces the conventional RNN sequence processing with a parallelized self-attention mechanism to address the issue of long-range dependencies. Transformer-based super-resolution reconstruction methods are mainly classified into two main categories, one is the Transformer that combines the attention mechanism and CNN network, and the other is the Transformer that consists of the attention mechanism only. Liang et al. [26] propose SwinIR, which integrates Swin Transformer with SR tasks for the first time. This model divides the image into blocks and applies self-attention mechanisms at the block level to learn global dependencies between pixels. RCAN [27] updates the weights according to the importance of different channels, strengthens the useful channels while suppressing the useless ones, and improves the utilization of computational resources. CRAN [28], proposed by Zhang et al., which utilizes contextual reasoning based on the global context, introduces channel and spatial interaction to generate attention masks to enhance the reconstruction performance of the network. To improve the traditional disadvantage of Transformer regarding ignoring local detail localization due to its global focus, TraFuse [29] employs a sequential stacked coding structure of CNN and Transformer and uses simple progressive upsampling in the encoders of the branches to increase the reconstruction performance of the network.

2.4. N–Gram Language Model

N-gram is a statistical language model (LM) [30] based on the assumption that the occurrence of the n-th word is related to the previous n 1 words and unrelated to any other words. The probability of an entire sentence is the product of the probabilities of each word, which can be calculated through statistical analysis of the corpus. The first characteristic of the N-gram model is that the occurrence of a word depends on several preceding words. The second characteristic is that the more information we have, the more accurate the prediction becomes.
Assume a sentence S consists of a sequence of words w 1 , w 2 , w 3 , , w n . The N-gram language model [31] can be expressed with the following formula (each word w i depends on the influence of the words from the first word w 1 to the preceding word w i 1 ):
P ( S ) = P ( w 1 , w 2 , w 3 , , w n ) = P ( w 1 ) P ( w 2 | w 1 ) P ( w 3 | w 1 , w 2 ) P ( w n | w 1 , w 2 , w 3 , , w n 1 )
Initially, N-gram language models were only used for text analysis tasks such as spell checking, grammar checking, and text generation. However, due to the similarity of the N-gram concept in images to that in text, in recent years, some researchers have begun to utilize N-gram language models to various computer vision tasks. Choi et al. [32] proposed the NGSwin network, which was the first to combine N-gram with super-resolution reconstruction tasks. They defined Uni–Gram as a local window within Swin Transformer, where pixels interact through self-attention, overcoming the limitations of the Swin Transformer that arise from its emphasis on global information.

3. Method

Our proposed approach primarily involves compressive sensing classification of input remote sensing images, followed by deep texture extraction on the classified images to enhance the representation capability of the super-resolved images. In Section 3.1, we present a comprehensive description of our proposed PCFI model structure. In Section 3.2, we provide a detailed explanation of ICPC, which involves classifying sub-image blocks according to predetermined ranges. Smooth blocks primarily contain background and contour information, while edge blocks are more complex, containing detailed information but may suffer from information blur and partial feature loss due to edge blurring. Therefore, in Section 3.3, we introduce the deep texture extraction module, which enhances the representation capability of blurred features caused by edge blur, thereby improving the ability of SR model to express details when dealing with complex remote sensing images.

3.1. Overall Structure

Figure 1 provides a visual representation of the overall network architecture of our proposed PCFI network. Comprising two main modules, namely the ICPC and the DTIF, our PCFI architecture is designed to process single HR images. Upon input, the HR image undergoes cropping to generate several equally sized sub-images. These sub-images are then individually fed into ICPC, where compressive sensing technique [33] is applied, followed by feature classification using compressive sensing: smooth blocks, margin blocks, and texture blocks. The output consists of three classes based on features, detailed in Section 3.2. Subsequently, the classified sub-images are directed into the DTIF for deep content extraction. This module, a hybrid of the U-Net [34] architecture and Swin Transformer architecture, adheres to the fundamental structure of U-Net. The specific composition of the DTIF will be elucidated in Section 3.3.

3.2. Integrated Compressive Sensing-Based Perception Classifier Module

Our integrated compressive sensing-based perception classifier (ICPC) module utilizes compressed sensing for feature classification, which effectively reduces computational complexity compared to the commonly employed deep learning methods, such as CNNs or self-attention mechanisms. Although CNNs demonstrate excellent performance in feature extraction, they often incur substantial computational and memory overhead. For instance, remote sensing images are typically large-scale; when processed using self-attention mechanisms, the model must account for dependencies between every pixel and region, significantly increasing the computational complexity. In contrast, CNNs rely on deep networks for layer-by-layer processing, requiring numerous convolutional kernels and network layers to capture features of varying complexities. This often leads to the processing of considerable redundant information to achieve comprehensive feature classification, which greatly diminishes processing speed and contradicts the ICPC module’s goal of accelerating super-resolution tasks, thereby increasing the computational burden.
Moreover, remote sensing images frequently contain a substantial amount of redundant information, with critical data often concentrated in a few specific regions or frequency bands. Many areas within remote sensing images, such as oceans and deserts, are usually uniform or exhibit extensive smooth textures. The pixel value variations in these regions are minimal and can be represented with fewer non-zero or significant values. By leveraging the spatial sparsity of such areas, the compressed sensing technique can save storage space and reduce processing complexity. When addressing complex textures, these regions typically exhibit sparse representations in certain transformation domains (e.g., wavelet transforms, discrete cosine transforms). CS effectively captures these sparse features, facilitating the identification of different texture categories in classification tasks.
In this section, we start by cropping the image into equally sized blocks and employ the CS theory [35] discussed in this section to map pixel values to perceptual values. Subsequently, the pixel values of each block are multiplied by the measurement matrix to obtain measurement values. These measurement values are then processed to derive features within the measurement value range, allowing for the classification of image blocks based on features within the perceptual domain [36]. This module categorizes image blocks into three classes: smooth, margin, and texture blocks. The specific structure is illustrated in Figure 2.

3.2.1. Definition of Correlation Between Signals in Two Domains

Since the mutual covariance measures the similarity between two signals, we utilize mutual covariance to assess the correlation of frequency domain signals. Specifically, the correlation between two frequency domain image signals is defined as
Γ F = cov ( α , θ )
where Γ F denotes the frequency domain correlation of the signals, cov represents the mutual covariance computation symbol, and α and θ are the video signals in the DCT domain.
When the value of Γ F is positive, it indicates that the two signals are positively correlated; conversely, a negative value indicates a negative correlation. The larger the absolute value of Γ F , the higher the similarity between the two signals, meaning the correlation is stronger. Conversely, a smaller absolute value indicates lower similarity and weaker correlation. In image analysis, we typically start from the pixel domain. However, directly analyzing in the pixel domain can be challenging due to the large volume of data and the lack of intuitive insights, making it difficult to effectively extract useful information. Therefore, we project the image data into the frequency domain for analysis, where frequency domain signals can more clearly reflect the pixel features of the image. According to the CS theory [37] presented in this section, there exists a linear relationship between the perceptual domain and the frequency domain. This means that the frequency domain can be regarded as an effective representation of the pixel domain, while the perceptual domain can reflect the detail level and complexity of the pixel domain images. Thus, by analyzing the signals in the perceptual domain, we can achieve a certain degree of extraction and analysis of image pixel features.

3.2.2. Correlation of Linear Relationships Between Perceptual Domain and Frequency Domain

For each block S z 1 , S z 2 , S z 3 , S z 4 , corresponding to frequency domain signals α 1 , α 2 , α 3 , α 4 , the independent compressed sensing measurements [38] are conducted utilizing an identical sensing matrix. This process yields perception values z 1 , z 2 , z 3 , z 4 , which are directly concatenated as row vectors to construct the perceptual domain z = [ z 1 , z 2 , z 3 , z 4 ] T . Given z i = [ z i 1 , z i 2 , , z i n ] T for i = 1 , 2 , , n , the representation of the feature image in the perceptual domain is denoted by the following complex formula:
z = z 11 z 12 z 1 n z 21 z 22 z 2 n z 41 z 42 z 4 n
In this formulation, each z i is represented as a column vector, where z i j signifies the j-th perception value of the i-th block.
Covariance Matrix Vectorization. If X 1 , X 2 , , X n constitute a collection of random variables forming a random vector X = [ X 1 , X 2 , , X n ] T , and each random variable has m samples, then there exists a sample matrix, as follows:
O = [ γ 1 , γ 2 , , γ n ] T = [ ζ 1 , ζ 2 , , ζ m ]
where γ i (for i = 1 , , n ) corresponds to the first i vectors of the sample values of the first random variable, while ζ j (for j = 1 , , m ) represents each random vector K within the sample vector. Consequently, the expression for the mutual covariance between random variables X i and X j is given by
M i j = E [ X i X j ] E ( X i ) E ( X j )
Covariance estimates can be derived from sample values, as follows:
M i j = 1 m γ i T γ j 1 m 2 a = 1 m O i a b = 1 m O j b
Perceptual Domain Covariance Matrix. We conduct measurements on each image block using the same randomly generated measurement matrix. The process follows a precise measurement procedure:
z i = σ x i ( i = 1 , 2 , , N )
where x represents the original signal of size n × 1 and z denotes the corresponding signal in the perceptual domain of size m × 1 . This procedure operates under the assumption that the random measurement matrix σ adheres to a Gaussian distribution with a mean of zero and a variance of 1 / m . Each signal in the perceptual domain z is associated with a signal in the frequency domain α , as represented by
x = D α , z = σ D α = Γ α
where Γ = σ D represents the projection matrix and D signifies the sparse basis Discrete Cosine Transform (DCT) matrix. The elements of the projection matrix Γ are derived as follows:
Γ i j = σ i k D k j = 1 , 2 , , m ; j = 1 , 2 , , n )
The variance of the elements of the projection matrix Γ are determined as
D [ Γ i j ] = D [ σ i k D k j ] = D k j T D 21 = 1 / m
The frequency domain signals corresponding to N image blocks form the frequency domain sample matrix:
α N × n = [ α 1 , α 2 , , α N ] T
Assuming each signal in the perceptual domain is a random variable, the sample matrix over the perceptual domain is obtained as follows:
z = [ z 1 , z 2 , , z N ] = [ Z 1 , Z 2 , , Z m ]
where z i represents the sample vector of the i-th random variable, and Z j ( j = 1 , , m ) represents the sample vector of each random vector.
Let Γ i ( i = 1 , 2 , , m ) denote the row vector of the projection matrix Γ . Utilizing the vector form of the covariance matrix, we have
M z = 1 m [ Z 1 , Z 2 , , Z m ] [ Z 1 , Z 2 , , Z m ] T Z o Z o T
M z 1 m ( Γ α ) T ( Γ α )
Finally, the covariance matrix in the perceptual domain can be approximated as
M z 1 m α Γ T Γ α T 1 m α α T
Frequency Domain Covariance Matrix. The frequency domain sample matrix is represented as
σ N × n = [ σ 1 , σ 2 , , σ N ] T = [ ξ 1 , ξ 2 , , ξ n ]
where σ i ( i = 1 , 2 , , N ) denotes the vector containing all sample values for the ith random variable, and ξ j ( j = 1 , 2 , , m ) corresponds to the sample vector of each random vector. Then, the i-th row of the covariance matrix and j-th column elements depict the mutual covariance between signals σ i and σ j :
M σ ( i , j ) = E [ σ i σ j ] E [ σ i ] E [ σ j ]
All Discrete Cosine Transform (DCT) coefficients within an image block adhere to a Gaussian distribution with a mean of zero; thus,
E [ σ i ] = E [ σ j ] = 0
M z n m M σ
The correlation between the signals in the perception and frequency domains is seen in the equation above. This model shows that the cross-covariance matrix between perception and frequency signals is essentially linear, corresponding to the correlation between the two domains. Clearly, in compression-based image processing, correlation evaluation of frequency domain signals may be directly performed using perceptual domain signals. Therefore, based on the content of this section, in compression-based image processing, the correlation analysis of frequency domain signals can be directly performed using signals from the perceptual domain.

3.2.3. Image Patch Classification Based on Perceptual Domain Features

The elements of the perceptual domain covariance matrix are correspondingly related to the variances of the respective frequency domain signals, emphasizing the close relationship between frequency domain signals and pixel domain signals. Specifically, smooth image patches correspond to frequency domain signals with lower sparsity, indicating a prevalence of uniform information across frequencies. Conversely, edge and texture patches are associated with frequency domain signals that exhibit higher sparsity, reflecting the presence of distinct high-frequency components that characterize sharp transitions and intricate textures in the image. This relationship underscores the importance of understanding the interplay between spatial characteristics and their spectral representations in image processing tasks. Consequently, the variance of the perceptual domain covariance matrix var ( M z ) reflects the characteristics of image patches and can be utilized for their classification. The criteria for classification are defined as follows:
if var ( M z ) < T 1 , Smooth Block T 1 var ( M z ) T 2 , Margin Block var ( M z ) > T 2 , Texture Block
When judging the categories of image blocks, we combine multi-directional block classification as follows:
(1) Let z ¯ i = Φ { x i + λ F i + 1 h } ( i = 1 , 2 , , N ), where F i + 1 h refers to the feature vector of the block adjacent to the i-th block in the horizontal direction. Based on the covariance matrix of z ¯ i , determine its variance v a r ( M z ) to classify it into one of the image block types and assign a parameter C L S h .
(2) Let z ¯ j = Φ { x j + λ F j + 1 v } ( j = 1 , 2 , , N ), where F j + 1 v refers to the feature vector of the block adjacent to the j-th block in the vertical direction. Based on the covariance matrix of z ¯ j , determine its variance v a r ( M z ) to classify it into one of the image block types and assign a parameter C L S v .
(3) For the same image sub-block, consider the category parameters C L S h and C L S v from both horizontal and vertical directions. The priority order, from high to low, is edge block, texture block, and smooth block; choose the category with the highest priority as the final category of the image block.
The algorithm simultaneously takes into account the global statistical characteristics and local distribution properties of image segments. By employing the 2 σ principle, it effectively computes the threshold. Smooth blocks primarily contain background and contour information, exhibiting relatively uniform features. In contrast, edge blocks are more complex, encompassing detailed information, such as boundaries corresponding to high-frequency components in the image. Consequently, edge blocks demonstrate higher sparsity in the frequency domain compared to smooth blocks.

3.3. Depth–Texture Interaction Fusion Module

The depth–texture interaction fusion module (DTIF) module employs a traditional U-Net encoder-decoder architecture, which effectively extracts and integrates contextual information while flexibly handling features of varying scales, demonstrating strong adaptability. Before inputting the image blocks into DTIF, they have already undergone feature classification via the ICPC module. This preprocessing allows us to dynamically adjust the number of DTIF modules based on the prior classification results. Specifically, for image blocks with different feature types, we can allocate varying numbers of DTIF modules. For example, simple and smooth regions may be assigned three DTIF modules; fast regions with pronounced edge features may utilize six modules; and complex texture regions may be assigned nine modules. This adjustment mechanism ensures that the model operates efficiently under different circumstances and optimizes feature extraction. In the DTIF, the primary task of the encoder is to extract detailed features of the image, while the decoder combines the encoder’s features with its own through skip connections, ensuring the retention of detail information. This design not only enhances the quality of reconstruction but also effectively reduces information loss, resulting in a clearer and more realistic final output image. By combining the advantages of both the encoder and decoder, the DTIF module is better equipped to adapt to various image features, thereby improving the performance of super-resolution reconstruction. The basic architecture of DTIF consists of a layered encoder, pooling layers, a compression layer, and a decoder, along with skip connection layers linking the encoder and decoder. This section will begin by presenting a comprehensive outline of the overall structure of DTIF. It will then proceed to offer in-depth explanations of the DTIF, CWIA, and MSFE.
Encoder. For a given LR image X R H × W × 3 , the extracted Y i R H × W × C is obtained through pixel shuffle, where H, W, and C represent the height, width, and channels, respectively. To enrich the transferred feature details and enhance the expression of content features, we design a three-layer encoder. Specifically, the extracted Y i first undergoes two stages composed of two DTIT (DTI Transformer) blocks and soft pooling [39] with a kernel size of 2 × 2, followed by processing through a third DTIT block. The details of the DTIT will be discussed in Section 3.3.1.
Pooling Layer. In contrast to non-deep learning methods such as average pooling [40] and max pooling [41], we utilize soft pooling [39] in this process, as shown in Figure 3. The calculation formula for soft pooling is as follows:
S o f t P o o l i n g ( γ ) = i = 1 N ω i γ i
where x is the input data, N is the number of input data( γ i ), and ω i is the weight corresponding to the input data γ i , usually calculated using the softmax function:
ω i = e γ i j = 1 N e γ j
The objective of this pooling approach is to minimize the information loss that occurs during the pooling process, while yet preserving the functionality of the pooling layer. By retaining information content features effectively, soft pooling facilitates deeper extraction of blurred information, thus significantly enhancing the performance of traditional SR methods in handling images with varying levels of clarity.
Compression Layer. The design of a three-layer encoder and the choice of pooling layers enable the extraction of more image details, but they also convey a significant amount of image information. To address this, the introduction of the compression layer module (MSFE) effectively compresses input image information while extracting features. This means that, while compressing redundant information from simple image blocks, the compression layer also aids in capturing detailed features [42] from complex, blurred image blocks. The design of the compression layer not only reduces computational complexity and alleviates the computational burden on the subsequent decoder but also enhances the representation capability of similar features, effectively minimizing information loss during the feature transfer process to the decoder, thereby improving the quality of image reconstruction. The specific details of this module will be elaborated in Section 3.3.4.
Decoder. Similar to most U-Net network architectures, the decoder adopts the same modular structure as the encoder. Here, we also use a combination of DTIT and soft pooling block. Additionally, The final result of the first DTIT is designed to bypass the completely connected layer and instead enter the compression layer module. It is then combined with the output of the three-layer encoder for residual connection [43], before being inputted into the decoder. This skip-connection [44] design aids in optimizing the model by reducing computational complexity while leveraging locality and dependency to perform multiscale feature extraction on the input.

3.3.1. Depth–Texture Interaction Transformer

Due to the fixed-size window mechanism employed by the Swin Transformer for self-attention operations, features near the window boundaries are prone to blurriness or incoherence, particularly in areas with complex texture details, making it difficult to effectively recover degraded edge pixels. This local window division hinders seamless information connectivity between different windows, resulting in suboptimal performance in texture blocks within remote sensing images. Furthermore, although the Swin Transformer facilitates information interaction through its window mechanism, its alternating local operations still impose limitations on information propagation, particularly in terms of long-range dependencies across windows. While the Swin Transformer employs a multi-layer shifted window approach for inter-window information interaction, its ability to capture global information remains constrained. Unlike global self-attention mechanisms, the Swin Transformer cannot capture global information at each layer; instead, it gradually builds global perception through multi-layer accumulation. Consequently, when addressing tasks requiring traversal across multiple local regions, it may not sufficiently model long-range dependencies.
To effectively overcome these challenges, we introduce an N–Gram window interaction module prior to self-attention operations within the module. This N–Gram window interaction module establishes tight feature relationships between multiple adjacent windows, breaking the limitations of the traditional Swin Transformer’s local window operations. It enables features from different windows to interact during the computation process, facilitating more comprehensive information propagation and fusion. Leveraging the concept of N–Gram from natural language processing, the N–Gram window interaction module fosters tighter feature relationships among adjacent windows, achieving efficient inter-window information dissemination and integration. This enhancement not only boosts the model’s information interaction capabilities within local windows but also effectively alleviates issues of edge blurriness and disjointed local features by capturing contextual information across windows. In the depth–texture interaction fusion (DTIT) module, our designed Cross-Window Importance Aggregation (CWIA) block spans the combined areas of W-Trans and SW-Trans following the N–Gram interaction module, connecting with the outputs of the self-attention mechanism through residual connections. This module allows for simultaneous consideration of feature importance in both directions, employing different pooling strategies to extract finer texture information based on directional significance. Through this integrative feature extraction approach, the model can more accurately identify and restore critical structures and details within images, enhancing overall image quality and recognition accuracy. The DTIT module is adept at capturing complex textures that are challenging to process using traditional convolution or window mechanisms, thereby transcending the local limitations of the Swin Transformer and ultimately improving overall image quality.

3.3.2. N–Gram Window Interaction Block

In the Swin Transformer, the original sliding window self-attention (SA) and cross-window self-attention (WSA) are computed as follows:
SA ( Q , K , V ) = softmax Q K T d k V
where Q, K, and V denote the matrices representing the query, key, and value, respectively. Additionally, d k refers to the dimension of the key.
WSA cross-window self-attention calculation formula:
WSA ( Q , K , V ) = softmax Q ( K T + A ) d k V
where A is the window shift matrix, which is used to address the limited receptive field issue caused by window shifting.
To address the window shifting issue in Swin-V1, we adopt the concept proposed by Choi et al. [32], which utilizes a Uni–Gram non-overlapping local window based on the N–Gram language model, as shown in Figure 4. The accurate formula is as follows:
Uni - Gram ( Q , K , V ) = softmax Q ( K T M ) d k V
where ⊙ denotes element-wise multiplication, and M is the Uni–Gram window mask matrix used to define the window range.
Each adjacent Uni–Gram window can be combined into a larger N–Gram window by concatenating the query, key, and value matrices of multiple Uni–Gram windows. In the N–Gram language model, consecutive forward, backward, or bidirectional words are considered as target words. Using N–Gram to define the window for WSA allows pixels within the window to influence each other through WSA, thereby enlarging the receptive field. This expansion enhances the capability of the model by increasing the accuracy of extracting details through a broader contextual understanding.

3.3.3. Cross-Window Importance Aggregation

Given that the Swin Transformer relies heavily on window alternation for information interaction, it tends to perform poorly when dealing with edge-blurred textures. This makes it challenging to precisely restore damaged pixels through content interaction. In order to resolve this problem, we have developed the CWIA block. This module extends across the window-blocked area in both W-Trans and SW-Trans and is linked to the result of self-attention by residual connections. The CWIA block utilizes two-dimensional Local Importance Pooling (LIP) [45] in both the horizontal as well as vertical orientations, using the following calculation:
Let X denotes the input feature map of size H × W × C , where H is the height, W is the width, and C is the number of channels. For each position ( i , j ) in the input feature map X, the LIP operation computes the importance score I i j as the local sum of absolute differences (LSAD) [46] between the pixel values within a neighborhood window centered at ( i , j ) .
The importance score I i j for position ( i , j ) is calculated as
I i j = p = k k q = k k | X ( i + p ) ( j + q ) X i j |
where k is the radius of the neighborhood window, and X ( i + p ) ( j + q ) represents the pixel value at position ( i + p , j + q ) in the input feature map X.
The two outputs after pooling are multiplied and subjected to self-attention. The result is then added to the output after passing through W-Trans and SW-Trans directly via an adder. The inclusion of this module enhances the representational capacity of details, allowing for deep extraction of texture block details to improve the accuracy of pixel recovery.

3.3.4. Multi-Scale Feature Enhancer

The Multi-Scale Feature Enhancer (MSFE) block, serving as the compression layer of the network, consists of three steps: transpose interpolation, concatenation, and grouped convolution. First, the output y s from the third layer of the encoder is fed into the MSFE block, where it undergoes upsampling through the transpose interpolation layer, resulting in the expanded feature map y s with an increased resolution of I . Unlike traditional interpolation methods, transpose interpolation better integrates the feature representations within convolutional neural networks, demonstrating significant advantages in preserving spatial information and detail features. Additionally, transpose interpolation smooths the feature map during expansion, avoiding artifacts such as the checkerboard effect, while requiring no additional training weights. This process allows for simple parameter control over the output image resolution, effectively alleviating computational burdens and enhancing the overall quality and efficiency of the upsampling. Subsequently, the MSFE block concatenates the results across channels in the channel domain to amplify and reinforce identical features. The combined features are then processed through the grouped convolution layer. Grouped convolution operates by partitioning channels into groups, performing 3 × 3 convolution operations within each group. This mechanism enables the network to learn different feature subspaces, resulting in richer feature representations upon final aggregation, particularly enhancing edge and complex texture block representations when processing intricate remote sensing images. Compared to standard convolution, grouped convolution significantly reduces the number of parameters and computational load, thereby further lowering the demands on computational resources. Through these steps, the MSFE block efficiently extracts multi-scale features while ensuring the capability to handle complex textures. The final output Y S is then fed into the decoder to complete the subsequent reconstruction tasks.

4. Experiment

4.1. Experiment Setup

Implementation details. PCFI is constructed using the PyTorch 1.10.0 framework, specifically leveraging a single NVIDIA GTX 3090Ti GPU. The network training strategy utilizes the Adaptive Moment Estimation (Adam) [47] algorithm as the approach for optimizing parameters. The Adam hyperparameters are set at β 1 = 0.9 and β 2 = 0.999 . In addition, we implement cosine annealing as the approach for decaying the learning rate. We initialize the learning rate to 2 × 10 4 and perform training for 250,000 iterations. During the training procedure, we utilize a stochastic selection technique to randomly choose 16 LR-HR image pairings from the training set. Subsequently, we perform random cropping to get LR image patches with dimensions of 64 × 64 . In addition, to increase the diversity of the dataset, we enrich each input image by applying data augmentation techniques like flipping the images vertically and horizontally. The training loss function employs the L1 norm, minimizing the standard L1 loss between the predicted image I S R and the high-resolution image I H R : L   =   I H R I S R 1 .
Evaluation metrics. In conducting quantitative comparisons, we adopted Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), Learned Perceptual Image Patch Similarity (LPIPS), Feature Similarity Index (FSIM), and Spectral Angle Mapper (SAM) as evaluation metrics to assess our model’s performance in the SR task. PSNR, measured in decibels (dB), indicates that a higher value signifies greater similarity between the reconstructed image and the original image. The SSIM index evaluates the structural similarity of images, with higher values also indicating increased similarity between images. Increases in PSNR and SSIM values represent superior performance. The detailed calculation formulas are as follows:
PSNR = 10 log 10 R 2 MSE
MSE = 1 N i = 1 N ( x i y i ) 2
SSIM ( x , y ) = ( 2 μ x μ y + C 1 ) ( 2 σ x y + C 2 ) ( μ x 2 + μ y 2 + C 1 ) ( σ x 2 + σ y 2 + C 2 )
We employed LPIPS to better assess the quality of the reconstructed image from a perceptual perspective, where a lower score indicates better visual quality of the reconstructed image. The detailed calculation formula is as follows:
LPIPS ( x , y ) = i | | ϕ i ( x ) ϕ i ( y ) | | 2 N
FSIM focuses on low-level features of images, such as the perceived quality of edges and textures, by evaluating key visual information through phase congruency and gradient magnitude. A higher score indicates better recovery of edge and texture details in the image. The calculation formula is as follows:
FSIM ( x , y ) = PC ( x , y ) · GM ( x , y )
SAM is used to evaluate the similarity of spectral information by comparing the spectral angles of each pixel in hyperspectral bands, effectively detecting whether spectral information is lost during reconstruction. A lower SAM score indicates a smaller difference between the reconstructed image and the original image in the spectral dimension, reflecting higher fidelity of spectral information. The detailed calculation formula is as follows:
SAM ( x , y ) = arccos x · y | | x | | · | | y | |

4.2. Datasets

To systematically evaluate the performance of our model, we selected multiple datasets. First, we utilized the AID dataset, a widely used large-scale benchmark dataset for remote sensing image scene classification, comprising 10,000 remote sensing images of fixed size 600 × 600 pixels [48]. The AID dataset is primarily obtained through aerial imaging techniques and contains 30 different scene categories, such as airports, bare land, and baseball fields (as shown in Figure 5). We used the AID dataset as both the training and testing set to assess the model’s performance on standard remote sensing images. To further explore the adaptability of the model under different imaging modalities, we conducted experiments on SAR images and hyperspectral images. For SAR images, we used the SEN1-2 dataset [49], which contains data from the Sentinel-1 and Sentinel-2 satellites. The SAR images provided by Sentinel-1 are single-channel 8-bit images of size 256 × 256 pixels, while the optical images from Sentinel-2 are three-channel 8-bit images of the same size. We selected the “Summer” dataset, which includes registered pairs of SAR and optical images, as the training and testing sets. For hyperspectral images, we employed the AVIRIS dataset [50] developed by NASA. This dataset is specifically designed for capturing hyperspectral images of the Earth’s surface, with each image typically being 256 × 256 pixels, where pixel values represent reflectance with a range of [0, 1], reflecting the surface’s radiation reflection characteristics. The AVIRIS dataset was also used as the training and testing set to validate the model’s performance in processing hyperspectral images. Finally, to comprehensively demonstrate the model’s performance, we conducted tests on natural images. We selected five commonly used standard benchmark natural image datasets—Set5 [51], Set14 [52], BSD100 [53], Urban100 [54], and Manga109 [55]—as the testing set to evaluate the model’s performance on natural images.

4.3. Performance on Task

Quantitative evaluations. To better evaluate the performance of the PCFI network, we compared it with current efficient super-resolution (SR) models, such as Bicubic, EDSR [10], NLSN [56], and HAT-L [57], across 30 different scene categories in the AID dataset. As shown in Table 1, PCFI significantly outperforms other models in terms of SSIM and PSNR scores, demonstrating exceptional performance. However, since SSIM and PSNR primarily assess image quality from a data perspective, they do not comprehensively reflect human visual perception. Therefore, we introduced LPIPS and FSIM as additional metrics for the quality assessment of standard remote sensing images, to better illustrate the model’s capabilities in visual effectiveness. As presented in Table 2, PCFI outperformed other advanced models, including Bicubic, EDSR [10], NLSN [56], HAT-L [57], TransENet [58], and HAN [59], in both LPIPS and FSIM scores. This indicates that PCFI excels in the recovery of edge and texture details, especially when dealing with edge regions and complex textures. Through the PCFI approach, we effectively enhanced the local modeling capabilities of the Transformer model, facilitating efficient interactions between windows and improving performance in super-resolution reconstruction tasks, particularly in restoring complex features such as object edges and textures in standard remote sensing images.
To assess the model’s complexity, we conducted a computation time comparison on 100 random images at a scaling factor of 4. As shown in Figure 6, the scatter plot compares the number of parameters and FLOPs of six models, including PCFI. The bar chart displays the PSNR and SSIM comparisons for each model at a 2× upscaling factor on the AID dataset. The experimental results indicate that our model achieves the highest PSNR and SSIM scores at a 3× upscaling factor while maintaining a balance between the number of parameters and FLOPs. As shown in Table 3, our model outperforms other models in terms of image reconstruction quality while also demonstrating excellent processing speed. Compared to other equally outstanding models, such as EDSR [10], NLSN [56], HAT-L [57] and TransENet [58], as well as GRL [60], our model exhibits faster processing speeds. Specifically, relative to the second-best performing model on the AID dataset, our computation time is reduced by approximately 4.33%. This is particularly important when processing large-scale datasets such as remote sensing images, as it significantly decreases computational overhead and enhances processing efficiency.
In addition, we compared PCFI with EDSR [10], RRDB [61], SNGAN [62], SRGAN [21], and HSENet [63] on the SAR images from the SEN1-2 dataset on ×2 and ×4 scale. As shown in Table 4, while PCFI still maintains an advantage in PSNR scores, it performed slightly worse in SSIM and FSIM. This discrepancy is mainly due to the impact of typical speckle noise (multiplicative noise) present in SAR images, which cannot be easily removed through conventional filtering methods, significantly reducing image contrast and detail representation, thus making it challenging for the model to extract effective features. Additionally, the electromagnetic reflection characteristics of SAR images pose difficulties for methods designed for optical images. Finally, we performed a quantitative comparison of PCFI with Bicubic, RDN [64], ESPCN [65], TransENet, and HSENet on the AVIRIS hyperspectral dataset, using PSNR, SSIM, and SAM as evaluation metrics. As shown in Table 5, although PCFI continues to lead in PSNR and SSIM scores at ×2 and ×4, its performance on SAM scores is relatively poor. Hyperspectral images contain a large number of spectral channels, far exceeding the three channels of standard RGB remote sensing images, which poses challenges for PCFI, a model adept at processing low-dimensional optical images, in capturing hyperspectral information. Additionally, the signal-to-noise ratios of spectral edge channels (such as infrared and ultraviolet bands) are comparatively low, further impacting the overall performance of the model and leading to less accurate recovery of spectral information. Nevertheless, PCFI performs well in PSNR and SSIM scores, indicating its advantages in handling spatial details and pixel value differences, excelling in both spatial dimensions and visual effects.
To further assess the performance of this model, we also compared PCFI with IMDN [66], EDSR [10], HNCT [67], LatticeNet [68], SwinIR [26], ESRT [69], HAT [57], SRFormer [70], and GRL [60]. As shown in Table 6, we quantitatively evaluated multiple SR models at scaling factors of 2×, 3×, and 4× across five benchmark datasets. Our network outperformed other models in both SSIM and PSNR scores, demonstrating exceptional performance. This indicates that our network not only excels in enhancing the resolution of remote sensing images but also shows significant advantages in improving the quality of natural images.
Visual comparison. As shown in Figure 7, we selected representative images from five different scenes in the AID dataset: airport, park, sparse residential area, and viaduct. To evaluate the effects of super-resolution reconstruction, we compared these images with the results from traditional methods such as Bicubic, EDSR [10], NLSN [56], and HAT-L [57], while also referencing high-resolution (HR) images. The comparison clearly indicates that images reconstructed using the PCFI network significantly outperform other models in detail preservation, particularly in complex edge and texture regions, which often exhibit blurring or artifacts. Furthermore, as illustrated in Figure 8, we selected two typical “Summer” images from the SAR image dataset SEN1-2 and performed super-resolution processing using EDSR [10], RRDB [61], SNGAN [62], SRGAN [21], HSENet [63], and our model, followed by a detailed comparison of the results. Additionally, we selected two images from the hyperspectral dataset AVIRIS and compared the outputs from Bicubic, RDN [64], ESPCN [65], TransENet [58], HSENet [63], and our model. These comparative analyses demonstrate that, despite the differing imaging modalities of remote sensing images, our model still excels in human visual perception, effectively preserving details and resulting in more realistic images. Finally, to validate the effectiveness of our model in processing natural images, we selected a representative image from each of the five benchmark datasets: Manga109 [55], BSD100 [53], Set5 [51], Set14 [52], and Urban100 [54]. We obtained outputs using Bicubic, SwinIR [26], HAT [57], and GRL [60], and we then compared these outputs with the images processed by PCFI. As shown in Figure 9, our PCFI outperforms other state-of-the-art methods in natural images effectively and significantly. This result verifies that the introduction of DTIF substantially enhances the effectiveness of the super-resolution model, greatly improving its reconstruction capabilities.

4.4. Ablation Studies

The PCFI model comprises two main modules, the Integrated Compressed Perception Classifier (ICPC) and the Deep Texture Interaction Module (DTIF), which includes two key blocks: CWIA and NWIB. In this section, we provide a detailed description of the ablation study conducted to validate the effectiveness and necessity of each module within the PCFI network. We selected the AID dataset as the experimental basis, as it contains a large number of high-resolution natural images that can adequately assess the performance of our model. We established a baseline model, denoted as Model L , which excludes the ICPC, CWIA, and NWIB modules, while the complete model is referred to as Model O . From Model O , we sequentially removed the ICPC module, CWIA block, and NWIB block, resulting in the following additional models: Model A , Model B , and Model C . We conducted experiments on these models to measure their output PSNR and SSIM after processing 100 random images, using the comparison of PSNR and SSIM scores to evaluate the quality of the reconstructed images. Furthermore, to demonstrate the impact of the ICPC module, we also present the processing times for each model.
Integrated Compressive Sensing-based Perception Classifier Module. The ICPC module performs compressed sensing sampling on sub-images cropped based on perceptual domains, followed by feature classification of these sub-images. In our ablation study, we first removed the ICPC module, resulting in a new model Model A , to assess its impact on the performance of the PCFI network. As shown in Table 7, compared to the full model, Model A exhibited a significant performance drop, with PSNR decreasing by 0.12 dB and SSIM decreasing by 0.0016. This result clearly indicates that the removal of the ICPC module adversely affects the quality of the reconstructed images and validates that processing sub-image blocks through feature classification before reconstruction can more accurately recover local detail textures, thereby improving the reconstruction accuracy. To further validate the acceleration effect of the ICPC module within PCFI, we also compared the computational speeds of various models investigated in this study. Notably, the runtime of Model A , with the ICPC module removed, significantly increased. We conclude that the inclusion of the ICPC module improves the model’s efficiency by approximately 8.91%. This improvement is attributed to the ability of the model, after compressed sensing classification, to allocate varying numbers of DTIF modules to tasks involving edge, smooth, and texture feature types, as opposed to assigning the same number of modules to all three feature types. This clearly demonstrates that the ICPC module’s use of compressed sensing for feature extraction is highly beneficial for accelerating super-resolution tasks.
Depth–Texture Interaction Module. The N–Gram Window Interaction Block (NWIB) and the CWIA block are essential components of the DTIT, crucial for enhancing the accuracy of the reconstructed images. First, we formed a new model Model B by removing the CWIA block from Model O to validate the role of CWIA. As shown in Table 7, the PSNR and SSIM of Model O improved by approximately 1.38% and 0.53% compared to Model B , respectively, indicating better reconstruction quality. This improvement is attributed to the CWIA block’s capability to capture local features in both vertical and horizontal directions. This bidirectional extraction method enables the model to effectively identify complex textures and edges within the image, thereby enhancing the accuracy and quality of pixel reconstruction. Therefore, the inclusion of CWIA had a significant positive impact on model performance, further consolidating the advantages of DTIT in super-resolution tasks. Next, we formed a new model Model C by removing the NWIB block from Model O to assess the impact of NWIB on model performance. As shown in Table 7, compared to Model O , the PSNR and SSIM of Model C decreased by 0.21 dB and 0.0031 dB, respectively. This result clearly indicates that NWIB plays a significant role in enhancing the similarity between the reconstructed images and the original images. This improvement is primarily due to the introduction of the N–Gram window interaction before the self-attention mechanism. This interaction effectively captures correlations within local regions, allowing the model to better understand the relationships between adjacent elements when processing images. This precise capture of local features not only aids in the restoration of complex textures but also significantly improves detail retention during the reconstruction process, reducing the occurrence of blurriness and artifacts. This further highlights the indispensable role of the NWIB block within the model, providing stronger support for image reconstruction and ensuring that the final results are closer to the original images.

5. Conclusions

In this study, we propose a novel network for remote sensing image super-resolution, called Perception-guided Classification Feature Intensification (PCFI). This network consists of two main modules: integrated compressive sensing-based perception classifier module (ICPC) and depth–texture interaction fusion module (DTIF). The ICPC module employs a perception-guided feature classification strategy, using perception mechanisms to guide targeted feature classification of different regions in the input image, such as edges, textures, and smooth areas. This approach enhances the capture of local feature details, contributing to the acceleration of the super-resolution (SR) task. The DTIF module combines Transformer architectures with a window texture interaction mechanism to extract complex textures from two dimensions, enabling it to capture texture variations and edge information in the image with precision, particularly in high-frequency textures and fine edges, while avoiding information loss or blurriness. Comprehensive experimental results demonstrate that our PCFI model outperforms current state-of-the-art super-resolution reconstruction methods on standard remote sensing image datasets. It not only significantly improves image reconstruction quality, especially in edge regions and complex texture restoration, but also enhances the computational efficiency. This dual improvement in performance and efficiency allows the PCFI to exhibit stronger advantages when processing large-scale datasets such as remote sensing images.

Author Contributions

Conceptualization, Y.L.; methodology, Y.L.; software, J.X.; validation, K.C.; formal analysis, Y.D.; writing—original draft, J.X.; writing—review and editing, Y.D.; visualization, Y.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (Nos. 62071380 and 62102314) and the Natural Science Foundation of Shaanxi Province (No. 2022JQ-668).

Data Availability Statement

The SEN1-2 dataset used in this study is accessible from https://mediatum.ub.tum.de/1436631 (accessed on 7 October 2024). The dataset consists of 282,384 pairs of corresponding synthetic aperture radar and optical image patches, acquired by the Sentinel-1 and Sentinel-2 remote sensing satellites, respectively. It is shared under the open access license CC-BY.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

References

  1. Zhang, Q.; Yuan, Q.; Song, M.; Yu, H.; Zhang, L. Cooperated spectral low-rankness prior and deep spatial prior for HSI unsupervised denoising. IEEE Trans. Image Process. 2022, 31, 6356–6368. [Google Scholar] [CrossRef] [PubMed]
  2. Xia, B.; Tian, Y.; Zhang, Y.; Hang, Y.; Yang, W.; Liao, Q. Meta-learning based degradation representation for blind super-resolution. IEEE Trans. Image Process. 2023, 32, 3383–3396. [Google Scholar] [CrossRef] [PubMed]
  3. Cai, Q.; Qian, Y.; Li, J.; Lyu, J.; Yang, Y.H.; Wu, F.; Zhang, D. HIPA: Hierarchical patch transformer for single image super resolution. IEEE Trans. Image Process. 2023, 32, 3226–3237. [Google Scholar] [CrossRef] [PubMed]
  4. Guo, J.; Wen, L.; Zhou, Y.; Song, B.; Chi, Y.; Yu, F.R. SPACE: Self-supervised Dual Preference Enhancing Network for Multimodal Recommendation. IEEE Trans. Multimedia 2024. [Google Scholar] [CrossRef]
  5. Irani, M.; Peleg, S. Improving resolution by image registration. CVGIP Graph. Model. Image Process. 1991, 53, 231–239. [Google Scholar] [CrossRef]
  6. Li, Z.; Liu, F.; Yang, W.; Peng, S.; Zhou, J. A survey of convolutional neural networks: Analysis, applications, and prospects. IEEE Trans. Neural Networks Learn. Syst. 2021, 33, 6999–7019. [Google Scholar] [CrossRef]
  7. Ran, R.; Deng, L.J.; Jiang, T.X.; Hu, J.F.; Chanussot, J.; Vivone, G. GuidedNet: A general CNN fusion framework via high-resolution guidance for hyperspectral image super-resolution. IEEE Trans. Cybern. 2023, 53, 4148–4161. [Google Scholar] [CrossRef]
  8. Kim, J.; Lee, J.K.; Lee, K.M. Deeply-recursive convolutional network for image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1637–1645. [Google Scholar]
  9. Dong, C.; Loy, C.C.; He, K.; Tang, X. Image super-resolution using deep convolutional networks. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 38, 295–307. [Google Scholar] [CrossRef]
  10. Lim, B.; Son, S.; Kim, H.; Nah, S.; Mu Lee, K. Enhanced deep residual networks for single image super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 136–144. [Google Scholar]
  11. Shaw, P.; Uszkoreit, J.; Vaswani, A. Self-attention with relative position representations. arXiv 2018, arXiv:1803.02155. [Google Scholar]
  12. Badrinarayanan, V.; Kendall, A.; Cipolla, R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
  13. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S.; et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
  14. Kong, X.; Zhao, H.; Qiao, Y.; Dong, C. Classsr: A general framework to accelerate super-resolution networks by data characteristic. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 12016–12025. [Google Scholar]
  15. He, J.; Wang, Y.; Liu, H. Ship classification in medium-resolution SAR images via densely connected triplet CNNs integrating Fisher discrimination regularized metric learning. IEEE Trans. Geosci. Remote Sens. 2020, 59, 3022–3039. [Google Scholar] [CrossRef]
  16. Yu, K.; Dong, C.; Lin, L.; Loy, C.C. Crafting a toolchain for image restoration by deep reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2443–2452. [Google Scholar]
  17. Yu, K.; Wang, X.; Dong, C.; Tang, X.; Loy, C.C. Path-restore: Learning network path selection for image restoration. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 44, 7078–7092. [Google Scholar] [CrossRef] [PubMed]
  18. Guo, J.; Li, Z.; Song, B.; Chi, Y. TSFE: Two-Stage Feature Enhancement for Remote Sensing Image Captioning. Remote Sensing 2024, 16, 1843. [Google Scholar] [CrossRef]
  19. Kim, J.; Lee, J.K.; Lee, K.M. Accurate image super-resolution using very deep convolutional networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1646–1654. [Google Scholar]
  20. Lai, W.S.; Huang, J.B.; Ahuja, N.; Yang, M.H. Deep laplacian pyramid networks for fast and accurate super-resolution. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 624–632. [Google Scholar]
  21. Ledig, C.; Theis, L.; Huszár, F.; Caballero, J.; Cunningham, A.; Acosta, A.; Aitken, A.; Tejani, A.; Totz, J.; Wang, Z.; et al. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4681–4690. [Google Scholar]
  22. Wang, X.; Yu, K.; Wu, S.; Gu, J.; Liu, Y.; Dong, C.; Qiao, Y.; Change Loy, C. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
  23. Lei, S.; Shi, Z.; Zou, Z. Super-resolution for remote sensing images via local–global combined network. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1243–1247. [Google Scholar] [CrossRef]
  24. Xu, W.; Guangluan, X.; Wang, Y.; Sun, X.; Lin, D.; Yirong, W. High quality remote sensing image super-resolution using deep memory connected network. In Proceedings of the IGARSS 2018–2018 IEEE International Geoscience and Remote Sensing Symposium, Valencia, Spain, 22–27 July 2018; pp. 8889–8892. [Google Scholar]
  25. Jiang, K.; Wang, Z.; Yi, P.; Wang, G.; Lu, T.; Jiang, J. Edge-enhanced GAN for remote sensing image superresolution. IEEE Trans. Geosci. Remote Sens. 2019, 57, 5799–5812. [Google Scholar] [CrossRef]
  26. Liang, J.; Cao, J.; Sun, G.; Zhang, K.; Van Gool, L.; Timofte, R. Swinir: Image restoration using swin transformer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 1833–1844. [Google Scholar]
  27. Zhang, Y.; Li, K.; Li, K.; Wang, L.; Zhong, B.; Fu, Y. Image super-resolution using very deep residual channel attention networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 286–301. [Google Scholar]
  28. Rahman, M.L.; Zhang, J.A.; Huang, X.; Guo, Y.J.; Heath, R.W. Framework for a perceptive mobile network using joint communication and radar sensing. IEEE Trans. Aerosp. Electron. Syst. 2019, 56, 1926–1941. [Google Scholar] [CrossRef]
  29. Zhang, Y.; Liu, H.; Hu, Q. Transfuse: Fusing transformers and cnns for medical image segmentation. In Proceedings of the Medical Image Computing and Computer Assisted Intervention–MICCAI 2021: 24th International Conference, Strasbourg, France, 27 September–1 October 2021; Proceedings, Part I 24. Springer: Berlin/Heidelberg, Germany, 2021; pp. 14–24. [Google Scholar]
  30. Chang, Y.; Wang, X.; Wang, J.; Wu, Y.; Yang, L.; Zhu, K.; Chen, H.; Yi, X.; Wang, C.; Wang, Y.; et al. A survey on evaluation of large language models. ACM Trans. Intell. Syst. Technol. 2024, 15, 1–45. [Google Scholar] [CrossRef]
  31. Lopez-Gazpio, I.; Maritxalar, M.; Lapata, M.; Agirre, E. Word n-gram attention models for sentence similarity and inference. Expert Syst. Appl. 2019, 132, 1–11. [Google Scholar] [CrossRef]
  32. Choi, H.; Lee, J.; Yang, J. N-gram in swin transformers for efficient lightweight image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2071–2081. [Google Scholar]
  33. Guo, J.; Sun, H.; Han, J.; Song, B.; Chi, Y.; Song, B. Multi-task Fine-grained Feature Mining for Multi-label Remote Sensing Image Classification. IEEE Trans. Geosci. Remote Sens. 2024, 26, 8849–8859. [Google Scholar]
  34. Ronneberger, O.; Fischer, P.; Brox, T. U-net: Convolutional networks for biomedical image segmentation. In Proceedings of the Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, 5–9 October 2015; proceedings, part III 18. Springer: Berlin/Heidelberg, Germany, 2015; pp. 234–241. [Google Scholar]
  35. Shi, W.; Jiang, F.; Liu, S.; Zhao, D. Image compressed sensing using convolutional neural network. IEEE Trans. Image Process. 2019, 29, 375–388. [Google Scholar] [CrossRef] [PubMed]
  36. Orović, I.; Papić, V.; Ioana, C.; Li, X.; Stanković, S. Compressive sensing in signal processing: Algorithms and transform domain formulations. Math. Probl. Eng. 2016, 2016, 7616393. [Google Scholar] [CrossRef]
  37. Guo, J.; Song, B.; Tian, F.; Liu, H.; Qin, H. Perception of image characteristics with compressive measurements. IEICE Trans. Inf. Syst. 2014, 97, 3234–3235. [Google Scholar] [CrossRef]
  38. Ravelomanantsoa, A.; Rabah, H.; Rouane, A. Compressed sensing: A simple deterministic measurement matrix and a fast recovery algorithm. IEEE Trans. Instrum. Meas. 2015, 64, 3405–3413. [Google Scholar] [CrossRef]
  39. Stergiou, A.; Poppe, R.; Kalliatakis, G. Refining activation downsampling with SoftPool. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 10357–10366. [Google Scholar]
  40. Hsiao, T.Y.; Chang, Y.C.; Chou, H.H.; Chiu, C.T. Filter-based deep-compression with global average pooling for convolutional networks. J. Syst. Archit. 2019, 95, 9–18. [Google Scholar] [CrossRef]
  41. Murray, N.; Perronnin, F. Generalized max pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Ho Chi Minh City, Vietnam, 8–10 October 2014; pp. 2473–2480. [Google Scholar]
  42. Kim, H.; Khan, M.U.K.; Kyung, C.M. Efficient neural network compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 12569–12577. [Google Scholar]
  43. Quan, T.M.; Hildebrand, D.G.C.; Jeong, W.K. Fusionnet: A deep fully residual convolutional neural network for image segmentation in connectomics. Front. Comput. Sci. 2021, 3, 613981. [Google Scholar] [CrossRef]
  44. Ahn, N.; Kang, B.; Sohn, K.A. Efficient deep neural network for photo-realistic image super-resolution. Pattern Recognit. 2022, 127, 108649. [Google Scholar] [CrossRef]
  45. Gao, Z.; Wang, L.; Wu, G. Lip: Local importance-based pooling. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 3355–3364. [Google Scholar]
  46. Dawoud, N.N.; Samir, B.B.; Janier, J. Fast template matching method based optimized sum of absolute difference algorithm for face localization. Int. J. Comput. Appl. 2011, 18, 0975–8887. [Google Scholar]
  47. Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
  48. Xia, G.S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef]
  49. Schmitt, M.; Hughes, L.H.; Zhu, X.X. The SEN1-2 dataset for deep learning in SAR-optical data fusion. arXiv 2018, arXiv:1807.01569. [Google Scholar] [CrossRef]
  50. Wang, W.; Zhao, B.; Feng, F.; Nan, J.; Li, C. Hierarchical sub-pixel anomaly detection framework for hyperspectral imagery. Sensors 2018, 18, 3662. [Google Scholar] [CrossRef] [PubMed]
  51. Bevilacqua, M.; Roumy, A.; Guillemot, C.; Alberi-Morel, M.L. Low-complexity single-image super-resolution based on nonnegative neighbor embedding. In Proceedings of the Proceedings of the 23rd British Machine Vision Conference (BMVC), Surrey, UK, 3–7 September 2012. [Google Scholar]
  52. Zeyde, R.; Elad, M.; Protter, M. On single image scale-up using sparse-representations. In Proceedings of the Curves and Surfaces: 7th International Conference, Avignon, France, 24–30 June 2010; Revised Selected Papers 7. Springer: Berlin/Heidelberg, Germany, 2012; pp. 711–730. [Google Scholar]
  53. Martin, D.; Fowlkes, C.; Tal, D.; Malik, J. A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Proceedings of the Eighth IEEE International Conference on Computer Vision. ICCV 2001, Vancouver, BC, Canada, 7–14 July 2001; Volume 2, pp. 416–423. [Google Scholar]
  54. Huang, J.B.; Singh, A.; Ahuja, N. Single image super-resolution from transformed self-exemplars. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, 7–12 June 2015; pp. 5197–5206. [Google Scholar]
  55. Matsui, Y.; Ito, K.; Aramaki, Y.; Fujimoto, A.; Ogawa, T.; Yamasaki, T.; Aizawa, K. Sketch-based manga retrieval using manga109 dataset. Multimed. Tools Appl. 2017, 76, 21811–21838. [Google Scholar] [CrossRef]
  56. Mei, Y.; Fan, Y.; Zhou, Y. Image super-resolution with non-local sparse attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 3517–3526. [Google Scholar]
  57. Chen, X.; Wang, X.; Zhou, J.; Qiao, Y.; Dong, C. Activating more pixels in image super-resolution transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 22367–22377. [Google Scholar]
  58. Lei, S.; Shi, Z.; Mo, W. Transformer-based multistage enhancement for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 1–11. [Google Scholar] [CrossRef]
  59. Niu, B.; Wen, W.; Ren, W.; Zhang, X.; Yang, L.; Wang, S.; Zhang, K.; Cao, X.; Shen, H. Single image super-resolution via a holistic attention network. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 191–207. [Google Scholar]
  60. Li, Y.; Fan, Y.; Xiang, X.; Demandolx, D.; Ranjan, R.; Timofte, R.; Van Gool, L. Efficient and explicit modelling of image hierarchies for image restoration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 18278–18289. [Google Scholar]
  61. Liu, T.J.; Chen, Y.Z. Satellite image super-resolution by 2d rrdb and edge-enhanced generative adversarial network. Appl. Sci. 2022, 12, 12311. [Google Scholar] [CrossRef]
  62. Zheng, C.; Jiang, X.; Zhang, Y.; Liu, X.; Yuan, B.; Li, Z. Self-normalizing generative adversarial network for super-resolution reconstruction of SAR images. In Proceedings of the IGARSS 2019–2019 IEEE International Geoscience and Remote Sensing Symposium, Yokohama, Japan, 28 July–2 August 2019; pp. 1911–1914. [Google Scholar]
  63. Lei, S.; Shi, Z. Hybrid-scale self-similarity exploitation for remote sensing image super-resolution. IEEE Trans. Geosci. Remote Sens. 2021, 60, 5401410. [Google Scholar] [CrossRef]
  64. Chen, G.; Zhang, L.; Sun, M.; Gao, Y.; Michelini, P.N.; Wu, Y. Single-image hdr reconstruction with task-specific network based on channel adaptive RDN. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 398–403. [Google Scholar]
  65. Shi, W.; Caballero, J.; Huszár, F.; Totz, J.; Aitken, A.P.; Bishop, R.; Rueckert, D.; Wang, Z. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 1874–1883. [Google Scholar]
  66. Hui, Z.; Gao, X.; Yang, Y.; Wang, X. Lightweight image super-resolution with information multi-distillation network. In Proceedings of the 27th ACM International Conference on Multimedia, Nice, France, 21–25 October 2019; pp. 2024–2032. [Google Scholar]
  67. Fang, J.; Lin, H.; Chen, X.; Zeng, K. A hybrid network of cnn and transformer for lightweight image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 1103–1112. [Google Scholar]
  68. Luo, X.; Xie, Y.; Zhang, Y.; Qu, Y.; Li, C.; Fu, Y. Latticenet: Towards lightweight image super-resolution with lattice block. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXII 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 272–289. [Google Scholar]
  69. Lu, Z.; Li, J.; Liu, H.; Huang, C.; Zhang, L.; Zeng, T. Transformer for single image super-resolution. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 19–20 June 2022; pp. 457–466. [Google Scholar]
  70. Zhou, Y.; Li, Z.; Guo, C.L.; Bai, S.; Cheng, M.M.; Hou, Q. Srformer: Permuted self-attention for single image super-resolution. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 12780–12791. [Google Scholar]
Figure 1. Overall structure of Perception-guided Classification Feature Intensification Network and integrated compressive sensing-based perception classifier module.
Figure 1. Overall structure of Perception-guided Classification Feature Intensification Network and integrated compressive sensing-based perception classifier module.
Remotesensing 16 04201 g001
Figure 2. An illustration of depth–texture interaction fusion module. The figure is divided into three sections: the top-left represents the DTIF module, the top-right section represents the DTIT block, and the bottom section represents the CWIA block.
Figure 2. An illustration of depth–texture interaction fusion module. The figure is divided into three sections: the top-left represents the DTIF module, the top-right section represents the DTIT block, and the bottom section represents the CWIA block.
Remotesensing 16 04201 g002
Figure 3. The process of soft pooling (The red arrows are for the forward operation, and the output value of the SoftPool operation is generated by passing the standard sum of all γ ˜ in the kernel neighborhood N).
Figure 3. The process of soft pooling (The red arrows are for the forward operation, and the output value of the SoftPool operation is generated by passing the standard sum of all γ ˜ in the kernel neighborhood N).
Remotesensing 16 04201 g003
Figure 4. The process of N–Gram window sliding (When sliding the window over single-character paddings, forward N–Gram features are obtained through the WSA operation).
Figure 4. The process of N–Gram window sliding (When sliding the window over single-character paddings, forward N–Gram features are obtained through the WSA operation).
Remotesensing 16 04201 g004
Figure 5. Some typical samples of AID dataset from 30 different scene classifications.
Figure 5. Some typical samples of AID dataset from 30 different scene classifications.
Remotesensing 16 04201 g005
Figure 6. The comparison of FLOPs and parameters, as well as PNSR/SSIM performance, with other methods on the AID dataset at a ×2 scale.
Figure 6. The comparison of FLOPs and parameters, as well as PNSR/SSIM performance, with other methods on the AID dataset at a ×2 scale.
Remotesensing 16 04201 g006
Figure 7. Visual comparison on AID datasets at a ×3 scale. The patches used for comparison are marked in red boxes.
Figure 7. Visual comparison on AID datasets at a ×3 scale. The patches used for comparison are marked in red boxes.
Remotesensing 16 04201 g007
Figure 8. Visual comparison on the SAR dataset and AVIRIS dataset at a ×2 scale. The patches used for comparison are marked in red boxes.
Figure 8. Visual comparison on the SAR dataset and AVIRIS dataset at a ×2 scale. The patches used for comparison are marked in red boxes.
Remotesensing 16 04201 g008
Figure 9. Visual Comparison of Images from Manga109, BSD100, Set5, Set14, and Urban100 Datasets at ×3 Scale The patches used for comparison are marked in red boxes.
Figure 9. Visual Comparison of Images from Manga109, BSD100, Set5, Set14, and Urban100 Datasets at ×3 Scale The patches used for comparison are marked in red boxes.
Remotesensing 16 04201 g009
Table 1. Comparison on AID datasets with Scale 3. We tested different models on the 30 different scenes in the AID dataset and marked the best-performing ones in red and the second-best ones in blue.
Table 1. Comparison on AID datasets with Scale 3. We tested different models on the 30 different scenes in the AID dataset and marked the best-performing ones in red and the second-best ones in blue.
ClassesBicucbicEDSR [10]NLSN [56]HAT-L [57]PCFI(Ours)
PSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIMPSNRSSIM
Airport27.830.755429.930.828230.160.832230.150.831930.180.8325
Bare Land35.600.856436.940.883737.000.884536.880.884137.120.8849
Baseball Field31.000.830533.050.876533.240.878733.250.878933.280.8792
Beach32.900.844634.180.872734.310.874934.240.875634.430.8758
Bridge30.220.828332.930.880033.120.881833.040.880933.160.8823
Center26.510.694428.770.792128.950.797128.920.795628.980.7974
Church24.290.633326.300.746926.510.752826.560.753226.630.7536
Commercial27.330.717429.010.794029.210.799629.210.800729.240.8011
Dense Residential22.930.567124.380.683924.600.693624.670.693624.690.6939
Desert39.260.910040.200.926840.270.927840.370.927840.420.9280
Farmland33.100.822635.000.868335.100.869935.030.869135.150.8702
Forest28.790.660529.850.731529.980.736930.100.736330.130.7371
Industrial26.770.695228.880.793129.040.798229.040.798029.070.7986
Meadow33.860.748334.630.780434.690.782134.700.781534.730.7825
Medium Residential26.360.633528.340.736528.490.741828.460.740828.520.7422
Mountain29.510.734930.630.788530.740.791630.780.792330.810.7925
Park29.060.753030.540.813030.710.817730.710.818930.740.8191
Parking24.240.706027.250.831727.570.840827.560.840527.590.8412
Playground32.640.845035.370.894335.580.896735.490.895935.600.8969
Pond30.700.816732.110.854232.220.855932.180.855532.250.8563
Port26.670.798628.500.859628.710.863128.810.863828.840.8642
Railway Station26.780.679328.720.773828.890.778328.880.778028.910.7786
Resort26.790.702928.520.779928.680.784528.710.784928.730.7853
River30.370.740231.550.789131.640.791431.630.790931.680.7917
School27.410.723729.360.804429.550.809729.540.810429.590.8108
Sparse Residential26.660.600627.710.672827.840.676727.880.675927.910.6769
Square28.550.739130.840.820031.040.824431.000.825131.060.8257
Stadium27.160.754729.630.838729.790.842229.770.842229.830.8424
Storage Tanks25.650.679327.440.766427.610.770927.600.769827.650.7714
Viaduct26.970.675528.990.775729.170.781329.110.779429.210.7818
Average28.860.738230.650.808630.810.812630.810.812430.870.8131
Table 2. LPIPS and FSIM comparison of different networks on AID dataset. We mark the best-performing ones in red and the second-best ones in blue.
Table 2. LPIPS and FSIM comparison of different networks on AID dataset. We mark the best-performing ones in red and the second-best ones in blue.
BicubicEDSR [10]NLSN [56]HAT-L [57]TransENet [58]HAN [59]PCFI
LPIPS0.48020.30690.30410.30760.31330.30770.3038
FSIM0.75720.81080.82320.82140.82790.82670.8285
Table 3. Running time comparison of different networks.
Table 3. Running time comparison of different networks.
NetworksEDSR [10]NLSN [56]HAT-L [57]TransENet [58]GRL [60]PCFI
Running Time (ms)132.62186.18284.16152.67213.43178.12
Table 4. PSNR, SSIM, and FSIM comparison of different networks on SEN1-2 dataset. We mark the best-performing ones in red and the second-best ones in blue.
Table 4. PSNR, SSIM, and FSIM comparison of different networks on SEN1-2 dataset. We mark the best-performing ones in red and the second-best ones in blue.
ScaleEDSR [10]RRDB [61]SNGAN [62]SRGAN [21]HSENet [63]PCFI
PSNR41.1941.2741.8641.7442.2342.91
×2SSIM0.9810.9860.9870.9860.9930.996
FSIM0.99890.99960.99940.99910.99890.9993
PSNR31.1531.3927.8127.6224.3431.85
×4SSIM0.8380.8410.7710.7630.6220.839
FSIM0.98480.98360.96530.96570.97180.9831
Table 5. PSNR, SSIM, and SAM comparison of different networks on the AVIRIS dataset. We mark the best-performing ones in red and the second-best ones in blue.
Table 5. PSNR, SSIM, and SAM comparison of different networks on the AVIRIS dataset. We mark the best-performing ones in red and the second-best ones in blue.
Model×2×4×8
PSNR SSIM SAM PSNR SSIM SAM PSNR SSIM SAM
Bicubic43.720.97330.137639.190.93690.968336.190.90292.8346
RDN45.690.98350.455840.610.95320.943637.270.91942.2317
ESPCN41.930.96811.926838.410.92461.872533.730.83243.0341
TransENet45.910.98430.667141.430.96211.042338.090.92812.1462
HSENet45.840.98390.460340.750.95460.987337.150.91752.2937
PCFI(Ours)45.970.98510.403741.650.96230.975238.120.92282.2731
Table 6. Quantitative comparison of super resolution. We train the result of PCFI and other different methods with 5 standard datasets under a scale factor of 2/3/4 and mark the best-performing ones in red and the second-best ones in blue.
Table 6. Quantitative comparison of super resolution. We train the result of PCFI and other different methods with 5 standard datasets under a scale factor of 2/3/4 and mark the best-performing ones in red and the second-best ones in blue.
Method Scale Set5 [51]Set14 [52]BSD100 [53]Urban100 [54]Manga109 [55]
PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
IMDN [66] 38.000.960533.630.917732.190.899632.170.928338.880.9774
EDSR [10] 38.110.960233.920.919532.320.901332.930.935139.100.9773
HNCT [67] 38.080.960833.650.918232.220.900132.220.929438.870.9774
LatticeNet [68] 38.060.960733.700.918732.200.899932.250.928838.940.9774
SwinIR [26]×238.140.961133.860.920632.310.901232.760.934039.120.9783
ESRT [69] ----------
HAT [57] 38.630.963034.860.927432.620.905334.450.946640.260.9809
SRFormer [70] 38.510.962734.440.925332.570.904634.090.944940.070.9802
GRL [60] 38.670.964735.080.930332.680.908735.060.950540.670.9818
PCFI (Ours)38.690.965335.140.931132.720.909235.120.950940.710.9824
IMDN [66] 34.360.927030.320.841729.090.804628.170.851933.610.9445
EDSR [10] 34.650.928030.520.846229.250.809328.800.865334.170.9476
HNCT [67] 34.470.927530.440.843929.150.806728.280.855733.810.9459
LatticeNet [68] 34.400.927230.320.841629.100.804928.190.851333.630.9442
SwinIR [26]×334.620.928930.540.846329.200.808228.660.862434.780.9478
ESRT [69] 34.420.926830.430.843329.150.806328.460.857433.950.9455
HAT [57] 35.070.932931.080.855529.540.816730.230.889635.530.9552
SRFormer [70] 35.020.932330.940.854029.480.815630.040.886535.260.9543
GRL [60] ----------
PCFI (Ours)35.110.933631.130.856029.580.816930.290.899135.580.9557
IMDN [66] 32.210.894828.580.781127.560.735326.040.783830.450.9075
EDSR [10] 32.460.896828.800.787627.710.742026.640.803330.020.9148
HNCT [67] 32.310.895728.710.783427.630.738126.200.789630.700.9112
LatticeNet [68] 32.300.894328.610.781227.570.735526.140.784430.540.9075
SwinIR [26]×432.920.904429.090.795027.920.748927.450.825432.030.9260
ESRT [69] 32.190.894728.690.783327.690.737926.390.796230.750.9100
HAT [57] 33.040.905629.230.797328.000.751727.970.836832.480.9292
SRFormer [70] 32.930.904129.080.795327.940.750227.680.831132.210.9271
GRL [60] 33.100.909429.370.805828.010.761128.530.850432.770.9325
PCFI (Ours)33.190.909829.410.806928.060.762828.680.850832.830.9337
Table 7. Comparison of PSNR, SSIM scores, and running times of PCFI with removed modules on the AID dataset at a ×3 scale. We rename the PCFI after removing certain modules. (×: Indicates that the module is not used. ✓: Indicates that the module is used.)
Table 7. Comparison of PSNR, SSIM scores, and running times of PCFI with removed modules on the AID dataset at a ×3 scale. We rename the PCFI after removing certain modules. (×: Indicates that the module is not used. ✓: Indicates that the module is used.)
MethodRunning Time (ms)ICPCDTIFAID
CWIA NWIB PSNR SSIM
ModelL129.18×××30.280.8078
ModelA195.53×30.710.8119
ModelB135.58×30.410.8092
ModelC153.24×30.620.8104
ModelO178.1230.830.8135
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, Y.; Xie, J.; Chi, K.; Zhang, Y.; Dong, Y. Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution. Remote Sens. 2024, 16, 4201. https://doi.org/10.3390/rs16224201

AMA Style

Li Y, Xie J, Chi K, Zhang Y, Dong Y. Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution. Remote Sensing. 2024; 16(22):4201. https://doi.org/10.3390/rs16224201

Chicago/Turabian Style

Li, Yinghua, Jingyi Xie, Kaichen Chi, Ying Zhang, and Yunyun Dong. 2024. "Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution" Remote Sensing 16, no. 22: 4201. https://doi.org/10.3390/rs16224201

APA Style

Li, Y., Xie, J., Chi, K., Zhang, Y., & Dong, Y. (2024). Feature Intensification Using Perception-Guided Regional Classification for Remote Sensing Image Super-Resolution. Remote Sensing, 16(22), 4201. https://doi.org/10.3390/rs16224201

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop