Our proposed approach primarily involves compressive sensing classification of input remote sensing images, followed by deep texture extraction on the classified images to enhance the representation capability of the super-resolved images. In
Section 3.1, we present a comprehensive description of our proposed PCFI model structure. In
Section 3.2, we provide a detailed explanation of ICPC, which involves classifying sub-image blocks according to predetermined ranges. Smooth blocks primarily contain background and contour information, while edge blocks are more complex, containing detailed information but may suffer from information blur and partial feature loss due to edge blurring. Therefore, in
Section 3.3, we introduce the deep texture extraction module, which enhances the representation capability of blurred features caused by edge blur, thereby improving the ability of SR model to express details when dealing with complex remote sensing images.
3.2. Integrated Compressive Sensing-Based Perception Classifier Module
Our integrated compressive sensing-based perception classifier (ICPC) module utilizes compressed sensing for feature classification, which effectively reduces computational complexity compared to the commonly employed deep learning methods, such as CNNs or self-attention mechanisms. Although CNNs demonstrate excellent performance in feature extraction, they often incur substantial computational and memory overhead. For instance, remote sensing images are typically large-scale; when processed using self-attention mechanisms, the model must account for dependencies between every pixel and region, significantly increasing the computational complexity. In contrast, CNNs rely on deep networks for layer-by-layer processing, requiring numerous convolutional kernels and network layers to capture features of varying complexities. This often leads to the processing of considerable redundant information to achieve comprehensive feature classification, which greatly diminishes processing speed and contradicts the ICPC module’s goal of accelerating super-resolution tasks, thereby increasing the computational burden.
Moreover, remote sensing images frequently contain a substantial amount of redundant information, with critical data often concentrated in a few specific regions or frequency bands. Many areas within remote sensing images, such as oceans and deserts, are usually uniform or exhibit extensive smooth textures. The pixel value variations in these regions are minimal and can be represented with fewer non-zero or significant values. By leveraging the spatial sparsity of such areas, the compressed sensing technique can save storage space and reduce processing complexity. When addressing complex textures, these regions typically exhibit sparse representations in certain transformation domains (e.g., wavelet transforms, discrete cosine transforms). CS effectively captures these sparse features, facilitating the identification of different texture categories in classification tasks.
In this section, we start by cropping the image into equally sized blocks and employ the CS theory [
35] discussed in this section to map pixel values to perceptual values. Subsequently, the pixel values of each block are multiplied by the measurement matrix to obtain measurement values. These measurement values are then processed to derive features within the measurement value range, allowing for the classification of image blocks based on features within the perceptual domain [
36]. This module categorizes image blocks into three classes: smooth, margin, and texture blocks. The specific structure is illustrated in
Figure 2.
3.2.1. Definition of Correlation Between Signals in Two Domains
Since the mutual covariance measures the similarity between two signals, we utilize mutual covariance to assess the correlation of frequency domain signals. Specifically, the correlation between two frequency domain image signals is defined as
where
denotes the frequency domain correlation of the signals, cov represents the mutual covariance computation symbol, and
and
are the video signals in the DCT domain.
When the value of
is positive, it indicates that the two signals are positively correlated; conversely, a negative value indicates a negative correlation. The larger the absolute value of
, the higher the similarity between the two signals, meaning the correlation is stronger. Conversely, a smaller absolute value indicates lower similarity and weaker correlation. In image analysis, we typically start from the pixel domain. However, directly analyzing in the pixel domain can be challenging due to the large volume of data and the lack of intuitive insights, making it difficult to effectively extract useful information. Therefore, we project the image data into the frequency domain for analysis, where frequency domain signals can more clearly reflect the pixel features of the image. According to the CS theory [
37] presented in this section, there exists a linear relationship between the perceptual domain and the frequency domain. This means that the frequency domain can be regarded as an effective representation of the pixel domain, while the perceptual domain can reflect the detail level and complexity of the pixel domain images. Thus, by analyzing the signals in the perceptual domain, we can achieve a certain degree of extraction and analysis of image pixel features.
3.2.2. Correlation of Linear Relationships Between Perceptual Domain and Frequency Domain
For each block
,
,
,
, corresponding to frequency domain signals
,
,
,
, the independent compressed sensing measurements [
38] are conducted utilizing an identical sensing matrix. This process yields perception values
,
,
,
, which are directly concatenated as row vectors to construct the perceptual domain
. Given
for
, the representation of the feature image in the perceptual domain is denoted by the following complex formula:
In this formulation, each is represented as a column vector, where signifies the j-th perception value of the i-th block.
Covariance Matrix Vectorization. If
constitute a collection of random variables forming a random vector
, and each random variable has
m samples, then there exists a sample matrix, as follows:
where
(for
) corresponds to the first
i vectors of the sample values of the first random variable, while
(for
) represents each random vector
K within the sample vector. Consequently, the expression for the mutual covariance between random variables
and
is given by
Covariance estimates can be derived from sample values, as follows:
Perceptual Domain Covariance Matrix. We conduct measurements on each image block using the same randomly generated measurement matrix. The process follows a precise measurement procedure:
where
represents the original signal of size
and
denotes the corresponding signal in the perceptual domain of size
. This procedure operates under the assumption that the random measurement matrix
adheres to a Gaussian distribution with a mean of zero and a variance of
. Each signal in the perceptual domain
is associated with a signal in the frequency domain
, as represented by
where
represents the projection matrix and
signifies the sparse basis Discrete Cosine Transform (DCT) matrix. The elements of the projection matrix
are derived as follows:
The variance of the elements of the projection matrix
are determined as
The frequency domain signals corresponding to
N image blocks form the frequency domain sample matrix:
Assuming each signal in the perceptual domain is a random variable, the sample matrix over the perceptual domain is obtained as follows:
where
represents the sample vector of the
i-th random variable, and
represents the sample vector of each random vector.
Let
denote the row vector of the projection matrix
. Utilizing the vector form of the covariance matrix, we have
Finally, the covariance matrix in the perceptual domain can be approximated as
Frequency Domain Covariance Matrix. The frequency domain sample matrix is represented as
where
(
) denotes the vector containing all sample values for the
ith random variable, and
(
) corresponds to the sample vector of each random vector. Then, the
i-th row of the covariance matrix and
j-th column elements depict the mutual covariance between signals
and
:
All Discrete Cosine Transform (DCT) coefficients within an image block adhere to a Gaussian distribution with a mean of zero; thus,
The correlation between the signals in the perception and frequency domains is seen in the equation above. This model shows that the cross-covariance matrix between perception and frequency signals is essentially linear, corresponding to the correlation between the two domains. Clearly, in compression-based image processing, correlation evaluation of frequency domain signals may be directly performed using perceptual domain signals. Therefore, based on the content of this section, in compression-based image processing, the correlation analysis of frequency domain signals can be directly performed using signals from the perceptual domain.
3.2.3. Image Patch Classification Based on Perceptual Domain Features
The elements of the perceptual domain covariance matrix are correspondingly related to the variances of the respective frequency domain signals, emphasizing the close relationship between frequency domain signals and pixel domain signals. Specifically, smooth image patches correspond to frequency domain signals with lower sparsity, indicating a prevalence of uniform information across frequencies. Conversely, edge and texture patches are associated with frequency domain signals that exhibit higher sparsity, reflecting the presence of distinct high-frequency components that characterize sharp transitions and intricate textures in the image. This relationship underscores the importance of understanding the interplay between spatial characteristics and their spectral representations in image processing tasks. Consequently, the variance of the perceptual domain covariance matrix
reflects the characteristics of image patches and can be utilized for their classification. The criteria for classification are defined as follows:
When judging the categories of image blocks, we combine multi-directional block classification as follows:
(1) Let (), where refers to the feature vector of the block adjacent to the i-th block in the horizontal direction. Based on the covariance matrix of , determine its variance to classify it into one of the image block types and assign a parameter .
(2) Let (), where refers to the feature vector of the block adjacent to the j-th block in the vertical direction. Based on the covariance matrix of , determine its variance to classify it into one of the image block types and assign a parameter .
(3) For the same image sub-block, consider the category parameters and from both horizontal and vertical directions. The priority order, from high to low, is edge block, texture block, and smooth block; choose the category with the highest priority as the final category of the image block.
The algorithm simultaneously takes into account the global statistical characteristics and local distribution properties of image segments. By employing the 2 principle, it effectively computes the threshold. Smooth blocks primarily contain background and contour information, exhibiting relatively uniform features. In contrast, edge blocks are more complex, encompassing detailed information, such as boundaries corresponding to high-frequency components in the image. Consequently, edge blocks demonstrate higher sparsity in the frequency domain compared to smooth blocks.
3.3. Depth–Texture Interaction Fusion Module
The depth–texture interaction fusion module (DTIF) module employs a traditional U-Net encoder-decoder architecture, which effectively extracts and integrates contextual information while flexibly handling features of varying scales, demonstrating strong adaptability. Before inputting the image blocks into DTIF, they have already undergone feature classification via the ICPC module. This preprocessing allows us to dynamically adjust the number of DTIF modules based on the prior classification results. Specifically, for image blocks with different feature types, we can allocate varying numbers of DTIF modules. For example, simple and smooth regions may be assigned three DTIF modules; fast regions with pronounced edge features may utilize six modules; and complex texture regions may be assigned nine modules. This adjustment mechanism ensures that the model operates efficiently under different circumstances and optimizes feature extraction. In the DTIF, the primary task of the encoder is to extract detailed features of the image, while the decoder combines the encoder’s features with its own through skip connections, ensuring the retention of detail information. This design not only enhances the quality of reconstruction but also effectively reduces information loss, resulting in a clearer and more realistic final output image. By combining the advantages of both the encoder and decoder, the DTIF module is better equipped to adapt to various image features, thereby improving the performance of super-resolution reconstruction. The basic architecture of DTIF consists of a layered encoder, pooling layers, a compression layer, and a decoder, along with skip connection layers linking the encoder and decoder. This section will begin by presenting a comprehensive outline of the overall structure of DTIF. It will then proceed to offer in-depth explanations of the DTIF, CWIA, and MSFE.
Encoder. For a given LR image
, the extracted
is obtained through pixel shuffle, where
H,
W, and
C represent the height, width, and channels, respectively. To enrich the transferred feature details and enhance the expression of content features, we design a three-layer encoder. Specifically, the extracted
first undergoes two stages composed of two DTIT (DTI Transformer) blocks and soft pooling [
39] with a kernel size of 2 × 2, followed by processing through a third DTIT block. The details of the DTIT will be discussed in
Section 3.3.1.
Pooling Layer. In contrast to non-deep learning methods such as average pooling [
40] and max pooling [
41], we utilize soft pooling [
39] in this process, as shown in
Figure 3. The calculation formula for soft pooling is as follows:
where
x is the input data,
N is the number of input data(
), and
is the weight corresponding to the input data
, usually calculated using the softmax function:
The objective of this pooling approach is to minimize the information loss that occurs during the pooling process, while yet preserving the functionality of the pooling layer. By retaining information content features effectively, soft pooling facilitates deeper extraction of blurred information, thus significantly enhancing the performance of traditional SR methods in handling images with varying levels of clarity.
Compression Layer. The design of a three-layer encoder and the choice of pooling layers enable the extraction of more image details, but they also convey a significant amount of image information. To address this, the introduction of the compression layer module (MSFE) effectively compresses input image information while extracting features. This means that, while compressing redundant information from simple image blocks, the compression layer also aids in capturing detailed features [
42] from complex, blurred image blocks. The design of the compression layer not only reduces computational complexity and alleviates the computational burden on the subsequent decoder but also enhances the representation capability of similar features, effectively minimizing information loss during the feature transfer process to the decoder, thereby improving the quality of image reconstruction. The specific details of this module will be elaborated in
Section 3.3.4.
Decoder. Similar to most U-Net network architectures, the decoder adopts the same modular structure as the encoder. Here, we also use a combination of DTIT and soft pooling block. Additionally, The final result of the first DTIT is designed to bypass the completely connected layer and instead enter the compression layer module. It is then combined with the output of the three-layer encoder for residual connection [
43], before being inputted into the decoder. This skip-connection [
44] design aids in optimizing the model by reducing computational complexity while leveraging locality and dependency to perform multiscale feature extraction on the input.
3.3.1. Depth–Texture Interaction Transformer
Due to the fixed-size window mechanism employed by the Swin Transformer for self-attention operations, features near the window boundaries are prone to blurriness or incoherence, particularly in areas with complex texture details, making it difficult to effectively recover degraded edge pixels. This local window division hinders seamless information connectivity between different windows, resulting in suboptimal performance in texture blocks within remote sensing images. Furthermore, although the Swin Transformer facilitates information interaction through its window mechanism, its alternating local operations still impose limitations on information propagation, particularly in terms of long-range dependencies across windows. While the Swin Transformer employs a multi-layer shifted window approach for inter-window information interaction, its ability to capture global information remains constrained. Unlike global self-attention mechanisms, the Swin Transformer cannot capture global information at each layer; instead, it gradually builds global perception through multi-layer accumulation. Consequently, when addressing tasks requiring traversal across multiple local regions, it may not sufficiently model long-range dependencies.
To effectively overcome these challenges, we introduce an N–Gram window interaction module prior to self-attention operations within the module. This N–Gram window interaction module establishes tight feature relationships between multiple adjacent windows, breaking the limitations of the traditional Swin Transformer’s local window operations. It enables features from different windows to interact during the computation process, facilitating more comprehensive information propagation and fusion. Leveraging the concept of N–Gram from natural language processing, the N–Gram window interaction module fosters tighter feature relationships among adjacent windows, achieving efficient inter-window information dissemination and integration. This enhancement not only boosts the model’s information interaction capabilities within local windows but also effectively alleviates issues of edge blurriness and disjointed local features by capturing contextual information across windows. In the depth–texture interaction fusion (DTIT) module, our designed Cross-Window Importance Aggregation (CWIA) block spans the combined areas of W-Trans and SW-Trans following the N–Gram interaction module, connecting with the outputs of the self-attention mechanism through residual connections. This module allows for simultaneous consideration of feature importance in both directions, employing different pooling strategies to extract finer texture information based on directional significance. Through this integrative feature extraction approach, the model can more accurately identify and restore critical structures and details within images, enhancing overall image quality and recognition accuracy. The DTIT module is adept at capturing complex textures that are challenging to process using traditional convolution or window mechanisms, thereby transcending the local limitations of the Swin Transformer and ultimately improving overall image quality.
3.3.2. N–Gram Window Interaction Block
In the Swin Transformer, the original sliding window self-attention (SA) and cross-window self-attention (WSA) are computed as follows:
where
Q,
K, and
V denote the matrices representing the query, key, and value, respectively. Additionally,
refers to the dimension of the key.
WSA cross-window self-attention calculation formula:
where
A is the window shift matrix, which is used to address the limited receptive field issue caused by window shifting.
To address the window shifting issue in Swin-V1, we adopt the concept proposed by Choi et al. [
32], which utilizes a Uni–Gram non-overlapping local window based on the N–Gram language model, as shown in
Figure 4. The accurate formula is as follows:
where ⊙ denotes element-wise multiplication, and
M is the Uni–Gram window mask matrix used to define the window range.
Each adjacent Uni–Gram window can be combined into a larger N–Gram window by concatenating the query, key, and value matrices of multiple Uni–Gram windows. In the N–Gram language model, consecutive forward, backward, or bidirectional words are considered as target words. Using N–Gram to define the window for WSA allows pixels within the window to influence each other through WSA, thereby enlarging the receptive field. This expansion enhances the capability of the model by increasing the accuracy of extracting details through a broader contextual understanding.
3.3.3. Cross-Window Importance Aggregation
Given that the Swin Transformer relies heavily on window alternation for information interaction, it tends to perform poorly when dealing with edge-blurred textures. This makes it challenging to precisely restore damaged pixels through content interaction. In order to resolve this problem, we have developed the CWIA block. This module extends across the window-blocked area in both W-Trans and SW-Trans and is linked to the result of self-attention by residual connections. The CWIA block utilizes two-dimensional Local Importance Pooling (LIP) [
45] in both the horizontal as well as vertical orientations, using the following calculation:
Let
X denotes the input feature map of size
, where
H is the height,
W is the width, and
C is the number of channels. For each position
in the input feature map
X, the LIP operation computes the importance score
as the local sum of absolute differences (LSAD) [
46] between the pixel values within a neighborhood window centered at
.
The importance score
for position
is calculated as
where
k is the radius of the neighborhood window, and
represents the pixel value at position
in the input feature map
X.
The two outputs after pooling are multiplied and subjected to self-attention. The result is then added to the output after passing through W-Trans and SW-Trans directly via an adder. The inclusion of this module enhances the representational capacity of details, allowing for deep extraction of texture block details to improve the accuracy of pixel recovery.
3.3.4. Multi-Scale Feature Enhancer
The Multi-Scale Feature Enhancer (MSFE) block, serving as the compression layer of the network, consists of three steps: transpose interpolation, concatenation, and grouped convolution. First, the output from the third layer of the encoder is fed into the MSFE block, where it undergoes upsampling through the transpose interpolation layer, resulting in the expanded feature map with an increased resolution of . Unlike traditional interpolation methods, transpose interpolation better integrates the feature representations within convolutional neural networks, demonstrating significant advantages in preserving spatial information and detail features. Additionally, transpose interpolation smooths the feature map during expansion, avoiding artifacts such as the checkerboard effect, while requiring no additional training weights. This process allows for simple parameter control over the output image resolution, effectively alleviating computational burdens and enhancing the overall quality and efficiency of the upsampling. Subsequently, the MSFE block concatenates the results across channels in the channel domain to amplify and reinforce identical features. The combined features are then processed through the grouped convolution layer. Grouped convolution operates by partitioning channels into groups, performing convolution operations within each group. This mechanism enables the network to learn different feature subspaces, resulting in richer feature representations upon final aggregation, particularly enhancing edge and complex texture block representations when processing intricate remote sensing images. Compared to standard convolution, grouped convolution significantly reduces the number of parameters and computational load, thereby further lowering the demands on computational resources. Through these steps, the MSFE block efficiently extracts multi-scale features while ensuring the capability to handle complex textures. The final output is then fed into the decoder to complete the subsequent reconstruction tasks.