Dual-Ascent-Inspired Transformer for Compressed Sensing

Lin, Rui; Shen, Yue; Chen, Yu

doi:10.3390/s25072157

Open AccessArticle

Dual-Ascent-Inspired Transformer for Compressed Sensing

by

Rui Lin

,

Yue Shen

and

Yu Chen

^*

SCS Laboratory, Department of Human and Engineered Environmental Studies, Graduate School of Frontier Sciences, The University of Tokyo, 5-1-5, Kashiwa-no-ha, Kashiwa City 277-8563, Chiba, Japan

^*

Author to whom correspondence should be addressed.

Sensors 2025, 25(7), 2157; https://doi.org/10.3390/s25072157

Submission received: 13 February 2025 / Revised: 11 March 2025 / Accepted: 23 March 2025 / Published: 28 March 2025

(This article belongs to the Special Issue Deep Learning-Based Image and Signal Sensing and Processing: 2nd Edition)

Download

Browse Figures

Versions Notes

Abstract

:

Deep learning has revolutionized image compressed sensing (CS) by enabling lightweight models that achieve high-quality reconstruction with low latency. However, most deep neural network-based CS models are pre-trained for specific compression ratios (CS ratios), limiting their flexibility compared to traditional iterative algorithms. To address this limitation, we propose the Dual-Ascent-Inspired Transformer (DAT), a novel architecture that maintains stable performance across different compression ratios with minimal training costs. DAT’s design incorporates the mathematical properties of the dual ascent method (DAM), leading to accelerated training convergence. The architecture features an innovative asymmetric primal–dual space at each iteration layer, enabling dimension-specific operations that balance reconstruction quality with computational efficiency. We also optimize the Cross Attention module through parameter sharing, effectively reducing its training complexity. Experimental results demonstrate DAT’s superior performance in two key aspects: First, during early-stage training (within 10 epochs), DAT consistently outperforms existing methods across multiple CS ratios (10%, 30%, and 50%). Notably, DAT achieves comparable PSNR to the ISTA-Net+ baseline within just one epoch, while competing methods require significantly more training time. Second, DAT exhibits enhanced robustness to variations in initial learning rates, as evidenced by loss function analysis during training.

Keywords:

compressed sensing; image reconstruction; deep unfolding network; dual ascent; transformer

1. Introduction

Compressed sensing (CS) is an important technique in signal processing that integrates a specialized sampling process with a reconstruction process [1]. Implementing CS involves designing algorithms or models for both the sampling and reconstruction stages. During sampling, the signal undergoes a linear random transformation, which simultaneously compresses it. During reconstruction, the original signal can be reconstructed from measurements that are significantly fewer than those required by the Nyquist sampling rate [2]. CS theory demonstrates that, if a signal exhibits sparsity in a certain transform domain, it can be reconstructed with high probability [3]. Hence, CS enables a reduction in both sampling rate and storage requirements while maintaining high speed signal acquisition, transmission, and reconstruction. This novel sampling strategy is hardware-friendly, making CS technology particularly successful in various imaging applications, such as medical imaging [4], single-pixel cameras [5], wireless remote monitoring, and snapshot compressive imaging [6].

Mathematically, the aim of reconstruction is to infer the original signal

x \in R^{N}

from its random measurements

y \in R^{M}

. The sampling process of CS can be expressed as

y = A x,

(1)

where

A \in R^{M \times N}

is a linear random projection matrix, and the compressed sensing ratio is defined as

\frac{M}{N}

. Since

M ≪ N

, this inverse problem is typically ill-posed. To achieve a reliable reconstruction, traditional algorithms usually are boiled down to solving an optimization problem:

arg min_{x} \frac{1}{2} {∥A x - y∥}_{2}^{2} + λ R (x),

(2)

where

\frac{1}{2} {∥A x - y∥}_{2}^{2}

denotes the data fidelity term,

λ R (x)

denotes the prior term with a regularization parameter

λ

. Specifically, when the regularization function

R (x)

is chosen to be

{∥B x∥}_{1}

, with

B \in R^{p \times n}

being a sparsifying operator, e.g., Wavelet Transform [7], this optimization is known as the LASSO problem [8].

Most traditional CS methods exploit structural sparsity as prior knowledge and solve a sparsity-regularized optimization problem iteratively [2]. The prior term typically involves a predefined operator B that transforms images into a domain where they exhibit sparsity, such as the discrete cosine transform (DCT) or wavelet transform. These methods offer strong theoretical guarantees and stable convergence properties in most cases. However, they suffer from high computational complexity and difficulties in selecting the optimal transform and tuning hyperparameters for effective reconstruction.

With the advent of deep learning, several deep network-based CS models have been proposed [9,10]. Though these models can rapidly infer solutions to CS problems, most of them rely on black-box architectures that do not fully incorporate the theoretical advantages of traditional optimization-based methods [11].

Recently, deep unfolding networks (DUNs) have gained prominence due to their improved interpretability. These models integrate optimization algorithms into neural network architectures by unfolding iterative processes, providing a structured approach to solving CS problems [11]. However, some high-performance DUNs require an extensive number of parameters. For instance, PRL-RND+ [12] has 55 M parameters, and Idm [13] has 100 M parameters, leading to increased computational costs and training challenges. Although lightweight models [14,15,16] with 0.8–2.4 M parameters have been introduced to mitigate memory usage and inference latency, training efficiency remains a largely overlooked issue. Unlike classical optimization-based approaches, which allow for flexible adaptation to different compression ratios, most DUNs require pre-trained models tailored to specific CS ratios. For individual users in practice, there is often a need to balance storage efficiency and reconstruction quality, which necessitates a DUN that can achieve fast convergence, low training complexity, and high reconstruction performance to facilitate personalized CS applications on personal computing devices.

To address these challenges, this paper proposes an efficient DUN module inspired by the classical dual-ascent method (DAM), termed the Dual-Ascent-Inspired Transformer (DAT). This module serves as the fundamental iterative layer in a lightweight image CS workflow. The complete DAT-based workflow is an end-to-end pipeline, processing raw images as inputs and producing reconstructed images as outputs, encompassing both the sampling and reconstruction stages. Within the deep reconstruction module, compressed measurements pass through multiple iterative layers composed of DAT modules to recover the original image. Each DAT module consists of three key submodules: Cross Attention (CA), High-pass Filter (HPF), and Dual Ascent (DA). By leveraging the mathematical properties of the DAM, DAT requires significantly less training data, hence enabling the high-quality reconstruction with fewer parameters.

2. Related Work

2.1. Deep Unfolding Networks

In recent years, neural network architectures have been increasingly applied to compressed sensing (CS), drawing inspiration from their success in image classification tasks. Early works, such as CNN-based models, learned the end-to-end mapping of compressed measurements to reconstructed images [9]. However, purely data-driven models suffer from interpretability issues due to their black-box nature, which limits their ability to integrate theoretical insights for further performance improvements. This limitation arises because these models rely on stacking convolutional and filtering layers to optimize a nonlinear transformation that maps input images to output images. While this approach can be effective, it disregards the fact that certain image-processing tasks, such as compressive sensing, have well-established mathematical interpretations for their intermediate steps. Black-box models offer limited control over these intermediate processes, making it difficult to verify whether the representations learned during training align with the theoretical priors inherent to the given image-processing task.

To address these limitations, a new class of models, known as deep unfolding networks (DUNs), are invented by integrating convolutional layers with traditional optimization-based algorithms. The core idea behind DUNs is to replace each iteration step of a conventional optimization algorithm with a trainable module. For instance, Zhang et al. proposed ISTA-Net [11], which incorporates convolutional layers into the iterative shrinkage-thresholding algorithm (ISTA) [17] for image reconstruction. A similar strategy was adopted in ADMM-CSNet [18], which was inspired by the alternating direction method of multipliers (ADMM) [19]. While these models employ a learning-based approach in the reconstruction process, their sampling modules still rely on handcrafted sensing matrices, potentially limiting their performance. To address this issue, Zhang et al. proposed ISTA-Net++ [20], which replaces the handcrafted sensing matrix with trainable convolutional layers, enabling more effective feature propagation across blocks and significantly improving reconstruction quality. Further advancements include FSOINET [21], which suggests executing soft-thresholding functions in a high-dimensional feature space rather than the original pixel space to enhance image reconstruction quality.

The studies reviewed above show that DUNs have significantly improved CS reconstruction quality from various perspectives. Next, we shall focus on the bilevel optimization formulation that underpins these methods.

DUNs are commonly formulated as solving a bilevel optimization problem:

\begin{matrix} min_{θ} \sum_{i = 1} ℓ_{θ} (x_{i}^{*}, x_{i}), \\ s . t . x_{i}^{*} = arg min_{x} \frac{1}{2} {∥A x - y_{i}∥}_{2}^{2} + λ R (x), \end{matrix}

(3)

where

θ

denotes the trainable parameters, and

ℓ_{θ} (x_{i}^{*}, x_{i})

denotes the loss function measuring the difference between the original image

x_{i}

and the reconstructed image

x_{i}^{*}

.

DUN-based approaches often integrate efficient CNN-based filtering operations into the optimization methods, including the proximal gradient descent (PGD) algorithm [11], approximate message passing (AMP) [22], the inertial proximal algorithm for non-convex optimization (iPiano) [23], and ADMM-based methods [19]. Each of these optimization algorithms lead to a distinct optimization-inspired DUN architecture.

Although DUNs outperform conventional iterative methods in both reconstruction quality and inference speed, their computational efficiency during training is often overlooked. Most existing DUN designs focus on enhancing the richness of extracted image features while neglecting the convergence speed of the underlying optimization process. Consequently, some recent DUNs, when trained on limited datasets, underperform compared with earlier models such as ISTA-Net.

To address this issue, this paper proposes a streamlined and efficient DUN architecture that leverages the fast convergence properties of the classical dual-ascent method, in order to accelerate training while minimizing memory usage and preserving high reconstruction accuracy.

2.2. Vision Transformer

The original Transformer [24] was designed for natural language processing (NLP). Inspired by its success, the Vision Transformer (ViT) [25] was introduced to process images by segmenting them into flat 16 × 16 patches, treating each patch as a token. This approach successfully extended the Transformer architecture to image-classification tasks. Since then, Transformers have emerged as a new paradigm in deep learning, alongside CNNs.

In recent years, Transformer-based deep unfolding networks have gained popularity in compressed sensing. For example, CSformer [16], proposed by Ye et al., integrates Transformers and CNNs to fuse intermediate features in a dual-path and black-box manner, achieving significant improvements in reconstruction quality over previous methods. Similarly, TransCS [15], proposed by Shen et al., interprets the network as an iterative equivalent of ISTA, incorporating gradient descent and soft-thresholding iterations. In this framework, the Transformer serves as an encoder, mapping data from the original space to a sparse feature space, thereby enhancing interpretability. Building on this idea, Song et al. introduced Octuf [14], which employs two types of asymmetric cross-attention modules: one leveraging inertial information in the feature space and the other capturing cross-information between the original and feature spaces. This design significantly improves both reconstruction speed and quality while reducing parameter overhead.

These advancements highlight the remarkable potential of Transformers in compressed sensing, particularly in enhancing reconstruction quality and efficiency. In this paper, we further extend this direction by leveraging Transformers not only to integrate inertial information from the dual feature space into variables in the original space but also to introduce residual information from the measurement space into the dual space at each iterative layer. This integration, achieved through the dual-ascent mechanism, can significantly accelerate the convergence of the training process.

2.3. Dual-Ascent Method

The dual-ascent method encompasses a family of optimization algorithms that facilitate accelerated convergence, global optimality, and distributed computing. This family includes the classical Dual Ascent, the Method of Multipliers, and the Alternating Direction Method of Multipliers (ADMM) [19]. Among these algorithms, ADMM has been widely applied to the LASSO problem [26] in unconstrained optimization:

\begin{matrix} min_{x} f (x) + g (x), \end{matrix}

(4)

which can be reformulated as a constrained optimization problem if there exists a variable z such that

A x + B z = c

:

\begin{matrix} min_{x, z} f (x) + g^{'} (z), \\ s . t . A x + B z = c, \end{matrix}

(5)

where

A \in R^{p \times n}

,

B \in R^{p \times m}

,

x \in R^{n}

,

z \in R^{m}

, and

c \in R^{p}

. Following the method of multipliers, we define the augmented Lagrangian as

\begin{matrix} L (x, z, v) = f (x) + g^{'} (z) + v^{T} (A x + B z - c) + \frac{1}{2} ρ {∥A x + B z - c∥}_{2}^{2}, \end{matrix}

(6)

where

v \in R^{p}

denotes dual variable and

ρ > 0

. Given the initial variables

(x_{0}, z_{0}, v_{0})

, for problem (5), the ADMM algorithm includes the following steps:

\begin{matrix} x_{k + 1} = \arg min_{x} L (x, z_{k}, v_{k}), \end{matrix}

(7)

\begin{matrix} \begin{matrix} z_{k + 1} = \arg min_{z} L (x_{k + 1}, z, v_{k}), \end{matrix} \end{matrix}

(8)

\begin{matrix} \begin{matrix} v_{k + 1} = v_{k} + ρ (A x_{k + 1} - B z_{k + 1} - c) . \end{matrix} \end{matrix}

(9)

Here, Equation (9) represents the dual-ascent step. When ADMM is applied to compressed sensing for image reconstruction—specifically, solving the LASSO problem (2)—the classical algorithm follows:

\begin{matrix} x_{k + 1} = {(A^{T} A + ρ B^{T} B)}^{- 1} (A^{T} y + ρ B^{T} (z_{k} - v_{k})), \end{matrix}

(10)

\begin{matrix} \begin{matrix} z_{k + 1} = S_{λ / ρ} (B x_{k + 1} + v_{k}), \end{matrix} \end{matrix}

(11)

\begin{matrix} \begin{matrix} v_{k + 1} = v_{k} + B x_{k + 1} - z_{k + 1}, \end{matrix} \end{matrix}

(12)

where

S_{λ / ρ} (•)

denotes a Soft-Thresholding function [27] with the parameters

λ, ρ

, which is defined as

S_{λ / ρ} (x) = \{\begin{matrix} x + λ / ρ & x \leq - λ / ρ, \\ 0 & |x| < λ / ρ, \\ x - λ / ρ & x \geq λ / ρ . \end{matrix}

(13)

Notably, the first step of calculating Equation (10) involves matrix inversion related to the sensing matrix A, which has a computational complexity of approximately

O (N^{3})

. As image size increases, the dimensions of the sensing matrix grow accordingly, leading to substantial computational overhead due to the inversion operation.

To address this, we employ the approximate ADMM algorithm [28] introduced in our previous work. As will be detailed in Section 3.1, this approach retains the dual-ascent step, Equation (9)—where dual variables are updated by incorporating residual information from the measurement space—while eliminating the costly matrix inversion, thereby ensuring lower computational complexity. This serves as a strong mathematical foundation for the potential acceleration of deep unfolding networks (DUNs) based on this optimization strategy.

3. Proposed Method

In this section, we first propose an inertial-dual-ascent form of the Approximate Alternating Direction Method of Multipliers (AADMM), based on our previous work [28], which is the algorithmic foundation of the reconstruction module in our deep unfolding network model. Next, we will introduce the framework of DAT, which includes the sampling module and the reconstruction module. For the reconstruction module, we will separately discuss the specific implementation processes of the Cross Attention (CA) submodule, High-pass Filter (HPF) submodule, and Dual Ascent (DA) submodule, as well as other auxiliary modules.

3.1. Inertial-Dual-Ascent Form of AADMM

In our previous work [28], we proposed the Approximate Alternating Direction Method of Multipliers (AADMM), a novel approach to applying the ADMM algorithm to the LASSO problem by reformulating the original problem (2) as follows (for the case where the number of inner loop iterations

J = 1

in the original formulation):

\begin{matrix} arg min_{x, z} \frac{1}{2} {∥z - y∥}_{2}^{2} + λ {∥w∥}_{1}, \\ s . t . A^{F} w = z, \end{matrix}

(14)

where

w = F x \in R^{D}

,

A^{F} = A F^{- 1} \in R^{M \times D}

, and

F \in R^{D \times N}

denotes a sparse mapping operator. Given the initial variables

(w_{0}, z_{0}, v_{0})

, the iterative updates are defined as follows:

\begin{matrix} s_{k + 1} & = w_{k} - γ {(A^{F})}^{T} (A^{F} w_{k} - z_{k} + \frac{1}{ρ} v_{k}), \end{matrix}

(15)

\begin{matrix} \begin{matrix} w_{k + 1} & = S_{γ λ / ρ} (s_{k + 1}), \end{matrix} \end{matrix}

(16)

\begin{matrix} \begin{matrix} z_{k + 1} & = \frac{1}{1 + ρ} (y + ρ (A^{F} w_{k + 1} + \frac{1}{ρ} v_{k})), \end{matrix} \end{matrix}

(17)

\begin{matrix} \begin{matrix} v_{k + 1} & = v_{k} + ρ (A^{F} w_{k + 1} - z_{k + 1}), \end{matrix} \end{matrix}

(18)

\begin{matrix} \begin{matrix} \dots \end{matrix} \end{matrix}

\begin{matrix} \begin{matrix} x^{*} & = F^{- 1} w_{K}, \end{matrix} \end{matrix}

(19)

where K denotes the final iteration,

γ > 0

and

ρ > 0

are step-size-related parameters, and

S_{γ λ / ρ} (•)

represents a nonlinear filter function [27] parameterized by

γ, λ

, and

ρ

. The variables

x_{k} \in R^{N}

and

z_{k} \in R^{M}

serve as the solutions to the two subproblems at the k-th iteration, while

s_{k} \in R^{N}

is an intermediate variable obtained through gradient descent, and

v_{k} \in R^{M}

is the dual variable.

Unlike the conventional ADMM formulation in Equation (10), this method eliminates the need for matrix inversion involving the sensing matrix A. Notably, matrix B in ADMM often has a known or easily computable inverse, such as in total variation (TV) regularization [29]. As a result, the AADMM approach significantly reduces computational complexity, reinforcing the feasibility of constructing a deep unfolding network based on this algorithm while preserving its fast convergence characteristics.

In this study, we further refine the AADMM formulation by substituting Equation (17) from the

(k - 1)

-th and k-th iterations into Equations (15) and (18) at iteration k. After rearranging the terms, we derive the following iterative update scheme:

\begin{matrix} s_{k + 1} & = w_{k} - \frac{γ}{1 + ρ} {(A^{F})}^{T} (A^{F} w_{k} - y) + {(A^{F})}^{T} (\frac{1}{ρ} v_{k} - \frac{1}{1 + ρ} v_{k - 1}), \end{matrix}

(20)

\begin{matrix} w_{k + 1} & = S_{γ λ / ρ} (s_{k + 1}), \end{matrix}

(21)

\begin{matrix} v_{k + 1} & = \frac{1}{1 + ρ} v_{k} + \frac{ρ}{1 + ρ} (A^{F} w_{k + 1} - y), \end{matrix}

(22)

\begin{matrix} \begin{matrix} \dots \end{matrix} \end{matrix}

\begin{matrix} \begin{matrix} x^{*} & = F^{- 1} w_{K} . \end{matrix} \end{matrix}

(23)

We refer to this reformulated approach as the inertial-dual-ascent form of AADMM. In this formulation, Equation (20) represents a gradient descent step with an inertial term. Prior research [30] has demonstrated that incorporating inertial terms helps mitigate the tendency of gradient descent to become trapped in local minima, thereby improving stability and convergence efficiency. Meanwhile, Equation (22) constitutes a dual-ascent step, a critical feature of ADMM-based methods.

It is important to emphasize that, unlike some previous studies [14] where dual variables were manually separated, our dual variable update step naturally emerges from the Approximate ADMM framework. This ensures that the accelerated convergence properties inherent to ADMM are preserved. Consequently, the deep unfolding network constructed based on this algorithm benefits from a robust mathematical foundation, allowing for significantly enhanced convergence speed and efficiency.

3.2. Dual-Ascent-Inspired Transformer

3.2.1. Overall Architecture

The DAT workflow is an end-to-end process designed to reconstruct an image from its compressed measurements. It consists of two main components: a sampling (compression) module and a reconstruction module.

In the sampling module, as illustrated in Figure 1, we employ the CBS sampling strategy. Specifically, compression is achieved by applying block-wise convolution on the original image

x \in R^{H \times W}

, followed by reshaping it into an

M \times \frac{H W}{N}

dimensional measurement tensor y. A more detailed explanation will be provided in the subsequent sections. As shown in Figure 2, these measurements are then processed by the initial reconstruction module to obtain the primal variable

x_{0} \in R^{H \times W}

and the dual variable

v_{- 1}, v_{0} \in R^{H \times W \times C}

during the first stage. These initial estimates serve as inputs to the deep reconstruction module, which consists of k layers of DAT iteration. The iterative process progressively refines the reconstruction, ultimately producing the final output image.

In the deep reconstruction module, we use a deep neural network inspired by the inertial-dual-ascent form of AADMM to establish each iteration layer, which can be formulated as

(k \in 1, 2, \dots, K)

\begin{matrix} s_{k + 1} & = (x_{k} - ρ A^{T} (A x_{k} - y)) \oplus {Conv}_{e} (v_{k} - v_{k - 1}), \end{matrix}

(24)

\begin{matrix} \begin{matrix} x_{k + 1} & = {Split}_{1} (HPF (s_{k + 1})), \end{matrix} \end{matrix}

(25)

\begin{matrix} \begin{matrix} {\tilde{v}}_{k + 1} & = {Split}_{2 \sim C + 1} (HPF (s_{k + 1})), \end{matrix} \end{matrix}

(26)

\begin{matrix} \begin{matrix} v_{k + 1} & = v_{k} + [• \oplus {\tilde{v}}_{k + 1}] (A^{T} (A x_{k + 1} - y)) \\ = v_{k} + {Split}_{2 \sim C + 1} ((A^{T} (A x_{k + 1} - y)) \oplus {\tilde{v}}_{k + 1}), \end{matrix} \end{matrix}

(27)

where

x_{k} \in R^{H \times W}

is the primal variable,

s_{k} \in R^{H \times W}

is the intermediate variable resulting from the Cross Attention module,

v_{k} \in R^{H \times W \times C}

is the dual variable, and

{\tilde{v}}_{k}

denotes the parameters which are reused. The notation

• \oplus ★

represents a binary operation involving two asymmetric inputs, specifically within the Cross Attention module. The operators

{Conv}_{e} (•)

,

HPF (•)

, and

[• \oplus {\tilde{v}}_{k + 1}]

correspond to neural network layers with learnable parameters. Furthermore,

{Split}_{1} (•)

and

{Split}_{2 \sim C + 1} (•)

are defined as segmentation operators applied along the C+1 dimension of an

H \times W \times (C + 1)

tensor. These operators extract the first dimension and the second (C+1)th dimension, respectively, forming new tensors. As illustrated in Figure 2, these components collectively contribute to the deep reconstruction module of DAT.

In DAT, we replace the manually designed parameters in the inertial-dual-ascent form of AADMM with neural network operators containing learnable parameters. This transformation can be illustrated in the following aspects:

Compared to Equation (20), Equation (24) replaces the fixed coefficient with a learnable convolutional layer, which encodes the residual of the dual variables between adjacent iteration layers. Additionally, it replaces the linear addition operation with a Cross Attention module, which integrates the result of the gradient descent operator with the residual of the dual variables. This design more effectively incorporates inertial information from the dual space into the primal variable updates, thereby facilitating accelerated convergence while ensuring global stability;
Equation (25) represents a proximal mapping step, which essentially functions as a high-pass filter in the sparse space. Following widely adopted practices in Deep Unfolding Networks (DUNs) [14,15,20], we replace the fixed matrix mappings in Equation (25) with convolutional layers containing learnable parameters, allowing for a more accurate transformation between the intermediate variable space and the sparse domain. Additionally, we employ the GELU activation function [31], which shares a similar role with the soft-thresholding function, to perform filtering. This process is encapsulated in the High-pass Filter (HPF) module;
In Equations (26) and (27), we introduce ${\tilde{v}}_{k + 1}$ as a fixed input to the right side of the Cross Attention module, forming a unified encoder $[• \oplus {\tilde{v}}_{k + 1}]$ that integrates the residual term $A x_{k + 1} - y$ into the dual variable update through the Dual Ascent (DA) module. Notably, this encoder not only ensures dimension consistency between the primal and dual variables but also reuses the parameters of the Cross Attention module, so as to avoid employing two independent CA modules. This design can reduce computational complexity and memory consumption, while also providing superior encoding performance compared to the fixed-coefficient approach used in Equation (22).

In the following sections, we will introduce the specific implementation of each module separately. Architectures of the modules are shown in Figure 3.

3.2.2. Sampling and Initial Reconstruction

Traditional methods typically use fixed-window size Gaussian matrices as sampling modules, which have drawbacks such as lower sampling quality and the generation of artifacts during the sampling process. To address these limitations, we adopt the Cross-Block Strategy (CBS), as proposed in previous research [14,15,20], as our sampling module. In the CBS-based sampling process, we replace the manually designed matrix multiplication with a learnable convolutional layer, denoted as

A (•)

. Specifically, for an image

X \in R^{H \times W}

, when the selected compressed ratio is

\frac{M}{N}

, the sampling process can be expressed as

\begin{matrix} Y = A (X) = W_{A} * X, \end{matrix}

(28)

where * denotes the bias-free and non-overlapping convolution operation. The convolution kernel

W_{A}

is reshaped by the learnable sampling matrix

A \in R^{M \times N}

into M filters, each of which is of kernel size

1 \times \sqrt{N} \times \sqrt{N}

. The data measurement

Y \in R^{M \times \frac{H}{\sqrt{N}} \times \frac{W}{\sqrt{N}}}

is then obtained.

For the initialization of the primal variable, we use the transpose operation

X_{0} = A^{T} (Y)

, implemented by a convolutional layer followed by a PixelShuffle layer, and extend this process to the entire image. This is defined as

\begin{matrix} X_{0} = A^{T} (Y) = Pixelshuffle (W_{A^{T}} * Y) . \end{matrix}

(29)

Here,

W_{A^{T}}

is obtained by reshaping

A^{T} \in R^{N \times M}

into N filters, each with a kernel size of

M \times 1 \times 1

. The convolution operation

W_{A^{T}} * Y

results in a tensor of size

{(\sqrt{N})}^{2} \times \frac{H}{\sqrt{N}} \times \frac{W}{\sqrt{N}}

, after which the PixelShuffle layer reshapes this tensor into the desired

H \times W

dimension.

Following this, two convolutional layers, each with a kernel size of

3 \times 3

and C channels, are applied to

X_{0}

to generate the initial dual variables

V_{- 1}, V_{0} \in R^{H \times W \times C}

, as denoted by

\begin{matrix} V_{- 1} = {Conv}_{Φ_{- 1}} (X_{- 1}), \end{matrix}

(30)

\begin{matrix} \begin{matrix} V_{0} = {Conv}_{Φ_{0}} (X_{0}) . \end{matrix} \end{matrix}

(31)

3.2.3. Cross Attention

To effectively integrate the information from the primal and dual spaces, we designed a Cross Attention (CA) module which is composed of two types of asymmetrical inputs. Since the primal and dual variables have different dimensions, this module addresses the need for dimensional alignment. As illustrated in the upper-left panel of Figure 3, inputs to the CA module are generated as follows:

The primal variable

X_{k}

is first converted into variable

{\hat{X}}_{k} \in R^{H \times W}

via a gradient descent operator, while the inertia term

{\hat{V}}_{k} \in R^{H \times W \times C}

is obtained through an encoder for

V_{k} - V_{k - 1}

which consists of a convolutional layer with kernel size of

3 \times 3

. These processes can be expressed as the following:

\begin{matrix} {\hat{X}}_{k} = X_{k} - ρ A^{T} (A (X_{k}) - Y), \end{matrix}

(32)

\begin{matrix} \begin{matrix} {\hat{V}}_{k} = {Conv}_{e} (V_{k} - V_{k - 1}) . \end{matrix} \end{matrix}

(33)

Unlike the original Transformer model [24,25], which uses three inputs Q,K,V for contextual dependency modeling, our CA module simplifies this by having only two asymmetrical inputs, i.e., by assuming that K=V. The variable

{\hat{X}}_{k}

is used as Q, and

{\hat{V}}_{k}

serves as both K and V in the CA module to generate the synthesized variable

S_{k} \in R^{H \times W \times (C + 1)}

, expressed as

\begin{matrix} S_{k + 1} = {\hat{X}}_{k} \oplus {\hat{V}}_{k} = CA ({\hat{X}}_{k}, {\hat{V}}_{k}) . \end{matrix}

(34)

As shown in the lower-right panel of Figure 3, the implementation of the CA module is similar to that of the Octuf [14]. Firstly, the inputs

{\hat{X}}_{k}

,

{\hat{V}}_{k}

are, respectively, embedded by a

1 \times 1

convolutional layer

{Conv}_{Q, K, V}^{1 \times 1} (•)

, respectively, to obtain features with the size being

H \times W \times C

. Despite the different input channels, the output channels are made identical. Next, a

3 \times 3

convolutional layer

{Conv}_{Q, K, V}^{3 \times 3} (•)

is used to encode channel-wise spatial context. Finally, the outputs are reshaped into tokens

{\hat{Q}, \hat{K}, \hat{V}} \in R^{H W \times C}

using the reshape operation

R (•)

. This process is described as follows:

\begin{matrix} \hat{Q} = R ({Conv}_{Q}^{3 \times 3} ({Conv}_{Q}^{1 \times 1} (Q))), \end{matrix}

(35)

\begin{matrix} \begin{matrix} \hat{K} = R ({Conv}_{K}^{3 \times 3} ({Conv}_{K}^{1 \times 1} (K))), \end{matrix} \end{matrix}

(36)

\begin{matrix} \begin{matrix} \hat{V} = R ({Conv}_{V}^{3 \times 3} ({Conv}_{V}^{1 \times 1} (V))) . \end{matrix} \end{matrix}

(37)

Subsequently, the attention variable

Att \in R^{H W \times C}

is obtained through the attention mechanism:

\begin{matrix} Att = \hat{V} Softmax ({\hat{K}}^{T} \hat{Q}) . \end{matrix}

(38)

This result is reshaped into a feature of size

R^{H \times W \times C}

. Then, an

1 \times 1

convolutional layer

{Conv}_{A} (•)

is applied for enhancing the feature extraction. In each iteration layer, the synthesized variable

S_{k}

is obtained by concatenating

X_{k}

and

{Att}_{k}

. The overall process of the CA module can be expressed as

\begin{matrix} CA ({\hat{X}}_{k}, {\hat{V}}_{k}) = Concat ({\hat{X}}_{k}, {Conv}_{A} (R ({Att}_{k}))) . \end{matrix}

(39)

3.2.4. High-Pass Filter

After combining the inertial information from the primal and dual spaces through the Cross Attention module, it is necessary to perform a high-pass filter (denoising) operation on the synthesized variable. Some studies [21] suggest that performing a high-pass filtering operation in the high-dimensional unfolding space often yields better reconstruction results. Our High-pass Filter consists of three layers of alternating

1 \times 1

and

3 \times 3

convolutional layers combined with the GELU function, as shown in Figure 3. The synthesized variable is encoded into a

T \times C

dimensional space, and then a sparse solution is obtained through the GELU function before being decoded back into the

T \times C

dimensional space. This operation can be expressed as

\begin{matrix} {\hat{S}}_{k + 1} = HPF (S_{k + 1}) . \end{matrix}

(40)

In the iterative algorithm, the high-pass filter effect is typically achieved using a form like

F^{- 1} S_{γ} (F (x))

, where

F

represents a sparse operator. And there exists a connection that

S_{γ} (x) = sgn (x) ReLU (|x| - γ)

, where the ReLU function [32] plays a similar role as high-pass filters with the GELU function. Similarly, our HPF module consists of an in-transformation and an out-transformation, with the GELU function acting as the high-pass filter, as shown in Figure 3. Previous studies [11,14] have demonstrated that data-driven convolutional layers can automatically learn the transformations into and out of the sparse space. Consequently, our High-pass Filter module essentially functions as a proximal mapping step with learnable parameters.

3.2.5. Dual Ascent

Before executing the Dual Ascent module, we need to reduce the dimensions of the synthesized variable

{\hat{S}}_{k + 1}

to update the primal variable

X_{k + 1}

, which is then used as the final output

X^{*}

after all iteration layers. In this paper, we adopt a method similar to that of Outuf [14], using the

{Split}_{1} (•)

operator to extract the first dimension of the synthesized variable as the output for the primal variable, as shown in the lower-left panel of Figure 3. Through training of the neural network, this method can adaptively yield reasonable results, while avoiding aliasing [33] caused by the down-sampling of convolutional layers.

In the Dual Ascent module, to save parameters and improve performance without introducing new convolutional neural network or Transformer modules, we employ a novel approach. Specifically, we extract the remaining C dimensions of the synthesized variable, denoted as

{\tilde{V}}_{k + 1} = {Split}_{2 \sim C + 1} (S_{k + 1})

to be used as a fixed input to the

K, V

positions of the CA module. These dimensions, considered as a whole, form the parameters of an encoder to update the dual variable

V_{k + 1}

by utilizing the residual information between Y and the primal variable

X_{k + 1}

during the iteration, which can be expressed as follows:

\begin{matrix} \begin{matrix} V_{k + 1} & = V_{k} + [• \oplus {\tilde{V}}_{k + 1}] (A^{T} (A (X_{k + 1}) - Y)) \\ = V_{k} + {Split}_{2 \sim C + 1} ((A^{T} (A (X_{k + 1}) - Y)) \oplus {\tilde{V}}_{k + 1}) . \end{matrix} \end{matrix}

(41)

This method is reasonable for the following reasons:

Since the dimensions of the primal and dual spaces differ, to integrate information from the primal space into the dual space, we require an encoder with a dimension-matching mechanism. The dimensions of ${\tilde{V}}_{k + 1} \in R^{H \times W \times C}$ and $V_{k} \in R^{H \times W \times C}$ are always the same, so using ${\tilde{V}}_{k + 1}$ as a fixed input naturally matches the dimensions of the primal variable with those of the dual space via the preceding CA module. Additionally, this approach leverages the information lost during down-sampling via the ${Split}_{1} (•)$ operator, resulting in better performance than using fixed coefficients as the encoder;
We note that, in Equation (22), the dual-ascent term requires the fixed coefficients $ρ$ and $\frac{ρ}{1 + ρ}$ to be positive-definite. Our DA module employs an attention mechanism with identical K and V inputs. To ensure strict positive definiteness when the encoder operates on the dual-ascent term, one approach is to use a linear attention mechanism [34], which produces the linear attention variable ${Att}_{L}$ . The process is outlined as follows:
Assume that for feature maps $\{X, V\} \in R^{h w \times c}$ , it follows that $\{x_{j}, v_{i}\} \in R^{h w \times 1}$ , for $(i, j \in 1, 2, \dots, c)$ ,

$\begin{matrix} \begin{matrix} {Att}_{L} = V (V^{T} X) & = (v_{1}, v_{2}, \dots, v_{c}) (\begin{matrix} v_{1}^{T} x_{1} & v_{1}^{T} x_{2} & \dots & v_{1}^{T} x_{c} \\ \dots & \dots & \dots & \dots \\ v_{c}^{T} x_{1} & v_{c}^{T} x_{2} & \dots & v_{c}^{T} x_{c} \end{matrix}) \\ = (\begin{matrix} \sum_{i = 1}^{c} {∥v_{i}∥}^{2} x_{1} & \sum_{i = 1}^{c} {∥v_{i}∥}^{2} x_{2} & \dots & \sum_{i = 1}^{c} {∥v_{i}∥}^{2} x_{c} \end{matrix}), \end{matrix} \end{matrix}$

(42)

The coefficient term $\sum_{i = 1}^{c} {∥v_{i}∥}^{2}$ is clearly positive-definite. This is established based on prior research [34], which shows that the standard attention mechanism can achieve equal or even superior performance compared to the linear attention mechanism. We argue that the encoder derived from this approach can approximately maintain positive definiteness, thereby supporting the preservation of the dual ascent method’s acceleration characteristics in DAT.

3.2.6. Loss Function

For an original image

X_{i}

in the dataset, of which the batchsize is

N_{b}

, the model first obtains the CS measurement values

Y_{i}

through sampling, then predicts the reconstructed image

X_{i}^{*}

. We perform end-to-end optimization of our DAT through the following loss function:

\begin{matrix} ℓ_{θ} (X_{i}^{*}, X_{i}) = ℓ_{{mse}_{θ}} (X_{i}^{*}, X_{i}), \end{matrix}

(43)

where

ℓ_{{mse}_{θ}}

is the mean squared error (MSE) between the original image

X_{i}

and the reconstructed image

X_{i}^{*}

. During the entire training process, we traverse over all the images in each batch, summing up their loss functions to obtain our total loss

(i \in 1, 2, \dots, N_{b})

:

\begin{matrix} ℓ_{{total}_{θ}} = \sum_{i = 1}^{N_{b}} ℓ_{θ} (X_{i}^{*}, X_{i}), \end{matrix}

(44)

which is the objective function we aim to optimize.

4. Experimental Results

4.1. Details of Implementation

For training, we used 400 images from the training and test datasets of the BSD500 dataset [35]. The training images were cropped into 89,600 patches of size 96 × 96 pixels and augmented following those in the previous study [36]. In the sampling module, the blocksize

\sqrt{N}

is set to 32, i.e.,

N = 1024

. For each given CS ratio

\frac{M}{N}

, respectively, 0.1, 0.3, and 0.5, a corresponding learnable measurement matrix A is constructed by using a convolutional layer with a kernel size of

M \times 1 \times \sqrt{N} \times \sqrt{N}

to sample the original image

X \in R^{96 \times 96}

to obtain the measurement

Y \in R^{M \times 9}

. Then, we apply the transpose of the sampling matrix as the kernel weights for convolution to Y to obtain the the initial reconstruction

X_{0}

.

For the network parameters, the default dimension of dual variable C is 31, the dimension expansion factor T is 4, the number of iteration layers K is 10, and the batch size is 8. The size of the unspecified convolution kernel is

3 \times 3

. We use the Adam optimizer [37] to train the network with the initial learning rate, which is decreased to

5 \times 10^{- 5}

through 100 epochs using the cosine annealing strategy [38] and the warm-up epochs are 3. For testing, we utilize a widely used benchmark dataset, yielding Set11 [9]. Two common-used image assessment criteria, Peak Signal to Noise Ratio (PSNR) and Structural Similarity (SSIM) [39], are used to evaluate the reconstruction results.

All the experiments are implemented in PyTorch 2.0.1+cu118 with an RTX 3090 GPU (NVIDIA Corporation, Santa Clara, CA, USA), except for the reconstruction time analysis, which was conducted using an RTX 4080 Laptop GPU from the same manufacturer.

4.2. Early-Stage Performance

We compared our model with other advanced models, particularly those utilizing the Transformer framework and dimension expansion strategy, namely Octuf [14], TransCS [15], and CSformer64 [16]. To evaluate the early-stage performance of the models, we examined the loss function values at each epoch, the actual reconstruction quality, and the number of parameters. For the competing methods, we migrated their models from the source code into our program, ensuring they operated within the same training–testing environment. We then retrained them and recorded the results for the first 10 epochs.

The selection of the initial learning rate closely follows the settings from the original papers. However, to enable the competing methods to achieve better and more reliable performance, we made the following adjustments: the initial learning rate for Octuf is set to

4.8 \times 10^{- 4}

, for TransCS it is

1.0 \times 10^{- 4}

, for CSformer64 it is

2.0 \times 10^{- 4}

, and for our model, DAT, it is

4.8 \times 10^{- 4}

.

4.2.1. Training Process Analysis

In Figure 4, we compare the training processes of DAT with competing methods under CS ratios of 0.5, 0.3, and 0.1, using Mean Squared Logarithmic Error (MSLE) as the evaluation metric, which is calculated as

ln (ℓ_{{total}_{θ}})

. (For CSformer, the original authors provided source code only for a CS ratio of 0.5.) We observed that, for most epochs, our model’s curve remained at the lowest position. These results indicate that our model achieves a good initial MSLE and reaches the smallest training MSLE, i.e., the best performance during early-stage training. Specifically, during the first four epochs, at a CS ratio of 0.5, when comparing the average MSLE values of the models, our model (1.96) outperformed the competing methods, which had MSLE values of 3.19, 2.71, and 3.71, respectively, achieving reductions of 1.23, 0.75, and 1.75.

4.2.2. Comparison of Reconstructions

Although DAT’s MSLE may exceed that of Octuf during epoch 1 and the final training stages at CS ratios of 0.1 and 0.3, we also employ another widely used metric, PSNR, to evaluate the actual reconstruction quality of the models. As shown in Table 1, we assessed the average PSNR of reconstructed images on Set11 for all models at epochs 1 and 10. For instance, under a CS ratio of 0.1 at epoch 1, although Figure 4 shows that our training MSLE is higher than that of Octuf, our model successfully outperformed competing methods by 1.64 dB (6.04%) and 5.53 dB (20.37%), respectively.

Overall, in terms of the early-stage training results that we focus on, specifically at epoch 1, our model achieved higher average PSNR across all three CS ratios compared to the competing methods, with improvements of 3.53 dB (10.59%), 9.27 dB (27.81%), and 10.66 dB (31.98%). This observation aligns with the trends we noted in MSLE and further validates that, even when DAT occasionally exhibits higher MSLE than competing methods, the images reconstructed by our model still achieve the best quality. This highlights the excellent generalization ability of our model. Figure 5 presents the visual comparisons on challenging images.

A similar trend can be observed in the PSNR evaluations at epoch 10. Although under a CS ratio of 0.1, DAT’s MSLE and PSNR were slightly inferior to those of Octuf, our model still delivered the best results in terms of average reconstructed PSNR, demonstrating its overall superiority.

4.2.3. Comparison of Convergences and Time Complexities

To define convergence criteria more concretely, we used the PSNR of ISTA-Net+ [11] on the Set11 dataset as a benchmark. As shown in Table 2, we evaluated the number of epochs required for each model to first exceed ISTA-Net+’s performance. At CS ratios of 0.5, 0.3, and 0.1: DAT required only one epoch at most, Octuf required up to four epochs, TransCS required up to five epochs, and CSformer required more than ten epochs.

We acknowledge that comparing the training speed of different models also necessitates considering the training time per epoch. Notably, for all models, training time is always proportional to reconstruction time. Therefore, we use the average reconstruction time for 256 × 256 images on the Set11 dataset as a proxy for training time to analyze time complexity, as shown in the first column on the right side of Table 2.

Although our model exhibits a disadvantage in this metric, its reconstruction speed remains within the same order of magnitude as state-of-the-art lightweight models, making this an acceptable trade-off. To provide a more objective comparison of training speed—our primary focus—we introduce the Time-Epoch Product (TEP) metric, calculated as runtime × epoch. As shown in the second column on the right side of Table 2, this metric clearly demonstrates that our model achieves the fastest training speed under the given convergence criteria, with a TEP (absolute time consumption) at most one-third that of competing models.

4.2.4. Comparison of Model Sizes

Finally, we conducted a model size comparison, taking the compressed sensing ratio of 0.5 as a example. As shown in the rightmost column of Table 2, even though the competing methods in this comparison are all advanced models with relatively few parameters, ranging from 0.82 M to 2.28 M, our model achieves the aforementioned performance while having the fewest parameters, at only 0.76 M. Fewer parameters imply that our model also has the lowest training complexity, which reflects the success of our parameter reuse strategy.

4.3. Influence of Initial Learning Rate

The initial learning rate is one of the critical parameters that determine the performance of deep neural network models. However, as a hyperparameter, there is often no robust theoretical basis for selecting an appropriate initial learning rate. If a neural network model with the same structure exhibits significant performance variations due to the choice of the initial learning rate, we consider the model’s performance to lack robustness with respect to this parameter. For user-specific customization needs, a truly excellent model should exhibit strong robustness to changes in the initial learning rate, ensuring reliable performance across a range of learning rate settings.

To investigate this, we compared DAT with Octuf, a model with similar performance to DAT in previous experiments. As shown in Table 3, under the CS ratio of 0.5, we set the initial learning rates for both models within the range of

4.0 \times 10^{- 4}

to

5.0 \times 10^{- 4}

, with an interval of

0.2 \times 10^{- 4}

, and trained each model separately. Three metrics were used for evaluation: maximum MSE, minimum MSE, and final MSE. Maximum MSE reflects the worst fluctuations observed during training. A large value indicates poor convergence stability under the current initial learning rate, potentially leading to significantly longer training epochs. Minimum MSE indicates the best performance achievable by the model with the given initial learning rate. A smaller value suggests that the model can converge to a good extremum point. The difference between the maximum and minimum MSE reflects whether there are fluctuations during the learning process. A smaller difference indicates fewer fluctuations, suggesting a more stable learning process. Final MSE also reflects learning stability. If this value exceeds the minimum MSE, it suggests that the model experienced fluctuations, and the previously reached extremum was a narrow optimum rather than a wide optimum, which is detrimental to convergence.

To quantify the impact of variations in the initial learning rate on the aforementioned three metrics, we calculated their mean and variance. As shown in Table 3, the average difference between the maximum and minimum MSE for the competing method Octuf is 1697.3, and its average final MSE exceeds the minimum MSE. In contrast, DAT exhibits an average difference of only 30.63 between the maximum and minimum MSE, and its average final MSE is nearly identical to the minimum MSE. This indicates that our model maintains a more stable and superior training process regardless of the initial learning rate. Moreover, we compared the variance of the three metrics mentioned above, finding that the variances for the competing methods are 4878, 209, and 220 times larger than those of our model, respectively. A larger variance indicates that the training stability of the model varies significantly with changes in the initial learning rate. In contrast, the variances for our model are much smaller, demonstrating that our model not only maintains an overall more stable training process but also exhibits less sensitivity to changes in the initial learning rate.

The stability of model training ultimately impacts the effectiveness of early-stage training. Taken together, these results show that our model’s performance is highly robust to the selection of the initial learning rate and significantly outperforms the competing method in this regard, making it much more user-friendly for practical applications.

5. Conclusions

In this paper, we propose a novel Dual-Ascent-Inspired Transformer (DAT) for Compressed Sensing. Specifically, we replace the addition of the inertial term in AADMM with a Cross-Attention module and reuse the parameters of this module during the dual ascent steps. This approach reduces training complexity while preserving the key characteristics of the dual-ascent method that accelerate convergence. This is achieved through the excellent encoding properties of the Cross-Attention module and the approximate positive definiteness demonstrated by the Dual Ascent module. Experiments show that our DAT can achieve comparable results with less training data compared to state-of-the-art methods, and the training process is more stable and less sensitive to the selection of the initial learning rate, making it more suitable for the needs of individual users to customize models. In the future, we will explore the relationship between initial convergence and long-term training performance. Indeed, preliminary evidence suggests a positive correlation between early convergence and the model’s long-term performance, indicating that better deep compressed sensing models can be developed based on the DAT framework. We plan to incorporate novel sampling modules and leverage the strengths of other reconstruction algorithms, aiming for breakthroughs in classic reconstruction metrics.

Author Contributions

Conceptualization, R.L. and Y.C.; methodology, R.L.; software, R.L. and Y.S.; validation, R.L., Y.S. and Y.C.; formal analysis, R.L. and Y.C.; investigation, R.L.; resources, R.L. and Y.S.; data curation, R.L. and Y.S.; writing—original draft preparation, R.L.; writing—review and editing, R.L. and Y.C.; visualization, R.L.; supervision, Y.C.; project administration, Y.C.; funding acquisition, Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Donoho, D.L. Compressed sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar]
Candes, E.J.; Tao, T. Near-optimal signal recovery from random projections: Universal encoding strategies? IEEE Trans. Inf. Theory 2006, 52, 5406–5425. [Google Scholar]
Candès, E.J.; Romberg, J.; Tao, T. Robust uncertainty principles: Exact signal reconstruction from highly incomplete frequency information. IEEE Trans. Inf. Theory 2006, 52, 489–509. [Google Scholar]
Lustig, M.; Donoho, D.; Pauly, J.M. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magn. Reson. Med. 2007, 58, 1182–1195. [Google Scholar]
Duarte, M.F.; Davenport, M.A.; Takhar, D.; Laska, J.N.; Sun, T.; Kelly, K.F.; Baraniuk, R.G. Single-pixel imaging via compressive sampling. IEEE Signal Process. Mag. 2008, 25, 83–91. [Google Scholar]
Ma, J.; Zhou, H.; Zhao, J.; Gao, Y.; Jiang, J.; Tian, J. Robust feature matching for remote sensing image registration via locally linear transforming. IEEE Trans. Geosci. Remote Sens. 2015, 53, 6469–6481. [Google Scholar]
Mallat, S. A Wavelet Tour of Signal Processing; Elsevier: Amsterdam, The Netherlands, 1999. [Google Scholar]
Tibshirani, R.J. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 1996, 58, 267–288. [Google Scholar]
Kulkarni, K.; Lohit, S.; Turaga, P.; Kerviche, R.; Ashok, A. Reconnet: Non-iterative reconstruction of images from compressively sensed measurements. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 449–458. [Google Scholar]
Sun, Y.; Chen, J.; Liu, Q.; Liu, B.; Guo, G. Dual-path attention network for compressed sensing image reconstruction. IEEE Trans. Image Process. 2020, 29, 9482–9495. [Google Scholar]
Zhang, J.; Ghanem, B. ISTA-Net: Interpretable optimization-inspired deep network for image compressive sensing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 1828–1837. [Google Scholar]
Chen, B.; Song, J.; Xie, J.; Zhang, J. Deep physics-guided unrolling generalization for compressed sensing. Int. J. Comput. Vis. 2023, 131, 2864–2887. [Google Scholar]
Chen, B.; Zhang, Z.; Li, W.; Zhao, C.; Yu, J.; Zhao, S.; Chen, J.; Zhang, J. Invertible Diffusion Models for Compressed Sensing. IEEE Trans. Pattern Anal. Mach. Intell. 2024. [Google Scholar]
Song, J.; Mou, C.; Wang, S.; Ma, S.; Zhang, J. Optimization-inspired cross-attention transformer for compressive sensing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 6174–6184. [Google Scholar]
Shen, M.; Gan, H.; Ning, C.; Hua, Y.; Zhang, T. TransCS: A transformer-based hybrid architecture for image compressed sensing. IEEE Trans. Image Process. 2022, 31, 6991–7005. [Google Scholar] [CrossRef] [PubMed]
Ye, D.; Ni, Z.; Wang, H.; Zhang, J.; Wang, S.; Kwong, S. CSformer: Bridging convolution and transformer for compressive sensing. IEEE Trans. Image Process. 2023, 32, 2827–2842. [Google Scholar] [CrossRef] [PubMed]
Beck, A.; Teboulle, M. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci. 2009, 2, 183–202. [Google Scholar] [CrossRef]
Yang, Y.; Sun, J.; Li, H.; Xu, Z. ADMM-CSNet: A deep learning approach for image compressive sensing. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 42, 521–538. [Google Scholar] [CrossRef]
Boyd, S.; Parikh, N.; Chu, E.; Peleato, B.; Eckstein, J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 2011, 3, 1–122. [Google Scholar] [CrossRef]
You, D.; Xie, J.; Zhang, J. ISTA-Net++: Flexible deep unfolding network for compressive sensing. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; pp. 1–6. [Google Scholar]
Chen, W.; Yang, C.; Yang, X. FSOINET: Feature-space optimization-inspired network for image compressive sensing. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 2460–2464. [Google Scholar]
Zhang, Z.; Liu, Y.; Liu, J.; Wen, F.; Zhu, C. AMP-Net: Denoising-based deep unfolding for compressive image sensing. IEEE Trans. Image Process. 2020, 30, 1487–1500. [Google Scholar] [CrossRef]
Ochs, P.; Chen, Y.; Brox, T.; Pock, T. iPiano: Inertial proximal algorithm for nonconvex optimization. SIAM J. Imaging Sci. 2014, 7, 1388–1419. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 261–272. [Google Scholar]
Dosovitskiy, A. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Tibshirani, R.J. The lasso problem and uniqueness. Electron. J. Stat. 2013, 7, 1456–1490. [Google Scholar] [CrossRef]
Donoho, D.L. De-noising by soft-thresholding. IEEE Trans. Inf. Theory 2002, 41, 613–627. [Google Scholar]
Lin, R.; Hayashi, K. An Approximated ADMM based Algorithm for ℓ₁-ℓ₂ Optimization Problem. In Proceedings of the 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Chiang Mai, Thailand, 7–10 November 2022; pp. 1720–1724. [Google Scholar]
Clarkson, J.A.; Adams, C.R. On definitions of bounded variation for functions of two variables. Trans. Am. Math. Soc. 1933, 35, 824–854. [Google Scholar]
Boţ, R.I.; Csetnek, E.R.; László, S.C. An inertial forward–backward algorithm for the minimization of the sum of two nonconvex functions. EURO J. Comput. Optim. 2016, 4, 3–25. [Google Scholar] [CrossRef]
Hendrycks, D.; Gimpel, K. Gaussian error linear units (gelus). arXiv 2016, arXiv:1606.08415. [Google Scholar]
Agarap, A. Deep learning using rectified linear units (relu). arXiv 2018, arXiv:1803.08375. [Google Scholar]
Mitchell, D.P.; Netravali, A.N. Reconstruction filters in computer-graphics. ACM Siggraph Comput. Graph. 1988, 22, 221–228. [Google Scholar]
Katharopoulos, A.; Vyas, A.; Pappas, N.; Fleuret, F. Transformers are rnns: Fast autoregressive transformers with linear attention. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 5156–5165. [Google Scholar]
Arbelaez, P.; Maire, M.; Fowlkes, C.; Malik, J. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 33, 898–916. [Google Scholar]
Shi, W.; Jiang, F.; Liu, S.; Zhao, D. Image compressed sensing using convolutional neural network. IEEE Trans. Image Process. 2019, 29, 375–388. [Google Scholar]
Kingma, D.P. Adam: A method for stochastic optimization (2014). arXiv 2017, arXiv:1412.6980. [Google Scholar]
Loshchilov, I.; Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv 2016, arXiv:1608.03983. [Google Scholar]
Hore, A.; Ziou, D. Image quality metrics: PSNR vs. SSIM. In Proceedings of the 2010 20th International Conference on Pattern Recognition, Istanbul, Turkey, 23–26 August 2010; pp. 2366–2369. [Google Scholar]

Figure 1. The architecture of the Cross-Block Strategy (CBS) [20] sampling module. The original

H \times W

dimensional tensor x is compressed into

M \times \frac{H W}{N}

dimensional measurement tensor y through the sampling module.

Figure 1. The architecture of the Cross-Block Strategy (CBS) [20] sampling module. The original

H \times W

dimensional tensor x is compressed into

M \times \frac{H W}{N}

dimensional measurement tensor y through the sampling module.

Figure 2. The architecture of deep reconstruction module of DAT. The inputs and outputs of CA (Cross Attention), HPF (High-pass Filter), and DA (Dual Ascent) are indicated by red, blue, and green arrows, respectively. “+” denotes addition, and ⊕ represents the binary operation in the CA module.

Figure 3. The architecture of Cross Attention (CA), Dual Ascent (DA), High-pass Filter (HPF) module and Cross Attention Block.

Figure 4. Comparison of training Mean Squared Logarithmic Error (MSLE) in first 10 epochs in the case of CS ratios = 0.5, 0.3, and 0.1 respectively.

Figure 5. Comparison of image reconstruction from the Set11 dataset by the Epoch-1 Models (Top) and Epoch-10 Models (Bottom) in the Case of CS Ratio = 0.5.

Table 1. Average PSNR (dB) and SSIM performance comparisons on the Set11 dataset [9] at the first and tenth epochs under different CS ratios. The best results are highlighted in red and our model’s results are highlighted in blue.

	Epoch 1				Epoch 10
Method	CS Ratio
Method	10%	30%	50%	Avg	10%	30%	50%	Avg
Octuf	25.51/0.7983	30.45/0.8842	33.44/0.9525	29.80/0.8783	30.18/0.8977	36.76/0.9650	40.76/0.9826	35.90/0.9485
TransCS	21.62/0.5815	24.33/0.7071	26.24/0.7752	24.06/0.6879	29.03/0.8776	35.71/0.9587	39.85/0.9794	34.86/0.9386
CSformer	-	-	22.67/0.6529	22.67/0.6529	-	-	24.22/0.7590	24.22/0.7590
DAT(Ours)	27.15/0.8351	33.93/0.9515	38.82/0.9782	33.33/0.9216	29.99/0.8928	36.77/0.9651	40.95/0.9830	35.90/ 0.9469

Table 2. Comparison of convergence speed, computational efficiency, and model size across different CS ratios on the Set11 dataset. The left part presents the number of epochs and corresponding PSNR (dB) required to reach ISTA-Net+ performance under different CS ratios, while the right part provides runtime, TEP, and parameter count under a CS ratio of 0.5. The best results are highlighted in red.

Method	CS Ratio			Run Time	TEP	Param
Method	10%	30%	50%	Run Time	TEP	Param
Octuf	2/27.72	3/34.79	4/38.63	0.046s	0.184	0.82M
TransCS	4/27.63	4/34.49	5/38.44	0.039s	0.195	2.28M
CSformer	-	-	10+	0.021s	0.210+	1.76M
DAT(Ours)	1/27.15	1/33.93	1/38.82	0.060s	0.060	0.76M
ISTA-Net+	20/26.64	20/33.82	20/38.07	0.016s	0.320	1.70M

Table 3. Comparison of maximum, minimum, and final MSE during training in the case of CS ratio = 0.5 across different initial learning rates, along with their averages and variances. The best results are highlighted in red.

Method	Indicator	Initial Learning Rate
Method	Indicator	$4.0 \times 10^{- 4}$	$4.2 \times 10^{- 4}$	$4.4 \times 10^{- 4}$	$4.6 \times 10^{- 4}$	$4.8 \times 10^{- 4}$	$5.0 \times 10^{- 4}$	Avg	Var
Octuf	MaxMSE	64.82	3663.50	309.35	827.26	267.35	5077.60	1701.65	4,537,075.99
	MinMSE	2.90	3.03	2.86	6.90	3.35	7.06	4.35	4.18
	FinalMSE	2.96	3.03	2.86	9.60	3.58	7.86	4.98	8.80
DAT(Ours)	MaxMSE	24.71	20.33	20.90	95.44	15.12	25.30	33.63	930.20
	MinMSE	2.94	2.95	3.25	2.89	2.91	3.07	3.00	0.02
	FinalMSE	2.94	2.95	3.42	2.89	2.91	3.07	3.03	0.04

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lin, R.; Shen, Y.; Chen, Y. Dual-Ascent-Inspired Transformer for Compressed Sensing. Sensors 2025, 25, 2157. https://doi.org/10.3390/s25072157

AMA Style

Lin R, Shen Y, Chen Y. Dual-Ascent-Inspired Transformer for Compressed Sensing. Sensors. 2025; 25(7):2157. https://doi.org/10.3390/s25072157

Chicago/Turabian Style

Lin, Rui, Yue Shen, and Yu Chen. 2025. "Dual-Ascent-Inspired Transformer for Compressed Sensing" Sensors 25, no. 7: 2157. https://doi.org/10.3390/s25072157

APA Style

Lin, R., Shen, Y., & Chen, Y. (2025). Dual-Ascent-Inspired Transformer for Compressed Sensing. Sensors, 25(7), 2157. https://doi.org/10.3390/s25072157

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Dual-Ascent-Inspired Transformer for Compressed Sensing

Abstract

1. Introduction

2. Related Work

2.1. Deep Unfolding Networks

2.2. Vision Transformer

2.3. Dual-Ascent Method

3. Proposed Method

3.1. Inertial-Dual-Ascent Form of AADMM

3.2. Dual-Ascent-Inspired Transformer

3.2.1. Overall Architecture

3.2.2. Sampling and Initial Reconstruction

3.2.3. Cross Attention

3.2.4. High-Pass Filter

3.2.5. Dual Ascent

3.2.6. Loss Function

4. Experimental Results

4.1. Details of Implementation

4.2. Early-Stage Performance

4.2.1. Training Process Analysis

4.2.2. Comparison of Reconstructions

4.2.3. Comparison of Convergences and Time Complexities

4.2.4. Comparison of Model Sizes

4.3. Influence of Initial Learning Rate

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI