An Invertible, Robust Steganography Network Based on Mamba

Huo, Lin; Ren, Jia; Li, Jianbo

doi:10.3390/sym17060837

Open AccessArticle

An Invertible, Robust Steganography Network Based on Mamba

by

Lin Huo

^1,2,3,*,

Jia Ren

¹ and

Jianbo Li

¹

School of Computer and Electronic Information, Guangxi University, Nanning 530004, China

²

China-ASEAN School of Economics, Guangxi University, Nanning 530004, China

³

China-ASEAN Collaborative Innovation Center for Regional Development, Guangxi University, Nanning 530004, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(6), 837; https://doi.org/10.3390/sym17060837

Submission received: 2 April 2025 / Revised: 15 May 2025 / Accepted: 21 May 2025 / Published: 27 May 2025

(This article belongs to the Section Computer)

Download

Browse Figures

Versions Notes

Abstract

Image steganography is a research field that focuses on covert storage and transmission technologies. However, current image hiding methods based on invertible neural networks (INNs) have limitations in extracting image features. Additionally, after experiencing the complex noise environment in the actual transmission channel, the quality of the recovered secret image drops significantly. The robustness of image steganography remains to be enhanced. To address the above challenges, this paper proposes a Mamba-based Robust Invertible Network (MRIN). Firstly, in order to fully utilize the global features of the image and improve the image quality, we adopted an innovative affine module, VMamba. Additionally, to enhance the robustness against joint attacks, we introduced an innovative multimodal adversarial training strategy, integrating fidelity constraints, adversarial games, and noise resistance into a composite optimization framework. Finally, our method achieved a maximum PSNR value of 50.29 dB and an SSIM value of 0.996 on multiple datasets (DIV2K, COCO, ImageNet). The PSNR of the recovered image under resolution scaling (0.5×) was 31.6 dB, which was 11.3% higher than with other methods. These results show that our method outperforms other current state-of-the-art (SOTA) image steganography techniques in terms of robustness on different datasets.

Keywords:

image steganography; invertible neural networks; Mamba; robust

1. Introduction

Image steganography involves meticulously adjusting image data to embed secret information within it while minimizing perceptual or statistical artifacts that could lead to detection by steganalysis methods [1]. During the image steganography process, the original image serves as the cover image, which a steganographic algorithm processes to generate a stego image containing the hidden information. The recipient can then extract the secret information using a specific decoding method. The basic process of image steganography is shown in Figure 1.

In 2017, Baluja [2] proposed a deep neural network-based image-in-image steganography framework. First, a preprocessing network was used to process the secret image, modifying it to be the same size as the cover image, altering colour-related pixels, and extracting features. Then, the hidden image was embedded into the cover image using an encoder, and the hidden image was extracted from the stego image using a decoder. This method achieves steganography with images of the same size. In recent years, with the continuous advancement of deep learning technology, researchers have proposed various deep learning-based image steganography methods to enhance information-hiding capabilities and improve efficiency against steganalysis detection. These methods can be broadly categorized into three main types: algorithms based on generative adversarial network (GAN) [3,4,5,6], algorithms based on convolutional neural networks(CNN) [2,7,8,9], and image steganography algorithms based on invertible neural network (INN) [10,11,12,13].

Lu et al. [10] was the first to apply INN [14] to the field of image steganography. They developed a high-capacity invertible network using a bijective transformation model, employing the same network’s forward and backward propagation operations to embed and extract secret images, significantly enhancing the payload capacity of steganography. However, this method is susceptible to JPEG compression [15], which reduces the quality of the compressed images.

Since INN transform the process of information embedding and extraction into a bidirectional mapping of reversible feature space, they can theoretically achieve complete extraction of the secret image. Therefore, many researchers [16] have continuously studied the application of invertible neural networks in image steganography. However, these methods find it challenging to exploit images’ global features thoroughly, and the embedding of secret information is not natural enough. The robustness against complex noise environments needs improvement, especially under joint attacks, where performance is poor. To address this issue, our proposed method aims to enhance the feature representation capabilities of images in complex backgrounds and strengthen the robustness of steganography methods against various typical attack scenarios through an innovative, multimodal adversarial training strategy.

In response to the challenges mentioned above, this paper’s main contributions are as follows:

We propose a invertible steganography network with stronger robustness (MRIN).
We propose using a selective state space model (SSM) for efficient global feature modeling. By modeling with the differential equations of linear time-invariant systems, we maintain a global receptive field while reducing computational complexity to $O (N)$ , significantly enhancing feature representation capability while ensuring lossless information transmission.
We propose a dynamic adversarial noise training strategy, introducing a composite optimization framework that integrates fidelity constraints, adversarial games, and noise immunity, achieving dynamic balance in the steganography system through multimodal joint training.
We introduced the Wasserstein GAN framework to establish a dynamic adversarial game, where the generator embeds secret information into the cover image through reversible transformations and the discriminator uses a multi-level wavelet convolution structure. When the game reaches Nash equilibrium, the distribution of the stego images coincides with that of natural images, achieving statistical invisibility.

The paper is organized as follows. Section 2 introduces the current research status of image steganography, as well as the technical foundations of invertible neural network and the Mamba architecture. Section 3 provides a detailed exposition of our proposed novel image steganography method—MRIN—which includes the Invertible Feature Mapping Mechanism, the Imitation VMamba Module, and the Compound Loss Optimization for Multimodal Adversarial Training. In Section 4, the datasets used in the experiment and detailed descriptions of the algorithm implementation are outlined. Meanwhile, the main evaluation indicators of the image steganography technique are exhibited. Section 5 shows the results of the seven experiments through charts and graphs, which demonstrate the enhancement of our proposed method on a number of evaluation metrics by completing comparative tests with other state-of-the-art methods. Finally, Section 6 concludes with a summary and an outlook on potential future research directions.

2. Related Works

2.1. Image Steganography

Traditional image steganography methods are commonly categorized into spatial domain steganography and transform domain steganography [17]. Spatial domain steganography methods, such as the Least Significant Bit (LSB) replacement method [18], directly modify pixel values to hide information. Although these methods provide a relatively high steganographic capacity, they are highly susceptible to attacks. On the other hand, transform domain steganography relies on methods like discrete cosine transform (DCT) [19] and discrete wavelet transform (DWT) [20], embedding information by adjusting the frequency coefficients.

In recent years, with the rapid development of deep learning technology, its powerful feature learning capabilities and end-to-end optimization abilities have provided new ideas for designing image steganography methods. Image steganography based on deep learning has gradually become an emerging research direction. Volkhonskiy [21] introduced generative adversarial networks (GANs) [22] into image steganography, proposing SGANs (steganographic generative adversarial networks). This method uses a ±1 embedding algorithm to embed random bit information of a secret image into a cover image. He designed two adversarial networks: one discriminator network to regularize the output, ensuring that the cover image is a sample from a real dataset, and one steganalyzer aimed at detecting whether the cover image contains a hidden secret image. However, the visual quality of the images generated by this network is poor, making it unsuitable for secure communication.

As researchers continued to explore image steganography in greater depth, Tang et al. [23] proposed the ASDL-GAN, an automatic learning framework for steganographic distortion (Distortion) using a GAN. In the ASDL-GAN framework, the discriminator network employs the deep learning-based steganalysis model Xunet [24] to simulate the adversarial process between additive distortion steganography and XuNet. However, this method uses the non-differentiable activation function TES (Ternary Embedding Simulator), which reduces training efficiency and increases time overhead. Wu [25] adopted a specially designed separable convolution architecture, combining Highway Network, ResNet, and ResNeXt. During backpropagation, the gradient is preserved in the earlier layers through skip connections [26], alleviating the gradient vanishing issue. Some researchers have improved image visual quality by modifying network structures. Duan et al. [27] applied the U-Net architecture to image steganography. Zhu et al. introduced noise layers in HiDDeN [9], simulating various real-world signal interferences (such as JPEG compression, cropping, noise addition, etc.) during training, which enhanced the model’s robustness to image processing operations. Additionally, by combining adversarial training, RIIS [12] is currently the optimal robust steganography method, designed with a container enhancement module and using DGM (Distortion-Guided Modulation, DGM) to adjust network parameters to accommodate different levels of distortion, enabling the model to maintain robust performance in various distortion environments. However, the decoder part of RIIS relies on the interference intensity coefficient to adjust the decoding strategy.

The attention mechanism can focus on task-relevant features and is thus beneficial for image steganography. Zhou et al. [28] proposed a channel attention transformer model for image steganography, which aims to construct long-range dependencies to identify unobtrusive locations to embed data. However, their scheme does not focus on the noise attack in the communication channel, and the robustness of steganography still needs to be improved. At the same time, the discrimination sub-network is too deep and the network is less efficient. To improve the security of the information embedding process, Xiao et al. [29] proposed a Transformer-based adversarial network watermark steganography framework, which introduced a WGAN [30] discriminator to cyclically adjust the steganographic image generator.

The cover-based modification steganography method embeds secret information into a cover image, but the modification traces inevitably remain in the cover image; in order to fundamentally solve this problem, researchers have proposed novel coverless steganography. In recent years, diffusion models have been widely used in many image fields due to their excellent generative ability. The strong control generation capability of conditional diffusion models makes stego images highly controllable. Guo et al. [31] combined an innovative mask extraction model, a conditional diffusion model, and deterministic DDIM technology. Hu et al. [32] encrypted the secret message as an initial noise input to the diffusion model to generate a stego image. Peng et al. [33] exploited the forward noise addition and reverse denoising process of DDPM. In this process, the secret information is hidden in the generated image, realizing the transformation from an image to noise and then to an image.

2.2. Invertible Neural Networks

Invertible neural network (INN) [14] have a symmetrical structure. The input data are sent through the invertible neural network forward process to obtain the output, and the reverse process can obtain the original input. There is no loss of information in this process input. Dinh et al. [14] proposed the NICE (Non-linear Independent Components Estimation) model, which uses a coupled layer structure to achieve symmetry, with each coupled layer of the input being divided into

x_{1}

and

x_{2}

two parts after the NICE forward process to obtain the output of

y_{1}

and

y_{2}

. They are spliced to obtain the output y. To enable invertible neural networks to handle image-related tasks better, Dinh [34] introduced convolutional layers and proposed a more general affine coupling layer, RNVP (Real-valued Non-Volume Preserving).

Currently, image steganography based on invertible neural networks has received extensive research attention. Jing et al. [13] proposed a new framework based on INN. Lu et al. [10] developed a large-capacity invertible network that uses the same network’s forward and backward propagation operations to respectively embed and extract secret images, significantly improving the payload capacity of steganography. However, further improvements are needed to balance image quality and security. Yang et al. [35] set up an image enhancement module at the beginning and end of the secret image extraction process. This approach reduces interference at both ends of the extraction process and enhances robustness. However, it causes significant distortion to the secret image, which can be problematic in some practical scenarios where the secret image is needed.

2.3. Mamba Architecture and the SSM

The advantages of the attention mechanism in image steganography are mainly reflected in its ability to focus on key areas of the image. By doing so, it can more accurately hide secret information and simultaneously enhance the imperceptibility of the hidden information. The attention mechanism has the ability to dynamically adjust the focus based on the image content. This enables the hidden information to blend better with the natural structure of the image, thus enhancing the quality of the stego image. Moreover, the attention mechanism can capture long-range dependencies within the image, aiding in more effectively embedding and extracting information in complex image backgrounds, thereby enhancing the robustness and security of steganographic methods. However, the traditional attention mechanism—Transformer architecture [36]—has limitations in processing long sequences, as its self-attention mechanism has a computational complexity of

O (N^{2})

, which leads to significant memory and computational bottlenecks when handling high-resolution images.

Mamba [37] is an advanced state space model (SSM) proposed by Albert Gu and Tri Dao, among others, aimed at efficiently handling complex, data-intensive sequences. It models the system using differential equations of linear time-invariant systems, incorporating selective state space, allowing the model to selectively propagate or suppress information based on each step’s input. The model enables the processing of sequences linearly according to their length, addressing the high computational complexity issue of traditional Transformers when handling long sequences. This provides a new theoretical framework for modeling global dependencies in high-resolution images.

Researchers have extended the application of the Mamba architecture to the field of computer vision. Liu et al. [38] adapted the state space model Mamba from the language domain into the visual backbone network VMamba, which has a linear time complexity. At the same time, a two-dimensional selective scan (SS2D) was introduced. This SS2D connects one-dimensional array scanning with two-dimensional plane traversal, enabling the state space model to be effectively applied to the processing of visual data. It divides the image into multi-directional scanning paths to simulate the spatial continuity of the image. State transfer is dynamically adjusted in 2D scanning through the selective mechanism of the SSM. VMamba has performed excellently in various visual tasks, such as image classification, object detection, and semantic segmentation.

Therefore, this paper is dedicated to studying the use of Mamba in image steganography tasks and proposes an invertible robust steganography network based on the Mamba architecture.

3. Proposed Method

This section proposes an invertible steganography network with stronger robustness (MRIN).

3.1. Invertible Feature Mapping Mechanism

This method constructs a bidirectional mapping system through an invertible neural network (INN) [14], modeling secret image embedding and extraction processes as strictly reversible mathematical transformations. Given the cover image

I_{c o} \in R^{H \times W \times 3}

, it is decomposed into four-channel frequency domain features through the discrete wavelet transform (DWT).

F_{c} \in R^{H / 2 \times W / 2 \times 4 C}

, where the low-frequency component

F_{c}^{L L}

carries the main visual information; the high-frequency components

{F_{c}^{L H}, F_{c}^{H L}, F_{c}^{H H}}

are used to hide secret information. Secret image

S_{s e c} \in R^{L \times L \times 3}

is extended to a feature map

F_{s o} \in R^{H / 2 \times W / 2 \times C_{s}}

through the preprocessing module, where the number of channels

C_{s}

is adjusted to match the dimension of the cover image’s frequency domain.

Figure 2 illustrates the Invertible Watermarking Module (IWM) process. As shown, DWT decomposes the cover image into multi-level Integration Features. At the same time, the secret image is converted into a residual flow by the encoder.

Specifically, the forward embedding process of the IWM is defined by the following formula:

\begin{matrix} F_{e}^{i} & = F_{e}^{i - 1} ⊙ \exp (σ (ϕ (R^{i - 1}))) + φ (R^{i - 1}) \\ R^{i} & = R^{i - 1} ⊙ \exp (σ (η (F_{e}^{i}))) + ρ (F_{e}^{i}) \end{matrix}

(1)

The physical significance and mathematical properties of the key symbols in the equation are as follows:

Integration feature

F_{e}^{i} \in R^{H / 2 \times W / 2 \times 4 C}

represents the cover secret composite feature of the i-th layer output, whose three-dimensional tensor structure corresponds to the frequency band decomposition characteristics in the wavelet domain. The significance of the specific components is shown in the Table 1.

Residual flow

R^{i} \in R^{H / 2 \times W / 2 \times C_{r}}

encodes the multi-scale abstract features of secret images. Its dynamic update mechanism is shown in the Table 2. Residual flow can be regarded as the “memory cover” of secret information during the invertible transformation process.

Affine transformation group

{ϕ, φ, η, ρ}

constitutes a parameterized function family implemented by a learnable neural network architecture. The mathematical role of each function in the transformation equation and the details of the neural network implementation are presented in Table 3. Specifically,

ϕ

generates multiplicative terms through

φ (\cdot)

that determine the degree of retention/suppression in each dimension of

F_{e}

. Meanwhile,

φ

introduces conditional biases through additive terms, effectively compensating for the linear features not captured by

ϕ

.

Each group of functions contains 2.1 M trainable parameters, accounting for 38.6% of the total model size.

Modulation function

σ (\cdot)

is a specialized variant of the Sigmoid activation function, defined by the mathematical expression

σ (x) = {(1 + e^{- x})}^{- 1}

. This function compresses the results of affine transformations into the interval (0,1), which ensures the feature scaling coefficient

\exp (σ (\cdot)) \in (1, e)

, mitigating the risk of gradient explosion. It confines the information injection amount

φ (\cdot) \in (- 1, 1)

to limit the modification range. Importantly, the output values of

φ (\cdot)

possess interpretability as they represent pixel-level embedding confidence that correlates with visual saliency.

This invertible architecture guarantees lossless information transmission through strict mathematical constraints

\det (J) = 1

. The block diagonal structure of the Jacobian matrix J reduces the computational complexity of the forward/inverse processes to

O (N)

.

The geometric interpretation of this transformation is as follows: The exponential term

\exp (σ (ϕ (R^{i - 1})))

generates a spatially adaptive mask, redistributing the energy of the cover’s high-frequency components. Meanwhile, the additive term

φ (R^{i - 1})

injects the features of the secret image into the human-eye-insensitive areas in the form of noise. Through N layers of chained transformations, the secret information is encoded into the high-frequency sub-bands.

The reverse extraction process (Figure 3) is achieved through the inverse transformation of shared parameters:

\begin{matrix} F_{s e}^{i - 1} & = (F_{s e}^{i} - ρ (F_{b}^{i})) ⊙ \exp (- σ (η (F_{b}^{i}))) \\ F_{b}^{i - 1} & = (F_{b}^{i} - φ (F_{s e}^{i - 1})) ⊙ \exp (- σ (ϕ (F_{s e}^{i - 1}))) \end{matrix}

(2)

where

F_{b}^{i} \in R^{H / 2 \times W / 2 \times 4 C}

is the backpropagation state of the degraded features. The initial value

F_{b}^{N} = DWT (I_{n o})

is derived from the frequency domain decomposition of the noisy cover, obtained from the frequency domain decomposition of the noisy cover. The auxiliary variable

Z \sim N (0, I)

is used as the initial value of

F_{s e}^{N}

, compensating for information loss during the transmission process through random perturbation.

Strict mathematical symmetry guarantee:

{IWMs}^{- 1} (IWMs (F_{c}, F_{s o})) \equiv (F_{c}, F_{s o})

(3)

Proof.

Let the i-th level transformation of the forward process be

T_{i} : (F_{e}^{i - 1}, R^{i - 1}) \mapsto (F_{e}^{i}, R^{i})

, with the mathematical expression

\begin{matrix} F_{e}^{i} & = F_{e}^{i - 1} ⊙ \exp (σ (ϕ (R^{i - 1}))) + φ (R^{i - 1}) \end{matrix}

(4)

\begin{matrix} R^{i} & = R^{i - 1} ⊙ \exp (σ (η (F_{e}^{i}))) + ρ (F_{e}^{i}) \end{matrix}

(5)

The corresponding inverse transformation

T_{i}^{- 1} : (F_{e}^{i}, R^{i}) \mapsto (F_{e}^{i - 1}, R^{i - 1})

can be constructed as

\begin{matrix} R^{i - 1} & = (R^{i} - ρ (F_{e}^{i})) ⊙ \exp (- σ (η (F_{e}^{i}))) \end{matrix}

(6)

\begin{matrix} F_{e}^{i - 1} & = (F_{e}^{i} - φ (R^{i - 1})) ⊙ \exp (- σ (ϕ (R^{i - 1}))) \end{matrix}

(7)

Prove symmetry by mathematical induction:

1.: Base Case ( $i = 1$ ): Apply an inverse transformation to the first level forward output $(F_{e}^{1}, R^{1})$ :

$\begin{matrix} R^{0^{'}} & = (R^{1} - ρ (F_{e}^{1})) ⊙ \exp (- σ (η (F_{e}^{1}))) \end{matrix}$

(8)

$\begin{matrix} F_{e}^{0'} & = (F_{e}^{1} - φ (R^{0'})) ⊙ \exp (- σ (ϕ (R^{0^{'}}))) \end{matrix}$

(9)

Substituting the forward equation gives

$\begin{matrix} R^{0^{'}} & = [(R^{0} ⊙ s_{1} + t_{1}) - t_{1}] ⊙ s_{1}^{- 1} = R^{0} \end{matrix}$

(10)

$\begin{matrix} F_{e}^{0^{'}} & = [(F_{e}^{0} ⊙ s_{0} + t_{0}) - t_{0}] ⊙ s_{0}^{- 1} = F_{e}^{0} \end{matrix}$

(11)

where $s_{0} = \exp (σ (ϕ (R^{0}))), t_{0} = φ (R^{0})$ , verifying that the base case holds.
2.: Inductive Hypothesis: Assume that the kth layer satisfies $T_{k}^{- 1} \circ T_{k} = Id$ .
3.: Inductive Step: For the $k + 1$ th layer, the forward output is $(F_{e}^{k + 1}, R^{k + 1})$ , and the inverse transformation is applied:

$\begin{matrix} R^{k^{'}} & = (R^{k + 1} - ρ (F_{e}^{k + 1})) ⊙ \exp (- σ (η (F_{e}^{k + 1}))) \end{matrix}$

(12)

$\begin{matrix} F_{e}^{k'} & = (F_{e}^{k + 1} - φ (R^{k'})) ⊙ \exp (- σ (ϕ (R^{k^{'}}))) \end{matrix}$

(13)

Replacing $R^{k + 1}$ and $F_{e}^{k + 1}$ according to the forward equation gives

$\begin{matrix} R^{k^{'}} & = R^{k} ⊙ \exp (σ (η (F_{e}^{k + 1}))) ⊙ \exp (- σ (η (F_{e}^{k + 1}))) = R^{k} \end{matrix}$

(14)

$\begin{matrix} F_{e}^{k^{'}} & = F_{e}^{k} ⊙ \exp (σ (ϕ (R^{k - 1}))) ⊙ \exp (- σ (ϕ (R^{k - 1}))) = F_{e}^{k} \end{matrix}$

(15)

By the induction hypothesis, it follows that after N inverse transformations, there must be $(F_{b}^{0}, F_{s e}^{0}) = (F_{c}, F_{s o})$ .

□

In addition, the c determinant

\det (J)

is computed as follows: The Jacobi matrices of each transformation layer

T_{i}

have a block triangular structure:

J_{T_{i}} = [\begin{matrix} \frac{\partial F_{e}^{i}}{\partial F_{e}^{i - 1}} & \frac{\partial F_{e}^{i}}{\partial R^{i - 1}} \\ 0 & \frac{\partial R^{i}}{\partial R^{i - 1}} \end{matrix}]

(16)

The row-displacement calculation is satisfied:

\det (J_{T_{i}}) = \det (\frac{\partial F_{e}^{i}}{\partial F_{e}^{i - 1}}) \cdot \det (\frac{\partial R^{i}}{\partial R^{i - 1}})

(17)

Since the scaling operation

\exp (σ (\cdot))

is always positive and the design constrains the absolute value of the determinant of each transformed layer to 1, the overall transform satisfies

\det (J) = \prod_{i = 1}^{N} \det (J_{T_{i}}) = 1

, thus guaranteeing bijectivity.

The secret image is processed by the encoder to generate residuals, which, together with the integrated features of the cover image, are fed into the IWM. Subsequently, the stego image goes through the IWM extraction module and is then inputted into the decoder, resulting in the recovered secret image being obtained. The detailed architectures of the encoder and decoder in these two stages are illustrated in Figure 4.

Specifically, the encoder is responsible for feature extraction and the transformation of the input information to map it into an intermediate representation space. This encoder first performs an initial convolution operation on the input secret image

I_{s e} \in R^{H \times W \times C}

, generating a shallow feature map

F_{0} = C o n v (I_{s e})

. Here, the spatial resolution remains unchanged and the number of channels is expanded to D. Subsequently, the feature map is processed by the VMamba module. SSM is employed to capture long-range dependencies, yielding the feature

F_{v m a m b a 1} = V M a m b a (F_{0})

, with the resolution still being

H \times W \times D

. Next, a downsampling operation with a stride of 2 is adopted to perform spatial dimensionality reduction on

F_{v m a m b a 1}

, generating the feature

F_{d o w n 1} \in R^{H / 2 \times W / 2 \times 2 D}

. Meanwhile, the number of channels is doubled to offset the information loss. After that, the feature

F_{d o w n 1}

passes through the secondary convolution and the VMamba module in the sequence, obtaining

F_{c o n v 2} = C o n v (F_{d o w n 1})

and

F_{v m a m b a 2} = V M a m b a (F_{c o n v 2})

, respectively, further fusing the global and local features. To construct the residual connection, the feature

F_{v m a m b a 2}

first passes through the convolutional layer and then is processed by the VMamba module. The spatial size of it is adjusted to

H / 2 \times W / 2 \times 2 D

using the Resize operation. Finally, it is added to the downsampled feature

F_{d o w n 1}

to form the residual output

F_{r e s i z e}

.

In contrast, the decoder takes on the opposite task. It aims to minimize the discrepancy between its output and the original input, leveraging the intermediate representation generated by the encoder.

The encoder and decoder are designed to follow the principle of symmetry. When the information is processed by the encoder in the forward direction, the decoder can accurately retrace the operation path of the encoder by its symmetric structure and convert the intermediate representation back to the original information form. This enables the decoder to transform the intermediate representation back into the original information format, thereby substantially reducing information loss during transmission and processing. Consequently, this design ensures the stability and reliability of the entire system.

The VMamba module is specifically introduced in the subsequent sections.

3.2. Imitation VMamba Module

Traditional convolutional neural network (CNN)-based affine modules are limited by local receptive fields and static weight parameters. This limitation makes it challenging to model long-range dependencies in images. On the other hand, Transformer-based implementations can capture the global context through a self-attentive mechanism. However, they have a computational complexity that grows quadratically with sequence length (

O (N^{2})

). As a result, their efficiency significantly decreases when processing high-resolution images. To address this issue, the selective state space model (SSM) was introduced to construct an efficient affine transformation module. Figure 5 demonstrates that an SSM is a model for describing state representations and predicting what the next state might be based on certain inputs. In this flowchart, matrix D serves to establish a direct signal pathway from the input to the output. Since its function is similar to that of a skip connection, in the situation where there is no skip connection, the part enclosed by the red dotted line in the figure is generally regarded as the simplified SSM.

Generally, an SSM includes the following components:

-: The input sequence mapping $x (t)$ , where t refers to any given time.
-: The latent state representation $h (t)$ .
-: The predicted output sequence $y (t)$ .

The simplified SSM is described by the differential equation. The first equation is the state equation, which is used to calculate the next latent state representation. The second equation is the output equation, which describes how the state is transformed into the output (through matrix C):

h (t) = A h (t - 1) + B x (t), y (t) = C h (t)

(18)

where

A \in R^{C \times C}

is the state transfer matrix and

B, C \in R^{C}

are the projection parameters.

The primary representation of an SSM is the continuous-time representation. However, when dealing with discrete inputs (such as text sequences), Mamba employs the zero-order hold (ZOH) technique to handle discretized data:

{\bar{A}}_{t} = \exp (Δ_{t} A), {\bar{B}}_{t} = {(Δ_{t} A)}^{- 1} (\exp (Δ_{t} A) - I) Δ_{t} B, Δ_{t} = Softplus (W_{δ} x_{t} + b_{δ})

(19)

where the time step

Δ_{t} \in R^{+}

is realized by linear projection with Softplus activation to achieve input dependence, forming a selective mechanism. Matrices

{\bar{A}}_{t}

and

{\bar{B}}_{t}

now represent the discrete parameters of the model. Due to the element-wise calculation characteristic (with a complexity of

O (N)

) of the diagonalized state transition matrix

{\bar{A}}_{t}

and the scalar property of the time step

Δ_{t}

, the discretization process avoids the overhead of traditional matrix multiplication (with a complexity of

O (N^{2})

). Discrete SSMs are reformulated using discrete time steps:

h_{k} = \bar{A} h_{k - 1} + \bar{B} x_{k}, y_{k} = C h_{k}

(20)

The matrix A constructs new states by capturing information from previous states, that is, matrix A generates hidden states. To address the long-range dependency problem, Mamba employs the High-order Polynomial Projection Operator (HiPPO) [39]. HiPPO aims to compress all input signals it has encountered so far into a vector of coefficients. This operation normalizes the parameter matrix A into a diagonal structure.

However, regardless of the sequence input to the SSM, the values of A, B, and C remain unchanged. This makes it impossible for the SSM to perform targeted reasoning for specific inputs. To address this issue, Mamba processes different inputs differently based on parameters learned during the training phase. The dynamic parameter generation process is as follows:

(B_{t}, C_{t}, Δ_{t}) = Linear (x_{t}) \in R^{2 C}

(21)

where a linear transformation layer maps the input x to a higher-dimensional representation, from which different parts are split to serve as the values of B, C, and

Δ

. Dynamic B and C make the SSM selective, allowing the model to control whether to let the input enter the state h or let the state enter the output y. The dynamic

Δ

can control the attention to the current input. When

Δ

is large, the model tends to focus on the current input and ignore previous information; when

Δ

is small, the model tends to retain more historical information.

If trained in an RNN pattern, efficiency would be significantly reduced. Therefore, Mamba employs an innovative, convolution-free parallel training approach. This training strategy includes kernel fusion, parallel scan, and recomputation.

The core algorithm of Mamba is designed for processing one-dimensional sequential data. When applying it to images, a two-dimensional selective scan (SS2D) mechanism is required.

First, the input image is divided into patches to generate a 2D feature map with spatial dimensions of

H / 4 \times W / 4

. Subsequently, multiple network stages are used to partition the 2D feature map into hierarchical representations with resolutions of

H / 8 \times W / 8

,

H / 16 \times W / 16

, and

H / 32 \times W / 32

. Each stage includes a downsampling layer, followed by a stack of SS2D blocks.

The cross-scan transform (as shown in Figure 6) is then designed for the 2D image feature

F_{i} \in R^{H \times W \times C}

:

{\tilde{F}}_{i} = Γ (F_{i}) \in R^{(4 H W) \times C}

(22)

where

Γ (\cdot)

is expanded into four scanning directions along the spatial dimension. The four paths traverse the 2D matrix from the top-left corner to the right, from the top-left corner downward, from the bottom-right corner to the left, and from the bottom-right corner upward, respectively. Next, individual Mamba modules are employed to process each patch sequence in parallel. Finally, the generated sequences are reshaped and merged to form the output map.

The computational amount has only a linear relationship with the sequence length

4 H W

(with a complexity of

O (4 H W)

), rather than the quadratic complexity of traditional 2D convolution.

The computational complexity analysis shows that when the feature map size is

H \times W

, the traditional Transformer attentional complexity is

O ({(H W)}^{2})

, while the SSM version is reduced to

O (H W)

. The specific computational comparison is

{FLOPs}_{Trans} = 4 H W C^{2} + 2 {(H W)}^{2} C, {FLOPs}_{SSM} = 4 H W C^{2} + 6 H W C

(23)

Transformer explicitly establishes a global dependency by calculating the similarity of all pairs of positions. The core innovation of Mamba is the replacement of explicit attention with implicit state memory and implementing adaptive long-range modeling through input-driven dynamic parameters.

The VMamba module can enhance the feature extraction of secret images. It filters out redundant information through a dynamic input selection mechanism. By doing so, it retains local texture features (e.g., boundary anomalies, statistical distribution perturbations, etc.), which are highly relevant to the steganography task. Meanwhile, it suppresses the noise interference from the natural image content. Additionally, it replaces traditional convolutional computation with hardware-aware parallel scanning, significantly boosting the training efficiency.

3.3. Compound Loss Optimization for Adversarial Training

A central challenge in steganography is achieving an optimal trade-off between imperceptibility, reconstruction accuracy, and robustness [40]. Traditional methods usually optimize each objective independently but cannot use collaborative defence against complex attack scenarios. Inspired by the multi-sensory integration mechanism in brain science, this paper proposes a composite optimization framework that integrates fidelity constraints, adversarial games, and noise immunity and achieves the dynamic equilibrium of steganographic systems through joint multi-modal training.

The underlying optimization objective contains three constraints:

The spatial domain reconstruction error

L_{i}

ensures the pixel-level similarity between the crypto-containing cover and the original cover through the mean-square error, which is defined as

L_{i} = \frac{1}{H W} \sum_{x = 1}^{H} \sum_{y = 1}^{W} {∥ I_{c o} (x, y) - I_{e n} (x, y) ∥}_{2}^{2}

(24)

The

ℓ_{2}

-paradigm constraint effectively suppresses high-frequency artifacts.

The bit-level cross entropy

L_{w}

then forces the recovered secret information to align with the original data through the Hamming distance:

L_{w} = \frac{1}{L} \sum_{k = 1}^{L} {∥ S_{0}^{(k)} - S_{e}^{(k)} ∥}_{Hamming}

(25)

Meanwhile, the wavelet low-frequency subband

ℓ_{1}

loss

L_{q}

constrains the frequency domain energy distribution:

L_{q} = \frac{1}{(H / 2) (W / 2)} \sum_{i = 1}^{H / 2} \sum_{j = 1}^{W / 2} {∥ F_{c}^{L L} (i, j) - F_{e}^{L L} (i, j) ∥}_{1}

(26)

These three terms constitute the basic fidelity conditions of the steganographic system, which maintain the stability of the embedding process from the spatial, bit, and frequency domains, respectively.

To defend against deep network-based steganalysis attacks, the Wasserstein GAN framework is introduced to establish a dynamic adversarial game. The generator G (i.e., steganographic network) embeds secret information in the cover by reversible transformation, while the discriminator D adopts a multi-level wavelet convolution structure:

D (I) = \sum_{k = 1}^{3} α_{k} \cdot LeakyReLU ({Conv}_{3 \times 3} ({DWT}^{k} (I)))

(27)

The adversarial loss

L_{adv}

contains three components:

L_{adv} = \underset{I_{c o}}{E} [D (I_{c o})] - \underset{I_{e n}}{E} [D (I_{e n})] + 10 \cdot \underset{\hat{I}}{E} [{(∥ \nabla_{\hat{I}} D (\hat{I}) ∥_{2} - 1)}^{2}]

(28)

where the first two terms drive the generator to deceive the discriminator and the third gradient penalty term forces Lipschitz continuity to be satisfied to avoid pattern collapse. Theoretical analysis shows that when the game reaches the Nash equilibrium, the distribution of the dense-containing image coincides with the natural image stream shape, achieving statistical invisibility.

Noise interference in the actual transmission channel is modeled by a differentiable noise layer

N (\cdot)

that dynamically combines six attacks:

N (I) = ⨁_{j = 1}^{6} τ_{j} (I), τ_{j} \in \{\begin{matrix} τ_{JPEG} (Q \in [10, 50]), τ_{Gauss} (σ \in [0.1, 0.3]) \\ τ_{Crop} (r \in [0.05, 0.15]), τ_{Drop} (p \in [0.2, 0.4]) \\ τ_{Rot} (θ \in [10 °, 30 °]), τ_{Resize} (s \in [0.3, 0.7]) \end{matrix}

(29)

Noise robustness loss

L_{noise}

joint optimization of information recovery and visual fidelity:

L_{noise} = \frac{1}{B} \sum_{b = 1}^{B} [BCE (S_{0}^{(b)}, S_{e}^{(b)}) + 0.5 \cdot SSIM (I_{e n}^{(b)}, N (I_{e n}^{(b)}))]

(30)

End-to-end optimization of attack strategies via Gumbel–Softmax enables steganographic networks to learn to be adaptive to worst-case scenarios.

The orthogonality of the invertible neural network is guaranteed by the regularity term

L_{orth}

:

L_{o r t h} = | J_{G} J_{G}^{⊤} {- I ∥}_{F}^{2} + 0.1 \cdot | \det (J_{G}) - 1 |

(31)

where the Frobenius paradigm term enforces Jacobi matrix orthogonality and the determinant constraint term maintains volume conservation, ensuring that the network maintains stable invertibility during training.

The total loss function achieves multimodal synergy through weighted fusion:

L_{total} = 3 L_{i} + 10 L_{w} + 0.1 L_{q} + 2 L_{adv} + 5 L_{noise} + 0.5 L_{orth}

(32)

The weighting configuration reflects the hierarchical design principle: a 10-fold weighting of

L_{w}

emphasizes secret recovery accuracy, a 5-fold weighting of

L_{noise}

enhances interference immunity, and a 0.1-fold weighting of

L_{q}

allows for controlled frequency-domain tuning.

A composite optimization framework combining a fidelity constraint, adversarial game, and noise immunity enables the steganographic system to maintain stability in dynamic attack environments. The strategy of joint multimodal training provides a new theoretical basis and technical path for the robust design of steganographic algorithms.

4. Experiments

This section delineates the dataset, implementation specifics of the hiding and extraction processes, and the evaluation metrics utilized.

4.1. Datasets

In order to comprehensively evaluate the performance of the algorithm, a multi-dimensional testing system was constructed in this study. Table 4 is a description of the dataset.

4.2. Implementation Details

The forward hidden network and backward reconstruction network were trained and validated with 16 Invertible Watermarking Modules (IWMs) and their parameters were optimized using the Adam optimizer with

β_{1} = 0.9

and

β_{2} = 0.999

. In the steganography network, the total number of training epochs was set to 3K. The initial learning rate was

1 \times 10^{- 4}

, and the learning rate was halved every 1K epochs. In the noise adversarial training phase, the total number of training rounds was 1.5K. The initial learning rate was

1 \times 10^{- 4}

, and the learning rate was halved every 200 epochs. The discriminator was first pre-trained for 500 rounds. The learning rate was halved every 100 rounds and then the discriminator and MRIN were jointly trained for 1K epochs, with the learning rate halved every 200 epochs. The loss function hyperparameters in Equation (32) were set to

λ_{i} = 3

,

λ_{w} = 10

,

λ_{q} = 0.1

,

λ_{a d v} = 2

,

λ_{n o i s e} = 5

, and

λ_{o r t h} = 0.5

.

All the experiments were implemented in the PyTorch 2.0.1 framework and NVIDIA V100 GPUs were used to complete the computation acceleration.

The comparison experiments selected representative state-of-the-art methods from recent years.

With the deep neural network steganography scheme proposed by Baluja et al. [2] as the baseline model, HiNet [13] enhanced the information capacity through reversible embedding in the high-frequency domain. Lu et al. [10] developed a large-capacity invertible network for parameter sharing. RIIS [12] is the current optimal robust steganography method. Yang et al. [35] set up an image enhancement module at the beginning and end of the secret image extraction process. This approach reduces interference at both ends of the extraction process and enhances robustness. All comparison models were reproduced based on their official open-source codes, and the same training strategy was used to ensure fairness.

4.3. Evaluation Metrics

For the performance analysis of image steganography algorithms, standardized evaluation metrics needed to be used, and several common evaluation metrics are listed below.

4.3.1. Structural Similarity Index

The Structural Similarity Index (SSIM) [44] evaluates visual quality by comparing the brightness, contrast, and structural information of image blocks

x

and

y

:

SSIM (x, y) = \frac{(2 μ_{x} μ_{y} + C_{1}) (2 σ_{x y} + C_{2})}{(μ_{x}^{2} + μ_{y}^{2} + C_{1}) (σ_{x}^{2} + σ_{y}^{2} + C_{2})}

(33)

where

μ

denotes the mean,

σ

is the variance,

σ_{x y}

is the covariance, and

C_{1}, C_{2}

are the stability constants. The SSIM value range is [0,1], with larger values indicating higher structural similarity.

4.3.2. Peak Signal-to-Noise Ratio

Peak Signal-to-Noise Ratio (PSNR) [45] is defined based on mean squared error (MSE):

\begin{matrix} MSE & = \frac{1}{H W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} {(I (i, j) - K (i, j))}^{2} \\ PSNR & = 10 \cdot {log}_{10} (\frac{{MAX}_{I}^{2}}{MSE}) \end{matrix}

(34)

where

{MAX}_{I}

is the pixel maximum (e.g., 255 pixel). Higher values of PSNR indicate better image fidelity but under-modeled perceptual properties of the human visual system.

4.3.3. No-Reference Image Quality Evaluation

Natural Image Quality Evaluation (NIQE) [46] provides a no-reference metric for assessing the invisibility of image steganography. It does so by quantifying the degree of deviation of statistical features between the test image and natural images. The core of NIQE lies in constructing a multidimensional feature space to analyze the perturbation effect of the steganography process on the statistical characteristics of images. NIQE selects three types of features that are sensitive to the steganographic perturbation, and its mathematical modeling process is as follows:

1.: Local luminance distribution modeling:

Steganographic embedding alters the local contrast, causing shape parameter

γ

to deviate from the typical range

[1.5, 2.5]

of natural images. For example, the LSB steganography method reduces

γ

by about

0.3

, while frequency domain embedding (e.g., DCT [19]) leads to an increase of

γ

by

0.2 \sim 0.5

.

2.: Texture correlation analysis:

The steganography operation increases the contrast feature by

5 \sim 15 %

and decreases the correlation feature by

8 \sim 20 %

by destroying the inter-pixel correlation. For example, the WOW algorithm increases the contrast by

9.2 %

and decreases the correlation by

12.7 %

at a 0.4 bpp embedding rate.

3.: Multi-scale wavelet coefficient analysis:

A three-level Daubechies wavelet decomposition of the image is performed to obtain horizontal (

L H

), vertical (

H L

), and diagonal (

H H

) sub-bands at each scale. The high-frequency sub-band (

H H

) of a natural image typically has

β \in [0.5, 1.2]

, and steganographic noise converges it to a Gaussian distribution (

β \to 2

).

Let the natural image feature distribution be

N (μ_{natural}, Σ_{natural})

and the vector of cryptographically containing image features be

f_{stego}

; then, the NIQE value is computed as

NIQE = \sqrt{{(f_{stego} - μ_{natural})}^{⊤} Σ_{natural}^{- 1} (m a t h b f f_{stego} - μ_{natural})} .

(35)

5. Results and Discussion

The experimental system of this study encompassed the following core evaluation dimensions: Firstly, the invisibility of the steganography method was quantified by NIQE metrics and SSIM/PSNR evaluated the reconstruction quality of the secret image. Secondly, the anti-detection capability is analysed by StegExpose tool. Subsequently, the channel damage scenarios, such as JPEG compression and noise interference, were simulated to verify the robustness of the method. Lastly, the contribution of each component was analyzed through module ablation experiments.

5.1. Image Hiding and Reconstruction Visualization

This section describes the verification of the steganography and reconstruction effects of the algorithm through several sets of visualization and comparison experiments. Figure 7 illustrates the complete processing flow of four sets of typical samples: the original cover image, the contained secret image, the original secret image, and the recovered secret image from left to right in each column. At the visual level, the texture details, color saturation, and lighting consistency of the secret image and the original cover image were highly consistent. There were no common distortions like the pixel block effect, artifacts, or local blurring. For example, in the vegetation scene (first column), the sharpness of the leaf edges of the secret image perfectly matched that of the original cover. This result shows that the algorithm had a significant advantage in maintaining the visual quality of the cover image.

The accuracy of reconstructing the secret information was further analyzed. The original secret image (second row, second column) contained highly complex details (dense fruit placement). The recovered image (fourth row, second column) reproduced these features completely: fruit edges were clear and unbroken; rounded parts were kept smooth) and color gradient transitions were natural. Notably, even for the heavily shaded secret image (third column, fourth row), the reconstruction result still accurately recovered its gray level, and there was no noise amplification in the shaded area. The above visual evidence validates the reliability of the algorithm in information hiding and recovery tasks from a subjective level.

This section describes the algorithm’s performance being experimentally verified by histogram visualization. Figure 8 and Figure 9 show the comparison of the RGB channel histograms between the cover image and the secret-containing image as well as between the secret image and the recovered image. The pixel distribution pattern of the histogram could intuitively reflect the magnitude and consistency of the image modification, which is an effective tool for assessing the imperceptibility and reconstruction accuracy of steganography algorithms [47].

From Figure 8, it can be seen that the histograms of the cover image (top) and the stego image (bottom) showed a high degree of similarity in all three color channels. Specifically, the peak area of the red channel (R) was concentrated in the

[150, 200]

gray scale interval, and the distribution pattern of the dense image in this interval was almost wholly overlapped with that of the cover image; the pixel value distribution curve of the green channel (G) maintained the same smooth transition characteristics of the dense image as that of the cover image. There were no apparent jagged fluctuations or abnormal spikes. The histograms of the blue channel (B) showed a minimal offset, especially in the blue channel. The overall offset was minimal, especially in the low gray area (

< 50

) and high gray area (

> 220

), where the distribution ratio was almost the same as the original cover image.

The KL divergence measured the degree of consistency between two probability distributions and evaluated the effect of the embedding process on the probability distribution. When this index was applied to the RGB channel histogram, if each channel were counted independently,

256^{3}

combinations needed to be calculated. This led to a dimensionality explosion. Therefore, we introduced the dimensionality reduction method proposed by Song et al. [48]. After calculation, the distance between the cover image and the stego image was

6 \times 10^{- 3}

. In the case of richer image textures, with a shorter distance, the algorithm was not easily detected by steganalysis.

These results indicate that the embedding process of the secret information had a negligible effect on the global statistical properties of the cover image, thus ensuring the visual indistinguishability of the secret-containing image.

Figure 9 further compares the histogram distribution of the secret image (top) with the recovered image (bottom). It can be observed that the height, width, and symmetry of the peaks of the histogram of the red channel (R) of the recovered image in the

[80, 160]

interval were highly matched with that of the original secret image; the distribution curve of the pixel values of the green channel (G) in the recovered image completely replicated the double-peak feature of the original secret image, and the spacing distance between the two peaks and the proportion of their relative heights were not significantly changed. The distribution density of the histogram of the blue channel (B) in the whole grayscale range was almost perfectly consistent with that of the original secret image, especially in the high brightness region (

> 200

), where the pixel percentage error was less than

1 %

.

We used the same KL divergence calculation method for the secret image and recovered image. The result of the calculation was

8 \times 10^{- 3}

, which meant that the difference between the two pictures was insignificant.

This result fully proves that the steganography algorithm proposed in this paper could accurately recover the statistical properties of the secret image and ensure consistency between the recovered image and the original secret information at the pixel distribution level.

5.2. Quality of Information Reconstruction

This section quantitatively shows that the MRIN method exhibits significant advantages regarding all relevant metrics for both cover/stego image pairs and secret/recovered image pairs.

Table 5 shows the comparison results of cover/stego image pairs for different steganographic methods on different datasets. As shown in Table 6, on the DIV2K standard test set, MRIN achieved a PSNR value of 50.29 dB and an SSIM of 0.996, which were 0.4% and 8.76% better than the current optimal method, HiNet. For the COCO dataset containing complex semantics, the recovered PSNR reached 40.74 dB, achieving an absolute improvement of 16.4% over the 35.01 dB of the traditional method Baluja [2], and the SSIM metric broke through 0.991, demonstrating that it can maintain structural similarity in a high-noise environment. Cross-dataset analysis showed that the PSNR of MRIN on ImageNet fluctuated in the range of only 48.15 dB–50.29 dB, with a standard deviation 65.5% lower than that of the suboptimal method, which verified the strong robustness of the state space model to domain offsets. This performance enhancement stemmed from the synergy between the dynamic noise training strategy and the reversible Mamba architecture, which effectively suppressed the loss of high-frequency information through the selective feature retention mechanism.

5.3. Robustness Against Degradation

Fidelity constraints establish a base defence layer that ensures the basic functional integrity of a steganography system. In adversarial games, the multilevel wavelet convolution structure of the discriminator D is employed to detect statistical features that may be missed by fidelity constraints. Noise immunity simulates real-world attack environments and enhances generalization capabilities.

Based on the DIV2K test set, this experiment simulated three channel impairment scenarios: noise interference, JPEG compression, and resolution scaling. The experimental data demonstrate that the MRIN method demonstrates excellent stability advantages in complex channel environments.

As shown in Table 7, under the strong noise attack (

σ

= 0.1) scenario, the SSIM value of MRIN reached 0.918, which was 12.8% higher than that of the benchmark method Baluja. Additionally, its PSNR index of 32.7 dB maintained HD reconstruction quality under noise interference. Especially under the limiting condition of resolution scaling (0.5×), the recovered image PSNR maintained 31.6 dB, an improvement of 11.3% over the HiNet method, which verified the effectiveness of the cross-scale feature fusion mechanism. Cross-attack scenario analysis revealed that the standard deviation of PSNR of MRIN was only 2.3 dB, which was 58.5% lower than that of the traditional method, and this stability advantage stemmed from the synergistic effect of the dynamic noise training strategy and the state space modeling, which balanced the null-frequency domain feature distribution through the adaptive frequency response mechanism.

To further investigate the impact of JPEG compression with varying quality factors on the reconstruction accuracy of secret images, we compared our method with RIIS, proposed by Xu et al., and PRIS, proposed by Yang et al. This comparison aimed to validate the stronger robustness of MRIN. The JPEG compression intensity was determined by the quality factor (QF), with a value ranging from 1 to 100. A higher QF meant that more image details were preserved during compression, resulting in less impact on image quality. Conversely, a lower QF indicated stronger compression. Table 8 shows the quality of the secret/recovered images obtained by different steganography methods under various QF values. Notably, HiNet, which does not account for JPEG compression interference, experienced a significant decline in image quality as compression strength increased. RIIS showed the lowest average decrease in quality. However, this method required the compression strength to be input as a parameter into the reconstruction network, meaning that the attack information needed to be known beforehand. In contrast, our proposed MRIN achieved optimal results across all compression strengths. Specifically, it outperformed the second best methods by margins of (0.07, 0.3 dB), (0.001, 0.69 dB), (0.099, 0.54 dB), and (0.126, 0.3 dB).

Under the interference of different noises and JPEG compression with different QF, the secret/recovered images obtained by MRIN achieved advanced results. This indicates that the method proposed in this paper has strong resistance to common noises and JPEG compression and possesses the characteristic of strong robustness.

Figure 10 shows the visualization of the recovered secret images under JPEG compression (QF = 30) for different comparison methods. We chose representative multiple examples of fruit maps for comparison. As shown in the figure, the Baluja method showed obvious noise dispersion points, while the secret image reconstructed by Hinet showed obvious texture degradation. PRIS performed better, though the fruit in the lower left corner was missing part of the shadow, indicating a small loss of information. Our method still maintained a certain reconstruction accuracy.

5.4. Invisibility Assessment

To quantify the visual naturalness of the steganographic images, this experiment used the no-reference quality assessment metric NIQE (Natural Image Quality Evaluator) for cross-dataset analysis.

As shown in Table 9, the MRIN method achieved a NIQE value of 3.81 on the DIV2K dataset, which was 29.8% lower than the traditional method Baluja [2]. MRIN maintained a leading position, with a metric value of 3.56 on the COCO dataset—which contained a complex scene—which was 10.8% lower than that of the second best method PRIS [35]. Notably, on the ImageNet test set with significant differences in data distribution, the fluctuation of the NIQE values of MRIN (3.56–3.89) was significantly smaller than that of other methods, and the standard deviation was 41.2% lower than that of Lu-INN [10], reflecting stronger cross-domain adaptability. The experimental data verified the effectiveness of selective state space modeling in maintaining the statistical properties of the generated images, which suppressed the periodic artifacts generated by the traditional methods in the texture region through global feature interactions.

5.5. Resistance to Steganalysis

StegExpose [49], as a mainstream steganalysis tool, detects steganographic traces through statistical distribution analysis. Four core metrics were used in this experiment to evaluate anti-detection performance.

ACC measured the ability of the classifier to correctly distinguish cover images from cryptographic images, AUC was the area under the ROC curve, FPR reflected the proportion of normal images that were misclassified as steganography, and the EER (equal error rate) represented the error rate when FPR and FNR were equal.

The results are shown in Table 10. The MRIN method achieved 55.8% in the detection accuracy (ACC) metric, which was 2.6 percentage points lower than the second best method, PRIS, approaching the random guess baseline (50%). In Figure 11, the blue curve represents our model. The AUC value reached 0.584, which was a decrease from the 0.745 of the conventional method Baluja [2] by 21.6%, indicating that its features are highly compatible with the statistical distribution of natural images. In a low detection rate scenario (FPR@5%), MRIN led to 47.6% of normal images being misdiagnosed, which was an improvement of 46.9% over Lu-INN [10], effectively circumventing the detection strategy based on frequency domain analysis. The equal error rate (EER) index reached 47.5%, which was only 2.5 percentage points different from the theoretical optimal value of 50%, verifying the effectiveness of the dynamic weight allocation mechanism in balancing the spatial distribution of features.

This improvement in anti-detection capability stemmed from the global smoothing effect of state space modeling on local statistical features, which suppressed the periodic patterns generated by traditional methods in the DCT domain through the time-varying convolution kernel.

In addition, deep learning-based cryptographic analysis tools have yielded remarkable outcomes in recent years. In this study, SRNet [50] was chosen as a steganalysis tool to compare the performance of different steganographic methods for steganalysis resistance. SRNet (Steganalysis Residual Network) is a steganalysis model based on deep residual networks. The model adopts a deep residual network architecture, which significantly improves detection accuracy. By expanding the “computational noise residual” part at the front-end of the detector, it significantly enhances detection accuracy, thereby preventing the pooling operations from suppressing the steganographic signals.

SRNet has an end-to-end design. It is capable of directly extracting and classifying features from an image without the need for the manual design of the feature extraction process. In addition, the selective channel mechanism introduced in the JPEG domain further improves detection performance. It enables it to exhibit good generalization ability in both the null and JPEG frequency domains. Leveraging these advantages, SRNet has been widely applied in image steganalysis. It can effectively determine whether there is any concealed information within an image, thus ensuring the security of digital images.

We first used different steganography methods to generate 200 contained images as their respective training sets to train SRNet and then generated 1500 contained images with their corresponding clean cover images to form their respective verification sets.

Figure 12 compares the detection accuracies of five steganalysis methods (Baluja, HiNet, PRIS, RIIS, and MRIN) for different training sample sizes. The overall trend shows that the likelihood of all methods analyzed by SRNet of containing secret information increased as the number of training samples increased, but there were significant differences between the methods.

The Baluja method’s initial accuracy was relatively high (73% when there were 50 training samples), and it reached an accuracy of 100% when there were 150 or more training samples. This indicated that this method may quickly learn steganographic features with a relatively small number of training samples, suggesting that the method has a weak ability to resist steganalysis and is likely to lead to the analysis of steganographic secret information. This is consistent with the characteristics of the method as a baseline model. The accuracy of HiNet was relatively low, especially when the number of training samples increased, indicating that it had some advantages in resisting steganalysis. The accuracy of PRIS and RIIS was relatively lower, and they were both better than HiNet when the training samples were the same. RIIS, as the optimal robust steganography method, had an accuracy close to that of PRIS but slightly higher than PRIS when the training samples were more extensive, indicating that its ability to resist steganography is stronger than that of PRIS. MRIN had the lowest accuracy among all compared methods, increasing from 52.4% to 84.2%, indicating that its ability to resist steganography is the strongest, which verifies the effectiveness of our method.

5.6. Ablation Experiment

In order to verify the synergistic effect of the modules, two variants of the model were constructed for comparative analysis in this study: removing the Mamba module (MRIN_-M) and replacing it with Transformer and removing the multimodal adversarial training (MRIN_-T).

As shown in Table 11, the ablation experiments quantitatively revealed the contribution of each module to the system performance. In the generative quality dimension, the removal of the Mamba module (MRIN_-M) led to a 6.3% decrease in SSIM and a 3.7 dB decrease in PSNR, verifying the central role of the selective state space model in global feature modeling. The removal of the anti-training mechanism (MRIN_-T) led to a 7.4% decrease in SSIM under noise attack, demonstrating the critical enhancement of robustness by the dynamic noise layer. For the anti-hidden-write analysis, MRIN_-M increased the detection accuracy (ACC) by 6.9 percentage points over the whole model, demonstrating that the Mamba architecture effectively smoothes out local statistical anomalies through long-range dependency modeling. In contrast, removing the adversarial training degraded the AUC value by 0.08, demonstrating the necessity of the GAN framework to optimize the distribution of hidden-write features. In particular, removing the Mamba module led to a 4.5 dB decrease in PSNR under compression attack, highlighting the unique advantage of its cross-band feature fusion capability in resisting JPEG quantization errors.

5.7. Complexity Comparison

The efficiency of Mamba was verified as, as shown in Table 12, Mamba significantly outperformed Transformer in terms of computational complexity, memory usage, and inference time. With the increase in resolution, the computational complexity of Transformer increased dramatically, and it even suffered from the Out Of Memory (OOM) problem. At a 128 × 128 pixel resolution, the FLOPs of Transformer were 4.2 G, the memory occupation was 0.9 GB, and the inference time was 12.5 ms. In comparison, the FLOPs of Mamba were only 2.3 G, the memory occupation was 0.5 GB, and the inference time was 6.8 ms. When the resolution was increased to a 256 × 256 pixel resolution, the FLOPs of Transformer reached 15.8 G, the memory occupation was 0.5 GB, and the inference time was 6.8 ms. When the resolution was increased to a 256 × 256 pixel resolution, the FLOPs of Transformer reached 15.8 G, the memory occupation was 3.2 GB, and the inference time was 45.3 ms. In comparison, the FLOPs of Mamba were only 8.7 G, the memory occupation was 1.8 GB, and the inference time was 22.1 ms.

At higher resolutions, the computational overhead of Transformer increased further. At a 512 × 512 pixel resolution, the FLOPs of Transformer reached 63.2 G, the memory occupation was 12.8 GB, the inference time was 181.2 ms, and it was prone to the Out-Of-Memory (OOM) problem in high-resolution tasks, while the FLOPs of Mamba were only 34.8 G, the memory occupation was 7.2 GB, and the inference time was 88.4 ms, so it could still run efficiently. At a 1024 × 1024 pixel resolution, Transformer’s FLOPs were as high as 252.8 G, with 51.2 GB of memory occupation, and it could not complete the inference task (OOM). In comparison, Mamba’s FLOPs were 139.2 G, with 28.8 GB of memory occupation, and the inference time was 353.6 ms. Thus, it could still run stably.

These results show that Mamba’s selective state space model (SSM) design significantly reduces computational overhead while maintaining high performance, which is especially suitable for the steganography task of high-resolution images. Although using Transformer could also improve the feature representation of images, its quadratic complexity (

O (N^{2})

) made it computationally inefficient in high-resolution tasks and even unable to complete inference tasks.

6. Conclusions

In this study, we proposed MRIN, an invertible steganography network based on Mamba architecture that outperforms existing state-of-the-art methods. We adopted state space modeling, which served as a global smoothing of the local statistical features. This design enables the system to provide better feature extraction performance in complex image backgrounds. To further enhance robust performance, we introduced a composite optimization framework that fuses fidelity constraints, adversarial games, and noise immunity. The experimental results demonstrate that the method has significant advantages in maintaining the visual quality of images, reflecting stronger visual naturalness in invisibility assessment, and simulating noise interference. The PSNR of the recovered image under resolution scaling (0.5×) was 31.6 dB, which was 11.3% higher than other methods. The detection accuracy of StegExpose, a mainstream steganalysis tool, was 55.8%, which was 2.6 percentage points lower than the existing advanced robust steganography methods. These results show that our method outperforms other current state-of-the-art (SOTA) image steganography techniques in terms of robustness and security on different datasets. The novel steganography method proposed in this paper achieved a performance breakthrough in the steganography framework based on cover modification. Therefore, we propose directions for future research:

Coverless steganography: Coverless steganography aims to embed secret information within a medium without modifying the cover object. The Mamba architecture could be applied to a coverless image steganography to weaken the dependence on cover images.
Lightweight network: Researchers could further solve the problem of high memory usage and degradation of inference speed faced when dealing with long sequences.
Multimodal steganography: In the future, an adaptive dynamic steganography system could be developed to adjust the joint concealment of text, images, and videos in real time.

Author Contributions

Methodology, L.H. and J.R.; software, J.R.; validation, J.R.; investigation, L.H. and J.R.; resources, J.L.; writing—original draft, L.H. and J.R.; visualization, J.R.; supervision, L.H. and J.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Major Bidding Project of National Social Science Fund (24&ZD172).

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Cheddad, A.; Condell, J.; Curran, K.; Mc Kevitt, P. Digital image steganography: Survey and analysis of current methods. Signal Process. 2010, 90, 727–752. [Google Scholar] [CrossRef]
Baluja, S. Hiding images in plain sight: Deep steganography. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Volume 30. [Google Scholar]
Denis, V.; Ivan, N.; Evgeny, B. Steganographic generative adversarial networks. In Proceedings of the Twelfth International Conference on Machine Vision, Amsterdam, The Netherlands, 16–18 November 2019; Volume 11433, p. 114333M. [Google Scholar]
Shi, H.; Dong, J.; Wang, W.; Qian, Y.; Zhang, X. SSGAN: Secure steganography based on generative adversarial networks. In Proceedings of the Advances in Multimedia Information Processing–PCM 2017: 18th Pacific-Rim Conference on Multimedia, Harbin, China, 28–29 September 2017; Revised Selected Papers, Part I 18. Springer: Berlin/Heidelberg, Germany, 2018; pp. 534–544. [Google Scholar]
Yao, Y.; Wang, J.; Chang, Q.; Ren, Y.; Meng, W. High invisibility image steganography with wavelet transform and generative adversarial network. Expert Syst. Appl. 2024, 249, 123540. [Google Scholar] [CrossRef]
Zhang, R.; Dong, S.; Liu, J. Invisible Steganography via Generative Adversarial Networks. arXiv 2018, arXiv:1807.0857. [Google Scholar]
Weng, X.; Li, Y.; Chi, L.; Mu, Y. High-capacity convolutional video steganography with temporal residual modeling. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, Ottawa, ON, Canada, 10–13 June 2019; pp. 87–95. [Google Scholar]
Rahim, R.; Nadeem, S. End-to-end trained CNN encoder-decoder networks for image steganography. In Proceedings of the European Conference on Computer Vision (ECCV) Workshops, Munich, Germany, 8–14 September 2018. [Google Scholar]
Zhu, J.; Kaplan, R.; Johnson, J.; Fei-Fei, L. Hidden: Hiding data with deep networks. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 657–672. [Google Scholar]
Lu, S.P.; Wang, R.; Zhong, T.; Rosin, P.L. Large-capacity image steganography based on invertible neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 10816–10825. [Google Scholar]
Cheng, K.L.; Xie, Y.; Chen, Q. Iicnet: A generic framework for reversible image conversion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Virtual Event, 11–17 October 2021; pp. 1991–2000. [Google Scholar]
Xu, Y.; Mou, C.; Hu, Y.; Xie, J.; Zhang, J. Robust Invertible Image Steganography. In Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 18–24 June 2022; pp. 7865–7874. [Google Scholar] [CrossRef]
Jing, J.; Deng, X.; Xu, M.; Wang, J.; Guan, Z. Hinet: Deep image hiding by invertible network. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, BC, Canada, 11–17 October 2021; pp. 4733–4742. [Google Scholar]
Dinh, L.; Krueger, D.; Bengio, Y. Nice: Non-linear independent components estimation. arXiv 2014, arXiv:1410.8516. [Google Scholar]
Skodras, A.; Christopoulos, C.; Ebrahimi, T. The JPEG 2000 still image compression standard. IEEE Signal Process. Mag. 2001, 18, 36–58. [Google Scholar] [CrossRef]
Yangjie, Z.; Jia, L.; Yan, K.; Meiqi, L. Image steganography based on generative implicit neural representation. arXiv 2024, arXiv:2406.01918. [Google Scholar]
Mandal, P.C.; Mukherjee, I.; Paul, G.; Chatterji, B. Digital image steganography: A literature survey. Inf. Sci. 2022, 609, 1451–1488. [Google Scholar] [CrossRef]
Mielikainen, J. LSB matching revisited. IEEE Signal Process. Lett. 2006, 13, 285–287. [Google Scholar] [CrossRef]
Noda, H.; Niimi, M.; Kawaguchi, E. Application of QIM with dead zone for histogram preserving JPEG steganography. In Proceedings of the IEEE International Conference on Image Processing 2005, Genova, Italy, 11–14 September 2005; Volume 2, pp. II–1082. [Google Scholar] [CrossRef]
Wang, W.; Tan, H.; Pang, Y.; Li, Z.; Ran, P.; Wu, J. A Novel Encryption Algorithm Based on DWT and Multichaos Mapping. J. Sens. 2016, 2016, 2646205. [Google Scholar] [CrossRef]
Volkhonskiy, D.; Borisenko, B.; Burnaev, E. Generative Adversarial Networks for Image Steganography. arXiv 2017, arXiv:1703.05502. [Google Scholar] [CrossRef]
Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Tang, W.; Tan, S.; Li, B.; Huang, J. Automatic steganographic distortion learning using a generative adversarial network. IEEE Signal Process. Lett. 2017, 24, 1547–1551. [Google Scholar] [CrossRef]
Xu, G.; Wu, H.Z.; Shi, Y.Q. Structural design of convolutional neural networks for steganalysis. IEEE Signal Process. Lett. 2016, 23, 708–712. [Google Scholar] [CrossRef]
Wu, P.; Yang, Y.; Li, X. Stegnet: Mega image steganography capacity with deep convolutional network. Future Internet 2018, 10, 54. [Google Scholar] [CrossRef]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2818–2826. [Google Scholar]
Duan, X.; Jia, K.; Li, B.; Guo, D.; Zhang, E.; Qin, C. Reversible image steganography scheme based on a U-Net structure. IEEE Access 2019, 7, 9314–9323. [Google Scholar] [CrossRef]
Zhou, Y.; Luo, T.; He, Z.; Jiang, G.; Xu, H.; Chang, C.C. CAISFormer: Channel-wise attention transformer for image steganography. Neurocomputing 2024, 603, 128295. [Google Scholar] [CrossRef]
Xiao, C.; Peng, S.; Zhang, L.; Wang, J.; Ding, D.; Zhang, J. A transformer-based adversarial network framework for steganography. Expert Syst. Appl. 2025, 269, 126391. [Google Scholar] [CrossRef]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning, PMLR, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Guo, Y.; Liu, Z. Coverless Steganography for Face Recognition Based on Diffusion Model. IEEE Access 2024, 12, 148770–148782. [Google Scholar] [CrossRef]
Hu, X.; Li, S.; Ying, Q.; Peng, W.; Zhang, X.; Qian, Z. Establishing Robust Generative Image Steganography via Popular Stable Diffusion. IEEE Trans. Inf. Forensics Secur. 2024, 19, 8094–8108. [Google Scholar] [CrossRef]
Peng, Y.; Hu, D.; Wang, Y.; Chen, K.; Pei, G.; Zhang, W. StegaDDPM: Generative Image Steganography based on Denoising Diffusion Probabilistic Model. In Proceedings of the 31st ACM International Conference on Multimedia, Ottawa, ON, Canada, 29 October–3 November 2023; MM ’23. pp. 7143–7151. [Google Scholar] [CrossRef]
Dinh, L.; Sohl-Dickstein, J.; Bengio, S. Density estimation using Real NVP. arXiv 2017, arXiv:1605.08803. [Google Scholar]
Yang, H.; Xu, Y.; Liu, X.; Ma, X. PRIS: Practical robust invertible network for image steganography. Eng. Appl. Artif. Intell. 2024, 133, 108419. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 4–9 December 2017; NIPS’17. pp. 6000–6010. [Google Scholar]
Gu, A.; Dao, T. Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv 2024, arXiv:2312.00752. [Google Scholar]
Liu, Y.; Tian, Y.; Zhao, Y.; Yu, H.; Xie, L.; Wang, Y.; Ye, Q.; Jiao, J.; Liu, Y. VMamba: Visual State Space Model. In Proceedings of the Advances in Neural Information Processing Systems; Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2024; Volume 37, pp. 103031–103063. [Google Scholar]
Gu, A.; Dao, T.; Ermon, S.; Rudra, A.; Ré, C. HiPPO: Recurrent Memory with Optimal Polynomial Projections. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2020; Volume 33, pp. 1474–1487. [Google Scholar]
Wu, L.; Cheng, H.; Yan, W.; Chen, F.; Wang, M.; Wang, T. Reversible and colorable deep image steganography with large capacity. J. Electron. Imaging 2023, 32, 043006. [Google Scholar] [CrossRef]
Agustsson, E.; Timofte, R. Ntire 2017 challenge on single image super-resolution: Dataset and study. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, Honolulu, HI, USA, 21–26 July 2017; pp. 126–135. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; Proceedings, Part v 13. Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Russakovsky, O.; Deng, J.; Su, H.; Krause, J.; Satheesh, S.; Ma, S.; Huang, Z.; Karpathy, A.; Khosla, A.; Bernstein, M.; et al. Imagenet large scale visual recognition challenge. Int. J. Comput. Vis. 2015, 115, 211–252. [Google Scholar] [CrossRef]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef]
Welstead, S.T. Fractal and Wavelet Image Compression Techniques; SPIE Press: Bellingham, WA, USA, 1999; Volume 40. [Google Scholar]
Mittal, A.; Soundararajan, R.; Bovik, A.C. Making a “completely blind” image quality analyzer. IEEE Signal Process. Lett. 2012, 20, 209–212. [Google Scholar] [CrossRef]
Stamm, M.C.; Liu, K.R. Forensic detection of image manipulation using statistical intrinsic fingerprints. IEEE Trans. Inf. Forensics Secur. 2010, 5, 492–506. [Google Scholar] [CrossRef]
Song, H.; Tang, G.; Sun, Y.; Gao, Z. Security Measure for Image Steganography Based on High Dimensional KL Divergence. Secur. Commun. Netw. 2019, 2019, 1–13. [Google Scholar] [CrossRef]
Boehm, B. StegExpose-A Tool for Detecting LSB Steganography. arXiv 2014, arXiv:1410.6656. [Google Scholar]
Boroumand, M.; Chen, M.; Fridrich, J. Deep residual network for steganalysis of digital images. IEEE Trans. Inf. Forensics Secur. 2018, 14, 1181–1193. [Google Scholar] [CrossRef]

Figure 1. The basic process of image steganography.

Figure 2. The hiding process of the invertible steganography module.

Figure 3. The extraction process of the invertible steganography module.

Figure 4. Encoder and decoder with excellent symmetry.

Figure 5. Main flowchart of SSM.

Figure 6. Workflow of SS2D.

Figure 7. Steganography/recovered visualization (four sets of samples: (a) Vegetation and squirrel; (b) Lion and fruits; (c) Lake and boat; (d) Penguin and plants).

Figure 8. RGB channel histogram comparison between cover image (top) and stego image (bottom).

Figure 9. RGB channel histogram comparison between secret image (top) and recovered image (bottom).

Figure 10. Recovered image comparison by method.

Figure 11. The ROC curve produced by StegExpose for our model.

Figure 12. Anti-analysis performance of each method under SRnet steganalysis.

Table 1. Mathematical properties of fusion feature

F_{e}^{i}

.

Table 1. Mathematical properties of fusion feature

F_{e}^{i}

.

Property	Mathematical Description	Functional Significance
Spatial dimensions	$H / 2 \times W / 2$	Maintains resolution consistency with wavelet-decomposed features.
Channel composition	$4 C = \underset{LL}{\underset{︸}{1 C}} + \underset{LH / HL / HH}{\underset{︸}{3 C}}$	Preserves joint frequency representation (low + high sub-bands).
Initialization condition	$F_{e}^{0} = F_{c}$	Inherits complete spectral information from original cover image.

Table 2. Mathematical properties of residual flow

R^{i}

.

Table 2. Mathematical properties of residual flow

R^{i}

.

Property	Mathematical Description	Functional Significance
Initialization	$R^{0} = F_{s o}$	Maps secret image features to wavelet-compatible representation space.
Channel dimension	$C_{r} = 64$	Experimentally optimizes balance between capacity and concealment.
Update mechanism	$R^{i} = f (R^{i - 1}, F_{e}^{i})$	Establishes cross-layer feature feedback. Dynamically adjusts embedding intensity.

Table 3. Affine transformation group.

Symbol	Role in Transformation	Description
$ϕ (R_{i - 1})$	Scaling function	Adjusts weights of channels in $F_{e}$ , similar to gating in attention mechanisms.
$φ (R_{i - 1})$	Bias injection	Provides conditionally dependent biases for $F_{e}$ to enhance expression. Encodes secret information into high-frequency components.
$η (F_{e}^{i})$	State feedback controller	Generates a scaling factor for $R_{i}$ based on the current state of feature $F_{e}^{i}$ , thus controlling the propagation range of the steganographic trace.
$ρ (F_{e}^{i})$	Residual transformation	Injects the nonlinear transformation of the current feature $F_{e}^{i}$ into the state $R_{i}$ . It deliberately introduces noise to mask secret information traces.

Table 4. Description of the dataset.

Data Type	Source	Size	Resolution	Preprocessing
Training set	DIV2K Train [41]	800 images	2K	Random flipping Random cropping Resized to $256 \times 256$ pixels
In-domain test	DIV2K Test [41]	100 images	2K	Center-cropped to $256 \times 256$ pixels
Cross-domain test	COCO [42] ImageNet [43]	120 images 120 images	Varied	Center-cropped to $256 \times 256$ pixels

Table 5. Cover/stego image quality comparison, with the best results in red and the second best in blue. ↑ indicates that a higher value is better.

Method	DIV2K		COCO		ImageNet
Method	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑
Baluja [2]	0.938	35.88	0.934	35.01	0.925	34.13
Lu-INN [10]	0.978	38.82	0.976	36.36	0.955	35.38
HiNet [13]	0.993	46.53	0.985	41.21	0.986	41.60
MRIN	0.995	48.75	0.986	40.68	0.991	42.45

Table 6. Secret/recovered image quality comparison, with the best results in red and second best in blue. ↑ indicates that a higher value is better.

Method	DIV2K		COCO		ImageNet
Method	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑
Baluja [2]	0.938	35.88	0.934	35.01	0.925	34.13
Lu-INN [10]	0.978	38.82	0.976	36.36	0.955	35.38
HiNet [13]	0.992	46.24	0.985	40.10	0.981	40.03
MRIN	0.996	50.29	0.991	40.74	0.990	48.15

Table 7. Multi-scene robustness assessment based on DIV2K dataset, with the best results in red. ↑ indicates that a higher value is better.

Method	Noise Attack ( $σ$ = 0.1)		Resolution Scaling (0.5×)
Method	SSIM↑	PSNR↑	SSIM↑	PSNR↑
Baluja [2]	0.814	27.54	0.768	26.29
Lu-INN [10]	0.861	30.91	0.812	29.54
HiNet [13]	0.842	31.18	0.792	28.42
RIIS [12]	0.876	29.84	0.825	29.59
PRIS [35]	0.883	32.14	0.831	30.72
MRIN (Ours)	0.918	32.74	0.857	31.63

Table 8. The secret/recovered secret image pair obtained after JPEG compression with different compression quality factors by different steganography methods, with the best results in red.

Method	QF = 95		QF = 85		QF = 75		QF = 65
Method	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑
HiNet [13]	0.841	26.98	0.813	25.43	0.779	23.29	0.744	20.07
RIIS [12]	0.845	29.01	0.827	28.43	0.806	28.02	0.753	27.85
PRIS [35]	0.881	32.84	0.832	31.13	0.810	30.09	0.755	28.14
MRIN (Ours)	0.951	33.14	0.931	31.82	0.909	30.63	0.881	28.44

Table 9. Cross-dataset visual quality evaluation, with the best results in red and the second best in blue.

Methods	DIV2K	COCO	ImageNet
Baluja (2017) [2]	5.43	5.88	6.12
HiNet (2021) [13]	4.87	5.15	5.34
Lu-INN (2021) [10]	4.35	4.62	4.79
RIIS (2022) [12]	3.97	4.53	4.56
PRIS (2024) [35]	3.89	4.03	4.25
MRIN	3.81	3.56	3.89

Table 10. Detection results based on StegExpose, with the best results in red and the second best in blue.

Method	ACC (%)	AUC	FPR@5%	EER (%)
Baluja	71.2	0.745	18.5	29.3
HiNet	68.5	0.712	23.7	32.1
Lu-INN	63.9	0.683	32.4	36.8
RIIS	59.6	0.679	38.5	40.1
PRIS	58.4	0.631	41.2	43.5
MRIN	55.8	0.584	47.6	47.5

Table 11. MRIN modular ablation experiment comparison (DIV2K dataset), with the best results in red.

Variant Model	Generative Quality		Detection Resistance		Robustness
Variant Model	SSIM	PSNR (dB)	ACC (%)	AUC	Noise SSIM	Compression PSNR
MRIN_-M	0.916	37.5	61.2	0.623	0.832	33.1
MRIN_-T	0.934	38.8	59.7	0.642	0.853	34.5
MRIN	0.978	41.2	54.3	0.562	0.921	37.6

Table 12. Mamba vs. Transformer computational complexity, with the best results in bold.

Model	Resolution (Pixel)	FLOPs (G)	Memory Footprint (GB)	Reasoning Time (ms)	OOM
Transformer	128 × 128	4.2	0.9	12.5	No
Mamba	128 × 128	2.3	0.5	6.8	No
Transformer	256 × 256	15.8	3.2	45.3	No
Mamba	256 × 256	8.7	1.8	22.1	No
Transformer	512 × 512	63.2	12.8	181.2	No
Mamba	512 × 512	34.8	7.2	88.4	No
Transformer	1024 × 1024	252.8	51.2	-	Yes
Mamba	1024 × 1024	139.2	28.8	353.6	No

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huo, L.; Ren, J.; Li, J. An Invertible, Robust Steganography Network Based on Mamba. Symmetry 2025, 17, 837. https://doi.org/10.3390/sym17060837

AMA Style

Huo L, Ren J, Li J. An Invertible, Robust Steganography Network Based on Mamba. Symmetry. 2025; 17(6):837. https://doi.org/10.3390/sym17060837

Chicago/Turabian Style

Huo, Lin, Jia Ren, and Jianbo Li. 2025. "An Invertible, Robust Steganography Network Based on Mamba" Symmetry 17, no. 6: 837. https://doi.org/10.3390/sym17060837

APA Style

Huo, L., Ren, J., & Li, J. (2025). An Invertible, Robust Steganography Network Based on Mamba. Symmetry, 17(6), 837. https://doi.org/10.3390/sym17060837

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Invertible, Robust Steganography Network Based on Mamba

Abstract

1. Introduction

2. Related Works

2.1. Image Steganography

2.2. Invertible Neural Networks

2.3. Mamba Architecture and the SSM

3. Proposed Method

3.1. Invertible Feature Mapping Mechanism

3.2. Imitation VMamba Module

3.3. Compound Loss Optimization for Adversarial Training

4. Experiments

4.1. Datasets

4.2. Implementation Details

4.3. Evaluation Metrics

4.3.1. Structural Similarity Index

4.3.2. Peak Signal-to-Noise Ratio

4.3.3. No-Reference Image Quality Evaluation

5. Results and Discussion

5.1. Image Hiding and Reconstruction Visualization

5.2. Quality of Information Reconstruction

5.3. Robustness Against Degradation

5.4. Invisibility Assessment

5.5. Resistance to Steganalysis

5.6. Ablation Experiment

5.7. Complexity Comparison

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI