1. Introduction
Steganalysis and steganography are two sides of the same coin. Steganography attempts to achieve confidential communication by hiding secret information in a public carrier, such as an image on the internet. Steganalysis attempts to distinguish whether the communicating parties use steganography. They are not supposed to be known exactly for both players in a communication game, but they cannot be studied separately.
Steganography poses an increasing threat to network security because steganographic carriers are increasingly available on the internet and the payloads of steganography are more severe than ever before. With the widespread use of smartphones, the number of images transmitted on the internet continually grows. This provides a very convenient channel for the misuse of steganography, such as planning and coordinating criminal activities. The capabilities of steganography have also developed from hiding a small message within the Least Significant Bits (LSB) of an image to place a full-sized image within another image across all of the available bits [
1]. We focus on steganalysis and detecting targets that include message-into-image steganography and image-into-image steganography. The former hides a secret text message in a cover image, whereas the latter hides a secret image in a cover image. Other types of steganography, such as hiding messages or images in voice or video recordings, are beyond the scope of this study.
Steganographic algorithms have developed from the earliest LSB, through UNIWARD (Universal Wavelet Relative Distortion) [
2], WOW (Wavelet Obtained Weights) [
3], HUGO [
4], MG (Multivariate Generalized) [
5], etc., to state-of-the-art technology based on deep learning. Deep-learning-based steganography can be divided into three types. The first type operates in an image-into-image way. The authors of [
1] tried hiding a full-sized secret image within a cover image of the same size using a neural network directly in an end-to-end way. Hu et al. [
6] proposed an image steganography without embedding method. The secret image was mapped into a noise vector and stego images were generated based on the noise vector by the trained generator neural network model. No modification or embedding operations were required during the process of image generation, and the information contained in the image was extracted successfully by another neural network, called the extractor, after training. The second type operates in a message-into-generated-cover way. In [
7], a secure steganography method based on neural network generated images from noise and then hid the secret message in them via a traditional embedding algorithm. The third type uses the neural network as an assessment method. In [
8], the neural network automatically learned embedding change probabilities for every pixel in a given spatial cover image. The learned embedding change probabilities were then converted to embedding distortions, which were adopted in the existing framework of minimal-distortion embedding. Steganographic neural networks based on adversarial training have evolved continuously and have made stego images increasingly difficult to detect, which poses a great challenge to steganalysis.
Steganalysis algorithms have also evolved from manually defined features analysis between cover images and stego images, including quality metrics [
9], binary similarity [
10], ensemble classifiers [
11], rich model [
12], embedding probabilities [
13], and so on, to deep learning. Various means are used to enhance the detection accuracy of deep learning for steganography. These means include transfer learning [
14,
15], residual architecture [
16,
17], absolute values [
18], channel selection [
19], diverse activation [
20], attention augmentation [
21], feature-guided adaptation [
22], etc. Deep learning is a promising framework providing state-of-the-art performance for steganalysis; however, it is generally difficult to obtain all the signals about steganographic algorithms from an adversary. Thus, TIS is an unavoidable scenario and should be dealt with in advance.
There are two ways to solve the problem of TIS, one is data augmentation [
23,
24] and the other is to generate new images. The former increases the training set size by transforming the available images. Typical ways include padding, rotating, re-scaling, flipping, translation, cropping, zooming, channel shuffle, dropout, and so on. However, only those that do not remove or suppress the fragile steganography signals are desirable for steganalysis applications. Indeed, these desirable ways for steganalysis include rotations by integer multiples of 90 degrees and vertical or horizontal flipping. Advanced methods include BitMix [
24] and Pixels-off [
25]. They focus on completely mining existing images rather than generating brand new images with different features. In this paper, we seek a breakthrough in the latter method.
The motivation behind this research was to generate new training images for spatial domain steganalysis. For convenience, we call the proposed network GLSNet, which is an abbreviation of Generative Learning Steganalysis Network. The main contributions are as follows: (1) We designed a new type of normalization layer to preserve the fragile pseudo stego signals for the generator and to accelerate the training. (2) Three shortcut connections from the input to three types of output layers were added to enhance recognition of the difference between cover image and stego image. (3) The ReLU activation function was replaced with Tanh to produce negative signals in the stego image. Such a structure is good for preserving the noise-like stego signals.
The rest of the paper is arranged as follows: In
Section 2, we propose the architecture of GLSNet, dataset, the steganographic method, training details, and evaluation metrics. In
Section 3, we present the experimental results. Finally,
Section 4 mentions the conclusion and the future of the work.
2. Materials and Methods
The architecture of GLSNet, the dataset and steganographic method, and training details and evaluation metrics are addressed in this section.
2.1. Architecture
We began with a small set of cover images and its corresponding stego images set , which usually lead to the problem of over-fitting for steganalysis.
Our goal was to help expand the training images dataset, which could then help to improve the diversity of the training data and delay the premature convergence of the steganalysis network. We therefore sought a neural network structure, which we called a generator , that could learn to map a cover image to its corresponding stego image.
In theory, the optimal generator should produce an output distribution identical to the empirical distribution . However, such a generator was not easy to train because of the extremely subtle difference between the cover and the stego–the steganography will always try to minimize the difference between the cover and stego images. These issues required a residual network, which better captures the subtle difference between the input and the output.
We designed a generator based on residual convolution blocks and added three shortcuts between the blocks to strengthen the retention of subtle differences. The details of the proposed generator are described in
Section 2.
The overall architecture is called GLSNet-Generative Learning Steganalysis Network. The concept “generative learning” refers to both the training strategy of steganalysis when confronted with a training data shortage and the stego images generator in front of the steganalysis module. The proposed GLSNet, as shown in
Figure 1, is composed of the following two segments: the front segment whose goal is to generate stego images, which is outlined in the figure by the left segment, and the last segment responsible for detecting the stego images, which is a deep residual convolution network called SRNet [
16]. The SRNet provided state-of-the-art detection accuracy for spatial domain steganalysis and minimized the use of heuristics and externally enforced elements that are universal. If we consider steganalysis as a binary classification problem, to be an original image or a stego image, the detection accuracy is the ratio of correct classifications when the input dataset is a 1:1 mixture of cover images and stego images.
In order to conduct a comparative study, we tested three kinds of generators, including a fully connected generator, as shown in
Figure 2, a CNN (Convolutional Neural Network) generator, as shown in
Figure 3, and a residual CNN generator named RES, as shown in
Figure 4, to evaluate the proposed architecture. The input of the generator was assumed to be a 256 × 256 original cover image. The output of the generator was a 256 × 256 generated stego image. The generator was trained with a small number of paired cover images and stego images. The generator was trained by reducing the error shown below (
and
are the generated stego image and original cover image, respectively,
is the real stego image.)
2.2. Details of the Generator
The RES generator borrows heavily from SRNet [
16], however, there are three improvements for this special application: (1) A new type of normalization layer was designed to preserve the fragile pseudo stego signals. (2) Three shortcut connections from the input to three types of output layers were added to enhance the learning of the difference between cover image and stego image, as depicted in
Figure 4. (3) The ReLU activation function was replaced with Tanh to produce negative signals as in the stego image. Such a structure can better preserve the noise-like stego signals.
The RES generator consisted of 12 layers, as shown in
Figure 4, and the details of each layer of the generator are shown in
Table 1.
Existing normalization techniques, such as Batch Normalization [
26], Layer Normalization [
27], Instance Normalization [
28], Group Normalization [
29], etc., were undesirable for the generator because they changed the mean and variance of the input simultaneously, resulting in destructive proportional distortion, which, in other words, is the destruction of the prior conditions for the generation of pseudo steganography signals.
We designed a new normalization method, IVN: instance variance normalization, and it achieved more satisfactory results in the generation of pseudo steganography images.
The input of a network layer can be expressed as a multidimensional array
where
is the number of images in a batch,
is the number of channels for each image,
is the height of the image, and
is the width of the image.
The mean of an instance
is
and its variance is
We will normalize each pixel of the instance
where ϵ is a small value that prevents the denominator from being zero.
The IVN transform was a differentiable transformation that introduced variance normalized activations into the network. The gradient was computed using the chain rule as follows:
Unlike the SRNet, we used Tanh as the activation function because it matched the residuals better as it may be positive or negative. The formula of Tanh is as follows:
2.3. Dataset and Steganographic Method
The proposed architecture was evaluated and contrasted with different generators on the commonly used publicly available source BOSSBase 1.01, which contains 10,000 grayscale images selected from seven types of cameras.
The experiments were executed for S-UNIWARD and Baluja-Net [
1] to cover both the classic steganographic algorithm and the state-of-the-art deep-learning-based embedding network. S-UNIWARD is a well-known content adaptive steganographic algorithm that hides a small message within the texture or noisy regions of a larger image. Baluja-Net is a deep-learning-based steganography method that can place a full-sized image within another image of the same size. Unlike many popular steganographic methods that encode the secret message within the LSB of the cover image, Baluja-Net compresses and distributes the secret image’s representation across all of the available bits.
We place greater emphasis on Baluja-Net because the deep learning steganographic method is highly variable and makes it more difficult to obtain highly authentic cover and stego image pairs compared with other low payload classic algorithms. In order to prepare training stego images for the generator, Baluja-Net was trained for 100 epochs with a batch size of 2. The training details are explained below: For an image
, we randomly selected any other image
as a secret image for steganography. In this way, 5000 real stego images were generated. An Adam optimizer was used with a learning rate
and coefficient
. Baluja contained a hiding network
and a reveal network
. After about forty iterations of training on the full dataset of the original cover images, the loss values of these two networks tended to stabilize and converge, as shown in
Figure 5.
Figure 6 shows different scenes selected from BOSSBase. The third column is the residual between the cover and the stego, and these residuals are focused mainly on the position with a complex texture.
2.4. Parameters and Evaluation Metric
The network was trained on a PC with an Intel (R) Core(TM) i7-9700K CPU @ 3.60GHZ, 32GB DDR4 memory, graphics processing unit (GPU) NVIDIA GeForce RTX2080Ti, and 11 GB of memory.
Due to limited hardware performance, the input size of our model was set to . Thus, we resized images of BOSSBase from their original size to using cv2.resize with “interpolation = INTER_NEAREST”. We generated 5000 cover and stego image pairs using S-UNIWARD and Baluja-Net each, and split them into the following two sets~: cover and stego image pairs for training; and for testing.
The performance of the GLSNet was measured with the total classification error probability on the testing set,
where
is the number of cover images that are identified as stego images, and
is the number of stego images that are identified as cover images.
3. Results
We conducted several comparative experiments and display the results in
Table 2. “Non” in the second column represents the steganalyses trained with the real cover and stego image pairs without using the pseudo stego image.
As shown in
Table 1, the proposed RES generator performs better than the FC generator, CNN generator, and non-generative training mode. All the models perform better with Baluja-Net than with S-UNIWARD. This is because a full-sized image is embedded in a cover image with Baluja-Net, which increases its payload to up to 8 bit/pixel (bpp) whereas the payload of S-UNIWARD was below or equal to 0.4 bpp in the experiment. Compared with the non-generative training mode, the proposed generative learning architecture achieves a significant performance level when the number of cover-stego image pairs is small. This result is consistent with the theoretical viewpoints. If fewer real cover and stego image pairs are used to train the steganalysis network, the generalization ability and detection accuracy of the steganalysis network will be greatly restricted. In contrast, the generated image dataset may have abundant texture expressions, which has the potential to increase the generalization ability of the steganalysis network and its detection accuracy; our experiments have confirmed this.
In addition, we also examined the impact of quality and noise of the original cover images on the detection performance in preliminary experiments. We used filtering operations to reduce the quality of the cover, or add noise to reduce the signal-to-noise ratio (SNR) of the cover. These low quality, or low SNR, cover images were then used to generate stego images in the same way as their originals. These generated cover-stego image pairs were then used to train the proposed GLSNet in the same configuration depicted in
Section 2. Experimental results show that the quality and SNR of original cover images do not affect the performance of GLSNet. This is because the generator extracts the residual information between the cover and the stego, and this information does not relate to the quality and SNR of the cover.
However, if the image quality or SNR is changed on cover or stego only, the generator will recognize the change as residual, which will have a great impact. Strictly speaking, such image pairs are incomplete, which could be the object of future research.
4. Discussion
It is difficult to obtain enough signals about deep-learning-based steganography from the point of view of game theory. Thus, the problem of the Training-Images-Shortage is prevalent for the deep learning steganalysis in deep learning steganography. We proposed a network we call GLSNet to solve the problem of TIS for the steganalysis and the results show that it performs well in solving such a problem. shown in the
Table 1, even with just one pair of cover and stego images, GLSNet achieves a 77.75% detection accuracy compared with the detection accuracy of 62.83% of the nongenerative training paradigm.
The generative learning architecture helps solve the problem of TIS and improves the detection accuracy of the steganalysis when it just has a small number of training images. The concept of generative learning could also be applied to other situations where it is not easy to obtain training data in a confrontational environment, such as anomaly detection in network security, the recognition of an enemy target in the radar, etc.
Although this paper demonstrates the feasibility of generative learning for image hidden steganalysis, there is still much work to complete. Future research directions include the development of new generative learning network structures and the improvement of their capability to mimic more steganographic methods to generate training stego images.