StegNet: Mega Image Steganography Capacity with Deep Convolutional Network

Traditional image steganography often leans interests towards safely embedding hidden information into cover images with payload capacity almost neglected. This paper combines recent deep convolutional neural network methods with image-into-image steganography. It successfully hides the same size images with a decoding rate of 98.2% or bpp (bits per pixel) of 23.57 by changing only 0.76% of the cover image on average. Our method directly learns end-to-end mappings between the cover image and the embedded image and between the hidden image and the decoded image. We~further show that our embedded image, while with mega payload capacity, is still robust to statistical analysis.


Introduction
Image steganography, aiming at delivering a modified cover image to secretly transfer hidden information inside with little awareness of the third-party supervision, is a classical computer vision and cryptography problem. Traditional image steganography algorithms go to their great length to hide information into the cover image while little consideration is tilted to payload capacity, also known as the ratio between hidden and total information transferred. The payload capacity is one significant factor to steganography methods because if more information is to be hidden in the cover, the visual appearance of the cover is altered further and thus the risk of detection is higher (The source code is available at: https://github.com/adamcavendish/StegNet-Mega-Image-Steganography-Capacitywith-Deep-Convolutional-Network).
The most commonly used image steganography for hiding large files during transmission is embedding a RAR archive (Roshal ARchive file format) after a JPEG (Joint Photographic Experts Group) file. In such way, it can store an infinite amount of extra information theoretically. However, the carrier file must be transmitted as it is, since any third-party alteration to the carrier is going to destroy all the hidden information in it, even just simply read out the image and save it again will corrupt the hidden information.
To maximize the payload capacity while still resistible to simple alterations, pixel level steganography is majorly used, in which LSB (least significant bits) method [1], BPCS [2] (Bit Plane Complexity Segmentation), and their extensions are in dominant. LSB-based methods can achieve a payload capacity of up to 50%, or otherwise, a vague outline of the hidden image would be exposed (see Figure 1). However, most of these methods are vulnerable to statistical analysis, and therefore it can be easily detected. Some traditional steganography methods with balanced attributes are hiding information in the JPEG DCT components. For instance, A. Almohammad's work [3] provides around 20% of payload capacity (based on the patterns) and still remains undetected through statistical analysis.
Most secure traditional image steganography methods recently have adopted several functions to evaluate the embedding localizations in the image, which enables content-adaptive steganography. HuGO [4] defines a distortion function domain by giving every pixel a changing cost or embedding impact based on its effect. It uses a weighted norm to represent the feature space. WOW (Wavelet Obtained Weights) [5] embeds information according to the textural complexity of the image regions. Work [6,7] have discussed some general ways of content-adaptive steganography to avoid statistical analysis. Work [8] is focusing on content-adaptive batched steganography. These methods highly depend on the patterns of the cover image, and therefore the average payload capacity can be hard to calculate.
The major contributions of our work are as follows: (i) We propose a methodology to apply neural networks for image steganography to embed image information into image information without any help of traditional steganography methods. (ii) Our implementation raises image steganography payload capacity to an average of 98.2% or 23.57 bpp (bits per pixel), changing only around 0.76% of the cover image (See Figure 2). (iii) We propose a new cost function named variance loss to suppress noise pixels generated by generator network. (iv) Our implementation is robust to statistical analysis and 4 other widely used steganography analysis methods.
The decoded rate is calculated by the cover changing rate is calculated by and the bpp (bits per pixel) is calculated by where C, H, E, D symbols stand for the cover image (C), the hidden image (H), the embedded image (E) and the decoded image (D) in correspondence, and "8, 3" stands for number of bits per channel and number of channels per pixel respectively.  This paper is organized as follows. Section 2 will describe traditional high-capacity steganography methods and the convolution neural network used by this paper. Section 3 will unveil the secret why the neural network can achieve the amount of capacity encoding and decoding images. The architecture and experiments of our neural network are discussed in Sections 4 and 5, and finally, we'll make a conclusion and put forward some future works in Section 6.

Steganography Methods
Most steganography methods can be grouped into three basic types, which is image domain steganography, transform domain steganography and file-format-based steganography. Image domain ones have an advantage of simplicity and better payload capacity while being more likely to be detected. Transform domain ones usually have a more complex algorithm but hides pretty well through third-party analysis. File-format-based ones depend very much on the file format which makes it quite fragile to alterations.

JPEG RAR Steganography
The JPEG RAR Steganography is a kind of file-format-based steganography, which uses a feature in these two file format specifications. (JPEG [9] and RAR [10]) After the JPEG file has scanned the segment of EOI (End Of Image) (0xd9 in hex format), all the remaining segments are ignored (skipped), and therefore any information is allowed to be appended afterward. A RAR file [10] has the magic file header "0x52 0x61 0x72 0x21 0x1a 0x07 0x00" in hex format ("Rar!" as characters) and the parser will ignore all the information before the file header. It is possible to dump the binary of the RAR file after the JPEG file, and it'll apparently act as if it is a JPEG image file while it is actually also a RAR archive. However, the method is very fragile to any file alterations. Third-party surveillance might truncate useless information to save transmission resource or apply some image alterations to attack potential steganography. Any alteration will crash the steganography, and all hidden information is lost.

LSB (Least Significant Bit) Method
LSB (Least Significant Bit)-based methods [1] are the most commonly used image domain steganography methods which hide information at the pixel level. Most LSB methods aim at altering parts of the cover image to such an extent that human visual system can barely notice. These methods are motivated by the fact that the visual part of most figures is dominated by the highest bits of each pixel, and the LSB bits (the underlined part of one pixel as shown in Figure 3) are statistically similar to randomly generated data, and therefore, hiding information via altering LSB cannot change the visual result apparently. The embedding operation of LSB method for the least bit single channel image is described as follows: where S i , C i and M i are the ith pixel of image after steganography, ith pixel of cover image and ith bit of the hiding message.
Since the least significant bits of the image data should look like random data, there are major two schemes in distributing the hiding data. The first kind of methods is to put in the hiding message sequentially after encrypting or compressing to achieve the randomness. The second kind of methods is scattering the hiding data by adopting a mutually acknowledged random seed by which generates the actual hiding sequence [11].

JPEG Steganography
JPEG steganography, i.e., Chang's work [12] and A. Almohammad's work [3], is a part of transform domain steganography. JPEG format examines an image in 8 × 8 blocks of pixels, converts from RGB color space into YCrCb (luminance and chrominance) color space, applies DCT (Discrete Cosine Transformation), quantizes the result, and entropy encodes the rest. After lossy compression, which is after quantization, the hidden information is hidden into the quantized DCT components, which serves as an LSB embedding in the transformed domain. As a result, it is quite hard to detect using statistical analysis and comparably lower payload capacity to LSB method.

Convolutional Neural Network
Convolutional neural network [13], though dates back to the 1990s, is now trending these years after AlexNet [14] won the championship of ImageNet competition. It has successfully established new records in many fields like classification [15], object segmentation [16], etc. A lot of factors boosted the progress including the development of modern GPU hardware, the work of ReLU (Rectified Linear Unit) [17] and its extensions, and finally, the abundance of training data [18]. Our work also benefits a lot from these factors.
The convolution operation is not solely used in neural networks, but also widely used in traditional computer vision methods. For instance, gaussian smoothing kernel is extensively used for image blurring and noise reduction, which, in implementation, is equal to applying a convolution between the original image and a gaussian function. Many other contributions in traditional methods are handcrafted patterns, kernels or filter combinations, i.e., the Sobel-Feldman filter [19] for edge detection, Log-Gabol filter [20] for texture detection, HOG [21] for object detection, etc.
However, designing and tuning handcrafted patterns are highly technical and might be effective for only some tasks. On the contrary, convolutional neural networks have the advantage of automatically creating patterns for specific tasks through back-propagation [22] on its own, and even further, high-level features can be easily learned through combinations of convolution operations [23][24][25].

Autoencoder Neural Network
Our method is inspired by traditional autoencoder neural networks [26], which was originally trained to generate an output image the same as input image in appearance. It is usually made up of two neural networks, one encoding network h = f (x) and one decoding network d = g(h), restricted under d = x, who finally can learn the conditional probability distribution of p(h|x) and p(x|h) correspondently. The autoencoder architecture has shown the ability to extract salient features in from images seen through shrinking hidden layer (h)'s dimension, which has been applied to various fields, i.e., denoising [27], dimension reduction [28], image generation [29], etc.

Neural Network for Steganography
Recently there are some works on applying neural networks for steganography. El-Emam [30] and Saleema [31] work on using neural networks to refine the embedded image generated via traditional steganography methods, i.e., LSB method. Volkhonskiy's [32] and Shi's [33] work focus on generating secure cover images for traditional steganography methods to apply image steganography. Baluja [34] is working on the same field as StegNet. However, the hidden image is slightly visible on residual images of the generated embedded images. Moreover, his architecture uses three networks which requires much more GPU memory and takes more time to embed.

High-order Transformation
In image steganography, we argue that we should not only focus on where to hide information, which most traditional methods work on, but we should also focus on how to hide it.
Most traditional steganography methods usually directly embed hidden information into parts of pixels or transformed correspondances. The transformation regularly occurs in where to hide, either actively applied in the steganography method or passively applied because of file format. As a result, the payload capacity is highly related and restricted to the area of the texture-rich part of the image detected by the handcoded patterns.
DCT-based steganography is one of the most famous transform domain steganography. We can consider the DCT process in JPEG lossy compression process as a kind of one-level high-order transformation which works at a block level, converting each 8 × 8 or 16 × 16 block of pixel information into its corresponding frequency-domain representation. Even hiding in DCT transformed frequency-domain data, traditional works [3,12] embed hidden information in mid-frequency band via LSB-alike methods, which eventually cannot be eluded.
While in contrast, deep convolution neural network makes multi-level high-order transformations possible for image steganography. Figure 4 shows the receptive field of one high-level kernel unit in a demo of a three-layer convolutional neural network. After lower-level features are processed by kernels and propagated through activations along middle layers, the receptive field of final higher-level kernel unit is able to absorb 5 lower-level features of the first layer and form its own higher-level feature throughout the training process.

Trading Accuracy for Capacity
Traditional image steganography algorithms mostly embed hidden information as it is or after applying lossless transformations. After decoding, the hidden information is extracted as it is or after the corresponding detransformations are applied. Therefore, empirically speaking, it is just as file compression methods, where lossless compression algorithms usually cannot outperform lossy compression algorithms in capacity.
We need to think in a "lossy" way in order to embed almost equal amount of information into the cover. The model needs to learn to compress the cover image and the hidden image into an embedding of high-level features and converts them into an image that appears as similar as the cover image, which comes to the vital idea of trading accuracy for capacity.
Trading accuracy for capacity means that we do not limit our model in reconstructing at a pixel-level accuracy of the hidden image, but aiming at "recreating" a new image with most of the features in it with a panoramic view, i.e., the spider in the picture, the pipes' position relatively correct, the outline of the mountain, etc.
In other words, the traditional approaches work in lossless ways, which after some preprocessing to the hidden image, the transformed data is crammed into the holes prepared in the cover image. However, StegNet approach decoded image has no pixel-wise relationship with the hidden image at all, or strictly speaking, there is no reasonable transformation between each pair of corresponding pixels, but the decoded image as a whole can represent the original hidden image's meaning through neural network's reconstruction.
In the encoding process, the model needs to transform from a low-level massive amount of pixel-wise information into some high-level limited sets of featurewise information with an understanding of the figure, and come up with a brand new image similar to the cover apparently but with hidden features embedded. In the decoding process, on the contrary, the model is shown only the embedded figure, from which both cover and hidden high-level features are extracted, and the hidden image is rebuilt according to network's own comprehension.
As shown in Figures 5 and 6, StegNet is not applying LSB-like or simple merging methods to embed the hidden information into the cover. The residual image is neither simulating random noise (LSB-based approach, see Figure 7) nor combining recognizable hidden image inside. The embedded pattern is distributed across the whole image and even magnified 5 to 10 times, the residual image is similar to the cover image visually which can help decrease the abnormality exposed to the human visual system and finally avoid to be detected. The residual image is computed via and the magnification or the enhancement operation is achieved via where I takes residual images, which are effectively normalized to [0, 1] and M is the magnification ratio, which 5 and 10 are chosen visualize the differences in this paper.

Architecture Pipeline
The whole processing pipeline is shown in Figure 8, which consists of two almost identical neural network structure responsible for encoding and decoding. The identical structures are taken from Autoencoder [26], GAN [35], etc., which help the neural network model similar high-level features of images in their latent space. The details of embedding and decoding structure are described in Figure 9. In the embedding procedure, the cover image and the hidden image are concatenated by channel while only the embedded image is shown to the network. Two parts of the network are both majorly made up of one lifting layer which lifts from figure channels to a uniform of 32 channels, six 3 × 3 basic building blocks raising features into high-dimensional latent space and one reducing layer which transforms features back to image space.  The basic building block named "Separable Convolution with Residual Block" (abbreviated as "SCR" in the following context) has the architecture as Figure 10. We adopt batch-normalization [36] and exponential linear unit (ELU) [37] for quicker convergence and better result.

Separable Convolution with Residual Block
Our work adopt the state of the art neural network structure, the skip connections in Highway Network [38], ResNet [39] and ResNeXt [40], and separable convolution [41] together to form the basic building block "SCR".
The idea behind separable convolution [41] originated from Google's Inception models [42,43] (see Figure 11 for its building blocks), and the hypothesis behind is that "cross-channel correlations and spatial correlations are decoupled". Further, in Xception architecture [41], it makes an even stronger hypothesis that "cross-channel correlations and spatial correlations can be mapped completely separately". Together with skip-connections [39] the gradients are preserved in backpropagation process via skip-connections to frontier layers and as a result, ease the problem of vanishing gradients.

Training
Learning the end-to-end mapping function from cover and hidden image to embedded image and embedded image to decoded image requires the estimation of millions of parameters in the neural network. It is achieved via minimizing the weighted loss of L 1 -loss between the cover and the embedded image, L 1 -loss between the hidden and the decoded image, and their corresponding variance losses (variance should be computed across images' height, width and channel). C, H, E, D symbols stand for the cover image (C), the hidden image (H), the embedded image (E) and the decoded image (D) in correspondence. (See Equations (6)- (8)) L CE is used to minimize the difference between the embedded image and the cover image, while L HD is for the hidden image and the decoded image. Choosing only to decode the hidden image while not both the cover and the hidden images are under the consideration that the embedded image should be a concentration of high-level features apparently similar to the cover image whose dimension is half the shape of those two images, and some trivial information has been lost. It would have pushed the neural network to balance the capacity in embedding the cover and the hidden if both images are extracted at the decoding process.
Furthermore, adopting variance losses helps to give a hint to the neural network that the loss should be distributed throughout the image, but not putting at some specific position (See Figure 12 for differences between. The embedded image without variance loss shows some obvious noise spikes (blue points) in the background and also some around the dog nose).

Environment
Our work is trained on one NVidia GTX1080 GPU and we adopt a batch size of 64 using Adam optimizer [44] with learning rate at 10 −5 . We use no image augmentation and restrict model's input image to 64 × 64 in height and width because of memory limit. Training with resized 64 × 64 ImageNet can yield pretty good results. We use 80% of the ImageNet dataset for training and the remaining for testing to verify the generalization ability of our model. Figure 13 shows the result of applying StegNet steganography method on a batch of images.

Statistical Analysis
The encoded and decoded images comparison between StegNet and LSB method are presented in Figure 2. They are very similar though, however, there is one critical flaw about the LSB method, in that it does not suffer through statistical analysis, and therefore LSB method is usually combined with transformations of the hidden image, i.e., compression, randomization, etc. Figure 14 is a comparison of histogram analysis between LSB method and our work. It shows a direct view of robustness of StegNet against statistical analysis, which the StegNet embedded's histogram and the cover image's histogram are much more matching.
A more all-around test is conducted through StegExpose [45], which combines several decent algorithms to detect LSB-based steganography, i.e., sample pair analysis [46], RS analysis [47], chi-square attack [48] and primary sets [49]. The detection threshold is its hyperparameter, which is used to balance true positive rate and false positive rate of the StegExpose's result. The test is performed with linear interpolation of detection threshold from 0.00 to 1.00 with 0.01 as the step interval. Figure 15 is the ROC curve, where true positive stands for an embedded image correctly identified that there are hidden data inside while false positive means a clean figure falsely classified as an embedded image. The figure is plotted in red-dash-line-connected scatter data, showing that StegExpose can only work a little better than random guessing, the line in green. In other words, the proposed steganography method can better resist StegExpose attack.

Conclusions and Future Work
We have presented a novel deep learning approach for image steganography. We show that the conventional image steganography methods mostly do not serve with good payload capacity. The proposed approach, StegNet, creates an end-to-end mapping from the cover image, hidden image to embedded image and from embedded image to decoded image. It has achieved superior performance than traditional methods and yet remains quite robust.
As seen in Figure 13, there is still some noise generated at non-texture-rich areas, i.e., plain white or plain black parts. The variance loss adopted by StegNet might not be the optimal solution to loss distribution.
In addition to the idea of "trading accuracy for capacity", the embedded image does not need to be even visually similar to the cover image. The only requirement to the embedded image is to pass the third-party supervision and the hidden image should be successfully decoded after the transmission is complete, and therefore the embedded image can look similar to anything that is inside the cover image dataset while can look nothing related to anything that is inside the hidden image dataset. Some of the state of the art generative models in neural networks can help achieve it, i.e., Variational Autoencoders [29,50], Generative Adversarial Networks [35,51,52], etc.
Some work is needed for non-equal sized images steganography since "1:1" image steganography is huge for traditional judgment; however, the ability of neural networks still remains to be discovered. Whether it is possible to generate approaches for even better capacity, or with a better visual quality for even safer from detections. Some other work is needed for non-image hidden information steganography, i.e., text information, binary data. In addition to changing the hidden information type, the cover information type may also vary from text information to even videos. Furthermore, some work is needed for applying StegNet on lossy-compressed image file formats or third-party spatial translations, i.e., cropping, resizing, stretching, etc.