BASN -- Learning Steganography with Binary Attention Mechanism

Secret information sharing through image carrier has aroused much research attention in recent years with images' growing domination on the Internet and mobile applications. However, with the booming trend of convolutional neural networks, image steganography is facing a more significant challenge from neural-network-automated tasks. To improve the security of image steganography and minimize task result distortion, models must maintain the feature maps generated by task-specific networks being irrelative to any hidden information embedded in the carrier. This paper introduces a binary attention mechanism into image steganography to help alleviate the security issue, and in the meanwhile, increase embedding payload capacity. The experimental results show that our method has the advantage of high payload capacity with little feature map distortion and still resist detection by state-of-the-art image steganalysis algorithms.


Introduction
Image steganography aims at delivering a modified cover image to secretly transfer hidden information inside with little awareness of the third-party The cover image and embedded image both use ImageNet pretrained ResNet-18 [8] network for classification. The percentage before the predicted class label represents network's confidence in prediction. The red, green and blue noisy images in the center represent the altered pixel locations in corresponding channels during steganography. There're only three kinds of colors within these images where white stands for no modification, the lighter one stands for a +1 modification and the darker one stands for a -1 modification.
supervision. On the other side, steganalysis algorithms are developed to find out whether an image is embedded with hidden information or not, and therefore, resisting steganalysis detection is one of the major indicators of steganography security. In the meanwhile, with the booming trend of convolutional neural networks, a massive amount of neural-network-automated tasks are coming into industrial practices like image auto-labeling through object detection [5,15] and classification [8,21], face recognition [16], pedestrian re-identification [29] and etc. Images steganography is now facing a more significant challenge from these automated tasks, whose embedding distortion might influcence the task result in a great manner and irresistibly lead to suspicion. Figure 1 is an example that LSB-Matching [12] steganography completely alters the image classification result from goldfish to proboscis monkey. Under such circumstances, a steganography model even with outstanding invisibility to steganalysis methods still cannot be called secure where the spurious label might re-arouse suspicion and finally, all efforts are made in vain.

Related Works
Most previous steganography models focus on resisting steganalysis algorithms or raising embedding payload capacity. BPCS [18,19] and PVD [24,25,22] uses adaptive embedding based on local complexty to improve embedding visual quality. HuGO [14] and S-UNIWARD [9] resist steganalysis by minimizing a suitably defined distortion function. Hu [10] adopts deep convolutional generative adversarial network to achieve steganography without embedding. Wu [26] and Baluja [1] achieve a vast payload capacity by focusing on image-into-image steganography.

Contributions of this work
In this paper, we propose a Binary Attention Steganography Network (abbreviated as BASN) architecture to achieve a relatively high payload capacity (2-3 bpp) with minimal distortion to other neural-network-automated tasks. It utilizes convolutional neural networks with two attention mechanisms, which minimizes embedding distortion to the human visual system and neural network feature maps respectively. Additionally, multiple attention fusion strategies are suggested to balance payload capacity with security, and a fine-tuning mechanism are put forward to improve the hidden information extraction accuracy.

Binary Attention Mechanism
Binary attention mechanism involves two attention models including image texture complexity (ITC) attention model and minimizing feature distortion (MFD) attention model. ITC model mainly focuses on deceiving the human visual system from noticing the differences out of altered pixels. MFD model minimizes the high-level features extracted between clean and embedded images so that neural networks will not give out diverge results. The attention mechanism in both models serve as a hint for steganography showing where to embed and how much information the corresponding pixel might tolerate. The embedding and extraction overall architecture are shown in Figure 2. After two attentions are found with the binary attention mechanism, we may adopt several fusion strategies to create the final attention used for embedding and extraction.

Evaluation of Image Texture Complexity
To evaluate an image's texture complexity, variance is adapted in most approaches. However, using variance as the evaluation mechanism enforces very strong pixel dependencies. In other words, every pixel is correlated to all other pixels in the image.  We propose variance pooling evaluation mechanism to relax cross-pixel dependencies (See Equation 1). Variance pooling applies on patches but not the whole image to restrict the influence of pixel value alterations within the corresponding patches. Especially in the case of training when optimizing local textures to reduce its complexity, pixels within the current area should be most frequently changed while far distant ones are intended to be reserved for keeping the overall image contrast, brightness and visual patterns untouched.
In Equation 1, X is a 2-dimensional random variable which can be either an image or a feature map and i, j are the indices of each dimension. Operator E(·) calculates the expectation of the random variable. VarPool2d applies similar kernel mechanism as other 2-dimensional pooling or convolution operations and k i , k j indicates the kernel indices of each dimension.
To further show the impact of gradients updating between variance and variance pooling during backpropagation, we applied the gradients backpropagated directly to the image to visualize how gradients influences the image itself during training (See Equation 3,4 for training loss and Figure 3 for the impact comparison).

ITC Attention Model
ITC (Image Texture Complexity) attention model aims to embed information without being noticed by the human visual system, or in other words, making just noticeable difference (JND) to cover images to ensure the largest embedding payload capacity [28]. In texture-rich areas, it is possible to alter pixels to carry hidden information without being noticed. Finding the ITC attention means finding the positions of the image pixels and their corresponding capacity that tolerate mutations.
Here we introduce two concepts: 1. A hyper-parameter θ representing the ideal embedding payload capacity that the input image might achieve.
2. An ideal texture-free image C θ corresponding to the input image that is visually similar but with the lowest texture complexity possible regarding the restriction of at most θ changes.
With the help of these concepts, we can formulate the aim of ITC attention model as: For each cover image C, ITC model f itc needs to find an attention A itc = f itc (C) to minimize the texture complexity evaluation function V itc : The θ in Equation 6 is used as an upper bound to limit down the attention area size. If trained without it, model f itc is free to output all-ones matrix A itc to acquire an optimal texture-free image. It is well-known that an image with the least amount of texture is a solid color image, which does not help find the correct texture-rich areas.
In actual training process, the detailed model architecture is shown in Figure 6 and two parts of the equation are slightly modified to ensure better training results. First, the ideal texture-free image C θ in Equation 5 does not indeed exist but is available through approximation nonetheless. In this paper median pooling with a kernel size of 7 is used to simulate the ideal texture-free image. It helps eliminate detailed textures within patches without touching object boundaries (See Figure 4 for comparison among different smoothing techniques). Second, we adopt soft bound limits in place of hard upper bound in forms of Equation 7 (visualized in Figure 9). Soft limits help generate smoothed gradients and provide optimizing directions.
The overall loss on training ITC attention model is listed in Equation 8,9, and Figure 5 shows the effect of ITC attention on image texture complexity Loss itc = λ · VarLoss + (1 − λ) · Area − Penalty itc (9)

MFD Attention Model
MFD (Minimizing Feature Distortion) attention model aims to embed information with least impact on neural network extracted features. Its attention also indicates the position of image pixels and their corresponding capacity that tolerate mutations. For each cover image C, MFD model f mfd needs to find an attention A mfd = f mfd (C) that minimizes the distance between cover image features f nn (C) and embedded image features f nn (S) after embedding information into cover image according to its attention.
minimize L fmrl (f nn (C), f nn (S)) (11) Here, C stands for the cover image and S stands for the corresponding embedded image. L fmrl (·) is the feature map reconstruction loss and α, β are thresholds limiting the area of attention map acting the same role as θ in the ITC attention model.
The actual ways of training the MFD attention model is split into 2 phases (See Figure 6). The first training phase aims to initialize the weights of encoder blocks using the left path shown in Figure 6 as an autoencoder. In the second training phase, all the weights of decoder blocks are reset and takes the right path to generate MFD attentions. The encoder and decoder block architectures are shown in Figure 8.
The overall training pipeline in the second phase is shown in Figure 7. The weights of two MFD blocks colored in purple are shared while the weights of two task specific neural network blocks colored in yellow are frozen. In the training process, task specific neural network works only as a feature extractor and therefore it can be simply extended to multiple tasks by reshaping and concatenating feature maps together. Here we adopt ResNet-18 [8] as an example for minimizing embedding distortion to the classification task.
The overall loss on training MFD attention model (phase 2) is listed in Equation 13. The L fmrl (Feature Map Reconstruction Loss) uses L 2 loss to reconstruct between cover image extracted feature maps and embedded ones. The L cerl (Cover Embedded image Reconstruction Loss) and L atrl (Attention Reconstruction Loss) uses L 1 loss to reconstruct between the cover images and the embedded images and their corresponding attentions. The L atap (AT-  (visualized in Figure 9). The visual effect of MFD attention embedding with random noise is shown in Figure 10.

Fusion Strategies, Finetune Process and Inference Techniques
The fusion strategies help merge ITC and MFD attention models into one attention model, and thus they are substantial to be consistent and stable. In this paper, two fusion strategies being minima fusion and mean fusion are put forth as Equation 15 and 16. Minima fusion strategy aims to improve security while mean fusion strategy generates more payload capacity for embedding.
After a fusion strategy is applied, finetuning process is required to improve attention reconstruction on embedded images. The finetune process is split into two phases. In the first phase, the ITC model is finetuned as Figure 11. The two ITC model colored in purple shares the same network weights and the MFD model weights are freezed. Besides from the image texture complexity loss (Equation 8) and the ITC area penalty (Equation 7), the loss additionally involves an attention reconstruction loss using L 1 loss similar to L atrl in Equation 13. In the second phase, the new ITC model is freezed, and the MFD model is finetuned using its original loss (Equation 13).
The ITC model, after finetune, appears to be more interested in the texture-complex areas while ignores the areas that might introduce noises into the attention (See Figure 12). When using the model for inference after finetuning, two extra techniques are proposed to strengthen steganography security. The first technique is Figure 12: ITC Attention After Finetune The first column shows the original image, the second column shows the ITC attention before any finetune, the third column shows the ITC attention after finetuning for minima fusion strategy, and the forth column shows the ITC attention after finetuning for mean fusion strategy.
named Least Significant Masking (LSM) which masks the lowest several bits of the attention during embedding. After the hidden information is embedded, the masked bits are restored to the original data to disturb the steganalysis methods. The second technique is called Permutative Straddling, which sacrifices some payload capacity to straddle between hidden bits and cover bits [23]. It is achieved by scattering the effective payload bit locations across the overall embedded locations using a random seed. The overall hidden bits are further re-arranged sequentially in the effective payload bit locations. The random seed is required to restore the hidden data.

Experiments Configurations
To demonstrate the effectiveness of our model, we conducted experiments on ImageNet dataset [3]. Specially, ILSVRC2012 dataset with 1,281,167 images is used for training and 50,000 for testing. Our work is trained on one NVidia GTX1080 GPU and we adopt a batch size of 32 for all models. Optimizers and learning rate setup for ITC model, MFD model 1 st phase and MFD model 2 nd phase are Adam optimizer [11] with 0.01, Nesterov momentum optimizer [20] with 1e-5 and Adam optimizer with 0.01 respectively.  In the model name part, the value after LSM is the number of bits masked during embedding process and the value after PS is the maximum payload capacity the embedded image is limited to during permutative straddling.
All the validation processes use the compressed version of The Complete Works of William Shakespeare [17] provided by Project Gutenberg [7]. It is downloaded here at [6]. The

Steganalysis Experiments
To ensure that our model is robust to steganalysis methods, we test our models using StegExpose [2] with linear interpolation of detection threshold from 0.00 to 1.00 with 0.01 as the step interval. The ROC curve is shown in Figure 14 where true positive stands for an embedded image correctly identified that there are hidden data inside while false positive means that a clean figure is falsely classified as an embedded image. The figure shows a comparison among our several models, StegNet [26] and Baluja-2017 [1] plotted in dash-line-connected scatter data. It demonstrates that StegExpose can only work a little better than random guessing and most BASN models perform better than StegNet and Baluja-2017. Our model is also further examined with learning-based steganalysis methods [13,4,27]. All of these models are trained with our cover and embedded images.Their corresponding ROC curves are shown in Figure 14. SRM [4] method works quite well on our model with a larger payload capacity, however in real-world applications we can always keep our dataset private and thus ensuring high security in resisting detection from learning-based steganalysis methods. Figure 15 shows that our model has very little influence on targeted neuralnetwork-automated tasks, which in this case is classification. Most embedded images, even carrying with more than 3 bpp of hidden information, takes an average of only 2% distortion.

Conclusion
This paper proposes an image stagnography method based on a binary attention mechanism to ensure little influence steganography is made to neuralnetwork-automated tasks. The first attention mechanism, image texture complexity (ITC) model, help track down the pixel locations and their tol- erance of modification without being noticed by the human visual system. The second mechanism, minimizing feature distortion (MFD) model, further keeps down the embedding impact through feature map reconstruction. Moreover, some attention fusion and finetune techniques are also proposed in this paper to improve security and hidden information extraction accuracy. The imperceptibility of secret information by our method is proved such that the embedding images can effectively resist detection by several steganalysis algorithms.