Image Steganalysis via Diverse Filters and Squeeze-and-Excitation Convolutional Neural Network

: Steganalysis is a method to detect whether the objects contain secret messages. With the popularity of deep learning, using convolutional neural networks (CNNs), steganalytic schemes have become the chief method of combating steganography in recent years. However, the diversity of ﬁlters has not been fully utilized in the current research. This paper constructs a new effective network with diverse ﬁlter modules (DFMs) and squeeze-and-excitation modules (SEMs), which can better capture the embedding artifacts. As the essential parts, combining three different scale convolution ﬁlters, DFMs can process information diversely, and the SEMs can enhance the effective channels out from DFMs. The experiments presented that our CNN is effective against content-adaptive steganographic schemes with different payloads, such as S-UNIWARD and WOW algorithms. Moreover, some state-of-the-art methods are compared with our approach to demonstrate the outstanding performance.


Introduction
With the rapid development of information technology, covert communication methods using steganography have attracted increasing attention in recent years. With the improvement of steganography, it is more difficult to find out the embedding traces in objects, as the secret information is hidden in the texture area of the image with contentadaptive steganographic algorithms, such as HUGO [1], S-UNIWARD [2], WOW [3], HILL [4], MiPoD [5], JUNIWARD [6], UERD [7], ASO [8] and so on [9][10][11]. The core idea of these adaptive image steganography algorithms is to design the embedded distortion cost function, so as to separately measure the impact of each pixel modification in an image for the steganography security. It means that the security issues of steganography have been transformed into the issues of optimizing the distortion cost, which can guide the embedding operation of steganography by calculating the minimized embedding distortion to maximize the security of the steganography. As every coin has two sides, steganography could be easily exploited by criminals, so it is an important task to detect steganography.
The aim of steganography is to hide secret information in objects to covert communication. In contrast, steganalysis is to detect the hidden messages. However, steganalysis is a relatively challenging task as the changes of cover objects are almost impossible to be recognized by human eyes.
The traditional steganalytic schemes are usually based on well-designed hand-crafted features by expert experience matching with different machine-learning classifiers [12][13][14][15][16], for instance, the Subtractive Pixel Adjacency Matrix (SPAM) [13], the Spatial Rich Model (SRM) [14], ccPEV [17], DCTR [15] and their variants. In recent years, the popularity of Convolutional Neural Network (CNN) has promoted the study of steganalysis. The CNN schemes have shown great performance in image steganalysis. The spatial schemes for the current study include: Qian-Net [18], Xu-Net [19], Ye-Net [20], Yedroudj-Net [16], ReST-Net [21], SRNet [22], Yedroudj-Net [23] and so on. All the above-mentioned CNN approaches can effectively distinguish steganographic images, and some even have better performances than the traditional approaches. After the analysis of the above-mentioned network structures, we can see that a typical steganalytic architecture mainly combines two elements of a feature extraction step and a classification step. One is to extract the noise residuals from the input image pairs as the features. The other is to classify the input image pairs into two classifications of covers and stegos. Although such approaches based on CNN have achieved a good performance in image steganalysis, the common thread in all these methods is that they use only one pipeline or do not combine the filters sufficiently.
In order to make full use of the diversity of filters, we propose an effective CNN architecture for steganalysis. There are two contributions of our method: the design of a diverse filter module (DFM) and squeeze-and-excitation module (SEM). Inspired by the Inception Network [24], which increases the width of the CNN structure, we use the DFMs to get more varied residual features. Similarly, we utilize the SEMs from learning the Squeeze-and-Excitation Network [25] to strengthen the key channels. Therefore, we named our CNN as "DFSE-Net" for steganalysis. The input image pairs can extract more features through multiple convolution kernels, at the same time the important feature maps can be highlighted. Therefore, the final fusion can obtain a better representation, and the experimental data set is BOSSBase1.01 [26]. Meanwhile, the experimental results demonstrate the outstanding performance of our convolutional network.
The rest of this paper is organized as follows. In Section 2, we review several related approaches of image steganalysis. Our CNN structure with DFMs and SEMs is presented in Section 3. In Section 4, the experimental results are reported. Section 5 concludes this paper.

Related Work
The first study of image steganalysis considering deep learning architecture was done by Tan and Li in 2014 with convolutional autoencoders [27], although the method was almost not effective, it was very innov ative at that time. In 2015, Qian et al. [18] proposed their GNCNN architecture and first imported a KV kernel as the image preprocessing layer. The design architecture could effectively improve the detection accuracy, but it was still not as good as traditional methods. A real breakthrough was achieved in 2016 by Xu et al. [19]. The performance of Xu-Net can be comparable with the traditional welldesigned method consisting of SRM [14] and an ensemble classifier (EC) [28] for the first time. By analyzing its architecture of Xu-Net, we can know that they innovatively introduced the absolute activation (ABS) layer in the feature maps to facilitate the statistical modeling in the following layers; to prevent overfitting, they utilized the TanH function to limit the range of data values at early stages of their network, and used a 1 × 1 convolutional layer to construct a deeper network. In 2017, Ye et al. [20] presented their CNN architecture with a preprocessing layer containing thirty high-pass filters of SRM and designed an efficient activation function of truncated linear unit (TLU) that can better reveal the embedding artifacts. In 2018, Li et al. [21] firstly proposed a wide CNN architecture ReST-Net with diverse activation modules and parallel subnets. While the network only used one type of convolution kernel 3 × 3. In 2019, Zeng et al. [29] proposed a separatethen-reunion network for steganalysis of color images. These architectures have proven that the wider CNN can improve the detection performances. However, the architectures of parallel subnets with different sizes of combined kernels have not been extensively explored in steganalysis so far. This motivates us to design a wider CNN with different sizes of kernel units.

Architecture Overview
In order to make full use of the diversity of filters, DFSE-Net was designed and the overall structure is presented in Figure 1. Since DFMs can extract more diverse feature maps and SEMs can make our network focus on analyzing effective feature maps, the combination of two different modules can improve the detection effectiveness. The experimental results demonstrate our conclusion. In Figure 1, DFSE-Net consists of one image preprocessing layer, seven convolutional layers, with six convolutional layers in DFSE Modules, one fully-connected layer and one softmax layer. Due to the limitations of computing power, the size of the input image is 256 × 256.
The layer types and parameters are displayed inside boxes in Figure 1. Conv(x 1 , a × a, x 2 ) inside boxes means that the kernel size of the convolution layer is a × a and the number of input feature maps is x 1 , the number of output feature maps is x 2 . The full name of ABS is absolute activation, similarly, BN is batch normalization, TLU is truncated linear unit, ReLU is a rectified linear unit. The data sizes (x × (a × a)) denote the number and size of output feature maps, which are shown on the right side of each box.
The whole DFSE-Net can be simply divided into three steps. The first step is an image preprocessing layer with thirty high-pass filters of SRM [14], which can make our CNN concentrate on the embedding areas rather than the contents of images. Feature extraction is the second step, which consists of three DFMs and SEMs with seven convolutional layers. In this step, the feature maps are transformed into a 240-D feature vector. The third step is a linear classification module with one fully-connected layer and one softmax layer. In this step, the feature vectors are transformed into the output probabilities for each class. Each basic element is made of the following different layer types:

Convolution Layer
In our proposed architecture, we use three different convolution kernels instead of using only one type of 3 × 3 convolution kernel to extract local features of different sizes. In addition, the 3 × 3 and 5 × 5 kernels are parallelly computed in each DFM to capture more features. For the first convolution layer, the kernel size is 5 × 5, as to obtain a larger view of the local features. The 1 × 1 kernel is used after each SEM to integrate the rich feature sets. The number of channels in each convolution layer is a comprehensive balance of computational complexity with network performance.

ABS Layer
The ABS layer [19] is only used after the first convolution layer. It discards the signs of the elements in the noise residuals, so that the statistical features of sign symmetry are forced to be considered in the feature maps. To show the performance of the ABS layer for image steganalysis, the comparisons are conducted based on the DFSE-Net with and without the ABS layer in the first convolution layer. Both models are trained for the WOW steganography algorithm at the payload of 0.2 bpp and 0.4 bpp. From Table 1, the DFSE-Net with ABS layer has a lower error rate of detecting the WOW steganography algorithm, and the ABS layer also accelerates the convergence and shows better performance, as shown in Figure 2.

BN Layer
The BN layer [30] is essentially a normalized network layer. It normalizes the distribution of each mini-batch to a zero-mean and a unit-variance. There are several advantages to using a BN layer. First, it can translate the distribution of the input feature maps. Second, it allows using a larger learning rate to speed up the learning, as it can desensitize networks to the initialization parameters. Third, it also can effectively prevent the gradient vanishing or exploding and overfitting in the training phase [30]. Hence, we choose to use the BN layer after each convolution layer in our proposed network.

Nonlinear Activation Layer
The activation layer introduces the nonlinearity into CNN networks, which can prevent gradient vanishing or exploding, increase the capability of feature representation and so on. There are various types of activation functions that can be chosen, such as the conventional sigmoid function, ReLU (Rectified Linear Unit) function, hyperbolic tangent function, Truncated Linear Unit (TLU) function and so on. Among all of them, the ReLU function is commonly used in CNN and it can be formulated as Equation (1).
Except for the first layer in DFSE-Net, we apply the classical ReLU as the activation function in other blocks. Using the ReLU function after the conventional layer can make networks selectively respond to embedded signals among the input feature maps and conduct more efficient features. To a certain extent, the steganographic embedding procedure can be viewed as adding low-amplitude additive noises to cover images, and the embedding signals are usually in the range of [−1,1]. Therefore, we select the TLU, which is slightly modified from ReLU, in the first layer. As it contributes to the suppression of image contents and extraction of embedding signals more effectively. It can be defined as Equation (2).
where T > 0 is the threshold determined by experiments. In this paper, the value of T is set to 3, the same as the value in Ye-Net [20].
To compare the performance of TLU with the ReLU function for image steganalysis, we conducted the comparisons based on the network shown in Figure 1. The DFSE-Net with TLU is trained in which the value of T is set to 3 in the first layer. The DFSE-Net with ReLU (replacing TLU) in the first layer is trained for comparison. Both models are also trained against the WOW steganography algorithm at the payload of 0.2 and 0.4 bpp. From Table 2, DFSE-Net with TLU has a lower error rate of detecting the WOW steganography algorithm, and the TLU function can also accelerate the convergence and show better performance, as shown in Figure 3.  The average pooling layer is used in each DFM. It calculates the average value of a certain area of the feature maps. It can reduce the size of feature maps according to the stride, reduce the parameters and calculation amount while retaining the main features. Furthermore, the average pooling layer can prevent over-fitting in training. Note that there is no pooling layer in the first block to avoid the loss of information as reported in [31]. Hence, we do not use the pooling layer after the first convolution layer.

Diverse Filter Module
As the Inception Network [24] in CNN gains a series of excellent results, it has been widely accepted that wider convolutional networks can capture more information of the images. Inspired by this, we designed the diverse filter modules called DFMs. They consisted of three different size convolutional kernels, as shown in Figure 4. As we can see, the types of kernel filters are 3 × 3, 5 × 5 and 1 × 1. The 3 × 3 and 5 × 5 convolutional kernels process the output of the previous layer parallelly. Then the outputs are concatenated together and sent to 1 × 1 convolutional layer. The 1 × 1 convolutional layer can effectively integrate the feature maps from above. In order to improve the performance, we take advantage of Xu-Net and Ye-Net to form DFMs using BN and ReLU layers. To further improve the effect, we design the SEMs to cooperate with DFMs.

Squeeze-and-Excitation Module
The Squeeze-and-Excitation (SE) Module is not a complete network structure. It is a substructure that can be located in other classification or detection networks. In our architecture, each SEM is followed by the concatenated layer in the DFM, as shown in Figure 4. The core idea of SEM is to learn feature weights according to the loss in the training, so that the trained model can achieve better results in the way of effective feature maps with significant weights, invalid or ineffective feature maps with small weights. Therefore, the network can pay more attention to key channels. The overall design of SEM is presented in Figure 5. According to [25], the module can be divided into two steps: global information and recalibrated filter responses, also known as Squeeze-and-Excitation. In Figure 5, we know the main operations are global average pool (GAP)-full connection (FC)-ReLU-FC-Sigmoid. For the first step of squeeze, through the first global average pool layer, each output channel becomes a scalar. Therefore, the C channels will get the C scalars. As for the next step of excitation, by a set of operations of FC-ReLU-FC-Sigmoid, the C scalars will be normalized into [0, 1], as the channel weights. Finally, the operation of the scale rescales each channel by multiplying the weight, respectively.
To demonstrate the performance of SEM in image steganalysis, we compared them based on the DFSE network and without SEM after each DFM. Both models are also trained for the WOW steganography algorithm at 0.2 and 0.4 bpp. From Table 3, we can see DFSE-Net with SEM has a lower error rate of detecting the WOW steganography algorithm. SEM can also accelerate the convergence and show better performance, as shown in Figure 6.

Squeeze
In order to exploit the channel dependencies, we consider each channel in the output features, as each convolutional filter operates with a small region and each unit of the output is also unable to utilize contextual information outside of this field. To exploit the channel dependencies, we use a global average pooling layer to squeeze global spatial information into a channel scalar. Formally, the statistic scalar z is generated by squeezing the input features U through its spatial dimensions H × W, such that the c-th element of z is calculated by Equation (3):

Excitation
To make full use of the information aggregated in the squeeze step, the excitation step is followed, which can capture channel-wise dependencies. To achieve this objective, the excitation step has to meet the following criteria. First, it must be able to learn the nonlinear interaction between channels. Second, it must learn a non-mutually-exclusive relationship. To meet these criteria, the sigmoid activation is employed. The operations of the excitation step can be formulated as Equation (4): where W 1 refers to the first FC operation, δ refers to the ReLU function [32], W 2 refers to the second FC operation and σ refers to the sigmoid function.
To limit the complexity of the module, there is a parameter of reduction ratio r in the first FC layer to reduce the dimension of the input. Then the dimensionality is increased to the original channel dimension after the second FC layer. The final outputx of the module is rescaled with the sigmoid activations s. The operation of the scale step can be formulated as Equation (5) where F s cale(uc; sc) refers to the multiplication operation between the scalar s c and the corresponding feature map u c andx c is one of the channels of outputsX.
To investigate the impact of parameter r in our network, we conduct several experiments with DFSE-Net for a range of different r values. The comparison results in Table 4 show the performance at each reduction ratio. There is only a slight difference in the detection accuracy. The set of r = 6 achieves a good balance between complexity and accuracy.

Experiments
In this section, several experiments are carried out to demonstrate the feasibility and effectiveness of DFSE-Net. For fair comparison, all methods are trained and tested on the same data sets.

The Steganographic Schemes and Datasets
In this paper, two content-adaptive image steganographic algorithms in the spatial domain of SUNIWARD and WOW were employed to product standard data sets. The two steganographic schemes were implemented with an STC simulator and the code files are available at http://dde.binghamton.edu/download/. In addition, the image sources of BOSSbase 1.01 can be found at the same URL. The image source is widely used in research fields, such as information hiding, forensics and steganalysis. It contains 10,000 8-bit grayscale images with a size of 512 × 512.
In consideration of the GPU computing power in our lab, the experiments on cover images with a size of 512 × 512 can be extremely time-consuming. Therefore, we decided to evaluate the effectiveness of all methods on the images with a size of 256 × 256. To this end, we refer to other models [20,23,33] and adopt the same approach. Therefore, we resampled all the images from 512 × 512 to 256 × 256 using the Matlab function of imresize() with default parameters.
Then, all 256 × 256 BOSSBase 1.01 images were embedded with messages using SUNIWARD and WOW steganographic algorithms, respectively, with the payload of 0.1, 0.2, 0.3, 0.4 and 0.5 bpp to generate the stego data sets. Therefore, we were able to generate 10 different steganographic data sets. Finally, all data sets including the cover set were split into three different sets randomly, 40% of the cover/stego pairs were split into the training set, 10% were split into the validation set, the rest were split into the testing set, and the testing set was untouched during all of the training phase.

Hyper-Parameters
We used Keras v2.24 with the backend of Tensorflow v1.15.3 for implementation. The optimizer of stochastic gradient descent (SGD) was applied to train our model. The momentum was set to 0.9 and the weight decay was fixed to 0.0001. No regularization and dropout were used. The batch size was fixed to 50 with 25 cover/stego pairs in the training procedure. For the preprocessing layer, thirty high-pass SRM filters were used without normalization. As the first layer, the TLU activation function was used and the threshold was set to three. All convolutional layers used the 'glorot_normal' normal distribution initializer, also called the Xavier method. The fully-connected and softmax layers were initialized with the 'RandomNormal' method of zero mean and standard deviation 0.01 and the initial bias was set to be zero. In addition to the above settings, the loss of our network was to minimize the cross-entropy. During the training phase, we set the maximum epoch of 500. Nevertheless, we usually cut short the training phase most of the time when the over-fitting phenomenon appeared. The learning rate (lr) was initialized to 0.01 and when the val_loss failed to improve after 10 epochs, the lr dropped by 10%. The minimum value of lr is 0.00001.
The performance was measured with the whole classification error probability on the same testing set using the formula P E = min P FA 1/2(P FA + P MD ), where P FA and P MD represent the probabilities of false-alarm and missed-detection.

Results
In this subsection, the experimental results are presented to verify the feasibility and demonstrate the effectiveness of our method. For fair comparison, we conducted all the experiments on the same data sets generated in Section 4.1, and the data sets in this paper are divided as follows. The 10,000 256 × 256 BOSSBase images were randomly split into three sets. The training set contains 4000 cover/stego image pairs, the validation set contains 1000 image pairs, and the testing set contains the remaining 5000 image pairs.

Feasibility
We have proved the validity of DFSE-Net on the data sets generated in Section 4.1, and the experimental results are shown in Figure 7 and Table 5. In Figure 7, we can see the DFSE-Net converges quickly on the two steganographic algorithms of WOW and S-UNIWARD at 0.4 bpp payload. According to Table 5, we can know the detection performance of DFSE-Net with different steganography methods at different payloads, and the experimental results show that it can detect stego images effectively.

Comparison with Existing Methods
To verify the superiority of our method, we conducted experiments to compare with the state-of-the-art approaches of the traditional classical method with the Spatial-Rich-Model (SRM) [13] combinined of Ensemble Classifier (EC), Xu-Net [20] and Ye-Net [18] without the selection-channel information (also called TLU-CNN), and Yedroudj-Net [20].
As the methods above are the current typical approaches. All methods have been trained and tested on the same datasets and run on a Nvidia P5000 GPU card.
In Table 6, we recorded the P E compared with other state-of-the-art steganalyzers, and all the methods are compared against the steganographic algorithms WOW and S-UNIWARD, respectively, with the payload of 0.2 and 0.4 bpp. From Table 6, we can see that our method has better detection performance than other methods in terms of the steganographic algorithms WOW and S-UNIWARD, respectively, at payloads of 0.2 and 0.4 bpp. Since we have well designed DFSE-Net with DFMs and SEMs, the error rate of our proposed architecture is reduced by 8.5% compared with the traditional method of SRM+EC, by 6.7% compared with the Xu-Net, by 3.9% compared with the Ye-Net and by 3% compared with the Yedroudj-Net against WOW at 0.2 bpp. The results in Table 6 also show that our proposed network can effectively extract image features and classify input images.
As shown in Figure 8, we can see more intuitively that our proposed network has better performance than other methods on different steganographic algorithms at different payloads. The good performance also demonstrates the effectiveness of the network structure of DFMs and SEMs.

Conclusions
This paper presents the architecture of DFSE-Net with a carefully designed modules of diverse filters and Squeeze-and-Excitation for image steganalysis. DFSE-Net has gathered several latest design propositions, such as ABS, BN, TLU to build an efficient architecture beating the state-of-the-art methods. The experiments show that the P E has reduced by 8.5% compared with the traditional method of SRM+EC, by 6.7% compared with the Xu-Net, by 3.9% compared with the Ye-Net and by 3% compared with the Yedroudj-Net against WOW at 0.2 bpp. To summarize, the contributions of our method are reflected in two aspects: (i) proposing DFMs to capture more steganographic traces in a diverse way; (ii) proposing SEMs to enhance the effective features obtained from DFMs. Several experiments demonstrate the effectiveness and better performance of our method. There are also some limitations to our work. For example, our network can only deal with input images of the same size, while the images are in all sizes in real life. In the future, we consider adding more diverse structures to improve the detection efficiency and adding new modules to handle multi-size images.