Spatial Domain-Based Nonlinear Residual Feature Extraction for Identification of Image Operations

In this paper, a novel approach that uses a deep learning technique is proposed to detect and identify a variety of image operations. First, we propose the spatial domain-based nonlinear residual (SDNR) feature extraction method by constructing residual values from locally supported filters in the spatial domain. By applying minimum and maximum operators, diversity and nonlinearity are introduced; moreover, this construction brings nonsymmetry to the distribution of SDNR samples. Then, we propose applying a deep learning technique to the extracted SDNR features to detect and classify a variety of image operations. Many experiments have been conducted to verify the performance of the proposed approach, and the results indicate that the proposed method performs well in detecting and identifying the various common image postprocessing operations. Furthermore, comparisons between the proposed approach and the existing methods show the superiority of the proposed approach.


Introduction
Currently, forgeries of digital images are widely propagated, and it is expected that tampered images will be used more and more in society, including in social media and even scientific discovery. This will cause a serious impact on political and social stability. At the same time, while postprocessing techniques and operations for use with digital images are developing rapidly, the imperceptible modification of digital images is becoming easier. Therefore, ever-increasing attention is being paid to digital image forensics, including detection of forgery and postprocessing operations.
To date, many approaches regarding image forensics have been proposed, such as tracking the history of JPEG compression [1][2][3][4]; revealing the image operations, including contrast enhancement [5,6], resampling [7][8][9][10], and median filtering [11][12][13][14]; detecting the image splicing operations [15,16], revealing frequency domain filtering [17]; and identifying image forgery [18,19]. However, most of these state-of-the-art studies have simply considered specific operations. On the other hand, some of the approaches only perform binary classification. In [1][2][3][4], the authors proposed methods of detecting history of JPEG compression and revealing artifacts caused by image coding. The methods were based on blocks. In [6], Stamm and Liu proposed detecting the occurrence of digital image modification according to contrast enhancement operations in a blind way. In [7], Popescu and Farid proposed detecting the resampling traces and interpolations based on a derivative operator and the radon transformation. In [20], Rao and Ni proposed detecting forgeries in digital images using a deep learning method. In [21], two-class 3D-convolutional neural network (3D-CNN) classifiers were employed for video copy detection. However, in most cases, the features show low feasibility and To avoid the abovementioned shortcomings and to effectively identify a variety of image operations, in this paper we propose a novel approach by extending our previous work [25]. In our previous work [25], we proposed this framework to identify the various image operations, and the experimental results showed a very good potential of the proposed framework in this application. In that work, the submodel of the spatial rich model (SRM) [23], sub_SRM, was applied for feature extraction. On that basis, in this work, we propose the spatial domain-based nonlinear residual (SDNR) feature extraction method by constructing residual values from locally supported filters in the spatial domain. By applying minimum and maximum operators, diversity and nonlinearity are introduced; moreover, this construction brings nonsymmetry to the distribution of SDNR samples. Then, similarly, we apply a deep learning method of a five-layer CNN to the extracted SDNR features to produce the detection and identification results. Many experiments are conducted to evaluate the performance of the improved approach, and the results show that the improved work can identify more image postprocessing operations when compared with the previous work [25]; this previous work allowed seven operation types to be identified, while this improved method allowed more than twice this number to be identified. Furthermore, the results of using BPNN and AlexNet are tested to show the superiority of the proposed method. The remainder of this paper is arranged as follows: Section 2 presents the proposed approach for identification of image operations, Section 2.1 explains the principle of the proposed SDNR method, and Section 2.2 introduces the employed CNN classifier in detail. Then, Section 3 demonstrates the experiments and discussions. Finally, Section 4 concludes this paper.

Proposed Approach for Identification of Image Operations
The design of the feature set plays an important part in classification problems. Fortunately, we can borrow some powerful features from other research fields, such as image classification, computer vision, and image steganalysis. An example is the modern universal steganalytic features, such as SRM [23], which consists of statistics derived from a number of image residuals. In such features, different high-pass filters are used to suppress the image content in different ways, and thus the obtained features can represent different local properties. In this paper, we propose the SDNR by constructing residual values from locally supported filters in the spatial domain. Then, to make use of the extracted SDNR features, we propose applying a deep learning method of a five-layer CNN to the extracted SDNR feature to produce the detection and classification results. The framework of the proposed approach is shown in Figure 1. In Figure 1, it can be seen that there are two major steps: feature extraction with the proposed SDNR method, and feature learning with a five-layer CNN model. In the proposed approach, firstly, we apply the variety of postprocessing operations, including spatial enhancement, spatial filtering, frequency filtering, and lossy compression, to the original images. In this way, the original images are trained and the corresponding postprocessed images are thus produced. Then, the SDNR is employed to the generated images to extract the feature sets. Next, we employ a five-layer CNN model to extract the CNN features from the extracted SDNR feature sets. A patch-sized sliding window is used to scan the extracted SDNR feature sets, and feature fusion is adopted to aggregate the CNN features and thus to obtain the discriminated feature from the image. Afterward, the softmax classifier of the CNN is trained for the detection and identification of the various operations.

Spatial Domain-Based Nonlinear Residual (SDNR) Feature
The SDNR is constructed from locally supported linear filters in the spatial domain. By applying minimum and maximum operators, the diversity and nonlinearity are thus introduced. In this way, the construction brings nonsymmetry to the distribution of SDNR samples. In Equation (1), the calculation of residual value is expressed, where I and f are the host image and filter, respectively. The corresponding residual of the pixel is calculated by multiplying and summing the corresponding pixel and its adjacent pixels by the filter coefficients. Considering the fact that for edge pixels of the image, the pixel value might not be adequately changed in accordance with the filter coefficients, mirror symmetric filling operations are conducted on the images before applying the filtering operations. That means the images are filled adjacently by border pixel values. After this operation, the size of the residual map does not change.

Spatial Domain-Based Nonlinear Residual (SDNR) Feature
The SDNR is constructed from locally supported linear filters in the spatial domain. By applying minimum and maximum operators, the diversity and nonlinearity are thus introduced. In this way, the construction brings nonsymmetry to the distribution of SDNR samples. In Equation (1), the calculation of residual value is expressed, where I and f are the host image and filter, respectively. The corresponding residual of the pixel is calculated by multiplying and summing the corresponding pixel and its adjacent pixels by the filter coefficients. Considering the fact that for edge pixels of the image, the pixel value might not be adequately changed in accordance with the filter coefficients, mirror symmetric filling operations are conducted on the images before applying the filtering operations.
That means the images are filled adjacently by border pixel values. After this operation, the size of the residual map does not change.
where R denotes the residual map, which is calculated from the host image I and the filter f ; ⊗ denotes the convolution process; and A and B indicate the size of the host image. M = A/2 and N = B/2 . Figure 2 graphically shows the structure of the proposed SDNR, where the symbol ' • ' indicates the central pixel X i,j , and the other two symbols denote the neighboring pixels and at the same time indicate the two employed filters. The SDNR can thus be formed with Equation (2).
Appl. Sci. 2020, 10, x FOR PEER REVIEW 5 of 14 where R denotes the residual map, which is calculated from the host image I and the filter f ;  denotes the convolution process; and A and B indicate the size of the host image.
To curb the residual's dynamic range and make the residual more sensitive to the processing changes at the spatial discontinuities in the image, especially at the edges and textures, the quantization and truncation process are employed in the residual map where the calculated means the corresponding truncated and quantized residual map; q denotes the quantization step (in our method, we set q to be the residual order, which means the absolute value of the central position coefficient of the filter);   indicates the round down operation; and T trunc () denotes the truncation function, which is defined in Equation To curb the residual's dynamic range and make the residual more sensitive to the processing changes at the spatial discontinuities in the image, especially at the edges and textures, the quantization and truncation process are employed in the residual map SDNR {(min),(max)} using Equation (3).
where the calculated SNDR_TQ {(min),(max)} means the corresponding truncated and quantized residual map; q denotes the quantization step (in our method, we set q to be the residual order, which means the absolute value of the central position coefficient of the filter); indicates the round down operation; and trunc T () denotes the truncation function, which is defined in Equation (4).
where a is the input to this truncation function. a = (max) . Meanwhile, T denotes the truncation coefficient; in this paper, we set T = 2, which is the same value used in [22].
Next, the horizontal co-occurrence CO is constructed from the four consecutive residual samples generated from Equation (3) by using Equation (5), and it is therefore a four-dimensional array. 4 , with T = 2 and (2T + 1) 4 = 625, and therefore, each array has 625 elements, which gives the size of each CO; Z denotes the normalization factor, and it is used to satisfy Equation (6).
Considering the fact that the symmetries can increase the statistical robustness of the model while decreasing its dimensionality, we use the sign symmetry (which means that taking a negative of an image does not change its statistical properties) and the directional symmetry of the images. It can be easily seen from Figure 2 that the SDNR is directionally symmetric while it is not sign symmetric, and we propose employing Property I to achieve sign symmetry.
With the symmetrization process in Equations (7) and (8), which turns the "min" co-occurrence CO (min) and "max" co-occurrence CO (max) into a single matrix, the dimensionality is thus reduced from 2 × 625 to 1 × 325. Therefore, the dimensionality of our proposed feature SDNR is 1 × 325.

Employed Convolutional Neural Network Model for Classification
With the extracted SDNR feature, we next propose applying the five-layer CNN, as shown in Figure 3, to detect and identify a variety of image operations. The applied five-layer CNN includes two convolutional layers, two pooling layers, and one fully connected layer, followed by one softmax classifier. Compared with the conventional methods, which take pixel values as input, the proposed method improves the generalization ability and accelerates the network convergence by replacing the pixel values with extracted features. To match the feature size of 1 × 325 as explained above, we set the size of the input layer to be 20 × 20 and employ a padding operation to the extracted features. The convolutional layers aim to extract the feature maps, and each of their neurons is connected with the neighboring neuron in its former layer. As shown in Figure 3, our employed CNN model involves two convolutional layers and two pooling layers. Regarding the convolutional layers, convolutional layer 1 has six kernels with a receptive field size of 5 × 5, and each feature map is 16 × 16 in size; while convolutional layer 2 has twelve kernels with a size of 5 × 5, and each feature map has a size of 4 × 4. The subsampling layer, which is an important component of the CNN model, is put between the two convolutional layers. Through reducing the connections between the two convolutional layers, the subsampling layer helps to reduce the calculation complexity. Regarding the pooling layers, both of these two layers have one kernel with a size of 2 × 2 that resamples the input spatially and reduces 75% of the activations. The down pooling methods that are frequently used are mean pooling and max pooling.
Appl. Sci. 2020, 10, x FOR PEER REVIEW 7 of 14 55  , and each feature map has a size of 44  . The subsampling layer, which is an important component of the CNN model, is put between the two convolutional layers. Through reducing the connections between the two convolutional layers, the subsampling layer helps to reduce the calculation complexity. Regarding the pooling layers, both of these two layers have one kernel with a size of 22  that resamples the input spatially and reduces 75% of the activations. The down pooling methods that are frequently used are mean pooling and max pooling.  In addition to the two convolutional layers and the two subsampling layers described above, the employed CNN model involves one full connection layer that is connected with each neuron in the former layer and is followed by a softmax classifier. The full connection layer transforms the feature map extracted from the former layer into a vector with a size of 1 48  and feeds this vector into the softmax classifier to perform identification. Softmax regression is the promotion of a logistic regression model in problems with multiple classifications. The estimation of probability values is usually achieved through the hypothetical function () hx  in Equation (9). Given x as the input and y as the class label, we assume that the output of the function is a k-dimensional vector, which means that there are k numbers for the value of the class label y , and the sum of the vector factors in the representation of the k estimated probability values is 1. In addition to the two convolutional layers and the two subsampling layers described above, the employed CNN model involves one full connection layer that is connected with each neuron in the former layer and is followed by a softmax classifier. The full connection layer transforms the feature map extracted from the former layer into a vector with a size of 1 × 48 and feeds this vector into the softmax classifier to perform identification. Softmax regression is the promotion of a logistic regression model in problems with multiple classifications. The estimation of probability values is usually achieved through the hypothetical function h θ (x) in Equation (9). Given x as the input and y as the class label, we assume that the output of the function is a k-dimensional vector, which means that there are k numbers for the value of the class label y, and the sum of the vector factors in the representation of the k estimated probability values is 1.
Usually, the softmax regression algorithm can be solved by minimizing the cost function; however, it has been proven that there is more than one minimization solution of the cost function of a softmax regression algorithm. To solve the multisolution phenomenon, we propose employing the method that adds a weight attenuation term into the cost function. The cost function after adding the weight attenuation term is shown in Equation (10).
where I{•} is the indicative function. When y (i) = j is true, I y (i) = j = 1; otherwise, I y (i) = j = 0. In addition, γ denotes the weight attenuation term, and γ > 0.
To minimize the cost function J(θ), the iterative gradient descent method is used to guarantee that it converges to the optimum solution over the whole situation. The derivative function of J(θ) is given in Equation (11).
By minimizing the J(θ), the softmax regression model can be achieved.

Experiments and Discussions
In the experiments conducted to test the performance of the proposed strategy, we randomly selected a huge number of raw images from the dataset Boss Base v1.0 [29]. For each of the original images, 15 counterparts were created by applying the image processing operations with random parameters from the predefined range. Table 1 lists the 15 image operations that were tested and the predefined range of the parameters of the corresponding operations, including spatial enhancement, e.g., gamma correction (GC) and histogram equalization (HE); spatial filtering, e.g., mean filtering (MeanF) and Wiener filtering (WF); geometric operation, e.g., scaling (Sca) and rotation (Rot); lossy compression, e.g., JPEG and JPEG2000 (JP2); and frequency filtering, e.g., high-pass filtering (HPF) and homomorphic filtering (HF). All of these images were then divided into two categories randomly: half of them were utilized for training and the other half for testing.

Parameter Settings
In the employed CNN architecture, the number of epochs has a large effect on the accuracy of the identification and the computational time. Therefore, in order for the proposed strategy to achieve good performance, it is important to determine how to set the appropriate parameters. Figure 4 shows the relationship between the accuracy of detection and the number of epochs (shown on the left) and the relationship between the computational time and the number of epochs (shown on the right). The results clearly indicate that either the detection accuracy or the computational expense increases with the number of epochs. To achieve a balance between the computational expense and the detection accuracy, we set the number of iterations to be 600, i.e., num_epochs = 600.

Parameter Settings
In the employed CNN architecture, the number of epochs has a large effect on the accuracy of the identification and the computational time. Therefore, in order for the proposed strategy to achieve good performance, it is important to determine how to set the appropriate parameters. Figure 4 shows the relationship between the accuracy of detection and the number of epochs (shown on the left) and the relationship between the computational time and the number of epochs (shown on the right). The results clearly indicate that either the detection accuracy or the computational expense increases with the number of epochs. To achieve a balance between the computational expense and the detection accuracy, we set the number of iterations to be 600, i.e., _ 600 num epochs  .
As explained in Section 2.2, the pooling methods that are frequently used are mean pooling and max pooling. According to the respective results with mean pooling and max pooling, as shown in Figure 4, the max pooling always obtains a better result than the mean pooling. Therefore, we set the pooling type as max pooling in the following experiments. In addition to the number of epochs and the pooling type, the kernel size plays an important role in our method as well. By setting the number of iterations to 600, we tested the detection accuracies with different kernel sizes, 33  , 55  , and 77  , and the detection results are 92.2%, 95.9%, and 91.7%, respectively. Therefore, we set the kernel size to 55  to achieve the highest accuracy.

Detection and Classification of Various Image Postprocessing Operations
In this section, we present the results of our evaluation of the performance of the proposed method, which was conducted by measuring the accuracy of the detection and classification of the various image operations. Table 2 shows the accuracy of the detection of 15 operations using As explained in Section 2.2, the pooling methods that are frequently used are mean pooling and max pooling. According to the respective results with mean pooling and max pooling, as shown in Figure 4, the max pooling always obtains a better result than the mean pooling. Therefore, we set the pooling type as max pooling in the following experiments. In addition to the number of epochs and the pooling type, the kernel size plays an important role in our method as well. By setting the number of iterations to 600, we tested the detection accuracies with different kernel sizes, 3 × 3, 5 × 5, and 7 × 7, and the detection results are 92.2%, 95.9%, and 91.7%, respectively. Therefore, we set the kernel size to 5 × 5 to achieve the highest accuracy.

Detection and Classification of Various Image Postprocessing Operations
In this section, we present the results of our evaluation of the performance of the proposed method, which was conducted by measuring the accuracy of the detection and classification of the various image operations. Table 2 shows the accuracy of the detection of 15 operations using different classifiers. We applied the Ensemble classifier [24], the backpropagation neural network (BPNN) [26], the AlexNet [28], and the proposed CNN as classifiers within the proposed framework to calculate the corresponding detection accuracy. Additionally, to show the superiority of the proposed SDNR feature, we applied the subtractive pixel adjacency matrix (SPAM) [30] for comparison. It is observed that when applying the SPAM feature, the detection accuracy is 96.5% on average when using the Ensemble classifier, 94% on average when using the BPNN, 96.8% on average when using AlexNet, and 97.7% on average when using the employed CNN, while the corresponding detection accuracies are 97.1%, 95.3%, 97.5%, and 98.9%, respectively, when applying the different classifiers with the proposed SDNR. The comparison results demonstrate that the proposed SDNR feature outperforms the SPAM feature when using different classifiers. For the classifiers, the performance of the employed CNN is shown to be better than that of the other tested classifiers. In Table 2, the last row shows the average detection results with different features and classifiers and the results are highlighted in italic. The results demonstrate that the proposed approach performs very well in detecting the various image postprocessing operations. In addition to detecting whether or not the images have been processed, this approach can also identify a variety of operations. Tables 3-5 show the confusion matrices of the multiclass identification results using the proposed SDNR feature paired with BPNN [26], AlexNet [28], and the employed CNN classifier. In Tables 3-5, the symbol '*' indicates that the predicted percentage is under 0.1%, which means that the classification can be ignored. The results in the diagonal show the multiclass classification results and are highlighted in bold for easy following. According to the results, the identification accuracy is calculated as 91.3% on average with the proposed SDNR features and BPNN [26], 92.5% on average with the proposed SDNR features and AlexNet [28], and 95.9% on average with the proposed SDNR features and the employed CNN classifier. These very good results indicate the effectiveness of the proposed method.
Furthermore, in addition to the comparison of detection of various image postprocessing operations, the comparison of classification of various image postprocessing operations using different features with different classifiers is shown in Table 6. Similar to the results shown in Table 3, to show the superiority of the proposed SDNR feature, we applied the SPAM [30] for comparison, and the BPNN [26], the AlexNet [28], and the employed CNN classifier were respectively applied within the proposed framework to calculate the corresponding classification accuracy. It is observed that when applying SPAM feature, the classification accuracy is 89.7% on average when using BPNN, 90.6% on average when using AlexNet, and 85% on average when using the employed CNN, while the corresponding detection accuracies are 88.89%, 92.2%, and 95.9%, respectively, when applying the different classifiers with the proposed SDNR. The results demonstrate that the proposed method performs the best for classifying a variety of image postprocessing operations.

Conclusions
In summary, we have proposed the SDNR feature extraction method, which is constructed from locally supported linear filters in the spatial domain. By applying minimum and maximum operators, diversity and nonlinearity are thus introduced, and the construction thus brings nonsymmetry to the distribution of the extracted SDNR features. By applying the proposed SDNR method to the original images and the corresponding images after processing, we can extract the feature sets accordingly. This improves the acceleration of the network convergence. Then, by scanning the extracted SDNR feature sets with a patch-sized sliding window, we have proposed employing the five-layer CNN, and we trained a softmax classifier accordingly to detect and identify a variety of image postprocessing operations. The main contributions of this paper are summarized as follows: (1) We have considered the problems of both binary classification and multiclass identification in our study and solved both of them with the proposed approach. We have conducted extensive experiments on up to 15 various image postprocessing operations to evaluate the proposed approach, and the results indicate the effectiveness of our proposed method. (2) We have extracted SDNR features instead of the pixel values as the input of our deep learning model; in this way, the generalization ability can be enhanced and the network convergence can be promoted. (3) We have proposed employing the five-layer CNN as the classifier, compared with the conventional methods that use a classifier such as SVM and an ensemble classifier, to achieve higher detection accuracy. The experimental results demonstrate that the proposed approach performs well in classifying and identifying of image postprocessing operations.

Conflicts of Interest:
The authors declare no conflict of interest.