Distinguishing Computer-Generated Graphics from Natural Images Based on Sensor Pattern Noise and Deep Learning

Computer-generated graphics (CGs) are images generated by computer software. The rapid development of computer graphics technologies has made it easier to generate photorealistic computer graphics, and these graphics are quite difficult to distinguish from natural images (NIs) with the naked eye. In this paper, we propose a method based on sensor pattern noise (SPN) and deep learning to distinguish CGs from NIs. Before being fed into our convolutional neural network (CNN)-based model, these images—CGs and NIs—are clipped into image patches. Furthermore, three high-pass filters (HPFs) are used to remove low-frequency signals, which represent the image content. These filters are also used to reveal the residual signal as well as SPN introduced by the digital camera device. Different from the traditional methods of distinguishing CGs from NIs, the proposed method utilizes a five-layer CNN to classify the input image patches. Based on the classification results of the image patches, we deploy a majority vote scheme to obtain the classification results for the full-size images. The experiments have demonstrated that (1) the proposed method with three HPFs can achieve better results than that with only one HPF or no HPF and that (2) the proposed method with three HPFs achieves 100% accuracy, although the NIs undergo a JPEG compression with a quality factor of 75.


I. INTRODUCTION
C OMPUTER-generated graphics (CG) are images generated by computer software such as 3D Max, Maya, and Cinema 4D.In recent years, with the aid of computer software, it is easier to generate photorealistic computer graphics (PRCG), which are quite difficult to distinguish from natural images (NI) by the naked eye.Some examples of computer graphics are shown in Figure 1.Although these rendering software suites help us to create images and animation conveniently, it could also bring serious security issues to the public if PRCG were used in fields such as justice and journalism [1].Therefore, as an essential topic in the domain of digital image forensics [2], distinguishing CG from NI has attracted increasing attention in the past decade.
Several algorithms have recently been proposed to distinguish computer-generated graphics from natural images.Xiaofen Wang et al. [4] present a customized statistical model based on the homomorphic filter and use support vector machines (SVMs) as a classifier to discriminate photorealistic computer graphics Ye Yao is with the School of CyberSpace, Hangzhou Dianzi University, Hangzhou, 310018, China.(e-mail: yyaoprivate@gmail.com).
Yun-Qing Shi is with the Department of Electrical and Computer Engineering, New Jersey Institute of Technology, Newark, NJ 07102 USA (e-mail: shi@njit.edu).(PRCG) from natural images.Zhaohong Li et al. [5] present a multiresolution approach to distinguish CG from NI based on local binary patterns (LBPs) features and SVM classifier.Jinwei Wang et al. [6] present a classification method based on the first four statistical features extracted from the quaternion wavelet transform (QWT) domain.Fei Peng et al. [7] present a method to extract 24 dimensions of features based on multi-fractal and regression analysis for the discrimination of computer-generated graphics and natural images.However, all of these methods have depended on handcrafted features from computer-generated graphics and natural images, and also depend on SVM as the classifier.
Deep learning has been used in many new fields and has achieved great success in recent years.Deep neural networks such as the convolutional neural network (CNN) have the capacity to automatically obtain high-dimensional features and reduce its dimensionality efficiently [8].Some researchers have begun to utilize deep learning to solve problems in the domain of image forensics, such as image manipulation detection [9], camera model identification [11], [12], steganalysis [13] [14], image copy-move forgery detection [10], and so on.
In this paper, we propose a method based on sensor pattern noise and deep learning to distinguish computer-generated graphics (CG) from natural images (NI).The main contributions are summarized as follows: 1) Different from the traditional methods of distinguishing CG from NI, the proposed approach utilizes a five-layer convolutional neural network (CNN) to make a classification for the input images.
2) Before being fed into the CNN-based model, these images-including the CG and NI-are clipped to image patches.3) Several high-pass filters (HPFs) are used to remove low-frequency signal, which represents the image contents.These filters are also used to enhance the residual signal as well as sensor pattern noise introduced by the digital camera device.4) The experimental results have shown that the proposed method with three high-pass filters can achieve 100% accuracy, although the natural images undergo a JPEG compression with a quality factor of 75.

II. RELATED WORKS
In this paper, we propose a method of distinguishing computer-generated graphics from natural images based on sensor pattern noise and deep learning.There are several studies related to deep learning as well as sensor pattern noise used for forensics.

A. Methods Based on Deep Learning
Gando et al. [15] presented a deep learning method based on a fine-tuned deep convolutional neural network.This method can automatically distinguish illustrations from photographs and achieve 96.8% accuracy.It outperforms other models, including custom CNN-based models trained from scratch and traditional models using handcrafted features.
Rahmouni et al. [3] presented a custom pooling layer to extract statistical features and a CNN framework to distinguish computer-generated graphics from real photographic images.A weighted voting scheme was used to aggregate the local estimates of class probabilities and predict the label of the whole picture.The best accuracy in [3] is 93.2%, obtained by the proposed Stats-2L model.

B. Sensor Pattern Noise Used for Forensics
Different digital cameras introduce different noise to their output digital images.The main noise sources are due to the imperfection of CCD or CMOS sensors.It has been named as sensor pattern noise (SPN) and is used as a fingerprint to characterize an individual camera.In particular, SPN has been used in image forgery detection [16] and source camera identification [17].
Villalba et al. [18] presented a method for video source acquisition identification based on sensor noise extraction from video key frames.Photo response non-uniformity (PRNU) is the primary part of the sensor pattern noise in an image.In [18], the PRNU is used to calculate the sensor pattern noise and characterize the fingerprints into feature vectors.Then, the feature vectors are extracted from the video key frames and trained by a SVM-based classifier.

III. PROPOSED METHOD
The proposed method consists of two primary steps: image preprocessing and CNN-based model training.In the first step, the input images-including the computer-generated graphics and the natural images-are clipped to image patches, then three types of high-pass filter (HPF) are applied to the image patches.These filtered image patches constitute the positive and negative training samples.In the second step, the filtered image patches are fed to the proposed CNN-based model for training.The proposed CNN-based model is a five-layer CNN.In this section, we introduce these steps of our method in detail.

A. Image Preprocessing 1) Clipped to Image Patches:
The natural images taken by cameras and the computer graphics generated by software often have a large resolution.Due to hardware memory limitations, we need to clip these full-size images into smaller image patches before they are fed into our neural network for training.This is also a data augmentation strategy in deep learning approaches to computer vision [8].Data augmentation [19] helps to increase the amount of training samples used for deep learning training and improve the generalization capability of the trained model.Therefore, we propose to clip all of the fullsize images to image patches.The resolution of each image patch is 650×650.We chose this size as a trade-off between processing time and computational limitations.
Both the computer-generated graphics and the natural images are clipped into image patches.All of the clipping is label-preserving operations.That is to say, we prepare the positive samples by drawing image Fig. 2. Three types of high-pass filter (HPF) used in the proposed method.
patches from the full-size natural images.In a similar way, we get negative samples from the full-size computer-generated graphics.However, natural images taken by cameras usually have a larger resolution than computer-generated graphics.If we want the amount of negative samples and the amount of positive samples to be approximately equivalent, we need to clip more image patches in each computer-generated graphic than we do from natural images.In light of this, we set the stride size for natural images to the width of the image patches (i.e., 650).After analyzing the amount of the image patches, we set the stride size for computer-generated graphics to a smaller value (i.e., 65).
2) Filtered with High-Pass Filter: Since the natural images and the computer-generated graphics are created from different pipelines, there should exist some distinct differences between them.As we all know, sensor pattern noise (SPN) has been used to identify the source camera of a natural image, and has obtained excellent performance [12], [11] [17].However, there is no sensor pattern noise in computer-generated graphics.Based on this idea, we propose our method to discriminate computer-generated graphics from natural images.
Fridrich et al. [20] designed several high-pass filters for the steganalysis of digital images.As it is mentioned, these filters have the ability to obtain the noise residuals and suppress the value of the low-frequency component, which represents the image content.Qian et al. [21] proposed a customized convolutional neural network for steganalysis.This customized deep learning approach starts with a predefined high-pass filter.This predefined HPF was proposed as a noise residual model of SQUARE5×5 in [20].Furthermore, this noise residual model has been applied to deep learning-based camera model identification [12] as well as to deep learning-based video forgery detection [8], and has obtained perfect performance.
In this paper, we utilize several high-pass filters in our method to obtain the sensor noise residuals and reduce the impact of the image content.These predefined high-pass filters are employed to make a convolution operation with the image patches.Furthermore, in order to reduce the computational complexity, the image patches are first converted to grayscale.The predefined high-pass filters are applied to the grayscale image patches, then the noise residuals of the image patches are piped into the proposed convolutional neural network.
The proposed high-pass filters are shown in Figure 2.There are three types of high-pass filter used in our method.The SQUARE5×5 and SQUARE3×3 were proposed as noise residuals model in [20].The EDGE3×3 was designed by us according to the different structure of all the other filters in [20].In order to let the three filters have the same size, the elements in the bounding boxes of the SQUARE3×3 and the EDGE3×3 are set to zero.

B. CNN-Based Model Training
The proposed convolutional neural network architecture is illustrated in Figure 3.The image patches of the input for the proposed neural network are image blocks clipped from the full-size natural images or computer-generated graphics with a resolution of 1×(650×650), where 1 represents the channel number of gray-scale, 650 represents the size of width and height.
There is a high-pass filter layer at the top of the proposed CNN-based model.This filter layer consists of three combinations of high-pass filters.We need to select one type of the three combinations for the deep learning training.The combination of the High Pass Filter×3 consists of all three proposed filters; i.e., the SQUARE5×5, the EDGE3×3, and the SQUARE3×3.The combination of the High Pass Filter×1 only contains the SQUARE5×5 filter.The combination of the High Pass Filter×0 utilizes an average pooling layer instead of the high-pass filter layer.According to the combination used by the proposed method, the number of feature maps outputted by the filter layer is different.If the High Pass Filter×3 is used, there will be three feature maps with size 325×325 outputted by the high-pass filter layer.Otherwise, there will only be one feature map of size 325×325 outputted by the high-pass filter layer.
The proposed CNN architecture consists of five convolutional layers.Each convolutional layer is followed by a batch normalization (BN) [22] layer, a rectified linear units (ReLU) [23] layer, and an average pooling layer.At the bottom of the proposed model, a fully-connected layer and a softmax layer are utilized to transform the 128 dimensional feature vectors to classification probabilities of the image patches.
The kernel sizes of the convolution layers in the proposed CNN-based model are 5×5, 5×5, 3×3, 3×3, and 1×1, respectively.The amounts of the feature maps outputted by each layer are 8, 16, 32, 64, and 128, respectively, and the size of feature maps are 325×325, 162×162, 81×81, 40×40, and 20×20, respectively.The kernel size of the average pooling in each layer is 5×5 and the stride size is 2. Note that the last average pooling layer has a global kernel size of 20×20.

IV. EXPERIMENTS A. Dataset
We compared our deep learning approach with the state-of-the-art methods in [3].The dataset used in this paper is the same as the dataset in [3].It consists of 1800 computer-generated graphics and 1800 natural images.The computer-generated graphics were downloaded from the Level-Design Reference Database [26], which contains more than 60,000 screenshots of photo-realistic video games.The game information was removed by cropping the images to a resolution of 1280×650.The preprocessed images can be downloaded from the link on Github [27].Some computer graphics samples are shown in Figure 1.The natural images are taken from the RAISE dataset [24].The resolution of these natural images ranges from 3008×2000 to 4900×3200.All of these natural images were downloaded in RAW format and converted to JPEG with a quality factor of 95.
In our experiment, 900 natural images and 900 computer-generated graphics were randomly selected from the dataset for training, 800 natural images and 800 computer-generated graphics were set aside for testing, and 100 natural images and 100 computer-generated graphics for validation.Then, all of these full-size images were clipped to image patches with size 650×650.The number of image patches we obtained for training was about 44,000.

B. Experiment Setup
We implemented the proposed convolution neural network based on the Caffe framework [25].All of the experiments were conducted on a GeForce GTX 1080ti GPU.The stochastic gradient descent algorithm was used to optimize the proposed CNN-based model.The initial learning rate was set to 0.001.The learning rate update policy was set to inv with the gamma value of 0.0001 and the power value of 0.75.The parameters of momentum and weight decay were set to 0.9 and 0.0005, respectively.The batch size of training was set to 64.Namely, 64 image patches were fed to the CNN-based model for each iteration.After 80 epochs of iteration, the trained CNN-based model was obtained for testing.
In order to get the performance of the proposed CNN-based model, we applied the trained model to the testing dataset.All of the full-size images in the testing dataset needed to be preprocessed.The preprocessing for the testing images was similar to the preprocessing of the images in the training.After preprocessing, the testing images were clipped into image patches.Then, these image patches were fed to the trained CNN-based model, and the prediction results for the image patches were obtained.Based on the prediction results of the image patches, we deployed a majority vote scheme to obtain the classification results for the full-size images.

C. Experimental Results
1) Different Numbers of High-Pass Filters: As shown in Figure 3, the proposed convolution neural network has three combinations for the high-pass filter layer.Each of the combinations has different numbers of high-pass filters.We trained all of these combinations for 80 epoch iterations and obtained two trained models for each of the combinations.In other words, we obtained a model of 50 epochs and a model of 80 epochs for the combination of High Pass Filter×3 after training the proposed network for 80 epoch iterations.We also obtained the same number of models for the other two combinations.5 show the evolutions of training loss and validation accuracy in the procedure of iteration.The validation accuracy is shown in Figure 4, and the training loss is shown in Figure 5.It is observed that the proposed method with High Pass Filter×3 converges much faster than the others and achieves much higher prediction accuracy.
To evaluate the classification performance of the proposed method with different numbers of high-pass filters, we tested these models obtained in the training procedure on the testing dataset.The classification accuracy is shown in Table I.Note that the size of the image patches in the method of Rahmouni et al. in [3] is 100×100.In our experiments, we set the size of the image patches to 650×650 to meet the requirement of our neural network architecture.A majority vote scheme was applied to the testing results of the image patches to obtain the classification results for the full-size images.
Compared with the state-of-the-art method of Rahmouni et al. in [3], our method with the high-pass filter obtained better performance.Furthermore, the proposed method with High Pass Filter×3 outperformed the others and obtained the best performance.The classification accuracy for the full-size images could achieve 100%.These experimental results demonstrate the effectiveness of the high-pass filter in the preprocessing procedure for our proposed deep learning approach.2) Different Quality Factors of Natural Images: We also evaluated the robustness of our proposed method with different quality factors.In this experiment, 2000 natural images in RAW format were downloaded from the RAISE-2k dataset [24].We randomly selected 1800 natural images for our robustness experiment.These RAW images were converted to JPEG format with quality factors of 95, 85, and 75, respectively.Then, we could obtain three sub-datasets with different quality factors of natural images for our experiment.Each of the sub-datasets were then divided into training (50%), testing (40%), and validation (10%) to form the datasets for the robustness experiment of quality factors.Note that the computer-generated graphics in this experiment remained untouched.These computer-generated graphics were compressed with a reasonable quality factor when the author collected this dataset.
For the filter layer, we utilized High Pass Filter×3 to achieve the best performance in this experiment.Figure 6 and Figure 7 show the evolutions of training loss and validation accuracy in the iteration procedure.The validation accuracy is shown in Figure 6, and the training loss is shown in Figure 7.The classification accuracy for different quality factors of natural images is shown in Table II.It is observed that the proposed method with High Pass Filter×3 achieves a perfect performance.Although the compression with different quality factors has an impact on the classification accuracy of image patches, due to the majority vote scheme used for the full-size images, all of the classification accuracies for different quality factors of the natural images are 100%.

V. CONCLUSION
In this paper, we develop an approach to distinguish computer-generated graphics from natural images based on sensor pattern noise and a convolutional neural network.The experimental results show that the   proposed method obtains better performance than the method in [3] on the same dataset.Currently, there are several computer-generated graphics datasets [5], [7] for forensics research.However, many images in these datasets are smaller than 650 pixels in width or height.This cannot meet the size requirement of the proposed convolutional neural network.In the future, we will focus on the improvement of our CNN-based model for smaller images.Furthermore, applying a trained CNN-based model to discriminate the computer-generated graphics from other existing datasets-namely one model for all datasets-would be another interesting future work.
This work was supported by the Zhejiang Provincial Natural Science Foundation of China (No.LY14F020044), the Key Research and Development Program of Zhejiang Province (No. 2017C01062), the Public Technology Application Research Project of ZheJiang Province (No. 2017C33146), the Humanities and Social Sciences Foundation of Ministry of Education of China (No. 17YJC870021), and the National Natural Science Foundation of China (No. 61772165) (Corresponding author: magherozhw@hdu.edu.cn(W.Z), wuting@hdu.edu.cn(T.W))

Fig. 3 .
Fig. 3.The proposed convolutional neural network architecture.Names and parameters of each layer are displayed in the boxes.Kernel sizes in each convolution layer are shown as number of kernels × (width × height × number of input).Sizes of feature maps between consecutive layers are shown as number of f eature maps × (width × height).Padding is used in each convolutional layer to keep the shape of image patches.BN: batch normalization; ReLU: rectified linear units.

Figure 4
Figure4and Figure5show the evolutions of training loss and validation accuracy in the procedure of iteration.The validation accuracy is shown in Figure4, and the training loss is shown in Figure5.It is observed that the proposed method with High Pass Filter×3 converges much faster than the others and achieves much higher prediction accuracy.To evaluate the classification performance of the proposed method with different numbers of high-pass filters, we tested these models obtained in the training procedure on the testing dataset.The classification accuracy is shown in TableI.Note that the size of the image patches in the method of Rahmouni et al. in[3] is 100×100.In our experiments, we set the size of the image patches to 650×650 to meet the requirement of our neural network architecture.A majority vote scheme was applied to the testing results of the image patches to obtain the classification results for the full-size images.Compared with the state-of-the-art method of Rahmouni et al. in[3], our method with the high-pass filter obtained better performance.Furthermore, the proposed method with High Pass Filter×3 outperformed the others and obtained the best performance.The classification accuracy for the full-size images could achieve 100%.These experimental results demonstrate the effectiveness of the high-pass filter in the preprocessing procedure for our proposed deep learning approach.

TABLE I CLASSIFICATION
ACCURACY WITH DIFFERENT NUMBERS OF HIGH-PASS FILTERS.

TABLE II CLASSIFICATION
ACCURACY FOR DIFFERENT QUALITY FACTORS OF NATURAL IMAGES.