3.1. Motivation
In this study on CG forensics, the pretext task for self-supervised learning should be useful for the main task of the classification of CG images and NIs. As mentioned above, existing self-supervised learning methods tend to be in favor of typical computer vision tasks related to image content understanding, i.e., semantic classification and object detection. Image forensic tasks are, in general, different from these tasks [
4,
47,
48], because in image forensics we are more interested in forensic traces that can be telltales of the image generation or processing history. These traces are often not related to image semantics and sometimes even not easily visible. For example, as discussed in the previous section, and noticed by other researchers, the differences between CG images and NIs may reside in subtle traces of image color, noise or relative relationships between pixel values (e.g., discussions presented in [
6,
29]).
The above observations motivated the design of a new pretext task that can cope better with the problem of distinguishing between CG images and NIs. A natural idea for the design of such a pretext task is to be able to detect different kinds of deviations from natural images. The success in achieving this pretext objective would be beneficial for reaching a satisfying classification performance of CG images and NIs, because, intuitively, CG images can be considered a special kind of synthetic images, the generation procedure of which deviates from the acquisition process of natural images. By following this idea and intuition, in our proposed pretext task, we would like to automatically generate, for each NI in the training set, a number of distorted versions by using various image manipulation operations. These distorted versions of NIs represent different variants of deviations from the authentic NIs and would hopefully act as a kind of useful proxy of CG images that the detector would encounter in practical scenarios. With a training step on such a pretext task, the learned representations of neural networks would be sensitive to various deviations from a large number of NIs, which would later be useful for the classification of NIs and CG images.
Natural images are readily and easily available, e.g., those included in the well-known ImageNet dataset [
49]. Therefore, the main “ingredient” in the proposed pretext task was the design of manipulation operations to modify NIs. In order to introduce diverse modifications to NIs, we considered
eight different manipulation operations of different properties, such as the following: color jitter with adjustment of saturation and hue, RGB color rescaling, Gaussian blurring, noise addition, sharpness enhancement, histogram equalization, pixel value quantization, and Gamma correction. The main motivation to choose these manipulations was to cover a large spectrum of diverse changes/deviations from natural images. Natural images have their specific statistics but understanding and modeling natural image statistics is inherently a very difficult problem [
50]. Accordingly, modeling deviations from NIs (including the deviations of CG images from NIs) is also a very difficult task. Here we chose to find a workaround and adopted a simple strategy of considering a group of different manipulations to simulate deviations by modifying NIs in different color components (e.g., jitter of color saturation and hue vs. rescaling in RGB space), in different frequency parts (e.g., Gaussian blurring vs. noise addition vs. sharpness enhancement), and with different processing operations on pixel values (e.g., quantization vs. equalization vs. exponential Gamma correction). Hopefully, these various deviations would cover, at least partially, the real deviations of CG images from NIs. Although being intuitive and certainly sub-optimal, our strategy appears to be a rational first attempt on designing a self-supervised pretext task for the forensic analysis of CG images and NIs. Similar to the fact that pretext tasks for image understanding (such as playing the jigsaw puzzles [
41] or predicting image rotation [
38]) are sub-optimal for downstream computer vision tasks (as reflected by a performance gap between self-supervised and supervised methods), we could not guarantee that the simulated deviations in our method could foresee all possible deviations of CG images from NIs. However, as shown later in our experimental studies, in general, the proposed self-supervised method was able to accurately detect CG images produced by four advanced rendering engines which remained
unseen during the self-supervised training. This could be considered evidence of the appropriateness of our designed pretext task, although, of course, we were not able to ensure that our method could work well on any unseen CG images created by other untested rendering algorithms. In addition, we show, in the following, that these manipulations indeed introduced diverse deviations in terms of the simple yet intuitive first- and second-order image statistics.
In
Figure 2 we show a natural image from the ImageNet dataset and five modified versions of this NI. We can observe that visually the modified images are quite different from each other, which illustrates the good diversity of introduced modifications. In order to demonstrate that the various modified versions can indeed represent different deviations from the real distributions of NIs, we show in
Figure 3 the empirical marginal distribution of pixel values (i.e., embodied by the conventional image pixel histogram) and in
Figure 4 an empirical joint-distribution of pixel values (here embodied by the co-occurrences of two horizontally neighboring pixels). The results in these two figures are for the green channel of images and we have similar observations for other channels. It can be seen from
Figure 3 that different manipulations tend to introduce different types of distortions, in terms of the pixel value histogram. For example, the quantization operation results in a comb-like histogram, and the Gamma correction operation (here with a factor smaller than 1) stretches the histogram towards big values with the appearance of gaps and peaks in the histogram after manipulation. Similarly, differences in distribution modifications can be observed for the pixel co-occurrences in
Figure 4. For instance, Gaussian blurring makes the distribution less dispersed with a thinner concentration band around the diagonal direction. Noise addition has, somewhat, the inverse effect, which, to some extent, expands the band near the diagonal. The diversity of the considered manipulations, as demonstrated in
Figure 3 and
Figure 4, would be helpful to learn appropriate representations of neural networks for the main task of differentiating between CG images and NIs.
3.2. Self-Supervised Learning Task
The objective of our method was to first of all carry out self-supervised training on an appropriate pretext task of correctly classifying authentic images and several kinds of processed images obtained after applying different kinds of manipulation operations on authentic images. This allowed us to learn features/representations that would then be utilized to solve the main task of the discrimination of synthetic CG images created by advanced rendering engines and natural images acquired by digital cameras. We would like to point out that both CG images and NIs involved in the main task remained unseen during the self-supervised learning stage. As mentioned in the previous subsection, in our proposed pretext task we considered eight manipulation operations, which are listed in the first column of
Table 1. In the second column we present the specifications concerning their parameters, if any. In practice, for applying manipulations on an image, the random parameters, if any, are drawn uniformly from the ranges/set as given in
Table 1. For example, in the color jitter operation, two random scaling factors are drawn, respectively, for the scaling of the color saturation and the color hue, while in the RGB color rescaling operation three factors are drawn randomly and independently and then applied, respectively, on the R, G and B channels. As mentioned in the last subsection, we chose the eight manipulations in
Table 1 to introduce a large spectrum of diverse modifications to NIs, in order to simulate, to some extent, the deviations of CG images from NIs. The selection of manipulations is rather empirical but as shown in
Section 4 this leads to good forensic performances in different scenarios for the discrimination of CG images and NIs. In general, the selected manipulations are believed to be related to the difference between CG images and NIs. For example, several previous methods extracted color features which proved to be effective for detecting CG images [
18,
21]. Therefore, color jitter, color rescaling or even Gamma correction operation would allow us to learn appropriate representations sensitive to color changes and, thus, be discriminative in distinguishing CG images and NIs. Furthermore, in the literature, researchers have found that CG images and NIs differ in terms of edge statistics [
16] and local smoothness [
6]; accordingly learned features sensitive to Gaussian blurring, noise addition and sharpness enhancement would be useful for the classification of CG images and NIs. In addition, intuitively and perceptually, CG images appear in general less “colorful” than NIs, probably due to computational or algorithmic limitations of rendering engines. Hence, the pixel value quantization might also be helpful for the considered CG forensic problem. Finally, including other manipulations not that closely related to deviations of CG images (one possible example might be the histogram equalization) would still be helpful to learn representations sensitive to more changes from NIs in general, which would later be more easily adapted to the main task of CG forensics.
One may design the pretext task as a binary classification problem between authentic images and modified ones generated by any of the eight manipulations. However, in practice, this binary-classification pretext task leads to limited performance for the main task of CG forensics, probably due to the somewhat over-simple nature of such a binary task. A more effective, and also more challenging, pretext task is the
multi-class classification task of authentic images and their eight modified versions obtained after applying the eight manipulations listed in
Table 1. The success in this multi-class classification task requires learning more sophisticated and discriminative representations, not only capable of classifying between NIs and their modified versions, but also able to accurately predict the type of manipulation of each modified image. Similar observations were mentioned in [
38], where the authors found that the multi-class classification of a number of rotation angles of an image was a more effective pretext task than the simple binary classification between original un-rotated images and any rotated ones.
Therefore, in our self-supervised pretext task there were in total 9 classes of training samples, i.e., the NI (with ground-truth label 0) and modified images corresponding to the eight kinds of manipulations (with ground-truth label in the set of
, as shown in the last column of
Table 1). Training images of different classes for the 9-class pretext task were generated dynamically. For each image
in a batch of NIs drawn from the ImageNet dataset [
49] (for the sake of simplicity, and with a little abuse of notation, we omit the batch index), a pseudo-random number
was drawn from the uniform distribution on the set of
, and then a training sample
was constructed for the pretext task as:
In the above equation,
stands for the manipulation indexed by
(as mentioned above, the indices of manipulations are given in the last column of
Table 1) with randomly drawn parameters, if any. It can be seen that with the dynamic sample generation procedure of Equation (
1) and with a large number of NIs (which was the case in our pretext task training), we could ensure that the considered 9-class classification problem was well balanced. Usually, a balanced problem facilitates the learning and implementation, but it is possible that an unbalanced self-supervised learning problem would be more beneficial for downstream tasks, e.g., during the learning process we could dynamically generate more samples for difficult manipulations with a higher training/validation error. In the future we plan to carry out studies to get better understanding and make improvements regarding this point.
It can be seen that
was, in fact, the ground-truth label of the generated sample
. We can also observe that the self-supervised labels
were generated automatically and dynamically
without any effort in terms of human manual labeling. Afterward, the pretext task training was carried out by minimizing the conventional cross-entropy loss of the 9-class classification problem as (again with a little abuse of notation):
where
M represents the number of training samples used for the pretext task training,
meant that what we tried to solve was a 9-class classification problem,
with
are the raw (non-normalized) class scores as computed by the neural network, and
stands for the indicator function with
and
.
It can be observed that our proposed self-supervised training procedure did not require any human manual labeling effort and did not use any CG images, which ensured that our method was very flexible and useful in practical application scenarios. After the self-supervised training on the pretext task, the obtained neural network could later be used to solve the main task of classifying CG images and NIs. The evaluation protocols for the main task and the obtained results are presented in the next section.