Structural Correlation Based Method for Image Forgery Classiﬁcation and Localization

: In the image forgery problems, previous works has been chieﬂy designed considering only one of two forgery types: copy-move and splicing. In this paper, we propose a scheme to handle both copy-move and splicing image forgery by concurrently classifying the image forgery types and localizing the forged regions. The structural correlations between images are employed in the forgery clustering algorithm to assemble relevant images into clusters. Then, we search for the matching of image regions inside each cluster to classify and localize tampered images. Comprehensive experiments are conducted on three datasets (MICC-600, GRIP, and CASIA 2) to demonstrate the better performance in forgery classiﬁcation and localization of the proposed method in comparison with state-of-the-art methods. Further, in copy-move localization, the source and target regions are explicitly speciﬁed.


Introduction
In an era of globalization, social networks such as Facebook, Twitter, and Instagram are widely used in our daily lives and a huge number of photos are uploaded to these networks everyday. Further, it becomes easy even for unpracticed users to manipulate digital images without leaving any perceptible trace. Copy-move and image splicing are two most popular image manipulation methods. In the copy-move forgery (CMF), one or more regions are copied from an authentic image and then pasted into other regions of that image. The authentic image used to compose the copy-move image is called the host image. On the other hand, in image splicing, some regions are copied from a source image (the donor image) and pasted into a target image (the host image) [1]. Examples of copy-move and spliced images are given in Figure 1.
In the image forgery scenario, a tampered region might not be exactly the same as the original region since it usually undergoes a sequence of post-processing operations such as rotation, scaling, edge softening, blurring, denoising, and smoothing for a better visual appearance [1]. Therefore, human beings may easily be deceived by tampered images and it is difficult to manually verify the authenticity of images.
Many researchers have put considerable effort into detecting and localizing tampered regions of image forgery. However, in most cases, forgery detection and localization algorithms were designed considering only one of two forgery types, copy-move and image splicing. In this paper, we propose an image forgery detection and localization algorithm that can handle both types of image forgeries simultaneously. The proposed method utilizes the bag-of-features (BOF) image representation and Hamming Embedding (HE) based image retrieval. The image forgery clustering algorithm classifies input images into distinct clusters, each of which consists of one authentic image and all the spliced and copy-move images which were composed using that authentic image as the host image. The algorithm also determines the authentic image based on structure and luminance similarity between images and assigns it as the centroid of the cluster. The cluster centroid is used to classify the image forgeries and localize the tampered regions. The experimental results show that the proposed method outperforms state-of-the-art techniques in image forgery classification and localization accuracy. In addition, we distinguish the source and target regions in copy-move tampering localization. The further part of this paper is organized as follows. Section 2 provides a brief review of image splicing and copy-move detection and localization methods. In Section 3, we present the image retrieval algorithm based on HE and BOF. The proposed image clustering algorithm is introduced in Section 4. Section 5 presents the image forgery detection and localization. The experimental results are discussed in Section 6. Finally, Section 7 concludes the paper.

Related Works
In the literature, image splicing forgery detection problem has been addressed efficiently [2][3][4]. In recent years, a substantial attention has been paid to deep learning based approaches [5,6] for localizing image splicing [7][8][9][10][11][12][13][14][15] wherein convolutional neural network (CNN) has been widely used [8][9][10][11][12]. Bondi et al. [8] extracted and employed features capturing characteristic traces from different camera models to localize a tampered mask by an iterative clustering algorithm. Region proposal network and condition random field are the main components of the model developed in Chen et al. [10]. The noise levels difference between spliced and original regions was utilized to find the splicing traces [11,13]. Non-linear camera response function was used in Yao et al. [11] and was combined with noise level function to exploit the strong relationship between two functions to localize the forged edges using a CNN. Mayer et al. [12] used a similarity network and a CNN-based feature extractor to determine whether image patches contain different traces or being captured by different camera models. Zeng et al. [13] estimated the noise levels using the principal component analysis and then clustered using k-means algorithm to localize the spliced regions. Matern et al. [15] utilized the gradient-based illumination descriptor to detect the illumination inconsistency and object color change, which helped localize image splicing traces. Wang et al. [16] used gamma transformation to detect splicing forgery and localize spliced region by estimating the probabilities of sliding window based overlapping blocks being gamma transformed.
Park et al. [17] introduced the upsampled log-polar Fourier descriptor, which is invariant to rotation and scaling, to robustly detect various types of tampering attacks. Wu et al. [18] proposed a two-branch deep neural network to detect potential manipulation via visual artifacts and visually similar regions, which helps specify the copied and pasted regions. Park et al. [19] used the scale space representation of scale-invariant feature transform (SIFT) to handle different geometric transformation. PatchMatch, an algorithm used to search for approximate nearest neighbors, was combined with Zernike moments to detect copy-move attacks in [23,24] whereas SIFT was utilized in [25][26][27]. In segmentation-based approaches, the input image was semantically segmented into non-overlapped regions [29][30][31][32]. Li et al. [29] developed two stages of matching to detect the copy-move regions. Firstly the affine transformation matrix between segmented regions was roughly estimated and then iteratively refined by using an expectation-maximization algorithm-based probability model. However, the major disadvantage of this method is its high computational complexity. Zheng et al. [30] classified smooth regions and non-smooth regions (keypoint regions) to be apply two different techniques. On the one hand, SIFT was used in a keypoint-based method to detect forgery in non-smooth regions. On the other hand, Zernike moments were extracted in a block-based method to handle smooth regions. The CMFL was effectively performed by the fusion of above-mentioned techniques.

Bag-of-Features and Hamming Embedding Based Image Retrieval
In image retrieval, images are represented by descriptive features. The features are used to evaluate similarity or dissimilarity between images. In the image forgery, since the forged regions may be rotated, scaled, and translated in different manners, the features of the images should be invariant to these transformations. The features generated by SIFT [33] have such noteworthy characteristics and the proposed algorithm utilizes the SIFT features to represent images [34,35].
In this section, we briefly review the image retrieval method based on BOF [36][37][38] and HE encoding [38,39]. Suppose that a query image Q is represented by a set of N descriptors, All of these descriptors are mapped into a visual vocabulary set W = {w 1 , w 2 , . . . , w K } by a K-means vector quantizer q. For example, q maps x Q n (n = 1, 2, . . . , N) to the closest visual word w k (k = 1, 2, . . . , K), where q(x Q n ) = w k ∈ W. We define a set of descriptor indexes, which assigns descriptors of Q to a particular visual word w k as I Q k = n q x Q n = w k . A matching model HE is used to estimate the matching of descriptors to a visual word. HE represents each descriptor as D-dimensional binary signatures [38].
The Hamming distance between two descriptors, x Q m and x Q n , is computed using their binary signatures as follows: Let us denote X P be the set of descriptors of the database image P. The probability that two sets of descriptors, X Q and X P are assigned to the same visual word w k is defined as: where the weighting function for a Hamming distance h is calculated as a Gaussian function [38]: The number of dimensions for the binary signatures is typically set to D = 64, and the Gaussian bandwidth parameter [38,40] is set to σ = D/4 = 16.
In order to retrieve images, an inverted index file is built in the image indexing process. The inverted file consists a list of entries. In each entry, a visual word is stored along with the identifier of associated images, descriptors of those images which are assigned to the visual word and the HE used for matching measurement. When the query of Q is performed, the entries of visual words associated to Q are searched in the inverted file. The score of a database image P in this query is calculated by accumulating the Hamming distances between two sets of descriptors' signatures for all the shared visual words of two images. Specifically, the similarity between Q and P is defined as follows: where the constant α k is the inverse document frequency [41] of a visual word w k in W. Suppose that p(w k ) is the probability of w k occurring in W, then α k = − log p(w k ).

Image Forgery Clustering
In this section, we give an exposition of the proposed image forgery clustering algorithm. Suppose that we have an input dataset including authentic and tampered images. The proposed algorithm classifies images into separate clusters, where each cluster consists of tampered images which were composed using an identical host image and that host image. Subsequently, the proposed algorithm finds the host image to be the centroid of each image cluster. The details of images clustering and centroid determination are provided in Algorithm 1.
Firstly, we randomly select a query image Q in the dataset. The ranking score of a database image P in the query of Q is denoted as Ω P Q and calculated as the similarity between two images according to Equation (4). The retrieval results are a list of images arranged in descending order of ranking scores. A cut-off threshold θ is set to obtain the set of images. Let us denoteQ as the host image of the image Q in the dataset. An authentic image is considered as the host image of itself. We need to retrieve all the relevant images R to the query Q satisfyingR =Q. To this end, we set the threshold θ to a relatively low value. This low threshold value leads to the case where also some irrelevant images may be retrieved together. Note that, the irrelevantly retrieved images will be discarded in the last step of the iteration. Due to the insignificant processing time of these operations, we can easily handle the case of a large number of images in a cluster. Further, we perform an additional query to ensure that all the relevant images to Q are retrieved. Notice that the top ranked image in retrieved list L 1 , image D 1 is identical to the query image Q. Therefore, the second highest ranked result in L 1 , image D 2 , is selected as the query image. The score threshold θ is also used in this query, then we obtain the set of retrieved images L 2 .
The image cluster C is the union of two sets of retrieved images, i.e., C = L 1 ∪ L 2 . The centroid of C is determined based on two criteria which measure the correlations in structure and luminance among images in the cluster. In this work, we extract SIFT features [33] in images and use Random Sample Consensus [42] to find the matching.
is a pair of keypoints. Then s UV = |K UV | is the number of matching keypoints between U and V. We denote by c i U the pixel coordinates of k i U in U. The number of matching keypoints in the corresponding positions of U and V, denoted byŝ UV , is calculated as follows:ŝ where δ is the Kronecker delta function: We define the ratioŝ UV /s UV as the structural similarity between U and V. In addition, we denote by U Y (x, y) the luminance value of image U at pixel (x, y), which can be calculated as follows [43]: where U R (x, y), U G (x, y), and U B (x, y) are the red, green, and blue color values of U at pixel (x, y), respectively. We define l UV , the luminance similarity image between U and V as follows: where H and W are height and width of U, respectively and We determine T, the centroid of image cluster C as follows: Afterwards, we refine the image cluster by discarding irrelevant images V to T whereV = T as followsŝ TV s TV < 0.5.
Therefore, all the retrieved authentic images, with the exception of the centroid image T, are discarded from the cluster. In other words, T is the unique authentic image in C. Figure 2 illustrates an example of discarding an image from the cluster according to Equation (11).  1

Forgery image clustering algorithm
Image database Image scoring matrix Image cluster Cluster centroid

Image Forgery Classification and Localization
Given the centroid T and an image U i in the cluster, we can easily estimate the mask of forged regions of U i based on T Y − U iY . Specifically, U i ∩ T denotes the image region including all image pixels that U i and T jointly have, and U i \ T denotes the image region in U i but not in T.
Consequently, two image regions U i ∩ T and U i \ T are extracted as shown in Figure 4. These image regions are refined by using median filter to remove salt and pepper noise.
We use SIFT to find the matched regions of U i ∩ T and U i \ T. 3 pairs of matched keypoints are utilized to calculate the affine transformation matrix, and subsequently, a warped image is generated for each transformation matrix. To localize the duplicated regions, the zero mean normalized cross-correlation method is adopted [19]. If we can find such regions, the image U i is classified as a copy-move image; otherwise, the tampered image is classified as a spliced image. In Figure 4, images U 1 and U 2 are classified as copy-move images and the detected forgery regions are illustrated in the last column. The previously detected regions, U i ∩ T, are the target regions in white, and the newly found matched regions are the source of the copy-move operation, which are represented in green. In the last two examples of Figure 4, the spliced regions of images U 3 and U 4 , are highlighted in white.

Spliced images
Cluster centroid T

Datasets
There exist several benchmarking datasets for evaluating the performance of image forgery detection algorithms. In our experiments, we used three challenging datasets MICC-600 [25], GRIP [23], and CASIA 2.0 [44] for the evaluation.

MICC-600
MICC-600 is a dataset of 600 high resolution images with various sizes from to pixels. There are 440 original images and 160 copy-move images. As shown in Figure 5, multiple scenarios of copy-move operations were performed in this dataset:

•
Single source region and single target region- Figure 5c,d,f • Single source region and multiple target regions-Figure 5b  GRIP is a small dataset with copy-move and original images. All the images in this dataset have either resolution 1024 × 768 or 768 × 1024. The target regions in copy-move images were composed using different attacks, such as compression, noise addition, rotation, scaling.
6.1.3. CASIA 2 CASIA 2 is a big dataset with more than 12,000 images in three categories: authentic, spliced and copy-move images. The images in this dataset are in low resolution with the sizes vary from 240 × 160 to 900 × 600 pixels. Among three datasets in our simulations, CASIA 2 is the only dataset which has both types of forgery: splicing and copy-move. Figure 5. Examples of CMFL in MICC-600 dataset. First row: original images, second row: copy-move images, third row: ground truth images, and fourth row: source regions (green) and target regions (white) detected by the proposed method.

Evaluation Metrics
In the experiments, we evaluate the performance of image retrieval and image forgery classification and localization.

Metrics for Image Retrieval
To evaluate the performance of the proposed image forgery clustering algorithm, we use the mean average precision (MAP) metric used in image retrieval problem. For a query q, let us denote N q the number of retrieved images, M q the number of relevant images, and Rel q (k) the number of relevant images in top k retrieved results. The precision and recall of query q at cut-off k, denoted by P q (k) and R q (k), are calculated as follows: Then, the average precision for query q is computed as follows: where ∆R q (k) = R q (k) − R q (k − 1) is the change in recall from items k − 1 to k. Note that R q (0) = 0. Finally, MAP for all the queries is defined as follows: where Q is the number of queries.

Metrics for Image Forgery Classification and Localization
Since we concurrently classify image forgery types and localize the forged regions, the evaluation is performed in both image and pixel levels.
To quantitatively evaluate the performance of forgery localization, we adopt two metrics for tampered regions of a classified tampered image [19], localization precision L P and localization recall L R , which are defined as follows: L R = # correctly detected pixels # all tampered pixels .
Similarly, we define the classification precision C P , and recall C R at image level: C R = # correctly detected tampered images # all tampered images .
In order to balance between precision and recall, we consider both of these quantities by computing their harmonic mean, called localization F-measure, as follows: The metrics precision, recall, and F-measure at pixel level are used for all 3 datasets in this work. Nevertheless, the metrics at image level are only used to evaluate performance of the proposed method in MICC-600 and GRIP datasets. To evaluate the classification performance in CASIA 2, which contains 3 classes, we use confusion matrix.

Image Retrieval Results
To evaluate the performance of the proposed image forgery clustering algorithm, we carry out the experiments to estimate MAP of image retrieval in 3 different scenarios related to cluster formation of Algorithm 1. In the first case, only one query is performed to compose the cluster. In the second case, the second query is performed to augment the retrieved results. In the third case, the cluster refinement using structural correlation of Algorithm 1 is conducted after two queries to form the image forgery cluster. We denote these cases by case A, case B, and case C, respectively. Figure 6 shows the retrieval performance of 3 above-mentioned cases in 3 datasets. It is clear that MAP significantly increases from case A to case C in all 3 datasets to prove the efficiency of the proposed image retrieval based clustering algorithm.
We present the average ratios that the host image of the query is retrieved in 3 cases in Table 1. The results ensure that by using image forgery clustering algorithm, we can generally retrieve the host images of query images into the clusters.   Table 2 presents the performance of the proposed method in comparison with state-of-the-art on MICC-600 dataset. Our classification F-measure outperforms Li. et al. [29] and is slightly lower than Li. et al. [27]. In term of localization performance, our method surpasses other methods with L F = 93.1%. Visual examples of CMFL are shown in Figure 5 where we distinguish the source and target regions in green and white, respectively.  Table 3 summarizes the performance on GRIP dataset where the proposed method exceeds other methods in both classification and localization indexes. The evaluations in different types of copy-move situations are also considered. Specifically, 4 attacks includes Gaussian noise addition and JPEG compression to the copy-move images, rotation and scaling to the copied regions. Figure 7a indicates that our method is better than other methods in term of localization F-measure with different levels of Gaussian noise added to the copy-move images. Figure 7b shows that different CMFL methods handling JPEG compression situation with a slight difference. In case the copied regions are rotated or scaled, Chen et al. [20], Chen et al. [32], and our method sequentially perform better than the rest (Figure 7c,d). The proposed method performs better than other methods when the changes of the copied regions are small. On the contrary, its performance declines for larger rotation angle and scaling factor. Figure 8 illustrates the CMFL examples of the proposed method on GRIP dataset. Cozzolino et al. [24] Bi et al. [28] Chen et al. [20] Li et al. [27] Chen et al. [32] Proposed method   Table 4 summarizes the 3-class classification results of the proposed method on the CASIA 2 dataset. To the best of our knowledge, all of the previous researches on forgery detection of this dataset are binary classification. Therefore, only the results of our work are reported. The detection accuracy of authentic images achieve 96.9%, which is higher than two image forgery types. 6.7% of copy-move images are classified as spliced images. By contrast, 4.4% of spliced images are mistakenly detected as copy-move images. Table 5 and Table 6 compare the proposed method with other researches on localization performance of spliced images and copy-move images of CASIA 2 dataset, respectively. Examples of CMFL results of the proposed method are shown in Figure 9. Since CASIA 2 is the most challenging dataset in our experiments with many small and smooth tampered regions, the proposed method occasionally fails to search for matching regions.  Table 6. Performance of copy-move images localization on CASIA 2 (%).

L P L R L F
Abd-Almageed et al. [18]

Conclusions
This paper introduces a novel method to detect and localize authentic images and two types of tampered images: copy-move and spliced images. We propose a robust algorithm to divide relevant images into cluster using BOF and HE based image retrieval. From image clusters, by exploiting the structural correlation between images, the proposed algorithm determines the cluster centroid, which is the only authentic image in the cluster. Afterwards, the image forgery are classified, and the forged regions are localized. The experimental results show that this method achieves higher performance in both forgery detection and localization in comparison with state-of-the-art methods. Notably, the proposed method can indicate the source and target regions of copy-move images.

Conflicts of Interest:
The authors declare no conflict of interest.