1. Introduction
In an era of globalization, social networks such as Facebook, Twitter, and Instagram are widely used in our daily lives and a huge number of photos are uploaded to these networks everyday. Further, it becomes easy even for unpracticed users to manipulate digital images without leaving any perceptible trace. Copy-move and image splicing are two most popular image manipulation methods. In the copy-move forgery (CMF), one or more regions are copied from an authentic image and then pasted into other regions of that image. The authentic image used to compose the copy-move image is called the host image. On the other hand, in image splicing, some regions are copied from a source image (the donor image) and pasted into a target image (the host image) [
1]. Examples of copy-move and spliced images are given in
Figure 1.
In the image forgery scenario, a tampered region might not be exactly the same as the original region since it usually undergoes a sequence of post-processing operations such as rotation, scaling, edge softening, blurring, denoising, and smoothing for a better visual appearance [
1]. Therefore, human beings may easily be deceived by tampered images and it is difficult to manually verify the authenticity of images.
Many researchers have put considerable effort into detecting and localizing tampered regions of image forgery. However, in most cases, forgery detection and localization algorithms were designed considering only one of two forgery types, copy-move and image splicing. In this paper, we propose an image forgery detection and localization algorithm that can handle both types of image forgeries simultaneously. The proposed method utilizes the bag-of-features (BOF) image representation and Hamming Embedding (HE) based image retrieval. The image forgery clustering algorithm classifies input images into distinct clusters, each of which consists of one authentic image and all the spliced and copy-move images which were composed using that authentic image as the host image. The algorithm also determines the authentic image based on structure and luminance similarity between images and assigns it as the centroid of the cluster. The cluster centroid is used to classify the image forgeries and localize the tampered regions. The experimental results show that the proposed method outperforms state-of-the-art techniques in image forgery classification and localization accuracy. In addition, we distinguish the source and target regions in copy-move tampering localization.
The further part of this paper is organized as follows.
Section 2 provides a brief review of image splicing and copy-move detection and localization methods. In
Section 3, we present the image retrieval algorithm based on HE and BOF. The proposed image clustering algorithm is introduced in
Section 4.
Section 5 presents the image forgery detection and localization. The experimental results are discussed in
Section 6. Finally,
Section 7 concludes the paper.
2. Related Works
In the literature, image splicing forgery detection problem has been addressed efficiently [
2,
3,
4]. In recent years, a substantial attention has been paid to deep learning based approaches [
5,
6] for localizing image splicing [
7,
8,
9,
10,
11,
12,
13,
14,
15] wherein convolutional neural network (CNN) has been widely used [
8,
9,
10,
11,
12]. Bondi et al. [
8] extracted and employed features capturing characteristic traces from different camera models to localize a tampered mask by an iterative clustering algorithm. Region proposal network and condition random field are the main components of the model developed in Chen et al. [
10]. The noise levels difference between spliced and original regions was utilized to find the splicing traces [
11,
13]. Non-linear camera response function was used in Yao et al. [
11] and was combined with noise level function to exploit the strong relationship between two functions to localize the forged edges using a CNN. Mayer et al. [
12] used a similarity network and a CNN-based feature extractor to determine whether image patches contain different traces or being captured by different camera models. Zeng et al. [
13] estimated the noise levels using the principal component analysis and then clustered using
k-means algorithm to localize the spliced regions. Matern et al. [
15] utilized the gradient-based illumination descriptor to detect the illumination inconsistency and object color change, which helped localize image splicing traces. Wang et al. [
16] used gamma transformation to detect splicing forgery and localize spliced region by estimating the probabilities of sliding window based overlapping blocks being gamma transformed.
CMF detection is the problem of detecting the tampered regions in copy-move images, is called CMF localization (CMFL) in this paper, to distinguish from CMF classification. CMFL has also been actively studied in many researches, which can be divided into three categories: block-based methods [
17,
18,
19,
20,
21,
22], keypoints-based methods [
23,
24,
25,
26,
27,
28], and segmentation-based methods [
29,
30,
31,
32]. Park et al. [
17] introduced the upsampled log-polar Fourier descriptor, which is invariant to rotation and scaling, to robustly detect various types of tampering attacks. Wu et al. [
18] proposed a two-branch deep neural network to detect potential manipulation via visual artifacts and visually similar regions, which helps specify the copied and pasted regions. Park et al. [
19] used the scale space representation of scale-invariant feature transform (SIFT) to handle different geometric transformation. PatchMatch, an algorithm used to search for approximate nearest neighbors, was combined with Zernike moments to detect copy-move attacks in [
23,
24] whereas SIFT was utilized in [
25,
26,
27]. In segmentation-based approaches, the input image was semantically segmented into non-overlapped regions [
29,
30,
31,
32]. Li et al. [
29] developed two stages of matching to detect the copy-move regions. Firstly the affine transformation matrix between segmented regions was roughly estimated and then iteratively refined by using an expectation-maximization algorithm-based probability model. However, the major disadvantage of this method is its high computational complexity. Zheng et al. [
30] classified smooth regions and non-smooth regions (keypoint regions) to be apply two different techniques. On the one hand, SIFT was used in a keypoint-based method to detect forgery in non-smooth regions. On the other hand, Zernike moments were extracted in a block-based method to handle smooth regions. The CMFL was effectively performed by the fusion of above-mentioned techniques.
3. Bag-of-Features and Hamming Embedding Based Image Retrieval
In image retrieval, images are represented by descriptive features. The features are used to evaluate similarity or dissimilarity between images. In the image forgery, since the forged regions may be rotated, scaled, and translated in different manners, the features of the images should be invariant to these transformations. The features generated by SIFT [
33] have such noteworthy characteristics and the proposed algorithm utilizes the SIFT features to represent images [
34,
35].
In this section, we briefly review the image retrieval method based on BOF [
36,
37,
38] and HE encoding [
38,
39]. Suppose that a query image
is represented by a set of
N descriptors,
. All of these descriptors are mapped into a visual vocabulary set
by a
K-means vector quantizer
q. For example,
q maps
to the closest visual word
, where
. We define a set of descriptor indexes, which assigns descriptors of
to a particular visual word
as
.
A matching model HE is used to estimate the matching of descriptors to a visual word. HE represents each descriptor as
D-dimensional binary signatures [
38]. Let
, is a single bit binary code used to represent
, then
is a binary signature of descriptor
. The Hamming distance between two descriptors,
and
, is computed using their binary signatures as follows:
Let us denote
be the set of descriptors of the database image
. The probability that two sets of descriptors,
and
are assigned to the same visual word
is defined as:
where the weighting function for a Hamming distance
h is calculated as a Gaussian function [
38]:
The number of dimensions for the binary signatures is typically set to
, and the Gaussian bandwidth parameter [
38,
40] is set to
.
In order to retrieve images, an inverted index file is built in the image indexing process. The inverted file consists a list of entries. In each entry, a visual word is stored along with the identifier of associated images, descriptors of those images which are assigned to the visual word and the HE used for matching measurement. When the query of
is performed, the entries of visual words associated to
are searched in the inverted file. The score of a database image
in this query is calculated by accumulating the Hamming distances between two sets of descriptors’ signatures for all the shared visual words of two images. Specifically, the similarity between
and
is defined as follows:
where the constant
is the inverse document frequency [
41] of a visual word
in
. Suppose that
is the probability of
occurring in
, then
.
4. Image Forgery Clustering
In this section, we give an exposition of the proposed image forgery clustering algorithm. Suppose that we have an input dataset including authentic and tampered images. The proposed algorithm classifies images into separate clusters, where each cluster consists of tampered images which were composed using an identical host image and that host image. Subsequently, the proposed algorithm finds the host image to be the centroid of each image cluster. The details of images clustering and centroid determination are provided in Algorithm 1.
Firstly, we randomly select a query image
in the dataset. The ranking score of a database image
in the query of
is denoted as
and calculated as the similarity between two images according to Equation (
4). The retrieval results are a list of images arranged in descending order of ranking scores. A cut-off threshold
is set to obtain the set of images. Let us denote
as the host image of the image
in the dataset. An authentic image is considered as the host image of itself. We need to retrieve all the relevant images
to the query
satisfying
. To this end, we set the threshold
to a relatively low value. This low threshold value leads to the case where also some irrelevant images may be retrieved together. Note that, the irrelevantly retrieved images will be discarded in the last step of the iteration. Due to the insignificant processing time of these operations, we can easily handle the case of a large number of images in a cluster. Further, we perform an additional query to ensure that all the relevant images to
are retrieved. Notice that the top ranked image in retrieved list
, image
is identical to the query image
. Therefore, the second highest ranked result in
, image
, is selected as the query image. The score threshold
is also used in this query, then we obtain the set of retrieved images
.
The image cluster
is the union of two sets of retrieved images, i.e.,
. The centroid of
is determined based on two criteria which measure the correlations in structure and luminance among images in the cluster. In this work, we extract SIFT features [
33] in images and use Random Sample Consensus [
42] to find the matching. Let
denote the set of matched keypoints between two images
and
in
where
is a pair of keypoints. Then
is the number of matching keypoints between
and
. We denote by
the pixel coordinates of
in
. The number of matching keypoints in the corresponding positions of
and
, denoted by
, is calculated as follows:
where
is the Kronecker delta function:
We define the ratio
as the structural similarity between
and
.
Algorithm 1: Image forgery clustering |
|
In addition, we denote by
the luminance value of image
at pixel
, which can be calculated as follows [
43]:
where
are the red, green, and blue color values of
at pixel
, respectively. We define
, the luminance similarity image between
and
as follows:
where
H and
W are height and width of
, respectively and
We determine
, the centroid of image cluster
as follows:
Afterwards, we refine the image cluster by discarding irrelevant images
to
where
as follows
Therefore, all the retrieved authentic images, with the exception of the centroid image
, are discarded from the cluster. In other words,
is the unique authentic image in
.
Figure 2 illustrates an example of discarding an image from the cluster according to Equation (
11).
Figure 3 depicts an example of image database indexing and one iteration of the proposed image forgery clustering algorithm to obtain one image cluster with. After each iteration of the proposed clustering algorithm, all the images of the new cluster are excluded from the image database. We repeatedly perform querying and clustering process until the database
is empty. Finally, each input image belongs to only one cluster.
5. Image Forgery Classification and Localization
Given the centroid
and an image
in the cluster, we can easily estimate the mask of forged regions of
based on
. Specifically,
denotes the image region including all image pixels that
and
jointly have, and
denotes the image region in
but not in
.
Consequently, two image regions
and
are extracted as shown in
Figure 4. These image regions are refined by using median filter to remove salt and pepper noise.
We use SIFT to find the matched regions of
and
. 3 pairs of matched keypoints are utilized to calculate the affine transformation matrix, and subsequently, a warped image is generated for each transformation matrix. To localize the duplicated regions, the zero mean normalized cross-correlation method is adopted [
19]. If we can find such regions, the image
is classified as a copy-move image; otherwise, the tampered image is classified as a spliced image. In
Figure 4, images
and
are classified as copy-move images and the detected forgery regions are illustrated in the last column. The previously detected regions,
, are the target regions in white, and the newly found matched regions are the source of the copy-move operation, which are represented in green. In the last two examples of
Figure 4, the spliced regions of images
and
, are highlighted in white.