Hybrid Image-Retrieval Method for Image-Splicing Validation

: Recently, the task of validating the authenticity of images and the localization of tampered regions has been actively studied. In this paper, we go one step further by providing solid evidence for image manipulation. If a certain image is proved to be the spliced image, we try to retrieve the original authentic images that were used to generate the spliced image. Especially for the image retrieval of spliced images, we propose a hybrid image-retrieval method exploiting Zernike moment and Scale Invariant Feature Transform (SIFT) features. Due to the symmetry and antisymmetry properties of the Zernike moment, the scaling invariant property of SIFT and their common rotation invariant property, the proposed hybrid image-retrieval method is efﬁcient in matching regions with different manipulation operations. Our simulation shows that the proposed method signiﬁcantly increases the retrieval accuracy of the spliced images.


Introduction
Digital images, portable cameras, and photo-editing software have become increasingly popular in the past decades.This popularization has enabled the easy manipulation of digital images even by unpracticed users.Nowadays, image splicing, which copies one or more regions from the source image and pastes them to the destination image, is one of the most popular image-manipulation methods.
In the image-splicing scenario, a spliced region might not be exactly the same as the original region since it usually undergoes a sequence of postprocessing operations such as rotation, scaling, edge softening, blurring, denoising, and smoothing for a better visual appearance.Therefore, human beings may easily be deceived by spliced images.
In the literature, a large number of algorithms have been introduced to effectively detect image splicing [1][2][3][4][5][6] and some algorithms achieved nearly perfect detection performance [2,3].In recent years, researchers have been focusing more on image-splicing localization and achieved promising results [7] thanks to the advances in machine learning and deep learning [8][9][10].Tampered regions of grayscale images were localized by different image-authentication techniques [11,12].In Reference [6], a convolutional neural network (CNN)-based algorithm was proposed to extract features capturing characteristic traces from different camera models.These features were then utilized as the input for an iterative clustering algorithm to estimate the tampering mask.The difference between noise levels of tampered and original regions was employed to find the splicing traces [8,13].The noise level was estimated using principal component analysis and then clustered by a k-means algorithm to localize the spliced regions [13].Nonlinear camera response function was individually used in Reference [14] or combined with noise-level function to exploit their strong relationship to localize the forged edges using a CNN [8].Gamma transformation was used to detect splicing forgery whilst spliced region was localized based on the probabilities of overlapping blocks and pixels of the input image being gamma-transformed [15].In Reference [16], spatial structure on the boundary between spliced and background regions was trained to localize splicing, copy-move, and removal manipulations.
In contrast to those abovementioned blind-splicing localization methods, context-based search-and-compare approaches were proposed in References [17,18] to localize spliced regions.Specifically, a spliced image was used as a query image in the problem of image retrieval among the database of authentic images.Finally, the retrieved result then combined with the corresponding query image to find the difference, which is the localization result.A novel deep CNN was developed to compute the probability that one image had been spliced to another image, and then splicing masks were localized in both images [9].These approaches achieved high localization performance in comparison with blind methods, but they can only be applied if the original authentic image dataset is given.
In this paper, we go one step further by providing solid evidence for image manipulation.If a certain image is proved to be the spliced image, the proposed method retrieves the original authentic images that were used to generate the spliced image to yield a clue about the spliced images.First of all, the proposed algorithm localizes tampered regions in the spliced image by using one of the state-of-the-art localization algorithms [19].Then, the proposed algorithm retrieves the original authentic images using a Hamming Embedding (HE) encoding-based retrieval algorithm with Zernike [20] and Scale Invariant Feature Transform (SIFT) [21] descriptors.In the literature, a few studies were proposed to find the provenance of manipulated images.In Reference [22], the provenance of spliced images was searched in two tiers using Speeded-Up Robust Features.The first tier was responsible to find the host (target) images, and then the search was refined in the second tier to find the donor (source) images.Provenance graph-construction methods were proposed in References [23,24] to show the relationships among images with multiple manipulation operations, such as splicing, spatial operations and enhancement.The experimental results in References [22][23][24] were performed using only a small number of test images and the results show that their methods achieved moderate retrieval performance for the source images.This paper aims to address the shortcomings of previous works and provide a convincing provenance of spliced images.The rest of this paper is organized as follows.Section 2 describes Zernike moment features and SIFT features, including their applications in previous works.Then, the image-splicing localization method is presented in Section 3 and the proposed retrieval algorithm is introduced in Section 4. In Section 5, we provide the comparative experimental results of the proposed method.Finally, our conclusions are drawn in Section 6.

Related Work
In this Section, we give a brief overview of Zernike moment features [20,25] and SIFT features [21].Given a continuous image function f (x, y), Zernike moments are the projection of the image function onto a set of basis functions called Zernike polynomials.This set of complex polynomials is complete and orthogonal over unit circle x 2 + y 2 = 1.Let us denote these polynomials as V n m (x, y), suppose that m is a non-negative integer, and n is an integer satisfies that m − |n| is a non-negative even number.Zernike polynomials are defined as follows: where (ρ, θ) represents the polar co-ordinates of (x, y), j is an imaginary unit number, and the radial polynomial R n m (ρ) is given by: Note that R n m = R −n m .The Zernike moment with order m and repetition n for image f (x, y) that vanishes outside the unit circle is defined as: where z denotes the complex conjugate of z.
The symmetric and antisymmetric properties of Zernike moments help reduce their computational complexity [26].We replace integrals in Equation (3) with summations to obtain the formula for digital image I(x, y) as follows: where x 2 + y 2 ≤ 1.The magnitudes of the Zernike moments, A n m , are used as the image features.According to References [20,[25][26][27][28][29], these values are invariant against rotation; hence, Zernike moment features can handle well rotated spliced regions.
The SIFT keypoints are invariant to image rotation and scaling, and robust for noise and illumination changes [21].Such keypoints of input image I(x, y) are detected by finding the local extrema of D(x, y, σ), which is the convolution of that image with a difference of Gaussian between adjacent image scales kσ and σ: where * denotes the convolution operation, k is the constant multiplicative factor, and Gaussian function G(x, y, σ) at scale σ is calculated as follows: As shown in Figure 1, the red pixel is marked as the keypoint if it is greater or smaller than all of its 26 neighbors, including 8 neighbors in the same scale (yellow pixels) and 9 neighbors in the corresponding positions of the scales above and below (green pixels).The corresponding descriptors of all the keypoints are extracted by computing their gradient magnitudes and orientations to form the features [21].
However, in the image-splicing scenario, the straightforward application of the Zernike moment features or the SIFT features is not efficient to retrieve spliced images because of the wide variety of shapes, sizes, and textures of the spliced regions.Therefore, in this paper, we propose a hybrid method that combines two advanced feature types, the Zernike moment features and the SIFT features, for solving the image-splicing retrieval problem.

Image-Splicing Localization
Suppose that a spliced image I(x, y) is composed of two authentic images, S(x, y) and T(x, y), which are the source and target images, respectively.Here, one or more regions of the source image can be copied and pasted to the target image to create the spliced image.Afterward, this resultant image I(x, y) is usually edited by some postprocessing operations to make it look realistic and rational.Given a set of spliced images, this study aims to retrieve their original-source images and target images.In Sections 3 and 4, our proposed approach is presented to solve this problem.In Figure 2, we depict the overall framework of the proposed method, which consists of two main stages: image-splicing localization and image retrieval.In the first stage, we firstly segment the input spliced image I(x, y) into the spliced regions (foreground) F(x, y) and the background region B(x, y).The detailed splicing localization algorithm is described in Section 3. The segmented regions are subsequently utilized as the input of the second stage.The proposed image-retrieval method is presented in Section 4, where we query F(x, y) and B(x, y) to find S(x, y) and T(x, y), respectively.In order to retrieve all images that were used to compose the spliced image, we need to segment the background and spliced regions of that spliced image to separately extract features.In this paper, we adopted a fast and efficient splicing-localization algorithm introduced in Reference [19].In their proposed algorithm, the input image is firstly converted to YUV space, and each color component of the image is divided into nonoverlapping 8 × 8 blocks.These blocks of three color components are then transformed to the Discrete Cosine Transform (DCT) domain.The histograms of DCT coefficients of all blocks in different color channels are subsequently used to construct a block posterior probability map.Afterward, a Support Vector Machine classifier utilizes this probability map as the feature to classify whether the 8 × 8 block is tampering or not.Note that we carry out some further postprocessing operations, including median filtering, morphological erosion, and dilation, to remove noises and to bridge the gaps of the generated tampering map to localize the spliced regions.This tampering image is detected in block level; hence, it is then upscaled to obtain the final localized splicing image.To quantitatively evaluate the performance of splicing localization, we adopted two metrics for spliced regions of a detected spliced image [31,44], precision M P and recall M R , which are defined as follows: and There is a trade-off between precision and recall; consequently, to consider both these measures, we computed their harmonic mean M F , called F 1 -score, as follows: In the proposed method, the localization of spliced regions is considered as a correct localization if M F > 0.7.These correctly localized spliced regions and the corresponding background of the same spliced images are then used for retrieval in the next stage.Examples of spliced images, segmentation of spliced and background regions, and ground truth images are presented in Figures 3 and 4 .

Image-Splicing Retrieval
Rotation and scaling are two operations that affect the shape and direction of spliced regions.Therefore, we take advantage of the rotation invariant property of the Zernike moment features and the rotation scale invariant property of the SIFT features to trace spliced regions.Whereas the SIFT features-based method (the SIFT method) effectively extracts features in textured regions, it usually fails in smooth regions.The Zernike moment features-based method (the Zernike method), in contrast, successfully addressed the shortcomings of the SIFT method.In addition, the Zernike moment features extraction is a block-based method, while SIFT features extraction is a keypoint-based method.For those reasons, in the proposed method, the Zernike moment features and the SIFT features are independently extracted to handle different situations in the spliced-image retrieval problem.

Bag-of-Features-Based Image-Retrieval Using Hamming Embedding
In image retrieval, images are represented by descriptive features, which are used to evaluate similarity or dissimilarity between a pair of images.Since the splicing operation may rotate, scale, translate the forged regions, the extracted features of the images should be invariant to these transformations.The features generated from Zernike moments and SIFT have such noteworthy characteristics.In this Section, we present the HE encoding [40,48] and bag-of-features-based [49][50][51] image-retrieval method.Suppose that query region Q is described by a set of n local descriptors, All these descriptors are mapped into a visual vocabulary set W = {w 1 , w 2 , . . . ,w k } by a k-mean vector quantizer q.In other words, q maps x Q i to a visual word where q(x Q i ) ∈ W. It is worthwhile to note that, in the proposed design, both the spliced region and the background region could be the query region.We define a set of index of descriptors in X Q that are assigned to a particular visual word w as I Q w = i q x Q i = w .In order to match different descriptors, each descriptor is represented as a binary signature [40].We define d as the number of bits used for descriptor representation, i.e., b x The Hamming distance between the binary signatures of two descriptors, u and v, is computed as follows: Let us denote M w (X Q , X D ) as a function between two sets of descriptors X Q and X D assigned to the same visual word w as: where weighting function f is calculated as a Gaussian function [40]: Finally, the similarity between two images, Q and D, is defined as: where constant c w is the inverse document frequency [52] of visual word w in W.

Hybrid Features-Based Image Retrieval
In this paper, we propose a hybrid image retrieval method exploiting Zernike and SIFT features.In order to effectively combine these two methods in the retrieval process, we evaluate their retrieval performance separately.Figure 5 illustrates clusters of spliced regions, which are correctly retrieved by either the Zernike method, the SIFT method or both methods.Two measurements of spliced regions, the extent and the smoothness, are represented by the number of pixels and the number of SIFT keypoints, respectively.The Zernike method correctly retrieves the higher of images in comparison with the SIFT method.In addition, for small and smooth regions, the former also shows a better performance than the latter.Therefore, at first, the proposed method performs image retrieval using the Zernike moment features of the query regions.Subsequently, based on the intermediate retrieval results, the proposed method decides whether the SIFT features need to be utilized to improve the retrieval performance.
We introduce a method to effectively utilize two kinds of features in image retrieval.Firstly, we apply the localization algorithm [19] to the spliced image I(x, y) to segment foreground (spliced region) and background, say F(x, y) and B(x, y), respectively.These regions are sequentially used as the queries for image retrieval process.Assume that the current query region is F(x, y).Then, the proposed method extracts the Zernike moment features of F(x, y) and evaluates the certainty score of the top-hit retrieval image using extracted features.We define s as the certainty score in Equation (15).If the score is positive, the top-hit retrieval image is considered as the source image S(x, y), which was used for image splicing.In contrast, if the certainty score s is smaller than or equal to 0, the SIFT features are subsequently used to retrieve the original source image of spliced region F(x, y). Figure 6 shows the pipeline of the proposed spliced image retrieval method for F(x, y).Similarly, if the background region B(x, y) is queried, the target image of image splicing, T(x, y), is retrieved.
In the first stage of the proposed image-retrieval process using the Zernike moment features, each retrieved image has a score that represents the similarity of that image with the input image.Suppose that I 1 and I 2 are the first and second top-hit retrieval images of the input image with voting scores s 1 and s 2 , respectively.It is worth pointing out that, in the spliced image retrieval problem, the top-hit retrieved image should be the only correct result for each query.We consider I 1 as the correctly retrieved image if the ratio s 1 /s 2 is greater than a specific value in different intervals of s 1 .
In contrast, the SIFT features are taken into account in the second stage of the proposed method if one of the following conditions happens: and s 1 /s 2 < λ 1 , . . .
where we use α i (i = 1, 2, . . ., N) as bounds of different intervals for top-hit score s 1 and λ i (i = 1, 2, . . ., N) as thresholds for s 1 /s 2 in each corresponding interval of s 1 .Since s 1 is a positive integer, in the experiments, we avoid setting α i (i = 1, 2, . . ., N) as integer numbers.We define s as the certainty score for the top-hit image in the first retrieval stage, which is calculated based on Equation ( 14) as follows: where we define 2 more parameters for this general formula, λ 0 = 0 and α 0 ∈ α 1 , α 1 .If s is positive, I 1 is considered as the correct retrieval of the query image region; otherwise, the proposed method does the retrieval process again using SIFT features.

Dataset and Experimental Setup
There exist several benchmarking datasets for evaluating the performance of image-splicing algorithms.In our simulations, we used CASIA v2.0, a realistic and challenging dataset introduced in Reference [53].In this dataset, there are 7491 authentic, 1849 spliced, and 3274 copy-move color images with various sizes from 240 × 160 to 900 × 600 pixels.However, due to the lack of ground truth images of this dataset, we generate the ground truth images for spliced regions for the evaluation of localization algorithm (Equations ( 7)-( 9)).The ground truth dataset was contributed to the research community, available online at https://github.com/namtpham/casia2groundtruth.In the retrieval testing set, we randomly chose 225 spliced images whose spliced regions are correctly localized in all the categories defined in the dataset, such as animal, architecture, character, plant, nature, indoor, and texture.The diversity of characteristics of spliced regions is depicted in Figure 5. Finally, 225 query images, which are equivalent to 450 query regions, are retrieved in 7491 database authentic images.We trained 100, 000 visual words for the experiments.
The parameters in Equation (15) were empirically set as presented in Table 1.We carried out the simulations by independently querying spliced regions and background regions.In each case, we performed the image retrieval of the Zernike method and the SIFT method, separately, to compare with the proposed method.

Evaluation Measures
In regular image-retrieval problems, a set of images, which are visually similar to the query image, are considered as the correct retrieval results.On the contrary, in the image-splicing retrieval problem of this paper, there is only one correct source image and one correct target image when we query a spliced image.Therefore, we defined a metric quantity TP i @N as the correctness of the retrieval of i-th query.If the original image of i-th query is retrieved in top N results, then TP i @N = 1; otherwise, TP i @N = 0.The number of correct retrieved images in top N retrieval results is defined as follows: where #Q is the total number of queries.The recall at top N retrieval is computed as follows:

Image-Splicing Retrieval Results
Table 2 shows the number of query spliced images whose original source and target images are retrieved in the top 1, 10, 50, and 100.As can be seen from the table, the Zernike method performs better than the SIFT method when each of them is independently used for retrieval.The proposed method, a hybrid algorithm of these two features, notably improved the performance of spliced region retrieval and achieved 84.89% accuracy in comparison with 59.56% and 76.44% of the SIFT and the Zernike methods, respectively.Additionally, the proposed method correctly retrieved all the background regions.

Query Region
Features TP@1 TP@10 TP@50 TP@100 R@1 The performance of the image-retrieval methods in terms of R@N for background and spliced query regions is given in Figure 7.

Analysis of Top-Hit Retrieved Source Image
Image-retrieval methods succeed in finding target images, whereas their performance of spliced-region retrieval is much lower.To evaluate the efficiency of these retrieval algorithms on spliced regions, we classified 225 spliced images into 5 categories based on the ratio between total area of all spliced regions in one spliced image to the area of that image.These ratios of 225 images ranged from 0.21% to 67.60%, and 84% of the testing images have small spliced regions that account for less than 20% of the whole image in term of area.Figure 8 shows the superiority of the proposed method over the SIFT method and the Zernike method in all the above-mentioned categories by giving the numbers of correctly retrieved spliced regions (TP@1).In general, the accuracy of retrieving tiny spliced regions, whose areas are less than 5% of the corresponding spliced images, is lower than that of bigger spliced regions.
Figure 9 gives four examples of spliced regions retrieval of the three methods, the Zernike method, the SIFT method, and the proposed hybrid method.In the first two cases, both the Zernike method and the SIFT method failed to retrieve the source image.On the other hand, the hybrid method correctly retrieved the images in both cases.All methods correctly retrieved the source image in the third case.In contrast, in the last example, all methods retrieved the wrong source image where the Zernike method found the correct spliced region (the houses) from another authentic image, which was taken of the same scene as the source image but from a different view.Top-hit images retrieved for spliced-region queries by the Zernike method, the SIFT method, and the proposed hybrid method, respectively; (f) source images.

Image-Splicing Validation
An example of image-splicing validation is presented in Figure 10.Two different spliced images were segmented spliced and background regions.Subsequently, these regions were queried by the hybrid image-retrieval algorithm to find their source and target images.The result shows that they were created by copying different regions in the source image and pasting to the target image.Therefore, the two input images were verified to not be authentic.

Conclusions
This paper solved a problem that seeks original authentic images that were used to compose the spliced images.Prior to the image-retrieval stage, we segmented the spliced images into the spliced regions and background regions in the image-splicing localization stage in order to improve retrieval accuracy.These segmented regions were then separately queried to find the provenance of the spliced image.We proposed a hybrid method that can effectively employ Zernike moment features and SIFT features for image retrieval.Since the former is invariant to rotation and the latter is invariant to scaling and rotation, the proposed method can handle different splicing operations to find the matching features.By retrieving the source and the target images of the spliced images with high accuracy, the proposed method can validate the authenticity of the spliced images with more certainty (see Figure 10).In future works, we will extend our research on improving the performance of image-splicing localization and the retrieval of small and smooth spliced regions.

Figure 1 .
Figure 1.Detecting SIFT keypoints in one scale by finding local maxima or minima.

Figure 2 .
Figure 2. General framework of the proposed method for image-splicing retrieval.

Figure 3 .
Figure 3. Image-splicing localization example.Spliced images are shown in the top row, and their corresponding spliced masks and ground truth images are presented in the second row and the third row, respectively.Finally, precision, recall, and F 1 -score of each localized splicing image are given.

Figure 4 .
Figure 4. Image-splicing localization example.Spliced images are shown in the top row, and their corresponding spliced masks and ground truth images are presented in the second row and the third row, respectively.Finally, precision, recall, and F 1 -score of each localized splicing image are given.

Figure 5 .FFigure 6 .
Figure 5. Illustration of spliced regions that are correctly retrieved by Zernike-based, SIFT-based, or both methods.The number of pixels of spliced regions and the number of SIFT keypoints in the corresponding regions are presented.

Figure 7 .
Figure 7. R@N of image-retrieval methods for querying (a) background regions; (b) spliced regions.

FFigure 10 .
Figure 10.Example of two images verified as spliced images, where the same source image and the same target image were retrieved.

Table 1 .
Parameters used in the proposed hybrid method.