Image Splicing Location Based on Illumination Maps and Cluster Region Proposal Network

: Splicing is the most common operation in image forgery, where the tampered background regions are imported from different images. Illumination maps are inherent attribute of images and provide signiﬁcant clues when searching for splicing locations. This paper proposes an end-to-end dual-stream network for splicing location, where the illumination stream, which includes Grey-Edge (GE) and Inverse-Intensity Chromaticity (IIC), extract the inconsistent features, and the image stream extracts the global unnatural tampered features. The dual-stream feature in our network is fused through Multiple Feature Pyramid Network (MFPN), which contains richer context information. Finally, a Cluster Region Proposal Network (C-RPN) with spatial attention and an adaptive cluster anchor are proposed to generate potential tampered regions with greater retention of location information. Extensive experiments, which were evaluated on the NIST16 and CASIA standard datasets, show that our proposed algorithm is superior to some state-of-the-art algorithms, because it achieves accurate tampered locations at the pixel level, and has great robustness in post-processing operations, such as noise, blur and JPEG recompression.


Introduction
With the increasing popularity of editing tools, image content can be easily and discretely edited.The forged images are transmitted, usually to pervert the truth, to scenes that require original images, such as news organizations, academics, courts of law, etc.The forgery incidents in modern society have become very frequent and bring serious negative results.Splicing is one of the most common image tampering operations, where one target is placed into another image in order to falsify reality.To cover up the traces of tampering, post-processing is often applied and this is very challenging for splicing blind forensics.
The traditional splicing forensics methods can be divided into two categories: equipment inconsistency and image attributes.The methods based on image equipment inconsistency use the discrepancy of the image device information, such as the Color Filter Array (CFA) [1], Error Level Analysis (ELA) [2], Noise Inconsistency (NI) [3].With the development of image post-processing technology, the image device information is easy to edit, causing inaccuracy in detection of splicing.The methods based on the image attributes look for the inconsistency features between the original and tampered regions, such as the gray world [4], 3-D lighting environment [5], the two-color reflection model [6].The principle of these methods are relatively simple, but they have limitations in judging the condition of the image.
Convolutional Neural Networks (CNN) have made significant contributions to the field of computer vision, which also provide a novel idea for image splicing forensics.Zhang firstly proposed a two-stage deep learning detection approach, which included the stacked autoencoder model to learn the block features, and the fusion model to integrate the context information of each blocks [7].Inspired by this, a first layer of CNN with the basic high-pass filter to calculate the residual maps is applied in a Spatial Rich Model (SRM), which can detect splicing and copy-move [8].Bondi considered a CNN-based method to extract camera features of image patches and then to cluster to detect whether an image has been forged [9].However, the above methods extract the features based on patches, where the location accuracy is patch-level.
In constrained image splicing localization, the Deep Matching and Validation Network (DMVN) [10] and Deep Matching network based on Atrous Convolution (DMAC) [11] were proposed to generate probability estimates and locate constrained splicing regions.Liu further introduced a novel attention-aware encoder-decoder deep matching network with atrous convolution, which had superior performance [12].A combination of four features has been extracted and trained using a logistic regression classification model [13].Ringed Residual U-net (RRU-Net) [14], Coarse to Refined Network [15] and Mask R-CNN [16] were directly applied for splicing location.Spatial Pyramid Attention Network (SPAN) is a self-attentive hierarchical structure, which constructs a pyramid of local self-attention blocks to locate tampered regions [17].These blind forensics methods only extract the feature based on single-stream, where the tampered region location accuracy is relative low.In contrast with the single-stream, the dual-stream methods usually require image preprocessing to extract manipulating traces.
Salloum proposed the Multi-Task Fully Convolutional Network (MFCN) with image-based and edge-based streams to localize the splicing regions, which explored the validity of semantic segmentation framework in forgery detection [18].Further, the dualstream with different input were proposed, such as spatial and frequency domains [19], RGB and noise images [20], RGB and SRM features [21].A Skip connection architecture based on Long-Short Term Memory-Encoder Decoder (LSTM-EnDec-Skip) [22] is an improvement of LSTM-EnDec [23], which applied a jump connection structure to the multitask network.In summary, the input of dual-stream framework always includes the RGB image and the other tampered attribute image.Even though the above methods can locate splicing tampered regions, the location precision still needs improvement and the tampered attribute trace is always hidden through post-processing.
The illumination maps are an inherent attribute of the image, which is difficult to modify but can help locate the splicing regions.We proposes a novel splicing blind forensics framework, where the image stream is used to extract semantic features, and the illumination stream is used to extract tampered features.The two stages are fused by Multiple Feature Pyramid Network (MFPN).A Cluster Region Proposal Network (C-RPN) is used to locate tampered regions, as shown in Figure 1.In general, our contribution can be summarized into the following three points:  The overall structure of this article is as follows: Section 2 briefly describes related work.In Section 3, we introduce the proposed model and the components of the model.Section 4 displays the experimental results and analyses.Section 5 summarizes the full text.

Related Works
Illumination maps can establish the illumination source, which are generally categorized into two classes: statistics-based and physics-based methods.Since the pristine and tampered regions are from different images, the inconsistency in illumination maps is a critical clue for splicing detection.The illumination color [24] and transformed spaces [25] are applied to detect image forgery, which can distinguish original and manipulated images in image level.Instead of traditional methods, the illumination maps with CNN can detect the splicing forgery at the pixel level [26,27], where the first step is to classify the pristine or fake image, and the second step is to locate the tampered region.However, the precision in locating the region is low.In order to make full use of illumination maps, we consider Grey-Edge (GE) [28] and Inverse-Intensity Chromaticity (IIC) [29] as the illumination stream, which can extract rich features with the RGB image stream.
Multiple scale feature fusion always consists of merging a low-level feature with a high-level feature map, which can capture global information.U-Net exploited lateral/skip connections that associate low-level feature maps across resolutions and semantic levels [30].HyperNet concatenated features of multiple layers before predictions, but it brings additional calculations [31].The Feature Pyramid Networks (FPN) designed horizontal connections to fuse the feature maps in multiple scale in bottom-up and top-down stages, which can obtain more robust semantic information [32].Since the above multiple scale feature fusion methods are applied in single-stream framework, we propose a Multiple Feature Pyramid Network (MFPN) for the dual-stream feature fusion.

The Proposed Framework
With the help of the color inconsistency of the illumination maps and image semantic features, the tampered regions can be classified and located.As shown in Figure 1, the framework can be divided into the following parts: the dual-stream framework, the multiple feature pyramid network and the cluster region proposal network.The image stream extracts features of unnatural tampering boundaries, while the illumination stream focuses on the inconsistent feature as a supplement.Then, the multiple feature maps of two streams are fused through MFPN.Finally, the Regions of Interest (ROI) are obtained by C-RPN and the tampered regions are located through data training.

Illumination Maps
The illumination map is an inherent attribute of the image and difficult to process uniformly, which can be considered as a major indicator for splicing forgery detection.GE and IIC are two state-of-art illumination maps, which are effective for mapping the tampered regions in splicing.GE assumes that the average reflection of the object surface has no color difference, which is easy to obtain and has low computational complexity.The process of GE illumination estimation GEill (x) of pixel x can be formulated as: where k is the scale factor, ꞏ represents the absolute value, ( ) denotes the intensity through Gaussian filter with kernel  , p is the Minkowski norm, n is the order of the derivative.
The IIC applies the inverse intensity and chromaticity estimation to recolour each pixel, which has fewer surface colors and good robustness.The IIC illumination estimation IICill (x) on channel {R, G, B} c  , is formulated as: where ( ) m x is a parameter that depends on the surface orientation, diffuse chromaticity and specular chromaticity;  is the chromaticity on channel c.

The Image and Illumination Stream
The image stream with the input of IMG (x) extract features of strong contrast differences, unnatural tampering boundaries, etc.However, the illumination stream with the input of GEill (x)/IICill (x) focus on the inconsistent feature as supplementary.The input of IMG (x) and the GEill (x)/IICill (x) are shown in Figure 2, where the illumination maps of GE and IIC can reveal the tampered regions.The backbone of image and illumination streams are ResNet-50, where the feature maps are recorded as {C1, C2, C3, C4, C5} and {P1, P2, P3, P4, P5}, respectively.

The Dual-Stream Framework
Feature Pyramid Network (FPN) is a classical method to fuse pyramid features at all scales, which is not accommodated for dual-stream framework.Inspired by FPN, we propose Multiple Feature Pyramid Network (MFPN) for the two streams feature fusion, as shown in Figure 3.The MFPN has three pathways, namely the bottom-up pathway on the left, the bottom-up pathway on the right and the top-down pathway in the middle.Since the feature map of first layer has a larger resolution, the feature maps of second to fifth layers in image and illumination streams are applied.For the top-down pathway in the middle, the feature maps are fused by the image and illumination, recorded as {K2, K3, K4, K5, K6}.The feature map of MFPN Ki is expressed as (3): where Conv () is the 1 × 1 convolution function to adjust dimension and connect horizontal, Up () is the up-sampling function with the factor of 2.

Cluster Region Proposal Network (C-RPN)
The default anchor size in Region Proposal Network (RPN) has a large span, which are unreasonable for splicing location.A Cluster Region Proposal Network (C-RPN) is proposed, where the cluster anchor can adapt the varying sizes of tampered region and the attention module can focus on the tampered regions.The spatial attention feature map i K is defined as: where Ms () is the spatial attention weight matrix,  is matrix multiplication.To gener- ate an adaptive anchor, the K-means cluster algorithm is performed to analyze the width and height of tampered region.Since the MFPN feature maps have 5 layers, the K-means cluster centers are initialized as 5.The maximum and minimum tampering regions are marked as Mmax and Mmin, respectively.The adaptive anchor size set S is expressed as (5): where Mi is cluster center, the aspect ratios are set as ' Q = {0.3,0.5, 1, 2, 3}.

Training Loss
The candidate tampered regions acquire fixed-size feature map through Region of Interest (ROI) alignment.The processing of the network is divided into three branches, which are the classification of the fixed size feature map, the regression of the bounding box and the segmentation of the mask.The loss of classifying the real and fake Lcls is expressed as (6): where pu is confidence coefficient of the fake class.The loss for regression the bounding box Lbbox is computed as ( 7): where x and y are the center position of ROI, w and h, respectively, indicate the width and height.The ti and vi are the label and predict of regression box i. Function smoothL1 () is the improvement of L1 distance, which is computed as (8): The loss Lmask for the predicted mask ypred and lable mask ylable is computed as (9): In summary, the total loss for dual-streams is defined as Equation ( 10):

Datasets and Evaluation Metrics
The training set is the Bappy synthetic dataset with 11k splicing tampered images [21], where the tampered regions are scaled and rotated in different factors.To compare with state-of-art methods, we also evaluated on standard datasets NIST16 [33] and CASIA [34].NIST16 contains three tampered operations: splicing, copy-move and removal.CASIA is a commonly splicing dataset, where the ground-truth masks are obtained by the subtraction of tampered images and corresponding host images.This paper used CASIA2.0 for training and CASIA1.0 for testing.The division of the dataset is shown in Table 1.In order to compare the experimental results more objectively, this paper uses the average precision of the image-level index Average Precision (AP) and the pixel-level index F 1 score to quantitatively evaluate the model performance.The definition of the image-level index AP is the same as the evaluation standard of the CoCO dataset [35], which can be used as an evaluation indicator for blind evidence collection for splicing.
The pixel-level F 1 score can be used to evaluate the accuracy of the location of the tampered area.We calculate the F1 score of each picture and then take the average as the final score for each data set.For each tampered picture, the definition of the F 1 is as follows: where Mout and Mgt represent the final predicted mask and ground-truth mask, TP represents the number of tampered pixels that are correctly predicted, FP represents the number of tampered pixels that are incorrectly predicted, FN represents the number of genuine pixels that are incorrectly predicted.

Training Setting
The optimization of this experimental model adopts the random gradient descent method, and the input image is cropped to a size of 512 × 512 pixels, so as to avoid excessive image resolution and increase the calculation time.The batch size is 4, while the initial learning rate is 0.001 and reduces to 0.0001 after 25k iterations.The maximum number of iterations is 50k, and the training weight value on ImageNet is used as the initialization weight of the network.All the experiments are conducted on a machine with Intel (R) Xeon W2123 CPU @ 3.60GHz, 32GB RAM and a single GPU (GTX 1080Ti).The visualization of training loss for the IIC and GE are shown in Figure 4, which have converged in several iterations.2, where the bold entities denote the greatest performance.The AP score of dual-stream network with the input of 'IIC + Image' is almost 8.1%, 7.7% and 2.6% higher than the single-stream with the input of GE, IIC and image, which indicates that the illumination inconsistent feature and the image feature are complementary.In addition, our proposed method with 'IIC' is 3.6% higher than dual-stream with 'IIC + image', which indicates that the application of MFPN fuses the low-level and high-level features to provide sufficient tampering features.In addition, the adaptive anchor size by applied C-RPN is conducive to return appropriate bounding boxes for improving splicing forgery location.In this part, the robustness experiments of post-processing, such as noise, blur and JPEG compression, are evaluated on the CASIA and NIST16 dataset.The post-processing parameters are set as follows: the mean value is zero, and variances are 0.001, 0.005, 0.01 and 0.02 in Gaussian noise; the window size is 3 × 3, and variance are 0.01, 0.1, 0.3 and 0.5 in Gaussian blur; the quality factors are 80, 70, 60 and 50 in the JPEG recompression, respectively.The robustness results of F1 scores on noise, blur and JPEG are shown in Figure 5. Since the tampered trace features are hidden in JPEG compression and noise, the F1 score has slightly decrease.However, the proposed methods with GE and IIC are robust to JPEG compression and noise within a certain range.For Gaussian blur, our network framework maintains great robustness.The effects of different robustness experiments on the IIC and GE are shown in Figure 6, which can be seen that the proposed dual-streams framework has great robustness on noise, blur, and JPEG compression.
Our method is significantly better than other methods, and even reaches 81.0% on the NIST16 dataset when using IIC.Compared with RGB-N [19], the F1 score is increased by at least 2.9% using GE and 4.4% using ICC.This shows that the illumination maps (GE and IIC) can better represent the tampered area compared to the noise feature map.Since the LSTM-EnDec [21] and LSTM-EnDec-Skip [22] have no experiments on Columbia and NIST16 datasets, we compare the results on CASIA, which demonstrates that our method on IIC and GE has a better performance.However, our method does not reach the best performance on the Columbia dataset, since most tampering on Columbia occur on background.The visualization results are shown in Figure 7, where the first two rows are from the CASIA dataset and the last two rows are from the NIST16 dataset.It shows better localization of pixel-level by using illumination maps (IIC and GE) on the CASIA and NIST16 datasets.

Conclusions
We propose an end-to-end splicing location network, which includes the image stream and the illumination stream.The image stream extracts global features such as strong contrast differences and unnatural tampered boundaries.The illumination stream extracts inconsistent color illumination features in the IIC and GE illumination maps.In addition, the MFPN is used to fuse the multi-scale features of dual-stream network and C-RPN is proposed to generate candidate tampered regions with greater retention of location information.Extensive experiments, evaluated on the NIST16 and CASIA datasets, show that our proposed algorithm is superior to some state-of-the-art algorithms, which achieves accurate tampered location at the pixel level, and has good robustness to post-processing operations, such as noise, blur and JPEG recompression.However, the illumination maps are also effective in detecting splicing forgery.In the future, we will try to explore common features for image forensics.

Figure 1 .
Figure 1.Overview of the proposed framework, where the image stream and the illumination stream extract semantic features and tampered features, respectively.The two stages are fused by MFPN and C-RPN is to locate tampered regions.denotes matrix multiplication.

( 1 )
The illumination maps are applied in the illumination stream to extract inconsistent lighting color features, which can prove the effectiveness of the illumination maps.(2) A Multiple Feature Pyramid Network (MFPN) is proposed for deep multi-scale dualstream features fusion, which provides sufficient tampered features for the tampered region proposal.(3) Cluster Region Proposal Network (C-RPN) is proposed, where the spatial attention mechanism retains more position information, and clusters adaptively selects the anchor size.

4. 3 .
Experiments and Comparative Analysis 4.3.1.Ablation Experiments In order to verify the effectiveness of the illumination maps in dual-stream and C-RPN, the ablation experiments with various frameworks are evaluated on synthetic dataset.The 'GE stream', 'IIC stream' and 'image stream' are, respectively, single stream for Mask R-CNN network with the input of GE, IIC and image.The 'Dual-stream (GE + Image)' and 'Dual-stream (IIC + Image)' are double streams with MFPN.The 'Ours (GE)' and 'Ours (IIC)' are proposed framework with the input of GE and IIC.The comparison results are shown in Table

Figure 6 .
Figure 6.Exemplar visual results on NIST16 dataset.From left to right: the first two columns are tampered images and Ground-Truth, the third to fifth columns are results of the tampered images, tampered images with noise (0, 0.02), blur (3 × 3, 0.5) and JPEG (50).

Figure 7 .
Figure 7. Examples of visual results on the CASIA and nist16 datasets.From left to right: the first three columns are original image, tampered images and Ground-Truth, the last two columns are the results of Ours (GE) and Ours (IIC).

Table 1 .
Comparison of image datasets.

Table 2 .
Ablation experiments results of AP on synthetic dataset (%).