GP-Net: Image Manipulation Detection and Localization via Long-Range Modeling and Transformers

: With the rise of image manipulation techniques, an increasing number of individuals ﬁnd it easy to manipulate image content. Undoubtedly, this presents a signiﬁcant challenge to the integrity of multimedia data, thereby fueling the advancement of image forgery detection research. A majority of current methods employ convolutional neural networks (CNNs) for image manipulation localization, yielding promising outcomes. Nevertheless, CNN-based approaches possess limitations in establishing explicit long-range relationships. Consequently, addressing the image manipulation localization task necessitates a solution that adeptly builds global context while preserving a robust grasp of low-level details. In this paper, we propose GPNet to address this challenge. GPNet combines Transformer and CNN in parallel which can build global dependency and capture low-level details efﬁciently. Additionally, we devise an effective fusion module referred to as TcFusion, which proﬁciently amalgamates feature maps generated by both branches. Thorough extensive experiments conducted on diverse datasets showcase that our network outperforms prevailing state-of-the-art manipulation detection and localization approaches.


Introduction
In recent years, due to rapid advancements in deep generative models [1][2][3], a multitude of image editing applications [4,5] have gained widespread popularity among the general public.However, there is growing concern regarding the potential misuse of these editing techniques for image forgery.The act of image forgery carries significant implications across various domains, including instances of academic misconduct and the dissemination of counterfeit images.Such occurrences emphasize the urgent need to intensify our focus on detecting image forgery.Consequently, the development of effective methodologies for accurately identifying tampered regions in images has become of utmost importance.
Generally, image manipulation techniques are categorized into three groups.For each type of manipultion, researchers have proposed numerous approaches for localizing image manipulations, such as splicing [6][7][8][9][10], copy-and-paste [11][12][13], and removal [14,15].As depicted in Figure 1, these methods often operate at the object level to create semantically coherent and persuasive images, involving the addition or removal of objects within the image.Drawing inspiration from this, we contend that image manipulation detection and localization should consider consistency of objects with one another, necessitating the establishment of long-range relations.While certain CNN-based methods [16][17][18][19][20] have achieved some success in this regard, building long-range relations using these methods requires continuous downsampling and convolution operations.This approach presents several drawbacks: (1) excessively deep networks can encounter the problem of gradient disappearance, and (2) the gradual reduction in spatial resolution results in the disappearance of crucial local information essential for image manipulation detection and localization tasks.The Transformer architecture [21], originally employed in the realm of Natural Language Processing (NLP), has garnered significant attention within the computer vision community.One notable application is the SETR model, which replaces the encoder layers of traditional encoder-decoder networks with transformers.This modification has led to impressive outcomes in image segmentation tasks, establishing SETR as a state-of-the-art solution.While the Transformer excels at establishing long-range relationships, it does exhibit limitations in capturing intricate details.In the context of image forgery detection, the ability to capture fine details is crucial.Consequently, a pure Transformer-based approach akin to SETR does not yield satisfactory results in this domain.
In recent research efforts, there have been attempts to combine CNNs and Transformers to leverage their respective strengths [22,23].For example, ObjectFormer [22] utilizes an RGB stream and a frequency stream, employing CNNs to extract features that are then processed by Transformers to establish long-range relationships and generate prediction masks.However, this approach mainly focuses on replacing convolutional layers with Transformers without effectively leveraging the advantages of both architectures.To fully harness the strength of CNNs combined with Transformers in manipulation detection, we propose a novel two-stream network called GP-Net.In GP-Net, the CNN-based branch and the Transformer-based branch operate in parallel, allowing each branch specialization in its respective strengths.To combine feature maps from both branches in an effective way, the TcFusion module is introduced by us, which efficiently fuses the feature maps to make predictions.Through an extensive ablation study, we optimize our network structure, resulting in improved performance compared to SOAT approaches across various datasets.In addition, we summarize the main contributions of our work:

•
GPNet represents a two-branch network which integrates CNNs and Transformers for image manipulation detection.It overcomes the limitations of excessively deep networks, addressing issues such as gradient vanishing and feature diminishment.

•
We introduce the TcFusion module, a novel feature fusion mechanism that effectively combines features from the CNN branch and the Transformer branch.This fusion module enables the integration of complementary information from both branches.
• Through extensive experiments conducted on multiple baselines, we prove that our approach achieves SOAT performance in both detection and localization of image manipulations.

Image Manipulation Detection
Image manipulation detection involves the task of differentiating tampered images from original images through image-level binary classification.Previous studies [11,24], approached the task by obtaining detection scores based on statistics, such as the mean [24] or the maximum [11].These methods focus primarily on image-level detection without explicitly considering pixel-level manipulation localization.However, recent works have increasingly emphasized manipulation localization on a pixel level while overlooking the importance of imge detection.In contrast, our study takes a holistic approach by jointly considering both the manipulation detection and localization.We recognize the significance of accurately identifying manipulated regions within an image while also providing a binary classification of the overall image.By integrating detection and localization, our approach aims to provide a comprehensive solution for image manipulation analysis.

Image Manipulation Localization
Early related works primarily focused on localizing specific types of operations.For example, some studies [6][7][8][9][10] concentrated on detecting splicing, while others [11][12][13] focused on copy-move manipulation, and [14,15] dealt with removal operations.But in reality, the type of manipulation is not always known, which has motivated recent research interest in general manipulation detection.Consequently, recent works have attempted to develop models capable of handling multiple types of forgeries within a single framework.RGB-N [16], for instance, adopts a two-stream Faster R-CNN-based network that includes an RGB stream and a noise stream generated by the SRM filter [25].However, when images are taken by the same type of camera, their noise patterns tend to be consistent, leading to the failure of the noise branch in distinguishing manipulations.Another approach, SPAN [19], introduces a spatial pyramidal attention network which enables efficient as well as explicit comparison of patches at multiple scales through local self-attention blocks.While SPAN leverages local area correlations, it does not fully exploit spatial correlation and thus has limited generality.PSCC-Net [20] takes a stepwise method to enhance multiscale feature representation and incorporates the Spatial and Channel Correlation Module (SCCM) to better explore the correlation between space and channel.However, the structure of PSCC-Net becomes complex in order to capture more spatial information.ObjectFormer [22] introduces the Transformer to image tampering detection, but it simply concatenates CNN and Transformer in a sequential manner without effectively leveraging the strengths of both approaches.In this paper, we leverage the modeling capability of Transformers to improve the identification of tampered areas, aiming to benefit from its advantages in capturing global relationships and context.

Method
GPNet is specifically designed to address the challenge of effectively localizing various types of image manipulations.Unlike image-level detection, which focuses on identifying the presence of tampering, pixel-level localization presents a greater difficulty.Consequently, GPNet places a stronger emphasis on solving the localization problem.We explain the architecture of GPNet in greater detail below.The tampered image is fed into GP-Net, and the feature maps are obtained by the CNN branch and the Transformer branch in parallel.Then, feature maps from both branches are fused by the TcFusion module we propose.The final tampered region is obtained through a process of upsampling.The architecture of GPNet is depicted in Figure 2. The green area represents the CNN stream, as discussed in detail in Section 3.1.The CNN stream is responsible for the extraction of relevant feature maps from the input images.The orange area represents the Transformer stream.This stream is explained comprehensively in Section 3.2.The Transformer stream focuses on capturing long-range dependencies and refining the extracted features.The feature maps obtained from both the CNN stream and the Transformer stream, with the same resolution, are then fed into the TcFusion module, depicted in the pink area of Figure 2. The TcFusion module combines the features from both branches to make the final prediction.The specifics of this module are introduced in Section 3.3.By integrating both the CNN and Transformer branches and utilizing the TcFusion module, GPNet aims to leverage the strengths of each approach to enhance the overall performance in pixel-level localization of image manipulations.

CNN Branch
In the CNN branch of GPNet, the input image undergoes a series of convolutional and downsampling operations to generate the feature map.Traditionally, the feature map is progressively downsampled to capture a global receptive field, which often requires a very deep network.However, in GPNet, we aim to leverage the advantages provided by Transformers while addressing the issue of gradient disappearance.Therefore, we choose not to use an excessively deep network.To tackle the gradient disappearance problem further, we introduce two algorithms: residual propagation [26] and residual feedback [7].Algorithms are as Figure 3 shows.Residual propagation operates similarly to the recall mechanisms of our brain, mitigating the degradation problem in deeper networks by recalling input feature information.The definition of residual propagation is presented in Equation ( 1).This equation describes the way in which the residual propagation algorithm modifies the input features.However, without the specific details of Equation (1), we are unable to provide further explanation.
In the component block, x represents the input, y f represents the output of the block and W i represents the parameters of layer i. F(x, W i ) is a learnable function that denotes the feature maps obtained after applying two convolutional layers and ReLU activation.Following that, a learnable linear mapping layer M is employed to adjust the dimension of input x, ensuring it matches the dimension of (x, W i ) so that they can be added together.The residual feedback mechanism aims to integrate the input feature information and amplify the differences in essential image properties between the untampered and tampered regions.Equation ( 2) presents the definition of residual feedback.
where x is input of the block and y b represents the enhanced output of block.In addtion, y f is the output of the previous step of residual Equation (1).W is a linear mapping that changes the dimension of y f , and s denotes the sigmoid activation function.

Transformer Branch
In the Transformer branch of our network, we adopt a typical encoder-decoder structure.In this branch, the input image x ∈ R H×W×3 is initially divided into N = H M × W M patches (M often set to 16).These patches are then flattened and fed into a linear layer with an output dimension of D, resulting in an embedding sequence S ∈ R N×D .To enable long-range modeling, a learnable position embedding of the same dimension as S is added to the embedding sequence.This position embedding helps the model capture spatial relationships between the patches.Subsequently, the embedding sequence S 1 ∈ R (N+1)×D is input into the Transformer encoder, which consists of n layers of Multi-Head Self-Attention (MSA) and Multilayer Perceptron (MLP), which includes n Layers of Multi-Head Self-Attention (MSA) and Multilayer Perceptron (MLP).The Self-Attention (SA) mechanism is the core principle of Transformer, which updates the state of each embedded patch by aggregating information in each layer.Its definition is presented in Equation (3).

SA(s
where [Q, K, V] = sW QKV , W QKV ∈ R D×3D k represents the projection matrix.MSA is a combination of multiple self-attentive mechanisms, and MLP is a series of layers (for details on MSA and MLP, please refer to [27]).In the decoder layers, we employ a progressive upsampling approach.As depicted in Figure 2, we take the feature map obtained from the encoder layers and apply a sequence of upsampling operations to generate feature maps of different sizes.Subsequently, these upscaled feature maps are fused with feature maps of the same size from the CNN branch.The fusion process combines the complementary information captured by both branches, leveraging the strengths of each to enhance the localization of image manipulations.

TcFusion Module
To achieve long-range modeling relationships within feature maps, we introduce a feature fusion module known as the TcFusion module.As shown in Figure 2, the feature map from Transformer branch is calculated by the Spatial Attention Module (SAM) to obtain a spatial attention map M(T) ∈ R H×W and then multiplied with the feature map from the CNN branch.Next, we describe the operation in detail.We use two pooling operations to summarize the information of the feature maps and generate two 2D feature maps: T avg ∈ R 1×H×W and T max ∈ R 1×H×W .They represent the average-pooling feature and the max-pooling feature across the channel.These features are concatenated and passed through a convolutional operation with a 3 × 3 kernel size to generate a 2D spatial attention feature map.In a word, the definition of SAM is presented in Equation (4).
where σ is the sigmoid function, and f represents the convolution operation with a kernel size of 3 × 3. Indeed, traditional CNN-based approaches achieve a global receptive field by repeatedly downsampling the input.While this approach is effective, it often requires very deep networks.The TcFusion module addresses these challenges by enabling the modeling of global relationships without the need for an excessively deep network.

Experimental Settings
Experimental Dataset: We assess the performance of our network using three widely recognized datasets: CASIA [28], Nist Nimble 2016 (NIST16) [29], and Coverage [30].Table 1 provides an overview of the forgery types present in each dataset, along with the number of images utilized in the evaluation of our network.Experimental Metrics: In the experiment, selecting appropriate evaluation metrics is crucial for effectively assessing the model's performance.When it comes to image tampering detection, accurately identifying the manipulation regions at a pixel level is of utmost importance.In this study, we adopt evaluation metrics that have been commonly used in previous related works [22].The F1 score combines recall and precision to provide a comprehensive measure of network performance.Meanwhile, the AUC score represents the area under the Receiver Operating Characteristic (ROC) curve.ROC curves are widely used to evaluate the classification effectiveness of a given classifier.In the context of manipulaiton localization, we classify each pixel in images as either tampered or un-tampered.Therefore, we utilize the AUC score as the experimental metric to assess model performance.
Compared Methods: To evaluate the performance of the proposed method and ensure the reliability of the experiment, we select a set of well-regarded manipulation detection and localization approaches as comparative methods.These approaches include RGB-N (detailed in [16]), SPAN ( [19]), RRU-Net ( [7]), PSCCNet ( [20]), and ObjectFormer ( [22]).
Implementation Details: The implementation of our method is executed on a server equipped with an NVIDIA GeForce RTX 3090 GPU.We utilize PyTorch 1.8.2 to implement both RRU-Net, PSCC-Net, and GP-Net.

Compared Detection Methods
To evaluate the effectiveness of our network, we conduct a comparative analysis with some excellent methods: SPAN [21], RGB-N [18], RRU-Net [7], PSCCNet [19] and ObjectFormer [22].To ensure fairness in the experiments, we fine-tune the parameters of the compared methods to their optimal values.Table 2 presents quantitative results (%) obtained by each method.From the table, it is evident that the network we propose outperforms other methods on the CASIA dataset and the NIST dataset.However, on the Coverage dataset, the F1-score and AUC are slightly lower compared to ObjectFormer.We speculate that this discrepancy may be due to the limited number of images available on the Coverage dataset.It does not allow the network optimal convergence.While our network does not require a large number of images, having fewer than 100 training images may hinder the network's convergence.We provide further evidence for this in Section 4.6.Overall, our network is superior to other related methods in terms of performance.
Moreover, we present visual forecasts of the different approaches in Figure 4. Unfortunately, we are unable to provide predictions for ObjectFormer [22] as the code for that method was not made available to us.Each column from left to right represents a distinct aspect, where the initial column showcases the forged images, the second column illustrates the visualization outcomes of the RRU-Net method's predictions, the third column displays the prediction results of the PSCC-Net method, the fourth column represents the prediction results of our method, and the last column depicts the ground truth image.Based on the visualized results, it is apparent that the RRU-Net method exhibits commendable performance in identifying general tampered regions and providing rough localization, although it still exhibits some instances of false detections and missed areas.Moreover, the method struggles to perform well when confronted with small tampered regions.In comparison, the PSCC-Net method demonstrates superior performance with fewer instances of false and missed detections.However, even in its predicted visualization results, the method does not exhibit remarkable performance in identifying minute tampered areas.Subjectively speaking, our model's predicted visualization results demonstrate its superiority and stability amongst the various methods.Not only can it accurately locate manipulation areas, but it also generates clearer boundaries.

Image Manipulation Detection
As recent works only consider pixel-level tampering detection and not image-level detection, there are no standard datasets for benchmarking.
Since the CASIA dataset and the RTD (Realistic Tampering Dataset) [31] contain both forgery images and its corresponding original image, we select 500 images from the CASIA dataset, 300 images from the NIST dataset and 220 images from the RTD with 50% forged and 50% original as a test dataset.From Table 3, it cn be seen that our model achieves good performance, especially on the nist dataset: our model achieves an F1 score of 84.8% and an AUC of 92.2%, which shows that our method is effective in capturing the manipulation artifacts.

Robustness Evaluation
In addition to the previous comparisons, we conduct further evaluations to assess the robustness of our method.To achieve this, we subject the images from the NIST dataset to various image attack methods.These methods consist of resizing, JPEG compression with the quality factor η and Gaussian blur with the kernel size κ.The parameters used and the performance of manipulation localization, measured by the F1-score and AUC, are presented in Table 4.Our model performs well under a variety of distortions.Especially on compressed images, the F1-score is only 0.4% lower than without the distortion when the quality factor is 100 and 1.0% lower than without the distortion when the quality factor is 80.It is evident that our network demonstrates robustness against different image attack methods.

Ablation Study
We employ the CNN branch and the Transformer branch, both branches running in parallel.Then, we fuse feature maps from both branches by the TcFusion module.To assess the usefulness of the branch-parallel approach to designed, we conduct experiments where each branch is removed separately from GPNet.Additionally, we replace the TcFusion module with a simple summation of feature maps to highlight the impact of the TcFusion module.We evaluate the manipulation localization performance on NIST and CASIA datasets.
The quantitative results are presented in Table 5.It is evident that when using only the CNN branch, both the F1 score and AUC decreased by 13.2% and 8.7%, respectively, on the NIST dataset and the CASIA dataset.Conversely, when employing only the Transformer branch, the AUC scores decreased by 5.3% on CASIA and 4.3% on NIST.Furthermore, without the TcFusion block, the F1-score and AUC experienced a decline of 6.8% and 7.1%, respectively.These deteriorations in the quantitative results demonstrate that the adoption of the parallel-in-branch design and the TcFusion module effectively contributes to the performance improvement of our method.GPNet indeed encounters a challenge when dealing with small datasets.To investigate this issue, we conduct a series of experiments on the CASIA and NIST16 datasets, and the results are displayed in Figure 5.We observe an overall upward trend in forgery localization performance as the number of training images increases.The AUC metric shows significant improvement, reaching its peak when training images reach 300.These findings indicate that the performance of GPNet benefits from a larger number of training samples.

Conclusions
GPNet is introduced as an innovative method for detecting and localizing image manipulation.It combines a CNN branch for feature extraction and a Transformer branch for modeling global relationships.By leveraging these two branches, GPNet achieves precise manipulation localization without the need for a deep network.Extensive experiments on diverse datasets demonstrate that GPNet outperforms existing SOAT approaches.Our approach offers a fresh perspective on harnessing the synergies between CNN and Transformer architectures.Future work will focus on further enhancing the structure of the Transformer-based branch.

Figure 1 .
Figure 1.Image tampering detection and localization often requires capturing traces left on objects.Hence, it is crucial to leverage object-level consistency for effective forgery detection and localization.

Figure 2 .
Figure 2. Overview of the GP-Net architecture for manipulating detection and localization tasks.

Figure 3 .
Figure 3.The structure of the residual propagation and residual feedback.

Figure 4 .
Figure 4. Visual comparisons of original models.From left to right, we present forged images, the predictions of RRUnet, PSCCNet and ours, GT masks.

Figure 5 .
Figure 5. AUC score (%) of our method with different numbers of training images.

Table 1 .
Summary of test datasets for our network (number denotes the number of images used for training and testing, and represent whether or not the manipulation type is involved).

Table 2 .
The F1/AUC results (%) of our method and the other methods on various datasets.

Table 3 .
F1 scores and AUC (%) of manipulation detection results on various datas.