Symmetry-Guided Dual-Branch Network with Adaptive Feature Fusion and Edge-Aware Attention for Image Tampering Localization

He, Zhenxiang; Li, Le; Wang, Hanbin

doi:10.3390/sym17071150

Open AccessArticle

Symmetry-Guided Dual-Branch Network with Adaptive Feature Fusion and Edge-Aware Attention for Image Tampering Localization

by

Zhenxiang He

^*

,

Le Li

and

Hanbin Wang

School of Cyberspace Security, Gansu University of Political Science and Law, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(7), 1150; https://doi.org/10.3390/sym17071150

Submission received: 16 June 2025 / Revised: 10 July 2025 / Accepted: 16 July 2025 / Published: 18 July 2025

(This article belongs to the Special Issue Symmetry/Asymmetry in Image Processing and Computer Vision Using Embedded Systems)

Download

Browse Figures

Versions Notes

Abstract

When faced with diverse types of image tampering and image quality degradation in real-world scenarios, traditional image tampering localization methods often struggle to balance boundary accuracy and robustness. To address these issues, this paper proposes a symmetric guided dual-branch image tampering localization network—FENet (Fusion-Enhanced Network)—that integrates adaptive feature fusion and edge attention mechanisms. This method is based on a structurally symmetric dual-branch architecture, which extracts RGB semantic features and SRM noise residual information to comprehensively capture the fine-grained differences in tampered regions at the visual and statistical levels. To effectively fuse different features, this paper designs a self-calibrating fusion module (SCF), which introduces a content-aware dynamic weighting mechanism to adaptively adjust the importance of different feature branches, thereby enhancing the discriminative power and expressiveness of the fused features. Furthermore, considering that image tampering often involves abnormal changes in edge structures, we further propose an edge-aware coordinate attention mechanism (ECAM). By jointly modeling spatial position information and edge-guided information, the model is guided to focus more precisely on potential tampering boundaries, thereby enhancing its boundary detection and localization capabilities. Experiments on public datasets such as Columbia, CASIA, and NIST16 demonstrate that FENet achieves significantly better results than existing methods. We also analyze the model’s performance under various image quality conditions, such as JPEG compression and Gaussian blur, demonstrating its robustness in real-world scenarios. Experiments in Facebook, Weibo, and WeChat scenarios show that our method achieves average F1 scores that are 2.8%, 3%, and 5.6% higher than those of existing state-of-the-art methods, respectively.

Keywords:

image tampering localization; self-calibrating feature fusion; symmetrical dual-branch network; edge-aware decoding

1. Introduction

The rapid development of digital image tampering technology has significantly lowered the threshold for generating highly realistic forged content. These carefully designed tampered images may spread rapidly on social media, leading to serious social, political, and ethical issues [1]. Therefore, developing effective image tampering localization technology has become particularly important, as it not only helps maintain the integrity of digital media but also plays a significant role in combating the spread of misinformation [2]. Image tampering detection aims to identify and locate tampered regions within an image precisely. Unlike simple binary classification tasks, it provides more granular tampering information [3].

However, image tampering localization faces numerous challenges: first, modern image editing tools can achieve high-quality tampering without leaving obvious visual traces; second, the same image may contain multiple types of tampering operations, such as copying and moving, splicing, and repairing; finally, post-processing operations may further obscure tampering traces [4]. Traditional image tampering localization methods primarily rely on manual features (such as noise consistency [5] and JPEG compression traces [6]). Although effective in limited scenarios, their limitations manifest in two aspects: first, insufficient generalization ability, as they rely on prior statistical models and struggle to detect unknown tampering types; second, limited feature expression capability, as they are sensitive to post-processing operations such as double JPEG compression and resampling [7].

In recent years, end-to-end image tampering detection methods based on deep learning have gradually become the mainstream of research. Wu et al. proposed ManTra-Net [8], which combines an encoder–decoder structure with an attention mechanism to achieve joint modeling of global and local features. Liu et al. designed PSCC-Net [9], introducing a spatial channel correlation module to capture the inherent inconsistencies in tampered regions. While these methods have improved the accuracy of tampering detection to some extent, they overly rely on RGB semantic features, leading to a lack of diversity in feature space representation. Additionally, most methods do not explicitly model the frequency domain anomalies of tampering boundaries, resulting in blurry and discontinuous prediction mask boundaries.

With the successful application of the Transformer architecture in computer vision, it has demonstrated significant potential in modeling long-range dependencies, offering new insights into image tampering localization. Hu et al. proposed the SPAN model [10], which utilizes a spatial pyramid attention mechanism to enhance feature discriminative capabilities; Zhu et al. developed the ProFact Transformer network [11] based on a progressive feedback mechanism, enabling a stepwise process of tampering localization from coarse to fine.

Although these methods have made some progress, they still face two key technical challenges: first, feature extraction in the tampering boundary region remains inadequate, limiting the model’s fine localization capability; second, existing models lack robustness for different types of tampering, especially visually imperceptible minor tampering. It is worth noting that the edges of tampered regions typically contain rich abnormal information. For example, in splicing tampering, the edges may exhibit phenomena such as blurred transitions, sharpness changes, or color discontinuities. Therefore, effectively mining and utilizing the structural and frequency domain clues of the edge regions is of great significance for improving the accuracy and robustness of tampering localization.

To address the above issues, this paper proposes a symmetric guided image tampering localization network—FENet—that integrates dual-branch features and edge-aware mechanisms to achieve high-precision, high-robustness tampering region perception. The main contributions are as follows:

Symmetrical Dual-Branch Architecture: We designed a dual-branch network featuring a symmetrical structure and complementary functions. The RGB branch extracts semantic features and content-level inconsistencies, while the SRM noise branch focuses on statistical anomalies and tampering artifacts. This architecture enables the model to analyze tampered regions in both the visual and noise domains, significantly enhancing its detection accuracy and robustness against various types of tampering.
ECAM: An attention module is proposed that combines coordinate attention with edge-awareness. Position-sensitive features are captured through independent pooling operations along horizontal and vertical directions. Additionally, the Sobel operator is employed to extract edge information, effectively enhancing the model’s perception of tampered boundaries and significantly improving feature representation and localization accuracy along edges. This design effectively addresses common issues of blurred and discontinuous boundaries in existing methods.
SCF: A content-aware dynamic weighting strategy is designed to adaptively fuse RGB semantic features with SRM noise residual features. This approach effectively suppresses redundant information and enhances the model’s robustness and discriminative power in complex tampering scenarios.
Channel Attention Enhancement Module: A channel attention mechanism based on global context modeling is introduced, integrating dilated convolution and global average pooling to jointly capture features from both the RGB and SRM branches. This mechanism emphasizes informative channels while suppressing irrelevant ones, thereby enhancing the localization accuracy and discriminative capability of features in tampered regions.

The rest of this paper is organized as follows: Section 2 describes the related work in detail. Section 3 elaborates on the experimental methods. Section 4 presents the experiments and analysis. Section 5 summarizes this study and discusses future research directions.

2. Related Work

Image tampering localization is a critical task in digital image forensics, aiming to identify areas of an image that have been maliciously tampered with and restore their spatial location and boundary range as accurately as possible. With the development of deep learning technology, research in this field has gradually transitioned from traditional methods based on artificial features to end-to-end deep neural network modeling systems, achieving significant progress in feature representation capabilities and detection accuracy. However, existing methods still face challenges in terms of insufficient boundary modeling capabilities, low efficiency in integrating multiple feature information, and weak robustness in complex real-world scenarios.

2.1. Traditional Methods and Feature-Driven Strategies

Early image tampering localization methods primarily relied on manually designed image statistical features or low-level artifact clues. For example, JPEG compression artifacts, edge discontinuities, texture differences, color inconsistencies, and abnormal camera parameters are widely used to locate tampered regions. Fridrich et al. proposed using JPEG compression noise residual detection for image splicing [12]; Popescu and Farid [13] introduced principal component analysis filtering and bilinear interpolation models for detecting interpolation tampering; Mahdian et al. [14] employed edge inconsistency and natural image noise analysis for tampering localization.

Additionally, traditional feature fusion strategies heavily rely on image segmentation operations and manually set thresholds, lacking the ability to model high-level semantic and spatial context, thus yielding limited effectiveness when dealing with complex tampering types and post-processing interference.

2.2. Image Tampering Localization Method Based on Deep Learning

In recent years, deep neural networks have been widely applied in the field of image forensics. Wu et al. proposed ManTra-Net [8], one of the first models to introduce end-to-end training into image tampering localization, utilizing an encoder–decoder architecture combined with an anomaly detection mechanism to achieve coarse-grained detection of tampered regions; Liu et al. proposed PSCC-Net [9], constructing a spatial and channel cross-correlation network, effectively modeling the internal structural inconsistencies of tampered regions and enhancing the model’s ability to detect non-semantic anomalies.

To enhance the model’s ability to distinguish hidden tampering, Bappy et al. [15] proposed using the SRM (Spatial Rich Model) noise residual information of the image as one of the network input channels, extracting high-frequency disturbance signals from the frequency domain to improve sensitivity to small tampering regions. This approach has been widely adopted and has sparked research into multi-modal input and fusion mechanisms. However, most methods still have the following issues:

Over-reliance on RGB semantic features leads to weakened localization capabilities in texture-consistent, semantically coherent tampering scenarios.
Lack of boundary perception modeling mechanisms makes it difficult to accurately locate mask edges.
Insufficient robustness to post-processing operations such as JPEG compression and Gaussian blurring in real complex images.

2.3. Transformer Architecture and Global Modeling Methods

With the introduction of Vision Transformer and its improved structures, global modeling capabilities have begun to play an important role in image tampering detection. Hu et al. proposed SPAN [10], which utilizes a spatial pyramid attention module to extract multi-scale contextual features, thereby enhancing global semantic modeling capabilities in image multi-scale fusion. Zhu et al. designed ProFact [11] using a progressive feedback mechanism, guiding the Transformer network to progressively refine the boundaries of tampered regions layer by layer, significantly enhancing the accuracy of boundary localization.

Chen et al. [16] further proposed the ViT-VAE network, which combines Vision Transformer with Variational Autoencoder and introduces an unsupervised anomaly detection mechanism to locate forged areas without tampering masks, effectively improving the model’s adaptability in blind detection scenarios; Ma et al. IML-ViT [2] combines the ViT encoder with cross-layer feature fusion strategies to improve the model’s robustness under various tampering types and image disturbances.

Additionally, variant structures such as the Swin Transformer and Pyramid Vision Transformer have been widely adopted in tampering detection tasks, enhancing the network’s ability to model local structures and multi-scale edges. However, the Transformer architecture also has two significant limitations:

Lack of a specialized modeling mechanism for image edge structures, resulting in weak response to high-frequency boundary disturbances.
Highly sensitive to training data scale and computing resources, making it difficult to train adequately on small to medium-sized annotated datasets and prone to overfitting.

To this end, some studies have proposed combining the local feature extraction capabilities of convolutional neural networks with the global modeling advantages of Transformers to construct hybrid structures that achieve a balance in performance. For example, Guo et al. [1] introduced an enhanced Transformer module and combined it with a channel-space collaborative attention mechanism to improve the network’s ability to perceive the boundaries of tampered regions.

2.4. Boundary Modeling and Robustness Study

The boundaries of image tampering regions typically contain rich artifacts, such as sharpness discontinuities, texture breaks, or abnormal color gradients. Therefore, the effective extraction and enhancement of boundary features are crucial for achieving precise localization. Zheng et al. [17] proposed DSSE-net, which effectively improves the accuracy of boundary detection and overall performance in image tampering localization by designing a dual-stream jumping edge enhancement network structure and introducing a tampering loss function, providing a new approach for refined tampering region segmentation.

Additionally, to address the common issue of image degradation in real-world scenarios, robustness has become a key metric in the design of image forensics models. Bappy et al. [15] combined residual signals with deep structure enhancement to detect weak texture responses in tampered regions, enabling the detection of invisible tampering. Guo et al. Enhanced Transformer [1] combines dilated convolutions with attention mechanisms to achieve performance stability across multiple compression levels.

In summary, although current image tampering localization research has made significant progress in network structure design, feature modeling, and performance metrics, it still faces the following core challenges:

Lack of modeling mechanisms specifically targeting tampered boundary areas, resulting in blurred and discontinuous edge localization.
Insufficient robustness against different types of tampering in complex degraded scenarios, limiting the model’s application in real-world scenarios.

To address the above issues, this paper proposes a symmetric guided dual-branch network structure that integrates semantic and noise information, combined with a self-calibration fusion strategy and an edge-aware attention mechanism, to achieve accurate perception and boundary modeling of tampered areas, thereby improving the model’s localization capabilities and stability under various types of tampering and complex image quality conditions.

3. Experimental Methods

3.1. Model Overview

The FENet architecture, as shown in Figure 1, features a dual-branch parallel processing structure. The overall architecture consists of four key components: the RGB branch, the SRM noise branch, the SCF module, and the ECAM module.

The core idea of FENet is to leverage the complementary nature of RGB information and SRM noise residual information. The RGB branch is based on the SegFormer-B2 pre-trained Transformer backbone network [18], which extracts high-level semantic features, focusing on capturing the global content inconsistencies between the tampered region and the original background. The SRM noise branch uses a fixed-parameter SRM high-pass filter [19] to extract noise features, which are sensitive to local statistical anomalies introduced by tampering operations. The features from both branches are dynamically fused through the SCF to suppress redundant information. Finally, the ECAM enhances boundary localization through multi-stage processing, ultimately outputting a boundary-optimized tampering mask.

The network takes an RGB image of size H × W × 3 as input and outputs a binary mask with the same spatial resolution, where regions with pixel values of 1 indicate tampered parts, and regions with pixel values of 0 indicate untampered parts.

3.2. Symmetric Guided Dual-Branch Feature Extraction

To improve the efficiency and generalization ability of feature extraction, this study uses a pre-trained Transformer backbone network based on SegFormer-B2 and performs targeted fine-tuning on the tampering localization task. Through a dual-branch non-shared weight architecture, semantic features and noise features are effectively decoupled, balancing global context modeling and local statistical anomaly analysis, thereby significantly enhancing the robustness of the model.

The RGB branch adopts a hierarchical Mix Transformer (MiT) block as its core architecture. The reason for choosing Transformer over the traditional convolutional neural network is that Transformer has powerful global modeling capabilities and can capture dependencies between pixels at long distances, which is very important for identifying complex tampering areas. Especially when detecting copy-move tampering, it is necessary to compare similar areas in different positions in the image, and the Transformer’s self-attention mechanism performs excellently in such tasks.

We use the pre-trained SegFormer-B2 model, which contains four levels of MiT blocks, each of which contains multiple Transformer layers. Unlike the standard Transformer, the MiT block uses a hierarchical design that gradually reduces the spatial resolution of the feature map while increasing the number of channels. This design maintains the global modeling capabilities of the Transformer while improving computational efficiency. Specifically, the feature map sizes and channel numbers output by the four MiT blocks are as follows:

\frac{H}{4} \times \frac{W}{4} \times 64, \frac{H}{8} \times \frac{W}{8} \times 128, \frac{H}{16} \times \frac{W}{16} \times 320, \frac{H}{32} \times \frac{W}{32} \times 512

Given an input RGB image, the workflow of the RGB branch is as follows:

\begin{matrix} F_{1}^{RGB} & = MiT 1 (X), F_{2}^{RGB} = MiT 2 (F_{1}^{RGB}), F_{3}^{RGB} = MiT 3 (F_{2}^{RGB}), F_{4}^{RGB} = MiT 4 (F_{3}^{RGB}) \end{matrix}

(1)

Among them, X is the input RGB image, and MiT is the MiT module used in SegFormer.

The SRM noise branch first extracts the noise feature map of the image through the SRM high-pass filter and then uses the same Transformer structure as the RGB branch for feature extraction. The SRM filter [19] originates from the traditional field of image forensics and was originally used to enhance and analyze camera images. It can effectively capture statistical anomalies in image noise, which are often important clues for tampering operations.

We employ three fixed-weight 3 × 3 high-pass filters

K_{1}

,

K_{2}

, and

K_{3}

, which are sensitive to different directions and frequencies of image details. Specifically,

K_{1}

is sensitive to diagonal directions,

K_{2}

is sensitive to horizontal directions, and

K_{3}

is a Laplacian filter sensitive to local changes. When processing color images, we apply these filters to each color channel separately and then concatenate the results along the channel dimension:

X_{c}^{'} = {Concat}_{j = 1}^{3} (X_{c} * K_{j}), j \in {1, 2, 3}, c \in {R, G, B}

(2)

Here,

X_{c}

denotes the cth color channel of the input image X,

X_{c}^{'}

is the filtering result of the corresponding channel, and

X^{'}

is the final noise residual feature with the number of channels 3 × 3 = 9. The filtered feature

X^{'}

are input into the same Transformer structure as the RGB branch to extract multi-scale noise features for capturing tampering traces in the image.

The features extracted from the dual-branch structure undergo multiple layers and contain rich semantic information. As proposed in [20], high-frequency detail information is gradually lost during the layer-by-layer transmission process, and the model performs poorly when processing tasks that require fine features. In view of this, we adaptively weight RGB features in the RGB branch by combining dilated convolution with global average pooling, the feature maps generated by Transformer block3 processing

G_{3} \in R^{H_{3} \times W_{3} \times C_{3}}

, we added a 3 × 3 size dilation convolution

{Conv}_{3 \times 3}^{dila}

[21] operation in the branch in order to effectively expand the range of the sensory field, so that the model is able to capture a wider range of contextual information [22]. The spatial attention mechanism is utilized to generate the graph

M_{s} (G_{3})

:

\begin{matrix} M_{s} (G_{3}) & = BN ({Conv}_{1 \times 1} ({ConvD}_{3 \times 3}^{d - 2} \circ {ConvD}_{3 \times 3}^{d - 2} ({Conv}_{1 \times 1} (G_{3})))) \end{matrix}

(3)

where

{ConvD}_{3 \times 3}^{d = 2}

denotes a 3 × 3 dilated convolution with an expansion of 2, BN denotes batch normalization, and ∘ is a functional composite operator.

The channel attention graph

M_{c} (G_{3})

is generated by global average pooling (GAP) and multilayer perceptron (MLP):

\begin{matrix} M_{c} (G_{3}) & = BN (MLP (GAP (G_{3}))) \end{matrix}

(4)

A gating mechanism is applied to fuse spatial and channel attention to generate the final integrated attention map

M (G_{3})

:

\begin{matrix} M (G_{3}) & = σ (M_{s} (G_{3}) + M_{c} (G_{3})) \end{matrix}

(5)

where

σ

denotes the Sigmoid activation function.

Ultimately, the enhanced feature map

G_{3}^{'}

consists of the original feature map summed with the feature map modulated by the attention map:

\begin{matrix} G_{3}^{'} = G_{3} + G_{3} \otimes M (G_{3}) \end{matrix}

(6)

where ⊗ in the figure denotes the element-level multiplication operation. This process is also applied to the feature map

F_{3}

of the SRM noise branch to generate the enhanced feature map

F_{3}^{'}

.

The design of this branch effectively improves the accuracy of feature representation and enables the model to better capture key feature information.

In this study, the SRM filters are designed with fixed weights rather than learnable parameters. This approach ensures stable extraction of predefined statistical features and prevents the filters from degenerating into conventional edge detectors during training, which would reduce their sensitivity to subtle statistical anomalies. Furthermore, it establishes a dataset-independent feature extraction mechanism that significantly enhances the model’s generalization ability to unseen tampering types. The fixed-weight design also effectively reduces the number of trainable parameters by approximately 9%, thereby decreasing model complexity and mitigating overfitting risks. Implemented via depthwise separable convolutions with dedicated filters for each channel, this scheme achieves significant computational efficiency improvements while maintaining robust feature extraction capability.

3.3. Self-Calibrating Fusion Module

SCF module It can dynamically adjust the importance of features from two branches to avoid information redundancy caused by simple splicing. Traditional feature fusion methods, such as channel splicing or element-level addition, make it difficult to adaptively adjust the importance of different features according to the input content, and are prone to introducing noise and redundant information. The SCF module solves this problem by dynamically fusing the features from different branches by learning adaptive weights.

As shown in Figure 2, the SCF module contains a feature transformation layer, an initial fusion layer, and a calibration layer. The input to the module is from the RGB branch and the SRM noise branch of the sibling features

F_{i}^{R G B}

and

F_{i}^{S R M}

, and the output is the fused feature

F_{i}^{S C F}

.

The feature transformation starts with the fact that the multiscale features

G_{i}

and

F_{i}

from the RGB branch and the SRM noise branch are first passed through the parallel feature transformation layer to achieve the dimensionality unification and initial feature enhancement:

G_{i}^{trans} = δ (BN ({Conv}_{1 \times 1} (G_{i}))), F_{i}^{trans} = δ (BN ({Conv}_{1 \times 1} (F_{i})))

(7)

where

δ

represents the ReLU activation function and batch normalization (BN) represents the batch normalization operation. In this way, information from different feature spaces can be subsequently processed in a unified representation space.

Initial fusion is performed by initially fusing the transformed features through element-level addition and introducing Dropout2D to suppress overfitting [23], resulting in the initial fusion feature Z. The Dropout2D probability is set to 0.2, which provides a good balance between suppressing overfitting and preserving useful features in our experiments.

Z = Dropout 2 D (G_{i}^{trans} + F_{i}^{trans})

(8)

We choose addition rather than splicing as the initial fusion operation based on two considerations: first, the addition operation keeps the feature dimensions unchanged and is more computationally efficient; second, the additive fusion can be regarded as the average of the two features, which provides a natural initial state for the subsequent calibration operation. This not only enhances the robustness of the model but also provides a base fusion representation for the subsequent calibration mechanism. The introduction of Dropout2D effectively suppresses the risk of overfitting while facilitating the diversity of feature representations.

The calibration layer mainly generates calibration weights, which we need to generate after the initial fusion for dynamically adjusting the importance of the two branch features. The process of calibration weight generation is as follows:

The first is global information aggregation: global context information is extracted using global average pooling (GAP).

W = σ ({Conv}_{expand} (δ ({Conv}_{reduce} (GAP (Z))))), W = [W_{1}, W_{2}]

(9)

where GAP denotes the global average pooling operation, Conv_reduce and Conv_expand denote the 1 × 1 convolution for downscaling and upscaling, respectively, and

σ

denotes the Sigmoid activation function. GAP compresses the spatial dimensions by retaining a single global response value for each channel, effectively capturing the statistical characteristics of the entire feature map. The dimensionality reduction convolution decreases the number of channels—typically to one-fourth of the original (i.e.,

1 / r

where

r = 4

)—which not only reduces computational complexity but also encourages the network to learn more compact and informative feature representations. The weights W can be decomposed into weight vectors

W = [W_{1}, W_{2}]

for the two branches, which are used to regulate the contribution of different feature branches.

Adaptive feature integration is performed next, where the generated weights are applied to each of the two feature branches, and the final fused features are obtained by weighted summation:

\begin{matrix} F_{output} & = G_{i}^{trans} ⊙ W_{1} + F_{i}^{trans} ⊙ W_{2} \end{matrix}

(10)

where ⊙ denotes the multiplication operation on the channel dimension. This adaptive weighting mechanism enables the model to dynamically adjust the importance of different features according to the characteristics of the input samples, which effectively enhances the ability to localize the tampered regions, especially in scenarios with complex backgrounds and subtle tampering.

The SCF module offers three significant advantages over traditional feature fusion methods, such as simple splicing or element-level summation:

Achieving dynamic adjustment of the importance of features and effectively reducing the redundancy of information.
Adaptive fusion based on content, significantly improving the discriminative performance of the model.
Optimizing the gradient flow process for efficient learning in multi-branch networks.

These improvements enable the SCF module to learn the relative importance between different feature branches and different channels based on the global contextual information of the features, thus realizing content-aware adaptive feature fusion. This design not only reduces computational complexity while maintaining strong expressive power but also enables the model to dynamically adjust the fusion strategy according to the characteristics of different input samples.

3.4. Edge-Aware Coordinate Attention Module

The ECAM integrates coordinate attention mechanisms with edge-aware characteristics to enhance the model’s ability to represent features of tampered boundary regions. This module takes as input a set of multi-scale features

Z = {Z_{1}, Z_{2}, Z_{3}, Z_{4}}

, and combines adaptive fusion with edge enhancement strategies to achieve precise localization of tampered regions [24]. Compared with traditional attention mechanisms, the advantages of ECAM mainly lie in the following three aspects:

Explicit modeling of spatial positional information rather than focusing solely on inter-channel relationships.
Introduction of edge structures as prior knowledge to guide the network in strengthening tampered boundary features.
Adaptive adjustment of feature importance, enabling dynamic focus on key regions based on the input content.

As shown in Figure 3, the ECAM module contains an input feature splicing layer, a coordinate attention sub-module, an edge perception sub-module, and finally, feature fusion. This multi-layer design significantly enhances the model’s ability to perceive the boundary of the tampered image and improves the accuracy of localizing the tampered region.

Feature fusion: the ECAM module first performs a join operation on the input multiscale features to generate the fused feature

Z^{'}

, i.e.,

\begin{matrix} Z^{'} & = Concat (Z_{1}, Z_{2}, Z_{3}, Z_{4}) \end{matrix}

(11)

Coordinate Attention Submodule: The pooling operation is first performed to pool the fusion features

Z^{'}

horizontally averaged and vertically averaged, respectively, [25]:

X_{v} = P_{v} (Z^{'}) = \frac{1}{H} \sum_{i = 0}^{H - 1} Z^{'} (:, i, :), X_{h} = P_{h} (Z^{'}) = \frac{1}{W} \sum_{i = 0}^{W - 1} Z^{'} (:, :, i)

(12)

where

X_{h}

represents horizontal mean pooling,

X_{v}

represents vertical mean pooling, W is the feature map width, and H is the feature map height.

This is immediately followed by information encoding: the pooled features are spliced, after which

Concat (X_{h}, X_{v})

is obtained, then linearly transformed by a 1 × 1 convolutional layer, followed by a batch normalization operation to speed up training and stabilize the model, and finally a Sigmoid activation function (

σ

) is used:

\begin{matrix} Y & = σ (BN ({Conv}_{1 \times 1} (Concat (X_{h}, X_{v})))) \end{matrix}

(13)

Finally, attention weight generation and application are performed: mainly, the coded features are separated into horizontal and vertical direction attention weights. The features in horizontal and vertical directions are processed by two independent 1 × 1 convolutional layers, respectively, and the attention weights are generated by the Sigmoid activation function:

M_{h} = σ ({Conv}_{1 \times 1} (Y_{h})), M_{v} = σ ({Conv}_{1 \times 1} (Y_{v}))

(14)

These attentional weights are then applied to the original fusion feature

Z^{″}

by an element-by-element multiplication operation to obtain:

\begin{matrix} Z^{″} & = Z^{'} \otimes M_{h} \otimes M_{v} \end{matrix}

(15)

This step improves the model’s ability to perceive spatial location information by weighting the original features with learned attention weights, enhancing important features, and suppressing irrelevant ones.

The edge perception sub-module is one of the important components of the ECAM module. First, the features processed by the coordinate attention mechanism are divided into two branches: the edge branch and the main branch. The edge branch is processed through a 3 × 3 convolutional layer, a batch normalization layer, and a ReLU activation function to extract edge features

Z_{edge}

. The main branch passes through a 1 × 1 convolutional layer, a batch normalization layer, and a ReLU activation function to preserve the main features

Z_{main}

.

Then, edge detection is performed in the edge branch: the horizontal and vertical Sobel operators are used to detect edges, respectively, extracting edge features

E_{h}

and

E_{v}

. Based on these two directional edge detections, the edge magnitude is calculated as follows:

\begin{matrix} E_{m a g} & = \sqrt{E_{h}^{2} + E_{v}^{2}} \end{matrix}

(16)

The last step of the ECAM module is feature fusion and adaptive adjustment: combining the edge information with the main features and dynamically adjusting the weights of the edge features through the attention mechanism to finally get the output features. First, the main branch feature

Z_{main}

is connected with the horizontal and vertical edge features

E_{h}

and

E_{v}

of the edge branch to obtain

Concat (Z_{main}, E_{h}, E_{v})

. Then, global information aggregation is performed on this connection feature by a GAP operation to obtain

GAP (Concat (Z_{main}, E_{h}, E_{v}))

. Next, edge attention weights are generated by a 1 × 1 convolutional layer and a Sigmoid activation function:

\begin{matrix} W_{edge} & = σ ({Conv}_{1 \times 1} (GAP (Concat (Z_{main}, E_{h}, E_{v})))) \end{matrix}

(17)

The edge features are weighted using this weight to obtain:

E_{h}^{'} = E_{h} \otimes W_{edge}, E_{v}^{'} = E_{v} \otimes W_{edge}

(18)

Finally, the main branch feature

Z_{main}

is concatenated with the weighted edge features

E_{h}^{'}

and

E_{v}^{'}

and fused by a 3 × 3 convolutional layer to obtain the final output features:

\begin{matrix} Z_{out} & = {Conv}_{3 \times 3} (Concat (Z_{main}, E_{h}^{'}, E_{v}^{'})) \end{matrix}

(19)

This process improves the localization accuracy by adaptively adjusting the importance of the edge features so that the model can dynamically focus on the boundary features of the tampered region according to different input images.

In summary, the ECAM module, by virtue of its unique design, demonstrates excellent performance in image tampering localization tasks. By integrating the coordinate attention mechanism and edge-aware features, the module can effectively enhance the model’s ability to characterize the boundary features of the tampered region and improve the tampered region localization accuracy.

3.5. Loss Function

In the task of image tampering localization, it is crucial to distinguish tampered and non-tampered regions accurately. The tampered region usually occupies only a small portion of the image, and the tampering boundary has a decisive impact on the localization accuracy. To cope with the category imbalance problem and enhance the model’s ability to perceive the tamper boundary, this paper adopts the combination of Dice loss and Focal loss as the optimization objective.

Dice loss [26] directly optimizes the degree of region overlap between the predicted mask and the true mask based on the Dice similarity coefficient. For the prediction result P and the true label G, the Dice loss is defined as:

\begin{matrix} L_{dice} & = 1 - \frac{2 \sum_{i}^{N} p_{i} g_{i}}{\sum_{i}^{N} p_{i}^{2} + \sum_{i}^{N} g_{i}^{2} + ϵ} \end{matrix}

(20)

where

p_{i} \in P

and

g_{i} \in G

denote the predicted and true labels of the ith pixel, respectively. N is the total number of pixels, and

ϵ

is a small constant used to prevent the denominator from being zero, which is usually set to 1 × 10⁻⁵. The Dice loss is insensitive to the category imbalance, which can efficiently deal with the situation where the tampering region is small in size and directly optimize the IoU metric.

The Focal loss [27] is designed to improve the model’s focus on hard-to-categorize regions such as tampering boundaries. It is defined as follows:

L_{Focal} = - α {(1 - p)}^{γ} \times y \times \log (p) - (1 - α) p^{γ} (1 - y) \times \log (1 - p)

(21)

p is the predicted tampering probability,

y \in {0, 1}

is the true label,

α

is the weight assigned to the positive (

α

) and negative (

1 - α

) samples, and

γ

is a moderator to reduce the weight of the easy-to-classify samples.

The composite loss function is the final loss function in this paper and is the weighted sum of the Dice loss and the Focal loss:

\begin{matrix} L = λ_{1} L_{dice} + λ_{2} L_{focal} \end{matrix}

(22)

where

λ_{1}

and

λ_{2}

are the weighting coefficients. This combined loss function can fully utilize the complementary advantages of the two losses: the Dice loss is used to optimize the overlap of the overall region, while the Focal loss focuses on pixels that are difficult to classify, especially those that tamper with the boundary region.

4. Experimentation and Analysis

4.1. Experimental Setup

Training dataset: a total of 60,971 synthetic falsified images are selected to train the proposed FENet based on the previous study [28]. These images are derived from the public, synthetic image dataset [29] and CASIAv2 [30]. The synthetic forgery images are generated by stitching the original images with the object regions in MS-COCO [31].

To comprehensively evaluate the adaptability and generalization performance of the proposed method under varying tampering types and complexity levels, five representative and widely adopted datasets were selected as test benchmarks, as summarized in Table 1:

CASIA-v1 [30]: This dataset includes two typical tampering types: copy-move and splicing. The images are of moderate resolution and do not involve post-processing operations such as compression or blurring, making them suitable for evaluating a model’s ability to detect basic tampering operations.
Columbia [7]: This dataset primarily consists of uncompressed spliced images, where tampering is performed by combining two source images along boundary regions. With high-quality images and minimal artifacts, it serves as a benchmark for assessing baseline detection performance under ideal conditions.
NIST16 [32]: Released by the National Institute of Standards and Technology (NIST), this dataset simulates real-world, professional-grade tampering scenarios. It includes a variety of complex manipulations—such as compositing, occlusion, and affine transformations—and offers high-resolution images with pixel-level annotations, making it a critical benchmark for evaluating practical performance.
IMD [33]: Manually constructed, this dataset covers various tampering types, including splicing, copy-move, and inpainting. It is well-suited for assessing the model’s generalization ability and robustness across heterogeneous forgery scenarios.
DSO [34]: This dataset focuses on small, blurry, and low-contrast tampered regions that are challenging to detect. It simulates difficult visual conditions, such as weak obfuscation and compression artifacts, serving as a valuable supplement for evaluating boundary sensitivity and fine-grained localization capabilities.

Data preprocessing: To enhance the model’s adaptability to heterogeneous image inputs and ensure training stability, a standardized preprocessing pipeline was established, comprising the following steps:

Input format compatibility: A unified input pipeline was constructed to support mainstream image formats (JPG, PNG, TIF), improving the model’s generalizability and deployment flexibility.
Spatial resolution normalization: All images were resized to a fixed resolution of 512 × 512. For high-resolution inputs, a sliding-window cropping strategy was employed to preserve spatial details while maintaining computational efficiency.
Pixel normalization and standardization: A two-stage normalization was applied—first linearly scaling pixel values to [0, 1], followed by standardization using ImageNet statistics (μ = [123.675, 116.28, 103.53], σ = [58.395, 57.12, 57.375]) to promote convergence and numerical stability.
Label binarization: Tampering masks were strictly binarized ({0,1}), and a threshold-based decision was applied to boundary pixels to ensure label accuracy and consistency.

Data Augmentation Strategy: To improve robustness against diverse tampering patterns, scale variations, and quality degradation, the following data augmentation techniques were adopted:

Geometric transformations: Including random multi-scale resizing (scaling factor in [0.5, 2.0]), random cropping (512 × 512), horizontal flipping (probability 0.5), vertical flipping (probability 0.3), and random rotation within ±15° (probability 0.3), enhancing spatial invariance.
Quality degradation simulation: Additive Gaussian noise ( $μ = 0, σ = 10$ ), Gaussian blur (5 × 5 kernel, probability 0.15), JPEG compression (quality factor in [70, 100], probability 0.5), and brightness/contrast variation (±10/±20%, probability 0.5) were applied to simulate realistic distortions.

4.2. Experimental Parameters

The experimental environment includes the Ubuntu 20.04 operating system, NVIDIA A800 GPU, and a deep learning framework based on PyTorch 1.13 and Python 3.8.

Table 2 summarizes the most important parameter settings in this study. These parameters have been experimentally verified and have a significant impact on model performance and stability. The RGB branch backbone network uses ImageNet pre-training weights, and the SRM filter weights are fixed.

During the first five training epochs, we froze the parameters of the SRM filtering layer and the low-level layers of the backbone network. A cosine annealing schedule was adopted for the learning rate, starting from 1 × 10⁻⁴ and gradually decaying to 1 × 10⁻⁶ over 100 epochs. A higher learning rate resulted in unstable training, whereas a lower value led to overly slow convergence. To mitigate overfitting and improve training efficiency, we employed an early stopping strategy with a patience value of 8.

The batch size was set to 8, providing an optimal trade-off between gradient estimation accuracy and memory efficiency while maintaining the effectiveness of batch normalization. We adopted the AdamW optimizer, which has demonstrated strong performance in image processing tasks. The momentum parameter was set to 0.9, selected based on comparisons with values of 0.8 and 0.95, aligning with best practices in the computer vision literature.

For the Focal loss function, we used

α = 0.25

and

γ = 2.0

, which effectively balanced the importance of positive and negative samples (i.e., tampered versus untampered regions), and enhanced the model’s sensitivity to hard-to-classify areas, particularly along tampering boundaries. Our sensitivity analysis further confirmed that the

γ

parameter has a significant impact on the accuracy of boundary localization.

From the perspective of resource consumption, the model’s weight file size is approximately 974 MB (standard SegFormer-B2 configuration), indicating a moderate parameter scale. To balance computational efficiency and performance, we adopted several optimization strategies:

A fixed-weight design was employed for the SRM filter, reducing the number of trainable parameters by approximately 9%.
A self-calibrating feature fusion mechanism was introduced to dynamically adjust feature importance, thereby improving resource utilization efficiency.
For large-sized images, we implemented a 512 × 512 sliding window processing strategy, where weighted fusion is used to effectively eliminate window boundary effects, allowing the model to handle inputs of arbitrary size within limited GPU memory.

Test results show that the standard SegFormer-B2 configuration achieves a peak memory usage of approximately 12 GB during training on a single A800 GPU.

4.3. Comparison Experiment

This study aims to evaluate the performance of the proposed tamper localization scheme in comparison with a variety of current state-of-the-art methods involving ManTra-Net [8], DFCN [35], MVSS-Net [36], OSN [37], and ConvNeXt [28]. Based on the officially disclosed code and pre-trained models, the other methods are tested to obtain the performance of each method under the same conditions. For the processing of the test images, the method framework of the previous study [6] is followed, and some of the larger-sized images in the NIST16 [32] dataset are cropped to 2048 × 1440 pixels to meet the input requirements of the ManTra-Net model.

Finally, experiments on several public datasets yielded comparative performance results of the methods based on two key localization metrics, F1 and IoU [37].

As presented in Table 3, the proposed FENet achieves the best overall performance among all compared methods, with an average F1 score that exceeds the second-best method, ConvNeXt, by 3.2%. Across all individual datasets, FENet consistently ranks among the top two methods in both F1 and IoU metrics, and notably attains the highest scores on the DSO dataset in both evaluations.

These results highlight the effectiveness of FENet’s architecture, which combines a dual-branch design for extracting complementary RGB semantic and SRM noise features, with the SCF and ECAM modules that support adaptive feature calibration and boundary enhancement. The DSO dataset, characterized by small, low-contrast tampered regions, poses challenges for many existing methods. FENet’s superior performance on this dataset underscores its robustness in detecting subtle manipulations and precisely localizing tampered boundaries.

In contrast, a slight decline in the F1 score is observed on the Columbia dataset. This can be attributed to the relatively smooth transitions and low texture contrast in its tampered regions, which reduce the discriminability of both semantic and statistical features. Nonetheless, FENet still achieves a high IoU on this dataset, indicating that its boundary localization remains accurate despite the reduced feature contrast.

These observations are further supported by the qualitative comparisons in Figure 4, which visually validate FENet’s consistent performance and strong generalization across diverse tampering scenarios.

4.4. Ablation Experiment

A series of ablation experiments was conducted to deeply explore the mechanism of action of each core design element in our approach. All experiments were trained on the same dataset to ensure consistency of experimental conditions, thus enhancing the comparability and credibility of the experimental results.

Our baseline model consists of four main components: the RGB branch, the SRM noise branch, the Transformer module for modeling global dependencies and feature interactions, and the dual-attention feature enhancement mechanism. On this basis, in order to construct a relatively simplified comparative frame of reference, we deliberately exclude the SCF module and the ECAM module in order to analyze the impact of each core component on the overall performance more clearly.

As shown in Table 4, when the SCF module is introduced and compared with the simple feature-connected average fusion approach, the F1 score of the model is significantly improved by 4.2%, while the IoU score also achieves an improvement of 3.3%. The results fully validate the effectiveness of the SCF module in enhancing the feature fusion of the model, reflecting its important role in the overall model performance enhancement.

In addition, the last row of data in Table 4 demonstrates the change in model performance after the introduction of the ECAM module. After integrating the ECAM module, the average performance of the model is significantly improved, as evidenced by an F1 score of 55.3% and an IoU score of 47.9%. Such a significant performance improvement indicates that the ECAM module has outstanding performance in enhancing the model’s ability to sense tampered regions, further proving its key position in constructing efficient tampering localization models.

In summary, through this group of layer-by-layer and well-designed ablation experiments, we systematically analyze the specific roles of the SCF module and the ECAM module in improving the model detection performance, and these experimental data provide a solid theoretical basis and experimental support for the subsequent further optimization of the model architecture, and the enhancement of tampering localization accuracy and robustness.

4.5. Robustness Assessment

In order to evaluate the robustness of the proposed model in the real social platform environment, especially in the face of multiple tampering operations and the challenges posed by social media compression processing [37], we conducted targeted experimental analysis based on the datasets proposed by OSN.

Specifically, four standard image forensic datasets were uploaded to Facebook, Weibo, and WeChat, which are the mainstream social platforms, to simulate the compression and re-encoding operations experienced during the image dissemination process in the actual social networks and analyze the model performance. Thus, test conditions close to real application scenarios were constructed, and model performance was comprehensively evaluated.

As shown in Table 5, our model performs particularly well on the DSO dataset, exhibiting stable and excellent performance in the post-propagation image tests on all the aforementioned social platforms. It is especially noteworthy that compared to the second-ranked ConvNeXt method on Facebook, Weibo, and WeChat, the F1 scores improve by 2.7%, 3%, and 5.6% F1 score improvements, respectively, and consistently remain in the top two of the compared methods in terms of IoU metrics. This result fully verifies the robustness and stability of the proposed method in dealing with common image compression, quality loss, and other disturbing factors in social platforms.

In summary, FENet demonstrates outstanding performance in tampering localization tasks following image transmission on social media platforms, maintaining stable accuracy even under severe JPEG compression. This robustness stems from three key design components:

First, the symmetry-guided dual-branch architecture concurrently extracts RGB semantic features and SRM noise residuals. While JPEG compression may degrade visual cues in the RGB domain, the SRM branch remains effective in capturing statistical inconsistencies in compression artifacts, thereby providing crucial complementary information for tampering detection.

Second, the SCF module dynamically adjusts the weighting of different feature streams based on the compression level of the input image. In highly compressed images, it increases the contribution of the SRM branch to compensate for the loss of discriminative information in the RGB features.

Third, the ECAM module integrates Sobel edge detection with coordinate attention to effectively restore boundary details compromised by compression.

Collectively, these modules enable FENet to achieve high localization and boundary accuracy, even under the common challenges of repeated compression and quality degradation on social media platforms, highlighting its strong robustness and practical applicability in real-world scenarios.

5. Conclusions

This paper proposes a symmetric guided dual-branch network that integrates adaptive feature fusion and edge attention mechanisms, aiming to improve the high-precision localization capability of tampered areas. The network effectively enhances the model’s ability to perceive tampering traces through the synergy of the SRM filter extracting noisy residual features and the RGB branch extracting semantic features. The introduced SCF module effectively alleviates the information redundancy problem faced by the traditional feature splicing approach and improves the feature interaction efficiency and expression capability, while the designed ECAM module further strengthens the model’s attention to the tampering boundaries, which significantly improves the localization accuracy of the edge region.

A large number of experimental results verify the superior performance of the proposed method on several benchmark datasets for public image tampering detection and localization, especially in the accurate identification of tampering boundaries, which reflects the robustness and practical value of the proposed method in complex tampering scenarios.

Future research work will focus on further exploring the deep correlation between image content features and tampering traces, exploring more efficient multi-scale fusion strategies and lightweight network design so as to construct more accurate, efficient, and adaptable image tampering localization models.

Author Contributions

Conceptualization, Z.H.; Methodology, Z.H. and L.L.; Software, L.L.; Validation, L.L.; Formal analysis, L.L. and H.W.; Investigation, L.L.; Resources, Z.H.; Data curation, L.L.; Writing—original draft, L.L.; Writing—review and editing, Z.H. and L.L.; Visualization, L.L.; Supervision, Z.H. and H.W.; Project administration, Z.H. and L.L.; Funding acquisition, Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by Gansu Province Higher Education Institutions Industrial Support Program under Grant 2020C-29, and in part by the National Natural Science Foundation of China under Grant 61562002.

Data Availability Statement

All data used in this study are publicly available and accessible online. The OSN Image Forensics Dataset, available via GitHub at https://github.com/HighwayWu/ImageForensicsOSN (accessed on 8 January 2025), includes several standard benchmark datasets, such as CASIA-v1, Columbia, NIST16, and DSO, as well as manipulated samples from platforms including Facebook, Weibo, and WeChat for robustness evaluation. The IMD dataset is separately available from its official website at https://staff.utia.cas.cz/novozada/db/ (accessed on 8 January 2025).

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Guo, K.; Zhu, H.; Cao, G. Effective image tampering localization via enhanced transformer and co-attention fusion. In Proceedings of the ICASSP 2024—2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Seoul, Republic of Korea, 14–19 April 2024; IEEE: New York, NY, USA, 2024; pp. 4895–4899. [Google Scholar]
Ma, X.; Du, B.; Jiang, Z.; Hammadi, A.Y.A.; Zhou, J. IML-ViT: Benchmarking Image Manipulation Localization by Vision Transformer. arXiv 2023, arXiv:2307.14863. [Google Scholar]
Chen, X.; Dong, C.; Ji, J.; Cao, J.; Li, X. Image manipulation detection by multi-view multi-scale supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 14185–14193. [Google Scholar]
Zhai, Y.; Luan, T.; Doermann, D.; Yuan, J. Towards generic image manipulation detection with weakly supervised self-consistency learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 2–3 October 2023; pp. 22390–22400. [Google Scholar]
Pan, X.; Zhang, X.; Lyu, S. Exposing image forgery with blind noise estimation. In Proceedings of the Thirteenth ACM Multimedia Workshop on Multimedia and Security, Buffalo, NY, USA, 29–30 September 2011; pp. 15–20. [Google Scholar]
Kwon, M.J.; Nam, S.H.; Yu, I.J.; Lee, H.K.; Kim, C. Learning jpeg compression artifacts for image manipulation detection and localization. Int. J. Comput. Vis. 2022, 130, 1875–1895. [Google Scholar] [CrossRef]
Hsu, Y.F.; Chang, S.F. Detecting image splicing using geometry invariants and camera characteristics consistency. In Proceedings of the 2006 IEEE International Conference on Multimedia and Expo, Toronto, ON, Canada, 9–12 July 2006; IEEE: New York, NY, USA, 2006; pp. 549–552. [Google Scholar]
Wu, Y.; AbdAlmageed, W.; Natarajan, P. Mantra-net: Manipulation tracing network for detection and localization of image forgeries with anomalous features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 9543–9552. [Google Scholar]
Liu, X.; Liu, Y.; Chen, J.; Liu, X. PSCC-Net: Progressive spatio-channel correlation network for image manipulation detection and localization. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 7505–7517. [Google Scholar] [CrossRef]
Hu, X.; Zhang, Z.; Jiang, Z.; Chaudhuri, S.; Yang, Z.; Nevatia, R. SPAN: Spatial pyramid attention network for image manipulation localization. In Proceedings of the Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, 23–28 August 2020; Proceedings, Part XXI 16. Springer: Berlin/Heidelberg, Germany, 2020; pp. 312–328. [Google Scholar]
Zhu, H.; Cao, G.; Huang, X. Progressive feedback-enhanced transformer for image forgery localization. arXiv 2023, arXiv:2311.08910. [Google Scholar]
Fridrich, J.; Kodovsky, J. Rich models for steganalysis of digital images. IEEE Trans. Inf. Forensics Secur. 2012, 7, 868–882. [Google Scholar] [CrossRef]
Popescu, A.C.; Farid, H. Exposing digital forgeries by detecting traces of resampling. IEEE Trans. Signal Process. 2005, 53, 758–767. [Google Scholar] [CrossRef]
Mahdian, B.; Saic, S. Using noise inconsistencies for blind image forensics. Image Vis. Comput. 2009, 27, 1497–1503. [Google Scholar] [CrossRef]
Bappy, J.H.; Simons, C.; Nataraj, L.; Manjunath, B.; Roy-Chowdhury, A.K. Hybrid lstm and encoder–decoder architecture for detection of image forgeries. IEEE Trans. Image Process. 2019, 28, 3286–3300. [Google Scholar] [CrossRef]
Chen, T.; Li, B.; Zeng, J. Learning traces by yourself: Blind image forgery localization via anomaly detection with ViT-VAE. IEEE Signal Process. Lett. 2023, 30, 150–154. [Google Scholar] [CrossRef]
Zheng, A.; Huang, T.; Huang, W.; Huang, L.; Ye, F.; Luo, H. DSSE-net: Dual stream skip edge-enhanced network with forgery loss for image forgery localization. Int. J. Mach. Learn. Cybern. 2024, 15, 2323–2335. [Google Scholar] [CrossRef]
Xie, E.; Wang, W.; Yu, Z.; Anandkumar, A.; Alvarez, J.M.; Luo, P. SegFormer: Simple and efficient design for semantic segmentation with transformers. Adv. Neural Inf. Process. Syst. 2021, 34, 12077–12090. [Google Scholar]
Zhuo, L.; Tan, S.; Li, B.; Huang, J. Self-adversarial training incorporating forgery attention for image forgery localization. IEEE Trans. Inf. Forensics Secur. 2022, 17, 819–834. [Google Scholar] [CrossRef]
Yang, C.; Wang, Z.; Shen, H.; Li, H.; Jiang, B. Multi-modality image manipulation detection. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021; IEEE: New York, NY, USA, 2021; pp. 1–6. [Google Scholar]
Yu, F.; Koltun, V. Multi-scale context aggregation by dilated convolutions. arXiv 2015, arXiv:1511.07122. [Google Scholar]
Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
Luo, X.; Ai, Z.; Liang, Q.; Xie, Y.; Shi, Z.; Fan, J.; Qu, Y. EdgeFormer: Edge-aware Efficient Transformer for Image Super-resolution. IEEE Trans. Instrum. Meas. 2024, 73, 5029312. [Google Scholar] [CrossRef]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 19–25 June 2021; pp. 13713–13722. [Google Scholar]
Wei, Q.; Li, X.; Yu, W.; Zhang, X.; Zhang, Y.; Hu, B.; Mo, B.; Gong, D.; Chen, N.; Ding, D.; et al. Learn to segment retinal lesions and beyond. In Proceedings of the 2020 25th International Conference on Pattern Recognition (ICPR), Milan, Italy, 10–15 January 2021; IEEE: New York, NY, USA, 2021; pp. 7403–7410. [Google Scholar]
Lin, T.Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Zhu, H.; Cao, G.; Zhao, M.; Tian, H.; Lin, W. Effective image tampering localization with multi-scale convnext feature fusion. J. Vis. Commun. Image Represent. 2024, 98, 103981. [Google Scholar] [CrossRef]
Bappy, J.H.; Roy-Chowdhury, A.K.; Bunk, J.; Nataraj, L.; Manjunath, B. Exploiting spatial structure for localizing manipulated image regions. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 4970–4979. [Google Scholar]
Dong, J.; Wang, W.; Tan, T. Casia image tampering detection evaluation database. In Proceedings of the 2013 IEEE China Summit and International Conference on Signal and Information Processing, Beijing, China, 6–10 July 2013; IEEE: New York, NY, USA, 2013; pp. 422–426. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision, Zurich, Switzerland, 6–12 September 2014; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Guan, H.; Kozak, M.; Robertson, E.; Lee, Y.; Yates, A.N.; Delgado, A.; Zhou, D.; Kheyrkhah, T.; Smith, J.; Fiscus, J. MFC datasets: Large-scale benchmark datasets for media forensic challenge evaluation. In Proceedings of the 2019 IEEE Winter Applications of Computer Vision Workshops (WACVW), Waikoloa Village, HI, USA, 7–11 January 2019; IEEE: New York, NY, USA, 2019; pp. 63–72. [Google Scholar]
De Carvalho, T.J.; Riess, C.; Angelopoulou, E.; Pedrini, H.; de Rezende Rocha, A. Exposing digital image forgeries by illumination color classification. IEEE Trans. Inf. Forensics Secur. 2013, 8, 1182–1194. [Google Scholar] [CrossRef]
Novozamsky, A.; Mahdian, B.; Saic, S. IMD2020: A large-scale annotated dataset tailored for detecting manipulated images. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, Snowmass, CO, USA, 1–5 March 2020; pp. 71–80. [Google Scholar]
Zhuang, P.; Li, H.; Tan, S.; Li, B.; Huang, J. Image tampering localization using a dense fully convolutional network. IEEE Trans. Inf. Forensics Secur. 2021, 16, 2986–2999. [Google Scholar] [CrossRef]
Dong, C.; Chen, X.; Hu, R.; Cao, J.; Li, X. Mvss-net: Multi-view multi-scale supervised networks for image manipulation detection. IEEE Trans. Pattern Anal. Mach. Intell. 2022, 45, 3539–3553. [Google Scholar] [CrossRef]
Wu, H.; Zhou, J.; Tian, J.; Liu, J.; Qiao, Y. Robust image forgery detection against transmission over online social networks. IEEE Trans. Inf. Forensics Secur. 2022, 17, 443–456. [Google Scholar] [CrossRef]
Cozzolino, D.; Verdoliva, L. Noiseprint: A CNN-based camera model fingerprint. IEEE Trans. Inf. Forensics Secur. 2019, 15, 144–159. [Google Scholar] [CrossRef]

Figure 1. Model architecture overview of FENet.

Figure 2. Model architecture overview of SCF.

Figure 3. Overview of the ECAM module architecture.

Figure 4. From left to right: tampered images, ground truth, localization maps of the localizers including ManTra-Net, Noiseprint, DFCN, MVSS-Net, OSN, ConvNeXt, and FENet.

Table 1. Distribution of image counts in test datasets.

Dataset Name	Columbia	CASIAv1	DSO	NIST	IMD
Image Count	160	920	100	564	2010

Table 2. Model parameter settings.

Parameters	Notation	Value
Batch Size	b	8
Max Epoch	E	100
Encoder Learning Rate	$α_{enc}$	1 × 10⁻⁴ for first 5 epochs
Decoder Learning Rate	$α_{dec}$	1 × 10⁻⁶
Weight Decay	$λ_{decay}$	5 × 10⁻²
Momentum Coefficient	$β$	0.9
Focal Gamma	$γ$	2.0
Focal Alpha	$α$	0.25

Table 3. Comparative experimental results of the model in the dataset.

Methods	Columbia		CASIAv1		DSO		NIST		IMD		Average
Methods	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU
Mantra-Net [8]	0.357	0.258	0.130	0.086	0.332	0.243	0.088	0.054	0.183	0.124	0.228	0.159
Noiseprint [38]	0.364	0.262	-	-	0.339	0.253	0.122	0.081	0.179	0.120	0.230	0.161
DFCN [35]	0.419	0.280	0.181	0.119	0.320	0.217	0.082	0.055	0.233	0.161	0.250	0.165
MVSS-Net [36]	0.684	0.596	0.451	0.397	0.271	0.188	0.294	0.240	0.200	0.401	0.401	0.333
OSN [37]	0.707	0.608	0.509	0.465	0.436	0.308	0.332	0.255	0.456	0.367	0.456	0.367
ConvNeXt [28]	0.885	0.857	0.581	0.548	0.270	0.228	0.370	0.318	0.500	0.432	0.521	0.477
FENet	0.858	0.803	0.585	0.543	0.451	0.335	0.355	0.281	0.514	0.431	0.553	0.479

Table 4. The table reports the F1 and IoU scores of the model with different module configurations across multiple benchmark datasets.

Methods	Columbia		CASIAv1		DSO		NIST		IMD		Average
Methods	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU
Baseline	0.787	0.736	0.505	0.466	0.317	0.221	0.286	0.223	0.455	0.365	0.470	0.402
Baseline + SCF	0.839	0.777	0.541	0.491	0.366	0.256	0.316	0.249	0.498	0.403	0.512	0.435
Baseline + SCF + ECAM	0.858	0.803	0.585	0.543	0.451	0.335	0.355	0.281	0.514	0.431	0.553	0.479

Table 5. Comparison of experimental results of the model on the social media dataset.

OSN	Methods	Columbia		CASIAv1		DSO		NIST		Average
OSN	Methods	F1	IoU	F1	IoU	F1	IoU	F1	IoU	F1	IoU
Facebook	DFCN	0.315	0.214	0.161	0.102	0.049	0.030	0.116	0.077	0.160	0.106
	MVSS-Net	0.691	0.603	0.387	0.334	0.277	0.193	0.264	0.213	0.405	0.336
	OSN	0.724	0.611	0.462	0.417	0.447	0.320	0.329	0.253	0.488	0.400
	ConvNeXt	0.873	0.841	0.534	0.496	0.260	0.216	0.362	0.311	0.507	0.466
	FENet	0.809	0.757	0.536	0.489	0.417	0.308	0.375	0.310	0.534	0.466
Weibo	DFCN	0.172	0.107	0.159	0.101	0.056	0.032	0.075	0.050	0.115	0.072
	MVSS-Net	0.689	0.601	0.403	0.353	0.258	0.183	0.251	0.200	0.400	0.334
	OSN	0.724	0.626	0.466	0.421	0.370	0.253	0.294	0.219	0.463	0.380
	ConvNeXt	0.882	0.853	0.529	0.497	0.262	0.216	0.357	0.308	0.507	0.468
	FENet	0.827	0.778	0.506	0.469	0.481	0.362	0.333	0.269	0.537	0.470
Wechat	DFCN	0.404	0.278	0.196	0.126	0.167	0.104	0.050	0.032	0.204	0.135
	MVSS-Net	0.690	0.603	0.248	0.209	0.214	0.150	0.212	0.165	0.341	0.282
	OSN	0.727	0.631	0.405	0.358	0.366	0.252	0.286	0.214	0.446	0.364
	ConvNeXt	0.882	0.854	0.358	0.324	0.230	0.187	0.349	0.298	0.455	0.416
	FENet	0.824	0.772	0.408	0.361	0.449	0.332	0.361	0.283	0.511	0.437

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

He, Z.; Li, L.; Wang, H. Symmetry-Guided Dual-Branch Network with Adaptive Feature Fusion and Edge-Aware Attention for Image Tampering Localization. Symmetry 2025, 17, 1150. https://doi.org/10.3390/sym17071150

AMA Style

He Z, Li L, Wang H. Symmetry-Guided Dual-Branch Network with Adaptive Feature Fusion and Edge-Aware Attention for Image Tampering Localization. Symmetry. 2025; 17(7):1150. https://doi.org/10.3390/sym17071150

Chicago/Turabian Style

He, Zhenxiang, Le Li, and Hanbin Wang. 2025. "Symmetry-Guided Dual-Branch Network with Adaptive Feature Fusion and Edge-Aware Attention for Image Tampering Localization" Symmetry 17, no. 7: 1150. https://doi.org/10.3390/sym17071150

APA Style

He, Z., Li, L., & Wang, H. (2025). Symmetry-Guided Dual-Branch Network with Adaptive Feature Fusion and Edge-Aware Attention for Image Tampering Localization. Symmetry, 17(7), 1150. https://doi.org/10.3390/sym17071150

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Symmetry-Guided Dual-Branch Network with Adaptive Feature Fusion and Edge-Aware Attention for Image Tampering Localization

Abstract

1. Introduction

2. Related Work

2.1. Traditional Methods and Feature-Driven Strategies

2.2. Image Tampering Localization Method Based on Deep Learning

2.3. Transformer Architecture and Global Modeling Methods

2.4. Boundary Modeling and Robustness Study

3. Experimental Methods

3.1. Model Overview

3.2. Symmetric Guided Dual-Branch Feature Extraction

3.3. Self-Calibrating Fusion Module

3.4. Edge-Aware Coordinate Attention Module

3.5. Loss Function

4. Experimentation and Analysis

4.1. Experimental Setup

4.2. Experimental Parameters

4.3. Comparison Experiment

4.4. Ablation Experiment

4.5. Robustness Assessment

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI