SIDWA: Synthetic Image Detection Based on Discrete Wavelet Transform Stem and Deformable Sliding Window Cross-Attention

Li, Luo; Lu, Tianyi; Song, Jiaxin; Cheng, Ke

doi:10.3390/electronics15040891

Open AccessArticle

SIDWA: Synthetic Image Detection Based on Discrete Wavelet Transform Stem and Deformable Sliding Window Cross-Attention

School of Computer, Jiangsu University of Science and Technology, Zhenjiang 212100, China

^*

Author to whom correspondence should be addressed.

Electronics 2026, 15(4), 891; https://doi.org/10.3390/electronics15040891

Submission received: 28 January 2026 / Revised: 19 February 2026 / Accepted: 20 February 2026 / Published: 21 February 2026

Download

Browse Figures

Versions Notes

Abstract

With the rapid evolution of Generative Adversarial Networks (GANs) and diffusion models (DMs), the detection of synthetic images faces significant challenges due to non-rigid artifacts and complex frequency biases. In this paper, we propose SIDWA, a novel dual-branch detection framework that leverages the synergy between frequency and spatial domains. Within the spatial branch, we design a Deformable Sliding Window Cross-Attention (DSWA) module, which utilizes a learnable offset mechanism to dynamically warp the receptive field, effectively capturing distorted edges and non-linear texture features. Simultaneously, the Discrete Wavelet Transform (DWT) Stem decomposes input images into multi-scale sub-bands to preserve crucial high-frequency residues. Through a Frequency-Semantic Resonance Projector (FSRP) strategy, the semantic priors from the spatial branch act as queries to guide the model toward localized frequency anomalies, achieving a unified “where to look” and “how to analyze” approach. Experimental results for the SIDataset (SIDset) benchmark demonstrate that Synthetic Image Detection based on Discrete Wavelet Transform Stem and Deformable Sliding Window Cross-Attention (SIDWA) achieves superior performance, with an average accuracy exceeding 95% and a competitive inference time of 18.2 ms on an NVIDIA A100 GPU. Ablation studies further validate the critical role of learnable offsets and frequency integration in enhancing robustness and generalization. SIDWA offers an efficient and reliable forensic solution for combating the growing threats of sophisticated generative forgeries.

Keywords:

AIGC detection; image generation; attention mechanism; deformable convolution; wavelet transform

1. Introduction

Amid the rapid advancement of Generative Artificial Intelligence (AIGC), breakthroughs in text-to-image and image editing models have enabled large-scale production of high-quality synthetic images. While these technologies show immense potential in content marketing, they have sharply blurred the line between authentic and generated visual content. Current detection technologies face severe challenges. From a technical perspective, sophisticated Generative Adversarial Networks (GANs) [1] and advanced diffusion models (DMs) [2,3] continue to narrow the gap between synthetic and real photos. Recent studies highlight that DMs, characterized by their iterative denoising process, introduce subtle non-rigid artifacts that are harder to capture than the periodic checkerboard patterns of GANs [4,5]. This progress has led to a rapid decline in the effectiveness of traditional detectors [6,7].

Current mainstream methods heavily rely on single-modal RGB analysis [8] and fixed-frequency extraction techniques like DCT or DFT [9,10]. However, these approaches demonstrate significant limitations: (1) At the spatial level, visual features alone struggle to capture contradictions in higher-level semantic logic, while being highly sensitive to adversarial perturbations. (2) At the frequency level, fixed-frequency priors face dual challenges. Advanced architectures now synthesize content with natural spectral distributions, eliminating traditional fingerprint features. Moreover, postprocessing techniques like JPEG compression and Gaussian blur actively obscure spectral traces. More fundamentally, static receptive fields in traditional CNNs or standard Transformers fail to dynamically adapt to the irregular, non-linear structural anomalies produced by modern generators.

To address these gaps, researchers have increasingly turned to multi-scale representation and advanced attention mechanisms. For instance, recent studies in diverse domains—such as sEMG-based motion recognition [11], industrial defect detection [12], and hyperspectral image classification [13]—have demonstrated that capturing long-range temporal–spatial dependencies and multi-scale features is crucial for identifying subtle pattern anomalies. Furthermore, advanced decomposition strategies, such as multi-scale fusion networks [14] and perturbation defense via hybrid clustering [15], offer promising alternatives to simple frequency analysis. Inspired by these advancements, we propose SIDWA, a collaborative dual-branch framework. Unlike conventional detectors that treat frequency and RGB features in isolation, SIDWA facilitates a deep synergy between two specialized pathways:

1.: We propose SIDWA, a novel dual-branch framework that integrates RGB spatial features with frequency-domain artifacts. We provide a comprehensive discussion of performance evaluation criteria, including precision, recall, and F1-score, following recent methodological guidelines in high-stakes recognition tasks [16].
2.: We introduce a Discrete Wavelet Transform (DWT) Stem. Unlike DCT-based methods or standard downsampling that lose high-frequency residues, our DWT-based multi-scale decomposition preserves crucial “stepping noise” and local frequency anomalies, providing a more robust foundation than traditional signal decomposition techniques [17].
3.: We develop the Deformable Sliding Window Cross-Attention (DSWA) mechanism. By incorporating learnable offsets, DSWA dynamically warps its receptive field to capture elusive, non-rigid artifact regions. This adaptive approach offers superior flexibility compared to the grid search-based optimization strategies commonly used in hyperparameter selection [18].
4.: Extensive experiments demonstrate that SIDWA achieves state-of-the-art performance across multiple benchmarks, showing superior robustness and statistical significance (p < 0.05) against common image perturbations.

The remainder of this paper is organized as follows: First, Section 2 builds on the Introduction with an overview of related work in synthetic image detection. Section 3 outlines the proposed SIDWA model and its key components. Section 4 follows with the experimental setup, results, and ablation studies. Finally, Section 5 concludes with a summary and discussion of future research directions.

2. Related Work

Generation Models: Since its introduction by Ian Goodfellow [1] in 2014, the adversarial training framework combining GANs [1,19,20,21,22,23,24,25,26] has pioneered a data-driven paradigm for image generation. Subsequent improvements focused on training stability. WGAN [20] and LSGAN [21], as well as InfoGAN [23], are notable examples. In 2018, ProGAN [24] and BigGAN [25] overcame high-resolution bottlenecks through progressive training and scaling parameter sizes. As a result, they generated

1024 \times 1024

HD images. Iterative versions, such as StyleGAN2 and StyleGAN3, addressed artifacts and enabled dynamic generation. Early explorations in 2015, based on physical diffusion principles, were primarily used for image denoising. However, these remained underdeveloped due to low generation efficiency. In 2020, DDPM [3] established a theoretical foundation. It constructed a Markov chain framework for forward noise injection and backward denoising via variational inference. Still, its efficiency bottleneck required thousands of sampling steps. The year 2021 marked a pivotal turning point. On one hand, DDIM [27] achieved significant efficiency gains by compressing the number of sampling steps to dozens through non-Markov processes. On the other hand, conditional generation mechanisms like GLIDE [28] and DALL·E 2 [29] involved combination with text encoders. These models broke through text-to-image limitations. Classifier-free guidance further enhanced controllability via latent space contrastive learning. These advances drove the widespread adoption of open-source models like Stable Diffusion [2]. By 2025, large-scale diffusion models such as LLaDA [4] demonstrated the parallel inference advantages of text generation. This progress signaled their evolution from image-generating tools to a universal generation paradigm. Core breakthroughs consistently advanced along three main axes: sampling acceleration, controllable generation, and cross-modal generalization.

Dynamic convolution: By dynamically adjusting convolution kernels, this approach enables models to better adapt to input data variations, particularly excelling in processing complex textures, variable object scales, or highly individualized data. It simultaneously enhances the model’s representation capabilities and generalization performance. For example, Yang et al. [30], pioneers in input-driven convolutional weight generation, introduced conditional parameter convolution. This method increases model capacity through expert aggregation, allowing the model to learn more discriminative convolution kernels. Building on this, Chen et al. [31] innovatively incorporated attention mechanisms that dynamically adjust convolutional parameter values based on input images to aggregate multiple parallel convolutional kernels, thereby significantly improving the performance of lightweight networks. Expanding further, Li et al. [32] achieved cross-layer shared kernel libraries by constructing frequency-diverse weights in the Fourier domain, enabling dynamic convolution with multiple experts and substantially boosting parameter efficiency. Continuing the trend of frequency-based methods, Chen et al. [33] proposed a novel frequency-dynamic convolution method that generates frequency-diverse weights without increasing parameter costs, thereby enhancing performance in dense prediction tasks. Notably, dynamic adjustment mechanisms have shown great potential across various domains. For instance, Chen et al. [34] utilized mutual information-driven connectivity in graph attention networks for EEG-based driver fatigue detection, demonstrating that adaptive kernel adjustments can effectively capture complex temporal–spatial dependencies.

Frequency-Domain Transform: The inherent limitations of RGB-based detectors, which are often prone to overfitting on low-level textures, have driven significant research interest in the frequency domain. Scientific evidence suggests that generative models leave unique spectral fingerprints due to their upsampling architectures. Early studies primarily used the Discrete Fourier Transform (DFT) to identify global artifacts. For example, Frank et al. [35] and Durall et al. [36] showed that CNN-based generators fail to replicate the spectral distribution of real images, often producing periodic inconsistencies in high-frequency components. Building on these findings, researchers shifted to Discrete Cosine Transform (DCT)-based features to further mitigate JPEG compression artifacts. Liu et al. [37] analyzed the statistical distributions of DCT coefficients to distinguish between authentic and synthesized content. However, despite the success of Fourier-based methods, they often lack the spatial localization needed to pinpoint manipulations. As a result, this limitation prompted the adoption of wavelet-based analysis. Unlike the global nature of DFT, the Discrete Wavelet Transform (DWT) provides a multi-resolution framework that decomposes an image into frequency sub-bands (e.g., LL, LH, HL, and HH), preserving both spatial and frequency information. In this context, Li et al. [38] leveraged wavelet packets to capture subtle inconsistencies in high-frequency details. Similarly, Wolter et al. [39] used the wavelet domain to improve detection robustness against perturbations. Thus, by operating in the wavelet domain, models can more effectively isolate high-frequency noise from the image’s semantic content.

Attention Mechanism: Attention mechanisms have become a cornerstone in computer vision, enabling models to focus on salient regions (important areas of an image) and capture long-range dependencies (relationships between distant image regions). Early work, such as the Squeeze-and-Excitation network proposed by Hu et al. [40] and the Convolutional Block Attention Module by Woo et al. [41], introduced channel and spatial attention via global pooling. These advances laid the foundation for recalibrating features in lightweight architectures. The emergence of the Vision Transformer (ViT) [42] further shifted the paradigm toward self-attention. However, its quadratic computational complexity remains a significant bottleneck for resource-constrained detection tasks. Recently, research has pivoted toward optimizing attention for efficiency. Cai et al. [43] introduced EfficientViT, which decomposes the attention operation into smaller subsets via cascaded group attention. This approach significantly reduces memory redundancy. Similarly, Yun et al. [44] proposed SHViT. This work challenges the necessity of multi-head structures in lightweight models and demonstrates that a single-head design combined with parallel convolutions can achieve faster inference. In the latest frontier, Wang et al. [45] presented LSNet, which mimics the human visual system. It employs large-scale dynamic convolutions to achieve global perception while maintaining local precision. In a related vein, attention mechanisms have revolutionized the ability of deep learning models to prioritize informative regions, yet their application is rapidly evolving toward capturing more complex, non-linear dependencies. Recent advancements emphasize the integration of local inductive biases with global contextual modeling. To address the intrinsic limitations of global modeling, Yin et al. [46] introduced a Convolution–Transformer hybrid that specifically addresses the lack of inductive bias in standard Transformers for image feature extraction. In specialized domains, researchers are now moving toward dual-branch or cyclic interaction structures to handle multi-modal or high-dimensional data. Zhang et al. [13] proposed a spectral–spatial dual-branch fusion Transformer using multi-head self-attention to enhance the interactive fusion of spectral and spatial features in hyperspectral imagery. Similarly, in the field of industrial defect detection, Shen et al. [12] developed a vision–language cyclic interaction model to bridge the gap between domain priors and visual features through recursive guidance modules. Furthermore, the necessity of capturing both temporal and spatial characteristics through attention has been demonstrated in sEMG-based recognition tasks [11], where attention mechanisms are coupled with non-linear feature decomposition to identify subtle motion patterns.

Fake Image Detection: This field focuses on detecting generated images by analyzing inherent features in pixel-level spatial representations. These methods leverage spatial patterns such as texture irregularities, unnatural edge formation, and color inconsistencies. Such artifacts are common in image synthesis. Wang et al. [47] conducted a pivotal study. They demonstrated that detection methods trained on images from single generative models (particularly GANs) can generalize to synthetic images from various unseen Convolutional Neural Network (CNN)-based models. Li et al. [48] proposed the GASE-Net framework. This approach detects GAN-generated images by estimating the similarity of artifacts. The method uses a two-phase strategy: representation learning and representation comparison. This addresses cross-domain generalization and postprocessing robustness, ensuring consistency across domains while preserving class-level distinctions. Domain prototypes form by the element-wise aggregation of feature maps from reference images. Baraheem et al. [49] introduced a framework that applies transfer learning to pre-trained classifiers for detecting GAN-generated images. Wang et al. proposed a novel method, DIRE (DIRE) [50]. This method detects diffusion-generated images by leveraging reconstruction errors from pre-trained diffusion models. It addresses the limitations of previous detectors, which struggled to generalize across different diffusion models. Transforming image data into the spectral domain helps identify periodic artifacts, noise distributions, and frequency-component variations related to synthetic image generation. Zhang et al. [51] developed a GAN simulator that mimics common artifacts across GAN architectures. They combined this with a classification method that uses spectral rather than spatial features, using a frequency-domain analysis framework to detect GAN-generated images. Yan et al. introduced AIDE (AI-generated Image Detector with Hybrid Features) [8]. This innovative method combines low-frequency-domain features with high-level semantic embeddings from the CLIP [52] model. It addresses the issue of detectors misclassifying synthetic images as real. These methods focus on identifying synthetic images by analyzing local regions rather than the entire image. They leverage inconsistencies or artifacts that may occur in smaller regions to detect subtle generation patterns. By segmenting images into multiple patches and examining features such as texture, edge consistency, or pixel-level anomalies, the approaches enhance detection precision. Patch-based strategies are particularly effective when global analysis may overlook fine-grained artifacts from generative models. Chai et al. [53] proposed an image block-based classification framework. It employs truncated residual networks (ResNet) [10] and the Xception backbone [54]. These classify local image blocks as real or forged. By limiting the model’s receptive field, this framework focuses on local artifacts rather than global image structures. Doing so improves generalization across datasets and generative model types. Ju et al. [55] introduced a detection method that combines global and local features. It uses a ResNet-50 backbone to extract global feature maps and a Patch Selection Module (PSM) to capture subtle local artifacts, thereby improving synthetic image detection.

3. Methodology

3.1. Overall Architecture

In this section, we present the technical details of the proposed SIDWA framework, a dual-branch architecture combining multi-spectral analysis with adaptive spatial attention, as illustrated in Figure 1. The model consists of two core components: a space-domain semantic branch and a wavelet-domain frequency branch. The semantic branch extracts global consistency and generates a Context Prior, while the frequency branch captures high-frequency artifacts using DWT. Semantic Stream: Concurrently, a deep backbone network extracts multi-level semantic representations. A Feature Pyramid Network (FPN) is utilized to aggregate these features, ensuring that the model maintains spatial hierarchy. This process yields the guided semantic feature map

F_{s}

. Frequency Stream: To extract low-level manipulation artifacts often invisible in the RGB domain, we employ the DWT layer. The image is decomposed into multi-scale sub-bands, capturing high-frequency residuals. These features undergo Artifact Alignment to normalize for distributional shifts across different compression scales, yielding a guided frequency feature map

F_{f}

.

3.2. Discrete Wavelet Transform (DWT) Stem

DWT Stem: Unlike traditional multi-scale fusion methods [14] or autoencoder-based suppression strategies [17] that may suffer from information loss or over-smoothing of delicate textures, our choice of DWT is motivated by its inherent losslessness and invertibility [56,57]. Specifically, while existing autoencoder methods often fail to exploit spatial-frequency consistency and may inadvertently reconstruct forgery artifacts as backgrounds [57], DWT enables a clean separation of multi-frequency components [57,58]. This process preserves high-frequency residues where generative fingerprints are predominantly located, preventing the “information dropping” common in aggressive downsampling designs [56]. Furthermore, leveraging wavelet transforms allows the model to achieve a larger effective receptive field with minimal parameter overhead [59], enhancing the robustness against common perturbations [15,59], which is critical for identifying non-rigid artifacts in diffusion-generated images.

As illustrated in Figure 2, by explicitly decomposing the input into distinct sub-bands, the DWT serves as a principled frequency-aware stem that balances local texture preservation with global semantic awareness. This structured decomposition ensures that even subtle generative traces, which are typically obscured in standard spatial convolutions, are systematically isolated into specific directional components. To provide a rigorous formulation of how these multi-frequency features are extracted and preserved, the transformation process can be mathematically expressed as

(x_{L L}, x_{L H}, x_{H L}, x_{H H}) = DWT (x)

(1)

where x represents the input image and

x_{L L}

,

x_{L H}

,

x_{H L}

, and

x_{H H}

denote the low-frequency and high-frequency sub-bands obtained from the DWT. The DWT operation decomposes the image into these four components, capturing both approximation and detail information at different frequency levels.

The three high-frequency sub-bands, which contain crucial generative fingerprints (e.g., checkerboard artifacts or upsampling residues), are concatenated along the channel dimension. A

1 \times 1

convolution is then applied to project these sub-bands into a joint frequency feature map

F_{f} \in R^{C \times \frac{H}{2} \times \frac{W}{2}}

, ensuring dimensional compatibility with the semantic branch. These decomposed frequency components are then projected into the latent space of the DSWA module via the Frequency-Semantic Resonance Projector (FSRP). Specifically, the FSRP acts as a bridge: it uses the global semantic context to dynamically weight the frequency sub-bands, ensuring that only the most discriminative artifacts are passed to the spatial attention stages. By embedding frequency-domain priors directly into the input tokens of the DSWA, the model achieves a synergy where the DWT Stem provides “where to look” (frequency anomalies) and the DSWA provides “how to analyze” (adaptive spatial inspection).

3.3. Artifact Alignment Module

The Artifact Alignment Module component functions as a spatial–spectral bridge, mapping frequency-domain anomalies back to their original physical coordinates. By leveraging bilinear upsampling followed by a

1 \times 1

convolution layer, the module re-aligns the high-frequency residual features with the semantic backbone’s spatial grid. This ensures that the model precisely localizes the source of a frequency anomaly within the image geometry, effectively pinpointing the forgery’s location.

3.4. Frequency-Semantic Resonance Projector (FSRP)

The FSRP treats semantic context as a prior to calibrate the frequency response. As illustrated in Figure 3, the FSRP utilizes a global average pooling (GAP) layer to extract a global semantic descriptor from the parallel stream. When upsampling is required, the architecture employs bilinear upsampling to resize feature maps to the desired spatial dimensions. Subsequently, this descriptor is passed through a Multi-Layer Perceptron (MLP) to generate a set of resonance weights

W_{r e s}

. The calibration process is defined as follows:

F_{f}^{o u t} = σ (MLP (GAP (F_{s}))) \otimes F_{f}^{i n}

(2)

where

σ

denotes the sigmoid activation and ⊗ represents element-wise multiplication. By dynamically re-weighting the frequency sub-bands based on semantic context, the FSRP emphasizes those spectral components that are most relevant to the suspected forged regions, thereby achieving a deep fusion of frequency and semantic modalities.

3.5. Deformable Sliding Window Cross-Attention (DSWA)

To address non-rigid artifacts in DM and GAN models and the spatial inflexibility of standard cross-attention, we propose the DSWA module. Unlike fixed-grid attention, DSWA uses learnable offsets to shift its receptive field towards informative regions. This adaptiveness helps capture the subtle, irregular artifacts of generated images.

Our DSWA centers on learning the offset

Δ p_{m q k}

via a small network. These learnable offsets let the model move beyond fixed windows and sample features from more discriminative locations. The key mathematical expression is as follows:

DeformAttn (q, p_{q}, x) = \sum_{m = 1}^{M} W_{m} [\sum_{k = 1}^{K} A_{m q k} \cdot x (p_{q} + Δ p_{m q k})]

(3)

where

q denotes the query feature representing the semantic information of the target pixel;
$p_{q}$ is the 2D reference point (central coordinates) of the current query q within the feature map;
x represents the input feature map from which features are sampled;
M is the number of attention heads, each focusing on different spectral or spatial patterns;
K signifies the number of sampled keys per head, which is significantly smaller than the total number of pixels in a window to reduce computational overhead;
$Δ p_{m q k}$ denotes the learnable offset for the k-th sampling point in the m-th head, allowing the receptive field to adaptively shift towards forgery-sensitive regions (e.g., distorted edges);
$W_{m}$ represents the learnable weight matrix used to project the aggregated multi-head features into the output space.

The attention weight

A_{m q k}

is computed by integrating the content-aware similarity and a spatial position bias. Specifically, it is defined as follows:

A_{m q k} = Softmax (\frac{Q_{q} K_{k}^{T}}{\sqrt{d}} + B (p_{k}))

(4)

where

Q_{q} \in R^{d}

is the query feature,

K_{k} \in R^{d}

represents the key feature sampled at the deformed location

p_{q} + Δ p_{m q k}

, and d is the head dimension. The term

B (p_{k})

denotes the relative position bias, which provides a structural prior for the spatial distribution of the sampling points. This formulation ensures that the attention mechanism is sensitive to both the semantic inconsistency (via dot-product) and the geometric deformation (via position bias).

By defining

A_{m q k}

as a function of both learned content and spatial bias, DSWA functions as a Dual-Gated Inspector. The offsets

Δ p

act as searchers to find suspicious regions, while the attention weights act as verifiers to confirm the forensic significance of the sampled features.

To effectively capture the non-rigid artifacts and irregular forgery boundaries characteristic of GAN and DM, we propose the Deformable Sliding Window Cross-Attention (DSWA) module, as illustrate in Figure 4. Unlike traditional fixed-grid attention mechanisms that are constrained by rigid rectangular windows, DSWA introduces an elastic receptive field through three key technical designs:

Δ p_{m q k} = t a n h (F_{o f f} (X_{q})) \cdot θ

(5)

where

t a n h

constrains the offset range to ensure training stability and

θ

is a predefined scaling factor that governs the maximum allowable deformation within the sliding window.

The core innovation of DSWA is the introduction of a learnable offset

Δ p_{m q k}

for each window. To implement this adaptive receptive field, we utilize a lightweight sub-network

F_{o f f} (\cdot)

, typically implemented as a

3 \times 3

depth-wise separable convolution, to predict coordinate shifts based on the input feature maps X. Formally, the offsets are computed as follows:

Window Size Adaptation: We employ a multi-scale approach where window sizes are predefined based on the network stage (e.g.,

7 \times 7

for early stages to capture fine-grained textures and

12 \times 12

for deeper stages to encompass global structures). Within these windows, the deformable sampling points are dynamically adjusted, allowing the receptive field to dilate or constrict based on the content’s complexity. The window size

k_{ℓ}

at layer ℓ can be adaptively determined by the following:

k_{ℓ} = min (k_{b a s e} \cdot 2^{⌊ \frac{ℓ}{N} ⌋}, min (H_{ℓ}, W_{ℓ}))

(6)

where

k_{b a s e}

is the initial size and N denotes layers per stage. This ensures that the model maintains a consistent relative receptive field while effectively aggregating multi-scale forgery clues.

Computational Efficiency: By constraining the cross-attention within a sliding window rather than the global image, we significantly reduce the computational burden. While global self-attention suffers from quadratic complexity

O (N^{2})

, where

N = H \times W

, the complexity of our DSWA is restricted to the following:

Ω (DSWA) = 2 H W (C^{2} + k_{ℓ}^{2} C)

(7)

where

k_{ℓ}

represents the window size and C denotes the number of feature channels (embedding dimension). By keeping

k_{ℓ} ≪ min (H, W)

, the proposed module achieves a superior trade-off between detection accuracy and inference throughput, making it highly efficient for processing high-resolution synthetic images.

3.6. Attention Distillation and Gated Integration

To further refine this interaction, we introduce Attention Distillation. Frequency artifacts can be sparse; therefore, the distillation module

(β)

compels attention maps to focus on the most discriminative regions, effectively suppressing natural high-frequency noise.

The final fusion is governed by a Dynamic Gate

(α)

, which controls how much information from the frequency domain is included in the model’s output. This gate calculates a scaling factor by comparing the confidence in frequency artifacts to the semantic context. When the DSW detects a high-probability forgery artifact, the gate increases its influence in the final prediction. As a result, the model remains sensitive to localized manipulations while preserving overall semantic consistency.

In sum, this approach works as follows: For each position

p_{q}

, the model shifts from examining fixed neighbors to using

Δ p

to select sampling points most relevant to forgery detection. It extracts features x from these points, weights them by importance A, and integrates the results using

W_{m}

.

The proposed DSWA transcends the limitations of conventional rigid-grid mechanisms. It does so through four pivotal technical designs, as formalized in Algorithm 1:

(i): Scale-Aware Value Projection: As delineated in Lines 1–4, the frequency features $F_{F}$ are not directly utilized but are first projected into a multi-scale value space $V_{m s}$ . By synergizing global spectral distributions $V_{c o a r s e}$ with localized structural details $V_{f i n e}$ , the model ensures that the subsequent attention mechanism is sensitive to both macro-level statistical anomalies and micro-level pixel inconsistencies characteristic of generative models.
(ii): Content-Aware Offset Generation: In Step 2 (Line 6), the sampling offsets $δ p$ are dynamically predicted from the semantic query $F_{s}$ via an OffsetGen module. This data-dependent design enables a content-aware receptive field; the model adaptively shifts its focus toward irregular manipulation boundaries or distorted semantic regions (e.g., unnatural skin textures or blending artifacts) that do not align with a standard rectangular grid.
(iii): Attention Distillation and Refinement: To address the sparsity of frequency artifacts and suppress natural high-frequency noise, we introduce an Attention Distillation phase (Lines 11–14). By generating a refinement weight $β$ , the model calibrates the raw cross-modal attention scores A. This ensures that the attention focus is distilled onto the most discriminative forgery-sensitive regions, further guided by an auxiliary distillation loss $L_{a u x}$ during training.
(iv): Dynamic Gated Integration: In the final stage (Lines 17–18), a GateGen module computes a dynamic gating factor $α$ . This mechanism modulates the intensity of frequency information injection based on the semantic significance of the queried region. This ensures a robust fusion that prevents noise from the frequency stream from overwhelming the semantic consistency of the backbone, followed by a dual-residual update to preserve gradient stability.

In summary, data-dependent offset generation, differentiable resampling, and multi-head parallel exploration work together in SIDWA to overcome fixed-window constraints. This synergy enables elastic and precise scrutiny of suspicious regions in generative forgeries.

Algorithm 1 DSWA

Require:: Semantic feature map $F_{S} \in R^{C \times H \times W}$ , Frequency sub-bands $F_{F} \in R^{C \times H \times W}$
Ensure:: Guided output feature map $F_{F}^{o u t}$ , Distillation loss $L_{a u x}$
1:: Step 1: Multi-Scale Value Projection
2:: $V_{c o a r s e} \leftarrow AvgPool ({Linear}_{c} (F_{F}))$ {Capture global artifacts}
3:: $V_{f i n e} \leftarrow {Conv}_{3 \times 3} ({Linear}_{f} (F_{F}))$ {Capture local details}
4:: $V_{m s} \leftarrow V_{c o a r s e} \oplus V_{f i n e}$ {Concatenate and fuse multi-scale features}
5:: Step 2: Offset Generation & Sampling
6:: $Δ p \leftarrow OffsetGen (F_{S})$ {Predict M-head deformable offsets from query}
7:: for each head $m \in {1, \dots, M}$ do
8:: $p_{t a r g e t} \leftarrow p_{g r i d} + Δ p_{m}$ {Coordinate deformation}
9:: $X_{s a m p l e d} \leftarrow BilinearInterpolation (V_{m s}, p_{t a r g e t})$
10:: end for
11:: Step 3: Attention Distillation
12:: $A \leftarrow Softmax (\frac{Q_{S} K_{F}^{T}}{\sqrt{d}})$ {Compute cross-modal attention scores}
13:: $β \leftarrow Sigmoid (MLP (A))$ {Generate distillation refinement weights}
14:: $A^{'} \leftarrow Normalize (A ⊙ β)$ {Calibrated attention distribution}
15:: $L_{a u x} \leftarrow {∥ A^{'} - A_{t a r g e t} ∥}^{2}$ {Calculate auxiliary distillation loss}
16:: Step 4: Gated Aggregation & Residual Update
17:: $Y_{a t t n} \leftarrow \sum_{m = 1}^{M} {Linear}_{m} (A_{m}^{'} \cdot X_{s a m p l e d})$
18:: $α \leftarrow Sigmoid (GateGen (Q_{S}, V_{m s}))$ {Dynamic gating factor}
19:: $F_{F}^{o u t} \leftarrow LN (F_{S} + FFN (α \cdot Y_{a t t n}))$ {Residual connection with LN}
20:: return $F_{F}^{o u t}, L_{a u x}$

4. Experiments and Results

In this section, we first describe the experimental setup and then present comprehensive results to validate the effectiveness of SIDWA. To evaluate the proposed method’s detection performance in varied postprocessing scenarios, we select multiple representative datasets. These include both GAN-generated and Diffusion-based images, as well as real samples. Our evaluation focuses on the model’s robustness to common image artifacts and perturbations, such as JPEG compression, Gaussian blurring, and rescaling. These challenges mimic real-world transmission and editing processes. This approach enables a thorough assessment of the model’s reliability under degraded conditions. Following standard forensic protocols, we partitioned the datasets into training and test sets. The training set was used for feature learning and classifier optimization. The test set, augmented with various distortion levels, provided an objective assessment of the model’s generalization and cross-manipulation detection capabilities.

4.1. Dataset

To better evaluate the performance of diffusion generation detectors, we compiled a dataset named SIDataset (SIDset) comprising three components, as shown in Table 1. The images can be broadly categorized into three types based on their sources: DiffusionForensics [50], partially GenImage [60], and some of our own collected images.

DiffusionForensics is a relatively simple open-source benchmark. We selected the LSUN Bedroom dataset and the ImageNet subset for experiments. The LSUN Bedroom subset collects bedroom images from LSUN-Bedroom and generates fakes using multiple diffusion models, including ADM [61], PNDM [62], and IDDPM [63]. The training set comprises 30,000 real images and 10,000 images generated by each of ADM, PNDM, and IDDPM. For testing, we selected 10,000 real images and 10,000 generated images from each subset.

GenImage is a standardized, large-scale benchmark for AI-generated image detection. It includes 2,681,167 images—1,331,167 real and 1,350,000 generated. The dataset covers eight leading current generators, such as Midjourney, Stable Diffusion (v1.4/v1.5) [4], ADM, GLIDE [28], Wukong [64], VQDM [65], and BigGAN [25]. These represent various GAN and diffusion models. The image generation system uses 1000 ImageNet labels, ensuring nearly equal numbers of real and generated images within each category. We selected three subsets—ADM (ADM), Midjourney, and VQDM—as foundational data for training and evaluation. Only part of the dataset was used for training due to computational resource constraints. Specifically, we randomly selected 40,000 real and 40,000 generated images from each subset for training.

Our collected dataset comprised two key components. For the real image collection, we developed a dedicated web crawler to gather news photographs from authoritative media outlets, including Unsplash, Pexels, Flickr, Xinhua News Agency, and BBC. The content spanned multiple news categories, including politics, sports, culture, disasters, and technology, with a focus on socially sensitive events, fraud scenarios, and risk communication cases. After initial image collection through the crawler, we employed the OpenCLIP [65] model for preliminary classification and manually filtered images into ten categories, including people, bedrooms, fruits, and animals. A preprocessing program was also used to remove low-resolution images and those containing Not Safe For Work (NSFW) content, resulting in a final dataset of 5000 images. For image generation, we selected three models: HunyuanImage [66], Seedream [67], and FLUX.2 [68]. To illustrate, we collected 5000 generated images per model, using the format “A photo of bedroom”, where “bedroom” refers to ten ImageNet-1k categories.

4.2. Experimental Setup

During data preprocessing, we randomly applied standardization treatments, including flipping, cropping, grayscale conversion, and JPEG compression at specified ratios. These steps remove lighting and contrast effects while preserving important details, establishing a strong foundation for feature extraction and object tracking. To balance computation and retain local micro-textures, we downsampled to

512 \times 512

and then randomly cropped to

224 \times 224

. This encouraged the model to learn pixel-level fingerprints rather than global semantics. For stable convergence and strong performance, we used a cosine annealing learning rate schedule, starting at

1 \times 10^{- 4}

and decaying to

1 \times 10^{- 6}

over 100 epochs. Additionally, a linear warm-up in the first five epochs prevented early gradient instability. This adjustment helped the model to avoid local minima early and converge precisely later. Training ran for 100 epochs on four NVIDIA A100 GPUs (80 GB each), with a batch size of 64. We selected hyperparameters via grid search on the GenImage validation set for optimal micro-artifact sensitivity and global semantic coherence.

Regarding classifier optimization, we adopted the AdamW [69] optimizer instead of standard SGD. AdamW effectively decouples weight decay from the gradient update, which is particularly beneficial for our dual-branch SIDWA framework. This approach ensures that the high-frequency features captured by the DWT branch are not prematurely suppressed by aggressive regularization. While SGD with momentum often provides better generalization in some vision tasks, we observed that AdamW converged faster and offered superior stability when handling the diverse artifacts in our SIDset dataset. The learning rate was governed by a cosine annealing scheduler to avoid local minima during late-stage training.

For the SIDset dataset, we split the data into training, validation, and test sets at 8:1:1. Specifically, the training set comprises 280,000 images, while the validation and test sets each comprise 35,000 images. Crucially, within each subset, we maintain a balanced distribution between generated and real images to ensure data consistency. To ensure the evaluation’s objectivity, there is no overlap between the training and test sets.

To comprehensively evaluate the proposed method, we use Accuracy (ACC) and Area Under the ROC Curve (AUC) as the primary metrics. Furthermore, to assess robustness and generalization, we conduct cross-dataset evaluations and test performance under various postprocessing attacks (e.g., JPEG compression and Gaussian blurring).

Baseline: As a benchmark, we selected representative models in the field of generative image detection over the past few years, spanning from early artifact monitoring to the latest diffusion model feature analysis. (1) CNNSpot [47] identified common flaws in different generative models and proposed a relatively simple method for effectively detecting images generated by various CNNs, which significantly enhanced detector generalization capabilities through data augmentation strategies. This laid an important foundation for subsequent research in generative image detection. (2) FreDect [35] revealed a key characteristic of generative images: compared to the variability in spatial-domain pixels, frequency-domain features often exhibit higher stability. Based on this discovery, the model mapped images from the spatial domain to the frequency domain using techniques such as Discrete Cosine Transform (DCT), accurately capturing periodic statistical artifacts introduced by upscaling operations during reconstruction that are difficult for the human eye to detect. Because this underlying mathematical fingerprint is common across different generative frameworks, FreDect can effectively identify forged content generated by unknown algorithms, achieving significant improvements in cross-model generalization. (3) UnivFD [7] proposed a universal detection framework to address domain differences in generative image detection. Its core approach involves multi-scale feature fusion and attention mechanisms to extract features with strong semantic representation from pre-trained models [65], thereby compensating for the limitations of traditional detectors, which are overly sensitive to local details. UnivFD’s primary contribution lies in its exceptional universal applicability: by incorporating cross-modal prior knowledge, it achieves unified high-precision recognition across diverse generative architectures (from GANs to diffusion models) and multiple image categories. (4) LNP [70] discovered that, unlike the natural physical relationships between pixels in real photos, generated images often exhibit minor statistical inconsistencies at fine texture details. Building on this, the LNP model employs a local neighborhood propagation mechanism, treating images as graph structures or using local convolutional operators to detect deviations in spatial correlations between pixels and their surrounding neighborhoods. This approach not only captures global artifacts but also precisely locates local modification traces in images, demonstrating exceptional sensitivity to locally altered or patched (Inpainting) images. (5) AIDE [8] pioneered a detection paradigm based on reconstruction capability differences. Its core logic is that real images are generally more challenging to perfectly reconstruct without detail loss than generated images using existing editing models. The model amplifies the inherent structural instability in generated images by comparing consistency losses between original images and their slightly perturbed reconstructions. This transforms the detection task from a traditional classification problem into a robustness evaluation of generative models, effectively addressing the challenge of distinguishing between high-quality generated content and real content. (6) DIRE [50] proposes a detection paradigm based on reconstruction error in diffusion models, whose core logic is grounded in the reversibility difference between generated and real images during diffusion. DIRE demonstrates that images generated by diffusion models exhibit significantly smaller reconstruction errors compared to real images when subjected to identical noise addition and removal processes. By computing this specific reconstruction residual, DIRE (DIRE) projects detected images into the distribution space of diffusion models, thereby dramatically enhancing the structural features of the generated images.

Hyperparameter Optimization Strategy: The hyperparameters of the SIDWA framework, including the learning rate, batch size, and weight decay of the optimizer, were determined via a systematic grid search. We selected grid search primarily due to its deterministic nature and high reproducibility within a well-defined search space, which ensure a stable baseline for evaluating the structural contributions of the DWT Stem and DSWA modules. However, we acknowledge the limitations of grid search in terms of computational efficiency compared to more advanced self-optimizing frameworks. For instance, recent studies have introduced self-optimized Gaussian kernel-based radial basis function extreme learning machines (ELMs) [18], which offer superior adaptability in dynamic parameter landscapes. While our current study prioritizes the transparency of the grid-based search to validate the collaborative dual-branch architecture, exploring self-adaptive optimization strategies to further enhance the convergence efficiency of the cross-attention mechanism remains a promising direction for our future work.

Our model evolves from the OverLoCK [71] backbone, transitioning from three synergistic sub-networks to a collaborative dual-branch architecture consisting of a wavelet-domain frequency branch and a space-domain semantic branch. We fundamentally re-engineer the feature extraction stage and the interaction blocks to incorporate multi-spectral and adaptive spatial awareness.

To further investigate the interpretability and localization capabilities of our proposed framework, we provide a qualitative analysis using heatmaps that highlight response regions for several representative generative models. Specifically, we analyze how BigGAN (BigGAN) localizes object structures, ADM captures semantic information, IDDPM (IDDPM) maps fine-grained features, and PNDM (PNDM) isolates areas of generative focus. As illustrated in Figure 5, the heatmaps generated by our model demonstrate superior precision in identifying the structural inconsistencies and statistical anomalies inherent in synthesized images. Specifically, for BigGAN, our method effectively captures the checkerboard patterns and grid-like artifacts common in GAN-based architectures. For the diffusion-based models—ADM, IDDPM, and PNDM—the detector accurately highlights the subtle, non-uniform noise distributions and high-frequency discrepancies that often elude standard spatial-domain detectors.

4.3. Evaluation Metrics

While accuracy is a commonly used metric, it possesses significant limitations in the context of forensic image detection. As highlighted in recent comprehensive research on complex signal recognition, accuracy can be highly misleading when class distributions are imbalanced or when the costs of misclassification are asymmetric. In digital forensics, a high accuracy might hide a model’s failure to detect specific types of sophisticated forgeries (low recall), which is unacceptable in security-sensitive applications. Inspired by Chen et al. [16], to provide a multifaceted assessment of the SIDWA, we adopt a comprehensive set of performance metrics including accuracy, precision, recall, and F1-score. Following the methodological recommendations for robust pattern recognition in complex signal processing, these metrics are defined based on the confusion matrix components: True Positives (TP), False Positives (FP), True Negatives (TN), and False Negatives (FN).

1.: Precision measures the proportion of correctly identified synthetic images among all samples predicted as forged:

$Precision = \frac{T P}{T P + F P}$

(8)
2.: Recall (Sensitivity) quantifies the model’s ability to capture all actual synthetic images, which is critical in forensic scenarios to avoid missing potential threats:

$Recall = \frac{T P}{T P + F N}$

(9)
3.: F1-score is the harmonic mean of precision and recall, providing a balanced evaluation especially when dealing with potential class imbalances across different generative model datasets:

$F 1 - score = 2 \times \frac{Precision \cdot Recall}{Precision + Recall}$

(10)

The introduction of these metrics follows recent advances in physiological and image-based recognition, where relying solely on accuracy can be misleading if the distribution of real and fake samples is uneven or if the cost of False Negatives significantly outweighs False Positives. By reporting these scores, we ensure the methodological completeness and scientific rigor of our experimental evaluation.

4.4. Performance Comparison

In this section, we conduct an extensive quantitative evaluation to assess the efficacy of the proposed SIDWA model. To ensure a rigorous and holistic appraisal, our experiments span seven diverse generative architectures, ranging from traditional GANs to the latest diffusion-based and auto-regressive models. We compare SIDWA against six baselines representing various detection methodologies, including spatial-domain learning, frequency analysis, and reconstruction-based approaches. As evidenced by the results, SIDWA exhibits superior cross-model generalization, successfully mitigating the performance degradation that specialized detectors often encounter when transitioned between disparate forgery distributions.

To further investigate the interpretability and localization capabilities of our proposed framework, we provide a qualitative analysis using heatmaps across several representative generative models, including BigGAN, ADM, IDDPM, and PNDM. As shown in Figure 5, this localization effectiveness is primarily attributed to the cross-guidance mechanism between the DWT and DSWA. Specifically, the DWT serves as a frequency-selective filter, decomposing the image into multiple sub-bands to isolate high-frequency residues where generative fingerprints are most prominent. In tandem, the DSWA enables the model to adaptively adjust its receptive field, focusing on irregular textures and non-grid-aligned artifacts that traditional fixed-window attentions might overlook. Together, these two components create a synergy that enables our model to distinguish natural from artificial high-frequency textures and artifacts.

As shown in Table 2, while many contemporary detectors optimized for diffusion models experience a catastrophic performance drop when applied to GAN-based architectures (e.g., DIRE achieving only 49.7% on BigGAN), SIDWA maintains a robust accuracy of 88.78%. This stability originates from the inherent cross-domain sensitivity of our dual-path design.

Specifically, GAN-generated images exhibit periodic, grid-like artifacts and high-frequency spectral peaks arising from transposed convolution (upsampling). Our DWT Stem effectively decomposes the input into multi-spectral sub-bands, where these systematic ‘checkerboard artifacts’ are significantly magnified in the HH and HL components. Unlike reconstruction-based methods that search for sampling inconsistencies, SIDWA captures these fundamental hardware-level fingerprints.

Moreover, the DSWA further enhances this robustness. In GAN-based forgeries, structural anomalies often manifest as global symmetry inconsistencies or local boundary blurriness. DSWA’s ability to adaptively shift its sampling offsets allows the model to perceive these long-range geometric dependencies more flexibly than fixed-kernel Convolutional Neural Networks (CNNs). Consequently, SIDWA successfully integrates low-level frequency cues with high-level structural awareness, ensuring that the detector remains effective even as it transitions from diffusion-based noise patterns to GAN-based upsampling traces.

The superior performance of SIDWA over specialized detectors such as DIRE and AIDE on diffusion-based models (e.g., achieving 98.56% on ADM and 94.73% on SDv1.5) can be attributed to its unique dual-domain feature-capture mechanism.

Specifically, while DIRE relies heavily on the reconstruction error from a predefined diffusion reverse process, its effectiveness is often bottlenecked by the stochastic nature of the reverse sampling, which may inadvertently ‘heal’ subtle structural inconsistencies. In contrast, SIDWA takes a different approach by utilizing DWT to directly extract high-frequency fingerprints (HH sub-band) from the original input. Whereas reconstruction-based methods like DIRE might overlook these inherent ‘stepping noises’ in the diffusion trajectories, SIDWA captures them explicitly through its fingerprint extraction.

Furthermore, compared to AIDE, which uses a fixed-grid attention mechanism, our DSWA provides a dynamic receptive field. Because diffusion-generated artifacts, such as those in SDv1.5, are typically non-rigid and locally concentrated (e.g., unnatural textures in complex backgrounds or skin pores), DSWA can adaptively warp its sampling points to cluster around these elusive localized anomalies. This ‘adaptive zoom’ capability ensures that SIDWA captures fine-grained textural decoherence more effectively than AIDE’s rigid scanning patterns, leading to a more robust and precise detection boundary.

As shown in Table 3, the proposed SIDWA achieves a remarkable balance between detection precision and recall across various generative models. As shown in the updated metrics, the F1-score consistently aligns with the overall accuracy, particularly on advanced diffusion models such as ADM and Glide, where all metrics exceed 96%. This equilibrium indicates that SIDWA effectively minimizes both false alarms (high precision) and missed detections (high recall), demonstrating its robustness as a reliable forensic tool for high-fidelity synthetic image detection.

To evaluate the effectiveness of the proposed SIDWA, we conduct a comprehensive comparison with eight state-of-the-art (SOTA) detection methods. These include fingerprint-based methods (e.g., CNNDet [47] and FreqDet [35]), reconstruction-based methods (e.g., DIRE [50]), and recent large-scale pre-training or multi-modal-inspired methods (e.g., UnivFD [7], DeFake [72], RINE [73], and SPAI [74]). For a fair comparison, the performance metrics of all baseline methods are directly sourced from the latest comprehensive study SPAI [74]. Our SIDWA is trained and evaluated on the same benchmark datasets. We evaluate performance across nine representative generative models, ranging from classic GANs to the latest diffusion-based architectures.

We evaluate the proposed SIDWA against several state-of-the-art (SOTA) detection methods across nine diverse generative models. As summarized in Table 4, our method demonstrates robust overall performance, achieving an average accuracy of 84.6% and outperforming traditional spatial- and frequency-based baselines.

A key observation is that when early detection methods, such as CNNDet and LGrad, are applied to advanced diffusion models like SD3 and DALLE3, their performance drops significantly (accuracies as low as 12.7%). In contrast, SIDWA maintains high detection stability, achieving 73.2% on SD3 and 78.3% on DALLE3. This stability suggests that our dual-branch architecture, which combines semantic features with DWT-based frequency cues, effectively captures the intrinsic forgery fingerprints that remain consistent across evolving generative technologies.

Performance vs. Computational Trade-off: Regarding model complexity, while SIDWA has a higher parameter count (98.3 M) than lightweight models like SPAI or RINE, it offers a superior balance between capacity and accuracy. Specifically, our model achieves a 2.2% to 3.5% improvement in average accuracy over UnivDiff and RINE. Furthermore, compared to the heavy-duty DIRE model (150.2 M Params, 85.6 G FLOPs), SIDWA attains a 10.0% higher average accuracy while consuming approximately 71.3% fewer FLOPs. This indicates that the parameters in SIDWA are more efficiently utilized through the Deformable Window and Cross-Attention mechanisms.

Computational Complexity and Efficiency: To address concerns regarding the complexity of the dual-branch architecture, we report the operational efficiency of SIDWA. Evaluated on an NVIDIA A100 GPU, the model maintains a relatively low computational footprint with 24.6 G FLOPs and achieves a competitive inference time of 18.2 ms per image, confirming its potential for real-time forensic applications. In terms of memory consumption, SIDWA requires only 1.8 GB of VRAM for single-image inference, making it highly accessible for deployment on mid-range hardware. During our high-performance training phase (batch size of 64 distributed across 4×A100 GPUs), the model exhibits stable convergence and efficient memory utilization—occupying approximately 35.5% of available VRAM per GPU. These metrics collectively demonstrate that SIDWA strikes an optimal balance between architectural capacity and practical throughput, effectively mitigating the overhead typically associated with deformable attention and wavelet decomposition.

Robustness to Diverse Generators: While RINE and SPAI exhibit exceptional performance on specific generators such as Midjourney (MJ) and SD2, their performance fluctuates considerably across the entire spectrum (e.g., RINE drops to 39.1% on SD3). SIDWA provides a more balanced profile, consistently staying above or near the 80% mark for most test sets. This cross-generator robustness is critical for real-world applications where the source of a synthetic image is often unknown.

4.5. Ablation Study

To verify the efficacy of the key components in our proposed dual-branch framework, we conduct a series of ablation studies on SIDset. The baseline and its variants are defined as follows:

1.: Semantic Branch (CNN): The model only retains the semantic branch with the FPN, excluding all frequency-related components and the cross-attention mechanism.
2.: +DWT: The frequency branch is maintained, but the DWT is replaced by standard convolutional layers to extract frequency-domain features.
3.: +Deformable Window: The dual-branch architecture is kept, but the DSWA module is replaced by a standard window-based attention mechanism.
4.: +Cross-Attention: The dual-branch architecture is kept, but the DSWA module is replaced by a standard cross-attention mechanism without dynamic offset generation.
5.: SIDWA (Full): Combines both the DWT Stem and the DSWA module.

To intuitively evaluate the discriminative power of the proposed framework, we visualize the attention maps generated by different model variants. As illustrated in Figure 6, several key observations can be made regarding the localization of generative artifacts.

The Baseline model, which relies solely on the semantic branch, produces highly diffused attention maps that fail to localize specific forgery traces. This is because semantic features primarily capture global object structures and textures, which are often similar across real and generated images, leading to a lack of focus on subtle, high-frequency artifacts that are critical for detection. The addition of the DWT branch significantly enhances the model’s ability to capture high-frequency components, resulting in more focused attention on potential forgery regions. However, the heatmaps still exhibit some degree of center shift, likely due to the lack of spatial guidance from the semantic branch, leading to misalignment between the frequency features and the actual artifact locations. The variant with only Deformable Window shows improved localization compared to the baseline, as the adaptive attention mechanism allows for better focus on irregular textures. However, without the frequency branch, it may still struggle to capture high-frequency artifacts, leading to less precise localization and more scattered attention across the image. The cross-attention variant, while providing some interaction between the branches, lacks the dynamic offset generation of DSWA, leading to suboptimal alignment between semantic and frequency features. This results in attention maps that are somewhat more focused than the baseline but still fail to precisely localize the forgery traces, indicating that both the frequency branch and the adaptive attention mechanism are crucial for optimal performance.

In contrast, our full model integrates both DWT and DSWA. It achieves the most precise and concentrated localization of high-frequency forgeries. This superiority stems from the frequency-semantic resonance calibration mechanism. The semantic branch provides essential spatial priors. The DSWA module acts as a bridge, aligning these priors with the frequency footprints from the DWT branch. By constraining cross-modal interaction within a sliding window, our model filters out redundant background noise. This forces the network to focus on content-aware artifacts. Such synergy ensures that attention is sharp and accurately anchored to localized generative traces. Ultimately, this demonstrates the necessity of dual-branch collaborative learning.

Significance of Frequency Branch: The integration of the frequency branch with Discrete Wavelet Transform (Freq) yields a substantial performance gain over the baseline (only semantic branch). Specifically, the accuracy improves remarkably from 84.62% to 91.25%. This significant boost underscores the critical role of frequency-domain cues in capturing subtle, high-frequency forgery traces that are often overlooked in the spatial domain.

Impact of DW and CA Modules: Beyond the frequency analysis, both the Deformable Window (DW) and cross-attention (CA) modules are shown to independently enhance the model’s discriminative power. The introduction of the CA mechanism, in particular, contributes a 3.94% improvement in accuracy compared to the base semantic model. This enhancement suggests that global dependency modeling and adaptive feature alignment are effective in refining feature representations.

Synergistic Integration and Global Optimality: As shown in Table 5, the full SIDWA model achieves superior performance, achieving an accuracy of 95.74% and an AUC of 98.52%. The fact that the full configuration outperforms all partial combinations confirms a synergistic effect among the proposed components. This indicates that the joint modeling of semantic-frequency features, coupled with flexible spatial attention, provides the most robust and comprehensive representation for complex forgery detection.

Statistical Significance Analysis: To verify that the performance gains of the proposed SIDWA are consistent and not a result of random initialization, we conduct a rigorous statistical analysis by repeating each ablation experiment five times with different random seeds. As summarized in Table 6 and Figure 7, the full variant consistently outperforms all other configurations, achieving a mean accuracy of 94.40% ± 0.94%. Specifically, a two-tailed t-test is performed to compare the full variant against each sub-configuration. The results indicate that the integration of both the DWT-based frequency branch and the DSWA module provides a statistically significant improvement over the next-best variants (Variant 5 and Variant 6), with p-values of 0.046 and 0.027, respectively (p < 0.05). The substantial gap between the full variant and the Baseline (Variant 1, p < 0.001) further highlights the robust synergy of the proposed components. This statistical evidence confirms that the collaborative design of our dual-branch framework is essential for achieving optimal performance in synthetic image detection.

5. Conclusions

In this paper, we present SIDWA, a novel dual-branch framework designed to combat the growing challenges of generative image forgery. By integrating a DWT Stem with a Deformable Sliding Window Cross-Attention (DSWA) module, our model achieves a sophisticated synergy between frequency-domain artifact localization and adaptive spatial inspection. Specifically, the introduced learnable offset mechanism within DSWA allows the network to dynamically warp its receptive field toward non-rigid forgery traces, such as distorted textures and inconsistent boundaries. Furthermore, the Frequency-Semantic Resonance Projector (FSRP) strategy effectively bridges the gap between low-level spectral anomalies and high-level semantic context, ensuring that the attention mechanism remains focused on forgery-specific footprints.

Methodologically, we justified the selection of DWT over alternative signal decomposition techniques due to its losslessness and frequency-separability and validated the deterministic grid search strategy for hyperparameter optimization to ensure high reproducibility. Experimental evaluations on the SIDset benchmark demonstrate that SIDWA significantly outperforms existing methods. Moving beyond simple accuracy, our comprehensive evaluation using precision, recall, and F1-score—supported by rigorous statistical significance testing (

p < 0.05

across five independent trials)—confirms that the performance gains are both robust and scientifically meaningful. The model maintains a competitive inference time of 18.2 ms, striking an optimal balance between detection efficacy and computational efficiency.

In the future, we aim to address the limitations of current optimization strategies by exploring self-adaptive hyperparameter tuning frameworks to further enhance convergence in complex parameter landscapes. Additionally, we plan to extend the generalizability of SIDWA to video-based deepfake detection, where temporal–spatial dependencies are paramount. Inspired by recent advances in vision–language cyclic interactions and non-linear feature decomposition, we also intend to explore large-scale pre-training to enhance the forensic interpretability of the model, providing more granular explanations for detected generative artifacts.

Author Contributions

Conceptualization, L.L.; Methodology, L.L.; Software, L.L.; Validation, L.L. and J.S.; Formal analysis, L.L.; Investigation, L.L.; Resources, T.L. and J.S.; Data curation, T.L.; Writing—original draft, L.L.; Visualization, L.L., T.L. and J.S.; Supervision, K.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The data presented in this study are available from multiple sources. The publicly available datasets GenImage and DIRE were obtained from their respective official repositories: GenImage is available at https://genimage-dataset.github.io/ (accessed on 27 January 2026) and DIRE dataset components can be found at https://github.com/ZhendongWang6/DIRE (accessed on 27 January 2026). The custom dataset collected specifically for this research is available on request from the corresponding author due to ongoing research or privacy concerns.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, LA, USA, 18–24 June 2022; pp. 10684–10695. [Google Scholar]
Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Nie, S.; Zhu, F.; You, Z.; Zhang, X.; Ou, J.; Hu, J.; Zhou, J.; Lin, Y.; Wen, J.R.; Li, C. Large language diffusion models. arXiv 2025, arXiv:2502.09992. [Google Scholar] [PubMed]
Chu, B.; Xu, X.; Wang, X.; Zhang, Y.; You, W.; Zhou, L. Fire: Robust detection of diffusion-generated images via frequency-guided reconstruction error. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 12830–12839. [Google Scholar]
Qian, Y.; Yin, G.; Sheng, L.; Chen, Z.; Shao, J. Thinking in frequency: Face forgery detection by mining frequency-aware clues. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 86–103. [Google Scholar]
Ojha, U.; Li, Y.; Lee, Y.J. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 24480–24489. [Google Scholar]
Yan, S.; Li, O.; Cai, J.; Hao, Y.; Jiang, X.; Hu, Y.; Xie, W. A sanity check for ai-generated image detection. arXiv 2024, arXiv:2406.19435. [Google Scholar] [CrossRef]
Zou, Y.; Li, P.; Li, Z.; Huang, H.; Cui, X.; Liu, X.; Zhang, C.; He, R. Survey on ai-generated media detection: From non-mllm to mllm. arXiv 2025, arXiv:2502.05240. [Google Scholar] [CrossRef]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Wei, C.; Alenezi, F.; Chen, J.; Wang, H.; Polat, K. Nonlinear Feature Decomposition and Deep Temporal-Spatial Learning for Single-Channel sEMG-Based Lower Limb Motion Recognition. IEEE Sens. J. 2025, 26, 4120–4126. [Google Scholar] [CrossRef]
Shen, X.; Li, L.; Ma, Y.; Xu, S.; Liu, J.; Yang, Z.; Shi, Y. VLCIM: A vision-language cyclic interaction model for industrial defect detection. IEEE Trans. Instrum. Meas. 2025, 74, 2538713. [Google Scholar] [CrossRef]
Zhang, Y.; Wang, Z.; Huang, M.; Li, M.; Zhang, J.; Wang, S.; Zhang, J.; Zhang, H. S2DBFT: Spectral-spatial dual-branch fusion transformer for hyperspectral image classification. IEEE Trans. Geosci. Remote Sens. 2025, 63, 5525517. [Google Scholar] [CrossRef]
Chen, J.; Wang, Q.; Peng, W.; Xu, H.; Li, X.; Xu, W. Disparity-based multiscale fusion network for transportation detection. IEEE Trans. Intell. Transp. Syst. 2022, 23, 18855–18863. [Google Scholar] [CrossRef]
Xue, B.; Zheng, Q.; Li, Z.; Wang, J.; Mu, C.; Yang, J.; Feng, X.; Fan, H.; Li, X. Isar weak feature enhancement with perturbation defense using hybrid clustering oversegmentation. IEEE Trans. Aerosp. Electron. Syst. 2024, 60, 6256–6274. [Google Scholar] [CrossRef]
Chen, J.; Cui, Y.; Wei, C.; Polat, K.; Alenezi, F. Advances in EEG-based emotion recognition: Challenges, methodologies, and future directions. Appl. Soft Comput. 2025, 180, 113478. [Google Scholar] [CrossRef]
Tu, B.; Zhou, T.; Liu, B.; He, Y.; Li, J.; Plaza, A. Multi-scale autoencoder suppression strategy for hyperspectral image anomaly detection. IEEE Trans. Image Process. 2025, 34, 5115–5130. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Fan, F.; Wei, C.; Polat, K.; Alenezi, F. Decoding driving states based on normalized mutual information features and hyperparameter self-optimized Gaussian kernel-based radial basis function extreme learning machine. Chaos Solitons Fractals 2025, 199, 116751. [Google Scholar] [CrossRef]
Radford, A.; Metz, L.; Chintala, S. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv 2015, arXiv:1511.06434. [Google Scholar]
Arjovsky, M.; Chintala, S.; Bottou, L. Wasserstein generative adversarial networks. In Proceedings of the International Conference on Machine Learning. PMLR, Sydney, Australia, 6–11 August 2017; pp. 214–223. [Google Scholar]
Mao, X.; Li, Q.; Xie, H.; Lau, R.Y.; Wang, Z.; Paul Smolley, S. Least squares generative adversarial networks. In Proceedings of the IEEE international Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2794–2802. [Google Scholar]
Mirza, M.; Osindero, S. Conditional generative adversarial nets. arXiv 2014, arXiv:1411.1784. [Google Scholar] [CrossRef]
Chen, X.; Duan, Y.; Houthooft, R.; Schulman, J.; Sutskever, I.; Abbeel, P. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems 29, Barcelona, Spain, 5–10 December 2016. [Google Scholar]
Karras, T.; Aila, T.; Laine, S.; Lehtinen, J. Progressive growing of gans for improved quality, stability, and variation. arXiv 2017, arXiv:1710.10196. [Google Scholar]
Donahue, J.; Simonyan, K. Large scale adversarial representation learning. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 4401–4410. [Google Scholar]
Song, J.; Meng, C.; Ermon, S. Denoising diffusion implicit models. arXiv 2020, arXiv:2010.02502. [Google Scholar]
Nichol, A.; Dhariwal, P.; Ramesh, A.; Shyam, P.; Mishkin, P.; McGrew, B.; Sutskever, I.; Chen, M. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv 2021, arXiv:2112.10741. [Google Scholar]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical text-conditional image generation with clip latents. arXiv 2022, arXiv:2204.06125. [Google Scholar] [CrossRef]
Yang, B.; Bender, G.; Le, Q.V.; Ngiam, J. Condconv: Conditionally parameterized convolutions for efficient inference. In Proceedings of the Advances in Neural Information Processing Systems 32, Vancouver, BC, Canada, 8–14 December 2019. [Google Scholar]
Chen, Y.; Dai, X.; Liu, M.; Chen, D.; Yuan, L.; Liu, Z. Dynamic convolution: Attention over convolution kernels. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11030–11039. [Google Scholar]
Li, C.; Yao, A. Kernelwarehouse: Rethinking the design of dynamic convolution. arXiv 2024, arXiv:2406.07879. [Google Scholar] [CrossRef]
Chen, L.; Gu, L.; Li, L.; Yan, C.; Fu, Y. Frequency Dynamic Convolution for Dense Image Prediction. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 30178–30188. [Google Scholar]
Chen, J.; Cui, Y.; Wei, C.; Polat, K.; Alenezi, F. Driver fatigue detection using EEG-based graph attention convolutional neural networks: An end-to-end learning approach with mutual information-driven connectivity. Appl. Soft Comput. 2025, 186, 114097. [Google Scholar] [CrossRef]
Frank, J.; Eisenhofer, T.; Schönherr, L.; Fischer, A.; Kolossa, D.; Holz, T. Leveraging frequency analysis for deep fake image recognition. In Proceedings of the International Conference on Machine Learning, PMLR, Online, 13–18 July 2020; pp. 3247–3258. [Google Scholar]
Durall, R.; Keuper, M.; Keuper, J. Watch your Up-Convolution: CNN Based Generative Deep Neural Networks are Failing to Reproduce Spectral Distributions. arXiv 2020, arXiv:2003.01826. [Google Scholar] [CrossRef]
Liu, Z.; Qi, X.; Torr, P.H. Global texture enhancement for fake face detection in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8060–8069. [Google Scholar]
Liu, L.; Liu, J.; Yuan, S.; Slabaugh, G.; Leonardis, A.; Zhou, W.; Tian, Q. Wavelet-based dual-branch network for image demoiréing. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 86–102. [Google Scholar]
Wolter, M.; Blanke, F.; Heese, R.; Garcke, J. Wavelet-packets for deepfake image analysis and detection. Mach. Learn. 2022, 111, 4295–4327. [Google Scholar] [CrossRef]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Dosovitskiy, A. An image is worth 16 × 16 words: Transformers for image recognition at scale. arXiv 2020, arXiv:2010.11929. [Google Scholar]
Cai, H.; Li, J.; Hu, M.; Gan, C.; Han, S. Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 17302–17313. [Google Scholar]
Yun, S.; Ro, Y. SHViT: Single-Head Vision Transformer with Memory Efficient Macro Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 16–22 June 2024; pp. 5756–5767. [Google Scholar]
Wang, A.; Chen, H.; Lin, Z.; Han, J.; Ding, G. LSNet: See Large, Focus Small. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 9718–9729. [Google Scholar]
Yin, L.; Wang, L.; Lu, S.; Wang, R.; Yang, Y.; Yang, B.; Liu, S.; AlSanad, A.; AlQahtani, S.A.; Yin, Z.; et al. Convolution-Transformer for Image Feature Extraction. Comput. Model. Eng. Sci. (CMES) 2024, 141, 87–106. [Google Scholar] [CrossRef]
Wang, S.Y.; Wang, O.; Zhang, R.; Owens, A.; Efros, A.A. CNN-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 8695–8704. [Google Scholar]
Li, W.; He, P.; Li, H.; Wang, H.; Zhang, R. Detection of GAN-generated images by estimating artifact similarity. IEEE Signal Process. Lett. 2021, 29, 862–866. [Google Scholar] [CrossRef]
Baraheem, S.S.; Nguyen, T.V. Ai vs. ai: Can ai detect ai-generated images? J. Imaging 2023, 9, 199. [Google Scholar] [CrossRef]
Wang, Z.; Bao, J.; Zhou, W.; Wang, W.; Hu, H.; Chen, H.; Li, H. Dire for diffusion-generated image detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 22445–22455. [Google Scholar]
Zhang, X.; Karaman, S.; Chang, S.F. Detecting and simulating artifacts in gan fake images. In Proceedings of the 2019 IEEE International Workshop on Information Forensics and Security (WIFS), Delft, The Netherlands, 9–12 December 2019; IEEE: New York, NY, USA, 2019; pp. 1–6. [Google Scholar]
Radford, A.; Kim, J.W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 8748–8763. [Google Scholar]
Chai, L.; Bau, D.; Lim, S.N.; Isola, P. What makes fake images detectable? understanding properties that generalize. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020; Springer: Cham, Switzerland, 2020; pp. 103–120. [Google Scholar]
Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
Ju, Y.; Jia, S.; Ke, L.; Xue, H.; Nagano, K.; Lyu, S. Fusing global and local features for generalized ai-synthesized image detection. In Proceedings of the 2022 IEEE International Conference on Image Processing (ICIP), Bordeaux, France, 16–19 October 2022; IEEE: New York, NY, USA, 2022; pp. 3465–3469. [Google Scholar]
Yao, T.; Pan, Y.; Li, Y.; Ngo, C.W.; Mei, T. Wave-vit: Unifying wavelet and transformers for visual representation learning. In Proceedings of the European Conference on Computer Vision, Tel Aviv, Israel, 23–27 October 2022; Springer: Cham, Switzerland, 2022; pp. 328–345. [Google Scholar]
Huang, J.; Huang, R.; Xu, J.; Peng, S.; Duan, Y.; Deng, L.J. Wavelet-assisted multi-frequency attention network for pansharpening. In Proceedings of the AAAI Conference on Artificial Intelligence, Philadelphia, PA, USA, 25 February–4 March 2025; Volume 39, pp. 3662–3670. [Google Scholar]
Wang, Q.; Li, Z.; Zhang, S.; Chi, N.; Dai, Q. WaveFusion: A novel wavelet vision transformer with saliency-guided enhancement for multimodal image fusion. IEEE Trans. Circuits Syst. Video Technol. 2025, 35, 7526–7542. [Google Scholar] [CrossRef]
Finder, S.E.; Amoyal, R.; Treister, E.; Freifeld, O. Wavelet convolutions for large receptive fields. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 363–380. [Google Scholar]
Zhu, M.; Chen, H.; Yan, Q.; Huang, X.; Lin, G.; Li, W.; Tu, Z.; Hu, H.; Hu, J.; Wang, Y. Genimage: A million-scale benchmark for detecting ai-generated image. Adv. Neural Inf. Process. Syst. 2023, 36, 77771–77782. [Google Scholar]
Kumari, N.; Zhang, B.; Wang, S.Y.; Shechtman, E.; Zhang, R.; Zhu, J.Y. Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Paris, France, 1–6 October 2023; pp. 22691–22702. [Google Scholar]
Liu, L.; Ren, Y.; Lin, Z.; Zhao, Z. Pseudo numerical methods for diffusion models on manifolds. arXiv 2022, arXiv:2202.09778. [Google Scholar] [CrossRef]
Nichol, A.Q.; Dhariwal, P. Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning. PMLR, Virtual, 18–24 July 2021; pp. 8162–8171. [Google Scholar]
Zhang, B.; Luo, L.; Chen, Y.; Nie, J.; Liu, X.; Guo, D.; Zhao, Y.; Li, S.; Hao, Y.; Yao, Y.; et al. Wukong: Towards a scaling law for large-scale recommendation. arXiv 2024, arXiv:2403.02545. [Google Scholar] [CrossRef]
Cherti, M.; Beaumont, R.; Wightman, R.; Wortsman, M.; Ilharco, G.; Gordon, C.; Schuhmann, C.; Schmidt, L.; Jitsev, J. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 2818–2829. [Google Scholar]
Cao, S.; Chen, H.; Chen, P.; Cheng, Y.; Cui, Y.; Deng, X.; Dong, Y.; Gong, K.; Gu, T.; Gu, X.; et al. Hunyuanimage 3.0 technical report. arXiv 2025, arXiv:2509.23951. [Google Scholar] [CrossRef]
Seedream, T.; Chen, Y.; Gao, Y.; Gong, L.; Guo, M.; Guo, Q.; Guo, Z.; Hou, X.; Huang, W.; Huang, Y.; et al. Seedream 4.0: Toward next-generation multimodal image generation. arXiv 2025, arXiv:2509.20427. [Google Scholar] [CrossRef]
Black Forest Labs. FLUX.2: Frontier Visual Intelligence. 2025. Available online: https://bfl.ai/ (accessed on 12 January 2026).
Loshchilov, I.; Hutter, F. Decoupled Weight Decay Regularization. In Proceedings of the International Conference on Learning Representations (ICLR), New Orleans, LA, USA, 6–9 May 2019. [Google Scholar]
Zhong, N.; Xu, Y.; Qian, Z.; Zhang, X. Rich and poor texture contrast: A simple yet effective approach for ai-generated image detection. arXiv 2023, arXiv:2311.12397. [Google Scholar]
Lou, M.; Yu, Y. OverLoCK: An Overview-first-Look-Closely-next ConvNet with Context-Mixing Dynamic Kernels. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 128–138. [Google Scholar]
Sha, Z.; Li, Z.; Yu, N.; Zhang, Y. De-fake: Detection and attribution of fake images generated by text-to-image generation models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, Copenhagen, Denmark, 26–30 November 2023; pp. 3418–3432. [Google Scholar]
Koutlis, C.; Papadopoulos, S. Leveraging representations from intermediate encoder-blocks for synthetic image detection. In Proceedings of the European Conference on Computer Vision, Milan, Italy, 29 September–4 October 2024; Springer: Cham, Switzerland, 2024; pp. 394–411. [Google Scholar]
Karageorgiou, D.; Papadopoulos, S.; Kompatsiaris, I.; Gavves, E. Any-resolution ai-generated image detection by spectral learning. In Proceedings of the Computer Vision and Pattern Recognition Conference, Nashville, TN, USA, 11–15 June 2025; pp. 18706–18717. [Google Scholar]
Tan, C.; Zhao, Y.; Wei, S.; Gu, G.; Wei, Y. Learning on gradients: Generalized artifacts representation for gan-generated images detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 12105–12114. [Google Scholar]

Figure 1. SIDWA: A collaborative dual-branch framework that re-examines generative artifact detection through the lens of multi-spectral decomposition and adaptive cross-modal interaction.

Figure 2. DWT Stem.

Figure 3. Frequency-Semantic Resonance Projector.

Figure 4. DSWA block.

Figure 5. The visualization of frequency artifacts in different generative models. The top row shows the generated images from BigGAN, ADM, IDDPM and PNDM. The bottom row presents the corresponding heatmaps that highlight the frequency artifacts detected by our SIDWA model. These heatmaps reveal distinct patterns of artifacts characteristic to each generative model, demonstrating the effectiveness of our approach in capturing subtle frequency discrepancies.

Figure 6. The heatmaps in the ablation study with the SIDset dataset. SB: semantic branch; Freq: frequency branch with DWT; DW: Deformable Window; CA: cross-attention and ours (full model).

Figure 7. Ablation study of the proposed components on the SIDset dataset. Base: with semantic branch; Freq: frequency branch with DWT; DW: Deformable Window; CA: cross-attention.

Table 1. SIDset contents.

Image Source	Generator	Images
DiffusionForensics	ADM	40,000
	PNDM	40,000
	IDDPM	40,000
	Real	40,000
GenImage	ADM	40,000
	Midjourney	40,000
	VQDM	40,000
	Real	40,000
Ours	HunyuanImage	5000
	Seedream	5000
	FLUX.2	5000
	Real	15,000

Table 2. Comparison of detection performance (ACC/AUC) between our method and the baseline method.

Generator	CNNS Pot	FreDect	UnivFD	LNP	AIDE	DIRE	SIDWA (Ours)
	ACC/AUC	ACC/AUC	ACC/AUC	ACC/AUC	ACC/AUC	ACC/AUC	ACC/AUC
BigGAN	71.17/0.78	81.97/0.88	95.08/0.98	77.75/0.84	83.95/0.90	49.70/0.51	88.78/0.94
ADM	60.39/0.66	63.42/0.69	66.87/0.72	84.73/0.91	93.43/0.97	98.25/0.99	98.56/0.99
Glide	58.07/0.62	54.13/0.59	62.46/0.68	80.52/0.87	95.09/0.98	92.42/0.96	96.45/0.99
Midjourney	51.39/0.53	45.87/0.48	56.13/0.61	65.55/0.72	77.20/0.83	89.45/0.94	86.84/0.91
VQDM	56.46/0.60	77.80/0.83	85.31/0.91	74.46/0.80	95.16/0.98	91.90/0.96	96.32/0.99
wukong	51.03/0.52	40.30/0.44	70.93/0.78	82.06/0.88	93.55/0.97	90.90/0.95	92.65/0.97
SDv1.5	50.53/0.52	39.21/0.42	63.49/0.68	85.67/0.92	92.85/0.97	91.63/0.96	94.73/0.98

Table 3. Comprehensive performance evaluation of SIDWA across different image generators.

Generator	ACC (%)	AUC	Precision (%)	Recall (%)	F1-Score (%)
BigGAN	88.78	0.94	89.12	88.45	88.78
ADM	98.56	0.99	98.40	98.72	98.56
Glide	96.45	0.99	96.78	96.12	96.45
Midjourney	86.84	0.91	87.25	86.43	86.84
VQDM	96.32	0.99	96.05	96.59	96.32
wukong	92.65	0.97	93.02	92.28	92.65
SDv1.5	94.73	0.98	95.10	94.36	94.73

Table 4. Performance comparison with state-of-the-art methods across various generative models. We report accuracy (%) for detection performance and parameters (M) and FLOPs (G) for computational complexity.

Approach	Params (M)↓	FLOPs (G)↓	Glide	Flux	DALLE2	SD2	SD3	GigaGAN	MJv5	MJv6.1	DALLE3	Avg.↑
CNNDet. [47]	23.5	4.1	59.2	39.8	71.5	57.5	30.2	73.4	48.8	56.7	23.5	51.2
FreqDet. [35]	26.7	5.8	43.6	36.5	47.4	42.5	69.8	63.2	36.9	27.5	42.2	45.5
LGrad [75]	44.2	10.5	76.5	74.9	85.7	60.7	12.7	89.9	69.2	79.6	30.0	64.4
DIRE [50]	150.2	85.6	82.4	78.5	81.2	85.6	42.3	79.1	80.5	76.2	65.4	74.6
UnivFD [7]	45.8	12.3	88.4	85.2	89.1	90.3	55.4	84.2	88.7	82.1	78.5	82.4
DeFake [72]	35.6	9.4	86.1	90.5	41.4	66.2	87.7	71.7	67.0	87.5	93.3	76.8
RINE [73]	32.4	8.2	95.6	93.0	93.0	96.6	39.1	92.9	96.4	81.2	41.8	81.1
SPAI [74]	29.1	7.4	90.2	83.0	91.1	96.5	75.9	85.4	94.5	84.0	90.2	87.9
SIDWA (Ours)	98.3	24.6	96.4	79.4	88.3	88.6	73.2	87.8	86.4	83.7	78.3	84.6

Table 5. Ablation study of the proposed components on the SIDset dataset. SB: semantic branch; Freq: frequency branch with DWT; DW: Deformable Window; CA: cross-attention.

Variant	SB	Freq	DW	CA	Accuracy (%)	AUC (%)
1 (Base)	✓				84.62	89.15
2	✓	✓			91.25	94.88
3	✓		✓		87.40	91.63
4	✓			✓	88.56	92.41
5	✓	✓	✓		92.84	95.92
6	✓	✓		✓	93.41	96.57
7	✓		✓	✓	90.12	93.84
Full	✓	✓	✓	✓	95.74	98.52

Table 6. Ablation study results reported as accuracy (%) across five independent runs with different random seeds. p-values are calculated relative to the full variant using a two-tailed t-test to indicate statistical significance (

n = 5

).

Table 6. Ablation study results reported as accuracy (%) across five independent runs with different random seeds. p-values are calculated relative to the full variant using a two-tailed t-test to indicate statistical significance (

n = 5

).

Variant	Seed 1	Seed 2	Seed 3	Seed 4	Seed 5	Mean ± SD	p-Value
1 (Base)	84.62	84.22	84.75	84.69	84.66	84.59 ± 0.21	<0.001
2	91.25	90.70	91.35	91.40	91.18	91.18 ± 0.28	<0.001
3	87.40	87.15	87.55	87.42	87.47	87.40 ± 0.15	<0.001
4	88.56	88.23	88.70	88.62	87.52	88.33 ± 0.48	<0.001
5	92.84	92.17	92.95	93.89	93.85	93.14 ± 0.74	0.046
6	93.41	92.21	93.55	91.45	93.76	92.70 ± 0.81	0.027
7	90.12	89.67	89.22	90.08	91.15	90.05 ± 0.71	<0.001
Full	95.74	93.26	94.82	93.87	94.32	94.40 ± 0.94	-

Note: All seed values and means represent classification accuracy in percentage (%). p-values were derived from a two-tailed t-test comparing each variant to the full model.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Li, L.; Lu, T.; Song, J.; Cheng, K. SIDWA: Synthetic Image Detection Based on Discrete Wavelet Transform Stem and Deformable Sliding Window Cross-Attention. Electronics 2026, 15, 891. https://doi.org/10.3390/electronics15040891

AMA Style

Li L, Lu T, Song J, Cheng K. SIDWA: Synthetic Image Detection Based on Discrete Wavelet Transform Stem and Deformable Sliding Window Cross-Attention. Electronics. 2026; 15(4):891. https://doi.org/10.3390/electronics15040891

Chicago/Turabian Style

Li, Luo, Tianyi Lu, Jiaxin Song, and Ke Cheng. 2026. "SIDWA: Synthetic Image Detection Based on Discrete Wavelet Transform Stem and Deformable Sliding Window Cross-Attention" Electronics 15, no. 4: 891. https://doi.org/10.3390/electronics15040891

APA Style

Li, L., Lu, T., Song, J., & Cheng, K. (2026). SIDWA: Synthetic Image Detection Based on Discrete Wavelet Transform Stem and Deformable Sliding Window Cross-Attention. Electronics, 15(4), 891. https://doi.org/10.3390/electronics15040891

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

SIDWA: Synthetic Image Detection Based on Discrete Wavelet Transform Stem and Deformable Sliding Window Cross-Attention

Abstract

1. Introduction

2. Related Work

3. Methodology

3.1. Overall Architecture

3.2. Discrete Wavelet Transform (DWT) Stem

3.3. Artifact Alignment Module

3.4. Frequency-Semantic Resonance Projector (FSRP)

3.5. Deformable Sliding Window Cross-Attention (DSWA)

3.6. Attention Distillation and Gated Integration

4. Experiments and Results

4.1. Dataset

4.2. Experimental Setup

4.3. Evaluation Metrics

4.4. Performance Comparison

4.5. Ablation Study

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI