In this section, we first describe the experimental setup and then present comprehensive results to validate the effectiveness of SIDWA. To evaluate the proposed method’s detection performance in varied postprocessing scenarios, we select multiple representative datasets. These include both GAN-generated and Diffusion-based images, as well as real samples. Our evaluation focuses on the model’s robustness to common image artifacts and perturbations, such as JPEG compression, Gaussian blurring, and rescaling. These challenges mimic real-world transmission and editing processes. This approach enables a thorough assessment of the model’s reliability under degraded conditions. Following standard forensic protocols, we partitioned the datasets into training and test sets. The training set was used for feature learning and classifier optimization. The test set, augmented with various distortion levels, provided an objective assessment of the model’s generalization and cross-manipulation detection capabilities.
4.1. Dataset
To better evaluate the performance of diffusion generation detectors, we compiled a dataset named SIDataset (SIDset) comprising three components, as shown in
Table 1. The images can be broadly categorized into three types based on their sources: DiffusionForensics [
50], partially GenImage [
60], and some of our own collected images.
DiffusionForensics is a relatively simple open-source benchmark. We selected the LSUN Bedroom dataset and the ImageNet subset for experiments. The LSUN Bedroom subset collects bedroom images from LSUN-Bedroom and generates fakes using multiple diffusion models, including ADM [
61], PNDM [
62], and IDDPM [
63]. The training set comprises 30,000 real images and 10,000 images generated by each of ADM, PNDM, and IDDPM. For testing, we selected 10,000 real images and 10,000 generated images from each subset.
GenImage is a standardized, large-scale benchmark for AI-generated image detection. It includes 2,681,167 images—1,331,167 real and 1,350,000 generated. The dataset covers eight leading current generators, such as Midjourney, Stable Diffusion (v1.4/v1.5) [
4], ADM, GLIDE [
28], Wukong [
64], VQDM [
65], and BigGAN [
25]. These represent various GAN and diffusion models. The image generation system uses 1000 ImageNet labels, ensuring nearly equal numbers of real and generated images within each category. We selected three subsets—ADM (ADM), Midjourney, and VQDM—as foundational data for training and evaluation. Only part of the dataset was used for training due to computational resource constraints. Specifically, we randomly selected 40,000 real and 40,000 generated images from each subset for training.
Our collected dataset comprised two key components. For the real image collection, we developed a dedicated web crawler to gather news photographs from authoritative media outlets, including Unsplash, Pexels, Flickr, Xinhua News Agency, and BBC. The content spanned multiple news categories, including politics, sports, culture, disasters, and technology, with a focus on socially sensitive events, fraud scenarios, and risk communication cases. After initial image collection through the crawler, we employed the OpenCLIP [
65] model for preliminary classification and manually filtered images into ten categories, including people, bedrooms, fruits, and animals. A preprocessing program was also used to remove low-resolution images and those containing Not Safe For Work (NSFW) content, resulting in a final dataset of 5000 images. For image generation, we selected three models: HunyuanImage [
66], Seedream [
67], and FLUX.2 [
68]. To illustrate, we collected 5000 generated images per model, using the format “A photo of bedroom”, where “bedroom” refers to ten ImageNet-1k categories.
4.2. Experimental Setup
During data preprocessing, we randomly applied standardization treatments, including flipping, cropping, grayscale conversion, and JPEG compression at specified ratios. These steps remove lighting and contrast effects while preserving important details, establishing a strong foundation for feature extraction and object tracking. To balance computation and retain local micro-textures, we downsampled to and then randomly cropped to . This encouraged the model to learn pixel-level fingerprints rather than global semantics. For stable convergence and strong performance, we used a cosine annealing learning rate schedule, starting at and decaying to over 100 epochs. Additionally, a linear warm-up in the first five epochs prevented early gradient instability. This adjustment helped the model to avoid local minima early and converge precisely later. Training ran for 100 epochs on four NVIDIA A100 GPUs (80 GB each), with a batch size of 64. We selected hyperparameters via grid search on the GenImage validation set for optimal micro-artifact sensitivity and global semantic coherence.
Regarding classifier optimization, we adopted the AdamW [
69] optimizer instead of standard SGD. AdamW effectively decouples weight decay from the gradient update, which is particularly beneficial for our dual-branch SIDWA framework. This approach ensures that the high-frequency features captured by the DWT branch are not prematurely suppressed by aggressive regularization. While SGD with momentum often provides better generalization in some vision tasks, we observed that AdamW converged faster and offered superior stability when handling the diverse artifacts in our SIDset dataset. The learning rate was governed by a cosine annealing scheduler to avoid local minima during late-stage training.
For the SIDset dataset, we split the data into training, validation, and test sets at 8:1:1. Specifically, the training set comprises 280,000 images, while the validation and test sets each comprise 35,000 images. Crucially, within each subset, we maintain a balanced distribution between generated and real images to ensure data consistency. To ensure the evaluation’s objectivity, there is no overlap between the training and test sets.
To comprehensively evaluate the proposed method, we use Accuracy (ACC) and Area Under the ROC Curve (AUC) as the primary metrics. Furthermore, to assess robustness and generalization, we conduct cross-dataset evaluations and test performance under various postprocessing attacks (e.g., JPEG compression and Gaussian blurring).
Baseline: As a benchmark, we selected representative models in the field of generative image detection over the past few years, spanning from early artifact monitoring to the latest diffusion model feature analysis. (1) CNNSpot [
47] identified common flaws in different generative models and proposed a relatively simple method for effectively detecting images generated by various CNNs, which significantly enhanced detector generalization capabilities through data augmentation strategies. This laid an important foundation for subsequent research in generative image detection. (2) FreDect [
35] revealed a key characteristic of generative images: compared to the variability in spatial-domain pixels, frequency-domain features often exhibit higher stability. Based on this discovery, the model mapped images from the spatial domain to the frequency domain using techniques such as Discrete Cosine Transform (DCT), accurately capturing periodic statistical artifacts introduced by upscaling operations during reconstruction that are difficult for the human eye to detect. Because this underlying mathematical fingerprint is common across different generative frameworks, FreDect can effectively identify forged content generated by unknown algorithms, achieving significant improvements in cross-model generalization. (3) UnivFD [
7] proposed a universal detection framework to address domain differences in generative image detection. Its core approach involves multi-scale feature fusion and attention mechanisms to extract features with strong semantic representation from pre-trained models [
65], thereby compensating for the limitations of traditional detectors, which are overly sensitive to local details. UnivFD’s primary contribution lies in its exceptional universal applicability: by incorporating cross-modal prior knowledge, it achieves unified high-precision recognition across diverse generative architectures (from GANs to diffusion models) and multiple image categories. (4) LNP [
70] discovered that, unlike the natural physical relationships between pixels in real photos, generated images often exhibit minor statistical inconsistencies at fine texture details. Building on this, the LNP model employs a local neighborhood propagation mechanism, treating images as graph structures or using local convolutional operators to detect deviations in spatial correlations between pixels and their surrounding neighborhoods. This approach not only captures global artifacts but also precisely locates local modification traces in images, demonstrating exceptional sensitivity to locally altered or patched (Inpainting) images. (5) AIDE [
8] pioneered a detection paradigm based on reconstruction capability differences. Its core logic is that real images are generally more challenging to perfectly reconstruct without detail loss than generated images using existing editing models. The model amplifies the inherent structural instability in generated images by comparing consistency losses between original images and their slightly perturbed reconstructions. This transforms the detection task from a traditional classification problem into a robustness evaluation of generative models, effectively addressing the challenge of distinguishing between high-quality generated content and real content. (6) DIRE [
50] proposes a detection paradigm based on reconstruction error in diffusion models, whose core logic is grounded in the reversibility difference between generated and real images during diffusion. DIRE demonstrates that images generated by diffusion models exhibit significantly smaller reconstruction errors compared to real images when subjected to identical noise addition and removal processes. By computing this specific reconstruction residual, DIRE (DIRE) projects detected images into the distribution space of diffusion models, thereby dramatically enhancing the structural features of the generated images.
Hyperparameter Optimization Strategy: The hyperparameters of the SIDWA framework, including the learning rate, batch size, and weight decay of the optimizer, were determined via a systematic grid search. We selected grid search primarily due to its deterministic nature and high reproducibility within a well-defined search space, which ensure a stable baseline for evaluating the structural contributions of the DWT Stem and DSWA modules. However, we acknowledge the limitations of grid search in terms of computational efficiency compared to more advanced self-optimizing frameworks. For instance, recent studies have introduced self-optimized Gaussian kernel-based radial basis function extreme learning machines (ELMs) [
18], which offer superior adaptability in dynamic parameter landscapes. While our current study prioritizes the transparency of the grid-based search to validate the collaborative dual-branch architecture, exploring self-adaptive optimization strategies to further enhance the convergence efficiency of the cross-attention mechanism remains a promising direction for our future work.
Our model evolves from the OverLoCK [
71] backbone, transitioning from three synergistic sub-networks to a collaborative dual-branch architecture consisting of a wavelet-domain frequency branch and a space-domain semantic branch. We fundamentally re-engineer the feature extraction stage and the interaction blocks to incorporate multi-spectral and adaptive spatial awareness.
To further investigate the interpretability and localization capabilities of our proposed framework, we provide a qualitative analysis using heatmaps that highlight response regions for several representative generative models. Specifically, we analyze how BigGAN (BigGAN) localizes object structures, ADM captures semantic information, IDDPM (IDDPM) maps fine-grained features, and PNDM (PNDM) isolates areas of generative focus. As illustrated in
Figure 5, the heatmaps generated by our model demonstrate superior precision in identifying the structural inconsistencies and statistical anomalies inherent in synthesized images. Specifically, for BigGAN, our method effectively captures the checkerboard patterns and grid-like artifacts common in GAN-based architectures. For the diffusion-based models—ADM, IDDPM, and PNDM—the detector accurately highlights the subtle, non-uniform noise distributions and high-frequency discrepancies that often elude standard spatial-domain detectors.
4.4. Performance Comparison
In this section, we conduct an extensive quantitative evaluation to assess the efficacy of the proposed SIDWA model. To ensure a rigorous and holistic appraisal, our experiments span seven diverse generative architectures, ranging from traditional GANs to the latest diffusion-based and auto-regressive models. We compare SIDWA against six baselines representing various detection methodologies, including spatial-domain learning, frequency analysis, and reconstruction-based approaches. As evidenced by the results, SIDWA exhibits superior cross-model generalization, successfully mitigating the performance degradation that specialized detectors often encounter when transitioned between disparate forgery distributions.
To further investigate the interpretability and localization capabilities of our proposed framework, we provide a qualitative analysis using heatmaps across several representative generative models, including BigGAN, ADM, IDDPM, and PNDM. As shown in
Figure 5, this localization effectiveness is primarily attributed to the cross-guidance mechanism between the DWT and DSWA. Specifically, the DWT serves as a frequency-selective filter, decomposing the image into multiple sub-bands to isolate high-frequency residues where generative fingerprints are most prominent. In tandem, the DSWA enables the model to adaptively adjust its receptive field, focusing on irregular textures and non-grid-aligned artifacts that traditional fixed-window attentions might overlook. Together, these two components create a synergy that enables our model to distinguish natural from artificial high-frequency textures and artifacts.
As shown in
Table 2, while many contemporary detectors optimized for diffusion models experience a catastrophic performance drop when applied to GAN-based architectures (e.g., DIRE achieving only 49.7% on BigGAN), SIDWA maintains a robust accuracy of 88.78%. This stability originates from the inherent cross-domain sensitivity of our dual-path design.
Specifically, GAN-generated images exhibit periodic, grid-like artifacts and high-frequency spectral peaks arising from transposed convolution (upsampling). Our DWT Stem effectively decomposes the input into multi-spectral sub-bands, where these systematic ‘checkerboard artifacts’ are significantly magnified in the HH and HL components. Unlike reconstruction-based methods that search for sampling inconsistencies, SIDWA captures these fundamental hardware-level fingerprints.
Moreover, the DSWA further enhances this robustness. In GAN-based forgeries, structural anomalies often manifest as global symmetry inconsistencies or local boundary blurriness. DSWA’s ability to adaptively shift its sampling offsets allows the model to perceive these long-range geometric dependencies more flexibly than fixed-kernel Convolutional Neural Networks (CNNs). Consequently, SIDWA successfully integrates low-level frequency cues with high-level structural awareness, ensuring that the detector remains effective even as it transitions from diffusion-based noise patterns to GAN-based upsampling traces.
The superior performance of SIDWA over specialized detectors such as DIRE and AIDE on diffusion-based models (e.g., achieving 98.56% on ADM and 94.73% on SDv1.5) can be attributed to its unique dual-domain feature-capture mechanism.
Specifically, while DIRE relies heavily on the reconstruction error from a predefined diffusion reverse process, its effectiveness is often bottlenecked by the stochastic nature of the reverse sampling, which may inadvertently ‘heal’ subtle structural inconsistencies. In contrast, SIDWA takes a different approach by utilizing DWT to directly extract high-frequency fingerprints (HH sub-band) from the original input. Whereas reconstruction-based methods like DIRE might overlook these inherent ‘stepping noises’ in the diffusion trajectories, SIDWA captures them explicitly through its fingerprint extraction.
Furthermore, compared to AIDE, which uses a fixed-grid attention mechanism, our DSWA provides a dynamic receptive field. Because diffusion-generated artifacts, such as those in SDv1.5, are typically non-rigid and locally concentrated (e.g., unnatural textures in complex backgrounds or skin pores), DSWA can adaptively warp its sampling points to cluster around these elusive localized anomalies. This ‘adaptive zoom’ capability ensures that SIDWA captures fine-grained textural decoherence more effectively than AIDE’s rigid scanning patterns, leading to a more robust and precise detection boundary.
As shown in
Table 3, the proposed SIDWA achieves a remarkable balance between detection precision and recall across various generative models. As shown in the updated metrics, the F1-score consistently aligns with the overall accuracy, particularly on advanced diffusion models such as ADM and Glide, where all metrics exceed 96%. This equilibrium indicates that SIDWA effectively minimizes both false alarms (high precision) and missed detections (high recall), demonstrating its robustness as a reliable forensic tool for high-fidelity synthetic image detection.
To evaluate the effectiveness of the proposed SIDWA, we conduct a comprehensive comparison with eight state-of-the-art (SOTA) detection methods. These include fingerprint-based methods (e.g., CNNDet [
47] and FreqDet [
35]), reconstruction-based methods (e.g., DIRE [
50]), and recent large-scale pre-training or multi-modal-inspired methods (e.g., UnivFD [
7], DeFake [
72], RINE [
73], and SPAI [
74]). For a fair comparison, the performance metrics of all baseline methods are directly sourced from the latest comprehensive study SPAI [
74]. Our SIDWA is trained and evaluated on the same benchmark datasets. We evaluate performance across nine representative generative models, ranging from classic GANs to the latest diffusion-based architectures.
We evaluate the proposed SIDWA against several state-of-the-art (SOTA) detection methods across nine diverse generative models. As summarized in
Table 4, our method demonstrates robust overall performance, achieving an average accuracy of 84.6% and outperforming traditional spatial- and frequency-based baselines.
A key observation is that when early detection methods, such as CNNDet and LGrad, are applied to advanced diffusion models like SD3 and DALLE3, their performance drops significantly (accuracies as low as 12.7%). In contrast, SIDWA maintains high detection stability, achieving 73.2% on SD3 and 78.3% on DALLE3. This stability suggests that our dual-branch architecture, which combines semantic features with DWT-based frequency cues, effectively captures the intrinsic forgery fingerprints that remain consistent across evolving generative technologies.
Performance vs. Computational Trade-off: Regarding model complexity, while SIDWA has a higher parameter count (98.3 M) than lightweight models like SPAI or RINE, it offers a superior balance between capacity and accuracy. Specifically, our model achieves a 2.2% to 3.5% improvement in average accuracy over UnivDiff and RINE. Furthermore, compared to the heavy-duty DIRE model (150.2 M Params, 85.6 G FLOPs), SIDWA attains a 10.0% higher average accuracy while consuming approximately 71.3% fewer FLOPs. This indicates that the parameters in SIDWA are more efficiently utilized through the Deformable Window and Cross-Attention mechanisms.
Computational Complexity and Efficiency: To address concerns regarding the complexity of the dual-branch architecture, we report the operational efficiency of SIDWA. Evaluated on an NVIDIA A100 GPU, the model maintains a relatively low computational footprint with 24.6 G FLOPs and achieves a competitive inference time of 18.2 ms per image, confirming its potential for real-time forensic applications. In terms of memory consumption, SIDWA requires only 1.8 GB of VRAM for single-image inference, making it highly accessible for deployment on mid-range hardware. During our high-performance training phase (batch size of 64 distributed across 4×A100 GPUs), the model exhibits stable convergence and efficient memory utilization—occupying approximately 35.5% of available VRAM per GPU. These metrics collectively demonstrate that SIDWA strikes an optimal balance between architectural capacity and practical throughput, effectively mitigating the overhead typically associated with deformable attention and wavelet decomposition.
Robustness to Diverse Generators: While RINE and SPAI exhibit exceptional performance on specific generators such as Midjourney (MJ) and SD2, their performance fluctuates considerably across the entire spectrum (e.g., RINE drops to 39.1% on SD3). SIDWA provides a more balanced profile, consistently staying above or near the 80% mark for most test sets. This cross-generator robustness is critical for real-world applications where the source of a synthetic image is often unknown.
4.5. Ablation Study
To verify the efficacy of the key components in our proposed dual-branch framework, we conduct a series of ablation studies on SIDset. The baseline and its variants are defined as follows:
- 1.
Semantic Branch (CNN): The model only retains the semantic branch with the FPN, excluding all frequency-related components and the cross-attention mechanism.
- 2.
+DWT: The frequency branch is maintained, but the DWT is replaced by standard convolutional layers to extract frequency-domain features.
- 3.
+Deformable Window: The dual-branch architecture is kept, but the DSWA module is replaced by a standard window-based attention mechanism.
- 4.
+Cross-Attention: The dual-branch architecture is kept, but the DSWA module is replaced by a standard cross-attention mechanism without dynamic offset generation.
- 5.
SIDWA (Full): Combines both the DWT Stem and the DSWA module.
To intuitively evaluate the discriminative power of the proposed framework, we visualize the attention maps generated by different model variants. As illustrated in
Figure 6, several key observations can be made regarding the localization of generative artifacts.
The Baseline model, which relies solely on the semantic branch, produces highly diffused attention maps that fail to localize specific forgery traces. This is because semantic features primarily capture global object structures and textures, which are often similar across real and generated images, leading to a lack of focus on subtle, high-frequency artifacts that are critical for detection. The addition of the DWT branch significantly enhances the model’s ability to capture high-frequency components, resulting in more focused attention on potential forgery regions. However, the heatmaps still exhibit some degree of center shift, likely due to the lack of spatial guidance from the semantic branch, leading to misalignment between the frequency features and the actual artifact locations. The variant with only Deformable Window shows improved localization compared to the baseline, as the adaptive attention mechanism allows for better focus on irregular textures. However, without the frequency branch, it may still struggle to capture high-frequency artifacts, leading to less precise localization and more scattered attention across the image. The cross-attention variant, while providing some interaction between the branches, lacks the dynamic offset generation of DSWA, leading to suboptimal alignment between semantic and frequency features. This results in attention maps that are somewhat more focused than the baseline but still fail to precisely localize the forgery traces, indicating that both the frequency branch and the adaptive attention mechanism are crucial for optimal performance.
In contrast, our full model integrates both DWT and DSWA. It achieves the most precise and concentrated localization of high-frequency forgeries. This superiority stems from the frequency-semantic resonance calibration mechanism. The semantic branch provides essential spatial priors. The DSWA module acts as a bridge, aligning these priors with the frequency footprints from the DWT branch. By constraining cross-modal interaction within a sliding window, our model filters out redundant background noise. This forces the network to focus on content-aware artifacts. Such synergy ensures that attention is sharp and accurately anchored to localized generative traces. Ultimately, this demonstrates the necessity of dual-branch collaborative learning.
Significance of Frequency Branch: The integration of the frequency branch with Discrete Wavelet Transform (Freq) yields a substantial performance gain over the baseline (only semantic branch). Specifically, the accuracy improves remarkably from 84.62% to 91.25%. This significant boost underscores the critical role of frequency-domain cues in capturing subtle, high-frequency forgery traces that are often overlooked in the spatial domain.
Impact of DW and CA Modules: Beyond the frequency analysis, both the Deformable Window (DW) and cross-attention (CA) modules are shown to independently enhance the model’s discriminative power. The introduction of the CA mechanism, in particular, contributes a 3.94% improvement in accuracy compared to the base semantic model. This enhancement suggests that global dependency modeling and adaptive feature alignment are effective in refining feature representations.
Synergistic Integration and Global Optimality: As shown in
Table 5, the full SIDWA model achieves superior performance, achieving an accuracy of 95.74% and an AUC of 98.52%. The fact that the full configuration outperforms all partial combinations confirms a synergistic effect among the proposed components. This indicates that the joint modeling of semantic-frequency features, coupled with flexible spatial attention, provides the most robust and comprehensive representation for complex forgery detection.
Statistical Significance Analysis: To verify that the performance gains of the proposed SIDWA are consistent and not a result of random initialization, we conduct a rigorous statistical analysis by repeating each ablation experiment five times with different random seeds. As summarized in
Table 6 and
Figure 7, the full variant consistently outperforms all other configurations, achieving a mean accuracy of 94.40% ± 0.94%. Specifically, a two-tailed t-test is performed to compare the full variant against each sub-configuration. The results indicate that the integration of both the DWT-based frequency branch and the DSWA module provides a statistically significant improvement over the next-best variants (Variant 5 and Variant 6), with
p-values of 0.046 and 0.027, respectively (
p < 0.05). The substantial gap between the full variant and the Baseline (Variant 1,
p < 0.001) further highlights the robust synergy of the proposed components. This statistical evidence confirms that the collaborative design of our dual-branch framework is essential for achieving optimal performance in synthetic image detection.