Localizing Perceptual Artifacts in Synthetic Images for Image Quality Assessment via Deep-Learning-Based Anomaly Detection

Yin, Zijin

doi:10.3390/electronics15050916

Open AccessArticle

Localizing Perceptual Artifacts in Synthetic Images for Image Quality Assessment via Deep-Learning-Based Anomaly Detection

by

Zijin Yin

School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing 100876, China

Electronics 2026, 15(5), 916; https://doi.org/10.3390/electronics15050916

Submission received: 30 January 2026 / Revised: 14 February 2026 / Accepted: 22 February 2026 / Published: 24 February 2026

(This article belongs to the Special Issue Computer Vision and AI Algorithms for Diverse Scenarios)

Download

Browse Figures

Versions Notes

Abstract

While deep generative models, such as text-to-image diffusion, demonstrate strong capabilities in synthesizing photorealistic images, they frequently produce perceptual artifacts (e.g., distorted structures or unnatural textures) that require manual correction. Existing artifact localization methods typically rely on fully supervised training with large-scale pixel-level annotations, which suffer from high labeling costs. To address these challenges, we propose a novel framework based on the core insight that perceptual artifacts can be fundamentally modeled as “semantic outliers”—regions that inherently fail to match any pre-defined semantic categories. Instead of learning specific artifact features, we introduce a Mask-based Semantic Rejection (MSR) mechanism within a semantic segmentation architecture. This mechanism leverages the “one-vs-all” property of object queries to identify regions that are consistently rejected by all pre-trained semantic categories. Furthermore, we design a flexible adaptation strategy that supports both zero-shot inference using pre-trained semantic knowledge and fine-tuning with a margin-based suppression objective to explicitly optimize the rejection boundary using minimal supervision. Comprehensive experiments across 11 synthesis tasks demonstrate that MSR significantly outperforms state-of-the-art methods, particularly in data-efficient scenarios. Specifically, the framework achieves mIoU improvements of 6.52% and 13.06% on the text-to-image task using only 10% and 50% of labeled samples, respectively, underscoring its superior capability.

Keywords:

image quality assessment; artifacts localization; anomaly detection

1. Introduction

The recent proliferation of deep generative models, particularly diffusion models [1,2] within large-scale text-to-image (T2I) frameworks [3,4], has revolutionized the landscape of digital content creation. State-of-the-art architectures, such as Stable Diffusion [5] and DALL-E 3 [6], are now capable of synthesizing photorealistic images that exhibit intricate details and high semantic alignment with user prompts. However, despite these remarkable capabilities, these “black box” generators are not infallible. They frequently suffer from hallucinations and generate perceptual artifacts, ranging from structural distortions (e.g., asymmetrical faces and disordered limbs) to unnatural textures and logical inconsistencies, as shown in Figure 1. These quality degradation issues severely hinder the deployment of generative models in professional workflows, necessitating manual inspection and correction by human experts.

To automate the quality assurance process, the research community has recently formulated the task of Perceptual Artifacts Localization (PAL) [7,8], which aims to segment defective regions within generated images pixel-by-pixel, as shown in Figure 1. Existing approaches typically treat this task as a fully supervised semantic segmentation problem. For instance, the pioneering work by Zhang et al. [8] constructed a dataset of over 10,000 annotated images to train a dedicated segmentation network. While effective within their specific training domains, such supervised methods face two critical limitations. First, the scalability bottleneck: obtaining pixel-level annotations for artifacts is notoriously labor-intensive and expensive. Second, the generalization gap: the visual characteristics of artifacts evolve rapidly with model updates (e.g., artifacts from a GAN differ significantly from those of a diffusion model). A supervised model trained on one generator often fails to recognize the novel artifact patterns produced by another, requiring frequent retraining and data re-labeling.

To overcome the scalability limitations of fully supervised approaches, we propose a paradigm shift from discriminating specific artifact features to identifying semantic inconsistencies. We posit that perceptual artifacts can be fundamentally modeled as semantic outliers—regions that deviate from the manifold of pre-defined semantic concepts found in real-world data. Unlike valid objects (e.g., “person”, “car”, and “sky”) that possess coherent semantic representations, artifacts are characterized by their incompatibility with established semantic categories. This perspective suggests that instead of learning a dedicated artifact detector, one can harness the rich discriminative knowledge encapsulated in pre-trained segmentation models to identify these anomalies via a mechanism of semantic rejection.

To realize this, we introduce the Mask-based Semantic Rejection (MSR) framework, instantiated upon the state-of-the-art Mask2Former architecture [9]. We exploit the query-based mask classification paradigm, where object queries inherently function as “one-vs-all” discriminators for specific semantic concepts. Rather than training a classifier to explicitly recognize artifacts, we formulate artifact localization as an out-of-distribution detection problem within the pre-trained semantic space. Specifically, we aggregate the prediction scores from the semantic query bank; regions that yield consistently low classification confidence across all pre-defined categories are identified as anomalies.

This formulation enables a highly label-efficient workflow. Our framework supports zero-shot inference by directly utilizing the generalized semantic priors of the base model. Furthermore, to adapt to model-specific artifact patterns, we design a few-shot adaptation strategy utilizing a margin-based suppression loss. Inspired by the rejection philosophy [10], this objective explicitly forces the model to suppress the activation of all semantic categories for the annotated artifact regions, ensuring their maximum classification scores fall below a learnable margin. This optimization effectively “pushes” artifacts further away from the in-distribution semantic manifold, sharpening the rejection boundary with minimal human supervision.

The main contributions of this work are summarized as follows:

We propose a novel perspective that models perceptual artifacts as outliers, shifting the focus from artifact pattern recognition to semantic rejection.
We develop the Mask-based Semantic Rejection (MSR) framework, a flexible architecture that effectively localizes artifacts by aggregating rejection signals from pre-trained semantic queries.
We introduce a margin-based suppression loss for few-shot adaptation, enabling the model to achieve superior localization performance with minimal human supervision.
Extensive experiments across multiple generative tasks (T2I generation and inpainting) demonstrate that our approach outperforms existing methods in terms of generalization and data efficiency.

2. Related Works

2.1. Artifacts Localization in Image Synthesis

The landscape of image synthesis has evolved rapidly, driven by the transition from generative adversarial networks (GANs) [11,12] to large-scale text-to-image (T2I) frameworks and diffusion models [2,3]. Modern architectures, such as Stable Diffusion [5] and Qwen-Image [13], demonstrate unprecedented capabilities in terms of generating high-fidelity, photorealistic images that align closely with complex text prompts. Despite these advancements, these models are not immune to failure modes. They frequently exhibit perceptual artifacts [8], which are structural distortions, unnatural textures, mismatched semantics, or logical inconsistencies (e.g., asymmetrical eyes, disordered limbs, or floating objects) [14]. As generative models are increasingly integrated into content creation workflows, the ability to automatically identify and localize these quality degradation issues has become essential for human-in-the-loop editing and model evaluation [8].

In the broader context of synthesis analysis, prior research efforts were predominantly directed towards Deepfake Detection [15,16,17,18,19] or Image Forensics. The primary objective of these methods is to perform binary classification, distinguishing between authentic real-world photographs and synthesized content. Prior works, such as CNNGenerates [14], demonstrated that generative models leave distinct frequency-domain fingerprints that can be detected by classifiers trained with aggressive data augmentation. Similarly, Patch Forensics [20] utilized a local patch-based analysis to identify statistical anomalies characteristic of GAN-generated textures. While these forensic methods are effective for security and authentication purposes, they are fundamentally limited in the context of image quality assessment. They typically output a global “fake” probability or coarse-grained heatmaps designed to flag the entire image, rather than providing the pixel-accurate segmentation required to pinpoint specific defective regions for correction.

To achieve fine-grained quality assessment, researchers have shifted focus from global scoring to Perceptual Artifacts Localization (PAL) [7,8]. Unlike binary detection, PAL aims to explicitly segment defective regions in generated content. This task has been extensively explored across various synthesis applications, such as text-to-image generation and image inpainting [8,21,22]. For instance, Zhang et al. [8] established a pioneering benchmark with human-annotated masks to train fully supervised segmenters. Recent works in 2024 and 2025 have further expanded this direction, introducing multimodal models to describe and localize defects [23,24]. However, relying on fully supervised training faces two major bottlenecks: (1) collecting pixel-level artifact labels is prohibitively expensive; (2) supervised models struggle to generalize to unseen architectures (e.g., transferring from GANs to diffusion models) because artifact patterns evolve constantly.

2.2. Anomaly Detection

The field of anomaly detection (AD) has historically focused on identifying image-level outliers—determining whether an entire image deviates from the training distribution. Early works primarily relied on post hoc statistics of discriminative classifiers. For instance, the maximum softmax probability (MSP) [25] and energy-based scores served as simple baselines to reject unknown samples. In 2024, the paradigm shifted towards leveraging large-scale pre-trained Vision–Language Models (VLMs) to enhance detection capabilities. Methods like WinCLIP [26] and AnomalyCLIP [27] exploit the rich semantic space of CLIP to define normality via textual prompts, enabling zero-shot detection. Furthermore, novel backbones such as State Space Models (SSMs) have been explored; for example, MambaAD [28] demonstrates that modeling long-range dependencies is crucial for capturing subtle structural anomalies that traditional CNNs might overlook.

To extend these capabilities to pixel-level localization (i.e., Anomaly Segmentation), substantial efforts have been directed towards Outlier Exposure (OE) strategies to explicitly regularize decision boundaries at a dense level. Contemporary methods typically leverage auxiliary datasets, such as COCO [29] or ADE20K [30], to simulate anomalies via cut-and-paste augmentation [31,32]. For example, Meta-OoD [33] maximizes prediction entropy on these synthetic outliers, whereas PEBAL [31] employs an abstention learning framework with adaptive energy-based penalties. Although DenseHybrid [32] has pushed state-of-the-art methods by synthesizing likelihood and posterior evaluations, these approaches often suffer from high training complexity—they necessitate fine-tuning on multiple auxiliary datasets with significant distribution shifts for each specific benchmark. This heavy reliance on extensive supervision contrasts with the urgent need for data-efficient frameworks capable of achieving robust generalization with minimal outlier supervision.

Recent advancements have also explored specialized segmentation-based architectures for anomaly localization. For instance, the Normal Image Guided Segmentation framework [34] utilizes reference images to identify deviations in unsupervised settings. VRP-SAM [35] incorporates visual reference prompts into the Segment Anything Model (SAM) to achieve zero-shot defect detection. While these segmentation-based methods are highly effective for industrial anomaly detection, they typically rely on reference images or visual prompts of “perfect” objects to define normality. Such constraints are inapplicable to our generative perceptual artifacts in open-world synthesis, where no ground truth reference exists, and semantic diversity is significantly higher.

3. Method

In this section, we present the proposed Mask-based Semantic Rejection (MSR) framework for Perceptual Artifacts Localization. Our approach fundamentally reformulates the task from supervised feature learning to semantic outlier detection. As illustrated in Figure 2, the framework operates in two stages: an inference stage that leverages pre-trained semantic priors to reject anomalies, and an optional adaptation loss that recalibrates the decision boundary using minimal supervision. The section is organized as follows: Section 3.1 reviews the preliminary mask classification architecture. Section 3.2 details the core mechanism of semantic rejection and anomaly score aggregation. Section 3.3 introduces the margin-based suppression loss for data-efficient fine-tuning.

3.1. Preliminaries

Our proposed framework is instantiated upon Mask2Former [9], a universal image segmentation architecture that fundamentally shifts the segmentation paradigm from per-pixel classification to mask classification. Unlike traditional fully convolutional networks (FCNs) [36] that assign a semantic label to every pixel independently, Mask2Former decomposes an image into a set of N binary masks and their associated class probability distributions using a Transformer-based query decoder. This object-centric formulation is pivotal to our approach, as it decouples the localization of a region from its semantic categorization.

Formally, given an input generated image

I \in R^{H \times W \times 3}

, the architecture first employs a backbone network (e.g., Swin Transformer [37]) followed by a pixel decoder to extract multi-scale high-resolution pixel embeddings, denoted as

F \in R^{C \times H \times W}

, where C is the channel dimension. The core of the recognition process relies on a fixed set of N learnable object queries,

Q = {q_{i}}_{i = 1}^{N}

, where each

q_{i} \in R^{D}

represents a latent global representation of a potential image segment. These queries interact with the dense pixel embeddings

F

through a Transformer decoder. A key design is the masked cross-attention mechanism, which constrains the attention of each query

q_{i}

to within the foreground region predicted by the preceding decoder layer. This mechanism ensures that queries focus on extracting localized semantic features rather than global context, significantly enhancing convergence and precision.

Upon passing through the decoder layers, the refined query embeddings are projected into final predictions via two specialized heads. For semantic classification, a linear classifier transforms the query

q_{i}

into class logits, followed by a softmax normalization. This yields a probability distribution over K semantic categories (e.g., person, car, and sky) and a special “no object” class ∅:

P (c | q_{i}) = Softmax (W_{c l s} q_{i} + b_{c l s}) \in R^{K + 1}

(1)

where

W_{c l s} \in R^{(K + 1) \times D}

and

b_{c l s}

denote the learnable weights and bias of the classification head. Simultaneously, for mask generation, a Multi-Layer Perceptron (MLP) projects the query

q_{i}

into a mask embedding vector. The binary mask prediction

M_{i} \in {[0, 1]}^{H \times W}

is then generated via a dot product between this query-specific embedding and the pixel embeddings

F

, passed through a sigmoid activation function:

M_{i} (x, y) = σ ({MLP}_{m a s k} {(q_{i})}^{T} \cdot F (x, y))

(2)

where

σ (\cdot)

is the sigmoid function, and

(x, y)

represents spatial coordinates.

One-vs-All Property: Crucially for our framework, this architecture treats each query

q_{i}

as an independent discriminator. During pre-training, each query is supervised to perform a “one-vs-all” classification task: it must predict a high probability for a specific class

c \in {1, \dots, K}

if it attends to a valid object, and predict the background class ∅ otherwise. This intrinsic property implies that regions failing to activate any of the K valid semantic categories—resulting in uniformly low confidence scores across the semantic manifold—can be theoretically characterized as semantic voids or anomalies.

3.2. Mask-Based Semantic Rejection (MSR)

The core hypothesis of our framework is that perceptual artifacts manifest as semantic voids—regions that cannot be confidently explained by any pre-defined semantic category within the model’s training distribution. Drawing inspiration from the outlier detection [10], we propose a Mask-based Semantic Rejection (MSR) mechanism. Unlike standard segmentation, which assigns the most probable class to each pixel, MSR quantifies the degree to which all semantic queries reject a specific region.

Let

K = {1, 2, \dots, K}

denote the set of valid foreground semantic categories (e.g., person and building), excluding the background class ∅. For an input image I, the pre-trained Mask2Former outputs a set of N object queries

{(M_{i}, P_{i})}_{i = 1}^{N}

, where

M_{i} \in {[0, 1]}^{H \times W}

is the predicted binary mask map and

P_{i} \in R^{K + 1}

is the class probability distribution for the i-th query.

3.2.1. Semantic Existence Quantification

First, we must quantify the confidence of each query in identifying a valid semantic object. For the i-th query, we define its semantic confidence score

s_{i}

as the maximum probability assigned to any valid foreground category:

s_{i} = max_{c \in K} P_{i} (c)

(3)

A low

s_{i}

implies that the query represents either background context or an ambiguous region that fails to match the feature distribution of any category in

K

.

3.2.2. Pixel-Wise Rejection Aggregation

Since Mask2Former performs segmentation via overlapping mask predictions, a single pixel

(x, y)

may be covered by multiple queries. To determine the semantic consistency of a pixel, we aggregate the confidence scores of all queries that are spatially active at that location.

We define the pixel semantic confidence

E (x, y)

, which represents the likelihood that the pixel

(x, y)

belongs to the semantic manifold defined by

K

. This is computed by projecting the query-level confidence

s_{i}

onto the pixel grid, weighted by the mask activation

M_{i} (x, y)

:

E (x, y) = {max}_{i = 1}^{N} (M_{i} (x, y) \cdot s_{i})

(4)

Here, the multiplication

M_{i} (x, y) \cdot s_{i}

measures the contribution of the i-th query to pixel

(x, y)

. If a query predicts a region as a foreground mask (high

M_{i}

) but lacks classification confidence (low

s_{i}

), its contribution to

E (x, y)

will be suppressed. The max operator ensures that we consider the most confident semantic explanation available for that pixel.

3.2.3. Artifact Localization via Inversion

Finally, we formally define the artifact map

A \in {[0, 1]}^{H \times W}

. Artifacts correspond to pixels that are consistently rejected by the semantic ensemble (i.e., those that have low semantic confidence E). Therefore, the anomaly score is derived as the complement of the semantic confidence:

A (x, y) = 1 - E (x, y) = 1 - {max}_{i = 1}^{N} (M_{i} (x, y) \cdot max_{c \in K} P_{i} (c))

(5)

This formulation yields a high anomaly score

A (x, y) \approx 1

only when no query can confidently claim the pixel as a valid semantic class. This effectively segments regions characterized by structural distortions or logical inconsistencies, as these patterns inherently lie outside the support of the pre-trained semantic distribution.

3.3. Adaptation via Margin-Based Suppression

While the zero-shot MSR framework effectively identifies general anomalies by leveraging pre-trained semantic priors, it faces two primary challenges in specific deployment scenarios. First, specific generative models often exhibit unique artifact patterns (e.g., the “checkerboard” artifacts in GANs versus “limb distortion” in diffusion models). Second, the closed-set vocabulary of the pre-trained backbone (e.g., ADE20K) may fail to cover all valid objects in open-world generation, potentially misclassifying Out-of-Vocabulary (OOV) objects as anomalies due to the lack of semantic matching. To bridge this domain gap and mitigate vocabulary limitations, we introduce an adaptation stage. Therefore we propose a margin-based suppression loss, whereby the model learns to suppress high anomaly scores for these valid but unknown semantics using only a few labeled examples, thereby enhancing localization precision and reducing false positives on OOV objects.

The primary objective is to recalibrate the rejection boundary of the model using a limited set of annotated samples, denoted as

D_{f e w} = {(I_{j}, Y_{j})}_{j = 1}^{M}

, where

Y_{j} \in {0, 1}^{H \times W}

represents the binary ground truth artifact mask.

3.3.1. Optimization Setup

To prevent the catastrophic forgetting of established semantic knowledge, we adopt a parameter-efficient fine-tuning strategy. We freeze the weights of the backbone network, the pixel decoder, and the mask prediction MLP. The optimization is restricted solely to the class classification head

W_{c l s}

and the object query embeddings

Q

. This constraint ensures that the model retains its capability to segment valid objects while adjusting its sensitivity to distribution shifts in the generated domain.

3.3.2. Semantic Suppression Margin Loss

Standard cross-entropy loss is unsuitable for this task because artifacts do not constitute a single coherent semantic class; rather, they are defined by their non-conformity to existing classes. Therefore, we propose the Semantic Suppression Margin Loss ( $L_{S S M}$ ). This objective is designed to enforce a strict upper bound on the semantic confidence scores of pixels annotated as artifacts.

Recalling Equation (4), the pixel-wise semantic confidence

E (x, y)

quantifies the maximum likelihood that a pixel belongs to the manifold of valid categories

K

. For a pixel located within an artifact region (i.e.,

Y (x, y) = 1

), an ideal anomaly detector should yield a semantic confidence close to zero. To enforce this, we formulate

L_{S S M}

as a hinge-based objective that penalizes semantic activations exceeding a pre-defined margin m:

L_{S S M} = \frac{1}{| Ω_{a r t} |} \sum_{(x, y) \in Ω_{a r t}} max (0, E (x, y) - m)

(6)

where

Ω_{a r t} = {(x, y) ∣ Y (x, y) = 1}

denotes the set of pixels labeled as artifacts, and

m \in [0, 1]

is a hyperparameter governing the strictness of the rejection boundary.

3.3.3. An Interpretation

The optimization mechanism of

L_{S S M}

operates by modifying the projection of query features in the classification space. When the model encounters an artifact region, the queries active in that region initially predict a distribution over

K

based on their pre-trained initialization. If the maximum probability among these classes exceeds the margin m, Equation (6) produces a positive gradient. This gradient updates the classifier weights

W_{c l s}

to orthogonalize the query embedding with respect to the weight vectors of all valid classes in

K

. Consequently, the probability distribution

P (c | q)

becomes uniform or suppressed towards the background class ∅, thereby decreasing

E (x, y)

and increasing the resultant anomaly score

A (x, y)

.

To maintain the detection capability for valid regions, we combine this suppression objective with a standard Dice Loss

L_{d i c e}

computed on the background regions (non-artifacts). The final total loss

L_{t o t a l}

is formulated as follows:

L_{t o t a l} = L_{S S M} + λ L_{d i c e}

(7)

where

λ

balances the contribution of the regularization term. This formulation explicitly optimizes the decision boundary to exclude artifact patterns from the in-distribution semantic manifold without requiring the model to learn a specific feature representation for the artifacts themselves.

4. Experiments

4.1. Implementation Details

Architecture. Our framework is built upon the official implementation of Mask2Former [9]. Unless otherwise specified, we employ a Swin-Base (Swin-B) [37] backbone initialized with weights pre-trained on the ADE20K [30] semantic segmentation dataset. We use a vocabulary size of

| K | = 150

standard semantic categories to form the in-distribution manifold. The input images are resized such that the shorter side is 640 pixels, preserving the aspect ratio.

Inference. For the zero-shot experiments, we freeze the entire Mask2Former network. The artifact localization is performed purely via the proposed MSR mechanism (Section 3.2), which aggregates rejection signals from the pre-trained object queries. The anomaly maps are binarized using a fixed threshold

τ = 0.5

.

Training. In the adaptation stage, we freeze the backbone and the pixel decoder to prevent catastrophic forgetting of general semantic features. We only update the parameters of the object queries

Q

and the classification head

W_{c l s}

. The model is fine-tuned using the proposed Semantic Suppression Margin Loss (

L_{t o t a l}

). We empirically set the rejection margin

m = 0.5

and the balancing weight

λ = 1.0

. Training is conducted using the AdamW optimizer with a learning rate of

5 \times 10^{- 5}

and a weight decay of

0.01

. The model is fine-tuned for 200 iterations with a batch size of 4 on a single NVIDIA A100 GPU.

4.2. Experimental Setups

Dataset. We conduct our experiments on the Perceptual Artifacts Localization (PAL) benchmark [8], utilizing the official train/validation/test split (80%/10%/10%). The dataset contains 10,168 synthesized images fully annotated with pixel-level artifact masks. To ensure a comprehensive evaluation, these images cover ten diverse synthesis tasks spanning three major generative paradigms: unconditional generation (e.g., StyleGAN2 [38]), text-to-image synthesis, and image restoration (e.g., inpainting and super-resolution).

Evaluation Metrics. Following the standard protocol [8], we adopt mean Intersection over Union (mIoU) as the primary evaluation metric to quantify the pixel-level localization accuracy of perceptual artifacts.

Comparison Methods. To demonstrate the effectiveness of our framework, we compare against the following baseline methods, covering both generic forgery detection and task-specific localization:

CNNgenerates [14] + Grad-CAM [39]: This method uses a classifier specifically trained to distinguish between generated and real images. Following [8], we employ Grad-CAM [39] to visualize the gradient activations of the specific “fake” class, serving as a proxy for artifact localization maps.
Patch Forensics [20]: This is a patch-based classifier designed to detect local anomalies in generated images. We utilize its truncation-based variation to predict “fake” regions based on local patch scores.
PAL4Inpaint [7]: This is a specialized segmentation model designed specifically for localizing perceptual artifacts in inpainting tasks. This serves as a strong baseline for task-specific performance comparison.
PAL [8]: The current state-of-the-art unified segmentation model trained on the full PAL dataset, serving as the direct competitor to our approach.

4.3. Results on Image Synthesis Tasks

Table 1 and Figure 3 present the quantitative comparison of artifact localization performance across ten synthesis tasks. Existing forensic baselines, including CNNgenerates [14] and Patch Forensics [20], exhibit suboptimal localization accuracy, with mIoU scores generally remaining below 10%. This performance deficit suggests that methods relying on low-level frequency anomalies or local texture statistics are insufficient for capturing high-level semantic inconsistencies. In comparison, the proposed MSR framework achieves state-of-the-art performance under both the Unified and Specialist settings. Specifically, the Unified MSR model demonstrates robust generalization, surpassing the previous best method, PAL (Unified) [8], particularly on semantically complex tasks. For instance, on the text-to-image (T2I) and mask-to-image (M2I) datasets, MSR improves mIoU by 3.85% (from 19.65% to 23.50%) and 1.73% (from 39.37% to 41.10%), respectively. By leveraging object masks instead of individual pixels, our method effectively filters out background noise in complex scenes, accurately identifying artifacts as unrecognized instances. Regarding the training settings, the Specialist models generally function as a performance upper bound; for the inpainting task, the Specialist MSR achieves 44.20% mIoU, outperforming the task-specific baseline PAL4Inpaint [7] by over 2%. A notable exception is observed in the AnyRes task, where the Unified model slightly outperforms its Specialist counterpart (36.80% vs. 36.00%), suggesting that the diversity of the unified training data may offer robustness benefits for tasks with high structural variance.

4.4. Results on Unseen Methods

Rapid advancements in generative models necessitate artifact detectors that can generalize to novel, unseen architectures. To evaluate this capability, we follow the protocol in [8] and test on five held-out models: StyleGAN3 [45], BlobGAN [46], Stable Diffusion v2.0 (SD2) [5], Versatile Diffusion (VD) [47], and Diffusion Transformer (DiT) [48].

As shown in Table 2, conventional forensic methods (CNNgenerates and Patch Forensics) fail to transfer to these new domains, yielding near-random performance. While the previous state-of-the-art, PAL shows reasonable zero-shot performance on GAN-based models (e.g., StyleGAN3), it struggles significantly with modern diffusion models (e.g., 6.75% on SD2). In contrast, our MSR method demonstrates superior cross-model generalization. Even in the zero-shot setting, MSR outperforms PAL on DiT by nearly 5% (21.30% vs. 16.46%), suggesting that the proposed semantic rejection mechanism is less dependent on domain-specific texture artifacts. Furthermore, the efficacy of our method is most pronounced in the fine-tuning setting (ten-shot adaptation). By leveraging the geometric constraints of the proposed margin loss, MSR adapts rapidly to the target distribution. Notably, on the challenging Stable Diffusion v2.0, MSR achieves a 13.4% improvement over PAL (24.50% vs. 11.04%), effectively bridging the domain gap with minimal supervision.

4.5. Results on Real Images

Since our MSR framework is trained exclusively on synthetic data to reject semantic inconsistencies, a natural question arises: how does the model behave when presented with real, natural images? To investigate this, we conduct inference on real images sampled from the COCO-Stuff-164k [29] dataset.

As illustrated in the bottom row of Figure 4, for the vast majority of natural scenes—ranging from food and aircraft to animals—the MSR detector exhibits robust behavior, outputting empty masks. This confirms that the model maintains a low false positive rate and does not blindly overfit to low-level texture statistics common in natural images. Intriguingly, in the rare cases where the model does flag regions in real images (top row, Figure 4), the detections are not random noise. Instead, the model shows a distinct tendency to localize semantically dense, high-frequency details, specifically small text (e.g., bus schedules), complex signage (e.g., neon street signs), and fine-grained interface elements (e.g., microwave buttons).

We attribute this phenomenon to the nature of the Mask-based Semantic Rejection mechanism. These regions—characterized by sharp edges and complex geometries—are successfully isolated as candidate objects by the mask proposal network. However, because they often lack a canonical semantic category in the pre-trained vocabulary or resemble the high-frequency structural distortions often found in generated artifacts (e.g., garbled text in T2I models), the semantic scoring function rejects them, effectively classifying them as perceptual anomalies.

We also conduct an quantitative analysis using two metrics: (1) the false positive rate (FPR), defined as the percentage of real images where the model predicts a non-empty artifact mask, and (2) the average predicted artifact area (APAA), the ratio of the predicted artifact area to the total image area, averaged across all images. As reported in Table 3, the MSR framework demonstrates selectivity. The overall FPR is as low as 3.82%, with an APAA of only 0.14%. When broken down by scene categories, the model exhibits its highest robustness in “Nature” and “Animal” scenes (FPR < 2%), where semantic structures are well-defined in the pre-trained vocabulary. Even in complex “Indoor” and “Urban” scenes, the FPR remains well-controlled. These quantitative findings complement our qualitative observations, confirming that the MSR mechanism is highly reliable and does not suffer from high false positive rates when encountering diverse natural distributions.

4.6. Ablation Study

Mask-level vs. Pixel-level Rejection. To validate the superiority of the proposed mask-based formulation, we compare MSR against standard pixel-wise segmentation baselines. As reported in Table 4, pixel-level approaches, including UPerNet and the standard Mask2Former pixel decoder head, achieve suboptimal performance (38.42% and 40.25% mIoU, respectively). This performance gap is attributed to the inherent noise in generated images; pixel-wise classifiers often overfit to high-frequency local artifacts, resulting in fragmented predictions. In contrast, our mask-level rejection leverages object-level priors to propose coherent regions, effectively suppressing background noise and improving mIoU by approximately 8% over the strongest pixel-wise baseline.

Anomaly Scoring Functions. We further investigate the design of the anomaly scoring function defined in Equation (5). Table 4 presents the comparison between our proposed metric and conventional uncertainty measures. Standard entropy

H (P)

and max probability inversion (

1 - max P

) yield mIoU scores of 42.50% and 44.12%, respectively. While effective for out-of-distribution detection, these metrics do not explicitly account for the geometric quality of the predicted regions. By incorporating the binary mask prediction M as a weighting factor (“Ours w/o Mask Weighting” vs. “Full Equation (5)”), we observe a performance gain of 2.4% (45.80% → 48.20%). This confirms that modulating the semantic rejection score with the mask confidence effectively filters out low-quality proposals where the model is uncertain about the object boundaries.

Margin m in Semantic Suppression Margin Loss. We analyze the sensitivity of the proposed framework to the margin hyperparameter m within the Semantic Suppression Margin Loss (

L_{S S M}

). This parameter controls the minimum distance enforced between the artifact embeddings and the semantic queries. As shown in Table 5, the model performance exhibits an inverted-U trend with respect to m. When the margin is set to a low value (e.g.,

m = 0.1

), the regularization effect is insufficient, resulting in a suboptimal mIoU of 45.54% due to the limited discriminability between artifacts and background content. Conversely, an excessively large margin (e.g.,

m = 0.9

) imposes overly stringent constraints on the latent space, making the optimization objective difficult to satisfy and degrading the performance to 44.80%. The model achieves optimal segmentation accuracy (48.20% mIoU) at

m = 0.5

, indicating a balanced trade-off between feature compactness and inter-class separability.

Anomaly maps binarization threshold $τ$ . We investigate the sensitivity of the final localization performance to the binarization threshold

τ

, which is applied to the continuous anomaly map

A (x, y)

to generate binary artifact masks. As illustrated in Table 6, the mIoU performance follows an inverted-U trajectory relative to the threshold value. When a lower threshold is employed (e.g.,

τ = 0.1

), the MSR framework adopts a highly sensitive detection strategy. While this ensures the inclusion of most potential artifact regions, it also incorporates a broader range of semantic query responses, leading to an mIoU of 22.16%. Conversely, at the higher end of the spectrum (e.g.,

τ = 0.9

), the model focuses exclusively on the most salient structural distortions with the highest rejection confidence, achieving an mIoU of 28.51%. The optimal localization accuracy of 48.20% is reached at

τ = 0.5

. This result demonstrates that the default threshold provides a robust and effective balance between capturing diverse, subtle artifact patterns and maintaining precise localization boundaries. The stable performance observed within the intermediate range (0.3 to 0.7) further validates the reliability of our semantic rejection signals.

Training data size. Figure 5 illustrates the quantitative correlation between segmentation performance and the volume of training data utilized. We observe a consistent positive trajectory across all tasks as the data ratio increases from 0% (zero-shot) to 100% (full data fine-tuning). This performance gain is particularly pronounced for semantically complex synthesis tasks, which rely heavily on supervision to resolve ambiguities. For instance, the text-to-image (T2I) task exhibits a substantial improvement, rising from 10.00% to 26.80%, while the mask-to-image (M2I) task more than doubles its accuracy, climbing from 20.05% to 43.50% to nearly match the performance of easier tasks. Conversely, tasks with more structural constraints, such as inpainting and StyleGAN2, display a steady logarithmic growth, initializing at 32.51% and 28.12%, respectively. Crucially, despite the significant improvements yielded by fine-tuning, the model demonstrates non-trivial zero-shot expressiveness.

Efficiency analysis. To assess the scalability and potential for real-time deployment, we evaluated our MSR framework using backbones with varying computational costs. As shown in Table 7, we compared the Swin-B (Transformer) against the lightweight ResNet-101 (CNN). Replacing Swin-B with ResNet-101 significantly reduces the computational burden, decreasing the GFLOPs from 268 to 160 (approximately a 40% reduction) and model parameters from 107 M to 63 M. Despite this substantial reduction in model capacity, the MSR framework equipped with ResNet-101 still achieves a robust mIoU of 44.57%. This result confirms two critical findings: first, our Semantic Rejection mechanism is a generic paradigm that functions effectively across both CNN and Transformer architectures; second, the framework offers a flexible trade-off between maximum precision and inference efficiency, making it suitable for diverse resource constraints.

5. Limitation and Future Work

In this work, we proposed a novel perspective on artifact localization by shifting the paradigm from pixel-wise regression to Mask-based Semantic Rejection. While our MSR framework demonstrates superior performance and data efficiency, there are several avenues for future exploration and improvement.

Limitations. Our method relies on the premise that artifacts manifest as “unrecognized” or “rejected” semantic entities. Consequently, a potential limitation arises when an artifact forms a semantically valid object but appears in an implausible context (e.g., a perfectly rendered car floating in the sky), as shown in Figure 6. Since the semantic query might classify this as a valid “car” with high confidence, our rejection mechanism might overlook such context-dependent anomalies. Since the primary objective of our framework is the localization of “perceptual artifacts”, defined as structural distortions and unnatural textures that directly degrade the pixel-level visual quality of generated images. Therefore, while logical inconsistencies are indeed an issue in generative models, they fall outside the specific scope of this study. I acknowledge this limitation and plan to address such scene-level anomalies in future work. Future work could integrate Vision–Language Models (VLMs) to verify not just the existence of objects, but their semantic compatibility with the global scene description.

From Detection to Correction. Beyond localization, the precise masks generated by MSR offer a promising foundation for image restoration and editing. As we localize defective regions, these masks can serve as automatic guidance for inpainting models to perform targeted fixing operations. We envision a self-correcting generative framework, where MSR acts as a quality control agent in the generation loop: detecting artifacts in the initial output and automatically triggering an inpainting refinement step. This closed-loop system could significantly enhance the yield rate of high-quality samples in text-to-image generation pipelines without human intervention.

Reward Modeling for RLHF. Furthermore, the pixel-level anomaly scores provided by MSR can be aggregated into a scalar quality metric. This metric has the potential to serve as a fine-grained reward signal for Reinforcement Learning from Human Feedback (RLHF) or Direct Preference Optimization (DPO). By penalizing the generation of artifacts during the training phase, we can steer generative models toward synthesizing more structurally coherent and semantically valid images.

6. Conclusions

We have presented the Mask-based Semantic Rejection (MSR) framework, a novel approach for localizing perceptual artifacts in diverse image synthesis tasks. By leveraging the geometric priors of Mask2Former and introducing a Semantic Suppression Margin Loss, MSR effectively identifies artifacts as semantic outliers rather than merely texture anomalies. Extensive experiments on the PAL benchmark demonstrate that our method significantly outperforms state-of-the-art baselines under both unified and specialist settings, achieving new benchmarks for localization accuracy. Critically, our analysis highlights the data efficiency and generalization capability of MSR. The framework maintains robust zero-shot expressiveness and adapts rapidly to complex generative distributions (e.g., text-to-image) with minimal supervision. We hope that MSR will serve as a strong baseline for future forensic research and pave the way for more reliable, self-refining generative systems.

Funding

This research was supported by BUPT Excellent Ph.D. Students Foundation No. CX20242081.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data were published in https://github.com/owenzlz/PAL4VST (accessed on 25 September 2025).

Conflicts of Interest

The author declares no conflicts of interest.

References

Ho, J.; Jain, A.; Abbeel, P. Denoising diffusion probabilistic models. Adv. Neural Inf. Process. Syst. 2020, 33, 6840–6851. [Google Scholar]
Dhariwal, P.; Nichol, A. Diffusion Models Beat GANs on Image Synthesis. arXiv 2021, arXiv:abs/2105.05233. [Google Scholar] [CrossRef]
Ramesh, A.; Dhariwal, P.; Nichol, A.; Chu, C.; Chen, M. Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv 2022, arXiv:abs/2204.06125. [Google Scholar] [CrossRef]
Saharia, C.; Chan, W.; Saxena, S.; Li, L.; Whang, J.; Denton, E.L.; Ghasemipour, K.; Gontijo Lopes, R.; Karagol Ayan, B.; Salimans, T.; et al. Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural Inf. Process. Syst. 2022, 35, 36479–36494. [Google Scholar]
Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; Ommer, B. High-Resolution Image Synthesis with Latent Diffusion Models. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: New Orleans, LA, USA, 2021; pp. 10674–10685. [Google Scholar]
Betker, J.; Goh, G.; Jing, L.; Brooks, T.; Wang, J.; Li, L.; Ouyang, L.; Zhuang, J.; Lee, J.; Guo, Y.; et al. Improving image generation with better captions. Comput. Sci. 2023, 2, 8. Available online: https://cdn.openai.com/papers/dall-e-3.pdf (accessed on 21 September 2023).
Zhang, L.; Zhou, Y.; Barnes, C.; Amirghodsi, S.; Lin, Z.; Shechtman, E.; Shi, J. Perceptual artifacts localization for inpainting. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 146–164. [Google Scholar]
Zhang, L.; Xu, Z.; Barnes, C.; Zhou, Y.; Liu, Q.; Zhang, H.; Amirghodsi, S.; Lin, Z.; Shechtman, E.; Shi, J. Perceptual Artifacts Localization for Image Synthesis Tasks. In 2023 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE: Paris, France, 2023; pp. 7545–7556. [Google Scholar]
Cheng, B.; Misra, I.; Schwing, A.G.; Kirillov, A.; Girdhar, R. Masked-attention mask transformer for universal image segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: New Orleans, LA, USA, 2022; pp. 1290–1299. [Google Scholar]
Nayal, N.; Yavuz, M.; Henriques, J.F.; Güney, F. Rba: Segmenting unknown regions rejected by all. In IEEE/CVF International Conference on Computer Vision; IEEE: Paris, France, 2023; pp. 711–722. [Google Scholar]
Goodfellow, I.J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.C.; Bengio, Y. Generative Adversarial Nets. In Neural Information Processing Systems; Association for Computing Machinery: Montreal, QC, Canada, 2024. [Google Scholar]
Karras, T.; Laine, S.; Aila, T. A Style-Based Generator Architecture for Generative Adversarial Networks. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR); IEEE: Long Beach, CA, USA, 2019; pp. 4396–4405. [Google Scholar]
Bai, S.; Chen, K.; Liu, X.; Wang, J.; Ge, W.; Song, S.; Dang, K.; Wang, P.; Wang, S.; Tang, J.; et al. Qwen2.5-VL Technical Report. arXiv 2025, arXiv:abs/2502.13923. [Google Scholar] [CrossRef]
Wang, S.Y.; Wang, O.; Zhang, R.; Owens, A.; Efros, A.A. CNN-generated images are surprisingly easy to spot... for now. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2020; pp. 8695–8704. [Google Scholar]
Lanzino, R.; Fontana, F.; Diko, A.; Marini, M.R.; Cinque, L. Faster than lies: Real-time deepfake detection using binary neural networks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2024; pp. 3771–3780. [Google Scholar]
Agarwal, A.; Ratha, N. Deepfake catcher: Can a simple fusion be effective and outperform complex dnns? In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2024; pp. 3791–3801. [Google Scholar]
Lin, L.; He, X.; Ju, Y.; Wang, X.; Ding, F.; Hu, S. Preserving fairness generalization in deepfake detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2024; pp. 16815–16825. [Google Scholar]
Ciamarra, A.; Caldelli, R.; Del Bimbo, A. Temporal surface frame anomalies for deepfake video detection. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2024; pp. 3837–3844. [Google Scholar]
Baraldi, L.; Cocchi, F.; Cornia, M.; Baraldi, L.; Nicolosi, A.; Cucchiara, R. Contrasting deepfakes diffusion via contrastive learning and global-local similarities. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2024; pp. 199–216. [Google Scholar]
Chai, L.; Bau, D.; Lim, S.N.; Isola, P. What makes fake images detectable? understanding properties that generalize. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2020; pp. 103–120. [Google Scholar]
Liang, Y.; He, J.; Li, G.; Li, P.; Klimovskiy, A.; Carolan, N.; Sun, J.; Pont-Tuset, J.; Young, S.; Yang, F.; et al. Rich human feedback for text-to-image generation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2024; pp. 19401–19411. [Google Scholar]
You, Z.; Gu, J.; Li, Z.; Cai, X.; Zhu, K.; Dong, C.; Xue, T. Descriptive image quality assessment in the wild. arXiv 2024, arXiv:2405.18842. [Google Scholar] [CrossRef]
Wu, H.; Zhang, Z.; Zhang, W.; Chen, C.; Liao, L.; Li, C.; Gao, Y.; Wang, A.; Zhang, E.; Sun, W.; et al. Q-Align: Teaching LMMs for Visual Scoring via Discrete Text-Defined Levels. In International Conference on Machine Learning; Association for Computing Machinery: Vienna, Austria, 2024; pp. 54015–54029. [Google Scholar]
Wang, S.; Li, C.; Zhang, Z.; Zhou, H.; Dong, W.; Chen, J.; Zhai, G.; Liu, X. AU-IQA: A Benchmark Dataset for Perceptual Quality Assessment of AI-Enhanced User-Generated Content. In 33rd ACM International Conference on Multimedia; Association for Computing Machinery: Dublin, Ireland, 2025; pp. 12737–12744. [Google Scholar]
Hendrycks, D.; Gimpel, K. A Baseline for Detecting Misclassified and Out-of-Distribution Examples in Neural Networks. In International Conference on Learning Representations; ICLR: Toulon, France, 2017. [Google Scholar]
Jeong, J.; Zou, Y.; Kim, T.; Zhang, D.; Ravichandran, A.; Dabeer, O. Winclip: Zero-/few-shot anomaly classification and segmentation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Vancouver, BC, Canada, 2023; pp. 19606–19616. [Google Scholar]
Zhou, Q.; Pang, G.; Tian, Y.; He, S.; Chen, J. AnomalyCLIP: Object-agnostic Prompt Learning for Zero-shot Anomaly Detection. In Twelfth International Conference on Learning Representations; ICLR: Vienna, Austria, 2024. [Google Scholar]
He, H.; Bai, Y.; Zhang, J.; He, Q.; Chen, H.; Gan, Z.; Wang, C.; Li, X.; Tian, G.; Xie, L. Mambaad: Exploring state space models for multi-class unsupervised anomaly detection. Adv. Neural Inf. Process. Syst. 2024, 37, 71162–71187. [Google Scholar]
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 740–755. [Google Scholar]
Zhou, B.; Zhao, H.; Puig, X.; Fidler, S.; Barriuso, A.; Torralba, A. Scene parsing through ade20k dataset. In IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Honolulu, HI, USA, 2017; pp. 633–641. [Google Scholar]
Tian, Y.; Liu, Y.; Pang, G.; Liu, F.; Chen, Y.; Carneiro, G. Pixel-wise energy-biased abstention learning for anomaly segmentation on complex urban driving scenes. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 246–263. [Google Scholar]
Grcić, M.; Bevandić, P.; Šegvić, S. Densehybrid: Hybrid anomaly detection for dense open-set recognition. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 500–517. [Google Scholar]
Chan, R.; Rottmann, M.; Gottschalk, H. Entropy maximization and meta classification for out-of-distribution detection in semantic segmentation. In IEEE/CVF International Conference on Computer Vision; IEEE: Montreal, QC, Canada, 2021; pp. 5128–5137. [Google Scholar]
Xing, P.; Sun, Y.; Zeng, D.; Li, Z. Normal image guided segmentation framework for unsupervised anomaly detection. IEEE Trans. Circuits Syst. Video Technol. 2023, 34, 4639–4652. [Google Scholar] [CrossRef]
Sun, Y.; Chen, J.; Zhang, S.; Zhang, X.; Chen, Q.; Zhang, G.; Ding, E.; Wang, J.; Li, Z. VRP-SAM: SAM with visual reference prompt. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2024; pp. 23565–23574. [Google Scholar]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition; IEEE: Boston, MA, USA, 2015; pp. 3431–3440. [Google Scholar]
Liu, Z.; Lin, Y.; Cao, Y.; Hu, H.; Wei, Y.; Zhang, Z.; Lin, S.; Guo, B. Swin transformer: Hierarchical vision transformer using shifted windows. In IEEE/CVF International Conference on Computer Vision; IEEE: Montreal, QC, Canada, 2021; pp. 10012–10022. [Google Scholar]
Karras, T.; Laine, S.; Aittala, M.; Hellsten, J.; Lehtinen, J.; Aila, T. Analyzing and improving the image quality of stylegan. In IEEE/CVF Conference on Computer Vision and Pattern Recognition; IEEE: Seattle, WA, USA, 2020; pp. 8110–8119. [Google Scholar]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-cam: Visual explanations from deep networks via gradient-based localization. In IEEE International Conference on Computer Vision; IEEE: Venice, Italy, 2017; pp. 618–626. [Google Scholar]
Wang, X.; Xie, L.; Dong, C.; Shan, Y. Real-esrgan: Training real-world blind super-resolution with pure synthetic data. In IEEE/CVF International Conference on Computer Vision; IEEE: Montreal, QC, Canada, 2021; pp. 1905–1914. [Google Scholar]
Wang, T.; Zhang, T.; Zhang, B.; Ouyang, H.; Chen, D.; Chen, Q.; Wen, F. Pretraining is all you need for image-to-image translation. arXiv 2022, arXiv:2205.12952. [Google Scholar]
Chai, L.; Wulff, J.; Isola, P. Using latent space regression to analyze and leverage compositionality in GANs. In International Conference on Learning Representations; ICLR: Vienna, Austria, 2021. [Google Scholar]
Fele, B.; Lampe, A.; Peer, P.; Struc, V. C-vton: Context-driven image-based virtual try-on network. In IEEE/CVF Winter Conference on Applications of Computer Vision; IEEE: Waikoloa, HI, USA, 2022; pp. 3144–3153. [Google Scholar]
Zhang, X.; Barron, J.T.; Tsai, Y.T.; Pandey, R.; Zhang, X.; Ng, R.; Jacobs, D.E. Portrait shadow manipulation. ACM Trans. Graph. (TOG) 2020, 39, 78. [Google Scholar] [CrossRef]
Karras, T.; Aittala, M.; Laine, S.; Härkönen, E.; Hellsten, J.; Lehtinen, J.; Aila, T. Alias-free generative adversarial networks. Adv. Neural Inf. Process. Syst. 2021, 34, 852–863. [Google Scholar]
Epstein, D.; Park, T.; Zhang, R.; Shechtman, E.; Efros, A.A. Blobgan: Spatially disentangled scene representations. In European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2022; pp. 616–635. [Google Scholar]
Xu, X.; Wang, Z.; Zhang, G.; Wang, K.; Shi, H. Versatile diffusion: Text, images and variations all in one diffusion model. In IEEE/CVF International Conference on Computer Vision; IEEE: Paris, France, 2023; pp. 7754–7765. [Google Scholar]
Peebles, W.; Xie, S. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision; IEEE: Paris, France, 2023; pp. 4195–4205. [Google Scholar]

Figure 1. Localization of several typical artifacts (marked by red areas) in synthetic images. Perceptual artifacts in generated content manifest in diverse forms, categorized here as follows. (Top) Distortion: Structural malformations, such as anatomical deformities in human bodies. (Middle) Unknown Region: Unrecognizable areas that fail to form coherent objects. (Bottom) Unusual Texture: Non-physical noise patterns or unnatural surface degradations. The red overlays highlight the artifact regions, illustrating the challenge of localizing such heterogeneous defects.

Figure 2. Using a pre-trained Mask2Former, we extract object masks and class probabilities. The MSR module identifies artifacts as “semantic voids”—regions with salient masks but low classification confidence (see “Reject by all classes”). These signals are aggregated pixel-wise to generate the final anomaly score heatmap for artifact localization.

Figure 3. Qualitative comparison of artifacts localization on difference tasks in PAL benchmark. Detected artifacts are marked as red area.

Figure 4. Qualitative evaluation on real-world images. Red contours indicate the regions localized by our model as artifacts. (Top row): The model sometimes tends to detect small text. (Bottom row): For the majority of natural images, our model exhibits robust behavior, correctly predicting no artifacts (empty masks).

Figure 5. Ablation results of the percentage of training data using our proposed loss.

Figure 6. Visual analysis of representative failure cases. The proposed MSR framework is challenged by scenarios involving logical inconsistencies, where objects are semantically valid in isolation but are contextually misplaced (e.g., a boy appearing in the sky).

Table 1. Quantitative mIoU (↑) of Perceptual Artifacts Localization on ten image synthesis tasks. We compare our MSR framework against state-of-the-art methods under both Specialist and Unified settings. We use the following brevity to indicate the tasks: LDMs → latent diffusion models [5], SR → super resolution [40], E2I → edge-to-image [41], M2I → mask-to-image [41], T2I → text-to-image [3,5], Comp. → image composition [42], VTON → virtual try-on [43], and PSR → portrait shadow removal [44]. Bold indicates best performance, and underline indicates the second best.

Methods	StyleGAN2	LDMs	AnyRes	SR	Inpaint	E2I	M2I	T2I	Comp.	VTON	PSR
CNNgenerates [14] + Grad-CAM	4.38	2.43	1.39	0.86	3.54	0.95	0.48	0.51	7.13	2.56	0.0
Patch Forensics [20]	3.81	9.08	8.76	1.34	5.35	14.19	10.54	2.71	9.63	2.14	0.66
PAL4Inpaint [7]	6.12	0.98	1.03	0.81	42.07	0.86	0.31	0.51	14.42	15.94	0.0
PAL (Specialist) [8]	37.85	35.39	9.15	14.41	42.07	45.56	35.01	21.79	25.31	37.44	21.33
PAL (Unified) [8]	38.53	30.86	34.74	11.92	41.81	46.01	39.37	19.65	29.53	38.07	5.10
MSR (Specialist)	42.50	37.10	36.00	16.50	44.20	49.10	43.50	26.80	34.20	41.80	23.00
MSR (Unified)	41.20	34.50	36.80	13.50	41.95	48.20	41.10	23.50	32.10	40.50	18.20

Table 2. Quantitative mIoU (↑) evaluation on five unseen generative models. “Fine-tuning” denotes adaptation using ten labeled samples. We use the following brevity: SD2 → Stable Diffusion v2.0 [5], VD → Versatile Diffusion [47], DiT → Diffusion Transformer [48]. Bold indicates best performance, and underline indicates the second best.

Methods	StyleGAN3 [45]	BlobGAN [46]	SD2 [5]	VD [47]	DiT [48]
CNNgenerates [14]	2.30	3.67	0.12	0.57	1.42
Patch Forensics [20]	11.43	5.96	3.08	2.97	3.18
PAL4Inpaint [7]	5.18	13.00	0.85	0.63	1.15
PAL (Zero-shot) [8]	46.45	25.39	6.75	5.92	16.46
PAL (Fine-tuning) [8]	40.81	33.33	11.04	22.18	31.76
MSR (Zero-shot)	49.20	28.50	9.80	10.50	21.30
MSR (Fine-tuning)	48.10	38.90	24.50	29.80	39.20

Table 3. Quantitative false positive evaluation on real-world images. We report the false positive rate (FPR) and average predicted artifact area (APAA) across different scene categories. ↓ indicates that smaller value brings better performance.

Category	FPR (%) ↓	APAA (%) ↓
Nature (Sky, Water, Plant)	1.45	0.04
Animals	1.82	0.07
Indoor Scenes	4.10	0.21
Urban / Street Scenes	6.25	0.28
Overall	3.82	0.14

Table 4. Ablation study on rejection mechanisms and anomaly scoring functions. We compare our mask-based approach against pixel-wise baselines and analyze the impact of different scoring strategies on the PAL benchmark (Unified setting). The models are evaluated using mIoU and Average Precision (AP).

Method/Variant	Mechanism	mIoU (%)
Rejection Level Analysis
Pixel-wise Baseline (UPerNet)	Per-pixel Classification	38.42
Pixel-wise Baseline (Mask2Former)	Per-pixel Classification	40.25
MSR (Ours)	Mask-level Rejection	48.20
Scoring Function Analysis
Entropy Uncertainty $H (P)$	Uncertainty Quantification	42.50
Max Probability ( $1 - max P$ )	Uncertainty Quantification	44.12
Ours w/o Mask Weighting	–	45.80
Ours (Full Equation (5))	Mask-weighted Inversion	48.20

Table 5. Sensitivity analysis of the margin hyperparameter m. We evaluate the segmentation performance (mIoU) on the unified benchmark by varying the margin threshold in the Semantic Suppression Margin Loss (

L_{S S M}

). The default setting is highlighted in gray.

Table 5. Sensitivity analysis of the margin hyperparameter m. We evaluate the segmentation performance (mIoU) on the unified benchmark by varying the margin threshold in the Semantic Suppression Margin Loss (

L_{S S M}

). The default setting is highlighted in gray.

Margin Value (m)	mIoU (%)
0.1	45.54
0.3	47.85
0.5 (Default)	48.20
0.7	46.45
0.9	44.80

Table 6. Sensitivity analysis of the binarization threshold $τ$ . We evaluate the segmentation performance (mIoU) on the unified benchmark by varying the threshold used to generate binary artifact masks. The default setting is highlighted in gray.

Threshold Value ( $τ$ )	mIoU (%)
0.1	22.16
0.3	45.83
0.5 (Default)	48.20
0.7	42.58
0.9	28.51

Table 7. Efficiency analysis of our MSR framework with different backbones. We compare the detection performance (mIoU) against computational cost (GFLOPs) and model size (parameters). The ResNet-101 variant offers a significant efficiency gain while maintaining competitive accuracy.

Method	Backbone	mIoU (%)	GFLOPs	Parameters
MSR (Ours)	ResNet-101	44.57	160	63 M
MSR (Ours)	Swin-B	48.20	268	107 M

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Yin, Z. Localizing Perceptual Artifacts in Synthetic Images for Image Quality Assessment via Deep-Learning-Based Anomaly Detection. Electronics 2026, 15, 916. https://doi.org/10.3390/electronics15050916

AMA Style

Yin Z. Localizing Perceptual Artifacts in Synthetic Images for Image Quality Assessment via Deep-Learning-Based Anomaly Detection. Electronics. 2026; 15(5):916. https://doi.org/10.3390/electronics15050916

Chicago/Turabian Style

Yin, Zijin. 2026. "Localizing Perceptual Artifacts in Synthetic Images for Image Quality Assessment via Deep-Learning-Based Anomaly Detection" Electronics 15, no. 5: 916. https://doi.org/10.3390/electronics15050916

APA Style

Yin, Z. (2026). Localizing Perceptual Artifacts in Synthetic Images for Image Quality Assessment via Deep-Learning-Based Anomaly Detection. Electronics, 15(5), 916. https://doi.org/10.3390/electronics15050916

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Localizing Perceptual Artifacts in Synthetic Images for Image Quality Assessment via Deep-Learning-Based Anomaly Detection

Abstract

1. Introduction

2. Related Works

2.1. Artifacts Localization in Image Synthesis

2.2. Anomaly Detection

3. Method

3.1. Preliminaries

3.2. Mask-Based Semantic Rejection (MSR)

3.2.1. Semantic Existence Quantification

3.2.2. Pixel-Wise Rejection Aggregation

3.2.3. Artifact Localization via Inversion

3.3. Adaptation via Margin-Based Suppression

3.3.1. Optimization Setup

3.3.2. Semantic Suppression Margin Loss

3.3.3. An Interpretation

4. Experiments

4.1. Implementation Details

4.2. Experimental Setups

4.3. Results on Image Synthesis Tasks

4.4. Results on Unseen Methods

4.5. Results on Real Images

4.6. Ablation Study

5. Limitation and Future Work

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI