1. Introduction
Generative AI has rapidly expanded into creative fields, transforming how visual art is produced, modified, and experienced. Early breakthroughs, such as StyleGAN [
1,
2], laid the foundation for high-quality image synthesis, but the field has since been revolutionized by the rapid rise of diffusion-based models [
3,
4,
5]. These newer techniques have significantly enhanced the ability to generate realistic images, mimic artistic styles [
6,
7], and even create entirely new visual compositions [
8,
9], establishing diffusion models as the dominant paradigm in generative artistry. Among these capabilities, style replication has emerged as a key area of interest, allowing users to apply diverse historical and modern artistic styles to AI-generated images [
10,
11,
12]. This technology enables greater artistic expression and personalization, bridging the gap between computational creativity and traditional artistry and empowering artists, designers, and hobbyists to explore and reinterpret visual styles in ways that were previously highly specialized or time-intensive [
13,
14,
15,
16].
The purpose of this study is to provide a critical assessment of the capabilities and limitations of current generative tools in effectively replicating styles. By examining both the technical performance and aesthetic outcomes of these tools, the study aims to highlight their strengths, identify areas where they fall short, and offer insights into the potential improvements needed to enhance their application in creative fields.
Specifically, we compared twelve modern generative models: DallE 3 (
https://openai.com/index/dall-e-3/, accessed on 3 July 2025), Stable Diffusion 1.5 (
https://huggingface.co/stable-diffusion-v1-5/stable-diffusion-v1-5, accessed on 3 July 2025), Stable Diffusion 3.5 large (
https://stabledifffusion.com/tools/sd-3-5-large, accessed on 3 July 2025), Flux 1.1 Pro (
https://flux1.ai/flux1-1, accessed on 3 July 2025), Flux 1 Schnell (
https://fluxaiimagegenerator.com/flux-schnell, accessed on 3 July 2025), Omnigen (
https://omnigenai.org/, accessed on 3 July 2025), Ideogram (
https://ideogram.ai/login, accessed on 3 July 2025), Kolors 1.5 (
https://klingai.com/text-to-image/new, accessed on 3 July 2025), Firefly Image 3 (
https://firefly.adobe.com/, accessed on 3 July 2025), Leonardo Pheonix (
https://leonardo.ai, accessed on 3 July 2025), Midjourney V6.1 (
https://www.midjourney.com/imagine, accessed on 3 July 2025), and Auto-Aesthetics v1 (
https://neural.love/blog/auto-aesthetics-v1-ai-art-revolution, accessed on 3 July 2025).
The models were compared using 73 uniform prompts that span a broad range of painting styles from the past five centuries. This resulted in the creation of a large supervised dataset of AI-generated artworks: the AI-Pastiche dataset. The dataset was explicitly conceived to highlight the multimodal interplay between language and image, focusing on how textual prompts guide visual style and content. The dataset can also offer a valuable resource for advancing research in areas such as deepfake detection, digital forensics, and the ethical study of AI-generated content. By supplying a controlled, high-quality set of AI-generated images, the dataset aids in training and testing models for improved detection accuracy, robustness against manipulation, and broader exploration of generative AI capabilities across fields ranging from security to digital art.
The quality of the generated images was evaluated based on two criteria: the ability of the models to faithfully replicate human-crafted artwork and their capacity to faithfully adhere to the style and content specified in the prompts, preserving coherence and integrity of the composition. The first criterion was assessed through a public survey in which participants were asked to distinguish between human-created and AI-generated images. The second criterion, which involved a per-prompt comparison of the samples generated by different models, was evaluated directly by members of our team along with a few additional volunteers.
The novelty of our approach consists precisely in the focus on human perception and qualitative evaluation across multiple generators, offering a blend of subjective assessment, style focus, and prompt–image alignment, all grounded in human judgment.
The results of our investigation reveal that while modern generative models demonstrate remarkable artistic capabilities, they still encounter significant challenges in faithfully replicating historical styles. Rather than a lack of detail, hyperrealism emerges as the primary obstacle—AI-generated images often display excessive sharpness and unnatural precision, making them visually striking but historically inconsistent. According to our evaluation, state-of-the-art models successfully produce images that non-expert users misidentify as human-created in less than 30% of cases, highlighting the persistent gap between AI-generated and traditionally crafted artworks.
This work is part of a larger and ambitious project that aims to assess whether Large Language Models possess an aesthetic sense and, if so, to identify the aesthetic principles that guide their preferences. This investigation represents a significant advancement in understanding the emergent abilities of LLMs [
17,
18,
19,
20] and their social implications. In evaluating the aesthetic sense of LLMs, it is essential to bypass any potential familiarity that the models may have with specific artworks, as this could allow them to draw on pre-existing evaluations or learned data. By using a dataset of fictional or AI-generated artworks, such as the one created as part of this study, we can ensure that LLMs rely solely on the information provided in the dataset, thus offering a more controlled evaluation of their aesthetic judgment.
In summary, this work makes two major contributions:
The creation of a well-curated, richly annotated dataset of AI-generated images, covering some of the most widely used generative models, suitable for a wide range of applications;
An in-depth evaluation through targeted human surveys, primarily assessing the perceived authenticity of generated images and their adherence to the prompt.
There are a few important disclaimers to make. First, we emphasize that the aim of this research study is not to conduct a comparative qualitative evaluation of different models—a task that would require a significantly larger dataset—but rather to assess the overall performance of these models in the style replication task. The current limitations of such systems are often evident even in a small number of examples; thus, the scale of AI-Pastiche is largely sufficient for a meaningful and focused assessment, demonstrating that problematic cases are not cherry-picked anomalies but rather recurring issues.
We see AI-Pastiche as a foundation for future comparative and diagnostic research across different modeling approaches and aesthetic criteria. The surveys included in this work are designed to enrich the dataset with human evaluations that would be impossible to obtain through automated methods. The choice to involve non-expert participants reflects our interest in the broader interpretability and public reception of AI-generated art. It is not in the scope of this article to carry out in-depth analyses using specific metrics or automatic image–text alignment scores; such investigations are precisely the kind of research we hope to enable and encourage through the release of this dataset.
The survey evaluation was conducted with a population of educated though non-expert users. We contend that at the current stage of development, state-of-the-art style replication does not yet require expert evaluation to identify its major limitations. In any case, expert and non-expert perspectives provide different yet equally valuable insights. No personal or identifiable information was collected at any point during the study. Participants were informed about the purpose and scope of the survey on the introduction page and were given the option to either proceed or opt out. They were also free to discontinue their participation at any time, and any incomplete responses were excluded from the dataset. We carefully considered the ethical implications of conducting a public survey of this scale and ensured that participation remained fully anonymous, voluntary, and non-invasive throughout the process.
Finally, we acknowledge that paintings are not merely flat images—they are anaglyphic, possessing 2½D qualities that play a critical role in perception through a complex interaction of texture, viewer, and context [
21,
22]. That said, the current quality of generative outputs does not yet reach a level where such subtle perceptual dimensions come into play in a meaningful way. As generative models continue to improve, perceptual features such as surface texture, depth, and material realism may gradually become more relevant for evaluation.
This article has the following structure: In
Section 3, we describe our methodology, the way the dataset was created, the selection of models, and the way surveys were formulated and conducted.
Section 4 gives a detailed description of the dataset and the relative metadata. In
Section 5 we give a detailed description of the surveys, the target audience, and the frameworks used to publish and collect data.
Section 6 describes the results of the evaluation. An in-depth discussion of some of the main critical aspects of the style transfer capabilities of generative tools is given in
Section 7. In
Section 8, we offer a few ideas for future developments and outline some possible applications of our dataset.
2. Related Works
AI-driven artistic style transfer has grown significantly in recent years, driven by advances in deep learning and generative models. Several works have explored the capabilities, limitations, and applications of AI-generated imagery. Our work contributes with a comprehensive evaluation of multiple generative models, emphasizing their adherence to artistic style and prompt fidelity.
Early works such as Gatys et al.’s [
23] laid the foundation for neural style transfer, introducing methods that blend content and style representations from convolutional neural networks. Subsequent research expanded on these concepts, improving efficiency and control over style application [
24]. More recently diffusion-based models have demonstrated superior results in high-fidelity artistic synthesis, allowing for more nuanced style adaptation. Our study builds upon these advancements but diverges in its focus on evaluating multiple state-of-the-art models across diverse artistic styles and historical periods. This allows for a broader assessment of model performance.
One major area of focus has been figuring out how to evaluate and detect AI-generated images. For instance, studies like CIFAKE by Bird and Lotfi [
25] and GenImage by Zhu et al. [
26] have worked on measuring how realistic synthetic images are and developing techniques to tell them apart form human-made art. Similarly, Li et al. [
27] explored the world of adversarial AI-generated art, shedding light on the challenges of authentication and detection. These efforts are vital to assessing the authenticity of generated works, particularly in contexts where human perception plays a critical role.
To support this kind of research, several large scale datasets have been created:
ArtiFact Dataset [
28]: This is a diverse mix of real and synthetic images, covering everything from human faces to animals to landscapes, vehicles, and artworks. It includes images synthesized by 25 different methods, including 13 GAN-based models and 7 diffusion models.
WildFake Dataset [
29]: A dataset designed to assess the generalizability of AI-generated image detection models. It contains fake images sourced from the open-source community, covering various styles and synthesis methods.
TWIGMA Dataset [
30]: A large-scale collection of AI-generated images scraped from Twitter, from 2021 to 2023, including metadata such as tweet text, engagement metrics, and associated hashtags.
While these studies focus on detecting AI-generated images, we focus on examining how convincingly these images replicate human-created art. Through public perception surveys, we assess whether generated paintings can be mistaken for human artwork, providing insights into the models’ ability to deceive a human viewer.
Beyond detection, generated images are increasingly used as data sources for synthetic training and research applications. The work by Yang et al. [
31] discusses the implications of using AI-generated images for training machine learning models. They explore the potential of synthetic datasets to enhance machine learning capabilities while also addressing concerns related to biases, authenticity, and ethical challenges.
Another direction in the field is the use of diffusion models for artistic style transfer. Researchers such as Chung et al. [
11] and Zhang et al. [
7,
12] have introduced training-free methods and pre-trained diffusion models specifically designed for style adaptation. These works highlight the effectiveness of modern diffusion-based architectures in achieving high-fidelity artistic synthesis while maintaining flexibility for style injection. Furthermore, the work by Png et al. [
10] proposes a feature-guided approach that improves control over the stylistic aspects of the generated output.
The creative applications of generative AI have also been widely discussed. Haase et al. [
15] explored the role of generated imagery in inspiring human creativity, particularly in design workflows. Similarly, Barros and Ai [
13] investigated the integration of text-to-image models in industrial design, while Vartiainen and Tedre [
16] examined their use in craft education. We complement these works by examining the limitations of generative tools in artistic fidelity, particularly their struggle with maintaining compositional balance, avoiding anachronisms, and ensuring stylistic coherence. We highlight critical shortcomings such as overuse of hyperrealism, anatomical distortions, and misinterpretations of historical context, which could be key obstacles to seamless integration into professional artistic workflow.
Furthermore, a growing body of work focuses on understanding the emergent capabilities of Large Language Models and their application in aesthetic evaluation. Studies such as those by Wei et al. [
17] and Du et al. [
19] discuss how LLMs develop new abilities, such as the preference for certain artistic styles. Wang et al. [
32] analyzed evaluation metrics for generative images, offering insights into how to assess AI-generated art both quantitatively and qualitatively. These studies can be expanded with our proposed dataset, which, unlike other existing ones, is a controlled dataset of synthetic artworks.
3. Methodology
In this section we outline our methodology for the creation of the dataset, the selection of models, and their evaluation. Our goal was to build a controlled, art-style-focused dataset to evaluate whether modern diffusion models can produce images that are both visually coherent and stylistically faithful to well-defined art styles. To make results comparable across models, we standardized the prompts, generation settings, and metadata.
3.1. Creation of the Dataset, Aims, and Methodology Used for Data Acquisition
Crafting high-quality prompts is crucial because prompt design directly governs relevance and stylistic fidelity of outputs [
33,
34,
35]. The prompt creation process began with obtaining the detailed descriptions of notable artworks drawn from authoritative art history sources and online museum archives. Using these references, we employed a two-stage, reverse-engineering-inspired method to fine-tune the prompts, adapting the descriptors to match the structural and length constraints typically supported by contemporary generative models.
In Stage 1, we manually drafted an initial set of prompts, with a common structure. They typically began with an indication of the style and historical period to imitate, sometimes reinforced by referencing a specific painter. This was followed by a detailed description of the subject, including suggestions for lighting, colors, and tones. Finally, each prompt concluded with a hint about the overall sentiment or emotion the artwork was intended to convey. These prompts were then tested on a subset of the diffusion models selected for the study to identify potential failure modes, including style drift, incorrect motifs, and anachronistic elements.
In Stage 2, the manual prompts were iteratively revised with GPT-4o to strengthen style cues by adding period-specific vocabulary, removing unnecessary adjectives, and clarifying composition. After each revision, a small batch of images was regenerated per model using the refined prompts. A prompt was retained for final art generation only if (i) the outputs consistently displayed the intended properties across a majority of models and (ii) GPT-4o, when given only the generated image, could correctly identify the target style described in the original prompt. Here are a couple of example prompts:
“Generate a detailed winter landscape painting in the Flemish Renaissance style of the second half of the 16th century. Depict a snow-covered village with small, rustic houses nestled into a hilly landscape. Include bare, slender trees in the foreground with hunters walking through the snow, accompanied by dogs. The scene should feature frozen lakes or ponds in the background, where villagers are skating and engaging in winter activities. The sky is a muted, wintry blue-gray, and the overall tone of the painting should evoke a peaceful, yet somewhat melancholic atmosphere, with intricate details showing rural life during winter.”
“Generate a view of Venice in the Vedutism style of the first half of the 18th century, focusing on a scene along the Grand Canal. The composition features detailed classical architecture with grand domes and facades, and gondolas moving along the canal. Add soft clouds to the sky and ensure there is little fading in the horizon, providing clear visibility of distant buildings. The color palette should include very soft blues and warm earth tones, avoiding saturated colors. The atmosphere remains calm and luminous, with minimal light-and-shadow effects, capturing the beauty and grandeur of Venice from a broad perspective.”
To ensure comparability, each accepted prompt was used unchanged across all diffusion models. For each (prompt, model) pair, we generated a fixed number of samples at the model’s native target resolution, using consistent sampler hyperparameters set to the model’s default values. We avoided post-processing and upscaling.
Each image is stored with detailed generation metadata, including the model, prompt text, subject, style, and period. This allows for the regeneration of similar samples and supports controlled ablation studies across models and settings. Further details on the dataset are provided in
Section 4.
3.2. Models
Image generative models are a class of machine learning algorithms designed to synthesize novel images by learning the underlying patterns in existing data. By approximating the underlying distribution of visual data, these models generate outputs that form the foundation of various creative AI applications.
Within the domain of image generation, models are broadly categorized into Text-to-Image (Text2Img) and Image-to-Image (Img2Img) frameworks [
36,
37], although hybrid and specialized approaches also exist. Text2Img models generate entirely new images based on textual descriptions, effectively translating linguistic cues into visual representations. In contrast, Img2Img models modify or enhance existing images by leveraging an input image as a reference while applying stylistic or contextual transformations. This study primarily focuses on Text2Img models due to their ability to create images purely from descriptive text prompts, making them particularly suited for analyzing artistic style recreation.
To systematically evaluate the artistic fidelity and limitations of state-of-the-art (SOTA) commercial generative models, 12 diffusion-based models were selected, among the most widely used and highly regarded in the field. These models were identified based on their popularity and performance, as detailed in the Introduction. The selection was motivated by three key considerations:
- 1.
Benchmarking Established Models: Using well-established models enables the creation of a high-quality AI-generated art dataset, which could serve as a valuable resource for future research.
- 2.
Avoiding Training and Fine-Tuning Biases: Training a model from scratch or fine-tuning an existing open-source model would not provide a fair assessment of the out-of-the-box capabilities of these models. Our goal was to evaluate their pre-trained performance rather than their adaptability to new training objectives.
- 3.
Computational Constraints: Training or fine-tuning diffusion models is highly resource-intensive. Proprietary models, in particular, are trained on vast datasets with ongoing refinements by dedicated research teams, making them the most suitable candidates for assessing the current peak capabilities of image generative AI.
Initially, 15 diffusion models were considered. Each model was tested using three standardized prompts to evaluate its ability to generate visually coherent and stylistically accurate images. Five researchers independently assessed the outputs based on realism, artifact minimization, and adherence to the prompt. A model was discarded if all five unanimously agreed that it failed to meet these criteria. For example, DeepFloyd IF [
38] was among the initial 15 models considered but was excluded from further experimentation. Its generated outputs frequently failed to align with the described artistic movements, particularly struggling with facial features and even simple object shapes (e.g., dogs and other animals).
The final selection of 12 models used in our study is listed in the Introduction, with key specifications summarized in
Table 1. It is important to note that many of these models are proprietary, and as a result, their architectural details and training methodologies remain undisclosed.
3.3. Evaluation Criteria
Models are evaluated based on two distinct and orthogonal criteria, each addressing a crucial aspect of their performance:
Authenticity. The first criterion evaluates the model’s ability to generate samples that are sufficiently realistic and convincing, such that they could be mistaken for artifacts created by a human. This involves assessing the quality of the generated output in terms of visual coherence, attention to detail, and overall believability. A high score in this area indicates that the model produces outputs that closely mimic human creativity and craftsmanship.
Adherence to Prompt Instructions. The second criterion focuses on the model’s capacity to accurately follow the detailed instructions specified in the prompt. This involves assessing how well the generated outputs align with the intended artistic style, thematic elements, or any specific requirements outlined. Success in this area demonstrates the model’s ability to interpret and faithfully execute complex and nuanced instructions.
These two evaluation criteria are deliberately designed to be independent. While a model may excel in producing outputs faithfully mimicking human art crafts, it might still fail to accurately adhere to the stylistic constraints of the prompt, or vice versa. By assessing these dimensions separately, this research study aims to obtain a comprehensive understanding of the model’s strengths and weaknesses across both realism and prompt alignment.
The way these criteria are addressed in our surveys will be described in
Section 5.
5. Human Surveys
To evaluate model performance according to the criteria outlined in
Section 3, two distinct human surveys were designed and conducted, and their results were collected for analysis.
5.1. Authenticity
With authenticity, we refer to the extent to which a model generates outputs that convincingly resemble human-made creations.
The evaluation was conducted using a survey-based approach, where participants were asked to classify images as either AI-generated or human-made. For the human-made paintings, a subset of open-access images from the National Gallery of Art in Washington (
https://www.nga.gov/open-access-images.html, accessed on 3 July 2025) was used. Participants were shown a set of 20 images, one at a time and in sequence, comprising a random mix of genuine and AI-generated works, and were asked to classify each image individually.
To ensure unbiased and reliable responses, the survey presented the images in randomized order, without any metadata or contextual information that could hint at their origin. This design encouraged participants to base their judgments solely on the visual and stylistic qualities of the images.
The survey was conducted anonymously, and no personal information was collected in accordance with privacy considerations. Given the focus on European painting, it was acknowledged that cultural background could influence participants’ perceptions. To account for this, participants were asked whether they identified with a European cultural background, with the option to decline to answer.
The survey reached approximately 600 participants, selected from a diverse pool to capture a wide range of perspectives. Most of the participants were students and colleagues, suggesting a relatively high level of education and some familiarity with artistic aesthetics, though typically without formal training in art critique. This selection was intentional, as it reflected the anticipated audience for AI-generated art in real-world scenarios. The study aimed to evaluate perceptual authenticity as it might be experienced by the general public.
5.2. Adherence to Prompt Instructions
The purpose of this evaluation is to assess each generated image based on its alignment with the requirements specified in the given prompt.
This classification task is significantly more complex than the previous one, as it requires a careful reading and thorough understanding of the prompt, as well as a comparative evaluation of outputs from different models. For this reason, participation was restricted to a selected number of members, comprising people of our research group, colleagues of the department of fine arts, and some of their students. While the collective number of participants was sensibly smaller than for the first survey, each person evaluated multiple prompts, resulting in 5706 entries with an average of about 475 assessments for each model.
In a companion study [
48], the potential of fully automated evaluation using models such as CLIP [
49] was explored, focusing on its perceptual capabilities across both human-made and AI-generated artworks. While CLIP performs well in anchoring images to broad semantic categories and proves effective in discriminative tasks such as matching a generated sample to its corresponding prompt, it often struggles with the more subjective dimensions of artistic evaluation, including style, historical period, and cultural context. As a result, its assessment of prompt adherence tends to be strongly biased toward content features and does not align closely with human judgments of artistic fidelity.
Our evaluation metric relies on subjective assessments of how well each image satisfies the requirements of the prompt, taking into account content, stylistic fidelity, and technical quality, penalizing the presence of visible defects or artifacts, as well as the lack of cohesion or integrity. Although it is theoretically possible to rank the generated images on a continuous scale, the inherent complexity of the task and the subjective nature of the evaluations led us to simplify the process. Instead, images are categorized into three broad classes: low, medium, and high alignment with the prompt.
These classifications—low, medium, and high—are not absolute or universal but are defined relative to the specific set of images generated for each prompt. This relative approach ensures that the evaluation accounts for the context and inherent variability within each batch of images.
6. Results of the Surveys
This section presents and analyzes the results of the surveys. As previously noted, the goal is not to compare the performance of different models but rather to offer a clearer understanding of the current state of the field. Our focus is on identifying the persistent challenges faced by generative models, highlighting specific problem areas, and discussing potential directions for improvement. By examining these limitations, we aim to contribute to the broader discourse on how these models can be refined and enhanced for more reliable and aesthetically convincing outputs.
6.1. Authenticity Results
Figure 1 depicts the confusion matrix resulting from the survey: overall, around 29% of AI-generated images were mistakenly attributed to humans. Interestingly enough, a slightly lower but still relevant number of human-generated images were attributed to AI: in this case, the misclassification percentage is around 20%.
Figure 2 illustrates the frequency distribution of misclassification percentages for AI-generated images in the dataset. The distribution is skewed toward lower misclassification percentages, with a small subset of images achieving a perfect authenticity score.
Since the purpose of this work is not to make a ranking of models but merely to understand the overall state of the art, we only provide a synthetic evaluation of the six best models, as evinced from our survey. The results are summarized in
Table 5.
The best-performing model appears to be Ideogram, achieving an impressive authenticity rate close to 50%. It is also noteworthy that relatively older models, such as Stable Diffusion 1.5 and Omnigen, perform comparatively well against more recent competitors. As apparent from the results of the second survey (see
Section 6.2), this is partly due to these models adopting a more liberal interpretation of the prompt, often sacrificing strict prompt adherence in favor of aesthetic quality.
Some examples of AI-generated artifacts of different styles and periods among the most frequently classified as human-made according to our survey are shown in
Figure 3.
A per-period investigation (see
Table 6) shows that not surprisingly, generative models perform particularly well in mimicking art of the last century and (some styles) of the 19th century. They clearly seem to have much more trouble producing convincing artifacts of previous periods.
Analyzing results according to artistic styles is complicated by the current underrepresentation of certain movements: values have been reported for the sake of completeness, but their statistical relevance is really modest. For instance, as shown in
Table 7, the style with the highest model performance is “Art Nouveau”.
However, we have only a single prompt associated with this label, depicting a pencil sketch of a seated man in a pensive attitude. A few examples are shown in
Figure 4. Due to the schematic simplicity of both the subject and the technique, it is not surprising that many of the AI-generated artifacts have been mistakenly perceived as human-made.
A similar problem arises with the “Satirical” style. Again, we have only one prompt relative to this category, referring to a caricature of Otto Von Bismarck in the style of the satirical magazine
La Lune of the end of the 19th century. Many models created convincing artifacts, as illustrated in
Figure 5.
Apart from these cases, generative models appear to be more adept at imitating modern artistic styles, such as Impressionism, Cubism, Dadaism, Futurism, and similar movements. These styles often emphasize abstraction, bold shapes, and expressive brushwork, which align well with the strengths of generative models.
Conversely, models face greater challenges when attempting to replicate older artistic styles, such as Renaissance, Baroque, and Rococo. These styles are characterized by intricate details, realistic depictions, and complex compositions, which demand a level of precision and semantic interpretation that many models struggle to achieve.
Interestingly, the worst performance is observed when models attempt to imitate naïve art. One key reason for this difficulty is the challenge most models face in handling the “flat” perspective typical of this style, as discussed in
Section 7.2.3. Unlike classical or modern styles, naïve art often employs a lack of depth, disproportionate figures, and an intuitive rather than rule-based approach to composition. This contradicts the implicit biases of generative models, which are often trained to prioritize realism, shading, and perspective consistency.
6.1.1. Distinction of Results for Cultural Background
As mentioned in the Introduction, participants in the survey were asked to disclose their cultural background to assess its potential impact on the perception of European paintings.
In this regard, the collected data are highly unbalanced, with European participants outnumbering non-European participants by approximately six to one. As a result, any analysis of this factor must be approached with caution, as the sample distribution may limit the reliability of our findings.
The only interesting result is relative to the misclassification rate for different historical periods, shown in
Figure 6.
Not surprisingly, non-European participants tend to misclassify images from the 15th, 16th, and 17th centuries more frequently, likely due to a lower level of familiarity with the artistic movements of those periods. European art from these centuries is deeply rooted in specific cultural and historical contexts, with stylistic conventions that may not be as immediately recognizable to those who have not been extensively exposed to them.
A similar analysis across different artistic styles did not reveal any additional trends significant enough to report.
6.1.2. Influence of the Subject
Our final investigation examines the influence of subject matter on the model’s ability to generate artifacts that can be mistaken for human-made creations. For this analysis, the tags described in
Section 4.2 are used. Specifically, each prompt is represented as a multilabel binarization over its associated set of tags. A linear regression was performed to predict the average degree of “authenticity”, as determined by the survey for all entries associated with the selected tags. The analysis was limited to tags appearing in at least two different prompts.
A high degree of predictive accuracy was not expected, given that other factors—such as the intended style and historical period—also influence outcomes. The primary focus of the analysis lies not in the predicted values themselves but in the weights assigned by the model to individual tags, particularly negative ones, which may highlight categories that pose challenges for current generative systems.
After normalizing the output by using Gaussian normalization, a prediction error of approximately 0.4 was obtained (compared with the unit standard deviation). As expected, the prediction accuracy is not particularly high, but it is sufficient to demonstrate a correlation between tags and perceived authenticity.
In
Figure 7, the weights associated with the different tags are shown. The investigation is repeated for all models (blue) and for a restricted subset of models comprising Ideogram, Midjourney, Stable-Diffusion-3.5-large, and Dall·E, which obtained high scores both in authenticity and prompt adherence.
Looking at the negative scores, a notable group is composed by tags related to humans: “crowd”, “person”, “persons”, “child”, and “portrait”. This provides strong evidence that generative models still struggle to represent humans convincingly when mimicking artistic painting. In addition, portraits of women tend to present more challenges compared with those of men.
This difficulty may arise from several, sometimes contrasting factors. For example, generative models may fail to achieve realism in highly complex and dynamic scenes involving multiple people or crowds, while at the same time, they may adopt exaggerated hyperrealism in portraits. We discuss these issues in more detail in
Section 7.
From the naturalistic point of view, “clouds”, “flowers”, and “water” seem to have a negative impact. The tag “flower” contrasts with “still_life”, which, by comparison, has a significantly more positive score. In our dataset, the negative perception associated with flowers seems to be primarily linked to paintings in Naïve style—one of the styles where generative models, as observed in the previous section, tend to perform the worst. The negative scores for “clouds” and “water” appear to stem from the inherent complexity of rendering these elements in a way that aligns with the stylistic and historical constraints specified in the prompt. It is also interesting to observe that while the “best” models seem to be able to cope with water in an acceptable way, their performance on “clouds” is even worse than average. We shall discuss this subject in more detail, in
Section 7.2.3, where we also provide a few examples.
Other tags related to nature, such as “fog”, “snow”, and “trees”, do not appear to pose significant challenges for generative models. These elements are often rendered convincingly, likely due to their relatively uniform structures and the abundance of high-quality reference images available in training datasets. However, the situation changes when considering specific moments of the day. Night scenes can easily suffer from inconsistencies in lighting and contrast or from the hyperrealistic rendering of specific elements, such as the moon. More notably, dawn and sunset present particular difficulties, as generative models often struggle to capture the complex interplay of warm and cool tones, the gradual transitions in atmospheric lighting, and the way natural and artificial light sources interact during these times. These shortcomings can lead to unnatural gradients, misplaced highlights, or an overall loss of realism, making these scenarios more challenging than other elements related to nature.
The explicit request in the prompt to add visible brushstrokes frequently increases the perception of authenticity. In addition, models generally perform better when prompted to adopt soft, muted tones rather than vibrant or dramatic color schemes. When working with softer tones, the model is more likely to produce balanced, harmonious compositions that align well with a wide range of artistic styles. In contrast, when tasked with generating highly saturated or dramatic lighting effects, models often tend to over-interpret the request, leading to exaggerated contrasts, unnatural color blending, or an overuse of artificial-looking highlights and shadows.
Finally, models appear to struggle with subjects related to mythology and religion, due to a combination of the inherent complexity of these themes and content moderation filters that may constrain or influence their performance.
6.2. Adherence to Prompt Instructions and Stylistic Fidelity Results
This survey measured user satisfaction based on the alignment of the generated images with the requirements specified in the given prompt, considering both content and style. Users rated their evaluations in three categories: “good”, “medium”, and “low”. The evaluation was not intended to be absolute, but rather comparative, assessing how each model’s output performed relative to others.
For instance, if a particular image was unanimously classified as “Good”, this does not necessarily imply that it was a highly satisfactory interpretation of the prompt. Rather, it simply indicates that in the collective judgment of the reviewers, it outperformed the outputs of other models.
Reviewers were encouraged to give a balanced repartition in the three categories, to reduce the impact of prompt complexity and inherent variability within each batch of images.
To derive a summary score, we computed a weighted average, assigning a value of 1 to “good”, 0 to “medium”, and −1 to “low”.
The results are summarized in
Figure 8 and
Table 8. Again, in the table, we only list the best-performing models, according to our investigation.
It is worth noting that prompt adherence leads to a substantially different ranking compared with authenticity scores, which were evaluated without knowledge of the corresponding prompts. A model that was instructed to generate a Renaissance painting but instead produced a convincing Cubist artwork would likely receive a high authenticity score, despite failing to follow the intended artistic style.
This suggests that some models prioritize aesthetic quality over strict prompt adherence, opting for visually compelling outputs even at the expense of accuracy. This tendency is particularly evident in early-generation generative models, such as Stable Diffusion 1.5 and Omnigen, which frequently take creative liberties with prompt instructions.
Despite their loose interpretation of prompts and their occasional introduction of artifacts and distortions, these models remain among the most creative and surprising in our tests. Their ability to produce unexpected yet visually engaging results highlights a trade-off in generative AI: while newer models may achieve higher precision in style replication, earlier models often exhibit a greater degree of unpredictability and artistic exploration, which can sometimes lead to unexpectedly compelling outputs.
6.3. Survey Results Integration
Synthetic results from the human evaluation survey have been integrated into the AI-Pastiche dataset available on Kaggle. These include aggregated ratings for each image based on subjective assessments of its perceived authenticity (i.e., the proportion of respondents who believed the sample was created by a human) and its adherence to the prompt. In addition, we introduced a metadata column labeled defects, which captures the presence and severity of visible artifacts in the generated image (with 0 indicating no visible defects and 1 indicating major or repeated defects). All metadata values are expressed as floating-point numbers in the range [0, 1].
The inclusion of these annotations is intended to support further research in perceptual evaluation and prompt–image alignment.
8. Conclusions
In this work, we have explored the capabilities and limitations of modern generative models in replicating historical artistic styles. Our analysis is structured around two main contributions: (1) the creation of a well curated, supervised dataset of AI-generated artworks—the AI-Pastiche dataset—and (2) a comprehensive evaluation of generative models through user surveys assessing perceptual authenticity and prompt adherence.
The AI-Pastiche dataset is a richly annotated collection of AI-generated images, categorized by model, style, period, and subject matter. While its current size is not very large (73 prompts, 12 models, and 753 images), it offers a valuable resource for analyzing the strengths and weaknesses of different generative approaches, with potential for diverse applications and for serving as a benchmark in future research on AI-driven artistic replication.
Using the AI-Pastiche dataset, we conducted a systematic evaluation of generative models based on extensive user surveys. We separately assessed perceptual authenticity—how convincingly an artwork mimics human-created paintings—and prompt adherence—how faithfully the output aligns with the given instructions. The results reveal a key trade-off: some models prioritize aesthetic quality over strict adherence to the prompt, while others sacrifice visual refinement for greater accuracy. This discrepancy underscores the challenges in balancing creative flexibility and control in generative image synthesis.
Our study highlights both the progress and ongoing challenges in generative AI for artistic style replication. While models can produce visually compelling outputs, a major obstacle remains their tendency toward hyperrealism. When attempting to reproduce historical styles, these models focus on surface-level details, such as textures and brushwork, yet fail to capture the deeper artistic principles that define each period. Artistic style is more than a sum of textures: it involves composition, narrative intent, spatial relationships, and cultural context. Given the limited availability of training data for many historical styles, achieving a truly contextually accurate AI-generated artwork remains a difficult task.
Another fundamental limitation is the rigid inference time of generative models. Unlike human artists, who naturally allocate more effort to complex compositions, these models operate under fixed computational budgets, leading to missed opportunities for adaptive refinement. Future improvements may involve confidence-based step adjustments, allowing the model to extend or shorten the generation process depending on the complexity of the scene. More advanced conditioning mechanisms could also enable models to better integrate structural coherence and artistic intent rather than simply mimicking surface features.
Ultimately, our findings point to the next frontier in generative AI for art: moving beyond simple visual reproduction toward models that can understand and interpret artistic traditions in a more holistic and historically grounded way. While significant challenges remain, improvements in training strategies, dataset curation, and adaptive inference methods could help bridge the gap between style imitation and true artistic coherence, bringing generative AI closer to meaningful contributions in digital artistry.
Our dataset could help trace progress in this direction. The field is evolving very rapidly, and even within the short time required for the publication of this article, new systems have emerged with improved capabilities. We plan to integrate AI-Pastiche with these new models in the near future while also extending the number of prompts to improve stylistic coverage and provide a more balanced dataset. We are open to collaborations for future improvements to the dataset.