Next Article in Journal
Lab-to-Field Generalization Gap: Assessment of Transfer Learning for Bearing Fault Detection
Previous Article in Journal
DTI Histogram and Texture Features as Early Predictors of Post-Radiotherapy Cognitive Decline
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction

by
Gerardo Aleman Manzanarez
1,*,
Nora de la Cruz Arana
1,
Jorge Garcia Flores
2,*,
Yobany Garcia Medina
1,
Raul Monroy
1 and
Nathalie Pernelle
2
1
Escuela de Ingenieria y Ciencias, Tecnologico de Monterrey, Carr. Lago de Guadalupe Km.3.5, Col. Margarita M. de Juarez, Atizapan 52926, Mexico
2
Laboratoire d’Informatique de Paris Nord, Centre National de la Recherche Scientifique, Université Sorbonne Paris Nord, 99 av. Jean-Baptiste Clément, 93430 Villetaneuse, France
*
Authors to whom correspondence should be addressed.
Appl. Sci. 2025, 15(12), 6802; https://doi.org/10.3390/app15126802
Submission received: 17 April 2025 / Revised: 9 June 2025 / Accepted: 9 June 2025 / Published: 17 June 2025

Abstract

Automated story writing has been a subject of study for over 60 years. Today, large language models can generate narratively consistent and linguistically coherent short fiction texts. Despite these advancements, rigorous assessment of such outputs in terms of literary merit—especially concerning aesthetic qualities—has received scant attention. In this paper, we address the challenge of evaluating AI-generated microfiction (MF) and argue that this task requires consideration of literary criteria across various aspects of the text, including thematic coherence, textual clarity, interpretive depth, and aesthetic quality. To facilitate this, we present GrAImes: an evaluation protocol grounded in literary theory; specifically, GrAImes draws from a literary perspective to offer an objective framework for assessing AI-generated microfiction. Furthermore, we report the results of our validation of the evaluation protocol as answered by both literature experts and literary enthusiasts. This protocol will serve as a foundation for evaluating automatically generated microfiction and assessing its literary value.

1. Introduction

Technological progress in artificial intelligence (AI) has led to systems capable of advanced reasoning [1,2,3], multimodal understanding [3] and creative writing [4]. Additionally, improvements in knowledge distillation [5] and reductions in inference-time computational costs [3,6] suggest that generative AI systems are becoming more accessible and affordable, with some scientific studies even suggesting that AI-generated texts may surpass human-written ones in literary quality [7,8]. However, the assessment of literary quality has always been a subjective matter.
A review of current methods for evaluating automatically generated fiction [9] reveals that literary criteria are seldom considered. In particular, the concept of reception [10,11] plays a crucial role. As the above studies indicate, the reception of literary texts is significantly influenced by readers’ experiences and literary expertise. Consequently, an evaluation framework grounded in literary theory is essential for effectively comparing human-authored texts with those produced by generative models.
Moreover, beyond direct human–AI comparison, it becomes relevant to assess the aesthetic quality of AI-generated fiction through a literary evaluation protocol. Current evaluation methods mostly come from the fields of natural language processing and machine translation, and exhibit significant limitations when applied to literary texts. These methods rely heavily on quantitative metrics such as BLEU, ROUGE, and perplexity, which fail to capture nuanced aspects of literary language such as metaphor, symbolism, and stylistic creativity. These metrics prioritize surface-level similarity over deeper semantic and aesthetic qualities, leading to inadequate assessments of text richness. While AI mechanisms can generate consistent and coherent fiction, little attention has been paid to producing texts with literary value. This raises a fundamental question: what makes literature literature?
This paper introduces the Grading AI and Human Microfiction Evaluation System, or GrAImes. The GrAImes evaluation protocol is named after Joseph E. Grimes (born 1922), an American linguist known for his work in discourse analysis, computational linguistics, and indigenous languages of America. He developed the first automatic story generator in the early 1960s [12], a pioneering system that used Vladimir Propp’s analysis of Russian folk stories [13] as a grammar for story generation. Our novel evaluation protocol is specifically designed to assess the literary value of microfiction. GrAImes incorporates literary criteria into the assessment process, aiming to evaluate the literary quality of microfiction generated by AI or created by human authors with or without AI assistance.
GrAImes’ reception situation is inspired by the editorial process used to accept or reject stories submitted to the publishing industry [14]. We chose to work with microfiction (see Section 2.1) as a model of literary narrative due to the limitations of language models in generating short narratives and because brevity facilitates the evaluation process. GrAImes consists of a questionnaire with fifteen questions designed to assess literary originality, the impact of microfiction on its readers, and to some extent its commercial potential.
The protocol’s development involved an initial validation phase during which a group of experts (literary scholars holding a PhD and an academic position) assessed human-written microfiction. Subsequently, GrAImes was applied to AI-generated microfiction, with evaluations conducted by a community of reading and literature enthusiasts. In both experiments, it was necessary to take into consideration that human evaluators base their assessments on their imagination and prior knowledge, often assigning higher ratings to stories they find more familiar [15]. From a historical creativity perspective [16], their evaluations are shaped by comparisons with other narratives that they have encountered throughout their lives. With GrAImes, we propose an evaluation method that aims to assess not only texts’ likeness but also their aesthetic, technical, editorial, and commercial quality and value.
Our findings indicate that GrAImes could become a reliable framework for assessing the literary quality of both human-written and AI-generated microfiction. Results from our first experiment, in which literature experts evaluated anonymous human-written microfiction, suggest a correlation between the author’s expertise and the results of the evaluation, with more experienced writers achieving good to acceptable internal consistency in the evaluators’ judgment. This was further corroborated in a second experiment involving reading enthusiasts and experts, whose evaluations slightly favored microfiction generated by ChatGPT-3.5 over those produced by a fine-tuned GPT-2 baseline language model (see Section Monterroso). The cumulative results of these experiments position GrAImes as a valuable tool for aiding the validation process of microfiction, demonstrating good to acceptable internal consistency in the evaluators’ judgments.
However, it is important to acknowledge the limitations of the present study. The experimental dataset of our experiments is small, and some statistical methods used for validation (ICC, Cronbach’s alpha, and Kendall’s W) are sensitive to the sample size, which might introduce bias or instability in the results. Furthermore, additional research is needed to test the applicability of GrAImes to other literary genres and other languages, as the present experiments were performed only in Spanish. Despite these limitations, we hope that our work shows the value of introducing concepts, methods, and expertise from the literary field, especially in order to challenge recent research results [4] suggesting that average readers prefer AI-generated poetry to classical texts.
To summarize, this paper introduces GrAImes, a novel evaluation protocol specifically designed for assessing the literary value of microfiction. The protocol’s development involved an initial validation phase in which literary experts assessed human-written microfiction. Subsequently, GrAImes was applied to AI-generated microfiction, with evaluations conducted by both literary experts and literature enthusiasts. The cumulative results of these experiments position GrAImes as a valuable tool for aiding the validation process of microfiction.
This paper is structured to first provide a foundational understanding of the subject; therefore, it begins with a comprehensive definition and illustrative examples of microfiction in Section 2.1. Following this, the evaluation protocol is detailed in Section 3.2. This section distinguishes between two evaluation processes: first, literary experts evaluated human-written microfiction, followed by literature enthusiasts evaluating AI-generated microfiction. After this, the GrAImes evaluation protocol itself is formally presented; to contextualize our work within the AI text generation field, this paper includes a review of existing approaches for evaluating automatically generated text in Section 2.2. It then specifies the chatbots used for generating AI microfiction in our experiments, notably the Monterroso system (Section Monterroso) and ChatGPT. Finally, the paper concludes with a thorough presentation and discussion of the results obtained from the conducted experiments, highlighting the insights gained from applying GrAImes to both human-written and AI-generated microfiction.

2. Microfiction and Evaluation Methods

2.1. Microfiction

Microfiction is a genre that mimics the narrative process through various significant mechanisms—transtextual, metafictional, parodic, and ironic—to construct its structure at both the syntactic and semantic levels. A microfiction is an exceptionally brief story with a highly literary purpose [17], far surpassing the ambition of generating readable and coherent narratives. Additionally, this literary genre challenges narrative norms by intentionally disarticulating the plot, requiring the reader’s narrative intelligence to navigate. While its brevity may be justified by its limited word count, this characteristic alone does not determine its textual functioning. Instead, microfiction relies on the literary system as its primary reference, offering a reinterpretation of previously explored fictional concepts. It deliberately disrupts its plot, creating gaps in the narrative framework that the reader must fill in order to engage with the story. Therefore, microfiction cannot be classified solely based on word count, but must also take into account the strategic use of information that prompts transtextual relationships, enabling the reader to reactivate the text’s signifying process [18].
Microfiction follows a structured sequence consisting of opening, development, and closing. Each action transitions from one state to another—from point A to point C, passing through B—to form a cohesive narrative unit (see Figure 1).
Reading microfiction requires the reader not only to interpret its meaning but also to reconstruct its structure. The text prompts the reader to complete the narrative by providing cues that suggest a storyline. As [19] suggests, “fictions re-describe what conventional language has already described.” Therefore, microfiction reinterprets previously explored literary themes. Consequently, while literary forms may differ in nomenclature, they also exhibit structural distinctions and inevitable variations in reception. An example of microfiction is provided in Figure 2.
In the realm of microfiction, where syntax is condensed into a dense network of signs, the reader’s role becomes indispensable; while every text requires interpretation, microfiction demands particularly active engagement to decipher its tacitly encoded information. As defined by [20], these codes represent “A system of associative informational fields referring to various spheres of culture” which shape the narrative structure and construct its semantics through a hermeneutical process. This initial dimension shapes the aesthetic experience of the second by introducing interpretative ambiguities. Thus, reading reactivates both explicit textual information and implicit transtextual references.
As explored further in Section 2.2, the aesthetic dimension is rarely considered in the design of text evaluation protocols. Therefore, it is essential to consider the aspects of literature in this context. In our study, microfiction is defined as a narrative text limited to 300 words, aligning with the established parameters articulated by Ana Maria Shua [21]. These parameters are characterized by concision, suggestive narrative, and complete story conveyance within a constrained format.

2.2. How Has Text Generation Been Evaluated?

Research on text generation spans a variety of approaches, with each study aiming to advance the field in a unique manner. Table 1 presents an overview of different methodologies, highlighting the objectives pursued by various researchers and the evaluation mechanisms used to assess their effectiveness.
In the domain of narrative fiction generation, ref. [22] focused on generating structured narratives using a fusion model, ensuring coherence across different hierarchical levels. The evaluation of this approach involves human evaluation and perplexity measures. Similarly, ref. [23] aimed to improve long-text generation by enhancing sentence and discourse coherence by employing deep learning (DL) techniques. Their evaluation metrics included perplexity, bilingual evaluation understudy (BLEU) [37], lexical and semantic repetition, distinct-4, context relatedness, and sentence order, which collectively assess the model’s ability to maintain logical flow and coherence.
Shorter-form text generation has also been explored, with [24] proposing an encoder–decoder structure for generating short texts based on images while optimizing for succinctness and relevance. Notably, this approach lacked an explicit evaluation mechanism. Meanwhile, ref. [25] utilized GPT-2 [38] to generate structured poetry adhering to specific poetic constraints such as the AABBA rhyming scheme of limerick poems. The effectiveness of this approach was evaluated using lexical diversity, subject continuity, BERT-based embedding distances [39], WordNet-based similarity metrics, and content classification.
In addition to traditional story generation, several studies have focused on interactive and rule-based systems. In [26], the authors introduced a character-based interactive storytelling mechanism aimed at enhancing computer entertainment evaluated through a quantification of the system’s generative potential. In [27], an ontology-based system was employed to generate plots matching specific user queries, although without specifying an evaluation mechanism.
Further refining narrative outcomes, ref. [28] investigated the impact of sentiment-driven story endings, leveraging neural networks to generate positive emotional resolutions. Their evaluation relied on human judgment to assess the effectiveness of emotional storytelling. Similarly, ref. [29] developed a symbolic encoding method for automated story and fable creation, which they evaluated by using Levenshtein distance and BLEU scores to measure textual similarity and fluency.
Annotation tools also play a role in text generation research, as exemplified by [30], who introduced a semantic encoding system designed for textual annotation. Their study was evaluated using human assessment to verify the quality of semantic encodings. In a different vein, ref. [31] explored character-level language modeling with gated recurrent neural networks (RNNs) to improve text synthesis at a granular level, employing bits per character as the primary evaluation metric.
Efforts in coherence and interpolation-based generation are represented by [32], who proposed a model that dynamically adjusts interpolation between a language model and attention mechanisms to maintain global coherence. Their evaluation relied on human judgment as well as BLEU-4 and METEOR (Metric for Evaluation of Translation with Explicit Ordering) scores to assess linguistic accuracy and coherence. Additionally, ref. [33] investigated alignment techniques for rich descriptive explanations, aiming to generate textual content that effectively bridges books and their movie adaptations. This approach was evaluated using BLEU and term frequency–inverse document frequency (TF-IDF) similarity measures.
Several researchers have instead focused on dialogue-based storytelling. In [34], the authors developed a statistical model for generating film dialogues based on character archetypes while ensuring that the generated dialogues maintain consistency with established personas. Evaluation relied on human assessments of the generated dialogues. Expanding upon this, ref. [35] explored automated scriptwriting using neural networks, resulting in the creation of a short film script; however, no explicit evaluation mechanism was reported. Other approaches such as [36] have delved into sentence planning techniques, employing neural network-based methodologies to refine parameterized sentence structure generation. In this instance, the evaluation used Levenshtein distance and BLEU scores to measure textual structure and fluency.
Lastly, a critical stand is needed concerning recent research results from [4] suggesting that average readers prefer AI-generated poetry to classical texts. While the statistical rigor and scale of this study are noteworthy, its interpretive claims are weakened by a fundamental oversight, namely, the absence of literary concepts, particularly the theory of reception [10,11]. By relying exclusively on a crowd-sourced evaluation platform populated by self-reported non-experts (90.4% of whom read poetry infrequently and two-thirds of whom were unfamiliar with the assigned poets), the authors equated statistical significance with literary insight. Reception theory emphasizes that meaning is co-produced by readers, whose interpretive frameworks are shaped by their literary competence, cultural background, and historical awareness [19,20]. Thus, the study’s conclusion that AI-generated poems are preferred over canonical works rests on a narrow understanding of preference that neglects the interpretive depth and aesthetic endurance that define literary reception. Without accounting for the respondents’ literary literacy, the finding that AI poems are judged to be superior risks privileging superficial readability over poetic complexity, potentially reframing the appreciation of literature in terms of immediate appeal rather than interpretive depth and enduring value.
Overall, these diverse research efforts illustrate the breadth of text generation methodologies, encompassing deep learning, rule-based systems, and symbolic encoding, each targeting unique challenges in narrative coherence, stylistic constraints, and interactivity. The evaluation mechanisms vary widely; some studies rely on automated metrics such as BLEU and perplexity, while others emphasize human evaluation to assess narrative quality and coherence.

2.3. Literary Text Evaluation Methods

The evaluation of literary text generation remains an open challenge rooted in both longstanding literary traditions and formalist approaches [13,40,41]. Despite the growing interest in computational creativity, there is no clear consensus on how to effectively assess creative text generation or measure the contribution of different stages in the process. These stages range from knowledge-based planning [42] to structuring the temporal flow of events [43] and producing linguistic realizations. Among these subtasks, evaluation is arguably the least developed and requires further research efforts [44]. The widespread adoption of commercial LLM-based chatbots has resulted in increased scientific efforts around evaluating AI-generated text reception, with a particular emphasis on human evaluation methodologies. Notable contributions in this area include the works of Porter and Machery [4], Koziev [45], and Franceschelli and Musolesi [46].
In summary, the evaluation of literary text generation remains a complex and evolving challenge. While human evaluation provides the most reliable assessments, untrained and machine-learned metrics offer scalable alternatives with varying degrees of effectiveness. Future research must focus on refining these methodologies in order to better capture the nuances of creative and narrative text generation.

3. Materials and Methods

3.1. Evaluation Protocol

It is important to note that the definition of literarity is ideological and sociohistorical; hence, it is not fixed in time but rather embedded within a cultural context [47]. In educational settings, the communicative approach to language, drawn from linguistic pragmatics is applied [48], with literature characterized as a form of communication distinguished by four elementary features.
The first feature is verisimilitude; a literary text is grounded in everyday reality, but represents that reality instead of replicating it. Consequently, it does not rely on external references but rather creates them, demanding that readers engage in a cooperative pact in which they accept the proposed universe as plausible. The second distinguishing aspect of literature is its codification; it is a message in which every component is chosen to convey meaning, with each perceived as intentional and linked to the total meaning of the work. Thus, a literary text cannot be summarized, translated, or paraphrased without significant loss of its essence [49].
Derived from these, the third feature is the deliberate breaking of rules and conventions of everyday language, and even strict grammar, in favor of aesthetic effect or meaning. This creates a tension between literature and language, emphasizing how something is said over what is said. Lastly, the deferred character of literarity [50] refers to the fact that the sender and receiver(s) of a work rarely share context; although this influences reception, it is not decisive for understanding the text. Autonomy largely depends on the integrity and cohesion of the diegetic world.
Yet, one question remains: are these elements enough to consider a text to be literary? Perhaps functionally; however, literary communication holds a significant ideological component that is dependent on sociohistorical context. In cultural studies, a distinction between ‘literary’ and ‘consumer’ fiction is made based on one variable, namely, prestige [51]. Traditionally, canonical literature was determined by academia or critics; since the 19th century, the publishing industry has also played a pivotal role in validation [52]. Each participant uses different parameters and perspectives; however, in mediating between author, reader, and time [53], the editor’s role is arguably the most operational and inclusive. Editors assess the clarity of content. technical value based on genre, and relevance, which can be thematic, formal, or commercial [14]. Hence, when applied to the evaluation of microfiction, these three parameters of clarity, technical value, and relevance gauge its potential for publication and integration into the contemporary literary landscape.
Initial assessment by a publisher for the publication of an unsolicited manuscript typically involves an evaluation process known as opinion. This usually entails a report prepared by a specialized reader focusing on the content’s technical value, commercial potential, and possible marketing strategies. To systematically address these aspects, we propose an evaluation instrument for microfiction consisting of questions that can be answered by both specialized and non-specialized readers.
The evaluation framework outlined in Table 2 presents the GrAImes evaluation protocol, which includes 15 items specifically tailored to assess Spanish microfiction, with broader applicability to narrative productions across genres and languages (for which further research is needed).
The protocol is organized into three distinct dimensions, each addressing specific criteria for systematic analysis of the texts assigned to evaluators. The first dimension, labeled “story overview and text complexity,” evaluates literary quality through an assessment of thematic coherence, textual clarity, interpretive depth, and aesthetic merit, incorporating both quantitative metrics (e.g., scoring scales) and qualitative judgments (e.g., textual commentary) to appraise literary value. The second dimension, “technical evaluation,” focuses on technical aspects such as linguistic proficiency, narrative plausibility, stylistic execution, genre-specific conventions, and the effective use of language to convey meaning. The final dimension, “editorial/commercial quality,” examines the commercial potential and editorial suitability of the microfiction, assessing factors such as audience appeal, market relevance, and feasibility for publication or dissemination. This tripartite structure ensures a comprehensive and multidimensional evaluation of the artistic and practical qualities inherent to the microfiction genre.

3.2. Experiments

To test GrAImes’ validity, we conducted two experiments. In the first, GrAImes was applied to stories written by humans and evaluated by experts in university-level literary studies. In the second, we applied GrAImes to stories generated by language models and evaluated by a community of reading and literature enthusiasts (see Figure 3).

3.2.1. Evaluators

We gathered two groups of evaluators: the ‘Experts’ and the ‘Enthusiasts’. The selection criteria for the Expert group consisted of five literary scholars, each holding a PhD in Spanish or Latin American literature and occupying a permanent academic position at a public university. The participants were affiliated with institutions in Mexico, France, and the United States. All experts taught literature at the graduate level and were fluent in Spanish. On the other hand, the literature enthusiast group comprised 16 evaluators recruited from a reading club of literature enthusiasts who actively shared their opinions through a YouTube channel and a Telegram group. This group is led by a published Mexican writer and booktuber.

3.2.2. Datasets

The microfiction evaluation dataset for the Expert group consisted of six microfictions written in Spanish by human authors (see Table 3). We presented these microfictions to the experts along with the fifteen questions from the GrAImes evaluation protocol. The six microfictions included two written by an expert and well-known author with published books (MF 1 and 2), two by a medium-experience author who has been published in magazines and anthologies (MF 3 and 6), and two by an emerging writer (MF 4 and 5). Two extra questions were applied to the literary experts evaluating human written microfiction, as follows: “Is this microfiction evaluation protocol clear enough for you?” (Yes or No) and “Do you think that this protocol can be used to evaluate the literary value of microfiction?” (Yes or No).
The microfiction evaluation dataset for the Enthusiasts group was generated by two distinct generative AIs: a state-of-the-art large language model (ChatGPT-3.5) and an in-house baseline model (Monterroso) specifically trained on a hand-crafted dataset of Spanish microfiction using the GPT-2 architecture. This hand-curated dataset of Spanish microfiction represented a fine-tuning of the Monterroso baseline, with the expectation that it would be better attuned to the structural, thematic, and linguistic elements prevalent in this literary form.
The Enthusiasts dataset consisted of six AI-generated microfictions in Spanish: three generated by ChatGPT-3.5, and three generated by Monterroso. The Enthusiasts group was composed of 16 literary enthusiasts plus the group leader/booktuber. Each evaluator was assigned six different microfictions in order to answer the GrAImes questionnaire.

3.2.3. Microfiction Generation Systems

Monterroso
Most story generation systems focus on developing a structured framework of narrative elements such as narrator, character, setting, and time in order to enhance story coherence and verisimilitude [54]. However, they often overlook what [55] termed “singularization” and what poststructuralist theorists describe as ‘literariness’. The Monterroso baseline system consists of fine-tuning an existing language model on microfiction; in this case, we utilized GPT-2 [38] as the base model, which employs a deep learning transformer architecture [56] for training and validation. Monterroso is available in both Spanish and English. With these pretrained models and a hand-made corpora of microfiction, Monterroso was able to produce literary-specific content.
Using the resulting Monterroso model, we input a prompt word, which served as the title. Additionally, we specified the desired length of the microfiction, with a maximum of 300 words. The resulting Monterroso GPT-2 baseline microfictions were used in our experiments. To develop the Monterroso model in Spanish, we leveraged a corpus of 1.33 Spanish microfictions, 1222 for training and 155 for validation. The corpus had a size of 1.4 MB and 411,287 tokens were used to generate the language model, alongside a publicly available GPT-2 language model specifically tailored for Spanish [57].
ChatGPT-3.5
ChatGPT-3.5 [58] was used to generate 300-word Spanish microfictions with the same prompts used for the Monterroso baseline system.

3.2.4. Statistical Measures

The reliability of the evaluation protocol questions was assessed using the intra-class correlation coefficient (ICC), which measures the degree of consistency or agreement among responses. Higher ICC values indicate strong reliability, while lower or negative values suggest inconsistencies in response patterns. Additionally, the average score (AV) provides insight into the perceived difficulty or clarity of each question. The internal consistency of the responses by microfiction was evaluated using Cronbach’s alpha, with the values categorized into standard reliability thresholds. Given the sensitivity of both the ICC and Cronbach’s alpha to sample size, we additionally used Kendall’s W to evaluate the concordance coefficient. Kendall’s W is more appropriate for smaller sample sizes, ensuring a more robust assessment of inter-rater agreement in our study.
Intra-class Correlation Coefficient:
I C C = σ b 2 σ b 2 + σ w 2
where σ b is the variance between subjects and σ w is the variance within subjects.
Cronbach’s Alpha:
α = p p 1 1 σ i i σ i i + 2 i < j σ i j
where p is the number of items in the scale, σ i i is the variance of item i, and σ i j is the covariance between items i and j.
Kendall’s W:
W = 12 S m 2 ( n 3 n )
where n is the number of objects, m is the number of judges, and S is the sum of the ranks of the squared deviations. This study aims to evaluate the literary quality of Spanish-language microfiction, focusing on both human and AI-generated texts. Thus, our purpose is to assess the effectiveness of a standardized evaluation protocol (GrAImes) in capturing literary value across different types of microfiction, including both those written by established and emerging authors and those generated by AI models. To achieve this, we selected two distinct generative AIs, namely, ChatGPT and Monterroso, the latter consisting of a baseline model trained specifically on a curated dataset of Spanish microfiction. This decision was made in order to compare the quality of texts generated by both a state-of-the-art language model and a model fine-tuned on the specific structural, thematic, and linguistic elements characteristic of Spanish microfiction. We aimed to explore whether such AI-generated texts could hold up to human evaluations in terms of literary quality as well as whether an evaluation protocol designed for human-written microfiction could be effectively applied to AI-generated works.

3.3. Limitations

Because both the ICC and Cronbach’s alpha are sensitive to sample size, their results may introduce bias or instability in our conclusions. To address this, we incorporated Kendall’s W as a measure less affected by sample size in order to assess inter-annotator agreement. Although annotators’ aesthetic judgments may vary due to individual differences in reception and corresponding biases, the observed agreement with ICC analysis was verified, with texts authored by more experienced writers receiving higher scores.
Further experiments are still required in order to assess the applicability of the proposed evaluation protocol outside the microfiction literary genre as well as to microfiction written in other languages. Involving literary experts in each target language may be necessary in order to mitigate cross-cultural validity issues, particularly when translating the questions from Spanish into other languages.

3.4. Repeatability

All elements to reproduce the experiment can be found at https://github.com/Manzanarez/GrAImes (accessed on 8 June 2025).

4. Results

4.1. GrAImes Evaluation of Human-Written Microfiction by the Experts Group

GrAImes was evaluated by literary experts, all of whom were Spanish speakers, with one non-native speaker among them. We selected six microfictions written in Spanish and provided them to the experts along with the fifteen questions from our evaluation protocol. The six microfictions included two written by a well-known author with published books (MF 1 and 2), two by an author who has been published in magazines and anthologies (MF 3 and 6), and two by an emerging writer (MF 4 and 5).
From the responses obtained and displayed in Table 4 and Table 5, and Figure 4 and Figure 5, it can be concluded that the literary experts rated microfictions 1 and 2 (authored by the expert writer) more favorably. However, the responses show a high standard deviation, indicating that while the evaluations were generally positive, there was significant variation among the experts. Additionally, the lowest-ranked microfictions were 4 and 6, which had a lower response average, also exhibited a lower standard deviation, suggesting greater agreement among the judges. These texts were written by the emerging author (MF 4) and medium-experience author (MF 6).
These results suggest a direct correlation between the authors’ expertise and the internal consistency of the texts. The microfictions written by the expert author (MF 1 and MF 2) exhibited the highest Cronbach’s alpha values of 0.80 and 0.79, respectively, indicating good to acceptable internal consistency (see Table 6 and Figure 6). This suggests that the Expert group evaluated the microfiction written by expert writers with higher coherence and internal consistency.
Microfictions written by the author with medium experience (MF 4 and MF 6) displayed Cronbach’s alpha values of 0.75 and 0.67, respectively (see Table 7), which fall within the acceptable to questionable range. While the evaluations of these microfictions maintained moderate internal consistency, they exhibited higher standard deviations (SD = 1 and SD = 1.1) compared to the microfictions by the expert author. This could imply that expert evaluators are able to coherently distinguish between a good text from an emerging author and a less-effective one from a medium-experienced author.
Conversely, the texts written by the author with low expertise (MF 3 and MF 5) demonstrated the lowest internal consistency, with Cronbach’s alpha values of 0.34 and 0.13, respectively. These values are classified as unacceptable, suggesting significant inconsistencies within the text. The standard deviation (SD = 0.9 for both) was lower than that of the expert and medium-experience authors, which may indicate a lack of variability in linguistic structures or a more rigid and less developed writing style. The low consistency of these texts highlights the challenges faced by less experienced authors in maintaining logical coherence and plot story structure.
The results of the Kendall’s W analysis (see Figure 7) indicate varying levels of agreement among the experts, with the highest concordance observed for MF1 and MF2, both of which were written by the expert author. In contrast, MF3 and MF5, which were authored by less experienced writers, showed lower levels of agreement. These findings suggest that the author’s expertise aligns with the evaluations made by literary experts.
Additionally, the average values of the microfiction provide further insight (see Table 7). The expert-authored texts had the highest AV (3 and 3.1), followed by the medium-experience author (2.5 and 2.4), while the less-experienced author scored the lowest (2 and 2.9). This pattern reinforces the idea that writing expertise influences not only internal consistency but also the overall perception of text quality.
These findings align with existing research [59] on the relationship between writing expertise and textual coherence. Higher expertise leads to better-structured and logically consistent texts, whereas lower expertise results in fragmented and inconsistent writing. The judges provided higher ratings to microfictions written by the more experienced author and lower ratings to those written by the emerging author. This is consistent with the purpose of our evaluation protocol, which aims to provide a tool for quantifying and qualifying a text based on its literary purpose as a microfiction.
Among the evaluated questions (see the Likert scale answer column in Table 2), Question 3 exhibited the highest ICC (0.87), indicating excellent reliability and strong agreement among respondents. Its relatively high average score (AV = 3.5) and moderate standard deviation (SD = 1) suggest that participants consistently rated this question favorably. Similarly, Question 11 (ICC = 0.75) demonstrated good reliability, although its AV (2.4) was lower, suggesting that respondents agreed on a more moderate evaluation of the item (see Table 7).
Moderate reliability was observed for Questions 10 and 6, with ICC values of 0.67 and 0.65, respectively. The AV scores (3.6 and 3.4) suggest that the the MFs were generally well-rated; however, the higher standard deviation of Question 10 (SD = 1.7) indicates a greater spread of responses, possibly due to varying interpretations or differences in respondent perspectives. Questions 5 and 8, with ICC values of 0.57 and 0.55, respectively, fall into the questionable reliability range. Notably, Question 8 had the lowest AV (1.8), indicating that respondents found it more difficult or unclear, which may have contributed to the reduced agreement among responses.
In contrast, Questions 7, 12, and 9 exhibited low ICC values (0.29, 0.21, and 0.16, respectively), suggesting weak reliability and higher response variability. The AV values for these items ranged from 2.2 to 2.3, further indicating inconsistent interpretations among the participants. The standard deviations for these questions (SD = 1.1–1.4) suggest a broad range of opinions, reinforcing the need for potential revisions to improve clarity and consistency.
A particularly notable finding is the negative ICC value for Question 13 (−0.72). Negative ICC values typically indicate systematic inconsistencies which may stem from ambiguous wording, multiple interpretations, or flaws in question design. With an AV of 2.0 and an SD of 1.2, it is evident that responses to this item lacked coherence.
Regarding the responses to the five open-answer questions (numbers 1, 2, 4, 14, and 15 in Table 2), we used Sentence-BERT [60] and semantic cosine similarity [61] to look for lexical and semantic similarities between the GrAImes open-answer questions. These results reveal key insights into evaluation agreement and interpretation variability across the six microfiction. For Question 1 (plot comprehension), agreement was often weak (e.g., J1-J4 semantic cosine similarity = 0.21 for MF1), suggesting narrative ambiguity or divergent reader focus. Question 2 (theme identification) showed inconsistent alignment (e.g., J2-J3 similarity = 0.67 for MF2 vs. J1-J3 = 0.10 for MF1; see Figure 8), indicating subjective thematic interpretation. Question 4 (interpretation specificity) had polarized responses, with perfect agreement in some cases (e.g., J1-J2 = 1.00 for MF3) and stark divergence in others (J2-J3 = 0.00 for MF4), reflecting conceptual or terminological disparities. Questions 14 (gifting suitability) and 15 (publisher alignment) demonstrated higher consensus (e.g., perfect agreement among four judges for MF4 on Question 4), likely due to more objective criteria. However, J5 consistently emerged as an outlier (e.g., similarity ≤ 0.11 for MF1 on Question 15), underscoring individual bias. Our protocol’s value lies in quantifying these disparities; clearer questions (14–55) reduced variability, while open-ended ones (1–2) highlighted the need for structured guidelines to mitigate judge-dependent subjectivity, particularly in the case of ambiguous or complex microfiction.
On the two extra questions given to the literary experts (see Section 3.2.2), the majority of experts (3 out of 5) found the microfiction evaluation protocol sufficiently clear for use, while a minority (2 out of 5) expressed concerns regarding ambiguous or unclear criteria. A strong consensus (4 out of 5 experts) agreed that the protocol can effectively evaluate the literary value of microfiction. However, the presence of one dissenting opinion highlights the need for adjustments in specific criteria to ensure more precise assessment.

4.2. GrAImes Evaluation of Monterroso- and ChatGPT3.5-Generated Microfiction Evaluated by the Enthusiast Group

Next, we applied GrAImes to assess a collection of six microfictions crafted by advanced AI tools. These tools included two models: the renowned Monterroso short story creator inspired by the style of renowned Guatemalan author Augusto Monterroso, and ChatGPT-3.5. The literature enthusiasts who participated in this study evaluated the microfictions based on parameters such as coherence, thematic depth, stylistic originality, and emotional resonance.
A total of six microfictions were generated, three by the Monterroso tool (MFs 1, 2, and 3) and three by ChatGPT-3.5 (MFs 4, 5, and 6). The microfictions were evaluated on a Likert scale ranging from 1 to 5, with ratings provided by a panel of 16 reader enthusiasts. The average and standard deviation (SD) of the ratings were calculated for each microfiction. The results of the analysis are presented in Table 8, and Figure 9 and Figure 10.
The results indicate that the ChatGPT-generated microfictions (4, 5, 6) have slightly higher average ratings (2.7–2.9) compared to the Monterroso-generated microfictions (1, 2, 3), which have average ratings ranging from 2.4 to 2.7 (see Table 9, Table 10, Table 11 and Table 12 and Figure 11). The standard deviation values are consistent across most microfictions, indicating a relatively narrow range of ratings.
The most consistent responses pertained to the credibility of the stories (AV = 3.1, SD = 1.0), indicating strong agreement among participants on the narratives’ believability. This suggests that regardless of their other literary attributes, the microfictions maintained a sense of realism that resonated with readers. The question regarding whether the text required the reader’s participation or cooperation to complete its form and meaning received the highest average rating (AV = 3.6, SD = 1.3). This suggests that the generated microfictions engaged actively the readers, requiring interpretation and involvement in order to fully grasp their meaning. The relatively low SD indicates moderate consensus on this aspect.
Questions concerning literary innovation, e.g., whether the texts proposed a new vision of language (AV = 2.7, SD = 1.3), reality (AV = 2.6, SD = 1.4), or genre (AV = 2.4, SD = 1.4), showed moderate variation in responses. This suggests that while some readers perceived novelty in these areas, others did not find the texts to be particularly innovative. Similarly, answers to the question of whether the texts reminded readers of other books (AV = 3.2, SD = 1.4) present a comparable level of divergence in opinions. The lowest-rated questions relate to the desire to read more texts of this nature (AV = 2.3), readers’ willingness to recommend them (AV = 2.2), and their inclination to gift them to others (AV = 2.1), all with SD = 1.4. These results suggest that while the generated microfictions may have some engaging qualities, they do not strongly motivate further exploration or endorsement.
Interestingly, the question about whether the texts propose interpretations beyond the literal received the highest standard deviation (SD = 1.6, AV = 2.9). This indicates significant variation in responses, suggesting that some readers found deeper layers of meaning while others perceived the texts as more straightforward.
The intra-class correlation coefficient (ICC) analysis of the GrAImes answers (see Table 13 and Figure 12) revealed varying degrees of reliability among the 16 literature enthusiast raters when assessing texts generated by Monterroso and ChatGPT-3.5. Three questions demonstrated poor reliability (ICC < 0.50), reflecting high variability in responses, with Question 8 exhibiting a negative ICC (−0.44) suggesting severe inconsistency, possibly due to misinterpretation or extreme subjectivity. In contrast, Questions 5 and 6 showed excellent reliability (ICC > 0.90), indicating strong inter-rater agreement, while Questions 9, 11, 12, and 13 displayed moderate reliability (ICC 0.6––0.70), implying acceptable but inconsistent consensus. These findings highlight the need to refine ambiguous or subjective questions in order to improve evaluative consistency in microfiction assessment.
This study assessed microfiction based on ability to propose interpretations beyond the literal meaning, finding notable differences between texts generated by the Monterroso (MFs 1–3) and ChatGPT-3.5 (MFs 4–6) models. Monterroso’s MF 2 had the highest average score (AV = 3.2), showing stronger ability to suggest multiple interpretations, while ChatGPT-3.5’s MF 4 had the lowest score (AV = 2.4), indicating limited interpretive depth. Standard deviation values were consistent across all MFs (1.5 to 1.7), showing moderate response variability among the literature enthusiasts. Thus, while certain MFs were seen as being more interpretively rich, the response variability was similar for all texts.
The technical quality of the MFs was assessed through questions related to credibility (Question 5), reader participation (Question 6), and innovation in reality, genre, and language (Questions 7–9). MF 6, generated by ChatGPT-3.5, scored highest in credibility (AV = 4.3), while MF 1 generated by Monterroso scored the lowest (AV = 1.9). This indicates a clear distinction in perceived realism between the two sources as evaluated by the literature enthusiasts. In terms of reader participation, MF 1 scored highest (AV = 4.6), suggesting that it effectively engaged readers in completing its form and meaning. MF 4 scored the lowest in this category (AV = 2.4), highlighting potential weakness in ChatGPT-3.5’s generated texts. Innovation in language (Question 9) was rated highest for MF 1 (AV = 3.4), while MF 5 scored the lowest (AV = 2.4). Overall, the technical quality of the MFs generated by Monterroso (MFs 1–3) was slightly higher (AV = 2.7–3.0) compared to those generated by ChatGPT-3.5 (AV = 2.8–3.0), with MF 3 scoring the lowest (AV = 2.7). The consistent SD values (ranging from 0.9 to 1.7) reflect similar levels of variability in responses from the literature enthusiasts.
The editorial and commercial appeal of the MFs was evaluated based on their resemblance to other texts (Question 10), reader interest in similar texts (Question 11), and willingness to recommend or gift the texts (Questions 12–13). MF 4, generated by ChatGPT-3.5, scored highest in resemblance to other texts (AV = 3.9), while MF 2 scored the lowest (AV = 2.8). This suggests that the texts generated by ChatGPT-3.5 may be more reminiscent of existing literature as perceived by literature enthusiasts. In terms of reader interest, MF 4 also scored the highest (AV = 3.0) while MF 3 scored the lowest (AV = 1.7); similarly, MF 4 was the most recommended (AV = 2.8) and most likely to be gifted (AV = 2.8), indicating stronger commercial appeal compared to Monterroso’s generated texts. Overall, the MFs generated by ChatGPT-3.5 (MFs 4–6) outperformed the Monterroso-generated MFs (MFs 1–3) in editorial and commercial appeal, with MF 4 achieving the highest average score (AV = 3.1) and MF 3 the lowest (AV = 1.9). The SD values (ranging from 0.9 to 1.7) indicate moderate variability in the responses from the literature enthusiasts.
The total analysis of the MFs (see Table 14) reveals that the ChatGPT-3.5 texts (MFs 4–6) generally outperformed the Monterroso-generated texts (MFs 1–3) in terms of editorial and commercial appeal, while Monterroso’s texts showed slightly better technical quality. MF 4, generated by ChatGPT-3.5, achieved the highest overall score (AV = 2.9), while MF 3 generated by Monterroso scored the lowest (AV = 2.4). The standard deviation values were consistent across all categories (SD ≈ 1.3–1.4), indicating similar levels of variability in responses from literature enthusiasts. These findings suggest that while the ChatGPT-3.5 texts may have stronger commercial potential, the Monterroso texts exhibit slightly higher technical sophistication. The evaluation by literature enthusiasts provides valuable insights into how general audiences perceive and engage with these microfiction examples.
The evaluation of six microfictions by the Enthusiast group leader revealed notable differences between the microfictions generated by Monterroso (MFs 1–3) and those generated by ChatGPT-3.5 (MFs 4–6); see Table 15 and Figure 13, Figure 14, Figure 15 and Figure 16. In terms of story overview and text complexity, MFs 4 and 5 scored the highest (AV = 4) for proposing multiple interpretations (see Table 15), while MF 3 scored the lowest (AV = 1, SD = 0), indicating a lack of depth. On technical aspects, MF 1 and MFs 4–6 were rated highly for credibility (AV = 5, SD = 0), whereas MF 3 scored poorly (AV = 2, SD = 1.4). MFs 1 and 5 excelled in requiring reader participation (AV = 5, SD = 0), while MFs 4 and 6 scored lower (AV = 3.5, SD = 0.7). However, all microfictions struggled to propose new visions of reality, language, or genre, with most scores ranging between 1 and 2.
In the editorial/commercial category, MFs 4–6 outperformed MFs 1–3. MFs 4 and 6 were most reminiscent of other texts (AV = 5 and 4.5, respectively) and were more likely to be recommended or given as presents (AV = 4, SD = 1.4). In contrast, MFs 1–3 scored poorly in these areas, with MF 3 consistently receiving the lowest ratings (AV = 1, SD = 0). Overall, the ChatGPT-3.5 microfictions (MFs 4–6) achieved higher total scores (AV = 3.4, SD = 0.8) compared to those generated by Monterroso (AV = 2.2, SD = 1.1).
One of the most notable results in this evaluation concerns the interpretative engagement of readers. The highest-rated question, “Does the text require your participation or cooperation to complete its form and meaning?”, received an average score (AV) of 4.3 with a standard deviation (SD)of 0.5. This suggests that the evaluated texts demand significant reader interaction, a crucial trait of literary complexity (see Table 16).
Conversely, aspects related to innovation in language and genre were rated lower. The question “Does it propose a new vision of the language itself?” obtained an AV of 1.2, which was the lowest among all items, along with an SD of 1.2, indicating high variability in responses. Similarly, the question “Does it propose a new vision of the genre it uses?” received an AV of 1.6 and an SD of 0.6, further emphasizing the expert perception that the generated texts did not significantly redefine literary conventions.
Regarding textual credibility, the question “Is the story credible?” was rated highly, with an AV of 4.2 and an SD of 0.7. This suggests that the narratives effectively maintain verisimilitude, an essential criterion for reader immersion. Additionally, evaluators were asked whether the texts reminded them of other literary works, yielding an AV of 3.4 and an SD of 0.8, indicating a moderate level of intertextuality.
GrAImes was also used to examine subjective aspects of reader appreciation. The questions “Would you recommend it?” and “Would you like to read more texts like this?” received AV scores of 2.7 and 2.8, respectively, with higher SD values (1.4 and 1.6), reflecting diverse expert opinions. Similarly, the willingness to offer the text as a gift scored an AV of 2.3 with an SD of 0.9, suggesting a moderate level of appreciation but not a strong endorsement.
There are currently no existing state-of-the-art references available for direct comparison with the present study, highlighting the novelty of our approach. This study introduces an innovative evaluation protocol for microfiction validated by literary experts, and contributes to the field by assessing both human-written and AI-generated texts. The absence of direct SOTA comparisons is due to the lack of prior work that simultaneously addresses the evaluation of microfiction across these two distinct origins. This gap in the literature underscores the significance of our research, which seeks to establish a comprehensive framework for assessing literary quality that can be applied to both human-written and AI-generated works. Our focus on this specific research area is justified by the increasing prominence of AI in literary production and the need for reliable expert-validated evaluation tools to assess the literary merit of AI-generated texts in comparison to traditional human-authored literature.

5. Discussion

In this study, the evaluators consisted of both literary experts and literature enthusiasts. The evaluation of AI-generated and human written literary texts was found to differ significantly depending on the evaluators’ expertise and reading habits. Literature scholars possess a familiarity with narrative structures, stylistic devices, and literary traditions, enabling them to assess texts with from a critical and informed perspective. These evaluators are more likely to recognize intertextual references, thematic depth, and the subtleties of language that contribute to literary quality. In contrast, evaluations conducted by enthusiast readers, most of whom engage with literature only occasionally, tend to focus on immediate readability, entertainment value, and emotional impact rather than on formal or aesthetic complexity. While this broader audience provides valuable insights into general reception and accessibility, their assessments may lack the depth needed to critically engage with intricate literary techniques. This divergence underscores the importance of distinguishing between different evaluator groups when developing assessment methodologies for AI-generated texts. A balanced evaluation framework should account for both perspectives, ensuring that AI-generated literature is judged not only on its mass appeal but also on its adherence to or innovation within established literary traditions.
The evaluation of AI-generated literary texts poses significant challenges due to the inherently subjective nature of aesthetic judgment. Traditional assessment frameworks in computational linguistics often rely on automated metrics such as perplexity, coherence, and sentiment analysis. However, these metrics fail to capture the nuanced and context-dependent qualities that define literary excellence. As a result, reader-based evaluation has emerged as a crucial methodological approach, leveraging human perception to assess the artistic and stylistic value of AI-generated narratives. The variability in responses between random readers and literary experts highlights the necessity of a structured framework that integrates both perspectives to ensure a more comprehensive and reliable evaluation process.
A key aspect of aesthetic evaluation is the distinction between general audience reception and expert literary critique. While non-expert readers provide insights into accessibility, engagement, and emotional resonance, experts apply specialized knowledge of literary traditions, narrative structures, and stylistic innovation. This distinction becomes particularly relevant when assessing AI-generated texts, as algorithmic authorship often lacks intentionality and depth in its construction of meaning. Consequently, the presence of literary experts within our evaluation design process is not only beneficial but essential for identifying higher-order textual attributes such as intertextuality, originality, and thematic complexity.
In light of these considerations, our proposed GrAImes hybrid evaluation model integrating both expert critique and broader audience participation is more likely to yield a balanced assessment of AI-generated literature. The inclusion of expert evaluators ensures that texts are measured against established literary standards, while the involvement of general readers provides valuable feedback on accessibility and reader engagement. This dual approach underscores the need for interdisciplinary collaboration between computational linguists and literary scholars in the design of methodologies for AI-generated and human-written text evaluation. Ultimately, a hybrid evaluation model that harmonizes expert insight with audience feedback will not only enrich the assessment of AI-generated literature but also foster a collaborative dialogue between disciplines, paving the way for a more inclusive understanding of literary value.
Expert consensus on the capacity of our GrAImes protocol to evaluate the literary value of microfiction was predominantly positive, with 80% of reviewers affirming its effectiveness. However, the presence of a dissenting perspective underscores the importance of continuous methodological refinement. The minor reservations primarily centered on the precision of specific evaluation criteria, indicating that a nuanced approach to protocol development is required. In summary, while the strong endorsement from experts highlights our protocol’s promise, the dissenting voices serve as a crucial reminder that ongoing refinement is essential to ensuring that its evaluation of microfiction remains both precise and relevant.
The proposed evaluation protocol encountered significant methodological challenges, particularly regarding criterion ambiguity. Key issues included interpretative inconsistencies in assessing linguistic creativity, intertextual references, and expressive quality. Experts specifically highlighted problematic areas such as the subjective interpretation of “novel language” and the complex evaluation of literary influences. Recommended methodological improvements include replacing vague recall-based assessments with more explicit and structured inquiries about literary lineage as well as more explicit criteria for measuring expressive quality. In further experiments, we aim to tackle these methodological challenges by implementing clearer and more structured criteria, which we expect will significantly enhance the GrAImes evaluation protocol and ensure that assessments of literary creativity and influence are conducted with both rigor and significance.
Both evaluation groups found the microfiction examples generated by ChatGPT-3.5 to be more commercially appealing, while Monterroso’s texts were judged to show slightly better technical execution (as assessed by the literature enthusiasts) and reader engagement (according to expert opinion). These findings highlight that while AI-generated microfiction can compete with human authorship in commercial and structural aspects, it falls short in terms of innovative and deeply interpretive literary qualities. Thus, our study underscores the potential of AI in replicating certain narrative techniques while also emphasizing the enduring challenges of achieving true creative originality. In this study, expert opinion focused on depth and originality, while the literature enthusiasts prioritized readability and appeal, reinforcing that AI-generated texts may satisfy casual readers more than experts in literature. In essence, while AI-generated microfiction demonstrates commercial viability and structural competence, it ultimately struggles to match the depth and originality that defines truly exceptional literature, revealing a persistent gap between algorithmic creations and human artistry.
Deep lexical and semantic comparison metrics such as BERT-based measures have been employed in previous studies to approximate human judgment in poetry generation by assessing lexical diversity and semantic similarity [25]. However, metrics such as BLEU and ROUGE were originally developed for tasks such as automatic translation and summarization, where a reference gold standard text, e.g., a human translation or summary, already exists. Given the highly subjective nature of evaluating literary qualities such as depth, originality, innovation, disruption, and structural complexity along with the lack of a gold standard text, BERT-based measures alone may not be a sufficient measure for assessing these dimensions.
Incorporating literary theory and engaging readers with expertise in literature (whether scholars or enthusiasts) could challenge recent findings suggesting that AI-generated poetry is indistinguishable from human-written classical poetry [4]. This is particularly relevant when evaluators are sourced from anonymous crowdsourcing platforms such as Prolific (https://www.prolific.com/, accessed on 8 June 2025). If reader experience significantly influences text reception, then the results of poetry evaluation experiments might differ substantially when conducted with literary experts or enthusiasts rather than general participants.
Our results indicate that integrating literary knowledge and evaluators who are familiar with or even experts in literature into the evaluation protocol enhances the assessment of texts produced by experienced human authors. Additionally, this approach improves the evaluation of microfiction generated by more advanced and better-trained generative language models.

6. Conclusions

This study introduces a literary evaluation protocol for human-written and AI-generated microfiction. The proposed GrAImes protoco integrates literary theory and expert input to ensure rigorous assessment of narrative quality, stylistic coherence, and creative depth. By grounding the proposed framework in established literary principles, GrAImes is able to provide enhanced adaptability across genres along with improved validity compared to more superficial metrics. The inclusion of domain-specific expertise enables meaningful comparisons between human-written and AI-generated texts, offering a scalable tool for research into computational creativity.
Our assessment of GrAImes revealed divergent evaluative priorities; literary experts emphasized technical execution and originality, while enthusiasts favored accessibility and enjoyment. ChatGPT-3.5’s high-data training yielded coherent but unoriginal outputs, whereas Monterroso’s limited dataset produced inconsistent results. Standard deviation analysis showed strong expert consensus (SD ≈ 0) versus moderate enthusiast variability (SD ≈ 1.3–1.4), highlighting how evaluator background shapes perception of literary quality.
GrAImes was systematically employed to benchmark AI systems, exposing gaps such as innovation deficits and training data dependencies. While future refinements could integrate qualitative open-ended questions, the current framework proved successful in measuring the performance of AI microfiction generation, demonstrating that evaluation design critically influences interpretations of computational creativity.
A forthcoming experiment will apply the protocol to AI-generated microfiction while incorporating expert and reader assessments along with AI self-assessments; expert readers will evaluate literary nuance, casual readers will assess engagement, and AI self-critiques will be used to provide comparative insights. Through this multi-perspective approach, we aim to validate our protocol’s robustness while exploring discrepancies between human and machine evaluative standards, thereby advancing computational literary analysis.

Author Contributions

Conceptualization, J.G.F., R.M., G.A.M., N.P., N.d.l.C.A. and Y.G.M.; methodology, R.M. and J.G.F.; software, G.A.M.; validation, G.A.M., R.M. and J.G.F.; formal analysis, R.M. and J.G.F.; investigation, G.A.M. and J.G.F.; resources, R.M. and J.G.F.; data curation, G.A.M. and J.G.F.; writing—original draft preparation, G.A.M.; writing—review and editing, R.M., J.G.F., G.A.M., N.P., N.d.l.C.A. and Y.G.M.; visualization, G.A.M.; supervision, R.M. and J.G.F.; project administration, R.M.; funding acquisition, R.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by ECOS NORD grant number 321105.

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

All materials needed to replicate the experiment are available at: https://github.com/Manzanarez/GrAImes (accessed on 8 June 2025).

Acknowledgments

We are grateful to the writers, literary experts, and enthusiast readers who participated in this research: Abril Albarran, Angelica, Brenda, Davide Buscaldi, Elisa, Marcos Eymar, Fernanda, Sandra Huerta, Iris, Alejandro Lambarry, Joseph Le Roux, Maria Elisa Leyva Gonzalez, Luis Roberto, Lupita Mejia Alvarez, Maria Mendoza, Ivan Vladimir Meza, David Nava, Florence Olivier, Diana Leticia Portillo Rodriguez, Guadalupe Monserrat Ramirez Santin, Brenda Rios, Adriana Azucena Rodriguez, Janik Rojas, Alma Sanchez, Miguel Tapia, Abraham Truxillo, Valeria, Dennis G. Wilson, and Oswaldo Zavala. Jorge Luis Borges (1899–1986) was an Argentine writer, poet, and essayist widely regarded as one of the most influential literary figures of the 20th century. Known for his intricate short stories exploring themes such as infinity, mirrors, labyrinths, and the nature of authorship, Borges played a foundational role in modern literature and philosophical fiction. Among Spanish-speaking readers, he is often considered one of the two most important authors in the history of the Spanish language, alongside Miguel de Cervantes.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv 2025, arXiv:2501.12948. [Google Scholar]
  2. Jaech, A.; Kalai, A.; Lerer, A.; Richardson, A.; El-Kishky, A.; Low, A.; Helyar, A.; Madry, A.; Beutel, A.; Carney, A.; et al. Openai o1 system card. arXiv 2024, arXiv:2412.16720. [Google Scholar]
  3. Team, G.; Georgiev, P.; Lei, V.I.; Burnell, R.; Bai, L.; Gulati, A.; Tanzer, G.; Vincent, D.; Pan, Z.; Wang, S.; et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv 2024, arXiv:2403.05530. [Google Scholar]
  4. Porter, B.; Machery, E. AI-generated poetry is indistinguishable from human-written poetry and is rated more favorably. Sci. Rep. 2024, 14, 26133. [Google Scholar] [CrossRef]
  5. Hinton, G.; Vinyals, O.; Dean, J. Distilling the knowledge in a neural network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
  6. Leslie, D.; Ashurst, C.; González, N.M.; Griffiths, F.; Jayadeva, S.; Jorgensen, M.; Katell, M.; Krishna, S.; Kwiatkowski, D.; Martins, C.I.; et al. ‘Frontier AI,’Power, and the Public Interest: Who benefits, who decides? Harv. Data Sci. Rev. 2024. [Google Scholar] [CrossRef]
  7. Clark, E.; Ji, Y.; Smith, N.A. Neural text generation in stories using entity representations as context. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers); Association for Computational Linguistics: New Orleans, LA, USA, 2018; pp. 2250–2260. [Google Scholar]
  8. Jakesch, M.; Hancock, J.T.; Naaman, M. Human heuristics for AI-generated language are flawed. Proc. Natl. Acad. Sci. USA 2023, 120, e2208839120. [Google Scholar] [CrossRef]
  9. Alhussain, A.I.; Azmi, A.M. Automatic story generation: A survey of approaches. ACM Comput. Surv. (CSUR) 2021, 54, 103. [Google Scholar] [CrossRef]
  10. Iser, W. The act of reading: A theory of aesthetic response. J. Aesthet. Art Crit. 1979, 38. [Google Scholar] [CrossRef]
  11. Ingarden, R. Concretización y reconstrucción. In En Busca del Texto: Teoría de la Recepción Literaria; ENALTTA: Mexico City, Mexico, 1993; pp. 31–54. [Google Scholar]
  12. Ryan, J. Grimes’ Fairy Tales: A 1960s Story Generator; Springer International Publishing: Cham, Switzerland, 2017; pp. 89–103. [Google Scholar]
  13. Propp, V.Y. The Russian Folktale by Vladimir Yakovlevich Propp; Wayne State University Press: Detroit, MI, USA, 2012. [Google Scholar]
  14. Ginna, P. What Editors Do: The Art, Craft, and Business of Book Editing; University of Chicago Press: Chicago, IL, USA, 2017. [Google Scholar]
  15. Peinado, F.; Gervás, P. Evaluation of automatic generation of basic stories. New Gener. Comput. 2006, 24, 289–302. [Google Scholar] [CrossRef]
  16. Boden, M.A. The Creative Mind: Myths and Mechanisms; Routledge: London, UK, 2004. [Google Scholar]
  17. Tomassini, G.; Maris, S. La minificción como clase textual transgenérica. Rev. Interam. Bibliogr. Rev. Interam. Bibliogr. 1996, 46, 6. [Google Scholar]
  18. Medina, Y.d.J.G. Microrrelato o minificción: De la nomenclatura a la estructura de un género literario. Microtextualidades. Rev. Int. Microrrelato Minificción 2017, 89–102. [Google Scholar]
  19. Ricoeur, P. La función narrativa. Rev. Semiót. 1989, 1, 69–90. [Google Scholar]
  20. Barthes, R.; Alcalde, R. La Aventura Semiológica; Paidós: Barcelona, Spain, 1990. [Google Scholar]
  21. Shua, A.M. Cómo Escribir un Microrrelato; Siglo XXI Editores: Mexico City, Mexico, 2023. [Google Scholar]
  22. Fan, A.; Lewis, M.; Dauphin, Y. Hierarchical neural story generation. arXiv 2018, arXiv:1805.04833. [Google Scholar]
  23. Guan, J.; Mao, X.; Fan, C.; Liu, Z.; Ding, W.; Huang, M. Long text generation by modeling sentence-level and discourse-level coherence. arXiv 2021, arXiv:2105.08963. [Google Scholar]
  24. Min, K.; Dang, M.; Moon, H. Deep Learning-Based Short Story Generation for an Image Using the Encoder-Decoder Structure. IEEE Access 2021, 9, 113550–113557. [Google Scholar] [CrossRef]
  25. Lo, K.L.; Ariss, R.; Kurz, P. GPoeT-2: A GPT-2 Based Poem Generator. arXiv 2022, arXiv:2205.08847. [Google Scholar]
  26. Cavazza, M.; Charles, F.; Mead, S.J. Character-based interactive storytelling. IEEE Intell. Syst. 2002, 17, 17–24. [Google Scholar] [CrossRef]
  27. Gervás, P.; Díaz-Agudo, B.; Peinado, F.; Hervás, R. Story plot generation based on CBR. J. Knowl.-Based Syst. 2005, 18, 235–242. [Google Scholar] [CrossRef]
  28. Mori, Y.; Yamane, H.; Mukuta, Y.; Harada, T. Toward a Better Story End: Collecting Human Evaluation with Reasons. In Proceedings of the 12th International Conference on Natural Language Generation, Tokyo, Japan, 29 October–1 November 2019; pp. 383–390. [Google Scholar]
  29. Rishes, E.; Lukin, S.M.; Elson, D.K.; Walker, M.A. Generating different story tellings from semantic representations of narrative. In Proceedings of the International Conference on Interactive Digital Storytelling, Istanbul, Turkey, 6–9 November 2013; Springer: Cham, Switzerland, 2013; pp. 192–204. [Google Scholar]
  30. Elson, D.K.; McKeown, K.R. A Tool for Deep Semantic Encoding of Narrative Texts. In Proceedings of the ACL-IJCNLP 2009 Software Demonstrations; Association for Computational Linguistics: Singapore, 2009; pp. 9–12. [Google Scholar]
  31. Sutskever, I.; Martens, J.; Hinton, G.E. Generating Text with Recurrent Neural Networks. In Proceedings of the 28th International Conference on Machine Learning, Bellevue, WA, USA, 28 June–2 July 2011. [Google Scholar]
  32. Kiddon, C.; Zettlemoyer, L.; Choi, Y. Globally coherent text generation with neural checklist models. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 329–339. [Google Scholar]
  33. Zhu, Y.; Kiros, R.; Zemel, R.; Salakhutdinov, R.; Urtasun, R.; Torralba, A.; Fidler, S. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 19–27. [Google Scholar]
  34. Walker, M.A.; Grant, R.; Sawyer, J.; Lin, G.I.; Wardrip-Fruin, N.; Buell, M. Perceived or not perceived: Film character models for expressive nlg. In Proceedings of the International Conference on Interactive Digital Storytelling, Vancouver, BC, Canada, 28 November–1 December 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 109–121. [Google Scholar]
  35. Goodwin, R.; Sharp, O. Sunspring. YouTube Website. 2016. Available online: https://apps.dtic.mil/sti/citations/ADA460392 (accessed on 8 June 2025).
  36. Lukin, S.M.; Reed, L.I.; Walker, M.A. Generating sentence planning variations for story telling. arXiv 2017, arXiv:1708.08580. [Google Scholar]
  37. Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, PA, USA, 7–12 July 2002; pp. 311–318. [Google Scholar]
  38. Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language models are unsupervised multitask learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
  39. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
  40. Genette, G. Points. In Figures I; Editions de Seuil: Paris, France, 1976; Volume 1. [Google Scholar]
  41. Bal, M.; Van Boheemen, C. Narratology: Introduction to the Theory of Narrative; University of Toronto Press: Toronto, ON, Canada, 2009. [Google Scholar]
  42. Gardent, C.; Perez-Beltrachini, L. A statistical, grammar-based approach to microplanning. Comput. Linguist. 2017, 43, 1–30. [Google Scholar] [CrossRef]
  43. Dorr, B.; Gaasterland, T. Summarization-Inspired Temporal-Relation Extraction: Tense-Pair Templates and Treebank-3 Analysis. 2007. Available online: https://apps.dtic.mil/sti/citations/ADA460392 (accessed on 8 June 2025).
  44. Zhu, J. Towards a Mixed Evaluation Approach for Computational Narrative Systems. In Proceedings of the ICCC’12: 2012 1st IEEE International Conference on Communications in China, Beijing, China, 15–17 August 2012; pp. 150–154. [Google Scholar]
  45. Koziev, I. Automated Evaluation of Meter and Rhyme in Russian Generative and Human-Authored Poetry. arXiv 2025, arXiv:2502.20931. [Google Scholar]
  46. Franceschelli, G.; Musolesi, M. Thinking Outside the (Gray) Box: A Context-Based Score for Assessing Value and Originality in Neural Text Generation. arXiv 2025, arXiv:2502.13207. [Google Scholar]
  47. Ludmer, J. Clases 1985: Algunos Problemas de Teoría Literaria; Paidós: Barcelona, Spain, 2015. [Google Scholar]
  48. Austin, J. How to Do Things with Words; Harvard University Press: Cambridge, MA, USA, 1962. [Google Scholar]
  49. Bertochi, D. La aproximación al texto literario en la enseñanza obligatoria. In Textos de Didáctica de la Lengua y la Literatura; Grao: Barcelona, Spain, 1995; pp. 23–38. [Google Scholar]
  50. Huamán, M.Á. Educación y Literatura; Mantaro: Lima, Peru, 2003. [Google Scholar]
  51. Lizaur Guerra, M.B.d. La telenovela mexicana: Forma y contenido de un formato narrativo de ficcíon de alcance mayoritario. Ph.D. Thesis, Universidad Nacional Autónoma de México, Mexico City, Mexico, 2003. [Google Scholar]
  52. Thompson, J.B. Merchants of Culture: The Publishing Business in the Twenty-First Century; John Wiley & Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
  53. Calasso, R. La Marca del Editor; Anagrama: Madrid, Spain, 2014. [Google Scholar]
  54. Bremond, C.; Cancalon, E.D. The logic of narrative possibilities. New Lit. Hist. 1980, 11, 387–411. [Google Scholar] [CrossRef]
  55. Shklovsky, V. Art as technique. In Literary Theory: An Anthology; Wiley-Blackwell: Oxford, UK, 1917; Volume 3. [Google Scholar]
  56. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems 2017: 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
  57. Oñate Latorre, A.; Ortiz Fuentes, J. GPT2-Spanish. 2018. Available online: https://huggingface.co/DeepESP/gpt2-spanish (accessed on 8 June 2025).
  58. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. arXiv 2020, arXiv:2005.14165. [Google Scholar]
  59. McCutchen, D. From novice to expert: Implications of language skills and writing-relevant knowledge for memory during the development of writing skill. J. Writ. Res. 2011, 3, 51–68. [Google Scholar] [CrossRef]
  60. Reimers, N.; Gurevych, I. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv 2019, arXiv:1908.10084. [Google Scholar]
  61. Rahutomo, F.; Kitasuka, T.; Aritsugi, M. Semantic cosine similarity. In Proceedings of the 7th International Student Conference on Advanced Science and Technology ICAST, Seoul, Republic of Korea, 29–30 October 2012; Volume 4, p. 1. [Google Scholar]
Figure 1. This figure presents an example of a microfiction authored by Yobany Garcia Medina. It is divided into three parts: opening (A), development (B), and closing (C), collectively forming a cohesive narrative unit.
Figure 1. This figure presents an example of a microfiction authored by Yobany Garcia Medina. It is divided into three parts: opening (A), development (B), and closing (C), collectively forming a cohesive narrative unit.
Applsci 15 06802 g001
Figure 2. Microfiction example.
Figure 2. Microfiction example.
Applsci 15 06802 g002
Figure 3. Microfiction evaluation process.
Figure 3. Microfiction evaluation process.
Applsci 15 06802 g003
Figure 4. Expert group evaluation averages.
Figure 4. Expert group evaluation averages.
Applsci 15 06802 g004
Figure 5. Expert group evaluation of human-written microfiction using GrAImes (showing averages by section).
Figure 5. Expert group evaluation of human-written microfiction using GrAImes (showing averages by section).
Applsci 15 06802 g005
Figure 6. ICC and Cronbach’s alpha line charts of the Experts group’s evaluation of human-written microfiction. Blue Dashed Line (0.75): This line represents the threshold for good reliability. Red Dashed Line (0.5): This line indicates the threshold for moderate reliability.
Figure 6. ICC and Cronbach’s alpha line charts of the Experts group’s evaluation of human-written microfiction. Blue Dashed Line (0.75): This line represents the threshold for good reliability. Red Dashed Line (0.5): This line indicates the threshold for moderate reliability.
Applsci 15 06802 g006
Figure 7. Kendall’s W for the Experts group evaluation of human-written microfiction, ordered by GrAImes sections.
Figure 7. Kendall’s W for the Experts group evaluation of human-written microfiction, ordered by GrAImes sections.
Applsci 15 06802 g007
Figure 8. Inter-rater agreement within the Expert group on clarity (q1), structure (q2), complexity (q4), giftability (q14), and commerciality (q15) for MF2.
Figure 8. Inter-rater agreement within the Expert group on clarity (q1), structure (q2), complexity (q4), giftability (q14), and commerciality (q15) for MF2.
Applsci 15 06802 g008
Figure 9. Literary enthusiats group’s evaluation of AI-generated microfiction.
Figure 9. Literary enthusiats group’s evaluation of AI-generated microfiction.
Applsci 15 06802 g009
Figure 10. Enthusiast group’s evaluation of AI-generated microfiction by section.
Figure 10. Enthusiast group’s evaluation of AI-generated microfiction by section.
Applsci 15 06802 g010
Figure 11. Line charts of literature enthusiasts’ summarized AV and SD by GrAImes questionnaire section.
Figure 11. Line charts of literature enthusiasts’ summarized AV and SD by GrAImes questionnaire section.
Applsci 15 06802 g011
Figure 12. ICC line chart for literature enthusiasts’ responses to AI-generated microfiction. Blue Dashed Line (0.75): This line represents the threshold for good reliability. Red Dashed Line (0.5): This line indicates the threshold for moderate reliability.
Figure 12. ICC line chart for literature enthusiasts’ responses to AI-generated microfiction. Blue Dashed Line (0.75): This line represents the threshold for good reliability. Red Dashed Line (0.5): This line indicates the threshold for moderate reliability.
Applsci 15 06802 g012
Figure 13. Literary experts group’s evaluation of AI-generated microfiction. Bar colors: Blue, Overview and text complexity section. Red, Technical section. Yellow, Editorial/commercial sections. Orange, Total score.
Figure 13. Literary experts group’s evaluation of AI-generated microfiction. Bar colors: Blue, Overview and text complexity section. Red, Technical section. Yellow, Editorial/commercial sections. Orange, Total score.
Applsci 15 06802 g013
Figure 14. Literary experts group’s evaluation of AI-generated microfiction (average by section).
Figure 14. Literary experts group’s evaluation of AI-generated microfiction (average by section).
Applsci 15 06802 g014
Figure 15. Line charts showing literary expert summarized AV and SD by GrAImes questionnaire section.
Figure 15. Line charts showing literary expert summarized AV and SD by GrAImes questionnaire section.
Applsci 15 06802 g015
Figure 16. Comparison of literary expert and enthusiast ratings of AI-generated microfiction by GrAImes questionnaire section.
Figure 16. Comparison of literary expert and enthusiast ratings of AI-generated microfiction by GrAImes questionnaire section.
Applsci 15 06802 g016
Table 1. Automatic text generation evaluation methods.
Table 1. Automatic text generation evaluation methods.
AuthorGoalApproachEvaluation MethodResults
Fan et al., 2018 [22]Hierarchical story generation using a fusion modelDeep LearningHuman evaluation, PerplexityStory generation with a given prompt.
Guan et al., 2021 [23]Long text generationDeep LearningPerplexity, BLEU, Lexical Repetition, Semantic Repetition, Distinct-4, Context Relatedness, Sentence OrderGeneration of long texts using sentence and discourse coherence.
Min et al., 2021 [24]Short text generation for an imageDeep LearningNoneGeneration of short texts using an image and encoder-decoder structure.
Lo et al., 2022 [25]Poem generation using GPT-2Deep LearningLexical diversity, Subject continuity, BERT-based Embedding Distances, WordNet-based Similarity Metric, Content ClassificationLimerick poems with AABBA rhyming scheme.
Cavazza et al., 2002 [26]Character-based interactive story generationRule-basedQuantification of system’s generative potentialComputer entertainment story generation.
Gervas et al. 2005 [27]Story that matches a given queryOntology-basedNoneSketch of a story plot.
Mori et al., 2019 [28]Story generation with better story endingsNeural Network-basedHuman evaluationEndings containing positive emotions, supported by sentiment analysis.
Rishes et al., 2013 [29]Story generationSymbolic EncodingLevenshtein Distance, BLEUStories and fables.
Elson et al., 2009 [30]Annotation tool for the semantic encoding of textsSymbolic EncodingHuman evaluationShort fables.
Sutskever et al., 2011 [31]Character-level language modelingNeural Network-basedBits Per CharacterText generation with gated RNNs.
Kiddon et al., 2016 [32]To generate an output by dynamically adjusting interpolation among a language model and attention modelsNeural Network-basedHuman evaluation, BLEU-4, METEORText generation with global coherence.
Zhu et al., 2015 [33]Rich descriptive explanations and alignment between book and movieNeural Network-basedBLEU, TF-IDFStory generation from images and texts.
Walker et al., 2011 [34]Generate story dialogues from film charactersStatistical ModelHuman evaluationDialogues generated based on given film characters.
Sharp et al., 2016 [35]Generation of a film scriptNeural Network-basedNoneShort film script.
Lukin et al., 2017 [36]Sentence planningNeural Network-basedLevenshtein Distance, BLEUParameterized sentence planner.
Porter et al., 2024 [4]Poem generation using ChatGPTDeep LearningHuman (crowd evaluation with Prolific)Evaluators preferred ChatGPT poems to those from well known human authors.
Table 2. List of questions in the evaluation protocol provided to the evaluators tasked with assessing the literary, linguistic, and editorial quality of microfiction pieces. OA = Open Answer, Likert = scale ranging from 1 to 5.
Table 2. List of questions in the evaluation protocol provided to the evaluators tasked with assessing the literary, linguistic, and editorial quality of microfiction pieces. OA = Open Answer, Likert = scale ranging from 1 to 5.
GrAImes Evaluation Protocol Questions
#QuestionAnswerDescription
Story Overview and text complexity
1What happens in the story?OAEvaluates how clearly the generated microfiction is understood by the reader.
2What is the theme?OAAssesses whether the text has a recognizable structure and can be associated with a specific theme.
3Does it propose other interpretations, in addition to the literal one?LikertEvaluates the literary depth of the microfiction. A text with multiple interpretations demonstrates greater literary complexity.
4If the above question was affirmative, Which interpretation is it?OAExplores whether the microfiction contains deeper literary elements such as metaphor, symbolism, or allusion.
Technical Assessment
5Is the story credible?LikertMeasures how realistic and distinguishable the characters and events are within the microfiction.
6Does the text require your participation or cooperation to complete its form and meaning?LikertAssesses the complexity of the microfiction by determining the extent to which it involves the reader in constructing meaning.
7Does it propose a new perspective on reality?LikertEvaluates whether the microfiction immerses the reader in an alternate reality different from their own.
8Does it propose a new vision of the genre it uses?LikertDetermines whether the microfiction offers a fresh approach to its literary genre.
9Does it give an original way of using the language?LikertMeasures the creativity and uniqueness of the language used in the microfiction.
Editorial/Commercial Quality
10Does it remind you of another text or book you have read?LikertAssesses the relevance of the text and its similarities to existing works in the literary market.
11Would you like to read more texts like this?LikertMeasures the appeal of the microfiction and its potential marketability.
12Would you recommend it?LikertIndicates whether the microfiction has an audience and whether readers might seek out more works by the author.
13Would you give it as a present?LikertEvaluates whether the microfiction holds enough literary or commercial value for readers to gift it to others.
14If the last answer was yes, to whom would you give it as a present?OAIdentifies the type of reader the evaluator believes would appreciate the microfiction.
15Can you think of a specific publisher that you think would publish a text like this?OAAssesses the commercial viability of the microfiction by determining if respondents associate it with a specific publishing market.
Table 3. Authors of human-written microfiction evaluated by the Expert group.
Table 3. Authors of human-written microfiction evaluated by the Expert group.
AuthorExperienceMicrofiction
ExpertWell known with books publishedMF1, MF2
MediumPublished in magazines and anthologiesMF3, MF6
EmergingLittle experience, starting authorMF4, MF5
Table 4. Responses, AV, and SD of literary experts assessing human-written microfiction (MF1–MF6). Answers are measured on a Likert scale and grouped by GrAImes questionnaire sections, with the total average response in the final column.
Table 4. Responses, AV, and SD of literary experts assessing human-written microfiction (MF1–MF6). Answers are measured on a Likert scale and grouped by GrAImes questionnaire sections, with the total average response in the final column.
Literary Experts’ Responses to Human Written Microfiction
MF 1MF 2MF 3MF 4MF 5MF 6Average
QuestionAVSDAVSDAVSDAVSDAVSDAVSDAVSD
Story Overview and text complexity
3. Does it propose other interpretations, in addition to the literal one?414.40.92.20.82.41.44.40.53.41.63.51
Technical
5. Is the story credible?2.21.83.21.840.54.41.73.41.820.93.21.4
6. Does the text require your participation or cooperation to complete its form and meaning?4.40.93.61.32.61.530.93.20.93.81.13.41.1
7. Does it propose a new vision of reality?2.41.12.61.51.20.91.81.1312.21.22.21.1
8. Does it propose a new vision of the genre it uses?21.22.41.51.40.51.20.42.21.11.60.91.80.9
9. Does it propose a new vision of the language itself?2.81.82.62.22.20.92.81.31.40.521.42.31.4
Editorial/commercial
10. Does it remind you of another text or book you have read?4.40.541.13.20.42.80.83.60.43.40.83.60.7
11. Would you like to read more texts like this?3130.71.40.920.830.420.82.40.8
12. Would you recommend it?2.81.631.21.20.920.82.81.11.612.21.1
13. Would you give it as a present?2.21.62.41.310.92.21.22.41.11.80.821.2
Table 5. Responses of literary experts to GrAImes questions evaluating human-written microfiction, organized by ascending order of standard deviation.
Table 5. Responses of literary experts to GrAImes questions evaluating human-written microfiction, organized by ascending order of standard deviation.
Literary Experts’ Responses to Microfiction Written by Humans, Ordered by SD
QuestionAVSD
10. Does it remind you of another text or book you have read?3.60.7
11. Would you like to read more texts like this?2.40.8
8. Does it propose a new vision of the genre it uses?1.80.9
3. Does it propose other interpretations, in addition to the literal one?3.51
6. Does the text require your participation or cooperation to complete its form and meaning?3.41.1
7. Does it propose a new vision of reality?2.21.1
12. Would you recommend it?2.21.1
13. Would you give it as a present?21.2
5. Is the story credible?3.21.4
9. Does it propose a new vision of the language itself?2.31.4
Table 6. Intra-class correlation coefficient and average values of literary experts’ responses to GrAImes questions.
Table 6. Intra-class correlation coefficient and average values of literary experts’ responses to GrAImes questions.
Questions ICC—AVG
QuestionICCAVSD
30.873.51
110.752.40.8
100.673.61.7
60.653.41.1
50.573.21.4
80.551.80.9
70.292.21.1
120.212.21.1
90.162.31.4
13−0.7221.2
Table 7. Responses of literary experts to GrAImes questions evaluating human-written microfiction, including Cronbach’s alpha for internal consistency, AV, and SD.
Table 7. Responses of literary experts to GrAImes questions evaluating human-written microfiction, including Cronbach’s alpha for internal consistency, AV, and SD.
MF, Cronbanch’s Alpha, Internal Consistency, AV, SD
MFAlphaICAVSD
10.8Good31.3
20.79Acceptable3.11.4
40.75Acceptable2.51
60.67Questionable2.41.1
30.34Unacceptable20.9
50.13Unacceptable2.90.9
Table 8. Responses, AV, and SD, of the Enthusiasts group to AI-generated microfiction (MFs 1–6) measured on a Likert scale and grouped by GrAImes questionnaire section, with the total average responses in the final column.
Table 8. Responses, AV, and SD, of the Enthusiasts group to AI-generated microfiction (MFs 1–6) measured on a Likert scale and grouped by GrAImes questionnaire section, with the total average responses in the final column.
Literature Enthusiasts’ Responses to Microfiction from Monterroso and ChatGPT-3.5
MF 1MF 2MF 3MF 4MF 5MF 6Average
QuestionAVSDAVSDAVSDAVSDAVSDAVSDAVSD
Story pverview and text complexity
3.-Does it propose other interpretations, in addition to the literal one?2.91.53.21.62.91.72.41.53.21.63.11.62.91.6
Technical
5.-Is the story credible?1.90.91.70.92.21.14.21.241.24.30.93.11
6.-Does the text require your participation or cooperation to complete its form and meaning?4.614.31.44.31.2.41.23.11.42.91.43.61.3
7.-Does it propose a new vision of reality?2.71.72.91.52.41.52.31.42.71.32.51.32.61.4
8.-Does it propose a new vision of the genre it uses?2.31.42.71.62.11.32.41.52.41.52.41.12.41.4
9.-Does it propose a new vision of the language itself?3.41.32.71.42.61.32.61.32.41.42.71.32.71.3
Editorial/commercial
10.-Does it remind you of another text or book you have read?2.91.52.81.32.91.53.91.33.21.53.21.53.21.4
11.-Would you like to read more texts like this?21.22.31.71.70.931.52.11.32.61.52.31.4
12.-Would you recommend it?2.11.52.11.51.60.92.81.42.11.42.61.62.21.4
13.-Would you give it as a present?2.11.61.71.31.412.81.521.42.31.42.11.4
Table 9. Story overview and text complexity section: AV and SD by MF.
Table 9. Story overview and text complexity section: AV and SD by MF.
Overview and Complexity
#MFAVSD
123.21.6
253.21.6
363.11.6
432.91.7
512.91.5
642.41.5
Table 10. Technical section: AV and SD by MF.
Table 10. Technical section: AV and SD by MF.
Technical
MFAVSD
631.2
131.3
22.91.4
52.91.3
42.81.3
32.71.3
Table 11. Editorial/commercial section: AV and SD by MF.
Table 11. Editorial/commercial section: AV and SD by MF.
Editorial/Commercial
MFAVSD
43.11.5
62.71.5
12.31.4
52.31.4
22.21.4
31.91.1
Table 12. Total analysis: AV and SD by MF.
Table 12. Total analysis: AV and SD by MF.
Total Analysis
MFAVSD
42.91.4
62.91.4
52.71.4
12.71.4
22.61.4
32.41.3
Table 13. Internal consistency analysis: ICC and Cronbach’s alpha for microfiction from Monterroso and ChatGPT-3.5 evaluated by literature enthusiasts.
Table 13. Internal consistency analysis: ICC and Cronbach’s alpha for microfiction from Monterroso and ChatGPT-3.5 evaluated by literature enthusiasts.
Literature Enthusiasts’ ICC–AVG–SD AnalysisMF, Alpha, Internal Consistency (IC), AV, SD
#QuestionICCAVSD#MFAlphaICAVSD
150.973.11140.90Excellent2.91.4
260.953.61.3250.89Good2.71.4
3130.702.11.4360.89Good2.91.4
490.672.71.3410.88Good2.71.4
5110.672.31.4520.84Good2.61.4
6120.622.31.4630.79Acceptable2.41.3
7100.573.21.4
830.282.91.6
970.012.61.4
108−0.442.41.4
Table 14. Average values (AV) and standard deviations (SD) of literary expert responses to microfiction by Monterroso and ChatGPT-3.5.
Table 14. Average values (AV) and standard deviations (SD) of literary expert responses to microfiction by Monterroso and ChatGPT-3.5.
Literary Experts Responses to Microfiction from Monterroso and ChatGPT-3.5
MF 1MF 2MF 3MF 1MF 5MF 6Average
QuestionAVSDAVSDAVSDAVSDAVSDAVSDAVSD
Story Overview and text complexity
3.-Does it propose other interpretations, in addition to the literal one?32.832.81041.441.43.52.13.11.8
Technical
5.-Is the story credible?5032.821.45050504.20.7
6.-Does the text require your participation or cooperation to complete its form and meaning?505041.43.50.7503.50.74.30.5
7.-Does it propose a new vision of reality?21.421.41021.421.42.50.71.91.1
8.-Does it propose a new vision of the genre it uses?21.41.50.7101021.4201.60.6
9.-Does it propose a new vision of the language itself?21.410101010101.21.2
Editorial/commercial
10.-Does it remind you of another text or book you have read?41.421.4105041.44.50.73.40.8
11.-Would you like to read more texts like this?31.421.41041.432.841.42.81.6
12.-Would you recommend it?32.8101041.432.841.42.71.4
13.-Would you give it as a present?10101041.432.841.42.30.9
Table 15. Literary expert evaluation of microfiction generated by Monterroso and ChatGPT-3.5 by GrAImes questionnaire section.
Table 15. Literary expert evaluation of microfiction generated by Monterroso and ChatGPT-3.5 by GrAImes questionnaire section.
Story Overview and Text ComplexityTechnicalEditorial/CommercialMicrofiction Total Analysis
#MFAVSDMFAVSDMFAVSDMFAVSD
144113.20.844.31.143.40.8
2541530.664.11.263.40.8
363.52.162.80.353.32.553.21.4
4132.842.50.412.81.8131.4
5232.822.5121.50.722.21.1
631031.80.631031.40.3
Table 16. Literary expert responses to microfiction generated by Monterroso and ChatGPT-3.5 organized by ascending standard deviation.
Table 16. Literary expert responses to microfiction generated by Monterroso and ChatGPT-3.5 organized by ascending standard deviation.
Literary Experts Responses to MF’s Generated by Monterroso and ChatGPT-3.5, Ordered by SD
QuestionAVSD
6.-Does the text require your participation or cooperation to complete its form and meaning?4.30.5
8.-Does it propose a new vision of the genre it uses?1.60.6
5.-Is the story credible?4.20.7
10.-Does it remind you of another text or book you have read?3.40.8
13.-Would you give it as a present?2.30.9
7.-Does it propose a new vision of reality?1.91.1
9.-Does it propose a new vision of the language itself?1.21.2
12.-Would you recommend it?2.71.4
11.-Would you like to read more texts like this?2.81.6
3.-Does it propose other interpretations, in addition to the literal one?3.11.8
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Aleman Manzanarez, G.; de la Cruz Arana, N.; Garcia Flores, J.; Garcia Medina, Y.; Monroy, R.; Pernelle, N. Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction. Appl. Sci. 2025, 15, 6802. https://doi.org/10.3390/app15126802

AMA Style

Aleman Manzanarez G, de la Cruz Arana N, Garcia Flores J, Garcia Medina Y, Monroy R, Pernelle N. Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction. Applied Sciences. 2025; 15(12):6802. https://doi.org/10.3390/app15126802

Chicago/Turabian Style

Aleman Manzanarez, Gerardo, Nora de la Cruz Arana, Jorge Garcia Flores, Yobany Garcia Medina, Raul Monroy, and Nathalie Pernelle. 2025. "Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction" Applied Sciences 15, no. 12: 6802. https://doi.org/10.3390/app15126802

APA Style

Aleman Manzanarez, G., de la Cruz Arana, N., Garcia Flores, J., Garcia Medina, Y., Monroy, R., & Pernelle, N. (2025). Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction. Applied Sciences, 15(12), 6802. https://doi.org/10.3390/app15126802

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop