Literary Language Mashup: Curating Fictions with Large Language Models
Abstract
1. Introduction
2. Related Work
2.1. LLMs as Judges
2.2. LLM Benchmarks and Quality Assessment
3. Concepts, Methods and Materials Used
3.1. Evaluation Protocols
3.1.1. GrAImes
3.1.2. TTCW
3.2. Experiments
3.2.1. GrAImes Expert Evaluation: Human Experts vs. LLM Judges
3.2.2. GrAImes Enthusiast Evaluation: Human Enthusiasts vs. LLM Judges
3.2.3. TTCW and GrAImes Protocol Comparison
3.2.4. LLMs Within-Models Stability
3.2.5. Inter-Rater Reliability
3.3. Datasets
Large Language Models as Judges
3.4. Limitations
4. Results
4.1. GrAImes Expert Evaluation Results (Experiment I)
4.2. Enthusiasts’ GrAImes Evaluation Results (Experiment II)
LLM Benchmark on the GrAImes Enthusiasts Dataset (Experiment II)
4.3. TTCW and GrAImes Protocol Comparison Results (Experiment III)
5. Discussion
6. Conclusions
Author Contributions
Funding
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Appendix A. Figures and Tables of LLM Responses to GrAImes with Microfictions
Appendix A.1. Krippendorff’s α

| Krippendorff’s and Confidence Interval (CI) for Literary Expert 2 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLM | MF1 | MF2 | MF3 | MF4 | MF5 | MF6 | ||||||
| CI | CI | CI | CI | CI | CI | |||||||
| 1 | −0.163 | (0.088, 0.388) | 0.0 | (−0.046, 0.282) | −0.094 | (0.004, 0.460) | −0.357 | (0.190, 0.693)) | 0.159 | (−0.046, 0.262) | −0.125 | (0.070, 0.386) |
| 2 | −0.008 | (−0.017, 0.344)) | 0.208 | (−0.039, 0.188) | −0.226 | (0.033, 0.404) | −0.275 | (0.153, 0.578) | −0.163 | (0.015, 0.296) | 0.101 | (0.018, 0.311) |
| 3 | 0.095 | (0.003, 0.331) | 0.066 | (−0.046, 0.215) | 0.015 | (0.034, 0.531) | −0.329 | (0.134, 0.502) | 0.017 | (−0.046, 0.252) | −0.213 | (0.156, 0.592) |
| 4 | 0.027 | (0.035, 0.480) | −0.081 | (0.051, 0.549) | −0.094 | (0.056, 0.476) | −0.293 | (0.153, 0.578) | −0.196 | (−0.002, 0.337) | −0.329 | (0.144, 0.571) |
| 5 | −0.132 | (0.066, 0.551) | −0.216 | (−0.017, 0.648) | −0.109 | (0.081, 0.561) | −0.103 | (0.100, 0.488) | −0.147 | (−0.046, 0.262) | −0.101 | (0.054, 0.464) |

| Krippendorff’s and Confidence Interval (CI) for Literary Expert 3 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLM | MF1 | MF2 | MF3 | MF4 | MF5 | MF6 | ||||||
| CI | CI | CI | CI | CI | CI | |||||||
| 1 | 0.050 | (0.002, 0.287) | −0.063 | (0.017, 0.286) | −0.188 | (0.023, 0.488) | −0.047 | (0.013, 0.332) | 0.057 | (0.004, 0.356) | 0.131 | (0.016, 0.571) |
| 2 | −0.027 | (0.023, 0.373) | 0.095 | (−0.029, 0.235) | −0.096 | (0.011, 0.397) | −0.257 | (−0.022, 0.273) | 0.007 | (−0.033, 0.181) | −0.027 | (0.001, 0.480) |
| 3 | −0.020 | (−0.013, 0.268) | −0.132 | (−0.003, 0.296) | −0.196 | (0.079, 0.578) | −0.230 | (−0.025, 0.296) | −0.056 | (0.000, 0.286) | −0.267 | (0.146, 0.578)) |
| 4 | −0.126 | (−0.032, 0.410) | −0.188 | (−0.015, 0.385) | −0.179 | (0.084, 0.485) | 0.162 | (−0.022, 0.273) | −0.132 | (−0.026, 0.263) | −0.178 | (0.073, 0.665) |
| 5 | 0.022 | (−0.031, 0.344) | 0.029 | (−0.025, 0.336) | −0.078 | (0.084, 0.565) | −0.284 | (−0.026, 0.251) | −0.319 | (0.004, 0.356) | −0.148 | (0.091, 0.578) |
| Krippendorff’s and Confidence Interval (CI) for Literary Expert 4 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLM | MF1 | MF2 | MF3 | MF4 | MF5 | MF6 | ||||||
| CI | CI | CI | CI | CI | CI | |||||||
| 1 | −0.039 | (−0.028, 0.214) | −0.063 | (0.017, 0.297) | 0.445 | (−0.048, 0.117) | 0.026 | (0.016, 0.648) | 0.029 | (−0.027, 0.273) | 0.007 | (0.010, 0.336) |
| 2 | 0.050 | (0.008, 0.408) | −0.148 | (−0.016, 0.307) | −0.118 | (−0.027, 0.214) | −0.047 | (0.029, 0.634) | −0.103 | (−0.019, 0.221) | −0.210 | (0.032, 0.316) |
| 3 | 0.029 | (−0.039, 0.245) | 0.095 | (−0.014, 0.253) | −0.08 | (−0.031, 0.357) | −0.008 | (0.026, 0.485) | −0.230 | (−0.020, 0.331) | 0.066 | (−0.013, 0.372) |
| 4 | −0.197 | (−0.035, 0.324) | −0.197 | (−0.028, 0.438) | −0.109 | (−0.013, 0.257) | −0.346 | (0.029, 0.634) | −0.163 | (−0.033, 0.256) | −0.275 | (0.144, 0.500) |
| 5 | 0.240 | (−0.047, 0.239) | −0.257 | (−0.025, 0.383) | 0.015 | (0.008, 0.332) | −0.348 | (0.042, 0.468) | −0.109 | (−0.027, 0.273) | −0.293 | (0.022, 0.476) |
| Krippendorff’s and Confidence Interval (CI) for Literary Expert 5 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLM | MF1 | MF2 | MF3 | MF4 | MF5 | MF6 | ||||||
| CI | CI | CI | CI | CI | CI | |||||||
| 1 | −0.293 | (0.144, 0.513) | −0.218 | (0.225, 0.430) | −0.326 | (−0.004, 0.434) | −0.101 | (0.060, 0.500) | −0.143 | (−0.013, 0.368) | −0.007 | (0.054, 0.341) |
| 2 | −0.275 | (0.134, 0.592) | −0.226 | (0.119, 0.430) | −0.020 | (0.026, 0.372) | −0.140 | (0.022, 0.500) | −0.195 | (0.060, 0.296) | −0.203 | (0.083, 0.388) |
| 3 | −0.250 | (0.111, 0.464) | −0.125 | (0.077, 0.430) | 0.120 | (−0.025, 0.209) | 0.191 | (−0.005, 0.306) | 0.015 | (−0.019, 0.350) | 0.120 | (−0.026, 0.369) |
| 4 | −0.319 | (0.095, 0.675) | −0.301 | (0.193, 0.578) | −0.090 | (0.022, 0.340) | −0.301 | (0.022, 0.500) | 0.007 | (0.028, 0.306) | −0.310 | (0.167, 0.551) |
| 5 | −0.230 | (0.070, 0.551) | −0.132 | (0.134, 0.538) | −0.169 | (0.078, 0.549) | −0.218 | (0.052, 0.325) | −0.286 | (−0.013, 0.368) | −0.284 | (0.083, 0.480) |

| Krippendorff’s Literary Enthusiast 2—LLMs (16) | ||||||
|---|---|---|---|---|---|---|
| LLM | MF1 | MF2 | MF3 | MF4 | MF5 | MF6 |
| gemini-2.5-pro | 0.296 | 0.296 | 0.305 | 0.181 | 0.296 | −0.25 |
| chatgpt-4o-latest-20250326 | 0.116 | −0.073 | −0.213 | −0.258 | −0.293 | −0.295 |
| claude-opus-4-20250514 | −0.118 | −0.027 | −0.027 | −0.109 | 0.05 | −0.258 |
| grok-4-0709 | 0.252 | 0.29 | 0.081 | −0.118 | 0.007 | −0.23 |
| gemini-2.5-flash | −0.056 | −0.056 | −0.188 | 0.05 | −0.163 | −0.407 |
| gemini-2.5-flash_ai | 0.01 | −0.027 | −0.056 | −0.118 | 0.081 | −0.258 |
| gpt-4.1-2025-04-14_AI | −0.157 | −0.056 | 0.095 | −0.152 | 0.208 | −0.258 |
| o3-2025-04-16_AI | −0.127 | 0.073 | −0.293 | 0.057 | −0.171 | −0.407 |
| grok-3-preview-02-24_AI | 0 | 0.283 | −0.143 | −0.034 | −0.204 | −0.14 |
| deepseek-v3-0324_AI | −0.137 | 0.183 | −0.301 | −0.02 | −0.155 | −0.462 |
| deepseek-r1-0528 | −0.056 | 0.174 | −0.179 | 0.029 | −0.284 | −0.545 |
| claude-sonnet-4_ai | −0.137 | 0.073 | −0.204 | 0.057 | −0.196 | −0.407 |
| kimi-k2-0711-preview | −0.127 | −0.056 | −0.293 | −0.196 | −0.204 | −0.407 |
| hunyuan-turbos-20250416 | −0.118 | 0.073 | −0.204 | 0.057 | −0.213 | −0.407 |
| qwen3-235b-a22b-no-thinking | 0.112 | 0.174 | −0.221 | −0.063 | −0.196 | −0.407 |
| Mistral-medium-2505 | −0.056 | 0.174 | −0.179 | 0.029 | −0.284 | −0.545 |
| Krippendorff’s Literary Enthusiast 3—LLMs (16) | ||||||
|---|---|---|---|---|---|---|
| LLM | MF1 | MF2 | MF3 | MF4 | MF5 | MF6 |
| gemini-2.5-pro | 0.252 | 0.29 | 0.081 | −0.118 | 0.007 | −0.23 |
| chatgpt-4o-latest-20250326 | −0.056 | −0.056 | −0.188 | 0.05 | −0.163 | −0.407 |
| claude-opus-4-20250514 | 0.01 | −0.027 | −0.056 | −0.118 | 0.081 | −0.258 |
| grok-4-0709 | −0.157 | −0.056 | 0.095 | −0.152 | 0.208 | −0.258 |
| gemini-2.5-flash | −0.127 | 0.073 | −0.293 | 0.057 | −0.171 | −0.407 |
| gemini-2.5-flash_ai | 0 | 0.283 | −0.143 | −0.034 | −0.204 | −0.14 |
| gpt-4.1-2025-04-14_AI | −0.137 | 0.183 | −0.301 | −0.02 | −0.155 | −0.462 |
| o3-2025-04-16_AI | −0.056 | 0.174 | −0.179 | 0.029 | −0.284 | −0.545 |
| grok-3-preview-02-24_AI | −0.137 | 0.073 | −0.204 | 0.057 | −0.196 | −0.407 |
| deepseek-v3-0324_AI | −0.127 | −0.056 | −0.293 | −0.196 | −0.204 | −0.407 |
| deepseek-r1-0528 | −0.118 | 0.073 | −0.204 | 0.057 | −0.213 | −0.407 |
| claude-sonnet-4_ai | 0.112 | 0.174 | −0.221 | −0.063 | −0.196 | −0.407 |
| kimi-k2-0711-preview | −0.056 | 0.174 | −0.179 | 0.029 | −0.284 | −0.545 |
| hunyuan-turbos-20250416 | 0.149 | 0.073 | −0.047 | −0.041 | −0.039 | 0.131 |
| qwen3-235b-a22b-no-thinking | 0.143 | 0.05 | 0.156 | 0.113 | 0 | −0.179 |
| Mistral-medium-2505 | −0.027 | −0.134 | −0.371 | −0.013 | 0.029 | 0.197 |
| Krippendorff’s Literary Enthusiast 4—LLMs (16) | ||||||
|---|---|---|---|---|---|---|
| LLM | MF1 | MF2 | MF3 | MF4 | MF5 | MF6 |
| gemini-2.5-pro | −0.178 | 0 | 0.112 | 0.406 | 0.296 | 0.007 |
| chatgpt-4o-latest-20250326 | 0.278 | 0.345 | −0.132 | 0.245 | −0.188 | −0.248 |
| claude-opus-4-20250514 | 0.107 | −0.145 | −0.118 | 0.18 | 0.202 | −0.027 |
| grok-4-0709 | 0.399 | 0.855 | 0.186 | 0.032 | −0.25 | −0.188 |
| gemini-2.5-flash | 0.007 | 0 | 0.013 | 0.604 | −0.196 | −0.169 |
| gemini-2.5-flash_ai | −0.02 | −0.145 | 0.084 | 0.036 | 0.066 | −0.027 |
| gpt-4.1-2025-04-14_AI | 0.142 | 0 | 0.128 | 0.136 | 0.042 | −0.027 |
| o3-2025-04-16_AI | 0.153 | 0.159 | −0.226 | 0.607 | −0.179 | −0.169 |
| grok-3-preview-02-24_AI | 0.007 | 0.01 | −0.163 | 0.362 | −0.248 | −0.179 |
| deepseek-v3-0324_AI | 0.147 | 0.361 | −0.226 | 0.255 | −0.188 | −0.236 |
| deepseek-r1-0528 | 0.013 | 0.095 | −0.125 | 0.59 | −0.319 | −0.333 |
| claude-sonnet-4_ai | 0.269 | 0.159 | −0.125 | 0.607 | −0.213 | −0.169 |
| kimi-k2-0711-preview | 0.153 | 0 | −0.226 | 0.203 | −0.239 | −0.169 |
| hunyuan-turbos-20250416 | 0.153 | 0.159 | −0.125 | 0.607 | −0.248 | −0.169 |
| qwen3-235b-a22b-no-thinking | 0.131 | 0.095 | −0.132 | 0.469 | −0.196 | −0.169 |
| Mistral-medium- | 0.013 | 0.095 | −0.125 | 0.59 | −0.319 | −0.333 |
| Krippendorff’s Literary Enthusiast 5—LLMs (16) | ||||||
|---|---|---|---|---|---|---|
| LLM | MF1 | MF2 | MF3 | MF4 | MF5 | MF6 |
| gemini-2.5-pro | 0.084 | −0.101 | 0.073 | 0.457 | 0.303 | −0.118 |
| chatgpt-4o-latest-20250326 | 0.208 | 0.116 | 0.029 | −0.179 | −0.155 | −0.23 |
| claude-opus-4-20250514 | 0.202 | −0.075 | −0.027 | 0.109 | 0.081 | −0.155 |
| grok-4-0709 | 0.321 | 0.026 | −0.108 | −0.148 | −0.013 | −0.188 |
| gemini-2.5-flash | 0.252 | −0.101 | 0.063 | 0 | −0.056 | −0.248 |
| gemini-2.5-flash_ai | 0.35 | −0.075 | −0.075 | −0.047 | 0.095 | −0.155 |
| gpt-4.1-2025-04-14_AI | 0.116 | −0.101 | 0.084 | 0.073 | 0.081 | −0.155 |
| o3-2025-04-16_AI | 0 | −0.2 | −0.056 | 0 | −0.034 | −0.248 |
| grok-3-preview-02-24_AI | 0.264 | −0.134 | −0.015 | −0.086 | −0.078 | −0.125 |
| deepseek-v3-0324_AI | −0.008 | −0.086 | −0.213 | −0.063 | −0.188 | −0.305 |
| deepseek-r1-0528 | 0.095 | −0.145 | −0.078 | −0.031 | −0.188 | −0.39 |
| claude-sonnet-4_ai | 0.136 | −0.2 | 0.043 | 0 | −0.048 | −0.248 |
| kimi-k2-0711-preview | 0 | −0.101 | −0.063 | −0.126 | −0.188 | −0.248 |
| hunyuan-turbos-20250416 | 0.015 | −0.2 | 0.043 | 0 | −0.07 | −0.248 |
| qwen3-235b-a22b-no-thinking | 0.392 | −0.145 | −0.101 | −0.109 | −0.041 | −0.248 |
| Mistral-medium- | 0.095 | −0.145 | −0.078 | −0.031 | −0.188 | −0.39 |
| # | LLM Evaluators | Organization | License |
|---|---|---|---|
| 1 | claude-sonnet-4-5-20250929 | Anthropic | Proprietary |
| 2 | Gemini-2.5-Flash-Preview-04-17 | Proprietary | |
| 3 | o3-2025-04-16 | OpenAI | Proprietary |
| 4 | DeepSeek-V3-0324 | DeepSeek | MIT |
| 5 | GPT-4.1-2025-04-14 | OpenAI | Proprietary |
| # | LLM Evaluators | Organization | License |
|---|---|---|---|
| 1 | gemini-2.5-pro | Proprietary | |
| 2 | chatgpt-4o-latest-20250326 | OpenAI | Proprietary |
| 3 | claude-opus-4-20250514 | Anthropic | Proprietary |
| 4 | grok-4-0709 | xAI | Proprietary |
| 5 | gemini-2.5-flash | Proprietary | |
| 6 | gemini-2.5-flash-lite-preview-06-17-thinking | Proprietary | |
| 7 | gpt-4.1-2025-04-14 | OpenAI | Proprietary |
| 8 | o3-2025-04-16 | OpenAI | Proprietary |
| 9 | grok-3-preview-02-24 | xAI | Proprietary |
| 10 | deepseek-v3-0324 | DeepSeek | MIT |
| 11 | deepseek-r1-0528 | DeepSeek | MIT |
| 12 | claude-sonnet-4-20250514-thinking-32k | Anthropic | Proprietary |
| 13 | kimi-k2-0711-preview | Moonshot | Modified MIT |
| 14 | hunyuan-turbos-20250416 | Tencent | Proprietary |
| 15 | qwen3-235b-a22b-no-thinking | Alibaba | Apache 2.0 |
| 16 | mistral-medium-2505 | Mistral | Proprietary |
Appendix A.1.1. Experiment Using TTCW Data with GrAImes to Compare Two Evaluation Protocols
| # | LLM Evaluators | Organization | License |
|---|---|---|---|
| 1 | claude-opus-4-1-20250805 | Anthropic | Proprietary |
| 2 | chatgpt-4o-latest-20250326 | OpenAI | Proprietary |
| 3 | gpt-5-high | OpenAI | Proprietary |
Appendix B. Verbatim Prompts
Appendix B.1. Expert Prompt (Translated from Spanish)
Appendix B.2. Enthusiast Prompt (Translated from Spanish)
Appendix B.3. Prompt
Appendix C. GrAImes Literary Evaluation Protocol
| GrAImes Evaluation Protocol Questions | |||
|---|---|---|---|
| # | Question | Answer | Description |
| Story overview and Text Complexity | |||
| 1 | What happens in the story? | OA | Evaluates how clearly the generated microfiction is understood by the reader. |
| 2 | What is the theme? | OA | Assesses whether the text has a recognizable structure and can be associated with a specific theme. |
| 3 | Does it propose other interpretations, in addition to the literal one? | Likert | Evaluates the literary depth of the microfiction. A text with multiple interpretations demonstrates greater literary complexity. |
| 4 | If the above question was affirmative, Which interpretation is it? | OA | Explores whether the microfiction contains deeper literary elements such as metaphor, symbolism, or allusion. |
| Technical Assessment | |||
| 5 | Is the story credible? | Likert | Measures how realistic and distinguishable the characters and events are within the microfiction. |
| 6 | Does the text require your participation or cooperation to complete its form and meaning? | Likert | Assesses the complexity of the microfiction by determining the extent to which it involves the reader in constructing meaning. |
| 7 | Does it propose a new perspective on reality? | Likert | Evaluates whether the microfiction immerses the reader in an alternate reality different from their own. |
| 8 | Does it propose a new vision of the genre it uses? | Likert | Determines whether the microfiction offers a fresh approach to its literary genre. |
| 9 | Does it give an original way of using the language? | Likert | Measures the creativity and uniqueness of the language used in the microfiction. |
| Editorial/Commercial Quality | |||
| 10 | Does it remind you of another text or book you have read? | Likert | Assesses the relevance of the text and its similarities to existing works in the literary market. |
| 11 | Would you like to read more texts like this? | Likert | Measures the appeal of the microfiction and its potential marketability. |
| 12 | Would you recommend it? | Likert | Indicates whether the microfiction has an audience and whether readers might seek out more works by the author. |
| 13 | Would you give it as a present? | Likert | Evaluates whether the microfiction holds enough literary or commercial value for readers to gift it to others. |
| 14 | If the last answer was yes, to whom would you give it as a present? | OA | Identifies the type of reader the evaluator believes would appreciate the microfiction. |
| 15 | Can you think of a specific publisher that you think would publish a text like this? | OA | Assesses the commercial viability of the microfiction by determining if respondents associate it with a specific publishing market. |
References
- Alhussain, A.I.; Azmi, A.M. Automatic story generation: A survey of approaches. ACM Comput. Surv. (CSUR) 2021, 54, 1–38. [Google Scholar] [CrossRef]
- Aleman Manzanarez, G.; de la Cruz Arana, N.; Garcia Flores, J.; Garcia Medina, Y.; Monroy, R.; Pernelle, N. Can Artificial Intelligence Write Like Borges? An Evaluation Protocol for Spanish Microfiction. Appl. Sci. 2025, 15, 6802. [Google Scholar] [CrossRef]
- Chakrabarty, T.; Laban, P.; Agarwal, D.; Muresan, S.; Wu, C.S. Art or artifice? large language models and the false promise of creativity. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–34. [Google Scholar]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
- Verga, P.; Hofstatter, S.; Althammer, S.; Su, Y.; Piktus, A.; Arkhangorodsky, A.; Xu, M.; White, N.; Lewis, P. Replacing judges with juries: Evaluating llm generations with a panel of diverse models. arXiv 2024, arXiv:2404.18796. [Google Scholar] [CrossRef]
- Liu, Y.; Iter, D.; Xu, Y.; Wang, S.; Xu, R.; Zhu, C. G-eval: NLG evaluation using gpt-4 with better human alignment. arXiv 2023, arXiv:2303.16634. [Google Scholar] [CrossRef]
- Huang, F.; Kwak, H.; An, J. Is chatgpt better than human annotators? potential and limitations of chatgpt in explaining implicit hate speech. In Proceedings of the Companion Proceedings of the ACM Web Conference 2023, Austin, TX, USA, 30 April–4 May 2023; pp. 294–297. [Google Scholar]
- Wang, P.; Li, L.; Chen, L.; Cai, Z.; Zhu, D.; Lin, B.; Cao, Y.; Liu, Q.; Liu, T.; Sui, Z. Large language models are not fair evaluators. arXiv 2023, arXiv:2305.17926. [Google Scholar] [CrossRef]
- Chiang, C.H.; Lee, H.y. Can large language models be an alternative to human evaluations? arXiv 2023, arXiv:2305.01937. [Google Scholar] [CrossRef]
- Chhun, C.; Suchanek, F.M.; Clavel, C. Do language models enjoy their own stories? prompting large language models for automatic story evaluation. Trans. Assoc. Comput. Linguist. 2024, 12, 1122–1142. [Google Scholar] [CrossRef]
- Pan, Q.; Ashktorab, Z.; Desmond, M.; Cooper, M.S.; Johnson, J.; Nair, R.; Daly, E.; Geyer, W. Human-Centered Design Recommendations for LLM-as-a-judge. arXiv 2024, arXiv:2407.03479. [Google Scholar] [CrossRef]
- Li, Z.; Xu, X.; Shen, T.; Xu, C.; Gu, J.C.; Lai, Y.; Tao, C.; Ma, S. Leveraging large language models for NLG evaluation: Advances and challenges. arXiv 2024, arXiv:2401.07103. [Google Scholar]
- Li, D.; Jiang, B.; Huang, L.; Beigi, A.; Zhao, C.; Tan, Z.; Bhattacharjee, A.; Jiang, Y.; Chen, C.; Wu, T.; et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Suzhou, China, 4–9 November 2025; pp. 2757–2791. [Google Scholar]
- QIN, Y. A survey on quality evaluation of machine generated texts. Comput. Eng. Sci. 2022, 44, 138. [Google Scholar]
- Iser, W. The act of reading: A theory of aesthetic response. J. Aesthet. Art Crit. 1979, 38, 88–91. [Google Scholar] [CrossRef]
- Ingarden, R. Concretización y reconstrucción. In En Busca del Texto: Teoría de la Recepción Literaria; Universidad Nacional Autónoma de México: Mexico City, Mexico, 1993; pp. 31–54. [Google Scholar]
- Chakrabarty, T.; Padmakumar, V.; Brahman, F.; Muresan, S. Creativity support in the age of large language models: An empirical study involving emerging writers. arXiv 2023, arXiv:2309.12570. [Google Scholar]
- McCutchen, D. From novice to expert: Implications of language skills and writing-relevant knowledge for memory during the development of writing skill. J. Writ. Res. 2011, 3, 51–68. [Google Scholar] [CrossRef]
- Chiang, W.L.; Zheng, L.; Sheng, Y.; Angelopoulos, A.N.; Li, T.; Li, D.; Zhang, H.; Zhu, B.; Jordan, M.; Gonzalez, J.E.; et al. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv 2024, arXiv:2403.04132. [Google Scholar] [CrossRef]
- Krippendorff, K. Estimating the reliability, systematic error and random error of interval data. Educ. Psychol. Meas. 1970, 30, 61–70. [Google Scholar] [CrossRef]
- Welivita, A.; Pu, P. Are Large Language Models More Empathetic than Humans? arXiv 2024, arXiv:2406.05063. [Google Scholar] [CrossRef]









| I. GrAImes Expert Evaluation: Human Experts vs. LLM Judges | ||
| Dataset | Human Evaluators | LLM Evaluators |
| Six writers’ microfictions Two expert Two medium Two amateur | Five experts with PhDs in literature | Five “expert” LLMs |
| I. GrAImes Enthusiast Evaluation: Human Enthusiasts vs. LLM Judges | ||
| Dataset | Human Evaluators | LLM Evaluators |
| Six AI-generated microfictions Three by ChatGPT Three by Monterroso | Sixteen literature enthusiasts | Sixteen “enthusiast” LLM prompts |
| III. TTCW and GrAImes protocol comparison | ||
| Dataset | Human Evaluators | LLM Evaluators |
| Twelve stories by TNY Twelve by ChatGPT Twelve by GPT4 Twelve by Claude | Ten human experts | Four “expert” LLMs |
| i. GrAImes Expert Evaluation: LLMs-as-Judges Stability | ||
| Dataset | LLM Evaluators | Experiments |
| Six writers’ microfictions Two expert Two medium Two amateur | Claude-sonnet-4-5-20250929,
Gemini-2.5-pro, o3-2025-04-16, GPT-4.1-2025-04-14, Deepseek-v3.2-exp-thinking | Five runs as “expert” LLM prompts |
| ii. GrAImes Enthusiast Evaluation: LLMs-as-Judges Stability | ||
| Dataset | LLM Evaluators | Experiments |
| Six AI-generated microfictions Three by ChatGPT Three by Monterroso | Gemini-2.5-pro,
Chatgpt-4o-latest-20250326, Claude-opus-4-20250514, Grok-4-0709, Gemini-2.5-flash, Llama-4-maverick-17b-128e-instruct, GPT-4.1-2025-04-14, o3-2025-04-16, GPT-5-high, Deepseek-v3-0324, Kimi-k2-thinking, Claude-sonnet-4-20250514-thinking-32k, Mistral-medium-2508, GLM-4.6, QWEN3-235b-a22b-no-thinking, Mistral-medium-2505 | Five runs as “enthusiast” LLM prompts |
| iii. TTCW and GrAImes Protocol Comparison: LMMs-as-Judges Stability | ||
| Dataset | LLM Evaluators | Experiments |
| Twelve stories by TNY Twelve by ChatGPT Twelve by GPT4 Twelve by Claude | Claude-opus-4-1-20250805, ChatGPT-4o-latest-20250326, GPT-5-high | Five runs as “expert” LLM prompts |
| LLM Responses to Human-Written Microfictions | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MF1 | MF2 | MF3 | MF4 | MF5 | MF6 | Avg. | ||||||||
| Question | AV | SD | AV | SD | AV | SD | AV | SD | AV | SD | AV | SD | AV | SD |
| Story Overview and Text Complexity | ||||||||||||||
| 3.-Does it propose other interpretations, in addition to the literal one? | 5.0 | 0.0 | 5.0 | 0.0 | 3.8 | 0.4 | 4.0 | 0.7 | 3.6 | 0.5 | 4.4 | 0.9 | 4.3 | 0.4 |
| Technical Assessment | ||||||||||||||
| 5.-Is the story credible? | 2.4 | 1.1 | 2.6 | 1.1 | 4.0 | 0.7 | 4.0 | 0.7 | 3.0 | 0.7 | 2.2 | 0.8 | 3.0 | 0.9 |
| 6.-Does the text require your participation or cooperation to complete its form and meaning? | 4.4 | 0.5 | 4.6 | 0.5 | 3.2 | 0.4 | 4.0 | 0.7 | 3.2 | 0.8 | 4.6 | 0.9 | 4.0 | 0.7 |
| 7.-Does it propose a new vision of reality? | 4.4 | 0.5 | 4.6 | 0.5 | 3.0 | 1.0 | 3.8 | 0.4 | 3.6 | 1.1 | 3.6 | 0.5 | 3.8 | 0.7 |
| 8.-Does it propose a new vision of the genre it uses? | 4.2 | 0.8 | 4.0 | 0.7 | 2.6 | 0.5 | 3.4 | 0.5 | 3.4 | 0.5 | 4.0 | 1.2 | 3.6 | 0.7 |
| 9.-Does it propose a new vision of the language itself? | 4.8 | 0.4 | 4.8 | 0.4 | 3.6 | 0.5 | 4.4 | 0.5 | 3.2 | 0.4 | 4.0 | 0.7 | 4.1 | 0.5 |
| Editorial/Commercial Quality | ||||||||||||||
| 10.-Does it remind you of another text or book you have read? | 4.0 | 0.0 | 4.0 | 0.7 | 3.2 | 1.1 | 3.6 | 0.5 | 4.0 | 1.0 | 3.8 | 0.8 | 3.8 | 0.7 |
| 11.-Would you like to read more texts like this? | 4.8 | 0.4 | 4.6 | 0.5 | 3.0 | 0.7 | 3.0 | 0.0 | 4.0 | 0.8 | 4.2 | 0.7 | 3.9 | 0.5 |
| 12.-Would you recommend it? | 4.8 | 0.4 | 4.6 | 0.5 | 3.4 | 0.9 | 4.2 | 0.4 | 4.2 | 0.8 | 4.0 | 0.7 | 4.2 | 0.6 |
| 13.-Would you give it as a present? | 4.4 | 0.5 | 4.2 | 0.8 | 2.6 | 1.1 | 3.8 | 0.8 | 3.8 | 1.1 | 3.4 | 1.1 | 3.7 | 0.9 |
| LLM Responses to Human-Written Microfictions, Ordered by SD | ||
|---|---|---|
| Question | AV | SD |
| 3.-Does it propose other interpretations, in addition to the literal one? | 4.3 | 0.4 |
| 11.-Would you like to read more texts like this? | 3.9 | 0.5 |
| 9.-Does it propose a new vision of the language itself? | 4.1 | 0.5 |
| 12.-Would you recommend it? | 4.2 | 0.6 |
| 6.-Does the text require your participation or cooperation to complete its form and meaning? | 4.0 | 0.7 |
| 7.-Does it propose a new vision of reality? | 3.8 | 0.7 |
| 10.-Does it remind you of another text or book you have read? | 3.8 | 0.7 |
| 8.-Does it propose a new vision of the genre it uses? | 3.6 | 0.7 |
| 13.-Would you give it as a present? | 3.7 | 0.9 |
| 5.-Is the story credible? | 3.0 | 0.9 |
| Story Overview | Technical Assessment | Editorial/Commercial | Total Analysis | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # | MF | AV | SD | MF | AV | SD | MF | AV | SD | MF | AV | SD | |||
| 1 | 1 | 5.0 | 0.0 | 2 | 4.0 | 0.7 | 1 | 4.5 | 0.4 | 1 | 4.3 | 0.5 | |||
| 2 | 2 | 5.0 | 0.0 | 1 | 4.1 | 0.7 | 2 | 4.4 | 0.7 | 2 | 4.3 | 0.6 | |||
| 3 | 6 | 4.4 | 0.9 | 4 | 3.9 | 0.6 | 5 | 4.0 | 0.9 | 4 | 3.8 | 0.5 | |||
| 4 | 4 | 4.0 | 0.7 | 6 | 3.7 | 0.8 | 6 | 3.9 | 0.8 | 6 | 3.8 | 0.8 | |||
| 5 | 3 | 3.8 | 0.4 | 3 | 3.3 | 0.6 | 4 | 3.7 | 0.5 | 5 | 3.6 | 0.8 | |||
| 6 | 5 | 3.6 | 0.5 | 5 | 3.3 | 0.7 | 3 | 3.1 | 1.0 | 3 | 3.2 | 0.8 | |||
| Krippendorff’s and Confidence Interval (CI) for Literary Expert 1 | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLM | MF1 | MF2 | MF3 | MF4 | MF5 | MF6 | ||||||
| CI | CI | CI | CI | CI | CI | |||||||
| 1 | 0.112 | (−0.028, 0.468) | 0.136 | (−0.027, 0.199) | 0.197 | (−0.022, 0.252) | −0.236 | (0.007, 0.442) | 0.058 | (−0.045, 0.206) | −0.520 | (−0.038, 0.392) |
| 2 | 0.010 | (−0.044, 0.168)) | 0.584 | (−0.048, 0.135) | 0.158 | (−0.021, 0.268) | −0.008 | (−0.030, 0.410) | 0.101 | (−0.013, 0.286) | 0.022 | (−0.001, 0.315) |
| 3 | 0.128 | (−0.056, 0.296) | −0.118 | (−0.047, 0.243) | −0.163 | (0.052, 0.444) | −0.126 | (−0.013, 0.332) | 0.228 | (−0.046, 0.296) | −0.291 | (−0.024, 0.454) |
| 4 | −0.288 | (−0.038, 0.678) | −0.008 | (0.007, 0.413) | −0.258 | (0.070, 0.386) | −0.008 | (−0.030, 0.410) | −0.171 | (0.039, 0.340) | −0.078 | (0.138, 0.634) |
| 5 | 0.188 | (−0.022, 0.620) | 0.168 | (−0.019, 0.353) | −0.293 | (0.075, 0.465) | −0.188 | (−0.019, 0.309) | 0.058 | (−0.045, 0.206) | 0.202 | (−0.044, 0.467)) |
| LLMs Responses to Human-Written Microfictions | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MF1 | MF2 | MF3 | MF4 | MF5 | MF6 | Avg. | ||||||||
| Question | AV | SD | AV | SD | AV | SD | AV | SD | AV | SD | AV | SD | AV | SD |
| Story Overview and Text Complexity | ||||||||||||||
| 3.-Does it propose other interpretations, in addition to the literal one? | 3.0 | 0.8 | 1.7 | 0.6 | 3.3 | 1.3 | 2.8 | 0.7 | 3.5 | 0.9 | 4.6 | 0.5 | 3.1 | 0.8 |
| Technical Assessment | ||||||||||||||
| 5.-Is the story credible? | 2.0 | 0.4 | 1.1 | 0.3 | 2.3 | 0.9 | 4.9 | 0.5 | 2.8 | 0.4 | 5.0 | 0.0 | 3.0 | 0.4 |
| 6.-Does the text require your participation or cooperation to complete its form and meaning? | 3.8 | 1.0 | 2.1 | 1.1 | 3.4 | 1.2 | 2.0 | 0.4 | 2.7 | 0.6 | 3.6 | 0.6 | 2.5 | 0.8 |
| 7.-Does it propose a new vision of reality? | 2.1 | 1.0 | 1.2 | 0.5 | 3.2 | 1.1 | 1.7 | 0.7 | 2.3 | 0.7 | 3.8 | 0.8 | 2.4 | 0.8 |
| 8.-Does it propose a new vision of the genre it uses? | 1.6 | 0.6 | 1.1 | 0.5 | 2.1 | 0.8 | 1.1 | 0.3 | 1.6 | 0.8 | 3.4 | 1.0 | 2.4 | 0.7 |
| 9.-Does it propose a new vision of the language itself? | 2.6 | 0.8 | 1.6 | 0.7 | 2.3 | 0.9 | 3.0 | 0.7 | 3.3 | 0.9 | 4.3 | 0.9 | 2.9 | 0.8 |
| Editorial/Commercial Quality | ||||||||||||||
| 10.-Does it remind you of another text or book you have read? | 1.3 | 0.6 | 1.2 | 0.4 | 2.7 | 0.9 | 3.7 | 0.5 | 2.6 | 0.7 | 4.3 | 0.9 | 2.6 | 0.7 |
| 11.-Would you like to read more texts like this? | 1.0 | 0.0 | 1.0 | 0.0 | 1.9 | 0.8 | 2.0 | 0.4 | 2.1 | 0.8 | 4.4 | 0.8 | 2.1 | 0.5 |
| 12.-Would you recommend it? | 1.0 | 0.0 | 1.0 | 0.0 | 1.7 | 0.5 | 2.2 | 0.5 | 2.1 | 0.8 | 4.3 | 1.2 | 2.3 | 0.5 |
| 13.-Would you give it as a present? | 1.0 | 0.0 | 1.0 | 0.0 | 1.3 | 0.5 | 1.9 | 0.6 | 1.9 | 0.8 | 4.3 | 1.2 | 2.4 | 0.5 |
| Story Overview | Technical Assessment | Editorial/Commercial | Total Analysis | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| # | MF | AV | SD | MF | AV | SD | MF | AV | SD | MF | AV | SD | |||
| 1 | 6 | 4.6 | 0.5 | 6 | 4.0 | 0.7 | 6 | 4.3 | 1.0 | 6 | 4.2 | 0.8 | |||
| 2 | 5 | 3.5 | 0.9 | 3 | 2.7 | 1.0 | 4 | 2.5 | 0.5 | 4 | 2.5 | 0.5 | |||
| 3 | 3 | 3.3 | 1.3 | 4 | 2.5 | 0.5 | 5 | 2.2 | 0.8 | 5 | 2.5 | 0.7 | |||
| 4 | 1 | 3.0 | 0.8 | 5 | 2.5 | 0.7 | 3 | 1.9 | 0.7 | 3 | 2.4 | 0.9 | |||
| 5 | 4 | 2.8 | 0.7 | 1 | 2.4 | 0.7 | 1 | 1.1 | 0.1 | 1 | 1.9 | 0.5 | |||
| 6 | 2 | 1.7 | 0.6 | 2 | 1.4 | 0.6 | 2 | 1.0 | 0.1 | 2 | 1.3 | 0.4 | |||
| Krippendorff’s and Confidence Interval (CI) for Literature Enthusiast 1 (16 LLMs) | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LLM | MF1 | MF2 | MF3 | MF4 | MF5 | MF6 | ||||||
| CI | CI | CI | CI | CI | CI | |||||||
| 1 | −0.348 | (0.268, 0.605) | −0.188 | (0.106, 0.500) | −0.078 | (0.054, 0.376) | −0.013 | (0.062, 0.443) | −0.163 | (0.135, 0.564) | −0.126 | (0.095, 0.476) |
| 2 | 0.007 | (−0.019, 0.210) | 0.125 | (−0.022, 0.185) | 0.250 | (−0.032, 0.213) | 0.095 | (0.014, 0.397) | −0.163 | (0.161, 0.476) | 0.021 | (−0.040, 0.419) |
| 3 | −0.242 | (0.139, 0.399) | −0.299 | (0.192, 0.551) | −0.299 | (0.195, 0.578) | −0.125 | (0.087, 0.408) | −0.242 | (0.234, 0.488) | −0.163 | (0.087, 0.430) |
| 4 | −0.082 | (0.007, 0.282) | 0.255 | (−0.018, 0.243) | 0.265 | (−0.019, 0.311) | 0.070 | (−0.002, 0.369) | 0.143 | (−0.013, 0.291) | 0.128 | (−0.030, 0.361) |
| 5 | −0.242 | (0.080, 0.355) | −0.188 | (0.115, 0.501) | 0.395 | (−0.045, 0.185) | −0.007 | (0.026, 0.378) | −0.027 | (0.054, 0.434) | 0.208 | (−0.056, 0.324) |
| 6 | −0.242 | (0.101, 0.484) | −0.299 | (0.192, 0.556) | −0.197 | (0.100, 0.529) | −0.226 | (0.147, 0.441) | −0.218 | (0.225, 0.476) | −0.163 | (0.087, 0.434) |
| 7 | −0.195 | (0.089, 0.365) | −0.188 | (0.128, 0.488) | −0.063 | (0.034, 0.390) | −0.148 | (0.103, 0.531) | −0.242 | (0.234, 0.488) | −0.163 | (0.087, 0.476) |
| 8 | −0.195 | (0.059, 0.316) | −0.056 | (0.076, 0.345) | 0.119 | (−0.020, 0.235) | 0.007 | (0.029, 0.378) | −0.163 | (0.076, 0.373) | 0.208 | (−0.056, 0.340) |
| 9 | −0.242 | (0.087, 0.397) | −0.216 | (0.084, 0.500) | 0.019 | (−0.001, 0.276) | −0.007 | (0.038, 0.378) | −0.242 | (0.234, 0.488) | 0.228 | (−0.046, 0.232) |
| 10 | −0.195 | (0.053, 0.349) | −0.020 | (0.067, 0.335) | −0.125 | (−0.001, 0.336) | 0.113 | (0.001, 0.365) | −0.221 | (0.104, 0.531) | −0.044 | (−0.056, 0.467) |
| 11 | −0.203 | (0.071, 0.393) | −0.109 | (0.095, 0.419) | 0.095 | (−0.019, 0.183) | −0.007 | (0.045, 0.489) | −0.221 | (0.105, 0.486) | −0.357 | (−0.056, 0.648) |
| 12 | −0.075 | (0.043, 0.297) | −0.056 | (0.073, 0.360) | 0.255 | (−0.033, 0.183) | 0.007 | (0.034, 0.365) | −0.267 | (0.144, 0.430) | 0.208 | (−0.056, 0.340) |
| 13 | −0.195 | (0.057, 0.325) | −0.188 | (0.126, 0.500) | −0.132 | (−0.006, 0.248) | −0.140 | (0.076, 0.488) | −0.284 | (0.296, 0.578) | 0.208 | (−0.056, 0.340) |
| 14 | −0.218 | (0.057, 0.287) | −0.056 | (0.077, 0.356) | 0.255 | (−0.035, 0.159) | 0.007 | (0.034, 0.397) | −0.242 | (0.242, 0.476) | 0.208 | (−0.056, 0.324) |
| 15 | −0.089 | (0.054, 0.388) | −0.109 | (0.095, 0.438) | 0.007 | (−0.008, 0.258) | 0.101 | (0.015, 0.442) | −0.179 | (0.077, 0.441) | 0.208 | (−0.056, 0.340) |
| 16 | −0.203 | (0.071, 0.393) | −0.109 | (0.086, 0.419) | 0.095 | (−0.016, 0.205) | −0.007 | (0.037, 0.461) | −0.221 | (0.104, 0.501) | 0.357 | (−0.056, 0.648) |
| # | LLM | Q3 | Q5 | Q6 | Q7 | Q8 | Q9 | Q10 | Q11 | Q12 | Q13 | Avg |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | deepseek-r1-0528 | 0.2 | 0.1 | −0.4 | −0.1 | −0.2 | 0.6 | −0.2 | 0.2 | 0.1 | 0.3 | 0.1 |
| 2 | mistral-medium-2505 | 0.2 | 0.1 | −0.4 | −0.1 | −0.2 | 0.6 | −0.2 | 0.2 | 0.1 | 0.3 | 0.1 |
| 3 | o3-2025-04-16_AI | 0.6 | 0.1 | −0.4 | −0.3 | −0.6 | 0.6 | −0.3 | −0.1 | −0.1 | 0.1 | 0.1 |
| 4 | deepseek-v3-0324_AI | 0.6 | −0.1 | −0.6 | 0.4 | −0.2 | 0.6 | −0.2 | 0.2 | 0.3 | 0.5 | 0.1 |
| 5 | kimi-k2-0711-preview | 0.2 | −0.1 | −0.6 | 0.1 | −0.2 | 0.1 | −0.7 | 0.1 | 0.1 | 0.1 | 0.1 |
| 6 | claude-sonnet-4_ai | 0.6 | 0.3 | −0.4 | −0.3 | −0.6 | 0.4 | −0.5 | −0.1 | −0.1 | 0.0 | 0.1 |
| 7 | gemini-2.5-flash | 1.2 | 0.1 | −0.6 | −0.3 | −0.7 | 0.3 | −0.3 | 0.1 | 0.1 | 0.0 | 0.2 |
| 8 | hunyuan-turbos-20250416 | 0.6 | 0.1 | −0.4 | −0.3 | −0.6 | 0.4 | −0.5 | −0.1 | −0.1 | 0.0 | 0.2 |
| 9 | chatgpt-4o-latest-20250326 | 0.6 | 0.3 | 0.2 | 1.1 | 0.3 | 0.9 | −0.5 | −0.3 | −0.2 | −0.2 | 0.2 |
| 10 | qwen3-235b-a22b-no-thinking | 0.6 | −0.1 | −0.3 | −0.3 | −0.7 | 0.3 | −0.5 | −0.1 | −0.1 | 0.0 | 0.2 |
| 11 | grok-4-0709 | 0.6 | −0.2 | 0.1 | 0.6 | 0.1 | 1.1 | 0.2 | −0.1 | −0.1 | 0.1 | 0.2 |
| 12 | grok-3-preview-02-24_AI | −0.6 | −0.1 | −0.4 | −0.4 | −0.7 | −0.2 | −0.3 | −0.1 | 0.1 | −0.2 | 0.3 |
| 13 | gemini-2.5-pro | −1.1 | −0.4 | −1.8 | −1.3 | −1.2 | −1.1 | −0.7 | −0.8 | −0.7 | −0.4 | 0.9 |
| 14 | gpt-4.1-2025-04-14_AI | −0.1 | −0.4 | −1.4 | −0.6 | −1.1 | −0.9 | −1.3 | −0.8 | −0.9 | −0.9 | 0.9 |
| 15 | claude-opus-4-20250514 | −0.8 | −0.2 | −1.8 | −0.9 | −1.1 | −0.9 | −1.3 | −0.8 | −0.9 | −0.9 | 1.0 |
| 16 | gemini-2.5-flash_ai | 0.1 | −0.6 | −1.8 | −0.9 | −1.2 | −0.9 | −1.3 | −0.8 | −0.9 | −0.9 | 1.0 |
| GPT-5-High LLM Responses to Short Stories (Likert Scale 1–5) | |||||
|---|---|---|---|---|---|
| The New Yorker | Claude | GPT-3.5-Turbo | GPT-4 | ALL | |
| Question | AV | AV | AV | AV | AV |
| Story Overview and Text Complexity | |||||
| 3.-Does it propose other interpretations, in addition to the literal one? | 5 | 4.08 | 4.25 | 5 | 4.53 |
| Technical Assessment | |||||
| 5.-Is the story credible? | 4.73 | 4.42 | 4 | 4.17 | 4.26 |
| 6.-Does the text require your participation or cooperation to complete its form and meaning? | 4.09 | 2.92 | 3.42 | 4.83 | 3.62 |
| 7.-Does it propose a new vision of reality? | 4 | 3.08 | 3.25 | 3.92 | 3.36 |
| 8.-Does it propose a new vision of the genre it uses? | 2.91 | 2.25 | 2.58 | 3.42 | 2.6 |
| 9.-Does it propose a new vision of the language itself? | 4.18 | 2.42 | 2.75 | 4.17 | 3.17 |
| Editorial/Commercial Quality | |||||
| 10.-Does it remind you of another text or book you have read? | 4.18 | 3.75 | 3.83 | 4.08 | 3.96 |
| 11.-Would you like to read more texts like this? | 4.82 | 3.92 | 3.92 | 4.83 | 4.15 |
| 12.-Would you recommend it? | 4.91 | 3.83 | 3.92 | 4.42 | 4.13 |
| 13.-Would you give it as a present? | 3.91 | 3.50 | 3.25 | 3.67 | 3.45 |
| Total by short story source | 4.27 | 3.42 | 3.52 | 4.25 | 3.72 |
| Claude-Opus-4-1-20250805 LLM Responses to Short Stories (Likert Scale 1–5) | |||||
|---|---|---|---|---|---|
| The New Yorker | Claude | GPT-3.5-Turbo | GPT-4 | ALL | |
| Question | AV | AV | AV | AV | AV |
| Story Overview and Text Complexity | |||||
| 3.-Does it propose other interpretations, in addition to the literal one? | 4.91 | 3.5 | 2.83 | 4.25 | 3.87 |
| Technical Assessment | |||||
| 5.-Is the story credible? | 4.45 | 3.58 | 2.42 | 3.75 | 3.55 |
| 6.-Does the text require your participation or cooperation to complete its form and meaning? | 4.55 | 2.67 | 2.08 | 3.83 | 3.28 |
| 7.-Does it propose a new vision of reality? | 4 | 2.25 | 1.83 | 3.25 | 2.83 |
| 8.-Does it propose a new vision of the genre it uses? | 3.55 | 1.75 | 1.5 | 3.25 | 2.51 |
| 9.-Does it propose a new vision of the language itself? | 3.82 | 2.08 | 1.75 | 3.92 | 2.89 |
| Editorial/Commercial Quality | |||||
| 10.-Does it remind you of another text or book you have read? | 4 | 4.08 | 4.25 | 3.58 | 3.98 |
| 11.-Would you like to read more texts like this? | 4.27 | 2.75 | 2.08 | 3.92 | 3.26 |
| 12.-Would you recommend it? | 4.27 | 2.67 | 2.25 | 3.92 | 3.28 |
| 13.-Would you give it as a present? | 3.27 | 2 | 2 | 3.08 | 2.59 |
| Total by short story source | 4.11 | 2.73 | 2.3 | 3.68 | 3.2 |
| GPT-5-High LLM Responses to Short Stories 1/0 (Y/N) | ||||||||
|---|---|---|---|---|---|---|---|---|
| The New Yorker | Claude | GPT-3.5-Turbo | GPT-4 | |||||
| Question | Y/N | APR | Y/N | APR | Y/N | APR | Y/N | APR |
| Story Overview and Text Complexity | ||||||||
| 3.-Does it propose other interpretations, in addition to the literal one? | 1 | 100% | 1 | 91.7% | 1 | 100% | 1 | 100% |
| Technical Assessment | ||||||||
| 5.-Is the story credible? | 1 | 90.9% | 1 | 91.7% | 1 | 91.7% | 1 | 100% |
| 6.-Does the text require your participation or cooperation to complete its form and meaning? | 1 | 90.9% | 0 | 57.1% | 1 | 85.7% | % | |
| 7.-Does it propose a new vision of reality? | 1 | 90.9% | 1 | 66.7% | 1 | 100% | 1 | 100% |
| 8.-Does it propose a new vision of the genre it uses? | 0 | 61.11% | 0 | 100% | 0 | 100% | 1 | 100% |
| 9.-Does it propose a new vision of the language itself? | 1 | 100% | 0 | 100% | 0 | 100% | 1 | 100% |
| Editorial/Commercial Quality | ||||||||
| 10.-Does it remind you of another text or book you have read? | 1 | 100% | 1 | 100% | 1 | 100% | 1 | 100% |
| 11.-Would you like to read more texts like this? | 1 | 100% | 1 | 100% | 1 | 100% | 1 | 100% |
| 12.-Would you recommend it? | 1 | 100% | 1 | 100% | 1 | 100% | 1 | 100% |
| 13.-Would you give it as a present? | 1 | 90.9% | 1 | 100% | 1 | 71.4% | 1 | 100% |
| Total by short story LLM generator | 1 | 92.42% | 1 | 70.95% | 1 | 74.88% | 1 | 100 % |
| Claude-Opus-4-1-20250805 LLM Responses to Short Stories 1/0 (Y/N) | ||||||||
|---|---|---|---|---|---|---|---|---|
| The New Yorker | Claude | GPT-3.5-Turbo | GPT-4 | |||||
| Question | Y/N | APR | Y/N | APR | Y/N | APR | Y/N | APR |
| Story Overview and Text Complexity | ||||||||
| 3.-Does it propose other interpretations, in addition to the literal one? | 1 | 100% | 1 | 87.5% | 0 | 66.7% | 1 | 100% |
| Technical Assessment | ||||||||
| 5.-Is the story credible? | 1 | 81.8% | 1 | 100% | 0 | 100% | 1 | 100% |
| 6.-Does the text require your participation or cooperation to complete its form and meaning? | 1 | 100% | 0 | 83.33% | 0 | 100% | 1 | 87.5% |
| 7.-Does it propose a new vision of reality? | 1 | 100% | 0 | 100% | 1 | 100% | 1 | 100% |
| 8.-Does it propose a new vision of the genre it uses? | 1 | 100% | 0 | 100% | 0 | 100% | 1 | 100% |
| 9.-Does it propose a new vision of the language itself? | 1 | 100% | 0 | 100% | 0 | 100% | 1 | 100% |
| Editorial/Commercial Quality | ||||||||
| 10.-Does it remind you of another text or book you have read? | 1 | 100% | 1 | 100% | 1 | 100% | 1 | 100% |
| 11.-Would you like to read more texts like this? | 1 | 100% | 0 | 80% | 0 | 100% | 1 | 100% |
| 12.-Would you recommend it? | 1 | 100% | 1 | 100% | 0 | 100% | 1 | 100% |
| 13.-Would you give it as a present? | 1 | 80% | 1 | 100% | 0 | 100% | 1 | 60% |
| Total by short story LLM generator | 1 | 98% | 0 | 68% | 0 | 86.7% | 1 | 90.8% |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Manzanarez, G.A.; Monroy, R.; Flores, J.G.; Calvo, H. Literary Language Mashup: Curating Fictions with Large Language Models. Mathematics 2026, 14, 210. https://doi.org/10.3390/math14020210
Manzanarez GA, Monroy R, Flores JG, Calvo H. Literary Language Mashup: Curating Fictions with Large Language Models. Mathematics. 2026; 14(2):210. https://doi.org/10.3390/math14020210
Chicago/Turabian StyleManzanarez, Gerardo Aleman, Raul Monroy, Jorge Garcia Flores, and Hiram Calvo. 2026. "Literary Language Mashup: Curating Fictions with Large Language Models" Mathematics 14, no. 2: 210. https://doi.org/10.3390/math14020210
APA StyleManzanarez, G. A., Monroy, R., Flores, J. G., & Calvo, H. (2026). Literary Language Mashup: Curating Fictions with Large Language Models. Mathematics, 14(2), 210. https://doi.org/10.3390/math14020210
