Evaluating Large Language Models’ Ability Using a Psychiatric Screening Tool Based on Metaphor and Sarcasm Scenarios
Abstract
:1. Introduction
1.1. Previous Research
1.2. Research Questions
- RQ1
- How do current large language models (LLMs) compare in their ability to comprehend metaphors versus sarcasm?
- RQ2
- How does the ability to comprehend metaphors and sarcasm vary among LLMs with different parameter counts?
1.3. Significance of the Topic
2. Methods
2.1. Material
2.2. Procedure
3. Results
4. Discussion
5. Conclusions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
1 | While Asperger syndrome has been subsumed under the broader category of Social Pragmatic Communication Disorder in the DSM-5 (American Psychiatric Association 2013), this paper uses the term in its original context as it pertains to the MSST. This is to maintain coherence with the terminology used in MSST, given that our focus is on analyzing the behavior of LLMs rather than diagnosing human conditions. |
2 | While we recognize the potential outdatedness and inadvisability of this term, for reference, we note that Adachi et al. (2004) used the label “children without mental retardation” in their results. |
3 | The content of all questions is available in the source code published at doi:10.5281/zenodo.10981763. |
References
- Adachi, Taeko, Shinichi Hirabayashi, Madoka Shiota, Shuhei Suzuki, Eiji Wakamiya, Shinji Kitayama, Masaki Kono, Yukinori Maeoka, and Tatsuya Koeda. 2006. Study of situational recognition of attention deficit/hyperactivity disorders, Asperger’s disorder and high functioning autism with the metaphor and sarcasm scenario test (MSST). Brain and Development 38: 177–81. (In Japanese). [Google Scholar] [PubMed]
- Adachi, Taeko, Tatsuya Koeda, Shinichi Hirabayashi, Yukinori Maeoka, Madoka Shiota, Edward Charles Wright, and Ayako Wada. 2004. The metaphor and sarcasm scenario test: A new instrument to help differentiate high functioning pervasive developmental disorder from attention deficit/hyperactivity disorder. Brain and Development 26: 301–6. [Google Scholar] [CrossRef] [PubMed]
- Aghazadeh, Ehsan, Mohsen Fayyaz, and Yadollah Yaghoobzadeh. 2022. Metaphors in pre-trained language models: Probing and generalization across datasets and languages. Paper presented at 60th Annual Meeting of the Association for Computational Linguistics, Stroudsburg, PA, USA, May 22–27; pp. 2037–50. [Google Scholar] [CrossRef]
- American Psychiatric Association. 2013. Diagnostic and Statistical Manual of Mental Disorders, 5th ed. Washington: American Psychiatric Association. [Google Scholar] [CrossRef]
- Atarere, Joseph, Haider Naqvi, Christopher Haas, Comfort Adewunmi, Sumanth Bandaru, Rakesh Allamneni, Onyinye Ugonabo, Olachi Egbo, Mfoniso Umoren, and Priyanka Kanth. 2024. Applicability of online chat-based artificial intelligence models to colorectal cancer screening. Digestive Diseases and Sciences 69: 1–7. [Google Scholar] [CrossRef] [PubMed]
- Baron-Cohen, Simon, Howard A. Ring, Edward T. Bullmore, Sally Wheelwright, Chris Ashwin, and Steve C. R. Williams. 2000. The amygdala theory of autism. Neuroscience & Biobehavioral Reviews 24: 355–64. [Google Scholar] [CrossRef]
- Baron-Cohen, Simon, Howard A. Ring, Sally Wheelwright, Edward T. Bullmore, Mick J. Brammer, Andrew Simmons, and Steve C. R. Williams. 1999. Social intelligence in the normal and autistic brain: An fMRI study. European Journal of Neuroscience 11: 1891–98. [Google Scholar] [CrossRef] [PubMed]
- Bavelas, Janet, Jennifer Gerwing, Chantelle Sutton, and Danielle Prevost. 2008. Gesturing on the telephone: Independent effects of dialogue and visibility. Journal of Memory and Language 58: 495–520. [Google Scholar] [CrossRef]
- Bubeck, Sébastien, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott M. Lundberg, and et al. 2023. Sparks of artificial general intelligence: Early experiments with GPT-4. arXiv arXiv:2303.12712. [Google Scholar]
- Caucheteux, Charlotte, Alexandre Gramfort, and Jean-Rémi King. 2023. Evidence of a predictive coding hierarchy in the human brain listening to speech. Nature Human Behaviour 7: 430–41. [Google Scholar] [CrossRef]
- Chahboun, Sobh, Øyvind Kvello, and Alexander Gamst Page. 2021. Extending the field of extended language: A literature review on figurative language processing in neurodevelopmental disorders. Frontiers in Communication 6: 1–14. [Google Scholar] [CrossRef]
- Chakrabarty, Tuhin, Arkadiy Saakyan, Debanjan Ghosh, and Smaranda Muresan. 2022. FLUTE: Figurative language understanding through textual explanations. Paper presented at 2022 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, December 7–11; pp. 7139–59. [Google Scholar] [CrossRef]
- Clark, Herbert H., and Richard J. Gerrig. 1984. On the pretense theory of irony. Journal of Experimental Psychology: General 113: 121–26. [Google Scholar] [CrossRef]
- Conover, Mike, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. 2023. Free Dolly: Introducing the World’s First Truly Open Instruction-Tuned LLM. Available online: https://www.databricks.com/blog/2023/04/12/dolly-first-open-commercially-viable-instruction-tuned-llm (accessed on 1 July 2024).
- Di Paola, Simona, Filippo Domaneschi, and Nausicaa Pouscoulous. 2020. Metaphorical developing minds: The role of multiple factors in the development of metaphor comprehension. Journal of Pragmatics 156: 235–51. [Google Scholar] [CrossRef]
- DiStefano, Paul V., John D. Patterson, and Roger E. Beaty. 2024. Automatic scoring of metaphor creativity with large language models. Creativity Research Journal, 1–15, forthcoming. [Google Scholar] [CrossRef]
- Eloundou, Tyna, Sam Manning, Pamela Mishkin, and Daniel Rock. 2023. GPTs are GPTs: An early look at the labor market impact potential of large language models. arXiv arXiv:2303.10130. [Google Scholar]
- Evanson, Linnea, Yair Lakretz, and Jean-Rémi King. 2023. Language acquisition: Do children and language models follow similar learning stages? Paper presented at Findings of the Association for Computational Linguistics: ACL 2023, Stroudsburg, PA, USA, July 9–14; pp. 12205–18. [Google Scholar] [CrossRef]
- Fanari, Rachele, Sergio Melogno, and Roberta Fadda. 2023. An experimental study on sarcasm comprehension in school children: The possible role of contextual, linguistics and meta-representative factors. Brain Sciences 13: 863. [Google Scholar] [CrossRef]
- Filippova, Eva, and Janet Wilde Astington. 2008. Further development in social reasoning revealed in discourse irony understanding. Child Development 79: 126–38. [Google Scholar] [CrossRef] [PubMed]
- Frith, Uta. 1989. Autism: Explaining the Enigma. Oxford: Blackwell Publishing. [Google Scholar]
- Frith, Uta. 2008. Autism: A Very Short Introduction. Oxford: Oxford University Press. [Google Scholar]
- Fuchs, Julia. 2023. Ironic, isn’t it!? a review on irony comprehension in children and adolescents with asd. Research in Autism Spectrum Disorders 108: 102248. [Google Scholar] [CrossRef]
- Garmendia, Joana. 2018. Irony. Cambridge: Cambridge University Press. [Google Scholar] [CrossRef]
- Gernsbacher, Morton Ann, and Sarah R. Pripas-Kapit. 2012. Who’s missing the point? a commentary on claims that autistic persons have a specific deficit in figurative language comprehension. Metaphor and Symbol 27: 93–105. [Google Scholar] [CrossRef]
- Gibbs, Raymond W. 1994. The Poetics of Mind: Figurative Thought, Language, and Understanding. Cambridge: Cambridge University Press. [Google Scholar]
- Gibbs, Raymond W. 2011. Evaluating conceptual metaphor theory. Discourse Processes 48: 529–62. [Google Scholar] [CrossRef]
- Gilovich, Thomas, Dale Griffin, and Daniel Kahneman. 2002. Heuristics and Biases. Cambridge: Cambridge University Press. [Google Scholar] [CrossRef]
- Golchin, Shahriar, and Mihai Surdeanu. 2024. Time travel in LLMs: Tracing data contamination in large language models. Paper presented at 12th International Conference on Learning Representation, Portland, OR, USA, May 7–12; pp. 1–22. [Google Scholar]
- Gold, Rinat, and Miriam Faust. 2010. Right hemisphere dysfunction and metaphor comprehension in young adults with asperger syndrome. Journal of Autism and Developmental Disorders 40: 800–11. [Google Scholar] [CrossRef]
- Guo, Biyang, Xin Zhang, Ziyuan Wang, Minqi Jiang, Jinran Nie, Yuxuan Ding, Jianwei Yue, and Yupeng Wu. 2023. How close is ChatGPT to human experts? comparison corpus, evaluation, and detection. arXiv arXiv:2301.07597. [Google Scholar]
- Hagendorff, Thilo. 2023. Machine psychology: Investigating emergent capabilities and behavior in large language models using psychological methods. arXiv arXiv:2303.13988. [Google Scholar]
- Hagendorff, Thilo, Sarah Fabi, and Michal Kosinski. 2022. Thinking fast and slow in large language models. arXiv arXiv:2212.05206. [Google Scholar]
- Happé, Francesca G. E. 1993. Communicative competence and theory of mind in autism: A test of relevance theory. Cognition 48: 101–19. [Google Scholar] [CrossRef]
- Happé, Francesca G. E. 1995. Understanding minds and metaphors: Insights from the study of figurative language in autism. Metaphor and Symbolic Activity 10: 275–95. [Google Scholar] [CrossRef]
- Heyes, Cecilia M., and Chris D. Frith. 2014. The cultural evolution of mind reading. Science 344: 1243091. [Google Scholar] [CrossRef]
- Hu, Edward J., Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-rank adaptation of large language models. Paper presented at 10th International Conference on Learning Representations, Portland, OR, USA, April 24–29; pp. 1–13. [Google Scholar]
- Ichien, Nicholas, Dušan Stamenković, and Keith J. Holyoak. 2023. Large language model displays emergent ability to interpret novel literary metaphors. arXiv arXiv:2308.01497. [Google Scholar]
- Jolliffe, Therese, and Simon Baron-Cohen. 2000. Linguistic processing in high-functioning adults with autism or Asperger’s syndrome: Is global coherence impaired? Psychological Medicine 30: 1169–87. [Google Scholar] [CrossRef]
- Kalandadze, Tamar, Courtenay Norbury, Terje Nærland, and Kari-Anne B Næss. 2016. Figurative language comprehension in individuals with autism spectrum disorder: A meta-analytic review. Autism 22: 99–117. [Google Scholar] [CrossRef]
- Kenny, A., trans. 2013. Poetics. Oxford: Oxford University Press. [Google Scholar]
- Kim, Jae Hyuk, Sun Kyung Kim, Jongmyung Choi, and Youngho Lee. 2024. Reliability of chatgpt for performing triage task in the emergency department using the korean triage and acuity scale. Digital Health 10: 1–9. [Google Scholar] [CrossRef]
- Kosinski, Michal. 2023. Theory of mind might have spontaneously emerged in large language models. arXiv arXiv:2302.02083. [Google Scholar]
- Kung, Tiffany H., Morgan Cheatham, Arielle Medenilla, Czarina Sillos, Lorie De Leon, Camille Elepaño, Maria Madriaga, Rimel Aggabao, Giezel Diaz-Candido, James Maningo, and et al. 2023. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLoS Digital Health 2: e0000198. [Google Scholar] [CrossRef] [PubMed]
- Lal, Yash Kumar, and Mohaddeseh Bastan. 2022. SBU figures it out: Models explain figurative language. Paper presented at 3rd Workshop on Figurative Language Processing (FLP), Stroudsburg, PA, USA, December 8; pp. 143–49. [Google Scholar] [CrossRef]
- Le Sourn-Bissaoui, Sandrine, Stéphanie Caillies, Fabien Gierski, and Jacques Motte. 2011. Ambiguity detection in adolescents with Asperger syndrome: Is central coherence or theory of mind impaired? Research in Autism Spectrum Disorders 5: 648–56. [Google Scholar] [CrossRef]
- Li, Changmao, and Jeffrey Flanigan. 2024. Task contamination: Language models may not be few-shot anymore. Paper presented at 38th AAAI Conference on Artificial Intelligence, Washington, DC, USA, February 20–27; pp. 18471–80. [Google Scholar] [CrossRef]
- Liu, Emmy, Chenxuan Cui, Kenneth Zheng, and Graham Neubig. 2022. Testing the ability of language models to interpret figurative language. Paper presented at 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Stroudsburg, PA, USA, July 10–15; pp. 4437–52. [Google Scholar] [CrossRef]
- Loconte, Riccardo, Graziella Orrù, Mirco Tribastone, Pietro Pietrini, and Giuseppe Sartori. 2023. Challenging ChatGPT ‘intelligence’ with human tools: A neuropsychological investigation on prefrontal functioning of a large language model. SSRN 4377371: 1–33. [Google Scholar] [CrossRef]
- Loukusa, Soile, and Irma Moilanen. 2009. Pragmatic inference abilities in individuals with Asperger syndrome or high-functioning autism: A review. Research in Autism Spectrum Disorders 3: 890–904. [Google Scholar] [CrossRef]
- Marchetti, Antonella, Cinzia Di Dio, Angelo Cangelosi, Federico Manzi, and Davide Massaro. 2023. Developing ChatGPT’s theory of mind. Frontiers in Robotics and AI 10: 1189525. [Google Scholar] [CrossRef] [PubMed]
- Mazzarella, Diana, and Nausicaa Pouscoulous. 2021. Pragmatics and epistemic vigilance: A developmental perspective. Mind & Language 36: 355–76. [Google Scholar] [CrossRef]
- Mazzarella, Diana, and Nausicaa Pouscoulous. 2023. Ironic speakers, vigilant hearers. Intercultural Pragmatics 20: 111–32. [Google Scholar] [CrossRef]
- Mention, Boris, Frederic Pourre, and Julie Andanson. 2024. Humor in autism spectrum disorders: A systematic review. L’Encèphale 50: 200–10. [Google Scholar] [CrossRef] [PubMed]
- Muraoka, Masayasu, Bishwaranjan Bhattacharjee, Michele Merler, Graeme Blackwood, Yulong Li, and Yang Zhao. 2023. Cross-lingual transfer of large language model by visually-derived supervision toward low-resource languages. Paper presented at 31st ACM International Conference on Multimedia, New York, NY, USA, October 29–November 3; pp. 3637–46. [Google Scholar] [CrossRef]
- Nematzadeh, Aida, Kaylee Burns, Erin Grant, Alison Gopnik, and Thomas L. Griffiths. 2018. Evaluating theory of mind in question answering. Paper presented at 2018 Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, USA, October 31–November 4; pp. 2392–400. [Google Scholar] [CrossRef]
- OpenAI. 2022. Introducing ChatGPT. Available online: https://openai.com/blog/chatgpt (accessed on 1 July 2024).
- OpenAI. 2023. GPT-4 technical report. arXiv arXiv:2303.08774. [Google Scholar]
- Ouyang, Long, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and et al. 2022. Training language models to follow instructions with human feedback. Paper presented at 36th Annual Conference on Neural Information Processing Systems, Red Hook, NY, USA, November 28–December 9; pp. 27730–44. [Google Scholar]
- Pexman, Penny M. 2023. Irony and Thought: Developmental Insights. Cambridge: Cambridge University Press, Chapter 11. pp. 181–96. [Google Scholar] [CrossRef]
- Potamias, Rolandos Alexandros, Georgios Siolas, and Andreas-Georgios Stafylopatis. 2020. A transformer-based approach to irony and sarcasm detection. Neural Computing and Applications 32: 17309–320. [Google Scholar] [CrossRef]
- Pouscoulous, Nausicaa, and Michael Tomasello. 2020. Early birds: Metaphor understanding in 3-year-olds. Journal of Pragmatics 156: 160–67. [Google Scholar] [CrossRef]
- Prystawski, Ben, Paul H. Thibodeau, and Noah D. Goodman. 2022. Psychologically-informed chain-of-thought prompts for metaphor understanding in large language models. arXiv arXiv:2209.08141. [Google Scholar]
- Rao, Patricia A., Deborah C. Beidel, and Michael J. Murray. 2007. Social skills interventions for children with Asperger’s syndrome or high-functioning autism: A review and recommendations. Journal of Autism and Developmental Disorders 38: 353–61. [Google Scholar] [CrossRef] [PubMed]
- Safdari, Mustafa, Greg Serapio-García, Clément Crepy, Stephen Fitz, Peter Romero, Luning Sun, Marwa Abdulhai, Aleksandra Faust, and Maja Mataric. 2023. Personality traits in large language models. arXiv arXiv:2307.00184. [Google Scholar]
- Schreiner, Maximilian. 2023. GPT-4 Architecture, Datasets, Costs and More Leaked. Available online: https://the-decoder.com/gpt-4-architecture-datasets-costs-and-more-leaked/ (accessed on 1 July 2024).
- Semino, Elena, and Zsófia Demjén. 2016. The Routledge Handbook of Metaphor and Language. London: Routledge. [Google Scholar] [CrossRef]
- Shi, Freda, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Schärli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. Paper presented at 40th International Conference on Machine Learning, Cambridge, MA, USA, July 23–29; pp. 31210–27. [Google Scholar]
- Sperber, Dan, and Deirdre Wilson. 1981. Irony and the Use-Mention Distinction. New York: Academic Press, Chapter 12. pp. 295–318. [Google Scholar]
- Sravanthi, Settaluri Lakshmi, Meet Doshi, Tankala Pavan Kalyan, V. Rudra Murthy, Pushpak Bhattacharyya, and Raj Dabre. 2024. PUB: A pragmatics understanding benchmark for assessing llms’ pragmatics capabilities. arXiv arXiv:2401.07078. [Google Scholar]
- Su, Chuandong, Fumiyo Fukumoto, Xiaoxi Huang, Jiyi Li, Rongbo Wang, and Zhiqun Chen. 2020. Deepmet: A reading comprehension paradigm for token-level metaphor detection. Paper presented at 2nd Workshop on Figurative Language Processing, Stroudsburg, PA, USA, July 9; pp. 30–39. [Google Scholar] [CrossRef]
- Touvron, Hugo, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and et al. 2023. Llama 2: Open foundation and fine-tuned chat models. arXiv arXiv:2307.09288. [Google Scholar]
- Trott, Sean, Cameron Jones, Tyler A. Chang, James A. Michaelov, and Benjamin K. Bergen. 2023. Do large language models know what humans know? Cognitive Science 47: e13309. [Google Scholar] [CrossRef]
- Ullman, Tomer D. 2023. Large language models fail on trivial alterations to theory-of-mind tasks. arXiv arXiv:2302.08399. [Google Scholar]
- Vulchanova, Mila, Sobh Chahboun, Beatriz Galindo-Prieto, and Valentin Vulchanov. 2019. Gaze and motor traces of language processing: Evidence from autism spectrum disorders in comparison to typical controls. Cognitive Neuropsychology 36: 383–409. [Google Scholar] [CrossRef] [PubMed]
- Wang, Shuo, and Xin Li. 2023. A revisit of the amygdala theory of autism: Twenty years after. Neuropsychologia 183: 108519. [Google Scholar] [CrossRef] [PubMed]
- Wei, Jason, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, and et al. 2022. Emergent abilities of large language models. Transactions on Machine Learning Research 209: 1–30. [Google Scholar]
- Willinger, Ulrike, Matthias Deckert, Michaela Schmöger, Ines Schaunig-Busch, Anton K. Formann, and Eduard Auff. 2017. Developmental steps in metaphorical language abilities: The influence of age, gender, cognitive flexibility, information processing speed, and analogical reasoning. Language and Speech 62: 207–28. [Google Scholar] [CrossRef]
- Wilson, Deirdre, and Dan Sperber. 2012. Explaining Irony. Cambridge: Cambridge University Press, Chapter 6. pp. 123–46. [Google Scholar]
- Yaghoobian, Hamed, Hamid R. Arabnia, and Khaled Rasheed. 2021. Sarcasm detection: A comparative study. arXiv arXiv:2107.02276. [Google Scholar]
Model | Metaphor Score | Sarcasm Score | Questions | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
Q1 | Q2 | Q3 | Q4 | Q5 | Q6 | Q7 | Q8 | Q9 | Q10 | |||
Dolly v2 12B | 1 | 1 | ✓ | ✓ | ||||||||
Llama 2 7B | 1 | 1 | ✓ | ✓ | ||||||||
Llama 2 13B | 2 | 0 | ✓ | ✓ | ||||||||
Llama 2 70B | 3 | 1 | ✓ | ✓ | ✓ | ✓ | ||||||
GPT-3.5 | 4 | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
GPT-4 | 4 | 1 | ✓ | ✓ | ✓ | ✓ | ✓ | |||||
Children w/o intellectual disability | 4.1 | 3.3 | — | |||||||||
(Adachi et al. 2004) |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2024 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Yakura, H. Evaluating Large Language Models’ Ability Using a Psychiatric Screening Tool Based on Metaphor and Sarcasm Scenarios. J. Intell. 2024, 12, 70. https://doi.org/10.3390/jintelligence12070070
Yakura H. Evaluating Large Language Models’ Ability Using a Psychiatric Screening Tool Based on Metaphor and Sarcasm Scenarios. Journal of Intelligence. 2024; 12(7):70. https://doi.org/10.3390/jintelligence12070070
Chicago/Turabian StyleYakura, Hiromu. 2024. "Evaluating Large Language Models’ Ability Using a Psychiatric Screening Tool Based on Metaphor and Sarcasm Scenarios" Journal of Intelligence 12, no. 7: 70. https://doi.org/10.3390/jintelligence12070070
APA StyleYakura, H. (2024). Evaluating Large Language Models’ Ability Using a Psychiatric Screening Tool Based on Metaphor and Sarcasm Scenarios. Journal of Intelligence, 12(7), 70. https://doi.org/10.3390/jintelligence12070070