Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI
Abstract
1. Introduction
2. Materials and Methods
2.1. Clinical Protocol: The Assessment Battery for Communication (ABaCo)
2.2. Participants and Dataset
2.3. LLM Model and Prompt Engineering
2.3.1. Rationale for Using GPT-4o
2.3.2. Prompt Development
2.3.3. Using the Prompt
2.4. Statistical Analysis
2.4.1. Inter-Rater Reliability Between Human Coder and GPT-4o
2.4.2. Distribution of Discrepancies by Pragmatic Act
3. Results
3.1. Inter-Rater Agreement
3.2. Analysis of Discrepancies
4. Discussion
4.1. General Discussion
4.2. Suggestions for Design
5. Limitations
6. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Appendix A
| Pragmatic Act | Error | Obs. | Exp. | z | |z| | p (Two-Sided) | q_BH | sig_BH |
|---|---|---|---|---|---|---|---|---|
| Assertion | FN | 2.00 | 1.00 | 1 | 1 | p > 0.05 | 0.57 | q > 0.05 |
| Assertion | FP | 0.00 | 1.00 | −1 | 1 | p > 0.05 | 0.57 | q > 0.05 |
| Command | FN | 0.00 | 9.10 | −3 | 3 | p < 0.01 * | 0.02 | q < 0.05 * |
| Command | FP | 18.00 | 8.90 | 3 | 3 | p < 0.01 * | 0.02 | q < 0.05 * |
| Conv.—Topic | FN | 0.00 | 1.00 | −1 | 1 | p > 0.05 | 0.57 | q > 0.05 |
| Conv.—Topic | FP | 2.00 | 1.00 | 1 | 1 | p > 0.05 | 0.57 | q > 0.05 |
| Conv.—Turn-taking | FN | 0.00 | 1.00 | −1 | 1 | p > 0.05 | 0.57 | q > 0.05 |
| Conv.—Turn-taking | FP | 2.00 | 1.00 | 1 | 1 | p > 0.05 | 0.57 | q > 0.05 |
| Deceit | FN | 58.00 | 37.80 | 3.3 | 3.3 | p < 0.001 * | 0.02 | q < 0.05 * |
| Deceit | FP | 17.00 | 37.20 | −3.3 | 3.3 | p < 0.001 * | 0.02 | q < 0.05 * |
| Emotion | FN | 1.00 | 0.50 | 0.7 | 0.7 | p > 0.05 | 0.63 | q > 0.05 |
| Emotion | FP | 0.00 | 0.50 | −0.7 | 0.7 | p > 0.05 | 0.65 | q > 0.05 |
| Incongruity | FN | 4.00 | 3.00 | 0.6 | 0.6 | p > 0.05 | 0.63 | q > 0.05 |
| Incongruity | FP | 2.00 | 3.00 | −0.6 | 0.6 | p > 0.05 | 0.63 | q > 0.05 |
| Irony | FN | 32.00 | 30.30 | 0.3 | 0.3 | p > 0.05 | 0.78 | q > 0.05 |
| Irony | FP | 28.00 | 29.70 | −0.3 | 0.3 | p > 0.05 | 0.78 | q > 0.05 |
| Norm | FN | 8.00 | 11.60 | −1.1 | 1.1 | p > 0.05 | 0.57 | q > 0.05 |
| Norm | FP | 15.00 | 11.40 | 1.1 | 1.1 | p > 0.05 | 0.74 | q > 0.05 |
| Question | FN | 1.00 | 2.50 | −1 | 1 | p > 0.05 | 0.57 | q > 0.05 |
| Question | FP | 4.00 | 2.50 | 1 | 1 | p > 0.05 | 0.57 | q > 0.05 |
| Request | FN | 1.00 | 4.00 | −1.5 | 1.5 | p > 0.05 | 0.46 | q > 0.05 |
| Request | FP | 7.00 | 4.00 | 1.5 | 1.5 | p > 0.05 | 0.46 | q > 0.05 |
| Social Norm | FN | 0.00 | 6.10 | −2.5 | 2.5 | p = 0.01 * | 0.06 | q > 0.05 |
| Social Norm | FP | 12.00 | 5.90 | 2.5 | 2.5 | p = 0.01 * | 0.06 | q > 0.05 |
| Standard Act | FN | 4.00 | 3.00 | 0.6 | 0.6 | p > 0.05 | 0.63 | q > 0.05 |
| Standard Act | FP | 2.00 | 3.00 | −0.6 | 0.6 | p > 0.05 | 0.63 | q > 0.05 |
| Failure Scenario | Act | Error Type | ABaCo Stimulus | Response | Human Score | Model Score | Model Behavior |
|---|---|---|---|---|---|---|---|
| Over-generalization of directive force | Command | False Positive (FP) | “Please issue some commands. Order me to stop asking you questions.” | “Stop asking questions, please.” | 0 | 1 | The model over-ascribes command force to a polite request/mitigated directive, incorrectly crediting a correct Command when the human coder judges it inadequate for the target act. |
| Literal–intent mismatch | Deceit | False Negative (FN) | “We’ll watch some clips. Pay attention, I’ll ask questions later. Luca hides his skateboard. His mother asks, ‘What was that noise?’ The boy does not want to be found out. What could he say? (Borderline cases: what would that mean?)” | “The best is to say nothing. Or, ‘I heard it too; I don’t know.’” | 1 | 0 | The model misses deceit: silence/evasive reply constitutes deceptive intent (successful target act), but the model labels it as inadequate. |
References
- Bender, E.M.; Koller, A. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 6–8 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5185–5198. [Google Scholar] [CrossRef]
- Levinson, S.C. Pragmatics; Cambridge University Press: Cambridge, UK, 1983. [Google Scholar] [CrossRef]
- Holler, J.; Levinson, S.C. Multimodal language processing in human communication. Trends Cogn. Sci. 2019, 23, 639–652. [Google Scholar] [CrossRef]
- Bara, B.G. Cognitive Pragmatics: The Mental Processes of Communication; MIT Press: Cambridge, MA, USA, 2010. [Google Scholar] [CrossRef]
- Bara, B.G. Cognitive pragmatics: The mental processes of communication. Intercult. Pragmat. 2011, 8, 443–485. [Google Scholar] [CrossRef]
- Bosco, F.M.; Bucciarelli, M.; Bara, B.G. The fundamental context categories in understanding communicative intention. J. Pragmat. 2004, 36, 467–488. [Google Scholar] [CrossRef]
- Adams, C. Practitioner review: The assessment of language pragmatics. J. Child Psychol. Psychiatry Allied Discip. 2002, 43, 973–987. [Google Scholar] [CrossRef] [PubMed]
- Angeleri, R.; Bosco, F.M.; Gabbatore, I.; Bara, B.G.; Sacco, K. Assessment battery for communication (ABaCo): Normative data. Behav. Res. Methods 2012, 44, 845–861. [Google Scholar] [CrossRef]
- Arcara, G.; Bambini, V. A Test for the Assessment of Pragmatic Abilities and Cognitive Substrates (APACS): Normative data and psychometric properties. Front. Psychol. 2016, 7, 70. [Google Scholar] [CrossRef]
- Bosco, F.M.; Angeleri, R.; Zuffranieri, M.; Bara, B.G.; Sacco, K. Assessment Battery for Communication: Development of two equivalent forms. J. Commun. Disord. 2012, 45, 290–303. [Google Scholar] [CrossRef]
- Bishop, D.V.M. The Children’s Communication Checklist, Second Edition—CCC-2 Manual; Pearson: London, UK, 2003. [Google Scholar]
- Parola, A.; Salvini, R.; Gabbatore, I.; Colle, L.; Berardinelli, L.; Bosco, F.M. Pragmatics, Theory of Mind and executive functions in schizophrenia: Disentangling the puzzle using machine learning. PLoS ONE 2020, 15, e0229603. [Google Scholar] [CrossRef]
- Gabbatore, I.; Bosco, F.M.; Mäkinen, L.; Ebeling, H.; Hurtig, T.; Loukusa, S. Investigating pragmatic abilities in young Finnish adults using the Assessment Battery for Communication. Intercult. Pragmat. 2019, 16, 27–56. [Google Scholar] [CrossRef]
- Bosco, F.M.; Berardinelli, L.; Parola, A. The ability of patients with schizophrenia to comprehend and produce sincere, deceitful, and ironic communicative intentions: The role of theory of mind and executive functions. Front. Psychol. 2019, 10, 827. [Google Scholar] [CrossRef]
- Bambini, V.; Arcara, G.; Bechi, M.; Buonocore, M.; Cavallaro, R.; Bosia, M. The communicative impairment as a core feature of schizophrenia: Frequency of pragmatic deficit, cognitive substrates, and relation with quality of life. Compr. Psychiatry 2016, 71, 106–120. [Google Scholar] [CrossRef] [PubMed]
- Colle, L.; Angeleri, R.; Vallana, M.; Sacco, K.; Bara, B.G.; Bosco, F.M. Understanding the communicative impairments in schizophrenia: A preliminary study. J. Commun. Disord. 2013, 46, 294–308. [Google Scholar] [CrossRef] [PubMed]
- Angeleri, R.; Gabbatore, I.; Bosco, F.M.; Sacco, K.; Colle, L. Pragmatic abilities in children and adolescents with autism spectrum disorder: A study with the ABaCo battery. Minerva Psichiatr. 2016, 57, 93–103. [Google Scholar]
- Gabbatore, I.; Longobardi, C.; Bosco, F.M. Improvement of communicative-pragmatic ability in adolescents with Autism Spectrum Disorder: The adapted version of the Cognitive Pragmatic Treatment. Lang. Learn. Dev. 2022, 18, 62–80. [Google Scholar] [CrossRef]
- Loukusa, S.; Moilanen, I.K. Pragmatic inference abilities in individuals with Asperger syndrome or high-functioning autism. A review. Res. Autism Spectr. Disord. 2009, 3, 890–904. [Google Scholar] [CrossRef]
- Gabbatore, I.; Marchetti Guerrini, A.; Bosco, F.M. Looking for social pragmatic communication disorder in the complex world of Italian special needs: An exploratory study. Sci. Rep. 2025, 15, 348. [Google Scholar] [CrossRef]
- Angeleri, R.; Bosco, F.M.; Zettin, M.; Sacco, K.; Colle, L.; Bara, B.G. Communicative impairment in traumatic brain injury: A complete pragmatic assessment. Brain Lang. 2008, 107, 229–245. [Google Scholar] [CrossRef]
- Bosco, F.M.; Angeleri, R.; Sacco, K.; Bara, B.G. Explaining pragmatic performance in traumatic brain injury: A process perspective on communicative errors. Int. J. Lang. Commun. Disord. 2015, 50, 63–83. [Google Scholar] [CrossRef]
- Bosco, F.M.; Parola, A.; Sacco, K.; Zettin, M.; Angeleri, R. Communicative-pragmatic disorders in traumatic brain injury: The role of theory of mind and executive functions. Brain Lang. 2017, 168, 73–83. [Google Scholar] [CrossRef]
- Joanette, Y.; Ska, B.; Côté, H. Protocole Montréal d’Évaluation de la Communication (MEC); Ortho Édition: Isbergues, France, 2004. [Google Scholar]
- Bryan, K. The Right Hemisphere Language Battery, 2nd ed.; Whurr Publishers: London, UK, 1995. [Google Scholar]
- Holland, A.L.; Frattali, C.; Fromm, D. Communication Activities of Daily Living, 2nd ed.; CADL-2; PRO-ED: Austin, TX, USA, 1999. [Google Scholar]
- McDonald, S.; Flanagan, S.; Rollins, J.; Kinch, J. TASIT: A new clinical tool for assessing social perception after traumatic brain injury. J. Head Trauma Rehabil. 2003, 18, 219–238. [Google Scholar] [CrossRef]
- Angeleri, R.; Bara, B.G.; Bosco, F.M.; Colle, L.; Sacco, K. ABaCo—Assessment Battery for Communication, 2nd ed.; Giunti OS: Florence, Italy, 2015. [Google Scholar]
- Grice, H.P. Logic and conversation. In Syntax and Semantics, Volume 3: Speech Acts; Cole, P., Morgan, J.L., Eds.; Academic Press: New York, NY, USA, 1975; pp. 41–58. [Google Scholar]
- Yan, C.; Fu, X.; Liu, X.; Zhang, Y.; Gao, Y.; Wu, J.; Li, Q. A survey of automated International Classification of Diseases coding: Development, challenges, and applications. Intell. Med. 2022, 2, 161–173. [Google Scholar] [CrossRef]
- Park, Y.J.; Pillai, A.; Deng, J.; Zhou, L.; Zhang, Z.; Yu, K.-H.; Wang, Y.; Wang, L.; Luo, Y. Assessing the research landscape and clinical utility of large language models: A scoping review. BMC Med. Inform. Decis. Mak. 2024, 24, 72. [Google Scholar] [CrossRef]
- Tian, S.; Jin, Q.; Yeganova, L.; Lai, P.-T.; Zhu, Q.; Chen, X.; Yang, Y.; Chen, Q.; Kim, W.; Comeau, D.C.; et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinform. 2024, 25, bbad493. [Google Scholar] [CrossRef] [PubMed]
- Vrdoljak, J.; Boban, Z.; Vilović, M.; Kumrić, M.; Božić, J. A review of large language models in medical education, clinical decision support, and healthcare administration. Healthcare 2025, 13, 603. [Google Scholar] [CrossRef] [PubMed]
- Tam, T.Y.C.; Sivarajkumar, S.; Kapoor, S.; Stolyar, A.V.; Polanska, K.; McCarthy, K.R.; Osterhoudt, H.; Wu, X.; Visweswaran, S.; Fu, S.; et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digit. Med. 2024, 7, 258. [Google Scholar] [CrossRef] [PubMed]
- Ho, C.N.; Tian, T.; Ayers, A.T.; Aaron, R.E.; Phillips, V.; Wolf, R.M.; Mathioudakis, N.; Dai, T.; Klonoff, D.C. Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: A narrative review. BMC Med. Inform. Decis. Mak. 2024, 24, 357. [Google Scholar] [CrossRef]
- Chew, R.; Bollenbacher, J.; Wenger, M.; Speer, J.; Kim, A. LLM-assisted content analysis: Using large language models to support deductive coding. arXiv 2023. [Google Scholar] [CrossRef]
- Tai, R.H.; Bentley, L.R.; Xia, X.; Sitt, J.M.; Fankhauser, S.C.; Chicas-Mosier, A.M.; Monteith, B.G. An examination of the use of large language models to aid analysis of textual data. Int. J. Qual. Methods 2024, 23, 16094069241231168. [Google Scholar] [CrossRef]
- Hu, J.; Floyd, S.; Jouravlev, O.; Fedorenko, E.; Gibson, E. A fine-grained comparison of pragmatic language understanding in humans and language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Long Papers), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 4194–4213. [Google Scholar] [CrossRef]
- Yerukola, A.; Vaduguru, S.; Fried, D.; Sap, M. Is the pope Catholic? Yes, the pope is Catholic. Generative evaluation of non-literal intent resolution in LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Short Papers), Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 265–275. [Google Scholar] [CrossRef]
- Cong, Y. Pre-trained language models’ interpretation of evaluativity implicature: Evidence from gradable adjectives usage in context. In Proceedings of the Second Workshop on Understanding Implicit and Underspecified Language, Seattle, WA, USA, 15 July 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 1–7. Available online: https://aclanthology.org/2022.unimplicit-1.1/ (accessed on 14 July 2025).
- Barattieri di San Pietro, C.; Frau, F.; Mangiaterra, V.; Bambini, V. The pragmatic profile of ChatGPT: Assessing the communicative skills of a conversational agent. Sist. Intell. 2023, 35, 379–400. [Google Scholar] [CrossRef]
- Ma, B.; Li, Y.; Zhou, W.; Gong, Z.; Liu, Y.J.; Jasinskaja, K.; Friedrich, A.; Hirschberg, J.; Kreuter, F.; Plank, B. Pragmatics in the era of large language models: A survey on datasets, evaluation, opportunities and challenges. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Long Papers), Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 8679–8696. [Google Scholar] [CrossRef]
- Hilviu, D.; Parola, A.; Bosco, F.M.; Marini, A.; Gabbatore, I. Grandpa, tell me a story! Narrative ability in healthy aging and its relationship with cognitive functions and Theory of Mind. Lang. Cogn. Neurosci. 2025, 40, 103–121. [Google Scholar] [CrossRef]
- Hilviu, D.; Gabbatore, I.; Parola, A.; Bosco, F.M. A cross-sectional study to assess pragmatic strengths and weaknesses in healthy ageing. BMC Geriatr. 2022, 22, 699. [Google Scholar] [CrossRef] [PubMed]
- Marini, A.; Petriglia, F.; D’Ortenzio, S.; Bosco, F.M.; Gasparotto, G. Unveiling the dynamics of discourse production in healthy aging and its connection to cognitive skills. Discourse Process. 2025, 62, 479–501. [Google Scholar] [CrossRef]
- Gabbatore, I.; Conterio, R.; Vegna, G.; Bosco, F.M. Longitudinal assessment of pragmatic and cognitive decay in healthy aging, and interplay with subjective cognitive decline and cognitive reserve. Sci. Rep. 2025, 15, 30835. [Google Scholar] [CrossRef] [PubMed]
- Searle, J.R. Indirect speech acts. In Syntax and Semantics (Volume 3): Speech Acts; Cole, P., Morgan, J.L., Eds.; Academic Press: New York, NY, USA, 1975; pp. 59–82. [Google Scholar]
- Kasher, A. Modular speech act theory: Programme and results. In Foundations of Speech Act Theory: Philosophical and Linguistic Perspectives; Tsohatzidis, S.L., Ed.; Routledge: Oxfordshire, UK, 1994; pp. 312–322. [Google Scholar] [CrossRef]
- Reuters. OpenAI’s Weekly Active Users Surpass 400 Million. 2025. Available online: https://www.reuters.com/technology/artificial-intelligence/openais-weekly-active-users-surpass-400-million-2025-02-20/?utm_source=chatgpt.com (accessed on 4 April 2025).
- OpenAI. Model Spec (Version 2025-04-11). Available online: https://model-spec.openai.com/2025-04-11.html (accessed on 11 April 2025).
- Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
- Agresti, A. An Introduction to Categorical Data Analysis, 2nd ed.; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar] [CrossRef]
- Cochran, W.G. Some methods for strengthening the common χ2 tests. Biometrics 1954, 10, 417–451. [Google Scholar] [CrossRef]
- Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 1995, 57, 289–300. [Google Scholar] [CrossRef]
- Wasserstein, R.L.; Lazar, N.A. The ASA statement on p-values: Context, process, and purpose. Am. Stat. 2016, 70, 129–133. [Google Scholar] [CrossRef]
- Cong, Y. Manner implicatures in large language models. Sci. Rep. 2024, 14, 29113. [Google Scholar] [CrossRef]
- Zuccon, G.; Koopman, B.; Shaik, R. ChatGPT hallucinates when attributing answers. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’23), Beijing, China, 26–28 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 46–51. [Google Scholar] [CrossRef]
- Searle, J.R. A classification of illocutionary acts. Lang. Soc. 1976, 5, 1–23. [Google Scholar] [CrossRef]
- Searle, J.R.; Vanderveken, D. Foundations of Illocutionary Logic; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
- Chen, H.; Leu, M.C.; Yin, Z. Real-time multi-modal human–robot collaboration using gestures and speech. J. Manuf. Sci. Eng. 2022, 144, 101007. [Google Scholar] [CrossRef]
- Bisk, Y.; Holtzman, A.; Thomason, J.; Andreas, J.; Bengio, Y.; Chai, J.; Lapata, M.; Lazaridou, A.; May, J.; Nisnevich, A.; et al. Experience Grounds Language. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8718–8735. [Google Scholar] [CrossRef]
- Strachan, J.W.A.; Albergo, D.; Borghini, G.; Pansardi, O.; Scaliti, E.; Gupta, S.; Saxena, K.; Rufo, A.; Panzeri, S.; Manzi, G.; et al. Testing theory of mind in large language models and humans. Nat. Hum. Behav. 2024, 8, 1285–1295. [Google Scholar] [CrossRef]
- Bosco, F.M.; Tirassa, M.; Gabbatore, I. Why pragmatics and theory of mind do not (completely) overlap. Front. Psychol. 2018, 9, 1453. [Google Scholar] [CrossRef]
- Gabbatore, I.; Bosco, F.M.; Tirassa, M. What are they all doing in that restaurant? Perspectives on the use of theory of mind. Front. Psychol. 2024, 15, 1507298. [Google Scholar] [CrossRef] [PubMed]
- Christiano, P.F.; Leike, J.; Brown, T.B.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems; Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf (accessed on 8 September 2025).
- Sharma, M.; Tong, M.; Korbak, T.; Duvenaud, D.K.; Askell, A.; Bowman, S.R.; Cheng, N.; Durmus, E.; Hatfield-Dodds, Z.; Johnston, S.; et al. Towards understanding sycophancy in language models. arXiv 2025. [Google Scholar] [CrossRef]
- Wagner, I.; Chakradeo, K. Human-AI Complementarity in Diagnostic Radiology: The Case of Double Reading. Philos. Technol. 2025, 38, 1–31. [Google Scholar] [CrossRef]
- Harada, Y.; Suzuki, T.; Harada, T.; Sakamoto, T.; Ishizuka, K.; Miyagami, T.; Kawamura, R.; Kunitomo, K.; Nagano, H.; Shimizu, T.; et al. Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: An analysis of 545 case reports of diagnostic errors. BMJ Open Qual. 2024, 13, e002654. [Google Scholar] [CrossRef]
- Artsi, Y.; Sorin, V.; Glicksberg, B.S.; Korfiatis, P.; Nadkarni, G.N.; Klang, E. Large language models in real-world clinical workflows: A systematic review of applications and implementation. Front. Digit. Health 2025, 7, 1659134. [Google Scholar] [CrossRef]
| Scale (n) | Comprehension (n) | Production (n) |
|---|---|---|
| Linguistic (44) | Basic speech acts → Questions = 4; Standard communicative acts = 4; Non-standard communicative acts → Deceit= 4, Irony = 4. | Basic speech acts → Assertions = 4; Questions = 4; Commands = 4; Requests = 4; Standard communicative acts = 4; Non-standard communicative acts → Deceit = 4, Irony = 4. |
| Extralinguistic (30) | Basic speech acts → Assertions = 4; Questions = 4; Commands = 4; Requests = 4; Standard communicative acts = 4; Non-standard communicative acts → Deceit = 4, Irony = 4. | Standard communicative acts = 1; Non-standard communicative acts → Deceit = 1. |
| Paralinguistic (22) | Basic communicative acts → Assertion = 2, Question = 2, Request = 2, Command = 2; Emotions = 8; Paralinguistic contradiction = 4. | Basic communicative acts → Request = 1, Command = 1. |
| Context (20) | Discourse norms = 8; Social norms = 8. | Social norms = 4. |
| Conversational (8) | Topic maintenance = 4; Turn-taking = 4. | |
| GPT-4o Score | Total | |||
|---|---|---|---|---|
| 0 | 1 | |||
| Human Score | 0 | 136 | 109 | 245 |
| 1 | 111 | 1669 | 1780 | |
| Total | 247 | 1778 | 2025 | |
| Value | Asymptotic Std. Error a | Approximated T b | Approx. Significance | ||
|---|---|---|---|---|---|
| Measure of Agreement: | Kappa | 0.491 | 0.030 | 22.096 | <0.001 |
| Valid cases (N) | 2025 | ||||
| Pragmatic Act | False Negative | False Positive | Total |
|---|---|---|---|
| Assertion | 2 | - | 2 |
| Command | - | 18 | 18 |
| Conversation—Topic | - | 2 | 2 |
| Conversation—Turn-Taking | - | 2 | 2 |
| Question | 1 | 4 | 5 |
| Emotion | 1 | - | 1 |
| Incongruity | 4 | 2 | 6 |
| Deceit | 58 | 17 | 75 |
| Irony | 32 | 28 | 60 |
| Norm | 8 | 15 | 23 |
| Social Norm | - | 12 | 12 |
| Request | 1 | 7 | 8 |
| Standard communicative acts | 4 | 2 | 6 |
| Total | 111 | 109 | 220 |
| Value | df | Asymptotic Significance (2-Sided) | |
|---|---|---|---|
| Pearson Chi-square | 69.431 a | 12 | <0.001 |
| Likelihood Ratio | 85.744 | 12 | <0.001 |
| N of valid cases | 220 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Boldi, A.; Gabbatore, I.; Bosco, F.M. Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI. Electronics 2025, 14, 4411. https://doi.org/10.3390/electronics14224411
Boldi A, Gabbatore I, Bosco FM. Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI. Electronics. 2025; 14(22):4411. https://doi.org/10.3390/electronics14224411
Chicago/Turabian StyleBoldi, Arianna, Ilaria Gabbatore, and Francesca M. Bosco. 2025. "Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI" Electronics 14, no. 22: 4411. https://doi.org/10.3390/electronics14224411
APA StyleBoldi, A., Gabbatore, I., & Bosco, F. M. (2025). Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI. Electronics, 14(22), 4411. https://doi.org/10.3390/electronics14224411

