Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI

Boldi, Arianna; Gabbatore, Ilaria; Bosco, Francesca M.

doi:10.3390/electronics14224411

Open AccessArticle

Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI

by

Arianna Boldi

^1,2,*

,

Ilaria Gabbatore

^2,3

and

Francesca M. Bosco

^1,2,4,*

¹

Department of Psychology, University of Turin, 10124 Turin, Italy

²

GIPSI Research Group, University of Turin, 10124 Turin, Italy

³

Department of Humanities, University of Turin, 10124 Turin, Italy

⁴

Neuroscience Institute of Turin—NIT, University of Turin, 10124 Turin, Italy

^*

Authors to whom correspondence should be addressed.

Electronics 2025, 14(22), 4411; https://doi.org/10.3390/electronics14224411

Submission received: 10 October 2025 / Revised: 5 November 2025 / Accepted: 9 November 2025 / Published: 12 November 2025

(This article belongs to the Special Issue Emerging Frontiers and Real-World Innovations in Human-Computer Interactions)

Download Review Reports Versions Notes

Abstract

Pragmatics concerns how people use language and other expressive means, such as nonverbal and paralinguistic cues, to convey intended meaning in the context. Difficulties in pragmatics are common across distinct clinical conditions, motivating validated assessments such as the Assessment Battery for Communication (ABaCo); whether Large Language Models (LLMs) can serve as reliable coders remains uncertain. In this exploratory study, we used Generative Pre-trained Transformer (GPT)-4o as a rater on 2025 item × dimension units drawn from the responses given by 10 healthy older adults (M = 69.8) to selected ABaCo items. Expert human coders served as the reference standard to compare GPT-4o scores. Agreement metrics included exact agreement, Cohen’s κ, and a discrepancy audit by pragmatic act. Agreement was 89.1% with κ = 0.491. Errors were non-random across acts (χ²(12) = 69.4, p < 0.001). After Benjamini–Hochberg False Discovery Rate correction across 26 cells, only two categories remained significant: false positives concentrated in Command and false negatives in Deceit. Missing prosodic and gestural cues likely exacerbate command-specific failures. In conclusion, in text-only settings, GPT-4o can serve as a supervised second coder for healthy-aging assessments of pragmatic competence, under human oversight. Safe clinical deployment requires population-specific validation and multimodal inputs that recover nonverbal cues.

Keywords:

pragmatic assessment; large language models; human–AI collaboration; digital health; healthy aging

1. Introduction

Human language understanding cannot be reduced to literal meaning: it requires inferring intent and meaning grounded in context [1]. This requirement defines pragmatics, a core human communicative ability that enables both the comprehension and production of communicative acts within social contexts, encompassing linguistic and extralinguistic expression [2,3,4,5].

Operationally, pragmatics includes a wide range of behaviors related to the context-sensitive use of language and other expressive means to convey meaning [6,7], spanning both literal and non-literal communication, including linguistic, paralinguistic, and extralinguistic modes of interaction [5]. Specifically, pragmatic processing entails complex inferential reasoning that goes beyond literal decoding and syntactic analysis, drawing on cognitive capacities such as recognition of communicative intentions, attribution of beliefs to interlocutors, and integration of contextual information [7,8]. Clinically, pragmatic competence is assessed through a set of validated instruments [8,9,10] and observations [7,11], as it captures skills that inform screening, diagnosis and prognosis, and guide and verify training programs aimed at rehabilitation and enhancement.

A deficit in pragmatic ability is functionally disruptive for individuals’ daily functioning. Several conditions are frequently associated with disruptions in the cognitive mechanisms underlying pragmatic interpretation [12,13], such as schizophrenia [14,15,16], autism spectrum disorder [17,18,19], atypical development [17,20], as well as traumatic brain injury [21,22,23]. These deficits often manifest in difficulties with indirect requests, irony, or adherence to social and conversational norms. Accordingly, pragmatic assessment plays a critical role in the evaluation of communicative impairments and offers the starting point for planning focused rehabilitation interventions.

This motivates standardized assessment protocols (e.g., MEC, Protocole Montréal d’Évaluation de la Communication [24]; APACS, Assessment of Pragmatic Abilities and Cognitive Substrates [9]; RHLB, Right Hemisphere Language Battery [25]; CADL-2 [26]; TASIT, The Awareness of Social Inference Test [27]), each with strengths but also limitations in scope, modality coverage, and scoring objectivity. The Assessment Battery for Communication (ABaCo; [8,10,28]) addresses several of these gaps by assessing both comprehension and production across five scales (linguistic, extralinguistic, paralinguistic, context, conversational). Items cover standard communicative acts (basic acts, i.e., assertion, question, request, and command, realized in indirect and direct forms) and non-standard acts (deceit, irony), as well as adherence/violations of social norms and norms referring to Grice’s Cooperative Principle (Quantity, Quality, Relation, Manner; see [29]). Scoring uses explicit dichotomous (0/1) criteria, and the battery reports solid psychometric properties and adult norms. These features, together with text-based comprehension items, make ABaCo well-suited to test whether instruction-tuned LLMs can act as assistive or even autonomous coders under human oversight.

The idea of automating qualitative coding and scoring in the clinical context is not new. In automated clinical coding (ICD/DRG), methods have progressed from rules-based and classical ML to deep-learning pipelines for document-level code assignment from notes [30] and, more recently, to contemporary LLMs. Recent reviews map LLM uptake across health research and clinical tasks, highlighting opportunities and safety constraints for assessment and qualitative coding (e.g., [31,32,33]). These studies converge on a clear message: the question is not whether LLMs can produce plausible outputs, but how to evaluate them credibly for specific clinical roles under human oversight. Park et al. (2024) [31] argue that evidence on real deployments should be organized around clinical utility rather than generic benchmarks. Tam et al. (2024) [34] expose recurring weaknesses in current studies, such as limited reliability checks and weak generalizability, while [35] emphasize standardized reporting so that results are interpretable across settings. Although LLMs have recently shown impressive linguistic fluency and domain adaptability, concerns about their capabilities in nuanced interpretive tasks and qualitative coding have become prominent (e.g., [36,37]). A converging literature documents LLMs’ pragmatic limitations, as these systems struggle with non-literal language, violations of conversational norms [38,39], indirect speech and irony [39,40]. On controlled tests of manner/scalar implicatures, state-of-the-art models (e.g., GPT-4o, Gemini-Flash) exhibit inconsistent behavior and sub-optimal accuracy, suggesting reliance on surface heuristics rather than generalized pragmatic representations [40].

Early clinical-pragmatics evaluations on a clinical pragmatic language battery (e.g., APACS [9]) likewise found near-human performance on some tasks but systematic weaknesses, tending towards overinformative responses, struggling with physical metaphors, and showing reduced humor comprehension [41]. Similarly, Ma et al. (2025) [42] provided a comprehensive survey of LLM evaluation in pragmatics, underscoring failures in scalar implicature, speech act recognition, and discourse coherence.

Taken together, these findings suggest clear potential but also non-trivial inferential gaps, raising questions about the reliability of LLM-based assessments in sensitive contexts such as clinical settings. To support safe deployment, we need precise estimates of where LLMs succeed and fail under a clinically validated rubric. Therefore, in this study, we test whether current LLMs, when appropriately prompted and constrained, can serve as raters of pragmatic performance in naturalistic data. More specifically, we investigate the reliability, limitations, and interpretive patterns of GPT-4o when applied to a corpus of ABaCo responses from older adults, a population that is comparatively understudied for detailed pragmatic profiling yet shows measurable discourse-pragmatic challenges (e.g., narrative/discourse production, see [43,44,45,46]). To our knowledge, no studies have evaluated LLMs against ABaCo, despite its wide application in the clinical context.

We pose two empirical Research Questions (RQs):

RQ1.

To what extent does GPT-4o align with expert human raters in scoring pragmatic competence in healthy older adults using ABaCo criteria?

RQ2.

Which specific pragmatic act types—standard (i.e., direct and indirect speech acts), non-standard communicative acts (i.e., irony and deceit), social norms, and Grice norm violations—show systematic convergence or divergence between GPT-4o and human coding?

By answering these questions, we offer an evaluation of current LLM performance. This exploratory study provides a contribution to current research that is methodological and setting-specific: (1) a human-anchored reliability estimate of an LLM’s performance on a structured pragmatic task; (2) an act-level discrepancy analysis that illuminates systematic bias; and (3) implications for hybrid human–AI systems in health. Ultimately, by situating pragmatic competence as a benchmark for interpretive AI, this study advances both empirical understanding and theoretical reflection on the integration of generative models in human-centered assessment contexts.

2. Materials and Methods

2.1. Clinical Protocol: The Assessment Battery for Communication (ABaCo)

Grounded in Cognitive Pragmatics theory [4] and informed by Grice’s Cooperative Principle (Quantity, Quality, Relation, Manner; [29]) and speech-act theory [47,48], ABaCo delivers a modular evaluation of communicative–pragmatic ability across different expressive means resulting in five assessment scales (linguistic, extralinguistic, paralinguistic, context, conversational). Comprehension and production are assessed for all scales except the conversational scale, which captures conversation holistically. We used the full version of the protocol, which includes 172 items, combining 100 brief film clips (~20–25 s each) with 72 structured examiner–participant interactions (administration time: ~90 min, full battery); however, as explained in Section 2.2, we then restricted analyses to items amenable to text-only presentation.

ABaCo maps cleanly onto LLM evaluation for several reasons. First, it targets pragmatic phenomena that are central to a wide assessment of pragmatic competence. Across modalities, ABaCo probes: (a) basic communicative acts (assertion, question, request, command); (b) standard communicative acts (direct/indirect); (c) non-standard communicative acts (deceit, irony); (d) contextual/social norms judgments; and (e) conversational skills (topic management, turn-taking). Second, psychometric support allows calibrated interpretation of agreement. Adult norms (N = 300) [8] are stratified by age and education, with inter-rater agreement ICC = 0.89 in the normative study, plus percentile/equivalent-score tables by scale. Third, the scoring scheme is binary and criterion-referenced, enabling direct human–LLM comparisons. ABaCo uses explicit dichotomous (0/1) scoring applied to standardized stimuli. For non-standard acts in comprehension, coding requires a score for each dimension (Expressed Content, Violation, Purpose, see Table S1 in the Supplementary Materials). Production tasks apply parallel, act-appropriate criteria. Finally, its stimulus format yields standardized, repeatable inputs for text-constrained evaluation.

Although ABaCo is a clinical battery, it is routinely applied to normative/control samples and provides normative data for ground interpretation. In this study, we evaluate GPT-4o on typical (healthy-aging) responses to establish a baseline uncontaminated by disorder-specific patterns, before attempting clinical generalization.

2.2. Participants and Dataset

Several ABaCo subtasks require the examiner to observe and score features that a text-only LLM cannot express or perceive, such as gesture execution (extralinguistic production) and prosodic/mimicry marking (paralinguistic production). Evaluating these with a text-only model would force the model to guess in the absence of the relevant cues. We restricted our material to items that can be presented in written form and whose scoring depends on propositional form, while items whose scoring depends exclusively on gesture and prosody/mimic were excluded. The rationale under this choice was to retain items where the pragmatic meaning could be inferred from verbal content to both (i) avoid forcing the model to “guess” missing information and (ii) prevent experimenter bias from injecting subjective textual translations of nonverbal signals into the prompt. Moreover, at present, there is no standardized or validated method to encode prosody or gesture captured by the ABaCo protocol in a textual format suitable for LLM input.

The final categories included were linguistic (n = 44), extralinguistic (n = 30), paralinguistic (n = 22), context (n = 20), and conversational (n = 8). Further details are provided in Table 1.

The original administered battery comprised 1240 items (124 × 10). A total of 70 responses were unreadable or missing; these were marked as NA and excluded, leaving 1170 items. Because several items are scored on two to three dimensions (among Expressed Content, Intended Meaning/Purpose, Violation), we expanded items to item × dimension units, yielding 2054 records. For analysis, we further excluded units with missing or unreadable human scores in specific dimensions (n = 29), yielding a final dataset of 2025 cases. The cases were drawn from the responses given to the protocol by 10 healthy older adults (mean age = 69.8 years, SD = 4.6; 5 women).

2.3. LLM Model and Prompt Engineering

2.3.1. Rationale for Using GPT-4o

We used GPT-4o to maximize external validity, as it is currently the most widely used consumer LLM via ChatGPT, with hundreds of millions of weekly active users reported in 2025 [49]. Based upon these figures, it was assumed that testing pragmatic coding on this model would best reflect real-world exposure for clinicians and lay users who already interact with it in daily practice. Second, GPT-4o has public, up-to-date system documentation (system card) describing capabilities and limitations across modalities and known safety issues. Building on a model with transparent, versioned documentation strengthens replicability [50]. Finally, GPT-4o offers broad accessibility (free/low-cost tiers and widely available API/platform), which lowers barriers to adoption for research groups and clinical teams and improves cross-study comparability: using a common, well-documented baseline facilitates synthesis across studies.

2.3.2. Prompt Development

The chosen LLM was employed as a coder through a rigorous and iteratively defined prompt-engineering approach: we ran a small feasibility pretest on a subset of items to ensure the model returned the required output schema without errors. Wording adjustments were made only to remove formatting failures. After this check, the final prompt was locked and used invariantly for the full study to eliminate experimenter degrees of freedom and ensure reproducibility. The final prompt codifies the scoring rules, admissible outputs (binary per dimension), and the exact tab-separated reporting format. The verbatim text is provided in Supplementary File S1.

2.3.3. Using the Prompt

Interactions were conducted via the ChatGPT web interface (OpenAI) using the model shown in the UI at the time, i.e., GPT-4o, with default system parameters. Human scores were not exposed to the model, and a human-model comparison was performed post hoc outside the model (see Section 2.4). Moreover, to avoid potential context carryover across items, which could happen when multiple items are processed within one session, conversations were refreshed and restarted between batches (every 10 inputs).

2.4. Statistical Analysis

All analyses were performed using SPSS Statistics (version 29.0.1.0 (171), IBM Corp. (New York, NY, USA)).

2.4.1. Inter-Rater Reliability Between Human Coder and GPT-4o

The analysis was computed at the response level (item × dimension units). First, human coding was performed by clinical researchers extensively trained in ABaCo administration and scoring. Inter-rater reliability between GPT-4o and human coders was evaluated using multiple metrics. First, Direct Agreement was calculated as the absolute number and percentage of cases in which both raters assigned the identical categorical score (0 or 1) for each item, excluding all cases with missing or indeterminate codes. Second, a full contingency table (confusion matrix) was constructed to examine the joint and marginal distributions of scores, providing insight into rating balance and error structure.

The primary chance-corrected metric was Cohen’s kappa (κ), computed on the complete set of paired ratings using standard formulas. Kappa values were interpreted according to [51]: κ < 0 (poor), 0–0.20 (slight), 0.21–0.40 (fair), 0.41–0.60 (moderate), 0.61–0.80 (substantial), 0.81–1.00 (almost perfect). To provide statistical inference, 95% confidence intervals for κ were estimated using nonparametric bootstrapping (5000 resamples). Moreover, marginal distributions and a confusion matrix (contingency table) were examined to assess rating balance and error structure.

2.4.2. Distribution of Discrepancies by Pragmatic Act

In addition to agreement statistics, a categorical frequency analysis was conducted to examine the distribution of discrepancies by pragmatic act. All discrepant cases (n = 220) were labeled as either False Negatives (FNs) or False Positives (FPs). A 2 × 13 contingency table was constructed, cross-tabulating error type (FNs vs. FPs) by pragmatic act category (13 levels). A Pearson χ² test of independence was then performed to assess whether error distributions differed significantly across act types [52]. This analysis tested the hypothesis that classification errors are non-randomly associated with specific communicative acts, potentially indicating model-specific interpretive biases. Assumptions for the chi-square test were evaluated, including the requirement that all expected frequencies be ≥1 and that fewer than 20% of cells have expected counts < 5 [53].

To identify the contribution of individual cells to the overall χ² statistic, Pearson standardized residuals (z) were computed for each cell in the contingency table. Two-tailed p-values were calculated for each residual to assess cellwise significance, using the normal distribution. Cells with |z| ≥ 1.96 (p < 0.05) were considered significantly deviant from the expected value under the null hypothesis. Because 26 cellwise tests were performed (13 acts × 2 error types), with sparsity risk, we controlled multiplicity using the Benjamini–Hochberg false discovery rate (BH–FDR) with α = 0.05. Implementation followed the standard step-up procedure, and cellwise significance was defined as q ≤ 0.05. We chose BH–FDR because it controls the expected proportion of false discoveries while offering greater power than familywise-error procedures (e.g., Bonferroni) in exploratory, post hoc localization of effects [54].

Beyond hypothesis testing, which localized cellwise discrepancies, we computed Cramér’s V as an effect-size index to quantify the magnitude of the act × error (i.e., how strongly error type varies by act on a 0–1 scale). Finally, we calculated per-act disagreement rates with 95% Wilson intervals to quantify the prevalence of GPT-4o’s failures by pragmatic act.

All computations were performed in SPSS v29 as follows: (i) compute two-sided p from z via the normal CDF; (ii) apply BH–FDR across the 26 p-values with step-up monotonicity; (iii) retain cells with q ≤ 0.05.

3. Results

A total of 2025 cases were analyzed after excluding responses with missing or indeterminate values (“nd”, “ND”, or missing). Agreement metrics were computed to evaluate the reliability and limitations of GPT-4o’s pragmatic coding. The analysis included computation of raw agreement, Cohen’s kappa statistic (with 95% bootstrap confidence interval), full contingency table, marginal distributions, and an evaluation of the reliability coefficient according to established interpretive guidelines.

3.1. Inter-Rater Agreement

Exact agreement between the human rater and GPT-4o was observed in 1805 out of 2025 cases, corresponding to an exact agreement proportion of 0.89 (89.1%). While this level of agreement appears high, raw agreement is sensitive to prevalence and does not adjust for agreement occurring by chance. Therefore, we calculated a full confusion matrix (see Table 2), which displays both concordant and discordant classification frequencies for each rater: the human rater assigned a rating of 0 in 245 cases (12.1%) and a rating of 1 in 1780 cases (87.9%), while GPT-4o assigned a rating of 0 in 247 cases (12.2%) and a rating of 1 in 1778 cases (87.8%).

These results indicate a substantial imbalance in the distribution of ratings, with the vast majority of items classified in category 1 by both raters. This means that most items were judged as “1” (which represents “correct”), while only a small proportion were classified as “0”. Although the direct agreement is high, the pronounced imbalance in category frequencies necessitates the use of chance-corrected statistics. Therefore, Cohen’s kappa was calculated to assess the degree of agreement beyond chance (see Table 3).

The observed kappa coefficient was 0.491 (95% bootstrap confidence interval: CI: 0.437–0.553; B = 5000, N = 2025), indicating a moderate level of inter-rater reliability according to the scale of [51], where values between 0.41 and 0.60 are interpreted as “moderate agreement”. The confidence interval excludes zero, demonstrating that the observed agreement is significantly greater than would be expected by chance. These results address RQ1, showing that at the response level, GPT-4o’s codes align with expert judgments to the extent reflected by the agreement estimates reported above.

3.2. Analysis of Discrepancies

The comparative analysis between the human rater and the GPT-4o automated rater for pragmatic abilities assessed via the ABaCo revealed a total of 220 discrepancies, which were further classified as false negatives (FN; missed by GPT-4o) or false positives (FP; over-attributed by GPT-4o), as shown in Table 4.

Across the 220 disagreements, 111 (50.4%) were false negatives and 109 (49.5%) were false positives, indicating a relatively balanced distribution of errors. The chi-square test indicated a non-random association between error type and pragmatic act, χ²(12) = 69.431, p < 0.001 (Table 5).

However, since the minimum expected count was ≥1, but >20% of cells had expected counts < 5 (61.5%), χ² was interpreted cautiously. To further examine the distributional structure, Pearson residuals were computed for each cell in the 2 × 13 contingency table crossing error type (FN, FP) with pragmatic act. Significant residuals (|z| ≥ 1.96, p < 0.05) indicate cells with observed frequencies deviating significantly from expected. Significant over-representation of FP was observed for the pragmatic acts “Command” (z = 3.0, p < 0.01) and “Social Norm” (z = 2.5, p = 0.01), while FN were significantly over-represented for “Deceit” (z = 3.3, p < 0.001). Other pragmatic acts showed no significant cellwise deviations (all p ≥ 0.05). The overall magnitude of the act × error association was non-trivial (Cramér’s V = 0.562). To complement structural magnitude with prevalence, we also calculated per-act disagreement rates (with 95% Wilson CIs), which show that disagreement rates concentrated in Deceit and Irony (for details, see Table S2 in the Supplementary Materials).

Because 26 cellwise tests were examined (13 acts × 2 error types), we controlled multiplicity with the Benjamini–Hochberg false discovery rate (BH–FDR, α = 0.05), reporting adjusted q-values and defining cellwise significance as q ≤ 0.05. Only two categories of pragmatic acts survived multiplicity correction, confirming an over-representation of FP in “Command” (z = 3.0, p < 0.01, q < 0.05) and an over-representation of FN in “Deceit” (z = 3.3, p < 0.001, q < 0.05). A complete residual table with Observed, Expected, z, two-sided p, BH-adjusted q, and a binary significance flag (q ≤ 0.05) is provided in Appendix A (Table A1).

In conclusion, for RQ2, the evidence indicates that discrepancies follow act-specific patterns, with over-attribution of Command and under-attribution of Deceit after correction.

4. Discussion

This exploratory study asked whether an instruction-tuned LLM can code pragmatic performance under a validated clinical rubric (ABaCo) and where, precisely, it succeeds or fails.

4.1. General Discussion

In response to RQ1, we observed high raw exact agreement (89.1%) with moderate chance-corrected reliability (κ = 0.491) between GPT-4o and the human rater across 2025 items. In line with the class imbalance observed in our contingency table (Table 2, Results), kappa (Table 3, Results) reflects a conservative, chance-corrected view of reliability rather than simple concordance. Practically, these figures indicate that a carefully prompted LLM can produce structured, human-comparable codes.

In response to RQ2, we observed that performance varied significantly across the pragmatic phenomena investigated, but such deviations were localized (Table 4, Results). While effective in coding direct communicative acts (assertions, questions, and requests), where literal meaning corresponds to the intended one, the model exhibited systematic deficits. The significant χ² (Table 5, Results) and the corresponding effect size (Cramér’s V) both indicate that these error patterns are non-random and display a non-trivial, act-specific structure. These results align partially with recent prior works [38,41], which reported similar strengths in literal understanding but noted degradation in GPT-4o’s interpretive performance for non-literal utterances based on the Gricean maxims (e.g., quantity, text-based inference, physical metaphors, humor, and irony).

However, after controlling for multiplicity with Benjamini–Hochberg FDR across 26 cellwise tests, only two pragmatic act categories showed robust deviations from the human benchmark: Command and Deceit. All other acts, including Social Norm, did not survive FDR correction (q > 0.05); therefore, these act-level discrepancies are not treated as practically important (in line with [55]. In sum, the model was effective on direct/literal acts (e.g., assertions, questions, requests) but exhibited systematic bias in acts requiring either unmarked directive force (Command) or non-literal, intention-ascribing inference (Deceit). To synthesize the discrepancy patterns observed, Table A2 (in Appendix A) summarizes the main failure scenarios identified across pragmatic acts. These categories formalize the narrative description of errors provided above and can inform follow-up studies on model robustness and prompt design.

Compared to prior works, our findings align with evidence that LLMs lack sensitivity to context-dependent pragmatic cues [40], manner implicatures, which refer to pragmatic inference triggered by a violation of the Grice maxim [56] and that they fabricate attributions, especially when the prompt is ambiguous or open-ended [57]. The elevated false-negative rate for Deceit suggests difficulties for the AI model in capturing subtle non-literal and pragmatic cues, while the over-representation of false positives for Command may indicate a tendency toward over-attribution of pragmatic intents.

However, given our mixed results (i.e., good performance in the majority of areas and weaker in others), alternative explanations merit consideration.

First, disagreements on Command are noteworthy, considering that in classical speech-act taxonomies, directives (which include commands) are a basic illocutionary class and are canonically realized by the imperative. Crucially, their perceived force is often signaled or disambiguated by prosody and other paralinguistic cues [58,59]: because our inputs were plain text with no intonation, emphasis, or timing, part of the error pattern may, instead, reflect a cue-loss effect rather than a principled limitation of the model. This hypothesis is strengthened by both prior research in human–robot collaboration, showing that real-time fusion of speech and gestures can improve command interpretation [60], and research in automated language understanding (e.g., [61]), emphasizing the need to anchor linguistic meaning in extralinguistic events and context to understand language effectively.

Second, prior works showed that GPT-4o performed at or above human levels on several Theory of Mind tasks [62]—a cognitive domain only partially overlapping with pragmatics [63,64]—including indirect requests, false beliefs and misdirection. However, the model struggled with faux pas, which are communicative situations where a speaker says something they should not have said, not knowing or realizing that they should not have said it. To recognize a faux pas, one must represent two mental states: (1) that the speaker does not know or does not remember the relevant information, and (2) that the person hearing it would feel hurt or insulted [62]. In this latter case, the model is subject to failures attributable to hyperconservatism rather than inference failure. Therefore, a further alternative reading is that our mixed pattern reflects design-imposed caution rather than a domain-general pragmatic deficit. The OpenAI Model Spec encodes a deontic hierarchy (“Instructions with higher authority override those with lower authority”) and explicitly instructs systems to “ask clarifying questions as appropriate” [50]: these defaults could plausibly encourage a risk-averse, question-first response style that can suppress more straightforward pragmatic inferences. Separately, preference-based post-training (RLHF) is a standard approach to align assistants to human judgments [65]. Recent evidence indicates that assistants fine-tuned with human feedback can exhibit sycophancy, i.e., alignment with user views at the expense of correctness, at least in some settings [66]. Applied to our study, these mechanisms plausibly account for the error structure: hyperconservatism with the preponderance of false negatives in acts requiring negative or mind-reading attributions (e.g., Deceit), consistent with a question-first reluctance to endorse strong pragmatic interpretations that would in effect “accuse” the speaker.

4.2. Suggestions for Design

By demonstrating both the potential and the limits of GPT-4o for pragmatic evaluation on ABaCo, we provide a twofold contribution to the current literature: (i) a human-anchored reliability estimate for an LLM evaluated with a clinically validated pragmatic battery (ABaCo); and (ii) an act-level discrepancy map that localizes where disagreements concentrate, thereby revealing interpretive blind spots.

The practical implications for human-centered AI are clear. LLMs can support assessments, yielding human-comparable results when prompts are well specified, but researcher oversight remains essential, especially in high-stakes or context-sensitive applications. Based on our data, we propose that GPT-4o can operate as a reflective second coder, especially in resource-constrained settings (for instance, when only one human coder can be recruited in the research team). In this role, the model may serve as a quality-assurance AI companion, flagging any detected inconsistencies, typos, or highlighting borderline cases for human re-review, which help researchers and clinicians revisit uncertain decisions, while the final adjudication remains theirs. This “second-reader” pattern is well-established in current human-in-the-loop workflows: recent studies document that double-reading practices can improve error detection and confidence in clinical domains: drawing from the medical field, [67] introduced the concept of diagnostic complementarity, meaning that human and AI readers exhibit different strengths and limitations, and their combined performance can exceed either working alone. Similarly, Harada et al. (2024) [68] demonstrate that LLM-assisted review can surface diagnostic or coding inconsistencies for targeted reassessment, while [69] show that human–AI complementarity can enhance decision-making and support safer, more consistent clinical workflows.

Looking ahead, use of domain-specific data is expected to improve qualitative coding performance [37], especially in digital health. In the clinical and research-based psychological domain, AI tools should be calibrated to population-specific communication profiles; concretely, prompts and policies should (i) abstain when uncertainty is high or key cues are missing; (ii) lower the decision threshold for Deceit (FN-prone); and (iii) require explicit textual markers (e.g., deontic verbs, imperative form) for Command (FP-prone). Moreover, multimodal integration, capable of capturing prosody, timing, gesture, and other paralinguistic cues, is needed to approach human-level judgment.

5. Limitations

Despite its merits, our study also has several limitations. First, despite participants’ responses being the unit of analysis, the cases were drawn from 10 participants only: this may have introduced clustering and sampling bias and limited generalizability. Second, we provided text-only inputs: items were presented to the model without prosodic/paralinguistic cues (tone, emphasis, timing, gesture). While necessary to maintain textual consistency and avoid subjective annotation, this choice restricts ecological validity, since pragmatic comprehension in real interaction depends on multimodal integration. This is even more true for phenomena such as Irony and some Command/indirect requests, gold labels in ABaCo, often rely on these cues. Under text-only conditions, the task is underspecified; some apparent model “errors” may thus reflect missing signals rather than a lack of competence. Third, coding was single-shot: we accepted the model’s first output without permitting clarifying questions or self-correction, a choice that reduces ecological validity when evaluating pragmatic interactions; future protocols should allow short, pre-specified clarification turns and a constrained self-check before final coding. Moreover, prompt-robustness was not assessed: reliability should be estimated over multiple runs (different random seeds), prompt and stimulus paraphrases, and small prompt perturbations, reporting stability intervals and disagreement rates. Finally, because ChatGPT UI models are periodically updated, all results are contingent on the model snapshot used. Future studies may benchmark our approach against other LLM-based chatbots (e.g., Gemini, Microsoft Copilot), newer model versions (e.g., GPT-5), and non-LLM baselines (e.g., supervised classifiers trained on ABaCo-labeled data).

6. Conclusions

As LLMs become increasingly embedded in human-facing systems, their pragmatic limitations must be explicitly acknowledged and systematically addressed to ensure ethical and effective use. This exploratory study provides a systematic evaluation of GPT-4o’s ability to assess pragmatic competence in older adults using the clinically validated ABaCo framework. The model achieved moderate overall agreement with human raters, and most pragmatic acts showed no significant misclassification bias. Statistically reliable deviations were limited to two act types: false positives for Command, and false negatives for Deceit. These inferential failures suggest structured limitations in the model’s pragmatic reasoning. However, such localized effects may also reflect missing paralinguistic/extralinguistic information in the text-only setup, in addition to model-level constraints. In conclusion, our findings do support the use of LLMs, such as GPT-4o, as second coders within human-in-the-loop workflows, provided human oversight. For practical application of LLMs in clinical coding, future work should explore model adaptation through targeted training on pragmatic corpora and the integration of multimodal cues (e.g., paralinguistic/extralinguistic information in the text-only setup) to overcome blind spots. Moreover, in the context of pragmatics, contextual calibration is critical to ensuring trustworthy deployment. Finally, while here GPT-4o was tested on a healthy-aging baseline, future works could address clinical evaluations using single-case or small case-series designs, minimizing heterogeneity and reducing the risk of conflating disorder-specific communicative profiles with model limitations.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/electronics14224411/s1. Supplementary information includes File S1—Full LLM prompt (TXT, verbatim); Table S1—Example ABaCo items with single, double, and triple dimension scoring criteria; Table S2—per act disagreement rates.

Author Contributions

Conceptualization: A.B., I.G. and F.M.B.; Methodology: A.B.; Formal analysis: A.B.; Investigation: A.B.; Data curation: A.B.; Writing—original draft: A.B.; Writing—review and editing: A.B., I.G. and F.M.B.; Supervision: F.M.B.; Funding acquisition: F.M.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by PRIN 2022, Prot. n. 2022CZF8KA, project title: “ACTIVe communication: Assessment and enhancement of pragmatic and narrative skills in hEaLthY aging (ACTIVELY)” Avviso pubblico n. 104 del 02/02/2022—PRIN 2022 PNRR M4C2 Inv. 1.1. Ministero dell’Università e della Ricerca (Financed by EU, NextGenerationEU)—CUP G53D23003110006.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of the University of Turin (Protocol no. 202174, approved on 20 February 2024).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The anonymized dataset is openly available on OSF at: https://osf.io/3rjux/?view_only=aee82aa022b944aaa2bc45b34744cfa6 (accessed on 30 September 2025).

Conflicts of Interest

The authors declare no conflicts of interest.

Appendix A

Table A1. Cellwise residual analysis of error type by pragmatic act with Benjamini–Hochberg FDR–adjusted q-values (m = 26).

Pragmatic Act	Error	Obs.	Exp.	z	\|z\|	p (Two-Sided)	q_BH	sig_BH
Assertion	FN	2.00	1.00	1	1	p > 0.05	0.57	q > 0.05
Assertion	FP	0.00	1.00	−1	1	p > 0.05	0.57	q > 0.05
Command	FN	0.00	9.10	−3	3	p < 0.01 *	0.02	q < 0.05 *
Command	FP	18.00	8.90	3	3	p < 0.01 *	0.02	q < 0.05 *
Conv.—Topic	FN	0.00	1.00	−1	1	p > 0.05	0.57	q > 0.05
Conv.—Topic	FP	2.00	1.00	1	1	p > 0.05	0.57	q > 0.05
Conv.—Turn-taking	FN	0.00	1.00	−1	1	p > 0.05	0.57	q > 0.05
Conv.—Turn-taking	FP	2.00	1.00	1	1	p > 0.05	0.57	q > 0.05
Deceit	FN	58.00	37.80	3.3	3.3	p < 0.001 *	0.02	q < 0.05 *
Deceit	FP	17.00	37.20	−3.3	3.3	p < 0.001 *	0.02	q < 0.05 *
Emotion	FN	1.00	0.50	0.7	0.7	p > 0.05	0.63	q > 0.05
Emotion	FP	0.00	0.50	−0.7	0.7	p > 0.05	0.65	q > 0.05
Incongruity	FN	4.00	3.00	0.6	0.6	p > 0.05	0.63	q > 0.05
Incongruity	FP	2.00	3.00	−0.6	0.6	p > 0.05	0.63	q > 0.05
Irony	FN	32.00	30.30	0.3	0.3	p > 0.05	0.78	q > 0.05
Irony	FP	28.00	29.70	−0.3	0.3	p > 0.05	0.78	q > 0.05
Norm	FN	8.00	11.60	−1.1	1.1	p > 0.05	0.57	q > 0.05
Norm	FP	15.00	11.40	1.1	1.1	p > 0.05	0.74	q > 0.05
Question	FN	1.00	2.50	−1	1	p > 0.05	0.57	q > 0.05
Question	FP	4.00	2.50	1	1	p > 0.05	0.57	q > 0.05
Request	FN	1.00	4.00	−1.5	1.5	p > 0.05	0.46	q > 0.05
Request	FP	7.00	4.00	1.5	1.5	p > 0.05	0.46	q > 0.05
Social Norm	FN	0.00	6.10	−2.5	2.5	p = 0.01 *	0.06	q > 0.05
Social Norm	FP	12.00	5.90	2.5	2.5	p = 0.01 *	0.06	q > 0.05
Standard Act	FN	4.00	3.00	0.6	0.6	p > 0.05	0.63	q > 0.05
Standard Act	FP	2.00	3.00	−0.6	0.6	p > 0.05	0.63	q > 0.05

Legend: Error: FN = False Negative; FP = False Positive; Obs. = Observed frequency; Exp. = Expected frequency; z = Standardized residual; p-value = Two-tailed p based on z; q_BH = p-values adjusted for multiple testing using the Benjamini–Hochberg false discovery rate (FDR) with m = 26 tests; sig_BH = significance of q_BH values (<0.05). * indicates significant values.

Table A2. Significant failure scenarios observed in GPT-4o’s pragmatic coding.

Failure Scenario	Act	Error Type	ABaCo Stimulus	Response	Human Score	Model Score	Model Behavior
Over-generalization of directive force	Command	False Positive (FP)	“Please issue some commands. Order me to stop asking you questions.”	“Stop asking questions, please.”	0	1	The model over-ascribes command force to a polite request/mitigated directive, incorrectly crediting a correct Command when the human coder judges it inadequate for the target act.
Literal–intent mismatch	Deceit	False Negative (FN)	“We’ll watch some clips. Pay attention, I’ll ask questions later. Luca hides his skateboard. His mother asks, ‘What was that noise?’ The boy does not want to be found out. What could he say? (Borderline cases: what would that mean?)”	“The best is to say nothing. Or, ‘I heard it too; I don’t know.’”	1	0	The model misses deceit: silence/evasive reply constitutes deceptive intent (successful target act), but the model labels it as inadequate.

Legend: Score: 1 = appropriate/successful for the target act; 0 = inappropriate/unsuccessful.

References

Bender, E.M.; Koller, A. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 6–8 July 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 5185–5198. [Google Scholar] [CrossRef]
Levinson, S.C. Pragmatics; Cambridge University Press: Cambridge, UK, 1983. [Google Scholar] [CrossRef]
Holler, J.; Levinson, S.C. Multimodal language processing in human communication. Trends Cogn. Sci. 2019, 23, 639–652. [Google Scholar] [CrossRef]
Bara, B.G. Cognitive Pragmatics: The Mental Processes of Communication; MIT Press: Cambridge, MA, USA, 2010. [Google Scholar] [CrossRef]
Bara, B.G. Cognitive pragmatics: The mental processes of communication. Intercult. Pragmat. 2011, 8, 443–485. [Google Scholar] [CrossRef]
Bosco, F.M.; Bucciarelli, M.; Bara, B.G. The fundamental context categories in understanding communicative intention. J. Pragmat. 2004, 36, 467–488. [Google Scholar] [CrossRef]
Adams, C. Practitioner review: The assessment of language pragmatics. J. Child Psychol. Psychiatry Allied Discip. 2002, 43, 973–987. [Google Scholar] [CrossRef] [PubMed]
Angeleri, R.; Bosco, F.M.; Gabbatore, I.; Bara, B.G.; Sacco, K. Assessment battery for communication (ABaCo): Normative data. Behav. Res. Methods 2012, 44, 845–861. [Google Scholar] [CrossRef]
Arcara, G.; Bambini, V. A Test for the Assessment of Pragmatic Abilities and Cognitive Substrates (APACS): Normative data and psychometric properties. Front. Psychol. 2016, 7, 70. [Google Scholar] [CrossRef]
Bosco, F.M.; Angeleri, R.; Zuffranieri, M.; Bara, B.G.; Sacco, K. Assessment Battery for Communication: Development of two equivalent forms. J. Commun. Disord. 2012, 45, 290–303. [Google Scholar] [CrossRef]
Bishop, D.V.M. The Children’s Communication Checklist, Second Edition—CCC-2 Manual; Pearson: London, UK, 2003. [Google Scholar]
Parola, A.; Salvini, R.; Gabbatore, I.; Colle, L.; Berardinelli, L.; Bosco, F.M. Pragmatics, Theory of Mind and executive functions in schizophrenia: Disentangling the puzzle using machine learning. PLoS ONE 2020, 15, e0229603. [Google Scholar] [CrossRef]
Gabbatore, I.; Bosco, F.M.; Mäkinen, L.; Ebeling, H.; Hurtig, T.; Loukusa, S. Investigating pragmatic abilities in young Finnish adults using the Assessment Battery for Communication. Intercult. Pragmat. 2019, 16, 27–56. [Google Scholar] [CrossRef]
Bosco, F.M.; Berardinelli, L.; Parola, A. The ability of patients with schizophrenia to comprehend and produce sincere, deceitful, and ironic communicative intentions: The role of theory of mind and executive functions. Front. Psychol. 2019, 10, 827. [Google Scholar] [CrossRef]
Bambini, V.; Arcara, G.; Bechi, M.; Buonocore, M.; Cavallaro, R.; Bosia, M. The communicative impairment as a core feature of schizophrenia: Frequency of pragmatic deficit, cognitive substrates, and relation with quality of life. Compr. Psychiatry 2016, 71, 106–120. [Google Scholar] [CrossRef] [PubMed]
Colle, L.; Angeleri, R.; Vallana, M.; Sacco, K.; Bara, B.G.; Bosco, F.M. Understanding the communicative impairments in schizophrenia: A preliminary study. J. Commun. Disord. 2013, 46, 294–308. [Google Scholar] [CrossRef] [PubMed]
Angeleri, R.; Gabbatore, I.; Bosco, F.M.; Sacco, K.; Colle, L. Pragmatic abilities in children and adolescents with autism spectrum disorder: A study with the ABaCo battery. Minerva Psichiatr. 2016, 57, 93–103. [Google Scholar]
Gabbatore, I.; Longobardi, C.; Bosco, F.M. Improvement of communicative-pragmatic ability in adolescents with Autism Spectrum Disorder: The adapted version of the Cognitive Pragmatic Treatment. Lang. Learn. Dev. 2022, 18, 62–80. [Google Scholar] [CrossRef]
Loukusa, S.; Moilanen, I.K. Pragmatic inference abilities in individuals with Asperger syndrome or high-functioning autism. A review. Res. Autism Spectr. Disord. 2009, 3, 890–904. [Google Scholar] [CrossRef]
Gabbatore, I.; Marchetti Guerrini, A.; Bosco, F.M. Looking for social pragmatic communication disorder in the complex world of Italian special needs: An exploratory study. Sci. Rep. 2025, 15, 348. [Google Scholar] [CrossRef]
Angeleri, R.; Bosco, F.M.; Zettin, M.; Sacco, K.; Colle, L.; Bara, B.G. Communicative impairment in traumatic brain injury: A complete pragmatic assessment. Brain Lang. 2008, 107, 229–245. [Google Scholar] [CrossRef]
Bosco, F.M.; Angeleri, R.; Sacco, K.; Bara, B.G. Explaining pragmatic performance in traumatic brain injury: A process perspective on communicative errors. Int. J. Lang. Commun. Disord. 2015, 50, 63–83. [Google Scholar] [CrossRef]
Bosco, F.M.; Parola, A.; Sacco, K.; Zettin, M.; Angeleri, R. Communicative-pragmatic disorders in traumatic brain injury: The role of theory of mind and executive functions. Brain Lang. 2017, 168, 73–83. [Google Scholar] [CrossRef]
Joanette, Y.; Ska, B.; Côté, H. Protocole Montréal d’Évaluation de la Communication (MEC); Ortho Édition: Isbergues, France, 2004. [Google Scholar]
Bryan, K. The Right Hemisphere Language Battery, 2nd ed.; Whurr Publishers: London, UK, 1995. [Google Scholar]
Holland, A.L.; Frattali, C.; Fromm, D. Communication Activities of Daily Living, 2nd ed.; CADL-2; PRO-ED: Austin, TX, USA, 1999. [Google Scholar]
McDonald, S.; Flanagan, S.; Rollins, J.; Kinch, J. TASIT: A new clinical tool for assessing social perception after traumatic brain injury. J. Head Trauma Rehabil. 2003, 18, 219–238. [Google Scholar] [CrossRef]
Angeleri, R.; Bara, B.G.; Bosco, F.M.; Colle, L.; Sacco, K. ABaCo—Assessment Battery for Communication, 2nd ed.; Giunti OS: Florence, Italy, 2015. [Google Scholar]
Grice, H.P. Logic and conversation. In Syntax and Semantics, Volume 3: Speech Acts; Cole, P., Morgan, J.L., Eds.; Academic Press: New York, NY, USA, 1975; pp. 41–58. [Google Scholar]
Yan, C.; Fu, X.; Liu, X.; Zhang, Y.; Gao, Y.; Wu, J.; Li, Q. A survey of automated International Classification of Diseases coding: Development, challenges, and applications. Intell. Med. 2022, 2, 161–173. [Google Scholar] [CrossRef]
Park, Y.J.; Pillai, A.; Deng, J.; Zhou, L.; Zhang, Z.; Yu, K.-H.; Wang, Y.; Wang, L.; Luo, Y. Assessing the research landscape and clinical utility of large language models: A scoping review. BMC Med. Inform. Decis. Mak. 2024, 24, 72. [Google Scholar] [CrossRef]
Tian, S.; Jin, Q.; Yeganova, L.; Lai, P.-T.; Zhu, Q.; Chen, X.; Yang, Y.; Chen, Q.; Kim, W.; Comeau, D.C.; et al. Opportunities and challenges for ChatGPT and large language models in biomedicine and health. Brief. Bioinform. 2024, 25, bbad493. [Google Scholar] [CrossRef] [PubMed]
Vrdoljak, J.; Boban, Z.; Vilović, M.; Kumrić, M.; Božić, J. A review of large language models in medical education, clinical decision support, and healthcare administration. Healthcare 2025, 13, 603. [Google Scholar] [CrossRef] [PubMed]
Tam, T.Y.C.; Sivarajkumar, S.; Kapoor, S.; Stolyar, A.V.; Polanska, K.; McCarthy, K.R.; Osterhoudt, H.; Wu, X.; Visweswaran, S.; Fu, S.; et al. A framework for human evaluation of large language models in healthcare derived from literature review. NPJ Digit. Med. 2024, 7, 258. [Google Scholar] [CrossRef] [PubMed]
Ho, C.N.; Tian, T.; Ayers, A.T.; Aaron, R.E.; Phillips, V.; Wolf, R.M.; Mathioudakis, N.; Dai, T.; Klonoff, D.C. Qualitative metrics from the biomedical literature for evaluating large language models in clinical decision-making: A narrative review. BMC Med. Inform. Decis. Mak. 2024, 24, 357. [Google Scholar] [CrossRef]
Chew, R.; Bollenbacher, J.; Wenger, M.; Speer, J.; Kim, A. LLM-assisted content analysis: Using large language models to support deductive coding. arXiv 2023. [Google Scholar] [CrossRef]
Tai, R.H.; Bentley, L.R.; Xia, X.; Sitt, J.M.; Fankhauser, S.C.; Chicas-Mosier, A.M.; Monteith, B.G. An examination of the use of large language models to aid analysis of textual data. Int. J. Qual. Methods 2024, 23, 16094069241231168. [Google Scholar] [CrossRef]
Hu, J.; Floyd, S.; Jouravlev, O.; Fedorenko, E.; Gibson, E. A fine-grained comparison of pragmatic language understanding in humans and language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Long Papers), Toronto, ON, Canada, 9–14 July 2023; Association for Computational Linguistics: Stroudsburg, PA, USA, 2023; pp. 4194–4213. [Google Scholar] [CrossRef]
Yerukola, A.; Vaduguru, S.; Fried, D.; Sap, M. Is the pope Catholic? Yes, the pope is Catholic. Generative evaluation of non-literal intent resolution in LLMs. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Short Papers), Bangkok, Thailand, 11–16 August 2024; Association for Computational Linguistics: Stroudsburg, PA, USA, 2024; pp. 265–275. [Google Scholar] [CrossRef]
Cong, Y. Pre-trained language models’ interpretation of evaluativity implicature: Evidence from gradable adjectives usage in context. In Proceedings of the Second Workshop on Understanding Implicit and Underspecified Language, Seattle, WA, USA, 15 July 2022; Association for Computational Linguistics: Stroudsburg, PA, USA, 2022; pp. 1–7. Available online: https://aclanthology.org/2022.unimplicit-1.1/ (accessed on 14 July 2025).
Barattieri di San Pietro, C.; Frau, F.; Mangiaterra, V.; Bambini, V. The pragmatic profile of ChatGPT: Assessing the communicative skills of a conversational agent. Sist. Intell. 2023, 35, 379–400. [Google Scholar] [CrossRef]
Ma, B.; Li, Y.; Zhou, W.; Gong, Z.; Liu, Y.J.; Jasinskaja, K.; Friedrich, A.; Hirschberg, J.; Kreuter, F.; Plank, B. Pragmatics in the era of large language models: A survey on datasets, evaluation, opportunities and challenges. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Long Papers), Vienna, Austria, 27 July–1 August 2025; Association for Computational Linguistics: Stroudsburg, PA, USA, 2025; pp. 8679–8696. [Google Scholar] [CrossRef]
Hilviu, D.; Parola, A.; Bosco, F.M.; Marini, A.; Gabbatore, I. Grandpa, tell me a story! Narrative ability in healthy aging and its relationship with cognitive functions and Theory of Mind. Lang. Cogn. Neurosci. 2025, 40, 103–121. [Google Scholar] [CrossRef]
Hilviu, D.; Gabbatore, I.; Parola, A.; Bosco, F.M. A cross-sectional study to assess pragmatic strengths and weaknesses in healthy ageing. BMC Geriatr. 2022, 22, 699. [Google Scholar] [CrossRef] [PubMed]
Marini, A.; Petriglia, F.; D’Ortenzio, S.; Bosco, F.M.; Gasparotto, G. Unveiling the dynamics of discourse production in healthy aging and its connection to cognitive skills. Discourse Process. 2025, 62, 479–501. [Google Scholar] [CrossRef]
Gabbatore, I.; Conterio, R.; Vegna, G.; Bosco, F.M. Longitudinal assessment of pragmatic and cognitive decay in healthy aging, and interplay with subjective cognitive decline and cognitive reserve. Sci. Rep. 2025, 15, 30835. [Google Scholar] [CrossRef] [PubMed]
Searle, J.R. Indirect speech acts. In Syntax and Semantics (Volume 3): Speech Acts; Cole, P., Morgan, J.L., Eds.; Academic Press: New York, NY, USA, 1975; pp. 59–82. [Google Scholar]
Kasher, A. Modular speech act theory: Programme and results. In Foundations of Speech Act Theory: Philosophical and Linguistic Perspectives; Tsohatzidis, S.L., Ed.; Routledge: Oxfordshire, UK, 1994; pp. 312–322. [Google Scholar] [CrossRef]
Reuters. OpenAI’s Weekly Active Users Surpass 400 Million. 2025. Available online: https://www.reuters.com/technology/artificial-intelligence/openais-weekly-active-users-surpass-400-million-2025-02-20/?utm_source=chatgpt.com (accessed on 4 April 2025).
OpenAI. Model Spec (Version 2025-04-11). Available online: https://model-spec.openai.com/2025-04-11.html (accessed on 11 April 2025).
Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef]
Agresti, A. An Introduction to Categorical Data Analysis, 2nd ed.; Wiley: Hoboken, NJ, USA, 2007. [Google Scholar] [CrossRef]
Cochran, W.G. Some methods for strengthening the common χ² tests. Biometrics 1954, 10, 417–451. [Google Scholar] [CrossRef]
Benjamini, Y.; Hochberg, Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 1995, 57, 289–300. [Google Scholar] [CrossRef]
Wasserstein, R.L.; Lazar, N.A. The ASA statement on p-values: Context, process, and purpose. Am. Stat. 2016, 70, 129–133. [Google Scholar] [CrossRef]
Cong, Y. Manner implicatures in large language models. Sci. Rep. 2024, 14, 29113. [Google Scholar] [CrossRef]
Zuccon, G.; Koopman, B.; Shaik, R. ChatGPT hallucinates when attributing answers. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’23), Beijing, China, 26–28 November 2023; Association for Computing Machinery: New York, NY, USA, 2023; pp. 46–51. [Google Scholar] [CrossRef]
Searle, J.R. A classification of illocutionary acts. Lang. Soc. 1976, 5, 1–23. [Google Scholar] [CrossRef]
Searle, J.R.; Vanderveken, D. Foundations of Illocutionary Logic; Cambridge University Press: Cambridge, UK, 1985. [Google Scholar]
Chen, H.; Leu, M.C.; Yin, Z. Real-time multi-modal human–robot collaboration using gestures and speech. J. Manuf. Sci. Eng. 2022, 144, 101007. [Google Scholar] [CrossRef]
Bisk, Y.; Holtzman, A.; Thomason, J.; Andreas, J.; Bengio, Y.; Chai, J.; Lapata, M.; Lazaridou, A.; May, J.; Nisnevich, A.; et al. Experience Grounds Language. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Online, 16–20 November 2020; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 8718–8735. [Google Scholar] [CrossRef]
Strachan, J.W.A.; Albergo, D.; Borghini, G.; Pansardi, O.; Scaliti, E.; Gupta, S.; Saxena, K.; Rufo, A.; Panzeri, S.; Manzi, G.; et al. Testing theory of mind in large language models and humans. Nat. Hum. Behav. 2024, 8, 1285–1295. [Google Scholar] [CrossRef]
Bosco, F.M.; Tirassa, M.; Gabbatore, I. Why pragmatics and theory of mind do not (completely) overlap. Front. Psychol. 2018, 9, 1453. [Google Scholar] [CrossRef]
Gabbatore, I.; Bosco, F.M.; Tirassa, M. What are they all doing in that restaurant? Perspectives on the use of theory of mind. Front. Psychol. 2024, 15, 1507298. [Google Scholar] [CrossRef] [PubMed]
Christiano, P.F.; Leike, J.; Brown, T.B.; Martic, M.; Legg, S.; Amodei, D. Deep reinforcement learning from human preferences. In Advances in Neural Information Processing Systems; Guyon, I., Von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, Available online: https://proceedings.neurips.cc/paper_files/paper/2017/file/d5e2c0adad503c91f91df240d0cd4e49-Paper.pdf (accessed on 8 September 2025).
Sharma, M.; Tong, M.; Korbak, T.; Duvenaud, D.K.; Askell, A.; Bowman, S.R.; Cheng, N.; Durmus, E.; Hatfield-Dodds, Z.; Johnston, S.; et al. Towards understanding sycophancy in language models. arXiv 2025. [Google Scholar] [CrossRef]
Wagner, I.; Chakradeo, K. Human-AI Complementarity in Diagnostic Radiology: The Case of Double Reading. Philos. Technol. 2025, 38, 1–31. [Google Scholar] [CrossRef]
Harada, Y.; Suzuki, T.; Harada, T.; Sakamoto, T.; Ishizuka, K.; Miyagami, T.; Kawamura, R.; Kunitomo, K.; Nagano, H.; Shimizu, T.; et al. Performance evaluation of ChatGPT in detecting diagnostic errors and their contributing factors: An analysis of 545 case reports of diagnostic errors. BMJ Open Qual. 2024, 13, e002654. [Google Scholar] [CrossRef]
Artsi, Y.; Sorin, V.; Glicksberg, B.S.; Korfiatis, P.; Nadkarni, G.N.; Klang, E. Large language models in real-world clinical workflows: A systematic review of applications and implementation. Front. Digit. Health 2025, 7, 1659134. [Google Scholar] [CrossRef]

Table 1. Included ABaCo Items Across Scales and Subscales.

Scale (n)	Comprehension (n)	Production (n)
Linguistic (44)	Basic speech acts → Questions = 4; Standard communicative acts = 4; Non-standard communicative acts → Deceit= 4, Irony = 4.	Basic speech acts → Assertions = 4; Questions = 4; Commands = 4; Requests = 4; Standard communicative acts = 4; Non-standard communicative acts → Deceit = 4, Irony = 4.
Extralinguistic (30)	Basic speech acts → Assertions = 4; Questions = 4; Commands = 4; Requests = 4; Standard communicative acts = 4; Non-standard communicative acts → Deceit = 4, Irony = 4.	Standard communicative acts = 1; Non-standard communicative acts → Deceit = 1.
Paralinguistic (22)	Basic communicative acts → Assertion = 2, Question = 2, Request = 2, Command = 2; Emotions = 8; Paralinguistic contradiction = 4.	Basic communicative acts → Request = 1, Command = 1.
Context (20)	Discourse norms = 8; Social norms = 8.	Social norms = 4.
Conversational (8)	Topic maintenance = 4; Turn-taking = 4.

Table 2. Confusion Matrix.

		GPT-4o Score		Total
		0	1	Total
Human Score	0	136	109	245
Human Score	1	111	1669	1780
Total		247	1778	2025

Table 3. Cohen’s Kappa and Symmetric Measures.

		Value	Asymptotic Std. Error ^a	Approximated T ^b	Approx. Significance
Measure of Agreement:	Kappa	0.491	0.030	22.096	<0.001
Valid cases (N)		2025

^a The null hypothesis is not assumed. ^b The asymptotic standard error is calculated under the assumption of the null hypothesis.

Table 4. Error Directionality (False Positive and False Negative) by Pragmatic Act.

Pragmatic Act	False Negative	False Positive	Total
Assertion	2	-	2
Command	-	18	18
Conversation—Topic	-	2	2
Conversation—Turn-Taking	-	2	2
Question	1	4	5
Emotion	1	-	1
Incongruity	4	2	6
Deceit	58	17	75
Irony	32	28	60
Norm	8	15	23
Social Norm	-	12	12
Request	1	7	8
Standard communicative acts	4	2	6
Total	111	109	220

Table 5. Chi-square.

	Value	df	Asymptotic Significance (2-Sided)
Pearson Chi-square	69.431 ^a	12	<0.001
Likelihood Ratio	85.744	12	<0.001
N of valid cases	220

^a 16 cells (61.5%) have an expected count less than 5. The minimum expected count is 0.50.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Boldi, A.; Gabbatore, I.; Bosco, F.M. Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI. Electronics 2025, 14, 4411. https://doi.org/10.3390/electronics14224411

AMA Style

Boldi A, Gabbatore I, Bosco FM. Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI. Electronics. 2025; 14(22):4411. https://doi.org/10.3390/electronics14224411

Chicago/Turabian Style

Boldi, Arianna, Ilaria Gabbatore, and Francesca M. Bosco. 2025. "Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI" Electronics 14, no. 22: 4411. https://doi.org/10.3390/electronics14224411

APA Style

Boldi, A., Gabbatore, I., & Bosco, F. M. (2025). Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI. Electronics, 14(22), 4411. https://doi.org/10.3390/electronics14224411

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI

Abstract

1. Introduction

2. Materials and Methods

2.1. Clinical Protocol: The Assessment Battery for Communication (ABaCo)

2.2. Participants and Dataset

2.3. LLM Model and Prompt Engineering

2.3.1. Rationale for Using GPT-4o

2.3.2. Prompt Development

2.3.3. Using the Prompt

2.4. Statistical Analysis

2.4.1. Inter-Rater Reliability Between Human Coder and GPT-4o

2.4.2. Distribution of Discrepancies by Pragmatic Act

3. Results

3.1. Inter-Rater Agreement

3.2. Analysis of Discrepancies

4. Discussion

4.1. General Discussion

4.2. Suggestions for Design

5. Limitations

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI