Large Language Models as Coders of Pragmatic Competence in Healthy Aging: Preliminary Results on Reliability, Limits, and Implications for Human-Centered AI
Round 1
Reviewer 1 Report
Comments and Suggestions for AuthorsThis manuscript evaluated GPT-4o's reliability as a coder for pragmatic communication assessment using the ABaCo battery, finding moderate agreement with human raters but systematic failures in detecting deceit and over-attributing commands. There are some questions and suggestions as follows:
The introduction lacks a systematic review of the current state of LLM applications in clinical assessment.
Regarding participants, why does this study test GPT-4o on healthy older adults but claims applicability to clinical assessments?
Although 2025 scoring units were generated, these data are highly dependent on only 10 participants, who may have individual biases. Moreover, the paper does not report inter-participant variance or consistency.
The results section merely reports statistical findings one by one without summarizing key points at the end of the section, and lacks direct answers to RQ1 and RQ2.
Author Response
- Comment 1: The introduction lacks a systematic review of the current state of LLM applications in clinical assessment.
- Response 1: Thank you for allow us to further verify whether we had missed some important piece of recent literature. We expanded the already cited literature (Park et al., 2024; Tian, 2024; Vrdoljak, 2025) and looked for new references in the Introduction, which now contains a concise paragraph synthesizing LLM use in health/assessment and qualitative coding, while also briefly pointing to non-LLMs automated coding surveys (Yan et al., 2022) for baseline context - this also helps us comply with other reviewers’ comments. Change implemented in: Introduction
- Comment 2: Regarding participants, why does this study test GPT-4o on healthy older adults but claims applicability to clinical assessments?
- Response 2: ABaCo is a clinical battery used both with normative/control than and clinical samples; the normative framework is integral to its scoring and interpretation. We therefore thought that it was methodologically more solid to begian with typical but still possibly nuanced (healthy-aging) responses to establish a clean baseline and avoid conflating model limits with disorder-specific profiles. Clinical validation is a next step, ideally in staged designs (e.g., single-case or small case-series) before broader generalization. Accordingly, we temper wording and clarified this in several sections of the paper. Change implemented in: Abstract; §2.2 Participants and Dataset; §6 Conclusion
- Comment 3: Although 2025 scoring units were generated, these data are highly dependent on only 10 participants, who may have individual biases. Moreover, the paper does not report inter-participant variance or consistency.
- Response 3: Thank you for providing the opportunity to further clarify this point. In our study the analytic unit is the individual scored response (ABaCo item × pragmatic dimension), not the participant. Agreement metrics are computed at the response level. While participants contribute by providing the responses, they are not the unit of analysis. This was explained in 2.2 (Participants and Dataset) but, in light of your comment, we have now further clarified this more explicitly in Methods. That said, the reviewer is right and we acknowledge the 10-participant constraint; we have therefore listed this point among the study limitations. Finally, we believe that estimating inter-participant variance, as suggested by the reviewer in the last comment, would address an interesting but different question, that is the variability of agreement between people, which was not the aim of this study. We considered proceeding with this analysis but we realized that, with only 10 participants, such estimates would be statistically unstable and risk over-interpreting small-sample idiosyncrasies. This point could be addressed in a further study, with a
wider sample. Change implemented in: §2.4 Statistical Analysis; §6 Limitations. - Comment 4: The results section merely reports statistical findings one by one without summarizing key points at the end of the section, and lacks direct answers to RQ1 and RQ2
- Response 4: Thank you for this comment which gave us the opportunity to verify the way we presented results and discussion. To follow the reviewer’s suggestion, we inserted one concise sentence at the end of §3.1 (answers RQ1) and §3.2 (answers RQ2) that state the take-home numbers and the multiple-testing outcome. We thought this was better than adding a separate recap subsection, which would have replicated the Discussion. A fuller synthesis and interpretation remains in the Discussion, as per standard structure of scientific papers. Change implemented in: §3.1 and 3.2 Results
Reviewer 2 Report
Comments and Suggestions for AuthorsThe article, "Large Language Models as Encoders of Pragmatic Competencies in Healthy Aging: Reliability, Limitations, and Implications for Human-Centered Artificial Intelligence," is timely and relevant to the intersection of linguistics, cognitive science, and AI technologies. The empirical findings presented in the paper are interesting and well-documented. The authors employ equivalent methods to manage the applicability of large language models to pragmatic tasks in the context of coping. The text is clear and well-implemented.
I suggest that the authors address the LLM error analysis problem to gain access to the discussion section.
The article demonstrates solid research quality and is ready for publication.
Author Response
We thank the reviewer for their appreciative comments and hope that our response effectively addresses their only suggestion regarding our work.
- Comment 1: I suggest that the authors address the LLM error analysis problem to gain access to the discussion section.
- Response 1: Following the reviewer’s suggestion, in the present version of the paper we have ameliorated and expanded the quantitative analysis. Specifically, we:
- i) quantify discrepancies by adding Cramér’s V (effect size for the act×error association) and per-act disagreement rates with 95% Wilson CIs);
- (ii) make the Results–RQ link explicit by adding a one-sentence answer to each RQ at the end of the relevant Results subsection;
- (iii) add a compact summary (Table 7) with a failure-scenario matrix highlighting the two BH–FDR–surviving cells (Command-FP and Deceit-FN) and two minimal, anonymized exemplars illustrating these case, in the Discussion/Appendix
- (iv) discuss these new results in the Discussion section.
Changes implemented in: §2. Methods (2.4.2 Distribution of discrepancies by pragmatic act); §3. Results 3.1 Inter-Rater Agreement; §33.2 Analysis of Discrepancies; §4. Discussion (plus Table 7 in the § Appendix); §Table S3 (Supplementary material)
Reviewer 3 Report
Comments and Suggestions for Authors- Why is the evaluation limited to only ten participants? A small sample size may limit generalizability and statistical power; please clarify whether this pilot nature was intended or if recruitment constraints applied.
- How does the exclusion of prosodic and gestural cues affect the conclusions? Since these cues are critical for commands and deceit, omitting them may confound interpretation of GPT-4o’s limitations. Please discuss this as a boundary condition.
- Please justify the decision to use only text-based ABaCo items. While understandable for LLM input constraints, some pragmatic skills might not be fairly assessed without multimodal context—please add a rationale.
- The kappa value of 0.491 is interpreted as “moderate,” but what threshold would make this acceptable in clinical use? Please elaborate on how this reliability compares to inter-human coding baselines in ABaCo applications.
- Some categories like 'Irony' and 'Norm' showed high discrepancies but were not significant after correction. Could these still be practically important? Please consider reporting effect sizes or practical significance even when q > 0.05.
- Clarify how the prompt was constructed and whether it introduced bias. Since prompts shape LLM responses, explain if alternative phrasing was tested and how robustness was validated—please add a sensitivity analysis or brief justification.
- The current study emphasizes how crucial it is to incorporate real-time interaction and multimodal signals when evaluating pragmatic competence. However, interpretive robustness is limited by the absence of prosodic and gestural inputs. Citing earlier research on multimodal human–robot collaboration, which integrated speech and gestures for in-the-moment comprehension and decision-making, would help the authors' discussion. In order to design more context-sensitive and interactive AI systems for clinical or assistive settings, this paper, for instance, shows real-time multimodal human–robot collaboration using speech and gestures. https://doi.org/10.1115/1.4054297
Author Response
- Comment 1: Why is the evaluation limited to only ten participants? A small sample size may limit generalizability and statistical power; please clarify whether this pilot nature was intended or if recruitment constraints applied.
- Response 1: We thank the reviewer for this comment. In our study the analytic unit is the individual scored response (ABaCo item × pragmatic dimension), not the participant. Agreement metrics are computed at the response level. While participants contribute by providing the responses, they are not the unit of analysis. This was already explained in 2.2 (Participants and Dataset) but we have clarified this explicitly in Methods by adding a phrase. This methdological asset affords high precision for κ and a robust discrepancy test (see χ² and Cramér’s, reported in the Results: both values indicating s a large association, supporting detection of non-random error structure). However, despite these promising results, we now explicitly frame the study as exploratory/pilot and recognized the recruitment of 10 participants as a limitation. Moreover, we added “Preliminary results” to the title, to make things clearer. Changes implemented in: Abstract, §1. Introduction, §2.4 Statistical Analysis; §4 Discussion, §5. Limitations, §6. Conclusion
- Comment 2: How does the exclusion of prosodic and gestural cues affect the conclusions? Since these cues are critical for commands and deceit, omitting them may confound interpretation of GPT-4o’s limitations. Please discuss this as a boundary condition.
- Response 2: We intentionally excluded items whose scoring depends only from prosodic/gestural cues to avoid forcing the model to guess absent cues and to avoid injecting subjective experimenter translations of non-verbal signals into the input. Plus, to our knowledge, there is no standardized, validated way to encode prosody/gesture into text for LLM consumption, yet; more importantly, including both the verbatim response and our interpretive paraphrases would confound the test. We reported these changes in the Methods. Please, note that the comment that these cues matter for commands is post-hoc! (from the observed residuals + multimodal theory), not an a priori assumption. In fact, deception research suggests that overt non-verbal cues are often weak and unreliable, and deceivers frequently aim to mask rather than display their true intentions (e.g., Hartwig & Bond, 2011 https://doi.org/10.1037/a0023589). That said, we now expanded the Limitations section. Changes implemented in: §2.2. Participants and Dataset Limitations §5.
- Comment 3: Please justify the decision to use only text-based ABaCo items. While understandable for LLM input constraints, some pragmatic skills might not be fairly assessed without multimodal context—please add a rationale.
- Response 3: Thank you for this comment. We have now addressed and justified this point in the previous reply to the reviewer’s comment (Response 2).
- Comment 4: The kappa value of 0.491 is interpreted as “moderate,” but what threshold would make this acceptable in clinical use? Please elaborate on how this reliability compares to inter-human coding baselines in ABaCo applications
- Response 4: Thank you for this comment that gives us the opportunity to further explain our point. We explore and support the use of the model only as a supervised second coder, not as a stand-alone clinical rater: we had explained this view in the original version of the paper. Nevertheless, to make it clearer, in this new revised version of the manuscript, we specified that in resource-constrained settings (e.g., a single human coder), GPT-4o can serve as a reflective/QA second coder, for instance by helping clinicians re-examine decisions, flag potential inconsistencies, typos, and borderline cases, and improve consistency, while the final judgment remains human. Change implemented in: §4.2 Suggestions for Design
- Comment 5: Some categories like 'Irony' and 'Norm' showed high discrepancies but were not significant after correction. Could these still be practically important? Please consider reporting effect sizes or practical significance even when q > 0.05
- Response 5: Because these acts did not survive FDR, we do not claim practical importance; following ASA guidance (Wasserstein & Lazar, 2016), we agree that results with p > 0.05 should not be ignored but, at the same time, claims drawn on such results should be avoided (otherwise, each non-significant result would need to be discussed, a decision that would create confusion in the reader). That said, to comply with the reviewer’s suggestion, we decided to report a global effect size (Cramér’s V) which quantifies the magnitude (and not only the significance) of the act x error association; moreover, we report per-act disagreement rates in the Supplementary Material (to avoid mix these results, which are for practical context, with those which are essential for the contribution). Therefore, at the current state we have: (i) FDR-adjusted Pearson standardized residuals, which localize where discrepancies occur, (ii) Cramér’s V, which summarize how strong the pattern act×error is, and (iii) per-act disagreement rates, which define how often failures occur by act. This should provide enough details to cover all the possible concerns on this point. Change implemented in: §2.4.2 Method 3.2 Results; Supplementary Material (Table S3); Discussion §4.1
- Comment 6: Clarify how the prompt was constructed and whether it introduced bias. Since prompts shape LLM responses, explain if alternative phrasing was tested and how robustness was validated—please add a sensitivity analysis or brief justification
- Response 6: We used an invariant, pre-registered prompt for all codings to maximize reproducibility and avoid mid-study drift. Before locking the prompt, we ran a small feasibility pretest on a subset of items to verify that the model consistently returned the exact output schema (one row per item, binary scores for each pragmatic dimension, and the requested formatting) without errors; we iteratively refined wording only to eliminate formatting failures. We did not conduct a full robustness/sensitivity analysis (as acknowledged in the Limitations section of the original paper), which we see as a separate study: adding it here would have exceeded scope and risked analytic multiplicity. We now better (i) document the prompt-construction workflow and feasibility check, and (ii) state this limitation more explicitly. The verbatim prompt was already provided in Supplementary File S1 in the original version of the paper, so we thought no further modifications were needed regarding this part. Change implemented in: Methods §2.3 (new subsections, §2.3.2 Prompt Development; §2.3.3 Using the prompt); Limitations §5
- Comment 7: The current study emphasizes how crucial it is to incorporate real-time interaction and multimodal signals when evaluating pragmatic competence. However, interpretive robustness is limited by the absence of prosodic and gestural inputs. Citing earlier research on multimodal human–robot collaboration, which integrated speech and gestures for in-the-moment comprehension and decision-making, would help the authors' discussion. In order to design more context-sensitive and interactive AI systems for clinical or assistive settings, this paper, for instance, shows real-time multimodal human–robot collaboration using speech and gestures. https://doi.org/10.1115/1.4054297
- Response 7: We agree that real-time, multimodal cues (prosody, gesture) are central to interaction; we now cite the suggested HRC work and Bisk et al. on language understanding (Chen et al., 2022; Bisk et al., 2020). At the same time, our study evaluates a text-only LLM as a coder on plain text, whereas HRC/robotics involves embodied, sensorimotor control loops; embodiment is a distinct requirement for robots (Asada & Cangelosi, 2024, 10.1016/j.device.2024.100605). Hence, inferences from that literature do not transfer straightforwardly to our setting. LLMs can of course be integrated into robots (e.g., Wang et al., 2024, 10.1016/j.jai.2024.12.003), but expanding this strand is beyond our scope. Changes implemented: §4. Discussion
Reviewer 4 Report
Comments and Suggestions for Authors This research is interesting, as the authors proposed how to use LLM to "automate" medical coding, and whether it can be really effective and valid in the medical setting by doing an evaluation of GPT-4o's ability to perform in this context. However, I point some of my concerns below in bullet points: - As to date, there are already a lot of commercially available LLMs, whether paid or free. What is the reasoning why the authors chose GPT-4o to other LLMs, as the authors also mentioned that GPT-4o "exhibit inconsistent behavior and sub-optimal accuracy". - When does GPT fail at coding? What kind of scenario? If the authors can clearly define the different types of combinations that would likely lead to failure, that would be a great contribution so future research can think of a way how to resolve those issues. - I am not much knowledgeable in medical coding, but are there cases in which the coder needs to ask for more details or clarification about the case on what was done? If this is the case, I am skeptical of the robustness of LLMs to ask for clarifications because LLMs tend to hallucinate during these ambiguous cases, or give in to forcefulness, as authors reported in their results of the pragmatic acts of Command and Deceit, respectively. I do not think "ask clarifying questions as appropriate" as quoted by the authors is enough, because I believe that LLMs do not have the ability to judge whether the current available information is enough or needs to be clarified, with "as appropriate" being a very vague line to determine. - Why is there no related work, or any discussion of previous studies about "automated" medical coding but with non-LLM systems? I think this is a very important baseline to cover, one such example of article I share below. I also think that this study would benefit way more if there was not only a comparison between the proposed LLM system with the human benchmark, but also with respect to the automated non-LLM system. https://www.sciencedirect.com/science/article/pii/S2667102622000092 - I am a little bit on the fence on the originality of this research. The core idea of using LLMs in this kind of clinical assessment is not entirely new, although the use of ABaCo framework maybe somewhat new. The results are what we expect how LLMs perform on vague scenarios that need high context, so originality is not groundbreaking but rather incremental. As said above, I would like to see a comparative study not only with human benchmark but also on non-LLM systems and see their difference in performance. As an overall recommendation, I would reconsider after major revision. The authors first need to defend their novelty and originality, make a new section to describe non-LLM systems that accomplish the same task, justify the use of why this specific LLM (GPT-4o), and describe the conditions that would likely lead to the failure of LLM.Author Response
- Comment 1: As to date, there are already a lot of commercially available LLMs, whether paid or free. What is the reasoning why the authors chose GPT-4o to other LLMs, as the authors also mentioned that GPT-4o "exhibit inconsistent behavior and sub-optimal accuracy"
- Response 1: Thank you for giving us the opportunity to further elaborate this choice. We selected GPT-4o to maximize external validity, as it is the most widely used consumer LLM via ChatGPT, so testing its pragmatic coding has high external validity for real-world use (clinical staff and the general public frequently interact with this model). We also chose GPT-4o for broad availability (free/low-cost tiers) and documented safety/limitations. These criteria are reported in the Method. Prior reports that GPT-4o can be inconsistent under ambiguity are exactly why a clinically validated pragmatic instrument (ABaCo) is needed, i.e., to localize act-specific failure modes under controlled conditions rather than infer from general impressions. One study’s observation does not foreclose further inquiry, rather it motivates a targeted evaluation. In line with the reviewer’s comment, in the Limitations section, we note that multi-model comparisons are recommended as future work beyond this first exploratory step. Changes implemented in: Methods §2.3.1 (new subsection, Rationale for Using GPT-4o); §5. Limitations
- Comment 2: When does GPT fail at coding? What kind of scenario? If the authors can clearly define the different types of combinations that would likely lead to failure, that would be a great contribution so future research can think of a way how to resolve those issues.
- Response 2: We have described, in a narrative way, failure patterns in the Discussion, which specifically pertain to Command and Deceit. Following the reviewer's suggestion and to make this clearer, we have now added a brief summary of failure patterns in a Table (7); we believe this better and more explicitly summarizes the Discussion’s narrative. Changes implemented in: Discussion §4 (new Table 7, Appendix)
- Comment 3: I am not much knowledgeable in medical coding, but are there cases in which the coder needs to ask for more details or clarification about the case on what was done? If this is the case, I am skeptical of the robustness of LLMs to ask for clarifications because LLMs tend to hallucinate during these ambiguous cases, or give in to forcefulness, as authors reported in their results of the pragmatic acts of Command and Deceit, respectively. I do not think "ask clarifying questions as appropriate" as quoted by the authors is enough, because I believe that LLMs do not have the ability to judge whether the current available information is enough or needs to be clarified, with "as appropriate" being a very vague line to determine
- Response 3: Thank you for raising this point. To answer reviewer’s concerns, we clarify that: (1.) Our task involves psychological/pragmatic scoring under the ABaCo protocol, whic cannot be intended as medical coding: coders (human or LLM) follow the exact structure of the standardized assessment battery, and do not ask for additional details; they score the verbatim written responses. (2.) No model-initiated clarifications occurred in our workflow. GPT-4O received a textual input, as stated in the previous version; we have now made this more explicit in the Methods. (3). The phrase “ask clarifying questions as appropriate” was not an instruction in our experiment but a quotation from OpenAI’s public documentation describing how GPT-4o can interact in general contexts. It belongs to the Discussion section and helps explain and interpret our results. Changes implemented in: 2.3 LLM Model and Prompt Engineering
- Comment 4: Why is there no related work, or any discussion of previous studies about "automated" medical coding but with non-LLM systems? I think this is a very important baseline to cover, one such example of article I share below. I also think that this study would benefit way more if there was not only a comparison between the proposed LLM system with the human benchmark, but also with respect to the automated non-LLM system. https://www.sciencedirect.com/science/article/pii/S2667102622000092
- Response 4: Thank you for raising this point, which prompted us to further reflect on this aspect. We have now added a concise related-work note on non-LLM automated clinical coding, citing Yan et al. (2022). However, we also wish to highlight that this literature addresses a different task/construct (document-level billing/code classification) whereas we evaluate item×dimension pragmatic scoring under ABaCo on an older-adult corpus. Because the second point (comparison with non-LLM systems) is contiguous with the next comment, for sake of clarity further information on how they were addressed is provided in the following response. Changes implemented in: §1. Introduction
- Comment 5: I am a little bit on the fence on the originality of this research. The core idea of using LLMs in this kind of clinical assessment is not entirely new, although the use of ABaCo framework maybe somewhat new. The results are what we expect how LLMs perform on vague scenarios that need high context, so originality is not groundbreaking but rather incremental. As said above, I would like to see a comparative study not only with human benchmark but also on non-LLM systems and see their difference in performance
- Response 5: We agree that the idea of probing LLMs on clinical/pragmatic abilities is not entirely new, as already acknowledged in the Introduction. However, with particular reference to the conceptual originality of the study, it should be acknowledged that our contribution is method- and setting-specific: (i) we are, to our knowledge, the first to apply the clinically validated, multi-dimensional ABaCo battery, a validated and widely used assessment tool, and wide for act-level evaluation on an older-adult corpus; (ii) we derive a theory-grounded discrepancy map (e.g., FP-Command; FN-Deceit) that localizes strengths/weaknesses across pragmatic dimensions; and (iii) we release open prompts/materials (OSF) to enable replication and secondary analysis. Regarding the suggested non-LLM baseline, we would like to respectfully note that this targets a different problem. A comparison with non-LLM systems is beyond the scope of this exploratory study. Moreover, rule-based or classical ML systems would require supervised training on ABaCo-labeled data and would evaluate an automatic scoring system, not the intrinsic pragmatic competence of a general-purpose model under zero/few-shot prompting. Note that we did not train or provide feedback to the LLM. As it would constitute a different study, not really feasible within the present work, to pursue the reviewer's suggestion in the present work, the required design would be the following: (1) establish whether an LLM can score ABaCo items reliably (as we do here); (2) only then test pre-specified hypotheses against purpose-built non-LLM classifiers trained on ABaCo-labeled data. Sections addressing this point are indicated below. Contribution and originality: §1. Introduction; Exploratory nature of the paper and future work: Abstract; other changes in §1. Introduction, §4. Discussion, §5. Limitation, §6. Conclusion
Round 2
Reviewer 1 Report
Comments and Suggestions for AuthorsI think this revision has addressed all the comments and aligned with the publication standard.
Reviewer 3 Report
Comments and Suggestions for AuthorsAccept.
Reviewer 4 Report
Comments and Suggestions for AuthorsThe authors addressed my concerns and revised the manuscript accordingly.
I can recommend the manuscript to be published as is, but one concern left is the fit of this paper as it will be published under "Computer Science and Engineering".
The contribution of this paper is more towards "applied psychology" or "linguistics", not computer science, especially since the authors claim that their "contribution is method- and setting-specific".
The quality is ok, but I will leave it up to the editors to decide the fit in their journal track.

