1. Introduction
The phenotype of multiple sclerosis (MS) can be described in terms of both disease course and clinical presentation [
1,
2,
3]. Whereas the disease course phenotype distinguishes relapsing-remitting from progressive patterns, the clinical presentation phenotype describes the neurologic signs and symptoms of an individual patient, including motor, sensory, visual, cerebellar, brainstem, sphincter, and cognitive manifestations [
3,
4,
5]. In routine care, the clinical presentation phenotype of MS is typically documented in free-text neurology notes, creating challenges for systematic extraction and comparison across patients [
6].
Extraction of the clinical presentation phenotype from electronic health records (EHRs) is important for MS research and routine clinical care [
6]. In both clinical trials and patient care, assessment of symptom burden is central to evaluating the response to disease-modifying therapy and informing therapeutic decision-making [
7,
8]. Swetlik et al. [
9] have recommended including standardized discrete data elements in each clinical note (e.g., relapse status, disease course, initial presentation, etc.), but such recommendations have not been widely adopted, and substantial relevant information remains embedded in clinician notes as unstructured free text.
The automated identification and extraction of patient phenotypes from EHRs has emerged as a central problem in biomedical informatics, owing to the large amount of clinically relevant information that resides in unstructured narrative text rather than in standardized discrete fields [
10,
11]. Such phenotypic data support cohort discovery, longitudinal modeling, and secondary use of clinical data for observational and translational research. In the MS domain, prior work has demonstrated that detailed disease traits, including clinical subtypes and disability measures, can be extracted from routine electronic records with high reliability [
6]. However, precise phenotype characterization in most settings still relies heavily on manual review of clinician notes, which is labor-intensive and difficult to scale.
Prior work has shown that high levels of inter-rater agreement can be achieved for neurologic concept annotation in clinical notes when structured annotation tools are used with clear guidelines. Oommen et al. [
12] reported that agreement between human annotators for neurological signs and symptoms exceeded agreement between a convolutional neural network and human annotators, underscoring the difficulty of automated neurologic concept annotation.
Traditional approaches to phenotype extraction from clinical text have relied on rule-based systems, dictionary matching, or supervised machine learning [
13,
14]. More recent approaches have incorporated neural networks and transformer encoders. Although these methods can be effective for well-defined extraction tasks, they may be limited when phenotype recognition requires integration of contextual information distributed across a clinical note. In contrast, large language models (LLMs) can perform zero-shot inference from natural-language task instructions, leveraging extensive pretraining on general and biomedical text to interpret clinical narratives without task-specific fine-tuning [
15]. This capability is particularly relevant for MS phenotype identification, where clinical presentation is heterogeneous and often described indirectly rather than through standardized terminology; for example, a note may describe a patient as “clumsy on finger-to-nose testing” rather than explicitly stating “ataxia.” Accordingly, zero-shot LLM-based approaches may provide a scalable alternative for extracting clinically meaningful phenotypes from unstructured narrative documentation.
Because clinical interpretation of narrative documentation involves subjective judgment, human inter-rater reliability provides an important benchmark for evaluating automated annotation systems. Cohen’s
is widely used to quantify agreement beyond chance and remains a standard measure in clinical annotation studies [
16,
17]. Establishing the level of agreement achievable between trained human annotators for MS phenotype features is therefore essential for interpreting the performance of automated methods. Perfect concordance is unlikely in either human–human or human–machine comparisons, particularly for phenotypes that are rare, ambiguously documented, or variably expressed in clinical language. Human–human agreement should therefore be interpreted as a practical benchmark rather than as a strict upper bound, because automated systems may exceed individual annotators on some dimensions, particularly recall, while still differing in their false-positive error profiles.
Large language models (LLMs) have demonstrated strong performance across a range of medical natural language processing and clinical reasoning tasks, often without task-specific fine-tuning [
18,
19]. Related large clinical language models and biomedical transformers have also shown the value of domain-specific pretraining for medical NLP [
20]. More recent work applying LLMs to high-throughput phenotyping of physician notes suggests that LLM-based approaches can outperform traditional deep learning and classical machine learning methods for some clinical phenotyping tasks [
21]. These models may enable scalable extraction of phenotypes and disease status directly from narrative EHR documentation in zero-shot or low-supervision settings.
In this study, phenotype identification was performed at the level of the complete clinical note rather than at the level of individual text spans. The task was document-level phenotype recognition: for each note, the annotator or model determined whether each of 17 clinically meaningful phenotype categories was present or absent anywhere in the document. The task did not require localization of the exact supporting text span, counting repeated mentions, normalization to a canonical ontology term, or assignment of a machine-readable ontology identifier.
This distinction is important because many biomedical NLP pipelines combine several separable tasks under the broad label of concept or phenotype extraction. Span-level named entity recognition identifies the location of a relevant mention in text. Concept normalization maps that refer to a canonical term in a controlled vocabulary such as SNOMED CT or the Human Phenotype Ontology (HPO). Identifier linking assigns the corresponding ontology code. Each of these steps introduces distinct sources of variability and error. Prior work has shown that LLMs may accurately interpret biomedical or clinical concepts while failing to reliably retrieve the corresponding standardized identifiers [
22,
23,
24]. By restricting the primary task to high-level note-level phenotype classification, the present study isolates clinical interpretation from the downstream challenge of ontology normalization.
The central question addressed in this study is whether a large language model can read a complete neurology note and arrive at the same high-level clinical interpretation of MS phenotype features as a human observer, as measured by inter-rater agreement and performance relative to an adjudicated reference annotation set. We additionally compare a frontier LLM (GPT-5.2) with a locally run open-source instruction-tuned LLM (Llama-3.1 8B), a supervised transformer encoder (BioClinical ModernBERT), and several HPO-based phenotype extraction tools (Doc2Hpo, PhenoSnap, and ClinPhen). Because the HPO-based tools produce more granular span- or concept-level ontology outputs, their results were post-processed into the same 17-category note-level representation used for human and LLM annotation.
2. Methods
2.1. Neurology Notes
A total of 12,661 de-identified neurology clinical notes from the University of Illinois Hospital, the primary teaching hospital of UI Health, were obtained from a REDCap database for the period 14 January 2016 through 2 September 2022. Notes were filtered to include only those longer than 600 words, associated with a diagnosis of multiple sclerosis (ICD-10-CM G35), and classified as Progress Note encounters. After deduplication, 4617 notes remained for analysis.
Each clinical note represents a narrative summary of a patient encounter and serves as the unit of analysis in this study. Prior to annotation, notes were converted to JavaScript Object Notation (JSON) format, with each complete note represented as a single JSON object. This note-level representation preserves the full clinical context of the note, enabling evaluation of disease status and phenotype-feature identification as a task of clinical interpretation rather than isolated text-span extraction. Use of the de-identified clinical documentation for research was approved by the Institutional Review Board of the University of Illinois (Protocol No. 2017-0520Z).
2.2. Human Annotation Process
Human annotation was performed using Prodigy (version 1.18.0., Explosion AI, Berlin, Germany), an annotation platform for natural language processing workflows. Prodigy provides a locally hosted web interface and integrates with the spaCy library. Each clinical note was presented as a single annotation unit. This study was designed as an inter-rater agreement experiment on a subset of 100 notes, randomly sampled from 4617 eligible MS progress notes. The objective was to estimate human–human and human–LLM agreement under controlled conditions rather than to perform large-scale annotation.
Two annotators independently labeled the notes: Annotator 1 (A1), a pre-medical student, and Annotator 2 (A2), a senior neurologist, enabling assessment across differing levels of clinical expertise.
Written definitions of phenotype features (
Appendix C) were provided before annotation, and joint training sessions using representative notes were conducted to promote consistency. Disease diagnoses (e.g., multiple sclerosis, optic neuritis) were not annotated, and modifiers such as laterality, severity, and duration were not separately coded.
Annotation was performed using Prodigy’s textcat.manual recipe. Seventeen phenotype features (
Appendix A) were annotated using a multi-label text classification interface, allowing multiple non-exclusive categories to be assigned to each complete note. In this annotation mode, Prodigy records document-level category assignments rather than start and end character offsets for individual text spans. Annotations were stored in an SQLite database and exported in JSON format for analysis.
The annotation framework, therefore, operated at the level of the complete clinical note. Phenotypes were recorded as present or absent, without counting repeated mentions, localizing exact text spans, or normalizing mentions to ontology identifiers. Span-based named entity recognition was not performed.
2.3. Annotation by LLM
The same set of 100 clinical notes used for human annotation was independently evaluated using a frontier large language model, GPT-5.2 (OpenAI; gpt-5.2-2025-12-11), accessed via the OpenAI Responses API. Hereafter, we refer to this model as “the LLM.” The temperature was set to 0 to minimize sampling variability, and no explicit maximum token limit was specified; default model settings were used for response length. The task was framed as multi-label classification at the level of the complete clinical note, using the same phenotype definitions provided to human annotators. Each note was processed independently using a fixed prompt. Model outputs were constrained using a predefined JSON schema to ensure structured and reproducible annotations. For each note, the model returned a list of phenotype items, each containing
- -
label: phenotype category (from predefined list)
- -
present: boolean indicating presence or absence
- -
evidence: supporting text excerpt from the note
- -
detail: optional clarification
An optional warning field was included to capture uncertainty or ambiguity in the model output. The full prompt, label definitions, and implementation details are provided in
Appendix A,
Appendix B and
Appendix C.
To assess the robustness of LLM annotation to prompt formulation, we performed a prompt-sensitivity analysis. We repeated annotation under alternative prompt conditions that varied three components of the LLM configuration: (1) temperature setting, (2) length and specificity of phenotype definitions, and (3) length and specificity of annotation rules. These experiments were intended to evaluate whether performance was strongly dependent on a particular prompt wording rather than to optimize the prompt. Each sensitivity run used the same predefined 10-note subset, the same 17 phenotype categories, and the same structured JSON output schema. Performance was compared using the same macro- and micro-averaged precision, recall, F1, and Matthews correlation coefficient metrics used in the primary analysis.
2.4. Adjudication Process to Create the Reference Annotation Set
A reference annotation set was constructed through a structured adjudication process to resolve discrepancies between annotators. The dataset comprised 100 clinical notes and 17 phenotypic features, yielding 1700 note–feature evaluations. For 1535 note–feature pairs (90.3%), both human annotators (A1 and A2) were in agreement, and the reference label was assigned directly based on this consensus. In 165 cases (9.7%), the human annotators were discordant. These discordant cases were re-reviewed and adjudicated by the senior neurologist (A2). Adjudication resulted in agreement with A2’s original label in 93 cases and with A1’s original label in 72 cases. This process yielded a complete adjudicated reference annotation set comprising 1700 note–feature labels.
2.5. Review of GPT-5.2 Assignments Discordant with Reference Assignments
After the primary adjudicated reference annotation set was finalized, we performed a post hoc secondary review of GPT-5.2 assignments that were discordant with the reference set. This review was conducted to characterize apparent GPT-5.2 false positives and to distinguish hallucination from other sources of disagreement, including unsupported inference and potential omissions in the human-derived reference set. The review was not used to modify the primary reference set or to recompute the primary performance metrics.
For each apparent false-positive GPT-5.2 note–category assignment, we reviewed the model-provided evidence field and the corresponding clinical note context. Apparent false positives were categorized qualitatively as negation or symptom-resolution errors, historical-current conflation, medication-based inference, weakly supported inference from indirect evidence, or likely valid phenotype evidence not incorporated into the reference set. Apparent false-negative assignments were not reviewed in this secondary analysis because the absence of GPT-5.2 labels did not include model-provided evidence fields. This secondary review was used only for error characterization and interpretation of the primary results.
2.6. Comparator Methods
The adjudicated reference annotation set was used as the ground truth for evaluating automated comparator methods on the 17-category note-level phenotype classification task. Comparator methods were selected to represent three approaches to clinical phenotype extraction: HPO-based concept extraction tools, instruction-following LLMs, and supervised transformer encoders. The HPO-based tools included Doc2Hpo [
25], ClinPhen [
6], and PhenoSnap [
26,
27]. The open-source LLM comparator was Llama-3.1 8B Instruct, run locally through Ollama. The supervised transformer comparator was BioClinical ModernBERT [
28].
Doc2Hpo was evaluated as an HPO-based concept-extraction comparator. We used the standard Doc2Hpo API endpoint (
https://doc2hpo.wglab.org/ (Accessed on 22 May 2026)) to process the clinical notes. Doc2Hpo accepts free-text input and returns recognized Human Phenotype Ontology (HPO) terms, ontology identifiers, and span-level information, including detected text position and span length [
25]. The tool was run with default settings, including matching against HPO terms and synonyms within the Phenotypic Abnormality subtree (HP:0000118), negation detection, and longest-match selection for overlapping annotations. Doc2Hpo therefore represents a traditional HPO-oriented extraction system that identifies span-level phenotype concepts and maps them to ontology terms and identifiers.
ClinPhen was evaluated as a rule-based HPO phenotype extraction comparator. We downloaded and installed ClinPhen locally and ran it on macOS. Each of the 100 MS progress notes was passed to ClinPhen as a separate free-text file. ClinPhen extracts phenotypes from clinical notes by mapping text to HPO terms. Its pipeline segments notes into sentences and subsentences, lemmatizes words, matches phenotype names and synonyms, and applies rule-based filters to exclude likely false-positive mentions, including negated findings and mentions referring to family members rather than the patient. ClinPhen returns HPO phenotype terms sorted by frequency of occurrence, first mention position, and HPO identifier.
PhenoSnap was evaluated as an additional HPO-based phenotype extraction comparator. Because no peer-reviewed manuscript describing PhenoSnap and no formal software release number were available at the time of analysis, we used the open-source implementation available from the WGLab GitHub repository at
https://github.com/WGLab/PhenoSnap (Accessed on 22 May 2026; commit a4199d55147347ffc9a26b9e3d282448b9b04774). The repository was downloaded and run locally on macOS. PhenoSnap performs local phenotype extraction from free text using spaCy and a locally downloaded HPO OBO file, without cloud-based LLMs or remote APIs. The tool uses phrase matching against HPO labels and synonyms, includes dependency-based negation detection, and outputs matched phenotype mentions with HPO identifiers, labels, character offsets, and negation status.
For all three HPO-based systems, native span- or concept-level outputs were post-processed into the same 17-category note-level representation used for human and LLM annotation. HPO terms and identifiers returned by these systems were mapped to the corresponding high-level phenotype categories using the curated lookup table described below. A phenotype category was coded as present for a note if one or more mapped HPO terms assigned to that category were detected.
To evaluate whether a smaller open-source LLM could perform the same note-level phenotype abstraction task, we ran Llama-3.1 8B Instruct locally using Ollama version 0.24.0 on macOS. The model was accessed via the local Ollama chat API using the model tag llama3.1:8b (model ID 46e0c10c039e; 8.0B parameters; Q4_K_M quantization), with a temperature of 0.0. The same 100 MS notes, phenotype definitions, and structured JSON output schema used for GPT-5.2 were used without fine-tuning, prompt optimization, or retrieval-augmented generation. Each note was processed independently as a complete note-level multilabel classification task. Model outputs were normalized to the same 17 binary phenotype categories and converted to a note-feature matrix for comparison with the adjudicated reference annotation set.
To include a contemporary supervised transformer baseline, we evaluated BioClinical ModernBERT (
https://huggingface.co/collections/thomas-sounack/bioclinical-modernbert, Acessed on 22 May 2026) using the Hugging Face Transformers framework. Unlike instruction-following LLMs, BioClinical ModernBERT is a pretrained clinical transformer encoder and does not directly perform phenotype classification without task-specific supervision. We therefore added a randomly initialized multilabel sequence-classification head with 17 output nodes, corresponding to the 17 phenotype categories, and fine-tuned the model on the annotated notes.
Evaluation was performed using 5-fold cross-validation with random shuffling and a fixed random seed. In each fold, the model was trained on 80 notes and evaluated on 20 held-out notes. The maximum input length was set to 4096 tokens to reduce truncation of long clinical notes; the model configuration supported sequences up to 8192 tokens. Training used 3 epochs, learning rate , weight decay 0.01, and batch size 1. Multilabel predictions were generated by applying a sigmoid transformation to the model logits and thresholding probabilities at 0.30. Macro- and micro-averaged precision, recall, F1, and Matthews correlation coefficient were computed across folds.
2.7. Mapping to High-Level Categories
Human annotators, GPT-5.2, Llama-3.1 8B, and BioClinical ModernBERT generated outputs directly as the 17 predefined note-level phenotype categories. In contrast, ClinPhen, Doc2Hpo, and PhenoSnap generated granular concept-level outputs as Human Phenotype Ontology (HPO) terms and identifiers. To enable comparison across methods, outputs from these HPO-based systems were mapped to the same 17 high-level phenotype categories used for human and LLM annotation.
Across the HPO-based systems, 382 unique HPO term–identifier pairs were identified. Each unique pair was manually reviewed using a curated lookup table. Of these, 147 were mapped to one of the 17 predefined phenotype categories, whereas 235 did not correspond to any target category and were excluded from the 17-category note-level grid.
For each note, a phenotype category was coded as present if one or more HPO terms mapped to that category were detected by the extraction system; otherwise, the category was coded as absent. False positives and false negatives were defined at the note–category level. A false positive occurred when a comparator method assigned a phenotype category as present for a note but that category was absent in the adjudicated reference annotation set. A false negative occurred when a phenotype category was present in the reference set but was not assigned by the comparator method. HPO terms that did not map to any of the 17 target categories were outside the predefined scoring schema and were therefore not counted as false positives.
2.8. Statistical Analysis
Agreement was assessed at the note–category level, with each clinical note contributing 17 binary phenotype decisions. Human–human and human–LLM agreement were summarized using unadjusted percent agreement and Cohen’s
[
16,
17]. Each annotation method, including A1, A2, GPT-5.2, Llama-3.1 8B, BioClinical ModernBERT, Doc2Hpo, ClinPhen, and PhenoSnap, was compared with the adjudicated reference annotation set.
For each phenotype category, true positives, false positives, false negatives, and true negatives were computed relative to the reference set. Precision, recall, F1 score, and Matthews correlation coefficient (MCC) were then calculated using standard definitions. Macro-averaged metrics were computed as the unweighted mean across the 17 phenotype categories. Micro-averaged metrics were computed by pooling all note–category decisions across phenotypes. False-positive and false-negative rates were additionally summarized as mean false-positive and false-negative phenotype assignments per note.
All analyses were performed in Python (3.10.20) using pandas (2.3.3), scikit-learn (1.7.1), matplotlib (3.10.9), seaborn (0.13.2), json, collections, requests, and the openai (2.30.0) packages.
Global performance metrics for the human annotators and automated comparator methods were computed against the adjudicated reference annotation set. Macro-averaged metrics are unweighted means across the 17 phenotype categories. Micro-averaged metrics are pooled across all note–category decisions. MCC denotes the Matthews correlation coefficient computed over pooled note–category decisions. Values in parentheses are note-level bootstrap 95% confidence intervals. Bootstrap 95% confidence intervals were estimated by resampling notes with replacement. Each bootstrap sample contained the same number of notes as the original analysis for that method, preserving the 17 phenotype decisions within each sampled note. Metrics were recomputed for 5000 bootstrap samples, and the 2.5th and 97.5th percentiles were used as confidence limits. MCC was computed over all pooled note–category decisions (
Appendix D,
Table A1,
Table A2 and
Table A3).
4. Discussion
This study evaluated whether large language models can perform note-level recognition of clinically meaningful MS phenotype categories from narrative neurology notes. The principal finding is that GPT-5.2 achieved the strongest automated performance and approached the performance of human annotators when evaluated against an adjudicated reference annotation set. GPT-5.2 showed particularly high recall, identifying most reference-positive phenotype categories across the 100-note set. Its macro-F1 and macro-averaged MCC were close to those of Annotator 1 and approached those of Annotator 2, the senior neurologist. These results suggest that frontier LLMs can perform clinically meaningful document-level phenotype recognition in a zero-shot setting.
The performance pattern differed between human and LLM annotations. Annotator 2 showed the most conservative profile, with the highest precision and the fewest false-positive assignments per note, but more false-negative omissions than GPT-5.2. In contrast, GPT-5.2 showed the lowest false-negative count per note and the highest recall, but generated more apparent false positives than either human annotator. This tradeoff is clinically plausible. Manual annotation of complete clinical notes across 17 simultaneous phenotype categories imposes a substantial vigilance burden on human annotators, creating opportunities for omission errors. LLMs are not subject to vigilance decrement in the same way and may therefore apply the annotation schema more exhaustively. However, this advantage was offset by a greater tendency toward overinclusive assignment.
The qualitative review of GPT-5.2 discordant assignments provided further insight into this error profile. We did not identify clear hallucinations among the apparent false-positive assignments. Instead, discordances reflected recognizable mechanisms, including negation or symptom-resolution errors, medication-based inference, weak inference from indirect evidence, historical–current state conflation, and cases in which GPT-5.2 appeared to identify valid phenotype evidence that had not been incorporated into the human-derived reference set. The largest subgroup consisted of likely valid assignments that had been missed during human annotation or adjudication. This finding suggests that the measured precision of GPT-5.2 may be conservative, because some apparent false positives may reflect omissions in the reference annotation set rather than true model errors. Importantly, we did not recompute the primary performance metrics after this secondary review, and the original adjudicated reference set remained the basis for all primary analyses.
The comparator analyses help place the GPT-5.2 results in context. Llama-3.1 8B, run locally through Ollama without fine-tuning or retrieval-augmented generation, performed substantially better than the HPO-based extraction systems and the supervised transformer comparator, although below GPT-5.2. This finding suggests that note-level phenotype abstraction is not unique to proprietary frontier models, but that model scale and capability remain important. The performance of Llama-3.1 8B is notable because it was run locally using the same structured prompt and output schema as GPT-5.2. Locally executable LLMs may therefore be attractive in settings where cost, privacy, reproducibility, or infrastructure constraints limit reliance on cloud-based APIs.
The HPO-based tools, including Doc2Hpo, ClinPhen, and PhenoSnap, performed less well on the present task. This should not be interpreted simply as a failure of those systems. They were designed primarily for span- or concept-level phenotype extraction and ontology linking, not for broad note-level classification into our 17 MS phenotype categories. Their native outputs include HPO terms and identifiers, which are more granular than the target labels used in this study. To compare them with human and LLM annotations, we mapped their HPO outputs to the 17 high-level categories. This post-processing necessarily changed the nature of their native task. Nonetheless, the results highlight an important practical limitation: tools optimized for lexical or ontology-based span extraction may miss phenotype evidence when it is expressed indirectly, distributed across the note, or described in clinical-functional terms rather than as a canonical ontology label. For example, a note may support gait impairment through descriptions of cane use, poor balance, or difficulty ambulating, even if it does not contain an exact phrase that maps cleanly to a specific HPO gait term.
BioClinical ModernBERT showed limited performance, likely reflecting the small number of notes available for supervised training. This result should be interpreted cautiously. BioClinical ModernBERT is a pretrained transformer encoder, not an instruction-following generative model. To apply it to the present task, we added a randomly initialized 17-label classification head and fine-tuned the model on the annotated notes. In 5-fold cross-validation, each fold provided only 80 training notes, with sparse positive examples for several phenotype categories. Thus, the limited performance likely reflects the difficulty of fine-tuning a supervised classifier in a low-resource setting (limited training examples) rather than a lack of clinically useful representations in the pretrained encoder. By contrast, instruction-following LLMs (GPT 5.2 and Llama 3.1 8B) can apply natural-language task definitions directly, which may make them more practical when limited training examples are available.
These findings support a distinction between document-level phenotype recognition and conventional span-level concept extraction. Many biomedical NLP pipelines, including tools such as ClinPhen, Doc2Hpo, and PhenoSnap, decompose phenotyping into sequential steps: span detection, concept normalization, ontology identifier assignment, and aggregation into analysis-ready variables. This fine-grained approach is essential when exact mention localization, ontology alignment, or knowledge representation is the primary goal. The present study addressed a different task: determining whether a clinically meaningful phenotype category was present in any part of the complete note. For this document-level task, exact span boundaries and ontology identifiers were not required. Instead, the relevant challenge was clinical interpretation: whether the note supported a high-level phenotype such as gait impairment, weakness, pain, cognitive symptoms, bladder dysfunction, or hyperreflexia. Thus, the unit of analysis was the clinical note, not the individual phenotype mentioned.
The note-level approach may be particularly useful for population-health and research applications. Large EHR corpora contain thousands or millions of narrative notes that cannot feasibly be annotated manually. Converting these notes into lower-dimensional phenotype variables could support cohort discovery, epidemiologic studies, longitudinal outcome tracking, real-world evidence generation, quality improvement initiatives, and downstream machine learning. In this setting, the goal is not to replace clinical judgment for an individual patient, but to create scalable, interpretable variables for aggregate analysis. This intended use is distinct from real-time clinical decision support or individualized precision medicine, where false positives and false negatives may have immediate consequences and would require prospective validation, monitoring, and human oversight.
The prompt-sensitivity analysis suggested that GPT-5.2’s performance was not highly dependent on a narrowly optimized prompt formulation. Across variations in temperature, rule length, and phenotype-definition length on a predefined 10-note subset, performance differences were modest. This finding should be interpreted cautiously because the sensitivity analysis was small, but it suggests that the main result was not simply an artifact of a highly tuned prompt. Future work should examine whether additional rules, such as explicitly prohibiting phenotype inference from medication use alone, can reduce false positives without compromising recall. This study has several limitations. First, the dataset was small: only 100 notes were manually annotated, although each note contributed 17 binary phenotype decisions. No formal power calculation was performed. The sample was sufficient for a controlled inter-rater agreement and proof-of-concept comparator study, and uncertainty was estimated using note-level bootstrap confidence intervals. However, larger datasets are needed to obtain more stable phenotype-specific estimates, particularly for rare categories such as hyperreflexia, spasticity, dysphagia, and bowel symptoms.
Second, the reference annotation set was derived from two human annotators, and discordant labels were adjudicated by A2, who was also one of the original annotators. An independent adjudication committee would have been preferable and may have reduced potential incorporation bias. Although 90.3% of note–feature pairs were concordant before adjudication, this limitation should be considered when interpreting human–machine comparisons.
Third, the study was performed on MS progress notes from a single academic health system. Notes shorter than 600 words were excluded because they were often task-specific encounters that lacked a complete history and neurological examination. Performance may differ in other institutions, specialties, note types, EHR templates, shorter notes, or patient populations. Fourth, the HPO-based comparators did not produce usable outputs for all notes in our workflow, and the causes of failure were not systematically investigated. Fifth, the supervised transformer comparator was limited by the small number of training examples. The BioClinical ModernBERT results should therefore not be interpreted as a definitive benchmark of supervised transformer performance, but as an illustration of the challenge of training such models under low-resource conditions. Sixth, the secondary review of GPT-5.2 discordant assignments was exploratory and was not used to revise the primary reference set or recompute the main performance metrics.
Finally, this study did not evaluate the full ontology normalization pipeline. We did not require exact span localization, canonical HPO term selection, SNOMED CT mapping, or ontology identifier assignment. Strong performance on note-level phenotype recognition should therefore not be interpreted as evidence that LLMs can reliably perform fine-grained ontology normalization or identifier linking. Conversely, the lower performance of HPO-based systems in this study should not be interpreted as evidence that those tools are ineffective for their native span-level or ontology-linking tasks. The methods were evaluated on a high-level note-level phenotype classification endpoint.
In summary, GPT-5.2 performed at a level comparable to human annotators for document-level MS phenotype recognition and achieved particularly high recall across phenotype categories. Llama-3.1 8B demonstrated that smaller locally run instruction-tuned LLMs can also perform this task with useful accuracy, although below GPT-5.2. HPO-based extraction tools and a supervised transformer encoder were less effective for this low-resource note-level classification task. These results support the feasibility of LLM-assisted document-level phenotyping as a potentially scalable approach for research and population-health analysis of large EHR note corpora, while emphasizing the need for careful error characterization, transparent reference standards, and prospective validation before use in clinical decision support.