Prior Knowledge Shapes Success When Large Language Models Are Fine-Tuned for Biomedical Term Normalization

Hier, Daniel B.; Platt, Steven K.; Nguyen, Anh

doi:10.3390/info16090776

Open AccessArticle

Prior Knowledge Shapes Success When Large Language Models Are Fine-Tuned for Biomedical Term Normalization

by

Daniel B. Hier

^1,*

,

Steven K. Platt

²

and

Anh Nguyen

²

¹

Department of Neurology & Rehabilitation, University of Illinois at Chicago, Chicago, IL 60612, USA

²

Laboratory for Applied Artificial Intelligence, Loyola University Chicago, Chicago, IL 60611, USA

^*

Author to whom correspondence should be addressed.

Information 2025, 16(9), 776; https://doi.org/10.3390/info16090776

Submission received: 6 August 2025 / Revised: 29 August 2025 / Accepted: 5 September 2025 / Published: 7 September 2025

(This article belongs to the Special Issue Transformative Technologies in Healthcare: Harnessing Machine Learning, Deep Learning and Large Language Models in Health Informatics)

Download

Browse Figures

Versions Notes

Abstract

Large language models (LLMs) often fail to correctly associate biomedical terms with their standardized ontology identifiers, posing challenges for downstream applications that rely on accurate, machine-readable codes. These linking failures can compromise the integrity of data used in precision medicine, clinical decision support, and population health. Fine-tuning can partially remedy these issues, but the degree of improvement varies across terms and terminologies. Focusing on the Human Phenotype Ontology (HPO), we show that a model’s prior knowledge of term–identifier pairs, acquired during pre-training, strongly predicts whether fine-tuning will enhance its linking accuracy. We evaluate prior knowledge in three complementary ways: (1) latent probabilistic knowledge, revealed through stochastic prompting, captures hidden associations not evident in deterministic output; (2) partial subtoken knowledge, reflected in incomplete but non-random generation of identifier components; and (3) term familiarity, inferred from annotation frequencies in the biomedical literature, which serve as a proxy for training exposure. We then assess how these forms of prior knowledge influence the accuracy of deterministic identifier linking. Fine-tuning performance varies most for terms in what we call the reactive middle zone of the ontology—terms with intermediate levels of prior knowledge that are neither absent nor fully consolidated. Fine-tuning was most successful when prior knowledge as measured by partial subtoken knowledge, was ‘weak’ or ‘medium’ or when prior knowledge as measured by latent probabilistic knowledge was ‘unknown’ or ‘weak’ (

p < 0.001

). These terms from the ‘reactive middle’ exhibited the largest gains or losses in accuracy during fine-tuning, suggesting that the success of knowledge injection critically depends on the level of term–identifier pair knowledge in the LLM before fine-tuning.

Keywords:

large language model; fine-tuning; PEFT; Human Phenotype Ontology; ontology linking; knowledge injection; reactive middle

1. Introduction

Biomedical term normalization—the process of mapping natural language expressions to standardized ontology concepts with machine-readable identifiers—is a cornerstone of precision medicine and biomedical research. Accurate normalization enables clinical and scientific text to be aligned with structured knowledge resources, thereby supporting reproducible analyses and computational reasoning [1,2,3,4,5,6,7]. Beyond biomedicine, there is growing interest in leveraging ontologies, knowledge graphs, and large language models for knowledge management across a wide range of domains, including ergonomics, construction safety, and information systems [8,9,10,11,12,13,14,15].

Two widely adopted biomedical ontologies include SNOMED CT, which is maintained by SNOMED CT International and has more than 360,000 concepts used to document patient care, and the Gene Ontology (GO), which supports functional annotation in gene and protein research with more than 39,000 concepts [16,17]. The Human Phenotype Ontology (HPO) is a structured vocabulary for describing phenotypic abnormalities associated with human disease [7,18,19]. It plays an important role in elucidating genetic disease mechanisms with more than 18,000 terms used to create 268,000 annotations that link a phenotypic feature to a specific disease, enabling large-scale biomedical research into the genetic basis of human disease [20].

Successful use of the HPO for disease annotation relies on the accurate linking of each phenotype term to its corresponding ontology identifier [7]. For example, Charcot–Marie–Tooth disease type 2B2 is annotated with phenotypic features such as Distal muscle weakness (HP:0002460), Areflexia (HP:0001284), and Distal sensory impairment (HP:0002936). Each HPO term is expressed in title case and uniquely identified by a seven-digit code prefixed with HP. Precise alignment of the term–identifier pair is essential for ensuring semantic accuracy and interoperability, enabling downstream applications such as diagnostic decision support, cohort discovery, and automated biomedical data analysis [7,20].

Large language models (LLMs) are increasingly used in biomedical natural language processing (NLP) tasks such as entity recognition, relation extraction, and term normalization [21]. Despite their broad success, LLMs often struggle with the seemingly straightforward task of biomedical term normalization [22]. Even state-of-the-art models frequently fail to link Human Phenotype Ontology (HPO) terms to their correct identifiers. A core limitation lies in the pre-training paradigm: autoregressive LLMs such as GPT-4 and LLaMA 3.1 are optimized for next-token prediction, not explicit fact memorization. While these models are exposed to vast biomedical corpora during pre-training, many rare biomedical concepts remain sparsely represented. As a result, model performance drops sharply across the long tail of biomedical vocabularies, where domain-specific terms are underexposed [23]. For instance, when GPT-4 was queried for identifiers corresponding to 18,880 HPO terms, it returned the correct identifier for only 8% [22].

Fine-tuning offers a potential remedy. Targeted adaptation using parameter-efficient methods such as LoRA has shown promise for injecting missing knowledge into large language models (LLMs) [24,25,26,27,28]. However, recent evaluations caution that these gains are often inconsistent: smaller models struggle to generalize reliably, and improvements in newly injected content may come at the cost of degrading existing knowledge—particularly when that knowledge is fragile or only weakly anchored [24,29,30,31,32,33,34].

A growing body of work suggests that while fine-tuning enables models to memorize new term–identifier mappings, it instills a limited ability to generalize from these mappings to unseen expressions [32,33]. Rather than promoting conceptual integration, fine-tuning may act as a form of rote injection, reinforcing isolated facts without building robust representations. Consequently, the success of fine-tuning appears to depend not only on the added data but also on how well the target concept is already embedded in the model’s pre-training knowledge [33,35]. The jury is still out on whether certain biomedical ontologies, by virtue of their innate structure or prevalence of their concepts in training data, may support broader conceptual generalization during fine-tuning than other less robust ontologies.

This raises a core question: Which biomedical terms are most likely to benefit from fine-tuning, and which are most vulnerable to degradation? We hypothesize that both improvement and degradation are systematically shaped by the model’s prior knowledge of each term–identifier pair.

To test this hypothesis, we evaluate the predictive value of three dimensions of prior knowledge, defined as the model’s ability to produce or approximate correct identifier mappings before fine-tuning:

Latent probabilistic knowledge. Hidden or partially accessible knowledge revealed through probabilistic querying of the model, even when deterministic (greedy) decoding fails [36]. For example, if an LLM is queried 100 times and returns the correct ontology ID only 5% of the time, this indicates latent probabilistic knowledge, distinct from not known at all.
Partial subtoken knowledge. Incomplete but non-random knowledge of the subtoken sequences comprising ontology identifiers, reflected in deterministic outputs that are close to, but not exactly, correct. For example, an LLM that predicts HP:0001259 for Ataxia instead of the correct HP:0001251 demonstrates partial subtoken knowledge, even though greedy decoding produces an incorrect response.
Term familiarity. The likely exposure of the model to specific term–identifier pairs during pre-training, estimated using external proxies such as annotation frequency in OMIM and Orphanet [37,38], and identifier frequency in the PubMed Central (PMC) corpus [39]. For example, the LLM is more likely to be familiar with the term Hypotonia (decreased tone), which has 1783 disease annotations in HPO, than with Mydriasis (small pupils), which has only 25 annotations.

Across these three dimensions, latent probabilistic knowledge, partial subtoken knowledge, and term familiarity, we uncover a striking pattern: terms with intermediate levels of prior knowledge, neither fully consolidated nor entirely absent [36], are the most responsive to fine-tuning. These reactive middle terms exhibit both the largest improvements and the greatest degradations, whereas terms at the extremes remain more stable and less affected by fine-tuning. This dual effect suggests that fine-tuning is both most effective and most disruptive in regions of moderate prior knowledge. Our findings extend prior work by Pletenev et al. [33] and Gekhman et al. [35,36] and underscore the importance of considering term susceptibility to fine-tuning when attempting knowledge injection.

This raises a core question: Which biomedical terms are most likely to benefit from fine-tuning, and which are most vulnerable to degradation? We hypothesize that both improvement and degradation are systematically shaped by the model’s prior knowledge of each term–identifier pair.

The remainder of this paper is organized as follows: Section 2 describes the materials and methods, including the experimental setup, probabilistic querying protocol, subtoken analysis, and familiarity scoring. Section 3 presents the results of our evaluation of fine-tuning performance and the predictive value of the three dimensions of prior knowledge. Section 4 discusses the implications of these findings, including how latent knowledge and term amenability shape fine-tuning outcomes, and outlines directions for future research. Section 5 concludes with practical recommendations for designing effective knowledge injection strategies for biomedical term normalization.

2. Materials and Methods

2.1. Workflow

Figure 1 summarizes the overall workflow. A total of 1618 de-identified clinical notes were mapped to the Human Phenotype Ontology (HPO, 18,988 terms), from which we curated a test set of 799 ground-truth term–identifier pairs. From this set, we derived five measures of prior knowledge: (1) annotation counts from curated HPO disease annotations, (2) PMC identifier counts obtained via the PubMed Central API, (3) latent probabilistic knowledge estimated from LLaMA 3.1 8B (Base model, Temp = 1.0, top-k = 50), (4) partial subtoken knowledge from deterministic queries (Temp = 0.0, top-k = 1), and (5) fine-tuning outcome classes (Correct, Incorrect, Gainer, Loser) from the fine-tuned LLaMA 3.1 8B model. These measures were then integrated into unified statistical analyses (ANOVA, t-tests, chi-square) and compared across groups.

2.2. HPO Dataset

We obtained the Human Phenotype Ontology (HPO) in CSV and OBO formats from BioPortal (https://bioportal.bioontology.org/ontologies/HP, 1 August 2025) and the OBO Foundry (http://purl.obolibrary.org/obo/hp.obo, 1 August 2025), respectively. As of 5 May 2025, the ontology comprised 23,065 classes (18,988 were active phenotype terms used for disease annotations). Each entry includes the standardized term label, unique HPO identifier, hierarchical parent–child relationships, synonyms, and definitions.

Phenotype–disease associations were obtained from the HPO annotation file (https://hpo.jax.org/data/annotations, 1 August 2025), containing 272,061 annotations that span 12,691 diseases. Most annotations are derived from OMIM and Orphanet. On average, each disease has 21 phenotype annotations.

2.3. Test Terms

To evaluate model performance on samples of realistic clinical language, we curated a test set of 799 HPO term–identifier pairs extracted from 1618 de-identified neurology physician notes from the University of Illinois electronic health record (EHR) system [40]. Use of these notes was approved by the Institutional Review Board (IRB) of the University of Illinois at Chicago. The clinical term extraction process used a natural language processing pipeline described in [40], which employed GPT-4, prompting strategies to identify 2718 candidate phrases that corresponded to HPO concepts. The prompts were designed to capture both explicit phenotype mentions and implicit clinical descriptions. After filtering and mapping, 799 unique HPO term–identifier pairs formed the test set for model evaluation.

2.4. Large Language Models and Fine-Tuning

We used LLaMA 3.1 8B as our base model, a transformer-based autoregressive language model with 8 billion parameters, pre-trained on a diverse mixture of web data, books, code, and scientific text. Comprehensive details of the model architecture and pre-training are available in [41], and the foundational transformer architecture is described in [42]. All experiments employed consistent software versions and configuration parameters to ensure reproducibility. Fine-tuning was performed locally using Hugging Face Transformers with Unsloth and Low-Rank Adaptation (LoRA) [27,43], a parameter-efficient method that inserts rank-decomposed matrices into attention layers without altering base weights. While LoRA reduces the number of trainable parameters and can preserve prior capabilities, catastrophic forgetting can still occur [44,45].

Training was conducted on a multi-GPU workstation equipped with three NVIDIA Quadro RTX 8000 GPUs (48 GB VRAM each), running CUDA 12.0 with mixed precision (fp16) for optimized memory usage. Each fine-tuning run required more than 12 hours of compute time, reflecting the scale of the HPO vocabulary. The model was fine-tuned on the full HPO vocabulary (18,988 terms) using five prompt variations per term–identifier pair. Following Unsloth’s recommendations, fine-tuning was limited to three epochs to mitigate overfitting.

2.5. Top-1 Accuracy (Deterministic Inference)

To evaluate baseline model performance, we measured top-1 accuracy using deterministic inference. For each HPO term, we prompted the model to return the corresponding HPO identifier in a strict format:

What is the HPO ID for {term}?

Return only the code in format HP:1234567

A response was scored as Correct if the predicted identifier exactly matched the ground-truth HPO ID. All other responses were considered Incorrect. This approach corresponds to the model’s most confident prediction, with no sampling or temperature variation. We refer to this strict evaluation metric as top-1 accuracy, in contrast to accuracy under stochastic inference (e.g., temperature-based sampling), which we report separately.

2.6. Outcome Classification for Fine-Tuning

We hypothesized that a term’s prior knowledge state would systematically influence its response to fine-tuning. To test this, we classified each of the 799 test terms based on deterministic accuracy before and after fine-tuning:

Correct: Correct both before and after fine-tuning;

Gainer: Incorrect before but correct after fine-tuning;

Loser: Correct before but incorrect after fine-tuning;

Incorrect: Incorrect both before and after fine-tuning.

This classification scheme provided the four outcome groups (Correct, Incorrect, Gainer, and Loser) used in subsequent analyses.

2.7. Latent Probabilistic Knowledge

Latent probabilistic knowledge was quantified following the method of Gekhman et al. [36]. The base model was prompted 50 times for each term at a temperature of 1.0, enabling diverse probabilistic outputs. Probabilistic accuracy (proportion of correct responses) was computed for each term. Model latent probabilistic knowledge of each term was categorized as follows:

Unknown: probabilistic accuracy =

0.0

;

Weak:

0.0 <

accuracy

< 0.2

;

Medium:

0.2 \leq

accuracy

< 0.5

;

Strong: accuracy

\geq 0.5

.

Figure 2 shows the distribution of probabilistic accuracy (log-scaled y-axis).

2.8. Partial Subtoken Knowledge

Partial subtoken knowledge was the model’s knowledge of each of the three HPO identifier subtokens during deterministic sampling. The LLaMA 3.1 tokenizer extracts three subtokens from the seven digits of the HPO identifier (e.g., HP:1234567), which are tokenized into three numeric subtokens in the format 123-456-7. Deterministic model outputs were scored from 0–3 based on the number of correctly predicted numeric subtokens. The four categories of partial subtoken knowledge were as follows:

None: 0 of 3 correct;

Weak: 1 of 3 correct;

Medium: 2 of 3 correct;

Complete: 3 of 3 correct.

Figure 3 shows the counts of terms in each of the partial subtoken knowledge categories.

2.9. Term Familiarity

We hypothesized that a model’s familiarity with term–identifier pairs is related to its exposure during pre-training. We therefore used two direct surrogates for term familiarity. The first was the number of times each HPO identifier appeared in the PubMed Central (PMC) full-text database, as accessed via the PMC API. The second was the number of times each HPO term was used to annotate the phenotype of a disease in the curated dataset available at https://hpo.jax.org/data/annotations, 1 August 2025. Histograms of annotation counts and PMC identifier counts are shown in Figure 4. Both distributions follow a long-tail pattern, with many terms having very few annotations or PMC identifiers. The 799-term test set is shifted to the right relative to the full HPO, indicating that the test set is enriched for more commonly used terms. These familiarity measures were then used to compare outcome groups (Correct, Incorrect, Gainers, and Losers) to test the hypothesis that fine-tuning effects are concentrated in a reactive middle zone.

2.10. Statistical Analysis

We compared deterministic accuracy before and after fine-tuning within each latent probabilistic knowledge category (Unknown, Weak, Medium, and Strong) using paired t-tests to assess significance and Cohen’s d to measure effect sizes. Bar plots with standard errors visualized group differences, and chi-square tests measured the proportion of terms corrected by fine-tuning relative to baseline. The same approach was applied to the four partial subtoken knowledge groups (None, Weak, Medium, and Complete). The significance of the differences in gainers and losers across latent probabilistic knowledge and partial subtoken knowledge categories was also assessed with chi-square tests. Finally, to evaluate how term familiarity is related to fine-tuning outcomes, we compared annotation counts and PMC identifier counts across the four outcome groups (Correct, Incorrect, Gainers, and Losers). Because both predictors were long-tailed, counts were log-transformed prior to analysis. Group differences were evaluated using one-way ANOVA with Tukey’s HSD post hoc comparisons. As a robustness check against non-normality, we also conducted Kruskal–Wallis tests on the raw values. Summary statistics (mean, standard deviation, median, and interquartile range) were reported for each group.

3. Results

3.1. Baseline and Fine-Tuned Performance of LLaMA 3.1 8B on HPO Term Normalization

The baseline LLaMA 3.1 8B model demonstrated significantly higher accuracy on the curated test set of 799 clinically relevant HPO terms compared to the full HPO vocabulary. When evaluated on all 18,988 HPO terms, the model correctly linked only 96 terms (0.5%), whereas on the curated test set, it correctly linked 32 terms (4.0%). A chi-square test confirmed that this difference was significant (

χ^{2} = 140.7, p < 1 \times 10^{- 30}

). This suggests that the curated test terms are more common clinical terms that the model was more likely to have encountered during pre-training, making them easier to normalize than the broader and rarer long-tail terms in the complete HPO.

Fine-tuning improved model performance on the curated set of 799 HPO terms. The baseline model correctly linked 32 terms (4.0%), whereas the fine-tuned model correctly linked 118 terms (14.8%). This represents a more than threefold improvement in deterministic accuracy. A chi-square test confirmed that the increase in correct mappings after fine-tuning was significant (

χ^{2} = 58.9, p < 1 \times 10^{- 14}

).

3.2. Latent Probabilistic Knowledge Predicts Fine-Tuning Success

Latent probabilistic knowledge, as adapted from Gekhman et al. [36], was quantified by computing the proportion of correct responses across 50 probabilistic queries (temperature = 1.0). Each term was classified into one of four categories: Unknown, Weak, Medium, or Strong (Figure 2).

Figure 5 shows the mean deterministic accuracy of both the base and fine-tuned models for each latent probabilistic knowledge category, with standard error bars and significance markers from two-sample t-tests. Fine-tuning significantly improved accuracy for terms with Weak latent probabilistic knowledge (

p < 0.001

) and modestly for those classified as Unknown (

p < 0.001

), though absolute accuracy for Unknown terms remained low. For terms with Medium latent probabilistic knowledge, accuracy improvements did not reach statistical significance, and for Strong latent probabilistic knowledge, both models already performed near ceiling levels with no measurable gain.

These results suggest that fine-tuning is most beneficial for terms where the model possesses partial but incomplete latent probabilistic knowledge, the hypothesized reactive middle. Gains are limited for completely unknown terms and negligible for strongly represented terms.

3.3. Partial Subtoken Knowledge Predicts Fine-Tuning Success

The LLaMA 3.1 tokenizer divides each 7-digit HPO identifier into three numeric subtokens in the format “123–456–7.” We assessed partial knowledge by comparing the model’s predicted identifier against the ground-truth identifier at the subtoken level. Based on the number of correctly predicted numeric subtokens (0–3), each term was assigned to one of four partial subtoken knowledge categories: None, Weak, Medium, or Complete. Fine-tuning significantly improved deterministic accuracy in the Weak and Medium categories (

p < 0.001

) but showed no improvement in the None category, where both models performed poorly. In the Complete category, the base model slightly outperformed the fine-tuned model (

p < 0.05

), indicating mild degradation of already well-consolidated knowledge. These results suggest that fine-tuning is most effective when partial knowledge is present but not yet fully established (Figure 6).

3.4. Term Familiarity

We used annotation counts (uses of phenotype terms for disease annotation) and PubMed Central (PMC) identifier counts as proxies for prior model exposure to HPO term–identifier pairs during pre-training. Annotations and PMC ID counts were highest for Correct terms (terms correctly linked to their identifiers before and after fine-tuning) and lowest for Incorrect terms (terms consistently mislinked before and after fine-tuning). Gainer terms (incorrect before but corrected after fine-tuning) and Loser terms (correct before but incorrect after fine-tuning) occupied intermediate positions, consistent with a ‘reactive middle’ (Figure 7).

Statistical analysis confirmed these patterns. For annotation counts, ANOVA on the log-transformed scale showed a highly significant effect of outcome class (

F (3, 686) = 88.1

,

p < 10^{- 47}

), with Tukey’s HSD indicating that both Gainers and Losers had significantly higher counts than Incorrect terms but did not differ significantly from Correct terms. A non-parametric Kruskal–Wallis test also supported group differences (

H = 186.8

,

p < 10^{- 39}

). For PMC identifier counts, results were similar (

F (3, 686) = 87.6

,

p < 10^{- 47}

; Kruskal–Wallis

H = 150.2

,

p < 10^{- 31}

). Tukey’s HSD indicated that Correct terms had significantly higher identifier counts than both Gainers and Incorrect terms, while Losers again showed intermediate values not significantly different from either extreme.

3.5. Positive and Negative Knowledge Flows During Fine-Tuning

Pletenev et al. [33] emphasized that fine-tuning can introduce new knowledge while also degrading previously consolidated knowledge. To characterize these flows, we classified terms as Gainers (incorrect before but correct after fine-tuning) or Losers (correct before but incorrect after). Terms that remained consistently correct or incorrect are included in the figures for context but are not the focus of this analysis.

Figure 8 shows outcomes stratified by latent probabilistic knowledge. Gains were most common among terms in the Unknown and Weak bins, with progressively fewer gains in the Medium and Strong bins. Losses were less frequent overall but appeared across bins, including in the Strong category, indicating that even well-represented terms may degrade under fine-tuning. The association between latent probabilistic knowledge and outcome class was highly significant (

χ^{2} (9) = 667.1

,

p < 10^{- 137}

).

Figure 9 shows outcomes stratified by partial subtoken knowledge. Here, gains were concentrated in the Weak and Medium bins, while losses occurred most often in the Complete bin. This suggests that while partial structural knowledge of identifiers makes terms more likely to improve with fine-tuning, even fully correct identifier encodings are susceptible to disruption. The association between partial subtoken knowledge and outcome class was also highly significant (

χ^{2} (9) = 868.9

,

p < 10^{- 180}

).

Together, these results support a reactive middle interpretation: fine-tuning has its largest positive impact on terms with intermediate prior knowledge, while both very weak and very strong prior knowledge states exhibit resistance to change. However, the presence of Losers in the Complete and Strong categories underscores the risk of negative knowledge transfer during fine-tuning.

4. Discussion

Our findings demonstrate that fine-tuning improves term-to-identifier linking in biomedical ontologies in a systematic rather than random manner, with outcomes strongly influenced by the model’s prior knowledge of each term–identifier pair. We evaluated three dimensions of prior knowledge—latent probabilistic knowledge, partial subtoken knowledge, and term familiarity—and found each to be predictive of fine-tuning outcomes. Latent probabilistic knowledge refers to hidden knowledge accessible under stochastic sampling; partial subtoken knowledge reflects partial mastery of an identifier’s numeric structure; and term familiarity captures prior exposure in pre-training, estimated from annotation counts and PubMed Central identifier frequencies.

A consistent pattern emerged: terms with intermediate levels of prior knowledge were the most responsive to fine-tuning. This reactive middle showed both the largest gains and the most losses in deterministic accuracy, while terms that were either completely unknown or highly consolidated changed little. Specifically, gains clustered in intermediate categories of latent probabilistic and subtoken knowledge, as well as terms with moderate annotation and identifier counts. In contrast, terms with high familiarity tended to remain stable, and terms with no prior signal showed limited improvement.

These observations align with the frameworks proposed by Gekhman et al. [36] and Pletenev et al. [33], which emphasize that fine-tuning can reinforce weakly consolidated representations while also destabilizing existing ones. Our results extend this theory by quantifying how gains and losses are distributed across distinct forms of prior knowledge, showing that both are concentrated where prior knowledge is partial.

We also find partial alignment with Wang et al. [46], who reported near-perfect knowledge injection when fine-tuning LLaMA 2 (7B) for 99 epochs using LoRA. In contrast, our shorter three-epoch training of LLaMA 3.1 (8B) produced modest gains alongside degradations of previously known terms. This contrast highlights how training depth and capacity influence whether fine-tuning consolidates knowledge or risks destabilization.

4.1. Limitations

This study has several limitations. We evaluated only one model architecture (LLaMA 3.1 8B) on a single ontology (HPO), and our test set of 799 terms represents only a subset of the 18,988-term vocabulary. Findings may not generalize to larger models, domain-specific pre-training, or other ontologies such as GO, SNOMED CT, or RxNorm. In addition, because all terms were included in training but only a subset was evaluated, training–test mismatch may have influenced results. Finally, our fine-tuning setup was restricted to three epochs with five prompt variants per term. Alternative designs—including broader prompt diversity, longer training, or larger context windows—may yield different outcomes. Ablation studies could further clarify which factors most strongly influence performance.

We also did not address whether parameter-efficient fine-tuning (PEFT) methods such as LoRA enable genuine generalization or primarily reinforce memorized mappings [31]. Our findings suggest the latter: fine-tuning largely strengthens partially known associations but does not necessarily promote deeper semantic abstraction beyond the training data.

4.2. Future Work

A key next step is to test whether these patterns of fine-tuning responsiveness generalize to other biomedical ontologies, larger models, and alternative training regimes. Systematic comparisons across fine-tuning methods—including QLoRA [47], Direct Preference Optimization (DPO), and full-model fine-tuning—are needed to assess their differential effects on knowledge injection and retention.

Future studies should also examine how temperature settings affect estimates of latent probabilistic knowledge, potentially refining predictive models of fine-tuning responsiveness. A particularly promising direction is targeted fine-tuning focused on the reactive middle—terms with partial prior knowledge—where gains are most likely and risks of loss are present. Finally, investigating the velocity of fine-tuning, i.e., how rapidly different terms adapt during training, may reveal predictors of responsiveness and inform more efficient training strategies.

5. Conclusions

Fine-tuning effectiveness for term-to-identifier linking in biomedical ontologies is governed by a model’s prior knowledge rather than being uniform across terms. Across three dimensions of prior knowledge—latent probabilistic knowledge, partial subtoken knowledge, and term familiarity—we found consistent evidence for a reactive middle: terms with intermediate prior knowledge undergo the most substantial changes, showing both the largest gains and the greatest degradations. In contrast, well-consolidated and completely unknown terms remain relatively unaffected.

These results suggest practical opportunities to improve how large language models are deployed for biomedical term normalization, and they may also provide insight into how such models internalize mappings between terms and identifiers.

More broadly, our findings challenge the assumption that fine-tuning is universally beneficial. By explicitly modeling prior knowledge, we can anticipate which terms are most susceptible to gains and losses, enabling fine-tuning strategies that maximize benefits while minimizing degradation of previously accurate knowledge.

Author Contributions

Conceptualization, D.B.H. and S.K.P.; Methodology, D.B.H., S.K.P. and A.N.; Software, A.N. and D.B.H.; Investigation, all authors; Writing—original draft, A.N. and D.B.H.; Writing—review and editing, all authors; Project administration, S.K.P.; Funding acquisition, S.K.P. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded in part by the National Science Foundation, Award Number 2423235.

Institutional Review Board Statement

The study was conducted in accordance with the Declaration of Helsinki, and approved by the Institutional Review Board (IRB) of the University of Illinois (Protocol code 2017-0520; approved on 1 September 2017 and extended on 24 June 2022).

Informed Consent Statement

Patient informed written consent was waived due to the use of de-identified data, as determined by the Institutional Review Board (Protocol code 2017-0520).

Data Availability Statement

Data and code supporting this study are available from the corresponding author upon reasonable request.

Acknowledgments

We thank the UIC Neuroimmunology Biobank Team for their support. The Biobank did not provide financial funding for this study.

Conflicts of Interest

The authors declare no conflict of interest.

References

Groza, T.; Köhler, S.; Doelken, S.; Collier, N.; Oellrich, A.; Smedley, D.; Couto, F.M.; Baynam, G.; Zankl, A.; Robinson, P.N. Automatic concept recognition using the human phenotype ontology reference and test suite corpora. Database 2015, 2015, bav005. [Google Scholar] [CrossRef]
Fu, S.; Chen, D.; He, H.; Liu, S.; Moon, S.; Peterson, K.J.; Shen, F.; Wang, L.; Wang, Y.; Wen, A.; et al. Clinical concept extraction: A methodology review. J. Biomed. Inform. 2020, 109, 103526. [Google Scholar] [CrossRef] [PubMed]
Luo, Y.F.; Henry, S.; Wang, Y.; Shen, F.; Uzuner, O.; Rumshisky, A. The 2019 n2c2/UMass Lowell shared task on clinical concept normalization. J. Am. Med. Inform. Assoc. 2020, 27, 1529-e1. [Google Scholar] [CrossRef]
Wang, Y.; Wang, L.; Rastegar-Mojarad, M.; Moon, S.; Shen, F.; Afzal, N.; Liu, S.; Zeng, Y.; Mehrabi, S.; Sohn, S.; et al. Clinical information extraction applications: A literature review. J. Biomed. Inform. 2018, 77, 34–49. [Google Scholar] [CrossRef] [PubMed]
Zheng, J.G.; Howsmon, D.; Zhang, B.; Hahn, J.; McGuinness, D.; Hendler, J.; Ji, H. Entity linking for biomedical literature. BMC Med. Inform. Decis. Mak. 2015, 15, 1–9. [Google Scholar] [CrossRef]
Krauthammer, M.; Nenadic, G. Term identification in the biomedical literature. J. Biomed. Inform. 2004, 37, 512–526. [Google Scholar] [CrossRef]
Robinson, P.N. Deep phenotyping for precision medicine. Hum. Mutat. 2012, 33, 777–780. [Google Scholar] [CrossRef]
Bispo, L.G.M.; Amaral, F.G.; da Silva, J.M.N.; Neto, I.R.; Silva, L.K.D.; da Silva, I.L. Ergonomic adequacy of university tablet armchairs for male and female: A multigroup item response theory analysis. J. Saf. Sustain. 2024, 1, 223–233. [Google Scholar] [CrossRef]
Yun, W.; Zhang, X.; Li, Z.; Liu, H.; Han, M. Knowledge modeling: A survey of processes and techniques. Int. J. Intell. Syst. 2021, 36, 1686–1720. [Google Scholar] [CrossRef]
Haslinda, A.; Sarinah, A. A review of knowledge management models. J. Int. Soc. Res. 2009, 2, 9. [Google Scholar]
Pan, J.Z.; Razniewski, S.; Kalo, J.C.; Singhania, S.; Chen, J.; Dietze, S.; Jabeen, H.; Omeliyanenko, J.; Zhang, W.; Lissandrini, M.; et al. Large language models and knowledge graphs: Opportunities and challenges. arXiv 2023, arXiv:2308.06374. [Google Scholar] [CrossRef]
Abellanosa, A.D.; Pereira, E.; Lefsrud, L.; Mohamed, Y. Integrating Knowledge Management and Large Language Models to Advance Construction Job Hazard Analysis: A Systematic Review and Conceptual Framework. J. Saf. Sustain. 2025; in press, corrected proof. [Google Scholar]
Chandak, P.; Huang, K.; Zitnik, M. Building a knowledge graph to enable precision medicine. Sci. Data 2023, 10, 67. [Google Scholar] [CrossRef]
Liu, Q.; Yang, R.; Gao, Q.; Liang, T.; Wang, X.; Li, S.; Lei, B.; Gao, K. A Review of Applying Large Language Models in Healthcare. IEEE Access 2024, 13, 6878–6892. [Google Scholar] [CrossRef]
Wang, Y.; Zhao, Y.; Petzold, L. Are Large Language Models Ready for Healthcare? A Comparative Study on Clinical Language Understanding. arXiv 2023, arXiv:2304.05368. [Google Scholar] [CrossRef]
Chang, E.; Mostafa, J. The use of SNOMED CT, 2013–2020: A literature review. J. Am. Med. Inform. Assoc. 2021, 28, 2017–2026. [Google Scholar] [CrossRef]
The Gene Ontology Consortium. The gene ontology resource: 20 years and still GOing strong. Nucleic Acids Res. 2019, 47, D330–D338. [Google Scholar] [CrossRef]
Zhou, G.; Zhang, J.; Su, J.; Shen, D.; Tan, C. Recognizing names in biomedical texts: A machine learning approach. Bioinformatics 2004, 20, 1178–1190. [Google Scholar] [CrossRef] [PubMed]
Köhler, S.; Vasilevsky, N.A.; Engelstad, M.; Foster, E.; McMurry, J.; Aymé, S.; Baynam, G.; Bello, S.M.; Boerkoel, C.F.; Boycott, K.M.; et al. The human phenotype ontology in 2017. Nucleic Acids Res. 2017, 45, D865–D876. [Google Scholar] [CrossRef] [PubMed]
Robinson, P.N.; Köhler, S.; Bauer, S.; Seelow, D.; Horn, D.; Mundlos, S. The Human Phenotype Ontology: A tool for annotating and analyzing human hereditary disease. Am. J. Hum. Genet. 2008, 83, 610–615. [Google Scholar] [CrossRef]
Jahan, I.; Laskar, M.T.R.; Peng, C.; Huang, J.X. A comprehensive evaluation of large language models on benchmark biomedical text processing tasks. Comput. Biol. Med. 2024, 171, 108189. [Google Scholar] [CrossRef] [PubMed]
Do, T.S.; Hier, D.B.; Obafemi-Ajayi, T. Mapping Biomedical Ontology Terms to IDs: Effect of Domain Prevalence on Prediction Accuracy. In Proceedings of the 2025 IEEE Conference on Artificial Intelligence (CAI), Santa Clara, CA, USA, 5–7 May 2025; IEEE: New York, NY, USA, 2025; pp. 1–6. [Google Scholar]
Kandpal, N.; Deng, H.; Roberts, A.; Wallace, E.; Raffel, C. Large language models struggle to learn long-tail knowledge. In Proceedings of the International Conference on Machine Learning, Honolulu, HI, USA, 23–29 July 2023; PMLR: Cambridge, MA, USA, 2023; pp. 15696–15707. [Google Scholar]
Wu, E.; Wu, K.; Zou, J. FineTuneBench: How well do commercial fine-tuning APIs infuse knowledge into LLMs? arXiv 2024, arXiv:2411.05059. [Google Scholar] [CrossRef]
Braga, M. Personalized Large Language Models through Parameter Efficient Fine-Tuning Techniques. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, Washington, DC, USA, 14–18 July 2024. [Google Scholar] [CrossRef]
Tinn, R.; Cheng, H.; Gu, Y.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Fine-tuning large neural language models for biomedical natural language processing. Patterns 2023, 4, 100729. [Google Scholar] [CrossRef]
Ding, N.; Qin, Y.; Yang, G.; Wei, F.; Yang, Z.; Su, Y.; Hu, S.; Chen, Y.; Chan, C.M.; Chen, W.; et al. Parameter-efficient fine-tuning of large-scale pre-trained language models. Nat. Mach. Intell. 2023, 5, 220–235. [Google Scholar] [CrossRef]
Wang, C.; Yan, J.; Zhang, W.; Huang, J. Towards Better Parameter-Efficient Fine-Tuning for Large Language Models: A Position Paper. arXiv 2023, arXiv:2311.13126. [Google Scholar] [CrossRef]
Wu, E.; Wu, K.; Zou, J. Limitations of Learning New and Updated Medical Knowledge with Commercial Fine-Tuning Large Language Models. NEJM AI 2025, 2, AIcs2401155. [Google Scholar] [CrossRef]
Mecklenburg, N.; Lin, Y.; Li, X.; Holstein, D.; Nunes, L.; Malvar, S.; Silva, B.; Chandra, R.; Aski, V.; Yannam, P.K.R.; et al. Injecting New Knowledge Into Large Language Models Via Supervised Fine-Tuning. arXiv 2024, arXiv:2404.00213. [Google Scholar] [CrossRef]
Chu, T.; Zhai, Y.; Yang, J.; Tong, S.; Xie, S.; Schuurmans, D.; Le, Q.V.; Levine, S.; Ma, Y. SFT memorizes, RL generalizes: A comparative study of foundation model post-training. arXiv 2025, arXiv:2501.17161. [Google Scholar]
Pan, X.; Hahami, E.; Zhang, Z.; Sompolinsky, H. Memorization and Knowledge Injection in Gated LLMs. arXiv 2025, arXiv:2504.21239. [Google Scholar] [CrossRef]
Pletenev, S.; Marina, M.; Moskovskiy, D.; Konovalov, V.; Braslavski, P.; Panchenko, A.; Salnikov, M. How Much Knowledge Can You Pack into a LoRA Adapter without Harming LLM? arXiv 2025, arXiv:2502.14502. [Google Scholar] [CrossRef]
Orgad, H.; Toker, M.; Gekhman, Z.; Reichart, R.; Szpektor, I.; Kotek, H.; Belinkov, Y. LLMS Know More Than They Show: On The Intrinsic Representation of LLM Hallucinations. arXiv 2025, arXiv:2410.02707. [Google Scholar]
Gekhman, Z.; Yona, G.; Aharoni, R.; Eyal, M.; Feder, A.; Reichart, R.; Herzig, J. Does Fine-Tuning LLMs on New Knowledge Encourage Hallucinations? In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Miami, FL, USA, 12–16 November 2024; pp. 7765–7784. [Google Scholar] [CrossRef]
Gekhman, Z.; David, E.B.; Orgad, H.; Ofek, E.; Belinkov, Y.; Szpektor, I.; Herzig, J.; Reichart, R. Inside-out: Hidden factual knowledge in LLMs. arXiv 2025, arXiv:2503.15299. [Google Scholar]
Amberger, J.S.; Bocchini, C.A.; Schiettecatte, F.; Scott, A.F.; Hamosh, A. OMIM. org: Online Mendelian Inheritance in Man (OMIM^®), an online catalog of human genes and genetic disorders. Nucleic Acids Res. 2015, 43, D789–D798. [Google Scholar] [CrossRef]
Maiella, S.; Rath, A.; Angin, C.; Mousson, F.; Kremp, O. Orphanet and its consortium: Where to find expert-validated information on rare diseases. Rev. Neurol. 2013, 169, S3–S8. [Google Scholar] [CrossRef] [PubMed]
Beck, J.; Sequeira, E. PubMed Central (PMC): An Archive for Literature from Life Sciences Journals. In The NCBI Handbook [Internet]; McEntyre, J., Ostell, J., Eds.; National Center for Biotechnology Information (US): Bethesda, MD, USA, 2002; Chapter 9. Available online: https://www.ncbi.nlm.nih.gov/sites/books/NBK21087/pdf/Bookshelf_NBK21087.pdf (accessed on 3 September 2025).
Hier, D.B.; Carrithers, M.A.; Platt, S.K.; Nguyen, A.; Giannopoulos, I.; Obafemi-Ajayi, T. Preprocessing of Physician Notes by LLMs Improves Clinical Concept Extraction Without Information Loss. Information 2025, 16, 446. [Google Scholar] [CrossRef]
Grattafiori, A.; Dubey, A.; Jauhri, A.; Pandey, A.; Kadian, A.; Al-Dahle, A.; Letman, A.; Mathur, A.; Schelten, A.; Vaughan, A.; et al. The llama 3 herd of models. arXiv 2024, arXiv:2407.21783. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30, pp. 5998–6008. [Google Scholar]
Hu, E.J.; Shen, Y.; Wallis, P.; Allen-Zhu, Z.; Li, Y.; Wang, S.; Wang, L.; Chen, W. Lora: Low-rank adaptation of large language models. ICLR 2022, 1, 3. [Google Scholar]
Luo, Y.; Yang, Z.; Meng, F.; Li, Y.; Zhou, J.; Zhang, Y. An empirical study of catastrophic forgetting in large language models during continual fine-tuning. arXiv 2023, arXiv:2308.08747. [Google Scholar] [CrossRef]
Kalajdzievski, D. Scaling laws for forgetting when fine-tuning large language models. arXiv 2024, arXiv:2401.05605. [Google Scholar] [CrossRef]
Wang, A.; Liu, C.; Yang, J.; Weng, C. Fine-tuning large language models for rare disease concept normalization. J. Am. Med. Inform. Assoc. 2024, 31, 2076–2083. [Google Scholar] [CrossRef] [PubMed]
Dettmers, T.; Pagnoni, A.; Holtzman, A.; Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Adv. Neural Inf. Process. Syst. 2023, 36, 10088–10115. [Google Scholar]

Figure 1. End-to-end workflow.

Figure 2. Distribution of latent probabilistic knowledge across 799 HPO terms. Probabilistic accuracy was estimated from 50 stochastic model outputs at temperature 1.0. Note that most terms are categorized as Unknown to the model based on latent probabilistic knowledge (n = 721).

Figure 3. Partial subtoken knowledge categories for predicted HPO identifiers. Bars show counts of terms with None (no matching subtokens), Weak (1 of 3 subtokens match), Medium (2 of 3 match), or Complete (all 3 subtokens match). The 7-digit HPO identifier is tokenized into three numeric subtokens by LLaMA 3.1 in the format of 123-456-7.

Figure 4. Distribution of annotation counts and PMC ID counts across all 18,988 HPO terms (blue) and the 799-term test set (orange). Both axes are plotted on a base-10 logarithmic scale. The test set distribution is shifted to the right, indicating enrichment for terms with higher annotation counts and PMC ID counts compared to the full ontology.

Figure 5. Fine-tuning effects across latent probabilistic knowledge categories. Error bars show standard errors; t-test significance: *** indicates

p < 0.001

, ns indicates not significant.

Figure 5. Fine-tuning effects across latent probabilistic knowledge categories. Error bars show standard errors; t-test significance: *** indicates

p < 0.001

, ns indicates not significant.

Figure 6. Fine-tuning effects across partial subtoken knowledge categories. Error bars show standard errors; t-test significance: * indicates

p < 0.05

, *** indicates

p < 0.001

, ns indicates not significant.

Figure 6. Fine-tuning effects across partial subtoken knowledge categories. Error bars show standard errors; t-test significance: * indicates

p < 0.05

, *** indicates

p < 0.001

, ns indicates not significant.

Figure 7. Mean annotation counts and PMC ID counts by outcome class. Annotation counts and PMC ID counts serve as proxies for LLM familiarity with terms during pre-training. Gainer and Loser terms occupy an intermediate ‘reactive middle’ zone, consistent with the hypothesis that fine-tuning primarily affects terms with moderate prior familiarity. Error bars represent standard error of the mean.

Figure 8. Gains and Losses after fine-tuning by term latent probabilistic knowledge. Terms were categorized according to latent probabilistic knowledge, and their status before and after fine-tuning was labeled as either a Gainer, a Loser, as continuing Correct, or as continuing Incorrect. Counts (left panel) and Proportions (right panel) are shown separately.

Figure 9. Gains and losses after fine-tuning by partial subtoken knowledge. Terms were categorized according to partial subtoken knowledge, and their status before and after fine-tuning was labeled as either a Gainer, a Loser, as continuing Correct, or as continuing Incorrect. Counts (left panel) and proportions (right panel) are shown separately.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hier, D.B.; Platt, S.K.; Nguyen, A. Prior Knowledge Shapes Success When Large Language Models Are Fine-Tuned for Biomedical Term Normalization. Information 2025, 16, 776. https://doi.org/10.3390/info16090776

AMA Style

Hier DB, Platt SK, Nguyen A. Prior Knowledge Shapes Success When Large Language Models Are Fine-Tuned for Biomedical Term Normalization. Information. 2025; 16(9):776. https://doi.org/10.3390/info16090776

Chicago/Turabian Style

Hier, Daniel B., Steven K. Platt, and Anh Nguyen. 2025. "Prior Knowledge Shapes Success When Large Language Models Are Fine-Tuned for Biomedical Term Normalization" Information 16, no. 9: 776. https://doi.org/10.3390/info16090776

APA Style

Hier, D. B., Platt, S. K., & Nguyen, A. (2025). Prior Knowledge Shapes Success When Large Language Models Are Fine-Tuned for Biomedical Term Normalization. Information, 16(9), 776. https://doi.org/10.3390/info16090776

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prior Knowledge Shapes Success When Large Language Models Are Fine-Tuned for Biomedical Term Normalization

Abstract

1. Introduction

2. Materials and Methods

2.1. Workflow

2.2. HPO Dataset

2.3. Test Terms

2.4. Large Language Models and Fine-Tuning

2.5. Top-1 Accuracy (Deterministic Inference)

2.6. Outcome Classification for Fine-Tuning

2.7. Latent Probabilistic Knowledge

2.8. Partial Subtoken Knowledge

2.9. Term Familiarity

2.10. Statistical Analysis

3. Results

3.1. Baseline and Fine-Tuned Performance of LLaMA 3.1 8B on HPO Term Normalization

3.2. Latent Probabilistic Knowledge Predicts Fine-Tuning Success

3.3. Partial Subtoken Knowledge Predicts Fine-Tuning Success

3.4. Term Familiarity

3.5. Positive and Negative Knowledge Flows During Fine-Tuning

4. Discussion

4.1. Limitations

4.2. Future Work

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI