Elevating Clinical Semantics: Contrastive Pre-Training Beyond Cross-Entropy in Discharge Summaries

Kim, Svetlana; Jung, Yuchae

doi:10.3390/app15126541

Open AccessArticle

Elevating Clinical Semantics: Contrastive Pre-Training Beyond Cross-Entropy in Discharge Summaries^†

by

Svetlana Kim

and

Yuchae Jung

^*

Department of AI Convergence & Engineering, Open Cyber University of Korea, C-9F, 353, Mangu-ro, Jungnang-gu, Seoul 02087, Republic of Korea

^*

Author to whom correspondence should be addressed.

^†

This article is expanded version of a paper entitled “Contrastive Representations Pre-Training for Enhanced Discharge Summary BERT”, which was presented at the 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI), Victoria, BC, Canada, 9–12 August 2021.

Appl. Sci. 2025, 15(12), 6541; https://doi.org/10.3390/app15126541

Submission received: 30 April 2025 / Revised: 6 June 2025 / Accepted: 8 June 2025 / Published: 10 June 2025

Download

Browse Figures

Versions Notes

Abstract

Despite remarkable advances in neural language models, a substantial gap remains in precisely interpreting the complex semantics of Electronic Medical Records (EMR). We propose Contrastive Representations Pre-Training (CRPT) to address this gap, replacing the conventional Next Sentence Prediction task’s cross-entropy loss with contrastive loss and incorporating whole-word masking to capture multi-token domain-specific terms better. We also introduce a carefully designed negative sampling strategy that balances intra-document and cross-document sentences, enhancing the model’s discriminative power. Implemented atop a BERT-based architecture and evaluated on the Biomedical Language Understanding Evaluation (BLUE) benchmark, our Discharge Summary CRPT model achieves significant performance gains, including a natural language inference precision of 0.825 and a sentence similarity score of 0.775. We further extend our approach through Bio+Discharge Summary CRPT, combining biomedical and clinical corpora to boost downstream performance across tasks. Our framework demonstrates robust interpretive capacity in clinical texts by emphasizing sentence-level semantics and domain-aware masking. These findings underscore CRPT’s potential for advancing semantic accuracy in healthcare applications and open new avenues for integrating larger negative sample sets, domain-specific masking techniques, and multi-task learning paradigms.

Keywords:

BERT; contrastive loss; whole-word masking; discharge summary; cross-entropy loss; contrastive loss; natural language processing; biomedical language understanding

1. Introduction

Natural Language Processing (NLP) has rapidly evolved, becoming a cornerstone of computational linguistics and playing a critical role across various fields, including biomedical and clinical research. A key challenge in this domain is transforming unstructured textual data, such as patient narratives and clinical records, into structured, machine-readable formats. This transformation enables the extraction of actionable insights and is essential for tasks like patient record analysis and eligibility criteria evaluation. Information extraction (IE), mainly through techniques like entity name recognition (NER) and relation extraction (NEL), forms the backbone of this process. NER identifies entities within the text (e.g., diseases, treatments), while NEL discerns relationships between these entities. Together, these components form the foundation for successful NLP applications in healthcare.

At the forefront of these advancements is BERT (Bidirectional Encoder Representations from Transformers) [1], a model pre-trained on large corpora such as Wikipedia and books. BERT has significantly improved performance in natural language understanding (NLU) tasks, particularly when benchmarked against the General Language Understanding Evaluation (GLUE) dataset [2]. In the biomedical domain, specialized versions of BERT, such as BioBERT [3] and Clinical BERT [4], have emerged, further refining BERT’s capabilities by pre-training on domain-specific corpora. BioBERT, for example, is trained on datasets like PubMed and PMC, which include specialized biomedical terminology, leading to better downstream task performance in biomedical applications [5,6,7]. Similarly, Clinical BERT has demonstrated superior performance in clinical tasks, especially when trained on discharge summaries from the MIMIC-III database, comprising over 200,000 notes.

Despite these advancements, traditional BERT models, including their domain-specific adaptations, struggle to capture sentence-level semantics effectively. BERT’s pre-training involves Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). While MLM is adept at learning token-level context by masking random tokens in a sequence and predicting them based on their surroundings, it struggles to capture broader sentence-level semantics. Similarly, NSP focuses primarily on determining the sequential relationship between sentences, which limits its ability to understand the deeper semantic relationships between sentences. Moreover, the cross-entropy loss function used in the NSP task fails to reflect semantic similarities accurately, leading to challenges in capturing nuanced sentence interplays [8]. These limitations become even more pronounced in complex clinical documents, where terms can be multi-token, and sentence continuity alone often does not suffice for accurate context modeling.

Recent studies have explored contrastive self-supervised learning frameworks to address these limitations, demonstrating significant improvements in capturing sentence-level semantics in generic NLP tasks [9,10]. For example, models like CERT (Contrastive Self-Supervised Encoder Representations from Transformers) have outperformed BERT in several tasks benchmarked on GLUE, emphasizing the potential of contrastive learning in enhancing language representations. CERT employs back-translation for text augmentation, which generates new sentence representations by translating them into another language and then back to the original language [9]. However, in domain-specific contexts such as biomedical text, back-translation presents challenges, as minor shifts in meaning during translation can introduce critical errors [10,11,12]. Furthermore, negative sampling strategies based solely on randomly paired sentences may not sufficiently differentiate subtle clinical topics, highlighting a need for more nuanced approaches—such as leveraging metadata or domain-specific cues—to create effective negative samples.

Given these constraints, our research introduces the Discharge Summary CRPT (Contrastive Representations Pre-Training) model, which enhances clinical language representations without relying on text augmentation techniques like back-translation. Our model replaces the traditional cross-entropy loss in the NSP task with a contrastive loss to better capture sentence-level semantics. In addition, we carefully design negative pairs from both intra-document and cross-document sentence pools to ensure that the model discriminates between closely related and truly unrelated content. Furthermore, we integrate the Whole-Word Masking (WWM) technique into the MLM phase, as proposed in ERNIE [13], to enhance performance by considering entire words rather than subword tokens in isolation. This masking strategy can be further tailored to domain-specific entities or medical concepts, potentially allowing for “concept-level” masking that captures the continuity of specialized terms. Such enhancements aim to preserve the nuanced semantics in clinical corpora, where partial sub-word masking can otherwise undermine context understanding.

Unlike CERT, which combines back-translation with MoCo-style (momentum contrast) learning on general-domain text, and ERNIE, which introduces Whole-Word Masking (WWM) without any contrastive component, our CRPT model is the first to fuse contrastive loss with WWM directly on clinical discharge summaries. This tight coupling preserves multi-token medical entities while simultaneously enforcing sentence-level discrimination. As a result, CRPT yields substantially richer semantic representations than existing biomedical NLP models. A preliminary version of this work was previously presented as a short paper at the IEEE International Conference on Healthcare Informatics (ICHI 2021), focusing on the initial development of the CRPT model [14].

In this study, we focus on the Discharge Summary CRPT model, inspired by Discharge Summary BERT [4], but with significant improvements in language representation. Tested on the BLUE (Biomedical Language Understanding Evaluation) benchmark [15], our model achieved superior performance in natural language inference and sentence similarity tasks compared to traditional models. Additionally, we introduce Bio+Discharge Summary CRPT, which combines biomedical and clinical domain pre-training, further boosting performance in medical NLP tasks. Throughout our experiments, we also examine how varying the margin parameter in contrastive loss and adjusting other hyperparameters impact performance, and we highlight representative error cases to clarify where the model still struggles—an essential step for real-world clinical deployment. Moreover, we briefly discuss how future work could extend CRPT to a multi-task learning setup (e.g., integrating classification, summarization, and QA) or increase the batch size to broaden negative sample diversity.

Our contributions are as follows:

We propose the Contrastive Representations Pre-Training (CRPT) model for clinical language tasks, enhancing sentence-level semantics by integrating contrastive loss and WWM;
We demonstrate significant improvements in medical NLP tasks using the BLUE benchmark, including natural language inference and sentence similarity tasks;
We introduce the Bio+Discharge Summary CRPT model, which combines domain-specific pre-training in biomedical and clinical corpora and outperforms existing models on various tasks;
We discuss practical considerations for negative sampling, domain-specific masking, and model interpretability, highlighting potential avenues for improved accuracy and better clinical integration.

2. Materials and Methods

2.1. Data Collection

In this study, we utilized the publicly available MIMIC-III [16] dataset (developed by the MIT Lab for Computational Physiology, Cambridge, MA, USA), consisting of de-identified critical care patient data, which includes 15 types of clinical notes. Among these, we primarily selected the Discharge Summary note type for pre-training our BERT-based models using the CRPT framework, given its comprehensive nature and substantial mention of diagnoses, treatments, and outcomes. This subset contains 59,652 discharge notes collected from the Beth Israel Deaconess Medical Center between 2001 and 2012. It is a valuable resource for NLP tasks due to its structured yet context-rich content.

Discharge summaries are particularly significant in clinical NLP, encapsulating end-to-end patient information. They are frequently used in downstream tasks such as medical condition classification and patient outcome prediction, as shown in Clinical BERT [3]. To facilitate more robust negative sampling and better contrastive learning, we also leveraged a small subset of other MIMIC-III note types to introduce genuinely unrelated sentence pairs, ensuring that the model could discriminate between closely related and entirely distinct contexts. The data was pre-processed using a standardized script from the Clinical BERT implementation, which includes tokenization, lowercasing, and removal of unique characters, thereby preparing the notes for both whole-word masking and contrastive objectives. This process also helped preserve domain-specific medical terminology, allowing our approach to capture multi-token clinical entities more effectively during CRPT pre-training.

2.2. Pre-Training BERT Model

Based on the Transformer architecture [17], BERT produces contextualized word representations by capturing bidirectional context from large-scale corpora. Its pre-training process involves two key tasks: Masked Language Modeling (MLM) and Next Sentence Prediction (NSP). In MLM, random tokens from the input sequence are masked, and the model learns to predict these masked tokens using their surrounding context. Meanwhile, NSP helps BERT determine whether a given sentence pair follows a sequential relationship, aiding in sentence-level understanding.

Pre-trained BERT models—trained on large general-domain datasets such as Wikipedia and BooksCorpus—are publicly available, providing robust initial parameters for a wide range of language tasks. These models can be further adapted to specific domains via transfer learning. In our experiments, we initialized parameters using the publicly released BERT-Base model, which consists of 12 transformer layers, 768 hidden units, and 12 attention heads. This setup offered a strong baseline for subsequent fine-tuning of clinical domain data.

Figure 1 illustrates how MLM and NSP collectively contribute to BERT’s learning capacity. While MLM captures token-level context by predicting masked tokens, and NSP classifies whether sentence pairs are sequentially related, these tasks alone often fail to model deeper sentence-level semantics. Such limitations have motivated our introduction of Contrastive Representations Pre-Training (CRPT), detailed in the following sections, to more effectively capture the nuanced relationships prevalent in clinical texts.

2.3. Pre-Training Discharge Summary CRPT Model

Clinical BERT is a variation in BERT pre-trained on clinical data from the MIMIC-III v1.4 corpus [16]. It is available in two versions: Clinical BERT, trained on all note types from MIMIC-III, and Discharge Summary BERT, which specializes in discharge summary notes. Table 1 demonstrates that despite relying on significantly fewer data, Discharge Summary BERT often achieves comparable or superior performance to Clinical BERT in various clinical NLP tasks [3].

In this study, we build upon the Discharge Summary BERT approach by replacing the Next Sentence Prediction cross-entropy loss with a contrastive loss function and integrating WWM to capture domain-specific multi-token terms more effectively. We refer to this new model as Discharge Summary CRPT—Contrastive Representations Pre-Training, reflecting our emphasis on sentence-level semantics within discharge summaries. Specifically, the same subset of 59,652 discharge notes from MIMIC-III [16] was used, but the pre-training objective was altered to maximize similarity for consecutive or contextually linked sentences while minimizing similarity for negative pairs (e.g., sentences from unrelated sections or different patient notes). This strategy aims to better capture nuanced clinical relationships—such as symptom evolution or treatment outcomes—and provides a more robust foundation for downstream tasks. By focusing on discharge summaries, which distill crucial patient information, we highlight how contrastive learning methods can substantially enhance a model’s clinical language representation, even when trained on relatively smaller, specialized corpora.

2.4. Pre-Training Bio+Discharge Summary CRPT Model

The BlueBERT model (National Center for Biotechnology Information, Bethesda, MD, USA) [7] combines PubMed and PMC data—both rich biomedical text sources—with the MIMIC-III dataset [17] from the clinical domain, initializing its parameters from a pre-trained BERT model. Following the original BERT pre-training methodology, BlueBERT achieved state-of-the-art (SOTA) performance on the BLUE benchmark, which consists of ten NLP tasks assessed by four evaluation metrics. In addition, BlueBERT provides pre-processed PubMed text comprising over four billion words in ASCII format, thereby offering extensive biomedical domain coverage.

Building upon this foundation, we pre-trained our Bio+Discharge Summary CRPT model by combining the discharge summary notes from MIMIC-III [17] with the pre-processed PubMed data available through BlueBERT [7]. We employed the same contrastive objectives and whole-word masking approach as in Discharge Summary CRPT but extended the training corpus to include broader biomedical contexts. This hybrid strategy aims to capture detailed clinical semantics from discharge summaries while leveraging a large-scale biomedical vocabulary that spans diseases, molecular biology, and pharmacology. Consequently, the Bio+Discharge Summary CRPT model benefits from both domain-specific clinical expressions and more general biomedical terminology, enhancing its capacity to handle diverse downstream tasks. By uniting these two corpora under a contrastive learning framework, we ensure that the model not only aligns related sentences within each domain but also effectively distinguishes between clinically distinct or biomedically unrelated sentence pairs, paving the way for robust and flexible performance across multiple medical NLP applications.

2.5. Contrastive Self-Supervised Learning

BERT’s pre-training method has proven highly effective for a range of NLP tasks, yet it often struggles to generalize to other domains, such as computer vision, where pretext tasks rely on heuristics [18]. In contrast, contrastive learning offers a more adaptable framework by forming positive and negative sample pairs. Positive pairs are typically created through augmentations of the same data instance, while negative pairs derive from different data instances. By minimizing the distance between positive pairs and maximizing the distance between negative pairs in latent space, the model acquires robust and generalizable feature representations.

CERT (Contrastive Self-Supervised Encoder Representations from Transformers) builds on this principle, employing back-translation as a global sentence-level augmentation method. Specifically, sentences are translated into an intermediate language (e.g., German or Chinese) and then translated back, generating augmented sentence pairs that approximate real variations in text. In contrast, Easy Data Augmentation (EDA) [19] relies on local word-level techniques—such as synonym replacement, random insertion, and swapping—to form augmented samples. Figure 2 demonstrates how CERT pulls positive pairs (semantically similar sentences) closer while pushing negative pairs farther apart in latent space. By combining back-translation with the Momentum Contrast (MoCo) loss function [20], CERT outperforms BERT in multiple GLUE tasks, illustrating the effectiveness of contrastive learning for sentence-level understanding.

In domain-specific contexts—like clinical or biomedical NLP—such augmentation strategies must be carefully adapted to preserve precise terminology. Our approach avoids back-translation errors by leveraging negative pairs sampled from unrelated sections or different documents, ensuring accurate domain context. This design choice aligns with the contrastive principle while mitigating semantic drift issues that can arise from purely generative augmentations in specialized fields.

3. Proposed Approach

Enhancing language models remains pivotal in the rapidly evolving field of medical natural language processing. This study introduces Contrastive Representations Pre-Training (CRPT) to address the limitations of conventional BERT-based models in capturing sentence-level semantics within clinical texts. Specifically, we replace the standard cross-entropy loss in the Next Sentence Prediction (NSP) task with a contrastive loss function, aiming to better reflect nuanced semantic relationships between sentences rather than mere continuity. We also move from random token masking to WWM, which preserves multi-token clinical terms and, thus, enriches the model’s ability to interpret discharge summaries.

We implement our approach in a Discharge Summary CRPT model built on top of BERT-Base, fine-tuning the model using carefully designed positive–negative pairs derived from discharge summaries and, in some cases, unrelated notes. This procedure ensures robust discrimination between sentences that are contextually similar and those that are semantically unaligned. By focusing on high-level context rather than sequence order, our proposed framework effectively addresses the shortcomings of NSP in clinical environments. The subsequent sections detail how we revise the NSP objective, highlight the advantages of contrastive loss, and discuss the integration of WWM to capture domain-specific tokens essential for clinical applications.

3.1. NSP and Its Limitations

The NSP task in the original BERT model was designed to predict whether two sentences are sequentially related [21,22]. In this process, the model feeds two input sentences through the BERT encoder, leveraging the CLS token to classify their relationship. As illustrated in Figure 3, only the representation of the CLS token from the encoder’s final hidden layer is used for the sentence prediction task. The label that indicates sentence continuity is set to 1 for a randomly selected following sentence (random next) and 0 for a consecutive sentence from the same document (actual subsequent).

A random next is taken from a different document, whereas an actual next is the immediately following sentence in the same document. The model increases the probability of predicting the correct sentence label via cross-entropy loss. While this approach succeeds in identifying sequence continuity, it lacks the capacity to capture more intricate relational dynamics between sentences. Since the cross-entropy objective merely focuses on whether two sentences follow in order, it undervalues deeper semantic similarities. Consequently, the model often struggles with understanding complex context shifts, cause–effect relationships, and other nuanced clinical inferences—critical for tasks such as patient condition analysis and discharge summary comprehension. The proposed CRPT method directly addresses these shortcomings by leveraging contrastive learning to align semantically similar sentences and separate those that are semantically divergent.

3.2. Contrastive Representation Learning (CRL)

The conventional next-sentence prediction (NSP) objective in BERT captures only coarse sentence adjacency, which is often insufficient for the subtle discourse structure of long discharge summaries. We, therefore, replace NSP with a margin-based contrastive objective that pulls semantically coherent sentences together and pushes unrelated sentences apart in the latent space.

Positive pairs were defined not only as immediately consecutive sentences but also as clinically linked sentences that share at least one UMLS clinical concept (CUI) within a two-sentence window; pilot studies showed that this broader criterion yields richer supervision than simple adjacency.

For each anchor sentence, the immediately following sentence in the same summary is treated as a positive partner, whereas negative partners are selected in two ways: (i) a non-consecutive sentence drawn from a different section of the same summary (intra-document negative), and (ii) a sentence sampled from a different patient’s summary (inter-document negative). This hybrid sampling scheme supplies the model with semantically unambiguous negative examples while preserving domain context.

After encoding each sentence with BERT and projecting it through a tanh-activated dense layer, we obtain three vectors,

z_{i}, z_{i}^{+}, a n d z_{i}^{-},

that represent, respectively, the anchor, positive, and negative sentences in a mini-batch of size

N

. Using the Euclidean distance

D (u, v) = {|u - v|}_{2}

, we minimize the margin-based contrastive loss in Equation (1) [23].

L_{c o n t r a s t} = \frac{1}{N} \sum_{i = 1}^{N} [D {(z_{i}, z_{i}^{+})}^{2} + α m a x {(0, m - D (z_{i}, z_{i}^{-}))}^{2}]

(1)

We evaluated margins m ∈{1.0,1.5,2.0,2.5} on the MedNLI development split and found that accuracy peaked at m = 2.0. Therefore, we fixed m = 2.0 for all experiments. We also set α = 0.5 based on the same pilot study. Detailed numbers are available upon request.

Figure 3 compares the traditional NSP approach with our CRL method, highlighting how CRL shifts from a soft-max next-sentence classifier to a distance-based objective for more robust semantic representation.

Figure 4 illustrates how Equation (1) contracts distances between consecutive or contextually similar sentences (positive pairs) while expanding those between non-consecutive or unrelated sentences (negative pairs).

To leverage unlabelled text more effectively, we combine Equation (1) with whole-word masked-language modeling (WWM-MLM) and optimize the joint objective:

L_{t o t a l} = λ L_{M L M} + (1 - λ) L_{c o n t r a s t}, λ = 0.5

(2)

where λ balances MLM and contrastive loss (fixed to 0.5). Both objectives share the same masked tokens so that gradient signals are coherent. The margin mmm and coefficient α\alphaα were tuned on the MedNLI development split and then held constant in every subsequent experiment. All remaining hyper-parameters—including the number of pre-training steps, learning rate, batch size, and sequence length—follow exactly the configuration reported in Section 4.1 and Table 2.

Notation. z—sentence embedding after the projection head; D—Euclidean distance; N—mini-batch size; m—margin; α—scaling coefficient; λ—MLM/contrastive mixing weight.

3.3. Whole-Word Masking (WWM)

Traditional word masking methods in pre-trained language models (e.g., BERT) frequently cause fragmented or partial word masking, where sub-words are independently masked. Such fragmentation can impair the model’s ability to fully capture semantic meaning, particularly in medical texts characterized by multi-token terms (e.g., “hematuria”). WWM directly addresses this issue by masking entire words rather than isolated sub-tokens, thereby preserving the linguistic integrity and contextual unity of domain-specific vocabulary.

This technique was initially introduced in ERNIE and later adopted by BERT, demonstrating improved performance across various NLP tasks. In our Discharge Summary CRPT pre-training, we integrated WWM into the Masked Language Modeling phase, ensuring that multi-token clinical terms remain semantically cohesive. Figure 5 contrasts Random Masking (RM) with WWM, illustrating how RM inadvertently splits tokens like “##am” and “##mon,” thus obscuring the original word structure (e.g., “phil ##am ##mon”). WWM, in contrast, masks the entire word at once, preventing vital medical terms from being diluted into multiple sub-tokens.

By employing WWM, our approach achieves more context-aware representations of complex medical terminology, a critical advantage in tasks where terminological precision (e.g., diagnoses or treatments) directly impacts model performance. Indeed, this cohesive masking strategy proves especially crucial for discharge summaries and clinical notes, where the correct interpretation of multi-token medical entities can significantly influence downstream applications such as patient outcome prediction, information extraction, and clinical decision support.

4. Experimental Result

4.1. Experimental Setup

This section describes how we trained and fine-tuned various BERT-based models for clinical NLP tasks, focusing on MIMIC-III discharge summaries and the BLUE benchmark. As a foundational baseline, BERT-Base was used with its standard 12-layer Transformer, 768 hidden units, and 12 attention heads, pre-trained on Wikipedia and BookCorpus. By contrast, Discharge Summary BERT shares the same architecture but specializes in the MIMIC-III Discharge Summary dataset (59,652 notes), improving its handling of domain-specific terms and contexts.

All experiments were run on a TPU with eight Google Cloud Platform cores. We set the sequence length to 128 and a batch size of 128 for pre-training, with a learning rate of 5 × 10⁻⁵. During fine-tuning, the batch size was reduced to 32 but retained the same learning rate, and each model was trained for 30 epochs. Every experiment was repeated five times with different random seeds, and we used the median accuracy for consistency. While BERT-Base served solely as a fine-tuning baseline, Discharge Summary CRPT, and Bio+Discharge Summary CRPT were pre-trained under a contrastive objective combined with Whole-word masking. To ensure a fair comparison, all models were trained and fine-tuned under identical hyperparameters—200K pre-training steps, 30 fine-tuning epochs, max seq-len 128, learning rate 5 × 10⁻⁵, batch size 128/32. Full settings are summarized in Table 2.

Table 2 outlines the configuration of all BERT-based models, including initial checkpoints, datasets, and training steps. This setup allowed us to systematically compare how each initialization, ranging from general-domain BERT-Base to specialized CRPT variations, affects performance on multiple clinical downstream tasks. By maintaining consistent hyperparameters across models, we isolate the impact of domain-specific pre-training and contrastive learning strategies on clinical NLP outcomes.

4.2. Evaluation of the CRPT Model

To assess the performance of our proposed Discharge Summary CRPT model, we utilized the BLUE benchmark dataset, which is designed for medical language understanding tasks across both biomedical and clinical domains. The BLUE benchmark evaluates models on various sentence-level and token-level tasks, such as inference, sentence similarity, named entity recognition (NER), and relation extraction. Table 3 provides a summary of the experimental datasets used for evaluation, including MedNLI for inference, BIOSSES for sentence similarity, ShARe/CLEFE for NER, and i2b2-2010 for relation extraction. Standard deviations from five independent runs are reported alongside median scores, underscoring the robustness of our models.

Our evaluation compared the performance of BERT-Base, BioBERT, Discharge Summary BERT, Bio+Discharge Summary BERT, BlueBERT, and the Discharge Summary CRPT models on the BLUE benchmark tasks. Table 4 presents the performance results of the models across various NLP tasks. The Bio+Discharge Summary CRPT model achieved the highest performance across most tasks, outperforming other baseline models, especially in MedNLI and BIOSSES tasks.

In the second experiment, we analyzed the models’ pre-training loss on the training set. In a comparison of training loss between the Discharge Summary BERT and Discharge Summary CRPT models over 150K training steps, both the Discharge Summary BERT and Discharge Summary CRPT models exhibited similar loss patterns at the beginning of the training. However, the Discharge Summary CRPT model demonstrated a more stable and consistent decrease in loss throughout the training process. The CRPT model maintained lower loss values as the training progressed, indicating a more efficient and stable pre-training process.

These findings suggest that applying contrastive representation learning within the CRPT model enhances its ability to capture and refine semantic relationships in clinical texts, leading to improved performance. The Discharge Summary CRPT model, in particular, benefits from this learning method, showing more refined understanding capabilities for clinical domain tasks. This outcome provides essential insights for advancing NLP in the medical field, particularly in discharge summary analysis and other clinical NLP tasks.

4.3. Quantitative Analysis

4.3.1. Sentence-Level Evaluation

MedNLI: The MedNLI dataset comprises sentence pairs extracted from the MIMIC-III database [16]. For each pair, one sentence serves as a premise and the other as a hypothesis. The task for the models is to predict whether the relationship between the two sentences is contradictory, neutral, or entailment. Accuracy was the chosen evaluation metric for assessing model performance on this task.

BIOSSES: BIOSSES is a corpus containing sentence pairs from the Biomedical Summarization Track Training Dataset, focusing on the biomedical domain [24]. The task involves determining the similarity between two sentences. The Pearson correlation coefficient was employed as the evaluation metric to measure the model’s ability to capture the degree of similarity between sentences.

4.3.2. Token-Level Evaluation

ShARe/CLEF: The ShARe/CLEF dataset consists of 299 de-identified clinical free-text notes derived from the MIMIC-II database [25]. The task assigned to the models is identifying entity names within the text that map to specific UMLS (Unified Medical Language System) codes. This task is designed to evaluate the model’s capacity to capture the meaning of each word in a sentence. The F1 score was used to compare the baseline and proposed models’ entity extraction performance.

i2b2-2010: This dataset focuses on shared NLP tasks designed to improve language understanding within the clinical domain [26]. The task requires models to predict relationships between medical problems, tests, and treatments within sentences. Similar to the ShARe/CLEF evaluation, the F1 score was used to measure and compare the performance of baseline and proposed models with extraction tasks.

4.4. Quantitative Analysis of CRPT Models

Five consecutive sentence pairs (A/B) were selected from discharge summary notes to evaluate the similarity between sentence pairs in the latent space. As shown in Figure 3, the sentences were processed by separating them based on segment ID. Each sentence pair was input into the model, and the similarity-based distance between their representations was measured. This analysis compares the BERT baseline model [1] and our CRPT models, with the results summarized in Table 5.

For qualitative analysis, similarity heatmaps were generated to visualize the degree of similarity between sentence pairs. These heatmaps are presented for both the BERT baseline model (see Figure 6a: Similarity heatmap between sentences using BERT model.) and the CRPT model (see Figure 6b: Three salient patterns emerge.).

Local coherence (main diagonal). In both models, the principal diagonal—corresponding to consecutive sentence pairs—displays the highest similarity. However, CRPT consistently yields darker cells and larger numeric values (e.g., 3.7 vs. 3.3), indicating that the contrastive objective tightens local semantic cohesion;
Global discourse (off-diagonal intra-document blocks). Cells linking non-adjacent sentences from the same summary retain moderate similarity under CRPT, whereas BERT often collapses these to near 0. This suggests that CRPT balances sentence-level specificity with broader document context;
True negatives (inter-document region). Rows and columns labeled ‘5-A’ and ‘5-B’ originate from different patient notes. In CRPT, these cross-document pairs cluster around 0 or even negative values (e.g., −3.7), forming a pale band that contrasts sharply with the intra-document area. The separation verifies that the mixed negative-sampling scheme effectively pushes unrelated sentences apart.

5. Discussion

Our model was pre-trained using Whole-word masking (WWM) and Contrastive Representation Learning (CRL) techniques. As evidenced by the experimental results, our Bio+Discharge Summary CRPT model outperformed BlueBERT [7] in three out of the four tasks. The Accuracy, Pearson correlation, and F1 score of the Bio+Discharge Summary CRPT model were 0.847, 0.848, and 0.761, respectively, compared to BlueBERT, which achieved 0.841, 0.839, and 0.757. Notably, the performance improvement was more pronounced in sentence-level tasks than in token-level tasks. In particular, the Named Entity Recognition (NER) task demonstrated comparable results, with a slight edge favoring BlueBERT in performance.

MedNLI, the sentence-level task, exhibited improved performance as the overall data used for training increased. The Discharge Summary CRPT model achieved results close to BlueBERT’s for the ShARe/CLEFE and i2b2-2010 tasks, with a 2.2% gap observed in MedNLI. However, compared to Discharge Summary BERT, our CRPT model showed a significant improvement of 3.4%. Furthermore, Bio+Discharge Summary CRPT outperformed BlueBERT [7] despite being trained with fewer datasets. The BIOSSES task exhibited a notable performance difference based on the biomedical data, and because this task evaluates sentence similarity, models utilizing contrastive representation learning achieved better performance than the baseline model.

For the ShARe/CLEFE task, despite using various types of MIMIC-III data [17], we only utilized the discharge summary note type in this study. Since WWM randomly selects tokens, certain domain-specific entities in the ShARe/CLEFE dataset might be considered unimportant during pre-training. While the CRPT method improved performance for learning models with the same data, comparing precision and recall metrics used to calculate the F1 score, Bio+Discharge Summary CRPT scored 0.849 in precision and 0.825 in recall, compared to BlueBERT, which scored 0.844 and 0.832, respectively.

This suggests that in the CRPT model, instances that should be labeled as true positives are sometimes mistakenly treated as negatives. Future research could address this by incorporating all available MIMIC-III data [16] or by designing masking methods for domain-specific entities [27,28,29]. Meanwhile, the i2b2-2010 task, which evaluates token-level predictions, could extend the relationship between entities to capture global relationships between sentences. From this perspective, CRPT offers advantages over traditional learning methods. The Discharge Summary CRPT model performed similarly to BlueBERT for sentence-level tasks, while Bio+Discharge Summary CRPT surpassed it.

We also performed additional experiments to isolate the effects of WWM and CRL. Discharge Summary BERT [3] served as our baseline, and WWM and CRL were individually applied to compare their performance, as shown in Table 6.

WWM involves masking entire tokens that form a word simultaneously, enhancing language understanding. As shown in Table 6, WWM improved model performance in both sentence-level and token-level tasks. While Discharge Summary BERT with WWM and NSP did not use CRL, our model still demonstrated improved word embeddings, mainly reflecting sentence-level semantic similarity. In the i2b2-2010 task, this model employing only WWM outperformed the model that only used CRL.

The application of CRL significantly enhanced model performance, particularly for sentence-level evaluations. We observed substantial performance improvements when CRL was applied, regardless of the masking method. Moreover, additional performance gains were realized when CRL was combined with WWM.

We visualized the similarity between five sentence pairs using a heatmap for qualitative evaluation, as illustrated in Table 5 [30]. We hypothesized that consecutive sentences would exhibit closer proximity after CRPT. Figure 6 confirm that the similarity between consecutive sentences increased compared to BERT [1]. However, non-consecutive sentences also became slightly closer, likely because individual statements cannot be neatly classified. Increasing the batch size in future studies may enhance performance by incorporating a wider variety of negative samples into the loss calculation.

6. Conclusions

In this paper, we demonstrated the effectiveness of applying whole-word masking (WWM) and Contrastive Representation Learning (CRL) to improve the performance of clinical and biomedical language models. Our proposed CRPT models, which leverage WWM and CRL, showed superior results on the BLUE benchmark dataset compared to baseline models such as BERT and BlueBERT, particularly in sentence-level tasks. The inclusion of CRL allowed for better semantic understanding between sentences, while WWM preserved word-level meaning, improving the overall contextual comprehension in medical texts.

For qualitative evaluation, we analyzed random pairs of consecutive sentences from discharge summary notes and observed a noticeable reduction in the distance between consecutive sentences, signifying enhanced sentence relation modeling. Additionally, CRPT models proved highly effective in capturing nuanced semantic relationships, contributing to better performance across sentence and token-level tasks.

These findings confirm that CRPT models provide a more refined and specialized understanding of medical language, which is crucial for improving clinical NLP tasks like named entity recognition (NER), relation extraction, and sentence inference. Our approach can enhance clinical decision support systems and medical documentation processing, potentially leading to more accurate diagnoses and treatments.

Future Work will focus on expanding our dataset to include all 15 note types from the MIMIC-III database, allowing for more diverse training data and a broader evaluation across various medical downstream tasks. We also plan to explore the integration of domain-specific masking techniques to further enhance the model’s ability to understand medical terminology. Lastly, we will investigate the potential of scaling the batch size during training to leverage a wider variety of negative samples in CRL, potentially improving model performance in large-scale applications.

Author Contributions

S.K. and Y.J. developed the main idea and conducted the data collection, analysis, and evaluations. Y.J. provided high-level guidance and supervision to this study and provided medical domain expertise and supervision. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education (grant number) (NRF-2022R1I1A1A01073591). This research was supported by a grant from the Korea Health Technology R&D Project through the Korea Health Industry Development Institute (KHIDI), funded by the Ministry of Health and Welfare, Republic of Korea (grant number: HI21C1137).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Informed consent was obtained from all subjects involved in this study.

Data Availability Statement

We used the BLUE benchmark dataset, which can be downloaded from the GitHub repository https://github.com/ncbi-nlp/BLUE_Benchmark, accessed in September 2023. The trained Model and the codes will be available from the corresponding author upon reasonable request.

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI o3, accessed April 2025) to assist with translation from Korean to English and to refine the flow of several sentences. The authors have reviewed and edited the resulting text and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

BERT	Bidirectional encoder representations from transformers
NLP	Natural language processing
CRPT	Contrastive representations pre-training
CRL	Contrastive representations learning
WWM	Whole-word masking

References

Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; Bowman, S.R. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding. arXiv 2018, arXiv:1804.07461. [Google Scholar]
Alsentzer, E.; Murphy, J.R.; Boag, W.; Weng, W.-H.; Jin, D.; Naumann, T.; McDermott, M. Publicly Available Clinical BERT Embeddings. arXiv 2019, arXiv:1904.03323. [Google Scholar]
Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.-H.; Kang, J. BioBERT: A Pre-Trained Biomedical Language Representation Model for Biomedical Text Mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
Gu, Y.; Tinn, R.; Cheng, H.; Lucas, M.; Usuyama, N.; Liu, X.; Naumann, T.; Gao, J.; Poon, H. Domain-Specific Language Model Pretraining for Biomedical Natural Language Processing. arXiv 2020, arXiv:2007.15779. [Google Scholar] [CrossRef]
Beltagy, I.; Lo, K.; Cohan, A. SciBERT: A Pretrained Language Model for Scientific Text. arXiv 2019, arXiv:1903.10676. [Google Scholar]
Peng, Y.; Yan, S.; Lu, Z. Transfer Learning in Biomedical Natural Language Processing: An Evaluation of BERT and ELMo on Ten Benchmarking Datasets. arXiv 2019, arXiv:1906.05474. [Google Scholar]
Fang, H.; Xie, P. ConSERT: Contrastive Self-Supervised Learning for Language Understanding. arXiv 2020, arXiv:2005.12766. [Google Scholar]
Giorgi, J.M.; Nitski, O.; Bader, G.D.; Wang, B. DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. arXiv 2020, arXiv:2006.03659. [Google Scholar]
Wei, J.; Zou, K. EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks. arXiv 2019, arXiv:1901.11196. [Google Scholar]
Sennrich, R.; Haddow, B.; Birch, A. Improving Neural Machine Translation Models with Monolingual Data. arXiv 2015, arXiv:1511.06709. [Google Scholar]
He, K.; Fan, H.; Wu, Y.; Xie, S.; Girshick, R. Momentum Contrast for Unsupervised Visual Representation Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 9729–9738. [Google Scholar]
Chu, C.; Wang, R. A Survey of Domain Adaptation for Neural Machine Translation. arXiv 2018, arXiv:1806.00258. [Google Scholar]
Won, D.; Lee, Y.; Choi, H.-J.; Jung, Y. Contrastive Representations Pre-Training for Enhanced Discharge Summary BERT. In Proceedings of the 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI), Victoria, BC, Canada, 9–12 August 2021; pp. 507–508. [Google Scholar] [CrossRef]
Koehn, P.; Knowles, R. Six Challenges for Neural Machine Translation. arXiv 2017, arXiv:1706.03872. [Google Scholar]
Zhang, Z.; Han, X.; Liu, Z.; Jiang, X.; Sun, M.; Liu, Q. ERNIE: Enhanced Language Representation with Informative Entities. arXiv 2019, arXiv:1905.07129. [Google Scholar]
Johnson, A.E.W.; Pollard, T.J.; Shen, L.; Li, L.-W.H.; Feng, M.; Ghassemi, M.; Moody, B.; Szolovits, P.; Celi, L.A.; Mark, R.G. MIMIC-III, a Freely Accessible Critical Care Database. Sci. Data 2016, 3, 160035. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
Chen, T.; Kornblith, S.; Norouzi, M.; Hinton, G. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the International Conference on Machine Learning, Virtual, 13–18 July 2020; pp. 1597–1607. [Google Scholar]
Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; Soricut, R. ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations. arXiv 2019, arXiv:1909.11942. [Google Scholar]
Kalyan, K.S.; Sangeetha, S. SecNLP: A Survey of Embeddings in Clinical Natural Language Processing. J. Biomed. Inform. 2020, 101, 103323. [Google Scholar] [CrossRef]
Hadsell, R.; Chopra, S.; LeCun, Y. Dimensionality Reduction by Learning an Invariant Mapping. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR ’06), New York, NY, USA, 17–22 June 2006; pp. 1735–1742. [Google Scholar]
Sogancıoğlu, G.; Ozturk, H.; Ozgür, A. BIOSSES: A Semantic Sentence Similarity Estimation System for the Biomedical Domain. Bioinformatics 2017, 33, 49–58. [Google Scholar] [CrossRef]
Suominen, H.; Salanterä, S.; Velupillai, S.; Chapman, W.W.; Savova, G.; Elhadad, N.; Pradhan, S.; South, B.R.; Mowery, D.L.; Jones, G.J.; et al. Overview of the SHARE/CLEF eHealth Evaluation Lab 2013. In Proceedings of the International Conference on Cross-Language Evaluation Forum for European Languages, Valencia, Spain, 23–26 September 2013; pp. 212–231. [Google Scholar]
Uzuner, O.; South, B.R.; Shen, S.; DuVall, S.L. 2010 i2b2/VA Challenge on Concepts, Assertions, and Relations in Clinical Text. J. Am. Med. Inform. Assoc. 2011, 18, 552–556. [Google Scholar] [CrossRef]
Lample, G.; Ballesteros, M.; Subramanian, S.; Kawakami, K.; Dyer, C. Neural Architectures for Named Entity Recognition. arXiv 2016, arXiv:1603.01360. [Google Scholar]
Settles, B. Biomedical Named Entity Recognition Using Conditional Random Fields and Rich Feature Sets. In Proceedings of the International Joint Workshop on Natural Language Processing for Biomedicine and Its Applications, Geneva, Switzerland, 28–29 August 2004; pp. 107–110. [Google Scholar]
Habibi, M.; Weber, L.; Neves, M.; Wiegandt, D.L.; Leser, U. Deep Learning with Word Embeddings Improves Biomedical Named Entity Recognition. Bioinformatics 2017, 33, 37–48. [Google Scholar] [CrossRef] [PubMed]
Waskom, M.L. Seaborn: Statistical Data Visualization. J. Open Source Softw. 2021, 6, 3021. [Google Scholar] [CrossRef]

Figure 1. Pre-training tasks in BERT.

Figure 2. Framework for Contrastive Self-Supervised Learning in CERT.

Figure 3. Comparison of NSP and CRL Procedures in BERT. NSP relies on a softmax classifier to determine sentence continuity, while CRL applies a distance-based objective to better capture underlying semantic relationships between sentences.

Figure 4. Contrastive Learning of Representations. Positive and negative sentence pairs are mapped into a latent space through BERT and a projection head. Positive pairs (consecutive or semantically aligned sentences) are drawn closer, while negative pairs (non-consecutive or semantically unrelated sentences) are pushed apart, enforcing a margin m.

Figure 5. Comparison of Random Masking (RM) vs. Whole-Word Masking (WWM). RM independently masks subword tokens, leading to fragmented context, while WWM masks entire words, preserving semantic meaning and context.

Figure 6. Sentence-level similarity heatmaps: (a) baseline BERT (right); (b) CRPT (left). Darker shades denote higher similarity. The pale horizontal/vertical bands in panel (b) correspond to sentences from a different patient (5-A/5-B’).

Table 1. Performance comparison of Clinical BERT and Discharge Summary BERT across various clinical NLP.

Model	MedNLI	i2b2 2006	i2b2 2010	i2b2 2012	i2b2 2014
Clinical BERT	80.8	91.5	86.4	78.5	92.6
Discharge Summary BERT	80.6	91.9	86.4	78.4	92.8

Table 2. Configuration of BERT-Based Models [Standard hyper-parameters (steps, epochs, LR, batch size, seq-len) were identical for every model].

Model	Initialize From	Data and Training Steps
BERT-Base BioBERT-Base Discharge Summary BERT Bio+Discharge Summary BERT BlueBERT	- BERT-Base BERT-Base BioBERT-Base BERT-Base	Wikipedia + BookCorpus 1M steps PubMed 200K steps + PMC 270K steps Discharge Summary Notes 150K steps Discharge Summary Notes 150K steps PubMed 5M steps + MIMIC 200K steps
Discharge Summary CRPT Bio+Discharge Summary CRPT	BERT-Base BERT-Base	Discharge Summary Notes 200K steps PubMed 3M steps + Discharge Summary Notes 200K steps

Table 3. Summary of experimental dataset.

Corpus	Task	Metrics	Train	Dev	Test
MedNLI	Inference	Accuracy	11,232	1395	1422
BIOSSES	Sentence similarity	Pearson	64	16	20
ShARe/CLEFE	NER	F1	4628	1075	5195
i2b2-2010	Relation extraction	F1	3110	11	6293

Table 4. Performance across BLUE clinical NLP tasks (Median ± Standard Error).

Model	MedNLI (Acc)	BIOSSES (r)	ShARe/CLEFE (F1)	i2b2-2010 (F1)
BERT-Base	0.757 ± 0.011	0.800 ± 0.089	0.798 ± 0.006	0.728 ± 0.006
BioBERT	0.819 ± 0.010	0.821 ± 0.086	0.812 ± 0.005	0.737 ± 0.006
Discharge Summary BERT	0.785 ± 0.011	0.718 ± 0.101	0.814 ± 0.005	0.736 ± 0.006
Bio+Discharge Summary BERT	0.814 ± 0.010	0.819 ± 0.086	0.819 ± 0.005	0.745 ± 0.005
BlueBERT	0.841 ± 0.010	0.839 ± 0.082	0.838 ± 0.005	0.757 ± 0.005
Discharge Summary CRPT	0.819 ± 0.010	0.773 ± 0.094	0.827 ± 0.005	0.754 ± 0.005
Bio+Discharge Summary CRPT	0.847 ± 0.010	0.848 ± 0.080	0.837 ± 0.005	0.761 ± 0.005

Table 5. Randomly selected sentence pairs from discharge summary notes.

Index	Sentences
Pair 1	A	On admission, her daughters were in disagreement over her code status, and her original, long-standing DNR/DNI status was changed to allow for intubation if needed. However, when the patient’s respiratory status continued to decline to the point of need for intubation, the patient refused intubation.
Pair 1	B	Her family was notified and agreed that their mother’s wishes should be fulfilled. She was started on IV morphine, then converted to morphine drip on HD #3 for comfort, and all other medications were discontinued. Her family was at her bedside, and their Rabbi was called. She passed away.
Pair 2	A	She was seen by renal, who felt that her increase in creatinine may have been secondary to ATN/hypotension and recommended avoiding aggressive overdiuresis.
Pair 2	B	She subsequently required aggressive diuresis given her rapid afib/chf with lasix and niseritide drips. However, her creatinine remained at baseline of 1.7–2.0 with diuresis.
Pair 3	A	She was admitted to the hospital after having a procedure to open up her trachea, and she had some difficulty breathing after the procedure.
Pair 3	B	The breathing tube was placed back in her throat, and she was admitted to the ICU for a couple of days and had the breathing tube removed.
Pair 4	A	With clearing of her mental status, patient also began expressing suicidal and homicidal ideation with auditory hallucinations.
Pair 4	B	She admitted to hearing voices telling her to kill herself by overdosing on pills, as well as voices telling her to hurt others, though no one in particular.
Pair 5	A	A section of the cell block demonstrates a lymphoid infiltrate comprised of small–medium-sized lymphocytes with small–moderate amounts of cytoplasm and round–oval nuclei with mostly vesicular chromatin.
Pair 5	B	Admixed are smaller lymphocytes with scant cytoplasm and hyperchromatic nuclei. A review of the cytology prep (alcohol fixed, pap stain) demonstrates aggregates of mildly enlarged lymphocytes.

Table 6. Ablation experiments.

Model	Masking	Method	MedNLI	BIOSSES	ShARe/CLEFE	i2b2-2010
Discharge Summary BERT	Random	NSP	0.785	0.740	0.814	0.736
Discharge Summary BERT	Whole Word	NSP	0.811	0.761	0.823	0.750
Discharge Summary CRPT	Random	CRL	0.816	0.767	0.825	0.743
Discharge Summary CRPT	Whole Word	CRL	0.819	0.773	0.827	0.754

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kim, S.; Jung, Y. Elevating Clinical Semantics: Contrastive Pre-Training Beyond Cross-Entropy in Discharge Summaries. Appl. Sci. 2025, 15, 6541. https://doi.org/10.3390/app15126541

AMA Style

Kim S, Jung Y. Elevating Clinical Semantics: Contrastive Pre-Training Beyond Cross-Entropy in Discharge Summaries. Applied Sciences. 2025; 15(12):6541. https://doi.org/10.3390/app15126541

Chicago/Turabian Style

Kim, Svetlana, and Yuchae Jung. 2025. "Elevating Clinical Semantics: Contrastive Pre-Training Beyond Cross-Entropy in Discharge Summaries" Applied Sciences 15, no. 12: 6541. https://doi.org/10.3390/app15126541

APA Style

Kim, S., & Jung, Y. (2025). Elevating Clinical Semantics: Contrastive Pre-Training Beyond Cross-Entropy in Discharge Summaries. Applied Sciences, 15(12), 6541. https://doi.org/10.3390/app15126541

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Elevating Clinical Semantics: Contrastive Pre-Training Beyond Cross-Entropy in Discharge Summaries^†

Abstract

1. Introduction