Adaption BERT for Medical Information Processing with ChatGPT and Contrastive Learning

: Calculating semantic similarity is paramount in medical information processing, and it aims to assess the similarity of medical professional terminologies within medical databases. Natural language models based on Bidirectional Encoder Representations from Transformers(BERT) offer a novel approach to semantic representation for semantic similarity calculations. However, due to the specificity of medical terminologies, these models often struggle with accurately representing semantically similar medical terms, leading to inaccuracies in term representation and consequently affecting the accuracy of similarity calculations. To address this challenge, this study employs Chat Generative Pre-trained Transformer (ChatGPT) and contrastive loss during the training phase to adapt BERT, enhancing its semantic representation capabilities and improving the accuracy of similarity calculations. Specifically, we leverage ChatGPT-3.5 to generate semantically similar texts for medical professional terminologies, incorporating them as pseudo-labels into the model training process. Subsequently, contrastive loss is utilized to minimize the distance between relevant samples and maximize the distance between irrelevant samples, thereby enhancing the performance of medical similarity models, especially with limited training samples. Experimental validation is conducted on the open Electronic Health Record (OpenEHR) dataset, randomly divided into four groups to verify the effectiveness of the proposed methodology.


Introduction
Semantic similarity computation of medical information, as one of the cores in medical information processing, aims to calculate the similarity of medical professional terms within massive datasets [1][2][3][4].Semantic similarity computation of medical information finds essential applications in various areas, including medical information modeling [5], semantic retrieval [6], intelligent decision support [7], and medical knowledge graph construction [8].
In the modern healthcare sector, electronic health (e-health) refers to using information and communication technologies to support and enhance the delivery and management of healthcare services.e-Health encompasses various forms, including electronic health records (EHR), telemedicine, and mobile health applications.EHR is a digital system for managing and storing patient medical information.Accurate understanding and processing of medical terminology within EHR systems are essential.Methods for calculating medical semantic similarity help EHR systems analyze and compare the semantic similarity between different medical terms more precisely, thereby improving the efficiency and accuracy of medical information processing.Additionally, Digital Decision Support Systems (DDSSs) [9] highly depend on accurate and efficient medical information processing to provide reliable recommendations and insights for clinical decision-making.Medical semantic similarity calculations can enhance DDSSs by providing precise and efficient medical information processing, offering reliable advice and insights for clinical decisionmaking.However, due to differences in expertise and backgrounds, the representation of medical terms exhibits complexity, diversity, and dynamics, posing challenges to understanding and utilizing medical information [10,11].For instance, in the open Electronic Health Record (OpenEHR) dataset [12], the term "care plan" has several semantically similar words, such as "Treatment plan", "Care regimen", and "Health management plan".Manual comparison methods struggle to handle massive data, thus affecting the ability to identify and understand different expressions of medical professional terms.Therefore, utilizing semantic similarity calculations in medical information processing to address the complexity and diversity of terms in the medical field is crucial for improving the efficiency and accuracy of medical information processing.
In contrast to the inefficient manual comparison methods, natural language models like Bidirectional Encoder Representations from Transformers (BERT) provide new modeling approaches for semantic similarity calculations in medical information [13,14].However, due to the specificity of medical terminology, natural language models have limitations in handling medical terms that are semantically similar, leading to inaccurate expressions of medical terms and consequently affecting the accuracy of similarity calculations [15][16][17][18][19]. Specifically, medical terms in the medical field may involve rich medical knowledge and specific contexts.
In this work, we adapt BERT during the training phase using Generative Pre-trained Transformer (ChatGPT) and contrastive loss, as illustrated in Figure 1, to enhance BERT's semantic representation capability and thereby improve the accuracy of similarity calculation.Specifically, we utilize ChatGPT-3.5 to generate semantically similar texts for target text and incorporate them into the model training process as pseudo-labels.Subsequently, we adapt the semantic representation parameters of BERT.Finally, we employ contrastive loss to minimize the distance between relevant samples and maximize the distance between irrelevant samples, enhancing the performance of medical similarity models under limited training samples.In the experimental section, we simulate massive datasets, conduct multiple sets of experiments, and validate and evaluate the model, confirming the effectiveness of this approach in improving the accuracy of semantic similarity calculation in medical information.
, respectively, with initialization weights provided by Hugging Face [20].Then, the semantic representation vectors are averaged to obtain text vector representations O j and O' j .Finally, the similarity score is computed using contrastive loss, which minimizes the distance between relevant samples and maximizes the distance between irrelevant samples.

Related Work
Semantic representation modeling is one of the main methods used to enhance the intelligence of machine language, and it is widely applied in information retrieval and natural language processing fields [1,5,6].After pre-training on a large-scale unlabeled corpus, the BERT semantic representation model based on Transformer [21] dramatically improves the performance of information retrieval and natural language processing tasks [22][23][24].
In the medical field, due to the specificity of medical information, it is often necessary to adapt BERT to different tasks [25][26][27][28][29][30].Suneetha et al. [31] proposed a cardiovascular disease diagnosis method based on Fine-tuning BERT, providing an effective and accurate diagnosis for cardiovascular disease patients.Kim et al. [32] proposed a medical specialty prediction model for patient-side medical problem texts based on BERT, assisting doctors in making decisions.Su et al. [33] proposed pre-training and fine-tuning BERT, improving the accuracy of automatic extraction of biomedical relationships.Ding et al. [34] proposed a crop disease diagnosis model based on bidirectional encoder representations transformer and RCNN applicable to crop diseases, assisting plant doctors in diagnosing crop diseases.Babu et al. [35] developed a medical chatbot based on BERT, significantly enhancing healthcare communication and accessibility using advanced deep-learning techniques.Meanwhile, Chen et al. [36] utilized coronary heart disease, also known as angina, as an example to construct a pre-trained diagnostic model for traditional Chinese medicine texts based on the BERT model, completing text classification tasks for different types of coronary heart disease medical cases.Faris et al. [37] designed a method for symptom identification and diagnosis based on BERT to assist doctors in handling consultations in multiple languages from users.BERT-based and CNN-based medical application methods are summarized in Table 1.In this paper, we utilize ChatGPT-3.5 to create a small-sample dataset and adapt the language expression parameters of BERT to enhance its ability to express semantic information in information retrieval and medical contexts.

CNN-based
Zheng et al. [38] Manually retrieving and comparing imaging and pathology reports with overlapping exam body sites is time-consuming.
A convolutional neural network model was used to calculate similarities among report pairs.
Liang et al. [39] Calculating the semantic similarity of noisy short medical question texts for an intelligent QA system.
A shared layer-based CNN combined with TF-IDF for feature extraction and noise reduction.
Li et al. [40] Enhancing the efficiency of online medical QA by accurately matching user questions with professional medical answers.
A bidirectional gated recurrent unit network with CNN for feature extraction to improve matching accuracy.

BERT-Based
Suneetha et al. [31] Improve the diagnosis effect of cardiovascular disease.
Fine-tuning BERT to provide an effective and accurate diagnosis.
Kim et al. [32] Use natural language processing technology to improve outpatient diagnosis and treatment initiation efficiency and accuracy.
A medical professional prediction model based on BERT patient-side medical question text is used to diagnose outpatients.BERT-based method to assist doctors in handling multilingual consultations.

Preliminary Knowledge
BERT is a widely used semantic information representation model consisting of Bert-Tokenizer and multiple layers of transformer encoders.Accurate representation of semantic information is crucial for diagnosis, treatment, and research in the medical field.Therefore, we adopt BERT as the base model to utilize its powerful semantic learning ability to enhance the quality of medical text representation.Firstly, we preprocess medical text using BertTokenizer to convert it into three types of embeddings: token embedding, segment embedding, and position embedding.Token embedding captures the semantic information of each word, segment embedding distinguishes between different sentences or text segments, and position embedding encodes the positional information of words in sentences.Combining these three embedding forms provides rich semantic and positional information to the model, improving the expression capability of textual information.Next, we sum up these three embedding forms to obtain a vector E j 1 , E j 2 , • • • E j N , the input to the transformer encoder.In the multi-layer encoder, textual information undergoes multiple self-attention and feed-forward operations, gradually transforming into high-level semantic representations.Specifically, we employ a multi-head attention mechanism to simultaneously focus on different positions and semantic features of the input sequence, thus better capturing the contextual relationships and semantic information in medical text.Through this approach, the model can comprehensively understand the text content, thereby enhancing the quality of semantic information representation.In the multi-head attention module, as shown in Figure 2, vectors E j 1 , E j 2 , • • • E j N undergo linear transformations to generate three matrices: Q (query), K (key), and V (value), which are then used to compute attention weights and obtain the final output: This attention mechanism enables the model to be more flexible in handling medical text, adjusting weights based on the correlation between different words to express semantic information better.However, despite BERT being a powerful semantic information representation model, there are still limitations when dealing with medical terminologies.Due to the complex diversity of medical terms and common phenomena such as abbreviations and acronyms in medical texts, BERT may not accurately capture subtle differences between terms.Therefore, adapting BERT enhances its ability to represent medical terminologies.Precisely, we adjust the parameters of token embedding, segment embedding, and position embedding in semantic modeling to better adapt the model to the characteristics of the medical domain.

Overview
To address the challenge posed by the diversity of medical terminology and contextual differences in processing medical information, we adapt BERT using ChatGPT and Contrastive Loss.This adjustment aims to enhance the accuracy of semantic similarity recognition of medical information under similar semantic bases.As illustrated in Figure 1, we first utilize ChatGPT-3.5 to generate pseudo-labels for the OpenEHR dataset, creating the OpenEHR-S dataset with similar semantic expressions.Subsequently, the BERT and BERT-S models separately process the OpenEHR and OpenEHR-S datasets to obtain output vectors • • • T ′ j N are averaged to obtain text vector representations O j and O ′ j for the OpenEHR and OpenEHR-S datasets, respectively.Finally, we compute the text similarity s j and utilize contrastive loss to minimize the distance between relevant samples in the OpenEHR and OpenEHR-S datasets while maximizing the distance between irrelevant samples.
Specifically, we create a dataset with semantics similar to medical professional terminologies as pseudo-labels.We utilize the large-scale language model ChatGPT-3.5 to generate a dataset of medical professional terminologies with similar semantics to the OpenEHR dataset, serving as pseudo-labels GT j for both the training and testing phases.The true label is designated as 1, while the erroneous label is set as 0. Subsequently, to process the medical professional terminology texts from OpenEHR, we employ the BertTokenizer for tokenization.The sentences are segmented into appropriate words or subword units, and special tokens (such as [CLS] and [SEP]) are added.The text is then encoded into a numerical sequence E j 1 , E j 2 , • • • E j N , serving as the input for BERT.Next, the numerical sequence E j 1 , E j 2 , • • • E j N is processed through the encoder of the transformer to compute the output T j 1 , T j 2 , • • • T j N of BERT.Finally, the medical text information O j is obtained by averaging the output It is worth noting that, to learn similar medical knowledge information during training, the embeddings in BERT are involved in training, while all other parameters are frozen.Processing pseudo-labels for the OpenEHR-S dataset follows a procedure similar to that of the OpenEHR dataset.Firstly, the pseudo-labels are tokenized using the Bert-Tokenizer to obtain N , which serves as input for BERT-S.Subsequently, N is passed through the encoder of the Transformer architecture to compute the output of BERT-S, denoted as Finally, the medical text information O ′ j is obtained using Formula (1).It is worth noting that during the computation process, all parameters of BERT-S are frozen and do not participate in training.The similarity s j can be calculated using the following formula: Furthermore, we present the algorithm for calculating the similarity score between two individual medical terms, as shown in Algorithm 1.

Pseudo-Label Generation with ChatGPT
ChatGPT [41][42][43][44] learn language models by pre-training on large-scale textual data and can adapt to different scenarios through prompts.To enhance the semantic expression ability of BERT, we utilize ChatGPT-3.5 to generate pseudo-label texts with similar semantic expressions, as illustrated in Figure 3. Specifically, we use the prompt "Generate words with similar semantics based on the following medical phrases" for each medical professional phrase to generate phrases and manually select appropriate phrases as pseudo-labels.Additionally, we prioritize selecting phrases with fewer duplicate words as pseudo-labels.For example, taking "self-test result" as input for ChatGPT-3.5,we obtain the following phrases in sequence: "Self-assessment outcome", "Self-diagnostic findings", "Personal evaluation outcome", "Individual screening result", and "Self-examination outcome".We select "Personal evaluation outcome" as the pseudo-label.

Adaption BERT with Contrastive Loss
During training, we employ contrastive loss to minimize the distance between relevant samples and maximize the distance between irrelevant samples: where GT j represents the true label of the jth sample, with positive samples denoted as 1 and negative samples denoted as 0. δ is the margin of the loss function, which defaults to 1 and controls the gap between the similarity score and the loss function.In summary, our proposed medical information similarity model leverages natural language models such as BERT to enhance the recognition of medical information similarity under similar semantic contexts.By adapting these models and utilizing contrastive loss to optimize similarity calculations, we aim to bridge the gap in understanding between diverse medical terminologies and contextual differences in medical information processing.This approach contributes to more accurate medical information retrieval and analysis and lays the foundation for advancing natural language processing techniques in healthcare.

Experiment Setup
OpenEHR is an international standard and open-source specification designed for creating, storing, sharing, and exchanging electronic health records.It provides a robust framework for managing and utilizing medical terminology.Using standardized and structured data models ensures the interoperability and consistency of medical information.In this study, we leverage the strengths of OpenEHR to extract medical-specific terminology.Specifically, we collect 708 medical text information from the OpenEHR website as the dataset for this study.From these, we randomly select 40 samples for training and testing, maintaining a 1:1 ratio between training and testing samples.As shown in Table 2, we present some target text and their corresponding pseudo-labels.In addition to evaluating the testing samples, we conduct more challenging experiments.To assess the search capability for semantically similar texts, we randomly divided the remaining 668 into groups labeled as Group1, Group2, Group3, and Group4, each containing 167 negative samples.These four groups are combined with pseudo-labeled samples from the testing set, forming 187 query samples.We calculate the similarity between 20 test texts and 187 query samples, with the top five similarity scores used as the prediction results for each test text, and evaluate the model performance.To assess the model's search capability for similar semantic expressions, we employ Top 1 Accuracy, Mean Reciprocal Rank (MRR), Precision, Recall, F 1 , Area Under the Receiver Operating Characteristic Curve (AUC), and Matthews Correlation Coefficient (MCC) as evaluation metrics.Top 1 Accuracy refers to the proportion of predicted results matching the true labels among the highest similarity scores.MRR represents the average of the reciprocals of the ranks where the correct answer first appears among the top five predictions given by the model: where M represents the number of real texts, rank j represents the position where the correct answer first appears in the jth sample, used to evaluate ranking task performance.The F 1 is an evaluation metric that comprehensively considers Precision and Recall: In addition, we use the AUC and MCC to evaluate our model further.
We conduct experiments in our PyTorch environment, using version 1.13.0, with Python version 3.10.0.Both training and evaluation are conducted on an NVIDIA RTX 4090 GPU.The batch size for training is set to 4, and we utilize the Adam optimizer with a learning rate of 1 × 10 −4 .

Method Comparison
Results on the Test Dataset: Table 3 shows the comparative results of TF-IDF [45,46], Count Vector [47], Levenshtein Distance [48], Damerau-Levenshtein Distance [49], BERT, and our model.It can be observed that BERT performs better on this test dataset compared to TF-IDF, Count Vector, Levenshtein Distance, and Damerau-Levenshtein Distance.Top 1, MRR, and F1 reach 85%, 86.7%, and 84.1%, respectively, demonstrating BERT's excellent semantic representation ability in small-scale similarity calculation.In contrast, our model improves by 0.6% and 2.5% on MRR and F1, respectively, validating the effectiveness of our model in the semantic representation of medical vocabulary.Results on Simulated Massive Dataset: Table 4 presents the comparative experiments of TF-IDF, Count Vector, Levenshtein Distance, Damerau-Levenshtein Distance, BERT, and our model on four groups.It can be observed that BERT could perform better when faced with a large number of negative samples, with Top-1 percentages of 25%, 35%, 35%, and 30% in the four groups, respectively.This indicates that BERT's similarity calculation needs to be more accurate under the interference of many negative samples, resulting in many erroneous detections.In contrast, our model achieves Top-1 percentages of 30%, 50%, 40%, and 30% in the four groups, respectively.Additionally, regarding the MRR metric representing retrieval capability, BERT achieved scores of 35.6%, 47.2%, 46.8%, and 39.9% on the four datasets, respectively.Compared to BERT, our model showed improvements of 4.6%, 9.5%, 1.1%, and 0.3% on the MRR metric across the four groups.This indicates that our model improves the accuracy of similarity calculation compared to BERT, validating the effectiveness of our model.Results on International Classification of Diseases, 9th Revision, Clinical Modification (ICD-9-CM) [50] and International Classification of Diseases, 9th Revision, Clinical Modification (ICD-10-CM) [51] Datasets: Additionally, to further validate the effectiveness of our model, we conduct experiments on the ICD-9-CM and ICD-10-CM datasets, randomly selecting 25 nouns from each dataset as the validation dataset.Moreover, to assess the algorithmic efficiency of the improved model, we calculate the time required to process each text.It is worth noting that this dataset is not involved in the training process.The experimental results, as shown in Table 5, indicate that compared to BERT, our model achieves a 3.8% and 7.7% improvement in MRR and F1 scores, respectively, on the ICD-9-CM dataset and a 0.6% and 0.2% enhancement in MRR and AUC metrics, respectively, on the ICD-10-CM dataset.The consistency in parameters and FLOPs is attributed to our use of fine-tuning without introducing additional parameters.In conclusion, our model exhibits improvement on both the ICD-9-CM and ICD-10-CM datasets, thus affirming its efficacy.

Qualitative Analysis
Table 6 presents qualitative analysis experiments on the target text, where the top 5 similarity scores are considered as the predicted results for the target text.We observe that the BERT model predicts the highest scores for "self-test result" and "classification of glaucoma" text as "Personal assessment outcome" and "Glaucoma categorization", respectively, consistent with the pseudo-labels.For the "inspection of the rectum" and "imaging examination of a placenta" text, the highest predicted scores are "Examination of the thyroid" and "Imaging examination of a body structure", respectively, inconsistent with the pseudo-labels.This indicates that BERT focuses more on common vocabulary (inspection) and overlooks the specificity of medical terms (placenta and rectum), especially when dealing with large datasets.In contrast, our model's highest predicted scores for "self-test result", "classification of glaucoma", and "imaging examination of a placenta" text are consistent with the pseudo-labels.Although the highest predicted score for the "Examination of the thyroid" text is inconsistent with the pseudo-label, it improves the ranking of the pseudo-label.This validates that our approach enhances BERT's expressive capability in terms of medical professional terms.

Ablation Studies
Contrastive Loss: Table 7 presents the results of our ablation study on four datasets.It can be observed that when employing the strategy of minimizing positive sample distance alone, there are improvements of 5% and 3.6% in the Top-1 and MRR metrics for Group 1, respectively; an increase of 2.6% and 3.3% in the MRR and F1 metrics for Group 2, respectively; no improvement in the metrics for Group 3 and a decrease in the metrics for Group 4. This suggests that the effectiveness of solely minimizing positive sample distance is insignificant, likely due to the lack of consideration for negative sample influence during positive sample learning.When solely employing the strategy of maximizing negative sample distance, there are improvements of 5% and 3.7% in the Top-1 and MRR metrics for Group 1, respectively; an increase of 15% and 9.5% in the Top-1 and MRR metrics for Group 2, respectively; a decrease in the metrics for Group 3 and no improvement in the metrics for Group 4. This indicates that the effectiveness of solely maximizing negative sample distance is similarly insignificant, likely because the distance of positive samples is not reduced during negative sample learning.When simultaneously employing both the strategies of minimizing positive sample distance and maximizing negative sample distance, there are improvements of 5%, 4.6%, and 4% in the Top-1, MRR, and F1 metrics for Group 1, respectively; an increase of 15% and 9.5% in the Top-1 and MRR metrics for Group 2, respectively; enhancements of 5% and 1.1% in the Top-1 and MRR metrics for Group 3, respectively, and an improvement of 0.3% in the MRR metric for Group 4. This validates the effectiveness of strategies for both minimizing positive sample distance and maximizing negative sample distance.Different Loss: Table 8 illustrates the results of adjusting BERT using different loss functions.When employing the cross-entropy loss, we observe a decrease of 5 and 2.9 in the Top-1 and MRR metrics, respectively.When using the smooth cross-entropy loss, there is a decrease of 10 and 5.4 in the Top-1 and MRR metrics, respectively.This indicates that utilizing cross-entropy loss and smooth cross-entropy loss leads to the adapted BERT model being less effective in learning information from medical texts, performing worse than the original BERT model.In contrast, the contrastive loss strategy of maximizing negative and minimizing positive samples better facilitates learning semantic representation parameters.Furthermore, we compare the Contrastive Loss* without using the square term.It is observed that there is an improvement compared to BERT, although the improvement is not significant.The square function is smooth and continuous, possessing good mathematical properties.Moreover, using the square term can magnify larger errors, making the model pay more attention to larger errors during training.

Limitations
The experimental results indicate that while our model shows some improvement compared to BERT, there is still significant room for enhancement.On the one hand, this might be attributed to our relatively small training dataset, comprising only 20 pairs of similar medical domain terms.This limitation may hinder the full exploration of the specificity of medical terminologies, thus requiring further improvement in semantic expression capabilities.On the other hand, generating pseudo-labels using ChatGPT entails considerable time and effort, leading to high costs.Future research will focus on enhancing the similarity of medical domain terms.

Conclusions
This paper examines the challenges faced by the medical information field in the context of big data and artificial intelligence technology development, focusing on the diversity of medical professional terms, contextual differences, and the challenges posed by massive datasets to traditional manual comparison methods and the BERT model.Addressing these challenges, we propose a method for semantic similarity computation of medical information by adapting the BERT model to enhance its understanding and processing capabilities of medical professional terms.We utilize ChatGPT-3.5 to generate semantically similar texts for medical professional terms, which are then incorporated into the model training process as pseudo-labels.Subsequently, by adjusting the semantic representation parameters of the BERT model, we enhance its adaptability to semantic features specific to the medical domain.We employ contrastive loss during training to minimize the distance between relevant samples and maximize the distance between irrelevant samples, thereby improving the model's performance under limited training samples.Through validation on test sets and simulation of massive datasets, we find that our model outperforms the baseline BERT model regarding the expressive capability of medical professional terms and the accuracy of semantic similarity computation in medical information.In future research, we plan to explore fine-tuning BERT and other Transformer-based models for applications in medical information processing, aiming to enhance decision-making processes in healthcare settings.Additionally, we aim to investigate the performance of these models in traditional Chinese medicine and their adaptability in handling medical terminologies in minority languages.

Figure 1 .
Figure 1.Illustration of the proposed method.During training, pseudo-label texts with similar semantic expressions are generated using ChatGPT-3.5 for a given target text.Subsequently, the target text and pseudo-label texts are separately processed by BERT and BERT-S to obtain semantic representation vectors Tj 1 , T j 2 , • • • T j N and T ′ j 1 , T ′ j 2 , • • • T ′ j N, respectively, with initialization weights provided by Hugging Face[20].Then, the semantic representation vectors T j 1 , T j 2 , • • • T j N and T ′ j 1 , T ′ j 2 , • • • T ′

Table 1 .
Summary of BERT-Based and CNN-based Methods in Medical Applications.

Table 2 .
Target Text and Pseudo-labels.The first column represents the target text of OpenEHR, and the second column represents the pseudo-labels generated using ChatGPT-3.5.

Table 4 .
Results of OpenEHR simulated massive dataset.

Table 6 .
Qualitative Analysis Experiment on Target Texts, Considering the Top Five Similarity Scores as Predicted Results.The first column represents the target text, the second column represents the pseudo label, the third column represents BERT's predicted results, and the fourth column represents our model's predicted results.

Table 7 .
Ablation experiments on the contrastive loss strategies of minimizing positive samples and maximizing negative samples.Min Positive denotes the strategy of minimizing positive samples, and Max Negative denotes the strategy of maximizing negative samples.

Table 8 .
Adaption BERT with different Loss.Contrastive Loss* indicates that the square is not used.