Filtered BERT: Similarity Filter-Based Augmentation with Bidirectional Transfer Learning for Protected Health Information Prediction in Clinical Documents

: For the secondary use of clinical documents, it is necessary to de-identify protected health information (PHI) in documents. However, the difﬁculty lies in the fact that there are few publicly annotated PHI documents. To solve this problem, in this study, we propose a ﬁltered bidirectional encoder representation from transformers (BERT)-based method that predicts a masked word and validates the word again through a similarity ﬁlter to construct augmented sentences. The proposed method effectively performs data augmentation. The results show that the augmentation method based on ﬁltered BERT improved the performance of the model. This suggests that our method can effectively improve the performance of the model in the limited data environment.


Introduction
With the advent of the Fourth Industrial Revolution, the medical field is developing by responding most sensibly and rapidly to technological advances [1]. In particular, data analysis and artificial intelligence technology based on clinical medical data are attracting attention because they can be used as clinical decision support systems (CDSSs) that help experts make decisions [2]. A key component of clinical medical data is clinical documents, which are very important for medical data analysis because they contain information written by the clinician [3]. However, it is necessary to de-identify the personal information contained in the clinical document, that is, the protected health information (PHI), to maintain the confidentiality of the patient during secondary use, such as for research and data analysis of the clinical document. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) defined guidelines for the secondary use of medical records, and guidelines for de-identifying medical records were defined accordingly.
Existing PHI removal and de-identification processes are performed manually. In this process, an annotator directly identifies and labels the PHI in a document. However, this method is expensive, and mistakes occur frequently when humans perform annotations directly.
The rule-based system was devised as an automatic annotation method to compensate for the problems of manual annotation. The rule-based system is generally implemented using regular expressions that are portable, fast, and easy to use, as they are standardized in most programming implementations. Shin et al. proposed a de-identification method using regular expressions in a free-text clinical document of an electronic medical record (EMR) in Korea, and obtained high recall and precision [4]. A machine learning system was proposed to solve the labor-intensive problem of the rule-based system and improve its generalization ability. Support vector machine (SVM) [5] and conditional random field (CRF) [6] have been used in previous studies as automatic PHI label identification methods. Although the SVM method is a classic machine learning method, it has been frequently used to identify PHIs. In particular, CRF has received attention because of its promising performance. Aramaki et al. proposed a PHI de-identification system using CRF [7]. Bin et al. proposed WI-deld, a de-identification algorithm based on the CRF algorithm, and achieved a high level of Micro F1-score [8]. However, a machine learning method that shows satisfactory performance requires detailed feature engineering for model tuning, and for this, it must be appropriately preprocessed.
Additionally, artificial neural network (ANN)-based deep learning methods are being actively considered as part of machine learning methods. This deep learning method has the advantage of automatically extracting features without the detailed feature engineering required in machine learning methods. In particular, models that use recurrent neural networks (RNNs) and long short-term memory (LSTM) [9] have attracted significant attention for their high performance, which has not been achieved before. Liu et al. achieved a high level of PHI entity identification performance using an LSTM model [10]. Yang et al. also proposed an anonymization method based on deep learning, using LSTM with a conditional random field [11].
However, a difficulty with these studies is that there are only a few appropriately annotated PHI public datasets. This poses challenges in the generalization stage [11,12]. Therefore, it is difficult to create large-scale open datasets, and it risks a patient's privacy. Therefore, technologies that can derive good results using a very small amount of data are required.
To overcome data limitations, data augmentation and transfer learning are mainly used in the traditional machine learning field. Data augmentation is a technique that increases the amount of training data by adding noise to existing data or by generating synthetic data based on the existing data. The field in which the data augmentation technique is most actively applied is the image field. Data augmentation techniques with promising performance have been introduced based on geometric transformation, cropping, rotation, and transformation to deep learning-based data enhancement [13,14]. Additionally, the data augmentation method was considered for time-series data, such as signal data [15]. Likewise, data augmentation was also considered in the natural language processing (NLP) field, where it was used to replace words with synonyms or for inserting and deleting random words, and showed effective and powerful performance improvement in limited data environments [16,17].
Additionally, pre-training [18] and transfer learning [19] have been actively considered as some of the methods to overcome limited data. In the image field, pre-learning has been used as a means to train a network on a large dataset such as ImageNet [20] and to solve other problems with pretrained weights. Furthermore, in the NLP field, pretrained embeddings such as Word2vec [21], GloVe [22], and fastText [23] demonstrate effective features. Embeddings from the language model (ELMO) [24] and bidirectional encoder representations from transformers (BERT) [25] are the most representative examples of using transfer learning in the NLP field. In particular, BERT is attracting attention because it supports fine-tuning and can be applied to various fields of NLP, which require a strong performance [26].
In this study, we propose a filtered BERT augmentation method to overcome limited data. This is to further improve the prediction performance by adding an appropriate augmentation that combines BERT and similarity filters to transfer learning to obtain limited data. We compared the performance of the existing BERT and its PHI prediction with the addition of the filtered BERT-based augmentation proposed using a representative public dataset.

Materials and Methods
Section 2 describes the filtered BERT proposed in this study. We first introduce the dataset used in this work in Section 2.1 and explain how data are properly preprocessed in the form of free text in Section 2.2. Section 2.3 describes our proposed filtered BERT augmentation method. Section 2.4 describes the process of fine-tuning BERT to recognize PHI with datasets created through the augmentation method, and the evaluation metrics are presented in Section 2.5. Figure 1 shows the schematic data pipeline used in this study.

Materials and Methods
Section 2 describes the filtered BERT proposed in this study. We first introduce the dataset used in this work in Section 2.1 and explain how data are properly preprocessed in the form of free text in Section 2.2. Section 2.3 describes our proposed filtered BERT augmentation method. Section 2.4 describes the process of fine-tuning BERT to recognize PHI with datasets created through the augmentation method, and the evaluation metrics are presented in Section 2.5. Figure 1 shows the schematic data pipeline used in this study.

Datasets
The i2b2 2014 dataset [27] was used in this study. It is one of the most representative datasets publicly available for PHI anonymization of medical documents. This dataset is a part of Track 1 of the 2014 i2b2/UTHealth natural language processing share task and consists of 1304 longitudinal medical records of 296 diabetic patients. All PHIs were anonymized by the organizer. Each PHI was annotated by three annotators and manually examined for annotation [28].
The dataset was annotated into the i2b2-PHI category, with a more expanded form than the HIPPA-PHI category. Table 1 lists the i2b2-PHI categories.

Data Preprocessing
We built a pretreatment pipeline for proper PHI recognition. First, tokenization was performed in units of words, and tokenized words were identified using the inside-outside-beginning (IOB) tagging scheme [29]. This method is the same as the existing method found in [10,11]. In the general BERT model, the input data are limited to 512 words. Therefore, each clinical note was divided into 250 words and used as input data. For example, if a clinical document contained 700 words, it was divided into three sets of data: 250, 250, and 200.

Datasets
The i2b2 2014 dataset [27] was used in this study. It is one of the most representative datasets publicly available for PHI anonymization of medical documents. This dataset is a part of Track 1 of the 2014 i2b2/UTHealth natural language processing share task and consists of 1304 longitudinal medical records of 296 diabetic patients. All PHIs were anonymized by the organizer. Each PHI was annotated by three annotators and manually examined for annotation [28].
The dataset was annotated into the i2b2-PHI category, with a more expanded form than the HIPPA-PHI category. Table 1 lists the i2b2-PHI categories.

Data Preprocessing
We built a pretreatment pipeline for proper PHI recognition. First, tokenization was performed in units of words, and tokenized words were identified using the inside-outsidebeginning (IOB) tagging scheme [29]. This method is the same as the existing method found in [10,11]. In the general BERT model, the input data are limited to 512 words. Therefore, each clinical note was divided into 250 words and used as input data. For example, if a clinical document contained 700 words, it was divided into three sets of data: 250, 250, and 200.

Filtered BERT for Augmentation Structure
We propose a filtered BERT method for effective data augmentation of clinical documents that include PHI, by applying a word-similarity-based filtering algorithm. The overall structure of the filtered BERT for the augmentation model is similar to that of the BERT-based augmentation method proposed in a previous study [30]. However, in the augmentation method that predicts the masked word through BERT, we added a filter that checked that word. The filter compares the similarity between the masked word and words in the original sentence. Thus, only words with a certain degree of similarity passed through the filter. The detailed structure of the proposed filtered BERT is shown in Figure 2.
BERT-based augmentation method proposed in a previous study [30]. However, in the augmentation method that predicts the masked word through BERT, we added a filter that checked that word. The filter compares the similarity between the masked word and words in the original sentence. Thus, only words with a certain degree of similarity passed through the filter. The detailed structure of the proposed filtered BERT is shown in Figure  2. The similarity filter was designed to check whether the words in the vector masked by the BERT model were similar to the original words. Figure 3 shows the detailed processing of the filter algorithm. To apply the BERT model, the words of the original sentence were masked (Xmasked) and those predicted through context-based reasoning in the BERT (Xpredicted) were converted into word vectors based on fastText [23] word embedding. Cosine similarity was calculated using the converted word vector, and word similarity was measured through this. The calculated cosine similarity had a value from -1 to 1, where -1 meant the masked and predicted words were different in meaning, and 1 meant they were the same. If the calculated cosine similarity was within the preset range, the predicted was finally returned to replace the masked word. Figure 2 shows the detailed processing of the filter algorithm. For example, if ∈cossim is set to 0, and the cosine similarity of Xmasked and Xpredicted is −0.7, it cannot pass the filter, and Xpredicted has to be predicted again. By contrast, if the cosine similarity of Xmasked and Xpredict is 0.5, it passes through the filter to form an augmentation sentence. These algorithms were implemented in Python. The similarity filter was designed to check whether the words in the vector masked by the BERT model were similar to the original words. Figure 3 shows the detailed processing of the filter algorithm. To apply the BERT model, the words of the original sentence were masked (X masked ) and those predicted through context-based reasoning in the BERT (X predicted ) were converted into word vectors based on fastText [23] word embedding. Cosine similarity was calculated using the converted word vector, and word similarity was measured through this. The calculated cosine similarity had a value from -1 to 1, where -1 meant the masked and predicted words were different in meaning, and 1 meant they were the same. If the calculated cosine similarity was within the preset range, the predicted was finally returned to replace the masked word. Figure 2 shows the detailed processing of the filter algorithm. For example, if ∈ cossim is set to 0, and the cosine similarity of X masked and X predicted is −0.7, it cannot pass the filter, and X predicted has to be predicted again. By contrast, if the cosine similarity of X masked and X predict is 0.5, it passes through the filter to form an augmentation sentence. These algorithms were implemented in Python.

Filtered BERT-Based Clinical Documents Augmentation
We applied an optimized pre-training model for filtered BERT-based data augmentation. Bio-Clinical BERT [31], which was pretrained using bio and clinical data, was applied to the BERT model that predicted the masked word vector. It was confirmed in a

Filtered BERT-Based Clinical Documents Augmentation
We applied an optimized pre-training model for filtered BERT-based data augmentation. Bio-Clinical BERT [31], which was pretrained using bio and clinical data, was applied to the BERT model that predicted the masked word vector. It was confirmed in a previous study that the BERT model pretrained with a corpus suitable for the data showed good performance. To calculate the cosine similarity at the filter stage, fastText-based embedding was used when vectorization was performed by the embedding vector. In this case, fastText embedding, trained using the BioWordVec corpora [32], was used.

Named Entity Recognition with BERT 2.4.1. Tokenization and Labeling for the BERT Model
To train the BERT model, words were re-tokenized using Wordpiece [33]. Wordpiece tokenizes long words into multiple subparts. If a token was part of a precedent token, two marks (##) were attached to the front of the token to indicate its continuity. Additionally, two special tokens, CLS and SEP, were added to express the beginning and end of a sentence. To prevent the loss of IOB-labeled words during the re-tokenization process, the label of the tokenized word followed the IOB label before tokenization.

Fine-Tuning BERT
In this study, a PHI entity recognition model based on the i2b2 dataset was constructed to verify the performance of the proposed filtered BERT augmentation method. We used a fine-tuned pretrained BERT model that showed a good named entity recognition (NER) performance in a previous study [34]. The structure of the BERT model constructed in this study was a structure in which a token classification layer for entity name recognition was added to the pretrained BERT embedding. Figure 4 shows the BERT architecture and fine-tuning method.  A PyTorch version of the publicly available BERT implementation was used in the experiment. A BERT parameter consisting of 12 layers, 768 hidden layers, and 12 heads was used. We used a pretrained bio-clinical BERT. The Adam Optimizer [35] was used for model training, and the learning rate was 3e −5 (0.00003). Training was repeated for five epochs.

Evaluation
Various evaluation indices were used to evaluate the performance of the constructed model. We considered several performance evaluation indicators based on the accuracy of the confusion matrix [36], calculated as (correct prediction)/(total number of data). This is a method that is often used in general deep learning classification problems, but it is difficult to properly evaluate the performance of the model's data with severe data imbalance problems such as NER. Accordingly, we used (1) precision, (2) recall, and (3) F1- A PyTorch version of the publicly available BERT implementation was used in the experiment. A BERT parameter consisting of 12 layers, 768 hidden layers, and 12 heads was used. We used a pretrained bio-clinical BERT. The Adam Optimizer [35] was used for model training, and the learning rate was 3 × 10 −5 (0.00003). Training was repeated for five epochs.

Evaluation
Various evaluation indices were used to evaluate the performance of the constructed model. We considered several performance evaluation indicators based on the accuracy of the confusion matrix [36], calculated as (correct prediction)/(total number of data). This is a method that is often used in general deep learning classification problems, but it is difficult to properly evaluate the performance of the model's data with severe data imbalance problems such as NER. Accordingly, we used (1) precision, (2) recall, and (3) F1-Score, which is obtained through recall and precision, to evaluate the performance of the model. True positive (TP) refers to the case where a PHI label is predicted as a PHI label, whereas a false positive (FP) refers to a case in which a normal label O is predicted as a PHI label. True negative (TN) is where a normal label O is predicted as a normal label, while false negative (FN) refers to the case where the PHI label is predicted as a normal label O. The F1-Score was obtained by calculating the harmonic average of recall and precision. This made it suitable for evaluating the performance of the model even when the data were unevenly distributed. Therefore, in this study, recall, precision, and F1-Score were used as evaluation indicators to ensure accurate model evaluation.

Data Augmentation
In this study, we created an augmentation text by transforming certain words in the original document using filtered BERT. Five words were substituted in all instances, and only those with a cosine similarity greater than 0 were passed through the filter and selected. Table 2 shows some examples of the original and augmented instances entered into the filtered BERT. It was confirmed that the word "denies" in the instance was replaced with "denied", a similar word (past tense). We implemented data augmentation in the above form for all samples of training data; that is, the size of the augmented dataset was equal to that of the training data.

Results of Named Entity Recognition
In this study, we compared the results of the model when it was fine-tuned using an augmented instance, using only training data before augmentation, to the performance of the NER when the model was fine-tuned using the augmented instance through augmentation. Table 3 shows the results of the NER performance evaluation before and after the application of filtered BERT. To further elaborate on the results, we evaluated the detailed performance of each PHI tag, which is shown in Table 4. We evaluated the precision, recall, and F1-score for each label, and the support indicates the number of labels.

Discussion
For the secondary use of clinical documents, it is important to de-identify PHIs. Data augmentation and transfer learning methods can be used as effective methods to overcome the problem of a limited dataset. They can create a high-performance model when there is a problem with a small number of publicly available datasets for de-identification.
A major novelty of this study lies in the process of verifying the masked words predicted by BERT through the similarity filter during the data augmentation process. PHI is a set of individual data points, such as a person's name or registration number, which is difficult to augment with the existing medical data knowledge base, such as the unified medical language system (UMLS). Therefore, it is difficult to apply knowledge-based augmentation methods. Additionally, BERT effectively predicts a masked word through context-based inference, but there is a risk that the word may be predicted as a word with a completely different meaning without being verified. We filtered words with completely opposite meanings by verifying the words for a second time through the similarity filter based on fastText embedding, which was trained in advance.
When analyzing the results of the study and comparing BERT before and after augmentation, the overall performance showed significant improvements. Labels, including DOCTOR, PATIENT, and USERNAME, showed a performance improvement of more than 5%. Furthermore, in the case of labels with fewer classes such as PROFESSION, ORGANI-ZATION, and STREET, we could see an immense performance improvement. However, in the case of labels such as LOCATION-OTHER and DEVICE, it was observed that sufficient data for learning was not secured because the number of classes was too small.
Although the absolute number of datasets affects the occurrence of this problem, the fact that the configured dataset was unbalanced had a significant impact. The data augmentation method, which increases the absolute number of datasets, was effective in improving the overall performance; however, because the number of major classes is augmented together, there may be a limit to the learning of the minor classes.
The learning problem of unbalanced data was solved by the hybrid method of rulebased systems and machine learning in previous studies [8,11,37]. Additionally, an over-sampling method, performed to resolve the unbalanced class, was also considered. This is because these methods are based on deep learning, such as the variational autoencoder (VAE), which is considered an effective solution to the unbalanced data problem [38]. If the oversampling method is applied in the future, it will be possible to present a model with superior performance.
The generalization ability of the proposed model can be considered as a limitation of this study. Since the training and testing data consisted of data collected from the same institution, applying it to other types of clinical documents may raise questions about the generalization of the model. Therefore, further evaluation using other organizations and other types of data sources is required.
Future work will focus on generalizing the results and the methodology, using more samples and samples of the same type of clinical documents data from other institutions.