Keyword Extraction Algorithm for Classifying Smoking Status from Unstructured Bilingual Electronic Health Records Based on Natural Language Processing

Featured Application: The study presents an improved and easily obtainable method in terms of automatic smoking classiﬁcation from unstructured bilingual electronic health records. Abstract: Smoking is an important variable for clinical research, but there are few studies regarding automatic obtainment of smoking classiﬁcation from unstructured bilingual electronic health records (EHR). We aim to develop an algorithm to classify smoking status based on unstructured EHRs using natural language processing (NLP). With acronym replacement and Python package Soynlp, we normalize 4711 bilingual clinical notes. Each EHR notes was classiﬁed into 4 categories: current smokers, past smokers, never smokers, and unknown. Subsequently, SPPMI (Shifted Positive Point Mutual Information) is used to vectorize words in the notes. By calculating cosine similarity between these word vectors, keywords denoting the same smoking status are identiﬁed. Compared to other keyword extraction methods (word co-occurrence-, PMI-, and NPMI-based methods), our proposed approach improves keyword extraction precision by as much as 20.0%. These extracted keywords are used in classifying 4 smoking statuses from our bilingual EHRs. Given an identical SVM classiﬁer, the F1 score is improved by as much as 1.8% compared to those of the unigram and bigram Bag of Words. Our study shows the potential of SPPMI in classifying smoking status from bilingual, unstructured EHRs. Our current ﬁndings show how smoking information can be easily acquired for clinical practice and research.


Introduction
Smoking is a major risk factor in developing coronary artery disease, chronic kidney disease, cancer, and cardiovascular disease (CVD) [1,2]. It is also considered as a modifiable risk factor for CVDs and other conditions associated with premature death worldwide [3][4][5][6]. Consequently, smoking status can be used to assess the risk of certain diseases and to suggest first-line interventions based on clinical guidelines.
Despite the effectiveness and importance of smoking cessation for disease prevention, smoking information is under-utilized and not easily measured. It is often buried in a narrative text rather than in a consistent coded form. The rapid adoption of electronic health

Data
We applied our keyword extraction method to 4711 clinical notes collected from Seoul National University Hospital (SNUH) from 1 January to 31 December 2017, through the clinical data warehouse (CDW) of SNUH, SUPREME (Seoul National University Hospital Patients Research EnvironMEnt). Of those, 3512 notes were collected from the department of family medicine (including the. Patients with diabetes), and the rest were from the department of pulmonary and critical care medicine (including the patients. with chronic obstructive pulmonary disorders). Each clinical note contains a patient's overall medical history as recorded by the doctors from each department. Furthermore, each note contains, on average, 157.04 tokens. However, the range of the token length varies between 1 and 589.
As the notes are written as free text, different doctors express identical terms or concepts differently. The notes contain both English and Korean words, which is very common practice in Korea and further complicates the keyword extraction. Although several researchers have worked with bilingual or multilingual EHRs [12][13][14], our paper is the first to focus on extracting bilingual keywords from EHRs. Based on patient smoking status, each of the notes were manually labeled into one of these four categories: current smokers, past smokers, never smokers, and unknown. Clinical notes were manually labeled by 3 medical students and 1 nurse, and errors were reviewed by 1 physician. Despite these class labels, sentences or words that suggest a patient's smoking status were not annotated. Consequently, the labels are not used in the keyword extraction process. However, including all notes regardless of their class labels introduces additional difficulty in extracting meaningful smoking-related keywords that can test the robustness of the proposed algorithm. Table 1 shows the overall statistics for our data. This study was approved by the Institutional Review Board of Seoul National University Hospital (Institutional Review Board number: N-1906-076-1040).

SPPMI-Based Keyword Extraction
In this work, we introduce SPPMI (Shifted Positive Point Mutual Information) [15]based keyword expansion to extract smoking status-related keywords from bilingual EHRs. It is an end-to-end unsupervised method, thus not requiring any annotated data or model training. Therefore, it is easily applicable in biomedical practice as manual annotation is both time-consuming and especially costly in the medical field. SPPMI-based keyword extraction uses three main steps: text preprocessing, seed word preparation, and keyword extraction ( Figure 1). of the proposed algorithm. Table 1 shows the overall statistics for our data. This st was approved by the Institutional Review Board of Seoul National University Hosp (Institutional Review Board number: N-1906-076-1040).  Current smokers  1046  84  1130  Past smokers  547  431  978  Never smokers  399  144  543  Unknown  1520  540  2060  Total  3512 1199 4711

SPPMI-Based Keyword Extraction
In this work, we introduce SPPMI (Shifted Positive Point Mutual Information) [ based keyword expansion to extract smoking status-related keywords from biling EHRs. It is an end-to-end unsupervised method, thus not requiring any annotated dat model training. Therefore, it is easily applicable in biomedical practice as man annotation is both time-consuming and especially costly in the medical field. SPP based keyword extraction uses three main steps: text preprocessing, seed w preparation, and keyword extraction ( Figure 1). During the text preprocessing step, we identified 170 commonly used medical acrony and replaced them with their full expressions. Unlike in English, not all words in Korean delimited by spaces. As an agglutinating language, these words are often delimited by a se special words or particles [16]. To tokenize Korean phrases (eojeol) into the most semantic relevant words, we applied the Python package soynlp (https://github.com/lovit/soy accessed on 8 September 2021) to the texts written in Korean. Without relying on predefined dictionaries, soynlp finds the boundaries between Korean words by estimating probability of those boundaries at the character level.
After text preprocessing, we prepared a list of known smoking status-related s keywords. They served as a basis for finding other keywords that describe a patie smoking status. Two medical professionals analyzed frequently occurring words in data and generated 50 smoking status-related keywords: 11 never-smoker-rela During the text preprocessing step, we identified 170 commonly used medical acronyms and replaced them with their full expressions. Unlike in English, not all words in Korean are delimited by spaces. As an agglutinating language, these words are often delimited by a set of special words or particles [16]. To tokenize Korean phrases (eojeol) into the most semantically relevant words, we applied the Python package soynlp (https://github.com/lovit/soynlp, accessed on 8 September 2021) to the texts written in Korean. Without relying on any predefined dictionaries, soynlp finds the boundaries between Korean words by estimating the probability of those boundaries at the character level.
After text preprocessing, we prepared a list of known smoking status-related seed keywords. They served as a basis for finding other keywords that describe a patient's smoking status. Two medical professionals analyzed frequently occurring words in our data and generated 50 smoking status-related keywords: 11 never-smoker-related keywords, 16 past-smoker-related keywords, and 13 current-smoker-related keywords. We limited Appl. Sci. 2021, 11, 8812 4 of 12 our seed words to unigrams or bigrams for clarity and computational efficiency. For Korean keywords, their unigrams and bigrams are defined in terms of words identified by soynlp.
A few examples of seed keywords from each smoking status are provided in Table 2. Table 2. Examples of our seed keywords. Both English and Korean words were selected as seed words. The two Korean keywords superscripted 1 and 2 translate to "non-smoking" and "stopped smoking", respectively.

Never Smoker Past Smoker Current Smoker
smk never smk ex current smoker smk negative smoker ya smk yr During the keyword extraction step, we identified semantically similar words to each of our seed words. To calculate the semantic similarity, we applied SPPMI to represent both our seed words and all the words in our EHR data as numerical vectors. We then calculated and ranked the pairwise cosine similarity between each seed word vector and other word vectors. Words with the highest cosine similarity to the seed words are identified as the extracted keywords.
In SPPMI, each word is initially represented as a vector of its pointwise mutual information (PMI) scores [17] with every other word in the dataset. As described in equation 1, PMI provides a probabilistic measure of association between two words by comparing their jointly occurring probability with their individual probabilities. Because they can effectively capture word similarity, PMI and its variants, such as normalized PMI (NPMI) [18], are frequently used in NLP [19][20][21].
Equation (1). Pointwise mutual information (PMI). As shown in Equation (2), SPPMI shifts the PMI values of each word vector by a global constant (log k). If lower-dimensional word vectors are desired, matrix factorization, such as singular value decomposition (SVD), is additionally applied. Depending on the value of k or the number of singular values used in the SVD, SPPMI can capture word similarity better than other neural-network-inspired word representation methods [22]. For example, Levy et al. [15,22] showed that word2vec is implicitly factorizing a word co-occurrence matrix, in which each co-occurrence is calculated as PMI. As the underlying mechanism of word2vec [23] and SPPMI is identical, they experimentally showed that SPPMI could achieve a similar level of performance as word2vec. In this work, we chose the same values of k (1,5,15) and the number of singular values (100, 500, 1000) as used in the original paper.

Experiment Setting
To objectively measure the performance of SPPMI-based keyword expansion, we compare its extracted keywords with those from six other models. Although they share identical text preprocessing and seed word preparation steps, they all use different keyword extraction steps. The word co-occurrence, PMI vector, and NPMI vector models represent each word as a vector of its co-occurrence counts, PMI scores, and NPMI scores, respectively, with every other word in the dataset. Based on those vector representations, the baseline models rank pairwise cosine similarity to extract keywords that belong to the same smoking status as their seed words. For the PMI and NPMI score-based keyword extraction models, we did not create any word vectors. Instead, we calculated the pairwise PMI and NPMI scores, respectively, between each seed word and all other words in the dataset. By ranking the scores, these two models similarly extract relevant keywords for each seed word. In word2vec models, each word vector is represented by the weights of a neural network that is trained to predict a word given its neighboring words. As one of the most basic word embedding methods, it is widely applied in solving various word-level NLP tasks [24][25][26][27]. In our experiment, we trained nine word2vec models with different hyperparameters. For the dimension of the word vectors, we set it to be 100, 200, or 300. For the context size, we designated it to be 2, 4, or 6. To extract relevant keywords with word2vec, we once again used a pairwise cosine similarity measure.

Keyword Extraction Precision
We extracted the top 1, 5, 10, and 20 keywords from each of these unsupervised keyword extraction models for each of our 50 initial seed words. As the complete set of smoking-related keywords are not available in the dataset, we used precision to measure the performance of the keyword extraction. To compare the precision of the models, two human annotators independently assessed the extracted keywords. Each extracted keyword was deemed correct only when both annotators agreed that it described the same smoking status as its input seed word.
As shown in Table 3, our SPPMI-based keyword expansion method showed superior performance in extracting smoking status-related keywords from bilingual EHRs. It is interesting to note that the basic SPPMI model does not significantly improve keyword extraction precision. Instead, it was the lower-dimensional word vectors created from the SVD that exhibited superior and robust performance. As the number of generated keywords increased, the precision inevitably decreased because the number of words related to smoking status in our dataset is limited.  It is also interesting to note that word2vec shows poor precision. There are several reasons why word embedding methods generally do not work effectively in our data. One of the most critical issues is the huge number of unique words relative to the size of the data. When doctors are writing clinical notes, they are often simultaneously interacting with their patients. Due to this real-time nature, the expressions in the notes are often short and abbreviated. Consequently, they do not strictly adhere to standardized expressions, and their expression styles often differ between doctors. Due to this issue, semantically similar yet structurally different terms are prevalent in our dataset. Without normalizing these terms, word embedding methods fail to generalize. However, bilingual term normalization itself is another critical future research topic that is beyond the scope of this paper. The detailed precision for each of the three smoking classes (never smoker, past smoker, and current smoker) is included in Tables S1-S3 in Supplementary Materials. Additionally, Table 4 includes a few examples of extracted keywords from each smoking status. As a reference, all 50 of our seed keywords, the extracted keywords, and their statistics are publicly available at https://github.com/hank110/smoking_status_keywords (accessed on 8 September 2021). Table 4. Examples of extracted keywords. Both English and Korean keywords were simultaneously extracted for each of our seed keyword. The two bilingual keywords superscripted 1 and 2 translate to "smoking negative" and "years ago ppd (packs per day)", respectively. The two Korean keywords superscripted 3 and 4 translate to "still cigarette" and "haven't quit", respectively.

Smoking Status Classification
Our SPPMI-based keyword extraction method can also be applied to training a smoking status classifier from EHR data. Several previous works have applied machine learning algorithms or statistical analysis to classify smoking status from EHR [28][29][30][31][32][33]. Among these works, a linear support vector machine (SVM) trained from unigram and bigram bag of words has consistently shown the highest classification accuracy [7,[34][35][36].
However, this smoking status classification accuracy can be further improved by preprocessing unigram and bigram bag of words by the keywords extracted from SPPMI. Instead of representing each EHR record by the frequencies of every word in the dataset, we represent it as a bag of keywords. For all non-keywords within the record, we simply treat them as an identical word. For example, if we decide to represent each record with five keywords extracted from SPPMI, each record becomes a vector with six dimensions (five for keywords and one for other non-keywords).
To test the impact of our SPPMI-based extracted keywords on the smoking status classification, we have fixed the classifier to be a linear SVM. Subsequently, we compare Appl. Sci. 2021, 11, 8812 7 of 12 the classification accuracy resulting from different clinical note vector representations. For all classification methods, we have trained the classifiers from 80% of our SNUH clinical notes data and used the rest as test data. As the entire EHR records were initially annotated with each patient's actual smoking status, accuracy can be measured in terms of F1 score to compare the impact of different vector representations on the classifier's performance (Table 5). Compared to the unigram and bigram-based Bag of Words approach used in [7,[34][35][36], classifying smoking status solely with the keywords extracted with our SPPMI-based approach improves the overall accuracy up to 1.8% (Table 4). This improvement in accuracy becomes more evident when we observe the classification accuracy of each smoking status (Tables S4-S7 in Supplementary Materials). For classifying smokers, our approach improves the F1 score by as much as 9.04%. Furthermore, the improved classification result also signifies that our method is capable of expanding meaningful keywords from our seed word.
In terms of machine learning, preprocessing clinical note vectors by our keywords serves as a form of dimension reduction or feature engineering. However, our approach outperforms other conventional dimension reduction techniques in document vectors, such as Latent Semantic Analysis (LSA) [37,38] and Latent Dirichlet Allocation (LDA) [39]. This superior smoking status classification result once again emphasizes the capability of our approach to expand keywords that are truly relevant to each smoking status.

Frequency Distribution of the Expanded Keywords
As our method solely relies on vector similarity, it is capable of extracting even infrequently occurring keywords. As shown in Figure 2, approximately 60% of our extracted keywords occur less than five times in the entire data. Keyword extractions that utilize statistical measures [40] or the feature importance of classifiers [41] will not be as effective as our method in capturing these infrequent keywords. As they have less impact on the overall classification accuracy, they will simply be disregarded or be replaced by more frequent features or keywords. However, these infrequent keywords provide equally meaningful insight into EHR records, especially when the amount of data or standardization is limited. They capture different ways of expressing smoking status and might even represent typos in the expressions.

Limitations of Pre-Trained Language Models
In this paper, we have excluded pre-trained language models such as BERT GPT [43] based methods from our experiment due to our data's domain differe bilingual property. These pre-trained models are trained to learn effective l models from general-purpose datasets such as English Wikipedia and Book However, the word generating distribution in the medical domain cannot be assumed to be similar to those of Wikipedia or Bookscorpus. As medical n simultaneously generated while doctors are interviewing or examining their pati notes are often succinct, largely containing abbreviations, domain-specific jarg incomplete sentences. Therefore, without the fine-tuning pre-trained model, i effectively capture the domain-specific word generating distribution presen medical notes. Although domain-specific models such as BioBERT [44] offer fin language models for biomedical applications, the terms expressed in the clinic significantly differ from the normalized and pre-processed terms used in traini models. Due to the small number of notes relative to the huge number of uniqu the benefits of additional fine-tuning these existing models with our data are also Most importantly, there are no bilingual or multilingual language mode biomedical domain at the moment. Without models that are simultaneously train multilingual medical text data, aligning embedding space of different language m the medical domain is an ongoing research topic that we hope to address in th Furthermore, the biggest bottleneck in this multilingual language model approa limited medical corpus available in Korean. Consequently, to the best of the knowledge, there is no large-scale pre-trained language model trained from medical text data. This paper's experiment with word2vec emphasizes not only but also the difficulty in creating a large-scale language model for Korean medi Despite various hyperparameter settings, the low precision from our word2vec

Limitations of Pre-Trained Language Models
In this paper, we have excluded pre-trained language models such as BERT [42] and GPT [43] based methods from our experiment due to our data's domain difference and bilingual property. These pre-trained models are trained to learn effective language models from general-purpose datasets such as English Wikipedia and Bookscorpus. However, the word generating distribution in the medical domain cannot be simply assumed to be similar to those of Wikipedia or Bookscorpus. As medical notes are simultaneously generated while doctors are interviewing or examining their patients, the notes are often succinct, largely containing abbreviations, domain-specific jargon, and incomplete sentences. Therefore, without the fine-tuning pre-trained model, it cannot effectively capture the domain-specific word generating distribution present in our medical notes. Although domain-specific models such as BioBERT [44] offer fine-tuned language models for biomedical applications, the terms expressed in the clinical notes significantly differ from the normalized and preprocessed terms used in training these models. Due to the small number of notes relative to the huge number of unique words, the benefits of additional fine-tuning these existing models with our data are also limited.
Most importantly, there are no bilingual or multilingual language models in the biomedical domain at the moment. Without models that are simultaneously trained from multilingual medical text data, aligning embedding space of different language models in the medical domain is an ongoing research topic that we hope to address in the future. Furthermore, the biggest bottleneck in this multilingual language model approach is the limited medical corpus available in Korean. Consequently, to the best of the authors' knowledge, there is no large-scale pre-trained language model trained from Korean medical text data. This paper's experiment with word2vec emphasizes not only the need but also the difficulty in creating a large-scale language model for Korean medical data. Despite various hyperparameter settings, the low precision from our word2vec models indicates that they have failed to capture the semantic similarity between terms. A longer and larger set of medical notes will provide additional contextual cues to improve the performance of word2vec and other neural network-based language models. However, creating a large medical corpus is extremely costly as it is not an easily available data source and requires medical professionals' input in processing the data. We hope that this paper will serve as a starting point for this expensive yet necessary process.
Only with sufficient medical data, language models, or deep learning algorithms are effective in the medical domain. For example. Arnaud et al. [45] uses 260,000 Emergency Department records to train its CNN text classifier, while Yao et al. [46] fine-tunes its BERT model from a Chinese clinical corpus that contains over 18 million tokens. As more Korean medical notes are currently being collected in a structural format, we also plan to improve our bilingual keyword extraction in the future. Once a large-scale Korean medical corpus is ready, training a more sophisticated language model or aligning word embedding space based on transfer learning will be interesting research topics to pursue.

Implications to Bilingual EHRs
The main purpose of EHR is to support patient care and administrative tasks related to treatment as a repository of clinical data. Therefore, EHR is not optimized for accurate retrieval of many data. Eventually, in the process of using EHR for research purposes, problems such as low accessibility, poor performance, and lack of data analysis functions arise. In particular, clinicians often use free text when recording clinical findings in EHR. Natural language processing is a representative method of extracting meaningful data from documents recorded as free statements. Attempts are being made to extract drug prescriptions, problem lists, and comprehensive clinical information from clinical documents recorded in the form of free statements using natural language processing or to use them for document classification and retrieval [47][48][49]. However, although current research on medical images or bio-signals has progressed considerably, research on analyzing textual medical data is insignificant. In particular, studies composed of multiple languages are not common [50]. Korean medical institutes face additional difficulties in natural language processing, as EHR contains both Korean and English. This study provides a meaningful basis for extracting insights from the clinical data warehouse, mapping documents to a standardized terminology system, and classifying bilingual EHR documents.

Strengths and Limitations
In the previous studies, smoking data were mainly collected through two sources: a structured questionnaire or a manual review of the clinical notes by researchers. Using a structured questionnaire may require a lot of effort to increase the response rate. Similarly, manual review needs significant amount of time and human resources with the possibility of human error. Our proposed method allows researchers and practitioners to easily obtain smoking information from EHR. It is also the first work to extract smoking information from the clinical free text that contains two different languages, Korean and English. Previous works on smoking status classification focused on binary classification. Their algorithms classified either smokers and non-smokers or past smokers and non-smokers. However, our classification algorithm based on keyword extraction uses multilabel classification that distinguishes between the current, past, non-smokers, and unknowns. Various treatment guidelines for noncommunicable diseases [51,52] recommend examining smoking history during every outpatient visit and educating lifestyle modification. If there is no smokingrelated information in the patient's previous EHR, the medical staff can perceive it as a cue to collect new information. Moreover, it was confirmed that smoking-related keywords were expressed in various ways even if the previous chart contents were copied and pasted. Our proposed algorithm will have a practical application in automatically mapping and preprocessing unstructured clinical notes composed of various keywords that occur under the copy and paste practice.
One limitation of this study is the possibility that some clinicians did receive patients' smoking history but failed to enter it into EHR. Therefore, simply labeling patients with unknown smoking status as the unknowns may be insufficient. Second, our study was conducted only with free-text clinical notes obtained from two departments in one tertiary hospital. The clinical notes of diabetes patients in the family medicine and COPD patients in the respiratory medicine often include the patients' smoking history, allowing us to build and validate our approach. Further validation study using data from other centers is required to test our proposed method's robustness and applicability.

Conclusions
Our study showed the great potential in classify smoking status from bilingual unstructured EHRs. To the best of our knowledge, this paper is one of the first works that confirmed the possibility of extracting meaningful keywords from bilingual unstructured EHRs. Instead of medical staff manually perusing through the notes, our proposed algorithm explored the possibility of replacing this time-consuming and expensive approach with an automated methodology leveraged by NLP. Due to the limitations on the amount of data available and the relatively small portion of EHRs dedicated to patients' smoking history, we could not train or apply sophisticated bilingual language models for our keyword extraction task from scratch. However, as the size of EHR would continue to increase, we plan to apply recent advancements in NLP to improve the accuracy of keyword extraction in the near future. Smoking information can be easy to acquire and use for clinical practice or research with our current findings.
Supplementary Materials: Python implementation of the study is available online at: https:// www.mdpi.com/article/10.3390/app11198812/s1 or https://github.com/hank110/smoking_status_ keywords, Table S1: Comparison of precision between SPPMI-based keyword expansion and five other baseline models on extracting never smoker-related keywords (Word co-occurrence, PMI vector, NPMI vector, PMI score, and NPMI score models). The values of d represent a number of singular values used in SVD, Table S2: Comparison of precision between SPPMI-based keyword expansion and five other baseline models on extracting past smoker-related keywords (Word cooccurrence, PMI vector, NPMI vector, PMI score, and NPMI score models). The values of d represent a number of singular values used in SVD, Table S3: Comparison of precision between SPPMI-based keyword expansion and five other baseline models on extracting current smoker-related keywords (Word co-occurrence, PMI vector, NPMI vector, PMI score, and NPMI score models). The values of d represent a number of singular values used in SVD, Table S4: Comparison of never smoker classification accuracy. All accuracies are reported in F1 scores, Table S5: Comparison of past smoker classification accuracy. All accuracies are reported in F1 scores, Table S6: Comparison of current smoker classification accuracy. All accuracies are reported in F1 scores, Table S7: Comparison of unknown smoking status classification accuracy. All accuracies are reported in F1 scores.