MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors

Featured Application: With its term mapping capability, MARIE can be used to improve data interoperability between di ﬀ erent biomedical institutions. It can also be applied to text data pre-processing or normalization in non-biomedical domains. Abstract: With growing interest in machine learning, text standardization is becoming an increasingly important aspect of data pre-processing within biomedical communities. As performances of machine learning algorithms are a ﬀ ected by both the amount and the quality of their training data, e ﬀ ective data standardization is needed to guarantee consistent data integrity. Furthermore, biomedical organizations, depending on their geographical locations or a ﬃ liations, rely on di ﬀ erent sets of text standardization in practice. To facilitate easier machine learning-related collaborations between these organizations, an e ﬀ ective yet practical text data standardization method is needed. In this paper, we introduce MARIE (a context-aware term mapping method with string matching and embedding vectors), an unsupervised learning-based tool, to ﬁnd standardized clinical terminologies for queries, such as a hospital’s own codes. By incorporating both string matching methods and term embedding vectors generated by BioBERT (bidirectional encoder representations from transformers for biomedical text mining), it utilizes both structural and contextual information to calculate similarity measures between source and target terms. Compared to previous term mapping methods, MARIE shows improved mapping accuracy. Furthermore, it can be easily expanded to incorporate any string matching or term embedding methods. Without requiring any additional model training, it is not only e ﬀ ective, but also a practical term mapping method for text data standardization and pre-processing. acquisition, All


Introduction
Due to the growing interest in text mining and natural language processing (NLP) in the biomedical field [1][2][3], data pre-processing is becoming an increasingly crucial issue for many biomedical practitioners. As datasets used in research and practice are often different, it is challenging to directly apply recent advancements in biomedical NLP to existing IT systems without effective data pre-processing. One of the key issues addressed during pre-processing is concept normalization, which refers to the task of aligning different text datasets or corpora into a common standard.

Methods
Since MARIE utilizes both string matching methods and BERT, we will briefly discuss these methods. Subsequently, we will provide details of our proposed method.

String Matching Methods
String matching methods define a similarity score between terms, based on the number of commonly shared letters or substrings. Among various string matching methods, edit distance [26,27], the Jaccard index [28,29] and Ratcliff/Obershelp similarity [30] are some of the most commonly used string matching methods. Given two terms, T 1 and T 2 , edit distance calculates their distance based on the number of insertions, deletions and substitutions of characters required in transforming T 1 into T 2 . While edit distance captures the structural information of terms on a character level, the Jaccard Appl. Sci. 2020, 10, 7831 3 of 11 index utilizes the token-level, structural information. It computes the similarity between T 1 and T 2 as the length of the longest common substring in proportion to the overall length of T 1 and T 2 . Ratcliff/Obershelp (R/O) similarity is regarded as an extension of the Jaccard index. Initially, it finds the longest common substring between T 1 and T 2 . Subsequently, it finds the next longest common substring from the non-matching regions of T 1 and T 2 . This process of finding the longest common substring from the non-matching regions is repeated until there is no common substring between T 1 and T 2 . Finally, it uses the length of all common substrings in proportion to the overall length of T 1 and T 2 to calculate their similarity.

BERT
As a word embedding model, BERT learns to generate continuous numerical vector representations of words from text data. To generate these embedding vectors, it relies on a bi-directional transformer architecture [31]. Although Embeddings from Language Model (ELMo) [32] similarly utilizes bi-directional contextual information to train its word vectors, BERT is one of the first embedding methods that simultaneously captures bi-directional contextual information during its training. Furthermore, its multi-head attention mechanisms serve as a computationally efficient solution for retaining long-range contextual information, thereby improving the quality of the embedding vectors, compared with those generated from the recurrent neural network-based models [33]. Additionally, it uses WordPiece embeddings [34] to tokenize the input words during training. Subsequently, a word vector is generated by aggregating these tokenized embedding vectors, thereby addressing the issue of out-of-vocabulary words. Recently, BioBERT [35] has been introduced in the field of biomedical NLP. Despite its near-identical architecture to BERT, it has been trained from biomedical corpora such as PubMed abstracts and PubMed Central full-text articles. Due to its domain-specific training data, it has improved the performance of named entity recognition, relation extraction and question answering within the biomedical domain. Based on its trained word embedding vectors, the similarity between terms T 1 and T 2 is measured by calculating the cosine distance between their respective embedding vectors. (1), MARIE calculates a term similarity between terms T 1 and T 2 as a weighted average of string matching methods and the cosine similarity of their embedding vectors from BioBERT. As long as the range of the resulting similarity score is between 0 and 1, any type of string matching methods can be incorporated into MARIE. To generate word embedding vectors, MARIE extracts the activations from the last few layers of BioBERT. When a term is composed of multiple words or tokens, MARIE uses the average of the constituting word or token embedding vectors to generate a final term vector. As both similarity measures are ranged between 0 and 1, the final similarity score computed from MARIE also lies between 0 and 1. Furthermore, a hyperparameter α enables users to adjust the relative importance of the similarity scores calculated from each of the string matching methods and the BioBERT embedding vectors. Although α is a user-defined hyperparameter, it enables MARIE to flexibly adjust its mapping capability to different datasets. To summarize the entire mapping process of MARIE, Figure 1 provides its overall flow diagram. The MARIE calculation method is as follows: a similarity score between terms T and T . The string matching methods purely rely on the structural similarities between the terms. On the other hand, BioBERT uses the distance between the embedding vectors E and E . As these embedding vectors are trained to capture contextual dependencies between the terms, the distance between E and E is determined by the contextual similarities between T and T . By combining these two approaches, MARIE utilizes both structural and contextual dependencies, thereby employing a richer set of criteria to define a similarity between T and T . To find a mapping between the input and the target datasets, MARIE initially compares a clinical term T from the input dataset with a clinical term T from the target dataset. To capture the contextual similarity between T and T , it generates BioBERT embedding vectors for both T and T and calculates their vector similarity, based on a cosine similarity measure. Along with their structural similarity, computed from the string matching method, MARIE calculates the weighted average similarity score between T and T . By ranking these final similarity scores computed from MARIE, T is mapped to T with the highest similarity score.

As described in Equation
Previously, word embedding-based edit distance [36] had similarly attempted to incorporate contextual information in edit distance. However, MARIE is a generalized improvement upon this previous work. Besides edit distance, MARIE is capable of incorporating a wider selection of string matching methods, thereby further expanding its applicability in practice. Furthermore, the biggest improvement of MARIE is its choice of term embedding methods. Among various term embedding methods, the previous work relies on word2vec [37] to generate term embedding vectors. However, word2vec does not capture richer bidirectional contextual information between words. Furthermore, it does not employ WordPiece embedding, thereby suffering from an out-of-vocabulary issue. When the embedding vector of a term T has not been trained, word2vec is not capable of inferring its embedding vector. For this type of out-of-vocabulary term, word embedding-based edit distance simply ignores the contextual information and reverts back to a string matching method. However, BioBERT relies on WordPiece embedding to overcome this weakness of word2vec. During training, BioBERT learns the embedding vectors of both input words and their subwords. Thus, when it needs to infer the embedding vector of T , it breaks T into known subwords and estimates its embedding vector by adding up its subword embedding vectors. By using BioBERT instead of other term embedding methods such as word2vec or GloVe [38], MARIE utilizes richer contextual dependencies between terms and is robust to new words. Therefore, MARIE is capable of leveraging contextual information, even when the resources for training term embedding vectors are limited. To find a mapping between the input and the target datasets, MARIE initially compares a clinical term T 1 from the input dataset with a clinical term T 2 from the target dataset. To capture the contextual similarity between T 1 and T 2 , it generates BioBERT embedding vectors for both T 1 and T 2 and calculates their vector similarity, based on a cosine similarity measure. Along with their structural similarity, computed from the string matching method, MARIE calculates the weighted average similarity score between T 1 and T 2 . By ranking these final similarity scores computed from MARIE, T 1 is mapped to T 2 with the highest similarity score.
The string matching methods and BioBERT use distinctively different criteria when calculating a similarity score between terms T 1 and T 2 . The string matching methods purely rely on the structural similarities between the terms. On the other hand, BioBERT uses the distance between the embedding vectors E 1 and E 2 . As these embedding vectors are trained to capture contextual dependencies between the terms, the distance between E 1 and E 2 is determined by the contextual similarities between T 1 and T 2 . By combining these two approaches, MARIE utilizes both structural and contextual dependencies, thereby employing a richer set of criteria to define a similarity between T 1 and T 2 .
Previously, word embedding-based edit distance [36] had similarly attempted to incorporate contextual information in edit distance. However, MARIE is a generalized improvement upon this previous work. Besides edit distance, MARIE is capable of incorporating a wider selection of string matching methods, thereby further expanding its applicability in practice. Furthermore, the biggest improvement of MARIE is its choice of term embedding methods. Among various term embedding methods, the previous work relies on word2vec [37] to generate term embedding vectors. However, word2vec does not capture richer bidirectional contextual information between words. Furthermore, it does not employ WordPiece embedding, thereby suffering from an out-of-vocabulary issue. When the embedding vector of a term T 1 has not been trained, word2vec is not capable of inferring its embedding vector. For this type of out-of-vocabulary term, word embedding-based edit distance simply ignores the contextual information and reverts back to a string matching method. However, BioBERT relies on WordPiece embedding to overcome this weakness of word2vec. During training, BioBERT learns the embedding vectors of both input words and their subwords. Thus, when it needs to infer the embedding vector of T 1 , it breaks T 1 into known subwords and estimates its embedding vector by adding up its subword embedding vectors. By using BioBERT instead of other term embedding methods such as word2vec or GloVe [38], MARIE utilizes richer contextual dependencies between terms and is robust to new words. Therefore, MARIE is capable of leveraging contextual information, even when the resources for training term embedding vectors are limited.

Dataset and Experiment Setups
To validate the performance of MARIE, we tested it by mapping 3489 medical terms currently used in Seoul National University Hospital (SNUH corpus) with the Observational Medical Outcomes Partnership Common Data Model (OMOP CDM). Among various code standards available in the CDM, we limited our mapping to SNOMED CT, Logical Observation Identifiers Names and Codes (LOINC), RxNorm and RxNorm Extension [39,40]. With Usagi software and manual reviews from six medical professionals, we created a true mapping between SNUH corpus and the CDM. Given the terms from SNUH corpus as inputs, the performance of MARIE was evaluated on how accurately it could reconstruct this true mapping between SNUH corpus and the CDM. For example, given a term T 1 from SNUH corpus, we calculated its similarity scores with all of the terms in a target dataset and selected the term with the highest similarity score T 1 as its predicted mapped term. The mapping between T 1 and T 1 was considered correct if it corresponded to the true mapping between SNUH corpus and the CDM.
For robust evaluation, we tested MARIE on three different artificial target datasets. Each of these target datasets contained 3489 correctly mapped terms, as expressed in the CDM. In addition to these correctly mapped terms, each of these target datasets also contained 5000, 10,000 and 50,000 randomly sampled terms from the CDM, respectively. These random samples included only the terms that were excluded or ignored during the true mapping between SNUH corpus and the CDM. The sampled target datasets increased the number of possible mapping candidates that MARIE needed to explore, thereby introducing additional complexity to accurately mapping the terms in SNUH corpus.
As for BioBERT, we used the BioBERT-Large v1.1 model from its official Github repository. Without additional model training, we extracted the term embedding vectors from its pre-trained weights. Table 1 shows the top 1, 3, 5 and 10 mapping accuracies of MARIE in each of three target datasets. To provide objective comparisons, we also reported the mapping accuracy of the string matching methods and an embedding vector-based mapping method. For string matching methods, we used the Jaccard index, edit distance and R/O similarity. For the embedding vector-based mapping method, we calculated the similarity score between terms T 1 and T 2 , based on the cosine distance of their respective embedding vectors E 1 and E 2 . This embedding vector-based mapping is a variant of the concept normalization method investigated by Karadeniz et al. [41]. The only difference is that the embedding vectors were generated from BioBERT instead of word2vec.

Experiment Results
The interplay of structural and contextual information within terms enabled MARIE to outperform other mapping methods across all target datasets. Considering the process involved in mapping terms, this outcome is not surprising. Let us assume that a correct mapping between datasets A and B resulted in a term T 1 ∈ A to be mapped to a term T 2 ∈ B. Unless T 1 and T 2 are identical strings, two types of discrepancies occur between these two terms. Given T 1 , either the order of words in T 2 will be different, or T 2 will be missing some of the words or subwords in T 1 . The term embedding vectors and string matching methods are effective in addressing each of these discrepancies, respectively. As an embedding vector of a term is defined by the average of its word vectors, the similarity score computed from the embedding vectors of BioBERT is invariant to the changes in the word orders within a term. Therefore, the contextual information of an entire term is preserved, regardless of the changes in its word order. On the other hand, similarity scores from string matching methods depend on the length of common characters or substrings. Consequently, as long as the lengths of missing words are short, compared with the overall lengths of the terms, string matching methods are effective in overlooking these missing words. As MARIE simultaneously resolves these two discrepancies, it achieves effective mapping performance. This characteristic is also evident in Table 2, which compares the actual mapping outcomes between SNUH local concepts and Observational Health Data Sciences and Informatics (OHDSI) Athena (https://athena.ohdsi.org/) from MARIE, R/O similarity and the embedding vector-based mapping.

Impact of α
Hyperparameter α balances the effect of structural and contextual information on calculating the overall similarity score with MARIE, thereby altering the mapping accuracy, as shown in Table 1. Therefore, we additionally analyzed the changes in the accuracy, with respect to various α, in the target dataset with 50,000 random samples. Within the range between 0.2 and 0.9, we changed the value of α in increments of 0.1.
As shown in Figure 2, there is an optimal cutoff value of α that maximizes the mapping accuracy, thereby suggesting that solely relying on either structural or contextual information is not sufficient for accurate mapping. However, the optimal value of α will vary, depending on the input and target datasets, and should be determined based on the type of discrepancy that is prevalent in them. If the word order discrepancy is a major difference between the input and target datasets, an α value greater than 0.5 should be used for better mapping accuracy. On the other hand, if the words missing from the corresponding terms are prevalent, an α value less than 0.5 is recommended. accuracy, thereby suggesting that solely relying on either structural or contextual information is not sufficient for accurate mapping. However, the optimal value of α will vary, depending on the input and target datasets, and should be determined based on the type of discrepancy that is prevalent in them. If the word order discrepancy is a major difference between the input and target datasets, an α value greater than 0.5 should be used for better mapping accuracy. On the other hand, if the words missing from the corresponding terms are prevalent, an α value less than 0.5 is recommended.

Impact of BioBERT Layers
Depending on the number of BioBERT layers used to represent a word embedding vector, the amount of contextual information captured by the vector can vary. When we utilized multiple layers, the final embedding vector was generated by taking an average of the embedding vectors from each layer. Table 3 summarizes the changes in the mapping accuracy of MARIE when we used the last 2~4 layers of BioBERT to generate a word embedding vector. Although the improvement was not significant, using more layers increased the overall mapping accuracy.

Limitations of MARIE
Despite its superior mapping capability, MARIE does suffer from a few limitations. One of its limitations is its inability to handle biomedical terms that contain both English and foreign languages, such as Korean. In order to map these terms possessing multiple languages, we will need a BioBERT model that has been simultaneously trained for these different languages. However, such a heterogeneous model or training corpus for biomedical applications does not yet exist, to the best of our knowledge.
Furthermore, the computation speed of MARIE is bounded by the computation speed of its string matching methods. As MARIE extracts pre-trained BioBERT vectors, the majority of its computation time is spent on applying string matching methods. However, this computation time varies, based on the choice of string matching methods and their implementations.
This computation time issue becomes especially critical when mapping between large datasets. Therefore, we highly recommend using string matching methods with low computational complexities if fast computation time is critical.

Conclusions
In this paper, we introduced a new biomedical term mapping method that incorporates both structural and contextual information to calculate a term similarity score. As pre-processing and standardizing datasets are crucial in applying the latest text mining and machine learning models in practice, MARIE will serve as an essential tool for many biomedical professionals. As a generalized mapping method, it does not require any additional model training and can incorporate other string matching and embedding methods not discussed in this paper.
Furthermore, MARIE will make significant contributions to the process of mapping local clinical concept codes to standardized codes. Specifically, this mapping generated from MARIE will be useful for deploying the OMOP CDM outside of the United States or the United Kingdom. Based on the data standardization achieved by MARIE, applications related to health information exchange and personal health records will also benefit from improved interoperability.
For the direction of future research, it will be interesting to incorporate recent advances in machine translation or transfer learning to MARIE, thereby enabling it to map biomedical terms across datasets recorded in different languages. Furthermore, we will also explore ways to apply MARIE to create new standardized biomedical text datasets. By aligning and aggregating biomedical text datasets from numerous organizations, we will be able to create a large, yet highly standardized, training dataset for machine learning applications.