MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors
Abstract
Featured Application
Abstract
1. Introduction
2. Methods
2.1. String Matching Methods
2.2. BERT
2.3. MARIE
3. Results
3.1. Dataset and Experiment Setups
3.2. Experiment Results
4. Discussion
4.1. Impact of α
4.2. Impact of BioBERT Layers
4.3. Limitations of MARIE
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Huang, C.C.; Lu, Z. Community challenges in biomedical text mining over 10 years: Success, failure and the future. Brief. Bioinform. 2016, 17, 132–144. [Google Scholar] [CrossRef] [PubMed]
- Wei, C.H.; Leaman, R.; Lu, Z. Beyond accuracy: Creating interoperable and scalable text-mining web services. Bioinformatics 2016, 32, 1907–1910. [Google Scholar] [CrossRef] [PubMed][Green Version]
- Junge, A.; Jensen, L.J. CoCoScore: Context-aware co-occurrence scoring for text mining applications using distant supervision. Bioinformatics 2020, 36, 264–271. [Google Scholar] [CrossRef] [PubMed]
- Aronson, A.R. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. In Proceedings of the AMIA Symposium; American Medical Informatics Association: Bethesda, MD, USA, 2001; p. 17. [Google Scholar]
- Bodenreider, O. The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 2004, 32, D267–D270. [Google Scholar] [CrossRef] [PubMed]
- Rindflesch, T.C.; Fiszman, M. The interaction of domain knowledge and linguistic structure in natural language processing: Interpreting hypernymic propositions in biomedical text. J. Biomed. Inform. 2003, 36, 462–477. [Google Scholar] [CrossRef] [PubMed]
- Leaman, R.; Islamaj Doğan, R.; Lu, Z. DNorm: Disease name normalization with pairwise learning to rank. Bioinformatics 2013, 29, 2909–2917. [Google Scholar] [CrossRef]
- Xu, D.; Zhang, Z.; Bethard, S. A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8452–8464. [Google Scholar]
- Li, H.; Chen, Q.; Tang, B.; Wang, X.; Xu, H.; Wang, B.; Huang, D. CNN-based ranking for biomedical entity normalization. BMC Bioinform. 2017, 18, 79–86. [Google Scholar] [CrossRef]
- Ji, Z.; Wei, Q.; Xu, H. Bert-based ranking for biomedical entity normalization. AMIA Summits Transl. Sci. Proc. 2020, 2020, 269. [Google Scholar]
- Schumacher, E.; Mulyar, A.; Dredze, M. Clinical Concept Linking with Contextualized Neural Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8585–8592. [Google Scholar]
- Dai, W.; Yang, Q.; Xue, G.R.; Yu, Y. Boosting for transfer learning. In Proceedings of the 24th International Conference on MACHINE Learning; Association for Computing Machinery: New York, NY, USA, 2007; pp. 193–200. [Google Scholar]
- Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
- Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Vaughan, J.W. A theory of learning from different domains. Mach. Learn. 2010, 79, 151–175. [Google Scholar] [CrossRef]
- Donnelly, K. SNOMED-CT: The advanced terminology and coding system for eHealth. Stud. Health Technol. Inform. 2006, 121, 279. [Google Scholar]
- Dogan, R.I.; Lu, Z. An inference method for disease name normalization. In Proceedings of the 2012 AAAI Fall Symposium Series, Arlington, VA, USA, 2–4 November 2012. [Google Scholar]
- Kate, R.J. Normalizing clinical terms using learned edit distance patterns. J. Am. Med. Inform. Assoc. 2016, 23, 380–386. [Google Scholar] [CrossRef] [PubMed]
- Turian, J.; Ratinov, L.; Bengio, Y. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; pp. 384–394. [Google Scholar]
- Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
- Wang, P.; Xu, B.; Xu, J.; Tian, G.; Liu, C.L.; Hao, H. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 2016, 174, 806–814. [Google Scholar] [CrossRef]
- Kim, H.K.; Kim, H.; Cho, S. Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing 2017, 266, 336–352. [Google Scholar] [CrossRef]
- Tang, D.; Wei, F.; Yang, N.; Zhou, M.; Liu, T.; Qin, B. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 23–25 June 2014; pp. 1555–1565. [Google Scholar]
- Nikfarjam, A.; Sarker, A.; O’connor, K.; Ginn, R.; Gonzalez, G. Pharmacovigilance from social media: Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J. Am. Med. Inform. Assoc. 2015, 22, 671–681. [Google Scholar] [CrossRef]
- Xing, C.; Wang, D.; Liu, C.; Lin, Y. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; pp. 1006–1011. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Wagner, R.A.; Fischer, M.J. The string-to-string correction problem. J. ACM 1974, 21, 168–173. [Google Scholar] [CrossRef]
- Hyyrö, H. Explaining and Extending the Bit-Parallel Approximate String Matching Algorithm of Myers; Technical Report A-2001-10; Department of Computer and Information Sciences, University of Tampere: Tampere, Finland, 2001. [Google Scholar]
- Jaccard, P. Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bull. Soc. Vaud. Sci. Nat. 1901, 37, 241–272. [Google Scholar]
- Gower, J.C.; Warrens, M.J. Similarity, dissimilarity, and distance, measures of. In Wiley StatsRef: Statistics Reference Online; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2014; pp. 1–11. [Google Scholar]
- Black, P.E. Ratcliff/obershelp pattern recognition. In Dictionary of Algorithms and Data Structures; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2004. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Dean, J. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
- Niu, Y.; Qiao, C.; Li, H.; Huang, M. Word embedding based edit distance. arXiv 2018, arXiv:1810.10752. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Liu, S.; Ma, W.; Moore, R.; Ganesan, V.; Nelson, S. RxNorm: Prescription for electronic drug information exchange. IT Prof. 2005, 7, 17–23. [Google Scholar] [CrossRef]
- Nelson, S.J.; Zeng, K.; Kilbourne, J.; Powell, T.; Moore, R. Normalized names for clinical drugs: RxNorm at 6 years. J. Am. Med. Inform. Assoc. 2011, 18, 441–448. [Google Scholar] [CrossRef]
- Karadeniz, I.; Özgür, A. Linking entities through an ontology using word embeddings and syntactic re-ranking. BMC Bioinform. 2019, 20, 156. [Google Scholar] [CrossRef]
Method | Jaccard Index | Edit Distance | R/O | MARIE(BioBERT + Edit Distance) | MARIE(BioBERT + R/O) | Embedding Vector Similarity | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Metric | |||||||||||
Top 1 accuracy | 33.68% | 44.71% | 49.44% | 46.49% | 48.29% | 51.16% | 51.25% | 53.48% | 54.66% | 35.83% | |
Top 3 accuracy | 40.64% | 55.49% | 60.05% | 58.56% | 62.77% | 67.58% | 63.37% | 66.12% | 69.30% | 55.69% | |
Top 5 accuracy | 42.65% | 59.50% | 63.92% | 63.26% | 67.73% | 72.43% | 66.72% | 70.71% | 74.63% | 61.45% | |
Top 10 accuracy | 46.09% | 64.35% | 69.07% | 68.64% | 73.40% | 78.48% | 71.80% | 75.67% | 79.02% | 67.04% | |
(a) Target dataset with 5000 random samples | |||||||||||
Top 1 accuracy | 33.02% | 41.93% | 47.38% | 43.57% | 45.69% | 48.15% | 49.33% | 50.47% | 51.22% | 33.51% | |
Top 3 accuracy | 40.04% | 53.71% | 58.24% | 56.75% | 60.33% | 64.66% | 61.48% | 64.12% | 67.27% | 52.97% | |
Top 5 accuracy | 41.90% | 58.13% | 62.85% | 61.22% | 65.84% | 70.39% | 65.63% | 69.02% | 72.14% | 58.30% | |
Top 10 accuracy | 45.34% | 63.54% | 67.96% | 67.10% | 71.74% | 76.33% | 70.34% | 73.77% | 77.27% | 64.26% | |
(b) Target dataset with 10,000 random samples | |||||||||||
Top 1 accuracy | 28.58% | 34.80% | 40.76% | 35.31% | 37.63% | 38.32% | 41.82% | 43.31% | 41.99% | 26.37% | |
Top 3 accuracy | 37.17% | 47.49% | 52.59% | 49.21% | 52.02% | 55.69% | 54.54% | 56.78% | 58.01% | 43.85% | |
Top 5 accuracy | 39.18% | 51.76% | 57.12% | 53.48% | 57.38% | 61.31% | 59.16% | 62.28% | 64.37% | 49.47% | |
Top 10 accuracy | 42.19% | 57.55% | 63.00% | 60.22% | 64.40% | 68.24% | 65.12% | 68.16% | 70.31% | 56.32% | |
(c) Target dataset with 50,000 random samples |
Mapping Methods | Rank 1 | Rank 2 | Rank 3 |
---|---|---|---|
MARIE | cobicistat 150 mg/elvitegravir 150 mg/emtricitabine 200 mg/tenofovir alafenamide 10 mg oral tablet [RxNorm 1721612) (0.8538) | tenofovir alafenamide 25 mg oral tablet (RxNorm 1858261) (0.8536) | emtricitabine 200 mg/tenofovir disoproxil fumarate 300 mg oral tablet (RxNorm 476445) (0.8433) |
R/O | atazanavir 300 mg/cobicistat 150 mg oral tablet (RxNorm 1601654) (0.4371) | amylases 10 mg/betaine 300 mg/bromelains 10 mg/papain 100 mg oral tablet (RxNorm Extension OMOP1092499) (0.4333) | tenofovir alafenamide 25 mg oral tablet (RxNorm 1858261) (0.4255) |
BioBERT Embedding Vectors | cobicistat 150 mg/elvitegravir 150 mg/emtricitabine 200 mg/tenofovir alafenamide 10 mg oral tablet (RxNorm 1721612) (0.9895) | emtricitabine 200 mg/tenofovir disoproxil fumarate 300 mg oral tablet (RxNorm 476445) (0.9616) | tenofovir alafenamide 25 mg oral tablet (RxNorm 1858261) (0.9606) |
(a) tenofovir alafenamide/emtricitabine/elvitegravir/cobicistat 10 mg/200 mg/150 mg/150 mg tab (local concept) | |||
Mapping Methods | Rank 1 | Rank 2 | Rank 3 |
MARIE | cortisone (mass/time) in 24 h urine (LOINC Lab Test 14044-2) (0.8700) | color of urine (LOINC Lab Test 5778-6) (0.8599) | glucose (mass/time) in 24 h urine (LOINC Lab Test 2351-5) (0.8475) |
R/O | color of urine (LOINC Lab Test 5778-6) (0.5882) | cortisone (mass/time) in 24 h urine (LOINC Lab Test 14044-2] (0.5862) | creatinine (mass/time] in 24 h urine (LOINC Lab Test 2162-6] (0.5574) |
BioBERT Embedding Vectors | somatotropin^15th specimen post xxx challenge (LOINC component) (0.9509) | insulin ab (titer) in serum (LOINC Lab Test 11087-4) (0.9484) | glucose (presence) in urine by test strip (LOINC 25428-4) (0.9481) |
(b) cortisol (24 h urine) (local concept) | |||
Mapping Methods | Rank 1 | Rank 2 | Rank 3 |
MARIE | does manage ileostomy (SNOMED CT Clinical Finding 1073731000000109) (0.8203) | bone marrow pathology biopsy report narrative (LOINC 66119-9) (0.8143) | dome osteotomy (SNOMED CT Procedure 447761008) (0.8133) |
R/O | pilopos (RxNorm Extension OMOP2011960) (0.5) | minor blood groups (SNOMED CT Procedure 143157006) (0.4440) | dome osteotomy (SNOMED CT Procedure 447761008) (0.4348) |
BioBERT Embedding Vectors | her3 ag|tissue and smears (LOINC Hierarchy LP132424-5) (0.9449) | procedure on head (SNOMED CT Procedure 118690002) (0.9435) | cells.estrogen receptor|tissue and smears (LOINC Hierarchy LP262344-7) (0.9397) |
(c) bm biopsy (local concept) |
Number of Layers | Top 1 Accuracy | Top 3 Accuracy | Top 5 Accuracy | Top 10 Accuracy |
---|---|---|---|---|
1 | 54.66% | 69.30% | 74.63% | 79.02% |
2 | 54.71% | 69.27% | 74.23% | 79.34% |
3 | 55.09% | 69.39% | 74.29% | 79.48% |
4 | 55.35% | 69.99% | 74.75% | 79.79% |
(a) Target dataset with 5000 random samples | ||||
1 | 51.22% | 67.27% | 72.14% | 77.27% |
2 | 51.19% | 67.13% | 72.03% | 77.30% |
3 | 51.68% | 67.33% | 72.17% | 77.59% |
4 | 52.05% | 67.64% | 72.48% | 78.07% |
(b) Target dataset with 10,000 random samples | ||||
1 | 41.99% | 58.01% | 64.37% | 70.31% |
2 | 42.08% | 58.01% | 64.12% | 70.31% |
3 | 42.33% | 58.41% | 64.17% | 70.48% |
4 | 43.31% | 58.87% | 64.55% | 71.02% |
(c) Target dataset with 50,000 random samples |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, H.K.; Choi, S.W.; Bae, Y.S.; Choi, J.; Kwon, H.; Lee, C.P.; Lee, H.-Y.; Ko, T. MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors. Appl. Sci. 2020, 10, 7831. https://doi.org/10.3390/app10217831
Kim HK, Choi SW, Bae YS, Choi J, Kwon H, Lee CP, Lee H-Y, Ko T. MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors. Applied Sciences. 2020; 10(21):7831. https://doi.org/10.3390/app10217831
Chicago/Turabian StyleKim, Han Kyul, Sae Won Choi, Ye Seul Bae, Jiin Choi, Hyein Kwon, Christine P. Lee, Hae-Young Lee, and Taehoon Ko. 2020. "MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors" Applied Sciences 10, no. 21: 7831. https://doi.org/10.3390/app10217831
APA StyleKim, H. K., Choi, S. W., Bae, Y. S., Choi, J., Kwon, H., Lee, C. P., Lee, H.-Y., & Ko, T. (2020). MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors. Applied Sciences, 10(21), 7831. https://doi.org/10.3390/app10217831