MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors
Abstract
:Featured Application
Abstract
1. Introduction
2. Methods
2.1. String Matching Methods
2.2. BERT
2.3. MARIE
3. Results
3.1. Dataset and Experiment Setups
3.2. Experiment Results
4. Discussion
4.1. Impact of α
4.2. Impact of BioBERT Layers
4.3. Limitations of MARIE
5. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Huang, C.C.; Lu, Z. Community challenges in biomedical text mining over 10 years: Success, failure and the future. Brief. Bioinform. 2016, 17, 132–144. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wei, C.H.; Leaman, R.; Lu, Z. Beyond accuracy: Creating interoperable and scalable text-mining web services. Bioinformatics 2016, 32, 1907–1910. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Junge, A.; Jensen, L.J. CoCoScore: Context-aware co-occurrence scoring for text mining applications using distant supervision. Bioinformatics 2020, 36, 264–271. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Aronson, A.R. Effective mapping of biomedical text to the UMLS Metathesaurus: The MetaMap program. In Proceedings of the AMIA Symposium; American Medical Informatics Association: Bethesda, MD, USA, 2001; p. 17. [Google Scholar]
- Bodenreider, O. The unified medical language system (UMLS): Integrating biomedical terminology. Nucleic Acids Res. 2004, 32, D267–D270. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Rindflesch, T.C.; Fiszman, M. The interaction of domain knowledge and linguistic structure in natural language processing: Interpreting hypernymic propositions in biomedical text. J. Biomed. Inform. 2003, 36, 462–477. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Leaman, R.; Islamaj Doğan, R.; Lu, Z. DNorm: Disease name normalization with pairwise learning to rank. Bioinformatics 2013, 29, 2909–2917. [Google Scholar] [CrossRef] [Green Version]
- Xu, D.; Zhang, Z.; Bethard, S. A Generate-and-Rank Framework with Semantic Type Regularization for Biomedical Concept Normalization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8452–8464. [Google Scholar]
- Li, H.; Chen, Q.; Tang, B.; Wang, X.; Xu, H.; Wang, B.; Huang, D. CNN-based ranking for biomedical entity normalization. BMC Bioinform. 2017, 18, 79–86. [Google Scholar] [CrossRef] [Green Version]
- Ji, Z.; Wei, Q.; Xu, H. Bert-based ranking for biomedical entity normalization. AMIA Summits Transl. Sci. Proc. 2020, 2020, 269. [Google Scholar]
- Schumacher, E.; Mulyar, A.; Dredze, M. Clinical Concept Linking with Contextualized Neural Representations. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 5–10 July 2020; pp. 8585–8592. [Google Scholar]
- Dai, W.; Yang, Q.; Xue, G.R.; Yu, Y. Boosting for transfer learning. In Proceedings of the 24th International Conference on MACHINE Learning; Association for Computing Machinery: New York, NY, USA, 2007; pp. 193–200. [Google Scholar]
- Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
- Ben-David, S.; Blitzer, J.; Crammer, K.; Kulesza, A.; Pereira, F.; Vaughan, J.W. A theory of learning from different domains. Mach. Learn. 2010, 79, 151–175. [Google Scholar] [CrossRef] [Green Version]
- Donnelly, K. SNOMED-CT: The advanced terminology and coding system for eHealth. Stud. Health Technol. Inform. 2006, 121, 279. [Google Scholar]
- Dogan, R.I.; Lu, Z. An inference method for disease name normalization. In Proceedings of the 2012 AAAI Fall Symposium Series, Arlington, VA, USA, 2–4 November 2012. [Google Scholar]
- Kate, R.J. Normalizing clinical terms using learned edit distance patterns. J. Am. Med. Inform. Assoc. 2016, 23, 380–386. [Google Scholar] [CrossRef] [PubMed]
- Turian, J.; Ratinov, L.; Bengio, Y. Word representations: A simple and general method for semi-supervised learning. In Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, Uppsala, Sweden, 11–16 July 2010; pp. 384–394. [Google Scholar]
- Collobert, R.; Weston, J.; Bottou, L.; Karlen, M.; Kavukcuoglu, K.; Kuksa, P. Natural language processing (almost) from scratch. J. Mach. Learn. Res. 2011, 12, 2493–2537. [Google Scholar]
- Wang, P.; Xu, B.; Xu, J.; Tian, G.; Liu, C.L.; Hao, H. Semantic expansion using word embedding clustering and convolutional neural network for improving short text classification. Neurocomputing 2016, 174, 806–814. [Google Scholar] [CrossRef]
- Kim, H.K.; Kim, H.; Cho, S. Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing 2017, 266, 336–352. [Google Scholar] [CrossRef] [Green Version]
- Tang, D.; Wei, F.; Yang, N.; Zhou, M.; Liu, T.; Qin, B. Learning sentiment-specific word embedding for twitter sentiment classification. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, MD, USA, 23–25 June 2014; pp. 1555–1565. [Google Scholar]
- Nikfarjam, A.; Sarker, A.; O’connor, K.; Ginn, R.; Gonzalez, G. Pharmacovigilance from social media: Mining adverse drug reaction mentions using sequence labeling with word embedding cluster features. J. Am. Med. Inform. Assoc. 2015, 22, 671–681. [Google Scholar] [CrossRef] [Green Version]
- Xing, C.; Wang, D.; Liu, C.; Lin, Y. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; pp. 1006–1011. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv 2018, arXiv:1810.04805. [Google Scholar]
- Wagner, R.A.; Fischer, M.J. The string-to-string correction problem. J. ACM 1974, 21, 168–173. [Google Scholar] [CrossRef]
- Hyyrö, H. Explaining and Extending the Bit-Parallel Approximate String Matching Algorithm of Myers; Technical Report A-2001-10; Department of Computer and Information Sciences, University of Tampere: Tampere, Finland, 2001. [Google Scholar]
- Jaccard, P. Distribution de la flore alpine dans le bassin des dranses et dans quelques régions voisines. Bull. Soc. Vaud. Sci. Nat. 1901, 37, 241–272. [Google Scholar]
- Gower, J.C.; Warrens, M.J. Similarity, dissimilarity, and distance, measures of. In Wiley StatsRef: Statistics Reference Online; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2014; pp. 1–11. [Google Scholar]
- Black, P.E. Ratcliff/obershelp pattern recognition. In Dictionary of Algorithms and Data Structures; National Institute of Standards and Technology: Gaithersburg, MD, USA, 2004. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 5998–6008. [Google Scholar]
- Peters, M.E.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; Zettlemoyer, L. Deep contextualized word representations. arXiv 2018, arXiv:1802.05365. [Google Scholar]
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Montréal, QC, Canada, 8–13 December 2014; pp. 3104–3112. [Google Scholar]
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Dean, J. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv 2016, arXiv:1609.08144. [Google Scholar]
- Lee, J.; Yoon, W.; Kim, S.; Kim, D.; Kim, S.; So, C.H.; Kang, J. BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020, 36, 1234–1240. [Google Scholar] [CrossRef]
- Niu, Y.; Qiao, C.; Li, H.; Huang, M. Word embedding based edit distance. arXiv 2018, arXiv:1810.10752. [Google Scholar]
- Mikolov, T.; Sutskever, I.; Chen, K.; Corrado, G.S.; Dean, J. Distributed representations of words and phrases and their compositionality. In Proceedings of the Advances in Neural Information Processing Systems, Lake Tahoe, NV, USA, 5–10 December 2013; pp. 3111–3119. [Google Scholar]
- Pennington, J.; Socher, R.; Manning, C.D. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
- Liu, S.; Ma, W.; Moore, R.; Ganesan, V.; Nelson, S. RxNorm: Prescription for electronic drug information exchange. IT Prof. 2005, 7, 17–23. [Google Scholar] [CrossRef]
- Nelson, S.J.; Zeng, K.; Kilbourne, J.; Powell, T.; Moore, R. Normalized names for clinical drugs: RxNorm at 6 years. J. Am. Med. Inform. Assoc. 2011, 18, 441–448. [Google Scholar] [CrossRef] [Green Version]
- Karadeniz, I.; Özgür, A. Linking entities through an ontology using word embeddings and syntactic re-ranking. BMC Bioinform. 2019, 20, 156. [Google Scholar] [CrossRef]
Method | Jaccard Index | Edit Distance | R/O | MARIE(BioBERT + Edit Distance) | MARIE(BioBERT + R/O) | Embedding Vector Similarity | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Metric | |||||||||||
Top 1 accuracy | 33.68% | 44.71% | 49.44% | 46.49% | 48.29% | 51.16% | 51.25% | 53.48% | 54.66% | 35.83% | |
Top 3 accuracy | 40.64% | 55.49% | 60.05% | 58.56% | 62.77% | 67.58% | 63.37% | 66.12% | 69.30% | 55.69% | |
Top 5 accuracy | 42.65% | 59.50% | 63.92% | 63.26% | 67.73% | 72.43% | 66.72% | 70.71% | 74.63% | 61.45% | |
Top 10 accuracy | 46.09% | 64.35% | 69.07% | 68.64% | 73.40% | 78.48% | 71.80% | 75.67% | 79.02% | 67.04% | |
(a) Target dataset with 5000 random samples | |||||||||||
Top 1 accuracy | 33.02% | 41.93% | 47.38% | 43.57% | 45.69% | 48.15% | 49.33% | 50.47% | 51.22% | 33.51% | |
Top 3 accuracy | 40.04% | 53.71% | 58.24% | 56.75% | 60.33% | 64.66% | 61.48% | 64.12% | 67.27% | 52.97% | |
Top 5 accuracy | 41.90% | 58.13% | 62.85% | 61.22% | 65.84% | 70.39% | 65.63% | 69.02% | 72.14% | 58.30% | |
Top 10 accuracy | 45.34% | 63.54% | 67.96% | 67.10% | 71.74% | 76.33% | 70.34% | 73.77% | 77.27% | 64.26% | |
(b) Target dataset with 10,000 random samples | |||||||||||
Top 1 accuracy | 28.58% | 34.80% | 40.76% | 35.31% | 37.63% | 38.32% | 41.82% | 43.31% | 41.99% | 26.37% | |
Top 3 accuracy | 37.17% | 47.49% | 52.59% | 49.21% | 52.02% | 55.69% | 54.54% | 56.78% | 58.01% | 43.85% | |
Top 5 accuracy | 39.18% | 51.76% | 57.12% | 53.48% | 57.38% | 61.31% | 59.16% | 62.28% | 64.37% | 49.47% | |
Top 10 accuracy | 42.19% | 57.55% | 63.00% | 60.22% | 64.40% | 68.24% | 65.12% | 68.16% | 70.31% | 56.32% | |
(c) Target dataset with 50,000 random samples |
Mapping Methods | Rank 1 | Rank 2 | Rank 3 |
---|---|---|---|
MARIE | cobicistat 150 mg/elvitegravir 150 mg/emtricitabine 200 mg/tenofovir alafenamide 10 mg oral tablet [RxNorm 1721612) (0.8538) | tenofovir alafenamide 25 mg oral tablet (RxNorm 1858261) (0.8536) | emtricitabine 200 mg/tenofovir disoproxil fumarate 300 mg oral tablet (RxNorm 476445) (0.8433) |
R/O | atazanavir 300 mg/cobicistat 150 mg oral tablet (RxNorm 1601654) (0.4371) | amylases 10 mg/betaine 300 mg/bromelains 10 mg/papain 100 mg oral tablet (RxNorm Extension OMOP1092499) (0.4333) | tenofovir alafenamide 25 mg oral tablet (RxNorm 1858261) (0.4255) |
BioBERT Embedding Vectors | cobicistat 150 mg/elvitegravir 150 mg/emtricitabine 200 mg/tenofovir alafenamide 10 mg oral tablet (RxNorm 1721612) (0.9895) | emtricitabine 200 mg/tenofovir disoproxil fumarate 300 mg oral tablet (RxNorm 476445) (0.9616) | tenofovir alafenamide 25 mg oral tablet (RxNorm 1858261) (0.9606) |
(a) tenofovir alafenamide/emtricitabine/elvitegravir/cobicistat 10 mg/200 mg/150 mg/150 mg tab (local concept) | |||
Mapping Methods | Rank 1 | Rank 2 | Rank 3 |
MARIE | cortisone (mass/time) in 24 h urine (LOINC Lab Test 14044-2) (0.8700) | color of urine (LOINC Lab Test 5778-6) (0.8599) | glucose (mass/time) in 24 h urine (LOINC Lab Test 2351-5) (0.8475) |
R/O | color of urine (LOINC Lab Test 5778-6) (0.5882) | cortisone (mass/time) in 24 h urine (LOINC Lab Test 14044-2] (0.5862) | creatinine (mass/time] in 24 h urine (LOINC Lab Test 2162-6] (0.5574) |
BioBERT Embedding Vectors | somatotropin^15th specimen post xxx challenge (LOINC component) (0.9509) | insulin ab (titer) in serum (LOINC Lab Test 11087-4) (0.9484) | glucose (presence) in urine by test strip (LOINC 25428-4) (0.9481) |
(b) cortisol (24 h urine) (local concept) | |||
Mapping Methods | Rank 1 | Rank 2 | Rank 3 |
MARIE | does manage ileostomy (SNOMED CT Clinical Finding 1073731000000109) (0.8203) | bone marrow pathology biopsy report narrative (LOINC 66119-9) (0.8143) | dome osteotomy (SNOMED CT Procedure 447761008) (0.8133) |
R/O | pilopos (RxNorm Extension OMOP2011960) (0.5) | minor blood groups (SNOMED CT Procedure 143157006) (0.4440) | dome osteotomy (SNOMED CT Procedure 447761008) (0.4348) |
BioBERT Embedding Vectors | her3 ag|tissue and smears (LOINC Hierarchy LP132424-5) (0.9449) | procedure on head (SNOMED CT Procedure 118690002) (0.9435) | cells.estrogen receptor|tissue and smears (LOINC Hierarchy LP262344-7) (0.9397) |
(c) bm biopsy (local concept) |
Number of Layers | Top 1 Accuracy | Top 3 Accuracy | Top 5 Accuracy | Top 10 Accuracy |
---|---|---|---|---|
1 | 54.66% | 69.30% | 74.63% | 79.02% |
2 | 54.71% | 69.27% | 74.23% | 79.34% |
3 | 55.09% | 69.39% | 74.29% | 79.48% |
4 | 55.35% | 69.99% | 74.75% | 79.79% |
(a) Target dataset with 5000 random samples | ||||
1 | 51.22% | 67.27% | 72.14% | 77.27% |
2 | 51.19% | 67.13% | 72.03% | 77.30% |
3 | 51.68% | 67.33% | 72.17% | 77.59% |
4 | 52.05% | 67.64% | 72.48% | 78.07% |
(b) Target dataset with 10,000 random samples | ||||
1 | 41.99% | 58.01% | 64.37% | 70.31% |
2 | 42.08% | 58.01% | 64.12% | 70.31% |
3 | 42.33% | 58.41% | 64.17% | 70.48% |
4 | 43.31% | 58.87% | 64.55% | 71.02% |
(c) Target dataset with 50,000 random samples |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Kim, H.K.; Choi, S.W.; Bae, Y.S.; Choi, J.; Kwon, H.; Lee, C.P.; Lee, H.-Y.; Ko, T. MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors. Appl. Sci. 2020, 10, 7831. https://doi.org/10.3390/app10217831
Kim HK, Choi SW, Bae YS, Choi J, Kwon H, Lee CP, Lee H-Y, Ko T. MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors. Applied Sciences. 2020; 10(21):7831. https://doi.org/10.3390/app10217831
Chicago/Turabian StyleKim, Han Kyul, Sae Won Choi, Ye Seul Bae, Jiin Choi, Hyein Kwon, Christine P. Lee, Hae-Young Lee, and Taehoon Ko. 2020. "MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors" Applied Sciences 10, no. 21: 7831. https://doi.org/10.3390/app10217831
APA StyleKim, H. K., Choi, S. W., Bae, Y. S., Choi, J., Kwon, H., Lee, C. P., Lee, H.-Y., & Ko, T. (2020). MARIE: A Context-Aware Term Mapping with String Matching and Embedding Vectors. Applied Sciences, 10(21), 7831. https://doi.org/10.3390/app10217831