Data-Driven Lexical Normalization for Medical Social Media†
1
Leiden Institute for Advanced Computer Science, Leiden University, 2333 CA Leiden, The Netherlands
2
Department of Biomedical Informatics, Emory University, Atlanta, GA 30322, USA
*
Author to whom correspondence should be addressed.
†
This paper is an extended version of our paper published in Social Media Mining for Health Applications
workshop (SMM4H), ACL 2019.
Multimodal Technol. Interact. 2019, 3(3), 60; https://doi.org/10.3390/mti3030060
Received: 30 June 2019 / Revised: 9 August 2019 / Accepted: 13 August 2019 / Published: 20 August 2019
(This article belongs to the Special Issue Text Mining in Complex Domains)
In the medical domain, user-generated social media text is increasingly used as a valuable
complementary knowledge source to scientific medical literature. The extraction of this knowledge is
complicated by colloquial language use and misspellings. However, lexical normalization of such
data has not been addressed effectively. This paper presents a data-driven lexical normalization
pipeline with a novel spelling correction module for medical social media. Our method significantly
outperforms state-of-the-art spelling correction methods and can detect mistakes with an F1 of 0.63
despite extreme imbalance in the data. We also present the first corpus for spelling mistake detection
and correction in a medical patient forum. View Full-Text
complementary knowledge source to scientific medical literature. The extraction of this knowledge is
complicated by colloquial language use and misspellings. However, lexical normalization of such
data has not been addressed effectively. This paper presents a data-driven lexical normalization
pipeline with a novel spelling correction module for medical social media. Our method significantly
outperforms state-of-the-art spelling correction methods and can detect mistakes with an F1 of 0.63
despite extreme imbalance in the data. We also present the first corpus for spelling mistake detection
and correction in a medical patient forum. View Full-Text
Keywords:
spelling correction; social media; health; natural language processing; lexical normalization
▼
Show Figures
This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited
MDPI and ACS Style
Dirkson, A.; Verberne, S.; Sarker, A.; Kraaij, W. Data-Driven Lexical Normalization for Medical Social Media. Multimodal Technol. Interact. 2019, 3, 60. https://doi.org/10.3390/mti3030060
AMA Style
Dirkson A, Verberne S, Sarker A, Kraaij W. Data-Driven Lexical Normalization for Medical Social Media. Multimodal Technologies and Interaction. 2019; 3(3):60. https://doi.org/10.3390/mti3030060
Chicago/Turabian StyleDirkson, Anne; Verberne, Suzan; Sarker, Abeed; Kraaij, Wessel. 2019. "Data-Driven Lexical Normalization for Medical Social Media" Multimodal Technol. Interact. 3, no. 3: 60. https://doi.org/10.3390/mti3030060
Find Other Styles
Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.