Automatic Classification and Visualization of Text Data on Rare Diseases
Abstract
1. Introduction
1.1. Motivation
1.2. Related Work
1.3. Contributions
2. Materials and Methods
2.1. Preliminary Analysis of Rare Diseases in Research and the News
2.2. Rare Disease Terms
2.3. Dataset
- If it contained any MeSH heading in the list of 709 rare disease terms, it was assigned to the rare disease category (see Section 2.2);
- Otherwise, if it contained any MeSH term in the Disease tree or the Mental Disorders (F03) tree, it was assigned to the non-rare disease category;
- Otherwise, it was assigned to the “Other” category.
2.4. Text Classification Model
2.5. Metrics
3. Results
4. Community-Driven Exploration of Rare Diseases Data
5. Discussion
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence | 
| ML | Machine Learning | 
| NDDs | Neurodevelopmental Disorders | 
References
- Gillentine, M.A.; Wang, T.; Eichler, E.E. Estimating the prevalence of de novo monogenic neurodevelopmental disorders from large cohort studies. Biomedicines 2022, 10, 2865. [Google Scholar] [CrossRef] [PubMed]
- Schee Genannt Halfmann, S.; Mählmann, L.; Leyens, L.; Reumann, M.; Brand, A. Personalized medicine: What’s in it for rare diseases? In Advances in Experimental Medicine and Biology; Springer: New York, NY, USA, 2017; pp. 387–404. [Google Scholar]
- Decherchi, S.; Pedrini, E.; Mordenti, M.; Cavalli, A.; Sangiorgi, L. Opportunities and Challenges for Machine Learning in Rare Diseases. Front. Med. 2021, 8, 747612. [Google Scholar] [CrossRef] [PubMed]
- Might, M.; Crouse, A.B. Why rare disease needs precision medicine—and precision medicine needs rare disease. Cell Rep. Med. 2022, 3, 100530. [Google Scholar] [CrossRef] [PubMed]
- Brasil, S.; Pascoal, C.; Francisco, R.; Dos Reis Ferreira, V.; Videira, P.A.; Valadão, A.G. Artificial intelligence (AI) in rare diseases: Is the future brighter? Genes 2017, 10, 978. [Google Scholar] [CrossRef]
- Miao, D.; Lang, F. A recommendation system based on text mining. In Proceedings of the International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery [CyberC], Nanjing, China, 12–14 October 2017; pp. 12–14. [Google Scholar]
- Schaefer, J.; Lehne, M.; Schepers, J.; Prasser, F.; Thun, S. The use of machine learning in rare diseases: A scoping review. Orphanet J. Rare Dis. 2020, 15, 145. [Google Scholar] [CrossRef] [PubMed]
- Mao, Y.; Lu, Z. MeSH Now: Automatic MeSH indexing at PubMed scale via learning to rank. J. Biomed. Semant. 2017, 8, 15. [Google Scholar] [CrossRef] [PubMed]
- Liu, K.; Peng, S.; Wu, J.; Zhai, C.; Mamitsuka, H.; Zhu, S. MeSHLabeler: Improving the accuracy of large-scale MeSH indexing by integrating diverse evidence. Bioinformatics 2015, 31, i339–i347. [Google Scholar] [CrossRef]
- You, R.; Liu, Y.; Mamitsuka, H.; Zhu, S. BERTMeSH: Deep contextual representation learning for large-scale high-performance MeSH indexing with full text. Bioinformatics 2021, 37, 684–692. [Google Scholar] [CrossRef]
- US National Library of Medicine. MEDLINE 2022 Initiative: Transition to Automated Indexing. NLM Tech. Bull. 2022, 443, E5. [Google Scholar]
- Costa, P.J.; Rei, L.; Stopar, L.; Fuart, F.; Grobelnik, M.; Mladenić, D.; Novalija, I.; Staines, A.; Pääkkönen, J.; Konttila, J.; et al. NewsMeSH: A new classifier designed to annotate health news with MeSH headings. J. Artif. Intell. Med. 2021, 114, 102053. [Google Scholar] [CrossRef] [PubMed]
- Leban, G.; Fortuna, B.; Brank, J.; Grobelnik, M. Event registry: Learning about world events from news. In Proceedings of the Companion Publication of the 23rd International Conference on World Wide Web Companion, Seoul, Republic of Korea, 7–11 April 2014; pp. 107–110. [Google Scholar]
- Zaghi, M.; Banfi, F.; Bellini, E.; Sessa, A. Rare Does Not Mean Worthless: How Rare Diseases Have Shaped Neurodevelopment Research in the NGS Era. Biomolecules 2021, 11, 1713. [Google Scholar] [CrossRef]
- Zdolsek Draksler, T.; Pita Costa, J. Zenodo 10435979—Rare Diseases Hand-Annotated News Articles: Angelman, De Lange, Fragile X, Kleefstra. Available online: https://zenodo.org/records/10435979 (accessed on 27 December 2023).
- National Library of Medicine: MEDLINE Dataset. Available online: https://www.nlm.nih.gov/bsd/medline.html (accessed on 3 September 2023).
- Vasilevsky, N.A.; Matentzoglu, N.A.; Toro, S.; Flack, J.E., IV; Hegde, H.; Unni, D.R.; Alyea, G.F.; Amberger, J.S.; Babb, L.; Balhoff, J.P.; et al. Mondo: Unifying Diseases for the World, by the World. MedRxiv. 2022. Available online: https://www.medrxiv.org/content/early/2022/05/03/2022.04.13.22273750 (accessed on 27 December 2023).
- Unni, D.; Joachimiak, M.; Shefchek, K.; Essaid, S.; Mungall, C. Rare Disease Analysis in Mondo; Zenodo: Geneva, Switzerland, 2019. [Google Scholar] [CrossRef]
- Hersh, W.; Buckley, C.; Leone, T.; Hickam, D. OHSUMED: An Interactive Retrieval Evaluation and New Large Test Collection for Research. In SIGIR ’94; Springer: London, UK, 1994; pp. 192–201. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, USA, 7 June 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Rumelhart, D.; Hinton, G.; Williams, R. Learning representations by back-propagating errors. Nature 1986, 323, 533–536. [Google Scholar] [CrossRef]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. Roberta: A robustly optimized bert pretraining approach. arXiv 2019, arXiv:1907.11692. [Google Scholar]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models Are Unsupervised Multitask Learners; Technical Report; OpenAI: San Francisco, CA, USA, 2019. [Google Scholar]
- Srivastava, N.; Hinton, G.; Krizhevsky, A.; Sutskever, I.; Salakhutdinov, R. Dropout: A Simple Way to Prevent Neural Networks from Overfitting. J. Mach. Learn. Res. 2014, 15, 1929–1958. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online, October 2021; pp. 38–45. [Google Scholar]
- Kingma, D.P.; Ba, J. Adam: A method for stochastic optimization. arXiv 2014, arXiv:1412.6980. [Google Scholar]
- Pita Costa, J.P.; Zdolsek Draksler, T. IDefine Europe—MEDLINE Explorer. Available online: https://idefine-europe.org/medline/ (accessed on 3 September 2023).
- Rankin, D.; Black, M.; Wallace, J.; Mulvenna, M.; Bond, R.; Cleland, B. The MIDAS Platform: Facilitating the Utilisation of Healthcare Big Data in Northern Ireland and Beyond. In Proceedings of the 8th Annual Translational Medicine Conference, Clinical Translational Research and Innovation Centre, Virtual Congress, 20 September 2017. [Google Scholar]
- Zdolsek Draksler, T. IDefine Europe—Rare Diseases Observatory. Available online: https://rarediseases.ijs.si (accessed on 3 September 2023).
- Thirion, B.; Pereira, S.; Neveol, A.; Dahamna, B.; Darmoni, S. French MeSH Browser: A cross-language tool to access MEDLINE/PubMed. In Proceedings of the AMIA Symposium, Chicago, IL, USA, 10–14 November 2007; p. 1132. [Google Scholar]





| Rare Disease | Languages | News Articles | Scientific Articles | MeSH Year | 
|---|---|---|---|---|
| Kleefstra syndrome | 9 | 105 | 127 | 2012 | 
| Angelman syndrome | 35 | 487 | 2036 | 1992 | 
| Dravet syndrome | 18 | 676 | 1555 | 1976 | 
| Cornelia de Lange syndrome | 19 | 136 | 855 | 1999 | 
| Phelan–McDermid syndrome | 14 | 55 | 337 | 2010 | 
| Fragile X syndrome | 32 | 1091 | 7603 | 1982 | 
| Pitt–Hopkins syndrome | 11 | 92 | 180 | 2010 | 
| Prader–Willi syndrome | 38 | 1296 | 4315 | 1976 | 
| FOXG1 syndrome | 4 | 22 | 51 | 1994 | 
| Koolen–de Vries syndrome | 5 | 6 | 50 | 2012 | 
| Wiedemann–Steiner syndrome | 5 | 10 | 79 | 2009 | 
| Kabuki syndrome | 14 | 67 | 592 | 2010 | 
| Rett syndrome | 39 | 1280 | 4381 | 1989 | 
| SYNGAP1 syndrome | 3 | 75 | 63 | 2004 | 
| SATB2 syndrome | 5 | 79 | 92 | 2007 | 
| CTNNB1 syndrome | 14 | 289 | 3 | 2005 | 
| Ontology | MeSH Terms | 
|---|---|
| GARD | 1265 | 
| ORDO | 1052 | 
| Mondo | 637 | 
| Wikidata | 476 | 
| Subset | Samples per Class | Total | 
|---|---|---|
| Training | 20.000 | 60.000 | 
| Validation | 2.000 | 6.000 | 
| Test | 2.000 | 6.000 | 
| Total | 24.000 | 72.000 | 
| Class | Samples | 
|---|---|
| Rare Diseases | 41 | 
| Non-Rare Diseases | 72 | 
| Other | 27 | 
| Total | 140 | 
| Hyperparameter | Value | 
|---|---|
| Batch Size | 32 | 
| Learning Rate | |
| Max Epochs | 10 | 
| Class | Precision | Recall | F1 | 
|---|---|---|---|
| Rare Diseases | 0.88 | 0.88 | 0.88 | 
| Non-rare Diseases | 0.82 | 0.82 | 0.82 | 
| Other | 0.89 | 0.89 | 0.89 | 
| Averages | |||
| Micro | 0.86 | 0.86 | 0.86 | 
| Macro | 0.86 | 0.86 | 0.86 | 
| Averages without “Other” | |||
| Micro | 0.85 | 0.85 | 0.85 | 
| Macro | 0.85 | 0.85 | 0.85 | 
| Class | Precision | Recall | F1 | 
|---|---|---|---|
| Rare Diseases | 0.69 | 0.54 | 0.60 | 
| Non-rare Diseases | 0.75 | 0.79 | 0.77 | 
| Other | 0.59 | 0.70 | 0.64 | 
| Averages | |||
| Micro | 0.70 | 0.70 | 0.70 | 
| Macro | 0.68 | 0.68 | 0.67 | 
| Averages without “Other” | |||
| Micro | 0.73 | 0.70 | 0.71 | 
| Macro | 0.72 | 0.66 | 0.68 | 
| Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. | 
© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Rei, L.; Pita Costa, J.; Zdolšek Draksler, T. Automatic Classification and Visualization of Text Data on Rare Diseases. J. Pers. Med. 2024, 14, 545. https://doi.org/10.3390/jpm14050545
Rei L, Pita Costa J, Zdolšek Draksler T. Automatic Classification and Visualization of Text Data on Rare Diseases. Journal of Personalized Medicine. 2024; 14(5):545. https://doi.org/10.3390/jpm14050545
Chicago/Turabian StyleRei, Luis, Joao Pita Costa, and Tanja Zdolšek Draksler. 2024. "Automatic Classification and Visualization of Text Data on Rare Diseases" Journal of Personalized Medicine 14, no. 5: 545. https://doi.org/10.3390/jpm14050545
APA StyleRei, L., Pita Costa, J., & Zdolšek Draksler, T. (2024). Automatic Classification and Visualization of Text Data on Rare Diseases. Journal of Personalized Medicine, 14(5), 545. https://doi.org/10.3390/jpm14050545
 
         
                                                



 
       