Next Article in Journal
Performance Analysis of Statistical and Supervised Learning Techniques in Stock Data Mining
Next Article in Special Issue
Medi-Test: Generating Tests from Medical Reference Texts
Previous Article in Journal
Analysis of Application of Cluster Descriptions in Space of Characteristic Image Features
Previous Article in Special Issue
Evolutionary Path of Factors Influencing Life Satisfaction among Chinese Elderly: A Perspective of Data Visualization
Article Menu

Export Article

Open AccessArticle

Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language

1
Romanian Academy Research Institute for Artificial Intelligence, 13 Calea 13 Septembrie, Bucharest 050711, Romania
2
National Institute of Diabetes and Metabolic Diseases “N.C. Paulescu”, 5-7 Ion Movilă Street, Bucharest 020475, Romania
*
Authors to whom correspondence should be addressed.
These authors contributed equally to this work.
Received: 29 September 2018 / Revised: 29 October 2018 / Accepted: 16 November 2018 / Published: 23 November 2018
(This article belongs to the Special Issue Curative Power of Medical Data)
  |  
PDF [233 KB, uploaded 23 November 2018]
  |  

Abstract

Gold standard corpora (GSCs) are essential for the supervised training and evaluation of systems that perform natural language processing (NLP) tasks. Currently, most of the resources used in biomedical NLP tasks are mainly in English. Little effort has been reported for other languages including Romanian and, thus, access to such language resources is poor. In this paper, we present the construction of the first morphologically and terminologically annotated biomedical corpus of the Romanian language (MoNERo), meant to serve as a gold standard for biomedical part-of-speech (POS) tagging and biomedical named entity recognition (bioNER). It contains 14,012 tokens distributed in three medical subdomains: cardiology, diabetes and endocrinology, extracted from books, journals and blogposts. In order to automatically annotate the corpus with POS tags, we used a Romanian tag set which has 715 labels, while diseases, anatomy, procedures and chemicals and drugs labels were manually annotated for bioNER with a Cohen Kappa coefficient of 92.8% and revealed the occurrence of 1877 medical named entities. The automatic annotation of the corpus has been manually checked. The corpus is publicly available and can be used to facilitate the development of NLP algorithms for the Romanian language. View Full-Text
Keywords: corpus; biomedical; Romanian; part-of-speech tags; named entities corpus; biomedical; Romanian; part-of-speech tags; named entities
Figures

Figure 1

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).
SciFeed

Share & Cite This Article

MDPI and ACS Style

Mitrofan, M.; Barbu Mititelu, V.; Mitrofan, G. Towards the Construction of a Gold Standard Biomedical Corpus for the Romanian Language. Data 2018, 3, 53.

Show more citation formats Show less citations formats

Note that from the first issue of 2016, MDPI journals use article numbers instead of page numbers. See further details here.

Article Metrics

Article Access Statistics

1

Comments

[Return to top]
Data EISSN 2306-5729 Published by MDPI AG, Basel, Switzerland RSS E-Mail Table of Contents Alert
Back to Top