An Arabic Dataset for Disease Named Entity Recognition with Multi-Annotation Schemes

: This article outlines a novel data descriptor that provides the Arabic natural language processing community with a dataset dedicated to named entity recognition tasks for diseases. The dataset comprises more than 60 thousand words, which were annotated manually by two independent annotators using the inside–outside (IO) annotation scheme. To ensure the reliability of the annotation process, the inter-annotator agreements rate was calculated, and it scored 95.14%. Due to the lack of research efforts in the literature dedicated to studying Arabic multi-annotation schemes, a distinguishing and a novel aspect of this dataset is the inclusion of six more annotation schemes that will bridge the gap by allowing researchers to explore and compare the effects of these schemes on the performance of the Arabic named entity recognizers. These annotation schemes are IOE, IOB, BIES, IOBES, IE, and BI. Additionally, ﬁve linguistic features, including part-of-speech tags, stopwords, gazetteers, lexical markers, and the presence of the deﬁnite article, are provided for each record in the dataset. any of the above cases, such as: Verbs, propositions, punctionations, etc.


Summary
Named entity recognition (NER) is a prominent subfield of natural language processing (NLP). The objective of NER is to recognize specific and predefined entities in a text. In the last decade, Arabic NER has gained considerable interest and focus from the research community due to the popularity of the language, as it is the native tongue for more than 325 million people [1]. While a substantial amount of research work has been dedicated to different domains, such as recognizing people's names, locations, crimes, organization names, and so on, few research studies have been dedicated to the medical domain. This shortcoming can be attributed to the lack of digital Arabic resources, such as datasets [2,3].
While there are fair amounts of research efforts studying multi-annotation schemes for the task of NER in languages such as English, Spanish, Dutch, Czech [4], Greek [5], Russian [6], and Punjabi [7], the Arabic language suffers from a lack of efforts in this domain. This work is an attempt to rectify this shortcoming by providing the Arabic NLP research community with dataset designated for NER tasks in the medical domain.
The dataset as supplementary was annotated independently by two annotators and the rate of the inter-agreement score was calculated. Another contribution of this work is providing this dataset with different annotation schemes, which will allow the discovery of the effect of using different annotation schemes on NER tasks. Seven well-known annotation schemes were used to annotate the dataset. These schemes are IO, IOE, IOB, BIES, IOBES, IE, and BI.

Data Description
This dataset consists of 62,506 records, and each record represents a single word/token from our corpus. Each word is described in terms of six features/columns, which are annotation labels, part-of-speech tags, stopwords, gazetteers, lexical markers, definiteness. Each column is described as follows:

Annotation Labels
This column determines whether the word of interest is a disease entity or not. Two labels were used to annotate the words. The label I is used to tag disease entities, whereas the label O is used to tag irrelevant words. This annotation mechanism is well known in the literature and is referred to as the IO annotation scheme. However, other annotation schemes are used in the literature, and each has advantages and disadvantages. In this work, we annotated our data using seven different annotations schemes, resulting in seven files, each corresponding to a unique annotation scheme. Further details regarding the annotation process and schemes are given in Section 3.3. In addition, a sample sentence of the dataset is presented with each annotation scheme in Figure 1. The literal translation of the sample sentence is "Leukemia (White blood cells cancer) is considered one of the most common kinds". It is worth noting that the Arabic language is written from right to left. Therefore, the annotation tags are ordered accordingly.

Part-of-Speech Tags
This column represents the part-of-speech (POS) tags of each word. The POS tags are labels assigned to a word to identify its part of speech (e.g., noun, pronoun, verb, etc.) in a given context. The POS tags are prevalently used in NER tasks due to their ability to reveal the grammatical structure of the sentence. Furthermore, according to [8], a strong correlation exists between POS tags and NER for the Arabic language.

Stopwords
This column indicates whether the word of interest exists in the predefined list of stopwords. Stopwords are less informative regarding the given task. Usually, the stopwords are primarily conjunctions, prepositions, pronouns, demonstratives, and so on [9]. In the natural language processing literature, lists of stopwords are commonly used for several tasks, including NER tasks. The dataset includes a list of 198 stopwords that have been used in this work.

Gazetteers
This column specifies whether the given word is listed in the disease entity gazetteer, which is a dictionary that collects frequently used entities. Gazetteers play a significant role in improving the performance of NE recognizers [10]. However, creating gazetteers from scratch can be a challenging and time-consuming process [11], even though using gazetteers in NER usually improves the precision at the expense of recall.

Lexical Marker Lists
This column determines whether the word exists in the lexical marker list. Lexical markers, also known as lexical triggers, are words or parts of a word that usually exist in the vicinity of the named entity and can help to recognize an entity. Analyzing the context of NEs can reveal the existence of such markers [12].

Definiteness (Existence of 'AL')
This column features the presence of the definite article at the beginning of each word. This article can be translated directly to mean "the" in English. However, in Arabic, this article appears as a prefix for nouns. The column has four possible values, as shown in Table 1.

Methods
This section describes the main methods used to generate the dataset in its final form. The process includes collecting and preprocessing the data, data labeling, and feature engineering.

Data Collection
King Abdullah Bin Abdulaziz Arabic Health Encyclopedia (KAAHE) [13] was the source for building this dataset. This encyclopedia is considered a reliable provider of health information. The administration and finance of KAAHE are provided by the Ministry of National Guard Health Affairs and King Saud bin Abdulaziz University for Health Sciences in Saudi Arabia. It follows the Executive Regulations of the Electronic Publishing Activity of the Ministry of Media in Saudi Arabia. The content of KAAHE was originally provided by UK National Health Services (NHS). The data consist of 27 Arabic medical articles, totaling around 50,000 words.

Data Preprocessing
To prepare our raw dataset for further processing, a crucial step to consider is data preprocessing. This step includes data cleansing and tokenization. During the data cleansing step, irrelevant information in the articles, such as hypertext links, images, and so on, was excluded from the dataset. Only the text of interest was included in the body of the dataset. Afterwards, the data were imported into the AMIRA tool [14] to be tokenized, which is a natural language tool devoted to the Arabic language and provides several NLP functionalities, such as the POS tagger, clitic tokenizer, and base phrase chunker. Moreover, AMIRA has several profiles to carry out the tokenization process. In our dataset, all prefixes except the definite article were tokenized. The definite article was exempted from tokenization because it is used in later stages to engineer the definiteness column in the final dataset. After the tokenization step, the dataset size increased from around 50,000 words to around 62,500 tokens.

Data Annotation
Data annotation is the process of tagging the data into predefined categories. It is an important process, especially for supervised learning, and can be done for different types of datasets, such as image, video, and textual datasets. Data annotation plays a crucial role in evaluating the performance of supervised models because they provide ground-truth target labels. According to [15], the process of annotating any dataset should be conducted by at least two independent annotators to make it possible to validate the reliability of the annotation process. Several statistical metrics are used to measure the reliability of annotator labeling, and a well-known measure is Cohen's kappa metric [16].
Our dataset was annotated independently by two annotators, and each word is classified as either a disease entity or otherwise. To check the reliability of the annotation process, we adopted the Cohen's kappa statistic. The score of Cohen's kappa was 95.14%, which indicates a high agreement between the annotators. Given that, many researchers in the literature have argued that the minimum acceptable Cohen's kappa score is 80% [17]. While the agreement score is high, the researcher analyzed the differences between the two annotated datasets and found that the disparity can primarily be attributed to classifying adjectives in diseases. For example, in the sentence , which is translated as "Chronic Lymphocytic Leukemia," the adjective chronic was considered part of the NE by the first annotator, while the second annotator decided to exclude it. In the NER literature, several well-known annotation schemes were used to perform the annotation task. We decided to annotate our dataset using seven frequently used annotation schemes, which are listed and described in Table 2.
The process of annotating our datasets using the aforementioned schemes is illustrated in Figure 2. The annotation step was performed based on the preprocessed data. At the beginning, each annotator independently labeled the dataset manually using the IO annotation scheme due to its simplicity. However, the IO scheme cannot determine the boundaries of consecutive entities. Therefore, the data were annotated automatically using the IOB scheme except for the consecutive entities that the annotators marked manually. Afterwards, the annotation process was automatically performed for the rest of the schemes using a Python script especially developed for this purpose. The script relies on identifying the named entities boundaries to be able to generate the subsequent annotation schemes. Based on the specific rules designated for each annotation scheme, the script generates the required scheme. For example, the code listing shown in Listing 1 highlights some of the important rules and steps performed for generating the IOBES annotation scheme.

Feature Engineering
In this section, we describe the method used to derive the features/columns that were mentioned in Section 2. The MADAMIRA tool was used to obtain the POS tags and the definiteness columns. MADAMIRA [18] is an Arabic morphological analyzer developed by combining two previous NLP tools: MADA [19] and AMIRA [14].
Regarding the lexical markers, stopwords, and gazetteers columns, several statistical methods were used to analyze the dataset, such as frequency, concordance, and n-gram analyses. These methods are essential to exploring and understanding the context in which the entities exist. Therefore, these tools allowed us to derive these lexical marks, stopwords, and gazetteers.
Frequency analysis is considered one of the most prominent techniques that is used to study any corpora. Analyzing the words in the dataset based on their frequency can indicate the most important keywords present in the dataset. However, the most frequent words are considered stopwords which are less informative. We implemented this technique to derive the stopwords list used in this work. After that, we removed the stopwords from the list which revealed the informative keywords in this domain and assisted in the gazetteers creation process. Then, the concordance analysis was carried out based on these keywords. The concordance analysis allows us to explore the context of a given word/named entity and the structure in which it frequently appears in. It aided in the extraction of the most common verbs, nouns that appear in the surrounding context which have the potential of being a lexical marker. Another statistical technique that was used in this work is n-gram analysis. The probability of a given set of words appearing together in the text, is the basis idea in n-gram analysis. This technique has several applications even outside of NLP and it gives us an additional perspective to understand and discover the structure of the text. This technique was helpful in this work specifically for recognizing disease entities as they are usually composed of more than one word appearing with each other.