Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach

Shaukat, Saima; Asad, Muhammad; Akram, Asmara

doi:10.3390/app13085103

Open AccessArticle

Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach

by

Saima Shaukat

^1,*,

Muhammad Asad

¹

and

Asmara Akram

²

¹

Graduate School of Information Science and Technology, Department of Creative Informatics, The University of Tokyo, Tokyo 113-8654, Japan

²

Department of Computer Science & Information Technology, The University of Lahore, Lahore 54590, Pakistan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(8), 5103; https://doi.org/10.3390/app13085103

Submission received: 28 March 2023 / Revised: 14 April 2023 / Accepted: 18 April 2023 / Published: 19 April 2023

(This article belongs to the Special Issue Natural Language Processing (NLP) and Applications)

Download Versions Notes

Abstract

:

Lemmatization aims at returning the root form of a word. The lemmatizer is envisioned as a vital instrument that can assist in many Natural Language Processing (NLP) tasks. These tasks include Information Retrieval, Word Sense Disambiguation, Machine Translation, Text Reuse, and Plagiarism Detection. Previous studies in the literature have focused on developing lemmatizers using rule-based approaches for English and other highly-resourced languages. However, there have been no thorough efforts for the development of a lemmatizer for most South Asian languages, specifically Urdu. Urdu is a morphologically rich language with many inflectional and derivational forms. This makes the development of an efficient Urdu lemmatizer a challenging task. A standardized lemmatizer would contribute towards establishing much-needed methodological resources for this low-resourced language, which are required to boost the performance of many Urdu NLP applications. This paper presents a lemmatization system for the Urdu language, based on a novel dictionary lookup approach. The contributions made through this research are the following: (1) the development of a large benchmark corpus for the Urdu language, (2) the exploration of the relationship between parts of speech tags and the lemmatizer, and (3) the development of standard approaches for an Urdu lemmatizer. Furthermore, we experimented with the impact of Part of Speech (PoS) on our proposed dictionary lookup approach. The empirical results showed that we achieved the best accuracy score of 76.44% through the proposed dictionary lookup approach.

Keywords:

NLP; Urdu language; lemmatizer; dictionary approach

1. Introduction

Morphological analysis studies the structure of the words in a language, and lemmatization is a major part of this. The lemmatization process aims to identify the root form of a word from its inflected form(s). Inflected words are words that are derived from the actual root word. The root form is called the lemma of a word. Essentially, lemmatization provides a word’s root or base form. For example, in English, the words “shine” and “shining” produce the lemma “shine”. Similarly, in the Urdu language, the words “ Applsci 13 05103 i008

” (shining) and “ Applsci 13 05103 i009

” (shines) reduce to “ Applsci 13 05103 i010

” (shine) as their lemma.

Part-of-Speech (PoS) tags play an important role in the lemmatization process, as a strong relationship exists between the lemmas and PoS tags [1,2]. Due to the rich morphology of the Urdu language, one word may have more than one PoS tag. So, with the change of a PoS tag, the lemma of a word also changes. For example, in English, the word “fly” can be a noun or verb. Similarly, in the Urdu language Applsci 13 05103 i011

can be a noun as well as a verb. If Applsci 13 05103 i011

(food) is a noun, its lemma is Applsci 13 05103 i011

, and if

(eat) is a verb, its lemma is Applsci 13 05103 i012

. Lemmatization has potential applications in many Natural Language Processing (NLP) tasks, including Information Retrieval (IR), Word Sense Disambiguation (WSD), Machine Translation, and Text Reuse Detection.

Lemmatization of morphologically rich languages has significant influence compared to that of less inflectional languages. A single word in morphologically rich languages can have many inflectional forms. The Urdu language contains words from many different languages, such as Persian and Turkish. This makes Urdu a complex and highly rich morphological language. Further, it is one of the most important languages in South Asia, as it is spoken by more than 175 million people in Pakistan, India, and other South Asian countries [3,4,5].

Lemmatization algorithms are categorized into two approaches: (1) manual approach (2) automatic approach [6]. In the manual approach, lemmas are generated manually by a human. It is also called the dictionary-based approach. This approach is accurate because it involves humans, and there are fewer chances of error or wrongly predicting the lemma, but it takes much effort. It is impossible to use this approach on a large amount of data. On the other hand, in the automatic approach, lemmas are generated by using rules. Therefore, it is also called the rule-based approach. This approach takes less time, but it is not as accurate as the manual approach.

As part of our contribution to NLP research in the Urdu language, this study presents an Urdu lemmatization system, based on a manually developed dictionary-based approach. In this approach, every word of Urdu contains its PoS tag and its suitable root word (lemma). The proposed dictionary contains 25,000 unique words with their most frequent parts of speech tags. The main contributions of this research are described below.

The first significant contribution of this study is the development of a large benchmark corpus for the Urdu dictionary. The proposed corpus was developed in the following steps: (1) data collection from two different sources, urmono corpus [7] and Wikipedia dump; (2) pre-processing and tokenizing of the collected data; (3) frequency counts of words; (4) selection of the most frequent words; (5) assignment of parts of speech tags to the selected words; (6) manual annotatation of data (assignment of lemma to each word); (7) standardization of CSV storage format.
The second significant contribution is in the exploration of the relationship between the PoS tag and the lemma of a word. This is achieved through training the PoS tagger and assigning the most frequently used tag to the word.
The third significant contribution is the proposed dictionary-based approach for the Urdu lemmatizer.

The rest of this paper is organized as follows. Section 2 presents the related work on lemmatization. Section 3 describes the dictionary generation process for the Urdu lemmatization system. Section 4 presents the proposed dictionary lookup approach. Section 5 explains the experimental setup and evaluation measures. Section 6 discusses the evaluation measures, results, and their analysis. Finally, Section 7 concludes the paper.

2. Related Work

The classical approaches to lemmatizing highly resourced languages are based on two main approaches: (1) the rule-based approach and (2) the dictionary-based approach. Recently, neural network-based approaches have been used, with limited success.

Rule-based lemmatizers have been widely focused on by researchers for English and other languages. In [8], researchers proposed a rule-based English lemmatizer using the ripple-down approach. In this approach, suffixes are added and removed from a word to get the root form. Similarly, ref. [9] developed a rule-based lemmatizer for the Hindi language. They introduced rules, based on the structure of the Hindi language, to perform operations on the suffix of an input word to generate the useful lemma of the word. This method extracts the suffix from the given word and adds the characters (generated by rules) at the end of a word. They reported 89.08% accuracy with this lemmatizer. In [10], the authors developed a rule-based lemmatizer for Icelandic. They used suffix substitution rules derived from a database to find the lemma. They experimented with both PoS-tagged text and non-PoS-tagged text. The non-PoS-tagged text produced the best accuracy score of 99.55%. A neural lemmatizer was developed by [11]. It was developed for the Bengali language. They used a feed-forward neural network with k fold cross-validation approach. This rule-based lemmatizer achieved an accuracy of 67.57%.

Similarly, there have been efforts to develop lemmatizers using a dictionary-based approach in recent years. In [12], the authors developed a lemmatizer using a dictionary-based approach. The Lemmatizer mainly dealt with the issue of vocabulary words. It was developed for four languages; Finnish, Swedish, German, and English. The highest

F_{1}

score they achieved was 70%. A Turkish lemmatizer was developed in [13]. They used a dictionary-based approach to derive the lemma of a word. They prepared a dictionary of 1,000,000 words and achieved 91.90% accuracy. Similarly, an Arabic lemmatizer was proposed in [14], based on the non-statistical approach. This Lemmatizer was developed for information retrieval tasks. The accuracy of this Lemmatizer was reported as being 89.15%. They used different resources from the Arabic language to build the dictionary and to generate more accurate lemmas. In [15], the researchers developed a multilingual lemmatizer. It was based on the Helsinki finite-state transducer technology, and lemma dictionaries were automatically generated from the proposed approach. They developed this Lemmatizer for the English, German, Dutch, French, and Spanish languages.

The authors in [16] proposed a lemmatization and POS tagging model based on a Bi-LSTM architecture to analyze 11th-century Paleographic Tamil stone inscriptions. The model was designed to accurately classify and predict tags for words, providing 96.43% accuracy. Other research, [17], presented a supervized machine learning approach using a Naive Bayes algorithm to lemmatize the Setswana language, showing that it was more reliable than rule-based approaches. This model shifted from hand-programmed rules and could lemmatize words contextually, based on how they were used. Experiments yielded an accuracy of 70.32% by using a medium-sized dataset for training and testing. The results showed that machine learning is more reliable than rule-based approaches for lemmatizing Setswana words in context.

The authors in [18] presented a lemmatization algorithm for the Uzbek language. They used an algorithm that utilized a finite state machine to remove affixes, a database of affixes, and parts of speech knowledge to identify the lemma of a given word. Their model uses a process that consists of general rules, affix classification, affix removal, and definition of the lemma. The algorithm was tested with an Uzbek corpus containing 80,000 words and phrases. Recently, a framework, [19], was proposed for extracting lemmas from inflected Bangla words, considering their parts of speech as context. Bangla is a language with complex morphology, similar to Urdu. This paper presented a new, bigger Bangla dataset and an encoder–decoder-based sequence-to-sequence framework for lemmatization. After adjusting the hyper-parameters, the testing split of the dataset showed 95.75% character accuracy and 91.81% exact match accuracy. The research in [20] presented a hybrid lemmatizer and POS-tagger for Akkadian, an ancient language spoken from 2350 BCE to 100 CE. The authors used TurkuNLP to initially POS-tag and lemmatize the text and then improved the lemmatization quality by post-correcting with dictionary-based methods. The post-correction also assigned confidence scores to flag the most suspicious lemmatizations for manual validation. The results showed that the tool achieved a Lemma+POS labeling accuracy of 94% and a lemmatization accuracy of 95% in a held-out test set.

For the Urdu language, there have been very limited efforts. In [21], the authors proposed a rule-based lemmatizer for the Urdu language. The lemmatizer works by adding or removing suffixes from the end of the words. The evaluation was carried out on only 1000 words. Further, the evaluation corpus for their approach is not publicly available. They reported an accuracy of 86.5% for their lemmatizer. Another paper, [22], presented a lemmatization algorithm based on recurrent neural network models for the Urdu language to complete normalization and morphological processes effectively. The proposed model was trained and tested on two datasets, the Urdu Monolingual Corpus (UMC) and the Universal Dependencies Corpus of Urdu (UDU), and outperformed existing models in accuracy, precision, and recall, achieving F-scores of 0.96, 0.95, 0.95, and 0.95, respectively.

After a detailed analysis of related work, some significant challenges in the lemmatization process were identified. The rule-based approach is not appropriate for morphologically rich languages. Rules can also generate lemma of non-words and does not always provide an appropriate lemma. The results for most rule-based approaches are low as compared to dictionary-based approaches. The only available Urdu lemmatizers are rule-based and are limited to only 1000 words, which may not be accurate, given the morphological structure of the Urdu language. Information about part of speech tags is also missing in existing lemmatizers. A strong relationship exists between the PoS tag and the lemma of a word, so we need a PoS tag for the correct lemma. No large dictionaries have yet been built for the Urdu lemmatization system. To foster NLP research in resource-poor languages and to overcome the limitations of the existing Urdu lemmatizer, we propose a state-of-the-art and open-sourced Urdu lemmatizer.

3. Dictionary Generation Process

This section discusses the dictionary generation process, which includes raw data collection, the annotation process (annotation guidelines, annotations, and Inter-annotator Agreement), dictionary characteristics, standardization, and examples from the proposed dictionary. Below we describe the proposed dictionary generation process in detail.

3.1. Data Source

To create our proposed dictionary, we collected data from two different sources: (1) the Urdu Mono-lingual (UrMono) Corpus (https://lindat.mff.cuni.cz/repository/xmlui/handle/11858/00-097C-0000-0023-65A9-5, which was last visited on 14 March 2023) by [23] and (2) the Urdu Wikipedia dump (http://wikipedia.c3sl.ufpr.br/urwiki/20191120/, accessed on 14 March 2023). In the first step, the source text from the UrMono corpus was pre-processed (we identified and removed stop words, URLs, digits, and English alphabets) and tokenized (tokenization was done with the help of the Urdu word tokenizer developed by [24]). After tokenization, a total of 95.4 million words and 20,000 most frequent words were selected (The list of the 20,000 most frequent words can be downloaded from the following link: https://github.com/Syma711/Development-of-Urdu-Lemmatizer, accessed on 14 March 2023). The lexical coverage of these 20,000 most frequent words in the UrMono corpus was 90.54%, and the selected words were already PoS tagged.

In the second step, the raw data was collected from the Urdu Wikipedia dump corpus, and the Urdu Wikipedia articles were cleaned by removing extra spaces, numbers, special characters, and words from other languages, using the Natural Language Toolkit (NLTK), from [25]. The cleaned Urdu Wikipedia articles were tokenized and PoS tagged using the Urdu tokenizer and Urdu PoS tagger from [24].

After tokenization, a total of 17.89 million words were obtained, from which 5000 most frequent words were selected (The list of 5000 most frequent words can be downloaded from the following https://github.com/Syma711/Development-of-Urdu-Lemmatizer, accessed on 14 March 2023). The lexical coverage of these 5000 words in the Urdu Wikipedia dump was 82.35%. After obtaining a PoS-tagged list of 20,000 words from UrMono Corpus and 5000 words from the Urdu Wikipedia dump, these PoS tags were mapped to use the PoS tag set of our proposed Urdu lemmatizer [26]. Twenty thousand unique words from the UrMono corpus and 5000 unique words from the Wikipedia dump were manually annotated (The list of 25,000 final selected words can be downloaded from the following link: https://github.com/Syma711/Development-of-Urdu-Lemmatizer, accessed on 14 March 2023).

3.2. Annotation Process

The annotation process was divided into three main steps: (1) preparation of annotation guidelines, (2) annotations, and (3) calculation of the Inter-Annotator Agreement.

3.2.1. Annotation Guidelines

Our main goal was to create an Urdu dictionary lemmatizer. Following annotation, guidelines were prepared and provided to all annotators.

Read the word and identify its root word in the dictionary.
Assign the root word which satisfies the language rules.

3.2.2. Annotations

Annotation guidelines prepared in the previous step were used to annotate the Urdu dictionary lemmatizer manually. Annotations were carried out by three annotators, A, B, and C. All annotators are native Urdu speakers and post-graduate level Computer Science students. Furthermore, they were also provided with tutorials by domain experts in the lemmatization training process and dictionary generation. The fundamental goal of the preparation was to show the process of lemmatizer generation.

The proposed dictionary was annotated in such a way that the first two annotators annotated a subset of a dictionary. The annotators discussed the agreed and conflicting pairs in the subset, and annotation guidelines were revised (if needed). The revised annotation guidelines were used to annotate the complete dictionary, and the inter-annotator agreement was computed for the entire dictionary. The conflicted pairs of the dictionary were annotated by Annotator C, and conflicted words were resolved using the third annotator’s Urdu grammar knowledge. The most suitable lemmas were assigned to the conflicted words, based on Urdu grammar knowledge.

3.2.3. Inter-Annotator Agreement

Table 1 shows the detailed statistics of Inter-Annotation Agreements (IAAs). The IAA of the corpus was 88%, which is a high agreement level. This highlights that the annotation guidelines were well defined, which assisted annotators in distinguishing between various levels in the proposed dictionary. In addition, this also showed that the annotators were well-trained and had expertise in the relevant field. In the proposed dictionary, there were 25,000 instances requiring agreement. Annotators A and B agreed on lemmas of 22,069 words out of 25,000 words so they disagreed on 2931 words).

3.3. Dictionary Characteristics and Standardization

The dictionary was standardized in CSV format and is publicly accessible to download for research purposes under a Creative Commons license (https://github.com/Syma711/Development-of-Urdu-Lemmatizer, accessed on 14 March 2023).

The proposed corpus consists of two versions. Version 1 comprises raw data without annotation that can be utilized for other NLP tasks in the Urdu language. Version 2 is composed of data with annotation for the Urdu Lemmatization task. The final dictionary consists of 25,000 words with their PoS tags and lemmas. Table 2 shows the detailed statistics of the dictionary.

3.4. Example from Proposed Dictionary

Table 3 shows a sample from the annotated dictionary. In the dictionary, each word contains an original word, the PoS tag assigned to it, and the most appropriate lemma of the original word.

4. Proposed Dictionary Lookup Approach for Urdu Lemmatization

This section discusses the proposed approach for Urdu lemmatization, based on the dictionary lookup approach. The proposed approach is dependent on the corpus and dictionary presented in Section 3.

Algorithm 1 shows the proposed algorithm, called the Dictionary Look-up approach, the table look-up approach or the lexical look-up approach. In this proposed algorithm, the proposed dictionary contains the word, part of the speech tag, and the lemma (root) of a word. In step 1, take the input text for lemmatization. There are four variations in inputs: two inputs for the sequence of words (sequence of words with part of speech tags or just a sequence of words) and two inputs for a single word (word with part of speech tag or just a word). The algorithm checks whether the input text contains the part of speech tags. If the input is already part of speech that is tagged, two possibilities occur: (1) a single word, (2) a sequence of words. If the input is a single word, the algorithm jumps to step 4. Otherwise, the input is tokenized into a single word, based on the white space in the sequence. The lemma returns only if both word and part of speech are present in the dictionary in the final step. However, in step 1, if the input is not already part of the tagged speech, the automatic part of the speech tagged is assigned to the input (step 2) by the tagger and the rest of the steps are the same.

Algorithm 1: The execution steps of the proposed algorithm.

5. Experimental Setup

This section describes the test dataset, evaluation methodology, and evaluation measures used for the Urdu lemmatization experiments.

5.1. Test Dataset Creation

We created a test corpus (called the test dataset) to evaluate the proposed algorithm’s performance, based on the proposed Urdu lemmatizer. The dataset is comprised of 8590 unique words, collected from the following four different sources: (1) tweets, (2) counter corpus [27] (http://ucrel.lancs.ac.uk/textreuse/counter.php, accessed on 14 March 2023), (3) Urdu Wikipedia dump, and (4) Urdu blogs. The statistical detail of these sources is provided in Table 4 for the test dataset. Two articles from each category of the counter corpus were selected for the test dataset. Two documents from the Wikipedia dump were selected, concerning news and current affairs. Blogs on social media, motivation, the importance of prayer, Pakistan, and coronavirus were selected for the test dataset.

After selecting all the articles, pre-processing was conducted through data cleaning (in the data cleaning step, we identified and removed stop words, URLs, digits, and English alphabets from all articles) and tokenization. The tokenization used the same process as described in Section 3. After tokenization, we had a total of 8590 words/tokens. The collected words for evaluation were not PoS tagged, so the test dataset was tagged using an Urdu PoS tagger [24].

Annotations

The same annotation guidelines were followed as discussed in Section 3.2.1 to develop the test dataset. In the next step, the IAA was computed for the test dataset. Table 5 shows the detailed statistics of the IAA for the test dataset. The IAA of the test dataset was 91%, which was high. The proposed test dataset was standardized in CSV format. The test dataset can be downloaded from the following link (https://github.com/Syma711/Development-of-Urdu-Lemmatizer, accessed on 14 March 2023).

5.2. Approaches

The dictionary look-up approach can be applied in two ways. In the first case, a word and part of speech are given to the system, and it searches in the proposed dictionary and returns the lemma of a word if both words and PoS tag match the dictionary word. In the second case, the process is divided into two further phases.

PoS Tagging Phase: Only a word is given to the system, and the PoS tagger assigns the PoS tag to the word, based on the most frequently used tag.
Lemma Generation Phase: After assigning, the tag system searches in the proposed dictionary and returns the lemma of a word if both the word and PoS tag match the dictionary word.

5.3. Evaluation Measure

Accuracy was used to measure the performance of the proposed Urdu lemmatizer. Accuracy can be defined as the ratio of correctly lemmatized words to the total number of words. The mathematical description of accuracy is:

Accuracy = \frac{Number of Correctly Lemmatized Words}{Total Number of Words} \times 100 .

(1)

6. Results and Analysis

Table 6 shows the accuracy scores obtained on our proposed test dataset. A total of 8590 words were processed to evaluate the Urdu lemmatizer. The proposed algorithm was analyzed on two different variations: (1) with-PoS-DLA (dictionary look-up approach with part of speech tag) refers to an input method wherein words are given to the Urdu lemmatizer with their respective part of speech tags, and, by using the dictionary look-up approach, it returns the lemma of a word, Table 6; (2) without-PoS-DLA (dictionary look-up approach without part of speech tag) refers to an input method in which only words are given to the Urdu lemmatizer. The most frequent PoS tag is assigned to the word using a developed PoS tagger by algorithm and, finally, it will return the lemma of a word using a dictionary look-up approach.

The best result (accuracy = 76.44%) was obtained by using the without-PoS-DLA approach. The reason for thjs best performance was that the USAS PoS tagger [24] assigned the most suitable and frequently used input tag. Using the with-PoS-DLA approach, we achieved accuracy = 66.79%. A possible reason for the poorer performance is that the tagset was assigned by humans (native users, not domain experts). Urdu is rich in morphology, so it is challenging to create a complete dictionary with lemmas of all words used in the Urdu language. A detailed description is provided in Section 6.1.

6.1. Error Analysis

Our proposed approach searches the dictionary against the user input word(s). Spelling mistakes and different blank spaces result in a fail, no match, in the dictionary. Although the test corpus is pre-processed, it contains many irregular words and compound words which cause a problem for word boundary identification during the pre-processing, e.g., words are Applsci 13 05103 i013

. These entries in the test dataset reduce the system’s accuracy because such junk entries and compound words should not be present in the proposed dictionary. The unavailability of exact words leads to no match in the proposed dictionary, and decrease in performance. Another reason for the poor performance of the proposed lemmatizer concerns proper nouns names of persons, places, and things. This also reduces the performance of the proposed approach as it seems impossible to cover all the entity names in the proposed dictionary.

Error Analysis for the With-PoS-DLA Approach

Table 7 shows the statistical details of the error analysis for the with-PoS-DLA approach concerning single and multi-word expressions. A total of 2852 words were incorrectly lemmatized, out of which 65% were single-words, and 35% were multi-word expressions. The portion of single-word expressions is significant in this approach because of the PoS tags assigned during the annotation process. When we use single words with their respective PoS tags, it might be possible that the word is present in a dictionary with another PoS tag. Therefore, this mismatch causes reduction in the performance of the proposed system.

Table 8 shows the statistical details of error analysis for the with-PoS-DLA approach concerning PoS tags. A total of 2852 words were incorrectly lemmatized, out of which 36% were words having noun tags (NNs), 11% were proper nouns (PNs), 13% were case maker words (P), 8% words were personal pronouns (PPs), 13% were verbs (VBs), 7% were adjectives (ADJa), and 8% were aspectual auxiliaries (AAs). The remaining 4% of the words had other tags. The portion of words with noun tags (NNs) is significant because of the PoS tags assigned during the annotation process. The user-assigned tagset on the test dataset may be different from the USAS PoS tag (the present tag in the dictionary for the respective word) and this cause a mismatch. Consequently, it reduces the performance of the proposed system.

6.2. Error Analysis of the Without-PoS-DLA Approach

Table 9 shows the statistical details of the error analysis of the without-PoS-DLA approach concerning single-word and multi-word expressions. A total of 2023 words were incorrectly lemmatized, out of which 45% were single-words, and 55% were multi-word expressions. The possible reason for the low performance (55% incorrectly classified instances) of multi-word expression was the presence of junk, compound words that cause a boundary identification issue during segmentation, and proper nouns. As expected, this may cause a decrease in the proposed system’s performance.

Similarly, Table 10 shows the statistical details of error analysis of the without-PoS-DLA approach concerning PoS tags. A total of 2023 words were incorrectly lemmatized, out of which 39% had noun tags (NNs), 9% were proper nouns (PNs), 3.5% were case maker words (P), 5% were personal pronoun (PPs), 15% were verbs (VBs), 6.3% were adjectives (ADJs), 5.5% were aspectual auxiliaries (AAs). The remaining words had other tags. The portion of words with noun tags (NNs) is significant because of the PoS tags assigned during the annotation process. When we use words with their respective PoS tags, it might be possible that the respective word is present in the dictionary with another PoS tag. Some of the potential causes we delve into are the following:

Morphological variations: The complex morphological structure of the Urdu language, which includes prefixes, suffixes, and infixes, can result in multiple valid forms of a single lemma. This can lead to mismatches in the dictionary lookup process.
Ambiguity in tokenization: Tokenization in Urdu can be challenging due to the lack of clear word boundaries and the presence of compound words. This can cause issues in accurately identifying individual words, leading to errors in the lemmatization process.
Out-of-vocabulary words: Words not present in the dictionary due to the dynamic nature of language or limited dictionary coverage can result in errors, as they cannot be matched to any lemma.
Homographs: Urdu words with the same spelling but different meanings can confuse the lemmatization process, as it may be difficult to determine the correct lemma without proper context.
Proper nouns and named entities: The vast number of proper nouns and named entities in the language makes it challenging to include them all in the dictionary, leading to reduced performance when encountering these words.

Therefore, these reasons explain the decrease in the accuracy of the proposed system. This shows that lemmatization is a challenging task for morphologically rich languages. Since a word can have many Urdu variants, compiling a concrete and comprehensive dictionary is not easy. For better performance, we need to develop an extensive dictionary that covers all variants of word PoS tags.

7. Conclusions and Future Work

In this study, we address the problem of lemmatization for a resource-poor but morphologically rich language, namely Urdu. We propose a dictionary lookup approach to solve the problem. We manually developed an extensive dictionary consisting of a word, part of speech (PoS) tags, and lemmas (roots). A total of 25,000 words were selected from two corpora, the Urmono corpus, and the Wikipedia dump. In addition, we also prepared a test dataset containing 8590 words to evaluate the lemmatizer. This test data was collected from different sources and manually annotated. The results were obtained using two different cases. In the first case, words and PoS tags were provided to the system, and we achieved an accuracy of 66.79%. In the second case, only words without PoS tags were provided to the system. The PoS tagger assigned the PoS tags to the given words, and after assigning the PoS tags, lemmas were generated by using the dictionary lookup approach. For the second case, we achieved an accuracy of 76.44%.

Poor grammatical structuring of the Urdu vocabulary, the improper joining of Urdu characters, the use of extra white spaces, and missing words contribute toward the increased complexity of the Urdu lemmatization task. Therefore, an increase in the dictionary size to improve the accuracy of the Urdu lemmatizer, and the use of a spell checker alongside the lemmatizer will be a part of future work. In addition, adding the English translation to Urdu words and their lemmas in the corpus is also a part of our future work.

Author Contributions

Conceptualization, M.A.; Methodology, S.S.; Formal analysis, S.S. and A.A.; Investigation, S.S.; Writing—original draft, S.S.; Writing—review & editing, M.A. and A.A.; Visualization, A.A.; Supervision, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

These research results were obtained from research commissioned by the National Institute of Information and Communications Technology (NICT), Japan.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The original contributions presented in the study are included in the article; the corpus data can be accessed through our GitHub repository.

Conflicts of Interest

The authors declare no conflict of interest regarding the publication of this research article.

References

Toutanova, K.; Cherry, C. A global model for joint lemmatization and part-of-speech prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 2–7 August 2009; pp. 486–494. [Google Scholar]
Bonatti, R.; de Paula, A.G.; Lamarca, V.S.; Cozman, F.G. Effect of part-of-speech and lemmatization filtering in email classification for automatic reply. In Proceedings of the Workshops at the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–13 February 2016. [Google Scholar]
Abbas, Q. Morphologically rich Urdu grammar parsing using Earley algorithm. Nat. Lang. Eng. 2016, 22, 775–810. [Google Scholar] [CrossRef]
Jabbar, A.; Iqbal, S.; Khan, M.U.G.; Hussain, S. A survey on Urdu and Urdu like language stemmers and stemming techniques. Artif. Intell. Rev. 2018, 49, 339–373. [Google Scholar] [CrossRef]
Riaz, K. Concept search in Urdu. In Proceedings of the 2nd PhD Workshop on Information and Knowledge Management, Napa Valley, CA, USA, 30 October 2008; pp. 33–40. [Google Scholar]
Kanis, J.; Skorkovská, L. Comparison of different lemmatization approaches through the means of information retrieval performance. In Proceedings of the International Conference on Text, Speech and Dialogue; Springer: Berlin/Heidelberg, Germany, 2010; pp. 93–100. [Google Scholar]
Jawaid, B.; Kamran, A.; Bojar, O. A Tagged Corpus and a Tagger for Urdu. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; Chair, N.C.C., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S., Eds.; European Language Resources Association (ELRA): Reykjavik, Iceland, 2014. [Google Scholar]
Plisson, J.; Lavrač, N.; Mladenić, D.; Erjavec, T. Ripple Down Rule learning for automated word lemmatisation. Ai Commun. 2008, 21, 15–26. [Google Scholar]
Paul, S.; Joshi, N.; Mathur, I. Development of a hindi lemmatizer. arXiv 2013, arXiv:1305.6211. [Google Scholar]
Ingólfsdóttir, S.L.; Loftsson, H.; Daðason, J.F.; Bjarnadóttir, K. Nefnir: A high accuracy lemmatizer for Icelandic. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, Turku, Finland, 30 September–2 October 2019; pp. 310–315. [Google Scholar]
Chakrabarty, A.; Chaturvedi, A.; Garain, U. A neural lemmatizer for bengali. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 2558–2561. [Google Scholar]
Loponen, A.; Järvelin, K. A dictionary- and corpus-independent statistical lemmatizer for information retrieval in low resource languages. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages, Padua, Italy, 20–23 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 3–14. [Google Scholar]
Civriz, M. Dictionary-Based Effective and Efficient Turkish Lemmatizer. Ph.D. Thesis, DEÜ Fen Bilimleri Enstitüsü, Izmir, Turkey, 2011. [Google Scholar]
El-Shishtawy, T.; El-Ghannam, F. An accurate arabic root-based lemmatizer for information retrieval purposes. arXiv 2012, arXiv:1203.3584. [Google Scholar]
Aker, A.; Petrak, J.; Sabbah, F. An extensible multilingual open source lemmatizer. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, ACL, Varna, Bulgaria, 2–8 September 2017; pp. 40–45. [Google Scholar]
Ezhilarasi, S.; Maheswari, P.U. Depicting a Neural Model for Lemmatization and POS Tagging of Words from Palaeographic Stone Inscriptions. In Proceedings of the 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 6–8 May 2021; pp. 1879–1884. [Google Scholar]
Bafitlhile, K.D. A Context-Aware Lemmatization Model for Setswana Language Using Machine Learning. Msc Thesis, Botswana International University of Science and Technology, Palapye, Botswana, 2022. [Google Scholar]
Sharipov, M.; Sobirov, O. Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language. arXiv 2022, arXiv:2210.16006. [Google Scholar]
Islam, M.A.; Towhiduzzaman, M.; Bhuiyan, M.T.I.; Maruf, A.A.; Ovi, J.A. BaNeL: An encoder-decoder based Bangla neural lemmatizer. SN Appl. Sci. 2022, 4, 138. [Google Scholar] [CrossRef]
Sahala, A.; Alstola, T.; Valk, J.; Linden, K. BabyLemmatizer: A Lemmatizer and POS-tagger for Akkadian. In Proceedings of the CLARIN Annual Conference Proceedings, 2022, CLARIN ERIC, Prague, Czech Republic, 10–12 October 2022. [Google Scholar]
Gupta, V.; Joshi, N.; Mathur, I. Design and development of a rule-based Urdu lemmatizer. In Proceedings of the International Conference on ICT for Sustainable Development; Springer: Berlin/Heidelberg, Germany, 2016; pp. 161–169. [Google Scholar]
Hafeez, R.; Anwar, M.W.; Jamal, M.H.; Fatima, T.; Espinosa, J.C.M.; López, L.A.D.; Thompson, E.B.; Ashraf, I. Contextual Urdu Lemmatization Using Recurrent Neural Network Models. Mathematics 2023, 11, 435. [Google Scholar] [CrossRef]
Jawaid, B.; Kamran, A.; Bojar, O. A Tagged Corpus and a Tagger for Urdu. In Proceedings of the LREC, Reykjavik, Iceland, 26–31 May 2014; pp. 2938–2943. [Google Scholar]
Shafi, J. An Urdu Semantic Tagger-Lexicons, Corpora, Methods and Tools. Ph.D. Thesis, Lancaster University, Lancaster, UK, 2019. [Google Scholar]
Loper, E.; Bird, S. NLTK: The natural language toolkit. arXiv 2002, arXiv:cs/0205028. [Google Scholar]
Sajjad, H.; Schmid, H. Tagging Urdu Text with Parts of Speech: A Tagger Comparison. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL), Athens, Greece, 30 March–3 April 2009; Association for Computational Linguistics: Athens, Greece, 2009; pp. 692–700. [Google Scholar]
Sharjeel, M.; Nawab, R.M.A.; Rayson, P. COUNTER: Corpus of Urdu news text reuse. Lang. Resour. Eval. 2017, 51, 777–803. [Google Scholar] [CrossRef]

Table 1. Detailed statistics of inter-annotator agreement.

Data Source	Total Number of Words	Total Number of Conflicted Words	Total Number of Same Annotated Words	IAA
UrMono Corpus	20,000	2791	17,209	22,069
Wikipedia Dump	5000	140	4860	25,000
Total	25,000	2931	22,069	0.88 = 88%

Table 2. Detailed statistics of the dictionary.

Total number of unique words	25,000
Total number of words from UrMono corpus	20,000
Total number of words from Urdu Wikipedia Dump	5000
Lexical Coverage from UrMono Corpus	33,500,000 (33.5 million)
Lexical Coverage from Urdu Wikipedia Dump	7,000,000 (7 million)

Table 3. Sample of the standard dictionary.

English Translation	PoS	English Translation
Seen	VBF	Look
Municipal	JJ	Municipalities
Side	NN	Side

Table 4. Source details of the test dataset.

Source	Articles	Words per Article	Total Words
Tweets	News tweets Election tweets	1500 500	2000
Counter Corpus	2 Documents from sports domain 2 Documents from showbiz domain 2 documents from foreign domain 5 Documents from business domain 2 documents from National domain	330 339 405 600 368	2042
Wikipedia dump	2 documents from Wikipedia dump	2409	2409
Blogs	Social media Motivation Importance of Urdu Blogs Importance of Dua Pakistan Corona Virus	340 437 348 132 349 533	2139
Total words			8590

Table 5. Detailed statistics of IAA for the test dataset.

Corpus	Total Number of Words	Total Number of Conflicted Words	Total Numbers of Same Annotated Words	IAA
Test Dataset	8590	763	7827	7827/8590
IAA				0.91 = 91%

Table 6. Accuracy obtained by proposed Urdu lemmatizer.

Approach	Accuracy
With PoS-DLA	66.79%
Without PoS-DLA	76.44%

Table 7. Error analysis of the with-PoS-DLA approach with respect to the single-word and multi-word expressions.

Error Analysis	Total Words
Total Incorrectly Lemmatized Words	2852
Incorrectly Lemmatized Single Word Expression	1864 (65%)
Incorrectly Lemmatized Multi-Word Expression	988 (35%)

Table 8. Error analysis of with-PoS-DLA approach with respect to the PoS tags.

PoS Tag	Error	PoS Tag	Error
PD	43 (1%)	OR	0
RD	29 (1%)	FR	3 (0.07%)
KD	0	MUL	0
AD	0	U	0
NN	954 (36%)	CC	33 (1%)
PN	306 (11%)	SC	65 (2%)
PP	136 (8%)	I	17 (0.59%)
RP	8 (0.28%)	AP	0
REP	2 (0.07%)	KER	0
AD	0	PRT	0
KP	0	POT	0
AKP	0	P	366 (13%)
GR	11 (0.3%)	SE	0
G	0	WALA	9 (0.3%)
VB	367 (13%)	NEG	16 (0.4%)
ADJ	200 (7%)	INT	9 (0.3%)
Q	47 (1%)	QW	0
AA	164 (8%)	SM	9
TA	0	PM	0
ADV	58 (2%)	DATE	0
CA	0	EXP	0

Table 9. Error analysis of the without-PoS-DLA approach with respect to single-word and multi-word expressions.

Error Analysis	Total Words
Total Incorrectly Lemmatized Words	2023
Incorrectly Lemmatized Single-Word Expression	915 (45%)
Incorrectly Lemmatized Multi-Word Expression	1108 (55%)

Table 10. Error Analysis of the without-PoS-DLA approach with respect to PoS tags.

PoS Tag	Error	PoS Tag	Error
PD	43 (2%)	OR	0
RD	20 (0.9%)	FR	3 (0.14%)
KD	0	MUL	0
AD	0	U	0
NN	789 (39%)	CC	30 (1.4%)
PN	180 (9%)	SC	64 (3.1%)
PP	111 (5%)	I	16 (0.8%)
RP	8 (0.3%)	AP	0
REP	2 (0.09%)	KER	0
AD	0	PRT	0
KP	0	POT	0
AKP	0	P	72 (3.5%)
GR	11 (0.5%)	SE	0
G	0	WALA	9 (0.4%)
VB	311 (15%)	NEG	16 (0.8%)
ADJ	129 (6.3%)	INT	7 (0.3%)
Q	45 (2.2%)	QW	0
AA	113 (5.5%)	SM	9 (0.4%)
TA	0	PM	0
ADV	35 (1.7%)	DATE	0
CA	0	EXP	0

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Shaukat, S.; Asad, M.; Akram, A. Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach. Appl. Sci. 2023, 13, 5103. https://doi.org/10.3390/app13085103

AMA Style

Shaukat S, Asad M, Akram A. Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach. Applied Sciences. 2023; 13(8):5103. https://doi.org/10.3390/app13085103

Chicago/Turabian Style

Shaukat, Saima, Muhammad Asad, and Asmara Akram. 2023. "Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach" Applied Sciences 13, no. 8: 5103. https://doi.org/10.3390/app13085103

APA Style

Shaukat, S., Asad, M., & Akram, A. (2023). Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach. Applied Sciences, 13(8), 5103. https://doi.org/10.3390/app13085103

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach

Abstract

1. Introduction

2. Related Work

3. Dictionary Generation Process

3.1. Data Source

3.2. Annotation Process

3.2.1. Annotation Guidelines

3.2.2. Annotations

3.2.3. Inter-Annotator Agreement

3.3. Dictionary Characteristics and Standardization

3.4. Example from Proposed Dictionary

4. Proposed Dictionary Lookup Approach for Urdu Lemmatization

5. Experimental Setup

5.1. Test Dataset Creation

Annotations

5.2. Approaches

5.3. Evaluation Measure

6. Results and Analysis

6.1. Error Analysis

Error Analysis for the With-PoS-DLA Approach

6.2. Error Analysis of the Without-PoS-DLA Approach

7. Conclusions and Future Work

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI