Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach
Abstract
:1. Introduction
- The first significant contribution of this study is the development of a large benchmark corpus for the Urdu dictionary. The proposed corpus was developed in the following steps: (1) data collection from two different sources, urmono corpus [7] and Wikipedia dump; (2) pre-processing and tokenizing of the collected data; (3) frequency counts of words; (4) selection of the most frequent words; (5) assignment of parts of speech tags to the selected words; (6) manual annotatation of data (assignment of lemma to each word); (7) standardization of CSV storage format.
- The second significant contribution is in the exploration of the relationship between the PoS tag and the lemma of a word. This is achieved through training the PoS tagger and assigning the most frequently used tag to the word.
- The third significant contribution is the proposed dictionary-based approach for the Urdu lemmatizer.
2. Related Work
3. Dictionary Generation Process
3.1. Data Source
3.2. Annotation Process
3.2.1. Annotation Guidelines
- Read the word and identify its root word in the dictionary.
- Assign the root word which satisfies the language rules.
3.2.2. Annotations
3.2.3. Inter-Annotator Agreement
3.3. Dictionary Characteristics and Standardization
3.4. Example from Proposed Dictionary
4. Proposed Dictionary Lookup Approach for Urdu Lemmatization
Algorithm 1: The execution steps of the proposed algorithm. |
5. Experimental Setup
5.1. Test Dataset Creation
Annotations
5.2. Approaches
- PoS Tagging Phase: Only a word is given to the system, and the PoS tagger assigns the PoS tag to the word, based on the most frequently used tag.
- Lemma Generation Phase: After assigning, the tag system searches in the proposed dictionary and returns the lemma of a word if both the word and PoS tag match the dictionary word.
5.3. Evaluation Measure
6. Results and Analysis
6.1. Error Analysis
Error Analysis for the With-PoS-DLA Approach
6.2. Error Analysis of the Without-PoS-DLA Approach
- Morphological variations: The complex morphological structure of the Urdu language, which includes prefixes, suffixes, and infixes, can result in multiple valid forms of a single lemma. This can lead to mismatches in the dictionary lookup process.
- Ambiguity in tokenization: Tokenization in Urdu can be challenging due to the lack of clear word boundaries and the presence of compound words. This can cause issues in accurately identifying individual words, leading to errors in the lemmatization process.
- Out-of-vocabulary words: Words not present in the dictionary due to the dynamic nature of language or limited dictionary coverage can result in errors, as they cannot be matched to any lemma.
- Homographs: Urdu words with the same spelling but different meanings can confuse the lemmatization process, as it may be difficult to determine the correct lemma without proper context.
- Proper nouns and named entities: The vast number of proper nouns and named entities in the language makes it challenging to include them all in the dictionary, leading to reduced performance when encountering these words.
7. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
References
- Toutanova, K.; Cherry, C. A global model for joint lemmatization and part-of-speech prediction. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP, Singapore, 2–7 August 2009; pp. 486–494. [Google Scholar]
- Bonatti, R.; de Paula, A.G.; Lamarca, V.S.; Cozman, F.G. Effect of part-of-speech and lemmatization filtering in email classification for automatic reply. In Proceedings of the Workshops at the Thirtieth AAAI Conference on Artificial Intelligence, Phoenix, AZ, USA, 12–13 February 2016. [Google Scholar]
- Abbas, Q. Morphologically rich Urdu grammar parsing using Earley algorithm. Nat. Lang. Eng. 2016, 22, 775–810. [Google Scholar] [CrossRef]
- Jabbar, A.; Iqbal, S.; Khan, M.U.G.; Hussain, S. A survey on Urdu and Urdu like language stemmers and stemming techniques. Artif. Intell. Rev. 2018, 49, 339–373. [Google Scholar] [CrossRef]
- Riaz, K. Concept search in Urdu. In Proceedings of the 2nd PhD Workshop on Information and Knowledge Management, Napa Valley, CA, USA, 30 October 2008; pp. 33–40. [Google Scholar]
- Kanis, J.; Skorkovská, L. Comparison of different lemmatization approaches through the means of information retrieval performance. In Proceedings of the International Conference on Text, Speech and Dialogue; Springer: Berlin/Heidelberg, Germany, 2010; pp. 93–100. [Google Scholar]
- Jawaid, B.; Kamran, A.; Bojar, O. A Tagged Corpus and a Tagger for Urdu. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; Chair, N.C.C., Choukri, K., Declerck, T., Loftsson, H., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., Piperidis, S., Eds.; European Language Resources Association (ELRA): Reykjavik, Iceland, 2014. [Google Scholar]
- Plisson, J.; Lavrač, N.; Mladenić, D.; Erjavec, T. Ripple Down Rule learning for automated word lemmatisation. Ai Commun. 2008, 21, 15–26. [Google Scholar]
- Paul, S.; Joshi, N.; Mathur, I. Development of a hindi lemmatizer. arXiv 2013, arXiv:1305.6211. [Google Scholar]
- Ingólfsdóttir, S.L.; Loftsson, H.; Daðason, J.F.; Bjarnadóttir, K. Nefnir: A high accuracy lemmatizer for Icelandic. In Proceedings of the 22nd Nordic Conference on Computational Linguistics, Turku, Finland, 30 September–2 October 2019; pp. 310–315. [Google Scholar]
- Chakrabarty, A.; Chaturvedi, A.; Garain, U. A neural lemmatizer for bengali. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 2558–2561. [Google Scholar]
- Loponen, A.; Järvelin, K. A dictionary- and corpus-independent statistical lemmatizer for information retrieval in low resource languages. In Proceedings of the International Conference of the Cross-Language Evaluation Forum for European Languages, Padua, Italy, 20–23 September 2010; Springer: Berlin/Heidelberg, Germany, 2010; pp. 3–14. [Google Scholar]
- Civriz, M. Dictionary-Based Effective and Efficient Turkish Lemmatizer. Ph.D. Thesis, DEÜ Fen Bilimleri Enstitüsü, Izmir, Turkey, 2011. [Google Scholar]
- El-Shishtawy, T.; El-Ghannam, F. An accurate arabic root-based lemmatizer for information retrieval purposes. arXiv 2012, arXiv:1203.3584. [Google Scholar]
- Aker, A.; Petrak, J.; Sabbah, F. An extensible multilingual open source lemmatizer. In Proceedings of the International Conference Recent Advances in Natural Language Processing, RANLP 2017, ACL, Varna, Bulgaria, 2–8 September 2017; pp. 40–45. [Google Scholar]
- Ezhilarasi, S.; Maheswari, P.U. Depicting a Neural Model for Lemmatization and POS Tagging of Words from Palaeographic Stone Inscriptions. In Proceedings of the 2021 5th International Conference on Intelligent Computing and Control Systems (ICICCS), Madurai, India, 6–8 May 2021; pp. 1879–1884. [Google Scholar]
- Bafitlhile, K.D. A Context-Aware Lemmatization Model for Setswana Language Using Machine Learning. Msc Thesis, Botswana International University of Science and Technology, Palapye, Botswana, 2022. [Google Scholar]
- Sharipov, M.; Sobirov, O. Development of a rule-based lemmatization algorithm through Finite State Machine for Uzbek language. arXiv 2022, arXiv:2210.16006. [Google Scholar]
- Islam, M.A.; Towhiduzzaman, M.; Bhuiyan, M.T.I.; Maruf, A.A.; Ovi, J.A. BaNeL: An encoder-decoder based Bangla neural lemmatizer. SN Appl. Sci. 2022, 4, 138. [Google Scholar] [CrossRef]
- Sahala, A.; Alstola, T.; Valk, J.; Linden, K. BabyLemmatizer: A Lemmatizer and POS-tagger for Akkadian. In Proceedings of the CLARIN Annual Conference Proceedings, 2022, CLARIN ERIC, Prague, Czech Republic, 10–12 October 2022. [Google Scholar]
- Gupta, V.; Joshi, N.; Mathur, I. Design and development of a rule-based Urdu lemmatizer. In Proceedings of the International Conference on ICT for Sustainable Development; Springer: Berlin/Heidelberg, Germany, 2016; pp. 161–169. [Google Scholar]
- Hafeez, R.; Anwar, M.W.; Jamal, M.H.; Fatima, T.; Espinosa, J.C.M.; López, L.A.D.; Thompson, E.B.; Ashraf, I. Contextual Urdu Lemmatization Using Recurrent Neural Network Models. Mathematics 2023, 11, 435. [Google Scholar] [CrossRef]
- Jawaid, B.; Kamran, A.; Bojar, O. A Tagged Corpus and a Tagger for Urdu. In Proceedings of the LREC, Reykjavik, Iceland, 26–31 May 2014; pp. 2938–2943. [Google Scholar]
- Shafi, J. An Urdu Semantic Tagger-Lexicons, Corpora, Methods and Tools. Ph.D. Thesis, Lancaster University, Lancaster, UK, 2019. [Google Scholar]
- Loper, E.; Bird, S. NLTK: The natural language toolkit. arXiv 2002, arXiv:cs/0205028. [Google Scholar]
- Sajjad, H.; Schmid, H. Tagging Urdu Text with Parts of Speech: A Tagger Comparison. In Proceedings of the 12th Conference of the European Chapter of the ACL (EACL), Athens, Greece, 30 March–3 April 2009; Association for Computational Linguistics: Athens, Greece, 2009; pp. 692–700. [Google Scholar]
- Sharjeel, M.; Nawab, R.M.A.; Rayson, P. COUNTER: Corpus of Urdu news text reuse. Lang. Resour. Eval. 2017, 51, 777–803. [Google Scholar] [CrossRef]
Data Source | Total Number of Words | Total Number of Conflicted Words | Total Number of Same Annotated Words | IAA |
---|---|---|---|---|
UrMono Corpus | 20,000 | 2791 | 17,209 | 22,069 |
Wikipedia Dump | 5000 | 140 | 4860 | 25,000 |
Total | 25,000 | 2931 | 22,069 | 0.88 = 88% |
Total number of unique words | 25,000 |
Total number of words from UrMono corpus | 20,000 |
Total number of words from Urdu Wikipedia Dump | 5000 |
Lexical Coverage from UrMono Corpus | 33,500,000 (33.5 million) |
Lexical Coverage from Urdu Wikipedia Dump | 7,000,000 (7 million) |
Word | English Translation | PoS | Lemma | English Translation |
---|---|---|---|---|
Seen | VBF | Look | ||
Municipal | JJ | Municipalities | ||
Side | NN | Side |
Source | Articles | Words per Article | Total Words |
---|---|---|---|
Tweets | News tweets Election tweets | 1500 500 | 2000 |
Counter Corpus | 2 Documents from sports domain 2 Documents from showbiz domain 2 documents from foreign domain 5 Documents from business domain 2 documents from National domain | 330 339 405 600 368 | 2042 |
Wikipedia dump | 2 documents from Wikipedia dump | 2409 | 2409 |
Blogs | Social media Motivation Importance of Urdu Blogs Importance of Dua Pakistan Corona Virus | 340 437 348 132 349 533 | 2139 |
Total words | 8590 |
Corpus | Total Number of Words | Total Number of Conflicted Words | Total Numbers of Same Annotated Words | IAA |
---|---|---|---|---|
Test Dataset | 8590 | 763 | 7827 | 7827/8590 |
IAA | 0.91 = 91% |
Approach | Accuracy |
---|---|
With PoS-DLA | 66.79% |
Without PoS-DLA | 76.44% |
Error Analysis | Total Words |
---|---|
Total Incorrectly Lemmatized Words | 2852 |
Incorrectly Lemmatized Single Word Expression | 1864 (65%) |
Incorrectly Lemmatized Multi-Word Expression | 988 (35%) |
PoS Tag | Error | PoS Tag | Error |
---|---|---|---|
PD | 43 (1%) | OR | 0 |
RD | 29 (1%) | FR | 3 (0.07%) |
KD | 0 | MUL | 0 |
AD | 0 | U | 0 |
NN | 954 (36%) | CC | 33 (1%) |
PN | 306 (11%) | SC | 65 (2%) |
PP | 136 (8%) | I | 17 (0.59%) |
RP | 8 (0.28%) | AP | 0 |
REP | 2 (0.07%) | KER | 0 |
AD | 0 | PRT | 0 |
KP | 0 | POT | 0 |
AKP | 0 | P | 366 (13%) |
GR | 11 (0.3%) | SE | 0 |
G | 0 | WALA | 9 (0.3%) |
VB | 367 (13%) | NEG | 16 (0.4%) |
ADJ | 200 (7%) | INT | 9 (0.3%) |
Q | 47 (1%) | QW | 0 |
AA | 164 (8%) | SM | 9 |
TA | 0 | PM | 0 |
ADV | 58 (2%) | DATE | 0 |
CA | 0 | EXP | 0 |
Error Analysis | Total Words |
---|---|
Total Incorrectly Lemmatized Words | 2023 |
Incorrectly Lemmatized Single-Word Expression | 915 (45%) |
Incorrectly Lemmatized Multi-Word Expression | 1108 (55%) |
PoS Tag | Error | PoS Tag | Error |
---|---|---|---|
PD | 43 (2%) | OR | 0 |
RD | 20 (0.9%) | FR | 3 (0.14%) |
KD | 0 | MUL | 0 |
AD | 0 | U | 0 |
NN | 789 (39%) | CC | 30 (1.4%) |
PN | 180 (9%) | SC | 64 (3.1%) |
PP | 111 (5%) | I | 16 (0.8%) |
RP | 8 (0.3%) | AP | 0 |
REP | 2 (0.09%) | KER | 0 |
AD | 0 | PRT | 0 |
KP | 0 | POT | 0 |
AKP | 0 | P | 72 (3.5%) |
GR | 11 (0.5%) | SE | 0 |
G | 0 | WALA | 9 (0.4%) |
VB | 311 (15%) | NEG | 16 (0.8%) |
ADJ | 129 (6.3%) | INT | 7 (0.3%) |
Q | 45 (2.2%) | QW | 0 |
AA | 113 (5.5%) | SM | 9 (0.4%) |
TA | 0 | PM | 0 |
ADV | 35 (1.7%) | DATE | 0 |
CA | 0 | EXP | 0 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Shaukat, S.; Asad, M.; Akram, A. Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach. Appl. Sci. 2023, 13, 5103. https://doi.org/10.3390/app13085103
Shaukat S, Asad M, Akram A. Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach. Applied Sciences. 2023; 13(8):5103. https://doi.org/10.3390/app13085103
Chicago/Turabian StyleShaukat, Saima, Muhammad Asad, and Asmara Akram. 2023. "Developing an Urdu Lemmatizer Using a Dictionary-Based Lookup Approach" Applied Sciences 13, no. 8: 5103. https://doi.org/10.3390/app13085103