A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels
Abstract
:1. Introduction
2. Related Works
3. Methodology
3.1. Data Set
which represents a legendry narrative novel and it means the following in English:
We start with the eastern region, where it seems that the Jinn move very easily between it and the rest of the Gulf regions. There is, first of all, the most famous fairy in the Al-Ahsa region, Umm the Leaves and the Leaf. Every child was an ehsa’i - or hasawy, a word lighter on the tongue! He shivered in bed every night when he heard a rustle…
which represents a social narrative novel and it means the following in English:
Amousha realized that a new history, different from the history of her mother and father, had occurred when King Faisal passed the law for the liberation of slaves in the sixties of the twentieth century. At that time, Jawhar and his wife Nuer, after the light in her eyes had gone out and her face had eaten the remains of smallpox, ran towards his uncle Abdul Rahman, and asked him: What does this law mean?…
3.2. Method
- 1.
- Pre-processing: We first collected the tag sets of the five taggers and then grouped the tags that are related to nouns, verbs, adjectives and remaining POS categories into four main categories: Noun, Verb, Adjective, Other. Mapping different tag sets of the five taggers into only four simplified and unified tags allow for efficient comparison. Table 3 presents a sample of tag sets from the selected taggers and their corresponding simplified tags. This mapping was done for simplification purposes, particularly for validating and evaluating the outcomes.
- 2.
- Implementation: We tagged the original ten Arabic texts via the five taggers included in this study (Stanford, CAMeL, Farasa, MADAMIRA and ALP). It should be mentioned here that the selected texts were tagged without segmentation. This is due to existing reports that segmentation increases ambiguity of the words tagged [9,39].
- 3.
- Post-processing of results: We used the tag sets mapping to transform all the taggers’ results into the four POS categories: Noun, Verb, Adjective and Other, as explained earlier. This simplification is necessary for comparison purposes.
- 4.
- Results comparison: we compared the outputs of the tagging results with those performed using classification report function from Pandas library in the Python Language.
- 5.
- Evaluation: after applying the classification report function, we evaluated the generated precision, recall and F1 scores for each text and for all taggers as explained in the in the following section.
4. Results and Discussion
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Albared, M.; Omar, N.; Ab Aziz, M.J. Developing a competitive HMM arabic POS tagger using small training corpora. In Proceedings of the Asian Conference on Intelligent Information and Database Systems, Daegu, Korea, 20–22 April 2011; Springer: Berlin/Heidelberg, Germany, 2011; pp. 288–296. [Google Scholar]
- El Hadj, Y.; Al-Sughayeir, I.; Al-Ansari, A. Arabic part-of-speech tagging using the sentence structure. In Proceedings of the Second International Conference on Arabic Language Resources and Tools, Cairo, Egypt, 22–23 April 2009; pp. 241–245. [Google Scholar]
- Habash, N.Y. Introduction to Arabic natural language processing. Synth. Lect. Hum. Lang. Technol. 2010, 3, 1–187. [Google Scholar] [CrossRef]
- Marquez, L.; Padro, L.; Rodriguez, H. A machine learning approach to POS tagging. Mach. Learn. 2000, 39, 59–91. [Google Scholar] [CrossRef] [Green Version]
- Randi, R. Building a corpus: What are the key considerations? In The Routledge Handbook of Corpus Linguistics; Routledge: London, UK, 2010; pp. 31–37. [Google Scholar]
- Alkhazi, I.S.B. Compression-Based Parts-of-Speech Tagger for the Arabic Language; Bangor University: Bangor, UK, 2019. [Google Scholar]
- Zeroual, I.; Lakhouaja, A.; Belahbib, R. Towards a standard Part of Speech tagset for the Arabic language. J. King Saud Univ.-Comput. Inf. Sci. 2017, 29, 171–178. [Google Scholar] [CrossRef]
- Khoja, S. APT: Arabic part-of-speech tagger. In Proceedings of the Student Workshop at NAACL, Pittsburgh, PA, USA, 2–7 June 2001; Citeseer: State College, PA, USA, 2001; pp. 20–25. [Google Scholar]
- Alosaimy, A.; Atwell, E. Tagging classical Arabic text using available morphological analysers and part of speech taggers. J. Lang. Technol. Comput. Linguist. 2017, 32, 1–26. [Google Scholar]
- Alashqar, A.M. A comparative study on Arabic POS tagging using Quran corpus. In Proceedings of the 2012 8th International Conference on Informatics and Systems (INFOS), Giza, Egypt, 14–16 May 2012; p. NLP-29. [Google Scholar]
- Albogamy, F.; Ramsay, A. Fast and robust POS tagger for Arabic tweets using agreement-based bootstrapping. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 1500–1506. [Google Scholar]
- Alrabiah, M.; Al-Salman, A.; Atwell, E.; Alhelewh, N. KSUCCA: A key to exploring Arabic historical linguistics. Int. J. Comput. Linguist. (IJCL) 2014, 5, 27–36. [Google Scholar]
- Al Khalil, M.; Habash, N.; Jiang, Z. A large-scale leveled readability lexicon for Standard Arabic. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 11–16 May 2020; pp. 3053–3062. [Google Scholar]
- Inoue, G.; Alhafni, B.; Baimukan, N.; Bouamor, H.; Habash, N. The interplay of variant, size, and task type in Arabic pre-trained language models. arXiv 2021, arXiv:2103.06678. [Google Scholar]
- Alkhazi, I.S.; Teahan, W.J. BAAC: Bangor Arabic Annotated Corpus. Mach. Transl. 2018, 22, 23. [Google Scholar] [CrossRef]
- Freihat, A.A.; Bella, G.; Mubarak, H.; Giunchiglia, F. A single-model approach for Arabic segmentation, POS tagging, and named entity recognition. In Proceedings of the 2018 2nd International Conference on Natural Language and Speech Processing (ICNLSP), Algiers, Algeria, 25–26 April 2018; pp. 1–8. [Google Scholar]
- Green, S.; Manning, C.D. Better Arabic parsing: Baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010), Beijing, China, 23–27 August 2010; pp. 394–402. [Google Scholar]
- Toutanova, K.; Klein, D.; Manning, C.D.; Singer, Y. Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of the 2003 Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics, Edmonton, AB, Canada, 27 May–1 June 2003; pp. 252–259. [Google Scholar]
- Toutanova, K.; Manning, C. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of the 2000 Joint SIGDAT Conference EMNLP/VLC, Hongkong, China, 7–8 October 2000; pp. 63–71. [Google Scholar]
- El-Haj, M.; Koulali, R. KALIMAT a multipurpose Arabic Corpus. In Proceedings of the Second Workshop on Arabic Corpus Linguistics (WACL-2), Lancaster, UK, 22 July 2013; pp. 22–25. [Google Scholar]
- Pasha, A.; Al-Badrashiny, M.; Diab, M.T.; El Kholy, A.; Eskander, R.; Habash, N.; Pooleery, M.; Rambow, O.; Roth, R. Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14), Reykjavik, Iceland, 26–31 May 2014; Citeseer: State College, PA, USA, 2014; Volume 14, pp. 1094–1101. [Google Scholar]
- Habash, N.; Rambow, O.; Roth, R. MADA+ TOKAN: A toolkit for Arabic tokenization, diacritization, morphological disambiguation, POS tagging, stemming and lemmatization. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools (MEDAR), Cairo, Egypt, 21–23 April 2009; Volume 41, p. 62. [Google Scholar]
- Diab, M. Second generation AMIRA tools for Arabic processing: Fast and robust tokenization, POS tagging, and base phrase chunking. In Proceedings of the 2nd International Conference on Arabic Language Resources and Tools, Cairo, Egypt, 22–23 April 2009; Volume 110, p. 198. [Google Scholar]
- Abdelali, A.; Darwish, K.; Durrani, N.; Mubarak, H. Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 conference of the North American Chapter of the Association for Computational Linguistics: Demonstrations, San Diego, CA, USA, 12–17 June 2016; pp. 11–16. [Google Scholar]
- Darwish, K.; Mubarak, H. Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia, 23–28 May 2016; pp. 1070–1074. [Google Scholar]
- Darwish, K.; Mubarak, H.; Abdelali, A.; Eldesouki, M. Arabic pos tagging: Don’t abandon feature engineering just yet. In Proceedings of the Third Arabic Natural Language Processing Workshop, Valencia, Spain, 3–4 April 2017; pp. 130–137. [Google Scholar]
- Obeid, O.; Zalmout, N.; Khalifa, S.; Taji, D.; Oudah, M.; Alhafni, B.; Inoue, G.; Eryani, F.; Erdmann, A.; Habash, N. CAMeL tools: An open source python toolkit for Arabic natural language processing. In Proceedings of the 12th Language Resources and Evaluation Conference, Marseille, France, 13–15 May 2020; pp. 7022–7032. [Google Scholar]
- Ghazi, A. Aljinniyah. In Almu assaah Al arabiyyah li-Dirāsāt wa Alnashr; Beirut, Lebanon, 2016; Available online: http://airpbooks.com/?url=ar/BookDetails?BookID:2397 (accessed on 5 October 2020).
- Badria, A. Hind wa Al skar. In Dār Alsāqī; Beirut, Lebanon, 2013; Available online: https://www.daralsaqi.com/book/%D9%87%D9%86%D8%AF-%D9%88%D8%A7%D9%84%D8%B9%D8%B3%D9%83%D8%B1 (accessed on 5 October 2020).
- Abdulrahman, M. Umm Alnudhūr. In Almu assaah Al arabiyyah li-Dirāsāt wa Alnashr; Beirut, Lebanon, 2005; Available online: https://www.neelwafurat.com/itempage.aspx?id=lbb136677-96909&search=books (accessed on 5 October 2020).
- Abdulrahman, M. Urwat Alzmān Albāhī. In Almu assaah Al arabiyyah li-Dirāsāt wa Alnashr; Beirut, Lebanon, 2007; Available online: https://www.neelwafurat.com/itempage.aspx?id=lbb29569-27908&search=books (accessed on 5 October 2020).
- Omaima, A. Msrā Alghrānīq. In Dār Alsāqī; Beirut, Lebanon, 2017; Available online: https://www.daralsaqi.com/book/%D9%85%D8%B3%D8%B1%D9%89-%D8%A7%D9%84%D8%BA%D8%B1%D8%A7%D9%86%D9%8A%D9%82-%D9%81%D9%8A-%D9%85%D8%AF%D9%86-%D8%A7%D9%84%D8%B9%D9%82%D9%8A%D9%82 (accessed on 5 October 2020).
- Monther, Q. Qarīn. In Aldār Al arabiyyah li-l ulūm; Beirut, Lebanon, 2016; Available online: http://www.aspbooks.com/books/bookpage.aspx?id=254828-237549 (accessed on 5 October 2020).
- Magbol, A. Ziryāb. In Dār Alsāqī; Beirut, Lebanon, 2014; Available online: https://www.daralsaqi.com/book/%D8%B2%D8%B1%D9%8A%D8%A7%D8%A8 (accessed on 5 October 2020).
- Qmasha, A. Unthā Al ankabūt. In Dār Alkifāḥ li-Nashr wa Altawzī; Dammam, Saudi Arabia, 2000; Available online: https://www.neelwafurat.com/itempage.aspx?id=lbb87629-0&search=books (accessed on 5 October 2020).
- Khalifa, S.; Habash, N.; Abdulrahim, D.; Hassan, S. A large scale corpus of Gulf Arabic. arXiv 2016, arXiv:1609.02960. [Google Scholar]
- Khalifa, S.; Habash, N.; Eryani, F.; Obeid, O.; Abdulrahim, D.; Al Kaabi, M. A morphologically annotated corpus of Emirati Arabic. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan, 7–12 May 2018. [Google Scholar]
- Diab, M.; Hacioglu, K.; Jurafsky, D. Automatic tagging of Arabic text: From raw text to base phrase chunks. In Proceedings of the HLT-NAACL 2004, Boston, MA, USA, 2–7 May 2004; pp. 149–152, Short Papers. [Google Scholar]
- Mohamed, E.; Kübler, S. Arabic Part of Speech Tagging. In Proceedings of the Seventh International Conference on Language Resources and Evaluation, Valletta, Malta, 17–23 May 2010. [Google Scholar]
- Maamouri, M.; Bies, A.; Buckwalter, T.; Mekki, W. The penn arabic treebank: Building a large-scale annotated arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools, Cairo, Egypt, 22–23 September 2004; Volume 27, pp. 466–467. [Google Scholar]
Item | Stanford | CAMeL | Farasa | MADAMIRA | ALP |
---|---|---|---|---|---|
Number of tags | 32 | 35 | 16 | 35 | 58 |
Past tense verb tag | VB | verb | V | verb | PSTV |
Preposition tag | IN | prep | PREP | prep | P |
No | Narration Type | Number of Words |
---|---|---|
1 | Legendry narrative [28] | 237 |
2 | Dialogue [28] | 211 |
3 | Social narrative [29] | 192 |
4 | Place description [30] | 181 |
5 | Preamble to the Novel [31] | 195 |
6 | Descriptive narrative [32] | 205 |
7 | Science Fiction [33] | 218 |
8 | Using poetry in historical narrative [33] | 212 |
9 | Historical narrative [34] | 205 |
10 | Social narrative [35] | 208 |
Total: 2059 |
Sample of Tag Sets Mapping | |||||
---|---|---|---|---|---|
Stanford | Farasa | MADAMIRA | CAMeL | ALP | Simplified Tags |
DTNN, DTNNP, NN, NNS. | NOUN, NSUFF, FOREIGN. | noun, noun_prop, noun_quant. | noun, noun_prop, noun_quant. | SMN, SFN, DMN, DFN, PMN, PFN. | Noun |
VB, VBD, VBN, VBP. | V, VSUFF. | verb, verb_pseudo. | verb, verb_pseudo. | PRSV, PSTV, PPRSV, PPSTV, IMPV. | Verb |
JJ, JJR, DTJJ. | ADJ. | adj, adj_comp. | adj, adj_comp. | SMAJ, SFAJ, DMAJ, DFAJ, PMAJ. | Adjective |
PUNC, WP, RB, WP. | CONJ, RP, WP, CD. | pron_rel, conj_sub, adv, pron. | prep, part_voc, part_neg, pron_dem. | PX, REL, C, C+LC. | Other |
Recall Comparison | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Taggers/No.Text | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Average |
Stanford | 65 | 76 | 73 | 70 | 71 | 57 | 73 | 77 | 68 | 74 | 70 |
CAMeL | 77 | 78 | 79 | 88 | 90 | 82 | 90 | 83 | 86 | 80 | 83 |
Farasa | 81 | 90 | 83 | 92 | 92 | 83 | 90 | 82 | 88 | 82 | 86 |
MADAMIRA | 85 | 94 | 82 | 89 | 93 | 84 | 89 | 88 | 87 | 87 | 88 |
ALP | 91 | 97 | 87 | 98 | 94 | 95 | 94 | 87 | 88 | 93 | 92 |
Precision Comparison | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Taggers/No.Text | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Average |
Stanford | 69 | 70 | 67 | 77 | 70 | 67 | 73 | 68 | 80 | 70 | 71 |
CAMeL | 80 | 78 | 79 | 91 | 88 | 87 | 93 | 83 | 89 | 84 | 85 |
Farasa | 86 | 90 | 86 | 94 | 89 | 86 | 90 | 83 | 88 | 82 | 87 |
MADAMIRA | 86 | 93 | 81 | 92 | 91 | 88 | 92 | 88 | 93 | 85 | 89 |
ALP | 89 | 99 | 85 | 99 | 94 | 94 | 91 | 90 | 95 | 95 | 93 |
Comparison | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Taggers/No.Text | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | Average |
Stanford | 61 | 69 | 66 | 68 | 67 | 59 | 71 | 69 | 72 | 68 | 67 |
CAMeL | 76 | 78 | 79 | 89 | 89 | 84 | 91 | 82 | 87 | 81 | 84 |
Farasa | 83 | 90 | 84 | 93 | 90 | 84 | 89 | 82 | 88 | 81 | 86 |
MADAMIRA | 85 | 94 | 81 | 90 | 92 | 85 | 90 | 88 | 89 | 86 | 88 |
ALP | 90 | 98 | 86 | 99 | 94 | 95 | 92 | 88 | 91 | 94 | 92 |
Recall Comparison | ||||
---|---|---|---|---|
POS Taggers | Noun | Verb | Adjective | Other |
Stanford | 91 | 73 | 68 | 50 |
CAMeL | 93 | 92 | 60 | 89 |
Farasa | 94 | 87 | 73 | 89 |
MADAMIRA | 95 | 95 | 70 | 91 |
ALP | 95 | 96 | 84 | 95 |
Precision Comparison | ||||
---|---|---|---|---|
POS Taggers | Noun | Verb | Adjective | Other |
Stanford | 57 | 67 | 65 | 93 |
CAMeL | 80 | 91 | 68 | 97 |
Farasa | 81 | 96 | 74 | 97 |
MADAMIRA | 84 | 96 | 79 | 98 |
ALP | 91 | 96 | 85 | 98 |
Comparison | ||||
---|---|---|---|---|
POS Taggers | Noun | Verb | Adjective | Other |
Stanford | 69 | 69 | 63 | 66 |
CAMeL | 86 | 92 | 64 | 70 |
Farasa | 87 | 92 | 74 | 80 |
MADAMIRA | 89 | 69 | 63 | 66 |
ALP | 93 | 97 | 84 | 88 |
Text 1 | Text 6 |
---|---|
متناهية | حجري |
الأشهر | منحوت |
إحسائي | الغابرين |
حساوي | كاغد |
أخف | ثمين |
الثقيل | المتداينون |
الباطن | المتشاكلون |
مباشر | المتباغضون |
شابة | المأكولة |
حسناء | المتنازعون |
عجوز | جلدي |
غاية | سميك |
الحسناء | |
الشوهاء | |
المؤقت | |
الحمارية |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Alluhaibi, R.; Alfraidi, T.; Abdeen, M.A.R.; Yatimi, A. A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels. Information 2021, 12, 523. https://doi.org/10.3390/info12120523
Alluhaibi R, Alfraidi T, Abdeen MAR, Yatimi A. A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels. Information. 2021; 12(12):523. https://doi.org/10.3390/info12120523
Chicago/Turabian StyleAlluhaibi, Reyadh, Tareq Alfraidi, Mohammad A. R. Abdeen, and Ahmed Yatimi. 2021. "A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels" Information 12, no. 12: 523. https://doi.org/10.3390/info12120523
APA StyleAlluhaibi, R., Alfraidi, T., Abdeen, M. A. R., & Yatimi, A. (2021). A Comparative Study of Arabic Part of Speech Taggers Using Literary Text Samples from Saudi Novels. Information, 12(12), 523. https://doi.org/10.3390/info12120523