Developing a POS Tagged Corpus of Urdu Tweets

: Processing of social media text like tweets is challenging for traditional Natural Language Processing (NLP) tools developed for well-edited text due to the noisy nature of such text. However, demand for tools and resources to correctly process such noisy text has increased in recent years due to the usefulness of such text in various applications. Literature reports various e ﬀ orts made to develop tools and resources to process such noisy text for various languages, notably, part-of-speech (POS) tagging, an NLP task having a direct e ﬀ ect on the performance of other successive text processing activities. Still, no such attempt has been made to develop a POS tagger for Urdu social media content. Thus, the focus of this paper is on POS tagging of Urdu tweets. We introduce a new tagset for POS-tagging of Urdu tweets along with the POS-tagged Urdu tweets corpus. We also investigated bootstrapping as a potential solution for overcoming the shortage of manually annotated data and present a supervised POS tagger with an accuracy of 93.8% precision, 92.9% recall and 93.3% F-measure


Introduction
Recent years have witnessed immense popularity of social media platforms among Internet users, researchers and organizations from several domains. Furthermore, micro-blogging websites are facilitating and inspiring several modern life aspects such as business, education, technology and government affairs, to name a few [1]. With around 326 million to date, Twitter is a popular micro-blogging web service which nowadays is a major source of information for all the major events and latest happenings around the world. Twitter allows its users to write or share tweets of up to 280 characters about countless topics such as their opinions about certain aspects of life, reviews of products, films, games, discussions about relationship issues, government affairs, pandemics etc. These tweets can be utilized further for a variety of activities such as using opinion mining to forecast or explain real-world outcomes, mining users' interests for targeted advertisement campaigns, acquiring customer opinions about brands, government policies, etc.
Language on Twitter, however, is quite different from well-edited text of news, books, etc., due to the presence of unconventional orthography, punctuation and grammatical mistakes, along with Twitter-specific conventions such as hashtags, emoticons, usernames and retweet tokens [2]. Such language style variation is often characterized as noisy user-generated text [3]. Since the performance of Natural Language Processing (NLP) applications depends on the type of text being processed [4], the effect of this language style variation of user-generated text on the performance of standard NLP tools has been explored by Foster et al. [5] and Petrov and McDonald [6]. Similarly, studies by Owoputi et al. [7], Gimpel et al. [8], Ritter, Clark and Etzioni [9], Seddah et al. [10] and Kong et al. [11] have shown that adaptation of NLP tools and resources is necessary to accommodate language differences in such noisy text.
(SNLTR) is used by transforming the SNLTR tagset into the tagset they proposed. The SNLTR system was trained on 1200 tweets and tested on 100 tweets. Their system achieved an accuracy of 86.99%.
The authors of [24] used an existing formal Indonesia POS tagger [25] to automatically annotate Indonesian tweets and added five new tags for Twitter data. Semi-automatic data annotation is employed by the tagger to automatically annotate the new data and the annotation results are manually corrected. The model is rebuilt by adding this resultant data into the training data. The model has been trained several times, with a data volume of 1000, 1600 and 1800 tweets achieving 66.36% accuracy.
As far as we know, there is currently no research study available for Urdu tweet POS tagging in literature. This paper is the first step towards filling this gap.

Urdu Tweet Part-of-Speech Tagset
There are various POS tagsets available for Urdu, including Hardie's tagset [26], the Sajjad and Schmid tagset [27] and the CLE POS tagset [28], to name a few. All these tagsets are designed for well-edited Urdu text. However, the performance of the taggers trained on well-edited text decreases on out-of-domain data such as tweets [8]. We evaluated the accuracy of two publicly available Urdu POS taggers, IIIT Urdu Shallow Parser [29] and CLE's Statistical POS Tagger for Urdu [30], on well-edited news text (1856 tokens) as well as on tweets (1862 tokens). The results of this experiment are presented in Table 1. For accuracy evaluation, precision (fraction of correct POS tags from total tagged tokens), recall (ratio of correctly identified labels over the total number of correct tags in the input data) and F-Measure (harmonic mean of precision and recall) are used. Equations (1)-(3) describe their calculations, respectively: Recall (R) = Correctly Tagged Tokens Total Possible Correctly Tagged Tokens , While high accuracy was achieved by both taggers in tagging news text, the same was not the case with tweets. The experimental results clearly show a performance drop of both taggers on tweets. Both taggers failed to properly tag typographical divergences (e.g., instead of ), unknown words (e.g., ) and bad segmentation (e.g., instead of ) of tweets. Similarly, the presence of emoticons, hashtags and other Twitter-specific elements was also problematic for both the taggers. These tokens never or hardly ever appear in news text. The experimental results show that POS tagging for Twitter is quite different from corresponding tags in more formal texts due to the informal, less grammatical nature and lexical divergences of tweets as compared to well-edited Urdu text, confirming that the findings of [16] also hold true for Urdu tweets. Still, we could have used any of the existing tagsets and trained a statistical POS tagger for tweets. However, the case of tweets is not just the problem of plain domain adoption where transfer of learning can improve Computers 2020, 9, 90 4 of 13 tagger accuracy. Lexical divergence of tweets makes it a whole new genre as compared to standard, well-edited text on which these taggers are trained. Moreover, the tagsets used by these taggers do not have appropriate tags for tagging Twitter-specific elements of tweets. These reasons motivated us to propose a new Twitter-specific tagset for Urdu. This tagset contains 33 part-of-speech tags for annotating standard parts of speech (nouns, verbs, etc.) along with groups of token variations found largely in Urdu tweets. The tagset is motivated from the Google Universal POS tagset [31] and the CLE POS tagset [28]. We refer to this tagset as the Urdu Noisy Text POS (UNTPOS) Tagset Tags and their descriptions are given in the following subsections, whereas the complete list of tags is provided in Appendix A.

ADJ: Adjective
Used to modify the nouns by specifying their properties or characteristics. Examples of Urdu adjectives are: " /good", " /heavy", " /old", etc.

AUX: Auxiliary Verb
Auxiliary verbs in Urdu are the verbs that can form a compound verb together with the main verb. Examples are: " /is", " /was", " /will", etc.

CONJ: Coordinating Conjunction
These are the words that are used to join two independent clauses in a compound sentence. Some examples are " /and"," /also", " /however".

DET: Determiner
In Urdu, determiners are not considered as separate word classes as most determiners are treated as demonstrative pronouns. However, determiners are terms that narrow down the referents of the following noun in the scope of a conversation, whereas demonstrative pronouns can entirely replace a noun in a sentence. Examples of DET are: " /this", " /many", " /few", and so on.

INTJ: Interjection
Interjections are used to express emotion, volition and mood. There are two sub-categories for interjection used in the UNTPOST tagset: "INTJ" which expresses emotions in the form of words and "INTJE" which is used to mark smiley emoticons/emojis as they also show emotions but in image form.
Examples of INTJ are: " /well done", " /oh", " /hey", etc. Examples of INTJE are: " Computers 2020, 9, x FOR PEER REVIEW 4 of 13 by these taggers do not have appropriate tags for tagging Twitter-specific elements of tweets. These reasons motivated us to propose a new Twitter-specific tagset for Urdu. This tagset contains 33 partof-speech tags for annotating standard parts of speech (nouns, verbs, etc.) along with groups of token variations found largely in Urdu tweets. The tagset is motivated from the Google Universal POS tagset [31] and the CLE POS tagset [28]. We refer to this tagset as the Urdu Noisy Text POS (UNTPOS) Tagset Tags and their descriptions are given in the following subsections, whereas the complete list of tags is provided in Appendix A.

AUX: Auxiliary Verb
Auxiliary verbs in Urdu are the verbs that can form a compound verb together with the main verb. Examples are: ‫/ﮨﻴﮟ"‬ is", "‫/ﺗﻬﺎ‬was", " / ‫ﮔﯽ‬ will", etc.

CONJ: Coordinating Conjunction
These are the words that are used to join two independent clauses in a compound sentence.

DET: Determiner
In Urdu, determiners are not considered as separate word classes as most determiners are treated as demonstrative pronouns. However, determiners are terms that narrow down the referents of the following noun in the scope of a conversation, whereas demonstrative pronouns can entirely replace a noun in a sentence. Examples of DET are: " ‫ﻳ‬ ‫ہ‬ /this", " ‫ﮐﺌ‬ ‫ﯽ‬ /many", "‫/ﭼﻨﺪ‬few", and so on.

INTJ: Interjection
Interjections are used to express emotion, volition and mood. There are two sub-categories for interjection used in the UNTPOST tagset: "INTJ" which expresses emotions in the form of words and "INTJE" which is used to mark smiley emoticons/emojis as they also show emotions but in image form. Examples of INTJ are: "‫/ﻭﺍﻩ‬well done", ‫,"‪/oh‬ﺍﻭﻩ"‬ "‫/ﺍﺭے‬hey", etc. Examples of INTJE are: " ", " ", " ".

", "
Computers 2020, 9, x FOR PEE by these taggers do not reasons motivated us to of-speech tags for anno variations found largel tagset [31] and the CLE Tagset Tags and their des tags is provided in App

ADJ: Adjective
Used to modify th adjectives are: ‫‪/goo‬ﺍﭼﻬﺎ"‬

ADV: Adverb
An adverb modifie of adverb: General adv away".

CONJ: Coordinating
These are the wor Some examples are ‫ﻭﺭ"‬

DET: Determiner
In Urdu, determin treated as demonstrativ of the following noun i replace a noun in a sent

INTJ: Interjection
Interjections are us interjection used in the "INTJE" which is used form. Examples of INT " ", " ".

NOUN: Noun
Nouns are parts o include: ‫,"‪/man‬ﺁﺩﻣﯽ"‬ ‫ﺮ"‬ ", " Computers 2020, 9, x FOR PEER REVIEW 4 of 13 by these taggers do not have appropriate tags for tagging Twitter-specific elements of tweets. These reasons motivated us to propose a new Twitter-specific tagset for Urdu. This tagset contains 33 partof-speech tags for annotating standard parts of speech (nouns, verbs, etc.) along with groups of token variations found largely in Urdu tweets. The tagset is motivated from the Google Universal POS tagset [31] and the CLE POS tagset [28]. We refer to this tagset as the Urdu Noisy Text POS (UNTPOS) Tagset Tags and their descriptions are given in the following subsections, whereas the complete list of tags is provided in Appendix A.

AUX: Auxiliary Verb
Auxiliary verbs in Urdu are the verbs that can form a compound verb together with the main verb. Examples are: ‫/ﮨﻴﮟ"‬ is", "‫/ﺗﻬﺎ‬was", " / ‫ﮔﯽ‬ will", etc.

CONJ: Coordinating Conjunction
These are the words that are used to join two independent clauses in a compound sentence.

DET: Determiner
In Urdu, determiners are not considered as separate word classes as most determiners are treated as demonstrative pronouns. However, determiners are terms that narrow down the referents of the following noun in the scope of a conversation, whereas demonstrative pronouns can entirely replace a noun in a sentence. Examples of DET are: " ‫ﻳ‬ ‫ہ‬ /this", " ‫ﮐﺌ‬ ‫ﯽ‬ /many", "‫/ﭼﻨﺪ‬few", and so on.

INTJ: Interjection
Interjections are used to express emotion, volition and mood. There are two sub-categories for interjection used in the UNTPOST tagset: "INTJ" which expresses emotions in the form of words and "INTJE" which is used to mark smiley emoticons/emojis as they also show emotions but in image form. Examples of INTJ are: "‫/ﻭﺍﻩ‬well done", ‫,"‪/oh‬ﺍﻭﻩ"‬ "‫/ﺍﺭے‬hey", etc. Examples of INTJE are: " ", " ", " ".

PART: Particle
Particles do not belong to any of the inflected grammatical word classes; they usually lack their own grammatical functions and form relationships between other parts of speech or expressive clauses. Examples are: " /ones", " /also", etc.

PRON: Pronoun
The Urdu Tweet tagset uses five subcategories of pronoun. Personal pronoun (PRON) is used to replace a noun. Examples are: " /me", " /that", etc. A possessive pronoun (PRONP) is a pronoun that shows the ownership relation. Examples are: " /yours", "/ mine". A reflexive pronoun (PRONR) is used for referring to oneself. An example is: " /self". A demonstrative pronoun (PROND) points to specific objects within a sentence and comes before a noun. A few examples are: " /the same", " /they", " /who". A relative pronoun (PRONRD) is one that is used to refer to nouns mentioned previously. Some examples are: " /which", " /like", " /that".

PROPN: Proper Noun
A proper noun denotes the names of specific people, things or places. Examples are: " /Pakistan", " /Taj Mahal", " /Muhammad Ahmed", etc. Tweet mentions of the form "@mshaanshahid" are also tagged as PROPN as these usernames represent a real person in the social media world. Similarly, in the case of multiword proper nouns such as " /Firdous Jamal", both words will be tagged as PROPN

SCONJ: Subordinating Conjunction
These are words that are used to join and show the relationship between the words, clauses or phrases that it joins. Some examples of SCONJ are: " /that", " /if", " /because".

SYM: Symbol
A symbol is an entity like a word and is different from normal words in its function, form or both. Examples are: "$/dollar", "%/percentage".

VERB: Verb
Verbs are a class of words used to denote an event or action. They can form in a clause the smallest predicate and control the types and number of other components that may appear in the clause. Examples are: " /keep", " /said", " /remeber", etc.

RET: Retweet
A retweet is used to mark a reposted or forwarded message on Twitter. An example tweet is shown below: "RT@faheemabbasii: ".

REP: Reply
Reply is used to mark a response to another person's tweet. An example tweet is shown below: "RP @SulemanZartasha: "

HASH: Hash
Twitter hashtags are sometimes used as ordinary words and other times as topic markers. HASH is used to mark both. Some Examples are "#StopKillingsOnLOC", " ‫#ﺧﺎﻣﻮﺷ‬ ‫ﯽ‬ ".

X: Others
X is used for foreign words, i.e., words from languages other than Urdu, or words which do not fall in any of the specified part-of-speech tag. Examples are: "DP", ‫,"ﷺ"‬ ‫."ﺍﮐﺜﺮﺍﻭﻗﺎﺕ"‬

Urdu Tweet Corpus
Since no annotated or unannotated corpus for Urdu tweets is publicly available, a new corpus was created for this study. Figure 1 shows the process of corpus creation and the following sections describe the process in detail.

Dataset for Corpus Creation
A total of 5000 Urdu tweets comprising a multitude of topics such as business, politics, sports, entertainment, etc., have been collected using the Twitter search and stream API.

REP: Reply
Reply is used to mark a response to another person's tweet. An example tweet is shown below: "RP @SulemanZartasha: "

HASH: Hash
Twitter hashtags are sometimes used as ordinary words and other times as topic markers. HASH is used to mark both. Some Examples are "#StopKillingsOnLOC", " ‫#ﺧﺎﻣﻮﺷ‬ ‫ﯽ‬ ".

X: Others
X is used for foreign words, i.e., words from languages other than Urdu, or words which do not fall in any of the specified part-of-speech tag. Examples are: "DP", ‫,"ﷺ"‬ ‫."ﺍﮐﺜﺮﺍﻭﻗﺎﺕ"‬

Urdu Tweet Corpus
Since no annotated or unannotated corpus for Urdu tweets is publicly available, a new corpus was created for this study. Figure 1 shows the process of corpus creation and the following sections describe the process in detail.

Dataset for Corpus Creation
A total of 5000 Urdu tweets comprising a multitude of topics such as business, politics, sports, entertainment, etc., have been collected using the Twitter search and stream API.

REP: Reply
Reply is used to mark a response to another person's tweet. An example tweet is shown below: "RP @SulemanZartasha: "

HASH: Hash
Twitter hashtags are sometimes used as ordinary words and other times as topic markers. HASH is used to mark both. Some Examples are "#StopKillingsOnLOC", " ‫#ﺧﺎﻣﻮﺷ‬ ‫ﯽ‬ ".

X: Others
X is used for foreign words, i.e., words from languages other than Urdu, or words which do not fall in any of the specified part-of-speech tag. Examples are: "DP", ‫,"ﷺ"‬ ‫."ﺍﮐﺜﺮﺍﻭﻗﺎﺕ"‬

Urdu Tweet Corpus
Since no annotated or unannotated corpus for Urdu tweets is publicly available, a new corpus was created for this study. Figure 1 shows the process of corpus creation and the following sections describe the process in detail.

Dataset for Corpus Creation
A total of 5000 Urdu tweets comprising a multitude of topics such as business, politics, sports, entertainment, etc., have been collected using the Twitter search and stream API. ".

LINK: Link
A link is used to mark email addresses and web links in the noisy text. An example tweet is shown below: " https://youtu.be/oqW6IWVAYSg".

REP: Reply
Reply is used to mark a response to another person's tweet. An example tweet is shown below: "RP @SulemanZartasha: " ".

HASH: Hash
Twitter hashtags are sometimes used as ordinary words and other times as topic markers. HASH is used to mark both. Some Examples are "#StopKillingsOnLOC", "# ".

X: Others
X is used for foreign words, i.e., words from languages other than Urdu, or words which do not fall in any of the specified part-of-speech tag. Examples are: "DP", "

REP: Reply
Reply is used to mark a response to another person's tweet. An example tweet is shown below: "RP @SulemanZartasha: "

HASH: Hash
Twitter hashtags are sometimes used as ordinary words and other times as topic markers. HASH is used to mark both. Some Examples are "#StopKillingsOnLOC", " ‫#ﺧﺎﻣﻮﺷ‬ ‫ﯽ‬ ".

X: Others
X is used for foreign words, i.e., words from languages other than Urdu, or words which do not fall in any of the specified part-of-speech tag. Examples are: "DP", ‫,"ﷺ"‬ ‫."ﺍﮐﺜﺮﺍﻭﻗﺎﺕ"‬

Urdu Tweet Corpus
Since no annotated or unannotated corpus for Urdu tweets is publicly available, a new corpus was created for this study. Figure 1 shows the process of corpus creation and the following sections describe the process in detail.

Dataset for Corpus Creation
A total of 5000 Urdu tweets comprising a multitude of topics such as business, politics, sports, entertainment, etc., have been collected using the Twitter search and stream API.

Urdu Tweet Corpus
Since no annotated or unannotated corpus for Urdu tweets is publicly available, a new corpus was created for this study. Figure 1 shows the process of corpus creation and the following sections describe the process in detail.

REP: Reply
Reply is used to mark a response to another person's tweet. An example tweet is shown below: "RP @SulemanZartasha: "

HASH: Hash
Twitter hashtags are sometimes used as ordinary words and other times as topic markers. HASH is used to mark both. Some Examples are "#StopKillingsOnLOC", " ‫#ﺧﺎﻣﻮﺷ‬ ‫ﯽ‬ ".

X: Others
X is used for foreign words, i.e., words from languages other than Urdu, or words which do not fall in any of the specified part-of-speech tag. Examples are: "DP", ‫,"ﷺ"‬ ‫."ﺍﮐﺜﺮﺍﻭﻗﺎﺕ"‬

Urdu Tweet Corpus
Since no annotated or unannotated corpus for Urdu tweets is publicly available, a new corpus was created for this study. Figure 1 shows the process of corpus creation and the following sections describe the process in detail.

Dataset for Corpus Creation
A total of 5000 Urdu tweets comprising a multitude of topics such as business, politics, sports, entertainment, etc., have been collected using the Twitter search and stream API.

Dataset for Corpus Creation
A total of 5000 Urdu tweets comprising a multitude of topics such as business, politics, sports, entertainment, etc., have been collected using the Twitter search and stream API.

Pre-Processing
Pre-processing prepares the dataset for machine learning [32]. Proper pre-processing improves the effectiveness of machine learning while reducing its training time. This involves data cleaning and sentence segmentation.
In our research, in the data cleaning stage, after data collection, all duplicated tweets were removed from raw corpus. Similarly, those tweets that were written in Urdu but also contained words from languages such as Sindhi, Punjabi, Pushto were also discarded. Very short tweets of 2 to 3 words were also discarded, leaving behind a corpus of 3420 tweets. This raw corpus was than normalized using the UrduHack library [33].
An important pre-processing step, tokenization, breaks long text strings into linguistic units, or tokens. We developed our own tokenizer to perform tokenization of special cases such as "RP@SulemanZ:", " mputers 2020, 9, x FOR PEER REVIEW 7 of 13

. Pre-Processing
Pre-processing prepares the dataset for machine learning [32]. Proper pre-processing improves e effectiveness of machine learning while reducing its training time. This involves data cleaning d sentence segmentation.
In our research, in the data cleaning stage, after data collection, all duplicated tweets were moved from raw corpus. Similarly, those tweets that were written in Urdu but also contained ords from languages such as Sindhi, Punjabi, Pushto were also discarded. Very short tweets of 2 to words were also discarded, leaving behind a corpus of 3420 tweets. This raw corpus was than rmalized using the UrduHack library [33].
An important pre-processing step, tokenization, breaks long text strings into linguistic units, or kens. We developed our own tokenizer to perform tokenization of special cases such as P@SulemanZ:", " " "https://youtu.be/oqW6IWVAYSg" in Urdu tweets.

. Manual Annotation
Correctly annotated tweets are a basic requirement to evaluate the POS tagger output in order compare the POS tagger output against this gold standard. Since, a new POS tagset is designed ecifically for Urdu tweets, no such gold standard corpus exists, highlighting the need for manually veloping a correctly annotated gold standard POS tagged Urdu tweets corpus.
To develop the initial gold standard corpus, manual annotation of POS tags was performed on ndomly sampled 300 tokenized Urdu tweets (8034 tokens) by two annotators; this set was then ecked and corrected by an expert who is a native Urdu speaker with a linguistics background. ctification of these 300 tweets formed the foundation for evaluating the tagset intuitiveness. The gset was finalized after numerous discussions and revisions. The next 200 randomly sampled eets (4689 tokens) were annotated in accordance with the revised tagset, using the first 300 tagged eets as a reference. An example of a tokenized, part-of-speech tagged Urdu tweet is shown in gure 2. Nonetheless, manual annotation of the text was quite time intensive. Therefore, we opted for otstrapping, a form of semi-supervised learning which creates annotated training data from large ounts of unannotated data [34] to speed up the manual annotation process as discussed in the next ction.

Bootstrapping
We performed five-fold cross validation of the manually annotated dataset to check its nsistency and correctness. We divided the dataset into five complementary sets, each with one lidation file (gold standard file) having 10% of the texts and one training file including 90% of the xt. Each set was evaluated by training an instance of the Stanford POS tagger with the training file d then the validation file was tagged with the trained model (the outcome was the test file). The st file was then evaluated against the gold standard file. This was done five times and the average " "https://youtu.be/oqW6IWVAYSg" in Urdu tweets.

Manual Annotation
Correctly annotated tweets are a basic requirement to evaluate the POS tagger output in order to compare the POS tagger output against this gold standard. Since, a new POS tagset is designed specifically for Urdu tweets, no such gold standard corpus exists, highlighting the need for manually developing a correctly annotated gold standard POS tagged Urdu tweets corpus.
To develop the initial gold standard corpus, manual annotation of POS tags was performed on randomly sampled 300 tokenized Urdu tweets (8034 tokens) by two annotators; this set was then checked and corrected by an expert who is a native Urdu speaker with a linguistics background. Rectification of these 300 tweets formed the foundation for evaluating the tagset intuitiveness. The tagset was finalized after numerous discussions and revisions. The next 200 randomly sampled tweets (4689 tokens) were annotated in accordance with the revised tagset, using the first 300 tagged tweets as a reference. An example of a tokenized, part-of-speech tagged Urdu tweet is shown in Figure 2.

Pre-Processing
Pre-processing prepares the dataset for machine learning [32]. Proper pre-processing improves the effectiveness of machine learning while reducing its training time. This involves data cleaning and sentence segmentation.
In our research, in the data cleaning stage, after data collection, all duplicated tweets were removed from raw corpus. Similarly, those tweets that were written in Urdu but also contained words from languages such as Sindhi, Punjabi, Pushto were also discarded. Very short tweets of 2 to 3 words were also discarded, leaving behind a corpus of 3420 tweets. This raw corpus was than normalized using the UrduHack library [33].

Manual Annotation
Correctly annotated tweets are a basic requirement to evaluate the POS tagger output in order to compare the POS tagger output against this gold standard. Since, a new POS tagset is designed specifically for Urdu tweets, no such gold standard corpus exists, highlighting the need for manually developing a correctly annotated gold standard POS tagged Urdu tweets corpus.
To develop the initial gold standard corpus, manual annotation of POS tags was performed on randomly sampled 300 tokenized Urdu tweets (8034 tokens) by two annotators; this set was then checked and corrected by an expert who is a native Urdu speaker with a linguistics background. Rectification of these 300 tweets formed the foundation for evaluating the tagset intuitiveness. The tagset was finalized after numerous discussions and revisions. The next 200 randomly sampled tweets (4689 tokens) were annotated in accordance with the revised tagset, using the first 300 tagged tweets as a reference. An example of a tokenized, part-of-speech tagged Urdu tweet is shown in Figure 2. Nonetheless, manual annotation of the text was quite time intensive. Therefore, we opted for bootstrapping, a form of semi-supervised learning which creates annotated training data from large amounts of unannotated data [34] to speed up the manual annotation process as discussed in the next section.

Bootstrapping
We performed five-fold cross validation of the manually annotated dataset to check its consistency and correctness. We divided the dataset into five complementary sets, each with one validation file (gold standard file) having 10% of the texts and one training file including 90% of the text. Each set was evaluated by training an instance of the Stanford POS tagger with the training file and then the validation file was tagged with the trained model (the outcome was the test file). The test file was then evaluated against the gold standard file. This was done five times and the average Nonetheless, manual annotation of the text was quite time intensive. Therefore, we opted for bootstrapping, a form of semi-supervised learning which creates annotated training data from large amounts of unannotated data [34] to speed up the manual annotation process as discussed in the next section.

Bootstrapping
We performed five-fold cross validation of the manually annotated dataset to check its consistency and correctness. We divided the dataset into five complementary sets, each with one validation file (gold standard file) having 10% of the texts and one training file including 90% of the text. Each set was evaluated by training an instance of the Stanford POS tagger with the training file and then the validation file was tagged with the trained model (the outcome was the test file). The test file was then evaluated against the gold standard file. This was done five times and the average gives the final five-fold cross validation result. The result of five-fold cross validation is shown in Table 2. The average score of five-fold cross validation gave us the baseline score, which was then used to evaluate all future models' performances. For bootstrapping experiments, the corpus of 500 manually annotated tweets (12,723 tokens) was used in this stage by splitting it into a seed training set of 300 tweets (8034 tokens) and development (2383 tokens) and test (2306 tokens) sets of 100 tweets each. To avoid and prevent subjective and accidental bias, 500 tweets (13,643 tokens) were sampled randomly from the corpus for the purpose of bootstrapping experiments. An initial POS tagger model was trained using the Stanford POS tagger on 300 tweet training data. This model was evaluated against the development set to calculate the model's accuracy at this stage. Then, five iterations of the bootstrapping experiment were performed using this model to tag 100 sentences at each iteration. At the end of each iteration, tagged tweets were manually corrected and added to the training set to retrain a new model of the tagger. The newly trained model was then used to tag the next 100 sentences and the accuracy of the model was checked against the development set using precision, recall and f-measure.
The overall bootstrapping process is described in the following points (1-7): 1.
Divide the manually tagged gold standard corpus of 500 sentences into training (300 sentences), development (100 sentences) and test set (100 sentences).

2.
On the training set, train the initial Stanford tagger model and evaluate its accuracy against the development set.

3.
Use the baseline model of step 2 to parse 100 sentences and correct the output manually.

4.
Train the new tagger model by adding 100 automatically tagged and manually rectified sentences to the training set.

5.
Use the new tagger model to tag an additional 100 sentences and check model accuracy against the development set. 6.
At the end of fifth iteration, check final model's accuracy against the test set.

Discussion
The results of the bootstrapping experiments are presented in Table 3. The precision of the initial model (84.3%) was higher than the average precision (82.6%) of five-fold validation. However, recall (80.4%) and f-measure (82.3%) of the initial model were lower than those of average five-fold score (83.3% and 82.9%, respectively), but the results for the initial evaluation were encouraging. At every iteration, we observed a steady increase in the accuracy of newly induced models and the maximum scoring models occured at the fifth iteration, attaining 92.5% precision, 93.5% recall and 93% Computers 2020, 9, 90 9 of 13 f-measure for the Stanford tagger. The test set showed similar tendencies to that of the development set in the final evaluation with 93.8% precision, 92.9% recall and 93.3% f-measure. The total percentage of errors made was 12.4% for the Stanford tagger on the test set of 2306 tokens. Based on our analysis, we categorized sources of POS tagging errors in three major categories: Low-frequency words, unseen words and ambiguous words shown statistically in Figure 3.  The total percentage of errors made was 12.4% for the Stanford tagger on the test set of 2306 tokens. Based on our analysis, we categorized sources of POS tagging errors in three major categories: Low-frequency words, unseen words and ambiguous words shown statistically in Figure 3. It is well known that low-frequency words are more difficult to learn and accurately predict. In our test set, one source of tagging errors was low-frequency words, i.e., English words transliterated in Urdu such as ‫"ﭨﺮﻭﻝ"‬ (troll), ‫"ﺍﻭﺳﻢ"‬ (awesome), etc., slang, hashtags and links. Additionally, frequent misclassification of emoticons and, in some cases, punctuation was also observed. These errors were mostly due to several sequences of multiple emoticons ( ) and punctuation marks (?????) in the test set but were absent in the training data where emoticons and punctuation always occur in isolation. The error rate for low-frequency words is 3.4%.
Unseen words are words that were not found in the training data but are present in the test set. The majority of these words in our test set are named entities, emoticons, slang, English words transliterated into Urdu, etc. Of those unknown words, the error rate of Stanford is 4.4%.
The largest source of tagging errors in our test set as ambiguous words for the Stanford tagger with a 4.6% error rate. There are several reasons for ambiguous words. Firstly, Urdu tweets are written conversations and consequently some words have multiple written versions. Some examples are: ‫ﺍﻣﺮﻳﮑہ"‬ vs. ‫,"ﺍﻣﺮﻳﮑﺎ‬ ‫ﻟﻴﮓ"‬ ‫ﻥ‬ vs. ‫ﻟﻴﮓ‬ ‫,"ﻧﻮﻥ‬ ‫ﺳﻌﻮﺩﻳہ"‬ vs. ‫ﻋﺮﺏ‬ ‫,"ﺳﻌﻮﺩی‬ etc. Similarly, the insertion of unnecessary spaces also causes tagger confusion. Two such cases are the words ‫"ﻧﺎﺍﮨﻞ"‬ and ‫"ﺑﮯﺿﺮﺭ"‬ which are basically adjectives (ADJ) but wrongly tagged as two separate words "PART" and "NOUN" due to the insertion of a space between them. Another case of tagger confusion is where two or more words are joined together such as " ‫ﻟ‬ ‫ﺍﺋ‬ ‫ﻴ‬ ‫ﮯ‬ " written as " ‫ﺍﺳﻠ‬ ‫ﻴ‬ ‫ﮯ‬ " causing PRON and ADP to be marked PRON mistakenly. The same was the case for the words ‫ﻁﺮﺡ"‬ ‫"ﺍﺋ‬ written as ‫,"ﺍﺳﻄﺮﺡ"‬ causing it to be marked as NOUN instead of DET and NOUN. Similarly, there are issues with spelling mistakes and the treatment of punctuation and other special characters. One such case of spelling mistakes which frequently occurred in the test set was SCONJ ‫"ﮐہ"‬ written as ‫,"ﮐﮯ"‬ causing taggers to tag it as ADP.
Another difficulty that the tagger faced was confusion between words that have the same word form but multiple meanings depending on usage. One such case is between particle ‫,"ﺗﻮ"‬ pronoun  It is well known that low-frequency words are more difficult to learn and accurately predict. In our test set, one source of tagging errors was low-frequency words, i.e., English words transliterated in Urdu such as " " (troll), " " (awesome), etc., slang, hashtags and links. Additionally, frequent misclassification of emoticons and, in some cases, punctuation was also observed. These errors were mostly due to several sequences of multiple emoticons ( Computers 2020, 9, x FOR PEER REVIEW 9 of 13 The total percentage of errors made was 12.4% for the Stanford tagger on the test set of 2306 tokens. Based on our analysis, we categorized sources of POS tagging errors in three major categories: Low-frequency words, unseen words and ambiguous words shown statistically in Figure 3. It is well known that low-frequency words are more difficult to learn and accurately predict. In our test set, one source of tagging errors was low-frequency words, i.e., English words transliterated in Urdu such as ‫"ﭨﺮﻭﻝ"‬ (troll), ‫"ﺍﻭﺳﻢ"‬ (awesome), etc., slang, hashtags and links. Additionally, frequent misclassification of emoticons and, in some cases, punctuation was also observed. These errors were mostly due to several sequences of multiple emoticons ( ) and punctuation marks (?????) in the test set but were absent in the training data where emoticons and punctuation always occur in isolation. The error rate for low-frequency words is 3.4%.
Unseen words are words that were not found in the training data but are present in the test set. The majority of these words in our test set are named entities, emoticons, slang, English words transliterated into Urdu, etc. Of those unknown words, the error rate of Stanford is 4.4%.
The largest source of tagging errors in our test set as ambiguous words for the Stanford tagger with a 4.6% error rate. There are several reasons for ambiguous words. Firstly, Urdu tweets are written conversations and consequently some words have multiple written versions. Some examples are: ‫ﺍﻣﺮﻳﮑہ"‬ vs. ‫,"ﺍﻣﺮﻳﮑﺎ‬ ‫ﻟﻴﮓ"‬ ‫ﻥ‬ vs. ‫ﻟﻴﮓ‬ ‫,"ﻧﻮﻥ‬ ‫ﺳﻌﻮﺩﻳہ"‬ vs. ‫ﻋﺮﺏ‬ ‫,"ﺳﻌﻮﺩی‬ etc. Similarly, the insertion of unnecessary spaces also causes tagger confusion. Two such cases are the words ‫"ﻧﺎﺍﮨﻞ"‬ and ‫"ﺑﮯﺿﺮﺭ"‬ which are basically adjectives (ADJ) but wrongly tagged as two separate words "PART" and "NOUN" due to the insertion of a space between them. Another case of tagger confusion is where two or more words are joined together such as " ‫ﻟ‬ ‫ﺍﺋ‬ ‫ﻴ‬ ‫ﮯ‬ " written as " ‫ﺍﺳﻠ‬ ‫ﻴ‬ ‫ﮯ‬ " causing PRON and ADP to be marked PRON mistakenly. The same was the case for the words ‫ﻁﺮﺡ"‬ ‫"ﺍﺋ‬ written as ‫,"ﺍﺳﻄﺮﺡ"‬ causing it to be marked as NOUN instead of DET and NOUN. Similarly, there are issues with spelling mistakes and the treatment of punctuation and other special characters. One such case of spelling mistakes which frequently occurred in the test set was SCONJ ‫"ﮐہ"‬ written as ‫,"ﮐﮯ"‬ causing taggers to tag it as ADP.
Another difficulty that the tagger faced was confusion between words that have the same word form but multiple meanings depending on usage. One such case is between particle ‫,"ﺗﻮ"‬ pronoun

Stanford Tagger
) and punctuation marks (?????) in the test set but were absent in the training data where emoticons and punctuation always occur in isolation. The error rate for low-frequency words is 3.4%.
Unseen words are words that were not found in the training data but are present in the test set. The majority of these words in our test set are named entities, emoticons, slang, English words transliterated into Urdu, etc. Of those unknown words, the error rate of Stanford is 4.4%.
The largest source of tagging errors in our test set as ambiguous words for the Stanford tagger with a 4.6% error rate. There are several reasons for ambiguous words. Firstly, Urdu tweets are written conversations and consequently some words have multiple written versions. Some examples are: " vs. ", " vs. ", " vs. ", etc. Similarly, the insertion of unnecessary spaces also causes tagger confusion. Two such cases are the words " " and " " which are basically adjectives (ADJ) but wrongly tagged as two separate words "PART" and "NOUN" due to the insertion of a space between them. Another case of tagger confusion is where two or more words are joined together such as " " written as " " causing PRON and ADP to be marked PRON mistakenly. The same was the case for the words " " written as " ", causing it to The same is the case for DET " " in " " and CCONJ " ". Additionally, the most frequent mistakes encountered in tagging were confusion between proper nouns (PROP), common nouns (NOUN) and adjectives (ADJ). This is common mistake in tagging, since there are many nouns that occur both as proper nouns and as common nouns. For example, in Urdu " ", " ", " ", etc., can be proper nouns as well as common nouns.
Overall, the results in Table 3 confirm that bootstrapping POS taggers is useful. As compared to manual annotation, much less time is required for automatic tagging and manual correction effort, whereas the final induced model acquired satisfactory results.

Conclusions
In this paper, a new POS-tagged dataset constructed from Urdu tweets is presented along with its tagging scheme, thereby expanding Urdu language NLP research for the processing of Urdu social media text. We performed an experiment where we evaluated the performance of two pre-trained Urdu taggers on well-edited Urdu text as well as Urdu tweets. The results showed a significant decrease in the performance of these taggers on Urdu tweets. Thereby highlighting the need for specific tools and resources for this domain. We report on the development of a manually tagged dataset of 500 Urdu tweets, the consistency of which was evaluated by using five-fold cross validation. We also produced a trained model for the Stanford POS tagger with an accuracy of 93.8% precision, 92.9% recall and 93.3% f-measure. Further, we show how bootstrapping can be used to leverage the lack of annotated data for a less-resourced language.
The POS-tagged corpus developed in this research is publicly available [35] for the research community. We also plan on including data from other social media platforms in order to create a more balanced corpus. We used a normalized module in our pre-processing stage but error analysis of our corpus showed that there is a need for a customized normalization model designed according to the requirements of noisy data just like the tokenizer we developed for such data. Additionally, we also plan to investigate and compare the performance of other statistical taggers on our dataset in the future.

ADV: Adverb
An adverb modifies a verb, an adjective or another adverb. The UNT ta of adverb: General adverb (ADV) and negation (NEG). Examples includ away".

CONJ: Coordinating Conjunction
These are the words that are used to join two independent clauses i Some examples are "‫/ﺍﻭﺭ‬and"," ‫ﻧ‬ ‫ﻴ‬ ‫ﺰ‬ /also", "‫/ﺗﺎﮨﻢ‬however".

DET: Determiner
In Urdu, determiners are not considered as separate word classes a treated as demonstrative pronouns. However, determiners are terms that na of the following noun in the scope of a conversation, whereas demonstrati replace a noun in a sentence. Examples of DET are: " ‫ﻳ‬ ‫ہ‬ /this", " ‫ﮐﺌ‬ ‫ﯽ‬ /many", ‫ﺪ"‬

ADV: Adverb
An adverb modifies a verb, an adjective or another adverb. T of adverb: General adverb (ADV) and negation (NEG). Examp away".

DET: Determiner
In Urdu, determiners are not considered as separate wo treated as demonstrative pronouns. However, determiners are te of the following noun in the scope of a conversation, whereas d replace a noun in a sentence. Examples of DET are: " ‫ﻳ‬ ‫ہ‬ /this", " ‫ﺌ‬ ‫ﯽ‬