AraSenCorpus: A Semi-Supervised Approach for Sentiment Annotation of a Large Arabic Text Corpus

: At a time when research in the ﬁeld of sentiment analysis tends to study advanced topics in languages, such as English, other languages such as Arabic still suffer from basic problems and challenges, most notably the availability of large corpora. Furthermore, manual annotation is time-consuming and difﬁcult when the corpus is too large. This paper presents a semi-supervised self-learning technique, to extend an Arabic sentiment annotated corpus with unlabeled data, named AraSenCorpus. We use a neural network to train a set of models on a manually labeled dataset containing 15,000 tweets. We used these models to extend the corpus to a large Arabic sentiment corpus called “AraSenCorpus”. AraSenCorpus contains 4.5 million tweets and covers both modern standard Arabic and some of the Arabic dialects. The long-short term memory (LSTM) deep learning classiﬁer is used to train and test the ﬁnal corpus. We evaluate our proposed framework on two external benchmark datasets to ensure the improvement of the Arabic sentiment classiﬁcation. The experimental results show that our corpus outperforms the existing state-of-the-art systems.


Introduction
Several tasks in natural language processing require annotated corpora for training and evaluation methods and comparing the different systems [1]. The process of manual annotation of corpora is usually costly and becomes prohibitive when scaled to a larger dataset [2]. For popular tasks on natural language processing, such as sentiment analysis, we can find widely used corpora that serve as baselines for approaches and methods proposed for the sentiment analysis task. For example, two datasets are the SemEval 2017 [3] and Arabic Sentiment Tweets Dataset (ASTD) [4], which are used for evaluating state-of-the-art models due to the reliability of the annotation, and both corpora contain a large number of annotated documents.
In the literature, there are several corpora for the task of Arabic sentiment analysis, but the high costs associated with manual annotation limit these resources to be either small or obtained through entirely automatic methods such as user rates or Arabic sentiment lexicons. Furthermore, the data presented in these corpora are outdated, incomplete, or small.
In this paper, we introduce AraSenCorpus, a semi-supervised framework to annotate a large Arabic text corpus using a small portion of manually annotated tweets (15,000 tweets) and extending it from a large set of unlabeled tweets (34.7 million tweets) to reduce human effort in annotation and providing a middle ground between manual and automatic labeling of a large dataset. We used the FastText neural network [5], and along-short term memory (LSTM) deep learning classifier to expand the manually annotated corpus and ensure the quality of the newly created corpus, respectively. The outcome of the developed corpus is tested using a set of Arabic benchmark datasets. The FastText algorithm is an open-source NLP library developed by Facebook AI. It is a fast and excellent tool to build NLP models and generate live predictions [6]. LSTM with word embeddings is used to perform the sentiment classification, as this classifier outperforms the traditional techniques in text classification [7].The classifier performed well with embeddings, especially when dealing with the sentiment classification of Arabic dialects [8].
The rest of the paper is organized as follows: in Section 2, we present a literature survey of Arabic sentiment corpus construction and sentiment analysis and discuss their approaches and performance. In Section 3, we present our methodology for the corpus generation process and show in detail the data collection and sentiment annotation, and statistical information about the corpus. In Section 4, we present the experimental results and analysis of the manual and automatic annotation on our corpus. In Section 5, we conclude with a discussion of our work and future plans, and then we sum up our contributions and present the conclusion in Section 6.

Related Work
The creation of corpora for sentiment analysis was largely addressed in the related work. Several types of research have been made using automatic labeling techniques due to the challenges involved in manual annotation such as training annotators, measuring the inter-annotator agreement, writing the annotation guidelines, and developing the annotation interfaces [9]. In the following two sections, we will present literature studies about sentiment annotation and classification approaches, with the main focus being on Arabic sentiment analysis studies.

Sentiment Annotation
Arabic text can be classified into three categories: (a) classical Arabic, the version in which the Quran (the holy book of Islam) is written in, (b) modern standard Arabic (MSA), which is used in all Arabic countries for official and formal purposes such as newspapers, schools, and universities, and (c) dialectical Arabic (DA), which is used in everyday life among the people in different regions. For sentiment analysis, most of the existing textual corpora are either MSA or DA. The sentiment annotation of Arabic text can be classified into three approaches: (1) automatic annotation, (2) semi-automatic annotation, and (3) manual annotation. These three annotation approaches are summarized in Figure 1. Several studies have been conducted using automatic techniques for corpus construction and annotation. For this purpose, three main techniques have been used: (1) automatic annotation based on rating reviews [10][11][12][13][14], (2) sentiment lexicon [15][16][17][18], and (3) external application programming interfaces (APIs) [19]. In the context of sentiment annotation based on rating reviews, a Large-Scale Arabic Book Reviews (LABR) corpus was proposed in [10] for a specific domain. This corpus is a collection of book reviews containing 63,257 book reviews that had been written by readers. Each review is on a scale from 1 to 5 based on the user's rating of books. The authors considered reviews with a score of 1 or 2 as negative, 4 or 5 as positive, and 3 as neutral. Another sentiment corpus is called "BRAD 1.0", a large sentiment corpus that was created in [12] and motivated by the LABR corpus in [10]. This corpus consists of 510,600 book reviews. It was collected from the goodreader.com website. The user's rating of each book review on a scale of 1-5 was used to annotate the corpus reviews by following [10]. Along with modern standard Arabic, the corpus also includes some dialects such as Egyptian dialect content. BRAD 2.0 is an extension of BRAD 1.0. It contains more than 200K extra reviews to account for more Arabic dialects [11]. To classify the reviews, they followed the same procedures as in [10,12].User ratings are used to annotate each review in the corpus into positive, negative, or neutral classes. Similarly, among the presented corpora in [12,14], HARD is the most recent sentiment corpus for hotel reviews [14]. This corpus consists of more than 370,000 reviews. It was collected from the booking.com website and automatically annotated based on the reviews rating, but on a scale of 1-10. Each review is annotated as positive, negative, or neutral based on the user's rating of the review. Similar work was done in [11]. The authors annotated a collection of reviews from different domains such as hotels, restaurants, movies, and product reviews. The reviews were extracted from different websites. The annotated corpus contains 33,000 reviews. or neutral classes. Similarly, among the presented corpora in [12,14], HARD is recent sentiment corpus for hotel reviews [14]. This corpus consists of more than reviews. It was collected from the booking.com website and automatically an based on the reviews rating, but on a scale of 1-10. Each review is annotated as negative, or neutral based on the user's rating of the review. Similar work was [11]. The authors annotated a collection of reviews from different domains such a restaurants, movies, and product reviews. The reviews were extracted from websites. The annotated corpus contains 33,000 reviews. The second approach of the automatic annotation is the use of sentiment le automatically annotate the textual corpora. In [15], an Algerian sentiment lexi automatically constructed by translating English sentiment lexicons. Based on th sentiment lexicon, a corpus containing 8000 messages, which were written in Ar Arabizi (writing Arabic using English characters), was automatically annota positive and negative classes. Another sentiment corpus was proposed in [16]. It 151,548 tweets in modern standard Arabic (MSA) and Egyptian dialects. The twe classified into positive and negative classes. They relied on the corpus propose by manually extracting and annotating a list of 4404 phrases that are commonly expressing sentiment, to be used in the annotation of their corpus. The annota performed based on the manually annotated phrases and the frequency of posi negative words appearing in the text. Another work was done in [20] by colle million tweets from Twitter and using a list of positive and negative Arabic w search keywords. The annotation was performed automatically by considerin containing positive search keywords as positive tweets and similarly tweets co negative search keywords as negative tweets. They used the Twitter API to colle during June and July 2017. Before classifying the tweets, they performed fur preprocessing steps to clean the collected dataset and obtain better results. Th distant supervision is another approach for automatic annotation. TEAD is a da Arabic sentiment analysis proposed in [17]. The dataset contains more than 6M that were collected from Twitter using a list of emojis as search keywords. After the tweets, automatic annotation grounded in a lexicon-based approach was pe for sentiment analysis (Ar-SeLn sentiment lexicon). The annotation method in search was evaluated manually by extracting 1000 tweets in each class.
The third approach of the automatic annotation is the use of external APIs s tool called "AYLIE", which was used to annotate a sentiment corpus for the Sud lect [19]. The corpus was collected from Twitter. It contains 5456 tweets and a cally classifies them into three classes: positive, negative, and neutral.
The semi-automatic approach is the second type of corpus annotation. The second approach of the automatic annotation is the use of sentiment lexicons to automatically annotate the textual corpora. In [15], an Algerian sentiment lexicon was automatically constructed by translating English sentiment lexicons. Based on the created sentiment lexicon, a corpus containing 8000 messages, which were written in Arabic and Arabizi (writing Arabic using English characters), was automatically annotated into positive and negative classes. Another sentiment corpus was proposed in [16]. It contains 151,548 tweets in modern standard Arabic (MSA) and Egyptian dialects. The tweets were classified into positive and negative classes. They relied on the corpus proposed in [10] by manually extracting and annotating a list of 4404 phrases that are commonly used in expressing sentiment, to be used in the annotation of their corpus. The annotation was performed based on the manually annotated phrases and the frequency of positive and negative words appearing in the text. Another work was done in [20] by collecting 10 million tweets from Twitter and using a list of positive and negative Arabic words as search keywords. The annotation was performed automatically by considering tweets containing positive search keywords as positive tweets and similarly tweets containing negative search keywords as negative tweets. They used the Twitter API to collect tweets during June and July 2017. Before classifying the tweets, they performed further text preprocessing steps to clean the collected dataset and obtain better results. The use of distant supervision is another approach for automatic annotation. TEAD is a dataset for Arabic sentiment analysis proposed in [17]. The dataset contains more than 6M tweets that were collected from Twitter using a list of emojis as search keywords. After filtering the tweets, automatic annotation grounded in a lexicon-based approach was performed for sentiment analysis (Ar-SeLn sentiment lexicon). The annotation method in this research was evaluated manually by extracting 1000 tweets in each class.
The third approach of the automatic annotation is the use of external APIs such as a tool called "AYLIE", which was used to annotate a sentiment corpus for the Sudani dialect [19]. The corpus was collected from Twitter. It contains 5456 tweets and automatically classifies them into three classes: positive, negative, and neutral.
The semi-automatic approach is the second type of corpus annotation. AraSenTi-Tweet is a sentiment corpus that contains 17,573 tweets [21]. The corpus text is written in the Saudi dialect. A sentiment lexicon was used to collect tweets that contain such words. After preprocessing and cleaning the tweets, three annotators were employed to review the constructed corpus. Similar work was proposed in [22] for the Saudi dialect as well. They used a sentiment lexicon to collect the corpus text from Twitter. The extracted tweets were filtered and manually annotated into two different classes, positive and negative. The resulting corpus contains 4000 tweets. The semi-supervised annotation approach has been applied for other languages, such as the Brazilian Portuguese language. In [23], the authors extended a small sentiment corpus, which was annotated manually, to annotate a large unlabeled corpus. They used only one classifier to predict the classes of the unlabeled documents and added those documents which have a confidence value above a predefined threshold. Another large unlabeled dataset of tweets was presented for the English language in [24]. A total of 384 million tweets were collected in the entirety of the year 2015. They used two semi-supervised learning approaches, self-learning and co-training, to annotate a huge collection of tweets.
The last approach for corpus annotation is the manual annotation approach. ASTD is an Arabic sentiment corpus that contains 10,000 tweets which were classified as positive, negative, mixed, and objective [19]. They used Amazon Mechanical Turk [25] to annotate the tweets in the dataset. The crowdsourcing technique is a method for manual annotation [26,27]. In [26], crowdsourcing was used to classify the tweets into two classes, positive or negative. This corpus contains 32,063 tweets written in the Saudi dialect. Human annotation is an approach related to manual annotation. ArSEntD-LEV is a Levantine dialect sentiment corpus that was proposed in [27]. The corpus contains 4000 tweets. It was annotated manually with different annotations including the overall sentiment of the tweet. The corpus was annotated with five-point scale classes: very positive, positive, neutral, negative, and very negative. The annotation process was manually carried out via crowdsourcing using the CrowdFlower platform [28]. In [29], a sentiment corpus SANA for the Algerian dialect was collected from the web. The corpus contains 513 comments which are classified into positive or negative classes. Two Algerian Arabic native speakers were employed to annotate the corpus. In [30], a sentiment and emotion corpus was constructed from Twitter. Three annotators engaged in the classification process where they labeled each tweet according to its sentiment polarity (e.g., positive, negative, and neutral). The corpus contains 5400 tweets. Another dialect sentiment corpus was proposed in [31] for Jordanian dialect tweets. The corpus was manually annotated into three classes-positive, negative, and neutral-by Arab Jordanian students. It contains 1000 tweets. A customer review corpus called "MASC" was collected from multiple websites and social media platforms [32]. It was manually annotated by two native speakers into positive and negative reviews. The corpus covered 15 different domains such as art and culture, bakeries and goodies, cafes, fashion, financial services, hotels, restaurants, etc. The corpus contains 8860 reviews. MSAC is a multi-domain sentiment corpus that covers different domains, such as sport, social issues, and politics [33]. The corpus contains 2000 tweets that were manually annotated into two classes: positive and negative. A Tunisian dialect sentiment corpus called "TSAC" was presented in [34]. The corpus contains 17,000 user comments from Facebook that were collected and annotated manually into two classes, positive and negative. The proposed corpus is a multi-domain corpus consisting of vocabulary from the education, social, and political domains. AWATIF, a multi-genre sentiment corpus, was introduced in [35]. The corpus contains 10,723 Arabic sentences retrieved from three resources: the Penn Arabic Treebank, web forums, and Wikipedia. The corpus was manually annotated as objective and subjective (both positive and negative). Another two Arabic sentiment corpora from Twitter were introduced in [36,37]. In [36], the corpus contains 2300 tweets that were manually annotated, while in [37], the corpus contains 2000 tweets. The tweets were classified as positive and negative by native annotators.
To summarize, several Arabic sentiment corpora have been developed for the Arabic sentiment classification task. Table 1 shows most of the existing Arabic sentiment corpora. The annotation of these corpora was conducted using three approaches: manual, automatic, and semi-automatic. The data of such corpora were collected either from social media platforms, customer reviews, or comments. The corpora based on social media texts are mostly generated from Twitter. It is observed that the size of automatically annotated corpora is larger than those which were manually or semi-automatically annotated. The semi-automatic annotation was performed using sentiment lexicon terms to extract data and manual annotation to annotate the extracted data. This paper contributes a semiautomatic Arabic sentiment corpus which contains 34.7 million tweets and spans 14 years. A semi-automatic approach was applied to annotate the corpus using a self-training approach which is, to our knowledge, not used in any of the existing Arabic sentiment corpora. This approach helps in building large-scale labeled datasets that span large periods of time, as reported in [24].

Sentiment Classification
The sentiment classification is the task of classifying a document into positive, negative, or neutral classes [38]. There are different sentiment classification approaches and tools used for the sentiment classification task on the document level that considers the whole document as a basic information unit. In this part, we will describe the approaches, features, and techniques used along with types of testing and validation of the existing sentiment corpora with a focus on the Arabic sentiment corpora.
From the revised literature, it was proved that machine learning approaches (e.g., support vector machine, naïve Bayes, and logistic regression) are the classifiers used for sentiment classifications, as shown in Table 2. These classifiers are suitable for the case of Twitter data [4,15,21,24,25,28,33,34] and YouTube data [39]. Lexicon-based Lexicon Terms [10,13,30] Many studies used deep learning models to recognize and understand data such as text, audio, and images. Long short-term memory (LSTM) and convolutional neural network (CNN) deep learning classifiers were used in [17]. They used both classifiers with a huge amount of training data to achieve better results. In [26], Bi-LSTM and LSTM achieved 94% and 92% accuracy for a dataset of 32,063 tweets, respectively. CNN and LSTM are two deep learning models that were used in [33]. The models achieved F-measures of 95% and 93%, respectively, by using a word2vec pre-trained embedding as input features for both models. An online system for Arabic sentiment analysis called "Mazakak" was presented in [41]. The system was developed based on a CNN followed by an LSTM deep learning model. It achieved state-of-the-art results on two benchmark datasets, SemEval 2017 and ASTD. The proposed approaches represent documents on counting, N-grams, TF-IDF, and word embeddings features.
Lexicon-based approaches were also used to perform sentiment analysis. In [42], an Arabic lexicon was constructed and combined with a named entity recognition (NER) system to show the importance of including NER in the process of sentiment analysis. The experimental results, on different Arabic sentiment corpora, obtained by using NER outperformed the results obtained without using NER. NileULex is an Arabic sentiment lexicon for modern standard Arabic and Egyptian dialect [43]. The lexicon was used over two manually annotated datasets to ensure that the developed lexicon is useful for Arabic sentiment analysis. The authors reported that the use of their lexicon, which is manually constructed and annotated, gives better results in comparison with the translated or automatically constructed lexicons. In [44], the sentiment polarity was calculated by looking for the polarity of each term of the message in the constructed lexicon. They performed several preprocessing steps including tokenization, normalization, repeat letters and stop words removal, light stemming, and negation and intensification handling. Two Arabic sentiment corpora were used to perform classification experiments, and the best accuracy result was 70%.

Data Collection Process
This section describes the process of data collection and the challenges we faced during the data gathering. It also presents our data collection methodology, representatives of our collected data, characteristics, and distribution of the sentiment corpus, and the potential applications of AraSenCorpus.

Challenges in Data Collection from Twitter
On a monthly basis, Twitter serves around 330 million active users [45], and 500 million tweets per day. While Twitter is public and tweets are viewable and searchable by anyone around the world, there are specific challenges to collect Twitter data. For example, the standard API only allows for retrieving tweets from up to 7 days ago, scrapping a limited number of tweets per 15min window. To overcome this problem, we developed a python script that can retrieve tweets over a long period.

Challenges in Using Twitter as a Data Source
Using Twitter as a data source in academic research leads to some challenges that may be faced, such as:

•
Ethical issues: Reproducing tweets in an academic publication has to be handled with care, especially concerning tweets related to sensitive topics. In our research, our topic targets sentiment analysis for Arabic text where the extracted tweets are not sensitive. • Legal issues: Under Twitter's API Terms of Service, it is prohibited to share Twitter datasets. Our corpus will be available for the research community by sharing the IDs of tweets only, which can be used by other researchers to obtain the tweets, along with their sentiment labels.

•
Retrieving datasets: Using certain keywords may not retrieve all of the tweets related to a topic. In our research, it is also a challenge to build a sentiment corpus by searching for a limited number of keywords in Arabic, while there are many Arabic dialects as well. We overcome this problem by collecting sentiment terms/phrases from multiple Arabic sentiment lexicons in modern standard Arabic and Arabic dialects. • Cost: Twitter data cost a lot of money when they are obtained from a licensed reseller of Twitter data. It is also difficult to obtain Twitter data using the free API. However, our developed system can obtain a large amount of data. • Spam: There are large numbers of tweets on Twitter that can attract a large amount of spam. In our research, we found a lot of tweets that contained sexual expressions that had to be excluded from the sentiment corpus.

Data Collection Methodology
To build the AraSenCorpus corpus, we selected Twitter as the data source for data collection. Twitter is a rich platform to learn about people's opinions and sentiments on different topics as they can share their opinions and thoughts. Twitter is considered a rich resource for sentimental text, containing views on many different topics: social issues, politics, business, economics, etc. A list of sentiment terms/phrases was obtained from Arabic sentiment lexicons in modern standard Arabic and Arabic dialects, as shown in Table 4. The list was verified and considered to represent search keywords to obtain tweets from Twitter social media platform. The selection of data sources and search query keywords is a very important part of the study.
For data collection, the list of sentiment terms/phrases helped us to ensure that the collected tweets are diversified and they are representative of many sentiment tweets from different topics. The collection system used these sentiment terms/phrases and retrieved tweets spanning from 2007 to 2020. The statistics about the collected tweets are shown in Table 3. To collect a large corpus while ensuring that it covered many sentiment terms and phrases, five sentiment lexicons were used to extract the corpus tweets. The lexicons used in this research were: (1) 230 Arabic words(modern standard Arabic) [46], (2) Large Arabic Resources for Sentiment Analysis(modern standard Arabic) [47], (3) MorLex lexicon (Modern Standard Arabic and Egyptian Dialect), (4) NileULexlexicon (modern standard Arabic and Egyptian dialect) [48], and (5) Arabic senti-lexicon(modern standard Arabic) [49]. Details about these lexicons are shown in Table 4.

Corpus Cleaning and Preprocessing
The text of tweets is known to be noisy and should be cleaned and preprocessed before performing sentiment classification in order to get better results. While collecting tweets, those that contained URLs, hashtags, mentions, or media were already cleaned before adding them to the corpus. The tweets were processed for text tokenization and normalization. Normalization is the process of unifying the shape of some Arabic letters that have different shapes. For example, the Arabic letters ( , , , ) are normalized to convert multiple shapes of the letter to one shape, the different forms of "Alef" ( , , ) are Appl. Sci. 2021, 11, 2434 9 of 19 converted into ( ), the different forms of "Ya'a" ( , ) are converted into ( ), the letter "Ta'a" ( ) is converted into ( ), and the letters and are converted to ( ). Non-Arabic letters such as (!, -, &, *) are removed by iterating all the tweet words to remove the noise from the text.
Repeated characters add noise and influence the mining process, which makes it very difficult. For example, a word may be written like " " instead of " ", which means "wonderful". To resolve this, we returned the word to its correct and right syntax by removing the extra repeated characters. Diacritics are rarely used on social media websites. These diacritics, in most cases, are used to add decorations to the text. While preprocessing, we removed all diacritics from the tweets. Furthermore, the duplicated tweets were already excluded from the corpus in the collection phase by looking for the tweet's ID and removing the tweet if it already exists in the corpus.

Corpus Characteristics and Potential Applications
The collected corpus contains more than 34.7 million tweets. We used more than 22,000 terms/phrases to collect tweets from Twitter. These terms/phrases were extracted and manually verified from 5 different Arabic sentiment lexicons. The total number of tokens in the corpus exceeds 479 million tokens, while the unique number of tokens is more than 9 million tokens. More than 7.6 million users participated in the collected corpus. Table 3 shows statistics about the collected tweets.
Zipf's law states that the frequency of word tokens in a large corpus is inversely proportional to the rank [52]. The law states that if f is the frequency of a word in the corpus and r is the rank, then: where k is a constant for the corpus. We follow [53] to verify Zipf's law by calculating the log of the frequency of word tokens and their ranking in our corpus. Figure 2 shows the Zipf's law curve of unigrams frequencies and their ranking. The curve ensures that our corpus does not have anomalous biases.
tweets were already excluded from the corpus in the collection phase by l tweet's ID and removing the tweet if it already exists in the corpus.

Corpus Characteristics and Potential Applications
The collected corpus contains more than 34.7 million tweets. We us 22,000 terms/phrases to collect tweets from Twitter. These terms/phrases w and manually verified from 5 different Arabic sentiment lexicons. The to tokens in the corpus exceeds 479 million tokens, while the unique numb more than 9 million tokens. More than 7.6 million users participated in corpus. Table 4 shows statistics about the collected tweets. Zipf's law states that the frequency of word tokens in a large corpu proportional to the rank [52]. The law states that if f is the frequency of corpus and r is the rank, then: = where k is a constant for the corpus. We follow [53] to verify Zipf's la ing the log of the frequency of word tokens and their ranking in our co shows the Zipf's law curve of unigrams frequencies and their ranking. The that our corpus does not have anomalous biases. Most of the proposed Twitter sentiment corpora in the literature w within a few months [4,17,21,27], or less than two years [30]. This does n coverage of multiple topics in social media. Our corpus contains tweets lected over 14 years, starting from 2007 and ending in 2020. Table 5 shows Most of the proposed Twitter sentiment corpora in the literature were collected within a few months [4,17,21,27], or less than two years [30]. This does not ensure the coverage of multiple topics in social media. Our corpus contains tweets that were collected over 14 years, starting from 2007 and ending in 2020. Table 5 shows distribution of the collected tweets. The highest number of tweets in the corpus was collected in the year 2020, while a low number of tweets were collected in the year 2007. This indicates the fact that the number of users of social media has increased over time, which causes the increase in the amount of data on social media. Another reason is the use of the Twitter API, which retrieves tweets from the last 7 days, and that the collection started during the year 2020. We believe that our proposed corpus can be used in different applications. In particular, the proposed sentiment corpus can be very helpful in (1) enhancing the existing Arabic sentiment analysis approaches and (2) building domain-specific pre-train sentiment models (such as word2vec, Glove, and BERT) for Arabic sentiment analysis.

Manually Annotated Dataset
Two Arabic native speakers were requested to annotate our manual corpus. Set of annotation guidelines were given to the annotators to provide the best degree of contingency in the obtained results.
Annotation Guidelines: The annotation guidelines were defined to label tweets in our corpus. We first surveyed the existing work on annotation guidelines to define the baseline guidelines. Second, we improved these guidelines. Two annotators were asked to independently annotate 3000 tweets under the provided guidelines. The third annotator was employed to resolve ambiguous cases if they were found during the annotation process. The ambiguous tweets included tweets with mixed sentiments in a single tweet and tweets in special cases such as sarcastic tweets. Three main aspects were formulated before performing the annotation.

1.
What to annotate: Tweets that bear a positive or negative sentiment and tweets that do not bear any positivity or negativity (neutral tweets).

2.
What not to annotate: tweets containing both positive and negative sentiments.

3.
How to handle special cases such as negations, sarcasm, or quotations.
The following are the guidelines given to the annotators for manual annotation: 1.

3.
Positive or negative situations or events, for example, " " " " (corona virus caused heavy losses to most of the world countries). This tweet will be marked as negative since it has two negative terms " " (losses) and " " (fatal).

4.
All objective tweets that do not contain any sentiment terms/phrases will be considered neutral and marked as "neutral".

5.
Tweets containing both positive and negative terms/phrases with the same intensity of positive and negative terms/phrases will be marked as "mixed" and will not be considered in this research. 6.
Tweets containing negations before positive or negative terms/phrases will flip the polarity of sentiment, for example " " (uncomfortable) is a phrase with negation and a positive term. Tweets containing such phrases should be marked as negative tweets.
A small .Net application was designed to facilitate the annotation process, as shown in Figure 3. pl. Sci. 2021, 11, x FOR PEER REVIEW 6. Tweets containing negations before positive or negative terms/ polarity of sentiment, for example ‫"ﻏﻴﺮﻣﺮﻳﺤﺔ"‬ (uncomfortable) gation and a positive term. Tweets containing such phrases sh negative tweets. A small .Net application was designed to facilitate the annotatio in Figure 3. Inter-annotator Agreement: To verify the completeness of the an we gave 3000 tweets to two different annotators and asked them to dataset. They independently annotated these tweets into "positiv "neutral" classes based on the given annotation guidelines and usin plication as shown in Figure 1. The inter-annotator agreement was kappa coefficient), which is considered to be a perfect value. The rem were divided equally between both annotators to speed up the ann included tweets that belonged to the positive, negative, and neutral c

Semi-Supervised Annotated Corpus
Initially, we used the self-learning technique by following [54] t ally annotated dataset (15,000 tweets). As shown in Figure 4, we trai on the manually annotated dataset. These classifiers were built using network algorithm. This algorithm was selected as it achieves simila Figure 3. Sentiment annotation interface. The English translation of the tweet is: Although the concept that he wishes to convey is very clear from the beginning of the article, it will draw you to read it to the end, with all eloquence, eloquence and elegance, a wonderful person.
Inter-annotator Agreement: To verify the completeness of the annotation guidelines, we gave 3000 tweets to two different annotators and asked them to annotate the given dataset. They independently annotated these tweets into "positive", "negative", and "neutral" classes based on the given annotation guidelines and using the annotation application as shown in Figure 1. The inter-annotator agreement was 0.93 (using Kohen's kappa coefficient), which is considered to be a perfect value. The remaining 12,000 tweets were divided equally between both annotators to speed up the annotation process. We included tweets that belonged to the positive, negative, and neutral classes only.

Semi-Supervised Annotated Corpus
Initially, we used the self-learning technique by following [54] to expand the manually annotated dataset (15,000 tweets). As shown in Figure 4, we trained three classifiers on the manually annotated dataset. These classifiers were built using the FastText neural network algorithm. This algorithm was selected as it achieves similar results to the machine/deep learning classifiers while training a lot faster, as reported in the initial paper [6]. Using this algorithm, we can train and test a model, predict sentiment classes of tweets, and predict the probability of tweets towards sentiment classes.
added to the training data and removed from the unlabeled data. We performed 3 times to increase the size of the labeled data. Moreover, we ensured the quali labeled data using a set of benchmark datasets after every iteration. AraSenCorpus differs from other semi-supervised approaches in the li Usually, the authors use a sentiment lexicon to annotate unlabeled data and sample of the annotated data manually. Since this approach needs more time to r annotation, we proposed a self-learning approach to automate the annotation an human effort, as shown in Algorithm 1. The intuition behind AraSenCorpus is manual annotation is necessary for subjective tasks such as sentiment anal should be part of the process. This is in addition to the fact that the iterative ad new tweets provides new information for the classifiers, thus resulting in better cation models for labeling the remaining unlabeled tweets.  Each tweet from the unlabeled tweets passed through three classifiers to be annotated with the sentiment class and the probability of the assigned class. If a tweet had a confidence value greater than or equal to a threshold (e.g., 90%) in all classifiers, it was added to the training data and removed from the unlabeled data. We performed this task 3 times to increase the size of the labeled data. Moreover, we ensured the quality of the labeled data using a set of benchmark datasets after every iteration.
AraSenCorpus differs from other semi-supervised approaches in the literature. Usually, the authors use a sentiment lexicon to annotate unlabeled data and revise a sample of the annotated data manually. Since this approach needs more time to revise the annotation, we proposed a self-learning approach to automate the annotation and reduce human effort, as shown in Algorithm 1. The intuition behind AraSenCorpus is that the manual annotation is necessary for subjective tasks such as sentiment analysis and should be part of the process. This is in addition to the fact that the iterative addition of new tweets provides new information for the classifiers, thus resulting in better classification models for labeling the remaining unlabeled tweets. //predict most likely sentiment probabilities of t from model1, model2, and model3 P1 =Model1.predict-prob(UnlabeledData(t)); P2 =Model2.predict-prob(UnlabeledData(t)); P3=Model3.predict-prob(UnlabeledData(t)); TrainingSet.Add(t); UnlabeledData.Remove(t); end end end

Dataset
We prepare our dataset to train a deep learning model and test the outcomes of the model using external benchmark datasets. Statistics about our corpora are shown in Table 6. We performed undersampling on the dataset by removing some of the tweets from the majority class randomly to match the number with the minority class. We considered 1million tweets in each sentiment class to make the dataset balanced. We evaluated AraSenCorpus using external benchmark Arabic sentiment corpora and after each iteration. The benchmark datasets were SemEval 2017 and ASTD. The SemEval 2017 dataset contains tweets written in Arabic dialects and it is one of the most popular benchmarks for Arabic sentiment classification. ASTD is another dataset that contains 10,000 tweets written in modern standard Arabic and Egyptian dialect. Because

Dataset
We prepare our dataset to train a deep learning model and test the outcomes of the model using external benchmark datasets. Statistics about our corpora are shown in Table 6. We performed undersampling on the dataset by removing some of the tweets from the majority class randomly to match the number with the minority class. We considered 1million tweets in each sentiment class to make the dataset balanced. We evaluated AraSenCorpus using external benchmark Arabic sentiment corpora and after each iteration. The benchmark datasets were SemEval 2017 and ASTD. The SemEval 2017 dataset contains tweets written in Arabic dialects and it is one of the most popular benchmarks for Arabic sentiment classification. ASTD is another dataset that contains 10,000 tweets written in modern standard Arabic and Egyptian dialect. Because we were performing two-way and three-way sentiment classification, we included the tweets/reviews that were classified as positive, negative or neutral only. The statistics of these datasets are presented in Table 7.

Experimental Setup
In this research, we performed two-way (positive and negative) and three-way (positive, negative, and neutral) sentiment classification. We trained a model using the LSTM deep learning classifier and tested the developed model on the benchmark datasets. The model was used with the hyper-parameter values as listed in Table 8.

Evaluation Metrics
All the results are reported using the F1-measureas follows: where the TP is the correctly predicted positive tweets, which means that the value of the actual class is positive and the value of the predicted class is also positive, FP is the actual class is negative and the predicted class is positive, and FN is the actual class is positive but the predicted class in negative. F1 is usually more useful because we have an uneven class distribution so it is better to look at both precision and recall measures. In our case, F1 score is used to measure the results and compare them with state-of-the-art systems.

Results
To evaluate the AraSenCorpus framework, we ran the framework using the manually annotated dataset (15,000 tweets) as the manually labeled input and extended it with the unlabeled dataset (more than 34 million tweets). In the last iteration, the full corpus contained more than 3.2 million and 4.5 million tweets for two-way and three-way sentiment classification, respectively. After obtaining each of the expanded annotated datasets, we trained an LSTM deep learning model and tested it using the two benchmark datasets. The intuition behind these experiments was to show the quality of the proposed semi-automatic annotations approach.
The evaluation of the AraSenCorpus corpus was carried out using two benchmark datasets to evaluate the effectiveness of the semi-supervision annotation. For this, we used SemEval 2017 and ASTD datasets. All experiments on these datasets were performed using two-way classification (positive and negative classes) and three-way sentiment classification (positive, negative, and neutral) using both balanced and unbalanced datasets. Figure 5 shows the F1-score results for two-way sentiment classification over two benchmark datasets after performing the expansion of the manual labeled dataset. It is clearly depicted that the increasing of labeled tweets with high confidence leads to better results. For example, the classification starts with the manually annotated dataset (15,000 tweets). It gives a low score in comparison with the results obtained in the third iteration (3.2 million tweets) using both balanced and unbalanced datasets. The last iteration shows significant results overall for the benchmark datasets.  Figure 6 shows the F1-score results for three-way sentiment classification. The obtained results in the third iteration are higher than the results obtained in the manual, first iteration, and second iteration datasets. It is clearly shown that the results obtained from the two-way sentiment classification are higher than the results obtained from the three-way sentiment classification. This is due to the existence of the neutral class in the three-way sentiment classification.  Figure 6 shows the F1-score results for three-way sentiment classification. The obtained results in the third iteration are higher than the results obtained in the manual, first iteration, and second iteration datasets. It is clearly shown that the results obtained from the twoway sentiment classification are higher than the results obtained from the three-way sentiment classification. This is due to the existence of the neutral class in the three-way sentiment classification.
The above results show the performance of our system on the benchmark datasets. Next, we compare the best-obtained results by our system with previous studies in both two-way and three-way classification, as shown in Tables 9 and 10, respectively. For the two-way classification, our system outperforms two recent studies in all datasets, as shown in Table 9. Our system improves the sentiment classification results from 80.37% to 87.4% using the SemEval 2017 dataset and from 79.77% to 85.2% using the ASTD dataset. Figure 6 shows the F1-score results for three-way sentiment classification. The obtained results in the third iteration are higher than the results obtained in the manual, first iteration, and second iteration datasets. It is clearly shown that the results obtained from the two-way sentiment classification are higher than the results obtained from the three-way sentiment classification. This is due to the existence of the neutral class in the three-way sentiment classification. The above results show the performance of our system on the benchmark datasets. Next, we compare the best-obtained results by our system with previous studies in both two-way and three-way classification, as shown in Tables 9 and 10, respectively. For the two-way classification, our system outperforms two recent studies in all datasets, as shown in Table 9. Our system improves the sentiment classification results from 80.37% to 87.4% using the SemEval 2017 dataset and from 79.77% to 85.2% using the ASTD dataset.  The bold values represent the best obtained results. The bold values represent the best obtained results.
In a three-way classification, our system also achieves the best results, as shown in Table 10. Our system gives 69.4% accuracy for the SemEval 2017 dataset while the best system gives 63.38% using F1-score. It also achieved 68.1% accuracy with the ASTD dataset, while the best performing system achieved 64.10%.

Discussion and Future Work
The major finding in our research is a large-scale Arabic sentiment corpus named "AraSenCorpus". The introduced techniques in building our corpus help in reducing human effort and automate the annotation process. The corpus contains 4.5 million tweets and achieves potential results when compared with state-of-the-art system using the same benchmark datasets.
The sentiment classification on two-way classification using our corpus improves the results by 7% and 5% using the SemEval 2017 and ASTD benchmark datasets, respectively. The three-way classification using our corpus also improves the classification results by 6% and 4% using both benchmark datasets, respectively. The tweets in our corpus are mostly subjective; as our corpus was collected using terms/phrases from Arabic sentiment lexicons. From the results and analysis, we observe that the two studies that we compared our results with used a mixed of subjective and objective tweets. Along with this reason, our collected tweets spanned 14 years to ensure the coverage of different topics, while in the two studies used in the comparison, the tweets spanned 2 months and 3 years, respectively. This semi-supervised learning approach proves that the expansion of a few samples that have been carefully annotated will lead to better results, as shown in the presented corpus. As the expanded corpus has millions of tweets, we used LSTM deep learning classifier which gives better results with sufficient data, as in our corpus.
AraSenCorpus is limited by the weaknesses of self-learning, such as skewed class distributions and error propagation. To overcome this problem, we selected tweets from the unlabeled dataset above or greater than a certain threshold, to be added to the training set. Other alternative semi-supervised approaches, such as co-training, can be used.
Improvements to AraSenCorpus could include improvement of the guidelines of the manual annotation, which could improve the classification results. Further preprocessing steps will be added to improve the overall performance of the sentiment classification.

Conclusions
In this paper, we present AraSenCorpus, a semi-supervised framework to annotate a large corpus for Arabic sentiment analysis. In doing so, we use the self-learning approach by training three neural network models to expand the training set with new data after applying a certain confidence threshold. We use a manually annotated dataset containing 15,000 tweets to annotate more than 4 million tweets. The sentiment classification results obtained using the LSTM deep learning classifier from AraSenCorpus outperforms the existing state-of-the-art model. The text of the corpus is a mix of modern standard Arabic and some of the Arabic dialects including Gulf, Yemeni, Egyptian, Iraqi, and Levantine dialects.
The proposed sentiment corpora in the literature are either small in size, not publicly available, or were annotated using fully automatic annotation methods. We overcome such issues by offering a large-scale sentiment corpus that is freely available for research purposes. The semi-supervised annotation method helps in automating the annotation and reducing the human effort during the annotation process.
As there is a lack of large sentiment corpora in the Arabic language, our corpus contributes in tackling the scarcity of Arabic corpora for sentiment analysis. The developed corpus contains more than 4.5 million tweets that were labeled into three classes (positive, negative, and neutral).
In the future, we plan to add more dialects, such as Sudanese, Moroccan, Algerian, and Tunisian dialects by training models on freely available sentiment corpora in these dialects. We will also experiment with more classification algorithms and use different inputs such as BERT to improve the sentiment classification of Arabic. Along with this, we will include more statistical analysis on the obtained and future results. Our intention is to build the largest sentiment corpus that covers modern standard Arabic and all dialectical Arabic.  Acknowledgments: Authors are thankful to Prince Sultan University, Saudi Arabia for providing the fund to carry out the work.

Conflicts of Interest:
The authors declare that they have no conflicts of interest to report regarding the present study.