A New Ontology-Based Method for Arabic Sentiment Analysis

: Arabic sentiment analysis is a process that aims to extract the subjective opinions of different users about different subjects since these opinions and sentiments are used to recognize their perspectives and judgments in a particular domain. Few research studies addressed semantic-oriented approaches for Arabic sentiment analysis based on domain ontologies and features’ importance. In this paper, we built a semantic orientation approach for calculating overall polarity from the Arabic subjective texts based on built domain ontology and the available sentiment lexicon. We used the ontology concepts to extract and weight the semantic domain features by considering their levels in the ontology tree and their frequencies in the dataset to compute the overall polarity of a given textual review based on the importance of each domain feature. For evaluation, an Arabic dataset from the hotels’ domain was selected to build the domain ontology and to test the proposed approach. The overall accuracy and f-measure reach 79.20% and 78.75%, respectively. Results showed that the approach outperformed the other semantic orientation approaches, and it is an appealing approach to be used for Arabic sentiment analysis.


Introduction
The Web offers a massive virtual space where users can express and publish their opinions and experiences. People use social media daily as a primary place in a wide range of applications in their lives, not only for social life purposes but also e-learning, e-commerce, politics, and many other applications. In the Middle East (where Arabic is the mother language), Facebook and Twitter were determined as the most prevalent social media websites that affect youth [1,2]. While web content has witnessed an unprecedented increase in size, the process of extracting useful information is becoming more challenging as well [3,4]. Sentiment analysis, or opinion mining, is a type of text mining research that depends mainly on Machine Learning (ML) and Natural Language Processing (NLP) approaches for mining subjective texts [4][5][6][7][8][9]. Sentiment analysis research scope in the field of computer science is rising very quickly [10]. The semantic web is a logical expansion of the World Wide Web, which is intended to make the web more machine-understandable [11]. The ontology is an essential semantic technology used widely for data handling in the semantic web [12,13]. Ontologies facilitate communication between humans and agents; they also describe the domain theories for the explicit representation of the semantics of the data [14] and web interoperability [15]. Ontology is a systematic account of existence [16], where it can be used to formalize and model-specific domain knowledge to be represented and applied in different fields, such as the semantic web, artificial intelligence, system engineering, information architecture, enterprise bookmarking and biomedical informatics. Furthermore, the ontology concept is valuable in text mining applications, such as in-text 1.
Building a semantic orientation approach using ontology for mining the different opinions to decrease the effort needed by ordinary users or organizations to make more accurate sentiments classification. The approach is working at the level of semantic features, which are extracted and weighted using the domain ontology. 2.
Using the domain features' levels to determine the polarity of the overall review. Also, the important domain features from the users' point of view are used to efficiently calculate the overall semantic polarity of a subjective text. This approach is different from the previous ontology-based approaches in using a weighting method with two factors to identify the different weights of importance for each semantic domain feature. 3.
Evaluating the proposed approach with an Arabic dataset from the hotels' domain, which was selected to build the domain ontology.
The rest of the paper is organized as follows: The next section discusses related work. Section 3 presents the research methodology and proposed sentiment analysis approach. Evaluation and results are presented and discussed in Section 4, followed by conclusions in Section 5.

Related Work
Sentiment classification approaches can be categorized into three fields: Sentimental SO, Machine Learning (ML), and hybrid approaches. Nithish et al. [36] proposed a featurebased sentiment analysis model of the English language using product reviews. They applied the feature level analysis to mobile product reviews and reached 70% accuracy. Thakor and Sasi [19] presented a sentiment analysis approach to classifying negative sentiments in social media content based on ontology. The proposed approach successfully classified 253 negative tweets out of 494 tweets. In [21], the authors proposed a sentiment analysis approach based on Latent Dirichlet Allocation (LDA) topic clusters, domain ontology, and SentiWordNet for Nokia 6610 cellular phone reviews. The precision of the extracted product features was 76.1%.
Alfonso and Sardinha [22] proposed an approach for holding the relationships between aspects, associations of aspects, and their expressions of opinion for aspect-based sentiment analysis using a fuzzy ontology. They tested their approach on the hotels' domain, where each aspect of the hotel got a score, and then they calculated the total score for the hotel by accumulating the scores of each aspect.
Zehra et al. [23] proposed an approach to construct a recommendation system based on sentiment analysis using ontology. The researchers focused on a Facebook closed group that includes posts and comments about various schools collected randomly. Salas-Zárate et al. [24] proposed an aspect-level opinion mining approach to the diabetes domain using ontologies to identify the aspects related to diabetes in the tweets.
Lazhar and Yamina [37] examined the effectiveness of domain ontologies in ASA. Mahyoub et al. [38] presented in their study a sentiment lexicon for the Arabic language when the proposed system worked to specify the sentiment scores for each word included in the Arabic WordNet. The accuracy of the classification reached 96%. Soliman et al. [39] presented an approach to building a Slang Sentimental Words and Idioms Lexicon (SSWIL) of opinion words. They also worked to categorize Arabic news comments on Facebook separating the SVM classifier into two classes: satisfy and dissatisfy, with an accuracy rate of 86.86%.
Hybrid approaches combine different semantic orientation approaches with different machine learning approaches to improve the results of the sentiment analysis process [59][60][61]. Several studies proposed a hybrid approach for sentiment analysis of different Arabic dialects tweets [62][63][64] and tweets of product reviews [65].
We benefit from these studies to build an enhanced approach for ASA using the ontology model. Tartir and Abdul-Nabi [29] focused on the semantic relations between sentiments and their instances to present a semantic orientation approach. In other semantic orientation approaches such as the studies of Thakor and Sasi [19][20][21][22][23][24], they focused on the use of ontology for feature identification and extraction without considering any other information from the ontology tree such as the levels of features, while El-Halees and Al-Asmar [25] used the levels of features to calculate the polarity, by multiplying each feature level with its sentiment polarity, where the levels indicate the feature importance. In this research, we used the ontology to identify and extract the domain features and their levels, while at the same time the frequencies of these features in the review's dataset are also used to identify the importance of each feature.

Method
This section presents and discusses the methodology followed in this study. The first subsection describes the overall approach design. The Arabic resources used in this work are described in Section 3.2, while the third subsection describes the main research phases and the entire steps in each phase in more detail.

Overall Approach Design
The overall methodology to classify Arabic textual reviews based on sentiment analysis using ontology is divided into five main phases, and each phase has several steps, as illustrated in Figure 1.

Method
This section presents and discusses the methodology followed in this study. The first subsection describes the overall approach design. The Arabic resources used in this work are described in Subsection 2, while the third subsection describes the main research phases and the entire steps in each phase in more detail.

Overall Approach Design
The overall methodology to classify Arabic textual reviews based on sentiment analysis using ontology is divided into five main phases, and each phase has several steps, as illustrated in Figure 1.

Description of the Arabic Resources
To illustrate the steps of the proposed method, it is beneficial to introduce the Arabic resources that were used in the evaluation; this will help the reader to gain insight into the proposed method. We used ElSahar and El-Beltagy [66] dataset to extract the domainspecific ontology and to evaluate and test the model. The overall dataset comprises around 33 thousand automatically annotated reviews in various domains which are: movies, restaurants, hotels, and products. Also, domain-specific lexicons contain about two thousand entries semi-automatically generated from the reviews.
The hotel reviews dataset contains around 15 thousand Arabic user reviews, extracted from the TripAdvisor website. The authors employed the open-source Scrapy framework, for establishing custom web crawlers. Table 1 describes the general statistics

Description of the Arabic Resources
To illustrate the steps of the proposed method, it is beneficial to introduce the Arabic resources that were used in the evaluation; this will help the reader to gain insight into the proposed method. We used ElSahar and El-Beltagy [66] dataset to extract the domain-specific ontology and to evaluate and test the model. The overall dataset comprises around 33 thousand automatically annotated reviews in various domains which are: movies, restaurants, hotels, and products. Also, domain-specific lexicons contain about two thousand entries semi-automatically generated from the reviews.
The hotel reviews dataset contains around 15 thousand Arabic user reviews, extracted from the TripAdvisor website. The authors employed the open-source Scrapy framework, for establishing custom web crawlers. Table 1 describes the general statistics of the hotels' datasets of ElSahar and El-Beltagy [66]. Table 2 holds a sample hotel review from the dataset, where each row is considered as a user opinion on a particular hotel and the identified polarity for that review. We added the review translation. for establishing custom web crawlers. Table 1 describes the general statistics of the hotels' datasets of ElSahar and El-Beltagy [66]. Table 2 holds a sample hotel review from the dataset, where each row is considered as a user opinion on a particular hotel and the identified polarity for that review. We added the review translation. of the hotels' datasets of ElSahar and El-Beltagy [66]. Table 2 holds a sample hotel review from the dataset, where each row is considered as a user opinion on a particular hotel and the identified polarity for that review. We added the review translation. Translation: Excellent-wonderful-I recommend it to everyone. In October 2013 I stayed 3 days in the hotel-it was more than wonderful-suitable prices-excellent service-rooms are very clean and bathrooms are wonderful-hotel management and staff are more than excellent-security guards are excellent-the restaurant, bar and nightclub are wonderful Really-I've never seen a better hotel than that in Addis Ababa.

(Positive)
Previous ASA studies suffered from the unavailability of adequate resources that classify the opinion words (sentiment lexicons). Although there exist some efforts to build lexicons in Arabic, they still have limitations such as unclear usability, small size, and nonpublicly shared lexicons. ArSenL is an Arabic SentiWordNet lexicon developed by Badaro et al. [67] to solve the previously mentioned limitations. The developers created the first large-scale publicly shared resource for opinion mining in standard Arabic. Their lexicon was built based on three different available resources: English sentiWordNet, the Standard Arabic Morphological Analyzer (SAMA), and Arabic WordNet.
Two values are attached with each existing lemma entry in the lexicon which indicates the positive and negative polarity scores. It contains four types of Part of Speech (POS) tags (adjective, noun, verb, and adverb). The lemmas are presented in Buckwalter's (2004) format to facilitate the NLP processes. ArSenL contains a total of around 28,760 lemmas and 157,969 Synsets which is considered a large-scale Arabic sentiment lexicon. Table 3 provides a sample of the ArSenL lexicon content; we added a column that represents each sentiment in an Arabic form and its translation in English as well.

Main Phases of the Approach
This section aims to briefly describe and discuss the main phases depicted in Figure  1 by explaining the steps and processes which are used for each phase.

Ontology Building
For the proposed semantic orientation approach of sentiment analysis, we need to build domain ontology. This ontology is used as a domain concept dictionary to extract the domain features with their importance. In this phase, we built domain ontology by 1 (Positive) Translation: Excellent-wonderful-I recommend it to everyone. In October 2013 I stayed 3 days in the hotel-it was more than wonderful-suitable prices-excellent service-rooms are very clean and bathrooms are wonderful-hotel management and staff are more than excellent-security guards are excellent-the restaurant, bar and nightclub are wonderful Really-I've never seen a better hotel than that in Addis Ababa.
Previous ASA studies suffered from the unavailability of adequate resources that classify the opinion words (sentiment lexicons). Although there exist some efforts to build lexicons in Arabic, they still have limitations such as unclear usability, small size, and non-publicly shared lexicons. ArSenL is an Arabic SentiWordNet lexicon developed by Badaro et al. [67] to solve the previously mentioned limitations. The developers created the first large-scale publicly shared resource for opinion mining in standard Arabic. Their lexicon was built based on three different available resources: English sentiWordNet, the Standard Arabic Morphological Analyzer (SAMA), and Arabic WordNet.
Two values are attached with each existing lemma entry in the lexicon which indicates the positive and negative polarity scores. It contains four types of Part of Speech (POS) tags (adjective, noun, verb, and adverb). The lemmas are presented in Buckwalter's (2004) format to facilitate the NLP processes. ArSenL contains a total of around 28,760 lemmas and 157,969 Synsets which is considered a large-scale Arabic sentiment lexicon. Table 3 provides a sample of the ArSenL lexicon content; we added a column that represents each sentiment in an Arabic form and its translation in English as well.

Main Phases of the Approach
This section aims to briefly describe and discuss the main phases depicted in Figure 1 by explaining the steps and processes which are used for each phase.

Ontology Building
For the proposed semantic orientation approach of sentiment analysis, we need to build domain ontology. This ontology is used as a domain concept dictionary to extract the domain features with their importance. In this phase, we built domain ontology 1 (Positive) Translation: Excellent-wonderful-I recommend it to everyone. In October 2013 I stayed 3 days in the hotel-it was more than wonderful-suitable prices-excellent service-rooms are very clean and bathrooms are wonderful-hotel management and staff are more than excellent-security guards are excellent-the restaurant, bar and nightclub are wonderful Really-I've never seen a better hotel than that in Addis Ababa.
Previous ASA studies suffered from the unavailability of adequate resources that classify the opinion words (sentiment lexicons). Although there exist some efforts to build lexicons in Arabic, they still have limitations such as unclear usability, small size, and non-publicly shared lexicons. ArSenL is an Arabic SentiWordNet lexicon developed by Badaro et al. [67] to solve the previously mentioned limitations. The developers created the first large-scale publicly shared resource for opinion mining in standard Arabic. Their lexicon was built based on three different available resources: English sentiWordNet, the Standard Arabic Morphological Analyzer (SAMA), and Arabic WordNet.
Two values are attached with each existing lemma entry in the lexicon which indicates the positive and negative polarity scores. It contains four types of Part of Speech (POS) tags (adjective, noun, verb, and adverb). The lemmas are presented in Buckwalter's (2004) format to facilitate the NLP processes. ArSenL contains a total of around 28,760 lemmas and 157,969 Synsets which is considered a large-scale Arabic sentiment lexicon. Table 3 provides a sample of the ArSenL lexicon content; we added a column that represents each sentiment in an Arabic form and its translation in English as well.

Main Phases of the Approach
This section aims to briefly describe and discuss the main phases depicted in Figure 1 by explaining the steps and processes which are used for each phase.

Ontology Building
For the proposed semantic orientation approach of sentiment analysis, we need to build domain ontology. This ontology is used as a domain concept dictionary to extract the domain features with their importance. In this phase, we built domain ontology by extracting the concepts that are relevant to the hotel domain using Latent Dirichlet Allocation (LDA) with manual approaches. Two lists of domain concepts are generated; one of them is extracted using the LDA algorithm, and the other list is extracted from the dataset manually because LDA ignores the concepts with low frequencies [26]. Figure 2 provides a graphical representation of LDA topic modeling. The Latent Dirichlet Allocation (LDA) model, proposed by Blei et al. [68], is an unsupervised method that is well-known in text mining applications. It can recognize the latent topics from several documents automatically [26]. LDA is used to arrange a document text into specified topics. It generates topics per documents model and words per topic model, using Dirichlet distributions [69]. Each topic is a collection of keywords, and each keyword participates in a specific weightage to the topic [68]. Variables and parameters which appear in Figure 2 of the LDA model are interpreted as: D is the number of documents in the corpus, N is the number of words in a specified document, A is the Dirichlet prior parameter on the topic distributions per document, B is the Dirichlet prior parameter on the word distribution per topic, Θ is the topic distribution for a specified document, Φ is the word distribution for a specified topic k, TP is the topic assignment for a word in the specified document, and W is the specified word. extracting the concepts that are relevant to the hotel domain using Latent Dirichlet Allocation (LDA) with manual approaches. Two lists of domain concepts are generated; one of them is extracted using the LDA algorithm, and the other list is extracted from the dataset manually because LDA ignores the concepts with low frequencies [26]. Figure 2 provides a graphical representation of LDA topic modeling. The Latent Dirichlet Allocation (LDA) model, proposed by Blei et al. [68], is an unsupervised method that is well-known in text mining applications. It can recognize the latent topics from several documents automatically [26]. LDA is used to arrange a document text into specified topics. It generates topics per documents model and words per topic model, using Dirichlet distributions [69]. Each topic is a collection of keywords, and each keyword participates in a specific weightage to the topic [68]. Variables and parameters which appear in Figure 2 of the LDA model are interpreted as: D is the number of documents in the corpus, N is the number of words in a specified document, Α is the Dirichlet prior parameter on the topic distributions per document, Β is the Dirichlet prior parameter on the word distribution per topic, Θ is the topic distribution for a specified document, Φ is the word distribution for a specified topic k, TP is the topic assignment for a word in the specified document, and W is the specified word. In the proposed approach, at first, the LDA is used to generate topic clusters from the dataset where each topic contains a group of keywords. To implement the LDA model using Python, Algorithm 1 is used. The portion of the dataset which is assigned for building ontology is imported in Python. Several preprocessing steps are utilized to normalize reviews' sentences, tokenize them into words, and remove unnecessary words. Two inputs are required for running the LDA modeling which are the dictionary and the corpus that report the distinct words and their repetitions in the training data. The Term Frequency-Inverse Document Frequency (TF-IDF) transformation is applied to the entire corpus, and then the LDA is run. The resulting topics contain keywords unlike to be domain concepts such as sentiments [21], so human evaluation is used to filter these topics and to judge each keyword to determine suitable domain concepts. Table 4 provides a sample of LDA-generated topics from the dataset of ElSahar and El-Beltagy [66], where the keywords in bold represent possible domain concepts.

Algorithm 1
Building the LDA topic model Input: Hotel Reviews Dataset Output: Topics with Keywords 1-Load the hotel reviews dataset.

4-
Apply TF-IDF transformation to the entire corpus. In the proposed approach, at first, the LDA is used to generate topic clusters from the dataset where each topic contains a group of keywords. To implement the LDA model using Python, Algorithm 1 is used. The portion of the dataset which is assigned for building ontology is imported in Python. Several preprocessing steps are utilized to normalize reviews' sentences, tokenize them into words, and remove unnecessary words. Two inputs are required for running the LDA modeling which are the dictionary and the corpus that report the distinct words and their repetitions in the training data. The Term Frequency-Inverse Document Frequency (TF-IDF) transformation is applied to the entire corpus, and then the LDA is run. The resulting topics contain keywords unlike to be domain concepts such as sentiments [21], so human evaluation is used to filter these topics and to judge each keyword to determine suitable domain concepts. Table 4 provides a sample of LDAgenerated topics from the dataset of ElSahar and El-Beltagy [66], where the keywords in bold represent possible domain concepts.
For the manual list of concepts, human evaluators are contributed to extracting domain concepts from a set of reviews manually, and then the extracted concepts are compared with the list of concepts using LDA and combined the two lists. The evaluators read the final list to identify the distinct concepts and their synonyms; also, they identify the relationships between them to determine their positions from the top to the bottom of the ontology tree. The final ontology is presented using the Protégé tool [70] to facilitate identifying the level for each concept, where the classes and subclasses represent concepts and subconcepts for that domain [71]. We used the Protégé tool only to draw the ontology instead of manual drawing.    After identifying the concepts, the Arabic WordNet browser and Google translation are used to search for more semantic Arabic synonyms for each concept. This phase aims to extract all semantic domain features and all words that have the same meaning as the domain features. Table 5 provides an example of semantic synonyms for extracted hotel concepts from the dataset of ElSahar and El-Beltagy [66]. Table 6 shows the total number of distinct domain concepts and the total number of levels in the constructed hotel ontology.  Table 6. Characteristics of the constructed hotel ontology model.

6
For each concept, the level is identified using the Protégé structure, we assume that the highest level (6) is at the ontology tree root and the lowest level (1) is at the last bottom feature in the ontology tree. Furthermore, for each concept, we identify the total frequency by calculating the sum of the concept's frequency and its synonyms' frequencies, and then two important weights are calculated for each of them. All the needed information from the ontology is stored in a separate file as a domain concepts dictionary. Each row in the domain concepts dictionary consists of Domain_Concept, Concept_Level_Importance, Concept_Frequency_Importance, and List_of_Synonyms.

Text Preprocessing
The reviews dataset is unstructured and contains stopwords, so it needs to be preprocessed. Text preprocessing is intended to make the reviews consistent and to represent them in some standard form to facilitate conducting systematic processes. Some NLP processes were used to preprocess the textual reviews. These processes include sentence tokenization, normalization, stopword removal, word tokenization, POS tagging, and stemming. Table 7 provides an example for each of them. English translation of the Arabic input is added. cesses were used to preprocess the textual reviews. These processes include sentence tokenization, normalization, stopword removal, word tokenization, POS tagging, and stemming. Table 7 provides an example for each of them. English translation of the Arabic input is added.

Domain Features and Initial Polarity Identification
This phase aims to distinguish the domain features and sentiment words using the POS, where the nouns are considered as candidate domain features for identification and extraction using the domain dictionary. The noun tags using the Stanford POS tagger [72] are NN, DTNN, NNP, DTNNP, NNS, DTNNS, NNPS, DTNNPS, NOUN, NOUN_QUANT. The other words such as adjectives, verbs, and the residual nouns which were not found in the domain dictionary are considered candidate sentiment words to match with the lexicon [37].
To extract the sentiment words around each domain feature, the N-gram-around method achieves considerable results in identifying the sentiment words related to each domain feature [20,24]. The initial polarity for the domain feature is calculated based on the sum of the positive scores and the sum of the negative scores for the sentiment words which are extracted using the N-gram-around method.
To search and match each sentiment word with the lexicon, three methods are used: the original word is matched with the lexicon; if not found, the word stem is matched with the lexicon; and if not found, the word root is matched with the lexicon. If neither the word nor its stem nor its root is found, its sentiment polarity is considered zero. For this step, we used the Tashaphyne stemmer [73], which is supported in Python, to generate both stems and roots.
Negations and intensifiers are handled during identifying sentiment words' polarities. Negation in the Arabic language is expressed by adding ( ) "not" before a verb, noun, or adjective. If any of the negation terms appears before a sentiment word, it

Domain Features and Initial Polarity Identification
This phase aims to distinguish the domain features and sentiment words using the POS, where the nouns are considered as candidate domain features for identification and extraction using the domain dictionary. The noun tags using the Stanford POS tagger [72] are NN, DTNN, NNP, DTNNP, NNS, DTNNS, NNPS, DTNNPS, NOUN, NOUN_QUANT. The other words such as adjectives, verbs, and the residual nouns which were not found in the domain dictionary are considered candidate sentiment words to match with the lexicon [37].
To extract the sentiment words around each domain feature, the N-gram-around method achieves considerable results in identifying the sentiment words related to each domain feature [20,24]. The initial polarity for the domain feature is calculated based on the sum of the positive scores and the sum of the negative scores for the sentiment words which are extracted using the N-gram-around method.
To search and match each sentiment word with the lexicon, three methods are used: the original word is matched with the lexicon; if not found, the word stem is matched with the lexicon; and if not found, the word root is matched with the lexicon. If neither the word nor its stem nor its root is found, its sentiment polarity is considered zero. For this step, we used the Tashaphyne stemmer [73], which is supported in Python, to generate both stems and roots.
Negations and intensifiers are handled during identifying sentiment words' polarities. Negation in the Arabic language is expressed by adding ( / / / ) "not" before a verb, noun, or adjective. If any of the negation terms appears before a sentiment word, it counters the meaning of that word; adding a negation particle before a positive word would make it negative, and vice versa. For example, in the sentence ( ), the word ( ) is positive and its positive and negative scores using the ArSenL lexicon are (0.083, 0.05), respectively. When the negation particle ( ) comes before it, its scores change to (0.05, 0.083), which is negative. Intensifiers in the Arabic language, such as ( / ) are added after a sentiment word to emphasize the meaning and indicate the strength of the meaning. So, we consider that when they appear after a sentimental word, the polarity for that word is doubled. For example, ( ) is a sentiment word with positive and negative scores of (0.402, 0.069), respectively. After adding ( ) to the sentence, its scores changed to (0.804, 0.138). Table 8 provides an example of this phase. Table 8. Example of domain features and initial polarity identification.

Original Review Input
Step Output Extract Domain Features with Importance

Overall Semantic Review Polarity Calculation
Based on the extracted semantic domain features for each review, we need to calculate a total semantic review polarity. The initial polarity of each domain feature is affected by the importance of that feature. The Formula (1) is used to calculate overall semantic review polarity based on semantic features' importance: where n is the number of the extracted domain features from a review, DFi represents the specific domain feature that has an initial polarity, L represents the level of importance of the domain feature (DFi), which is identified based on its level in the ontology tree, and F represents the frequency importance of the domain feature (DFi) which takes the following values-0.1, 0.25, 0.50, 0.75, 1-to indicate its importance from domain users' point of view. Since features' levels are not dependent on the dataset, we consider the domain feature frequency to represent the importance of the domain features as they are repeated in the dataset. High frequent domain features in the dataset means that users are more interested in those features in that domain than the other ones. Domain features are divided into five groups based on their frequencies; the most frequent features in the dataset get the highest importance value as (1), and so on, whereas the lowest frequent features get the lowest importance value as (0.1). We experiment with assigning different weights for this factor for each group of domain features. We noticed that these weights of importance have improved the performance of the semantic orientation sentiment analysis. The review label is determined as positive (+1) if the overall semantic review polarity is greater than or equal to zero because the third class (neutral) is ignored in the proposed approach and we noticed that the number of reviews where their total semantic polarity exactly equals to zero are very few in the dataset, so we considered them as positive reviews. Conversely, the review label is determined as negative (−1) if the overall semantic review polarity is less than zero. Table 9 illustrates the phase of calculating the overall semantic review polarity using the previous phase example. Table 9. Example of calculating overall semantic review polarity. The overall review polarity is considered positive, although the review contains one feature that is considered positive with an initial polarity of (+0.33757), and one feature that is considered negative with an initial polarity of (−0.51666), where the negative feature has the higher initial value. Since the positive feature has higher importance than the negative feature; the total importance of the positive feature is (6) and for the negative feature is (2.25).

Performance Evaluation
In this phase, some performance evaluation metrics are used to measure the performance of the proposed approach, and to compare it with some other semantic orientation approaches used by researchers in the literature. The performance evaluation measures are accuracy, recall, precision, and f1-measure. Referring to [54], the precision and recall measures can be computed for the positive class using the following equations: where: • TP (True Positive): represents the number of reviews that are classified as positive in both original classifications and predicted classifications.

Results and Discussion
This section describes and discusses the conducted experiments for performance evaluation. We have implemented an automatic framework that combines several tools and libraries. The software architecture is depicted in Figure 3. We used a Python version of 3.7 and worked on anaconda 3 with the following libraries and modules: Pandas, Gensim, NLTK, CLTK, PyArabic, PyAramorph, Stanford POS Tagger, and Tashaphyne. Pandas offer a Data Frame Object for quick and effective data handling along with integrated indexing; tools capable of reading and writing data between in-memory data structures and various formats such as text, Excel, and CSV files [74]. A python dictionary is a Python data structure that consists of a set of (key: value) pairs, where the keys are unique within one dictionary. The main functions of a dictionary are storing and extracting values using their keys [75]. We used nested dictionaries, where a collection of dictionaries is inside one single dictionary. indexing; tools capable of reading and writing data between in-memory data structures and various formats such as text, Excel, and CSV files [74]. A python dictionary is a Python data structure that consists of a set of (key: value) pairs, where the keys are unique within one dictionary. The main functions of a dictionary are storing and extracting values using their keys [75]. We used nested dictionaries, where a collection of dictionaries is inside one single dictionary.

Dataset Balancing
We examine the proposed approach using the hotel reviews dataset of ElSahar and El-Beltagy [66] which was presented in Section 3.2. The dataset consists of unbalanced classes because it contains different sizes of positive, negative, and neutral reviews. At first, the neutral reviews were excluded based on the assumption that neutral texts are located close to the boundary of the binary classifier. Moreover, neutral texts are supposed to be less informative in comparison with clear positive or negative texts [76].
After that, we balanced the remaining positive and negative reviews using the undersampling method. The objective of using under-sampling to balance the reviews is to gain a high performance of classification and to prevent the classifier from acting biased toward the majority group examples [77]. The random under-sampling is a non-heuristic method that is used to balance class sizes through the random elimination of majority class examples to make them equivalent to the smallest class size [78]. Table 10 shows the size of each class before and after class balancing. The balanced reviews dataset consists of 5294 hotel reviews (2647 positive reviews and 2647 negative reviews). 3294 hotel reviews are used for domain ontology extraction using LDA and the manual approach. The main goal is to extract the domain ontology based on the available review's dataset. The remaining 2000 hotel reviews (1000 positive reviews and 1000 negative reviews) are used for ASA experiments, to evaluate the effectiveness of the proposed approach. The authors of [20] divided the reviews dataset into a similar approach for the

Dataset Balancing
We examine the proposed approach using the hotel reviews dataset of ElSahar and El-Beltagy [66] which was presented in Section 3.2. The dataset consists of unbalanced classes because it contains different sizes of positive, negative, and neutral reviews. At first, the neutral reviews were excluded based on the assumption that neutral texts are located close to the boundary of the binary classifier. Moreover, neutral texts are supposed to be less informative in comparison with clear positive or negative texts [76].
After that, we balanced the remaining positive and negative reviews using the undersampling method. The objective of using under-sampling to balance the reviews is to gain a high performance of classification and to prevent the classifier from acting biased toward the majority group examples [77]. The random under-sampling is a non-heuristic method that is used to balance class sizes through the random elimination of majority class examples to make them equivalent to the smallest class size [78]. Table 10 shows the size of each class before and after class balancing. The balanced reviews dataset consists of 5294 hotel reviews (2647 positive reviews and 2647 negative reviews). 3294 hotel reviews are used for domain ontology extraction using LDA and the manual approach. The main goal is to extract the domain ontology based on the available review's dataset. The remaining 2000 hotel reviews (1000 positive reviews and 1000 negative reviews) are used for ASA experiments, to evaluate the effectiveness of the proposed approach. The authors of [20] divided the reviews dataset into a similar approach for the same purposes.

Lexicon Baseline Evaluation
The lexicon baseline approach is selected for the comparison since the lexicon baseline approach does not consider the domain concepts to identify review polarity; it simply used a sentiment lexicon to extract all the words from the review with their polarities. The ArSenL lexicon of Badaro et al. [67] is used in this experiment. Tables 11 and 12 present the confusion matrix and performance measures of the lexicon baseline approach.  The confusion matrix of the lexicon baseline approach shows that the number of correctly classified positive reviews is 898, and the number of correctly classified negative reviews is 596. The number of incorrectly classified positive reviews is 404, and the number of incorrectly classified negative reviews is 102. The overall precision of the lexicon baseline approach is 77.17% with a higher precision value for the negative reviews; the opposite is the case with the recall since the higher recall value is for the positive class with an overall recall of 74.70%. The overall f-measure value is 74.10%.

Ontology Baseline Evaluation
The hotel ontology, built of 203 concepts and 6 levels, is used in this experiment as a domain concepts dictionary for features selection. The domain features are considered the best semantic features to represent each review. The hotel concepts, along with the noun POS tags, are used to identify the domain features and calculate their polarities using the Ngram around method with N = 4. 4 words before and 4 words after each domain feature are extracted and searched in the ArSenL lexicon to identify its polarity. The confusion matrix of this approach is shown in Table 13. The number of true-positive reviews is 929, and the number of true-negative reviews is 638. The number of false-positive reviews is 362, and the number of false-negative reviews is 71. Table 14 presents performance measures of the ontology baseline approach. The overall precision is 80.96% with a higher precision value for the negative reviews. The overall recall is 78.35%, where the positive class obtained a higher recall value. The overall f-measure is 77.87% with the higher f-measure value for the positive class.

Ontology with Level Importance Evaluation
The ontology with level importance approach is utilizing the ontology for both domain features extraction and domain feature importance identification based on their levels in the ontology tree. The hotel dictionary which was built based on the extracted ontology is used to determine the hotel features and their levels. The confusion matrix of this approach is depicted in Table 15. This approach predicted 938 and 644 reviews truly from the original one thousand positive reviews and one thousand negative reviews, respectively. The number of falsely predicted reviews from the original negative reviews is 356, and the number of falsely predicted reviews from the original positive reviews is 62. Performance measures of ontology with the level importance approach are shown in Table 16. The negative class precision is 91.21% which is higher than the precision of the positive class, and the average precision of this approach is 81.84%. The positive class recall is 93.80% which is higher than the negative class recall, and the average recall for both classes is 79.1%. The average f-measure is 78.63% with a higher value for positive reviews.

Ontology with Level and Frequency Importance Evaluation
In this experiment, we extracted the hotel features by matching the ontology concepts with the identification of their levels and their frequency importance, so the hotel concepts dictionary is used in this experiment to identify the three elements. Tables 17 and 18 present the confusion matrix and performance measures of ontology with the level and frequency importance approach.  The number of correctly classified positive reviews using this approach is 937, and the number of correctly classified negative reviews is 647. The number of incorrectly classified positive reviews is 353, and the number of incorrectly classified negative reviews is 63. The performance measures that are presented in Table 18 demonstrated that the proposed approach achieved an overall precision of 81.87% with a higher precision value for the negative reviews, and it achieved an overall recall of 79.20% with a higher recall value for the positive class. The f-measure value is 78.75%.

Results Summary and Discussion
Using ontology with domain features' importance in the two approaches, we observed the following: the ontology with level importance and the ontology with level and frequency importance have the best results through all the semantic orientation approaches with a minor difference between them. Figure 4 summarizes results for the four schemes described earlier on average of positive and negative.  The number of correctly classified positive reviews using this approach is 937, and the number of correctly classified negative reviews is 647. The number of incorrectly classified positive reviews is 353, and the number of incorrectly classified negative reviews is 63. The performance measures that are presented in Table 18 demonstrated that the proposed approach achieved an overall precision of 81.87% with a higher precision value for the negative reviews, and it achieved an overall recall of 79.20% with a higher recall value for the positive class. The f-measure value is 78.75%.

Results Summary and Discussion
Using ontology with domain features` importance in the two approaches, we observed the following: the ontology with level importance and the ontology with level and frequency importance have the best results through all the semantic orientation approaches with a minor difference between them. Figure 4 summarizes results for the four schemes described earlier on average of positive and negative. A comparison between different state of the art approaches for ASA is depicted in Figure 5. It reveals that the first approach yields 79.10% as accuracy. The second approach yields 79.20% as accuracy. This may indicate that the way we utilized the concepts' frequencies in the formula needs improvement to increase the enhancement of the proposed approach. Although the difference between their performances is small, the suggested method that incorporates two factors to represent semantic domain features importance A comparison between different state of the art approaches for ASA is depicted in Figure 5. It reveals that the first approach yields 79.10% as accuracy. The second approach yields 79.20% as accuracy. This may indicate that the way we utilized the concepts' frequencies in the formula needs improvement to increase the enhancement of the proposed approach. Although the difference between their performances is small, the suggested method that incorporates two factors to represent semantic domain features importance still has comparable results to other approaches. Combining domain ontology with the lexicon baseline approach showed an improvement up to 3.65% on accuracy value. The lexicon baseline approach did not apply any feature selection method; it just extracted all review words. Combining domain features' importance using two factors with the ontology baseline approach presents an improvement reached 0.85% for the accuracy value. Finally, the proposed approach improved the lexicon baseline approach by 4.5% for accuracy.
A comparison between the proposed approach with some state-of-the-art deep learning, machine learning, and aspect-based classifiers used for ASA is provided in Table 19. We have selected approaches that have used in common the same sentiment lexicon in [67], or the same hotels domain dataset in [66] for aspect-level-based methods. Aspect-based using Support Vector Machine (SVM).
ArSenL Hotels 79.20% Al-Sallab et al. [79] presented A Recursive Deep Learning Model for Opinion Mining in Arabic (AROMA). AROMA was tested on three Arabic datasets that were varied in writing styles and genres. Their method on the second dataset obtains an accuracy that is Combining domain ontology with the lexicon baseline approach showed an improvement up to 3.65% on accuracy value. The lexicon baseline approach did not apply any feature selection method; it just extracted all review words. Combining domain features' importance using two factors with the ontology baseline approach presents an improvement reached 0.85% for the accuracy value. Finally, the proposed approach improved the lexicon baseline approach by 4.5% for accuracy.
A comparison between the proposed approach with some state-of-the-art deep learning, machine learning, and aspect-based classifiers used for ASA is provided in Table 19. We have selected approaches that have used in common the same sentiment lexicon in [67], or the same hotels domain dataset in [66] for aspect-level-based methods. Al-Sallab et al. [79] presented A Recursive Deep Learning Model for Opinion Mining in Arabic (AROMA). AROMA was tested on three Arabic datasets that were varied in writing styles and genres. Their method on the second dataset obtains an accuracy that is similar to our approach accuracy, which was (79.2%). Baly et al. [80] presented another deep learning approach for opinion mining using Recursive Neural Tensor Networks (RNTN). Their method obtains a slightly higher accuracy rate than our approach, where the best value of accuracy was 80%. Mataoui et al. [81] and Mohammad et al. [83] methods were based on aspects of detection and extraction of hotel datasets. In comparison with their experimentation results, which were 74.39% and 76.42% accuracy, respectively, our proposed approach of sentiment analysis based on domain aspects detection, outperformed the first method accuracy by 4.81%, and the second method by 2.78%.

Conclusions
In this paper, we propose a semantic orientation approach for ASA using ontology. It incorporates a semantic domain features importance weighting method. The approach works at the feature level using an ontology of the domain concepts to extract the semantic features. It combines different factors which are: features' levels in the ontology tree, and features' frequencies in the dataset to generate overall semantic review polarity based on domain features' importance. The conducted experiment for the ontology with the level and frequency importance approach and the obtained results from this experiment demonstrated that using the frequency importance factor along with the level importance factor as an indication for the domain feature importance can increase the performance of the lexicon baseline and ontology baseline approaches with overall accuracy and f-measure values reach to 79.20% and 78.75%, respectively. The proposed approach can be comparable with the state-of-the-art methods for sentiment analysis in the Arabic language.
During this work, many limitations were faced, including the unavailability of suitable Arabic ontology for the selected domain and the unavailability of adequate lexicons for the different Arabic dialects. Future work can be derived based on these limitations: (1) Using a fully automatic approach to extract the domain ontology from the dataset available; (2) Building and using sentiment lexicon for different dialects in the Arabic language, as well as the lexicon that is used for the standard Arabic; (3) Building and using domain-specific sentiment lexicon for different domains.