AraMAMS: Arabic Multi-Aspect, Multi-Sentiment Restaurants Reviews Corpus for Aspect-Based Sentiment Analysis

: The abundance of data on the internet makes analysis a must. Aspect-based sentiment analysis helps extract valuable information from textual data. Because of limited Arabic resources, this paper enriches the Arabic dataset landscape by creating AraMA, the ﬁrst and largest Arabic multi-aspect corpus. AraMA comprises 10,750 Google Maps reviews for restaurants in Riyadh, Saudi Arabia. It covers four aspect categories—food, environment, service, and price—along with four sentiment polarities: positive, negative, neutral, and conﬂict. All AraMA reviews are labeled with at least two aspect categories. A second version, named AraMAMS, includes reviews labeled with at least two different sentiments, making it the ﬁrst Arabic multi-aspect, multi-sentiment dataset. AraMAMS has 5312 reviews covering the same four aspect categories and sentiment polarities. Both corpora were evaluated using naïve biased (NB), support vector classiﬁcation (SVC), linear SVC, and stochastic gradient descent (SGD) models. In the AraMA corpus, the aspect categories task achieved a 91.41% F1 measure result using the SVC model, while in the AraMAMS corpus, the best F1 measure result for aspect categories task reached 91.70% using the linear SVC model.


Introduction
With the growth of social media usage recently, it is essential to discover and reap the benefits of online user-generated information to enhance a product or service, and help to create more effective marketing efforts. For instance, analyzing consumers' feelings and opinion data from reviews on e-commerce platforms is very important, as it provides insight into customers' satisfaction levels. This type of data analysis can provide businesses with valuable insights into customer sentiment, brand perception, market trends, and investment opportunities. By leveraging these insights, businesses can enhance customer satisfaction, brand reputation, market competitiveness, and financial performance. Overall, assessing customer reviews is a critical component of constructing a strong and equitable infrastructure that promotes economic growth and improves the quality of life and wellbeing.
However, analyzing this opinion data manually would be impossible, given the enormous volume of textual content. As a result, the field of sentiment analysis (SA) has emerged as an AI tool that allows automatic extraction of the knowledge about opinions, emotions, and attitudes concealed within unstructured texts. Yet, SA only provides a view of what people like or dislike. It basically classifies a given text into positive, negative, and neutral sentiments [1]. Aspect-based sentiment analysis (ABSA) is a field of SA that goes one step further than SA by automatically assigning sentiments to certain features or aspects in the text. The primary goal of ABSA is to extract the relevant aspects and then classify them into different sentiment polarities [1]. This entails breaking down text data into smaller fragments in order to obtain deeper, more granular insights. As such, all relationships between the entities involved must be appropriately identified and linked to the conveyed sentiment. Thus, the main challenge of this task is to distinguish between different opinion contexts for different aspects or targets. Of late, ABSA has become one of the most important tasks of SA, since it can extract a deeper insight from text to ensure that the right decisions are made, and provide a clearer image of weaknesses.
ABSA can play a significant role in supporting the Sustainable Development Goals (SDGs) of the 2030 Agenda, which were adopted by the United Nations to address global challenges and promote sustainable development [2]. This can be accomplished by providing insights into the sustainability performance of businesses. For instance, ABSA for customer reviews can help in assessing the sustainability efforts of restaurants, hotels, retail stores, or service providers by analyzing sentiments related to specific aspects of sustainability, and this allows for a comprehensive assessment. Analyzing restaurant customer reviews is critical for constructing high-quality, long-lasting, and robust infrastructure to promote economic development and human wellbeing. If restaurant owners analyze customer reviews, they will be more aware of their needs, enabling them to direct efforts and money to improve these aspects more quickly and with less effort. As a result, client happiness and loyalty rise, resulting in higher revenue and economic progress. Furthermore, restaurant owners can focus more on enabling cheap and equitable access to high-quality food and improve eating experiences for all consumers in order to obtain more positive comments which, in turn, will increase the number of visitors to the restaurant. Overall, assessing customer reviews is a critical component of constructing a strong and equitable infrastructure that promotes economic growth and the wellbeing of all people.
In recent years, a number of researchers have carried out a great deal of work in the field of SA and its application. However, ABSA studies are still scarce compared to SA research, especially in the Arabic language. This is for two main reasons: the lack of labelled dataset resources in Arabic, and the complexity of the Arabic language [3].There are three different varieties of the Arabic language, Classical Arabic (CA), which is used in the Holy Qur'an of Islam; Modern Standard Arabic (MSA), which is used in official contexts such as newspapers and education; and Dialectical Arabic (DA), which is used in daily conversation and in most social media content. DA also differs from one Arab nation and the next, and does not have standard orthographies [4].
Researchers have become more interested in ABSA in the past few years. There are different studies in the literature regarding ABSA in the English language; however, there is a lack of Arabic research in this field. In addition, it is clear that the field of Arabic ABSA suffers from the small number of available well-created corpora that would help the Arabic research community. Therefore, to bridge this gap, we aim to enhance the Arabic dataset resources available for serving ABSA studies. In this study, we create two versions of Arabic ABSA corpora, and provide an in-depth analysis of restaurant reviews in the city of Riyadh by identifying the most important aspects, such as price, environment, food, and service quality, that affect restaurants. This will make it easier to pinpoint exactly what customers like and dislike, and thus improve the business in question. We collected 21,330 Arabic reviews from Google Maps about restaurants in Riyadh, Saudi Arabia. These reviews have been manually annotated by Arab annotators who can understand Saudi DA. To make data more reliable, annotation guidelines were created. The process was carried out in two rounds. To our knowledge, this work is the first study targeting the field of restaurant ABSA in Arabic.
In summary, this paper makes the following contributions: 1.
Creating the first and the largest Arabic multi-aspect corpus (AraMA) with a total of 10,739 reviews from Google Maps related to restaurants in Riyadh, and analyzing its suitability for the ABSA task.

2.
Based on the research in [5], the authors prove that if an ABSA corpus contains sentences labeled with the same sentiments, this can reduce ABSA to sentence-level, and the classifiers can obtain good results without considering aspects (i.e., it returns to a classical sentiment analysis). Thus, we wanted to investigate the effect of this on Arabic datasets by generating the second version of AraMA, which only includes reviews labeled with different sentiments. This is the first Arabic corpus of Google Maps reviews of Riyadh restaurants with multi-aspect, multi-sentiment properties; it is named AraMAMS. 3.
The annotation guidelines are highlighted to help researchers in future studies.
The rest of the paper is organized as follows. The Section 2 is a review of the related work. In the Section 3, the process of collecting and cleaning the data is described. In the Section 4, the workflow of the annotation process is explained. The Section 5 contains an expletory data analysis for both corpora. The Section 6 concerns data evaluation. Section 7 concludes the paper, and discusses potential future work. Section 8 contains corpus availability details.

Related Work
There are several studies in the Arabic language for ABSA. We reviewed the Arabic corpora that have been created and labeled for ABSA in the literature. Table 1 provides a summary of the datasets used regarding their domain, size, Arabic language type, publicity, predefined aspect categories, sentiment polarity, and, if applicable, the platform used during annotation. For a comprehensive review of the available Arabic ABSA research, we refer the reader to [3]. In 2015, Ref. [6] provided the first research benchmark dataset for ABSA. The authors created the human-annotated Arabic dataset for book reviews (HAAD) [15]. It consists of 1513 annotated book reviews taken from the large Arabic book review corpus (LABR) dataset that was essentially created for SA [16]. The authors annotated aspect terms, aspect categories, and sentiment polarities. In Ref. [7], published in the same year, the authors collected 200 reviews from forums, Facebook, YouTube, and Google search. Then, they extracted aspects using part-of-speech tagging (POS) and manually annotated sentiment polarity.
In 2016, Al-Sarhan et al. [8] collected 2265 Arabic news posts related to the Gaza conflict, associated comments from Al Jazeera and Al Arabiya (well-known Arabic news networks), and related posts on Facebook. They annotated the posts' aspect categories, aspect terms, and sentiment polarities. Additionally, they annotated comment categories and sentiment polarities. They chose the most dominant aspect category only for both posts and comments. The BRAT tool was used in this study to ease annotation. In the same year, Semantic Evaluation (SemEval) launched a workshop to create Arabic hotel reviews for ABSA; it has been used as a benchmark until now. The dataset includes a total of 2291 annotated review sentences, of which 1839 were used for training and 452 for testing. The sentences were gathered from hotel booking websites such as Booking.com, (accessed on 23 June 2023) and TripAdvisor.com, (accessed on 23 June 2023). The selected reviews belong to hotels from different Arabian cities such as Dubai, Mecca, Amman, Beirut, etc. They annotated aspect terms, aspect categories, and sentiment polarities. The aspect category annotations were more detailed; annotators were required to identify entities and attributes. Further, the category field needed to be completed using the syntax of (Entity # Attribute). Entities were predefined as hotel, rooms, room_amenities, facilities, service, location, food&drinks. Attributes were defined as general, prices, design&features, cleanliness, comfort, quality, style&options, and miscellaneous. For example, in the sentence "the rooms are comfortable", the entity is the room, and the attribute is comfort. Thus, the category field would be defined as value (room#comfort) [17].
A much simpler corpus was subsequently created. The authors of [10] compiled a total of 5000 tweets related to the service on Saudi airlines. They annotated aspect categories and sentiment polarities. Additionally, in Ref. [11], customers' sentiments were extracted-using machine learning and deep learning approaches-from 1098 tweets, collected by the authors, regarding the Saudi telecommunication companies STC, Mobile, and Zain. The paper was part of an ongoing project. They extracted available aspects such as internet, customer services, network, billing, packages, and general. For annotation, they manually annotated sentiment polarities using the DataTracking website.
In 2020, a total of 7934 tweets related to Qassim university were collected by the author of [12]. Annotators labelled aspect categories and sentiment polarities. In [13], 1000 Arabic book reviews were selected and annotated from an LABR dataset. Annotators labelled aspect terms and sentiment polarity terms. Since a review can contain more than one aspect, they added one sentiment for an entire review, much like SA. Additionally, in [14], a total of 2071 Arabic reviews were selected from the Apple Store and Google Play. Using these reviews, 60 different mobile apps were created by the United Arab Emirates' government. Annotation was carried out with a specially designed computer application named "GARSA". Annotators labelled aspect terms, sentiment words, and aspect categories.
In our review of the literature, we noted that most of the datasets were collected from multiple resources such as Twitter, Youtube, Facebook, application reviews, and different websites. Mostly, datasets were manually annotated by researchers, although most of them are not publicly available. Currently, the SemEval 2016 Arabic hotel reviews is the published dataset that best represents a benchmark for ABSA in the Arabic language. In the SemEval 2016 Arabic hotel reviews dataset, there are multiple records that have one aspect category. Thus, in using it, we will lose the advantage of ABSA, since it is reduced to a sentence-level sentiment analysis, according to the study of Jiang et al. [5]. This study proved that if the dataset contains sentence-level reviews, classifiers can still achieve competitive results without considering aspects. Further, advanced ABSA methods trained on these datasets can hardly distinguish the various aspects of the sentiment polarities in the sentences, which contain multiple aspects and multiple sentiments. This encourages us to enrich the field with a well-created ABSA Arabic corpus for Riyadh restaurants. The creation process will be described in detail in the following sections.

Dataset Creation
In this paper, the aim is to create an Arabic multi-aspect (AraMA) corpus of Riyadh restaurants for ABSA. We decided to remove sentences with one aspect category from the original dataset to prevent the reduction of ABSA to the sentence level, as proven in [5]. All sentences in AraMA will have at least two aspect categories. After that, we will create an Arabic multi-aspect, multi-sentiment (AraMAMS) version, which contains only sentences with different sentiment polarities. Both datasets have sentences with at least two aspect categories, but the difference is in their sentiments. AraMA may contain sentences with the same sentiments, while AraMAMS only contains sentences with different sentiments. Figure 1 summarizes the creation workflow. et al. [5]. This study proved that if the dataset contains sentence-level reviews, classifiers can still achieve competitive results without considering aspects. Further, advanced ABSA methods trained on these datasets can hardly distinguish the various aspects of the sentiment polarities in the sentences, which contain multiple aspects and multiple sentiments. This encourages us to enrich the field with a well-created ABSA Arabic corpus for Riyadh restaurants. The creation process will be described in detail in the following sections.

Dataset Creation
In this paper, the aim is to create an Arabic multi-aspect (AraMA) corpus of Riyadh restaurants for ABSA. We decided to remove sentences with one aspect category from the original dataset to prevent the reduction of ABSA to the sentence level, as proven in [5]. All sentences in AraMA will have at least two aspect categories. After that, we will create an Arabic multi-aspect, multi-sentiment (AraMAMS) version, which contains only sentences with different sentiment polarities. Both datasets have sentences with at least two aspect categories, but the difference is in their sentiments. AraMA may contain sentences with the same sentiments, while AraMAMS only contains sentences with different sentiments. Figure 1 summarizes the creation workflow.

Data Collection
We wanted to target the dialectical Arabic that Saudi people use in daily life. Thus, we decided to collect Google Maps reviews of Riyadh restaurants. We used the Instant Data Scraper extension of Google Chrome to help us collect reviews in an Excel sheet [18]. We collected the most recent restaurant reviews from famous restaurants that are in a highly visited area in Riyadh on the Google Maps website. A total of 21,330 reviews were collected from 61 restaurants. It was obvious that Arabic reviews were mostly in DA. There were few English reviews. The data included the reviewer's username, number of reviews, user title, review, an image link if available, review date, thumbs up, and a reply from the restaurant owner if available. We were only interested in reviews.

Data Collection
We wanted to target the dialectical Arabic that Saudi people use in daily life. Thus, we decided to collect Google Maps reviews of Riyadh restaurants. We used the Instant Data Scraper extension of Google Chrome to help us collect reviews in an Excel sheet [18]. We collected the most recent restaurant reviews from famous restaurants that are in a highly visited area in Riyadh on the Google Maps website. A total of 21,330 reviews were collected from 61 restaurants. It was obvious that Arabic reviews were mostly in DA. There were few English reviews. The data included the reviewer's username, number of reviews, user title, review, an image link if available, review date, thumbs up, and a reply from the restaurant owner if available. We were only interested in reviews. Figure 2 illustrates one comment section in Google Maps, with the user review highlighted in red. . we came to taste beef and cheese pottery, but it was less than expected."

Corpus Cleaning and Preprocessing
To increase the accuracy of the opinion-mining process and to prevent excessive processing overhead, we first removed some empty reviews from the Excel sheet. Following that, a regular expression tool (regex [19]) in Python was used for preprocessing reviews using the Google Collab platform [20]. Regex was used to check Figure 2. A preview of a Google Maps review text. The translation of text inside red box in English is "A Turkish restaurant.. very good.. baking and potteries are good.. restaurant environment is good.. we came to taste beef and cheese pottery, but it was less than expected."

Corpus Cleaning and Preprocessing
To increase the accuracy of the opinion-mining process and to prevent excessive processing overhead, we first removed some empty reviews from the Excel sheet. Following that, a regular expression tool (regex [19]) in Python was used for preprocessing reviews using the Google Collab platform [20]. Regex was used to check if a string contained the specified search pattern. The preprocessing steps included multiple tasks, e.g., the removal of unnecessary characters such as punctuation, diacritics, numbers, emojis, and all English letters. After that, stemming was performed to remove repeated characters, and normalization was carried out. Table 2 shows examples of parts of the reviews before and after the preprocessing task. At this step, we removed about 2678 reviews that were changed to empty.  Figure 2. A preview of a Google Maps review text. The translation of text inside red box in English is "A Turkish restaurant.. very good.. baking and potteries are good.. restaurant environment is good.. we came to taste beef and cheese pottery, but it was less than expected."

Corpus Cleaning and Preprocessing
To increase the accuracy of the opinion-mining process and to prevent excessive processing overhead, we first removed some empty reviews from the Excel sheet. Following that, a regular expression tool (regex [19]) in Python was used for preprocessing reviews using the Google Collab platform [20]. Regex was used to check if a string contained the specified search pattern. The preprocessing steps included multiple tasks, e.g., the removal of unnecessary characters such as punctuation, diacritics, numbers, emojis, and all English letters. After that, stemming was performed to remove repeated characters, and normalization was carried out. Table 2 shows examples of parts of the reviews before and after the preprocessing task. At this step, we removed about 2678 reviews that were changed to empty.

Task
Before After Diacritics

empty) English Letters
Good food (empty) Figure 2. A preview of a Google Maps review text. The translation of text inside red box in English is "A Turkish restaurant.. very good.. baking and potteries are good.. restaurant environment is good.. we came to taste beef and cheese pottery, but it was less than expected."

Corpus Cleaning and Preprocessing
To increase the accuracy of the opinion-mining process and to prevent excessive processing overhead, we first removed some empty reviews from the Excel sheet. Following that, a regular expression tool (regex [19]) in Python was used for preprocessing reviews using the Google Collab platform [20]. Regex was used to check if a string contained the specified search pattern. The preprocessing steps included multiple tasks, e.g., the removal of unnecessary characters such as punctuation, diacritics, numbers, emojis, and all English letters. After that, stemming was performed to remove repeated characters, and normalization was carried out. Table 2 shows examples of parts of the reviews before and after the preprocessing task. At this step, we removed about 2678 reviews that were changed to empty.

Task
Before After Diacritics

Annotation
Although manual annotation consumes a great deal of time and resources, we performed it to ensure a more accurate data-labeling process [14]. Figure 3 illustrates the annotation process.

Annotation
Although manual annotation consumes a great deal of time and resources, we performed it to ensure a more accurate data-labeling process [14]. Figure 3 illustrates the annotation process. This section is divided into three parts. The first is an aspect-based approach. In this part, we explain how we defined categories and sentiments before starting annotation. The second part describes the annotation platform. In this section, we explained how the platform we created is used during annotation. The third part is annotator recruitment, wherein we explain in detail all the steps we went through in choosing the annotators.

Aspect-Based Approach
Before starting the annotation process, the aspect categories must be identified. The research in [21]   This section is divided into three parts. The first is an aspect-based approach. In this part, we explain how we defined categories and sentiments before starting annotation. The second part describes the annotation platform. In this section, we explained how the platform we created is used during annotation. The third part is annotator recruitment, wherein we explain in detail all the steps we went through in choosing the annotators.

Aspect-Based Approach
Before starting the annotation process, the aspect categories must be identified. The research in [21] aimed to identify the most important specifications affecting restaurant guests' satisfactions, in order to help restaurants owners address them. The study concluded that the quality dimensions can be summarized as food quality, service quality, physical environment, and price fairness. Based on this, we selected our aspect categories (food, service, environment, and price). Table 3 shows topics related to each aspect category. On the other hand, for sentiment annotation, we added conflict polarity as well as positive, negative, and neutral, because of the nature of restaurant review topics. A user frequently liked one dish but hated another. In these cases, reviews were marked as conflicting sentiment, wherein two positive and negative sentiment polarities applied to the same aspect. Examples of each aspect category in each sentiment case are displayed in Table 4. Table 4. Examples of reviews on each sentiment from all aspect categories.

Category
Sentiment Example 1 Example 2

Food
Positive "Very good in terms of food quality" "Honestly, the food is suspicious" Negative "Quantities are small" "Pizza is bad" Neutral "My experience with food is okay" "The taste is normal" Conflict "The food is delicious, but there are few items on the menu" "The pasta is delicious, but the salad has a strange taste" Environment Positive "The décor is very nice" "The place is clean" Negative "Unfortunately, it is not suitable for families, as there is no privacy ever" "The furniture is very old and worn out" Neutral "Indoor and outdoor tables are normal" "The place and decor are okay" Conflict "The place is cool, the music is annoying" "The inside tables are bad; the outside tables are nice and tidy"

Price
Positive "Their prices are cheap compared to competing restaurants" "Their prices are very good for large families" Negative "They do not have card payment" "Prices are much higher than before" Neutral "Prices are suitable" "The pizza price is average" Conflict "Water is expensive, the dish prices are good" "The food is reasonably priced, but the drinks are so expensive"

Service
Positive "The staff are friendly, as soon as you enter, they welcome you" "Quick serving" Negative "Staff are not respectful at all" "The waiter did not come to receive our orders until after a quarter of an hour" Neutral "The waiters are like those in any other restaurant, nothing special" "Service is normal" Conflict "The staff is cooperative, but the language difference makes things difficult" "The service is fast, but I wish that I could order by phone or website"

Annotation Platform
Within this step, we collected all reviews in an Excel sheet. In some other SA types (i.e., emotion detection, stance detection), annotation is easy to manage in Excel. In these SA types, each review needs one tag only; for example, in stance detection, the annotator will label a review with one of the following tags: with, against, or neutral. On the other hand, ABSA datasets are very difficult to maintain in an Excel sheet file, because annotators need to identify all opinions in the sentence, the categories that they fall under, and their sentiments. For example, the sentence " " has two categories: food and service. The food category shows a positive sentiment, while the service category shows a neutral sentiment. Thus, to start our annotation, we converted the Excel file containing reviews into an XML file using the Python programming language. For the convenience of the annotators, and to obtain the best results in the annotation process in terms of accuracy and time, we built a website to start the annotations. The website interface is very simple, as shown in Figure 4. The website allowed the annotators to sequentially obtain reviews from a database that contains all reviews. Then, each review was read, analyzed, and annotated by marking checkboxes that represent the categories included in the sentence. Then, sentiments were selected from the corresponding dropdown list. The review background color was red, but after the annotators clicked "save", the background color changed to grey, to ensure that annotators would not miss any review. annotations. The website interface is very simple, as shown in Figure 4. The website allowed the annotators to sequentially obtain reviews from a database that contains all reviews. Then, each review was read, analyzed, and annotated by marking checkboxes that represent the categories included in the sentence. Then, sentiments were selected from the corresponding drop-down list. The review background color was red, but after the annotators clicked "save", the background color changed to grey, to ensure that annotators would not miss any review.

Annotator Recruitment
Since all the reviews we collected were in Arabic, the annotators were required to be Arab, with a mother tongue of Arabic and a wide knowledge of the Arabic language

Annotator Recruitment
Since all the reviews we collected were in Arabic, the annotators were required to be Arab, with a mother tongue of Arabic and a wide knowledge of the Arabic language and dialects. Due to the complexity of the reviews and users' opinions, the annotators were required to be above 20 years old. As such, we recruited four Arab annotators whose mother tongue was Arabic and who understood the dialects in reviews well. Their ages ranged from 26 to 32. All of them had a computer-related background (information technology and software engineering). Firstly, an overview and explanation of the aim of the research were given to the annotators. Then, the annotation guidelines (Supplementary Material) were given to them. Following that, we assessed their understanding of the task by reviewing their annotations for 20 review sentences. After confirming that the annotations were correct, we sent them the link for the annotation website. The reviews were divided evenly; each annotator was responsible for completing about 5330 reviews. They were told that they could make contact at any time that they had doubts about their work. After the annotations were completed, we exchanged sets between annotators to ensure that each sentence was reviewed by at least two individuals.

Exploratory Data Analysis
During the annotation process, sentences with one aspect and unrelated reviews were excluded. Thus, from the 21,330 collected reviews, we obtained 10,739 annotated reviews to create the AraMA dataset (the original dataset). Annotators identified 25,653 aspect categories mentioned in sentences with corresponding sentiment polarities. The dataset contains a total of 16,551 positive reviews, 6154 negative reviews, 1439 neutral reviews, and 1509 conflict reviews. The dataset contains a total of 9539 reviews in the food aspect category, 6395 in the environment aspect category, 5660 in the service aspect category, and 4059 in the price aspect category. Table 5 provides more statistics of the dataset. Figure 5 shows the percentage of total number of reviews in each aspect category. aspect categories mentioned in sentences with corresponding sentiment polari dataset contains a total of 16,551 positive reviews, 6154 negative reviews, 1439 reviews, and 1509 conflict reviews. The dataset contains a total of 9539 review food aspect category, 6395 in the environment aspect category, 5660 in the servic category, and 4059 in the price aspect category. Table 5 provides more statisti dataset. Figure 5 shows the percentage of total number of reviews in each category.  Following that, we used Python code to extract reviews with multiple sen in an individual file to create the AraMAMS corpus, which is another version of Following that, we used Python code to extract reviews with multiple sentiments in an individual file to create the AraMAMS corpus, which is another version of AraMA. AraMAMS contains 5312 extracted reviews, with 13,387 annotated aspect category and sentiment polarity tags. It contains a total of 6483 positive reviews, 4056 negative reviews, 1403 neutral reviews, and 1445 conflict reviews. This includes a total number of 4791 reviews in the food aspect category, 3183 in the environment aspect category, 2282 in the service aspect category, and 3131 in the price aspect category. Table 6 provides more statistics of the AraMAMS dataset. Figure 6 shows the percentages of the total number of reviews in each aspect category.  Food  3034  510  409  838  4791  Environment  1793  856  78  456  3183  Service  1526  619  52  85  2282  Price  130  2071  864  66  3131  Total  6483  4056  1403  1445  13,387  Food  3034  510  409  838  4  Environment  1793  856  78  456  3  Service  1526  619  52  85  2  Price  130  2071  864  66  3  Total  6483  4056  1403 1445 13 Figure 6. Visual representation of the tag percentages of the aspect categories w AraMAMS corpus.
AraMA and AraMAMS are in XML format, containing a record for each The record information includes the user review, aspect category, a corresponding sentiment. Figure 7 shows an example of review records fro datasets. AraMA and AraMAMS are in XML format, containing a record for each review. The record information includes the user review, aspect category, and the corresponding sentiment. Figure 7 shows an example of review records from both datasets.  In both datasets, there were more positive sentiments than negative, conflict, and neutral sentiments, respectively. On the category side, food aspects dominated in both datasets, followed by the environment category. In AraMA, the service category followed the price category, while in AraMAMS, the price category followed the service category.
When comparing between datasets (results of AraMA are shown in Table 5 and those of AraMAMS are shown in Table 6), the greatest difference is seen in the positive sentiment tags. The total number of positive sentiment tags in Table 5 is 16,551, while the total number of positive sentiments tags in Table 6 is 6483. There is a great difference between the number of positive sentiments tags between two datasets: 10,068. Further, there are 6154 negative sentiments tags in Table 5, while there are 4056 negative sentiment tags in Table 6. The difference between the total number of negative sentiment In both datasets, there were more positive sentiments than negative, conflict, and neutral sentiments, respectively. On the category side, food aspects dominated in both datasets, followed by the environment category. In AraMA, the service category followed the price category, while in AraMAMS, the price category followed the service category.
When comparing between datasets (results of AraMA are shown in Table 5 and those of AraMAMS are shown in Table 6), the greatest difference is seen in the positive sentiment tags. The total number of positive sentiment tags in Table 5 is 16,551, while the total number of positive sentiments tags in Table 6 is 6483. There is a great difference between the number of positive sentiments tags between two datasets: 10,068. Further, there are 6154 negative sentiments tags in Table 5, while there are 4056 negative sentiment tags in Table 6. The difference between the total number of negative sentiment tags between the two datasets is 2098. On the other hand, there was only a small difference between the total number of neutral and conflict sentiment tags.

Corpus Validation
In order to validate both corpora, we applied supervised ML classifiers to offer baseline results. Using the Python language on the Google Collab platform, we ran four different classifiers: naïve biased (NB); support vector classification (SVC), with linear kernel as well as linear SVC; and stochastic gradient descent (SGD). In this study, we dealt with the datasets as a multi-class, multi-label text classification problem. Thus, we calculated the micro-averages of the precision, recall, and F1 measures.
Both corpora were divided into training (70%) and testing (30%) data. Table 7 shows the number of reviews in the two corpora after splitting. After that, four data frames were created: one for aspect categories, and three for sentiments. The positive and negative sentiments each had individual data frames, while we gathered neutral and conflict sentiments in the same data frame, because they have almost the same sentimental meaning, and fewer tags. The evaluation results of the classifiers in the AraMA corpus (original) and AraMAMS corpus (second version) are provided in Table 8. The results can be viewed on a color scale between green and red for easier reading; green represents good results, while red represents bad results.
Starting with the results of the AraMA corpus, it can be seen from Table 7 that SVC, with the kernel linear model, achieved the best performance in terms of aspect categories and negative, neutral, and conflict sentiments. The highest F1 measure result was 91.41%, found for the aspect category. On the other hand, in the AraMAMS corpus, the best F1 measure results of all categories were obtained using the linear SVC model (an F1 measure value of 91.70% in terms of the aspect category).
Overall, the NB model achieved the worst results in both corpora. In addition, the results of the linear SVC and SVC with kernel linear models were similar in all categories, except for the negative sentiment category, wherein there was a minor difference of 0.15%.
When comparing the results of the two corpora, the best F1 measure result in AraMA was 91.41% for the aspect category, while the best F1 measure result in AraMAMS was 91.70% for the same category. Therefore, we can say that there was a slight improvement in the F1 measure result in AraMAMS, yet the difference is not significant. In addition, the results of precision, recall, and F1 measure were the worst in the neutral and conflict sentiments in both corpora. This is due to the small number of tags available in both corpora.