Multi-Aspect Sentiment Classification of Arabic Tourism Reviews Using BERT and Classical Machine Learning

Samar Zaid; Amal Hamed Alharbi; Halima Samra

doi:10.3390/data10110168

,

and

Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Data2025, 10(11), 168;https://doi.org/10.3390/data10110168
(registering DOI)

Version Notes

Order Reprints

Abstract

Understanding visitor sentiment is essential for developing effective tourism strategies, particularly as Google Maps reviews have become a key channel for public feedback on tourist attractions. Yet, the unstructured format and dialectal diversity of Arabic reviews pose significant challenges for extracting actionable insights at scale. This study evaluates the performance of traditional machine learning and transformer-based models for aspect-based sentiment analysis (ABSA) on Arabic Google Maps reviews of tourist sites across Saudi Arabia. A manually annotated dataset of more than 3500 reviews was constructed to assess model effectiveness across six tourism-related aspects: price, cleanliness, facilities, service, environment, and overall experience. Experimental results demonstrate that multi-head BERT architectures, particularly AraBERT, consistently outperform traditional classifiers in identifying aspect-level sentiment. Ara-BERT achieved an F1-score of 0.97 for the cleanliness aspect, compared with 0.91 for the best-performing classical model (LinearSVC), indicating a substantial improvement. The proposed ABSA framework facilitates automated, fine-grained analysis of visitor perceptions, enabling data-driven decision-making for tourism authorities and contributing to the strategic objectives of Saudi Vision 20300.

Keywords:

Arabic aspect-based sentiment analysis; Google Maps reviews; BERT; transformer-based models; tourism analytics; Arabic natural language processing; multi-aspect classification

1. Introduction

The rapid growth of social media and review platforms has generated an abundance of user-generated content that reflects public opinion and individual experiences []. Among these platforms, Google Maps used by more than two billion people globally has emerged as a major source of travel information, providing extensive reviews of hotels, attractions, and various leisure activities []. Its vast collection of user comments, ratings, photos, and detailed place descriptions, spanning restaurants to museums, offers a valuable corpus for understanding visitor experiences [,].

In the tourism sector, sentiment analysis plays an increasingly vital role in understanding customer perceptions and improving service quality. However, deriving actionable insights from Arabic-language reviews remains challenging due to the unstructured nature of the text, the linguistic complexity of Arabic, and its wide dialectal variation [,]. These challenges have constrained the effectiveness of sentiment analysis in tourism analytics across the Arab world. Despite the expanding Arabic-speaking online population, it is now exceeding 185 million users and accounting for 4.8% of global internet users aspect-based sentiment analysis (ABSA) in Arabic, particularly for tourism-related data from platforms such as Google Maps, remains substantially underexplored [,]. Most prior research has concentrated on domains such as social media, finance, or healthcare, leaving tourism applications relatively neglected.

To address this gap, the present study develops and evaluates advanced models for Arabic ABSA on Google Maps reviews related to tourist attractions in Saudi Arabia. A manually annotated dataset of more than 3500 reviews was constructed, covering six key aspects: price, cleanliness, facilities, service, environment, and overall experience. The study benchmarks a multi-head classification architecture based on a pre-trained Arabic BERT model (asafaya/bert-base-arabic) against traditional machine learning pipelines, including ensemble classifiers and optimized SVMs using TF-IDF features. This hybrid approach enables a systematic comparison between deep learning and classical methods for multi-aspect sentiment classification within Arabic tourism data.

Although the study does not introduce a new ABSA architecture, its main contribution lies in the thoughtful adaptation of advanced transformer-based methods to an underexplored domain, Arabic tourism reviews from Google Maps. The originality of this work stems from the development of a domain-specific, manually annotated dataset and the presentation of the first comprehensive benchmark comparing classical and transformer-based models for Arabic aspect-based sentiment analysis in the tourism context.

This research makes three primary contributions.

Novelty in Domain and Data: To the best of our knowledge, this is the first systematic study to conduct ABSA on Arabic Google Maps reviews within the tourism sector. Prior research has rarely addressed ABSA for Arabic Google Maps data or examined tourism as a domain.
Resource Creation: The study introduces a new manually annotated Arabic dataset of Google Maps reviews covering six critical tourism-related aspects (price, cleanliness, facilities, service, environment, and overall experience). This dataset fills a major gap, as no publicly available, aspect-annotated corpus currently exists for Arabic tourism reviews.
Benchmarking and Practical Insights: The research benchmarks state-of-the-art transformer-based models (multi-head BERT) against classical machine learning approaches, providing a practical evaluation of their effectiveness on real-world, dialect-rich Arabic data. The findings offer actionable insights for tourism analytics and policy development in Saudi Arabia.

2. Background

2.1. Sentiment Analysis

The primary goal of sentiment analysis is to identify and categorize the emotional tone or attitude expressed in text. This process typically involves classifying user-generated content as positive, neutral, or negative based on the viewpoints expressed [].

Sentiment can be modeled using two main paradigms. The categorical approach assigns discrete labels such as positive, neutral, or negative, whereas the dimensional approach represents emotion along continuous scales of valence and arousal to capture more nuanced affective variations. While dimensional models offer greater emotional granularity, categorical representation remains dominant in aspect-based sentiment analysis (ABSA) due to its interpretability and compatibility with standard evaluation metrics. Accordingly, this study adopts a categorical framework to support transparent aspect-level analysis while recognizing the potential of dimensional modeling for future research that aims to capture continuous emotional intensity [].

ABSA encompasses several sub-tasks, including Aspect Sentiment Classification (ASC), Aspect Sentiment Triplet Extraction (ASTE), Aspect Sentiment Quad Prediction (ASQP), and the emerging Dimensional ABSA (DimABSA), which models sentiment on continuous dimensions. The present study focuses on the ASC task, addressing six predefined aspects within the tourism domain to align with practical analytical needs [].

Sentiment analysis can be performed at multiple levels of granularity, each serving distinct analytical purposes [,,]:

Document-level analysis assigns an overall sentiment (positive, neutral, or negative) to an entire text based on its global tone.
Sentence-level analysis evaluates each sentence independently, enabling the detection of mixed opinions within a single document.
Phrase-level analysis targets specific expressions or keywords that reflect sentiment toward a particular feature or topic.
Aspect-level analysis offers the most fine-grained understanding by detecting multiple aspects within the same text and assigning each a separate sentiment polarity, facilitating precise interpretation of user opinions.

The techniques employed in sentiment analysis are generally grouped into three primary categories, lexicon-based, machine learning-based, and hybrid approaches, as illustrated in Figure 1. Hybrid approaches are designed to overcome the limitations of individual methods by integrating their complementary strengths [].

Figure 1. Sentiment analysis approaches.

Lexicon-Based Approach: This method depends on sentiment dictionaries or predefined word lists that categorize terms as positive, negative, or neutral. Each word is given a sentiment score, and the overall sentiment of the text is determined by summing these scores [].
Machine Learning-Based Approach: This approach uses algorithms trained on labeled datasets to detect sentiment patterns. It includes techniques like Naive Bayes, Support Vector Machines (SVM), and advanced deep learning models such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), including LSTM and BiLSTM, which have shown strong performance in sentiment classification tasks [].
Hybrid Approach: This technique combines elements of both lexicon-based and machine learning-based methods to take advantage of the strengths of each. It aims to enhance accuracy, adaptability, and performance in various sentiment analysis applications [].

2.2. Aspect-Based Sentiment Analysis

Aspect-Based Sentiment Analysis (ABSA) enables detailed examination of opinions by identifying specific attributes or aspects within a text and determining the sentiment polarity associated with each. Unlike document- or sentence-level sentiment analysis, ABSA extracts fine-grained insights into multiple features of a product or service simultaneously, making it especially valuable for domains such as tourism, e-commerce, and hospitality [].

ABSA generally comprises two main tasks:

Aspect extraction: identifies the specific features or attributes discussed in the text such as price, cleanliness, or facilities in tourism related reviews.
Sentiment classification: assigns a sentiment polarity (positive, neutral, or negative) to each extracted aspect based on the user’s expressed opinion.

This two-step process provides actionable, aspect-specific insights that help researchers and practitioners understand user satisfaction and preferences at a granular level [,].

2.2.1. Challenges of ABSA in Arabic

While ABSA has achieved notable success in English and other languages, applying it to Arabic presents distinct linguistic challenges [,]. Arabic exhibits rich morphology, diverse regional dialects, and frequent omission of diacritics, all of which increase lexical ambiguity. For instance, the word “حلو” can mean “sweet” when describing taste (e.g., a dessert) or “nice” when praising an experience (e.g., the atmosphere), depending on context. Moreover, dialectal variation across regions leads to differing expressions for the same aspect, complicating both, aspect extraction and sentiment polarity detection. These factors make Arabic ABSA considerably more complex than its counterparts in morphologically simpler languages.

2.2.2. Suitability of Multi-Head Architectures for ABSA

Recent advances in deep learning have introduced multi-head architectures, particularly multi-head BERT models, which are highly suitable for ABSA tasks [,]. In this configuration, a shared encoder learns general linguistic representations while multiple classification heads specialize in predicting sentiment for individual aspects. This setup enables simultaneous learning of shared and aspect-specific features, improving efficiency and accuracy across multiple aspects within a single review.

Beyond traditional transformer encoders, recent research has explored Large Language Model (LLM) approaches for ABSA. These methods leverage instruction tuning and retrieval-augmented strategies such as retrieval-based example ranking to enhance aspect detection and contextual sentiment reasoning. While this study focuses on transformer-based architectures (AraBERT, MARBERT, QARiB), instruction-tuned LLMs represent a promising direction for improving cross-domain generalization and performance on dialect-rich Arabic reviews [].

3. Related Work

Aspect-based sentiment analysis (ABSA) has received considerable scholarly attention in recent years. This section provides an integrated synthesis of prior research on Arabic-language ABSA across multiple domains, followed by studies applying ABSA to the tourism sector and, more specifically, to Google Maps reviews.

3.1. Aspect-Based Sentiment Analysis in Arabic

Recent years have seen significant advancements in Arabic aspect-based sentiment analysis (ABSA), encompassing a range of approaches including deep learning architectures and ontology-based methods. However, Arabic text poses distinctive challenges such as dialectal variation, complex morphology, and the scarcity of large, annotated corpora.

Abdelgwad et al. [] introduced a BERT-based model for Arabic aspect sentiment polarity classification, fine-tuning separate classifiers on three datasets hotel reviews, book reviews, and news comments achieving accuracy scores of 89.5%, 73%, and 85.7%, respectively. Although their approach relied on single-task classification per aspect, it demonstrated strong generalization across domains. In contrast, the present study employs a multi-head architecture to classify sentiment across six aspects simultaneously within a unified model.

Alnasser et al. [] developed the HEAR dataset, which includes more than 31,000 Arabic Google Maps reviews in the healthcare domain. The authors fine-tuned several pre-trained transformer models—including MARBERT, AraBERT, and CAMeLBERT—and reported that the MARBERT+SVM combination achieved 92.14% accuracy and a 92.06% F1-score. While the focus was healthcare, their results confirmed the effectiveness of transformer-based models for handling Arabic user-generated content.

Shafiq et al. [] proposed an end-to-end AraBERT-CRF model that performs both aspect term extraction and sentiment classification. The model achieved high F1-scores on the SemEval-2016 hotel review dataset 95.11% for aspect extraction and 98.34% for sentiment classification. Unlike our work, which uses predefined aspects, their model dynamically learned aspect terms directly from the input text.

Similarly, Mohammad et al. [] introduced a gated recurrent unit (GRU)-based deep learning model enhanced with the Multilingual Universal Sentence Encoder (MUSE). Applied to Arabic hotel reviews, their model showed strong performance in both aspect extraction and sentiment classification, outperforming traditional baselines despite the dataset being unavailable publicly.

In the telecommunication sector, Alshammari et al. [] designed a hybrid framework for aspect-based sentiment analysis and location detection using Arabic tweets. Human annotators labeled 1277 tweets by aspect and sentiment polarity. The authors compared several classifiers, including logistic regression, SVM, random forest, and CNN—and reported that the CNN model, augmented with part-of-speech tagging and named entity recognition, achieved an F1-score of 0.81 and 75% accuracy for aspect classification. However, the narrow domain and limited dataset reduce the generalizability of their results to broader contexts such as tourism.

In a separate study, Alshaikh et al. [] employed a BERT-based model to analyze open-ended survey responses in the higher education sector. Their dataset of 1506 manually annotated responses contained aspect terms, sentiment polarities, and predefined categories. The model achieved an F1-score of 58% for aspect extraction, 86% for polarity classification, and 98% for aspect-to-category mapping, demonstrating promise for domain-specific applications with room for improvement using richer datasets.

Abdelgwad et al. [] also reframed aspect sentiment polarity classification as a sentence-pair classification task. Using Arabic hotel, book, and news review datasets, they trained a BERT-based model that effectively captured contextual meaning and achieved competitive performance across all domains.

Building on semi-supervised learning strategies, Almasri et al. [] combined the “Noisy Student” framework with AraBERT, augmenting the SemEval 2016 dataset using pseudo-labeling and the HARD dataset. Their model achieved a micro-level F1-score of 68.2%, showing that leveraging unlabeled data can partially compensate for limited annotated corpora, though effectiveness depends heavily on dataset diversity.

Finally, Khabour et al. [] adopted an ontology-based approach for sentiment classification in Arabic hotel reviews. Manual annotation was combined with a domain-specific ontology, yielding 79.20% accuracy and a 78.7% F1-score. While semantic enrichment improved classification, the ontology construction process was labor-intensive and domain-dependent.

Collectively, these studies illustrate the diversity of methods applied to Arabic ABSA, though most focus on narrow domains such as hotels or healthcare using relatively small or nonpublic datasets. Consequently, their findings are difficult to generalize, particularly within the tourism sector. This limitation underscores the need for scalable, aspect-aware models based on large, real-world corpora such as Google Maps reviews—an objective directly addressed in the present study.

3.2. Aspect-Based Sentiment Analysis in the Tourism Sector

Recent studies have applied ABSA to various areas of the tourism domain, exploring different modeling and annotation strategies to analyze reviews of hotels, airlines, restaurants, and other travel-related services.

Recurrent neural networks, particularly LSTM and BiLSTM architectures, have demonstrated promising results in analyzing aspects such as cleanliness, pricing, and accommodation. For instance, ref. [] analyzed tourist reviews from ten Indonesian destinations on TripAdvisor to support sustainable tourism initiatives. BiLSTM outperformed other models in both aspect detection and sentiment classification. However, the exclusive focus on Indonesian reviews limits the transferability of these results to Arabic contexts, and the lower F1-scores reflect possible dataset imbalance or noise.

Expanding on this, ref. [] introduced a hybrid model combining word sense disambiguation (WSD), BiLSTM, BERT, and graph convolutional networks to enhance ABSA in hotel reviews. The model achieved F1-scores near 90% for both positive and negative classes. Nonetheless, its confinement to English-language hotel reviews restricts its relevance to multilingual tourism datasets.

In a broader analysis, ref. [] examined tourist dissatisfaction in Granada, Spain, using over 50,000 reviews from Twitter, Instagram, and TripAdvisor. A BERT-based entity recognition and clustering approach provided insights into recurring negative experiences, showing how hybrid techniques can support destination management.

Similarly, ref. [] proposed an ACOS LLM framework to extract quadruples of aspect–category–opinion–sentiment, integrating external knowledge and manual annotations to improve the F1-score by 7.49% on a tourism dataset. This work highlighted the utility of domain-specific augmentation for refining sentiment predictions.

In another study, ref. [] used ChatGPT to summarize TripAdvisor hotel reviews, focusing on negative service experiences. Combined with a BERT-based sentiment model, the system produced aspect-level summaries that enabled hotel managers to identify service deficiencies rapidly.

The LiFeBERT model introduced in [] assessed airline service satisfaction before and during the COVID-19 pandemic. Reviews were manually annotated across eight service aspects, achieving an F1-score of 60.1%. The study demonstrated the contextual challenges of modeling sentiment during global disruptions.

Beyond hospitality, ref. [] leveraged aspect-level sentiment data from Yelp to forecast restaurant longevity using a conditional survival forest model. Their results confirmed that aspect-specific sentiment signals provide greater predictive power than overall ratings.

Likewise, ref. [] analyzed 12,396 Booking.com hotel reviews to examine the role of technological amenities. Aspects such as location, smart-room features, and staff behavior were manually labeled for sentiment. The findings showed that technology-driven services enhance guest satisfaction, reinforcing the managerial value of ABSA insights.

Finally, ref. [] developed an Indonesian BERT-based ABSA model for hotel reviews, comparing sentence-pair and traditional classification structures. The model achieved an F1-score of approximately 91.94%, reaffirming the robustness of transformer-based architectures in resource-scarce settings.

Despite these advances, ABSA research in tourism remains dominated by English and other non-Arabic datasets, leaving limited understanding of the linguistic and dialectal challenges in Arabic user reviews. The present work addresses this gap through a systematic evaluation of ABSA models on a manually annotated Arabic Google Maps dataset, offering practical benchmarks and new insights for both academia and industry.

3.3. Aspect-Based Sentiment Analysis in the Tourism Sector Using Google Maps

While platforms such as TripAdvisor and Booking.com have long been used to collect tourism reviews, Google Maps has emerged as a particularly valuable source of user-generated feedback due to its vast, real-time coverage of diverse locations, from restaurants and cafés to hotels and major landmarks []. Reviews are often posted immediately after an experience, offering spontaneous and context-rich sentiment data [].

In [], the authors performed ABSA on Google Maps reviews of airport services in Dubai and Doha. Human annotators labeled aspects such as security and accessibility, and a bidirectional transformer-based model achieved around 80% accuracy, demonstrating the potential of contextual models for domain-specific ABSA.

Similarly, ref. [] examined Google Maps reviews of Indonesia’s Borobudur and Prambanan temples. Aspects such as attractions, amenities, pricing, and accessibility were manually labeled, and several machine learning models—including Random Forest, SVM, Logistic Regression, and Extra Trees—were compared. Performance varied across models, underscoring how aspect granularity and data distribution affect outcomes.

In the Arabic context, ref. [] introduced the AraMA corpus, which contains 10,739 Arabic Google Maps restaurant reviews from Riyadh. Aspects included food, environment, service, and price, annotated with sentiment labels spanning positive, negative, neutral, and conflict. The follow-up AraMAMS corpus extended this to multi-aspect, multi-sentiment annotation. A linear SVM achieved 91.70% accuracy, though the dataset’s focus on a single city limits its broader applicability.

Another Arabic study [] analyzed Saudi-dialect Google Maps reviews using SVM, LSTM, and RNN models for overall sentiment classification. SVM achieved 98% accuracy, but the lack of aspect-level granularity restricted the depth of sentiment insights.

Additionally, ref. [] investigated Korean Google Maps restaurant reviews across four aspects, food, price, service, and atmosphere using Random Forest for classification. The study showed that aspect-based sentiment analysis provides richer, more actionable insights than overall sentiment categorization.

Finally, ref. [] explored multi-class sentiment prediction for airport service reviews sourced from Google Maps, Twitter, and Airline Quality. Annotators labeled seven aspects, and surprisingly, traditional machine learning methods outperformed deep learning models, emphasizing that dataset structure and size strongly influence ABSA performance.

Although recent studies have begun to explore Google Maps as a data source for ABSA, research on Arabic reviews, particularly in tourism and across multiple regions—remains scarce. Existing datasets are typically city-specific or limited in thematic coverage. The current study addresses these gaps by constructing a comprehensive, multi-aspect, annotated Arabic dataset spanning various Saudi regions and evaluating the performance of transformer-based models in this underexplored domain.

3.4. Summary and Future Directions

Research on Arabic and tourism-related ABSA has advanced considerably, employing a variety of annotation strategies, ranging from manual to semi-supervised, and leveraging models such as SVM, LSTM, and transformer-based architectures. Nevertheless, the scarcity of large-scale, dialectally diverse annotated corpora continues to hinder progress. Although Google Maps offers a rich source of user-generated data, inconsistencies in annotation and the absence of standardized benchmarks limit cross-study comparability.

A comprehensive review of previous studies related to aspect-based sentiment analysis (ABSA) is summarized in Table 1, Table 2 and Table 3. Table 1 presents existing research on ABSA in Arabic, Table 2 focuses on studies applying ABSA in the tourism sector, and Table 3 highlights works that utilized Google Maps reviews for ABSA in tourism-related contexts.

Table 1. ABSA in Arabic.

Table 2. ABSA in the Tourism Sector.

Table 3. ABSA in the Tourism Sector Using Google Maps.

Future work should pursue more sophisticated transformer-based designs, real-time sentiment tracking, and cross-dialect transfer learning to improve robustness and scalability. These directions would expand the applicability of ABSA across diverse domains, including tourism, healthcare, education, and e-commerce.

More recently, large language model (LLM)-based approaches have begun shaping the next phase of ABSA. Through instruction tuning and retrieval-based example ranking, these models demonstrate strong potential for improving contextual sentiment reasoning, particularly for dialect-rich Arabic data, and open new avenues for adaptable, cross-domain applications [].

4. Methodology

This section describes the methodology used to perform multi-aspect sentiment classification on Arabic Google Maps reviews of tourist attractions in Saudi Arabia. It outlines the data collection, manual annotation, and exploratory analysis procedures, followed by the development of transformer-based and traditional machine learning models. The section concludes with an explanation of the evaluation metrics used to assess model performance.

4.1. Dataset Source and Collection

This study focuses on dialectical Arabic (DA) as used across Saudi Arabia. Reviews were collected from Google Maps for 39 tourist attractions distributed across 15 major cities using the Instant Data Scraper Chrome extension [].

In total, 20,000 reviews were gathered, most written in dialectical Arabic, with a smaller portion in English. Metadata such as usernames, timestamps, likes, and owner replies were collected but excluded from analysis; only the review text was kept. Figure 2 shows a sample review that illustrates the type of user feedback analyzed. To promote transparency and future research in Arabic sentiment and tourism-related NLP, the annotated dataset has been released publicly on Zenodo (https://zenodo.org/records/17011574 (accessed on 20 October 2025), DOI: 10.5281/zenodo.17011573).

Figure 2. A preview of a Google Maps review text. The translation of the text inside the red box is: “Great, but the ticket price is a bit expensive—50 SAR per person—and the options are limited. There are cafés, and the space is a bit small”.

4.2. Dataset Overview

To identify key tourist destinations, a keyword-based search on Google Maps was conducted using terms such as “museums,” “parks,” “farms,” and “heritage.” This process yielded 39 attractions across 15 Saudi cities, representing a diverse mix of venues and dialects. The dataset captures several regional dialects, including Hijazi, Najdi, Northern, Eastern, and Southern Arabic. Most reviews were written in Najdi and Hijazi dialects, along with instances of Modern Standard Arabic (MSA).

4.3. Aspect-Based Approach

Prior to annotation, key aspect categories were determined based on previous research [], which identified common visitor concerns such as amenities, pricing, and staff service. This study extended those dimensions to include additional aspects relevant to tourism, specifically cleanliness, environment, and overall experience. The predefined aspect categories and their corresponding topics used in this study are summarized in Table 4.

Table 4. Aspect Categories and Related Topics.

4.4. Annotation Process and Inter-Annotator Agreement

Two native Arabic speakers, aged between 26 and 32, served as annotators. Both possessed strong proficiency in dialectical Arabic and demonstrated technical competence in linguistic annotation. Prior to the main task, they were trained using comprehensive annotation guidelines and a preliminary 20-sentence calibration exercise designed to ensure consistency. Each annotator independently labeled the same dataset. Only reviews containing at least two identifiable aspects were retained, while non-Arabic reviews were removed.

From the initial collection of 20,000 Google Maps reviews, approximately 16,460 were excluded during the preprocessing and annotation stages, primarily due to non-Arabic content, duplication, or insufficient identifiable aspects with clear sentiment expressions. The final dataset comprised 3540 multi-aspect reviews, accurately reflecting both the linguistic diversity and the sparsity characteristic of user-generated Arabic tourism data.

To assess annotation consistency, Cohen’s kappa coefficient was calculated for each aspect. The agreement scores are presented in Table 5.

Table 5. Cohen’s Kappa Inter-Annotator Agreement for Each Aspect.

High kappa values confirm strong inter-annotator agreement, validating the reliability of the labeling process. Discrepancies were resolved by adopting Samar’s annotations as the reference standard, ensuring the creation of a consistent, high-quality “gold” dataset for subsequent modeling.

4.5. Exploratory Data Visualization

Figure 3 presents the distribution of reviews across attraction types. Gardens received the highest number of reviews, followed by historical and heritage sites, parks, and museums, reflecting strong public engagement with natural and culturally significant destinations.

Figure 3. Distribution of reviews by attraction type.

Figure 4 illustrates the frequency of aspect mentions. Environment and overall experience were most frequently discussed, followed by facilities and price, whereas cleanliness and HR & service appeared less often.

Figure 4. Frequency of mentions per aspect: Environment, Experience, Facilities, Price, HR\& Service, and Cleanliness.

The final dataset contains 9085 aspect-sentiment pairs: 5777 positive, 2406 negative, and 902 neutral. Table 6. Sentiment distribution per aspect is shown in Table 6.

Table 6. Sentiment Distribution per Aspect Category.

To illustrate annotation practices, Table 7 provides representative examples in Saudi dialectical Arabic with corresponding English translations and sentiment labels.

Table 7. Sample Annotated Reviews with Saudi Dialect and Translation.

4.6. Model Architecture

To evaluate the trade-offs between deep learning and traditional machine learning methods, four model architectures were implemented for multi-aspect sentiment classification. These included both transformer-based and classical models.

The same base encoder, asafaya/bert-base-arabic, was used in two configurations:

Manual BERT (single-aspect): a separate single-head classifier fine-tuned individually for each aspect.
Multi-Head BERT: a shared encoder with multiple classification heads, one per aspect, enabling simultaneous prediction across all aspects.

The Multi-Head BERT model was fine-tuned using the pre-trained asafaya/bert-base-arabic encoder available on Hugging Face. It consists of a shared encoder and six independent heads, each responsible for predicting sentiment polarity (positive, negative, or neutral) for one aspect (price, cleanliness, facilities, service, environment, and overall experience). This structure captures both general linguistic features and aspect-specific sentiment cues, consistent with contemporary multi-task ABSA frameworks [,]. Figure 5 presents the overall model architecture.

Figure 5. Architecture of the proposed multi-head BERT model. The model consists of a shared BERT encoder that processes the input review, followed by six parallel classification heads—each dedicated to predicting sentiment for one specific aspect (price, cleanliness, facilities, service, environment, and overall experience).

4.7. Performance Measurement

The performance of all models was evaluated using four widely adopted metrics in sentiment classification: Accuracy, Precision, Recall, and F1-score []. These metrics were derived from the confusion matrix to provide a balanced assessment of each model’s ability to correctly identify positive, neutral, and negative instances. Since these measures are standard in the machine learning and NLP literature, their detailed mathematical definitions are omitted for brevity.

All Precision, Recall, and F1-Score values reported throughout this study (Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13) were computed using the macro-averaging method to account for class imbalance across sentiment categories.

5. Experiments and Results

This section presents the evaluation of seven models for aspect-based sentiment analysis (ABSA) on Arabic Google Maps reviews of Saudi tourist attractions.

Both BERT-based variants share the same base encoder. Manual BERT refers to single-aspect fine-tuning (a separate classifier per aspect), while Multi-Head BERT denotes a shared encoder with multiple classification heads (one per aspect).

The experiments compared traditional machine learning models (TF-IDF + LinearSVC, Voting Classifier) with transformer-based models (AraBERT, MARBERT, QARiB, and Manual BERT). All models were implemented in Python version: 3.11.6.

Using scikit-learn, PyTorch 3.11.6, and Hugging Face Transformers. The dataset was divided using an 80/20 stratified split to ensure balanced aspect and sentiment representation.

Model training used consistent hyperparameters: maximum sequence length = 128, batch size = 16 or 32, learning rate = 2 × 10⁻⁵, three epochs, and the AdamW optimizer. The TF-IDF + LinearSVC model employed GridSearchCV for hyperparameter optimization, while LinearSVC (Oversampling) used manual resampling to ensure class balance. The Voting Classifier combined SVC, SGD, and Logistic Regression through hard voting.

Transformer-based models included Manual BERT (single-aspect) fine-tuned individually per aspect and Multi-Head BERT (AraBERT) with a shared encoder and six classification heads for joint prediction. MARBERT + Focal Loss integrated class weights and dropout (0.3), while QARiB (Attention) utilized attention pooling with class-weighted loss, early stopping (patience = 2), and linear learning-rate scheduling. Each experiment was repeated with multiple random seeds to ensure result stability.

5.1. Performance Evaluation by Aspect

Price: AraBERT achieved the highest performance on the Price aspect (Accuracy = 0.93, F1 = 0.92), followed closely by MARBERT (F1 = 0.90), as shown in Table 8. Traditional classifiers such as TF-IDF + LinearSVC and the Voting Classifier also performed reasonably well but lagged behind transformer-based models in recall and precision. Manual BERT produced strong accuracy but lower recall (F1 = 0.76), confirming that single-aspect fine-tuning is less effective than shared contextual encoding for nuanced sentiment detection.

Table 8. Performance comparison for aspect: Price.

Model	Accuracy	Precision	Recall	Macro-F1
TF-IDF + LinearSVC (GridSearch)	0.88	0.87	0.88	0.87
LinearSVC (Oversampling)	0.88	0.86	0.88	0.87
VotingClassifier	0.88	0.87	0.88	0.87
Multi-Head BERT (AraBERT)	0.93	0.92	0.93	0.92
MARBERT + Focal Loss	0.90	0.89	0.90	0.90
QARiB	0.92	0.79	0.86	0.81
Manual BERT (single-aspect)	0.92	0.84	0.72	0.76

Cleanliness: AraBERT achieved the strongest overall results for the Cleanliness aspect (Accuracy = 0.97, F1 = 0.97), as summarized in Table 9. MARBERT followed closely (F1 = 0.96), and Manual BERT also performed competitively (F1 = 0.91). Traditional classifiers achieved solid but slightly lower results, reaffirming that transformer models are better at capturing explicit cleanliness cues in text.

Table 9. Performance comparison for aspect: Cleanliness.

Model	Accuracy	Precision	Recall	Macro-F1
TF-IDF + LinearSVC (GridSearch)	0.92	0.92	0.92	0.91
LinearSVC (Oversampling)	0.93	0.92	0.93	0.92
VotingClassifier	0.93	0.92	0.93	0.92
Multi-Head BERT (AraBERT)	0.97	0.97	0.97	0.97
MARBERT + Focal Loss	0.94	0.95	0.94	0.96
QARiB	0.96	0.86	0.90	0.88
Manual BERT (single-aspect)	0.97	0.94	0.88	0.91

Facilities: AraBERT and QARiB obtained the highest accuracy (0.83), demonstrating the importance of contextual embeddings in modeling infrastructure-related sentiments (see Table 10). Classical classifiers performed respectably (F1 ≈ 0.78) but were less capable of identifying subtle or implicit opinions expressed through descriptive language.

Table 10. Performance comparison for aspect: Facilities.

Model	Accuracy	Precision	Recall	Macro-F1
TF-IDF + LinearSVC (GridSearch)	0.78	0.77	0.78	0.77
LinearSVC (Oversampling)	0.79	0.78	0.79	0.78
VotingClassifier	0.79	0.77	0.79	0.78
Multi-Head BERT (AraBERT)	0.83	0.85	0.83	0.84
MARBERT + Focal Loss	0.78	0.79	0.78	0.79
QARiB	0.83	0.76	0.83	0.79
Manual BERT (single-aspect)	0.82	0.74	0.79	0.76

Service & Staff: QARiB achieved the strongest performance for the Service & Staff aspect (Accuracy = 0.93, F1 = 0.85), with AraBERT performing comparably (Accuracy = 0.92, F1 = 0.92), as shown in Table 11. These results highlight the models’ capacity to interpret interpersonal tone and nuanced language, where traditional methods plateaued around F1 = 0.86. Manual BERT’s lower recall underscores the benefits of multi-aspect contextual learning.

Table 11. Performance comparison for aspect: Service & Staff.

Model	Accuracy	Precision	Recall	Macro-F1
TF-IDF + LinearSVC (GridSearch)	0.87	0.85	0.87	0.86
LinearSVC (Oversampling)	0.87	0.86	0.87	0.86
VotingClassifier	0.88	0.87	0.88	0.87
Multi-Head BERT (AraBERT)	0.92	0.92	0.92	0.92
MARBERT + Focal Loss	0.77	0.81	0.77	0.79
QARiB	0.93	0.82	0.88	0.85
Manual BERT (single-aspect)	0.92	0.88	0.69	0.75

Environment: As reported in Table 12, QARiB achieved the highest accuracy for the Environment aspect (0.88), while AraBERT provided a better precision–recall balance (F1 = 0.86). This indicates complementary strengths between the two models—QARiB excels in recognizing strong sentiment cues, whereas AraBERT handles contextually mixed expressions more effectively.

Table 12. Performance comparison for aspect: Environment.

Model	Accuracy	Precision	Recall	Macro-F1
TF-IDF + LinearSVC (GridSearch)	0.82	0.80	0.82	0.81
LinearSVC (Oversampling)	0.82	0.80	0.82	0.81
VotingClassifier	0.82	0.80	0.82	0.81
Multi-Head BERT (AraBERT)	0.87	0.86	0.87	0.86
MARBERT + Focal Loss	0.84	0.79	0.84	0.75
QARiB	0.88	0.76	0.77	0.77
Manual BERT (single-aspect)	0.85	0.71	0.60	0.64

Overall Experience: Performance across all models declined on the more abstract Overall Experience aspect, reflecting its conceptual complexity (see Table 13). QARiB and Manual BERT delivered the highest scores (Accuracy = 0.77, F1 = 0.77), followed closely by AraBERT (F1 = 0.76). These results suggest that modeling generalized impressions requires deeper contextual inference than surface-level sentiment cues can provide.

Table 13. Performance comparison for aspect: Overall Experience.

Model	Accuracy	Precision	Recall	Macro-F1
TF-IDF + LinearSVC (GridSearch)	0.75	0.75	0.75	0.75
LinearSVC (Oversampling)	0.73	0.73	0.73	0.73
VotingClassifier	0.72	0.72	0.72	0.72
Multi-Head BERT (AraBERT)	0.76	0.77	0.76	0.76
MARBERT + Focal Loss	0.65	0.60	0.65	0.62
QARiB	0.77	0.76	0.78	0.77
Manual BERT (single-aspect)	0.77	0.78	0.76	0.77

In summary, transformer-based models, particularly AraBERT and QARiB, consistently outperformed traditional machine learning baselines across all aspects. Cleanliness and Price yielded the highest scores, owing to their clear lexical sentiment indicators, while Overall Experience remained the most challenging due to its abstract, context-dependent nature.

5.2. Model Analysis and Discussion

This section presents a comparative analysis of the six evaluated models for multi-aspect sentiment classification on Arabic Google Maps reviews of Saudi tourist attractions. The discussion interprets the quantitative results summarized in Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13, emphasizing each model’s strengths, limitations, and practical implications for tourism analytics.

5.2.1. TF-IDF + LinearSVC (GridSearchCV)

Despite its relative simplicity, the TF-IDF + LinearSVC pipeline achieved competitively high F1-scores across several aspects, most notably Cleanliness (F1 = 0.91), Price (F1 = 0.87), and Service & Staff (F1 = 0.86), as shown in Table 8 and Table 9. Systematic hyperparameter tuning via GridSearchCV (regularization strength, n-gram range, vocabulary size) was essential in mitigating class imbalance, and the use of class_weight = balanced further ensured equitable treatment of all sentiment classes. The model’s transparent feature weights and low computational footprint make it particularly suitable for rapid deployment or scenarios with limited resources. However, its reliance on bag-of-words features limits its ability to capture subtle linguistic nuances frequently present in user-generated Arabic reviews.

5.2.2. LinearSVC with Oversampling vs. Voting Classifier

Applying manual oversampling to the LinearSVC model further improved the F1-score for Cleanliness to 0.93, underscoring the importance of data-level balancing in multi-class sentiment tasks. The Voting Classifier, an ensemble of LinearSVC, SGDClassifier, and LogisticRegression, yielded only marginal improvements (e.g., Price F1: 0.87 vs. 0.86; Service & Staff F1: 0.87 vs. 0.86), as shown in Table 11, and did not justify the added complexity. These findings highlight that a well-tuned single classifier can rival simple ensemble strategies in Arabic ABSA. However, while traditional classifiers such as LinearSVC performed well on explicit aspects like Cleanliness and Price, their accuracy declined on more abstract dimensions such as Overall Experience. This limitation stems from their reliance on bag-of-words features, which overlook contextual meaning and subtle sentiment cues in user-generated Arabic text. Such nuances are better captured by transformer-based models, explaining the superior results of AraBERT and QARiB on complex, context-dependent aspects.

5.2.3. Multi-Head BERT (AraBERT)

AraBERT’s multi-task architecture, leveraging a shared BERT encoder and parallel classification heads, achieved the strongest overall results. Highest F1-scores were recorded for Cleanliness (0.97), Service & Staff (0.92), and Price (0.92), as indicated in Table 9 and Table 11. The ability of contextualized embeddings to capture complex polarity expressions across aspects proved especially beneficial for the nuanced language of tourism reviews. AraBERT’s superior performance in aspects critical to tourism, such as cleanliness and pricing, demonstrates its practical value for extracting actionable insights that support service improvement and strategic planning for Saudi tourist destinations.

5.2.4. MARBERT with Focal Loss

Fine-tuning MARBERT with a Focal Loss objective gamma = 2.0 improved minority-class recall—particularly for Cleanliness (F1 = 0.95) and Price (F1 = 0.89)—by down-weighting well-classified examples and focusing on underrepresented cases. However, its weaker result on Overall Experience (F1 = 0.60) reflects the inherent ambiguity and multi-faceted nature of this aspect, which often conflates several sentiment cues. Table 13 highlights the persistent challenge of modeling broad and abstract aspects within real-world ABSA tasks. Effectively capturing sentiment related to general impressions or mixed experiences demands richer contextual modeling, larger and more balanced datasets, and the integration of advanced LLM-based techniques capable of inferring underlying intent rather than relying solely on surface-level lexical cues.

5.2.5. QARiB with Attention, Scheduling, and Early Stopping

The attention-enhanced QARiB model, equipped with class-weighted loss, linear warmup scheduling, and early stopping patience = 2, demonstrated robust performance across aspects, particularly for the previously challenging Overall Experience aspect (F1 = 0.77). It balanced precision and recall across all aspects, especially Cleanliness (F1 = 0.88) and Service & Staff (F1 = 0.85), while effectively mitigating overfitting. This confirms that targeted architectural enhancements and regularization significantly boost model stability for practical, multi-aspect sentiment analysis in Arabic tourism data.

5.2.6. Manual BERT (Single-Aspect)

Even without specialized attention modules or custom loss weighting, the vanilla asafaya/bert-base-arabic model performed competitively on well-represented aspects such as Cleanliness (F1 = 0.91) and Overall Experience (F1 = 0.77) but delivered lower performance on subtler aspects (Environment F1 = 0.64; Service & Staff F1 = 0.75). The absence of shared contextual learning across aspects limits its ability to generalize, reinforcing the advantage of multi-head architectures for integrated sentiment classification.

5.2.7. Comparative Insights and Practical Implications

Transformer-based models clearly outperformed classical baselines across nearly all aspects. Their contextual embeddings captured dialectal variation, idiomatic phrasing, and implicit sentiment cues that linear models could not. High inter-annotator agreement on aspects like Cleanliness and Service & Staff (Cohen’s κ > 0.8) correlated with stronger model performance, demonstrating that consistent annotation directly supports predictive accuracy. Conversely, aspects with ambiguous or composite sentiment boundaries—Environment and Overall Experience—remained the most challenging.

For tourism stakeholders, AraBERT and QARiB represent the most effective tools for deriving actionable insights from large-scale Arabic reviews. Their precision in assessing Cleanliness, Facilities, and Price can inform operational improvements and policy decisions. Integrating these models into tourism analytics dashboards would enable real-time sentiment monitoring across attractions and cities, supporting the evidence-based goals of Saudi Vision 2030.

5.3. Comprehensive Evaluation of the Best-Performing Model

To provide a holistic assessment of aspect-based sentiment analysis in Arabic tourism reviews, this section presents a detailed quantitative evaluation of the best-performing model, AraBERT, across all six targeted aspects: Price, Cleanliness, Facilities, Service & Staff, Environment, and Overall Experience. This analysis extends beyond overall accuracy, incorporating additional metrics and qualitative diagnostics to assess real-world applicability within the Saudi tourism context.

5.3.1. Quantitative Evaluation

Table 14 summarizes AraBERT’s performance across all aspects in terms of Accuracy, Macro-F1, ROC-AUC, and Average Precision (AP). As shown, aspects with explicit sentiment cues, such as Cleanliness and Price, consistently achieve the highest scores, while more abstract aspects such as Overall Experience and Environment remain more challenging. This disparity reflects the inherent ambiguity and mixed sentiment typical of real user-generated reviews.

Table 14. Summary of performance metrics per aspect (AraBERT).

5.3.2. Class Distribution Support

Table 15 illustrates the number of true instances per sentiment class (−1 = Negative, 0 = Neutral, 1 = Positive). A clear class imbalance is evident, particularly the limited positive reviews for Price and the smaller negative set for Overall Experience. These disparities partly explain the lower F1-scores observed for these aspects and emphasize the importance of balanced datasets for reliable multi-aspect classification.

Table 15. Number of instances per sentiment class (−1 = Negative, 0 = Neutral, 1 = Positive) for each aspect.

5.3.3. Training Dynamics

Figure 6 displays the batch-wise training loss curve for AraBERT over three epochs. The steady downward trajectory and absence of early overfitting confirm that the training schedule and regularization strategy were appropriately tuned for this dataset.

Figure 6. Training loss curve (batch-wise) over three epochs for AraBERT.

5.3.4. Confusion Matrices and Error Analysis

To better understand model behavior, Figure 7 displays normalized confusion matrices for two representative aspects: Cleanliness and Overall Experience. The Cleanliness aspect demonstrates near-perfect classification, with 99% of neutral reviews correctly identified and only minor confusion between negative and neutral labels. In contrast, Overall Experience reveals more ambiguity, around 25% of true positives were misclassified as neutral, highlighting the difficulty of interpreting sentiment when lexical cues are abstract or context-dependent.

Figure 7. Normalized confusion matrices for AraBERT on (a) Cleanliness and (b) Overall Experience.

5.3.5. Representative Misclassification Patterns

A qualitative analysis of misclassified samples revealed several recurring error patterns, most notably within the Environment and Overall Experience aspects, which were the most challenging due to their abstract and context-dependent nature. These aspects often involve subjective evaluations that rely on implicit cues or subtle contrasts in sentiment.

The first pattern, mixed sentiment, was common in reviews expressing both praise and criticism within the same sentence. The model tended to emphasize the dominant sentiment word rather than the overall contextual meaning. For instance, “التجربة جميلة لكن لا أعتقد أني أكرر الزيارة” (“The experience was nice, but I don’t think I would visit again”) should be classified as neutral because it conveys both satisfaction and reluctance to revisit, yet the model incorrectly labeled it as positive. Similarly, the review “مكان يستحق الزيارة مرة واحدة فقط” (“A place worth visiting once only”) expresses a limited or cautious recommendation, but the model misinterpreted it as fully positive.

The second pattern, aspect sentiment masking, occurred when strong sentiment toward one aspect overshadowed another. For example, “المكان نظيف ومرتب لكن حر جدًا ومزدحم” (“The place is clean and organized, but very hot and crowded”) combines positive comments about cleanliness with negative impressions of the environment. However, the model predicted a positive label because it relied on strong positive cues like “نظيف ومرتب,” overlooking the environmental complaint. A similar error appeared in “المنظر جميل بس الصوت مزعج” (“The view is beautiful, but the noise is disturbing”), where the negative auditory experience was masked by the positive visual expression.

The third pattern, keyword overreliance, reflected the model’s tendency to assign sentiment labels based on emotionally charged words such as “جميل” (beautiful), “رائع” (wonderful), or “ممتع” (enjoyable), even when the broader tone was neutral or mildly negative. For instance, “المكان جميل لكنه مبالغ في الأسعار” (“The place is beautiful but overpriced”) was classified as positive because the model focused on “جميل,” disregarding the negative judgment on pricing. Likewise, “الجو لطيف بس مافي شي مميز” (“The weather is pleasant, but there’s nothing special”) was misclassified as positive due to its reliance on surface-level lexical cues.

A final pattern, ambiguity and contextual subtlety, was frequently observed in reviews describing the Environment and Overall Experience aspects through vague, sarcastic, or understated expressions. The phrase “المكان عادي جدًا” (“The place is very ordinary”) was often misclassified as neutral despite carrying a mildly negative tone in Arabic. Similarly, “تجربة لن تُنسى للأسف” (“An unforgettable experience—unfortunately”) was incorrectly predicted as positive because the model interpreted it literally and failed to capture the sarcastic tone. These cases highlight the difficulty of recognizing irony, understatement, and implied sentiment within Arabic text, especially when meaning depends heavily on cultural and contextual nuances. A summary of the most frequent error patterns and representative examples is provided in Table 16, illustrating typical misclassification causes across the Environment and Overall Experience aspects.

Table 16. Summary of the most frequent error types and representative examples observed in AraBERT predictions.

Addressing these limitations requires more context-aware modeling approaches. Future work may benefit from data augmentation techniques such as paraphrasing, back-translation, and dialectal variation to enrich training data and reduce dependence on lexical cues [,]. Incorporating contextual or dialect-sensitive embeddings and adapter layers could also enhance the model’s ability to represent subtle, region-specific sentiment patterns [,]. Additionally, adversarial training on ambiguous or borderline samples may strengthen robustness to mixed or implicit sentiment [,]. Finally, continuous manual error analysis and fine-tuning on recurrent misclassifications can progressively align model predictions with the nuanced nature of Arabic tourism reviews. Together, these strategies provide a practical pathway toward improving Arabic ABSA systems for real-world tourism analytics.

5.3.6. Ablation and Sensitivity Analysis

To evaluate the robustness of the proposed transformer-based architecture, a focused ablation and sensitivity analysis was conducted on the best-performing AraBERT configuration (sequence length = 128, batch size = 16, class-weighted loss). Several variants were implemented by systematically modifying or removing key components to assess their contribution to performance. These included:

(i).: Removing the attention pooling layer and relying solely on the [CLS] token representation (No attention);
(ii).: Removing class weighting from the loss function (No class weights);
(iii).: Increasing the maximum sequence length to 256 (MaxLen = 256);
(iv).: Doubling the batch size to 32 (Batch = 32); and
(v).: Training six independent single-head models instead of a shared multi-head encoder (Single-Head).

Figure 8 illustrates the Macro-F1 scores across all aspects and model variants. The results confirm that the multi-head, class-weighted baseline achieves the best overall performance (average Macro-F1 = 0.79), with consistent improvements on Price, Cleanliness, and Service & Staff.

Figure 8. Macro-F1 heatmap across all ablation variants and aspects. The multi-head, class-weighted baseline achieves the highest performance, while removing attention or class weights leads to a clear drop in accuracy.

Removing attention pooling caused a moderate decline (average Macro-F1 = 0.77), showing that lightweight attention contributes to capturing aspect-specific cues beyond the global [CLS] representation.

Eliminating class weights led to the largest performance drop on the Price aspect (from 0.84 to 0.62), confirming the importance of balancing sentiment classes in highly imbalanced aspects.

Extending the sequence length or batch size yielded no meaningful gains, suggesting that most reviews are short and already well represented at 128 tokens.

Finally, training independent single-head models slightly reduced overall robustness (average Macro-F1 = 0.77), highlighting the advantage of shared feature representations in multi-aspect learning.

5.3.7. Feature Importance Visualization (Classical Models)

To enhance the interpretability of the classical machine-learning baseline, the most influential features identified by the LinearSVC classifier were visualized using TF-IDF coefficients.

Figure 9 presents the top positive and negative unigrams for the Cleanliness aspect.

Figure 9. Top positive and negative unigrams for the Cleanliness aspect identified by the LinearSVC model.

Words such as clean, organized, and neat are strongly associated with positive sentiment, while dirty, filthy, and bad odor correspond to negative predictions.

This visualization demonstrates the transparency of feature-based models and clarifies how they capture sentiment polarity compared with contextual transformer representations.

5.3.8. Precision–Recall Curves

Finally, Figure 8 presents the precision–recall curves for AraBERT on Cleanliness and Overall Experience. The Cleanliness aspect maintains high precision across recall levels, indicating clearly separable sentiment boundaries, while Overall Experience exhibits a sharper precision decline, further evidence of the inherent ambiguity in modeling generalized impressions within user reviews. The precision–recall performance of the AraBERT model for the aspects Cleanliness and Overall Experience is illustrated in Figure 10.

Figure 10. Precision–Recall curves for AraBERT on (a) Cleanliness and (b) Overall Experience, showing performance across classes −1, 0, and 1.

6. Discussion and Key Insights

The results clearly show that transformer-based models outperform traditional approaches, especially on well-defined aspects such as Cleanliness and Price. This advantage stems from their ability to capture deep contextual relationships within text rather than relying on isolated words. In other words, these models understand meaning in context—how a single phrase or adjective can shift depending on the sentence or sentiment around it. However, even the most advanced transformers struggle when the sentiment becomes less direct. Broader and more abstract aspects, such as Overall Experience, often contain blended tones or nuanced expressions that blur the line between positive and neutral sentiment. This explains the frequent misclassifications observed in those categories. Interestingly, we also found that aspects with high inter-annotator agreement (Cohen’s κ > 0.8) consistently led to stronger model performance, emphasizing how much consistent labeling and clear annotation guidelines contribute to model reliability.

While transformer models clearly dominate in most scenarios, well-tuned classical methods, such as SVMs or simple ensemble classifiers, can still perform competitively under the right conditions. When the dataset is balanced and dialectal variation is limited, these traditional models often achieve reliable results. They are lightweight, fast, and easy to interpret, which makes them practical choices when computational resources are constrained. However, their reliance on surface-level word patterns limits their ability to capture deeper contextual and semantic nuances, particularly in reviews where sentiment is implied or mixed. This is precisely where transformer-based models such as AraBERT and QARiB excel, leveraging contextual embeddings to understand meaning within complex, multi-aspect Arabic text.

The strength of AraBERT and QARiB lies in their extensive pretraining on large, diverse Arabic corpora. This exposure enables them to generalize across both Modern Standard Arabic and various regional dialects, something traditional models simply cannot achieve. Their multi-head architectures allow the models to learn shared language representations while simultaneously focusing on aspect-specific sentiment cues. QARiB goes a step further by incorporating explicit dialectal coverage and attention mechanisms, making it especially effective for handling the informal, colloquial language found in user-generated tourism reviews.

Beyond performance metrics, the proposed ABSA framework offers tangible value for the Saudi tourism sector. When integrated into analytical dashboards, AraBERT and QARiB can provide real-time insights into how visitors perceive specific aspects of their experiences, such as cleanliness, pricing, and service quality. This enables decision-makers to move from anecdotal feedback to actionable data. For instance, if sentiment around cleanliness in a particular city drops, managers can immediately identify the issue and direct resources to address it. The framework is flexible enough to be retrained for new cities, attraction types, or even adjacent sectors such as hospitality or events. Over time, it can support longitudinal monitoring, helping authorities evaluate how changes in policy or service delivery affect public sentiment and satisfaction.

The insights drawn from the error analysis offer meaningful lessons for tourism authorities seeking to better understand how visitors express their experiences. The misclassifications observed in aspects such as overall experience and price reveal that tourists often blend emotions or use indirect expressions when sharing opinions, particularly across different Arabic dialects. Recognizing these subtle linguistic and contextual nuances can help organizations interpret feedback more accurately and design surveys that truly capture visitors’ intentions. When sentiment insights are aligned with real-world service observations, decision-makers can not only identify which aspects require improvement but also understand the underlying causes behind misinterpreted praise or criticism, enabling more empathetic and targeted service enhancements.

Moreover, error analysis serves as a key contribution, offering practical insights into how contextual models handle dialectal Arabic and mixed sentiment, an essential step toward more robust and inclusive ABSA systems.

7. Conclusions and Future Work

This study addressed a critical gap in tourism analytics by developing and evaluating advanced models for aspect-based sentiment analysis (ABSA) on Arabic Google Maps reviews of tourist attractions in Saudi Arabia. The findings demonstrate that transformer-based architectures, particularly AraBERT and QARiB, achieve superior accuracy in capturing visitors’ opinions across key aspects such as cleanliness, price, and facilities. The publicly released, manually annotated dataset developed for this research offers a valuable foundation for future studies in Arabic ABSA and tourism-related NLP applications.

Future work will extend these models to better accommodate the linguistic diversity of Arabic dialects and to amplify underrepresented voices within the tourism discourse. Further exploration of few-shot and zero-shot learning approaches will enhance model adaptability to new attraction types and evolving review trends. Integrating these models into real-time monitoring systems can enable the timely identification of operational strengths and weaknesses, allowing for evidence-based service optimization.

A primary limitation of this study lies in the dialectal imbalance within the dataset, certain regional dialects (e.g., Najdi and Hijazi) are more prevalent than others, which may constrain model generalizability across less represented linguistic varieties. Addressing this limitation will be crucial for achieving broader dialectal inclusivity and performance stability.

Beyond its applied contributions, this research establishes a methodological foundation for multi-aspect sentiment modeling in underrepresented languages, contributing to the advancement of multilingual NLP and supporting the broader objectives of Saudi Vision 2030 for a more responsive, data-driven, and globally competitive tourism sector.

Author Contributions

Conceptualization, S.Z., A.H.A. and H.S.; Methodology, S.Z.; Software, S.Z.; Validation, S.Z.; Formal analysis, S.Z.; Investigation, S.Z., A.H.A. and H.S.; Data curation, S.Z.; Writing—original draft preparation, S.Z.; Writing—review and editing, A.H.A. and H.S.; Visualization, S.Z.; Supervision, A.H.A. and H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This project was funded by the KAU Endowment (WAQF) under King Abdulaziz University, Jeddah, Saudi Arabia. The authors gratefully acknowledge the KAU Endowment (WAQF) and the Deanship of Scientific Research (DSR) for their technical and financial support.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset generated and analyzed during the current study is publicly available on Zenodo at the following DOI: https://doi.org/10.5281/zenodo.17011574. (accessed on 20 October 2025).

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-5, 2025), Grammarly, and QuillBot for proofreading, grammar correction, and language refinement. The authors have reviewed and edited all generated content and take full responsibility for the final version of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ABSA	Aspect-Based Sentiment Analysis
NLP	Natural Language Processing
MSA	Modern Standard Arabic
DA	Dialectal Arabic
ROC-AUC	Receiver Operating Characteristic—Area Under Curve
F1	F1-Score (Harmonic Mean of Precision and Recall)
TF-IDF	Term Frequency–Inverse Document Frequency
BoW	Bag of Words
SVM	Support Vector Machine
RF	Random Forest
CNN	Convolutional Neural Network
RNN	Recurrent Neural Network
LSTM	Long Short-Term Memory
BiLSTM	Bidirectional Long Short-Term Memory
GRU	Gated Recurrent Unit
CRF	Conditional Random Field
BERT	Bidirectional Encoder Representations from Transformers
AraBERT	Arabic BERT
MARBERT	Multilingual Arabic BERT (Twitter-based)
QARiB	Qatari Arabic BERT (Dialect-focused)
MUSE	Multilingual Universal Sentence Encoder
LLM	Large Language Model

References

Mirzaalian, F.; Halpenny, E. Social media analytics in hospitality and tourism: A systematic literature review and future trends. J. Hosp. Tour. Technol. 2019, 10, 764–790. [Google Scholar] [CrossRef]
Li, S.; Li, Y.; Liu, C.; Fan, N. How do different types of user-generated content attract travelers? Taking story and review on Airbnb as the example. J. Travel Res. 2023, 63, 371–387. [Google Scholar] [CrossRef]
Schröter Freitas, F. The Impact of Google Maps’ Reviews and Algorithms on Young Adults’ Choices of Museums to Visit in Prague; Univerzita Karlova, Filozofická Fakulta: Prague, Czechia, 2023. [Google Scholar]
Putri, T.A.; Hamami, F.; Alam, E.N. Aspect-based sentiment analysis on natural tourism in West Bandung using multinomial logistic regression algorithm. In Proceedings of the 2023 1st International Conference on Advanced Informatics and Intelligent Information Systems (ICAI3S); Atlantis Press: Dordrecht, The Netherlands, 2024; pp. 116–127. [Google Scholar]
Oueslati, O.; Cambria, E.; HajHmida, M.B.; Ounelli, H. A review of sentiment analysis research in Arabic language. Future Gener. Comput. Syst. 2020, 112, 408–430. [Google Scholar] [CrossRef]
Abo, M.E.M.; Raj, R.G.; Qazi, A. A review on Arabic sentiment analysis: State-of-the-art, taxonomy and open research challenges. IEEE Access 2019, 7, 162008–162024. [Google Scholar] [CrossRef]
Alqurashi, T. Arabic sentiment analysis for twitter data: A systematic literature review. Eng. Technol. Appl. Sci. Res. 2023, 13, 10292–10300. [Google Scholar] [CrossRef]
Abd-Elshafy, M.F.; Aly, T.; Gheith, M. Analyse the enhancement of sentiment analysis in Arabic by doing a comparative study of several machine learning techniques. Int. J. Res. Appl. Sci. Eng. Technol. 2024, 12, 2007–2027. [Google Scholar] [CrossRef]
Liu, B. Sentiment Analysis and Opinion Mining; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
Lee, L.-H.; Li, J.-H.; Yu, L.-C. Chinese EmoBank: Building valence–arousal resources for dimensional sentiment analysis. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 21, 1–18. [Google Scholar] [CrossRef]
Lee, L.-H.; Yu, L.-C.; Wang, S.; Liao, J. Overview of the SIGHAN 2024 shared task for Chinese dimensional aspect-based sentiment analysis. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), Bangkok, Thailand, 11–16 August 2024; pp. 165–174. [Google Scholar]
Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
Habimana, O.; Li, Y.; Li, R.; Gu, X.; Yu, G. Sentiment analysis using deep learning approaches: An overview. Sci. China Inf. Sci. 2020, 63, 111102. [Google Scholar] [CrossRef]
Alslaity, A.; Orji, R. Machine learning techniques for emotion detection and sentiment analysis: Current state, challenges, and future directions. Behav. Inf. Technol. 2024, 43, 139–164. [Google Scholar] [CrossRef]
Dang, N.C.; Moreno-García, M.N.; De la Prieta, F. Sentiment analysis based on deep learning: A comparative study. Electronics 2020, 9, 483. [Google Scholar] [CrossRef]
Nandwani, P.; Verma, R. A review on sentiment analysis and emotion detection from text. Soc. Netw. Anal. Min. 2021, 11, 81. [Google Scholar] [CrossRef]
Mehta, P.; Pandya, S. A review on sentiment analysis methodologies, practices and applications. Int. J. Sci. Technol. Res. 2020, 9, 601–609. [Google Scholar]
Alshaikh, K.A.; Almatrafi, O.A.; Abushark, Y.B. BERT-based model for aspect-based sentiment analysis for analyzing Arabic open-ended survey responses: A case study. IEEE Access 2023, 12, 2288–2302. [Google Scholar] [CrossRef]
Abdelgwad, M.M.; Azmi, A.M. Arabic aspect sentiment polarity classification using BERT. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 788–796. [Google Scholar] [CrossRef]
Gupta, S.; Ranjan, R.; Singh, S.N. Comprehensive study on sentiment analysis: From rule-based to modern LLM-based systems. arXiv 2024, arXiv:2409.09989. [Google Scholar]
AlNasser, A.; AlMuhaideb, A. Listening to patients: Advanced Arabic aspect-based sentiment analysis using transformer models towards better healthcare. Information 2024, 8, 156. [Google Scholar] [CrossRef]
Shafiq, M.; Anwar, U.; Adeel, A. Enhancing Arabic aspect-based sentiment analysis using end-to-end model. Future Internet 2023, 15, 342. [Google Scholar] [CrossRef]
Mohammad, A.-S.; Hammad, M.M.; Sa’ad, A.; AL-Tawalbeh, S.; Cambria, E. Gated recurrent unit with multilingual universal sentence encoder for Arabic aspect-based sentiment analysis. Knowl.-Based Syst. 2023, 261, 107540. [Google Scholar]
AlShammari, N.; AlMansour, A. Aspect-based sentiment analysis and location detection for Arabic language tweets. Appl. Comput. Syst. 2022, 27, 119–127. [Google Scholar] [CrossRef]
Almasri, M.; Al-Malki, N.; Alotaibi, R. A semi-supervised approach to Arabic aspect category detection using BERT and teacher-student model. PeerJ Comput. Sci. 2023, 9, e1425. [Google Scholar] [CrossRef]
Khabour, S.M.; Al-Radaideh, Q.A.; Mustafa, D. A new ontology-based method for Arabic sentiment analysis. Big Data Cogn. Comput. 2022, 6, 48. [Google Scholar] [CrossRef]
Af’idah, D.I.; Anggraeni, P.D.; Rizki, M.; Setiawan, A.B.; Handayani, S.F. Aspect-based sentiment analysis for Indonesian tourist attraction reviews using bidirectional long short-term memory. JUITA J. Inform. 2023, 11, 27–36. [Google Scholar] [CrossRef]
Nadeem, A.; Missen, M.M.S.; Al Reshan, M.S.; Memon, M.A.; Asiri, Y.; Nizamani, M.A.; Alsulami, M.; Shaikh, A. Resolving ambiguity in natural language for enhancement of aspect-based sentiment analysis of hotel reviews. PeerJ Comput. Sci. 2025, 11, e2635. [Google Scholar] [CrossRef]
Viñán-Ludeña, M.S.; De Campos, L. Evaluating tourist dissatisfaction with aspect-based sentiment analysis using social media data. Adv. Hosp. Tour. Res. 2024, 12, 254–286. [Google Scholar] [CrossRef]
Xu, C.; Wang, M.; Ren, Y.; Zhu, S. Enhancing aspect-based sentiment analysis in tourism using large language models and positional information. arXiv 2024, arXiv:2409.14997. [Google Scholar] [CrossRef]
Jeong, N.; Lee, J. An aspect-based review analysis using ChatGPT for the exploration of hotel service failures. Sustainability 2024, 16, 1640. [Google Scholar] [CrossRef]
Chang, Y.-C.; Ku, C.-H.; Le Nguyen, D.-D. Predicting aspect-based sentiment using deep learning and information visualization: The impact of COVID-19 on the airline industry. Inf. Manag. 2022, 59, 103587. [Google Scholar] [CrossRef]
Li, H.; Bruce, X.B.; Li, G.; Gao, H. Restaurant survival prediction using customer-generated content: An aspect-based sentiment analysis of online reviews. Tourism Manag. 2023, 96, 104707. [Google Scholar] [CrossRef]
Özen, İ.A.; Katlav, E.Ö. Aspect-based sentiment analysis on online customer reviews: A case study of technology-supported hotels. J. Hosp. Tour. Technol. 2023, 14, 102–120. [Google Scholar] [CrossRef]
Yulianti, E.; Nissa, N.K. ABSA of Indonesian customer reviews using IndoBERT: Single-sentence and sentence-pair classification approaches. Bull. Electr. Eng. Inform. 2024, 13, 3579–3589. [Google Scholar] [CrossRef]
Phuangsuwan, P.; Siripipatthanakul, S.; Limna, P.; Pariwongkhuntorn, N. The impact of Google Maps application on the digital economy. Corp. Bus. Strategy Rev. 2024, 5, 192–203. [Google Scholar] [CrossRef]
Sahagun, M.A.; Flores, J.; Jocson, J. Utilizing Google Map reviews and sentiment analysis: Knowing customer experience in coffee shops. Quest J. Multidiscip. Res. Dev. 2022, 1, 29. [Google Scholar] [CrossRef]
Alaydaa, M.S.M.; Li, J.; Jinkins, K. Aspect-based sentimental analysis for travellers’ reviews. arXiv 2023, arXiv:2308.02548. [Google Scholar] [CrossRef]
Arianto, D.; Budi, I. Aspect-based sentiment analysis on Indonesia’s tourism destinations based on Google Maps user code-mixed reviews. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, Hanoi, Vietnam, 24–26 October 2020; pp. 359–367. [Google Scholar]
AlMasaud, A.; Al-Baity, H.H. AraMAMS: Arabic multi-aspect, multi-sentiment restaurants reviews corpus for aspect-based sentiment analysis. Sustainability 2023, 15, 12268. [Google Scholar] [CrossRef]
Alharbi, B.A.; Mezher, M.A.; Barakeh, A.M. Tourist reviews sentiment classification using deep learning techniques: A case study in Saudi Arabia. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 717–726. [Google Scholar] [CrossRef]
Shin, B.; Ryu, S.; Kim, Y.; Kim, D. Analysis on review data of restaurants in Google Maps through text mining: Focusing on sentiment analysis. J. Multimed. Inf. Syst. 2022, 9, 61–68. [Google Scholar] [CrossRef]
Alanazi, M.S.M.; Li, J.; Jenkins, K.W. Multiclass sentiment prediction of airport service online reviews using aspect-based sentimental analysis and machine learning. Mathematics 2024, 12, 781. [Google Scholar] [CrossRef]
Web Robots. Instant Data Scraper. Available online: https://chrome.google.com/webstore/detail/instant-data-scraper/ofaokhiedipichpaojbibbnahnkdoiiah (accessed on 4 July 2025).
Fadaee, M.; Bisazza, A.; Monz, C. Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 567–573. [Google Scholar]
Wei, J.; Zou, K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 6383–6389. [Google Scholar]
Antoun, W.; Baly, F.; Hajj, H. AraELECTRA: Pre-training text discriminators for Arabic language understanding. arXiv 2021, arXiv:2104.07704. [Google Scholar]
El, K.M.; Antoun, W.; Baly, F.; Hajj, H.; Bouamor, H.; Glass, J. Arabic-BERT: Pre-training Arabic transformers for natural language understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual, 7–11 November 2021. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Miyato, T.; Dai, A.M.; Goodfellow, I. Adversarial training methods for semi-supervised text classification. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]

Figure 1. Sentiment analysis approaches.

Figure 2. A preview of a Google Maps review text. The translation of the text inside the red box is: “Great, but the ticket price is a bit expensive—50 SAR per person—and the options are limited. There are cafés, and the space is a bit small”.

Figure 3. Distribution of reviews by attraction type.

Figure 4. Frequency of mentions per aspect: Environment, Experience, Facilities, Price, HR\& Service, and Cleanliness.

Figure 5. Architecture of the proposed multi-head BERT model. The model consists of a shared BERT encoder that processes the input review, followed by six parallel classification heads—each dedicated to predicting sentiment for one specific aspect (price, cleanliness, facilities, service, environment, and overall experience).

Figure 6. Training loss curve (batch-wise) over three epochs for AraBERT.

Figure 7. Normalized confusion matrices for AraBERT on (a) Cleanliness and (b) Overall Experience.

Figure 8. Macro-F1 heatmap across all ablation variants and aspects. The multi-head, class-weighted baseline achieves the highest performance, while removing attention or class weights leads to a clear drop in accuracy.

Figure 9. Top positive and negative unigrams for the Cleanliness aspect identified by the LinearSVC model.

Figure 10. Precision–Recall curves for AraBERT on (a) Cleanliness and (b) Overall Experience, showing performance across classes −1, 0, and 1.

Table 1. ABSA in Arabic.

Ref.	Algorithm	Domain/Data/Language	Metrics	Results
[]	BERT (single head per aspect)	Hotel, Book, News (Arabic)	Accuracy	89.5%, 73%, 85.7%
[]	MARBERT + SVM	HEAR dataset (31K Arabic Google Maps reviews, healthcare)	Accuracy, F1	92.14%, 92.06%
[]	AraBERT-CRF (End-to-End)	SemEval-2016 Hotel reviews (Arabic)	F1 (AE, Sentiment)	95.11%, 98.34%
[]	Pooled GRU + MUSE	Hotel reviews (Arabic)	F1 (AE, PC)	93.0%, 90.86%
[]	LR, SVM, RF, CNN	Arabic tweets (Telecommunications)	F1 (sentiment), Accuracy (aspect)	0.81, 75%
[]	BERT-based	Education survey responses (Arabic)	F1 (AE), Polarity, Category	58%, 86%, 98%
[]	BERT (sent-pair)	Hotel, Book (HAAD), News (Arabic)	Accuracy	89.51%, 73.23%, 85.73%
[]	Noisy Student + AraBERT	SemEval2016 + HARD (Arabic)	Micro F1	68.2%
[]	Ontology-based	Hotel reviews (Arabic)	Accuracy, F1	79.20%, 78.7%

Table 2. ABSA in the Tourism Sector.

Ref.	Algorithm	Domain/Data/Language	Metrics	Results
[]	LSTM, Bi-LSTM	Tourist reviews (Indonesia)	Accuracy, F1	92.22%, 71.06%
[]	WSD + BERT + BiLSTM + GCN	Refined RABSA (English)	F1 (pos, neg, neut)	90.52%, 89.12%, 84.32%
[]	ABSA + BERT (entity rec.)	Twitter, Instagram, TripAdvisor (English)	Clustering, Summarization	Tourist dissatisfaction
[]	ACOS LLM	Self-created + Rest15/16 (Low-resource)	F1 score	+7.49%, +0.05%, +1.06%
[]	ChatGPT + BERT (neg refinement)	TripAdvisor (English)	Topic modeling, Summarized sentiments	Service failures (cleanliness, staff)
[]	LiFeBERT	Airline reviews (Global pandemic)	F1 (8 aspects)	60.1%
[]	CSF + BERT-based extraction	Restaurant survival (Yelp, Eng)	C-index	0.7715 (↑ 38.66%)
[]	ABSA (various)	Booking.com (English)	Not stated (qualitative)	Tech integration affects satisfaction
[]	IndoBERT	AiryRooms (Indonesian)	F1 score	91.94%

Table 3. ABSA in the Tourism Sector Using Google Maps.

Ref.	Algorithm	Domain/Data/Language	Metrics	Results
[]	Contextual model (BERT-like)	Google Maps, Airport (Dubai, Doha), English	Accuracy	≈80%
[]	RF, NB, LR, DT, ET	Borobudur, Prambanan (Indonesia), Google Maps	F1 range	45.5–91.5%
[]	ML (SVC, Linear SVC)	AraMA/AraMAMS (Arabic), Google Maps (Riyadh)	Accuracy	Up to 91.70%
[]	SVM, LSTM, RNN	Google Maps (Saudi dialect)	Accuracy	98% (SVM)
[]	RF + TF-IDF	Google Maps (Korean)	Balanced accuracy?	Aspect-based (food, price, etc.)
[]	RF vs. deep learning	Twitter, Google Maps, Airline Quality (English)	Pos, neg, nonexistent	RF > deep models

Table 4. Aspect Categories and Related Topics.

Aspect Category	Topics
Facilities	Availability of essential amenities such as restaurants, restrooms, prayer areas, children’s playgrounds, seating areas, and the presence of organized events or activities.
Experience	Overall tourist satisfaction, willingness to revisit the location, recommendation likelihood, and perceived value.
Cleanliness	Maintenance of public spaces, odor control, cleanliness of restrooms and green areas, and waste management practices.
HR & Service	Staff behavior, guide helpfulness, professionalism, service quality, and responsiveness to visitor needs.
Environment	Visual appeal, site atmosphere, weather conditions, and natural surroundings.
Price	Perceived fairness of entry fees, value for money, clarity of pricing, and availability of offers or discounts.

Table 5. Cohen’s Kappa Inter-Annotator Agreement for Each Aspect.

Aspect	Cohen’s Kappa	Level of Agreement
Price	0.72	Substantial
Cleanliness	0.87	Almost Perfect
Facilities	0.76	Substantial
Service/Staff	0.92	Almost Perfect
Environment	0.80	Substantial
Overall Experience	0.90	Almost Perfect

Table 6. Sentiment Distribution per Aspect Category.

Aspect Category	Positive	Negative	Neutral	Total
Environment	1390	539	199	2128
Overall Experience	1339	456	193	1988
Facilities	1021	421	146	1588
Price	725	439	137	1301
Service (HR & Staff)	690	368	132	1190
Cleanliness	612	183	95	890
Total	5777	2406	902	9085

Table 7. Sample Annotated Reviews with Saudi Dialect and Translation.

Aspect Category	Sentiment	Arabic Review (Saudi Dialect)	English Translation
Experience	Positive	المكان رهيب وأنصح الكل يجربه	The place is awesome, I recommend everyone to try it.
	Negative	مايستاهل المشوار بصراحة	Honestly, it’s not worth the trip.
	Neutral	تجربتي كانت عادية، لا جديد	My experience was average, nothing new.
Environment	Positive	الجو يفتح النفس والمكان يشرح الصدر	The atmosphere is uplifting and the place is heartwarming
	Negative	زحمة وازعاج المكان ومافي راحة	Crowded, noisy, and no comfort at all.
	Neutral	المكان عادي مرة.	The atmosphere is okay but not wow.
Facilities	Positive	كل شي متوفر حتى الجلسات حلوة.	Everything’s available, even nice seating areas.
	Negative	مافي دورة مياه قريبة وتعبنا ندور.	No nearby restroom, we got tired looking for one.
	Neutral	الخدمات مقبولة.	The facilities are acceptable.
Price	Positive	السعر على قد الخدمة، مرة مناسب.	The price matches the service, very reasonable.
	Negative	خمسين ريال على المكان؟ كثير صراحة.	50 riyals for this place? Honestly, that’s too much.
	Neutral	السعر عادي زي باقي الإمكان.	The price is normal like other places.
HR & Service	Positive	الموظفين محترمين وتعاملهم راقي.	The staff were respectful and had classy behavior.
	Negative	سألت عن شيء وتجاهلوني!	I asked about something and they ignored me!
	Neutral	تعاملهم كان عادي.	Their service was just okay.
Cleanliness	Positive	المكان نظيف وكل شي مرتب.	The place is clean and everything is organized.
	Negative	المكان مو نظيف ابدا ورائحته كريهة.	The place was not clean at all and smelled bad.
	Neutral	النظافة متوسطة، لا بأس.	Cleanliness is average, not bad.

Table 14. Summary of performance metrics per aspect (AraBERT).

Aspect	Accuracy	Macro-F1	ROC-AUC	AP
Price	0.92	0.80	0.96	0.81
Cleanliness	0.96	0.88	0.99	0.94
Facilities	0.82	0.76	0.94	0.86
Service & staff	0.93	0.78	0.97	0.87
Environment	0.87	0.71	0.92	0.76
Overall Experience	0.75	0.75	0.90	0.83

Table 15. Number of instances per sentiment class (−1 = Negative, 0 = Neutral, 1 = Positive) for each aspect.

Aspect	Negative (−1)	Neutral (0)	Positive (1)
Price	175	498	36
Cleanliness	39	603	67
Facilities	80	467	162
Service & Staff	36	590	83
Environment	64	99	546
Overall Experience	99	310	300

Table 16. Summary of the most frequent error types and representative examples observed in AraBERT predictions.

Error Type	Description/Likely Cause	Example (Arabic/English)	True → Predicted
Mixed Sentiment	Review contains both praise and criticism; model focuses on dominant positive cue.	Arabic: التجربة جميلة لكن لا أعتقد أني أكرر الزيارة. English: The experience was nice, but I wouldn’t visit again.	Neutral → Positive
Aspect Sentiment Masking	Strong sentiment toward one aspect (e.g., cleanliness) overshadows another (e.g., environment).	Arabic: المكان نظيف ومرتب لكن حر جدًا ومزدحم. English: The place is clean and organized, but very hot and crowded.	Negative (Environment) → Positive
Keyword Overreliance	Model depends on strong sentiment words like “جميل” or “رائع” regardless of context.	Arabic: المكان جميل لكنه مبالغ في الأسعار. English: The place is beautiful but overpriced.	Neutral → Positive
Ambiguity & Contextual Subtlety	Sarcastic or understated tone misinterpreted as literal positivity.	Arabic: تجربة لن تُنسى للأسف. English: An unforgettable experience—unfortunately.	Negative → Positive

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Article Metrics

Citations

Article Access Statistics

Journal Statistics

Article metric data becomes available approximately 24 hours after publication online.