You are currently viewing a new version of our website. To view the old version click .
Data
  • Article
  • Open Access

23 October 2025

Multi-Aspect Sentiment Classification of Arabic Tourism Reviews Using BERT and Classical Machine Learning

,
and
Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia
*
Author to whom correspondence should be addressed.
Data2025, 10(11), 168;https://doi.org/10.3390/data10110168 
(registering DOI)

Abstract

Understanding visitor sentiment is essential for developing effective tourism strategies, particularly as Google Maps reviews have become a key channel for public feedback on tourist attractions. Yet, the unstructured format and dialectal diversity of Arabic reviews pose significant challenges for extracting actionable insights at scale. This study evaluates the performance of traditional machine learning and transformer-based models for aspect-based sentiment analysis (ABSA) on Arabic Google Maps reviews of tourist sites across Saudi Arabia. A manually annotated dataset of more than 3500 reviews was constructed to assess model effectiveness across six tourism-related aspects: price, cleanliness, facilities, service, environment, and overall experience. Experimental results demonstrate that multi-head BERT architectures, particularly AraBERT, consistently outperform traditional classifiers in identifying aspect-level sentiment. Ara-BERT achieved an F1-score of 0.97 for the cleanliness aspect, compared with 0.91 for the best-performing classical model (LinearSVC), indicating a substantial improvement. The proposed ABSA framework facilitates automated, fine-grained analysis of visitor perceptions, enabling data-driven decision-making for tourism authorities and contributing to the strategic objectives of Saudi Vision 20300.

1. Introduction

The rapid growth of social media and review platforms has generated an abundance of user-generated content that reflects public opinion and individual experiences []. Among these platforms, Google Maps used by more than two billion people globally has emerged as a major source of travel information, providing extensive reviews of hotels, attractions, and various leisure activities []. Its vast collection of user comments, ratings, photos, and detailed place descriptions, spanning restaurants to museums, offers a valuable corpus for understanding visitor experiences [,].
In the tourism sector, sentiment analysis plays an increasingly vital role in understanding customer perceptions and improving service quality. However, deriving actionable insights from Arabic-language reviews remains challenging due to the unstructured nature of the text, the linguistic complexity of Arabic, and its wide dialectal variation [,]. These challenges have constrained the effectiveness of sentiment analysis in tourism analytics across the Arab world. Despite the expanding Arabic-speaking online population, it is now exceeding 185 million users and accounting for 4.8% of global internet users aspect-based sentiment analysis (ABSA) in Arabic, particularly for tourism-related data from platforms such as Google Maps, remains substantially underexplored [,]. Most prior research has concentrated on domains such as social media, finance, or healthcare, leaving tourism applications relatively neglected.
To address this gap, the present study develops and evaluates advanced models for Arabic ABSA on Google Maps reviews related to tourist attractions in Saudi Arabia. A manually annotated dataset of more than 3500 reviews was constructed, covering six key aspects: price, cleanliness, facilities, service, environment, and overall experience. The study benchmarks a multi-head classification architecture based on a pre-trained Arabic BERT model (asafaya/bert-base-arabic) against traditional machine learning pipelines, including ensemble classifiers and optimized SVMs using TF-IDF features. This hybrid approach enables a systematic comparison between deep learning and classical methods for multi-aspect sentiment classification within Arabic tourism data.
Although the study does not introduce a new ABSA architecture, its main contribution lies in the thoughtful adaptation of advanced transformer-based methods to an underexplored domain, Arabic tourism reviews from Google Maps. The originality of this work stems from the development of a domain-specific, manually annotated dataset and the presentation of the first comprehensive benchmark comparing classical and transformer-based models for Arabic aspect-based sentiment analysis in the tourism context.
This research makes three primary contributions.
  • Novelty in Domain and Data: To the best of our knowledge, this is the first systematic study to conduct ABSA on Arabic Google Maps reviews within the tourism sector. Prior research has rarely addressed ABSA for Arabic Google Maps data or examined tourism as a domain.
  • Resource Creation: The study introduces a new manually annotated Arabic dataset of Google Maps reviews covering six critical tourism-related aspects (price, cleanliness, facilities, service, environment, and overall experience). This dataset fills a major gap, as no publicly available, aspect-annotated corpus currently exists for Arabic tourism reviews.
  • Benchmarking and Practical Insights: The research benchmarks state-of-the-art transformer-based models (multi-head BERT) against classical machine learning approaches, providing a practical evaluation of their effectiveness on real-world, dialect-rich Arabic data. The findings offer actionable insights for tourism analytics and policy development in Saudi Arabia.

2. Background

2.1. Sentiment Analysis

The primary goal of sentiment analysis is to identify and categorize the emotional tone or attitude expressed in text. This process typically involves classifying user-generated content as positive, neutral, or negative based on the viewpoints expressed [].
Sentiment can be modeled using two main paradigms. The categorical approach assigns discrete labels such as positive, neutral, or negative, whereas the dimensional approach represents emotion along continuous scales of valence and arousal to capture more nuanced affective variations. While dimensional models offer greater emotional granularity, categorical representation remains dominant in aspect-based sentiment analysis (ABSA) due to its interpretability and compatibility with standard evaluation metrics. Accordingly, this study adopts a categorical framework to support transparent aspect-level analysis while recognizing the potential of dimensional modeling for future research that aims to capture continuous emotional intensity [].
ABSA encompasses several sub-tasks, including Aspect Sentiment Classification (ASC), Aspect Sentiment Triplet Extraction (ASTE), Aspect Sentiment Quad Prediction (ASQP), and the emerging Dimensional ABSA (DimABSA), which models sentiment on continuous dimensions. The present study focuses on the ASC task, addressing six predefined aspects within the tourism domain to align with practical analytical needs [].
Sentiment analysis can be performed at multiple levels of granularity, each serving distinct analytical purposes [,,]:
  • Document-level analysis assigns an overall sentiment (positive, neutral, or negative) to an entire text based on its global tone.
  • Sentence-level analysis evaluates each sentence independently, enabling the detection of mixed opinions within a single document.
  • Phrase-level analysis targets specific expressions or keywords that reflect sentiment toward a particular feature or topic.
  • Aspect-level analysis offers the most fine-grained understanding by detecting multiple aspects within the same text and assigning each a separate sentiment polarity, facilitating precise interpretation of user opinions.
The techniques employed in sentiment analysis are generally grouped into three primary categories, lexicon-based, machine learning-based, and hybrid approaches, as illustrated in Figure 1. Hybrid approaches are designed to overcome the limitations of individual methods by integrating their complementary strengths [].
Figure 1. Sentiment analysis approaches.
  • Lexicon-Based Approach: This method depends on sentiment dictionaries or predefined word lists that categorize terms as positive, negative, or neutral. Each word is given a sentiment score, and the overall sentiment of the text is determined by summing these scores [].
  • Machine Learning-Based Approach: This approach uses algorithms trained on labeled datasets to detect sentiment patterns. It includes techniques like Naive Bayes, Support Vector Machines (SVM), and advanced deep learning models such as Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN), including LSTM and BiLSTM, which have shown strong performance in sentiment classification tasks [].
  • Hybrid Approach: This technique combines elements of both lexicon-based and machine learning-based methods to take advantage of the strengths of each. It aims to enhance accuracy, adaptability, and performance in various sentiment analysis applications [].

2.2. Aspect-Based Sentiment Analysis

Aspect-Based Sentiment Analysis (ABSA) enables detailed examination of opinions by identifying specific attributes or aspects within a text and determining the sentiment polarity associated with each. Unlike document- or sentence-level sentiment analysis, ABSA extracts fine-grained insights into multiple features of a product or service simultaneously, making it especially valuable for domains such as tourism, e-commerce, and hospitality [].
ABSA generally comprises two main tasks:
  • Aspect extraction: identifies the specific features or attributes discussed in the text such as price, cleanliness, or facilities in tourism related reviews.
  • Sentiment classification: assigns a sentiment polarity (positive, neutral, or negative) to each extracted aspect based on the user’s expressed opinion.
This two-step process provides actionable, aspect-specific insights that help researchers and practitioners understand user satisfaction and preferences at a granular level [,].

2.2.1. Challenges of ABSA in Arabic

While ABSA has achieved notable success in English and other languages, applying it to Arabic presents distinct linguistic challenges [,]. Arabic exhibits rich morphology, diverse regional dialects, and frequent omission of diacritics, all of which increase lexical ambiguity. For instance, the word “حلو” can mean “sweet” when describing taste (e.g., a dessert) or “nice” when praising an experience (e.g., the atmosphere), depending on context. Moreover, dialectal variation across regions leads to differing expressions for the same aspect, complicating both, aspect extraction and sentiment polarity detection. These factors make Arabic ABSA considerably more complex than its counterparts in morphologically simpler languages.

2.2.2. Suitability of Multi-Head Architectures for ABSA

Recent advances in deep learning have introduced multi-head architectures, particularly multi-head BERT models, which are highly suitable for ABSA tasks [,]. In this configuration, a shared encoder learns general linguistic representations while multiple classification heads specialize in predicting sentiment for individual aspects. This setup enables simultaneous learning of shared and aspect-specific features, improving efficiency and accuracy across multiple aspects within a single review.
Beyond traditional transformer encoders, recent research has explored Large Language Model (LLM) approaches for ABSA. These methods leverage instruction tuning and retrieval-augmented strategies such as retrieval-based example ranking to enhance aspect detection and contextual sentiment reasoning. While this study focuses on transformer-based architectures (AraBERT, MARBERT, QARiB), instruction-tuned LLMs represent a promising direction for improving cross-domain generalization and performance on dialect-rich Arabic reviews [].

4. Methodology

This section describes the methodology used to perform multi-aspect sentiment classification on Arabic Google Maps reviews of tourist attractions in Saudi Arabia. It outlines the data collection, manual annotation, and exploratory analysis procedures, followed by the development of transformer-based and traditional machine learning models. The section concludes with an explanation of the evaluation metrics used to assess model performance.

4.1. Dataset Source and Collection

This study focuses on dialectical Arabic (DA) as used across Saudi Arabia. Reviews were collected from Google Maps for 39 tourist attractions distributed across 15 major cities using the Instant Data Scraper Chrome extension [].
In total, 20,000 reviews were gathered, most written in dialectical Arabic, with a smaller portion in English. Metadata such as usernames, timestamps, likes, and owner replies were collected but excluded from analysis; only the review text was kept. Figure 2 shows a sample review that illustrates the type of user feedback analyzed. To promote transparency and future research in Arabic sentiment and tourism-related NLP, the annotated dataset has been released publicly on Zenodo (https://zenodo.org/records/17011574 (accessed on 20 October 2025), DOI: 10.5281/zenodo.17011573).
Figure 2. A preview of a Google Maps review text. The translation of the text inside the red box is: “Great, but the ticket price is a bit expensive—50 SAR per person—and the options are limited. There are cafés, and the space is a bit small”.

4.2. Dataset Overview

To identify key tourist destinations, a keyword-based search on Google Maps was conducted using terms such as “museums,” “parks,” “farms,” and “heritage.” This process yielded 39 attractions across 15 Saudi cities, representing a diverse mix of venues and dialects. The dataset captures several regional dialects, including Hijazi, Najdi, Northern, Eastern, and Southern Arabic. Most reviews were written in Najdi and Hijazi dialects, along with instances of Modern Standard Arabic (MSA).

4.3. Aspect-Based Approach

Prior to annotation, key aspect categories were determined based on previous research [], which identified common visitor concerns such as amenities, pricing, and staff service. This study extended those dimensions to include additional aspects relevant to tourism, specifically cleanliness, environment, and overall experience. The predefined aspect categories and their corresponding topics used in this study are summarized in Table 4.
Table 4. Aspect Categories and Related Topics.

4.4. Annotation Process and Inter-Annotator Agreement

Two native Arabic speakers, aged between 26 and 32, served as annotators. Both possessed strong proficiency in dialectical Arabic and demonstrated technical competence in linguistic annotation. Prior to the main task, they were trained using comprehensive annotation guidelines and a preliminary 20-sentence calibration exercise designed to ensure consistency. Each annotator independently labeled the same dataset. Only reviews containing at least two identifiable aspects were retained, while non-Arabic reviews were removed.
From the initial collection of 20,000 Google Maps reviews, approximately 16,460 were excluded during the preprocessing and annotation stages, primarily due to non-Arabic content, duplication, or insufficient identifiable aspects with clear sentiment expressions. The final dataset comprised 3540 multi-aspect reviews, accurately reflecting both the linguistic diversity and the sparsity characteristic of user-generated Arabic tourism data.
To assess annotation consistency, Cohen’s kappa coefficient was calculated for each aspect. The agreement scores are presented in Table 5.
Table 5. Cohen’s Kappa Inter-Annotator Agreement for Each Aspect.
High kappa values confirm strong inter-annotator agreement, validating the reliability of the labeling process. Discrepancies were resolved by adopting Samar’s annotations as the reference standard, ensuring the creation of a consistent, high-quality “gold” dataset for subsequent modeling.

4.5. Exploratory Data Visualization

Figure 3 presents the distribution of reviews across attraction types. Gardens received the highest number of reviews, followed by historical and heritage sites, parks, and museums, reflecting strong public engagement with natural and culturally significant destinations.
Figure 3. Distribution of reviews by attraction type.
Figure 4 illustrates the frequency of aspect mentions. Environment and overall experience were most frequently discussed, followed by facilities and price, whereas cleanliness and HR & service appeared less often.
Figure 4. Frequency of mentions per aspect: Environment, Experience, Facilities, Price, HR\& Service, and Cleanliness.
The final dataset contains 9085 aspect-sentiment pairs: 5777 positive, 2406 negative, and 902 neutral. Table 6. Sentiment distribution per aspect is shown in Table 6.
Table 6. Sentiment Distribution per Aspect Category.
To illustrate annotation practices, Table 7 provides representative examples in Saudi dialectical Arabic with corresponding English translations and sentiment labels.
Table 7. Sample Annotated Reviews with Saudi Dialect and Translation.

4.6. Model Architecture

To evaluate the trade-offs between deep learning and traditional machine learning methods, four model architectures were implemented for multi-aspect sentiment classification. These included both transformer-based and classical models.
The same base encoder, asafaya/bert-base-arabic, was used in two configurations:
  • Manual BERT (single-aspect): a separate single-head classifier fine-tuned individually for each aspect.
  • Multi-Head BERT: a shared encoder with multiple classification heads, one per aspect, enabling simultaneous prediction across all aspects.
The Multi-Head BERT model was fine-tuned using the pre-trained asafaya/bert-base-arabic encoder available on Hugging Face. It consists of a shared encoder and six independent heads, each responsible for predicting sentiment polarity (positive, negative, or neutral) for one aspect (price, cleanliness, facilities, service, environment, and overall experience). This structure captures both general linguistic features and aspect-specific sentiment cues, consistent with contemporary multi-task ABSA frameworks [,]. Figure 5 presents the overall model architecture.
Figure 5. Architecture of the proposed multi-head BERT model. The model consists of a shared BERT encoder that processes the input review, followed by six parallel classification heads—each dedicated to predicting sentiment for one specific aspect (price, cleanliness, facilities, service, environment, and overall experience).

4.7. Performance Measurement

The performance of all models was evaluated using four widely adopted metrics in sentiment classification: Accuracy, Precision, Recall, and F1-score []. These metrics were derived from the confusion matrix to provide a balanced assessment of each model’s ability to correctly identify positive, neutral, and negative instances. Since these measures are standard in the machine learning and NLP literature, their detailed mathematical definitions are omitted for brevity.
All Precision, Recall, and F1-Score values reported throughout this study (Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13) were computed using the macro-averaging method to account for class imbalance across sentiment categories.

5. Experiments and Results

This section presents the evaluation of seven models for aspect-based sentiment analysis (ABSA) on Arabic Google Maps reviews of Saudi tourist attractions.
Both BERT-based variants share the same base encoder. Manual BERT refers to single-aspect fine-tuning (a separate classifier per aspect), while Multi-Head BERT denotes a shared encoder with multiple classification heads (one per aspect).
The experiments compared traditional machine learning models (TF-IDF + LinearSVC, Voting Classifier) with transformer-based models (AraBERT, MARBERT, QARiB, and Manual BERT). All models were implemented in Python version: 3.11.6.
Using scikit-learn, PyTorch 3.11.6, and Hugging Face Transformers. The dataset was divided using an 80/20 stratified split to ensure balanced aspect and sentiment representation.
Model training used consistent hyperparameters: maximum sequence length = 128, batch size = 16 or 32, learning rate = 2 × 10−5, three epochs, and the AdamW optimizer. The TF-IDF + LinearSVC model employed GridSearchCV for hyperparameter optimization, while LinearSVC (Oversampling) used manual resampling to ensure class balance. The Voting Classifier combined SVC, SGD, and Logistic Regression through hard voting.
Transformer-based models included Manual BERT (single-aspect) fine-tuned individually per aspect and Multi-Head BERT (AraBERT) with a shared encoder and six classification heads for joint prediction. MARBERT + Focal Loss integrated class weights and dropout (0.3), while QARiB (Attention) utilized attention pooling with class-weighted loss, early stopping (patience = 2), and linear learning-rate scheduling. Each experiment was repeated with multiple random seeds to ensure result stability.

5.1. Performance Evaluation by Aspect

Price: AraBERT achieved the highest performance on the Price aspect (Accuracy = 0.93, F1 = 0.92), followed closely by MARBERT (F1 = 0.90), as shown in Table 8. Traditional classifiers such as TF-IDF + LinearSVC and the Voting Classifier also performed reasonably well but lagged behind transformer-based models in recall and precision. Manual BERT produced strong accuracy but lower recall (F1 = 0.76), confirming that single-aspect fine-tuning is less effective than shared contextual encoding for nuanced sentiment detection.
Table 8. Performance comparison for aspect: Price.
Table 8. Performance comparison for aspect: Price.
ModelAccuracyPrecisionRecallMacro-F1
TF-IDF + LinearSVC (GridSearch)0.880.870.880.87
LinearSVC (Oversampling)0.880.860.880.87
VotingClassifier0.880.870.880.87
Multi-Head BERT (AraBERT)0.930.920.930.92
MARBERT + Focal Loss0.900.890.900.90
QARiB0.920.790.860.81
Manual BERT (single-aspect)0.920.840.720.76
Cleanliness: AraBERT achieved the strongest overall results for the Cleanliness aspect (Accuracy = 0.97, F1 = 0.97), as summarized in Table 9. MARBERT followed closely (F1 = 0.96), and Manual BERT also performed competitively (F1 = 0.91). Traditional classifiers achieved solid but slightly lower results, reaffirming that transformer models are better at capturing explicit cleanliness cues in text.
Table 9. Performance comparison for aspect: Cleanliness.
Table 9. Performance comparison for aspect: Cleanliness.
ModelAccuracyPrecisionRecallMacro-F1
TF-IDF + LinearSVC (GridSearch)0.920.920.920.91
LinearSVC (Oversampling)0.930.920.930.92
VotingClassifier0.930.920.930.92
Multi-Head BERT (AraBERT)0.970.970.970.97
MARBERT + Focal Loss0.940.950.940.96
QARiB0.960.860.900.88
Manual BERT (single-aspect)0.970.940.880.91
Facilities: AraBERT and QARiB obtained the highest accuracy (0.83), demonstrating the importance of contextual embeddings in modeling infrastructure-related sentiments (see Table 10). Classical classifiers performed respectably (F1 ≈ 0.78) but were less capable of identifying subtle or implicit opinions expressed through descriptive language.
Table 10. Performance comparison for aspect: Facilities.
Table 10. Performance comparison for aspect: Facilities.
ModelAccuracyPrecisionRecallMacro-F1
TF-IDF + LinearSVC (GridSearch)0.780.770.780.77
LinearSVC (Oversampling)0.790.780.790.78
VotingClassifier0.790.770.790.78
Multi-Head BERT (AraBERT)0.830.850.830.84
MARBERT + Focal Loss0.780.790.780.79
QARiB0.830.760.830.79
Manual BERT (single-aspect)0.820.740.790.76
Service & Staff: QARiB achieved the strongest performance for the Service & Staff aspect (Accuracy = 0.93, F1 = 0.85), with AraBERT performing comparably (Accuracy = 0.92, F1 = 0.92), as shown in Table 11. These results highlight the models’ capacity to interpret interpersonal tone and nuanced language, where traditional methods plateaued around F1 = 0.86. Manual BERT’s lower recall underscores the benefits of multi-aspect contextual learning.
Table 11. Performance comparison for aspect: Service & Staff.
Table 11. Performance comparison for aspect: Service & Staff.
ModelAccuracyPrecisionRecallMacro-F1
TF-IDF + LinearSVC (GridSearch)0.870.850.870.86
LinearSVC (Oversampling)0.870.860.870.86
VotingClassifier0.880.870.880.87
Multi-Head BERT (AraBERT)0.920.920.920.92
MARBERT + Focal Loss0.770.810.770.79
QARiB0.930.820.880.85
Manual BERT (single-aspect)0.920.880.690.75
Environment: As reported in Table 12, QARiB achieved the highest accuracy for the Environment aspect (0.88), while AraBERT provided a better precision–recall balance (F1 = 0.86). This indicates complementary strengths between the two models—QARiB excels in recognizing strong sentiment cues, whereas AraBERT handles contextually mixed expressions more effectively.
Table 12. Performance comparison for aspect: Environment.
Table 12. Performance comparison for aspect: Environment.
ModelAccuracyPrecisionRecallMacro-F1
TF-IDF + LinearSVC (GridSearch)0.820.800.820.81
LinearSVC (Oversampling)0.820.800.820.81
VotingClassifier0.820.800.820.81
Multi-Head BERT (AraBERT)0.870.860.870.86
MARBERT + Focal Loss0.840.790.840.75
QARiB0.880.760.770.77
Manual BERT (single-aspect)0.850.710.600.64
Overall Experience: Performance across all models declined on the more abstract Overall Experience aspect, reflecting its conceptual complexity (see Table 13). QARiB and Manual BERT delivered the highest scores (Accuracy = 0.77, F1 = 0.77), followed closely by AraBERT (F1 = 0.76). These results suggest that modeling generalized impressions requires deeper contextual inference than surface-level sentiment cues can provide.
Table 13. Performance comparison for aspect: Overall Experience.
Table 13. Performance comparison for aspect: Overall Experience.
ModelAccuracyPrecisionRecallMacro-F1
TF-IDF + LinearSVC (GridSearch)0.750.750.750.75
LinearSVC (Oversampling)0.730.730.730.73
VotingClassifier0.720.720.720.72
Multi-Head BERT (AraBERT)0.760.770.760.76
MARBERT + Focal Loss0.650.600.650.62
QARiB0.770.760.780.77
Manual BERT (single-aspect)0.770.780.760.77
In summary, transformer-based models, particularly AraBERT and QARiB, consistently outperformed traditional machine learning baselines across all aspects. Cleanliness and Price yielded the highest scores, owing to their clear lexical sentiment indicators, while Overall Experience remained the most challenging due to its abstract, context-dependent nature.

5.2. Model Analysis and Discussion

This section presents a comparative analysis of the six evaluated models for multi-aspect sentiment classification on Arabic Google Maps reviews of Saudi tourist attractions. The discussion interprets the quantitative results summarized in Table 8, Table 9, Table 10, Table 11, Table 12 and Table 13, emphasizing each model’s strengths, limitations, and practical implications for tourism analytics.

5.2.1. TF-IDF + LinearSVC (GridSearchCV)

Despite its relative simplicity, the TF-IDF + LinearSVC pipeline achieved competitively high F1-scores across several aspects, most notably Cleanliness (F1 = 0.91), Price (F1 = 0.87), and Service & Staff (F1 = 0.86), as shown in Table 8 and Table 9. Systematic hyperparameter tuning via GridSearchCV (regularization strength, n-gram range, vocabulary size) was essential in mitigating class imbalance, and the use of class_weight = balanced further ensured equitable treatment of all sentiment classes. The model’s transparent feature weights and low computational footprint make it particularly suitable for rapid deployment or scenarios with limited resources. However, its reliance on bag-of-words features limits its ability to capture subtle linguistic nuances frequently present in user-generated Arabic reviews.

5.2.2. LinearSVC with Oversampling vs. Voting Classifier

Applying manual oversampling to the LinearSVC model further improved the F1-score for Cleanliness to 0.93, underscoring the importance of data-level balancing in multi-class sentiment tasks. The Voting Classifier, an ensemble of LinearSVC, SGDClassifier, and LogisticRegression, yielded only marginal improvements (e.g., Price F1: 0.87 vs. 0.86; Service & Staff F1: 0.87 vs. 0.86), as shown in Table 11, and did not justify the added complexity. These findings highlight that a well-tuned single classifier can rival simple ensemble strategies in Arabic ABSA. However, while traditional classifiers such as LinearSVC performed well on explicit aspects like Cleanliness and Price, their accuracy declined on more abstract dimensions such as Overall Experience. This limitation stems from their reliance on bag-of-words features, which overlook contextual meaning and subtle sentiment cues in user-generated Arabic text. Such nuances are better captured by transformer-based models, explaining the superior results of AraBERT and QARiB on complex, context-dependent aspects.

5.2.3. Multi-Head BERT (AraBERT)

AraBERT’s multi-task architecture, leveraging a shared BERT encoder and parallel classification heads, achieved the strongest overall results. Highest F1-scores were recorded for Cleanliness (0.97), Service & Staff (0.92), and Price (0.92), as indicated in Table 9 and Table 11. The ability of contextualized embeddings to capture complex polarity expressions across aspects proved especially beneficial for the nuanced language of tourism reviews. AraBERT’s superior performance in aspects critical to tourism, such as cleanliness and pricing, demonstrates its practical value for extracting actionable insights that support service improvement and strategic planning for Saudi tourist destinations.

5.2.4. MARBERT with Focal Loss

Fine-tuning MARBERT with a Focal Loss objective gamma = 2.0 improved minority-class recall—particularly for Cleanliness (F1 = 0.95) and Price (F1 = 0.89)—by down-weighting well-classified examples and focusing on underrepresented cases. However, its weaker result on Overall Experience (F1 = 0.60) reflects the inherent ambiguity and multi-faceted nature of this aspect, which often conflates several sentiment cues. Table 13 highlights the persistent challenge of modeling broad and abstract aspects within real-world ABSA tasks. Effectively capturing sentiment related to general impressions or mixed experiences demands richer contextual modeling, larger and more balanced datasets, and the integration of advanced LLM-based techniques capable of inferring underlying intent rather than relying solely on surface-level lexical cues.

5.2.5. QARiB with Attention, Scheduling, and Early Stopping

The attention-enhanced QARiB model, equipped with class-weighted loss, linear warmup scheduling, and early stopping patience = 2, demonstrated robust performance across aspects, particularly for the previously challenging Overall Experience aspect (F1 = 0.77). It balanced precision and recall across all aspects, especially Cleanliness (F1 = 0.88) and Service & Staff (F1 = 0.85), while effectively mitigating overfitting. This confirms that targeted architectural enhancements and regularization significantly boost model stability for practical, multi-aspect sentiment analysis in Arabic tourism data.

5.2.6. Manual BERT (Single-Aspect)

Even without specialized attention modules or custom loss weighting, the vanilla asafaya/bert-base-arabic model performed competitively on well-represented aspects such as Cleanliness (F1 = 0.91) and Overall Experience (F1 = 0.77) but delivered lower performance on subtler aspects (Environment F1 = 0.64; Service & Staff F1 = 0.75). The absence of shared contextual learning across aspects limits its ability to generalize, reinforcing the advantage of multi-head architectures for integrated sentiment classification.

5.2.7. Comparative Insights and Practical Implications

Transformer-based models clearly outperformed classical baselines across nearly all aspects. Their contextual embeddings captured dialectal variation, idiomatic phrasing, and implicit sentiment cues that linear models could not. High inter-annotator agreement on aspects like Cleanliness and Service & Staff (Cohen’s κ > 0.8) correlated with stronger model performance, demonstrating that consistent annotation directly supports predictive accuracy. Conversely, aspects with ambiguous or composite sentiment boundaries—Environment and Overall Experience—remained the most challenging.
For tourism stakeholders, AraBERT and QARiB represent the most effective tools for deriving actionable insights from large-scale Arabic reviews. Their precision in assessing Cleanliness, Facilities, and Price can inform operational improvements and policy decisions. Integrating these models into tourism analytics dashboards would enable real-time sentiment monitoring across attractions and cities, supporting the evidence-based goals of Saudi Vision 2030.

5.3. Comprehensive Evaluation of the Best-Performing Model

To provide a holistic assessment of aspect-based sentiment analysis in Arabic tourism reviews, this section presents a detailed quantitative evaluation of the best-performing model, AraBERT, across all six targeted aspects: Price, Cleanliness, Facilities, Service & Staff, Environment, and Overall Experience. This analysis extends beyond overall accuracy, incorporating additional metrics and qualitative diagnostics to assess real-world applicability within the Saudi tourism context.

5.3.1. Quantitative Evaluation

Table 14 summarizes AraBERT’s performance across all aspects in terms of Accuracy, Macro-F1, ROC-AUC, and Average Precision (AP). As shown, aspects with explicit sentiment cues, such as Cleanliness and Price, consistently achieve the highest scores, while more abstract aspects such as Overall Experience and Environment remain more challenging. This disparity reflects the inherent ambiguity and mixed sentiment typical of real user-generated reviews.
Table 14. Summary of performance metrics per aspect (AraBERT).

5.3.2. Class Distribution Support

Table 15 illustrates the number of true instances per sentiment class (−1 = Negative, 0 = Neutral, 1 = Positive). A clear class imbalance is evident, particularly the limited positive reviews for Price and the smaller negative set for Overall Experience. These disparities partly explain the lower F1-scores observed for these aspects and emphasize the importance of balanced datasets for reliable multi-aspect classification.
Table 15. Number of instances per sentiment class (−1 = Negative, 0 = Neutral, 1 = Positive) for each aspect.

5.3.3. Training Dynamics

Figure 6 displays the batch-wise training loss curve for AraBERT over three epochs. The steady downward trajectory and absence of early overfitting confirm that the training schedule and regularization strategy were appropriately tuned for this dataset.
Figure 6. Training loss curve (batch-wise) over three epochs for AraBERT.

5.3.4. Confusion Matrices and Error Analysis

To better understand model behavior, Figure 7 displays normalized confusion matrices for two representative aspects: Cleanliness and Overall Experience. The Cleanliness aspect demonstrates near-perfect classification, with 99% of neutral reviews correctly identified and only minor confusion between negative and neutral labels. In contrast, Overall Experience reveals more ambiguity, around 25% of true positives were misclassified as neutral, highlighting the difficulty of interpreting sentiment when lexical cues are abstract or context-dependent.
Figure 7. Normalized confusion matrices for AraBERT on (a) Cleanliness and (b) Overall Experience.

5.3.5. Representative Misclassification Patterns

A qualitative analysis of misclassified samples revealed several recurring error patterns, most notably within the Environment and Overall Experience aspects, which were the most challenging due to their abstract and context-dependent nature. These aspects often involve subjective evaluations that rely on implicit cues or subtle contrasts in sentiment.
The first pattern, mixed sentiment, was common in reviews expressing both praise and criticism within the same sentence. The model tended to emphasize the dominant sentiment word rather than the overall contextual meaning. For instance, “التجربة جميلة لكن لا أعتقد أني أكرر الزيارة” (“The experience was nice, but I don’t think I would visit again”) should be classified as neutral because it conveys both satisfaction and reluctance to revisit, yet the model incorrectly labeled it as positive. Similarly, the review “مكان يستحق الزيارة مرة واحدة فقط” (“A place worth visiting once only”) expresses a limited or cautious recommendation, but the model misinterpreted it as fully positive.
The second pattern, aspect sentiment masking, occurred when strong sentiment toward one aspect overshadowed another. For example, “المكان نظيف ومرتب لكن حر جدًا ومزدحم” (“The place is clean and organized, but very hot and crowded”) combines positive comments about cleanliness with negative impressions of the environment. However, the model predicted a positive label because it relied on strong positive cues like “نظيف ومرتب,” overlooking the environmental complaint. A similar error appeared in “المنظر جميل بس الصوت مزعج” (“The view is beautiful, but the noise is disturbing”), where the negative auditory experience was masked by the positive visual expression.
The third pattern, keyword overreliance, reflected the model’s tendency to assign sentiment labels based on emotionally charged words such as “جميل” (beautiful), “رائع” (wonderful), or “ممتع” (enjoyable), even when the broader tone was neutral or mildly negative. For instance, “المكان جميل لكنه مبالغ في الأسعار” (“The place is beautiful but overpriced”) was classified as positive because the model focused on “جميل,” disregarding the negative judgment on pricing. Likewise, “الجو لطيف بس مافي شي مميز” (“The weather is pleasant, but there’s nothing special”) was misclassified as positive due to its reliance on surface-level lexical cues.
A final pattern, ambiguity and contextual subtlety, was frequently observed in reviews describing the Environment and Overall Experience aspects through vague, sarcastic, or understated expressions. The phrase “المكان عادي جدًا” (“The place is very ordinary”) was often misclassified as neutral despite carrying a mildly negative tone in Arabic. Similarly, “تجربة لن تُنسى للأسف” (“An unforgettable experience—unfortunately”) was incorrectly predicted as positive because the model interpreted it literally and failed to capture the sarcastic tone. These cases highlight the difficulty of recognizing irony, understatement, and implied sentiment within Arabic text, especially when meaning depends heavily on cultural and contextual nuances. A summary of the most frequent error patterns and representative examples is provided in Table 16, illustrating typical misclassification causes across the Environment and Overall Experience aspects.
Table 16. Summary of the most frequent error types and representative examples observed in AraBERT predictions.
Addressing these limitations requires more context-aware modeling approaches. Future work may benefit from data augmentation techniques such as paraphrasing, back-translation, and dialectal variation to enrich training data and reduce dependence on lexical cues [,]. Incorporating contextual or dialect-sensitive embeddings and adapter layers could also enhance the model’s ability to represent subtle, region-specific sentiment patterns [,]. Additionally, adversarial training on ambiguous or borderline samples may strengthen robustness to mixed or implicit sentiment [,]. Finally, continuous manual error analysis and fine-tuning on recurrent misclassifications can progressively align model predictions with the nuanced nature of Arabic tourism reviews. Together, these strategies provide a practical pathway toward improving Arabic ABSA systems for real-world tourism analytics.

5.3.6. Ablation and Sensitivity Analysis

To evaluate the robustness of the proposed transformer-based architecture, a focused ablation and sensitivity analysis was conducted on the best-performing AraBERT configuration (sequence length = 128, batch size = 16, class-weighted loss). Several variants were implemented by systematically modifying or removing key components to assess their contribution to performance. These included:
(i).
Removing the attention pooling layer and relying solely on the [CLS] token representation (No attention);
(ii).
Removing class weighting from the loss function (No class weights);
(iii).
Increasing the maximum sequence length to 256 (MaxLen = 256);
(iv).
Doubling the batch size to 32 (Batch = 32); and
(v).
Training six independent single-head models instead of a shared multi-head encoder (Single-Head).
Figure 8 illustrates the Macro-F1 scores across all aspects and model variants. The results confirm that the multi-head, class-weighted baseline achieves the best overall performance (average Macro-F1 = 0.79), with consistent improvements on Price, Cleanliness, and Service & Staff.
Figure 8. Macro-F1 heatmap across all ablation variants and aspects. The multi-head, class-weighted baseline achieves the highest performance, while removing attention or class weights leads to a clear drop in accuracy.
Removing attention pooling caused a moderate decline (average Macro-F1 = 0.77), showing that lightweight attention contributes to capturing aspect-specific cues beyond the global [CLS] representation.
Eliminating class weights led to the largest performance drop on the Price aspect (from 0.84 to 0.62), confirming the importance of balancing sentiment classes in highly imbalanced aspects.
Extending the sequence length or batch size yielded no meaningful gains, suggesting that most reviews are short and already well represented at 128 tokens.
Finally, training independent single-head models slightly reduced overall robustness (average Macro-F1 = 0.77), highlighting the advantage of shared feature representations in multi-aspect learning.

5.3.7. Feature Importance Visualization (Classical Models)

To enhance the interpretability of the classical machine-learning baseline, the most influential features identified by the LinearSVC classifier were visualized using TF-IDF coefficients.
Figure 9 presents the top positive and negative unigrams for the Cleanliness aspect.
Figure 9. Top positive and negative unigrams for the Cleanliness aspect identified by the LinearSVC model.
Words such as clean, organized, and neat are strongly associated with positive sentiment, while dirty, filthy, and bad odor correspond to negative predictions.
This visualization demonstrates the transparency of feature-based models and clarifies how they capture sentiment polarity compared with contextual transformer representations.

5.3.8. Precision–Recall Curves

Finally, Figure 8 presents the precision–recall curves for AraBERT on Cleanliness and Overall Experience. The Cleanliness aspect maintains high precision across recall levels, indicating clearly separable sentiment boundaries, while Overall Experience exhibits a sharper precision decline, further evidence of the inherent ambiguity in modeling generalized impressions within user reviews. The precision–recall performance of the AraBERT model for the aspects Cleanliness and Overall Experience is illustrated in Figure 10.
Figure 10. Precision–Recall curves for AraBERT on (a) Cleanliness and (b) Overall Experience, showing performance across classes −1, 0, and 1.

6. Discussion and Key Insights

The results clearly show that transformer-based models outperform traditional approaches, especially on well-defined aspects such as Cleanliness and Price. This advantage stems from their ability to capture deep contextual relationships within text rather than relying on isolated words. In other words, these models understand meaning in context—how a single phrase or adjective can shift depending on the sentence or sentiment around it. However, even the most advanced transformers struggle when the sentiment becomes less direct. Broader and more abstract aspects, such as Overall Experience, often contain blended tones or nuanced expressions that blur the line between positive and neutral sentiment. This explains the frequent misclassifications observed in those categories. Interestingly, we also found that aspects with high inter-annotator agreement (Cohen’s κ > 0.8) consistently led to stronger model performance, emphasizing how much consistent labeling and clear annotation guidelines contribute to model reliability.
While transformer models clearly dominate in most scenarios, well-tuned classical methods, such as SVMs or simple ensemble classifiers, can still perform competitively under the right conditions. When the dataset is balanced and dialectal variation is limited, these traditional models often achieve reliable results. They are lightweight, fast, and easy to interpret, which makes them practical choices when computational resources are constrained. However, their reliance on surface-level word patterns limits their ability to capture deeper contextual and semantic nuances, particularly in reviews where sentiment is implied or mixed. This is precisely where transformer-based models such as AraBERT and QARiB excel, leveraging contextual embeddings to understand meaning within complex, multi-aspect Arabic text.
The strength of AraBERT and QARiB lies in their extensive pretraining on large, diverse Arabic corpora. This exposure enables them to generalize across both Modern Standard Arabic and various regional dialects, something traditional models simply cannot achieve. Their multi-head architectures allow the models to learn shared language representations while simultaneously focusing on aspect-specific sentiment cues. QARiB goes a step further by incorporating explicit dialectal coverage and attention mechanisms, making it especially effective for handling the informal, colloquial language found in user-generated tourism reviews.
Beyond performance metrics, the proposed ABSA framework offers tangible value for the Saudi tourism sector. When integrated into analytical dashboards, AraBERT and QARiB can provide real-time insights into how visitors perceive specific aspects of their experiences, such as cleanliness, pricing, and service quality. This enables decision-makers to move from anecdotal feedback to actionable data. For instance, if sentiment around cleanliness in a particular city drops, managers can immediately identify the issue and direct resources to address it. The framework is flexible enough to be retrained for new cities, attraction types, or even adjacent sectors such as hospitality or events. Over time, it can support longitudinal monitoring, helping authorities evaluate how changes in policy or service delivery affect public sentiment and satisfaction.
The insights drawn from the error analysis offer meaningful lessons for tourism authorities seeking to better understand how visitors express their experiences. The misclassifications observed in aspects such as overall experience and price reveal that tourists often blend emotions or use indirect expressions when sharing opinions, particularly across different Arabic dialects. Recognizing these subtle linguistic and contextual nuances can help organizations interpret feedback more accurately and design surveys that truly capture visitors’ intentions. When sentiment insights are aligned with real-world service observations, decision-makers can not only identify which aspects require improvement but also understand the underlying causes behind misinterpreted praise or criticism, enabling more empathetic and targeted service enhancements.
Moreover, error analysis serves as a key contribution, offering practical insights into how contextual models handle dialectal Arabic and mixed sentiment, an essential step toward more robust and inclusive ABSA systems.

7. Conclusions and Future Work

This study addressed a critical gap in tourism analytics by developing and evaluating advanced models for aspect-based sentiment analysis (ABSA) on Arabic Google Maps reviews of tourist attractions in Saudi Arabia. The findings demonstrate that transformer-based architectures, particularly AraBERT and QARiB, achieve superior accuracy in capturing visitors’ opinions across key aspects such as cleanliness, price, and facilities. The publicly released, manually annotated dataset developed for this research offers a valuable foundation for future studies in Arabic ABSA and tourism-related NLP applications.
Future work will extend these models to better accommodate the linguistic diversity of Arabic dialects and to amplify underrepresented voices within the tourism discourse. Further exploration of few-shot and zero-shot learning approaches will enhance model adaptability to new attraction types and evolving review trends. Integrating these models into real-time monitoring systems can enable the timely identification of operational strengths and weaknesses, allowing for evidence-based service optimization.
A primary limitation of this study lies in the dialectal imbalance within the dataset, certain regional dialects (e.g., Najdi and Hijazi) are more prevalent than others, which may constrain model generalizability across less represented linguistic varieties. Addressing this limitation will be crucial for achieving broader dialectal inclusivity and performance stability.
Beyond its applied contributions, this research establishes a methodological foundation for multi-aspect sentiment modeling in underrepresented languages, contributing to the advancement of multilingual NLP and supporting the broader objectives of Saudi Vision 2030 for a more responsive, data-driven, and globally competitive tourism sector.

Author Contributions

Conceptualization, S.Z., A.H.A. and H.S.; Methodology, S.Z.; Software, S.Z.; Validation, S.Z.; Formal analysis, S.Z.; Investigation, S.Z., A.H.A. and H.S.; Data curation, S.Z.; Writing—original draft preparation, S.Z.; Writing—review and editing, A.H.A. and H.S.; Visualization, S.Z.; Supervision, A.H.A. and H.S. All authors have read and agreed to the published version of the manuscript.

Funding

This project was funded by the KAU Endowment (WAQF) under King Abdulaziz University, Jeddah, Saudi Arabia. The authors gratefully acknowledge the KAU Endowment (WAQF) and the Deanship of Scientific Research (DSR) for their technical and financial support.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The dataset generated and analyzed during the current study is publicly available on Zenodo at the following DOI: https://doi.org/10.5281/zenodo.17011574. (accessed on 20 October 2025).

Acknowledgments

During the preparation of this manuscript, the authors used ChatGPT (OpenAI, GPT-5, 2025), Grammarly, and QuillBot for proofreading, grammar correction, and language refinement. The authors have reviewed and edited all generated content and take full responsibility for the final version of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:
ABSAAspect-Based Sentiment Analysis
NLPNatural Language Processing
MSAModern Standard Arabic
DADialectal Arabic
ROC-AUCReceiver Operating Characteristic—Area Under Curve
F1F1-Score (Harmonic Mean of Precision and Recall)
TF-IDFTerm Frequency–Inverse Document Frequency
BoWBag of Words
SVMSupport Vector Machine
RFRandom Forest
CNNConvolutional Neural Network
RNNRecurrent Neural Network
LSTMLong Short-Term Memory
BiLSTMBidirectional Long Short-Term Memory
GRUGated Recurrent Unit
CRFConditional Random Field
BERTBidirectional Encoder Representations from Transformers
AraBERTArabic BERT
MARBERTMultilingual Arabic BERT (Twitter-based)
QARiBQatari Arabic BERT (Dialect-focused)
MUSEMultilingual Universal Sentence Encoder
LLMLarge Language Model

References

  1. Mirzaalian, F.; Halpenny, E. Social media analytics in hospitality and tourism: A systematic literature review and future trends. J. Hosp. Tour. Technol. 2019, 10, 764–790. [Google Scholar] [CrossRef]
  2. Li, S.; Li, Y.; Liu, C.; Fan, N. How do different types of user-generated content attract travelers? Taking story and review on Airbnb as the example. J. Travel Res. 2023, 63, 371–387. [Google Scholar] [CrossRef]
  3. Schröter Freitas, F. The Impact of Google Maps’ Reviews and Algorithms on Young Adults’ Choices of Museums to Visit in Prague; Univerzita Karlova, Filozofická Fakulta: Prague, Czechia, 2023. [Google Scholar]
  4. Putri, T.A.; Hamami, F.; Alam, E.N. Aspect-based sentiment analysis on natural tourism in West Bandung using multinomial logistic regression algorithm. In Proceedings of the 2023 1st International Conference on Advanced Informatics and Intelligent Information Systems (ICAI3S); Atlantis Press: Dordrecht, The Netherlands, 2024; pp. 116–127. [Google Scholar]
  5. Oueslati, O.; Cambria, E.; HajHmida, M.B.; Ounelli, H. A review of sentiment analysis research in Arabic language. Future Gener. Comput. Syst. 2020, 112, 408–430. [Google Scholar] [CrossRef]
  6. Abo, M.E.M.; Raj, R.G.; Qazi, A. A review on Arabic sentiment analysis: State-of-the-art, taxonomy and open research challenges. IEEE Access 2019, 7, 162008–162024. [Google Scholar] [CrossRef]
  7. Alqurashi, T. Arabic sentiment analysis for twitter data: A systematic literature review. Eng. Technol. Appl. Sci. Res. 2023, 13, 10292–10300. [Google Scholar] [CrossRef]
  8. Abd-Elshafy, M.F.; Aly, T.; Gheith, M. Analyse the enhancement of sentiment analysis in Arabic by doing a comparative study of several machine learning techniques. Int. J. Res. Appl. Sci. Eng. Technol. 2024, 12, 2007–2027. [Google Scholar] [CrossRef]
  9. Liu, B. Sentiment Analysis and Opinion Mining; Springer Nature: Cham, Switzerland, 2022. [Google Scholar]
  10. Lee, L.-H.; Li, J.-H.; Yu, L.-C. Chinese EmoBank: Building valence–arousal resources for dimensional sentiment analysis. ACM Trans. Asian Low-Resour. Lang. Inf. Process. 2022, 21, 1–18. [Google Scholar] [CrossRef]
  11. Lee, L.-H.; Yu, L.-C.; Wang, S.; Liao, J. Overview of the SIGHAN 2024 shared task for Chinese dimensional aspect-based sentiment analysis. In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10), Bangkok, Thailand, 11–16 August 2024; pp. 165–174. [Google Scholar]
  12. Wankhade, M.; Rao, A.C.S.; Kulkarni, C. A survey on sentiment analysis methods, applications, and challenges. Artif. Intell. Rev. 2022, 55, 5731–5780. [Google Scholar] [CrossRef]
  13. Habimana, O.; Li, Y.; Li, R.; Gu, X.; Yu, G. Sentiment analysis using deep learning approaches: An overview. Sci. China Inf. Sci. 2020, 63, 111102. [Google Scholar] [CrossRef]
  14. Alslaity, A.; Orji, R. Machine learning techniques for emotion detection and sentiment analysis: Current state, challenges, and future directions. Behav. Inf. Technol. 2024, 43, 139–164. [Google Scholar] [CrossRef]
  15. Dang, N.C.; Moreno-García, M.N.; De la Prieta, F. Sentiment analysis based on deep learning: A comparative study. Electronics 2020, 9, 483. [Google Scholar] [CrossRef]
  16. Nandwani, P.; Verma, R. A review on sentiment analysis and emotion detection from text. Soc. Netw. Anal. Min. 2021, 11, 81. [Google Scholar] [CrossRef]
  17. Mehta, P.; Pandya, S. A review on sentiment analysis methodologies, practices and applications. Int. J. Sci. Technol. Res. 2020, 9, 601–609. [Google Scholar]
  18. Alshaikh, K.A.; Almatrafi, O.A.; Abushark, Y.B. BERT-based model for aspect-based sentiment analysis for analyzing Arabic open-ended survey responses: A case study. IEEE Access 2023, 12, 2288–2302. [Google Scholar] [CrossRef]
  19. Abdelgwad, M.M.; Azmi, A.M. Arabic aspect sentiment polarity classification using BERT. Int. J. Adv. Comput. Sci. Appl. 2021, 12, 788–796. [Google Scholar] [CrossRef]
  20. Gupta, S.; Ranjan, R.; Singh, S.N. Comprehensive study on sentiment analysis: From rule-based to modern LLM-based systems. arXiv 2024, arXiv:2409.09989. [Google Scholar]
  21. AlNasser, A.; AlMuhaideb, A. Listening to patients: Advanced Arabic aspect-based sentiment analysis using transformer models towards better healthcare. Information 2024, 8, 156. [Google Scholar] [CrossRef]
  22. Shafiq, M.; Anwar, U.; Adeel, A. Enhancing Arabic aspect-based sentiment analysis using end-to-end model. Future Internet 2023, 15, 342. [Google Scholar] [CrossRef]
  23. Mohammad, A.-S.; Hammad, M.M.; Sa’ad, A.; AL-Tawalbeh, S.; Cambria, E. Gated recurrent unit with multilingual universal sentence encoder for Arabic aspect-based sentiment analysis. Knowl.-Based Syst. 2023, 261, 107540. [Google Scholar]
  24. AlShammari, N.; AlMansour, A. Aspect-based sentiment analysis and location detection for Arabic language tweets. Appl. Comput. Syst. 2022, 27, 119–127. [Google Scholar] [CrossRef]
  25. Almasri, M.; Al-Malki, N.; Alotaibi, R. A semi-supervised approach to Arabic aspect category detection using BERT and teacher-student model. PeerJ Comput. Sci. 2023, 9, e1425. [Google Scholar] [CrossRef]
  26. Khabour, S.M.; Al-Radaideh, Q.A.; Mustafa, D. A new ontology-based method for Arabic sentiment analysis. Big Data Cogn. Comput. 2022, 6, 48. [Google Scholar] [CrossRef]
  27. Af’idah, D.I.; Anggraeni, P.D.; Rizki, M.; Setiawan, A.B.; Handayani, S.F. Aspect-based sentiment analysis for Indonesian tourist attraction reviews using bidirectional long short-term memory. JUITA J. Inform. 2023, 11, 27–36. [Google Scholar] [CrossRef]
  28. Nadeem, A.; Missen, M.M.S.; Al Reshan, M.S.; Memon, M.A.; Asiri, Y.; Nizamani, M.A.; Alsulami, M.; Shaikh, A. Resolving ambiguity in natural language for enhancement of aspect-based sentiment analysis of hotel reviews. PeerJ Comput. Sci. 2025, 11, e2635. [Google Scholar] [CrossRef]
  29. Viñán-Ludeña, M.S.; De Campos, L. Evaluating tourist dissatisfaction with aspect-based sentiment analysis using social media data. Adv. Hosp. Tour. Res. 2024, 12, 254–286. [Google Scholar] [CrossRef]
  30. Xu, C.; Wang, M.; Ren, Y.; Zhu, S. Enhancing aspect-based sentiment analysis in tourism using large language models and positional information. arXiv 2024, arXiv:2409.14997. [Google Scholar] [CrossRef]
  31. Jeong, N.; Lee, J. An aspect-based review analysis using ChatGPT for the exploration of hotel service failures. Sustainability 2024, 16, 1640. [Google Scholar] [CrossRef]
  32. Chang, Y.-C.; Ku, C.-H.; Le Nguyen, D.-D. Predicting aspect-based sentiment using deep learning and information visualization: The impact of COVID-19 on the airline industry. Inf. Manag. 2022, 59, 103587. [Google Scholar] [CrossRef]
  33. Li, H.; Bruce, X.B.; Li, G.; Gao, H. Restaurant survival prediction using customer-generated content: An aspect-based sentiment analysis of online reviews. Tourism Manag. 2023, 96, 104707. [Google Scholar] [CrossRef]
  34. Özen, İ.A.; Katlav, E.Ö. Aspect-based sentiment analysis on online customer reviews: A case study of technology-supported hotels. J. Hosp. Tour. Technol. 2023, 14, 102–120. [Google Scholar] [CrossRef]
  35. Yulianti, E.; Nissa, N.K. ABSA of Indonesian customer reviews using IndoBERT: Single-sentence and sentence-pair classification approaches. Bull. Electr. Eng. Inform. 2024, 13, 3579–3589. [Google Scholar] [CrossRef]
  36. Phuangsuwan, P.; Siripipatthanakul, S.; Limna, P.; Pariwongkhuntorn, N. The impact of Google Maps application on the digital economy. Corp. Bus. Strategy Rev. 2024, 5, 192–203. [Google Scholar] [CrossRef]
  37. Sahagun, M.A.; Flores, J.; Jocson, J. Utilizing Google Map reviews and sentiment analysis: Knowing customer experience in coffee shops. Quest J. Multidiscip. Res. Dev. 2022, 1, 29. [Google Scholar] [CrossRef]
  38. Alaydaa, M.S.M.; Li, J.; Jinkins, K. Aspect-based sentimental analysis for travellers’ reviews. arXiv 2023, arXiv:2308.02548. [Google Scholar] [CrossRef]
  39. Arianto, D.; Budi, I. Aspect-based sentiment analysis on Indonesia’s tourism destinations based on Google Maps user code-mixed reviews. In Proceedings of the 34th Pacific Asia Conference on Language, Information and Computation, Hanoi, Vietnam, 24–26 October 2020; pp. 359–367. [Google Scholar]
  40. AlMasaud, A.; Al-Baity, H.H. AraMAMS: Arabic multi-aspect, multi-sentiment restaurants reviews corpus for aspect-based sentiment analysis. Sustainability 2023, 15, 12268. [Google Scholar] [CrossRef]
  41. Alharbi, B.A.; Mezher, M.A.; Barakeh, A.M. Tourist reviews sentiment classification using deep learning techniques: A case study in Saudi Arabia. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 717–726. [Google Scholar] [CrossRef]
  42. Shin, B.; Ryu, S.; Kim, Y.; Kim, D. Analysis on review data of restaurants in Google Maps through text mining: Focusing on sentiment analysis. J. Multimed. Inf. Syst. 2022, 9, 61–68. [Google Scholar] [CrossRef]
  43. Alanazi, M.S.M.; Li, J.; Jenkins, K.W. Multiclass sentiment prediction of airport service online reviews using aspect-based sentimental analysis and machine learning. Mathematics 2024, 12, 781. [Google Scholar] [CrossRef]
  44. Web Robots. Instant Data Scraper. Available online: https://chrome.google.com/webstore/detail/instant-data-scraper/ofaokhiedipichpaojbibbnahnkdoiiah (accessed on 4 July 2025).
  45. Fadaee, M.; Bisazza, A.; Monz, C. Data augmentation for low-resource neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 567–573. [Google Scholar]
  46. Wei, J.; Zou, K. EDA: Easy data augmentation techniques for boosting performance on text classification tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, Hong Kong, China, 3–7 November 2019; pp. 6383–6389. [Google Scholar]
  47. Antoun, W.; Baly, F.; Hajj, H. AraELECTRA: Pre-training text discriminators for Arabic language understanding. arXiv 2021, arXiv:2104.07704. [Google Scholar]
  48. El, K.M.; Antoun, W.; Baly, F.; Hajj, H.; Bouamor, H.; Glass, J. Arabic-BERT: Pre-training Arabic transformers for natural language understanding. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Virtual, 7–11 November 2021. [Google Scholar]
  49. Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. In Proceedings of the International Conference on Learning Representations (ICLR), San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
  50. Miyato, T.; Dai, A.M.; Goodfellow, I. Adversarial training methods for semi-supervised text classification. In Proceedings of the International Conference on Learning Representations (ICLR), Toulon, France, 24–26 April 2017. [Google Scholar]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Article metric data becomes available approximately 24 hours after publication online.