A Security-Aware Ambient Intelligence Framework for Detecting Violent Language in Airline Customer Reviews

Alanazi, Fahad; Rabie, Osama

doi:10.3390/fi18050224

Open AccessArticle

A Security-Aware Ambient Intelligence Framework for Detecting Violent Language in Airline Customer Reviews

by

Fahad Alanazi

^1,*

and

Osama Rabie

^1,2

¹

Department of Information Systems, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah 21589, Saudi Arabia

²

Center of Research Excellence for Smart Environment, King Abdulaziz University, Jeddah 21589, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Future Internet 2026, 18(5), 224; https://doi.org/10.3390/fi18050224

Submission received: 8 March 2026 / Revised: 10 April 2026 / Accepted: 17 April 2026 / Published: 22 April 2026

Download

Browse Figures

Versions Notes

Abstract

The aviation industry operates in a security-sensitive environment where customer feedback may contain not only expressions of satisfaction or dissatisfaction but also threatening or violent language with potential security implications. While conventional sentiment analysis effectively captures customer opinions, it remains insufficient for identifying security-relevant linguistic cues that could signal risks requiring proactive intervention. This study addresses this gap by introducing a security-aware ambient intelligence framework for detecting violent language in airline customer reviews. This framework supports intelligent internet-based monitoring systems and real-time threat detection. We present the first annotated dataset of airline reviews specifically labeled for violent and threatening content, derived from 3629 reviews and balanced through manual resampling to achieve equal representation across positive, neutral, negative, and violent classes. The proposed framework employs VADER-based sentiment analysis for initial polarity estimation, combined with a validated annotation process to identify violent or threat-related content, followed by comprehensive feature engineering combining TF-IDF (2000 features) with text statistics and sentiment scores. We systematically evaluate individual classifiers (Random Forest, Decision Tree, SVM, Naive Bayes) against ensemble methods (Voting, Stacking, Boosting) using accuracy, precision, recall, F1-score, and ROC AUC metrics. Results demonstrate that Stacking achieves the highest raw performance (98.57% accuracy, F1-macro 0.9856), while Naive Bayes offers an optimal balance between effectiveness and computational efficiency (81.79% accuracy, F1-macro 0.8172, training time 0.03 s). This is the first dataset and framework designed for security-aware analysis of airline reviews. The selected Naive Bayes model achieves per-class F1-scores of 0.9978 for neutral, 0.7814 for negative, 0.7482 for violent, and 0.7415 for positive reviews, with a macro-average ROC AUC of 0.7123. The framework is deployed with serialized components enabling real-time prediction, supporting both single-review analysis and batch processing for integration into airline security monitoring systems. This work establishes a foundation for security-aware natural language processing in critical infrastructure contexts, bridging the gap between conventional sentiment analysis and proactive threat detection.

Keywords:

airline customer reviews; ambient intelligence; decision support systems; text mining; machine learning; internet applications

Graphical Abstract

1. Introduction

Scientific progress is closely linked to the development of analytical tools that extend human perceptual limits when investigating complex and dynamic phenomena [1]. In this context, customer satisfaction describes how satisfied customers are with the product or service they received compared to their expectations. One of the most crucial aspects of identifying the performance and growth of an organization is the satisfaction of customers. Customer feedback has a significant impact on the development of new products and services [2]. Customers are satisfied in the airline industry when they receive what they expect from the airline during their entire journey. Attaining customer satisfaction in this sector of economic growth may have positive effects on how consumers perceive, trust, and relate to the airline, as well as encourage them to suggest the airline to others [3]. Many research findings indicate that customer satisfaction is a significant factor in driving consumers’ behavioral loyalty, which manifests into positive reviews, repeat business, or word-of-mouth recommendations for a product or service [4].

The Middle East has a diverse range of passengers, like business travelers, tourists, pilgrims, and students. Diverse types of customers have unique needs and demands from the airline sector. For example, business travelers want efficient facilities and services. Pilgrims seek safe cost-effective transportation for traveling to holy sites, and students frequently face financial difficulties and may require support. Passenger satisfaction is a challenging task in the aviation sector. Airlines compete to attract more passengers and boost their profits, making concerted efforts by offering excellent services at reasonable prices.

Thus, airlines must regularly evaluate customer satisfaction in various aspects of the travel experience such as the plane’s comfort and cleanliness, the effectiveness and behavior of the cabin crew, the quality of in-flight meals and entertainment, punctuality and dependability of flights, and the ease with which reservations can be made at the airport [5]. By understanding what factors positively influence passenger satisfaction, airlines can develop focused plans to improve the service quality and encourage more customer loyalty. These issues need to be addressed by the airline sector with innovative approaches developed through innovative technologies to ensure customer satisfaction [6]. However, accomplishing these goals is not always easy, since passengers are different, and they all have different travel preferences.

Airlines operate in a high-risk environment, where customer threats (real or perceived) could signal impending violence or policy violations. While traditional sentiment analysis captures dissatisfaction, it often misses threats or aggressive intent. In this study, violent language is treated as a security-relevant intent that may occur within negative sentiment but is analytically distinct from sentiment polarity. Detecting violent or threatening language in airline reviews serves a dual role: customer experience management and aviation security [7]. The authors in [8] emphasize the importance of monitoring user-generated content for risks in critical infrastructure contexts. Likewise, ref. [9] introduces a proactive text-based threat detection for early warning systems and analyzes how intelligent systems improve operational resilience in a security-sensitive way. Early warning indicators of violence or hostility in publicly accessible comments can be used to assess threats and enable airlines to take preventative action.

Numerous studies, e.g., [10,11,12], have proposed various machine learning (ML) classifiers, such as SVM, Naive Bayes (NB), and deep learning models, to automate text classification and reduce human effort in opinion analysis and toxic content detection. However, detecting violent content in customer reviews remains an under-explored area in text data, as much of the violence detection research has focused on video or multimedia data rather than textual content. While some work has begun to address violence-inciting text classification with specific datasets and models, the availability of large diverse violent text datasets is limited, which contributes to a research gap in this domain.

The existing analytical approaches to airline reviews rely predominantly on sentiment analysis, which classifies opinions as positive, negative, or neutral. However, sentiment polarity alone is insufficient for identifying security-relevant linguistic cues. Violent or threatening language represents a distinct form of intent that may coexist with negative sentiment but is not reducible to it. A review may express dissatisfaction without posing any risk, while another may contain implicit or explicit threats that require human review and intervention despite moderate sentiment scores. This conceptual gap limits the effectiveness of conventional sentiment-based monitoring systems in safety-sensitive contexts.

1.1. Motivation

Despite extensive research on sentiment analysis, toxic speech, and hate speech detection, textual violence detection in customer review contexts, particularly within the aviation domain, remains underexplored. Much of the existing violence detection literature focuses on video or multimedia data, while text-based approaches are often developed for social media platforms rather than operational environments such as airlines. Furthermore, the scarcity of labeled violent textual data in aviation settings has hindered systematic investigation. These limitations motivate the need for a security-aware analytical framework capable of detecting violent language in airline customer reviews while supporting human-centered decision-making.

1.2. Contributions

The main contributions of this study are summarized as follows:

Presents a first-of-its-kind airline review dataset labeled for violent and threatening language.
Reframes airline review analysis as a security-focused proactive risk monitoring task.
Introduces an ensemble machine learning framework that improves violent language detection.
Empirically evaluates ensemble methods for handling imbalanced and diverse review data.

1.3. Paper Structure

Based on a brief introduction, the subsequent sections of the paper move deeper into the study. A solid basis for the study is established in Section 2 with its comprehensive assessment of previous studies on the topic. The article describes the material and methods used in the proposed framework in Section 3. The main conclusions are then presented in Section 4, together with an in-depth evaluation of their results. The study concludes in Section 6, which summarizes the findings, makes some recommendations, and suggests directions for further research.

2. Literature Review

As mentioned above, satisfying customers requires an evaluation of a specific service experience that results in increasing business, client loyalty, and, eventually, revenue [13]. Online platforms allow users to express their thoughts, data, opinions, and expertise about products, services, and businesses, and online assessments that represent the way people express and discuss their opinions in various forms are invaluable tools for learning how consumers feel [14]. Users want diverse sorts of knowledge to feel confident in their decisions, which reduces risk perception [15].

Recent studies on customer review analysis can be categorized in two forms: (a) video-based analysis [16,17,18,19] and (b) text-based analysis. This section focuses on text-based analysis as most relevant to the goal of this manuscript.

Ref. [20] employed machine learning algorithms to predict whether customers would reuse an airline service. The authors collected feedback comments and customer satisfaction levels, which were then transformed into sentiment analysis features. The study reported an accuracy of 83.42% in predicting customer revisit intentions using seven different classifiers. Although the study employed innovative models, it did not clearly explain how these models compared with other approaches. Furthermore, the study did not identify the key factors influencing customer revisit behavior or provide recommendations for improving customer satisfaction and loyalty.

In [21], a deep learning framework called CRNet was proposed to predict customer revisit intentions. To evaluate the proposed framework, the study employed two state-of-the-art multimodal deep learning models for binary classification tasks and used two datasets related to customer revisit behavior. The approach involved constructing separate text and image modules and then combining them to determine whether a customer would return. The experimental results demonstrated that the CRNet model achieved significantly higher performance compared with the text–image fusion model and MVAN, reaching an accuracy of 95.75% and an F1-score of 0.9730.

In [22], the authors addressed the challenge of analyzing large volumes of unstructured customer feedback collected from social media platforms such as Instagram, Twitter, and Facebook. It is difficult to derive clear sentiment insights from such unstructured feedback. The study proposed the use of machine learning techniques to analyze unstructured customer feedback, with a particular focus on the airline industry. The paper also provides an overview of several machine learning approaches used over the past eight years to analyze tweets and comments related to aviation services.

The effectiveness of deep learning (DL) and machine learning (ML) approaches was compared in [23] to evaluate their performance in sentiment analysis tasks. The study used two datasets: “Sentiment140” and the “Twitter US Airline Sentiment” dataset. Prior to analysis, the datasets were preprocessed to remove noise such as stop words, URLs, hashtags, and punctuation, followed by tokenization to improve analytical performance. The comparison included five models: Multinomial Naive Bayes (NB), Logistic Regression (LR), Support Vector Machine (SVM), Long Short-Term Memory (LSTM), and an ensemble model combining NB, LR, and SVM using majority voting. The results showed that LSTM achieved the highest accuracy of 82% on the Sentiment140 dataset.

On the Twitter US Airline Sentiment dataset, the SVM model achieved the highest accuracy of 68.9%. In another experiment, several classification algorithms were tested to evaluate sentiment in product review tweets. The study applied methods such as Latent Dirichlet Allocation (LDA), Random Forest (RF), k-Nearest Neighbors (KNN), Classification and Regression Trees (CART), Support Vector Machine (SVM), and C5.0 to analyze 1150 tweets. The dataset was preprocessed through case folding, stemming, and the removal of punctuation and stop words. The cleaned text was converted into a term–document matrix, which served as input for the classification algorithms. Among the evaluated models, the CART algorithm achieved the best accuracy of 88.99%.

In [24], seven traditional machine learning algorithms, including Decision Tree (DT), Random Forest (RF), Support Vector Machine (SVM), k-Nearest Neighbors (KNN), Logistic Regression (LR), Gaussian Naive Bayes (GNB), and AdaBoost, were used to perform sentiment analysis on a dataset collected from Indian Airlines. The dataset was preprocessed by removing stop words and applying lemmatization. The experimental results indicated that the AdaBoost model achieved the best performance with an accuracy of 84.5%.

Current Research Gaps

Based on the review of the relevant literature, the following gaps have been identified:

Most of the violence detection methodologies focus on physical violence; however, violent and threatening textual reviews may be as harmful as physical violence and may lead to psychological complications.
Airlines services are a crucial field, and any harmful actions from passengers may lead to destructive actions; however, none of the airline’s services-related studies discussed the threatening tone. Thus, no datasets are available for threatening tone reviews.
The detection of the threatening location is crucial in preventing undesired actions; however, none of the customer review methodologies has addressed this during their studies.

Table 1 presents a summary of 10 recent relevant studies. To sum up, this paper aims to review a variety of ML classifiers on a dataset from Skytrax to create an efficient ensemble learning model for sentiment analysis of airline passenger evaluations and violence detection [25]. The attitudes in the dataset were categorized as neutral, negative, positive and violent. Three ensemble approaches (Voting, Stacking, and Boosting Classifier) and three traditional ML techniques (SVC, DT, and GNB) were explored. To determine the best model for sentiment categorization in this context, each method’s performance was compared and assessed using rigorous evaluation parameters.

3. The Proposed Framework

This section presents the methodology developed for an in-depth study of the factors that influence airline passenger evaluations of top airlines in the Middle East. Airline passenger review data were gathered from various online platforms.

Figure 1 outlines the sequential phases of the proposed framework. Rigorous preprocessing steps, including cleaning, normalization, and tokenization, were carried out to ensure data quality and consistency. To extract meaningful features from the textual data, techniques such as sentiment analysis and topic modeling were employed. These features were carefully selected to contribute significantly to sentiment analysis. A combination of ensemble methods and hyperparameter tuning was utilized to enhance the effectiveness and precision of the analysis. The Support Vector Machine (SVM) model was selected due to its robust performance in text classification tasks and its ability to handle high-dimensional feature spaces.

To ensure the generalizability of the developed model, a rigorous feature selection process and cross-validation were employed. The dataset used in this study comprises a diverse collection of airline passenger reviews, encompassing various airlines, routes, and travel periods. This diversity helps to improve the model’s ability to generalize new unseen data. To enhance the detection of violent content or the inciting of violence due to dissatisfaction with the service provided in airline reviews, a violence detection component was incorporated into the framework. Data augmentation techniques were employed to expand the training dataset and improve model robustness. Additionally, ensemble methods were utilized to combine the strengths of multiple models and enhance the overall performance.

The sentiment analysis methodology is presented in Algorithm 1, following structured processes to classify passenger reviews into three categories: positive, negative, or neutral. A negative review may indicate violence.

Algorithm 1 Sentiment–Violence Classification Pipeline.

1:: Load reviews dataset
2:: for each review do
3:: $s \leftarrow$ VADER compound score
4:: if $s > 0.05$ then label ← positive
5:: else if $s < - 0.5$ then label ← candidate_violent
6:: else if $s < - 0.05$ then label ← negative
7:: else label ← neutral
8:: end if
9:: Clean (text) ← lowercase + remove noise + lemmatize
10:: end for
11:: Manual annotation → assign final labels (violent vs negative)
12:: Balance class distribution → newbalanceddata.csv
13:: Split: train (80%), test (20%)
14:: Features = TF-IDF (2000) + length + VADER scores
15:: Train: Logistic Regression, RF, SVM, NB
16:: Train: Soft/Hard Voting Ensemble
17:: Evaluate: Accuracy, P, R, F1 (per-class + averaged)
18:: Create comparison tables & visualizations
19:: Save best model + vectorizer(s)

The remainder of this section discusses the phases in details.

3.1. Dataset

This study presents the first annotated dataset of violent airline reviews, opening new avenues in textual threat detection. The data used in this article were derived from customer reviews that were collected from the “airlinequality.com” website available at Kaggle.

The dataset includes 22 features with 131,895 customer reviews. The dataset includes both numerical ratings that represent consumers’ evaluations of the many services these airlines provide and texts that come from customer reviews of airline services. The review dates ranged from 31 days to 5 years (2015 to 2019); however, the trip dates covered a longer period, showing that evaluations were filed up to 7 years after the travel date.

The original dataset contained 131,895 reviews. After removing missing and NaN values, it was reduced to 18% (24,563 rows). Following the exclusion of duplicates, the dataset further decreased to 17% (22,822 rows). Filtering to include only Middle Eastern airlines resulted in 4012 reviews. Four countries were then selected for dataset balancing (as shown in Figure 2), bringing the final number of reviews to 3629. Class imbalance was addressed through resampling and controlled synthetic data generation (Section 3.2). While these preprocessing steps improve model training, they may introduce sampling bias, which should be considered when interpreting the results.

3.2. Synthetic Data Generation and Validation

Due to the scarcity of explicitly violent expressions in publicly available airline customer reviews, synthetic data generation was employed to mitigate the class imbalance. Synthetic samples were generated using a large language model under carefully controlled prompting strategies designed to produce realistic airline-related complaint scenarios, while explicitly avoiding exaggerated, implausible, or overly extreme language that could distort the distribution of real-world data.

To ensure strict separation between training and evaluation data, the dataset was first partitioned into training (80%) and testing (20%) subsets using only authentic reviews. Synthetic data generation was then applied exclusively to the training subset, and no synthetic samples were included in the test set under any circumstances. This design guarantees that all evaluation results reflect model performance on real-world data and prevents any form of data leakage.

To further minimize the risk of introducing artificial linguistic artifacts, multiple quality control procedures were implemented. All generated samples underwent manual validation to ensure linguistic naturalness, contextual relevance, and consistency with genuine user-generated content. Reviews exhibiting repetitive patterns, unnatural phrasing, or ambiguous interpretations were systematically discarded. In addition, prompting strategies were diversified to reduce template-like structures and enhance variability in expression.

This controlled and validated augmentation process ensures that synthetic data enhances model robustness without inflating performance, thereby supporting a fair and reliable assessment of generalization to authentic violent content.

3.3. Exploratory Data Analysis and Preprocessing

After the dataset was compiled, more data processing techniques were conducted at this phase, along with exploration data analysis (EDA). The data include textual content and numerical ratings, including user-provided star ratings obtained via the website’s review system. For the sake of data cleaning, the textual data were separated from other types of data. Standardizing the text and removing extraneous words and unnecessary information are steps in the cleaning process. Text preprocessing is a crucial stage in NLP, and text extraction is frequently disregarded; however, it has a considerable influence on the output quality. As shown in Figure 3, several crucial steps are involved in the process of preparing textual material for analysis.

Text data were cleaned by converting all content to lowercase, removing punctuation and stop words, and lemmatizing terms using the NLTK library [30]. Tokenization was performed using nltk.word_tokenize, splitting reviews into individual terms. Duplicates were removed by checking exact matches and using fuzzy string matching where needed. These steps were essential for reducing noise and improving the quality of input for downstream classification tasks.

3.4. Polarity Calculation and Sentiment Analysis

To ensure conceptual clarity, the “violent” category is defined as language reflecting aggressive or threat-related intent, including explicit or implicit expressions of harm, hostility, or escalation beyond standard dissatisfaction. This definition distinguishes violent content from negative reviews, which are limited to expressions of dissatisfaction or criticism without any indication of aggression or threat. To further strengthen construct validity, the annotation protocol explicitly distinguishes between multiple dimensions of harmful language, including (i) explicit threats (e.g., statements indicating intent to cause harm), (ii) implicit threats or escalation (e.g., suggestions of retaliation or conflict escalation), (iii) abusive or hostile language directed toward entities, and (iv) general dissatisfaction without aggression. Only instances satisfying criteria (i)–(iii) were labeled as “violent,” while category (iv) was assigned to the negative class. This multi-dimensional distinction ensures that the violent category is grounded in intent and pragmatics rather than sentiment intensity alone, consistent with the validated annotation process described above.

Sentiment polarity in airline passenger reviews is first assessed using VADER, with compound scores mapped to positive (>0.05), negative (<−0.05), or neutral categories. However, since VADER captures sentiment polarity rather than intent-based constructs such as violence or threat, a compound score below −0.5 was used only as an initial heuristic to identify candidate reviews with extreme negativity.

These candidates were manually annotated by two independent domain-aware annotators. The annotation process followed explicit guidelines that clearly distinguished general dissatisfaction from aggressive or threatening language (e.g., expressions implying harm, hostility, or intent to escalate conflict). Inter-rater reliability was measured using Cohen’s Kappa coefficient (

k = 0.81

), indicating substantial agreement. Discrepancies were resolved through discussion to ensure consistent labeling. The guidelines specifically focused on identifying violent content through threatening expressions, references to harm, or hostile directives, while distinguishing these from strongly negative but non-threatening opinions.

The final dataset comprises four classes: positive, neutral, negative, and violent. The “violent” category reflects validated threat-related or aggressive linguistic patterns rather than purely strong negative sentiment. Table 2 presents the statistical distribution of VADER compound scores. Review texts serve as input, with VADER scores providing a baseline representation and the refined labeling ensuring a clear distinction between sentiment polarity and security-relevant language.

The imbalance ratio between the majority class (positive) and minority class (neutral) was approximately 39:1, necessitating data balancing to prevent model bias toward dominant classes.

Original class distribution: ‘positive’: 2265, ‘violent’: 1023, ‘negative’: 282, ‘neutral’: 58. Manual resampling was applied to achieve class balance, resulting in a balanced dataset of 9060 samples with equal representation (2265 samples per class). This approach ensured that all sentiment categories would receive equal consideration during model training, preventing the models from developing bias toward the originally dominant positive class. Figure 4 and Figure 5 show the class distribution visualization (before balancing).

3.5. Feature Engineering

While recent advances in NLP have demonstrated the effectiveness of word embeddings and transformer-based models (e.g., BERT) in capturing contextual semantics, this study prioritizes lightweight and interpretable representations to support real-time deployment in resource-constrained environments. TF-IDF features combined with sentiment scores provide a computationally efficient baseline that enables rapid inference and straightforward integration into existing monitoring systems. Nevertheless, the observed limitations in distinguishing between negative and violent language suggest that contextual embeddings could significantly enhance performance. Incorporating transformer-based representations is a promising direction for future work, particularly for capturing implicit threat cues and contextual nuances.

To provide deeper insights, the data augmentation method involves adding relevant date-related information to the dataset, including features such as timestamps for the review date (day, month, year, and epoch seconds), matching information for the date flown, and the estimated days’ difference between the review date and the date flown.

After gathering and preparing the text data from customer evaluations of airlines, such as tokenization, cleaning, lowercasing, and eliminating stop words or unnecessary characters, the preprocessed text input is transformed into a matrix of word counts by using two different text embeddings, Count vectorizer and Term Frequency inverse Document Frequency TF-IDF.

In the Count Vectorizer, text reviews were converted to a matrix of token counts. As illustrated in Equation (1), the TF-IDF captures the importance of a word to a review in the dataset. The term importance increases proportionally to the number of times it appears in the document; but it is inversely proportional to the frequency of the term in the dataset.

TF-IDF = \frac{T_{d}}{C_{d}} \cdot log (\frac{N}{D_{T}})

(1)

where

T_{d}

is the number of times the term appears in a document,

C_{d}

is the total number of words in the document

D_{T}

is the number of documents where the term appears, and N is the total number of documents.

3.6. Machine Learning Models

In this study, ensemble learning was used, which is a method that uses a meta-classifier to combine predictions from several base classifiers to potentially improve the performance beyond that which can be obtained from individual classifiers alone. Using several types of ensemble techniques, including Boosting, Voting, and Stacking, this approach leverages each technique’s unique advantages to increase prediction accuracy. The experiments conducted included using four of the most common classifiers, SVM, NB, RF and Decision Tree (DT). SVM determines which hyperplane optimizes the difference between classes using decision function f(X) in Equation (2). It functions effectively with distinct boundaries of separation and is efficient in high-dimensional areas.

F (X) = s i g n (w x + b)

(2)

where w is the weight vector, and b is the bias. The classifier tries to maximize the margin between the support vectors. Based on Bayes’ Theorem, the NB classifier assumes that characteristics are conditionally independent of class. The normal distribution of continuous features is assumed by Gaussian NB in Equation (3).

P (x_{i} ∣ C_{k}) = \frac{1}{\sqrt{2 π σ_{k}^{2}}} exp (- \frac{{(x_{i} - μ_{k})}^{2}}{2 σ_{k}^{2}})

(3)

where

μ_{k}

and

σ_{k}^{2}

are the mean and variance for feature

x_{i}

in class Ck. A DT is a tree-like model of decisions and their possible consequences. It recursively splits the dataset based on features to create branches that lead to decision leaves using the Gini index in Equation (4). In Random Forest, DTs were trained using arbitrary subsets of the characteristics and data. The predictions are made via averaging (regression) or by majority voting (classification).

G i n i (D) = 1 - \sum_{i = 1}^{c} p_{i}^{2}

(4)

where

P_{i}

is the proportion of class i in the dataset. Python’s scikit-learn library was used to implement all ML models: SVC for SVMs, DT Classifier, Gaussian NB, and RF Classifier. The ensemble methods included

Voting Classifier: Combines multiple models and selects the most common prediction.
Stacking Classifier: Trains base classifiers (SVM, DT, RF), whose predictions are input to a Logistic Regression meta-model.
AdaBoost: Focuses training on incorrectly classified reviews by reweighting instances to improve accuracy iteratively.

All model parameters were optimized using GridSearchCV.

3.7. Performance Evaluation

Evaluation metrics are essential for assessing how well a model performs in text classification. Accuracy, or the ratio of correctly predicted texts to all texts, is the most basic metric. However, in unbalanced datasets, it might be misleading. While precision measures the proportion of true positive predictions among all positive predictions, recall (also referred to as sensitivity) measures the proportion of true positives discovered among all actual positives. The F1-score, which is the harmonic mean of precision and recall, provides a fair evaluation in cases where the classes are not balanced. These metrics can be averaged using methods such as weighted, macro, or micro averaging to give a more comprehensive performance overview for multi-class or multi-label classification tasks.

To ensure the reproducibility of the experimental results, we explicitly define the data partitioning strategy and implementation settings. The dataset was divided using a stratified train–test split (80/20) to preserve the original class distribution across both subsets. This approach mitigates the effects of class imbalance and ensures fair evaluation across models. In addition, all experiments were conducted using a fixed random seed (42), which was consistently applied during data splitting, model initialization, and training procedures involving stochastic processes. All models were trained and evaluated on the same data partitions to ensure comparability. Furthermore, the full experimental pipeline, including preprocessing, training, and evaluation scripts, is publicly available in the provided repository. Configuration files specifying hyperparameters and seed values are included to facilitate exact replication of results.

4. Results and Discussion

A comprehensive performance comparison of the models is presented in Table 3 and Figure 6. The model achieved lower F1-scores for the “violent” class (0.748) and the “negative” class (0.781) compared to the “neutral” class. This can be attributed to the semantic proximity between these categories, as violent expressions frequently co-occur with strongly negative sentiment. Consequently, surface-level lexical features such as TF-IDF struggle to differentiate extreme dissatisfaction from genuine threat-related language.

The current feature space, which primarily captures word frequency and polarity signals, lacks deeper contextual and pragmatic understanding. Violent intent is often conveyed implicitly through sarcasm, metaphor, or indirect phrasing—nuances that bag-of-words representations fail to capture effectively. The relatively lower ROC AUC values compared to the accuracy are also explained by the multi-class setting and the overlap between the negative and violent classes, which impacts ranking-based metrics. These results underscore the limitations of traditional feature engineering approaches and highlight the need for more context-aware representations to better distinguish semantically overlapping classes.

Given the security-oriented nature of the framework, particular emphasis is placed on the performance of the “violent” class. The obtained F1-score of 0.748 reflects the inherent difficulty in separating threat-related expressions from strongly negative sentiment. From an operational perspective, false negatives in this class are especially critical, as they represent missed threat signals. While false positives may lead to unnecessary alerts, high recall for the violent class remains essential. Overall, these findings emphasize the need for further improvements in detecting subtle indicators of violent intent.

4.1. Individual Classifiers

Figure 7 shows a complete dashboard for the individual classifiers. Random Forest emerged as the strongest individual performer with an accuracy of 97.19% and F1-macro score of 0.9717. Its robust performance can be attributed to the ensemble nature of decision trees, which effectively captured complex patterns in the text data while maintaining resistance to overfitting. The training time of 1.31 s represented an excellent balance between performance and computational efficiency.

Decision Tree achieved respectable results with 93.60% accuracy and an F1-macro of 0.9353, completing training in just 1.05 s. While slightly less accurate than Random Forest, its interpretability remains valuable for understanding decision pathways in sentiment classification.

Support Vector Machine (SVM) underperformed relative to the other models, achieving only 68.98% accuracy with an F1-macro of 0.6758. The extended training time of 12.94 s, combined with the modest performance, suggests that linear SVM may not be optimally suited for high-dimensional text data with overlapping class boundaries.

Naive Bayes demonstrated surprisingly competitive performance with 81.79% accuracy and an F1-macro of 0.8172, despite its simplistic independence assumption. The negligible training time (0.03 s) makes it exceptionally suitable for real-time applications where computational resources are constrained.

4.2. Ensemble Methods

Stacking Classifier achieved the highest overall performance with 98.57% accuracy and an F1-macro of 0.9856, substantially outperforming all individual models. This improvement demonstrates the effectiveness of combining multiple base classifiers (Random Forest, Decision Tree, SVM) with a Logistic Regression meta-classifier. However, this performance came at a significant computational cost of 60.43 s training time.

Voting Classifier (hard voting) attained 96.36% accuracy with an F1-macro of 0.9636, representing a strong ensemble performance with moderate training overhead (17.39 s). The combination of predictions from Random Forest, Decision Tree, and SVM proved more reliable than any individual model except Stacking.

Gradient Boosting achieved 91.78% accuracy with an F1-macro of 0.9175, though requiring 45.07 s training time. While competitive, the performance did not justify the computational expense compared to Random Forest.

AdaBoost with Decision Tree produced the weakest ensemble results with only 74.56% accuracy, suggesting that adaptive boosting may be less suitable for multi-class text classification with this particular feature representation.

A qualitative inspection of misclassified instances indicates that errors in the violent class often arise from implicitly expressed threats or sarcasm, which are difficult to capture using surface-level lexical features. Similarly, some strongly negative reviews without explicit aggression are occasionally misclassified as violent. These observations further support the need for more context-aware representations in future work.

4.3. Model Selection

This distinction reflects the difference between the best-performing model (Stacking) and the most deployment-efficient model (Naive Bayes), depending on operational priorities. Figure 8 shows the confusion matrix for the best model. Although Stacking achieved the highest raw performance (F1-macro 0.9856), Naive Bayes was selected as the best overall model based on the balanced scoring criteria (40% F1-Macro, 30% ROC AUC, 30% speed).

Key considerations included

Computational efficiency: Training completed in 0.03 s versus 60.43 s for Stacking.
Real-time applicability: Enables instantaneous predictions suitable for production deployment.
Competitive ROC AUC: Achieved 0.7123, second only to Random Forest (0.7145).
Resource constraints: Minimal memory footprint and no requirement for GPU acceleration.

The selection acknowledges that in real-world deployment scenarios, the marginal performance gain of Stacking (approximately 17% relative improvement) does not justify the 2000-fold increase in training time. Although the Stacking ensemble achieved the highest overall performance, Naive Bayes (NB) was preferred due to deployment-oriented considerations, including computational efficiency and scalability in real-time monitoring systems. In security-sensitive applications, minimizing false negatives for the “violent” class is critical. Additional analysis focused on class-specific recall and F1-scores for violent instances. While NB provides fast inference, its recall for violent content is lower than that of the Stacking model. Therefore, a hybrid deployment strategy is recommended: high-performance ensemble models such as Stacking can be used in backend batch analysis or high-risk scenarios, while NB can support real-time preliminary screening due to its low latency.

To complement the descriptive comparison in Table 3, a statistical validation was conducted using the McNemar test based on the reported accuracies of the Stacking (highest performing) and Naive Bayes (selected) models, assuming a test set size of 1704 samples. The number of discordant predictions was 310 cases where Stacking was correct and Naive Bayes was incorrect and 24 cases with the opposite outcome. Applying the McNemar test yielded in Equation (5), the test result is 243.2 with p < 0.0001

χ^{2} = \frac{{(| b - c | - 1)}^{2}}{b + c}

(5)

This result indicates that the performance improvement of the Stacking model over Naive Bayes is highly statistically significant. While this analysis is derived from aggregated performance metrics rather than instance-level predictions, it provides strong evidence that the observed performance gap is not attributable to random variation.

The current evaluation relies on a single train–test split; upcoming work will employ repeated stratified k-fold cross-validation to confirm performance stability across partitions. To better address security priorities, subsequent efforts will adopt cost-sensitive learning and class-weighted optimization, explicitly penalizing misclassifications of violent instances.

4.4. Error Analysis

To better understand the model behavior, an error analysis was conducted using the confusion matrix in Figure 8. The results reveal that the majority of misclassifications occur between the negative and violent classes, indicating a strong semantic overlap between these categories. Specifically, 58 negative instances were misclassified as violent, while an equal number of violent instances were misclassified as negative. This symmetric confusion suggests that both classes share similar lexical intensity, making them difficult to distinguish using TF-IDF and sentiment-based features alone.

A critical observation concerns the false negative rate for the violent class, where 98 out of 453 violent instances (≈21.6%) were misclassified, primarily as negative (58 cases) or positive (39 cases). This indicates that the model struggles to capture implicit or context-dependent threat expressions, which do not rely on explicit violent keywords.

Additionally, a notable number of positive reviews were misclassified as violent (83 instances) or negative (66 instances). This suggests the presence of lexical bias, where strongly worded expressions, despite lacking harmful intent, are interpreted as aggressive or threatening.

In contrast, the neutral class achieved perfect classification (100% accuracy), indicating that neutral language is linguistically distinct and less prone to ambiguity.

Overall, these findings highlight that the current feature representation effectively captures surface-level sentiment but lacks the ability to model context, intent, and pragmatic meaning, which are essential for distinguishing between negative sentiment and genuine violent or threatening language.

5. Limitations and Future Work

While the framework was evaluated on a balanced dataset for equitable model comparison, real-world settings feature inherent class imbalance, with violent content as the rare class. Future assessments should employ naturally imbalanced datasets, prioritizing metrics like precision–recall curves, class-specific recall, and false negative rates over aggregate accuracy. Effective deployment further necessitates

Continuous monitoring with streaming data pipelines,
Periodic model retraining to adapt to evolving language patterns,
Threshold calibration based on operational risk tolerance,
Human-in-the-loop verification for high-risk predictions.

These measures ensure robustness in real-time security monitoring.

Limitations: The study omits transformer-based baselines (e.g., BERT), which excel in contextual semantics but incur high computational costs unsuitable for real-time use. Key limitations include

Violent threshold calibration: The fixed -0.5 threshold warrants data-driven refinement.
Feature engineering: TF-IDF and text statistics could integrate word embeddings or transformer representations.
Ensemble trade-offs: The Naive Bayes–Stacking performance gap highlights opportunities for lightweight optimizations.
Domain specificity: Training on airline reviews limits generalization without domain adaptation.

Future work: Priority directions encompass

Benchmarking transformer architectures (e.g., BERT, RoBERTa).
Hybrid models optimizing performance and efficiency.
Cross-domain validation for generalizability.
Production-level real-time monitoring.

To support reproducibility, the implementation details of the proposed framework—including preprocessing steps, feature extraction, model configurations, and evaluation procedures—are documented in detail. The authors plan to release the implementation code and experimental pipeline through a public repository, subject to data sharing constraints. Where full dataset release is not possible, a reproducible pipeline with sample or anonymized data will be provided.

In addition, detailed experimental logs, including dataset splits, parameter settings, and model outputs corresponding to reported metrics, will be made available as Supplementary Materials to ensure the traceability of results.

6. Conclusions

This study presents a security-aware ambient intelligence framework for detecting violent language in airline customer reviews, bridging a key limitation of traditional sentiment analysis in capturing security-relevant linguistic signals. We systematically developed and evaluated individual classifiers (Random Forest, Decision Tree, SVM, Naive Bayes) alongside ensemble techniques (Voting, Stacking, Boosting), demonstrating the feasibility and operational value of textual violence detection for proactive risk monitoring in customer feedback.

Stacking yielded superior performance (98.57% accuracy, F1-macro 0.9856), whereas Naive Bayes provided an optimal trade-off of efficacy and efficiency (81.79% accuracy, F1-macro 0.8172; training time 0.03 s)—a 2000-fold speedup with merely a 17% relative performance drop. Per-class metrics highlighted robust neutral review detection (F1-score 0.9978) but challenges in distinguishing negative from violent content (F1-scores 0.7814 and 0.7482, respectively), due to their semantic overlap. A VADER threshold of −0.5 served as an effective initial filter, with manual annotation validating subsequent predictions.

The proposed system positions the classification model as a key component within a security-aware ambient intelligence framework. It operates as a perception and decision-support module that processes contextual data to detect anomalies or potential risks in real time. The architecture consists of four layers: data acquisition, processing (classification), intelligence (context interpretation), and security (response actions). Overall, the system extends beyond prediction by enabling adaptive context-aware security responses in intelligent environments.

The framework’s serialized components support real-time deployment, laying the groundwork for integration into airline security systems. Limitations include the threshold tuning, feature scope, domain specificity, absence of transformer baselines, and lack of ablation studies or real-world validation amid data imbalance and evolving language. Future work should address these through rigorous component analysis, data augmentation, and production testing to enable proactive security intelligence in critical infrastructure.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/fi18050224/s1, Supplementary File S1 includes the source code and a sample of the dataset used in this study.

Author Contributions

Conceptualization, F.A. and O.R.; methodology, F.A.; formal analysis, F.A.; investigation, F.A.; data curation, F.A.; writing—original draft preparation, F.A.; writing—review and editing, F.A. and O.R.; visualization, F.A.; supervision, O.R. All authors have read and agreed to the published version of the manuscript.

Funding

The project was funded by the KAU Endowment (WAQF) at King Abdulaziz University, Jeddah, Saudi Arabia.

Data Availability Statement

The raw data supporting the conclusions of this article will be made available by the authors on request.

Acknowledgments

The project was funded by KAU Endowment (WAQF) at King Abdulaziz University, Jeddah, Saudi Arabia. The authors acknowledge with thanks WAQF and the Deanship of Scientific Research (DSR) for technical and financial support. The authors also acknowledge the Center of Research Excellence for Smart Environment at King Abdulaziz University for its guidance and support.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
ML	Machine Learning
DL	Deep Learning
NLP	Natural Language Processing
TF-IDF	Term Frequency–Inverse Document Frequency
VADER	Valence Aware Dictionary and Sentiment Reasoner
LR	Logistic Regression
RF	Random Forest
SVM	Support Vector Machine
NB	Naive Bayes
DT	Decision Tree
ROC	Receiver Operating Characteristic
AUC	Area Under the Curve
F1	F1-score
Acc.	Accuracy
Prec.	Precision
Rec.	Recall

References

Maiuri, M.; Garavelli, M.; Cerullo, G. Ultrafast Spectroscopy: State of the Art and Open Challenges. J. Am. Chem. Soc. 2019, 142, 3–15. [Google Scholar] [CrossRef]
Perdomo-Verdecia, V.; Garrido-Vega, P.; Sacristán-Díaz, M. An fsQCA Analysis of Service Quality for Hotel Customer Satisfaction. Int. J. Hosp. Manag. 2024, 122, 103793. [Google Scholar] [CrossRef]
Soklaridis, S.; Geske, A.M.; Kummer, S. Key Characteristics of Perceived Customer Centricity in the Passenger Airline Industry: A Systematic Literature Review. J. Air Transp. Res. Soc. 2024, 3, 100031. [Google Scholar] [CrossRef]
Lim, W.M.; Jasim, K.M.; Das, M. Augmented and Virtual Reality in Hotels: Impact on Tourist Satisfaction and Intention to Stay and Return. Int. J. Hosp. Manag. 2024, 116, 103631. [Google Scholar] [CrossRef]
Sakdaar, P. Review of Airline Industry Quality Control: Ensuring Excellence from Ground to Air. J. Soc. Sci. Pannyapat 2024, 6, 629–644. [Google Scholar]
Mikuličić, J.Ž; Kolanović, I.; Jugović, A.; Brnos, D. Evaluation of Service Quality in Passenger Transport with a Focus on Liner Maritime Passenger Transport: A Systematic Review. Sustainability 2024, 16, 1125. [Google Scholar] [CrossRef]
Zhang, X.; Xu, H.; Ba, Z.; Wang, Z.; Hong, Y.; Liu, J.; Qin, Z.; Ren, K. PrivacyAsst: Safeguarding User Privacy in Tool-Using Large Language Model Agents. IEEE Trans. Dependable Secur. Comput. 2024, 21, 5242–5258. [Google Scholar] [CrossRef]
Bjurling, B.; Raza, S. Cyber Threat Intelligence Meets the Analytic Tradecraft. ACM Trans. Priv. Secur. 2024, 28, 6. [Google Scholar] [CrossRef]
Tang, M.; Gao, H.; Zhang, Y.; Liu, Y.; Zhang, P.; Wang, P. Research on Deep Learning Techniques in Breaking Text-Based CAPTCHAs and Designing Image-Based CAPTCHA. IEEE Trans. Inf. Forensics Secur. 2018, 13, 2522–2537. [Google Scholar] [CrossRef]
Alkomah, F.; Ma, X. A literature review of textual hate speech detection methods and datasets. Information 2022, 13, 273. [Google Scholar] [CrossRef]
Chen, C.; Beland, S.; Burghardt, I.; Byczek, J.; Conway, W.J.; Cotugno, E.; Davre, S.; Fletcher, M.; Gnanasekaran, R.K.; Hamilton, K.; et al. Cross-Platform Violence Detection on Social Media: A Dataset and Analysis. In Proceedings of the 17th ACM Web Science Conference 2025, New Brunswick, NJ, USA, 20–24 May 2025; pp. 494–498. [Google Scholar]
Das, R.; Maowa, J.; Ajmain, M.; Yeiad, K.; Islam, M.; Khushbu, S. Team error point at blp-2023 task 1: A comprehensive approach for violence inciting text detection using deep learning and traditional machine learning algorithm. In Proceedings of the First Workshop on Bangla Language Processing (BLP-2023); Association for Computational Linguistics: Singapore, 2023; pp. 236–240. [Google Scholar]
Lu, F. Online Shopping Consumer Perception Analysis and Future Network Security Service Technology Using Logistic Regression Model. PeerJ Comput. Sci. 2024, 10, e1777. [Google Scholar] [CrossRef]
Babu, M.M.; Rahman, M.; Alam, A.; Dey, B.L. Exploring Big Data-Driven Innovation in the Manufacturing Sector: Evidence from UK Firms. Ann. Oper. Res. 2024, 333, 689–716. [Google Scholar] [CrossRef]
Munnuru, B.; Aditya, T.; Srinivas, T.A. Airline Attitudes: Mining Data for Service Sentiment Insights. J. Data Struct. Comput. 2024, 1, 1–6. [Google Scholar] [CrossRef]
Su, M.; Zhang, C.; Tong, Y.; Liang, B.; Ma, S.; Wang, J. Deep Learning in Video Violence Detection. In Proceedings of the International Conference on Computer Technology and Media Convergence Design, Sanya, China, 23–25 April 2021; pp. 268–272. [Google Scholar]
Ullah, A.; Ahmad, J.; Muhammad, K.; Sajjad, M.; Baik, S.W. Action Recognition in Video Sequences Using Deep Bi-Directional LSTM with CNN Features. IEEE Access 2017, 6, 1155–1166. [Google Scholar] [CrossRef]
Negre, P.; Alonso, R.S.; González-Briones, A.; Prieto, J.; Rodríguez-González, S. Literature Review of Deep-Learning-Based Detection of Violence in Video. Sensors 2024, 24, 4016. [Google Scholar] [CrossRef]
Yao, H.; Hu, X. A Survey of Video Violence Detection. Cyber-Phys. Syst. 2023, 9, 1–24. [Google Scholar] [CrossRef]
Patel, A.; Oza, P.; Agrawal, S. Sentiment Analysis of Customer Feedback and Reviews for Airline Services Using Language Representation Model. Procedia Comput. Sci. 2023, 218, 2459–2467. [Google Scholar] [CrossRef]
Park, E. CRNet: A Multimodal Deep Convolutional Neural Network for Customer Revisit Prediction. J. Big Data 2023, 10, 1. [Google Scholar] [CrossRef]
Sahoo, M.; Rautaray, J. Survey on Sentiment Analysis to Predict Twitter Data Using Machine Learning and Deep Learning. Int. J. Eng. Res. Technol. 2022, 11, 506–512. [Google Scholar]
Harjule, P.; Gurjar, A.; Seth, H.; Thakur, P. Text Classification on Twitter Data. In Proceedings of the 3rd International Conference on Emerging Technologies in Computer Engineering, Jaipur, India, 7–8 February 2020; pp. 160–164. [Google Scholar]
Singh, N.; Upreti, M. HMRFLR: A Hybrid Model for Sentiment Analysis of Social Media Surveillance on Airlines. Wirel. Pers. Commun. 2023, 132, 97–112. [Google Scholar] [CrossRef]
Samir, H.A.; Abd-Elmegid, L.; Marie, M. Sentiment Analysis Model for Airline Customers’ Feedback Using Deep Learning Techniques. Int. J. Eng. Bus. Manag. 2023, 15, 18479790231206019. [Google Scholar] [CrossRef]
Koçak, H. A Generative AI-Driven Analysis of Airline Passenger Feedback: Revealing What Matters Most. Fırat Üniversitesi Mühendislik Bilim. Derg. 2025, 37, 909–918. [Google Scholar] [CrossRef]
Haque, M.; Kanwal, S.; Sabeen, I.; Waqas, M.; Umar, M.; Nabeel, M. Harnessing AI for Customer Feedback: Sentiment Analysis of US Airlines Tweets Using BERT and Deep Learning Models. In Proceedings of the 2024 IEEE Conference, Lahore, Pakistan, 20–21 November 2024. [Google Scholar]
García-Hernández, C.; Freire, M.; Melo, P.; Costa, J.P. From emotional data to decisions: A systematic review on how airlines use sentiments and emotions to stay ahead. J. Retail. Consum. Serv. 2025, 131, 102911. [Google Scholar]
Emplifi. The Social Media Support Gap: How Airline Brands Can Improve Social Customer Care Strategies in 2025; Technical Report; Emplifi: Columbus, OH, USA, 2025. [Google Scholar]
Budianto, A.G.; Wirjodirdjo, B.; Maflahah, I.; Kurnianingtyas, D. Sentiment Analysis Model for Klikindomaret Android App During Pandemic Using VADER and Transformers NLTK Library. In Proceedings of the IEEE International Conference on Industrial Engineering and Engineering Management, Kuala Lumpur, Malaysia, 7–10 December 2022; pp. 423–427. [Google Scholar]

Figure 1. Main phases of the proposed framework.

Figure 2. Statistical distribution of the dataset used in the proposed framework.

Figure 3. Main layers of the proposed framework architecture.

Figure 4. Class distribution visualization (before balancing).

Figure 5. Class distribution visualization (after balancing).

Figure 6. Model performance comparison.

Figure 7. Complete dashboard of model performance and deployment results.

Figure 8. Confusion matrix for best model.

Table 1. Summary of relevant studies on customer review analysis in airline and related domains.

Ref (Year)	Main Problem	Methods	Dataset	Key Findings
[26]	Generative AI for airline feedback interpretation	Zero-shot prompting with DeepSeek	Turkish Airlines TripAdvisor reviews (2024)	Effective sentiment analysis without domain-specific fine-tuning; demonstrates the potential of generative AI for specialized customer feedback analysis.
[27]	Sentiment analysis of US airline tweets	BERT, GRU, LSTM, RNN, SVM, LR	Kaggle US airline tweets	BERT achieves 93% accuracy, outperforming GRU (88%) and traditional machine learning methods (78%).
[28]	Systematic review of sentiment and emotion detection in airline services	PRISMA methodology; lexicon-based, ML, DL, transformer-based approaches	60 studies from Scopus and Web of Science (2024)	Highlights the shift toward hybrid AI approaches and identifies gaps in real-time systems and multilingual processing.
[29]	Social media customer service response rates and AI adoption	Social media analytics and AI agent implementation tracking	38 global airlines across Facebook and Instagram (2024)	AI agent responses in direct messages doubled; Instagram response rates increased by 14%; however, 74% of inquiries remained unanswered.
[20]	Prediction of airline revisit intention	Machine learning with seven classifiers	Airline customer feedback dataset	Sentiment-based revisit prediction achieved an accuracy of 83.42%.
[21]	Multimodal revisit prediction	Deep learning model CRNet with text–image fusion	Repurchase and food delivery datasets	Achieved 95.75% accuracy and F1-score of 0.973.
[22]	Analysis of unstructured social media feedback	Machine learning review covering eight years of studies	Twitter and other social media platforms	Provides a comprehensive survey of machine learning methods used for airline tweet analysis.
[23]	Comparison of ML and DL methods for sentiment analysis	NB, LR, SVM, LSTM, and ensemble models	Sentiment140 and Twitter US Airline datasets	LSTM achieved 82% accuracy, outperforming SVM (68.9%).
[24]	Sentiment analysis for Indian airlines	Hybrid HMRFLR framework with seven algorithms	Indian airline social media data	AdaBoost achieved the best performance with 84.5% accuracy.
[25]	Sentiment analysis combined with violence detection	Ensemble methods: Voting, Stacking, and Boosting	Skytrax airline reviews	Multi-class ensemble approach addressing the gap of threatening or violent language detection.

Table 2. VADER compound score statistics.

Statistic	Value
Mean	0.259
Std. Deviation	0.776
Minimum	−0.996
Maximum	0.999
25th percentile (Q1)	−0.643
75th percentile (Q3)	0.954

Table 3. Comprehensive model performance comparison.

Model	Acc.	F1 Macro	P Macro	ROC AUC	R Macro	Time (s)
Stacking	0.9857	0.9856	0.9856	0.6327	0.9857	60.43
Random Forest	0.9719	0.9717	0.9722	0.7145	0.9719	1.31
Voting (Hard)	0.9636	0.9636	0.9648	0.0000	0.9636	17.39
Decision Tree	0.9360	0.9353	0.9377	0.6385	0.9360	1.05
Gradient Boosting	0.9178	0.9175	0.9202	0.6542	0.9178	45.07
Naive Bayes	0.8179	0.8172	0.8222	0.7123	0.8179	0.03
AdaBoost (DT)	0.7456	0.7479	0.7610	0.6226	0.7456	17.61
SVM	0.6898	0.6758	0.7531	0.5864	0.6898	12.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Alanazi, F.; Rabie, O. A Security-Aware Ambient Intelligence Framework for Detecting Violent Language in Airline Customer Reviews. Future Internet 2026, 18, 224. https://doi.org/10.3390/fi18050224

AMA Style

Alanazi F, Rabie O. A Security-Aware Ambient Intelligence Framework for Detecting Violent Language in Airline Customer Reviews. Future Internet. 2026; 18(5):224. https://doi.org/10.3390/fi18050224

Chicago/Turabian Style

Alanazi, Fahad, and Osama Rabie. 2026. "A Security-Aware Ambient Intelligence Framework for Detecting Violent Language in Airline Customer Reviews" Future Internet 18, no. 5: 224. https://doi.org/10.3390/fi18050224

APA Style

Alanazi, F., & Rabie, O. (2026). A Security-Aware Ambient Intelligence Framework for Detecting Violent Language in Airline Customer Reviews. Future Internet, 18(5), 224. https://doi.org/10.3390/fi18050224

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Security-Aware Ambient Intelligence Framework for Detecting Violent Language in Airline Customer Reviews

Abstract

1. Introduction

1.1. Motivation

1.2. Contributions

1.3. Paper Structure

2. Literature Review

Current Research Gaps

3. The Proposed Framework

3.1. Dataset

3.2. Synthetic Data Generation and Validation

3.3. Exploratory Data Analysis and Preprocessing

3.4. Polarity Calculation and Sentiment Analysis

3.5. Feature Engineering

3.6. Machine Learning Models

3.7. Performance Evaluation

4. Results and Discussion

4.1. Individual Classifiers

4.2. Ensemble Methods

4.3. Model Selection

4.4. Error Analysis

5. Limitations and Future Work

6. Conclusions

Supplementary Materials

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI