Leveraging Stacking Framework for Fake Review Detection in the Hospitality Sector

: Driven by motives of profit and competition, fake reviews are increasingly used to manipulate product ratings. This trend has caught the attention of academic researchers and international regulatory bodies. Current methods for spotting fake reviews suffer from scalability and interpretability issues. This study focuses on identifying suspected fake reviews in the hospitality sector using a review aggregator platform. By combining features and leveraging various classifiers through a stacking architecture, we improve training outcomes. User-centric traits emerge as crucial in spotting fake reviews. Incorporating SHAP (Shapley Additive Explanations) enhances model interpretability. Our model consistently outperforms existing methods across diverse dataset sizes, proving its adaptable, explainable, and scalable nature. These findings hold implications for review platforms, decision-makers, and users


Introduction
Effective decision-making heavily hinges on information search processes.Within the realm of e-commerce, consumer choices are significantly influenced not only by the information disseminated by companies but also by the evaluations of fellow product purchasers [1].Gradually, these product reviews have transitioned into an integral facet of consumers' purchasing decisions, with a staggering 90% of customers consulting online reviews prior to making purchase-related choices.Remarkably, 88% of consumers place a level of trust in online reviews akin to personal recommendations [2].This escalating reliance on such reviews, however, unravels a concern of manipulation in the decisionmaking process through the injection of fabricated reviews [3].
Managers have realized the potential of reviews on consumer engagement intention, leading some of them to engage in review manipulation [4].Fabricated reviews encompass two distinct categories: destructive and deceptive.Destructive reviews often serve as mere promotional content that bears no relation to the actual product experience.On the other hand, deceptive reviews are particularly harmful as they spread false information that can seriously harm businesses and result in significant financial loss [5].Notably, even renowned platforms like Tripadvisor have struggled to grapple with the pervasive issue of counterfeit reviews.This is demonstrated by their multiple shifts in slogans over the years, underscoring the complexity of distinguishing genuine reviews from fraudulent ones.This issue has inspired our research into the detection of counterfeit reviews within the hospitality domain [6][7][8].
The magnitude of this predicament is starkly quantified, with the World Economic Forum estimating annual losses of an astonishing USD 152 billion due to the proliferation of counterfeit reviews, an economic loss that merits serious consideration [9].The gravity of the problem has attracted the attention of international regulatory bodies, spanning from the European Union and the United States to Australia and India [10].These authorities have enacted stringent measures to combat the endorsement and dissemination of counterfeit reviews.Regulations now mandate that platforms verify the authenticity of consumer reviews or face prosecution and penalties.Despite these efforts, the ground reality remains different [11][12][13].
The menace of counterfeit reviews has particularly targeted review aggregator platforms like Yelp, Tripadvisor, and more, compelling scholars' engagement for well over a decade [14,15].These inauthentic reviews extend beyond external sites to influential platforms like Amazon, Walmart, and Flipkart.Notably, establishments are found to be more prone to receiving fake positive reviews on external sites like Tripadvisor, Yelp, MouthShut, etc. [16].However, in cases where establishments feature on both internal and external platforms, the ratings on external platforms tend to be lower, owing to the deliberate injection of negative counterfeit reviews by competitors [17].The lack of a robust purchase validation mechanism renders external platforms more susceptible to the infiltration of fake reviews.
While fake reviews span various sectors, they are notably concentrated in entertainment, hospitality, and e-commerce [18].The initial response involved manual identification, but this approach was proven to be sluggish, imprecise, and resource-intensive.This paved the way for the pioneering work of Jindal and Liu [19], who introduced the concept of automated fake review detection.Subsequently, machine learning techniques encompassing Support Vector Machines (SVM), Random Forest (RF), Naïve Bayes, and neural networks (NN) have gained significant prominence for detecting counterfeit reviews [20].Notably, the scope of fake review detection extends beyond supervised learning methods, encompassing semi-supervised and unsupervised approaches [14].
The prevailing research landscape predominantly revolves around feature engineering and training classifiers to effectively distinguish between genuine and fraudulent reviews.Comparative analyses of classifiers, such as Naïve Bayes Classifier and Support Vector Machine (SVM), have garnered attention [18,21,22].Paradoxically, despite the capabilities of machine and deep learning to handle large datasets, research often focuses on comparatively small datasets (as mentioned in Section 2) when reporting superior performance.This raises valid concerns about overfitting and lack of real-world adaptability in scenarios where millions of reviews are generated daily.These observations culminate in the formulation of the ensuing research questions.
RQ1.In the context of fake review detection in the hospitality sector, does combining the power of several base classifiers with a meta-classifier lead to a consistently better result?RQ2.Will the performance of such a classifier vary from the scaling of input data size?
To answer RQ1, we have built upon the methods of [23,24] by creating an ensemble model by using several well-reported classifiers (XGBoost, Random Forest, artificial neural network, etc.) in the fake review detection domain [15,25].In answering RQ2, we have used the Yelp dataset by [26].The dataset consists of reviews from review aggregator site Yelp.com and corresponds to hotels and restaurants.They have created three databases with varying sizes ideal to test our classifier's performance.The work offers some key theoretical contributions, including the proposition of a state-of-the-art framework using the staking ensemble technique for fake review detection, the demonstration of a comparable framework performance with the existing benchmark models irrespective of the size of the input, and the introduction of several new features (average rating provided to a product, subjectivity, average rating received by a product, etc.) with more distinguishing power.The study findings unequivocally demonstrate the enhancement in classification performance through the application of classifier ensembling.This observed outcome remains invariant across varying dataset magnitudes.The empirical verification encompasses diverse evaluation metrics, including, but not limited to, average precision, recall, area under the curve, F1 score, and accuracy.Through comprehensive benchmarking, it is established that our approach distinctly excels in cumulative performance, thereby substantiating its superior efficacy in the domain of counterfeit review identification.
The remainder of this paper is organized as follows: Section 2 reviews the existing literature work.Section 3 highlights the proposed model along with its features, learning techniques, and evaluation.Section 4 delves into findings, implications, and future research, while Section 5 throws light on the conclusions.

Literature Review 2.1. Firms and Fake Reviews
It has been established that users base their opinions on reviews of a product or a service, a fact that manufacturers and service providers are well aware of [27,28].Research indicates that service providers are more likely to engage in fake review activities when their reputation is at stake due to a limited number of reviews or negative feedback.Moreover, restaurants experiencing heightened competition are found to be more vulnerable to receiving negative reviews [29].A study by [16] revealed that independent or singleunit hotels and restaurants are the primary beneficiaries of review manipulation and are, therefore, more susceptible to it.Conversely, [30] found that even a small number of fake reviews (50) can be sufficient to surpass the competitors in certain markets.Refs.[31] and [1] have reported that consumers associate themselves with the review website rather than other participants, and this relationship is moderated by homophily and tie strength, which foster source credibility in the context of electronic word-of-mouth (eWOM) on review websites.
There have been cases where prominent brands were prosecuted for availing services of a third party to promote or defend them online [32].The pervasiveness of this issue has garnered attention not only from academic circles but also from major news outlets such as the BBC and the New York Times, which reported on a photography company for defaming its competitors by posting fake negative reviews [33].Trends have shown that fake reviews are increasing day by day across platforms, thus causing a problem for online information accuracy and potential market regulation [34].

Fake Review Detection
The prevalence of fake reviews has garnered significant attention in academic circles.A range of studies have aimed to detect fake reviews.Some of them have applied supervised learning techniques, while others have utilized alternative methods (semi-supervised learning, unsupervised learning, probabilistic models, graph-based models., and deep learning) [35].Supervised learning is a machine learning technique where the model is trained on a labeled dataset, learning to map input data to corresponding output labels through iterative adjustments, enabling it to make predictions on new, unseen data, whereas semi-supervised learning is a type of machine learning where the model is trained on a combination of labeled and unlabeled data, leveraging both the provided answers and the patterns it discovers independently to make predictions.Unsupervised learning is a machine learning approach where the model explores patterns and structures in unlabeled data without explicit guidance, uncovering hidden relationships or grouping similar items based on inherent similarities.Ref. [19] tackled this issue by developing a rule based on similarity: reviews with more than a 90% similarity were deemed as spam.They extracted 24 review-centric features and trained the model using logistic regression.Ref. [36] have reported that deceptive reviews have greater lexical complexity, contain more frequent mentions of brands and first-person pronouns, and display a sentiment tone that is more positive towards a product or service.burst-related features are more relevant in identifying fake reviews.Ref. [50] proposed a bidirectional gated recurrent neural network, which helps in capturing global semantic information that standard discrete features fail to grasp.A bidirectional gated recurrent neural network is a type of artificial neural network designed for processing sequential data capable of analyzing information both forward and backward through time.It uses special gates to selectively remember and forget past information, enhancing its ability to understand context and relationships in the data bidirectionally.
In the experiment conducted by [51], they revealed that a weak brand suffers more when it excessively adds fake positive reviews, and this raises suspicion among the users, leading to a loss of credibility [52].They further reported that deleting negative reviews is more subtle and leaves fewer manipulation cues.Ref. [22] illustrated the use of univariate and multivariate distributions in improving classifier accuracy.
Ref. [53] developed multiple deep learning-based solutions to address the issue of variable-length review texts.They proposed two approaches: one using multi-instance learning and the other employing a hierarchical architecture, both aimed at effectively handling reviews of varying lengths.Ref. [17] developed a supplementary method called trust measure that determines the genuineness of a review based on strongly positive and negative terms.They reported that fake reviews are more prevalent on open reviewing platforms than on closed platforms.
Ref. [54] derived several micro-linguistic cues using Linguistic Inquiry Word Count (LIWC) and Coh-Metrix to study their impact on positive and negative reviews being either fake or genuine.Their findings revealed that single posts, reiterating posts, and generic feedback are useful clues in identifying spam.Ref. [55] leveraged the capability of deep learning architectures and proposed a high-dimensional model conflating n-gram, skip-gram, and emotion-based features.Refs.[28,56] have postulated that fake reviews have more social and affective cues as compared to genuine reviews.
Ref. [57] examined the temporal impact on classifier performance.They proposed that algorithms dealing with text should have the capability to periodically update their vocabulary with words used in general parlance.Ref. [58] have explored an interesting concept of inconsistency and its potential to enhance classifier performance, defining it as a disparity between review content and star ratings, differing sentiments for the same rating, or a change in a reviewer's writing style for the same rating.Ref. [59] have stressed the presence of emotional cues in fake reviews.Refs.[60,61] have relied on feature engineering along with word embedding to identify fake and genuine reviewers.
A quick look at Table 1 solidifies our argument that the majority of the high-performing work has been reported on a relatively smaller dataset and often reported on a selective set of matrices, which can lead to a false sense of performance.Furthermore, combining the power of multiple classifiers has received less attention than warranted in academia, in contrast to machine learning contests where such architecture often features as bestperforming models [62].Additionally, training deep learning models is resource-intensive and time-consuming.To this end, we propose a staking ensemble-based classifier that is faster and easier to train and performs well, irrespective of the size of the input dataset.

Methodology
For this study, the Yelp datasets curated by [26] are selected.The reviews were collected over four years, between 2010 and 2014, from Yelp.com.The dataset comprises reviews given to restaurants and hotels in the US.The investigation encompasses three datasets, namely YelpChi, YelpNYC, and YelpZip.The YelpChi dataset has reviews of hotels and restaurants situated in Chicago, whereas YelpNYC comprises reviews of restaurants in New York City.YelpZip specifically gathers reviews from restaurants in New York City based on their zip codes.Yelp employs a filtering algorithm designed to identify and segregate potentially fake or suspicious reviews into a distinct filtered list.These filtered reviews are publicly accessible, with a business' Yelp page showcasing recommended reviews, while a link at the page's bottom allows users to peruse the filtered or unrecommended reviews.Although the Yelp anti-fraud filter is not infallible, it approximates a "near" ground truth and has demonstrated a propensity for accurate outcomes [63].The datasets under consideration include both recommended and filtered reviews, denoted as genuine and potentially fraudulent, respectively.Table 2 presents the details of the datasets.The metadata includes features such as 'user_id', 'prod_id', 'rating', 'label', and 'date'.These represent the encoded identifier of the user who submitted the review, the encoded identifier of the product being reviewed, the rating given by the user on a scale from 1 to 5, whether the review has been filtered out by the system, and the date on which the review was submitted.The 'label' feature has two values: '−1' signifies that the review has been filtered, indicating that Yelp.com's algorithm has marked it as fake or spam, while '1' indicates that the review has not been filtered.
A series of preprocessing steps performed over the data for classification purposes are discussed in detail in Section 3.1.The proposed framework is shown in Figure 1.After the preprocessing was completed, feature engineering was performed, resulting in several features based on the previous literature.A total of six user-centric, six product-centric, and twenty-five review-centric features were derived.Details of these features are provided in Section 3.

Data Balancing
Table 2 shows that the dataset is highly imbalanced, which would cause a mode trained on it to be biased towards the majority class.To balance the dataset, th 'imbalanced-learn' python package by [68] is used.The 'RandomUnderSampler technique brings down the majority class instances to that of the minority class b randomly removing instances from the majority class.This technique is employed for tw reasons: Firstly, oversampling the dataset would have augmented the data point, makin it difficult for the classifier to converge, especially when word embeddings were also used As machine learning models do not accept textual data as they are, review text was embedded.The most popular types of embeddings are BERT and its variants DistilBERT, RoBERTa, and ALBERTA.These embeddings, along with the user/product/review-centric features, were used in the classification task.The evaluation was performed in terms of classification accuracy, AUC, F1 score, average precision, and recall.   2 shows that the dataset is highly imbalanced, which would cause a model trained on it to be biased towards the majority class.To balance the dataset, the 'imbalancedlearn' python package by [68] is used.The 'RandomUnderSampler' technique brings down the majority class instances to that of the minority class by randomly removing instances from the majority class.This technique is employed for two reasons: Firstly, oversampling the dataset would have augmented the data point, making it difficult for the classifier to converge, especially when word embeddings were also used in the model.Secondly, synthetically created embeddings do not represent the actual reviews and would not have made any sense.

Text Pre-Processing
The standard pre-processing steps are followed: removal of stop words, removal of punctuations, removal of URLs, lowercasing the text, and lemmatization.The label of filtered reviews is changed from −1 to 0 as this is a binary classification problem, and some of the classifiers used expect classes to be labeled as 0 and 1 only.In this study, the NLTK package is used for preliminary processing.Lemmatization was performed using the spaCy package to give the best result in real-world settings [69].

Feature Engineering
Apart from the existing meta-features, a total of thirty-seven psycholinguistic features have been derived that can be classified into user-centric, product-centric, and reviewcentric features.Besides the features that have already been reported in the literature, we have engineered some new features intuitively and adopted some features from other fields of the literature.User-centric features are those that are concerned with user behavior.These features include the average rating provided by the user against all products, the total number of reviews written by a user, and their deviation.Product-centric features are those that describe the characteristics of the product from reviewers' point of view.These include the number of reviews received by the product and the average rating given by the user to the product.Lastly, review-centric features are mainly concerned with the linguistic aspect of the reviews and are not limited to the presence of exclamation marks in sentences or the count of uppercase and lowercase letters.They also include various emotional variables, such as sentiment scores and variables highlighting anger, joy, and trust in the review text.The complete list of features, along with their description and categorization, is presented in Table 3.  [72] new positive sentiment score between 0 to 10 negativeSent negative sentiment score between 0 to 10 ld [73] new lexical diversity: ratio of unique words to all words in a review le_d [74] lexical density: ratio of opinion words to all words in a review anger [75] new anger emotion on a scale of 0 to 1 anticipation anticipation emotion on a scale of 0 to 1 trust trust emotion on a scale of 0 to 1 fear fear emotion on a scale of 0 to 1 sadness sadness emotion on a scale of 0 to 1 joy joy emotion on a scale of 0 to 1 surprise surprise emotion on a scale of 0 to 1 disgust disgust emotion on a scale of 0 to 1 Entropy [26] rating entropy singleton whether a review is the only review posted by the user date_entropy time gap between each consecutive review similarity [42] maximum content similarity Ext rating extremity ratio_LCAPS [76] ratio of uppercase character to lowercase character Figure 2 shows the correlation matrix among the features of the YelpZip dataset.Most features show either a negative or negligible correlation (values below or closer to 0), suggesting that there is no multicollinearity issue with the acquired features.Figure 3 illustrates the cumulative distribution function for the engineered features from the YelpZip dataset.In this context, 'Ham' refers to genuine reviews and 'Spam' to fake reviews.Features like avg_Urating, day_Urating, Entropy, similarity, and day_entropy exhibit the greatest discriminatory power.For the correlation plot and CDF of the YelpChi and YelpNYC datasets, refer to Appendices B and C, respectively.
Review-Centric Features [75] [new] anticipation anticipation emotion on a scale of 0 to 1 trust trust emotion on a scale of 0 to 1 fear fear emotion on a scale of 0 to 1 sadness sadness emotion on a scale of 0 to 1 joy joy emotion on a scale of 0 to 1 surprise surprise emotion on a scale of 0 to 1 disgust disgust emotion on a scale of 0 to 1 Entropy [26] rating entropy singleton whether a review is the only review posted by the user date_entropy time gap between each consecutive review similarity [42] maximum content similarity Ext rating extremity ratio_LCAPS [76] ratio of uppercase character to lowercase character

Text Embedding
We have considered transformer-based architecture, BERT (Bidirectional Encoder Representations from Transformers) embedding, and its popular variants, ALBERT, RoBERTa, and DistilBERT, to convert textual input into machine-understandable numerical form.The transformer technique utilizes attention mechanisms to process input data in parallel, capturing contextual information efficiently.BERT can produce meaningful embeddings because it is trained on large-scale real-world datasets.It leverages an attention mechanism that dynamically calculates the relationships between input words based on their context within a sentence.BERT comprehends context in language by considering both preceding and succeeding words, capturing intricate relationships for better understanding.ALBERT optimizes BERT's efficiency by sharing parameters among layers, offering similar performance with fewer parameters, making it computationally more efficient.RoBERTa refines BERT by modifying training methods, removing the next sentence prediction task, and using larger mini-batches for improved natural language understanding.DistilBERT retains essential language representations with reduced complexity, enabling faster training and lower resource requirements while maintaining performance.

Text Embedding
We have considered transformer-based architecture, BERT (Bidirectional Encoder Representations from Transformers) embedding, and its popular variants, ALBERT, RoBERTa, and DistilBERT, to convert textual input into machine-understandable numerical form.The transformer technique utilizes attention mechanisms to process input data in parallel, capturing contextual information efficiently.BERT can produce meaningful embeddings because it is trained on large-scale real-world datasets.It leverages an attention mechanism that dynamically calculates the relationships between input words based on their context within a sentence.BERT comprehends context in language by considering both preceding and succeeding words, capturing intricate relationships for better understanding.ALBERT optimizes BERT's efficiency by sharing

Fake Review Detection Model
Our approach utilizes a supervised learning framework known as the stacking-based ensemble technique.Table 1 lists various classifiers sourced from the literature, including the multi-layer perceptron classifier, Random Forest classifier (Bagging), logistic regression, k-nearest neighbor classifier (K = 3), and XGBoost classifier (Boosting).Multi-layer perceptron classifier is a neural network with multiple layers of interconnected nodes that learns complex patterns, making it effective for tasks like image recognition and classification.Random Forest classifier is an ensemble method that builds multiple decision trees during training and combines their predictions in classification tasks.Logistic regression is a statistical model used for binary classification, predicting the probability of an event occurring; it employs the logistic function to map input features into a probability range between 0 and 1. K-nearest neighbor classifier classifies data points based on the majority class among their k (=3)-nearest neighbors, making it straightforward and adaptable for various datasets.An optimized gradient boosting algorithm that combines weak learners (usually decision trees) to create a strong predictive model Boosting technique sequentially builds a series of models, with each new model focusing on correcting errors made by the previous ones, enhancing overall predictive performance.
Classifiers can be combined via two approaches: voting or stacking.As the name suggests, voting adds an extra layer that decides the final outcome based on the majority rule from the base classifiers' predictions.Stacking, however, is a more complex approach where the base classifiers' predictions are used as input for another classifier, creating a layered approach.Through experimentation, we chose the XGBoost classifier as the meta-classifier because it delivered the best results among the prospective classifiers.We have used the stacking method with probability over voting as it provides a better performance [77].Furthermore, we have reported five-fold stratified cross-validation results across all the prominent metrics.

Performance Evaluation
In order to evaluate the performance of the classifier, several metrics are used.Table 4 shows a binary confusion matrix.Then, Accuracy is given by Accuracy = ((TP + TN))/((TP + FN + FP + TN)) Precision is given by Precision = TP/((TP + FP)) Recall (Sensitivity) is given by Recall = TP/((TP + FN)) F1 score is given by Specificity is given by Specificity = TN/((TN + FN)) The area under the curve is a graph plotted between sensitivity and specificity at different threshold values.The closer this value is to 1, the better the classification.

Model Evaluation
To address Research Questions 1 (RQ1) and 2 (RQ2), a series of empirical experiments were conducted, involving the systematic manipulation of feature sets and diverse embedding styles and dataset sizes.A comprehensive total of five distinct experiments were executed, delineated by variations in both embedding and feature configurations.Specifically, Experiment 1 was designed to investigate classification performance solely utilizing engineered features.Subsequent experiments, namely Experiments 2 through 5, incorporated the fusion of features alongside word embeddings.Experiment 2 explored the amalgamation of features with BERT embeddings, while Experiments 3 and 4 sequentially incorporated ALBERTA and DistilBERT embeddings.Experiment 5 culminated with the utilization of RoBERTa embeddings.Within Experiment 1, a progressive strategy was adopted, commencing with the standalone utilization of the derived feature set, subsequently evaluating their combined effect.Detailed graphical representations of our model's performance across varied evaluation metrics were generated for the YelpZip, YelpChi, and YelpNYC datasets (Figures 4-6, respectively).Notably, the delineated feature sets encompass user-centric (FU), product-centric (FP), review-centric (FR), and the composite of all features (FA).
As shown in Figure 7, user-centric features dominate the top five features in all three datasets.For YelpZip, the dataset plot suggests that an increased number of reviews written by a user in a single day can decrease the model's accuracy by as much as 6%.This observation is consistent with the model performance, which indicates that user-centric features generally lead to improved performance of the model.Number of ratings provided by the user on a day, average rating provided by the user, and the average number of words written by the user in a review are among the top five features consistently.As shown in Figure 7, user-centric features dominate the top five features in all three datasets.For YelpZip, the dataset plot suggests that an increased number of reviews written by a user in a single day can decrease the model's accuracy by as much as 6%.This observation is consistent with the model performance, which indicates that user-centric features generally lead to improved performance of the model.Number of ratings provided by the user on a day, average rating provided by the user, and the average number of words written by the user in a review are among the top five features consistently.Figure 4 elucidates that the stacking technique consistently yielded optimal outcomes across diverse scenarios.Particularly, for the expansive YelpZip dataset, the model achieved notable performance metrics, with an accuracy of 83.89%, an average precision of 92.93%, a recall of 79.22%, an AUC of 91.46%, and an F1 score of 88.12%.Analogous trends were echoed in the context of YelpChi (Figure 5) and YelpNYC (Figure 6) datasets.For a comprehensive assessment of the model's performance, readers are encouraged to review Appendix A. Detailed examination of these sections indicates that feature engineering alone, as opposed to its combination with various large language models, is exceptionally effective.This underscores the idea that in the context of detecting counterfeit reviews, the use of feature engineering and the extraction of relevant features are fundamentally more appropriate than employing resource-heavy neural network-based large language models.
To explore the explainability aspect of our model, we plotted the feature importance using the full set of features.Since our model is based on stacking architecture, it is not advisable to depend upon the importance plot of individual classifiers such as XG-Boost or Random Forest.Researchers propose various approaches, such as LIME [78] or DeepLIFT [79], to improve the interpretability of the model.For our study, we employed SHAP (Shapley Additive exPlanations), as recommended by [80].SHAP is a comprehensive, model-agnostic method that amalgamates different feature importance techniques previously developed by researchers.In a nutshell, it performs sensitivity analysis based on accuracy.The SHAP values allow us to understand any prediction or classification as the sum effect of the features.Figure 7 shows the beeswarm plot of feature importance based on SHAP values.The feature importance plot (Figure 7) also shows that behavioral features are more helpful in identifying fake and genuine reviews than linguistic features.The input variables are arranged in top-to-bottom order as per their mean absolute SHAP values for the first 1000 reviews in the test dataset.For Figure 7a, values for 'day_Urating' can be interpreted as if the number of ratings provided by a user in a day is higher, there is a lower chance of being a truthful user.Figures 8-10 show the local explanation of a fake review record using the SHAP force plot in each of the datasets.The plot can be interpreted as the binary target was fake (Label = 0) or truthful (Label = 1).In the plot above, the bold 0.00 is the model's score for this observation.Higher scores lead the model to predict 1, and lower scores lead the model to predict 0. The features that were important in making the prediction for this observation are shown in red and blue, with red As shown in Figure 7, user-centric features dominate the top five features in all three datasets.For YelpZip, the dataset plot suggests that an increased number of reviews written by a user in a single day can decrease the model's accuracy by as much as 6%.This observation is consistent with the model performance, which indicates that user-centric features generally lead to improved performance of the model.Number of ratings provided by the user on a day, average rating provided by the user, and the average number of words written by the user in a review are among the top five features consistently.
The feature importance plot (Figure 7) also shows that behavioral features are more helpful in identifying fake and genuine reviews than linguistic features.The input variables are arranged in top-to-bottom order as per their mean absolute SHAP values for the first 1000 reviews in the test dataset.For Figure 7a, values for 'day_Urating' can be interpreted as if the number of ratings provided by a user in a day is higher, there is a lower chance of being a truthful user.Figures 8-10 show the local explanation of a fake review record using the SHAP force plot in each of the datasets.The plot can be interpreted as the binary target was fake (Label = 0) or truthful (Label = 1).In the plot above, the bold 0.00 is the model's score for this observation.Higher scores lead the model to predict 1, and lower scores lead the model to predict 0. The features that were important in making the prediction for this observation are shown in red and blue, with red representing features that pushed the model score higher and blue representing features that pushed the score lower.Features that had more of an impact on the score are located closer to the dividing boundary between red and blue, and the size of that impact is represented by the size of the bar.Again, behavioral features are significant in deciding to classify a record as fake or truthful.The feature importance plot (Figure 7) also shows that behavioral features are more helpful in identifying fake and genuine reviews than linguistic features.The input variables are arranged in top-to-bottom order as per their mean absolute SHAP values for the first 1000 reviews in the test dataset.For Figure 7a, values for 'day_Urating' can be interpreted as if the number of ratings provided by a user in a day is higher, there is a lower chance of being a truthful user.Figures 8-10 show the local explanation of a fake review record using the SHAP force plot in each of the datasets.The plot can be interpreted as the binary target was fake (Label = 0) or truthful (Label = 1).In the plot above, the bold 0.00 is the model's score for this observation.Higher scores lead the model to predict 1, and lower scores lead the model to predict 0. The features that were important in making the prediction for this observation are shown in red and blue, with red representing features that pushed the model score higher and blue representing features that pushed the score lower.Features that had more of an impact on the score are located closer to the dividing boundary between red and blue, and the size of that impact is represented by the size of the bar.Again, behavioral features are significant in deciding to classify a record as fake or truthful.Additionally, it can be noticed from Figures 4-6 that the results of the stacking classifier were closest to the best-performing classifier.It was noted that user-related features contributed significantly to the classification performance, both when used alone and when combined with other features or embeddings.The feature importance plot (Figure 7) also shows that behavioral features are more helpful in identifying fake and genuine reviews than linguistic features.The input variables are arranged in top-to-bottom order as per their mean absolute SHAP values for the first 1000 reviews in the test dataset.For Figure 7a, values for 'day_Urating' can be interpreted as if the number of ratings provided by a user in a day is higher, there is a lower chance of being a truthful user.Figures 8-10 show the local explanation of a fake review record using the SHAP force plot in each of the datasets.The plot can be interpreted as the binary target was fake (Label = 0) or truthful (Label = 1).In the plot above, the bold 0.00 is the model's score for this observation.Higher scores lead the model to predict 1, and lower scores lead the model to predict 0. The features that were important in making the prediction for this observation are shown in red and blue, with red representing features that pushed the model score higher and blue representing features that pushed the score lower.Features that had more of an impact on the score are located closer to the dividing boundary between red and blue, and the size of that impact is represented by the size of the bar.Again, behavioral features are significant in deciding to classify a record as fake or truthful.Additionally, it can be noticed from Figures 4-6 that the results of the stacking classifier were closest to the best-performing classifier.It was noted that user-related features contributed significantly to the classification performance, both when used alone and when combined with other features or embeddings.The feature importance plot (Figure 7) also shows that behavioral features are more helpful in identifying fake and genuine reviews than linguistic features.The input variables are arranged in top-to-bottom order as per their mean absolute SHAP values for the first 1000 reviews in the test dataset.For Figure 7a, values for 'day_Urating' can be interpreted as if the number of ratings provided by a user in a day is higher, there is a lower chance of being a truthful user.Figures 8-10 show the local explanation of a fake review record using the SHAP force plot in each of the datasets.The plot can be interpreted as the binary target was fake (Label = 0) or truthful (Label = 1).In the plot above, the bold 0.00 is the model's score for this observation.Higher scores lead the model to predict 1, and lower scores lead the model to predict 0. The features that were important in making the prediction for this observation are shown in red and blue, with red representing features that pushed the model score higher and blue representing features that pushed the score lower.Features that had more of an impact on the score are located closer to the dividing boundary between red and blue, and the size of that impact is represented by the size of the bar.Again, behavioral features are significant in deciding to classify a record as fake or truthful.Additionally, it can be noticed from Figures 4-6 that the results of the stacking classifier were closest to the best-performing classifier.It was noted that user-related features contributed significantly to the classification performance, both when used alone and when combined with other features or embeddings.Additionally, it can be noticed from Figures 4-6 that the results of the stacking classifier were closest to the best-performing classifier.It was noted that user-related features contributed significantly to the classification performance, both when used alone and when combined with other features or embeddings.

Benchmarking
The proposed model is benchmarked with that of existing models and frameworks.The following popular works based on the Yelp dataset are considered: 1.
SpEagle [26]: An unsupervised learning approach (Spam Eagle) capable of integrating and scaling with labeled data using metadata along with relational data.The authors were the ones who curated the datasets.The performance of their model was tested on all three datasets.2.
Ref. [21]: Proposed an effective multi-feature-based model.They suggested some new features and conducted a performance evaluation based on burst features.The authors have considered only the YelpZip dataset.

3.
Ref. [22]: They have developed a novel hierarchical supervised learning approach that analyzes user features and their interactions in univariate and multivariate distributions.They have also used the YelpZip dataset for modeling purposes.

4.
SPR2EP [81]: They proposed a semi-supervised framework (SPam Review REPresentation) for fake review detection, which uses the feature vectors extracted from reviews, reviewers, and products.After combining these vectors for detection purposes, they demonstrated the performance of their model on all three datasets.5.
HFAN [66]: Hierarchical Fusion Attention Network (HFAN) is a deep learning-based technique that automatically learns reviews' semantics from the user and product information using a multi-attention unit.6.
Ref. [60]: Proposed a convolution neural network-based architecture connecting sentiment-dependent linguistic features and behavioral features via a fully connected layer to determine fake and genuine reviewers.7.
Ref. [82]: Proposed an integrated multi-view feature strategy, blending implicit and explicit features from review content, reviewer data, and product descriptions.They introduced a hybrid extraction method, combining word-and sentence-level techniques with attention.This extends to a classification framework with an ensemble classifier leveraging a convolutional neural network (CNN) for reviewer information, a deep neural network (DNN) for product-level analysis, and a Bidirectional-Long Short-Term Memory (Bi-LSTM) for review-level features.This comprehensive methodology aims to enhance analysis effectiveness across diverse dimensions.8.
Ref. [83]: Used the fine-tuned version of BERT to identify fake reviews just from review text.
As can be seen from Table 5, most of the work has been done on the YelpZip datasetone reason for this could be the availability of a large amount of labeled data as compared to the other two datasets.Furthermore, the authors have not measured the performance of their model on other well-defined and accepted metrics such as accuracy, recall, and F1.Most of the authors have evaluated the performance of their model against the AUC metric.As apparent from Table 5, our model has delivered better performance as compared to the other models on almost all the performance metrics across the three datasets.It is crucial to employ a model that exhibits robustness across various metrics, and our model successfully bridges this gap.The results also demonstrate the effectiveness of our approach of stacking the output of heterogeneous classifiers to reliably detect spam or fake reviews in both small and large datasets.Furthermore, our model is resource-efficient, unlike deep learning models that are complex to implement and scale, are resource-hungry, and require a heavy infrastructural investment.Our model's performance is comparable to the advanced deep learning model proposed by [66].These obvious advantages of our model over other models make our work more practical and applicable for real-world deployment.

Conclusions
This study is dedicated to addressing the pervasive challenge of detecting counterfeit reviews within the hospitality and restaurant domain.Our approach centers around the development of a stacking-based model, which ingeniously amalgamates the outputs of divergent base classifiers and harnesses a meta-classifier to discern the veracity of a given review-whether it is genuine or fabricated.The framework leverages a comprehensive suite of user-centric, product-centric, and review-centric features, collectively providing a multifaceted perspective on review authenticity.
The potency of our strategy resides in the stacking technique's ability to consolidate the predictive capabilities of individual classifiers, culminating in an elevated overall efficiency.Evidently, the endeavor culminates in the establishment of a model proficient in discerning between fake and truthful reviews with commendable accuracy levels.Intriguingly, our model exhibits superior performance in comparison to well-established works across a spectrum of relevant performance metrics.Furthermore, the framework's adaptability manifests in its ability to robustly operate across varying dataset sizes, rendering its utility independent of the dataset's magnitude.
The ethical landscape pertaining to review authenticity is intricate, especially concerning the deployment of filtering algorithms designed to distinguish fraudulent or suspicious reviews.While these algorithms aim to uphold the quality and reliability of reviews, the potential for false positives-incorrectly classifying authentic reviews as fraudulent-poses a risk, potentially tarnishing the reputations of businesses or individuals.This gives rise to concerns about algorithmic bias and its implications for stakeholders.To mitigate this, we advocate for the labeling of reviews identified by the model as 'not suggested reviews' rather than outright deletion or labeling them fake/manipulated.Additionally, triangulating model findings with supplementary data sources, such as IP addresses, reviewer location, reviewer history, and businesses' history, can further enhance accuracy and reduce false positives.Routine model retraining involving deliberate introduction of fake reviews contributes to increased accuracy and resilience against false detection.
Furthermore, ethical challenges extend to the transparency of the filtering process and the disclosure of filtered reviews.Users may lack full awareness of the criteria guiding review identification and filtering, hindering their ability to evaluate information reliability.Balancing platform integrity and user transparency emerges as a nuanced ethical consideration.Addressing this challenge, our model prioritizes interpretability, facilitating a clear understanding of features significantly contributing to fake review identification.
The implications of this investigation transcend the confines of the hospitality and restaurant sector, extending to domains like movie reviews and e-commerce.This affords a broader utility for our model's insightful mechanisms.The hospitality and tourism review aggregators can embrace our model as an instrument to identify potentially inauthentic reviews, thus embarking on a path of enhanced credibility.Acknowledging the inherent uncertainty surrounding the authenticity of reviews, we advocate for a 'not-suggested' annotation for reviews flagged by our algorithm.This approach empowers website visitors to make more informed choices safeguarded against the ambiguity of potentially deceptive reviews.
As the reliance on reviews surges among consumers making purchasing decisions, it is imperative for stakeholders to proactively establish regulations and mechanisms to thwart the influence of counterfeit reviewers.Our recommendations extend beyond mere detection mechanisms.We propose a preemptive approach, suggesting the integration of a pop-up prompt before a review is submitted.This prompt would serve as a reminder to users, encouraging them to ensure their reviews adhere to the platform's ethical guidelines.The wisdom advocated by [32] underscores the importance of consumer awareness and regulatory vigilance, effectively shielding prospective consumers from the deleterious impact of counterfeit reviews.

Theoretical Contributions
The proliferation of counterfeit reviews presents a dual menace, impacting both consumers and businesses.The increasing digital footprint of those who have grown up in the digital era heightens their vulnerability to misleading and deceptive information.This represents a significant concern as it has the capacity to profoundly skew decision-making processes by introducing cognitive biases.In light of this challenge, the imperative for a robust and scalable mechanism to counteract this influx of misinformation becomes strikingly evident.
Our contribution to the scholarly landscape is multifaceted.Firstly, our work extends the existing corpus of the deception detection literature through the development of an efficient system adept at identifying suspicious reviews.This accomplishment is underpinned by the formulation of a stacking-based framework adept at harnessing the capabilities of diverse underlying classifiers.The precedent established by [84], who validated the superior performance of this architecture through simulations using real-world data, highlights its effectiveness.Unlike deep learning-based architectures, these models demonstrate accelerated convergence and adaptability to varying data sizes.A notable shift from previous research emerges, affirming the pre-eminence of user-centric behavioral features over their linguistic counterparts.This underscores that the inherent characteristics of users exert a more profound influence on the classification process.
Secondly, the enhancement of model interpretability constitutes a pivotal facet of our approach.Rather than relying on conventional feature importance plots prevalent in the literature, our strategy leverages SHAP values.This choice enables a deeper level of insight, as SHAP-based importance plots not only unveil feature significance but also elucidate how variations in feature values impact the model's outcomes.Furthermore, as corroborated by [85], the conventional feature importance plots are inherently sensitive to the chosen methodology, posing a concern over robustness.
Our endeavor encompasses the introduction of novel features, as underscored by the insights gleaned from Table 5. Importantly, the architectural framework we have devised operates as a modular entity, thereby accommodating the integration of more efficient classifiers as they emerge in the future research landscape.Such adaptability can be achieved through minimal code adjustments, rendering our model remarkably customizable at a negligible expense.
In summary, empirical tests of our model against recognized benchmarks confirm its effectiveness.It consistently surpasses standard models in various evaluation metrics, regardless of the size of the dataset.This empirical evidence highlights the strength, robustness, and dependability of our approach.

Managerial Implications
In terms of managerial implications, the pervasive issue of counterfeit reviews poses a significant threat to trust and credibility within the review ecosystem.Platforms like Yelp, reporting the filtration of nearly 25% of their reviews, underscore the gravity of the challenge [34].Notable instances such as Oobah Butler's experiment in 2017, where a fictitious profile became a top-rated restaurant on Tripadvisor, further accentuate the vulnerability of such platforms.
Our work provides a practical solution to the problem of counterfeit review identification.The proposed method distinguishes itself by its ease of implementation, circumventing the complexities often associated with academic solutions.Its lightweight nature, swift convergence, and ability to flag fraudulent reviews at an acceptable threshold align seamlessly with existing operational standards.Moreover, the ordered importance of features offers review aggregator platforms actionable insights for identifying cues in reviewer comments, augmenting their mechanisms to combat fraudulent activities.
Beyond its immediate application, the generalizability and scalability of our approach extend its potential to diverse domains grappling with the scourge of fake reviews.The farreaching implications of our study empower review platforms to mitigate the incursion of counterfeit reviews.Such liberation from the influence of fraudulent reviewers engenders a greater degree of trust in the disseminated information, culminating in heightened user traffic and ultimately translating into enhanced revenue for businesses.
From the consumer perspective, our study holds the promise of furnishing them with credible and dependable information for decision-making purposes.This will help the consumer make a more informed purchase, leading to lesser post-purchase dissonance and satisfactory experience.
Our model has the potential to significantly enhance the trustworthiness of the internet, particularly in the context of user reviews.By effectively identifying and flagging suspicious reviews, it acts as a robust deterrent against the proliferation of counterfeit feedback, thus mitigating the distortion of online information.This not only empowers users to engage with more reliable content but also instills confidence in the integrity of the digital space.An additional layer of transparency allows users to discern between reviews that may lack authenticity.In turn, businesses that heavily depend on user reviews stand to benefit immensely.The model's capability to distinguish between genuine and fraudulent reviews shields businesses from potential reputational harm caused by deceptive feedback.This safeguarding of reputations contributes to building and maintaining the credibility of businesses in the online sphere.Moreover, the adaptability of our approach across diverse dataset sizes ensures that businesses of varying scales can leverage the model to enhance their online presence.

Societal Implications
Our efforts to address counterfeit reviews wield profound societal impact by fostering a culture of authenticity and trust in the digital realm.In a broader context, our model not only benefits consumers and businesses but contributes to shaping responsible online behavior.By empowering users to make informed choices and safeguarding against misinformation, we align with societal goals of promoting consumer rights and digital literacy.The adaptability of our approach ensures that even smaller businesses integral to local economies can enjoy enhanced credibility.Ethical considerations embedded in our model, such as the 'not-suggested' annotation and transparency in the filtering process, reflect a commitment to responsible technology use.Our proactive approach, exemplified by the integration of a pop-up prompt, sets a precedent for ethical platform management, influencing the broader landscape.Moreover, our emphasis on interpretability contributes to increased transparency and accountability, enabling users to critically evaluate online information.In essence, our work goes beyond the technical realm, actively participating in the ongoing discourse on responsible technology development with the overarching aim of creating a digital landscape that positively impacts society at large.

Limitations
Our study encounters significant limitations originating from the choice of the foundational classifier.The configuration encompassing the count and nature of both the base classifiers and a meta-classifier wields substantial influence over the model's performance.A subsequent limitation pertains to the temporal relevance of our database, which potentially renders it an imperfect representation of contemporary lexical usage.Notably, features such as punctuation count, lexical diversity, and lexical density are susceptible to this dynamic linguistic landscape.This phenomenon has been examined by [57], who unveiled its repercussions on classifier efficacy within a comparable context.An important limitation arises from the dataset's balance or lack thereof.It is imperative to examine how our model will perform when confronted with an imbalanced dataset where the distribution of instances across different classes is uneven.Understanding the model's behavior under such conditions is critical as it may impact its accuracy in real-world scenarios.

Future Work
For forthcoming endeavors, we are inclined to subject our model to assessment across alternate datasets.This proactive measure seeks to bolster the model's adaptability and reinforce its empirical validity.Furthermore, our model awaits validation against expansive language models, such as GPT-4 and ChatGPT, which are capable of generating synthetic text dynamically.The cautionary standpoint underscored by [86] regarding the uncritical adoption of artificial intelligence without vigilant scrutiny for inherent biases is salient.To build a more resilient model, we intend to scrutinize our data for potential biases, a step aimed at enhancing its robustness in the face of inherent complexities.Yet another future direction could be to enhance the model to detect fake reviews posted by new users against whom the model lacks behavioral data.
An integral aspect of our future research trajectory involves the construction of a tangible regulatory framework firmly grounded in practical principles.This framework aspires to empower review aggregators in their efforts to combat the disruptive and malicious conduct of fake reviewers.Moreover, our model primarily integrates features derived from the textual content of reviews.Incorporating aspects of social networks, like reviewer interactions, could add a new layer of complexity to the model.This addition has the potential to enhance its performance even further.

Figure 1 .
Figure 1.Proposed framework for fake review detection.

Figure 1 .
Figure 1.Proposed framework for fake review detection.

Figure 2 .
Figure 2. Correlation among the features of the YelpZip dataset.Figure 2. Correlation among the features of the YelpZip dataset.

Figure 2 .
Figure 2. Correlation among the features of the YelpZip dataset.Figure 2. Correlation among the features of the YelpZip dataset.

Figure 4 .
Figure 4. Graph showing the model's performance on the YelpZip dataset.

Figure 5 .
Figure 5. Graph showing the model's performance on the YelpChi dataset.

Figure 4 .
Figure 4. Graph showing the model's performance on the YelpZip dataset.

Figure 4 .
Figure 4. Graph showing the model's performance on the YelpZip dataset.

Figure 5 .
Figure 5. Graph showing the model's performance on the YelpChi dataset.

Figure 5 .
Figure 5. Graph showing the model's performance on the YelpChi dataset.

Figure 5 .
Figure 5. Graph showing the model's performance on the YelpChi dataset.

Figure 6 .
Figure 6.Graph showing the model's performance on the YelpNYC dataset.

Figure 6 .
Figure 6.Graph showing the model's performance on the YelpNYC dataset.

Figure 7 .
Figure 7. Feature importance plot for all three datasets.

Figure 7 .
Figure 7. Feature importance plot for all three datasets.

Figure 7 .
Figure 7. Feature importance plot for all three datasets.

Figure 8 .
Figure 8. Force plot of a fake record in the YelpChi dataset.

Figure 9 .
Figure 9. Force plot of a fake record in the YelpNYC dataset.

Figure 10 .
Figure 10.Force plot of a fake record in the YelpZip dataset.

Figure 8 . 16 (Figure 7 .
Figure 8. Force plot of a fake record in the YelpChi dataset.

Figure 8 .
Figure 8. Force plot of a fake record in the YelpChi dataset.

Figure 9 .
Figure 9. Force plot of a fake record in the YelpNYC dataset.

Figure 10 .
Figure 10.Force plot of a fake record in the YelpZip dataset.

Figure 9 . 16 (Figure 7 .
Figure 9. Force plot of a fake record in the YelpNYC dataset.

Figure 8 .
Figure 8. Force plot of a fake record in the YelpChi dataset.

Figure 9 .
Figure 9. Force plot of a fake record in the YelpNYC dataset.

Figure 10 .
Figure 10.Force plot of a fake record in the YelpZip dataset.

Figure 10 .
Figure 10.Force plot of a fake record in the YelpZip dataset.

Table 2 .
Details of the Yelp datasets.

Table 1 .
Review of the selected literature on fake review detection in hospitality.

Table 2 .
Details of the Yelp datasets.

Table 3 .
Description of the features extracted from the dataset.

Table 4 .
A binary confusion matrix.
1Figures in bold are the cases where our model outperformed the existing state of the art

Table A2 .
Classifier performance on the YelpChi dataset.

Table A2 .
Cont.Cumulative distribution function of YelpChi dataset features.

Table A2 .
Classifier performance on the YelpChi dataset.

Table A3 .
Cont.Cumulative distribution function of YelpNYC dataset features.