1. Introduction
Recommender systems extract patterns from user behaviour and preference data to reduce information overload in product and service selection [
1]. Beyond traditional user behaviour data, such as clicks, ratings, and browsing history, recent approaches incorporate psychological user traits, including personality. By incorporating information from personality models, recommenders can capture more stable user traits that shape preferences, interaction patterns, and engagement behaviours [
2]. When extracted from user-generated text (e.g., reviews), personality has been shown to improve recommendation performance [
3], particularly in addressing the cold start problem and recommendation novelty. However, the sparsity of textual data for each user limits the reliability of personality inference, and the application of personality in restaurant recommender systems remains underexplored.
Recommenders, particularly restaurant recommenders, aim to enhance customer experience and satisfaction through personalisation. Food choice is a crucial activity during vacations that shapes the travel experience of travellers. The problem of overchoice—stemming from an abundance of available options—has been widely studied across psychology, marketing and decision science, which has motivated the use of recommender systems. Recently, interest in food experience analysis has emerged [
4], focusing on a deeper understanding of customers’ food preferences to improve recommendations [
5] and restaurant selection [
6]. Traditional restaurant recommendation approaches extract user preferences from structured information of users’ historical records (e.g., review ratings, purchases) or are collected explicitly from users via questionnaires. Such techniques rely heavily on rating data and basic demographic information, which do not fully capture the nuanced preferences that influence customers’ choices. More recent work leverages electronic word of mouth (eWOM) to extract emotion and personality [
7,
8], which influence motivation, purchasing decisions, preferences, perception of a service [
9,
10], satisfaction, and consumer behaviour [
7]. Personality has been proven to improve recommenders’ performance [
11] since people with similar personalities have similar preferences and needs [
12]. Additionally, the automated extraction of personality scores from user-generated textual content addresses the bias and limitations inherent in traditional self-reported surveys (e.g., response bias arising when respondents provide socially desirable answers). Despite these advances, the integration of personality into restaurant recommendation systems remains limited.
In the context of consumer behaviour, businesses are often described using human characteristics [
13], a concept referred to as brand personality [
14,
15,
16]. This enables businesses to better engage with their customers, since people tend to interact with brands as if they were human entities. Brand personality represents the set of human traits that consumers associate with a brand, enabling the formation of an emotional connection that can lead to increased loyalty and marketplace success. The relationship between a business and its customers is, therefore, fundamental and can be explained by personality–brand congruence (PBC) theory, which suggests that consumers are more likely to be attracted to brands whose projected personalities are congruent with their own [
15,
17,
18]. Seminal work on brand personality by Aaker [
16] explains how consumers perceive brands along five dimensions: sincerity, excitement, competence, sophistication, and ruggedness. Among these, sincerity, excitement, and competence were found to correlate directly with three of the Big Five human personality traits [
17], specifically agreeableness, extroversion, and conscientiousness [
16,
17]. In this work, we utilise these three traits to express brand personality and assess it in a manner consistent with prior work [
18], which derives brand personality from social media text. However, unlike previous work [
18], we utilise a large language model (LLM) and linguistic style analysis to assess personality, rather than a dictionary-based approach. In the context of the restaurant business, PBC explains why customers with personalities similar to a restaurant’s personality are more likely to visit or revisit such venues. Such PBC insights are utilised by businesses to build relationships with consumers and differentiate themselves from competitors [
14]. However, such strategies are mainly manual (i.e., they use questionnaires). To our knowledge, current recommendation approaches do not utilise the PBC to improve recommendations, and very few studies have investigated the use of personality for restaurant recommendations [
5]. This work aims to address this research gap.
Users’ food preferences and opinions about venues constitute another critical factor in restaurant recommendations. Such information is typically collected through explicit questionnaires, which impose a high user burden and are prone to bias. This has also been criticised, since users do not (always) know what they want, with recency and frequency of their experiences usually dominating their answers. More advanced methods for extracting consumer preferences include using online behaviour data, such as browsing paths on websites, idle times, etc. However, obtaining such information can be challenging when building recommender systems for various businesses. By contrast, eWOM in the form of online reviews provides a rich source of customer perceptions from different service engagements, expressed in both textual and quantitative forms, and has been utilised to externalise preferences and opinions. However, the research on preference analysis from online reviews remains scarce.
We hypothesise that review text provides valuable information about customers’ preferences (e.g., food) that cannot be inferred from users’ scale evaluations (e.g., food rating, environment rating) alone, since these scales evaluate generic factors assumed by the designer of the social media platform (e.g., TripAdvisor). Customers might have preferences and opinions that are outside the scope of these scales and thus cannot be captured through such approaches. In addition, past reviews of customers provide a history of previous user decisions and opinions that can be extracted from text; thus, frequent opinions about certain foods (positive or negative) or consumption patterns of foods, irrespective of user ratings, provide information about food preference (they ordered that food because they like it) [
19]. Accordingly, we extract fine-grained food preferences directly from review text.
This study extends previous work that introduced personality to enhance restaurant recommendations [
20], by incorporating the concept of venue personality grounded in PBC, automated user preference extraction from eWOM text, optimised topic modelling component for the extraction of relevant user opinions (topics) about venues, and a novel personality classifier that improves personality recognition in out-of-distribution data [
21] through the utilisation of linguistic information in reviews’ text, along with a neural collaborative filtering (NCF) [
22]. The proposed method utilises and combines two types of information from eWOM, namely, derived and direct. The former refers to information extracted from eWOM text, such as venues and consumers’ personalities, customer food preferences, and opinions about venues, using text classification, topic modelling and named entity recognition. Direct information refers to restaurants’ properties that are explicitly rated by users, such as price, value for money, service quality, atmosphere, and the offered cuisine, as provided on the reviews platform. Derived and direct information are jointly used to train and test an XGBoost regression model [
23] to predict restaurant ratings. We hypothesise that integrating derived and direct information enriches the algorithm’s input, enabling it to learn more generalizable patterns and generate more accurate recommendations.
The guiding research question is whether the combination of derived and direct information from eWOM improves restaurant recommendations when compared to baseline recommender methods (such as neural collaborative filtering and matrix factorisation). In responding to this question, this work makes four methodological contributions:
A novel personality classifier for deriving customer personality from reviews, that outperforms baseline machine learning (ML) methods (trained on secondary data);
The introduction and evaluation of the concept of venue personality based on PBC [
15];
Automated extraction of food preferences using a custom named-entity recogniser;
Opinion inference via topic modelling to assess its impact on recommendation performance.
Overall, the proposed approach is among the first to integrate personality traits and personality–brand congruence into restaurant recommendations, combining heterogeneous eWOM-derived and explicit information sources to achieve superior performance over traditional models.
The paper is organised as follows. The next section reviews the literature on recommender systems and techniques for extracting user preferences and personality from text. This is followed by a description of the method and the empirical results obtained from applying it to a custom dataset and comparing its performance with that of mainstream recommender methods. The paper concludes with a discussion of findings, implications for management, and future directions.
3. Methodology
The proposed methodology aims to improve restaurant recommendations by combining direct (quantitative information extracted from review websites, such as ratings for food, price, and cuisine offered) and derived information from eWOM text. Derived information includes users’ personalities and opinions extracted through a novel personality classifier and a topic model built from user reviews. Users’ preferences are extracted using NCF, similar to [
69], and food preferences are identified using a custom NER. The performance of the proposed approach is evaluated using various recommender systems metrics and compared against traditional and neural networks-based techniques, such as the two-tower model.
Figure 1 illustrates the overall workflow of the proposed methodology, which is implemented through the steps detailed below. An ablation is conducted by progressively adding feature groups to the model, allowing for the analysis of the individual contribution of each component to recommendation performance.
3.1. Data Collection [Step 1]
Restaurant reviews were collected from TripAdvisor using a dedicated web crawler (primary dataset). Consumers’ eWOM and additional direct information on restaurant features, such as price range, atmosphere, service, and value for money, were also collected. The dataset consists of approximately 255 thousand English-language reviews posted by consumers who visited restaurants in Cyprus between 2010 and 2021. It includes 52 thousand unique users and 2615 restaurants. For this study, only users with at least 6 reviews and restaurants with at least 50 reviews are considered, to reduce user–item matrix sparsity and ensure sufficient textual data for personality extraction. These users typically correspond to tourists who stay for extended periods or revisit the destination regularly. This filtering resulted in 774 restaurants and 5595 users. Reviewers who systematically posted identical positive messages across venues were removed, as these were indicative of fake reviews. Consequently, four users were eliminated along with their corresponding reviews. In addition, reviewers who systematically rated particular restaurants negatively and others positively in the same location were eliminated, as these were also considered fake reviews that aimed to benefit certain venues while diminishing the reputation of others (e.g., nearby competitors). To identify such users, reviews were grouped by user and restaurant location. Restaurant locations were determined using TripAdvisor’s location ID metadata. For each user who rated multiple restaurants in a particular location, a rating matrix was constructed. The matrix’s columns represent restaurants in that location, and the cells correspond to users’ ratings. Rows represent individual reviews of that user for that restaurant and location. For each user–location matrix, a paired sample t-test is performed between all combinations of the matrix’s columns (i.e., ratings of restaurants). The process is repeated for each user and location in the dataset. A t-test with a mean difference between two restaurants’ ratings greater than 3 (e.g., a user rating one restaurant with 5 and another with less than or equal to 2) and a p-value < 0.01 indicate biased behaviour favouring one restaurant and diminishing another in the same location. The test was applied only when at least three negative reviews were present in any matrix column, to account for cases where users revisited a venue following an initial negative experience. As a result of this process, one additional user account and its reviews were eliminated. The final dataset contains 31,597 reviews from 5591 unique users across 774 restaurants.
Secondary data used in this study include the stream-of-consciousness essay (BIG5) datasets [
70] and MBTI [
71,
72], which are employed for personality classification. The BIG 5 dataset comprises 2468 essays written by individuals, annotated with personality labels [
70]. The MBTI dataset consists of social media posts labelled by personality type, as defined using the MBTI questionnaire. The dataset is publicly available on Kaggle [
72] and contains 8675 rows corresponding to users’ posts on the social network personalitycafe.com, annotated with personality labels. The dataset was constructed by first asking users to complete an MBTI questionnaire, after which they engaged in discussions with other users on the platform. Each of the 8675 users contributed 50 posts, which were concatenated per user and separated with the delimiter “|||”.
Preprocessing activities were performed prior to model training to ensure that the secondary data did not include TAGS (e.g., personality type indicators or any other tags) or the sentence separator “|||” used by the dataset’s curators, which could overfit the classifier, but have no occurrence in our primary dataset (i.e., domain shift problem). In contrast to previous work [
73], the elimination of tags specified by the curators degraded the performance of the classifiers; however, this was necessary since these tags were not present in our primary data. To further address the model’s generalisation and use in our sample data, we enhanced data preprocessing activities so that the training data includes generic features relevant to personality, such as POS and POS sequences, in addition to text embeddings from BERT. Furthermore, emoticons were also converted into text (e.g., “:D”, “:P”, etc.). Uppercase letter words were also explained with additional information in the text, indicating that the author has a strong emotion about something, since this behaviour differs among personalities. Additionally, punctuation repetitions, such as exclamation marks or periods, were also converted into textual form to refer to the additional author’s writing style (for example, the word “emphasis” is inserted when multiple exclamation marks were encountered in the text and “etc.” is used for multiple consecutive periods). Finally, contractual expansion of text abbreviations was performed, joint words were split into separate words, and repeated characters in words were eliminated.
3.2. Text Preprocessing of Primary and Secondary Datasets [Step 2]
Data preprocessing and preparation are common and necessary steps for subsequent analyses (i.e., topic modelling, named entity recognition, and personality classification). Such procedures include elimination of punctuation, URLs, numbers, stop-words, lowering of text, inserting spaces in long words and breaking them into separate words, removing repeated characters in words (e.g., “yeaaahhhh” becomes “yeah”) and contractual expansion of text abbreviations (e.g., “don’t” or “dnt” to “do not”). Additionally, annotations (tags) within the text, specified by the MBTI dataset curators, such as tags of certain words, are removed to generalise the data prior to personality model training.
3.3. Topic Modelling [Step 3]
Topic modelling, and in particular the structural topic model (STM) technique [
74], is employed in this step to infer the themes consumers discussed in eWOM. In general, topic models employ statistical models to identify topics arising in a collection of documents [
75]. Each topic represents a set of words that occur frequently together in a corpus, and each document is associated with a probability distribution of topics that appear in that document. Restaurants and users’ opinions are produced by averaging the topics’ theta values (representing the distribution of topics over documents) associated with each restaurant/user. These represent common consumer opinions per restaurant and common topics that characterise users (preferences). The identification of the optimal number of topics that best describes the dataset is performed through an iterative process that involves examining different values for the number of topics (K) and inspecting the semantic coherence and held-out likelihood until a satisfactory model is found [
74]. Coherence measures the semantic consistency of high-scoring words within a given topic and serves as an indication of the interpretability and meaningfulness of that topic. Held out likelihood tests a trained topic model against a test set with unseen documents, with higher values indicating a statistically strong topic model. Exclusivity measures the extent to which top words for each topic do not appear as top words in other topics. The naming of the topics is performed by domain experts who utilise the most prevalent words that characterise each topic.
3.4. Food Preference Extraction [Step 4]
The extraction of food preferences assumes that consumers who visit multiple restaurants and write numerous reviews about the foods they consume indirectly indicate their food preferences. Named-entity recognition (NER) is used to extract customers’ food preferences. An NER entity can refer to any concept of interest (i.e., food types, locations, products, etc.). Existing food NER models (such as NLTK, SpaCy, and Stanford NER) were evaluated and deemed inappropriate for our analysis, as none were trained on labelled data for Cyprus dishes [
76]. Domain-specific applications require different types of entities to be identified by NER models; thus, there was a need to customise an existing NER for the task. To create or fine-tune a NER, text labelled with entities of interest needs to be provided, and a rule-based approach can be used to annotate the text using grammatical rules and linguistic terms. The SpaCy library was utilised, as it demonstrated superior performance when customised compared to alternative libraries [
77]. SpaCy comes with a pretrained NER model that can be fine-tuned to different tasks using labelled data. This was an essential step since the SpaCy NER did not recognise Cypriot foods. Customising the SpaCy NER to identify food entities in reviews required the training of the model with additional cases containing custom food words. Thus, the rule-based technique was used to extract sentences that refer to the consumption of food from the local cuisine. For this task, only reviews from traditional restaurants—tavernas—were utilised. The identified cases were used to fine-tune the original NER. Foods that were identified were added to a food dictionary and used as vocabulary during TF-IDF vectorisation with customers’ reviews. The cumulative TF-IDF scores for each food entity in all reviews per user serve as a proxy for food preferences. This assumes that when customers write comments about the food they consume in different restaurants, irrespective of their ratings, they provide information about their food preferences.
3.5. Optimising Personality Classification [Step 5]
To identify customers’ personalities from eWOM, several binary text classification techniques are evaluated utilising knowledge transfer from BERT embeddings and several machine learning techniques, along with deep learning classification (BERT). The models were trained and tested using two labelled personality datasets, the MBTI tweets dataset [
71,
72] and the BIG-5 famous stream-of-consciousness essays [
70]. Each dimension of the two personality models was used to train a binary classifier, resulting in 5 binary classifiers for the BIG-5 and 4 for the MBTI. For instance, a classifier for the extraversion–introversion dimension of the BIG-5 model assigns a probability that the author of a given text is extroverted or zyyyintroverted. The classification process begins with the vectorisation of the text into a form suitable for ML/deep learning algorithms. This is feasible by either using open/closed lexicons or through embeddings of text, learned from large corpora of text in an unsupervised manner, as in the case of language models such as BERT. The vectorised text is used to train logistic regression (LR), XGBoost, naive Bayes (NB), and support vector machines (SVM) classifiers, as they constitute mainstream models in personality recognition [
11]. The second group of techniques evaluated employed large language models and transfer learning by fine-tuning a pre-trained BERT model. BERT comes in two versions: BERT-base and BERT-large. The former uses 12 transformer blocks, referring to the number of self-attention heads, the hidden layer size is 768, which defines the size of the text embeddings, and the total number of trained parameters is 110 M. BERT-large has 24 transformer blocks and 340 M parameters. The BERT-base model is utilised to generate embeddings, as it requires fewer computational resources. Evidence suggests that BERT-large provides minimal to no benefit when the datasets used are relatively small, and no benefit in other cases [
78]. The most popular architecture used for assessing personality with BERT involves adding a dense layer on top of BERT’s output, followed by a binary output layer (sigmoid) for classification. In the case of multi-class text classification, a softmax layer is used instead.
This approach uses only the final hidden state vector of the [CLS] token from BERT, as it represents an aggregate embedding of the entire text and is generally regarded as the most informative feature for text classification tasks [
79]. To enhance the performance of this standard BERT architecture, we combine the [CLS] output with a convolutional neural network (CNN) and a long short-term memory (LSTM) network to capture both local and sequential dependencies within the text. The former is used to extract additional features from the embedding of the last dense layer, and the latter is used to find patterns in the linguistic features of the text, namely, part-of-speech (POS) sequences. LSTM and CNN have been used in personality classification to improve accuracy [
80] and in combination with BERT to enhance text classification [
81]. However, they have not been used to find patterns within POS sequences. For the CNN, BERT’s last dense layer is used to extract local features by sliding a 1D kernel across contextualised embeddings to capture additional local relationships between tokens (the max-pooling method is adopted in this step). The LSTM layer utilises linguistic features that have been proven to contribute to language complexity prediction, which is linked to personality. SpaCy’s POS tagger is used to extract linguistic features [
82], such as the number of pronouns, verbs, adjectives, and nouns, within reviews. The most influential POS features for the personality class are selected using various feature selection techniques, including model-based and statistical approaches. To leverage additional linguistic features from the text, the order of POS tags in reviews is also used as input to the classifier (
Figure 2). Tag sequences are expressed using unique IDs, specified for each POS, and expressed as a sequence of numbers. An LSTM layer is used to find patterns in the POS sequences. The CNN, LSTM and linguistic input are concatenated and fed to a linear layer and a sigmoid activation function that predicts the probability for each personality class. The above layers have been used independently and in combination, and the classifier structure that yielded the best results was selected. This is presented in
Figure 2.
The second issue addressed while searching for the best personality classifier is the issue of data imbalance. When the number of training examples is skewed toward one class, ML models struggle to correctly predict minority classes. For example, in our case, the number of extrovert cases exceeds that of introvert cases in the training data. Due to such data imbalance, additional techniques are required to balance the data prior to training the classifiers with text embeddings. Deep learning models, however, especially those using pre-trained architectures (such as BERT), can be more resilient to moderate data imbalance. During this step, we considered prominent imbalance treatment techniques for the ML models, such as resampling, cost-sensitive algorithms, ensemble methods [
83], and class weighting. Resampling involves under-sampling the majority class or over-sampling the minority class (in case of binary classification) and, thus, balances the data by altering the number of sample units per class. Oversampling is a proven technique for treating class imbalance used in text classification [
84] that generates synthetic new cases (instead of replication) based on data from the minority class. Two of the most popular over-sampling techniques are synthetic minority over-sampling (SMOTE) [
83] and adaptive synthetic (ADASYN) sampling. The latter is considered an extension of SMOTE that adaptively generates minority data instances based on their distribution [
84]. Both SMOTE and ADASYN are evaluated in this step. In the case of BERT, imbalance was treated using class weighting and different loss functions (i.e., focal loss, binary cross-entropy loss) since it performs better than over- and under-sampling.
The third issue addressed regarding BERT-based classification concerns its tendency to perform best with short texts, typically those containing fewer than 128 tokens. Since the reviews text used are longer than this limit, different long text BERT classification techniques were considered that used different parts of the text, such as, the naïve head only that use the first X number of words (tokens) and ignores the remaining text, the naive tail-only use the last (X) number of words of the text and ignores the rest, and the semi-naive combines top X words with bottom X words or combines these with important words in the text and ignore the rest. Even though such approaches lose information, they have a minimal computational cost and achieve good results [
81]; however, long-text treatment methods, such as using only head or tail tokens, achieve the best classification performance. Recent work aiming to alleviate the computational cost of processing long text utilises more sophisticated models that involve fragmenting the text into chunks and combining the embeddings of these chunks [
85]. The benefits, however, from such models were not significantly different from the aforementioned techniques to justify the extra processing, which is key in recommender systems that aim to serve a large number of users simultaneously. During this step, different long text treatments were evaluated.
3.6. User and Venue Personality Extraction [Step 6]
The best personality classifier from Step 5 is used to label the personality of each consumer and restaurant. Consumer personality is estimated by initially aggregating all text generated by each user. In cases where the text length exceeds 512 tokens, the text is divided into 512-token-length chunks. The remaining part of the chunked text is eliminated since it was less than the required length. Each chunk of the aggregate user text is used as input to the personality classifiers, and the predictions for each chunk are averaged to produce the user’s overall personality. This is repeated for each user and personality dimension. Similarly, venue personality is estimated by aggregating the reviews of users who have visited the venue and liked it (positive evaluation > 4), then chunking the text and averaging the personality scores.
3.7. Extracting Latent User Information Through Neural Collaborative Filtering (NCF) [Step 7]
A neural collaborative filtering (NCF) component is used to extract latent user/item features. A deep neural network (NCF) is trained using embeddings of customers and restaurants as input and user ratings as output. The NCF converts the sparse user–item matrix into low-dimensional user–item embeddings (dense layer), thereby extracting latent customer preferences [
36]. Embeddings from the NCF model are extracted and combined with features from previous steps of the method (personality, topics, food preferences). Inputs referring to both derived and direct information are used collectively to train and test an XGBoost regression model. The rationale for using XGBoost lies in its better interpretability and popularity with tabular data, compared to deep neural networks. Hence, its logic on how predictions are made can be explained with techniques such as Shapley additive explanations (SHAP) [
86]. This hybrid approach is similar to the wide and deep architecture [
41] that leverages a wide linear model for memorisation and a deep neural network for generalisation, allowing it to capture both specific feature interactions and broader patterns in data, but instead of using a dense layer to combine the wide and deep components to make predictions in the multi-layer perceptron [
41], an XGBoost model is used since it can be trained faster and is easier to be explained. Another approach related to the one proposed is the two-tower model [
87] that evaluates user–item rankings through the inner product of their respective embeddings. Two-tower models are capable of learning complex relationships between users and items, and can scale to large datasets; thus, they are popular in industrial settings.
3.8. Recommendation Generation [Step 8]
An XGBoost regressor model is trained (80%) and tested (20%) to predict user ratings (i.e., customer satisfaction) for restaurants that users have not visited yet. XGBoost is an ensemble method; hence, multiple trees are generated with each tree learning from the errors of previously generated trees [
23]. XGBoost is selected due to its good results in similar problems and its faster training and prediction speeds compared to neural networks [
88]. To address XGBoost overfitting, several hyperparameters had to be optimised using GridSearch, Bayesian optimisation, or random search, with GridSearch producing the best results. XGBoost predictions were used to rank recommendations based on Recall@k (the proportion of all relevant items successfully retrieved in the first K results) and Precision@k (the proportion of recommended items that are relevant in the first K results).
3.9. Comparative Analysis and Stepwise Ablation Study [Step 9–10]
During this step, alternative techniques are used to stress test the results of the proposed method. Different matrix factorisation techniques are used as alternatives to the proposed method, such as NMF, SVD, SVD++, and NCF models, to predict cells in the user–item matrix with unknown values. The user–item matrix is generated with rows corresponding to consumers and columns to restaurants, and cells containing user–item interactions. The above are popular collaborative filtering techniques that are considered state-of-the-art in the industry [
30,
33]. They are, thus, used as baseline approaches to compare with the proposed method as part of its evaluation. Hyperparameters such as the K-number of factors and regularization options for SVD, SVD++ and NMF, were tuned using GridSearch (SVD best params: {‘n_factors’: 100, ‘reg_all’: 0.005}; SVD++ best params: {‘n_factors’: 20, ‘reg_all’: 0.01}, where reg_all applies L2 regularization to the model’s learned parameters; NMF best params: {‘n_factors’: 20, ‘reg_pu’: 0.05, ‘reg_qi’: 0.05}, where reg_pu and reg_qi refer to regularization penalty on users’ and items latent dimension values). Additionally, a widely used deep learning architecture, namely the two-tower model, is employed to further evaluate the proposed method, as its rationale is similar to that of the proposed method. Here, the features constitute a mixture of user features, such as user_id, preferences from the topic model (tower 1), while restaurant (items) features are the item_id, food, price, and cuisine offered (tower 2). The two-tower NN model was optimised using the hyperparameters embedding dimension, user units (number of neurons in the dense layer of the user tower), joint_units (the sizes of the dense layers after combining the user & item towers), dropout_rate and learning_rate. The best hyperparameters based on MAE are {‘emb_dim’: 16, ‘user_units’: 128, ‘joint_units’: (128, 64), ‘dropout_rate’: 0.4, ‘learning_rate’: 0.001}. The NCF model was optimised based on {‘emb_dim’, ‘hidden_units’, ‘dropout’, ‘lr’, ‘optimiser’: [adam or sgd], ‘batch_size and ‘epochs’}. The best hyperparameters were based on validation RMSE: {‘emb_dim’: 64, ‘hidden_units’: 256, ‘dropout’: 0.2, ‘lr’: 0.01, ‘optimiser’: ‘sgd’, ‘batch_size’: 64, ‘epochs’: 15}. The performance of the proposed method is evaluated using offline recommender systems evaluation metrics such as the mean squared error (MSE), mean absolute error (MAE), and root mean squared error (RMSE). Additional ranked evaluation metrics were used, such as Recall@k and Precision@k.
Table 1 lists the evaluation metrics along with their mathematical formulas.
4. Results
The primary dataset contains restaurant reviews collected from TripAdvisor, as discussed in
Section 3.1.
Figure 3 illustrates the descriptive statistics of review ratings by year. From this, it is evident that customer satisfaction degraded in 2021, possibly due to the COVID-19 pandemic.
4.1. Topic Modelling
To extract the topics that are discussed by consumers in eWOM, an STM topic model was developed. STM is used over the traditional latent Dirichlet allocation (LDA) [
89] since it produced a higher-quality model, while providing insights into how metadata links to documents in the corpus, such as the review rating or sentiment, which helped the naming of the extracted topics. The STM model was trained using the STM package in R (1.3). During text preprocessing, words with fewer than three characters were eliminated, as they provided little contextual information. Custom stop words referring to names of people, towns, countries, cities, etc., were also eliminated, as they offer no information about user preferences. The optimal number of topics (K) based on the model’s performance metrics in
Figure 4, with a focus on high coherence and exclusivity, high held-out likelihood, low residuals, and high lower bound scores, is 18 (K = 18).
Topics’ names in
Table 2 are derived using domain knowledge from domain experts using the words with the highest probability per topic and high Lift (words that appear less frequently in other topics get higher weight) and FREX (Frequency, Relative Extractability) scores (summarizes words with the harmonic mean of the probability of appearance under a topic and the exclusivity to that topic) [
90]. These provide more semantically intuitive representations of topics and are, thus, good for distinguishing topics.
The probability distribution of topics per review represents the topics discussed in each review, and the sum of the probabilities of all topics in each review is 1. The STM model’s theta values were used as review embeddings, and in combination with the other features that were extracted from eWOM, were used to train the XGBoost model.
Figure 5 shows the average theta value per topic, indicating the prevalence of each topic in the corpus. Among these topics, only topics that were informative for the recommendation task were considered. In particular, the topic “Intention to revisit” was not included in the list of features since it did not provide information about customer preference. The remaining topics were utilised.
4.2. Food Preference Extraction
Users’ food preferences are extracted from eWOM’s text using a custom named entity recognition (NER) model trained using annotated data generated using a rule-based approach and the SpaCy library. Initially, the SpaCy library was utilised to extract sentences with food mentions, and then these annotated sentences were used as training data to update the existing SpaCy NER. For the annotation task, several rules have been specified using the SpaCy pattern language to extract the sentences that mention food consumption in reviews. Examples of patterns that have been specified include: “I ate {}”, “We had {} for dinner”, etc. Patterns were designed using combinations of generic part-of-speech tags to relax the constraints of sentence filtering and address variations in sentences that refer to food consumption, such as “we ate steak at the xxx restaurant” and “we had a nice steak at this lovely restaurant”. Extracted sentences were annotated automatically based on the position of the food entity in the sentence. This was identified based on the string length of the pattern that was satisfied when the sentence was selected.
Figure 6 depicts an example sentence annotated with the position of the food entity in the text. Thus, number 20 refers to the position in the text where the entity name starts and five refers to the number of characters that comprise the entity name. This process was necessary to create a training dataset from which to fine-tune the generic SpaCy NER. During NER training, the dataset was split into training (70%) and testing (30%) sets. Generalisation of the model was achieved through regularisation techniques, such as dropout rates (preventing complex co-adaptations on training data by randomly shutting down neurons). The trained NER model achieved an accuracy of 94%. In addition, the NER was evaluated qualitatively to verify the correctness of the labels using a sample of 50 reviews on a dataset different from the one used to fine-tune the model. The manual process involved assessing the FP/TP/FN. The results yielded a recall of 65% and a precision of 70%.
The fine-tuned NER was applied to restaurant reviews to extract foods associated with each review. Many food entities were generated, and there were numerous repetitions due to variations in spelling. To reduce the number of features, a feature selection process was performed using a random forest classifier to identify the most important food names (features), using the cuisine offered by restaurants as the target variable. The restaurant’s cuisine is sourced from its page on the reviews platform. During this process, restaurants were initially clustered based on the cuisine they offered. Reviews in each cluster were used to extract the dominant foods using the identified foods in reviews as features and cuisine as the target variable. Several binary classifiers were generated using one cuisine as a class variable versus the rest. For the identification of local cuisine foods, traditional restaurants were used as one cluster. The most important features from the binary classifiers were combined into a collection of 220 international and local foods that formed our food vocabulary. TF-IDF vectorisation using the compiled food names as vocabulary was used on eWOM’s text. Users’ food preferences were specified as foods with the highest cumulative TF-IDF scores for all reviews by the user. These scores refer to foods that users either ordered/consumed and discussed in their reviews. The assumption here is that regardless of liking the food or not, these refer to foods they prefer to consume. The same approach is used to identify the foods that each restaurant is famous/good at. In the case of restaurants, only positive reviews were used.
4.3. Selecting the Best Personality Classifier
The stream-of-consciousness essay dataset [
70] is used to train the BIG 5 models, while the MBTI dataset [
72] is used to train the MBTI models (recall
Section 3.1). To identify the best-performing BERT long-text handling method for personality classification, we compared the naive approach, which used the head (i.e., the beginning) of the text with lengths of 256 and 512 tokens, to the semi-naive approach, which divided the text into 128-token chunks and combined their embeddings. The results from this analysis have shown that the BERT classifier, using the head-only 512-token long text preprocessing strategy [
85], outperformed the semi-naive approach; thus, this approach was employed in the methodology. Therefore, aggregated reviews from each user were initially chunked into 512 tokens. Then, the personality of each chunk was assessed and averaged to determine the user’s overall personality. This chunked long-text approach proved to be the best when compared to different ML classifiers and two datasets (MBTI and BIG 5) that utilised BERT embeddings as features. The results of
Table 3 show that the MBTI BERT 512-based approach outperformed the ML models and the BIG5 BERT 512-based model. During the second evaluation phase, the MBTI BERT-512 was compared with four ML models that utilised data balancing. Oversampling with SMOTE and ADASYN did not improve the performance of the MBTI ML classifiers, as seen in
Table 4. This could be because SMOTE may amplify noise in the minority class by creating synthetic samples from noisy instances, or it may focus solely on the minority class, potentially overlooking important characteristics of the majority class. SMOTE can also create synthetic samples that cross class boundaries, potentially confusing the classifier. On the other hand, ADASYN focuses on generating more synthetic samples for minority class instances that are harder to learn (i.e., those closer to the majority class). While this can be beneficial in some cases, it can also significantly amplify noise. Also, if there are outliers or mislabelled samples in the minority class, ADASYN may generate more synthetic samples around these problematic instances. Therefore, based on the results in
Table 4 and after statistically evaluating the significance of one algorithm over the other using the McNemar–Bowker test, the MBTI BERT 512 classifier was chosen to label the primary data.
Having identified the best classifier for personality, this was used to label each user and venue. Reviews were initially aggregated per user and venue. Each user’s text was chunked into 512 slots, as this is the best length of text that our classifier can handle. The text is then vectorised using BERT, and through the model of
Figure 2, the personality of each chunk is predicted. The personality of the user is the average for all chunks of text for each user. For each review, four binary classifiers were used to predict the probabilities for each of the four MBTI dimensions.
Figure 7 shows descriptive statistics, in the form of personality distributions of users resulting from the MBTI BERT classifier. The distributions indicate that each personality trait varies around a mean probability of 0.5, representing an approximately balanced likelihood of belonging to either class. This observation is consistent with personality theory, which suggests that, on average, most individuals tend to lie near the midpoint of personality continua rather than at the extremes of either class.
Venue personality is assessed by averaging the personality profiles of users who reviewed each restaurant and expressed positive evaluations (>4). From the five brand dimensions proposed in [
14,
16]—sincerity, excitement, competence, sophistication, and ruggedness—which describe how consumers perceive brands, three dimensions (i.e., sincerity, excitement, and competence) closely correspond to three human personality traits in the BIG 5 model (i.e., agreeableness, extraversion, and conscientiousness). In turn, as the BIG 5 dimensions are known to map the four dimensions of MBTI [
91], the three MBTI dimensions (introversion–extroversion, thinking–feeling, and judging–perceiving) were utilised in the recommendation approach proposed (
Figure 7), to evaluate brand personality, since the intuition–sensing dimension is not directly linked to any dimension of the venue personality model based on [
14,
16].
4.4. Training and Evaluating the Proposed Model: A Stepwise Ablation Study
The features derived from eWOM, along with direct information regarding restaurants’ service dimensions, were combined with embeddings from the NCF component and used collectively to train an XGBoost regression model, with the output variable being the rating of restaurants. The following XGBoost hyperparameters were tuned using GridSearch: [learning rate, max depth, number of estimators, scale_pos_weight] for data balancing, along with regularisation parameters such as alpha, lambda and gamma.
The model’s performance was evaluated using ranking and accuracy metrics, including the mean absolute error (MAE), mean squared error (MSE), root mean squared error (RMSE), Precision@k, and Recall@k. Precision and recall were computed by measuring the proportion of relevant restaurants retrieved among the top-K recommendations. The comparison of the proposed XGBoost approach against NMF, SVD, SVD++, NCF, and the two-tower model revealed improved performance over these models.
In the analysis conducted using the extracted restaurant reviews, the data was split into test and training sets (80/20) using stratified sampling by user ratings, to ensure sufficient representation of all categories in the test and training sets. The model’s hyperparameters were tuned, and the model was trained and tested using the same samples (
Table 5).
An incremental, stepwise evaluation was conducted using the MBTI dataset to assess the impact of various components of the proposed model on recommendation performance. The evaluation metrics summarised in
Table 5 show that the proposed model achieves its best performance when all components—personality, topics, and food preferences—are included along with direct information (ratings, cuisine, etc). The baseline models, which represent current industry best practices, do not incorporate personality information.
The results reveal several key findings. (1) Introducing venue and user personality into the initial model (XGBoost trained on direct information) leads to a clear improvement in recommendation performance. This supports our hypothesis that consumers prefer restaurants whose personalities align with their own, as reflected in the enhanced accuracy of recommendations. (2) When food preferences and restaurant cuisines are incorporated in addition to personality features, performance improves further. This suggests that recommendations are more effective when users’ food preferences align with the cuisines in which restaurants excel, increasing the likelihood of customer satisfaction. (3) Finally, incorporating user opinions and venue themes, extracted from topic modelling, alongside personality and food preferences, yields the highest gains in performance. This indicates that matching user opinions with the thematic attributes of restaurants (i.e., topics that characterise positively reviewed venues) further enhances the relevance of recommendations.
A final observation is that when all features are combined, the proposed model outperforms both collaborative filtering (CF) and neural network-based (NCF) approaches. This demonstrates that information derived from eWOM substantially improves recommendation quality, providing strong empirical support for our initial hypothesis. In the proposed approach, recommendations are generated by ranking the XGBoost model’s predictions for each user. The number of recommendations to be produced (e.g., the top five restaurants) is specified by the user.
An additional evaluation is conducted by first selecting the best hyperparameters for each model configuration (corresponding to different component combinations) using a fixed 80/20 tuning split of the training data, ensuring that hyperparameter selection is not influenced by the test set. After tuning, the model is retrained and evaluated multiple times (30) using different random seeds, which affect both the data splitting (different samples end up in the training and test sets) and the stochastic elements of model training. Each run produces evaluation metrics, and the results are aggregated across runs to assess the standard deviation for each metric. This procedure quantifies the sensitivity of each model configuration to randomness in training and data sampling. The results in
Table 6 show that all model variants exhibit low variability across 30 random runs, indicating robust performance and greater consistency.
4.5. Explanation of XGBoost Using SHAP
The SHAP (Shapley Additive exPlanations) summary plot (
Figure 8) [
86] shows that the most influential implicit category of features is the topic, followed by personality and then food. This result is consistent with the RMSE, MAE, and MSE performance metrics in
Table 5, which show a more drastic improvement when topics are introduced into the model. The SHAP summary plot indicates that the topics “Disappointment”, “Long wait”, and “Bad food” are negatively influencing the ratings of the reviews. The postfix “u_avg” and “r_avg” in the topics‘ names refer to user average and restaurant average. Higher values of these topics (red points on the left) are associated with a decrease in the predicted outcome, indicating that disappointment and long wait strongly drive the model towards lower scores. In contrast, topics such as service and atmosphere (red points on the right) increase the prediction, suggesting that these have a positive impact. The user’s personality also emerges as a significant feature. Specifically, the levels of extraversion (IE) and thinking–feeling (TF) traits have a positive influence on the model when they are high. The thinking–feeling dimension refers to how a person makes decisions and evaluates information, with thinking associated more with logic and objective analysis, while the latter is associated with emotion and empathy. Along with the above two categories of features, the food category is also influential to the model’s output, with foods such as chicken showing a positive effect on the rating. Food features are shown with lower-case names followed by the postfix “u_avg” and “r_avg”.
5. Discussion
This work combines direct and derived information from electronic word-of-mouth (eWOM) to enhance the performance of recommendations. The importance of personality in recommender systems has been acknowledged in previous studies [
58]; however, most of these rely on self-reported personality assessments obtained through questionnaires, which are often impractical and hinder consumer adoption. Automated personality assessment from eWOM text has been explored using machine learning (ML) and deep learning techniques; however, such approaches have not yet been effectively applied to the restaurant recommendation domain.
In addition to user personality, we operationalised and incorporated the concept of brand personality within recommender systems, demonstrating that the joint consideration of user and brand (in this case, restaurant) personality positively contributes to recommendation performance. This finding aligns with personality–brand congruence theory, which posits that individuals tend to prefer brands that reflect their own personality traits [
14].
Combining direct information (e.g., ratings, cuisine type, and restaurant metadata) with derived information (e.g., personality, topics, and food preferences) enriches the model’s feature space. Each component contributes unique, complementary information. Direct features reflect explicit user behaviour. Personality captures latent psychological tendencies influencing preferences. Food preferences and topics encompass nuanced, contextual, and sentiment-based cues. The integration of these heterogeneous data sources reduces feature sparsity, enabling the model to learn more robust and generalizable patterns, thereby improving predictive accuracy. Fusing diverse feature types enhances the model’s representational capacity and mitigates bias toward purely behavioural data. Models such as XGBoost can leverage interactions between structured and unstructured features to more accurately approximate user–item relevance. This multimodal fusion increases the model’s ability to generalise beyond observed ratings, which explains the consistent improvement over baseline methods that rely solely on direct features.
The work presented evaluates various personality classification techniques to determine the most effective performer. Prominent ML techniques are evaluated using two methods for addressing imbalance. The paper extends our previous work [
20] that examines deep learning classifiers such as BERT by optimising its performance using different long text treatment strategies. Similar work that uses transferred learning through language models for text classification [
92] does not adequately address long-text challenges, resulting in inferior classification performance. This systematic evaluation of personality classifiers prior to labelling data contributes towards improved labelling, which is of paramount importance, as also highlighted in [
70]. The performance of our proposed personality classifier is better than ML classifiers and other baseline personality classification techniques, such as [
62], thereby giving us greater confidence in the labelling of the data.
Moreover, the method introduces an automated approach for identifying user preferences from eWOM text. Similarly, ref. [
50] utilises sentiment for food preference extraction and clustering to identify topics from online reviews; however, it does not jointly address the concept of personality, nor does it utilise a topic modelling technique such as STM that enables the association of metadata, such as sentiment, with topics to improve interpretability.s
Our results also show that a combination of direct and derived information from eWOM (i.e., personality, food preference, and opinions from topic modelling) enhances recommendation performance.