Leveraging LLMs for User Rating Prediction from Textual Reviews: A Hospitality Data Annotation Case Study
Abstract
1. Introduction
- The paper provides a comprehensive comparison of sentiment scores from GPT models and human annotators, contributing to the understanding of reliability and consistency in textual annotation based on LLMs;
- presents a statistical direction for comparing the performances of different sentiment annotators, including LLM vs Human, as well as different LLM variants.
2. Literature Review
2.1. Traditional Approaches and Their Limitations
2.2. Crowdsourced Annotation and Bias Concerns
2.3. Emergence of LLMs for Annotation
2.4. Gaps and Research Direction
3. Methodology
3.1. Amazon Mechanical Turk
Large Language Models (GPT 4o and GPT 3.5 Turbo)
4. Results and Discussion
4.1. Statistical Analysis
4.2. Hypothesis Testing
- RQ1: Can the customers’ original ratings of text reviews be accurately represented using GPT models and/or the MTurk workers’ ratings?
- RQ2: Do AI-generated (GPT) sentiment scores exhibit a stronger correlation with the original numeric ratings assigned by customers compared to the sentiment scores provided by human MTurk annotators?
5. Future Directions
6. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
References
- Chen, T.; Samaranayake, P.; Cen, X.; Qi, M.; Lan, Y.-C. The impact of online reviews on consumers’ purchasing decisions: Evidence from an eye-tracking study. Front. Psychol. 2022, 13, 865702. [Google Scholar] [CrossRef]
- Litvin, S.W.; Goldsmith, R.E.; Pan, B. Electronic word-of-mouth in hospitality and tourism management. Tour. Manag. 2008, 29, 458–468. [Google Scholar] [CrossRef]
- Gabbard, D. The impact of online reviews on hotel performance. J. Mod. Hosp. 2023, 2, 26–36. [Google Scholar] [CrossRef]
- Mishra, A.; Kishan, K.; Tewari, V. The influence of online reviews on consumer decision-making in the hotel industry. J. Data Acquis. Process. 2023, 38, 2559–2573. [Google Scholar]
- Xiang, Z. A comparative analysis of major online review platforms: Implications for social media analytics in hospitality and tourism. Tour. Manag. 2017, 58, 51–65. [Google Scholar] [CrossRef]
- An, Q.; Ma, Y.; Du, Q.; Xiang, Z.; Fan, W. Role of user-generated photos in online hotel reviews: An analytical approach. J. Hosp. Tour. Manag. 2020, 45, 633–640. [Google Scholar] [CrossRef]
- Tsai, C. Improving text summarization of online hotel reviews with review helpfulness and sentiment. Tour. Manag. 2020, 80, 104122. [Google Scholar] [CrossRef]
- Ibeke, E.; Lin, C.; Coe, C.D.; Wyner, A.Z.; Liu, D.; Barawi, M.H.B.; Abd Yusof, N.F. A curated corpus for sentiment-topic analysis. Emot. Sentim. Anal. 2016. Available online: http://www.lrec-conf.org/proceedings/lrec2016/workshops/LREC2016Workshop-ESA_Proceedings.pdf (accessed on 26 November 2025).
- Sun, S.; Luo, C.; Chen, J. A review of natural language processing techniques for opinion mining systems. Inf. Fusion 2017, 36, 10–25. [Google Scholar] [CrossRef]
- Lin, Y.; Dong, X.; Zheng, L.; Yan, Y.; Yang, Y. A bottom-up clustering approach to unsupervised person re-identification. Proc. AAAI Conf. Artif. Intell. 2019, 33, 8738–8745. [Google Scholar] [CrossRef]
- Chen, C.; Qin, C.; Ouyang, C.; Li, Z.; Wang, S.; Qiu, H.; Chen, L.; Tarroni, G.; Bai, W.; Rueckert, D. Enhancing MR image segmentation with realistic adversarial data augmentation. Med. Image Anal. 2022, 82, 102597. [Google Scholar] [CrossRef]
- Raza, S.; Rahman, M.; Kamawal, S.; Toroghi, A.; Raval, A.; Navah, F.; Kazemeini, A. A comprehensive review of recommender systems: Transitioning from theory to practice. Comput. Sci. Rev. 2026, 59, 100849. [Google Scholar] [CrossRef]
- Al Kuwatly, H.; Wich, M.; Groh, G. Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Proceedings of the Fourth Workshop on Online Abuse and Harms, Online, 20 November 2020; pp. 184–190. [Google Scholar]
- Jakobsen, T.S.T.; Barrett, M.; Søgaard, A.; Lassen, D. The sensitivity of annotator bias to task definitions in argument mining. In Proceedings of the 16th Linguistic Annotation Workshop (LAW-XVI) Within LREC2022, Marseille, France, 24 June 2022; pp. 44–61. [Google Scholar]
- Hu, Y.; Chen, K. Predicting hotel review helpfulness: The impact of review visibility, and interaction between hotel stars and review ratings. Int. J. Inf. Manag. 2016, 36 Pt A, 929–944. [Google Scholar] [CrossRef]
- Shin, D.; Darpy, D. Rating, review and reputation: How to unlock the hidden value of luxury consumers from digital commerce? J. Bus. Ind. Mark. 2020, 35, 1553–1561. [Google Scholar] [CrossRef]
- McCusker, K.; Gunaydin, S. Research using qualitative, quantitative or mixed methods and choice based on the research. Perfusion 2015, 30, 537–542. [Google Scholar] [CrossRef]
- Raza, S.; Qureshi, R.; Zahid, A.; Fioresi, J.; Sadak, F.; Saeed, M.; Sapkota, R.; Jain, A.; Zafar, A.; Ul Hassan, A.; et al. Who is responsible? The data, models, users or regulations? Responsible generative AI for a sustainable future. arXiv 2025, arXiv:2502.08650. [Google Scholar]
- Ezenkwu, C.P.; Ibeke, E.; Iwendi, C. Assessing the capabilities of ChatGPT in recognising customer intent in a small training data scenario. In Proceedings of the 3rd International Conference on Advanced Communication and Intelligent Systems (ICACIS 2024), New Delhi, India, 16–17 May 2024. [Google Scholar]
- Sadiq, S. Discrepancy detection between actual user reviews and numeric ratings of Google App Store using deep learning. Expert Syst. Appl. 2021, 181, 115111. [Google Scholar] [CrossRef]
- Valdivia, A. Inconsistencies on TripAdvisor reviews: A unified index between users and sentiment analysis methods. Neurocomputing 2019, 353, 3–16. [Google Scholar] [CrossRef]
- He, Z.; Huang, C.-Y.; Ding, C.-K.C.; Rohatgi, S.; Huang, T.-H.K. If in a Crowdsourced Data Annotation Pipeline, a GPT-4. In Proceedings of the CHI 2024, Honolulu, HI, USA, 11–16 May 2024; ACM: New York, NY, USA, 2024. [Google Scholar]
- Li, J. A Comparative Study on Annotation Quality of Crowdsourcing and LLM via Label Aggregation. In Proceedings of the ICASSP 2024, Seoul, Republic of Korea, 14–19 April 2024; IEEE: Piscataway, NJ, USA, 2024. [Google Scholar]
- Pavlovic, M.; Poesio, M. Effectiveness of LLMs as Annotators: A Perspectivist Evaluation. In Proceedings of the LREC-COLING 2024, Turin, Italy, 20–25 May 2024; ELRA: Paris, France, 2024. [Google Scholar]
- Ravi, K.; Ravi, V. A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowl.-Based Syst. 2015, 89, 14–46. [Google Scholar] [CrossRef]
- Shakya, S.; Du, K.; Ntalianis, K. Sentiment analysis and deep learning. In Proceedings of the ICSADL 2022, Songkhla, Thailand, 16–17 June 2022; Springer: Singapore, 2022. [Google Scholar]
- Taboada, M. Lexicon-based methods for sentiment analysis. Comput. Linguist. 2011, 37, 267–307. [Google Scholar] [CrossRef]
- Stringam, B.B.; Gerdes, J., Jr. An analysis of word-of-mouse ratings and guest comments of online hotel distribution sites. J. Hosp. Mark. Manag. 2010, 19, 773–796. [Google Scholar] [CrossRef]
- Racherla, P.; Connolly, D.J.; Christodoulidou, N. What determines consumers’ ratings of service providers? An exploratory study of online traveler reviews. J. Hosp. Mark. Manag. 2013, 22, 135–161. [Google Scholar] [CrossRef]
- Gilardi, F.; Alizadeh, M.; Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc. Natl. Acad. Sci. USA 2023, 120, e2305016120. [Google Scholar] [CrossRef]
- Azad, S. The effectiveness of GPT-4 as financial news annotator versus human annotator in improving the accuracy and performance of sentiment analysis. In Proceedings of the International Conference on MAchine inTelligence for Research & Innovations, Jalandhar, India, 1–3 September 2023; Springer: Singapore, 2023; pp. 105–119. [Google Scholar]
- Roumeliotis, K.I.; Tselikas, N.D.; Nasiopoulos, D.K. Leveraging large language models in tourism: A comparative study of the latest GPT Omni models and BERT NLP for customer review classification and sentiment analysis. Information 2024, 15, 792. [Google Scholar] [CrossRef]
- Alam, M.H.; Ryu, W.-J.; Lee, S. Joint multi-grain topic sentiment: Modeling semantic aspects for online reviews. Inf. Sci. 2016, 339, 206–223. [Google Scholar] [CrossRef]
- Liu, Y.; Han, T.; Ma, S.; Zhang, J.; Yang, Y.; Tian, J.; He, H.; Li, A.; He, M.; Liu, Z.; et al. Summary of ChatGPT-related research and perspective towards the future of large language models. Meta-Radiology 2023, 1, 100017. [Google Scholar] [CrossRef]
- Costa-Jussà, M.R.; Grivolla, J.; Mellebeek, B.; Benavent, F.; Codina, J.; Banchs, R.E. Using annotations on Mechanical Turk to perform supervised polarity classification of Spanish customer comments. Inf. Sci. 2014, 275, 400–412. [Google Scholar] [CrossRef]
- Paolacci, G.; Chandler, J.; Ipeirotis, P.G. Running experiments on Amazon Mechanical Turk. Judgm. Decis. Mak. 2010, 5, 411–419. [Google Scholar] [CrossRef]
- Huang, F.; Kwak, H.; An, J. Is ChatGPT better than human annotators? Potential and limitations of ChatGPT in explaining implicit hate speech. arXiv 2023, arXiv:2302.07736. [Google Scholar]
- Ezenkwu, C.P. Towards expert systems for improved customer services using ChatGPT as an inference engine. In Proceedings of the 2023 International Conference on Digital Applications, Transformation & Economy (ICDATE), Miri, Malaysia, 14–16 July 2023; IEEE: Piscataway, NJ, USA, 2023; pp. 1–5. [Google Scholar]
- Zhou, Y.; Muresanu, A.I.; Han, Z.; Paster, K.; Pitis, S.; Chan, H.; Ba, J. Large language models are human-level prompt engineers. arXiv 2022, arXiv:2211.01910. [Google Scholar]
- Davis, J. The Temperature Feature of ChatGPT: Modifying creativity for clinical research. JMIR Hum. Factors 2024, 11, e53559. [Google Scholar] [CrossRef] [PubMed]
- Anscombe, F.J. Graphs in statistical analysis. Am. Stat. 1973, 27, 17–21. [Google Scholar] [CrossRef]
- Rousselet, G.A.; Pernet, C.R.; Wilcox, R.R. Beyond differences in means: Robust graphical methods to compare two groups in neuroscience. Eur. J. Neurosci. 2017, 46, 1738–1748. [Google Scholar] [CrossRef] [PubMed]
- Mulrow, E.J. The visual display of quantitative information. Technometrics 2002, 44, 400. [Google Scholar] [CrossRef]







| Text Review | Original Rating |
|---|---|
| hated inn terrible room-service horrible staff un-welcoming decor recently updated lacks complete look managment staff horrible | 1 |
| not recommend staying staff routinely indifferent point rude parking arrangement public garage hill corner remarkably inconvenient mobility issues parking hotel challenging maneuver rooms adequate not maintained cleaned particularly hotel not bad easily better seattle definitely | 2 |
| nice rooms not 4* experience hotel monaco seattle good hotel n’t 4* level.positives large bathroom mediterranean suite comfortable bed pillowsattentive housekeeping staffnegatives ac unit malfunctioned stay desk disorganized missed 3 separate wakeup calls concierge busy hard touch did n’t provide guidance special requests.tv hard use ipod sound dock suite non functioning. decided book mediterranean suite 3 night weekend stay 1st choice rest party filled comparison w spent 45 night larger square footage room great soaking tub whirlpool jets nice shower.before stay hotel arrange car service price 53 tip reasonable driver waiting arrival.checkin easy downside room picked 2 person jacuzi tub no bath accessories salts bubble bath did n’t stay night got 12/1a checked voucher bottle champagne nice gesture fish waiting room impression room huge open space felt room big tv far away bed chore change channel ipod dock broken disappointing.in morning way asked desk check thermostat said 65f 74 2 degrees warm try cover face night bright blue light kept got room night no 1st drop desk called maintainence came look thermostat told play settings happy digital box wo n’t work asked wakeup 10 a.m. morning did n’t happen called later 6 p.m. nap wakeup forgot 10 a.m. wakeup morning yep forgotten.the bathroom facilities great room surprised room sold whirlpool bath tub n’t bath amenities great relax water jets going | 3 |
| excellent staff housekeeping quality hotel chocked staff make feel home experienced exceptional service desk staff concierge door men maid service needs work maid failed tuck sheets foot bed instance soiled sheets used staff quickley resolved soiled sheets issue guess relates employee not reflection rest staff.we received excellent advice concierge regarding resturants area happy hour wine tasting nice touch staff went way make feel home.great location like close good food shopping took play 5th street theather well.pikes market pioneer square access mono rail short walking distance | 4 |
| hotel stayed hotel monaco cruise rooms generous decorated uniquely hotel remodeled pacific bell building charm sturdiness everytime walked bell men felt like coming home secure great single travelers location fabulous walk things pike market space needle.little grocery/drug store block away today green bravo 1 double bed room room bed couch separated curtain snoring mom slept curtain great food nearby | 5 |
| Analyse the sentiment in the review ‘“{text}”’ and provide a sentiment score between 1 (very negative) and 5 (very positive), following the examples provided in ‘“{In-Context}”’. Only give the right ratings and do not provide any explanation or more than one class and must not make it up. Using the response choices range to be returned, produce only the sentiment score: |
| Very Negative > 1 |
| Negative > 2 |
| Neutral > 3 |
| Positive > 4 |
| Very Positive > 5 |
| Sentiments | Observations | |||
|---|---|---|---|---|
| Original | MTurk | GPT-4o | GPT 3.5 Turbo | |
| 1 | 66 | 11 | 84 | 7 |
| 2 | 77 | 113 | 98 | 51 |
| 3 | 110 | 273 | 99 | 545 |
| 4 | 294 | 436 | 274 | 342 |
| 5 | 453 | 167 | 445 | 55 |
| Measure | Observations | |||
|---|---|---|---|---|
| Original | MTurk | GPT-4o | GPT-3.5 Turbo | |
| Mean | 3.991 | 3.635 | 3.898 | 3.387 |
| Median | 4 | 4 | 4 | 3 |
| Mode | 5 | 4 | 5 | 3 |
| Range | 4 | 4 | 4 | 4 |
| Interquartile Range | 2 | 1 | 2 | 1 |
| Standard Deviation | 1.210 | 0.926 | 1.297 | 0.701 |
| Kullback–Leibler Divergence | - | 0.311 | 0.201 | 0.339 |
| Metric | GPT-4o | MTurk |
|---|---|---|
| Shapiro–Wilk p-value | 0.0001 | 0.0000 |
| Wilcoxon p-value | 3.7 × 10−4 | 2.5 × 10−8 |
| Cohen’s d | 0.32 | 0.65 |
| Correlation (95% CI) | 0.805 [0.79, 0.82] | 0.153 [0.12, 0.18] |
| Metric | Ratings | ||
|---|---|---|---|
| MTurk | GPT-4o | GPT-3.5 Turbo | |
| Spearman Correlation | 0.153 | 0.703 | 0.663 |
| Mean Squared Error (MSE) | 2.106 | 0.629 | 1.520 |
| Mean Absolute Error (MAE) | 1.094 | 0.451 | 0.980 |
| Root Mean Square Error (RMSE) | 1.451 | 0.793 | 1.232 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Nnanna, P.; Amujo, O.; Ezenkwu, C.P.; Ibeke, E. Leveraging LLMs for User Rating Prediction from Textual Reviews: A Hospitality Data Annotation Case Study. Information 2025, 16, 1059. https://doi.org/10.3390/info16121059
Nnanna P, Amujo O, Ezenkwu CP, Ibeke E. Leveraging LLMs for User Rating Prediction from Textual Reviews: A Hospitality Data Annotation Case Study. Information. 2025; 16(12):1059. https://doi.org/10.3390/info16121059
Chicago/Turabian StyleNnanna, Patricia, Olasoji Amujo, Chinedu Pascal Ezenkwu, and Ebuka Ibeke. 2025. "Leveraging LLMs for User Rating Prediction from Textual Reviews: A Hospitality Data Annotation Case Study" Information 16, no. 12: 1059. https://doi.org/10.3390/info16121059
APA StyleNnanna, P., Amujo, O., Ezenkwu, C. P., & Ibeke, E. (2025). Leveraging LLMs for User Rating Prediction from Textual Reviews: A Hospitality Data Annotation Case Study. Information, 16(12), 1059. https://doi.org/10.3390/info16121059

