Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews

Voutsa, Maria C.; Tsapatsoulis, Nicolas; Djouvas, Constantinos

doi:10.3390/ai6080178

Open AccessArticle

Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews

by

Maria C. Voutsa

^1,*

,

Nicolas Tsapatsoulis

¹

and

Constantinos Djouvas

²

¹

Department of Communication and Marketing, Cyprus University of Technology, Limassol 3036, Cyprus

²

Department of Communication and Internet Studies, Cyprus University of Technology, Limassol 3036, Cyprus

^*

Author to whom correspondence should be addressed.

AI 2025, 6(8), 178; https://doi.org/10.3390/ai6080178

Submission received: 29 June 2025 / Revised: 22 July 2025 / Accepted: 30 July 2025 / Published: 4 August 2025

(This article belongs to the Special Issue AI Bias in the Media and Beyond)

Download

Browse Figures

Versions Notes

Abstract

As large language models (LLMs) gain traction among researchers and practitioners, particularly in digital marketing for tasks such as customer feedback analysis and automated communication, concerns remain about the reliability and consistency of their outputs. This study investigates annotation bias in LLMs by comparing human and AI-generated annotation labels across sentiment, topic, and aspect dimensions in hotel booking reviews. Using the HRAST dataset, which includes 23,114 real user-generated review sentences and a synthetically generated corpus of 2000 LLM-authored sentences, we evaluate inter-annotator agreement between a human expert and three LLMs (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4-mini) as a proxy for assessing annotation bias. Our findings show high agreement among LLMs, especially on synthetic data, but only moderate to fair alignment with human annotations, particularly in sentiment and aspect-based sentiment analysis. LLMs display a pronounced neutrality bias, often defaulting to neutral sentiment in ambiguous cases. Moreover, annotation behavior varies notably with task design, as manual, one-to-one prompting produces higher agreement with human labels than automated batch processing. The study identifies three distinct AI biases—repetition bias, behavioral bias, and neutrality bias—that shape annotation outcomes. These findings highlight how dataset complexity and annotation mode influence LLM behavior, offering important theoretical, methodological, and practical implications for AI-assisted annotation and synthetic content generation.

Keywords:

AI bias; annotation bias; large language models; sentiment analysis; topic modeling; aspect classification; synthetic data; inter-coder reliability; ChatGPT; hospitality reviews; neutrality bias

1. Introduction

As artificial intelligence (hereinafter AI) continues to reshape how companies engage with customers, large language models (hereinafter LLMs) are becoming increasingly embedded in marketing practice. From analyzing customer reviews to identifying sentiment trends and summarizing feedback, LLMs offer an unprecedented ability to process and classify content on a scale. This is particularly valuable in data-intensive industries such as hospitality, where organizations must navigate large volumes of user-generated reviews to inform decision-making. In this context, LLMs are being adopted for tasks ranging from customer experience management and brand monitoring to automated communication and personalized marketing strategies [1,2,3].

However, as the use of LLMs shifts from generating content to interpreting it, particularly in the role of automated annotators, important questions arise about the reliability and transparency of their output. One key concern is annotation bias, which refers to the tendency of different annotators (human or machine) to apply labels to the same content in systematically different ways [4]. These discrepancies, though often subtle, can have significant downstream effects. They may compromise the quality of training data, introduce inconsistencies in model evaluation, and ultimately lead to biased or misleading results in consumer-facing applications [5,6]. However, while a growing body of research has explored bias in AI-generated predictions, much less attention has been paid to bias at the annotation stage, which is the very foundation on which supervised AI systems are built.

Importantly, annotation bias is not unique to LLMs. Human annotators are also prone to a multitude of cognitive and contextual biases, such as confirmation bias, inter-annotator variability, and the influence of demographics, expertise, or task design [7,8,9]. However, recent studies reveal that LLMs can also reflect and amplify biases, especially when operating without task-specific calibration or when exposed to vague instructions [10,11]. For example, LLMs have been found to exhibit alignment inconsistencies with expert annotations in high-stakes tasks such as hate speech detection and health misinformation labeling [12,13].

In marketing and customer analytics, these annotation biases can be particularly consequential. A misclassified sentiment in a user review can lead to incorrect brand perception metrics, while flawed topic extraction could distort managerial responses to customer concerns, ultimately harming the consumer–brand relationship [14]. Furthermore, the rise of synthetic data (i.e., text generated entirely by LLMs to simulate real-world inputs) further complicates the annotation landscape. Synthetic reviews are often used to train or benchmark models due to their efficiency and controllability [15,16], yet there is limited understanding of how the annotation behavior changes when applied to such artificially constructed inputs. Initial evidence shows that synthetic data tend to lack the ambiguity, variation, and complexity present in real user-generated content [17], potentially masking the limitations of AI annotation systems under real-world conditions.

An additional variable often overlooked in LLM evaluation is the annotation mode itself. Whether an LLM is prompted one sentence at a time (manually) or in bulk (batch processing) can influence its output quality, particularly when tasks require contextual interpretation or disambiguation. Although role conditioning and prompt engineering have been shown to improve consistency [15], little is known about how the annotation interface and the delivery method affect alignment with human coders.

This study addresses these research gaps by investigating annotation bias in LLMs across two data types (real vs. synthetic hotel reviews) and two annotation modes (manual one-to-one prompting vs. automated batch processing). Specifically, we compare human and LLM-generated labels across three dimensions of customer review content (i.e., sentiment, topic, and aspect presence), using the publicly available Hotel Reviews: Aspects, Sentiments and Topics (HRAST) dataset [18] and a synthetically generated corpus of 2000 review sentences created using ChatGPT-4. We evaluate inter-annotator agreement between a domain expert and three LLMs (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4-mini) using Cohen’s and Fleiss’ kappa as reliability metrics.

Our findings show that while LLMs display strong internal consistency, especially when labeling synthetic data, their alignment with human annotations is only moderate. This divergence reflects underlying biases in how LLMs process, interpret, and label data, rather than performance differences per se. The greatest differences were observed in sentiment and aspect-based sentiment categorization, where LLMs showed a consistent tendency to default to neutral sentiment, particularly in cases involving ambiguity or mixed evaluations, a behavior characterized as neutrality bias [19]. Furthermore, the results reveal a notable behavioral sensitivity to annotation context, with LLMs displaying markedly different annotation patterns under manual, one-to-one prompting compared to batch processing. This finding highlights that input delivery format itself constitutes a source of annotation bias, shaping how models generate labels independently of the input content.

By foregrounding the interaction between data type, annotation mode, and bias expression, this study makes three key contributions: (1) it documents the structural annotation tendencies of LLMs in practical review classification tasks; (2) it demonstrates the limitations of synthetic data as a proxy for real-world complexity in model evaluation; and (3) it provides actionable insights for marketing practitioners and researchers deploying LLMs in customer-facing or analytic workflows. In contrast to prior studies, which primarily focus on accuracy and bias in domains such as hate speech [12,20], fairness [11], and multilingual NLP [21], this study uniquely examines how annotation mode (batch vs. manual) and data type (real vs. synthetic) jointly influence human–LLM agreement in a hospitality context, revealing systematic, mode-driven biases and divergent annotation patterns. These findings underscore the importance of context-sensitive AI deployment and advocate for human-in-the-loop approaches in annotation and decision-support systems that rely on LLMs.

The remainder of the paper is structured as follows. Section 2 reviews the literature on AI biases in annotation tasks, with particular attention to human–AI alignment and synthetic data generation. Section 3 presents the datasets, annotation protocols, and methodological design, highlighting our comparative approach to data types and annotation modes. Section 4 reports the empirical results, emphasizing inter-annotator agreement and the emergence of neutrality and behavioral biases. Section 5 discusses the theoretical, methodological, and practical implications of the findings, offering recommendations to mitigate bias in LLM-assisted annotation. Finally, Section 6 concludes with a summary of the contributions.

2. Literature Review

2.1. Biases in AI Systems

Artificial Intelligence (AI) refers to systems that replicate human cognitive functions, such as perception, memory, language, reasoning, and problem-solving, by applying logical frameworks (e.g., deductive, inductive, abductive) and learning methods (supervised, semi-supervised, unsupervised, reinforcement) to analyze data, identify patterns, and adapt over time [22]. These functions, however, are biased. A plethora of systematic literature reviews and conceptual studies revealed different AI biases [5,6,23,24,25].

The Oxford English Dictionary defines bias as a “tendency to favour or dislike a person or thing, especially as a result of a preconceived opinion; partiality, prejudice” [26]. In the context of research, bias can appear in multiple forms (e.g., design, selection, data collection, analysis, and publication), each of which can distort findings and compromise validity [27]. Delgado-Rodriguez and Llorca [28] further identified 74 types of bias, which can be broadly categorized into four groups: bias in intervention execution, information bias, selection bias, and confounding.

Types of Biases in AI Systems

Similarly, bias in AI systems originates from three interrelated sources: problem-related bias (i.e., problem formulation), data-related bias (i.e., data sources and processing), and model-related bias (i.e., model development, validation, and implementation) [29]. At the core of these concerns is algorithmic bias, defined as a “systematic deviation from equality that emerges in the outputs of an algorithm” [6] (p. 395). AI systems are particularly susceptible to biases rooted in data, design, and implementation choices [4]. Data-related issues such as selection, sampling, and historical bias reflect structural inequalities, while measurement and framing biases stem from inconsistent labeling and poorly defined problems. Aggregation and learning biases mask group-level disparities, and evaluation and deployment biases arise from narrow test sets or unintended use contexts.

Crucially, bias in AI is not only a technical issue rooted in flawed data or algorithms, but also a societal one shaped by unrepresentative samples, problematic analytical frameworks, and embedded social prejudices [5], such as gender and ethnic biases [30,31]. AI systems inherit these biases through the human knowledge they are trained on, which is often shaped by dominant power structures and Western-centric paradigms that marginalize underrepresented voices [4]. As Kordzadeh and Ghasemaghaei [6] argue, such algorithmic bias undermines perceived fairness, diminishing user trust, and hindering system adoption. The consequences are profound, including unequal access to resources, distorted decision-making, and erosion of public confidence in AI technologies [5]. Within this broader context, the present study focuses specifically on data-related biases, and more specifically, on annotation bias, which arises from the characteristics of annotators and the subjective processes they follow during data labeling [4].

2.2. Human vs. AI Annotator Biases

Humans are central to the design and effectiveness of AI and machine learning algorithms [32,33]. Human annotation biases may be affected by human errors, which are mainly categorized into active failures and latent conditions [34]). Active failures refer to the immediate, often visible, unsafe acts (i.e., slips, lapses, mistakes, or procedural violations) committed by individuals in direct contact with the system [35]. In contrast, latent conditions are systemic weaknesses, such as poor design, inadequate staffing/burnout, or flawed procedures, introduced by higher-level decision-makers [34]. For instance, Pandey et al. [33] demonstrated that the sequencing of data annotation tasks can influence annotation outcomes, highlighting the subtle impact of latent conditions.

In addition to these errors, three key types of bias further compromise annotation quality: cognitive bias, stemming from annotators’ prior knowledge or lack thereof; inter-annotator bias, reflecting inconsistencies due to differences in training, expertise, or task interpretation; and confirmation bias, where annotators label data in ways that affirm their pre-existing beliefs rather than adhere to objective guidelines [7]. The presence of confirmation bias is further supported by Haliburton et al. [8], who found that annotators’ ethnicity and gender significantly affect labeling decisions.

To mitigate such biases and reduce annotator fatigue, crowdsourcing platforms are often employed [32]. However, discrepancies among annotators persist even within crowdsourced environments [36,37]. Parmar et al. [9] highlight instruction bias in crowdsourcing, reinforcing concerns about inter-annotator variability. Geva et al. [38], also, show that model performance improves when annotator identifiers are included as features, enabling the model to adapt to individual annotation styles and prioritize input from the most reliable annotators. Demographics of annotators also seem to affect these biases. Even though Kuwatly et al. [39] found that there are no gender differences, they also found that native English speakers are better able to have a better understanding of the task and make better annotations. Age and education also have a crucial effect [39].

Another way to mitigate annotation-related challenges stems from the increasing complexity of annotation tasks and the sheer volume of data now required for training AI systems. This has significantly increased the cost of human annotation (whether through crowdsourcing platforms or expert-driven labeling), which is both financially and time-intensive. In response, the field has seen a growing shift toward the use of semi-supervised and unsupervised learning algorithms, as well as the adoption of AI-assisted annotation tools [40]. For image-based tasks, tools such as LabelMe and the Computer Vision Annotation Tool (CVAT) are commonly used, while for text-based annotation, general-purpose AI tools like ChatGPT-3.5 and various open-source alternatives are increasingly being integrated into workflows.

However, LLMs like ChatGPT-3.5 have been shown to exhibit biases in annotation. For instance, Das et al. [10] found that persona-based LLMs reflect annotator biases related to gender, race, disability, and religion in hate speech classification tasks. Similarly, Giorgi et al. [12] reported that while human annotators show mild in-group bias and demographic variation, persona-based LLMs also display bias but align poorly with human annotation patterns, even when tailored through personalization. In another study, Felkner et al. [11] examined annotations related to antisemitism and the Jewish community and concluded that GPT-3.5-Turbo could not match the performance of expert annotators with lived experience. Table 1 synthesizes these and other recent studies comparing human and LLM annotations across diverse domains and tasks. Collectively, these works reveal consistent patterns: LLMs tend to approximate human annotations in structured and less ambiguous tasks (e.g., information retrieval [41], web and social media text classification [42], and multimodal genre and hate speech annotation [20]), yet diverge substantially in complex or socially sensitive settings (e.g., hate speech and antisemitism detection [11,12]; multilingual emotion detection [21]). Furthermore, Wang et al. [43] highlighted that integrating LLM-generated labels and explanations into human workflows can improve annotation accuracy, though at the cost of increased cognitive load. Across these studies, LLMs demonstrate high internal consistency but often fail to align with human annotators in domains requiring subjective judgment or cultural and experiential grounding. Based on this body of research, we hypothesize that differences will exist between human and LLM-generated annotations, particularly as task complexity increases. We expect these discrepancies to be more pronounced in challenging tasks such as aspect categorization than in relatively simpler tasks such as sentiment analysis. Thus, the following research hypothesis is addressed:

Hypothesis 1.

Human annotations differ from LLM annotations in (a) sentiment, (b) topic, and (c) aspect categorizations of synthetic review data.

Table 1. Comparison of studies evaluating LLMs as annotators.

Study	Domain	Source	Mode	Task	Annotators	Key Findings
Giorgi et al. [12]	Hate speech/Social media posts	Real	Batch	Hate speech labeling (not hate/maybe/hate)	Humans: crowdworkers; LLMs: Llama3, Phi3, Solar, Starling	Humans show biases (age, religion, gender identity); LLMs exhibit fewer annotator-like biases but still misreport by simulated persona (e.g., underreported by Christian/straight, overreport by gay/bisexual personas).
Felkner et al. [11]	Fairness/Survey responses	Real & Synthetic	N/A	Bias benchmark construction (Jewish Community Survey)	Humans: Jewish Community Survey; LLMs: ChatGPT-3.5	All LMs showed significant antisemitic bias (avg. 69% vs. ideal 50%), higher on Jewish women/mothers and Israel/Palestine topics. LLM-extracted predicates often hallucinated, and were repetitive and poorly aligned with human annotations.
Nasution & Onan [21]	Multilingual/Tweets	Real	Batch	Topic classification, sentiment analysis, emotion classification	Humans: native speakers, crowdworkers; LLMs: ChatGPT-4, BERT, RoBERTa, T5	Humans outperform LLMs on topic and emotion (higher precision/recall); LLMs competitive on sentiment (i.e., ChatGPT-4, BERTurk); both struggle with fear/neutral; ChatGPT-4 shows promise for low-resource languages.
Wang et al. [43]	Gen. NLP/Sentence pairs, social media posts	Real	N/A	NLI, stance detection, hate speech detection	Humans, verifier models, LLMs	Verifier-assisted human reannotation improves accuracy (+7–8%) over LLM-only; LLM explanations help when correct but mislead otherwise; increases cognitive load without improving perceived trust.
Zendel et al. [41]	Info. retrieval/TREC topics	Real	Batch & Manual	Cognitive complexity classification (Remember/Understand/Analyze)	Humans: expert-annotated benchmark; LLMs: ChatGPT-3.5/4	ChatGPT-4 stable and matches human quality; 3.5 less consistent and sensitive to batch order. Batch mode reduces time/cost without significant loss of quality. ChatGPT-4 effective as additional annotator.
Aldeen et al. [42]	Multi-task/Web, social media posts	Real	Batch	10 classification tasks: sentiment, emotion, spam, sarcasm, topic, etc.	Humans: benchmark datasets; LLMs: ChatGPT-3.5/4	ChatGPT performs better on formal tasks (e.g., banking, websites); weaker on informal/casual tasks (sarcasm, emotion); still competitive on some informal tasks with explicit cues (e.g., Amazon reviews, Twitter topics).
Mohta et al. [20]	Multimodal/Image-text, text pairs	Real	N/A	Movie genre, hate speech, NLI, binary internal tasks	Humans: crowdworkers; LLMs: Llama2, Vicuna-v1.5	Humans outperform LLMs; fine-tuned Vicuna better than base Llama; images improve response rates but not accuracy.
Current Study	Hospitality/Hotel booking reviews	Real & Synthetic	Batch & Manual	Sentiment, aspect, topic classification	Humans: domain expert; LLMs: ChatGPT-3.5/4	LLMs showed high internal agreement; moderate with human on sentiment (real); low on aspects and synthetic data; manual mode improved agreement, revealing mode- and data-driven biases.

2.3. Large Language Models and Synthetic Data

Synthetic data refers to “artificial data from scratch or using advanced data manipulation techniques to produce novel and diverse training examples” [16] (p. 12). According to a recent narrative review on synthetic health data, there are six levels of synthetic data, ranging from low-utility structural datasets with no analytical value to high-fidelity replicas that closely resemble real data but carry increased privacy risks [13]. A wide array of tools, including open-source software and libraries in R (v4.3.1) and Python (v3.11.4), support the generation of such data.

The field of synthetic data generation has advanced significantly since its early applications, which were largely limited to basic data augmentation techniques such as bootstrapping. As Siriborvornratanakul [40] observes, contemporary methods now encompass sophisticated data annotation and the automated generation of synthetic text and images, enabled primarily by the capabilities of LLMs. Despite these advancements, synthetic data presents notable limitations. One prominent concern is the presence of generative bias, wherein models disproportionately reflect patterns, attributes, or perspectives from their training data, resulting in skewed or unbalanced outputs [40,44,45]. More specifically, Guo and Chen [17], in a comparative analysis of three generative LLMs, found that synthetic data often suffer from inaccuracies, limited intra-class and inter-class diversity, and instances of hallucination (i.e., outputs that are factually incorrect or entirely fabricated). While Chan et al. [46] suggest that data augmentation is among the most effective uses of synthetic data, other findings highlight its limitations in high-stakes applications. Specifically, Li et al. [15] demonstrated that classification models trained on real data consistently outperform those trained solely on synthetic datasets.

Given these observations, a key question arises regarding the role of annotation quality in fully synthetic datasets. When the data itself is generated by an LLM, it tends to be inherently more simplistic, neutral, and stylistically aligned with the model’s own linguistic patterns. Recent research has demonstrated that LLM-generated texts systematically differ from human-authored content, exhibiting more restricted vocabulary, fewer adjectives, simpler syntax, and less expression of strong negative emotions [47]. As a result, synthetic content is more homogeneous and predictable compared to real user-generated reviews, which are often ambiguous, idiosyncratic, and characterized by mixed sentiments and implicit topics [18]. Consequently, synthetic data annotation imposes lower cognitive demands on LLMs, resulting in closer alignment with human annotations, while real-world reviews pose greater challenges to annotation consistency. This leads to the formulation of the following hypothesis:

Hypothesis 2.

Human annotations do not differ from LLM annotations in (a) sentiment, (b) topic, and (c) aspect categorizations of synthetic review data.

3. Materials and Methods

To investigate annotation bias in LLMs, we designed a comparative study involving both real and synthetic hotel review datasets, each annotated by a human expert and three different LLMs. This section outlines the data sources, annotation protocols, and generation procedures used to ensure methodological consistency across models and conditions. By standardizing inputs and prompts across manual and automated annotation modes, we aim at isolating key factors contributing to bias and variability in LLM-generated labels. Figure 1 presents the research design, showing the two datasets (real-world HRAST and synthetic hotel reviews), their annotation by a human expert and three LLMs (using manual and batch modes), and the evaluation of sentiment, topic, and aspect labels through inter-annotator agreement metrics.

3.1. Real Review Data

In this study, we use the HRAST dataset developed by Andreou et al. [18], a publicly available benchmarking dataset comprising 23,114 hotel review sentences from 42 hotels in four European capitals. Each sentence is annotated across three dimensions: sentiment (positive, negative, neutral), topic (covering 21 hotel-related categories such as cleanliness, staff, breakfast, location, and Wi-Fi), and aspect presence (yes/no). While the binary aspect annotation does not fully capture sentences with multiple aspects or mixed sentiments, it follows the original HRAST [18] and prior work (e.g., [48]) to ensure benchmark compatibility. We selected the hospitality domain due to the strong influence of user-generated reviews on booking intentions [49,50], as well as their strategic importance in shaping competitive advantage [1]. Additionally, a significant proportion of the dataset consists of aspect-relevant sentences [18], making it well-suited for testing our hypotheses on annotation bias between human and LLM annotators.

3.2. Real Review Data Annotation

The original HRAST dataset was fully annotated by human annotators, with a subset of 5167 sentences labeled by a field expert [18]. To ensure consistency and allow a robust comparison with LLM annotations, one of the authors of the present study, who is also a field expert annotator, further annotated the remaining 17,947 sentences. As a result, as part of this work, a new expert-annotated version of the full dataset has been created and made publicly available as an updated benchmarking resource (see Data Availability statement).

The entire dataset was also annotated using LLMs. For LLM-based annotation, we used the prompt provided in Appendix A.1, which served not only as the LLM instruction set but also as the annotation guideline followed by the initial HRAST annotator and both expert annotators in the revised version. Two annotation techniques were applied. In the first method, two research assistants manually input each sentence from the dataset into ChatGPT-3.5 after initializing the model with the full prompt. Each assistant annotated 11,557 sentences. To minimize historical bias [4], they used newly created ChatGPT accounts without any prior conversation history. The results were recorded in new columns appended to the dataset, with each dimension (sentiment, topic, and aspect) binary coded as 0 (absence) or 1 (presence).

Subsequently, a second method was employed using the newer ChatGPT-4 and ChatGPT-4-mini models, which allow for automated batch processing of the full dataset. The same prompt was adapted slightly to begin with the following: “Suppose you are an annotator. Please read the attached file and, based on the following instructions, generate a new column with the output of your annotation”. This role-playing framing was intentionally used, as recent research suggests that prompting LLMs to adopt specific roles can significantly enhance output quality and consistency [15]. Two additional datasets were produced using this method and coded in the same binary format across 25 variables (three for sentiment, 21 for topics, and one for aspect presence), ensuring consistency in the annotation schema across all LLM-generated outputs.

To further isolate the effect of annotation mode, we also conducted an additional analysis where ChatGPT-3.5 was applied in batch mode to the real review dataset, using the same prompt and coding procedure as before. We then computed Cohen’s

κ

between ChatGPT-3.5 (batch) and the human expert annotations for all categories.

3.3. Synthetic Review Data Generation

For the creation of the synthetic dataset, we employed OpenAI’s ChatGPT-4, following the data generation approach outlined by Li et al. [15]. The process began with an introductory role-based prompt, “Imagine you are a hotel booking review writer”, to establish context and tone. This was followed by a structured data generation prompt instructing the model to produce 500 positive and 500 negative hotel booking review sentences, covering the full range of 21 predefined topics and incorporating aspect-level complexity (see Appendix A.2). As recommended by Li et al. [15], a final diversity prompt was used to explicitly request 1000 distinct sentences. Although the generation prompt specified 2000 reviews in total, the LLM output exhibited substantial redundancy, resulting in only 199 unique sentences after post-processing and deduplication. To preserve the analytical integrity of the study and avoid inflating inter-annotator agreement through repetitive content, we retained only these distinct examples, aligning with our objective to assess the diversity and fidelity of synthetic data. This curated subset served as the synthetic dataset for subsequent annotation and analysis.

3.4. Synthetic Review Data Annotation

The annotation procedure for the synthetic dataset mirrored that of the real review dataset to ensure consistency in comparison. Specifically, a domain expert manually annotated all 199 synthetic review sentences using the same guidelines applied to the real data. In parallel, automated annotations were generated using prompt-engineered inputs in ChatGPT-3.5, ChatGPT-4, and ChatGPT-4-mini. As with the real dataset, each model’s outputs were processed using the previously described coding scheme, transforming sentiment, topic, and aspect classifications into binary indicators for subsequent analysis.

4. Results

4.1. Descriptive Statistics

The final dataset comprised 23,113 real and 199 unique synthetic hotel review sentences, each annotated by one human expert and three LLMs (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4-mini). Annotations were performed across three variables, including three sentiment categories (positive, negative, neutral), 21 topic categories (e.g., cleanliness, location, room), and one aspect category capturing non-specific references to service elements. In terms of review length, the real dataset consisted of reviews ranging from two to 123 words (M = 10.30, SD = 7.15), while the synthetic dataset included reviews between two and ten words (M = 5.62, SD = 1.26), indicating shorter and potentially less complex text generated synthetically. The distribution of sentiment, topic, and aspect annotations across coders for both datasets is presented in Figure 2.

Focusing on the human annotations, sentiment in the real dataset was primarily positive (N = 11,819; 51%), followed by negative (N = 10,504; 45%), and a small proportion was labeled neutral (N = 794; 3%). In the synthetic dataset, sentiment was predominantly positive (N = 149; 75%), with negative (N = 47; 24%) and neutral (N = 3; 2%) annotations appearing less frequently. This is noteworthy given that the original prompt for synthetic data generation requested a balanced distribution of 500 positive and 500 negative reviews. Regarding topic annotations, the most frequently occurring topics in the real dataset, based on human annotation, were room (N = 5004; 21.65%), location (N = 4729; 20%), and staff (N = 3804; 16%). In contrast, the synthetic reviews demonstrated a relatively even distribution of topic mentions, with cleanliness being the most frequently tagged topic (N = 16; 8%), albeit with low absolute frequencies overall.

Review complexity also differs significantly between datasets. Based on human annotations, real reviews contained between zero and eight topics per sentence (M = 1.60, SD = 0.86), while synthetic reviews ranged from one to two topics (M = 1.19, SD = 0.39). This suggests that synthetic review sentences tend to be less complex, often expressing fewer evaluative dimensions than their real counterparts. This is also evident in the aspect where 12% (N = 2884) of the real reviews and only 4% (N = 8) of the synthetic reviews were coded as containing an aspect, further supporting the observation of reduced complexity and abstraction in synthetic content.

4.2. Hypothesis Testing

Our hypotheses addressed potential differences in annotation patterns between annotators across both real (Hypothesis 1) and synthetic (Hypothesis 2) datasets. To investigate these differences, we conceptualized LLMs as coders and evaluated the level of agreement between them using inter-coder reliability metrics. Inter-coder reliability was assessed using Cohen’s kappa when comparing two coders, and Fleiss’ kappa when three or more coders were involved [51]. These metrics are widely used in content analysis to quantify the extent of agreement, with interpretive guidelines well-established in the literature [52,53]. According to Landis and Koch [53], kappa values are classified as follows: values between 0.00 and 0.20, slight agreement; 0.21 and 0.40, fair; 0.41 and 0.60, moderate; 0.61 and 0.80, substantial; and 0.81 and 1.00, almost perfect agreement. To complement the kappa statistics, Figure 3 provides a visual overview of inter-annotator agreement patterns across coders and datasets. Panel (a) shows the almost perfect agreement among the three LLMs, particularly in the synthetic dataset. Panel (b) depicts the moderate agreement between the human annotator and ChatGPT-4 in the real dataset, with noticeably lower agreement on neutral sentiment. Panel (c) highlights the higher alignment between the human annotator and ChatGPT-3.5 under manual annotation, and panel (d) shows consistently strong agreement between ChatGPT-4 and ChatGPT-3.5.

Our dataset included four coders: one human and three LLMs (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4-mini). However, to test our hypotheses more robustly, we first examined the inter-rater agreement among the three LLMs. As shown in Figure 3a, their agreement was almost perfect (Fleiss’

K = 1

, p < 0.001) across all categories in the synthetic dataset. In the real dataset, the LLMs achieved substantial to almost perfect agreement across most categories (Fleiss’

K > 0.60

, p < 0.001). Exceptions included neutral sentiment (Fleiss’

K = 0.147

, 95% CI [0.140, 0.154],

p < 0.001

), positive sentiment (Fleiss’

K = 0.541

, 95% CI [0.534, 0.549],

p < 0.001

), and the facilities aspect (Fleiss’

K = 0.297

, 95% CI [0.289, 0.304],

p < 0.001

), where agreement was lower.

Since the three LLMs demonstrated substantial agreement across most categories, with a few exceptions, we proceeded to test our hypotheses by comparing the human annotator’s labels with those of ChatGPT-4, which served as a representative of the LLMs, given that it is the most advanced model at the time of writing. In the real dataset, the agreement between the human annotator and ChatGPT-4 was moderate for both positive (Cohen’s

κ

= 0.434,

p < 0.001

) and negative (Cohen’s

κ

= 0.462,

p < 0.001

) sentiment, while agreement on neutral sentiment was slight (Cohen’s

κ

= 0.024,

p < 0.001

). These findings support Hypothesis 1a. This pattern also suggests that LLMs tend to default to the neutral sentiment category, likely the simplest and least risky choice, whereas the human annotator was more willing to assign a clear evaluative tone (positive or negative), even in more ambiguous cases.

Regarding topic annotation (see Figure 3b), slight agreement between the human annotator and ChatGPT-4 was observed for facilities (Cohen’s

κ

= 0.180,

p < 0.001

) and fair agreement for the generic topic (Cohen’s

κ

= 0.217,

p < 0.001

). Moderate agreement was found for reception (Cohen’s

κ

= 0.543,

p < 0.001

), value for money (Cohen’s

κ

= 0.453,

p < 0.001

), comfort (Cohen’s

κ

= 0.588,

p < 0.001

), and restaurant (Cohen’s

κ

= 0.585,

p < 0.001

). Contrary to our expectations, however, the majority of topics exhibited substantial to almost perfect agreement between the human annotator and ChatGPT-4 (Cohen’s

κ \geq 0.63

,

p < 0.001

). More specifically, almost perfect agreement was observed for Wi-Fi (Cohen’s

κ \geq 0.97

,

p < 0.001

), breakfast (Cohen’s

κ \geq 0.97

,

p < 0.001

), parking (Cohen’s

κ \geq 0.96

,

p < 0.001

), lift (Cohen’s

κ \geq 0.94

,

p < 0.001

), bathroom and pool (Cohen’s

κ \geq 0.93

,

p < 0.001

), staff (Cohen’s

κ \geq 0.89

,

p < 0.001

), bed (Cohen’s

κ \geq 0.86

,

p < 0.001

), location and cleanliness (Cohen’s

κ \geq 0.85

,

p < 0.001

), and beach (Cohen’s

κ \geq 0.83

,

p < 0.001

). These findings suggest that while discrepancies remain for more subjective or abstract topics, ChatGPT-4 aligns closely with human annotations on concrete and unambiguous topics. Therefore, Hypothesis 1b is partially supported. As far as aspect-based sentiment analysis is concerned, there was fair agreement between the human annotator and ChatGPT-4 (Cohen’s

κ

= 0.207,

p < 0.001

); thus, Hypothesis 1c is also supported.

As far as the synthetic dataset is concerned, and contrary to our expectations, the agreement between the human annotator and ChatGPT-4 on sentiment labels was generally low. Specifically, there was slight agreement for neutral (Cohen’s

κ

= 0.015, p = 0.224) and positive (Cohen’s

κ

= 0.176,

p < 0.001

) sentiment, and fair agreement for negative sentiment (Cohen’s

κ

= 0.211,

p < 0.001

). These results lead to the rejection of Hypothesis 2a. Overall, the LLMs appeared to default more readily to the neutral category rather than explicitly choosing a sentiment polarity, even in the controlled setting of synthetic data. In terms of topic classification, the majority of categories exhibited substantial or almost perfect agreement between the human annotator and ChatGPT-4. However, lower levels of agreement were observed for comfort (Cohen’s

κ

= 0.519,

p < 0.001

), restaurant (Cohen’s

κ

= 0.444,

p < 0.001

), generic (Cohen’s

κ

= 0.549,

p < 0.001

), and facilities (Cohen’s

κ

= 0.273,

p < 0.001

), all of which fell below the substantial threshold. These findings lead to the partial support of Hypothesis 2b, indicating that human and LLM annotations differ in topic categorization on synthetic data. Similarly, in the synthetic dataset, aspect classification also produced the lowest agreement score, with only slight agreement observed (Cohen’s

κ

= 0.143,

p < 0.001

), leading to the rejection of Hypothesis 2c.

It is also worth noting that, in both the real and synthetic datasets, the agreement between the human annotator and ChatGPT-4 was very high for certain topics, such as pool, lift, parking, bathroom, Wi-Fi, and bed. However, notable differences emerged between the two datasets in other topics: for example, agreement on breakfast and generic was higher in the real dataset than in the synthetic one, while agreement on reception and value for money was higher in the synthetic dataset than in the real dataset. Additionally, agreement on positive and negative sentiment was consistently higher in the real dataset compared to the synthetic dataset.

In addition, although it was beyond the initial scope of the study, we also explored the agreement between ChatGPT-3.5 and both the human annotator and ChatGPT-4, given that annotation with ChatGPT-3.5 was performed manually on a one-to-one basis in the real dataset. Figure 3c,d present these comparisons. In the synthetic dataset, where annotation by ChatGPT-3.5 was generated automatically, there was almost perfect agreement (Cohen’s

κ

= 1.00,

p < 0.001

) across all annotated variables. In the real dataset, almost perfect agreement between ChatGPT-3.5 and ChatGPT-4 was found only in a subset of categories, including staff (Cohen’s

κ

= 0.84,

p < 0.001

), pool (Cohen’s

κ

= 0.89,

p < 0.001

), bathroom (Cohen’s

κ

= 0.90,

p < 0.001

), lift (Cohen’s

κ

= 0.90,

p < 0.001

), parking (Cohen’s

κ

= 0.94,

p < 0.001

), breakfast (Cohen’s

κ

= 0.96,

p < 0.001

), and Wi-Fi (Cohen’s

κ

= 0.96,

p < 0.001

). Notably, when comparing ChatGPT-3.5 directly with the human annotator in the real dataset, almost perfect agreement was achieved in several categories, including lift (Cohen’s

κ

= 0.91,

p < 0.001

), parking (Cohen’s

κ

= 0.96,

p < 0.001

), pool (Cohen’s

κ

= 0.90,

p < 0.001

), bathroom (Cohen’s

κ

= 0.90,

p < 0.001

), Wi-Fi (Cohen’s

κ

= 0.95,

p < 0.001

), staff (Cohen’s

κ

= 0.89,

p < 0.001

), breakfast (Cohen’s

κ

= 0.95,

p < 0.001

), location (Cohen’s

κ

= 0.91,

p < 0.001

), bed (Cohen’s

κ

= 0.82,

p < 0.001

), noise (Cohen’s

κ

= 0.86,

p < 0.001

), negative sentiment (Cohen’s

κ

= 0.88,

p < 0.001

), and positive sentiment (Cohen’s

κ

= 0.86,

p < 0.001

). These findings suggest that when LLMs are used for one-to-one annotation, processing, and labeling each item individually, they can approximate human annotation to a highly satisfactory degree. However, this level of agreement appears to diminish significantly when LLMs are tasked with bulk or automated annotation across large sets of inputs. It is also particularly noteworthy that the older ChatGPT-3.5 model, when used manually, outperformed newer and ostensibly more advanced models in aligning with human annotations, highlighting the critical impact of annotation mode over model scale or architecture.

To further investigate the impact of annotation mode, we compared ChatGPT-3.5 annotations performed in batch mode with those obtained in manual, one-by-one mode (see Table 2). The inter-mode agreement between the ChatGPT-3.5 (batch) and the human annotator is shown in Table 3. Compared to ChatGPT-3.5 in manual mode, ChatGPT-3.5 in batch mode yielded substantially lower

κ

values for sentiment (neutral: Cohen’s

κ

= 0.109,

p < 0.001

, negative: Cohen’s

κ

= 0.201,

p < 0.001

, positive: Cohen’s

κ

= 0.535,

p < 0.001

), aspect (Cohen’s

κ

= 0.244,

p < 0.001

), and abstract topics such as facilities (Cohen’s

κ

= 0.270,

p < 0.001

) and generic (Cohen’s

κ

= 0.381,

p < 0.001

). Conversely, agreement remained high in concrete, easily identifiable topics such as parking (Cohen’s

κ

= 0.945,

p < 0.001

), breakfast (Cohen’s

κ

= 0.958,

p < 0.001

), and Wi-Fi (Cohen’s

κ

= 0.963,

p < 0.001

). These results closely mirror those of ChatGPT-4 and ChatGPT-4-mini in batch mode, suggesting that annotation mode exerts a stronger influence on agreement with human annotations than model version alone. To illustrate the observed patterns, Table 3 shows representative sentences where LLM annotations disagreed or agreed with the human coder. Disagreements often reflect neutrality bias, missed aspects, or incomplete topic detection, while agreement occurred in clear, simple cases. This table also reveals that LLMs systematically default to neutral sentiment, reflecting a conservative and risk-averse response to ambiguity, while the human annotator more readily assigned a definitive positive or negative sentiment.

5. Discussion

The findings of this study reveal multiple systematic forms of bias that shape annotation outcomes when comparing LLMs to a human annotator across both real and synthetic datasets. First of all, the three LLMs (ChatGPT-3.5, ChatGPT-4, and ChatGPT-4-mini) exhibited substantial to almost perfect agreement among themselves across most coding categories, particularly within the synthetic dataset. When comparing ChatGPT-4 to a human annotator, moderate agreement was observed for positive and negative sentiment in the real dataset, but only slight to fair agreement in the synthetic dataset, indicating a tendency for LLMs to default to neutral sentiment. Topic classification generally showed substantial agreement, although some topics (e.g., facilities, generic) fell below this threshold. Aspect annotation showed consistently low agreement in both datasets. Importantly, when LLMs were used for one-to-one manual annotation, as was the case with ChatGPT-3.5 in the real dataset, almost perfect agreement with the human annotator was observed in several categories, suggesting that the annotation approach (manual vs. batch) significantly affects performance. These results do not serve to assess the relative performance of human and AI annotators but rather underscore how divergent cognitive and algorithmic rationality leads to systematic AI biases.

5.1. Theoretical Contribution

These findings raise important theoretical issues that have a bearing on the AI biases literature. The first type of bias identified comes from the structural properties of the synthetic dataset itself. Specifically, of the 2000 LLM-generated review sentences, only 199 were unique. This empirical result reflects well-documented challenges associated with generative AI models, including output repetition, exposure bias, and hallucination [54,55]. Crucially, the observed simplicity and repetitiveness of the synthetic data should not be construed as a limitation of the present study but rather as an empirical manifestation of AI bias inherent in LLMs. This structural bias undermines the ecological validity of synthetic data by stripping away the ambiguity, heterogeneity, and multi-dimensionality characteristic of authentic user-generated reviews.

A second layer of bias revealed in the analysis pertains to behavioral biases, which emerge from the interaction between annotation mode and task structure. The divergence between the high-fidelity outputs of manual, one-to-one annotation and the less consistent results of batch processing suggests that LLM behavior is shaped not only by the input but also by the computational context of the task. We hypothesize that several factors contribute to this phenomenon. First, batch processing may induce context dilution; in a large-volume request, the model’s attentional resources are distributed across numerous items, potentially weakening the guiding influence of the initial prompt on any single data point and leading to more generic classifications. Second, this discrepancy may reflect divergent computational optimization goals, wherein interactive sessions prioritize conversational coherence and per-item accuracy, while batch processing prioritizes aggregate throughput, potentially at the expense of interpretive nuance.

This behavior could be linked to implicit parameter settings such as temperature, which governs the randomness of the model’s output. Although not explicitly configured in our study, batch processing APIs may operate under a more conservative (i.e., lower temperature) default setting than interactive sessions. This would align with the “temperature paradox” [56,57], where lower temperature settings favor deterministic, high-probability outputs over more creative or nuanced interpretations. Such a setting would naturally promote safer, default classifications—like ’neutral’—and inhibit the model’s ability to resolve ambiguity, thereby constituting a distinct, context-dependent form of AI bias. These findings broadly support existing research linking input complexity to LLM performance [58,59] and underscore that the annotation mode itself is a significant source of systematic bias.

A third and particularly consequential form of bias uncovered in this study is the emergence of a pronounced neutrality bias in LLM-based annotation, particularly within sentiment classification task. As discussed by Herrera-Poyatos et al. [19], this tendency reflects broader challenges of uncertainty, variability, and opacity inherent to LLMs. Across both datasets, the models exhibited a strong tendency to default to neutral sentiment categories, a pattern significantly more frequent than observed in human coding. This behavior aligns with one definition of neutrality as “not favouring one position over another” [60] (p. 346), suggesting that LLMs may over-correct in an effort to appear objective or non-partisan. However, this neutrality likely arises not only from such over-correction but also from a structural combination of aleatoric and epistemic uncertainty, risk-averse inference favoring safe outputs under ambiguity, prompt sensitivity, and the black-box nature of LLM decision-making, which obscures how these outcomes are reached [19]. Together, these factors contribute to what appears to be a gatekeeping bias [61], wherein the model avoids taking a definitive evaluative stance. In doing so, LLMs may effectively “neutralize biased text” [62] (p. 480), but also diminish their capacity to detect sentiment polarity, thereby compromising their performance as reliable annotators.

5.2. Methodological Contribution

Methodologically, this study offers a significant contribution by designing and implementing a systematic, multi-model annotation comparison across both real and synthetic data in the hospitality domain. By utilizing the HRAST dataset [18] and extending it with new expert annotations, the study enhances the reliability and applicability of an already recognized benchmarking resource. The study also introduces a dual-mode annotation protocol, comparing manual one-to-one interactions with ChatGPT-3.5 to fully automated batch processing using ChatGPT-4, and ChatGPT-4-mini. This methodological design not only enables robust hypothesis testing but also provides critical insights into how annotation mode (manual vs. automated) impacts inter-coder reliability between humans and LLMs. Moreover, the standardized use of prompt engineering (including consistent instructions and role-based framing; [15]) ensures comparability across models and reflects real-world practices in AI deployment for content analysis. Furthermore, by integrating a controlled synthetic dataset, generated through role-conditioned prompting and topic balancing, the study establishes a rigorous testing ground for isolating annotation behaviors.

5.3. Practical Contribution

This study offers clear guidance for researchers and digital marketing practitioners using LLMs for content analysis, sentiment classification, and customer feedback monitoring. While LLMs perform reliably in one-to-one, prompt-based annotation settings, their effectiveness declines in bulk or automated contexts, especially in tasks requiring subjective judgment. This suggests that LLMs should be used as human-assistive tools, not autonomous annotators, which is in line with Wang et al. [43].

The diminished performance observed in batch processing underscores a critical trade-off between scalability and annotation fidelity. The performance degradation in batch mode is likely attributable to the increased cognitive load imposed by large-volume inputs, which can result in a loss of instruction fidelity. As the model processes thousands of items in a single request, the influence of the initial, nuanced instructions may be diluted, increasing the propensity for simplified heuristics and default classifications. Consequently, the choice between manual and batch processing is not one of inherent superiority but of strategic alignment with analytical goals. For high-stakes tasks requiring granular accuracy, such as fine-grained brand sentiment analysis, the loss of fidelity may be unacceptable. Conversely, for large-scale, low-stakes applications like broad topic identification, the efficiency of batch processing may justify the modest compromise in per-item accuracy. This distinction is crucial for practitioners aiming to deploy LLMs effectively and responsibly in analytical workflows.

For digital marketers, the high internal consistency among LLMs in synthetic data highlights their utility for large-scale, low-stakes tasks, such as identifying frequently mentioned product features. However, the observed neutrality bias indicates that LLMs may underrepresent emotional tone, potentially skewing sentiment analyses. Human-in-the-loop processes or well-designed prompt strategies can help mitigate these risks.

The study also shows that LLMs are better at identifying concrete topics (e.g., Wi-Fi, parking) than abstract or evaluative dimensions (e.g., value for money, comfort), which should inform how marketers delegate tasks between AI and human analysts. Additionally, our synthetic data generation revealed that even when LLMs follow structured prompts, the outputs often lack authenticity, raising concerns about the use of LLMs for generating synthetic reviews. These findings call for careful human oversight when using AI-generated content in consumer-facing or analytical applications.

Lastly, the findings also inform practical strategies for mitigating the biases we identified. To counteract the LLMs’ pronounced neutrality bias, practitioners can employ several targeted approaches beyond general human oversight. One such strategy involves refining prompt engineering to establish a “forced-choice” framework, which instructs the model to assign either a positive or negative sentiment, reserving the ’neutral’ category exclusively for objectively descriptive statements. A more integrated approach is the implementation of a tiered annotation protocol. In this hybrid model, the LLM performs the initial classification pass but is programmed to flag sentences where its confidence score for a neutral label is high or where probabilities for competing labels are proximate. These flagged instances are then routed to a human annotator for a final decision, thereby preserving the scalability of automated annotation while strategically allocating human expertise to the most ambiguous cases where bias is most likely to manifest.

5.4. Limitations and Future Research

Like any study, this research is subject to several limitations. First, although the HRAST dataset was initially annotated by multiple human coders, the present study relied on a single expert human annotator to create a consistent benchmark for comparison with LLM outputs. This approach ensured high-quality and domain-informed annotations compared to alternative methods such as crowdsourcing, which often introduce additional variability and lower reliability [63,64]. However, relying on a single expert does limit the generalizability of the observed human annotation patterns, as it does not capture intra-human variability. Given the scale of the dataset, involving multiple expert annotators would have been prohibitively resource-intensive. Future studies should aim to include multiple human coders, at least on a representative subset of the data, to better assess inter-rater reliability and establish a stronger human benchmark. Furthermore, while Cohen’s and Fleiss’ kappa remain standard metrics for assessing inter-annotator agreement beyond chance [51,53], we acknowledge their sensitivity to class imbalance, which can depress scores for infrequent categories even when absolute agreement is high [65]. Future research could therefore complement kappa with class-level metrics, such as precision, recall, and the Area under the ROC Curve (AUC), to better capture performance in imbalanced settings and to disentangle the effects of imbalance from genuine disagreement.

Second, the study focused exclusively on the ChatGPT family of LLMs, which, although representative of current state-of-the-art models, may not reflect the full spectrum of annotation behavior exhibited by alternative architectures or providers such as Llama, Flan, or DeepSeek [66]. This focus is nevertheless warranted given the widespread adoption and familiarity of ChatGPT variants among both professional practitioners and general users. Recent survey data show that ChatGPT is currently among the most commonly used generative AI tools by U.S. adults, reflecting its prominence in everyday use compared to other available LLMs [67]. Accordingly, our findings regarding annotation biases (e.g., neutrality, repetition, and behavioral bias) should be interpreted within the context of the ChatGPT ecosystem. Future research would benefit from extending this comparative framework to encompass a broader array of LLM architectures, thereby enhancing the generalizability of insights into annotation behavior. Moreover, the annotation task was restricted to sentiment, topic, and aspect labeling within hotel review-style data. As such, findings may not generalize to other domains, such as news articles, social media content, or more technical text types, where annotation demands and interpretive complexity may differ significantly. As illustrated in Table 1, prior studies across diverse domains have reported varying patterns of bias and agreement, further underscoring the need for domain-specific evaluation.

Third, the annotation process itself varied across models: ChatGPT-3.5 was used in a manually prompted, one-to-one format, while ChatGPT-4 and ChatGPT-4-mini operated in bulk-processing mode. Although this distinction enabled valuable insights into the impact of annotation mode, it also introduces a confounding variable that future research should isolate and control for more systematically. Furthermore, the synthetic dataset was generated using a prompt-based approach without structurally replicating the distributional characteristics of the real dataset. Future research could build on recent recommendations (e.g., [13]) by generating synthetic data that more closely mirrors the structure, complexity, and distribution of actual datasets. This would allow for more ecologically valid assessments of LLM performance in artificial data contexts.

Fourth, although we conducted an additional analysis comparing ChatGPT-3.5 in batch versus manual modes, our design did not fully cross models and modes. While the findings suggest that annotation mode has a stronger effect on agreement patterns than the model version, further research is needed to systematically evaluate both modes across multiple models and datasets. This study also does not account for the potential impact of the temperature parameter, often referred to as the “temperature paradox”, which may contribute to AI-induced biases. Prior research suggests that temperature slightly improves causal reasoning performance [56], while exhibiting complex trade-offs: although temperature is positively correlated with output novelty, it is negatively correlated with coherence, and shows no statistically significant correlation with creativity [57]. Despite employing a diversity-enhancing prompt strategy, as recommended by Li et al. [15], to guide the generation of synthetic data in this study, the outputs nonetheless exhibited substantial repetition. This suggests that the mitigation effects of prompt engineering alone may be insufficient to address redundancy bias. Thus, future research should examine more systematically how the temperature setting contributes to AI-driven biases, both in the context of synthetic data generation and in annotation tasks as well. While future research could more systematically explore the impact of parameter tuning (e.g., temperature, top-p sampling) and more sophisticated generation strategies to mitigate repetition bias, our decision to employ a straightforward prompting approach was intentional. This choice to use a simple prompting approach reflects common practitioner practices, prioritizing accessibility over optimization, and thus enhances the ecological validity of our findings.

6. Conclusions

In sum, the study identifies and empirically demonstrates three distinct forms of AI bias affecting annotation outcomes: (1) repetition bias, driven by the highly redundant structure of synthetic data; (2) behavior bias, contingent upon both the mode of input processing (manual versus bulk) and the nature of the dataset (real versus synthetic); and (3) neutrality bias, reflecting LLMs’ conservative default toward neutral classifications in sentiment annotation, coupled with notable limitations in aspect-based sentiment analysis, particularly when sentences contain contradictory sentiments directed at different aspects or topics. These findings underscore the need for further research, with particular emphasis on neutrality bias, which warrants deeper investigation across different annotation tasks (e.g., topic classification) and diverse application contexts. Finally, this study contributes to the research community by releasing a fully annotated version of the HRAST dataset, labeled by an expert human coder. This resource is publicly available and intended to support future benchmarking, replication, and further research.

Author Contributions

Conceptualization, N.T., C.D. and M.C.V.; methodology, N.T. and M.C.V.; formal analysis, M.C.V.; data curation, N.T. and M.C.V.; writing—original draft preparation, M.C.V.; writing—review and editing, N.T. and C.D.; visualization, M.C.V.; supervision, N.T.; project administration, N.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the use of publicly available online booking review texts, which do not involve interaction with human participants or the collection of identifiable private information.

Data Availability Statement

The revised HRAST dataset is available at https://www.kaggle.com/datasets/costastziouvas/hotel-reviews-aspects-sentiments-and-topics (accessed on 27 June 2025).

Acknowledgments

We gratefully acknowledge Christiana Andreou, Varnavia Giorgalla, and Eleni Kakoulli for their valuable assistance with prompt preparation and the manual input for ChatGPT-3.5annotation.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

Abbreviation	Definition
AI	Artificial Intelligence
LLM	Large Language Model
HRAST	Hotel Reviews: Aspects, Sentiments, and Topics
CSV	Comma-Separated Values
AUC	Area Under the ROC Curve
ROC	Receiver Operating Characteristic
GPT	Generative Pre-trained Transformer
SE	Standard Error
SD	Standard Deviation
M	Mean
p	p-value (probability value)

Appendix A

Appendix A.1. Annotation Protocol and Prompt Design for LLM-Based Labeling

Hotel Review Sentence Evaluation

Task: In this assignment, you will evaluate individual sentences from hotel reviews. Your primary goal is to determine the sentiment, topic, and aspect of each sentence. Your contribution will help us understand customer feedback more comprehensively.

Sentence Evaluation

Sentiment: Determine if the sentence conveys a positive, negative, or neutral sentiment.

Aspect: Decide if the sentence (yes/no) contains multiple topics or combines positive and negative elements related to a single topic. If met, mark it as “Aspect”.

Examples of aspect sentences:

The breakfast was poor but the staff was very helpful
The breakfast was decent and the room was quiet
The breakfast was rich but quite expensive

Select the most relevant topic from the provided categories:

Room
Location
Staff
Cleanliness
Comfort
Facilities/Amenities
Breakfast
Bathroom/Shower (toilet)
Bed
Noise
Value for money
Generic (design, architecture, building, atmosphere, etc.)
View (Balcony)
Parking
Bar
Pool
Restaurant (dinner)
Reception (check-in, check-out, etc.)
Lift
Wi-Fi
Beach
None

Question 1: Judge the sentiment of the review: (positive/negative/neutral) 0. Negative, 1. Neutral, 2. Positive

Question 2: Is this review aspect? (Yes/No) 0. No, 1. Yes

Question 3: What topic(s) does the review contain? (you can choose 1 to 7 topics)

Choose from the above:

Room
Location
Staff
Cleanliness
Comfort
Facilities/Amenities
Breakfast
Bathroom/Shower (toilet)
Bed
Noise
Value for money
Generic (design, architecture, building, atmosphere, etc.)
View (Balcony)
Parking
Bar
Pool
Restaurant (dinner)
Reception (check-in, check-out, etc.)
Lift
Wi-Fi
Beach
None

Appendix A.2. Prompt Design for Synthetic Hotel Review Generation Using LLMs

Step 1: intro-role prompt

Imagine you are a hotel booking review writer

Step 2: data generation prompt

You are tasked with generating 1000 realistic booking review sentences for a hotel. Each sentence must meet the following criteria:

Be concise and natural, reflecting typical guest feedback
Contain no more than 30 words
Be written in the style of real booking reviews
Include 500 positive and 500 negative sentences

Ensure that all sentences are distributed across the following review topics:

Room
Location
Staff
Cleanliness
Comfort
Facilities/Amenities
Breakfast
Bathroom/Shower (toilet)
Bed
Noise
Value for money
Generic (design, architecture, building, atmosphere, etc.)
View (Balcony)
Parking
Bar
Pool
Restaurant (dinner)
Reception (check-in, check-out, etc.)
Lift
Wi-Fi
Beach

Additionally, include a variety of Aspect sentences, defined as sentences that either:

Combine positive and negative sentiments about the same topic, or
Refer to multiple topics within the same sentence.

Examples of Aspect sentences:

The breakfast was poor but the staff was very helpful.
The breakfast was decent and the room was quiet.
The breakfast was rich but quite expensive.

Output Requirements:

Format: CSV
Each row should contain a single sentence only
Do not include headers or additional text

Please generate and return 1000 unique sentences as specified above, in CSV format.

Step 3: diversity prompt

Can you provide something more diverse compared to the previously generated data?

References

Milwood, P.A.; Hartman-Caverly, S.; Roehl, W.S. A scoping study of ethics in artificial intelligence research in tourism and hospitality. In ENTER22 e-Tourism Conference; Springer: Berlin/Heidelberg, Germany, 2023; pp. 243–254. [Google Scholar]
Wang, P.Q. Personalizing guest experience with generative AI in the hotel industry: There’s more to it than meets a Kiwi’s eye. Curr. Issues Tour. 2024, 28, 527–544. [Google Scholar] [CrossRef]
Wüst, K.; Bremser, K. Artificial Intelligence in Tourism Through Chatbot Support in the Booking Process—An Experimental Investigation. Tour. Hosp. 2025, 6, 36. [Google Scholar] [CrossRef]
Kouros, T.; Theodosiou, Z.; Themistocleous, C. Machine Learning Bias: Genealogy, Expression and Prevention; CABI Books: Bognor Regis, UK, 2025; pp. 113–126. [Google Scholar] [CrossRef]
Akter, S.; McCarthy, G.; Sajib, S.; Michael, K.; Dwivedi, Y.K.; D’Ambra, J.; Shen, K.N. Algorithmic bias in data-driven innovation in the age of AI. Int. J. Inf. Manag. 2021, 60, 102387. [Google Scholar] [CrossRef]
Kordzadeh, N.; Ghasemaghaei, M. Algorithmic bias: Review, synthesis, and future research directions. Eur. J. Inf. Syst. 2021, 31, 388–409. [Google Scholar] [CrossRef]
Chen, Y.; Clayton, E.W.; Novak, L.L.; Anders, S.; Malin, B. Human-Centered Design to Address Biases in Artificial Intelligence. J. Med. Internet Res. 2023, 25, e43251. [Google Scholar] [CrossRef]
Haliburton, L.; Ghebremedhin, S.; Welsch, R.; Schmidt, A.; Mayer, S. Investigating labeler bias in face annotation for machine learning. In HHAI 2024: Hybrid Human AI Systems for the Social Good; IOS Press: Amsterdam, The Netherlands, 2024; pp. 145–161. [Google Scholar]
Parmar, M.; Mishra, S.; Geva, M.; Baral, C. Don’t blame the annotator: Bias already starts in the annotation instructions. arXiv 2022, arXiv:2205.00415. [Google Scholar]
Das, A.; Zhang, Z.; Hasan, N.; Sarkar, S.; Jamshidi, F.; Bhattacharya, T.; Rahgouy, M.; Raychawdhary, N.; Feng, D.; Jain, V. Investigating annotator bias in large language models for hate speech detection. In Proceedings of the Neurips Safe Generative AI Workshop 2024, Vancouver, BC, Canada, 15 December 2024. [Google Scholar]
Felkner, V.K.; Thompson, J.A.; May, J. Gpt is not an annotator: The necessity of human annotation in fairness benchmark construction. arXiv 2024, arXiv:2405.15760. [Google Scholar]
Giorgi, T.; Cima, L.; Fagni, T.; Avvenuti, M.; Cresci, S. Human and LLM biases in hate speech annotations: A socio-demographic analysis of annotators and targets. In Proceedings of the International AAAI Conference on Web and Social Media, Copenhagen, Denmark, 23–26 June 2025; pp. 653–670. [Google Scholar]
Gonzales, A.; Guruswamy, G.; Smith, S.R. Synthetic data in health care: A narrative review. PLoS Digit. Health 2023, 2, e0000082. [Google Scholar] [CrossRef]
Kozinets, R.V.; Gretzel, U. Commentary: Artificial Intelligence: The Marketer’s Dilemma. J. Mark. 2020, 85, 156–159. [Google Scholar] [CrossRef]
Li, Z.; Zhu, H.; Lu, Z.; Yin, M. Synthetic data generation with large language models for text classification: Potential and limitations. arXiv 2023, arXiv:2310.07849. [Google Scholar]
Nikolenko, S.I. Synthetic Data for Deep Learning; Springer: Berlin/Heidelberg, Germany, 2021; Volume 174. [Google Scholar]
Guo, X.; Chen, Y. Generative ai for synthetic data generation: Methods, challenges and the future. arXiv 2024, arXiv:2403.04190. [Google Scholar]
Andreou, C.; Tsapatsoulis, N.; Anastasopoulou, V. A Dataset of Hotel Reviews for Aspect-Based Sentiment Analysis and Topic Modeling. In Proceedings of the 2023 18th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP) 18th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP 2023), Limassol, Cyprus, 25–26 September 2023. [Google Scholar] [CrossRef]
Herrera-Poyatos, D.; Peláez-González, C.; Zuheros, C.; Herrera-Poyatos, A.; Tejedor, V.; Herrera, F.; Montes, R. An overview of model uncertainty and variability in LLM-based sentiment analysis. Challenges, mitigation strategies and the role of explainability. arXiv 2025, arXiv:2504.04462. [Google Scholar]
Mohta, J.; Ak, K.; Xu, Y.; Shen, M. Are large language models good annotators? In Proceedings of the NeurIPS 2023 Workshops, New Orleans, LA, USA, 16 December 2023; pp. 38–48. [Google Scholar]
Nasution, A.H.; Onan, A. Chatgpt label: Comparing the quality of human-generated and llm-generated annotations in low-resource language nlp tasks. IEEE Access 2024, 12, 71876–71900. [Google Scholar] [CrossRef]
Kar, A.K.; Choudhary, S.K.; Singh, V.K. How can artificial intelligence impact sustainability: A systematic literature review. J. Clean. Prod. 2022, 376, 134120. [Google Scholar] [CrossRef]
Bacalhau, L.M.; Pereira, M.C.; Neves, J. A bibliometric analysis of AI bias in marketing: Field evolution and future research agenda. J. Mark. Anal. 2025, 13, 308–327. [Google Scholar] [CrossRef]
Roselli, D.; Matthews, J.; Talagala, N. Managing Bias in AI. In Proceedings of the WWW’19: Companion Proceedings of the 2019 World Wide Web Conference, San Francisco, CA, USA, 13–17 May 2019. [Google Scholar] [CrossRef]
Varsha, P.S. How can we manage biases in artificial intelligence systems—A systematic literature review. Int. J. Inf. Manag. Data Insights 2023, 3, 100165. [Google Scholar] [CrossRef]
Oxford English Dictionary. Bias, n., sense 3.c. Available online: https://doi.org/10.1093/OED/4832698884 (accessed on 27 June 2025).
Smith, J.; Noble, H. Bias in research. Evid. Based Nurs. 2014, 17, 100–101. [Google Scholar] [CrossRef] [PubMed]
Delgado-Rodriguez, M.; Llorca, J. Bias. J. Epidemiol. Community Health 2004, 58, 635–641. [Google Scholar] [CrossRef]
Nazer, L.H.; Zatarah, R.; Waldrip, S.; Ke, J.X.C.; Moukheiber, M.; Khanna, A.K.; Hicklen, R.S.; Moukheiber, L.; Moukheiber, D.; Ma, H.; et al. Bias in artificial intelligence algorithms and recommendations for mitigation. PLoS Digit. Health 2023, 2, e0000278. [Google Scholar] [CrossRef]
Spennemann, D.H. Who Is to Blame for the Bias in Visualizations, ChatGPT or DALL-E? AI 2025, 6, 92. [Google Scholar] [CrossRef]
Gupta, O.; Marrone, S.; Gargiulo, F.; Jaiswal, R.; Marassi, L. Understanding Social Biases in Large Language Models. AI 2025, 6, 106. [Google Scholar] [CrossRef]
Gautam, S.; Srinath, M. Blind spots and biases: Exploring the role of annotator cognitive biases in NLP. arXiv 2024, arXiv:2404.19071. [Google Scholar]
Pandey, R.; Castillo, C.; Purohit, H. Modeling human annotation errors to design bias-aware systems for social stream processing. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, Vancouver, BC, Canada, 27–30 August 2019; pp. 374–377. [Google Scholar]
Reason, J. Human error: Models and management. BMJ 2000, 320, 768–770. [Google Scholar] [CrossRef]
Reason, J. Human Error; Cambridge University Press: Cambridge, UK, 1990. [Google Scholar]
Ntalianis, K.; Tsapatsoulis, N.; Doulamis, A.; Matsatsinis, N. Automatic annotation of image databases based on implicit crowdsourcing, visual concept modeling and evolution. Multimed. Tools Appl. 2014, 69, 397–421. [Google Scholar] [CrossRef]
Giannoulakis, S.; Tsapatsoulis, N. Filtering Instagram Hashtags Through Crowdtagging and the HITS Algorithm. IEEE Trans. Comput. Soc. Syst. 2019, 6, 592–603. [Google Scholar] [CrossRef]
Geva, M.; Goldberg, Y.; Berant, J. Are we modeling the task or the annotator? An investigation of annotator bias in natural language understanding datasets. arXiv 2019, arXiv:1908.07898. [Google Scholar]
Kuwatly, H.; Wich, M.; Groh, G. Identifying and measuring annotator bias based on annotators’ demographic characteristics. In Proceedings of the Fourth Workshop on Online Abuse and Harms; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 184–190. [Google Scholar]
Siriborvornratanakul, T. From Human Annotators to AI: The Transition and the Role of Synthetic Data in AI Development. In Proceedings of the International Conference on Human-Computer Interaction, Gothenburg, Sweden, 22–27 June 2025; pp. 379–390. [Google Scholar]
Zendel, O.; Culpepper, J.S.; Scholer, F.; Thomas, P. Enhancing human annotation: Leveraging large language models and efficient batch processing. In Proceedings of the 2024 Conference on Human Information Interaction and Retrieval, Sheffield, UK, 10–14 March 2024; pp. 340–345. [Google Scholar]
Aldeen, M.; Luo, J.; Lian, A.; Zheng, V.; Hong, A.; Yetukuri, P.; Cheng, L. Chatgpt vs. human annotators: A comprehensive analysis of chatgpt for text annotation. In Proceedings of the 2023 International Conference on Machine Learning and Applications (ICMLA), Jacksonville, FL, USA, 15–17 December 2023; pp. 602–609. [Google Scholar]
Wang, X.; Kim, H.; Rahman, S.; Mitra, K.; Miao, Z. Human-llm collaborative annotation through effective verification of llm labels. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems, Honolulu, HI, USA, 11–16 May 2024; pp. 1–21. [Google Scholar]
Ferrara, E. Fairness and Bias in Artificial Intelligence: A Brief Survey of Sources, Impacts, and Mitigation Strategies. Sci 2023, 6, 3. [Google Scholar] [CrossRef]
Djouvas, C.; Charalampous, A.; Christodoulou, C.J.; Tsapatsoulis, N. LLMs are not for everything: A Dataset and Comparative Study on Argument Strength Classification. In Proceedings of the 28th Pan-Hellenic Conference on Progress in Computing and Informatics, Athens, Greece, 13–15 December 2024; pp. 437–443. [Google Scholar]
Chan, Y.C.; Pu, G.; Shanker, A.; Suresh, P.; Jenks, P.; Heyer, J.; Denton, S. Balancing cost and effectiveness of synthetic data generation strategies for llms. arXiv 2024, arXiv:2409.19759. [Google Scholar]
Muñoz-Ortiz, A.; Gómez-Rodríguez, C.; Vilares, D. Contrasting linguistic patterns in human and LLM-generated news text. Artif. Intell. Rev. 2024, 57, 265. [Google Scholar] [CrossRef]
Yadav, R.K.; Jiao, L.; Granmo, O.C.; Goodwin, M. Human-level interpretable learning for aspect-based sentiment analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Virtually, 2–9 February 2021; Volume 35, pp. 14203–14212. [Google Scholar]
Jia, S.; Chi, O.H.; Chi, C.G. Unpacking the impact of AI vs. human-generated review summary on hotel booking intentions. Int. J. Hosp. Manag. 2025, 126, 104030. [Google Scholar] [CrossRef]
Sparks, B.A.; Browning, V. The impact of online reviews on hotel booking intentions and perception of trust. Tour. Manag. 2011, 32, 1310–1323. [Google Scholar] [CrossRef]
O’Connor, C.; Joffe, H. Intercoder Reliability in Qualitative Research: Debates and Practical Guidelines. Int. J. Qual. Methods 2020, 19, 1609406919899220. [Google Scholar] [CrossRef]
Gisev, N.; Bell, J.S.; Chen, T.F. Interrater agreement and interrater reliability: Key concepts, approaches, and applications. Res. Soc. Adm. Pharm. 2013, 9, 330–338. [Google Scholar] [CrossRef] [PubMed]
Landis, J.R.; Koch, G.G. The Measurement of Observer Agreement for Categorical Data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]
Huo, F.Y.; Johnson, N.F. Capturing AI’s Attention: Physics of Repetition, Hallucination, Bias and Beyond. arXiv 2025, arXiv:2504.04600. [Google Scholar]
Arora, K.; Asri, L.E.; Bahuleyan, H.; Cheung, J.C.K. Why exposure bias matters: An imitation learning perspective of error accumulation in language generation. arXiv 2022, arXiv:2204.01171. [Google Scholar]
Li, L.; Sleem, L.; Gentile, N.; Nichil, G.; State, R. Exploring the Impact of Temperature on Large Language Models: Hot or Cold? arXiv 2025, arXiv:2506.07295. [Google Scholar]
Peeperkorn, M.; Kouwenhoven, T.; Brown, D.; Jordanous, A. Is temperature the creativity parameter of large language models? arXiv 2024, arXiv:2405.00492. [Google Scholar]
Nishu, K.; Mehta, S.; Abnar, S.; Farajtabar, M.; Horton, M.; Najibi, M.; Nabi, M.; Cho, M.; Naik, D. From Dense to Dynamic: Token-Difficulty Driven MoEfication of Pre-Trained LLMs. arXiv 2025, arXiv:2502.12325. [Google Scholar]
Behera, A.P.; Champati, J.P.; Morabito, R.; Tarkoma, S.; Gross, J. Towards Efficient Multi-LLM Inference: Characterization and Analysis of LLM Routing and Hierarchical Techniques. arXiv 2025, arXiv:2506.06579. [Google Scholar]
Macdonald, S.; Birdi, B. The concept of neutrality: A new approach. J. Doc. 2019, 76, 333–353. [Google Scholar] [CrossRef]
Donohue, G.A.; Tichenor, P.J.; Olien, C.N. Gatekeeping: Mass media systems and information control. Curr. Perspect. Mass Commun. Res. 1972, 1, 41–70. [Google Scholar]
Pryzant, R.; Martinez, R.D.; Dass, N.; Kurohashi, S.; Jurafsky, D.; Yang, D. Automatically neutralizing subjective bias in text. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 480–489. [Google Scholar]
Snow, R.; O’connor, B.; Jurafsky, D.; Ng, A.Y. Cheap and fast–but is it good? evaluating non-expert annotations for natural language tasks. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing, Honolulu, HI, USA, 25–27 October 2008; pp. 254–263. [Google Scholar]
Nowak, S.; Rüger, S. How reliable are annotations via crowdsourcing: A study about inter-annotator agreement for multi-label image annotation. In Proceedings of the International Conference on Multimedia Information Retrieval, Philadelphia, PA, USA, 29–31 March 2010; pp. 557–566. [Google Scholar]
Wardhani, N.W.S.; Rochayani, M.Y.; Iriany, A.; Sulistyono, A.D.; Lestantyo, P. Cross-validation metrics for evaluating classification performance on imbalanced data. In Proceedings of the 2019 International Conference on Computer, Control, Informatics and Its Applications (IC3INA), Tangerang, Indonesia, 23–24 October 2019; pp. 14–18. [Google Scholar]
Rasool, A.; Shahzad, M.I.; Aslam, H.; Chan, V.; Arshad, M.A. Emotion-aware embedding fusion in large language models (Flan-T5, Llama 2, DeepSeek-R1, and ChatGPT 4) for intelligent response generation. AI 2025, 6, 56. [Google Scholar] [CrossRef]
Thormundsson, B.S.R.T. Artificial Intelligence Tools Popularity in the United States as of September 2024, by Brand. 2024. Available online: https://www.statista.com/forecasts/1480449/ai-tools-popularity-share-usa-adults (accessed on 15 July 2025).

Figure 1. Research design and methodology.

Figure 2. Distribution of sentiment, topic, and aspect annotations across real and synthetic datasets by annotator type.

Figure 3. Inter-coder agreement among/between the following: (a) the three LLMs (ChatGPT-3.5, ChatGPT-4, ChatGPT-4-mini). (b) human annotator and ChatGPT-4. (c) human annotator and ChatGPT-3.5. (d) ChatGPT-4 and ChatGPT-3.5. The red squares indicate areas of slight agreement.

Table 2. Inter-mode agreement (Cohen’s

κ

) between batch and manual annotation modes of ChatGPT-3.5.

Table 2. Inter-mode agreement (Cohen’s

κ

) between batch and manual annotation modes of ChatGPT-3.5.

Variable	Cohen’s $κ$	Asympt. SE	Approx. T	Approx. Sig.
Neutral	0.109	0.004	27.49	<0.001
Negative	0.201	0.004	48.75	<0.001
Aspect	0.244	0.007	46.73	<0.001
Facilities	0.270	0.011	54.55	<0.001
Generic	0.381	0.017	6.69	<0.001
Comfort	0.423	0.014	64.43	<0.001
Bar	0.498	0.021	84.94	<0.001
Positive	0.535	0.006	82.44	<0.001
View	0.610	0.013	97.89	<0.001
Restaurant	0.619	0.022	96.55	<0.001
Reception	0.667	0.014	101.46	<0.001
Value For Money	0.672	0.015	102.45	<0.001
Room	0.681	0.005	106.75	<0.001
Noise	0.683	0.011	104.38	<0.001
Beach	0.721	0.033	113.11	<0.001
Cleanliness	0.775	0.007	117.87	<0.001
Bed	0.795	0.009	123.00	<0.001
Location	0.802	0.005	122.23	<0.001
Staff	0.827	0.005	125.76	<0.001
Pool	0.887	0.011	135.52	<0.001
Bathroom	0.900	0.005	136.87	<0.001
Lift	0.904	0.013	137.80	<0.001
Parking	0.945	0.007	143.65	<0.001
Breakfast	0.958	0.003	145.69	<0.001
WiFi	0.963	0.008	146.34	<0.001

Table 3. Examples of human–LLM annotation (setiment/topic/aspect) disagreement and agreement across real and synthetic datasets.

Dataset	Example	Human	ChatGPT-3.5 (1-1)	ChatGPT-3.5 (Batch)	ChatGPT-4	ChatGPT-4-mini	Agr.
real	Air conditioning worked well.	positive	positive	neutral	neutral	neutral	no
		facilities	facilities	none	none	none
		no	no	no	no	no
real	Basic restaurant with limited choice.	negative	negative	neutral	neutral	neutral	no
		restaurant	restaurant	restaurant	restaurant	restaurant
		no	no	no	no	no
real	No fresh air and AC didn’t work the first night.	negative	negative	neutral	negative	neutral	no
		room, facilities	room	room, facilities	room	room
		no	yes	yes	yes	yes
real	At breakfast I was asked for €6 for juice.	negative	negative	neutral	neutral	neutral	no
		breakfast, value for money	breakfast, value for money	breakfast	breakfast	breakfast
		no	yes	no	no	no
real	Very poor noise cancellation.	negative	negative	negative	negative	negative	yes
		noise	noise	noise	noise	noise
		no	no	no	no	no
real	Lack of physical and relational contact.	negative	neutral	neutral	neutral	neutral	no
		staff	staff	none	none	none
		no	no	no	yes	yes
synthetic	Slept like a baby on that bed.	positive	N/A	neutral	neutral	neutral	no
		bed	N/A	bed	bed	bed
		no	N/A	no	no	no
synthetic	Beautifully designed interior and exterior.	positive	N/A	neutral	neutral	neutral	no
		generic	N/A	generic	generic	generic
		no	N/A	yes	yes	yes
synthetic	Great spot near all attractions.	positive	N/A	positive	positive	positive	no
		location	N/A	none	none	none
		no	N/A	no	no	no
synthetic	Reception staff were disorganized.	negative	N/A	neutral	neutral	neutral	no
		staff, reception	N/A	staff, reception	staff, reception	staff, reception
		no	N/A	no	no	no
synthetic	Location was inconvenient.	negative	N/A	neutral	neutral	neutral	no
		location	N/A	location	location	location
		no	N/A	no	no	no
synthetic	Affordable without compromising quality.	positive	N/A	positive	positive	positive	yes
		value for money	N/A	value for money	value for money	value for money
		no	N/A	no	no	no

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Voutsa, M.C.; Tsapatsoulis, N.; Djouvas, C. Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews. AI 2025, 6, 178. https://doi.org/10.3390/ai6080178

AMA Style

Voutsa MC, Tsapatsoulis N, Djouvas C. Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews. AI. 2025; 6(8):178. https://doi.org/10.3390/ai6080178

Chicago/Turabian Style

Voutsa, Maria C., Nicolas Tsapatsoulis, and Constantinos Djouvas. 2025. "Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews" AI 6, no. 8: 178. https://doi.org/10.3390/ai6080178

APA Style

Voutsa, M. C., Tsapatsoulis, N., & Djouvas, C. (2025). Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews. AI, 6(8), 178. https://doi.org/10.3390/ai6080178

Article Menu

Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews

Abstract

1. Introduction

2. Literature Review

2.1. Biases in AI Systems

Types of Biases in AI Systems

2.2. Human vs. AI Annotator Biases

2.3. Large Language Models and Synthetic Data

3. Materials and Methods

3.1. Real Review Data

3.2. Real Review Data Annotation

3.3. Synthetic Review Data Generation

3.4. Synthetic Review Data Annotation

4. Results

4.1. Descriptive Statistics

4.2. Hypothesis Testing

5. Discussion

5.1. Theoretical Contribution

5.2. Methodological Contribution

5.3. Practical Contribution

5.4. Limitations and Future Research

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

Appendix A.1. Annotation Protocol and Prompt Design for LLM-Based Labeling

Appendix A.2. Prompt Design for Synthetic Hotel Review Generation Using LLMs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI