Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization

Djouvas, Constantinos; Andreou, Christiana; Voutsa, Maria C.; Tsapatsoulis, Nicolas

doi:10.3390/computers15050262

Open AccessArticle

Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization

¹

Department of Communication and Internet Studies, Cyprus University of Technology, Limassol 3036, Cyprus

²

Department of Communication and Marketing, Cyprus University of Technology, Limassol 3036, Cyprus

^*

Authors to whom correspondence should be addressed.

Computers 2026, 15(5), 262; https://doi.org/10.3390/computers15050262

Submission received: 1 April 2026 / Revised: 16 April 2026 / Accepted: 17 April 2026 / Published: 22 April 2026

(This article belongs to the Special Issue Advances in Semantic Multimedia and Personalized Digital Content)

Download

Browse Figure

Versions Notes

Abstract

Large Language Models (LLMs) are increasingly deployed as automated annotators in semantic multimedia systems, yet their reliability varies significantly across architectures. This study extends prior cross-model evaluations by benchmarking ChatGPT-5, Qwen-3, and Gemini-3-flash against human expert annotations using the HRAST hotel review dataset. We adopt a bias-by-design framework to analyze systematic divergences in sentiment, topic, and aspect labeling across real and synthetic data, while investigating the moderating effects of annotation mode. Findings reveal model-contingent polarity bias: ChatGPT-5 exhibits a pronounced neutrality bias, while Qwen-3 and Gemini-3-flash align more closely with human polarization. Agreement is substantial for concrete topics but diverges on abstract evaluative dimensions. Synthetic data consistently inflates reliability metrics while masking ambiguity. These findings highlight that annotation bias is structurally embedded in model design choices and operational conditions. Cross-architectural triangulation and mode-aware deployment strategies are recommended for robust semantic multimedia system development.

Keywords:

large language models; annotation bias; semantic multimedia; sentiment analysis; cross-model evaluation; synthetic data; bias mitigation; hotel reviews; aspect-based sentiment analysis

1. Introduction

The exponential growth of user-generated content (UGC) on digital platforms has intensified the demand for scalable semantic analysis tools that support personalized multimedia experiences [1]. Large Language Models (LLMs) have emerged as versatile components in these pipelines, acting as automated annotators for sentiment classification, topic extraction, and aspect-based analysis [2,3]. Their integration into recommendation engines, brand monitoring dashboards, and adaptive content delivery systems promises unprecedented efficiency in processing unstructured textual data [4]. Recent advances in ontology-driven frameworks further underscore the need for accurate semantic grounding to enable context-aware personalization, particularly in niche domains such as sport tourism, where user preferences must align with complex contextual factors [5]. Similarly, the demonstrated capacity of modern LLMs to generalize across domains and languages suggests their potential to streamline the adaptation of multilingual content without exhaustive manual translation [6].

However, as LLMs transition from experimental prototypes to production-grade annotation tools, concerns about output reliability and structural bias have gained prominence [7]. Annotation outcomes directly shape downstream personalization algorithms: misclassified sentiment can distort user preference models, while inconsistent topic labeling can compromise content relevance [8]. These risks are amplified when LLMs process ambiguous, culturally nuanced, or multi-modal inputs characteristic of authentic UGC [9]. Research in affective computing indicates that single-modal approaches often lack reliability due to the complexity of human expression, suggesting that annotation systems relying on limited contextual signals may inherit similar vulnerabilities [10]. Furthermore, empirical studies in crisis informatics reveal that model performance varies significantly in regional and cultural contexts, implying that universal annotation models may overlook linguistic nuances critical for accurate interpretation [11].

Annotation deviations are increasingly understood not as errors, but as structured manifestations of model design choices [12]. Bender et al. [12] demonstrate that pre-trained language models reproduce and amplify patterns inherent in their training data, thereby compromising interpretive neutrality. This aligns with the concept of bias-by-design, suggesting that systematic labeling tendencies stem from architectural and optimization constraints rather than a misunderstanding of the task [13,14]. For example, architectural decisions regarding feature extraction and fusion strategies have been shown to dictate performance robustness in multi-modal systems, where specific design choices may prioritize certain signal types over others [10]. Similarly, reliance on English-centric corpora in foundational models can introduce structural biases that persist even when models are adapted for low-resource languages or specialized domains [6].

Despite growing empirical interest, most evaluations remain confined to single-model settings or synthetic benchmarks [15]. Mohta et al. [15] note that LLM behavior varies significantly with prompting conditions and computational context, suggesting that annotation biases may originate from essential system properties. Furthermore, agreement levels appear to be sensitive to dataset construction, particularly when synthetic data are introduced [16]. These gaps limit our understanding of whether observed biases generalize across LLM families or represent model-contingent artifacts. Evidence from cross-regional classification tasks indicates that dataset composition and regional representation significantly influence model generalization, highlighting the need for diverse evaluation benchmarks [11]. Additionally, comparative analyses between encoder-based and decoder-based architectures suggest that performance gaps may vary depending on the structural nature of the input data, necessitating broader architectural scrutiny [6].

To address these limitations, this study extends the work of Voutsa et al. [17] through a cross-model investigation of three architecturally distinct LLMs: ChatGPT-5, Qwen-3, and Gemini-3-flash. Leveraging the HRAST dataset of real hotel booking reviews [18] and its augmented counterpart containing both authentic and synthetic content [16], we benchmark model-generated annotations against expert human coding across sentiment, topic, and aspect dimensions. Guided by a ’bias-by-design’ analytical perspective [13], we conceptualize systematic labeling tendencies as structural outcomes of model development choices rather than incidental errors. This approach aligns with recent calls for context-based machine learning frameworks that account for cultural and regional variations in data interpretation [11].

This paper makes three key contributions aligned with the Special Issue on Advances in Semantic Multimedia and Personalized Digital Content:

it documents model-contingent polarity bias, revealing a systematic divergence in polarity assignment between LLM families with implications for personalization reliability, echoing findings on architectural influence in multi-modal systems [10];
it demonstrates how the type of dataset (real vs. synthetic) moderates human–LLM agreement, reinforcing concerns regarding the effects of data composition observed in cross-regional studies [11];
it proposes actionable, bias-aware strategies for deploying LLMs in semantic multimedia pipelines, emphasizing cross-architectural triangulation and hybrid human–AI workflows to mitigate design-induced biases [6].

The remainder of this paper is structured as follows: Section 2 reviews related work on LLM annotation, bias typologies, and semantic personalization. Section 3 details the methodology, including the construction of the dataset, the annotation protocols, and the analytical procedures. Section 4 presents empirical results across real and synthetic datasets. Section 5 discusses theoretical, methodological, and practical implications for semantic multimedia adaptation. Section 6 concludes with limitations and directions for future research.

2. Related Work

2.1. Large Language Models in Semantic Multimedia Analysis

LLMs have evolved from experimental tools to multifunctional agents within computational pipelines, fundamentally altering the way semantic analysis is conducted. Tan and Jiang [2] position LLMs as capable of simultaneously managing encoding, evaluation, and knowledge enhancement, offering insights that are not immediately apparent in heterogeneous multimedia content. Their work underscores the ability of LLMs to perform complex analyses across domains, supporting downstream user evaluation and decision-making. In specialized contexts, such as dark web Q&A forums, De-Marcos and Domínguez-Díaz [19] demonstrate that LLM-based topic modeling exhibits superior coherence and coverage compared to traditional statistical approaches, uncovering hidden patterns in forum discussions and providing richer semantic representations.

However, the reliability of automated semantic analysis often depends on the robustness of the underlying architecture. Recent studies in affective computing suggest that single-modal approaches often lack reliability due to the complexity of human expression [10]. While Mostert et al. focus on multi-modal emotion detection, their finding that integrating diverse data streams enhances robustness parallels the need for diverse annotation sources in text analysis. Similarly, in the hospitality and service quality domains, the performance of LLMs relative to established baselines remains an active area of investigation. Ghatora et al. [20] report that GPT-4 shows comparable performance to traditional machine learning baselines (e.g., Random Forest, SVM) in large-scale review sentiment analysis, although it does not consistently outperform the strongest classifiers. Similarly, Falatouri et al. [3] find that while LLMs achieve 76% accuracy in multilingual information extraction tasks, their alignment with human sentiment judgments is only moderate, particularly in capturing cultural nuances and fine-grained aspects.

2.2. Annotation Bias

Annotation bias represents a central concern in the implementation of AI systems for semantic interpretation, as it directly shapes the representations upon which personalization algorithms rely. This perspective is reinforced by Bender et al. [12], who argue that language technologies inherently reflect the values, priorities, and limitations of their creators, particularly through the selection of training data and the optimization of training goals. From this viewpoint, annotation bias is not a random error, but a predictable outcome of how systems are designed.

Empirical studies have documented specific manifestations of this structural bias, often influenced by regional and cultural contexts. Miah et al. [11] highlight that model performance varies significantly in regional and cultural contexts in disaster informatics, implying that universal annotation models may overlook linguistic nuances critical for accurate interpretation. In sentiment analysis, Pyreddy and Zaman [21] identify a consistent positivity bias in the labels generated by ChatGPT for social media content, where the model assigns more positive sentiment than human annotators. In contrast, Voutsa et al. [16] identify a neutrality bias in LLM-based hotel review annotation, in which models frequently default to neutral sentiment when faced with ambiguous or mixed reviews. This tendency attenuates emotional variance and produces homogeneous semantic interpretations, potentially compromising the effectiveness of sentiment-driven personalization systems. Furthermore, Giorgi et al. [22] note that while LLMs may exhibit fewer annotator biases than humans in hate speech labeling, this does not eliminate structural biases inherent to the model’s training distribution.

2.3. Model-Specific Bias and Cross-Model Divergence

An emerging theme in the literature is that annotation bias is not universal but varies across LLMs, depending on their architectures, training methodologies, and language settings. Voutsa et al. [17] provide one of the first cross-family evaluations, showing that ChatGPT-4 and Qwen exhibit divergent annotation patterns when analyzing identical hotel review datasets. ChatGPT-4 tends toward neutral sentiment assignment, while Qwen shows a polarity bias that aligns more closely with human evaluations. This divergence is particularly pronounced in abstract and emotionally nuanced categories, challenging the assumption that LLMs constitute a homogeneous class of annotators.

This architectural dependency is further supported by research on cross-lingual adaptation, where Cho et al. [6] demonstrate that performance gaps vary depending on whether encoder- or decoder-based architectures are employed, underscoring the need for broader architectural scrutiny. In educational assessment, Anghel et al. [23,24] observe that human–LLM alignment is domain-dependent, with tighter correlation in technical subjects (e.g., machine learning) than in others (e.g., computer networks), and that LLM judges show weaker alignment to human mean than human–human pairs. Mohta et al. [15] warn that relying on a single model type can inadvertently embed unchecked biases into semantic pipelines. Employing multiple models, on the contrary, helps surface underlying interpretive assumptions and enhances the system’s robustness. Viewed this way, model-specific bias is not merely a constraint, but also a diagnostic tool for uncovering hidden dimensions of semantic interpretation.

2.4. Synthetic Data and Bias Amplification

Synthetic data generation has become common to address data scarcity and scalability challenges in AI research. However, recent studies warn that synthetic data can introduce distinctive forms of generative bias. Li et al. [25] demonstrate that models trained exclusively in synthetic text perform poorly relative to those trained in real data, particularly in tasks requiring semantic nuance. LLM-generated texts often employ narrower vocabularies, simpler syntactic structures, and reduced expression of negative emotion, characteristics that may make synthetic content easier for models to annotate, potentially inflating observed agreement scores.

Voutsa et al. [16] confirm that the LLM–human agreement is higher in synthetic datasets than in real reviews, yet this reliability masks a loss of ambiguity and interpretive depth. Consequently, synthetic benchmarks can amplify existing annotation biases while misrepresenting real-world performance. Synthetic data, therefore, does not function as a neutral substitute but rather as a bias amplifier in the annotation process. This pattern reinforces concerns that synthetic benchmarks may inflate reliability while masking the ambiguity present in authentic UGC.

2.5. Research Gap and Positioning

The reviewed literature reveals a critical insight: annotation bias in LLM-based systems is systematic, design-driven, and model-dependent. Although LLMs offer scalability and accessibility for semantic multimedia analysis, their annotation behavior often diverges from human judgment in predictable, explainable ways. Existing studies outline various bias typologies, performance limitations, and dataset effects, as summarized in Table 1. However, most evaluations focus on single-model settings or synthetic benchmarks. Few frameworks systematically link annotation outcomes to design choices while examining divergence across models and operational conditions. This gap motivates the present study, which adopts a bias-by-design perspective and employs cross-model comparison to investigate how annotation bias manifests in LLM-driven semantic multimedia systems.

3. Data and Methods

This study adopts a comparative experimental design evaluating multiple LLMs performing identical semantic annotation tasks on the same datasets. The methodology comprises three interconnected phases: (1) dataset creation and validation, extending Andreou et al. [18]; (2) bias-oriented annotation experiments using LLMs, building on Voutsa et al. [16]; and (3) cross-model structural bias analysis, advancing the framework introduced in Voutsa et al. [17].

3.1. Dataset

This research employs the augmented HRAST (Hotel Reviews: Aspects, Sentiments and Topics) dataset, comprising 23,114 human-annotated hotel review sentences from Booking.com [16,18] and 199 synthetic hotel review sentences generated by ChatGPT-4 [16]. Each sentence is annotated at three semantic dimensions: (1) sentiment polarity (positive/negative/neutral); (2) topic classification (one or more of 21 predefined hotel-related aspects (e.g., room, location, staff, cleanliness, comfort, breakfast, facilities, value for money); and (3) aspect presence (binary indicator for sentences containing multiple aspects or mixed sentiments).

Notably, 5812 sentences (25.1%) based on human annotation [16] contain multiple aspects with conflicting sentiments, making the dataset suitable for evaluating advanced aspect-based sentiment models.

In addition to real human-generated reviews, the study includes 199 synthetically generated reviews created with ChatGPT-4 using carefully designed prompts, described in Voutsa et al. [16]. Because the dataset explicitly indicates whether each review is human- or model-authored, researchers can conduct controlled experiments to examine how annotation behavior varies across data types. Synthetic data was also annotated by humans [16].

3.2. LLM-Based Annotation Framework

Following the experimental framework proposed by Voutsa et al. [16], annotation is examined not merely as a classification task but as a form of semantic interpretation carried out by computational systems. Building upon the validated HRAST corpus, the study deploys ChatGPT-5, Qwen-3, and Gemini-3-flash as independent annotators to evaluate the extent to which their semantic judgments replicate, approximate, or systematically diverge from human expert labeling.

The annotation pipeline is implemented through structured prompting and batch-based interpretation. Each review sentence is provided to the models together with standardized instructions describing the required tasks: determine sentiment polarity, identify corresponding topics using the predefined aspect schema, and detect aspect presence at the sentence level. Identical prompt templates are used on all experimental runs to minimize the variability arising from instruction design.

To operationalize the LLM-based annotation framework, a standardized, API-driven pipeline was developed to ensure consistency, reproducibility, and cross-model comparability. The implementation builds upon the structured prompting paradigm outlined in Section 3.2, translating conceptual annotation tasks into executable workflows using the OpenRouter interface and model-specific endpoints.

All models were accessed through a unified API client, enabling identical input formatting and parameter control across architectures. Each review sentence from the HRAST dataset was processed, and a constrained output schema that requires configuration (temperature = 0) was used, minimizing stochastic variation and ensuring that observed differences in annotation outputs reflect model-specific interpretive tendencies rather than sampling noise.

A single, carefully engineered system prompt was used across all experiments to standardize task interpretation. The prompt explicitly defines the three annotation dimensions (sentiment polarity, aspect presence, and topic classification) along with a constrained output schema requiring a valid JSON response. This design enforces structural uniformity in model outputs while reducing ambiguity in instruction-following behavior. The prompt also enumerates a fixed taxonomy of 21 domain-specific topics, ensuring alignment with the human-annotated schema of the HRAST dataset.

To enhance robustness against API instability and transient failures, an exponential backoff retry mechanism was implemented. Requests returning server-side errors were automatically retrieved using a sequential approach, including direct JSON decoding, regex-based corrections, and scale processing. This design choice is especially important in high-volume experimental settings, where uninterrupted model access cannot always be guaranteed.

Model outputs were subjected to a multi-stage post-processing pipeline to ensure syntactic and semantic validity. Because LLM responses may deviate from strict JSON formatting, a custom parsing function was implemented to normalize outputs using a sequential approach, including direct JSON decoding, regex-based corrections, abstract syntax evaluation, and fallback extraction of target fields. This procedure minimized annotation loss due to malformed responses while preserving the substantive content of model decisions.

Following parsing, annotations were transformed into a structured format aligned with the dataset schema. Sentiment labels were converted into binary indicator variables (positive, negative, neutral), while aspect presence was encoded as a binary flag. Topic assignments were mapped to the predefined vocabulary using exact matching and normalization heuristics for lexical variants. For example, semantically equivalent outputs such as “elevator” were standardized to the topic “Lift”, thereby improving compatibility between model-generated and human-coded labels.

The annotation process was executed in batch mode across the full dataset, iterating over sentences and appending model predictions to newly created output columns. To support fault tolerance and long-running execution, intermediate results were periodically saved at fixed checkpoints, allowing interrupted runs to resume without loss of previously processed instances. Final outputs were stored as extended dataset files combining original human annotations with LLM-generated labels, thereby enabling direct downstream agreement analysis.

3.3. Cross-Model Comparative Analysis

To address the limitations of single-model evaluations identified in previous work (Table 1), where analyses often remain confined to single-model families or binary comparisons, three distinct LLM families were selected to ensure diversity in model origin, multilingual exposure, and reinforcement learning strategies. This study benchmarks ChatGPT-5, Qwen-3, and Gemini-3-flash, representing proprietary, open-source, and multi-modal architectures, respectively, to isolate model-contingent biases from universal annotation tendencies. This cross-family design extends the comparative framework introduced by Voutsa et al. [17] by incorporating a third architectural paradigm, thus improving the robustness of the design bias analysis. Key characteristics regarding training orientation, access reproducibility, and alignment objectives are summarized in Table 2.

4. Results

4.1. Descriptive Statistics

Descriptive statistics (see Table 3) revealed systematic differences in response distributions between the real user-generated review dataset (N = 23,113) and the synthetic ChatGPT-3-generated dataset (N = 199). In the real dataset, Qwen and Gemini closely approximated human annotation distributions across most variables, with absolute percentage-point deviations typically ≤3.0 (e.g., Location: Human = 20.5%, Qwen = 21.6%, Gemini = 20.8%). ChatGPT, however, exhibited systematic divergences: it substantially underrepresented negative sentiment (28.7% vs. 45.4% human;

Δ

= 16.7) while overrepresenting neutral sentiment (25.6% vs. 3.4%;

Δ

= 22.2), and produced markedly elevated rates for the aggregate Aspect variable (45.0% vs. 12.5% human;

Δ

= 32.5). In the synthetic dataset, Qwen and Gemini maintained strong distributional fidelity to human annotations (mean absolute deviation = 1.8 percentage points), whereas ChatGPT demonstrated pronounced misalignment, most notably reversing the sentiment profile: neutral sentiment was annotated at 68.8% (vs. 1.5% human;

Δ

= 67.3), while positive sentiment dropped to 22.6% (vs. 74.9% human;

Δ

= 52.3). Similarly, ChatGPT overrepresented the Generic category (53.3% vs. 6.5% human) and the aggregate Aspect variable (35.2% vs. 4.0% human), suggesting a tendency toward vague or undifferentiated coding in synthetic contexts. Missing values were minimal and consistent across annotators (real: N = 172, 0.7%; synthetic: N = 3, 1.5%), indicating high complete annotation.

4.2. Inter-Rater Reliability

Inter-rater reliability between human annotations and LLM outputs was evaluated using Cohen’s (

κ

) coefficient [39] across two distinct datasets (real user-generated hotel reviews and synthetic ChatGPT-3-generated reviews) and three models (Qwen, Gemini, and ChatGPT; see Table A1). Similarly, to assess consensus among the three LLMs independent of human annotation, Fleiss’ kappa (

κ

) [40] was computed for each variable across the real and synthetic datasets (see Table A2; Figure 1). To enhance methodological transparency and effect size precision, all

κ

estimates are reported with asymptotic standard errors (

S E

), 95% confidence intervals (computed as

κ \pm 1.96 \times S E

), exact p-values, and interpretations following Landis and Koch’s [41] benchmarks. All statistical tests employed two-tailed significance thresholds at

α = 0.05

. Kappa values were interpreted following Landis and Koch’s [41] benchmarks: slight agreement (

κ = 0.00

–

0.20

), fair (

0.21

–

0.40

), moderate (

0.41

–

0.60

), substantial (

0.61

–

0.80

), and almost perfect (

0.81

–

1.00

). Descriptive statistics for

κ

distributions are presented in Table 4.

Across all conditions, Cohen’s

κ

values ranged from −0.015 to 1.00. Gemini demonstrated the highest mean agreement in both datasets (real: M = 0.810, SD = 0.20; synthetic: M = 0.849, SD = 0.19), followed by Qwen (real: M = 0.701, SD = 0.19; synthetic: M = 0.799, SD = 0.23) and ChatGPT (real: M = 0.683, SD = 0.27; synthetic: M = 0.442, SD = 0.28). Notably, ChatGPT exhibited greater variability in performance, particularly in the synthetic condition, where its minimum kappa value (

κ

= −0.001) indicated agreement no better than chance for neutral sentiment classification.

In addition, inter-LLM agreement was consistently higher in the real compared to the synthetic dataset. In the real condition, 18 of 25 variables (72%) achieved substantial or almost perfect agreement (

κ

≥ 0.61), whereas only 9 of 25 variables (36%) reached this threshold in the synthetic condition. All kappa coefficients in the real dataset were statistically significant (p < 0.001); in the synthetic dataset, 2 comparisons were not statistically significant.

4.2.1. Inter-Rater Reliability in Sentiment Classification

For positive sentiment annotation in the real dataset, all models achieved statistically significant, substantial to almost perfect agreement with human coders (Qwen,

κ

= 0.906, p < 0.001; Gemini,

κ

= 0.940, p < 0.001; ChatGPT,

κ

= 0.707, p < 0.001), while negative sentiment yielded comparable patterns (Qwen,

κ

= 0.908, p < 0.001; Gemini,

κ

= 0.935, p < 0.001; ChatGPT,

κ

= 0.555, p < 0.001).

In the synthetic dataset, Qwen and Gemini maintained robust reliability for both positive (Qwen:

κ

= 0.907; Gemini:

κ

= 0.932) and negative (Qwen:

κ

= 0.931; Gemini:

κ

= 0.928) sentiment (all ps < 0.001). However, ChatGPT showed markedly attenuated agreement, with slight agreement in positive sentiment (

κ

= 0.147, p < 0.001), and fair agreement in negative sentiment (

κ

= 0.393, p < 0.001).

However, the neutral sentiment classification revealed the most pronounced model and dataset effects. In the real dataset, agreement was statistically significant but substantively modest: fair for Qwen (

κ

= 0.299, p < 0.001); moderate for Gemini (

κ

= 0.403, p < 0.001), and slight for ChatGPT (

κ

= 0.066, p < 0.001). In the synthetic dataset, agreement on neutral sentiment deteriorated substantially for both models. Qwen produced a negative coefficient

κ

(

κ

= −0.015, p = 0.829), indicating systematic disagreement beyond chance levels. Similarly, ChatGPT yielded

κ

= −0.001, p = 0.935, reflecting the agreement equivalent to random assignment. Only Gemini retained statistically significant, albeit fair, agreement (

κ

= 0.235, p = 0.001). These findings suggest that neutral sentiment, which is often characterized by ambiguous or mixed evaluative content, poses significant challenges for LLM annotation, particularly when applied to synthetic text.

Inter-LLM agreement for sentiment variables revealed a pronounced dataset effect. In the real dataset, positive sentiment achieved substantial agreement (

κ

= 0.792, p < 0.001), negative reached moderate agreement (

κ

= 0.684, p < 0.001) and neutral yielded only slight agreement (

κ

= 0.166, p < 0.001). In the synthetic dataset, the agreement patterns diverged markedly. Positive sentiment decreased to fair LLM agreement (

κ

= 0.248, p < 0.001), negative remained moderate (

κ

= 0.572, p < 0.001), and neutral produced a negative kappa coefficient (

κ

= −0.287, p < 0.001).

4.2.2. Inter-Rater Reliability in Aspect-Based Annotation

Agreement statistics for the aggregate Aspect variable revealed distinct patterns across datasets and annotator pairs. In the real dataset, the human–LLM agreement ranged from fair for Qwen (

κ

= 0.296) and ChatGPT (

κ

= 0.209) to moderate agreement for Gemini (

κ

= 0.457). The Inter-LLM consensus among all three models was fair (

κ

= 0.341). In contrast, in the synthetic dataset, Qwen showed substantially higher agreement with human annotations than in the real condition (

κ

= 0.790), indicating a shift from fair to substantial agreement. Gemini maintained moderate agreement (

κ

= 0.504), while ChatGPT’s performance declined to slight agreement (

κ

= 0.143). In particular, the inter-LLM consensus decreased to slight agreement (

κ

= 0.194), indicating that the models diverged considerably when generating synthetic content despite Qwen’s strong alignment with human judgments.

4.2.3. Inter-Rater Reliability in Topic Annotation

Attributes representing concrete, objectively verifiable hotel features consistently yielded substantial to almost perfect agreement across models and datasets. The breakfast annotation demonstrated exceptional reliability: in the real dataset, the values of

κ

ranged from 0.948 (Qwen) to 0.974 (Gemini) to 0.950 (ChatGPT), all ps < 0.001; in the synthetic dataset, the agreement remained high (Qwen:

κ

= 0.923; Gemini:

κ

= 1.00; ChatGPT:

κ

= 0.616). Similarly, the annotations for Staff, Location, and Bathroom/Shower showed

κ

≥ 0.819 in all combinations of models and datasets where data were available, indicating robust replicability for these well-defined constructs.

Furthermore, variables related to physical infrastructure showed greater variability. The annotation of the pool in the real dataset showed strong agreement for Qwen (

κ

= 0.831) and Gemini (

κ

= 0.970) but near-zero agreement for ChatGPT (

κ

= 0.004, p < 0.001), suggesting model-specific sensitivity to this attribute. In the synthetic dataset, both Qwen (

κ

= 0.939) and Gemini (

κ

= 1.00) achieved almost perfect agreement for Pool, whereas this variable was not assessed for ChatGPT. Parking had also achieved perfect agreement (

κ

= 1.00) among the three LLMs, while Wi-Fi demonstrated a similar pattern, with perfect agreement (

κ

= 1.00) for Qwen and Gemini in synthetic data, but ChatGPT showed substantial but lower reliability (Wi-Fi:

κ

= 0.655).

However, more subjective dimensions, such as Comfort, Value for money, and Generic, produced moderate and less consistent agreement. In the real dataset, the kappa values ranged from 0.429 (Qwen) to 0.646 (ChatGPT), indicating fair to substantial agreement. In synthetic data, ChatGPT’s agreement on value for money (

κ

= 0.705) exceeded that of Qwen (

κ

= 0.808) and Gemini (

κ

= 0.945), although all remained statistically significant. The Generic category, intended to capture ambiguous references, yielded the lowest agreement across conditions (real dataset:

κ

range = [0.316, 0.502]; synthetic dataset:

κ

range = [0.097, 0.585], underscoring the challenge of annotating poorly defined constructs.

Regarding the agreement among LLM, in the real dataset, concrete attributes such as Breakfast (

κ

= 0.954), Bathroom/Shower (

κ

= 0.922), and Location (

κ

= 0.888) achieved almost perfect consensus, while more ambiguous categories such as Generic (

κ

= 0.338) reached only fair agreement. In particular, the Generic category in the synthetic data did not achieve statistically significant agreement (

κ

= 0.019, p = 0.640).

4.2.4. Missing Data Patterns

Several aspect variables were not annotated for specific combinations of model datasets (denoted by blank cells in Table A1), primarily for ChatGPT under synthetic conditions (e.g., Restaurant (dinner), View (Balcony), Pool, Beach, Reception-check-in, Lift). This pattern suggests either methodological exclusions due to low annotation frequency or model-specific limitations in generating relevant content for these attributes in synthetic reviews.

4.2.5. Real Versus Synthetic Dataset

A consistent cross-model pattern emerged, with agreement metrics generally higher and less variable in the real dataset than in the synthetic dataset. This effect was most pronounced for ChatGPT, whose mean kappa decreased from 0.683 (real) to 0.442 (synthetic), a 35.3% reduction. Qwen and Gemini showed greater cross-dataset stability, with mean kappa values increasing by 14.0% and 4.8%, respectively.

The interaction between the type of dataset and the sentiment classification was particularly noteworthy. Although all models maintained strong agreement for polarized sentiment (positive/negative) across datasets, the reliability of neutral sentiment collapsed in synthetic data for Qwen and ChatGPT. This is also pronounced in the LLM agreement, where several variables of physical infrastructure achieved substantial agreement in the synthetic dataset (Parking:

κ

= 1.00; Staff:

κ

= 0.808; Location:

κ

= 0.750), while others showed marked declines relative to real data (Pool:

κ

= 0.486 vs. 0.870 in real; View:

κ

= 0.433 vs. 0.790 in real).

4.2.6. Model-Specific Performance Profiles

Gemini demonstrated the most consistent high-level performance, achieving almost perfect agreement (

κ

≥ 0.81) for 15 of 25 variables (60%) in the real dataset and 20 of 25 variables (80%) in the synthetic dataset. Its lowest agreement occurred for Generic in real data (

κ

= 0.337) and neutral sentiment in synthetic data (

κ

= 0.235), both still statistically significant. However, it was unable to provide annotations for 172 sentences in the real dataset and 3 sentences in the synthetic dataset, respectively.

Qwen showed strong, stable performance, with 10 variables (40%) reaching almost perfect agreement in real data and 17 variables (68%) in synthetic data. Its primary weakness was its classification of neutral feelings in synthetic contexts (

κ

= −0.015, p = 0.829), indicating a systematic misalignment with human judgment on this specific task.

ChatGPT exhibited the greatest performance variability. Although it achieved almost perfect agreement for 11 variables (46%) in the real dataset, particularly for concrete attributes such as breakfast (

κ

= 0.950) and Parking (

κ

= 0.946), the performance of its synthetic dataset was markedly less reliable, with only 1 variable (4%) reaching almost perfect agreement (Parking,

κ

= 1.00) and 5 variables (28%) achieving slight or negative agreement.

5. Discussion & Implications

The present study extends previous LLM ([17]) and cross-model ([16]) evaluations by incorporating a third architectural family (Gemini) along with ChatGPT and Qwen, offering a more robust triangulation of annotation behaviors in semantic multimedia systems. The findings confirm that annotation bias is not a universal property of LLMs but is highly model-contingent, shaped by architectural design, alignment objectives, and operational contexts (see Table 2). By benchmarking three distinct models against human expert annotations across real and synthetic datasets, this research advances the bias-by-design framework, demonstrating how structural choices manifest as systematic divergences in sentiment, topic, and aspect labeling.

5.1. Architectural Determinants of Annotation Bias

A primary contribution of this study is the identification of a performance hierarchy among LLM families. Gemini demonstrated the highest stability and human alignment, achieving substantial to almost perfect agreement (

κ

≥ 0.81) for 60% of variables in the real dataset and 80% in the synthetic dataset. This suggests that Gemini’s underlying architecture and training regimen may prioritize semantic fidelity closer to human interpretive patterns than its counterparts. In contrast, ChatGPT-5 exhibited the highest variability, particularly a pronounced neutrality bias in synthetic contexts. These findings support the perspective of Bender et al. [12], who argue that language technologies inherently reflect the values and limitations of their creators through the selection of training data and the optimization of their goals. The divergence between Gemini’s polarity sensitivity and ChatGPT’s neutrality suggests that annotation bias is a predictable outcome of system design rather than random error.

A more fine-grained error analysis suggests that the observed biases are not random but follow model-specific failure modes. ChatGPT-5’s main error pattern is not simply lower accuracy, but systematic compression of polarity into safer, less committal labels. In the real dataset, it underassigned negative sentiment relative to the human annotator (28.7% vs. 45.4%) while overassigning neutral sentiment (25.6% vs. 3.4%). This pattern intensified sharply in the synthetic dataset, where neutral sentiment rose to 68.8% compared with only 1.5% in the human annotations, alongside a large increase in Generic labels (53.3% vs. 6.5%) and Aspect labels (35.2% vs. 4.0%). The corresponding

κ

values support the same interpretation, since ChatGPT’s agreement for neutral sentiment was only slight in the real data (

κ

= 0.066) and collapsed to chance level in the synthetic data (

κ

= −0.001). This negative agreement (

κ

= −0.001) does not reflect random noise, but rather a structured misalignment driven by class imbalance and calibration divergence. When a model systematically overpredicts a minority class (neutral),

κ

is mathematically penalized, revealing that the metric captures systematic label collision rather than mere disagreement.

Taken together, these results suggest a risk-averse interpretive strategy in which the model defaults to underspecified or non-committal categories when evaluative ambiguity is present, rather than resolving mixed evidence toward positive or negative polarity.

Qwen-3 exhibits a different error profile. Its overall alignment with human annotations is strong, but its weakness is concentrated in a narrow, theoretically important case: neutral sentiment in synthetic reviews. Whereas Qwen performed robustly on polarized sentiment and many concrete topics, its neutral-sentiment agreement in the synthetic dataset became negative (

κ

= −0.015), indicating systematic disagreement rather than simple noise. This suggests that Qwen is less prone than ChatGPT-5 to generalized neutral overassignment, but may instead struggle when neutrality is expressed through synthetic, stylized, or weakly grounded linguistic constructions. This pattern likely stems from Qwen’s open-weight training paradigm, which emphasizes instruction-following and factual grounding over conversational safety constraints. Without heavy RLHF smoothing, Qwen appears more willing to commit to polarity, yet this strength becomes a liability when the input contains artificially smoothed or ambiguous evaluative language, causing the model to misclassify synthetic neutrality as implicit polarity or structural noise.

Gemini-3-flash shows the most balanced profile, with the highest mean human–LLM agreement across both datasets, but even this model is not uniformly reliable. Its weakest results occur on categories that are conceptually diffuse rather than concretely referential, especially Generic in the real dataset (

κ

= 0.337) and neutral sentiment in the synthetic dataset (

κ

= 0.235). This pattern indicates that Gemini’s relative advantage lies less in the elimination of bias than in better calibration across clearly bounded categories. The broader pattern across all three models, therefore, suggests that concrete, physically verifiable hotel attributes are easier to stabilize computationally, whereas ambiguity, mixed sentiment, and residual categories expose differences in alignment strategy, calibration, and tolerance for uncertainty.

This architectural dependency aligns with recent research on cross-lingual adaptation, where Sanghyun et al. [6] demonstrate that performance gaps depend on whether encoder- or decoder-based architectures are used. Although our study did not isolate specific architectural parameters, the observed divergence between proprietary models (ChatGPT-5, Gemini-3-flash) and open-weight models (Qwen-3) underscores the need for a broader architectural review. Furthermore, while Pyreddy and Zaman [21] identify a consistent positivity bias in ChatGPT for social media content, our results reveal a neutrality bias in the hospitality domain. This discrepancy highlights the findings of Miah et al. [11], who note that the model’s performance varies significantly across regional and cultural contexts. It suggests that bias typologies are not static but interact with domain-specific linguistic norms.

5.2. Human–LLM Alignment and Domain Specificity

The results reinforce that, while LLMs offer scalability, their alignment with human judgment is moderate and task-dependent. Ghatora et al. [20] report that GPT-4 shows performance comparable to traditional machine learning baselines but does not consistently outperform them. Our findings extend this by showing that agreement is robust for concrete topics (e.g., Wi-Fi, Parking) across all models but diverges significantly on abstract evaluative dimensions (e.g., Comfort, Value for Money). This divergence between concrete and abstract domains mirrors observations by Falatouri et al. [3], who find that while LLMs achieve high accuracy in information extraction, their alignment with human sentiment judgments is only moderate, particularly in capturing cultural nuances and fine-grained aspects.

Moreover, the domain-dependent nature of the alignment observed in this study resonates with Anghel et al. [23,24], who observe a tighter correlation in technical subjects than in others and note that LLM judges show weaker alignment with the human mean than human–human pairs. In our context, the lower agreement on abstract sentiment categories suggests that LLMs struggle with the interpretive depth required for nuanced hospitality reviews. This supports the argument of Giorgi et al. [22] that while LLMs may reduce some human annotator biases (e.g., demographic biases in hate speech labeling), this does not eliminate structural biases inherent to the model’s training distribution. Consequently, relying on a single model type can inadvertently embed unchecked biases into semantic pipelines, as warned by Mohta et al. [15]. Our cross-model comparison demonstrates that architectural diversity can serve as a diagnostic filter: when models disagree on abstract constructs, the divergence itself signals interpretive ambiguity that should trigger human review or multi-model consensus routing. Employing multiple models, as demonstrated in this study, helps surface the underlying interpretive assumptions and enhances the system’s robustness.

5.3. The Synthetic Data Paradox

A counterintuitive finding emerged regarding the synthetic data. While inter-LLM agreement was generally lower in synthetic conditions, human–LLM agreement for specific tasks (e.g., Aspect annotation by Qwen) was higher in synthetic data than in real data. This suggests a synthetic alignment effect, where models may find it easier to align with human annotations on data that share structural or linguistic patterns with their own training distributions. However, this alignment comes with caveats. Li et al. [25] demonstrate that models trained exclusively in synthetic text perform poorly relative to those trained in real data, particularly in tasks requiring semantic nuance. Thus, while the agreement metrics were inflated in synthetic data, this reliability masked a loss of ambiguity and interpretive depth.

Voutsa et al. [16] confirm that the LLM–human agreement is higher in synthetic datasets than in real reviews, yet this reliability masks a loss of ambiguity. Consequently, synthetic benchmarks can amplify existing annotation biases while misrepresenting real-world performance. This pattern reinforces concerns that synthetic benchmarks may inflate reliability while masking the ambiguity present in authentic UGC. Therefore, synthetic data do not serve as a neutral substitute, but rather amplify bias in the annotation process. For researchers using synthetic data to benchmark annotation tools, these results reveal that high agreement on synthetic benchmarks may reflect shared architectural biases rather than genuine semantic understanding.

The extreme neutrality pattern observed for ChatGPT-5 in the synthetic dataset suggests that synthetic text may not merely simplify annotation, but qualitatively reshape the decision space available to the model. When reviews are generated in a more regular, fluent, and distributionally familiar style, evaluative language may become less sharply anchored to concrete experiential cues and more evenly balanced across positive and negative expressions. For a model with a conservative or ambiguity-sensitive alignment profile, such inputs may favor neutral classification as a fallback strategy. Thus, the synthetic dataset appears to amplify an already present model tendency: rather than resolving mixed or flattened evaluative signals into polarity, ChatGPT-5 absorbs them into neutral or generic categories. This interpretation helps explain why synthetic benchmarks can yield not only inflated agreement in some cases, but also dramatic model-specific distortions in label distributions.

5.4. Implications for Semantic Multimedia Systems

The findings offer strategic implications for the development of adaptive semantic multimedia architectures. Tan and Jiang [2] position LLMs as multi-role agents capable of managing encoding and evaluation. However, our results suggest that these roles must be allocated based on model-specific strengths. Rather than deploying a single LLM across an entire pipeline, developers should implement bias-aware routing: concrete, schema-bound tasks (e.g., aspect detection, location extraction, facility mentions) can be safely automated using cost-efficient or high-throughput models, whereas evaluative, culturally contingent, or emotionally nuanced judgments should be routed to models with proven polarity calibration (e.g., Gemini-3-flash) or escalated to human validators. For tasks that require conservative classification (e.g., content moderation), ChatGPT’s neutrality bias might be a feature. For tasks requiring emotional nuance (e.g., customer experience analytics), Gemini’s higher polarity alignment is preferable.

Furthermore, the need for diverse annotation sources parallels findings in affective computing, where Mostert et al. [10] suggest that single-modality approaches often lack reliability due to the complexity of human expression. Integrating diverse data streams enhances robustness; similarly, integrating diverse model architectures (cross-model triangulation) mitigates the risk of model-specific bias. De-Marcos and Domínguez-Díaz [19] demonstrate that LLM-based topic modeling exhibits superior coherence compared to traditional approaches. Our study adds that this coherence is highest for concrete topics, suggesting that knowledge graphs and recommendation engines should treat concrete entity extraction as high-confidence automated tasks, while routing abstract evaluative judgments to higher-fidelity models or human validators. We therefore propose a tiered annotation architecture: (1) automated extraction for physically verifiable constructs, (2) cross-model consensus voting for subjective dimensions, and (3) human-in-the-loop adjudication when model disagreement exceeds a predefined

κ

threshold or when label distributions deviate significantly from historical baselines.

5.5. Limitations & Future Research

This study has limitations. First, the study relies on a single expert human annotator for the ground truth. While this ensures consistency, it does not capture inter-human variability. Anghel et al. [24] note that LLM judges show weaker alignment to human mean than human–human pairs, suggesting that future research should incorporate multi-annotator benchmarks to distinguish model bias from annotator subjectivity. Second, the empirical analysis relies on a single domain (hotel reviews), which features structured aspect schemas and moderate emotional variance. Consequently, findings may not directly transfer to high-stakes, emotionally volatile, or low-resource domains (e.g., clinical notes, financial disclosures, or social media discourse. Miah et al. [11] highlight that model performance varies across regional and cultural contexts, suggesting that future work should test these findings across diverse linguistic and cultural datasets. Third, while we identified performance hierarchies, the specific architectural parameters (e.g., temperature, top-p) were not systematically varied. Sanghyun et al. [6] emphasize the need for a more comprehensive architectural review, and future work should investigate how the inference-time parameters interact with the model architecture to modulate annotation bias. Fourth, the synthetic dataset adopted by Voutsa et al. [16] was generated using a prompt-based approach without structurally replicating the distributional characteristics of the real dataset. Li et al. [25] recommend generating synthetic data that more closely mirrors the structure and complexity of real datasets to enable more ecologically valid assessments. Future research should examine more systematically how temperature settings and generation strategies contribute to AI-driven biases in both synthetic data generation and annotation tasks. Fifth, the synthetic subset (N = 199 effective cases) yields wider confidence intervals for synthetic-condition estimates (SEs: 0.030–0.204 vs. 0.002–0.025 in real data), reflecting elevated sampling variance. Results should therefore be interpreted as indicative rather than definitive, and caution should be exercised against overgeneralization to larger synthetic corpora. Sixth, although our analysis prioritizes

κ

-based statistics to reduce the confounding effects of skewed label distributions, future studies should address the constraints of our current agreement-focused evaluation by incorporating complementary metrics and robust class-balancing strategies (e.g., F1-scores, accuracy) to better capture true model–rater alignment across varying data distributions.

Finally, reproducibility remains a structural limitation of this study and of contemporary LLM evaluation more broadly. Although the annotation workflow was standardized and is reproducible at the procedural level, the evaluated systems are not equally reproducible at the model level. As summarized in Table 2, GPT-5 was accessible only through proprietary API-based access, without public weights, full training recipes, or complete architectural disclosure [27,28]. Gemini-3-flash presents similar constraints, as it is also available through closed Google services and does not expose open weights or full corpus-level documentation [31,32,33]. By contrast, Qwen-3 offers a comparatively stronger reproducibility profile through open weights, public model cards, and independent deployment possibilities, even though it too does not provide complete transparency into all development choices [29,30]. This creates an epistemic asymmetry in cross-model comparison: Qwen-3 can be inspected and rerun more directly, whereas GPT-5 and Gemini-3-flash can only be replicated behaviorally through externally managed API endpoints.

As a result, the present study should be understood as fully reproducible in its evaluation protocol, but only partially reproducible in relation to the underlying models themselves. Independent researchers can replicate the dataset, prompt structure, parameter settings, parsing logic, schema-constrained outputs, and agreement analyses, and can therefore reproduce the analytical procedure under comparable access conditions. However, they cannot verify whether future API calls will reflect the same hidden architecture, training distribution, routing behavior, safety layers, or post-deployment alignment updates that shaped the outputs observed here [27,28,31,32]. For this reason, our findings are best interpreted as time-specific measurements of deployed model behavior rather than as definitive audits of stable and fully inspectable systems.

To mitigate these constraints, we adopted fixed prompt templates, temperature = 0 decoding, version logging, a unified API client, and JSON schema enforcement in order to maximize behavioral reproducibility across runs. Even so, procedural transparency cannot fully compensate for limited model transparency. Future work would benefit from stronger community standards, including version-pinned API access, fuller model-card disclosure, release of executable annotation pipelines, and benchmarking designs that deliberately combine API-based systems with open-weight models so that behavioral comparison can be complemented by deeper independent auditability [28,30,31,33].

6. Conclusions

This study demonstrates that annotation bias in LLMs is model-contingent rather than universal, shaped by architectural design and alignment strategies. The systematic divergences observed, particularly ChatGPT-5’s neutrality bias versus Qwen-3’s and Gemini-3-flash’s polarity alignment, reveal that biases are structurally embedded in model development choices. These findings underscore the indispensable value of human-annotated datasets, which provide irreplaceable ground truth benchmarks that capture the ambiguity and contextual sensitivity that LLMs systematically flatten through risk-averse classification. Without high-quality human annotations, researchers risk creating circular validation frameworks in which models are evaluated against synthetic or LLM-generated labels that share the same biases, leading to inflated metrics that do not reflect real-world performance.

Author Contributions

Conceptualization, N.T. and C.D.; methodology, C.D. and M.C.V.; software, C.D. and M.C.V.; validation, C.D. and M.C.V.; formal analysis, M.C.V.; investigation, C.D.; resources, C.D.; data curation, C.D. and M.C.V.; writing—original draft preparation, C.D., C.A. and M.C.V.; writing—review and editing, M.C.V. and N.T.; visualization, M.C.V.; supervision, N.T.; project administration, N.T.; funding acquisition, N.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Ethical review and approval were waived for this study due to the use of publicly available online booking review texts, which do not involve interaction with human participants or the collection of identifiable private information.

Data Availability Statement

The original HRAST dataset containing human-annotated hotel review sentences is openly available on Kaggle. The LLM-generated annotations produced in this study (ChatGPT-5, Qwen-3, and Gemini-3-flash outputs), together with the prompt templates, model identifiers, parameter settings, and post-processing specifications used in the annotation pipeline, are available from the corresponding author upon reasonable request. This enables replication of the evaluation procedure and comparative agreement analyses under similar access conditions. However, exact model-level replication remains constrained for API-based systems such as GPT-5 and Gemini-3-flash, because their weights, full training corpora, and internal alignment pipelines are not publicly disclosed, whereas Qwen-3 provides comparatively greater reproducibility through open-weight availability and public deployment resources [28,29,30,32]. Accordingly, the materials shared by the authors support procedural and behavioral replication of the study, but not full independent reconstruction of the proprietary models themselves.

Acknowledgments

During the preparation of this manuscript/study, the authors used Qwen3.5 for the purposes of proofreading. The authors have reviewed and edited the output and take full responsibility for the content of this publication.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
API	Application Programming Interface
LLM	Large Language Model
MoE	Mixture of experts
UGC	User-generated content

Appendix A

Table A1. Inter-rater reliability (

κ

) with standard errors, 95% confidence intervals, and interpretation between Human and AI models across Real and Synthetic datasets.

Table A1. Inter-rater reliability (

κ

) with standard errors, 95% confidence intervals, and interpretation between Human and AI models across Real and Synthetic datasets.

Dataset	Variable	Human vs. Qwen			Human vs. Gemini			Human vs. ChatGPT
Dataset	Variable	$κ$ (SE)	95% CI	Interp.	$κ$ (SE)	95% CI	Interp.	$κ$ (SE)	95% CI	Interp.
real	Sentiment: positive	0.906 (0.003)	[0.900, 0.912]	AP	0.940 (0.002)	[0.936, 0.944]	AP	0.707 (0.005)	[0.697, 0.717]	S
real	Sentiment: negative	0.908 (0.003)	[0.902, 0.914]	AP	0.935 (0.002)	[0.931, 0.939]	AP	0.555 (0.005)	[0.545, 0.565]	M
real	Sentiment: neutral	0.299 (0.012)	[0.275, 0.323]	F	0.403 (0.015)	[0.374, 0.432]	M	0.066 (0.005)	[0.056, 0.076]	Sl
real	Aspect	0.296 (0.008)	[0.280, 0.312]	F	0.457 (0.007)	[0.443, 0.471]	M	0.209 (0.005)	[0.199, 0.219]	F
real	Clean	0.786 (0.006)	[0.774, 0.798]	S	0.937 (0.003)	[0.931, 0.943]	AP	0.818 (0.006)	[0.806, 0.830]	AP
real	Comfort	0.429 (0.009)	[0.411, 0.447]	M	0.459 (0.010)	[0.439, 0.479]	M	0.646 (0.007)	[0.632, 0.660]	S
real	Facilities/Amenities	0.488 (0.009)	[0.470, 0.506]	M	0.783 (0.006)	[0.771, 0.795]	S	0.672 (0.009)	[0.654, 0.690]	S
real	Location	0.921 (0.003)	[0.915, 0.927]	AP	0.945 (0.003)	[0.939, 0.951]	AP	0.856 (0.004)	[0.848, 0.864]	AP
real	Restaurant (dinner)	0.582 (0.021)	[0.541, 0.623]	M	0.712 (0.021)	[0.671, 0.753]	S	N/A	N/A	–
real	Staff	0.916 (0.004)	[0.908, 0.924]	AP	0.925 (0.003)	[0.919, 0.931]	AP	0.819 (0.005)	[0.809, 0.829]	AP
real	View (Balcony)	0.740 (0.011)	[0.718, 0.762]	S	0.872 (0.009)	[0.854, 0.890]	AP	0.844 (0.010)	[0.824, 0.864]	AP
real	Breakfast	0.948 (0.003)	[0.942, 0.954]	AP	0.974 (0.002)	[0.970, 0.978]	AP	0.950 (0.003)	[0.944, 0.956]	AP
real	Room	0.610 (0.005)	[0.600, 0.620]	S	0.737 (0.005)	[0.727, 0.747]	S	0.705 (0.005)	[0.695, 0.715]	S
real	Pool	0.831 (0.014)	[0.804, 0.858]	AP	0.970 (0.006)	[0.958, 0.982]	AP	0.982 (0.004)	[0.974, 0.990]	AP
real	Beach	0.704 (0.025)	[0.655, 0.753]	S	0.885 (0.018)	[0.850, 0.920]	AP	0.853 (0.021)	[0.812, 0.894]	AP
real	Bathroom/Shower	0.910 (0.004)	[0.902, 0.918]	AP	0.935 (0.004)	[0.927, 0.943]	AP	0.893 (0.005)	[0.883, 0.903]	AP
real	Bar	0.647 (0.020)	[0.608, 0.686]	S	0.781 (0.018)	[0.746, 0.816]	S	0.740 (0.018)	[0.705, 0.775]	S
real	Bed	0.862 (0.007)	[0.848, 0.876]	AP	0.929 (0.005)	[0.919, 0.939]	AP	0.770 (0.009)	[0.752, 0.788]	S
real	Parking	0.687 (0.014)	[0.660, 0.714]	S	0.964 (0.006)	[0.952, 0.976]	AP	0.946 (0.007)	[0.932, 0.960]	AP
real	Noise	0.857 (0.007)	[0.843, 0.871]	AP	0.944 (0.004)	[0.936, 0.952]	AP	0.847 (0.007)	[0.833, 0.861]	AP
real	Reception-checkin	0.690 (0.011)	[0.668, 0.712]	S	0.734 (0.010)	[0.714, 0.754]	S	0.696 (0.012)	[0.672, 0.720]	S
real	Lift	0.813 (0.017)	[0.780, 0.846]	AP	0.944 (0.010)	[0.924, 0.964]	AP	0.808 (0.019)	[0.771, 0.845]	AP
real	Value for money	0.602 (0.012)	[0.578, 0.626]	S	0.769 (0.011)	[0.747, 0.791]	S	0.729 (0.012)	[0.705, 0.753]	S
real	Wi-Fi	0.580 (0.020)	[0.541, 0.619]	M	0.972 (0.007)	[0.958, 0.986]	AP	0.944 (0.010)	[0.924, 0.964]	AP
real	Generic	0.502 (0.011)	[0.480, 0.524]	M	0.337 (0.012)	[0.313, 0.361]	F	0.316 (0.011)	[0.294, 0.338]	F
synthetic	Sentiment: positive	0.907 (0.034)	[0.840, 0.974]	AP	0.932 (0.030)	[0.873, 0.991]	AP	0.147 (0.033)	[0.082, 0.212]	Sl
synthetic	Sentiment: negative	0.931 (0.031)	[0.870, 0.992]	AP	0.928 (0.032)	[0.865, 0.991]	AP	0.393 (0.078)	[0.240, 0.546]	F
synthetic	Sentiment: neutral	−0.015 (0.006)	[−0.027, −0.003]	Po	0.235 (0.204)	[−0.165, 0.635]	F	−0.001 (0.012)	[−0.025, 0.023]	Po
synthetic	Aspect	0.790 (0.102)	[0.590, 0.990]	S	0.504 (0.111)	[0.286, 0.722]	M	0.143 (0.047)	[0.051, 0.235]	Sl
synthetic	Clean	0.825 (0.076)	[0.676, 0.974]	AP	0.901 (0.057)	[0.789, 1.000]	AP	0.508 (0.123)	[0.267, 0.749]	M
synthetic	Comfort	0.440 (0.098)	[0.248, 0.632]	M	0.711 (0.112)	[0.492, 0.930]	S	0.421 (0.148)	[0.131, 0.711]	M
synthetic	Facilities/Amenities	0.616 (0.111)	[0.398, 0.834]	S	0.615 (0.125)	[0.370, 0.860]	S	N/A	N/A	–
synthetic	Location	0.888 (0.064)	[0.763, 1.000]	AP	0.881 (0.068)	[0.748, 1.000]	AP	0.582 (0.131)	[0.325, 0.839]	M
synthetic	Restaurant (dinner)	0.816 (0.104)	[0.612, 1.000]	AP	0.884 (0.081)	[0.725, 1.000]	AP	N/A	N/A	–
synthetic	Staff	0.770 (0.091)	[0.592, 0.948]	S	0.801 (0.086)	[0.632, 0.970]	S	0.770 (0.099)	[0.576, 0.964]	S
synthetic	View (Balcony)	0.790 (0.102)	[0.590, 0.990]	S	0.895 (0.074)	[0.750, 1.000]	AP	N/A	N/A	–
synthetic	Breakfast	0.923 (0.054)	[0.817, 1.000]	AP	1.000 (0.000)	[1.000, 1.000]	AP	0.616 (0.132)	[0.357, 0.875]	S
synthetic	Room	0.617 (0.095)	[0.431, 0.803]	S	0.814 (0.081)	[0.655, 0.973]	AP	0.638 (0.095)	[0.452, 0.824]	S
synthetic	Pool	0.939 (0.061)	[0.819, 1.000]	AP	1.000 (0.000)	[1.000, 1.000]	AP	N/A	N/A	–
synthetic	Beach	0.895 (0.074)	[0.750, 1.000]	AP	0.945 (0.055)	[0.837, 1.000]	AP	N/A	N/A	–
synthetic	Bathroom/Shower	0.950 (0.050)	[0.852, 1.000]	AP	0.950 (0.050)	[0.852, 1.000]	AP	0.449 (0.171)	[0.114, 0.784]	M
synthetic	Bar	0.950 (0.050)	[0.852, 1.000]	AP	0.950 (0.050)	[0.852, 1.000]	AP	N/A	N/A	–
synthetic	Bed	0.872 (0.073)	[0.729, 1.000]	AP	0.872 (0.073)	[0.729, 1.000]	AP	0.449 (0.153)	[0.149, 0.749]	M
synthetic	Parking	1.000 (0.000)	[1.000, 1.000]	AP	1.000 (0.000)	[1.000, 1.000]	AP	1.000 (0.000)	[1.000, 1.000]	AP
synthetic	Noise	0.835 (0.081)	[0.676, 0.994]	AP	0.911 (0.062)	[0.789, 1.000]	AP	0.385 (0.158)	[0.075, 0.695]	F
synthetic	Reception-checkin	0.957 (0.043)	[0.873, 1.000]	AP	0.957 (0.043)	[0.873, 1.000]	AP	N/A	N/A	–
synthetic	Lift	1.000 (0.000)	[1.000, 1.000]	AP	1.000 (0.000)	[1.000, 1.000]	AP	N/A	N/A	–
synthetic	Value for money	0.808 (0.093)	[0.626, 0.990]	AP	0.945 (0.055)	[0.837, 1.000]	AP	0.705 (0.140)	[0.431, 0.979]	S
synthetic	Wi-Fi	1.000 (0.000)	[1.000, 1.000]	AP	1.000 (0.000)	[1.000, 1.000]	AP	0.655 (0.143)	[0.375, 0.935]	S
synthetic	Generic	0.471 (0.108)	[0.259, 0.683]	M	0.585 (0.125)	[0.340, 0.830]	M	0.097 (0.032)	[0.034, 0.160]	Sl

Note. SE = Asymptotic Standard Error; 95% CI =

κ

± 1.96 × SE; Interp. = Interpretation per Landis & Koch [41]: Po = Poor (

κ

< 0), Sl = Slight (0.00–0.20), F = Fair (0.21–0.40), M = Moderate (0.41–0.60), S = Substantial (0.61–0.80), AP = Almost Perfect (0.81–1.00). All p-values < 0.001 unless otherwise indicated. N/A = variable excluded due to low annotation frequency in synthetic condition. Confidence intervals exceeding 1.0 are capped at 1.000.

Table A2. Fleiss’s

κ

reliability statistics with standard errors, 95% confidence intervals, z-scores, significance levels, and interpretation for inter-LLM agreement across Real and Synthetic datasets.

Table A2. Fleiss’s

κ

reliability statistics with standard errors, 95% confidence intervals, z-scores, significance levels, and interpretation for inter-LLM agreement across Real and Synthetic datasets.

Variable	Dataset	$κ$ (SE)	95% CI	z	p	Interp.
Sentiment: positive	real	0.792 (0.004)	[0.784, 0.800]	207.798	<0.001	S
Sentiment: positive	synthetic	0.248 (0.041)	[0.168, 0.328]	6.025	<0.001	F
Sentiment: negative	real	0.684 (0.004)	[0.676, 0.692]	179.452	<0.001	S
Sentiment: negative	synthetic	0.572 (0.041)	[0.492, 0.652]	13.871	<0.001	M
Sentiment: neutral	real	0.166 (0.004)	[0.158, 0.174]	43.639	<0.001	Sl
Sentiment: neutral	synthetic	−0.287 (0.041)	[−0.367, −0.207]	−6.970	<0.001	Po
Aspect	real	0.341 (0.004)	[0.333, 0.349]	89.389	<0.001	F
Aspect	synthetic	0.194 (0.041)	[0.114, 0.274]	4.704	<0.001	Sl
Clean	real	0.773 (0.004)	[0.765, 0.781]	202.799	<0.001	S
Clean	synthetic	0.6920 (0.041)	[0.612, 0.772]	16.788	<0.001	S
Comfort	real	0.456 (0.004)	[0.448, 0.464]	119.608	<0.001	M
Comfort	synthetic	0.386 (0.041)	[0.306, 0.466]	9.357	<0.001	F
Facilities/Amenities	real	0.512 (0.004)	[0.504, 0.520]	134.195	<0.001	M
Facilities/Amenities	synthetic	0.373 (0.041)	[0.293, 0.453]	9.053	<0.001	F
Location	real	0.888 (0.004)	[0.880, 0.896]	233.068	<0.001	AP
Location	synthetic	0.750 (0.041)	[0.670, 0.830]	18.193	<0.001	S
Restaurant	real	0.297 (0.004)	[0.289, 0.305]	78.037	<0.001	F
Restaurant	synthetic	0.456 (0.041)	[0.376, 0.536]	11.060	<0.001	M
Staff	real	0.868 (0.004)	[0.860, 0.876]	227.621	<0.001	AP
Staff	synthetic	0.808 (0.041)	[0.728, 0.888]	19.587	<0.001	AP
View	real	0.790 (0.004)	[0.782, 0.798]	207.179	<0.001	S
View	synthetic	0.433 (0.041)	[0.353, 0.513]	10.508	<0.001	M
Breakfast	real	0.954 (0.004)	[0.946, 0.962]	250.192	<0.001	AP
Breakfast	synthetic	0.719 (0.041)	[0.639, 0.799]	17.436	<0.001	S
Room	real	0.769 (0.004)	[0.761, 0.777]	201.706	<0.001	S
Room	synthetic	0.710 (0.041)	[0.630, 0.790]	17.213	<0.001	S
Pool	real	0.870 (0.004)	[0.862, 0.878]	228.287	<0.001	AP
Pool	synthetic	0.486 (0.041)	[0.406, 0.566]	11.785	<0.001	M
Beach	real	0.745 (0.004)	[0.737, 0.753]	195.388	<0.001	S
Beach	synthetic	0.484 (0.041)	[0.404, 0.564]	11.741	<0.001	M
Bathroom/Shower	real	0.922 (0.004)	[0.914, 0.930]	241.906	<0.001	AP
Bathroom/Shower	synthetic	0.666 (0.041)	[0.586, 0.746]	16.145	<0.001	S
Bar	real	0.684 (0.004)	[0.676, 0.692]	179.438	<0.001	S
Bar	synthetic	0.482 (0.041)	[0.402, 0.562]	11.697	<0.001	M
Bed	real	0.820 (0.004)	[0.812, 0.828]	215.218	<0.001	AP
Bed	synthetic	0.657 (0.041)	[0.577, 0.737]	15.922	<0.001	S
Parking	real	0.760 (0.004)	[0.752, 0.768]	199.429	<0.001	S
Parking	synthetic	1.000 (0.041)	[0.920, 1.000] ^†	24.249	<0.001	AP
Noise	real	0.833 (0.004)	[0.825, 0.841]	218.459	<0.001	AP
Noise	synthetic	0.601 (0.041)	[0.521, 0.681]	14.574	<0.001	S
Reception	real	0.790 (0.004)	[0.782, 0.798]	207.220	<0.001	S
Reception	synthetic	0.479 (0.041)	[0.399, 0.559]	11.608	<0.001	M
Lift	real	0.778 (0.004)	[0.770, 0.786]	204.178	<0.001	S
Lift	synthetic	0.482 (0.041)	[0.402, 0.562]	11.697	<0.001	M
Value for money	real	0.684 (0.004)	[0.676, 0.692]	179.525	<0.001	S
Value for money	synthetic	0.700 (0.041)	[0.620, 0.780]	16.974	<0.001	S
Wi-Fi	real	0.684 (0.004)	[0.676, 0.692]	179.425	<0.001	S
Wi-Fi	synthetic	0.791 (0.041)	[0.711, 0.871]	19.184	<0.001	S
Generic	real	0.338 (0.004)	[0.330, 0.346]	88.717	<0.001	F
Generic	synthetic	0.019 (0.041)	[−0.061, 0.099]	0.467	0.640	Sl

Note. SE = Asymptotic Standard Error; 95% CI =

κ

± 1.96 × SE; Interp. = Interpretation per Landis & Koch [41]: Po = Poor (

κ

< 0), Sl = Slight (0.00–0.20), F = Fair (0.21–0.40), M = Moderate (0.41–0.60), S = Substantial (0.61–0.80), AP = Almost Perfect (0.81–1.00). ^† Confidence intervals exceeding 1.0 are capped at 1.000 for interpretation.

References

Baier, D.; Decker, R.; Asenova, Y. Collecting and analyzing user-generated content for decision support in marketing management: An overview of methods and use cases. Schmalenbach J. Bus. Res. 2025, 77, 419–455. [Google Scholar] [CrossRef]
Tan, Z.; Jiang, M. User modeling in the era of large language models: Current research and future directions. arXiv 2023, arXiv:2312.11518. [Google Scholar]
Falatouri, T.; Hrušecká, D.; Fischer, T. Harnessing the power of LLMs for service quality assessment from user-generated content. IEEE Access 2024, 12, 99755–99767. [Google Scholar] [CrossRef]
Wei, W.; Hao, C.; Wang, Z. User needs insights from UGC based on large language model. Adv. Eng. Inform. 2025, 65, 103268. [Google Scholar] [CrossRef]
Intana, A.; Tantayakul, K.; Tanthavanich, W.; Chumchuay, W. An Ontology-Driven Framework for Personalised Context-Aware Running Event Recommendations. Computers 2026, 15, 195. [Google Scholar] [CrossRef]
Sanghyun, C.; Kim, M.; Hye-Lynn, K.; Jung-Hun, L.; Hyuk-Chul, K.; Soo-Jong, L. Cross-Lingual Adaptation for Multilingual Table Question Answering and Comparative Evaluation with Large Language Models. Computers 2026, 15, 92. [Google Scholar] [CrossRef]
Kozinets, R.V.; Gretzel, U. Commentary: Artificial intelligence: The marketer’s dilemma. J. Mark. 2021, 85, 156–159. [Google Scholar] [CrossRef]
Milwood, P.A.; Hartman-Caverly, S.; Roehl, W.S. A scoping study of ethics in artificial intelligence research in tourism and hospitality. In Proceedings of the ENTER22 e-Tourism Conference; Springer: Berlin/Heidelberg, Germany, 2023; pp. 243–254. [Google Scholar]
Jia, S.J.; Chi, O.H.; Chi, C.G. Unpacking the impact of AI vs. human-generated review summary on hotel booking intentions. Int. J. Hosp. Manag. 2025, 126, 104030. [Google Scholar]
Mostert, W.; Kurien, A.; Djouani, K. Multi-Modal Emotion Detection and Tracking System Using AI Techniques. Computers 2025, 14, 441. [Google Scholar] [CrossRef]
Miah, M.R.; Akter, L.; Ahmed, A.A.; Ngamassi, L.; Ramakrishnan, T. An ML-Based Approach to Leveraging Social Media for Disaster Type Classification and Analysis Across World Regions. Computers 2026, 15, 16. [Google Scholar] [CrossRef]
Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual, 3–10 March 2021; pp. 610–623. [Google Scholar]
Kordzadeh, N.; Ghasemaghaei, M. Algorithmic bias: Review, synthesis, and future research directions. Eur. J. Inf. Syst. 2022, 31, 388–409. [Google Scholar] [CrossRef]
Djouvas, C.; Charalampous, A.; Christodoulou, C.J.; Tsapatsoulis, N. Llms are not for everything: A dataset and comparative study on argument strength classification. In Proceedings of the 28th Pan-Hellenic Conference on Progress in Computing and Informatics, Athens, Greece, 13–15 December 2024; pp. 437–443. [Google Scholar]
Mohta, J.; Ak, K.; Xu, Y.; Shen, M. Are large language models good annotators? In Proceedings of the Proceedings on “I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models” at NeurIPS 2023 Workshops, New Orleans, LA, USA, 16 December 2023; pp. 38–48. [Google Scholar]
Voutsa, M.C.; Tsapatsoulis, N.; Djouvas, C. Biased by design? evaluating bias and behavioral diversity in llm annotation of real-world and synthetic hotel reviews. AI 2025, 6, 178. [Google Scholar] [CrossRef]
Voutsa, M.C.; Tsapatsoulis, N.; Djouvas, C.; Andreou, C. Bias in the Machine: Cross-Model Evaluation of ChatGPT-4 and Qwen in Hotel Booking Review Annotation. In Proceedings of the 2025 20th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP), Mystras, Greece, 27–28 November 2025; pp. 81–85. [Google Scholar]
Andreou, C.; Tsapatsoulis, N.; Anastasopoulou, V. A dataset of hotel reviews for aspect-based sentiment analysis and topic modeling. In Proceedings of the 2023 18th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP) 18th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP 2023), Limassol, Cyprus, 25–26 September 2023; pp. 1–9. [Google Scholar]
De-Marcos, L.; Domínguez-Díaz, A. Llm-based topic modeling for dark web q&a forums: A comparative analysis with traditional methods. IEEE Access 2025, 13, 67159–67169. [Google Scholar] [CrossRef]
Ghatora, P.S.; Hosseini, S.E.; Pervez, S.; Iqbal, M.J.; Shaukat, N. Sentiment analysis of product reviews using machine learning and pre-trained llm. Big Data Cogn. Comput. 2024, 8, 199. [Google Scholar] [CrossRef]
Pyreddy, S.R.; Zaman, T.S. Emoxpt: Analyzing emotional variances in human comments and llm-generated responses. In Proceedings of the 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 6–8 January 2025; pp. 88–94. [Google Scholar]
Giorgi, T.; Cima, L.; Fagni, T.; Avvenuti, M.; Cresci, S. Human and LLM biases in hate speech annotations: A socio-demographic analysis of annotators and targets. In Proceedings of the International AAAI Conference on Web and Social Media, Copenhagen, Denmark, 23–26 June 2025; Volume 19, pp. 653–670. [Google Scholar]
Anghel, C.; Craciun, M.V.; Pecheanu, E.; Cocu, A.; Anghel, A.A.; Iacobescu, P.; Maier, C.; Andrei, C.A.; Scheau, C.; Dragosloveanu, S. CourseEvalAI: Rubric-Guided Framework for Transparent and Consistent Evaluation of Large Language Models. Computers 2025, 14, 431. [Google Scholar] [CrossRef]
Anghel, C.; Craciun, M.V.; Anghel, A.A.; Cocu, A.; Balau, A.S.; Andrei, C.A.; Maier, C.; Dragosloveanu, S.; Nedelea, D.G.; Scheau, C. EvalCouncil: A Committee-Based LLM Framework for Reliable and Unbiased Automated Grading. Computers 2025, 14, 530. [Google Scholar] [CrossRef]
Li, Z.; Zhu, H.; Lu, Z.; Yin, M. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 10443–10461. [Google Scholar]
Nasution, A.H.; Onan, A. Chatgpt label: Comparing the quality of human-generated and llm-generated annotations in low-resource language nlp tasks. IEEE Access 2024, 12, 71876–71900. [Google Scholar] [CrossRef]
OpenAI. GPT-5 System Card, 2025. OpenAI System Card for GPT-5. 2025. Available online: https://openai.com/index/gpt-5-system-card/ (accessed on 1 March 2026).
OpenAI. GPT-5 Model. OpenAI API Documentation for GPT-5. 2025. Available online: https://developers.openai.com/api/docs/models/gpt-5 (accessed on 1 March 2026).
Qwen Team and Alibaba Cloud. Qwen3, 2025. Official GitHub Repository and Model Overview for Qwen3. Available online: https://github.com/QwenLM/qwen3 (accessed on 1 March 2026).
Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 Technical Report. arXiv 2025. [Google Scholar] [CrossRef]
Google DeepMind. Gemini 3 Flash-Model Card. Google DeepMind Model Card for Gemini 3 Flash. 2025. Available online: https://deepmind.google/models/model-cards/gemini-3-flash/ (accessed on 1 March 2026).
Google Cloud. Gemini 3 Flash|Generative AI on Vertex AI. Vertex AI Documentation for Gemini 3 Flash. 2025. Available online: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-flash (accessed on 1 March 2026).
Google DeepMind. Gemini 3.1 Pro-Model Card. Google DeepMind Model Card for Gemini 3.1 Pro. 2026. Available online: https://deepmind.google/models/model-cards/gemini-3-1-pro/ (accessed on 1 March 2026).
Google. Gemini 3 Developer Guide|Gemini API. Gemini API Developer Guide for the Gemini 3 Series. 2026. Available online: https://ai.google.dev/gemini-api/docs/gemini-3 (accessed on 1 March 2026).
OpenAI. Models|OpenAI API. OpenAI API Models Documentation. 2026. Available online: https://developers.openai.com/api/docs/models (accessed on 1 March 2026).
Qwen Team. Qwen Documentation. Official Qwen Documentation. 2026. Available online: https://qwen.readthedocs.io/ (accessed on 1 March 2026).
Qwen Team. Qwen3: Think Deeper, Act Faster. Official Qwen Blog Post Introducing Qwen3. 2025. Available online: https://qwenlm.github.io/blog/qwen3/ (accessed on 1 March 2026).
OpenAI. Safety & Responsibility. OpenAI Safety Overview and Governance Documentation. 2026. Available online: https://openai.com/safety/ (accessed on 1 March 2026).
Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378. [Google Scholar] [CrossRef]
Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Inter-rater agreement results. (a) Human vs. Qwen. (b) Human vs. ChatGPT. (c) Human vs. Gemini. (d) Agreement across ALL LLMs.

Table 1. Summary of studies on LLM annotation capabilities and comparisons.

Study	Domain	Source	Task	Annotators	Key Findings
Ghatora et al. (2024) [20]	Product reviews	Real	Sentiment analysis	ChatGPT-4; Machine learning-based classifiers (Random Forest, Naive Bayes, and Support Vector Machine)	GPT-4 shows comparable performance to traditional ML models but does not consistently outperform them.
Falatouri et al. (2024) [3]	Apps	Real reviews of 2 mobile apps (English & Persian)	Service quality assessment; Sentiment analysis	ChatGPT 3.5; Claude 3; NLP methods: TextBlob, VADER, Transformers; Human	Both LLMs > NLP methods; ChatGPT close to human annotators.
Voutsa et al. (2025) [16]	Hotel reviews	Real & synthetic Booking reviews	Sentiment analysis; Topic modelling; Aspect-based-sentiment analysis	ChatGPT-3.5; ChatGPT-4; Human	High intra-family LLM agreement; manual > batch mode; LLM biases regarding neutrality and repetition.
Voutsa et al. (2025) [17]	Hotel reviews	Real & synthetic Booking reviews	Comparison of cross-family LLM annotation behavior	ChatGPT-4; Qwen-3; Human	ChatGPT-4 tends to be more neutral, whereas Qwen aligns more closely with human judgments.
Nasution & Onan (2024) [26]	Tweets	Real (low-resource Multilingual datasets)	Sentiment analysis; Topic modeling	ChatGPT-4; BERT; RoBERTa; T5; Human	LLMs similar to Humans in sentiment analysis; Humans > LLMs on topic modeling and emotion classification.
Mohta et al. (2023) [15]	General LLM tasks	Synthetic/Real datasets	plethora of tasks (e.g., movie genre, hate speech)	Llama2; Vicuna-v1.5; Humans	Vicuna > Llama; Humans > LLMs.
Giorgi et al. (2025) [22]	Social media posts	Real (Twitter/social media)	Hate speech labeling	Llama3; Phi3; Solar; Starling; Humans	LLMs less annotator biases than humans.
Anghel et al. (2025) [23]	Educational assessment	Real (university exam responses)	Rubric-based scoring of answers & explanations	GPT-4; Llama-3.1; Mistral-LoRA; Human (2)	Fine-tuning improves rubric alignment; GPT-4 strongest human correlation for technical answers; LLM-human agreement moderate for answers, weaker for explanations.
Anghel et al. (2025) [24]	Educational assessment	Real (university exam responses)	Rubric-based grading of answers & explanations	Mistral-7B; Gemma-7B; Zephyr-7B; OpenHermes; LLaMA3-Instruct; Humans	Domain-dependent Human-LLM alignment (Machine Learning tighter than Computer Networks); LLM judges show weaker alignment to human mean than human-human.
Current Study	Hotel reviews	Real & synthetic Booking reviews	Cross-family LLM evaluation	Gemini-3-flash; Qwen-3; ChatGPT-5; Human	Gemini demonstrated the most consistent, high-fidelity alignment with human annotations across both datasets; ChatGPT exhibited systematic neutrality bias and marked performance degradation on synthetic data; Qwen showed stable performance but failed on neutral sentiment in synthetic contexts. Reliability was highest for concrete, objectively verifiable attributes (e.g., Breakfast, Location) and lowest for ambiguous constructs (e.g., Generic, neutral sentiment, aspect).

Table 2. Key characteristics of GPT-5, Qwen-3, and Gemini 3 Flash.

Dimension	GPT-5 (OpenAI)	Qwen-3 (Alibaba)	Gemini 3 Flash (Google)
Model type & licensing	Proprietary flagship reasoning model; closed weights; access via ChatGPT and OpenAI API [27,28]	Open-weight LLM family with dense and MoE variants; Apache 2.0 license; local deployment supported [29,30]	Proprietary natively multimodal reasoning model; closed weights; access via Gemini API/Vertex AI [31,32]
Training data & transparency	Broad categories disclosed (public web, partner data, user/trainer/researcher data), but no detailed dataset inventory or token counts [27]	More transparent than the others; technical report describes multilingual and synthetic data mix and reports ∼36T training tokens [30]	Large-scale multimodal data described at category level (web, code, images, audio, video, licensed and synthetic data), but without full corpus breakdown [31,33]
Context window & architecture	400 K context window, 128 K max output; unified routed system with fast and deeper-reasoning components, but architecture details remain undisclosed [27,28]	Dense and MoE family; up to 256 K native context in official releases, with some variants extendable further; architecture disclosed more explicitly than proprietary peers [29,30]	Up to 1M input context and 64 K output; based on Gemini 3 Pro architecture; sparse MoE and natively multimodal design [31,33,34]
Reproducibility & research access	API-based access only; no open weights or full training recipe; limited independent reproducibility [28]	Strongest reproducibility profile of the three: open weights, public model cards, GitHub access, and independent benchmarking possible [29,30]	API-only access through Google services; no open weights; restricted independent replication [31,32]
Multimodal capabilities	Text and image input with text output in the API; strong vision support, but no native audio/video output on the GPT-5 model page [28,35]	Core Qwen-3 is primarily an LLM family, while the broader Qwen ecosystem includes multimodal variants such as Qwen3-Omni [29,36]	Native multimodal input across text, image, audio, video, and code, with text output [31,34]
Computational efficiency	Efficiency improved through configurable reasoning effort and smaller mini/nano variants, but no public disclosure of training compute or active parameters [28,35]	Broad size range plus MoE design improves deployment flexibility; Alibaba reports strong efficiency gains from sparse activation compared with dense predecessors [29,37]	Designed for lower latency and lower cost than larger Gemini models; thinking levels allow explicit quality/latency/cost trade-offs [31,34]
Bias evaluation & ethical governance	Extensive public safety documentation, including fairness/bias discussion, preparedness framework, and large-scale red-teaming [27,38]	Public governance documentation is lighter than for OpenAI/Google; alignment and safety are discussed, but bias/governance reporting is less extensive [30,37]	Detailed model-card governance with Google AI Principles, multimodal safety evaluations, red-teaming, and Frontier Safety Framework alignment [31,33]

Table 3. Frequency distribution of sentiment, topic, and aspect annotations by Human Annotator and AI models across real and synthetic datasets.

Variable	Human Annotator		Qwen		Gemini		ChatGPT
Variable	N	%	N	%	N	%	N	%
Real Dataset
Sentiment: positive	11,819	51.1	11,098	48.0	11,518 *	49.8	10,568	45.7
Sentiment: negative	10,504	45.4	10,198	44.1	10,346 *	44.8	6637	28.7
Sentiment: neutral	794	3.4	1816	7.9	1077 *	4.7	5908	25.6
Clean	2927	12.7	2784	12.0	3007 *	13.0	2570	11.1
Comfort	2562	11.1	2663	11.5	1282 *	5.5	3700	16.0
Facilities/Amenities	2923	12.6	2803	12.1	2627 *	11.4	1921	8.3
Location	4729	20.5	4992	21.6	4796 *	20.8	4668	20.2
Restaurant	300	1.3	479	2.1	315 *	1.4
Staff	3804	16.5	3750	16.2	3641 *	15.8	3039	13.1
View	784	3.4	1106	4.8	809 *	3.5	690	3.0
Breakfast	2539	11.0	2598	11.2	2545 *	11.0	2402	10.4
Room	5004	21.7	8162	35.3	6177 *	26.7	6097	26.4
Pool	472	2.0	444	1.9	496 *	2.1	479	2.1
Beach	188	0.8	262	1.1	154 *	0.7	147	0.6
Bathroom/Shower (toilet)	2346	10.2	2582	11.2	2462 *	10.7	2237	9.7
Bar	322	1.4	481	2.1	344 *	1.5	447	1.9
Bed	1480	6.4	1668	7.2	1578 *	6.8	1190	5.1
Parking	541	2.3	987	4.3	561 *	2.4	574	2.5
Noise	1604	6.9	1729	7.5	1654 *	7.2	1361	5.9
Reception	953	4.1	1410	6.1	1394 *	6.0	1035	4.5
Lift	291	1.3	349	1.5	294 *	1.3	230	1.0
Value for money	892	3.9	1424	6.2	867 *	3.8	906	3.9
Wi-Fi	268	1.2	639	2.8	279 *	1.2	293	1.3
Generic	1973	8.5	1695	7.3	636 *	2.8	1596	6.9
Aspect	2884	12.5	4690	20.3	6100 *	26.4	10,412	45.0
Synthetic Dataset
Sentiment: positive	149	74.9	148	74.4	147 **	73.9	45	22.6
Sentiment: negative	47	23.6	48	24.1	44 **	22.1	17	8.5
Sentiment: neutral	3	1.5	3	1.5	5 **	2.5	137	68.8
Clean	16	8.0	15	7.5	17 **	8.5	10	5.0
Comfort	12	6.0	29	14.6	10 **	5.0	6	3.0
Facilities/Amenities	12	6.0	16	8.0	10 **	5.0	N/A	N/A
Location	14	7.0	15	7.5	14 **	7.0	6	3.0
Restaurant	8	4.0	9	4.5	10 **	5.0	N/A	N/A
Staff	13	6.5	15	7.5	14 **	7.0	10	5.0
View	9	4.5	11	5.5	11 **	5.5	N/A	N/A
Breakfast	13	6.5	15	7.5	13 **	6.5	6	3.0
Room	12	6.0	25	12.6	17 **	8.5	24	12.1
Pool	8	4.0	9	4.5	8 **	4.0	N/A	N/A
Beach	11	5.5	9	4.5	9 **	4.5	N/A	N/A
Bathroom/Shower (toilet)	10	5.0	11	5.5	11 **	5.5	3	1.5
Bar	11	5.5	10	5.0	10 **	5.0	N/A	N/A
Bed	11	5.5	14	7.0	14 **	7.0	6	3.0
Parking	9	4.5	9	4.5	9 **	4.5	9	4.5
Noise	12	6.0	14	7.0	13 **	6.5	3	1.5
Reception	13	6.5	12	6.0	12 **	6.0	N/A	N/A
Lift	10	5.0	10	5.0	10 **	5.0	N/A	N/A
Value for money	9	4.5	13	6.5	10 **	5.0	5	2.5
Wi-Fi	10	5.0	10	5.0	10 **	5.0	5	2.5
Generic	13	6.5	22	11.1	10 **	5.0	106	53.3
Aspect	8	4.0	12	6.0	22 **	11.1	70	35.2

Note: * N = 172 (0.7%); ** N = 3 (1.5%) missing values; N/A: not applicable.

Table 4. Descriptive statistics of inter-rater reliability (

κ

) between Human and AI models across Real and Synthetic datasets.

Table 4. Descriptive statistics of inter-rater reliability (

κ

) between Human and AI models across Real and Synthetic datasets.

Dataset	Model	M	SD	Range	n
real	ChatGPT	0.683	0.268	[0.004, 0.950]	24
real	Gemini	0.810	0.196	[0.337, 0.974]	25
real	Qwen	0.701	0.194	[0.296, 0.948]	25
synthetic	ChatGPT	0.442	0.279	[−0.001, 1.000]	18
synthetic	Gemini	0.849	0.186	[0.235, 1.000]	25
synthetic	Qwen	0.799	0.228	[−0.015, 1.000]	25

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Djouvas, C.; Andreou, C.; Voutsa, M.C.; Tsapatsoulis, N. Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization. Computers 2026, 15, 262. https://doi.org/10.3390/computers15050262

AMA Style

Djouvas C, Andreou C, Voutsa MC, Tsapatsoulis N. Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization. Computers. 2026; 15(5):262. https://doi.org/10.3390/computers15050262

Chicago/Turabian Style

Djouvas, Constantinos, Christiana Andreou, Maria C. Voutsa, and Nicolas Tsapatsoulis. 2026. "Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization" Computers 15, no. 5: 262. https://doi.org/10.3390/computers15050262

APA Style

Djouvas, C., Andreou, C., Voutsa, M. C., & Tsapatsoulis, N. (2026). Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization. Computers, 15(5), 262. https://doi.org/10.3390/computers15050262

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization

Abstract

1. Introduction

2. Related Work

2.1. Large Language Models in Semantic Multimedia Analysis

2.2. Annotation Bias

2.3. Model-Specific Bias and Cross-Model Divergence

2.4. Synthetic Data and Bias Amplification

2.5. Research Gap and Positioning

3. Data and Methods

3.1. Dataset

3.2. LLM-Based Annotation Framework

3.3. Cross-Model Comparative Analysis

4. Results

4.1. Descriptive Statistics

4.2. Inter-Rater Reliability

4.2.1. Inter-Rater Reliability in Sentiment Classification

4.2.2. Inter-Rater Reliability in Aspect-Based Annotation

4.2.3. Inter-Rater Reliability in Topic Annotation

4.2.4. Missing Data Patterns

4.2.5. Real Versus Synthetic Dataset

4.2.6. Model-Specific Performance Profiles

5. Discussion & Implications

5.1. Architectural Determinants of Annotation Bias

5.2. Human–LLM Alignment and Domain Specificity

5.3. The Synthetic Data Paradox

5.4. Implications for Semantic Multimedia Systems

5.5. Limitations & Future Research

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

Abbreviations

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI