Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization
Abstract
1. Introduction
- it documents model-contingent polarity bias, revealing a systematic divergence in polarity assignment between LLM families with implications for personalization reliability, echoing findings on architectural influence in multi-modal systems [10];
- it demonstrates how the type of dataset (real vs. synthetic) moderates human–LLM agreement, reinforcing concerns regarding the effects of data composition observed in cross-regional studies [11];
- it proposes actionable, bias-aware strategies for deploying LLMs in semantic multimedia pipelines, emphasizing cross-architectural triangulation and hybrid human–AI workflows to mitigate design-induced biases [6].
2. Related Work
2.1. Large Language Models in Semantic Multimedia Analysis
2.2. Annotation Bias
2.3. Model-Specific Bias and Cross-Model Divergence
2.4. Synthetic Data and Bias Amplification
2.5. Research Gap and Positioning
3. Data and Methods
3.1. Dataset
3.2. LLM-Based Annotation Framework
3.3. Cross-Model Comparative Analysis
4. Results
4.1. Descriptive Statistics
4.2. Inter-Rater Reliability
4.2.1. Inter-Rater Reliability in Sentiment Classification
4.2.2. Inter-Rater Reliability in Aspect-Based Annotation
4.2.3. Inter-Rater Reliability in Topic Annotation
4.2.4. Missing Data Patterns
4.2.5. Real Versus Synthetic Dataset
4.2.6. Model-Specific Performance Profiles
5. Discussion & Implications
5.1. Architectural Determinants of Annotation Bias
5.2. Human–LLM Alignment and Domain Specificity
5.3. The Synthetic Data Paradox
5.4. Implications for Semantic Multimedia Systems
5.5. Limitations & Future Research
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
Abbreviations
| AI | Artificial Intelligence |
| API | Application Programming Interface |
| LLM | Large Language Model |
| MoE | Mixture of experts |
| UGC | User-generated content |
Appendix A
| Dataset | Variable | Human vs. Qwen | Human vs. Gemini | Human vs. ChatGPT | ||||||
|---|---|---|---|---|---|---|---|---|---|---|
| (SE) | 95% CI | Interp. | (SE) | 95% CI | Interp. | (SE) | 95% CI | Interp. | ||
| real | Sentiment: positive | 0.906 (0.003) | [0.900, 0.912] | AP | 0.940 (0.002) | [0.936, 0.944] | AP | 0.707 (0.005) | [0.697, 0.717] | S |
| real | Sentiment: negative | 0.908 (0.003) | [0.902, 0.914] | AP | 0.935 (0.002) | [0.931, 0.939] | AP | 0.555 (0.005) | [0.545, 0.565] | M |
| real | Sentiment: neutral | 0.299 (0.012) | [0.275, 0.323] | F | 0.403 (0.015) | [0.374, 0.432] | M | 0.066 (0.005) | [0.056, 0.076] | Sl |
| real | Aspect | 0.296 (0.008) | [0.280, 0.312] | F | 0.457 (0.007) | [0.443, 0.471] | M | 0.209 (0.005) | [0.199, 0.219] | F |
| real | Clean | 0.786 (0.006) | [0.774, 0.798] | S | 0.937 (0.003) | [0.931, 0.943] | AP | 0.818 (0.006) | [0.806, 0.830] | AP |
| real | Comfort | 0.429 (0.009) | [0.411, 0.447] | M | 0.459 (0.010) | [0.439, 0.479] | M | 0.646 (0.007) | [0.632, 0.660] | S |
| real | Facilities/Amenities | 0.488 (0.009) | [0.470, 0.506] | M | 0.783 (0.006) | [0.771, 0.795] | S | 0.672 (0.009) | [0.654, 0.690] | S |
| real | Location | 0.921 (0.003) | [0.915, 0.927] | AP | 0.945 (0.003) | [0.939, 0.951] | AP | 0.856 (0.004) | [0.848, 0.864] | AP |
| real | Restaurant (dinner) | 0.582 (0.021) | [0.541, 0.623] | M | 0.712 (0.021) | [0.671, 0.753] | S | N/A | N/A | – |
| real | Staff | 0.916 (0.004) | [0.908, 0.924] | AP | 0.925 (0.003) | [0.919, 0.931] | AP | 0.819 (0.005) | [0.809, 0.829] | AP |
| real | View (Balcony) | 0.740 (0.011) | [0.718, 0.762] | S | 0.872 (0.009) | [0.854, 0.890] | AP | 0.844 (0.010) | [0.824, 0.864] | AP |
| real | Breakfast | 0.948 (0.003) | [0.942, 0.954] | AP | 0.974 (0.002) | [0.970, 0.978] | AP | 0.950 (0.003) | [0.944, 0.956] | AP |
| real | Room | 0.610 (0.005) | [0.600, 0.620] | S | 0.737 (0.005) | [0.727, 0.747] | S | 0.705 (0.005) | [0.695, 0.715] | S |
| real | Pool | 0.831 (0.014) | [0.804, 0.858] | AP | 0.970 (0.006) | [0.958, 0.982] | AP | 0.982 (0.004) | [0.974, 0.990] | AP |
| real | Beach | 0.704 (0.025) | [0.655, 0.753] | S | 0.885 (0.018) | [0.850, 0.920] | AP | 0.853 (0.021) | [0.812, 0.894] | AP |
| real | Bathroom/Shower | 0.910 (0.004) | [0.902, 0.918] | AP | 0.935 (0.004) | [0.927, 0.943] | AP | 0.893 (0.005) | [0.883, 0.903] | AP |
| real | Bar | 0.647 (0.020) | [0.608, 0.686] | S | 0.781 (0.018) | [0.746, 0.816] | S | 0.740 (0.018) | [0.705, 0.775] | S |
| real | Bed | 0.862 (0.007) | [0.848, 0.876] | AP | 0.929 (0.005) | [0.919, 0.939] | AP | 0.770 (0.009) | [0.752, 0.788] | S |
| real | Parking | 0.687 (0.014) | [0.660, 0.714] | S | 0.964 (0.006) | [0.952, 0.976] | AP | 0.946 (0.007) | [0.932, 0.960] | AP |
| real | Noise | 0.857 (0.007) | [0.843, 0.871] | AP | 0.944 (0.004) | [0.936, 0.952] | AP | 0.847 (0.007) | [0.833, 0.861] | AP |
| real | Reception-checkin | 0.690 (0.011) | [0.668, 0.712] | S | 0.734 (0.010) | [0.714, 0.754] | S | 0.696 (0.012) | [0.672, 0.720] | S |
| real | Lift | 0.813 (0.017) | [0.780, 0.846] | AP | 0.944 (0.010) | [0.924, 0.964] | AP | 0.808 (0.019) | [0.771, 0.845] | AP |
| real | Value for money | 0.602 (0.012) | [0.578, 0.626] | S | 0.769 (0.011) | [0.747, 0.791] | S | 0.729 (0.012) | [0.705, 0.753] | S |
| real | Wi-Fi | 0.580 (0.020) | [0.541, 0.619] | M | 0.972 (0.007) | [0.958, 0.986] | AP | 0.944 (0.010) | [0.924, 0.964] | AP |
| real | Generic | 0.502 (0.011) | [0.480, 0.524] | M | 0.337 (0.012) | [0.313, 0.361] | F | 0.316 (0.011) | [0.294, 0.338] | F |
| synthetic | Sentiment: positive | 0.907 (0.034) | [0.840, 0.974] | AP | 0.932 (0.030) | [0.873, 0.991] | AP | 0.147 (0.033) | [0.082, 0.212] | Sl |
| synthetic | Sentiment: negative | 0.931 (0.031) | [0.870, 0.992] | AP | 0.928 (0.032) | [0.865, 0.991] | AP | 0.393 (0.078) | [0.240, 0.546] | F |
| synthetic | Sentiment: neutral | −0.015 (0.006) | [−0.027, −0.003] | Po | 0.235 (0.204) | [−0.165, 0.635] | F | −0.001 (0.012) | [−0.025, 0.023] | Po |
| synthetic | Aspect | 0.790 (0.102) | [0.590, 0.990] | S | 0.504 (0.111) | [0.286, 0.722] | M | 0.143 (0.047) | [0.051, 0.235] | Sl |
| synthetic | Clean | 0.825 (0.076) | [0.676, 0.974] | AP | 0.901 (0.057) | [0.789, 1.000] | AP | 0.508 (0.123) | [0.267, 0.749] | M |
| synthetic | Comfort | 0.440 (0.098) | [0.248, 0.632] | M | 0.711 (0.112) | [0.492, 0.930] | S | 0.421 (0.148) | [0.131, 0.711] | M |
| synthetic | Facilities/Amenities | 0.616 (0.111) | [0.398, 0.834] | S | 0.615 (0.125) | [0.370, 0.860] | S | N/A | N/A | – |
| synthetic | Location | 0.888 (0.064) | [0.763, 1.000] | AP | 0.881 (0.068) | [0.748, 1.000] | AP | 0.582 (0.131) | [0.325, 0.839] | M |
| synthetic | Restaurant (dinner) | 0.816 (0.104) | [0.612, 1.000] | AP | 0.884 (0.081) | [0.725, 1.000] | AP | N/A | N/A | – |
| synthetic | Staff | 0.770 (0.091) | [0.592, 0.948] | S | 0.801 (0.086) | [0.632, 0.970] | S | 0.770 (0.099) | [0.576, 0.964] | S |
| synthetic | View (Balcony) | 0.790 (0.102) | [0.590, 0.990] | S | 0.895 (0.074) | [0.750, 1.000] | AP | N/A | N/A | – |
| synthetic | Breakfast | 0.923 (0.054) | [0.817, 1.000] | AP | 1.000 (0.000) | [1.000, 1.000] | AP | 0.616 (0.132) | [0.357, 0.875] | S |
| synthetic | Room | 0.617 (0.095) | [0.431, 0.803] | S | 0.814 (0.081) | [0.655, 0.973] | AP | 0.638 (0.095) | [0.452, 0.824] | S |
| synthetic | Pool | 0.939 (0.061) | [0.819, 1.000] | AP | 1.000 (0.000) | [1.000, 1.000] | AP | N/A | N/A | – |
| synthetic | Beach | 0.895 (0.074) | [0.750, 1.000] | AP | 0.945 (0.055) | [0.837, 1.000] | AP | N/A | N/A | – |
| synthetic | Bathroom/Shower | 0.950 (0.050) | [0.852, 1.000] | AP | 0.950 (0.050) | [0.852, 1.000] | AP | 0.449 (0.171) | [0.114, 0.784] | M |
| synthetic | Bar | 0.950 (0.050) | [0.852, 1.000] | AP | 0.950 (0.050) | [0.852, 1.000] | AP | N/A | N/A | – |
| synthetic | Bed | 0.872 (0.073) | [0.729, 1.000] | AP | 0.872 (0.073) | [0.729, 1.000] | AP | 0.449 (0.153) | [0.149, 0.749] | M |
| synthetic | Parking | 1.000 (0.000) | [1.000, 1.000] | AP | 1.000 (0.000) | [1.000, 1.000] | AP | 1.000 (0.000) | [1.000, 1.000] | AP |
| synthetic | Noise | 0.835 (0.081) | [0.676, 0.994] | AP | 0.911 (0.062) | [0.789, 1.000] | AP | 0.385 (0.158) | [0.075, 0.695] | F |
| synthetic | Reception-checkin | 0.957 (0.043) | [0.873, 1.000] | AP | 0.957 (0.043) | [0.873, 1.000] | AP | N/A | N/A | – |
| synthetic | Lift | 1.000 (0.000) | [1.000, 1.000] | AP | 1.000 (0.000) | [1.000, 1.000] | AP | N/A | N/A | – |
| synthetic | Value for money | 0.808 (0.093) | [0.626, 0.990] | AP | 0.945 (0.055) | [0.837, 1.000] | AP | 0.705 (0.140) | [0.431, 0.979] | S |
| synthetic | Wi-Fi | 1.000 (0.000) | [1.000, 1.000] | AP | 1.000 (0.000) | [1.000, 1.000] | AP | 0.655 (0.143) | [0.375, 0.935] | S |
| synthetic | Generic | 0.471 (0.108) | [0.259, 0.683] | M | 0.585 (0.125) | [0.340, 0.830] | M | 0.097 (0.032) | [0.034, 0.160] | Sl |
| Variable | Dataset | (SE) | 95% CI | z | p | Interp. |
|---|---|---|---|---|---|---|
| Sentiment: positive | real | 0.792 (0.004) | [0.784, 0.800] | 207.798 | <0.001 | S |
| Sentiment: positive | synthetic | 0.248 (0.041) | [0.168, 0.328] | 6.025 | <0.001 | F |
| Sentiment: negative | real | 0.684 (0.004) | [0.676, 0.692] | 179.452 | <0.001 | S |
| Sentiment: negative | synthetic | 0.572 (0.041) | [0.492, 0.652] | 13.871 | <0.001 | M |
| Sentiment: neutral | real | 0.166 (0.004) | [0.158, 0.174] | 43.639 | <0.001 | Sl |
| Sentiment: neutral | synthetic | −0.287 (0.041) | [−0.367, −0.207] | −6.970 | <0.001 | Po |
| Aspect | real | 0.341 (0.004) | [0.333, 0.349] | 89.389 | <0.001 | F |
| Aspect | synthetic | 0.194 (0.041) | [0.114, 0.274] | 4.704 | <0.001 | Sl |
| Clean | real | 0.773 (0.004) | [0.765, 0.781] | 202.799 | <0.001 | S |
| Clean | synthetic | 0.6920 (0.041) | [0.612, 0.772] | 16.788 | <0.001 | S |
| Comfort | real | 0.456 (0.004) | [0.448, 0.464] | 119.608 | <0.001 | M |
| Comfort | synthetic | 0.386 (0.041) | [0.306, 0.466] | 9.357 | <0.001 | F |
| Facilities/Amenities | real | 0.512 (0.004) | [0.504, 0.520] | 134.195 | <0.001 | M |
| Facilities/Amenities | synthetic | 0.373 (0.041) | [0.293, 0.453] | 9.053 | <0.001 | F |
| Location | real | 0.888 (0.004) | [0.880, 0.896] | 233.068 | <0.001 | AP |
| Location | synthetic | 0.750 (0.041) | [0.670, 0.830] | 18.193 | <0.001 | S |
| Restaurant | real | 0.297 (0.004) | [0.289, 0.305] | 78.037 | <0.001 | F |
| Restaurant | synthetic | 0.456 (0.041) | [0.376, 0.536] | 11.060 | <0.001 | M |
| Staff | real | 0.868 (0.004) | [0.860, 0.876] | 227.621 | <0.001 | AP |
| Staff | synthetic | 0.808 (0.041) | [0.728, 0.888] | 19.587 | <0.001 | AP |
| View | real | 0.790 (0.004) | [0.782, 0.798] | 207.179 | <0.001 | S |
| View | synthetic | 0.433 (0.041) | [0.353, 0.513] | 10.508 | <0.001 | M |
| Breakfast | real | 0.954 (0.004) | [0.946, 0.962] | 250.192 | <0.001 | AP |
| Breakfast | synthetic | 0.719 (0.041) | [0.639, 0.799] | 17.436 | <0.001 | S |
| Room | real | 0.769 (0.004) | [0.761, 0.777] | 201.706 | <0.001 | S |
| Room | synthetic | 0.710 (0.041) | [0.630, 0.790] | 17.213 | <0.001 | S |
| Pool | real | 0.870 (0.004) | [0.862, 0.878] | 228.287 | <0.001 | AP |
| Pool | synthetic | 0.486 (0.041) | [0.406, 0.566] | 11.785 | <0.001 | M |
| Beach | real | 0.745 (0.004) | [0.737, 0.753] | 195.388 | <0.001 | S |
| Beach | synthetic | 0.484 (0.041) | [0.404, 0.564] | 11.741 | <0.001 | M |
| Bathroom/Shower | real | 0.922 (0.004) | [0.914, 0.930] | 241.906 | <0.001 | AP |
| Bathroom/Shower | synthetic | 0.666 (0.041) | [0.586, 0.746] | 16.145 | <0.001 | S |
| Bar | real | 0.684 (0.004) | [0.676, 0.692] | 179.438 | <0.001 | S |
| Bar | synthetic | 0.482 (0.041) | [0.402, 0.562] | 11.697 | <0.001 | M |
| Bed | real | 0.820 (0.004) | [0.812, 0.828] | 215.218 | <0.001 | AP |
| Bed | synthetic | 0.657 (0.041) | [0.577, 0.737] | 15.922 | <0.001 | S |
| Parking | real | 0.760 (0.004) | [0.752, 0.768] | 199.429 | <0.001 | S |
| Parking | synthetic | 1.000 (0.041) | [0.920, 1.000] † | 24.249 | <0.001 | AP |
| Noise | real | 0.833 (0.004) | [0.825, 0.841] | 218.459 | <0.001 | AP |
| Noise | synthetic | 0.601 (0.041) | [0.521, 0.681] | 14.574 | <0.001 | S |
| Reception | real | 0.790 (0.004) | [0.782, 0.798] | 207.220 | <0.001 | S |
| Reception | synthetic | 0.479 (0.041) | [0.399, 0.559] | 11.608 | <0.001 | M |
| Lift | real | 0.778 (0.004) | [0.770, 0.786] | 204.178 | <0.001 | S |
| Lift | synthetic | 0.482 (0.041) | [0.402, 0.562] | 11.697 | <0.001 | M |
| Value for money | real | 0.684 (0.004) | [0.676, 0.692] | 179.525 | <0.001 | S |
| Value for money | synthetic | 0.700 (0.041) | [0.620, 0.780] | 16.974 | <0.001 | S |
| Wi-Fi | real | 0.684 (0.004) | [0.676, 0.692] | 179.425 | <0.001 | S |
| Wi-Fi | synthetic | 0.791 (0.041) | [0.711, 0.871] | 19.184 | <0.001 | S |
| Generic | real | 0.338 (0.004) | [0.330, 0.346] | 88.717 | <0.001 | F |
| Generic | synthetic | 0.019 (0.041) | [−0.061, 0.099] | 0.467 | 0.640 | Sl |
References
- Baier, D.; Decker, R.; Asenova, Y. Collecting and analyzing user-generated content for decision support in marketing management: An overview of methods and use cases. Schmalenbach J. Bus. Res. 2025, 77, 419–455. [Google Scholar] [CrossRef]
- Tan, Z.; Jiang, M. User modeling in the era of large language models: Current research and future directions. arXiv 2023, arXiv:2312.11518. [Google Scholar]
- Falatouri, T.; Hrušecká, D.; Fischer, T. Harnessing the power of LLMs for service quality assessment from user-generated content. IEEE Access 2024, 12, 99755–99767. [Google Scholar] [CrossRef]
- Wei, W.; Hao, C.; Wang, Z. User needs insights from UGC based on large language model. Adv. Eng. Inform. 2025, 65, 103268. [Google Scholar] [CrossRef]
- Intana, A.; Tantayakul, K.; Tanthavanich, W.; Chumchuay, W. An Ontology-Driven Framework for Personalised Context-Aware Running Event Recommendations. Computers 2026, 15, 195. [Google Scholar] [CrossRef]
- Sanghyun, C.; Kim, M.; Hye-Lynn, K.; Jung-Hun, L.; Hyuk-Chul, K.; Soo-Jong, L. Cross-Lingual Adaptation for Multilingual Table Question Answering and Comparative Evaluation with Large Language Models. Computers 2026, 15, 92. [Google Scholar] [CrossRef]
- Kozinets, R.V.; Gretzel, U. Commentary: Artificial intelligence: The marketer’s dilemma. J. Mark. 2021, 85, 156–159. [Google Scholar] [CrossRef]
- Milwood, P.A.; Hartman-Caverly, S.; Roehl, W.S. A scoping study of ethics in artificial intelligence research in tourism and hospitality. In Proceedings of the ENTER22 e-Tourism Conference; Springer: Berlin/Heidelberg, Germany, 2023; pp. 243–254. [Google Scholar]
- Jia, S.J.; Chi, O.H.; Chi, C.G. Unpacking the impact of AI vs. human-generated review summary on hotel booking intentions. Int. J. Hosp. Manag. 2025, 126, 104030. [Google Scholar]
- Mostert, W.; Kurien, A.; Djouani, K. Multi-Modal Emotion Detection and Tracking System Using AI Techniques. Computers 2025, 14, 441. [Google Scholar] [CrossRef]
- Miah, M.R.; Akter, L.; Ahmed, A.A.; Ngamassi, L.; Ramakrishnan, T. An ML-Based Approach to Leveraging Social Media for Disaster Type Classification and Analysis Across World Regions. Computers 2026, 15, 16. [Google Scholar] [CrossRef]
- Bender, E.M.; Gebru, T.; McMillan-Major, A.; Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency, Virtual, 3–10 March 2021; pp. 610–623. [Google Scholar]
- Kordzadeh, N.; Ghasemaghaei, M. Algorithmic bias: Review, synthesis, and future research directions. Eur. J. Inf. Syst. 2022, 31, 388–409. [Google Scholar] [CrossRef]
- Djouvas, C.; Charalampous, A.; Christodoulou, C.J.; Tsapatsoulis, N. Llms are not for everything: A dataset and comparative study on argument strength classification. In Proceedings of the 28th Pan-Hellenic Conference on Progress in Computing and Informatics, Athens, Greece, 13–15 December 2024; pp. 437–443. [Google Scholar]
- Mohta, J.; Ak, K.; Xu, Y.; Shen, M. Are large language models good annotators? In Proceedings of the Proceedings on “I Can’t Believe It’s Not Better: Failure Modes in the Age of Foundation Models” at NeurIPS 2023 Workshops, New Orleans, LA, USA, 16 December 2023; pp. 38–48. [Google Scholar]
- Voutsa, M.C.; Tsapatsoulis, N.; Djouvas, C. Biased by design? evaluating bias and behavioral diversity in llm annotation of real-world and synthetic hotel reviews. AI 2025, 6, 178. [Google Scholar] [CrossRef]
- Voutsa, M.C.; Tsapatsoulis, N.; Djouvas, C.; Andreou, C. Bias in the Machine: Cross-Model Evaluation of ChatGPT-4 and Qwen in Hotel Booking Review Annotation. In Proceedings of the 2025 20th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP), Mystras, Greece, 27–28 November 2025; pp. 81–85. [Google Scholar]
- Andreou, C.; Tsapatsoulis, N.; Anastasopoulou, V. A dataset of hotel reviews for aspect-based sentiment analysis and topic modeling. In Proceedings of the 2023 18th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP) 18th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP 2023), Limassol, Cyprus, 25–26 September 2023; pp. 1–9. [Google Scholar]
- De-Marcos, L.; Domínguez-Díaz, A. Llm-based topic modeling for dark web q&a forums: A comparative analysis with traditional methods. IEEE Access 2025, 13, 67159–67169. [Google Scholar] [CrossRef]
- Ghatora, P.S.; Hosseini, S.E.; Pervez, S.; Iqbal, M.J.; Shaukat, N. Sentiment analysis of product reviews using machine learning and pre-trained llm. Big Data Cogn. Comput. 2024, 8, 199. [Google Scholar] [CrossRef]
- Pyreddy, S.R.; Zaman, T.S. Emoxpt: Analyzing emotional variances in human comments and llm-generated responses. In Proceedings of the 2025 IEEE 15th Annual Computing and Communication Workshop and Conference (CCWC), Las Vegas, NV, USA, 6–8 January 2025; pp. 88–94. [Google Scholar]
- Giorgi, T.; Cima, L.; Fagni, T.; Avvenuti, M.; Cresci, S. Human and LLM biases in hate speech annotations: A socio-demographic analysis of annotators and targets. In Proceedings of the International AAAI Conference on Web and Social Media, Copenhagen, Denmark, 23–26 June 2025; Volume 19, pp. 653–670. [Google Scholar]
- Anghel, C.; Craciun, M.V.; Pecheanu, E.; Cocu, A.; Anghel, A.A.; Iacobescu, P.; Maier, C.; Andrei, C.A.; Scheau, C.; Dragosloveanu, S. CourseEvalAI: Rubric-Guided Framework for Transparent and Consistent Evaluation of Large Language Models. Computers 2025, 14, 431. [Google Scholar] [CrossRef]
- Anghel, C.; Craciun, M.V.; Anghel, A.A.; Cocu, A.; Balau, A.S.; Andrei, C.A.; Maier, C.; Dragosloveanu, S.; Nedelea, D.G.; Scheau, C. EvalCouncil: A Committee-Based LLM Framework for Reliable and Unbiased Automated Grading. Computers 2025, 14, 530. [Google Scholar] [CrossRef]
- Li, Z.; Zhu, H.; Lu, Z.; Yin, M. Synthetic data generation with large language models for text classification: Potential and limitations. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Singapore, 6–10 December 2023; pp. 10443–10461. [Google Scholar]
- Nasution, A.H.; Onan, A. Chatgpt label: Comparing the quality of human-generated and llm-generated annotations in low-resource language nlp tasks. IEEE Access 2024, 12, 71876–71900. [Google Scholar] [CrossRef]
- OpenAI. GPT-5 System Card, 2025. OpenAI System Card for GPT-5. 2025. Available online: https://openai.com/index/gpt-5-system-card/ (accessed on 1 March 2026).
- OpenAI. GPT-5 Model. OpenAI API Documentation for GPT-5. 2025. Available online: https://developers.openai.com/api/docs/models/gpt-5 (accessed on 1 March 2026).
- Qwen Team and Alibaba Cloud. Qwen3, 2025. Official GitHub Repository and Model Overview for Qwen3. Available online: https://github.com/QwenLM/qwen3 (accessed on 1 March 2026).
- Yang, A.; Li, A.; Yang, B.; Zhang, B.; Hui, B.; Zheng, B.; Yu, B.; Gao, C.; Huang, C.; Lv, C.; et al. Qwen3 Technical Report. arXiv 2025. [Google Scholar] [CrossRef]
- Google DeepMind. Gemini 3 Flash-Model Card. Google DeepMind Model Card for Gemini 3 Flash. 2025. Available online: https://deepmind.google/models/model-cards/gemini-3-flash/ (accessed on 1 March 2026).
- Google Cloud. Gemini 3 Flash|Generative AI on Vertex AI. Vertex AI Documentation for Gemini 3 Flash. 2025. Available online: https://docs.cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/3-flash (accessed on 1 March 2026).
- Google DeepMind. Gemini 3.1 Pro-Model Card. Google DeepMind Model Card for Gemini 3.1 Pro. 2026. Available online: https://deepmind.google/models/model-cards/gemini-3-1-pro/ (accessed on 1 March 2026).
- Google. Gemini 3 Developer Guide|Gemini API. Gemini API Developer Guide for the Gemini 3 Series. 2026. Available online: https://ai.google.dev/gemini-api/docs/gemini-3 (accessed on 1 March 2026).
- OpenAI. Models|OpenAI API. OpenAI API Models Documentation. 2026. Available online: https://developers.openai.com/api/docs/models (accessed on 1 March 2026).
- Qwen Team. Qwen Documentation. Official Qwen Documentation. 2026. Available online: https://qwen.readthedocs.io/ (accessed on 1 March 2026).
- Qwen Team. Qwen3: Think Deeper, Act Faster. Official Qwen Blog Post Introducing Qwen3. 2025. Available online: https://qwenlm.github.io/blog/qwen3/ (accessed on 1 March 2026).
- OpenAI. Safety & Responsibility. OpenAI Safety Overview and Governance Documentation. 2026. Available online: https://openai.com/safety/ (accessed on 1 March 2026).
- Cohen, J. A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960, 20, 37–46. [Google Scholar] [CrossRef]
- Fleiss, J.L. Measuring nominal scale agreement among many raters. Psychol. Bull. 1971, 76, 378. [Google Scholar] [CrossRef]
- Landis, J.R.; Koch, G.G. The measurement of observer agreement for categorical data. Biometrics 1977, 33, 159–174. [Google Scholar] [CrossRef] [PubMed]

| Study | Domain | Source | Task | Annotators | Key Findings |
|---|---|---|---|---|---|
| Ghatora et al. (2024) [20] | Product reviews | Real | Sentiment analysis | ChatGPT-4; Machine learning-based classifiers (Random Forest, Naive Bayes, and Support Vector Machine) | GPT-4 shows comparable performance to traditional ML models but does not consistently outperform them. |
| Falatouri et al. (2024) [3] | Apps | Real reviews of 2 mobile apps (English & Persian) | Service quality assessment; Sentiment analysis | ChatGPT 3.5; Claude 3; NLP methods: TextBlob, VADER, Transformers; Human | Both LLMs > NLP methods; ChatGPT close to human annotators. |
| Voutsa et al. (2025) [16] | Hotel reviews | Real & synthetic Booking reviews | Sentiment analysis; Topic modelling; Aspect-based-sentiment analysis | ChatGPT-3.5; ChatGPT-4; Human | High intra-family LLM agreement; manual > batch mode; LLM biases regarding neutrality and repetition. |
| Voutsa et al. (2025) [17] | Hotel reviews | Real & synthetic Booking reviews | Comparison of cross-family LLM annotation behavior | ChatGPT-4; Qwen-3; Human | ChatGPT-4 tends to be more neutral, whereas Qwen aligns more closely with human judgments. |
| Nasution & Onan (2024) [26] | Tweets | Real (low-resource Multilingual datasets) | Sentiment analysis; Topic modeling | ChatGPT-4; BERT; RoBERTa; T5; Human | LLMs similar to Humans in sentiment analysis; Humans > LLMs on topic modeling and emotion classification. |
| Mohta et al. (2023) [15] | General LLM tasks | Synthetic/Real datasets | plethora of tasks (e.g., movie genre, hate speech) | Llama2; Vicuna-v1.5; Humans | Vicuna > Llama; Humans > LLMs. |
| Giorgi et al. (2025) [22] | Social media posts | Real (Twitter/social media) | Hate speech labeling | Llama3; Phi3; Solar; Starling; Humans | LLMs less annotator biases than humans. |
| Anghel et al. (2025) [23] | Educational assessment | Real (university exam responses) | Rubric-based scoring of answers & explanations | GPT-4; Llama-3.1; Mistral-LoRA; Human (2) | Fine-tuning improves rubric alignment; GPT-4 strongest human correlation for technical answers; LLM-human agreement moderate for answers, weaker for explanations. |
| Anghel et al. (2025) [24] | Educational assessment | Real (university exam responses) | Rubric-based grading of answers & explanations | Mistral-7B; Gemma-7B; Zephyr-7B; OpenHermes; LLaMA3-Instruct; Humans | Domain-dependent Human-LLM alignment (Machine Learning tighter than Computer Networks); LLM judges show weaker alignment to human mean than human-human. |
| Current Study | Hotel reviews | Real & synthetic Booking reviews | Cross-family LLM evaluation | Gemini-3-flash; Qwen-3; ChatGPT-5; Human | Gemini demonstrated the most consistent, high-fidelity alignment with human annotations across both datasets; ChatGPT exhibited systematic neutrality bias and marked performance degradation on synthetic data; Qwen showed stable performance but failed on neutral sentiment in synthetic contexts. Reliability was highest for concrete, objectively verifiable attributes (e.g., Breakfast, Location) and lowest for ambiguous constructs (e.g., Generic, neutral sentiment, aspect). |
| Dimension | GPT-5 (OpenAI) | Qwen-3 (Alibaba) | Gemini 3 Flash (Google) |
|---|---|---|---|
| Model type & licensing | Proprietary flagship reasoning model; closed weights; access via ChatGPT and OpenAI API [27,28] | Open-weight LLM family with dense and MoE variants; Apache 2.0 license; local deployment supported [29,30] | Proprietary natively multimodal reasoning model; closed weights; access via Gemini API/Vertex AI [31,32] |
| Training data & transparency | Broad categories disclosed (public web, partner data, user/trainer/researcher data), but no detailed dataset inventory or token counts [27] | More transparent than the others; technical report describes multilingual and synthetic data mix and reports ∼36T training tokens [30] | Large-scale multimodal data described at category level (web, code, images, audio, video, licensed and synthetic data), but without full corpus breakdown [31,33] |
| Context window & architecture | 400 K context window, 128 K max output; unified routed system with fast and deeper-reasoning components, but architecture details remain undisclosed [27,28] | Dense and MoE family; up to 256 K native context in official releases, with some variants extendable further; architecture disclosed more explicitly than proprietary peers [29,30] | Up to 1M input context and 64 K output; based on Gemini 3 Pro architecture; sparse MoE and natively multimodal design [31,33,34] |
| Reproducibility & research access | API-based access only; no open weights or full training recipe; limited independent reproducibility [28] | Strongest reproducibility profile of the three: open weights, public model cards, GitHub access, and independent benchmarking possible [29,30] | API-only access through Google services; no open weights; restricted independent replication [31,32] |
| Multimodal capabilities | Text and image input with text output in the API; strong vision support, but no native audio/video output on the GPT-5 model page [28,35] | Core Qwen-3 is primarily an LLM family, while the broader Qwen ecosystem includes multimodal variants such as Qwen3-Omni [29,36] | Native multimodal input across text, image, audio, video, and code, with text output [31,34] |
| Computational efficiency | Efficiency improved through configurable reasoning effort and smaller mini/nano variants, but no public disclosure of training compute or active parameters [28,35] | Broad size range plus MoE design improves deployment flexibility; Alibaba reports strong efficiency gains from sparse activation compared with dense predecessors [29,37] | Designed for lower latency and lower cost than larger Gemini models; thinking levels allow explicit quality/latency/cost trade-offs [31,34] |
| Bias evaluation & ethical governance | Extensive public safety documentation, including fairness/bias discussion, preparedness framework, and large-scale red-teaming [27,38] | Public governance documentation is lighter than for OpenAI/Google; alignment and safety are discussed, but bias/governance reporting is less extensive [30,37] | Detailed model-card governance with Google AI Principles, multimodal safety evaluations, red-teaming, and Frontier Safety Framework alignment [31,33] |
| Variable | Human Annotator | Qwen | Gemini | ChatGPT | ||||
|---|---|---|---|---|---|---|---|---|
| N | % | N | % | N | % | N | % | |
| Real Dataset | ||||||||
| Sentiment: positive | 11,819 | 51.1 | 11,098 | 48.0 | 11,518 * | 49.8 | 10,568 | 45.7 |
| Sentiment: negative | 10,504 | 45.4 | 10,198 | 44.1 | 10,346 * | 44.8 | 6637 | 28.7 |
| Sentiment: neutral | 794 | 3.4 | 1816 | 7.9 | 1077 * | 4.7 | 5908 | 25.6 |
| Clean | 2927 | 12.7 | 2784 | 12.0 | 3007 * | 13.0 | 2570 | 11.1 |
| Comfort | 2562 | 11.1 | 2663 | 11.5 | 1282 * | 5.5 | 3700 | 16.0 |
| Facilities/Amenities | 2923 | 12.6 | 2803 | 12.1 | 2627 * | 11.4 | 1921 | 8.3 |
| Location | 4729 | 20.5 | 4992 | 21.6 | 4796 * | 20.8 | 4668 | 20.2 |
| Restaurant | 300 | 1.3 | 479 | 2.1 | 315 * | 1.4 | ||
| Staff | 3804 | 16.5 | 3750 | 16.2 | 3641 * | 15.8 | 3039 | 13.1 |
| View | 784 | 3.4 | 1106 | 4.8 | 809 * | 3.5 | 690 | 3.0 |
| Breakfast | 2539 | 11.0 | 2598 | 11.2 | 2545 * | 11.0 | 2402 | 10.4 |
| Room | 5004 | 21.7 | 8162 | 35.3 | 6177 * | 26.7 | 6097 | 26.4 |
| Pool | 472 | 2.0 | 444 | 1.9 | 496 * | 2.1 | 479 | 2.1 |
| Beach | 188 | 0.8 | 262 | 1.1 | 154 * | 0.7 | 147 | 0.6 |
| Bathroom/Shower (toilet) | 2346 | 10.2 | 2582 | 11.2 | 2462 * | 10.7 | 2237 | 9.7 |
| Bar | 322 | 1.4 | 481 | 2.1 | 344 * | 1.5 | 447 | 1.9 |
| Bed | 1480 | 6.4 | 1668 | 7.2 | 1578 * | 6.8 | 1190 | 5.1 |
| Parking | 541 | 2.3 | 987 | 4.3 | 561 * | 2.4 | 574 | 2.5 |
| Noise | 1604 | 6.9 | 1729 | 7.5 | 1654 * | 7.2 | 1361 | 5.9 |
| Reception | 953 | 4.1 | 1410 | 6.1 | 1394 * | 6.0 | 1035 | 4.5 |
| Lift | 291 | 1.3 | 349 | 1.5 | 294 * | 1.3 | 230 | 1.0 |
| Value for money | 892 | 3.9 | 1424 | 6.2 | 867 * | 3.8 | 906 | 3.9 |
| Wi-Fi | 268 | 1.2 | 639 | 2.8 | 279 * | 1.2 | 293 | 1.3 |
| Generic | 1973 | 8.5 | 1695 | 7.3 | 636 * | 2.8 | 1596 | 6.9 |
| Aspect | 2884 | 12.5 | 4690 | 20.3 | 6100 * | 26.4 | 10,412 | 45.0 |
| Synthetic Dataset | ||||||||
| Sentiment: positive | 149 | 74.9 | 148 | 74.4 | 147 ** | 73.9 | 45 | 22.6 |
| Sentiment: negative | 47 | 23.6 | 48 | 24.1 | 44 ** | 22.1 | 17 | 8.5 |
| Sentiment: neutral | 3 | 1.5 | 3 | 1.5 | 5 ** | 2.5 | 137 | 68.8 |
| Clean | 16 | 8.0 | 15 | 7.5 | 17 ** | 8.5 | 10 | 5.0 |
| Comfort | 12 | 6.0 | 29 | 14.6 | 10 ** | 5.0 | 6 | 3.0 |
| Facilities/Amenities | 12 | 6.0 | 16 | 8.0 | 10 ** | 5.0 | N/A | N/A |
| Location | 14 | 7.0 | 15 | 7.5 | 14 ** | 7.0 | 6 | 3.0 |
| Restaurant | 8 | 4.0 | 9 | 4.5 | 10 ** | 5.0 | N/A | N/A |
| Staff | 13 | 6.5 | 15 | 7.5 | 14 ** | 7.0 | 10 | 5.0 |
| View | 9 | 4.5 | 11 | 5.5 | 11 ** | 5.5 | N/A | N/A |
| Breakfast | 13 | 6.5 | 15 | 7.5 | 13 ** | 6.5 | 6 | 3.0 |
| Room | 12 | 6.0 | 25 | 12.6 | 17 ** | 8.5 | 24 | 12.1 |
| Pool | 8 | 4.0 | 9 | 4.5 | 8 ** | 4.0 | N/A | N/A |
| Beach | 11 | 5.5 | 9 | 4.5 | 9 ** | 4.5 | N/A | N/A |
| Bathroom/Shower (toilet) | 10 | 5.0 | 11 | 5.5 | 11 ** | 5.5 | 3 | 1.5 |
| Bar | 11 | 5.5 | 10 | 5.0 | 10 ** | 5.0 | N/A | N/A |
| Bed | 11 | 5.5 | 14 | 7.0 | 14 ** | 7.0 | 6 | 3.0 |
| Parking | 9 | 4.5 | 9 | 4.5 | 9 ** | 4.5 | 9 | 4.5 |
| Noise | 12 | 6.0 | 14 | 7.0 | 13 ** | 6.5 | 3 | 1.5 |
| Reception | 13 | 6.5 | 12 | 6.0 | 12 ** | 6.0 | N/A | N/A |
| Lift | 10 | 5.0 | 10 | 5.0 | 10 ** | 5.0 | N/A | N/A |
| Value for money | 9 | 4.5 | 13 | 6.5 | 10 ** | 5.0 | 5 | 2.5 |
| Wi-Fi | 10 | 5.0 | 10 | 5.0 | 10 ** | 5.0 | 5 | 2.5 |
| Generic | 13 | 6.5 | 22 | 11.1 | 10 ** | 5.0 | 106 | 53.3 |
| Aspect | 8 | 4.0 | 12 | 6.0 | 22 ** | 11.1 | 70 | 35.2 |
| Dataset | Model | M | SD | Range | n |
|---|---|---|---|---|---|
| real | ChatGPT | 0.683 | 0.268 | [0.004, 0.950] | 24 |
| real | Gemini | 0.810 | 0.196 | [0.337, 0.974] | 25 |
| real | Qwen | 0.701 | 0.194 | [0.296, 0.948] | 25 |
| synthetic | ChatGPT | 0.442 | 0.279 | [−0.001, 1.000] | 18 |
| synthetic | Gemini | 0.849 | 0.186 | [0.235, 1.000] | 25 |
| synthetic | Qwen | 0.799 | 0.228 | [−0.015, 1.000] | 25 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Djouvas, C.; Andreou, C.; Voutsa, M.C.; Tsapatsoulis, N. Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization. Computers 2026, 15, 262. https://doi.org/10.3390/computers15050262
Djouvas C, Andreou C, Voutsa MC, Tsapatsoulis N. Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization. Computers. 2026; 15(5):262. https://doi.org/10.3390/computers15050262
Chicago/Turabian StyleDjouvas, Constantinos, Christiana Andreou, Maria C. Voutsa, and Nicolas Tsapatsoulis. 2026. "Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization" Computers 15, no. 5: 262. https://doi.org/10.3390/computers15050262
APA StyleDjouvas, C., Andreou, C., Voutsa, M. C., & Tsapatsoulis, N. (2026). Model-Contingent Polarity Bias in Large Language Model Annotation: Implications for Semantic Multimedia Personalization. Computers, 15(5), 262. https://doi.org/10.3390/computers15050262

