Breaking the Ceiling: Mitigating Extreme Response Bias in Surveys Using an Open-Ended Adaptive-Testing System and LLM-Based Response Analysis
Abstract
1. Introduction
2. Related Work
2.1. The Ceiling Effect
2.2. Multistage Testing
2.3. LLM-as-A-Judge
3. Methods
3.1. Modules
3.2. Routing Question for Module Assignment
3.3. Survey Questions
3.4. Analysis of Survey Responses
3.4.1. Classification Instructions
- Define a small number of classification categories (e.g., low, medium, high).
- Provide detailed descriptions of the types of responses that are appropriate for each category.
- Clearly delineate category boundaries to minimize ambiguity and reduce the likelihood that a single response could fit multiple categories.
- Incorporate keywords within the classification instructions whose presence may assist the LLM in distinguishing between categories.
- Specify the desired output format, such as outputting only the label, or the label accompanied by a brief justification.
- Prohibit any speculative inferences about the respondent’s background, motivation, or intent beyond what is explicitly stated in the response’s text.
- Conduct a small-scale preliminary survey and refine classification instructions based on observed response content. At this stage, it is often necessary to further clarify the separation between categories.
- Have an LLM classify the responses from the preliminary survey. Then, visualize the distribution of classifications for each question (see Appendix D). This would allow the researcher to detect unbalanced distributions or deviations from expected patterns, which may indicate flaws in the instructions.
- Revise the instructions as needed in light of the outcomes of the previous stage. Occasionally, the wording of the survey questions themselves may require modification to support proper classification. This iterative process of re-running the LLM classification and revising should be repeated as necessary, until the classification distributions meet expectations.
- Administer a second small-scale preliminary survey to ensure both the survey questions and classification instructions are optimally designed.
3.4.2. Filtering Non-Human Responses—Bot and AI Detection
3.4.3. Scoring Scheme
3.5. Evaluation of LLM-as-A-Judge
3.6. Case Study
3.6.1. Survey Administration
3.6.2. Questionnaire Structure
3.6.3. Survey Development
3.6.4. LLM Candidates
3.6.5. Data Analysis
3.6.6. Score Calculation
3.6.7. Bot and LLM Response Detection
4. Results
4.1. Ceiling Effect in Likert Scale Rating
4.2. Ceiling Effect Mitigation with Open-Ended MST
4.3. LLM Performance
Cost-Effectiveness Analysis
4.4. Prompting Methodologies
5. Discussion
6. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
| MST | Multistage Testing |
| LLM | Large Language Model |
| NLP | Natural Language Processing |
| CAT | Computerized Adaptive Testing |
| IRT | Item Response Theory |
| AI | Artificial intelligence |
| alt-test | Alternative Annotator Test |
Appendix A. Open-Ended Questions—Full Version
Appendix A.1. Module 1
- You’ve just finished making yourself dinner. You place the plate on the table, but before sitting down, you go to the bathroom. When you return, you see a cockroach standing on your food. As you approach, it scurries off and upon reaching the edge of the table, it spreads its wings and flies out the window. What do you do with your dinner, and why do you do that?
- In some countries, fried cockroaches are regarded as a delicacy. Imagine a friend returning from one of these countries with a bag of fried cockroach snacks and offering you a taste. Would you try one? Under what conditions might you be willing to give it a try?
- Your friend has gone on vacation and asked you to feed their pet hissing cockroaches. When you arrive at their house, you notice that, despite the winter cold making the cockroaches lethargic, one has managed to escape and is sitting motionless near the terrarium. What would you do with the escaped cockroach, and what factors would influence your choice?
- While visiting a natural history museum with your child, you come across an exhibit featuring live insects. In one display, an open terrarium allows visitors to pick up and hold cockroaches. Your child wants to try holding one but asks you to go first. How would you feel about it? What could affect your decision?
- Imagine you’re sitting in a fancy restaurant, waiting for your dinner to arrive, when you notice a cockroach scurrying across the floor before disappearing under a counter. Would you take any immediate or delayed action in response? What thoughts or emotions do you think you would think or feel?
- It’s late in the evening, and as you walk into the kitchen, you notice two long antennae peeking out from behind a jar. Suspecting it might be a cockroach, do you take any action? Why do you think you would behave that way?
Appendix A.2. Module 2
- You’re seated in a movie theater, waiting for the movie to begin, when you spot two cockroaches scurrying on the floor nearby. Moments later, the lights dim and the movie begins. Do you take any action in response? If so, what do you do and what motivates your choice of action?
- You’ve just returned home in the evening. As you step into the living room and turn on the light, you spot a cockroach scurrying across the floor before it disappears under the couch. What do you do? Why would you choose that course of action?
- While driving through town, you stop at a red light and glance to your left, noticing a bunch of cockroaches crawling out of a storm drain. Does this sight evoke any feelings in you? If so, what emotions do you think you might experience?
- While hiking through the woods, you pause to rest on a tree stump, taking in the natural surroundings. As you look around, you notice a cockroach crawling on a nearby log. How does this sight make you feel, and how do you think you would respond after noticing it?
- Imagine you’re alone in your living room, watching a movie on your television, when you suddenly notice a dead cockroach lying on its back in the corner of the room. Would you take any action? If so, would you address it immediately or wait until later, and what would influence the timing of your response?
- Suppose you are visiting a large museum, and one of the exhibits focuses on the natural world, featuring a collection of various preserved cockroach species displayed safely behind glass. Would you choose to explore this exhibit? If not, why would you avoid it? If you do decide to go in, what would be your motivation to enter?
Appendix A.3. Module 3
- 1.
- Imagine you’re at a party, and someone begins sharing their passion for insect photography, showing photographs of cockroaches they’ve captured on their phone. How would you respond to this? Why do you think you would act this way?
- 2.
- Imagine you are browsing your TV alone, looking for something to watch, and come across a highly rated documentary about the life of cockroaches, highlighting their complex social interactions and sophisticated communication skills. How would you feel about watching this documentary? What factors might influence your decision?
- 3.
- While you are at work, your partner calls to tell you they just spotted a cockroach in the kitchen. They managed to catch it and throw it out the window. What would be your reaction to this news? What would you say or ask?
- 4.
- Imagine scrolling through your social media feed and coming across a post from a friend describing how they discovered a small family of cockroaches living in the cabinet under their kitchen sink‚ and got rid of them. If you were to respond to their post, what would you write in your response to the post? Why would that be your response?
- 5.
- Suppose you are at a friend’s house, and their child asks you to play with them using a collection of realistic plastic toy cockroaches. How would you handle the situation, and what feelings might it evoke in you?
- 6.
- Imagine coming across a children’s book featuring friendly, cartoonish cockroach characters designed to teach kids about nature. Would you consider reading this book to your child? Why or why not?
Appendix B. Prompt for LLM Classification
Appendix C. Bot and AI Detection Prompt
Appendix D. LLM Classifications

References
- Swan, K.; Speyer, R.; Scharitzer, M.; Farneti, D.; Brown, T.; Woisard, V.; Cordier, R. Measuring what matters in healthcare: A practical guide to psychometric principles and instrument development. Front. Psychol. 2023, 14, 1225850. [Google Scholar] [CrossRef]
- Brown, G.T. The past, present and future of educational assessment: A transdisciplinary perspective. In Frontiers in Education; Frontiers Media SA: Lausanne, Switzerland, 2022; Volume 7, p. 1060633. [Google Scholar]
- Dell’Aquila, E.; Ponticorvo, M.; Limone, P. Psychological Foundations for Effective Human–Computer Interaction in Education. Appl. Sci. 2025, 15, 3194. [Google Scholar] [CrossRef]
- Hammarberg, K.; Kirkman, M.; De Lacey, S. Qualitative research methods: When to use them and how to judge them. Hum. Reprod. 2016, 31, 498–501. [Google Scholar] [CrossRef]
- Johnson, R.B.; Onwuegbuzie, A.J. Mixed methods research: A research paradigm whose time has come. Educ. Res. 2004, 33, 14–26. [Google Scholar] [CrossRef]
- Prabhu, G.N. Teaching the scope and limits of generalizability in qualitative research. New Trends Qual. Res. 2020, 1, 186–192. [Google Scholar]
- Zhou, Y.; Wu, M.L. Reported methodological challenges in empirical mixed methods articles: A review on JMMR and IJMRA. J. Mix. Methods Res. 2022, 16, 47–63. [Google Scholar] [CrossRef]
- Oswald, M.E.; Grosjean, S. Confirmation bias. In Cognitive Illusions: A Handbook on Fallacies and Biases in Thinking, Judgement and Memory; Psychology Press: Hove, UK, 2004; pp. 79–96. [Google Scholar]
- Creswell, J.W.; Creswell, J.D. Research Design: Qualitative, Quantitative, and Mixed Methods Approaches; Sage Publications: Thousand Oaks, CA, USA, 2017. [Google Scholar]
- Antwi, S.K.; Hamza, K. Qualitative and quantitative research paradigms in business research: A philosophical reflection. Eur. J. Bus. Manag. 2015, 7, 217–225. [Google Scholar]
- Stockemer, D. The nuts and bolts of empirical social science. In Quantitative Methods for the Social Sciences: A Practical Introduction with Examples in SPSS and Stata; Springer: Berlin/Heidelberg, Germany, 2018; pp. 5–22. [Google Scholar]
- Everitt, B.S.; Skrondal, A. The Cambridge Dictionary of Statistics; Cambridge University Press: Cambridge, UK, 2010; Volume 4. [Google Scholar]
- McHorney, C.A.; Tarlov, A.R. Individual-patient monitoring in clinical practice: Are available health status surveys adequate? Qual. Life Res. 1995, 4, 293–307. [Google Scholar] [CrossRef]
- Cramer, D.; Howitt, D.L. The Sage Dictionary of Statistics: A Practical Resource for Students in the Social Sciences; Sage: Thousand Oaks, CA, USA, 2004. [Google Scholar]
- Brinkman, N.; Looman, R.; Jayakumar, P.; Ring, D.; Choi, S. Is it possible to develop a patient-reported experience measure with lower ceiling effect? Clin. Orthop. Relat. Res. 2025, 4, 693–703. [Google Scholar] [CrossRef]
- Fukano, Y.; Soga, M. Evolutionary psychology of entomophobia and its implications for insect conservation. Curr. Opin. Insect Sci. 2023, 59, 101100. [Google Scholar] [CrossRef]
- Lockwood, J. The Infested Mind: Why Humans Fear, Loathe, and Love Insects; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
- Gish, M.; Hisano, M.; Soga, M. Does aversion to insects affect insecticide use? An elusive answer calls for improved methods in biophobia research. People Nat. 2024, 6, 1001–1014. [Google Scholar] [CrossRef]
- Hung, M.; Stuart, A.R.; Higgins, T.F.; Saltzman, C.L.; Kubiak, E.N. Computerized adaptive testing using the PROMIS physical function item bank reduces test burden with less ceiling effects compared with the short musculoskeletal function assessment in orthopaedic trauma patients. J. Orthop. Trauma 2014, 28, 439–443. [Google Scholar] [CrossRef]
- Yan, D.; Lewis, C.; von Davier, A.A. Overview of computerized multistage tests. In Computerized Multistage Testing: Theory and Applications; CRC Press: Boca Raton, FL, USA, 2014; pp. 3–20. [Google Scholar]
- Yigiter, M.S.; Dogan, N. Computerized multistage testing: Principles, designs and practices with R. Meas. Interdiscip. Res. Perspect. 2023, 21, 254–277. [Google Scholar] [CrossRef]
- Crede, M.; Bashshur, M.; Niehorster, S. Reference group effects in the measurement of personality and attitudes. J. Personal. Assess. 2010, 92, 390–399. [Google Scholar] [CrossRef]
- Li, H.; Dong, Q.; Chen, J.; Su, H.; Zhou, Y.; Ai, Q.; Ye, Z.; Liu, Y. Llms-as-judges: A comprehensive survey on llm-based evaluation methods. arXiv 2024, arXiv:2412.05579. [Google Scholar]
- Li, D.; Jiang, B.; Huang, L.; Beigi, A.; Zhao, C.; Tan, Z.; Bhattacharjee, A.; Jiang, Y.; Chen, C.; Wu, T.; et al. From generation to judgment: Opportunities and challenges of llm-as-a-judge. arXiv 2025, arXiv:2411.16594. [Google Scholar] [CrossRef]
- Kojima, T.; Gu, S.S.; Reid, M.; Matsuo, Y.; Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural Inf. Process. Syst. 2022, 35, 22199–22213. [Google Scholar]
- Achiam, J.; Adler, S.; Agarwal, S.; Ahmad, L.; Akkaya, I.; Aleman, F.L.; Almeida, D.; Altenschmidt, J.; Altman, S.; Anadkat, S.; et al. Gpt-4 technical report. arXiv 2023, arXiv:2303.08774. [Google Scholar] [CrossRef]
- Yang, J.; Jin, H.; Tang, R.; Han, X.; Feng, Q.; Jiang, H.; Zhong, S.; Yin, B.; Hu, X. Harnessing the power of llms in practice: A survey on chatgpt and beyond. ACM Trans. Knowl. Discov. Data 2024, 18, 1–32. [Google Scholar] [CrossRef]
- Tahmid Rahman Laskar, M.; Saiful Bari, M.; Rahman, M.; Amran Hossen Bhuiyan, M.; Joty, S.; Xiangji Huang, J. A systematic study and comprehensive evaluation of chatgpt on benchmark datasets. In Findings of the Association for Computational Linguistics: ACL 2023; Association for Computational Linguistics: Toronto, ON, Canada, 2023. [Google Scholar]
- Calderon, N.; Reichart, R.; Dror, R. The alternative annotator test for llm-as-a-judge: How to statistically justify replacing human annotators with llms. arXiv 2025, arXiv:2501.10970. [Google Scholar]
- Uma, A.N.; Fornaciari, T.; Hovy, D.; Paun, S.; Plank, B.; Poesio, M. Learning from disagreement: A survey. J. Artif. Intell. Res. 2021, 72, 1385–1470. [Google Scholar] [CrossRef]
- Bartsch, H.; Jorgensen, O.; Rosati, D.; Hoelscher-Obermaier, J.; Pfau, J. Self-consistency of large language models under ambiguity. arXiv 2023, arXiv:2310.13439. [Google Scholar] [CrossRef]
- Gilardi, F.; Alizadeh, M.; Kubli, M. ChatGPT outperforms crowd workers for text-annotation tasks. Proc. Natl. Acad. Sci. USA 2023, 120, e2305016120. [Google Scholar]
- Chen, T.; Xu, P. Botfip-LLM: An Enhanced Multimodal Scientific Computing Framework Leveraging Knowledge Distillation from Large Language Models. arXiv 2024, arXiv:2411.15525. [Google Scholar] [CrossRef]
- Hu, J.; Dong, T.; Gang, L.; Ma, H.; Zou, P.; Sun, X.; Guo, D.; Yang, X.; Wang, M. Psycollm: Enhancing llm for psychological understanding and evaluation. IEEE Trans. Comput. Soc. Syst. 2024, 12, 539–551. [Google Scholar] [CrossRef]
- Brokelman, R.B.; Haverkamp, D.; van Loon, C.; Hol, A.; van Kampen, A.; Veth, R. The validation of the visual analogue scale for patient satisfaction after total hip arthroplasty. Eur. Orthop. Traumatol. 2012, 3, 101–105. [Google Scholar] [CrossRef]
- Moret, L.; Nguyen, J.M.; Pillet, N.; Falissard, B.; Lombrail, P.; Gasquet, I. Improvement of psychometric properties of a scale measuring inpatient satisfaction with care: A better response rate and a reduction of the ceiling effect. BMC Health Serv. Res. 2007, 7, 197. [Google Scholar] [CrossRef]
- De Vet, H.C.; Terwee, C.B.; Mokkink, L.B.; Knol, D.L. Measurement in Medicine: A Practical Guide; Cambridge University Press: Cambridge, UK, 2011. [Google Scholar]
- Šimkovic, M.; Träuble, B. Robustness of statistical methods when measure is affected by ceiling and/or floor effect. PloS ONE 2019, 14, e0220889. [Google Scholar]
- Chyung, S.Y.; Hutchinson, D.; Shamsy, J.A. Evidence-based survey design: Ceiling effects associated with response scales. Perform. Improv. 2020, 59, 6–13. [Google Scholar] [CrossRef]
- Pouwer, F.; Snoek, F.J.; Heine, R.J. Ceiling effect reduces the validity of the Diabetes Treatment Satisfaction Questionnaire. Diabetes Care J. Clin. Appl. Res. Educ. 1998, 21, 2039. [Google Scholar] [CrossRef]
- Vita, S.; Coplin, H.; Feiereisel, K.B.; Garten, S.; Mechaber, A.J.; Estrada, C. Decreasing the ceiling effect in assessing meeting quality at an academic professional meeting. Teach. Learn. Med. 2013, 25, 47–54. [Google Scholar] [CrossRef]
- Harland, N.; Dawkin, M.; Martin, D. Relative utility of a visual analogue scale vs a six-point Likert scale in the measurement of global subject outcome in patients with low back pain receiving physiotherapy. Physiotherapy 2015, 101, 50–54. [Google Scholar] [CrossRef]
- Paap, D.; Schepers, M.; Dijkstra, P.U. Reducing ceiling effects in the working alliance inventory-rehabilitation dutch version. Disabil. Rehabil. 2020, 42, 2944–2950. [Google Scholar] [CrossRef]
- Voutilainen, A.; Pitkäaho, T.; Kvist, T.; Vehviläinen-Julkunen, K. How to ask about patient satisfaction? The visual analogue scale is less vulnerable to confounding factors and ceiling effect than a symmetric Likert scale. J. Adv. Nurs. 2016, 72, 946–957. [Google Scholar]
- Andrew, S.; Salamonson, Y.; Everett, B.; Halcomb, E.J.; Davidson, P.M. Beyond the ceiling effect: Using a mixed methods approach to measure patient satisfaction. Int. J. Mult. Res. Approaches 2011, 5, 52–63. [Google Scholar] [CrossRef]
- van den Oord, E.J.; van der Ark, L.A. A note on the use of the Tobit approach for tests scores with floor or ceiling effects. Br. J. Math. Stat. Psychol. 1997, 50, 351–364. [Google Scholar] [CrossRef]
- Wang, L.; Zhang, Z.; McArdle, J.J.; Salthouse, T.A. Investigating ceiling effects in longitudinal data analysis. Multivar. Behav. Res. 2008, 43, 476–496. [Google Scholar] [CrossRef] [PubMed]
- Page, A.C.; Hooke, G.R.; Morrison, D.L. Psychometric properties of the Depression Anxiety Stress Scales (DASS) in depressed clinical samples. Br. J. Clin. Psychol. 2007, 46, 283–297. [Google Scholar] [CrossRef]
- Palm, K.M.; Strong, D.R.; MacPherson, L. Evaluating symptom expression as a function of a posttraumatic stress disorder severity. J. Anxiety Disord. 2009, 23, 27–37. [Google Scholar] [CrossRef]
- Bjorner, J.B.; Chang, C.H.; Thissen, D.; Reeve, B.B. Developing tailored instruments: Item banking and computerized adaptive assessment. Qual. Life Res. 2007, 16, 95–108. [Google Scholar] [CrossRef]
- Van der Linden, W.J.; Glas, C.A. Elements of Adaptive Testing; Springer: Berlin/Heidelberg, Germany, 2010; Volume 10. [Google Scholar]
- Zheng, L.; Chiang, W.L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. Judging llm-as-a-judge with mt-bench and chatbot arena. Adv. Neural Inf. Process. Syst. 2023, 36, 46595–46623. [Google Scholar]
- Gu, J.; Jiang, X.; Shi, Z.; Tan, H.; Zhai, X.; Xu, C.; Li, W.; Shen, Y.; Ma, S.; Liu, H.; et al. A survey on llm-as-a-judge. arXiv 2024, arXiv:2411.15594. [Google Scholar] [CrossRef]
- Voutsa, M.C.; Tsapatsoulis, N.; Djouvas, C. Biased by Design? Evaluating Bias and Behavioral Diversity in LLM Annotation of Real-World and Synthetic Hotel Reviews. AI 2025, 6, 178. [Google Scholar] [CrossRef]
- Argyle, L.P.; Busby, E.C.; Fulda, N.; Gubler, J.R.; Rytting, C.; Wingate, D. Out of one, many: Using language models to simulate human samples. Political Anal. 2023, 31, 337–351. [Google Scholar] [CrossRef]
- Rothschild, D.M.; Brand, J.; Schroeder, H.; Wang, J. Opportunities and Risks of LLMs in Survey Research. 2024. Available online: https://doi.org/10.2139/ssrn.5001645 (accessed on 5 February 2026).
- DiGiuseppe, M.R.; Flynn, M.E. Scaling Open-ended Survey Responses Using LLM-Paired Comparisons. 2025. Available online: https://doi.org/10.2139/ssrn.5112677 (accessed on 5 February 2026).
- Henkel, O.; Hills, L.; Boxer, A.; Roberts, B.; Levonian, Z. Can Large Language Models Make the Grade? An Empirical Study Evaluating LLMs Ability To Mark Short Answer Questions in K-12 Education. In Proceedings of the Eleventh ACM Conference on Learning @ Scale, Atlanta, GA, USA, 18–20 July 2024; pp. 300–304. [Google Scholar] [CrossRef]
- Liew, P.Y.; Tan, I.K.T. On Automated Essay Grading using Large Language Models. In Proceedings of the 2024 8th International Conference on Computer Science and Artificial Intelligence, Beijing, China, 6–8 December 2024; pp. 204–211. [Google Scholar] [CrossRef]
- Bijker, R.; Merkouris, S.S.; Dowling, N.A.; Rodda, S.N. ChatGPT for automated qualitative research: Content analysis. J. Med. Internet Res. 2024, 26, e59050. [Google Scholar] [CrossRef]
- Altozano, A.; Minissi, M.E.; Gómez-Zaragozá, L.; Maddalon, L.; Alcañiz, M.; Marín-Morales, J. Enhancing Psychological Assessments with Open-Ended Questionnaires and Large Language Models: An ASD Case Study. IEEE J. Biomed. Health Inform. 2025, 30, 1707–1720. [Google Scholar] [CrossRef]
- Wataoka, K.; Takahashi, T.; Ri, R. Self-preference bias in llm-as-a-judge. arXiv 2024, arXiv:2410.21819. [Google Scholar] [CrossRef]
- Krumdick, M.; Lovering, C.; Reddy, V.; Ebner, S.; Tanner, C. No free labels: Limitations of llm-as-a-judge without human grounding. arXiv 2025, arXiv:2503.05061. [Google Scholar]
- Wang, A.; Morgenstern, J.; Dickerson, J.P. Large language models that replace human participants can harmfully misportray and flatten identity groups. Nat. Mach. Intell. 2025, 7, 400–411. [Google Scholar] [CrossRef]
- Han, Z.; Battaglia, F.; Mansuria, K.; Heyman, Y.; Terlecky, S.R. Beyond text generation: Assessing large language models’ ability to reason logically and follow strict rules. AI 2025, 6, 12. [Google Scholar] [CrossRef]
- Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.D.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language models are few-shot learners. Adv. Neural Inf. Process. Syst. 2020, 33, 1877–1901. [Google Scholar]
- Kim, S.; Moses, T. An investigation of the impact of misrouting under two-stage multistage testing: A simulation study. ETS Res. Rep. Ser. 2014, 2014, 1–13. [Google Scholar] [CrossRef]







| Aversion Level the Questions were Intended to Invoke | Question # | Topics of the Questions in Module 1 | Topics of the Questions in Module 2 | Topics of the Questions in Module 3 |
|---|---|---|---|---|
| Low | 1, 2 | Seeing a live cockroach in a place where food is prepared | Being with a dead cockroach in an indoor space | Being exposed to symbolic representations of cockroaches |
| Medium | 3, 4 | Picking up or holding a live cockroach | Seeing live cockroaches outdoors | Hearing or reading about someone else’s encounter with cockroaches |
| High | 5, 6 | Ingesting a cockroach or food that came in contact with a cockroach | Being with a live cockroach in an indoor space | Looking at photography of cockroaches on a screen |
| LLM | Cohen’s Kappa vs. Majority Vote | Fleiss’ Kappa (All Annotators) |
|---|---|---|
| GPT-4.1 | 0.750 | 0.782 |
| GPT-5 | 0.719 | 0.780 |
| LLM | Irrelevant | Not Sure |
|---|---|---|
| GPT-4.1 | 20 | 33 |
| GPT-5 | 13 | 63 |
| Both GPT-4.1 & GPT-5 | 12 | 16 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2026 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license.
Share and Cite
Gish, M.; Nowominski, A.; Dror, R. Breaking the Ceiling: Mitigating Extreme Response Bias in Surveys Using an Open-Ended Adaptive-Testing System and LLM-Based Response Analysis. AI 2026, 7, 73. https://doi.org/10.3390/ai7020073
Gish M, Nowominski A, Dror R. Breaking the Ceiling: Mitigating Extreme Response Bias in Surveys Using an Open-Ended Adaptive-Testing System and LLM-Based Response Analysis. AI. 2026; 7(2):73. https://doi.org/10.3390/ai7020073
Chicago/Turabian StyleGish, Moshe, Amit Nowominski, and Rotem Dror. 2026. "Breaking the Ceiling: Mitigating Extreme Response Bias in Surveys Using an Open-Ended Adaptive-Testing System and LLM-Based Response Analysis" AI 7, no. 2: 73. https://doi.org/10.3390/ai7020073
APA StyleGish, M., Nowominski, A., & Dror, R. (2026). Breaking the Ceiling: Mitigating Extreme Response Bias in Surveys Using an Open-Ended Adaptive-Testing System and LLM-Based Response Analysis. AI, 7(2), 73. https://doi.org/10.3390/ai7020073










