Automatic- and Transformer-Based Automatic Item Generation: A Critical Review
Abstract
1. Introduction
1.1. Psychometric Costs of Testing on Demand
1.2. Psychometric Costs of Retesting
1.3. Psychometric Costs of Test Preparation
1.4. Implications for Test Construction
2. Human-Constructed Test Items (H-IG)
3. Automatic Item Generation (AIG)
3.1. Item Model Approach
3.2. Cognitive Design System Approach
3.3. Automatic Min-Max Approach
4. Transformer-Based Automatic Item Generation (TB-AIG)
4.1. Introduction to Transformer Networks
4.2. Perils of Using Transformer-Based Automatic Item Generation in a Fully Automated Manner
4.3. Studies Using Transformer-Based Automatic Item Generation in the Cognitive Ability Domain
4.3.1. Item Model-Based TB-AIG
4.3.2. Element-Based TB-AIG
5. Discussion
5.1. Capacity of Item Generations Methods to Deal with the Specific Item Construction Demands
5.2. Capacity of Item Generation Methods to Provide Cost-Free Practice-Based Training
5.3. Potential Cost Savings During the Actual Item Construction Phase
5.4. Potential Cost Savings in the Item Calibration Phase and the Item Pool Maintenance Phase
5.5. Differences in Test Security Concerns
6. Concluding Remarks
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
AIG | Automatic item generation |
CAT | Computerized adaptive test |
DIF | Differential item functioning |
LLTM | Linear Logistic Test Model |
LOFT | Linear on-the-fly test |
MST | Multi-stage test |
TB-AIG | Transformer-based automatic item generation |
References
- Abu-Haifa, Mohammad, Bara’a Etawi, Huthaifa Alkhatatbeh, and Ayman Ababneh. 2024. Comparative analysis of ChatGPT, GPT-4, and Microsoft Copilot Chatbots for GRE test. International Journal of Learning, Teaching and Educational Research 23: 327–47. [Google Scholar] [CrossRef]
- Ahn, Jihyun J., and Wenpeng Yin. 2025. Prompt-Reverse Inconsistency: LLM Self-Inconsistency Beyond Generative Randomness and Prompt Paraphrasing. arXiv arXiv:2504.01282. [Google Scholar]
- Al Faraby, Said, Ade Romadhony, and Adiwijaya. 2024. Analysis of llms for educational question classification and generation. Computers and Education: Artificial Intelligence 7: 100298. [Google Scholar] [CrossRef]
- Allalouf, Avi, and Gershon Ben-Shakhar. 1998. The effect of coaching on the predictive validity of scholastic aptitude tests. Journal of Educational Measurement 35: 31–47. [Google Scholar] [CrossRef]
- American Educational Research Association (AERA), American Psychological Association (APA), and National Council on Measurement in Education (NCME). 2018. Standards for Educational and Psychological Testing. Washington, DC: American Psychological Association. [Google Scholar]
- Anderson, John R., Jon M. Fincham, and Scott Douglass. 1997. The role of examples and rules in the acquisition of a cognitive skill. Journal of Experimental Psychology: Learning, Memory and Cognition 23: 932–45. [Google Scholar] [CrossRef]
- Appelhaus, Stefan, Susanne Werner, Pascal Grosse, and Juliane E. Kämmer. 2023. Feedback, fairness, and validity: Effects of disclosing and reusing multiple-choice questions in medical schools. Medical Education Online 28: 2143298. [Google Scholar] [CrossRef]
- Appelrouth, Jed I., Karen M. Zabrucky, and DeWayne Moore. 2017. Preparing students for college admissions tests. Assessment in Education: Principles, Policy and Practice 24: 78–95. [Google Scholar] [CrossRef]
- Arendasy, Martin. 2000. Psychometrischer Vergleich Computergestützter Vorgabeformen bei Raumvorstellungsaufgaben: Stereoskopisch-Dreidimensionale und Herkömmlich-Zweidimensionale Darbietung. Ph.D thesis, Universität Wien, Wien, Austria. [Google Scholar]
- Arendasy, Martin. 2004. Automatisierte Itemgenerierung und Psychometrische Qualitätssicherung am Beispiel des Matrizentests GEOM. Lausanne: Peter Lang. [Google Scholar]
- Arendasy, Martin, and Markus Sommer. 2005. The effect of different types of perceptual manipulations on the dimensionality of automatically generated figural matrices. Intelligence 33: 307–24. [Google Scholar] [CrossRef]
- Arendasy, Martin, and Markus Sommer. 2007. Using psychometric technology in educational assessment: The case of a schema-based isomorphic approach to the automatic generation of quantitative reasoning items. Learning and Individual Differences 17: 366–83. [Google Scholar] [CrossRef]
- Arendasy, Martin, and Markus Sommer. 2010. Evaluating the contribution of different item features to the effect size of the gender difference in three-dimensional mental rotation using automatic item generation. Intelligence 38: 574–81. [Google Scholar] [CrossRef]
- Arendasy, Martin, and Markus Sommer. 2011. Automatisierte Itemgenerierung: Aktuelle Ansätze, Anwendungen und Forschungen. In Enzyklopädie für Psychologie: Methoden der Psychologischen Diagnostik. Edited by Lutz F. Hornke, Manfred Amelang and Martin Kersting. Göttingen: Hogrefe, pp. 215–80. [Google Scholar]
- Arendasy, Martin, and Markus Sommer. 2012a. Using automatic item generation to meet the increasing item demands of high-stakes assessment. Learning and Individual Differences 22: 112–17. [Google Scholar] [CrossRef]
- Arendasy, Martin, and Markus Sommer. 2012b. Gender differences in figural matrices: The moderating role of item design features. Intelligence 40: 584–97. [Google Scholar] [CrossRef]
- Arendasy, Martin, and Markus Sommer. 2013a. Quantitative differences in retest effects across different methods used to construct alternate test forms. Intelligence 41: 181–92. [Google Scholar] [CrossRef]
- Arendasy, Martin, and Markus Sommer. 2013b. Reducing response elimination strategies enhances the construct validity of figural matrices. Intelligence 41: 234–43. [Google Scholar] [CrossRef]
- Arendasy, Martin, Markus Sommer, and Andreas Hergovich. 2007. Psychometrische Technologie: Automatische Zwei-Komponenten-Itemgenerierung am Beispiel eines neuen Aufgabentyps zur Messung der Numerischen Flexibilität. Diagnostica 53: 119–30. [Google Scholar] [CrossRef]
- Arendasy, Martin, Markus Sommer, and Friedrich Mayr. 2012. Using automatic item generation to simultaneously con-struct German and English versions of a verbal fluency test. Journal of Cross-Cultural Psychology 43: 464–79. [Google Scholar] [CrossRef]
- Arendasy, Martin, Markus Sommer, and Georg Gittler. 2010. Combining automatic item generation and experimental designs to investigate the contribution of cognitive components to the gender difference in mental rotation. Intelligence 38: 506–12. [Google Scholar] [CrossRef]
- Arendasy, Martin, Markus Sommer, and Georg Gittler. 2020. Manual Intelligence-Struktur-Battery 2 (INSBAT-2). Mödling: SCHUHFRIED GmbH. [Google Scholar]
- Arendasy, Martin, Markus Sommer, Andreas Hergovich, and Martina Feldhammer. 2011. Evaluating the impact of depth cue salience in working three-dimensional mental rotation tasks by means of psychometric experiments. Learning and Individual Differences 21: 403–8. [Google Scholar] [CrossRef]
- Arendasy, Martin, Markus Sommer, Georg Gittler, and Andreas Hergovich. 2006. Automatic generation of quantitative reasoning items: Pilot study. Journal of Individual Differences 27: 2–14. [Google Scholar] [CrossRef]
- Arendasy, Martin E., and Markus Sommer. 2017. Reducing the effect size of the retest effect: Examining different approaches. Intelligence 62: 89–98. [Google Scholar] [CrossRef]
- Arendasy, Martin E., Markus Sommer, Karin Gutierrez-Lobos, and Joachim F. Punter. 2016. Do individual differences in test preparation compromise the measurement fairness of admission tests? Intelligence 55: 44–56. [Google Scholar] [CrossRef]
- Arendasy, Martin E., Markus Sommer, Reinhard Tschiesner, Martina Feldhammer-Kahr, and Konstantin Umdasch. 2024. Using automatic item generation to construct scheduling problems measuring planning ability. Intelligence 106: 101855. [Google Scholar] [CrossRef]
- Ariel, Adelaide, Wim J. van Der Linden, and Bernard P. Veldkamp. 2006. A strategy for optimizing item-pool management. Journal of Educational Measurement 43: 85–96. [Google Scholar] [CrossRef]
- Attali, Yigal, Andrew Runge, Geoffrey T. LaFlair, Kevin Yancey, Sarah Goodwin, Yena Park, and Alina A. von Davier. 2022. The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence 5: 903077. [Google Scholar] [CrossRef]
- Attali, Yigal, Luis Saldivia, Carol Jackson, Fred Schuppan, and Wilbur Wanamaker. 2014. Estimating Item Difficulty with Comparative Judgments. Princeton: ETS. [Google Scholar]
- Baldonado, Angela A., Dubravka Svetina, and Joanna Gorin. 2015. Using necessary information to identify item dependence in passage-based reading comprehension tests. Applied Measurement in Education 28: 202–18. [Google Scholar] [CrossRef]
- Balestri, Roberto. 2025. Gender and content bias in Large Language Models: A case study on Google Gemini 2.0 Flash Experimental. Frontiers in Artificial Intelligence 8: 1558696. [Google Scholar] [CrossRef] [PubMed]
- Bangert-Drowns, Robert L., James A. Kulik, and Chen-Lin C. Kuli. 1983. Effects of coaching programs on achievement test performance. Review of Educational Research 53: 571–85. [Google Scholar] [CrossRef]
- Becker, Betsy J. 1990. Coaching for the Scholastic Aptitude Test: Further synthesis and appraisal. Review of Educational Research 60: 373–417. [Google Scholar] [CrossRef]
- Beg, Mirza A., Afifa Tabassum, and Sobia Ali. 2021. Role of faculty development workshop for improving MCQS quality in basic medical sciences. Biomedica 37: 51–55. [Google Scholar] [CrossRef]
- Bejar, Isaac I. 1983. Subject matter experts’ assessment of item statistics. Applied Psychological Measurement 7: 303–10. [Google Scholar] [CrossRef]
- Bejar, Isaac I. 2002. Generative testing: From conception to implementation. In Item Generation for Test Development. Edited by Sidney H. Irvine and Patrick C. Kyllonen. Mahwah: Lawrence Erlbaum, pp. 199–217. [Google Scholar]
- Bejar, Isaac I., René R. Lawless, Mary E. Morley, Michael E. Wagner, Randy E. Bennett, and Javier Revuelta. 2002. A Feasibility Study of On-the-Fly Item Generation in Adaptive Testing (GRE Board Professional Rep. No. 98-12P). Princeton: ETS. [Google Scholar]
- Bejar, Isaac I., Roger Chaffin, and Susan Embretson. 2012. Cognitive and Psychometric Analysis of Analogical Problem Solving. Berlin: Springer. [Google Scholar]
- Belzak, William C., Ben Naismith, and Jill Burstein. 2023. Ensuring fairness of human-and AI-generated test items. In International Conference on Artificial Intelligence in Education. Cham: Springer Nature Switzerland, pp. 701–7. [Google Scholar]
- Belzak, William C. M. 2019. Testing differential item functioning in small samples. Multivariate Behavioral Research 55: 722–47. [Google Scholar] [CrossRef]
- Berenbon, Rebecca F., and Bridget C. McHugh. 2023. Do subject matter experts’ judgments of multiple-choice format suitability predict item quality? Educational Measurement: Issues and Practice 42: 13–21. [Google Scholar] [CrossRef]
- Bethell-Fox, Charles E., David F. Lohman, and Richard E. Snow. 1984. Adaptive reasoning: Componential and eye movement analysis of geometric analogy performance. Intelligence 8: 205–38. [Google Scholar] [CrossRef]
- Bezirhan, Ummugul, and Matthias von Davier. 2023. Automated reading passage generation with OpenAI’s large language model. Computers and Education: Artificial Intelligence 5: 100161. [Google Scholar] [CrossRef]
- Bhayana, Rajesh, Satheesh Krishna, and Robert R. Bleakney. 2023. Performance of ChatGPT on a radiology board-style examination: Insights into current strengths and limitations. Radiology 307: e230582. [Google Scholar] [CrossRef]
- Blum, Diego, and Heinz Holling. 2018. Automatic generation of figural analogies with the imak package. Frontiers in Psychology 9: 1286. [Google Scholar] [CrossRef] [PubMed]
- Borsboom, Denny, Jan-Willem Romeijn, and Jelte M. Wicherts. 2008. Measurement invariance versus selection invariance: Is fair selection possible? Psychological Methods 13: 75–98. [Google Scholar] [CrossRef] [PubMed]
- Bozkurt, Aras, and Ramesh C. Sharma. 2023. Generative AI and prompt engineering: The art of whispering to let the genie out of the algorithmic world. Asian Journal of Distance Education 18: 1–7. Available online: https://www.asianjde.com/ojs/index.php/AsianJDE/article/view/749 (accessed on 3 March 2025).
- Briggs, Derek C. 2009. Preparation for College Admission Exams (2009 NACAC Discussion Paper). Arlington: National Association for College Admission Counseling. [Google Scholar]
- Brown, Tom B., Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, and et al. 2020. Language models are few-shot learners. Advances in Neural Information Processing Systems 33: 1877–901. [Google Scholar]
- Buchmann, Claudia, Dennis J. Condron, and Vincent J. Roscigno. 2010. Shadow education, American style: Test preparation, the SAT and college enrollment. Social Forces 89: 435–82. [Google Scholar] [CrossRef]
- Bulut, Okan, Maggie Beiting-Parrish, Jodi M. Casabianca, Sharon C. Slater, Hong Jiao, Dan Song, and Polina Morilova. 2024. The rise of artificial intelligence in educational measurement: Opportunities and ethical challenges. arXiv arXiv:2406.18900. [Google Scholar]
- Burke, Eugene F. 1997. A short note on the persistence of retest effects on aptitude scores. Journal of Occupational and Organizational Psychology 70: 295–301. [Google Scholar] [CrossRef]
- Burns, Gary N., Brian P. Siers, and Neil D. Christiansen. 2008. Effects of providing pre-test information and preparation materials on applicant reactions to selection procedures. International Journal of Selection and Assessment 16: 73–77. [Google Scholar] [CrossRef]
- Calamia, Matthew, Kristian Markon, and Daniel Tranel. 2012. Scoring higher the second time around: Meta-analyses of practice effects in neuropsychological assessment. The Clinical Neuropsychologist 26: 543–70. [Google Scholar] [CrossRef] [PubMed]
- Campion, Michael C., Emily D. Campion, and Michael A. Campion. 2019. Using practice employment tests to improve recruitment and personnel selection outcomes for organizations and job seekers. Journal of Applied Psychology 104: 1089–102. [Google Scholar] [CrossRef] [PubMed]
- Chan, Kuang W., Farhan Ali, Joonhyeong Park, Kah S. B. Sham, Erdalyn Y. T. Tan, Francis W. C. Chong, and Guan K. Sze. 2025. Automatic item generation in various STEM subjects using large language model prompting. Computers and Education: Artificial Intelligence 8: 100344. [Google Scholar] [CrossRef]
- Chauhan, Archana, Farah Khaliq, and Kirtana Raqhurama Nayak. 2025. Assessing quality of scenario-based multiple-choice questions in physiology: Faculty-generated vs. ChatGPT-generated questions among phase I medical students. International Journal of Artificial Intelligence in Education, 1–30. [Google Scholar] [CrossRef]
- Cho, Sun-Joo, Paul De Boeck, Susan Embretson, and Sophia Rabe-Hesketh. 2014. Additive multilevel item structure models with random residuals: Item modeling for explanation and item generation. Psychometrika 79: 84–104. [Google Scholar] [CrossRef]
- Choi, Jaehwa, and Xinxin Zhang. 2019. Computerized item modeling practices using computer adaptive formative assessment automatic item generation system: A tutorial. The Quantitative Methods for Psychology 15: 214–25. [Google Scholar] [CrossRef]
- Chung, Jinmin, and Sungyeun Kim. 2024. Comparison of rule-based models and Large Language Models in item and feedback generation. Journal of Science Education 48: 154–69. [Google Scholar] [CrossRef]
- Circi, Ruhan, Juanita Hicks, and Emmanuel Sikali. 2023. Automatic item generation: Foundations and machine learning-based approaches for assessments. Frontiers in Education 8: 858273. [Google Scholar] [CrossRef]
- Colvin, Kimberly F., Lisa A. Keller, and Frederic Robin. 2016. Effect of imprecise parameter estimation on ability estimation in a multistage test in an automatic item generation context. Journal of Computerized Adaptive Testing 4: 1–18. [Google Scholar] [CrossRef]
- Daniel, Robert C., and Susan E. Embretson. 2010. Designing cognitive complexity in mathematical problem-solving items. Applied Psychological Measurement 34: 348–64. [Google Scholar] [CrossRef]
- De Boeck, Paul, and Mark Wilson. 2004. Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach. New York: Springer. [Google Scholar]
- Denker, Marek, Clara Schütte, Martin Kersting, Daniel Weppert, and Stephan J. Stegt. 2023. How can applicants’ reactions to scholastic aptitude tests be improved? A closer look at specific and general tests. Frontiers in Education 7: 931841. [Google Scholar] [CrossRef]
- Devlin, Jacob, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv arXiv:1810.04805. [Google Scholar]
- Doebler, Anna. 2012. The problem of bias in person parameter estimation in adaptive testing. Applied Psychological Measurement 36: 255–70. [Google Scholar] [CrossRef]
- Doebler, Anna, and Heinz Holling. 2016. A processing speed test based on rule-based item generation: An analysis with the Rasch Poisson Counts model. Learning and Individual Differences 52: 121–28. [Google Scholar] [CrossRef]
- Draheim, Christopher, Tyler L. Harrison, Susan E. Embretson, and Randall W. Engle. 2018. What item response theory can tell us about the complex span tasks. Psychological Assessment 30: 116–29. [Google Scholar] [CrossRef]
- Drasgow, Fritz, Richard M. Luecht, and Randy Bennett. 2006. Technology and testing. In Educational Measurement, 4th ed. Edited by R. L. Brennan. Westport: American Council on Education and Praeger Publishers, pp. 471–55. [Google Scholar]
- Eleragi, Ali M. S., Elhadi Miskeen, Kamal Hussein, Assad A. Rezigalla, Masoud I. Adam, Jaber A. Al-Faifi, and Osama A. Mohammed. 2025. Evaluating the multiple-choice questions quality at the College of Medicine, University of Bisha, Saudi Arabia: A three-year experience. BMC Medical Education 25: 233. [Google Scholar] [CrossRef]
- El Masri, Yasmine H., Steve Ferrara, Peter W. Foltz, and Jo-Anne Baird. 2017. Predicting item difficulty of science national curriculum tests: The case of key stage 2 assessments. The Curriculum Journal 28: 59–82. [Google Scholar] [CrossRef]
- Embretson, Susan. 2023. Understanding examinees’ item responses through cognitive modeling of response accuracy and response times. Large-Scale Assessments in Education 11: 9. [Google Scholar] [CrossRef]
- Embretson, Susan E. 1998. A cognitive design system approach to generating valid tests: Application to abstract reasoning. Psychological Methods 3: 300–96. [Google Scholar] [CrossRef]
- Embretson, Susan E. 1999. Generating items during testing: Psychometric issues and models. Psychometrika 64: 407–33. [Google Scholar] [CrossRef]
- Embretson, Susan E. 2002. Generating abstract reasoning items with cognitive theory. In Generating Items for Cognitive Tests: Theory and Practice. Edited by Sidney Irvine and Patrick Kyllonen. Mahwah: Erlbaum. [Google Scholar]
- Embretson, Susan E. 2005. Measuring human intelligence with artificial intelligence. In Cognition and Intelligence. Edited by Robert J. Sternberg and Jean E. Pretz. New York: Cambridge University Press, pp. 251–67. [Google Scholar]
- Embretson, Susan E. 2016. Understanding examinees’ responses to items: Implications for measurement. Educational Measurement: Issues and Practice 35: 6–22. [Google Scholar] [CrossRef]
- Embretson, Susan E., and Joanna S. Gorin. 2001. Improving construct validity with cognitive psychology principles. Journal of Educational Measurement 38: 343–68. [Google Scholar] [CrossRef]
- Embretson, Susan E., and Neal M. Kingston. 2018. Automatic item generation: A more efficient process for developing mathematics achievement items? Journal of Educational Measurement 55: 112–31. [Google Scholar] [CrossRef]
- Embretson, Susan E., and Robert C. Daniel. 2008. Understanding and quantifying cognitive complexity level in mathematical problem solving items. Psychology Science 50: 328–44. [Google Scholar]
- Embretson, Susan E., and Xiangdong Yang. 2007. Automatic item generation and cognitive psychology. In Handbook of Statistics: Vol 26 Psychometrics. Edited by Calyampudi R. Rao and Sandip Sinharay. North Holland: Elsevier, pp. 747–68. [Google Scholar]
- Emekli, Emre, and Betül N. Karahan. 2025. Comparison of automatic item generation methods in the assessment of clinical reasoning skills. Revista Española de Educación Médica 6: 1–12. [Google Scholar] [CrossRef]
- Enright, Mary K., Mary Morley, and Kathleen M. Sheehan. 2002. Items by design: The impact of systematic feature variation on item statistical characteristics. Applied Measurement in Education 15: 49–74. [Google Scholar] [CrossRef]
- Estrada, Eduardo, Emilio Ferrer, Fransisco J. Abad, Fransisco J. Román, and Roberto Colom. 2015. A general factor of intelligence fails to account for changes in tests’ scores after cognitive practice: A longitudinal multi-group latent variable study. Intelligence 50: 93–99. [Google Scholar] [CrossRef]
- Falcão, Filipe, Daniela M. Pereira, Nuno Gonçalves, Andre De Champlain, Patrício Costa, and José M. Pêgo. 2023. A suggestive approach for assessing item quality, usability and validity of Automatic Item Generation. Advances in Health Sciences Education 28: 1441–65. [Google Scholar] [CrossRef]
- Falcão, Filipe M. V., Daniela M. Pereira, José M. Pêgo, and Patrício Costa. 2024. Progress is impossible without change: Implementing automatic item generation in medical knowledge progress testing. Education and Information Technologies 29: 4505–30. [Google Scholar] [CrossRef]
- Farrell, Simon, and Stephan Lewandowsky. 2010. Computational models as aids to better reasoning in psychology. Current Directions in Psychological Science 19: 329–35. [Google Scholar] [CrossRef]
- Fehringer, Benedict C. 2020. Spatial thinking from a different view: Disentangling top-down and bottom-up processes using eye tracking. Open Psychology 2: 138–212. [Google Scholar] [CrossRef]
- Fehringer, Benedict C. 2023. Different perspectives on retest effects in the context of spatial thinking: Interplay of behavioral performance, cognitive processing, and cognitive workload. Journal of Intelligence 11: 66. [Google Scholar] [CrossRef] [PubMed]
- Fischer, Gerhard H. 1995. The Linear Logistic Test Model. In Rasch Models. Foundations, Recent Developments, and Applications. Edited by G. H. Fischer and I. W. Molenaar. New York: Springer, pp. 157–80. [Google Scholar]
- Folk, Valerie G., and Robert L. Smith. 2002. Models for delivery of CBTs. In Computer-Based Testing: Building the Foundation for Future Assessments. Edited by Craig Mills, Maria Potenza, John Fremer and William Ward. Mahwah: Lawrence Erlbaum, pp. 41–66. [Google Scholar]
- Foster, David. 2016. Testing technology and its effects on test security. In Technology and Testing: Improving Educational and Psychological Measurement. Edited by Fritz Drasgow. New York: Routledge, pp. 235–55. [Google Scholar]
- Förster, Natalie, and Jörg-Tobias Kuhn. 2023. Ice is hot and water is dry: Developing equivalent reading tests using rule-based item design. European Journal of Psychological Assessment 39: 96–105. [Google Scholar] [CrossRef]
- Freund, Philipp A., and Heinz Holling. 2011. How to get real smart: Modeling retest and training effects in ability testing using computer-generated figural matrices items. Intelligence 39: 233–43. [Google Scholar] [CrossRef]
- Freund, Philipp A., Stefan Hofer, and Heinz Holling. 2008. Explaining and controlling for the psychometric properties of computer-generated figural matrix items. Applied Psychological Measurement 32: 195–210. [Google Scholar] [CrossRef]
- Fried, Eiko I. 2020. Lack of theory building and testing impedes progress in the factor and network literature. Psychological Inquiry 31: 271–88. [Google Scholar] [CrossRef]
- Fu, Yanyan, Edison M. Choe, Hwanggyu Lim, and Jaehwa Choi. 2022. An evaluation of automatic item generation: A case study of weak theory approach. Educational Measurement: Issues and Practice 41: 10–22. [Google Scholar] [CrossRef]
- Funk, Paul F., Cosima C. Hoch, Samuel Knoedler, Leonard Knoedler, Sebastian Cotofana, Giuseppe Sofo, and Michael Alfertshofer. 2024. ChatGPT’s response consistency: A study on repeated queries of medical examination questions. European Journal of Investigation in Health, Psychology and Education 14: 657–68. [Google Scholar] [CrossRef]
- Geerlings, Hanneke, Wim J. van der Linden, and Cees A. Glas. 2013. Optimal test design with rule-based item generation. Applied Psychological Measurement 37: 140–61. [Google Scholar] [CrossRef]
- Georgiadou, Elissavet, Evangelos Triantafillou, and Anastasios A. Economides. 2007. A review of item exposure control strategies for computerized adaptive testing developed from 1983 to 2005. Journal of Technology, Learning, and Assessment 5. Available online: https://files.eric.ed.gov/fulltext/EJ838610.pdf (accessed on 3 March 2025).
- Gierl, Mark J., and Hollis Lai. 2012. The role of item models in automatic item generation. International Journal of Testing 12: 273–98. [Google Scholar] [CrossRef]
- Gierl, Mark J., Hollis Lai, and Simon Turner. 2012. Using automatic item generation to create multiple-choice items for assessments in medical education. Medical Education 46: 757–65. [Google Scholar] [CrossRef]
- Gierl, Mark J., Jinnie Shin, Tahereh Firoozi, and Hollis Lai. 2022a. Using content coding and automatic item generation to improve test security. Frontiers in Education 7: 853578. [Google Scholar] [CrossRef]
- Gierl, Mark J., Kimberly Swygert, Donna Matovinovic, Allison Kulesher, and Hollis Lai. 2022b. Three sources of validation evidence needed to evaluate the quality of generated test items for medical licensure. Teaching and Learning in Medicine 36: 72–82. [Google Scholar] [CrossRef]
- Glas, Cees A., Wim J. van der Linden, and Hanneke Geerlings. 2016. Item-family models. In Handbook of Item Response Theory. Edited by Wim J. van der Linden. Boca Raton: Chapman and Hall/CRC, vol. 1, pp. 465–76. [Google Scholar]
- Glas, Cees A. W., and Wim J. van der Linden. 2003. Computerized adaptive testing with item cloning. Applied Psychological Measurment 27: 247–61. [Google Scholar] [CrossRef]
- Gorin, Joanna S. 2005. Manipulating processing difficulty of reading comprehension questions: The feasibility of verbal item generation. Journal of Educational Measurement 42: 351–73. [Google Scholar] [CrossRef]
- Gorin, Joanna S. 2006. Test design with cognition in mind. Educational Measurement: Issues and Practice 25: 21–35. [Google Scholar] [CrossRef]
- Gorin, Joanna S., and Susan E. Embretson. 2006. Item diffficulty modeling of paragraph comprehension items. Applied Psychological Measurement 30: 394–411. [Google Scholar] [CrossRef]
- Graf, Edith A., Stephen Peterson, Manfred Steffen, and René Lawless. 2005. Psychometric and Cognitive Analysis as a Basis for the Design and Revision of Quantitative Item Models (No. RR-05-25). Princeton: Educational Testing Service. [Google Scholar]
- Greeno, James G., Joyce L. Moore, and David R. Smith. 1993. Transfer of situated learning. In Transfer on Trial: Intelligence, Cognition, and Instruction. Edited by Douglas K. Detterman and Robert J. Sternberg. New York: Ablex Publishing, pp. 99–167. [Google Scholar]
- Guest, Olivia, and Andrea E. Martin. 2021. How computational modeling can force theory building in psychological science. Perspectives on Psychological Science 16: 789–802. [Google Scholar] [CrossRef] [PubMed]
- Gühne, Daniela, Philipp Doebler, David M. Condon, Fang Luo, and Luning Sun. 2020. Validity and reliability of automatically generated propositional reasoning items. European Journal of Psychological Assessment 16: 325–39. [Google Scholar] [CrossRef]
- Guo, Jing, Louis Tay, and Fritz Drasgow. 2009. Conspiracy and test compromise: An evaluation of the resistance of test systems to small-scale cheating. International Journal of Testing 9: 283–309. [Google Scholar] [CrossRef]
- Gupta, Piyush, Pinky Meena, Amir M. Khan, Rajeev K. Malhotra, and Tejinder Singh. 2020. Effect of faculty training on quality of multiple-choice questions. International Journal of Applied and Basic Medical Research 10: 210–14. [Google Scholar] [CrossRef]
- Hao, Jiangang, Alina A. von Davier, Victoria Yaneva, Susan Lottridge, Matthias von Davier, and Deborah J. Harris. 2024. Transforming assessment: The impacts and implications of large language models and generative AI. Educational Measurement: Issues and Practice 43: 16–29. [Google Scholar] [CrossRef]
- Hausknecht, John P., Jane A. Halpert, Nicole T. DiPaolo, and Meghan O. Moriarty Gerrard. 2007. Retesting in selection: A meta-analysis of coaching and practice effects for tests of cognitive ability. Journal of Applied Psychology 92: 373–85. [Google Scholar] [CrossRef]
- Hayes, Taylor R., Alexander A. Petrov, and Per B. Sederberg. 2015. Do we really become smarter when our fluid intelligence test scores improve? Intelligence 48: 1–14. [Google Scholar] [CrossRef]
- He, Wei, and Mark D. Reckase. 2014. Item pool design for an operational variable-length computerized adaptive test. Educational and Psychological Measurement 74: 473–94. [Google Scholar] [CrossRef]
- Heil, Martin, and Petra Jansen-Osmann. 2008. Sex differences in mental rotation with polygons of different complexity: Do men utilize holistic processes whereas women prefer piecemeal ones? The Quarterly Journal of Experimental Psychology 61: 683–89. [Google Scholar] [CrossRef]
- Heil, Martin, Frank Rösler, Michael Link, and Jasmin Bajric. 1998. What is improved if mental rotation task is repeated: The efficiency of memory access, or the speed of transformation routine? Psychological Research 61: 99–106. [Google Scholar] [CrossRef]
- Hermes, Michael, Frank Albers, Jan R. Böhnke, Gerrit Huelmann, Julia Maier, and Dirk Stelling. 2019. Measurement and structural invariance of cognitive ability tests after computer-based training. Computers in Human Behavior 93: 370–78. [Google Scholar] [CrossRef]
- Hermes, Michael, Julia Maier, Justin Mittelstädt, Frank Albers, Gerrit Huelmann, and Dirk Stelling. 2023. Computer-based training and repeated test performance: Increasing assessment fairness instead of retest effects. European Journal of Work and Organizational Psychology 32: 450–59. [Google Scholar] [CrossRef]
- Heston, Thomas F., and Charya Khun. 2023. Prompt engineering in medical education. International Medical Education 2: 198–205. [Google Scholar] [CrossRef]
- Hickman, Luis, Patrick D. Dunlop, and Jasper L. Wolf. 2024. The performance of large language models on quantitative and verbal ability tests: Initial evidence and implications for unproctored high-stakes testing. International Journal of Selection and Assessment 32: 499–511. [Google Scholar] [CrossRef]
- Hines, Scott. 2017. The Development and Validation of an Automatic-Item Generation Measure of Cognitive Ability. Ph.D. dissertation, Louisiana Tech University, Ruston, LA, USA. Available online: https://digitalcommons.latech.edu/dissertations/71 (accessed on 3 March 2025).
- Holling, Heinz, Jonas P. Bertling, and Nina Zeuch. 2009. Automatic item generation of probability word problems. Studies in Educational Evaluation 35: 71–76. [Google Scholar] [CrossRef]
- Holmes, Stephen D., Michelle Meadows, Ian Stockford, and Qingping He. 2018. Investigating the comparability of examination difficulty using comparative judgement and Rasch modelling. International Journal of Testing 18: 366–91. [Google Scholar] [CrossRef]
- Hornke, Lutz F., and Michael W. Habon. 1986. Rule-based item bank construction and evaluation within the linear logistic framework. Applied Psychological Measurement 10: 369–80. [Google Scholar] [CrossRef]
- Impara, James C., and Barbara S. Plake. 1998. Teachers’ ability to estimate item difficulty: A test of the assumptions in the Angoff standard setting method. Journal of Educational Measurement 35: 69–81. [Google Scholar] [CrossRef]
- Irvine, Sidney H. 2002. The foundations of item generation for mass testing. In Item Generation for Test Development. Edited by Sidney H. Irvine and Patrick C. Kyllonnen. Mahwah: Lawrence Erlbaum Associates, pp. 3–34. [Google Scholar]
- Irvine, Sidney H., and Patrick C. Kyllonen. 2002. Item Generation for Test Development. Mahwah: Lawrence Erlbaum Associates. [Google Scholar]
- Ivie, Jennifer L., and Susan E. Embretson. 2010. Cognitive process modeling of spatial ability: The assembling objects task. Intelligence 38: 324–35. [Google Scholar] [CrossRef]
- Jarosz, Andrew F., and Jennifer Wiley. 2012. Why does working memory capacity predict RAPM performance? A possible role of distraction. Intelligence 40: 427–38. [Google Scholar] [CrossRef]
- Joncas, Sébastien X., Christina St-Onge, Sylvie Bourque, and Paul Farand. 2018. Re-using questions in classroom-based assessment: An exploratory study at the undergraduate medical education level. Perspectives on Medical Education 7: 373–78. [Google Scholar] [CrossRef]
- Jozefowicz, Ralph F., Bruce M. Koeppen, Susan Case, Robert Galbraith, David Swanson, and Robert H. Glew. 2002. The quality of in-house medical school examinations. Academic Medicine 77: 156–61. [Google Scholar] [CrossRef] [PubMed]
- Kaller, Christoph P., Benjamin Rahm, Lena Köstering, and Josef M. Unterrainer. 2011. Reviewing the impact of problem structure on planning: A software tool for analyzing tower tasks. Behavioural Brain Research 216: 1–8. [Google Scholar] [CrossRef] [PubMed]
- Kamruzzaman, Mahammed, Hieu Nguyen, Nazmul Hassan, and Gene L. Kim. 2024. “A Woman is More Culturally Knowledgeable than A Man?”: The Effect of Personas on Cultural Norm Interpretation in LLMs. arXiv arXiv:2409.11636. [Google Scholar]
- Kapoor, Radhika, Sang T. Truong, Nick Haber, Maria A. Ruiz-Primo, and Benjamin W. Domingue. 2025. Prediction of item difficulty for reading comprehension items by creation of annotated item repository. arXiv arXiv:2502.20663. [Google Scholar]
- Kara, Basek Erdem, and Nuri Dogan. 2022. The effect of ratio of items indicating differential item functioning on computer adaptive and multi-stage tests. International Journal of Assessment Tools in Education 9: 682–96. [Google Scholar] [CrossRef]
- Karthikeyan, Sowmiya, Elizabeth O’Connor, and Wendy Hu. 2019. Barriers and facilitators to writing quality items for medical school assessments–a scoping review. BMC Medical Education 19: 123. [Google Scholar] [CrossRef]
- Kıyak, Yavuz S., and Andrzej A. Kononowicz. 2025. Using a hybrid of AI and template-based method in automatic item generation to create multiple-choice questions in medical education: Hybrid AIG. JMIR Formative Research 9: e65726. [Google Scholar] [CrossRef]
- Kıyak, Yavuz S., Andrzej A. Kononowicz, and Stanislaw Górski. 2024. Multilingual template-based automatic item generation for medical education supported by generative artificial intelligence models ChatGPT and Claude. Bio-Algorithms and Med-Systems 20: 81–89. [Google Scholar] [CrossRef]
- Kıyak, Yavuz S., Emre Emekli, Özlem Coşkun, and Işil İ Budakoğlu. 2025. Keeping humans in the loop efficiently by generating question templates instead of questions using AI: Validity evidence on Hybrid AIG. Medical Teacher 47: 744–47. [Google Scholar] [CrossRef]
- Klahr, David, and Brian MacWhinney. 1997. Information Processing. In Cognition, Perception, and Language. Handbook of Child Psychology, 5th ed. Edited by William Damon, Deanna Kuhn and Robert Siegler. Hoboken: John Wiley and Sons, vol. 2, pp. 631–78. [Google Scholar]
- Kosh, Audra E., Mary A. Simpson, Lisa Bickel, Mark Kellogg, and Ellie Sanford-Moore. 2019. A cost–benefit analysis of automatic item generation. Educational Measurement: Issues and Practice 38: 48–53. [Google Scholar] [CrossRef]
- Krautter, Kai, Jessica Lehmann, Eva Kleinort, Marco Koch, Frank M. Spinath, and Nicolas Becker. 2021. Test preparation in figural matrices tests: Focus on the difficult rules. Frontiers in Psychology 12: 619440. [Google Scholar] [CrossRef] [PubMed]
- Kulik, James A., Chen-Lin C. Kulik, and Robert L. Bangert. 1984a. Effects of practice on aptitude and achievement test scores. American Educational Research Journal 21: 435–47. [Google Scholar] [CrossRef]
- Kulik, James A., Robert L. Bangert-Drowns, and Chen-Lin C. Kulik. 1984b. Effectiveness of coaching for aptitude tests. Psychological Bulletin 95: 179–88. [Google Scholar] [CrossRef]
- Kurdi, Ghader, Jared Leo, Bijan Parsia, Uli Sattler, and Salam Al-Emari. 2020. A systematic review of automatic question generation for educational purposes. International Journal of Artificial Intelligence in Education 30: 121–204. [Google Scholar] [CrossRef]
- LaDuca, Anthony, William I. Staples, Bryce Templeton, and Gerald B. Holzman. 1986. Item modelling procedures for constructing content-equivalent multiple-choice questions. Medical Education 20: 53–56. [Google Scholar] [CrossRef]
- Lai, Hollis, Mark J. Gierl, Claire Touchie, Debra Pugh, André-Philippe Boulais, and André De Champlain. 2016. Using automatic item generation to improve the quality of MCQ distractors. Teaching and Learning in Medicine 28: 166–73. [Google Scholar] [CrossRef]
- Lane, Suzanne, Mark Raymond, and Thomas Haladyna. 2016. Test development process. In Handbook of Test Development. Edited by Suzanne Lane, Mark Raymond and Thomas Haladyna. New York: Routledge, pp. 3–18. [Google Scholar]
- Lang, Jonas W. B. 2011. Computer-adaptives Testen. In Enzyklopädie für Psychologie: Verfahren zur Leistungs-, Intelligenz- und Verhaltensdiagnostik. Edited by Lutz F. Hornke, Manfred Amelang and Martin Kersting. Göttingen: Hogrefe, pp. 405–46. [Google Scholar]
- Lee, Hye Y., So J. Yune, Sang Y. Lee, Sunju Im, and Bee S. Kam. 2024. The impact of repeated item development training on the prediction of medical faculty members’ item difficulty index. BMC Medical Education 24: 599. [Google Scholar] [CrossRef]
- Lee, Jennifer C., Natasha Quadlin, and Denise Ambriz. 2023a. Shadow education, pandemic style: Social class, race, and supplemental education during COVID-19. Research in Social Stratification and Mobility 83: 100755. [Google Scholar] [CrossRef]
- Lee, Jooyoung, Thai Le, Jinghui Chen, and Dongwon Lee. 2022. Do language models plagiarize? arXiv arXiv:2203.07618. [Google Scholar]
- Lee, Philseok, Shea Fyffe, Mina Son, Zihao Jia, and Ziyu Yao. 2023b. A paradigm shift from “human writing” to “machine generation” in personality test development: An application of state-of-the-art natural language processing. Journal of Business and Psychology 38: 163–90. [Google Scholar] [CrossRef]
- Lee, Unggi, Haewon Jung, Younghoon Jeon, Younghoon Sohn, Wonhee Hwang, Jewoong Moon, and Hyeoncheol Kim. 2023c. Few-shot is enough: Exploring ChatGPT prompt engineering method for automatic question generation in English education. Education and Information Technologies 29: 11483–515. [Google Scholar] [CrossRef]
- Leslie, Tara, and Mark J. Gierl. 2023. Using automatic item generation to create multiple-choice questions for pharmacy assessment. American Journal of Pharmaceutical Education 87: 100081. [Google Scholar] [CrossRef] [PubMed]
- Levacher, Julie, Marco Koch, Johanna Hissbach, Frank M. Spinath, and Nicolas Becker. 2021. You can play the game without knowing the rules-but you’re better off knowing them: The influence of rule knowledge on figural matrices tests. European Journal of Psychological Assessment 38: 15–23. [Google Scholar] [CrossRef]
- Li, Kunze, and Yu Zhang. 2024. Planning first, question second: An LLM-guided method for controllable question generation. In Findings of the Association for Computational Linguistics ACL 2024. Bangkok: Association for Computational Linguistics, pp. 4715–29. [Google Scholar]
- Lievens, Filip, Charlie L. Reeve, and Eric D. Heggestad. 2007. An examination of psychometric bias due to retesting on cognitive ability tests in selection settings. Journal of Applied Psychology 92: 1672–82. [Google Scholar] [CrossRef]
- Lievens, Filip, Tine Buyse, and Paul R. Sackett. 2005. Retest effects in operational selection settings: Development and test of a framework. Personnel Psychology 58: 981–1007. [Google Scholar] [CrossRef]
- Lilly, Jane, and Paul Montgomery. 2011. Systematic reviews of the effects of preparatory courses on university entrance examinations in high school-age students. International Journal of Social Welfare 21: 3–12. [Google Scholar] [CrossRef]
- Lim, Sangdon, and Seung W. Choi. 2024. Item exposure and utilization control methods for optimal test assembly. Behaviormetrika 51: 125–56. [Google Scholar] [CrossRef]
- Lin, Zhiqing, and Huilin Chen. 2024. Investigating the capability of ChatGPT for generating multiple-choice reading comprehension items. System 123: 103344. [Google Scholar] [CrossRef]
- Liu, Cheng, Kyung T. Han, and Jun Li. 2019. Compromised item detection for computerized adaptive testing. Frontiers in Psychology 10: 1–16. [Google Scholar] [CrossRef]
- Liu, Mingxin, Tsuyoshi Okuhara, XinYi Chang, Ritsuko Shirabe, Yuriko Nishiie, Hiroko Okada, and Takahiro Kiuchi. 2024. Performance of ChatGPT across different versions in medical licensing examinations worldwide: Systematic review and meta-analysis. Journal of Medical Internet Research 26: e60807. [Google Scholar] [CrossRef] [PubMed]
- Liu, Pengfei, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023a. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys 55: 1–35. [Google Scholar] [CrossRef]
- Liu, Yaohui, Keren He, Kaiwen Man, and Peida Zhan. 2025. Exploring critical eye-tracking metrics for identifying cognitive strategies in Raven’s Advanced Progressive Matrices: A data-driven perspective. Journal of Intelligence 13: 14. [Google Scholar] [CrossRef] [PubMed]
- Liu, Yaohui, Peida Zhan, Yanbin Fu, Qipeng Chen, Kaiwen Man, and Yikun Luo. 2023b. Using a multi-strategy eye-tracking psychometric model to measure intelligence and identify cognitive strategy in Raven’s advanced progressive matrices. Intelligence 100: 101782. [Google Scholar] [CrossRef]
- Loesche, Patrick, Jennifer Wiley, and Marcus Hasselhorn. 2015. How knowing the rules affects solving the Raven Advanced Progressive Matrices Test. Intelligence 48: 58–75. [Google Scholar] [CrossRef]
- Lu, Pan, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. 2022. A survey of deep learning for mathematical reasoning. arXiv arXiv:2212.10535. [Google Scholar]
- Luca, Massimiliano, Ciro Beneduce, Bruno Lepri, and Jacopo Staiano. 2025. The LLM wears Prada: Analysing gender bias and stereotypes through online shopping data. arXiv arXiv:2504.01951. [Google Scholar]
- Luecht, Richard M. 2005. Some useful cost-benefit criteria for evaluating computer-based test delivery models and systems. Association of Test Publishers Journal 7. Available online: http://jattjournal.net/index.php/atp/article/view/48338 (accessed on 3 March 2025).
- Matteucci, Mariagiulia, Stefania Mignani, and Bernard P. Veldkamp. 2012. The use of predicted values for item parameters in item response theory models: An application in intelligence tests. Journal of Applied Statistics 39: 2665–83. [Google Scholar] [CrossRef]
- Matton, Nadine, Stéphane Vautier, and Éric Raufaste. 2011. Test-specificity of the advantage of retaking cognitive ability tests. International Journal of Selection and Assessment 19: 11–17. [Google Scholar] [CrossRef]
- McCoy, R. Thomas, Paul Smolensky, Tal Linzen, Jianfeng Gao, and Asli Celikyilmaz. 2023. How much do language models copy from their training data? evaluating linguistic novelty in text generation using raven. Transactions of the Association for Computational Linguistics 11: 652–70. [Google Scholar] [CrossRef]
- Messick, Samuel. 1982. Issues of effectiveness and equity in the coaching controversy: Implications for educational testing and practice. Educational Psychologist 17: 67–91. [Google Scholar] [CrossRef]
- Messick, Samuel, and Ann Jungeblut. 1981. Time and method in coaching for the SAT. Psychological Bulletin 89: 191–216. [Google Scholar] [CrossRef]
- Mislevy, Robert J., and Michelle M. Riconscente. 2006. Evidence-centered assessment design: Layers, concepts, and terminology. In Handbook of Test Development. Edited by Steven Downing and Thomas Haladyna. Mahwah: Lawrence Erlbaum Associates, pp. 61–90. [Google Scholar]
- Mislevy, Robert J., Kathleen M. Sheehan, and Marilyn Wingersky. 1993. How to equate tests with little or no data. Journal of Educational Measurement 30: 55–78. [Google Scholar] [CrossRef]
- Mislevy, Robert J., Russell G. Almond, and Janice F. Lukas. 2003. A Brief Introduction to Evidence-Centered Design (Research Report: RR-03-16). Princeton: Educational Testing Service. [Google Scholar]
- Morley, Mary E., Brent Bridgeman, and René R. Lawless. 2004. Transfer Between Variants of Quantitative Items (GRE Board Rep. No. 00-06R). Princeton: ETS. [Google Scholar]
- Nemec, Eric C., and Beth Welch. 2016. The impact of a faculty development seminar on the quality of multiple-choice questions. Currents in Pharmacy Teaching and Learning 8: 160–63. [Google Scholar] [CrossRef]
- OpenAI. 2023. GPT-4 technical report. arXiv arXiv:2303.08774. [Google Scholar]
- O’Reilly, Tenaha, Gary Feng, John Sabatini, Zuowei Wang, and Joanna Gorin. 2018. How do people read the passages during a reading comprehension test? The effect of reading purpose on text processing behavior. Educational Assessment 23: 277–95. [Google Scholar] [CrossRef]
- Park, Julie J., and Ann H. Becks. 2015. Who benefits from SAT prep? An examination of high school context and race/ethnicity. The Review of Higher Education 39: 1–23. [Google Scholar] [CrossRef]
- Piromsombat, Chayut. 2014. Differential Item Functioning in Computerized Adaptive Testing: Can CAT Self-Adjust Enough? (Publication No. 3620715). Doctoral dissertation, University of Minnesota, Minneapolis, MN, USA. [Google Scholar]
- Powers, Donald E. 2005. Effects of Pre-Examination Disclosure of Essay Prompts for the GRE Analytical Writing Assessment (Research Report: RR-05–01). Princeton: Educational Testing Service. [Google Scholar]
- Powers, Donald E. 2012. Understanding the Impact of Special Preparation for Admissions Tests. In Advancing Human Assessment: The Methodological, Psychological and Policy Contributions of ETS. ETS Research Report Series; Cham: Springer International Publishing. [Google Scholar]
- Powers, Donald E., and Donald A. Rock. 1999. Effects of coaching on SAT I: Reasoning scores. Journal of Educational Measurement 36: 93–118. [Google Scholar] [CrossRef]
- Powers, Donald E., and Donald L. Alderman. 1983. Effects of test familiarization on SAT performance. Journal of Educational Measurement 20: 71–79. [Google Scholar] [CrossRef]
- Primi, Ricardo. 2002. Complexity of geometric inductive reasoning tasks: Contribution to the understanding of fluid intelligence. Intelligence 30: 41–70. [Google Scholar] [CrossRef]
- Primi, Ricardo. 2014. Developing a fluid intelligence scale through a combination of Rasch modeling and cognitive psychology. Psychological Assessment 26: 774–88. [Google Scholar] [CrossRef] [PubMed]
- Raffel, Colin, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21: 5485–551. [Google Scholar]
- Rajeb, Mehdi, Andrew T. Krist, Qingzhou Shi, Daniel O. Oyeniran, Stefanie A. Wind, and Joni M. Lakin. 2024. Mental rotation performance: Contribution of item features to difficulties and functional adaptation. Journal of Intelligence 13: 2. [Google Scholar] [CrossRef]
- Ranjan, Rajesh, Shailja Gupta, and Saranyan N. Singh. 2024. Gender Biases in LLMs: Higher intelligence in LLM does not necessarily solve gender bias and stereotyping. arXiv arXiv:2409.19959. [Google Scholar]
- Reckase, Mark D. 2010. Designing item pools to optimize the functioning of a computerized adaptive test. Psychological Test and Assessment Modeling 52: 127–41. [Google Scholar]
- Reckase, Mark D., Unhee Ju, and Sewon Kim. 2019. How adaptive is an adaptive test: Are all adaptive tests adaptive? Journal of Computerized Adaptive Testing 7: 1–14. [Google Scholar] [CrossRef]
- Reeve, Charlie L., and Holly Lam. 2005. The psychometric paradox of practice effects due to retesting: Measurement invariance and stable ability estimates in the face of observed score changes. Intelligence 33: 535–49. [Google Scholar] [CrossRef]
- Ren, Xuezhu, Frank Goldhammer, Helfried Moosbrugger, and Karl Schweizer. 2012. How does attention relate to the ability-specific and position-specific components of reasoning measured by APM? Learning and Individual Differences 22: 1–7. [Google Scholar] [CrossRef]
- Reynolds, Laria, and Kyle McDonell. 2021. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems. New York: Association for Computing Machinery, pp. 1–7. [Google Scholar]
- Reza, Mohi, Ioannis Anastasopoulos, Shreya Bhandari, and Zachary A. Pardos. 2024. PromptHive: Bringing subject matter experts back to the forefront with collaborative prompt engineering for educational content creation. arXiv arXiv:2410.16547. [Google Scholar]
- Riedel, Maximilian, Katharina Kaefinger, Antonia Stuehrenberg, Viktoria Ritter, Niklas Amann, Anna Graf, Florian Recker, Evelyn Klein, Marion Kiechle, Fabian Riedel, and et al. 2023. ChatGPT’s performance in German OB/GYN exams—Paving the way for AI-enhanced medical education and clinical practice. Frontiers in Medicine 10: 129661. [Google Scholar] [CrossRef]
- Rogausch, Anja, Rainer Hofer, and René Krebs. 2010. Rarely selected distractors in high stakes medical multiple-choice examinations and their recognition by item authors: A simulation and survey. BMC Medical Education 10: 85. [Google Scholar] [CrossRef]
- Roid, Gale H., and Thomas M. Haladyna. 1982. Toward a Technology of Test-Item Writing. New York: Academic. [Google Scholar]
- Runge, Andrew, Yigal Attali, Geoffrey T. LaFlair, Yena Park, and Jaqueline Church. 2024. A generative AI-driven interactive listening assessment task. Frontiers in Artificial Intelligence 7: 1474019. [Google Scholar] [CrossRef]
- Ryoo, Ji H., Sunhee Park, Hongwook Suh, Jaehwa Choi, and Jongkyum Kwon. 2022. Development of a new measure of cognitive ability using automatic item generation and its psychometric properties. SAGE Open 12: 1–13. [Google Scholar] [CrossRef]
- Sahin, Alper, and Duygu Anil. 2017. The effects of test length and sample size on item parameters in item response theory. Educational Science: Theory and Practice 17: 321–35. [Google Scholar] [CrossRef]
- Sahin Kursad, Merve, and Seher Yalcin. 2024. Effect of differential item functioning on computer adaptive testing under different conditions. Applied Psychological Measurement 48: 303–22. [Google Scholar] [CrossRef] [PubMed]
- Sahoo, Pranab, Ayush K. Singh, Sriparna Saha, Vinija Jain, Samrat Mondal, and Aman Chadha. 2024. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv arXiv:2402.07927. [Google Scholar]
- Sayin, Ayfer, and Mark J. Gierl. 2023. Automatic item generation for online measurement and evaluation: Turkish literature items. International Journal of Assessment 10: 218–31. [Google Scholar] [CrossRef]
- Sayin, Ayfer, and Marl Gierl. 2024. Using OpenAI GPT to generate reading comprehension items. Educational Measurement: Issues and Practice 43: 5–18. [Google Scholar] [CrossRef]
- Sayın, Ayfer, and Okan Bulut. 2024. The difference between estimated and perceived item difficulty: An empirical study. International Journal of Assessment Tools in Education 11: 368–87. [Google Scholar] [CrossRef]
- Sayın, Ayfer, and Sebahat Gören. 2023. Comparing estimated and real item difficulty using multi-facet Rasch analysis. Journal of Measurement and Evaluation in Education and Psychology 14: 440–54. [Google Scholar] [CrossRef]
- Säuberli, Andreas, and Simon Clematide. 2024. Automatic generation and evaluation of reading comprehension test items with large language models. arXiv arXiv:2404.07720. [Google Scholar]
- Scharfen, Jana, Judith M. Peters, and Heinz Holling. 2018. Retest effects in cognitive ability tests: A meta-analysis. Intelligence 67: 44–66. [Google Scholar] [CrossRef]
- Schneider, Benedikt, and Jörn R. Sparfeldt. 2021a. How to get better: Taking notes mediates the effect of a video tutorial on number series. Journal of Intelligence 9: 55. [Google Scholar] [CrossRef]
- Schneider, Benedikt, and Jörn R. Sparfeldt. 2021b. How to solve number series items: Can watching video tutorials increase test scores? Intelligence 87: 101547. [Google Scholar] [CrossRef]
- Schneider, Benedikt, Nicolas Becker, Florian Krieger, Frank M. Spinath, and Jörn R. Sparfeldt. 2020. Teaching the underlying rules of figural matrices in a short video increases test scores. Intelligence 82: 101473. [Google Scholar] [CrossRef]
- Schroeders, Ulrich, and Priscilla Achaa-Amankwaa. 2025. Developing NOVA: Next-generation open vocabulary assessment. Unpublished manuscript. [Google Scholar]
- Schroeders, Ulrich, and Timo Gnambs. 2025. Sample-size planning in item-response theory: A tutorial. Advances in Methods and Practices in Psychological Science 8: 25152459251314798. [Google Scholar] [CrossRef]
- Schulhoff, Sander, Michael Ilie, Nishant Balepur, Konstantine Kahadze, Amanda Liu, Chenglei Si, and Philip Resnik. 2024. The prompt report: A systematic survey of prompting techniques. arXiv arXiv:2406.06608. [Google Scholar]
- Schulze Balhorn, Lukas, Jana M. Weber, Stefan Buijsman, Julian R. Hildebrandt, Martina Ziefle, and Artur M. Schweidtmann. 2024. Empirical assessment of ChatGPT’s answering capabilities in natural science and engineering. Scientific Reports 14: 4998. [Google Scholar] [CrossRef]
- Segall, Daniel O. 2004. A sharing item response theory model for computerized adaptive testing. Journal of Educational and Behavioral Statistics 29: 439–60. [Google Scholar] [CrossRef]
- Selvi, Hüseyin. 2020. Should items and answer keys of small-scale exams be published? Higher Education Studies 10: 107–13. [Google Scholar] [CrossRef]
- Shi, Qingzhou, Stefanie A. Wind, and Joni M. Lakin. 2023. Exploring the influence of item characteristics in a spatial reasoning task. Journal of Intelligence 11: 152. [Google Scholar] [CrossRef] [PubMed]
- Shin, Jinnie, and Mark J. Gierl. 2022. Generating reading comprehension items using automated processes. International Journal of Testing 22: 289–311. [Google Scholar] [CrossRef]
- Shultz, Benjamin, Robert J. DiDomenico, Kristen Goliak, and Jeffrey Mucksavage. 2025. Exploratory assessment of GPT-4’s effectiveness in generating valid exam items in pharmacy education. American Journal of Pharmaceutical Education 89: 101405. [Google Scholar] [CrossRef]
- Siegler, Robert S. 1996. Emerging Minds: The Process of Change in Children’s Thinking. New York: Oxford University Press. [Google Scholar]
- Sinharay, Sandip. 2017. Which statistic should be used to detect item pre-knowledge when the set of compromised items is known? Applied Psychological Measurement 41: 403–21. [Google Scholar] [CrossRef]
- Sinharay, Sandip, and Matthew S. Johnson. 2008. Use of item models in a large-scale admissions test: A case study. International Journal of Testing 8: 209–36. [Google Scholar] [CrossRef]
- Sinharay, Sandip, Matthew S. Johnson, and David M. Williamson. 2003. Calibrating item families and summarizing the results using family expected response functions. Journal of Educational and Behavioral Statistics 28: 295–313. [Google Scholar] [CrossRef]
- Smaldino, Paul E. 2020. How to build a strong theoretical foundation. Psychological Inquiry 31: 297–301. [Google Scholar] [CrossRef]
- Sobieszek, Adam, and Tadeusz Price. 2022. Playing games with AIs: The limits of GPT-3 and similar large language models. Minds and Machines 32: 341–64. [Google Scholar] [CrossRef]
- Someshwar, Shonai. 2024. Quality Control and the Impact of Variation and Prediction Errors on Item Family Design. Doctoral dissertation, The University of North Carolina at Greensboro, Greensboro, NC, USA. [Google Scholar]
- Sommer, Markus, Margit Herle, Joachim Häusler, and Martin Arendasy. 2009. Von TAVTMB zu ATAVT: Eine Anwendung der Automatisierten Itemgenerierung unter einschränkenden Rahmenbedingungen. In Zweites Österreichisches Symposium für Psychologie im Militär. Edited by Georg Ebner and Günther Fleck. Wien: Schriftreihe der Landesverteidigungsakademie, pp. 27–52. [Google Scholar]
- Sommer, Markus, Martin E. Arendasy, Joachim F. Punter, Martina Feldhammer-Kahr, and Anita Rieder. 2025. Does test preparation mediate the effect of parents’ level of educational attainment on medical school admission test performance? Intelligence 108: 101893. [Google Scholar] [CrossRef]
- Song, Yishen, Junlei Du, and Qinhua Zheng. 2025. Automatic item generation for educational assessments: A systematic literature review. Interactive Learning Environments, 1–20. [Google Scholar] [CrossRef]
- Stricker, Lawrence J. 1984. Test disclosure and retest performance on the SAT. Applied Psychological Measurement 8: 81–87. [Google Scholar] [CrossRef]
- Su, Mei-Chin, Li-En Lin, Li-Hwa Lin, and Yu-Chun Chen. 2024. Assessing question characteristic influences on ChatGPT’s performance and response-explanation consistency: Insights from Taiwan’s nursing licensing exam. International Journal of Nursing Studies 153: 104717. [Google Scholar] [CrossRef] [PubMed]
- Sun, Luning, Yanan Liu, and Fang Luo. 2019. Automatic generation of number series reasoning items of high difficulty. Frontiers in Psychology 10: 884. [Google Scholar] [CrossRef]
- Svetina, Dubravka, Joanna Gorin, and Kikumi K. Tatsuoka. 2011. Defining and comparing the reading comprehension construct: A cognitive-psychometric modeling approach. International Journal of Testing 11: 1–23. [Google Scholar] [CrossRef]
- Sydorenko, Tetyana. 2011. Item writer judgments of item difficulty versus actual item difficulty: A case study. Language Assessment Quarterly 8: 34–52. [Google Scholar] [CrossRef]
- Tan, Bin, Nour Armoush, Elisabetta Mazzullo, Okan Bulut, and Mark J. Gierl. 2024. A review of automatic item generation techniques leveraging large language models. EdArXiv. [Google Scholar] [CrossRef]
- te Nijenhuis, Jan, Annelies E. M. Vianen, and Henk van der Flier. 2007. Score gains on g-loaded tests: No g. Intelligence 35: 283–300. [Google Scholar] [CrossRef]
- Thakur, Vishesh. 2023. Unveiling gender bias in terms of profession across LLMs: Analyzing and addressing sociological implications. arXiv arXiv:2307.09162. [Google Scholar]
- Tian, Chen, and Jaehwa Choi. 2023. The impact of item model parameter variations on person parameter estimation in computerized adaptive testing with automatically generated items. Applied Psychological Measurement 47: 275–90. [Google Scholar] [CrossRef]
- Touvron, Hugo, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Thimothée Lacroix, and Guillaume Lample. 2023. Llama: Open and efficient foundation language models. arXiv arXiv:2302.13971. [Google Scholar]
- van der Linden, Wim J., and Cees A. Glas. 2010. Elements of Adaptive Testing. New York: Springer. [Google Scholar]
- van der Maas, Han L., Lukas Snoek, and Claire E. Stevenson. 2021. How much intelligence is there in artificial intelligence? A 2020 update. Intelligence 87: 101548. [Google Scholar] [CrossRef]
- Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz U. Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. Edited by Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna Wallach, Rob Fergus, Vichy SVN Vishwanathan and Roman Garnett. Red Hook: Curran Associates, pp. 5998–6008. [Google Scholar]
- Veerkamp, Wim J., and Cees A. W. Glas. 2000. Detection of known items in adaptive testing with a statistical quality control method. Journal of Educational and Behavioral Statistics 25: 373–89. [Google Scholar] [CrossRef]
- Veldkamp, Bernard P., and Wim J. van der Linden. 2010. Designing item pools for adaptive testing. In Computerized Adaptive Testing: Theory and Practice. Edited by Wim J. van der Linden and Cees A. W. Glas. New York: Springer, pp. 149–62. [Google Scholar]
- Verguts, Tom, and Paul De Boeck. 2002. The induction of solution rules in Raven’s Progressive Matrices. European Journal of Cognitive Psychology 14: 521–47. [Google Scholar] [CrossRef]
- Vigneau, François, André F. Caissie, and Douglas A. Bors. 2006. Eye-movement analysis demonstrates strategic influences on intelligence. Intelligence 34: 261–72. [Google Scholar] [CrossRef]
- von Davier, Matthias. 2018. Automated item generation with recurrent neural networks. Psychometrika 83: 847–57. [Google Scholar] [CrossRef]
- von Davier, Matthias. 2019. Training Optimus prime, M.D.: Generating medical certification items by fine-tuning OpenAI’s gpt2 transformer model. arXiv arXiv:1908.08594. [Google Scholar]
- Wagner-Menghin, Michaela, Ingrid Preusche, and Michael Schmidts. 2013. The effects of reusing written test items: A study using the Rasch model. ISRN Education 2013: 585420. [Google Scholar] [CrossRef]
- Wainer, Howard. 2002. On the automatic generation of items: Some whens, whys and hows. In Item Generation for Test Development. Edited by Sidney H. Irvine and Patrick C. Kyllonen. Mahwah: Lawrence Erlbaum, pp. 287–316. [Google Scholar]
- Waldock, William J., Joe Zhang, Ahmad Guni, Ahmad Nabeel, Ara Darzi, and Hutan Ashrafian. 2024. The accuracy and capability of artificial intelligence solutions in health care examinations and certificates: Systematic review and meta-analysis. Journal of Medical Internet Research 26: e56532. [Google Scholar] [CrossRef]
- Wancham, Kittitas, Kamonwan Tangdhanakanond, and Sirichai Kanjanawasee. 2023. Development of the automatic item generation system for the diagnosis of misconceptions about force and laws of motion. Eurasia Journal of Mathematics, Science and Technology Education 19: em2282. [Google Scholar] [CrossRef]
- Wang, Yi, Qian Zhou, and David Ledo. 2024. StoryVerse: Towards co-authoring dynamic plot with LLM-based character simulation via narrative planning. Paper presented at 19th International Conference on the Foundations of Digital Games, Worcester, MA, USA, May 21–24. [Google Scholar]
- Webb, Emily M., Jonathan S. Phuong, and David M. Naeger. 2015. Does educator training or experience affect the quality of multiple-choice questions? Academic Radiology 22: 1317–22. [Google Scholar] [CrossRef]
- Weppert, Daniel, Dorothee Amelung, Malvin Escher, Leander Troll, Martina Kadmon, Lena Listunova, and Jana Montasser. 2023. The impact of preparatory activities on the largest clinical aptitude test for prospective medical students in Germany. Frontiers in Education 8: 1104464. [Google Scholar] [CrossRef]
- Witt, Elizabeth A. 1993. Meta-analysis and the effects of coaching for aptitude tests. Paper presented at the Annual Meeting of the American Educational research Association, Atlanta, GA, USA, April 12–16. [Google Scholar]
- Wonde, Shewatatek G., Tefera Tadesse, Belay Moges, and Stefan K. Schauber. 2024. Experts’ prediction of item difficulty of multiple-choice questions in the Ethiopian Undergraduate Medicine Licensure Examination. BMC Medical Education 24: 1016. [Google Scholar] [CrossRef]
- Wood, Timothy J. 2009. The effect of reused questions on repeat examinees. Advances in Health Sciences Education 14: 465–73. [Google Scholar] [CrossRef]
- Wood, Timothy J., Christina St-Onge, André-Philippe Boulais, David E. Blackmore, and Thomas O. Maguire. 2010. Identifying the unauthorized use of examination material. Evaluation and the Health Professions 33: 96–108. [Google Scholar] [CrossRef]
- Yang, Eunbae B., Myung A. Lee, and Yoon S. Park. 2018. Effects of test item disclosure on medical licensing examination. Advances in Health Sciences Education 23: 265–74. [Google Scholar] [CrossRef] [PubMed]
- Yang, Jingfeng, Hongye Jin, Ruixiang Tang, Xiaotian Han, Qizhang Feng, Haoming Jiang, and Xia Hu. 2024. Harnessing the power of LLMs in practice: A survey on ChatGPT and beyond. ACM Transactions on Knowledge Discovery from Data 18: 1–32. [Google Scholar] [CrossRef]
- Yang, Yuan, and Mathilee Kunda. 2023. Computational models of solving Raven’s Progressive Matrices: A comprehensive introduction. arXiv arXiv:2302.04238. [Google Scholar]
- Yang, Yuan, Deepayan Sanyal, Joel Michelson, James Ainooson, and Mathilee Kunda. 2022. Automatic item generation of figural analogy problems: A review and outlook. arXiv arXiv:2201.08450. [Google Scholar]
- Yi, Qing, Jinming Zhang, and Hua-Hua Chang. 2008. Severity of organized item theft in computerized adaptive testing: A simulation study. Applied Psychological Measurement 32: 543–58. [Google Scholar] [CrossRef]
- Yu, Jiayuan. 1994. Homogenity of problem solving strategies and the fitting of linear logistic model. Acta Psychologica Sinica 26: 219–24. [Google Scholar]
- Zenisky, April, Ronald K. Hambleton, and Richard M. Luecht. 2010. Multistage testing: Issues, designs, and research. In Elements of Adaptive Testing. Edited by Wim J. van der Linden and Cees A. Glas. New York: Springer, pp. 355–72. [Google Scholar]
- Zha, Daochen, Zaid P. Bhat, Kwei-Herng Lai, Fan Yang, Zhimeng Jiang, Saochen Zhong, and Xiaben Hu. 2025. Data-centric artificial intelligence: A survey. ACM Computing Surveys 57: 1–42. [Google Scholar] [CrossRef]
- Zhang, Jinming, and Hua-Hua Chang. 2005. The Effectiveness of Enhancing Test Security by Using Multiple Item Pools (ETS RR-05-19). Princeton: ETS. [Google Scholar]
- Zhang, Jinming, Hua-Hua Chang, and Qing Yi. 2012. Comparing single-pool and multiple-pool designs regarding test security in computerized testing. Behavior Research Methods 44: 742–52. [Google Scholar] [CrossRef] [PubMed]
- Zickar, Michael J. 2020. Measurement development and evaluation. Annual Review of Organizational Psychology and Organizational Behavior 7: 213–32. [Google Scholar] [CrossRef]
- Zimmer, Felix, Mirka Henninger, and Rudolf Debelak. 2024. Sample size planning for complex study designs: A tutorial for the mlpwr package. Behavior Research Methods 56: 5246–63. [Google Scholar] [CrossRef] [PubMed]
- Zimmermann, Stefan, Dietrich Klusmann, and Wolfgang Hampe. 2016. Are exam questions known in advance? Using local dependence to detect cheating. PLoS ONE 11: e0167545. [Google Scholar] [CrossRef]
- Zorowitz, Samuel, Gabriele Chierchia, Sarah-Jayne Blakemore, and Nathaniel D. Daw. 2024. An item response theory analysis of the matrix reasoning item bank (MaRs-IB). Behavior Research Methods 56: 1104–22. [Google Scholar] [CrossRef]
- Zu, Jiyun, Ikkyu Choi, and Jiangang Hao. 2023. Automated distractor generation for fill-in-the-blank items using a prompt-based learning approach. Psychological Testing and Assessment Modeling 65: 55–75. Available online: https://www.psychologie-aktuell.com/fileadmin/Redaktion/Journale/ptam_2023-1/PTAM__1-2023_3_kor.pdf (accessed on 3 March 2025).
- Zwick, Rebecca. 2002. Is the SAT a ‘wealth test’? Phi Delta Kappan 84: 307–11. [Google Scholar] [CrossRef]
- Zwick, Rebecca, Dorothy T. Thayer, and Marilyn Wingersky. 1995. Effect of Rasch calibration on ability and DIF estimation in computer-adaptive tests. Journal of Educational Measurement 32: 341–63. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Sommer, M.; Arendasy, M. Automatic- and Transformer-Based Automatic Item Generation: A Critical Review. J. Intell. 2025, 13, 102. https://doi.org/10.3390/jintelligence13080102
Sommer M, Arendasy M. Automatic- and Transformer-Based Automatic Item Generation: A Critical Review. Journal of Intelligence. 2025; 13(8):102. https://doi.org/10.3390/jintelligence13080102
Chicago/Turabian StyleSommer, Markus, and Martin Arendasy. 2025. "Automatic- and Transformer-Based Automatic Item Generation: A Critical Review" Journal of Intelligence 13, no. 8: 102. https://doi.org/10.3390/jintelligence13080102
APA StyleSommer, M., & Arendasy, M. (2025). Automatic- and Transformer-Based Automatic Item Generation: A Critical Review. Journal of Intelligence, 13(8), 102. https://doi.org/10.3390/jintelligence13080102