Controlling Rater Effects in Divergent Thinking Assessment: An Item Response Theory Approach to Individual Response and Snapshot Scoring
Abstract
:1. Introduction
1.1. From Individual Response Scoring to Snapshot Scoring
1.2. Item Response Theory Applied to DT Scoring
1.3. Rationale of the Study
2. Materials and Methods
2.1. Participants
2.2. Creativity Measures
2.3. Analysis Strategy
3. Results
3.1. Aim 1: Applying IRT to Individual Response Scoring and Snapshot Scoring
3.2. Aim 2: Comparing Scoring Approaches
3.3. Aim 3: Simulating Missing Data
4. Discussion
Limitations and Future Directions
5. Conclusions
Supplementary Materials
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Conflicts of Interest
Abbreviations
DT | Divergent Thinking |
IRT | Item Response Theory |
AUT | Alternate Uses Task |
JRT | Judge Response Theory |
ICC | Intraclass Correlation Coefficient |
MFRM | Many-Facet Rasch Model |
AIC | Akaike Information Criterion |
BIC | Bayesian Information Criterion |
JCC | Judge Category Curves |
TIF | Test Information Function |
EAP | Expected A Posteriori |
GRM | Graded Response Model |
References
- Amabile, Teresa M. 1982. Social psychology of creativity: A consensual assessment technique. Journal of Personality and Social Psychology 43: 997–1013. [Google Scholar] [CrossRef]
- Andrich, David. 1978. A Rating Formulation for Ordered Response Categories. Psychometrika 43: 561–73. [Google Scholar] [CrossRef]
- Baer, John, James C. Kaufman, and Claudia A. Gentile. 2004. Extension of the Consensual Assessment Technique to Nonparallel Creative Products. Creativity Research Journal 16: 113–17. [Google Scholar] [CrossRef]
- Barbot, Baptiste, James C. Kaufman, and Nils Myszkowski. 2023. Creativity with 6 Degrees of Freedom: Feasibility Study of Visual Creativity Assessment in Virtual Reality. Creativity Research Journal 35: 783–800. [Google Scholar] [CrossRef]
- Benedek, Mathias. 2024. On the relationship between creative potential and creative achievement: Challenges and future directions. Learning and Individual Differences 110: 102424. [Google Scholar] [CrossRef]
- Benedek, Mathias, Caterina Mühlmann, Emanuel Jauk, and Aljoscha C. Neubauer. 2013. Assessment of divergent thinking by means of the subjective top-scoring method: Effects of the number of top-ideas and time-on-task on reliability and validity. Psychology of Aesthetics, Creativity, and the Arts 7: 341–49. [Google Scholar] [CrossRef] [PubMed]
- Benedek, Mathias, Nora Nordtvedt, Emanuel Jauk, Corinna Koschmieder, Jürgen Pretsch, Georg Krammer, and Aljoscha C. Neubauer. 2016. Assessment of creativity evaluation skills: A psychometric investigation in prospective teachers. Thinking Skills and Creativity 21: 75–84. [Google Scholar] [CrossRef]
- Botella, Marion, Franck Zenasni, and Todd Lubart. 2018. What Are the Stages of the Creative Process? What Visual Art Students Are Saying. Frontiers in Psychology 9: 2266. [Google Scholar] [CrossRef]
- Ceh, Simon Majed, Carina Edelmann, Gabriela Hofer, and Mathias Benedek. 2022. Assessing Raters: What Factors Predict Discernment in Novice Creativity Raters? The Journal of Creative Behavior 56: 41–54. [Google Scholar] [CrossRef]
- Clark, Philip M., and Herbert L. Mirels. 1970. Fluency as a pervasive element in the measurement of creativity. Journal of Educational Measurement 7: 83–86. [Google Scholar] [CrossRef]
- Cronbach, Lee J. 1951. Coefficient alpha and the internal structure of tests. Psychometrika 16: 297–334. [Google Scholar] [CrossRef]
- DeVellis, Robert F. 1991. Scale Development. London: Sage. [Google Scholar]
- Eckes, Thomas. 2011. Introduction to Many-Facet Rasch Measurement. Lausanne: Peter Lang D. [Google Scholar] [CrossRef]
- Forthmann, Boris. 2024. Disentangling Quantity and Quality in the Assessment of Creative Productions. Creativity Research Journal 37: 230–35. [Google Scholar] [CrossRef]
- Forthmann, Boris, and Philipp Doebler. 2022. Fifty years later and still working: Rediscovering Paulus et al.’s (1970) automated scoring of divergent thinking tests. Psychology of Aesthetics, Creativity, and the Arts 19: 63–76. [Google Scholar] [CrossRef]
- Forthmann, Boris, Benjamin Goecke, and Roger E. Beaty. 2023. Planning Missing Data Designs for Human Ratings in Creativity Research: A Practical Guide. Creativity Research Journal 37: 167–78. [Google Scholar] [CrossRef]
- Forthmann, Boris, Carsten Szardenings, and Denis Dumas. 2021. On the Conceptual Overlap between the Fluency Contamination Effect in Divergent Thinking Scores and the Chance View on Scientific Creativity. The Journal of Creative Behavior 55: 268–75. [Google Scholar] [CrossRef]
- Forthmann, Boris, Heinz Holling, Nima Zandi, Anne Gerwig, Pınar Çelik, Martin Storme, and Todd Lubart. 2017a. Missing creativity: The effect of cognitive workload on rater (dis-)agreement in subjective divergent-thinking scores. Thinking Skills and Creativity 23: 129–39. [Google Scholar] [CrossRef]
- Forthmann, Boris, Heinz Holling, Pınar Çelik, Martin Storme, and Todd Lubart. 2017b. Typing Speed as a Confounding Variable and the Measurement of Quality in Divergent Thinking. Creativity Research Journal 29: 257–69. [Google Scholar] [CrossRef]
- Forthmann, Boris, Sue Hyeon Paek, Denis Dumas, Baptiste Barbot, and Heinz Holling. 2020. Scrutinizing the basis of originality in divergent thinking tests: On the measurement precision of response propensity estimates. British Journal of Educational Psychology 90: 683–99. [Google Scholar] [CrossRef] [PubMed]
- Guilford, Joy Paul. 1950. Creativity. American Psychologist 5: 444–54. [Google Scholar] [CrossRef]
- Guilford, Joy Paul. 1967. The Nature of Human Intelligence. New York: McGraw-Hill. [Google Scholar]
- Hancock, Gregory, and Ralph O. Mueller. 2001. Rethinking construct reliability within latent variable systems. In Structural Equation Modeling: Present and Future—A Festschrift in Honor of Karl Jöreskog. Edited by Robert Cudeck, Stephen du Toit and Dag Sörbom. Baltimore: Scientific Software International, pp. 195–216. [Google Scholar]
- Koo, Terry K., and Mae Y. Li. 2016. A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. Journal of Chiropractic Medicine 15: 155–63. [Google Scholar] [CrossRef]
- Lubart, Todd I. 2001. Models of the Creative Process: Past, Present and Future. Creativity Research Journal 13: 295–308. [Google Scholar] [CrossRef]
- Masters, Geoff N. 1982. A Rasch Model for Partial Credit Scoring. Psychometrika 47: 149–74. [Google Scholar] [CrossRef]
- McDonald, Roderick P. 1999. Test Theory: A Unified Treatment, 1st ed. London: Psychology Press. [Google Scholar] [CrossRef]
- Muraki, Eiji. 1993. Information Functions of the Generalized Partial Credit Model. Applied Psychological Measurement 17: 351–63. [Google Scholar] [CrossRef]
- Myszkowski, Nils. 2021. Development of the R library “jrt”: Automated item response theory procedures for judgment data and their application with the consensual assessment technique. Psychology of Aesthetics, Creativity, and the Arts 15: 426–38. [Google Scholar] [CrossRef]
- Myszkowski, Nils. 2024. Item Response Theory for Creativity Measurement, 1st ed. Cambridge: Cambridge University Press. [Google Scholar] [CrossRef]
- Myszkowski, Nils, and Martin Storme. 2019. Judge response theory? A call to upgrade our psychometrical account of creativity judgments. Psychology of Aesthetics, Creativity, and the Arts 13: 167–75. [Google Scholar] [CrossRef]
- Patterson, John D., Hannah M. Merseal, Dan R. Johnson, Sergio Agnoli, Matthijs Baas, Brendan S. Baker, Baptiste Barbot, Mathias Benedek, Khatereh Borhani, Qunlin Chen, and et al. 2023. Multilingual semantic distance: Automatic verbal creativity assessment in many languages. Psychology of Aesthetics, Creativity, and the Arts 17: 495–507. [Google Scholar] [CrossRef]
- Plucker, Jonathan A., Meihua Qian, and Shujuan Wang. 2011. Is Originality in the Eye of the Beholder? Comparison of Scoring Techniques in the Assessment of Divergent Thinking. The Journal of Creative Behavior 45: 1–22. [Google Scholar] [CrossRef]
- Primi, Ricardo, Paul J. Silvia, Emanuel Jauk, and Mathias Benedek. 2019. Applying many-facet Rasch modeling in the assessment of creativity. Psychology of Aesthetics, Creativity, and the Arts 13: 176–86. [Google Scholar] [CrossRef]
- R Core Team. 2024. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. Available online: https://www.R-project.org/ (accessed on 29 February 2024).
- Reiter-Palmon, Roni, Boris Forthmann, and Baptiste Barbot. 2019. Scoring divergent thinking tests: A review and systematic framework. Psychology of Aesthetics, Creativity, and the Arts 13: 144–52. [Google Scholar] [CrossRef]
- Robitzsch, Alexander, Thomas Kiefer, and Margaret Wu. 2024. TAM: Test Analysis Modules. R Package Version 4.2-21. Available online: https://CRAN.R-project.org/package=TAM (accessed on 31 May 2024).
- Rosseel, Yves. 2012. lavaan: An R Package for Structural Equation Modeling. Journal of Statistical Software 48: 1–36. [Google Scholar] [CrossRef]
- Said-Metwaly, Sameh, Christa L. Taylor, Anaëlle Camarda, and Baptiste Barbot. 2022. Divergent thinking and creative achievement—How strong is the link? An updated meta-analysis. Psychology of Aesthetics, Creativity, and the Arts 18: 869–81. [Google Scholar] [CrossRef]
- Samejima, Fumi. 1968. Estimation of latent ability using a response pattern of graded scores. ETS Research Bulletin Series 1968: I-169. [Google Scholar] [CrossRef]
- Saretzki, Janika, Boris Forthmann, and Mathias Benedek. 2024a. A systematic quantitative review of divergent thinking assessments. Psychology of Aesthetics, Creativity, and the Arts. Advance online publication. [Google Scholar] [CrossRef]
- Saretzki, Janika, Rosalie Andrae, Boris Forthmann, and Mathias Benedek. 2024b. Investigation of response aggregation methods in divergent thinking assessments. The Journal of Creative Behavior. Advance online publication. [Google Scholar] [CrossRef]
- Silvia, Paul J. 2008. Discernment and creativity: How well can people identify their most creative ideas? Psychology of Aesthetics, Creativity, and the Arts 2: 139–46. [Google Scholar] [CrossRef]
- Silvia, Paul J. 2011. Subjective scoring of divergent thinking: Examining the reliability of unusual uses, instances, and consequences tasks. Thinking Skills and Creativity 6: 24–30. [Google Scholar] [CrossRef]
- Silvia, Paul J., Beate P. Winterstein, John T. Willse, Christopher M. Barona, Joshua T. Cram, Karl I. Hess, Jenna L. Martinez, and Crystal A. Richard. 2008. Assessing creativity with divergent thinking tasks: Exploring the reliability and validity of new subjective scoring methods. Psychology of Aesthetics, Creativity, and the Arts 2: 68–85. [Google Scholar] [CrossRef]
- Silvia, Paul J., Christopher Martin, and Emily C. Nusbaum. 2009. A snapshot of creativity: Evaluating a quick and simple method for assessing divergent thinking. Thinking Skills and Creativity 4: 79–85. [Google Scholar] [CrossRef]
- Smeekens, Bridget A., and Michael J. Kane. 2016. Working memory capacity, mind wandering, and creative cognition: An individual-differences investigation into the benefits of controlled versus spontaneous thought. Psychology of Aesthetics, Creativity, and the Arts 10: 389–415. [Google Scholar] [CrossRef]
- Wright, Ben D., and John Michael Linacre. 1994. Reasonable mean-square fit values. Rasch Measurement Transactions 8: 370–71. [Google Scholar]
- Yu, Yuhua, Roger E. Beaty, Boris Forthmann, Mark Beeman, John Henry Cruz, and Dan Johnson. 2023. A MAD method to assess idea novelty: Improving validity of automatic scoring using maximum associative distance (MAD). Psychology of Aesthetics, Creativity, and the Arts. Advance online publication. [Google Scholar] [CrossRef]
- Zielińska, Aleksandra, Izabela Lebuda, and Maciej Karwowski. 2022. Scaling the Creative Self: An Item Response Theory Analysis of the Short Scale of Creative Self. Creativity Research Journal 34: 431–44. [Google Scholar] [CrossRef]
- Zielińska, Aleksandra, Peter Organisciak, Denis Dumas, and Maciej Karwowski. 2023. Lost in translation? Not for Large Language Models: Automated divergent thinking scoring performance translates to non-English contexts. Thinking Skills and Creativity 50: 101414. [Google Scholar] [CrossRef]
Individual Response Scoring | Snapshot Scoring | |||
---|---|---|---|---|
Model | AIC | BIC | AIC | BIC |
Rating Scale Model | 34,925.89 | 34,971.16 | 6669.37 | 6708.98 |
Generalized Rating Scale Model | 34,299.33 | 34,363.88 | 6635.26 | 6696.89 |
Partial Credit Model | 33,721.05 | 33,805.11 | 6619.11 | 6711.55 |
Generalized Partial Credit Model | 33,710.67 | 33,807.66 | 6607.11 | 6717.16 |
Constrained Graded Rating Scale Model | 33,705.17 | 35,006.43 | 6657.57 | 6697.18 |
Graded Rating Scale Model | 34,262.57 | 34,320.77 | 6625.81 | 6683.03 |
Constrained Graded Response Model | 33,705.17 | 33,789.23 | 6603.03 | 6695.47 |
Graded Response Model | 33,703.75 | 33,800.75 | 6602.47 | 6712.52 |
Many Facet Rasch Model | 34,863.84 | 34,928.51 | 7137.61 | 7177.25 |
a | d1 | d2 | d3 | d4 | Outfit | Infit | ||
---|---|---|---|---|---|---|---|---|
Individual Response Scoring | Rater 1 | 2.26 | 1.83 | −0.87 | −3.12 | −5.48 | 0.68 | 0.70 |
Rater 2 | 2.26 | 3.43 | −1.07 | −5.33 | −8.14 | 0.77 | 0.78 | |
Rater 3 | 2.51 | 4.24 | 1.34 | −1.21 | −3.84 | 0.64 | 0.65 | |
Snapshot Scoring | Rater 1 | 2.71 | 5.09 | 1.20 | −1.26 | −4.45 | 0.81 | 0.82 |
Rater 2 | 3.24 | 7.39 | 2.36 | −1.43 | −5.70 | 0.75 | 0.77 | |
Rater 3 | 2.84 | 7.27 | 2.44 | −1.14 | −5.29 | 0.80 | 0.80 | |
Rater 4 | 2.41 | 4.17 | 1.18 | −1.86 | −5.10 | 0.84 | 0.85 | |
Rater 5 | 2.80 | 5.05 | 0.72 | −2.47 | −5.33 | 0.79 | 0.80 |
α | ω | H | |
---|---|---|---|
Fluency | 0.85 [0.80; 0.88] | 0.85 [0.80; 0.89] | 0.87 [0.83; 0.92] |
Complete dataset | |||
Average scoring | 0.67 [0.57; 0.75] | 0.67 [0.57; 0.75] | 0.69 [0.60; 0.78] |
IRT-adjusted average scoring | 0.68 [0.57; 0.76] | 0.68 [0.58; 0.76] | 0.68 [0.60; 0.78] |
Max-1 scoring | 0.56 [0.43; 0.66] | 0.56 [0.43; 0.65] | 0.56 [0.46; 0.74] |
Max-2 scoring | 0.65 [0.55; 0.73] | 0.65 [0.56; 0.73] | 0.66 [0.59; 0.78] |
Max-3 scoring | 0.70 [0.62; 0.77] | 0.71 [0.63; 0.77] | 0.72 [0.65; 0.84] |
Max-4 scoring | 0.72 [0.64; 0.79] | 0.72 [0.64; 0.79] | 0.73 [0.66; 0.82] |
Max-5 scoring | 0.73 [0.64; 0.79] | 0.73 [0.64; 0.78] | 0.73 [0.67; 0.81] |
Snapshot scoring | 0.74 [0.66; 0.80] | 0.74 [0.66; 0.80] | 0.74 [0.67; 0.83] |
IRT-adjusted snapshot scoring | 0.74 [0.66; 0.80] | 0.74 [0.64; 0.80] | 0.74 [0.67; 0.83] |
Simulated missingness dataset | |||
Average scoring | 0.64 [0.52; 0.73] | 0.64 [0.53; 0.73] | 0.65 [0.56; 0.79] |
IRT-adjusted average scoring | 0.64 [0.52; 0.72] | 0.65 [0.54; 0.73] | 0.65 [0.56; 0.77] |
Max-1 scoring | 0.50 [0.36; 0.61] | 0.50 [0.37; 0.61] | 0.52 [0.42; 0.95] |
Max-2 scoring | 0.59 [0.47; 0.67] | 0.60 [0.49; 0.68] | 0.61 [0.54; 0.80] |
Max-3 scoring | 0.67 [0.58; 0.74] | 0.67 [0.57; 0.74] | 0.68 [0.60; 0.80] |
Max-4 scoring | 0.68 [0.58; 0.75] | 0.68 [0.59; 0.75] | 0.68 [0.60; 0.77] |
Max-5 scoring | 0.71 [0.62; 0.78] | 0.71 [0.62; 0.77] | 0.71 [0.64; 0.80] |
Snapshot scoring | 0.69 [0.59; 0.76] | 0.69 [0.59; 0.76] | 0.69 [0.60; 0.78] |
IRT-adjusted snapshot scoring | 0.69 [0.60; 0.76] | 0.69 [0.59; 0.76] | 0.70 [0.62; 0.79] |
1. | 2. | 3. | 4. | 5. | 6. | 7. | 8. | 9. | |
---|---|---|---|---|---|---|---|---|---|
1. Average scoring | - | ||||||||
2. IRT-adjusted average scoring | >0.99 ** | - | |||||||
3. Max-1 scoring | 0.75 ** | 0.74 ** | - | ||||||
4. Max-2 scoring | 0.76 ** | 0.75 ** | 0.96 ** | - | |||||
5. Max-3 scoring | 0.74 ** | 0.74 ** | 0.93 ** | 0.99 ** | - | ||||
6. Max-4 scoring | 0.74 ** | 0.73 ** | 0.92 ** | 0.97 ** | 0.99 ** | - | |||
7. Max-5 scoring | 0.76 ** | 0.76 ** | 0.92 ** | 0.96 ** | 0.98 ** | 0.99 ** | - | ||
8. Snapshot scoring | 0.79 ** | 0.79 ** | 0.85 ** | 0.86 ** | 0.85 ** | 0.84 ** | 0.85 ** | - | |
9. IRT-adjusted snapshot scoring | 0.80 ** | 0.80 ** | 0.84 ** | 0.86 ** | 0.84 ** | 0.83 ** | 0.84 ** | >0.99 ** | - |
10. Fluency | −0.02 | −0.02 | 0.40 ** | 0.48 ** | 0.54 ** | 0.57 ** | 0.55 ** | 0.29 ** | 0.28 * |
Simulated Dataset | |||||
---|---|---|---|---|---|
Average Scoring | IRT-Adjusted Average Scoring | Snapshot Scoring | IRT-Adjusted Snapshot Scoring | ||
Complete dataset | Average scoring | 0.97 | 0.97 | 0.73 | 0.75 |
IRT-adjusted average scoring | 0.97 | 0.98 | 0.72 | 0.74 | |
Snapshot scoring | 0.75 | 0.76 | 0.94 | 0.95 | |
IRT-adjusted snapshot scoring | 0.75 | 0.76 | 0.94 | 0.94 |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Pellegrino, G.; Saretzki, J.; Benedek, M. Controlling Rater Effects in Divergent Thinking Assessment: An Item Response Theory Approach to Individual Response and Snapshot Scoring. J. Intell. 2025, 13, 69. https://doi.org/10.3390/jintelligence13060069
Pellegrino G, Saretzki J, Benedek M. Controlling Rater Effects in Divergent Thinking Assessment: An Item Response Theory Approach to Individual Response and Snapshot Scoring. Journal of Intelligence. 2025; 13(6):69. https://doi.org/10.3390/jintelligence13060069
Chicago/Turabian StylePellegrino, Gerardo, Janika Saretzki, and Mathias Benedek. 2025. "Controlling Rater Effects in Divergent Thinking Assessment: An Item Response Theory Approach to Individual Response and Snapshot Scoring" Journal of Intelligence 13, no. 6: 69. https://doi.org/10.3390/jintelligence13060069
APA StylePellegrino, G., Saretzki, J., & Benedek, M. (2025). Controlling Rater Effects in Divergent Thinking Assessment: An Item Response Theory Approach to Individual Response and Snapshot Scoring. Journal of Intelligence, 13(6), 69. https://doi.org/10.3390/jintelligence13060069