A Comparison of Linking Methods for Two Groups for the Two-Parameter Logistic Item Response Model in the Presence and Absence of Random Differential Item Functioning
Abstract
:1. Introduction
2. Linking Two Groups with the 2PL Model
2.1. 2PL Model
2.2. Linking Design
2.3. Random Differential Item Functioning
2.3.1. Identified Item Parameters in Separate Calibrations in the Two Groups
2.3.2. The Role of Normally Distributed Random DIF in Educational Assessment
3. Linking Methods
3.1. Log-Mean-Mean Linking
3.2. Mean-Mean Linking
3.3. Haberman Linking (HAB and HAB-nolog)
3.4. Invariance Alignment with
3.5. Haebara Linking Methods (HAE-Asymm, HAE-Symm, HAE-Joint)
3.6. Recalibration Linking (RC1, RC2, and RC3)
3.7. Anchored Item Parameters
3.8. Concurrent Calibration
4. Simulation Study
4.1. Purpose
4.2. Design
4.3. Analysis Methods
4.4. Results
5. Empirical Example: Linking PISA 2006 and PISA 2009 for Austria
5.1. Method
5.2. Results
6. Discussion
Funding
Data Availability Statement
Conflicts of Interest
Abbreviations
1PL | one-parameter logistic model |
2PL | two-parameter logistic model |
ANCH | anchored item parameters |
CC | concurrent calibration |
DIF | differential item functioning |
HAB | Haberman linking with logarithmized item discriminations |
HAB-nolog | Haberman linking with untransformed item discriminations |
HAE | Haebara linking |
HAE-asymm | asymmetric Haebara linking |
HAE-joint | Haebara linking with joint item parameters |
HAE-symm | symmetric Haebara linking |
IA2 | invariance alignment with power |
IRF | item response function |
IRT | item response theory |
logMM | log-mean-mean linking |
LSA | large-scale assessment |
MM | mean-mean linking |
MML | marginal maximum likelihood |
MSE | mean-squared error |
NUDIF | nonuniform differential item functioning |
PIRLS | Progress in International Reading Literacy Study |
PISA | Programme for International Student Assessment |
RC | recalibration linking |
RMSE | root-mean-squared error |
SD | standard deviation |
TIMSS | Trends in International Mathematics and Science Study |
UDIF | uniform differential item functioning |
Appendix A. Nonidentifiability of DIF Effects Distributions
Appendix A.1. DIF Effects for Item Difficulties
Appendix A.2. DIF Effects for Item Discriminations
Appendix B. Proof of Proposition 1
Appendix B.1. Consistency of Additive DIF Effects fi with Condition (I)
Appendix B.2. Consistency for Multiplicative DIF Effects fi with Condition (II)
Appendix C. Proof of Proposition 2
Appendix C.1. Consistency for Additive DIF Effects fi with Condition (I)
Appendix C.2. Consistency for Multiplicative DIF Effects fi with Condition (IO)
Appendix D. Estimates in Haberman Linking
Appendix E. Estimates in Invariance Alignment
Appendix F. Item Parameters Used in the Simulation Study
Item | ||
---|---|---|
1 | 0.95 | −0.97 |
2 | 0.88 | 0.59 |
3 | 0.75 | 0.75 |
4 | 1.29 | −0.79 |
5 | 1.28 | 1.23 |
6 | 1.29 | −1.10 |
7 | 1.25 | −0.67 |
8 | 0.97 | 0.20 |
9 | 0.73 | 1.26 |
10 | 1.27 | 0.05 |
11 | 1.42 | 1.22 |
12 | 0.75 | −0.01 |
13 | 0.50 | 0.20 |
14 | 0.81 | 1.39 |
15 | 1.12 | 0.61 |
16 | 0.78 | −1.00 |
17 | 1.30 | −1.58 |
18 | 0.70 | −1.62 |
19 | 1.29 | 1.06 |
20 | 0.74 | −0.81 |
References
- Cai, L.; Choi, K.; Hansen, M.; Harrell, L. Item response theory. Annu. Rev. Stat. Appl. 2016, 3, 297–321. [Google Scholar] [CrossRef]
- van der Linden, W.J.; Hambleton, R.K. (Eds.) Handbook of Modern Item Response Theory; Springer: New York, NY, USA, 1997. [Google Scholar] [CrossRef]
- Yen, W.M.; Fitzpatrick, A.R. Item response theory. In Educational Measurement; Brennan, R.L., Ed.; Praeger Publishers: Westport, CT, USA, 2006; pp. 111–154. [Google Scholar]
- Battauz, M. Regularized estimation of the four-parameter logistic model. Psych 2020, 2, 269–278. [Google Scholar] [CrossRef]
- Bürkner, P.C. Analysing standard progressive matrices (SPM-LS) with Bayesian item response models. J. Intell. 2020, 8, 5. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Chang, H.H.; Wang, C.; Zhang, S. Statistical applications in educational measurement. Annu. Rev. Stat. Appl. 2021, 8, 439–461. [Google Scholar] [CrossRef]
- Genge, E. LC and LC-IRT models in the identification of Polish households with similar perception of financial position. Sustainability 2021, 13, 4130. [Google Scholar] [CrossRef]
- Jefmański, B.; Sagan, A. Item response theory models for the fuzzy TOPSIS in the analysis of survey data. Symmetry 2021, 13, 223. [Google Scholar] [CrossRef]
- Karwowski, M.; Milerski, B. Who supports Polish educational reforms? Exploring actors’ and observers’ attitudes. Educ. Sci. 2021, 11, 120. [Google Scholar] [CrossRef]
- Medová, J.; Páleníková, K.; Rybanskỳ, L.; Naštická, Z. Undergraduate students’ solutions of modeling problems in algorithmic graph theory. Mathematics 2019, 7, 572. [Google Scholar] [CrossRef] [Green Version]
- Mousavi, A.; Cui, Y. The effect of person misfit on item parameter estimation and classification accuracy: A simulation study. Educ. Sci. 2020, 10, 324. [Google Scholar] [CrossRef]
- Palma-Vasquez, C.; Carrasco, D.; Hernando-Rodriguez, J.C. Mental health of teachers who have teleworked due to COVID-19. Eur. J. Investig. Health Psychol. Educ. 2021, 11, 515–528. [Google Scholar] [CrossRef]
- Storme, M.; Myszkowski, N.; Baron, S.; Bernard, D. Same test, better scores: Boosting the reliability of short online intelligence recruitment tests with nested logit item response theory models. J. Intell. 2019, 7, 17. [Google Scholar] [CrossRef] [Green Version]
- Tsutsumi, E.; Kinoshita, R.; Ueno, M. Deep item response theory as a novel test theory based on deep learning. Electronics 2021, 10, 1020. [Google Scholar] [CrossRef]
- Birnbaum, A. Some latent trait models and their use in inferring an examinee’s ability. In Statistical Theories of Mental Test Scores; Lord, F.M., Novick, M.R., Eds.; MIT Press: Reading, MA, USA, 1968; pp. 397–479. [Google Scholar]
- Kolen, M.J.; Brennan, R.L. Test Equating, Scaling, and Linking; Springer: New York, NY, USA, 2014. [Google Scholar] [CrossRef]
- Lietz, P.; Cresswell, J.C.; Rust, K.F.; Adams, R.J. (Eds.) Implementation of Large-Scale Education Assessments; Wiley: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
- Maehler, D.B.; Rammstedt, B. (Eds.) Large-Scale Cognitive Assessment; Springer: New York, NY, USA, 2020. [Google Scholar] [CrossRef]
- Wagemaker, H. International large-scale assessments: From research to policy. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2014; pp. 11–36. [Google Scholar] [CrossRef]
- van der Linden, W.J. Unidimensional Logistic Response Models. In Handbook of Item Response Theory, Volume One: Models; CRC Press: Boca Raton, FL, USA, 2016; pp. 11–30. [Google Scholar] [CrossRef]
- Rasch, G. Probabilistic Models for Some Intelligence and Attainment Tests; Danish Institute for Educational Research: Copenhagen, Denmark, 1960. [Google Scholar]
- von Davier, A.A.; Carstensen, C.H.; von Davier, M. Linking Competencies in Educational Settings and Measuring Growth; (Research Report No. RR-06-12); Educational Testing Service: Princeton, NJ, USA, 2006. [Google Scholar] [CrossRef]
- von Davier, A.A.; Holland, P.W.; Thayer, D.T. The Kernel Method of Test Equating; Springer: New York, NY, USA, 2004. [Google Scholar] [CrossRef] [Green Version]
- Bolsinova, M.; Maris, G. Can IRT solve the missing data problem in test equating? Front. Psychol. 2016, 6, 1956. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Liou, M.; Cheng, P.E. Equipercentile equating via data-imputation techniques. Psychometrika 1995, 60, 119–136. [Google Scholar] [CrossRef]
- Meredith, W. Measurement invariance, factor analysis and factorial invariance. Psychometrika 1993, 58, 525–543. [Google Scholar] [CrossRef]
- Millsap, R.E. Statistical Approaches to Measurement Invariance; Routledge: New York, NY, USA, 2011. [Google Scholar] [CrossRef]
- van de Vijver, F.J.R. (Ed.) Invariance Analyses in Large-Scale Studies; OECD: Paris, France, 2019. [Google Scholar] [CrossRef]
- Mellenbergh, G.J. Item bias and item response theory. Int. J. Educ. Res. 1989, 13, 127–143. [Google Scholar] [CrossRef]
- Millsap, R.E.; Everson, H.T. Methodology review: Statistical approaches for assessing measurement bias. Appl. Psychol. Meas. 1993, 17, 297–334. [Google Scholar] [CrossRef]
- Osterlind, S.J.; Everson, H.T. Differential Item Functioning; Sage Publications: New York, NY, USA, 2009. [Google Scholar] [CrossRef]
- Penfield, R.D.; Camilli, G. Differential item functioning and item bias. In Handbook of Statistics, Volume 26: Psychometrics; Rao, C.R., Sinharay, S., Eds.; Elesvier: Amsterdam, The Netherlands, 2007; pp. 125–167. [Google Scholar] [CrossRef]
- Uyar, S.; Kelecioglu, H.; Dogan, N. Comparing differential item functioning based on manifest groups and latent classes. Educ. Sci. Theory Pract. 2017, 17, 1977–2000. [Google Scholar] [CrossRef] [Green Version]
- Lee, S.Y.; Hong, A.J. Psychometric investigation of the cultural intelligence scale using the Rasch measurement model in South Korea. Sustainability 2021, 13, 3139. [Google Scholar] [CrossRef]
- Mylona, I.; Aletras, V.; Ziakas, N.; Tsinopoulos, I. Rasch validation of the VF-14 scale of vision-specific functioning in Greek patients. Int. J. Environ. Res. Public Health 2021, 18, 4254. [Google Scholar] [CrossRef]
- Pichette, F.; Béland, S.; Leśniewska, J. Detection of gender-biased items in the peabody picture vocabulary test. Languages 2019, 4, 27. [Google Scholar] [CrossRef] [Green Version]
- Shibaev, V.; Grigoriev, A.; Valueva, E.; Karlin, A. Differential item functioning on Raven’s SPM+ amongst two convenience samples of Yakuts and Russian. Psych 2020, 2, 44–51. [Google Scholar] [CrossRef] [Green Version]
- Silvia, P.J.; Rodriguez, R.M. Time to renovate the humor styles questionnaire? An item response theory analysis of the HSQ. Behav. Sci. 2020, 10, 173. [Google Scholar] [CrossRef]
- Hanson, B.A. Uniform DIF and DIF defined by differences in item response functions. J. Educ. Behav. Stat. 1998, 23, 244–253. [Google Scholar] [CrossRef]
- Teresi, J.A.; Ramirez, M.; Lai, J.S.; Silver, S. Occurrences and sources of differential item functioning (DIF) in patient-reported outcome measures: Description of DIF methods, and review of measures of depression, quality of life and general health. Psychol. Sci. 2008, 50, 538–612. [Google Scholar]
- Buchholz, J.; Hartig, J. Measurement invariance testing in questionnaires: A comparison of three multigroup-CFA and IRT-based approaches. Psych. Test Assess. Model. 2020, 62, 29–53. [Google Scholar]
- Chalmers, R.P. Extended mixed-effects item response models with the MH-RM algorithm. J. Educ. Meas. 2015, 52, 200–222. [Google Scholar] [CrossRef]
- De Boeck, P.; Wilson, M. Explanatory Item Response Models: A Generalized Linear and Nonlinear Approach; Springer: New York, NY, USA, 2004. [Google Scholar] [CrossRef]
- De Boeck, P. Random item IRT models. Psychometrika 2008, 73, 533–559. [Google Scholar] [CrossRef]
- de Jong, M.G.; Steenkamp, J.B.E.M.; Fox, J.P. Relaxing measurement invariance in cross-national consumer research using a hierarchical IRT model. J. Consum. Res. 2007, 34, 260–278. [Google Scholar] [CrossRef]
- Doran, H.; Bates, D.; Bliese, P.; Dowling, M. Estimating the multilevel Rasch model: With the lme4 package. J. Stat. Softw. 2007, 20, 1–18. [Google Scholar] [CrossRef]
- Fox, J.P.; Verhagen, A.J. Random item effects modeling for cross-national survey data. In Cross-Cultural Analysis: Methods and Applications; Davidov, E., Schmidt, P., Billiet, J., Eds.; Routledge: London, UK, 2010; pp. 461–482. [Google Scholar] [CrossRef]
- Van den Noortgate, W.; De Boeck, P. Assessing and explaining differential item functioning using logistic mixed models. J. Educ. Behav. Stat. 2005, 30, 443–464. [Google Scholar] [CrossRef]
- Muthén, B.; Asparouhov, T. Bayesian structural equation modeling: A more flexible representation of substantive theory. Psychol. Methods 2012, 17, 313–335. [Google Scholar] [CrossRef] [PubMed]
- van de Schoot, R.; Kluytmans, A.; Tummers, L.; Lugtig, P.; Hox, J.; Muthén, B. Facing off with scylla and charybdis: A comparison of scalar, partial, and the novel possibility of approximate measurement invariance. Front. Psychol. 2013, 4, 770. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bechger, T.M.; Maris, G. A statistical test for differential item pair functioning. Psychometrika 2015, 80, 317–340. [Google Scholar] [CrossRef]
- Camilli, G. The case against item bias detection techniques based on internal criteria: Do item bias procedures obscure test fairness issues? In Differential Item Functioning: Theory and Practice; Holland, P.W., Wainer, H., Eds.; Erlbaum: Hillsdale, NJ, USA, 1993; pp. 397–417. [Google Scholar]
- Doebler, A. Looking at DIF from a new perspective: A structure-based approach acknowledging inherent indefinability. Appl. Psychol. Meas. 2019, 43, 303–321. [Google Scholar] [CrossRef] [PubMed]
- Robitzsch, A.; Lüdtke, O. A review of different scaling approaches under full invariance, partial invariance, and noninvariance for cross-sectional country comparisons in large-scale assessments. Psych. Test Assess. Model. 2020, 62, 233–279. [Google Scholar]
- Frederickx, S.; Tuerlinckx, F.; De Boeck, P.; Magis, D. RIM: A random item mixture model to detect differential item functioning. J. Educ. Meas. 2010, 47, 432–457. [Google Scholar] [CrossRef]
- Byrne, B.M.; Shavelson, R.J.; Muthén, B. Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychol. Bull. 1989, 105, 456–466. [Google Scholar] [CrossRef]
- Magis, D.; Tuerlinckx, F.; De Boeck, P. Detection of differential item functioning using the lasso approach. J. Educ. Behav. Stat. 2015, 40, 111–135. [Google Scholar] [CrossRef]
- Tutz, G.; Schauberger, G. A penalty approach to differential item functioning in Rasch models. Psychometrika 2015, 80, 21–43. [Google Scholar] [CrossRef] [Green Version]
- Soares, T.M.; Gonçalves, F.B.; Gamerman, D. An integrated Bayesian model for DIF analysis. J. Educ. Behav. Stat. 2009, 34, 348–377. [Google Scholar] [CrossRef]
- Kopf, J.; Zeileis, A.; Strobl, C. Anchor selection strategies for DIF analysis: Review, assessment, and new approaches. Educ. Psychol. Meas. 2015, 75, 22–56. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Magis, D.; Béland, S.; Tuerlinckx, F.; De Boeck, P. A general framework and an R package for the detection of dichotomous differential item functioning. Behav. Res. Methods 2010, 42, 847–862. [Google Scholar] [CrossRef] [Green Version]
- Teresi, J.A.; Ramirez, M.; Jones, R.N.; Choi, S.; Crane, P.K. Modifying measures based on differential item functioning (DIF) impact analyses. J. Aging Health 2012, 24, 1044–1076. [Google Scholar] [CrossRef] [Green Version]
- DeMars, C.E. Alignment as an alternative to anchor purification in DIF analyses. Struct. Equ. Model. 2020, 27, 56–72. [Google Scholar] [CrossRef]
- Lai, M.H.C.; Liu, Y.; Tse, W.W.Y. Adjusting for partial invariance in latent parameter estimation: Comparing forward specification search and approximate invariance methods. Behav. Res. Methods 2021. [Google Scholar] [CrossRef] [PubMed]
- Robitzsch, A.; Lüdtke, O. Mean comparisons of many groups in the presence of DIF: An evaluation of linking and concurrent scaling approaches. J. Educ. Behav. Stat. 2021. [Google Scholar] [CrossRef]
- Sachse, K.A.; Roppelt, A.; Haag, N. A comparison of linking methods for estimating national trends in international comparative large-scale assessments in the presence of cross-national DIF. J. Educ. Meas. 2016, 53, 152–171. [Google Scholar] [CrossRef]
- Oliveri, M.E.; von Davier, M. Investigation of model fit and score scale comparability in international assessments. Psych. Test Assess. Model. 2011, 53, 315–333. [Google Scholar]
- Oliveri, M.E.; von Davier, M. Toward increasing fairness in score scale calibrations employed in international large-scale assessments. Int. J. Test. 2014, 14, 1–21. [Google Scholar] [CrossRef]
- OECD. PISA 2015. Technical Report; OECD: Paris, France, 2017. [Google Scholar]
- von Davier, M.; Yamamoto, K.; Shin, H.J.; Chen, H.; Khorramdel, L.; Weeks, J.; Davis, S.; Kong, N.; Kandathil, M. Evaluating item response theory linking and model fit for data from PISA 2000–2012. Assess. Educ. 2019, 26, 466–488. [Google Scholar] [CrossRef]
- Robitzsch, A. Lp loss functions in invariance alignment and Haberman linking with few or many groups. Stats 2020, 3, 246–283. [Google Scholar] [CrossRef]
- Robitzsch, A.; Lüdtke, O. Linking errors in international large-scale assessments: Calculation of standard errors for trend estimation. Assess. Educ. 2019, 26, 444–465. [Google Scholar] [CrossRef]
- El Masri, Y.H.; Andrich, D. The trade-off between model fit, invariance, and validity: The case of PISA science assessments. Appl. Meas. Educ. 2020, 33, 174–188. [Google Scholar] [CrossRef]
- Shealy, R.; Stout, W. A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika 1993, 58, 159–194. [Google Scholar] [CrossRef]
- Zwitser, R.J.; Glaser, S.S.F.; Maris, G. Monitoring countries in a changing world: A new look at DIF in international surveys. Psychometrika 2017, 82, 210–232. [Google Scholar] [CrossRef]
- Aitkin, M. Expectation maximization algorithm and extensions. In Handbook of Item Response Theory, Volume 2: Statistical Tools; van der Linden, W.J., Ed.; CRC Press: Boca Raton, FL, USA, 2016; pp. 217–236. [Google Scholar] [CrossRef]
- Bock, R.D.; Aitkin, M. Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika 1981, 46, 443–459. [Google Scholar] [CrossRef]
- von Davier, M.; Sinharay, S. Analytics in international large-scale assessments: Item response theory and population models. In A Handbook of International Large-Scale Assessment: Background, Technical Issues, and Methods of Data Analysis; Rutkowski, L., von Davier, M., Rutkowski, D., Eds.; Chapman Hall/CRC Press: London, UK, 2014; pp. 155–174. [Google Scholar] [CrossRef]
- Robitzsch, A. A note on a computationally efficient implementation of the EM algorithm in item response models. Quant. Comput. Methods Behav. Sci. 2021, 1, e3783. [Google Scholar] [CrossRef]
- González, J.; Wiberg, M. Applying Test Equating Methods. Using R; Springer: New York, NY, USA, 2017. [Google Scholar] [CrossRef]
- Lee, W.C.; Lee, G. IRT linking and equating. In The Wiley Handbook of Psychometric Testing: A Multidisciplinary Reference on Survey, Scale and Test; Irwing, P., Booth, T., Hughes, D.J., Eds.; Wiley: New York, NY, USA, 2018; pp. 639–673. [Google Scholar] [CrossRef]
- Sansivieri, V.; Wiberg, M.; Matteucci, M. A review of test equating methods with a special focus on IRT-based approaches. Statistica 2017, 77, 329–352. [Google Scholar] [CrossRef]
- Haberman, S.J. Linking Parameter Estimates Derived from an Item Response Model through Separate Calibrations; (Research Report No. RR-09-40); Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
- Battauz, M. Multiple equating of separate IRT calibrations. Psychometrika 2017, 82, 610–636. [Google Scholar] [CrossRef] [PubMed]
- Asparouhov, T.; Muthén, B. Multiple-group factor analysis alignment. Struct. Equ. Model. 2014, 21, 495–508. [Google Scholar] [CrossRef]
- Muthén, B.; Asparouhov, T. IRT studies of many groups: The alignment method. Front. Psychol. 2014, 5, 978. [Google Scholar] [CrossRef] [Green Version]
- Muthén, B.; Asparouhov, T. Recent methods for the study of measurement invariance with many groups: Alignment and random effects. Sociol. Methods Res. 2018, 47, 637–664. [Google Scholar] [CrossRef]
- Pokropek, A.; Davidov, E.; Schmidt, P. A Monte Carlo simulation study to assess the appropriateness of traditional and newer approaches to test for measurement invariance. Struct. Equ. Model. 2019, 26, 724–744. [Google Scholar] [CrossRef] [Green Version]
- Pokropek, A.; Lüdtke, O.; Robitzsch, A. An extension of the invariance alignment method for scale linking. Psych. Test Assess. Model. 2020, 62, 303–334. [Google Scholar]
- Haebara, T. Equating logistic ability scales by a weighted least squares method. Jpn. Psychol. Res. 1980, 22, 144–149. [Google Scholar] [CrossRef] [Green Version]
- Kim, S.; Kolen, M.J. Effects on scale linking of different definitions of criterion functions for the IRT characteristic curve methods. J. Educ. Behav. Stat. 2007, 32, 371–397. [Google Scholar] [CrossRef]
- Weeks, J.P. plink: An R package for linking mixed-format tests using IRT-based methods. J. Stat. Softw. 2010, 35, 1–33. [Google Scholar] [CrossRef]
- Arai, S.; Mayekawa, S.i. A comparison of equating methods and linking designs for developing an item pool under item response theory. Behaviormetrika 2011, 38, 1–16. [Google Scholar] [CrossRef]
- Robitzsch, A. Robust Haebara linking for many groups: Performance in the case of uniform DIF. Psych 2020, 2, 155–173. [Google Scholar] [CrossRef]
- OECD. PISA 2006. Technical Report; OECD: Paris, France, 2009. [Google Scholar]
- Foy, P.; Yin, L. Scaling the PIRLS 2016 achievement data. In Methods and Procedures in PIRLS 2016; Martin, M.O., Mullis, I.V., Hooper, M., Eds.; IEA: Newton, MA, USA, 2017. [Google Scholar]
- Foy, P.; Yin, L. Scaling the TIMSS 2015 achievement data. In Methods and Procedures in TIMSS 2015; Martin, M.O., Mullis, I.V., Hooper, M., Eds.; IEA: Newton, MA, USA, 2016. [Google Scholar]
- Foy, P.; Fishbein, B.; von Davier, M.; Yin, L. Implementing the TIMSS 2019 scaling methodology. In Methods and Procedures: TIMSS 2019 Technical Report; Martin, M.O., von Davier, M., Mullis, I.V., Eds.; IEA: Newton, MA, USA, 2020. [Google Scholar]
- Gebhardt, E.; Adams, R.J. The influence of equating methodology on reported trends in PISA. J. Appl. Meas. 2007, 8, 305–322. [Google Scholar]
- Fishbein, B.; Martin, M.O.; Mullis, I.V.S.; Foy, P. The TIMSS 2019 item equivalence study: Examining mode effects for computer-based assessment and implications for measuring trends. Large-Scale Assess. Educ. 2018, 6, 11. [Google Scholar] [CrossRef] [Green Version]
- Martin, M.O.; Mullis, I.V.S.; Foy, P.; Brossman, B.; Stanco, G.M. Estimating linking error in PIRLS. IERI Monogr. Ser. 2012, 5, 35–47. [Google Scholar]
- Kim, S.H.; Cohen, A.S. A comparison of linking and concurrent calibration under item response theory. Appl. Psychol. Meas. 1998, 22, 131–143. [Google Scholar] [CrossRef] [Green Version]
- Hanson, B.A.; Béguin, A.A. Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Appl. Psychol. Meas. 2002, 26, 3–24. [Google Scholar] [CrossRef]
- Andersson, B. Asymptotic variance of linking coefficient estimators for polytomous IRT models. Appl. Psychol. Meas. 2018, 42, 192–205. [Google Scholar] [CrossRef]
- Demirus, K.; Gelbal, S. The study of the effect of anchor items showing or not showing differantial item functioning to test equating using various methods. J. Meas. Eval. Educ. Psychol. 2016, 7, 182–201. [Google Scholar] [CrossRef] [Green Version]
- Gübes, N.; Uyar, S. Comparing performance of different equating methods in presence and absence of DIF Items in anchor test. Int. J. Progress. Educ. 2020, 16, 111–122. [Google Scholar] [CrossRef]
- He, Y.; Cui, Z. Evaluating robust scale transformation methods with multiple outlying common items under IRT true score equating. Appl. Psychol. Meas. 2020, 44, 296–310. [Google Scholar] [CrossRef]
- Inal, H.; Anil, D. Investigation of group invariance in test equating under different simulation conditions. Eurasian J. Educ. Res. 2018, 18, 67–86. [Google Scholar] [CrossRef]
- Kabasakal, K.A.; Kelecioğlu, H. Effect of differential item functioning on test equating. Educ. Sci. Theory Pract. 2015, 15, 1229–1246. [Google Scholar] [CrossRef] [Green Version]
- Tulek, O.K.; Kose, I.A. Comparison of different forms of a test with or without items that exhibit DIF. Eurasian J. Educ. Res. 2019, 19, 167–182. [Google Scholar] [CrossRef]
- Pohl, S.; Schulze, D. Assessing group comparisons or change over time under measurement non-invariance: The cluster approach for nonuniform DIF. Psych. Test Assess. Model. 2020, 62, 281–303. [Google Scholar]
- Yurtçu, M.; Güzeller, C.O. Investigation of equating error in tests with differential item functioning. Int. J. Assess. Tool. Educ. 2018, 5, 50–57. [Google Scholar] [CrossRef]
- R Core Team. R: A Language and Environment for Statistical Computing; R Core Team: Vienna, Austria, 2020; Available online: https://www.R-project.org/ (accessed on 20 August 2020).
- Robitzsch, A.; Kiefer, T.; Wu, M. TAM: Test Analysis Modules; R Package Version 3.7-6. 2021. Available online: https://CRAN.R-project.org/package=TAM (accessed on 25 June 2021).
- Robitzsch, A. Sirt: Supplementary Item Response Theory Models; R Package Version 3.9-4. 2020. Available online: https://CRAN.R-project.org/package=sirt (accessed on 17 February 2020).
- Bates, D.; Mächler, M.; Bolker, B.; Walker, S. Fitting linear mixed-effects models using lme4. J. Stat. Softw. 2015, 67, 1–48. [Google Scholar] [CrossRef]
- OECD. PISA 2009. Technical Report; OECD: Paris, France, 2012. [Google Scholar]
- Falk, C.F.; Cai, L. Semiparametric item response functions in the context of guessing. J. Educ. Meas. 2016, 53, 229–247. [Google Scholar] [CrossRef]
- Feuerstahler, L.M. Metric transformations and the filtered monotonic polynomial item response model. Psychometrika 2019, 84, 105–123. [Google Scholar] [CrossRef] [PubMed]
- Feuerstahler, L. Flexible item response modeling in R with the flexmet package. Psych 2021, 3, 447–478. [Google Scholar] [CrossRef]
- Ramsay, J.O.; Winsberg, S. Maximum marginal likelihood estimation for semiparametric item analysis. Psychometrika 1991, 56, 365–379. [Google Scholar] [CrossRef]
- Rossi, N.; Wang, X.; Ramsay, J.O. Nonparametric item response function estimates with the EM algorithm. J. Educ. Behav. Stat. 2002, 27, 291–317. [Google Scholar] [CrossRef] [Green Version]
- Anderson, D.; Kahn, J.D.; Tindal, G. Exploring the robustness of a unidimensional item response theory model with empirically multidimensional data. Appl. Meas. Educ. 2017, 30, 163–177. [Google Scholar] [CrossRef]
- Martineau, J.A. Distorting value added: The use of longitudinal, vertically scaled student achievement data for growth-based, value-added accountability. J. Educ. Behav. Stat. 2006, 31, 35–62. [Google Scholar] [CrossRef] [Green Version]
- Köhler, C.; Hartig, J. Practical significance of item misfit in educational assessments. Appl. Psychol. Meas. 2017, 41, 388–400. [Google Scholar] [CrossRef] [PubMed]
- Sinharay, S.; Haberman, S.J. How often is the misfit of item response theory models practically significant? Educ. Meas. 2014, 33, 23–35. [Google Scholar] [CrossRef]
- Zhao, Y.; Hambleton, R.K. Practical consequences of item response theory model misfit in the context of test equating with mixed-format test data. Front. Psychol. 2017, 8, 484. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Bolt, D.M.; Deng, S.; Lee, S. IRT model misspecification and measurement of growth in vertical scaling. J. Educ. Meas. 2014, 51, 141–162. [Google Scholar] [CrossRef]
- Guo, H.; Liu, J.; Dorans, N.; Feigenbaum, M. Multiple Linking in Equating and Random Scale Drift; (Research Report No. RR-11-46); Educational Testing Service: Princeton, NJ, USA, 2011. [Google Scholar] [CrossRef]
- Puhan, G. Detecting and correcting scale drift in test equating: An illustration from a large scale testing program. Appl. Meas. Educ. 2008, 22, 79–103. [Google Scholar] [CrossRef]
- Battauz, M. IRT test equating in complex linkage plans. Psychometrika 2013, 78, 464–480. [Google Scholar] [CrossRef] [PubMed]
- Battauz, M. Factors affecting the variability of IRT equating coefficients. Stat. Neerl. 2015, 69, 85–101. [Google Scholar] [CrossRef]
- Battauz, M. equateIRT: An R package for IRT test equating. J. Stat. Softw. 2015, 68, 1–22. [Google Scholar] [CrossRef] [Green Version]
- Briggs, D.C.; Weeks, J.P. The sensitivity of value-added modeling to the creation of a vertical score scale. Educ. Financ. Policy 2009, 4, 384–414. [Google Scholar] [CrossRef]
- Bjermo, J.; Miller, F. Efficient estimation of mean ability growth using vertical scaling. Appl. Meas. Educ. 2021. [Google Scholar] [CrossRef]
- Fischer, L.; Rohm, T.; Carstensen, C.H.; Gnambs, T. Linking of Rasch-scaled tests: Consequences of limited item pools and model misfit. Front. Psychol. 2021, 12, 633896. [Google Scholar] [CrossRef] [PubMed]
- Pohl, S.; Haberkorn, K.; Carstensen, C.H. Measuring competencies across the lifespan-challenges of linking test scores. In Dependent Data in Social Sciences Research; Stemmler, M., von Eye, A., Wiedermann, W., Eds.; Springer: Cham, Switzerland, 2015; pp. 281–308. [Google Scholar] [CrossRef]
- Tong, Y.; Kolen, M.J. Comparisons of methodologies and results in vertical scaling for educational achievement tests. Appl. Meas. Educ. 2007, 20, 227–253. [Google Scholar] [CrossRef]
- Barrett, M.D.; van der Linden, W.J. Estimating linking functions for response model parameters. J. Educ. Behav. Stat. 2019, 44, 180–209. [Google Scholar] [CrossRef]
- Jewsbury, P.A. Error Variance in Common Population Linking Bridge Studies; (Research Report No. RR-19-42); Educational Testing Service: Princeton, NJ, USA, 2019. [Google Scholar] [CrossRef] [Green Version]
- Ogasawara, H. Standard errors of item response theory equating/linking by response function methods. Appl. Psychol. Meas. 2001, 25, 53–67. [Google Scholar] [CrossRef]
- Haberman, S.J.; Lee, Y.H.; Qian, J. Jackknifing Techniques for Evaluation of Equating Accuracy; (Research Report No. RR-09-02); Educational Testing Service: Princeton, NJ, USA, 2009. [Google Scholar] [CrossRef]
- Michaelides, M.P. A review of the effects on IRT item parameter estimates with a focus on misbehaving common items in test equating. Front. Psychol. 2010, 1, 167. [Google Scholar] [CrossRef] [Green Version]
- Monseur, C.; Berezner, A. The computation of equating errors in international surveys in education. J. Appl. Meas. 2007, 8, 323–335. [Google Scholar] [PubMed]
- Monseur, C.; Sibberns, H.; Hastedt, D. Linking errors in trend estimation for international surveys in education. IERI Monogr. Ser. 2008, 1, 113–122. [Google Scholar]
- Xu, X.; von Davier, M. Linking Errors in Trend Estimation in Large-Scale Surveys: A Case Study; (Research Report No. RR-10-10); Educational Testing Service: Princeton, NJ, USA, 2010. [Google Scholar] [CrossRef]
- Van der Vaart, A.W. Asymptotic Statistics; Cambridge University Press: Cambridge, UK, 1998; Volume 3. [Google Scholar] [CrossRef]
Source | ||||
---|---|---|---|---|
Bias | RMSE | Bias | RMSE | |
N | 0.3 | 1.1 | 0.6 | 3.9 |
I | 0.0 | 0.3 | 0.0 | 0.0 |
Meth | 10.2 | 14.9 | 19.1 | 0.0 |
13.0 | 0.0 | 0.8 | 1.8 | |
4.3 | 9.0 | 12.3 | 0.0 | |
NI | 0.0 | 0.0 | 0.0 | 0.0 |
NMeth | 0.0 | 3.7 | 0.8 | 0.0 |
N | 0.0 | 2.4 | 0.0 | 0.0 |
N | 0.0 | 0.6 | 0.0 | 5.0 |
IMeth | 0.4 | 0.1 | 0.1 | 0.0 |
I | 0.0 | 0.1 | 0.0 | 0.0 |
I | 0.0 | 0.0 | 0.0 | 0.6 |
Meth | 58.1 | 13.1 | 17.5 | 14.2 |
Meth | 8.2 | 12.1 | 47.7 | 13.2 |
0.0 | 4.1 | 0.0 | 17.7 | |
NIMeth | 0.0 | 0.0 | 0.1 | 0.0 |
NI | 0.0 | 0.2 | 0.0 | 0.0 |
NI | 0.0 | 0.1 | 0.0 | 0.4 |
NMeth | 0.2 | 7.5 | 0.0 | 4.2 |
NMeth | 0.0 | 4.0 | 0.0 | 9.1 |
N | 0.1 | 10.0 | 0.0 | 13.8 |
IMeth | 0.5 | 0.0 | 0.0 | 0.4 |
IMeth | 0.1 | 0.0 | 0.2 | 1.1 |
I | 0.1 | 0.3 | 0.0 | 0.7 |
Meth | 1.0 | 10.1 | 0.1 | 8.2 |
Residual | 3.7 | 6.4 | 0.6 | 5.7 |
Bias | RMSE | |||||
---|---|---|---|---|---|---|
NODIF | UDIF | NUDIF | NODIF | UDIF | NUDIF | |
logMM | 100 | 97 | 94 | 100 | 100 | 45 |
HAB | 100 | 97 | 94 | 100 | 100 | 44 |
MM | 100 | 94 | 95 | 92 | 100 | 72 |
HAB-nolog | 100 | 94 | 96 | 100 | 100 | 78 |
IA2 | 75 | 78 | 8 | 100 | 100 | 4 |
HAE-asymm | 100 | 42 | 42 | 100 | 61 | 78 |
HAE-symm | 100 | 97 | 94 | 100 | 61 | 81 |
HAE-joint | 100 | 42 | 60 | 100 | 42 | 61 |
RC1 | 83 | 78 | 16 | 100 | 61 | 29 |
RC2 | 83 | 78 | 8 | 100 | 61 | 48 |
RC3 | 100 | 94 | 96 | 100 | 61 | 79 |
ANCH | 83 | 78 | 13 | 100 | 61 | 48 |
CC | 100 | 50 | 45 | 100 | 33 | 46 |
Bias | RMSE | |||||
---|---|---|---|---|---|---|
NODIF | UDIF | NUDIF | NODIF | UDIF | NUDIF | |
Mean | ||||||
logMM | 0.000 | 0.007 | 0.008 | 108.2 | 104.4 | 106.1 |
HAB | 0.000 | 0.007 | 0.008 | 108.2 | 104.4 | 106.1 |
MM | 0.000 | 0.007 | 0.007 | 108.1 | 103.7 | 104.7 |
HAB-nolog | 0.001 | 0.007 | 0.007 | 108.5 | 103.5 | 104.5 |
IA2 | −0.001 | 0.001 | 0.045 | 103.2 | 107.5 | 133.3 |
HAE-asymm | −0.002 | −0.030 | −0.032 | 102.3 | 100.0 | 100.0 |
HAE-symm | −0.001 | 0.002 | 0.005 | 102.7 | 105.0 | 105.2 |
HAE-joint | −0.002 | 0.067 | 0.064 | 100.9 | 136.1 | 132.4 |
RC1 | −0.001 | 0.001 | 0.028 | 100.2 | 104.8 | 120.5 |
RC2 | −0.006 | −0.004 | −0.022 | 100.0 | 104.0 | 100.1 |
RC3 | −0.003 | −0.001 | 0.002 | 100.1 | 103.9 | 109.4 |
ANCH | −0.003 | −0.004 | −0.021 | 101.4 | 104.2 | 103.9 |
CC | −0.002 | 0.095 | 0.109 | 101.3 | 149.2 | 157.7 |
Standard Deviation | ||||||
logMM | 0.000 | 0.003 | 0.008 | 110.2 | 112.6 | 128.9 |
HAB | 0.000 | 0.003 | 0.008 | 110.2 | 112.6 | 129.4 |
MM | −0.001 | 0.001 | 0.005 | 108.5 | 109.4 | 107.7 |
HAB-nolog | 0.001 | 0.002 | 0.007 | 100.0 | 100.0 | 100.0 |
IA2 | 0.009 | 0.009 | 0.147 | 113.2 | 111.6 | 197.9 |
HAE-asymm | −0.002 | −0.120 | −0.134 | 107.2 | 378.8 | 185.6 |
HAE-symm | 0.001 | −0.003 | 0.003 | 108.3 | 233.7 | 119.9 |
HAE-joint | −0.001 | 0.020 | 0.029 | 107.5 | 317.0 | 146.6 |
RC1 | 0.006 | 0.008 | 0.105 | 109.8 | 243.8 | 174.5 |
RC2 | −0.009 | −0.008 | −0.097 | 108.5 | 217.2 | 148.3 |
RC3 | −0.002 | 0.000 | 0.002 | 106.6 | 228.3 | 110.2 |
ANCH | −0.009 | −0.008 | −0.097 | 108.5 | 217.2 | 148.3 |
CC | −0.001 | 0.015 | 0.029 | 107.4 | 220.4 | 129.0 |
Domain | N | I | ||||||
---|---|---|---|---|---|---|---|---|
P06 | P09 | P06 | P09 | P06 | P09 | P06 | P09 | |
Mathematics | 3784 | 4575 | 48 | 35 | 506.8 | 495.9 | 96.8 | 96.1 |
Reading | 2646 | 6585 | 27 | 99 | 491.2 | 470.3 | 107.7 | 100.1 |
Science | 4927 | 4577 | 103 | 53 | 511.7 | 494.3 | 97.3 | 101.8 |
Method | Mathematics | Reading | Science | |||
---|---|---|---|---|---|---|
1PL | 2PL | 1PL | 2PL | 1PL | 2PL | |
logMM | −15.5 | −12.4 | −5.8 | −6.3 | −14.7 | −16.8 |
HAB | −15.5 | −12.4 | −5.8 | −6.3 | −14.7 | −16.8 |
MM | −15.5 | −12.4 | −5.8 | −6.3 | −14.7 | −16.7 |
HAB-nolog | −15.5 | −12.3 | −6.0 | −6.3 | −14.5 | −16.6 |
IA2 | −15.5 | −15.9 | −5.8 | −6.1 | −14.7 | −11.6 |
HAE-asymm | −14.4 | −14.6 | −4.9 | −6.4 | −14.2 | −15.9 |
HAE-symm | −14.6 | −15.0 | −5.0 | −6.6 | −14.2 | −15.7 |
HAE-joint | −13.5 | −14.1 | −4.1 | −5.0 | −13.9 | −14.0 |
RC1 | −14.3 | −14.5 | −4.4 | −5.1 | −14.0 | −13.2 |
RC2 | −14.3 | −14.3 | −4.3 | −5.0 | −14.2 | −12.9 |
RC3 | −14.3 | −14.4 | −4.4 | −5.0 | −14.1 | −13.1 |
ANCH | −14.4 | −15.7 | −4.5 | −5.4 | −14.5 | −14.1 |
CC | −14.3 | −14.9 | −4.3 | −5.3 | −14.2 | −13.6 |
M | −14.8 | −14.1 | −5.0 | −5.8 | −14.3 | −14.7 |
SD | 0.7 | 1.3 | 0.7 | 0.6 | 0.3 | 1.8 |
Min | −15.5 | −15.9 | −6.0 | −6.6 | −14.7 | −16.8 |
Max | −13.5 | −12.3 | −4.1 | −5.0 | −13.9 | −11.6 |
Method | Mathematics | Reading | Science | |||
---|---|---|---|---|---|---|
1PL | 2PL | 1PL | 2PL | 1PL | 2PL | |
logMM | 97.7 | 98.3 | 98.6 | 103.2 | 103.2 | 106.8 |
HAB | 97.7 | 98.3 | 98.6 | 103.2 | 103.2 | 106.8 |
MM | 97.7 | 98.7 | 98.6 | 103.8 | 103.2 | 106.9 |
HAB-nolog | 97.9 | 99.3 | 94.6 | 102.0 | 103.9 | 108.1 |
IA2 | 97.7 | 99.5 | 98.6 | 104.6 | 103.2 | 109.2 |
HAE-asymm | 94.1 | 95.0 | 102.6 | 105.4 | 105.0 | 107.5 |
HAE-symm | 95.0 | 96.2 | 103.1 | 105.9 | 105.3 | 107.8 |
HAE-joint | 95.0 | 95.7 | 105.1 | 107.5 | 104.7 | 107.4 |
RC1 | 96.0 | 96.9 | 103.1 | 107.2 | 103.9 | 108.6 |
RC2 | 96.0 | 95.6 | 99.9 | 106.2 | 104.7 | 105.9 |
RC3 | 96.0 | 96.3 | 101.5 | 106.7 | 104.3 | 107.2 |
ANCH | 96.0 | 95.6 | 99.9 | 106.2 | 104.7 | 105.9 |
CC | 95.9 | 96.7 | 101.3 | 106.4 | 104.1 | 107.5 |
M | 96.3 | 97.1 | 100.4 | 105.2 | 104.1 | 107.4 |
SD | 1.2 | 1.5 | 2.7 | 1.7 | 0.7 | 0.9 |
Min | 94.1 | 95.0 | 94.6 | 102.0 | 103.2 | 105.9 |
Max | 97.9 | 99.5 | 105.1 | 107.5 | 105.3 | 109.2 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2021 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Robitzsch, A. A Comparison of Linking Methods for Two Groups for the Two-Parameter Logistic Item Response Model in the Presence and Absence of Random Differential Item Functioning. Foundations 2021, 1, 116-144. https://doi.org/10.3390/foundations1010009
Robitzsch A. A Comparison of Linking Methods for Two Groups for the Two-Parameter Logistic Item Response Model in the Presence and Absence of Random Differential Item Functioning. Foundations. 2021; 1(1):116-144. https://doi.org/10.3390/foundations1010009
Chicago/Turabian StyleRobitzsch, Alexander. 2021. "A Comparison of Linking Methods for Two Groups for the Two-Parameter Logistic Item Response Model in the Presence and Absence of Random Differential Item Functioning" Foundations 1, no. 1: 116-144. https://doi.org/10.3390/foundations1010009
APA StyleRobitzsch, A. (2021). A Comparison of Linking Methods for Two Groups for the Two-Parameter Logistic Item Response Model in the Presence and Absence of Random Differential Item Functioning. Foundations, 1(1), 116-144. https://doi.org/10.3390/foundations1010009