# How Much g Is in the Distractor? Re-Thinking Item-Analysis of Multiple-Choice Items

^{*}

## Abstract

**:**

_{CC}) for their potential to detect distractors with ability-related discriminatory power. The simulation design was adopted to item selection scenarios relying on rather small sample sizes (e.g., N = 100 or N = 200). Both suggested effect size measures (Cohen’s ω only when based on two ability groups) yielded acceptable to conservative type-I-error rates, whereas, the canonical correlation outperformed Cohen’s ω in terms of empirical power. The simulation results further suggest that an effect size threshold of 0.30 is more appropriate as compared to more lenient (0.10) or stricter thresholds (0.50). The suggested item-analysis procedure is illustrated with an analysis of twelve Raven’s progressive matrices items in a sample of N = 499 participants. Finally, strategies for item selection for cognitive ability tests with the goal of scaling by means of nested logit models are discussed.

## 1. Introduction

#### 1.1. Distractor Analysis as Part of Traditional Analysis of Multiple-Choice Items

#### 1.1.1. Distractor Choice Frequency

#### 1.1.2. The Point-Biserial Correlation

_{D}(Gierl et al. 2017; Attali and Fraenkel 2000) that contrasts test performance between participants who chose distractor D with the participants who did not choose D (i.e., the participants who chose either the correct solution or one of the other distractors):

_{D}is the average performance of participants who chose D, M is the average performance of all participants, S is the standard deviation of the performance of all participants, and P

_{D}is the proportion of participants who selected D. Well-functioning distractors show a negative PB

_{D}(Attali and Fraenkel 2000). However, Attali and Fraenkel (2000) pointed out that the groups of participants contrasted by PB

_{D}do not yield the relevant information for developers in every situation. For example, a positive PB

_{D}can be found for rather difficult items even when the average score of participants choosing D is substantially lower than the average score of participants solving the item (i.e., M is also affected by participants who chose one of the other distractors). Hence, they suggest an alternative index PB

_{DC}that contrasts the group choosing D only with the group who solved the item:

_{DC}is the average sum correct score of the participants who either chose D or the correct solution C, S

_{DC}is the standard deviation of the sum correct score of the group choosing either D or C, and P

_{C}is the proportion of participants choosing the correct solution. It is clear that this index provides better contrast between distractor choice and item solution in terms of ability. However, both contrasts are not informative for the aim of detecting distractors with discriminatory power with respect to ability because this would require a contrast between participants choosing distractor D and participants choosing any other distractor. For Example-item 1 and Example-item 2, there were on average no differences observable for both PB

_{D}and PB

_{DC}. This is indeed expected given that both indices focus on a different aspect of discrimination as compared to the distractor discrimination parameters in nested logit models (see Table 1).

#### 1.1.3. Trace Line Plots and χ^{2} Statistics

_{D}to distinguish it from the effect size measure ω

_{G}that will be introduced below) for Example-Item 1 and Example-Item 2 (see Table 1). On average ω

_{D}was comparable in this case. One might argue that the variation of ω

_{D}across distractors could be sensitive for the discriminatory power of distractors, but using a dispersion index (e.g., SD of ω

_{D}across distractors) would yield a measure with a rather non-intuitive metric (as compared to commonly used effect size metrics). Thus, in this work, we aim at a direct effect size quantification of the discriminatory power of distractors. In addition, there were no intersections of the trace lines for the distractors between the ability groups in the left plot of Figure 1 (Example-Item 1). The ${\chi}_{D}^{2}$ statistic, however, would nonetheless be significant for all distractors in this plot (see also the large ω

_{D}values in Table 1). In this vein, it has been conjectured by Garcia-Perez (2014) that non-monotonic empirical trace lines are required (such as those ones depicted for Example-Item 2 in the right plot of Figure 1) to allow effective modeling by polytomous IRT models. Hence, effect sizes are needed, which are sensitive for the detection of distractor trace lines that display distractor-ability interaction effects. In this work, we will use an effect size based on the χ

^{2}statistic using ability groups (as shown in Figure 1). In this approach, however, all distractors are considered (i.e., participants solving an item will be discarded from analysis); for a comparable implementation see Levine and Drasgow (1983). However, they used a less intuitive metric as the one that we will introduce in Section 1.2.

#### 1.1.4. Rising Selection Ratios

#### 1.2. Effect Sizes for the Detection of Discriminatory Distractors

#### 1.2.1. Cohen’s ω Based on Ability Groups and Distractor Choice

^{2}derived from a contingency table in which the rows represent the distractors and the columns represent ability groups. Hence, this effect size is analogous to the above-mentioned approach used by Levine and Drasgow (1983), but it has a normed range that is easier to interpret. They also scale the χ

^{2}statistic for better interpretability. In particular, the χ

^{2}should be independent of the number of participants who did not solve the item under consideration because the raw χ

^{2}statistic would clearly depend on item difficulty otherwise (Levine and Drasgow 1983). Cohen’s ω (Cohen 1992) can be calculated by

^{2}statistic based on the G ability groups and all K distractors, and $\sum}_{i=1}^{k}{N}_{{D}_{k}$ is the number of all participants who did not solve the item under consideration. In this study, we will examine Cohen’s well-known interpretation guideline for ω

_{G}(Cohen 1992). Specifically, we will use 0.10 (small effect size), 0.30 (medium effect size), and 0.50 (large effect size) as cut-offs for the detection of items with discriminatory distractor sets. Example-Item 1 had ω

_{2}= 0.02 and ω

_{5}= 0.04, whereas, Example-Item 2 had ω

_{2}= 0.24 and ω

_{5}= 0.34. This illustrates that the discriminatory power of distractors could potentially be detected with this variant of Cohen’s ω.

#### 1.2.2. Canonical Correlation Based on Ability and Distractor Choice

**X**

_{1}, including k binary indicator variables with an entry of 1 in the kth column and vth row when person v chose distractor k and zero otherwise, and (b) a vector

**x**

_{2}that includes the total scores for all participants who did not solve the item under consideration. Then,

**r**

_{12}is the column vector, including the correlations between each column from

**X**

_{1}and

**x**

_{2}, and

**H**

_{1}is the Cholesky decomposition (Harville 2008) of the correlation matrix between all binary indicator variables in

**X**

_{1}. With these terms in mind, the canonical correlation can be expressed as

_{11}is the only element from the

**D**matrix resulting from a singular value composition (Harville 2008) of

**W**= ${r}_{12}^{\prime}{H}_{1}^{-1}$. For the canonical correlation the same cut-offs for the detection of items with discriminatory distractors are suggested as it was the case above for Cohen’s ω

_{G}(small—0.10; medium—0.30; and large—0.50). Example-Item 1 had R

_{CC}= 0.01, whereas, Example-Item 2 had R

_{CC}= 0.34. This illustrates that the discriminatory power of distractors could also be detected by means of R

_{CC}.

#### 1.3. Aim of the Current Study

_{CC}(as outlined above) to detect items for their potential to discriminate individuals with respect to their latent trait based on distractor choice behavior. To this aim, we first simulated conditions in which distractors did not possess discriminatory power with respect to the latent trait to assess the type-I-error of the used statistical indices (i.e., effect sizes passing the effect size threshold, when the population model did not include discriminatory distractors). Second, we simulated conditions based on a population model with discriminatory distractors to examine the power to detect items that are suitable for nested logit modeling. A final aim of this work is to illustrate the suggested item-analytical strategy by means of the data taken from Myszkowski and Storme (2018).

## 2. Simulation Study

#### 2.1. Method

#### 2.1.1. Data Generating Model

_{j}is the ability parameter, β

_{i}is the item difficulty parameter, and α

_{i}is the discrimination parameter. Then, in case that an item has not been solved the probability to choose distractor v among the set of the remaining m

_{i}distractors is modeled by the nominal response model with intercept parameters ζ

_{iv}and distractor discrimination parameters λ

_{iv}:

#### 2.1.2. Facets of the Simulation Design

- Sample size (three levels): N = 100; N = 200; and N = 500.
- Number of items (three levels): I = 10; I = 20; and I = 50.
- Number of distractors (two levels): D = 3; and D = 7.
- 2-PL difficulty (three levels): Moderate [β
_{i}~ U(−0.15, 0.15)]; difficult [β_{i}~ U(−1.15, −0.85)]; and very difficult [β_{i}~ U(−2.25, −1.85)]. - 2-PL discrimination (three levels): Low [α
_{i}~ U(0.25, 0.55)]; moderate[α_{i}~ U(0.85, 1.15)]; and high [α_{i}~ U(1.60, 1.90)]. - NRM discrimination parameters (four levels) are depicted in Table 2.

_{iv}were further sampled for all design cells from a U(−1, 1) distribution. Further facets resulted from the used effect size threshold and the type of effect size (but these facets did not imply additionally generated datasets):

- 7.
- Effect size threshold (three levels): Small: Effect size > 0.10; moderate: Effect size > 0.30; and large: Effect size > 0.50.
- 8.
- Type of effect size (three levels): Cohen’s ω based on two ability groups; Cohen’s ω based on five ability groups; and the canonical correlation coefficient.

#### 2.1.3. Dependent Variables

#### 2.1.4. Simulation Setup

`simdata()`function included in the R package mirt (Chalmers 2012). The design for the dataset generation was based on crossing all facets of the simulation design (see 1. to 6. presented in Section 2.1.2). Hence, it was a sample-size × number-of-items × number-of-distractors × 2-PL-difficulty × 2-PL-discrimination × NRM-discrimination design with 3 × 3 × 2 × 3 × 3 × 4 = 648 cells. For each of these 648 cells, we generated 1000 datasets and aggregated the dependent variables across these datasets for each cell. All R code files and simulated data are available in the OSF repository for this work (https://osf.io/9tp8h/).

#### 2.2. Simulation Results

#### 2.2.1. Type-I-Error Results

_{5}was adequate only when the sample size was N = 500. Moreover, for seven distractors, a 0.30 threshold, and a sample size of N = 500 acceptable type-I-error rates were reached only for very difficult items. The best performance of Cohen’s ω based on five ability groups was found for the three-distractor condition and a threshold of 0.50 (i.e., only type-I-error for moderately difficult items was too large). However, both other effect size measures (Cohen’s ω based on two ability groups and the canonical correlation) yielded highly conservative type-I-error rates (i.e., type-I-error rates that are notably smaller than 0.05) when the effect size threshold was 0.50 regardless of any other design facet. Moreover, Cohen’s ω based on two ability groups and the canonical correlation coefficient yielded acceptable to conservative type-I-error rates with three distractors and a threshold of 0.30 when 2PL-difficulty was at least difficult. The same was observed for these two effect size measures for seven distractors, but only when the sample size was at least N = 200 (see Figure 2). Finally, we found that 2PL-difficulty was inversely related to type-I-error rates as in several simulated conditions moderate 2PL-difficulty resulted in the highest type-I-error rate (see Figure 2), whereas, the level of 2PL-discrimination and the number of items did not show any specific relationship with a type-I-error rate (see Appendix A).

_{2}(85%) and R

_{CC}(84%) comparable numbers of cells with acceptable type-I-error rates resulted, whereas, Cohen’s ω

_{5}displayed acceptable type-I-error rates only for 48% of the simulated cells. Thus, Cohen’s ω

_{5}appeared to have only a very narrow range of scenarios in which this statistic is advisable for the detection of discriminatory distractors. Cohen’s ω

_{2}and R

_{CC}, however, were found to function comparably well (see also Table 3).

#### 2.2.2. Power Results

_{DC}would be ignored. Hence, we checked all conditions that had both acceptable type-I-error and sufficient power (i.e., power ≥ 0.80) for their power under additional boundary conditions. First, the power of effect size measures was reevaluated under the additional condition that the average γ is greater than 0.30. Second, another reevaluation of the power of effect size measures took PB

_{DC}as a boundary condition into account. Here we tested the additional condition that the average PB

_{DC}had to be smaller than −0.30. Table 3 displays the percentages of design cells with adequate empirical power with and without boundary conditions.

_{5}and adequate power was only achieved for less than 10% of the design cells, which was still surpassed by R

_{CC}paired with a 0.30 threshold (see Table 3). Results indicated further, as expected, a positive relationship between NRM discrimination and empirical power. That is, the higher the NRM discrimination was in the data-generating model; the higher was the percentage of design cells with adequate power to detect discriminatory distractors (see Table 3). This pattern was rather robust across effect size thresholds and the used effect sizes. Restricting the findings to R

_{CC}as the overall best-performing statistic, however, revealed that power gains from high to very high NRM conditions were negligible (even non-existent when boundary conditions were taken into account) with a 0.30 threshold. Comparing further the R

_{CC}results between the 0.30 and the 0.50 threshold, independent of NRM discrimination, suggested that the power advantage of the lower 0.30 threshold as compared to the 0.50 threshold vanishes for very high NRM discrimination. The differences between power analyses without and with boundary conditions increased with the level of NRM discrimination. Generally, the overall impression of empirical power results was supported regardless of the presence of additional boundary conditions.

#### 2.2.3. Empirical Substance Examination

_{2}in all conditions and 79.95% for ω

_{5}in all conditions. Hence, ω

_{5}was affected the most by empirical substance loss, which in turn might explain its inferior performance in the simulation study.

#### 2.2.4. Discussion of Simulation Study Findings

_{CC}, ω

_{2}, and ω

_{5}effect sizes to detect the discriminatory power of distractors. The power examination was also carried out under additional boundary conditions defined by effect sizes with a focus on solution behavior (i.e., γ and PB

_{DC}). The simulation was further flanked by an empirical substance investigation to reveal the amount of information loss when, for example, distractors are chosen by less than 5% of the participants or creation of ability groups did not result in the target number of groups. The aim of this simulation was twofold: (a) We wanted to identify the best-performing effect size for the detection of discriminatory distractors, and (b) we wanted to explore potential factors that influence type-I-error and empirical power.

_{CC}and ω

_{2}yielded comparable performance with respect to type-I-error. R

_{CC}and ω

_{2}displayed acceptable type-I-error for a far greater variety of simulated conditions as compared to ω

_{5}. Hence, ω

_{5}was found to be clearly limited in its range of application. In terms of empirical power, however, it was found that R

_{CC}clearly outperformed ω

_{2}in most of the simulated conditions with few design cells in which R

_{CC}and ω

_{2}performed comparably well. In relation to this, it is further important that using R

_{CC}in combination with a 0.30 threshold yielded better empirical power findings in conditions with moderate or high NRM discrimination. For very high NRM discrimination conditions R

_{CC}combined with a 0.30 threshold and R

_{CC}combined with a 0.50 threshold were found to be comparable with respect to empirical power findings. Thus, for a wide range of simulated conditions in this study, R

_{CC}in combination with a 0.30 threshold would be the best choice.

_{CC}as the generally best-performing effect size measure is used (i.e., ω

_{2}also had acceptable type-I-error rates for all conditions with a 0.50 threshold, but did not perform on par with R

_{CC}with respect to power). However, in terms of empirical power, it is important to take further into account that 2PL-discrimination needs to be at least moderate and NRM discrimination had to be very high to yield largely acceptable detection power (with better findings for seven distractor items). When NRM discrimination is only high, detection of discriminatory distractors was only feasible for seven distractor items when at the same time 2PL-discrimination was at least moderate. Empirical power, with a 0.50 threshold to detect items with moderate NRM discrimination, was found to be unacceptable.

## 3. Empirical Illustration

#### 3.1. Method

#### 3.1.1. Dataset

#### 3.1.2. Analytical Strategy

_{CC}as effect size with a threshold of 0.30 because it displayed acceptable type-I-error rates and reasonable empirical power under comparable conditions as given for the given dataset in the simulation study above. In addition, the number of distractors with relative choice frequencies < 0.05, PB

_{DC}and γ were calculated. Finally, we re-estimated the 2PL-parameters to facilitate interpretation of the findings in connection with the simulation study presented above.

#### 3.2. Results and Discussion

_{CC}. As expected, this sparsity of distractor frequencies was associated with 2PL-difficulty estimates. These five items were indeed among the easiest items according to the estimates in Table 4. Moreover, the estimates, in particular those for items 1 to 5, were much higher as compared to the difficulties simulated above. In fact, only items 10 to 12 were found to be in the range of simulated 2PL-difficulty values used above. The values for items 8 and 9 were closer to the moderate difficulty level used in the simulation, whereas, the estimates for items 6 and 7 were clearly easier. The 2PL-discrimination estimates, however, were inside the range of the simulation study and were even higher for several items. The latter observation is particularly important, because even for the detection of moderate NRM discrimination it was found that R

_{CC}had adequate power levels with seven distractors and a sample size of N = 500 (which are conditions resembling the Myszkowski-Storme dataset). Given that 2PL-discrimination was identified in the simulation as an important influencing factor on the detection power, one could reasonably assume that higher 2PL-discrimination can compensate for lower 2PL-difficulty as compared to the used simulation setup. This reasoning applies particularly to item 6, which was found to have a much larger 2PL-discrimination parameter estimate as compared to the values used in the simulation (and a much lower difficulty estimate). Nonetheless, caution is needed when interpreting these findings with parameter estimates outside the simulated values.

_{DC}for items 6 to 12 to examine if the correct solution was associated with higher ability levels as compared to choosing one of the distractors. In addition, γ was used to check the rising selection ratio property. Table 4 displays the average PB

_{DC}across all distractors for each of the items. These values ranged from moderate to large effect sizes implying rather well-functioning distractors in this regard. Importantly, some items displayed comparable R

_{CC}values (items 9 and 10), but clearly varying PB

_{DC}values. This highlights the importance to study ability-related discrimination of distractors and discrimination with respect to solution behavior at the same time. PB

_{DC}and also γ focus on solution behavior, but R

_{CC}focus on distractors that are more often chosen by participants with higher ability levels as compared to other distractors (i.e., not in comparison to the correct solution). These aspects of discrimination are not necessarily expected to covary. This is further illustrated by the R

_{CC}and PB

_{DC}findings for item 11, which had the lowest R

_{CC}value, but the second strongest PB

_{DC}(see Table 4).

_{DC}was below −0.30, but γ was not larger than 0.30). To decide if this pattern is problematic for the detection of ability-related discriminatory distractors, new simulations to understand the interplay of PB

_{DC}and γ as boundary conditions are clearly needed. Hence, results for items 11 and 12 should also be treated with caution. These findings largely support the feasibility of nested logit modeling with its primary focus on solution behavior and distractor choices as additional information used for ability estimation.

## 4. Overall Discussion

#### Limitations

_{DC}, and the average γ and this approach seems to be most promising. In particular, the PB

_{DC}can further reveal if choosing the correct solution is more strongly related to the ability as compared to any other distractor. This is crucial for a model that puts solution behavior in the first place. Moreover, average γ ensures that the assumption of rising selection ratios holds for a set of candidate items which is further useful to guide item pre-selection. In our simulation, we found that using these two statistics as boundary conditions is useful, but for simplicity, these two item statistics were studied in isolation. Obviously, other more complex item-analysis strategies could be used in which also a combined cut-off for both statistics is used. It is further possible to consider scenarios in which item-selection is carried out in multiple consecutive steps, and the current work does not shed much light into the question which statistic should be consulted first (e.g., testing discrimination in the sense of item-scale correlations first, testing for rising selection ratios second, and screening for ability-related discriminatory power of distractors as a final step).

## 5. Conclusions

## Author Contributions

## Funding

## Conflicts of Interest

## Appendix A

**Figure A1.**2PL-discrimination split of type-I-error analysis: Depiction of the type-I-error rate (y-axis) as a function of sample size (x-axis), number of distractors (three distractors = top-row; seven distractors = bottom-row), 2PL-discrimination (disc_level: Low vs. moderate vs. high), and effect size measures combined with effect size thresholds. The horizontal red dashed line represents the target type-I-error rate of 0.05. For more explanations, see Figure 2.

**Figure A2.**Number-of-items split of type-I-error analysis: Depiction of the type-I-error rate (y-axis) as a function of sample size (x-axis), number of distractors (three distractors = top-row; seven distractors = bottom-row), number of items and effect size measures combined with effect size thresholds. The horizontal red dashed line represents the target type-I-error rate of 0.05. For more explanations, see Figure 2.

## Appendix B

_{DC}as a boundary condition (see Figure A4). 2PL discrimination was found to be the strongest influencing factor on these examinations of boundary conditions. For 2PL difficulty, a similar pattern was revealed with the lowest power for moderately difficult items (in some cases even dropping below the 0.80 target level) and the highest power for very difficult items (see Figure A5 and Figure A6). When structuring the boxes in the boxplots according to the number of items, it was further revealed that the number of items was inversely related to the dispersion of power results (see Figure A7 and Figure A8). With ten items power findings were pretty homogenous for all studied effect size measures, but for 50 items, the power results were strongly scattered.

**Figure A3.**2PL-discrimination split of empirical power analysis: Depiction of the empirical power (y-axis) under the boundary condition that the average γ is greater than 0.30. Power is depicted as a function of NRM discrimination, sample size (x-axis), number of distractors, 2PL-discrimination, and discrimination effect sizes combined with effect size thresholds (p30_cc_p30_m_g = canonical correlation with a 0.30 threshold; p30_cw2_p30_m_g = Cohen’s ω based on two ability groups with a 0.30 threshold; p30_cw5_p30_m_g = Cohen’s ω based on five ability groups with a 0.30 threshold; p50_cc_p30_m_g = canonical correlation with a 0.50 threshold; p50_cw2_p30_m_g = Cohen’s ω based on two ability groups with a 0.50 threshold; p50_cw5_p30_m_g = Cohen’s ω based on five ability groups with a 0.50 threshold). The horizontal red dashed line represents the target power level of 0.80. For more explanations, see Figure 2 and Figure 3.

**Figure A4.**2PL-discrimination split of empirical power analysis: Depiction of the empirical power (y-axis) under the boundary condition that the average PB

_{DC}is smaller than −0.30. Power is depicted as a function of NRM discrimination, sample size (x-axis), number of distractors, 2PL-discrimination, and discrimination effect sizes combined with effect size thresholds (p30_cc_p30_m_pb = canonical correlation with a 0.30 threshold; p30_cw2_ p30_m_pb = Cohen’s ω based on two ability groups with a 0.30 threshold; p30_cw5_ p30_m_pb = Cohen’s ω based on five ability groups with a 0.30 threshold; p50_cc_ p30_m_pb = canonical correlation with a 0.50 threshold; p50_cw2_ p30_m_pb = Cohen’s ω based on two ability groups with a 0.50 threshold; p50_cw5_ p30_m_pb = Cohen’s ω based on five ability groups with a 0.50 threshold). The horizontal red dashed line represents the target power level of 0.80. For more explanations, see Figure 2 and Figure 3.

**Figure A5.**2PL-difficulty split of empirical power analysis: Depiction of the empirical power (y-axis) under the boundary condition that the average γ is greater than 0.30. Power is depicted as a function of NRM discrimination, sample size (x-axis), number of distractors, 2PL-difficulty, and discrimination effect sizes combined with effect size thresholds. The horizontal red dashed line represents the target power level of 0.80. For more explanations, see Figure 2 and Figure 3.

**Figure A6.**2PL-difficulty split of empirical power analysis: Depiction of the empirical power (y-axis) under the boundary condition that the average PB

_{DC}is smaller than −0.30. Power is depicted as a function of NRM discrimination, sample size (x-axis), number of distractors, 2PL-difficulty, and discrimination effect sizes combined with effect size thresholds. The horizontal red dashed line represents the target power level of 0.80. For more explanations, see Figure 2 and Figure 3.

**Figure A7.**Number-of-items split of empirical power analysis: Depiction of the empirical power (y-axis) under the boundary condition that the average γ is greater than 0.30. Power is depicted as a function of NRM discrimination, sample size (x-axis), number of distractors, number of items, and discrimination effect sizes combined with effect size thresholds. The horizontal red dashed line represents the target power level of 0.80. For more explanations, see Figure 2 and Figure 3.

**Figure A8.**Number-of-items split of empirical power analysis: Depiction of the empirical power (y-axis) under the boundary condition that the average PB

_{DC}is smaller than −0.30. Power is depicted as a function of NRM discrimination, sample size (x-axis), number of distractors, and discrimination effect sizes combined with effect size thresholds. The horizontal red dashed line represents the target power level of 0.80. For more explanations, see Figure 2 and Figure 3.

## Appendix C

### load data # ### downloaded from # https://data.mendeley.com/datasets/h3yhs5gy3w/1 dataset <- read.csv(“dataset.csv”, stringsAsFactors=FALSE) ### install required packages (if needed) ### remove # to make this code run #install.packages(c(“psych”,”mirt”)) ### get results function ### includes also the effect size measures that were ### studied in the simulation and also more useful ### descriptive statistics get_results <- function(data,keys=“sim”){ ### keys for scoring if(length(keys)==1){keys <- rep(0,ncol(data))}else{ keys <- keys } ### quantiles for two ability groups p2 <- .5 ### quantiles for five ability groups p5 <- c(.2,.4,.6,.8) ### load psych library require(psych) ### score all items scored <- score.multiple.choice(key=keys,data=data,score=F) ### ability groups abil2.c <- rep(0,nrow(scored)) for(i in 1:length(p2)){ if(i < length(p2)){ abil2.c[rowSums(scored)>quantile(rowSums(scored),p=p2[i]) & rowSums(scored)<=quantile(rowSums(scored),p=p2[i+1])] <- i }else{abil2.c[rowSums(scored)>quantile(rowSums(scored),p=p2[i])] <- i } } ### ability groups abil5.c <- rep(0,nrow(scored)) for(i in 1:length(p5)){ if(i < length(p5)){ abil5.c[rowSums(scored)>quantile(rowSums(scored),p=p5[i]) & rowSums(scored)<=quantile(rowSums(scored),p=p5[i+1])] <- i }else{abil5.c[rowSums(scored)>quantile(rowSums(scored),p=p5[i])] <- i } } ### list distractors with relative frequency < .05 rf05 <- list() for(j in 1:ncol(data)){ rf05[[j]] <- table(data[,j][data[,j]!=keys[j]])/length(data[,j])<.05 } ### general Cohen’s w, 2 ability groups chi_g2 <- list() cw_g2 <- list() tab_c2_l <- list() zero_columns2 <- list() for(k in 1:ncol(data)){ tab_c2 <- matrix(table(data[,k][data[,k]!=keys[k]],abil2.c[data[,k]!=keys[k]])[!rf05[[k]],],ncol=length(unique(abil2.c[data[,k]!=keys[k]]))) zero_columns2[[k]] <- colSums(tab_c2)==0 tab_c2 <- tab_c2[,colSums(tab_c2)>0] tab_c2_l[[k]]<-tab_c2 if(sum(!rf05[[k]])>=2){chi_g2[[k]] <- chisq.test(tab_c2)}else{ chi_g2[[k]] <- NA } ### Cohen’s w - general if(sum(!rf05[[k]])>=2){cw_g2[[k]] <- sqrt(sum(((chi_g2[[k]]$observed/sum(tab_c2)-chi_g2[[k]]$expected/sum(tab_c2))^2)/(chi_g2[[k]]$expected/sum(tab_c2))))}else{ cw_g2[[k]] <- NA } } ### general Cohen’s w, 5 ability groups chi_g5 <- list() cw_g5 <- list() tab_c5_l <- list() zero_columns5 <- list() for(k in 1:ncol(data)){ tab_c5 <- matrix(table(data[,k][data[,k]!=keys[k]],abil5.c[data[,k]!=keys[k]])[!rf05[[k]],],ncol=length(unique(abil5.c[data[,k]!=keys[k]]))) zero_columns5[[k]] <- colSums(tab_c5)==0 tab_c5 <- tab_c5[,colSums(tab_c5)>0] tab_c5_l[[k]]<-tab_c5 if(sum(!rf05[[k]])>=2){chi_g5[[k]] <- chisq.test(tab_c5)}else{ chi_g5[[k]] <- NA } ### Cohen’s w - general if(sum(!rf05[[k]])>=2){cw_g5[[k]] <- sqrt(sum(((chi_g5[[k]]$observed/sum(tab_c5)-chi_g5[[k]]$expected/sum(tab_c5))^2)/(chi_g5[[k]]$expected/sum(tab_c5))))}else{ cw_g5[[k]] <- NA } } ### canonical correlation can_cor <- list() ncol_mmat <- list() for(k in 1:ncol(data)){ ncol_mmat[[k]] <- if(sum(!rf05[[k]])>=2){ncol(model.matrix(rowSums(scored[scored[,k]==0,-1])~-1+factor(data[,k][scored[,k]==0]))[,!rf05[[k]]])}else{ NA } can_cor[[k]] <- if(sum(!rf05[[k]])>=2){cancor(rowSums(scored[scored[,k]==0,-k]),model.matrix(rowSums(scored[scored[,k]==0,-1])~-1+factor(data[,k][scored[,k]==0]))[,!rf05[[k]]])$cor}else{ NA } } ### point-biserial coefficient PB_DC pb_dc <- list() ### Goodman-Kruskal gamma gkg <- list() gkg_tab <- list() ### start loop for(v in 1:ncol(data)){ pb_dc_d <- list() gkg_d <- list() gkg_tab_d <- list() ### function to calculate ### Goodman-Kruskal gamma ### taken from here: ### https://stat.ethz.ch/pipermail/r-help/2003-March/030835.html goodman <- function(x,y){ Rx <- outer(x,x,function(u,v) sign(u-v)) Ry <- outer(y,y,function(u,v) sign(u-v)) S1 <- Rx*Ry return(sum(S1)/sum(abs(S1)))} ### start loop non_key <- unique(data[,v])[!unique(data[,v])%in%keys[v]] for(w in non_key){ MDC <- mean(rowSums(scored)[data[,v]%in%c(keys[v],w)]) SDC <- sd(rowSums(scored)[data[,v]%in%c(keys[v],w)]) MD <- mean(rowSums(scored)[data[,v]%in%w]) PD <- mean(data[,v]%in%w) PC <- mean(data[,v]%in%keys[v]) ### r-PB_D ### r-PB_DC pb_dc_d[[w]] <- (MD-MDC)/SDC*sqrt(PD/PC) ### Goodman-Kruskal gamma score_other_items <- factor(rowSums(scored[,-v])) tab_gkg_d <- table(data[data[,v]%in%c(keys[v],w),v],score_other_items[data[,v]%in%c(keys[v],w)]) ### exclude ability levels with zero frequency tab_gkg_d <- tab_gkg_d[,colSums(tab_gkg_d)>0] gkg_d[[w]] <- goodman(as.numeric(colnames(tab_gkg_d)), tab_gkg_d[as.numeric(rownames(tab_gkg_d))%in%keys[v],]/colSums(tab_gkg_d)) gkg_tab_d[[w]] <- tab_gkg_d } pb_dc[[v]] <- pb_dc_d gkg[[v]] <- gkg_d gkg_tab[[v]] <- gkg_tab_d } ### return results res <- list(rf05 = rf05, tab_c2_l = tab_c2_l, zero_columns2 = zero_columns2, tab_c5_l = tab_c5_l, zero_columns5 = zero_columns5, cw_g2 = cw_g2, cw_g5 = cw_g5, can_cor = can_cor, pb_dc = pb_dc, gkg = gkg, gkg_tab = gkg_tab, ncol_mmat = ncol_mmat) return(res) } ### frequencies of distractor usage ### including correct response apply(dataset,2,table) ### load psych library library(psych) ### score all items scored <- score.multiple.choice(key=c(7,6,8,2,1,5,1,6,3,2,4,5),data=dataset,score=F) ### does choosing a certain other distractor ### imply better overall scores? # ### run suggested distractor analysis ms_res<-get_results(dataset,keys = c(7,6,8,2,1,5,1,6,3,2,4,5)) ### show Results for Table 4 # ### show for which items the distractor choice frequencies were ### below 5%: ms_res$rf05 ### Items 1 to 5 have too many distractors with response frequencies ### below 5%. # ### get 2PL parameters from mirt library(mirt) est_test2pl <- mirt(scored, 1, itemtype=“2PL”) ### show results coef(est_test2pl) # a1 = 2PL-discrimination # d = 2PL-difficulty # ### canonical correlation findings ms_res$can_cor[6:12] ### check boundary conditions # ### average pb_dc lapply(ms_res$pb_dc,function(x)mean(unlist(x),na.rm=T))[6:12] ### average gamma lapply(ms_res$gkg,function(x)mean(unlist(x),na.rm=T))[6:12]

## References

- Arendasy, Martin, and Markus Sommer. 2013. Reducing response elimination strategies enhances the construct validity of figural matrices. Intelligence 41: 234–43. [Google Scholar] [CrossRef]
- Arendasy, Martin, Markus Sommer, Georg Gittler, and Andreas Hergovich. 2006. Automatic generation of quantitative reasoning items: A pilot study. Journal of Individual Differences 27: 2–14. [Google Scholar] [CrossRef]
- Attali, Yigal, and Tamar Fraenkel. 2000. The point-biserial as a discrimination index for distractors in multiple-choice items: Deficiencies in usage and an alternative. Journal of Educational Measurement 37: 77–86. [Google Scholar] [CrossRef]
- Barton, Mark A., and Frederic M. Lord. 1981. An upper asymptote for the three-parameter logistic item-response model. ETS Research Report Series 1: i-8. [Google Scholar] [CrossRef]
- Bethell-Fox, Charles E., David F. Lohman, and Richard E. Snow. 1984. Adaptive reasoning: Componential and eye movement analysis of geometric analogy performance. Intelligence 8: 205–38. [Google Scholar] [CrossRef]
- Blum, Diego, and Heinz Holling. 2018. Automatic generation of figural analogies with the IMak package. Frontiers in Psychology 9: 1286. [Google Scholar] [CrossRef] [Green Version]
- Blum, Diego, Heinz Holling, Maria S. Galibert, and Boris Forthmann. 2016. Task difficulty prediction of figural analogies. Intelligence 56: 72–81. [Google Scholar] [CrossRef]
- Bock, R. Darrell. 1972. Estimating item parameters and latent ability when responses are scored in two or more nominal categories. Psychometrika 37: 29–51. [Google Scholar] [CrossRef]
- Chalmers, R. Philip. 2012. mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software 48: 1–29. [Google Scholar] [CrossRef] [Green Version]
- Cohen, J. 1992. A power primer. Psychological Bulletin 112: 155–59. [Google Scholar] [CrossRef]
- Crocker, Linda S., and James Algina. 1986. Introduction to Classical and Modern Test Theory. Forth Worth: Harcourt Brace Jovanovich. [Google Scholar]
- Cureton, Edward. 1966. Corrected item-test correlations. Psychometrika 31: 93–96. [Google Scholar] [CrossRef]
- Davis, Frederick B., and Gordon Fifer. 1959. The effect on test reliability and validity of scoring aptitude and achievement tests with weights for every choice. Educational and Psychological Measurement 19: 159–70. [Google Scholar] [CrossRef]
- DeMars, Christine E. 2003. Sample size and recovery of nominal response model item parameters. Applied Psychological Measurement 27: 275–88. [Google Scholar] [CrossRef]
- Garcia-Perez, Miguel A. 2014. Multiple-choice tests: Polytomous IRT models misestimate item information. Spanish Journal of Psychology 17: e88. [Google Scholar] [CrossRef] [PubMed]
- Gierl, Mark J., Okan Bulut, Qi Guo, and Xinxin Zhang. 2017. Developing, analyzing, and using distractors for multiple-choice tests in education: A comprehensive review. Review of Educational Research 87: 1082–116. [Google Scholar] [CrossRef]
- Gonthier, Corentin, and Jean-Luc Roulin. 2019. Intraindividual strategy shifts in Raven’s matrices, and their dependence on working memory capacity and need for cognition. Journal of Experimental Psychology: General 149: 564–79. [Google Scholar] [CrossRef] [PubMed]
- Gonthier, Corentin, and Noémylle Thomassin. 2015. Strategy use fully mediates the relationship between working memory capacity and performance on Raven’s matrices. Journal of Experimental Psychology: General 144: 916–24. [Google Scholar] [CrossRef] [PubMed]
- Goodman, Leo A., and William H. Kruskal. 1979. Measures of Association for Cross Classifications. New York: Springer. [Google Scholar]
- Guttman, Louis, and Izchak M. Schlesinger. 1967. Systematic construction of distractors for ability and achievement test items. Educational and Psychological Measurement 27: 569–80. [Google Scholar] [CrossRef]
- Haladyna, Thomas M. 2004. Developing and Validating Multiple-Choice Test Items. New York: Routledge. [Google Scholar]
- Haladyna, Thomas M., and Steven M. Downing. 1993. How many options is enough for a multiple-choice test item? Educational and Psychological Measurement 53: 999–1010. [Google Scholar] [CrossRef]
- Harville, David A. 2008. Matrix Algebra from a Statistician’s Perspective. New York: Springer. [Google Scholar]
- Hayes, Taylor R., Alexander A. Petrov, and Per B. Sederberg. 2011. A novel method for analyzing sequential eye movements reveals strategic influence on Raven’s Advanced Progressive Matrices. Journal of Vision 11: 1–11. [Google Scholar] [CrossRef] [Green Version]
- Henrysson, Sten. 1962. The relation between factor loadings and biserial correlations in item analysis. Psychometrika 27: 419–29. [Google Scholar] [CrossRef]
- Henrysson, Sten. 1971. Gathering, analyzing, and using data on test items. In Educational Measurement, 2nd ed.; Edited by Robert L. Thorndike. Beverly Hills: American Council on Education. [Google Scholar]
- Hornke, Lutz F., and Michael W. Habon. 1986. Rule-based item bank construction and evaluation within the linear logistic framework. Applied Psychological Measurement 10: 369–80. [Google Scholar] [CrossRef] [Green Version]
- Jacobs, Paul I., and Mary Vandeventer. 1970. Information in wrong responses. Psychological Reports 26: 311–15. [Google Scholar] [CrossRef]
- Jarosz, Andrew F., and Jennifer Wiley. 2012. Why does working memory capacity predict RAPM performance? A possible role of distraction. Intelligence 40: 427–38. [Google Scholar] [CrossRef]
- Johanson, George A., and Gordon P. Brooks. 2010. Initial scale development: Sample size for pilot studies. Educational Psychological Measurement 70: 394–400. [Google Scholar] [CrossRef] [Green Version]
- Klecka, William R. 1980. Discriminant Analysis. Beverly Hills: SAGE Publications, ISBN 0-8039-1491-1. [Google Scholar]
- Kline, Paul. 2000. The Handbook of Psychological Testing. London: Routledge. [Google Scholar]
- Kunda, Maithilee, Isabelle Soulieres, Agata Rozga, and Ashok K. Goel. 2016. Error patterns on the Raven’s Standard Progressive Matrices test. Intelligence 59: 181–98. [Google Scholar] [CrossRef] [Green Version]
- Levine, Michael V., and Fritz Drasgow. 1983. The relation between incorrect option choice and estimated ability. Educational Psychological Measurement 43: 675–85. [Google Scholar] [CrossRef]
- Lord, Frederic M. 1980. Applications of Item Response Theory to Practical Testing Problems. Hillsdale: Lawrence-Erlbaum Associates. [Google Scholar]
- Love, Thomas E. 1997. Distractor selection ratios. Psychometrika 62: 51–62. [Google Scholar] [CrossRef]
- Matzen, Laura E., Zachary O. Benz, Kevin R. Dixon, Jamie Posey, James K. Kroger, and Ann E. Speed. 2010. Recreating Raven’s: Software for systematically generating large numbers of Raven-like matrix problems with normed properties. Behavor Research Methods 42: 525–41. [Google Scholar] [CrossRef]
- Mitchum, Ainsley L., and Colleen M. Kelley. 2010. Solve the problem first: Constructive solution strategies can influence the accuracy of retrospective confidence judgments. Journal of Experimental Psychology: Learning, Memory, Cognition 36: 699–710. [Google Scholar] [CrossRef]
- Muraki, Eiji. 1992. A generalized partial credit model: Application of an EM algorithm. ETS Research Report Series 1: i-30. [Google Scholar] [CrossRef] [Green Version]
- Myszkowski, Nils, and Martin Storme. 2018. A snapshot of g? Binary and polytomous item-response theory investigations of the last series of the Standard Progressive Matrices (SPM-LS). Intelligence 68: 109–16. [Google Scholar] [CrossRef]
- Nunnally, Jum C., and Ira H. Bernstein. 1994. Psychometric Theory. New York: McGraw-Hill. [Google Scholar]
- R Core Team. 2019. R: A Language and Environment for Statistical Computing. Vienna: R Foundation for Statistical Computing. [Google Scholar]
- Revelle, William. 2018. Psych: Procedures for Personality and Psychological Research. Evanston: Northwestern University. [Google Scholar]
- Revuelta, Javier. 2005. An item response model for nominal data based on the rising selection ratios criterion. Psychometrika 70: 305–24. [Google Scholar] [CrossRef]
- Schiano, Diane J., Lynn A. Cooper, Robert Glaser, and Hou C. Zhang. 1989. Highs are to lows as experts are to novices: Individual differences in the representation and solution of standardized figural analogies. Human Performance 2: 225–48. [Google Scholar] [CrossRef]
- Sigel, Irving E. 1963. How intelligence tests limit understanding of intelligence. Merrill-Palmer Quarterly of Behavior and Development 9: 39–56. [Google Scholar]
- Snow, Richard E. 1980. Aptitude processes. In Aptitude, Learning, and Instruction: Cognitive Process Analyses of Aptitude. Edited by Richard E. Snow, Pat-Anthony Federico and William E. Montague. Hillsdale: Erlbaum, vol. 1, pp. 27–63. ISBN 978-089-859-043-2. [Google Scholar]
- Storme, Martin, Nils Myszkowski, Simon Baron, and David Bernard. 2019. Same test, better scores: Boosting the reliability of short online intelligence recruitment tests with nested logit item response theory models. Journal of Intelligence 7: 17. [Google Scholar] [CrossRef] [Green Version]
- Suh, Youngsuk, and Daniel M. Bolt. 2010. Nested logit models for multiple-choice item response data. Psychometrika 75: 454–73. [Google Scholar] [CrossRef]
- Thissen, David. 1976. Information in wrong responses to the Raven Progressive Matrices. Journal of Educational Measurement 13: 201–14. [Google Scholar] [CrossRef]
- Thissen, David, Lynne Steinberg, and Anne R. Fitzpatrick. 1989. Multiple-choice models: The distractors are also part of the item. Journal of Educational Measurement 26: 161–76. [Google Scholar] [CrossRef]
- Thompson, Bruce. 1984. Canonical Correlation Analysis. Newbury Park: SAGE Publications, ISBN 0-8039-2392-9. [Google Scholar]
- Vejleskov, Hans. 1968. An analysis of Raven matrix responses in fifth grade children. Scandinavian Journal Psychology 9: 177–86. [Google Scholar] [CrossRef]
- Vigneau, François, André F. Caissie, and Douglas A. Bors. 2006. Eye-movement analysis demonstrates strategic influences on intelligence. Intelligence 34: 261–72. [Google Scholar] [CrossRef]
- Vodegel Matzen, Linda B. L., Maurits W. van der Molen, and Ad C. M. Dudink. 1994. Error analysis of Raven test performance. Personality and Individual Differences 16: 433–45. [Google Scholar] [CrossRef]
- Von der Embse, Nathaniel P., Andrea D. Mata, Natasha Segool, and Emma-Catherine Scott. 2014. Latent profile analyses of test anxiety: A pilot study. Journal of Psychoeducational Assessessment 32: 165–72. [Google Scholar] [CrossRef]
- Wainer, Howard. 1989. The future of item analysis. Journal of Educational Measurement 26: 191–208. [Google Scholar] [CrossRef]
- Yen, Wendy M., and Anne R. Fitzpatrick. 2006. Item Response Theory. In Educational Measurement. Edited by Robert L. Brennan. Westport: Praeger Publishers. [Google Scholar]

**Figure 1.**Trace line plots (Gierl et al. 2017; Wainer 1989) of two simulated items (N = 10,000 each) when distractors were simulated as non-discriminating in a 2PL (left plot) and when distractors were simulated with moderate NRM (nominal response model) discrimination in a 2PNL.

**Figure 2.**2PL-difficulty split of type-I-error analysis: Depiction of the type-I-error rate (y-axis) as a function of sample size (x-axis), number of distractors (three distractors = top-row; seven distractors = bottom-row), 2PL-difficulty (diff_level: Moderate vs. difficult vs. very difficult), and effect size measures combined with effect size thresholds (see explanation) (p10_cc = canonical correlation with a 0.10 threshold; p10_cw2 = Cohen’s ω based on two ability groups with a 0.10 threshold; p10_cw5 = Cohen’s ω based on five ability groups with a 0.10 threshold; p30_cc = canonical correlation with a 0.30 threshold; p30_cw2 = Cohen’s ω based on two ability groups with a 0.30 threshold; p30_cw5 = Cohen’s ω based on five ability groups with a 0.30 threshold; p50_cc = canonical correlation with a 0.50 threshold; p50_cw2 = Cohen’s ω based on two ability groups with a 0.50 threshold; p50_cw5 = Cohen’s ω based on five ability groups with a 0.50 threshold). The horizontal red dashed line represents the target type-I-error rate of 0.05.

**Figure 3.**Number-of-items split of empirical power analysis: Depiction of the empirical power (y-axis) as a function of NRM discrimination (0.40 = top-row; 1.00 = middle-row; and 1.75 = bottom-row), sample size (x-axis), number of distractors (three distractors = top-row in each sub-plot; seven distractors = bottom-row in each sub-plot), number of items, and effect size measures combined with effect size thresholds. The horizontal red dashed line represents the target power level of 0.80. For more explanations, see Figure 2.

**Figure 4.**2PL-difficulty split of empirical power analysis: Depiction of the empirical power (y-axis) as a function of NRM discrimination, sample size (x-axis), number of distractors, 2PL-difficulty (diff_level: Moderate vs. difficult vs. very difficult), and effect size measures combined with effect size thresholds. The horizontal red dashed line represents the target power level of 0.80. For more explanations, see Figure 2 and Figure 3.

**Figure 5.**2PL-discrimination split of empirical power analysis: Depiction of the empirical power (y-axis) as a function of NRM discrimination, sample size (x-axis), number of distractors, 2PL-discrimination (disc_level: Low vs. moderate vs. high), and effect size measures combined with effect size thresholds. The horizontal red dashed line represents the target power level of 0.80. For more explanations, see Figure 2 and Figure 3.

Distractor | Relative Choice Frequency < 0.05 | PB_{D} | PB_{DC} | ω_{D} | γ |
---|---|---|---|---|---|

Item 1/Item 2 | Item 1/Item 2 | Item 1/Item 2 | Item 1/Item 2 | Item 1/Item 2 | |

Distractor 1 | 0/0 | −0.18/−0.21 | −0.54/−0.58 | 0.57/0.68 | 1/1 |

Distractor 2 | 0/0 | −0.29/−0.08 | −0.56/−0.47 | 0.55/0.39 | 1/1 |

Distractor 3 | 0/0 | −0.13/−0.36 | −0.50/−0.68 | 0.58/0.97 | 1/1 |

_{D}= point-biserial correlation for the contrast between participants who chose D vs. participants who chose any other option (including the correct option) with respect to test performance; PB

_{DC}= point-biserial correlation for the contrast between participants who chose D vs. participants who chose the correct option with respect to test performance; ω

_{D}—Haladyna-Downing approach (Haladyna and Downing 1993) = Cohen’s ω based on choice frequencies restricted to D as a function of 5 ability groups based on equi-distant quantiles; γ = Goodman-Kruskal γ for the relationship between test performance based on all other items and the probability to choose the correct response as estimated based on a 2 × J contingency table (with J is the number of possible performance scores) which has been suggested by (Love 1997) as an index for the evaluation of rising selection ratios.

Distractor | Level 1—Zero | Level 2—Moderate | Level 3—High | Level 4—Very High |
---|---|---|---|---|

3 distractors/7 distractors | 3 distractors/7 distractors | 3 distractors/7 distractors | 3 distractors/7 distractors | |

λ_{i}_{1} | 0.00/0.00 | −0.40/−1.20 | −1.00/−3.00 | −1.75/−5.25 |

λ_{i}_{2} | 0.00/0.00 | 0.00/−0.80 | 0.00/−2.00 | 0.00/−3.50 |

λ_{i}_{3} | 0.00/0.00 | 0.40/−0.40 | 1.00/−1.00 | 1.75/−1.75 |

λ_{i}_{4} | -/0.00 | -/0.00 | -/0.00 | -/0.00 |

λ_{i}_{5} | -/0.00 | -/0.40 | -/1.00 | -/1.75 |

λ_{i}_{6} | -/0.00 | -/0.80 | -/2.00 | -/3.50 |

λ_{i}_{7} | -/0.00 | -/1.20 | -/3.00 | -/5.25 |

NRM discrimination (step size) | 0.00 | 0.40 | 1.00 | 1.75 |

_{i}parameters that can be used as a general indicator of NRM discrimination.

**Table 3.**Percentages of cells in the simulation design with adequate type-I-error rate and empirical power.

Threshold = 0.30 | Threshold = 0.50 | |||||
---|---|---|---|---|---|---|

R_{CC} | ω_{2} | ω_{5} | R_{CC} | ω_{2} | ω_{5} | |

Adequate type-I-error rate | ||||||

All | 69 | 70 | 25 | 100 | 100 | 72 |

Adequate power | ||||||

NRM Discrimination = 0.40 | ||||||

All | 23 | 17 | 5 | 0 | 0 | 7 |

M(γ) > 0.30 | 17 | 3 | 4 | - | - | 6 |

M(PB_{DC}) < −0.30 | 16 | 3 | 4 | - | - | 3 |

NRM Discrimination = 1.00 | ||||||

All | 61 | 45 | 23 | 35 | 24 | 23 |

M(γ) > 0.30 | 39 | 31 | 16 | 25 | 17 | 14 |

M(PB_{DC}) < −0.30 | 41 | 30 | 16 | 24 | 15 | 15 |

NRM Discrimination = 1.75 | ||||||

All | 65 | 62 | 26 | 65 | 41 | 54 |

M(γ) > 0.30 | 38 | 38 | 14 | 41 | 25 | 36 |

M(PB_{DC}) < −0.30 | 40 | 41 | 15 | 44 | 24 | 40 |

_{DC}) < −0.30: In addition to adequate empirical power the boundary condition that PB

_{DC}to check discrimination between item solvers and participants who chose a certain distractor had to be smaller than −0.30. Frequencies in bold font refer to the best performing effect sizes under the respective threshold conditions.

**Table 4.**Distractor choice frequency, 2PL-parameter estimates, and distractor effect size measure findings on the Myszkowski-Storme dataset.

Item | Number of Distractors with Relative Choice Frequency < 0.05 | 2PL-Difficulty | 2PL-Discrimination | R_{CC} | M(PB_{DC}) ^{2} | M(γ) ^{3} |
---|---|---|---|---|---|---|

Item 1 | 6 | 1.32 | 0.85 | NA ^{1} | NA ^{1} | NA ^{1} |

Item 2 | 7 | 3.56 | 2.01 | NA ^{1} | NA ^{1} | NA ^{1} |

Item 3 | 6 | 2.07 | 1.69 | NA ^{1} | NA ^{1} | NA ^{1} |

Item 4 | 6 | 4.11 | 4.10 | NA ^{1} | NA ^{1} | NA ^{1} |

Item 5 | 7 | 5.51 | 4.97 | NA ^{1} | NA ^{1} | NA ^{1} |

Item 6 | 5 | 2.13 | 2.38 | 0.46 | −0.38 | 0.48 |

Item 7 | 4 | 1.23 | 1.55 | 0.28 | −0.36 | 0.40 |

Item 8 | 1 | 0.50 | 1.61 | 0.34 | −0.47 | 0.65 |

Item 9 | 3 | 0.40 | 1.27 | 0.34 | −0.39 | 0.37 |

Item 10 | 1 | −0.70 | 2.20 | 0.36 | −0.61 | 0.77 |

Item 11 | 1 | −0.82 | 1.51 | 0.21 | −0.48 | 0.21 |

Item 12 | 1 | −0.91 | 1.14 | 0.31 | −0.43 | 0.23 |

^{1}Distractor effect size measures were not calculated when the number of distractors with relative choice frequency < 0.05 exceeded a value of five (i.e., when only one distractor remained for analysis).

^{2}The average of all PB

_{DC}values for all available distractors of an item is reported—the lower the average PB

_{DC}, the better.

^{3}The average of all γ values for all available distractors of an item is reported—the higher the average γ, the better. The R code to reproduce the findings in this table can be found in Appendix C.

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Forthmann, B.; Förster, N.; Schütze, B.; Hebbecker, K.; Flessner, J.; Peters, M.T.; Souvignier, E.
How Much *g* Is in the Distractor? Re-Thinking Item-Analysis of Multiple-Choice Items. *J. Intell.* **2020**, *8*, 11.
https://doi.org/10.3390/jintelligence8010011

**AMA Style**

Forthmann B, Förster N, Schütze B, Hebbecker K, Flessner J, Peters MT, Souvignier E.
How Much *g* Is in the Distractor? Re-Thinking Item-Analysis of Multiple-Choice Items. *Journal of Intelligence*. 2020; 8(1):11.
https://doi.org/10.3390/jintelligence8010011

**Chicago/Turabian Style**

Forthmann, Boris, Natalie Förster, Birgit Schütze, Karin Hebbecker, Janis Flessner, Martin T. Peters, and Elmar Souvignier.
2020. "How Much *g* Is in the Distractor? Re-Thinking Item-Analysis of Multiple-Choice Items" *Journal of Intelligence* 8, no. 1: 11.
https://doi.org/10.3390/jintelligence8010011