Same Test, Better Scores: Boosting the Reliability of Short Online Intelligence Recruitment Tests with Nested Logit Item Response Theory Models

Assessing job applicants’ general mental ability online poses psychometric challenges due to the necessity of having brief but accurate tests. Recent research (Myszkowski & Storme, 2018) suggests that recovering distractor information through Nested Logit Models (NLM; Suh & Bolt, 2010) increases the reliability of ability estimates in reasoning matrix-type tests. In the present research, we extended this result to a different context (online intelligence testing for recruitment) and in a larger sample (N=2949 job applicants). We found that the NLMs outperformed the Nominal Response Model (Bock, 1970) and provided significant reliability gains compared with their binary logistic counterparts. In line with previous research, the gain in reliability was especially obtained at low ability levels. Implications and practical recommendations are discussed.


Introduction
With the development of the Internet, the assessment of job applicants is increasingly performed online, which facilitates large scale testing while reducing costs [1,2]. This recent trend has led to the creation of a new research field in psychometrics, referred to as e-assessment [2]. Considering the relevance of General Mental Ability (GMA) in predicting job performance [3], many e-assessment platforms have included tasks that aim at capturing it-such as logical series or logical reasoning matrices-in their online test batteries.
The assessment phase in e-recruiting poses very specific psychometric challenges. On the one hand, the assessment should ideally lead to a short-list of the best applicants [1,2]. The accuracy of the assessment is therefore a key issue in e-recruiting just like in it is in recruiting in general. On the other hand, the assessment phase cannot require from applicants that they take part in assessment processes that are too time consuming and too cognitively demanding. It is indeed not acceptable to extensively test people who have a relatively low chance of getting an interview. Perceived unfairness of the recruitment process has been shown to have a negative impact on the image of the recruiting company, which can lead to negative word of mouth and/or intentions not to complete the recruitment process [4,5]. The challenge that is inherent to e-assessment in a recruitment context is essentially the challenge of short psychometric measures, which is to extract as much information as possible from short instruments.
Extracting reliable information from short tests remains a real challenge from a measurement perspective [6]. Hopefully, psychometricians have allies in this challenging endeavour, such as Item Response Theory (IRT) modeling, which often allows them to extract more information from short psychometric tools than Classical Test Theory (CTT). Originally suggested for multiple-choice items by Bock [7], one way that researchers can take advantage of the IRT framework in logical series or matrices tests consists in extracting information from which incorrect responses were selected. This approach is based on the premise that when a test taker selects a wrong response option out of a set of wrong response options, the choice of the wrong response option can carry information about the ability of the test taker. Further, recent developments [8] applied to progressive matrices have suggested recovering additional information from distractor responses through Nested Logit Models (NLM) [9], and have indicated that such models may be more appropriate than Bock's [7] Nominal Response Model in logical reasoning tests, but also than traditional binary IRT models [8]. In this research, recovering information from the choice of distractors has provided significant gains in reliability in comparison with not recovering such information and using traditional binary logistic models.
Currently, no study has investigated whether applying this approach in the field of recruitment would lead to gains in reliability. Yet, taking an online GMA test as part of a recruitment process is in several ways different from taking a GMA test for an experiment in the lab. There is reasonable evidence to suspect that such differences could affect the way distractors are processed by test takers, which could possibly jeopardize the very idea of recovering psychometric information from distractors. In the present article, our main aim is to extend and conceptually replicate previous research on students and in laboratory conditions [8] to online personnel pre-selection contexts, by testing whether the modeling strategies previously suggested are able, even in this context, to provide tangible gains in reliability. The effort of conducting conceptual replications in the field is crucial in psychology to rule out the possibility that a laboratory finding is too weak to be relevant in contexts that are less tightly controlled [10].

Binary Item Response Theory Models
Item Response Theory (IRT) has traditionally helped psychometricians improve the reliability of the ability estimates obtained with short intelligence measures [8,11]. IRT provides a framework that has indeed been shown to improve the reliability of measurement compared to the Classical Test Theory (CTT) approach [12]. While CTT assumes that all items are linked to the latent trait in a similar fashion, IRT assumes that each item is linked to the the latent trait in a unique manner [13]. The aim of IRT is to model the probability of a response to an item as a function of the latent trait or ability of the test taker, traditionally with a non-linear function of the latent trait that is unique for each item. In the case of binary responses, the non-linear function is, frequently, the logistic function. Because of the flexibility of its parametrization in comparison with CTT, IRT allows for the accounting of a variety of testing phenomena and extracting information that is relevant in the context of GMA assessment [8].
GMA tests, such as progressive matrices or logical series, usually contain one correct answer option and several incorrect answer options-which are often referred to as distractors. Although the response dataset is thus polytomous, it is typical to recode the dataset by collapsing the distractor responses together, which yields a dichotomous success/failure variable format. The binary IRT approach generally consists in modeling these dichotomous responses using a logistic function of the latent ability and a set of item parameters representing various item characteristics (difficulty, discrimination, etc.).
The simplest IRT models, including only one parameter and referred to as One-Parameter Logistic (1PL) models, characterize items by their level of difficulty only. The difficulty parameter corresponds to the level of the latent trait for which the slope of the function linking the ability and the probability to select the correct response option reaches its maximum-in other words, the ability level where the discrimination of the item is at its maximum. The model is often extended with another parameter-discrimination-leading to Two-Parameter Logistic (2PL) models. Such models not only take into account the difficulty of an item, but also its ability to discriminate between ability levels. The discrimination parameter corresponds to the strength of relationship between the ability and the probability to select the correct response option. Three-Parameter Logistic (3PL) models add to previous models a variable lower asymptote in the relation between the ability and the probability to select the correct response option. In the context of IRT, the lower asymptote corresponds to the probability to select the correct answer to a given item by guessing it. Therefore, 3PL models allow to characterize items regarding the extent to which they are susceptible to correct guessing. A fourth parameter is included in 4-Parameter Logistic (4PL) models, which corresponds to a variable upper asymptote in the relation between the ability and the probability to select the correct response option. In the context of IRT, the upper asymptote corresponds to the probability of responding incorrectly to an item in spite of having a level of ability that should normally lead to providing the right answer. This parameter allows the modelling of the phenomenon of inattention or slipping. Although 4PL models are used less frequently than 1, 2, and 3PL models, they have been shown to correct adequately for careless mistakes and to improve measurement efficiency [14,15].
Although binary IRT is able to model phenomena that appear in matrix-type reasoning tasks, even models that include guessing fail to account for the possibility that choosing a distractor over another one could be related to the respondent's ability-a phenomenon often described as as ability-based guessing [16]. Indeed, the lower asymptote parameter of the 3PL and 4PL models account for the probability of correctly guessing, but what distractor is chosen when an examinee uses a guessing strategy is not considered-all distractor responses are still collapsed together as incorrect. Yet, if one considers that the guessing process is related to the ability, then the outcome of this process-the distractor chosen-can contain information about the ability that binary models fail to recover.

Recovering Distractor Information
In matrix-type or logical series type tests, distractors are usually designed in a way that they are only partially in line with the set of rules that structures the logical series. For example, if three rules are structuring the progression of a logical series, the correct response option will respect all three of them, but frequently a distractor could respect only two, while another may respect one or even none of them. In this example, a distractor that respects two out of three rules could be considered as a better response option than a distractor that would only respect one out of three rules, although both are ultimately incorrect response options. As a consequence, the wrong response options that are selected by test takers are usually not equivalent in (in)correctness, and thus could carry information about their ability [17].

The Nominal Response Model
A traditional approach to recovering information from distractors is to fit the nominal data with Bock's [7] Nominal Response Model (NRM). This model is essentially a multinomial adaptation of the 2PL model, where the probability P iv that an examinee j chooses a category v-which could be the correct response or a distractor-among the m i possible responses for item i is modeled as a function of the examinee's ability θ j , an intercept item-category parameter ζ iv and a slope item-category parameter λ iv , as well as the item-category parameters of all other categories, such as: A way to interpret this model is to essentially consider each category as having a propensity e ζ iv +λ iv θ j and the probability of selecting a category depends on the category's propensity over the total of all category propensities. When applied to multiple choice items, a consequence of this is that the Nominal Response Model's formulation is mathematically consistent with a response process where all response categories compete with one another and where, depending on the examinee's ability, one category would dominate in propensity the others, and result in the examinee responding (more probably) in favor of that category [9]. But, as we later discuss, this representation of the response process may not be in line with all multiple-choice tests, especially in the case of logical reasoning matrices or logical series.

Nested Logit Models
In certain multiple-choice tests, in order to respond, the examinee is supposed to consider a stimulus (for example, in matrix-type tests, the incomplete matrix), from which a rule should be extracted and used to find the completing element. In such cases, it can be questioned whether examinees put into competition the different response options right away-a process that would ideally be modeled by the NRM. Instead, it could be that they first focus on understanding the stimulus (the matrix, or the beginning of the series) to find the correct response (regardless of what the response options are). From that process, two situations may arise-either they have understood the rule correctly and found the correct response-in that case, the distractors are not really considered as viable options and the correct response is selected-or they have not-and in that case the response options are put in competition in the guessing strategy.
Such a sequential process was described by Suh and Bolt [9] as a motivation to develop a new class of item-response models for multiple-choice questions where this double process could be considered: Nested Logit Models (NLM). NLMs have been designed to model situations in which the response choice possesses a nested structure, that is when the final choice of a response option is made through a sequential process.
NLMs attempt to approximate the response probabilities that occur from this sequential process and the two models that best describe each step into a single model. NLMs have two levels that separate the response options in two nests. At a higher level (level 1), the model distinguishes the choice of the correct response option versus the choice of any incorrect response option, which can be achieved with a binary logistic IRT model (e.g. the 2PL, 3PL or 4PL model). At a lower level (level 2), the model distinguishes the probability of selecting one particular distractor (as opposed to another one) as the product of the probability of selecting any distractor (which is the complement of the probability earlier modeled with the level 1 part) and a probability modeled using the propensities of each distractor-which is similar to a Nominal Response Model of the distractors.
To summarize, using the 4-Parameter NLM (4PNL) as an example for at level 1, the probability P(x ij = u|θ j ) that the jth person selects the correct response (category u) to the ith item, depends on their ability θ j and item parameters α i (discrimination/slope), β i (difficulty/intercept), γ i (lower asymptote) and δ i (upper asymptote), such as: Similar to binary logistic models, the 3-Parameter Nested Logit (3PNL) model is a constrained 4PNL where δ i is fixed, generally (and throughout in this paper) to 1, such as: Further, the 2-Parameter Nested Logit (2PNL) is a constrained 3PNL where γ i is fixed, generally (and throughout in this paper) to 0, such as: At level 2, which models the distractor responses, the probability P(x ij = v|θ j ) that the examinee selects the distractor category v among the m i possible distractor responses is modeled as the product of the probability of responding incorrectly 1 − P(x ij = u|θ j ) and the probability that the examinee selects the distractor conditional upon an incorrect response. The latter is in fact similar to a Nominal Response model, where distractor responses have propensities that are a function of the ability θ j , intercept ζ iv and slope λ iv . The resulting distractor model for the probability P U ij = 0, D ijv |θ j that person j selects distractor v for item i is thus given by: Using the level 1 4PL model in Equation (2), the distractors-model results in the 4PNL model to: Using the level 1 3PL model in Equation (3), the distractors-model results in the 3PNL model to: Using the level 1 2PL model in Equation (4), the distractors-model results in the 2PNL model to: An important distinction to note between the models of this class and the Nominal Response Model is that, in the NLM, the probability of a correct response is not directly affected by the propensities towards the different distractors, but the probability to select the distractors is conditional upon the probability of a correct (or rather, incorrect) response. In contrast, in the Nominal Response Model, the propensities towards all response categories-correct response and distractors alike-all affect one another.
To illustrate NLM, we present in Figure 1 the item-category characteristic curves for an item of the test studied in this very paper.  (4) is increasingly probable as θ j increases. However, the response category 3-which is the only distractor response where the blue and the yellow squares are (correctly) not adjacent-would be more probably selected by individuals with low abilities (θ j ≈ −2.7), while the category 1 would be more probably selected by individuals with even lower abilities (θ j < −3)-thus showing that the choice of distractor may be informative of θ j .

The Aim of This Study
Although originally, Nominal Response Models were considered as a way to recover information from multiple-choice tests, recent research suggests that, in the case of matrix or series-type GMA tests, NLM may better fit the norminal-level data than the NRM and provide significant reliability gains in comparison with binary logistic models. In particular, Myszkowski and Storme [8] have shown that, on the last series of the Standard Progressive Matrices [18], (1) using NLM provided a better fit than Nominal Response Models to the nominal level data, and (2) NLMs allowed significant reliability gains when estimating ability.
Yet, however promising, this result has only been observed with one GMA test and has only been used on a convenience sample of undergraduates, in a low-stakes situation. This study aims at bridging this gap by replicating this result on another short GMA test, with higher stakes, and in a context that would be particularly interested in these reliability gains: Online recruitment.
The conditions under which job applicants take GMA tests are indeed very different from the conditions in which research participants take similar tests in the lab as part of a typical research study. For example, in a recruitment context, the stakes are higher in comparison with taking the test in a lab. Previous research on the effect of pressure on cognitive processes when taking intelligence tests has shown that when under pressure, working memory is busy processing intrusive thoughts which can have in turn a negative impact on performance [19,20]. It is possible that this phenomenon also affects the way distractors are processed and lead to different processing of response options. When under pressure, test takers who fail at identifying the rule that structures the progression of the series might experience high levels of stress and fail at comparing efficiently distractors to identify the best of the incorrect response options. As a consequence, in the context of the online assessment of job applicants, the choice of distractors might carry little information about the ability of test takers. If this is the case, NLMs should not lead in this context to gains in empirical reliability compared with binary models.
Furthermore, the fact that job applicants usually take online tests in their own time leads to less standardized testing contexts. Compared with the relatively controlled and quiet conditions of a lab, there might be more attentional perturbations in the environment, which might induce a shallower processing of the wrong response options. Consequently, it is possible that in the context of e-assessment, the choice of distractors is not so much reflective of the ability of the test taker, which could hinder the potential gains from NLMs.
The aim of the present study is to test whether the findings of Myszkowski and Storme [8]-obtained in a low psychological pressure and controlled context-can be replicated and generalized to an assessment situation characterized by more psychological pressure and less standardization, as well as a different test.

Participants and Procedure
The sample consisted of 2949 French job applicants (2084 Men, 865 Women, M age = 36.88, SD age = 8.66) who responded to a logical series test that aims at measuring GMA online. The examinees responded using an e-assessment application presented in their web browser. As it is common in e-assessment, it can be expected that the standardization regarding when and where the test was taken was relatively low as job applicants were free to take the test at the time and at the place that was the most convenient for them. Of the participants, 40.96% had a master (or higher) degree, 23.64% of participants had a bachelor degree, the remaining applicants had less than a bachelor degree.

Instrument
The test under investigation-the GF20-comprises 20 incomplete logical series presented each with six response options to complete the missing part, including one correct answer that can be deducted from the application of logical rules. Each logical series consists of a 4 by 1 matrix with colored cells moving progressively on a grid according to simple geometric rules-such as translations and rotations. The 20 items that are comprised in the final test were designed and pre-tested to discriminate different levels of ability. An item example is provided in Figure 1. Except for instructions participants to complete the series, the test only included non-verbal and non-numerical content. No time limit was provided to applicants to take the test. It took them on average 21.30 min to complete the 20 items (SD = 9.78). Items were presented one by one. Participants were instructed to provide an response to each item before they could move on to the next item, and were not able to go back.

Model Estimation
All binary IRT models-the 1-Parameter Logistic (1PL), 2-Parameter Logistic (2PL), 3-Parameter Logistic with free lower asymptote (3PL), and 4-parameter logistic (4PL) models-were estimated using an Expectation-Maximization (EM) algorithm with the R package "mirt" [23]. All models successfully converged. Nevertheless, the information matrix of the 4PL model could not be inverted in order to compute the parameter standard errors-decreasing the convergence tolerance and changing the estimation method did not solve this issue-which may be a sign that the estimates were unstable. Item characteristic curve plots, which, for binary models, present the expected probability of a correct response as a function of the latent ability θ j were plotted using the package for R "jrt" [24]. To keep the paper concise, only models with appropriate fit were plotted.

Model Fit
The fit of the models were then compared on Likelihood Ratio Tests (LRT) the model's corrected Akaike Information Criterion (AICc). For the former, p values below 0.05 were used to indicate a significantly improved fit from using the more complex (least constrained) model as opposed to the least complex (most constrained) model. For the latter, a smaller AICc indicates a better (more parsimonious) model fit.
In addition, absolute model fit indices were obtained by using limited information Goodness-of-Fit statistics [25] as implemented in "mirt." As usual-although more frequently seen in Structural Equation Modeling-and similar to the original study of Myszkowski and Storme [8], we used as absolute model fit indices the Comparative Fit Index (CFI) and Tucker-Lewis Index (TLI), with thresholds of 0.95, along with the Standardized Root Mean Square Residual (SRMR) with a threshold of 0.08, and the Root Mean Square Error of Approximation (RMSEA) with a threshold of 0.06.

Reliability
Since the aim of this paper is to extend and replicate the finding that NLM provides an increase in measurement accuracy in logical GMA tests-as found with the Raven's progressive matrices [8]-quantifying measurement accuracy is key. Measurement accuracy is represented by several statistics in IRT, especially information, standard error of measurement and reliability, which are mathematical transformations of one another. Because reliability is a familiar metric for most researchers-in both CTT and IRT-is conveniently bounded by 0 and 1, and is the metric chosen in the article that this study attempts to replicate, it was chosen in this study. However, it should be noted that the conclusions reached about reliability here are also extendable to information and standard errors.
Similar to the original study, reliability functions were plotted for the 2PL, 3PL and 4PL models, overlayed with their Nested Logit counterparts. In addition, and also similar to the original study, marginal estimates of empirical reliability were computed [26]. The estimate of empirical reliability reported corresponds to the reliability of the θ j scores, averaged across all cases j.

Model Estimation
The models for nominal data-the Nominal Response (NR), the 2-Parameter Nested Logit (2PNL), the 3-Parameter Nested Logit (3PNL) an the 4-Parameter Nested Logit (4PNL) models-were estimated using the package "mirt" [23] using an EM algorithm. All models converged successfully. However, as with the binary models, the information matrix of the 4PNL model could not be inverted in order to compute the parameter standard errors, which may be a sign that the estimates were unstable. As for the binary models, item category curve plots, which present the expected probability of selecting a category as a function of the latent ability θ j were computed using "jrt" [24]. Again, to keep the paper concise, only models with appropriate fit were reported.

Model Fit
Similar to the binary models, Likelihood Ratio Tests were used to compare the different nominal models. However, only the 2PNL, 3PNL and 4PNL models are nested with one another, and thus only they allow the use of Likelihood Ratio Tests when comparing them. The AICcs of all models were computed, and the AICc was used to compare the Nominal Response model with the other models.
Polytomous models are largely more heavily parametrized than binary models, which, in some cases, prevents to compute limited information Goodness-of-Fit statistics, such as in Myszkowski and Storme [8]-thereby limiting model fit estimations. However, in this case, because of the larger sample size than in Myszkowski and Storme [8], we were able to compute them, and used the same indices and thresholds earlier discussed for the binary models.

Reliability
Similar to the binary models, we also computed the reliability functions of the NLMs, which were plotted as an overlay of the reliability functions of their respective binary counterparts (i.e., 2PL and 2PNL, 3PL and 3PLN, 4PL and 4PLN)-thereby facilitating visual comparisons. We also computed the empirical reliability of each model averaged across cases as an estimate of marginal reliability.
As one of the aims of this study is to examine potential gains in reliability from using NLMs as opposed to their binary counterparts, we computed the reliability gain ∆r xx between models by computing their difference. Similar to the original study and other previous studies [8,15], we used bootstrapping to obtain a Wald's z test (based on the bootstrapped standard error) and 95% Confidence Intervals for the reliability gains.

Binary IRT Models
The model fit indices of all binary models are reported in Table 1. The 2PL, 3PL and 4PL models all had satisfactory fit, with the 4PL model providing the best fit. The 4PL model fitted significantly better than the 3PL model (∆χ 2 = 167.405, ∆d f = 20, p < 0.001), which fitted significantly better than the 2PL model (∆χ 2 = 519.018, ∆d f = 20, p < 0.001), which fitted significantly better than the 1PL model (∆χ 2 = 1100.652, ∆d f = 19, p < 0.001). As they were the best two fitting models and provided very similar absolute fit indices, we present the item characteristic curves of both the 2PL, 3PL and 4PL models respectively in Figures 2-4. Their similarity and the relatively high low asymptotes for the 4PL model-for the 3PL, they are fixed to 1-are in line with the fact that the two models provided similar fit.
The parameter estimates (along with standard errors for the 2PL and 3PL) of the 2PL, 3PL, and 4PL models are presented respectively in Table 2      The marginal estimates of empirical reliability for all the binary models were satisfactory and close to the CTT-based estimates earlier reported, as they were 0.833 for the 1PL model, 0.849 for the 2PL model, 0.868 for the 3PL model and 0.873 for the 4PL model.

Nominal Models
The model fit indices of all nominal models are reported in Table 3. Although the Nominal Response model provided a borderline acceptable fit, it was, as hypothesized, outperformed by all the NLMs, which all presented satisfactory fit. The 4PNL model fitted significantly better than the 3PNL model (∆χ 2 = 82.624, ∆d f = 20, p < 0.001), which fitted significantly better than the 2PNL model (∆χ 2 = 541.102, ∆d f = 20, p < 0.001).
The item category curve plots of the 2PNL, 3PNL and the 4PNL are respectively presented in Figures 5-7. Their model estimates as well as standard errors are presented respectively in Tables 4-6.             The reliability functions of the 2PL, 3PL and 4PL are reported with their Nested Logit counterparts in respectively Figures 8-10. As noted by a reviewer, between a binary model and its nested counterpart, θ j is not perfectly invariant, and thus the reliability functions may cross, such as in Figure 4. This was also previously observed in the comparison between binary and nominal response models [7]. As expected, they show that using NLM provided increments in reliability especially in the lower range of abilities.

Discussion
The aim of the present research was to extend the previous findings of Myszkowski and Storme [8] to different testing modalities, online assessment-a different context with higher stakes-and personnel selection, on a larger sample and with a different logical reasoning test.
We found that 4-parameter models-both binary and nested logit-were likely unstable (as their information matrix could not be inverted) but they seemed to outperform their 1PL, 2PL, and 3PL counterparts. Being that the 2-parameter and 3-parameter models did not present this issue while still presenting excellent fit, the results suggest that choosing them may be a more parsimonious but still well fitting approach to this test. In fact, the 2PL and 2PNL fitting respectively almost as well as the 3PL and 3PNL, they may be a more optimal modeling strategy for this test.
We also found that, as hypothesized, Nested Logit Models (NLM) both outperformed the Nominal Response Model [7], providing significant reliability gains compared with their binary counterparts. In addition, the absolute fit of the NLMs-which was not computable in Myszkowski and Storme [8] due to the lower sample size-could be computed here and was found satisfactory, especially regarding the models including a guessing parameter (3PNL and 4PNL).
These findings overall suggest that NLMs [9] are a better modeling alternative than binary logistic models and than the Nominal Response Model [7] for logical reasoning multiple-choice tests, such as incomplete matrix or series tests, in online personnel selection settings.

Theoretical and Practical Implications
From a theoretical viewpoint, the present study can be seen as a conceptual replication and extension of Myszkowski and Storme [8]'s study on Raven's progressive matrices. Replicating findings is an important endeavor in scientific research. This is especially true in the field of psychology, which is regularly criticized for its lack of consideration for replicating empirical findings [27].
Recently, Hüffmeier et al. [10] have designed a theoretical framework to conceptualize the replication process in psychology and have proposed a typology of replication studies. Rather than considering replication as a process separate from the initial research process, they conceptualize replication as the very research process by which fundamental findings are generalized to situations that are increasingly close to real life conditions. When a result has been shown at a fundamental level, it may be interesting to replicate it to see if it is not due to chance. In this case, exact or close replications will be used [10]. To be able to further generalize the findings of a fundamental study, it is important to be able to perform conceptual replications in the laboratory or in the field. In conceptual replications, comparability to the original study is limited to the aspects that are considered theoretically relevant [28,29]. Among the conceptual replications are field studies. The aim of such studies is to investigate whether laboratory findings also hold under field conditions, and to rule out the possibility that a laboratory finding is a laboratory artifact or too weak to be relevant in contexts that are less tightly controlled [10]. In the framework described by Hüffmeier et al. [10], our study can be defined as a conceptual replication in the field of the study conducted by Myszkowski and Storme [8]. Our findings suggest that the characteristics of the e-assessment context do not fundamentally affect the way distractors are selected by test takers. Previous basic research on recovering distractor information is therefore relevant in an e-assessment context.
From a practical viewpoint, our findings suggest that one way to improve the accuracy of e-assessment in the context of recruitment is to recover distractor information. Web applications that use tests with distractors should try to implement NLM to get more reliable estimates of the general mental ability of job applicants. To this day, there are few software implementations of NLM. A recommendation to designers of IRT platforms would be to add NLM to their offer. For e-assessment platforms, a relatively inexpensive alternative to commercial IRT software could be to use the "mirt" [23] R library on the server side to estimate the ability of test takers using the built-in NLM function. One of the challenges of this option is that R can be a programming language that is relatively consuming in terms of computing resources and time, although θ j estimations in "mirt" are relatively fast once the parameters of the model are stored in memory. More optimizations that will facilitate the implementation of NLM in e-assessment might come in the future.
In line with the findings of Myszkowski and Storme [8], the observed gain in reliability was especially visible at relatively low levels of ability. This is not surprising as NLM recover information from wrong response options. Recruiters are usually interested in applicants with high levels of intelligence, but this is not always the case. For example, it is possible that due to high competition on the job market, a recruiter is unable to attract the best applicants, and has to select among applicants with relatively lower levels of ability. In such situations, the use of NLM could be highly valuable as it allows forming a more accurate impression of applicants on the low end of the trait, and selecting the best.
As a reviewer pointed out, the standard errors of item parameter estimates of the Nested Logit Models were overall smaller than their binary counterparts-this of course only concerns parameters that are common between models (difficulty, discrimination and, for the 3PL and 3PNL, guessing). This result may seem counterintuitive, because, in general, for a given dataset, item parameter standard errors tend to increase as model complexity increases, and the Nested Logit Models are substantially more parametrized than the binary models. However, it should be noted that the Nested Logit Models are not only more complex, but they also use, to some extent, a different dataset, in that they use more information from the base dataset. Indeed, they use the complete information from the nominal level, while binary models use only the information at the binary level. Although we have showed that, like in Myszkowski and Storme [8], Nested Logit Models resulted in gains in reliability (and thus lower standard errors) for the person estimates, the present results also suggest that the difficulty, disscrimination and guessing parameters of the Nested Logit Models are estimated with more accuracy-because they use more information-than the respective item parameters of their binary counterparts. This result calls for replication in other datasets, contexts and types of tests.
Throughout the paper, we have mostly emphasized the benefits of using NLM to improve the accuracy of ability estimates. However, NLM has other potentially interesting applications beyond improving scoring. For example, Suh and Bolt [30] have described a method relying on NLM to evaluate how distractors might contribute to Differential Item Functioning (DIF) [30]. It is indeed possible that distractors function differently across groups, leading to Differential Distractor Functioning (DDF). DDF can lead in turn to DIF, which is a major problem when using the same test on different groups. Multigroup NLM could help test designers to improve the diagnosis of the causes of DIF, and thus to improve their tests. Bolt et al. [31] have suggested another interesting application of NLM, which is to use NLM as a way to determine whether the ability distinguished by distractors is the same as the ability underlying the choice of the correct response. Here again, the use of NLM could help test designers to select items that best reflect the underlying ability.

Limitations and Future Research
Our study has several limitations which should stimulate and guide further research on the topic. A first limitation is related to the sample that was used in the study. The sample comes from a single e-assessment platform and it is therefore difficult to know whether the findings would generalize to other platforms. It is possible for example that characteristics of the design of Web applications affect the way distractors are processed by test takers. Previous research has shown that the experience of users greatly affect the cognitive processes they mobilize when using a Web application [32]. Applied to our question, it is possible that a bad Web design reduces the motivation of test takers to process distractors when they fail at identifying the rule governing the logical progression of the series. Further research is needed to test the generalizability of the findings to other platforms, but also to other types of GMA tasks.
Antother limitation is related to our sample size. NLM have more parameters than the models to which they were compared in the current study. Although our sample is larger than the one used in the original study that we conceptually replicated [8], it is still unclear whether our sample size is large enough to get reliable parameter estimates. Further research using Monte-Carlo simulations is needed on the influence of sample size on parameter estimation in NLM, and to provide clear guidelines regarding the necessary sample size.
In addition, it should be noted that the fact that NLM provided a better fit, like in Myszkowski and Storme [8]'s study, does not necessarily imply that the cognitive processes engaged in responding similar tests are necessarily only the 2-step sequence that the NLM are based on-attempting to solve the task by looking at the stimulus only and then, if the correct answer is not found, examining the distractors. Indeed, it remains very possible that the actual responding process is less clear and closer to a back-and-forth between a stimulus-based strategy and a response option comparison-based strategy. Further, it has been noted that NLM may be further improved by including the possibility that the guessing strategy (level 2) results in the choice of the correct response. In other words, choosing the correct response could then be the result or either strategy. Future research might consider this interesting possibility when such models are available in traditional IRT software.
Another limitation of this study is that it was limited in the breadth of nominal models tested by their availibility in "mirt." Although this package provides a large number of popular models, we were not able to fit some alternatives models, notably Thissen and Steinberg [33]'s Multiple Choice Model (MCM), which essentially adds to the Nominal Response model a latent state category for examinees that corresponds to an examinee not knowing-and thus guessing-what the correct response is. Although the Nominal Response model was here outperformed by the Nested Logit models, it may be that alternative models like the MCM are better alternatives.
Another important limitation of our study is that we did not test whether the improvement in reliability translates into an improvement in predictive validity. This is because our study did not include a measure of job performance. The ability of an assessment tool to predict future job performance is crucial in the context of recruitment. Improvements in measurement reliability can lead to improvements in predictive validity, as reliability is a prerequisite for validity [34].
Whether recovering distractor information actually improves predictive validity in the context of e-assessment remains to be investigated. The answer to this question could represent an important contribution to the literature. It has indeed been shown that in situations in which test takers are under pressure, for example when stakes are high, the predictive validity of GMA tests tends to decrease [35]. Duckworth et al. [35] argued that GMA tests predict various indicators of success in life because when they are used in low stakes contexts, they essentially measure the motivation of test takers. According to Duckworth et al. [35], it is because GMA tests taken in the lab measure motivation that they are found to be positively associated with a broad range of indicators of life success. Although there is empirical evidence supporting Duckworth et al. [35]'s argument, one can wonder whether using a more precise strategy to score GMA tests could not ultimately reveal that there is a relation between GMA and various indicators of achievement. Testing the predictive validity of GMA tests scored with NLM could therefore have important implications regarding the knowledge of the true relationship between GMA and achievement in general.