Regularized Estimation of the Four-Parameter Logistic Model

: The four-parameter logistic model is an Item Response Theory model for dichotomous items that limit the probability of giving a positive response to an item into a restricted range, so that even people at the extremes of a latent trait do not have a probability close to zero or one. Despite the literature acknowledging the usefulness of this model in certain contexts, the difﬁculty of estimating the item parameters has limited its use in practice. In this paper we propose a regularized estimation approach for the estimation of the item parameters based on the inclusion of a penalty term in the log-likelihood function. Simulation studies show the good performance of the proposal, which is further illustrated through an application to a real-data set.


Introduction
Item Response Theory (IRT) provides a framework for statistical modeling of the responses to a test or questionnaire [1,2]. In IRT models, the probability of giving a certain response to an item depends on one or more latent variables and on some parameters related to the items. The aim of the analysis is usually to measure the latent variables and to study the properties of the items. Different kinds of models have been proposed in the literature depending on the type of responses that can be given to an item. In the case of binary responses (such as, for example, correct or incorrect, agree or disagree, yes or no), the four-parameter logistic (4PL) model [3] constitutes the more flexible option, since it is able to capture the relation of the responses with the latent variable allowing for some randomness, so that people at a very low level of the latent trait have a nonzero probability of giving a positive response and people at a very high level of the latent trait have a probability of giving a positive response lower than 1. Guessing in a multiple-choice educational test is a typical example of the necessity of modeling such behavior for examining at low ability levels. Likewise, people at high ability levels could fail to give the correct response because of inattention and tiredness. Recently, the 4PL model has received renewed interest. Reise and Waller [4] suggested that to completely characterize the functioning of psychopathology items there is a need for a four-parameter model, which was later confirmed by the authors [5]. The 4PL model was found to be useful also for computerized adaptive testing [6][7][8][9]. However, the estimation of the parameters of the 4PL model is a difficult task [10,11], which explains why this model was substantially ignored for a long time. Some recent contributions to the literature employ a Bayesian approach for the estimation of the item parameters [5,[10][11][12], while another interesting work employs a mixture model formulation and prior distributions on the parameters [13].
In recent years, statistical learning methods [14] have attracted increasing interest due to their capacity of dealing with the complexity of the data. Of particular interest here are regularization methods, which were first proposed for the linear regression model with the aim of shrinking the coefficients toward zero and were also employed to obtain smoothing splines [14,15]. In general, Psych 2020, 2 these methods prevent overfitting, reduce the variability of the estimates and improve the predictive capacity of the model. Regularization methods have then found application to a variety of models, including categorical data [16,17]. Restricting our attention to IRT, a ridge-type penalty was used for the two-parameter logistic (2PL) model [18], while a lasso penalty for the detection of differential item functioning was employed for the Rasch model [19] and for generalized partial credit models [20]. A lasso penalty was also used for latent variable selection in multidimensional IRT models [21], while a fused-lasso penalty was proposed for the nominal response model to group response categories and perform variable selection [22]. Penalized estimation for the detection of DIF was implemented also for a logistic regression model [23].
To deal with the complexity of the estimation of the 4PL model, in this paper we propose a regularization approach based on the inclusion of a penalty term in the log-likelihood function. The paper is structured as follows. Section 2 introduces the 4PL model and some regularization methods. Section 3 describes our proposal, whose performance is assessed in Section 4 through some simulation studies. Finally, Section 6 concludes with a discussion.

The 4-Parameter Logistic Model
In the following, the terminology of educational testing is used as an example for introducing the 4PL model, though this model is applicable to other contexts as well. Let the variable X ij be equal to 1 if person i knows the correct answer to item j, and Y ij be equal to 1 if the response is correct. The probability of knowing the correct response response of item j according to the 2PL model is given by where a j and b j are the parameters of the item usually referred to as discrimination and difficulty, and θ i is the ability of person i. In the 4PL model, the probability of giving the correct response does not coincide with the probability of knowing it, and it is given by where c j = P(Y ij = 1|X ij = 0) is the probability of giving the correct response when it is not known, and d j = P(Y ij = 1|X ij = 1) is the probability of giving the correct response when it is known. These two other item parameters are often referred to as guessing and inattention, or just as the lower and upper asymptotes. Rewriting Equation (2) as follows: returns the usual form of the 4PL model. The 3-parameter logistic model is obtained if d j is set to 1, while the 2PL results when c j is also constrained to 0. See also [24] for a similar arguments about the 3PL model. The item parameters are usually estimated using the marginal maximum likelihood method [25], which treats the abilities as random variables with a standard normal distribution and integrates them out of the likelihood function. A parameterization of the model more suitable for estimation is the following: where the function F(x) = e x /(1 + e x ) constrains the parameters c j = F(β 3j ) and d j = F(β 4j ) to be in the (0, 1) range, while b j = −β 1j /β 2j and a j = β 2j .

Regularized Estimation
In order to achieve regularized parameter estimates and reduce their variability, a very common strategy is based on the inclusion of a penalty term in the loss function, which is minimized to obtain the parameter estimates [14,17].
Let β be a K × 1 vector of parameters. A very common penalty function is the ridge penalty: where λ is a tuning parameter that determines the amount of shrinkage. This penalty has the effect of shrinking all the parameters toward zero proportionally [14]. If, instead, the purpose is setting some parameters exactly at zero, a more effective penalty is the lasso [26,27]: A variant is given by the fused lasso penalty [28] J(β, λ 1 , which forces adjacent parameters to take the same value and is meaningful only if the parameters present a natural order. Another penalty that requires an ordering of the coefficients was employed in [29] to smooth the effects of an ordered predictor on the response variable and takes the following form:

A New Proposal for the 4PL Model
Let β be the vector containing all the item parameters and (β) be the marginal log-likelihood function. Our proposal employs a penalty term on the item parameters in order to obtain regularized estimates with limited variability. However, using the penalties that were originally proposed for regression models that force the parameters toward zero is not particularly meaningful for IRT models. Hence, the penalized log-likelihood function we propose takes the following form: The penalty added to the log-likelihood function has the effect of forcing each different type of item parameter toward a common value. This means that the intercepts β 1j of all items are forced toward a common value, as well as the slopes β 2j and the parameters that determine the lower and upper bonds β 3j and β 4j . The penalty used here is similar to (8); however, in this case, there is not a natural order of the parameters, so it is necessary to consider all the pairs of parameters pertaining to different items. The same type of penalty was also employed in [18] to shrink the slopes in the 2PL model. It is worth noting that the penalty employed here does not force the parameters to assume exactly the same value as induced by the penalty (7) but rather it forces the parameters toward a common value.
The assumption that underlies this penalty is that the upper asymptotes assume similar values, as well as the lower asymptotes, the intercepts and the slopes. The amount of similarity between the parameters is determined using a data-driven procedure, as explained in the following of this paper, and this procedure could possibly lead to preferring the unpenalized estimates.
Including a penalty of the form of a density function in the log-likelihood function for each type of item parameter is, for example, implemented in the R package mirt [30]. However, it requires the choice of the parameters of such density, which is not trivial or irrelevant. It is possible to show that the penalties included in Equation (9) are equivalent to the logarithm of the normal density (see Appendix A for the proof). The great advantage of the penalty employed in this paper it that it does not require choosing or estimating the mean of the distribution. Instead, the selection of the tuning parameter λ, which plays the same role of the variance, should be performed on the basis of a data-driven procedure. To this end, K-fold cross-validation represents an effective approach. Data are divided into K groups, and the parameters of the model are estimated on K-1 folds leaving one fold out to evaluate the error. This is performed leaving one fold out in turn and for each value of λ. In our application, the error was evaluated using the negative log-likelihood function as suggested in [31]. Hence, the cross-validation error is given by where β −k is the vector of item parameter estimates obtained excluding the k-th group of data, which is denoted by y k . The minimum cross-validation error determines the choice of λ.

Simulation Studies
In order to assess the performance of our proposal in comparison to maximum likelihood estimation (MLE), we conducted a simulation study. In the first setting, the true item parameters were taken to be equal to the estimates reported in Table 4 in [10], who followed a Bayesian approach to fit a 4PL model to a dataset with 14 items to assess delinquency. In this setting, the items were rather difficult (all the difficulties were above zero with a mean of 1.51) and with high discriminations (the mean of the discrimination parameters was 2.13), lower asymptotes close to zero (their mean was 0.03), and upper asymptotes ranging from 0.72 to 0.89. Hence, we also considered a second setting with true parameters more similar to a typical educational test and which were obtained by random generation. The number of item parameters in this setting is 30. The difficulties b j were generated from a standard normal distribution, the discriminations a j were generated from a normal distribution with mean 1 and standard deviation 0.2, the lower asymptote c j were generated from a uniform distribution in the [0.1, 0.3] range, and the upper asymptote d j were generated from a uniform distribution in the [0.8, 1] range. In both cases, the latent variables θ were generated from a standard normal distribution and the number of examinations was taken to be equal to n = 500, 1000 and 5000. All results are based on 500 replications.
All statistical analyses were performed in R [32] and C++, employing the Rcpp [33] and RcppArmadillo [34] packages to integrate the code. The Reg4PL package developed to implement the methods is available as supplementary material to this paper. The MLE values were computed using the estimates provided by the mirt package [30] as initial values of the maximization of the log-likelihood function by means of the optim function of the R software. The same function was employed to obtain the penalized estimates. Table 1 displays the results in the first setting. The root mean square error (RMSE) reported in the table is the average over the 14 items. The bias (B) is the root of the average of the squared bias of the estimates of each item. The bias and the RMSE of MLE are particularly large for the discrimination parameters. The penalized estimates consistently present smaller values of bias and RMSE than MLE for the a j parameters, though for n = 500 they are not negligible. In comparison with the discrimination parameters, the RMSE and the bias of the maximum likelihood estimates are located on smaller values when considering the difficulty parameters. However, the penalized estimates perform better for all the sample sizes. Considering the lower asymptotes, there is only one exception where the penalized estimates present a larger bias than MLE, which is for n = 1000. In all other cases, penalized estimation performs better. Finally, the estimates of the upper asymptote present good properties both using MLE and penalized estimation, sometimes showing a slight prevalence of one or of the other. The results in the second setting are reported in Table 2. The most difficult parameters to estimate are the discriminations. Penalized estimation always performs better than MLE for the discrimination parameters. The difficulty parameters generally benefit of penalized estimation too, with the only exception of the bias when n = 5000, which is slightly increased. The RMSE is always smaller for the penalized estimates, while the bias of the lower and upper asymptotes is at the same level or increased.

A Real-Data Example
The method was then applied to the Second International Self-Report Delinquency Study (ISRD-2), a large-scale study on delinquency of 12 to 15 years old students [35]. The dataset is publicly available at https://www.icpsr.umich.edu/web/NACJD/studies/34658. The questions used in this paper are reported in Table 3. These items are all dichotomous, since the responses can either be yes or no. After selecting students from Switzerland and deleting the cases with missing responses, our dataset is composed of 3247 students.
The mirt package [30] was used to compare the fit of the 2PL, 3PL and 4PL models to these data. Table 4 reports the Akaike Information Criterion (AIC), the Bayesian Information Criterion (BIC) and the p-value of the likelihood ratio (LR) test. According to these results, the 3PL model does not provide a better fit to these data than the 2PL model. However, the 4PL is the preferred one since it presents the lowest AIC and BIC and the LR test indicates that the upper asymptotes should be included in the model. Figure 1 shows the cross-validation error as a function of λ. The vertical dashed line corresponds to the minimum value, which is for λ = 0.005. Figure 2 shows the item parameter estimates at different values of λ. The MLE estimates and the penalized estimates at the selected value of λ are reported in Table 5. The parameter estimates were obtained using the reg4PL package available as supplementary material to this paper. Since in this example the sample size is rather large, the value of λ determined by cross-validation is small. Nonetheless, some resultant parameter estimates are noticeably reduced, such as the discrimination parameters of item BEERLTP, which was extremely high using MLE. As already observed in self-reported delinquency studies [10,11], the lower asymptotes of the 4PL model are nearly zero, while some items present an upper asymptote lower than one. In this application, only item BEERLTP has a lower asymptote considerably greater than zero, probably due to the fact that this behavior is not completely unusual for teenagers with a low level of delinquency.
It is also worth noting that the lower asymptote of item GFIGLTP is slightly greater than zero, likely because one could participate in a fight against their will. Various items have an upper asymptote lower than one, such as for example items HASHLTP and DRUDLTP, showing that some behaviors are not necessarily pursued by teenagers at high levels of delinquency. However, it is worth noting that these items are related to the latent variable, as indicated by the highly positive discrimination parameters. EXTOLTP Did you ever threaten somebody with a weapon or to beat them up, just to get money or other things from them? GFIGLTP Did you ever participate in a group fight on the school playground, a football stadium, the streets or in any public place? ASLTLTP Did you ever intentionally beat up someone, or hurt him with a stick or knife, so bad that he had to see a doctor? DRUDLTP Did you ever sell any (soft or hard) drugs or act as an intermediary?
Version November 9, 2020 submitted to Psych 7 of 12 pursued by teenagers at high levels of delinquency. However, it is worth noting that these items are 154 related to the latent variable, as indicated by the highly positive discrimination parameters.

Discussion
In this paper we have proposed a regularization method for the 4PL model based on penalized maximum likelihood estimation. While penalized estimation of the linear regression model always introduces bias to reduce the variability of the estimates, in the case under study in this paper the penalty introduced in many cases does not increase the bias and always reduces the root mean square error. In this respect, it is worth noting that the least square estimator used for linear regression provides unbiased estimates of the coefficients, while MLE of nonlinear models is known to be consistent but biased in finite samples. It is also interesting to observe that the values of bias and mean square error reported in this paper are the average over all the items. Thus, considering a single item, it is possible to observe an increment of the bias.
Our approach shares some similarities with the marginalized maximum a posteriori estimation proposed in [13], where some prior distributions are assumed on the parameters and an EM algorithm is then implemented. The main difference between the two approaches lies in the treatment of the parameters of the prior distributions. While in [13] the parameters of the prior distributions are fixed, in our approach the parameter λ is estimated by K-fold cross-validation. In this respect, it is important to have only one parameter to estimate, since cross-validation would become impractical for more parameters. Despite there being only one parameter that determines the amount of shrinkage induced by the penalty term, the parameters are shrunk by varying magnitudes, depending on the log-likelihood function. In like manner, the shrinkage of the coefficients of a regression model is governed by a single tuning parameter, when the usual ridge or lasso penalties are employed. It is important to note that the cross-validation error as a function of λ, as shown for example in Figure 1, suggests that the estimation of the tuning parameter is fundamental, since different values of λ can lead to a cross-validation error larger than the one obtained with MLE. The simulation studies show that our proposal is always able to reduce the RMSE and, in many cases, to lower the bias, hence supporting the usefulness of this approach for the 4PL model.

Conflicts of Interest:
The author declares no conflict of interest.

Abbreviations
The following abbreviations are used in this manuscript:

2PL
two-parameter logistic 3PL three-parameter logistic 4PL four-parameter logistic B bias IRT Item Response Theory ISRD-2 Second International Self-Report Delinquency Study MLE maximum likelihood estimation RMSE root mean square error