DINA Model with Entropy Penalization

: The cognitive diagnosis model (CDM) is an effective statistical tool for extracting the discrete attributes of individuals based on their responses to diagnostic tests. When dealing with cases that involve small sample sizes or highly correlated attributes, not all attribute proﬁles may be present. The standard method, which accounts for all attribute proﬁles, not only increases the complexity of the model but also complicates the calculation. Thus, it is important to identify the empty attribute proﬁles. This paper proposes an entropy-penalized likelihood method to eliminate the empty attribute proﬁles. In addition, the relation between attribute proﬁles and the parameter space of item parameters is discussed, and two modiﬁed expectation–maximization (EM) algorithms are designed to estimate the model parameters. Simulations are conducted to demonstrate the performance of the proposed method, and a real data application based on the fraction–subtraction data is presented to showcase the practical implications of the proposed method.


Introduction
CDMs are widely used in the field of educational and psychological assessments.These models are used to extract the examinees' latent binary random vectors, which can provide rich and comprehensive information about examinees.Different CDMs are proposed for different test scenarios.The popular CDMs include Deterministic Input, Noisy "And" gate (DINA) model [1], Deterministic Input, Noisy "Or" gate (DINO) model [2], Noisy Inputs, Deterministic "And" gate (NIDA) model [3], Noisy Inputs, Deterministic "Or" gate (NIDO) model [2], Reduced Reparameterized Unified Model (RRUM) [4,5] and Log-linear Cognitive Diagnosis Model (LCDM) [6].The differences among the abovementioned CDMs are the modeling methods of the positive response probabilities.CDMs can be summarized in more flexible frameworks such as the Generalized Noisy Inputs, Deterministic "And" gate (GDINA) model [7] and the General Diagnostic Model (GDM) [8].The simplicity and interpretability of the DINA model have positioned it as one of the most popular CDMs.
The DINA model, also known as the latent classes model [9][10][11][12][13], is a mixture model, so it still suffers from the drawbacks of the mixture model.Too many latent classes may overfit the data, which means that data should have been characterized by a simpler model.Too few latent classes cannot characterize the true underlying data structure well and yield poor inference.In practical terms, identifying the empty latent classes will improve the model's interpretability and explain the data well.Chen [14] showed that the theoretical optimal convergence rate of the mixture model with the unknown number of classes is slower than the optimal convergence rate with the known number of classes.This means that the inference would strongly benefit from the known number of classes.Therefore, from both practical and theoretical views, eliminating the empty latent classes is a crucial issue in the DINA model.
Common reasons for empty attribute profiles include small sample sizes or highly correlated attributes.Let us explore a few examples to illustrate further.In a scenario where the sample size is smaller than the number of attribute profiles, it is inevitable that some attribute profiles will be empty.In another scenario with two attributes α 1 and α 2 , the relation is assumed that α 2 = 1 if and only if α 1 = 1.Under the assumption of extremely correlated attributes, attribute profiles (α 1 = 1, α 2 = 0) and (α 1 = 0, α 2 = 1) do not appear.Situations with empty attribute profiles can occur in various scenarios [15].
The hierarchical diagnostic classification model [15,16] is a well-known method to eliminate empty attribute profiles.In the literature, directed acyclic graphs are employed to describe the relationships among the attributes, and the directions of edges impose strict constraints on attributes.If there is a directed edge from α 1 to α 2 , the attribute profile (01) is forbidden.Gu and Xu [15] utilized penalized EM to select the true attribute profiles, avoid overfitting, and learn attribute hierarchies.Wang and Lu [17] compared two exploratory approaches of learning attribute hierarchies in the LCDM and DINA models.In essence, the attribute hierarchy can be regarded as a specific family of correlated attributes that can be effectively represented and described through a graph model.
The penalized methods have been widely researched in many statistical problems.In the regression model, the least absolute shrinkage and selection operator (LASSO) and its variants are analyzed by [18][19][20].Fan and Li [21] proposed a nonconcave penalty smoothly clipped absolute deviation (SCAD) to reduce the bias of estimators.In the Gaussian mixture model, Ma and Wang, Huang et al. [22,23] proposed penalized likelihood methods to determine the number of components.In CDMs, Chen et al. [10] used SCAD to obtain the sparse item parameters and recovery Q matrix.Xu and Shang [11] applied a "L 0 norm" penalty to CDM and suggested a truncated "L 1 norm" penalty as the approximate calculation.
In the hierarchical diagnostic classification model, directed acyclic graphs of attributes often need to be specified in advance.A limitation of this model is that it is difficult to specify a graph in real scenarios.The penalty of the penalized EM proposed by Gu and Xu [15] involves two tuning parameters that complicate the implementation.Therefore, we hope to propose a method that does not require specifying a directed acyclic graph in advance and has a concise penalty term.
This paper makes two primary contributions.Firstly, it introduces an entropy-based penalty, and secondly, it develops the corresponding algorithms to utilize this penalty.This paper proposes a novel approach for estimating the DINA model, combining Shannon entropy and the penalized method.In information theory, "uncertainty" can be interpreted informally as the negative logarithm of probability, and Shannon entropy is the average effect of the "uncertainty".Shannon entropy can be used to characterize the distribution of attribute profiles.By utilizing the proposed method, the empty attribute profiles can be eliminated.We further develop the EM algorithm for the proposed method and conduct some simulations to verify the proposed method.
The rest of the paper is organized as follows.In Section 2, we give an overview of the DINA model and the estimation method.A definition of the feasible domain is defined to characterize the latent classes.Section 3 introduces the entropy penalized method, and the EM algorithm is employed to estimate the DINA model.The numerical studies of the entropy penalized method are shown in Section 4. Section 5 presents real data analysis based on the fractions-subtraction data.The summary of the paper and future research are given in Section 6.The details of the EM algorithm and proof are given in Appendices A-C.

Review of DINA Model
Firstly, some useful notations are introduced.For the examinee i = 1, . . ., N, the attribute profile α i , also known as the latent class, is a K-dimensional binary vector α i = (α i1 , α i2 , . . ., α iK ) , and the corresponding response data to J items is a J-dimensional binary vector y i = (y i1 , y i2 , . . ., y iJ ) , where " " is the transpose operation.Let Y and α denote the collection of all y i and α i , respectively.The Q matrix is a J × K binary matrix, where if item j requires the attribute k, then the element q jk is 1, otherwise, the element q jk is 0. The j-th row vector is denoted by q j .Given the fixed K, there are C = 2 K latent classes.We use a multinomial distribution with the probability π Λ P(α i = Λ) to describe the attribute profile Λ ∈ {0, 1} K , where ∑ Λ∈{0,1} K π Λ = 1, and the population parameter π denotes the collection of probabilities for all attribute profiles.
The DINA model [1] supposes that, in an ideal scenario, the examinees with all required attributes will provide correct answers.For examinee i and item j, the ideal response is defined as η j,α i = ∏ K k=1 α q jk ik , where 0 0 is defined as 1.The slipping and guessing parameters are defined by conditional probabilities s j = P(y ij = 0|η j,α i = 1) and g j = P(y ij = 1|η j,α i = 0), respectively.The parameters s and g are the collections of all s j and g j , respectively.
In the DINA model, the positive response probability θ j,α i can be constructed as θ j,α i P(y ij = 1|s j , g j , α i ; q j ) = (1 − s j ) η j,α i g If both Y and α are observed, the likelihood function is Given data Y and attribute profile α, the parameters s and g can be directly estimated by the maximum likelihood estimators: where 1(•) is the indicator function.When α is latent, by integrating out α, the marginal likelihood is which is the primary focus of this paper.

Estimation Methods
EM and Markov chain Monte Carlo (MCMC) are two estimation methods for the DINA model.De la Torre [24] discussed the marginal maximum likelihood estimation for the DINA model, and the EM algorithm was employed where the objective function was Equation (4).Gu and Xu [15] proposed a penalized expectation-maximization (PEM) with the penalty where λ ∈ (−∞, 0) controls the sparsity of π and ρ N is a small threshold parameter the same order as N −d , the constant d ≥ 1.There are two tuning parameters λ and ρ N in PEM.Additionally, a variational EM algorithm is proposed as an alternative approach.
Culpepper [25] proposed a Bayesian formulation for the DINA model and used Gibbs sampling to estimate parameters.The algorithm can be implemented by the R package "dina".The Gibbs sampling can be extended by a sequential method in the DINA and GDINA with many attributes [26], which provides an alternative approach to the traditional MCMC.As the focus of this paper does not revolve around the MCMC, we refrain from details.

The Property of DINA as Mixture Model
For a fixed K, the DINA model can be viewed as a mixture model comprising 2 K latent classes (i.e., components).In contrast to the Gaussian mixture model, where a change in the number of components will introduce or remove the mean and covariance parameters, the DINA model behaves differently.Specifically, a change in the number of latent classes does not necessarily affect the presence of item parameters.This means that there are two cases: (i) the latent classes have changed while the structure of the item parameters does not change, and (ii) the latent classes and the structure of the item parameters change simultaneously.To account for the two cases, a formal definition of the feasible domain of latent classes is introduced.Definition 1.Given F ⊆ {0, 1} K the subset of latent classes.If for any s j or g j , j = 1, . . ., J, there exist some latent classes in F, whose response function (i.e., the distribution of response data), is determined by s j or g j .We say that F is a feasible subset of latent classes and all feasible Fs make up the feasible domain F .
∈ F is strictly 0. There exist some subsets F that will spoil the item parameter space.Let us see the following examples Assume the response vector y i = (y i1 , y i2 , y i3 , y i4 ) .If α i is from F 1 , then for g j , j = 1, 2, 3, 4, we have which means that the Λ 1,1 's response function is determined by g j .For s j , j = 1, 2, 3, 4, we have which means that the Λ 1,2 's response function is determined by s j .The Equations ( 7) and ( 8) are obtained by calculating the ideal responses.To determine the response function of Λ 1,1 and Λ 1,2 , all item parameters s j and g j are required.Based on similar discussions, Λ 2,1 's response function is determined by g 1 , g 2 , s 3 , g 4 , and Λ 2,2 's response function is determined by s 1 , s 2 , g 3 , s 4 .To determine the response function of Λ 2,1 and Λ 2,2 , all item parameters s j and g j are required.Then, a different case is presented.If α i is from F 3 , Λ 3,1 's response function is determined by s 1 , g 2 , g 3 , g 4 , Λ 3,2 's response function is determined by g 1 , s 2 , g 3 , g 4 , and Λ 3,3 's response function is determined by s 1 , g 2 , s 3 , g 4 .The item parameter s 4 cannot affect the response function of any attribute profile.Hence, s 4 is called a redundant parameter.Meanwhile, this indicates that there does not exist a slipping behavior for item 4, and we can let the redundant parameter s 4 = 0.If α i is from F 3 , the item parameter space collapses from 8-dimensional to 7-dimensional.It is obvious that the subset F ∈ F will not spoil item parameter space, and the feasible domain F depends on Q.However, a lemma can be given as follows.The proof is deferred to Appendix A. Lemma 1.If F contains 0 K and 1 K , then F always lies in the feasible region F , where 0 K and 1 K are K-dimensional vectors with all 0 and 1, respectively.

Entropy Penalty of DINA
Shannon entropy is E(π) = − ∑ Λ π Λ log π Λ , where the value of notation 0 log 0 is taken to be 0 according to lim x→0 + x log x = 0 [27,28].This section focuses on the case within the feasible domain, and the entropy penalized log-likelihood function with the constraint is where the penalty parameter λ ∈ (−∞, 0) whose the interpretation coincides with [15].Analogously, the penalty parameter λ still controls the sparsity of π.The two penalties have different scales because π Λ is not close to π Λ log π Λ .If omitting the condition F ∈ F , the penalty λ going to the negative infinity implies that one latent class will be randomly selected (i.e., extremely sparse).The penalty λ going to zero implies all information comes from observed data.Compared with PEM, the proposed method only needs one tuning parameter.The essential differences between the PEM and Entropy penalization methods are emphasized.PEM utilizes the term 1(π Λ ≤ ρ N ) log ρ N to handle the population probability π Λ = 0, where ρ N is pre-specified rather than determined by some fit indices.Hence, the selection of ρ N will affect the performance of PEM.In the Entropy penalization method, 0 log 0 is well defined, and we do not need extra parameters.The performance of the Entropy penalization method is completely determined by the parameter λ.
Treating α as the latent data, the expected log-likelihood function is where the expectation E is taken with respect to the distribution P(α|Y, s, g, π).Considering the constraint ∑ Λ π Λ = 1, the Lagrange function can be defined as Let the derivatives of ∂µ be 0, for any Λ ∈ F, we obtain where h i,Λ = π Λ P(y i |s,g,Λ) ∑ Λ∈F π Λ P(y i |s,g,Λ) is the posterior probability of the examinee i belonging to the latent class Λ.The iterative formula of π Λ }, where the superscript "(t)" indicates the values coming from the t-th iteration.Based on the iterative formula, a theorem is given to shrink the interval of λ.Theorem 1.For the DINA model with a fixed integer K, the penalty parameter λ of the penalized function Equation ( 9) should be in interval (− N K log 2 , 0).
This theorem also indicates that λ and N have the same order, and −1 K log 2 is the rate.This paper focuses on λ ∈ {−0.05N, −0.1N, . . ., −N K log 2 }.Algorithm 1 shows the schedules of EM for the DINA model within the feasible domain.When to implement algorithms, the algorithms of Gu and Xu [15] introduce an additional pre-specified constant c to update the population parameter π i,Λ + λ}, where c > 0 is a small constant.Algorithm 1 does not rely on a pre-specified constant.In the method establishment and algorithm implementation, the algorithms of Gu and Xu [15] involve three parameters λ, ρ N , c, while Algorithm 1 only involves a parameter λ.We emphasize again that the parameter λ in the two methods has different scales.The calculation and proof are omitted, and more details are deferred to Appendix B.
Algorithm 1: EM of Entropy Penalized Method within the Feasible Domain.
Input: Observed data Y, initial values s (0) , g (0) , π (0) , penalty parameter λ, initial feasible set F (0) and maximum iterations T. Output: The estimators ŝ, ĝ, π and the final feasible set F. while t < T and not converged do Stop the algorithm and export a warning "λ is too small".end end Output: ŝ = s (t) , ĝ = g (t) , π = π (t) and F = F (t) .Chen and Chen [29] proposed the extended Bayesian information criteria (EBIC) to conduct model selection from a large model space.The EBIC has the following form: where || F|| is the number of nonempty latent classes and ( m n ) indicates the binomial coefficient "m choose n".The smaller EBIC indicates a preferred model.In this paper, EBIC is used to select λ from the mentioned grid structure.
In the implementation of the EM algorithm, we assume that the initial F (0) includes all latent classes to avoid missing important latent classes.The gaps between item parameters are used to check the convergence.If F (t+1) are not feasible, it means that the penalty parameter λ is too small.

Modified EM
This section considers the case that F is not necessarily feasible.In this case, some g j or s j will disappear, which means that the observed data cannot provide any information about g j or s j .These redundant item parameters are set to zero.We focus on the Lagrange function without constraints as Because the space of item parameters may be collapsed, the dimension of item parameters needs to be recalculated.Meanwhile, EBIC will become where ||s|| and ||g|| indicate the numbers of nonzero slipping and guessing parameters, respectively.If all s and g exist, the summation of ||s|| and ||g|| is 2J, which implies Equation ( 13).
The corresponding EM is shown in Algorithm 2, where the discussions of F (0) and convergence are similar.However, this algorithm does not distinguish between feasible and not.To the best of our knowledge, no algorithms have handled the case without the feasible domain.More details are shown in Appendix B.
Algorithm 2: EM of Entropy Penalized Method without the Feasible Domain.

Study I
In this study, the settings are N = 150, 500, 1000, and s = g = 0.2.The attribute profiles are generated from F, and each attribute profile of F has a probability of 1/7.
For the proposed EM, the penalty parameter λ = −0.075N.The method to explore λ will be discussed in the next simulation study.For each attribute α ik , the classification accuracy is evaluated by the posterior marginal probability as where P(α i |Y, ŝ, ĝ, π) is the posterior probability of examinee i having attribute profile α i .Given the posterior marginal probability, the logarithm of the posterior marginal likelihood as reflects the global classification accuracy, where α * ik denotes the true attribute during data generation.
This study focuses on the estimators of π and the classification accuracy.The results of π are shown in Figure 1.We observe that for both settings of N, the performance of standard EM is poor as it fails to eliminate any irrelevant attribute profiles.When the sample size is N, the probability 1/N can be treated as a threshold to distinguish irrelevant attribute profiles.This method can only eliminate the partially irrelevant attribute profiles.In fact, finding an appropriate threshold is challenging.In contrast, regardless of sample sizes, the proposed EM can find the true attribute profiles and set the probability of all irrelevant attribute profiles to zero.The logarithm of the posterior marginal likelihood for the six settings, namely {150, 500, 1000} ⊗ {proposed EM, standard EM}, are −204.239,−453.163,−508.047,−514.681,−1046.273and −1054.014for settings {150, 500, 1000} ⊗ {proposed EM, standard EM}, respectively.The likelihood of the proposed EM is consistently larger than that of the standard EM, indicating that the proposed EM enjoys higher classification accuracy.The results also show that the proposed method works well for the small sample size N = 150.Especially considering the posterior marginal likelihood, the proposed method has obvious advantages over the standard EM.Note that the discussion regarding the estimation of item parameters will be shown in the third simulation study.The results of estimated π for different sample sizes.The horizontal axis represents attribute profiles which can be concluded as 1 (0, 0, 0, 0, 0) , 2 (1,0,0,0,0) , 3 (0, 1, 0, 0, 0) , . . ., 7 (1,1,0,0,0) , . . ., 32 (1,1,1,1,1) .The rules can also apply to different Ks.In Figure 2a, the colored lines indicate the true latent classes, while the black lines indicate the irrelevant latent classes.The dotted line represents the probability is 1/7.Based on the figure, the interval [−0.155N, −0.07N] can efficiently estimate the true π and eliminate the empty latent classes.For the large λ, the recovery of π does not appear to worsen much when estimating the probability of irrelevant latent classes.However, irrelevant latent classes cannot be strictly zero.In Figure 2b, the left and right vertical axes show results of the number of F and EBIC, respectively.The dotted line represents the true number of F. The Figure 2b shows that when the correct number of F is selected, the EBIC achieves the minimum.This study is an illustration of how to explore the values λ using EBIC.

Study III
In this study, the effects of sample sizes and item parameters are evaluated.We consider N = 500, 1000 and item parameters s = g = 0.1, 0.2, 0.3.The response data are generated with the more complex F, and each latent class of F has the probability 1/12.In each setting, 200 independent data are generated.
0 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 1 Firstly, the information criteria EBIC is used to select appropriate λ for each setting.The selection precision of the latent classes and Bias, RMSE of item parameters are used to evaluate the proposed method.The selection precision and Bias, RMSE of item parameters are listed in Table 1.The notation "ST/AT" denotes the ratio between selected true latent classes and all true latent classes.The notation "ST/AS" denotes the ratio between selected true latent classes and all selected latent classes.The subscript of Bias and RMSE indicates the type of item parameters.When the sample size N increases and s, g decreases, both the selection precision and the performance of item parameters' estimators will be better.If N = 1000 and s = g = 0.1, the true F can be completely recovered.The RMSE and Bias of the guessing parameters are lower than the slipping parameters, which is due to the DINA model itself, as the guessing parameter is estimated from all latent groups that do not fully master the required attributes for an item.In contrast, the slipping parameter is estimated only for the latent group with complete mastery for that specific q j vector.

Real Data Analysis
In this section, fraction-subtraction data are analyzed.For more about the data, please refer to the literature [7,32,33].This data set contains responses of N = 536 middle school students to J = 20 items, where the responses are coded as 0 or 1.The test measures K = 8 attributes, so there are 2 8 = 256 possible latent classes.The Q-matrix and item contents are shown in Table 2.
Because the sample size N = 536 is not significantly larger than possible latent classes 2 K = 256, we cannot ensure there are enough latent classes to guarantee that true F is feasible.Algorithm 2 is suggested for analyzing the real data.Firstly, EBIC is used to select the penalty parameter λ ∈ {−0.2N, −0.19N, . . ., −0.01N}.The results around the optimal EBIC are shown in Table 3, which is based on a stable interval of EBIC.We observe that when λ = −0.17N, the EBIC achieves the minimum.If λ = −0.16N, the number || F|| changes from 76 to 20, and two guessing parameters disappear.For λ = −0.16N,if λ slightly increases, the model will be more complicated.Based on this fact, we discard the λs that are not less than −0.16N.Next, the evaluation is based on Theorem 1 and estimators ŝ, ĝ, F. We note that −0.19 < −1 8 log 2 does not satisfy Theorem 1, so the corresponding estimator eliminates many classes.Figure 3 presents the estimated F for different λ, and we can see that the results of Figure 3a are consistent with the conclusion of Theorem 1.The conclusion is that the λs no larger than −0.19N are discarded.In addition, combined with Figure 3a-c, we know that attribute 7 is the most basic attribute.Figure 4a,b display the estimators of guessing and slipping parameters, respectively.According to the estimated ŝ, the results of λ = −0.19Nstrongly shifted on items 2, 3, 5, 9, and 16.For different λs, the behavior of estimated ĝ is too complex, and significant differences are found in items 8, 9, and 13.

Discussion
In this paper, we study the penalized method for the DINA model.There are two contributions.Firstly, the entropy penalized method is proposed for the DINA model.The feasible domain is defined to describe the relation between latent classes and the parameter space of item parameters.This framework allows for distinguishing irrelevant attribute profiles.Second, based on the definition of the feasible domain, two modified EM algorithms are developed.In practice, it is recommended to perform exploratory analyses using Algorithm 2 before proceeding further, which can provide valuable insights and guidance to understand the data structure.
While this paper focuses on the DINA model, a natural extension would be the application of the entropy penalized method to other CDMs.Additionally, it is worth noting that this paper study involves situations with a maximum dimension of K = 8, which is relatively low.In high-dimensional cases of K, improving the power and performance of the entropy-penalized method is an interesting topic.A more challenging question is indicating how the specification of irrelevant latent classes may affect the classification accuracy and the estimation of the model.Those topics are left for future research.

4. 2 .
Study IIIn this study, the solution paths of π versus λ are elaborated.The running settings of N, s, g and F are the same as the settings of Study I.The penalty parameter λ ∈ {−0.2N, −0.195N, . . ., −0.01N}.Due to the similarity of results, only the results of the sample size N = 500 are presented.Figure2shows the solution paths of π, F and EBIC versus λ.

Figure 2 .
Figure 2. Solution paths of the estimated π, the number of F and EBIC versus λ.(a) Solution paths of π.(b) Number of F and EBIC.

Table 1 .
The selection precision of the latent classes and Bias, RMSE of item parameters.The results are based on 200 independent data.

Table 2 .
The Q-matrix and contents of the fractions-subtraction data.