Next Article in Journal
An Analytic Network Process to Support Financial Decision-Making in the Context of Behavioural Finance
Next Article in Special Issue
Generalized Linear Models with Covariate Measurement Error and Zero-Inflated Surrogates
Previous Article in Journal
Using Noisy Evaluation to Accelerate Parameter Optimization of Medical Image Segmentation Ensembles
Previous Article in Special Issue
Ordinal Time Series Analysis with the R Package otsfeatures
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

DINA Model with Entropy Penalization

1
School of Economics and Statistics, Guangzhou University, Guangzhou 510006, China
2
Institute of Applied Mathematics, Shenzhen Polytechnic, Shenzhen 518000, China
*
Author to whom correspondence should be addressed.
Mathematics 2023, 11(18), 3993; https://doi.org/10.3390/math11183993
Submission received: 22 August 2023 / Revised: 15 September 2023 / Accepted: 19 September 2023 / Published: 20 September 2023
(This article belongs to the Special Issue Statistical Methods in Data Science and Applications)

Abstract

:
The cognitive diagnosis model (CDM) is an effective statistical tool for extracting the discrete attributes of individuals based on their responses to diagnostic tests. When dealing with cases that involve small sample sizes or highly correlated attributes, not all attribute profiles may be present. The standard method, which accounts for all attribute profiles, not only increases the complexity of the model but also complicates the calculation. Thus, it is important to identify the empty attribute profiles. This paper proposes an entropy-penalized likelihood method to eliminate the empty attribute profiles. In addition, the relation between attribute profiles and the parameter space of item parameters is discussed, and two modified expectation–maximization (EM) algorithms are designed to estimate the model parameters. Simulations are conducted to demonstrate the performance of the proposed method, and a real data application based on the fraction–subtraction data is presented to showcase the practical implications of the proposed method.

1. Introduction

CDMs are widely used in the field of educational and psychological assessments. These models are used to extract the examinees’ latent binary random vectors, which can provide rich and comprehensive information about examinees. Different CDMs are proposed for different test scenarios. The popular CDMs include Deterministic Input, Noisy “And” gate (DINA) model [1], Deterministic Input, Noisy “Or” gate (DINO) model [2], Noisy Inputs, Deterministic “And” gate (NIDA) model [3], Noisy Inputs, Deterministic “Or” gate (NIDO) model [2], Reduced Reparameterized Unified Model (RRUM) [4,5] and Log-linear Cognitive Diagnosis Model (LCDM) [6]. The differences among the above-mentioned CDMs are the modeling methods of the positive response probabilities. CDMs can be summarized in more flexible frameworks such as the Generalized Noisy Inputs, Deterministic “And” gate (GDINA) model [7] and the General Diagnostic Model (GDM) [8]. The simplicity and interpretability of the DINA model have positioned it as one of the most popular CDMs.
The DINA model, also known as the latent classes model [9,10,11,12,13], is a mixture model, so it still suffers from the drawbacks of the mixture model. Too many latent classes may overfit the data, which means that data should have been characterized by a simpler model. Too few latent classes cannot characterize the true underlying data structure well and yield poor inference. In practical terms, identifying the empty latent classes will improve the model’s interpretability and explain the data well. Chen [14] showed that the theoretical optimal convergence rate of the mixture model with the unknown number of classes is slower than the optimal convergence rate with the known number of classes. This means that the inference would strongly benefit from the known number of classes. Therefore, from both practical and theoretical views, eliminating the empty latent classes is a crucial issue in the DINA model.
Common reasons for empty attribute profiles include small sample sizes or highly correlated attributes. Let us explore a few examples to illustrate further. In a scenario where the sample size is smaller than the number of attribute profiles, it is inevitable that some attribute profiles will be empty. In another scenario with two attributes α 1 and α 2 , the relation is assumed that α 2 = 1 if and only if α 1 = 1 . Under the assumption of extremely correlated attributes, attribute profiles ( α 1 = 1 , α 2 = 0 ) and ( α 1 = 0 , α 2 = 1 ) do not appear. Situations with empty attribute profiles can occur in various scenarios [15].
The hierarchical diagnostic classification model [15,16] is a well-known method to eliminate empty attribute profiles. In the literature, directed acyclic graphs are employed to describe the relationships among the attributes, and the directions of edges impose strict constraints on attributes. If there is a directed edge from α 1 to α 2 , the attribute profile ( 01 ) is forbidden. Gu and Xu [15] utilized penalized EM to select the true attribute profiles, avoid overfitting, and learn attribute hierarchies. Wang and Lu [17] compared two exploratory approaches of learning attribute hierarchies in the LCDM and DINA models. In essence, the attribute hierarchy can be regarded as a specific family of correlated attributes that can be effectively represented and described through a graph model.
The penalized methods have been widely researched in many statistical problems. In the regression model, the least absolute shrinkage and selection operator (LASSO) and its variants are analyzed by [18,19,20]. Fan and Li [21] proposed a nonconcave penalty smoothly clipped absolute deviation (SCAD) to reduce the bias of estimators. In the Gaussian mixture model, Ma and Wang, Huang et al. [22,23] proposed penalized likelihood methods to determine the number of components. In CDMs, Chen et al. [10] used SCAD to obtain the sparse item parameters and recovery Q matrix. Xu and Shang [11] applied a “ L 0 norm” penalty to CDM and suggested a truncated “ L 1 norm” penalty as the approximate calculation.
In the hierarchical diagnostic classification model, directed acyclic graphs of attributes often need to be specified in advance. A limitation of this model is that it is difficult to specify a graph in real scenarios. The penalty of the penalized EM proposed by Gu and Xu [15] involves two tuning parameters that complicate the implementation. Therefore, we hope to propose a method that does not require specifying a directed acyclic graph in advance and has a concise penalty term.
This paper makes two primary contributions. Firstly, it introduces an entropy-based penalty, and secondly, it develops the corresponding algorithms to utilize this penalty. This paper proposes a novel approach for estimating the DINA model, combining Shannon entropy and the penalized method. In information theory, “uncertainty” can be interpreted informally as the negative logarithm of probability, and Shannon entropy is the average effect of the “uncertainty”. Shannon entropy can be used to characterize the distribution of attribute profiles. By utilizing the proposed method, the empty attribute profiles can be eliminated. We further develop the EM algorithm for the proposed method and conduct some simulations to verify the proposed method.
The rest of the paper is organized as follows. In Section 2, we give an overview of the DINA model and the estimation method. A definition of the feasible domain is defined to characterize the latent classes. Section 3 introduces the entropy penalized method, and the EM algorithm is employed to estimate the DINA model. The numerical studies of the entropy penalized method are shown in Section 4. Section 5 presents real data analysis based on the fractions–subtraction data. The summary of the paper and future research are given in Section 6. The details of the EM algorithm and proof are given in Appendix A, Appendix B and Appendix C.

2. DINA Model

2.1. Review of DINA Model

Firstly, some useful notations are introduced. For the examinee i = 1 , , N , the attribute profile α i , also known as the latent class, is a K-dimensional binary vector α i = ( α i 1 , α i 2 , , α i K ) , and the corresponding response data to J items is a J-dimensional binary vector y i = ( y i 1 , y i 2 , , y i J ) , where “⊤” is the transpose operation. Let Y and α denote the collection of all y i and α i , respectively. The Q matrix is a J × K binary matrix, where if item j requires the attribute k, then the element q j k is 1, otherwise, the element q j k is 0. The j-th row vector is denoted by q j . Given the fixed K, there are C = 2 K latent classes. We use a multinomial distribution with the probability π Λ P ( α i = Λ ) to describe the attribute profile Λ { 0 , 1 } K , where Λ { 0 , 1 } K π Λ = 1 , and the population parameter π denotes the collection of probabilities for all attribute profiles.
The DINA model [1] supposes that, in an ideal scenario, the examinees with all required attributes will provide correct answers. For examinee i and item j, the ideal response is defined as η j , α i = k = 1 K α i k q j k , where 0 0 is defined as 1. The slipping and guessing parameters are defined by conditional probabilities s j = P ( y i j = 0 | η j , α i = 1 ) and g j = P ( y i j = 1 | η j , α i = 0 ) , respectively. The parameters s and g are the collections of all s j and g j , respectively.
In the DINA model, the positive response probability θ j , α i can be constructed as
θ j , α i P ( y i j = 1 | s j , g j , α i ; q j ) = ( 1 s j ) η j , α i g j 1 η j , α i .
If both Y and α are observed, the likelihood function is
P ( Y , α | s , g ) = i = 1 N π α i j = 1 J ( 1 s j ) η j , α i g j 1 η j , α i y i j s j η j , α i ( 1 g j ) 1 η j , α i 1 y i j .
Given data Y and attribute profile α , the parameters s and g can be directly estimated by the maximum likelihood estimators:
s ^ j = i = 1 N 1 ( η j , α i = 1 & y i j = 0 ) i = 1 N 1 ( η j , α i = 1 ) , j = 1 , , J , g ^ j = i = 1 N 1 ( η j , α i = 0 & y i j = 1 ) i = 1 N 1 ( η j , α i = 0 ) , j = 1 , , J ,
where 1 ( · ) is the indicator function. When α is latent, by integrating out α , the marginal likelihood is
P ( Y | s , g , π ) = i = 1 N Λ { 0 , 1 } K π Λ j = 1 J ( 1 s j ) η j , Λ g j 1 η j , Λ y i j s j η j , Λ ( 1 g j ) 1 η j , Λ 1 y i j ,
which is the primary focus of this paper.

2.2. Estimation Methods

EM and Markov chain Monte Carlo (MCMC) are two estimation methods for the DINA model. De la Torre [24] discussed the marginal maximum likelihood estimation for the DINA model, and the EM algorithm was employed where the objective function was Equation (4). Gu and Xu [15] proposed a penalized expectation–maximization (PEM) with the penalty
λ Λ [ 1 ( π Λ > ρ N ) log π Λ + 1 ( π Λ ρ N ) log ρ N ] ,
where λ ( , 0 ) controls the sparsity of π and ρ N is a small threshold parameter the same order as N d , the constant d 1 . There are two tuning parameters λ and ρ N in PEM. Additionally, a variational EM algorithm is proposed as an alternative approach.
Culpepper [25] proposed a Bayesian formulation for the DINA model and used Gibbs sampling to estimate parameters. The algorithm can be implemented by the R package “dina”. The Gibbs sampling can be extended by a sequential method in the DINA and GDINA with many attributes [26], which provides an alternative approach to the traditional MCMC. As the focus of this paper does not revolve around the MCMC, we refrain from details.

2.3. The Property of DINA as Mixture Model

For a fixed K, the DINA model can be viewed as a mixture model comprising 2 K latent classes (i.e., components). In contrast to the Gaussian mixture model, where a change in the number of components will introduce or remove the mean and covariance parameters, the DINA model behaves differently. Specifically, a change in the number of latent classes does not necessarily affect the presence of item parameters. This means that there are two cases: (i) the latent classes have changed while the structure of the item parameters does not change, and (ii) the latent classes and the structure of the item parameters change simultaneously. To account for the two cases, a formal definition of the feasible domain of latent classes is introduced.
Definition 1.
Given F { 0 , 1 } K the subset of latent classes. If for any s j or g j , j = 1 , , J , there exist some latent classes in F , whose response function (i.e., the distribution of response data), is determined by s j or g j . We say that F is a feasible subset of latent classes and all feasible F s make up the feasible domain F .
If all Λ F { 0 , 1 } K , the probability of Λ F is strictly 0. There exist some subsets F that will spoil the item parameter space. Let us see the following examples
Q 1 0 0 0 1 0 0 0 1 1 1 0 , F 1 Λ 1 , 1 = 000 Λ 1 , 2 = 111 , F 2 Λ 2 , 1 = 001 Λ 2 , 2 = 110 , F 3 Λ 3 , 1 = 100 Λ 3 , 2 = 010 Λ 3 , 3 = 001 .
Assume the response vector y i = ( y i 1 , y i 2 , y i 3 , y i 4 ) . If α i is from F 1 , then for g j , j = 1 , 2 , 3 , 4 , we have
P ( y i j | α i = Λ 1 , 1 ) = g j y i j 1 g j 1 y i j ,
which means that the Λ 1 , 1 ’s response function is determined by g j . For s j , j = 1 , 2 , 3 , 4 , we have
P ( y i j | α i = Λ 1 , 2 ) = s j 1 y i j ( 1 s j ) y i j ,
which means that the Λ 1 , 2 ’s response function is determined by s j . The Equations (7) and (8) are obtained by calculating the ideal responses. To determine the response function of Λ 1 , 1 and Λ 1 , 2 , all item parameters s j and g j are required. Based on similar discussions, Λ 2 , 1 ’s response function is determined by g 1 , g 2 , s 3 , g 4 , and Λ 2 , 2 ’s response function is determined by s 1 , s 2 , g 3 , s 4 . To determine the response function of Λ 2 , 1 and Λ 2 , 2 , all item parameters s j and g j are required.
Then, a different case is presented. If α i is from F 3 , Λ 3 , 1 ’s response function is determined by s 1 , g 2 , g 3 , g 4 , Λ 3 , 2 ’s response function is determined by g 1 , s 2 , g 3 , g 4 , and Λ 3 , 3 ’s response function is determined by s 1 , g 2 , s 3 , g 4 . The item parameter s 4 cannot affect the response function of any attribute profile. Hence, s 4 is called a redundant parameter. Meanwhile, this indicates that there does not exist a slipping behavior for item 4, and we can let the redundant parameter s 4 = 0 . If α i is from F 3 , the item parameter space collapses from 8-dimensional to 7-dimensional. It is obvious that the subset F F will not spoil item parameter space, and the feasible domain F depends on Q. However, a lemma can be given as follows. The proof is deferred to Appendix A.
Lemma 1.
If F contains 0 K and 1 K , then F always lies in the feasible region F , where 0 K and 1 K are K-dimensional vectors with all 0 and 1, respectively.

3. Entropy Penalized Method

3.1. Entropy Penalty of DINA

Shannon entropy is E ( π ) = Λ π Λ log π Λ , where the value of notation 0 log 0 is taken to be 0 according to lim x 0 + x log x = 0 [27,28]. This section focuses on the case within the feasible domain, and the entropy penalized log-likelihood function with the constraint is
L λ ( s , g , π ) = log P ( Y | s , g , π ) λ E ( π ) s . t . F F ,
where the penalty parameter λ ( , 0 ) whose the interpretation coincides with [15]. Analogously, the penalty parameter λ still controls the sparsity of π . The two penalties have different scales because π Λ is not close to π Λ log π Λ . If omitting the condition F F , the penalty λ going to the negative infinity implies that one latent class will be randomly selected (i.e., extremely sparse). The penalty λ going to zero implies all information comes from observed data. Compared with PEM, the proposed method only needs one tuning parameter.
The essential differences between the PEM and Entropy penalization methods are emphasized. PEM utilizes the term 1 ( π Λ ρ N ) log ρ N to handle the population probability π Λ = 0 , where ρ N is pre-specified rather than determined by some fit indices. Hence, the selection of ρ N will affect the performance of PEM. In the Entropy penalization method, 0 log 0 is well defined, and we do not need extra parameters. The performance of the Entropy penalization method is completely determined by the parameter λ .
Treating α as the latent data, the expected log-likelihood function is
E [ L λ ] = E [ log P ( Y | s , g , α ) P ( α | π ) λ E ( π ) ] s . t . F F ,
where the expectation E is taken with respect to the distribution P ( α | Y , s , g , π ) . Considering the constraint Λ π Λ = 1 , the Lagrange function can be defined as
L λ μ = E [ L λ ] + μ ( Λ F π Λ 1 ) s . t . F F .
Let the derivatives of L λ μ π Λ and L λ μ μ be 0, for any Λ F , we obtain
i = 1 N h i , Λ π Λ λ ( log π Λ + 1 ) + μ = 0 , Λ π Λ 1 = 0 ,
where h i , Λ = π Λ P ( y i | s , g , Λ ) Λ F π Λ P ( y i | s , g , Λ ) is the posterior probability of the examinee i belonging to the latent class Λ . The iterative formula of π Λ ( t + 1 ) is proportional to max { 0 , i = 1 N h i , Λ ( t ) λ π Λ ( t ) log π Λ ( t ) } , where the superscript “ ( t ) ” indicates the values coming from the t-th iteration. Based on the iterative formula, a theorem is given to shrink the interval of λ .
Theorem 1.
For the DINA model with a fixed integer K, the penalty parameter λ of the penalized function Equation (9) should be in interval ( N K log 2 , 0 ) .
This theorem also indicates that λ and N have the same order, and 1 K log 2 is the rate. This paper focuses on λ { 0.05 N , 0.1 N , , N K log 2 } . Algorithm 1 shows the schedules of EM for the DINA model within the feasible domain. When to implement algorithms, the algorithms of Gu and Xu [15] introduce an additional pre-specified constant c to update the population parameter π Λ ( t + 1 ) max { c , i = 1 N h i , Λ ( t ) + λ } , where c > 0 is a small constant. Algorithm 1 does not rely on a pre-specified constant. In the method establishment and algorithm implementation, the algorithms of Gu and Xu [15] involve three parameters λ , ρ N , c , while Algorithm 1 only involves a parameter λ . We emphasize again that the parameter λ in the two methods has different scales. The calculation and proof are omitted, and more details are deferred to Appendix B.
Algorithm 1: EM of Entropy Penalized Method within the Feasible Domain.
Input:
Observed data Y , initial values s ( 0 ) , g ( 0 ) , π ( 0 ) , penalty parameter λ , initial feasible set F ( 0 ) and maximum iterations T.
Output:
The estimators s ^ , g ^ , π ^ and the final feasible set F ^ .
Mathematics 11 03993 i001
Output: s ^ = s ( t ) , g ^ = g ( t ) , π ^ = π ( t ) and F ^ = F ( t ) .
Chen and Chen [29] proposed the extended Bayesian information criteria (EBIC) to conduct model selection from a large model space. The EBIC has the following form:
EBIC = 2 L λ + ( 2 J + | | F ^ | | 1 ) log ( N ) + log 2 K | | F ^ | | ,
where | | F ^ | | is the number of nonempty latent classes and m n indicates the binomial coefficient “m choose n”. The smaller EBIC indicates a preferred model. In this paper, EBIC is used to select λ from the mentioned grid structure.
In the implementation of the EM algorithm, we assume that the initial F ( 0 ) includes all latent classes to avoid missing important latent classes. The gaps between item parameters are used to check the convergence. If F ( t + 1 ) are not feasible, it means that the penalty parameter λ is too small.

3.2. Modified EM

This section considers the case that F is not necessarily feasible. In this case, some g j or s j will disappear, which means that the observed data cannot provide any information about g j or s j . These redundant item parameters are set to zero. We focus on the Lagrange function without constraints as
L λ μ = L λ ( s , g , π ) + μ ( α π α 1 ) .
Because the space of item parameters may be collapsed, the dimension of item parameters needs to be recalculated. Meanwhile, EBIC will become
EBIC = 2 L λ + ( | | s | | + | | g | | + | | F ^ | | 1 ) log ( N ) + log 2 K | | F ^ | | ,
where | | s | | and | | g | | indicate the numbers of nonzero slipping and guessing parameters, respectively. If all s and g exist, the summation of | | s | | and | | g | | is 2 J , which implies Equation (13).
The corresponding EM is shown in Algorithm 2, where the discussions of F ( 0 ) and convergence are similar. However, this algorithm does not distinguish between feasible and not. To the best of our knowledge, no algorithms have handled the case without the feasible domain. More details are shown in Appendix B.
Algorithm 2: EM of Entropy Penalized Method without the Feasible Domain.
Input:
Observed data Y , initial values s ( 0 ) , g ( 0 ) , π ( 0 ) , penalty parameter λ , initial set F ( 0 ) and maximum iterations T.
Output:
The estimator s ^ , g ^ , π ^ and the final feasible set F ^ .
Mathematics 11 03993 i002
Output: s ^ = s ( t ) , g ^ = g ( t ) , π ^ = π ( t ) and F ^ = F ( t ) .

4. Simulation Studies

Three simulation studies are conducted to implement the standard EM accounting for all latent classes as a baseline for comparison to verify the selection validity of EBIC and the performance of the entropy penalized method, respectively. Each simulation study serves a specific purpose and contributes to the overall assessment of the proposed approach. For all simulation studies, we set K = 5 , J = 15 and Q matrix has the following structure:
Q 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 .
The Q matrix with two identity matrices satisfies the identifiable conditions [12,30,31]. All attributes are required by four items, and the design of the Q matrix is balanced.

4.1. Study I

In this study, the settings are N = 150 , 500 , 1000 , and s = g = 0.2 . The attribute profiles are generated from F , and each attribute profile of F has a probability of 1/7.
F 0 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 1 1 .
For the proposed EM, the penalty parameter λ = 0.075 N . The method to explore λ will be discussed in the next simulation study. For each attribute α i k , the classification accuracy is evaluated by the posterior marginal probability as
P ( α i k = 1 | Y , s ^ , g ^ , π ^ ) = { α i | α i k = 1 } P ( α i | Y , s ^ , g ^ , π ^ ) ,
where P ( α i | Y , s ^ , g ^ , π ^ ) is the posterior probability of examinee i having attribute profile α i . Given the posterior marginal probability, the logarithm of the posterior marginal likelihood as
i = 1 N j = 1 J log [ P ( α i k * = 1 | Y , s ^ , g ^ , π ^ ) α i k * P ( α i k * = 0 | Y , s ^ , g ^ , π ^ ) 1 α i k * ]
reflects the global classification accuracy, where α i k * denotes the true attribute during data generation.
This study focuses on the estimators of π and the classification accuracy. The results of π are shown in Figure 1. We observe that for both settings of N, the performance of standard EM is poor as it fails to eliminate any irrelevant attribute profiles. When the sample size is N, the probability 1 / N can be treated as a threshold to distinguish irrelevant attribute profiles. This method can only eliminate the partially irrelevant attribute profiles. In fact, finding an appropriate threshold is challenging. In contrast, regardless of sample sizes, the proposed EM can find the true attribute profiles and set the probability of all irrelevant attribute profiles to zero. The logarithm of the posterior marginal likelihood for the six settings, namely { 150 , 500 , 1000 } { proposed EM , standard EM } , are −204.239, −453.163, −508.047, −514.681, −1046.273 and −1054.014 for settings { 150 , 500 , 1000 } { proposed EM , standard EM } , respectively. The likelihood of the proposed EM is consistently larger than that of the standard EM, indicating that the proposed EM enjoys higher classification accuracy. The results also show that the proposed method works well for the small sample size N = 150 . Especially considering the posterior marginal likelihood, the proposed method has obvious advantages over the standard EM. Note that the discussion regarding the estimation of item parameters will be shown in the third simulation study.

4.2. Study II

In this study, the solution paths of π versus λ are elaborated. The running settings of N, s , g and F are the same as the settings of Study I. The penalty parameter λ { 0.2 N , 0.195 N , , 0.01 N } . Due to the similarity of results, only the results of the sample size N = 500 are presented. Figure 2 shows the solution paths of π , F and EBIC versus λ .
In Figure 2a, the colored lines indicate the true latent classes, while the black lines indicate the irrelevant latent classes. The dotted line represents the probability is 1/7. Based on the figure, the interval [ 0.155 N , 0.07 N ] can efficiently estimate the true π and eliminate the empty latent classes. For the large λ , the recovery of π does not appear to worsen much when estimating the probability of irrelevant latent classes. However, irrelevant latent classes cannot be strictly zero. In Figure 2b, the left and right vertical axes show results of the number of F and EBIC, respectively. The dotted line represents the true number of F . The Figure 2b shows that when the correct number of F is selected, the EBIC achieves the minimum. This study is an illustration of how to explore the values λ using EBIC.

4.3. Study III

In this study, the effects of sample sizes and item parameters are evaluated. We consider N = 500 , 1000 and item parameters s = g = 0.1 , 0.2 , 0.3 . The response data are generated with the more complex F , and each latent class of F has the probability 1/12. In each setting, 200 independent data are generated. 1.5
F 0 1 0 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0 1 0 0 0 1 1 1 .
Firstly, the information criteria EBIC is used to select appropriate λ for each setting. The selection precision of the latent classes and Bias, RMSE of item parameters are used to evaluate the proposed method. The selection precision and Bias, RMSE of item parameters are listed in Table 1. The notation “ST/AT” denotes the ratio between selected true latent classes and all true latent classes. The notation “ST/AS” denotes the ratio between selected true latent classes and all selected latent classes. The subscript of Bias and RMSE indicates the type of item parameters. When the sample size N increases and s , g decreases, both the selection precision and the performance of item parameters’ estimators will be better. If N = 1000 and s = g = 0.1 , the true F can be completely recovered. The RMSE and Bias of the guessing parameters are lower than the slipping parameters, which is due to the DINA model itself, as the guessing parameter is estimated from all latent groups that do not fully master the required attributes for an item. In contrast, the slipping parameter is estimated only for the latent group with complete mastery for that specific q j vector.

5. Real Data Analysis

In this section, fraction–subtraction data are analyzed. For more about the data, please refer to the literature [7,32,33]. This data set contains responses of N = 536 middle school students to J = 20 items, where the responses are coded as 0 or 1. The test measures K = 8 attributes, so there are 2 8 = 256 possible latent classes. The Q-matrix and item contents are shown in Table 2.
Because the sample size N = 536 is not significantly larger than possible latent classes 2 K = 256 , we cannot ensure there are enough latent classes to guarantee that true F is feasible. Algorithm 2 is suggested for analyzing the real data. Firstly, EBIC is used to select the penalty parameter λ { 0.2 N , 0.19 N , , 0.01 N } . The results around the optimal EBIC are shown in Table 3, which is based on a stable interval of EBIC. We observe that when λ = 0.17 N , the EBIC achieves the minimum. If λ = 0.16 N , the number | | F ^ | | changes from 76 to 20, and two guessing parameters disappear. For λ = 0.16 N , if λ slightly increases, the model will be more complicated. Based on this fact, we discard the λ s that are not less than 0.16 N .
Next, the evaluation is based on Theorem 1 and estimators s ^ , g ^ , F ^ . We note that 0.19 < 1 8 log 2 does not satisfy Theorem 1, so the corresponding estimator eliminates many classes. Figure 3 presents the estimated F ^ for different λ , and we can see that the results of Figure 3a are consistent with the conclusion of Theorem 1. The conclusion is that the λ s no larger than 0.19 N are discarded. In addition, combined with Figure 3a–c, we know that attribute 7 is the most basic attribute. Figure 4a,b display the estimators of guessing and slipping parameters, respectively. According to the estimated s ^ , the results of λ = 0.19 N strongly shifted on items 2, 3, 5, 9, and 16. For different λ s, the behavior of estimated g ^ is too complex, and significant differences are found in items 8, 9, and 13.
Until now, the candidate penalties are λ = 0.18 N and 0.17 N . The penalty λ = 0.17 N supports the criteria EBIC, and λ = 0.18 N prefers a simpler model. Furthermore, a denser grid between [ 0.18 N , 0.16 N ] will give more detailed results.

6. Discussion

In this paper, we study the penalized method for the DINA model. There are two contributions. Firstly, the entropy penalized method is proposed for the DINA model. The feasible domain is defined to describe the relation between latent classes and the parameter space of item parameters. This framework allows for distinguishing irrelevant attribute profiles. Second, based on the definition of the feasible domain, two modified EM algorithms are developed. In practice, it is recommended to perform exploratory analyses using Algorithm 2 before proceeding further, which can provide valuable insights and guidance to understand the data structure.
While this paper focuses on the DINA model, a natural extension would be the application of the entropy penalized method to other CDMs. Additionally, it is worth noting that this paper study involves situations with a maximum dimension of K = 8 , which is relatively low. In high-dimensional cases of K, improving the power and performance of the entropy-penalized method is an interesting topic. A more challenging question is indicating how the specification of irrelevant latent classes may affect the classification accuracy and the estimation of the model. Those topics are left for future research.

Author Contributions

J.W. provided original idea and wrote the paper; Y.L. provided feedback on the manuscript and all authors contributed substantially to revisions. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the Humanities and Social Sciences Youth Foundation, Ministry of Education of China (No. 22YJC880073), the Guangdong Natural Science Foundation of China (2022A1515011899), and the National Natural Science Foundation of China (No. 11731015).

Data Availability Statement

The full data set used in the real data analysis is openly available as an example data set in the R package CDM on CRAN at https://CRAN.R-project.org/package=CDM (accessed on 25 August 2022).

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Proof of Lemma 1

Proof. 
Assume the response vector y i = ( y i 1 , , y i J ) . The vector q j cannot be 0 K because item j in the diagnostic test must measure some attribute. If α i = 0 K is from F , then η j , 0 K = k = 1 K 0 q j k = 0 . For g j , j = 1 , , J , we have
P ( y i j | α i = 0 K ) = g j y i j 1 g j 1 y i j ,
which means that the 0 K ’s response function is determined by g j . If α i = 1 K is from F , then η j , 1 K = k = 1 K 1 q j k = 1 . For s j , j = 1 , , J , we have
P ( y i j | α i = 1 K ) = s j 1 y i j ( 1 s j ) y i j ,
which means that the 1 K ’s response function is determined by s j . Hence, we only need 0 K and 1 K to conclude that all item parameters g j and s j are required. □

Appendix B. The Details of EM Algorithm

For Algorithm 1, the Lagrange function becomes
L λ μ = E [ L λ ] + μ ( Λ F π Λ 1 ) = E [ log P ( Y | s , g , α ) P ( α | π ) ] λ E ( π ) + μ ( Λ F π Λ 1 ) = E i = 1 N Λ F 1 ( α i = Λ ) j = 1 J log P ( y i j | s j , g j , α i = Λ ) + log P ( α i = Λ | π ) λ E ( π ) + μ ( Λ F π Λ 1 ) = i = 1 N Λ F E 1 ( α i = Λ ) j = 1 J log P ( y i j | s j , g j , α i = Λ ) + log P ( α i = Λ | π ) λ E ( π ) + μ ( Λ F π Λ 1 )
where E 1 ( α i = Λ ) is defined as h i , Λ . Given s , g and π , the expectation E is taken with respect to the distribution P ( α | Y , s , g , π ) , and h i , Λ is nothing but the posterior probability of examinee i belonging to the latent class Λ . If π Λ > 0 , Equation (12) can be strictly reduced to
i = 1 N h i , Λ λ π Λ ( log π Λ + 1 ) + μ π Λ = 0 , Λ π Λ 1 = 0 .
If π Λ = 0 , the term i = 1 N h i , Λ will be positive and close to zero, the equation i = 1 N h i , Λ λ π Λ ( log π Λ + 1 ) + μ π Λ 0 . Equation (A4) can be treated as the alternative of Equation (12). By taking summation over all Λ , we could obtain,
i = 1 N Λ h i , Λ λ Λ π Λ ( log π Λ + 1 ) + μ = 0 N + λ E ( π ) + μ λ = 0 N + λ E ( π ) = λ μ
According to Equations (A4) and (A5), the iterative formula is
π Λ = 0 if i = 1 N h i , Λ λ π Λ log π Λ < 0 , i = 1 N h i , Λ λ π Λ log π Λ N + λ E ( π ) + Δ otherwise .
where Δ = Λ [ ( i = 1 N h i , Λ λ π Λ log π Λ ) · 1 ( i = 1 N h i , Λ λ π Λ log π Λ < 0 ) ] is negative. It implies that π Λ is proportional to max { 0 , i = 1 N h i , Λ ( t ) λ π Λ ( t ) log π Λ ( t ) } . Equation (A6) can also be used to explain why λ should be negative. We assume that λ is non-negative. For any Λ { Λ | π Λ ( t ) 0 } , the posterior probability h i , Λ ( t ) is positive and the term π Λ ( t ) log π Λ ( t ) with 0 < π Λ ( t ) < 1 is positive. Due to the positive λ , we obtain that i = 1 N h i , Λ ( t ) λ π Λ ( t ) log π Λ ( t ) is positive, for all Λ { Λ | π Λ ( t ) 0 } . This result means that, from iteration t to iteration t + 1 , the positive λ cannot eliminate any attribute profiles. Hence, λ should be negative.
The derivatives of item parameters L λ μ s j and L λ μ g j are
L λ μ s j = i = 1 N h i , Λ { Λ | η j , Λ = 1 } y i j 1 s j + 1 y i j s j , L λ μ g j = i = 1 N h i , Λ { Λ | η j , Λ = 0 } y i j g j + 1 y i j 1 g j .
Therefore, the solutions of item parameters are
s j = i = 1 N h i , Λ · 1 ( η j , Λ = 1 & y i j = 0 ) i = 1 N h i , Λ · 1 ( η j , Λ = 1 ) , g j = i = 1 N h i , Λ · 1 ( η j , Λ = 0 & y i j = 1 ) i = 1 N h i , Λ · 1 ( η j , Λ = 0 ) .
Equations (A6) and (A8) imply Algorithm 1.
For Algorithm 2, if some item parameters disappear, the derivatives L λ μ s j and L λ μ g j make no sense. The event is reflected in h i , Λ is that i = 1 N h i , Λ · 1 ( η j , Λ = 1 ) or i = 1 N h i , Λ · 1 ( η j , Λ = 1 ) takes the value 0. In Algorithm 2, we should find those items and set the corresponding item parameters to 0. This is the key difference between Algorithms 1 and 2.

Appendix C. Proof of Theorem 1

Proof of Theorem 1.
The denominator of Equation (A6) must be positive, so N + λ E ( π ) + Δ > 0 , for all E ( π ) . Due to Δ < 0 , then N + λ E ( π ) > 0 must be positive. Noting the discrete Shannon entropy E ( 0 , K log 2 ] , the conclusion λ > max E ( π ) { N E ( π ) } = N K log 2 can be obtained. □

References

  1. Junker, B.W.; Sijtsma, K. Cognitive assessment models with few assumptions, and connections with nonparametric item response theory. Appl. Psychol. Meas. 2001, 25, 258–272. [Google Scholar] [CrossRef]
  2. Templin, J.L.; Henson, R.A. Measurement of psychological disorders using cognitive diagnosis models. Psychol. Methods 2006, 11, 287–305. [Google Scholar] [CrossRef]
  3. Maris, E. Estimating multiple classification latent class models. Psychometrika 1999, 64, 187–212. [Google Scholar] [CrossRef]
  4. Hartz, S.M. A Bayesian Framework for the Unified Model for Assessing Cognitive Abilities: Blending Theory with Practicality. Diss. Abstr. Int. B Sci. Eng. 2002, 63, 864. [Google Scholar]
  5. DiBello, L.V.; Stout, W.F.; Roussos, L.A. Unified cognitive/psychometric diagnostic assessment likelihood-based classification techniques. In Cognitively Diagnostic Assessment; Routledge: London, UK, 1995; pp. 361–389. [Google Scholar]
  6. Henson, R.A.; Templin, J.L.; Willse, J.T. Defining a family of cognitive diagnosis models using log-linear models with latent variables. Psychometrika 2009, 74, 191–210. [Google Scholar] [CrossRef]
  7. de la Torre, J. The generalized DINA model framework. Psychometrika 2011, 76, 179–199. [Google Scholar] [CrossRef]
  8. von Davier, M. A general diagnostic model applied to language testing data. ETS Res. Rep. Ser. 2005, 2005, i-35. [Google Scholar] [CrossRef]
  9. Liu, J.; Xu, G.; Ying, Z. Data-driven learning of Q-matrix. Appl. Psychol. Meas. 2012, 36, 548–564. [Google Scholar] [CrossRef]
  10. Chen, Y.; Liu, J.; Xu, G.; Ying, Z. Statistical analysis of Q-matrix based diagnostic classification models. J. Am. Stat. Assoc. 2015, 110, 850–866. [Google Scholar] [CrossRef]
  11. Xu, G.; Shang, Z. Identifying latent structures in restricted latent class models. J. Am. Stat. Assoc. 2018, 113, 1284–1295. [Google Scholar] [CrossRef]
  12. Gu, Y.; Xu, G. Identification and Estimation of Hierarchical Latent Attribute Models. arXiv 2019, arXiv:1906.07869. [Google Scholar]
  13. Gu, Y.; Xu, G. Partial identifiability of restricted latent class models. Ann. Stat. 2020, 48, 2082–2107. [Google Scholar] [CrossRef]
  14. Chen, J. Optimal rate of convergence for finite mixture models. Ann. Stat. 1995, 23, 221–233. [Google Scholar] [CrossRef]
  15. Gu, Y.; Xu, G. Learning Attribute Patterns in High-Dimensional Structured Latent Attribute Models. J. Mach. Learn. Res. 2019, 20, 1–58. [Google Scholar]
  16. Templin, J.; Bradshaw, L. Hierarchical diagnostic classification models: A family of models for estimating and testing attribute hierarchies. Psychometrika 2014, 79, 317–339. [Google Scholar] [CrossRef] [PubMed]
  17. Wang, C.; Lu, J. Learning attribute hierarchies from data: Two exploratory approaches. J. Educ. Behav. Stat. 2021, 46, 58–84. [Google Scholar] [CrossRef]
  18. Tibshirani, R. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B (Methodol.) 1996, 58, 267–288. [Google Scholar] [CrossRef]
  19. Zou, H. The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 2006, 101, 1418–1429. [Google Scholar] [CrossRef]
  20. Yuan, M.; Lin, Y. Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 2006, 68, 49–67. [Google Scholar] [CrossRef]
  21. Fan, J.; Li, R. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Am. Stat. Assoc. 2001, 96, 1348–1360. [Google Scholar] [CrossRef]
  22. Ma, J.; Wang, T. Entropy penalized automated model selection on Gaussian mixture. Int. J. Pattern Recognit. Artif. Intell. 2004, 18, 1501–1512. [Google Scholar] [CrossRef]
  23. Huang, T.; Peng, H.; Zhang, K. Model selection for Gaussian mixture models. Stat. Sin. 2017, 27, 147–169. [Google Scholar] [CrossRef]
  24. de la Torre, J. DINA model and parameter estimation: A didactic. J. Educ. Behav. Stat. 2009, 34, 115–130. [Google Scholar] [CrossRef]
  25. Culpepper, S.A. Bayesian estimation of the DINA model with Gibbs sampling. J. Educ. Behav. Stat. 2015, 40, 454–476. [Google Scholar] [CrossRef]
  26. Wang, J.; Shi, N.; Zhang, X.; Xu, G. Sequential Gibbs Sampling Algorithm for Cognitive Diagnosis Models with Many Attributes. Multivar. Behav. Res. 2022, 57, 840–858. [Google Scholar] [CrossRef]
  27. Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J. 1948, 27, 379–423. [Google Scholar] [CrossRef]
  28. Thomas, M.; Joy, A.T. Elements of Information Theory; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
  29. Chen, J.; Chen, Z. Extended Bayesian information criteria for model selection with large model spaces. Biometrika 2008, 95, 759–771. [Google Scholar] [CrossRef]
  30. Xu, G. Identifiability of restricted latent class models with binary responses. Ann. Stat. 2017, 45, 675–707. [Google Scholar] [CrossRef]
  31. Xu, G.; Zhang, S. Identifiability of diagnostic classification models. Psychometrika 2016, 81, 625–649. [Google Scholar] [CrossRef]
  32. Tatsuoka, C. Data analytic methods for latent partially ordered classification models. J. R. Stat. Soc. Ser. C (Appl. Stat.) 2002, 51, 337–350. [Google Scholar] [CrossRef]
  33. de la Torre, J.; Douglas, J.A. Higher-order latent trait models for cognitive diagnosis. Psychometrika 2004, 69, 333–353. [Google Scholar] [CrossRef]
Figure 1. The results of estimated π ^ for different sample sizes. The horizontal axis represents attribute profiles which can be concluded as 1 ( 0 , 0 , 0 , 0 , 0 ) , 2 ( 1 , 0 , 0 , 0 , 0 ) , 3 ( 0 , 1 , 0 , 0 , 0 ) , …, 7 ( 1 , 1 , 0 , 0 , 0 ) , …, 32 ( 1 , 1 , 1 , 1 , 1 ) . The rules can also apply to different Ks. (a) The results of sample size N = 150 . (b) The results of sample size N = 500 . (c) The results of sample size N = 1000 .
Figure 1. The results of estimated π ^ for different sample sizes. The horizontal axis represents attribute profiles which can be concluded as 1 ( 0 , 0 , 0 , 0 , 0 ) , 2 ( 1 , 0 , 0 , 0 , 0 ) , 3 ( 0 , 1 , 0 , 0 , 0 ) , …, 7 ( 1 , 1 , 0 , 0 , 0 ) , …, 32 ( 1 , 1 , 1 , 1 , 1 ) . The rules can also apply to different Ks. (a) The results of sample size N = 150 . (b) The results of sample size N = 500 . (c) The results of sample size N = 1000 .
Mathematics 11 03993 g001
Figure 2. Solution paths of the estimated π , the number of F and EBIC versus λ . (a) Solution paths of π . (b) Number of F and EBIC.
Figure 2. Solution paths of the estimated π , the number of F and EBIC versus λ . (a) Solution paths of π . (b) Number of F and EBIC.
Mathematics 11 03993 g002
Figure 3. The estimators of item parameters s and g as the penalty parameter λ varies in the set { 0.19 N , 0.18 N , , 0.15 N } . (a) λ = 0.19 N . (b) λ = 0.18 N . (c) λ = 0.17 N . (d) λ = 0.16 N .
Figure 3. The estimators of item parameters s and g as the penalty parameter λ varies in the set { 0.19 N , 0.18 N , , 0.15 N } . (a) λ = 0.19 N . (b) λ = 0.18 N . (c) λ = 0.17 N . (d) λ = 0.16 N .
Mathematics 11 03993 g003
Figure 4. The estimators of item parameters s and g as the penalty parameter λ varies in the set { 0.19 N , 0.18 N , , 0.15 N } . (a) Slipping parameters. (b) Guessing parameters.
Figure 4. The estimators of item parameters s and g as the penalty parameter λ varies in the set { 0.19 N , 0.18 N , , 0.15 N } . (a) Slipping parameters. (b) Guessing parameters.
Mathematics 11 03993 g004
Table 1. The selection precision of the latent classes and Bias, RMSE of item parameters. The results are based on 200 independent data.
Table 1. The selection precision of the latent classes and Bias, RMSE of item parameters. The results are based on 200 independent data.
s , g NST/ATST/ASBias s RMSE s Bias g RMSE g
0.1 5001.00000.99500.00120.0315−0.00030.0173
10001.00001.00000.00010.0220−0.00010.0123
0.2 5000.96710.92910.00280.0553−0.00120.0279
10000.99330.96010.00350.0381−0.00190.0197
0.3 5000.72460.70400.00130.1138−0.01490.0671
10000.84830.80540.00210.0759−0.00450.0388
Table 2. The Q-matrix and contents of the fractions–subtraction data.
Table 2. The Q-matrix and contents of the fractions–subtraction data.
Item α 1 α 2 α 3 α 4 α 5 α 6 α 7 α 8 Item α 1 α 2 α 3 α 4 α 5 α 6 α 7 α 8
5 3 3 4 00010110 4 1 3 2 4 3 01001010
3 4 3 8 00010010 11 8 1 8 00000011
5 6 1 9 00010010 3 3 8 2 6 5 01011010
3 1 2 2 3 2 01101010 3 4 5 3 2 5 01000010
4 3 5 3 4 10 01010011 2 1 3 10000010
6 7 4 7 00000010 4 5 7 1 4 7 01000010
3 2 1 5 11000010 7 3 5 2 4 5 01001010
2 3 2 3 00000010 4 1 10 2 8 10 01001110
3 7 8 2 01000000 4 1 4 3 11101010
4 4 12 2 7 12 01001011 4 1 3 1 5 3 01101010
Table 3. EBIC and exploratory results about | | F ^ | | , | | g ^ | | , | | s ^ | | .
Table 3. EBIC and exploratory results about | | F ^ | | , | | g ^ | | , | | s ^ | | .
λ 0.19 N 0.18 N 0.17 N 0.16 N 0.15 N
EBIC11,245.3010,263.119807.2910,342.2510,173.47
| | F ^ | | 716207680
| | g ^ | | 1415182020
| | s ^ | | 2020202020
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Wang, J.; Li, Y. DINA Model with Entropy Penalization. Mathematics 2023, 11, 3993. https://doi.org/10.3390/math11183993

AMA Style

Wang J, Li Y. DINA Model with Entropy Penalization. Mathematics. 2023; 11(18):3993. https://doi.org/10.3390/math11183993

Chicago/Turabian Style

Wang, Juntao, and Yuan Li. 2023. "DINA Model with Entropy Penalization" Mathematics 11, no. 18: 3993. https://doi.org/10.3390/math11183993

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop