Deep Item Response Theory as a Novel Test Theory Based on Deep Learning

Item Response Theory (IRT) evaluates, on the same scale, examinees who take different tests. It requires the linkage of examinees’ ability scores as estimated from different tests. However, the IRT linkage techniques assume independently random sampling of examinees’ abilities from a standard normal distribution. Because of this assumption, the linkage not only requires much labor to design, but it also has no guarantee of optimality. To resolve that shortcoming, this study proposes a novel IRT based on deep learning, Deep-IRT, which requires no assumption of randomly sampled examinees’ abilities from a distribution. Experiment results demonstrate that Deep-IRT estimates examinees’ abilities more accurately than the traditional IRT does. Moreover, Deep-IRT can express actual examinees’ ability distributions flexibly, not merely following the standard normal distribution assumed for traditional IRT. Furthermore, the results show that Deep-IRT more accurately predicts examinee responses to unknown items from the examinee’s own past response histories than IRT does.


Introduction
As a rapidly growing area of e-assessment, E-testing involves the delivery of examinations and assessments on screen, using either local systems or web-based systems. In general, e-testing provides automatic assemblies of uniform test forms, for which each form comprises a different set of items but which still has equivalent measurement accuracy [1][2][3][4][5][6][7][8][9][10]. Uniform test forms are assembled for which all forms have equivalent qualities for equal evaluation of examinees who have taken different test forms. Examinees' test scores should be guaranteed to become equivalent, even if different examinees with the same ability take different tests. However because it is difficult to develop perfectly uniform test forms, the calibration process is fundamentally important when multiple test forms are used. To resolve this difficulty, IRT has been used as a calibration method. Reports of the literature describe that Item Response Theory (IRT) offers the following benefits [11,12]: One can estimate examinees' abilities while minimizing the effects of heterogeneous or aberrant items that have low estimation accuracy. • IRT produces examinee ability estimates on a single scale, even for results obtained from different tests. • IRT predicts an individual examinee's correct response probability to an item from the examinee's past response histories.
Evaluating abilities of numerous examinees on a single scale requires linkage of examinees' abilities estimated from different tests [12][13][14][15]. However, linkage techniques of IRT assume random sampling of examinees' abilities from a standard normal distribution. Because of this assumption, the IRT linkage theoretically has no guarantee for its optimality.
Nevertheless, it requires much labor to design [16][17][18][19]. In addition, examinees' abilities have no guarantee of being sampled randomly from a standard normal distribution.
To resolve difficulties of linkage, this study proposes a novel Item Response Theory based on deep learning, Deep-IRT, without assuming random sampling of examinees' abilities from a statistical distribution. The proposed method represents a probability for an examinee to answer an item correctly based on the examinee's ability parameter and the item's difficulty parameter. The main contributions of this study are presented below: • Based on deep learning technology, a novel IRT is proposed. It requires no linkage procedures because it does not assume random sampling of examinees. • Deep-IRT estimated examinees' abilities with high accuracy when the examinees are not sampled randomly from a single distribution or when there are no common items among the different tests. • Deep-IRT can express actual examinees' abilities distributions flexibly. It does not follow a standard normal distribution. • The proposed method provides more reliable and robust ability estimation for actual data than IRT does.
In the study of artificial intelligence, researchers have recently developed deep learning methods that incorporate IRT for knowledge tracing [20][21][22][23]. Nevertheless, these methods have not achieved interpretable parameters for examinee ability and item difficulty because each examinee parameter depends on each item. Estimating interpretable parameters is the most important task in the field of test theory. To increase the interpretability of the parameters, the proposed method estimates parameters using two independent networks: an examinee network and an item network. However, generally speaking, independent networks are known to have less prediction accuracy than dependent networks have. Recent studies of deep learning have demonstrated that redundancy of parameters (deep layers of hidden variables) reduces generalization error, contrary to Occam's razor [24][25][26][27]. Based on reports of state-of-the-art studies, the proposed method constructs two independent redundant deep networks: an examinee network and an item network. The present study uses the term "deep learning" in the sense of learning neural networks with a deep layer of hidden variables. Therefore, the proposed method is expected to have highly interpretable parameters without impairment of the estimation accuracy.
Simulation experiments demonstrate that the proposed Deep-IRT estimates examinees' abilities more accurately than IRT does when examinees' abilities are not sampled randomly from a single distribution or when no common items exist among the different tests. Experiments conducted with actual data demonstrated that the proposed method provides more reliable and robust ability estimation than IRT does. Furthermore, Deep-IRT more accurately predicted examinee responses to unknown items from the examinee's past response histories than IRT does.

Related Works
For knowledge tracing [28][29][30][31][32][33][34], the task of tracking the knowledge states of different learners over time, several deep IRT methods that have been developed in the domain of artificial intelligence combine IRT with a deep learning method [20][21][22][23]27]. Cheng and Liu [21] proposed deep IRT based Long-short term memory (LSTM) [35] to estimate item discrimination and difficulty parameters by extracting item text information. Yeung [20] and Gan et al. [23] used the dynamic key-value memory network (DKVMN) [21] based on a Memory-Augmented Neural Network and attention mechanisms that trace a learner's knowledge state. Ghosh et al. [22] used attention mechanisms that incorporates a forgetting function of the past learner's response data. Ghosh et al. used a Rasch model [13,36] incorporating the learner's ability parameters and the item's difficulty parameter.
These deep knowledge tracing methods have not achieved interpretable parameters for learner ability and item difficulty, which are extremely important in the field of test theory. In addition, these earlier deep knowledge tracing methods estimate time-series changes of an examinee's abilities to capture the examinee's growth for knowledge tracing.
However, the examinees' ability change is not considered in the field of test theory because the purpose of testing is estimating an examinee's current ability.
Consequently, earlier deep knowledge tracing methods [20][21][22][23]27] emphasized not a test theory but a knowledge tracing task. By contrast, this study proposes an IRT model based on deep learning as a novel test theory. Herein, we designate the proposed IRT as "Deep-IRT": a novel test theory.

Item Response Theory
This section briefly introduces IRT and a two-parameter logistic model (2PLM), which is an extremely popular IRT model as a test theory. For the two-parameter logistic model, u ij denotes the response of examinee i to item j (1, . . . , n) as u ij = 1 (examinee i answers correctly to item j) 0 (otherwise) In the two parameter logistic model, the probability of a correct answer given to item j by examinee i with ability parameter θ i ∈ (−∞, ∞) is assumed as where a j ∈ (0, ∞) is the j-th item's discrimination parameter expressing the discriminatory power for examinee's abilities, and where b j ∈ (−∞, ∞) is the j-th item's difficulty parameter expressing the degree of difficulty. From Bayes' theorem, the posterior distribution of an ability parameter g(θ|u) is given as where h(u) is a marginal distribution: The parameters are estimated using the expected a priori (EAP) method, which is known to maximize the prediction accuracy theoretically aŝ Because calculating the parameters analytically is difficult, numerical calculation methods such as Markov Chain Monte Carlo methods (MCMC) are generally used.
Here, prior distribution f (θ) indicates the examinees' ability distribution. The examinees' abilities are assumed to be sampled randomly from f (θ). Therefore, comparing the examinees' abilities as estimated from different tests requires a linkage that scales those abilities on the same scale using common examinees or items among the tests.
• Concurrent calibration: Item parameters for different tests are estimated together using common items [37,50]. • Fixed common item parameters: fixing the common item parameters and calibrating only the pretest items so that the item parameter estimates of the pretest are of the same scale as the common item parameters [51,52].
Even though linkage requires much labor to design, no linkage method can fully represent the joint probability distribution. Particularly when examinees are not sampled randomly from a certain statistical distribution, the linkage accuracy is greatly decreased [16][17][18][19]. In addition, examinees' abilities might not be sampled randomly from the standard normal distribution.

Deep-IRT
To resolve the difficulties described above, this study proposes a novel Item Response Theory based on Deep Learning: Deep-IRT. To increase the interpretability of the parameters, Deep-IRT estimates parameters using two independent networks: an examinee network and an item network. However, in general, independent networks are known to have less prediction accuracy than dependent networks have. Recent studies of deep learning have demonstrated that redundancy of parameters (deep layers of hidden variables) reduces generalization error, contrary to Occam's razor [24][25][26][27]. Based on state-of-the-art reports, Deep-IRT constructs two independent redundant deep networks: an examinee network and an item network. Deep-IRT is expected to have highly interpretable parameters without impairment of the estimation accuracy.

Method
This subsection presents an explanation of the Deep-IRT method. This method uses two independent neural networks: Examinee Layer and Item Layer. Using outputs of both networks, a probability for an examinee to answer an item correctly is calculated. Figure 1 presents a brief illustration. To express the i-th examinee, the encode of Examinee Layer is a one-hot vector s i ∈ {0, 1} I , where I represents the number of examinees. The i-th element is 1; the other elements are 0s. The Examinee Layer comprises three layers as described below: Here, we use the hyperbolic tangent as an activation function: In addition, W (θ 1 ) and W (θ 2 ) are the weight matrices given as is the weight vector given as In addition, τ (θ 1 ) = τ are the bias parameters vectors; τ (θ 3 ) is the bias parameter. In this study, we consider the last layer θ (i) 3 as the ith examinee's ability parameter. An overview of the calculation in terms of the Examinee Layer is presented in Figure 2. Weight matrix W represents an estimate of the relation between an examinee's ability and all other examinees' abilities. Therefore, Deep-IRT does not require assumption of random sampling examinees' abilities from a statistical distribution because it estimates an examinees' ability by adjusting the other examinees' ability estimates. Similarly, to express the j-th item, the encoding of the Item Layer is a one-hot vector q i ∈ {0, 1} J , where J stands for the number of items. The j-th element is 1; the other elements are 0s. The Item Layer consists of three layers as follows: In addition, W (β 1 ) and W (β 2 ) are the weight matrices given as presented below: is the weight vector given as shown below: |β 2 | are the bias parameters' vectors. τ (β 3 ) is the bias parameter. For this study, we consider the last layer β (j) 3 as the j th item's difficulty parameter. Similarly to the sampling of examinees, this method does not assume random sampling of item difficulty parameters from a statistical distribution.
Then, Deep-IRT represents an examinee's correct response probability to an item using the difference between the examinee's ability parameter and the item difficulty parameter. Specifically, examinee i's correct response probability to j's item is described using a hidden Here, W (y) = (w (y) 2 ) and τ (y) = (τ (y) 2 ) are the weight vector and bias' parameters vector.
Deep-IRT does not assume random sampling of examinees' abilities and item difficulties from any statistical distribution. Instead, it uses a deep learning method to estimate the relation between an examinees' ability and all other examinees' abilities by maximizing the prediction accuracy of examinees' responses. The unique feature of this method is to es-timate an examinee's ability by adjusting the other examinees' ability estimates. Because of this property, this method requires no linkage procedure.

Learning Parameters
In general, deep learning methods learn their parameters using the back-propagation algorithm by minimizing a loss function. The loss function of the proposed Deep-IRT employs cross-entropy, which reflects classification errors. It is calculated from the predicted responsesŷ i,j and the true responses u i,j as Like other machine learning techniques, deep learning methods are biased to data they have encountered before. Therefore, the generalization capacity of the methods depends on the training data, which leads to sub-optimal performance. Consequently, Deep-IRT cannot predict responses of examinees or items accurately with an extremely small number of (in)correct answers. To overcome this shortcoming, cost-sensitive learning, which weights minority data over majority, has been used widely [53]. Therefore, we add the loss function based on a cost-sensitive approach as where L e stands for a group of examinees whose correct answer rates are less than α L e , H e denotes a group of examinees whose correct answer rates are more than α H e , L i signifies a group of items of which correct answer rates are less than α L i , and H i represents a group of items with correct answer rates that are more than α H i . Here, γ 1 , γ 2 , γ 3 , γ 4 and α L e , α H e , α L i , α H i are tuning parameters. All of the parameters are learned simultaneously using a popular optimization algorithm: adaptive moment estimation [54].

Simulation Experiments
This section presents evaluation of the performances of Deep-IRT using simulation data according to earlier IRT studies of the linkage or the multi-population [55,56].

Experiment Settings
We implemented Deep-IRT using Chainer (https://chainer.org/ (accessed on 23 April 2021)), a popular framework for neural networks. The values of tuning parameters are presented in Table 1.

Estimation Accuracy
For this experiment, we evaluate root mean square error (RMSE), Pearson's correlation coefficient, and the Kendall rank correlation coefficient between the estimated abilities and the true values. For calculation of RMSE, the estimated abilities of Deep-IRT are standardized.

Estimation Accuracy for Randomly Sampled Examinee Data
To underscore the effectiveness of Deep-IRT for data of examinees' abilities that are not randomly sampled, this subsection presents evaluation of the estimation accuracy with changing examinee assignments for different tests. The procedures of this experiment are explained hereinafter.
This experiment generates 10 test data that have no common examinees. In addition, the k-th test (k = 1, . . . , 10) has common items only among the k − 1-th test and the k + 1th test.
The true parameters were generated randomly: Here, the simulation data were generated based on 2PLM in the following two ways. The first way is that examinees are assigned randomly to each test from Equation (17). The other way is that examinees are assigned systematically to each test as described below.

2.
The examinees are sorted in order of their ascending ability. Furthermore, the examinees are divided equally into groups of 10 examinees in order of their respective abilities.

3.
The k th examinee group is assigned to the k-th test. Table 2 demonstrates the average of estimation accuracies for each condition. Results of the random assignment condition show that IRT outperforms Deep-IRT. The reason is that the condition is an ideal situation for IRT because the data are generated randomly from the IRT model. However, for a small number of examinees or items, the differences between IRT and Deep-IRT become smaller.
In contrast, the results obtained for the systematically assignment condition show Deep-IRT without assuming randomly sampling examinees outperforms IRT with that assumption. Furthermore, Deep-IRT suppresses the decline of accuracy in cases without common items among different tests. These results are expected to be beneficial for applying Deep-IRT with actual data.

Estimation Accuracy for Multi-Population Data
As described earlier, IRT assumes that examinees' abilities follow a standard normal distribution. Furthermore, it is known that no optimal linkage occurs under the assumption [17]. Additionally, no guarantee exists that examinees' abilities follow a standard normal distribution. When the assumption is violated, ability estimation accuracy of IRT becomes extremely worse, even without the linkage problem. However, because Deep-IRT does not assume random sampling from a statistical distribution, robust ability estimation is expected to be provided even when the IRT presumption is violated. To demonstrate the benefits of the proposed method, this subsection evaluates estimation accuracies of IRT and Deep-IRT when examinees' abilities follow multiple populations.
For this experiment, the abilities of examinees taking different tests are assumed to be sampled from different populations. For this study, we assume two tests including 50 items. The abilities of the tests are sampled randomly from N 1 (µ 1 , σ 2 ) and N 2 (µ 2 , σ 2 ). Table 3 shows the average of estimation accuracies with different ability distributions and the number of common items. The standard deviation of each distribution was ascertained so that the total abilities' standard deviation is close to 1.0. Here, Wilcoxon's signed rank test is applied to infer whether the accuracies of IRT and Deep-IRT are significantly different. The results showed that when the difference between µ 1 and µ 2 becomes small, IRT provides significantly high accuracy because the distribution approaches a single normal distribution. By contrast, as the difference between µ 1 and µ 2 becomes large, Deep-IRT estimates examinees' abilities accurately. Therefore, Deep-IRT is robust for estimation of examinees' abilities when they follow different distributions. The results also show that, when there is no common item, Deep-IRT estimates the examinees' abilities more accurately than IRT does. Consequently, Deep-IRT can estimate examinees' abilities accurately without common items.  Next, we demonstrate that Deep-IRT can accommodate abilities with multiple populations. Specifically, we generate abilities according to multiple populations for data N 1 (−0.7, 0.3) and N 2 (0.7, 0.3) in Table 3. Figure 3 shows histograms of the true abilities, the estimated abilities using IRT, and the estimated abilities using Deep-IRT. Figure 3 shows that Deep-IRT clearly estimates a bimodal distribution as the ability distribution similar to the true distribution. The result demonstrates that Deep-IRT flexibly expresses actual examinees' abilities distributions that do not follow a standard normal distribution.
Next, we evaluate the estimated ability distributions of IRT and Deep-IRT using a fitting score to the true distribution as ∑ k∈{1,2} where I k represents the number of examinees who took the k-th test. In addition,θ ki is the estimated ability of i-th examinee for the k-th test. In addition, p(θ ki |µ k , σ) is the likelihood of estimated abilities given the true ability distribution as If the method fits the true distribution, then the estimated distribution approaches the true distribution. The fitting score of IRT is −1633.4. That of Deep-IRT is −1437.1. The latter is higher than the former. Therefore, Deep-IRT expresses the examinees' ability distributions more accurately than IRT does.

Actual Data Experiments
The simulation experiments suggested that Deep-IRT might estimate examinees' abilities with high accuracy for actual data. This section evaluates the effectiveness of Deep-IRT using actual datasets.

Actual Datasets
For this experiment, we use the following actual datasets. Here, we present "Rate.Sparse" which is the average rate of items that an examinee did not address in the learning process.

1.
Information datasets consist of two test data (Information 1, 2) related to information technology. Information 1 has 169 examinees over 50 items. Information 2 has 266 examinees over 50 items. The tests were conducted of the learning management system, "Samurai" developed by [57][58][59]

Reliability of Ability Estimation
This subsection presents evaluation of the reliability of abilities estimation of Deep-IRT. Because the true values of parameters are unknown, we evaluate the reliabilities as follows: (1) Each dataset is divided equally into two sets of data. (2) Parameters of each method are estimated for the divided data from each dataset. (3) The RMSE and correlation between the two sets of the estimated parameters from the two divided datasets are calculated. (4) These procedures are repeated 10 times. The average of the RMSEs and correlations is calculated. Table 4 presents the results. Here, a Wilcoxon signed rank test is applied to infer whether the reliabilities of IRT and Deep-IRT are significantly different.   Table 4 shows that Deep-IRT provides more reliable ability estimates than IRT does. In particular, regarding the average of Kendall rank correlation coefficient, which is known to provide a robust estimate for aberrant values, Deep-IRT outperforms IRT significantly. Results indicate that Deep-IRT can estimate parameters more reliably than IRT does for actual test data. It is surprising that Deep-IRT outperforms IRT for small datasets such as Program 1, Program 2, Statistics, Information Ethics, and Engineer Ethics. This result indicates Deep-IRT as effective even for small datasets. For Practice_Math, Practice_Physics, and ASSISTMENTS, IRT has a higher Kendall rank correlation coefficient than Deep-IRT does because the ability estimation of IRT tends to become stable when the dataset becomes large. IRT has that stability because it is guaranteed to converge asymptotically to the true joint probability distribution.

Prediction of Responses to Unknown Items
In the field of artificial intelligence in education, the prediction of examinee's responses to unknown items from the examinee's past response history becomes important for adaptive learning systems [20,30,32,61,62]. Reportedly, the prediction accuracy of IRT is the highest for the problem [63]. This subsection presents comparison of the prediction accuracy of Deep-IRT with that of IRT. Specifically, using ten-fold cross validation, the parameters are learned from training data and are used to predict responses in the remaining data. Then, we calculate the accuracy rates for the cross validation experiments. Here, a Wilcoxon signed rank test is applied to infer whether the respective accuracies of IRT and Deep-IRT are significantly different. Table 5 shows the results: the average of F1 value of Deep-IRT is significantly higher than that of IRT. Deep-IRT can predict examinees' responses to unknown items more accurately than IRT can. It is noteworthy that Deep-IRT does not always outperform for large data. For ASSISTMENTS and Critical Thinking, IRT provides better performance than Deep-IRT does because ASSISTMENTS and Critical Thinking have high values of Rate.Sparse. Deep-IRT might be weak in dealing with sparse datasets. In contrast, for datasets with low values of Rate.Sparse, Deep-IRT outperforms IRT even for small datasets. Generally speaking, the IRT prediction accuracy increases along with the number of examinees. Therefore, IRT has high prediction accuracies for Practice_Math and Practice_Physics. Furthermore, Figure 4 depicts histograms of abilities estimated from Practice_Math, where the prediction accuracy of IRT is higher than that of Deep-IRT. Figure 5 depicts histograms of abilities estimated from Classi_Biology data, where the prediction accuracy of Deep-IRT is higher than that of IRT. Figure 4 shows estimates conducted using both methods for the ability distribution similar to the standard normal distribution. In contrast, Figure 5 shows that Deep-IRT expresses a multi modal distribution, although IRT estimates a unimodal distribution. Deep-IRT can predict responses to unknown items because it can flexibly express distributions of various abilities.

Conclusions
This study examines a novel test theory based on deep learning: Deep-IRT. To increase the interpretability of the parameters, Deep-IRT estimates parameters using two independent networks: an examinee network and an item network. However, generally speaking, independent networks are known to have less prediction accuracy than dependent networks have. Recent studies of deep learning have indicated that redundancy of parameters reduces generalization error, contrary to Occam's razor [24][25][26][27]. Based on reports of stateof-the-art research, Deep-IRT was constructed to have two independent redundant deep networks. Therefore, Deep-IRT has high interpretable parameters without impairment of the estimation accuracy. The main contributions of Deep-IRT are presented below: (1) Deep-IRT does not assume random sampling of examinees' abilities from a statistical distribution because the weight matrix of the ability parameters estimates the relation between an examinee's ability and all other examinees' abilities. (2) Deep-IRT estimates examinees' abilities with high accuracy when the examinees are not sampled randomly from a single distribution or when no common items exist among the different tests. Experiments conducted using actual data demonstrated that Deep-IRT provided more reliable and robust ability estimation than IRT did. Furthermore, Deep-IRT more accurately predicted examinee responses to unknown items from the examinee's past response histories than IRT did. Results showed that Deep-IRT is effective even for small datasets. However, the results also suggest that Deep-IRT might be weak in dealing with sparse data. To estimate an examinee's ability for sparse data robustly, one must improve the estimation methods. One potential means of doing so is optimizing the number of hidden layers of each neural network.
Furthermore, as another subject of future work, we expect to incorporate Deep-IRT with (CAT) [64,65] to improve the examinee's ability estimation accuracy in an actual environment.

Conflicts of Interest:
The authors declare that they have no conflict of interest. The funders had no role in the design of the study, in the collection, analyses, or interpretation of data, in the writing of the manuscript, or in the decision to publish the results.