From Knowledge Transmission to Knowledge Construction: A Step towards Human-Like Active Learning

Machines usually employ a guess-and-check strategy to analyze data: they take the data, make a guess, check the answer, adjust it with regard to the correct one if necessary, and try again on a new data set. An active learning environment guarantees better performance while training on less, but carefully chosen, data which reduces the costs of both annotating and analyzing large data sets. This issue becomes even more critical for deep learning applications. Human-like active learning integrates a variety of strategies and instructional models chosen by a teacher to contribute to learners’ knowledge, while machine active learning strategies lack versatile tools for shifting the focus of instruction away from knowledge transmission to learners’ knowledge construction. We approach this gap by considering an active learning environment in an educational setting. We propose a new strategy that measures the information capacity of data using the information function from the four-parameter logistic item response theory (4PL IRT). We compared the proposed strategy with the most common active learning strategies—Least Confidence and Entropy Sampling. The results of computational experiments showed that the Information Capacity strategy shares similar behavior but provides a more flexible framework for building transparent knowledge models in deep learning.


Introduction
The passive learning technique normally requires an enormous amount of labeled data that has to provide the correct answers (see Figure 1). An active learning environment guarantees better performance while training on less, but carefully chosen, data which reduces the costs of both annotating and analyzing large data sets [1][2][3][4][5][6][7][8][9][10]. In uncertainty sampling, which has been reported to be successful in numerous scenarios and settings [11,12], a machine requests instances which cause uncertainty. This leads to the optimal leveraging of both new and existing data [13].
The process of querying the information imitates a classroom instructional method that actively engages learners in the learning process [14][15][16]. They replace or adapt their knowledge and understanding based on prior knowledge in response to learning opportunities provided by a teacher. This contrasts with a model of instruction whereby knowledge is transmitted from the teacher to learners, which typically presents passive learning. Active learning in an educational setting integrates a variety of strategies and instructional models chosen by a teacher to contribute to learners' knowledge [17].
Hence, machine active learning strategies are still expected to be more versatile and self-sustaining. In particular, deep neural networks demonstrate remarkable performance on particular supervised learning tasks but are not good at telling when they are not sure while working in an active learning environment. The output from the softmax layer usually tends to be overconfident. Besides, deep neural networks have grown so complex that it seems practically impossible to follow their decision-making process [18].
In this study, we intend to inspect humans and machines reasoning processes [19][20][21][22][23] in order to understand how machines make predictions in an active learning environment. Rather than improving performance, we explored whether we can explain how machines come to decisions by imitating human-like reasoning in multiple-choice testing [24][25][26][27][28][29]. We suggest a new uncertainty sampling strategy based on the four-parameter logistic item response theory (4PL IRT) [24] we call Information Capacity. The strategy guarantees the performance similar to the most common uncertainty sampling techniques-Least Confidence and Entropy Sampling-but allows creating more transparent knowledge models in deep learning.
In deep neural networks, we have little visibility into the understanding of how models come to conclusions. This happens because we do not know how learning is supposed to work. While training a model, we iterate with better data, better configurations, better algorithms, and more computational power, although we have little knowledge why that model converges slowly and generalizes poorly. As a result, we do not have much control over rebuilding that model-it is not transparent [18,30,31].
Information Capacity brings with it a new interpretation of learning processes to enlighten "black-box" models. In contrast to Least Confidence and Entropy Sampling, the proposed strategy relies on neural network architectures to model learners' behavior, where neurons or network weights of network classifiers are considered to be a group of learners with different proficiency in classifying learning items. Information Capacity ensures more flexible deep architectures with explainable and controllable learning behavior, not restricted to connectionist models.

Related Work
• Deep active learning. Active learning of deep neural models has hardly been considered to date.
The prominent related studies report minimizing test errors and computational efforts [32][33][34][35][36], taking some directions towards interpretability in deep learning [37]. This study approaches another major issue within the context of transparency-a lack of reasoning in deep neural models. • IRT-based deep learning. Item response theory has been successfully used in solving machine and deep learning problems [38][39][40][41]. They mostly focus on improving generalization ability through optimizing the parameters of IRT models. Rather than optimizing hyperparameters [42] via IRT model-fitting, we aimed to find meaningful interpretations of deep networks reasoning with learning behaviors. • Meta active learning. The reported studies mostly focused on increasing the accuracy of classification with adaptive optimization schemes [1,5,43]. Instead, we intend to simplify an active learning process by integrating a set of evolving learning behaviors into learning models while improving their transparency.

Design of Experiments
We built a SGD-based CNN classifier with two convolutional layers with a ReLU activation and one dropout layer ending with a softmax layer in PyTorch. The first convolutional layer filters the 1 × 10 input image with the square kernel of size 5. The second convolutional layer takes as input the pooled output of the first convolutional layer with a stride of 2 pixels and the square kernel of size 5. An SGD optimizer with learning rate 0.01 and momentum 0.5 was trained on n epoch = 10 with n batch = 64 and tested with n batch = 1000.
We tested the CNN model on the MNIST and Fashion MNIST datasets. From each dataset we randomly took m train = 10,000 examples for training and m test = 10,000 examples for testing. The active learning environment was created with three labeled pool |L| = {100, 500, 1000} with fifty rounds n round = 50 and hundred queried examples |L S | = 100.
The proposed Information Capacity strategy was implemented in line with the two baseline algorithms-Least Confidence and Entropy Sampling. Each experiment was repeated n run = 10 in order to produce statistically significant estimates.

Analysis of Experiments
The parameters of the proposed strategy in training the model defined clearly interpretable behavior of learners during multiple-choice testing. The learners (network weights) guessed correctly with the probability a i = 0.1 on the item (labeled example) i of the difficulty β i = 4. We assumed that there was no penalty for guessing announced. The item discrimination parameter α i = 0.25 reflects how well an item discriminates among the learners located at different points θ j along the continuum. These values for parameters are chosen to minimize the maximum of the information capacity of the items in L but, at the same time, avoid possible inaccuracies caused by machine precision when the informativeness measure values are approaching zero and become imperceptible for different classes.
Implemented guessing behavior reflects "noise" in information. Therefore, a nonzero a i reduced the amount information available for locating learners on the θ continuum. In addition, answering the item i, the learners with locations at θ j did not have a success probability equal to 1 but b i = 0.9 due to partial forgetting. The locations θ j < β i present lower level learners, while the locations θ j ≥ β i describe higher level learners. The given values for the parameters α i , β i , a i , b i define the behavior of learners responding to the items in accord with the item information function (see Section 3) that presents the amount of information each item provides.
The experiments confirmed that Information Capacity with pre-defined learning behavior can represent the baseline active learning strategies (see Figures 2 and 3). The values of accuracy on testing over rounds mean ± std are given in Tables 1 and 2. The similarity in learning behavior for different subsets of the MNIST and Fashion MNIST datasets pointed to the conclusion that Information Capacity relies on neural network architectures to model learners' behavior. It can be explained by the fact that decisions on classification tasks are made at the output layer of a network, but depended on weights (learners) which were set at hidden layers. With increasing amount of labeled pool |L| the similarities between the accuracy curves for different strategies become stronger (see Tables 1 and 2).
We applied a one-sided Wilcoxon test [44] with Bonferroni correction [45,46] for each round to confirm a lack of statistically significant differences between the three strategies in the accuracy values on testing. Since the p-value for each round turned out to be close to 1, we have a sufficient reason to accept the null hypothesis. Consequently, the similarities between the accuracy curves in Figures 2 and 3 are statistically significant.

Discussion
We took Least Confidence and Entropy Sampling for comparison for two reasons. First, these active learning strategies are used as baseline sampling techniques for more complex approaches adopted in deep active learning. Second, Information Capacity shares some similarity with them-it finds y i which range over all possible labels (Entropy Sampling) with the least information capacity (Least Confidence).
As progress on improving performance in deep learning has come at the cost of transparency, we find this approach particularly beneficial. Information Capacity allows learners to exhibit different learning behaviors with regard to the IRT hyperparameters. In the experiments, they were chosen in a certain way to rule out the reasoning behind the Least Confidence and Entropy Sampling strategies. In particular, we modeled uncertainty with a group of learners, who adopted both guessing and forgetting strategies (a i > 0 and b i < 1) to classify "hard" items (β i > 2). In addition, it was difficult to assess how strong or weak the learners were (θ j < β i or θ j > β i ) because the value of discrimination factor was low α i < 0.5. No penalty for guessing p = 0 delivered less predictable behavioral observations.
As we have seen, Least Confidence and Entropy Sampling can be interpreted by the scenario in which each learner (neuron) in a neural network shares the same behavior. Considering the complexity of deep networks, these backbone strategies seem limited. For increasing the transparency of deep learning process, different combinations of the IRT parameters can be used to construct a variety of educational scenarios and learning strategies with strong or weak learners including learning in groups [47][48][49].
The analysis of different neural network architectures with regard to learning behaviors is beyond the scope of this study. However, we hope that our presentation of neural networks will encourage further research exploring novel neural networks building groups of learners with learning behavior which is not limited to gradient-based methods and primitive connectionist models.

Problem Statement
Let X be a feature space and Y be a label space. Let P(X, Y) be an unknown underlying distribution, where X ∈ X , Y ∈ Y. We use labeled training set S m = (x i , y i ) of m labeled training samples to select a prediction function f ∈ F , f : X → Y so as to minimize the risk , where (·) ∈ R + is a given loss function. For any labeled set L (training and testing), the empirical risk over L is given by: In a pool-based setting [7,33], an active learner chooses examples from a set U = m − L of unlabeled samples according to a query function S. Query functions often select points based on information inferred from the current model f s , the existing training set |L|, and the current pool |U|. The aim is to accurately train the model for a given number of labeled points |L S |.
We consider a class B of learning behaviors during testing, where each behavior B ∈ B represents a hypothesis class containing all learners f s ∈ B, where s defines a set of parameters in a testing framework for making behavioral observations.

Testing Framework
We are interested in measuring classification proficiencies of a group of learners (neurons or network weights). Although it seems impossible to directly observe the level of proficiency (working knowledge), we can infer its existence through behavioral observations in a classroom. The learners are given an instrument containing several items (labeled examples) i.e., multiple-choice tests [27,28,[50][51][52]. The responses to this instrument constitute the behavioral observations. Item Response Theory [24,28,[53][54][55][56] suggests a variety of models to assess the distance between the learner and the item locations as it clearly defines the learner's correct response. This means that items located toward the right side have difficulty β. They require a learner to have greater proficiency θ to correctly answer items located on the right side than items located on the left side. In general, items located below 0 are "easy" while items above 0 are "hard".
In this study, we focused on the four-parameter logistic item response theory (4PL IRT) model which can be presented as [24]: where p(y ij = 1|θ j , α i , β i , a i , b i ) is the probability of providing the correct response y ij = 1 to an item i by a learner j with the location (ability) θ j . From the definition (1) we can see that the rate of success mainly depends on the relationship between the item's parameters and learners' proficiency.

Information Capacity
So far, we considered the estimation of a learner's location from its uncertainty. Let us now take the opposite side and define a query strategy S.
The instrument's items-labeled examples-contain a certain amount of information that can be used for estimating the learner location parameters. We assume that each item contributes information to reduce the uncertainty about a learner's location independent of the other items of the instrument. The amount of information items provide can be presented using the Fisher information as [24,57,58]: where σ 2 e (θ|θ) is the asymptotic variance error of the estimate θ. The log likelihood function for a learner j's response vector is equal to: where The items' capacity is defined as the maximum of the information function S(θ) max . The definition (2) can be rewritten with regard to (1) in the explicit form as: The detailed derivation of the Equation (4) is given in Appendix A. Figure 4 illustrates the projections of the information function with fixed values α = 0.25, β = 4, a = 0.1, b = 0.9 described in Section 2.2. A new pool-based strategy-we refer to as Information Capacity-suggests estimating the information capacity for unlabelled instances based on the definition (4). Figure 5 depicts the proposed pool-based strategy for measuring the items capacity S with regard to the items' difficulty β i , learners' locations θ j , strategies a i and b i , and penalty announcement p in a classroom. The strategy is aimed at "moving" learners along the difficulty axis while keeping high values for the capacity axis. The learners query the examples with the lowest information capacity. For clarity, the algorithm represents the proposed pool-based active learning with the Information Capacity strategy (see Algorithm 1). The differences in implementation in comparison with the traditional active learning framework are highlighted in blue. Initialize a learning behavior of learners B with regard to a set of parameters α, β, a, b;

5:
Train a group of learners on the labeled set L; 6: Measure performance of the group of learners on the test set m test ; 7: Initialize several rounds n round and several queried examples |L S |; 8: for round ∈ n round do 9: Estimate the probabilities with regard to (1) based on the learning behavior B; 10: Sort the unlabeled items in U according to (4) based on the probabilities from the step 9; 11: Query the items L S with the smallest of the maximum capacity S in a round; Retrain a group of learners on the labeled set L; 15: Measure performance of the group of learners on the test set m test ; 16: end for 17: return The performance of the learners with the interpretation of their learning behavior.

Conclusions
We present Information Capacity, which is an uncertainty sampling strategy that effectively integrates human-and machine-reasoning processes. The strategy allows embedding into the models different learning behaviors with regard to the parameters of the 4PL IRT model. The experiments on the MNIST and Fashion MNIST datasets with the same CNN model indicate that Information Capacity performs similarly to Least Confidence and Entropy Sampling but brings more transparency into a deep learning process.
We considered the neurons or network weights of the CNN classifier at the last hidden layer as a group of learners with different proficiency in classifying learning items, i.e., images. The pre-defined parameters of the Information Capacity strategy defined their learning behavior: the learners had a success probability b i = 0.9 due to partial forgetting while they guessed correctly with the probability a i = 0.1 on the item i of the difficulty β i = 4, which discriminated the learners with the factor α i = 0.25.
The equivalence of the parameters α i , β i , a i , b i for different subsets of the MNIST and Fashion MNIST datasets revealed that the model architecture greatly influences learning behavior. As a direction for further research, we suggest modeling learning behaviors with different network architectures. While keeping equally good performance due to the similarity between different strategies, it seems desirable to optimize neural network architectures and learning processes.