Computer Adaptive Testing Using Upper-Conﬁdence Bound Algorithm for Formative Assessment

Featured Application: The paper proposes an application of UCB algorithm for item selection in formative assessment. The main advantage of this approach is its ease of implementation when compared to Elo and Multidimensional Item Response Theory based testing. Thus the method should be applicable in virtually any classroom where formative assessment is desired and students have access to computers or phones. Abstract: There is strong support for formative assessment inclusion in learning processes, with the main emphasis on corrective feedback for students. However, traditional testing and Computer Adaptive Testing can be problematic to implement in the classroom. Paper based tests are logistically inconvenient and are hard to personalize, and thus must be longer to accurately assess every student in the classroom. Computer Adaptive Testing can mitigate these problems by making use of Multi-Dimensional Item Response Theory at cost of introducing several new problems, most problematic of which are the greater test creation complexity, because of the necessity of question pool calibration, and the debatable premise that di ﬀ erent questions measure one common latent trait. In this paper a new approach of modelling formative assessment as a Multi-Armed bandit problem is proposed and solved using Upper-Conﬁdence Bound algorithm. The method in combination with e-learning paradigm has the potential to mitigate such problems as question item calibration and lengthy tests, while providing accurate formative assessment feedback for students. A number of simulation and empirical data experiments (with 104 students) are carried out to explore and measure the potential of this application with positive results.


Introduction
Formative assessment have been proposed to make education more accessible and more effective [1][2][3][4]. The distinction between summative and formative roles of assessment was first proposed by Scriven [5] and then applied to students by Bloom [6,7]. Formative assessment is specifically intended to generate feedback on performance to improve and accelerate competency acquisition as opposed to summarizing the achievement status of a student [8,9]. Any learning activity has potential value as formative assessment from oral discourse to conventional quizzes [10]. Three core principles form the basis for formative assessment [11]. Firstly, formative assessment should be viewed as an integral part of instruction, and it should be used in real time for guiding learning process. The material provided to students should depend on their current state of knowledge further learning. The process can be viewed through the lens of J. Hatties three question feedback model [15]. To utilize the algorithm the teacher must first form topics or competences (Where am I going?). The algorithm quickly identifies lacking areas of knowledge (How am I going?) and explores the topic in detail helping further instruction (Where to next?). This approach in combination with presently widespread mobile devices has the potential to mitigate the aforementioned issues such as test creation complexity and long test times, while providing accurate formative assessment data compatible with J. Hattie three question feedback, Competency Based Learning and Assessment methodologies.

Modelling Assessment as UCB Problem
When tutoring, a teacher will often engage in a dialogue with a student. The teacher may ask the student a series of formative questions in order to diagnose the gaps in student's knowledge. Assume the material consists of two topics, and the teacher asked 5 questions on each topic. The knowledge about first topic appears to be in a worse shape, two incorrect answers, than the knowledge on the second topic, one incorrect answer (see Table 1). Topic 1 + + - If there is time for five more questions before proceeding with didactic instruction, the teacher must face the dilemma of which topic should they explore with further questions? If the two incorrect answers to questions on first topic are attributable to bad luck due to small sample size, should the teacher explore first or second topic? What if there are more than two topics?
The family of bandit algorithms are designed to cope with uncertainty by balancing exploration and exploitation [40]. However, when applied to formative assessment the exploitation component is non obvious, as ultimately the goal is to explore the knowledge of the student. The algorithm should probe and explore the different topics and engage in focused questioning, exploiting those which are possibly in most need of instruction. This presents an opaque bandit problem where a unique answer, reward, is observed at each round, in contrast with the transparent one where all rewards are observed [34]. Thus, in context of assessment, a sequential allocation problem is obtained when the assessor has to choose from many questions from multiple topics, bandits, and has to repeatedly choose a topic to explore, which bandit arm to pull. When choosing next question to ask the decision should depend on the history of already known answers. Then a policy is the mapping from the individual history of the student to actions (questions to be asked of the student).
Suppose student's knowledge on number of topics T = {1, 2, . . . , k}. The reward in case of a multiple-choice quiz with either correct on incorrect answer to each question X r ∈ {0, 1} is binary valued. Each topic corresponds to an unknown probability distribution. There exists a vector µ ∈ [0, 1] k such that the probability that X r = 0 given the algorithm chose topic T r = t is µ t . This kind of environment is called a stochastic Bernoulli bandit. If the mean vector associated with the environment was known, the optimal policy is to always choose a question on one topic t * = argmin t∈T µ t . This will result in the exploration of the weakest area of student's knowledge, so as to aid in the further instruction. The regret over the n questions is where the expectation E is with respect to stochastic environment and policy. However, in practical setting, the number of questions on one topic is usually rather limited due to the scope of the curriculum and the question pool. As is the length, or the horizon, of the quiz. Thus the value of calculated this way regret is of little practical value. The main challenge of the task is finding the weakest topic of a student. To do so the algorithm must explore different topics and exploit particular topic to obtain more accurate estimation of the student's level of knowledge on that subject. This basic exploration-exploitation dilemma is the key to obtaining a good strategy. A heuristic principle for dealing with this issue chosen in this paper is optimism in face of uncertainty and an algorithm which operates on this principle Upper-Confidence Bound (UCB). UCB algorithm is one of the simplest algorithms that offers sub-linear regret. The algorithm suggest choosing the action with the largest upper confidence bound, or in case of our model a topic with the smaller lower confidence bound. Then the question number n chosen on a topic t will be where C is a constant that can be chosen to regulate the impact the second exploration component has on the choice of the topic, and N t is the number questions on the topic has been asked so far.
As the number of questions on the topic increases, so the uncertainty and the exploration term of the formula decrease [40]. Thus the algorithm will seek out the weakest topics of knowledge for a student, once identified it will thoroughly question the student on said topics. This is pedagogically valuable because, once identified the lacking topic knowledge can be corrected. In addition, the algorithm will gather a more fine-grained information on the weakest topic by "exploiting it", which will be useful in post assessment knowledge correction. In the case when the item pool of the topic is exhausted the algorithm chooses the topic with the second smallest value as estimated by Formula (2).

Simulated Students' Experiments
A number of experiments with simulated students were carried out before proceeding with testing using empirical data. Number of simulated students for each experiment was 1000, unless stated otherwise. Each simulated quiz had a number of question on two or more topics. Each topic, then, would be represented as a vector of weights for each question in the quiz. Each weight would represent the relevancy of the question to the topic. In this paper only experiments with binary weights, 0 or 1, where carried out. Moreover, each question was assumed to belong only to one topic. The number of question items on each topic were set to be equal. Each simulated student had a vector equal in length to the number of the questions in the quiz, where each element represented the knowledge on one question, either yes or no. For all simulation experiments, unless stated otherwise, the inter-topic correlation of answers was random. The probability that the student will know the answer on a topic p t had uniformly distributed random bias between 0 or 1.

Real Students' Assessment Methodology
In this study, feasibility of application of UCB algorithm to formative assessment is explored using the data collected from a 60-question quiz covering 15 subtopics. (i.e., network topologies, networking devices, Internet Protocol version 4, Ethernet, cloud computing services). The assessment was held at Vilnius Gediminas Technical University, Lithuania (on 25 April 2019). The test length and item pool size were set to 60 questions to keep quiz length close to an hour. Number of topics was set to 15 because it is the number of lectures in the course. The quiz was designed to assess students' knowledge of basic computer networking and cloud computing technologies. In total 104 undergraduate, sophomore and junior (third year), students from 7 different groups where tested. All of the students took the test at the same place and time, (Saulėtekio al. 11, Vilnius, from 12:30 to 13:30). The question pool contained questions of varying difficult (hardest question was answered by 16 students, easiest by 103). This was done to test robustness of the algorithm, as it is meant as an alternative to methods that require item calibration. Therefore the algorithm must be able to work with items of varying and priory unknown difficulty.
All questions in the 60-question quiz where multiple choice questions with four options, only one of which was correct. Questions on the same topic were made to never appear in consequence so as to lessen the impact of deductive reasoning over the knowledge of the subject. Students were not allowed to assist each other during the assessment. All students were also informed that if they so desire the test will have no summative impact on their grade and will serve exclusively formative function. Every student gave a written permission allowing their anonymized data to be used for scientific research. The answers to questions have been aggregated in a comma separated file (csv), anonymized and later processed using python software written for the purpose of this experiment.
The true knowledge of a student on each topic, the ground truth, was calculated by dividing the number of correct answers within a topic by a total number of questions within that topic. In the experiment with participation of real students the number of correct answers was known because all students were required to answer all questions in the item pool. After complete knowledge of the test material for each student was known, the algorithms would question the database. The accuracy was then measured as a relationship between an estimated student knowledge from the incomplete information accessed by the algorithm and the complete information in the database. Formulas used for accuracy calculation are provided in the following statistical analysis section.

Statistical Analysis
The accuracy (performance) of the test was established based on Positive Predictive Value (PPV), which defines the probability of supplying the correct learning material to a student after the formative assessment and evaluated according to the formula, PPV = TP/(TP + FP) for one student. Where TP is True Positive, or the number of correctly identified weakest topics for a student. The number of weakest topics is not always one, because topic proficiency is assumed to be equal to an expected value of an answer on a topic question, a Bernoulli variable E(T) = p t . This value can be the same for several topics, in that case it is assumed that the student would equally benefit from instruction on any of the topics. The FP, False Positive, is a number of incorrectly identified topics for which topic mean µ t is larger than the smallest mean, µ m . For the group of students average of individual accuracies was taken.
Where applicable experimental results where expressed as mean ± Standard Error of the Mean (SEM). Correlation matrix of questions for heat-map visualization was computed using Pearson correlation coefficient using Pandas Python data analysis library.
Variance of answers on questions on one topic in simulation experiments was calculated using Var[T] = p t (1 − p t ) formula for Bernoulli distribution. The probability of correct answer on the topic p t was known and controlled for each topic T to observe its effect on assessment accuracy. In the experiments where real students' variance was computed using same formula, p t was estimated using formula where s is a student, q is a question within a topic and a sq is an answer of a particular student on a particular question within a topic, n s and n st are the number of students and questions within a topic, respectively.

Impact of Exploration Constant on Accuracy
We start by presenting a set of simulations to systematically explore different properties of formative assessment using UCB algorithm. UCB algorithm efficiency is dependent on the constant C which regulates the impact of exploration term on the topic choice as can be seen in Formula (2).
To analyze this impact and to choose the most suitable C for assessing real students a number of experiments with synthetic students were carried out.
A cursory result for the C impact on assessment accuracy can be seen in Figure 1. UCB algorithm shows better performance for every plotted constant over randomly asked questions. Note that algorithm serving random questions never asks same question twice of the same student, thus it achieves 100% accuracy after serving all 64 question items. It is clear that exploration can have both positive and negative impact on accuracy as seen from better performance of C = 0.45 over C = 0 and C = 1.
From Figure 1 it is clear that UCB algorithm when applied to formative assessment has a potential to significantly shorten test length. With larger constant the algorithm displayed relatively bad accuracy at quiz lengths from about 15 to 30 questions. This can be explained by failure to exploit known bad topics in order to further explore topics about which little data is known. Finally, as seen from the best performance of C = 0.45 exploration component does have a positive impact on assessment accuracy.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 12 positive and negative impact on accuracy as seen from better performance of C = 0.45 over C = 0 and C = 1. From Figure 1 it is clear that UCB algorithm when applied to formative assessment has a potential to significantly shorten test length. With larger constant the algorithm displayed relatively bad accuracy at quiz lengths from about 15 to 30 questions. This can be explained by failure to exploit known bad topics in order to further explore topics about which little data is known. Finally, as seen from the best performance of C = 0.45 exploration component does have a positive impact on assessment accuracy. To choose the appropriate exploration constant for the real quiz, which had 60 questions (to set its duration at about 60 min), a following experiment with synthetic students was carried out. The number of questions was set to 64 in order to observe the importance of exploration in realistic scenarios: 4, 8, and 16 topics (see Figure 2). In the experiment we measured the minimal quiz length (number of questions) required to achieve accuracy greater than 95% in the class of 1000 synthetic students. The experiment was performed for every constant value from 0 to 2, with a step of 0.1 and the results were plotted in Figure 2. A conclusion we can draw from Figure 2 is that for every practical topic number in a 64-question test constant can be set to 0.5 for optimal results.  To choose the appropriate exploration constant for the real quiz, which had 60 questions (to set its duration at about 60 min), a following experiment with synthetic students was carried out. The number of questions was set to 64 in order to observe the importance of exploration in realistic scenarios: 4, 8, and 16 topics (see Figure 2). In the experiment we measured the minimal quiz length (number of questions) required to achieve accuracy greater than 95% in the class of 1000 synthetic students. The experiment was performed for every constant value from 0 to 2, with a step of 0.1 and the results were plotted in Figure 2. A conclusion we can draw from Figure 2 is that for every practical topic number in a 64-question test constant can be set to 0.5 for optimal results.
Appl. Sci. 2019, 9, x FOR PEER REVIEW 6 of 12 positive and negative impact on accuracy as seen from better performance of C = 0.45 over C = 0 and C = 1. From Figure 1 it is clear that UCB algorithm when applied to formative assessment has a potential to significantly shorten test length. With larger constant the algorithm displayed relatively bad accuracy at quiz lengths from about 15 to 30 questions. This can be explained by failure to exploit known bad topics in order to further explore topics about which little data is known. Finally, as seen from the best performance of C = 0.45 exploration component does have a positive impact on assessment accuracy. To choose the appropriate exploration constant for the real quiz, which had 60 questions (to set its duration at about 60 min), a following experiment with synthetic students was carried out. The number of questions was set to 64 in order to observe the importance of exploration in realistic scenarios: 4, 8, and 16 topics (see Figure 2). In the experiment we measured the minimal quiz length (number of questions) required to achieve accuracy greater than 95% in the class of 1000 synthetic students. The experiment was performed for every constant value from 0 to 2, with a step of 0.1 and the results were plotted in Figure 2. A conclusion we can draw from Figure 2 is that for every practical topic number in a 64-question test constant can be set to 0.5 for optimal results.

UCB Advantage for Different Quiz Length
A more thorough exploration of quiz space has been performed to gauge the practicality of implementing UCB algorithm in different classroom situations. From Table 2 it is clear that UCB testing method becomes more valuable as pool of questions and number of topics increase. Number of simulated students for this experiment was 500. Table 2. Reduction in quiz length necessary to achieve 95% accuracy conferred by using UCB algorithm with optimal exploration constant over traditional random question approach. All numbers are given for the optimal C value, where T-number of topics; Q-total number of questions. Green color indicates large reduction in quiz length, and red indicates lesser benefit.

Assessment of Real Students
In Figure 3 are presented the results from the experiment with real students (n = 104). A formative assessment quiz included items from 15 topics, with 4 question items in each topic, for 60 total questions. Results show that in this scenario using traditional testing methods quiz length could be shortened from 60 questions to 55 questions if the goal was 95% accuracy in weakest topic identification. With use of UCB adaptive assessment, however, quiz length could be almost halved to 32 questions. Empirical results (orange and blue plots) are in line with projections drawn from simulations (grey plot). For this particular quiz and group of students UCB would offer a reduction in quiz length by 23 questions if we aim for same (>95%) accuracy.

UCB Advantage for Different Quiz Length
A more thorough exploration of quiz space has been performed to gauge the practicality of implementing UCB algorithm in different classroom situations. From Table 2 it is clear that UCB testing method becomes more valuable as pool of questions and number of topics increase. Number of simulated students for this experiment was 500. Table 2. Reduction in quiz length necessary to achieve 95% accuracy conferred by using UCB algorithm with optimal exploration constant over traditional random question approach. All numbers are given for the optimal C value, where T-number of topics; Q-total number of questions. Green color indicates large reduction in quiz length, and red indicates lesser benefit.

Assessment of Real Students
In Figure 3 are presented the results from the experiment with real students (n = 104). A formative assessment quiz included items from 15 topics, with 4 question items in each topic, for 60 total questions. Results show that in this scenario using traditional testing methods quiz length could be shortened from 60 questions to 55 questions if the goal was 95% accuracy in weakest topic identification. With use of UCB adaptive assessment, however, quiz length could be almost halved to 32 questions. Empirical results (orange and blue plots) are in line with projections drawn from simulations (grey plot). For this particular quiz and group of students UCB would offer a reduction in quiz length by 23 questions if we aim for same (>95%) accuracy. The ease at which the weakest topic of a student can be identified is dependent on how strongly the answers on the same topic are correlated. At answer variance equal to zero, it is sufficient to ask only one question to know the student's knowledge of the rest of the items within the topic. The ease at which the weakest topic of a student can be identified is dependent on how strongly the answers on the same topic are correlated. At answer variance equal to zero, it is sufficient to ask only one question to know the student's knowledge of the rest of the items within the topic.
However, when answers are uncorrelated, estimation must be harder, and might defeat entire premise of the proposed UCB testing model. Thus an experiment to measure the effect of knowledge correlation on items within one topic on method effectiveness was carried out. The impact of answer variance on testing accuracy can be seen in Figure 4.
As anticipated, identifying weak topics is trivial for unrealistically strongly correlated answers. However, even for uncorrelated answers UCB performs twice as good as random questioning. Also plotted in the Figure 4 are variances calculated from answers of real students, N = 104 (15 topic 60 question quiz).
Appl. Sci. 2019, 9, x FOR PEER REVIEW 8 of 12 However, when answers are uncorrelated, estimation must be harder, and might defeat entire premise of the proposed UCB testing model. Thus an experiment to measure the effect of knowledge correlation on items within one topic on method effectiveness was carried out. The impact of answer variance on testing accuracy can be seen in Figure 4.
As anticipated, identifying weak topics is trivial for unrealistically strongly correlated answers. However, even for uncorrelated answers UCB performs twice as good as random questioning. Also plotted in the Figure 4 are variances calculated from answers of real students, N = 104 (15 topic 60 question quiz). Correlation matrix for answers between different questions has been constructed and is shown in Figure 5. It indicates relatively low answer correlation even for inter topic questions. Answers on each topic where grouped together for this illustration (i.e., answers 1, 2, 3, 4 all belong to same topic). The matrix in accordance with Figure 4 shows low general correlation of answers within one topic especially for items 20-24 (questions about the physical layer of OSI model).  Correlation matrix for answers between different questions has been constructed and is shown in Figure 5. It indicates relatively low answer correlation even for inter topic questions. Answers on each topic where grouped together for this illustration (i.e., answers 1, 2, 3, 4 all belong to same topic). The matrix in accordance with Figure 4 shows low general correlation of answers within one topic especially for items 20-24 (questions about the physical layer of OSI model).
Appl. Sci. 2019, 9, x FOR PEER REVIEW 8 of 12 However, when answers are uncorrelated, estimation must be harder, and might defeat entire premise of the proposed UCB testing model. Thus an experiment to measure the effect of knowledge correlation on items within one topic on method effectiveness was carried out. The impact of answer variance on testing accuracy can be seen in Figure 4.
As anticipated, identifying weak topics is trivial for unrealistically strongly correlated answers. However, even for uncorrelated answers UCB performs twice as good as random questioning. Also plotted in the Figure 4 are variances calculated from answers of real students, N = 104 (15 topic 60 question quiz). The impact of answer variance within topics on testing accuracy (number of questions needed to achieve 95% accuracy in a 60 question 15 topic quiz). The green markers indicate variance calculated using data from real tests, square and circle position along x axis represents average variance.
Correlation matrix for answers between different questions has been constructed and is shown in Figure 5. It indicates relatively low answer correlation even for inter topic questions. Answers on each topic where grouped together for this illustration (i.e., answers 1, 2, 3, 4 all belong to same topic). The matrix in accordance with Figure 4 shows low general correlation of answers within one topic especially for items 20-24 (questions about the physical layer of OSI model).  To assess shorter quiz lengths and algorithm behavior at 95% accuracy value a more detailed look at accuracy distribution is provided in Figure 6. As seen from the figure for 60 question, 15 topic quiz the algorithm rarely displays accuracies between 50% and 99% for individual students.
To assess shorter quiz lengths and algorithm behavior at 95% accuracy value a more detailed look at accuracy distribution is provided in Figure 6. As seen from the figure for 60 question, 15 topic quiz the algorithm rarely displays accuracies between 50% and 99% for individual students. Figure 6. Distribution of assessment accuracies for students, where dark blue <1%, blue 1-50%, cyan 50-99%, and green >99% accuracy.
At quiz lengths between 10 and 32 questions a substantial portion (4-38%) of students were assessed very poorly, with less than 1% accuracy. Similarly, even at >95% accuracies, there can be a minority of students with incorrect weakest topic estimates. The root causes may be small correlation of answers (Figures 4 and 5) in the sample used for testing and small number of items within a topic (only 4). There was an increase in the fraction of students with wildly wrong estimates (accuracy <1%) up to question 7. This is because initial estimated knowledge on all topics is set to 0.5 in our implementation of the algorithm. The algorithm shows a steady increase in accurately (>99%) diagnosed students with no abnormalities.

Discussion
UCB algorithm has a potential to significantly reduce assessment length without the loss of accuracy. Even for very short tests with few topics the algorithm offers significant reduction of test length (Table 2). However, there is no advantage for quizzes in which each topic contains only one question, at which point the notion of topic loses its pedagogical meaning. There is evidence that time allocated to study positively affects student performance [41], thus reducing the time spend on assessment is desired. The experiments with real students support the conclusions drawn from the simulations (Figure 3). Such strong change in quiz length has the potential to change the dynamics in the classroom, because the instructor would not need to spend an entire lesson just for formative assessment. This in turn, may increase opportunities to provide personalized feedback to students linked to better performance [14,42]. This advantage comes at no cost in quiz creation complexity, unlike IRT and Elo rating based systems where quiz creation can be prohibitively complex in some situations due to item calibration problem [22,27]. Because of this fundamental difference we do not compare performance of UCB to IRT and Elo based algorithms. Such comparison would not discredit either approach: If UCB performs worse (as is safe to assume), its simplicity of use and independence from question item calibration makes it an interesting alternative to traditional assessment.
It is clear that exploration component of the algorithm becomes more important with the increase in number of items within a topic. This can be observed in bad performance of algorithm in 4 topic quiz when C was set to 0 in Figure 2. It took 45 questions to reach >95% assessment accuracy. Also, unintuitively, it takes less questions to identify weakest topics of students' when there are more topics when the size of the item pool is kept constant. At quiz lengths between 10 and 32 questions a substantial portion (4-38%) of students were assessed very poorly, with less than 1% accuracy. Similarly, even at >95% accuracies, there can be a minority of students with incorrect weakest topic estimates. The root causes may be small correlation of answers (Figures 4 and 5) in the sample used for testing and small number of items within a topic (only 4). There was an increase in the fraction of students with wildly wrong estimates (accuracy <1%) up to question 7. This is because initial estimated knowledge on all topics is set to 0.5 in our implementation of the algorithm. The algorithm shows a steady increase in accurately (>99%) diagnosed students with no abnormalities.

Discussion
UCB algorithm has a potential to significantly reduce assessment length without the loss of accuracy. Even for very short tests with few topics the algorithm offers significant reduction of test length (Table 2). However, there is no advantage for quizzes in which each topic contains only one question, at which point the notion of topic loses its pedagogical meaning. There is evidence that time allocated to study positively affects student performance [41], thus reducing the time spend on assessment is desired. The experiments with real students support the conclusions drawn from the simulations (Figure 3). Such strong change in quiz length has the potential to change the dynamics in the classroom, because the instructor would not need to spend an entire lesson just for formative assessment. This in turn, may increase opportunities to provide personalized feedback to students linked to better performance [14,42]. This advantage comes at no cost in quiz creation complexity, unlike IRT and Elo rating based systems where quiz creation can be prohibitively complex in some situations due to item calibration problem [22,27]. Because of this fundamental difference we do not compare performance of UCB to IRT and Elo based algorithms. Such comparison would not discredit either approach: If UCB performs worse (as is safe to assume), its simplicity of use and independence from question item calibration makes it an interesting alternative to traditional assessment.
It is clear that exploration component of the algorithm becomes more important with the increase in number of items within a topic. This can be observed in bad performance of algorithm in 4 topic quiz when C was set to 0 in Figure 2. It took 45 questions to reach >95% assessment accuracy. Also, unintuitively, it takes less questions to identify weakest topics of students' when there are more topics when the size of the item pool is kept constant.
Compared to IRT and Elo algorithms UCB will obtain less information about the student in a mathematical sense [27,43], and this can be seen as a disadvantage. However, not all information is equally pedagogically valuable [12,16]. For example, assume we are assessing student's knowledge on two topics. We are nearing the end of the quiz and have determined that knowledge on first topic is adequate, but lacking on the second topic. Because of the nature of UCB algorithm topic two is more explored than topic one, therefore we expect to gain less information by exploring second topic. However, from pedagogical point of view the information on topic two can be more valuable to address the assessment needs according to the three question feedback model [15]. According to the assessment, topic two is in need of instruction. If we are going to proceed to teach it, we can use the extra diagnostic data to save time and effort by not teaching what the student already knows. Meanwhile we have no immediate use for the more precise data about first topic.
Simulation data presented in Figure 4 and empirical results (Figures 3 and 5) suggest that the UCB assessment approach can offer significant reduction in quiz length for any practical item pool size and inter-topic variance of answers. This is an important result because it indicates suitability of the method for any grouping of questions regardless of how correlated the answers are within one topic. Method effectiveness stays almost constant for any observed answer variance, which makes it easy to predict quiz length. This allows a teacher to group questions on each topic as they see fit according with syllabus and the learning material at hand, regardless of existence or lack of a common latent trait underlying the items within a topic. This separates UCB method from IRT and Elo based alternatives which depend on the assumption of common latent trait [22,28].
At quiz lengths between 10 and 32 questions a substantial portion (4-38%) of students were assessed very poorly implies that using shorter assessments is morally questionable, as the majority of the students will receive a very accurate guidance, while the rest will be tutored on topics which they already know. This presents a problem for more important formative assessments (i.e., entire semester assessment). This property of the algorithm can be offset my small increase in quiz length as seen in Figure 6.

Conclusions
Presented in this paper novel approach to formative assessment based on UCB algorithm shows promising results when compared to traditional assessment methods. This approach can significantly reduce quiz length without reduction in accuracy. For quizzes with item pool equal to 8 questions the reduction is 29%, for quizzes with item pool of 512 question it is 73%. Variance of answers to questions within same topic has little impact on assessment accuracy for empirically observed values (0.1 to 0.25), thus the algorithm is suited for situations where items do not necessary measure same latent skill or trait. However, distribution of student accuracies within a class is non-normal. Even at high average class accuracies (95%), the majority of accurately assessed students is offset by a small minority of students for whom weakest topics where incorrectly identified. To offset this property of the algorithm we recommend that educators target >99% accuracy for course-crucial UCB formative assessments. We believe UCB based formative assessment has pedagogical potential for practical applications and should be further explored. Unlike IRT and Elo rating-based assessment methods UCB based assessment requires no question item calibration and does not depend on the debatable premise that different questions measure same latent trait. As consequence UCB method belongs to a different, sparsely explored class of easy to implement and maintain formative assessment solutions. It may prove to be a fresh and viable alternative to traditional linear assessment in situations where IRT and Elo methods were deemed too complex to implement and maintain. In the future a comparative study of UCB assessment method with established item calibration dependent methods may be of interest.