Improving Comprehension: Intelligent Tutoring System Explaining the Domain Rules When Students Break Them †

: Intelligent tutoring systems have become increasingly common in assisting students but are often aimed at isolated subject-domain tasks without creating a scaffolding system from lower-to higher-level cognitive skills, with low-level skills often neglected. We designed and developed an intelligent tutoring system, CompPrehension, which aims to improve the comprehension level of Bloom’s taxonomy. The system features plug-in-based architecture, easily adding new subject domains and learning strategies. It uses formal models and software reasoners to solve the problems and judge the answers, and generates explanatory feedback about the broken domain rules and follow-up questions to stimulate the students’ thinking. We developed two subject domain models: an Expressions domain for teaching the expression order of evaluation, and a Control Flow Statements domain for code-tracing tasks. The chief novelty of our research is that the developed models are capable of automatic problem classiﬁcation, determining the knowledge required to solve them and so the pedagogical conditions to use the problem without human participation. More than 100 undergraduate ﬁrst-year Computer Science students took part in evaluating the system. The results in both subject domains show medium but statistically signiﬁcant learning gains after using the system for a few days; students with worse previous knowledge gained more. In the Control Flow Statements domain, the number of completed questions correlates positively with the post-test grades and learning gains. The students’ survey showed a slightly positive perception of the system.


Introduction
Lately, a large number of Intelligent Tutoring Systems (ITS) have been developed for different subject domains. Less effort has been spent on categorizing the learning tasks they have use and the kinds of skills they have developed to provide adequate scaffolding for developing higher-level skills. The popular Bloom's taxonomy [1][2][3][4][5] of educational objectives and its successor models and approaches based on it, such as Bloom's Revised Taxonomy [6] and Bloom's Digital Taxonomy [7] identify six levels of cognitive skills: remembering, comprehending, applying, analyzing, synthesizing, and evaluating activities. These skills, ideally, should be developed in that order because higher-level skills rely on lower-level ones.

1.
Knowledge level of Bloom's taxonomy "involves the recall of specifics and universals, the recall of methods and processes, or the recall of a pattern, structure, or setting" [1].

2.
Comprehension "refers to a type of understanding or apprehension such that the individual knows what is being communicated and can make use of the material or idea being communicated without necessarily relating it to other material or seeing its fullest implications" [1].

1.
Remembering implies learning the symbols and syntax of programming language to produce code.

2.
Comprehending implies understanding the nature of functions and statements of programming language. 3.
Applying means the ability to write the code according to the task requirements, and to define the objects and their behavior using programming language. 4.
Analyzing is associated with the abilities to find the logical or syntax errors in the program code.

5.
Synthesizing implies the ability to implement complex structures in programming language such as modules, classes, functions that interact with each other to solve some initial tasks. 6.
Evaluating is related to making choices about how to approach the given problem, what data structures, programming patterns, algorithms to apply for the task solution.
To make such a choice, an appropriate level of experience is required.
Usually, the success of the students during programming courses depends strongly on comprehending the programming concepts taught in the introductory courses, so ITS in this area should analyze learners at various levels of cognition of these concepts to examine students' performance during subsequent stages of the courses [10]. Furthermore, while higher-level objectives are often the true goals of education, concentrating on high-level assessments may result in lessening learning gains as some students try to solve high-level problems without adequate knowledge and comprehension of the subject-domain concepts, a problem that limits their progress severely, e.g., students may have difficulties writing code even given an algorithm since they do not make the connection between pseudocode and programming-language code [11]. One way to guide a student through the domain concepts as learning objectives, control the results, and provide feedback during the learning process is the microlearning approach. It is an activity-oriented approach that provides learning in small parts, including information and interactive activities to practice [12,13].
Students may gain some knowledge by being given simple quizzes, but developing full comprehension and learning to apply rules requires intelligent support. This makes developing ITS aimed at comprehension and application levels an important challenge. The relative simplicity of comprehension-level assessments allows using formal subjectdomain models and software reasoners [14,15]. This paves the way to the development of a multi-domain ITS where subject domains can be attached as plug-ins.
Comprehension-level tutoring systems are most useful when students are introduced to new subject domains including large numbers of new concepts. We chose introductory programming as a meta subject domain for in-depth study. Our system is unique because it is based on a combination of factors. It uses formal subject-domain models, capable of solving subject-domain problems and determining the causes of the errors in students' answers that can be re-used in other educational software. Another important novelty is introducing feedback in the form of questions instead of statements. This helps students by stimulating thinking and information retrieval, while it also allows the system to pinpoint the exact cause of the student's error. Furthermore, our system offers systematic coverage of all the ways to break subjectdomain rules that are used to measure the students' performance and determine the suitable problems for them to solve using a pedagogical strategy based on the Q-learning approach [16,17], though the system design allows the easy integration of other approaches such as Bayesian Knowledge Tracing. This provides our system with the ability to classify the learning problems automatically, which is a step in the direction of generating learning problems using an existing program code without a human in the loop. Our long-term aim is to create a system capable of building problem banks in subject domains automatically, selecting the problems to give students according to the kinds of errors they made previously, and maintaining the dialog with the student (including asking follow-up questions) until mastering the subject-domain concepts and laws.
The rest of this article is organized as follows. In Section 2, we will provide an overview of state-of-the-art works in the introductory programming domain. In Section 3, we will describe the architecture and workflow of the developed system. In Section 4, we will describe the two subject domains developed to evaluate the system. In Section 5, we will describe the evaluation of the developed system. We will follow with a brief conclusion in Section 6.
Reviewing related works, we used the following criteria. The results are shown in Table 1.  [18,28]. A common issue is missing levels of Bloom's taxonomy when ITS provides students only high-level learning tasks without providing adequate support for developing lower-level cognitive skills [11], e.g., [23,24] are aimed at the synthesis level, so their feedback provides no or minimal information on the application and comprehension levels. To fully utilize adaptive abilities, ITS needs a large bank of different tasks. Most of the existing systems provide a limited set of predefined tasks [11,22,23,29]. In many of them, this set can be extended by a teacher [22,23,29]. One of the interesting and relevant approaches for adaptive tasks generation and testing is a "notional machine" conception, i.e., an abstract computer for executing programs. This is noted in the literature as a major challenge in introductory programming learning [30,31] because teaching programming paradigms requires using multiple notional machines at different levels of abstraction. This approach covers learning on different levels of Bloom's taxonomy, including higher levels.
Many IPTS use visualization as a powerful means of illustrating the task and its solutions to easier understanding the problem, its possible solutions, and errors [32,33]. Interactive visualization can provide additional feedback during the task solving to facilitate the student's progress [32,34,35]. Textual feedback can be helpful as well [29,[36][37][38].
To uncover the full power of adaptive assessments, ITS needs either a way to generate new learning tasks according to the situation or a large bank of various tasks. Some works advocate problem generation on the formal subject-domain model [19,20], but the generated problems require human verification, limiting their number.

CompPrehension: Models, Architecture, and Workflow
We propose a new architecture of a multi-domain intelligent tutoring system on the comprehension level. The main goal of this architecture is the flexibility of the developed system along the four main axes, represented by the plug-in types:

1.
Domain plug-ins encapsulate everything related to their subject domains, making the rest of the system domain-independent; 2.
Backend plug-ins allow interaction with different software reasoners, solving tasks by using provided laws and facts; 3.
Strategy plug-ins assess the level of demonstrated knowledge for the student, and choose the pedagogical interventions and the next question if necessary; 4.
Frontend plug-in encapsulates the user interface for teachers and students.
Plug-ins exchange information through the core, but they are independent of each other, making the system's pedagogical parts (strategies) agnostic of the subject-domain knowledge, and vice versa.
The requirement to generate semantic explanations for each possible error, including the broken rules; thus, fostering understanding of the subject domain limits the tutoring system to closed-answer questions because determining the exact reasons for all possible errors in open-answer questions is impossible. Four kinds of questions were chosen for the developed system:

1.
Single choice (choosing one element from the given set); 2.
Multiple choice (choosing any number of elements from the given set); 3.
Match (matching each element from the given set with an element for another set); 4.
Order (ordering elements of a given set; elements may be omitted or used several times).

Comprehension-Level Model of Subject Domain
As is shown in [39], comprehension-level modeling requires axiomatic definitions of all the important properties of the taught concepts. Therefore, while a subject-domain ontology serves as the basis of the model, it is enhanced with rules of two types: • Positive rules or productions allow inferring the correct answer from the formal problem definition; • Negative rules or constraints allow catching errors in student's answer.
As in comprehension-level tasks any complete sequence of correct steps always leads to a correct answer, the negative rules can be inferred from positive rules by applying negation, thus giving the system the full set of possible errors; each error has a template for explanation generation. One positive rule can spawn several negative rules, i.e., there may be several ways to break one law. This makes negative rules the best way to measure student's knowledge. However, negative rules can be complex (i.e., there may be several reasons for making the same error in the given situation); in this case, the tutoring system can either give a complex explanation or generate follow-up questions to determine the particular fault reason.
While rules (laws) define all important properties of the concepts, one type of knowledge remains attached to the concepts themselves: the student's ability to identify individual objects of the given concept in the problem definition in its human-readable form. Everything else is taught using the rules. Figure 1 shows the proposed architecture and the chief responsibilities of its components. Strategy plug-ins analyze the learning situation (a sequence of correct and incorrect applications of subject-domain laws during previous interactions of the student), determine the information to show to the learning and the allowed actions, and form a request for generating the next question or decide that the exercise is complete. After conducting experiments measuring wall reasoning time and memory usage for two different subject domains, we found that there is no single reasoner that can outperform all the others. For the Control-Flow Statements domain, the Jena reasoner proved to be the fastest, while for the Expressions domain, Clingo performs better (see Figures 2 and 3). Therefore, we developed backend plug-ins to integrate external reasoners for use by the subject domains.   Domain plug-ins encapsulate everything related to a particular subject domain, including the formal model (for software reasoner) of the domain (i.e., its concepts and laws) and the particular problem (individuals and facts), and human-readable versions of problem formulations and error explanations. However, subject domains are separated from both pedagogical decisions (which is the responsibility of strategies) and solving the problem according to the formal model (which is done by backends). This allows the efforts required to be minimized, a new subject domain to be added to the system and subject domains with different tutoring models and reasoners to be combined in a versatile way for better efficiency. Another important feature of subject-domain plug-ins is tagging, allowing the system to limit the concepts and laws used in the exercise; tags let a single subject domain cover a group of similar tasks (e.g., the same tasks in different programming languages), re-using the rules.

Architecture
Frontends encapsulate domain-independent user interface. This allows integration with modern learning management systems through standards such as Learning Tools Interoperability (LTI), using special interfaces for increasing accessibility to particular user categories or providing gamified experience, and using mobile interfaces or even messaging software to interact with the system. Frontend plug-ins can transform complex questions to simpler ones if their user interface is restricted, e.g., a question including ordering a set of elements can be transformed into a set of single-choice questions "choose the next element".

Typical Workflow
As domain plug-ins provide rules for finding a correct solution, the system can combine providing worked examples and guiding students through the task-solving process. A typical workflow for task solving is shown in Figure 4 and works as follows: 1.
The strategy creates a request for a question based on the exercise settings, student's performance, and difficulty of the laws and concepts.

2.
The domain generates a question based on the strategy's request, including machinesolvable and human-readable versions. 3.
The backend solves the question, finding the correct answer. 4.
The student reads the question and provides a (possibly partial) answer through the frontend.

5.
The core transforms the student's answer to backend facts. 6.
The backend judges the answer, determining its correctness, errors, and their fault reasons (the sentence). 7.
The domain interprets this sentence, transforming backend facts to subject-domain law violations, and generates human-readable feedback (e.g., error explanation). 8.
The strategy adjusts the feedback level. 9.
The student watches the feedback, possibly requesting more feedback. 10. The strategy chooses to show a worked example for the next step, to ask a followup question, to continue with the current question, to generate a new question, or consider the exercise completed (the strategy can also let the student choose among some of these options).

Developed Domains
As a proof of concept and in order to evaluate the developed system, we developed two domain plug-ins related to teaching programming: Control Flow Statements and Expressions.

Control Flow Statements
This model is aimed at teaching basic control-flow structures (sequences, alternatives, and loops) using the task of creating a trace for the given algorithm knowing the values of control conditions as they are evaluated. Control Flow is the basic concept of the imperative programming paradigm that makes it an important element in many introductory programming courses. The domain supports the C++, Java, and Python programming languages.
The domain concepts are defined using an ontology. The algorithm is represented as an abstract syntax tree, as in [40]. The classes used for it are shown in Figure 5. The trace is a linked list of Act (i.e., an act of execution) instances connected by the has_next property; it shows the step-by-step process of the algorithm execution (see Figure 6). Single Act instances are created for leaf nodes of the algorithm tree; a pair of corresponding acts (Act_begin and Act_end) represents the boundaries of the other nodes, enclosing the acts of the nested statements. The acts that evaluate control conditions of alternatives and loops contain the values of their conditions. Currently, the domain model contains 29 algorithm elements, 7 trace acts, 29 kinds of errors, 27 explanations for correct acts (total 92 classes), 30 properties, 16 positive rules, 41 negative rules, and 52 helper rules (totaling 99 underlying rules).  The Control Flow Statements domain uses order tasks, requiring the student to put the acts of execution of the control-flow statements in the correct order; some statements can be executed several times while others may not be executed at all. Figure 7 shows an example of the problem text and a partial answer. The student can use "play" and "stop" buttons embedded in the algorithm to add more acts of execution. If the new Act is correct, it is added to the trace; if it is incorrect, the student receives a message explaining their error and can try another trace element. It is also possible to ask the program to demonstrate a correct step if the student is lost, exploring the example. The ability to demonstrate some steps while leaving the student the others allows our system to narrow down existing questions if the strategy needs to verify or develop the student's knowledge regarding some particular subject-domain law.
Consider the example shown in Figure 7. The student began the trace correctly, but then made an error, starting a loop iteration after its control condition evaluates to false.
The positive rule calculating the correct Act at this point (the end of the loop because the condition is false) is shown in (1).
The negative rule (39th in Table 2) that caught the error and its context (precursor) is shown in (2).

Expressions
This domain is aimed at teaching expression evaluation order using the primary task of choosing the next operator to be evaluated (i.e., ordering question for the operators in the given expression that requires using each operator once and can have several correct orders). Understanding expression evaluation is the basic skill for computer science education; its mastery serves as a basis for developing more complex skills such as analyzing and writing program code. The domain supports the C++ and Python programming languages; the support for other languages can be added using tags for subject-domain laws.
The order of evaluation in modern programming languages is a partial order. Evaluation order is defined by the sequenced-before relation, based on the abstract syntax tree (AST) of the expression. It can be represented as a directed a cyclic graph in which child nodes must be evaluated before their parent nodes. WrongCondNeighbour Why did you execute branch "if-green", when the condition "green" is false ? 4 BeginEndMismatch Bad trace: loop "waiting" ended as selection "choose". 5 EndedDeeper An action ends only when all its nested actions have ended, so A cannot end until K ends as K is a part of A. 6 EndedShallower Selection "choose" cannot end loop "waiting" as loop "waiting" contains selection "choose". 7 WrongContext A cannot be executed inside of B because A is not a direct part of B. 8 OneLevelShallower A cannot be executed within C because A is an element of P, so start P first. 9 TooEarlyInSequence A sequence executes its nested actions in order, so B cannot be executed before A. 10 TooLateInSequence A sequence executes its nested actions in order, so A cannot be executed after B. 11 SequenceFinishedTooEarly A sequence always executes all its actions. The sequence A cannot finish until actions: X, Y, Z are executed. 12 SequenceFinishedNotInOrder Sequence "if-ready" cannot end until it starts. 13 DuplicateOfAct (of sequence) A sequence executes each its action once, so each execution of P can contain only one execution of X. 14 NoFirstCondition Selection statement "choose" should start with evaluating its first condition "red". 15 BranchNotNextToCondition Selection statement "choose" can execute the branch "if-red" right after condition "red" only. 16 ElseBranchNotNextToLastCondition Selection statement "choose" cannot execute the branch "ELSE" until its condition "green" is evaluated. 17 ElseBranchAfterTrueCondition Selection statement "choose" must not execute its branch "ELSE" since condition "green" is true. 18 CondtionNotNextToPrevCondition Selection statement "choose" can evaluate its condition "green" right after the condition "red" only, if "red" is false. 19 ConditionTooEarly Selection statement "choose" cannot evaluate its condition "green" until the condition "red" is evaluated. 20 ConditionTooLate Selection statement "choose" should evaluate its condition "green" earlier, right after condition "red" is evaluated. 21 ConditionAfterBranch Selection statement "choose" must not evaluate its condition "green" because the branch "if-red" was executed. 22 DuplicateOfCondition Selection statement "choose" must not evaluate its condition "red" twice.

23
NoNextCondition A selection statement evaluates its conditions in order up to the first true condition.
Selection statement "choose" should evaluate its condition "green" next because the condition "red" is false.

BranchOfFalseCondition
A selection statement executes its branch only if its condition is true. Selection statement "choose" must not execute the branch "if-green" because its condition "green" is false.

25
AnotherExtraBranch A selection statement executes only one branch. Selection statement "choose" must not start its branch "else" because the branch "if-red" was executed.

BranchWithoutCondition
A selection statement executes its branch when the branch condition evaluates to true. Selection statement "choose" must not execute the "if-red" without evaluating its condition "red" first.

27
NoBranchWhenConditionIsTrue A selection statement executes its branch when the corresponding condition is true. Selection statement "choose" must execute the branch "if-red" because its condition "red" is true.

28
LastFalseNoEnd When all conditions of a selection statement are false and "ELSE" branch does not exist, the selection does nothing. Selection statement "choose" does not have an "else" branch so it must finish because its condition "green" is false.

29
AlternativeEndAfterTrueCondition When a condition of a selection statement evaluates to true, the selection executes the corresponding branch.
Selection statement "choose" should not finish until the branch of successful condition "red" is executed.

30
NoAlternativeEndAfterBranch A selection statement finishes after executing one branch. Selection statement "choose" executed its branch "if-green" and should finish.

31
LastConditionIsFalseButNoElse A selection statement executes its "ELSE" branch only if all conditions are false. Selection statement "choose" must execute its branch "ELSE" because the condition "green" evaluated to false.

NoIterationAfter-SuccessfulCondition
A WHILE loop continues if its condition is true: its new iteration must begin. A new iteration of the loop "waiting" must begin because its condition "ready" is true. 33 LoopEndAfterSuccessfulCondition A WHILE loop continues if its condition is true: its new iteration must begin. Its too early to finish the loop "waiting" because its condition "ready" is true. 34 NoLoopEndAfterFailedCondition A WHILE loop ends when its condition becomes false. As the condition "ready" is false, the loop "waiting" must end. 35 LoopEndsWithoutCondition Since the condition "ready" is not evaluated yet, the loop "waiting" must not end. 36 LoopStartIsNotCondition A WHILE loop is a pre-test loop. So the loop "waiting" should start by evaluating its condition "ready". 37 LoopStartIsNotIteration A DO loop is a post-test loop. Therefore, loop "running" should begin with an iteration. 38 LoopContinuedAfterFailedCondition A loop ends when its condition becomes false. Since condition "ready" is false, loop "running" cannot continue. 39 IterationAfterFailedCondition A loop ends when its condition becomes false. Since condition "ready" is false, the iteration cannot start.

NoConditionAfterIteration
After an iteration of DO-WHILE loop, its condition must be evaluated to determine whether to continue the loop or finish it. After an iteration of loop "running", its condition "ready" should be evaluated.

NoConditionBetweenIterations
After an iteration of DO-WHILE loop, it is to determine whether the loop continues or ends. Before proceeding to the next iteration of loop "running", its condition "ready" should be evaluated. Figure 8 demonstrates the evaluation order graph for the expression 3 in the C++ programming language; the purple arrow pointing between the rectangles shows that the left operand of || (logical or operator) should be evaluated before any operator from its right operand. This notation is used to avoid drawing too many lines between the operators in the left and right rectangles.
The ontology rules we developed determine the sequenced-before relations between the expression operators (positive rules), compare it with the student's evaluation order to determine inconsistencies and determine the error by matching (negative rules). The domain code written in Java can then generate error messages from the natural-language templates associated with negative rules. The formal reasoning model receives the expression split into tokens. For each token, its position in the expression and the text of the token are set. An abstract syntax tree for the given expression is built after it. AST construction is an iterative process: at each step, an operator whose operands are already evaluated is found, and its operands are determined. In order to make the model compatible with the reasoners which cannot change or delete existing facts-only add new facts-we needed special measures.
We copied the tokens to create a separate layer for every step of AST construction so that we could assign different labels to the same token, storing its current state, at different steps. A unique data property "step" is set for the tokens in each layer. This is performed sequentially: after a new operator and its operands are found, all the specified properties of the current layer are copied to the next one, except the auxiliary properties that do not make sense at the next iteration (such as the property showing the operator with the highest priority) and the properties for token labels that change between the steps.
There are three token states represented by labels: initialized, evaluated, and applied. The initialized state is used for operator tokens that are not evaluated yet. Evaluated label is used for the tokens which have already been evaluated during previous steps, but have not been used as operands for other evaluated operators. Operand tokens (variables and literals) start in this state. The applied label is used for the tokens which are already marked as direct operands of some evaluated operator. At the end of the AST building, only one token remains in the evaluated state-the root operator of the syntax tree; all other tokens should be in the applied state.
The built AST serves as the basis for the graph of sequenced-before relations between the operators. Some operators require a strict order of evaluation for their operands; this leads to adding more edges to the graph. For example, in the C++ programming language, the logical operators and the comma operator require a full evaluation of their left operand before the evaluation of any operator belonging to their right operand; the ternary operator also evaluates its first operand before the others. After taking into account these relationships, the full graph of the operators' dependencies, encompassing every possible correct answer to the question, is built.
A student should click on the operators in the expression in the order they are evaluated. The student's answer is passed to the reasoner as a sequence of operator evaluation. If the order of a pair of operators linked by a sequenced-before graph edge contradicts the student's answer (i.e., the parent is evaluated before the child), a student's error is detected. Using the information from the edge nodes and the type of the violated edge, the negative rules classify the error and link all the necessary nodes to it (the errors concerning the strict operand order need to have the operator that causes the strict-order relationship linked to them to generate the full message). This information is used to store the student's error in the database (to evaluate the student's knowledge and better choose the next question) and generate the error message to show to the student. In Figure 9, you can see the error messages for an attempt to evaluate the addition operator at position 4 broke two subject-domain rules: the first is related to operator precedence, the second to operator associativity. The list of all possible basic errors is shown in Table 3.

Error Name Message Example
HigherPrecedenceRight Operator * at pos 4 should be evaluated before Operator + at pos 2 because Operator * has higher precedence HigherPrecedenceLeft Operator + at pos 4 should be evaluated before Operator < at pos 6 because Operator + has higher precedence LeftAssociativityLeft Operator + at pos 2 should be evaluated before Operator − at pos 4 because Operator + has the same precedence and left associativity and evaluates left to right RightAssociativityRight Operator = at pos 6 should be evaluated before Operator += at pos 4 because Operator = has the same precedence and right associativity and evaluates right to left InComplex Operator + at pos 2 should be evaluated before Operator * at pos 5 because expression in parenthesis is evaluated before the operators outside of them Operator/ at pos 4 should be evaluated before parenthesis ( at pos 2 because function arguments are evaluated before function call StrictOperandsOrder Operator< at pos 2 should be evaluated before Operator > at pos 6 because the left operand of the Operator || at pos 4 must be evaluated before its right operand However, these error messages and situations are complex, and they may not be easily understandable for all students. Furthermore, there are many possible causes for the same error. Consider the situation in Figure 9. The student thinks that the addition operator at position 4 can be evaluated, but why? There are many possible causes for it, for example:

•
The student does not know that addition is left-associative; • The student does not know that precedence must be checked before associativity; • The student does not know the relative precedence of addition and multiplication.
To learn the true cause of the student's error and stimulate thinking about the problem, the domain can ask the student a series of small multiple-choice questions related to the error made-follow-up questions that the student can obtain by pressing on the "More details" button or if the strategy plug-in decides that the student should answer these questions.
Follow-up questions are shown in a subsection of the specific error (see Figure 10). The messages about the correctness of the answers to follow-up questions are shown for several seconds below the question; it is also emphasized by the message color (green for correct answers, red for wrong answers). After that, the next question or the final message is shown. In this case, the dialog lets the program determine the exact subject-domain fact that the student got wrong and give a detailed explanation. Sometimes, however, the student can understand their error while answering the follow-up questions and the resulting message must be affirmative. Consider the example shown in Figure 10. The dialog between the student and the system will look as follows. Follow-up questions are defined as a finite-state machine in the domain code. In Figure 11, you can see a part of the graph of follow-up questions in the Expressions domain-the questions regarding operator precedence and associativity.

Evaluation
First-year undergraduate Computer Science students of Volgograd State Technical University participated in the evaluation of our system. They were asked to complete four steps:

1.
Pass a pre-test to measure their baseline knowledge (maximum grade is 10); 2.
Work with the CompPrehension system to learn more, preferably until the system tells that the exercise is complete; 3.
Pass a post-test to measure their learning gains (maximum grade is 10); 4.
Participate in a survey about the system they used.
The acquired knowledge was related to the topics in the "Informatics" (CS0) course they were currently taking and could improve their grades which motivated them for participation. They were given one week to complete everything. The survey was anonymous so that the students could express their opinions freely. Out of the 151 students who attempted the first pre-test (for the expressions domain), 88 students completed all the tasks. Table 4 shows in detail how many students completed various tasks. The smaller number of students attempting the Control Flow Statements domain can be explained by its higher complexity and the effort required to perform the exercise as a typical algorithm trace contains significantly more elements than a typical expression. Another possible explanation is that the Control Flow Statements domain was second in their online course, so some students dropped after completing exercises for the first domain.

Expressions Domain
The Expressions domain relied partially on the students' previous knowledge-most of them should be acquainted at least with the concept of operator precedence in school. So relatively high grades of the pre-test were expected, and these expectations were confirmed by the students solving more than 50% of the pre-test on average.
The maximum grade for the pre-test and post-test was 10. As Table 5 shows, their average learning gains were relatively small, but, combined with the previous knowledge, it let them master more than 2/3 of the subject-domain knowledge after a short exercise. The paired two-tailed t-test showed that the learning gains are statistically significant. The increase in standard deviation showed that the students did not learn uniformly: some of them benefited from using the system more than others. Most of the students completed the pre-test between 5 and 6 grade (out of 10) as shown in Figure 12. We divided the students into two groups by their previous knowledge: students with low pre-test grades (less than 6) and students with high pre-test grades (more than 6). Their learning gains were different (see Table 6: the students with less initial knowledge learned much more. The independent two-tailed t-test showed that this difference is statistically significant (p = 0.001). This confirmed the hypothesis that developing comprehension of concepts helps less-performing students more. One possible argument against this is that the students with poor initial knowledge had more to gain than the students with good initial knowledge. In order to assess this, we calculated relative learning gains by dividing the absolute learning gains of each student by their maximum possible gains; the difference between the students with low and high pre-test grades is still statistically significant for relative gains even though with much bigger p = 0.03. Table 6. Absolute learning gains for the students with low and high pre-test grades in the Expressions domain.

Low Pre-Test High Pre-Test Significance
Number of students 86 57 Avg.
2.69 1.92  Table 7 shows the statistics on how the students used the Expressions domain in the system: the number of questions generated for the student, the number of steps to solve the questions the student did, the percentage of the correct steps, the number of hinted steps, and the number of follow-up questions series. Zeroes in the minimum steps show that some students did not try to solve CompPrehension's questions; they only asked for hinting correct steps. The students could also omit some of the questions. On average, a student received 14 questions but fully solved only about five of them, making 36 solution steps (each step required one mouse click). The exercise usage was during homework, and the students were given the possibility to move to the next question without solving the previous question fully, which led to a rather low percentage of completed questions. The average percent of correct steps was more than 80%, but in difficult situations, the students asked for correct solution hints. The low usage of followup questions is understandable because this topic is easy for college-level students and so the error explanations were mostly sufficient. The average number of steps (clicks) to solve a question was 2.34; the average percent of hinted steps is 11%. We found no significant correlation between these variables and learning gains. Generally, while the students' average learning gains of 1.6 out of 10 do not look impressive, they are significant for an exercise that required 36 mouse clicks on average.

Control Flow Statements Domain
Fewer students chose to participate in the evaluation of the Control Flow Statements domain than in the evaluation of the Expressions domain, possibly because it took more effort. Still, the percentage of the students who finished exercises and the post-test-90%-was almost the same as for the easier Expressions domain. Out of 129 students who attempted the post-test, only 13 students finished their exercise attempts in the system; two of them finished two attempts. This is understandable as there were a lot of laws to learn in this domain, and each question in the attempt required more work, while the system was conservative and only considered the exercise attempt complete when the student demonstrated knowledge of all the subject-domain laws.
The learning gains after using the CompPrehension system were about the same as for the Expressions domain with a slightly lower knowledge level which can be explained by the complexity of this domain. The t-test (paired, two-tailed) showed the statistical significance of absolute learning gains (see Table 8). You can see the histogram of the pre-test and post-test grades in Figure 13. It shows that after working with our system, the students were divided into two distinct groups: the students with poor knowledge (the grade is less than 4) and the students with good knowledge (the grade is 6 or more). As for the Expression domain, the students with lower pre-test grades (less than 6 out of 10) gained more than the students with high pre-test grades. This difference (see Table 9) is statistically significant for absolute learning gains (independent two-tailed t-test p = 0.009) but is not significant for relative learning gains (p = 0.06) by a small margin. This supports the hypothesis that poorly performing students gain more by using comprehension-level intelligent tutoring systems.  Table 10. The students solved about half of the questions they solved in the Expressions domain but per-formed double the number of solution steps, which is understandable because the Control Flow Statements domain has longer questions. In this domain, the students completed most of the questions they started compared to the Expressions domain that shows their interest and engagement. The percent of correct steps is lower than for the Expressions domain-a more complex subject domain made students make errors more often. However, the increase in using the correct steps hint is not large, and the percentage of using hints to all solution steps decreased significantly, which shows that the error explanation messages were useful most of the time. We did not use follow-up questions in the Control Flow Statements domain for this evaluation because in this domain error explanations are simpler than in the Expressions domain: in the Control Flow Statements domain, each situation has only one next correct step. The values of the Pearson's correlation coefficient between different variables for the Control Flow Statements domain are shown in Table 11. Weak negative correlation between the percent of errors (failed interactions) and the pre-test grade was trivial, showing that better-prepared students made fewer errors. More interesting are the correlations for the post-test grades: there was a medium positive correlation between the number of completed questions and the resulting knowledge, and a medium negative correlation between the percent of errors and the resulting knowledge. The number of completed questions shows a consistent influence on the learning gains, with a weak positive correlation to absolute learning gains and a medium positive correlation to relative learning gains. This shows that the students who did not hurry to the next question, completing each question, gained more knowledge.

Survey
Out of the students who participated in the evaluation, only 88 students completed the final survey. The students were asked eight questions using a Likert scale (1-strongly disagree, 5-strongly agree). We used both positive and negative statements about the system to avoid bias in question formulation. The results are shown in Table 12, while most average results are close to neutral, they are consistently above three for the positive statements and below three for the negative statements that shows moderate support of the developed system. The main problems the students noted in the free text part of the survey were problems of waiting for server response (the server did not bear the load of more than 100 students well during the first days of the evaluation) and the small amount of theoretical material supporting the exercises.

Discussion
In this article, we described an intelligent tutoring system aimed at developing comprehension of subject-domain concepts in programming and other well-formalized domains. Some of the similar systems (e.g., [41]) even working in well-formalized domains such as mathematics and programming allow to evaluate and provide the feedback only for the errors in the answer to the final task, or some of the steps of solving it which makes the feedback more general and decreases its quality. This also limits the system's ability to determine the exact cause of the student error and so update the model of the student's knowledge as several kinds of errors can lead to the same wrong answer step.
We enhance this approach by introducing advanced feedback in the form of follow-up questions based on the Socratic questions approach [42,43]. It requires active student participation, thus stimulating information retrieval and thinking in the subject-domain terms, and allowing the system to determine the exact cause of the error the student made. Using questions in addition to statements (a common form of providing feedback in ITS) advances the field and opens the way for new research, determining when and where they are more effective than the regular feedback.
A common approach to generating feedback (see [44]) is based on classifying it on the levels of details (e.g., abstract, concrete, and bottom-out levels) but it often makes feedback content non-systematic: sometimes it is aimed at explaining to the student why they are wrong, while in other cases it just tells the student what to do to proceed to the correct answer without developing comprehension. Both kinds of feedback can co-exist in the same system, e.g., [45]. Our approach advances the field by concentrating on feedback about the subject-domain rules and facts that the student violated when making an error and using this kind of violation to measure the students' knowledge. Our formal models deliberately avoid comparing a student's answer with the correct answer, making supporting domains with many possible correct answers easy.
The same data about the causes of the students' errors are used both for providing feedback and choosing the next question to support learning, unlike [11,22,23], and most of the other solutions reviewed. To estimate the student's knowledge of the subject domain, a model based on the Q-learning approach is used, but other models such as Bayesian Knowledge Tracing can be added easily using the interface for strategy plug-ins. This allows using our approach in guided learning [46,47] and export these data for reviewing by teachers and tutors and to foster integration with Learning Management Systems and Intelligent Textbooks [48] using common domain models.
While developing the described ITS, we faced the following challenges: 1.
decoupling pedagogical strategies from subject-domain models; 2.
decoupling subject-domain-specific user interface from general question interface; 3.
systematic coverage of all the ways to violate subject-domain rules by negative laws; 4.
developing a question base to satisfy all possible requests for generating questions.
To make plug-ins implementing pedagogical strategies agnostic of the subject domains, we needed a way to define the knowledge space for a subject domain, i.e., all kinds of facts, rules, and laws students should learn (develop mastery of). We found negative laws, i.e., the ways to make an error, the best measure of the knowledge required to solve a specific question because they are more detailed than the rules for making correct steps (positive laws). The knowledge space also includes concepts, but they are used to measure the student's ability to recognize the objects of that concept in the problem formulation; everything else is modeled using rules. Using this knowledge space, a strategy can measure the student's knowledge and mastery and formulate requests for the new questions based on the kinds of errors that could be done in them.
The user interface subsystem faced a similar problem because many subject domains benefit from domain-specific interfaces. The main challenge there was making as few restrictions on the possible question formulations as possible while placing the controls to answer right into the formulation (c.f., the examples in the Control Flow domain above: while it is possible to offer the student to select the next statement to perform using the regular interface for multiple-choice questions with round buttons below the algorithm, it is highly unfeasible because of the number of possible choices that all have corresponding parts in the question formulation). Our solution was passing the question formulation from the domain plug-in as HTML code, supplemented with an array of "answer objects"i.e., the objects from which the answer is built-containing button identifiers for the buttons generated by the domain plug-in. This way, the system core, user interface, and strategy can keep track of the student's answer without knowing how it is represented in the question formulation.
Other challenges for the development of this system were systematic coverage of the entire error space and developing a question bank to satisfy the requests for questions generated by the strategy using the given knowledge space. They required a lot of work from the team (though this work is necessary only once per subject domain). We are going to attend to this problem in further research. To cover the error space systematically, we need a formalized representation of the taught intellectual skill that can be done using decisions trees. The question bank can be significantly expanded by mining the question formulations from the existing program code. The key feature of our subject-domain models allowing it is that they can automatically determine the list of the possible errors for a given task, i.e., there is no need for a human question author, specifying which question templates correspond to which topics within the subject domain. This allows large-scale question generation without teacher intervention.

Conclusions and Future Work
We designed a domain-independent comprehension-level ITS CompPrehension and developed a working prototype, including two subject domains (Control Flow Statements and Expressions), two backends (SWRL and Jena rules), the LTI Frontend, and a basic strategy. The domains contain tag sets for the C++ and Python programming languages.
The system is capable of selecting a problem from the problem base according to the student's performance, solving it using software reasoning, grading answers, determining fault reasons, showing explanatory feedback, asking follow-up questions, and showing worked examples. The main difference of our system from most of the other intelligent programming tutoring systems is consistently displaying the information about the broken subject-domain rules in the error messages. Integrating a new subject domain requires developing its domain plug-in. As all the questions in the developed system are automati-cally linked with the concepts and subject-domain laws that should be known for a correct solution, they can be easily integrated with Intelligent Textbooks [48].
The system's evaluation showed that for the both domains, doing exercises produced moderate learning gains that were greater for the low-performing students. Since the students did not receive any additional instructions on the topic during the evaluation process, these learning gains can be attributed only to the developed system. This is in line with the findings of Kumar [44] for code-tracing problems. For exercises in the Control Flow Statements domain, the percentage of completed questions was positively correlated with the post-test grades and learning gains while not being correlated with the pre-test grade. This shows influence over how much the students learned; it may be beneficial to disable the ability to start a new question without completing the previous question. The percentage of errors was negatively correlated with both the pre-test and post-test grades but not with learning gains. This only shows that better-performing students make fewer errors.
The developed system exposes the properties of the subject-domain concepts through simple questions, verifying and developing their comprehension for students. This supports students in doing higher-level tasks by ensuring that they understand the concepts they use. The system is limited to the comprehension level and is most effective for introductory courses in new subject domains such as programming, mathematics, or natural languages when students need to understand a significant number of new concepts and learn to handle them in typical situations. Worse-performing students gain more from such systems as the evaluation shows.
The main problem of the developed system is the small number of predefined questions compared to the number of subject-domain laws and possible errors, given that learning may require answering several questions concerning the same subject-domain law in a row. The properties of the developed formal models allow for large-scale question generation by mining existing open-source code which is the most important direction of further work. Another direction of further work is developing a method to build sets of follow-up questions from the formal description of the subject domain because creating and debugging them manually is a time-consuming process.