Prediction of Confusion Attempting Algebra Homework in an Intelligent Tutoring System through Machine Learning Techniques for Educational Sustainable Development

: Incorporating substantial, sustainable development issues into teaching and learning is the ultimate task of Education for Sustainable Development (ESD). The purpose of our study was to identify the confused students who had failed to master the skill(s) given by the tutors as homework using the Intelligent Tutoring System (ITS). We have focused ASSISTments, an ITS in this study, and scrutinized the skill-builder data using machine learning techniques and methods. We used seven candidate models including: Naïve Bayes (NB), Generalized Linear Model (GLM), Logistic Regression (LR), Deep Learning (DL), Decision Tree (DT), Random Forest (RF), and Gradient Boosted Trees (XGBoost). We trained, validated, and tested learning algorithms, performed stratiﬁed cross-validation, and measured the performance of the models through various performance metrics, i.e., ROC (Receiver Operating Characteristic), Accuracy, Precision, Recall, F-Measure, Sensitivity, and Speciﬁcity. We found RF, GLM, XGBoost, and DL were high accuracy-achieving classiﬁers. However, other perceptions such as detecting unexplored features that might be related to the forecasting of outputs can also boost the accuracy of the prediction model. Through machine learning methods, we identiﬁed the group of students that were confused when attempting the homework exercise, to help foster their knowledge and talent to play a vital role in environmental development.


Introduction
The Intelligent tutoring systems (ITSs) and Massive Open Online Courses (MOOCs) both have similar educational approaches. However, they differ in many aspects, for instance; ITS facilitates instant feedback and scaffolding practice in solving pedagogical problems. While learning through web-based interfaces, students have the opportunity to take hints and watch related topic videos. They also receive guidance on how to practice concepts and attempt the right answer. MOOC, on the other hand, provides much more interactive learning through the learning management system (LMS), in which various forms of instructive video lectures, moderated discussion boards (MDBs), and forums are available for learning with peer feedback [1]. feedback from ITS causes student confusion. Therefore, as per education for sustainable development, we found the gap in which every student wishes to nurture their talent, knowledge, and experience to become a responsible member and citizen of the society during their ongoing development. However, the ultimate goal of sustainable development is to eliminate common issues and frustrations, which can only be achieved when we as people, students, teachers, etc. find out the remedies to resolve these issues by taking instant feedback, hints, and help from technology and humans. Confusion among the students while using ITS is a major hindrance in maintaining sustainable development, leading to an inability to meet the needs of the present.
Wang et al. [14] discourage manual classification due to its imperfection, irreproducibility, and classifications generated by the human eye. Thus, we used the digitized, computerized classification as our data were diversified. Therefore, by using machine learning classifiers allowed us to categorize students who were confused and those who were not. Machine learning algorithms work on the principle of statistics, and for this objective we used the statistical programming language R and RapidMiner 8.1 to analyze and predict the confused students in an ITS skill-builder session and showed the results.
To accomplish this task, the following was our research question, which we considered in order to bridge the gap. knowledge, and experience to become a responsible member and citizen of the society during their ongoing development. However, the ultimate goal of sustainable development is to eliminate common issues and frustrations, which can only be achieved when we as people, students, teachers, etc. find out the remedies to resolve these issues by taking instant feedback, hints, and help from technology and humans. Confusion among the students while using ITS is a major hindrance in maintaining sustainable development, leading to an inability to meet the needs of the present. Wang et al. [14] discourage manual classification due to its imperfection, irreproducibility, and classifications generated by the human eye. Thus, we used the digitized, computerized classification as our data were diversified. Therefore, by using machine learning classifiers allowed us to categorize students who were confused and those who were not. Machine learning algorithms work on the principle of statistics, and for this objective we used the statistical programming language R and RapidMiner 8.1 to analyze and predict the confused students in an ITS skill-builder session and showed the results.
To accomplish this task, the following was our research question, which we considered in order to bridge the gap.
 Can we categorize which machine learning algorithms are the best fit to classify mastery skill learning confusion among the students using skill-builder in an intelligent tutoring system from chosen skills?
Further, the structure of the paper is as follows: In Section 2, we present a short overview of related works and research on the particular subject; in Section 3, we define related methods used, and proposed predictive methods; in Section 4, we interpret the results and discuss prediction performance; in Section 5, we leave the reader with concluding thoughts, shortcomings, and future recommendations.

Related Works
ITSs are computer technology-based programs that have been developed for different subject areas (e.g., algebra, physics, science, medicine, statistics, etc.) to assist students and learners obtain domain-specific intellectual knowledge. It also models the mental, cognitive states of students and learners to provide modified instructions through instant and prompt feedback. This system provides an interface that presents and receives information to communicate with learners. For instance, by learning the concept of the subject domain (e.g., algebra), the learner can interact through interfaces to solve problems while looking for a hints or answering the questions [15].
ITSs distinguish learning for students from various abilities, understandings, and performances. Variations in learning style are taken into account to distinguish learning. Learning style can be predicted by taking independent behavior variables during tutoring discussions. Fuzzy trees have been induced to predict the learning style of individuals. Outcome classification has been observed due to the automatic behavior of learning from a collection of data [16].
Material regarding the mathematics course in ASSISTments comprises of difficulties with solutions, and in-time suggestions. Furthermore, substantial assistance readily available over the Internet to resolve the issue that students solve online. Another type of material is precisely designed for mastery-focused skills training named "Skill-builders" as discussed above. At the moment, ASSISTments covers more than 300 matters related to mathematics for middle school, and the capability is given to teachers to allocate skill-builders to pupils, allowing them to rehearse those problems while emphasizing the desired skill(s) until they receive the pre-defined standards for accuracy [9].
Nonetheless, very limited research has discovered the importance of ITS used as homework [17]. Many types of research corroborate the significance of ITS used while in a class for students at school [18]. Hence, it was very inspiring when [19] ANDES and ITS used in this manner communicated favorable outcomes.
At present, ASSISTments is being used by massive numbers of students at middle and high school for their evening homework. The instant advice regarding homework, allows students to feel Can we categorize which machine learning algorithms are the best fit to classify mastery skill learning confusion among the students using skill-builder in an intelligent tutoring system from chosen skills?
Further, the structure of the paper is as follows: In Section 2, we present a short overview of related works and research on the particular subject; in Section 3, we define related methods used, and proposed predictive methods; in Section 4, we interpret the results and discuss prediction performance; in Section 5, we leave the reader with concluding thoughts, shortcomings, and future recommendations.

Related Works
ITSs are computer technology-based programs that have been developed for different subject areas (e.g., algebra, physics, science, medicine, statistics, etc.) to assist students and learners obtain domain-specific intellectual knowledge. It also models the mental, cognitive states of students and learners to provide modified instructions through instant and prompt feedback. This system provides an interface that presents and receives information to communicate with learners. For instance, by learning the concept of the subject domain (e.g., algebra), the learner can interact through interfaces to solve problems while looking for a hints or answering the questions [15].
ITSs distinguish learning for students from various abilities, understandings, and performances. Variations in learning style are taken into account to distinguish learning. Learning style can be predicted by taking independent behavior variables during tutoring discussions. Fuzzy trees have been induced to predict the learning style of individuals. Outcome classification has been observed due to the automatic behavior of learning from a collection of data [16].
Material regarding the mathematics course in ASSISTments comprises of difficulties with solutions, and in-time suggestions. Furthermore, substantial assistance readily available over the Internet to resolve the issue that students solve online. Another type of material is precisely designed for mastery-focused skills training named "Skill-builders" as discussed above. At the moment, ASSISTments covers more than 300 matters related to mathematics for middle school, and the capability is given to teachers to allocate skill-builders to pupils, allowing them to rehearse those problems while emphasizing the desired skill(s) until they receive the pre-defined standards for accuracy [9].
Nonetheless, very limited research has discovered the importance of ITS used as homework [17]. Many types of research corroborate the significance of ITS used while in a class for students at school [18]. Hence, it was very inspiring when [19] ANDES and ITS used in this manner communicated favorable outcomes.
At present, ASSISTments is being used by massive numbers of students at middle and high school for their evening homework. The instant advice regarding homework, allows students to feel comfortable, and enables tutors to monitor the reports specifying students achievements [17]. So far, for the evening homework, multilayered tutoring systems are not suitable as on the other hand, technology-supported instructions which disseminate the same questions with a fast response about the problem are more appropriate [20].
According to Singh et al., homework on web-based tools using the tutoring system is more authentic and robust in learning and mastering student skill(s) compared to previous old-fashioned paper-based traditional style. Furthermore, this research focuses on comparing the instantaneous responses from the tutoring system against the feedback received by students from the tutor the next working day, which is time-consuming and reduces the learning ability as a whole. They further showed that 8th-grade math students who were indulged in both scenarios, expanded pointedly with an effect size of 0.40 by using technology-assisted homework [18].
As described by Fyfe, around the globe, ITS and technology-assisted homework achieved fame and pervasiveness in schools [21]. Conferring to Ma et al., personalized education, and well before advice, is the solid foundation of intelligent tutoring systems [15]. The objective of the study by Fyfe was to reveal an investigational assessment of an algebra class for middle school students who had preceding variable knowledge affected by the technology-based responses [21].
Generally, many types of research support the notion that using the in-time responses from ITSs, as usual, has useful properties on learning outputs as opposed to no response from ITSs [20,21]. Lee et al., Baker et al., and Gupta and Rose, all classified that confusion, both its roots and penalties, can be easily recognized through the performance and student actions [22][23][24].
Confusion causes students to halt, reproduce, and begin problem-solving to rectify their own confusion. The only way to cope with confusion is that every student must have in-depth knowledge of complicated matters, as fought with confusion is an intellectual action [25,26]. On the other hand, if a healthy learning atmosphere offers an adequate platform and timely assistance to students, and they efficiently normalize their confusion, they could achieve positive outcomes [25][26][27][28].
Moreover, many scholars used different methods like "classification or knowledge engineering" to detect the disturbance changes in students, particularly confusion [29]. Likewise, Conati and MacLaren established a detector built on logged data and survey question groupings to forecast self-described student disturbance, although this model was healthier to recognize attentive and inquisitive students but ineffective at classifying confused students [30].
Baker et al. conducted strong research particularly focused on computer software designed for education, e.g., ITSs to automatically identify confusion through affect detection where they collected this information through semantic actions of students, and labeled the existing Pittsburgh Science of Learning Center (PSLC) DataShop log files. In this research, they defined confusion as the slower pattern of students' actions while attempting the pre-defined teacher criterion before the starting of mastery skill-builder assignment or homework. Authors focused the preliminary step and observed the percentage of clip actions [26]. Table 1 shows the summary of the illustrative methods used in the previous research in the domain of ITS.

Methods Used Advantages Disadvantages
Fuzzy Decision Trees [16] The most popular choice for learning and reasoning systems particularly from feature based discrete values, dealing issue with uncertainty, noise, and inexact data Does not take into consideration the connections between behavior variables and, due to the uncertainty intrinsically present in modeling learning styles, small differences in behavior can lead to incorrect predictions Hierarchical Linear Regression Model (HLM) [9] Simple relationship with a limited number of variables, it is the ordinary least square (OLS) regression-based analysis that takes the hierarchical structure of data into consideration Complex form, assessed data using a fixed parameter, and thus insufficient analysis due to the neglect of the shared variance Regression Analysis [21] It is a statistical analysis technique used to forecast future conditions, provides the relationship between two or more related variables with the help of which we can quickly estimate or predict the unknown values of one variable from the known values of another variable The cause and effect of the relationship between variables remain unchanged, cannot be used in a qualitative phenomenon, long and complex calculations and analysis Mixed-effects Modeling [31] Useful where repeated measurements are made on the same statistical units or made on clusters of related statistical units, Includes a combination of fixed and random effects, and very appropriate dealing with missing values Increase the power in studies without sample structure ANCOVA (Analysis of Covariance) [17] Better power, enhanced capability to detect and evaluate interactions, and the availability of extensions to deal with measurement error in the covariates It may not be helpful when the imbalance between the groups is large Sensor-Free Detectors [26] These detectors are designed to operate solely on the information available through the students' semantic actions within the interface Not substantially better especially when subject to stringent cross-validation processes ANOVA (Analysis of Variance) [18] The statistical method used to compare the means of multiple groups (more than two sets of data), can also control the overall Type-I error rate Not suitable when the samples are not independent Discovery with a model approach [25] Leaves clear data trials that can be re-inspected in the future, development of lifelong learning skills, supports an active engagement of the learner, use activities to focus attention, can be motivating Inefficient, too time-consuming, possibility of confusing learner's if no initial framework is available, requires that the teacher prepares for too many corrections if discovery turns out to be wrong Data-Driven Methodology [24] Make up-to-date design decisions based on real user needs and prioritize issues to solve based on its relative impact for users Data are trusted blindly without any uncertainty, and are often messy and even incorrect. Low-quality data leads to low-quality decisions Feedback Model [22] Reduces the discrepancy between current and desired understanding Feedback is only built on something. It is of little use when there is no initial learning or surface information, and under particular circumstances, an instruction is more effective than feedback Probabilistic Student Model [19] Beneficial for making responses to help requests that are particularly relevant to domains in which there is uncertainty about the student's mental state Unable to look at the problem of deciding what kind of response to give to the student at any given time Table 1. Cont.

Methods Used Advantages Disadvantages
Academic Emotions Questionnaire (AEQ) assumptions of a cognitive-motivational model [30] Useful for analyzing students' emotions in learning, as emotions are multifaceted and can be measured reliably by the AEQ Due to primarily used cross-sectional or predictive designs, not allowing precise inferences of causal relations

Methods
This section clarifies and illustrates the effectiveness of raw data to classification via machine learning methods. Figure 1 depicts the visualization of raw data to workflow classification:

Preparation of Data
In this study, we used dataset collected from ASSISTments, Skill-builder data 2009-2010 [8].
Skill-builder problem sets have the following features: • Questions based on one specific skill; a question can have multiple skill tagging's.
• Students must answer three questions correctly in a row to complete the assignment.
• If a student uses the tutoring ("Hint" or "Break this Problem into Steps"), the question will be

Preparation of Data
In this study, we used dataset (in the supplementary) collected from ASSISTments, Skill-builder data 2009-2010 [8].
Skill-builder problem sets have the following features: • Questions based on one specific skill; a question can have multiple skill tagging's. • Students must answer three questions correctly in a row to complete the assignment.

•
If a student uses the tutoring ("Hint" or "Break this Problem into Steps"), the question will be marked incorrect.

•
Students will know immediately if they answered the question correctly. • If a student is unable to figure out the problem on his or her own, the last hint will answer. • Currently, this feature is only available for math problem sets.
In this whole dataset, various features are available relating to the mastery skill-builder learning. There were almost 72 participating schools with 93 mastery skills in algebra mathematics and about 28 features. We targeted the school ID-73 because it had maximum availability of records, i.e., the total record of 5148 with 20 attributes, and four attributes had various missing values initially amongst other school IDs, and selected 10 mastery skills, i.e., Absolute Value, Addition and Subtraction of Positive Decimals, Box and Whisker, Circle Graph, Multiplication Fractions, Ordering Fractions, Percent Of, Subtraction of Whole Numbers, Venn Diagram, and Write Linear Equation from Graph, as the maximum number of students selected these chosen skills and after removing duplicate values, we got total 166 distinct student IDs remaining.

Measurements and Covariates
We selected the predictors (original, attempt_count, ms_first_response, correct, hint_total, overlap_time, and opportunity) from the list of features available in the dataset, following are the basic description, measured ROC, accuracy, precision, recall, F-measure, sensitivity, and specificity as performance indicators that were used by machine learning algorithms.
Original: If '0' means scaffolding problem, and other than '0' means the main problem. Attempt_count: Number of times a student entered an answer. Ms_first_response: Time between the start time and first student action. Correct: If '0' means Incorrect on first attempt otherwise correct. Hint_total: Number of possible hints on the problem. Overlap_time: This is meant to be the time taken by student to finish the problem. Opportunity: The number of opportunities each student has to practice on the skill.

Discretization of Predicted Variable
After the precise selection of predictors, we were interested in learning what features appraise the status of the confused/not confused student. In order to determine this aspect, we used a feature extraction technique to select the predicted variable. We chose and combined three variables, i.e., attempt_count, correct, and overlap_time, with the conditions mentioned below to form a new feature called "student state", and on the basis of that we categorized the status of the confused/not confused students, '1' designated confused and '0' not confused.

Experimental Manipulations or Interventions
We used the 10-fold cross-validation technique to divide our datasets into a standard (80%-20%) of training, and test datasets, respectively with stratified sampling, as our response variable was dichotomous.
We also checked the correlation between explanatory and response variables and identified which variables were significant, bring information to the model, and which variables do not.

Pre-Processing of Data
Data extracted either from databases, log files, or Microsoft Excel files required cleaning. Although it was in good shape, data cleaning before moving ahead was an absolute part of the pre-processing. Data could be noisy, missing, or uneven. Machine learning algorithms performed pre-processing of data up-to some extent, but these algorithms were more robust if we manually accomplished this step.

Integration and Transformation of Data
For better statistical analysis and classification, data must be integrated and transformed. For this objective, Table 2 illustrates five amongst ten mastery skills as a snapshot and each skill has four attributes (stated above in 3.1. Preparation of Data) for each student.

Feature Extraction
Feature extraction is a procedure for creating new attributes amongst existing features. Figure 1 shows a snapshot of the feature extraction step. Classification is considered to be an essential step as the performance measure of the learning process depends on significant explanatory variables. In many real-world cases, we cumulatively extract features, alter if needed, combine them, and produce one variable. The same procedure might be adopted for the selection of response variables. In this study, we selected ten mastery skills and also nominated the associated explanatory variables and clubbed them to form 40 explanatory variables for each student in each mastery skill.

Feature Selection
As revealed in Figure 1 above, there are many criterions available for feature selection, for Instance: Backward Elimination, Forward Selection, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), DIC (Deviance Information Criterion), Bayes factor, Mallow's Cp, etc. We performed Backward Elimination using the adjusted R 2 method with the cutoff p-value of 0.05 to construct our model because it is a common way [32]. We started with the full model and eliminated one variable at a time until the parsimonious model was reached [33].

Training of Model
Before predicting the confusion amongst the students attempting algebra mastery skills homework in ITS, we essentially trained machine learning algorithms to curtail the difference between actual and predicted values. For this objective, we split the data into (80%-20%) ratio with stratified sampling, as our response variable was nominal.

Testing and Evaluation of the Model
Model evaluation is an essential part of implementing machine learning techniques. When a machine is trained on known data, then we evaluate the model on unseen data to verify that the model is good enough, learned, and classified correctly. We used seven learning models in this study and the description is as follows: Naïve Bayes (NB) is fast and efficient probabilistic classifier with an extensive record of research. Due to its robustness, precision, and competence this method is usually referenced [34].
Generalized Linear Model (GLM) is eventually the enhancement of old-style linear models, and the series of instructions inside these models turn to data by using the MLE technique. These models give tremendous, high-speed, and parallel computation with a small number of explanatory variables with non-zero constants [35,36].
Logistic Regression (LR) is a broadly used statistical technique for the classification of binary output. When predicting the output of the nominal response variable, usually the logistic regression algorithm is used [36,37].
Deep Learning (DL) works by a neural network that takes information that offers information about other data as input, and produces the outcome by using many layers [38]. DL can tune and select the model at an optimal level by itself, and it also achieves mining of features instinctively without involvement and interaction of individuals or humans, which spectacularly saves plenty of determination and time [39].
Decision Tree (DT) depicts tree-like building, where it has nodes (internal and leaf). It is made by training data, which consists of data rows or records. Each record is formed by a set of features and outcome labels [40].
Random Forest (RF) associates multiple tree input variables in a group. New occurrences being classified are broken down into trees, and each tree states a classification [41]. RF generates many arbitrary trees on various subsets of a data set, and the subsequent model builds on polling of these trees. Due to this variance, it is less likely to overtraining [36].
Gradient Boosted Trees (XGBoost) correlates with the Gradient Boosting Machine (GBM), which is another boosting algorithm. It produces good accuracy due to the competences of parallel computing and the effective linear model solver. It also creates decision trees which are individual understandable models [42].
We executed all the above-mentioned learning algorithms on the RapidMiner 8.1 for our experimentation. RapidMiner has a collection of machine learning/data mining algorithms, and we calculated the desired output by accuracy. Furthermore, we tested these algorithms on other performance metrics mentioned in Section 3.7.1.

Performance Metrics
In this study, we adopted the most common and widely used performance metrics of [43,44]. They used the ROC Area under the curve (AUC) to calculate the performance of prediction models.
• ROC Curve or AUC ROC curve demonstrates the association between true positive and false positive rates [45]. It contains several thresholds, and each produces a 2 × 2 contingency table. We also used other performance metrics, i.e., accuracy, precision, recall, F-Measure, sensitivity, and specificity.
Detailed results for this present study are described in Section 4 (Results).

Classification
Plenty of machine learning/data mining tools are available. We used RapidMiner 8.1 for our investigation and testing. RapidMiner studio is well equipped with data mining/machine learning tasks with sufficient state-of-the-art collections of machine learning algorithms along with data access, pre-processing, blending, cleansing, modeling, visualization, and validation operators, which give high-tech advanced platforms to perform machine learning/data mining tasks in the most efficient and well-organized manner [46,47]. As we executed supervised learning, succeeding some linear and non-linear classifiers were used for classification, application and verification.

Statistical Analysis and Parameters
In order to discover the significance of explanatory variables for the prediction of students' confusion in algebra mastery skills in ITS, it was imperative to explore the predictors (explanatory variables), and its impact on response variables statistically. Although machine learning algorithms intrinsically perform statistical tests and analysis of variables, it is always good practice to check manually before applying any machine learning methods and techniques. Table 3 is the weights (ranks) of the attributes, which show the universal significance of each attribute for the value of the target attribute, independent of the modeling algorithm. We have used the statistical programming language R with the standard cut-off level of probability value (p-value 0.05). Table 4 displays a correlation matrix p-values with the response variable.
In this statistical summary of correlation, we found nine predictor variables that were most significant, i.e., their values were (p < 0.05) related to the dichotomous response variable. Correlation was used to measure the strength of the linear association between two numeric variables. Many types of correlation coefficients exist. i.e., Pearson, Kendall, and Spearman. We used Pearson's correlation coefficient as it commonly used in linear regression. It is denoted by (r or R) and its value is always in the range from −1 to +1, where +1 specifies strong positive correlation, and −1 the strong negative correlation.
Statistically, after using backward elimination techniques, we ended up with the final model which validates the significance of nine explanatory variables, Table 5 portrays descriptive statistics; Table 6 shows regression analysis including predictor's coefficients, standard errors, p-values, etc.; and Table 7 reveals regression summary.   Furthermore, Table 8 reveals the efficient feature of R language, which shows what maximum adjusted R 2 value could achieve through given explanatory variables. These are the ten chosen skills with ASSISTments data attributes. The purpose of showing this Table is to validate the adjusted R 2 value in Table 7.

Results
As discussed in Section 3; Methods, ROC is Receiver Operating Characteristic and also recognized as ROC AUC or just simply ROC curve. It demonstrates the relationship between the true positive rate (TPR) and false positive rate (FPR). It also determines the cooperation between sensitivity and specificity as both are contradictory, i.e., when the sensitivity rises, specificity declines. The accuracy can be monitored if the curve is closer to the top left corner and could be considered the finest results, but if curve comes closer to the diagonal angle (45 • ), the results would not be accurate. Moreover, ROC AUC value >0.9 portrays excellent results; values between 0.8-0.9 are considered good, those between 0.7-0.8 reflects fair, and <0.6 illustrates poor [48].
Graphical representation of ROC AUC shown in Figures 2 and 3 depicts the AUC values graphs for seven machine learning algorithms, which correctly predicted the confusion amongst the students attempting algebra mastery skills in ITS. In Figure 3, vertical axis number shows the percentage value between 0% and 100%. accurate. Moreover, ROC AUC value >0.9 portrays excellent results; values between 0.8-0.9 are considered good, those between 0.7-0.8 reflects fair, and <0.6 illustrates poor [48]. Graphical representation of ROC AUC shown in Figures 2,3 depicts the AUC values graphs for seven machine learning algorithms, which correctly predicted the confusion amongst the students attempting algebra mastery skills in ITS. In Figure 3, vertical axis number shows the percentage value between 0% and 100%. We have constructed seven candidate models built on various machine learning methods. The performance achieved by each classifier is shown in Figure 4, which reveals the accuracy performance metric of each model by repetitive sampling validation technique, in which it randomly replicates division of training and test data. We have constructed seven candidate models built on various machine learning methods. The performance achieved by each classifier is shown in Figure 4, which reveals the accuracy performance metric of each model by repetitive sampling validation technique, in which it randomly replicates division of training and test data. These results illustrate the ratio of time we require to acceptably predict the cases. We attained maximum accuracy with RF, GLM, XGBoost, and DL, i.e., 86.1%, 84.9%, 84.9%, and 83.1%, respectively. We also employed other classifiers, i.e., DT: 79.5%, NB: 78.9%, and LR: 77.1%.
We checked other performance metrics, which we discussed in Section 3. Figure 5 displays the performance of seven machine learning algorithms regarding precision, recall, F-measure, sensitivity, and specificity.  These results illustrate the ratio of time we require to acceptably predict the cases. We attained maximum accuracy with RF, GLM, XGBoost, and DL, i.e., 86.1%, 84.9%, 84.9%, and 83.1%, respectively. We also employed other classifiers, i.e., DT: 79.5%, NB: 78.9%, and LR: 77.1% as shown in Table 9. We checked other performance metrics, which we discussed in Section 3. Figure 5 displays the performance of seven machine learning algorithms regarding precision, recall, F-measure, sensitivity, and specificity. These results illustrate the ratio of time we require to acceptably predict the cases. We attained maximum accuracy with RF, GLM, XGBoost, and DL, i.e., 86.1%, 84.9%, 84.9%, and 83.1%, respectively. We also employed other classifiers, i.e., DT: 79.5%, NB: 78.9%, and LR: 77.1%.
We checked other performance metrics, which we discussed in Section 3. Figure 5 displays the performance of seven machine learning algorithms regarding precision, recall, F-measure, sensitivity, and specificity.    Figure 6 displays the lift charts of high accuracy achieving machine learning models. A lift chart is a graphical illustration of the enhancement that a model delivers when related against a random guess [49]. It shows the efficiency of the model by measuring the ratio between the outcome obtained "with and without a model" [36].

Discussion and Conclusions
This research investigates models for the prediction of confused students attempting homework using skill-builder in ITS. Analyzing confusion is a task of classification, and machine learning has plenty of robust classification algorithms. In this study, we used machine learning methods for the experiment. Performing techniques of data mining on ITS is a tough task because the multiple features are related in various extents with many of noisy data and missing fields. We then extracted explanatory variables (input features) and targeted (output) response variables from ITS. This was followed by applying machine learning models NB, GLM, LR, DL, DT, RF, and XGBoost, respectively. The results demonstrated that RF, GLM, XGBoost, and DL models attained a high

Discussion and Conclusions
This research investigates models for the prediction of confused students attempting homework using skill-builder in ITS. Analyzing confusion is a task of classification, and machine learning has plenty of robust classification algorithms. In this study, we used machine learning methods for the experiment. Performing techniques of data mining on ITS is a tough task because the multiple features are related in various extents with many of noisy data and missing fields. We then extracted explanatory variables (input features) and targeted (output) response variables from ITS. This was followed by applying machine learning models NB, GLM, LR, DL, DT, RF, and XGBoost, respectively.
The results demonstrated that RF, GLM, XGBoost, and DL models attained a high accuracy of 86.1%, 84.9%, 84.9%, and 83.1% in predicting the students' confusion in the algebra mastery skills in ITS.
Such a result can assist school tutors in next day classes, and identifying student groups which were confused attempting the homework exercise in mastery skill-builder. This tool also highlights which skill(s) need(s) more attention for further practice. Furthermore, tutors can also govern learning behaviors and student performances during various mastery skill(s), allowing them to focus only problematic skill(s) in the next day of the class, which will save a lot of time and effort for both tutors and students.
Our study has many decent inferences both educationally and practically. Firstly, to the best of our information and facts, our research, amongst the previous studies for predicting confusion by using machine learning methods for sustainable educational development, is one of the rare studies that have focused on these specific aspects. ITS contributes to sustainable development in education, as the development focuses on the necessities of the present-day without compromising the future needs. The objective of sustainable development is to stabilize our environmental, economic, and social needs [50].
Sustainable development in education is an interdisciplinary learning approach that covers the combined environmental, social, and economic aspects of the formal and informal curriculum. This educational approach can assist students develop their aptitudes, knowledge, and experience to show a significant role in ecological development, and become liable members of society. Participation and sharing teaching and learning techniques and methods are also required to encourage and empower learners to change and alter their performances and take corrective actions for sustainable development. Critical thinking, visualizing the future, and decisions making are the skills and abilities that Education for Sustainable Development (ESD) promotes [51].

Shortcomings
The shortcoming in this study is that we have used a limited number of variables as there are more attributes available, which can be used for further investigation and could be statistically stronger. Another shortcoming is that by doing rigorous optimization techniques like changing criterion, pruning, selecting a threshold of machine learning models (algorithms) we could achieve better results.

Future Recommendations
In future work, we will design and apply some strategies to further augment our model. First, a more decent optimization parameter can be used for building a more accurate model, for instance: In DT, we can change the criterion i.e., gain_ratio, Information_gain, gini_index, accuracy, maximal depth parameter etc.; in LR, we can set the criterion, solver, reproducible, use regularization, etc.; and in XGBoost, we can alter maximal depth, min rows, min split improvement, number of bins, etc. to optimize performance. Additionally, we will consider Convolutional Neural Network (CNN) [52] in our experimentation as this methodology assures negligible loss and best classification accuracy. Other approaches like the Dynamic Bayesian Network (DBN), Gauss Cloud Model, and Cloud Reasoning Algorithm will also be taken into consideration for best classifying accuracy and precision [53][54][55].
Secondly, other kinds of classification methods and techniques can be measured. Though the machine learning techniques used and applied in this study are relatively comprehensive but still, there are various unexplored methods/techniques that can be applied to the prediction problem in the domain of students in intelligent tutoring system. Thirdly, other structures and features in the data may enhance the prediction correctness and accuracy that can be added. Furthermore, as per the tutors' perspective, we can identify the benefits associated while detecting the confusion in a group of students solving mathematics homework using skill-builder in ITS.