Learning Time Acceleration in Support Vector Regression: A Case Study in Educational Data Mining

: The development of a country involves directly investing in the education of its citizens. Learning analytics/educational data mining (LA/EDM) allows access to big observational struc-tured/unstructured data captured from educational settings and relies mostly on machine learning algorithms to extract useful information. Support vector regression (SVR) is a supervised statistical learning approach that allows modelling and predicts the performance tendency of students to direct strategic plans for the development of high-quality education. In Brazil, performance can be evalu-ated at the national level using the average grades of a student on their National High School Exams (ENEMs) based on their socioeconomic information and school records. In this paper, we focus on increasing the computational efﬁciency of SVR applied to ENEM for online requisitions. The results are based on an analysis of a massive data set composed of more than ﬁve million observations, and they also indicate computational learning time savings of more than 90%, as well as providing a prediction of performance that is compatible with traditional modeling.


Introduction
Education as a human right is a prerequisite for an individual to function fully as a human being in modern society. For instrumental reasons, the guarantee of education as a multi-faceted social, economic and cultural human right allows the development of a society because it facilitates economic self-sufficiency through employment or selfemployment and promotes the full development of the human personality. However, in a country such as Brazil, with strong social inequality inherited from a history of slavery, access to education for all as a high-quality public service plays an fundamental role in reducing this inequality. Moreover, equalizing opportunities within education is an even greater challenge. It has long been observed that there are racial discrepancies when it comes to study opportunities. At the end of the 20th century, it had already been observed that the average difference in years of study between white and black individuals was 2 years [1] and actions regarding equality and high-quality education must be connected: they cannot be seen as policy trade-offs [2].
Educational data are increasingly being used to support effective policy and practice and to move education systems towards more evidence-informed approaches to large-scale improvement. Generally, high-income countries with established assessment programmes use data for sector-wide reforms or interventions to improve learning outcomes. Lowincome countries that are beginning to use these programmes tend to identify a few separate issues, such as resource allocation, correlations between students' socio-economic status and their performance, and teacher qualifications. Resulting policies include interventions prompted by demands for policies to address equity issues.
Statistical/machine Learning (ML) based on computer tools and statistical methodologies allows for the semi-automatic discovery of knowledge in LA/EDM by finding

Background
With advances in LA/EDM, new meaningful insights can be obtained from large data sets towards helping to identify novel and useful patterns besides predicting the outcome of future observations. The combination of artificial intelligence and data science covers a range of computational approaches and methods towards the extraction of actionable knowledge from large, complex, multidimensional, and diverse data sources. Recently, the use of data mining tools and applied machine learning has risen over conventional statistical approaches for more accurate predictions [3].

Support Vector Regression
Support vector models are a class of powerful ML methods introduced by Vapnik and co-workers [27][28][29][30] for classification and regression models that often have superior predictive performance to classical neural networks. Their remarkably elegant optimization and risk minimization theories provide robust performance with respect to sparse and noisy data, which makes them the optimal choice in several applications. A support vector machine (SVM) is primarily a method that performs classification tasks by constructing hyperplanes in a multidimensional space that separates cases of different class labels. In many situations where the response variable is continuous, i.e., y ∈ R, it is possible to use SVM to predict the outcome by using covariates via a regression model, the so-called support vector regression (SVR) [31]. In this sense, SVMs can handle multiple continuous and categorical variables. To construct an optimal hyperplane, an SVM employs training algorithms, which are used to minimize an error function.
The general idea of support vector models for regression in the linear situation is to build a hyperplane f (x) = w x + = 0, where x is the vector of the explanatory variables, w is the parameter vector, and > 0 is a hyperparameter. This representation is plotted in Figure 1, which separates observations within a 2 wide hyper-tube covering all observations as close as possible to the external limits [32], i.e., margin equations are minimized. In practical situations, not all observations are expected to be inside this hyper-tube where is small. In this way, slack variables ξ are added to the linear SVM with smooth margins, so that the model becomes suitable (see plot in Figure 2). When using slack variables, the optimization problem is based on finding an optimal hyperplane that maximizes the hyper-tube margins and minimizes the slack variable, where we have ξ + for values above the upper margin and ξ − for values below the lower margin, and w is the (not necessarily normalized) normal vector to the hyperplane. In general, it is given by the following convex optimization problem: where ., . denotes the inner product and C is a regularization constant that imposes a weight on minimizing errors, since there is no limit to the number of incorrect classifications. If C → ∞, a smooth-margin SVR returns to a hard-margin SVR. After applying the maximization process via positive Lagrange multipliers α + and α − , generated, respectively, by ξ + and ξ − , our SVR classifier is given by: which only depends on the support vectors. It is also worth mentioning the influence of the sample size on the optimization process used to estimate the parameters of the support vector models. Basically, the larger the size of the data set, the more parameters will be estimated, being directly linked to the number of α s (restrictions) of the model. The traditional problem with SVM is that it can only find a linear boundary, which is often not possible. The trick is to map the training data or input space (R) to a higherdimensional space called the feature space (F ) and then use kernel functions to represent the inner product of two data vectors projected onto this space. The mapping is realized via the φ function, which is implicitly given by the kernel function. The appropriate choice of the φ mapping function or the kernel function implies that the training set mapped in F can be separated by a linear SVM, as shown in Figure 3. The advantage of this approach is that we can implicitly map data onto higher-dimensional space and only inner products are needed to estimate the parameters. A problem that can be found is the fact that the size of F can be very high, bringing a high computational cost. Replacing the scalar product by means of a kernel function is thus straightforward and does not affect the solver. The most used kernel functions are listed in Table 1.

Kernel Type
where σ > 0, d ∈ R and q ∈ {2, 3, . . .}. SVM models are highly dependent on a number of user-defined parameters (hyperparameters); such parameters include: the regularization parameters, the tube size of the ε-insensitive loss function, and the bandwidth of the kernel functions. An inappropriate choice of the parameters may lead to over-fitting or under-fitting [34], and, for massive data problems, this is a troublesome situation.

SVM Applied to Large Databases
In the traditional method of formulating an SVM, the number of underlying support vectors (SVs) is usually linear with respect to the sample size n, and this implies a high prediction cost, as the standard training procedures to solve the dual problem of kernel SVMs such as Sequential Minimum Optimization (SMO) [35] require Ω(n) iterations each with O(n) cost [36].
Tsang et al. [37] propose the Core Vector Machine (CVM) algorithm, which, different from the native SVM algorithm, has a time complexity that does not depend on the size of the training sample. Experiments on large data sets, with real and simulated data, have shown that CVM is as accurate as traditional SVM, but it is much faster and can handle larger data sets. This method differs from the native SVM by formulating the kernel method as a minimum enclosing ball (MEB) problem, as well as proposing a specified approximation algorithm to estimate its parameters. From another perspective, Sarmento [38] presented a series of techniques used for the application of SVM in large databases, all focused on the selection of representative samples and, in sequence, the application of the traditional SVM algorithm. Among the suggested methods for sample selection, we can mention the k-nearest neighbors method (KNN) [39], Random Selection Reduction and Convolutional Neural Networks [40]. Applying the suggested methods, results are extracted from synthetic and real data, concluding that the use of such techniques brings good results that reduce the computational cost necessary for the execution of the model. Finally, Torres et al. [41] improved a version of the SMO algorithm for training classification and regression SVMs, based on a Conjugate Descent procedure, decreasing the number of iterations needed for convergence. These cited methods are based on some sophisticated concepts to improve the computational performance, modifying some ideas of the basic theory of SVM. In this paper, we consider a more intuitive concept based on Weak SVR applied to regression tasks. This method is presented in the next section.

Reducing Learning Time Using Weak SVMs
To apply SVM models to large data sets, data reduction (reducing the number of support vectors) appears to be priority. Wang et al. [26] present another technical option that reduces the training base for the later use of traditional SVM directed towards classification. In their work, the so-called Weak SVMs are used, which are models adjusted to small databases sampled from the initial base, to carry out the selection of observations and then build a training base smaller than the original base.
A Weak SVR, also known as a ε-Gross Granularity Weak SVR [26], considers X = x 1 , . . . , x n as the training data set, andẊ as a subset of X, in which the cardinality ofẊ is much smaller than X. In other words,Ẋ ⊂ X, and Card(Ẋ) << Card(X). In this sense, f (x) = w T x subject to |y − f (x)| < is the SVR predictor of X as well as f (ẋ) =ẇ Tẋ + is the SVR predictor ofẊ. The SVR ofẊ is called the ε-Gross Granularity Weak SVR and its empirical loss function is given by the following equation: Though the relationship between the size of the training data set and the bound error ε is weak since the performance of the hypothesis predictor depends on the size of the training data set, Weak SVRs are defined with training data sets with n 0 << n [26,42].
The entire procedure is based on two stages for training data. A random sub-sampling data cleaning method is applied in the first stage, and two maximum entropy-based informative pattern extraction methods are presented in the second stage. In this final constructed base, the traditional SVM model is applied, which has a shorter estimation time, since most of the observations were previously removed. The results achieved by this method are comparable to other methods such as PEGASOS [43], LIBLINEAR [44], and RSVM [45]. In this work, we modify the algorithm proposed by Wang et al. [26] by transforming the SVM algorithm of classification based on Weak SVR for a regression problem and using it for predicting a student's average grade on the ENEM associated with covariates. In this approach, the variability of the values estimated by "Weak SVR" will be considered instead of the "Weak SVM" proposed by Wang et al. [26]. In other words, this approach considers a regression task to predict a continuous variable instead of the classification task considered in the previous work. Figure 4 shows the flowchart of steps used by the modified method proposed.

Initial Sampling
The first step in the approach used is to extract K samples of size n 0 << n from the population, to adjust the K SVR models, which will now be called the "Weak SVRs". These initial K models are said to be "weak" because they are fitted from a small sample size when compared to the original sample size of the initial base.

Adjustment and Prediction of Weak SVRs
Following the selection of the initial K samples, K SVR models are adjusted; one model for each sample is initially extracted, and, in the sequence, these models are used to predict all observations in the initial population. After this adjustment, the values predicted by each of the K models are stored. In the original method presented by Wang et al. [26], the removal of observations is performed by eliminating those that obtained the same prediction for each of the estimated K models. In the proposed method, to select these observations to be removed, the standard deviation calculated between the K predictions for each observation estimated by the "Weak SVR" will be used.

Selection of the Final Sample
To select the observations that will be part of the final training sample, the previously calculated standard deviation will be used. Based on these values, the observations are ordered, and the observations with the greatest observed variations are selected. An observation with a low standard deviation indicates that the predictions made by K models were close to each other, and therefore, this observation is considered to have low uncertainty or information, which would be similar to the idea originally presented by Wang et al. [26]. In this case, using quantiles, the observations that presented a standard deviation above the third quartile are selected for the final sample. Then, with the completion of the selection process of the set to be used, the traditional SVR method is finally applied to the data.
Algorithm 1 displays the pseudocode of the method to speed up the SVR learning time.

ENEM as an Educational Selection Procedure
Historically, measuring "education" and its uses is not straightforward, since several facets and aggregation levels should be considered. An approach usually employed to obtain LA/EDM is an application of tests to monitor the quality of candidates, systems, and student learning outcomes. For example, around 160 AD, in China, an imperial examination was employed for the selection of public servants to compose the intellectual elite of the Chinese government [46]. On the other hand, since 1926, North American universities have selected their students by carrying out the Scholastic Aptitude Test (SAT). In Brazil, the selection procedure for candidates, used by several universities, is the National High School Exam (ENEM-from portuguese Exame Nacional do Ensino Médio). University admittance exams have existed in Brazil since the last century, but their use was most prevalent in the early 1970s, when they were unified to cover the national demand for higher education [21]. At the same time, many preparatory courses were created, bringing compilations of material books that included questions extracted from previously applied exams, and the discourse in schools about the preparatory courses for exams also grew [47].
Originally, the ENEM was created in 1998 as an instrument to provide educational information and government actions based on the evaluation of the results of students who had completed basic education. In its first edition, it had over 150,000 candidates [48]. Throughout the editions, ENEM became one of the options used by students, alongside the entrance exams that were carried out independently by educational institutions, to access several colleges and public universities across the country. Later, ENEM was adopted by many institutions as the only option for those seeking admission. Today, the ENEM score is accepted by hundreds of institutions in Brazil and some Portuguese institutions as a form of selection. The exam continues to be held year after year, with millions of candidates.
Nowadays, the ENEM has 180 items that are completed over two days (two Sundays, in general) and it is divided into four major areas of knowledge (Natural Sciences and its technologies, Human Sciences and its technologies, Mathematics and its technologies, Languages, Codes and its technologies) and a mandatory writing component.

Data
This paper uses the raw data from ENEM 2019 applied for all candidates in Brazil. This data set was used because it is the most recently available and is composed of 136 variables. The data are publicly available at website https://www.gov.br/inep/pt-br/acesso-ainformacao/dados-abertos/microdados/enem of the National Institute of Educational Studies and Research Anísio Teixeira (INEP) in Brazil on 23 June 2021. For the purposes of this paper, the considered variables are shown in Table 2.
In order to understand the candidates situation in Brazil and learn more about the considered features, we carry out an Exploratory Data Analysis (EDA) to describe the profile of the 5,095,270 students who participated in ENEM 2019. Figure 5 shows the number of candidates across the 27 Brazilian Federative Units. In this plot, we can observe a large concentration of candidates in the states with a higher population density (São Paulo, Rio de Janeiro, Minas Gerais, Bahia, and Ceará States) and a smaller number of candidates in the states with the lowest population density (Roraima, Amapá, Acre, and Tocantins States).   Figure 6 shows a histogram representing the distribution of students' age. The asymmetrical shape is expected because many younger people usually register for the ENEM. The average age of 22 years is commonly observed because participants are typically students at the end of high school or those who have recently graduated, which can be confirmed by observing Table 3, which shows the large concentration of students who had completed high school or had completed it in 2019, the year that this exam was taken. It can be seen that the largest number of candidates are female (3 million versus 2 million males). These values differ slightly comparatively from the Brazilian density population by gender [49]. On the other hand, around 600,000 students are identified as so-called trainee students (ENEM trainers are students under the age of 18 who are in their 1st or 2nd year of high school and wish to take the ENEM exam to test their knowledge). Figure 7 reveals the differences among candidates from different ethnic groups, with a concentration of applicants who declared themselves as Brown race (Pardo in Portuguese; Brown race is an official category used by the Brazilian Institute of Geography and Statistics-IBGE in the Brazilian census [50]). Figures 8 and 9 show the most common level of education for fathers and mothers, where it was observed that mothers have a higher level of education when compared to fathers.     Figure 10 reveals a greater concentration of members with an income in the range of R$ 998.00-R$2 994.00, followed by those with an income of up to R$ 998.00. In addition to the personal information of these candidates, some characteristics related to their school life were analyzed. Thus, it was possible to observe that among the students who provided their year of completion of high school, 2018 was the year that registered the highest number of student candidates (600,000 candidates). In addition, we observed that in relation to the type of school attended during high school (See Figure 11), the vast majority of students chose not to provide this information; however, among those who did, it can be observed that most students had attended public schools.  By analyzing the choice of the language tests, we can observe in Table 4 a small difference between the number of registrants who chose English as a foreign language and the number of registrants who chose Spanish as a foreign language. Moreover, the data reveal an average rate of missing the tests of approximately 25%; less than 1% of the students were eliminated, typically candidates who violated an exam rule, and the test was removed in any of the four tests (see Table 5). Observing the variable that shows the number of people who lived with the person enrolled in the ENEM (See Figure 12), it is noted that the vast majority of students shared a house with up to five people; this is an expected result given that, as this is an exam carried out mostly by students, they are expected to live with their families. In addition, there was also a high concentration of single candidates (86%), which once again is expected because participants were mostly young people at the end of school age.

Modeling
This section presents the results obtained by modeling the average grades of students who took the ENEM 2019 using the variables listed in Table 2 as inputs of an SVR model. The results presented here correspond to the comparison between the so-called traditional SVR model, a model applied to all available data, and the model proposed in this paper, a model that applied pre-processing to extract the most informative observations that could thus reduce the time estimation needed for the final model and still maintain good predictive performance results. All analyses were performed using the R [51] language on a personal computer with processor 2.00 GHz Intel Core i3-6006U, 4 GB RAM of memory, and a 64 bit Windows 10 Operating System. The R codes are available from the authors upon request.
In order to use the proposed model, a total of K = 10 "Weak SVR" models were considered during the process, each constructed with n 0 = 1000 observations from the original training sample selected at random. In the observation selection process for the final training sample, the distribution of the calculated deviations for each observation can be seen in Figure 13. Thus, as previously defined, observations with a standard deviation greater than the value of the third quartile were selected as the final training sample. For the case of Figure 13, Q3 = 22.90. The evaluation of the proposed method considers different population sizes and the results are shown in Table 6, which considers 70% of observations as the training sample, as well as the Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE), as predictive performance measures in the test sample. Based on Figure 14, it is possible to observe the difference in time between the proposed model and the traditional SVR model, which includes all the observations in the estimation process. In addition, based on Table 6, it is observed that the quality of the adjustment was maintained, despite the small observed difference. Figure 15 shows the percentage gain in time performance when using the proposed model. It is possible to observe that the gain in time has a tendency of growth when also increasing the size of the base used, reaching a gain of 90% when working with a set with 300,000 observations, the largest set used in this comparison. Moreover, despite the use of a lower sample size, the obtained results of the proposed method are consistent with the traditional method for all the cases from 30,000 to 300,000 total observations. Furthermore, there was a reduction in the time needed for the learning phase, making it possible to observe the quality of the proposed model compared to the traditional model. In particular, the RMSE mostly presented values close to 70, which may be justified by the scale of the response variable, which varied between 0 and 900 points.   Furthermore, the general quality of the model's fit was verified through the R 2 metric. As expected, it was 34%, which would indicate that the model is able to explain only 34% of the data variability. This value is consistent and close to other results found in the recent literature when performing a similar analysis using the ENEM database [52]. In particular, the metric R 2 is not the best metric to analyze the goodness of fit of a model, especially for a massive database such as ENEM, which presents great variability for its variable response. Figure 16 displays the behavior of the model when crossing the real values with the values predicted. Despite a behavior with high variability, as expected, it is possible to notice a moderate linear trend. It was also observed how the model behaved for each of the Brazilian geographic regions, thus taking the cross between the real and predicted values per region (see Figure 17). As seen for the whole country, there is a great variability in the predicted values; however, higher concentrations of notes are seen in the southeast and northeast regions of the country.

Comparative Analysis of the Predicted Grades
This section presents a comparative analysis performed between the worst and best predicted grades on the database, taking the top 10% and the bottom 10%; such analysis can be seen in Table 7. There is a significant difference between the behavior of some key variables used in the modeling, when comparing the worst grades with the best grades. The variable Ethnic group shows the great racial inequality existing in the country, while, among the worst average grades observed, there is a high concentration of self-declared brown students (62%); among the best average grades is seen a large concentration of students who were white (71%), and it is possible to observe a decline in the number of black students among those with better grades, a reduction from 20% to only 3%. Another important variable to be observed is the family income variable, which, in turn, shows economic inequality. While, among the worst grades, it is possible to observe a concentration of students with an income of 1 basic salary (R$ 998.00) (53%) and income between 1 basic salary and 3 basic salary (R$ 2994.00) (26%), for the best grades, there is an even greater concentration of students with a family income above 5 basic salary (R$ 4990.00) (72%), with less than 1% among the worst grades. Finally, two other important variables are the educational levels of the students' parents. Among the worst grades, it is observed that the most common level of education is incomplete primary education, either for the father (49%) or for the mother (49%). On the other hand, for students who obtained the highest grades, there is a high number of mothers (71%) and fathers (62%) who completed higher education.

Final Considerations
Several high-profile publications have demonstrated a lack of transparency, reproducibility, ethics, and effectiveness in the reporting and evaluation of ML/AI-based predictive models (error rates) [53][54][55]. This growing body of evidence suggests that while many best practice recommendations for the design, performance, analysis, reporting, evaluation of student performance, and implementation of education and public policy can be borrowed from the traditional economics, public policy, health systems, and education statistics literature, they are not sufficient to guide the use of ML/IA in research. Producing such guidance with transparency and with intuitive methods is an important undertaking because of the increasing speed of producing predictions, the large battery of ML/IA algorithms, and the multifaceted nature of assessing student performance and social impact [56]. Taking no action is unacceptable, and if we wait for a more definitive solution, we risk breaking ethical and moral norms beyond the work of methodological development.
In particular, this paper presents some theoretical concepts about the SVR machine learning method, as well as a proposition to address the problem found when using this native regression method in large databases. The results obtained in the modeling process with the proposed model were traced as well as applied in an educational data mining problem. The proposed method maintained good performance and presented a considerable reduction in learning time, reaching a gain of 90% for a database with at least 300,000 observations. It represents a reduction in the time needed to learn the model from 8 h to only 26 min. This reduction was achieved only using theoretical modifications, without any use of parallel procedures.
The educational application provides a general descriptive analysis about the candidates who participated in the ENEM 2019 in Brazil, the distribution of the candidates by federative units, the economic situation of their families, the educational level of their parents, and also the distribution of high school type.
The learning time reduction and computational effort are quite relevant for online applications, since the learning step may be repeated in different subsets in dashboard applications, for example. In this sense, using only these seven input variables, it is possible to predict precisely the average grade of a student on the National High School Exam (ENEM-Exame Nacional do Ensino Médio) in Brazil. These efforts would be impractical when using SVR without a learning time reduction.
Based on the predictive results of the SVR, it is possible to determine the performance of a given student in a future ENEM test application: this can be key, on the one hand, to produce more comprehensive and fairer tests that account for the different demographics of students from different segments of society, i.e., the predictions allow the profiling of students and the grouping of them for various purposes, which mainly allows a reduction in inequality. On the other hand, the information assimilated by the algorithm can help us to understand more accurately the students' learning processes and their interconnections in such a way that distortions can be corrected more quickly in teaching procedures.
Based on the experiment, we also found research limitations in our current study and identified more research methods for our future studies as follows. We observed that the number of students in the data set is an important factor affecting the predictive performance. For those subsets identified by region or ethnic group, for example, a larger number of students has better predictive performance; for example 10% better grades were found in a larger number of students with 60% ethnic group EG2 (high concentration). Nonetheless, predictions can be unstable if there is substantial volatility in the underlying data set or if the data set is small. Thus, in future work, it is necessary to introduce noise to improve the shortage of sample data. On the other hand, we observed that the method used in predicting students' performance is based on a shallow architecture and predictive result failure to capture the relationships among attributes in a massive data set, and a similar conclusion was already presented [57,58] and others in similar works. It is also worth mentioning the fact that the work developed is easily extendable for other contexts and methods, and there is also the possibility for parallelization adequacy, which guarantees an even greater computational time gain, or even a combination within other methodologies applied in large databases [37,41].

Conflicts of Interest:
The authors declare no conflict of interest.