You are currently viewing a new version of our website. To view the old version click .
Electronics
  • Article
  • Open Access

30 January 2023

Analysis of Enrollment Criteria in Secondary Schools Using Machine Learning and Data Mining Approach

,
,
,
,
,
and
1
Department of Computer Science, MNSUA Multan, Multan 60650, Pakistan
2
Department of Computer Science, Virtual University of Pakistan, Lahore 54000, Pakistan
3
School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China
4
BK21 Chungbuk Information Technology Education and Research Center, Chungbuk National University, Cheongju 28644, Republic of Korea
This article belongs to the Special Issue Advances in the Use of Artificial Intelligence (AI)/Machine Learning (ML) and IoT in the Primary Sector

Abstract

Out-of-school children (OSC) surveys are conducted annually throughout Pakistan, and the results show that the literacy rate is increasing gradually, but not at the desired speed. Enrollment campaigns and targets system of enrollment given to the schools required a valuable model to analyze the enrollment criteria better. In existing studies, the research community mainly focused on performance evaluation, dropout ratio, and results, rather than student enrollment. There is a great need to develop a model for analyzing student enrollment in schools. In this proposed work, five years of enrollment data from 100 schools in the province of Punjab (Pakistan) have been taken. The significant features have been extracted from data and analyzed through machine learning algorithms (Multiple Linear Regression, Random Forest, and Decision Tree). These algorithms contribute to the future prediction of school enrollment and classify the school’s target level. Based on these results, a brief analysis of future registrations and target levels has been carried out. Furthermore, the proposed model also facilitates determining the solution of fewer enrollments in school and improving the literacy rate.

1. Introduction

1.1. Background

In this research, we used this power of the entire field for educational purposes. Low enrollment of students in public schools is the prime challenge in developing countries. Pakistan is a signatory of the sustainable development goal (SDG) 2030, and Article IV of SDG 2030 states that primary education is the right of every individual in the whole world [1]. Article 25(A) of the Islamic Republic of Pakistan constitution says that fundamental education will be given to every citizen of this country [2]. According to the Ministry of Federal Education and Professional Training, Pakistan’s current literacy rate is 62.3%, implying that an estimated 60 million people are illiterate in the country [3].
Following the Punjab Education Sector Reform Program (PESRP), the school census report published by the Punjab government in 2020–2021 demonstrates the importance of increasing student enrollment and the reasons for the dropping out of young children. The Institute of Statistics defines the dropout ratio as the proportion of students in the same class who are no longer enrolled in the next school year in the same school [4].
The authors predicted the academic performance of architecture students using data from their earlier research and machine learning models. The researchers used linear regression analysis and a K-nearest neighbor (k-NN) neighbor study. In terms of accuracy, the the K-nearest neighbor (k-NN) model significantly outperformed the linear discriminant analysis model. Additionally, how well architecture students performed on math exams (at the ordinary level) greatly impacted their grades [5].
Information Technology University (ITU), Lahore, Pakistan, used Restricted Boltzmann Machines, Matrix Factorization, and Collaborative Filtering to analyze data from the real world (RBM). The authors looked at the electrical engineering departmental grades of ITU undergraduates. RBM was found to be more reliable in predicting a student’s performance in a particular course than other approaches [6].
Researchers discussed a case study showing how machine learning methods can forecast students’ academic performance. They created a regression algorithm that can forecast a student’s performance based on their performance on a small number of written assignments and their demographics. A software tutoring aid prototype was developed [7].
The primary goal of their paper is to identify the key factors that influence school academic performance and to investigate their relationships using a two-stage analysis of a sample of Tunisian secondary schools. To deal with undesirable outputs in the first stage, we employ the Directional Distance Function Approach (DDF) [8]. According to the author, the past work witnesses the researcher’s focus on Student Performance, slow learning, and dropout rate. It is a dire necessity to pay special attention to low enrollment. Low enrollment means a decrease in literacy rate; this is directly related to the progress and development of the country. The current system demands the design of a model that will help the schools to achieve their targets. It is a dire need to maximize enrollment by applying a suitable model that allows the school administration and policymakers. Another solution is to find out the school category in which school enrollment lies, far from the target, Below Target, and on target [9].
Many countries have shown growing interest and concern about the problem of low school enrollment and its primary causes in recent years. This problem is known as the “100-factor problem”. A lot of research has been done to identify factors that affect student performance (school failure and dropouts) at different academic levels (primary, secondary and tertiary) [10,11].
In this context, machine learning can contribute to a remarkable achievement in understanding and analyzing the challenges of low enrollment. The authors stated that the use of machine learning algorithms to evaluate the complete/incomplete nature of a thesis—is accurate—is one of the study’s innovative results. These assessment models allow for a more balanced match between students and instructors. Machine learning is a cornerstone of artificial intelligence and big data analysis. It includes powerful algorithms capable of recognizing patterns, classifying data, and, in essence, learning to perform a specific task independently. This field has grown in popularity recently, but it is still unknown to most people, including professionals [12]. Machine learning approaches are presented in Figure 1.
Figure 1. Machine Learning Model A-Z process.

1.2. Research Gaps and Limitations

1.2.1. Gaps in Previous Research

Research has been increasingly interested in issues such as forecasting student performance, preventing failure, and determining what causes kids to drop out of school in recent years. Nagy and Molontay used and evaluated several machine learning methods based on information available at the time of enrollment to identify at-risk individuals and estimate student dropout from university programs (secondary school performance, personal details). They also provided a platform for data-driven decision assistance to the education directorate and other stakeholders. They based their models on data from 15,825 undergraduate students who registered at the Budapest University of Technology and Economics between 2010 and 2017 and either completed or dropped out of their programs [13].
Mengash demonstrated how data mining techniques can be used to anticipate applicants’ academic success at universities to aid admissions decisions. The proposed methodology was validated using data from 2039 students enrolled in the Computer Science and Information College of a Saudi state university between 2016 and 2019. According to the findings, candidates’ early university performance can be predicted before admission based on specific pre-admission traits (high school grade average, Scholastic Achievement Admission Test score, and General Aptitude Test score). Furthermore, the data show that the Scholastic Achievement Admission Test score is the best predictor of future student achievement. As a result, admissions algorithms should give this score more weight [14].
A multitude of academic and non-academic factors influence a student’s academic achievement at a university. While students who previously failed due to familial distractions may be able to focus away from home and thrive at university, students who previously succeeded in secondary school may lose focus due to peer pressure and a social lifestyle. In Nigeria, university admission is heavily based on a student’s cognitive entry criteria, which are predominantly intellectual and may not always transfer to excellence if a student enrolls in a university [15].
Learning analytics and educational data mining have improved tremendously in a short amount of time. Baker had a vision for the field’s future directions, including increased interpretability, generalizability, transferability, application, and clearer evidence of effectiveness. The keynote talk was gently revised and delivered in 2019 at the Learning Analytics and Knowledge Conference. They offer these future approaches as a set of six competitions, the Baker Learning Analytics Prizes, with particular standards for what would represent forward advancement in each of these routes (BLAP). By addressing these challenges, the field will be able to more effectively use data to benefit students and enhance education [16].
Muralidharan and Prakash investigated the effects of an innovative project in the Indian state of Bihar that provided girls who progressed to secondary school with a bicycle to make it simpler for them to commute to school to bridge the gender gap in secondary enrollment. They analyzed data from a large representative household survey using a triple difference approach, with boys and the neighboring state of Jharkhand serving as comparison groups. They discovered that being part of a cohort that was exposed to the Cycle program increased girls’ age-appropriate secondary school enrollment by 32% and reduced the associated gender gap by 40%. Furthermore, they identified an 18% increase in the number of girls taking the important secondary school certificate exam and a 12% increase in the proportion of girls passing it. The triple-difference estimate as a function of distance to the nearest secondary school reveals that enrollment increases tended to occur in villages further from a secondary school, indicating that the ability of the bicycle to reduce the time and safety costs of school attendance was the mechanism of impact [17].
Many educational institutions place a high value on reducing student dropouts. Peréz et al. examined the findings of a case study in educational data analytics aimed at identifying undergraduate Systems Engineering (SE) students who had dropped out after six years of enrollment at a Colombian institution. Original data were enlarged and enriched using a feature engineering technique. The experiment’s findings indicated that dropout predictors can be determined with consistent levels of accuracy using simple algorithms. The findings of Decision Trees, Logistic Regression, Naive Bayes, and Random Forest were compared to recommend the best option. In addition, Watson Analytics is evaluated to see how well it works for non-expert users. The major findings are presented to lower dropout rates by identifying reasonable explanations [18].
However, the current study focuses on the impact of prior academic achievement on the academic performance of architecture students. Several factors affect student academic performance [19].
Academic failure is a serious worry at a time when postsecondary education is becoming increasingly vital to economic success. The authors analyzed student data available at registration, such as school records and environmental circumstances, to predict potential failure early, with the goal of swift and successful remediation and/or study reorientation. Three algorithms for artificial neural networks, logistic regression, and random forest were changed. They developed approaches to improve forecast accuracy when specific classes are of great importance. These strategies are applicable across multiple disciplines and are context-independent [20].
It is critical to pay special attention to low enrollments. Low enrollment leads to a drop in literacy rates, directly related to the country’s progress and development.
The objectives of this research are cited as follows [9,21].
  • To find a model that will predict the upcoming enrollment of schools before time.
  • To highlight the categories of schools that require close attention in the future and highlight these schools to enroll the maximum number of students.
  • To improve the school’s targets for enhancement of literacy rate.

1.2.2. Limitations of Our Work

  • Our work is limited to secondary Schools.
  • Our investigation of enrollment criteria is limited to the schools of a particular geographical area.

1.3. List of Abbreviations

The notations used in our study are presented in Table 1.
Table 1. List of abbreviations.
The first part of the paper is related to the introduction of our research. The rest of the paper is arranged as follows. Section 2 presents the related work in our research area. Section 3 is about the research methodology and techniques. Section 4 presents simulation work, and its results are discussed in detail. The conclusion is presented in Section 5. Finally, future work is discussed in Section 6.

3. Materials and Methods

3.1. Workflow of Research

After defining the problem of school enrollment, convenient features were selected based on related work and the current scenario. The dataset of student information including many important features (age, gender, family Size, and physical health) was collected from the head office of the education department of Punjab for research purposes. This dataset (after preprocessing) is reliable for the proposed research because it was collected from a real-time monitoring survey by the School Education Department. Data pre-processing is the backbone of further analysis and accuracy of algorithms. It involves removing missing values, feature scaling, managing categorical data, and many other useful tasks to make the data consistent. Here is the brief overview of data after preprocessing i.e., missing values is filled with technique of median and feature scaling is done. The data after pre-processing has shown in the Table 2.
Table 2. After preprocessing of data.
Feature reduction methods were analyzed, and we selected the Backward Elimination Method for feature reduction. Now it is time for data pre-processing. Data split in the training and testing phase is done using the Stratified Shuffled split technique.
There are multiple techniques for splitting the data into testing and training data:
(1)
Test Train split (Simple divide dataset into purposed ratio)
(2)
Stratified Shuffled Split (used for equal division of categorical data)
(3)
K-fold (iterative technique for data split).
Stratified shuffled split is used because dataset has a feature known as “GENDER”. Based on this, data are divided into the test and train sets. After applying stratified shuffled split. The dataset is perfectly divided. For example, the 100-testing dataset has 24 Female schools and 380 training dataset has 76. The OLS final summary is presented in Table 3.
Table 3. Final OLS summary.
Algorithms Implementation for both classification and regression is the next objective.
After a brief comparative analysis, a conclusion is drawn. Research design (workflow is presented in Figure 2) is as follows.
Figure 2. Workflow of research.

3.2. Algorithms Used for Classification

Algorithms used for classification are
  • Decision tree Classifier
  • Random Forest Classifier
  • KNN Classifier
  • SVM Classifier
  • Naïve Bayes Classifier
And algorithms used for the prediction (Regression) the future enrollment are:
  • Multiple Linear Regression
  • Random Forest Regression
  • Decision Tree Regression

3.3. Data Collection

Five hundred students’ data were collected from public schools in the Punjab province of Pakistan, especially schools with low enrollment in early classes. In addition, rural and urban students are represented. The dataset of student information, including many important features (age, gender, family size, physical health), was collected from the head office of the education department of Punjab for research purposes. This dataset (after pre-processing) is reliable for the proposed research because it is compiled from a real-time monitoring survey of the School Education Department [26].
The selection of required features demands special attention, because all the upcoming processes and models’ accuracy are correlated with selected features. Twelve features are chosen for this research. The data for genders containing testing and training data are presented in Table 4.
Table 4. The Dataset.
This raw data demands to be clean and noise-free on the same scale. Therefore, pre-processing is the only solution to make this data consistent and reliable [30].

3.4. Data Mining Process

Data mining is widely used by businesses, organizations, and governments to find “hidden” patterns and connections in their transactional data. There has been a lot of progress in the last several years. Data mining algorithms can address any problems that come up while analyzing data. Models for Data Mining and Knowledge Discovery reached their peak with the release of SEMMA in 2000. The five-step SEMMA framework is used by the SAS Institute to organize the phases of data mining. SEMMA stands for Sample, Explore, Modify, Model, and Evaluate. An overview of each SEMMA process phase is shown in Figure 3.
Figure 3. Workflow Process with Data Mining.

3.4.1. Sampling

A relevant subset of data comprising the most important and easily manipulable information must be extracted as the first step in the SEMMA data mining process, which starts with a large dataset. It may be possible to greatly shorten the time required for data mining for extremely large datasets by concentrating on a subset rather than the complete dataset. A statistically significant sample will reveal the existence of a statistically significant trend in the data. Sub-data must be taken into account in the context of datasets.

3.4.2. Explore

The second step is to visually or statistically analyze the data to look for underlying classifications or patterns. Research allows the refinement and modification of the discovery process. This promotes conceptual and cognitive growth. If a visual review of the data does not show any clear patterns, statistical methods, such as factor analysis, correspondence analysis, and clustering can be employed to analyses the data.
If all of the data are processed at once, likely, more intricate patterns will not be found until further exploration is done.

3.4.3. Model

The data will be cleaned and organized, and statistical models will be created to show how the patterns behave. Data mining uses a variety of statistical models, including time series analysis, memory-based reasoning, and principal component models, among others, as well as modeling approaches including artificial neural networks, decision trees, rough set analysis, support vector machines, and logistic models. Each model has distinctive qualities and attributes that make it the best choice for various data mining tasks.

3.4.4. Evaluation of the Model’s Performance Is the Last SEMMA Stage

The user’s assessment of the model is used to evaluate how accurately it predicts the criteria. It is typical to divide the data into training and testing sets when evaluating a model. While the second is used to assess the model, the first is used to instruct it. This step’s goal is to evaluate how closely the model resembles the training data set’s outcomes. Both the training sample and the reserved sample can make use of the model if it is accurate.

3.4.5. Tools and Data Analysis Techniques

Python is used for data pre-processing and machine learning algorithm implementation. We also have a significant variety of software related to our domain. Still, the python tool is user-friendly and easy to apply to Artificial Intelligence algorithms, especially Machine learning and deep learning algorithms. It is suitable for large and small datasets [44].
Scikit Learn performs early data processing using Scikit learn (Sklearn), the most well-liked and dependable machine learning library for Python. Clustering, regression classification, and dimension reduction are a few of the powerful machine learning and data mining modeling tools accessible through a Python-compatible interface [26]. Python code for this library is created by combining NumPy, SciPy, and Matplotlib. Outliers are eliminated, missing values are filled in, and features are scaled during the pre-processing data stage [45].
We will go into detail on three different categories in what follows:
Based on a set of data, estimators are computer programs that can make educated assumptions about specific parameters. The fit and transform methods are available. The fit procedure performs a proper operation and calculates internal parameters, to obtain useful data-related information [46].
The transform method takes in data and outputs results as close as possible to the input data. Another function that carries out both fitting and modifying is fit transform [44].
Models such as the linear regression model and predictors are similar. Fit and prediction are very similar when compared. It also has a scoring feature that can assess the forecasts’ accuracy [47].
A total of five hundred entries are present in the data set with the help of the GENDER count function. One-hundred twenty Female Schools’ and 380 Male Schools’ data are present in the data set, and a significant part of the data needs scaling [48,49].
Feature scaling is used to make all the attributes on one scale. Primarily, two types of feature scaling methods:
  • Min-max scaling (Normalization) (value—min)/ (max-min) Sklearn provides a class called Min Max Scaler for this.
  • Standardization (value—mean)/std Sklearn provides a class called Standard Scaler.
Backward elimination is a technique for selecting significant features when constructing a machine learning model. This technique eliminates elements that do not affect the dependent variable [50]. There are multiple ways to construct a Machine Learning model; some are given below:
  • Backward Elimination
  • All-in
  • Forward Selection
  • Score Comparison
  • Bidirectional Elimination
Already there are potential methods for training the model in Machine Learning, but we used the Backward Elimination approach as it is the quickest way. A list of proposed features is presented in Table 5, and the Final OLS summary in Table 6.
Table 5. List of proposed features.
Table 6. Final OLS Summary.
Below are some primary steps which are used to apply the backward elimination process:
  • First, we will decide on the model’s level of significance. (SL = 0.05)
  • The second step involves fitting the entire model, which includes all independent variables and potential predictors.
  • Third, as shown in the image below, pick the predictor with the highest p-value.
  • Proceed to Step 4 if the p-value is less than the SL. However, when you are done, our model will be too.
  • Step 4 is to remove the predictor.
  • Fifth, rebuild and re-fit the model after removing some of the variables.
Three things we discovered could be eliminated from our model to improve its accuracy.
Our work will become worse if we try to eliminate even more. The alfa value increased in the eliminated traits (family Size, School Extra Education, Parent’s Status). There is further analysis done on the remaining features.
There are several ways to split datasets into training and test sets, including:
  • Test Train split (Simple divide dataset into purposed ratio)
  • Stratified Shuffled Split (used for equal division of categorical data)
  • k-fold (an iterative approach for data split.) [47].
Stratified shuffled split is used because the dataset has a feature named “GENDER.” After applying a stratified shuffled split, this data are divided into test and train sets. The dataset is perfectly divided. Such as the 100-testing dataset has 24 female and 76 male schools, and the training dataset has 120 female and 380 male schools [46].
Manual enrollment targets are assigned to all public schools, and approximately 70% of marks are achievable. The rest of the targets always remain unachieved. This research significantly contributes to increasing enrollments in public schools and helps to achieve the given targets.
This research aims to predict and Identify low enrollment schools and classify them according to targets.

4. Results and Discussion

4.1. Classification

Classification is the technique of machine learning used for predicting group composition in data instances. Several classification methods are available and can be used for classification purposes. In this section, we discussed Linear and Sigmoid kernels, the basic classification techniques, and some major classification method types, including the k-nearest neighbor, decision tree, classifier and support vector machines (SVM), Naive Bayes, and Random Forest [51].
Class-wise categories are mentioned below:
  • Class 1: Below Target
  • Class 2: Far from Target
  • Class 3: On target

4.1.1. Random Forest

Random forests are popularly used for data science competitions and practical problems. They are accurate, do not require scaling or categorical encoding of features, and need little parameter tuning. It consists of several random decision trees [52]. Inside the trees are built two types of randomness. First, each tree is constructed from the original data on a random sample. Second, a subset of functions is randomly selected at each tree node to produce the best split. Random Forest Classifier (RFC) is the most common ensemble learning classifier proven to be a very popular and effective cognition and machine learning technique for high-dimensional classification and skewed problems. RF classifier performance with truth data and results are presented in Figure 4.
Figure 4. RF Classifier Performance. The accuracy of Random Forest is 97% as shown in the truth table (Figure 4).

4.1.2. Decision Tree

Growing machine learning algorithms have their benefits and motivations for implementation. The decision tree algorithm is one of the most widely used algorithms. A decision tree is an upside-down tree that makes decisions based on the conditions of the data [10]. Decision tree performance with truth data classifier results and accuracy is presented in Figure 5.
Figure 5. DT Performance. The accuracy of Decision Tree is 94% as shown in the truth table (Figure 5).

4.1.3. SVM (Linear)

A support vector machine (SVM) is a supervised machine learning model that employs classification algorithms to solve problems involving classification into two classes. After providing an SVM model with sets of labeled training data for each category, they can categorize new text. As a result, we are attempting to solve the enrollment target classification issue. We are working to improve our training results and may have even experimented with Naive Bayes. However, now that we can look forward to enjoying the dataset, we want to go a step further. Enter Support Vector Machines (SVM) is a straightforward and trustworthy classification algorithm with sparse data performing admirably [53]. SVM performance with truth data classifier results and accuracy is presented in Figure 6.
Figure 6. SVM Performance. The accuracy of SVM (linear) is 59% as shown in the truth table (Figure 6).
Having searched a little deeper and bumped into concepts such as linearly separable, kernel trick, and kernel functions [53]. The idea behind the SVM algorithm is simple and does not require much of the complicated stuff to apply to the classification of targets.

4.1.4. SVM (Sigmoid Karnal)

The Accuracy of this Model is not up to the mark. A detailed Confusion table and essential parameters are given in Figure 7.
Figure 7. SVM Sigmoid Confusion Matrix. The accuracy of SVM (Sigmoid Karnal) is 37% as shown in the truth table (Figure 7).

4.1.5. KNN

The classification technique k-Nearest-Neighbors (kNN) is one of the common ways in machine learning; it is essentially classification by identifying the most similar pieces of information in the training data and making an informed guess based on their classification. As it is very easy to understand and implement, this approach has seen wide use in many fields, for example, in recommendation systems, semantic searching, and the identification of anomalies [53]. KNN performance, with truth data, classifier results, and accuracy, is presented in Figure 8.
Figure 8. KNN Performance. The accuracy of KNN is 80% as shown in the truth table (Figure 8).

4.1.6. Naive Bayes

Naive Bayes is a classifier for machine learning that is easy but efficient and commonly used. It is a probabilistic classifier making classifications in a Bayesian setting, using the Maximum A Posteriori decision law. It can also be represented using a Bayesian network which is very basic. Naive Bayes classifiers are particularly popular for classification and are a traditional solution to problems such as spam detection [54,55,56]. NB performance, with truth data, classifier results, and accuracy, is presented in Figure 9.
Figure 9. NB Performance. The accuracy of Naive Bayes is 49% as shown in the truth table (Figure 9).

4.1.7. Models Summary

A summary of all models containing precision, Recall, fl-score, and accuracy is presented in Table 7. The accuracy of the random forest model is high, among others.
Table 7. Classification Summary.
The accuracy of all models in which Random Forest has high accuracy is shown in Figure 10.
Figure 10. Classification Models Summary.

4.2. Prediction

The second part of this research consists of prediction. Prediction of future enrollment is the prime objective, and we implemented three major algorithms for prediction. Results and comparative analysis are discussed in the below section. Parametric Charts and compression tables highlighted the comparisons of parameters of regression models. In addition, the highest performance in terms of R2 and RMSE is the Random Forest Regressor.

4.2.1. Multiple Linear Regression

Multiple regression analysis methods are used to look into relationships that run in a straight line between two or more variables. The independent variables (IVs), also known as stand-alone variables, are the Xs. Y is the dependent variable (DV) [53]. The subscript j stands for the observation’s number (row). The regression coefficients are in the form of the. They base their projections on b’s.
In contrast, b is close to the original unknown parameter (population). Although n = 9 is used in this study, a general equation with independent variables is provided. The features coefficient is presented in Figure 11.
Figure 11. Coefficient of Features.
Multiple Linear Regression Evaluation Parameters are presented in Table 8, and Intercept and Coefficients are presented in Table 9.
Table 8. Multiple Linear Regression Evaluation Parameters.
Table 9. Multiple Linear Regression Intercept and Coefficients.
Though various techniques can solve the regression problem, the most commonly used approach is the least squares. The bs are chosen to minimize the sum of the squared residuals in the least square regression analysis. This set of bs is not necessarily the set we want because outliers—points that are not representative of the data can distort them.
Intercept is 65.43
  • Mean Absolute Error (MAE) is the mean of the absolute value of the errors;
  • Mean Squared Error (MSE) is the mean of the squared errors;
  • Root Mean Squared Error (RMSE) is the square root of the mean of the squared errors;
  • MSLE Is the loss being the mean over the seen data of the squared differences between the log-transformed true and predicted values, or writing it as a formula;
  • Where ŷ is the predicted value, this loss can be interpreted as a ratio between the true and predicted values. Because MSE “punishes” larger errors, it is more popular than MAE. However, RMSE is more popular than MSE because it can be expressed in the “y” unit. E. Model Evaluation parameters are presented in Figure 12.
    Figure 12. Model Evaluation parameters.

4.2.2. Random Forest Regression

The Radom Forest regressor’s performance is comparatively high compared to other models used in this research. Detailed analyses are given below in Figure 13.
Figure 13. Random Forest Regressor Parameters.

4.2.3. Decision Tree

The evaluation parameters of the decision tree model are presented in Figure 14.
Figure 14. DT Performance.

4.2.4. Models Summary

A comparative Summary of Models MLR, RF, and DT are presented in Figure 15.
Figure 15. Comparison of Regression Models.
Now going to the conclusion of the Regression part, the best model for predicting School Enrollment is a Random Forest Regressor. Model Performance is above 97%. The parametric analysis of all subject models is presented in Figure 16.
Figure 16. Parametric Analysis.
Parametric Charts highlighted the comparisons of parameters of regression models. The highest performance in terms of R2 and RMSE is the Random Forest Regressor. In addition, a comparison of regressors models is presented in Table 10.
Table 10. Comparison of Regressors Models.

4.3. Testing of Regression Model

Under the supervised learning system, different model evaluation techniques can be used, which helps us determine how well our model is doing. A straightforward approach for testing a model is to find the variance that is the difference between the expected and actual values, but it is not the best solution and can lead to poor decision making. Furthermore, we need more measures to evaluate the different models, and choosing the appropriate evaluation measure is crucial in choosing and separating the suitable model from other models.
Unlike classification, In Regression (where we can count the number of results we have classified correctly), we are often incorrect since our estimates are either greater or smaller than the original value (rarely the same as the original value). Therefore, we are also not concerned with how many times we have been incorrect but rather with the quantity of variance between the actual and expected value
The determination coefficient is the most critical method of evaluating a Regression model and is much more common than SSE/MSE/RMSE. The coefficient of determination in statistics is the ratio of the variance in the dependent variable that is predictable from the independent variable(s).
Based on the total outcome variance explained by the model, R-Square measures how well the model replicates well-observed outcomes. The coefficient of determination ranges from 0 to 1. R-Square is a statistic that provides information about the goodness of fit of a model. In Regression, the R-Square determination coefficient is a statistical indicator of how well the regression line approximates the actual data points. An R-Square of one indicates that the data fit well with the regression line. R2 informs us, after training our model, we create a job lib file of the model. This is simply the trained model and it is ready for applications. We performed multiple tests on this model, and the results are below. Take five rows from the test data for testing this model. The output of five random samples is presented in Table 11.
Table 11. Final testing Summary.

5. Conclusions

This study is divided into two sections. The first section includes different models for predicting school enrollment, such as Random Forest Regression, Decision Tree Regression, and Multiple Linear Regression. Enrollment is based on essential student characteristics. Pre-processing techniques extracted useful information from collected raw data. For feature reduction, backward elimination is used. Stratified shuffled split is a data splitting technique that uses cross-validation. Train the model on multiple machine learning algorithms and analyze the best model to predict future enrollment. The highest Regressor model is the Random Forest Regressor having with an accuracy of 97.1%. The second part contains the classification of schools into three categories. The Random Forest classifier has the highest accuracy (94%), whether the school is on target, below target, or far from the target. The goal is to determine which schools will be on track or not in the coming academic year. This research significantly improves literacy and allows for the implementation of specific initiatives for low-performing schools ahead of schedule.

6. Future Work and Directions

  • The study can be expanded to include an examination of student retention. Twelve features were chosen based on the scope of the research. However, this research can be taken to a higher level by analyzing the level of study of enrolled students using Machine Learning.
  • We can implement deep learning and Federated learning techniques to improve the accuracy of models.
  • We can also extend our study to higher secondary schools and colleges, Universities for the analysis of enrollment criteria.

Author Contributions

Conceptualization, Z.u.A., T.M. and A.R.; methodology, T.M., I.H. and Z.u.A.; software, T.M. and Z.u.A.; validation, A.R. and I.U.; formal analysis, T.M., I.H.; investigation, T.M., I.U. and I.H.; resources, I.U. and H.A.; data curation, I.U., Z.u.A.; writing—original draft preparation, T.M., I.H. and I.U.; writing—review and editing, I.H. and I.U.; visualization, A.R. and H.A.; Project administration, I.U., H.A. and H.G.M.; Funding acquisition, I.U., H.G.M. All authors have read and agreed to the published version of the manuscript.

Funding

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023TR140), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Acknowledgments

Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2023TR140), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. SDGs. N.I.F.S.D.G. Article IV of SDG 2030. 2022. Available online: https://www.sdgpakistan.pk/ (accessed on 25 December 2022).
  2. National Assembly of Pakistan. Article 25(A) of the Constitution of the Islamic Republic of Pakistan. Available online: https://na.gov.pk/en/downloads.php (accessed on 25 August 2022).
  3. Ministry of Federal Education and Professional Training Pakistan. Literacy Rate. 2022. Available online: http://mofept.gov.pk/Detail/NDM1NDI0ZTQtZmFjMy00ZTVlLWE5M2YtYjgxOTE4YTkyYWNi (accessed on 25 December 2022).
  4. PESRP. The School Census Report 2020–2021. 2022. Available online: https://www.pesrp.edu.pk/ (accessed on 25 December 2022).
  5. Aluko, R.O.; Adenuga, O.A.; Kukoyi, P.O.; Soyingbe, A.A.; Oyedeji, J.O. Predicting the academic success of architecture students by pre-enrolment requirement: Using machine-learning techniques. Constr. Econ. Build. 2016, 16, 86–98. [Google Scholar] [CrossRef]
  6. Iqbal, Z.; Qadir, J.; Mian, A.N.; Kamiran, F. Machine learning based student grade prediction: A case study. arXiv 2017, arXiv:1708.08744. [Google Scholar]
  7. Kotsiantis, S.B. Use of machine learning techniques for educational proposes: A decision support system for forecasting students’ grades. Artif. Intell. Rev. 2012, 37, 331–344. [Google Scholar] [CrossRef]
  8. Rebai, S.; Yahia, F.B.; Essid, H. A graphically based machine learning approach to predict secondary schools performance in Tunisia. Socio-Econ. Plan. Sci. 2020, 70, 100724. [Google Scholar] [CrossRef]
  9. Lykourentzou, I.; Giannoukos, I.; Nikolopoulos, V.; Mpardis, G.; Loumos, V. Dropout prediction in e-learning courses through the combination of machine learning techniques. Comput. Educ. 2009, 53, 950–965. [Google Scholar] [CrossRef]
  10. Ciolacu, M.; Tehrani, A.F.; Beer, R.; Popp, H. Education 4.0—Fostering student’s performance with machine learning methods. In Proceedings of the 2017 IEEE 23rd International Symposium for Design and Technology in Electronic Packaging (SIITME), Constanta, Romania, 26–29 October 2017. [Google Scholar]
  11. Doan, T.; Kalita, J. Selecting machine learning algorithms using regression models. In Proceedings of the 2015 IEEE International Conference on Data Mining Workshop (ICDMW), Atlantic City, NJ, USA, 14–17 November 2015. [Google Scholar]
  12. Superby, J.-F.; Vandamme, J.; Meskens, N. Determination of factors influencing the achievement of the first-year university students using data mining methods. In Proceedings of the Workshop on Educational Data Mining; Citeseer: University Park, PA, USA, 2006. [Google Scholar]
  13. Nagy, M.; Molontay, R. Predicting dropout in higher education based on secondary school performance. In Proceedings of the 2018 IEEE 22nd International Conference on Intelligent Engineering Systems (INES), Las Palmas de Gran Canaria, Spain, 21–23 June 2018. [Google Scholar]
  14. Mengash, H.A. Using data mining techniques to predict student performance to support decision making in university admission systems. IEEE Access 2020, 8, 55462–55470. [Google Scholar] [CrossRef]
  15. Adekitan, A.I.; Noma-Osaghae, E. Data mining approach to predicting the performance of first year student in a university using the admission requirements. Educ. Inf. Technol. 2019, 24, 1527–1543. [Google Scholar] [CrossRef]
  16. Baker, R.S. Challenges for the future of educational data mining: The Baker learning analytics prizes. J. Educ. Data Min. 2019, 11, 1–17. [Google Scholar]
  17. Muralidharan, K.; Prakash, N. Cycling to school: Increasing secondary school enrollment for girls in India. Am. Econ. J. Appl. Econ. 2017, 9, 321–350. [Google Scholar] [CrossRef]
  18. Pérez, B.; Castellanos, C.; Correal, D. Predicting student drop-out rates using data mining techniques: A case study. In IEEE Colombian Conference on Applications in Computational Intelligence; Springer: Berlin/Heidelberg, Germany, 2018. [Google Scholar]
  19. Aluko, R.O.; Daniel, E.I.; Oshodi, O.S.; Aigbavboa, C.O.; Abisuga, A.O. Towards reliable prediction of academic performance of architecture students using data mining techniques. J. Eng. Des. Technol. 2018, 16, 385–397. [Google Scholar] [CrossRef]
  20. Hoffait, A.-S.; Schyns, M. Early detection of university students with potential difficulties. Decis. Support Syst. 2017, 101, 1–11. [Google Scholar] [CrossRef]
  21. Azar, A.T.; Elshazly, H.I.; Hassanien, A.E.; Elkorany, A.M. A random forest classifier for lymph diseases. Comput. Methods Programs Biomed. 2014, 113, 465–473. [Google Scholar] [CrossRef] [PubMed]
  22. Uskov, V.L.; Bakken, J.P.; Byerly, A.; Shah, A. Machine learning-based predictive analytics of student academic performance in STEM education. In Proceedings of the 2019 IEEE Global Engineering Education Conference (EDUCON), Dubai, United Arab Emirates, 8–11 April 2019. [Google Scholar]
  23. Adebayo, A.O.; Chaubey, M.S. Data mining classification techniques on the analysis of student’s performance. GSJ 2019, 7, 45–52. [Google Scholar]
  24. Zhang, Z. Introduction to machine learning: K-nearest neighbors. Ann. Transl. Med. 2016, 4, 27386492. [Google Scholar] [CrossRef] [PubMed]
  25. Abdelhamid, N.; Thabtah, F. Associative classification approaches: Review and comparison. J. Inf. Knowl. Manag. 2014, 13, 1450027. [Google Scholar] [CrossRef]
  26. Cortez, P.; Silva, A.M.G. Using Data Mining to Predict Secondary School Student Performance; EUROSIS-ETI: Oostende, Belgia, 2008. [Google Scholar]
  27. Soofi, A.A.; Awan, A. Classification techniques in machine learning: Applications and issues. J. Basic Appl. Sci. 2017, 13, 459–465. [Google Scholar] [CrossRef]
  28. Iqbal, Z.; Qadir, J.; Mian, A.N. Admission criteria in pakistani universities: A case study. In Proceedings of the 2016 International Conference on Frontiers of Information Technology (FIT), Islamabad, Pakistan, 19–21 December 2016. [Google Scholar]
  29. Pal, A.K.; Pal, S. Classification model of prediction for placement of students. Int. J. Mod. Educ. Comput. Sci. 2013, 5, 49. [Google Scholar]
  30. Yadav, S.K.; Bharadwaj, B.; Pal, S. Data mining applications: A comparative study for predicting student’s performance. arXiv 2012, arXiv:1202.4815. [Google Scholar]
  31. Boswell, D. Introduction to Support Vector Machines; Departement of Computer Science and Engineering University of California San Diego: La Jolla, CA, USA, 2002. [Google Scholar]
  32. Taheri, S.; Mammadov, M. Learning the naive Bayes classifier with optimization models. Int. J. Appl. Math. Comput. Sci. 2013, 23, 787–795. [Google Scholar] [CrossRef]
  33. Cortez, P.; Morais, A.D.J.R. A Data Mining Approach to Predict Forest Fires Using Meteorological Data; Associação Portuguesa Para a Inteligência Artificial: Braga, Portugal, 2007. [Google Scholar]
  34. Akinode, J.; Bada, O. Student Enrollment Prediction using Machine Learning Techniques. Presented at the 5th National Conference of the School of Pure & Applied Sciences Federal Polytechnic, Ilaro, Nigeria, 29–30 September 2021. [Google Scholar]
  35. Kim, J.S.; Sunderman, G.L. Measuring academic proficiency under the No Child Left Behind Act: Implications for educational equity. Educ. Res. 2005, 34, 3–13. [Google Scholar] [CrossRef]
  36. Huang, X.-L.; Ma, X.; Hu, F. Machine learning and intelligent communications. Mob. Netw. Appl. 2018, 23, 68–70. [Google Scholar] [CrossRef]
  37. Batool, S.; Rashid, J.; Nisar, M.W.; Kim, J.; Kwon, J.-Y.; Hussain, A. Educational data mining to predict students’ academic performance: A survey study. Educ. Inf. Technol. 2022, 7, 1–67. [Google Scholar] [CrossRef]
  38. Ekong, A.; Silas, A.; Inyang, S. A Machine Learning Approach for Prediction of Students’ Admissibility for Post-Secondary Education using Artificial Neural Network. Int. J. Comput. Appl. 2022, 184, 44–49. [Google Scholar] [CrossRef]
  39. Yağcı, M. Educational data mining: Prediction of students’ academic performance using machine learning algorithms. Smart Learn. Environ. 2022, 9, 11. [Google Scholar] [CrossRef]
  40. Yousafzai, B.K.; Khan, S.A.; Rahman, T.; Khan, I.; Ullah, I.; Ur Rehman, A.; Baz, M.; Hamam, H.; Cheikhrouhou, O. Student-performulator: Student academic performance using hybrid deep neural network. Sustainability 2021, 13, 9775. [Google Scholar] [CrossRef]
  41. Ahmad, I.; Ullah, I.; Khan, W.U.; Ur Rehman, A.; Adrees, M.S.; Saleem, M.Q.; Cheikhrouhou, O.; Hamam, H.; Shafiq, M. Efficient algorithms for E-healthcare to solve multiobject fuse detection problem. J. Healthc. Eng. 2021, 2021, 9500304. [Google Scholar] [CrossRef]
  42. Tufail, A.B.; Ullah, K.; Khan, R.A.; Shakir, M.; Khan, M.A.; Ullah, I.; Ma, Y.K.; Ali, M. On Improved 3D-CNN-Based Binary and Multiclass Classification of Alzheimer’s Disease Using Neuroimaging Modalities and Data Augmentation Methods. J. Healthc. Eng. 2022, 2022, 1302170. [Google Scholar] [CrossRef]
  43. Amjad, S.; Younas, M.; Anwar, M.; Shaheen, Q.; Shiraz, M.; Gani, A. Data Mining Techniques to Analyze the Impact of Social Media on Academic Performance of High School Students. Wirel. Commun. Mob. Comput. 2022, 2022, 9299115. [Google Scholar] [CrossRef]
  44. Raschka, S. Python Machine Learning; Packt Publishing Ltd.: Birmingham, UK, 2015. [Google Scholar]
  45. Saa, A.A. Educational data mining & students’ performance prediction. Int. J. Adv. Comput. Sci. Appl. 2016, 7. [Google Scholar]
  46. Macintyre, J. Engineering Applications of Neural Networks. In Proceedings of the 20th International Conference, EANN 2019, Xersonisos, Greece, 24–26 May 2019; Springer: Berlin/Heidelberg, Germany, 2019; Volume 1000. [Google Scholar]
  47. Marquez-Vera, C.; Romero, C.; Ventura, S. Predicting school failure using data mining. In Educational Data Mining; CiteSeer: University Park, PA, USA, 2011. [Google Scholar]
  48. Al-Obeidat, F.; Tubaishat, A.; Dillon, A.; Shah, B. Analyzing students’ performance using multi-criteria classification. Clust. Comput. 2018, 21, 623–632. [Google Scholar] [CrossRef]
  49. Tahir, M.E.; Abbas, N.; Sayat, M.F.; Nasir, M. Statistical analysis of crowd behaviour in catastrophic situation. Mehran Univ. Res. J. Eng. Technol. 2022, 41, 104–112. [Google Scholar] [CrossRef]
  50. Livieris, I.E.; Drakopoulou, K.; Tampakas, V.T.; Mikropoulos, T.A.; Pintelas, P. Predicting secondary school students’ performance utilizing a semi-supervised learning approach. J. Educ. Comput. Res. 2019, 57, 448–470. [Google Scholar] [CrossRef]
  51. Nair, P.B.; Choudhury, A.; Keane, A.J. Some greedy learning algorithms for sparse regression and classification with mercer kernels. J. Mach. Learn. Res. 2002, 3, 781–801. [Google Scholar]
  52. Trawiński, B.; Smętek, M.; Telec, Z.; Lasota, T. Nonparametric statistical analysis for multiple comparison of machine learning regression algorithms. Int. J. Appl. Math. Comput. Sci. 2012, 22, 867–881. [Google Scholar] [CrossRef]
  53. Goetz, J.; Brenning, A.; Petschko, H.; Leopold, P. Evaluating machine learning and statistical prediction techniques for landslide susceptibility modeling. Comput. Geosci. 2015, 81, 1–11. [Google Scholar] [CrossRef]
  54. Khalil, H.; Rahman, S.U.; Ullah, I.; Khan, I.; Alghadhban, A.J.; Al-Adhaileh, M.H.; Ali, G.; ElAffendi, M. A UAV-Swarm-Communication Model Using a Machine-Learning Approach for Search-and-Rescue Applications. Drones 2022, 6, 372. [Google Scholar] [CrossRef]
  55. Haq, I.; Mazhar, T.; Malik, M.A.; Kamal, M.M.; Ullah, I.; Kim, T.; Hamdi, M.; Hamam, H. Lung Nodules Localization and Report Analysis from Computerized Tomography (CT) Scan Using a Novel Machine Learning Approach. Appl. Sci. 2022, 12, 12614. [Google Scholar] [CrossRef]
  56. Tufail, A.B.; Ullah, I.; Rehman, A.U.; Khan, R.A.; Khan, M.A.; Ma, Y.K.; Hussain Khokhar, N.; Sadiq, M.T.; Khan, R.; Shafiq, M.; et al. On Disharmony in Batch Normalization and Dropout Methods for Early Categorization of Alzheimer’s Disease. Sustainability 2022, 14, 14695. [Google Scholar] [CrossRef]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Article Metrics

Citations

Article Access Statistics

Multiple requests from the same IP address are counted as one view.