# Predictive Models of Student College Commitment Decisions Using Machine Learning

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Literature Review

## 3. Materials and Methods

- Logistic Regression (LG)
- Naive Bayes (NB)
- Decision Trees (DT)
- Support Vector Machine (SVM)
- K-Nearest Neighbors (K-NN)
- Random Forests (RF)
- Gradient Boosting (GB)

#### 3.1. Data

**bold**. Note the variables are organized into columns corresponding to type, i.e., binary, categorical and numerical.

#### 3.1.1. Preprocessing

#### 3.1.2. Data Exploration

#### 3.1.3. Prediction Techniques

#### 3.2. Methodology

#### 3.2.1. Implementation

#### 3.2.2. Resolution of Class Imbalance: Different Success Metrics

- True positive: The admitted student accepted the offer and the model correctly predicted that the student accepted the offer
**(correct classification)**. - True negative: The admitted student rejected the offer and the model correctly predicted that the student would reject the offer
**(correct classification)**. - False positive (Type I Error): The admitted student rejected the offer, but the model incorrectly predicted the student would accept the offer
**(incorrect classification)**. - False negative (Type II Error): The admitted student accepted the offer, but the model incorrectly predicted that the student would reject the offer
**(incorrect classification)**.

**Definition**

**1.**

**Definition**

**2.**

**Definition**

**3.**

**Definition**

**4.**

#### 3.2.3. Feature Selection

## 4. Results

#### 4.1. Choosing the Final Model: Classifier Comparison and Hyperparameter Optimization

- penalty: ${L}_{2}$
- Solver: Newton-cg
- $ma{x}_{iteration}=100$ (default setting)
- tolerance factor = 10 (tolerance for stopping criteria)
- $C={10}^{2}$ ($C=$ inverse of regularization strength)

#### 4.2. Testing Statistical Significance of Important Features by Chi Squared Test

## 5. Discussion

#### 5.1. Conclusions

- predicting a student’s likelihood of receiving admission to an institution of their choice;
- predicting how likely an admission committee is to admit an applicant based on the information provided in their application file [19] (dataset size 588);
- predicting student dropouts in an online program of study [20] (dataset size 189); and

#### 5.2. Future Work

**Feature Engineering.**First, the adage “better data beat better algorithms” comes to mind. Feature engineering is the process of using domain knowledge of the problem being modeled to create features that increase the predictive power of the model [9]. This is often a difficult task that requires deep knowledge of the problem domain. One way to engineer features for the accepted student college commitment decision problem is by designing arrival surveys for incoming students to better understand their reasons for committing to the college. Using knowledge gained from these surveys would allow the creation of features to be added to the dataset associated with future admits. These engineered features would likely improve model accuracy with the caveat that including more input features may increase training time.**Geocoding.**Another avenue for exploration is to better understand the effects of incorporating geographic location on the model, a process referred to as geocoding [42]. For example, we could consider an applicant’s location relative to that of the institution they are applying to. One can regard geocoding as a special example of feature engineering.**Data Imputation.**Instead of dropping categorical variables with missing or obviously inaccurate entries, which results in the loss of potentially useful information, one could instead implement data imputation [8]. Implementation of data imputation techniques could potentially improve the accuracy of the model presented here.**Ensemble Learning.**Random Forest is an ensemble learning technique that was used in the study presented here but did not outperform the Logistic Regression classifier. Ensemble learning is a process by which a decision is made via the combination of multiple classification techniques [9]. For future work, one could consider other ensemble learning methods but we note that they typically require an increased amount of storage and computation time.

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## Appendix A

**Logistic Regression**produces a probabilistic value using the logistic function that a given applicant will fall into one of two classes: accepts admission offer or rejects admission offer [9,38]. The main objective of Logistic Regression is to find the best fitting model to describe the relationship between the dependent variable and the set of independent variables. The parameters of the logistic function are optimized using the method of maximum likelihood [43].**Naive Bayes**(NB) is a well-known algorithm based on Bayes’ Theorem [8,38] with the goal to compute the conditional probability distribution of each feature. The conditional probability of a vector to be classified into a class C is equal to the product of probabilities of each of the vector’s features to be in class C. It is called “naive" because of its core assumptions of conditional independence, i.e., all input features are assumed to be independent from one another. If the conditional independence assumption actually holds, a Naive Bayes classifier will converge more quickly than other models such as Logistic Regression.**Decision trees**(or ID3 Algorithm) use a tree-based structure to geometrically represent a number of possible decision paths and an outcome for each path [39]. A typical decision tree starts at a root node, and then branches out into a series of possible decision nodes by ranking features to minimize entropy of the sub-dataset. Edges representing decision rules culminate in leaves representing the outcome classes for the given study. Parameters that can be controlled in Decision Trees include but are not limited to: maximum depth of the tree, the minimum number of samples a node must have before it can be split and the minimum number of samples a leaf node must have.**Support Vector Machines**(SVM) is an algorithm that separates binary classes with an optimal hyperplane that maximizes the margin of the data, i.e., SVM searches for a decision surface that is far away as possible from any data point dividing the two classes [9,10]. The margin of the classifier is the distance from the decision surface to the closest data point and these points are called the support vectors.**$\mathit{K}$-Nearest Neighbors**(K-NN) algorithm uses the notion of distance between data points and is based on the assumption that data points that are “close” to each other are similar [8,39]. Given an unseen observation, its unknown class is assigned upon investigating its K “nearest” labeled data points and the class that those data points belong to. The unseen data point is assigned the class with the majority prediction based on its nearest K neighbors. The choice of the parameter K can be very crucial in this algorithm.**Random Forest**is a classifier based on a combination of tree predictors such that each tree is independently constructed in the ensemble [8,38]. After some number of trees are generated, each tree votes on how to classify a new data point and the majority prediction wins. Some of the parameters that can be tuned in the Random Forest classifier are: the number of trees in the ensemble, number of features to split in each node, and the minimum samples needed to make the split.**Gradient Boosting**is a type of boosting method in which the weak prediction models are decision trees (or stumps) and each weak predictor that is sequentially added tries to fit to the residual error made by the previous weak predictor [8]. Boosting refers to any ensemble method that produces a prediction model in the form of an ensemble of weak prediction models resulting in a strong classifier.

## References

- Lapovsky, L. The Changing Business Model For Colleges And Universities. Forbes
**2018**. Available online: https://www.forbes.com/sites/lucielapovsky/2018/02/06/the-changing-business-model-for-colleges-and-universities/#bbc03d45ed59 (accessed on 15 December 2018). - The Higher Education Business Model, Innovation and Financial Sustainability. Available online: https://www.tiaa.org/public/pdf/higher-education-business-model.pdf (accessed on 15 December 2018).
- Occidental College. Available online: https://www.oxy.edu (accessed on 15 December 2018).
- Occidental College Office of Financial Aid. Available online: https://www.oxy.edu/admission-aid/costs-financial-aid (accessed on 15 December 2018).
- Tuition Discounting. Available online: https://www.agb.org/briefs/tuition-discounting (accessed on 15 December 2018).
- Massa, R.J.; Parker, A.S. Fixing the net tuition revenue dilemma: The Dickinson College story. New Dir. High. Educ.
**2007**, 140, 87–98. [Google Scholar] [CrossRef] - Hossler, D.; Bean, J.P. The Strategic Management of College Enrollments, 1st ed.; Jossey Bass: San Francisco, CA, USA, 1990. [Google Scholar]
- Géron, A. Hands-On Machine Learning with Scikit-Learn & Tensor Flow, 1st ed.; O’Reilly: Sebastopol, CA, USA, 2017. [Google Scholar]
- Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning, Data Mining, Inference and Prediction, 2nd ed.; Springer: New York, NY, USA, 2009. [Google Scholar]
- Grus, J. Data Science From Scratch First Principles with Python, 1st ed.; O’Reilly: Sebastopol, CA, USA, 2015. [Google Scholar]
- Kotsiantis, S.B. Supervised Machine Learning: A Review of Classification Techniques. Informatica
**2007**, 4, 249–268. [Google Scholar] - Alpaydin, E. Introduction to Machine Learning, 3rd ed.; MIT Press: Cambridge, MA, USA, 2010. [Google Scholar]
- Kuncheva, L.I. Combining Pattern Classifiers: Methods and Algorithms, 2nd ed.; McGraw Hill; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2014. [Google Scholar]
- Occidental College Office of Admissions. Available online: https://www.oxy.edu/admission-aid (accessed on 15 December 2018).
- Journal of Educational Data Mining. Available online: http://jedm.educationaldatamining.org/index.php/JEDM (accessed on 15 December 2018).
- Educational Data Mining Conference 2018. Available online: http://educationaldatamining.org/EDM2018/ (accessed on 15 December 2018).
- Romero, C.; Ventura, S. Educational data mining: A survey from 1995 to 2005. Expert Syst. Appl.
**2007**, 33, 135–146. [Google Scholar] [CrossRef] - Peña-Ayala, A. Educational data mining: A survey and a data mining-based analysis of recent works. Expert Syst. Appl.
**2014**, 41, 1432–1462. [Google Scholar] - Waters, A.; Miikkulainen, R. GRADE: Machine Learning Support for Graduate Admissions. In Proceedings of the Twenty-Fifth Conference on Innovative Applications of Artificial Intelligence, Bellevue, WA, USA, 14–18 July 2013; Available online: http://www.cs.utexas.edu/users/ai-lab/downloadPublication.php?filename=http://www.cs.utexas.edu/users/nn/downloads/papers/waters.iaai13.pdf&pubid=127269 (accessed on 15 December 2018).
- Yukselturk, E.; Ozekes, S.; Türel, Y.K. Predicting Dropout Student: An Application of Data Mining Methods in an Online Education Program. Eur. J. Open Distance E-Learn.
**2014**, 17, 118–133. [Google Scholar] [CrossRef] [Green Version] - Tampakas, V.; Livieris, I.E.; Pintelas, E.; Karacapilidis, N.; Pintelas, P. Prediction of students’ graduation time using a two-level classification algorithm. In Proceedings of the 1st International Conference on Technology and Innovation in Learning, Teaching and Education (TECH-EDU 2018), Thessaloniki, Greece, 20–22 June 2018. [Google Scholar]
- Livieris, I.E.; Kotsilieris, T.; Tampakas, V.; Pintelas, P. Improving the evaluation process of students’ performance utilizing a decision support software. Neural Comput. Appl.
**2018**. [Google Scholar] [CrossRef] - Livieris, I.E.; Drakopoulou, K.; Kotsilieris, T.; Tampakas, V.; Pintelas, P. DSS-PSP-a decision support software for evaluating students’ performance. Eng. Appl. Neural Netw. (EANN)
**2017**, 744, 63–74. [Google Scholar] - Duzhin, F.; Gustafsson, A. Machine Learning-Based App for Self-Evaluation of Teacher-Specific Instructional Style and Tools. Educ. Sci.
**2018**, 8, 7. [Google Scholar] [CrossRef] - Chang, L. Applying Data Mining to Predict College Admissions Yield: A Case Study. New Dir. Institutional Res.
**2006**, 131, 53–68. [Google Scholar] [CrossRef] - Powell, F. Universities, Colleges Where Students Are Eager to Enroll. U.S. News and World Report. 2018. Available online: https://www.usnews.com/education/best-colleges/articles/2018-01-23/universities-colleges-where-students-are-eager-to-enroll (accessed on 15 December 2018).
- Wu, X.; Kumar, V.; Quinlan, J.R.; Ghosh, J.; Yang, Q.; Motoda, H.; McLachlan, G.J.; Ng, A.F.M.; Liu, B.; Yu, P.S.; et al. Top 10 algorithms in data mining. Knowl. Inf. Syst.
**2008**, 14, 1–37. [Google Scholar] [CrossRef] - Japkowicz, N.; Stephen, S. The class imbalance problem: A systematic study. Intell. Data Anal.
**2002**, 6, 429–449. [Google Scholar] [CrossRef] - James, G.; Witten, D.; Hastie, T.; Tibshirani, R. An Introduction to Statistical Learning with Applications in R, 7th ed.; Springer: New York, NY, USA, 2013. [Google Scholar] [CrossRef]
- Brink, H.; Richards, J.; Fetherolf, M. Real World Machine Learning, 1st ed.; Manning: Shelter Island, NY, USA, 2017; Available online: https://www.manning.com/books/real-world-machine-learning (accessed on 15 December 2018).
- Rao, R.B.; Fung, G. On the Dangers of Cross-Validation: An Experimental Evaluation. In Proceedings of the 2008 International Conference on Data Mining, Atlanta, GA, USA, 24–26 April 2008; Available online: https://doi.org/10.1137/1.9781611972788.54 (accessed on 15 April 2019).
- Dormann, C.; Elith, J.; Bacher, S.; Buchmann, C.; Carl, G.; Carré, G.; García Marquéz, J.R.; Gruber, B.; Lafourcade, B.; Leitão, P.J.; et al. Collinearity: A review of methods to deal with it and a simulation study evaluating their performance. Ecography
**2012**, 36, 27–46. [Google Scholar] [CrossRef] - Maarsman, M.; Waldorp, L.; Maris, G. A note on large-scale logistic prediction, using an approximate graphical model to deal with collinearity and missing data. Behaviometrika
**2017**, 44, 513–534. [Google Scholar] [CrossRef] - Peck, R.; Olsen, C.; Devore, J. Statistics and Data Analysis, 5th ed.; Cengage: Boston, MA, USA, 2016. [Google Scholar]
- Scikit-learn, Machine Learning in Python. Available online: http://scikit-learn.org/stable/ (accessed on 15 December 2018).
- Python 3.0. Available online: https://www.python.org (accessed on 15 December 2018).
- Schapire, R.E. The Boosting Approach to Machine Learning: An Overview, 1st ed.; Springer: New York, NY, USA, 2003; pp. 149–171. [Google Scholar]
- Godsey, B. Think Like a Data Scientist, 1st ed.; Manning: Shelter Island, NY, USA, 2017. [Google Scholar]
- Cielen, D.; Meysman, A.; Ali, M. Introducing Data Science, 1st ed.; Manning: Shelter Island, NY, USA, 2016; Available online: https://www.manning.com/books/introducing-data-science (accessed on 15 December 2018).
- Chawla, N.V.; Bowyer, K.W.; Hall, O.; Kegelmeyer, W.P. SMOTE: Synthetic Minority Over-sampling Technique. J. Artif. Intell. Res.
**2002**, 16, 321–357. [Google Scholar] [CrossRef] - IBM SPSS Software. Available online: https://www.ibm.com/analytics/spss-statistics-software (accessed on 15 December 2018).
- Karimi, H.A.; Karimi, B. Geospatial Data Science Techniques and Applications, 1st ed.; CRC Press: Boca Raton, FL, USA, 2017. [Google Scholar]
- Millar, R.B. Maximum Likelihood Estimation and Inference: With Examples in R, SAS and ADMB, 1st ed.; Wiley: New York, NY, USA, 2011. [Google Scholar]

1. | |

2. | |

3. | |

4. | |

5. |

Binary | Categorical | Numerical |
---|---|---|

Gross Commit Indicator | Application Term | GPA |

Net Commit Indicator | Ethnic Background | HS Class Rank |

Final Decision | Permanent State/Region | HS Class Size |

Financial Aid Intent | Permanent Zip Code/Postal Code | ACT Composite Score |

Gender | Permanent Country | SAT I Critical Reading |

Legacy | Current/Most Recent School Geomarket | SAT I Math |

Direct Legacy | First Source Date | SAT I Writing |

First Generation Indicator | First Source Summary | SAT I Superscore |

Campus Visit Indicator | Top Academic Interest | SATR EBRW |

Interview | Extracurricular Interests | SATR Math |

Recruited Athlete Indicator | Level of Financial Need | SATR Total |

Scholarship | Reader Academic Rating |

Application Year | Class of 2018 | Class of 2019 | Class of 2020 | Class of 2021 | Total |
---|---|---|---|---|---|

Applicants | 6071 | 5911 | 6409 | 6775 | 25,166 |

Admits | 2549 | 2659 | 2948 | 2845 | 11,001 |

Admit Percentage | 42% | 45% | 46% | 42% | 44% |

Commits | 547 | 518 | 502 | 571 | 2138 |

Commit Percentage | 21% | 19% | 17% | 20% | 19% |

Uncommits | 2002 | 2141 | 2446 | 2274 | 8863 |

Uncommit Percentage | 79% | 81% | 83% | 80% | 81% |

GPA Bin | Accept Percentage | Reject Percentage | Number of Students |
---|---|---|---|

$2.0+$ | 100 | 0 | 1 |

$2.2+$ | 33 | 67 | 3 |

$2.4+$ | 100 | 0 | 4 |

$2.6+$ | 43 | 57 | 14 |

$2.8+$ | 53 | 47 | 34 |

$3.0+$ | 71 | 29 | 152 |

$3.2+$ | 77 | 23 | 586 |

$3.4+$ | 79 | 21 | 1424 |

$3.6+$ | 85 | 15 | 2577 |

$3.8+$ | 87 | 13 | 3324 |

$4.0$ | 89 | 11 | 1412 |

Accept Percentage | Reject Percentage | Number of Students | |
---|---|---|---|

Visited Campus | 23 | 77 | 5122 |

Didn’t Visit Campus | 7 | 93 | 4409 |

Predicted Accept | Predicted Reject | |
---|---|---|

Accepted | True Positive | False Negative |

Rejected | False Positive | True Negative |

Numerical Variable | GPA | HS Class Size | RAR |
---|---|---|---|

GPA | 1.0000 | 0.0045 | −0.6100 |

HS Class Size | 0.0045 | 1.0000 | −0.0820 |

Reader Academic Rating (RAR) | −0.6100 | −0.0820 | 1.0000 |

Variable Name | Description | Range |
---|---|---|

Financial Aid Intent | binary | Y/N |

Scholarship | categorical | type of scholarship program |

Direct Legacy | binary | Y/N |

Ethnic Background | categorical | e.g., Hispanic, White |

First Generation Indicator | binary | Y/N |

Permanent State/Region | categorical | e.g., NY, CA |

GPA | numerical | 0–4 |

HS Class Size | numerical | 1–5000 |

Campus Visit Indicator | binary | Y/N |

Interview | binary | Y/N |

Top Academic Interest | categorical | Politics, Marine Biology |

Extracurricular Interests | categorical | e.g., Dance, Yoga |

Gender | binary | M/F |

Level of Financial Need | categorical | High/Medium/Low |

Reader Academic Rating (RAR) | numerical | 1–5 |

Classifier | CV_Accuracy | Run Time (s) | CV_AUC | Run Time (s) |
---|---|---|---|---|

Logistic Regression | 85.53% | 48.72 | 77.79% | 47.60 |

Naive Bayes | 85.13% | 46.82 | 66.99% | 47.14 |

Decision Trees | 82.59% | 199.80 | 59.47% | 199.57 |

SVM | 83.13% | 2111.62 | 66.61% | 2112.58 |

10-Nearest Neighbors | 84.78% | 856.00 | 69.45% | 857.88 |

Random Forests | 86.18% | 245.08 | 72.85% | 242.51 |

Gradient Boosting | 84.96% | 10,461.47 | 76.02% | 10,308.31 |

Classifier | Training ${\mathit{F}}_{0.5}$ Score | Testing ${\mathit{F}}_{0.5}$ Score |
---|---|---|

Logistic Regression | 0.8818 | 0.8283 |

Naive Bayes | 0.8332 | 0.8264 |

Decision Trees | 1.0000 | 0.8045 |

SVM | 0.8326 | 0.8264 |

10-Nearest Neighbors | 0.8415 | 0.8245 |

Random Forests | 0.9959 | 0.8276 |

Gradient Boosting | 0.8647 | 0.8264 |

Variable | Importance in Percentage |
---|---|

GPA | 4.0841 |

Campus Visit Indicator | 4.0831 |

HS Class Size | 3.6876 |

Reader Academic Rating | 2.3227 |

Gender | 1.5428 |

Variable | p-Value |
---|---|

GPA | 0.021 |

Campus Visit Indicator | 0.028 |

HS Class Size | 0.037 |

Reader Academic Rating | 0.039 |

Gender | 0.053 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Basu, K.; Basu, T.; Buckmire, R.; Lal, N.
Predictive Models of Student College Commitment Decisions Using Machine Learning. *Data* **2019**, *4*, 65.
https://doi.org/10.3390/data4020065

**AMA Style**

Basu K, Basu T, Buckmire R, Lal N.
Predictive Models of Student College Commitment Decisions Using Machine Learning. *Data*. 2019; 4(2):65.
https://doi.org/10.3390/data4020065

**Chicago/Turabian Style**

Basu, Kanadpriya, Treena Basu, Ron Buckmire, and Nishu Lal.
2019. "Predictive Models of Student College Commitment Decisions Using Machine Learning" *Data* 4, no. 2: 65.
https://doi.org/10.3390/data4020065