Using Machine Learning in Veterinary Medical Education: An Introduction for Veterinary Medicine Educators
1. Introduction
1.1. Introduction to Educational Data Mining and Machine Learning
1.2. Comparison of Classical Statistical Analysis and Machine Learning Models
1.3. Overview of the Main Types of Machine Learning Algorithms and Random Forest Machine Learning Models
1.4. Programming Languages and Tools
2. Simulation of Dataset and Creation of a Random Forest Machine Learning Model
2.1. Defining the Project Goals
2.2. Data Collection and Storage Plan
2.2.1. Simulated Data Collection and Storage
2.2.2. Importing the Dataset
- #Import required packages:
- Import pandas as pd
- #Import the dataset using the function pd.read_excel().
- Dataset = pd.read_excel(r’C:\location_of_data\name_of_excell_datafile.xlsx’,
- sheet_name = ‘name’)
- #To view the first 10 rows of the dataset with the column names:
- Dataset.head(10)
2.3. Exploratory Data Analysis
2.4. Data Preprocessing
- #To one-hot encode for race column, use the get_dummies() from the pandas package
- #We assign this transformed data to a new variable called dataset_OneHot
- #We also need the argument drop_first to not be true in order to perform one-hot
- #encoding.
- Dataset_OneHot = pd.get_dummies(dataset, columns = [“Race”], drop_first = False)
- dataset_OneHot.head() #To view the first several rows and column names
- #To dummy encode the gender column
- dataset_OneHot = pd.get_dummies(dataset_OneHot, columns = [“Gender”],
- drop_first = True)
- print(dataset_OneHot.head()) #To view the first several rows and column names
- #Import the first dataset with missing GRE values using the function pd.read_excel().
- Biased_dataset = pd.read_excel(r’C:\location_of_data/name_of_excell_datafile.xlsx’,
- sheet_name = ‘name’)
- #Code to drop delete each student record that does not have a GRE score reported
- #The “empty” GRE value will be noted as an “na” in Python, therefore we use the
- #dropna()
- #The argument axis = 0 means the row with the “na” will be dropped.
- #The argument how = ’any’ means that any “na” will result in the row being deleted
- #The argument inplace = True means that a new dataframe will not be created
- biased_dataset_OneHot.dropna(axis = 0, how = ‘any’, inplace = True)
- #Code to replace each missing GRE score with the mean of the GRE value
- biased_dataset_OneHot.fillna((biased_dataset_OneHot[‘GRE’].mean()), inplace = True)
2.5. Data Feature Extraction
2.6. Model Creation and Performance Evaluation
2.6.1. Generation of Base Random Forest Model
- #X is our variable dataframe and y is our target dataframe
- #create dataframe without target, [rows, columns], the : indicates to select all rows
- X = dataset_OneHot.loc[: , dataset_OneHot.columns != ‘Fail’]
- y = dataset_OneHot[‘Fail’] #target variable for prediction
- #Import required python function of SMOTE from Python package imbalanced-learn
- #version 0.10.1from imblearn.over_sampling import SMOTE,
- #Oversampling to allow 0 and 1 target to be equal
- #Assigning a value to the random state argument ensures that anyone can generate
- #the same set of random numbers again
- X_resampled, y_resampled = SMOTE(random_state = 23).fit_resample(X, y)
- #Import required functions:
- from sklearn.model_selection import train_test_split
- #Use the balanced data to create testing and training datasets with 70% of the data
- #being training and 30% of the data being testing.
- X_trainSMOTE, X_testSMOTE, y_trainSMOTE, y_testSMOTE =
- train_test_split(X_resampled, y_resampled, stratify = y_resampled, test_size = 0.3,
- random_state = 50)
- #Check sizes of arrays to make sure it they match each other
- print(‘Training Variables Shape:’, X_trainSMOTE.shape)
- print(‘Training Target Shape:’, y_trainSMOTE.shape)
- print(‘Testing Variables Shape:’, X_testSMOTE.shape)
- print(‘Testing Target Shape:’, y_testSMOTE.shape)
- #Import required functions:
- from sklearn.ensemble import RandomForestClassifier
- #Build base model without any changes to default settings
- forest_base = RandomForestClassifier(random_state = 23)
- #Train the model via fit()
-, y_trainSMOTE) #using training data
2.6.2. Evaluation of the Base Random Forest Model
- #Make predictions using testing data set
- y_predictions = forest_base.predict(X_testSMOTE)
- y_trueSMOTE = y_testSMOTE #Rename the test target dataframe
- #Import required function
- from sklearn.model_selection import KFold
- #Defining the cross-validation to be able to compute the performance metrics using
- #the k-fold CV
- kf = KFold(shuffle = True, n_splits = 5)
- #Import required function
- from sklearn.model_selection import cross_val_score
- #To calculate the accuracy of the model using k-fold cross validation
- score_accuracy_mean = cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE,
- cv = kf, scoring = ‘accuracy’).mean()
- print(score_accuracy_mean) #View the mean of the CV validation results for
- #accuracy of the model.
- #To calculate the recall of the model using k-fold cross validation
- recall = cross_val_score(best_grid_model, X_testSMOTE, y_testSMOTE, cv = kf,
- scoring = ‘recall’).mean()
- print(recall) #View the mean of the CV validation results for recall of the model
- #Import required function
- from sklearn.metrics import make_scorer
- #Define specificity
- scoring = make_scorer(recall_score, pos_label = 0)
- #Use our defined specificity as the type of score that is calculated
- score_specificity_mean = cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE,
- cv = kf, scoring = scoring).mean()
- cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE, cv = kf, scoring = scoring)
- print(score_specificity_mean) #View the mean of the CV validation results for
- #specificity of the model
- # To calculate the precision of the model using k-fold cross validation
- score_precision_mean = cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE,
- cv = kf, scoring = ‘precision’).mean()
- print(score_precision_mean) #View the mean of the CV validation results for
- #precision of the model
- #To calculate the F-score of the model using k-fold cross validation
- score_f1_mean = cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE, cv = kf,
- scoring = ‘f1’).mean()
- print(score_f1_mean) #View the mean of the CV validation results for precision of
- #the model
- #To calculate the ROC curve AUC of the model using k-fold cross validation
- #score_auc_mean = cross_val_score(forest_base, X_testSMOTE, y_trueSMOTE, cv =
- kf, scoring = ‘roc_auc’).mean()
- print(score_auc_mean) ) #View the mean of the CV validation results for ROC curve
- #AUC of the model
2.6.3. Tuning of the Random Forest Model
- ##Assess hyperparameters to try to improve upon base model:
- #Import required functions:
- from sklearn.model_selection import RandomizedSearchCV
- from sklearn.model_selection import GridSearchCV
- # Create the hyperparameter grid for first the random search function
- hyper_grid = {# Number of trees to be included in random forest
- ‘n_estimators’: [150, 200, 250, 300, 350, 400],
- # Number of features to consider at every split
- ‘max_features’: [‘sqrt’],
- #Maximum number of levels in a tree
- ‘max_depth’: [10, 20, 40, 60, 80, 100, 120, 140, 160, 180, 200],
- # Minimum number of samples required to split a node
- ‘min_samples_split’: [2, 4, 6, 8, 10],
- # Minimum number of samples required at each leaf node
- ‘min_samples_leaf’: [1, 2, 4, 6, 8, 10],
- # Method of selecting samples for training each tree
- ‘bootstrap’: [True, False]}
- #Initiate random forest base model to tune
- best_params = RandomForestClassifier(random_state = (23))
- #Use random grid search to find best hyperparameters, uses k-fold validation as cross
- #validation method
- #Search 200 different combinations
- best_params_results = RandomizedSearchCV(estimator = best_params,
- param_distributions = hyper_grid, n_iter = 200, cv = kf, verbose = 5, random_state = (23))
- #Fit the random search model
-, y_trainSMOTE)
- #Find the best parameters from the grid search results
- Print(best_params_results.best_params_)
- #Build another hyperparameter grid using narrowed down parameter guidelines
- #from above
- #Then use GridSearchCV method to search every combination of grid
- new_grid = {‘n_estimators’: [250, 275, 300, 325, 332, 350, 375],
- ‘max_features’: [‘sqrt’],
- ‘max_depth’: [160, 165, 170, 175, 180, 185, 190, 195],
- ‘min_samples_split’: [1, 2, 3, 4, 5, 6],
- ‘min_samples_leaf’: [1, 2, 3],
- ‘bootstrap’: [True]}
- #Initiate random forest base model to tune
- best_params = RandomForestClassifier(random_state = (23))
- #Use GridSearchCV method to search every combination of grid
- best_params_grid_search = GridSearchCV(estimator = best_params, param_grid =
- new_grid, cv = kf, n_jobs = −1, verbose = 10)
- #Fit the gridsearch model
-, y_trainSMOTE)
- #Get the results of the search grid form the random forest model
- best_params_grid_search.best_params_
- #Using the results of the best parameters, we will create a new model and show the
- #specific arguments.
- best_grid_model = RandomForestClassifier(n_estimators = 375, max_features = ‘sqrt’,
- max_depth = (160), min_samples_split = 2, min_samples_leaf = 2, bootstrap = True)
- #Best model based upon grid
-, y_trainSMOTE)
2.6.4. Determining the Most Important Features of the Random Forest Model
- #Most important features from best performing random forest model, Gini im
- #portance
- feature_imp = pd.Series(best_grid_model.feature_importances_, index = X.columns)
- feature_imp = feature_imp.sort_values(ascending = False)
- print(feature_imp)
- #Import required package
- import shap
- #Most important features from best performing random forest model, SHAP values
- shap_feature_imp = shap.TreeExplainer(best_grid_model)
- shap_values = shap_feature_imp.shap_values(X_testSMOTE)
- shap.summary_plot(shap_values, X_testSMOTE) #Shows results in a plot
3. Results
4. Discussion
5. Conclusions
Variable Name | Range of Values | Type of Data |
Full Name | 400 randomly generated female and male names | Categorical |
Gender | Male or Female | Categorical |
Race/Ethnicity | Asian, Black, Latinx, Not Provided, White | Categorical |
Age | 20–40 years | Numeric |
Pre-Vet School GPA | 3.00–4.00 | Numeric |
GRE | 260–330 | Numeric |
Fail | 0–1 | Numeric |
Actual Negative Class: 0, Student Who Did Not Fail | Actual Positive Class: 1, Student Who Did Fail | |
Predicted negative Class: 0, student who did not fail | True negative (TN) | False negative (FN) |
Predicted positive Class: 1, student who did fail | False positive (FP) | True positive (TP) |
Performance Metric | Random Forest Base Model | Radom Forest Tuned Model |
Accuracy | 87.07% | 86.61% |
Recall/Sensitivity/TPR | 89.61% | 89.77% |
Specificity/TNR | 87.15% | 88.11% |
Precision | 86.46% | 86.21% |
F1-Score | 86.24% | 88.40% |
ROC curve AUC | 87.15% | 88.11% |
All GRE Records | Feature Importance Score | Missing Low GRE Values Removed | Feature Importance Score | Missing Low GRE Values Replaced with Mean | Feature Importance Score | Random Missing GRE Values Removed | Feature Importance Score | Random Missing GRE Values Replaced with Mean | Feature Importance Score |
GRE | 0.241850 | GRE | 0.370290 | GRE | 0.357720 | GRE | 0.291419 | preGPA | 0.218575 |
Age | 0.150457 | Race_Not Provided | 0.131981 | preGPA | 0.152793 | preGPA | 0.199777 | GRE | 0.181400 |
Race_Not Provided | 0.146913 | preGPA | 0.106498 | Age | 0.146683 | Age | 0.163021 | Age | 0.180350 |
preGPA | 0.130455 | Age | 0.100696 | Race_Not Provided | 0.096973 | Race_Not Provided | 0.071514 | Race_Not Provided | 0.117902 |
Race_White | 0.078924 | Race_White | 0.093832 | Race_Latinx | 0.062529 | Race_Asian | 0.062693 | Race_White | 0.073569 |
Race_Latinx | 0.067359 | Gender_Male | 0.082514 | Race_White | 0.057276 | Race_White | 0.059140 | Race_Black | 0.071190 |
Gender_Male | 0.063287 | Race_Latinx | 0.047706 | Race_Asian | 0.044694 | Race_Black | 0.051953 | Race_Asian | 0.062513 |
Race_Black | 0.063043 | Race_Black | 0.041844 | Gender_Male | 0.042020 | Gender_Male | 0.050608 | Race_Latinx | 0.060454 |
Race_Asian | 0.057713 | Race_Asian | 0.024639 | Race_Black | 0.039311 | Race_Latinx | 0.049876 | Gender_male | 0.034048 |
