An Improved Machine Learning-Based Employees Attrition Prediction Framework with Emphasis on Feature Selection

: Companies always seek ways to make their professional employees stay with them to reduce extra recruiting and training costs. Predicting whether a particular employee may leave or not will help the company to make preventive decisions. Unlike physical systems, human resource problems cannot be described by a scientiﬁc-analytical formula. Therefore, machine learning approaches are the best tools for this aim. This paper presents a three-stage (pre-processing, processing, post-processing) framework for attrition prediction. An IBM HR dataset is chosen as the case study. Since there are several features in the dataset, the “max-out” feature selection method is proposed for dimension reduction in the pre-processing stage. This method is implemented for the IBM HR dataset. The coefﬁcient of each feature in the logistic regression model shows the importance of the feature in attrition prediction. The results show improvement in the F1-score performance measure due to the “max-out” feature selection method. Finally, the validity of parameters is checked by training the model for multiple bootstrap datasets. Then, the average and standard deviation of parameters are analyzed to check the conﬁdence value of the model’s parameters and their stability. The small standard deviation of parameters indicates that the model is stable and is more likely to generalize well.


Introduction
Human resource is the initial source and the most critical essence of each company. Managers spend a considerable amount of time recruiting capable employees. Furthermore, they regularly spend additional resources on training staff. Every employee attrition, quitting the job without replacement, imposes a cost on the company for recruiting and training a new employee. To illustrate the definition of attrition, consider two cases (a) and (b). In case (a), which is not attrition, the employer decides to replace an employee with another more skilled person. In case (b), which is attrition, an employee leaves the company. Obviously, in the second case, the employer faces delays in its project schedule, due to recruiting and training the replacement employee. Predicting attrition makes it easier for decision-makers to take proper preventive actions. Several factors, such as age, salary, distance from home, education level, etc., contribute to whether an employee decides to leave a company or not. Since there is no deterministic analytical relation between employee attrition and these influential factors, machine learning approaches, which are computational methods that use experience to improve performance or make accurate predictions [1], can be utilized.

•
Proper feature selection method • Informative evaluation of the classifier's performance • Confidence levels for the value of the coefficient of each feature in the logistic regression model A proper study of attrition prediction tasks should include pre-processing, processing, and post-processing stages in order to present a practical, reliable predictor that human resource managers can trust. At the pre-processing stage, features that are most relevant to this task are selected. Besides, redundant features are eliminated. A proper feature selection enhances the performance of the models that are trained at the processing stage. At the processing stage, the predictor model is trained, tested, and compared with other models. Finally, how much confidence in the model at the post-processing stage must be ensured.
To the best of our knowledge, this paper, for the first time, proposes an attrition prediction task that addresses all three stages of pre-processing, process, and post-processing. First of all, the "max-out" algorithm, which is a novel feature selection method for enhancing the performance of our attrition prediction classifier, is presented as the pre-processing stage. Then, a logistic regression model is trained for the new set of features for the processing stage. Next, confidence analysis for quantifying how sure we are about our model's parameters is introduced at the post-processing stage. Finally, the methodology is verified using IBM attrition data [12]. The general structure of the proposed framework is shown in Figure 1. In this figure, yellow, green, and red blocks depict pre-processing, processing, and post-processing stages. The main objective of these steps is to make sure that the model is able to generalize properly. atics 2021, 9, x FOR PEER REVIEW 3 of 14 A proper study of attrition prediction tasks should include pre-processing, processing, and post-processing stages in order to present a practical, reliable predictor that human resource managers can trust. At the pre-processing stage, features that are most relevant to this task are selected. Besides, redundant features are eliminated. A proper feature selection enhances the performance of the models that are trained at the processing stage. At the processing stage, the predictor model is trained, tested, and compared with other models. Finally, how much confidence in the model at the post-processing stage must be ensured.
To the best of our knowledge, this paper, for the first time, proposes an attrition prediction task that addresses all three stages of pre-processing, process, and post-processing. First of all, the "max-out" algorithm, which is a novel feature selection method for enhancing the performance of our attrition prediction classifier, is presented as the pre-processing stage. Then, a logistic regression model is trained for the new set of features for the processing stage. Next, confidence analysis for quantifying how sure we are about our model's parameters is introduced at the post-processing stage. Finally, the methodology is verified using IBM attrition data [12]. The general structure of the proposed framework is shown in Figure 1. In this figure, yellow, green, and red blocks depict pre-processing, processing, and post-processing stages. The main objective of these steps is to make sure that the model is able to generalize properly. The rest of this paper is organized as follows. Section 2 introduces the "max-out" method and discusses the complexity of the algorithm. Next, the parameters' confidence analysis is introduced in Section 3. Then, logistic regression, which is the predictor of attrition, is briefly reviewed in Section 4. After that, Section 5 studies this algorithm's The rest of this paper is organized as follows. Section 2 introduces the "max-out" method and discusses the complexity of the algorithm. Next, the parameters' confidence analysis is introduced in Section 3. Then, logistic regression, which is the predictor of attrition, is briefly reviewed in Section 4. After that, Section 5 studies this algorithm's capability to enhance the IBM attrition classifier's performance. Finally, conclusions are drawn in Section 6.

Feature Selection
Several unnecessary features can be eliminated in the pre-processing stage. One solution is to train the model for every subset of features and then compare the validation dataset metrics. This procedure requires 2 n times training the model for a dataset with n features. Thus, it is highly time-consuming. Several feature selection methods have been previously presented for this particular procedure as summarized in Table 1 [13,14]. Filter methods are based on the correlation between the values of features. On the other hand, wrapper methods concentrate on selection based on the classifier's performance with the selected features. Embedded methods embed feature selection in the learning algorithm. Hybrid methods are those which are a combination of these methods. A comparison between these categories is provided in Table 1 [13,14]. Notice that the feature selection method should be consistent with the nature of the features.

Max-Out Feature Selection Algorithm
Based on the nature of this problem's feature set, which includes both binary and continuous features, the "max-out" algorithm feature selection, which belongs to the wrapping category, is developed. The algorithm is expressed in Algorithm 1. According to this algorithm, firstly, all subsets of n-m features are trained. The subset with the most significant metric is chosen as the feature set. The process is repeated for the new set of features. When the metric gets smaller than the previous step, the feature set is not changed further. Since the model is trained for only a portion of all possible features, the algorithm is much faster than checking all possible combinations of features. If m is equal to 1, the algorithm is called 1-max-out, and if m is 2, the algorithm is called 2-max-out. The 1-max-out algorithm is backward feature selection [22]. However, in some special cases, the inclusion of m features together may enhance the performance. Still, every one of them may not have a significant role in the classification performance. Therefore, 1-max-out may wrongly eliminate these features one after another. In these cases, m-max-out (m > 1) performs better than 1-max-out. The m-max-out is of the order of O(f m ), given f as the number of initial features. Therefore, choosing a suitable m is also dependent on the available computation resources. Figure 2 provides an example in which omitting four (X2, X3, X9, X8) features from the total fifteen is the best feature selection determined by the "1-max-out" method. This process takes 14 iterations to identify the first feature, which should be omitted, 13 iterations for the second, 12 iterations for the third, 11 iterations for the fourth, and 10 iterations for realizing that further eliminations do not help. If we wanted to perform a brute-force search, 2 15 , which is equal to 32,768, times training the model, is required.
Create SetF = Set of all n features.

2.
Compute variable M as the metric of a classifier with features of SetF 3.
Take n equal to the size of SetF 4.
Create Sub 1 , Sub 2 , . . . , Sub k all subsets of size n-m of SetF.
Take M' equal to the biggest metrics' value of sets of line 4 (for the corresponding subset j).

Illustrative 1-Max-out Example
In order to further illustrate the Max-Out method, an example of the fish market problem [23] is discussed in this section. Notice that this is not the main case study of this paper. The objective of this benchmark problem is to predict the weight of fish. There are seven binary variables, "Bream", "Parkki", "Perch", "Pike", "Roach", "Smelt", "Whitefish" and five continuous variables, "Length1", "Length2", "Length3", and "Height", "Width." After indexing these from 0 to 11, the 1-max-out algorithm is performed. As Figure 3 depicts, the variables 2, 1, and 10 are removed in succession. As a result of this feature selection, the R 2 -score measure for the validation set increases from 0.9384 to 0.9396, and for the test set from 0.9145 to 0.9179.

Illustrative 1-Max-out Example
In order to further illustrate the Max-Out method, an example of the fish market problem [23] is discussed in this section. Notice that this is not the main case study of this paper. The objective of this benchmark problem is to predict the weight of fish. There are seven binary variables, "Bream", "Parkki", "Perch", "Pike", "Roach", "Smelt", "Whitefish" and five continuous variables, "Length1", "Length2", "Length3", and "Height", "Width." After indexing these from 0 to 11, the 1-max-out algorithm is performed. As Figure 3 depicts, the variables 2, 1, and 10 are removed in succession. As a result of this feature selection, the R 2 -score measure for the validation set increases from 0.9384 to 0.9396, and for the test set from 0.9145 to 0.9179.

Parameter Confidence Analysis
In order to check how much we are confident about the value of the parameters of our model, the procedure in Figure 4 is performed. Each time we produce a new dataset by bootstrapping (resampling by replacement [24]) the primary training dataset, our model changes its parameters. If the model overfits the training set, the variation of parameters would be significant. Otherwise, parameters vary slightly, which means that the parameters estimate the real trend, not just memorizing the training set. Various statistical analyses can be performed on each parameter. In order to illustrate, a toy example is provided in Figure 5. Consider that we have a dataset with six samples, three of which are in star class and others are in circle class. A logistic regression classifier, as shown in Figure 5a, is trained. If we train the model for a bootstrap dataset in which green star and red circle are omitted, the parameters dramatically change, as shown in Figure 5b. Therefore, it can be concluded that we cannot be certain about the model's parameters which depend highly on the training set. To make the parameters more stable, the parameters can be regularized. The model will change to

Parameter Confidence Analysis
In order to check how much we are confident about the value of the parameters of our model, the procedure in Figure 4 is performed. Each time we produce a new dataset by bootstrapping (resampling by replacement [24]) the primary training dataset, our model changes its parameters. If the model overfits the training set, the variation of parameters would be significant. Otherwise, parameters vary slightly, which means that the parameters estimate the real trend, not just memorizing the training set. Various statistical analyses can be performed on each parameter.

Parameter Confidence Analysis
In order to check how much we are confident about the value of the parameters of our model, the procedure in Figure 4 is performed. Each time we produce a new dataset by bootstrapping (resampling by replacement [24]) the primary training dataset, our model changes its parameters. If the model overfits the training set, the variation of parameters would be significant. Otherwise, parameters vary slightly, which means that the parameters estimate the real trend, not just memorizing the training set. Various statistical analyses can be performed on each parameter. In order to illustrate, a toy example is provided in Figure 5. Consider that we have a dataset with six samples, three of which are in star class and others are in circle class. A logistic regression classifier, as shown in Figure 5a, is trained. If we train the model for a bootstrap dataset in which green star and red circle are omitted, the parameters dramatically change, as shown in Figure 5b. Therefore, it can be concluded that we cannot be certain about the model's parameters which depend highly on the training set. To make the parameters more stable, the parameters can be regularized. The model will change to In order to illustrate, a toy example is provided in Figure 5. Consider that we have a dataset with six samples, three of which are in star class and others are in circle class. A logistic regression classifier, as shown in Figure 5a, is trained. If we train the model for a bootstrap dataset in which green star and red circle are omitted, the parameters dramatically change, as shown in Figure 5b. Therefore, it can be concluded that we cannot be certain about the model's parameters which depend highly on the training set. To make the parameters more stable, the parameters can be regularized. The model will change to

Logistic Regression
Logistic Regression classification aims to determine the probability that the output variable belongs to a specific class as a function of a linear summation of the features. This function is asserted in (1) and (2) [24]: = 0 + 1 1 + 2 2 + ⋯ + (1) In (1) and (2), X1, X2, …, Xp are features. Omegas are coefficients, and P(G = 1) is the probability that output G belongs to class 1. Coefficients should be tuned so that the likelihood that the training samples' outputs occur is maximized. This can be formulized for a training dataset of R samples such as (3), which can be rewritten as (4): Several algorithms, such as gradient descent, can be used for optimizing the likelihood [14].

Logistic Regression
Logistic Regression classification aims to determine the probability that the output variable belongs to a specific class as a function of a linear summation of the features. This function is asserted in (1) and (2) [24]: In (1) and (2), X 1 , X 2 , . . . , X p are features. Omegas are coefficients, and P(G = 1) is the probability that output G belongs to class 1. Coefficients should be tuned so that the likelihood that the training samples' outputs occur is maximized. This can be formulized for a training dataset of R samples such as (3), which can be rewritten as (4): Several algorithms, such as gradient descent, can be used for optimizing the likelihood [14].

Performance Evaluation
Four performance measures "Accuracy", "Precision", "Recall", and "F1-Score" are used to evaluate the performance of classifiers. These measures are presented in (5)-(8) [25]: In (5)-(7), TP and TN are the number of samples that the classifier truly predicted as positive (class 1) and negative (class 2). FP and FN are the number of samples that the classifier wrongly predicted as positive and negative. For imbalanced datasets, accuracy is not a good evaluation measure. As an illustrative example, consider that there are 10 positive samples and 99,990 negative samples. If a weak classifier labels all samples as negative, the accuracy of the classifier would be 99.99 percent. Therefore, accuracy may overrate the performance of the classifier. The recall measure evaluates what percentage of positive samples are labeled truly. In the example mentioned, this is zero.
On the other hand, the precision measure calculates what percentage of the samples that the algorithm labeled as positive are really positive. Some biased algorithms may have either a large value of recall or a large value of precision. Therefore, the F1-score is calculated in order to represent both recall and precision. If either recall or precision has a small value, F1-Score will also be small.

Case Study
The IBM attrition dataset is used as the case study. This dataset consists of 35 columns for each employee. One of these columns, which is "attrition," is the target output of the classifier. The other 34 columns are features. Two of these features, which are "standard hours" and "employee count," are constant for all employees. Therefore, they are omitted from the features. Other features are listed as: "age", "business travel", "daily rate", "department", "distance from home", "education", "education field", "employee number", "environment satisfaction", "gender", "hourly rate", "job involvement", "job level", "job role", "job satisfaction", "marital status", "monthly income", "monthly rate", "number companies worked", "over 18", "overtime", "percent salary hike", "performance rating", "relationship satisfaction", "stock option level", "total working years", "training times last year", "work-life balance", "years at company", "years in the current role", "years since last promotion", and "years with the current manager." These features are either categorical or numerical. Since machine learning models cannot deal with categorical features directly, categorical data are converted to binary features using dummy coding [26]. For example, the categorical feature "education field" can be converted into five binary variables, as seen in Table 2.

Feature Selection
In order to decide which features are the most important, the dataset was initially divided into the training&validation set and the test set. Then, the validation set was separated from the training set. The 1-max-out algorithm was performed in order to determine what features should be omitted. After that, the procedure of randomly separating the validation set and performing 1-max-out repeats itself. After seven iterations, the features that were determined to be omitted more than four times were considered eliminated.
According to results of this procedure, "hourly rate", "Education Field_HR", "Monthly income", "Gender female", "Department_Research & Development", "Over18_yes", "Education", "Job Level", "Department_Human Resources", "Business Travel_Travel_Rarley", "performance rating", "Job Role_Manufacturing Director", "Monthly Rate", "Education Field_Other", "Business Travel_Non-Travel", "Education Field_Marketing", "Years at Company", "Department_Sales", "Over Time_No", "Education Field_Medical", "Marital Status_Married" are omitted. These features are not necessarily the least important features for attrition prediction. Some of them are chosen to be deleted because they are absolutely correlated with other features in the dataset. For instance, "Gender female" is one minus "Gender male" feature. For categorical features which are converted to binary features, being eliminated means that being in this category neither increases nor decreases the probability of attrition.

Final Model
The value of coefficients for each feature is presented in Table 3. According to the coefficients, "years since last promotion", "overtime", working as a sales representative, and "number of companies worked" are the most influential factors for an employee to leave the job. With an increase in any of these values, the probability that the employee leaves the job increases. On the other hand, working as a research director, "total working years", "years with current manager", and "Job Involvement" are the most influential factors for an employee to stay with the company. The model shows an accuracy of 81% for the test database. The precision, recall, and F1-score are 0.43, 0.82, and 0.56, respectively. If the 1-max-out feature selection were not performed, the accuracy, precision, recall, and F1score would be 78%, 0.39, 0.82, 0.53, respectively. A comparison of the proposed method's performance and the classifier used in previous works is displayed in Figure 6. The results show a considerable improvement in F1-score for the proposed method.  It is worth mentioning that these results are valid only for this dataset. These coefficients may vary for other companies in another country with a different culture and economic situation.

Parameters Confidence Analysis
In order to check the confidence value for each coefficient, the procedure of Section III is performed. Three hundred bootstrap datasets are generated from the original dataset. Then, the model is trained for each dataset. The average, standard deviation, and coefficient of variations (standard deviation to the absolute value of average ratio) of all coefficients are listed in Table 4. The standard deviations show an average level of confidence. We can be more confident for the fields in which the value of the coefficient of variations is small. For instance, we have the most confidence in the coefficient "over time" feature. In contrast, we are not certain about the coefficient associated with the "Job Role_Research Scientist" feature.
Box plots of the coefficients can also graphically demonstrate the variation of the pa- It is worth mentioning that these results are valid only for this dataset. These coefficients may vary for other companies in another country with a different culture and economic situation.

Parameters Confidence Analysis
In order to check the confidence value for each coefficient, the procedure of Section III is performed. Three hundred bootstrap datasets are generated from the original dataset. Then, the model is trained for each dataset. The average, standard deviation, and coefficient of variations (standard deviation to the absolute value of average ratio) of all coefficients are listed in Table 4. The standard deviations show an average level of confidence. We can be more confident for the fields in which the value of the coefficient of variations is small. For instance, we have the most confidence in the coefficient "over time" feature. In contrast, we are not certain about the coefficient associated with the "Job Role_Research Scientist" feature. Box plots of the coefficients can also graphically demonstrate the variation of the parameter over all bootstraps. Figure 7 depicts the variation of coefficients associated with the most influential features, which was discussed in the previous subsection. This plot demonstrates that the years since last promotion's coefficient takes a value between 2 and 4 for all of the bootstrap training datasets. Therefore, we can be confident about its prominent effect on attrition. The coefficient of the "Over Time-Yes" feature barely varies. Therefore, we can be sure about the value of this coefficient. In contrast, the value of the "Years with Current Manager" coefficient varies across a wide interval. Thus, we cannot be certain about this parameter. However, in all of the bootstrap datasets, this parameter is negative. Therefore, it can be inferred that this feature has a good impact on making the employers stay with the company.

Conclusions
This paper aimed at presenting a machine learning model for predicting employee attrition. A feature selection method for reducing the dimension of the feature space was first presented. Then, a logistic model was trained for the purpose of prediction. A comparison of the results with the existing methods reveals that the proposed feature selection increases the performance of the predictor. The model demonstrated that "years after the last promotion", "Over Time-Yes", "Job Role_Sales Representative", and "Number Companies Worked" are the prominent reasons for leaving the job. Bigger values for these features lead to a greater attrition probability. Conversely, "total working years", "years with current manager", and "job involvement" are the most influential factor for staying with the company. In order to check whether the parameters are valid, 300 hundred bootstrap datasets were produced. For each of these, a model was fitted. Then, a statistical analysis of the coefficient of each feature was performed. Generally, the variation of coefficients was acceptable. In particular, variations in parameters that are associated with the most influential features were insignificant. Therefore, we are sure that the aforementioned features are the prominent features in predicting attrition.
In comparison to previous works, this paper presents a three-stage, pre-processing, processing, and post-processing framework for building a precise employee attrition prediction model and for checking the validity of the model's parameters. The m-max-out algorithm is introduced for the feature selection at the pre-processing stage. Due to the limits of computation devices that the authors currently face, the 1-max-out (which is a special case in which m is equal to one) is used in this paper. Bigger m could also be used in case of more available computation resources. The validity of the logistic regression model's parameters for attrition prediction is checked by analyzing the parameters' variations when they are trained over multiple bootstrap datasets. These preprocessing and post-processing stages can be used to develop accurate and stable models for any kind of general problem. The max-out feature selection method can be used for any set of feature sets, including binary and continuous features. For any kind of Parametric Machine Learning models, statistical analysis of the model's parameters over numerous bootstraps can infer whether we are confident about the model. For future research on attrition prediction, psychological factors regarding employee attrition are suggested for analysis. In addition, the effect of the number of available vacancies for each employer, considering

Conclusions
This paper aimed at presenting a machine learning model for predicting employee attrition. A feature selection method for reducing the dimension of the feature space was first presented. Then, a logistic model was trained for the purpose of prediction. A comparison of the results with the existing methods reveals that the proposed feature selection increases the performance of the predictor. The model demonstrated that "years after the last promotion", "Over Time-Yes", "Job Role_Sales Representative", and "Number Companies Worked" are the prominent reasons for leaving the job. Bigger values for these features lead to a greater attrition probability. Conversely, "total working years", "years with current manager", and "job involvement" are the most influential factor for staying with the company. In order to check whether the parameters are valid, 300 hundred bootstrap datasets were produced. For each of these, a model was fitted. Then, a statistical analysis of the coefficient of each feature was performed. Generally, the variation of coefficients was acceptable. In particular, variations in parameters that are associated with the most influential features were insignificant. Therefore, we are sure that the aforementioned features are the prominent features in predicting attrition.
In comparison to previous works, this paper presents a three-stage, pre-processing, processing, and post-processing framework for building a precise employee attrition prediction model and for checking the validity of the model's parameters. The m-max-out algorithm is introduced for the feature selection at the pre-processing stage. Due to the limits of computation devices that the authors currently face, the 1-max-out (which is a special case in which m is equal to one) is used in this paper. Bigger m could also be used in case of more available computation resources. The validity of the logistic regression model's parameters for attrition prediction is checked by analyzing the parameters' variations when they are trained over multiple bootstrap datasets. These preprocessing and post-processing stages can be used to develop accurate and stable models for any kind of general problem. The max-out feature selection method can be used for any set of feature sets, including binary and continuous features. For any kind of Parametric Machine Learning models, statistical analysis of the model's parameters over numerous bootstraps can infer whether we are confident about the model. For future research on attrition prediction, psychological factors regarding employee attrition are suggested for analysis. In addition, the effect of the number of available vacancies for each employer, considering his specifications and situational factors relating to his/her attrition probability, can also be analyzed in future works.

Data Availability Statement:
Publicly available datasets were analyzed in this study. This data can be found here: https://www.kaggle.com/pavansubhasht/ibm-hr-analytics-attrition-dataset (accessed on 17 April 2021).