Next Article in Journal
Mathematically Grounded Neuro-Fuzzy Control of IoT-Enabled Irrigation Systems
Previous Article in Journal
Dynamics and Solution Behavior of the Variable-Order Fractional Newton–Leipnik System
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Predicting Employee Turnover Based on Improved ADASYN and GS-CatBoost

by
Shuigen Hu
and
Kai Dong
*
School of Public Affairs, Zhejiang University, Hangzhou 310030, China
*
Author to whom correspondence should be addressed.
Mathematics 2026, 14(2), 313; https://doi.org/10.3390/math14020313
Submission received: 18 November 2025 / Revised: 25 December 2025 / Accepted: 13 January 2026 / Published: 16 January 2026
(This article belongs to the Section E5: Financial Mathematics)

Abstract

In corporate management practices, human resources are among the most active and critical elements, and frequent employee turnover can impose substantial losses on firms. Accurately predicting employee turnover dynamics and identifying turnover propensity in advance is therefore of significant importance for organizational development. To improve turnover prediction performance, this study proposes an employee turnover prediction model that integrates an improved ADASYN data rebalancing algorithm with a grid-search-optimized CatBoost classifier. In practice, turnover instances typically constitute a minority class; severe class imbalance may lead to overfitting or underfitting and thus degrade predictive performance. To mitigate imbalance, we employ ADASYN oversampling to reduce skewness in the dataset. However, because ADASYN is primarily designed for continuous features, it may generate invalid or meaningless values when discrete variables are present. Accordingly, we improve ADASYN by introducing a new distance metric and an enhanced sample generation strategy, making it applicable to turnover data with mixed (continuous and discrete) features. Given CatBoost’s strong predictive capability in high-dimensional settings, we adopt CatBoost as the base learner. Nonetheless, CatBoost performance is highly sensitive to hyperparameter choices, and different parameter combinations can yield markedly different results. Therefore, we apply grid search (GS) to efficiently optimize CatBoost hyperparameters and obtain the best-performing configuration. Experimental results on three datasets demonstrate that the proposed improved-ADASYN GS-CatBoost model effectively enhances turnover prediction performance, exhibiting strong robustness and adaptability. Compared with existing models, our approach improves predictive accuracy by approximately 4.6112%.

1. Introduction

In recent years, employee turnover in large corporations has become increasingly severe, particularly among specialized professionals. However, employees with diverse career aspirations are often dissatisfied with the status quo; alternatively, for various reasons, they may lack sufficient communication and interaction with the organization and may impulsively choose to resign as a way to address their concerns [1]. Employee turnover not only reduces corporate revenue but also poses substantial risks to organizational development, such as the leakage of business secrets, the loss of key clients or core technologies, and the erosion of market competitiveness, potentially resulting in irreparable losses [1]. Reducing the turnover rate is therefore a critical means of improving the effectiveness of human resource management. Consequently, it is essential to predict employees’ turnover intentions so that timely and effective retention measures can be implemented [2].
Employee turnover is generally classified into two types [3,4]. The first is voluntary turnover, which refers to employees leaving an organization based on their own intentions and is often termed employee attrition. The second is involuntary turnover, whereby employees exit their current positions against their will, including separations resulting from layoffs or dismissals. From the perspective of organizational management, voluntary turnover is typically more latent and difficult to detect in advance, while employee-related data are usually transparent and readily accessible to firms. Therefore, this study focuses on voluntary turnover; throughout the remainder of this paper, the term “turnover” refers exclusively to voluntary turnover unless otherwise specified.
Human resource management data that contain information on employees’ behaviors, backgrounds, and related attributes provide a solid foundation for data-driven employee turnover prediction. Accordingly, extensive efforts have been made in academia to predict employee turnover. In addition to using statistical models for prediction, in recent years, employee turnover prediction based on machine learning has also received extensive attention from both the academic and industrial communities [5,6,7]. However, most existing machine learning approaches are developed on the basis of a single model, such as decision trees, random forests, and support vector machines [8,9,10]. For turnover data characterized by complex underlying relationships, a single classifier may fail to fully capture the input–output mapping, thereby compromising the overall predictive performance.
Because the outcomes of employee attrition prediction may constitute sensitive information that can affect the reputation of organizational members, predictive accuracy is of paramount importance. Although most companies possess human resource data and the analysis may appear straightforward, the drivers of employee attrition are diverse and the data often exhibit non-typical characteristics; therefore, more sophisticated techniques are required to improve prediction accuracy. Ensemble learning algorithms can construct high-performing predictive models by integrating multiple base learners, thereby achieving higher predictive accuracy. In recent years, ensemble-learning-based employee attrition prediction models have been widely studied and applied, including Random Forest, AdaBoost, XGBoost, and CatBoost [11]. CatBoost achieves strong performance by considering different combinations of categorical features to expand the effective feature space and by employing ordered boosting and gradient bias reduction to alleviate overfitting. As a result, it performs well in addressing gradient bias and prediction shift, leading to improved accuracy and generalization capability [12]. Nevertheless, CatBoost is highly sensitive to hyperparameter settings; different parameter combinations can yield substantial performance differences, and careful tuning is required to obtain optimal results [13].
Employee turnover data are typically imbalanced: the majority of instances correspond to employees who remain employed, whereas only a minority correspond to employees who leave. Such class imbalance can substantially degrade the predictive performance of machine learning models for turnover prediction [14]. This is because learning algorithms often tend to classify more samples as “stayers” in order to maximize overall accuracy, which in turn markedly reduces the model’s ability to identify employees who will leave. However, the primary objective of turnover prediction is precisely to detect potential leavers. Therefore, it is necessary to develop turnover prediction models that explicitly account for class imbalance [15]. Transforming an imbalanced dataset into a balanced one is among the most important strategies for handling class imbalance, and representative methods include MWMOTE, SMOTE, and ADASYN [16,17,18]. Among them, the Adaptive Synthetic Sampling approach (ADASYN) has been widely used for data balancing due to its strong resistance to overfitting, good generalization capability, and robustness to noise [19]. Nevertheless, the standard ADASYN algorithm is mainly designed to generate new samples for continuous features (variables), whereas employee-related datasets often contain a substantial number of discrete/categorical variables (e.g., gender, place of origin). Therefore, it is necessary to improve standard ADASYN so that it can be effectively applied to employee turnover data with mixed (continuous and discrete) features.
In summary, this study proposes a novel predictive framework that integrates an improved ADASYN algorithm with a parameter-optimized CatBoost ensemble learning model to enhance the accuracy of employee turnover prediction. The main contributions of this work are summarized as follows:
  • An improved ADASYN algorithm was constructed to balance the employee turnover data. To mitigate the adverse impact of class imbalance on turnover prediction accuracy and to enhance overall predictive performance, we employ the ADASYN adaptive synthetic oversampling strategy to rebalance the dataset. Considering that standard SMOTE-type methods primarily generate new samples for continuous features, whereas employee turnover data typically contain both continuous and discrete variables, the proposed improved ADASYN incorporates a new distance metric and an enhanced sample generation strategy to better accommodate mixed-type features.
  • Employing the grid search (GS) procedure to optimize the hyperparameters of the categorical boosting model (CatBoost). GS methods address the low efficiency of hyperparameter tuning and improving the predictive performance of the CatBoost. The effectiveness and feasibility of the GS-CatBoost model are validated through prediction experiments on the Employee turnover datasets.
  • By integrating the improved ADASYN algorithm with the GS-CatBoost model, we propose an IADASYN–GS-CatBoost model for employee turnover prediction. The effectiveness of the proposed approach is validated on three employee turnover datasets. Experimental results demonstrate that, compared with existing methods, our model achieves higher predictive accuracy and stronger generalization performance.
The remainder of this paper is organized as follows. Section 2 reviews the literature relevant to employee turnover prediction. Section 3 describes the proposed methodology. Section 4 presents the experimental results on different datasets. Section 5 concludes the paper by summarizing the main findings and discussing the implications and limitations of this study.

2. Literature Review

At present, employee turnover prediction methods can be broadly classified into two categories: probability-based statistical prediction approaches and machine-learning-based approaches. The latter can be further divided into single-classifier (standalone) models and ensemble-learning models.

2.1. Turnover Prediction Based on Statistical Learning

Statistical approaches typically predict employee turnover by leveraging methods such as regression analysis, factor analysis, and descriptive statistical analysis [20,21].
Kong et al. [22] conducted a survey study on the key factors influencing accountants’ turnover in Singapore. Using 20 turnover-related features, they developed a turnover prediction model via stepwise regression analysis. Gong et al. [23] proposed an employee turnover prediction model based on logistic regression. Using an encoded dataset, they constructed a logistic regression model that outputs the probability of turnover for a given employee profile. Changling Pi and Xiangmin Zheng et al. [24] focused on a new-generation cohort of hotel employees and investigated the frequent turnover problem. They employed questionnaire surveys and statistical analyses to examine the effects of work values on turnover intention. Yumin Liu and Guangping Li [25] studied dispatched employees in labor-using organizations. To address the high turnover rate among dispatched workers, they applied hierarchical regression to analyze the relationship between perceived organizational support and turnover intention.
Overall, employee turnover prediction models based on traditional statistical methods tend to operate on relatively small-scale datasets and typically require stringent assumptions about the sample data, such as normality or linear relationships. These requirements may limit the applicability and effectiveness of such models in practice.

2.2. Employee Turnover Prediction Using a Single Machine Learning Model

Human resource management data that capture employees’ behaviors, backgrounds, and related attributes provide a foundation for data-driven employee turnover prediction. In recent years, machine-learning-based turnover prediction has attracted substantial attention from both academia and industry.
Zhang et al. [4] first applied k-means clustering to preliminarily categorize employees and then used decision trees (DT) to predict potential turnover. Ali et al. [5] proposed an improved Extremely Randomized Trees model for employee attrition prediction. Guerranti et al. [6] compared the turnover prediction performance of several models, including decision trees, logistic regression, and neural networks, and reported that logistic regression achieved the best predictive results. Rohit Punnoose et al. [26] modeled and predicted employee turnover using data collected from a human resource information system and found that the XGBoost model significantly improved turnover prediction accuracy. Xiang Gao et al. [27] proposed a weighted quadratic random forest model for high-dimensional and imbalanced turnover data; the results showed that the proposed model achieved significant improvements over conventional models across multiple evaluation metrics, particularly in terms of recall and the F-measure. Ozmen et al. [28] applied a convolutional neural network (CNN) to address employee turnover prediction in the retail sector and further improved the CNN-based approach by proposing a novel hybrid Expanded Convolutional Decision Tree (ECDT) model. This model mitigated the issue of missing data and provided an accurate and reliable method for turnover prediction. Bagus Priambodo et al. [29] focused on employees in the IT industry, identified turnover-related factors using correlation analysis and chi-square tests, and comparatively evaluated the classification performance of three models: decision trees, naive Bayes, and random forests.
Fallucchi et al. [9] developed an employee turnover prediction model using 35 features extracted from human resource data and conducted empirical analyses based on an IBM HR dataset. In their experiments, they applied multiple models—including naive Bayes, k-nearest neighbors (KNN), decision trees, logistic regression, support vector machines (SVM), and random forests—to predict turnover on a dataset of 1500 samples. The results indicated that the logistic regression model achieved the highest predictive accuracy. Ganthi et al. [30] constructed employee turnover prediction models using decision trees, random forests, k-nearest neighbors, neural networks, extreme gradient boosting, and AdaBoost. To further improve predictive accuracy, they applied regularization techniques to optimize model parameters. Experiments on human resource data from a U.S. power company showed that extreme gradient boosting achieved the best performance, with a turnover prediction accuracy of up to 88%.
The above studies develop employee turnover prediction models based on single classifiers. However, for turnover data characterized by complex underlying relationships, a single model may not fully capture the input–output mapping, thereby limiting the overall predictive performance.

2.3. Employee Turnover Prediction Based on Ensemble Models

Ensemble learning algorithms can construct high-performance predictive models by integrating multiple base classifiers, and a large body of research has shown that such methods perform well in both classification and regression tasks. In recent years, ensemble-learning-based employee turnover prediction models have been extensively studied [8,31,32].
Muslim and Dasril et al. [31] proposed a turnover prediction algorithm based on a stacking ensemble framework. In their approach, k-nearest neighbors (KNN), decision trees, support vector machines (SVM), and random forests were used as base predictors, and LightGBM served as the meta-learner to integrate the outputs of the base models. Using a Polish bankruptcy dataset for empirical evaluation, the results showed that the ensemble model achieved higher predictive accuracy than each individual model (KNN, decision tree, SVM, and random forest). Xiang Gao et al. [27] proposed a weighted quadratic random forest model for high-dimensional and imbalanced employee turnover data. The results showed that the proposed model significantly outperformed conventional methods across multiple performance metrics, particularly in terms of recall and the F-measure.
Qiang Li et al. [33] developed an employee turnover prediction model for a company by combining Random Forest with a stacking ensemble learning strategy to construct an LRA predictive framework. The results showed that the ensemble approach outperformed single-model baselines and provided more accurate decision support for the firm; however, the model was relatively complex and computationally slow. Rohit Punnoose [26] modeled and predicted employee turnover using data derived from a human resource information system and found that the XGBoost model significantly improved the accuracy of turnover prediction. Jian Zheng and Renjing Liu developed an employee turnover prediction model based on a bagging framework—Random Forest (RF) [7,34]. Their approach introduced interval variables into the random forest to handle class-imbalanced data. Chung et al. constructed a stacking ensemble classification model in which random forest and artificial neural networks served as base learners and logistic regression was used as the meta-learner [8,35]. In addition to bagging and stacking, boosting is another important ensemble learning paradigm. CatBoost is one of the classical boosting algorithms and has been widely used in related applications [11].
Compared with bagging- and stacking-based methods, CatBoost is generally easier to implement. During training, CatBoost places greater emphasis on samples that were misclassified in previous iterations, and it often achieves better and more stable classification performance on complex datasets [12]. Moreover, CatBoost adopts an ordered boosting scheme, in which residuals are computed in a permutation-driven manner. This design helps mitigate target leakage and prediction shift, leading to more stable performance on small or noisy datasets and reducing the risk of overfitting. Therefore, in this study, we select CatBoost as the ensemble learning model for turnover prediction [13].

2.4. Balanced Data Algorithms

Employee turnover data exhibit imbalanced characteristics, with the majority of samples being “employed” and a minority being “departed.” Data imbalance significantly affects the prediction performance of machine learning models for employee turnover [15]. Models often tend to classify more samples as “employed” to maximize overall classification accuracy, which can significantly reduce the prediction performance for leavers. However, the core objective of employee turnover prediction is to identify potential leavers. Therefore, it is necessary to construct a prediction model that considers data imbalance. Transforming imbalanced data into balanced data is one of the key strategies for handling data imbalance, with methods including oversampling and undersampling techniques. Oversampling methods balance data by generating additional samples for the minority class. For example, SMOTE uses an adaptive strategy to oversample minority class samples, considering noise effects [16]. The MWMOTE method performs over-sampling based on distance metrics and weighted clustering [17]. The WKSMOTE method generates new minority samples based on SVM for oversampling [18]. However, SMOTE is sensitive to outliers, MWMOTE may overlook distant minority class subclusters, and WKSMOTE faces class overlap issues. Additionally, ADASYN (Adaptive Synthetic Over-sampling Method) is one of the widely used oversampling techniques. ADASYN increases the number of minority class samples by generating new samples between existing minority samples, thus achieving data balance.
Compared with the SMOTE sampling method, ADASYN does not merely replicate minority class samples. Instead, it takes into account the distribution of samples around the minority classes and automatically combines the samples, thereby increasing the number of edge samples and providing more information for classification work. Compared with under-sampling techniques, oversampling techniques such as ADASYN do not eliminate the original samples, which can avoid the unexpected loss of effective information in the data. Meanwhile, ADASYN also has the characteristics of a simple principle and excellent data balancing effect.
However, the standard ADASYN algorithm primarily generates new samples for continuous features, whereas employee data often include many categorical variables (e.g., employee gender, origin) [36]. Therefore, it is necessary to improve the standard ADASYN algorithm to accommodate employee turnover data with mixed (continuous and categorical) features.

3. IADASYN-GS-CatBoost Turnover Prediction

Let D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , , ( x N , y N ) } be the employee attrition dataset, where y i 1 , 0 is the class label, y i = 1 indicates employee attrition, y i = 0 indicates no attrition, and i = 1 , 2 , , N . The dataset x i contains F features { x i , 1 , x i , 2 , , x i , F } , which describe variables related to employee background and behavior. The objective of employee attrition prediction is to train a classification model Ψ on the attrition dataset, i.e., to learn a mapping function Ψ : x y . Then, for a new sample x n e w , the learned mapping function Ψ is used to predict its corresponding attrition label y n e w , namely,
y n e w = Ψ ( x n e w )
The employee attrition dataset D has two typical characteristics: data imbalance and mixed features. Firstly, the dataset has significantly more samples of non-attrition ( y = 0 ) than attrition ( y = 1 ), leading to a noticeable imbalance in the data. Therefore, the data must be balanced before training the model to reduce the impact of data imbalance on classification. Secondly, the data may contain both discrete features, such as “department”, “education”, “specialization”, and “employee satisfaction score”, as well as continuous features, like “performance evaluation score”. In this study, for continuous features, the maximum and minimum normalization method is used to convert the data into the range [0, 1], eliminating the impact of different scales. Based on feature coding and standardization, this study proposes an employee attrition prediction method based on improved ADASYN-GS-CatBoost to address the issues of data imbalance and mixed features in the employee attrition dataset.

3.1. Balancing Turnover Samples by Improved ADASYN

The standard ADASYN algorithm primarily generates new samples based on continuous features (variables) and is not suitable for employee datasets containing discrete features. On the one hand, for samples with discrete attributes, the Euclidean distance employed by ADASYN fails to accurately capture the similarity between samples, leading to bias in the selection of nearest neighbors. On the other hand, applying sample generation methods designed for continuous features to discrete attributes may produce invalid new values, thereby disrupting the original data structure of discrete features.
Therefore, it is necessary to improve the standard ADASYN algorithm to accommodate employee turnover datasets characterized by mixed features (continuous and discrete). The improved ADASYN algorithm utilizes a distance metric that accounts for mixed feature types to generate new samples [19]. These synthetic samples are added to the original training set to obtain a balanced dataset. Specifically, a new distance metric is proposed: for discrete features, the difference diff is defined as 1 if the attribute values differ, and 0 if they are equal; for continuous features, diff is calculated as the absolute difference between feature values. This new metric accurately reflects the similarity between samples with mixed feature types, particularly discrete features, which is essential for employee turnover data. Furthermore, during the generation of new samples, a random factor drawn from the interval [0, 1] is introduced. This random factor enables the generation of new synthetic samples from original mixed-feature samples without disrupting the inherent structure of discrete attributes.

3.1.1. ADASYN Algorithm

ADASYN [36,37] is commonly used for machine learning classification tasks, and its main idea is to generate synthetic samples to adapt the classification decision boundary towards difficult class samples, thereby addressing class imbalance. In this paper, ADASYN is used to enhance the training dataset D t r a i n (with an imbalanced ratio of minority and majority class samples). The main calculation process is as follows:
In the training set D t r a i n = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , , ( x m , y m ) } , which contains m samples, m s and m l are defined as the number of turnover samples and non-turnover samples, respectively. Thus, m s < m l and m s + m l = m . The following steps are performed on the training set D t r a i n :
(I) Calculate the class imbalance degree:
d = m s m l
In the formula: degree of imbalance d ( 0 ,   1 ] .
(II) When d < d t h (where d < d t h is the preset class imbalance ratio):
(II.1) Calculate the number of synthetic turnover samples needed for each turnover sample
G = ( m l m s ) × β
In the formula, β ( 0 ,   1 ] is a parameter that specifies the desired level of balance after generating synthetic samples. β = 1 indicates that the operation results in a completely balanced dataset.
(II.2) For each sample, find K neighboring samples in the n-dimensional space based on the Euclidean distance and calculate the proportion r i .
r i = Δ i K , i = 1,2 , , m ,   r i [ 0 ,   1 ]
In the formula, Δ i is the number of non-turnover samples among the K nearest neighbors.
(II.3) Standardize r i :
r ¯ i = r i i = 1 m s r i
(II.4) Calculate the number of synthetic turnover samples required for each sample g i = r ¯ i × G
(II.5) Synthesize samples according to Equation (6):
s i = x i + ( x z i x i ) × λ
In the formula, ( x z i x i ) represents the difference vector in n-dimensional space, where x z i is a randomly selected subset of data from the K nearest neighbors of x i ; λ is a random number that satisfies λ [ 0 ,   1 ] .

3.1.2. Improved ADASYN Algorithm

  • (I) Distance metric for samples with mixed features
This section proposes an improved ADASYN algorithm based on new distance metrics and data generation strategies for employee turnover data containing mixed features. Let Ω be the index set of discrete features and Π be the index set of continuous features in the dataset. Given a sample,   X p = ( x p , 1 , x p , 2 , , x p , F ) and X q = ( x q , 1 , x q , 2 , , x q , F ) . The distance between X p   and   X q is defined as
d ( X p , X q ) = j = 1 F Δ ( x p , j , x q , j )
where Δ calculates the difference between X p   and   X q in the j feature, and Δ is defined as
Δ ( x p , j , x q , j ) = 0 ,       i f   j Ω   a n d   x p , j = x q , j 1 ,       i f   j Ω   a n d   x p , j x q , j   ( x p , j x q , j ) 2 ,         i f   j Π
In Equation (8), the mixed feature characteristics of employee turnover data are considered. For discrete features, the difference value Δ is set to 1 when the feature values are not equal and 0 when they are equal. For continuous features, the difference value Δ is the difference in the feature values. When using Equation (8) for calculation, for continuous features, the minimum-max normalization method is adopted to convert the data to the [0, 1] interval to eliminate the influence of dimensions.
  • (II) Sample Generation Strategy for Mixed Features:
Given sample X a = { x a , 1 , x a , 2 , , x a , F } and one of its K nearest neighbors X b = { x b , 1 , x b , 2 , , x b , F } , the formula for synthesizing a new sample X i g = { x i , 1 g , x i , 2 g , , x i , F g } by X a and X b is as follows [37]:
x i , j g = x a , j ,                                 j Ω μ 0.5 x b , j ,                                 j Ω μ > 0.5     x a , j + ( x b , j x a , j ) × λ ,         j Π
In the formula, μ and λ represent random numbers generated between 0 and 1. According to Equation (9), it can be known that when x i , j g is a continuous feature, x i , j g is obtained through interpolation based on the distance weight. When x i , j g is a discrete feature, x i , j g randomly takes the value of x a , j or x b , j . So, the formula (9) can be used to generate new samples based on the existing samples X a and X b that include mixed features, while this generation strategy does not alter the structure of the original discrete features.
Algorithm 1 illustrates the process of the improved ADASYN algorithm, which is based on a minority class sample dataset D s (i.e., the turnover samples) to generate a new sample set D n e w containing N n e w samples. The generated samples D n e w and the original dataset D t r a i n together form a balanced dataset D t r a i n B = D t r a i n D n e w , which will be used to train the employee turnover prediction model. The number of generated samples N n e w is set to the difference between the number of majority class samples and the number of minority class samples in D t r a i n .
Algorithm 1. Improved ADASYN algorithm processing
Input: Small sample set D m i n = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , , ( x N m i n , y N m i n ) } (the sample size is N m i n ), size of new samples generated is N g , number of nearest samples is k ;
Output: Generating samples set D g
1Randomly arrange the order of the samples in D m i n ;
2Let y m i n = y 1 ;
3for  i 1 to N g   do
4          a { i 1   m o d   N m i n } + 1 ; /* Obtain the index of the current sample */
5       Calculate the distances from x a to other samples in D m i n by Equation (7);
6       Select the k samples closest to x a and place them in set Π ;
7       Randomly select a sample x b from Π ;
8       Generating new samples X i g = x i , 1 g , , x i , F g by X a = x a , 1 , , x a , F and X b =   ( x b , 1 , , x b , F ) ;
9           for  j 1 to F   do
10                     Calculating the x i , j g by Equation (9), x a , j and x b , j ;
11           End
12End
13return generated sample set D g = X 1 g , y m i n , X 2 g , y m i n , , X N g g , y m i n ;

3.2. Employee Turnover Prediction Model Based on GS-CatBoost

3.2.1. CatBoost Prediction Algorithm

CatBoost [11] is a gradient boosting decision tree algorithm based on symmetric decision trees. It leverages a symmetric binary tree architecture to achieve iterative ensemble learning, using the residuals from preceding models as the optimization targets for subsequent models, thereby progressively reducing prediction errors. Built upon the foundations of CatBoost and LightGBM, CatBoost further improves performance by adopting a ranking-based categorical feature encoding algorithm [13], which enables the direct processing of categorical variables such as employee age and monthly salary, thus avoiding the informational bias and model misclassification that may arise from traditional manual encoding of categorical features. In the context of employee turnover prediction, CatBoost employs the Ordered Boosting strategy to effectively suppress the interference of noisy data, while its symmetric tree design not only enhances computational efficiency but also captures the nonlinear effects of multiple variables—such as age, monthly income, salary growth rate, and tenure in the company—on employee attrition. Given CatBoost’s significant advantages in feature handling efficiency and model generalization capability, this study adopts CatBoost as the core modeling method, with its objective function formalized as follows:
m i n f i = 1 n L Y i , F m X i
Here, X = { x 1 , x 2 , , x n } , denotes the n-dimensional feature input vector incorporating variables such as age, monthly income, and distance from home that influence employee turnover. Y = 0   or   1 is the target variable, Y = 1 indicates attrition, and Y = 0 indicates no attrition, indicating whether an employee leaves the organization. L represents the loss function, which measures the discrepancy between the model’s predicted values and the true targets. F m x   denotes the cumulative prediction of the first m decision trees.
During the iterative process of CatBoost, F m ( x ) can be expressed as follows:
F m x = F m 1 x + u f m ( x )
In Equation (2), F m 1 x denotes the cumulative prediction of the first m−1 decision trees, f m x represents the prediction of the m-th decision tree, and u is the learning rate that controls the step size of each iterative update, thereby adjusting the contribution of the newly added tree to the overall model prediction. At the m-th iteration, the objective is to fit the negative gradient of the current loss function with respect to the previous round’s prediction,
g i = L [ Y i , F m X i ] F m 1 ( X i )
Then, by minimizing the squared error, f m ( x ) is obtained, and the specific computation process is as follows:
f m x = a r g m i n f i = 1 n [ g i f ( x i ) ] 2
CatBoost performs regression on the negative gradient of the residuals from the previous model [Equations (12) and (13)], progressively fitting the optimal direction of the loss function [Equation (10)]. In this way, the model output is updated at each iteration [Equation (11)], ultimately constructing a powerful ensemble prediction model. When one tree is fully grown, the next tree begins learning from the residuals of the previous tree, continuing until the final tree is fully grown. The predictions from all trees are then summed to obtain the final prediction result, which represents the probability of employee turnover.

3.2.2. Grid Search-Based CatBoost Parameter Optimization

Grid Search (GS) is a commonly used exhaustive search algorithm in machine learning. During the hyperparameter tuning process, a broad search range is first defined for the hyperparameters to locate potential global optima. The search range is then gradually narrowed by adjusting the step size, and K -fold cross-validation is employed to validate each parameter combination, thereby identifying the optimal set of model parameters.
Several parameters affect the CatBoost model, including: the number of iterations t , the maximum depth of decision trees T r m a x , the learning rate e t a , the regularization parameter γ , the minimum sum of sample weights W m i n , the proportion of data used to train each tree S , and the proportion of features used C . Using the balanced dataset processed by the ADASYN algorithm as the training set, grid search is used to optimize the parameters t , T r m a x , e t a , γ , W m i n , S , and C of the Catoost model according to specific step sizes, with 10-fold cross-validation conducted to ensure the robustness of the parameter tuning.

3.3. Algorithm Framework

This study proposes an employee turnover prediction algorithm for imbalanced data, improved ADASYN-GS-CatBoost. The algorithm first improves the ADASYN method to convert the original data into balanced data, then employs the ensemble machine learning algorithm CatBoost to construct the employee turnover prediction model, and uses grid search to optimize key parameters of CatBoost.
The standard ADASYN algorithm balances the data by generating new minority class samples between existing minority class samples. However, standard ADASYN is not suitable for employee datasets that contain discrete features. On one hand, ADASYN uses Euclidean distance to measure sample similarity, which is not accurate for discrete feature samples, leading to biased selection of nearest samples. On the other hand, the new sample generation method for continuous features creates new values for discrete features, disrupting the original data structure of discrete features. Therefore, this study proposes an improved ADASYN algorithm to handle employee turnover data with mixed features.
Figure 1 shows the overall framework of the proposed IADASYN-GS-CatBoost algorithm. First, the original employee turnover dataset is divided into a training set D t r a i n and a test set D t e s t . Then, the training set D t r a i n is input into the IADASYN-GS-CatBoost for training the employee turnover prediction model. IADASYN-GS-CatBoost includes two stages: Stage 1 uses the improved ADASYN algorithm to balance the training set, and Stage 2 uses the balanced dataset to train a classification model based on the CatBoost algorithm. The improved IADASYN-GS-CatBoost algorithm uses a distance metric that considers mixed features to generate new samples. These new samples are added to the original training set to obtain a balanced training set.
When constructing the classification model with CatBoost, higher weights are assigned to misclassified samples in each iteration to generate multiple decision trees. Additionally, the Bayesian optimization algorithm is used to select the best parameter combination for CatBoost. By aggregating the decision trees, a final ensemble classification model (i.e., the employee turnover prediction model) is obtained. Finally, the test set D t e s t is input into the resulting classification model to obtain predictions for the test set samples, which are used to evaluate the effectiveness of the constructed employee turnover prediction model.

4. Experimental Setup

4.1. Employee Turnover Datasets

To validate the effectiveness of ADASYN-CatBoost (referred to as A-CatBoost), three datasets are used: one from a company’s human resources department (referred to as “Dataset-I”, https://www.datacastle.cn/dataset_description.html?type=dataset&id=1903 accessed on 20 May 2025) and another from the IBM Watson Analytics platform (referred to as “Dataset-II”, https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset accessed on 20 May 2025), and the third from the Kaggle database (denoted as “Dataset-III”, https://www.kaggle.com/jiangzuo/hr-comma-sep/version/1 accessed on 20 May 2025).
Dataset-I contains employee turnover information from a company over a 9-year period. It includes 12 discrete features: “Unit Type,” “Gender,” “Source of Origin,” “School Type,” “Highest Education Level,” “Major,” “Marital Status,” “Region Type,” “Initial Job Position,” “Political Status,” “Highest Title,” and “Work Region Before Leaving.” This dataset includes 2494 samples, of which 208 are turnover samples (positive class) and 2286 are non-turnover samples (negative class), resulting in a class imbalance ratio of approximately 1:11.
Dataset-II is the IBM employee turnover dataset from the Kaggle database [38], featuring 34 attributes and 1470 samples. This dataset includes 237 turnover samples and 1233 non-turnover samples, with a class imbalance ratio of approximately 1:5. In the original dataset, features “Over 18 Years Old,” “Standard Hours,” and “Number of Employees” are constant, and “Employee ID” does not relate to turnover status, so these four features are removed. Additionally, “Daily Rate,” “Hourly Rate,” and “Monthly Rate” are redundant with “Monthly Income,” so the first three redundant variables are removed. After feature elimination, the dataset contains 27 attributes. Among these, 10 features are continuous, including “Age,” “Distance from Home,” “Monthly Income,” “Number of Companies Worked For,” “Promotion Rate,” “Total Work Experience,” “Years at Current Company,” “Time in Current Role,” “Years Since Last Promotion,” and “Years with Current Manager,” while the remaining features are discrete.
Dataset-III is an employee turnover dataset from a company in India, sourced from the Kaggle database [https://www.kaggle.com/jiangzuo/hr-comma-sep/version/1 accessed on 20 May 2025]. This dataset includes 10 discrete features: “employee satisfaction level,” “latest performance evaluation,” “number of projects,” “average monthly working hours,” “years at the company,” “work accident occurrence,” “whether the employee left,” “promotion in the last five years,” “job position,” and “salary level.” The dataset contains 14,999 samples in total, among which 3570 samples are turnover cases and 11,429 are non-turnover cases, resulting in a positive-to-negative sample ratio of approximately 1:4. Thus, the dataset is characterized as imbalanced.

4.2. Experimental Design

To comprehensively evaluate the performance of IADASYN-GS-CatBoost, two sets of experiments were designed in this study:
(I) Comparative Experiments: To assess the predictive performance and stability of IADASYN-GS-CatBoost on employee turnover prediction, comparative experiments were conducted across three different datasets. The proposed method was compared against three hybrid feature handling approaches: Logistic Regression (LR) [39], Passive-Aggressive Stochastic Gradient Descent (PMSGD) [40], and Ensemble Random Forest (Ensemble RF) [6].
(II) Ablation Experiments: To evaluate the effectiveness of GS-CatBoost, a series of comparative experiments were performed using several classification algorithms, including Random Forest (RF) [41], Least Squares Support Vector Machine (LSSVM) [42], Back Propagation Neural Network (BPNN) [43], Naïve Bayes (NB) [44], and CatBoost [45].
In the ADASYN-CatBoost model, the number of base classifiers is set to 120, and the minimum leaf size is set to 1. The balanced employee turnover sample data is divided into training and test sets in a 7:3 ratio. Grid search is used to optimize parameters such as t , T r m a x , e t a , γ , W m i n , S   and C in the CatBoost model according to the specified step sizes. After multiple rounds of tuning, the optimal parameters for the model are shown in Table 1.
In reference to the standard ADASYN, the improved ADASYN sets the number of nearest neighbors k   = 5 . For the Random Forest (RF), the minimum leaf size is set to 1 and the number of trees is set to 120, consistent with ADASYN-CatBoost. The Least Squares Support Vector Machine (LSSVM) uses a Gaussian kernel function. The parameters “BoxConstraint” and “KernelScale” are set to 357.29 and 0.2771 for Dataset-I, and to 488.52 and 0.6924 for Dataset-II (IBM dataset), respectively. These parameters were tuned using a 10-fold cross-validation approach. The Back Propagation Neural Network (BPNN) has a network structure with 2 hidden layers, each containing 30 units, and uses the ReLU activation function. It employs the cross-entropy loss function and the Adam optimizer, with a learning rate of 0.01 and 200 epochs. In the Naive Bayes (NB) model, the smoothing factor is set to 10−9, and the parameter smoothing coefficient is set to 1.
The experiments are conducted on a personal computer with a 2.70 GHz CPU and 32 GB of RAM, with all algorithms implemented on the Python 3.9.13 platform. The datasets are randomly divided into training set D t r a i n and test set D t e s t in a 7:3 ratio. To thoroughly compare algorithm performance, this division is performed randomly 10 times, resulting in 10 different training and test sets. Each algorithm is evaluated based on these 10 sets through 10 experiments, and the average results of these 10 experiments are used for comparison. The 10 sets of training and test data are consistent across all algorithms to ensure fairness in the comparison.

4.3. Evaluation Metrics

The classification results of each algorithm on the test set are used to assess the algorithm’s performance. This study employs classification performance metrics, including Accuracy, Precision, Sensitivity, Specificity, and G-mean. Employee turnover prediction is a binary classification problem, and its confusion matrix is shown in Table 2. In the table, TP, TN, FP, and FN represent the number of samples classified by the model as True Positive, True Negative, False Positive, and False Negative, respectively. Based on the confusion matrix in Table 1, the definitions of Accuracy, Sensitivity, Specificity, and G-mean are derived as follows [46]:
(1) Accuracy refers to the ratio of the number of correctly predicted samples to the total number of samples in a classification model. It is used to measure the overall classification performance, with a higher accuracy indicating better overall model performance. The calculation formula is
Accuracy = TP + TN TP + FN + FP + TN
(2) Precision refers to the ratio of the number of samples correctly predicted as positive to the total number of samples predicted as positive. It is used to evaluate the model’s accuracy when predicting positive classes. A Precision value closer to 1 indicates a stronger ability of the model to correctly identify positive instances. The calculation formula is
Precision = TP TP + FP
(3) Recall refers to the ratio of the number of samples correctly predicted as positive to the total number of actual positive samples. It is used to evaluate the model’s ability to identify positive instances. A higher Recall indicates a stronger capability of the model in correctly recognizing positive samples. The calculation formula is
Recall = TP TP + FN
(4) F1-Score is the harmonic mean of Precision and Recall, taking into account both the model’s accuracy and its ability to identify positive instances. The F1-Score ranges from 0 to 1, with a value of 1 indicating optimal performance. Formula of F1-score is
F 1 - score = 2 × Recall × Precision Recall + Precision
(5) Receiver Operating Characteristic (ROC) curve plots the True Positive Rate (TPR) on the vertical axis against the False Positive Rate (FPR) on the horizontal axis, illustrating the performance of the model across different classification thresholds. The formulas for TPR and FPR are
FPR = F P T N + F P ,   TPR = T P T P + F N
Area Under the Curve (AUC) refers to the area under the ROC curve and quantifies the model’s ability to correctly classify positive samples relative to the misclassification of negative samples across different classification thresholds. The AUC value ranges from 0 to 1, with values closer to 1 indicating better model performance.
In the experiments, all five metrics—Accuracy, Precision, Recall, F1-Score, and AUC —were used to comprehensively evaluate and compare the performance of different algorithms. Higher values of Accuracy, Precision, Recall, F1-Score, and AUC generally indicate stronger predictive capabilities and better discrimination between positive and negative samples.

4.4. Experiment Results and Analysis

This section evaluates the data balancing performance of the improved ADASYN algorithm and the predictive performance of the IADASYN-GS-CatBoost model. The 10-fold cross-validation strategy was employed in the experiments, with the results of 10 iterations recorded and their average taken as the final evaluation metric.

4.4.1. Comparative Experiments

To evaluate the effectiveness of IADASYN-GS-CatBoost in employee turnover prediction, the proposed method was compared with several hybrid feature processing approaches, including Logistic Regression (LR) [47], Passive-Aggressive Stochastic Gradient Descent (PMSGD) [13], and Ensemble Random Forest (Ensemble RF) [48]. Table 3, Table 4 and Table 5 present the prediction results of the four methods on Dataset-I, Dataset-II, and Dataset-III, respectively.
As shown in Table 3, for Dataset-I, the proposed IADASYN-GS-CatBoost method achieved an average Accuracy, Recall, False Positive Rate, Precision, and AUC of 93.29%, 92.01%, 93.48%, 92.30%, and 93.24%, respectively. Across all five metrics, it consistently outperformed LR, PMSGD, and Ensemble RF. This superior performance is attributed to the improved ADASYN algorithm, which selectively increases the number of samples in locally similar feature spaces, thus achieving localized balancing and avoiding the indiscriminate oversampling issues typical of traditional SMOTE algorithms. Furthermore, the improvement effectively addresses the challenges posed by the coexistence of continuous and discrete features in employee turnover data, thereby enhancing classification accuracy. Additionally, the CatBoost model, based on the boosting framework, effectively reduces residual errors during training, improving predictive accuracy. The use of Grid Search (GS) for hyperparameter optimization further helps prevent overfitting, thereby enhancing both predictive performance and the model’s generalization capability.
For Dataset-II, the proposed IADASYN-GS-CatBoost method achieved an Accuracy, Precision, Recall, F1-Score, and AUC of 96.81%, 94.01%, 98.99%, 96.91%, and 96.80%, respectively. Compared with LR, PMSGD, and Ensemble RF methods, the proposed method outperformed all others across the five metrics. Specifically, compared to the relatively strong-performing Ensemble RF method, the proposed method improved Accuracy, Precision, Recall, F1-Score, and AUC by approximately 6.28%, 1.29%, 8.28%, 4.67%, and 4.84%, respectively.
For Dataset-III, the proposed method also demonstrated the best performance across all five metrics. Compared to the PMSGD method, which balances data using the SMOTE algorithm and optimizes decision trees with a genetic algorithm, the proposed method achieved improvements of approximately 6.31%, 3.40%, 7.72%, 5.64%, and 6.20% in Accuracy, Precision, Recall, F1-Score, and AUC, respectively. Furthermore, compared to the ensemble-based Ensemble RF method, the proposed approach improved the five metrics by approximately 3.93%, 2.07%, 7.12%, 4.67%, and 3.99%, respectively.
The experimental results across all three datasets demonstrate that the IADASYN-GS-CatBoost method not only achieves superior predictive performance but also exhibits strong stability and generalization ability when applied to different employee turnover datasets.

4.4.2. Ablation Experiments

The ablation study consists of two parts: I) evaluation of the superiority of the GS-CatBoost model, and II) validation of the effectiveness of the improved ADASYN sample generation method.
  • (I) Effectiveness Analysis of the GS-CatBoost Model
This section evaluates the predictive performance of the GS-CatBoost model for employee turnover prediction. Table 6, Table 7 and Table 8 present the experimental comparison results of GS-CatBoost against classical classification algorithms on Dataset-I, Dataset-II, and Dataset-III, respectively.
As shown in Table 6, Table 7 and Table 8, for Dataset-I, the proposed GS-CatBoost method achieved average Accuracy, Precision, Recall, F1-Score, and AUC values of 87.29%, 88.89%, 36.37%, 51.62%, and 67.66%, respectively. For Dataset-II, the GS-CatBoost method achieved Accuracy, Precision, Recall, F1-Score, and AUC values of 94.41%, 92.47%, 96.62%, 94.50%, and 94.46%, respectively. For Dataset-III, the GS-CatBoost method achieved Accuracy, Precision, Recall, F1-Score, and AUC values of 96.95%, 92.78%, 94.12%, 93.44%, and 95.96%, respectively. The experimental results indicate that across all datasets, GS-CatBoost consistently achieved the best performance across all five evaluation metrics.
By comparing Accuracy and Precision, it can be observed that the GS-CatBoost method outperforms the five classical prediction algorithms. Compared with the CatBoost algorithm, which had the highest performance among the comparison models, the proposed method achieved an average improvement of 8.90% in Accuracy and 3.91% in Precision. This indicates that the GS-CatBoost method proposed in this study provides the highest prediction accuracy for both turnover and retained employees.
For Dataset-I, the Recall rates of the comparison algorithms (RF, LSSVM, NB, BPNN, CatBoost) were all below 32%, and their F1-Scores were all below 47%. For Dataset-II, the Recall rates remained below 95% and the F1-Scores below 93%. For Dataset-III, the Recall rates were below 93%, and the F1-Scores were below 92%. In contrast, GS-CatBoost achieved Recall rates of 36.37%, 96.62%, and 94.12%, and F1-Scores of 51.62%, 94.50%, and 93.44% across the three datasets, significantly outperforming the other methods. Moreover, the AUC values obtained by the five comparison algorithms on the three datasets were all below 90%, noticeably lower than those achieved by GS-CatBoost.
These results demonstrate that through optimized parameter tuning, GS-CatBoost not only improves the prediction of turnover employees but also substantially enhances the prediction performance for retained employees. However, it can also be observed that, due to the impact of data imbalance, the overall prediction accuracy for turnover employees remains relatively lower across all algorithms.
  • (II) Effectiveness Analysis of the Improved ADASYN
To validate the effectiveness of the proposed improved ADASYN, we utilized the ADASYN algorithm and the IADASYN algorithm for data balancing, and then compared the prediction results for different models. In this experiment, the training data are rebalanced using both standard ADASYN and IADASYN. For each rebalanced dataset, we then conduct turnover prediction using five classifiers—RF, LSSVM, NB, BPNN, and GS-CatBoost—and compare the results under the same evaluation protocol. We employ five metrics (Accuracy, Precision, Recall, F1-score, and AUC), and report the average performance over three datasets as the final results. The experimental results are shown as Table 9.
As shown in Table 9 the datasets generated by IADASYN consistently outperform those generated by standard ADASYN to varying degrees. Specifically, (i) Accuracy increases across all five models, with an average improvement of approximately 1.342%; (ii) for Precision, IA-BPNN and IA-GS-CatBoost improve over A + BPNN and A-GS-CatBoost by 2.05% and 0.83%, respectively; (iii) for Recall, IA + BPNN and IA-GS-CatBoost improve by 1.55% and 1.03%, respectively; (iv) for F1-score, IADASYN-BPNN and IADASYN-GS-CatBoost improve by 1.06% and 1.14%, respectively; and (v) for AUC, improvements of 1.32% and 1.18% are observed for IADASYN-BPNN and IADASYN-GS-CatBoost, respectively. These results indicate that, across different datasets and predictive models, IADASYN leads to more balanced and stable training data, thereby improving predictive performance.

4.4.3. Comparison with Current Research

We compare the method proposed in this study against sixteen recently introduced approaches, conducting experiments on the public IBM human-resources dataset (Dataset-III), and the prediction accuracies of all methods are reported in Table 10. From Table 10, reference [44,49] employed Decision Tree models for attrition prediction and achieved an accuracy of 83.44%. Logistic Regression (LR), owing to its solid predictive performance, has been adopted in multiple studies: reference [9,50] reported prediction accuracies of 87.50% and 87.00%, respectively, while reference [8] further improved accuracy to 88.43% by ensemble LR models. Support Vector Machines (SVMs) have likewise been widely used. Reference [51] attained 88.44% accuracy with SVM; reference [13] reported 84% accuracy using SVM; and reference [52] achieved 92.50% using linear combinations of SVMs. Random Forest (RF) was applied in reference [14,53,54], yielding accuracies of 80%, 85.11%, and 87.298%, respectively.
Boosting-based methods have demonstrated high predictive accuracy in employee attrition prediction. Reference [13] combined XGBoost obtained 86.02% accuracy, and reference [13] used CatBoost to increase accuracy to 89.45%. Deep learning has also attracted considerable attention: reference [55] employed deep neural networks (DNNs) to reach 89.11% accuracy, reference [56] employed genetic algorithm optimization of a deep encoder and KNN (GA-Deep-Autoencoder-KNN) to reach 90.95% accuracy, and reference [57] used an ensemble bidirectional temporal convolutional network (Ensemble Bi-TCN) to achieve 92.17%. Our IADASYN-GS-CatBoost model likewise attains high predictive accuracy, with an average accuracy of 97.48%.
Compared with the Linear-SVM with feature fusion and Ensemble Bi-TCN + GAN methods, the accuracy of our method improves 4.98% and 5.31%, respectively; compared with the remaining models, it exhibits a more pronounced advantage in attrition-prediction accuracy.
Table 10. Comparing the proposed model with current research on Dataset-II.
Table 10. Comparing the proposed model with current research on Dataset-II.
ReferenceYearModelAccuracy
Reference [49]2019Decision Tree (DTJ83.44%
Reference [9]2020Logistic Regression (LR)87.50%
Reference [50]2020Logistic Regression (LR)87.00%
Reference [8]2021Ensemble LR (ELR)88.43%
Reference [51]2020Support vector machine(SVM)88.44%
Reference [14]2021Random Forest (RF)87.30%
Reference [41]2021LR with feature selection81.00%
Reference [55]2021Deep Neural Network (DNN)89.11%
Reference [58]2021Artificial Neural Network(ANN)84.00%
Reference [54]2021Random Forest (RF)85.11%
Reference [53]2022k-Nearest Neighbor (KNN)84.00%
Reference [13]2022CatBoost89.45%
Reference [13]2023XGboost86.02%
Reference [52]2024Linear-SVM with feature fusion92.50%
Reference [56]2024GA-DeepAutoencoder-KNN90.95%
Reference [57]2025Ensemble Bi-TCN + GAN92.17%
Proposed method2025IADASYN-GS-CatBoost97.48%

4.4.4. Algorithm Runtime

Figure 2 records the training and testing times for each algorithm. It can be observed that the training time for each model increased after the introduction of IADASYN. This is due to the time required by IADASYN to generate balanced data, and the increase in sample size also extended the training time. Algorithms such as RF, LSSVM, NB and CatBoost have lower training and testing times compared to BPNN and GS-CatBoost. Comparing BPNN and GS-CatBoost, it is evident that BPNN has slightly higher training and testing times than CatBoost. After incorporating IADASYN, the training time of IA-BPNN and IA-GS-CatBoost shows a noticeable increase. The testing time of IA-GS-CatBoost is close to that of CatBoost and lower than that of IA-BPNN, indicating that the ensemble model obtained by IA-GS-CatBoost is more streamlined compared to IA-BPNN. Although the training times of IA-GS-CatBoost are slightly higher than those of RF, LSSVM, NB and CatBoost, its classification performance is significantly better than these algorithms. Additionally, after the model training is completed, the test time of the proposed IA-GS-CatBoost model is relatively close to that of models such as RF, LSSVM, NB, and CatBoost, and the testing time of IA-GS-CatBoost is under 100 ms, which is acceptable for practical applications.

5. Conclusions

Employee turnover prediction plays a crucial role in enabling organizations to proactively identify potential resignations, thereby aiding in employee retention and reducing human resource management costs. This paper proposes an employee turnover prediction method for imbalanced datasets, based on the IADASYN-GS-CatBoost algorithm. To address the challenges posed by mixed-type features in turnover datasets, we adopt an improved ADASYN technique to balance the data and utilize a Grid Search-optimized CatBoost ensemble model (GS-CatBoost) to construct the prediction framework. Experimental results demonstrate that the IADASYN-GS-CatBoost model significantly enhances the prediction accuracy of employee turnover. Notably, across various imbalanced datasets, the algorithm consistently achieves high precision, indicating its robustness in identifying high-risk employees. The proposed model enables HR departments to assess turnover risks in a timely manner and take targeted intervention measures, thereby effectively preventing the loss of key talent and reinforcing the organization’s core competitiveness.
Despite the promising performance, several limitations remain: (1) Feature Selection: Future research could integrate feature selection strategies into the current model to filter out non-essential variables, incorporating only those features strongly associated with turnover, which may further improve model accuracy. (2) Hyperparameter Optimization: While grid search was used in this study, other optimization techniques such as Bayesian optimization could be employed to find more efficient and accurate hyperparameter combinations. (3) Model Architecture: The algorithms used in this study are primarily shallow models. Expanding to include deep learning frameworks could enhance both the generalizability and stability of turnover prediction. In future work, we aim to integrate Bayesian optimization with deep learning models to further improve predictive performance.

Author Contributions

Methodology, S.H. and K.D.; Software, K.D.; Formal analysis, S.H.; Data curation, S.H.; Writing—original draft, K.D.; Writing—review & editing, S.H. and K.D. All authors have read and agreed to the published version of the manuscript.

Funding

Funding was provided by National Natural Science Foundation of China (Grant No. 61671338).

Data Availability Statement

The selection of experimental data comes from the Kaggle. IBM HR Analytics employee attrition & performance [24 January 2024] (https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset accessed on 20 May 2025). All data generated or analyzed during this study are included in this published article. All code used in this study is available online (https://doi.org/10.5281/zenodo.16908067 accessed on 20 May 2025, https://github.com/k64723-cmyk/code/releases/tag/v.1.0.0 accessed on 20 May 2025).

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

  1. Wang, G.; Qin, S.; Gui, H. Analysis of influence factors and prediction for employee turnover. J. Syst. Sci. Math. Sci. 2022, 42, 1616–1632. [Google Scholar]
  2. Akasheh, M.A.; Malik, E.F.; Hujran, O.; Zaki, N. A decade of research on machine learning techniques for predicting employee turnover: A systematic literature review. Expert Syst. Appl. 2024, 238, 121794. [Google Scholar] [CrossRef]
  3. Lei, T.; Min, J.; Han, C.; Qi, C.; Jin, C.; Li, S. Multi-model ensemble forecasting of 10-m wind speed over eastern China based on machine learning optimization. Atmos. Ocean. Sci. Lett. 2023, 16, 97–103. [Google Scholar] [CrossRef]
  4. Zhang, Y.M.; Zhang, C.Y. The application of the decision tree algorithm based on K-means in employee turnover prediction. J. Phys. Conf. Ser. 2019, 1325, 012123. [Google Scholar]
  5. Raza, A.; Munir, K.; Almutairi, M.; Younas, F.; Fareed, M.M.S. Predicting employee attrition using machine learning approaches. Appl. Sci. 2022, 12, 6424. [Google Scholar] [CrossRef]
  6. Guerranti, F.; Dimitri, G.M. A comparison of machine learning approaches for predicting employee attrition. Appl. Sci. 2022, 13, 267. [Google Scholar] [CrossRef]
  7. Zhao, Y.; Hryniewicki, M.K.; Cheng, F.; Fu, B.; Zhu, X. Employee turnover prediction with machine learning: A reliable approach. In Intelligent Systems and Applications, Proceedings of the 2018 Intelligent Systems Conference (IntelliSys), London, UK, 6–7 September 2018; Springer: Cham, Switzerland, 2019; Volume 2, pp. 737–758. [Google Scholar]
  8. Qutub, A.; Al-Mehmadi, A.; Al-Hssan, M.; Aljohani, R.; Alghamdi, H.S. Prediction of employee attrition using machine learning and ensemble methods. Int. J. Mach. Learn. Comput. 2021, 11, 110–114. [Google Scholar] [CrossRef]
  9. Fallucchi, F.; Coladangelo, M.; Giuliano, R.; William De Luca, E. Predicting employee attrition using machine learning techniques. Computers 2020, 9, 86. [Google Scholar] [CrossRef]
  10. El-Rayes, N.; Fang, M.; Smith, M.; Taylor, S.M. Predicting employee attrition using tree-based models. Int. J. Organ. Anal. 2020, 28, 1273–1291. [Google Scholar] [CrossRef]
  11. Jhaveri, S.; Khedkar, I.; Kantharia, Y.; Jaswal, S. Success Prediction using Random Forest, CatBoost, XGBoost and AdaBoost for Kickstarter Campaigns. In Proceedings of the 2019 3rd International Conference on Computing Methodologies and Communication (ICCMC), Erode, India, 27–29 March 2019; pp. 1170–1173. [Google Scholar]
  12. Jain, R.; Nayyar, A. Predicting employee attrition using CatBoost machine learning approach. In Proceedings of the 2018 International Conference on System Modeling & Advancement in Research Trends (SMART), Moradabad, India, 23–24 November 2018; pp. 113–120. [Google Scholar]
  13. Atique, M.M.A.B.; Hoque, M.N.; Uddin, M.J. Employee attrition analysis using CatBoost. In Proceedings of the International Conference on Machine Intelligence and Emerging Technologies, Noakhali, Bangladesh, 23–25 September 2022; pp. 644–658. [Google Scholar]
  14. Chakraborty, R.; Mridha, K.; Shaw, R.N.; Ghosh, A. Study and prediction analysis of the employee turnover using machine learning approaches. In Proceedings of the 2021 IEEE 4th International Conference on Computing, Power and Communication Technologies (GUCON), Kuala Lumpur, Malaysia, 24–26 September 2021; pp. 1–6. [Google Scholar]
  15. Lin, W.C.; Tsai, C.F.; Hu, Y.H.; Jhang, J.S. Clustering-based undersampling in class-imbalanced data. Inf. Sci. 2017, 409, 17–26. [Google Scholar] [CrossRef]
  16. Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2022, 16, 321–357. [Google Scholar] [CrossRef]
  17. Barua, S.; Islam, M.M.; Yao, X.; Murase, K. MWMOTE—Majority weighted minority oversampling technique for imbalanced data set learning. IEEE Trans. Knowl. Data Eng. 2012, 26, 405–425. [Google Scholar] [CrossRef]
  18. Mathew, J.; Pang, C.; Luo, M.; Leong, W.H. Classification of imbalanced data by oversampling in kernel space of support vector machines. IEEE Trans. Neural Netw. Learn. Syst. 2017, 29, 4065–4076. [Google Scholar] [CrossRef] [PubMed]
  19. Chen, H.; Zhao, J.Z.; Xiao, C.L.; Chen, J.H.; Xiao, Y. Research on improved intrusion detection model of ADASYN-SDA. Comput. Eng. Appl. 2020, 56, 97–105. [Google Scholar]
  20. Li, S.J.; Zhang, H.R. Job Demands-Resources Model for Turnover Intention of Manufacturing Employees. Sci. Technol. Dev. 2021, 17, 711–718. [Google Scholar]
  21. Yasin, Y.M.; Kerr, M.S.; Wong, C.A.; Bélanger, C.H. Factors affecting nurses’ job satisfaction in rural and urban acute care settings: A PRISMA systematic review. J. Adv. Nurs. 2020, 76, 963–979. [Google Scholar] [CrossRef]
  22. Ang, K.B.; Goh, C.T.; Koh, H.C. An employee turnover prediction model: A study of accountants in singapore. Asian Rev. Account. 1994, 2, 121–138. [Google Scholar] [CrossRef]
  23. Gong, J.W.; Zhang, L.F.; She, Q.; Yu, F. Research on employee data visualization and turnover prediction based on logistic regression and decision tree. Intell. Comput. Appl. 2023, 13, 162–167. [Google Scholar]
  24. Pi, C.L.; Zheng, X.M. A Study on the Influence of Work Values of New Generation Employees in Hotels on Turnover Intention. J. Chongqing Technol. Bus. Univ. 2018, 2, 108–112. [Google Scholar]
  25. Lu, Y.M.; Li, G.P. Effects of employers organizational support sense on the turnover intention of dispatched employees intermediate effects of regulation. Manag. Rev. 2016, 28, 193–201. [Google Scholar]
  26. Punnoose, R.; Ajit, P. Prediction of employee turnover in organizations using machine learning algorithms. Int. J. Adv. Res. Artif. Intell. 2016, 5, 222–226. [Google Scholar] [CrossRef]
  27. Gao, X.; Wen, J.H.; Zhang, C. An improved random forest algorithm for predicting employee turnover. Math. Probl. Eng. 2019, 2019, 4140707. [Google Scholar] [CrossRef]
  28. Ozmen, E.P.; Ozcan, T. A novel deep learning model based on convolutional neural networks for employee churn prediction. J. Forecast. 2022, 41, 539–550. [Google Scholar] [CrossRef]
  29. Priambodo, B.; Jumaryadi, Y.; Rahayu, S.; Ani, N.; Ratnasari, A.; Salamah, U.; Putra, Z.P.; Otong, M. Predicting employee turnover in IT industries using correlation and chi-square visualization. Int. J. Adv. Comput. Sci. Appl. 2022, 13, 71–75. [Google Scholar] [CrossRef]
  30. Ganthi, L.S.; Nallapaneni, Y.; Perumalsamy, D.; Mahalingam, K. Employee Attrition Prediction Using Machine Learning Algorithms. In Proceedings of the International Conference on Data Science and Applications, Virtual, 26–27 March 2022; Springer: Singapore, 2022; pp. 577–596. [Google Scholar]
  31. Muslim, M.A.; Dasril, Y. Company bankruptcy prediction framework based on the most influential features using XGBoost and stacking ensemble learning. Int. J. Electr. Comput. Eng. 2021, 11, 5549–5557. [Google Scholar] [CrossRef]
  32. Heidari, M.; Zad, S.; Rafatirad, S. Ensemble of supervised and unsupervised learning models to predict a profitable business decision. In Proceedings of the 2021 IEEE International IoT, Electronics and Mechatronics Conference (IEMTRONICS), Toronto, ON, Canada, 21–24 April 2021; pp. 1–6. [Google Scholar]
  33. Li, Q.; Zhai, L. Analysis and research on employee turnover prediction based on stacking algorithm. J. Chongqing Technol. Bus. Univ. 2019, 36, 117–123. [Google Scholar]
  34. Zheng, J.; Liu, R.J. Prediction of employee turnover in power enterprises in Qinghai electric power company on IVRF algorithm. Oper. Res. Manag. Sci. 2022, 31, 210–216. [Google Scholar]
  35. Chung, D.; Yun, J.; Lee, J.; Jeon, Y. Predictive model of employee attrition based on stacking ensemble learning. Expert Syst. Appl. 2023, 215, 119364. [Google Scholar] [CrossRef]
  36. Wang, W.; Zhi, L. Fraud detection model generalization performance improvement and interpretability study based on ADASYN-SFS-RF. Appl. Res. Comput. 2022, 39, 3605–3613. [Google Scholar]
  37. Krishnan, U.; Sangar, P. A rebalancing framework for classification of imbalanced medical appointment no-show data. J. Data Inf. Sci. 2021, 6, 178–192. [Google Scholar] [CrossRef]
  38. Kaggle. IBM HR Analytics Employee Attrition & Performance. Available online: https://www.kaggle.com/datasets/pavansubhasht/ibm-hr-analytics-attrition-dataset (accessed on 24 January 2024).
  39. Najafi, Z.S.; Shams, G.N.; Arjomandi, N.A.; Hashemkhani, Z.S. An improved machine learning-based employee attrition prediction framework with emphasis on feature selection. Mathematics 2021, 9, 1226. [Google Scholar] [CrossRef]
  40. Azad, C.; Bhushan, B.; Sharma, R.; Shankar, A.; Singh, K.K.; Khamparia, A. A prediction model using SMOTE, genetic algorithm and decision tree (PMSGD) for classification of diabetes mellitus. Multimed. Syst. 2022, 28, 1289–1307. [Google Scholar] [CrossRef]
  41. Schumacher, T.B.; LaMonte, J.M.; LaCroix, Z.A.; Simonsick, E.M.; Hooker, S.P.; Parada, H., Jr.; Bellettiere, J.; Kumar, A. Development, validation, and transportability of several machine-learned, non-exercise-based VO2max prediction models for older adults. J. Sport Health Sci. 2024, 13, 611–620+718. [Google Scholar] [CrossRef] [PubMed]
  42. Ahangari, D.; Daneshfar, R.; Zakeri, M.; Ashoori, S.; Soulgani, B.S. On the prediction of geochemical parameters (TOC, S1 and S2) by considering well log parameters using ANFIS and LSSVM strategies. Petroleum 2022, 8, 174–184. [Google Scholar] [CrossRef]
  43. Liu, D.; Su, C.; Yang, R.; Ren, J.; Liu, X. Temperature field test and prediction using a GA-BP neural network for CRTSⅡ slab tracks. Railw. Eng. Sci. 2023, 31, 381–395. [Google Scholar] [CrossRef]
  44. Li, D.; Hu, S.; Guo, J.; Wang, K.; Gao, C.; Wang, S.; He, W. A new hybrid machine learning model for short-term climate prediction by performing classification prediction and regression prediction simultaneously. J. Meteorol. Res. 2022, 36, 853–865. [Google Scholar] [CrossRef]
  45. Lin, J.; Qi, C.; Wan, H.; Min, J.; Chen, J.; Zhang, K.; Zhang, L. Prediction of cross-tension strength of self-piercing riveted joints using finite element simulation and CatBoost algorithm. Chin. J. Mech. Eng. 2021, 34, 178–188. [Google Scholar] [CrossRef]
  46. Jiang, H.; Deng, H. Traffic incident detection method based on factor analysis and weighted random forest. IEEE Access 2020, 8, 168394–168404. [Google Scholar] [CrossRef]
  47. Jhaver, M.; Gupta, Y.; Mishra, A.K. Employee turnover prediction system. In Proceedings of the 2019 4th International Conference on Information Systems and Computer Networks (ISCON), Mathura, India, 21–22 November 2019; pp. 391–394. [Google Scholar]
  48. Jain, N.; Tomar, A.; Jana, P.K. A novel scheme for employee churn problem using multi-attribute decision making approach and machine learning. J. Intell. Inf. Syst. 2021, 56, 279–302. [Google Scholar] [CrossRef]
  49. Usha, P.M.; Balaji, N. Analyzing employee attrition using machine learning. Karpagam J. Comput. Sci. 2019, 13, 277–282. [Google Scholar]
  50. Mohbey, K.K. Employee attrition prediction using machine learning approaches. In Machine Learning and Deep Learning in Real-Time Applications; IGI Global: New York, NY, USA, 2020; pp. 121–128. [Google Scholar]
  51. Sethy, A.; Raut, A.K. Employee attrition rate prediction using machine learning approach. Turk. J. Physiother. Rehabil. 2020, 32, 14024–14031. [Google Scholar]
  52. Al Akasheh, M.; Hujran, O.; Malik, E.F.; Zaki, N. Enhancing the prediction of employee turnover with knowledge graphs and explainable AI. IEEE Access 2024, 12, 77041–77053. [Google Scholar] [CrossRef]
  53. Atef, M.S.; Elzanfaly, D.; Ouf, S. Early prediction of employee turnover using machine learning algorithms. Int. J. Electr. Comput. Eng. Syst. 2022, 13, 135–144. [Google Scholar] [CrossRef]
  54. Pratt, M.; Boudhane, M.; Cakula, S. Employee attrition estimation using random forest algorithm. Balt. J. Mod. Comput. 2021, 9, 49–66. [Google Scholar] [CrossRef]
  55. Jin, Z.; Shang, J.; Zhu, Q.; Ling, C.; Xie, W.; Qiang, B. RFRSF: Employee turnover prediction based on random forests and survival analysis. In Proceedings of the Web Information Systems Engineering—WISE 2020: 21st International Conference, Amsterdam, The Netherlands, 20–24 October 2020; Part II, pp. 503–515. [Google Scholar]
  56. Lim, C.S.; Malik, E.F.; Khaw, K.W.; Alnoor, A.; Chew, X.; Chong, Z.L.; Al Akasheh, M. Hybrid GA–DeepAutoencoder–KNN model for employee turnover prediction. Stat. Optim. Inf. Comput. 2024, 12, 75–90. [Google Scholar] [CrossRef]
  57. Mortezapour Shiri, F.; Yamaguchi, S.; Ahmadon, M.A.B. A deep learning model based on bidirectional temporal convolutional network (Bi-TCN) for predicting employee attrition. Appl. Sci. 2025, 15, 2984. [Google Scholar] [CrossRef]
  58. Ahmed, T.M. A novel classification model for employees turnover using neural network to enhance job satisfaction in organizations. J. Inf. Organ. Sci. 2021, 45, 361–374. [Google Scholar] [CrossRef]
Figure 1. Flowchart of the IADASYN-GS-CatBoost Algorithm.
Figure 1. Flowchart of the IADASYN-GS-CatBoost Algorithm.
Mathematics 14 00313 g001
Figure 2. Runtime of Different Algorithms.
Figure 2. Runtime of Different Algorithms.
Mathematics 14 00313 g002
Table 1. Results of Grid Search Optimization for CatBoost Algorithm Parameters.
Table 1. Results of Grid Search Optimization for CatBoost Algorithm Parameters.
ParameterValue RangeStep SizeOptimal Parameter
Maximum   Depth   of   Decision   Trees   ( T r m a x ) [ 6 , 10 ] 18
Learning   Rate   ( e t a ) [ 0.1 , 0.3 ] 0.010.16
Number   of   Iterations   ( t ) [ 10 , 500 ] 10240
Regularization   Parameter   ( γ ) [ 0 , 1 ] 0.010.05
Minimum   Sum   of   Sample   Weights   ( W m i n ) [ 0 , 10 ] 11
Ratio   of   Data   Used   ( C ) [ 0 , 1 ] 0.020.88
Ratio   of   Features   Used   ( S ) [ 0 , 1 ] 0.020.92
Table 2. Confusion Matrix.
Table 2. Confusion Matrix.
Predicted “Employed” (−)Predicted “Resigned”(+)
Actual “Employed” (−)TNFP
Actual “Resigned” (+)FNTP
Table 3. Prediction Results of IADASYN-GS-CatBoost and the other three methods on Dataset-I.
Table 3. Prediction Results of IADASYN-GS-CatBoost and the other three methods on Dataset-I.
MethodsAccuracy(%)Precision(%)Recall(%)F1-Score(%)AUC(%)
LR86.0589.8481.5385.4887.27
PMSGD87.6688.6587.3287.9889.59
Ensemble RF89.1790.8188.6789.7390.15
Proposed93.2992.0193.4892.3093.24
Table 4. Prediction Results of IADASYN-GS-CatBoost and the other three methods on Dataset-II.
Table 4. Prediction Results of IADASYN-GS-CatBoost and the other three methods on Dataset-II.
MethodsAccuracy(%)Precision(%)Recall(%)F1-Score(%)AUC(%)
LR88.7692.7285.3588.8390.31
PMSGD89.8191.5788.9790.2591.18
Ensemble RF91.0992.8191.4292.5992.33
Proposed96.8194.0198.9996.9196.80
Table 5. Prediction Results of IADASYN-GS-CatBoost and the other three methods on Dataset-III.
Table 5. Prediction Results of IADASYN-GS-CatBoost and the other three methods on Dataset-III.
MethodsAccuracy(%)Precision(%)Recall(%)F1-Score(%)AUC(%)
LR90.5191.6688.6890.1490.27
PMSGD91.69392.8491.7892.3091.78
Ensemble RF93.7994.0592.3093.1693.73
Proposed 97.4896.0098.8797.5197.47
Table 6. Prediction Results of GS-CatBoost and Classical methods on Dataset-I.
Table 6. Prediction Results of GS-CatBoost and Classical methods on Dataset-I.
MethodsAccuracy(%)Precision(%)Recall(%)F1-Score(%)AUC(%)
RF82.1579.6829.4545.9264.88
LSSVM80.9779.1928.4644.7962.05
NB75.6873.9927.9842.4661.47
BPNN84.3283.4430.8143.0964.31
CatBoost86.4487.5031.8246.6765.39
GS-CatBoost87.2988.8936.3751.6267.66
Table 7. Prediction Results of GS-CatBoost and Classical methods on Dataset-II.
Table 7. Prediction Results of GS-CatBoost and Classical methods on Dataset-II.
MethodsAccuracy(%)Precision(%)Recall(%)F1-Score(%)AUC(%)
RF91.6190.0491.3790.9191.06
LSSVM89.7789.2191.1589.1388.22
NB84.6884.1587.8485.9586.40
BPNN91.3289.7591.8090.7290.57
CatBoost92.7791.3094.3892.8192.79
GS-CatBoost94.4192.4796.6294.5094.46
Table 8. Prediction Results of GS-CatBoost and Classical methods on Dataset-III.
Table 8. Prediction Results of GS-CatBoost and Classical methods on Dataset-III.
MethodsAccuracy(%)Precision(%)Recall(%)F1-Score(%)AUC(%)
RF89.1691.0685.6788.2888.33
LSSVM86.8788.1183.7585.8786.12
NB84.3585.5180.8683.1285.44
BPNN89.4291.1586.0588.5889.51
CatBoost96.2090.9392.7491.8495.00
GS-CatBoost96.9592.7894.1293.4495.96
Table 9. Prediction Results Comparison of IADASYN and ADASYN.
Table 9. Prediction Results Comparison of IADASYN and ADASYN.
MethodsAccuracy(%)Precision(%)Recall(%)F1-Score(%)AUC(%)
A-RF88.4387.2287.6888.4289.18
A-LSSVM86.3985.9186.1386.9286.27
A-NB82.8881.5184.8183.2984.71
A-BPNN89.0788.4890.5190.5690.73
A-GS-CatBoost 94.6493.1896.1194.6394.89
IA-RF89.7289.0689.8489.7190.10
IA-LSSVM87.1186.1988.0387.4088.36
IA-NB84.2282.8585.4984.4885.94
IA-BPNN91.2190.5392.0691.6292.05
IA-GS-CatBoost 95.8694.0197.1495.7796.07
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Hu, S.; Dong, K. Predicting Employee Turnover Based on Improved ADASYN and GS-CatBoost. Mathematics 2026, 14, 313. https://doi.org/10.3390/math14020313

AMA Style

Hu S, Dong K. Predicting Employee Turnover Based on Improved ADASYN and GS-CatBoost. Mathematics. 2026; 14(2):313. https://doi.org/10.3390/math14020313

Chicago/Turabian Style

Hu, Shuigen, and Kai Dong. 2026. "Predicting Employee Turnover Based on Improved ADASYN and GS-CatBoost" Mathematics 14, no. 2: 313. https://doi.org/10.3390/math14020313

APA Style

Hu, S., & Dong, K. (2026). Predicting Employee Turnover Based on Improved ADASYN and GS-CatBoost. Mathematics, 14(2), 313. https://doi.org/10.3390/math14020313

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop