2.1. Random Forest (RF)
Random Forest (RF) is a powerful ensemble learning method recognized for its robustness against overfitting, particularly in high-dimensional datasets. This strength arises from its ensemble approach, which integrates numerous decision trees to increase predictability and manage variation. Studies indicate that Random Forest consistently outperforms other ensemble methods in classification performance [
7]. The method’s ability to aggregate predictions from numerous trees enhances its reliability, making it a stable choice for classification tasks.
In addition, Random Forest provides insights into variable importance, aiding in understanding the influence of different predictors on outcomes and enhancing the model’s interpretability [
8]. Its computational efficiency allows for fast training and prediction times make it ideal for huge datasets and real-time applications.
Due to these characteristics, Random Forest has been widely adopted across various domains, including medicine, finance, and environmental science. In medical research, for example, Random Forest has been effectively utilized to predict treatment responses in cancer patients, highlighting its versatility and effectiveness in complex scenarios [
9].
Random Forest is recognized for its robustness and accuracy in classification tasks, making it a preferred choice in HR analytics. For instance, a study by Jayanti and Wasesa highlights that Random Forest outperforms Naive Bayes in predictive modeling for hiring processes, demonstrating its superior capability in handling complex datasets typical in HR applications [
10]. Similarly, Santhanalakshmi’s research indicates that Random Forest achieved an impressive 89% accuracy in predicting employee turnover, highlighting its effectiveness in critical HR functions [
11]. These findings underscore the algorithm’s strength in providing actionable insights that can enhance decision-making in HR management.
In contrast, other machine learning techniques such as Support Vector Machines (SVMs) and linear regression have also been employed in HR analytics. For example, Oyeleye et al. explored hybrid models combining Random Forest with linear regression to improve cardiovascular disease predictions, achieving notable accuracy levels [
12]. This suggests that while Random Forest is highly effective on its own, its performance can be further enhanced when integrated with other methodologies. Furthermore, the comparative analysis of various algorithms, including SVMs and neural networks, indicates that the choice of algorithm should be guided by the specific characteristics of the dataset and the predictive task at hand [
13].
Moreover, the versatility of Random Forest in handling diverse types of data—be it categorical or continuous—positions it favorably against other methods. Hu and Szymczak emphasize that Random Forest is a non-parametric approach capable of accommodating diverse response types, which is particularly beneficial in longitudinal data analysis within HR contexts [
14]. This flexibility contrasts with more rigid models that may not perform as well across varying data types.
Despite its advantages, Random Forest is not without limitations. For instance, Wang’s study on packet classification using Random Forest revealed that while the algorithm is capable of high-speed processing, the accuracy of its classifications can be compromised, emphasizing the need for careful model training and validation [
15]. This highlights a critical aspect of machine learning applications in HR analytics: the importance of model performance metrics and the potential trade-offs between speed and accuracy.
Random Forest is a well-known collaborative learning technique for regression and classification applications. It generates many decision trees during training and then combines their predictions to increase accuracy and prevent overfitting. In classification, the final estimation is generated by collecting the majority vote from all the trees, but in regression, the tree outputs are averaged to obtain the final forecast. This strategy makes Random Forest an effective tool for developing strong and dependable models.
A key characteristic of Random Forest lies in its ensemble learning structure. By merging predictions from several decision trees, the model smooths out individual tree errors, improving overall predictive performance. The ensemble approach leads to a more stable and reliable model, as it reduces the variance inherent in single decision trees. This is achieved through bootstrap aggregating (bagging) and involves training each tree on a random portion of the training data. This method mitigates overfitting, making the model more generalizable to unseen data. Additionally, at each node split, a random subset of characteristics is examined, which ensures that the trees are diverse and strengthens the overall effectiveness of the model.
Several important parameters contribute to the performance of a Random Forest model. One such parameter is the number of trees (ntree), which defines how many decision trees are built in the forest. Increasing the number of trees improves the model’s stability and predictive power, though it also demands more computational resources. Another important parameter is the number of features to consider at each split (mtry), which specifies how many features are randomly selected to search for the best split at each node. Tuning this parameter controls the level of randomness and influences the diversity within the forest. Additionally, the model can calculate feature importance, which indicates the influence of individual predictors on the outcome. This capability helps with understanding the model and guiding future feature selection.
Multiple criteria are required to assess the functioning of a Random Forest model. The confusion matrix summarizes the predictions by indicating how many cases were properly or erroneously identified. Accuracy assesses the total proportion of right predictions, giving a broad picture of performance. Precision is the percentage of real positive predictions among all positive forecasts, reflecting the model’s ability to recognize important events. Recall (or sensitivity) measures how effectively the model recognizes all relevant instances in the dataset. The F1-score, which is the harmonic mean of accuracy and recall, provides a balanced assessment when both metrics are significant. Furthermore, the ROC curve displays the trade-off between sensitivity (true positive rate) and specificity (true negative rate), with the AUC (Area Under the ROC curve) providing a single value representing the model’s ability to differentiate across classes.
Random Forest is a powerful and versatile algorithm that provides robust performance across various datasets. Its ability to handle large datasets with high dimensionality, along with built-in feature importance metrics, makes it a popular choice for classification and regression problems. By tuning its parameters, you can optimize the model for specific datasets and achieve reliable predictions.
2.2. K-Nearest Neighbors (KNN)
The K-Nearest Neighbors (KNN) algorithm is widely used in machine learning for classification applications. It works on the instance-based learning principle, which states that the categorization of a new data point is decided by the majority class of its k closest neighbors in the feature space. This approach is popular since it is simple and successful in applications such as picture categorization, medical diagnosis, and customer satisfaction analysis [
16].
KNN functions by calculating the distance between the query point and all points in the training dataset, using distance metrics like Euclidean, Manhattan, and Minkowski to identify the k closest neighbors [
17]. The choice of k is critical; a small k can make the model sensitive to noise, while a large k may obscure important patterns [
16,
18]. Notably, KNN is a lazy learner, meaning it does not create a model during training but retains training instances for classification. KNN has been effectively applied in various domains, such as classifying diseases based on patient data, where it has achieved significant accuracy improvements when combined with feature selection techniques [
19]. In image data contexts, KNN has demonstrated its versatility, being used for species classification [
20].
KNN is a non-parametric, instance-based learning algorithm that classifies data points based on the majority class of their nearest neighbors. Its effectiveness in various applications has been documented in recent studies. For instance, Kumar and Vidhya demonstrated that an enhanced version of KNN achieved significant improvements in accuracy for heart plaque detection compared to the Least Squares Support Vector Machine (LSSVM) [
21]. This finding illustrates KNN’s potential for high accuracy in medical applications, which can be paralleled in HR analytics for tasks such as employee retention prediction and candidate selection.
In HR analytics, KNN has been compared favorably against other algorithms such as decision trees and Support Vector Machines (SVMs). Sitienei’s research highlighted that KNN outperformed decision trees in predicting maize yield, highlighting its robustness in handling various datasets [
22]. Similarly, a comparative study by Lone evaluated KNN against decision trees for heart disease prediction, finding that KNN provided competitive accuracy, which is crucial for applications in HR where accurate predictions can lead to better hiring and employee management decisions [
23]. These comparisons indicate that KNN can serve as a reliable alternative to more complex algorithms in HR analytics.
Moreover, the performance of KNN can be significantly enhanced through various techniques. For example, Yin et al. explored different sampling algorithms to address imbalanced data, employing KNN alongside Logistic Regression and decision trees. Their findings suggest that KNN’s performance can be optimized when combined with data preprocessing techniques [
24]. This adaptability is particularly beneficial in HR analytics, where datasets may often be imbalanced due to the nature of hiring and employee turnover data.
Despite its advantages, KNN has limitations, particularly concerning computational efficiency and sensitivity to noise. The introduction of High-Level K-Nearest Neighbors (HLKNN) has been proposed to mitigate these issues by enhancing KNN’s classification performance in noisy environments [
25]. This development is relevant for HR applications where data quality can vary significantly, and robust models are necessary for reliable predictions.
K-Nearest Neighbors (KNN) is a simple, intuitive, and widely used algorithm for classification or regression tasks. It operates by identifying the k closest training examples in the feature space to make predictions about unknown data points. The code begins with data preparation, reading from an Excel file and processing the data for use in the KNN algorithm. This involves converting the date column to POSIXct format and ensuring that the rating is numeric. Categorical variables, such as division and department, are transformed into numeric factors, and any rows with missing values are removed to maintain clean data for modeling.
Following data preparation, the dataset is split into features (X) and labels (y), with X containing the independent variables and y representing the dependent variable (rating). The dataset is then partitioned into training (80%) and testing (20%) sets using the createDataPartition function from the caret library. Normalization is performed on the features to scale them between 0 and 1 using a custom normalize function, which is essential in KNN to prevent attributes with larger ranges from dominating the distance calculations.
The code also addresses duplicates and constant columns by checking for and removing duplicate rows in the training data and identifying constant columns that provide no useful information for the model. To determine the optimal value of k, the test_knn function is employed to apply the KNN algorithm for different k values (3, 5, 7, 10), using the tryCatch function to gracefully handle any errors during execution. The results are stored in a list for comparison, and the best k value is selected based on the length of this results list, reflecting the number of successful predictions for each k.
Once the best k is identified, the final KNN model is trained using this value, and predictions are made on the test set. The predictions are converted to numeric format for further processing, and these predicted ratings are combined with the test data, which includes the actual ratings and staff names. Duplicate staff names in the results are managed by retaining the highest predicted rating for each staff member. Finally, the top five and bottom five staff members are identified based on the predicted ratings and printed. The KNN algorithm primarily depends on the parameter k, which is crucial as it determines how many nearest neighbors are considered when making predictions; in this code, several k values (3, 5, 7, 10) are tested to find the one that yields the best performance.
The provided code effectively prepares data for and implements the KNN algorithm, allowing for flexibility in selecting the optimal k value. The combination of normalization, handling of duplicates and constant columns, and the systematic testing of different k values demonstrates good practices in preparing for and executing a KNN classification task. The final output, highlighting the top and bottom staff based on predicted ratings, offers a useful evaluation of the model’s performance.
2.3. XGBoost
XGBoost, or Extreme Gradient Boosting, is a very efficient and strong machine learning algorithm known for its outstanding performance in predictive modeling applications. XGBoost, an ensemble learning method that uses boosting techniques, improves the predicted accuracy of weak learners, which are often decision trees. Its capacity to handle big datasets with high dimensionality and resistance to overfitting make it a popular choice for a variety of applications, including healthcare, finance, and environmental research [
15,
26].
XGBoost’s efficacy stems from the optimization of multiple hyperparameters, which have a substantial impact on model performance. Key parameters include the learning rate, maximum tree depth, subsampling ratios, and regularization terms. The learning rate affects each tree’s contribution to the final model, whereas the maximum depth determines the complexity of individual trees. Subsampling ratios for rows and columns introduce randomness into the training process, helping to prevent overfitting [
27,
28]. Regularization parameters, such as L1 (Lasso) and L2 (Ridge), are crucial for managing model complexity and enhancing generalization [
29].
XGBoost also incorporates advanced techniques such as tree pruning and parallel processing, which enhance its speed and efficiency. The algorithm optimizes the loss function through iterative updates within a gradient boosting framework, allowing it to converge quickly to a robust model [
15,
26]. These features make XGBoost particularly suitable for applications requiring high accuracy and interpretability, such as risk prediction in medical studies and fraud detection in finance [
29,
30].
XGBoost is a gradient-boosted decision tree method that has demonstrated outstanding performance in a variety of fields, including healthcare and HR analytics. For example, Xu et al. showed that XGBoost outperformed traditional algorithms like K-Nearest Neighbors (KNN), Support Vector Machines (SVMs), and Logistic Regression (LR) in predicting lymphovascular invasion in breast cancer patients, demonstrating its clinical viability [
31]. This performance advantage is attributed to XGBoost’s ability to handle missing data and its inherent feature importance measurement, which allows for better interpretability of model predictions [
31]. In HR analytics, similar advantages are observed, particularly in predicting employee attrition and optimizing recruitment processes [
32,
33].
In HR analytics, various machine learning techniques have been used to enhance decision-making processes. For instance, Thakral et al. explored the thematic landscape of HR analytics, emphasizing the integration of machine learning frameworks to improve hiring and placement decisions [
34]. This aligns with the findings of Gazi, who applied machine learning to predict employee attrition, underscoring the necessity for organizations to leverage data-driven insights to retain talent [
32]. XGBoost’s ability to provide accurate predictions and insights into employee behavior positions it as a preferred choice over other algorithms, such as Random Forest and SVMs, which have been commonly used in HR analytics [
19].
Moreover, the optimization of XGBoost through metaheuristic algorithms has further enhanced its performance. Gulsun’s research on optimizing hyperparameters of XGBoost using Artificial Rabbits Optimization (ARO) demonstrated significant improvements in forecasting accuracy compared to traditional methods [
35]. This optimization capability is crucial in HR analytics, where the accuracy of predictive models directly impacts strategic decision-making. The combination of XGBoost with other advanced techniques, such as deep learning models, has also been explored, highlighting its versatility in complex HR scenarios [
36].
Maximizing the performance of XGBoost models requires effective parameter tuning. Techniques like grid search and Bayesian optimization are commonly used to systematically explore the hyperparameter space and identify optimal settings [
35]. Proper tuning has been shown to significantly improve model accuracy and stability, especially in complex datasets with multiple features [
29]. Liu et al. highlighted XGBoost’s advantages over other algorithms, such as Random Forest and Support Vector Machines, in managing high-dimensional data while maintaining computational efficiency [
29].
In this study, the XGBoost algorithm is utilized as the primary machine learning model. XGBoost, which stands for Extreme Gradient Boosting, is an optimized gradient boosting algorithm renowned for its performance and speed, particularly in structured or tabular data. Several key parameters are critical in the implementation of the XGBoost model.
The first important parameter is the objective, set as objective = “reg:squarederror”. This defines the learning task and the accompanying objective function, indicating that the model is conducting regression with the goal of minimizing the square of the errors, or the difference between predicted and actual values. Another important setting is the max depth, which is set at 6. This specifies the maximum depth of a tree; raising this number increases the model’s complexity and capacity to capture interactions, but it also increases the danger of overfitting. A depth of 6 is frequently selected as a neutral starting point for bias and variation.
The learning rate (eta) is another crucial parameter, set at eta = 0.3. This parameter shrinks the contribution of each tree by a factor of eta, where lower values typically yield a more robust model but require more trees to adequately model the data. A learning rate of 0.3 is regarded as a reasonable value that allows for quicker convergence without compromising accuracy significantly. Additionally, the number of rounds (nrounds) is specified as nrounds = 100, which indicates the number of boosting rounds or trees to build. Increasing this number can enhance model performance but also raises the risk of overfitting, making it common practice to employ early stopping techniques to determine the optimal number of rounds.
Regarding model training and prediction, the XGBoost model is trained using the training dataset, comprising the feature matrix Xtrain and the target variable Ytrain. The data is transformed into an xgb.DMatrix, an internal data structure optimized for XGBoost, which improves training efficiency. Following the training process, predictions are generated on the test dataset Xtest to evaluate the model’s performance.
XGBoost was chosen because of its capacity to handle a variety of data formats and its performance in regression tasks. The model parameters are tuned following established best practices, with the goal of achieving a balance between model complexity and predictive power. Parameters may be adjusted and tuned based on the dataset’s features and desired performance measures.
2.4. Support Vector Machines (SVMs)
Support Vector Machines (SVMs) are sophisticated supervised learning algorithms used in classification and regression. SVMs seek to discover the best hyperplane that divides various classes in high-dimensional space by maximizing the margin between the nearest points, also known as support vectors. SVMs are popular in areas such as bioinformatics and financial forecasting due to their ability to handle high-dimensional data and their resistance to overfitting [
37,
38].
Key parameters influencing SVM performance include the regularization parameter
\(C
\), the kernel function, and kernel-specific parameters like
\(\gamma
\) for the radial basis function(RBF) kernel. The parameter
\(C
\) balances maximizing the margin and minimizing classification error, while the choice of kernel function—linear, polynomial, or RBF—affects feature space transformation [
34].
Hyperparameter tuning is crucial for optimizing SVM performance, employing techniques like grid search, random search, and Bayesian optimization to find the best settings. Effective hyperparameter tuning enhances model accuracy and generalization [
39]. In practice, the SVM has been applied in various domains, including disease classification in healthcare and credit scoring in finance, demonstrating its capability to handle complex, high-dimensional datasets [
39].
The SVM is particularly effective for classification problems due to its ability to find the optimal hyperplane that separates different classes in high-dimensional spaces. This characteristic is crucial in HR analytics, where the data often involves numerous features related to employee performance, attrition, and recruitment. For instance, Tian et al. employed SVMs in conjunction with Latent Semantic Analysis (LSA) and Bidirectional Encoder Representations from Transformers (BERT) to enhance the employee selection process, demonstrating the SVM’s capability to handle complex textual data effectively [
40]. The study highlighted that the SVM outperformed other machine learning techniques in terms of accuracy and robustness when applied to HR resume data, reinforcing its position as a reliable choice for HR analytics tasks.
Moreover, the SVM’s interpretability is another significant advantage, especially in HR contexts where understanding the decision-making process is essential. As noted by Thakral et al., the integration of machine learning techniques, including SVMs, in HR analytics allows organizations to develop comprehensive competency frameworks and understand the psychological variables influencing employee behavior [
34]. This interpretability is vital for HR professionals who need to justify their decisions based on data-driven insights, making SVMs a preferred choice over more complex models that may lack transparency.
In comparison to other machine learning techniques such as Random Forest and XGBoost, the SVM has demonstrated competitive performance, particularly in scenarios with smaller datasets. For example, Avrahami et al. explored various machine learning algorithms in HR analytics and found that the SVM maintained high accuracy levels even with limited data, which is often the case in HR settings where data collection can be challenging [
19]. This adaptability makes the SVM a versatile tool for HR analytics, capable of delivering reliable predictions without requiring extensive data preprocessing.
Furthermore, the application of SVMs in predicting employee attrition has been highlighted in recent studies. Gazi’s research emphasized the importance of machine learning models, including SVMs, in identifying at-risk employees and developing retention strategies [
32]. The ability of SVMs to classify employees based on various attributes allows organizations to implement targeted interventions, thereby reducing turnover rates and associated costs.
The Support Vector Machine (SVM) is a supervised machine learning algorithm primarily employed for classification tasks. The main goal of the SVM is to find the optimal hyperplane that effectively separates data points belonging to different classes in a high-dimensional space.
One of the key features of SVMs is margin maximization, which identifies the hyperplane that maximizes the margin between the classes. This approach enhances generalization on unseen data. Another significant aspect is the kernel trick, allowing SVMs to utilize various kernel functions (such as linear, polynomial, and radial basis function) to transform the data into higher dimensions, facilitating the identification of a non-linear separating hyperplane.
Several parameters are essential in the SVM model. The type of parameter is set as type = “C-classification”, indicating that the SVM is utilized for classification problems, with “C” denoting classification. The kernel parameter is defined as kernel = “linear”, which signifies that a linear kernel is employed to transform the input space into a higher-dimensional space, aiming to find a linear hyperplane for class separation. Alternatively, other kernels, like polynomial or radial basis function (RBF), could be utilized depending on the dataset characteristics. The formula specified is Performance ~ Division + Department, which defines the target variable (Performance) and the predictor variables (Division and Department), with the model attempting to classify staff performance based on these factors.
Regarding model training and prediction, the SVM model is trained on a filtered dataset, with the data filtered, containing instances with defined performance (1 for the top 25% and 0 for the bottom 25%). The model learns to identify patterns that differentiate top performers from bottom performers based on Division and Department. Predictions are made using the predict function on the same dataset, indicating the expected performance classification for each staff member.
After creating predictions, the model’s performance is measured using a range of assessment measures. The confusion matrix gives information about the classification model’s performance by displaying the number of actual positives, fake positives, actual negatives, and fake negatives. Accuracy is determined as the fraction of properly identified cases among all instances, using the confusion matrix. Precision is the fraction of real positive findings among all positive forecasts, indicating the model’s accuracy in predicting positive classifications. Recall (sensitivity) is the fraction of real positive outcomes among all actual positives, which assesses the model’s ability to recognize positive cases. The F1-score is the harmonic mean of accuracy and recall, which provides a balanced statistic, particularly with unbalanced datasets. The ROC curve contrasts the true positive and false positive rates at different threshold settings, whilst the AUC (Area Under the ROC Curve) measures the model’s ability to distinguish between classes, with values closer to 1 showing improved performance.
Finally, the model identifies and sorts staff members based on predicted performance, providing a list of the top five and bottom five performers. The results are then saved into a new Excel file, including the classifications of the top and bottom staff.
The SVM algorithm, particularly with a linear kernel, is effectively used in this analysis to classify staff performance based on division and department. The hyperparameters, evaluation metrics, and overall workflow demonstrate a systematic approach to binary classification, making it a valuable tool for performance analysis in organizational settings.
TabNet
A special attention method is used by TabNet, a deep learning architecture created especially for tabular data, to improve feature selection and model interpretability. By using a sequential attention mechanism, it improves learning efficiency and prediction accuracy by allowing the model to concentrate on pertinent features at each decision step [
41,
42]. This feature is especially helpful for structured data, which is frequently used in marketing, finance, and healthcare.
The width of the decision layers, the attention embedding dimension, and the number of decision steps are important TabNet model characteristics. The attention embedding dimension regulates the number of features that can be paid to concurrently, whereas each decision step processes a subset of characteristics [
41]. To maximize the model’s capacity to pick up intricate patterns, the decision layers’ width can be changed. For instance, it has been shown to be successful to set the decision accuracy width and attention embedding width to 8 [
43].
With methods like Bayesian optimization used to methodically search for ideal values, hyperparameter adjustment is crucial for optimizing TabNet’s performance and greatly improving its prediction capabilities [
44,
45]. In comparison to cutting-edge models such as XGBoost, TabNet has demonstrated competitive performance, especially when dealing with mixed continuous and categorical data [
46]. It performed admirably in binary classification tasks, achieving a remarkable Area Under the Curve (AUC) of 0.9957 in one research [
47].
In HR analytics, where datasets frequently contain employee attributes, performance indicators, and other structured information, TabNet’s design is especially well-suited for tabular data. TabNet uses a sequential attention method that enables it to learn feature interactions efficiently while preserving interpretability, in contrast to conventional deep learning models that could have trouble handling tabular input. To enhance decision-making processes, Thakral et al. emphasized the significance of incorporating cutting-edge machine learning approaches into HR analytics. They also emphasized the necessity of models that can effectively manage complicated information. In this situation, TabNet is extremely useful since it can automatically choose pertinent features and model interactions.
TabNet has proven to be more effective than other machine learning methods like Random Forest and Support Vector Machines (SVMs) in a number of prediction tasks. For example, TabNet surpassed standard algorithms like Random Forest and SVMs in terms of accuracy and interpretability, despite the fact that they are effective, according to Gazi’s research on staff attrition prediction [
32]. This is especially significant in HR analytics, as decision-makers need to comprehend the reasoning behind forecasts. The strategic value of HR analytics is increased by the interpretability of TabNet, which enables HR professionals to extract useful insights from model predictions [
32].
Additionally, TabNet’s ability to handle missing data is a noteworthy benefit. For a variety of reasons, such as data entry problems and staff turnover, many HR datasets are incomplete. According to Avrahami et al., reliable predictive modeling in HR analytics depends on efficient handling of missing data [
19]. A common drawback of existing machine learning models is their inability to handle missing values. TabNet’s architecture allows it to take advantage of available data without the need for intensive preparation.
Furthermore, TabNet’s flexibility with regard to various data kinds is remarkable. Text, numerical, and category data are all possible types of data used in HR analytics. TabNet is a useful tool for HR professionals because of its adaptability to handle various kinds of data without requiring a lot of feature building. Adeusi’s findings, which highlight the value of employing machine learning models that can handle a variety of data formats to predict employee turnover, are consistent with this flexibility [
48].
Because TabNet offers interpretability and can compete with more conventional models like Random Forest and gradient-boosting machines, it has become increasingly popular. The attention mechanism, which uses a sparse attention method to enable the model to concentrate on the most pertinent features for every choice, is one of its primary features. There is less need for intensive feature engineering because this feature selection happens automatically during training. Furthermore, TabNet can process raw tabular data directly thanks to its support for end-to-end learning, which makes it adaptable to a wide range of applications without requiring a lot of preprocessing. The attention scores obtained from the model improve its interpretability by offering information about the significance of features and simplifying the model’s decision-making process.
The performance of TabNet is influenced by several crucial elements. The number of full runs of the training dataset is determined by the epochs option; while more epochs can improve learning, too many epochs can result in overfitting. A bigger batch size can speed up training but may also consume more memory. The batch size specifies how many training samples are processed before the model’s weights are changed. The number of decision steps the model takes to generate predictions is indicated by the n_steps parameter; each step represents a layer in the architecture. Raising this figure can improve the model’s capacity to identify intricate connections in the data. A lower step size may result in better convergence, but it requires more training epochs. The step size regulates the learning rate for every decision step. Different feature representations and increased flexibility are made possible by the n_independent parameter, which determines the number of independent layers in the model. The n_shared parameter, on the other hand, shows how many layers are shared between decision stages. This improves overall model performance by helping to generalize learnt patterns across steps.
Several important parameters influence TabNet’s performance. The epochs parameter determines the number of complete passes through the training dataset; while more epochs can lead to better learning, excessive epochs may cause overfitting. The batch size defines the number of training samples processed before the model’s weights are updated, where a bigger batch size can accelerate training but may require more memory. The n_steps parameter indicates the number of decision steps the model uses to make predictions, with each step corresponding to a layer in the architecture. Increasing this number can enhance the model’s ability to capture complex relationships in the data. The step size controls the learning rate for each decision step, where a smaller step size might improve convergence but necessitates more epochs for training. The n_independent parameter specifies the number of independent layers in the model, allowing for different feature representations and improving flexibility. Meanwhile, the n shared parameter indicates the number of shared layers across decision steps, which helps generalize learned patterns across different steps, enhancing overall model performance.
With its cutting-edge performance and interpretability, TabNet is a potent machine learning model that works well with tabular data. You can optimize the model for datasets by changing its parameters, guaranteeing accurate predictions for a range of applications. TabNet is an effective option for jobs involving complicated data linkages because it integrates attention methods that enable it to automatically focus on the most crucial characteristics.
2.5. Neural Architecture Search (NAS)
An inventive method in machine learning called Neural Architecture Search (NAS) reduces the need for expert manual intervention by automating the creation of neural network designs. Finding the best design to maximize performance on a task while minimizing computing expenses is the main objective of NAS. In this method, a predetermined search space of potential architectures is explored, and their performance is assessed using criteria, such efficiency and accuracy [
49,
50].
A search space, a search strategy, and a performance evaluation technique are some of the essential elements that are usually included in the algorithmic framework of NAS. Variations in layer types, the number of layers, and connectivity patterns are among the potential architectures that can be investigated; these are defined by the search space [
51,
52]. Evolutionary algorithms and reinforcement learning (RL) are popular search techniques. Differentiable architecture search (DARTS), for example, formulates the architecture search as a continuous optimization problem, enabling gradient-based optimization of both the network weights and the architectural parameters [
53].
Usually, a validation procedure is used to evaluate the performance of candidate designs. In this process, each architecture is trained on a subset of the data, and its performance is evaluated using a validation measure. Because this assessment can be computationally costly, methods like weight sharing—in which several architectures exchange weights to cut down on redundancy and expedite the search process—have been developed [
54]. Furthermore, some methods take hardware into account, optimizing architectures for efficiency on certain hardware platforms in addition to accuracy [
55]. The search space selection, the search approach, and the evaluation criteria used to gauge architectural performance are important NAS parameters. There are several ways to construct the search space, including employing more flexible representations like graphs or preset layer types [
56,
57]. The efficiency and efficacy of the search process can be greatly impacted by the search strategy chosen; for instance, evolutionary algorithms use mutation and selection processes to explore the search space, while reinforcement learning-based approaches can adaptively learn which architectures perform well. The evaluation criteria, which are crucial for assessing the viability of the found designs in practical applications, frequently comprise accuracy, computational cost, and memory use [
40,
58].
In automated machine learning, Neural Architecture Search (NAS) has become a key technique, especially for creating the best neural network architectures. The architecture selection process, which has historically mainly depended on human judgment and manual adjustment, is automated by this method. By contrasting it with cutting-edge machine learning methods used in Human Resource (HR) analytics, where effective and efficient model creation is crucial, the theoretical underpinnings of NAS can be reinforced.
The capacity of NAS to optimize deep neural network topologies through automated search procedures is one of its main benefits. According to [
59], NAS can lessen the computational load related to traditional architecture design, which frequently calls for a large amount of manual labor and resources. This is especially important in HR analytics since decision-making processes depend on models that are both accurate and understandable. According to [
60], NAS can find architectures that perform better than hand-designed models. This capability fits very nicely with the requirements of HR analytics, where predictive accuracy is essential.
Furthermore, Ref. [
61] highlight the intricacy of the optimization challenge inherent in NAS, underscoring the difficulty of optimizing neural architectures. This intricacy is similar to the difficulties encountered in HR analytics, where models must handle a variety of datasets with differing feature relevance. By utilizing sophisticated methods like automatic differentiation, which Demertzis investigated [
62], NAS efficiency is further increased, and optimal architectures can be quickly reached. In HR analytics, where prompt insights are crucial for strategic decision-making, this is especially advantageous.
The robustness of models created with NAS is a crucial factor in addition to efficiency. Recent developments in NAS frameworks, like the one put forth by [
63] highlight the significance of hardware efficiency and adversarial robustness. In HR analytics, these factors are crucial since models need to function well in typical scenarios while also preserving their integrity in the face of possible data abnormalities or hostile attacks. By making sure that designs are tailored for certain computing settings, the use of hardware-aware NAS approaches, as covered by [
64], can further improve model deployment in practical HR applications.
Additionally, Ref. [
65]’s investigation into federated learning methodologies in NAS creates new opportunities for HR analytics, especially when sensitive employee data is involved. Federated NAS makes it possible to create customized models while protecting data privacy, which is an increasingly critical issue in HR analytics. One important contribution that NAS can make to the HR field is the ability to customize models to meet the needs of specific clients while adhering to data protection laws.
A neural network is a computational model made up of interconnected clusters of nodes that resemble neurons and is modeled after the biological neural networks seen in the human brain. These networks can identify patterns in input data and generate classifications or predictions. Layers make up a neural network’s architecture: an input layer with features, one or more hidden layers with neurons that can recognize intricate patterns, and an output layer that generates the desired result. Each neuron’s input is converted to output via non-linear activation functions, which allow the network to learn intricate mappings. Backpropagation is a technique used in the learning process that modifies connection weights according to the output error relative to the predicted outcome.
With Rating as the target variable and Division and Department as the predictor variables, the neural network model’s formula is expressed as Rating ~ Division + Department. The network’s hidden layers are organized according to specifications like hidden = c(5, 3), which denotes two hidden layers with three neurons in the second layer and five neurons in the first. The linear parameter is used to set the output type. The network will generate continuous output values appropriate for regression applications, such as rating prediction, if output = TRUE.
To ensure that every input characteristic contributes equally to the distance calculations, data normalization is applied to numeric features using a function that scales the data to a range of [0, 1]. A random sample is then used to divide the data into training (80%) and testing (20%) sets so that the model can be trained on one set and assessed on the other. The neuralnet function, which learns the relationship between the inputs (Department and Division) and the output (Rating), is used to train the neural network model. The trained model is used to make predictions on the test data, and a denormalize function is used to return the output values to their initial scale.
Following the acquisition of the anticipated ratings, the ratings are categorized into binary classifications according to a threshold, in this case the original ratings’ median. Those who score higher than the cutoff are categorized as high performers (1), and those who score below are categorized as low performers (0). A confusion matrix that contrasts predicted and actual classes and provides counts of true positives, false positives, true negatives, and false negatives is used to assess the model’s performance. Recall (the percentage of true positive predictions among actual positive instances), accuracy (the percentage of correct predictions), precision (the ratio of true positive predictions to all positive predictions), and the F1-score (the harmonic mean of precision and recall) are significant metrics taken from this evaluation. The AUC measures the model’s overall performance, with values nearer 1 denoting superior class discrimination, while the ROC curve illustrates the trade-off between true positive and false positive rates across various thresholds.
Based on categorical input variables, the neural network model is a potent tool for staff rating prediction because of its code-defined design, which includes certain hidden layers and output kinds. To guarantee efficient training and performance evaluation of the model, the code makes use of fundamental methods including normalization, binary classification, and model evaluation. This thorough method shows how neural networks can be used in organizational contexts for predictive analytics.
The fundamental ideas, salient characteristics, parameters, training procedures, and assessment metrics for several machine learning models, such as Random Forest, K-Nearest Neighbors (KNN), XGBoost, Support Vector Machines (SVMs), TabNet, and Neural Architecture Search (NAS), are depicted in the
Table 1. Although quantitative measures like accuracy, precision, recall, F1-score, and ROC-AUC offer a quick overview of model performance, making an informed selection requires knowing the underlying causes of their efficacy, constraints, and potential areas for development.
Because of its effective gradient boosting framework, which lowers bias and variance and improves prediction accuracy, XGBoost routinely performs better than other models, especially in structured datasets. Because of its exceptional ability to handle missing data, XGBoost can continue to function even in situations when the data is not complete. It is also more resilient than many conventional models because of its support for regularization, which helps avoid overfitting. Combining these advantages enables XGBoost to produce better outcomes, particularly in intricate datasets where precision is essential for making decisions.
By integrating several decision trees, Random Forest enhances forecast accuracy and performs exceptionally well in ensemble learning. It excels at managing huge datasets with a variety of properties. It might, however, have interpretability issues, which would make it difficult for HR specialists to extract useful information. Feature selection and dimensionality reduction strategies could be the main focus of efforts to improve performance by streamlining data processing while maintaining crucial information. Because of its non-parametric nature and simplicity, KNN performs well in situations with irregular decision boundaries. However, because of the curse of dimensionality, its performance deteriorates with high-dimensional data. Furthermore, KNN requires efficient normalization methods due to its sensitivity to feature scaling. To increase classification accuracy, improvement techniques can involve investigating distance metric modifications or putting weighted KNN into practice.
The SVM’s capacity to identify ideal hyperplanes makes it useful for high-dimensional data. Although the SVM shows resilience in smaller datasets, overlapping classes or noisy data may cause its performance to deteriorate. Furthermore, the SVM does not include built-in feature selection, which can result in inefficiencies when dealing with complicated datasets. In order to maximize speed, future research could concentrate on combining feature selection strategies and testing out various kernel functions. TabNet is especially useful for tabular data in HR analytics because of its attention-based methodology, which enables automatic feature selection and interaction modeling. Although it performs admirably when handling missing data, careful hyperparameter adjustment is necessary to optimize effectiveness. Its acceptance among HR professionals may be facilitated by more research into its interpretability, which would increase confidence in model projections.
Lastly, NAS optimizes neural network topologies for certain objectives, such as HR analytics, by automating the neural network design process. By drastically lowering the need for expert manual intervention, this technique makes it possible to find architectures that perform better than manually created models. Interpretability, however, might be made more difficult by the intricacy of model architectures. For more adaptability in HR environments, future research should concentrate on merging NAS approaches with current models and streamlining architectures without sacrificing performance.
2.6. Impact of Staff Performance Analytics on Organizational Productivity
A crucial topic of research in modern corporate management is the effect of staff performance analytics on organizational efficiency, especially when combined with the Transactional Net Promoter Score (tNPS). The tNPS is a useful indicator for evaluating customer satisfaction just after certain encounters, giving businesses useful information about how customers are feeling. By coordinating staff efforts with customer expectations and feedback, the tNPS, when paired with staff performance data, can improve organizational productivity.
The methodical gathering and examination of employee performance data to pinpoint opportunities for development and strengths and shortcomings is known as staff performance analytics. Organizations can establish a direct connection between customer satisfaction outcomes and employee performance by including the tNPS into this framework. According to [
5], a team’s consistent high tNPS scores, for example, show that their service delivery meets customer expectations, which in turn encourages positive behaviors and practices inside the team. On the other hand, low tNPS scores can indicate areas in which employees might need more assistance or training, allowing for focused interventions that can enhance customer satisfaction and employee performance [
5].
Additionally, using the tNPS in tandem with staff performance data promotes a continuous improvement and responsibility culture. When workers are aware of how their actions directly affect customer perceptions and organizational outcomes, they are more inclined to take actions that improve customer experiences. Because they can see the actual results of their work in terms of customer loyalty and organizational performance, employees may become more motivated and engaged because of this alignment [
5]. Additionally, companies can use tNPS data to identify and reward top performers, strengthening a customer-focused culture that places a premium on providing outstanding customer service [
5].
Organizations can also make data-driven decisions about resource allocation and operational enhancements by integrating the tNPS into performance analytics. By examining trends in tNPS scores in conjunction with employee performance metrics, organizations can determine which departments or teams are performing well and which may require more support or resources. This strategic approach not only increases productivity but also guarantees that customer needs are met.
Past Methods and Their Limitations
Organizations should be aware of the following restrictions when using Excel to calculate the Transactional Net Promoter Score (tNPS). Customer feedback obtained just after a transaction is used to calculate the tNPS. Usually, this is achieved by asking consumers to rate their likelihood of recommending the service or product to others on a scale of 0 to 10. The percentage of promoters (those who score 9–10) is subtracted from the percentage of critics (those who score 0–6) to determine the score. Despite being a strong tool for data analysis, Excel has built-in drawbacks that may compromise the precision and efficiency of tNPS computations.
The possibility of human mistakes in data entry and formula application is a major drawback of utilizing Excel for tNPS computations. Inaccuracies may arise from manually entering survey replies, particularly when dealing with big datasets. The results can be distorted and produce false tNPS statistics if respondents are not correctly classified as promoters, passives, or detractors. Furthermore, even though Excel’s formula capabilities are strong, they might not be enough for more intricate analyses that go beyond simple tNPS computations to improve the comprehension of customer sentiment.
The management of big datasets is another drawback. Excel may experience performance problems when businesses expand and gather more client feedback, especially when handling thousands of responses. This may lead to sluggish processing speeds and a higher chance of accidents, which may impede prompt analysis and judgment. Additionally, Excel lacks sophisticated analytical functions like sentiment analysis, trend identification, and real-time reporting capabilities that are available in specialized customer experience management applications. For businesses looking to obtain more insight from tNPS data and effectively address customer feedback, these capabilities are essential. Furthermore, the integration of tNPS data with other operational metrics or key performance indicators (KPIs) is not supported by Excel by default. To obtain a complete picture of organizational efficiency, organizations frequently need to link the tNPS with other performance indicators, such as sales data or employee performance analytics. Strategic decision-making may be hampered and fragmented insights may result from a lack of smooth integration.
There are still a few unanswered questions regarding machine learning-based performance prediction, especially when it comes to using tNPS (Team Net Promoter Score) data. Traditional measures like sales numbers, job completion rates, or direct supervisor evaluations are used in a large portion of the literature currently available on performance prediction (cite sources). Despite their value, these methods frequently fall short in capturing the wider attitude and engagement levels of employees, which can be important predictors of future performance.
Furthermore, not much research has made use of tNPS data, which is becoming more widely acknowledged as a potent instrument for gauging team satisfaction and engagement. Despite its capacity to capture team attitude and its propensity to foster a happy work environment, NPS is frequently overlooked in performance prediction models. This discrepancy is important because it ignores the potential of employee feedback to forecast team dynamics and engagement as well as individual performance, both of which are key factors in organizational productivity.
In addition, a lot of previous research has used conventional machine learning models like Random Forest and Support Vector Machines (SVMs), but it has not thoroughly investigated more sophisticated models like XGBoost or deep learning architectures that might provide better prediction accuracy and resilience (cite sources). Furthermore, rather than using time-series data, which could offer more profound insights into performance trends and patterns over time, prior research frequently takes performance into account in static snapshots.
By employing machine learning models, such as XGBoost, to forecast employee performance using tNPS data over a three-month period, this article seeks to close these gaps. This study offers fresh perspectives on the predictive ability of employee feedback for identifying both high-performing and at-risk personnel by concentrating on this underutilized indicator and contrasting its efficacy across several machine learning algorithms. A more dynamic view of employee engagement and performance over time is provided by our method, which also incorporates time-series data to forecast performance patterns.
Machine learning models have been increasingly applied in HR analytics due to their ability to handle complex, non-linear relationships in workforce data. For example, Random Forest has been widely used to predict employee turnover and identify key factors influencing attrition due to its high interpretability and resistance to overfitting. XGBoost, on the other hand, has demonstrated superior performance in HR scenarios involving structured datasets, offering enhanced accuracy and robustness, particularly in handling missing values and imbalanced classes. Support Vector Machines (SVMs) are effective for classification tasks with clear margins but may struggle with scalability in larger datasets. K-Nearest Neighbors (KNN), while simple and intuitive, are sensitive to high-dimensional data and less effective when class distributions are skewed. TabNet stands out for its deep learning capabilities on tabular data, providing feature interpretability through attentive mechanisms, which makes it particularly suitable for HR domains where transparency is crucial. Neural Architecture Search (NAS) automates model optimization but often requires high computational resources. In summary, ensemble methods like XGBoost and Random Forest offer strong performance and stability, while deep learning methods like TabNet and NAS bring flexibility and scalability, especially in large-scale workforce analytics. The choice of model should be guided by dataset characteristics, need for interpretability, and available computational resources.