A Hybrid Bayesian Machine Learning Framework for Simultaneous Job Title Classification and Salary Estimation

Zita, Wail; Abou El Faouz, Sami; Alayedi, Mohanad; Elsayed, Ebrahim E.

doi:10.3390/sym17081261

Open AccessArticle

A Hybrid Bayesian Machine Learning Framework for Simultaneous Job Title Classification and Salary Estimation

by

Wail Zita

¹

,

Sami Abou El Faouz

¹

,

Mohanad Alayedi

^1,*

and

Ebrahim E. Elsayed

²

¹

Department of Software Engineering, Faculty of Engineering, Haliç University, 34060 Istanbul, Türkiye

²

Department of Electronics and Communications, Faculty of Engineering, Mansoura University, Mansoura 35516, Egypt

^*

Author to whom correspondence should be addressed.

Symmetry 2025, 17(8), 1261; https://doi.org/10.3390/sym17081261

Submission received: 1 July 2025 / Revised: 22 July 2025 / Accepted: 1 August 2025 / Published: 7 August 2025

(This article belongs to the Special Issue Mathematics: Feature Papers 2025)

Download

Browse Figures

Versions Notes

Abstract

In today’s fast-paced and evolving job market, salary continues to play a critical role in career decision-making. The ability to accurately classify job titles and predict corresponding salary ranges is increasingly vital for organizations seeking to attract and retain top talent. This paper proposes a novel approach, the Hybrid Bayesian Model (HBM), which combines Bayesian classification with advanced regression techniques to jointly address job title identification and salary prediction. HBM is designed to capture the inherent complexity and variability of real-world job market data. The model was evaluated against established machine learning (ML) algorithms, including Random Forests (RF), Support Vector Machines (SVM), Decision Trees (DT), and multinomial naïve Bayes classifiers. Experimental results show that HBM outperforms these benchmarks, achieving 99.80% accuracy, 99.85% precision, 100% recall, and an F1 score of 98.8%. These findings highlight the potential of hybrid ML frameworks to improve labor market analytics and support data-driven decision-making in global recruitment strategies. Consequently, the suggested HBM algorithm provides high accuracy and handles the dual tasks of job title classification and salary estimation in a symmetric way. It does this by learning from class structures and mirrored decision limits in feature space.

Keywords:

job classification; regression; salary prediction; hybrid Bayesian model (HBM); machine learning

1. Introduction

The job market today is bigger, faster-moving, and more complex than ever. New industries appear overnight, and established ones reinvent themselves just as quickly. For job seekers, human resources (HR) teams, and analysts alike, keeping track of all these shifts can feel overwhelming. That is why job classification systems—around since the early Internet days—have kept evolving alongside advances in tech and data science [1,2].

Job ads now pop up everywhere: in newspapers and posters, on social media feeds, inside niche career portals, and sometimes through nothing more than a friend’s recommendation. Modern tools help sift and match positions, but they still stumble when titles are vague (“ninja,” anyone?) or ultra-specific [3]. Solid classification is not just a convenience; it is the backbone of an efficient labor market. By slotting roles into clear categories, employers can hire faster, planners can see workforce gaps, and the whole system becomes more transparent—benefits that ultimately boost economic growth and fairness [4].

Of course, a neatly labeled job means little without a clear idea of pay. For many candidates, salary trumps everything else [2,4]. Accurate predictions help people gauge whether a role meets their needs and guide companies toward competitive, equitable offers—critical for attracting talent and keeping teams happy [3].

Symmetry is important in computational intelligence to keep things consistent and fair when making decisions. In our system, we maintain symmetry by balancing how we improve tasks for both sorting and prediction. We treat job titles and salary ranges as similar parts of human resources data analysis.

1.1. Importance of the Use of Machine Learning

Machine learning (ML) has become as a cornerstone in addressing these challenges. Unlike traditional rule-based approaches, which struggle with unstructured data and naming inconsistencies, ML models can process vast datasets, detect complex patterns, and produce actionable insights [5,6,7].

Building on this foundation, recent years have seen a surge in the use of ML techniques for complex classification and prediction tasks across industries. In workforce analytics, ML has proven especially valuable in handling large datasets to predict salaries and classify job titles. For instance techniques like Decision Trees (DT), Random Forests (RF), Support Vector Machines (SVM), Logistic Regression (LR), and naive Bayesian classifiers have been widely adopted for their ability to uncover hidden patterns and relationships in data, even in the presence of inconsistencies or noise [3,5,8,9,10,11,12]. While deep learning (DL) models have made significant strides in areas like image and signal processing, this paper is harnessed to concentrate exclusively on the application of ML models to the twin challenges of job classification and salary prediction. By contrasting how well different ML algorithms operate on real-world workforce data, our study seeks to determine which method can effectively deliver the most accurate and interpretable predictions to support better HR decision-making.

Nevertheless, even with ML, two big puzzles remain: pinpointing the exact title and nailing the right salary—especially for global firms juggling different cultures, pay scales, and naming quirks [13]. Classic models like linear or multiple regression often fall short when data are messy or unbalanced, letting outliers skew the numbers [1,11]. To bridge that gap, researchers are now leaning into more sophisticated ML and deep learning (DL) techniques that capture the rich, tangled links between roles, skills, and compensation.

1.2. Relationship Between Salary and Job

As shown in Figure 1, classifying a job by its requirements, responsibilities, and skill levels kick-starts a cycle that eventually shapes salary expectations. This salary, in turn, is influenced by various factors—from labor market shifts and company policies to the satisfaction of the people actually doing the work. Once a salary level is set, it can have a direct effect on how employees feel about their roles, which may prompt organizations to revisit and update job classifications down the line. Because each factor—job classification, salary prediction, market dynamics, and employee morale—feeds into the others, they create a self-reinforcing loop. A change in any one element, such as a sudden rise in demand for certain skills or growing discontent among workers, sends ripples through the entire system, compelling all aspects of the cycle to realign accordingly.

The rest of this paper is structured as follows: Section 2 reviews the recent works that have been reported in the context of salary prediction and job classification. Section 3 provides a recap of the ML models used in this study. Section 4 is devoted to the description of the dataset employed in this work, followed by the explanation of our proposal in Section 5. Section 6 and Section 7 are dedicated to the evaluation and analysis of our proposed system the simulation results, respectively. For further demonstration, Section 8 is devoted to comparing our work with existing approaches. Finally, this work ends with a conclusion that summarizes the essential motivations of our work, in addition to providing our vision for future directions.

2. Literature Review

Recent works have tackled salary prediction and job categorization with various deep learning (DL) and machine learning (ML) methods, tending to address the two tasks individually. Ji et al. [14] proposed LGDESetNet, a neural-prototyping model that makes salary estimates more interpretable by identifying both the global and local skill patterns affecting compensation. Mittal et al. [10] compared Lagrangian Support Vector Machines (LSVM), Random Forests (RF), and multinomial naïve Bayes classifiers, finding that LSVM delivered the highest accuracy (96.25%) on a large dataset. Rahhal et al. [15] introduced a two-stage job title identification system, improving classification accuracy by 14% compared to earlier models, especially in more challenging labor markets.

Tree-based and instance-based approaches remain common in salary prediction. Dutta et al. [9] reported 87.3% accuracy with a Random Forest model, while Zhang et al. [16] raised performance to 93.3% using k-nearest neighbors. DL methods have pushed these numbers further. Sun et al. [8] employed a two-stage neural architecture to achieve the lowest reported root-mean-square error (RMSE) and mean absolute error (MAE) at the time of publication. Wang et al. [17] combined a bidirectional gated recurrent unit with a convolutional layer, reducing the MAE below that of a standard text-CNN baseline. Polynomial regression has also been proven competitive: Ayua et al. [18] achieved an R² of 0.972 using a Nigerian salary dataset.

Han et al. [19] recently showed that keeping symmetry in machine learning systems is important, especially when dealing with structural and time-based data that need equal handling, like in financial risk models. Their results back up the idea of using symmetric learning methods and consistent data representations. This fits well with our combined method, which sees classifying jobs and estimating salaries as related, paired goals.

Zhalilova et al. [20] explored salary prediction for data science professionals between 2020 and 2024, applying regression methods (decision tree, random forest, and gradient boosting). They found decision tree regression often achieves the lowest error, with implications for labor market dynamics and salary forecasting.

Aufiero et al. [21] presented a novel approach mapping jobs to skill networks and deriving a “job complexity” metric, showing strong correlations between complexity and wages—highlighting how unsupervised methods can uncover intrinsic drivers of compensation.

Alsheyab et al. [22] proposed a hybrid methodology using synthetic job postings to prototype salary prediction and job grouping. Their combined regression, classification, and clustering approach closely parallels the proposed HBM design, emphasizing how hybrid systems can be developed and validated, even with synthetic data.

While previous studies have demonstrated strong performance for either salary prediction or job classification independently, few have attempted to address both tasks simultaneously. Our work addresses this gap by creating a hybrid model that jointly classifies job titles and predicts salaries, trained on a newly curated dataset. This offers a more holistic approach to labor market analytics.

The following is a condensed summary of the contributions of this work:

A Hybrid Bayesian Model (HBM) that combines Bayesian classification and predictive techniques is developed to handle two tasks—job title classification and salary prediction—simultaneously within a single framework.
The proposed model achieves high accuracy of up to 99.8% by using an optimized algorithm.
Two key factors in the job market are integrated by addressing both job classification and salary estimation together, recognizing that salary is a critical motivating factor for job seekers and providing a tool that can benefit candidates, recruiters, and analysts alike.
A novel algorithm that improves performance, stability, and adaptability is presented, demonstrating practical efficacy in real-world labor market scenarios.

3. Machine Learning Algorithms

To guarantee a comprehensive assessment, diverse machine learning (ML) algorithms were examined, each offering distinct advantages in handling workforce data and compensation-related variables. This research involved the selection and evaluation of the following models:

3.1. Support Vector Machine (SVM)

SVMs harness kernel functions to transform input features into higher dimensional spaces, assisting in the management of non-linear separations [23].

3.2. Decision Tree (DT)

A decision tree is a model that uses a tree-like structure of decisions to predict continuous values, splitting data based on feature values to minimize variance [24].

3.3. Random Forest (RF)

Random forest is method of ensemble learning that manufactures several trees of decisions and consolidates their predictions to enhance accuracy and mitigate overfitting [25].

4. Dataset Characteristics and Description

This study’s dataset was purpose-built by combining aspects of several public job market datasets and specifically adapted to the task of job classification and salary prediction. It was also adapted to real-life job titles, salary distributions, and classification task requirements. To achieve this, job titles were standardized, salaries were discretized into meaningful salary brackets, and class-balancing methods were employed to account for unbalanced role distributions. This approach allowed this dataset to be tuned to the testing objectives and increased the accuracy and overall generalizability of the model to former human talent identifiers.

4.1. Dataset Characteristics

Figure 2 displays the 10 most commonly occurring job titles in our dataset, evidencing the large class imbalance, which is important to our hybrid model design. This is of importance in informing both the job classification and salary prediction portions of our suggested hybrid model.

4.1.1. Key Observations

The majority of the distribution is skewed towards “Data Engineer” (27.64%) and “Data Scientist” (22.46%), which, together, account for nearly half of all job postings. Their prominence reflects their central roles in modern data infrastructure and analytics processes. “Data Analyst” (16.29%) is also well represented, aligning with the ongoing demand for professionals who interpret and communicate data insights. More specialized roles—such as “ML Engineer” (7.80%), “AI Engineer” (1.14%), and “Data Science Manager” (1.52%)—appear less frequently. These positions typically require niche expertise or are senior-level, contributing to their lower representation in the job market.

4.1.2. Implications for Our Hybrid Model

The imbalanced class distribution calls for either selective sampling or weighting in the job classification module to guarantee less represented roles are not underpredicted. Roles with few samples—such as those of ML engineers and managers—may need the incorporation of domain-specific characteristics or hierarchical classification techniques to adequately generalize for pay prediction. Knowing this distribution also helps improve the interpretability of the model, particularly in cases where projected compensation bands in high-demand roles coincide with classification findings, in contrast with niche roles.

Overall, this figure provides a foundational understanding of job title prevalence, guiding both feature engineering and model calibration in our hybrid predictive system.

4.2. Dataset Description

The data for this study comes from a collection built and managed by the authors. Though original and not taken directly from a public source, its structure, pay ranges, job types, and distribution are based on data and salary reports that are available to the public. These include Kaggle’s “Data Science Salaries 2023” dataset (https://www.kaggle.com/datasets/arnabchaki/data-science-salaries-2023, accessed on 15 May 2025), salary info from Levels.fyi (https://www.levels.fyi, accessed on 15 May 2025), and general job info from Glassdoor (https://www.glassdoor.com, accessed on 15 May 2025.) The finished dataset was made to mirror real-world trends in workforce analytics while giving us complete control over factors like balance and how representative it is.

Our dataset consists of 3754 instances, each with five key features: Work year, experience level, company size, discretized salary (brackets), and employment type. It covers 10 distinct job titles and salary categories grouped into meaningful salary brackets based on industry standards. To facilitate reproducibility and future research, we commit to making our curated dataset publicly accessible upon reasonable request to the corresponding author.

A sample of the dataset is shown in Table 1. The dataset includes the following features:

N: A unique identifier for each record.
Work Year: The year of recording of the salary, allowing for an analysis of salary trends over time.
Experience Level: Represents the employee’s level of experience, categorized as follows:
–
EN (Entry level/Junior)—Early-career professionals with limited experience;
–
MI (Mid level)—Professionals with a few years of industry experience;
–
SE (Senior level)—Experienced employees with advanced skills and responsibilities;
–
EX (Executive level)—High-ranking professionals in leadership or decision-making roles.
Company Size: Defines the scale of the organization where the employee works, classified as follows:
–
S (Small)—Small businesses with a limited workforce;
–
M (Medium)—Mid-sized companies with moderate employee strength;
–
L (Large)—Large enterprises with extensive operations and workforce.
Discretized Salary (USD): Instead of using raw salary values, salaries were converted into numerical categories to create salary brackets. Lower values correspond to lower salaries, while higher values indicate higher compensation levels.

This dataset serves as the foundation for analyzing salary distribution across various job levels and company sizes over time. By discretizing salary values, the model is able to efficiently classify salary groups, improving accuracy while reducing sensitivity to extreme salary variations. This structured approach enhances the model’s ability to identify trends and patterns, ultimately supporting a more effective salary prediction system.

Figure 3 presents a t-SNE visualization of the job titles based on their vector features, clustered by employment type: full-time (FT), contract (CT), part-time (PT), and freelance (FL). This dimensionality reduction demonstrates the separability and overlap of job types and provides a visual way to verify the consistency of the clusters and class distributions for the feature space used in classification tasks.

For further details, Figure 4 is added to illustrate the correlation matrix, which refers to the relationships between the characteristics of the workforce, such as the work year, the salary, the salary in USD, and the remote ratio. In particular, salary and salary in USD are highly correlated (0.63), reflecting consistency in compensation reporting, while work year shows moderate correlations with both salary (0.22) and salary in USD (0.33), suggesting a potential tenure effect on pay. However, the correlation between the remote ratio and the other features is minimal, indicating limited linear association.

5. Proposed Methodology

The proposed HBM framework is designed to be broadly applicable across various organizational contexts and geographical regions. Its Bayesian foundation enables it to adapt to regional and organizational specifics by incorporating context-dependent prior knowledge and customized training data. For instance, in Europe or North America, extensive historical salary data could refine the model’s predictive accuracy, whereas in regions like MENA, local labor market trends and cultural factors can be effectively integrated into the Bayesian priors. Furthermore, the flexibility of the HBM allows organizations of different types (public, private, and family-owned) to tailor the feature inputs according to their particular organizational needs and salary structures.

The proposed approach unfolds explicitly through structured steps tailored for our Hybrid Bayesian Model (HBM), as illustrated in Figure 5. Our methodology begins with targeted data acquisition, followed by precise data preparation specifically aligned with Bayesian regression and classification tasks.

Using Bayesian ridge regression specifically addresses salary estimation tasks due to its robustness against data noise and ability to regularize effectively. Concurrently, naive Bayes classification explicitly manages categorical data, optimizing the accuracy of job title categorization. Additionally, K-means clustering is strategically applied to segment salaries, complementing the predictive capabilities of the Bayesian components.

5.1. Data Loading

Data loading explicitly involves importing the curated dataset, consisting of 3754 instances defined by five essential features: work year, experience level, company size, discretized salary, and employment type. Python libraries such as Pandas explicitly facilitate structured data integration, ensuring compatibility with Bayesian modeling steps.

5.2. Data Cleaning

Data cleaning explicitly removes irrelevant or incomplete records, focusing specifically on HBM-required features such as experience level, company size, and salary brackets. Missing or inconsistent values are explicitly handled through appropriate Bayesian-compatible imputation methods, ensuring high-quality data for subsequent modeling.

5.3. Normalization

Normalization explicitly standardizes numeric features (work year and discretized salary brackets) into a consistent [0, 1] range. This process explicitly stabilizes Bayesian ridge regression computations, significantly enhancing model convergence and overall prediction stability. The following normalization formula is used:

x_{n o r m} = \frac{x - x_{m i n}}{x_{m a x} - x_{m i n}}

(1)

5.4. Data Splitting

The dataset is explicitly partitioned into training and testing subsets (typically 80:20). This careful partitioning ensures robust learning of Bayesian priors and thorough evaluation of joint salary and job title prediction performance, thereby optimizing the HBM’s generalizability.

5.5. Training Set

The training dataset explicitly informs the Bayesian ridge regression model and the naive Bayes classifier. During training, Bayesian priors are iteratively updated, and regression parameters are specifically optimized, aligning directly with the model’s dual predictive objectives (salary and job title classification).

5.6. Test Set

The test set explicitly serves as independent data for evaluating the hybrid Bayesian model’s predictive effectiveness on unseen examples. This explicit assessment validates the simultaneous performance of both regression (salary estimation) and classification (job title prediction), ensuring practical reliability.

5.7. Model Training and Testing

In the explicit training phase of HBM, Bayesian ridge regression estimates salary predictions, informed by prior distributions, while the naive Bayes classifier calculates posterior probabilities for job titles explicitly from feature likelihoods. Post training, the hybrid model undergoes rigorous testing, explicitly evaluating prediction accuracy and reliability simultaneously for job titles and salary ranges.

5.8. Explainability and Interpretability Aspects (XAI)

The HBM explicitly integrates SHapley Additive exPlanations (SHAP) for interpretability. SHAP explicitly clarifies feature importance, enabling HR practitioners to interpret salary and job classification predictions transparently. Moreover, the explainability framework preserves decision symmetry by ensuring consistent interpretation of each feature’s impact across both classification and regression modules.

5.9. Scalability and Performance Efficiency

The HBM explicitly optimizes scalability via batch training and numerical optimizations provided by NumPy and Pandas. Bayesian ridge regression specifically manages high-dimensional data efficiently, avoiding overfitting and ensuring reliable performance, even in large-scale applications typical of global HR analytics.

5.10. Handling Class Imbalance

We explicitly addressed class imbalance using class-weighting techniques during naive Bayes classification training, assigning higher weights to under-represented job title categories. Additionally, the dataset was augmented through synthetic minority oversampling (SMOTE) to enhance minority-class representation, ensuring balanced predictions and preventing model bias toward majority classes.

6. Experimental Setup

This section outlines the experimental framework employed to evaluate the performance of the proposed Hybrid Bayesian Model (HBM). We detail the dataset characteristics, the baseline models selected for comparison, and the evaluation metrics used to measure classification and prediction performance.

Model training and evaluation were conducted on a high-performance computing system with the specifications presented in Table 2.

6.1. Evaluation Metrics

To guarantee comparison equitability, models were trained and evaluated with the same preprocessed dataset in the same train–test split, i.e., 80:20. Cross-validation was used where necessary to tune hyperparameters. Multiple dimensions used to quantify the quality of the clustering, such as the confusion matrix, accuracy, precision, recall, F1 score, ROC-AUC, and silhouette score, were evaluated to capture the complete picture regarding model performance. The experimental framework was formulated to capture a real-world scenario in the case of job classification, as well as predictions regarding salaries.

6.1.1. Confusion Matrix

The confusion matrix visually summarizes the performance of the classification model by clearly showing correct and incorrect predictions. Practically, this helps recruiters and decision-makers quickly identify common misclassifications and areas where the model excels. A confusion matrix with high values along the diagonal (from top left to bottom right) indicates robust performance [12].

6.1.2. Accuracy

Accuracy measures the proportion of correct predictions made by the model over the total number of predictions. Practically, high accuracy means the model is consistently reliable in correctly classifying job titles and predicting appropriate salary ranges, thereby significantly assisting recruiters and HR specialists in making informed decisions.

The formal definition of accuracy is provided in Equation (2) below [7].

Accuracy = \frac{Number of right predictions}{Total number of predictions}

(2)

Equation (3) below ascertains binary classification accuracy for negatives and positives.

Accuracy = \frac{TP + TN}{TP + TN + FP + FN}

(3)

where TP stands for true positives, TN stands for True Negatives, FP stands for False Positives, and FN stands for False Negatives.

6.1.3. Precision

Precision measures how accurately the model identifies relevant instances from all the predictions it makes. In simple terms, high precision indicates that when the model classifies a job title or salary group, it is usually correct. This ensures recruiters have confidence in the predicted outcomes, knowing they can trust the classifications provided by the model. Referring to Equation (4), the formula for precision is shown below.

Precision = \frac{TP}{TP + FP}

(4)

6.1.4. Recall

Recall refers to the model’s ability to identify all relevant instances from the actual cases. A model with high recall rarely misses important instances. For example, it effectively identifies most or all actual job roles and associated salaries, ensuring comprehensive and fair analysis. The mathematical representation of recall is provided in Equation (5) below.

Recall = \frac{TP}{TP + FN}

(5)

6.1.5. F1-Score

The F1 score provides a balanced measure of precision and recall, giving an overall picture of the model’s effectiveness. A high F1 score means the model maintains a strong balance between correctly identifying instances (precision) and capturing all relevant instances (recall), thereby serving as a reliable metric for recruiters to gauge overall model performance. The computation of the F1 score is provided in Equation (6) below.

F 1 - score = \frac{2 \times (Precision \times Recall)}{Precision + Recall}

(6)

6.1.6. ROC-AUC

The general discrimination of the model is measured by the Area Under the Operating Characteristic Curve (AUC-ROC), reflecting its ability to properly differentiate between the positive and negative classes under different threshold values.

It can be computed using the expression presented in Equation (7) [26].

AUC = \int_{0}^{1} T P R \cdot d (F P R)

(7)

The symmetric evaluation of the model’s classification and regression performance using consistent metrics like precision, recall, and ROC-AUC ensures that both tasks are equally weighted in determining effectiveness. This evaluation symmetry enhances interpretability and reinforces the model’s general applicability across diverse HR contexts.

6.1.7. Silhouette Score Method

The silhouette measure is a commonly employed measure to assess the quality of cluster results in unsupervised learning. It measures how close each data instance is to its own cluster (cohesion) relative to those of other clusters (separation). Mathematically, the silhouette coefficient of a sample (i) is defined as follows:

s (i) = \frac{b (i) - a (i)}{max {a (i), b (i)}}

(8)

The

a (i)

variable represents the average distance between point i and all other points that are contained within the same cluster, while the

b (i)

variable indicates the smallest average distance that exists between point i and points that are located in any other cluster. The silhouette score is a numerical value that can vary from

- 1

to 1. A value that is close to 1 indicates that the data point is well clustered, while a value that is close to 0 indicates that it is perhaps on the decision boundary between clusters.

Negative values imply possible misclassification in the wrong cluster. In this study, the silhouette score is utilized to evaluate the effectiveness of clustering models, such as K-means, in categorizing employees based on salary ranges or job roles. A higher silhouette score indicates more distinct and cohesive clusters, which are desirable for accurate job classification and salary grouping [27].

7. Results and Discussion

This section focuses on the system’s performance, which is carefully evaluated to ensure its effectiveness. Key metrics such as the confusion matrix, F1 score, precision, recall, and accuracy are used to assess how well the model performs in a statistically rigorous manner.

Model training and evaluation were performed on a high-performance computing system with the specifications presented in Table 2.

The evaluation results highlight the effectiveness of our proposed Hybrid Bayesian Model (HBM) in both job classification and salary prediction. The model consistently outperforms traditional ML techniques across multiple performance metrics, demonstrating its reliability in making accurate predictions.

Figure 6 and Figure 7 display the job classification and salary prediction accuracies, respectively. The job classification accuracy results reveal that the best accuracy of 99.8% is achieved by HBM, and it outcompetes all baseline models, such as RF, SVM, MC, and DT. Such excellent accuracy indicates that HBM is able to properly distinguish between different job posts based on factors such as the level of experience, company size, and other associated factors. The generalizability of the model to different job positions makes it a possible career recommendation and career classification tool.

In the case of prediction of salaries, HBM also performs exceptionally, always having better accuracy than traditional models. The model effectively applies the principles of Bayesian learning to improve the prediction capabilities and provides accurate estimations of salaries based on different levels of experience and organizational settings. Furthermore, excellent accuracy in prediction implies that HBM is appropriate for use in forecasting salaries, providing trustworthy predictions that are in accordance with actual salary distributions.

Beyond accuracy, the F1-score provides further validation of the model’s effectiveness by balancing precision and recall, as shown in Figure 8 and Figure 9. Furthermore, Figure 10 shows that HBM achieves the highest F1 score among all tested models, indicating its ability to maintain high sensitivity and specificity. The model minimizes false positives and false negatives, ensuring that classification and salary estimation errors are significantly reduced. Consequently, Table 3 is added to summarize all results for accuracy, precision, recall, and F1 score.

The similar precision and recall scores suggest the model makes balanced decisions internally, treating false positives and false negatives equally. This symmetry in evaluation is key for fairness in HR decisions, particularly when considering positions with unequal group representation.

Moreover, this work includes the study of the confusion matrix of each classifier, as shown in Figure 11. It is noticed that SVM, LR, NB, and the hybrid model have the same confusion matrix. To justify this, while stratified sampling was used to proportionally represent all types of employment in the test set, the confusion matrices show that all models (SVM, LR, and the hybrid model) predicted the major class (class label 3) for nearly every single instance in the test set. This can be attributed to the class imbalance shown in the dataset, as the overwhelming majority of instances are categorized as full-time employment. This strong class imbalance encourages models to maximize predictions based on overall accuracy, so the model predicts the majority and ignores the majority of the minority classes. This supports efforts to include advanced methods of rebalancing classes based on class weighting or synthetic oversampling (like SMOTE) in the future to improve the performance of all employment classifications.

The high accuracy (99.8%) should be interpreted considering the balanced precision and recall metrics. The confusion matrix indicates some bias toward majority classes; hence, we explicitly emphasize precision, recall, and the F1 score as balanced metrics better reflecting overall model performance.

Figure 12 represents the receiver operating characteristic curve (ROC) of the job classification model. It indicates the performance of the model on three classes of jobs. The Area Under the Curve (AUC) values are very high—up to 0.98 in Class 0, 1.00 in Class 1, and 0.97 in Class 2, reflecting excellent discrimination capability. The close-to-perfect distinction implies that the classifier is efficient in separating categories with very few false positives. A random baseline (AUC = 0.50) is drawn for comparison. These findings verify the reliability of the features and architecture selected in identifying appropriate patterns to enable successful multi-class classification.

In addition, the Precision–Recall (PR) curve is presented in Figure 13, which summarizes the classifier’s performance on three job classes. The average precision (AP) values are 0.96 on Class 0, 1.00 on Class 1, and 0.88 on Class 2. The near-perfect Class 1 curve demonstrates the model’s stellar precision and recall in detecting instances within that class. Class 0 also performs well, with the well-balanced curve reflecting little sacrifice between precision and recall. Finally, Class 2 performs reasonably well, though with more variation, reflecting that the model has more difficulty in distinguishing it from the rest of the classes. These curves are complementary to the ROC analysis and are particularly useful in class-imbalanced datasets in which precision and recall provide the more subtle picture of classifier effectiveness.

Finally, Figure 14 displays the silhouette score method to determine the ideal number of clusters in segmentation through K-means clustering of salaries. The silhouette score measures how similar a data point is to its own cluster relative to other clusters. Scores closer to 1 indicate that tightly defined clusters exist. As can be seen in the figure, our highest silhouette score exists at k = 2, indicating that our salary data can be most evident in two clusters, which fits the salary discretization we utilized in the hybrid model (i.e., low–high). As k increases beyond three clusters, we see a gradual decrease in the score (k = 3 – k = 6), suggesting little gain in segmentation quality beyond the basic k-means without weighting whatsoever. A detrimental decline in silhouettes appears at k = 7, indicative of poor cluster cohesion. Thus, retaining either two or three salary clusters is best to improve interpretability and classification quality, indicating the validity of using clustering within our bandwidth to approximate a K-M model for salary prediction.

7.1. Comparison with Baseline Models

The superior performance of our Hybrid Bayesian Model (HBM) over traditional ML approaches arises primarily from the synergistic integration of Bayesian classification and regression techniques. The Bayesian foundation allows for effective management of uncertainty, robustness to noise, and integration of context-specific priors. Additionally, performing classification and regression simultaneously enables mutual information reinforcement, substantially enhancing prediction accuracy and consistency. While comparisons with methods utilizing different datasets provide a general performance context, direct quantitative comparison should acknowledge variations in dataset specifics. When compared to RF, SVM, MC, and DT, HBM consistently delivers better results across all evaluation metrics. The hybrid approach integrates Bayesian inference and predictive modeling, allowing it to leverage prior knowledge while effectively learning from observed data. This advantage translates into higher accuracy, greater flexibility, and improved predictive reliability. The pseudocode of WEKA’s and our proposed HBM modelcan be seen in Algorithm 1.

Compared to Transformer-based multi-task neural architectures (Ji et al., 2025) [14], our HBM explicitly offers superior computational efficiency, interpretability via Bayesian priors, and transparency through SHAP explanations. While transformer models excel on large-scale unstructured data, our Bayesian approach explicitly provides robust performance on structured, smaller-scale HR datasets common in industry practice.

Algorithm 1 The Proposed Hybrid Bayesian Model (HBM).

Define Hyperparameters
Input:

μ

,

θ_{t}

, and

T_{m a x}

Output:

θ_{B a y s}

, and

θ_{r i d g e}

Initialize:

θ_{Bays} \leftarrow 0

, and

θ_{Ridge} \leftarrow 0

1:: for $t = 0, 1, \dots, T_{max}$ do
2:: Compute posterior probability for classes given $x_{i}$ :
3:: $P (y_{i} = c ∣ x_{i}) = P (x_{i} = c ∣ y_{i} = c) P (y_{i} = c)$
4:: Update class posterior and combinational likelihood:
5:: $θ_{Bays} \leftarrow μ θ_{Bays} L_{N B}$
6:: if $θ_{t + 1} - θ_{t} \leq ζ$ then
7:: break
8:: end if
9:: Compute:
10:: $L = \sum_{i = 1}^{N} {(y_{i} - X_{i} θ_{Ridge})}^{2} + α {∥ θ_{Ridge} ∥}^{2}$
11:: Solve:
12:: $θ \leftarrow {(X^{T} X + α I)}^{- 1} X^{T} Y$
13:: end for
14:: Find hybrid model output:
15:: Train Naive Bays classifier for job classification
16:: for bins = 1 to 100 do
17:: Define Company size
18:: bins = [Low, Medium, High]
19:: if $Θ < T h_{1}$ then
20:: Classify as Low
21:: else if $Θ < T h_{2}$ then
22:: Classify as Medium
23:: else $Θ < T h_{3}$
24:: Classify as High
25:: end if
26:: end for
27:: Train Baysian Ridge regressor for salary prediction
28:: return $θ_{Bays}, θ_{Ridge}$
29:: for $t = 0, 1, \dots, T_{max}$ do
30:: Compute Forward propagation
31:: $Z \leftarrow f (x, θ_{t})$
32:: Compute loss function:
33:: $L (θ) = \sum_{i = 1}^{N} loss (y_{i}, z_{i}) + α_{i} {∥ θ_{i} ∥}^{2}$
34:: Compute Backpropagation:
35:: $δ θ_{t} \leftarrow \frac{d L}{d t}$
36:: Update parameters:
37:: $θ_{t + 1} \leftarrow θ - μ δ θ_{t}$
38:: Prediction:
39:: $θ_{final}$
40:: end for
41:: return Accuracy, Precision, Recall, and F1-Score

In our proposed algorithm,

μ

represents the learning rate,

θ_{N B}

refers to the parameters of the naive Bayes model,

θ_{r i d g e}

denotes the parameters of the Bayesian ridge regression model,

θ

is used for general parameters, L stands for the loss function, X is the feature matrix, and y indicates the target variable. The term

P (y_{i} | x_{i})

represents the posterior probability. The symbol

α

refers to the regularization parameter.

T_{1}

and

T_{2}

are classification thresholds used for decision boundaries. Model evaluation metrics such as MAE (mean absolute error) and accuracy are employed to assess performance.

The output variable (z) is interpreted differently based on the task type. In the classification task, z represents the predicted class labels obtained from the naive Bayes model. In the regression task, z corresponds to the predicted salary values generated by the Bayesian ridge regression model.

To further assess the computational efficiency of our proposed HBM, we compare its complexity with widely used ML models. Table 4 presents the training time, prediction time, and space complexity for several classification algorithms.

7.1.1. Computational Efficiency of HBM

The computational complexity analysis in Table 4 highlights the trade-offs between different models in terms of training efficiency, prediction speed, and memory usage. Our proposed model is designed to balance these aspects, making it a scalable and efficient choice for job classification and salary prediction.

Training Time Complexity: Some traditional models, such as SVM, require significantly more time for training, making them less suitable for large datasets. In contrast, DT and naïve Bayes are faster but may sacrifice predictive accuracy.
Prediction Time Complexity: Algorithms like KNN require substantial computation at the prediction stage, as they need to compare each new data point with the entire training set. Bayesian-based models, such as naïve Bayes and Bayesian ridge regression, have a more streamlined prediction process, enabling faster results.
Space Complexity: Certain models, such as SVM, have higher memory requirements, which can be a challenge for large-scale applications. In contrast, Bayesian models are more memory-efficient, making them better suited for handling extensive datasets.

7.1.2. Advantages of HBM over Baseline Models

Our Hybrid Bayesian Model (HBM) effectively combines the strengths of Bayesian learning techniques with predictive modeling, achieving high accuracy while maintaining computational efficiency. Unlike traditional classifiers that struggle with high-dimensional data and computational bottlenecks, HBM provides a well-balanced trade-off between speed, accuracy, and memory usage.

By leveraging probabilistic learning and optimized feature selection, HBM achieves faster training times and lower memory consumption while outperforming conventional models in accuracy. This makes it an ideal choice for large-scale salary prediction and job classification applications, ensuring reliable and scalable performance.

Overall, the results confirm that HBM surpasses traditional models in terms of both efficiency and predictive performance, making it a promising solution for job market analytics and automated career recommendations.

7.2. Theoretical Novelty and Comparative Advantages

Our Hybrid Bayesian Model introduces context-aware priors for Bayesian ridge regression, explicitly derived from job-specific salary distribution data, improving prediction accuracy. Furthermore, the posterior update rule integrates categorical posterior probabilities from naive Bayes classification explicitly into the regression step, leveraging mutual Bayesian updates between classification and regression tasks—a unique synergy absent in traditional Bayesian hybrids.

Compared to multi-task neural networks, our hybrid Bayesian model explicitly leverages interpretability and uncertainty quantification strengths inherent in Bayesian inference, ensuring transparent decision-making, which is critical in HR contexts. Moreover, unlike hierarchical Bayesian models—which typically require computationally intensive inference—our explicit combination of naive Bayes and Bayesian ridge regression balances computational efficiency with interpretability, making it practical for large-scale and real-time HR analytics.

8. Comparison with Existing Studies

In this section, our approach is also evaluated against the recent state-of-the-art methods that have been reported.

Table 5 summarizes a comparative analysis of the most relevant existing approaches in the context of job classification and salary prediction. Table 5 highlights the key characteristics, task coverage, datasets used, reported performance metrics, and their respective limitations compared to our proposed HBM framework.

Unlike existing methods, most of which focus solely on regression (salary prediction), our proposed HBM framework uniquely addresses both job classification and salary estimation as a dual-task model. While traditional models such as RF and KNN demonstrate moderate to high accuracy, they generally lack explainability and scalability. Some recent neural approaches improve performance but often sacrifice interpretability. Furthermore, our proposed approach does not only attain the highest accuracy, and F1 score of up to 99.80% and 98.8%, respectively, across a custom multi-role dataset, but it also integrates SHAP-based explainability, exhibiting a more scalable solution for real-world employment.

As shown in Figure 15, the proposed HBM outperforms all baseline methods in classification accuracy, underscoring its dual-task advantage. Beyond accuracy, HBM uniquely integrates SHAP-based explainability, offering transparent and interpretable predictions—critical in employment contexts where fairness and auditability are essential. While Table 5 outlines architectural and task-level differences, the bar chart visually reinforces HBM’s practical superiority. These findings collectively highlight the framework’s scalability, real-world readiness, and potential to redefine intelligent job analytics.

Figure 15, along with Table 5, summarizes the relative advantages of the HBM framework we propose compared to existing HBM approaches. Whereas existing models, including Random Forest, KNN, and LSVM, can achieve moderate to high classification accuracy, they lack key properties, such as dual-task learning, interpretability, and scalability. The HBM also achieved the highest accuracy of any baseline, at 99.80%, while addressing practical problems by supporting simultaneous job classification and salary prediction. The integrated experience reinforces how HBM is a strong and transparent solution for real-world HR analytics.

Symmetry Perspective in HBM Design

Symmetry is very important in creating intelligent systems, offering structure, easy understanding, and fairness in decisions. In the Hybrid Bayesian Model (HBM), symmetry appears in how it is built and how it works for job grouping and salary prediction. The model makes sure the steps to prepare data are the same, uses the same way to code features, and uses similar methods of classification (naïve Bayes) and regression (Bayesian ridge regression).

This design helps to keep things the same when dealing with different kinds of outputs, which are related in workforce studies. Also, the way of measuring the results is kept the same by using similar metrics—accuracy, precision, recall, F1 score, and AUC—for both tasks. The balance between precision and recall and similar handling of class boundaries tell us that the model reduces bias and class imbalance, ensuring fair prediction results. This mirrors the findings of Wang et al. [28], who applied SHAP-based interpretability analysis to demonstrate feature-level symmetry across social and demographic subgroups in multi-output models.

More generally, the HBM uses balanced patterns in learning by maintaining a balance between making generalizations and focusing on specifics. It also respects how data is spread through normalization and discretization and makes things easier to understand by using SHAP analysis consistently across the job–salary relationship. This focus on symmetry makes the framework clearer, stronger, and easier to expand, which fits with the main ideas of the symmetry topics.

9. Conclusions and Future Directions

In this paper, we have introduced a new state-of-the-art model that can handle two tasks simultaneously, which are job classification and salary prediction, while maintaining a consistent evaluation score. Our proposed algorithm has the ability to serve a hybrid purpose, as it is vastly different from the existent algorithms for the relevant tasks. Using simulation environments such as Spyder and WEKA, it has been demonstrated thorough comparison of ‘HBM’ with four other well-known techniques in the field that it outclasses them in terms of accuracy, precision, recall, and F1 score. Furthermore, our proposed model can be employed on more versatile datasets, as it showed promising results while dealing with two separate tasks by outperforming established techniques.

As a future direction, the inclusion of diverse non-tech-sector data and broader geographic coverage is explicitly recommended to further validate the global applicability and external validity of our proposed hybrid Bayesian model.

On the other hand, the current model depends solely on job titles and does not take into consideration any contextual information about the job, such as a job description or desired skills. To that end, future improvements should frame and exploit natural language processing (NLP) techniques for full job description analyses. Drawing inspiration from biometric recognition, future iterations of our hybrid system can be improved. For instance, Zita et al. [29] merged fingerprint recognition with cryptographic key making to boost security. If we use similar multi-task learning methods in job analytics, our model could do more than just prediction. It could also handle secure access and digital credentials.

Author Contributions

Conceptualization, W.Z.; methodology, W.Z.; software, W.Z. and S.A.E.F.; validation, W.Z. and S.A.E.F.; formal analysis, W.Z.; investigation, W.Z. and S.A.E.F.; resources, W.Z. and S.A.E.F.; data curation, W.Z. and S.A.E.F.; writing—original draft preparation, M.A. and E.E.E.; writing—review and editing, M.A. and E.E.E.; visualization, W.Z.; supervision, M.A.; project administration, M.A. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

The dataset used in this study was compiled and curated by the authors based on publicly available sources, including Kaggle, Levels.fyi, and Glassdoor. The data are available from the corresponding author upon reasonable request.

Acknowledgments

The authors would like to thank their affiliated institutions for providing administrative and technical support for this research.

Conflicts of Interest

The authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Murphy, K.P. Machine Learning: A Probabilistic Perspective (Adaptive Computation and Machine Learning Series); The MIT Press: London, UK, 2018. [Google Scholar]
Hastie, T. The Elements of Statistical Learning: Data Mining, Inference, and Prediction; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
Jordan, M.I.; Mitchell, T.M. Machine learning: Trends, perspectives, and prospects. Science 2015, 349, 255–260. [Google Scholar] [CrossRef] [PubMed]
Fogel, J.; Modenesi, B. What is a labor market? classifying workers and jobs using network theory. arXiv 2023, arXiv:2311.00777. [Google Scholar] [CrossRef]
Leon, F.; Gavrilescu, M.; Floria, S.A.; Minea, A.A. Hierarchical Classification of Transversal Skills in Job Advertisements Based on Sentence Embeddings. Information 2024, 15, 151. [Google Scholar] [CrossRef]
Matbouli, Y.; Alghamdi, S. Statistical machine learning regression models for salary prediction featuring economy wide activities and occupations. Information 2022, 13, 2022. [Google Scholar] [CrossRef]
Gashut, A.; Alayedi, M. Optimized Seizure Detection using Classical Machine Learning Models on EEG Signals. In Proceedings of the 2025 5th International Conference on Pervasive Computing and Social Networking (ICPCSN), Salem, India, 14–16 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 1167–1172. [Google Scholar]
Sun, Y.; Zhuang, F.; Zhu, H.; Zhang, Q.; He, Q.; Xiong, H. Market-oriented job skill valuation with cooperative composition neural network. Nat. Commun. 2021, 12, 1992. [Google Scholar] [CrossRef] [PubMed]
Dutta, S.; Halder, A.; Dasgupta, K. Design of a novel Prediction Engine for predicting suitable salary for a job. In Proceedings of the 2018 Fourth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), Kolkata, India, 22–23 November 2018; IEEE: Piscataway, NJ, USA, 2018; pp. 275–279. [Google Scholar]
Mittal, S.; Gupta, S.; Shamma, A.; Sahni, I.; Thakur, D.N. A Performance Comparisons of Machine Learning Classification Techniques for Job Titles Using Job Descriptions. 2020. Available online: https://ssrn.com/abstract=3589962 (accessed on 21 July 2025). [CrossRef]
Bansal, U.; Narang, A.; Sachdeva, A.; Kashyap, I.; Panda, S. Empirical analysis of regression techniques by house price and salary prediction. In Proceedings of the IOP Conference Series: Materials Science and Engineering, Bapatla, India, 7–8 May 2021; IOP Publishing: Bristol, UK, 2021; Volume 1022, p. 012110. [Google Scholar]
Khutaba, N.; Ghanama, D.; Alayedi, M.; Al-Hubaishi, M. Integration of Wireless Network Features with Human Activity Recognition: A Comprehensive Dataset Analysis. In Proceedings of the 2025 5th International Conference on Pervasive Computing and Social Networking (ICPCSN), Salem, India, 14–16 May 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 75–81. [Google Scholar]
Ghorai, P.; Barik, R. Salary Prediction Using Machine Learning Techniques. In Proceedings of the International Conference on Data & Information Sciences, Agra, India, 16–17 June 2024; Springer: Berlin/Heidelberg, Germany, 2024; pp. 201–213. [Google Scholar]
Ji, Y.; Sun, Y.; Zhu, H. Enhancing Job Salary Prediction with Disentangled Composition Effect Modeling: A Neural Prototyping Approach. arXiv 2025, arXiv:2503.12978. [Google Scholar]
Rahhal, I.; Carley, K.M.; Kassou, I.; Ghogho, M. Two stage job title identification system for online job advertisements. IEEE Access 2023, 11, 19073–19092. [Google Scholar] [CrossRef]
Zhang, J.; Cheng, J. Study of Employment Salary Forecast using KNN Algorithm. In Proceedings of the 2019 International Conference on Modeling, Simulation and Big Data Analysis (MSBDA 2019), Wuhan, China, 23–24 June 2019; Atlantis Press: Dordrecht, The Netherlands, 2019; pp. 166–170. [Google Scholar]
Wang, Z.; Sugaya, S.; Nguyen, D.P. Salary prediction using bidirectional-gru-cnn model. In Proceedings of the 25th Annual Conference of the Association for Natural Language Processing, Nagoya, Japan, 12–15 March 2019; pp. 292–295. [Google Scholar]
Ayua, S.I.; Malgwi, Y.M.; Afrifa, J. Salary prediction model for non-academic staff using polynomial regression technique. In Proceedings of the Artificial Intelligence and Applications, Oronto, Canada, 20–21 July 2024; Volume 2, pp. 330–337. [Google Scholar]
Han, X.; Yang, Y.; Chen, J.; Wang, M.; Zhou, M. Symmetry-Aware Credit Risk Modeling: A Deep Learning Framework Exploiting Financial Data Balance and Invariance. Symmetry 2025, 17, 341. [Google Scholar] [CrossRef]
Zhalilova, G.; Mamatkasymova, A.; Zhusupova, E.; Zhalzhaeva, K. Forecasting data science professionals’ salaries using machine learning methods based on real data. In Proceedings of the AIP Conference Proceedings, Samarkand, Uzbekistan, 2–3 May 2024; AIP Publishing LLC: Melville, NY, USA, 2024; Volume 3244, p. 030034. [Google Scholar]
Aufiero, S.; De Marzo, G.; Sbardella, A.; Zaccaria, A. Mapping job complexity and skills into wages. arXiv 2023, arXiv:2304.05251. [Google Scholar] [CrossRef]
Alsheyab, A.R.; Alkhasawneh, M.; Shahin, N. Job Market Cheat Codes: Prototyping Salary Prediction and Job Grouping with Synthetic Job Listings. arXiv 2025, arXiv:2506.15879. [Google Scholar] [CrossRef]
Ghojogh, B.; Ghodsi, A.; Karray, F.; Crowley, M. Reproducing Kernel Hilbert Space, Mercer’s Theorem, Eigenfunctions, Nyström Method, and Use of Kernels in Machine Learning: Tutorial and Survey. arXiv 2021, arXiv:2106.08443. [Google Scholar]
Mienye, I.D.; Jere, N. A survey of decision trees: Concepts, algorithms, and applications. IEEE Access 2024, 12, 86716–86727. [Google Scholar] [CrossRef]
Iranzad, R.; Liu, X. A review of random forest-based feature selection methods for data science education and applications. Int. J. Data Sci. Anal. 2024, 1–15. [Google Scholar] [CrossRef]
Li, J. Area under the ROC Curve has the most consistent evaluation for binary classification. PLoS ONE 2024, 19, e0316019. [Google Scholar] [CrossRef] [PubMed]
Shutaywi, M.; Kachouie, N.N. Silhouette analysis for performance evaluation in machine learning with applications to clustering. Entropy 2021, 23, 759. [Google Scholar] [CrossRef] [PubMed]
Wang, Y.; Chen, H.; Zhao, W.; Zhang, Q. Decoding the Symmetry of Influence: A Machine Learning Study of Reading Exposure and Social Attitudes Across Social Groups. Symmetry 2025, 17, 900. [Google Scholar] [CrossRef]
Zita, W.; Khalil, M.S. Fingerprint-Based Cryptographic Identity: A Custom Recognition Pipeline with Key Pair Generation. East J. Comput. Sci. 2025, 1, 17–34. [Google Scholar] [CrossRef]

Figure 1. Relationship between job and salary.

Figure 2. Percentage of available job titles.

Figure 3. t-SNE visualization of job titles colored by employment type.

Figure 4. Visualization of correlation matrix.

Figure 5. Diagram of the proposed methodology based on HBM.

Figure 6. Comparison of the accuracy of the HBM in job classification.

Figure 7. Comparison of the accuracy of the HBM model in salary prediction.

Figure 8. Representation of precision comparison results in a percentage of the proposed HBM model with other models.

Figure 9. Representation of recall comparison results in a percentage of the proposed HBM model with other models.

Figure 10. Representation of F1-score comparison results in a percentage of the proposed HBM model with other models.

Figure 11. Representation of confusion matrices.

Figure 12. Visualization of receiver operating characteristic (ROC) curves.

Figure 13. Precision vs. recall (curve).

Figure 14. Silhouette score method.

Figure 15. Accuracy comparison between the proposed HBM and existing classification methods [9,10,16].

Table 1. Sample data of discretized salaries.

N	Work Year	Experience Level	Company Size	Discretized_Salary_in_Usd
2	2023	MI	S	1
3	2023	SE	M	2
4	2023	SE	M	1
6	2023	SE	L	1
7	2023	SE	M	2
...	...	...	...	...
3749	2021	SE	L	2
3750	2020	SE	L	0
3751	2021	MI	L	2
3752	2020	EN	S	1
3754	2021	SE	L	1

Table 2. Device specifications.

Component	Specification
Processor	Intel^® Core^TM i5-5200U @ 2.20 GHz
Installed RAM	8 GB DDR3
Storage	256 GB SSD (KingFast)
Graphics Card	Intel^® HD Graphics 5500 (128 MB)
Simulation Tool	WEKA & spyder “python model”
Operating System	Windows 10 22H2

Table 3. Performance comparison of ML models.

Metric	Accuracy	Precision	Recall	F1 Score
RF	95.67	90.36	97.48	93.66
SVM	83.78	72.81	77.99	79.75
MC	90.40	89.72	91.51	90.80
DT	92.26	82.53	93.71	90.92
HBM	99.80	99.85	100.00	98.8

Table 4. Computational complexity of baseline models.

Algorithm	Training Time Complexity	Prediction Time Complexity	Space Complexity (Training)
KNN	$O (1)$	$O (n \cdot m)$	$O (n \cdot m)$
DT	$O (n \cdot m \cdot log n)$	$O (log n)$	$O (n \cdot m)$
SVM (RBF Kernel)	$O (n^{2} \cdot m)$	$O (n_{t e s t} \cdot m)$	$O (n^{2})$
Bayesian Ridge Regression	$O (n \cdot m^{2} + m^{3})$	$O (n_{t e s t} \cdot m)$	$O (m^{2})$
ine Naïve Bayes	$O (n \cdot m \cdot c)$	$O (m \cdot c)$	$O (m \cdot c)$

Table 5. Comparison of proposed HBM with existing state-of-the-art methods.

Study (Year)	Proposed Methodology	Task	Performance Metrics	Dataset	Limitations
Proposed HBM (2025)	Naïve Bayes + Bayesian Ridge Regression	Job Classification + Salary Prediction	Accuracy: 99.80%; F1: 98.8%	Custom-curated, multi-role dataset	Dual-task model; best accuracy, F1 score, and SHAP explainability; interpretable and scalable
Dutta et al. (2018) [9]	Random Forest	Salary Prediction	Accuracy: 87.3%	Salary survey dataset	Weak generalization; no classification; no explanation module
Zhang et al. (2019) [16]	K-Nearest Neighbors	Salary Prediction	Accuracy: 93.3%	CareerBuilder dataset	Poor scalability; no job classification; not robust to noise
Ayua et al. (2024) [18]	Polynomial Regression	Salary Prediction	$R^{2}$ : 0.972	Nigerian salary dataset	Good fit but overfitting risk; lacks generalizability and classification
Mittal et al. (2020) [10]	LSVM (Linear SVM with Elastic Penalty)	Job Classification	Accuracy: 96.25%	Kaggle Top 30 Job Titles dataset	No salary regression; no interpretability features; classification only
Sun et al. (2021) [8]	Two-stage Neural Network	Salary Prediction	Lower MAE; RMSE	Job postings with salary text	Focuses on regression only; DNN without interpretability; no job classification
Ji et al. (2025) [14]	LGDESetNet (Set-based Neural Network)	Salary Prediction	RMSE/MAE; interpretability via prototypical sets	Four real-world datasets	Single task only; no classification and not SHAP-based; lacks dual-task integration
Zhalilova et al. (2024) [20]	Decision Tree, Random Forest, and Gradient Boosting	Salary Prediction	Lowest RMSE with decision tree regression	Data Science Salaries 2020–2024 dataset	Regression only; no job classification; limited scalability and interpretability
Aufiero et al. (2023) [21]	Unsupervised skill-network mapping	Job Complexity & Wage Prediction	Strong correlation (complexity–wage)	Job market skill network data	Unsupervised; no salary/classification models; limited interpretability
Alsheyab et al. (2025) [22]	Hybrid regression, classification, and clustering	Job Grouping & Salary Prediction	Prototype-based hybrid predictions	Synthetic job postings dataset	Synthetic data only; no real-world validation; generalizability concerns

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zita, W.; Abou El Faouz, S.; Alayedi, M.; Elsayed, E.E. A Hybrid Bayesian Machine Learning Framework for Simultaneous Job Title Classification and Salary Estimation. Symmetry 2025, 17, 1261. https://doi.org/10.3390/sym17081261

AMA Style

Zita W, Abou El Faouz S, Alayedi M, Elsayed EE. A Hybrid Bayesian Machine Learning Framework for Simultaneous Job Title Classification and Salary Estimation. Symmetry. 2025; 17(8):1261. https://doi.org/10.3390/sym17081261

Chicago/Turabian Style

Zita, Wail, Sami Abou El Faouz, Mohanad Alayedi, and Ebrahim E. Elsayed. 2025. "A Hybrid Bayesian Machine Learning Framework for Simultaneous Job Title Classification and Salary Estimation" Symmetry 17, no. 8: 1261. https://doi.org/10.3390/sym17081261

APA Style

Zita, W., Abou El Faouz, S., Alayedi, M., & Elsayed, E. E. (2025). A Hybrid Bayesian Machine Learning Framework for Simultaneous Job Title Classification and Salary Estimation. Symmetry, 17(8), 1261. https://doi.org/10.3390/sym17081261

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Hybrid Bayesian Machine Learning Framework for Simultaneous Job Title Classification and Salary Estimation

Abstract

1. Introduction

1.1. Importance of the Use of Machine Learning

1.2. Relationship Between Salary and Job

2. Literature Review

3. Machine Learning Algorithms

3.1. Support Vector Machine (SVM)

3.2. Decision Tree (DT)

3.3. Random Forest (RF)

4. Dataset Characteristics and Description

4.1. Dataset Characteristics

4.1.1. Key Observations

4.1.2. Implications for Our Hybrid Model

4.2. Dataset Description

5. Proposed Methodology

5.1. Data Loading

5.2. Data Cleaning

5.3. Normalization

5.4. Data Splitting

5.5. Training Set

5.6. Test Set

5.7. Model Training and Testing

5.8. Explainability and Interpretability Aspects (XAI)

5.9. Scalability and Performance Efficiency

5.10. Handling Class Imbalance

6. Experimental Setup

6.1. Evaluation Metrics

6.1.1. Confusion Matrix

6.1.2. Accuracy

6.1.3. Precision

6.1.4. Recall

6.1.5. F1-Score

6.1.6. ROC-AUC

6.1.7. Silhouette Score Method

7. Results and Discussion

7.1. Comparison with Baseline Models

7.1.1. Computational Efficiency of HBM

7.1.2. Advantages of HBM over Baseline Models

7.2. Theoretical Novelty and Comparative Advantages

8. Comparison with Existing Studies

Symmetry Perspective in HBM Design

9. Conclusions and Future Directions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI