Next Article in Journal
Whole-Genome Sequencing Dataset from Two High-Risk Breast Cancer Families Negative for BRCA1/2 and Other Known Susceptibility Genes
Previous Article in Journal
Bibliometric Analysis of the Literature Regarding MRI-Linac: A Paradigm Shift in Radiation Oncology
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Machine Learning Models for Predicting Professional Disqualification in Peruvian Association Members

by
Manuel Pretel Pretel
1,
Yeny Chávez Llempén
1,
Abel Angel Sullon Macalupu
1,
Paulo Canas Rodrigues
2,
Javier Linkolk López-Gonzales
1 and
Esteban Tocto-Cano
1,*
1
Escuela de Posgrado, Universidad Peruana Unión, Lima 15468, Peru
2
Departament of Statistics, Federal University of Bahia, Salvador 40170-110, Brazil
*
Author to whom correspondence should be addressed.
Data 2026, 11(5), 98; https://doi.org/10.3390/data11050098
Submission received: 20 January 2026 / Revised: 3 March 2026 / Accepted: 4 March 2026 / Published: 30 April 2026

Abstract

The disqualification of licensed professionals for non-payment of their monthly fees constitutes a significant operational risk to the financial sustainability of professional associations. This problem highlights the need for predictive tools that can anticipate the risk of disqualification and protect institutional stability. The main objective of this study was to develop a supervised machine learning model for estimating the risk of disqualification among registered professionals based on historical and contextual variables. An empirical, applied, and quantitative study was conducted by analyzing more than 5.7 million financial records corresponding to 27,964 registered professionals. Multiple supervised classification algorithms, including ensemble models such as CatBoost and XGBoost, were evaluated using stratified cross-validation and class-balancing techniques to address the substantial imbalance in the data. The results indicated that CatBoost performed best (F1-score = 57.96%; AUC = 0.72), whereas XGBoost showed greater stability across cross-validation folds. In conclusion, the model developed supports the timely identification of members at high-risk of disqualification, enabling the implementation of early warning systems and proactive institutional financial management strategies.

1. Introduction

Non-payment of mandatory membership fees constitutes a systemic financial risk that undermines both the institutional sustainability of professional associations and the continuity of licensed practices. This phenomenon generates recurring liquidity constraints, disrupts medium- and long-term budget planning, and limits investment capacity in infrastructure expansion, welfare facilities, and continuing professional development programs. Consequently, membership delinquency emerges as a critical management challenge that requires evidence-based risk-mitigation strategies. This structural vulnerability extends beyond administrative inefficiency and directly constrains institutional growth, infrastructure maintenance, and service quality, thereby reinforcing the need for proactive financial governance. Previous studies demonstrate that delinquency reflects structured behavioral patterns, rather than random non-compliance. Age, income level, marital status, professional seniority, and employment status systematically influence the propensity to default or incur financial delinquency [1,2,3,4,5,6]. These findings establish that payment behavior is shaped by identifiable sociodemographic and professional characteristics, suggesting the feasibility of predictive approaches to delinquency management. In the field of credit risk management, machine learning methods, particularly ensemble algorithms, consistently outperformed traditional statistical models on complex classification tasks. For example, Random Forest achieves accuracies above 89% in insurance applications [7], while XGBoost demonstrates strong predictive performance and computational efficiency in financial risk assessment [8]. Moreover, integrating these algorithms with balancing techniques such as the Synthetic Minority Oversampling Technique (SMOTE), effectively addresses imbalanced distributions in binary classification problems [9]. These advances indicate that ensemble learning provides robust solutions for high-dimensional and imbalanced datasets in finance. Despite these advances, predictive modeling remains underexplored in professional association contexts, particularly in relation to disqualification due to non-payment. Conventional approaches, including logistic regression and discriminant analysis, often perform suboptimally in datasets characterized by nonlinear relationships, high dimensionality, and pronounced class imbalance [10,11]. Furthermore, few studies have integrated institutional variables, including membership type, payment history, banking institution, and payment location, with sociodemographic and academic characteristics into a unified predictive framework [10,11]. This limitation restricts the development of institution-specific risk mitigation strategies tailored to the operational realities of professional licensing organizations. This study develops a supervised machine learning model to estimate the probability of disqualification due to delinquency among licensed professionals in Peru. By integrating institutional financial transaction records with member registry data, this study evaluates multiple supervised algorithms under controlled conditions to identify the most effective approach for delinquency prediction. The analysis incorporates demographic, academic, and financial-institutional variables within a comprehensive predictive framework. The findings demonstrate that gradient-boosting frameworks, particularly CatBoost and XGBoost, exhibit robust discriminatory capacity in identifying members at risk of disqualification, outperforming traditional statistical methods and other machine-learning approaches. By systematically evaluating multiple algorithms across a large-scale institutional dataset, the analysis establishes an empirical foundation for preventive delinquency management systems within professional associations. The integration of institutional and sociodemographic variables with advanced modeling techniques enables the development of proactive, data-driven intervention strategies. The remainder of this article is structured as follows: Section 2 establishes the Research Hypothesis and Conceptual Model; Section 3 reviews the relevant literature; Section 4 describes the data and methodology; Section 5 presents the empirical results; Section 6 discusses their implications; and Section 7 concludes and outlines directions for future research.

2. Research Hypothesis and Conceptual Model

Building on behavioral economics, financial risk theory, life-cycle theory, and human capital theory, this study advances a framework that conceptualizes professional disqualification risk as a multifactorial, institutional outcome. Non-payment does not occur by chance—it emerges from the structured interaction of structural, professional, and behavioral determinants that can be modeled through supervised machine learning [1,10,11]. By integrating individual attributes with institutional conditions, the framework delivers a coherent predictive architecture grounded in economic reasoning.
Behavioral economics and financial life-cycle theory treat age as a signal of economic stability, accumulated capital, and financial maturity [2,4,5]. Evidence indicates that early-career professionals experience greater income volatility and reduced capacity to absorb shocks, thereby increasing delinquency risk [1,6,12]. Accordingly, H1 predicts a negative association between age and professional disqualification risk: younger professionals are more likely to be classified as high-risk. In this way, life-cycle dynamics function as a structural driver of financial vulnerability.
Sociodemographic variables—including gender, marital status, and country of origin—capture structural differences in financial stability and compliance behavior [1,2,5,12]. These factors reflect broader socioeconomic conditions that shape distinct institutional risk profiles. Accordingly, H2 posits a significant association between sociodemographic characteristics and professional disqualification risk, extending the analysis beyond purely financial indicators.
Human capital theory provides the foundation for H3 by conceptualizing institutional seniority and number of registered specialties as accumulated investments in education, experience, and labor market positioning [13]. Greater specialization and longer tenure signal more stable income expectations and stronger organizational commitment, thereby reducing the default probability. Accordingly, H3 posits a negative association between academic and professional variables and disqualification risk, framing them as protective factors against financial instability.
Within financial risk theory, historical payment behavior constitutes the most robust predictor of future default [8,11,14,15]. Indicators such as payment frequency, proportion of months with outstanding debt, and balance variability capture persistent behavioral patterns, rather than static attributes [1,10,11]. Because these measures reflect dynamic financial discipline or instability, they are expected to display superior discriminatory capacity. Accordingly, H4 proposes that financial and payment behavior variables exhibit greater predictive power than static sociodemographic characteristics.
From an organizational perspective, administrative infrastructure—payment group, payment method, and banking institution—structures the context in which compliance decisions [1,10]. These institutional conditions may introduce operational frictions or facilitate timely payments through automated mechanisms. Thus, H5 posits that payment infrastructure variables contribute significantly to the classification of disqualification risk.
Finally, methodological considerations motivate H6. Research on imbalanced classification consistently shows that gradient boosting ensemble methods outperform traditional linear models when nonlinear relationships and high-dimensional interactions are present [7,8,9,16]. Given the heterogeneity and interplay among structural and behavioral predictors in this framework, H6 predicts that supervised ensembles—XGBoost, LightGBM, and CatBoost—achieve superior predictive performance compared to conventional statistical models for identifying high-risk members. Collectively, these hypotheses translate the conceptual framework into a coherent theoretical and methodological basis for predictive institutional risk assessment.

Conceptual Model

The model conceptualizes professional disqualification risk, operationalized as a binary dependent variable, as the outcome of systematic interactions among three theoretically grounded dimensions. The first dimension encompasses sociodemographic attributes that capture individual structural conditions; the second includes academic and professional characteristics that reflect educational trajectory and accumulated human capital [13]; and the third comprises financial-institutional factors that integrate historical payment behavior with relevant administrative features [8,10,11,14,15]. This tripartite structure delineates the determinants of the phenomenon with conceptual precision and aligns observable indicators with explicit theoretical foundations.
The framework guides variable selection and avoids a purely exploratory data mining approach. By integrating structural and behavioral dimensions within a supervised modeling architecture, the study compares predictive algorithms and evaluates the explanatory contribution of each conceptual domain to the risk classification. This design ensures coherence between theoretical foundations and empirical implementation.
Accordingly, the study extends beyond a technical comparison of modeling methods and advances a theory-grounded framework for preventive institutional risk management. By aligning conceptual rationale, variable operationalization, and empirical evaluation, it enhances explanatory rigor and establishes a solid foundation for decision-making based on predictive evidence.

3. Related Works

The use of supervised machine learning models has been widely documented for predicting complex events across various domains. This section reorganizes the literature into studies directly related to financial risk, delinquency prediction, and institutional compliance modeling, with the objective of positioning the present work within the current state of the art.

3.1. Models Applied to Financial Risk and Fraud Detection

Recent literature demonstrates that machine learning models consistently outperform traditional statistical approaches in fraud detection and financial risk prediction, particularly when integrating data balancing techniques and hybrid optimization strategies. In the health insurance domain, Nabrawi and Alanazi [17] compare Random Forest, Logistic Regression, and Artificial Neural Networks on imbalance-corrected datasets and report anomaly-detection precision exceeding 98%, identifying that policy type, education level, and claimant age are key determinants. Their findings confirm the superiority of ensemble-based methods in heterogeneous financial contexts.
Similarly, in the U.S. banking sector, Kusaya and O’Keefe [18] evaluate supervised models—including Artificial Neural Networks, Gradient Boosting, Random Forest, Logistic Regression, and deep autoencoders—for predicting internal abuse and fraud. They demonstrate that model selection must prioritize predictive performance and out-of-sample generalization capacity rather than theoretical simplicity.
In financial risk management, Sundar et al. [14] propose a hybrid framework combining CatBoost and Support Vector Machines with SMOTE and Principal Component Analysis, achieving 95.93% accuracy and outperforming Logistic Regression and Random Forest. Complementarily, Brygala and Korol [15] show that gradient-based algorithms such as LightGBM, CatBoost, and XGBoost enhance personal insolvency prediction and employ SHAP values to interpret variable contributions, including income, credit denial history, and payment delays.
Collectively, these studies consolidate empirical evidence that advanced ensemble algorithms provide scalable, high-precision, and interpretable solutions for delinquency and fraud detection, particularly in high-dimensional and imbalanced financial datasets. However, their application to professional association contexts remains limited, thereby justifying the present study.

3.2. Models for Social and Professional Phenomena

Al-Alawi et al. [13] proposed a hybrid model to predict academic and psychological stress in students, combining k-NN, Random Forest, XGBoost, and weighted voting techniques. The integration of multiple models enabled the effective handling of interrelated variables.
Akinyemi et al. [19] applied SVM, Random Forest, k-NN, and neural networks to predict student dropout based on personal, academic, and social variables. Random Forest achieved accuracy above 90%, supported by systematic cross-validation and attribute selection.
Naik et al. [20] evaluated Naïve Bayes, k-NN, and Decision Trees to predict salary class using the UCI Census dataset. Their findings validate the predictive value of demographic and professional attributes in administrative contexts, a methodological principle shared with the present research.

3.3. Theoretical Foundations and Variable Selection in Delinquency Modeling

Payment behavior has been extensively analyzed in behavioral economics and financial risk theory, which demonstrates that delinquency is shaped by sociodemographic, academic, and institutional determinants rather than purely by income constraints [1,6,12]. Age, marital status, and gender are systematically associated with financial stability and compliance patterns, consistent with life-cycle finance theory and household decision-making models [6,12].
Academic and professional variables, including tenure and number of specialties, operate as proxies for accumulated human capital and employment stability. Human capital theory posits that higher specialization increases income expectations and reduces default probability [13]. Institutional variables, such as payment method and banking institution, capture operational compliance mechanisms and transactional friction [10,11].
Prior research on delinquency prediction also emphasizes the importance of integrating complementary data sources to adequately represent the domain structure. For instance, Nurdin et al. [21] analyze default risk in educational financing using demographic, academic, and financial variables—including gender, study program, funding type, installment amount, payment status, and historical arrears—demonstrating that combining individual attributes with historical payment behavior improves predictive accuracy.
Following this methodological logic, the present study integrates two complementary institutional databases—“Payment Records” and “Membership Registry”—to capture both transactional dynamics and structural professional attributes. The “Payment Records” dataset includes transactional variables such as payment date, amount, payment type, payment method, and banking institution, enabling the construction of behavioral indicators such as frequency, punctuality, and cumulative arrears. The “Membership Registry” dataset incorporates sociodemographic and administrative attributes, including tenure, gender, age, marital status, country of qualification, professional category, and number of specialties, which characterize the structural profile of each member.
This dual-database structure ensures coherent representation of the problem domain and aligns with best practices in delinquency modeling, where preliminary variable assessment and domain-specific feature engineering are critical for robust predictive performance.

3.4. Studies Focused on Benchmarking and Algorithm Evaluation

Tousi et al. [22] emphasize that model selection must balance predictive accuracy, interpretability, and computational efficiency of the model. Belavagi et al. [23] highlight the importance of F1-score, AUC, and cross-validation in imbalanced classification contexts. These methodological criteria guide the evaluation framework adopted in the present study.
Table 1 summarizes the principal studies related to this research, detailing context, algorithms, evaluation metrics, and methodological contributions.

4. Methodology

4.1. Research Design

This study is an empirical, applied, and quantitative investigation aimed at developing a predictive model using supervised learning techniques [28,29] to estimate the risk of disqualification among licensed professionals in Peru, based on sociodemographic, occupational, and financial characteristics. The analysis was conducted at the individual level using anonymized administrative records provided by a national professional association.

4.2. Data Sources

Dataset 1 contains information on payment history, charges, and deposits, with approximately 5.7 million records. Dataset 2 includes demographic, academic, and professional information on 28,802 active and historical members. Both datasets were integrated to construct the final dataset. Each record represents a member, aggregated at the monthly level. The target variable (risk_level) was defined as binary (0 = low risk; 1 = high risk) according to internal accounting rules, based on accumulated debt relative to institutional delinquency thresholds. Also, exploratory analysis was conducted to characterize initial patterns in both numerical and categorical variables, with the objective of examining their association with the observed levels of disability risk (risk_level).

4.3. Data Preprocessing and Variable Construction

Data processing was performed using Python 3.9, employing the libraries pandas, scikit-learn, xgboost, catboost, lightgbm, among others.
The steps were as follows:
  • Initial cleaning: Duplicate entries were removed.
  • Monthly aggregation: Monthly financial indicators were calculated for each member, including variables such as collections_month, payments_month, and accumulated_debt.
Let i represent an individual identified by their unique code c _ p e r s o n , and let t denote a calendar month defined by truncating the transaction date to the year-month level. Based on the individual transaction records, the following variables were defined:
  • Monthly quota used:
    amount_used i , t = 1 N i , t + j = 1 N i , t + amount i , t , j where amount i , t , j > 0
    This variable represents the average of positive collection transactions for individual i during month t—that is, the amounts actually collected.
  • Monthly payments:
    monthly_collections i , t = j = 1 N i , t + amount i , t , j where amount i , t , j > 0
    This variable corresponds to the total sum of positive amounts collected from individual i during month t.
  • Monthly payments:
    monthly_payments i , t = j = 1 N i , t amount i , t , j where amount i , t , j < 0
    This variable captures the total amount paid by individual i during month t, represented as negative values in the transaction records.
Where N i , t + and N i , t represent, respectively, the number of positive (collections) and negative (payments) transactions for individual i in month t.
  • Target variable—high risk: The target variable, denoted as high risk, was defined according to institutional accounting rules, using accumulated debt and total disqualification duration as classification criteria. An individual was labeled high-risk if their total debt exceeded a threshold equivalent to 25% of the number of months they remained disqualified. Formally, let debt i denote the accumulated debt for individual i, and let M i inhab be the total number of disqualification months. The high-risk condition is then defined as follows:
    high_risk i = 1 if debt i > 0.25 · M i disqualification 0 otherwise
    The 25% threshold was determined based on internal policy and empirically validated through a sensitivity analysis aimed at optimizing the F1 score and balancing debt recovery with classification accuracy for the minority class.
  • Feature engineering: Variables such as the following were derived:
    monthly_payment_frequency i = number of payments made by i total number of months observed
    percentage_months_with_debt i = months with debt i total months observed i × 100
    monthly_balance_variation i , t = balance i , t balance i , t 1 balance i , t 1
  • Encoding: Categorical variables were coded using target encoding or one-hot encoding, depending on the model. Numerical variables were normalized using MinMaxScaler.

4.4. Predictor Variables

The selection of predictor variables is guided by a theory-driven approach to capture the multifactorial nature of institutional financial risk [30,31,32]. Rather than relying exclusively on transactional indicators, this study integrates sociodemographic, academic, and institutional variables to represent the structural, behavioral, and organizational determinants of payment compliance.
Twelve predictor variables were included in the analysis (see Table 2), organized into three conceptual categories to reflect their distinct analytical roles. The first category comprised demographic characteristics, specifically age, gender, marital status, and country of birth, which capture individual-level background attributes. The second category encompassed academic and professional factors, including length of time at the school, registered specialties, country of degree, and type of tuition, thereby representing educational trajectory and professional positioning. The third category consisted of financial and institutional variables, namely payment group, payment method, bank, and payment office, which characterize economic arrangements and administrative structures. This classification establishes a structured framework for examining how individual, professional, and institutional dimensions jointly relate to the outcome of interest.

4.5. Modeling Framework

Figure 1 presents the proposed methodological framework for estimating risk levels through a structured machine learning pipeline composed of seven sequential stages. The process begins with data extraction and proceeds to data cleaning and transformation, where validation, standardization, and indicator construction are performed to ensure data integrity, consistency, and analytical reliability. Subsequently, multiple data sources are integrated to construct a unified and coherent analytical dataset aligned with the problem domain.
The fourth stage involves feature engineering, aimed at identifying and selecting variables with the highest predictive relevance. This step supports the subsequent training and validation of multiple supervised classification models, including Logistic Regression (LR), Linear Discriminant Analysis (LDA), Gaussian Naïve Bayes (GNB), k-Nearest Neighbors (k-NN), Decision Trees (DT), Random Forest (RF), Multilayer Perceptron (MLP), XGBoost, LightGBM, and CatBoost. Model performance is then systematically evaluated and compared using appropriate classification metrics to identify the algorithm with the strongest predictive accuracy and generalization capacity. Finally, the selected model generates a prediction output that classifies individuals into mutually exclusive risk categories—high or low—thereby completing the analytical workflow of the proposed framework.

4.6. Predictive Modeling

The selection of supervised learning algorithms in this study was based on a strategic balance between generalization capacity, computational efficiency, and model interpretability [11]. First, traditional linear models such as logistic regression (LR) were included, serving as a robust reference due to their transparency and ease of interpretation in regulated sectors, enabling stakeholders to understand the functional relationships between input variables and financial risk [10,11]. These models are valued for their training efficiency and flexibility regarding class distributions in the feature space [26,33]. However, to capture the multifactorial nature and nonlinear interactions inherent in institutional data, tree-based and ensemble methods—such as Random Forest (RF), XGBoost, LightGBM, and CatBoost—were incorporated [9]. The Random Forest algorithm was selected for its capacity to reduce variance and overfitting by averaging multiple independent decision trees [10,11]. In contrast, gradient boosting methods, particularly XGBoost, offer enhanced scalability and effective handling of imbalanced data through regularization and gradient-based optimization [8]. Notably, state-of-the-art models such as CatBoost and LightGBM have demonstrated high effectiveness in managing high-cardinality categorical variables, minimizing encoding bias, and optimizing computational resources through leaf-wise tree growth strategies [34]. This combination of approaches enables the development of models that not only achieve high predictive accuracy (R2 > 0.90) but also maintain robustness against volatility in historical data [22,34].
Ten supervised learning algorithms were systematically compared to evaluate their predictive performance across methodological families. The first group comprised base models, including Logistic Regression (LR), Linear Discriminant Analysis (LDA), Gaussian Naive Bayes (GNB), and the k-nearest neighbors (k-NN) algorithm, which represent established statistical and distance-based classification approaches. The second group consisted of tree-based, boosting, and neural network approaches, specifically Decision Trees (DTs), Random Forest (RF), XGBoost, LightGBM, and CatBoost, as well as Multilayer Perceptrons (MLPs), which model nonlinear relationships and complex feature interactions through hierarchical partitioning, boosting, or layered network architectures. This design allows a systematic comparison of linear, probabilistic, instance-based, tree-based, boosting, and neural network paradigms

Hyperparameter Tuning

GridSearchCV with stratified cross-validation ( k = 5 ) was applied for each model, optimizing the F1-Score metric. The random seed (random_state = 42) was fixed to ensure reproducibility.

4.7. Class Balancing

Due to the severe imbalance in the target variable (89% high-risk, n = 25,498 ; 11% low risk, n = 3304 ), SMOTE (Synthetic Minority Oversampling Technique) was applied within each training fold, and class weighting (class_weight = ‘balanced’) was implemented for algorithms that allow it. This revision explicitly reports the absolute number of records per class, thereby clarifying the magnitude of the imbalance in the original dataset.

4.8. Evaluation Metrics

In classification problems with extreme class imbalance, where the incidence rate may be as low as 1%—as in fraud or dropout detection—accuracy becomes a misleading metric that systematically favors the majority class [19]. In such scenarios, a naive model that classifies all instances as negative can achieve 99% accuracy, yet entirely fails to identify critical cases of institutional risk [9]. From a risk assessment perspective, it is essential to prioritize discriminative metrics. Although AUC-ROC is a widely used standard, under severe imbalance, it tends to overestimate performance, as the false-positive rate is artificially suppressed by the large volume of negative cases [9]. Therefore, the PR-AUC (area under the precision–recall curve) is preferred, as it focuses exclusively on minority class performance and more accurately reflects the trade-off between case detection and prediction precision [9]. Accordingly, the F1-score was adopted as the primary optimization metric during hyperparameter tuning, balancing sensitivity and precision for the positive class while reducing bias toward majority-class predictions [9].
Model performance was evaluated using complementary classification metrics that capture overall predictive accuracy and class-specific discrimination. Accuracy was quantified as the proportion of correctly classified instances among all observations. Precision measured the proportion of true positives among predicted positives, thereby reflecting the reliability of positive classifications. Recall, or sensitivity, assessed the proportion of actual positives correctly identified, indicating the model’s ability to detect relevant cases. The F1-score is the harmonic mean of precision and recall, providing a single, robust summary measure that is robust to class imbalance. The area under the receiver operating characteristic curve (AUC–ROC) evaluated the model’s discriminatory capacity across classification thresholds. Confusion matrices and ROC curves were examined for the best-performing models to enable a detailed assessment of classification errors and threshold-dependent performance.

4.9. Implementation Environment

The experiments were conducted in a hybrid computing environment, consisting of an institutional Linux server (32 GB RAM, Xeon 3.2 GHz CPU, Lima, Peru) and local workstations (Intel Core i7, 16 GB RAM, Lima, Peru). The codebase was version-controlled using GitHub. Workflow traceability and experimental replicability were ensured throughout the process.

5. Results

5.1. Exploratory Descriptive Analysis

Figure 2 shows the correlation matrix between the numerical variables age_membership, n_specialist, and the target variable target_num. The correlations exhibit low magnitudes (absolute values below 0.10), suggesting limited linear association between the numerical characteristics and the target variable. This result supports incorporating categorical and contextual variables into predictive modeling, since numerical variables alone do not capture relevant discriminative patterns.
Figure 3 presents the distribution of age_membership by risk_level. There is a notable concentration of members in the 22 to 30 age range, especially within the high-risk group. This trend suggests a higher likelihood of disqualification among younger professionals, possibly due to factors such as economic instability or limited tenure within the institution. Likewise, Figure 4 shows the relationship between marital status and risk level. Single members account for the highest proportion of individuals classified as high-risk, while married members exhibit a significantly lower incidence. This finding suggests a possible association between socio-family factors and the fulfillment of institutional obligations.
Taken together, these exploratory results confirm the relevance of developing a predictive model that integrates multiple dimensions—demographic, professional, and financial characteristics—to estimate risk levels with greater accuracy.

5.2. Predictive Modeling Results

Ten classification models were evaluated under the same stratified cross-validation protocol (5 folds), incorporating balancing techniques—Synthetic Minority Oversampling Technique (SMOTE) and class weight adjustment—to compensate for class imbalance. Table 3 summarizes the performance metrics on the held-out test folds.
The application of balancing techniques to the test set is methodologically inappropriate because it alters the true class distribution and biases predictive performance estimates; therefore, balancing must be restricted to the training data [16]. The results exhibit consistently high precision (≈90–93%) and recall (≈69–86%) across models, whereas the F1-score ranges between 54–58%. This pattern reflects the pronounced class imbalance in the test set and the sensitivity of the F1 metric to asymmetric error distributions, rather than computational inconsistency. Within this comparative framework, CatBoost attains the highest F1-score (57.96%), indicating a relatively stronger balance between precision and recall under asymmetric conditions. LightGBM and XGBoost achieve comparable but slightly lower F1 values, suggesting marginally reduced performance stability. In contrast, traditional classifiers—including Logistic Regression (LR), Linear Discriminant Analysis (LDA), and Gaussian Naive Bayes (GNB)—maintain high precision but substantially lower recall, limiting their suitability for detecting high-risk default cases. Overall, algorithmic performance differences primarily reflect variation in the capacity to operate under structurally imbalanced classification regimes.

5.3. ROC Curve and Discriminant Analysis

The ROC curve of the CatBoost model on the test set yielded an AUC of 0.72, exceeding the random baseline of 0.50 and indicating moderate discriminative capacity between risk classes (Figure 5). For false-positive rates between 0.2 and 0.3, the true-positive rate ranges from 0.6 to 0.8, demonstrating effective sensitivity at controlled levels of false alarms.

5.4. Confusion Matrix

The confusion matrix for CatBoost, presented in Table 4, demonstrates effective detection of high-risk cases (positive class).
Although the model yields 417 false positives, it accurately identifies most high-risk cases, substantially reducing false negatives. This trade-off is appropriate in institutional contexts where undetected risk may result in sanctions or significant financial losses.

6. Discussion

From a theoretical perspective, the findings support the idea that the risk of professional disqualification is a multifactorial institutional phenomenon in which contextual and organizational variables play a role comparable to that of traditional financial indicators. This is consistent with contemporary data-driven governance frameworks, which emphasize the integration of administrative, behavioral, and sociodemographic information to support preventive decision-making in institutional risk management.
The results empirically validate the conceptual framework by aligning theoretical foundations with predictive evidence. H1 supports financial life-cycle theory: age consistently differentiates risk, with early-career professionals more likely to be classified as high-risk [1,2,4,6,6]. Life-course dynamics therefore shape financial vulnerability. H2 shows that sociodemographic variables have moderate predictive relevance [1,6,12]. Although weaker than behavioral indicators, they improve model performance and structure differentiated risk profiles beyond strictly financial data. For H3, the evidence confirms human capital theory [13]. Institutional seniority and specialization are negatively associated with disqualification risk. Accumulated experience acts as a protective factor, even if its effect is less pronounced than that of financial behavior. Hypothesis H4 receives strong empirical support, in line with the credit risk modeling literature [8,10,11]. Payment behavior indicators rank among the most influential variables in ensemble models. This result confirms that historical financial conduct constitutes the most robust predictor of future default. With respect to H5, variables related to payment infrastructure provide additional explanatory power, albeit secondary to direct behavioral patterns [1,10]. This pattern suggests that administrative conditions indirectly shape members’ financial behavior and complement individual-level risk determinants. Finally, the findings confirm H6 and align with evidence demonstrating the superiority of ensemble algorithms in imbalanced data settings [7]. Gradient boosting models outperform traditional statistical approaches across key performance metrics because they capture nonlinear relationships and complex interactions more effectively. Taken together, these results reinforce the validity of the conceptual model and demonstrate that disqualification risk primarily arises from financial behavioral patterns embedded within a broader structural framework. The integration of economic theory, human capital perspectives, and computational modeling provides a robust foundation for institutional governance grounded in predictive evidence.
The results confirm the effectiveness of machine learning-based approaches for predicting delinquency phenomena in professional contexts, particularly when working with highly unbalanced and multidimensional datasets. The CatBoost algorithm ranked as the most effective model in the final test set, achieving an F1-Score of 57.96% and an accuracy of 79.92%, due to its capacity to handle categorical variables without prior encoding and to model complex nonlinear relationships. However, in the cross-validation process, XGBoost achieved a higher F1-macro (0.585), suggesting greater overall stability across different data subsets. This difference underscores the need to consider both average performance and production behavior when selecting optimal models. The observed F1-Score indicates that the model effectively identifies high-risk professionals who might otherwise be undetected, enabling early interventions to reduce delinquency rates and administrative costs.
These findings are consistent with previous studies in which XGBoost has demonstrated advantages in financial and credit risk scenarios, especially when combined with balancing techniques such as SMOTE [9]. However, as [27] also points out, high overall accuracy can be misleading in imbalanced contexts, as it tends to favor the majority class. A similar pattern emerged in this study: models such as Random Forest and k-NN achieved high accuracy (>90%) but lower F1-macro scores, reflecting a bias toward the majority class (high risk).
The weight adjustment strategy in XGBoost, combined with oversampling, improved the recall of the minority class (low risk) from 39.3% to 60.6%, reducing false negatives. This improvement advances one of the key objectives of any risk prediction tool: correctly identifying cases that would otherwise be overlooked [9]. Regarding model interpretability, although a technical discussion of the metrics is presented, the analysis does not include variable importance. Incorporating tools such as SHAP (SHapley Additive exPlanations) would clarify which attributes most influence risk prediction, strengthening transparency before institutional authorities or auditors. This is particularly relevant in applications where decisions carry direct professional consequences.
The set of variables used—covering demographic, financial, and institutional attributes—represents an improvement over studies that focus exclusively on credit or clinical aspects. As shown by [13,19], incorporating contextual variables enhances model performance in social and administrative prediction tasks.
Estimating risk at a specific point in time aligns methodologically with institutional decision-making processes, which operate through discrete evaluations under operational constraints. The literature on financial risk management demonstrates that predictive models applied to information available at a given moment can identify individuals with a high probability of default before adverse events occur [17,18]. Likewise, behavioral economics and life cycle theory indicate that sociodemographic variables observed at a given moment are systematically associated with future financial behavior [1,6,12]. Consequently, the cross-sectional prediction adopted in this study is conceptually robust and empirically supported for practical institutional implementation.
However, this study presents several important limitations. Independent validation was not conducted, as all data were assessed exclusively through k-fold cross-validation. This methodological choice may constrain the model’s capacity to generalize beyond the analyzed dataset, thereby weakening the robustness of the resulting inferences.
The variables used are structured and quantifiable, while qualitative dimensions such as member satisfaction, perceived institutional value, and motivations for payment compliance are excluded. Future research could incorporate these factors through surveys or sentiment analysis of digital platforms.
From an institutional perspective, the study provides a basis for developing preventive monitoring systems, early warning mechanisms, and evidence-based financial retention strategies. Implementing models such as the one proposed would enable targeted interventions for high-risk subgroups and the design of differentiated strategies aligned with sociodemographic and academic profiles.
The institutional implications identified align with empirical evidence from other contexts of financial risk management in the FinTech sector. Moise et al. [35] demonstrate that institutions design differentiated strategies based on users’ demographic characteristics and optimize technologies that precisely respond to the priorities of each segment. In the context of professional associations, the predictive models proposed in this study enable the identification of specific risk groups based on sociodemographic and academic profiles, thereby facilitating targeted interventions and segmented campaigns. Similarly, Alamsyah et al. [36] argue that incorporating alternative data sources into risk assessment frameworks improves decision-making and strengthens institutional adaptation to the evolving digital environment. Their analysis of credit risk using professional social network data shows that the integrated use of demographic, behavioral, and relational information reduces defaults and expands access to credit by decreasing reliance on traditional scoring systems. Consequently, the findings of the present study indicate that professional associations can enhance their predictive capacity by integrating behavioral and digital engagement variables, thereby enabling a more comprehensive assessment of delinquency risk and potential professional disqualification.
Consistent with the institutional implications discussed above, the findings show that machine learning models—particularly XGBoost and CatBoost—when calibrated with class-balancing and interpretability techniques, offer robust solutions for anticipating disqualification risk among licensed professionals. Their implementation enhances preventive financial management and strengthens the long-term sustainability of professional associations.

7. Conclusions

This study demonstrates that supervised machine learning can estimate disqualification risk due to delinquency using large-scale institutional data. By consolidating more than 5.7 million transactions from 27,964 members, it developed and tested a predictive framework grounded in historical and contextual variables. Among ten algorithms, CatBoost achieved the strongest performance (F1-score = 57.96%; AUC = 0.72), followed closely by XGBoost after class balancing. These results confirm the effectiveness of gradient boosting for modeling structurally imbalanced institutional risk data.
Methodologically, the study integrates sociodemographic, academic, and financial-institutional variables within a unified predictive architecture. Individual-level analysis and standardized metrics (F1-score, AUC) ensure rigorous evaluation under asymmetric class conditions. The findings demonstrate that contextual and organizational attributes enhance prediction beyond purely financial indicators.
Operationally, the results provide a solid empirical basis for early-warning systems within professional associations. Estimating risk at specific decision points reflects the discrete logic of institutional governance. Evidence from financial risk and behavioral economics supports the use of contemporaneous data to anticipate future payment behavior. Implementing CatBoost- or XGBoost-based systems would enable proactive identification of high-risk members and reinforce financially sustainable governance. Incorporating interpretability tools such as SHAP would strengthen transparency and accountability.
Beyond its predictive utility, the risk scoring model strengthens the operational capacity of professional associations by structuring portfolio management around quantifiable criteria. Its implementation as a periodic classification tool allows cases to be prioritized according to exposure level, optimizing the allocation of follow-up resources and reducing administrative costs associated with indiscriminate collection processes. Segmentation by profile facilitates the application of differentiated strategies—such as adjustments in contact frequency, personalized payment plans, and targeted actions based on sociodemographic and professional characteristics—improving collection efficiency without affecting the institutional relationship with members. Likewise, the analysis of variables linked to the payment infrastructure (bank, modality, and channel) provides concrete inputs for redesigning internal processes through automation, operational simplification, and reduction of administrative friction. Integrated into regular governance mechanisms, the model enables the establishment of performance indicators—risk ratios by segment, differentiated recovery rates, and average management times—and the periodic review of intervention criteria, thereby strengthening financial planning and the consistency of institutional decisions.
Limitations remain. The absence of temporal or external validation limits the assessment of model stability under changing economic or regulatory conditions. Reliance on structured quantitative variables also excludes psychosocial dimensions such as institutional engagement, perceived value, or motivational drivers of compliance.
Future research should prioritize longitudinal validation, integration of behavioral and psychosocial indicators, and experimentation with hybrid or deep learning architectures paired with interpretability techniques. Advancing these directions will enhance the robustness, generalizability, and strategic value of predictive systems for preventive disqualification risk management in professional associations.

Author Contributions

Conceptualization, M.P.P. and Y.C.L.; methodology, E.T.-C., M.P.P. and Y.C.L.; software, A.A.S.M.; validation, M.P.P., J.L.L.-G., P.C.R., Y.C.L. and A.A.S.M.; formal analysis, M.P.P., Y.C.L., P.C.R. and A.A.S.M.; investigation, E.T.-C., J.L.L.-G., M.P.P., Y.C.L. and A.A.S.M.; data curation, M.P.P., J.L.L.-G., Y.C.L., P.C.R. and A.A.S.M.; writing—review and editing, E.T.-C., M.P.P., Y.C.L. and A.A.S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The dataset underpinning the results of this study is available in the following repository: https://drive.google.com/file/d/1vDIBnp6sopBaqq2ZBM-WaZHDXJn-EoZA/view (accessed on 15 December 2025).

Acknowledgments

We would also like to express our sincerest gratitude to all the experts in the field of university quality who dedicated their time and experience to validate the instrument used in this research. Their contribution has been fundamental in ensuring the accuracy and relevance of our study. Their commitment to excellence and their willingness to share their experience have not only enriched this work but also reflect their dedication to the advancement of educational quality at the university level. We would like to thank all of them for their invaluable input.

Conflicts of Interest

The authors declare no conflicts of interest.

References

  1. Jones, L.E.; Loibl, C.; Tennyson, S. Effects of informational nudges on consumer debt repayment behaviors. J. Econ. Psychol. 2015, 51, 16–33. [Google Scholar] [CrossRef]
  2. Moulton, S.; Loibl, C.; Samak, A.; Michael Collins, J. Borrowing capacity and financial decisions of low-to-moderate income first-time homebuyers. J. Consum. Aff. 2013, 47, 375–403. [Google Scholar]
  3. Reniers, R.L.; Beavan, A.; Keogan, L.; Furneaux, A.; Mayhew, S.; Wood, S.J. Is it all in the reward? Peers influence risk-taking behaviour in young adulthood. Br. J. Psychol. 2017, 108, 276–295. [Google Scholar] [PubMed]
  4. Bricker, J.; Bucks, B.; Kennickell, A.; Mach, T.; Moore, K. The financial crisis from the family’s perspective: Evidence from the 2007–2009 SCF panel. J. Consum. Aff. 2012, 46, 537–555. [Google Scholar]
  5. Bruine de Bruin, W.; Van der Klaauw, W.; Downs, J.S.; Fischhoff, B.; Topa, G.; Armantier, O. Expectations of inflation: The role of financial literacy and demographic variables. J. Consum. Aff. 2010, 44, 381–402. [Google Scholar]
  6. Letkiewicz, J.C.; Heckman, S.J. Homeownership among young Americans: A look at student loan debt and behavioral factors. J. Consum. Aff. 2018, 52, 88–114. [Google Scholar]
  7. Brati, E.; Braimllari, A.; Gjeçi, A. Machine Learning Applications for Predicting High-Cost Claims Using Insurance Data. Data 2025, 10, 90. [Google Scholar] [CrossRef]
  8. Qin, R. The construction of corporate financial management risk model based on XGBoost algorithm. J. Math. 2022, 2022, 2043369. [Google Scholar] [CrossRef]
  9. Imani, M.; Beikmohammadi, A.; Arabnia, H.R. Comprehensive analysis of random Forest and XGBoost performance with SMOTE, ADASYN, and GNUS under varying imbalance levels. Technologies 2025, 13, 88. [Google Scholar] [CrossRef]
  10. Hu, L.; Chen, J.; Vaughan, J.; Aramideh, S.; Yang, H.; Wang, K.; Sudjianto, A.; Nair, V.N. Supervised machine learning techniques: An overview with applications to banking. Int. Stat. Rev. 2021, 89, 573–604. [Google Scholar] [CrossRef]
  11. Nguyen, Q.G.; Nguyen, L.H.; Hosen, M.M.; Rasel, M.; Shorna, J.F.; Mia, M.S.; Khan, S.I. Enhancing Credit Risk Management with Machine Learning: A Comparative Study of Predictive Models for Credit Default Prediction. Am. J. Appl. Sci. 2025, 7, 21–30. [Google Scholar] [CrossRef]
  12. Loibl, C.; Letkiewicz, J.; McNair, S.; Summers, B.; Bruine de Bruin, W. On the association of debt attitudes with socioeconomic characteristics and financial behaviors. J. Consum. Aff. 2021, 55, 939–966. [Google Scholar] [CrossRef]
  13. Al-Alawi, L.; Al Shaqsi, J.; Tarhini, A.; Al-Busaidi, A.S. Using machine learning to predict factors affecting academic performance: The case of college students on academic probation. Educ. Inf. Technol. 2023, 28, 12407–12432. [Google Scholar] [CrossRef] [PubMed]
  14. Sundar, J.S.; Chandra, S.; Saini, R.K.; Jha, S.; Katta, S.K. AI-Powered Predictive Analytics for Financial Risk Management in Banking and Fintech. In Proceedings of the 2025 IEEE 4th World Conference on Applied Intelligence and Computing (AIC), Delhi, India, 26–27 July 2025; IEEE: Piscataway, NJ, USA, 2025; pp. 174–178. [Google Scholar]
  15. Brygała, M.; Korol, T. Personal bankruptcy prediction using machine learning techniques. Econ. Bus. Rev. 2024, 10, 118–142. [Google Scholar] [CrossRef]
  16. Goyal, S.R. A systematic review on AI based class imbalance handling in software defect prediction. Results Eng. 2025, 27, 106578. [Google Scholar] [CrossRef]
  17. Nabrawi, E.; Alanazi, A. Fraud detection in healthcare insurance claims using machine learning. Risks 2023, 11, 160. [Google Scholar] [CrossRef]
  18. Kusaya, C.; O’Keefe, J. Insider Abuse and Fraud Prediction for US Banks: A Comparison of Machine Learning Approaches. SSRN Electron. J. 2021. Available online: https://www.ssrn.com/abstract=3659757 (accessed on 10 January 2026).
  19. Akinyemi, B.O.; Olalere, D.A.; Sanni, M.L.; Olajubu, E.A.; Aderounmu, G.A.; Ibrahim, I.A. Performance evaluation of machine learning models for cyber threat detection and prevention in mobile money services. Informatica 2023, 47. [Google Scholar] [CrossRef]
  20. Naik, R.; Kunchur, P.N. Census Employee Salary Prediction using Supervised Machine Learning. Grenze Int. J. Eng. Technol. 2020, 6, 214. [Google Scholar]
  21. Nurdin, A.; Zunaidi, R.A.; Wicaksono, M.A.F.; Martadinata, A.L.J. Analisis Kredit Pembayaran Biaya Kuliah Dengan Pendekatan Pembelajaran Mesin. J. Teknol. Inf. Dan Ilmu Komput. 2023, 10, 271–280. [Google Scholar]
  22. Tousi, A.; Luján, M. Comparative analysis of machine learning models for performance prediction of the spec benchmarks. IEEE Access 2022, 10, 11994–12011. [Google Scholar] [CrossRef]
  23. Belavagi, M.C.; Muniyal, B. Performance evaluation of supervised machine learning algorithms for intrusion detection. Procedia Comput. Sci. 2016, 89, 117–123. [Google Scholar] [CrossRef]
  24. Koç, S.; Tomak, L.; Karabulut, E. A Predictive Model for the Risk of Infertility in Men Using Machine Learning Algorithms. J. Urol. Surg. 2022, 9, 265–271. [Google Scholar]
  25. Gonsalves, A.H.; Thabtah, F.; Mohammad, R.M.A.; Singh, G. Prediction of coronary heart disease using machine learning: An experimental analysis. In Proceedings of the 2019 3rd International Conference on Deep Learning Technologies, Xiamen, China, 5–7 July 2019; pp. 51–56. [Google Scholar]
  26. Kurian, B.; Jyothi, V. Breast cancer prediction using an optimal machine learning technique for next generation sequences. Concurr. Eng. 2021, 29, 49–57. [Google Scholar] [CrossRef]
  27. Tikhe, S.A.; Rana, D.P. Fine-tuned predictive models for forecasting severity level of COVID-19 patient using epidemiological data. In Frontiers of ICT in Healthcare: Proceedings of EAIT 2022; Springer: Cham, Switzerland, 2023; pp. 431–442. [Google Scholar]
  28. Dresch, A.; Pacheco, D.; Valle, J.A. Design Science Research; Springer: Cham, Switzerland, 2015; p. 176. [Google Scholar]
  29. Kitchin, R. The Data Revolution: Big Data, Open Data, Data Infrastructures and Their Consequences; Sage: Thousand Oaks, CA, USA, 2014. [Google Scholar]
  30. Bosker, J.; Gürtler, M.; Zöllner, M. Machine learning-based variable selection for clustered credit risk modeling: J. Bosker et al. J. Bus. Econ. 2025, 95, 617–652. [Google Scholar]
  31. Wang, Y.; Zhang, Y. Credit Risk Assessment for Small and Microsized Enterprises Using Kernel Feature Selection-Based Multiple Criteria Linear Optimization Classifier: Evidence from China. Complexity 2020, 2020, 2394948. [Google Scholar] [CrossRef]
  32. Maldonado, S.; Bravo, C.; López, J.; Pérez, J. Integrated framework for profit-based feature selection and SVM classification in credit scoring. Decis. Support Syst. 2017, 104, 113–121. [Google Scholar] [CrossRef]
  33. Nyunga Mpinda, B.; Sadefo-Kamdem, J.; Osei, S.; Fadugba, J. Accuracies of Model Risks in Finance using Machine Learning. Preprint. Available online: https://hal.umontpellier.fr/hal-03191437 (accessed on 26 January 2026).
  34. Lazaridis, P.C.; Kavvadias, I.E.; Demertzis, K.; Iliadis, L.; Vasiliadis, L.K. Interpretable machine learning for assessing the cumulative damage of a reinforced concrete frame induced by seismic sequences. Sustainability 2023, 15, 12768. [Google Scholar] [CrossRef]
  35. Dospinescu, O.; Dospinescu, N.; Agheorghiesei, D.T. Fintech services and factors determining the expected benefits of users: Evidence in Romania for millennials and generation Z. E M Econ. Manag. 2021, 24, 101–118. [Google Scholar] [CrossRef]
  36. Alamsyah, A.; Hafidh, A.A.; Mulya, A.D. Innovative credit risk assessment: Leveraging social media data for inclusive credit scoring in Indonesia’s fintech sector. J. Risk Financ. Manag. 2025, 18, 74. [Google Scholar] [CrossRef]
Figure 1. Predictive model framework for risk level estimation. Source: Authors’ own elaboration.
Figure 1. Predictive model framework for risk level estimation. Source: Authors’ own elaboration.
Data 11 00098 g001
Figure 2. Numerical correlations between variables and risk indicator (target_num).
Figure 2. Numerical correlations between variables and risk indicator (target_num).
Data 11 00098 g002
Figure 3. Distribution of membership age by risk level.
Figure 3. Distribution of membership age by risk level.
Data 11 00098 g003
Figure 4. Distribution of marital status according to risk level.
Figure 4. Distribution of marital status according to risk level.
Data 11 00098 g004
Figure 5. ROC curve on the test set for both classes (high and low). The green dotted line represents random performance.
Figure 5. ROC curve on the test set for both classes (high and low). The green dotted line represents random performance.
Data 11 00098 g005
Table 1. Summary of related studies.
Table 1. Summary of related studies.
Author(s)Context and AlgorithmsMetrics and FindingsContribution to the Present Study
Nabrawi and Alanazi (2023) [17]Health insurance fraud detection. Random Forest, Logistic Regression, Artificial Neural Networks.Precision > 98% after imbalance correction. Ensemble methods outperform traditional models.Evidence of ML superiority in imbalanced financial datasets.
Kusaya and O’Keefe (2021) [18]U.S. banking internal fraud. ANN, Gradient Boosting, Random Forest, Logistic Regression, and deep autoencoders.Emphasis on out-of-sample generalization and predictive robustness.Importance of model selection based on predictive performance.
Sundar et al. (2025) [14]Financial risk management. CatBoost + SVM with SMOTE and PCA.Accuracy 95.93%. Hybrid models outperform baseline methods.Relevance of hybrid optimization and data balancing techniques.
Brygala and Korol (2024) [15]Personal insolvency prediction. LightGBM, CatBoost, and XGBoost with SHAP.Improved predictive accuracy and interpretability.Demonstrates the value of gradient boosting and explainable AI.
Nurdin et al. (2023) [21]Educational financing defaults. Supervised ML with demographic, academic, and financial variables.Combining structural and behavioral variables improves accuracy.Supports integration of complementary institutional databases.
Letkiewicz and coauthors (2018); Loibl et al. (2021); Jones (2015) [1,6,12]Financial behavior and delinquency theory. Socioeconomic determinants analysis.Age, marital status, and income linked to compliance behavior.Theoretical grounding for sociodemographic variable selection.
Al-Alawi et al. (2023) [13]Academic and psychological stress. k-NN, RF, XGBoost, weighted voting.Hybrid ensemble improves predictive performance.Modeling interrelated professional and behavioral variables.
Akinyemi et al. (2023) [19]Student dropout prediction: SVM, RF, k-NN, Neural Networks.Accuracy > 90% with cross-validation and feature selection.Methodological alignment for supervised classification.
Naik et al. (2020) [20]Salary prediction (UCI Census). Naïve Bayes, k-NN, Decision Trees.Demographic and professional attributes predict class membership.Administrative prediction using structured personal profiles.
Tousi et al. (2022) [22]Comparative benchmarking. RF, k-NN, SVM, Linear Regression.Balance between accuracy, interpretability, and efficiency.Criteria for model evaluation and benchmarking.
Belavagi et al. (2016) [23]Intrusion detection. Logistic Regression, NB, SVM, RF.F1-score, AUC, cross-validation in imbalanced contexts.Evaluation framework for robust classification.
Koç et al. (2022) [24]Male infertility. Random Forest, SVM, SuperLearner.AUC up to 97%. Ensemble improves accuracy.Demonstrates robustness of ensemble learning.
Gonsalves et al. (2019) [25]Coronary heart disease. Naïve Bayes, Decision Trees, SVM.Sensitivity to imbalanced classes; NB reduces false negatives.Relevance for binary classification problems.
Kurian et al. (2021) [26]Breast cancer classification. Decision Tree among nine classifiers.Accuracy 94.03%. Feature selection improves results.Importance of preprocessing in structured datasets.
Tikhe et al. (2023) [27]COVID-19 severity prediction. RF, AdaBoost, DT, k-NN, NB with SMOTE.Accuracy 98.32%. Manual attribute design and balancing.Feature engineering and class imbalance management.
Table 2. Description of predictor variables used in the classification model.
Table 2. Description of predictor variables used in the classification model.
IDVariable Name (Input)TypeRange of ValuesVariable Description
1age_membershipNumeric0–51Years since professional membership (seniority of the member).
2n_specialistNumeric0–4Number of specialties officially registered by the member.
3gender_descriptionCategorical{Male, Female}Gender declared in the institutional registry.
4marital_statusCategorical{Single, Married, Other}Marital status reported by the member.
5country_of_birthCategorical{Peru, Foreign, Not declared}Country of birth of the registered professional.
6university_countryCategorical{Peru, Foreign}Country where the degree-awarding university is located.
7membership_typeCategorical{Regular, Special, Honorary}Type of membership according to professional registration status.
8payer_group_codeCategorical{G1, G2, G3}Institutional classification of the employer group or paying entity.
9local_payment_nameCategorical21 categoriesOffice or headquarters where the member makes payments.
10bankCategorical{BBVA, BCP, Interbank, Scotiabank}Banking entity used for payments or transactions.
11payment_method_codeCategorical{Discount, Deposit, Transfer}Payment method registered in the association’s financial system.
12risk_levelOrdinal{Low, High}Institutional risk level determined by financial behavior.
13target_numBinary{0, 1}Target variable: 1 = high-risk; 0 = low risk.
Table 3. Model comparison.
Table 3. Model comparison.
ML ModelAccuracy (%)Precision (%)Recall (%)F1-Score (%)
CatBoost79.9290.8785.9557.96
LightGBM79.8190.8685.8357.89
XGBoost77.2191.3881.9957.66
k-NN79.9690.4086.5656.70
RF77.4790.6183.1656.08
DT76.3690.5681.8355.44
RL68.3093.2469.2155.08
LDA68.3393.1969.2955.04
GNB69.6892.3571.7154.82
MLP67.8592.4169.4053.84
Table 4. CatBoost model confusion matrix.
Table 4. CatBoost model confusion matrix.
Pred 0-LOWPred 1-HIGH
Actual 0-LOW244417
Actual 1-HIGH7154385
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Pretel Pretel, M.; Chávez Llempén, Y.; Sullon Macalupu, A.A.; Rodrigues, P.C.; López-Gonzales, J.L.; Tocto-Cano, E. Machine Learning Models for Predicting Professional Disqualification in Peruvian Association Members. Data 2026, 11, 98. https://doi.org/10.3390/data11050098

AMA Style

Pretel Pretel M, Chávez Llempén Y, Sullon Macalupu AA, Rodrigues PC, López-Gonzales JL, Tocto-Cano E. Machine Learning Models for Predicting Professional Disqualification in Peruvian Association Members. Data. 2026; 11(5):98. https://doi.org/10.3390/data11050098

Chicago/Turabian Style

Pretel Pretel, Manuel, Yeny Chávez Llempén, Abel Angel Sullon Macalupu, Paulo Canas Rodrigues, Javier Linkolk López-Gonzales, and Esteban Tocto-Cano. 2026. "Machine Learning Models for Predicting Professional Disqualification in Peruvian Association Members" Data 11, no. 5: 98. https://doi.org/10.3390/data11050098

APA Style

Pretel Pretel, M., Chávez Llempén, Y., Sullon Macalupu, A. A., Rodrigues, P. C., López-Gonzales, J. L., & Tocto-Cano, E. (2026). Machine Learning Models for Predicting Professional Disqualification in Peruvian Association Members. Data, 11(5), 98. https://doi.org/10.3390/data11050098

Article Metrics

Back to TopTop