1. Introduction
In recent decades, advancements in medical treatments, hospital processes, and pharmaceuticals have significantly improved global life expectancy. Simultaneously, artificial intelligence (AI), particularly machine learning (ML), has emerged as a critical tool for optimizing healthcare processes, including fraud detection. Traditional methods, like audits, struggle with the growing complexity of healthcare data, where fraud accounts for approximately 5% of hospital expenditures.
In modern health systems, claims are increasingly submitted, validated, and audited through internet portals and web APIs. Fraud patterns therefore emerge and evolve in online environments at web scale. Our study explicitly frames fraud detection as an internet-native analytics problem over online databases and interoperable health data services.
5G network slicing enables reliable, low-latency data exchange, while blockchain traceability ensures tamper-evident audit trails across claims. IoT devices generate real-time signals preprocessed at the edge to extract fraud-relevant patterns before secure integration into central analytics. These enablers jointly enhance timeliness, integrity, and privacy, forming a scalable, transparent framework for AI-driven healthcare fraud detection.
AI encompasses computational systems designed to emulate human cognitive processes such as learning, reasoning, and decision-making. Within the healthcare domain, AI has been increasingly utilized in a wide spectrum of applications, including diagnostic support, outcome prediction, personalized treatment planning, and fraud detection. Recent methodological advances, particularly the integration of Bayesian inference with deep learning architectures, have significantly enhanced the predictive performance of time-series models in clinical contexts, offering more reliable decision support in dynamic healthcare environments [
1]. In parallel, AI has demonstrated substantial impact in pharmaceutical research, contributing to the identification of novel therapeutic targets, optimization of drug candidates, and mitigation of antimicrobial resistance through advanced pattern recognition and data driven modeling approaches [
2]. Building on these advances, this study links ML pipelines with next-generation infrastructure to ensure provenance, privacy, and low-latency operation for deployable fraud analytics in hospitals.
AI advances healthcare fraud detection by analyzing large datasets, enabling real-time monitoring and early anomaly detection, surpassing traditional auditing in efficiency and accuracy [
3]. Machine learning models such as Random Forest, XGBoost, CNNs, and RNNs show strong precision and adaptability, effectively capturing complex and evolving fraud patterns [
4]. The operational value of these models depends on the digital infrastructure, including connectivity, secure interoperability, and distributed learning, which enables prioritization of high-risk claims within audit timeframes.
This approach uses an open-source dataset with demographics, admissions, diagnoses, and financial transactions. Feature engineering enriches the data by linking patient, provider, and claim details, enabling the detection of fraudulent patterns across hospital operations. Aligned with this internet-scale perspective, the proposed pipeline is audit-ready, preserves data provenance, maintains privacy through fold-restricted training, and supports low-latency edge-oriented data ingestion.
The exploration of healthcare fraud detection is structured into seven sections, starting with an Introduction in
Section 1 that highlights how ML enhances healthcare efficiency and integrity, particularly in fraud detection for sustainable operations. The article is organized into seven sections.
Section 1 introduces the problem and the role of machine learning in sustainable hospital operations.
Section 2 reviews related work, while
Section 3 outlines the dataset and the leakage-safe preprocessing pipeline.
Section 4 details the modeling framework and evaluates eight algorithms, followed by
Section 5 on the evaluation protocol, including the composite CPI and SMOTE-ENN comparisons.
Section 6 discusses operational implications such as threshold policy and error trade-offs, and
Section 7 concludes with findings, limitations, and directions for future research.
The contributions of this study can be summarized as follows:
A systematic comparison of eight algorithms from four methodological groups, evaluated with F1, ROC-AUC, Precision/Recall, MCC, and a composite productivity index tailored to audit workloads.
A robust preprocessing framework that integrates four data sources, engineers 13 additional features, applies imputation, categorical encoding, Power transformation, Boruta feature selection, denoising autoencoders, and SMOTE-ENN within cross-validation folds to prevent leakage.
An operational layer that connects calibrated thresholds and Type I/II errors with alert volumes and recovery expectations, supporting prepayment triage, postpayment audit, and provider-level profiling.
A deployment-oriented framework for fraud analytics that conceptually integrates 5G slicing, blockchain traceability, federated learning, and edge feature extraction to enable timely, privacy-preserving, and auditable detection.
2. Literature Review
Healthcare fraud imposes heavy costs on stakeholders and is enabled by complex revenue systems, collusion, and complex billing processes. Coding schemes like CPT with modifiers add further difficulty in detection. Effective auditing requires tailored strategies such as cross-checking billing with medical records, coding analysis, and anonymous tip lines. Building on prior work on AI-driven fraud detection frameworks [
5], recent work frames anti-fraud analytics as an online service that depends on modern networked infrastructures, where connectivity, distributed compute, and verifiable data flows shape what can be detected in practice [
6].
The National Health Care Anti-Fraud Association (NHCAA) estimates that fraud costs the United States approximately
$68 billion annually, with other estimates reaching as high as
$230 billion, representing up to 10% of total healthcare expenditure [
7]. These losses could potentially be sufficient to provide coverage for the uninsured population in the United States, highlighting the severe impact of healthcare fraud [
8].
The Department of Justice identifies five categories of hospital fraud: billing, illegal referrals, vendor bribes, financial schemes, and false invoices. Fraudulent billing is most frequent, but illegal referrals cause the highest losses, requiring auditors with specialized healthcare knowledge and collaboration with medical experts [
9].
The study by Pan et al. [
10] highlights differences in fraud roles between administrative and medical professionals, which are important for developing targeted fraud prevention strategies. Button and Gee estimate that global healthcare loses at least 3% and potentially over 5% of its spending to fraud. They highlight that between 1999 and 2006, the UK’s National Health Service (NHS) managed to reduce fraud losses by up to 60%, underscoring the critical role of accurate data and effective detection systems [
11]. US healthcare insurance faces fraud losses of
$125–175 billion annually within total losses of
$600–850 billion [
12]. Detection systems use claims, practitioner, and clinical data with supervised and unsupervised methods, adapting to changing fraud patterns [
13]. A growing line of research examines how detection throughput is constrained by audit capacity and data latency, which makes threshold policy and real time ingestion central design choices [
14].
Sparrow identified two distinct healthcare fraud strategies. The first one is called hit-and-run, where fraudsters rapidly submit multiple claims in a short period to maximize gains before detection, and the second, called steal a little, all the time, where small, ongoing fraudulent claims are made continuously over a longer period [
15].
The high-dimensional nature of healthcare data complicates fraud detection. Techniques like spectral clustering and propositionalizing help reduce data complexity [
16]. Most systems lack real time detection and rely heavily on manual oversight, but improvements like SVM and knowledge engineering have shown promising results [
17]. Recent studies investigate online pipelines with streaming features, edge preprocessing close to data sources, and near real time anomaly scoring, which can shorten detection delays and reduce the window of undetected fraud [
18].
Beyond model-level accuracy, healthcare fraud exhibits domain-specific risk factors that shape both feature design and evaluation. Diagnosis Related Group payment schemes are vulnerable to systematic upcoding and code creep, which increase case severity on paper and inflate reimbursements [
19]. Prescription behavior can drift over time at the prescriber or clinic level, producing gradual yet anomalous shifts in drug classes, dosages, or refill cadence that are detectable with time series and anomaly-based methods [
20]. Fraud is also organized through collusive provider–patient–intermediary networks where repeated co-billing, atypical referral paths, and shared attributes produce collective anomalies [
21]. Graph based approaches model these multi-entity relations explicitly, and recent heterogeneous GNNs on claims graphs have shown that community structure and cross-institutional link patterns help surface collusive behavior that tabular models miss [
22]. This study addresses the complementary challenge of noisy and imbalanced tabular claims by combining SMOTE ENN with a denoising autoencoder, while recognizing that graph modeling is a natural extension for detecting collusion at network scale.
In his study, Yang mentions that common fraud types include phantom claims, duplicate claims, and upcoding. Effective detection and prevention require advanced techniques such as Markov blanket filtering for feature selection, real time systems, and simplified data inputs for enhancing fraud prevention [
23].
Recent research from Abdallah et al. to overcome limitations has focused on automating fraud detection techniques such as dimensionality reduction and pattern recognition to uncover hidden correlations. In this study, it is also mentioned that despite these advancements, most existing systems still heavily rely on human oversight and lack of automation for real-time detection capabilities [
24].
Stowell, Schmidt, and Wadlinger show that healthcare fraud continues to threaten the US economy and public welfare despite countermeasures [
25]. Wibowo et al. report its harmful effects on service quality and financial stability [
26], while NHCAA estimates losses of up to 10% of annual health spending, raising costs and reducing consumer benefits [
7].
Dean et al. highlight the root causes and challenges of healthcare fraud in the U.S., stressing the role of Enterprise Risk Management [
27]. Aruleba and Sun show that ML can mitigate financial losses [
28], while Lavanya et al. note the limits of manual detection and the value of MLPs [
29]. Van Capelleveen et al. estimate U.S. fraud at
$700 billion annually, affecting Medicare and Medicaid, with outlier detection as a proposed solution [
30].
Healthcare fraud detection employs diverse machine learning models, from classical methods like Logistic Regression, KNN, and SVM to ensemble techniques such as Random Forest. Gradient boosting models (XGBoost, LightGBM, CatBoost) and neural networks like MLP also show strong performance in capturing complex patterns. Graph oriented approaches are increasingly used to model collusive structures among providers, patients, and intermediaries, enabling detection of communities and atypical referral pathways that are not visible to record level models [
21].
The ACFE highlights AI’s role in detecting healthcare fraud by analyzing large datasets for patterns like over-billing and false claims. While challenges remain in billing code interpretation and data privacy, AI enhances real-time prevention and strengthens healthcare system efficiency and integrity [
31].
Peng et al. introduce LR for binary outcomes, emphasizing coefficient and odds ratio interpretation [
32], while Sperandei highlights its value in epidemiologic studies [
33]. Itoo et al. achieved up to 95.9% accuracy with LR, excelling in multiple metrics [
34]. Ahsan et al. compared LDA, RF, and LR for credit card fraud, showing RF best at 60:40 splits, LR at 90:10, and SMOTE effective in improving detection [
35].
Zhongheng Zhang highlights the importance of choosing k and distance measures in KNN [
36]. Guo et al. propose a model-based KNN using representative points to improve efficiency while maintaining accuracy [
37]. Fayaz Itoo et al. compare LR, Naïve Bayes, and KNN for fraud detection, finding LR most accurate [
34].
Breiman introduced Random Forest (RF) as an ensemble of decision trees achieving high accuracy, robustness to noise, and scalability to high-dimensional data [
38]. Biau later analyzed its statistical properties, showing consistency and adaptability without overfitting [
39]. Al-Hashedi and Magalingam further confirmed RF’s effectiveness in financial fraud detection, highlighting its robustness and accuracy [
40].
Burges reviewed SVMs, covering linear and non-linear solutions, generalization, and practical implementation [
41]. Perols showed SVM and LR outperform other methods in fraud detection under cost and imbalance constraints [
42]. Yongli Zhang further highlights SVM’s strong accuracy and robustness, especially with small samples and high-dimensional data [
43].
Chen and Guestrin (2016) introduced XGBoost, a scalable boosting system with sparsity-aware optimization for large datasets [
44]. Adeola et al. showed its superior accuracy and efficiency in diagnosing kidney disease [
45], while Priscilla and Prabha found that hyperparameter tuning improves performance on imbalanced data without resampling [
46].
Ke et al. (2017) introduced a highly efficient gradient boosting decision tree framework that utilizes Gradient-based One-Side Sampling and Exclusive Feature Bundling to significantly reduce computational costs while maintaining accuracy, called LightGBM [
47]. Du et al. in 2023 propose combining an autoencoder for dimensionality reduction with LightGBM for classification, to enhance credit card fraud detection [
48].
Bentéjac et al. (2021) show CatBoost’s effectiveness in chronic kidney disease and fraud detection, noting its robustness with imbalanced data and categorical features [
49]. Hancock and Khoshgoftaar confirm CatBoost’s superiority in Medicare fraud detection, outperforming other algorithms on high-cardinality features and achieving the highest AUC [
50].
Rosenblatt introduced the Perceptron as the basis for MLPs [
51]. Alsmadi et al. (2009) showed backpropagation greatly improves MLP performance in complex classification [
52], while Nanduri et al. demonstrated their effectiveness in e-commerce fraud detection by adapting to dynamic fraud patterns [
53].
Arık and Pfister introduce TabNet, a deep learning model for tabular data that uses sequential attention to focus on key features, improving interpretability and performance over traditional models [
54].
Privacy preserving learning and data provenance emerge as enabling themes in healthcare settings with strict regulation. Federated learning enables training across institutions without raw data exchange, while secure aggregation and differential privacy reduce inference risks, and blockchain based traceability strengthens audit readiness by recording claim lineage and model input provenance [
55].
Techniques that enhance ML models by improving data quality and structure address issues like class imbalance and feature scaling, making models more accurate and robust. These methods optimize data preprocessing, enabling better performance and reliable results in diverse applications.
Pellegrini et al. propose a learnable aggregation function (LAF) that approximates standard and complex aggregators, such as variance and skewness. LAF outperforms traditional methods by offering greater flexibility and accuracy for complex datasets and tasks [
56].
Scaling and monotonic transformations were applied to stabilize features and reduce skewness. Amorim et al. show that scaler choice, such as Robust Scaler, can strongly affect accuracy under outliers and imbalance [
57]. Yeo and Johnson propose a power transformation extending Box–Cox for zero and negative values [
58], while the QuantileTransformer maps features to uniform or normal distributions, reducing skewness and outlier impact [
59].
To reduce dimensionality while preserving information, three feature-selection methods were applied. Brown et al. use mutual information to capture target dependence while limiting redundancy [
60]. Friedman, Hastie, and Tibshirani apply L1 penalties in logistic regression for sparsity and better generalization [
61]. Kursa and Rudnicki propose Boruta, a random forest wrapper that retains all strongly relevant features beyond a minimal subset [
62].
For compact representations, both linear and non-linear methods were evaluated. PCA captures maximal variance with minimal information loss [
63], denoising autoencoders learn noise-resilient features [
64], and variational autoencoders provide scalable latent-variable models through efficient stochastic optimization [
65].
Given the rarity of positive cases, oversampling and hybrid resampling were applied to balance classes. SMOTE generates synthetic minority samples to expand decision regions [
66], ADASYN focuses on harder minority instances to reduce bias [
67], and SMOTE-ENN combines oversampling with noise removal for cleaner class separation [
68].
Evaluation metrics play a crucial role in assessing the performance of machine learning models, with different metrics being appropriate for different data scenarios.
Powers compares accuracy and F1 score as evaluation metrics for ML models. Accuracy reflects overall correctness in balanced datasets [
69], while F1, combining precision and recall, is preferred in imbalanced settings. In fraud detection, false negatives cause major financial losses, and false positives raise administrative costs, making these errors critical in healthcare. Such imbalance necessitates threshold calibration beyond standard accuracy metrics [
67].
Fawcett (2006) introduced ROC AUC as a classifier metric, showing its value in visualizing true–false positive trade-offs and providing a single probability-based score, particularly useful for imbalanced datasets [
70].
Wasserbacher and Spindler (2022) show that machine learning enhances financial planning and resource allocation through automated analysis [
71]. Ali et al. demonstrate its impact on fraud detection by improving resource efficiency [
72], while Shen et al. propose consumption- and morbidity-based methods for forecasting drug inventories [
73].
Previous studies often used single algorithms or limited comparisons, leaving gaps in model evaluation. This study systematically assesses eight models within a unified preprocessing and evaluation framework, highlighting the value of advanced techniques such as feature engineering, SMOTE-ENN, and dimensionality reduction on high-dimensional data.
3. Dataset and Preprocessing
This section presents the dataset and the preprocessing methodology adopted in this study.
Section 3.1,
Section 3.2,
Section 3.3,
Section 3.4,
Section 3.5,
Section 3.6 and
Section 3.7 describe the dataset composition, analytical procedures, and the sequential pipeline of scaling, feature selection, representation learning, and imbalance treatment, implemented under leakage controls to ensure model robustness.
3.1. Dataset Description
This study uses a Kaggle dataset [
74] segmented into four files:
Patient demographics and reimbursements (40,474 rows, 30 columns)
Inpatient claims with admissions, discharges, and diagnoses (517,737 rows, 27 columns)
Outpatient claims for non-admitted patients (138,556 rows, 25 columns)
Provider identifiers with a binary fraud label
Table 1 summarizes the dataset features, describing demographic, clinical, and claim-related variables and their roles in building a comprehensive fraud detection framework.
3.2. Data Analysis
This study analyzed data distribution to guide fraud detection. With fraud at 9.35%, imbalance-aware training was applied, restricting resampling and threshold tuning to training folds. Exploratory analysis of financial and utilization variables (
Figure 1 and
Figure 2) showed skewness, heavy tails, and heterogeneity, reflecting inflationary billing and highlighting discriminative tail behavior. These insights motivated robust preprocessing with outlier handling, and normalization, supported by leverage-resistant models. Evaluation used imbalance-sensitive metrics (PR-AUC, F1, MCC) with calibrated thresholds to align decisions with operational costs. The following
Figure 1 presents the distribution of key financial variables by fraud label.
Figure 2 shows claim-duration distributions, outpatients cluster at same-day visits, inpatients average longer stays (≈4 days), both with rare extremes. These patterns suggest using spline terms, duration flags, and interaction with monetary variables, ideally with stratified models. To prevent bias, splits must be leakage-controlled, while imbalance-aware objectives and threshold tuning balance prolonged fraud cases against benign long stays.
These descriptive patterns establish the statistical setting of the problem and highlight the need for a leakage-controlled framework supported by robust preprocessing and duration-sensitive representations.
3.3. Data Integration and Preprocessing
The healthcare dataset comprises four interrelated tables: Inpatient, Outpatient, Beneficiary, and Provider, as illustrated in
Figure 3. The Inpatient and Outpatient tables, indexed by Claim ID, include patient, provider, physician, timing, and diagnostic details, and are first merged to consolidate hospital services. This structure is then linked to the Beneficiary table via Beneficiary ID, enriching claims with demographic and clinical attributes such as age, gender, location, chronic conditions, and mortality data. Finally, integration with the Provider table adds the binary Fraud Result attribute, which serves as the ground-truth label for supervised learning. Through this sequential merging process, raw records are transformed into a unified dataset that represents patient, claim, and provider dimensions, aligned with the fraud detection objective and used as input for all subsequent preprocessing and modeling stages.
The complete workflow is summarized in
Figure 4, which illustrates the sequential stages from data integration and preprocessing to feature scaling, dimensionality reduction, imbalance handling, model training, evaluation, and final model selection.
Before modeling, we define a systematic preprocessing framework for high-dimensional, heterogeneous, and imbalanced claims data, where naïve steps risk bias and leakage. The pipeline performs controlled comparisons, at least three alternatives per stage, under stratified 5-fold cross-validation, applying all transforms strictly within folds to prevent leakage. Scaled/transformed datasets are exported for transparency and reproducibility. Subsequent subsections detail feature scaling, feature selection, representation learning, and class-imbalance handling, each reported with metrics. All features were aggregated to the provider level prior to modeling, and cross-validation used a stratified GroupKFold by Provider to prevent any provider appearing in both training and test folds, thereby eliminating provider-level leakage.
Although timestamps such as claim and admission dates could support temporal feature engineering, this study aggregated records at the provider level to prevent cross-claim leakage during cross-validation. As a result, explicit sliding-window statistics (e.g., rolling means or exponentially weighted averages) were not generated. Future work may exploit such temporal dynamics to capture gradual shifts in provider or patient behavior for improved fraud detection.
Records were aggregated at the provider level to prevent cross claim leakage during cross validation. As a result, sliding window statistics such as rolling means or exponentially weighted averages were not produced. Future research could incorporate temporal dynamics to capture gradual behavioral shifts in provider or patient activity.
3.4. Scaling
In high-dimensional claims data, feature scale and distribution affect separability, regularization, and threshold stability under imbalance. Normalization reshapes skewness and extreme values, influencing margins and calibration. For rare events like hospital fraud, the focus is not only ROC-AUC but precision–recall trade-offs that determine audit workload and recovery. This study evaluates normalization methods, attributing observed differences to the transformation rather than model capacity.
Three techniques are considered, each encoding a distinct bias about the data generating process. Robust scaling, with median and IQR, downweighs outliers by centering and scaling with robust statistics. Power transformation performs variance stabilization and approximate Gaussianization while preserving order, a combination that can improve linear decision boundaries and probability calibration. Quantile normalization to a normal output enforces a full rank-preserving warp to standard marginals. This can enhance rank-based discrimination such as PR-AUC, yet may distort local metric structure relevant to thresholded decisions. Under severe imbalance, theory thus suggests tension. Transforms that maximize recall by broadening minority mass may simultaneously inflate false positives, whereas variance-stabilizing transforms can improve precision without sacrificing too much sensitivity.
Table 2 summarizes stratified 5-fold cross validated discrimination statistics for the three normalizations under an identical pipeline. Best values per column are highlighted with bold.
Results corroborate the theoretical expectations. Power transformation delivers the most balanced profile, leading F1, ROC-AUC, and Precision, a desirable operating regime for audit-constrained fraud detection, where excess false positives carry tangible cost. Quantile normalization attains the top PR-AUC, indicating strong rank discrimination. Robust scaling amplifies Recall to 0.966, which is consistent with outlier tolerant centering, but the accompanying collapse in Precision (0.155) renders this configuration operationally impractical. Accordingly, subsequent stages in this research advance with Power normalization as the default.
Financial variables showed strong right skew due to infrequent high cost claims. The Yeo Johnson power transformation was applied to stabilize variance and compress extreme values while preserving proportionality among typical observations. This transformation improves separability and calibration in models that are sensitive to outliers and class imbalance.
3.5. Feature Selection
Feature selection addresses variance control under limited positives (≈9.4%) and removal of redundant or weak signals that reduce precision. Three methods are tested. KBest-MI, using mutual information for broad non-linear effects but with redundancy, L1-penalized logistic, enforcing sparsity for compact, interpretable sets but discarding correlated weak variables, and Boruta, a tree-based wrapper that retains all strongly relevant features, including interactions, while filtering noise.
Table 3 depicts stratified 5-fold results, with best values in bold.
Table 3 shows that all three feature selection methods perform similarly, with only minor differences. Boruta has a slight edge, giving the highest F1 (0.520) and PR-AUC (0.677) while maintaining strong ROC-AUC and recall. KBest-MI and L1-Logistic yield nearly identical results. Overall, all methods retain informative features, though Boruta provides a small but consistent advantage.
Feature relevance was evaluated using the Boruta algorithm with a Random Forest base estimator of 500 trees and balanced class weighting. The procedure was performed within each cross-validation fold to prevent information leakage and was limited to 100 iterations with a significance level of 0.05. In the absence of longitudinal data, temporal proxies such as claim duration and admission discharge intervals were included. Future studies could add rolling window features and Shapley value analysis to enhance feature stability and interpretability.
3.6. Representation Learning
Representation learning, unlike feature selection, generates embeddings that capture latent structure in high-dimensional claims data. This study evaluates PCA, Denoising Autoencoders, and Variational Autoencoders as complementary methods for mapping claims into lower-dimensional spaces with more regular class boundaries. PCA achieves linear decorrelation and variance compression but lacks the ability to model curvature. Denoising Autoencoders learn stable features by reconstructing noise-corrupted inputs, often aligning with fraud signatures. Variational Autoencoders impose a latent prior for smoother, disentangled factors, though sometimes at the expense of boundary sharpness critical for minority-class detection.
Table 4 reports stratified 5-fold discrimination statistics for the three embeddings learned on the Boruta subset. Latent dimensionality is shown to contextualize compression strength.
Table 4 depicts the best values in bold.
Among the methods, Denoising Autoencoders provide the most balanced performance across F1, PR-AUC, ROC-AUC, and Precision, showing that moderate non-linearity with denoising effectively captures minority structure without raising false positives. PCA achieves the highest Recall (0.895) but loses Precision, lowering F1. Variational Autoencoders fall between, with strong ROC-AUC and Recall but weaker Precision. Overall, DAE emerges as the most reliable default for downstream modeling, while PCA and VAE remain useful baselines in settings prioritizing recall or calibration.
3.7. Class Imbalance
With a minority prevalence of ≈9.4% (506 positives, 4904 negatives), fraud detection in claims data is dominated by class imbalance. Beyond probability ranking, the practical objective is to increase positive yield at tolerable false-alert rates, to improve F1 and precision–recall behavior at operating thresholds. On top of the DAE representation, this study investigates three families of resampling such as, SMOTE (synthetic minority oversampling via k-NN interpolation), ADASYN (adaptive oversampling that focuses on hard-to-learn minority instances), and SMOTE-ENN, which augments SMOTE with Edited Nearest Neighbors to remove ambiguous majority points near the decision boundary. These methods introduce different geometric priors. SMOTE smooths minority manifolds. ADASYN aggressively expands into difficult regions. ENN cleans noisy borders to favor precision.
Table 5 reports mean test metrics across folds on the DAE embeddings from the Boruta subset, comparing no resampling to SMOTE, ADASYN, and SMOTE-ENN. Bold values denote the highest performance for each metric.
ADASYN raises recall (0.911) but lowers precision (0.300), reducing F1 and making it costly for audits. SMOTE-ENN provides the most balanced results across metrics, while the small gap with no resampling suggests DAE embeddings already stabilize class geometry. Thus, SMOTE-ENN is chosen for resampled modeling, with the baseline kept as a strong reference for robustness.
4. Modeling Algorithms
Eight supervised learning algorithms were implemented to capture complementary modeling behaviors across linear, instance-based, ensemble, and neural methodologies. Logistic Regression and Support Vector Machines provide interpretable and well-calibrated probabilistic baselines suitable for structured data. Random Forest and k-Nearest Neighbors extend the analysis to non-parametric methods that capture complex local and global relationships without strong distributional assumptions. Gradient boosting frameworks, including LightGBM, XGBoost, and CatBoost, were selected for their ability to model non-linear feature interactions efficiently and to generalize well on high-dimensional, imbalanced datasets. The Multi-Layer Perceptron introduces a compact deep learning component capable of learning latent representations from denoised embeddings while maintaining manageable model complexity. These algorithms represent mature, reproducible, and computationally efficient methods widely adopted for tabular healthcare prediction. More complex architectures designed for sequential, graphical, or image-based data were excluded, as they require substantially larger datasets, introduce additional hyperparameter uncertainty, and provide limited benefits for independent claim-level records. The chosen set therefore reflects an optimal trade-off between predictive accuracy, interpretability, and operational feasibility in healthcare fraud detection.
A convex probabilistic classifier minimizing the Bernoulli negative log-likelihood, with L1/L2 regularization it is a strong, well-calibrated baseline for tabular tasks and provides interpretable coefficients. On compact DAE embeddings, the linear decision boundary often trades a small loss in recall for improved precision and stable calibration. Best values: C = 3, penalty = L2, solver = liblinear, class_weight = balanced, max_iter = 2000.
- 2.
K-Nearest Neighbor (kNN)
A non-parametric, instance-based rule, where consistency improves with data, but performance is sensitive to the metric and to feature scaling. On DAE, distance weighting with a mid-sized neighborhood improved robustness to local noise near the decision boundary. Best values: n_neighbors = 11, weights = distance, p = 2 (Minkowski/Euclidean), standardization applied inside CV.
- 3.
Random Forest
Bagging of decorrelated decision trees reduces variance while capturing interactions and non-linearities. It is robust to outliers and provides native feature importance. Best values: n_estimators = 500, max_depth = 16, min_samples_leaf = 2, max_features = log2, class_weight = balanced.
- 4.
Support Vector Machines (SVM)
Maximum-margin classification with the kernel trick. The Gaussian (RBF) kernel yields flexible non-linear boundaries governed by C (soft margin) and γ (locality of similarity). Best values: C = 10, gamma = 0.01, kernel = rbf, class_weight = balanced, probability = True.
- 5.
LightGBM (LGBM)
Leaf-wise gradient boosting with histogram splits is notably efficient on high-dimensional tables while retaining competitive accuracy. Best values: n_estimators = 600, num_leaves = 63, max_depth = 10, learning_rate = 0.05, subsample = 0.85, colsample_bytree = 0.80, min_child_samples = 20, class_weight = balanced, verbosity = −1
- 6.
eXtreme Gradient Boosting Classifier (XGBOOST)
Second-order gradient boosting with shrinkage, row/column subsampling, and explicit regularization. The histogram method sustains speed without sacrificing accuracy on tabular data. Best values: n_estimators = 500, max_depth = 7, learning_rate = 0.05, subsample = 0.85, colsample_bytree = 0.80, gamma = 0.0, eval_metric = logloss, tree_method = hist
- 7.
Categorical Boosting (CatBoost)
Boosting with ordered target statistics and permutation-driven regularization mitigates prediction shift and often performs strongly with minimal preprocessing. Best values: iterations = 700, depth = 8, learning_rate = 0.05, l2_leaf_reg = 3, loss_function = Logloss, verbose = False
- 8.
Multi-layer Perceptron Classifier (MLPs)
A feed-forward neural network trained by backpropagation, with ReLU and weight decay that provides a flexible non-linear baseline for tabular fraud signals. Best values: hidden_layer_sizes = (128,64), activation = relu, alpha = 1 × 10−4, learning_rate_init = 3 × 10−4, max_iter = 300
The following
Table 6 lists the hyperparameter grids per algorithm where bold marks the final value used in this study.
Table 5 reports the parameter grids investigated for each learning algorithm, with the final retained configurations highlighted in bold. The adopted values reflect empirical trade-offs between model complexity, computational feasibility, and stability across folds. For ensemble methods such as Random Forest, XGBoost, LightGBM, and CatBoost, the chosen depths, number of estimators, and sampling ratios favor moderate complexity that controls variance.
In contrast, the RBF SVM relies on carefully selected regularization (C) and kernel width (γ) to produce smooth yet responsive decision surfaces. Linear models (Logistic Regression) were tuned with penalties and solvers that ensured convergence under imbalance, preserving interpretability of coefficients. For k-Nearest Neighbors, distance weighting and scaling were critical to stabilize performance, while the neural network was deliberately configured with compact hidden layers and controlled learning rates to avoid overfitting. Taken together, the bolded settings provide reproducible defaults that balance generalization and computational efficiency for downstream evaluation.
The selected algorithms represent complementary methodological paradigms encompassing linear, ensemble, kernel-based, and neural models. Logistic Regression provides a well-calibrated linear baseline under class imbalance, while tree ensembles (Random Forest, XGBoost, LightGBM, CatBoost) capture high-order feature interactions with intrinsic handling of non-linearity and categorical data. SVM with an RBF kernel models complex boundaries through similarity metrics, and the MLP extends capacity for non-linear abstraction within compact architectures suitable for tabular data. Hyperparameters were tuned within bounded grids to balance discrimination power, interpretability, and computational efficiency, ensuring comparability across models. This design enables both methodological diversity and operational feasibility for healthcare fraud detection at scale.
5. Evaluation
This study reports a strict, leakage-safe modeling on DAE embeddings with per-fold threshold tuning, evaluated by a fraud-oriented composite CPI that prioritizes detection while controlling false alarms.
This weighting emphasizes recall (auditor coverage) and precision (audit yield), with MCC capturing balanced correctness under imbalance, and F1/ROC-AUC/G-Mean providing complementary views of ranking quality and operating balance. This study’s CPI reweighting (Recall 0.30, Precision 0.20, MCC 0.20, F1 0.15, ROC-AUC 0.10, G-Mean 0.05) compresses the top of the leaderboard into a practical tie.
The relative weights were determined based on the operational relevance of each metric to healthcare fraud auditing. Recall received the highest weight, as undetected fraud imposes the greatest expected financial loss, while precision followed to preserve investigator efficiency and maintain optimal audit yield. MCC was weighted equally to ensure balanced assessment under class imbalance and to enhance model reliability across folds. F1 was assigned a moderate weight to capture the harmonic interaction between recall and precision at the calibrated threshold. ROC-AUC contributed a smaller proportion to represent global ranking ability across thresholds, and G-Mean was included as a secondary stability indicator balancing sensitivity and specificity. Collectively, these proportional weights reflect the economic and operational priorities of fraud detection, aligning the composite score with institutional decision-making objectives.
Table 7 presents the performance of eight classifiers under the baseline regime without resampling, evaluated on the DAE embeddings with fraud-oriented CPI as the principal metric.
Table 7 below depicts the best values in bold.
Table 7 summarizes the performance of the evaluated models across the composite CPI and standard classification metrics. The Neural Network (MLP) achieves the highest CPI (0.644), highlighting its balanced performance across different evaluation criteria. CatBoost stands out with the best F1 score (0.608), ROC-AUC (0.933), Precision (0.642), Specificity (0.966), Accuracy (0.930), MCC (0.572), and the lowest Type I Error (0.033), demonstrating strong discriminative ability and reliability in minimizing false alarms. On the other hand, KNN exhibits the highest Recall (0.911) and G-Mean (0.810), along with the lowest Type II Error (0.089), indicating superior capability in detecting fraud cases. Overall, the results suggest that while CatBoost provides the most consistent performance across several key metrics, KNN emphasizes detection sensitivity, and MLP offers the best composite performance as reflected in the CPI.
To further assess the impact of resampling,
Table 8 reports the corresponding results when the models are trained with the SMOTE-ENN strategy. This comparison highlights how the integration of resampling techniques affects both the composite CPI and the standard evaluation metrics.
Table 8 illustrates the best values with bold.
Table 8 presents the performance of the models when trained with the SMOTE-ENN resampling strategy. CatBoost achieves the best composite CPI (0.633), F1 (0.603), Precision (0.651), Specificity (0.969), Accuracy (0.931), and the lowest Type I Error (0.031), confirming its robustness under resampling. The Neural Network (MLP) attains the highest ROC-AUC (0.933), while XGBoost yields the strongest MCC (0.560). SVM (RBF) shows competitive performance with the second-highest CPI (0.633) and strong Recall (0.634). Finally, KNN maintains its strength in detection sensitivity, achieving the best Recall (0.915), G-Mean (0.810), and the lowest Type II Error (0.085). Overall, SMOTE-ENN shifts the balance, while CatBoost continues to dominate across most performance dimensions.
Table 9 provides a direct head-to-head comparison between the baseline (No Sampler) and the resampling-based (SMOTE-ENN).
The results in
Table 9 demonstrate that the influence of SMOTE-ENN is heterogeneous across algorithms and evaluation metrics. Neural Networks (MLP), Random Forests, and Logistic Regression exhibit consistently positive differences in CPI, F1-score, Precision, MCC, and Accuracy, which indicates that these models attain superior overall performance in the absence of resampling.
LightGBM, CatBoost, and SVM (RBF) benefit from SMOTE-ENN through fewer Type II errors and higher recall, while KNN and XGBoost show minimal changes. Overall, classical models perform better without resampling, whereas boosting and kernel methods gain from added class balance. Aggregate results confirm higher performance without resampling, with SMOTE-ENN offering modest reductions in Type I/II errors.
MLP, Random Forest, Logistic Regression, LightGBM, and CatBoost reach higher CPI without resampling, while SVM (RBF) and XGBoost benefit from SMOTE-ENN through improved recall and fewer false negatives. KNN is largely unaffected. Thus, no-resampling is generally preferable, with SMOTE-ENN useful when reducing false negatives is the priority.
Beyond identifying the preferred sampling strategy per model, it is also essential to assess how consistently each algorithm performs across multiple metrics.
Figure 5 compares algorithms across six metrics. CatBoost shows the most balanced performance, combining high F1, MCC, and ROC-AUC with strong precision and recall. XGBoost, SVM, and Logistic Regression perform well with minor trade-offs, while KNN is unbalanced, favoring recall at the cost of precision. LightGBM and Random Forest remain stable but weaker. Overall, CatBoost stands out as the most consistent algorithm.
Figure 6 presents the CatBoost confusion matrix at threshold τ ≈ 0.30. Of 5410 cases, the model correctly identified 4739 legitimate and 294 fraudulent claims, while misclassifying 165 as false positives and 212 as false negatives. This corresponds to recall 0.581, precision 0.642, specificity 0.966, and F1 score 0.609, illustrating the model’s effectiveness in distinguishing fraudulent from legitimate claims.
Figure 7 shows the CatBoost ROC curve with AUC = 0.930, confirming strong class separability. At the F1-optimal threshold (τ = 0.3), the true positive rate is 0.70 and the false positive rate 0.06, consistent with the confusion matrix. Although ROC-AUC can be optimistic in imbalanced data, the high value supports CatBoost’s robustness, while threshold choice balances false negatives against investigative cost. The ROC curve thus underpins cost-sensitive threshold selection.
To enhance financial interpretability, the confusion matrix at τ = 0.3 was extended to include predictive value metrics. The model achieved a PPV of 0.641 and an NPV of 0.957, meaning fewer than 5% of legitimate claims were wrongly flagged while one in three alerts was a false positive. Assuming a review cost of USD 50,000 per one million claims, this corresponds to an additional USD 18,000 per million in audit expenses. Although overall results are aggregated, future work may incorporate DRG-specific evaluation to capture clinical variation and guide targeted auditing.
To contextualize these findings, the performance of the proposed models was also compared with prior work.
Table 10 contrasts the results of this study with those reported by Suri (2019) [
75] for Logistic Regression and Random Forest. The article addresses fraudulent insurance claims in healthcare and their financial impact. It evaluates Logistic Regression and Random Forest models to identify fraud patterns, aiming to enhance automated detection systems and reduce economic losses. While both studies achieve comparable levels of accuracy, the present models exhibit markedly higher discriminative ability, as reflected in superior ROC-AUC and F1-scores.
The evaluation is methodologically robust, using stratified cross-validation with strict fold separation, threshold optimization, and controlled preprocessing. At each pipeline stage, alternative techniques were tested and only the best retained, minimizing leakage and ensuring reliable results that provide a stronger benchmark than simpler procedures such as those in Suri’s study, showing the results in
Table 10, depicting the superiority of this study, with the best values illustrated in bold.
6. Discussion
In this study a leakage-safe, end-to-end pipeline was engineered to translate Future-Internet-oriented fraud-detection modeling into operational and economic value for hospitals and payers. The key finding is that model accuracy alone is insufficient to deliver impact. What matters is the alignment between statistical performance, threshold policy, and audit capacity. In large-scale audit systems, even highly accurate classifiers can generate excessive alerts that exceed review capacity or, conversely, suppress detection when thresholds are set too conservatively. Meaningful impact arises only when predictive precision and recall are balanced against institutional resources and risk tolerance, linking statistical metrics to actionable workload and recovery outcomes. By integrating threshold calibration, scenario analysis, and governance, the system converts probabilistic outputs into decisions that can be executed via web-facing workflows in prepayment triage and postpayment audit without overwhelming staff or delaying legitimate claims.
Methodologically, the design choices were made to maximize reliability rather than headline scores. All preprocessing, feature learning, and hyperparameter selection were performed within cross-validation folds to prevent information leakage. The use of both resampled and non-resampled training regimes showed that gains in F1 and recall under class-imbalance handling did not come at the expense of inflated optimism, the same evaluation protocol was maintained so that improvements could be attributed to the modeling choices rather than to procedural artifacts. The resulting performance is therefore interpretable for deployment, with calibrated scores that remain stable across folds.
The comparative analysis indicates that, for the overlapping algorithms, this study’s pipeline outperformed prior baselines reported by Suri [
75] (2019). More importantly for practice, models such as multilayer perceptron and CatBoost, when paired with the full preprocessing and calibration stack, achieved the strongest results among the candidates evaluated. These gains arose from the interaction between denoising and normalization, feature selection guided by cross-validated importance, and threshold optimization against explicit audit objectives. The pattern is consistent with the view that modern architectures deliver value when embedded in disciplined evaluation and policy tuning rather than used in isolation.
Complementing discrimination metrics, the operating-point analysis using Type I and Type II errors clarifies the practical trade-offs that matter for audit capacity. Reporting Type I and Type II explicitly therefore anchors model selection to institutional risk appetite and makes threshold policy transparent.
Operational considerations were central to the evaluation. In prepayment, the calibrated scores support tiered review, sending only the upper-risk tail to investigators and preserving flow for low-risk claims. In postpayment, adjusted thresholds recover additional losses through targeted retrospective audits without requiring additional headcount. At the provider level, aggregation of claim-level scores into risk indices, controlling volume and case mix, enables proportionate responses that begin with education for borderline behavior and escalate to focused investigation when anomalies persist.
Governance completes the bridge from models to policy. Continuous monitoring for drift, calibration audits across subgroups, and fairness checks ensure that performance remains stable as coding practices and case mix evolve. A structured change-management protocol authorizes threshold updates, retraining, and version promotion, while documentation of features and attributions supports internal review, external audit, and clinician communication without revealing proprietary elements.
Ethical reliability is essential for deploying AI in healthcare. While this study focused on technical and operational aspects, potential bias from uneven demographic or insurance representation is acknowledged. Future research should include subgroup fairness testing using AUC disparity and parity metrics, along with adversarial debiasing and fairness audits to ensure equitable model performance.
The economic implications follow directly from these mechanisms. Higher precision at the review threshold reduces wasted investigator time and mitigates adversarial interactions with compliant providers. Higher recall of the calibrated operating points increases recovered losses. Because thresholds are set in view of capacity, alert volumes remain manageable, which protects day-to-day operations. Taken together, these effects raise audit productivity and reduce fraudulent leakage, the outcomes that finance and compliance units ultimately target. From an internet-operations perspective, the system supports continuous monitoring on streaming data and role-based access over online databases. Furthermore, next-generation infrastructures such as 5G connectivity, blockchain-based auditability, federated learning, and edge analytics can further enhance scalability, trust, and privacy, aligning the proposed framework with the core principles of the Future Internet.
Although the evaluation controlled for leakage and used rigorous cross-validation, external validity depends on institutional context, and local billing practices not represented in the development data.
Overall, the evidence supports the claim that edge techniques deliver practical value when embedded within a disciplined pipeline that links modeling to thresholds, capacity, and governance. The contribution of this study is not only improved accuracy but the articulation of a deployment pathway that makes those improvements actionable in hospital settings. By treating decision thresholds as policy instruments, by validating economic and operational metrics alongside standard measures, and by institutionalizing monitoring and change control, the system demonstrates a credible route from predictive performance to measurable gains in recovery and workload management.
7. Conclusions and Future Work
This study shows that disciplined preprocessing and honest evaluation are as decisive as the nominal classifier for hospital fraud detection. A leakage-safe, imbalance-aware pipeline uses Power transformation, Boruta selection, and denoising autoencoders to align statistical gains with auditor-relevant decisions that save time and enhance recovery. The composite productivity index used here ties model choice and threshold policy to precision, recall, MCC, and ROC-AUC in proportions aligned with workload and recovery priorities, and it supports both prepayment triage and postpayment audit. Within this framework, MLP achieved the highest composite CPI, CatBoost delivered the best control of false positives and strong accuracy, and resampling had limited incremental value once upstream representations regularized class geometry.
The economic perspective is central. Type I and Type II errors are reported at the operating threshold, making explicit the trade-off between investigator burden and missed fraud. Thresholds function as policy instruments that map directly to alert volume and expected yield, and calibration inside cross-validation provides estimates decision makers can trust for prepayment, postpayment, and provider-level profiling.
Labels are binary at the provider level, which constrains typology-specific conclusions and may mask specialty heterogeneity. Claims data embed coding artifacts and policy effects, and choices in feature design and temporal aggregation can shift relative performance. These factors advise caution when generalizing beyond the studied cohort.
Future Internet directions include integrating additional online data sources and federated learning across institutions to respect data sovereignty while updating models from internet-connected sites. Also, future research should strengthen the link between prediction and economics through cost-sensitive training and thresholding under explicit loss matrices, external validation by state, setting, specialty, and time, and the use of semi and self-supervised learning to exploit unlabeled claims. Graph-based models over beneficiaries, providers, and procedures may capture network structure that single-claim models miss.
In summary, a carefully engineered and leakage-safe pipeline surpasses prior baselines on overlapping algorithms and, when paired with MLP and CatBoost, delivers stronger performance at actionable thresholds. With the extensions outlined above, the framework can evolve into a continuously internet-based learning decision service that adapts to emerging fraud patterns while safeguarding operational, legal, and ethical constraints in the healthcare sector.